Prior to transformer, language processing was done by recurrent neural networks (RNNS) and their refinements long-short-term-memory networks (LSTMs) and gated-recurrent-units (GRUs). These processed text word-by-word or token by token. These passed along a hidden state that encoded everything read. It was an intuitive process.
There were shortcomings in these architectures. They struggled with very long sentences. The context at the beginning faded at the end of the paragraph. It was difficult to parallelize, since each step depended on the previous one. Without being stuck in a linear rut, the field needed a way to process sequences.
Good Brain researchers at Google wanted to change this dynamic. The simplest solution was to forgo recurrence. The model, instead, should focus on every word in a sequence simultaneously and to figure out how these words are related to each other.
Attention mechanism enables us to focus on the most relevant parts of the sentence, leaving the baggage of recurrence. The result was arrival of the transformer — fast, parallelizable and good at handling context over long stretches of text.
What was the breakthrough idea? Attention. Transformer elevated attention to the position of a hero of the show.
There were brainstorming sessions. Attention mechanism made translations more effective. The architecture of transformer has two main parts — encoder which processes input data and creates meaningful representation of that data, using self-attention and simple neural networks and decoder which focuses on previously generated output, using information from the encoder.
What is so brilliant in this design? It enables training on huge datasets quickly and efficiently.
The architecture was publicly released and became an experimental tool all over the world. This led to the realization of how powerful the transformer is. The idea that transformer has revolutionized the whole field of AI slowly sunk in.
Google published 2017 paper — Attention Is All That You Need. It outperformed all previous models. It led to a wave of innovations. Google itself designed BERT — bi-directional encoder representation from transformers. NLP became much more efficient. OpenAI took the transformer blueprint and made GPT — generative pretrained transformers. They had GPT 1 and GPT 2. The models produced human-like text. In late 2020, ChatGPT was introduced based on GPT-3.5. It generated coherent text, could translate languages, could write code and even produce poetry. Machine’s ability became a tangible reality. AI-assisted creativity became an everyday thing.
The transformer is not confined to text. Attention mechanism could be extended to different types of data — images, music, code.
Software frameworks (TensorFlow and PyTorch) incorporated transformer-friendly building blocks. Models were scaled up by increasing the number of parameters in a transformer and the size of its training dataset. There was a race for bigger models, more data, more GPUs. The field attracted massive investments.
Training such models consumes compute time and electricity.
There are issues of synthetic data, copyright, misinformation, deepfakes and ethical deployment. The governments propose to regulate the field without stifling innovation.
The Attention Is All That You Need paper is a testament to what open research can do for the world. The paper allowed every researcher to build on its ideas. It is the spirit of openness. Transformers are the default backbones of NLP. Should we go beyond attention to find new models? The field is not stagnant.
Leave a Reply