ChatGPT and such large language models use a multi-layered transformer network. The transformer first breaks down the input text into a sequence of tokens — ‘what’, ‘is’, ‘the’, ‘capital’, ‘of’, ‘India’. These tokens are subjected to a series of mathematical operations. A technique called self-attention is used to generate an output sequence.
A transformer is an encoder-decoder. To begin with encoder-decoder models for translation were RNN-based. The year of introduction was 2014.
One process represents or encodes input data into one vector. Another process decodes that vector into the desired output. This is useful in natural language processing and computer vision.
RNN-based encoder-decoder when faced a longer input sequence, there was noisier information from earlier parts of the sequence. It affects the transformer’s ability to make good predictions over long sequence lengths.
Bidirectional encoders were introduced in 2016 as an innovation. They consider all hidden state vectors generated during encoding, rather than one last encoder hidden state.
By self-attention mechanism, different positions of a single sequence could be related to get the representation of the sequence.
Previously, king of sequence problems were RNNs, whereas today we use BERT, GPT, GPT2, GPT3 and 3.5 and GPT4. The paper which set this trend was ‘Attention Is All You Need’ co-authored by Vaswani in 2017.
Transformers were first described in the above-mentioned 2017 paper from Google. Stanford researchers call transformers ‘foundational models’ in a 2021 paper. It drew a paradigm shift in AI.
The transformers are provided long term memory in attention mechanism. It can focus or attend all previous tokens that have been generated.
Attention mechanism allows model to associate words with each other. It incorporates the understanding of relevant words into one currently being processed. There are input-dependent global interactions.
Attention mechanism overcomes the limitations of encoder-decoder model encoding input sequence to one fixed length vector from which to decode each output time step.
The attention mechanism allows output to focus attention on input while producing output. Self-attention model allows inputs to interact with each other.
Unlike RNNs, transformers process the entire input all at once, and not one word at a time. This makes it possible to avail of high performance computing such as GPUs.
At high level, the encoder maps an input sequence into abstract continuous representation that holds all the learned information.
The decoder then takes that continuous representation. Step by step, it generates a single output.
A transformer model is a neural network that learns context and therefore the meaning by tracking relationships in sequential data, say the words in any sentence.
Attention mechanism spots how data elements in a series influence and depend on each other.
Any app that uses sequential text, image or video is a candidate for transformer model. They make self-supervising learning possible. They replace CNNs and RNNs, popular just five years ago. Most of the papers on AI now cite transformers. It is a shift from a 2017 study which mentioned RNNs and CNNs as the most popular models for pattern recognition.
Prior to the advent of transformers, neural networks were trained with large data sets. It was costly and time consuming. Transformers find the patterns mathematically, and so this cumbersome training is eliminated.
The mathematics used by transformers is amenable to parallel processing. The models thus run faster.
The name transformer was coined by Jakob Uszkoreit, a software engineer of Google Brain team in 2017.
Transformers advanced from relationships between words to relationships between atoms in molecules.