There was a debut of Transformer in 2017 which stimulated the race to produce new models. OpenAI took the first initiative in June 2018 to create GPT: a decoder-only model that excelled in Natural Language Generation (NLG). It ultimately powered ChatGPT. Google responded by releasing BERT in October 2018, four months later. BERT is an encoder-only model designed for Natural Language Understanding (NLU).
Decoder-Only Models
The decoder block in the Transformer generates an output sequence based on the input provided to the encoder. Decoder-Only models eliminate encoder block entirely. Instead, multiple decoders are stacked together in a single model. These models accept prompt as inputs and generate responses by predicting the next most probable word (or say token) one at a time in a task called Next Token Prediction (NTP). Decoder-Only thus excels in NLG such as conversational chatbots, machine translation and code generation. As ChatGPT is widely used, the public is familiar with such models. ChatGPT is powered by decoder-only models such as GPT-3.5 and GPT-4.
Encoder-Only Models
The encoder block in the transformer accepts an input sequence and creates vector representation for each word (or token). Encoder-only model eliminates decoder and stacks multiple encoders to produce a simple model. These models do not accept prompts. They rather accept an input sequence for a prediction to be made upon a missing word in the sequence. Encoder-only model lacks the generating capacity (of new words). Thus, they are not used for chatbot applications. Instead, encoder-only models are used for NLU tasks such as Named Entity Recognition (NER) and Sentiment Analysis. The vector representation provides a deep understanding of the input texts to the BERT models. Though it is technically possible to generate text with BERT, that is not for which this architecture is meant. The results are not as good as decoder-only models.
Thus, Transformer model has both has both encoders and decoders, GPT models are decoder-only models and BERT models are encoder-only models.
It is GPT model that made transformer pre-training popular. It covered broad understanding of language nuances (word usage and grammatical patterns). This produced a task-agnostic foundational model. After training, a foundational model can be fine-tuned for specific task Fine-tuning involves training only the linear layer (a small feedforward neural network). The weights and biases of the rest of the model or the foundational portion remain unchanged.