These neural architectures are the foundation for NLP. It followed the 2017 paper of Vaswani et al titled Attention Is All That You Need. Transformers differ from RNNs and CNNs by avoiding recurrence and processing data in parallel, significantly reducing the training time.
Here attention mechanism is utilized to weigh influence of different words on each other. Transformers have the ability to handle data sequences without the need for sequential processing. It makes them effective for various NLP tasks. They can do translation, text summarization and sentiment analysis.
Transformers have achieved state-of-the art results in various NLP tasks. BERT is a variant. GPT is a variant. Transformer architecture consists of an encoder and decoder each composed of multiple layers of self-attention mechanism. This enables it to capture lang-range dependencies in input sequence.
The encoder processes the input sequence. The decoder generates the output sequence. This architecture does not rely on recurrent connection. It is highly parallelizable. It is more efficient.