Add and Normalize
In transformer architecture, we come a cross the term ‘add and normalize’. The first step is ‘add’ — a residual connection that adds input of the sublayer such as self-attention or feedforward network to its output. This prevents vanishing gradient problem and makes it possible for the model to learn deeper representations. The second stop is ‘normalize’ — there is normalization of the sublayer across the feature dimension. It stabilizes the training process and reduces dependency on initialization.
Multi-head Attention
Multi-head attention enables a neural network to learn different aspects of input sequence by applying multiple attention in parallel. The idea that works here is that different queries, keys and values can capture different semantic information from the same input. To illustrate, one attention head can focus on syntactic structure of the sentence, and other can focus on semantics of the words.
There are four steps in multi-head attention.
- To begin with, the input queries, keys and values are projected into h subspaces using linear transformations. Here h is the number of attention heads. Each subspace has a lower dimension than the original input space.
2. Secondly, each projected query, key and value are fed into a scaled dot-product attention function. It computes the attention weights and outputs for each subspace independently.
3. Thirdly, the outputs of h attention heads are concatenated and linearly transformed into the final output dimension.
4. Lastly, the final output is optionally passed through a layer of normalization and a feedforward network.
Multi-head attention has several advantages — it can learn more complex and diverse patterns from the input sequence by combining attention functions. It is cost effective. It improves memory usage by reducing dimensionality of each subspace. It makes the model robust by introducing more parameters. Multi-head attention can be implemented from scratch in TensorFlow and Keras.
Multi-head attention and Self-attention
These two are related concepts, and yet distinct in transformer architecture.
Attention is the ability of the network to attend to different parts of another sequence while making predictions.
Self-attention is the ability of the network to attend to different parts of the same sequence while making predictions.
Multi-head attention makes it possible for the neural network to learn different aspects of the input or output sequence by applying multiple attention functions in parallel.
Self-attention can capture long-range dependencies and contextual information from the input sequence. It can be combined with multi-head attention. It can be regularized by applying dropout or other methods to the attention weights. It reduces overfitting.
1