Parallel computations occur in transformer model during self-attention mechanism and the feedforward neural network layers.
The self-attention mechanism involves computing attention scores between all pairs of tokens in the input sequence. Here it is done through matrix multiplication operations.
Within the self-attention mechanism, the scaled dot-product attention computation occurs to calculate attention scores. These are parallelized across all tokens in the sequence.
In feedforward, there is linear transformation — matrix multiplication and element-wise non-linear activation function (ReLU). This is parallelized across multiple tokens in the sequence.
In training with multiple layers of the transformer, there is parallel processing across the different layers. It makes the process efficient and speeds it up.
Thus, parallelization facilitates efficient processing of input sequences and faster training.
Illustration
We have the four tokens in the input sequence — [Tokens1, Token2, Token3, Token4]
We compute three matrices — Query, Key and Value (from input embeddings). Each matrix has dimensions (sequence length) x (embedding dimension).
Let us assume embedding dimension of 3.
Input Embeddings:
Token 1: [1, 2, 3]
Token 2: [4 ,5, 6]
Token 3: [7, 8, 9]
Token 4: [10, 11, 12]
Query Matrix (Q):
[ 1, 4, 7, 10 ]
[ 2, 5, 8, 11 ]
[ 3, 6, 9, 12 ]
Key Matrix (K):
[1, 4, 7, 10]
[2, 5, 8,11]
[3, 6, 9,12]
Value Matrix (V):
[ 1, 4, 7, 10 ]
[ 2, 5, 8, 11 ]
[ 3, 6, 9, 12 ]
Calculate the attention scores by taking the dot product of the Query and Key matrices and scale them.
Finally, compute the weighted sum of the value matrix based on these attention weights.
Parallelization
At each step of the self-attention computation, perform parallel operations across all tokens in the sequence. Say while computing attention score of Token 1, simultaneously calculate scores of other tokens, viz. Token 2, Token 3 and Token 4.
This parallelization brings in an element of efficiency. The models are scalable.
Transformer models are more amenable to parallel computation as compared to RNNs (where information processing happens one step at a time).
Feedforward layer is more amenable to parallelization (feedforward in encoder and decoder). In training, computations across different sequences within a batch can also be parallelized.
It should be noted that all aspects of transformers are not parallelizable, (e.g. self-attention layer has sequential dependencies).