In a transformer model, masked multi-head attention in the output layer is a mechanism used to attend to the different parts of the input sequence (while preventing the model from attending to future tokens).
This is useful in language modelling or sequence generation where the model should only have access to past tokens during training and inference to maintain causality.
Multi-head attention is present in the decoder layer to capture different aspects of the input sequence (It is similar to encoder).
Prior to computing attention scores, a masking mechanism is applied to the attention weights to prevent the model from attending to future tokens. Masking ensures that each token can only attend to previous tokens in the sequence (preserving the autoregressive property).
During training, the masking is achieved by setting the attention scores for future tokens to a large negative value (e.g. infinity) before applying SoftMax. It effectively masks future tokens. Thus, the model attends to only past tokens during training.
This mechanism is employed in the decoder block to ensure that the model attends to previously generated outputs during sequence generation tasks (machine translation or text summarization).
Masked attention restricts access to future information.