Self-attention and Multi-head Attention

Let us understand the concept self-attention mechanism in a transformer. It is a core part of neural network architecture. It facilitates natural language processing. It makes the model focus on relevant parts of an input sequence while processing each element .

First, each word in the sequence is converted into a vector (input encoding). Thereafter, for word three vectors are created — Query vector (focus of attention), Key vector (essence of each word) and value vector (actual information for each word). Afterwards, the similarity between q vector and each key vector is computed. It gives us attention scores (how relevant each word is to current focus). Lastly, weighted values summation is done — attention scores are summed up. It means a new representation for the current word (enriched by the context of similar words in the sequence).

The whole thing is iterated through the entire sequence. It captures long-range dependences and understand the relationships between words (when they are far apart in the sentence). This facilitates machine translation, text summarization and question answering.

The weighted sum of the value vectors becomes the output for the current position in the sequence. This output captures the context of the current word based on the relevant parts of the sequence.

Self-attention provides contextual embeddings — the meaning of each word is incorporated in the context of surrounding words. To illustrate, money and bank and river and bank. In these two pairs, the meaning of the word ‘bank’ changes as per the context.

Multi-head attention is an extension of the self-attention mechanism in transformer models. Instead of a single attention mechanism, there is multiple-head attention mechanism in parallel. Each such attention mechanism is called a head.

The input embedding are split into multiple heads. Each head has its own set of parameters (query, key and value matrices). It is called splitting. Each head’s attention score is independently calculated (between query and k vectors).

The attention of scores from all heads are concatenated and linearly transformed. Then finally a weighted sum of value scores is calculated, generating the output of multi-head attention layer.

Here the multi-head attention mechanism allows us to focus on two or more perspectives simultaneously. The diverse relationships are captured. It enhances a model’s representational capacity. The model is able to capture complex patterns in the data.

print

Leave a Reply

Your email address will not be published. Required fields are marked *