Softmax Function in Transformers

Softmax function is activated in the transformer model. It is applied to the output of the self-attention layer ( which calculates the relevance of each input token to each other token). Softmax function normalizes the output, so that its values amount to 1. Each value thus represents the probability that the input token is relevant to the output token.

The model learns long-range dependencies between the tokens. Softmax allows us to identify these relationships and to learn how to predict the next word in a sequence.

Consider a sentence, A dog sits on the mat. The self-attention layer will calculate the relevance of each word to the other word. The output of the self-attention layer will be a matrix of numbers, where each row represents the relevance of one word to the other two words.

The Softmax function normalizes the output of self-attention layer so that the value sum up to 1. Each value in the output matrix is thus a probability that corresponds to the relevant word in the output.

In our example, the Softmax function is likely to give the highest probability to the word ‘dog’ being relevant to the word ‘mat.’ Softmax is likely to give high probability to ‘sat’ since it is relevant to ‘mat.’ Both these words are related in the sentence.

Softmax is a tool that enables the transformer to learn long-range dependencies between tokens. It is essential for NLP tasks such as translation and text summarization.

Softmax function is defined as follows:

softmax (x) =exp(x) /sum(exp(x))

where

x= a vector of scores

Softmax function first exponentiates each element of the vector and then divides each element by the sum of all elements.

The output of a the Softmax function is thus a vector of probabilities. Each probability represents the likelihood that the corresponding token is the most relevant token in the input sequence.

This model allows to focus on the most relevant tokens in the input sequence. It does not need to compute the full attention matrix.

The model, however, attends to one token at a time. It is less flexible. It can be computationally expensive.

print

Leave a Reply

Your email address will not be published. Required fields are marked *