The hidden layers refer to the layers between the input and output layers in a transformer. These hidden layers are where most of the computation and transformations occur.
In the transformer architecture, the hidden layers consist of the self-attention mechanisms followed by feed forward neural networks.
Thus, there is self-attention mechanism. It contains multiple self -attention heads. It weighs the significance of different input tokens (when processing each token). It enables capturing relationships between tokens (in the input sequence).
Soon after the self-attention mechanism, there is feedforward network layer. It consists of two linear transformations with a non-linear activation function (usually a ReLU) in between. The model thus learns more complex pattern and relationships.
At the end, both self-attention and feedforward layers are augmented with residual connections and layer normalization. Residual connections facilitate the flow of gradients during training. Layer normalization stabilizes the training process.
Hidden layers transforms an input sequence into a representation that captures its semantic and contextual information.