As we know, LLMs such as ChatGPT consume extensive memory and has huge computational demands. Thus such models are expensive, and the operating costs are very high.
It is possible to simplify the architecture of LLMs to make them economical. The underlying architecture consists of a transformer. ETH Zurich researchers have targeted the transformer architecture. They have come out with a new streamlined design of a transformer, while retaining its accuracy and its inference making ability.
LLMs, as we know, operate on a foundation of transformer blocks. These process the data sequences. In each transformer block, there are two key sub-blocks — attention mechanism and multi-layer perception (MLP). As we know, the attention layer focuses selectively on different parts of input data (say tokens in sequence) to get the context and relative importance. Though the tokens are apart, the model learns how they relate to each other.
Sequential data is processed by the transformer block. The sub-block of attention-mechanism further refines and processes highlighted information. The whole thing captures relationships.
There are additional features such as residual connections and normalization layers. These speed up learning and reduce the severity of issues.
Transformers stack up to increase their capacity to capture complex relationships in training data. However, the fundamental design of the transformer block has remained unaltered since its making.
Given the excessive training cost and costs of their deployment, the efficiency we can bring about in training and inference making ability result into substantial savings.
The transformer block can be simplified by eliminating unnecessary components. It reduces parameter count and increases throughput of the model.
The stripped-down version of the transformer, as per the research team, does not compromise either the training speed or performance on downstream tasks.
There are multiple attention heads in a transformer model — key (K), query (Q) and value (V) parameters go with these components. These together map the interface of the input tokens. If V parameters are eliminated, there could be projection layer which synthesizes the values for the MLP block, there is no loss of efficiency.
At the same time, the researchers removed skip connections (which avoid vanishing gradients). These vanishing gradients make training difficult (the gradient is too small to bring about significant learning in preceding layers).
The transformer block has been redesigned to process attention heads and MLP concurrently (rather than one after the other.) It is this parallel processing which deviates from the conventional architecture.
The reduction of parameters has been compensated by adjusting, non-learnable parameters, by refinement of training, and by implementing architectural tweaks. Put together, these alterations maintain a model’s learning capabilities, despite the leaner structure.
The new transformer block has been tested by the researchers. The transformer has shrunk in size by as much as 16 per cent without diluting its capabilities. If this is extended to a large model with billions of parameters, it could result into a massive memory saving.
The greater depth makes the model trainable faster and makes use of extra capacity the depth provides. Though tested on a smaller scale, the research still remains untested on larger models.
Leave a Reply