StreamingLLM

LLMs handle long text sequences, but while doing so they may reach their context limit. At times, it is necessary to extend the context of the model to longer sequences.

The solutions are computationally intensive, memory intensive or not so precise. One breakthrough solution is StreamingLLM which has been developed by a team of researchers from Facebook, Carnegie Mellon and MIT. This technique extends the context to millions of tokens — there is no need to consume vast compute resources and memory resources. The performance of the model is preserved. Streaming-LLM is very useful to process long-sequence text.

Context-windows, we have already studied, as a concept. A Llama-2 model has context-window of 4000 tokens or about 3000 words. As long as the interaction of the model remains limited to this context, the performance is not affected. The limit is finite, and the model has restrictions.

We can extend the context window length. This approach modifies the architecture of the model. It requires retraining. This is expensive. Many organizations may not afford this. Context-window lengthening increases the quadratic costs — double the window size, you will have to quadruple its memory and compute costs.

Alternatively, a sliding context window could be used. Here a context window of 4000 token size is fed 4000-x tokens where x is the number of tokens it is expected to generate. The practice has certain drawbacks. Auto-regressive LLMs have ‘KV caching’ to improve efficiency. It is a mechanism that computes and stores the value of attention heads of previous tokens. It eliminates the need to recompute the values for each new token. The attention value of each token is dependent on its preceding token. On shifting the context window, one has to recompute the whole KV cache. It reduces the model’s throughput.

There is one more solution. Move the window but keep the cached values as they are for the tokens that overlap between the new and old context. It is a better method but it has its fallacy. The quality of the model declines quickly, once the context starts to deviate from the initial setting.

Researchers now focus on Attention Sinks. A substantial proportion of attention score in autoregressive LLMs is allocated to the initial tokens (irrespective of their relevance to the language modeling task). Such initial tokens are called Attention Sinks.

A model becomes much more perplex when the text length increases more than the cache size (on account of the exclusion of initial tokens). The perplexity is uncertainty in model’s predictions. If it is low, the model has higher precision. Thus the attention sinks (irrespective how far they are from the tokens being predicted) play a vital role in maintaining the stability of the LLMs. This is intuitive. Language modeling is autoregressive. Initial tokens are visible to almost all subsequent tokens. Thus initial tokens are readily trained to act as attention sinks. They thus capture a disproportionate amount of attention.

Remove the attention values of the first few tokens from the context, and the result is deterioration in model’s performance — there is a significant loss of attention value. These attention sinks are preserved in the StreamingLLM technique.

It allows the model to perform well without fine tuning. Attention sinks are preserved. The attention-score distribution is thus near-normal. When the interaction surpasses the model’s context length, StreamingLLM retains KV cache for the attention sink tokens — four initial tokens are enough. It extends the model’s context and stabilizes its performance. There is no need to recompute the entire KV values.

Under StreamingLLM framework, KV cache consists of the attention sinks, and rolling KV cache that retains most recent tokens ( vital for language modeling ). It is a versatile design. It can be incoporated in any autoregressive language model that employs relative positional encoding.

Generation of token 6

0123 (4 5 ) 6. Here 0 1 2 3 are attention sinks. 4 5 are KV cache rolling. 6 is token generated.

Generation of token 7

0 1 2 3 ( 4 5 6 ) 7. Here 0 1 2 3 are attention sinks. 4 5 6 is KV cache rolling. 7 is token generated.

Generation of token 8

0 1 2 3 4 ( 5 6 7 ) 8. Here 1 2 3 are attention sinks. 4 is evicted. 5 6 7 is rolling cache. 8 is token generated.

Generating token 9

0 1 2 3 4 5 ( 6 7 8 ) 9. Here 0 1 2 3 are attention sinks. 4 5 are evicted. 6 7 8 are rolling cache. 9 is token generated.

The Code for StreamingLLM is accessible on GitHub. Hugging Face is closely monitoring development of StreamingLLM.

print

Leave a Reply

Your email address will not be published. Required fields are marked *