StreamLLM

We have already studied this concept in a previous blog. Here we discuss some additional points. StreamLLM framework enables LLMs to tackle infinite-length inputs. It is a mechanism that improves context by using attention sinks — generally the initial tokens — which capture the attention (even if they are not semantically relevant, say they are punctuation marks). KV of these tokens are retained so as to maintain the performance of the model when the text length exceeds the cache size.

LLMs when streamed generalize longer texts than their training sequence length. It is not necessary to fine tune the model for this.

In this model , some tokens are evicted. It means removing them from the cache of the previous key and value states, which consumes extensive memory. There are two approaches to eviction — token by token eviction and batched token eviction.

print

Leave a Reply

Your email address will not be published. Required fields are marked *