AI has advanced because of the NLP capabilities of large language models (LLMs). They interpret vast amount of existing data and generate human-like texts. Despite their tremendous capabilities, there is significant challenge they pose — their computational inefficiency. They perform lethargically even on the most powerful hardware. These models are built on millions or billions of parameters. Hence, the requirement of computational resources, memory and processing power is huge. One may not have access to such resources always. Besides, LLMs have pathetic response time, making them unsuited for real-time or interactive apps. The issue is how to address the challenges, so as to make LLMs widely accessible.
Researchers from University of California, Berkeley worked on this and developed vLLM. It is an alternative to LLM that is simpler, faster and economical. Currently they are using the library to power models. Instead, they can use vLLM as the backend. It can now handle peak traffic efficiently 5 times more than attended so far. There is use of limited computational resources. It reduces the operational cost. vLLM supports HuggingFace transformers.
Research indicated that memory-related issues slow down the performance of LLMs. These use input tokens to generate attention key and value tensors. These are cached in GPU memory. Thus subsequent tokens are generated. These dynamic key and value tensors. These are cached in GPU memory. These dynamic key and value tensors (KV cache) occupy substantial memory. Their management is cumbersome.
The innovative concept of PagedAttention is introduced to resolve this challenge. This is an attention algorithm. What was paging in OS has been extended to LLM serving. PagedAttention is a flexible approach to manage KV tensors by storing them in non-continuous memory spaces. This obviates the need for continuous long memory blocks. Using a block table, these blocks can be independently retrieved during attention computation. It results into efficient memory utilization. There is less memory waste, and optimal memory usage. PagedAttention can also batch five times more sequencess together. There is good GPU utilization and throughout. There is also efficient memory sharing.
vLLM handles effectively the management of attention key and value memory through implementation of PagedAttention mechanism. It also integrates with HuggingFace models. The library can be installed by a simple pip command. There is availability for both online serving and offline inference. vLLM, an open source LLM inference and serving library accelerates HuggingFace transformers by 24x.