vLLM — KV Cache & PagedAttention

KV Cache & PagedAttention

The KV cache stores the key and value tensors computed during attention, so they aren't recomputed each step. PagedAttention manages this cache in fixed-size pages — the central innovation of vLLM.

Reference: PagedAttention paper (SOSP 2023) | vLLM blog post

The KV Cache Problem

Why the KV Cache Needs Special Management

During autoregressive decoding, at each step the model must attend over all prior tokens. The key (K) and value (V) projections of those tokens are stored in the KV cache to avoid recomputation.

# Memory per token (per layer):
#   K tensor:  [num_kv_heads, head_dim] × dtype_bytes
#   V tensor:  [num_kv_heads, head_dim] × dtype_bytes
#
# Example: Llama-3-8B, FP16
#   num_layers=32, num_kv_heads=8, head_dim=128
#   Per token: 2 × 8 × 128 × 2 bytes × 32 layers = 131,072 bytes ≈ 128 KB
#   For a 4096-token context: 4096 × 128 KB = 512 MB just for one request

With many concurrent requests, naive per-request pre-allocation wastes memory: you must reserve worst-case (max_model_len) tokens even if the request only uses a fraction. This leads to fragmentation — like reserving a 4096-token block for a 50-token request.

PagedAttention borrows the virtual memory + paging concept from OS design. Physical GPU memory is divided into equal-sized pages (blocks); requests reference logical pages that map to physical ones via a page table.

PagedAttention

Block-Based Memory Layout

The GPU KV cache is a fixed tensor allocated once at startup:

# Physical KV cache tensors (one per layer, per device)
# Shape: [num_gpu_blocks, block_size, num_kv_heads, head_dim]
kv_cache_K = torch.empty(
    (num_gpu_blocks, block_size, num_kv_heads, head_dim),
    dtype=cache_dtype, device="cuda"
)
kv_cache_V = torch.empty_like(kv_cache_K)

Each block holds block_size tokens (default: 16). Requests are assigned blocks dynamically. The block table (a mapping from logical page to physical block ID) is passed to the attention kernel each step.

Visual: GPU KV cache pool (24 blocks, 3 active requests):

free

Blue=Request A, Green=Request B, Orange=Request C. Blocks are non-contiguous in physical memory but appear contiguous to the attention kernel via the block table.

# Block table for request A:
# Logical page 0 → physical block 2
# Logical page 1 → physical block 7
# Logical page 2 → physical block 15
block_table_A = [2, 7, 15]

# The attention kernel uses this to gather K/V:
for logical_page, phys_block in enumerate(block_table_A):
    kv_cache[phys_block]  # physical access

Prefix Caching (Automatic KV Reuse)

When multiple requests share the same prefix (e.g., a system prompt or a long document in RAG), vLLM avoids re-computing KV values for those tokens by sharing blocks.

# Block hashing for prefix caching
# Each full block gets a content hash:
#   hash(prefix_hash, token_ids_in_block) → BlockHash

# Request A: [system_prompt...] + "Q: What is 2+2?"
# Request B: [system_prompt...] + "Q: Tell me a joke"
#
# Shared:  blocks 0, 1, 2 (system_prompt blocks)
# Unique:  block 3 (question-specific tokens)

When a new request arrives, the scheduler checks if its prompt prefix matches any cached block hashes. Matched blocks are shared (reference counted) — no new allocation needed for those tokens.

SYS 0

SYS 1

SYS 2

A-Q

B-Q

free

Purple = shared prefix blocks. Both requests A and B reference the same physical blocks for the system prompt.

Source: vllm/v1/core/kv_cache_utils.py

KVCacheManager

Block Allocator

Source: vllm/v1/core/kv_cache_manager.py

The KVCacheManager is a pure Python block allocator. It runs on CPU (inside the scheduler) and manages logical block IDs, block hashes, and reference counts. The GPU tensors themselves are untouched here.

class KVCacheManager:
    def __init__(self, kv_cache_config: KVCacheConfig, ...):
        self.num_gpu_blocks = kv_cache_config.num_gpu_blocks
        self.block_size = kv_cache_config.block_size
        # Free list — available block IDs
        self.free_block_ids: list[int] = list(range(self.num_gpu_blocks))
        # Block metadata indexed by block_id
        self.blocks: dict[int, KVCacheBlock] = {}
        # Hash → block_id (for prefix cache hits)
        self.hash_to_block: dict[BlockHash, int] = {}

    def allocate_slots(
        self,
        request: Request,
        num_new_tokens: int,
    ) -> KVCacheBlocks | None:
        # Returns None if not enough blocks available
        # Assigns block IDs to request, updates block table
        ...

    def free(self, request: Request) -> None:
        # Decrement reference counts, return blocks to free list
        # Hashed blocks stay in cache (evicted LRU when needed)
        ...

KVCacheBlock

@dataclass
class KVCacheBlock:
    block_id: int
    block_hash: BlockHash | None   # None = not yet hashed (partial block)
    ref_count: int                  # shared = ref_count > 1
    is_null: bool                   # sentinel for padding
    token_count: int                # tokens stored in this block

KVCacheConfig & KVCacheSpec

Source: vllm/v1/kv_cache_interface.py

@dataclass
class KVCacheConfig:
    num_gpu_blocks: int          # total physical blocks on GPU
    num_cpu_blocks: int          # for CPU swap (offloading)
    block_size: int              # tokens per block (default 16)
    kv_cache_groups: list[KVCacheGroupSpec]  # per model component

# KVCacheSpec variants describe the attention type:
class FullAttentionSpec(KVCacheSpec):
    # Standard transformer: full sequence attention
    block_size: int
    num_kv_heads: int
    head_size: int
    dtype: torch.dtype

class SlidingWindowSpec(KVCacheSpec):
    # Mistral-style: only attend to last `window_size` tokens
    window_size: int
    ...

class CrossAttentionSpec(KVCacheSpec):
    # Encoder-decoder: attend to fixed encoder output
    ...

Different model architectures (Llama = full, Mistral = sliding window, T5 = cross-attention) use different specs, allowing the KV cache manager to allocate correctly for each.

Memory Profiling & Block Count Determination

At startup, vLLM runs a memory profiling pass to determine how many KV cache blocks fit in available GPU memory:

# In Worker.determine_num_available_blocks():
# 1. Run model with dummy input to measure activation peak memory
dummy_output = model_runner.profile_run()

# 2. Free memory after profiling pass
torch.cuda.empty_cache()
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()

# 3. Reserve fraction for KV cache
kv_cache_memory = free_gpu_memory * gpu_memory_utilization

# 4. Compute how many blocks fit
bytes_per_block = (
    block_size * num_layers * num_kv_heads * head_dim * 2  # K+V
    * dtype.itemsize
)
num_gpu_blocks = int(kv_cache_memory // bytes_per_block)

The gpu_memory_utilization parameter (default: 0.9) controls what fraction of GPU memory is reserved for KV cache. Setting it too high risks OOM from PyTorch activation memory.