vLLM Internals

vLLM Architecture Overview

An in-depth walkthrough of how vLLM processes a request end-to-end — from the user's prompt to the generated tokens. Each section links to the actual source code on GitHub.

Source: vllm @ d400445  |  Paper: Efficient Memory Management with PagedAttention (SOSP 2023)

What is vLLM?

vLLM is a high-throughput LLM inference engine built around three core insights:

PagedAttention

Manages KV cache memory in fixed-size pages (blocks), like virtual memory in an OS. Eliminates fragmentation and enables sharing of prompt prefixes across requests.

Continuous Batching

Requests are processed iteration-by-iteration. New requests join and finished ones leave at each step, keeping GPU utilization high without waiting for a full batch.

Chunked Prefill

Long prompts are split into chunks and interleaved with decode steps, balancing latency and throughput rather than blocking on prefill.

Prefix Caching

Identical prompt prefixes share KV cache blocks via content hashing. Repeated system prompts or RAG contexts are computed only once.

End-to-End Data Flow

Key Data Structures

Struct Direction File Purpose
SamplingParams in sampling_params.py User-facing: temperature, top_p, max_tokens, stop…
EngineCoreRequest in v1/engine/__init__.py Serializable IPC request (msgspec)
Request internal v1/request.py Scheduler-internal state with lifecycle tracking
SchedulerOutput internal v1/core/sched/output.py Batch description + block assignments sent to executor
ModelRunnerOutput internal v1/outputs.py Raw logits + sampled tokens from device
EngineCoreOutput out v1/engine/__init__.py Per-request new tokens + finish reason
RequestOutput out outputs.py User-facing: decoded text, logprobs, metrics

Source Layout

vllm/
├── entrypoints/          # LLM, OpenAI API, gRPC server
│   ├── llm.py            # Offline inference class
│   └── openai/           # HTTP API server (FastAPI)
├── v1/                   # V1 engine (production)
│   ├── engine/
│   │   ├── core.py       # EngineCore orchestrator
│   │   ├── core_client.py# IPC client / proxy
│   │   ├── async_llm.py  # Async engine (online serving)
│   │   ├── llm_engine.py # Sync engine (offline)
│   │   ├── input_processor.py
│   │   └── output_processor.py
│   ├── core/
│   │   ├── sched/
│   │   │   ├── scheduler.py     # Main scheduler
│   │   │   ├── output.py        # SchedulerOutput
│   │   │   └── request_queue.py
│   │   ├── kv_cache_manager.py  # Block allocator
│   │   └── kv_cache_utils.py
│   ├── executor/         # Distributed execution adapters
│   ├── worker/
│   │   ├── gpu_worker.py
│   │   └── gpu_model_runner.py  # Model forward + sampling
│   ├── sample/
│   │   ├── sampler.py
│   │   └── rejection_sampler.py
│   ├── attention/backend.py
│   └── kv_cache_interface.py
├── model_executor/
│   ├── layers/attention/ # Attention layer with paged KV
│   └── model_loader.py
├── config/
│   └── vllm.py           # VllmConfig (central config)
├── sampling_params.py
└── outputs.py

Class Responsibility Map

ClassFileRole
LLMentry-pointsUser API — offline inference
AsyncLLMentry-pointsUser API — online async serving
EngineCoreengineOrchestrates scheduler + executor each step
SchedulerschedulerBatches requests, manages KV allocation
KVCacheManagerkv-cachePagedAttention block allocation
ExecutorexecutionDispatches to workers (single/multi/Ray)
LLMModelRunnerexecutionModel forward pass, CUDA graphs
AttentionattentionQKV attention with paged KV cache
SamplersamplingToken sampling from logits

External References