Engine Layer
The engine is the central orchestrator. It owns the request lifecycle, drives the scheduler, dispatches work to the executor, and coordinates the iterative generation loop. Two layers exist: the EngineCore (compute-side) and the front-end (LLMEngine / AsyncLLM), connected via IPC.
Architecture: Two-Process Split
Front End vs. Engine Core
In production, vLLM separates the HTTP/user-facing front end from the GPU-facing engine core into distinct processes. This prevents Python GIL contention and lets the engine loop run uninterrupted.
┌──────────────────────────────────┐ IPC (msgspec over socket)
│ FRONT-END PROCESS │◄──────────────────────────────►
│ LLMEngine / AsyncLLM │
│ ├─ InputProcessor │ ┌──────────────────────────┐
│ ├─ OutputProcessor │ │ ENGINE-CORE PROCESS │
│ └─ EngineCoreClient (proxy) │ │ EngineCore │
└──────────────────────────────────┘ │ ├─ Scheduler │
│ ├─ KVCacheManager │
│ └─ Executor → Workers │
└──────────────────────────┘
Single-process fallback: both sides run in one process (debugging/testing)
Ray mode: engine core lives on Ray actors
The boundary is EngineCoreClient. On the client side it serializes EngineCoreRequest structs; on the server side EngineCore deserializes and processes them.
EngineCore
EngineCore — The Main Loop
Source: vllm/v1/engine/core.py
EngineCore is instantiated once. It holds the scheduler, executor, and KV cache config. Its step() method is the main generation loop — called repeatedly until all requests are done.
class EngineCore:
def __init__(self, vllm_config: VllmConfig, ...):
self.scheduler = Scheduler(vllm_config, ...)
self.model_executor = Executor.get_class(vllm_config)(vllm_config)
self.kv_cache_config: KVCacheConfig = self._init_kv_cache()
def add_request(self, request: EngineCoreRequest) -> None:
# Deserialize and hand to scheduler
self.scheduler.add_request(Request.from_engine_core_request(request))
def step(self) -> tuple[dict[str, EngineCoreOutputs], bool]:
# 1. Scheduler decides what to run this iteration
scheduler_output = self.scheduler.schedule()
# 2. Executor runs the model (may span multiple GPUs)
model_output = self.model_executor.execute_model(scheduler_output)
# 3. Scheduler updates state from results (free blocks, mark done)
engine_core_outputs = self.scheduler.update_from_output(
scheduler_output, model_output
)
return engine_core_outputs, self.scheduler.has_unfinished_requests()
Initialization sequence
Request Lifecycle in EngineCore
Requests move through states managed by Request (vllm/v1/request.py):
class RequestStatus(enum.Enum):
WAITING = "waiting" # in request queue
RUNNING = "running" # being processed this step
PREEMPTED = "preempted" # evicted due to memory pressure
FINISHED_STOP = "finished_stop"
FINISHED_LENGTH = "finished_length"
FINISHED_ABORT = "finished_abort"
class Request:
request_id: str
prompt_token_ids: list[int]
sampling_params: SamplingParams
status: RequestStatus
# Tracking progress
num_computed_tokens: int # tokens whose KV is in cache
output_token_ids: list[int]
stop_reason: int | str | None
Under memory pressure, RUNNING → PREEMPTED (blocks reclaimed) → WAITING (requeued for re-prefill).
EngineCoreClient — IPC Layer
Cross-Process Communication
Source: vllm/v1/engine/core_client.py
The client abstracts the communication mechanism. vLLM supports multiple backends:
| Mode | Class | Use Case |
|---|---|---|
| In-process | UniProcEngineCoreClient | Debugging, single-GPU, tests |
| Multiprocessing | MPClient | Default multi-GPU on single node |
| Ray | RayClient | Multi-node distributed serving |
| External launcher | ExternalLauncherClient | Kubernetes / custom orchestration |
Serialization
# msgspec is used for zero-copy binary serialization
import msgspec
encoder = msgspec.msgpack.Encoder()
decoder = msgspec.msgpack.Decoder(EngineCoreRequest)
# Send: front end → engine core
wire_bytes = encoder.encode(engine_core_request)
socket.send(wire_bytes)
# Receive: engine core → front end
engine_core_request = decoder.decode(socket.recv())
msgspec is 10-100x faster than pickle or JSON for structured data, crucial for high-QPS scenarios where serialization overhead matters.
LLMEngine (Sync) & AsyncLLM (Async)
LLMEngine — Synchronous Wrapper
Source: vllm/v1/engine/llm_engine.py
LLMEngine is the sync wrapper used by the offline LLM class. It drives the engine core in a loop.
class LLMEngine:
def __init__(self, vllm_config: VllmConfig):
self.engine_core = EngineCoreClient.get_class(vllm_config)(vllm_config)
self.input_processor = InputProcessor(vllm_config, ...)
self.output_processor = OutputProcessor(...)
def add_request(self, request_id: str, prompt, params):
engine_request = self.input_processor.process(prompt, params, ...)
self.engine_core.add_request(engine_request)
def step(self) -> list[RequestOutput]:
# Drive one engine iteration
engine_core_outputs, has_more = self.engine_core.step()
return self.output_processor.process_outputs(engine_core_outputs)
def has_unfinished_requests(self) -> bool: ...
def abort_request(self, request_id: str): ...
Configuration — VllmConfig
Source: vllm/config/vllm.py
VllmConfig is the central configuration object passed everywhere. It composes all sub-configs:
class VllmConfig:
model_config: ModelConfig # model, dtype, quantization
cache_config: CacheConfig # block_size, num_gpu_blocks
parallel_config: ParallelConfig # tensor/pipeline parallelism
scheduler_config: SchedulerConfig # max_num_seqs, max_num_batched_tokens
device_config: DeviceConfig # cuda, cpu, tpu, …
speculative_config: SpeculativeConfig
attention_config: AttentionConfig # backend choice
compilation_config: CompilationConfig # CUDA graphs, torch.compile
observability_config: ObservabilityConfig
kv_transfer_config: KVTransferConfig | None # disaggregated prefill/decode