vLLM Internals

Executor, Worker & Model Runner

The execution layer runs the actual transformer model. It has three tiers: the Executor abstracts distribution strategy, the Worker manages one device (GPU), and the ModelRunner runs the model forward pass and sampling on that device.

Execution Tier Architecture

Three-Tier Stack

Executor
vllm/v1/executor/abstract.py
Distribution strategy (single/multi-GPU/Ray). Broadcasts SchedulerOutput to all workers.
↓ calls execute_model() on each worker
Worker (per device)
vllm/v1/worker/gpu_worker.py
Owns one GPU. Manages model, KV cache tensors, communicators.
↓ calls model_runner.execute_model()
LLMModelRunner
vllm/v1/worker/gpu_model_runner.py
Prepares inputs, runs transformer forward pass, samples next tokens.

Executor

Executor Implementations

Source: vllm/v1/executor/abstract.py

The executor is selected at startup based on parallel_config.distributed_executor_backend:

ExecutorBackendWhen
UniProcExecutorin-processSingle GPU, debugging
MultiprocExecutormultiprocessingMulti-GPU on one node (default)
RayDistributedExecutorRayMulti-node cluster
ExternalLauncherExecutorexternalCustom launcher (Kubernetes, SLURM)
class Executor(ABC):
    @classmethod
    def get_class(cls, vllm_config: VllmConfig) -> type["Executor"]:
        backend = vllm_config.parallel_config.distributed_executor_backend
        if backend == "mp":
            return MultiprocExecutor
        elif backend == "ray":
            return RayDistributedExecutor
        ...

    @abstractmethod
    def execute_model(
        self, scheduler_output: SchedulerOutput
    ) -> ModelRunnerOutput:
        # Broadcast input to all workers, gather outputs
        ...

Tensor Parallelism

For large models split across multiple GPUs (TP), the executor launches one worker per GPU. Workers communicate via NCCL all_reduce during the forward pass (after each attention and MLP layer). The scheduler only talks to rank 0; it gathers outputs from rank 0 as well.

GPU 0 (rank 0)
Heads 0-7 + LM Head
GPU 1 (rank 1)
Heads 8-15
GPU 2 (rank 2)
Heads 16-23
GPU 3 (rank 3)
Heads 24-31

Each GPU holds a shard of all transformer layers. After each layer, an all_reduce gathers partial sums across GPUs.

Worker

GPU Worker

Source: vllm/v1/worker/gpu_worker.py | worker_base.py

One worker process per GPU. Lifecycle:

class GPUWorker(WorkerBase):
    def init_device(self):
        # Set CUDA device, init distributed group (NCCL)
        torch.cuda.set_device(self.local_rank)
        dist.init_process_group(backend="nccl", ...)
        self.gpu_model_runner = LLMModelRunner(self.vllm_config, ...)

    def load_model(self):
        # Download from HuggingFace, load weights into GPU tensors
        self.gpu_model_runner.load_model()

    def initialize_cache(self, num_gpu_blocks: int):
        # Allocate KV cache tensors on GPU
        self.gpu_model_runner.initialize_kv_cache(num_gpu_blocks)

    def compile_or_warm_up_model(self):
        # Run dummy inference to trigger torch.compile / CUDA graph capture
        self.gpu_model_runner.capture_model()

    def execute_model(
        self,
        scheduler_output: SchedulerOutput,
    ) -> ModelRunnerOutput:
        return self.gpu_model_runner.execute_model(scheduler_output)

LLMModelRunner

Model Forward Pass

Source: vllm/v1/worker/gpu_model_runner.py

LLMModelRunner is the core of GPU-side execution. It prepares the input tensors from the scheduler's abstract description and runs them through the model.

class LLMModelRunner(nn.Module):
    def execute_model(
        self, scheduler_output: SchedulerOutput
    ) -> ModelRunnerOutput:
        # 1. Build input tensors from scheduler description
        input_ids, positions, attn_metadata = self._prepare_inputs(scheduler_output)

        # 2. Forward pass through transformer
        hidden_states = self.model(
            input_ids=input_ids,
            positions=positions,
            kv_caches=self.kv_cache,
            attn_metadata=attn_metadata,
        )

        # 3. Select only the *last* hidden state per sequence (for sampling)
        #    (prefill tokens don't produce output tokens)
        logits = self.compute_logits(hidden_states, scheduler_output)

        # 4. Sample next tokens
        sampler_output = self.sampler(logits, sampling_metadata)

        return ModelRunnerOutput(
            sampled_token_ids=sampler_output.sampled_tokens,
            logprobs=sampler_output.logprobs,
        )

_prepare_inputs()

This method converts the human-readable SchedulerOutput into flat tensors the model can consume:

def _prepare_inputs(self, scheduler_output: SchedulerOutput):
    # input_ids: [total_tokens] — all prompt + decode tokens concatenated
    # positions: [total_tokens] — position indices for RoPE
    # block_tables: [num_seqs, max_blocks] — KV cache lookup table
    # seq_lens: [num_seqs] — actual length of each sequence
    # query_lens: [num_seqs] — how many query tokens each seq contributes
    #
    # Example (2 requests: prefill=4tok, decode=1tok):
    # input_ids = [p0, p1, p2, p3, d0]
    # positions  = [ 0,  1,  2,  3,  7]  # d0 is at position 7
    # seq_lens   = [4,  8]
    # query_lens = [4,  1]   # req0 contributes 4 queries, req1 contributes 1
    ...

CUDA Graph Execution

For decode-only steps (no prefill), vLLM captures CUDA graphs to eliminate Python overhead and kernel launch latency.

# During warm-up (capture phase):
for batch_size in [1, 2, 4, 8, 16, 32, ...]:
    # Capture the decode-only forward pass for this batch size
    with torch.cuda.graph(self.graphs[batch_size]):
        output = model(dummy_input_of_size(batch_size))
    self.graphs[batch_size] = graph  # store

# During inference (replay phase):
if is_decode_only and batch_size in self.graphs:
    # Update input buffer in-place
    self.input_ids_buf[:batch_size] = input_ids
    # Replay graph — no Python overhead, ~10-30% throughput gain
    self.graphs[batch_size].replay()
    return self.output_buf[:batch_size]
else:
    # Eager mode for prefill or unseen batch sizes
    return model(input_ids, ...)

CUDA graphs eliminate per-kernel launch overhead (~5-10µs per kernel). For small decode batches where GPU compute is minimal, this overhead dominates — graphs make it nearly zero.

ModelRunnerOutput

Source: vllm/v1/outputs.py

@dataclass
class ModelRunnerOutput:
    # One token ID per sequence (the newly sampled token)
    sampled_token_ids: torch.Tensor   # shape: [num_seqs]
    # Log-probabilities (only if requested by SamplingParams)
    logprobs: LogprobsTensors | None
    # Optional: hidden states for embedding tasks
    hidden_states: torch.Tensor | None
    # Output from KV transfer connector (disaggregated prefill/decode)
    kv_connector_output: KVConnectorOutput | None

This is collected from all workers (only rank 0 returns meaningful sampled tokens). The scheduler's update_from_output() consumes it to update request state and emit EngineCoreOutput.

Pipeline Parallelism

With pipeline parallelism (PP), layers are split across GPUs in sequence. GPU 0 runs layers 0-7, GPU 1 runs 8-15, etc. The activation (hidden states) is passed between GPUs via point-to-point sends/receives.

# PP communication: send/recv between adjacent ranks
# GPU 0 → GPU 1: hidden_states after layer 7
dist.send(hidden_states, dst=next_rank)
# GPU 1 receives:
hidden_states = dist.recv(src=prev_rank)

PP is useful for very large models that don't fit on a single node even with TP. TP+PP can be combined (e.g., TP=4, PP=2 for a 8-GPU setup).