vLLM — Scheduler & Continuous Batching

Scheduler & Continuous Batching

The scheduler is the brain of vLLM. Each engine iteration it decides: which requests to run, how many tokens to process, and which KV cache blocks to allocate. Its decisions directly determine throughput and latency.

Reference: Orca: Continuous Batching (OSDI 2022) — the paper that introduced iteration-level scheduling for LLMs

Continuous Batching

Why Continuous Batching?

Traditional static batching waits for a full batch of requests to arrive, processes them all to completion, then accepts new ones. GPU utilization collapses when some requests finish early.

Continuous (or iteration-level) batching evaluates each request one token at a time and allows requests to join or leave at any iteration boundary:

# Static batching (naive):
Iteration 1-50:  [req-A(50tok), req-B(50tok), req-C(50tok)]   # all 3 run
# req-A finishes at iter 20, req-B at iter 35 → GPU underutilized 20-50

# Continuous batching (vLLM):
Iter 1-20:  [req-A, req-B, req-C]
Iter 21:    [req-D joins, req-A left]   # immediate!
Iter 21-35: [req-B, req-C, req-D]
Iter 36:    [req-E joins, req-B left]
...

vLLM achieves 20–24× higher throughput than naive batching on long-context workloads (per the PagedAttention paper).

Prefill vs. Decode Tokens

In one scheduler step, two types of tokens can appear in the same batch:

Prefill tokens

Processing the user's input prompt. Produces KV cache entries for all prompt tokens in one shot. Compute-bound (parallelizable across the prompt length).

Decode tokens

Generating output tokens one at a time. Each new token attends over all prior KV cache. Memory-bandwidth bound.

With Chunked Prefill, long prompts are split into chunks of max_num_scheduled_tokens tokens and interleaved with decode steps, preventing any single prefill from monopolizing the GPU.

# Example: scheduler_config
SchedulerConfig(
    max_num_seqs=256,              # max concurrent requests
    max_num_batched_tokens=8192,   # max total tokens per step
    enable_chunked_prefill=True,   # interleave prefill chunks with decode
    max_num_partial_prefills=1,    # max simultaneous chunked-prefill requests
)

Scheduler Internals

Scheduler.schedule() — Step by Step

Source: vllm/v1/core/sched/scheduler.py

Called once per engine iteration. Returns a SchedulerOutput describing exactly what the model runner should execute.

def schedule(self) -> SchedulerOutput:
    # Phase 1: Resume cached (prefix-hit) requests
    #   These reuse previously computed KV blocks — no re-prefill needed
    cached_requests = self._schedule_cached_requests()

    # Phase 2: Schedule running (decode) requests
    #   Each needs 1 new KV block allocated for the next token
    running_scheduled = self._schedule_running_requests()

    # Phase 3: Prefill new requests from the queue
    #   Allocate KV blocks for prompt tokens
    #   Subject to: max_num_batched_tokens, max_num_seqs, available blocks
    new_requests = self._schedule_new_requests()

    return SchedulerOutput(
        new_requests=new_requests,
        cached_requests=cached_requests,
        scheduled_running_reqs=running_scheduled,
        num_scheduled_tokens=total_tokens,
        ...
    )

Scheduling constraints checked

Constraint	Config field	Purpose
Max concurrent sequences	`max_num_seqs`	Prevents OOM from too many KV caches
Max tokens per step	`max_num_batched_tokens`	Bounds GPU memory for activations
Available KV blocks	runtime	Can't schedule if no blocks available
Max model len	`max_model_len`	Request can't exceed context window

SchedulerOutput

Source: vllm/v1/core/sched/output.py

This is the contract between the scheduler and the executor. It describes the full batch for one iteration.

@dataclass
class SchedulerOutput:
    # New requests starting their prefill phase
    new_requests: list[NewRequestData]
    # Requests resuming from full prefix cache hit
    cached_requests: list[CachedRequestData]
    # Requests in ongoing decode phase
    scheduled_running_reqs: list[RunningRequestData]

    # Total token count (prefill + decode) for this step
    num_scheduled_tokens: dict[str, int]  # request_id → tokens

    # KV cache block assignments per request
    # Shape: (num_requests,) of block ID lists
    # Used by attention kernel to look up K/V values
    block_ids: dict[str, list[int]]

    # Preempted requests (evicted due to memory pressure)
    preempted_reqs: list[str]

@dataclass
class NewRequestData:
    request_id: str
    prompt_token_ids: list[int]
    prompt_embeds: torch.Tensor | None
    mm_features: list[MultiModalFeatureSpec]
    sampling_params: SamplingParams
    block_ids: list[int]        # allocated KV blocks
    num_computed_tokens: int    # 0 for fresh requests

RequestQueue

Source: vllm/v1/core/sched/request_queue.py

Manages the queue of waiting requests. Supports pluggable scheduling policies:

Policy	Description
FCFS	First-come first-served (default). Prevents starvation.
Priority	User-assigned priority scores. Useful for SLA tiers.

class RequestQueue:
    def __init__(self, scheduler_config: SchedulerConfig):
        self._waiting: deque[Request] = deque()
        self._policy = get_policy(scheduler_config.policy)

    def add_request(self, request: Request) -> None:
        self._waiting.append(request)

    def pop_next_request(self) -> Request | None:
        # Policy-specific selection
        return self._policy.get_priority(self._waiting)

Preemption & Memory Pressure

When GPU KV cache is exhausted, the scheduler must free blocks to make room for new prefills. It does this by preempting running requests.

# Preemption strategy: recompute (re-prefill)
# 1. Pick lowest-priority running request
# 2. Free its KV blocks
# 3. Move it back to WAITING status
# 4. When it's re-scheduled, full prefill runs again

# Alternative: swap to CPU (not default in V1)
# KV blocks moved to CPU DRAM, swapped back when resources free up

The re-computation approach avoids PCIe CPU↔GPU transfer overhead but wastes prefill compute. For long prompts, prefix caching softens this — only the non-cached suffix is recomputed.

update_from_output() — Feedback Loop

After each model execution, the scheduler is updated with results. This closes the loop.

def update_from_output(
    self,
    scheduler_output: SchedulerOutput,
    model_runner_output: ModelRunnerOutput,
) -> list[EngineCoreOutput]:
    outputs = []
    for request_id, new_token_id in zip(
        scheduler_output.scheduled_ids,
        model_runner_output.sampled_token_ids
    ):
        request = self.running[request_id]
        request.output_token_ids.append(new_token_id)
        request.num_computed_tokens += scheduler_output.num_scheduled_tokens[request_id]

        # Check stop conditions
        if self._should_stop(request, new_token_id):
            self._free_request(request)    # releases KV blocks
            outputs.append(EngineCoreOutput(
                request_id=request_id,
                new_token_ids=[new_token_id],
                finish_reason=FinishReason.STOP,
            ))
        else:
            outputs.append(EngineCoreOutput(
                request_id=request_id,
                new_token_ids=[new_token_id],
                finish_reason=None,  # still running
            ))
    return outputs