Scheduler & Continuous Batching
The scheduler is the brain of vLLM. Each engine iteration it decides: which requests to run, how many tokens to process, and which KV cache blocks to allocate. Its decisions directly determine throughput and latency.
Reference: Orca: Continuous Batching (OSDI 2022) — the paper that introduced iteration-level scheduling for LLMs
Continuous Batching
Why Continuous Batching?
Traditional static batching waits for a full batch of requests to arrive, processes them all to completion, then accepts new ones. GPU utilization collapses when some requests finish early.
Continuous (or iteration-level) batching evaluates each request one token at a time and allows requests to join or leave at any iteration boundary:
# Static batching (naive):
Iteration 1-50: [req-A(50tok), req-B(50tok), req-C(50tok)] # all 3 run
# req-A finishes at iter 20, req-B at iter 35 → GPU underutilized 20-50
# Continuous batching (vLLM):
Iter 1-20: [req-A, req-B, req-C]
Iter 21: [req-D joins, req-A left] # immediate!
Iter 21-35: [req-B, req-C, req-D]
Iter 36: [req-E joins, req-B left]
...
vLLM achieves 20–24× higher throughput than naive batching on long-context workloads (per the PagedAttention paper).
Prefill vs. Decode Tokens
In one scheduler step, two types of tokens can appear in the same batch:
Processing the user's input prompt. Produces KV cache entries for all prompt tokens in one shot. Compute-bound (parallelizable across the prompt length).
Generating output tokens one at a time. Each new token attends over all prior KV cache. Memory-bandwidth bound.
With Chunked Prefill, long prompts are split into chunks of max_num_scheduled_tokens tokens and interleaved with decode steps, preventing any single prefill from monopolizing the GPU.
# Example: scheduler_config
SchedulerConfig(
max_num_seqs=256, # max concurrent requests
max_num_batched_tokens=8192, # max total tokens per step
enable_chunked_prefill=True, # interleave prefill chunks with decode
max_num_partial_prefills=1, # max simultaneous chunked-prefill requests
)
Scheduler Internals
Scheduler.schedule() — Step by Step
Source: vllm/v1/core/sched/scheduler.py
Called once per engine iteration. Returns a SchedulerOutput describing exactly what the model runner should execute.
def schedule(self) -> SchedulerOutput:
# Phase 1: Resume cached (prefix-hit) requests
# These reuse previously computed KV blocks — no re-prefill needed
cached_requests = self._schedule_cached_requests()
# Phase 2: Schedule running (decode) requests
# Each needs 1 new KV block allocated for the next token
running_scheduled = self._schedule_running_requests()
# Phase 3: Prefill new requests from the queue
# Allocate KV blocks for prompt tokens
# Subject to: max_num_batched_tokens, max_num_seqs, available blocks
new_requests = self._schedule_new_requests()
return SchedulerOutput(
new_requests=new_requests,
cached_requests=cached_requests,
scheduled_running_reqs=running_scheduled,
num_scheduled_tokens=total_tokens,
...
)
Scheduling constraints checked
| Constraint | Config field | Purpose |
|---|---|---|
| Max concurrent sequences | max_num_seqs | Prevents OOM from too many KV caches |
| Max tokens per step | max_num_batched_tokens | Bounds GPU memory for activations |
| Available KV blocks | runtime | Can't schedule if no blocks available |
| Max model len | max_model_len | Request can't exceed context window |
SchedulerOutput
Source: vllm/v1/core/sched/output.py
This is the contract between the scheduler and the executor. It describes the full batch for one iteration.
@dataclass
class SchedulerOutput:
# New requests starting their prefill phase
new_requests: list[NewRequestData]
# Requests resuming from full prefix cache hit
cached_requests: list[CachedRequestData]
# Requests in ongoing decode phase
scheduled_running_reqs: list[RunningRequestData]
# Total token count (prefill + decode) for this step
num_scheduled_tokens: dict[str, int] # request_id → tokens
# KV cache block assignments per request
# Shape: (num_requests,) of block ID lists
# Used by attention kernel to look up K/V values
block_ids: dict[str, list[int]]
# Preempted requests (evicted due to memory pressure)
preempted_reqs: list[str]
@dataclass
class NewRequestData:
request_id: str
prompt_token_ids: list[int]
prompt_embeds: torch.Tensor | None
mm_features: list[MultiModalFeatureSpec]
sampling_params: SamplingParams
block_ids: list[int] # allocated KV blocks
num_computed_tokens: int # 0 for fresh requests
RequestQueue
Source: vllm/v1/core/sched/request_queue.py
Manages the queue of waiting requests. Supports pluggable scheduling policies:
| Policy | Description |
|---|---|
| FCFS | First-come first-served (default). Prevents starvation. |
| Priority | User-assigned priority scores. Useful for SLA tiers. |
class RequestQueue:
def __init__(self, scheduler_config: SchedulerConfig):
self._waiting: deque[Request] = deque()
self._policy = get_policy(scheduler_config.policy)
def add_request(self, request: Request) -> None:
self._waiting.append(request)
def pop_next_request(self) -> Request | None:
# Policy-specific selection
return self._policy.get_priority(self._waiting)
Preemption & Memory Pressure
When GPU KV cache is exhausted, the scheduler must free blocks to make room for new prefills. It does this by preempting running requests.
# Preemption strategy: recompute (re-prefill)
# 1. Pick lowest-priority running request
# 2. Free its KV blocks
# 3. Move it back to WAITING status
# 4. When it's re-scheduled, full prefill runs again
# Alternative: swap to CPU (not default in V1)
# KV blocks moved to CPU DRAM, swapped back when resources free up
The re-computation approach avoids PCIe CPU↔GPU transfer overhead but wastes prefill compute. For long prompts, prefix caching softens this — only the non-cached suffix is recomputed.
update_from_output() — Feedback Loop
After each model execution, the scheduler is updated with results. This closes the loop.
def update_from_output(
self,
scheduler_output: SchedulerOutput,
model_runner_output: ModelRunnerOutput,
) -> list[EngineCoreOutput]:
outputs = []
for request_id, new_token_id in zip(
scheduler_output.scheduled_ids,
model_runner_output.sampled_token_ids
):
request = self.running[request_id]
request.output_token_ids.append(new_token_id)
request.num_computed_tokens += scheduler_output.num_scheduled_tokens[request_id]
# Check stop conditions
if self._should_stop(request, new_token_id):
self._free_request(request) # releases KV blocks
outputs.append(EngineCoreOutput(
request_id=request_id,
new_token_ids=[new_token_id],
finish_reason=FinishReason.STOP,
))
else:
outputs.append(EngineCoreOutput(
request_id=request_id,
new_token_ids=[new_token_id],
finish_reason=None, # still running
))
return outputs