How the llama.cpp server turns HTTP requests into batched inference and streamed responses.
| Route | Purpose | Returns |
|---|---|---|
| POST /v1/completions | OpenAI-compatible text completion | JSON / SSE stream |
| POST /v1/chat/completions | Chat completion (messages format) | JSON / SSE stream |
| POST /v1/embeddings | Compute token embeddings | float[] JSON |
| POST /v1/rerank | Score and rank documents vs query | float[] JSON |
| GET /v1/models | List loaded models | JSON model list |
| POST /tokenize | Tokenize text (debug/inspection) | token ID array |
| POST /detokenize | Convert token IDs to text | string |
| GET /health | Server status and slot availability | JSON |
| GET /metrics | Prometheus-format metrics | text/plain |
| GET /slots | Current slot state (debug) | JSON |
The server maintains a fixed pool of slots, where each slot handles one concurrent inference request. The number of slots is set by --parallel N (default 1). All slots share the same llama_context and KV cache, but each has its own sampler chain, token buffer, and generation state.
llama_batch. After llama_decode(), each slot independently samples its next token. New requests can start filling idle slots immediately, without waiting for other requests to finish — this is the key to high throughput serving.
// tools/server/server-http.cpp — route handler for POST /v1/chat/completions
svr.Post("/v1/chat/completions", [&](const Request & req, Response & res) {
// Parse JSON body
json body = json::parse(req.body);
// Build server_task from request
server_task task;
task.type = SERVER_TASK_TYPE_INFERENCE;
task.params = parse_sampling_params(body); // temperature, top-p, etc.
task.prompt = apply_chat_template(body["messages"]); // Jinja → string
task.id = unique_id();
// Queue for the inference thread
queue_tasks.post(task);
// If streaming (stream=true), set up SSE response
if (body.value("stream", false)) {
res.set_chunked_content_provider("text/event-stream", stream_handler);
} else {
// Blocking: wait for result future
auto result = task.result_future.get();
res.set_content(result.to_json(), "application/json");
}
});
Chat APIs send a messages array ([{role:"user", content:"Hello"}]). The server uses the model's built-in Jinja2 chat template (stored in GGUF metadata) to render this into the prompt string the model expects.
// tools/server/server-context.cpp // common/chat.cpp — chat template rendering // The template renders something like (Llama-3 format): // <|begin_of_text|> // <|start_header_id|>user<|end_header_id|> // Hello // <|eot_id|> // <|start_header_id|>assistant<|end_header_id|> // ↑ no closing tag — model generates from here // This rendered string is then tokenized normally: std::string prompt = apply_template(model, messages, /*add_gen_prompt=*/true); std::vector<llama_token> prompt_tokens = tokenize(vocab, prompt, true, true);
// tools/server/server-context.cpp — update_slots()
// Find an available slot:
server_slot * slot = find_idle_slot();
if (!slot) {
// All slots busy — queue the task, retry next tick
queue_tasks.defer(task);
return;
}
// Assign the task to the slot:
slot->task = task;
slot->state = SLOT_STATE_PROCESSING_PROMPT;
slot->n_decoded = 0;
slot->smpl = common_sampler_init(model, task.params);
// Allocate KV cache space for this sequence:
// Each slot gets a unique seq_id; KV slots are assigned from the pool
llama_memory_seq_rm(ctx->memory, slot->id, -1, -1); // clear any old state
// KV slots will be reserved during the first llama_decode() call
The server runs a tight loop on the inference thread. Each iteration builds one batch from all active slots and executes it together — this is continuous batching.
// tools/server/server-context.cpp — inference loop (simplified)
while (server_running) {
// PHASE 1: Build batch from all active slots
llama_batch_clear(batch);
for (server_slot & slot : slots) {
if (slot.state == SLOT_STATE_PROCESSING_PROMPT) {
// Add next chunk of prompt tokens to batch
for (llama_token tok : slot.prompt_tokens) {
llama_batch_add(batch, tok, slot.n_past, {slot.id}, false);
slot.n_past++;
}
// Mark last token to compute logits (for sampling):
batch.logits[batch.n_tokens - 1] = true;
} else if (slot.state == SLOT_STATE_GENERATING) {
// Add the previously sampled token as next input
llama_batch_add(batch, slot.last_token, slot.n_past, {slot.id}, true);
slot.n_past++;
}
}
if (batch.n_tokens == 0) {
// Nothing to process — wait for new tasks
wait_for_tasks();
continue;
}
// PHASE 2: Execute the full batch in one decode call
int ret = llama_decode(ctx, batch); // → runs transformer, fills logits
// PHASE 3: Sample next token for each generating slot
for (server_slot & slot : slots) {
if (!needs_sampling(slot)) continue;
// Find the logit index for this slot's last token:
int32_t idx = slot.logit_index;
// Sample:
llama_token id = common_sampler_sample(slot.smpl, ctx, idx);
common_sampler_accept(slot.smpl, id, true);
// Convert to text and send:
std::string piece = token_to_piece(vocab, id);
slot.generated_text += piece;
send_stream_chunk(slot, piece); // SSE: "data: {delta: piece}\n\n"
// Check stop conditions:
if (llama_vocab_is_eog(vocab, id) ||
slot.n_decoded >= slot.params.n_predict ||
slot.generated_text.ends_with(slot.params.stop)) {
slot.state = SLOT_STATE_DONE;
send_stream_done(slot);
} else {
slot.last_token = id;
// Slot stays in GENERATING state for next tick
}
}
}
If two requests share a common prompt prefix (e.g., a system prompt), the server can reuse the KV cache from the first request for the second, skipping the expensive prefill for the shared part.
// tools/server/server-context.cpp — slot assignment with cache lookup
// When assigning a new task to a slot:
// 1. Tokenize the full prompt
// 2. Compare with any existing slot's prompt tokens
// 3. If a prefix matches, reuse that slot's KV cache state:
int n_matching = count_matching_prefix(new_tokens, slot->prompt_tokens);
if (n_matching > 0) {
// Truncate KV cache to the matching length:
llama_memory_seq_rm(ctx->memory, new_seq_id,
n_matching, -1); // keep [0..n_matching]
slot->n_past = n_matching; // start generating from here
// Only process the non-cached suffix of the prompt
}
// This is critical for system prompt reuse:
// Request 1: [system_prompt + "What is 2+2?"] → full prefill
// Request 2: [system_prompt + "What is Paris?"] → only prefill "What is Paris?"
// system_prompt comes from cache
The server supports speculative decoding: a small "draft" model generates N candidate tokens cheaply, then the large "target" model verifies them in one batched forward pass. Accepted tokens are free; only rejections cost the full model's time.
// Speculative decoding flow (server-context.cpp): // Requires: --model (target) + --model-draft (small fast model) // Each generation step: // 1. Draft model generates N tokens cheaply (e.g. N=8): // draft_tokens = [t1, t2, t3, t4, t5, t6, t7, t8] // 2. Target model processes all N+1 tokens in ONE batch: // target_batch = [last_accepted, t1, t2, ..., t8] // target_logits = llama_decode(ctx_tgt, target_batch) // 3. Verify each draft token: // for i in 0..N: // if sample(target_logits[i]) == draft_tokens[i]: // accept → free token! // else: // reject → sample from target_logits[i] instead // → stop verification, discard remaining drafts // Expected speedup: 2-4× if draft model is accurate // Key condition: draft model must use same vocabulary as target // Server flags: // --model-draft path/to/small.gguf // --draft-max 8 # max draft tokens per step // --draft-min 5 # min to bother drafting // --draft-p-min 0.9 # only draft if probability high enough
When the client sends "stream": true, the server uses Server-Sent Events (SSE). Each generated token is sent as a separate SSE event immediately, enabling the client to display text as it's generated.
// SSE format (OpenAI-compatible):
// Each token delta:
data: {"id":"cmpl-xyz","object":"chat.completion.chunk","model":"llama3",
"choices":[{"delta":{"content":" Paris"},"index":0}]}
// Final event:
data: {"id":"cmpl-xyz","object":"chat.completion.chunk","model":"llama3",
"choices":[{"delta":{},"finish_reason":"stop","index":0}],
"usage":{"prompt_tokens":25,"completion_tokens":12}}
data: [DONE]
// The server sends each piece as soon as it's sampled:
// llama_token → piece string → HTTP chunk → SSE event
// Latency from "token sampled" to "client receives": ~1ms + network RTT
// Non-streaming response (stream=false):
// Server accumulates all generated tokens before responding:
{
"id": "cmpl-xyz",
"object": "chat.completion",
"model": "llama3",
"choices": [{
"message": {"role": "assistant", "content": "The capital of France is Paris."},
"finish_reason": "stop",
"index": 0
}],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 12,
"total_tokens": 37
}
}
// Server thread layout: // HTTP threads (httplib thread pool, default 8): // Accept connections, parse JSON, queue tasks, send responses // These never call llama_decode() — they just enqueue + wait // Inference thread (single, dedicated): // Runs the main inference loop // The ONLY thread that calls llama_decode() // Eliminates race conditions on KV cache and context state // All slot management happens here // Queue: task_queue (thread-safe) // HTTP threads → produce server_task // Inference thread → consume server_task, produce results // Result queue: // Inference thread → produce results (token chunks, final responses) // HTTP threads → consume results (send SSE chunks to clients) // Internal llama_decode() uses its own threadpool: // CPU computation: n_threads (set in context params) // This threadpool is separate from the HTTP threads above
// State transitions for server_slot: IDLE │ (new task assigned, tokens loaded) ▼ PROCESSING_PROMPT │ prompt tokens added to batch, n chunks processed │ (all prompt tokens decoded, logits available for last token) ▼ GENERATING │ one new token sampled per tick │ token appended to generated_text, sent as SSE chunk │ (EOS token OR n_predict reached OR stop string matched) ▼ DONE │ final usage stats computed, response completed ▼ IDLE (slot ready for next request) // In PROCESSING_PROMPT, if prompt is longer than n_batch: // Multiple ticks process the prompt in chunks: // Tick 1: tokens[0..511] → llama_decode() → no sampling // Tick 2: tokens[512..1023] → llama_decode() → no sampling // ... // Final prompt tick: last chunk → llama_decode() → sample first output token
This completes the full loop. Each generated token feeds back into the batch for the next decode step, continuing until the model emits EOS or the max token limit is reached.