HTTP Server

How the llama.cpp server turns HTTP requests into batched inference and streamed responses.

tools/server/server.cpp tools/server/server-context.cpp tools/server/server-http.cpp Server README

HTTP API Routes

RoutePurposeReturns
POST /v1/completionsOpenAI-compatible text completionJSON / SSE stream
POST /v1/chat/completionsChat completion (messages format)JSON / SSE stream
POST /v1/embeddingsCompute token embeddingsfloat[] JSON
POST /v1/rerankScore and rank documents vs queryfloat[] JSON
GET /v1/modelsList loaded modelsJSON model list
POST /tokenizeTokenize text (debug/inspection)token ID array
POST /detokenizeConvert token IDs to textstring
GET /healthServer status and slot availabilityJSON
GET /metricsPrometheus-format metricstext/plain
GET /slotsCurrent slot state (debug)JSON

Architecture: Slots for Parallel Inference

The server maintains a fixed pool of slots, where each slot handles one concurrent inference request. The number of slots is set by --parallel N (default 1). All slots share the same llama_context and KV cache, but each has its own sampler chain, token buffer, and generation state.

slot 0
PREFILL
tok 0/512
slot 1
GENERATING
tok 42/200
slot 2
GENERATING
tok 7/100
slot 3
IDLE
Continuous batching: Each server tick, all active slots contribute tokens to a shared llama_batch. After llama_decode(), each slot independently samples its next token. New requests can start filling idle slots immediately, without waiting for other requests to finish — this is the key to high throughput serving.

Request Lifecycle

1. HTTP request arrives — JSON parsing and task queuing
// tools/server/server-http.cpp — route handler for POST /v1/chat/completions

svr.Post("/v1/chat/completions", [&](const Request & req, Response & res) {
    // Parse JSON body
    json body = json::parse(req.body);

    // Build server_task from request
    server_task task;
    task.type   = SERVER_TASK_TYPE_INFERENCE;
    task.params = parse_sampling_params(body);  // temperature, top-p, etc.
    task.prompt = apply_chat_template(body["messages"]);  // Jinja → string
    task.id     = unique_id();

    // Queue for the inference thread
    queue_tasks.post(task);

    // If streaming (stream=true), set up SSE response
    if (body.value("stream", false)) {
        res.set_chunked_content_provider("text/event-stream", stream_handler);
    } else {
        // Blocking: wait for result future
        auto result = task.result_future.get();
        res.set_content(result.to_json(), "application/json");
    }
});
2. Chat template application — messages → prompt string

Chat APIs send a messages array ([{role:"user", content:"Hello"}]). The server uses the model's built-in Jinja2 chat template (stored in GGUF metadata) to render this into the prompt string the model expects.

// tools/server/server-context.cpp
// common/chat.cpp — chat template rendering

// The template renders something like (Llama-3 format):
// <|begin_of_text|>
// <|start_header_id|>user<|end_header_id|>
// Hello
// <|eot_id|>
// <|start_header_id|>assistant<|end_header_id|>
//                    ↑ no closing tag — model generates from here

// This rendered string is then tokenized normally:
std::string prompt = apply_template(model, messages, /*add_gen_prompt=*/true);
std::vector<llama_token> prompt_tokens = tokenize(vocab, prompt, true, true);
3. Slot assignment and KV cache allocation
// tools/server/server-context.cpp — update_slots()

// Find an available slot:
server_slot * slot = find_idle_slot();
if (!slot) {
    // All slots busy — queue the task, retry next tick
    queue_tasks.defer(task);
    return;
}

// Assign the task to the slot:
slot->task       = task;
slot->state      = SLOT_STATE_PROCESSING_PROMPT;
slot->n_decoded  = 0;
slot->smpl       = common_sampler_init(model, task.params);

// Allocate KV cache space for this sequence:
// Each slot gets a unique seq_id; KV slots are assigned from the pool
llama_memory_seq_rm(ctx->memory, slot->id, -1, -1);  // clear any old state
// KV slots will be reserved during the first llama_decode() call
4. The main inference loop — continuous batching

The server runs a tight loop on the inference thread. Each iteration builds one batch from all active slots and executes it together — this is continuous batching.

// tools/server/server-context.cpp — inference loop (simplified)

while (server_running) {
    // PHASE 1: Build batch from all active slots
    llama_batch_clear(batch);

    for (server_slot & slot : slots) {
        if (slot.state == SLOT_STATE_PROCESSING_PROMPT) {
            // Add next chunk of prompt tokens to batch
            for (llama_token tok : slot.prompt_tokens) {
                llama_batch_add(batch, tok, slot.n_past, {slot.id}, false);
                slot.n_past++;
            }
            // Mark last token to compute logits (for sampling):
            batch.logits[batch.n_tokens - 1] = true;

        } else if (slot.state == SLOT_STATE_GENERATING) {
            // Add the previously sampled token as next input
            llama_batch_add(batch, slot.last_token, slot.n_past, {slot.id}, true);
            slot.n_past++;
        }
    }

    if (batch.n_tokens == 0) {
        // Nothing to process — wait for new tasks
        wait_for_tasks();
        continue;
    }

    // PHASE 2: Execute the full batch in one decode call
    int ret = llama_decode(ctx, batch);  // → runs transformer, fills logits

    // PHASE 3: Sample next token for each generating slot
    for (server_slot & slot : slots) {
        if (!needs_sampling(slot)) continue;

        // Find the logit index for this slot's last token:
        int32_t idx = slot.logit_index;

        // Sample:
        llama_token id = common_sampler_sample(slot.smpl, ctx, idx);
        common_sampler_accept(slot.smpl, id, true);

        // Convert to text and send:
        std::string piece = token_to_piece(vocab, id);
        slot.generated_text += piece;
        send_stream_chunk(slot, piece);  // SSE: "data: {delta: piece}\n\n"

        // Check stop conditions:
        if (llama_vocab_is_eog(vocab, id) ||
            slot.n_decoded >= slot.params.n_predict ||
            slot.generated_text.ends_with(slot.params.stop)) {
            slot.state = SLOT_STATE_DONE;
            send_stream_done(slot);
        } else {
            slot.last_token = id;
            // Slot stays in GENERATING state for next tick
        }
    }
}

KV Cache Reuse (Prompt Caching)

How the server reuses cached prompt prefixes

If two requests share a common prompt prefix (e.g., a system prompt), the server can reuse the KV cache from the first request for the second, skipping the expensive prefill for the shared part.

// tools/server/server-context.cpp — slot assignment with cache lookup

// When assigning a new task to a slot:
// 1. Tokenize the full prompt
// 2. Compare with any existing slot's prompt tokens
// 3. If a prefix matches, reuse that slot's KV cache state:

int n_matching = count_matching_prefix(new_tokens, slot->prompt_tokens);

if (n_matching > 0) {
    // Truncate KV cache to the matching length:
    llama_memory_seq_rm(ctx->memory, new_seq_id,
                        n_matching, -1);  // keep [0..n_matching]
    slot->n_past = n_matching;  // start generating from here
    // Only process the non-cached suffix of the prompt
}

// This is critical for system prompt reuse:
// Request 1: [system_prompt + "What is 2+2?"]  → full prefill
// Request 2: [system_prompt + "What is Paris?"] → only prefill "What is Paris?"
//                                                  system_prompt comes from cache

Speculative Decoding

Draft model acceleration for faster generation

The server supports speculative decoding: a small "draft" model generates N candidate tokens cheaply, then the large "target" model verifies them in one batched forward pass. Accepted tokens are free; only rejections cost the full model's time.

// Speculative decoding flow (server-context.cpp):
// Requires: --model (target) + --model-draft (small fast model)

// Each generation step:
// 1. Draft model generates N tokens cheaply (e.g. N=8):
//    draft_tokens = [t1, t2, t3, t4, t5, t6, t7, t8]

// 2. Target model processes all N+1 tokens in ONE batch:
//    target_batch = [last_accepted, t1, t2, ..., t8]
//    target_logits = llama_decode(ctx_tgt, target_batch)

// 3. Verify each draft token:
//    for i in 0..N:
//      if sample(target_logits[i]) == draft_tokens[i]:
//        accept → free token!
//      else:
//        reject → sample from target_logits[i] instead
//        → stop verification, discard remaining drafts

// Expected speedup: 2-4× if draft model is accurate
// Key condition: draft model must use same vocabulary as target

// Server flags:
// --model-draft path/to/small.gguf
// --draft-max 8           # max draft tokens per step
// --draft-min 5           # min to bother drafting
// --draft-p-min 0.9       # only draft if probability high enough

Streaming Response (SSE)

How tokens are streamed back to the client

When the client sends "stream": true, the server uses Server-Sent Events (SSE). Each generated token is sent as a separate SSE event immediately, enabling the client to display text as it's generated.

// SSE format (OpenAI-compatible):
// Each token delta:
data: {"id":"cmpl-xyz","object":"chat.completion.chunk","model":"llama3",
       "choices":[{"delta":{"content":" Paris"},"index":0}]}

// Final event:
data: {"id":"cmpl-xyz","object":"chat.completion.chunk","model":"llama3",
       "choices":[{"delta":{},"finish_reason":"stop","index":0}],
       "usage":{"prompt_tokens":25,"completion_tokens":12}}

data: [DONE]

// The server sends each piece as soon as it's sampled:
// llama_token → piece string → HTTP chunk → SSE event
// Latency from "token sampled" to "client receives": ~1ms + network RTT
// Non-streaming response (stream=false):
// Server accumulates all generated tokens before responding:
{
  "id": "cmpl-xyz",
  "object": "chat.completion",
  "model": "llama3",
  "choices": [{
    "message": {"role": "assistant", "content": "The capital of France is Paris."},
    "finish_reason": "stop",
    "index": 0
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 12,
    "total_tokens": 37
  }
}

Thread Architecture

How threads are organized in the server
// Server thread layout:

// HTTP threads (httplib thread pool, default 8):
//   Accept connections, parse JSON, queue tasks, send responses
//   These never call llama_decode() — they just enqueue + wait

// Inference thread (single, dedicated):
//   Runs the main inference loop
//   The ONLY thread that calls llama_decode()
//   Eliminates race conditions on KV cache and context state
//   All slot management happens here

// Queue: task_queue (thread-safe)
//   HTTP threads → produce server_task
//   Inference thread → consume server_task, produce results

// Result queue:
//   Inference thread → produce results (token chunks, final responses)
//   HTTP threads → consume results (send SSE chunks to clients)

// Internal llama_decode() uses its own threadpool:
//   CPU computation: n_threads (set in context params)
//   This threadpool is separate from the HTTP threads above

server_slot — State Machine

The full lifecycle of a request slot
// State transitions for server_slot:

IDLE
  │  (new task assigned, tokens loaded)
  ▼
PROCESSING_PROMPT
  │  prompt tokens added to batch, n chunks processed
  │  (all prompt tokens decoded, logits available for last token)
  ▼
GENERATING
  │  one new token sampled per tick
  │  token appended to generated_text, sent as SSE chunk
  │  (EOS token OR n_predict reached OR stop string matched)
  ▼
DONE
  │  final usage stats computed, response completed
  ▼
IDLE  (slot ready for next request)

// In PROCESSING_PROMPT, if prompt is longer than n_batch:
// Multiple ticks process the prompt in chunks:
// Tick 1: tokens[0..511]    → llama_decode() → no sampling
// Tick 2: tokens[512..1023] → llama_decode() → no sampling
// ...
// Final prompt tick: last chunk → llama_decode() → sample first output token

This completes the full loop. Each generated token feeds back into the batch for the next decode step, continuing until the model emits EOS or the max token limit is reached.

← Overview Tokenization Computation Graph Backends Sampling