llama.cpp — Inference Data Flow

How text goes from input string to generated tokens, step by step through the stack.

Source @ 35c9b1f Attention Is All You Need GGUF Format

Full Pipeline

Every inference in llama.cpp follows this sequence. Click a step to explore its component page.

STEP 0
HTTP Request / CLI Input
A POST to /v1/chat/completions or CLI args arrive at the server. JSON is parsed into a server_task and queued for processing.
"What is the capital of France?" → server_task{prompt, params}
STEP 1 — Tokenization
Text → Token IDs
The prompt string is split by the vocabulary tokenizer (BPE, SentencePiece, etc.) into integer token IDs. Each ID maps to a learned embedding vector.
"Hello world" → [1, 15043, 3186]
STEP 2 — Batch Preparation
Token IDs → llama_batch
Token IDs are packaged into a llama_batch along with positions, sequence IDs, and logit flags. Multiple sequences can share a batch (parallel inference).
batch.token=[…], batch.pos=[0,1,2], batch.logits=[0,0,1]
STEP 3 — Graph Building
llama_batch → ggml_cgraph
The model architecture (LLaMA, Qwen, Mistral, …) builds a lazy computation graph: embedding lookup → 32 transformer layers (attention + FFN) → output projection. No computation yet.
ggml_cgraph{1847 nodes} describing full forward pass
STEP 4 — Backend Execution
ggml_cgraph → device kernels
The backend scheduler assigns each graph node to a device (CPU / CUDA / Metal). Kernels execute: matrix multiplications, attention softmax, RoPE encoding, RMS norm, SiLU activation.
W_q @ x → CUDA kernel / CPU SIMD / Metal shader
STEP 5 — Logits Extraction
hidden state → logits [n_vocab]
After all layers, the final hidden state is projected to vocabulary size. Only logits for tokens marked in batch.logits are copied from device to host.
logits[128256] — one float per vocab token
STEP 6 — Sampling
logits → next token ID
A sampler chain filters logits: temperature scaling, top-K, top-P nucleus, repetition penalty. A categorical draw picks the next token ID from the filtered distribution.
logits[128256] → top-K/P → temp → sample → token_id=4521
STEP 7 — Detokenization
token ID → text piece
The sampled token ID is converted back to a string via the vocabulary. The piece is appended to the output buffer and streamed to the client.
token_id=4521 → " Paris"
STEP 8 — Loop / Stop
Autoregressive generation
Steps 2–7 repeat: the new token becomes the next batch input. KV-cache stores past key/value tensors so only the new token is processed each step. Stops on EOS or max tokens.
KV cache grows by 1 row per step — O(1) compute per new token

Component Pages

Key Data Structures

llama_model — loaded weights

Opaque struct holding all model tensors (embeddings, attention matrices, FFN weights) plus hyperparameters loaded from the GGUF file. Shared across contexts and never mutated after load.

// include/llama.h — forward declaration only (opaque)
struct llama_model;   // ← allocated by llama_model_load_from_file()

// Key hyperparameters exposed via llama_model_* accessors:
//   llama_model_n_embd()   → embedding dimension (e.g. 4096)
//   llama_model_n_layer()  → transformer depth (e.g. 32)
//   llama_model_n_head()   → attention heads
//   llama_model_n_vocab()  → vocabulary size
llama_context — inference state

Holds the KV cache, logits buffer, backend scheduler, and all per-request state. One context per inference session; multiple contexts can share a model.

// include/llama.h#L336 — context parameters
struct llama_context_params {
    uint32_t n_ctx;        // token context window (default 512)
    uint32_t n_batch;      // max tokens per llama_decode() call
    uint32_t n_ubatch;     // physical micro-batch size
    uint32_t n_threads;    // threads for CPU ops
    enum ggml_type type_k; // KV cache key dtype (F16 default)
    enum ggml_type type_v; // KV cache value dtype
    bool     flash_attn;   // use flash attention kernel
    // ...
};
llama_batch — one decode call's input

All data needed for a single llama_decode() call. Can contain multiple tokens from multiple sequences simultaneously.

// include/llama.h#L240
typedef struct llama_batch {
    int32_t       n_tokens;   // total tokens in this batch
    llama_token * token;      // token IDs   [n_tokens]
    float       * embd;       // embeddings  [n_tokens * n_embd] (or NULL)
    llama_pos   * pos;        // positions   [n_tokens]
    int32_t     * n_seq_id;   // seq count   [n_tokens]
    llama_seq_id** seq_id;    // seq IDs     [n_tokens]
    int8_t      * logits;     // want logits?[n_tokens]  1=yes, 0=no
} llama_batch;
ggml_tensor — the fundamental compute unit

Every weight matrix and intermediate activation is a ggml_tensor. Tensors are nodes in the computation graph; their src[] pointers form the graph edges. No computation happens at construction time.

// ggml/include/ggml.h#L666
struct ggml_tensor {
    enum ggml_type type;       // F32, F16, Q4_0, Q8_0, ...
    int64_t ne[4];             // dimensions [cols, rows, batch, batch2]
    size_t  nb[4];             // byte strides per dimension
    enum ggml_op op;           // NONE | MUL_MAT | SOFT_MAX | ROPE | ...
    struct ggml_tensor * src[GGML_MAX_SRC]; // input tensors (graph edges)
    void  * data;              // raw data pointer (on device)
    char    name[GGML_MAX_NAME];
};

Minimal Inference Example

Complete C API call sequence

This is the skeleton every llama.cpp-based program follows. The server and CLI examples wrap this pattern.

// 1. Init
llama_backend_init();                               // llama.h#L449

// 2. Load model from GGUF file
llama_model_params mparams = llama_model_default_params();
mparams.n_gpu_layers = 35;                          // offload to GPU
struct llama_model * model =
    llama_model_load_from_file("model.gguf", mparams); // llama.h#L484

// 3. Create inference context
llama_context_params cparams = llama_context_default_params();
cparams.n_ctx    = 4096;
cparams.n_batch  = 512;
struct llama_context * ctx =
    llama_init_from_model(model, cparams);           // llama.h#L509

// 4. Tokenize
const char * prompt = "Hello, world!";
int n_tokens = llama_tokenize(                       // llama.h#L1125
    llama_model_get_vocab(model),
    prompt, strlen(prompt),
    tokens, max_tokens, true, false);

// 5. Create sampler chain
struct llama_sampler * smpl =
    llama_sampler_chain_init(llama_sampler_chain_default_params());
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(50));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.9, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));

// 6. Prefill prompt (process all prompt tokens at once)
struct llama_batch batch = llama_batch_get_one(tokens, n_tokens);
llama_decode(ctx, batch);                            // llama.h#L953

// 7. Autoregressive generation loop
for (int i = 0; i < max_new_tokens; i++) {
    llama_token id =
        llama_sampler_sample(smpl, ctx, -1);         // llama.h#L1481
    if (llama_vocab_is_eog(llama_model_get_vocab(model), id)) break;

    // Print the token
    char piece[32];
    llama_token_to_piece(llama_model_get_vocab(model), id, piece, 32, 0, true);
    printf("%s", piece);

    // Feed sampled token back as next input
    llama_sampler_accept(smpl, id);
    batch = llama_batch_get_one(&id, 1);
    llama_decode(ctx, batch);                        // KV cache grows +1
}

// 8. Cleanup
llama_sampler_free(smpl);
llama_free(ctx);
llama_model_free(model);
llama_backend_free();

Architecture and File Map

Key source files
llama.cpp/
├── include/llama.h            ← Public C API (all structs & functions)
├── src/
│   ├── llama.cpp              ← Thin wrappers → internal implementations
│   ├── llama-model.cpp        ← GGUF loading, architecture dispatch
│   ├── llama-context.cpp      ← decode(), KV cache, logits extraction
│   ├── llama-graph.cpp        ← ggml_cgraph construction per architecture
│   ├── llama-batch.cpp        ← Batch splitting into micro-batches
│   ├── llama-vocab.cpp        ← Tokenizer (BPE / SPM / WPM / UGM / RWKV)
│   ├── llama-sampler.cpp      ← Sampler chain implementations
│   └── models/llama*.cpp      ← Per-architecture graph builders
├── ggml/
│   ├── include/ggml.h         ← Tensor API (ggml_tensor, all ops)
│   ├── src/ggml.c             ← CPU tensor math, graph executor
│   ├── src/ggml-backend.cpp   ← Backend abstraction & scheduler
│   ├── src/ggml-cpu/          ← SIMD kernels (AVX2, NEON, …)
│   ├── src/ggml-cuda/         ← CUDA kernels
│   └── src/ggml-metal/        ← Metal shaders
└── tools/server/
    ├── server.cpp             ← HTTP server entry point
    ├── server-context.cpp     ← Inference loop, slot management
    └── server-http.cpp        ← Route handlers, JSON I/O