How raw logits become the next token: the composable sampler chain.
llama.cpp uses a composable chain of samplers. Each sampler in the chain receives the current token candidates array, optionally filters or transforms it, and passes it to the next. The last sampler in the chain must be a "selecting" sampler that returns a single token ID.
// include/llama.h#L1190 — sampler interface
// Core sampler vtable (every sampler implements some of these):
struct llama_sampler_i {
const char * (*name) (const struct llama_sampler * smpl);
void (*accept)(struct llama_sampler * smpl, llama_token token);
void (*apply) (struct llama_sampler * smpl,
llama_token_data_array * cur_p); // ← filters candidates
void (*reset) (struct llama_sampler * smpl);
void (*free) (struct llama_sampler * smpl);
};
// llama_token_data_array: the candidates being filtered
typedef struct llama_token_data_array {
llama_token_data * data; // array of {id, logit, p} for each token
size_t size; // current number of candidates (shrinks as filtered)
int64_t selected; // index of selected token (-1 if not yet sampled)
bool sorted; // true if sorted by probability (descending)
} llama_token_data_array;
// Build a typical chat sampler chain:
struct llama_sampler * smpl =
llama_sampler_chain_init(llama_sampler_chain_default_params());
llama_sampler_chain_add(smpl, llama_sampler_init_repetition_penalty(
llama_model_get_vocab(model), last_n=64, penalty=1.1f, freq_pen=0.f, pres_pen=0.f));
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95f, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8f));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));
// ↑ final selector: sample from distribution
// During generation loop:
llama_token id = llama_sampler_sample(smpl, ctx, -1); // -1 = last token's logits
llama_sampler_accept(smpl, id); // update repetition state
Temperature controls the sharpness of the output distribution. Dividing logits by T before softmax is equivalent to raising each probability to the power 1/T and renormalizing.
// src/llama-sampler.cpp — llama_sampler_init_temp
// Applied in apply():
for (size_t i = 0; i < cur_p->size; i++) {
cur_p->data[i].logit /= temp; // scale logits
}
// Softmax happens at the selection step
// Effect:
// temp=0.0 → greedy (argmax, deterministic)
// temp=0.5 → sharper than softmax, less variety
// temp=1.0 → raw softmax probabilities
// temp=2.0 → flatter, more random
// temp>10 → nearly uniform → essentially random token selection
The "nucleus" is the minimal set of tokens whose cumulative probability exceeds p. This adapts to the model's confidence: when the model is very sure (peaked distribution), the nucleus is small; when uncertain, it's large.
// Pseudocode for top-p:
// 1. Compute softmax of all logits
// 2. Sort tokens by probability, descending
// 3. Walk down the sorted list, accumulating probability
// 4. Stop when cumulative sum >= p
// 5. Zero out all tokens beyond the cutoff
float cumsum = 0.0f;
for (size_t i = 0; i < sorted_size; i++) {
cumsum += probs[i];
if (cumsum >= p && i + 1 >= min_keep) {
cur_p->size = i + 1; // keep only top i+1 tokens
break;
}
}
Three related penalties discourage the model from repeating tokens from recent context. The token history is tracked via llama_sampler_accept() after each generation step.
// For each token t in the candidate list: // if t appeared in the last `penalty_last_n` tokens: // // Base penalty (multiplicative): // logit[t] = (logit[t] >= 0) ? logit[t] / penalty_repeat // : logit[t] * penalty_repeat // // Frequency penalty (subtractive, scales with count): // logit[t] -= count(t) * penalty_freq // // Presence penalty (subtractive, binary — just appeared or not): // logit[t] -= penalty_present (if t appeared at all in window) // Typical values: // penalty_repeat = 1.1 (10% reduction per occurrence) // penalty_freq = 0.0 (disabled by default) // penalty_present = 0.0 (disabled by default)
Min-P (introduced 2023) is an alternative to Top-P. Instead of a fixed cumulative threshold, it removes tokens whose probability is below min_p × max_prob. This scales the cutoff with the model's confidence.
// float max_prob = max of all softmax probabilities // Keep token t if: prob[t] >= min_p * max_prob // // Example: // max_prob = 0.6 (model very confident) // min_p = 0.1 // cutoff = 0.06 → removes all tokens with prob < 6% // // max_prob = 0.05 (model unsure) // cutoff = 0.005 → much more permissive // Benefit: behaves like greedy when confident, // like broad sampling when uncertain
Mirostat maintains a target perplexity (entropy) across generated tokens. It dynamically adjusts temperature each step to keep the "surprise level" of generation stable, avoiding both repetitive and incoherent text.
// Mirostat v2 algorithm (Basu 2021): // target_surprise τ = 3.0 (target perplexity per token) // η = 0.1 (learning rate for τ adjustment) // μ = 2τ (initial sampling parameter) // Each step: // 1. Sort tokens by probability // 2. Find k = min tokens such that sum of top-k exceeds threshold μ // 3. Sample from top-k // 4. Compute surprise s = -log2(p_selected) // 5. Update: μ = μ - η * (s - τ) // → if sampled token was surprising: lower μ next step (more focused) // → if too predictable: raise μ (more diverse)
// llama_sampler_init_greedy()
// Selects the token with highest logit value unconditionally
// Equivalent to temperature → 0
// apply():
int64_t best = 0;
for (size_t i = 1; i < cur_p->size; i++) {
if (cur_p->data[i].logit > cur_p->data[best].logit) best = i;
}
cur_p->selected = best;
// Used for: code generation, structured output, beam search
// Problem: can get stuck in repetition loops (use repetition penalty)
llama.cpp includes a sampler that enforces a GBNF (GGML BNF) grammar. It maintains a parser state machine and sets logits to -infinity for any token that would produce an invalid continuation.
// Grammar sampler — applied BEFORE top-K/P/temp to constrain the space
struct llama_sampler * grammar =
llama_sampler_init_grammar(llama_model_get_vocab(model),
grammar_str, // GBNF grammar text
"root"); // start rule
// During apply(), the grammar parser:
// 1. Identifies which tokens are valid next tokens given the current parse state
// 2. Sets logit = -INFINITY for all invalid tokens
// 3. The subsequent top-K/P samplers then only see valid tokens
// Example GBNF for JSON object with "name" field:
// root ::= "{" ws "\"name\"" ws ":" ws string ws "}"
// string ::= "\"" char* "\""
// char ::= [a-zA-Z0-9 ]
The selected token_id is returned from llama_sampler_sample(). The caller must then:
// 1. Accept the token into the sampler's history (for repetition tracking): llama_sampler_accept(smpl, token_id); // 2. Convert to text: char piece[64]; llama_token_to_piece(vocab, token_id, piece, sizeof(piece), 0, true); // → " Paris" // 3. Feed back for next step: struct llama_batch next_batch = llama_batch_get_one(&token_id, 1); llama_decode(ctx, next_batch); // KV cache grows +1, computes new logits