Converting a raw text string into a sequence of integer token IDs — and back again.
Language models don't operate on characters or words — they operate on tokens: variable-length subword units chosen to balance vocabulary size against sequence length. The mapping between strings and token IDs is fixed after training.
The tokenizer type is stored in the GGUF file and read at model load time. llama.cpp supports 6+ types covering essentially all modern LLM families.
How "Hello world" gets tokenized with a LLaMA-3 BPE vocabulary:
Notice the space is part of the token " world" — BPE encodes whitespace as part of the following token, which is why the leading space matters for correct round-tripping.
// include/llama.h#L1125
LLAMA_API int32_t llama_tokenize(
const struct llama_vocab * vocab,
const char * text,
int32_t text_len,
llama_token * tokens, // output array
int32_t n_tokens_max,
bool add_special, // add BOS token at start?
bool parse_special); // treat <|im_start|> etc as tokens?
Returns the number of tokens written. Call with tokens=NULL to query the count first.
// Usage pattern: int n = llama_tokenize(vocab, text, -1, NULL, 0, true, false); // query count std::vector<llama_token> ids(n); llama_tokenize(vocab, text, -1, ids.data(), n, true, false);
// include/llama.h#L1147
LLAMA_API int32_t llama_token_to_piece(
const struct llama_vocab * vocab,
llama_token token,
char * buf,
int32_t length,
int32_t lstrip, // strip leading spaces
bool special); // render special tokens as text?
// Returns number of bytes written.
// Negative return = buffer too small (abs value = needed size)
BPE tokenization is a two-stage greedy process:
The raw string is first split by a regex pattern (model-specific) into "words". For GPT-2/LLaMA-3 style BPE, this separates punctuation, handles whitespace, and prevents merges across word boundaries.
// Example regex (simplified GPT-2 pattern):
// "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|..."
// "Hello world!" → ["Hello", " world", "!"]
Each pre-token is converted to UTF-8 bytes, then the BPE merge rules (learned during training) are applied greedily in priority order:
// "Hello" as bytes: ['H','e','l','l','o']
// Merge table lookup (highest priority first):
// ('H','e') → 'He' → ['He','l','l','o']
// ('l','l') → 'll' → ['He','ll','o']
// ('He','ll') → 'Hell' → ['Hell','o']
// ('Hell','o') → 'Hello' → ['Hello'] ← single token!
// Result: token_id for "Hello" = 9906
SPM doesn't pre-tokenize on whitespace — it treats the whole input as a byte sequence. It uses a Viterbi algorithm to find the optimal segmentation according to a unigram language model, then applies BPE-like merges.
// SPM special handling: // 1. Normalizes unicode (NFKC or custom rules) // 2. Adds '▁' (U+2581) before each word to encode spaces // 3. Byte fallback: unknown bytes → <0xHH> tokens // "Hello world" with SPM: // → ['▁Hello', '▁world'] (if both are in vocab) // → ['▁He', 'llo', '▁world'] (if 'Hello' not in vocab)
Every model defines special tokens stored in the GGUF vocabulary metadata. These are not learnable subwords but reserved IDs with specific semantic meaning.
// include/llama.h — special token accessors LLAMA_API llama_token llama_vocab_bos(const struct llama_vocab * vocab); // begin-of-sequence LLAMA_API llama_token llama_vocab_eos(const struct llama_vocab * vocab); // end-of-sequence LLAMA_API llama_token llama_vocab_eot(const struct llama_vocab * vocab); // end-of-turn LLAMA_API llama_token llama_vocab_sep(const struct llama_vocab * vocab); // separator // Check if token signals end of generation: LLAMA_API bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);
Chat-format models (instruct/assistant variants) additionally use role-delimiter tokens like <|im_start|>, <|im_end|>, [INST], <|eot_id|>. These are encoded by the chat template (Jinja2-based, stored in the GGUF metadata).
The GGUF file format stores tokenizer data as key-value metadata entries. These are loaded before any tensors.
// GGUF metadata keys for vocabulary: "tokenizer.ggml.model" → "gpt2" | "llama" | "bert" | ... "tokenizer.ggml.tokens" → array of token strings (length = n_vocab) "tokenizer.ggml.scores" → array of log-probabilities (for SPM) "tokenizer.ggml.token_type" → NORMAL | UNKNOWN | CONTROL | USER_DEFINED | BYTE "tokenizer.ggml.merges" → BPE merge rules (ordered by priority) "tokenizer.ggml.bos_token_id" → integer "tokenizer.ggml.eos_token_id" → integer "tokenizer.chat_template" → Jinja2 template string
At load time, llama_vocab builds hash maps from token strings to IDs and vice versa, plus the merge priority table for BPE.
The integer token IDs produced here are placed into llama_batch.token[]. During the forward pass, each token ID is used as an index into the tok_embd weight matrix to retrieve that token's learned embedding vector — the first step of the transformer computation.