GGML Backends

How the scheduler maps computation graph nodes to CPU, CUDA, Metal, and Vulkan device kernels.

ggml-backend.cpp ggml-backend.h ggml-cuda/ ggml-metal/

Available Backends

CPU
x86-64 / ARM / WASM
Multi-threaded with SIMD kernels. Supports AVX2, AVX-512 (x86), NEON, SVE (ARM). Always available as fallback.
ggml/src/ggml-cpu/
CUDA
NVIDIA GPUs
Primary GPU backend. Highly optimized CUDA kernels for all major ops. Supports cuBLAS for gemm.
ggml/src/ggml-cuda/
Metal
Apple Silicon / macOS
Apple GPU backend via Metal API. Excellent performance on M-series chips. Supports all quantization types.
ggml/src/ggml-metal/
Vulkan
AMD / Intel / cross-platform
Cross-platform GPU compute. Good for AMD GPUs on Linux/Windows where ROCm isn't available.
ggml/src/ggml-vulkan/
ROCm (HIP)
AMD GPUs
Uses the same CUDA codebase compiled with HIP. Close to CUDA performance on RDNA3.
ggml/src/ggml-cuda/ (HIP build)
SYCL
Intel Arc / oneAPI
Intel GPU acceleration via SYCL + oneMKL. Targets Intel Arc discrete and integrated GPUs.
ggml/src/ggml-sycl/

Backend Abstraction Layer

ggml_backend — the device interface

Each backend implements a common interface via vtable-like function pointers. The rest of llama.cpp never calls CUDA or Metal directly — it calls through this interface.

// ggml/include/ggml-backend.h

// A backend = a device + its allocator + its compute engine
typedef struct ggml_backend * ggml_backend_t;

// Core operations every backend implements:
struct ggml_backend_i {
    const char * (*get_name)(ggml_backend_t backend);

    // Allocate a tensor's data on this device:
    void (*buffer_alloc)(ggml_backend_buffer_t buf, struct ggml_tensor * t);

    // Execute a computation graph on this backend:
    enum ggml_status (*graph_compute)(
        ggml_backend_t backend,
        struct ggml_cgraph * cgraph);

    // Async copy: host ↔ device
    void (*tensor_set)(ggml_backend_t, struct ggml_tensor *, const void * data, size_t offset, size_t size);
    void (*tensor_get)(ggml_backend_t, const struct ggml_tensor *, void * data, size_t offset, size_t size);

    // Synchronize (wait for all async ops to complete):
    void (*synchronize)(ggml_backend_t backend);
};
ggml_backend_sched — the multi-device scheduler

The scheduler is the key piece enabling split inference across devices. It inspects the full graph, decides which device handles each node, allocates memory on the right device, and inserts copy nodes where tensors cross device boundaries.

// ggml/src/ggml-backend.cpp

// Create scheduler with list of backends (in priority order):
ggml_backend_t backends[] = { cuda_backend, cpu_backend };
ggml_backend_sched_t sched =
    ggml_backend_sched_new(backends, /*n_backends=*/2, max_nodes);

// Assign: the scheduler calls each backend's can_compute(node) function.
// CUDA backend takes: MUL_MAT, ROPE, SOFT_MAX, NORM, SILU, ADD, ...
// CPU backend takes: ops not supported by GPU, or ops on CPU-only tensors.

// Execute the full graph:
ggml_backend_sched_graph_compute(sched, cgraph);
ggml_cgraph (all nodes) │ ▼ ggml_backend_sched ┌────────────────────────────────────────────┐ │ For each node: │ │ if CUDA backend can_compute(node): │ │ assign → CUDA │ │ else: │ │ assign → CPU │ │ │ │ For tensors that cross device boundary: │ │ insert ggml_cpy node (device transfer) │ └────────────────────────────────────────────┘ │ ┌────┴────┐ │ │ CUDA CPU kernels threads

GPU Layer Offloading

n_gpu_layers — partial model on GPU

Models too large to fit in VRAM can be split: the first N layers go to GPU, the rest stay on CPU. This is controlled by n_gpu_layers in llama_model_params.

// include/llama.h#L291
struct llama_model_params {
    int32_t n_gpu_layers;  // how many transformer layers to offload to GPU
                           // -1 = all, 0 = CPU only

    enum llama_split_mode split_mode;
    // LLAMA_SPLIT_MODE_NONE  = single GPU
    // LLAMA_SPLIT_MODE_LAYER = split layers across GPUs
    // LLAMA_SPLIT_MODE_ROW   = split individual tensors row-wise
};

// At model load, each layer's tensors are allocated on the assigned device:
// layers[0..n_gpu_layers-1]  → CUDA/Metal buffer
// layers[n_gpu_layers..end]  → CPU buffer
// tok_embd, output           → follow last GPU or CPU

During a forward pass with split layers, the scheduler automatically inserts CPU↔GPU transfer nodes at layer boundaries. The hidden state tensor travels GPU→CPU→GPU as it crosses from offloaded to non-offloaded layers.

Multi-GPU splitting

// LLAMA_SPLIT_MODE_LAYER: round-robin across GPUs
// GPU 0: layers 0..15
// GPU 1: layers 16..31

// LLAMA_SPLIT_MODE_ROW: each GPU holds part of every weight matrix
// GPU 0: W[:n_embd/2, :]
// GPU 1: W[n_embd/2:, :]
// Requires NCCL for all-reduce across GPUs

Quantization Types

llama.cpp stores model weights in quantized formats to reduce memory and increase throughput. The backend handles dequantization on-the-fly during matrix multiplication.

Supported quantization formats and their trade-offs
TypeBits/weightSize vs F32Notes
F3232Full precision, largest
F16160.5×Standard GPU inference dtype
BF16160.5×Better range, same size as F16
Q8_080.25×Near-lossless, good for KV cache
Q4_K_M~4.5~0.14×Best quality/size for most models
Q4_040.125×Faster, slightly lower quality
Q3_K_M~3.4~0.11×Very small, some quality loss
Q2_K~2.5~0.08×Extreme compression, noticeable quality drop
IQ4_XS~4.25~0.13×Importance-matrix quant, better than Q4_0

K-quants (Q4_K, Q5_K, Q6_K): group weights into blocks of 256, store a per-block scale + min value. Better quality than uniform quantization at same bit width.

// ggml/include/ggml.h#L389 — quantization types enum
enum ggml_type {
    GGML_TYPE_F32  = 0,
    GGML_TYPE_F16  = 1,
    GGML_TYPE_Q4_0 = 2,
    GGML_TYPE_Q4_1 = 3,
    GGML_TYPE_Q5_0 = 6,
    GGML_TYPE_Q5_1 = 7,
    GGML_TYPE_Q8_0 = 8,
    // ... K-quants:
    GGML_TYPE_Q2_K = 10,
    GGML_TYPE_Q3_K = 11,
    GGML_TYPE_Q4_K = 12,
    GGML_TYPE_Q5_K = 13,
    GGML_TYPE_Q6_K = 14,
    // imatrix quants:
    GGML_TYPE_IQ4_XS = 22,
    // ...
};

CPU Backend: Threading and SIMD

How CPU kernels parallelize matrix multiplication

The CPU backend uses a threadpool. Each ggml_mul_mat is split into row blocks, one per thread. Within each thread, SIMD intrinsics handle the inner loop.

// ggml/src/ggml-cpu/ contains type-specific kernels:
//
// ggml_vec_dot_q4_0_q8_0(): dot product of Q4_0 × Q8_0 vectors
//   → depack 4-bit weights to 8-bit
//   → multiply-accumulate with 8-bit inputs
//   → uses AVX2: _mm256_maddubs_epi16, _mm256_madd_epi16
//
// ggml_vec_dot_f32(): F32 dot product
//   → AVX2: _mm256_fmadd_ps
//   → ARM NEON: vfmaq_f32

// Thread count set in llama_context_params.n_threads
// (default: physical core count)

// For prompt processing (large batch): n_threads_batch is used
// For single-token generation: n_threads is used
// Recommendation: n_threads = physical cores, n_threads_batch = all cores

CUDA Backend: Key Kernels

How CUDA accelerates the main operations
// ggml/src/ggml-cuda/ — major kernel files:
//
// mmq.cu         — quantized matrix multiplication (MUL_MAT with Q types)
// mul_mat_vec.cu — matrix-vector product (single token generation)
// softmax.cu     — attention softmax with causal mask
// rope.cu        — rotary position embedding
// norm.cu        — RMS normalization
// fattn/         — Flash Attention kernel variants
//   fattn-vec.cuh  — vector attention (prefill with small batch)
//   fattn-tile.cuh — tiled attention (large batch prefill)

// Key insight: llama.cpp uses different kernels for:
// PREFILL (processing full prompt):
//   batch size = n_tokens (e.g. 512)
//   uses large GEMM (batched matmul) kernels → high GPU utilization
//
// GENERATION (one new token per step):
//   batch size = 1 (or small with parallel requests)
//   uses GEMV (matrix-vector) kernels → bandwidth bound, not compute bound

Metal Backend (Apple Silicon)

How Apple GPU kernels work
// ggml/src/ggml-metal/ggml-metal.m — Objective-C Metal interface
// ggml/src/ggml-metal/ggml-metal.metal — MSL shader kernels

// Metal shaders for key operations:
// kernel_mul_mat_q4_0_f32 — Q4_0 × F32 matmul
// kernel_mul_mat_f16_f32  — F16 × F32 matmul
// kernel_soft_max         — causal softmax
// kernel_rope_norm        — RoPE with normalization
// kernel_rms_norm         — RMS normalization

// Apple Silicon advantage: unified memory
// CPU and GPU share the same physical RAM → no PCIe transfer overhead
// Model weights loaded once into memory, GPU accesses them directly

// Flash Attention on Metal:
// kernel_flash_attn_ext_f16 — tiled flash attention kernel
// Very efficient on M-series due to large GPU L2 cache

Memory Layout and mmap

How model weights are loaded efficiently

llama.cpp memory-maps (mmap) the GGUF file by default. This means weight tensors point directly into the mapped file pages — no upfront copy into RAM. Pages are loaded on demand by the OS as the GPU reads them.

// src/llama-model.cpp — model loading

// Default: use_mmap=true
// Weight tensor data pointers → mmap'd file pages
// OS loads pages lazily when GPU DMA-reads them

// When use_mmap=false:
// All tensors explicitly copied to heap
// Slower startup but avoids page-fault latency during inference

// For GPU offloading:
// CPU-resident tensors: mmap pages (or heap copy)
// GPU-resident tensors: allocated in VRAM, weights copied during load
//   (ggml_backend_tensor_set copies from host mmap → device VRAM)

// Memory pinning: for faster host→GPU transfers, CPU buffers
// can be pinned (CUDA: cudaHostRegister) to enable DMA

After device kernels execute and logits are on CPU, sampling selects the next token.

→ Sampling Pipeline ← Computation Graph