tinygrad — Runtime & Execution

run_linear — the execution entry point engine/realize.py:252 ▼

Called at the end of Tensor.realize(). Takes the LINEAR UOp produced by the scheduler and executes each CALL node against the device.

def run_linear(
  linear:    UOp,                    # LINEAR with ordered CALLs
  var_vals:  dict[str, int] | None,  # symbolic variable values
  input_uops: tuple[UOp, ...] = (), # externally-provided buffers
  update_stats: bool = True,
  jit:  bool = False,
  wait: bool = False,
):
  ctx = ExecContext(var_vals or {}, ...)
  for call in linear.src:
    pm_exec.rewrite(call, ctx)       # dispatch by CALL type

engine/realize.py:252 — run_linear

compile_linear — deferred compilation engine/realize.py:246 ▼

Compiles all kernels in the LINEAR plan without executing them. Used by the JIT to pre-warm the kernel cache.

1

Apply pm_compile pattern matcher

Walks the LINEAR and compiles any CALL nodes that contain un-compiled PROGRAM ASTs via do_to_program(). Converts SINK ASTs to BINARY UOps.

2

BEAM search (optional)

When BEAM>0, compile many tiling candidates and benchmark them. Keep the fastest. The winner's compiled binary is cached.

3

Validate (optional)

When VALIDATE_WITH_CPU=1, run the same computation on CPU and compare outputs. Useful for debugging backend correctness.

engine/realize.py:246 — compile_linear

pm_exec — execution dispatch (pattern matcher) engine/realize.py:233 ▼

Each CALL node is dispatched to the appropriate executor via a PatternMatcher. The CALL type determines which function handles it.

K

exec_kernel — run a compiled GPU/CPU kernel

Resolves buffer arguments to concrete Buffer objects, looks up or creates the kernel runtime via get_runtime(), then calls runtime(bufs, var_vals, global_size, local_size).

engine/realize.py:170 — exec_kernel

C

exec_copy — buffer data transfer

Handles host→device uploads, device→host downloads, and device→device copies. Dispatches to allocator's copyin()/copyout()/_transfer().

engine/realize.py:156 — exec_copy

V

exec_view — buffer alias/slice

Creates a BUFFER_VIEW — a zero-copy view into an existing buffer at an offset. Used for slices and tensor views.

engine/realize.py:149 — exec_view

G

exec_graph — batched kernel graph

Submits multiple kernels as a single batched command graph (CUDA Graph, Metal command buffer, HCQ graph). Reduces per-kernel CPU overhead for training loops.

engine/realize.py:200 — exec_graph

Buffer — device memory abstraction device.py:99 ▼

Every tensor's data lives in a Buffer. Allocation is lazy — the underlying device memory is only allocated on first use.

device Device string this buffer lives on ("CUDA:0", "CPU", etc.) size Number of elements dtype Element type allocate() Trigger actual device memory allocation via the device's allocator copyin(mv) Copy data from a Python memoryview into device memory copyout(mv) Copy device memory back to a Python memoryview view(offset, size) Create a zero-copy alias starting at the given element offset ref(count) Adjust reference count for kernel-held buffers (prevents premature free)

device.py:99 — class Buffer

Device singleton & Compiled class device.py:15, 287 ▼

Device["CUDA"] returns the singleton for that backend, which provides the allocator, compiler, and renderer triplet.

Device[name] Returns cached device instance, loading the runtime module on first access Device.DEFAULT Current default device string from env or first available allocator Device-specific memory allocator (e.g. CUDAAllocator, MetalAllocator) compiler Code compiler for this device (nvrtc, clang, Metal, etc.) renderer Source code renderer for this device runtime(prg) Creates a callable kernel from compiled bytes graph() Creates a batched graph executor for JIT training loops

device.py:15 — class _Device device.py:287 — class Compiled

Runtime backends runtime/ops_*.py ▼

CUDA

runtime/ops_cuda.py

nvrtc JIT compiler, CUDAAllocator, CUDA stream management.

Metal (Apple)

runtime/ops_metal.py

Metal command queues, MTLBuffer allocation, MSL compilation.

AMD (HIP/HSA)

runtime/ops_amd.py

HCQ-based direct HSA queue submission. Low-overhead GPU dispatch.

NV (low-level)

runtime/ops_nv.py

Direct NVIDIA HCQ (hardware command queue) submission without CUDA driver overhead.

CPU (clang)

runtime/ops_cpu.py

Clang JIT compiler, mmap allocator, shared library loading and calling.

OpenCL

runtime/ops_opencl.py

OpenCL platform/device enumeration, cl_mem allocation, kernel build and enqueue.

WebGPU

runtime/ops_webgpu.py

Browser-native GPU via the WebGPU API. Runs in Node or browser environments.

DISK

runtime/ops_disk.py

Memory-mapped file buffers. Used for loading model weights directly from disk.

Stats tracking & profiling engine/realize.py:49 ▼

Every kernel execution updates global counters and can emit profiling events.

GlobalCounters Tracks kernel count, total FLOPS, memory bandwidth, and wall time across all executions track_stats() Records per-kernel execution time, op count, memory bytes ProfileEvent Emitted for each kernel/copy; used by the profiler for timeline visualization estimate_uop() Static analysis of a CALL to estimate FLOPS and memory before running

engine/realize.py:49 — track_stats engine/realize.py:38 — estimate_uop

Caching layers summary multi-level cache ▼

Level 1: UOp deduplication
  UOpMetaClass caches UOp instances by (op, dtype, src, arg)
  → same computation graph node returned if already exists

Level 2: Schedule cache (SCACHE=1)
  schedule_cache[uop_key] = linear_uop
  → skip re-scheduling for identical tensor graphs

Level 3: Kernel source cache
  Compiler.compile_cached(source) hashes source string
  → skip recompiling identical kernels within a process

Level 4: Disk cache (~/.cache/tinygrad/)
  Compiled binaries persisted to disk
  → survive process restarts; keyed on source hash + device

Level 5: Runtime cache
  runtime_cache[(key, device)] = compiled_kernel_callable
  → skip kernel loading overhead on repeated runs

Level 6: Graph cache (JIT)
  graph_cache caches batched CUDA Graph / Metal command buffer
  → near-zero CPU overhead per training step