run_linear — the execution entry point
engine/realize.py:252
▼
Called at the end of Tensor.realize(). Takes the LINEAR UOp produced by the scheduler and executes each CALL node against the device.
def run_linear(
linear: UOp, # LINEAR with ordered CALLs
var_vals: dict[str, int] | None, # symbolic variable values
input_uops: tuple[UOp, ...] = (), # externally-provided buffers
update_stats: bool = True,
jit: bool = False,
wait: bool = False,
):
ctx = ExecContext(var_vals or {}, ...)
for call in linear.src:
pm_exec.rewrite(call, ctx) # dispatch by CALL type
engine/realize.py:252 — run_linear
compile_linear — deferred compilation
engine/realize.py:246
▼
Compiles all kernels in the LINEAR plan without executing them. Used by the JIT to pre-warm the kernel cache.
1
Apply pm_compile pattern matcher
Walks the LINEAR and compiles any CALL nodes that contain un-compiled PROGRAM ASTs via
do_to_program(). Converts SINK ASTs to BINARY UOps.2
BEAM search (optional)
When
BEAM>0, compile many tiling candidates and benchmark them. Keep the fastest. The winner's compiled binary is cached.3
Validate (optional)
When
VALIDATE_WITH_CPU=1, run the same computation on CPU and compare outputs. Useful for debugging backend correctness.
pm_exec — execution dispatch (pattern matcher)
engine/realize.py:233
▼
Each CALL node is dispatched to the appropriate executor via a PatternMatcher. The CALL type determines which function handles it.
K
exec_kernel — run a compiled GPU/CPU kernel
Resolves buffer arguments to concrete
engine/realize.py:170 — exec_kernel
Buffer objects, looks up or creates the kernel runtime via get_runtime(), then calls runtime(bufs, var_vals, global_size, local_size).C
exec_copy — buffer data transfer
Handles host→device uploads, device→host downloads, and device→device copies. Dispatches to allocator's
engine/realize.py:156 — exec_copy
copyin()/copyout()/_transfer().V
exec_view — buffer alias/slice
Creates a
engine/realize.py:149 — exec_view
BUFFER_VIEW — a zero-copy view into an existing buffer at an offset. Used for slices and tensor views.G
exec_graph — batched kernel graph
Submits multiple kernels as a single batched command graph (CUDA Graph, Metal command buffer, HCQ graph). Reduces per-kernel CPU overhead for training loops.
engine/realize.py:200 — exec_graph
Buffer — device memory abstraction
device.py:99
▼
Every tensor's data lives in a Buffer. Allocation is lazy — the underlying device memory is only allocated on first use.
device
Device string this buffer lives on (
device.py:99 — class Buffer
"CUDA:0", "CPU", etc.)
size
Number of elements
dtype
Element type
allocate()
Trigger actual device memory allocation via the device's allocator
copyin(mv)
Copy data from a Python memoryview into device memory
copyout(mv)
Copy device memory back to a Python memoryview
view(offset, size)
Create a zero-copy alias starting at the given element offset
ref(count)
Adjust reference count for kernel-held buffers (prevents premature free)
Device singleton & Compiled class
device.py:15, 287
▼
Device["CUDA"] returns the singleton for that backend, which provides the allocator, compiler, and renderer triplet.
Device[name]
Returns cached device instance, loading the runtime module on first access
Device.DEFAULT
Current default device string from env or first available
allocator
Device-specific memory allocator (e.g. CUDAAllocator, MetalAllocator)
compiler
Code compiler for this device (nvrtc, clang, Metal, etc.)
renderer
Source code renderer for this device
runtime(prg)
Creates a callable kernel from compiled bytes
graph()
Creates a batched graph executor for JIT training loops
device.py:15 — class _Device
device.py:287 — class Compiled
Runtime backends
runtime/ops_*.py
▼
CUDA
runtime/ops_cuda.py
nvrtc JIT compiler, CUDAAllocator, CUDA stream management.
Metal (Apple)
runtime/ops_metal.py
Metal command queues, MTLBuffer allocation, MSL compilation.
AMD (HIP/HSA)
runtime/ops_amd.py
HCQ-based direct HSA queue submission. Low-overhead GPU dispatch.
NV (low-level)
runtime/ops_nv.py
Direct NVIDIA HCQ (hardware command queue) submission without CUDA driver overhead.
CPU (clang)
runtime/ops_cpu.py
Clang JIT compiler, mmap allocator, shared library loading and calling.
OpenCL
runtime/ops_opencl.py
OpenCL platform/device enumeration, cl_mem allocation, kernel build and enqueue.
WebGPU
runtime/ops_webgpu.py
Browser-native GPU via the WebGPU API. Runs in Node or browser environments.
DISK
runtime/ops_disk.py
Memory-mapped file buffers. Used for loading model weights directly from disk.
Stats tracking & profiling
engine/realize.py:49
▼
Every kernel execution updates global counters and can emit profiling events.
GlobalCounters
Tracks kernel count, total FLOPS, memory bandwidth, and wall time across all executions
track_stats()
Records per-kernel execution time, op count, memory bytes
ProfileEvent
Emitted for each kernel/copy; used by the profiler for timeline visualization
estimate_uop()
Static analysis of a CALL to estimate FLOPS and memory before running
engine/realize.py:49 — track_stats
engine/realize.py:38 — estimate_uop
Caching layers summary
multi-level cache
▼
Level 1: UOp deduplication UOpMetaClass caches UOp instances by (op, dtype, src, arg) → same computation graph node returned if already exists Level 2: Schedule cache (SCACHE=1) schedule_cache[uop_key] = linear_uop → skip re-scheduling for identical tensor graphs Level 3: Kernel source cache Compiler.compile_cached(source) hashes source string → skip recompiling identical kernels within a process Level 4: Disk cache (~/.cache/tinygrad/) Compiled binaries persisted to disk → survive process restarts; keyed on source hash + device Level 5: Runtime cache runtime_cache[(key, device)] = compiled_kernel_callable → skip kernel loading overhead on repeated runs Level 6: Graph cache (JIT) graph_cache caches batched CUDA Graph / Metal command buffer → near-zero CPU overhead per training step