Pipeline
STAGE 1
Tensor & Operations
tensor.py · mixin/
User calls
out: UOp DAG
a + b, matmul(), sum(). Each op builds a node in a lazy DAG — no computation happens yet.↓
STAGE 2
UOps — IR
uop/ops.py · uop/__init__.py
The intermediate representation. Every operation — tensor math, memory, loops, and compiled programs — is a
UOp(op, dtype, src, arg)
UOp node in a DAG. PatternMatcher rewrites transform the graph at every stage.↓ Tensor.realize() / linear_with_vars()
STAGE 3
Scheduler
schedule/__init__.py
Transforms the tensor UOp graph into an ordered list of kernel invocations. Decides kernel fusion boundaries, allocates buffers, handles multi-device routing.
out: LINEAR(CALL, CALL, …)
↓
STAGE 4
Codegen
codegen/__init__.py
Lowers each kernel AST through a cascade of PatternMatcher passes: movement ops eliminated, loops inserted, vector ops expanded, device-specific lowering applied.
out: SINK with RANGE/LOAD/STORE
↓
STAGE 5
Renderer & Compiler
renderer/ · device.py
Linearized UOp list → device source string (C, PTX, MSL, WGSL, LLVM IR) → compiled binary bytes via the device compiler.
out: BINARY(bytes)
↓
STAGE 6
Runtime & Execution
engine/realize.py · runtime/ops_*.py
Executes the compiled plan: kernel dispatch, memory copies, buffer views. Multi-level caching ensures JIT training loops have near-zero CPU overhead.
actual FLOPS ✓
Annotated code flow
full trace
▼
# 1. User code
import tinygrad
a = Tensor.randn(1024, 1024) # UOp(BUFFER, shape=(1024,1024))
b = Tensor.randn(1024, 1024) # UOp(BUFFER, shape=(1024,1024))
c = (a @ b).relu() # DAG: RELU(REDUCE(MUL(EXPAND(a), EXPAND(b))))
# ↑ nothing computed — just graph nodes
# 2. Trigger execution
c.realize()
# calls: Tensor.linear_with_vars()
# → create_linear_with_vars(big_sink) [schedule/__init__.py:122]
# → full graph rewrite (movement op fusion)
# → create_schedule(sched_sink) [schedule/__init__.py:21]
# → topological sort of kernel dependencies
# → returns LINEAR(CALL(kernel_ast, buf_a, buf_b, buf_c))
# → returns (linear_uop, var_vals={})
#
# calls: run_linear(linear, var_vals) [engine/realize.py:252]
# → for each CALL in linear.src:
# compile_linear if not cached
# → do_to_program(ast, renderer) [codegen/__init__.py:176]
# → full_rewrite_to_sink(ast, ren) [codegen/__init__.py:27]
# phase 1: eliminate RESHAPE/EXPAND
# phase 2: add RANGE loop bounds
# phase 3: expand WMMA / tensor cores
# phase 4: finalize loop bounds
# phase 5: devectorize, lower index dtypes
# → do_linearize(ren, prg, sink) [codegen/__init__.py:132]
# toposort → flat list[UOp]
# → do_render(ren, prg, lin) [codegen/__init__.py:155]
# renderer.render(uops) → source str
# → do_compile(ren, prg, source) [codegen/__init__.py:160]
# compiler.compile_cached(src) → bytes
# exec_kernel(ctx, call, ast) [engine/realize.py:170]
# → get_runtime(device, ast)
# → runtime(bufs, var_vals, global_sz, local_sz)
# → GPU kernel runs ← ACTUAL COMPUTATION HERE
tensor.py:241 — realize()
schedule/__init__.py:122 — create_linear_with_vars
codegen/__init__.py:176 — do_to_program
engine/realize.py:252 — run_linear
Key files quick reference
index
▼
tinygrad/tensor.py
Tensor class, realize(), _apply_uop()
tinygrad/mixin/elementwise.py
add, mul, exp2, log2, cast, where…
tinygrad/mixin/movement.py
reshape, permute, expand, pad, shrink…
tinygrad/mixin/reduce.py
sum, max, min, mean, argmax…
tinygrad/uop/__init__.py
Ops enum — all 80+ operation codes
tinygrad/uop/ops.py
UOp class, PatternMatcher, UPat, graph_rewrite
tinygrad/schedule/__init__.py
create_linear_with_vars, create_schedule
tinygrad/codegen/__init__.py
do_to_program, full_rewrite_to_sink, do_render, do_compile
tinygrad/codegen/late/linearizer.py
linearize(): DAG → ordered flat list[UOp]
tinygrad/renderer/__init__.py
Renderer base class, TensorCore config
tinygrad/renderer/cstyle.py
C/OpenCL source generation
tinygrad/renderer/ptx.py
NVIDIA PTX assembly generation
tinygrad/engine/realize.py
run_linear, compile_linear, exec_kernel, pm_exec
tinygrad/device.py
Device, Buffer, Allocator, Compiler, Compiled
Core design patterns
architecture
▼
Lazy evaluation
Tensors build a UOp DAG without computing.
.realize() is the single trigger that flushes everything. This enables cross-op fusion decisions to be made globally.Graph rewrites (PatternMatcher)
Every optimization and lowering pass is expressed as declarative pattern → replacement rules. The same mechanism is used from early Tensor-level rewrites all the way to device-specific instruction selection.
UOp deduplication
UOpMetaClass caches instances by (op, dtype, src, arg). Two identical subgraphs always share the same nodes. This keeps the DAG compact and enables O(1) equality.Symbolic shapes
Dimension sizes can be symbolic
UOp expressions. Concrete values are only bound at runtime via sym_infer(var_vals). This allows one compiled kernel to handle multiple tensor sizes.Renderer abstraction
All device-specific code lives in a
Renderer subclass. Codegen only knows abstract UOps; it calls renderer.render(uops) which maps them to device strings. Adding a new backend means writing one Renderer.Multi-level caching
Six independent cache layers (UOp, schedule, source, disk, runtime, graph) ensure that repeated forward passes — as in training — approach zero CPU overhead after the first iteration.
UOp shape across the pipeline
transformation summary
▼
Stage 1 — Tensor graph (movement ops present, high-level)
SINK
└─ REDUCE(ADD) ← MUL ← EXPAND ← RESHAPE ← BUFFER(a)
← EXPAND ← RESHAPE ← BUFFER(b)
Stage 2 — After scheduling (movement ops fused/eliminated, CALL tree)
LINEAR(
CALL(PROGRAM(ast), buf_a, buf_b, buf_out)
)
Stage 3 — After codegen / full_rewrite_to_sink (explicit loops)
SINK(
RANGE(0..M) ← outer loop
RANGE(0..N) ← inner loop
STORE(INDEX(buf_out, m*N+n),
REDUCE(MUL(LOAD(a[m,k]),
LOAD(b[k,n]))))
END
END
)
Stage 4 — After linearize (flat list)
[DEFINE_VAR(m), RANGE(m, 0, M),
DEFINE_VAR(n), RANGE(n, 0, N),
LOAD(a[m,k]), LOAD(b[k,n]), MUL, STORE(out[m,n]),
END(n), END(m)]
Stage 5 — After render (device source string)
"__kernel void r_MxN(global float* out, global float* a, global float* b) { ... }"
Stage 6 — After compile
BINARY(b'\x7fELF...' or b'PTX...')