tinygrad — Data Flow

# 1. User code
import tinygrad
a = Tensor.randn(1024, 1024)          # UOp(BUFFER, shape=(1024,1024))
b = Tensor.randn(1024, 1024)          # UOp(BUFFER, shape=(1024,1024))
c = (a @ b).relu()                    # DAG: RELU(REDUCE(MUL(EXPAND(a), EXPAND(b))))
# ↑ nothing computed — just graph nodes

# 2. Trigger execution
c.realize()
# calls: Tensor.linear_with_vars()
#   → create_linear_with_vars(big_sink)          [schedule/__init__.py:122]
#     → full graph rewrite (movement op fusion)
#     → create_schedule(sched_sink)              [schedule/__init__.py:21]
#       → topological sort of kernel dependencies
#       → returns LINEAR(CALL(kernel_ast, buf_a, buf_b, buf_c))
#     → returns (linear_uop, var_vals={})
#
# calls: run_linear(linear, var_vals)             [engine/realize.py:252]
#   → for each CALL in linear.src:
#       compile_linear if not cached
#         → do_to_program(ast, renderer)         [codegen/__init__.py:176]
#           → full_rewrite_to_sink(ast, ren)     [codegen/__init__.py:27]
#               phase 1: eliminate RESHAPE/EXPAND
#               phase 2: add RANGE loop bounds
#               phase 3: expand WMMA / tensor cores
#               phase 4: finalize loop bounds
#               phase 5: devectorize, lower index dtypes
#           → do_linearize(ren, prg, sink)       [codegen/__init__.py:132]
#               toposort → flat list[UOp]
#           → do_render(ren, prg, lin)           [codegen/__init__.py:155]
#               renderer.render(uops) → source str
#           → do_compile(ren, prg, source)       [codegen/__init__.py:160]
#               compiler.compile_cached(src) → bytes
#       exec_kernel(ctx, call, ast)              [engine/realize.py:170]
#           → get_runtime(device, ast)
#           → runtime(bufs, var_vals, global_sz, local_sz)
#           → GPU kernel runs ← ACTUAL COMPUTATION HERE

tensor.py:241 — realize() schedule/__init__.py:122 — create_linear_with_vars codegen/__init__.py:176 — do_to_program engine/realize.py:252 — run_linear

Key files quick reference index ▼

tinygrad/tensor.py Tensor class, realize(), _apply_uop()

tinygrad/mixin/elementwise.py add, mul, exp2, log2, cast, where…

tinygrad/mixin/movement.py reshape, permute, expand, pad, shrink…

tinygrad/mixin/reduce.py sum, max, min, mean, argmax…

tinygrad/uop/__init__.py Ops enum — all 80+ operation codes

tinygrad/uop/ops.py UOp class, PatternMatcher, UPat, graph_rewrite

tinygrad/schedule/__init__.py create_linear_with_vars, create_schedule

tinygrad/codegen/__init__.py do_to_program, full_rewrite_to_sink, do_render, do_compile

tinygrad/codegen/late/linearizer.py linearize(): DAG → ordered flat list[UOp]

tinygrad/renderer/__init__.py Renderer base class, TensorCore config

tinygrad/renderer/cstyle.py C/OpenCL source generation

tinygrad/renderer/ptx.py NVIDIA PTX assembly generation

tinygrad/engine/realize.py run_linear, compile_linear, exec_kernel, pm_exec

tinygrad/device.py Device, Buffer, Allocator, Compiler, Compiled

Core design patterns architecture ▼

Lazy evaluation

Tensors build a UOp DAG without computing. .realize() is the single trigger that flushes everything. This enables cross-op fusion decisions to be made globally.

Graph rewrites (PatternMatcher)

Every optimization and lowering pass is expressed as declarative pattern → replacement rules. The same mechanism is used from early Tensor-level rewrites all the way to device-specific instruction selection.

UOp deduplication

UOpMetaClass caches instances by (op, dtype, src, arg). Two identical subgraphs always share the same nodes. This keeps the DAG compact and enables O(1) equality.

Symbolic shapes

Dimension sizes can be symbolic UOp expressions. Concrete values are only bound at runtime via sym_infer(var_vals). This allows one compiled kernel to handle multiple tensor sizes.

Renderer abstraction

All device-specific code lives in a Renderer subclass. Codegen only knows abstract UOps; it calls renderer.render(uops) which maps them to device strings. Adding a new backend means writing one Renderer.

Multi-level caching

Six independent cache layers (UOp, schedule, source, disk, runtime, graph) ensure that repeated forward passes — as in training — approach zero CPU overhead after the first iteration.

UOp shape across the pipeline transformation summary ▼

Stage 1 — Tensor graph (movement ops present, high-level)
  SINK
    └─ REDUCE(ADD) ← MUL ← EXPAND ← RESHAPE ← BUFFER(a)
                          ← EXPAND ← RESHAPE ← BUFFER(b)

Stage 2 — After scheduling (movement ops fused/eliminated, CALL tree)
  LINEAR(
    CALL(PROGRAM(ast), buf_a, buf_b, buf_out)
  )

Stage 3 — After codegen / full_rewrite_to_sink (explicit loops)
  SINK(
    RANGE(0..M)                          ← outer loop
      RANGE(0..N)                        ← inner loop
        STORE(INDEX(buf_out, m*N+n),
              REDUCE(MUL(LOAD(a[m,k]),
                         LOAD(b[k,n]))))
      END
    END
  )

Stage 4 — After linearize (flat list)
  [DEFINE_VAR(m), RANGE(m, 0, M),
   DEFINE_VAR(n), RANGE(n, 0, N),
   LOAD(a[m,k]), LOAD(b[k,n]), MUL, STORE(out[m,n]),
   END(n), END(m)]

Stage 5 — After render (device source string)
  "__kernel void r_MxN(global float* out, global float* a, global float* b) { ... }"

Stage 6 — After compile
  BINARY(b'\x7fELF...' or b'PTX...')