UOp dataclass
uop/ops.py:128
▼
UOp is the single IR node that represents every operation in the entire pipeline — from high-level tensor ops through scheduling to compiled kernels. All Tensor computation is a DAG of UOps.
@dataclass(eq=False, slots=True) class UOp(OpMixin, metaclass=UOpMetaClass): op: Ops # what this node does dtype: DType = dtypes.void # output type src: tuple[UOp,...] = () # input nodes (the DAG edges) arg: Any = None # op-specific extra data tag: Any = None # optional tracing metadata metadata: Metadata = None # user-supplied metadatauop/ops.py:128 — class UOp
Deduplication:
UOpMetaClass (line 85) caches instances — two UOps with identical (op, dtype, src, arg) return the same object. This keeps the DAG compact and enables O(1) equality checks.Key properties
toposort()
Returns ordered dict of all ancestor UOps; respects control-flow entry ordering
_shape
Recursively infers shape from src nodes; may contain symbolic
uop/ops.py:174 — toposort()
uop/ops.py:212 — _shape
sint values
ssimplify()
Constant-fold symbolic expressions where possible
sym_infer(var_vals)
Substitute symbolic variables with concrete ints
Ops enum — all operation codes
uop/__init__.py:13
▼
The Ops enum enumerates every operation a UOp can represent. The ordering controls toposort priority.
Defines / Params
DEFINE_VAR
BIND
SPECIAL
DEFINE_LOCAL
DEFINE_REG
Program Structure
SINK
LINEAR
PROGRAM
SOURCE
BINARY
PARAM
FUNCTION
CALL
Memory
BUFFER
BUFFER_VIEW
LOAD
STORE
INDEX
COPY
Arithmetic (Unary)
CAST
BITCAST
EXP2
LOG2
SIN
SQRT
RECIPROCAL
NEG
TRUNC
Arithmetic (Binary)
ADD
MUL
SUB
FDIV
MAX
SHL
SHR
XOR
OR
AND
CMPLT
CMPNE
CMPEQ
POW
Ternary & Reduce
WHERE
MULACC
REDUCE
ALLREDUCE
WMMA
Control Flow
RANGE
END
IF
ENDIF
BARRIER
WAIT
Tensor-only (pre-schedule)
RESHAPE
PERMUTE
EXPAND
PAD
SHRINK
FLIP
CONTIGUOUS
DEVICE
PatternMatcher — AST rewriting
uop/ops.py:1299
▼
All optimization and lowering passes are expressed as PatternMatcher rules. A pattern describes a UOp subgraph to match; if matched, a replacement is returned.
pm = PatternMatcher([
# constant folding: ADD(CONST a, CONST b) → CONST(a+b)
(UPat(Ops.ADD, src=(UPat(Ops.CONST, name="a"),
UPat(Ops.CONST, name="b"))),
lambda a, b: UOp.const(a.dtype, a.arg + b.arg)),
...
])
new_uop = graph_rewrite(uop, pm)
uop/ops.py:1299 — class PatternMatcher
Bottom-up rewriting:
graph_rewrite applies the pattern matcher bottom-up, repeatedly until no more rules match. All codegen, scheduling, and optimization passes are built from these rules.
Kernel-related UOp containers
uop/ops.py:1037–1097
▼
Three frozen dataclasses wrap kernel-level information and are stored in UOp arg fields.
KernelInfo
Local sizes,
uop/ops.py:1037 — KernelInfo
uop/ops.py:1050 — ProgramInfo
uop/ops.py:1097 — CallInfo
dont_use_locals, tensor_core config. Stored in PROGRAM arg.
ProgramInfo
Device, name, globals/locals shape, op counts. Attached to PROGRAM after compilation.
CallInfo
Argument buffer list for a CALL node: which Buffers the kernel reads/writes and their roles.
UOp lifecycle through the pipeline
stages
▼
# Stage 1 — Tensor graph (high-level, movement ops present)
BUFFER ─ RESHAPE ─ PERMUTE ─ EXPAND ─┐
├─ MUL ─ REDUCE ─ SINK
BUFFER ─────────────────────────────┘
# Stage 2 — After scheduling (movement ops eliminated, CALL tree)
LINEAR(
CALL(PROGRAM(kernel_ast), buf_a, buf_b, buf_out)
CALL(COPY, buf_src, buf_dst)
...
)
# Stage 3 — After codegen (lowered to indexed loops)
SINK(
RANGE(0..N) ─┐
├─ STORE(INDEX(buf_out, i), LOAD(buf_a,i) * LOAD(buf_b,i))
└─ END
)
# Stage 4 — After render (device source string)
SOURCE("kernel void k(float* a, float* b, float* c) { ... }")
# Stage 5 — After compile
BINARY(bytes_of_ptx_or_metallib)