Renderer base class
renderer/__init__.py:62
▼
The Renderer class is the abstract interface all device backends implement. Codegen calls renderer.render(uops); the renderer walks the linearized UOp list and emits device source code.
@dataclass
class Renderer:
device: str = ""
suffix: str = ""
# capability flags
has_local: bool = True # has thread-local storage
has_shared: bool = True # has shared/LDS memory
has_threads: bool = True # is parallel (GPU)
# resource limits
global_max: tuple = (...)
local_max: tuple = (...)
shared_max: int = 32768
# operation → code string mapping
code_for_op: dict[Ops, Callable] = {}
# available tensor-core configs
tensor_cores: list[TensorCore] = []
# methods
def render(self, uops: list[UOp]) -> str: ...
def asm(self, prg, lin) -> bytes: ... # for ISA renderers
renderer/__init__.py:62 — class Renderer
supported_ops is derived from code_for_op.keys(). Codegen uses this to decompose ops the renderer can't express into combinations of ops it can.Key properties
has_local
Whether thread-local variables exist (false for simple CPU backends)
has_shared
Shared memory across a thread group (GPU only)
global_max
Max grid dimensions; controls loop tiling strategy
local_max
Max thread block dimensions
tensor_cores
Available WMMA/tensor-core configurations (e.g. 16×8×16 fp16)
Device backends
renderer/ & runtime/
▼
C-style (CPU / OpenCL)
renderer/cstyle.py
Generates C/OpenCL C with explicit memory casts. Used by CPU (clang) and OpenCL backends.
CPU OpenCL
LLVM IR
renderer/llvmir.py
Generates textual LLVM IR. Compiled via llvmlite. Used for high-performance CPU execution.
CPU
NVIDIA PTX
renderer/ptx.py
Generates PTX assembly. Assembled via CUDA's nvrtc or ptxas. Supports tensor cores.
CUDA
WebGPU WGSL
renderer/wgsl.py
Generates WGSL for WebGPU. Enables browser-based GPU execution.
WebGPU
NIR
renderer/nir.py
Mesa NIR (generic GPU IR). Used as an alternative GPU backend path.
GPU
AMD GCN/RDNA
renderer/amd/
AMD-specific GPU code generation for GCN and RDNA architectures. Supports AMD tensor/matrix cores.
AMD
x86 / ARM ISA
renderer/isa/
Direct machine-code generation bypassing LLVM. Used for embedded/bare-metal targets.
CPU
Metal (Apple)
runtime/ops_metal.py
Metal Shading Language (MSL) generation. Uses Metal C++ compiler on macOS/iOS. Supports Apple GPU tensor ops.
Metal
Compiler class
device.py:276
▼
The Compiler class wraps device-specific compilation. Each backend provides a Compiler subclass that implements compile(src: str) -> bytes.
compile(src)
Takes rendered source string, returns compiled binary bytes
compile_cached
Wraps
device.py:276 — class Compiler
compile() with disk/memory caching keyed on source hash
disassemble
Optional: convert bytes back to human-readable assembly for debugging
Compile results are cached by content hash. On warm runs or JIT loops, identical kernels hit the cache and skip compilation entirely.
Tensor cores & WMMA
hardware acceleration
▼
When a Renderer declares tensor_cores, the optimizer can emit WMMA/SHAPED_WMMA UOps for matrix multiplications. Each TensorCore entry describes:
dims
The M×N×K tile size (e.g. 16×8×16 for CUDA fp16)
dtypes
Input and accumulator dtypes (e.g. half → float)
threads
Required thread layout within the warp/wave
layout
Memory layout of the fragments (row-major, column-major)
The codegen expander converts matmul DAGs into WMMA UOps which the renderer maps to hardware instructions (mma.sync in PTX, simdgroup_multiply_accumulate in Metal, etc.).