← overview

Renderer

Stage 5 of 6 · tinygrad/renderer/
Renderer base class renderer/__init__.py:62

The Renderer class is the abstract interface all device backends implement. Codegen calls renderer.render(uops); the renderer walks the linearized UOp list and emits device source code.

@dataclass
class Renderer:
  device: str = ""
  suffix: str = ""
  # capability flags
  has_local: bool  = True    # has thread-local storage
  has_shared: bool = True    # has shared/LDS memory
  has_threads: bool = True   # is parallel (GPU)
  # resource limits
  global_max: tuple = (...)
  local_max:  tuple = (...)
  shared_max: int   = 32768
  # operation → code string mapping
  code_for_op: dict[Ops, Callable] = {}
  # available tensor-core configs
  tensor_cores: list[TensorCore] = []
  # methods
  def render(self, uops: list[UOp]) -> str: ...
  def asm(self, prg, lin) -> bytes: ...        # for ISA renderers
renderer/__init__.py:62 — class Renderer
supported_ops is derived from code_for_op.keys(). Codegen uses this to decompose ops the renderer can't express into combinations of ops it can.

Key properties

has_local Whether thread-local variables exist (false for simple CPU backends) has_shared Shared memory across a thread group (GPU only) global_max Max grid dimensions; controls loop tiling strategy local_max Max thread block dimensions tensor_cores Available WMMA/tensor-core configurations (e.g. 16×8×16 fp16)
Device backends renderer/ & runtime/
C-style (CPU / OpenCL)
renderer/cstyle.py
Generates C/OpenCL C with explicit memory casts. Used by CPU (clang) and OpenCL backends.
CPU OpenCL
LLVM IR
renderer/llvmir.py
Generates textual LLVM IR. Compiled via llvmlite. Used for high-performance CPU execution.
CPU
NVIDIA PTX
renderer/ptx.py
Generates PTX assembly. Assembled via CUDA's nvrtc or ptxas. Supports tensor cores.
CUDA
WebGPU WGSL
renderer/wgsl.py
Generates WGSL for WebGPU. Enables browser-based GPU execution.
WebGPU
NIR
renderer/nir.py
Mesa NIR (generic GPU IR). Used as an alternative GPU backend path.
GPU
AMD GCN/RDNA
renderer/amd/
AMD-specific GPU code generation for GCN and RDNA architectures. Supports AMD tensor/matrix cores.
AMD
x86 / ARM ISA
renderer/isa/
Direct machine-code generation bypassing LLVM. Used for embedded/bare-metal targets.
CPU
Metal (Apple)
runtime/ops_metal.py
Metal Shading Language (MSL) generation. Uses Metal C++ compiler on macOS/iOS. Supports Apple GPU tensor ops.
Metal
Compiler class device.py:276

The Compiler class wraps device-specific compilation. Each backend provides a Compiler subclass that implements compile(src: str) -> bytes.

compile(src) Takes rendered source string, returns compiled binary bytes compile_cached Wraps compile() with disk/memory caching keyed on source hash disassemble Optional: convert bytes back to human-readable assembly for debugging
device.py:276 — class Compiler
Compile results are cached by content hash. On warm runs or JIT loops, identical kernels hit the cache and skip compilation entirely.
Tensor cores & WMMA hardware acceleration

When a Renderer declares tensor_cores, the optimizer can emit WMMA/SHAPED_WMMA UOps for matrix multiplications. Each TensorCore entry describes:

dims The M×N×K tile size (e.g. 16×8×16 for CUDA fp16) dtypes Input and accumulator dtypes (e.g. half → float) threads Required thread layout within the warp/wave layout Memory layout of the fragments (row-major, column-major)

The codegen expander converts matmul DAGs into WMMA UOps which the renderer maps to hardware instructions (mma.sync in PTX, simdgroup_multiply_accumulate in Metal, etc.).