tinygrad — Renderer

Renderer base class renderer/__init__.py:62 ▼

The Renderer class is the abstract interface all device backends implement. Codegen calls renderer.render(uops); the renderer walks the linearized UOp list and emits device source code.

@dataclass
class Renderer:
  device: str = ""
  suffix: str = ""
  # capability flags
  has_local: bool  = True    # has thread-local storage
  has_shared: bool = True    # has shared/LDS memory
  has_threads: bool = True   # is parallel (GPU)
  # resource limits
  global_max: tuple = (...)
  local_max:  tuple = (...)
  shared_max: int   = 32768
  # operation → code string mapping
  code_for_op: dict[Ops, Callable] = {}
  # available tensor-core configs
  tensor_cores: list[TensorCore] = []
  # methods
  def render(self, uops: list[UOp]) -> str: ...
  def asm(self, prg, lin) -> bytes: ...        # for ISA renderers

renderer/__init__.py:62 — class Renderer

supported_ops is derived from code_for_op.keys(). Codegen uses this to decompose ops the renderer can't express into combinations of ops it can.

Key properties

has_local Whether thread-local variables exist (false for simple CPU backends) has_shared Shared memory across a thread group (GPU only) global_max Max grid dimensions; controls loop tiling strategy local_max Max thread block dimensions tensor_cores Available WMMA/tensor-core configurations (e.g. 16×8×16 fp16)

Device backends renderer/ & runtime/ ▼

C-style (CPU / OpenCL)

renderer/cstyle.py

Generates C/OpenCL C with explicit memory casts. Used by CPU (clang) and OpenCL backends.

CPU OpenCL

LLVM IR

renderer/llvmir.py

Generates textual LLVM IR. Compiled via llvmlite. Used for high-performance CPU execution.

CPU

NVIDIA PTX

renderer/ptx.py

Generates PTX assembly. Assembled via CUDA's nvrtc or ptxas. Supports tensor cores.

CUDA

WebGPU WGSL

renderer/wgsl.py

Generates WGSL for WebGPU. Enables browser-based GPU execution.

WebGPU

NIR

renderer/nir.py

Mesa NIR (generic GPU IR). Used as an alternative GPU backend path.

GPU

AMD GCN/RDNA

renderer/amd/

AMD-specific GPU code generation for GCN and RDNA architectures. Supports AMD tensor/matrix cores.

AMD

x86 / ARM ISA

renderer/isa/

Direct machine-code generation bypassing LLVM. Used for embedded/bare-metal targets.

CPU

Metal (Apple)

runtime/ops_metal.py

Metal Shading Language (MSL) generation. Uses Metal C++ compiler on macOS/iOS. Supports Apple GPU tensor ops.

Metal

Compiler class device.py:276 ▼

The Compiler class wraps device-specific compilation. Each backend provides a Compiler subclass that implements compile(src: str) -> bytes.

compile(src) Takes rendered source string, returns compiled binary bytes compile_cached Wraps compile() with disk/memory caching keyed on source hash disassemble Optional: convert bytes back to human-readable assembly for debugging

device.py:276 — class Compiler

Compile results are cached by content hash. On warm runs or JIT loops, identical kernels hit the cache and skip compilation entirely.

Tensor cores & WMMA hardware acceleration ▼

When a Renderer declares tensor_cores, the optimizer can emit WMMA/SHAPED_WMMA UOps for matrix multiplications. Each TensorCore entry describes:

dims The M×N×K tile size (e.g. 16×8×16 for CUDA fp16) dtypes Input and accumulator dtypes (e.g. half → float) threads Required thread layout within the warp/wave layout Memory layout of the fragments (row-major, column-major)

The codegen expander converts matmul DAGs into WMMA UOps which the renderer maps to hardware instructions (mma.sync in PTX, simdgroup_multiply_accumulate in Metal, etc.).