anuraagw.me

tinygrad notes

rough progression of a tinygrad tensor operation

1) Tensors are just thin wrappers over UOps

Files involved:

  • Tensor.py
  • uop/ops.py
  • uop/init.py

Tensors are just a thin wrapper over UOps. Every tensor operation creates a UOp (or set of UOps) that gets injected into the current UOp graph.

For example:

Tensor.ones(3).uop

UOp(Ops.EXPAND, dtypes.float, arg=None, src=(
  UOp(Ops.RESHAPE, dtypes.float, arg=None, src=(
    UOp(Ops.CONST, dtypes.float, arg=1.0, src=(
      UOp(Ops.DEVICE, dtypes.void, arg='CPU', src=()),
      UOp(Ops.UNIQUE, dtypes.void, arg=0, src=()),)),
    UOp(Ops.CONST, dtypes.index, arg=1, src=()),)),
  UOp(Ops.CONST, dtypes.index, arg=3, src=()),))

all the ops

You can write an entire GPU kernel using just UOps, bypassing the tensor layer entirely.

Ops that don’t appear in compiled programs

These are higher-level ops that get lowered or eliminated during compilation:

  • unique, device, kernel, assign, custom_kernel, lunique
  • contiguous, contiguous_backward, detach
  • bufferize, copy, buffer, buffer_view, mselect, mstack, encdec
  • reshape, permute, expand, pad, shrink, flip, multi
  • reduce_axis, reduce, allreduce
  • unroll, contract, cat, ptrcat

everything else (ops that appear in compiled programs)

These are the low-level ops that actually make it into the final compiled kernel:

  • define_global, define_var, bind, special, define_local, define_reg
  • noop, rewrite_error, program, linear, source, binary, sink, after, group, gep, vectorize
  • index, load, store
  • wmma
  • cast, bitcast, exp2, log2, sin, sqrt, reciprocal, neg, trunc
  • add, mul, shl, shr, idiv, max, mod, cmplt, cmpne, cmpeq, xor, or, and, threefry, sub, fdiv, pow
  • where, mulacc
  • barrier, range, if, end, endif
  • vconst, const
  • custom, customi

2) Schedule creation (kernel partitioning)

Files involved:

  • engine/schedule.py
  • schedule/__init__.py
  • rangeify.py

Each UOp graph gets turned into one or more kernel ASTs (each rooted at Ops.SINK). Every kernel becomes an ExecItem with its own AST and buffers. This is where tinygrad figures out how to partition your computation into separate GPU kernel launches.

3) Kernel AST rewrite + optimization

Files involved:

  • codegen/init.py
  • postrange.py
  • opt/search.py (BEAM)

Kernel ASTs are rewritten and optimized based on rewrite rules defined in PatternMatchers. These are scattered all around the codebase and rewrite portions of the UOp graph: movement ops, range splitting, constant folding, etc.

apply_opts applies optimizations to the AST. This path varies based on the BEAM environment variable.

BEAM on: BEAM search happens. See the BEAM section below.

BEAM off: A bunch of hand-written optimizations are applied to each AST. See the pattern-matchers section below (todo).

Calls codegen/opt/heuristic.py to apply hand-coded optimizations, assuming NOOPT is unset and the AST doesn’t have optimizations already applied.

More rewrites happen after this:

  • expander — lowers ranges and reduces into loops
  • devectorizer — handles UPCAST ops
  • gpudims — GPU launch dimensions
  • lower index dtype — converts symbolic dtypes into dtypes for the respective backend
  • decompositions — replaces unsupported ops with supported ops on the target renderer
  • final rewrite — one last cleanup pass

These can be found in:

  • codegen/late/expander.py
  • codegen/late/devectorizer.py
  • codegen/late/gpudims.py
  • codegen/simplify.py

At the end of the rewrite chain, the UOp graph is linearized into a list of UOps to be run in a straight line.

Linearization happens in:

  • codegen/late/linearizer.py
  • renderer/init.py

4) Render to source

Files involved:

  • renderers, like renderer/ptx.py
  • renderer/init.py

The renderer (tinygrad/renderer/*) turns the linearized UOps into device source code (CUDA, HIP, etc). A ProgramSpec object is created.

5) Compile to binary

Files involved:

  • engine/realize.py
  • runtime/ops_*.py

The device compiler compiles the source into a binary (ProgramSpec.lib). This is done via CompiledRunner in tinygrad/engine/realize.py.

6) Runtime launch

ExecItem.run() calls the device runtime with buffers, global/local sizes, and variable values. The kernel actually runs on the GPU/CPU here.

7) Caching/JIT

Files involved:

  • jit.py

get_runner memoizes compiled programs based on the AST + context (including BEAM, NOOPT, DEVECTORIZE settings). If TinyJit is used, it captures kernels and replays them without recompiling.

BEAM search

https://github.com/tinygrad/tinygrad/pull/13836

Python 3.14 removed support for pickling itertools objects, which broke BEAM in tinygrad. Currently, BEAM pickles Scheduler objects and sends them to worker processes to be executed. This is an issue because the Scheduler contains an itertools.count that cannot be serialized:

# tinygrad/codegen/opt/postrange.py:21
self.opt_range = itertools.count(start=max([x.arg[0] for x in self.rngs], default=0)+1)

This led me down a long rabbit hole into what BEAM actually does. It’s one of the main reasons tinygrad is so fast.

What BEAM does

It runs many variants of every AST in your graph to find the one that runs fastest on your hardware. Think of it as auto-tuning at the kernel optimization level.

The parameters BEAM adjusts are:

optargumentsdescription
UNROLLaxis, amtloop unrolling, i.e. #pragma unroll
UPCASTaxis, amtlike float4 loads, later touched by devectorize
LOCALaxis, amtsplit work between threads in a workgroup
GROUPaxis, amtparallelize reduce operations, inner axis inside outer loop
GROUPTOPaxis, amtsame as above, inner axis is above outer loop
THREADaxis, amtwhich globalizable axis becomes the thread index? local is workgroup size (block)
TCaxis, (tc_select, tc_opt, use)tensor core/wmma ops when device supports
SWAPaxis0, axis1two RANGE nodes exchange axis IDs, only on global axes
PADTOaxis, amtpads loop axes (for tile sizes that require dimensions to be multiples), adds guard
NOLOCALSNOLOCALS=1 (optional env var)disables local/shared memory axes

(from actions in tinygrad/codegen/opt/search.py)

Keep in mind that all of these optimizations are run at the AST stage, before the kernel is linearized and rendered into a program.

Examples of each optimization

Unroll

Upcast

Local

Group

Grouptop

Thread

TC

Swap

PadTo

Nolocals

Examples of kernels optimized with BEAM

pattern matchers + graph rewrites (the non-BEAM optimization path)

undocumented environment variables

There are a lot of random environment variables scattered around the tinygrad codebase, and most of these aren’t documented at all. These are incredibly useful for debugging, profiling, and understanding what’s going on. Here’s a list of all the ones I’ve found so far:

variablerangedescription
IGNORE_BEAM_CACHE0 or nonzeroalways regenerate BEAM kernels
CACHELEVEL0 or nonzero0 disables disk cache
TRACK_MATCH_STATS0-31: basic tracking, 2: detailed trace data, 3: per-match timing
CUDA_PTX0 or nonzeroPTX codegen for Nvidia GPUs, use with NV=1
NOOPT0 or nonzerodisables optimizations on the AST
DEVECTORIZE0 or nonzerocontrols whether devectorization happens
HCQ_VISIBLE_DEVICESdevice #on amd, choose which device is used by tinygrad. just try each one
AMD_LLVM0 or nonzeroon amd, use the LLVM-IR -> machine code path instead of comgr

You can set some of these using the with helpers.Context(VAR=n): pattern, which is useful for only enabling certain features for part of a program, but it’s not guaranteed to work with all settings. Either set them using os.environ["VAR"] = "value" at the start of the file or pass them when you run the program:

BEAM=1 TRACK_MATCH_STATS=2 python your_script.py

random tweaks and issues

multilib handled poorly in c.py

Some Linux distros are multilib, meaning they download and store 32-bit libraries alongside 64-bit ones, for applications like Steam. Fedora is one of them. A lot of tinygrad (essentially all the code that runs your kernels on the GPU) relies on FFI (calling C functions from Python). These libraries are imported in tinygrad/runtime/support/c.py:42 (findlib).

The issue is that on Fedora, /lib contains 32-bit libraries:

boop@framework:/lib$ file libLTO.so.21.1
libLTO.so.21.1: ELF 32-bit LSB shared object, Intel i386, version 1 (SYSV), dynamically linked, BuildID[sha1]=574d091ee52fa7856759bc49c393f2c2a4636ae4, stripped

The 64-bit libs are actually at /lib64 or /usr/lib64/. You have to patch c.py to scan these first, and exclude all 32-bit ELFs by checking the magic numbers. If you installed Cuda, add its libs to scanlist as well.

wrong rocm path in amd_disassemble

If you have a non-standard ROCm installation (your ROCm isn’t in /opt/rocm), you’ll get a few errors while running VIZ=1, especially during disassembly. You might also encounter this if you run code with DEBUG>5.

Patch compiler_amd.py and edit amdgpu_disassemble to change the default location of llvm-objdump. This can point to your system llvm install or to your custom ROCm build of llvm.

tinygrad editable install not working with lsp?

If you install tinygrad with uv pip install -e ., it won’t generate types and completions for your LSP.

Run this instead: uv pip install -e . --config-settings editable_mode=strict.