How tt-lang works

From Python decorators to C++ kernel strings — a layer-by-layer tour of the compilation pipeline.

Contents

The big picture in one diagram
Layer 1 — The Python DSL the user sees
Layer 2 — AST capture and MLIR builder
Layer 3 — The ttl dialect
Layer 4 — The MLIR pass pipeline
Layer 5 — The ttkernel dialect
Layer 6 — EmitC and translation to C++ strings
Layer 7 — Packaging kernels for tt-metal execution
Takeaway: the same idea at every layer

The big picture in one diagram

tt-lang is, fundamentally, a Python frontend that builds MLIR, runs a chain of MLIR passes that progressively lower the IR, and finally serializes a small per-thread C++ source string for every RISC-V core that will run on the Tenstorrent device. tt-metal then compiles those strings the same way it compiles a hand-written tt-metalium kernel.

user Python (eltwise_add.py) │ │ @ttl.operation — decorators capture function source ▼ ┌─────────────────────────────────────────────┐ │ 1. Python AST (ast.parse) │ │ 2. TTLGenericCompiler walks the AST │ │ and emits MLIR via Python bindings │ └─────────────────────────────────────────────┘ │ ▼ Initial MLIR in the ttl dialect (dataflow buffers, blocks, pipes, tile ops) │ │ PassManager.parse(...).run(module) ▼ ┌─────────────────────────────────────────────┐ │ ~20 MLIR passes (TTL → TTKernel) │ │ insert-copy-wait, insert-cb-sync, │ │ convert-ttl-to-compute, assign-dst, │ │ lower-to-loops, ttl-to-ttkernel ... │ └─────────────────────────────────────────────┘ │ ▼ MLIR in the ttkernel dialect (cb_wait_front, matmul_tiles, noc_async_read…) │ │ convert-ttkernel-to-emitc ▼ MLIR in the emitc dialect (C-like ops) │ │ ttkernel_to_cpp_by_name(module, name) ▼ std::string of compute.cpp / reader.cpp / writer.cpp │ ▼ tt-metal builds & launches via ttnn.generic_op

Everything below this diagram is a more careful walk through each stage.

Layer 1 — The Python DSL the user sees

The user never writes MLIR or C++. They write a Python function that looks like normal Python but actually has very restricted semantics — it is a small DSL pretending to be Python, much like JAX or Triton.

The canonical "hello world" is element-wise add (from examples/eltwise_add.py):

@ttl.operation(grid="auto")
def eltwise_add(a_in, b_in, out):
    a_dfb   = ttl.make_dataflow_buffer_like(a_in,   shape=(GRANULARITY, 1), block_count=2)
    b_dfb   = ttl.make_dataflow_buffer_like(b_in,   shape=(GRANULARITY, 1), block_count=2)
    out_dfb = ttl.make_dataflow_buffer_like(out,    shape=(GRANULARITY, 1), block_count=2)

    @ttl.compute()
    def compute():
        node_col, node_row = ttl.node(dims=2)
        ...
        with a_dfb.wait()  as a_blk, \
             b_dfb.wait()  as b_blk, \
             out_dfb.reserve() as out_blk:
            out_blk.store(a_blk + b_blk)

    @ttl.datamovement()
    def read():
        ...
        with a_dfb.reserve() as a_blk, b_dfb.reserve() as b_blk:
            tx_a = ttl.copy(a_in[r0:r1, col:col+1], a_blk)
            tx_b = ttl.copy(b_in[r0:r1, col:col+1], b_blk)
            tx_a.wait(); tx_b.wait()

    @ttl.datamovement()
    def write():
        ...

What the decorators actually mean

@ttl.operation Marks the outer function as a kernel operation. It collects the inner @ttl.compute and @ttl.datamovement functions registered into a per-call thread registry (_thread_registry in ttl_api.py) and is the entry point that drives the compile.
@ttl.compute() One thread that runs on the TRISC compute cores (the FPU/SFPU pipeline).
@ttl.datamovement() One thread that runs on a NOC core (BRISC/NCRISC). You typically write a "reader" and a "writer" pair.
grid="auto" The op is parallelized across a 2D grid of cores. ttl.node(dims=2) is the per-core (col, row) id.

What the abstractions in the body mean

Dataflow buffer (DFB) — a typed circular buffer (CB) on L1 with a producer/consumer protocol. reserve() grabs a slot to write into; wait() blocks until a slot is full to read.
Block — the value yielded from a DFB context manager: a typed view into one slot of the CB. Compute is written against blocks (a_blk + b_blk).
copy — explicit DMA. Source and destination can be a tensor slice (lives in DRAM/L1 of another core) or a block (lives in this core's L1). tx.wait() blocks for completion.
Pipe / PipeNet — cross-core communication abstractions for multi-core programs. (Not used in eltwise_add but present in matmul_1d_mcast.py, etc.)

Critically, a_blk + b_blk is not a real Python __add__. By the time tt-lang sees it, the function body has been re-parsed as an AST and the + is just a node that the AST walker decides to lower into a tile-add op.

Layer 2 — AST capture and MLIR builder

When a decorated @ttl.operation function is called, tt-lang does not actually run the Python body in the normal sense. Instead:

Source is captured. inspect.getsource() grabs the function text; _cleanup_source_code normalizes indentation. The text is then parsed with the standard library ast module.
Each thread is compiled in isolation. Inner @ttl.compute / @ttl.datamovement functions are pulled out of the registry and one TTLGenericCompiler instance is built per thread (see python/ttl/_src/ttl_ast.py).
The AST is walked, not interpreted. The compiler is a visitor that translates Python AST nodes into MLIR ops by calling the Python MLIR bindings exposed by tt-mlir (the ttl, ttkernel, ttcore, arith, func, ... dialects).

So when the visitor sees an ast.BinOp(left=Name("a_blk"), op=Add(), right=Name("b_blk")) inside a with that opened blocks of a tile-typed DFB, it emits an MLIR op like ttl.tile_add (or whichever per-element op the operator maps to). When it sees ttl.copy(...) it emits a ttl.copy op. When it sees the with a_dfb.wait() as a_blk it emits the wait/release scoped operations on the buffer.

The result is one func.func per thread, all stitched into one builtin.module:

module attributes {ttl.target_arch = "wormhole_b0"} {
  func.func @eltwise_add_compute(...) attributes {ttkernel.thread = #ttkernel.thread<compute>} { ... }
  func.func @eltwise_add_read   (...) attributes {ttkernel.thread = #ttkernel.thread<noc>}     { ... }
  func.func @eltwise_add_write  (...) attributes {ttkernel.thread = #ttkernel.thread<noc>}     { ... }
}

If you set TTLANG_INITIAL_MLIR=/tmp/initial.mlir before running an example, this is exactly what you'll see written to that file.

Layer 3 — The `ttl` dialect

The "initial MLIR" lives in a custom MLIR dialect called ttl. Its job is to preserve the user's intent as faithfully as possible: it has first-class ops for things like:

tt-lang concept	ttl op (representative)
dataflow buffer reserve / wait	`ttl.dfb_reserve` / `ttl.dfb_wait`
tile-level math (a + b on tiles)	`ttl.tile_add`, `ttl.tile_mul`, `ttl.tile_matmul`, ...
explicit DMAs	`ttl.copy`
per-core address (`ttl.node`)	`ttl.node_index`

The dialect is defined under include/ttlang/Dialect/TTL/ and implemented in lib/Dialect/TTL/. There is a separate ttkernel dialect that is much closer to the tt-metal C++ API.

The architectural rule: every level of abstraction that the user can express must have a 1:1 op in the dialect. This avoids the antipattern of "guess what the user meant by chasing SSA defs" — instead the semantics live on the operation itself, and lowering can be a localised rewrite.

Layer 4 — The MLIR pass pipeline

Once the module is built, tt-lang assembles a pipeline string and runs it with PassManager.parse(...).run(module) (see ttl_api.py ~line 1584–1679). The same ordering also exists in C++ for the standalone ttlang-opt tool in lib/Dialect/TTL/Pipelines/TTLPipelines.cpp.

The passes can be grouped into three rough phases. Setting TTLANG_VERBOSE_PASSES=1 dumps the IR before and after every one of them, which is the easiest way to see what each does.

Phase A — Synchronization & semantics elaboration (still in `ttl`)

ttl-insert-intermediate-dfbs        # auto-create CBs for fused intermediate values
ttl-insert-copy-wait                # add wait points after copies the user forgot about
ttl-insert-cb-sync                  # cb_push_back / cb_pop_front pairs around regions
ttl-annotate-l1-acc-loops           # mark loops eligible for L1 accumulation

These passes don't change the dialect — they fill in the synchronization scaffolding that the user gets to omit because it's mechanical.

Phase B — Compute lowering (still in `ttl`)

convert-ttl-to-compute              # rewrite tile arithmetic into a "compute region"
ttl-set-compute-kernel-config       # pick fp32_dest_acc_en, dst_full_sync_en, etc.
ttl-assign-dst                      # assign destination registers (the DST file)
ttl-subblock-compute-for-dst (opt.) # tile DST sub-block scheduling for throughput
ttl-lower-to-loops                  # turn block-level math into explicit affine loops
ttl-schedule-operations (opt.)      # reorder ops to maximize DST utilisation
ttl-finalize-dfb-indices            # bake CB indices into ops
ttl-annotate-cb-associations        # tag which CB each tile op reads / writes

By the end of this phase the IR still references ttl.* ops, but the high-level value-typed math has been replaced by an explicit compute region with affine loops over tile coordinates and ops that name a specific DST register.

Phase C — Lowering to `ttkernel` and to C

convert-ttl-to-ttkernel             # ttl ops -> ttkernel API calls
ttkernel-insert-inits               # binary_op_init_common, mm_init, etc.
ttkernel-insert-l1-accumulation     # L1-accum specific guard ops
ttkernel-combine-pack-tiles (opt.)  # merge adjacent pack_tile calls
canonicalize, cse                   # standard cleanups
lower-affine                        # affine loops -> scf / cf
ttl-lower-signpost-to-emitc         # profiling signposts -> emitc.call
convert-ttkernel-to-emitc           # the big rewrite to the EmitC dialect
symbol-dce                          # drop unused functions

The pipeline is run once over the whole module — all three threads of a kernel are lowered together so synchronization between them stays consistent.

Layer 5 — The `ttkernel` dialect

Mid-pipeline the IR transitions to the ttkernel dialect. ttkernel is intentionally close to the tt-metal "kernel API" surface that a tt-metalium user types in by hand:

// after convert-ttl-to-ttkernel (sketch)
%cb_in0 = ttkernel.get_compile_time_arg : !ttkernel.cb
%cb_in1 = ttkernel.get_compile_time_arg : !ttkernel.cb
%cb_out = ttkernel.get_compile_time_arg : !ttkernel.cb

ttkernel.binary_op_init_common(%cb_in0, %cb_in1, %cb_out)
scf.for %i = ... {
  ttkernel.cb_wait_front  %cb_in0, 1
  ttkernel.cb_wait_front  %cb_in1, 1
  ttkernel.cb_reserve_back %cb_out, 1

  ttkernel.tile_regs_acquire
  ttkernel.add_tiles(%cb_in0, %cb_in1, 0, 0, 0)
  ttkernel.tile_regs_commit
  ttkernel.tile_regs_wait
  ttkernel.pack_tile(0, %cb_out)
  ttkernel.tile_regs_release

  ttkernel.cb_pop_front  %cb_in0, 1
  ttkernel.cb_pop_front  %cb_in1, 1
  ttkernel.cb_push_back  %cb_out, 1
}

Each operation here corresponds 1:1 to an inline function in tt_metal/include/compute_kernel_api/*.h. The dialect carries enough metadata (tile types, CB ids, thread type) for the EmitC pass below to print a runnable C++ file.

Each func.func in the module is also tagged with a ttkernel.thread = #ttkernel.thread<compute|noc> attribute, which is how the next stage decides what kind of kernel to emit and which header file to include.

Layer 6 — `EmitC` and translation to C++ strings

MLIR has an upstream dialect called emitc that models a small C-like subset (calls, locals, expressions, types). convert-ttkernel-to-emitc rewrites every ttkernel.* op into an emitc.call referencing the corresponding tt-metal kernel-API function, plus the right emitc.includes and void main() wrapper.

Once the module is in emitc, lowering to a C++ string is mechanical — MLIR's built-in mlir-translate --mlir-to-cpp handles it. tt-lang exposes this through a tiny C++ helper bound into Python:

// python/ttmlir/TT_MLIRMinimalPasses.cpp
m.def(
  "ttkernel_to_cpp_by_name",
  [](MlirModule module, const std::string &kernelName) -> std::string {
    // Convert to EmitC first (idempotent if already done)
    if (!ttmlirMinimalRunTTKernelToEmitC(module)) { /* error */ }
    // Translate ONE func.func by name to a self-contained C++ source string
    char *result = ttmlirMinimalTranslateKernelToCpp(module, kernelName.c_str());
    std::string output(result);
    free(result);
    return output;
  });

The Python side then loops once per thread and asks for that thread's C++:

# ttl_api.py, _compile_ttnn_kernel
for name, thread_type in get_ttkernel_names(module):
    cpp_source = ttkernel_to_cpp_by_name(module, name)   # <-- the string!
    kernel_path = _write_kernel_to_tmp(name, cpp_source) # write it to /tmp/.../foo.cpp
    ...

So the canonical "output" of the tt-lang compiler — what tt-metal will eventually see — is a small set of C++ strings, one per RISC-V core. For a typical operation that's three: compute.cpp, reader.cpp, and writer.cpp.

Layer 7 — Packaging kernels for tt-metal execution

tt-lang doesn't run the kernel itself. It hands the kernel strings, plus metadata, to the existing tt-metal runtime. The packaging step (_compile_ttnn_kernel in ttl_api.py) does a few things:

Pick the role per thread. By inspecting ttkernel.thread attributes it labels each kernel as compute (TRISC0/1/2), reader (NCRISC), or writer (BRISC).
Apply auto-detected compute config. If the IR contains f32 tiles or a reduce_tile/matmul_block call, it flips on fp32_dest_acc_en automatically.
Extract runtime args. Each func.func carries an arg_spec attribute listing what the kernel expects at runtime; get_ttkernel_arg_spec reads it back into Python.
Build CB descriptors. The ttl.base_cta_index and ttl.crta_indices attributes (set during AST→MLIR) tell tt-metal how the user tensors map to circular buffers.
Hand off to ttnn.generic_op. tt-metal picks up the C++ files, compiles them with its own RISC-V toolchains, links them against the tt-metal kernel API, and launches the program on the device just like any hand-written tt-metalium kernel.

Optionally tt-lang also emits a kernel bundle on disk under ttlang_kernel_dumps/program_<hash>/ — the C++ files plus a JSON manifest describing the program. That's how examples can be replayed (or shared) without re-running the Python frontend.

Takeaway: the same idea at every layer

If you squint, every transition in the pipeline is the same trick:

Define a representation that captures one specific layer of intent (Python DSL, ttl, ttkernel, emitc).
Make that representation rich enough that the lowering rule to the next layer is a local rewrite, not a global guess.
Hand the result of the last layer to a system that already knows how to run it (tt-metal in this case).

The user only ever writes layer 1. Everything else is a series of well-defined, individually testable rewrites. That's why you can debug a tt-lang program either by reading its Python source, by dumping TTLANG_INITIAL_MLIR, by stepping through TTLANG_VERBOSE_PASSES=1, or by reading the final C++ in ttlang_kernel_dumps/ — they're four views of the same program at four levels of detail.