From Python decorators to C++ kernel strings — a layer-by-layer tour of the compilation pipeline.
ttl dialectttkernel dialectEmitC and translation to C++ stringstt-lang is, fundamentally, a Python frontend that builds MLIR, runs a chain of MLIR passes that progressively lower the IR, and finally serializes a small per-thread C++ source string for every RISC-V core that will run on the Tenstorrent device. tt-metal then compiles those strings the same way it compiles a hand-written tt-metalium kernel.
Everything below this diagram is a more careful walk through each stage.
The user never writes MLIR or C++. They write a Python function that looks like normal Python but actually has very restricted semantics — it is a small DSL pretending to be Python, much like JAX or Triton.
The canonical "hello world" is element-wise add (from examples/eltwise_add.py):
@ttl.operation(grid="auto")
def eltwise_add(a_in, b_in, out):
a_dfb = ttl.make_dataflow_buffer_like(a_in, shape=(GRANULARITY, 1), block_count=2)
b_dfb = ttl.make_dataflow_buffer_like(b_in, shape=(GRANULARITY, 1), block_count=2)
out_dfb = ttl.make_dataflow_buffer_like(out, shape=(GRANULARITY, 1), block_count=2)
@ttl.compute()
def compute():
node_col, node_row = ttl.node(dims=2)
...
with a_dfb.wait() as a_blk, \
b_dfb.wait() as b_blk, \
out_dfb.reserve() as out_blk:
out_blk.store(a_blk + b_blk)
@ttl.datamovement()
def read():
...
with a_dfb.reserve() as a_blk, b_dfb.reserve() as b_blk:
tx_a = ttl.copy(a_in[r0:r1, col:col+1], a_blk)
tx_b = ttl.copy(b_in[r0:r1, col:col+1], b_blk)
tx_a.wait(); tx_b.wait()
@ttl.datamovement()
def write():
...
@ttl.compute and @ttl.datamovement functions registered into a per-call thread registry (_thread_registry in ttl_api.py) and is the entry point that drives the compile.ttl.node(dims=2) is the per-core (col, row) id.reserve() grabs a slot to write into; wait() blocks until a slot is full to read.a_blk + b_blk).tx.wait() blocks for completion.eltwise_add but present in matmul_1d_mcast.py, etc.)Critically, a_blk + b_blk is not a real Python __add__. By the time tt-lang sees it, the function body has been re-parsed as an AST and the + is just a node that the AST walker decides to lower into a tile-add op.
When a decorated @ttl.operation function is called, tt-lang does not actually run the Python body in the normal sense. Instead:
inspect.getsource() grabs the function text; _cleanup_source_code normalizes indentation. The text is then parsed with the standard library ast module.@ttl.compute / @ttl.datamovement functions are pulled out of the registry and one TTLGenericCompiler instance is built per thread (see python/ttl/_src/ttl_ast.py).ttl, ttkernel, ttcore, arith, func, ... dialects).So when the visitor sees an ast.BinOp(left=Name("a_blk"), op=Add(), right=Name("b_blk")) inside a with that opened blocks of a tile-typed DFB, it emits an MLIR op like ttl.tile_add (or whichever per-element op the operator maps to). When it sees ttl.copy(...) it emits a ttl.copy op. When it sees the with a_dfb.wait() as a_blk it emits the wait/release scoped operations on the buffer.
The result is one func.func per thread, all stitched into one builtin.module:
module attributes {ttl.target_arch = "wormhole_b0"} {
func.func @eltwise_add_compute(...) attributes {ttkernel.thread = #ttkernel.thread<compute>} { ... }
func.func @eltwise_add_read (...) attributes {ttkernel.thread = #ttkernel.thread<noc>} { ... }
func.func @eltwise_add_write (...) attributes {ttkernel.thread = #ttkernel.thread<noc>} { ... }
}
If you set TTLANG_INITIAL_MLIR=/tmp/initial.mlir before running an example, this is exactly what you'll see written to that file.
ttl dialectThe "initial MLIR" lives in a custom MLIR dialect called ttl. Its job is to preserve the user's intent as faithfully as possible: it has first-class ops for things like:
| tt-lang concept | ttl op (representative) |
|---|---|
| dataflow buffer reserve / wait | ttl.dfb_reserve / ttl.dfb_wait |
| tile-level math (a + b on tiles) | ttl.tile_add, ttl.tile_mul, ttl.tile_matmul, ... |
| explicit DMAs | ttl.copy |
per-core address (ttl.node) | ttl.node_index |
The dialect is defined under include/ttlang/Dialect/TTL/ and implemented in lib/Dialect/TTL/. There is a separate ttkernel dialect that is much closer to the tt-metal C++ API.
The architectural rule: every level of abstraction that the user can express must have a 1:1 op in the dialect. This avoids the antipattern of "guess what the user meant by chasing SSA defs" — instead the semantics live on the operation itself, and lowering can be a localised rewrite.
Once the module is built, tt-lang assembles a pipeline string and runs it with PassManager.parse(...).run(module) (see ttl_api.py ~line 1584–1679). The same ordering also exists in C++ for the standalone ttlang-opt tool in lib/Dialect/TTL/Pipelines/TTLPipelines.cpp.
The passes can be grouped into three rough phases. Setting TTLANG_VERBOSE_PASSES=1 dumps the IR before and after every one of them, which is the easiest way to see what each does.
ttl)ttl-insert-intermediate-dfbs # auto-create CBs for fused intermediate values
ttl-insert-copy-wait # add wait points after copies the user forgot about
ttl-insert-cb-sync # cb_push_back / cb_pop_front pairs around regions
ttl-annotate-l1-acc-loops # mark loops eligible for L1 accumulation
These passes don't change the dialect — they fill in the synchronization scaffolding that the user gets to omit because it's mechanical.
ttl)convert-ttl-to-compute # rewrite tile arithmetic into a "compute region"
ttl-set-compute-kernel-config # pick fp32_dest_acc_en, dst_full_sync_en, etc.
ttl-assign-dst # assign destination registers (the DST file)
ttl-subblock-compute-for-dst (opt.) # tile DST sub-block scheduling for throughput
ttl-lower-to-loops # turn block-level math into explicit affine loops
ttl-schedule-operations (opt.) # reorder ops to maximize DST utilisation
ttl-finalize-dfb-indices # bake CB indices into ops
ttl-annotate-cb-associations # tag which CB each tile op reads / writes
By the end of this phase the IR still references ttl.* ops, but the high-level value-typed math has been replaced by an explicit compute region with affine loops over tile coordinates and ops that name a specific DST register.
ttkernel and to Cconvert-ttl-to-ttkernel # ttl ops -> ttkernel API calls
ttkernel-insert-inits # binary_op_init_common, mm_init, etc.
ttkernel-insert-l1-accumulation # L1-accum specific guard ops
ttkernel-combine-pack-tiles (opt.) # merge adjacent pack_tile calls
canonicalize, cse # standard cleanups
lower-affine # affine loops -> scf / cf
ttl-lower-signpost-to-emitc # profiling signposts -> emitc.call
convert-ttkernel-to-emitc # the big rewrite to the EmitC dialect
symbol-dce # drop unused functions
The pipeline is run once over the whole module — all three threads of a kernel are lowered together so synchronization between them stays consistent.
ttkernel dialectMid-pipeline the IR transitions to the ttkernel dialect. ttkernel is intentionally close to the tt-metal "kernel API" surface that a tt-metalium user types in by hand:
// after convert-ttl-to-ttkernel (sketch)
%cb_in0 = ttkernel.get_compile_time_arg : !ttkernel.cb
%cb_in1 = ttkernel.get_compile_time_arg : !ttkernel.cb
%cb_out = ttkernel.get_compile_time_arg : !ttkernel.cb
ttkernel.binary_op_init_common(%cb_in0, %cb_in1, %cb_out)
scf.for %i = ... {
ttkernel.cb_wait_front %cb_in0, 1
ttkernel.cb_wait_front %cb_in1, 1
ttkernel.cb_reserve_back %cb_out, 1
ttkernel.tile_regs_acquire
ttkernel.add_tiles(%cb_in0, %cb_in1, 0, 0, 0)
ttkernel.tile_regs_commit
ttkernel.tile_regs_wait
ttkernel.pack_tile(0, %cb_out)
ttkernel.tile_regs_release
ttkernel.cb_pop_front %cb_in0, 1
ttkernel.cb_pop_front %cb_in1, 1
ttkernel.cb_push_back %cb_out, 1
}
Each operation here corresponds 1:1 to an inline function in tt_metal/include/compute_kernel_api/*.h. The dialect carries enough metadata (tile types, CB ids, thread type) for the EmitC pass below to print a runnable C++ file.
Each func.func in the module is also tagged with a ttkernel.thread = #ttkernel.thread<compute|noc> attribute, which is how the next stage decides what kind of kernel to emit and which header file to include.
EmitC and translation to C++ stringsMLIR has an upstream dialect called emitc that models a small C-like subset (calls, locals, expressions, types). convert-ttkernel-to-emitc rewrites every ttkernel.* op into an emitc.call referencing the corresponding tt-metal kernel-API function, plus the right emitc.includes and void main() wrapper.
Once the module is in emitc, lowering to a C++ string is mechanical — MLIR's built-in mlir-translate --mlir-to-cpp handles it. tt-lang exposes this through a tiny C++ helper bound into Python:
// python/ttmlir/TT_MLIRMinimalPasses.cpp
m.def(
"ttkernel_to_cpp_by_name",
[](MlirModule module, const std::string &kernelName) -> std::string {
// Convert to EmitC first (idempotent if already done)
if (!ttmlirMinimalRunTTKernelToEmitC(module)) { /* error */ }
// Translate ONE func.func by name to a self-contained C++ source string
char *result = ttmlirMinimalTranslateKernelToCpp(module, kernelName.c_str());
std::string output(result);
free(result);
return output;
});
The Python side then loops once per thread and asks for that thread's C++:
# ttl_api.py, _compile_ttnn_kernel
for name, thread_type in get_ttkernel_names(module):
cpp_source = ttkernel_to_cpp_by_name(module, name) # <-- the string!
kernel_path = _write_kernel_to_tmp(name, cpp_source) # write it to /tmp/.../foo.cpp
...
So the canonical "output" of the tt-lang compiler — what tt-metal will eventually see — is a small set of C++ strings, one per RISC-V core. For a typical operation that's three: compute.cpp, reader.cpp, and writer.cpp.
tt-lang doesn't run the kernel itself. It hands the kernel strings, plus metadata, to the existing tt-metal runtime. The packaging step (_compile_ttnn_kernel in ttl_api.py) does a few things:
ttkernel.thread attributes it labels each kernel as compute (TRISC0/1/2), reader (NCRISC), or writer (BRISC).reduce_tile/matmul_block call, it flips on fp32_dest_acc_en automatically.func.func carries an arg_spec attribute listing what the kernel expects at runtime; get_ttkernel_arg_spec reads it back into Python.ttl.base_cta_index and ttl.crta_indices attributes (set during AST→MLIR) tell tt-metal how the user tensors map to circular buffers.ttnn.generic_op. tt-metal picks up the C++ files, compiles them with its own RISC-V toolchains, links them against the tt-metal kernel API, and launches the program on the device just like any hand-written tt-metalium kernel.Optionally tt-lang also emits a kernel bundle on disk under ttlang_kernel_dumps/program_<hash>/ — the C++ files plus a JSON manifest describing the program. That's how examples can be replayed (or shared) without re-running the Python frontend.
If you squint, every transition in the pipeline is the same trick:
ttl, ttkernel, emitc).The user only ever writes layer 1. Everything else is a series of well-defined, individually testable rewrites. That's why you can debug a tt-lang program either by reading its Python source, by dumping TTLANG_INITIAL_MLIR, by stepping through TTLANG_VERBOSE_PASSES=1, or by reading the final C++ in ttlang_kernel_dumps/ — they're four views of the same program at four levels of detail.