How one developer spent four and a half months writing a pure-Python userspace driver, runtime, RISC-V assembler, and kernel compiler for a Tenstorrent Blackhole AI accelerator — from a single ioctl to a 120-core kernel launch.
A reverse-engineered, dependency-free replacement for Tenstorrent's entire tt-metal software stack — written in Python, talking straight to the silicon.
Tenstorrent's Blackhole is an AI accelerator. A normal program reaches it through tt-metal: tens of thousands of lines of C++, an LLVM-based RISC-V toolchain, a kernel-mode driver, and a host runtime. blackhole-py throws all of that out and asks a harder question: what is the absolute minimum you need to make this chip compute?
The answer, built up commit by commit, is a stack of pure-Python layers. At the bottom it opens a character device and fires raw ioctls; at the top it lets you write a Tensix compute kernel in a Python DSL, assembles it to RISC-V machine code in-process, uploads the firmware over PCIe, and launches it across the whole chip. No C++. No external assembler. No tt-metal.
/dev/tenstorrent (and later VFIO), drives the card with hand-written ctypes ioctl structs, pins host DMA pages, maps PCIe BAR windows.(noc_x, noc_y, addr) on-chip location for mmap'd reads/writes.TTUNPACR, TTPACR, TTSFPLOADI…). Labels, fixups, ELF-style segments, mixin-built kernels.Even a "from scratch" project needs ground truth. tt-metal-deps is the carved-out skeleton of tt-metal that blackhole-py measured itself against — and eventually outgrew.
For most of the project, tt-metal-deps was the lifeline: a ~219 MB extraction of just the pieces of Tenstorrent's stack that the Python code genuinely couldn't reproduce — chiefly a real RISC-V toolchain and the header definitions that describe the hardware's mailboxes and memory map.
Three things mattered here. The include/ headers (dev_msgs.h and friends) define the host↔device contract — the exact byte layout of launch messages, go-signals and mailboxes the firmware reads. The toolchain/blackhole/*.ld linker scripts encode each core's memory map. And sfpi-toolchain/ was the actual compiler used while firmware was still C++.
The whole arc of the back half of the project is the story of shedding this dependency. The setup script marched through tt-metal-deps v0.7.0 → v0.8.0, and then on Apr 21 came the line that changed everything: "…no more tt-metal-deps. not functional". Once the Python assembler could emit RISC-V directly, the GCC toolchain became dead weight. The headers' knowledge got absorbed into ttk/addrs.py and ttk/_cfg_regs.py. Today tt-metal-deps survives only in the old snapshot next door — kept around as an oracle to diff against, not a thing the live code needs.
Python LOC tracked at ten checkpoints. The story isn't a line going up — it's a mountain. The codebase grew to almost 19k lines, then the author deleted more than half of it.
Peak (~18.8k) includes the throwaway emulator + C++ firmware + tests. The cliff is the pure-Python pivot: delete the emulator, delete C++, and relentlessly simplify.
LOC counted as raw wc -l across all .py files present at each commit (includes emulator & tests at the peak). Directional, not exact.
Reading the commit log front to back, the project moves through distinct epochs — each one a layer of the stack coming alive, breaking, and getting rebuilt.
The repo is born as exactly two files: main.py and autogen.py, 253 lines total. main.py opens /dev/tenstorrent/0, builds a TenstorrentGetDeviceInfoIn/Out struct by hand, and fires GET_DEVICE_INFO to decode the PCIe BDF. It asserts arch in ("p100a","p150b") — "only blackhole is supported" — and mmaps BAR0/BAR1. autogen.py is generated ctypes for the kernel driver's ioctl ABI.
To touch anything beyond the BAR you need TLB windows: programmable apertures that retarget a slice of the PCIe BAR at an on-chip (noc_x, noc_y, addr). They land in a256625 — but ARC harvesting reads don't work yet, a fact the author dutifully records in the commit message. The fix arrives the next morning at 4:40 AM. The NOC coordinate system and tile-indexing math follow.
Each Tensix tile has five RISC-V cores: BRISC & NCRISC (data-movement babies that drive NOC transfers) and TRISC0/1/2 (the compute trio — unpacker, FPU/math, packer). Firmware starts as prebuilt .elf files checked into the repo, then self-compiled C++ via the sfpi toolchain. The payoff is an all-caps milestone, immediately undercut by a hardware reality.
Two long-running battles overlap here. First, DRAM reads keep returning mismatched data — interleaved bank tables get rewritten again and again. Second, the move from slow dispatch (host pushes every command over PCIe) to fast dispatch (host writes commands into a host-memory ring; on-device cq_prefetch/cq_dispatch firmware streams them to cores). Fast dispatch was a multi-week grind — including a commit literally named "non-working," promptly reverted.
With kernels running, attention turns to performance. The matmul_peak benchmark climbs through blocking strategies, packer L1 accumulation, HiFi2 math fidelity and bf16 — from ~60 to 94 to 151 TFLOPs (the very first PR is literally #1 matmul-block-94tflops). A profiler appears that reads on-device perf counters and renders an HTML flame UI. Board detection grows to handle P150 alongside P100. A compiler.py + IR (dispatch.py, hw.py, kernels.py) splits out, and the old abi/autogen gets removed.
The most radical phase. The author bets that the entire LLVM/GCC toolchain and the C++ firmware can be replaced by Python. First a DSL that emits "all riscv + tenstorrent ops" (cc94cf1). Then a full instruction-level emulator (emulator part 1 / part 2) so kernels can be brought up off-device. Then the scorched-earth commits: the profiler, the IR, and tt-metal-deps all go in one swing — with an honest "not functional" tacked on — followed by the deletion of ~5,000 lines of C++ firmware in a single commit.
The assembler stabilizes (asm.py final form; split into asm/program/dsl), and then — core by core — all five RISC-V firmwares come back to life, now authored in Python and assembled in-process. Once the real firmware works on the device, the emulator that bootstrapped it is deleted; it had done its job. The assembler grows composable mixins that become today's ttk/ toolkit.
The final stretch ties the bow: match Tenstorrent's ttas assembler output, integrate their ttsim simulator for validation, and use it to auto-name mystery registers. The canonical "add 1 to every element" kernel runs end-to-end on the all-Python stack and scales to 120 cores. And then — true to form — the project ends mid-debugging: fast dispatch works, but breaks after the 81st iteration.
Every from-scratch project has a moment of conviction. For blackhole-py it's d463403.
One commit removes 11 files and 4,955 lines of C++: cq_prefetch.cpp (1,608), cq_dispatch.cpp (1,019), cq_common.hpp (538), brisc.cc (419), and the rest. At that exact moment the tree was not functional — the author had already admitted as much in the previous commit. It took three weeks of Python firmware work (Phase 6) to climb back to a working device. Deleting working C++ to bet on unwritten Python is the whole personality of this repo.
// firmware/cq/cq_prefetch.cpp — 1608 lines // compiled with riscv-tt-elf-g++, // linked against firmware_brisc.ld, // from tt-metal-deps/
from dsl import TTPACR, TTUNPACR, jal, sw class Asm(Tensix, Noc, Cb, Flow, Debug): def emit(self, *insns): ... # asm.py: labels, fixups, segments # every instruction → 4 little-endian bytes
The crown jewel: ~1,400 lines of Python replacing an entire LLVM/GCC RISC-V toolchain.
dsl.py (566 LOC) is a metaclass-driven instruction encoder. Each RISC-V format — RType, IType, SType, BType, UType, JType — is a class whose fields are BitField descriptors that pack/sign-extend into a 32-bit word. On top of the base ISA sit the Tensix custom ops the matrix engine speaks: TTUNPACR (unpack), TTPACR (pack), TTSFPLOADI (vector-engine load-immediate), TTSEMWAIT, TTZEROACC, and dozens more — each one a bitfield struct that emits a single instruction word.
class BitField: def encode(self, val): # pack into [hi:lo], sign-extend if signed class SType(Inst): # split immediate: imm[11:5] | imm[4:0] def imm(self): return _sext((self.imm11_5 << 5) | self.imm4_0, 12) class JType(Inst): # the gnarly scrambled jump immediate def imm(self): return _sext((self.imm20<<20)|(self.imm19_12<<12)|(self.imm11<<11)|(self.imm10_1<<1), 21)
asm.py (469 LOC) is the assembler proper. The Asm class inherits from five ttk mixins — Tensix, Noc, Cb, Flow, Debug — so a kernel author writes self.write32(addr, val), self.delay_cycles(n), self.cb_wait_front(...) instead of raw instructions. It tracks register constants for peephole optimization (_reg_const, _const_delta), resolves label()/Fixup references in a second pass, and emits raw bytes. program.py (407 LOC) handles ELF-like ptload segments, dtypes and tile sizes.
The result is that examples/add1.py — the canonical kernel — is written entirely in this DSL: MOP (macro-op) expander templates of TTPACR/TTUNPACR/TTNOP slots, semaphore handshakes between TRISC cores, and circular-buffer bookkeeping, all in Python that compiles to machine code with no external tool in the loop.
339 messages, one author, no filter. The log reads like a lab notebook — triumph, defeat, and dry self-awareness, often in the same line.
Even as a one-person project, it ran on feature branches and pull requests — 18 merged PRs from the author's own branches (#1 matmul-block-94tflops, #8 codex/always-upload-firmware-startup, #11 tlb-window-fix). There's even a 5700518 "migrate to jj" — a detour into the Jujutsu version-control system mid-project.
6afb021 · May 30, 2026 · ~8,079 lines of Python across 23 files.
| File | LOC | Role |
|---|---|---|
| pcie.py | 1059 | ioctl/VFIO ABI, BAR mapping, TLB windows, sysmem, board detection |
| examples/add1.py | 701 | the canonical kernel, hand-written in the DSL |
| fw/cq.py | 699 | fast-dispatch prefetch/dispatch firmware (was 2,600+ lines of C++) |
| tt-smi.py | 628 | a clone of Tenstorrent's status tool, with a TUI fan editor |
| dsl.py | 566 | RISC-V + Tensix instruction encoder |
| asm.py | 469 | assembler: labels, fixups, segments, mixin kernels |
| cq.py | 463 | host-side command queue / dispatch |
| program.py | 407 | ELF-style ptload segments, dtypes, tile sizes |
| device.py | 232 | high-level device: open, harvest, map, launch |
| dram.py | 122 | DRAM bank allocator |
| ttk/ · fw/ | ~3.4k | toolkit mixins (cb, noc, tensix, sfpu, flow, debug, addrs) + Python firmware |
The journey: open the device → read ARC harvesting → map TLBs → load core firmware → run one kernel (slow dispatch) → build fast dispatch and chase matmul TFLOPs → add a profiler & board support → then the big bet: delete the C++ toolchain and firmware, write a Python RISC-V assembler + emulator + firmware, validate against ttsim/ttas, and land add1 on 120 cores. It ends, fittingly, mid-debug on the 81st iteration.