← ai corner
a git archaeology report

blackhole​-py

How one developer spent four and a half months writing a pure-Python userspace driver, runtime, RISC-V assembler, and kernel compiler for a Tenstorrent Blackhole AI accelerator — from a single ioctl to a 120-core kernel launch.

339 commits Jan 13May 30, 2026 1 author · anu <hi@anuraagw.me> ~8,000 lines of Python at HEAD arch · blackhole p100a / p150b
5
RISC-V cores per tile, driven from Python
~5k
lines of C++ firmware deleted in one commit
120
cores in the final working launch
81
the iteration fast-dispatch still breaks on

00What is this thing?

A reverse-engineered, dependency-free replacement for Tenstorrent's entire tt-metal software stack — written in Python, talking straight to the silicon.

Tenstorrent's Blackhole is an AI accelerator. A normal program reaches it through tt-metal: tens of thousands of lines of C++, an LLVM-based RISC-V toolchain, a kernel-mode driver, and a host runtime. blackhole-py throws all of that out and asks a harder question: what is the absolute minimum you need to make this chip compute?

The answer, built up commit by commit, is a stack of pure-Python layers. At the bottom it opens a character device and fires raw ioctls; at the top it lets you write a Tensix compute kernel in a Python DSL, assembles it to RISC-V machine code in-process, uploads the firmware over PCIe, and launches it across the whole chip. No C++. No external assembler. No tt-metal.

The layer cake (as it stands at HEAD)

kernel ABIpcie.py · 1059 LOC
Opens /dev/tenstorrent (and later VFIO), drives the card with hand-written ctypes ioctl structs, pins host DMA pages, maps PCIe BAR windows.
TLB windowsTLBWindow / _MappedBar
The chip's address space dwarfs the PCIe BAR. Programmable TLB apertures retarget a BAR window at any (noc_x, noc_y, addr) on-chip location for mmap'd reads/writes.
NOC + harvestingttk/noc.py · device.py
A 2D mesh wiring cores, DRAM and PCIe. Two physical NOCs, coordinate translation, multicast upload. Reads the harvesting mask from ARC telemetry to find which cores actually exist.
dispatchcq.py · fw/cq.py · 463+699 LOC
Slow dispatch pushes every command over PCIe synchronously. Fast dispatch streams commands through an on-device prefetch/dispatch firmware ring buffer.
assembler + DSLdsl.py · asm.py · program.py
A from-scratch RISC-V encoder (R/I/S/B/U/J types) plus Tensix custom ops (TTUNPACR, TTPACR, TTSFPLOADI…). Labels, fixups, ELF-style segments, mixin-built kernels.
firmwarefw/{brisc,ncrisc,trisc,cq}.py
The five RISC-V cores' firmware — once C++, now written in the Python DSL and assembled at runtime: NOC command buffers, subordinate-sync mailboxes, go-message dispatch.

01The one dependency it couldn't fake: tt-metal-deps

Even a "from scratch" project needs ground truth. tt-metal-deps is the carved-out skeleton of tt-metal that blackhole-py measured itself against — and eventually outgrew.

For most of the project, tt-metal-deps was the lifeline: a ~219 MB extraction of just the pieces of Tenstorrent's stack that the Python code genuinely couldn't reproduce — chiefly a real RISC-V toolchain and the header definitions that describe the hardware's mailboxes and memory map.

tt-metal-deps/ ├── sfpi-toolchain/ # the real GCC: riscv-tt-elf-{gcc,g++,as,ld,objcopy,nm} │ └── bin/riscv-tt-elf-gcc-15.1.0 ├── toolchain/blackhole/ # 16 linker scripts: firmware_*.ld, kernel_*.ld │ ├── firmware_brisc.ld firmware_ncrisc.ld │ └── firmware_trisc0.ld trisc1.ld trisc2.ld ... ├── firmware-src/ # reference C firmware: brisck.cc, ncrisck.cc, trisck.cc ├── lib/blackhole/ # prebuilt objects: noc.o, substitutes.o, tmu-crt0.o └── include/ # the hardware contract — what every register means ├── hostdev/dev_msgs.h # host↔device mailbox layout └── tt_metal/{hw,api,hostdevcommon}

Three things mattered here. The include/ headers (dev_msgs.h and friends) define the host↔device contract — the exact byte layout of launch messages, go-signals and mailboxes the firmware reads. The toolchain/blackhole/*.ld linker scripts encode each core's memory map. And sfpi-toolchain/ was the actual compiler used while firmware was still C++.

The whole arc of the back half of the project is the story of shedding this dependency. The setup script marched through tt-metal-deps v0.7.0v0.8.0, and then on Apr 21 came the line that changed everything: "…no more tt-metal-deps. not functional". Once the Python assembler could emit RISC-V directly, the GCC toolchain became dead weight. The headers' knowledge got absorbed into ttk/addrs.py and ttk/_cfg_regs.py. Today tt-metal-deps survives only in the old snapshot next door — kept around as an oracle to diff against, not a thing the live code needs.

02The shape of the code over time

Python LOC tracked at ten checkpoints. The story isn't a line going up — it's a mountain. The codebase grew to almost 19k lines, then the author deleted more than half of it.

Python lines of code, by commit

Peak (~18.8k) includes the throwaway emulator + C++ firmware + tests. The cliff is the pure-Python pivot: delete the emulator, delete C++, and relentlessly simplify.

LOC counted as raw wc -l across all .py files present at each commit (includes emulator & tests at the peak). Directional, not exact.

03Eight phases of bring-up

Reading the commit log front to back, the project moves through distinct epochs — each one a layer of the stack coming alive, breaking, and getting rebuilt.

Jan 13 – Jan 16 · 2 files

Phase 0 — A device handle and a heartbeatioctl + mmap

The repo is born as exactly two files: main.py and autogen.py, 253 lines total. main.py opens /dev/tenstorrent/0, builds a TenstorrentGetDeviceInfoIn/Out struct by hand, and fires GET_DEVICE_INFO to decode the PCIe BDF. It asserts arch in ("p100a","p150b")"only blackhole is supported" — and mmaps BAR0/BAR1. autogen.py is generated ctypes for the kernel driver's ioctl ABI.

23e5abeJan 13 · first commit
9184b56Jan 14 · refactor setup and open
62a65a5Jan 16 · rename autogen to abi, trim >150 char lines, add abi for kernel launch
Jan 16 – Jan 20

Phase 1 — Windows into the chipTLB · ARC · NOC

To touch anything beyond the BAR you need TLB windows: programmable apertures that retarget a slice of the PCIe BAR at an on-chip (noc_x, noc_y, addr). They land in a256625 — but ARC harvesting reads don't work yet, a fact the author dutifully records in the commit message. The fix arrives the next morning at 4:40 AM. The NOC coordinate system and tile-indexing math follow.

a256625Jan 16 · add TLB windows, arc harvesting reading not working
7b40602Jan 17 04:40 · harvesting mask working
0ed19d9Jan 19 · add noc helpers and tile indexing calculations
823f771Jan 20 · added interleaved dram allocator, add noc1 coordinate translation
Jan 20 – Feb 3

Phase 2 — Firmware, and the first computationbrisc/ncrisc/trisc

Each Tensix tile has five RISC-V cores: BRISC & NCRISC (data-movement babies that drive NOC transfers) and TRISC0/1/2 (the compute trio — unpacker, FPU/math, packer). Firmware starts as prebuilt .elf files checked into the repo, then self-compiled C++ via the sfpi toolchain. The payoff is an all-caps milestone, immediately undercut by a hardware reality.

b930c49Jan 16 · tensix tile firmware loaded successfully
0f4494e · KERNEL WORKING (both cores use noc0)
c5a6165 · fix firmware, add kernel running logic. dram read appears broken.
late Jan – Feb 14

Phase 3 — The DRAM read saga & fast dispatch warcommand queues

Two long-running battles overlap here. First, DRAM reads keep returning mismatched data — interleaved bank tables get rewritten again and again. Second, the move from slow dispatch (host pushes every command over PCIe) to fast dispatch (host writes commands into a host-memory ring; on-device cq_prefetch/cq_dispatch firmware streams them to cores). Fast dispatch was a multi-week grind — including a commit literally named "non-working," promptly reverted.

2085b2c · add dram interleaved tables, mismatch remains
529c7c7 · add abi for fast dispatch
201ecf3 · non-working fast dispatch  →  ef910dc · Revert "non-working fast dispatch"
Feb – Mar 13

Phase 4 — Chasing TFLOPs, building toolsmatmul · profiler · p150

With kernels running, attention turns to performance. The matmul_peak benchmark climbs through blocking strategies, packer L1 accumulation, HiFi2 math fidelity and bf16 — from ~60 to 94 to 151 TFLOPs (the very first PR is literally #1 matmul-block-94tflops). A profiler appears that reads on-device perf counters and renders an HTML flame UI. Board detection grows to handle P150 alongside P100. A compiler.py + IR (dispatch.py, hw.py, kernels.py) splits out, and the old abi/autogen gets removed.

3bbc91f · …151 TFLOPs
82c186cMar 13 · IR + lowering split; abi/autogen removed
72c11fdApr 3 · removes the tt-kmd dependency — talks to the card over VFIO instead
Mar 25 – Apr 21

Phase 5 — The Great Pivot: burn the C++DSL · emulator · assembler

The most radical phase. The author bets that the entire LLVM/GCC toolchain and the C++ firmware can be replaced by Python. First a DSL that emits "all riscv + tenstorrent ops" (cc94cf1). Then a full instruction-level emulator (emulator part 1 / part 2) so kernels can be brought up off-device. Then the scorched-earth commits: the profiler, the IR, and tt-metal-deps all go in one swing — with an honest "not functional" tacked on — followed by the deletion of ~5,000 lines of C++ firmware in a single commit.

cc94cf1Mar 25 · add dsl with all riscv + tenstorrent ops
0511943Apr 15 · emulator part 1 · d570394 · emulator part 2
4ae247eApr 21 · delete profiler, rvir … no more tt-metal-deps. not functional
d463403Apr 21 · delete c++ firmware. python risc-v only  −4955 lines
May 9 – May 19

Phase 6 — Re-deriving firmware, in pure Pythonasm.py final form

The assembler stabilizes (asm.py final form; split into asm/program/dsl), and then — core by core — all five RISC-V firmwares come back to life, now authored in Python and assembled in-process. Once the real firmware works on the device, the emulator that bootstrapped it is deleted; it had done its job. The assembler grows composable mixins that become today's ttk/ toolkit.

fb64fa1May 9 · asm.py final form
7713bb1May 15 · brisc firmware working in fast dispatch
4a8de11May 16 · …python firmware works on the device
61f5726May 16 · Remove emulator  (it served its purpose)
87f5c1cMay 18 · slow dispatch fw working!
May 28 – May 30

Phase 7 — Validation, 120 cores, and one stubborn bugttsim · ttas · add1

The final stretch ties the bow: match Tenstorrent's ttas assembler output, integrate their ttsim simulator for validation, and use it to auto-name mystery registers. The canonical "add 1 to every element" kernel runs end-to-end on the all-Python stack and scales to 120 cores. And then — true to form — the project ends mid-debugging: fast dispatch works, but breaks after the 81st iteration.

f576c92May 28 · match ttas, thanks Zaneham
cc4d80fMay 28 · ttsim integration
192525eMay 28 · working add1 (one run)
6afb021May 30 · …working launch on 120 cores. fast dispatch breaks after 81st iteration

04The commit that defines the project

Every from-scratch project has a moment of conviction. For blackhole-py it's d463403.

d463403 · Apr 21 · "delete c++ firmware. python risc-v only"

One commit removes 11 files and 4,955 lines of C++: cq_prefetch.cpp (1,608), cq_dispatch.cpp (1,019), cq_common.hpp (538), brisc.cc (419), and the rest. At that exact moment the tree was not functional — the author had already admitted as much in the previous commit. It took three weeks of Python firmware work (Phase 6) to climb back to a working device. Deleting working C++ to bet on unwritten Python is the whole personality of this repo.

Before — C++ firmware

// firmware/cq/cq_prefetch.cpp — 1608 lines
// compiled with riscv-tt-elf-g++,
// linked against firmware_brisc.ld,
// from tt-metal-deps/

After — Python DSL → bytes

from dsl import TTPACR, TTUNPACR, jal, sw
class Asm(Tensix, Noc, Cb, Flow, Debug):
  def emit(self, *insns): ...
# asm.py: labels, fixups, segments
# every instruction → 4 little-endian bytes

05How the assembler actually works

The crown jewel: ~1,400 lines of Python replacing an entire LLVM/GCC RISC-V toolchain.

dsl.py (566 LOC) is a metaclass-driven instruction encoder. Each RISC-V format — RType, IType, SType, BType, UType, JType — is a class whose fields are BitField descriptors that pack/sign-extend into a 32-bit word. On top of the base ISA sit the Tensix custom ops the matrix engine speaks: TTUNPACR (unpack), TTPACR (pack), TTSFPLOADI (vector-engine load-immediate), TTSEMWAIT, TTZEROACC, and dozens more — each one a bitfield struct that emits a single instruction word.

class BitField:
  def encode(self, val): # pack into [hi:lo], sign-extend if signed
class SType(Inst):       # split immediate: imm[11:5] | imm[4:0]
  def imm(self): return _sext((self.imm11_5 << 5) | self.imm4_0, 12)
class JType(Inst):       # the gnarly scrambled jump immediate
  def imm(self): return _sext((self.imm20<<20)|(self.imm19_12<<12)|(self.imm11<<11)|(self.imm10_1<<1), 21)

asm.py (469 LOC) is the assembler proper. The Asm class inherits from five ttk mixins — Tensix, Noc, Cb, Flow, Debug — so a kernel author writes self.write32(addr, val), self.delay_cycles(n), self.cb_wait_front(...) instead of raw instructions. It tracks register constants for peephole optimization (_reg_const, _const_delta), resolves label()/Fixup references in a second pass, and emits raw bytes. program.py (407 LOC) handles ELF-like ptload segments, dtypes and tile sizes.

The result is that examples/add1.py — the canonical kernel — is written entirely in this DSL: MOP (macro-op) expander templates of TTPACR/TTUNPACR/TTNOP slots, semaphore handshakes between TRISC cores, and circular-buffer bookkeeping, all in Python that compiles to machine code with no external tool in the loop.

06The commit log as a diary

339 messages, one author, no filter. The log reads like a lab notebook — triumph, defeat, and dry self-awareness, often in the same line.

"harvesting mask working"
7b40602 · the 4:40 AM victory
"KERNEL WORKING (both cores use noc0)"
0f4494e · all-caps elation
"fix firmware, add kernel running logic. dram read appears broken."
c5a6165 · triumph & defeat, one line
"add dram interleaved tables, mismatch remains"
2085b2c · the saga continues
"non-working fast dispatch" → Revert "non-working fast dispatch"
201ecf3 / ef910dc · naming it, then reverting it
"…not functional"
4ae247e · admitting the tree is broken on purpose
"add delay to trisc1; our kernel runs too fast"
175667b · the rare too-fast bug
"remove rvir and rvlib; wrong direction"
ba5a1ed · an honest dead end
"add1 working e2egit statusgit status!"
1c71335 · a literal `git status` mashed into the message
"format like tinygrad" / "like tinygrad" / "de-slopifying part 1"
f7fa2d8 · tinygrad is the aesthetic north star
"opus 4.6 rewrites fast dispatch"
75de285 · crediting an LLM for a subsystem
"match ttas, thanks Zaneham"
f576c92 · the one human shout-out in 339 commits
"use WC-only TLB mapping — dual UC/WC mmap silently breaks"
37709f0 · a real hardware gotcha, captured
"…working launch on 120 cores. fast dispatch breaks after 81st iteration"
6afb021 · ending on a specific, unsolved heisenbug

Solo dev, full process

Even as a one-person project, it ran on feature branches and pull requests — 18 merged PRs from the author's own branches (#1 matmul-block-94tflops, #8 codex/always-upload-firmware-startup, #11 tlb-window-fix). There's even a 5700518 "migrate to jj" — a detour into the Jujutsu version-control system mid-project.

07Where it stands at HEAD

6afb021 · May 30, 2026 · ~8,079 lines of Python across 23 files.

FileLOCRole
pcie.py1059ioctl/VFIO ABI, BAR mapping, TLB windows, sysmem, board detection
examples/add1.py701the canonical kernel, hand-written in the DSL
fw/cq.py699fast-dispatch prefetch/dispatch firmware (was 2,600+ lines of C++)
tt-smi.py628a clone of Tenstorrent's status tool, with a TUI fan editor
dsl.py566RISC-V + Tensix instruction encoder
asm.py469assembler: labels, fixups, segments, mixin kernels
cq.py463host-side command queue / dispatch
program.py407ELF-style ptload segments, dtypes, tile sizes
device.py232high-level device: open, harvest, map, launch
dram.py122DRAM bank allocator
ttk/ · fw/~3.4ktoolkit mixins (cb, noc, tensix, sfpu, flow, debug, addrs) + Python firmware

From 253 lines to a working accelerator runtime

The journey: open the device → read ARC harvesting → map TLBs → load core firmware → run one kernel (slow dispatch) → build fast dispatch and chase matmul TFLOPs → add a profiler & board support → then the big bet: delete the C++ toolchain and firmware, write a Python RISC-V assembler + emulator + firmware, validate against ttsim/ttas, and land add1 on 120 cores. It ends, fittingly, mid-debug on the 81st iteration.