blackhole-py-old — host code in cq.py / dispatch.py / device.py, firmware in firmware/cq/.The pipeline has three cooperating actors:
cq.py. Writes commands into a pinned hugepage ("issue region") and pushes 2-byte size entries onto a 16-bit prefetch_q in the prefetch core's L1.(PREFETCH_NOC_X, PREFETCH_NOC_Y), NOC 0) — reads from host hugepage over PCIe-NOC, lands data in its cmddat_q, relays commands to the dispatch core's CB. Built from cq_prefetch.cpp.(DISPATCH_NOC_X, DISPATCH_NOC_Y), NOC 1, with a co-resident NCRISC "subordinate") — drains the relayed command stream, executes individual dispatch commands. Built from cq_dispatch.cpp (BRISC) and cq_dispatch_subordinate.cpp (NCRISC).The board layout is wired in device.py:
P100: tensix_x = 1..7, 10..14 prefetch=(14,2) dispatch=(14,3)
P150: tensix_x = 1..7, 10..16 prefetch=(16,2) dispatch=(16,3)
PCIE NOC: x=19 y=24 (absolute; not subject to NOC mirroring)
build.sh compiles the CQ kernels with -DPREFETCH_NOC_X/Y, -DDISPATCH_NOC_X/Y, -DPCIE_NOC_X=19 -DPCIE_NOC_Y=24 and -DNOC_INDEX={0,1}. Prefetch BRISC is NOC 0; dispatch BRISC + the subordinate NCRISC are both NOC 1.
cq_fixed_config.hpp is the source of truth for L1 layout. Three include-guarded blocks select per processor.
| Region | Base | Size | Purpose |
|---|---|---|---|
| PCIE NOC view of host hugepage | pcie_base = 0x4000_0100 | 64 MiB | Prefetcher reads via NOC PCIe |
prefetch_q | 0x19840 | 3068 B (1534 × 2 B) | Ring of 16-bit fetch sizes |
prefetch_q_rd_ptr | 0x196C0 | 4 B | Device read pointer (echoed back to host) |
prefetch_q_pcie_rd_ptr | 0x196C4 | 4 B | Current PCIe read offset (echoed back) |
cmddat_q | 0x1A440 | 256 KiB, 64 × 4 KiB, 4 blocks | Where commands land |
scratch_db | 0x5A440 | 128 KiB (2 × 64 KiB) | Double-buffered bounce for paged/linear relays |
dispatch_s_buffer (view) | 0x9A000 | 32 KiB, log_page=8 | Path to dispatch subordinate |
downstream_cb (view) | 0x1A000 | 128 × 4 KiB = 512 KiB | Dispatch BRISC CB |
Semaphores (16 B aligned, on the FD core type):
MY_DOWNSTREAM_CB_SEM_ID = 0 — page credits granted by dispatch_d.MY_DISPATCH_S_CB_SEM_ID = 1 — page credits granted by dispatch_s.DOWNSTREAM_SYNC_SEM_ID = 2 — counter dispatch_d increments to unblock prefetch's STALL.MY_UPSTREAM_CB_SEM_ID = 3 — only used in _d variants (split path).| Address | Name | Purpose |
|---|---|---|
0x196D0 | DEV_COMPLETION_Q_WR_PTR | Device-side completion FIFO write ptr (mirror) |
0x196E0 | DEV_COMPLETION_Q_RD_PTR | Host's read ptr written into device L1 |
0x1A000 | DISPATCH_CB_BASE | Input CB from prefetcher, 128 × 4 KiB, 4 blocks |
0x40000000 (NOC) | COMMAND_QUEUE_BASE_ADDR | Host completion ptr in sysmem (mirror target) |
0x44000100 (NOC) | COMPLETION_QUEUE_BASE_ADDR | Host completion region (32 MiB) |
0x370 | UNICAST_GO_SIGNAL_ADDR | Per-worker go-message offset |
Co-resident with dispatch_d on the same tile. CB at 0x9A000, 32 KiB, log_page_size=8. Worker-completion stream registers live in the NOC stream register file from FIRST_STREAM_USED = 48. DISPATCH_S_SYNC_SEM_BASE_ADDR = 0x19710 — array of 16-byte-strided semaphores bumped by dispatch_d so dispatch_s knows when it's safe to send a go signal.
+0 (header / reserved)
+2 * PCIE_ALIGN _HOST_CQ_WR_OFF (4 B: completion-fifo wr-ptr | toggle)
+3 * PCIE_ALIGN _HOST_CQ_RD_OFF (4 B: host's rd-ptr | toggle echo)
+4 * PCIE_ALIGN _HOST_ISSUE_BASE issue region, 64 MiB
+ ... _HOST_COMPLETION_BASE completion region, 32 MiB
+ ... _HOST_TIMESTAMP_BASE 4096 × 16 B (CQTimestamp slots)
+ ... _HOST_CORE_TIMING_BASE 4096 × 16 B (per-core BRISC start/end)
PCIE_ALIGN = 64, L1_ALIGN = 16, PAGE_SIZE = 4096
mmap(MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE), pinned via PCIDevice.pin_pages(), and the returned IOVA must have bit 60 set (the "PCIE NOC base" marker _PCIE_NOC_BASE = 1 << 60). The 32-bit truncation noc_local = noc_addr - _PCIE_NOC_BASE is what the prefetcher uses as pcie_base offsets.cq.py)Every dispatch command is a 16-byte fixed-size header followed by an L1-aligned payload. The byte layout is defined by cq_commands.hpp and produced on the host by _cq_hdr(fmt, ...) which struct-packs and zero-pads to 16 bytes.
| ID | Name | Dataclass |
|---|---|---|
| 3 | WRITE_LINEAR_H_HOST | CQHostEvent |
| 5 | WRITE_PACKED | CQWritePacked |
| 6 | WRITE_PACKED_LARGE | CQWritePackedLarge |
| 7 | WAIT | CQWaitStream, CQBarrier |
| 14 | SEND_GO_SIGNAL | CQSendGoSignal |
| 17 | SET_GO_SIGNAL_NOC_DATA | CQSetGoSignalNocData |
| 18 | TIMESTAMP | CQTimestamp |
Each top-level dispatch command is wrapped on the host into a RELAY_INLINE prefetch command:
def _relay_inline(payload):
stride = align_up(CQ_CMD_SIZE + len(payload), PCIE_ALIGN)
hdr = _cq_hdr("<BBHII", _RELAY_INLINE, 0, 0, len(payload), stride)
return hdr + payload.ljust(stride - CQ_CMD_SIZE, b"\0")
PCIE_ALIGN (64 B). The 2-byte ring entry encodes stride >> 4.prefetch_q as a 1534-slot SPSC ring. Each slot is a uint16 holding either 0 (free) or the 16-byte-unit size of the next chunk to fetch. The high bit is an EXEC_BUF stall flag (unused in this build).1. Bump _issue_wr to next PCIE_ALIGN boundary.
2. If record won't fit, wrap to 0.
3. Copy record into sysmem at _HOST_ISSUE_BASE + _issue_wr.
4. _issue_wr += len(record).
5. Spin waiting until prefetch_q[idx] == 0.
6. Atomically write 2-byte size (record_len >> 4) into prefetch_q[idx].
7. idx = (idx + 1) % CQ_PREFETCH_Q_ENTRIES.
dispatch.py)build_ir(program, ...) produces a list of IRCommands; _lower_ir then converts:
Write(cores, addr, bytes) → CQWritePackedLarge + CQBarrier (per 1 KiB chunk)
Write(cores, addr, list[bytes]) → CQWritePacked (per L1_ALIGN slice)
Launch(cores) → CQSetGoSignalNocData(cores)
CQWaitStream(DONE_STREAM=48, 0)
CQSendGoSignal(go_word, 48, 0, len(cores))
CQWaitStream(DONE_STREAM, len(cores))
lower_fast(programs, ...) bookends each per-program IR with CQTimestamp writes. device._run_fast_dispatch() appends a single CQHostEvent — that is the event the host waits on.
The prefetch firmware lives in cq_prefetch.cpp. It compiles to three variants (_h, _d, _hd) but for blackhole-py-old only the unified _hd is used because IS_H_VARIANT = IS_D_VARIANT = 1.
void kernel_main_hd() {
uint32_t cmd_ptr = cmddat_q_base;
uint32_t fence = cmddat_q_base;
bool done = false;
uint32_t l1_cache[l1_cache_elements_rounded];
PrefetchExecBufState exec_buf_state;
cq_noc_async_write_init_state<...>(0, dst_dispatch, 0); // BRISC_WR_CMD_BUF
cq_noc_async_write_init_state<...>(0, dst_dispatch_s, 0); // BRISC_WR_REG_CMD_BUF
while (!done) {
fetch_q_get_cmds<0>(fence, cmd_ptr, pcie_read_ptr);
cmd = (volatile CQPrefetchCmd*)cmd_ptr;
uint32_t stride;
done = process_cmd<false, false>(cmd_ptr, downstream_data_ptr, stride, l1_cache, exec_buf_state);
cmd_ptr += stride;
}
}
fetch_q_get_cmds as a state machine over three flags. The function returns as soon as there is something to process; otherwise it advances any in-flight read, or starts a new PCIe-NOC read keyed off the next nonzero prefetch_q slot.process_cmd (prefetcher)| Prefetch cmd | Action |
|---|---|
RELAY_LINEAR | NOC-read from (noc_xy, addr) into scratch_db, stream to dispatch CB. (unused here) |
RELAY_PAGED / _PACKED | Pull pages from interleaved DRAM banks, double-buffer through scratch_db_top[0..1]. (unused) |
RELAY_INLINE | The path Python uses for every command. Copy body from cmddat_q to dispatch CB, flushing pages as it goes. |
RELAY_INLINE_NOFLUSH | Same, but doesn't release trailing page; next command can append. |
EXEC_BUF / _END | Switch source from PCIe to a DRAM-resident command buffer; stalls new PCIe reads. |
STALL | Spin on my_downstream_sync_sem hitting a count (driven by dispatch_d's NOTIFY_PREFETCH WAIT flag). |
TERMINATE | Drop out of kernel_main_hd. |
process_relay_inline_common — the hot pathuint32_t length = cmd->relay_inline.length;
uint32_t stride = cmd->relay_inline.stride;
uint32_t npages = (length + page_size - 1) >> log_page_size;
RelayInlineState::cb_writer.acquire_pages(npages); // wait on credits from dispatch
CmdPayloadStream stream{cmd_ptr, exec_buf_state};
stream.init(sizeof(CQPrefetchCmd)); // skip 16 B prefetch header
while (remaining > 0) {
ByteSpan span = stream.next_span(remaining, stride);
write_downstream<...>(span.ptr, local_downstream_data_ptr, span.len, ..., );
remaining -= span.len;
}
local_downstream_data_ptr = round_up_pow2(local_downstream_data_ptr, page_size);
noc_async_writes_flushed(RelayInlineState::downstream_noc_index);
RelayInlineState::cb_writer.release_pages(npages, local_downstream_data_ptr);
CBWriter keeps an additional_count field for "borrowed credit" (used by EXEC_BUF / ringbuffer paths), and uses wrap_gt signed-diff comparison to safely handle the semaphore wrapping past 2³¹.void kernel_main() {
set_l1_data_cache<true>();
if constexpr (NOC_INDEX != UPSTREAM_NOC_INDEX) noc_local_state_init(UPSTREAM_NOC_INDEX);
reset_worker_completion_stream_counts<FIRST_STREAM_USED=48, MAX_NUM_WORKER_SEMS=8>();
dispatch_cb_reader.init();
cmd_ptr = DISPATCH_CB_BASE;
write_offset[0..2] = 0;
completion_queue_wr_ptr_and_toggle = *get_cq_completion_write_ptr();
while (!done) {
dispatch_cb_reader.wait_for_available_data_and_release_old_pages(cmd_ptr);
done = process_cmd<IS_D_VARIANT=true>(cmd_ptr, l1_cache);
cmd_ptr = round_up_pow2(cmd_ptr, dispatch_cb_page_size);
}
}
CBReaderWithReleasePolicy (defined in cq_common.hpp) maintains:
local_count_ — pages logically consumed.cb_fence_ — byte address one past last guaranteed-present byte.block_next_start_addr_[4] — block boundary fences.rd_block_idx_ — current block (0..3).block_noc_writes_to_clear_ — snapshot of NOC nonposted-write counter.process_cmd (dispatcher)| Dispatch cmd | Function |
|---|---|
WRITE_LINEAR | process_write_linear(num_mcast_dests) |
WRITE_LINEAR_H_HOST | process_write_host_h() — host event / read-buffer path |
WRITE_PAGED | process_write_paged<is_dram>() |
WRITE_PACKED | process_write_packed<mcast, SubCmdT>() |
WRITE_PACKED_LARGE | process_write_packed_large() |
WAIT | process_wait() |
SEND_GO_SIGNAL | process_go_signal_mcast_cmd() |
SET_GO_SIGNAL_NOC_DATA | copies words into local 256-entry array |
SET_WRITE_OFFSET | updates write_offset[3] for relocation |
TIMESTAMP | samples wall-clock and writes 8 B to a NOC dst |
TERMINATE | breaks the loop |
process_write_packedprocess_write_packed. Sub-commands (one per target core) are copied into BRISC's local cache, then a tight loop issues one stateful NOC write per sub-command. If NO_STRIDE is set, all cores share the same payload pointer; otherwise data_ptr advances by L1_ALIGN-padded xfer_size.process_write_host_h (host event / read-back)data_ptr = cmd_ptr; // include cmd header so host can parse it
wlength = cmd->write_linear_host.length;
init NOC state to pcie_noc_xy.
while (wlength != 0) {
chunk_len = min(wlength, ~(dispatch_cb_page_size - 1));
while (chunk_len != 0) {
wait_for_data_in_cb(data_ptr);
xfer = min(chunk_len, available_in_cb);
npages = div_up(xfer, completion_page_size);
completion_queue_reserve_back(npages); // blocks on rd_ptr
wr_addr = cq_write_interface.completion_fifo_wr_ptr << 4;
if (wr_addr + xfer > completion_queue_end) { split + wrap; }
write to (pcie_noc_xy, wr_addr) of size xfer
completion_queue_push_back(npages); // updates ptr in sysmem
data_ptr += xfer; chunk_len -= xfer; wlength -= xfer;
}
}
WAIT flags)BARRIER (0x01) → noc_async_write_barrier()WAIT_MEMORY (0x04) → spin until *addr >= count (using wrap_ge)WAIT_STREAM (0x08) → spin on stream's SPACE_AVAILABLE_REG until >= countCLEAR_STREAM (0x10) → atomically zero the stream by writing -(*sem) to the update regNOTIFY_PREFETCH (0x02) → noc_semaphore_inc(prefetch's UPSTREAM_SYNC_SEM, 1)cq_dispatch_subordinate.cpp lives on the same tensix as dispatch_d but runs on the NCRISC. It uses NOC 1 with a customized command-buffer allocation so BRISC and NCRISC never collide:
NOC_CMD_BUF_WRITE_REG setup at boot enforces the partition.In blackhole-py-old's fixed config DISTRIBUTED_DISPATCHER = 0, so the host never sends go-signal commands as DISPATCH_SUBORDINATE. The dispatch_s mostly idles after start-up — kept fully functional only so future split configurations can drop in.
_lower_ir lowers a Launch(cores) to four CQ commands in order:
CQSetGoSignalNocData(cores) # 17: SET_GO_SIGNAL_NOC_DATA
CQWaitStream(DONE_STREAM=48, 0) # 7 + WAIT_STREAM + CLEAR_STREAM
CQSendGoSignal(go_word, 48, 0, len(cores)) # 14: SEND_GO_SIGNAL (num_unicast = N)
CQWaitStream(DONE_STREAM, len(cores)) # 7 + WAIT_STREAM + CLEAR_STREAM
RUN_MSG_GO to 0x370 on each worker; each worker increments stream 48 on the dispatch core when it transitions to RUN_MSG_DONE; dispatch's WAIT(N) unblocks and clears the stream for the next launch.The go_word is built in device._go_word():
go = GoMsg()
go.bits.signal = DevMsgs.RUN_MSG_GO # = 0x80
go.bits.master_x, go.bits.master_y = DISPATCH_NOC_X, DISPATCH_NOC_Y
go.bits.dispatch_message_offset = 0
return go.all # uint32: [signal | mx | my | offset]
Copies a list of NOC xy encodings into go_signal_noc_data[256]. Subsequent SEND_GO_SIGNAL walks the array starting at noc_data_start_index.
Trivially passes the WAIT check (≥0 always true), then atomically zeroes the stream counter by writing -(*sem) << REMOTE_DEST_BUF_WORDS_FREE_INC to the update reg. Streams are accumulator-style registers, not normal memory.
for (i = 0; i < num_unicasts; ++i) {
dst = get_noc_addr_helper(go_signal_noc_data[idx++], UNICAST_GO_SIGNAL_ADDR);
noc_async_write_one_packet(aligned_go_signal_storage, dst, sizeof(uint32_t));
}
The go word lands at 0x370 on each worker tensix. The worker BRISC spins on that address, sees signal == RUN_MSG_GO (0x80), starts the program; when done it writes back RUN_MSG_DONE = 0 and bumps stream 48 of the dispatch core by 1.
Dispatcher spins until stream 48 reaches N, then clears it. This is the "bottom of cycle" fence — after it returns, the next launch can proceed.
while True:
wr_raw = self._sysmem_read32(_HOST_CQ_WR_OFF)
wr_16b, wr_toggle = wr_raw & 0x7FFF_FFFF, (wr_raw >> 31) & 1
if (wr_16b, wr_toggle) != (self._completion_rd_16b, self._completion_rd_toggle):
off = (self._completion_rd_16b << 4) - self.noc_local
got = self._sysmem_read32(off + 16) # skip echoed header
self._completion_rd_16b += self._completion_page_16b
if self._completion_rd_16b >= self._completion_end_16b:
self._completion_rd_16b = self._completion_base_16b
self._completion_rd_toggle ^= 1
raw = (self._completion_rd_16b & 0x7FFF_FFFF) | (self._completion_rd_toggle << 31)
self._dispatch_win.write32(CQ_COMPLETION_RD_PTR, raw)
self._sysmem_write32(_HOST_CQ_RD_OFF, raw)
if got != (event_id & 0xFFFFFFFF):
raise RuntimeError("event mismatch")
return
time.sleep(0.0002)
A concrete trace for device.run() on one program with N cores:
_compile_ir(program, ...) builds IR: Write(go_msg reset) + Write(go_msg_index) + Write(uniform_rta) + Write(launch, role) + Write(shared, role) + Launch(cores).lower_fast prepends a CQTimestamp for slot 0, lowers IR to CQ commands, appends CQTimestamp for slot 1, then CQHostEvent(event_id).to_bytes()-serialized, wrapped in _relay_inline, appended to cq.stream.CQSysmem.flush(cq) copies each record into sysmem and pushes its size onto prefetch_q.fetch_q_get_cmds sees the new slot, NOC-reads the record from hugepage into cmddat_q, marks the slot 0.RELAY_INLINE, acquires N pages of dispatch CB credit, NOC-writes the inner command into the dispatch CB, releases credit via NOC atomic-inc.process_cmd — for WRITE_PACKED issues N NOC writes to workers; for WAIT_STREAM polls stream 48; for SEND_GO_SIGNAL unicasts the go word.WRITE_LINEAR_H_HOST: writes 32 B (cmd hdr + event page) to completion region; updates wr_ptr in sysmem mirror.wait_completion polls _HOST_CQ_WR_OFF, sees move, reads event_id, advances rd_ptr, writes back. Done.prefetch_q (16-bit sizes, 1534 entries), cmddat_q (256 KiB byte ring), dispatch CB (128×4 KiB page ring). Each has its own producer/consumer ptrs.STALL / NOTIFY_PREFETCH). All counted, all over-NOC atomic-incremented.CQTimestamp commands that have the dispatcher sample its wall clock and write 8 B to a sysmem slot. With TT_CORE_TIMING=1, the host also arms each worker BRISC to log its own start/end into another pair of slots.RELAY_PAGED*, RELAY_LINEAR, RELAY_RINGBUFFER, EXEC_BUF*, WRITE_PAGED, WRITE_LINEAR(_H), NOTIFY_SUBORDINATE_GO_SIGNAL, mcast SEND_GO_SIGNAL. The host only emits WRITE_PACKED, WRITE_PACKED_LARGE, WAIT, SEND_GO_SIGNAL, SET_GO_SIGNAL_NOC_DATA, WRITE_LINEAR_H_HOST, TIMESTAMP.