← ai corner

Fast Dispatch / CQ Firmware on Blackhole

A self-contained, detailed walkthrough of how the Command Queue (CQ) firmware works in blackhole-py-old — host code in cq.py / dispatch.py / device.py, firmware in firmware/cq/.

0. System overview

The pipeline has three cooperating actors:

  1. Host — runs the Python in cq.py. Writes commands into a pinned hugepage ("issue region") and pushes 2-byte size entries onto a 16-bit prefetch_q in the prefetch core's L1.
  2. Prefetch core (BRISC on (PREFETCH_NOC_X, PREFETCH_NOC_Y), NOC 0) — reads from host hugepage over PCIe-NOC, lands data in its cmddat_q, relays commands to the dispatch core's CB. Built from cq_prefetch.cpp.
  3. Dispatch core (BRISC on (DISPATCH_NOC_X, DISPATCH_NOC_Y), NOC 1, with a co-resident NCRISC "subordinate") — drains the relayed command stream, executes individual dispatch commands. Built from cq_dispatch.cpp (BRISC) and cq_dispatch_subordinate.cpp (NCRISC).
HOST (Python) cq.py / dispatch.py Hugepage (sysmem) issue region (64 MiB) completion (32 MiB) timestamps Pushes 2 B sizes → prefetch_q[idx] spin until slot==0 wait_completion poll _HOST_CQ_WR_OFF read event_id PREFETCH CORE BRISC · NOC 0 · (14,2) P100 prefetch_q @ 0x19840 1534 × uint16 cmddat_q @ 0x1A440 256 KiB · 64×4 KiB pages 4 blocks · ring scratch_db @ 0x5A440 128 KiB · double-buffered kernel_main_hd() fetch_q_get_cmds process_cmd → RELAY_INLINE CBWriter (credits) DISPATCH CORE BRISC + NCRISC · NOC 1 dispatch CB @ 0x1A000 128 × 4 KiB · 4 blocks 512 KiB total dispatch_s CB @ 0x9A000 32 KiB · idle (HD mode) kernel_main() CBReader (4 blocks) process_cmd → WRITE_PACKED, WAIT → SEND_GO_SIGNAL streams 48..55 worker-completion regs WORKERS tensix core core core core core PCIe-NOC read push sizes RELAY_INLINE → dispatch CB WRITE_PACKED SEND_GO_SIGNAL stream 48 inc WRITE_LINEAR_H_HOST → completion FIFO
Figure 1 — Three-actor pipeline. Host pushes records through hugepage and a 16-bit ring; prefetch core relays them into dispatch's CB; dispatch interprets them and unicasts go-words to workers; workers signal back via NOC stream registers; dispatch writes events into host's completion FIFO.

The board layout is wired in device.py:

P100: tensix_x = 1..7, 10..14   prefetch=(14,2)  dispatch=(14,3)
P150: tensix_x = 1..7, 10..16   prefetch=(16,2)  dispatch=(16,3)
PCIE NOC: x=19  y=24   (absolute; not subject to NOC mirroring)

build.sh compiles the CQ kernels with -DPREFETCH_NOC_X/Y, -DDISPATCH_NOC_X/Y, -DPCIE_NOC_X=19 -DPCIE_NOC_Y=24 and -DNOC_INDEX={0,1}. Prefetch BRISC is NOC 0; dispatch BRISC + the subordinate NCRISC are both NOC 1.

1. Memory map and pointers

cq_fixed_config.hpp is the source of truth for L1 layout. Three include-guarded blocks select per processor.

Prefetch core L1 (NOC 0 BRISC) prefetch_q (3068 B) 0x19840 rd_ptr · pcie_rd_ptr 0x196C0 (gap) cmddat_q 256 KiB · 64×4 KiB · 4 blocks 0x1A440 scratch_db 128 KiB · double-buffered 0x5A440 dispatch_s_buffer (view) 32 KiB · log_page=8 0x9A000 downstream_cb view → dispatch L1 Dispatch core L1 (NOC 1 BRISC) DEV_COMPLETION_Q_WR_PTR 0x196D0 DEV_COMPLETION_Q_RD_PTR 0x196E0 (semaphores) 0x19710 (gap) DISPATCH_CB_BASE 128 × 4 KiB pages 4 blocks · 512 KiB 0x1A000 dispatch_s CB 32 KiB 0x9A000 Host sysmem (hugepage) header / reserved +0 _HOST_CQ_WR_OFF +128 _HOST_CQ_RD_OFF +192 issue region 64 MiB +256 completion region 32 MiB timestamp slots core timing slots 4096 × 16 B each RELAY PCIe-NOC read completion WR (mirror)
Figure 2 — L1 layouts for prefetch and dispatch cores, and the host hugepage. Arrows show the three NOC traffic flows: PCIe-NOC reads (host → prefetch), RELAY_INLINE (prefetch → dispatch CB), and completion-FIFO writes back to sysmem.

1.1 Prefetch core L1 (NOC 0 BRISC)

RegionBaseSizePurpose
PCIE NOC view of host hugepagepcie_base = 0x4000_010064 MiBPrefetcher reads via NOC PCIe
prefetch_q0x198403068 B (1534 × 2 B)Ring of 16-bit fetch sizes
prefetch_q_rd_ptr0x196C04 BDevice read pointer (echoed back to host)
prefetch_q_pcie_rd_ptr0x196C44 BCurrent PCIe read offset (echoed back)
cmddat_q0x1A440256 KiB, 64 × 4 KiB, 4 blocksWhere commands land
scratch_db0x5A440128 KiB (2 × 64 KiB)Double-buffered bounce for paged/linear relays
dispatch_s_buffer (view)0x9A00032 KiB, log_page=8Path to dispatch subordinate
downstream_cb (view)0x1A000128 × 4 KiB = 512 KiBDispatch BRISC CB

Semaphores (16 B aligned, on the FD core type):

1.2 Dispatch core L1 (NOC 1 BRISC, "dispatch_d")

AddressNamePurpose
0x196D0DEV_COMPLETION_Q_WR_PTRDevice-side completion FIFO write ptr (mirror)
0x196E0DEV_COMPLETION_Q_RD_PTRHost's read ptr written into device L1
0x1A000DISPATCH_CB_BASEInput CB from prefetcher, 128 × 4 KiB, 4 blocks
0x40000000 (NOC)COMMAND_QUEUE_BASE_ADDRHost completion ptr in sysmem (mirror target)
0x44000100 (NOC)COMPLETION_QUEUE_BASE_ADDRHost completion region (32 MiB)
0x370UNICAST_GO_SIGNAL_ADDRPer-worker go-message offset

1.3 Dispatch subordinate L1 (NCRISC, NOC 1)

Co-resident with dispatch_d on the same tile. CB at 0x9A000, 32 KiB, log_page_size=8. Worker-completion stream registers live in the NOC stream register file from FIRST_STREAM_USED = 48. DISPATCH_S_SYNC_SEM_BASE_ADDR = 0x19710 — array of 16-byte-strided semaphores bumped by dispatch_d so dispatch_s knows when it's safe to send a go signal.

1.4 Host sysmem layout

+0                       (header / reserved)
+2 * PCIE_ALIGN          _HOST_CQ_WR_OFF      (4 B: completion-fifo wr-ptr | toggle)
+3 * PCIE_ALIGN          _HOST_CQ_RD_OFF      (4 B: host's rd-ptr | toggle echo)
+4 * PCIE_ALIGN          _HOST_ISSUE_BASE     issue region, 64 MiB
+   ...                  _HOST_COMPLETION_BASE  completion region, 32 MiB
+   ...                  _HOST_TIMESTAMP_BASE   4096 × 16 B   (CQTimestamp slots)
+   ...                  _HOST_CORE_TIMING_BASE 4096 × 16 B   (per-core BRISC start/end)

PCIE_ALIGN = 64,  L1_ALIGN = 16,  PAGE_SIZE = 4096
The buffer is allocated with mmap(MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE), pinned via PCIDevice.pin_pages(), and the returned IOVA must have bit 60 set (the "PCIE NOC base" marker _PCIE_NOC_BASE = 1 << 60). The 32-bit truncation noc_local = noc_addr - _PCIE_NOC_BASE is what the prefetcher uses as pcie_base offsets.

2. Host-side command construction (cq.py)

2.1 Commands

Every dispatch command is a 16-byte fixed-size header followed by an L1-aligned payload. The byte layout is defined by cq_commands.hpp and produced on the host by _cq_hdr(fmt, ...) which struct-packs and zero-pads to 16 bytes.

IDNameDataclass
3WRITE_LINEAR_H_HOSTCQHostEvent
5WRITE_PACKEDCQWritePacked
6WRITE_PACKED_LARGECQWritePackedLarge
7WAITCQWaitStream, CQBarrier
14SEND_GO_SIGNALCQSendGoSignal
17SET_GO_SIGNAL_NOC_DATACQSetGoSignalNocData
18TIMESTAMPCQTimestamp

2.2 RELAY_INLINE wrapping

Each top-level dispatch command is wrapped on the host into a RELAY_INLINE prefetch command:

def _relay_inline(payload):
  stride = align_up(CQ_CMD_SIZE + len(payload), PCIE_ALIGN)
  hdr = _cq_hdr("<BBHII", _RELAY_INLINE, 0, 0, len(payload), stride)
  return hdr + payload.ljust(stride - CQ_CMD_SIZE, b"\0")
RELAY_INLINE wrapper (PCIE_ALIGN-aligned record in issue region) prefetch hdr 16 B cmd_id, len, stride dispatch cmd hdr 16 B e.g. WRITE_PACKED sub-commands count × 4 or 12 B (L1_ALIGN padded) payload RTA bytes / launch msg / kernel binary / event_id pad → PCIE_ALIGN total stride bytes ⇒ prefetch_q[idx] = (stride >> 4) host spins until prefetch_q[idx] == 0, then atomically writes the 16-byte-unit size device zeros the slot on consumption ⇒ single-producer/single-consumer 1534-entry SPSC ring
Figure 3 — A single record in the issue region: outer 16 B prefetch header, inner 16 B dispatch header, sub-commands, payload(s), and tail padding so the whole thing is PCIE_ALIGN (64 B). The 2-byte ring entry encodes stride >> 4.

2.3 prefetch_q ring

host wr_idx device rd_ptr prefetch_q 1534 × uint16 Slot states nonzero = pending zero = free Host: spin until slot==0 write (size>>4) Device: read, NOC PCIe
Figure 4 — prefetch_q as a 1534-slot SPSC ring. Each slot is a uint16 holding either 0 (free) or the 16-byte-unit size of the next chunk to fetch. The high bit is an EXEC_BUF stall flag (unused in this build).

2.4 Issue region writes

1. Bump _issue_wr to next PCIE_ALIGN boundary.
2. If record won't fit, wrap to 0.
3. Copy record into sysmem at _HOST_ISSUE_BASE + _issue_wr.
4. _issue_wr += len(record).
5. Spin waiting until prefetch_q[idx] == 0.
6. Atomically write 2-byte size (record_len >> 4) into prefetch_q[idx].
7. idx = (idx + 1) % CQ_PREFETCH_Q_ENTRIES.

2.5 Lowering programs (dispatch.py)

build_ir(program, ...) produces a list of IRCommands; _lower_ir then converts:

Write(cores, addr, bytes)         → CQWritePackedLarge + CQBarrier  (per 1 KiB chunk)
Write(cores, addr, list[bytes])   → CQWritePacked                   (per L1_ALIGN slice)
Launch(cores)                     → CQSetGoSignalNocData(cores)
                                    CQWaitStream(DONE_STREAM=48, 0)
                                    CQSendGoSignal(go_word, 48, 0, len(cores))
                                    CQWaitStream(DONE_STREAM, len(cores))

lower_fast(programs, ...) bookends each per-program IR with CQTimestamp writes. device._run_fast_dispatch() appends a single CQHostEvent — that is the event the host waits on.

3. Prefetcher startup and main loops

The prefetch firmware lives in cq_prefetch.cpp. It compiles to three variants (_h, _d, _hd) but for blackhole-py-old only the unified _hd is used because IS_H_VARIANT = IS_D_VARIANT = 1.

void kernel_main_hd() {
    uint32_t cmd_ptr = cmddat_q_base;
    uint32_t fence   = cmddat_q_base;
    bool done = false;
    uint32_t l1_cache[l1_cache_elements_rounded];
    PrefetchExecBufState exec_buf_state;

    cq_noc_async_write_init_state<...>(0, dst_dispatch, 0);    // BRISC_WR_CMD_BUF
    cq_noc_async_write_init_state<...>(0, dst_dispatch_s, 0);  // BRISC_WR_REG_CMD_BUF

    while (!done) {
        fetch_q_get_cmds<0>(fence, cmd_ptr, pcie_read_ptr);
        cmd = (volatile CQPrefetchCmd*)cmd_ptr;
        uint32_t stride;
        done = process_cmd<false, false>(cmd_ptr, downstream_data_ptr, stride, l1_cache, exec_buf_state);
        cmd_ptr += stride;
    }
}
cmd_ready? cmd_ptr ≠ fence return process_cmd takes over read_pending? pending_read_size ≠ 0 barrier on read fence += pending_read_size prefetch_q_ready? slot ≠ 0 (invalidate dcache!) issue PCIe-NOC read src = pcie_noc_xy | pcie_read_ptr dst = fence slot ← 0 (consumed) echo rd_ptr back to L1 spin invalidate_l1_cache yes no yes re-check prefetch_q no yes no
Figure 5 — fetch_q_get_cmds as a state machine over three flags. The function returns as soon as there is something to process; otherwise it advances any in-flight read, or starts a new PCIe-NOC read keyed off the next nonzero prefetch_q slot.

3.1 process_cmd (prefetcher)

Prefetch cmdAction
RELAY_LINEARNOC-read from (noc_xy, addr) into scratch_db, stream to dispatch CB. (unused here)
RELAY_PAGED / _PACKEDPull pages from interleaved DRAM banks, double-buffer through scratch_db_top[0..1]. (unused)
RELAY_INLINEThe path Python uses for every command. Copy body from cmddat_q to dispatch CB, flushing pages as it goes.
RELAY_INLINE_NOFLUSHSame, but doesn't release trailing page; next command can append.
EXEC_BUF / _ENDSwitch source from PCIe to a DRAM-resident command buffer; stalls new PCIe reads.
STALLSpin on my_downstream_sync_sem hitting a count (driven by dispatch_d's NOTIFY_PREFETCH WAIT flag).
TERMINATEDrop out of kernel_main_hd.

3.2 process_relay_inline_common — the hot path

uint32_t length = cmd->relay_inline.length;
uint32_t stride = cmd->relay_inline.stride;
uint32_t npages = (length + page_size - 1) >> log_page_size;

RelayInlineState::cb_writer.acquire_pages(npages);     // wait on credits from dispatch
CmdPayloadStream stream{cmd_ptr, exec_buf_state};
stream.init(sizeof(CQPrefetchCmd));                    // skip 16 B prefetch header
while (remaining > 0) {
    ByteSpan span = stream.next_span(remaining, stride);
    write_downstream<...>(span.ptr, local_downstream_data_ptr, span.len, ..., );
    remaining -= span.len;
}
local_downstream_data_ptr = round_up_pow2(local_downstream_data_ptr, page_size);
noc_async_writes_flushed(RelayInlineState::downstream_noc_index);
RelayInlineState::cb_writer.release_pages(npages, local_downstream_data_ptr);
The CBWriter keeps an additional_count field for "borrowed credit" (used by EXEC_BUF / ringbuffer paths), and uses wrap_gt signed-diff comparison to safely handle the semaphore wrapping past 2³¹.

4. Dispatch core (master, BRISC NOC 1)

void kernel_main() {
    set_l1_data_cache<true>();
    if constexpr (NOC_INDEX != UPSTREAM_NOC_INDEX) noc_local_state_init(UPSTREAM_NOC_INDEX);
    reset_worker_completion_stream_counts<FIRST_STREAM_USED=48, MAX_NUM_WORKER_SEMS=8>();
    dispatch_cb_reader.init();
    cmd_ptr = DISPATCH_CB_BASE;
    write_offset[0..2] = 0;
    completion_queue_wr_ptr_and_toggle = *get_cq_completion_write_ptr();
    while (!done) {
        dispatch_cb_reader.wait_for_available_data_and_release_old_pages(cmd_ptr);
        done = process_cmd<IS_D_VARIANT=true>(cmd_ptr, l1_cache);
        cmd_ptr = round_up_pow2(cmd_ptr, dispatch_cb_page_size);
    }
}

4.1 The dispatch CB — 4-block ring with deferred release

DISPATCH_CB — 128 × 4 KiB pages = 512 KiB, partitioned into 4 blocks Block 0 32 pages · 128 KiB released (1 block ago) Block 1 currently draining NOC writes in flight Block 2 (active) reader processing here Block 3 future cmd_ptr / cb_fence deferred release zone credit returned to prefetch A block is freed only after the next block is finished AND its NOC writes have left the NIU, so prefetch's incoming writes can't race with the dispatch's outgoing writes from the same pages.
Figure 6 — Dispatch CB deferred-release pattern. The active block (cmd_ptr) drains; the previous block waits for outstanding NOC writes to land before its pages are released back to the prefetcher.

CBReaderWithReleasePolicy (defined in cq_common.hpp) maintains:

4.2 process_cmd (dispatcher)

Dispatch cmdFunction
WRITE_LINEARprocess_write_linear(num_mcast_dests)
WRITE_LINEAR_H_HOSTprocess_write_host_h() — host event / read-buffer path
WRITE_PAGEDprocess_write_paged<is_dram>()
WRITE_PACKEDprocess_write_packed<mcast, SubCmdT>()
WRITE_PACKED_LARGEprocess_write_packed_large()
WAITprocess_wait()
SEND_GO_SIGNALprocess_go_signal_mcast_cmd()
SET_GO_SIGNAL_NOC_DATAcopies words into local 256-entry array
SET_WRITE_OFFSETupdates write_offset[3] for relocation
TIMESTAMPsamples wall-clock and writes 8 B to a NOC dst
TERMINATEbreaks the loop

4.3 The hot path: process_write_packed

dispatch CB CQDispatchCmd hdr sub[0]: noc_xy sub[1]: noc_xy sub[N-1]: noc_xy payload (L1_ALIGN-padded) data_ptr advances by stride careful_copy → l1_cache[] avoid re-reading L1 per iter (local data RAM faster) for sub in subs: cq_noc_async_write_with_state src=data_ptr, dst=sub.noc_xy data_ptr += stride N worker tensix L1s core 0 core 1 core 2 core N all receive xfer_size bytes at the same dst_addr
Figure 7 — process_write_packed. Sub-commands (one per target core) are copied into BRISC's local cache, then a tight loop issues one stateful NOC write per sub-command. If NO_STRIDE is set, all cores share the same payload pointer; otherwise data_ptr advances by L1_ALIGN-padded xfer_size.

4.4 process_write_host_h (host event / read-back)

data_ptr = cmd_ptr;     // include cmd header so host can parse it
wlength  = cmd->write_linear_host.length;
init NOC state to pcie_noc_xy.
while (wlength != 0) {
    chunk_len = min(wlength, ~(dispatch_cb_page_size - 1));
    while (chunk_len != 0) {
        wait_for_data_in_cb(data_ptr);
        xfer = min(chunk_len, available_in_cb);
        npages = div_up(xfer, completion_page_size);
        completion_queue_reserve_back(npages);          // blocks on rd_ptr
        wr_addr = cq_write_interface.completion_fifo_wr_ptr << 4;
        if (wr_addr + xfer > completion_queue_end) { split + wrap; }
        write to (pcie_noc_xy, wr_addr) of size xfer
        completion_queue_push_back(npages);             // updates ptr in sysmem
        data_ptr += xfer; chunk_len -= xfer; wlength -= xfer;
    }
}

4.5 Sync commands (WAIT flags)

5. Dispatch subordinate (NCRISC, NOC 1)

cq_dispatch_subordinate.cpp lives on the same tensix as dispatch_d but runs on the NCRISC. It uses NOC 1 with a customized command-buffer allocation so BRISC and NCRISC never collide:

NOC 1 command-buffer allocation (dispatch core) Cmd Buf 0 dispatch_s regular writes Cmd Buf 1 dispatch_s inline / reg writes Cmd Buf 2 dispatch_s atomics Cmd Buf 3 dispatch_d NCRISC_WR_CMD_BUF
Figure 8 — On the dispatch tile, BRISC (dispatch_d) and NCRISC (dispatch_s) share NOC 1. The subordinate is restricted to cmd-bufs 0–2; the master keeps cmd-buf 3. Manual NOC_CMD_BUF_WRITE_REG setup at boot enforces the partition.

In blackhole-py-old's fixed config DISTRIBUTED_DISPATCHER = 0, so the host never sends go-signal commands as DISPATCH_SUBORDINATE. The dispatch_s mostly idles after start-up — kept fully functional only so future split configurations can drop in.

6. Launch flow — how a program actually starts

_lower_ir lowers a Launch(cores) to four CQ commands in order:

CQSetGoSignalNocData(cores)                 # 17: SET_GO_SIGNAL_NOC_DATA
CQWaitStream(DONE_STREAM=48, 0)             # 7 + WAIT_STREAM + CLEAR_STREAM
CQSendGoSignal(go_word, 48, 0, len(cores))  # 14: SEND_GO_SIGNAL  (num_unicast = N)
CQWaitStream(DONE_STREAM, len(cores))       # 7 + WAIT_STREAM + CLEAR_STREAM
Host Prefetch Dispatch Workers push cmd 17 PCIe-NOC RELAY_INLINE stash NOC xys go_signal_noc_data[] push cmd 7 CLEAR stream 48 (arm counter at 0) push cmd 14 unicast go_word → N workers @ 0x370 RUN_MSG_GO (0x80) run kernel N cores in parallel stream 48 += 1 (×N) push cmd 7 WAIT(stream 48, N) CLEAR after match
Figure 9 — Launch handshake (vertical = time). Four CQ commands move from host → prefetch → dispatch; dispatch unicasts RUN_MSG_GO to 0x370 on each worker; each worker increments stream 48 on the dispatch core when it transitions to RUN_MSG_DONE; dispatch's WAIT(N) unblocks and clears the stream for the next launch.

The go_word is built in device._go_word():

go = GoMsg()
go.bits.signal = DevMsgs.RUN_MSG_GO          # = 0x80
go.bits.master_x, go.bits.master_y = DISPATCH_NOC_X, DISPATCH_NOC_Y
go.bits.dispatch_message_offset = 0
return go.all                                # uint32: [signal | mx | my | offset]

6.1 SET_GO_SIGNAL_NOC_DATA

Copies a list of NOC xy encodings into go_signal_noc_data[256]. Subsequent SEND_GO_SIGNAL walks the array starting at noc_data_start_index.

6.2 The WAIT(stream=48, 0) + CLEAR

Trivially passes the WAIT check (≥0 always true), then atomically zeroes the stream counter by writing -(*sem) << REMOTE_DEST_BUF_WORDS_FREE_INC to the update reg. Streams are accumulator-style registers, not normal memory.

6.3 SEND_GO_SIGNAL

for (i = 0; i < num_unicasts; ++i) {
    dst = get_noc_addr_helper(go_signal_noc_data[idx++], UNICAST_GO_SIGNAL_ADDR);
    noc_async_write_one_packet(aligned_go_signal_storage, dst, sizeof(uint32_t));
}

The go word lands at 0x370 on each worker tensix. The worker BRISC spins on that address, sees signal == RUN_MSG_GO (0x80), starts the program; when done it writes back RUN_MSG_DONE = 0 and bumps stream 48 of the dispatch core by 1.

6.4 WAIT(stream=48, N) + CLEAR

Dispatcher spins until stream 48 reaches N, then clears it. This is the "bottom of cycle" fence — after it returns, the next launch can proceed.

7. Host-side completion path

completion FIFO page-granular (4 KiB) 31 bits ptr | 1 bit toggle dispatch wr_ptr host rd_ptr One completion page (4 KiB) echoed cmd header (16 B) WRITE_LINEAR_H_HOST event_id (4 B) + pad to L1_ALIGN unused tail (4032 B) host reads off+16 to skip the echoed header
Figure 10 — Completion FIFO. The high bit of both rd/wr pointers is the wraparound toggle: when the wr_ptr wraps to base, its toggle flips, so the dispatcher can tell "wr is one cycle ahead" from "rd has caught up". Each push is a full 4 KiB page regardless of payload size.
while True:
  wr_raw = self._sysmem_read32(_HOST_CQ_WR_OFF)
  wr_16b, wr_toggle = wr_raw & 0x7FFF_FFFF, (wr_raw >> 31) & 1
  if (wr_16b, wr_toggle) != (self._completion_rd_16b, self._completion_rd_toggle):
    off = (self._completion_rd_16b << 4) - self.noc_local
    got = self._sysmem_read32(off + 16)            # skip echoed header
    self._completion_rd_16b += self._completion_page_16b
    if self._completion_rd_16b >= self._completion_end_16b:
      self._completion_rd_16b = self._completion_base_16b
      self._completion_rd_toggle ^= 1
    raw = (self._completion_rd_16b & 0x7FFF_FFFF) | (self._completion_rd_toggle << 31)
    self._dispatch_win.write32(CQ_COMPLETION_RD_PTR, raw)
    self._sysmem_write32(_HOST_CQ_RD_OFF, raw)
    if got != (event_id & 0xFFFFFFFF):
      raise RuntimeError("event mismatch")
    return
  time.sleep(0.0002)

8. Putting it all end-to-end

Host Prefetch Dispatch Workers build IR lower → CQ cmds flush to sysmem push prefetch_q wait_completion(evt) PCIe read RELAY_INLINE WRITE GO WAIT execute kernel EVT → sysmem prefetch_q[idx] dispatch CB go signal stream 48 completion FIFO wr_ptr update → host poll wakes
Figure 11 — End-to-end pipeline. The host can push records far ahead of the device draining them; backpressure is enforced by three independent rings (prefetch_q slots, dispatch CB credits, completion FIFO toggle).

A concrete trace for device.run() on one program with N cores:

  1. _compile_ir(program, ...) builds IR: Write(go_msg reset) + Write(go_msg_index) + Write(uniform_rta) + Write(launch, role) + Write(shared, role) + Launch(cores).
  2. lower_fast prepends a CQTimestamp for slot 0, lowers IR to CQ commands, appends CQTimestamp for slot 1, then CQHostEvent(event_id).
  3. Each CQ command is to_bytes()-serialized, wrapped in _relay_inline, appended to cq.stream.
  4. CQSysmem.flush(cq) copies each record into sysmem and pushes its size onto prefetch_q.
  5. Prefetch core's fetch_q_get_cmds sees the new slot, NOC-reads the record from hugepage into cmddat_q, marks the slot 0.
  6. Prefetch decodes outer RELAY_INLINE, acquires N pages of dispatch CB credit, NOC-writes the inner command into the dispatch CB, releases credit via NOC atomic-inc.
  7. Dispatch BRISC wakes; runs process_cmd — for WRITE_PACKED issues N NOC writes to workers; for WAIT_STREAM polls stream 48; for SEND_GO_SIGNAL unicasts the go word.
  8. Workers receive go word, run, write completion to stream 48 on dispatch core.
  9. Dispatch processes WRITE_LINEAR_H_HOST: writes 32 B (cmd hdr + event page) to completion region; updates wr_ptr in sysmem mirror.
  10. Host's wait_completion polls _HOST_CQ_WR_OFF, sees move, reads event_id, advances rd_ptr, writes back. Done.

9. Pieces worth re-emphasizing

Cleanest mental model: the prefetcher is a credit-managed NOC DMA from host hugepage to dispatch L1, and the dispatcher is an interpreter of a short command ISA whose primary verbs are "scatter to L1 of N tensix cores (packed)", "atomic-inc / poll a stream register" (the sync primitive), and "write something into the host completion FIFO". Everything else is plumbing and corner-case backpressure.