Fast Dispatch / CQ Firmware on Blackhole

A self-contained, detailed walkthrough of how the Command Queue (CQ) firmware works in blackhole-py-old — host code in cq.py / dispatch.py / device.py, firmware in firmware/cq/.

0. System overview

The pipeline has three cooperating actors:

Host — runs the Python in cq.py. Writes commands into a pinned hugepage ("issue region") and pushes 2-byte size entries onto a 16-bit prefetch_q in the prefetch core's L1.
Prefetch core (BRISC on (PREFETCH_NOC_X, PREFETCH_NOC_Y), NOC 0) — reads from host hugepage over PCIe-NOC, lands data in its cmddat_q, relays commands to the dispatch core's CB. Built from cq_prefetch.cpp.
Dispatch core (BRISC on (DISPATCH_NOC_X, DISPATCH_NOC_Y), NOC 1, with a co-resident NCRISC "subordinate") — drains the relayed command stream, executes individual dispatch commands. Built from cq_dispatch.cpp (BRISC) and cq_dispatch_subordinate.cpp (NCRISC).

Figure 1 — Three-actor pipeline. Host pushes records through hugepage and a 16-bit ring; prefetch core relays them into dispatch's CB; dispatch interprets them and unicasts go-words to workers; workers signal back via NOC stream registers; dispatch writes events into host's completion FIFO.

The board layout is wired in device.py:

P100: tensix_x = 1..7, 10..14   prefetch=(14,2)  dispatch=(14,3)
P150: tensix_x = 1..7, 10..16   prefetch=(16,2)  dispatch=(16,3)
PCIE NOC: x=19  y=24   (absolute; not subject to NOC mirroring)

build.sh compiles the CQ kernels with -DPREFETCH_NOC_X/Y, -DDISPATCH_NOC_X/Y, -DPCIE_NOC_X=19 -DPCIE_NOC_Y=24 and -DNOC_INDEX={0,1}. Prefetch BRISC is NOC 0; dispatch BRISC + the subordinate NCRISC are both NOC 1.

1. Memory map and pointers

cq_fixed_config.hpp is the source of truth for L1 layout. Three include-guarded blocks select per processor.

Figure 2 — L1 layouts for prefetch and dispatch cores, and the host hugepage. Arrows show the three NOC traffic flows: PCIe-NOC reads (host → prefetch), RELAY_INLINE (prefetch → dispatch CB), and completion-FIFO writes back to sysmem.

1.1 Prefetch core L1 (NOC 0 BRISC)

Region	Base	Size	Purpose
PCIE NOC view of host hugepage	`pcie_base = 0x4000_0100`	64 MiB	Prefetcher reads via NOC PCIe
`prefetch_q`	`0x19840`	3068 B (1534 × 2 B)	Ring of 16-bit fetch sizes
`prefetch_q_rd_ptr`	`0x196C0`	4 B	Device read pointer (echoed back to host)
`prefetch_q_pcie_rd_ptr`	`0x196C4`	4 B	Current PCIe read offset (echoed back)
`cmddat_q`	`0x1A440`	256 KiB, 64 × 4 KiB, 4 blocks	Where commands land
`scratch_db`	`0x5A440`	128 KiB (2 × 64 KiB)	Double-buffered bounce for paged/linear relays
`dispatch_s_buffer` (view)	`0x9A000`	32 KiB, log_page=8	Path to dispatch subordinate
`downstream_cb` (view)	`0x1A000`	128 × 4 KiB = 512 KiB	Dispatch BRISC CB

Semaphores (16 B aligned, on the FD core type):

MY_DOWNSTREAM_CB_SEM_ID = 0 — page credits granted by dispatch_d.
MY_DISPATCH_S_CB_SEM_ID = 1 — page credits granted by dispatch_s.
DOWNSTREAM_SYNC_SEM_ID = 2 — counter dispatch_d increments to unblock prefetch's STALL.
MY_UPSTREAM_CB_SEM_ID = 3 — only used in _d variants (split path).

1.2 Dispatch core L1 (NOC 1 BRISC, "dispatch_d")

Address	Name	Purpose
`0x196D0`	`DEV_COMPLETION_Q_WR_PTR`	Device-side completion FIFO write ptr (mirror)
`0x196E0`	`DEV_COMPLETION_Q_RD_PTR`	Host's read ptr written into device L1
`0x1A000`	`DISPATCH_CB_BASE`	Input CB from prefetcher, 128 × 4 KiB, 4 blocks
`0x40000000` (NOC)	`COMMAND_QUEUE_BASE_ADDR`	Host completion ptr in sysmem (mirror target)
`0x44000100` (NOC)	`COMPLETION_QUEUE_BASE_ADDR`	Host completion region (32 MiB)
`0x370`	`UNICAST_GO_SIGNAL_ADDR`	Per-worker go-message offset

1.3 Dispatch subordinate L1 (NCRISC, NOC 1)

Co-resident with dispatch_d on the same tile. CB at 0x9A000, 32 KiB, log_page_size=8. Worker-completion stream registers live in the NOC stream register file from FIRST_STREAM_USED = 48. DISPATCH_S_SYNC_SEM_BASE_ADDR = 0x19710 — array of 16-byte-strided semaphores bumped by dispatch_d so dispatch_s knows when it's safe to send a go signal.

1.4 Host sysmem layout

+0                       (header / reserved)
+2 * PCIE_ALIGN          _HOST_CQ_WR_OFF      (4 B: completion-fifo wr-ptr | toggle)
+3 * PCIE_ALIGN          _HOST_CQ_RD_OFF      (4 B: host's rd-ptr | toggle echo)
+4 * PCIE_ALIGN          _HOST_ISSUE_BASE     issue region, 64 MiB
+   ...                  _HOST_COMPLETION_BASE  completion region, 32 MiB
+   ...                  _HOST_TIMESTAMP_BASE   4096 × 16 B   (CQTimestamp slots)
+   ...                  _HOST_CORE_TIMING_BASE 4096 × 16 B   (per-core BRISC start/end)

PCIE_ALIGN = 64,  L1_ALIGN = 16,  PAGE_SIZE = 4096

The buffer is allocated with mmap(MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE), pinned via PCIDevice.pin_pages(), and the returned IOVA must have bit 60 set (the "PCIE NOC base" marker _PCIE_NOC_BASE = 1 << 60). The 32-bit truncation noc_local = noc_addr - _PCIE_NOC_BASE is what the prefetcher uses as pcie_base offsets.

2. Host-side command construction (`cq.py`)

2.1 Commands

Every dispatch command is a 16-byte fixed-size header followed by an L1-aligned payload. The byte layout is defined by cq_commands.hpp and produced on the host by _cq_hdr(fmt, ...) which struct-packs and zero-pads to 16 bytes.

ID	Name	Dataclass
3	`WRITE_LINEAR_H_HOST`	`CQHostEvent`
5	`WRITE_PACKED`	`CQWritePacked`
6	`WRITE_PACKED_LARGE`	`CQWritePackedLarge`
7	`WAIT`	`CQWaitStream`, `CQBarrier`
14	`SEND_GO_SIGNAL`	`CQSendGoSignal`
17	`SET_GO_SIGNAL_NOC_DATA`	`CQSetGoSignalNocData`
18	`TIMESTAMP`	`CQTimestamp`

2.2 RELAY_INLINE wrapping

Each top-level dispatch command is wrapped on the host into a RELAY_INLINE prefetch command:

def _relay_inline(payload):
  stride = align_up(CQ_CMD_SIZE + len(payload), PCIE_ALIGN)
  hdr = _cq_hdr("<BBHII", _RELAY_INLINE, 0, 0, len(payload), stride)
  return hdr + payload.ljust(stride - CQ_CMD_SIZE, b"\0")

Figure 3 — A single record in the issue region: outer 16 B prefetch header, inner 16 B dispatch header, sub-commands, payload(s), and tail padding so the whole thing is PCIE_ALIGN (64 B). The 2-byte ring entry encodes stride >> 4.

2.3 prefetch_q ring

Figure 4 — prefetch_q as a 1534-slot SPSC ring. Each slot is a uint16 holding either 0 (free) or the 16-byte-unit size of the next chunk to fetch. The high bit is an EXEC_BUF stall flag (unused in this build).

2.4 Issue region writes

1. Bump _issue_wr to next PCIE_ALIGN boundary.
2. If record won't fit, wrap to 0.
3. Copy record into sysmem at _HOST_ISSUE_BASE + _issue_wr.
4. _issue_wr += len(record).
5. Spin waiting until prefetch_q[idx] == 0.
6. Atomically write 2-byte size (record_len >> 4) into prefetch_q[idx].
7. idx = (idx + 1) % CQ_PREFETCH_Q_ENTRIES.

2.5 Lowering programs (`dispatch.py`)

build_ir(program, ...) produces a list of IRCommands; _lower_ir then converts:

Write(cores, addr, bytes)         → CQWritePackedLarge + CQBarrier  (per 1 KiB chunk)
Write(cores, addr, list[bytes])   → CQWritePacked                   (per L1_ALIGN slice)
Launch(cores)                     → CQSetGoSignalNocData(cores)
                                    CQWaitStream(DONE_STREAM=48, 0)
                                    CQSendGoSignal(go_word, 48, 0, len(cores))
                                    CQWaitStream(DONE_STREAM, len(cores))

lower_fast(programs, ...) bookends each per-program IR with CQTimestamp writes. device._run_fast_dispatch() appends a single CQHostEvent — that is the event the host waits on.

3. Prefetcher startup and main loops

The prefetch firmware lives in cq_prefetch.cpp. It compiles to three variants (_h, _d, _hd) but for blackhole-py-old only the unified _hd is used because IS_H_VARIANT = IS_D_VARIANT = 1.

void kernel_main_hd() {
    uint32_t cmd_ptr = cmddat_q_base;
    uint32_t fence   = cmddat_q_base;
    bool done = false;
    uint32_t l1_cache[l1_cache_elements_rounded];
    PrefetchExecBufState exec_buf_state;

    cq_noc_async_write_init_state<...>(0, dst_dispatch, 0);    // BRISC_WR_CMD_BUF
    cq_noc_async_write_init_state<...>(0, dst_dispatch_s, 0);  // BRISC_WR_REG_CMD_BUF

    while (!done) {
        fetch_q_get_cmds<0>(fence, cmd_ptr, pcie_read_ptr);
        cmd = (volatile CQPrefetchCmd*)cmd_ptr;
        uint32_t stride;
        done = process_cmd<false, false>(cmd_ptr, downstream_data_ptr, stride, l1_cache, exec_buf_state);
        cmd_ptr += stride;
    }
}

Figure 5 — fetch_q_get_cmds as a state machine over three flags. The function returns as soon as there is something to process; otherwise it advances any in-flight read, or starts a new PCIe-NOC read keyed off the next nonzero prefetch_q slot.

3.1 `process_cmd` (prefetcher)

Prefetch cmd	Action
`RELAY_LINEAR`	NOC-read from `(noc_xy, addr)` into scratch_db, stream to dispatch CB. (unused here)
`RELAY_PAGED` / `_PACKED`	Pull pages from interleaved DRAM banks, double-buffer through `scratch_db_top[0..1]`. (unused)
`RELAY_INLINE`	The path Python uses for every command. Copy body from cmddat_q to dispatch CB, flushing pages as it goes.
`RELAY_INLINE_NOFLUSH`	Same, but doesn't release trailing page; next command can append.
`EXEC_BUF` / `_END`	Switch source from PCIe to a DRAM-resident command buffer; stalls new PCIe reads.
`STALL`	Spin on `my_downstream_sync_sem` hitting a count (driven by dispatch_d's `NOTIFY_PREFETCH` WAIT flag).
`TERMINATE`	Drop out of `kernel_main_hd`.

3.2 `process_relay_inline_common` — the hot path

uint32_t length = cmd->relay_inline.length;
uint32_t stride = cmd->relay_inline.stride;
uint32_t npages = (length + page_size - 1) >> log_page_size;

RelayInlineState::cb_writer.acquire_pages(npages);     // wait on credits from dispatch
CmdPayloadStream stream{cmd_ptr, exec_buf_state};
stream.init(sizeof(CQPrefetchCmd));                    // skip 16 B prefetch header
while (remaining > 0) {
    ByteSpan span = stream.next_span(remaining, stride);
    write_downstream<...>(span.ptr, local_downstream_data_ptr, span.len, ..., );
    remaining -= span.len;
}
local_downstream_data_ptr = round_up_pow2(local_downstream_data_ptr, page_size);
noc_async_writes_flushed(RelayInlineState::downstream_noc_index);
RelayInlineState::cb_writer.release_pages(npages, local_downstream_data_ptr);

The CBWriter keeps an additional_count field for "borrowed credit" (used by EXEC_BUF / ringbuffer paths), and uses wrap_gt signed-diff comparison to safely handle the semaphore wrapping past 2³¹.

4. Dispatch core (master, BRISC NOC 1)

void kernel_main() {
    set_l1_data_cache<true>();
    if constexpr (NOC_INDEX != UPSTREAM_NOC_INDEX) noc_local_state_init(UPSTREAM_NOC_INDEX);
    reset_worker_completion_stream_counts<FIRST_STREAM_USED=48, MAX_NUM_WORKER_SEMS=8>();
    dispatch_cb_reader.init();
    cmd_ptr = DISPATCH_CB_BASE;
    write_offset[0..2] = 0;
    completion_queue_wr_ptr_and_toggle = *get_cq_completion_write_ptr();
    while (!done) {
        dispatch_cb_reader.wait_for_available_data_and_release_old_pages(cmd_ptr);
        done = process_cmd<IS_D_VARIANT=true>(cmd_ptr, l1_cache);
        cmd_ptr = round_up_pow2(cmd_ptr, dispatch_cb_page_size);
    }
}

4.1 The dispatch CB — 4-block ring with deferred release

Figure 6 — Dispatch CB deferred-release pattern. The active block (cmd_ptr) drains; the previous block waits for outstanding NOC writes to land before its pages are released back to the prefetcher.

CBReaderWithReleasePolicy (defined in cq_common.hpp) maintains:

local_count_ — pages logically consumed.
cb_fence_ — byte address one past last guaranteed-present byte.
block_next_start_addr_[4] — block boundary fences.
rd_block_idx_ — current block (0..3).
block_noc_writes_to_clear_ — snapshot of NOC nonposted-write counter.

4.2 `process_cmd` (dispatcher)

Dispatch cmd	Function
`WRITE_LINEAR`	`process_write_linear(num_mcast_dests)`
`WRITE_LINEAR_H_HOST`	`process_write_host_h()` — host event / read-buffer path
`WRITE_PAGED`	`process_write_paged<is_dram>()`
`WRITE_PACKED`	`process_write_packed<mcast, SubCmdT>()`
`WRITE_PACKED_LARGE`	`process_write_packed_large()`
`WAIT`	`process_wait()`
`SEND_GO_SIGNAL`	`process_go_signal_mcast_cmd()`
`SET_GO_SIGNAL_NOC_DATA`	copies words into local 256-entry array
`SET_WRITE_OFFSET`	updates `write_offset[3]` for relocation
`TIMESTAMP`	samples wall-clock and writes 8 B to a NOC dst
`TERMINATE`	breaks the loop

4.3 The hot path: `process_write_packed`

Figure 7 — process_write_packed. Sub-commands (one per target core) are copied into BRISC's local cache, then a tight loop issues one stateful NOC write per sub-command. If NO_STRIDE is set, all cores share the same payload pointer; otherwise data_ptr advances by L1_ALIGN-padded xfer_size.

4.4 `process_write_host_h` (host event / read-back)

data_ptr = cmd_ptr;     // include cmd header so host can parse it
wlength  = cmd->write_linear_host.length;
init NOC state to pcie_noc_xy.
while (wlength != 0) {
    chunk_len = min(wlength, ~(dispatch_cb_page_size - 1));
    while (chunk_len != 0) {
        wait_for_data_in_cb(data_ptr);
        xfer = min(chunk_len, available_in_cb);
        npages = div_up(xfer, completion_page_size);
        completion_queue_reserve_back(npages);          // blocks on rd_ptr
        wr_addr = cq_write_interface.completion_fifo_wr_ptr << 4;
        if (wr_addr + xfer > completion_queue_end) { split + wrap; }
        write to (pcie_noc_xy, wr_addr) of size xfer
        completion_queue_push_back(npages);             // updates ptr in sysmem
        data_ptr += xfer; chunk_len -= xfer; wlength -= xfer;
    }
}

4.5 Sync commands (`WAIT` flags)

BARRIER (0x01) → noc_async_write_barrier()
WAIT_MEMORY (0x04) → spin until *addr >= count (using wrap_ge)
WAIT_STREAM (0x08) → spin on stream's SPACE_AVAILABLE_REG until >= count
CLEAR_STREAM (0x10) → atomically zero the stream by writing -(*sem) to the update reg
NOTIFY_PREFETCH (0x02) → noc_semaphore_inc(prefetch's UPSTREAM_SYNC_SEM, 1)

5. Dispatch subordinate (NCRISC, NOC 1)

cq_dispatch_subordinate.cpp lives on the same tensix as dispatch_d but runs on the NCRISC. It uses NOC 1 with a customized command-buffer allocation so BRISC and NCRISC never collide:

Figure 8 — On the dispatch tile, BRISC (dispatch_d) and NCRISC (dispatch_s) share NOC 1. The subordinate is restricted to cmd-bufs 0–2; the master keeps cmd-buf 3. Manual NOC_CMD_BUF_WRITE_REG setup at boot enforces the partition.

In blackhole-py-old's fixed config DISTRIBUTED_DISPATCHER = 0, so the host never sends go-signal commands as DISPATCH_SUBORDINATE. The dispatch_s mostly idles after start-up — kept fully functional only so future split configurations can drop in.

6. Launch flow — how a program actually starts

_lower_ir lowers a Launch(cores) to four CQ commands in order:

CQSetGoSignalNocData(cores)                 # 17: SET_GO_SIGNAL_NOC_DATA
CQWaitStream(DONE_STREAM=48, 0)             # 7 + WAIT_STREAM + CLEAR_STREAM
CQSendGoSignal(go_word, 48, 0, len(cores))  # 14: SEND_GO_SIGNAL  (num_unicast = N)
CQWaitStream(DONE_STREAM, len(cores))       # 7 + WAIT_STREAM + CLEAR_STREAM

Figure 9 — Launch handshake (vertical = time). Four CQ commands move from host → prefetch → dispatch; dispatch unicasts RUN_MSG_GO to 0x370 on each worker; each worker increments stream 48 on the dispatch core when it transitions to RUN_MSG_DONE; dispatch's WAIT(N) unblocks and clears the stream for the next launch.

The go_word is built in device._go_word():

go = GoMsg()
go.bits.signal = DevMsgs.RUN_MSG_GO          # = 0x80
go.bits.master_x, go.bits.master_y = DISPATCH_NOC_X, DISPATCH_NOC_Y
go.bits.dispatch_message_offset = 0
return go.all                                # uint32: [signal | mx | my | offset]

6.1 SET_GO_SIGNAL_NOC_DATA

Copies a list of NOC xy encodings into go_signal_noc_data[256]. Subsequent SEND_GO_SIGNAL walks the array starting at noc_data_start_index.

6.2 The WAIT(stream=48, 0) + CLEAR

Trivially passes the WAIT check (≥0 always true), then atomically zeroes the stream counter by writing -(*sem) << REMOTE_DEST_BUF_WORDS_FREE_INC to the update reg. Streams are accumulator-style registers, not normal memory.

6.3 SEND_GO_SIGNAL

for (i = 0; i < num_unicasts; ++i) {
    dst = get_noc_addr_helper(go_signal_noc_data[idx++], UNICAST_GO_SIGNAL_ADDR);
    noc_async_write_one_packet(aligned_go_signal_storage, dst, sizeof(uint32_t));
}

The go word lands at 0x370 on each worker tensix. The worker BRISC spins on that address, sees signal == RUN_MSG_GO (0x80), starts the program; when done it writes back RUN_MSG_DONE = 0 and bumps stream 48 of the dispatch core by 1.

6.4 WAIT(stream=48, N) + CLEAR

Dispatcher spins until stream 48 reaches N, then clears it. This is the "bottom of cycle" fence — after it returns, the next launch can proceed.

7. Host-side completion path

Figure 10 — Completion FIFO. The high bit of both rd/wr pointers is the wraparound toggle: when the wr_ptr wraps to base, its toggle flips, so the dispatcher can tell "wr is one cycle ahead" from "rd has caught up". Each push is a full 4 KiB page regardless of payload size.

while True:
  wr_raw = self._sysmem_read32(_HOST_CQ_WR_OFF)
  wr_16b, wr_toggle = wr_raw & 0x7FFF_FFFF, (wr_raw >> 31) & 1
  if (wr_16b, wr_toggle) != (self._completion_rd_16b, self._completion_rd_toggle):
    off = (self._completion_rd_16b << 4) - self.noc_local
    got = self._sysmem_read32(off + 16)            # skip echoed header
    self._completion_rd_16b += self._completion_page_16b
    if self._completion_rd_16b >= self._completion_end_16b:
      self._completion_rd_16b = self._completion_base_16b
      self._completion_rd_toggle ^= 1
    raw = (self._completion_rd_16b & 0x7FFF_FFFF) | (self._completion_rd_toggle << 31)
    self._dispatch_win.write32(CQ_COMPLETION_RD_PTR, raw)
    self._sysmem_write32(_HOST_CQ_RD_OFF, raw)
    if got != (event_id & 0xFFFFFFFF):
      raise RuntimeError("event mismatch")
    return
  time.sleep(0.0002)

8. Putting it all end-to-end

Figure 11 — End-to-end pipeline. The host can push records far ahead of the device draining them; backpressure is enforced by three independent rings (prefetch_q slots, dispatch CB credits, completion FIFO toggle).

A concrete trace for device.run() on one program with N cores:

_compile_ir(program, ...) builds IR: Write(go_msg reset) + Write(go_msg_index) + Write(uniform_rta) + Write(launch, role) + Write(shared, role) + Launch(cores).
lower_fast prepends a CQTimestamp for slot 0, lowers IR to CQ commands, appends CQTimestamp for slot 1, then CQHostEvent(event_id).
Each CQ command is to_bytes()-serialized, wrapped in _relay_inline, appended to cq.stream.
CQSysmem.flush(cq) copies each record into sysmem and pushes its size onto prefetch_q.
Prefetch core's fetch_q_get_cmds sees the new slot, NOC-reads the record from hugepage into cmddat_q, marks the slot 0.
Prefetch decodes outer RELAY_INLINE, acquires N pages of dispatch CB credit, NOC-writes the inner command into the dispatch CB, releases credit via NOC atomic-inc.
Dispatch BRISC wakes; runs process_cmd — for WRITE_PACKED issues N NOC writes to workers; for WAIT_STREAM polls stream 48; for SEND_GO_SIGNAL unicasts the go word.
Workers receive go word, run, write completion to stream 48 on dispatch core.
Dispatch processes WRITE_LINEAR_H_HOST: writes 32 B (cmd hdr + event page) to completion region; updates wr_ptr in sysmem mirror.
Host's wait_completion polls _HOST_CQ_WR_OFF, sees move, reads event_id, advances rd_ptr, writes back. Done.

9. Pieces worth re-emphasizing

Three independent rings: prefetch_q (16-bit sizes, 1534 entries), cmddat_q (256 KiB byte ring), dispatch CB (128×4 KiB page ring). Each has its own producer/consumer ptrs.
Three independent semaphores per producer-consumer hop: input-CB credit, downstream-CB credit, sync (for STALL / NOTIFY_PREFETCH). All counted, all over-NOC atomic-incremented.
Stateful NOC programming: each kernel pre-arms cmd buffers once at init and re-uses them every transaction. dispatch_s uses a different cmd-buf allocation than dispatch_d so they share NOC 1 without clobbering each other.
PCIe NOC: prefetcher reads from host hugepage go over (19,24) on Blackhole P100 — a special endpoint bridging to the PCIe controller. The hugepage IOVA is pinned and bit 60 marks it as a PCIe-NOC address.
Stream registers (48..55) carry worker-completion signals. CLEAR is implemented via the update reg with a shifted negative delta because stream regs are accumulators.
Timestamps: every program is bracketed with two CQTimestamp commands that have the dispatcher sample its wall clock and write 8 B to a sysmem slot. With TT_CORE_TIMING=1, the host also arms each worker BRISC to log its own start/end into another pair of slots.
Unused-in-this-build commands: RELAY_PAGED*, RELAY_LINEAR, RELAY_RINGBUFFER, EXEC_BUF*, WRITE_PAGED, WRITE_LINEAR(_H), NOTIFY_SUBORDINATE_GO_SIGNAL, mcast SEND_GO_SIGNAL. The host only emits WRITE_PACKED, WRITE_PACKED_LARGE, WAIT, SEND_GO_SIGNAL, SET_GO_SIGNAL_NOC_DATA, WRITE_LINEAR_H_HOST, TIMESTAMP.

Cleanest mental model: the prefetcher is a credit-managed NOC DMA from host hugepage to dispatch L1, and the dispatcher is an interpreter of a short command ISA whose primary verbs are "scatter to L1 of N tensix cores (packed)", "atomic-inc / poll a stream register" (the sync primitive), and "write something into the host completion FIFO". Everything else is plumbing and corner-case backpressure.

0. System overview

1. Memory map and pointers

1.1 Prefetch core L1 (NOC 0 BRISC)

1.2 Dispatch core L1 (NOC 1 BRISC, "dispatch_d")

1.3 Dispatch subordinate L1 (NCRISC, NOC 1)

1.4 Host sysmem layout

2. Host-side command construction (cq.py)

2.1 Commands

2.2 RELAY_INLINE wrapping

2.3 prefetch_q ring

2.4 Issue region writes

2.5 Lowering programs (dispatch.py)

3. Prefetcher startup and main loops

3.1 process_cmd (prefetcher)

3.2 process_relay_inline_common — the hot path

4. Dispatch core (master, BRISC NOC 1)

4.1 The dispatch CB — 4-block ring with deferred release

4.2 process_cmd (dispatcher)

4.3 The hot path: process_write_packed

4.4 process_write_host_h (host event / read-back)

4.5 Sync commands (WAIT flags)

5. Dispatch subordinate (NCRISC, NOC 1)

6. Launch flow — how a program actually starts

6.1 SET_GO_SIGNAL_NOC_DATA

6.2 The WAIT(stream=48, 0) + CLEAR

6.3 SEND_GO_SIGNAL

6.4 WAIT(stream=48, N) + CLEAR

7. Host-side completion path

8. Putting it all end-to-end

9. Pieces worth re-emphasizing

2. Host-side command construction (`cq.py`)

2.5 Lowering programs (`dispatch.py`)

3.1 `process_cmd` (prefetcher)

3.2 `process_relay_inline_common` — the hot path

4.2 `process_cmd` (dispatcher)

4.3 The hot path: `process_write_packed`

4.4 `process_write_host_h` (host event / read-back)

4.5 Sync commands (`WAIT` flags)