anuraagw.me

niu

NIU Emulator Specification (Blackhole)

1. Address Space Overview

Each Tensix tile has 2 NIU instances (NoC0 and NoC1), memory-mapped into the tile’s private address space:

NIUBase AddressSize
NoC00xFFB200000x10000
NoC10xFFB300000x10000

The formula firmware uses to compute a register’s absolute address:

addr = (buf << 11) + (noc << 16) + register_offset

Where register_offset already includes the 0xFFB20000 base (the defines in noc_parameters.h are absolute addresses). For the emulator, offsets from the NIU base are:

RegionOffset RangeDescription
Command Buffers0x0000-0x1FFF4 buffers x 0x800 stride
Misc Control0x0040-0x0068CMD_CTRL, NODE_ID, ECC, CMD_BUF_AVAIL
Configuration0x0100-0x017FNIU_CFG_0, ROUTER_CFG, translate tables
Status Counters0x0200-0x02FF128 x 4-byte counters
Security Fence0x0400-0x04A8Not needed for emulation
Flit Counters0x0500-0x05FFNot needed for emulation

2. Command Buffer Registers

4 command buffers at offsets 0x000, 0x800, 0x1000, 0x1800. Each has this layout:

OffsetNameR/WDescription
+0x00NOC_TARG_ADDR_LORWTarget address bits [31:0]
+0x04NOC_TARG_ADDR_MIDRWTarget address bits [63:32] (PCIe: bit 60 set)
+0x08NOC_TARG_ADDR_HIRWTarget XY coordinate (see encoding below)
+0x0CNOC_RET_ADDR_LORWReturn/dest address bits [31:0]
+0x10NOC_RET_ADDR_MIDRWReturn/dest address bits [63:32]
+0x14NOC_RET_ADDR_HIRWReturn/dest XY coordinate
+0x18NOC_PACKET_TAGRW[13:10] = transaction_id, [9] = header_store
+0x1CNOC_CTRLRWCommand type/flags (see bitfield below)
+0x20NOC_AT_LEN_BERWTransfer length in bytes (or byte-enable low word)
+0x24NOC_AT_LEN_BE_1RWByte-enable high word (for BE writes)
+0x28NOC_AT_DATARWInline write data / atomic operand
+0x2CNOC_BRCST_EXCLUDERWBroadcast exclusion mask
+0x30NOC_L1_ACC_AT_INSTRNRWL1 accumulate atomic instruction encoding
+0x34NOC_SEC_CTRLRWSecurity control (ignore for emu)
+0x40NOC_CMD_CTRLRWTrigger: write 0x1 to fire. Reads 0x0 when ready.
+0x44NOC_NODE_IDROPhysical node XY: {y[5:0], x[5:0]}
+0x48NOC_ENDPOINT_IDROEndpoint ID

Registers at offsets +0x44 and +0x48 read the same value in all 4 command buffers — they are shared (the hardware aliases them).

Conventional buffer assignments (dedicated mode, the common case)

BufferIndexUsed for
WR_CMD_BUF0Large DMA writes
RD_CMD_BUF1All DMA reads
WR_REG_CMD_BUF2Small register/semaphore writes
AT_CMD_BUF3Atomics (incr_get, CAS, swap, accumulate)

3. NOC_CTRL Bitfield

Bit  0: AT       (1 = atomic, 0 = copy)
Bit  1: WR       (1 = write, 0 = read)
Bit  2: WR_BE    (byte-enable write: AT_LEN_BE is a bitmask, not a length)
Bit  3: WR_INLINE (data comes from NOC_AT_DATA register, not from L1)
Bit  4: RESP_MARKED (non-posted: expect ACK; used for barrier tracking)
Bit  5: BRCST_PACKET (multicast)
Bit  6: VC_LINKED (linked VC allocation with previous transaction)
Bit  7: VC_STATIC (use static VC from bits [15:13])
Bit  8: PATH_RESERVE (reserve path for multicast)
Bit  9: MEM_RD_DROP_ACK (drop read ack -- fire-and-forget)
[12:10]: reserved
[15:13]: STATIC_VC (virtual channel number when VC_STATIC=1)
[16]: BRCST_XY (multicast direction flag)
[17]: BRCST_SRC_INCLUDE (include source in multicast)
[26:18]: reserved
[29:27]: ARB_PRIORITY
[30]: reserved
[31]: L1_ACC_AT_EN (enable L1 accumulate atomic)

Transaction type decoding for the emulator

ATWRWR_BEWR_INLINETransaction
0000Read: copy from TARG to RET
0100Write: copy from TARG (local src) to RET (remote dest)
0110Write with byte-enables: AT_LEN_BE/BE_1 = 64-bit bitmask
0101Inline write: 4 bytes from NOC_AT_DATA to TARG addr
1xxxAtomic: operation encoded in NOC_L1_ACC_AT_INSTRN or NOC_AT_LEN_BE

4. XY Coordinate Encoding

Coordinates are packed as 6-bit fields:

Unicast (in NOC_TARG_ADDR_HI / NOC_RET_ADDR_HI):

[5:0]  = x
[11:6] = y

Multicast (when BRCST_PACKET=1):

[5:0]   = end_x
[11:6]  = end_y
[17:12] = start_x
[23:18] = start_y

64-bit address encoding (as used by the API to build src_noc_addr / dst_noc_addr):

[35:0]  = local address (36-bit)
[41:36] = x
[47:42] = y

The API splits these: bits [31:0] go to ADDR_LO, bits [35:32] (plus PCIe bit 60 -> bit 28 of MID) go to ADDR_MID, and bits [47:36] (the XY coordinate) go to ADDR_HI/ADDR_COORDINATE.

5. Configuration Registers (0x100 + index*4)

These are the registers the emulator needs to support (firmware reads/writes them):

IndexNameKey bits
0x00NIU_CFG_0[14] = NOC_ID_TRANSLATE_EN, [16] = CMD_BUFFER_FIFO_EN
0x01ROUTER_CFG_0Router config (not needed for functional emu)
0x06-0x0BNOC_X_ID_TRANSLATE_TABLE_0..5X coordinate translation (6 regs, 6 entries each, 5 bits/entry)
0x0C-0x11NOC_Y_ID_TRANSLATE_TABLE_0..5Y coordinate translation
0x12NOC_ID_LOGICALLogical coordinates: {y[5:0], x[5:0]}
0x14NOC_ID_TRANSLATE_COL_MASKColumn mask for ID translation
0x15NOC_ID_TRANSLATE_ROW_MASKRow mask for ID translation

Critical for emulation: NOC_ID_LOGICAL (at absolute address NIU_base + 0x148) is what firmware reads during noc_init() to discover “who am I.” The ARC firmware pre-programs this before tensix cores boot. Your emulator must pre-populate this for each tile.

6. Status Counters (0x200 + index*4)

These are hardware-maintained monotonically increasing counters. Firmware polls them for barriers (“have all my writes been acked?”). The emulator must increment them when transactions complete.

Key counters firmware actually reads:

IndexNameWhen to increment
0x0NIU_MST_ATOMIC_RESP_RECEIVEDAfter atomic completes
0x1NIU_MST_WR_ACK_RECEIVEDAfter non-posted write ACK received
0x2NIU_MST_RD_RESP_RECEIVEDAfter read response received
0x4NIU_MST_CMD_ACCEPTEDAfter any command accepted (optional)
0x5NIU_MST_RD_REQ_SENTAfter read request sent
0xANIU_MST_NONPOSTED_WR_REQ_SENTAfter non-posted write sent
0xBNIU_MST_POSTED_WR_REQ_SENTAfter posted write sent
0x10+idNIU_MST_REQS_OUTSTANDING_ID(id)Outstanding count per transaction_id

Barrier pattern (this is what firmware spins on):

// Read barrier: spin until RD_RESP_RECEIVED == local counter
while (NOC_STATUS_READ_REG(noc, NIU_MST_RD_RESP_RECEIVED) != noc_reads_num_issued[noc]);

// Write barrier: spin until WR_ACK_RECEIVED == local counter
while (NOC_STATUS_READ_REG(noc, NIU_MST_WR_ACK_RECEIVED) != noc_nonposted_writes_acked[noc]);

For a synchronous emulator: increment counters immediately when NOC_CMD_CTRL fires. This makes all barriers resolve on the next load.

7. Misc Control Registers (at NIU base)

OffsetNameNotes
0x50NUM_MEM_PARITY_ERRAlways 0 in emu
0x54NUM_HEADER_1B_ERRAlways 0
0x58NUM_HEADER_2B_ERRAlways 0
0x5CECC_CTRLIgnore
0x60NOC_CLEAR_OUTSTANDING_REQ_CNTWrite to clear outstanding request counts by ID mask
0x64CMD_BUF_AVAIL[4:0]=buf0 slots, [12:8]=buf1, [20:16]=buf2, [28:24]=buf3. Return all-available.
0x68CMD_BUF_OVFLOverflow flag, always 0 in emu

8. Transaction Execution (on CMD_CTRL write)

When firmware writes 0x1 to NOC_CMD_CTRL at offset +0x40 of a command buffer:

def fire_cmd(self, buf_idx):
    regs = self.cmd_bufs[buf_idx]
    ctrl = regs[0x1C]  # NOC_CTRL

    is_atomic = ctrl & 1
    is_write  = (ctrl >> 1) & 1
    is_wr_be  = (ctrl >> 2) & 1
    is_inline = (ctrl >> 3) & 1
    is_resp_marked = (ctrl >> 4) & 1  # non-posted
    is_mcast  = (ctrl >> 5) & 1

    targ_lo  = regs[0x00]
    targ_mid = regs[0x04]
    targ_xy  = regs[0x08]
    ret_lo   = regs[0x0C]
    ret_mid  = regs[0x10]
    ret_xy   = regs[0x14]
    length   = regs[0x20]

    if is_atomic:
        # Atomic operation on target, return result to ret addr
        execute_atomic(...)
        self.inc_counter(NIU_MST_ATOMIC_RESP_RECEIVED)
    elif is_write:
        if is_inline:
            # Write NOC_AT_DATA (4 bytes) to targ address
            execute_inline_write(...)
        else:
            # DMA from local TARG addr to remote RET addr
            execute_dma_write(...)
        if is_resp_marked:
            self.inc_counter(NIU_MST_WR_ACK_RECEIVED)
            self.inc_counter(NIU_MST_NONPOSTED_WR_REQ_SENT)
        else:
            self.inc_counter(NIU_MST_POSTED_WR_REQ_SENT)
    else:
        # Read: DMA from remote TARG addr to local RET addr
        execute_dma_read(...)
        self.inc_counter(NIU_MST_RD_RESP_RECEIVED)

9. Firmware Boot Sequence (what the emulator must handle)

The BRISC firmware noc_init() does this for each NoC:

  1. Read NOC_CFG(NOC_ID_LOGICAL) (offset 0x148 from NIU base) to get my_x, my_y
  2. Pre-program command buffers with local coordinates:
    • WR_CMD_BUF (0): set TARG_ADDR_MID=0, TARG_ADDR_COORDINATE=my_xy (local source for writes)
    • WR_REG_CMD_BUF (2): same as above
    • AT_CMD_BUF (3): set RET_ADDR_LO, RET_ADDR_MID=0, RET_ADDR_COORDINATE=my_xy (atomic return addr)
    • RD_CMD_BUF (1): set NOC_CTRL to read command flags, RET_ADDR_MID=0, RET_ADDR_COORDINATE=my_xy
  3. Read status counters (noc_local_state_init): read RD_RESP_RECEIVED, NONPOSTED_WR_REQ_SENT, WR_ACK_RECEIVED, ATOMIC_RESP_RECEIVED, POSTED_WR_REQ_SENT and store in L1 variables

This means your emulator must pre-populate before firmware starts:

  • NOC_CFG(NOC_ID_LOGICAL) = (y << 6) | x for the tile’s logical coordinates
  • NOC_NODE_ID (offset 0x44) = (y << 6) | x for the tile’s physical coordinates
  • All status counters = 0

10. Disassembly Verification

From the real add1 kernel disassembly, here’s noc_async_write (NCRISC, using NoC0, cmd_buf 0):

; a4 = 0xFFB20000 (NoC0 base, cmd buf 0)
; Wait for CMD_CTRL ready
.L4:
    lw   a5, 64(a4)     ; read NOC_CMD_CTRL (+0x40)
    bnez a5, .L4         ; spin until 0

    ; Write NOC_CTRL = 0x2092 (CPY|WR|RESP_MARKED|VC_STATIC|STATIC_VC(1))
    li   a5, 0x2092
    sw   a5, 28(a4)      ; +0x1C = NOC_CTRL

    ; Write source address
    sw   a0, 0(a4)       ; +0x00 = NOC_TARG_ADDR_LO  (local L1 source)

    ; Write dest address
    sw   a1, 12(a4)      ; +0x0C = NOC_RET_ADDR_LO
    sw   zero, 16(a4)    ; +0x10 = NOC_RET_ADDR_MID = 0

    ; Write dest coordinate (pre-shifted)
    srli a2, a2, 4
    sw   a2, 20(a4)      ; +0x14 = NOC_RET_ADDR_COORDINATE

    ; Write length
    li   a5, 0x800       ; 2048 bytes
    sw   a5, 32(a4)      ; +0x20 = NOC_AT_LEN_BE

    ; FIRE
    li   a5, 1
    sw   a5, 64(a4)      ; +0x40 = NOC_CMD_CTRL = 1

And from noc_async_read (BRISC, using NoC1, cmd_buf 1):

; a4 = 0xFFB31000 (NoC1 base + buf1*0x800 = 0xFFB30000 + 0x800)
; Wait for CMD_CTRL ready
.L7:
    lw   a5, -1984(a4)   ; 0xFFB31000 + (-1984) = 0xFFB30840 = NOC_CMD_CTRL of buf 1
    bnez a5, .L7

    ; Write dest coordinate (local return)
    sw   a2, -2036(a4)   ; 0xFFB30814 = NOC_RET_ADDR_HI of buf 1

    ; Write source address
    sw   a0, -2048(a4)   ; 0xFFB30800 = NOC_TARG_ADDR_LO of buf 1
    sw   zero, -2044(a4) ; 0xFFB30804 = NOC_TARG_ADDR_MID of buf 1 = 0

    ; Write source coordinate
    sw   a1, -2040(a4)   ; 0xFFB30808 = NOC_TARG_ADDR_HI of buf 1

    ; Write length
    li   a2, 0x800
    sw   a2, -2016(a4)   ; 0xFFB30820 = NOC_AT_LEN_BE of buf 1

    ; FIRE
    li   a2, 1
    sw   a2, -1984(a4)   ; 0xFFB30840 = NOC_CMD_CTRL of buf 1

11. Overlay / Stream Path

The stream/overlay engine lives at a separate base:

  • NOC_OVERLAY_START_ADDR = 0xFFB40000
  • 64 streams, each with 0x1000 bytes of register space
  • STREAM_REG_ADDR(stream_id, reg_id) = 0xFFB40000 + stream_id*0x1000 + reg_id*4

The stream engine can issue NoC transactions on behalf of firmware (used for data movement pipelines). This is a separate subsystem from the raw NIU command path. For the matmul_peak and add1 kernels, the raw NIU path above is what’s used — the stream/overlay path is for pipelined data movement (not needed for basic emulation of compute kernels).

12. CMD_BUFFER_FIFO_EN (NIU_CFG_0 bit 16)

When bit 16 of NIU_CFG_0 is set, the 4 command buffers operate as a FIFO rather than independently addressable slots. Commands are written to buffer 0’s registers and the hardware auto-advances. Not used in add1 or matmul_peak kernels. Can be stubbed as a no-op for initial emulation.

13. Atomic Operations

When NOC_CMD_AT (bit 0 of NOC_CTRL) is set, the transaction is an atomic. The operation type is encoded in NOC_AT_LEN_BE (bits [15:12]) or NOC_L1_ACC_AT_INSTRN (when L1_ACC_AT_EN is set):

CodeNameDescription
0x0NOPNo operation
0x1INCR_GETIncrement target, return old value
0x2INCR_GET_PTRIncrement-get with pointer wrap
0x3SWAPSwap target with NOC_AT_DATA
0x4CASCompare-and-swap
0x5GET_TILE_MAPTile map lookup
0x6STORE_INDIndirect store
0x7SWAP_4B4-byte swap
0x9ACCAccumulate (FP32/FP16/INT32/INT8)

The atomic instruction field in NOC_AT_LEN_BE:

[1:0]  = IND_32 (index)
[5:2]  = WRAP (wrap count)
[9:6]  = (reserved / INCR)
[11:10] = IND_32_SRC (source index)
[15:12] = INS (instruction opcode from table above)

For L1_ACC_AT_INSTRN (when L1_ACC_AT_EN=1):

[2:0]  = ACC_FORMAT (0=FP32, 1=FP16_A, 2=FP16_B, 3=INT32, 4=INT32_COMPL, 5=INT32_UNS, 6=INT8)
[3]    = ACC_SAT_DIS (disable saturation)
[15:12] = INS (must be 0x9 for ACC)

14. Multicast Delivery Model

When NOC_CMD_BRCST_PACKET (bit 5 of NOC_CTRL) is set, the transaction is delivered to a rectangle of tiles rather than a single target.

Coordinate encoding (in NOC_RET_ADDR_HI for DMA writes, NOC_TARG_ADDR_HI for inline writes)

[5:0]   = EndX      (6 bits)
[11:6]  = EndY      (6 bits)
[17:12] = StartX    (6 bits)
[23:18] = StartY    (6 bits)

The 64-bit multicast address is constructed by the firmware API:

#define NOC_MULTICAST_ADDR(x_start, y_start, x_end, y_end, addr)  \
    ((uint64_t)(x_start) << 48) | ((uint64_t)(y_start) << 54) |   \
    ((uint64_t)(x_end) << 36) | ((uint64_t)(y_end) << 42) |       \
    ((uint64_t)(addr))

This splits across the three address registers: bits [31:0] to ADDR_LO, bits [35:32] to ADDR_MID, and the XY fields to ADDR_HI.

Delivery algorithm

The hardware delivers to all tiles whose coordinates lie within the rectangle:

def multicast_targets(start_x, start_y, end_x, end_y, src_x, src_y, src_include):
    """Return set of (x, y) coordinates that receive the multicast."""
    targets = set()
    for y in range(start_y, end_y + 1):
        for x in range(start_x, end_x + 1):
            if not src_include and x == src_x and y == src_y:
                continue  # skip sender unless BRCST_SRC_INCLUDE is set
            targets.add((x, y))
    return targets

The rectangle is always {(x,y) | StartX <= x <= EndX AND StartY <= y <= EndY}. Only tiles that have been registered in the NOC routing table actually receive data (unregistered grid positions are ignored).

NOC_CMD_BRCST_XY (bit 16) — routing axis, not delivery set

This bit controls which axis the multicast packet traverses first through the routers:

  • BRCST_XY=0: X is the major axis (traverse columns first, then branch along rows)
  • BRCST_XY=1: Y is the major axis (traverse rows first, then branch along columns)

This affects only routing topology and congestion behavior — it does not change which tiles receive the write. The emulator should ignore this bit entirely.

NOC_CMD_BRCST_SRC_INCLUDE (bit 17) — source tile inclusion

Controls whether the sending tile receives its own multicast:

  • BRCST_SRC_INCLUDE=0 (default): if the sender’s (x, y) falls inside the rectangle, it is excluded from receiving
  • BRCST_SRC_INCLUDE=1: the sender also receives the write (loopback)

The firmware API exposes two variants for each multicast operation:

  • noc_async_write_multicast(...) — excludes self (default)
  • noc_async_write_multicast_loopback_src(...) — includes self

NOC_BRCST_EXCLUDE (register offset +0x2C) — non-rectangular exclusion

When bit 22 is set in this register, a single row or column can be carved out of the rectangle:

Bits [7:0]   = start_x of excluded region
Bits [13:8]  = start_y of excluded region
Bit  [20]    = exclude direction X
Bit  [21]    = exclude direction Y
Bit  [22]    = exclude enabled

Used by noc_multicast_copy_exclude() for non-rectangular broadcast patterns. Standard firmware APIs (including noc_semaphore_set_multicast) always write 0x0 to this register (no exclusion). The emulator can leave this unimplemented for basic kernel emulation.

Multicast is write-only

Multicast applies only to write and atomic transactions. Read requests must always be unicast (BRCST_PACKET=0).

Counter accounting for multicast

For non-posted multicast writes (RESP_MARKED=1):

  • NIU_MST_NONPOSTED_WR_REQ_SENT increments by 1 (one command issued)
  • NIU_MST_WR_ACK_RECEIVED increments by 1 per destination tile that ACKs

The firmware pre-charges its local LDM counter: noc_nonposted_writes_acked[noc] += num_dests. The synchronous emulator should increment WR_ACK_RECEIVED by the number of actual targets delivered to when fire_cmd() runs, so that barrier polling resolves immediately.

Emulator fire_cmd() multicast path

def fire_cmd(self, buf_idx):
    regs = self.cmd_bufs[buf_idx]
    ctrl = regs[0x1C]
    is_mcast = (ctrl >> 5) & 1
    mcast_src_include = (ctrl >> 17) & 1

    if is_mcast and is_write:
        sx, sy, ex, ey = self._mcast_rect(ret_xy)
        targets_hit = 0
        for y in range(sy, ey + 1):
            for x in range(sx, ex + 1):
                if not mcast_src_include and x == self.x and y == self.y:
                    continue
                if (x, y) in self.fabric:
                    self.fabric[(x, y)].write(dst_addr, data)
                    targets_hit += 1
        if is_resp_marked:
            self.inc_counter(NIU_MST_WR_ACK_RECEIVED, targets_hit)
            self.inc_counter(NIU_MST_NONPOSTED_WR_REQ_SENT)
    # ... (existing unicast path unchanged)

Firmware API summary

FunctionSrc included?Notes
noc_async_write_multicast(src, dst_mcast, size, num_dests, linked, noc)NoStandard multicast DMA
noc_async_write_multicast_loopback_src(...)Yesnum_dests includes self
noc_semaphore_set_multicast(src, dst_mcast, num_dests, linked, noc)No4-byte non-posted write
noc_semaphore_set_multicast_loopback_src(...)Yes4-byte with loopback
noc_semaphore_inc_multicast(addr, incr, num_dests, noc)NoAtomic increment multicast
get_noc_multicast_addr(x_start, y_start, x_end, y_end, addr, noc)Constructs the 64-bit multicast address

15. Source Files Reference

PurposeFile
Primary BH register definestt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/noc/noc_parameters.h
NoC nonblocking API (C++)tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/noc_nonblocking_api.h
Overlay/stream paramstt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/noc/noc_overlay_parameters.h
Firmware C noc_transfertt-metal/tt_metal/hw/firmware/src/tt-1xx/blackhole/noc.c
Firmware multicast helperstt-metal/tt_metal/hw/inc/api/dataflow/dataflow_api.h
Multicast addr constructiontt-metal/tt_metal/hw/inc/internal/dataflow/dataflow_api_addrgen.h
NOC architecture overviewtt-isa-documentation/BlackholeA0/NoC/README.md
NOC register field spectt-isa-documentation/BlackholeA0/NoC/MemoryMap.md
NOC routing pathstt-isa-documentation/BlackholeA0/NoC/RoutingPaths.md
NIU register Python maptt-exalens/ttexalens/hardware/blackhole/niu_registers.py
ARC firmware coordinate maptt-zephyr-platforms/lib/tenstorrent/bh_arc/noc.c
ARC NIU init/configtt-zephyr-platforms/lib/tenstorrent/bh_arc/noc_init.c
Existing emulator NIU modelblackhole-py/extra/emu/noc.py