book

blackhole emulator

A guided reading order for the Blackhole emulator specs, from the machine model and boot flow through the data path, hardware units, and advanced synchronization details.

33 chapters 7 stages

Stage 1 — What Is This Machine?

  1. 01 device grid NOC grid topology for P100A and P150 boards: 2D grid with Tensix columns, DRAM controllers, PCIe endpoint, and ARC router tile coordinates in virtual NOC space.
  2. 02 execution model Cycle-approximate Python emulator using single-threaded round-robin scheduling: RISC-V cores step once, Tensix coprocessor processes one per-thread instruction, NOC ticks.
  3. 03 address space Complete 32-bit address map for Tensix tile memory: L1 scratchpad, LDM, MMIO registers, instruction buffers, and semaphore windows visible to on-tile RISC-V cores.

Stage 2 — The Five RISC-V Cores and Boot

  1. 04 registers Which CSRs and tile registers the emulator must model: cfg0 (BFloat mode), SOFT_RESET_0 (core reset), and various control registers with firmware-observable effects.
  2. 05 ldm layouts Local Data Memory (LDM) layouts for five RISC-V cores per tile: BRISC and NCRISC have 8 KiB each, TRISC0/1/2 have 4 KiB each, holding per-core state and lookup tables.
  3. 06 firmware upload Firmware upload and core boot process for 5 RISC-V cores (BRISC, NCRISC, TRISC0/1/2) on each tile, including reset sequencing and instruction cache invalidation.
  4. 07 logical to virtual coordinates Three coordinate systems for Tensix tiles: physical (NOC mesh), virtual (post-NIU translation), and logical (sequential, harvesting-agnostic) with runtime translation tables.

Stage 3 — Tensix Coprocessor Frontend

  1. 08 tensix coprocessor pipeline Multi-threaded Tensix coprocessor with 3 independent frontend threads (T0/T1/T2) feeding 9 shared backend units: Unpack, Pack, Matrix, Vector, Mover, Misc, Sync, Config, Parity.
  2. 09 instruction push Two mechanisms to push Tensix opcodes: MMIO stores to INSTRN_BUF_BASE and .ttinsn inline instructions (up to 4-way fusion), both feeding per-thread instruction FIFOs.
  3. 10 mop and replay expanders MOP Expander and Replay Expander chain: MOP instruction expands to up to 32639 instructions, Replay Expander further expands REPLAY to up to 64 instructions.
  4. 11 stallwait conditions STALLWAIT and SEMWAIT synchronization instructions that install wait gates in the Tensix frontend pipeline to hold instructions until conditions are met.
  5. 12 semaphores Two semaphore systems: Tensix hardware semaphores (8 counting, 4-bit value/max) and software semaphores (NOC atomics via semaphore window in PCBuf space).

Stage 4 — Data Path: Registers, Addressing, CBs

  1. 13 data types and conversions Numeric formats used in Tensix coprocessor internals: shuffled 19-bit register format for SrcA/SrcB, conversions to/from IEEE FP32, and BFP/FP8 expansion rules.
  2. 14 dest srca srcb registers Three register files for tile data: Dest (1024 rows × 16 cols, 16-bit or 32-bit dual view), SrcA/SrcB (64 rows × 16 cols × 2 banks, 19-bit shuffled format).
  3. 15 rwc and addressing RWC (Read-Write Counters) and ADC (Address Counters) for tile addressing: RWCs track register-file row positions, ADCs track L1 coordinates during pack/unpack operations.
  4. 16 pack unpack registers Pack/Unpack configuration registers in Tensix Config Register space with double-buffered ping-pong state, supporting multiple data formats (FP32, BF16, BFP, INT, etc.).
  5. 17 circular buffers Circular Buffers (CBs) are hardware-tracked FIFOs in L1 memory that pass data between RISC-V data-movement cores and Tensix compute pipeline, with 64 CBs per tile.
  6. 18 pcbufs 3 PC Buffers (16-entry FIFOs) for control/dispatch from BRISC to each TRISC, with semaphore access window and coprocessor/MOP done check addresses.

Stage 5 — Hardware Unit Deep-Dives

  1. 19 fpu operations Matrix Unit (FPU) operations on SrcA/SrcB/Dest register files, supporting MVMUL, elementwise ops, MOV*, and gated operations with register-write counter (RWC) addressing.
  2. 20 unpack data path Two unpackers that read L1 tile data, perform format conversion, and write to SrcA/SrcB register files, with address and data path configuration for tiling operations.
  3. 21 pack data path PACR packer data path: Dest register file → ADC addressing → format conversion → ReLU/activation → histogram → thresholding → quantization → L1 tile write.
  4. 22 sfpu operations 32-lane Vector Unit (SFPU) with 5 sub-units (Load, Simple, MAD, Round, Store) operating on 32-bit LRegs, supporting arithmetic, transcendental, and LUT-based operations.
  5. 23 niu NIU (Network Interface Unit) with 2 instances per tile (NoC0/NoC1), 4 command buffers, configuration, status counters; handles NOC transaction generation and routing.
  6. 24 dram DRAM and PCIe as NOC endpoints: tile format (32×32 faces in Z-order), DRAM bank geometry/interleaving, address generation, and PCIe endpoint for host memory access.

Stage 6 — Cross-Cutting Infrastructure

  1. 25 stream registers Stream/NOC overlay registers: 64 streams per tile, CB-to-stream mapping (CB N → stream N on Blackhole), synchronization registers for CB push/pop/wait operations.
  2. 26 gpr and dma instructions GPR file (192 registers, 64 per thread) as staging for config values, Scalar Unit (ThCon) instructions for arithmetic, and Configuration Unit for register read-modify-write.
  3. 27 noc atomics NIU atomic operations on L1 memory: INCR_GET, INCR_GET_PTR, SWAP, CAS, ACC (accumulate), GET_TILE_MAP, STORE_IND, and RISC-V Zaamo via NOC.
  4. 28 mutexes Hardware mutexes in Tensix Sync Unit for exclusive mutual exclusion between threads: 4 mutexes (math, unpack0, unpack1, pack0) with ATGETM/ATRELM acquire/release instructions.
  5. 29 xmov and tdma mover Mover hardware DMA block accessible via XMOV instruction or TDMA-RISC registers: 16-byte aligned transfers between L1, backend config space, and instruction RAM.

Stage 7 — Niche / Advanced

  1. 30 specialty fpu operations Specialty Matrix Unit instructions: legacy neutered ops (CONV3S1/S2, APOOL3S1/S2, MPOOL3S1/S2) and functional specialties (GAPOOL, GATESRCRST, SHIFTXA/B, CLREXPHIST).
  2. 31 sfploadmacro and sfptransp SFPLOADMACRO enables IPC > 1 by scheduling pipelined operations; SFPTRANSP performs cross-lane data movement for row-wise reduction and transpose in Vector Unit.
  3. 32 additional scalar unit instructions Four additional Scalar Unit (ThCon) instructions (SHIFTDMAREG, BITWOPDMAREG, CMPDMAREG, SUBDMAREG) that operate on the GPR file, used infrequently but required for modeling completeness.
  4. 33 config sync instructions Four instructions (CFGSHIFTMASK, STREAMWRCFG, STREAMWAIT, REG2FLOP) that interact with backend configuration registers and NoC overlay stream system for synchronized updates.