book
blackhole emulator
A guided reading order for the Blackhole emulator specs, from the machine model and boot flow through the data path, hardware units, and advanced synchronization details.
Stage 1 — What Is This Machine?
- 01 device grid NOC grid topology for P100A and P150 boards: 2D grid with Tensix columns, DRAM controllers, PCIe endpoint, and ARC router tile coordinates in virtual NOC space.
- 02 execution model Cycle-approximate Python emulator using single-threaded round-robin scheduling: RISC-V cores step once, Tensix coprocessor processes one per-thread instruction, NOC ticks.
- 03 address space Complete 32-bit address map for Tensix tile memory: L1 scratchpad, LDM, MMIO registers, instruction buffers, and semaphore windows visible to on-tile RISC-V cores.
Stage 2 — The Five RISC-V Cores and Boot
- 04 registers Which CSRs and tile registers the emulator must model: cfg0 (BFloat mode), SOFT_RESET_0 (core reset), and various control registers with firmware-observable effects.
- 05 ldm layouts Local Data Memory (LDM) layouts for five RISC-V cores per tile: BRISC and NCRISC have 8 KiB each, TRISC0/1/2 have 4 KiB each, holding per-core state and lookup tables.
- 06 firmware upload Firmware upload and core boot process for 5 RISC-V cores (BRISC, NCRISC, TRISC0/1/2) on each tile, including reset sequencing and instruction cache invalidation.
- 07 logical to virtual coordinates Three coordinate systems for Tensix tiles: physical (NOC mesh), virtual (post-NIU translation), and logical (sequential, harvesting-agnostic) with runtime translation tables.
Stage 3 — Tensix Coprocessor Frontend
- 08 tensix coprocessor pipeline Multi-threaded Tensix coprocessor with 3 independent frontend threads (T0/T1/T2) feeding 9 shared backend units: Unpack, Pack, Matrix, Vector, Mover, Misc, Sync, Config, Parity.
- 09 instruction push Two mechanisms to push Tensix opcodes: MMIO stores to INSTRN_BUF_BASE and .ttinsn inline instructions (up to 4-way fusion), both feeding per-thread instruction FIFOs.
- 10 mop and replay expanders MOP Expander and Replay Expander chain: MOP instruction expands to up to 32639 instructions, Replay Expander further expands REPLAY to up to 64 instructions.
- 11 stallwait conditions STALLWAIT and SEMWAIT synchronization instructions that install wait gates in the Tensix frontend pipeline to hold instructions until conditions are met.
- 12 semaphores Two semaphore systems: Tensix hardware semaphores (8 counting, 4-bit value/max) and software semaphores (NOC atomics via semaphore window in PCBuf space).
Stage 4 — Data Path: Registers, Addressing, CBs
- 13 data types and conversions Numeric formats used in Tensix coprocessor internals: shuffled 19-bit register format for SrcA/SrcB, conversions to/from IEEE FP32, and BFP/FP8 expansion rules.
- 14 dest srca srcb registers Three register files for tile data: Dest (1024 rows × 16 cols, 16-bit or 32-bit dual view), SrcA/SrcB (64 rows × 16 cols × 2 banks, 19-bit shuffled format).
- 15 rwc and addressing RWC (Read-Write Counters) and ADC (Address Counters) for tile addressing: RWCs track register-file row positions, ADCs track L1 coordinates during pack/unpack operations.
- 16 pack unpack registers Pack/Unpack configuration registers in Tensix Config Register space with double-buffered ping-pong state, supporting multiple data formats (FP32, BF16, BFP, INT, etc.).
- 17 circular buffers Circular Buffers (CBs) are hardware-tracked FIFOs in L1 memory that pass data between RISC-V data-movement cores and Tensix compute pipeline, with 64 CBs per tile.
- 18 pcbufs 3 PC Buffers (16-entry FIFOs) for control/dispatch from BRISC to each TRISC, with semaphore access window and coprocessor/MOP done check addresses.
Stage 5 — Hardware Unit Deep-Dives
- 19 fpu operations Matrix Unit (FPU) operations on SrcA/SrcB/Dest register files, supporting MVMUL, elementwise ops, MOV*, and gated operations with register-write counter (RWC) addressing.
- 20 unpack data path Two unpackers that read L1 tile data, perform format conversion, and write to SrcA/SrcB register files, with address and data path configuration for tiling operations.
- 21 pack data path PACR packer data path: Dest register file → ADC addressing → format conversion → ReLU/activation → histogram → thresholding → quantization → L1 tile write.
- 22 sfpu operations 32-lane Vector Unit (SFPU) with 5 sub-units (Load, Simple, MAD, Round, Store) operating on 32-bit LRegs, supporting arithmetic, transcendental, and LUT-based operations.
- 23 niu NIU (Network Interface Unit) with 2 instances per tile (NoC0/NoC1), 4 command buffers, configuration, status counters; handles NOC transaction generation and routing.
- 24 dram DRAM and PCIe as NOC endpoints: tile format (32×32 faces in Z-order), DRAM bank geometry/interleaving, address generation, and PCIe endpoint for host memory access.
Stage 6 — Cross-Cutting Infrastructure
- 25 stream registers Stream/NOC overlay registers: 64 streams per tile, CB-to-stream mapping (CB N → stream N on Blackhole), synchronization registers for CB push/pop/wait operations.
- 26 gpr and dma instructions GPR file (192 registers, 64 per thread) as staging for config values, Scalar Unit (ThCon) instructions for arithmetic, and Configuration Unit for register read-modify-write.
- 27 noc atomics NIU atomic operations on L1 memory: INCR_GET, INCR_GET_PTR, SWAP, CAS, ACC (accumulate), GET_TILE_MAP, STORE_IND, and RISC-V Zaamo via NOC.
- 28 mutexes Hardware mutexes in Tensix Sync Unit for exclusive mutual exclusion between threads: 4 mutexes (math, unpack0, unpack1, pack0) with ATGETM/ATRELM acquire/release instructions.
- 29 xmov and tdma mover Mover hardware DMA block accessible via XMOV instruction or TDMA-RISC registers: 16-byte aligned transfers between L1, backend config space, and instruction RAM.
Stage 7 — Niche / Advanced
- 30 specialty fpu operations Specialty Matrix Unit instructions: legacy neutered ops (CONV3S1/S2, APOOL3S1/S2, MPOOL3S1/S2) and functional specialties (GAPOOL, GATESRCRST, SHIFTXA/B, CLREXPHIST).
- 31 sfploadmacro and sfptransp SFPLOADMACRO enables IPC > 1 by scheduling pipelined operations; SFPTRANSP performs cross-lane data movement for row-wise reduction and transpose in Vector Unit.
- 32 additional scalar unit instructions Four additional Scalar Unit (ThCon) instructions (SHIFTDMAREG, BITWOPDMAREG, CMPDMAREG, SUBDMAREG) that operate on the GPR file, used infrequently but required for modeling completeness.
- 33 config sync instructions Four instructions (CFGSHIFTMASK, STREAMWRCFG, STREAMWAIT, REG2FLOP) that interact with backend configuration registers and NoC overlay stream system for synchronized updates.