anuraagw.me

tensix coprocessor pipeline

Tensix Coprocessor Pipeline

Overview

The Tensix coprocessor is a multi-threaded instruction-driven accelerator embedded in each Tensix tile. It has 3 independent threads (T0, T1, T2), each with its own frontend pipeline, feeding into 9 shared backend execution units that run concurrently.

The 5 RISC-V cores on a Tensix tile map to these roles:

IndexCoreRoleTensix Push Access
0BRISCData Movement 0All 3 thread FIFOs
1NCRISCData Movement 1None
2TRISC0Unpack kernels (T0)Own thread FIFO (T0) only
3TRISC1Math/Compute kernels (T1)Own thread FIFO (T1) only
4TRISC2Pack kernels (T2)Own thread FIFO (T2) only

Frontend Pipeline (per-thread)

Each thread has its own independent frontend pipeline:

RISC-V store to INSTRN_BUF
         |
         v
  Instruction FIFO (32 slots)
         |
         v
  MOP Expander (expands MOP to up to 32639 instructions)
         |
         v
  Replay Expander (32-slot replay buffer)
         |
         v
  Wait Gate (STALLWAIT/SEMWAIT re-evaluate each cycle)
         |
         v
  Backend Dispatch (by opcode)

The three frontend pipelines are fully independent. Instructions are dispatched in-order per thread, but across threads the backend can reorder as each unit processes at its own rate.

Backend Execution Units (shared)

The backend has 9 concurrent execution units. Dispatch is purely opcode-driven:

UnitInstructions
Sync UnitSTALLWAIT, SEMWAIT, SEMINIT, SEMPOST, SEMGET
Unpacker 0 (SrcA/Dst)UNPACR variants
Unpacker 1 (SrcB)UNPACR variants
Matrix Unit (FPU)MVMUL, ELWADD, ELWSUB, ELWMUL, MATMUL (1 IPC)
Packers 0-3PACR variants (4 packer units)
Vector Unit (SFPU)SFPLOAD, SFPSTORE, SFPADD, SFPMAD, etc. (32x32b)
Scalar Unit (ThCon)SETDMAREG, ADDDMAREG, LOAD_IND, STORE_IND, CAS
Configuration UnitSETC16, WRCFG, RMWCIB, CFGSHIFTMASK
MoverBulk L1 data transfers

The Matrix Unit (FPU) accepts at most 1 instruction per cycle regardless of source thread. Unpackers and packers are similarly contended across threads.

The design intent is triple-buffered execution: T0 unpacks the next tile, T1 computes on the current tile, T2 packs the previous tile’s results — all running concurrently. Cross-thread synchronization uses hardware semaphores (MATH_PACK, UNPACK_TO_DEST, MATH_DONE) via STALLWAIT.

Instruction Encoding

All Tensix instructions are 32-bit words:

bits[31:24] = opcode (8 bits)
bits[23:0]  = parameters (24 bits)

Constructed via:

#define TT_OP(opcode, params) ((opcode << 24) + params)

Instruction Issue Mechanisms

There are two ways to push instructions into a thread’s FIFO. Both deliver the identical 32-bit instruction word.

1. MMIO Store (runtime-variable operands)

Write the 32-bit instruction word to the thread’s INSTRN_BUF address:

// volatile pointer to INSTRN_BUF_BASE (0xFFE40000)
volatile uint32_t* instrn_buffer = (volatile uint32_t*)INSTRN_BUF_BASE;
instrn_buffer[0] = TT_OP_MVMUL(src_a, src_b, ...);

Used via TT_XXX() macros (e.g. TT_MVMUL(...)) and the ex_push_insn() helper in firmware.

2. Inline .ttinsn Custom RISC-V Instruction (compile-time constants)

A custom RISC-V ISA extension that encodes the Tensix instruction directly in the RISC-V instruction stream:

#define INSTRUCTION_WORD(x) __asm__ __volatile__(".ttinsn %0" : : "i"((x)))

The 32-bit Tensix opcode is rotated left by 2 bits and stored in the encoding space normally reserved for the RISC-V “C” extension (which these cores do not implement). The hardware rotates right by 2 to recover the original instruction word and pushes it to the thread’s FIFO.

Used via TTI_XXX() macros (e.g. TTI_MVMUL(...)). Requires compile-time constant operands.

On Blackhole, up to 4 adjacent .ttinsn instructions can be fused and pushed in a single cycle (though the FIFO still dequeues at most 1 per cycle per thread).

FIFO Backpressure

The instruction FIFO holds 32 entries per thread. When full, the RISC-V core automatically stalls on the next store/.ttinsn until a slot frees up. A slot is freed when the instruction is consumed by the MOP Expander, not when it reaches the backend.

INSTRN_BUF Address Map

AddressSymbolPurpose
0xFFE40000INSTRN_BUF_BASET0 instruction FIFO
0xFFE50000INSTRN1_BUF_BASET1 instruction FIFO
0xFFE60000INSTRN2_BUF_BASET2 instruction FIFO

Stride between threads: 0x10000 (64 KB).

Routing by Source Core

The address 0xFFE40000 is context-sensitive — hardware routes the write based on which RISC-V core performs the store:

Store AddressFrom BRISCFrom TRISC0From TRISC1From TRISC2
0xFFE40000Push to T0Push to T0Push to T1Push to T2
0xFFE50000Push to T1(hangs)(hangs)(hangs)
0xFFE60000Push to T2(hangs)(hangs)(hangs)

Each TRISC can only push to its own thread via 0xFFE40000. The hardware remaps the address per-core. Writing to 0xFFE50000 or 0xFFE60000 from a TRISC will hang the core.

BRISC can target any thread by writing to the corresponding address directly.

BRISC Coprocessor Access

BRISC can push Tensix instructions to all three thread FIFOs. However:

  • BRISC’s pushes enter after the MOP Expander (bypassing MOP expansion). This means BRISC cannot issue MOP instructions — only fully-expanded individual instructions.
  • There is a mux at each thread’s frontend that merges BRISC and TRISC_i inputs. If both push on the same cycle, the TRISC_i instruction is silently discarded. BRISC must only push to a thread when that thread’s TRISC is not actively issuing.

In practice, BRISC issues instructions only during initialization (before TRISCs start their kernels):

// From brisc.cc — device_setup()
instrn_buf[0] = core.instrn_buf_base(0);  // 0xFFE40000
instrn_buf[1] = core.instrn_buf_base(1);  // 0xFFE50000
instrn_buf[2] = core.instrn_buf_base(2);  // 0xFFE60000

core.ex_zeroacc(instrn_buf[0]);                     // Clear dest registers
core.ex_encc(instrn_buf[0]);                         // Enable CC stack
core.ex_load_const(instrn_buf[0]);                   // Load SFPU constants
core.initialize_tensix_semaphores(instrn_buf[0]);    // Init hardware semaphores

NCRISC has no tensix instruction push capability.

Hardware Semaphores

The Tensix coprocessor has 8 hardware semaphores accessed via the PC Buffer address space (PC_BUF_BASE + semaphore_offset). Key semaphores used for inter-thread synchronization:

SemaphoreIndexPurpose
MATH_PACK1TRISC1 (math) <-> TRISC2 (pack) sync on Dst register
UNPACK_TO_DEST2TRISC0 (unpack) <-> TRISC1 (math) sync on unpack-to-dest
MATH_DONE7Wait for TRISC1 math completion

Semaphore operations (SEMINIT, SEMPOST, SEMGET) are Tensix instructions issued through the instruction FIFO. STALLWAIT with semaphore wait conditions allows a thread to stall in the Wait Gate until a semaphore reaches a threshold.

BRISC initializes all semaphores at boot via SEMINIT through T0’s FIFO.

Other Key Address Regions

AddressSymbolPurpose
0xFFE00000REGFILE_BASEThCon GPR file (192 regs, 64 per thread)
0xFFE80000PC_BUF_BASET0 PC buffer / sync registers
0xFFE90000PC1_BUF_BASET1 PC buffer
0xFFEA0000PC2_BUF_BASET2 PC buffer
0xFFB80000TENSIX_MOP_CFG_BASEMOP Expander config (write-only, 9 words)
0xFFB11000RISCV_TDMA_REGSTDMA mover command registers
0xFFEC0000TENSIX_MAILBOX0Hardware mailbox (BRISC)
0xFFEC1000TENSIX_MAILBOX1Hardware mailbox (TRISC0)
0xFFEC2000TENSIX_MAILBOX2Hardware mailbox (TRISC1)
0xFFEC3000TENSIX_MAILBOX3Hardware mailbox (TRISC2)
0xFFEF0000TENSIX_CFG_BASEBackend config registers (unpack/pack/FPU config)

BRISC <-> TRISC Orchestration

BRISC controls TRISC lifecycle via two mechanisms:

Software Mailboxes (L1)

BRISC writes go/done signals to subordinate_sync in L1:

subordinate_sync->trisc0 = RUN_SYNC_MSG_GO;   // 0x80 = start kernel
subordinate_sync->trisc1 = RUN_SYNC_MSG_GO;
subordinate_sync->trisc2 = RUN_SYNC_MSG_GO;
// TRISCs write back RUN_SYNC_MSG_DONE (0x00) when finished

PC Buffer

BRISC writes kernel launch tokens to pc_buf[thread] to trigger TRISC execution. TRISCs call tensix_sync() (a blocking store to pc_buf_base[1]) after each kernel to drain the coprocessor pipeline before signaling done.

Key Opcodes

OpcodeInstructionExecution Unit
0x01MOPFrontend (MOP Expander)
0x02REPLAYFrontend (Replay Expander)
0x08-0x0fSTALLWAIT, SEMWAIT, SEMINITSync Unit
0x28ELWADDMatrix Unit (FPU)
0x29ELWSUBMatrix Unit (FPU)
0x2aELWMULMatrix Unit (FPU)
0x41PACRPack Unit
0x42UNPACRUnpack Unit
0x58MATMULMatrix Unit (FPU)
0x80-0x8fSFP*Vector Unit (SFPU)
0xa0-0xafWRCFG, RMWCIBConfig Unit
0xb0-0xbfTHCON_LD_IND, THCON_ST_INDScalar Unit (ThCon)
0xb2SETC16Config Unit

Source References

  • ISA spec (coprocessor overview): tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/README.md
  • Push mechanism: tt-isa-documentation/WormholeB0/TensixTile/BabyRISCV/PushTensixInstruction.md
  • Blackhole push (.ttinsn fusion): tt-isa-documentation/BlackholeA0/TensixTile/BabyRISCV/PushTensixInstruction.md
  • MOP Expander: tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MOPExpander.md
  • Instruction encoding (all opcodes): tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h
  • ISA YAML (opcode -> execution unit mapping): tt-llk/tt_llk_blackhole/instructions/assembly.yaml
  • BRISC firmware: tt-metal/tt_metal/hw/firmware/src/tt-1xx/brisc.cc
  • Instruction push helpers: tt-metal/tt_metal/hw/inc/internal/tensix_functions.h
  • Address map: tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/tensix.h
  • Stall/wait parameters: tt-llk/tt_llk_blackhole/common/inc/ckernel_instr_params.h