anuraagw.me

dest srca srcb registers

Dest, SrcA, and SrcB Register Files

The Tensix coprocessor has three named register files that hold tile data during computation: Dest (also called “accumulator” or “Dst”), SrcA, and SrcB. Each is physically separate storage with its own addressing and ownership rules.


1. Dest Register File

1.1 Physical Storage

Dest is a flat array of 16-bit cells:

uint16_t DstBits[1024][16];
bool     DstRowValid[1024];
  • 1024 rows, each 16 columns wide.
  • Each cell is 16 bits of raw storage.
  • Each row has one associated valid bit (DstRowValid).
  • Total storage: 1024 × 16 × 2 bytes = 32 KiB.

1.2 Two Views: Dst16b and Dst32b

The same storage is exposed under two named views depending on the active data format:

ViewShapeElement widthRow index range
Dst16b1024 rows × 16 cols16-bit0–1023
Dst32b512 rows × 16 cols32-bit0–511

Both views alias the same DstBits[1024][16] array. Dst32b[Row][Col] is defined as:

uint32_t read  = (DstBits[Adj32(Row)][Col] << 16) | DstBits[Adj32(Row) + 8][Col];
// write stores the high 16b at Adj32(Row) and the low 16b at Adj32(Row)+8

Dst16b[Row][Col] is normally sugar for DstBits[Adj16(Row)][Col].

Adj16 and Adj32 functions (from Dst.md):

uint10_t Adj16(uint10_t r) {
  if (Config.DEST_ACCESS_CFG_remap_addrs) {
    r = (r & 0x3c7) ^ ((r & 0x030) >> 1) ^ ((r & 0x008) << 2);
  }
  return r;
}

uint10_t Adj32(uint10_t r) {
  r = Adj16(r);
  if (Config.DEST_ACCESS_CFG_swizzle_32b) {
    r = (r & 0x3f3) ^ ((r & 0x018) >> 1) ^ ((r & 0x004) << 1);
  }
  return ((r & 0x1f8) << 1) | (r & 0x207);
}

Both remap_addrs and swizzle_32b also affect how packers address Dest.

1.3 Data Types

Dst16b elements hold one of:

TypeWidthNotes
BF1616-bitSign(1) + Man(7) + Exp(8); bit order in storage is Sign,Man(7b),Exp(8b)
FP1616-bitSign(1) + Exp(5) + Man(10); bit order in storage is Sign,Man(10b),Exp(5b)
Integer “8”16-bitOverlaid on FP16; Sign(1) + Mag(10b), raw exponent field held as 16 (or 0 when zero)
Integer “16”16-bitOpaque 16 bits; no computation instructions use this type

Dst32b elements hold one of:

TypeWidthNotes
FP3232-bitSign(1) + Exp(8) + Man(23); stored as Sign,Man(7b),Exp(8b),Man(3b low),Zeros(13b)
Integer “32”32-bitSign/magnitude: 1 sign bit + 31 magnitude bits

The coprocessor does not fully conform to IEEE 754 for floating-point types.

Active mode selection: Config.ALU_ACC_CTRL_Fp32_enabled or Config.ALU_ACC_CTRL_INT8_math_enabled selects Dst32b; otherwise Dst16b is active. Most instructions (MVMUL, ZEROACC, MOVD2A, etc.) read this flag to decide which view to use.

1.4 Tile and Face Decomposition

LLK software decomposes Dest into tiles and faces for addressing:

ConstantValueSource
FACE_HEIGHT16 rowsckernel_defs.h
FACE_WIDTH16 columnsckernel_defs.h
TILE_HEIGHT32 rowsckernel_defs.h
TILE_WIDTH32 columnsckernel_defs.h
DEST_REGISTER_FULL_SIZE1024 rows (Dst16b)tensix_types.h: 64 * DEST_FACE_HEIGHT
DEST_REGISTER_HALF_SIZE512 rows (Dst16b)tensix_types.h: DEST_REGISTER_FULL_SIZE / 2
BIT32_DEST_REGISTER_HALF_SIZE256 rows (Dst32b)tensix_types.h: DEST_REGISTER_HALF_SIZE / 2
DEST_NUM_TILES_FP1616 tiles (32×32)ckernel_defs.h
MATH_HALF_DEST_SIZE32 faces (of 16×16)ckernel_structs.h

A 32×32 tile occupies 64 Dst16b rows (4 faces × 16 rows/face). A 32×16 tile occupies 32 rows. A 16×16 tile occupies 16 rows.

In FP32 (Dst32b) mode, a 32×32 tile still uses the same 64 DstBits rows, but only 32 Dst32b row-index slots are consumed (because each 32-bit row uses two adjacent 16-bit rows via Adj32). The effective tile capacity is halved: instead of 16 tiles, only 8 tiles fit in Dest, and each half holds 4 tiles instead of 8.

1.5 Half-Dest Double-Buffering

Dest is logically split into two halves:

HalfDst16b rowsDst32b rowsLLK constant
Low half0–5110–255offset 0 (dest_offset_id == 0)
High half512–1023256–511offset DEST_REGISTER_HALF_SIZE = 512 (dest_offset_id == 1)

The MATH_PACK semaphore (index 1) coordinates ownership of these halves between the math thread (T1) and the pack thread (T2). The protocol:

Initialization (SyncHalf mode):
  SEMINIT(max=2, val=0, sem=MATH_PACK)
  dest_offset_id = 0

Math thread (T1) before each tile:
  SEMWAIT on MATH_PACK != Max (i.e. room to write)
  → writes math results into current half (dest_offset_id)
  → SEMPOST(MATH_PACK)    [set_math_semaphores()]
  → dest_section_flip():  update_dest_offset_id(), flip DEST_TARGET_REG_CFG_MATH_Offset

Pack thread (T2) before each tile:
  SEMWAIT on MATH_PACK != 0 (i.e. data to pack)
  → PACR reads the half that math just finished
  → ZEROACC(CLR_HALF, dest_offset_id) to mark that half as invalid
  → SEMGET(MATH_PACK)     [_llk_packer_set_math_semaphore_()]
  → flip_packer_dest_offset_id(), select_packer_dest_registers()

The semaphore’s Max value is 2 in SyncHalf mode and 1 in SyncFull mode (where math and pack share the entire Dest sequentially). With Max=2, the semaphore can range 0–2, allowing one half to be in math’s hands and the other in pack’s hands simultaneously.

The global variable dest_offset_id (0 or 1) tracks which half is currently being written by math. get_dest_buffer_base() returns dest_offset_id ? DEST_REGISTER_HALF_SIZE : 0. Both the math thread and the pack thread maintain their own pointer using the same variable, flipped in lockstep via the semaphore.

1.6 ZEROACC: Marking Rows Invalid

ZEROACC does not write zeroes into DstBits. It clears DstRowValid bits. Subsequent reads of an invalid row behave as:

  • Packers: read zero from invalid rows.
  • Matrix Unit (FPU): read the identity element (0 for MVMUL/ELWADD; −∞ for GMPOOL/MPOOL3S*), then mark the row valid after writing.
  • Vector Unit (SFPU): UndefinedBehavior if reading an invalid row.
  • Unpackers: UndefinedBehavior if writing to only some columns of an invalid row.

ZEROACC field encoding (from dsl.py and ZEROACC.md):

bits[23:19] = clear_mode    (5 bits)
bits[18]    = use_32_bit_mode
bits[17]    = clear_zero_flags
bits[16:14] = addr_mode     (3 bits, AddrMod)
bits[13:0]  = where         (14 bits, Imm10 + extras)

clear_mode values (from LLK ckernel_defs.h):

clear_modeConstantBehavior
0ClearRow / ZEROACC_MODE_ONE_ROWClear one row at Imm10 + DEST_TARGET_REG_CFG_MATH_Offset + RWC.Dst + DEST_REGW_BASE. Applies address remapping when Dst32b mode is active. Advances AddrMod.
1Clear16Rows / ZEROACC_MODE_16_ROWSClear 16 contiguous rows within a tile-aligned block. Imm10 selects the block. In Dst16b mode: block = Imm10 * 16 (requires Imm10 < 64). In Dst32b (use_32_bit_mode=1): block = Imm10 * 32 with 16 rows scattered (requires Imm10 < 32). Advances AddrMod. Out-of-range Imm10 is a NOP on current silicon.
2ClearHalf / ZEROACC_MODE_HALF_OF_DSTClear rows 0–511 (where & 1 == 0) or rows 512–1023 (where & 1 == 1). No AddrMod.
3ClearFull / ZEROACC_MODE_ALL_OF_DSTClear all 1024 rows. No AddrMod.

use_32_bit_mode (use_32_bit_mode=1): selects the Dst32b scatter pattern for ZEROACC_MODE_16_ROWS. For ZEROACC_MODE_ONE_ROW, equivalent behavior is controlled by ALU_ACC_CTRL_Fp32_enabled or ALU_ACC_CTRL_INT8_math_enabled in the backend config.

Typical packer sequence (SyncHalf):

// In pack thread, after reading a half:
TT_ZEROACC(p_zeroacc::CLR_HALF, is_fp32_dest_acc_en, 0, ADDR_MOD_1, dest_offset_id % 2);
// dest_offset_id==0 → clears low half (rows 0–511)
// dest_offset_id==1 → clears high half (rows 512–1023)

Typical init sequence (SyncFull):

TTI_ZEROACC(p_zeroacc::CLR_ALL, 0, 0, ADDR_MOD_1, 0);

1.7 RISCV Debug Window at 0xFFBD8000

RISCV T0, T1, and T2 can read and write Dest directly via a memory-mapped window:

Base address: 0xFFBD8000
Size:         32 KiB  (= DEST_REGISTER_FULL_SIZE_BYTES = 1024 × 16 × 2 bytes)

The window always spans the entire 1024-row, 16-column Dest array regardless of whether Dst16b or Dst32b mode is active. Access format is controlled by per-thread config RISC_DEST_ACCESS_CTRL_SEC[CurrentThread].{no_swizzle, unsigned_int, fmt}:

fmtRISCV viewAccess width
0float Dst32b[512][16] or uint32_t Dst32b[512][16]32-bit lw/sw
1int32_t Dst32b[512][16] (two’s complement ↔ sign-magnitude conversion)32-bit
2__fp16 Dst16b[1024][16]16-bit lh/sh
3__bf16 Dst16b[1024][16]16-bit
4int16_t or uint16_t Dst16b[1024][16]16-bit
5int8_t or uint8_t Dst16b[1024][16]8-bit lb/sb

Bit-layout conversions (swizzling) are applied on access unless no_swizzle is set. For example, FP32 bits are stored in Dest with a non-standard layout (Sign,Man(7b),Exp(8b),Man(3b low),Zeros(13b)) and the no_swizzle=0 path re-orders them to standard IEEE754 on each load/store.

RISCV T0 and T1 access one element at a time with the appropriate-width instruction. RISCV T2 can access multiple elements per instruction, aligned to the total transfer size.

Address calculation (for T0/T1): element_index = (Addr - 0xFFBD8000) / element_bytes.

// FP32 mode (fmt=0): Addr = 0xFFBD8000 + (row * 16 + col) * 4
// BF16 mode (fmt=3): Addr = 0xFFBD8000 + (row * 16 + col) * 2

Debug usage example (from dprint_tensix.h):

// Read a FP32 row from Dest (ARCH_BLACKHOLE path):
const uint32_t* addr = reinterpret_cast<const uint32_t*>(0xFFBD8000);
for (int i = 0; i < 16; ++i) {
    rd_data[i] = addr[i + (row << 4)];  // row * 16 + column
}

Note on debug window size: the window is always 32 KiB regardless of active data format, because the underlying DstBits[1024][16] is 32 KiB of 16-bit storage. FP32 mode (fmt=0,1) exposes only rows 0–511 of Dst32b (512 × 16 × 4 bytes = 32 KiB), which maps to the same physical storage.

1.8 Instruction Scheduling Hazard

After any instruction that writes to Dest, the written 8×16-row-aligned block cannot be read for the next 4 cycles. The hardware stalls the thread automatically if a Matrix Unit or PACR instruction tries to read it. To avoid stalls when accumulating (e.g., looping MVMUL), software should cycle over at least 5 distinct 8-row blocks of Dest between consecutive writes to the same block.


2. SrcA Register File

2.1 Physical Storage

enum class SrcClient { MatrixUnit, Unpackers };

struct {
  SrcClient AllowedClient;   // initially Unpackers
  uint19_t  Rows[64][16];
} SrcA[2];
  • 2 banks, each with 64 rows × 16 columns of 19-bit data.
  • Total storage per bank: 64 × 16 × 19 bits ≈ 2.4 KiB (stored in 3-byte cells).
  • Total SrcA: 2 × 2.4 KiB ≈ 4.8 KiB.

The 19-bit element width accommodates TF32 (19 bits: 1 sign + 10 mantissa + 8 exponent), the widest data type in SrcA.

2.2 Data Types

TypeStorage widthNotes
TF3219 bitsSign(1) + Man(10) + Exp(8); stored as Sign,Man(10b),Exp(8b)
BF1619 bitsOverlaid on TF32 with low 3 mantissa bits = 0; stored as Sign,Man(10b),Exp(8b)
FP1619 bitsStored as Sign,Man(10b),Zero(3b),Exp(5b)
Integer “8”19 bitsOverlaid on FP16; Sign(1) + Mag(10b), fixed exponent field
Integer “16”19 bitsOpaque 16-bit transfer; no computation instructions

The BF16/TF32 and FP16/Int8 internal representations differ between Src and Dst. Shuffle functions handle the conversion:

uint19_t ShuffleBF16(uint16_t x) {  // Dst BF16 → Src BF16
  return ((x & 0xFF00) << 3) | (x & 0xFF);
}
uint19_t ShuffleFP16(uint16_t x) {  // Dst FP16 → Src FP16
  return ((x & 0xFFE0) << 3) | (x & 0x1F);
}
uint19_t ShuffleTF32(uint19_t x) {  // Dst TF32 → Src TF32
  uint19_t SignHiMan = x & 0x3fc000;
  uint19_t Exp       = x & 0x0007f8;
  uint19_t LoMan     = x & 0x000007;
  return SignHiMan | (LoMan << 8) | (Exp >> 3);
}

2.3 Bank Tracking State

Four bank indices are maintained:

uint1_t MatrixUnit::SrcABank = 0;     // which bank FPU is reading from
uint1_t Unpackers[0]::SrcBank = 0;   // which bank unpacker 0 is writing to

Additionally, each unpacker tracks a per-thread row cursor:

uint6_t Unpackers[0]::SrcRow[3];  // indexed by Tensix thread

2.4 Double-Buffering Protocol (Bank Flipping)

The two SrcA banks allow the unpacker to load the next tile while the Matrix Unit consumes the current tile. Ownership is mediated by AllowedClient:

Giving a bank to the Matrix Unit (SETDVALID / UNPACR_NOP):

// SETDVALID with FlipSrcA=1 (or UNPACR_NOP 0x7 for WhichUnpacker=0):
SrcA[Unpackers[0].SrcBank].AllowedClient = SrcClient::MatrixUnit;
Unpackers[0].SrcBank ^= 1;
Unpackers[0].SrcRow[CurrentThread] = ThreadConfig[CurrentThread].SRCA_SET_Base << 4;

Giving a bank back to Unpackers (CLEARDVALID):

// CLEARDVALID with FlipSrcA=1:
SrcA[MatrixUnit.SrcABank].AllowedClient = SrcClient::Unpackers;
if (!KeepReadingSameSrc) MatrixUnit.SrcABank ^= 1;

Flipping during MVMUL (FlipSrcA bit):

// At end of MVMUL with FlipSrcA=1:
if (!ThreadConfig[CurrentThread].CLR_DVALID_SrcA_Disable) {
    SrcA[MatrixUnit.SrcABank].AllowedClient = SrcClient::Unpackers;
}
MatrixUnit.SrcABank ^= 1;

STALLWAIT conditions for SrcA bank synchronization:

ConditionMeaning
C5SrcA[Unpackers[0].SrcBank].AllowedClient != SrcClient::Unpackers — unpack side is not ready
C7SrcA[MatrixUnit.SrcABank].AllowedClient != SrcClient::MatrixUnit — math side is not ready
SRCA_VLD (0x80)Used as a STALLWAIT mask for waiting for SrcA to be given to the Matrix Unit

LLK calls: wait_bank_valid<SrcA>() issues TTI_STALLWAIT(p_stall::STALL_MATH, p_stall::SRCA_VLD).

SETRWC with CLR_A also gives the current SrcA bank back to Unpackers and flips MatrixUnit.SrcABank.

2.5 How UNPACR Fills SrcA

Unpacker 0 fills SrcA. Key fields from UNPACR_Regular.md:

  • WhichUnpacker = 0 → targets SrcA (or Dest if configured).
  • The unpacker writes to SrcA[Unpackers[0].SrcBank] starting at Unpackers[0].SrcRow[CurrentThread].
  • Row filling is sequential from the initial row, advancing by 1 per datum until the configured YDim rows are written.
  • XDim controls the column count per row (normally 16).
  • In MultiContextMode, XDim can vary by context.
  • X/Y transposition is available when writing to SrcA (not available for SrcB or Dest): if enabled, the unpacker writes columns as rows, producing a transposed result in SrcA.

After the UNPACR operation completes, UNPACR_NOP with opcode 0x7 is used to hand the filled bank to the Matrix Unit and flip the unpacker to the other bank.

2.6 How MVMUL/ELWADD Read from SrcA

MVMUL reads a 16-row × 16-column block from SrcA:

uint6_t SrcARow = RWCs[CurrentThread].SrcA & 0x38;  // aligned to 8
// reads SrcA[MatrixUnit.SrcABank][SrcARow + 0..15][0..15]

The SrcA operand to MVMUL is always exactly 16 rows × 16 columns (a 16×16 matrix). The SrcB operand is 8 rows × 16 columns (aligned to 8). The result is an 8×16 matrix added into Dest.

ELWADD and other element-wise operations also read from SrcA[MatrixUnit.SrcABank] using RWCs[CurrentThread].SrcA as the row index, operating on up to 16 rows.


3. SrcB Register File

3.1 Physical Storage

struct {
  SrcClient AllowedClient;   // initially Unpackers
  uint19_t  Rows[64][16];
} SrcB[2];

Identical layout to SrcA: 2 banks × 64 rows × 16 columns × 19 bits.

SrcA and SrcB are physically the same size and the same element width. The distinction is which unpacker writes them (unpacker 0 → SrcA, unpacker 1 → SrcB) and that certain operations treat them asymmetrically (SrcA is the right-hand matrix in MVMUL; SrcB is the left-hand matrix).

3.2 Data Types

Identical to SrcA: TF32, BF16, FP16, Integer “8”, Integer “16” — all stored in 19-bit cells with the same bit layouts.

3.3 Bank Tracking State

uint1_t MatrixUnit::SrcBBank = 0;     // which bank FPU is reading from
uint1_t Unpackers[1]::SrcBank = 0;   // which bank unpacker 1 is writing to
uint6_t Unpackers[1]::SrcRow[3];     // per-thread row cursor

3.4 Double-Buffering Protocol

Identical to SrcA, but using Unpackers[1] and MatrixUnit.SrcBBank:

// SETDVALID with FlipSrcB=1 (or UNPACR_NOP 0x7 for WhichUnpacker=1):
SrcB[Unpackers[1].SrcBank].AllowedClient = SrcClient::MatrixUnit;
Unpackers[1].SrcBank ^= 1;
Unpackers[1].SrcRow[CurrentThread] = ThreadConfig[CurrentThread].SRCB_SET_Base << 4;
// CLEARDVALID with FlipSrcB=1:
SrcB[MatrixUnit.SrcBBank].AllowedClient = SrcClient::Unpackers;
if (!KeepReadingSameSrc) MatrixUnit.SrcBBank ^= 1;

STALLWAIT conditions for SrcB:

ConditionMeaning
C6SrcB[Unpackers[1].SrcBank].AllowedClient != SrcClient::Unpackers
C8SrcB[MatrixUnit.SrcBBank].AllowedClient != SrcClient::MatrixUnit
SRCB_VLD (0x100)Used as STALLWAIT mask for waiting for SrcB to be given to the Matrix Unit

LLK constant SRCB_ROW16_OFFSET = 0x10 (16 rows) is a frequently used SrcB row offset for separating two 16×16 faces within the same bank.

3.5 How UNPACR Fills SrcB

Unpacker 1 fills SrcB. Behavior is symmetric to SrcA / Unpacker 0, except:

  • No X/Y transposition available for SrcB (only SrcA supports this).
  • SrcB unpacker (WhichUnpacker = 1) does not support MultiContextMode with WhichContext >= 2 (UndefinedBehavior if attempted).
  • Row filling is sequential from Unpackers[1].SrcRow[CurrentThread].

3.6 TRNSPSRCB: Transpose Rows 16–31 In Place

TRNSPSRCB transposes the 16×16 matrix stored in SrcB rows 16–31 of the current Matrix Unit bank:

// Waits for SrcB[MatrixUnit.SrcBBank].AllowedClient == MatrixUnit:
uint6_t RowBase = 16;
for (unsigned i = 0; i < 16; ++i) {
  for (unsigned j = 0; j < i; ++j) {
    uint19_t ij = SrcB[MatrixUnit.SrcBBank][RowBase + i][j];
    uint19_t ji = SrcB[MatrixUnit.SrcBBank][RowBase + j][i];
    SrcB[MatrixUnit.SrcBBank][RowBase + i][j] = ji;
    SrcB[MatrixUnit.SrcBBank][RowBase + j][i] = ij;
  }
}

What changes: only rows 16–31 of the active SrcB bank are affected. Rows 0–15 are untouched. The operation swaps elements [i][j] and [j][i] for all j < i, producing a standard matrix transpose of the 16×16 block in place. Rows 0–15 are typically used to hold the SrcB matrix for the current MVMUL, while rows 16–31 hold a pre-transposed version for the next phase.

TRNSPSRCB waits at the Wait Gate until SrcB[MatrixUnit.SrcBBank].AllowedClient == MatrixUnit.

3.7 SrcA vs SrcB Differences

PropertySrcASrcB
UnpackerUnpacker 0Unpacker 1
X/Y transpose during unpackYesNo
Role in MVMULRight-hand matrix (16×16)Left-hand matrix (8×16 or 1×16 broadcast)
TRNSPSRCBNot applicableRows 16–31 transposable
MultiContextMode context limit8 contexts (context 0–7)2 contexts (0 and 1 only)
Row cursor SETBASE configSRCA_SET_BaseSRCB_SET_Base

4. Data Movement Instructions: MOVD2A, MOVD2B, MOVA2D, MOVB2D, MOVB2A

These Matrix Unit (FPU) instructions move data between the three register files without involving L1 memory.

4.1 MOVD2A — Dest → SrcA

Copies 1 or 4 aligned rows from Dest into the active SrcA bank.

TT_MOVD2A(/* bool */ UseDst32bLo,
          /* u6  */ SrcRow,      // destination row in SrcA
          /* u2  */ AddrMod,
         (/* bool */ Move4Rows) << 1,
          /* u10 */ DstRow)      // source row in Dest
  • Move4Rows=0: copies 1 row; Move4Rows=1: copies 4 rows (DstRow aligned to 4, SrcRow aligned to 4).
  • SrcRow range: SrcRow + RWCs[CurrentThread].SrcA, masked to 6 bits.
  • DstRow range: DstRow + DEST_TARGET_REG_CFG_MATH_Offset + RWCs[CurrentThread].Dst + DEST_REGW_BASE_Base.
  • Writes to SrcA[MatrixUnit.SrcABank].
  • Does not automatically wait for SrcA[MatrixUnit.SrcABank].AllowedClient == MatrixUnit. Use STALLWAIT(B6, C7) before MOVD2A if needed.
  • After MOVD2A, the next cycle only accepts MOVD2A or MOVB2A from the Matrix Unit; other instructions are automatically stalled for 1 cycle.
  • Data format conversion: applies ShuffleBF16, ShuffleFP16, or ShuffleTF32 based on ALU_FORMAT_SPEC_REG0_SrcA and Fp32 mode.

4.2 MOVD2B — Dest → SrcB

Copies 1 or 4 aligned rows from Dest into the active SrcB bank.

TT_MOVD2B(/* bool */ UseDst32bLo,
          /* u6  */ SrcRow,      // destination row in SrcB
          /* u2  */ AddrMod,
         (/* bool */ Move4Rows) << 1,
          /* u10 */ DstRow)
  • Identical structure to MOVD2A but targets SrcB[MatrixUnit.SrcBBank].
  • Does not automatically wait for SrcB bank validity. Use STALLWAIT(B6, C8).
  • After MOVD2B, the next 3 cycles only accept another MOVD2B.
  • Note: MOVD2B uses ALU_FORMAT_SPEC_REG0_SrcA (not SrcB) to determine the conversion style — this is not a documentation error, it is hardware behavior.

4.3 MOVA2D — SrcA → Dest

Copies 1 or 8 aligned rows from the active SrcA bank into Dest.

TT_MOVA2D(/* bool */ UseDst32bLo,
          /* u6  */ SrcRow,
          /* u2  */ AddrMod,
         (/* bool */ Move8Rows) << 1,
          /* u10 */ DstRow)
  • Move8Rows=1: copies 8 rows (SrcRow aligned to 8, DstRow aligned to 8).
  • Waits at Wait Gate for SrcA[MatrixUnit.SrcABank].AllowedClient == MatrixUnit.
  • After MOVA2D, software should avoid reading the written Dest region for 3 cycles (hardware partially enforces this by stalling on follow-up MOVD2A, MOVD2B, ELWMUL, MVMUL, etc.).
  • Data format: reverse of MOVD2A; removes low mantissa (BF16/TF32) or high exponent bits (FP16) to produce Dest’s 16-bit format.

4.4 MOVB2D — SrcB → Dest

Copies 1, 4, or 8 rows from the active SrcB bank into Dest, with optional column-0 broadcast or row broadcast.

TT_MOVB2D(/* bool */ UseDst32bLo,
          /* u6  */ SrcRow,
          /* u2  */ AddrMod,
        ((/* bool */ Move4Rows)        << 2) +
        ((/* bool */ Broadcast1RowTo8) << 1) +
          /* bool */ BroadcastCol0,
          /* u10 */ DstRow)
  • Broadcast1RowTo8=1: one SrcB row is replicated into 8 consecutive Dest rows (DstRow aligned to 8).
  • BroadcastCol0=1: column 0 of each SrcB row is replicated across all 16 columns of the corresponding Dest row.
  • Waits at Wait Gate for SrcB[MatrixUnit.SrcBBank].AllowedClient == MatrixUnit.
  • After MOVB2D, avoid reading the written Dest region for 3 cycles.
  • Data format: same ALU_FORMAT_SPEC_REG0_SrcA-driven conversion as MOVD2B.

4.5 MOVB2A — SrcB → SrcA

Copies 1 or 4 aligned rows from the active SrcB bank into the active SrcA bank.

TT_MOVB2A(/* u6 */ SrcARow,
          /* u2 */ AddrMod,
         (/* bool */ Move4Rows) << 1,
          /* u6 */ SrcBRow)
  • Waits at Wait Gate for SrcB[MatrixUnit.SrcBBank].AllowedClient == MatrixUnit.
  • Does not automatically wait for SrcA bank validity.
  • After MOVB2A, the next cycle only accepts MOVD2A or MOVB2A.

5. Ownership Model Across Threads

The three Tensix threads have dedicated roles:

ThreadRISC-V corePrimary roleRegister files owned
T0TRISC0UnpackWrites SrcA (via Unpacker 0), writes SrcB (via Unpacker 1)
T1TRISC1MathReads SrcA, reads SrcB, writes Dest
T2TRISC2PackReads Dest

5.1 SrcA/SrcB Ownership Flow

T0 thread:
  UNPACR (WhichUnpacker=0) → writes SrcA[Unpackers[0].SrcBank]
  UNPACR_NOP (0x7, WhichUnpacker=0) → flips SrcA bank to MatrixUnit ownership
  (Similarly for SrcB using WhichUnpacker=1)

T1 thread:
  STALLWAIT(STALL_MATH, SRCA_VLD) → wait until SrcA is owned by MatrixUnit
  STALLWAIT(STALL_MATH, SRCB_VLD) → wait until SrcB is owned by MatrixUnit
  MVMUL / ELWADD / etc. → consume SrcA and SrcB, write Dest
  MVMUL with FlipSrcA/FlipSrcB → return consumed bank to Unpackers, flip to other bank

5.2 Dest Ownership Flow (MATH_PACK Semaphore)

Dest ownership between T1 (math) and T2 (pack) is coordinated by the MATH_PACK semaphore (index 1). The semaphore acts as a token counter: one token = one half-Dest’s worth of math results ready to be packed.

SyncHalf initialization:
  SEMINIT(max=2, val=0, sem=MATH_PACK)
  dest_offset_id = 0

T1 (math) per tile:
  SEMWAIT(MATH_PACK, STALL_ON_MAX)   ← block if both halves are full
  ... compute into current half (dest_offset_id) ...
  SEMPOST(MATH_PACK)                 ← signal: one half ready to pack
  update_dest_offset_id()            ← flip dest_offset_id (0→1 or 1→0)
  SETC16(DEST_TARGET_REG_CFG_MATH_Offset, new_base)  ← point to new half

T2 (pack) per tile:
  SEMWAIT(MATH_PACK, STALL_ON_ZERO)  ← block if no half is ready
  PACR ...                            ← read and pack a half
  ZEROACC(CLR_HALF, dest_offset_id)  ← invalidate rows in packed half
  SEMGET(MATH_PACK)                   ← release token; signal math can write here again
  flip_packer_dest_offset_id()       ← advance to the other half

In SyncFull mode, SEMINIT(max=1, val=0) is used and the full Dest is treated as a single unit. Math waits for the semaphore to be 0 (not at max), packs the whole Dest, then SEMGET releases it.

5.3 Other Relevant Semaphores

UNPACK_OPERAND_SYNC (index 3): coordinates between T0 (unpack) and T1 (math) on operand tile lifecycle. T1 calls SEMGET(UNPACK_OPERAND_SYNC) after consuming a tile’s SrcA/SrcB banks (via _llk_math_release_tile_()), which releases the operand slot for T0 to refill.

UNPACK_TO_DEST (index 2): used when Unpacker 0 writes directly to Dest (bypassing SrcA) for certain operations. T0 posts to this semaphore when the unpack-to-dest write completes; T1 waits on it before reading from Dest.

MATH_DONE (index 7): used when unpacking directly to Dest; signals T1 to proceed with SFPU computation.

Full semaphore table (from ckernel_structs.h):

IndexNameDirection
0FPU_SFPUFPU ↔ SFPU sync
1MATH_PACKT1 math → T2 pack; Dest ownership
2UNPACK_TO_DESTT0 unpack → T1 math; unpack-to-dest completion
3UNPACK_OPERAND_SYNCT0 unpack ↔ T1 math; operand get/release
4PACK_DONET2 pack iteration start/end; perf events
5UNPACK_SYNCTRISC ↔ unpack; HW kernel sync
6UNPACK_MATH_DONEUnpack or math iteration done; perf events
7MATH_DONET1 math done; used with unpack-to-dest

6. Emulator Implementation Notes

Dest

Model DstBits[1024][16] as a flat u16 array and DstRowValid[1024] as a bool array. Track dest_offset_id (0 or 1) as a per-tile-state variable shared between the math and pack thread contexts.

ZEROACC clears DstRowValid bits (not DstBits). Packers read zeros from invalid rows. Matrix Unit reads identity elements from invalid rows and then marks them valid.

The debug window at 0xFFBD8000 is a 32 KiB view directly into DstBits, with optional bit-layout swizzling on load/store controlled by per-thread RISC_DEST_ACCESS_CTRL_SEC.fmt and .no_swizzle.

SrcA / SrcB

Model each as two banks of u32 (or packed u19 for exact fidelity) arrays [64][16]. Track four AllowedClient bits (2 for SrcA, 2 for SrcB) and four bank index bits (MatrixUnit.SrcABank, MatrixUnit.SrcBBank, Unpackers[0].SrcBank, Unpackers[1].SrcBank).

For bank-ownership checks: STALLWAIT conditions C5–C8 map directly to the four AllowedClient states. Instructions that wait at Wait Gate (MOVA2D, MOVB2D, MOVB2A, MVMUL, etc.) spin until the relevant bank is owned by MatrixUnit.

ZEROSRC (TT_ZEROSRC opcode 0x11) zeros one or both SrcA/SrcB banks by writing the zero_val pattern (typically 0) to all cells in the specified bank. src_mask bits: bit 0 = SrcA, bit 1 = SrcB.

TRNSPSRCB operates on SrcB[MatrixUnit.SrcBBank].Rows[16..31] in place.


Source References

FilePurpose
tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/Dst.mdAuthoritative Dest spec: storage, types, bit layouts, Adj16/Adj32, RISCV debug window
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SrcASrcB.mdAuthoritative SrcA/SrcB spec (Wormhole; Blackhole is identical for these register files)
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/ZEROACC.mdZEROACC functional model and mode constants
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MOVD2A.mdMOVD2A functional model and shuffle functions
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MOVD2B.mdMOVD2B functional model
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MOVA2D.mdMOVA2D functional model
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MOVB2D.mdMOVB2D functional model including broadcast modes
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MOVB2A.mdMOVB2A functional model
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MVMUL.mdMVMUL including FlipSrcA/FlipSrcB bank-flip behavior
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETDVALID.mdSETDVALID: give SrcA/SrcB to Matrix Unit
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/CLEARDVALID.mdCLEARDVALID: give SrcA/SrcB to Unpackers
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/TRNSPSRCB.mdTRNSPSRCB: SrcB rows 16–31 transpose
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_NOP_SETDVALID.mdUNPACR_NOP SETDVALID variant
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/README.mdUnpacker pipeline overview
tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/STALLWAIT.mdSTALLWAIT condition codes C5–C8 for SrcA/SrcB ownership
tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/BackendConfiguration.mdConfig register space; RISC_DEST_ACCESS_CTRL_SEC
tt-llk/tt_llk_blackhole/common/inc/ckernel_structs.hSemaphore index constants; MATH_HALF_DEST_SIZE
tt-llk/tt_llk_blackhole/common/inc/ckernel_defs.hFACE_HEIGHT/WIDTH, TILE_HEIGHT/WIDTH, DEST_NUM_TILES_FP16
tt-llk/tt_llk_blackhole/common/inc/cmath_common.hwait_math_semaphores, set_math_semaphores, dest_section_flip
tt-llk/tt_llk_blackhole/common/inc/cpack_common.hflip_packer_dest_offset_id, select_packer_dest_registers
tt-llk/tt_llk_blackhole/common/inc/ckernel_instr_params.hSRCA_VLD, SRCB_VLD, SRCB_ROW16_OFFSET constants
tt-llk/tt_llk_blackhole/llk_lib/llk_math_common.hllk_math_pack_sync_init, llk_math_dest_section_done
tt-llk/tt_llk_blackhole/llk_lib/llk_pack_common.hllk_packer_wait_for_math_done, llk_pack_dest_section_done
tt-llk/tt_llk_blackhole/llk_lib/llk_defs.hDstSync enum (SyncHalf, SyncFull)
blackhole-py/tt-metal-deps/include/.../tensix_types.hDEST_REGISTER_FULL_SIZE, DEST_REGISTER_HALF_SIZE, DEST_FACE_WIDTH/HEIGHT
blackhole-py/dsl.pyZEROACC, MOVD2A, MOVD2B, MOVB2D, CLEARDVALID, ZEROSRC field encodings
tt-metal/tt_metal/hw/inc/api/debug/dprint_tensix.hDebug window usage at 0xFFBD8000