anuraagw.me

gpr and dma instructions

Tensix GPRs, Scalar Unit, and Configuration Unit Instructions

The Tensix coprocessor has a 192-register GPR (General Purpose Register) file that serves as a staging area for values destined for backend configuration registers. The RISC-V cores load constants into GPRs, then Tensix instructions copy those values into the config registers that control hardware units (unpackers, packers, matrix unit, etc.).

SETDMAREG → GPR → (sync) → WRCFG/RDCFG/RMWCIB/SETC16 → Config/ThreadConfig → controls hardware

Two execution units are involved:

  • Scalar Unit (ThCon): Operates on GPRs. Runs SETDMAREG, ADDDMAREG, MULDMAREG, DMANOP, etc.
  • Configuration Unit: Reads/writes config registers. Runs WRCFG, RDCFG, SETC16, RMWCIB, etc.

GPR File Layout

The GPR file is mapped at 0xFFE00000 in the RISC-V address space:

0xFFE00000 .. 0xFFE000FF   Thread 0 (T0) — 64 x 32-bit registers (256 bytes)
0xFFE00100 .. 0xFFE001FF   Thread 1 (T1) — 64 x 32-bit registers
0xFFE00200 .. 0xFFE002FF   Thread 2 (T2) — 64 x 32-bit registers

The ISA models them as:

uint32_t GPRs[3][64];   // 192 total, 64 per thread

The address space reservation extends to 0xFFE3FFFF (256 KiB), but only 768 bytes are populated.

Access Rules

  • Each coprocessor thread (T0/T1/T2) can only access its own 64 GPRs via Tensix instructions.
  • BRISC has full MMIO read/write access to all three threads’ GPRs.
  • NCRISC has no access.
  • Each RISC-V Ti also has its own GPRs mapped at REGFILE_BASE (0xFFE00000), but can only see GPRs[i].

Sub-Word Addressing

SETDMAREG addresses GPRs in 16-bit half-register units. The index space is 0–127 (7 bits):

  • Index 2*n = low 16 bits of GPR n
  • Index 2*n+1 = high 16 bits of GPR n
#define LO_16(REG) (2 * (REG))       // low half of GPR
#define HI_16(REG) (2 * (REG) + 1)   // high half of GPR

Named GPR Conventions

From tt-llk/.../ckernel_gpr_map.h:

Common (all threads):

GPRNamePurpose
0ZEROAlways 0
1DBG_RESERVEDReserved
2DBG_MSGFirmware debug message
3DBG_CKIDCkernel ID

T0 (unpack thread): GPRs 4–59 hold operand base/offset addresses, tile sizes, face dimensions, stride save/restore values.

T1 (math thread): GPRs 4–61 hold dest register offsets for SFPU, perf counters.

T2 (pack thread): GPRs 4–63 hold output L1 address, tile headers, stride configs, edge offsets.

Instructions

All four instructions execute on the Scalar Unit (ThCon). The Scalar Unit is fully serialized: at most one instruction at a time, no internal pipelining, and it blocks all threads’ Wait Gates while executing.

SETDMAREG — Set 16 bits of one GPR (opcode 0x45)

The workhorse instruction. Writes a 16-bit immediate to one half of a GPR, leaving the other half unchanged. Loading a full 32-bit constant requires two SETDMAREG instructions.

Encoding:

[31:24] = 0x45  (opcode)
[23:22] = Payload_SigSelSize  (2 bits — used in signal mode only)
[21:8]  = Payload_SigSel      (14 bits — immediate value in load mode)
[7]     = SetSignalsMode       (0 = immediate load, 1 = signal/packer read)
[6:0]   = RegIndex16b          (7 bits — half-register index 0–127)
#define TT_OP_SETDMAREG(Payload_SigSelSize, Payload_SigSel, SetSignalsMode, RegIndex16b) \
    TT_OP(0x45, (((Payload_SigSelSize) << 22) + ((Payload_SigSel) << 8) \
               + ((SetSignalsMode) << 7) + ((RegIndex16b) << 0)))

Functional model (immediate mode, SetSignalsMode=0):

uint16_t *HalfReg = (char*)&GPRs[CurrentThread][0] + ResultHalfReg * 2;
*HalfReg = NewValue;

Signal mode (SetSignalsMode=1): Reads 128 bits of packer configuration or state, then writes 16–128 bits to GPRs. The field layout changes meaning:

TT_SETDMAREG(/* u2 */ ResultSize,
           ((/* u4 */ WhichPackers) << 7) +
           ((/* u4 */ InputSource ) << 3) +
             /* u3 */ InputHalfReg,
             1,  // SetSignalsMode = 1
             /* u7 */ ResultHalfReg)

InputSource selects what 128-bit value to read:

InputSourceValue read (128 bits across Values[0..3])
0Per-packer {AccTileSize[hi16], LastTileSize[lo16]} for packers 0–3
1Per-packer AllZeroFlags for packers 0–3
2–5Full tile header for packer InputSource-2: TileSize, DataFormat, DisableZeroCompression, AllZeroFlags
6–716-byte slice of exponent histogram for packer WhichPackers (bytes [0..15] or [16..31])
8Bit 0 of each packer’s AllZeroFlags packed into Values[0] bits [3:0]; optionally resets AccTileSize (masked by WhichPackers)
9Packers[0].ExponentHistogramMaxExponent

ResultSize selects how many bits to write to GPRs:

ResultSizeEffect
016-bit: HalfRegs[ResultHalfReg] = InputHalves[InputHalfReg]
132-bit: GPRs[ResultHalfReg >> 1] = Values[InputHalfReg >> 1]
2128-bit: writes all 4 Values to 4 consecutive aligned GPRs
3128-bit tile header: writes only tile header fields, preserving reserved bits

Performance: 1 cycle (both modes).

ADDDMAREG — 32-bit GPR addition (opcode 0x58)

TT_ADDDMAREG(0, ResultReg, RightReg, LeftReg)   // reg + reg
TT_ADDDMAREG(1, ResultReg, RightImm6, LeftReg)  // reg + 6-bit unsigned immediate

Encoding:

[31:24] = 0x58
[23]    = OpBisConst  (0 = reg-reg, 1 = reg-immediate)
[17:12] = ResultRegIndex  (6 bits)
[11:6]  = OpBRegIndex     (6 bits — GPR index or 6-bit constant)
[5:0]   = OpARegIndex     (6 bits — GPR index)

Functional model:

uint32_t LeftVal  = GPRs[CurrentThread][LeftReg];
uint32_t RightVal = OpBisConst ? RightImm6 : GPRs[CurrentThread][RightReg];
GPRs[CurrentThread][ResultReg] = LeftVal + RightVal;  // 32-bit, wraps on overflow

Performance: 3 cycles (immediate, or same aligned group of 4), 4 cycles otherwise.

MULDMAREG — 16-bit unsigned multiply → 32-bit result (opcode 0x5A)

Same encoding layout as ADDDMAREG. Key distinction: inputs are truncated to 16 bits, but the product is a full 32-bit result.

GPRs[CurrentThread][ResultReg] = (LeftVal & 0xFFFF) * (RightVal & 0xFFFF);

Performance: Same as ADDDMAREG.

DMANOP — Scalar Unit NOP (opcode 0x60)

#define TT_OP_DMANOP TT_OP(0x60, 0)

Does nothing, occupies the Scalar Unit for 1 cycle. Used as a pipeline bubble between SETDMAREG and WRCFG when the Scalar Unit is provably already idle (replacing STALLWAIT in carefully scheduled code paths).

LOADIND — Indirect GPR Load from L1 (opcode 0x49)

Reads 8, 16, 32, or 128 bits from tile-local L1 memory into one or more GPRs, using an indirect address computed from two GPR values. The address register holds a 16-byte-aligned base, and a separate offset half-register provides the byte offset.

Encoding:

[31:24] = 0x49  (opcode)
[23:22] = SizeSel        (2 bits — 0=16B/4 GPRs, 1=32-bit, 2=16-bit, 3=8-bit)
[21:14] = OffsetIndex    (8 bits — half-register index for byte offset)
[13:12] = AutoIncSpec    (2 bits — 0=none, 1=+2B, 2=+4B, 3=+16B)
[11:6]  = DataRegIndex   (6 bits — destination GPR index)
[5:0]   = AddrRegIndex   (6 bits — base address GPR index)
#define TT_OP_LOADIND(SizeSel, OffsetIndex, AutoIncSpec, DataRegIndex, AddrRegIndex) \
    TT_OP(0x49, (((SizeSel) << 22) + ((OffsetIndex) << 14) + ((AutoIncSpec) << 12) \
               + ((DataRegIndex) << 6) + ((AddrRegIndex) << 0)))

Address computation:

uint32_t L1Address = GPRs[CurrentThread][AddrRegIndex] * 16 + *OffsetHalfReg;

The AddrRegIndex GPR holds a 16-byte word address (multiply by 16 to get byte address). The OffsetIndex selects a 16-bit half-register (using the same HalfReg[index] scheme as SETDMAREG) that provides a byte offset added to the base.

Functional model:

uint32_t* GPR = &GPRs[CurrentThread][DataRegIndex & (SizeSel ? 0x3F : 0x3C)];
uint16_t* Offset = (uint16_t*)((char*)&GPRs[CurrentThread][0] + OffsetIndex * 2);
uint32_t L1Addr = GPRs[CurrentThread][AddrRegIndex] * 16 + *Offset;

// Auto-increment offset register
switch (AutoIncSpec) {
    case 0: break;              // no increment
    case 1: *Offset += 2;  break;  // +2 bytes
    case 2: *Offset += 4;  break;  // +4 bytes
    case 3: *Offset += 16; break;  // +16 bytes
}

// Deferred: data arrives asynchronously after the Scalar Unit releases
switch (SizeSel) {
    case 0: memcpy(GPR, (void*)(L1Addr & ~15), 16); break;  // 16B → 4 aligned GPRs
    case 1: *GPR = *(uint32_t*)(L1Addr & ~3);       break;  // 32-bit word
    case 2: *(uint16_t*)GPR = *(uint16_t*)(L1Addr & ~1); break;  // 16-bit, low half only
    case 3: *(uint8_t*)GPR  = *(uint8_t*)L1Addr;    break;  // 8-bit, low byte only
}

For SizeSel=0 (16B), the destination register index is masked to a 4-aligned boundary (& 0x3C), and four consecutive GPRs are written.

Synchronization: The Scalar Unit dispatches the read request and releases after >= 3 cycles, but the GPR data arrives asynchronously. Software must issue STALLWAIT(STALL_CFG, THCON) (block=B7, wait=C0) before any instruction that consumes the loaded GPR value.

Performance: >= 3 cycles occupying the Scalar Unit.

Parameter constants (from ckernel_instr_params.h):

struct p_ind {
    static constexpr uint32_t HIER_L1   = 0x1;  // MemHierSel for STOREIND
    static constexpr uint32_t INC_NONE  = 0x0;
    static constexpr uint32_t INC_2B    = 0x1;
    static constexpr uint32_t INC_4B    = 0x2;
    static constexpr uint32_t INC_16B   = 0x3;
    static constexpr uint32_t LD_16B    = 0;
    static constexpr uint32_t LD_32bit  = 1;
    static constexpr uint32_t LD_16bit  = 2;
    static constexpr uint32_t LD_8bit   = 3;
};

STOREIND — Indirect Store from GPR (opcode 0x66)

The counterpart to LOADIND. A polymorphic instruction with three modes selected by bit [23] (MemHierSel) and bit [22] (SizeSel):

MemHierSelSizeSelMode
1xL1 mode: write GPR data to L1 memory
01MMIO mode: write 32-bit GPR to MMIO register
00Src mode: write 2 GPRs (4×BF16) to SrcA or SrcB register file

Encoding:

[31:24] = 0x66  (opcode)
[23]    = MemHierSel     (1 = L1 write, 0 = regfile/MMIO write)
[22]    = SizeSel        (mode-dependent — see above)
[21]    = RegSizeSel     (mode-dependent — data width or Src select)
[20:14] = OffsetIndex    (7 bits — half-register index for byte offset)
[13:12] = AutoIncSpec    (2 bits — auto-increment: 0=none, 1=+2B, 2=+4B, 3=+16B)
[11:6]  = DataRegIndex   (6 bits — source GPR index)
[5:0]   = AddrRegIndex   (6 bits — base address GPR index)
#define TT_OP_STOREIND(MemHierSel, SizeSel, RegSizeSel, OffsetIndex, AutoIncSpec, DataRegIndex, AddrRegIndex) \
    TT_OP(0x66, (((MemHierSel) << 23) + ((SizeSel) << 22) + ((RegSizeSel) << 21) \
               + ((OffsetIndex) << 14) + ((AutoIncSpec) << 12) \
               + ((DataRegIndex) << 6) + ((AddrRegIndex) << 0)))

L1 Mode (MemHierSel=1): GPR → L1

Address computation identical to LOADIND. SizeSel and RegSizeSel together encode the transfer size (same 0/1/2/3 scheme: 16B, 32-bit, 16-bit, 8-bit).

// Functional model (L1 mode):
uint32_t* GPR = &GPRs[CurrentThread][DataRegIndex & (Size ? 0x3F : 0x3C)];
uint16_t* Offset = (uint16_t*)((char*)&GPRs[CurrentThread][0] + OffsetIndex * 2);
uint32_t L1Addr = GPRs[CurrentThread][AddrRegIndex] * 16 + *Offset;

// Auto-increment
switch (AutoIncSpec) { /* same as LOADIND */ }

// Write to L1
switch (Size) {
    case 0: memcpy((void*)(L1Addr & ~15), GPR, 16); break;  // 16B from 4 GPRs
    case 1: *(uint32_t*)(L1Addr & ~3)  = *GPR;      break;  // 32-bit
    case 2: *(uint16_t*)(L1Addr & ~1)  = (uint16_t)*GPR; break;  // 16-bit
    case 3: *(uint8_t*)L1Addr          = (uint8_t)*GPR;  break;  // 8-bit
}

MMIO Mode (MemHierSel=0, SizeSel=1): GPR → MMIO Register

Always a 32-bit write to the 0xFFB_____ address range (Tensix MMIO window, lower bound 0xFFB11000).

uint16_t* Offset = (uint16_t*)((char*)&GPRs[CurrentThread][0] + OffsetIndex * 2);
uint32_t Addr = GPRs[CurrentThread][AddrRegIndex] + (*Offset >> 4);
Addr = 0xFFB00000 + (Addr & 0x000FFFFC);
*(uint32_t*)Addr = GPRs[CurrentThread][DataRegIndex];
// then apply AutoIncSpec to Offset

Src Mode (MemHierSel=0, SizeSel=0): 2×GPR → SrcA/SrcB

Bit [21] (RegSizeSel) selects SrcA (0) or SrcB (1). Writes 4×BF16 values extracted from two consecutive GPRs into the FPU source register file. Waits on bank ownership semaphore. Rarely used — for software-feeding matrix input data.

Performance: >= 3 cycles (all modes).

Common usage (L1 mode — writing tile header from packer):

// Write 16B tile header from GPRs to output L1 address
TTI_STOREIND(1, 0, p_ind::LD_16B, LO_16(0), p_ind::INC_NONE,
             p_gpr_pack::TILE_HEADER, p_gpr_pack::OUTPUT_ADDR);

Backend Configuration Model

Before describing the Config Unit instructions, here’s the data they operate on. Two distinct configuration spaces exist:

uint32_t Config[2][CFG_STATE_SIZE * 4];                          // thread-agnostic, two banks
struct {uint16_t Value, Padding[7];} ThreadConfig[3][THD_STATE_SIZE]; // per-thread, one bank each

Both are mapped contiguously at TENSIX_CFG_BASE (0xFFEF0000, 64 KiB).

Config — Thread-agnostic, double-buffered. The active bank is selected by ThreadConfig[CurrentThread].CFG_STATE_ID_StateID. Holds unpack tile descriptors, pack config, ALU formats, ADDR_MOD slots, packer output addresses, stride configs, etc. Writes to indices >= GLOBAL_CFGREG_BASE_ADDR32 write to both banks simultaneously.

ThreadConfig — Per-thread, single-banked. Holds thread-specific fields like CFG_STATE_ID (bank selector), dest register offsets, unpack context config, clock gater control. Only writable by SETC16; RISC-V sw cannot write ThreadConfig.

The cfg_defines.h file defines Name_ADDR32, Name_MASK, Name_SHAMT constants for both spaces. The // Registers for THREAD section indexes ThreadConfig; all other sections index Config.

Config (thread-agnostic)ThreadConfig (per-thread)
Tensix writeWRCFG, RMWCIB, REG2FLOPSETC16 only
Tensix readRDCFG, various implicitVarious implicit only
RISC-V writesw only (with ordering hazards)Cannot write directly
RISC-V readlw/lh/lblw/lh/lb

Special side effects on write:

  • Writing anything to Config[i][STATE_RESET_EN_ADDR32] (except via RMWCIB) zeros all non-global config in that bank.
  • Writing to Config.PRNG_SEED_Seed_Val_ADDR32 reseeds all PRNGs.
  • Writing to ThreadConfig[i][CG_CTRL_EN_*] or [CG_CTRL_KICK_*] immediately affects clock gaters.

Configuration Unit Instructions

The Configuration Unit handles all config register reads and writes. It accesses the same GPRs as the Scalar Unit. Key throughput rules:

  • SETC16 has its own IPC group: up to 3 per cycle (one per thread), independent of everything else.
  • All other instructions (WRCFG, RDCFG, RMWCIB, RISCV requests, Mover requests) share a single Config IPC group with sustained throughput of at most 1 per cycle. Excessive WRCFG from one thread can starve RDCFG/RMWCIB from other threads and delay RISC-V config accesses.

WRCFG — Write GPR to Config (opcode 0xB0)

Copies 32 or 128 bits from a GPR to Config.

Encoding:

[31:24] = 0xB0
[21:16] = GprAddress  (6 bits — which GPR provides the data)
[15]    = wr128b      (0 = 32-bit write, 1 = 128-bit write)
[10:0]  = CfgReg      (11 bits — config register index, matches Name_ADDR32)
#define TT_OP_WRCFG(GprAddress, wr128b, CfgReg) \
    TT_OP(0xb0, (((GprAddress) << 16) + ((wr128b) << 15) + ((CfgReg) << 0)))

Functional model:

uint1_t StateID = ThreadConfig[CurrentThread].CFG_STATE_ID_StateID;

if (Is128Bit) {
    // InputReg aligned to 4-GPR boundary; writes 4 consecutive config words
    memcpy(&Config[StateID][CfgIndex & ~3], &GPRs[CurrentThread][InputReg & ~3], 16);
} else {
    Config[StateID][CfgIndex] = GPRs[CurrentThread][InputReg];
}

Performance: 2 cycles, fully pipelined (one per cycle). The issuing thread is not blocked during the 2nd cycle. The instruction immediately after WRCFG must not consume the config just written — insert a NOP.

RDCFG — Read Config to GPR (opcode 0xB1)

Reads 32 bits from Config into a GPR. Cannot read ThreadConfig.

Encoding:

[31:24] = 0xB1
[23:16] = GprAddress  (8 bits — but only low 6 used for GPR index)
[15:0]  = CfgReg      (16 bits — but only low 11 used for config index)
#define TT_OP_RDCFG(GprAddress, CfgReg) \
    TT_OP(0xb1, (((GprAddress) << 16) + ((CfgReg) << 0)))

Functional model:

uint1_t StateID = ThreadConfig[CurrentThread].CFG_STATE_ID_StateID;
GPRs[CurrentThread][ResultReg] = Config[StateID][CfgIndex];

Performance: At least 2 cycles. The GPR result is not available immediately — the instruction(s) after RDCFG must not read the destination GPR. Use STALLWAIT(STALL_CFG, CFGEXU) after issuing RDCFG(s) to ensure the result has landed before consuming it.

Wormhole B0 hardware bug: If multiple threads issue RDCFG on the same cycle, all but one are silently dropped. Software must ensure single-thread RDCFG. Blackhole does not have this bug — RDCFG is fully pipelined there (one per cycle, non-blocking, but still needs a stall before consuming the GPR result).

SETC16 — Write 16-bit immediate to ThreadConfig (opcode 0xB2)

The only instruction that can write ThreadConfig. Writes a 16-bit immediate to one entry in the current thread’s ThreadConfig bank.

Encoding:

[31:24] = 0xB2
[23:16] = CfgIndex    (8 bits — indexes ThreadConfig, matches "Registers for THREAD" Name_ADDR32)
[15:0]  = NewValue    (16 bits — immediate value)
#define TT_OP_SETC16(setc16_reg, setc16_value) \
    TT_OP(0xb2, (((setc16_reg) << 16) + ((setc16_value) << 0)))

Functional model:

ThreadConfig[CurrentThread][CfgIndex].Value = NewValue;

It always writes to the current thread’s bank — there is no cross-thread ThreadConfig access. The 16-bit value replaces the entire Value field of the ThreadConfig entry (each entry is 16 bits of value + padding).

Performance: 1 cycle. Up to 3 SETC16 instructions can execute per cycle (one from each thread) because it has its own IPC group, independent of the Config pipeline.

Common uses:

// Switch active config bank (double-buffering)
TT_SETC16(CFG_STATE_ID_StateID_ADDR32, new_state_id);

// Set math dest offset
TT_SETC16(DEST_TARGET_REG_CFG_MATH_Offset_ADDR32, dst_index);

// Set unpack config context
TT_SETC16(UNPACK_MISC_CFG_CfgContextOffset_0_ADDR32, 0x0104);

Blackhole scheduling restriction: After reset, SETC16(CFG_STATE_ID_StateID_ADDR32, x) must be executed once before any other config-bank-dependent instruction. Also, within a fused instruction bundle, instructions after a CFG_STATE_ID write must not depend on the new value.

RMWCIB0/1/2/3 — Read-Modify-Write Config Byte (opcodes 0xB3–0xB6)

Atomic read-modify-write on a single byte of Config. The digit suffix (0/1/2/3) selects which byte within the 32-bit config word to modify.

Encoding (same for all four, opcode differs):

[31:24] = 0xB3 (RMWCIB0), 0xB4 (RMWCIB1), 0xB5 (RMWCIB2), 0xB6 (RMWCIB3)
[23:16] = Mask          (8 bits — which bits to modify)
[15:8]  = NewValue      (8 bits — new bit values)
[7:0]   = CfgRegAddr    (8 bits — config register index, matches Name_ADDR32)
#define TT_OP_RMWCIB0(Mask, Data, CfgRegAddr) TT_OP(0xb3, (((Mask)<<16)+((Data)<<8)+((CfgRegAddr)<<0)))
#define TT_OP_RMWCIB1(Mask, Data, CfgRegAddr) TT_OP(0xb4, (((Mask)<<16)+((Data)<<8)+((CfgRegAddr)<<0)))
#define TT_OP_RMWCIB2(Mask, Data, CfgRegAddr) TT_OP(0xb5, (((Mask)<<16)+((Data)<<8)+((CfgRegAddr)<<0)))
#define TT_OP_RMWCIB3(Mask, Data, CfgRegAddr) TT_OP(0xb6, (((Mask)<<16)+((Data)<<8)+((CfgRegAddr)<<0)))

Functional model:

uint1_t StateID = ThreadConfig[CurrentThread].CFG_STATE_ID_StateID;
uint8_t* ByteAddr = (uint8_t*)&Config[StateID][CfgRegAddr] + Index1;  // Index1 = opcode digit (0-3)
atomic {
    uint8_t OldValue = *ByteAddr;
    *ByteAddr = (NewValue & Mask) | (OldValue & ~Mask);
}

The formula is: bits where Mask=1 get NewValue; bits where Mask=0 keep OldValue. This is the standard RMW pattern (new & mask) | (old & ~mask).

Performance: 1 cycle (but may wait at Wait Gate if Config pipeline is busy).

LLK wrappercfg_reg_rmw_tensix<CfgAddr32, Shamt, Mask>(val) in ckernel.h decomposes a 32-bit mask+value write into up to 4 RMWCIB calls (one per non-zero byte lane):

// Only emits RMWCIB for byte lanes where the mask is non-zero
if (mask_b0) TT_RMWCIB0(mask_b0, data_b0, CfgAddr32);
if (mask_b1) TT_RMWCIB1(mask_b1, data_b1, CfgAddr32);
if (mask_b2) TT_RMWCIB2(mask_b2, data_b2, CfgAddr32);
if (mask_b3) TT_RMWCIB3(mask_b3, data_b3, CfgAddr32);

Synchronization: SETDMAREG → WRCFG

The Scalar Unit and Configuration Unit are independent asynchronous backend units. Software must synchronize between them. The standard pattern:

// 1. Load 32-bit constant into GPR via two 16-bit halves
TT_SETDMAREG(0, LOWER_HALFWORD(value), 0, LO_16(p_gpr_pack::TMP0));
TT_SETDMAREG(0, UPPER_HALFWORD(value), 0, HI_16(p_gpr_pack::TMP0));

// 2. Stall Config Unit until Scalar Unit finishes
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::THCON);
//             B7=0x80: block CFG   C0=0x001: wait while ThCon busy

// 3. Copy GPR to config register
TTI_WRCFG(p_gpr_pack::TMP0, p_cfg::WRCFG_32b, TARGET_ADDR32);

// 4. NOPs — WRCFG takes 2 cycles, next insn must not read this config
TTI_NOP;
TTI_NOP;

RISC-V Direct Write Alternative

BRISC can write GPRs directly via stores to 0xFFE00000, but must ensure the write completes before pushing a Tensix instruction that reads the GPR. Three approaches:

  1. Use SETDMAREG instead (preferred).
  2. Push a STALLWAIT with condition C13 before the consuming instruction.
  3. Use a load-back fence: sw to GPR, lw from same address, consume result, then sw to push the Tensix instruction.

Examples from Real Disassemblies

Firmware GPR Init (fw_trisc2.S)

TRISC2 firmware zeroes the T0 GPR file at boot:

# from disasms/rvir/fw_trisc2.S — zero_gprs
    lui  a5, 0xffe00          # a5 = 0xFFE00000 (GPR base)
    addi a4, a5, 256          # a4 = 0xFFE00100 (end of 64 GPRs)
zero_gprs:
    sw   zero, 0(a5)          # *a5 = 0
    addi a5, a5, 4            # next GPR
    bne  a5, a4, zero_gprs    # loop until all 64 zeroed

SETDMAREG → WRCFG Sequence (add1_trisc2.S)

Pack thread loading constants into GPRs 28–29, then writing to config:

# Build SETDMAREG instructions and push them to the T2 instruction FIFO (s0 = 0xFFE60000)
    lui  a4, 0x45000          # opcode 0x45, payload=0x0000
    addi a4, a4, 56           # RegIndex16b = 56 = LO_16(28)
    sw   a4, 0(s0)            # push: SETDMAREG(0, 0x0000, 0, LO_16(28))

    lui  a4, 0x45002          # opcode 0x45, payload=0x0002
    addi a4, a4, 57           # RegIndex16b = 57 = HI_16(28)
    sw   a4, 0(s0)            # push: SETDMAREG(0, 0x0002, 0, HI_16(28))

    lui  a4, 0x45020          # opcode 0x45, payload=0x0020
    addi a4, a4, 58           # RegIndex16b = 58 = LO_16(29)
    sw   a4, 0(s0)            # push: SETDMAREG(0, 0x0020, 0, LO_16(29))

    lui  a4, 0x45080          # opcode 0x45, payload=0x0080
    addi a4, a4, 59           # RegIndex16b = 59 = HI_16(29)
    sw   a4, 0(s0)            # push: SETDMAREG(0, 0x0080, 0, HI_16(29))

    # Inline Tensix: wait for Scalar Unit to finish
    TT_STALLWAIT 0x400001     # STALL_CFG, THCON

    # Inline Tensix: copy GPRs 28-29 to config
    TT_WRCFG 0x1c000c        # GPR 28 → Config[12], 32-bit
    TT_WRCFG 0x1d000d        # GPR 29 → Config[13], 32-bit
    TT_NOP
    TT_NOP

Note: The SETDMAREG instructions are built manually with lui+addi and pushed to the instruction FIFO via sw to 0xFFE60000, while STALLWAIT/WRCFG/NOP appear as inline Tensix instructions in the rvir disassembly.

RISC-V Direct GPR Write (add1_trisc2.S)

TRISC2 also writes GPRs directly via MMIO stores, bypassing SETDMAREG entirely:

    lui  s2, 0xffe00          # s2 = GPR base for this thread
    lui  a4, 0x1
    addi a4, a4, -2048        # a4 = 0x800
    sw   a4, 64(s2)           # GPR[16] = 0x800     (offset 64 = GPR index 16)
    sw   zero, 68(s2)         # GPR[17] = 0
    sw   zero, 72(s2)         # GPR[18] = 0
    sw   zero, 76(s2)         # GPR[19] = 0
    lw   a4, 76(s2)           # load-back fence: read GPR[19]
    sw   a4, 76(s2)           # consume result before pushing Tensix insn

Encoding Quick Reference

Scalar Unit (ThCon) — GPR operations

InstructionOpcodeKey Fields
SETDMAREG0x45[23:22] SigSelSize, [21:8] Payload, [7] SignalMode, [6:0] RegIndex16b
LOADIND0x49[23:22] SizeSel, [21:14] OffsetIndex, [13:12] AutoIncSpec, [11:6] DataReg, [5:0] AddrReg
ADDDMAREG0x58[23] OpBisConst, [17:12] Result, [11:6] OpB, [5:0] OpA
SUBDMAREG0x59(same as ADDDMAREG)
MULDMAREG0x5A(same as ADDDMAREG)
DMANOP0x60(no fields)
STOREIND0x66[23] MemHierSel, [22] SizeSel, [21] RegSizeSel, [20:14] OffsetIdx, [13:12] AutoInc, [11:6] DataReg, [5:0] AddrReg

Configuration Unit — Config register operations

InstructionOpcodeKey Fields
WRCFG0xB0[21:16] GprAddr, [15] wr128b, [10:0] CfgReg
RDCFG0xB1[23:16] GprAddr, [15:0] CfgReg
SETC160xB2[23:16] CfgIndex (ThreadConfig), [15:0] NewValue
RMWCIB00xB3[23:16] Mask, [15:8] Data, [7:0] CfgRegAddr
RMWCIB10xB4(same as RMWCIB0, targets byte 1)
RMWCIB20xB5(same as RMWCIB0, targets byte 2)
RMWCIB30xB6(same as RMWCIB0, targets byte 3)

ISA Documentation Pointers

FileContent
Scalar Unit
tt-isa-documentation/.../ScalarUnit.mdGPR model, access rules, instruction latency table
tt-isa-documentation/.../SETDMAREG_Immediate.mdSETDMAREG load mode functional model
tt-isa-documentation/.../SETDMAREG_Special.mdSETDMAREG signal/packer-state mode (all InputSource/ResultSize combos)
tt-isa-documentation/.../ADDDMAREG.mdAddition functional model
tt-isa-documentation/.../MULDMAREG.mdMultiply functional model, 16-bit truncation
tt-isa-documentation/.../DMANOP.mdNOP functional model
Configuration Unit
tt-isa-documentation/.../ConfigurationUnit.mdIPC groups, pipeline stages, throughput rules, starvation bugs
tt-isa-documentation/.../BackendConfiguration.mdConfig vs ThreadConfig model, address space, special side effects
tt-isa-documentation/.../WRCFG.mdConfig write from GPR, 32b vs 128b, scheduling
tt-isa-documentation/.../RDCFG.mdConfig read to GPR, WH multi-thread bug, BH pipelining
tt-isa-documentation/.../SETC16.mdThreadConfig write, BH scheduling restrictions
tt-isa-documentation/.../RMWCIB.mdRead-modify-write byte, mask formula, all 4 variants
LLK headers
tt-llk/.../ckernel_gpr_map.hNamed GPR constants
tt-llk/.../ckernel_ops.hInstruction encoding macros
tt-llk/.../ckernel_defs.hLO_16/HI_16/LOWER_HALFWORD/UPPER_HALFWORD macros
tt-llk/.../ckernel.hcfg_reg_rmw_tensix wrapper, flip_cfg_state_id
Indirect Memory Ops
tt-isa-documentation/.../LOADIND.mdLOADIND functional model (WormholeB0 dir, applies to Blackhole)
tt-isa-documentation/.../STOREIND.mdSTOREIND dispatcher (WormholeB0 dir)
tt-isa-documentation/.../STOREIND_L1.mdSTOREIND L1 mode
tt-isa-documentation/.../STOREIND_MMIO.mdSTOREIND MMIO mode
tt-isa-documentation/.../STOREIND_Src.mdSTOREIND SrcA/SrcB mode
tt-llk/.../ckernel_instr_params.hstruct p_ind (HIER_L1, INC_, LD_ constants)