anuraagw.me

rwc and addressing

RWC and Addressing — Blackhole Tensix Coprocessor

This document specifies how the Tensix coprocessor tracks position within tiles during computation, covering Read-Write Counters (RWC), Address Mode descriptors (AddrMod), and Address Counters (ADC).


1. Overview

Math and SFPU instructions (MVMUL, ELWADD, SFPLOAD, SFPSTORE, …) do not accept absolute register-file addresses in their instruction words — the instruction word’s dst field is only a small offset. The actual addressing is driven by two complementary counter systems:

SystemUsersWhat it counts
RWCMatrix Unit (FPU), Vector Unit (SFPU)Rows in SrcA, SrcB, Dst register files
ADCUnpackers, PackersX/Y/Z/W tile coordinates → L1 address and Dst datum index

Both systems follow the same pattern: an instruction reads the current counter value to form its address, then advances the counter via a post-instruction update specified by an AddrMod descriptor (for RWCs) or an explicit increment (for ADCs).


2. RWC — Read-Write Counters

2.1 State

Each of the three Tensix threads has its own independent RWC state. There is no cross-thread access.

struct {
    uint10_t Dst,     Dst_Cr;    // 10-bit; Dest register row index
    uint6_t  SrcA,    SrcA_Cr;  //  6-bit; SrcA register row index
    uint6_t  SrcB,    SrcB_Cr;  //  6-bit; SrcB register row index
    uint2_t  FidelityPhase;     //  2-bit; 0..3 multiplication fidelity phase
    uint1_t  ExtraAddrModBit;   //  1-bit; selects upper AddrMod bank
} RWCs[3];  // indexed by CurrentThread

The _Cr (“Column Register”) variants are checkpointed copies of the main counters. They serve as the base for CR-mode increments and clears (see §4 AddrMod).

Bit widths and valid ranges:

CounterWidthMax valueMeaning of 1 unit
SrcA6 bits631 row of SrcA (16 elements × 19 bits each)
SrcB6 bits631 row of SrcB
Dst10 bits10231 row of Dst16b (16 elements × 16 bits)
FidelityPhase2 bits3Selects mantissa bits for multiply

SrcA and SrcB each hold 2 banks of 64 rows × 16 columns. Dst holds 1024 rows × 16 columns in 16-bit mode (512 rows in 32-bit mode).

2.2 How Instructions Use RWC

MVMUL (Dst += SrcB @ SrcA) reads:

uint6_t SrcARow = RWCs[CurrentThread].SrcA & 0x38;  // aligned to 16-row block
uint6_t SrcBRow = RWCs[CurrentThread].SrcB & 0x38;  // aligned to 8-row block
uint10_t DstRow = RWCs[CurrentThread].Dst + ConfigState.DEST_REGW_BASE_Base
                + DstField + ThreadConfig[CurrentThread].DEST_TARGET_REG_CFG_MATH_Offset;
DstRow &= ~7;  // aligned to 8-row block

MVMUL consumes 16 rows from SrcA (rows SrcARow through SrcARow+15) and 8 rows from SrcB (rows SrcBRow through SrcBRow+7), writes an 8×16 result to Dst rows DstRow through DstRow+7.

ELWADD (Dst = SrcA + SrcB or Dst += SrcA + SrcB) reads in the same pattern but uses 8 rows from each source and destination:

uint6_t SrcARow = RWCs[CurrentThread].SrcA & 0x38;
uint6_t SrcBRow = RWCs[CurrentThread].SrcB & (broadcast ? 0x3f : 0x38);
uint10_t DstRow = (RWCs[CurrentThread].Dst + ... ) & 0x3f8;  // aligned to 8

SFPLOAD / SFPSTORE read Dst at a 4-row-aligned address:

uint10_t Addr = Imm10 + DEST_TARGET_REG_CFG_MATH_Offset
              + RWCs[CurrentThread].Dst + ConfigState.DEST_REGW_BASE_Base;
uint10_t Row    = (Addr & ~3) + (Lane / 8);  // 4-row aligned group
uint4_t  Column = (Lane & 7) * 2;            // even column; bit 1 of Addr selects odd

Lane 0..7 map to row Addr & ~3, lanes 8..15 to +1, 16..23 to +2, 24..31 to +3. The Imm10 field of SFPLOAD/SFPSTORE is dest_reg_addr, added directly to RWC_D + base.

SFPLOAD/SFPSTORE apply ApplyPartialAddrMod (§4) after execution — they advance Dst/SrcA/SrcB RWCs but do not update FidelityPhase.

PACR (packer) uses ADC counters, not RWC directly (see §5).

2.3 SETRWC Instruction

TT_SETRWC(clear_ab_vld, rwc_cr, rwc_d, rwc_b, rwc_a, BitMask)

Encoding (6-bit opcode 0x37, full word left-shifted 2 when in instruction stream):

Bits [29:24]Bits [23:22]Bits [21:18]Bits [17:14]Bits [13:10]Bits [9:6]Bits [5:0]
opcode=0x37clear_ab_vldrwc_cr (4b)rwc_d (4b)rwc_b (4b)rwc_a (4b)BitMask (6b)

BitMask selects which counters receive the new values (using p_setrwc constants):

MaskMeaning
SET_A = 0x1Set SrcA
SET_B = 0x2Set SrcB
SET_D = 0x4Set Dst
SET_F = 0x8Set Fidelity (always clears to 0)
CombinationsSET_AB=0x3, SET_ABD=0x7, SET_ABD_F=0xf, etc.

CR modifier in rwc_cr field (4-bit field):

BitMeaning
CR_A=0x1SrcA: add rwc_a to existing SrcA_Cr
CR_B=0x2SrcB: add rwc_b to existing SrcB_Cr
CR_D=0x4Dst: add rwc_d to existing Dst_Cr
C_TO_CR_MODE=0x8Dst: add rwc_d to current Dst (not Dst_Cr) then checkpoint

clear_ab_vld (2-bit): optionally flip SrcA/SrcB bank and release bank back to unpackers.

Functional model:

auto& RWC = RWCs[CurrentThread];
if (BitMask & SET_A) {
    if (rwc_cr & CR_A) rwc_a += RWC.SrcA_Cr;
    RWC.SrcA = rwc_a;  RWC.SrcA_Cr = rwc_a;
}
if (BitMask & SET_B) {
    if (rwc_cr & CR_B) rwc_b += RWC.SrcB_Cr;
    RWC.SrcB = rwc_b;  RWC.SrcB_Cr = rwc_b;
}
if (BitMask & (SET_D | C_TO_CR_MODE)) {
    if (rwc_cr & C_TO_CR_MODE) rwc_d += RWC.Dst;       // base = current C
    else if (rwc_cr & CR_D)    rwc_d += RWC.Dst_Cr;    // base = checkpoint
    RWC.Dst = rwc_d;  RWC.Dst_Cr = rwc_d;
}
if (BitMask & SET_F) RWC.FidelityPhase = 0;
if (clear_ab_vld & 1) { release_srca_bank(); flip_srca_bank(); }
if (clear_ab_vld & 2) { release_srcb_bank(); flip_srcb_bank(); }

The most common usage is SETRWC(CLR_NONE, 0, 0, 0, 0, SET_ABD_F) to reset all counters to zero at the start of a tile computation.

2.4 INCRWC Instruction

TT_INCRWC(rwc_cr, rwc_d, rwc_b, rwc_a)

Encoding (opcode 0x38):

Bits [23:18]Bits [17:14]Bits [13:10]Bits [9:6]
rwc_cr (3b CR flags, 3b padding)rwc_d (4b)rwc_b (4b)rwc_a (4b)

Functional model:

auto& RWC = RWCs[CurrentThread];
if (rwc_cr & SrcACr) { RWC.SrcA_Cr += rwc_a; RWC.SrcA = RWC.SrcA_Cr; }
else                 { RWC.SrcA += rwc_a; }
if (rwc_cr & SrcBCr) { RWC.SrcB_Cr += rwc_b; RWC.SrcB = RWC.SrcB_Cr; }
else                 { RWC.SrcB += rwc_b; }
if (rwc_cr & DstCr)  { RWC.Dst_Cr  += rwc_d; RWC.Dst  = RWC.Dst_Cr; }
else                 { RWC.Dst  += rwc_d; }

INCRWC does not touch FidelityPhase. It is useful for fine-grained manual counter control outside of the AddrMod mechanism.


3. AddrMod — Address Mode Descriptors

3.1 Purpose

Every math/SFPU instruction has a 2-bit addr_mode field (called AddrMod in ISA docs). This selects one of up to 8 pre-configured descriptor entries that specify how to update the RWC counters after the instruction executes. This avoids encoding large increment values directly in the (narrow) instruction word.

3.2 Index Calculation

void ApplyAddrMod(uint2_t AddrMod, bool UpdateFidelityPhase = true) {
    auto& RWC = RWCs[CurrentThread];
    uint3_t Index = AddrMod;  // 2-bit field from instruction
    if (RWC.ExtraAddrModBit || ThreadConfig[CurrentThread].ADDR_MOD_SET_Base) {
        Index += 4;  // use upper bank (sections 4..7)
    }
    // ... apply descriptor at Index
}

Instructions with a 2-bit addr_mode field select entries 0..3 in the lower bank, or 4..7 in the upper bank (when ExtraAddrModBit or ADDR_MOD_SET_Base is set). The Bias sub-descriptor can flip ExtraAddrModBit, enabling context switching between the two banks within a single instruction sequence.

SFPLOAD and SFPSTORE use ApplyPartialAddrMod — same as ApplyAddrMod but UpdateFidelityPhase = false.

3.3 Config Register Layout

The 8 descriptors are stored in ThreadConfig[CurrentThread] (per-thread configuration, written via SETC16 instruction). There are three sub-descriptors per section:

AB sub-descriptor (SrcA/SrcB update), ADDR32 = 12 + section_index:

BitsFieldDescription
[5:0]SrcAIncrUnsigned addend to SrcA (6-bit)
[6]SrcACR1 = add to SrcA_Cr checkpoint then assign
[7]SrcAClear1 = SrcA = 0, SrcA_Cr = 0
[13:8]SrcBIncrUnsigned addend to SrcB (6-bit)
[14]SrcBCR1 = add to SrcB_Cr checkpoint then assign
[15]SrcBClear1 = SrcB = 0, SrcB_Cr = 0

DST sub-descriptor (Dst / Fidelity update), ADDR32 = 28 + section_index:

BitsFieldDescription
[9:0]DestIncrSigned addend to Dst (10-bit, two’s complement)
[10]DestCR1 = add to Dst_Cr checkpoint then assign
[11]DestClear1 = Dst = 0, Dst_Cr = 0
[12]DestCToCR1 = add DestIncr to current Dst (C), then checkpoint
[14:13]FidelityIncr2-bit unsigned addend to FidelityPhase
[15]FidelityClear1 = FidelityPhase = 0

BIAS sub-descriptor (ExtraAddrModBit control), ADDR32 = 47 + section_index:

BitsFieldDescription
[3:0]BiasIncrIf bits [1:0] != 0: ExtraAddrModBit += 1
[4]BiasClear1 = ExtraAddrModBit = 0

PACK sub-descriptor (packer ADC Y/Z update), ADDR32 = 37 + section_index (only 4 sections):

BitsFieldDescription
[3:0]YsrcIncrADC Y increment for pack input (Dst row)
[4]YsrcCRCR-mode for YsrcIncr
[5]YsrcClearClear Y for pack input
[9:6]YdstIncrADC Y increment for pack output (L1 addr)
[10]YdstCRCR-mode for YdstIncr
[11]YdstClearClear Y for pack output
[12]ZsrcIncr1-bit Z increment for pack input
[13]ZsrcClearClear Z for pack input
[14]ZdstIncr1-bit Z increment for pack output
[15]ZdstClearClear Z for pack output

3.4 Complete ApplyAddrMod Pseudocode

void ApplyAddrMod(uint2_t AddrMod, bool UpdateFidelityPhase) {
    auto& RWC = RWCs[CurrentThread];
    uint3_t Index = AddrMod;
    if (RWC.ExtraAddrModBit || ThreadConfig[CurrentThread].ADDR_MOD_SET_Base)
        Index += 4;
    auto& AB   = ThreadConfig[CurrentThread].ADDR_MOD_AB_SEC[Index];
    auto& Dst  = ThreadConfig[CurrentThread].ADDR_MOD_DST_SEC[Index];
    auto& Bias = ThreadConfig[CurrentThread].ADDR_MOD_BIAS_SEC[Index];

    // SrcA update
    if (AB.SrcAClear)       { RWC.SrcA = 0; RWC.SrcA_Cr = 0; }
    else if (AB.SrcACR)     { RWC.SrcA_Cr += AB.SrcAIncr; RWC.SrcA = RWC.SrcA_Cr; }
    else                    { RWC.SrcA += AB.SrcAIncr; }

    // SrcB update
    if (AB.SrcBClear)       { RWC.SrcB = 0; RWC.SrcB_Cr = 0; }
    else if (AB.SrcBCR)     { RWC.SrcB_Cr += AB.SrcBIncr; RWC.SrcB = RWC.SrcB_Cr; }
    else                    { RWC.SrcB += AB.SrcBIncr; }

    // Dst update
    if (Dst.DestClear)      { RWC.Dst = 0; RWC.Dst_Cr = 0; }
    else if (Dst.DestCToCR) { RWC.Dst += Dst.DestIncr; RWC.Dst_Cr = RWC.Dst; }
    else if (Dst.DestCR)    { RWC.Dst_Cr += Dst.DestIncr; RWC.Dst = RWC.Dst_Cr; }
    else                    { RWC.Dst += Dst.DestIncr; }

    // Fidelity update
    if (UpdateFidelityPhase) {
        if (Dst.FidelityClear)   RWC.FidelityPhase = 0;
        else                     RWC.FidelityPhase += Dst.FidelityIncr;
    }

    // ExtraAddrModBit update
    if (Bias.BiasClear)     RWC.ExtraAddrModBit = 0;
    else if (Bias.BiasIncr & 3) RWC.ExtraAddrModBit += 1;
}

4. Matmul Kernel — Concrete AddrMod Examples

4.1 Configuration Setup (from matmul_trisc1.S)

The peak matmul kernel (standard 32×32 tile, no transpose, single fidelity) configures five addr_mod sections via SETC16 before the compute loop. Decoded from the observed ttsetc16 instructions at 0x63c4–0x63fc:

SectionSrcASrcBDstFidelityPurpose
ADDR_MOD_0incr=0incr=8incr=8-Step both SrcB and Dst by 8 rows; SrcA holds
ADDR_MOD_1incr=16cr+0 (reset to checkpoint)incr=8-Advance SrcA by 16 rows, restore SrcB to CR
ADDR_MOD_2cr+0 (reset to checkpoint)cr+32incr=8-Restore SrcA to CR, advance SrcB CR by 32
ADDR_MOD_4cr+32cr+48cr+0 (reset to checkpoint)-Advance SrcA CR by 32, SrcB CR by 48, restore Dst
ADDR_MOD_5clrclrclrincr=1Reset all counters, increment fidelity phase

The cr in cr+N means “add N to the checkpoint register and assign to the active counter”. With 6-bit unsigned arithmetic, cr+48 from a checkpoint of 32 yields (32+48) & 0x3f = 16, which is how the SrcB counter wraps from face 2 back to face 1 (row 16).

4.2 RWC Trace: Standard 32×32 Tile Computation

A 32×32 tile uses two 16×16 SrcA faces and two 16×16 SrcB faces. The FPU computes Dst[8,16] += SrcB[8,16] @ SrcA[16,16] per MVMUL. One complete tile (single fidelity phase) requires 16 MVMULs covering all B-face/A-face combinations.

SrcA layout in register file: face 0 at rows 0–15, face 1 at rows 16–31, face 2 at rows 32–47, face 3 at rows 48–63. SrcB identical. Dst: face 0 rows 0–7, face 1 rows 8–15, face 2 rows 16–23, face 3 rows 24–31 (repeating with 32-row period per accumulation pass).

Initial state: rwc_a=0(cr=0) rwc_b=0(cr=0) rwc_d=0(cr=0).

Insn  AddrMod  Reads [SrcB, SrcA → Dst]     After: rwc_a  rwc_b  rwc_d
  1   MOD_0    SrcB[0..7], SrcA[0..15]→Dst[0..7]   0(cr=0)  8(cr=0)  8(cr=0)
  2   MOD_1    SrcB[8..15], SrcA[0..15]→Dst[8..15] 16(cr=0) 0(cr=0) 16(cr=0)
  3   MOD_0    SrcB[0..7], SrcA[16..31]→Dst[16..23] 16(cr=0) 8(cr=0) 24(cr=0)
  4   MOD_2    SrcB[8..15], SrcA[16..31]→Dst[24..31] 0(cr=0) 32(cr=32) 32(cr=0)
  5   MOD_0    SrcB[32..39], SrcA[0..15]→Dst[32..39] 0(cr=0) 40(cr=32) 40(cr=0)
  6   MOD_1    SrcB[40..47], SrcA[0..15]→Dst[40..47] 16(cr=0) 32(cr=32) 48(cr=0)
  7   MOD_0    SrcB[32..39], SrcA[16..31]→Dst[48..55] 16(cr=0) 40(cr=32) 56(cr=0)
  8   MOD_4    SrcB[40..47], SrcA[16..31]→Dst[56..63] 32(cr=32) 16(cr=16) 0(cr=0)
  9   MOD_0    SrcB[16..23], SrcA[32..47]→Dst[0..7]   32(cr=32) 24(cr=16) 8(cr=0)
 10   MOD_1    SrcB[24..31], SrcA[32..47]→Dst[8..15]  48(cr=32) 16(cr=16) 16(cr=0)
 11   MOD_0    SrcB[16..23], SrcA[48..63]→Dst[16..23] 48(cr=32) 24(cr=16) 24(cr=0)
 12   MOD_2    SrcB[24..31], SrcA[48..63]→Dst[24..31] 32(cr=32) 48(cr=48) 32(cr=0)
 13   MOD_0    SrcB[48..55], SrcA[32..47]→Dst[32..39] 32(cr=32) 56(cr=48) 40(cr=0)
 14   MOD_1    SrcB[56..63], SrcA[32..47]→Dst[40..47] 48(cr=32) 48(cr=48) 48(cr=0)
 15   MOD_0    SrcB[48..55], SrcA[48..63]→Dst[48..55] 48(cr=32) 56(cr=48) 56(cr=0)
 16   MOD_5    SrcB[56..63], SrcA[48..63]→Dst[56..63] 0(cr=0)   0(cr=0)   0(cr=0)  fidelity+=1

MVMUL reads SrcA aligned to 16-row blocks (rwc_a & 0x38), so the register-file rows consumed equal the table values. For Dst the hardware aligns to 8-row blocks.

After MVMUL 16, all counters reset to 0 and FidelityPhase increments to 1. The outer MOP loop re-executes the 16-MVMUL replay buffer for fidelity phases 1, 2, 3 if high-fidelity is enabled; SETRWC(CLR_A, 0, 0, 0, 0, SET_ABD_F) (or CLR_B) resets all and clears FidelityPhase=0 at the end of the last phase.

4.3 Actual Instruction Stream (matmul_trisc1.S, replay buffer at 0x6404)

6404:  98000000   ttmvmul  0,0,0,0    ; clear_dvalid=0, addr_mode=0, dst=0
6408:  98010000   ttmvmul  0,0,1,0    ; addr_mode=1
640c:  98000000   ttmvmul  0,0,0,0    ; addr_mode=0
6410:  98020000   ttmvmul  0,0,2,0    ; addr_mode=2
6414:  98000000   ttmvmul  0,0,0,0
6418:  98010000   ttmvmul  0,0,1,0
641c:  98000000   ttmvmul  0,0,0,0
6420:  98040000   ttmvmul  0,0,4,0    ; addr_mode=4
6424:  98000000   ttmvmul  0,0,0,0
6428:  98010000   ttmvmul  0,0,1,0
642c:  98000000   ttmvmul  0,0,0,0
6430:  98020000   ttmvmul  0,0,2,0
6434:  98000000   ttmvmul  0,0,0,0
6438:  98010000   ttmvmul  0,0,1,0
643c:  98000000   ttmvmul  0,0,0,0
6440:  98050000   ttmvmul  0,0,5,0    ; addr_mode=5: reset + fidelity increment

This replay buffer (16 instructions) is loaded via ttreplay 16,16,0,1 at 0x6400. The MOP wrapper executes it once per fidelity phase (inner_loops = to_underlying(MathFidelity)).

Note on encoding: All TTI_ (inline) Tensix instructions appear in the instruction stream left-shifted by 2 bits relative to the TT_OP() encoding. That is, if TT_OP(opcode, params) yields a 32-bit word W, then the instruction stream contains W << 2. The opcode field sits in bits [29:24] of TT_OP, and lands in bits [31:26] of the physical instruction word.

The disassembled opcode byte (0x98 for ttmvmul) is therefore 0x26 << 2 = 0x98, and 0xdc for ttsetrwc is 0x37 << 2 = 0xdc.

4.4 SETRWC at Loop Boundary (matmul_trisc1.S, 0x64a0)

64a0:  dc00003c   ttsetrwc  0,0,0,0,0,15

Decoded: SETRWC(clear_ab_vld=0, rwc_cr=0, rwc_d=0, rwc_b=0, rwc_a=0, BitMask=0xf)

BitMask 0xf = SET_ABD_F — sets SrcA=0, SrcB=0, Dst=0, Fidelity=0. All checkpoints also reset to 0 (new values are written to both main counter and _Cr). This resets all RWC state before the inner tile computation begins.


5. ADC — Address Counters

5.1 State

struct {
    struct {
        struct {
            uint18_t X, X_Cr;
            uint13_t Y, Y_Cr;
            uint8_t  Z, Z_Cr;
            uint8_t  W, W_Cr;
        } Channel[2];
    } Unpacker[2], Packers;
} ADCs[3];  // indexed by CurrentThread (or overridden)

ADCs are used by unpackers and packers. The RWC system is entirely separate.

Per unit:

UnitWhat it addresses
Unpacker[0]SrcA data in L1 (UNP0)
Unpacker[1]SrcB data in L1 (UNP1)
PackersDst→L1 (all 4 packers share one set)

5.2 Channel Semantics

For unpackers, Channel[0] drives the input (L1 read) path and Channel[1] drives the output (register write) path:

CounterChannel 0Channel 1
XL1 input address generation; part of datum countUpper limit of datum count
YDecompressor — seeks to row within L1Output address to Dst (UNP0) or SrcB (UNP1)
ZDecompressor (BFP exponent section)Output address (continued)
WDecompressorOutput address (continued)

For packers, Channel[0] drives the input (Dst read) path and Channel[1] drives the output (L1 write) path:

CounterChannel 0Channel 1
XDst row/column address; datum count startDatum count end
YDst address offsetL1 output address offset
ZDst address offsetL1 output address offset
WDst address offsetL1 output address offset

The packer’s Dst input address is:

uint32_t Addr = PCK0_ADDR_BASE_REG_0_Base
    + ADC.X * (PCK0_ADDR_CTRL_XY_REG_0_Xstride & 0xf)
    + ADC.Y * PCK0_ADDR_CTRL_XY_REG_0_Ystride
    + ADC.Z * PCK0_ADDR_CTRL_ZW_REG_0_Zstride
    + ADC.W * PCK0_ADDR_CTRL_ZW_REG_0_Wstride;
// Then: Addr_datum = (Addr / BytesPerDatum) & ~ADC_X_Mask) + (ADC.X & ADC_X_Mask)
//       + DEST_TARGET_REG_CFG_PACK_SEC[i].Offset << 4;

The packer L1 output address is computed from PCK0_ADDR_BASE_REG_1_Base plus Channel[1].Y/Z/W weighted by the corresponding stride registers, aligned to 16 bytes.

5.3 ADC Instructions

All ADC instructions execute on the Miscellaneous Unit.

CntSetMask (3-bit) selects which counters to modify:

CntSetMaskConstantUnits affected
0b001UNP0 / UNP_AUnpacker 0 (SrcA)
0b010UNP1 / UNP_BUnpacker 1 (SrcB)
0b011UNP_ABBoth unpackers
0b100PACPackers

SETADC — set one dimension of one channel:

TT_SETADC(CntSetMask, Channel, XYZW, NewValue)
// Sets ADC.Channel[Channel].{X|Y|Z|W} = NewValue (and its _Cr)
// NewValue bits[17:16] = ThreadOverride (0 = use CurrentThread)

SETADCXY — set X and Y of both channels simultaneously:

TT_SETADCXY(CntSetMask, Y1Val, X1Val, Y0Val, X0Val, BitMask)
// BitMask bits: X0(0), Y0(1), X1(2), Y1(3) — select which to update
// Values are 3-bit (for small tile face indices)

SETADCZW — set Z and W of both channels simultaneously:

TT_SETADCZW(CntSetMask, W1Val, Z1Val, W0Val, Z0Val, BitMask)
// BitMask: Z0(0), W0(1), Z1(2), W1(3)

SETADCXX — set X of both channels from 10-bit values:

TT_SETADCXX(CntSetMask, X1Val, X0Val)
// X0Val (10b): Channel[0].X and Channel[0].X_Cr
// X1Val (10b): Channel[1].X and Channel[1].X_Cr
// (No ThreadOverride; always uses CurrentThread)

INCADCXY — increment X and Y of both channels:

TT_INCADCXY(CntSetMask, Y1Inc, X1Inc, Y0Inc, X0Inc)
// All increments are 3-bit; adds directly to X/Y (not _Cr)

INCADCZW — increment Z and W of both channels:

TT_INCADCZW(CntSetMask, W1Inc, Z1Inc, W0Inc, Z0Inc)

X wrapping via SETADCXX: The SETADCXX instruction sets both Channel[0].X (the starting X counter) and Channel[1].X (the ending X counter used for datum count). The datum count passed to the unpacker/packer is Channel[1].X - Channel[0].X + 1. Wrapping or stopping conditions are handled externally by software resetting these via SETADCXX or SETADC before the next UNPACR/PACR.

5.4 ADC Usage in the Matmul Kernel (matmul_trisc2.S)

The pack thread (TRISC2) sets up ADC for the packers:

7114:  4600002d   ttsetadcxy  4,0,0,0,0,11    ; PAC, Y1=0,X1=0,Y0=0,X0=0, mask=0b1011 (X0,Y0,Y1)
7118:  5200003d   ttsetadczw  4,0,0,0,0,15    ; PAC, all Z/W = 0

This initializes the packer ADC to position (X=0, Y=0, Z=0, W=0) in both channels, targeting the beginning of the Dst tile and the L1 destination address.

The TRISC0 (unpack) thread similarly initializes Unpacker ADCs before each UNPACR sequence using SETADCXY/SETADCZW to position at the correct tile face within L1.


6. Functional Model (Python)

class RWCState:
    def __init__(self):
        self.srca = self.srca_cr = 0
        self.srcb = self.srcb_cr = 0
        self.dst  = self.dst_cr  = 0
        self.fidelity = 0
        self.extra_addr_mod_bit = 0
    SRCA_MASK = 0x3F
    SRCB_MASK = 0x3F
    DST_MASK  = 0x3FF
    FIDELITY_MASK = 0x3

def apply_addr_mod(rwc: RWCState, ab, dst_d, bias, update_fidelity=True):
    """Apply one AddrMod descriptor entry to RWC state."""
    # SrcA
    if ab['srca_clr']:
        rwc.srca = rwc.srca_cr = 0
    elif ab['srca_cr']:
        rwc.srca_cr = (rwc.srca_cr + ab['srca_incr']) & rwc.SRCA_MASK
        rwc.srca = rwc.srca_cr
    else:
        rwc.srca = (rwc.srca + ab['srca_incr']) & rwc.SRCA_MASK

    # SrcB
    if ab['srcb_clr']:
        rwc.srcb = rwc.srcb_cr = 0
    elif ab['srcb_cr']:
        rwc.srcb_cr = (rwc.srcb_cr + ab['srcb_incr']) & rwc.SRCB_MASK
        rwc.srcb = rwc.srcb_cr
    else:
        rwc.srcb = (rwc.srcb + ab['srcb_incr']) & rwc.SRCB_MASK

    # Dst
    if dst_d['dest_clr']:
        rwc.dst = rwc.dst_cr = 0
    elif dst_d['dest_c2cr']:     # CtoCR: add to C, checkpoint
        rwc.dst = (rwc.dst + dst_d['dest_incr']) & rwc.DST_MASK
        rwc.dst_cr = rwc.dst
    elif dst_d['dest_cr']:        # CR: add to checkpoint, assign
        rwc.dst_cr = (rwc.dst_cr + dst_d['dest_incr']) & rwc.DST_MASK
        rwc.dst = rwc.dst_cr
    else:
        rwc.dst = (rwc.dst + dst_d['dest_incr']) & rwc.DST_MASK

    # Fidelity
    if update_fidelity:
        if dst_d['fidelity_clr']:
            rwc.fidelity = 0
        else:
            rwc.fidelity = (rwc.fidelity + dst_d['fidelity_incr']) & rwc.FIDELITY_MASK

    # ExtraAddrModBit
    if bias['bias_clr']:
        rwc.extra_addr_mod_bit = 0
    elif bias['bias_incr'] & 3:
        rwc.extra_addr_mod_bit = min(1, rwc.extra_addr_mod_bit + 1)


def setrwc(rwc: RWCState, rwc_a=0, rwc_b=0, rwc_d=0, rwc_cr=0, bitmask=0,
           clear_srca_bank=False, clear_srcb_bank=False):
    """Execute SETRWC instruction."""
    SET_A, SET_B, SET_D, SET_F = 1, 2, 4, 8
    CR_A, CR_B, CR_D, C_TO_CR = 1, 2, 4, 8
    if bitmask & SET_A:
        base = rwc.srca_cr if (rwc_cr & CR_A) else 0
        rwc.srca = rwc.srca_cr = (base + rwc_a) & RWCState.SRCA_MASK
    if bitmask & SET_B:
        base = rwc.srcb_cr if (rwc_cr & CR_B) else 0
        rwc.srcb = rwc.srcb_cr = (base + rwc_b) & RWCState.SRCB_MASK
    if bitmask & (SET_D | C_TO_CR):
        if rwc_cr & C_TO_CR:   base = rwc.dst
        elif rwc_cr & CR_D:    base = rwc.dst_cr
        else:                  base = 0
        rwc.dst = rwc.dst_cr = (base + rwc_d) & RWCState.DST_MASK
    if bitmask & SET_F:
        rwc.fidelity = 0
    # Bank flip (clear_dvalid) handled separately by hardware


def incrwc(rwc: RWCState, rwc_a=0, rwc_b=0, rwc_d=0, rwc_cr=0):
    """Execute INCRWC instruction."""
    CR_A, CR_B, CR_D = 1, 2, 4
    if rwc_cr & CR_A:
        rwc.srca_cr = (rwc.srca_cr + rwc_a) & RWCState.SRCA_MASK
        rwc.srca = rwc.srca_cr
    else:
        rwc.srca = (rwc.srca + rwc_a) & RWCState.SRCA_MASK
    if rwc_cr & CR_B:
        rwc.srcb_cr = (rwc.srcb_cr + rwc_b) & RWCState.SRCB_MASK
        rwc.srcb = rwc.srcb_cr
    else:
        rwc.srcb = (rwc.srcb + rwc_b) & RWCState.SRCB_MASK
    if rwc_cr & CR_D:
        rwc.dst_cr = (rwc.dst_cr + rwc_d) & RWCState.DST_MASK
        rwc.dst = rwc.dst_cr
    else:
        rwc.dst = (rwc.dst + rwc_d) & RWCState.DST_MASK


def mvmul_dst_row(rwc: RWCState, dst_field=0, dest_target_offset=0, dest_regw_base=0):
    """Compute the Dst row used by MVMUL."""
    row = dst_field + dest_target_offset + rwc.dst + dest_regw_base
    return row & ~7  # align to 8-row block, masked to 10 bits

def sfpload_row_column(rwc: RWCState, imm10=0, dest_target_offset=0,
                        dest_regw_base=0, lane=0, addr_bit1_exchange=False):
    """Compute the Dst row and column accessed by SFPLOAD/SFPSTORE for a given lane."""
    addr = (imm10 + dest_target_offset + rwc.dst + dest_regw_base) & 0x3FF
    row    = (addr & ~3) + (lane // 8)
    column = (lane & 7) * 2
    if (addr & 2) or addr_bit1_exchange:
        column += 1
    return row & 0x3FF, column & 0xF

7. Source References

FileContent
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/RWCs.mdRWC state definition, ApplyAddrMod pseudocode, instruction list
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETRWC.mdSETRWC encoding and functional model
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/INCRWC.mdINCRWC encoding and functional model
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/MVMUL.mdMVMUL functional model (RWC consumption)
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/ELWADD.mdELWADD functional model
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SFPLOAD.mdSFPLOAD — Dst addressing, ApplyPartialAddrMod
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SFPSTORE.mdSFPSTORE — Dst addressing
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/ADCs.mdADC state definition and channel usage table
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADC.mdSETADC instruction
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADCXY.mdSETADCXY instruction
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADCZW.mdSETADCZW instruction
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADCXX.mdSETADCXX instruction
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/INCADCXY.mdINCADCXY instruction
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/INCADCZW.mdINCADCZW instruction
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Packers/InputAddressGenerator.mdPacker Dst→L1 address computation using ADCs
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Packers/OutputAddressGenerator.mdPacker L1 output address computation
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Dst.mdDst register file layout, 16-bit vs 32-bit rows
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SrcASrcB.mdSrcA/SrcB register file layout and fidelity phase details
tt-llk/tt_llk_blackhole/common/inc/ckernel_addrmod.haddr_mod_t struct, field layout, SETC16 config register addresses
tt-llk/tt_llk_blackhole/common/inc/ckernel_instr_params.hp_setrwc, p_setadc constant definitions
tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.hTT_OP_SETRWC, TT_OP_INCRWC, TT_OP_SETADC, TT_OP_SETADCXY, etc. macros
tt-llk/tt_llk_blackhole/llk_lib/llk_math_matmul.hmatmul_configure_addrmod() — actual addr_mod setup for all tile shapes
tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.hADDR_MOD_AB_SEC[0..7]*, ADDR_MOD_DST_SEC[0..7], ADDR_MOD_BIAS_SEC[0..7]_, ADDR_MOD_PACK_SEC[0..3]_* ADDR32 offsets
blackhole-py/disasms/matmul_peak/matmul_trisc1.STRISC1 (math) disassembly: SETC16 addr_mod config + MVMUL replay buffer
blackhole-py/disasms/matmul_peak/matmul_trisc2.STRISC2 (pack) disassembly: SETADCXY/SETADCZW for packer ADC init
blackhole-py/dsl.pyTensixOp definitions with opcode values; encoding note: TTI_ instructions appear in stream left-shifted by 2 bits