unpack data path
Two unpackers that read L1 tile data, perform format conversion, and write to SrcA/SrcB register files, with address and data path configuration for tiling operations.
Unpack Data Path
Overview
The Tensix coprocessor has two unpackers: Unpacker 0 (moves L1 data to SrcA or directly to Dest) and Unpacker 1 (moves L1 data to SrcB). Both read from L1 memory, perform format conversion, and write into register files. They operate concurrently and are controlled by the UNPACR instruction issued from TRISC0 (the unpack thread).
The full pipeline from software perspective:
L1 tile bytes
│
▼
[TileDescriptor: InDataFormat, XDim, YDim, ZDim, WDim, blobs_per_xy_plane]
│
▼
[ADC address counters: X0, Y0, Z0, W0 (L1 input), X1, Y1, Z1, W1 (Src output)]
│
▼
[Format conversion: BFP expansion, FP conversion, bit rearrangement]
│
▼
SrcA[bank][row][col] (19-bit elements, 2 banks × 64 rows × 16 cols)
SrcB[bank][row][col] (19-bit elements, 2 banks × 64 rows × 16 cols)
Dst[row][col] (16-bit or 32-bit elements, 1024 rows × 16 cols)
The primary instruction is UNPACR. Secondary instructions (UNPACR_NOP, SETADC, SETADCXY, SETADCZW, SETADCXX, INCADCXY, INCADCZW) configure ADC state, signal bank handoff, and pop overlay stream messages.
Sources:
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_Regular.mdtt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/tt-llk/tt_llk_blackhole/common/inc/cunpack_common.htt-llk/tt_llk_blackhole/llk_lib/llk_unpack_AB.h,llk_unpack_A.h,llk_unpack_tilize.h,llk_unpack_untilize.htt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h
1. L1 Tile Data Layouts (Per Format)
1.1 Non-BFP Formats (FP32, TF32, BF16, FP16, FP8, INT8, UINT8, INT16, INT32)
Non-BFP tiles are flat packed arrays of elements. There is no separate exponent section.
Each tile in L1 begins with a tile header of (1 + DigestSize) * 16 bytes. In practice DigestSize = 0, so the header is 16 bytes (one 16-byte block). The header is skipped when computing InAddr_Datums; elements begin immediately after.
After the header, elements are laid out in row-major order within each face, then faces are concatenated:
[16-byte tile header]
[Face 0: XDim × YDim elements, DatumSizeBytes each]
[Face 1: ...]
...
[Face ZDim-1: ...]
For a standard 32×32 BF16 tile (XDim=16, YDim=1, ZDim=4 faces, DatumSizeBytes=2):
- Header: 16 bytes
- Each face: 16 elements × 1 row × 2 bytes = 32 bytes
- Total: 16 + 4 × 32 = 144 bytes (but XDim encodes the full face flat size; for a 16×16 face, XDim=256)
Critical: XDim is the total number of datums per Z-slice (one “row” of the tile-level iteration). For a 16×16 face: XDim = 256, YDim = 1, ZDim = 4 (four faces per tile).
1.2 BFP Tile Layout in L1
BFP tiles (BFP8/BFP4/BFP2, both A and B exponent variants) have a richer structure.
For uncompressed BFP tiles (IsUncompressed = true):
[16-byte tile header]
[Exponent section: ceil(NumExponents/16) × 16 bytes, 16-byte aligned]
[Mantissa section: flat array of mantissa bits, DatumSizeBytes each]
Where:
NumExponents = ceil(NumElements / 16)— one exponent byte per 16 datumsNumElements = XDim × YDim × ZDim × WDim- For BFP8:
DatumSizeBytes = 1; for BFP4:0.5; for BFP2:0.25
Exception for BFP4/BFP2: When NoBFPExpSection is set in the tile descriptor (ConfigDescriptor.NoBFPExpSection = true), the exponent section is omitted and a forced shared exponent (FORCED_SHARED_EXP_shared_exp) is used instead. BFP8 always has an exponent section regardless of NoBFPExpSection.
The address computation (from UNPACR functional model):
// After tile header:
InAddr = (BaseAddress + 1 + DigestSize) * 16;
// For BFP: exponent section starts here
if (IsBFPFormat(InDataFormat) && !REG2_Force_shared_exp) {
InAddr_Exponents = InAddr;
if (InDataFormat == BFP8 || InDataFormat == BFP8a || !NoBFPExpSection) {
NumElements = XDim * YDim * ZDim * WDim;
NumExponents = ceil(NumElements / 16.0);
InAddr += ceil(NumExponents / 16.0) * 16; // 16-byte aligned
}
}
// Mantissa/datum section starts here
InAddr_Datums = InAddr;
// Advance by FirstDatum * DatumSizeBytes to reach the starting element
InAddr_Datums += FirstDatum * DatumSizeBytes;
// Exponent pointer also offset by FirstDatum/16 to reach the starting exponent
InAddr_Exponents += FirstDatum / 16.0;
Exponent sharing: Each exponent byte covers exactly 16 consecutive mantissa datums. The exponent pointer advances by 1/16 per datum (i.e., one new exponent byte every 16 datums). The index into the exponent array is floor(InAddr_Exponents) for integer byte access.
Forced shared exponent (REG2_Force_shared_exp = 1): The exponent section is omitted entirely; all datums use the value in UNP[n].FORCED_SHARED_EXP_shared_exp. Useful for INT8→BF16 conversion.
1.3 Compressed Tile Layout
For compressed tiles (IsUncompressed = false), the layout is:
[16-byte tile header]
[Row Start Index (RSI) section: ceil((NumRows+1)*2/16)*16 bytes, 16-byte aligned]
- Array of uint16_t: RSI[0..NumRows], where RSI[i] = byte offset of row i in datum stream
- Used to seek to a specific compressed row
[Exponent section: same as uncompressed if BFP format]
[Interleaved datum+delta stream:
- 32 datums (DatumSizeBytes each)
- 32 RLE delta nibbles (4 bits each = 16 bytes total)
- 32 datums
- 32 RLE delta nibbles
- ...]
Where NumRows = YDim * ZDim * WDim for regular tiles, or BlobsPerXYPlane * ZDim * WDim when blobs are used.
Each RLE nibble specifies how many zeros to insert after the corresponding datum (0-15).
2. UNPACR Functional Model
This is the complete functional model of UNPACR (Regular mode). See source at UNPACR_Regular.md.
2.1 Instruction Encoding
TT_OP_UNPACR(
/* u1 */ WhichUnpacker, // 0=SrcA/Dst, 1=SrcB
// 8-bit packed increment field:
((/* u2 */ Ch1YInc) << 6) +
((/* u2 */ Ch1ZInc) << 4) +
((/* u2 */ Ch0YInc) << 2) +
/* u2 */ Ch0ZInc,
false, // not FlushCache or IncrCtxCounter form
/* u3 */ ContextNumber, // which context to use (MultiContextMode only)
/* u2 */ ContextADC, // which ADC set to use (MultiContextMode only)
/* bool */ MultiContextMode, // enable multi-context
/* bool */ FlipSrc, // transfer bank to MatrixUnit and flip
false,
/* bool */ AllDatumsAreZero, // write zeros regardless of L1 content
/* bool */ UseContextCounter, // use hardware context counter
/* bool */ RowSearch, // BFP blob row search mode
false,
false)
2.2 Phase 1: Context and Config Selection
StateID = ThreadConfig[CurrentThread].CFG_STATE_ID_StateID
ConfigState = Config[StateID]
CurrentUnpacker = Unpackers[WhichUnpacker]
# Determine context
if MultiContextMode:
if UseContextCounter:
WhichContext = CurrentUnpacker.ContextCounter[CurrentThread]
else:
WhichContext = ContextNumber
WhichContext += ThreadConfig[CurrentThread].UNPACK_MISC_CFG_CfgContextOffset[WhichUnpacker]
WhichADC = ContextADC
# WhichUnpacker==1 requires WhichContext < 2; WhichADC must not be 3
else:
WhichContext = 0
WhichADC = CurrentThread
ConfigDescriptor = ConfigState.THCON_SEC[WhichUnpacker].REG0_TileDescriptor
# Determine IsUncompressed
if MultiContextMode:
IsUncompressed = ConfigState.THCON_SEC[WhichUnpacker].REG2_Disable_zero_compress_cntx[WhichContext]
else:
IsUncompressed = ConfigDescriptor.IsUncompressed
# Tile dimensions
if MultiContextMode and WhichUnpacker == 0:
XDim = ConfigState.THCON_SEC[0].REG5_Tile_x_dim_cntx[WhichContext & 3]
else:
XDim = ConfigDescriptor.XDim
YDim = ConfigDescriptor.YDim
ZDim = max(ConfigDescriptor.ZDim, 1)
WDim = max(ConfigDescriptor.WDim, 1)
# Data format
if MultiContextMode and ConfigState.THCON_SEC[WhichUnpacker].REG2_Ovrd_data_format:
InDataFormat = ConfigState.THCON_SEC[WhichUnpacker].REG7_Unpack_data_format_cntx[WhichContext]
OutDataFormat = ConfigState.THCON_SEC[WhichUnpacker].REG7_Unpack_out_data_format_cntx[WhichContext]
else:
InDataFormat = ConfigDescriptor.InDataFormat
OutDataFormat = ConfigState.THCON_SEC[WhichUnpacker].REG2_Out_data_format
2.3 Phase 2: Input Address Computation
# Base address (16-byte units)
if MultiContextMode and WhichContext != 0:
InAddr = ConfigState.THCON_SEC[WhichUnpacker].REG3_Base_cntx[WhichContext].address \
+ (ConfigState.THCON_SEC[WhichUnpacker].REG7_Offset_cntx[WhichContext & 3].address & 0xffff)
else:
InAddr = ConfigState.THCON_SEC[WhichUnpacker].REG3_Base_address \
+ (ConfigState.THCON_SEC[WhichUnpacker].REG7_Offset_address & 0xffff)
# Skip tile header: (1 + DigestSize) * 16 bytes
InAddr = (InAddr + 1 + ConfigDescriptor.DigestSize) * 16 # now in bytes
# For compressed tiles: RSI section
InAddr_RowStart = None
if not IsUncompressed:
InAddr_RowStart = InAddr # pointer to uint16_t RSI array
if ConfigDescriptor.BlobsPerXYPlane:
NumBlobs = ConfigDescriptor.BlobsPerXYPlane * ZDim * WDim
InAddr += ceil_16((NumBlobs + 1) * 2)
else:
NumRows = YDim * ZDim * WDim
InAddr += ceil_16((NumRows + 1) * 2)
# For BFP: exponent section
InAddr_Exponents = None
if IsBFPFormat(InDataFormat) and not ConfigState.THCON_SEC[WhichUnpacker].REG2_Force_shared_exp:
InAddr_Exponents = InAddr
if InDataFormat in (BFP8, BFP8a) or not ConfigDescriptor.NoBFPExpSection:
NumElements = XDim * YDim * ZDim * WDim
NumExponents = ceil(NumElements / 16.0)
InAddr += ceil_16(NumExponents)
# Compute FirstDatum and InputNumDatums from ADC
ADC_XY = ADCs[WhichADC].Unpacker[WhichUnpacker].Channel[0]
ADC_ZW = ADCs[CurrentThread].Unpacker[WhichUnpacker].Channel[0]
if IsUncompressed:
if not RowSearch:
XPos = ADC_XY.X
YPos = ADC_XY.Y
XEnd = ADCs[WhichADC].Unpacker[WhichUnpacker].Channel[1].X + 1
# (RowSearch/BlobsPerXYPlane path elided for brevity)
FirstDatum = ((ADC_ZW.W * ZDim + ADC_ZW.Z) * YDim + YPos) * XDim + XPos
InputNumDatums = XEnd - XPos
else:
# Compressed: RSI lookup
InAddr_RowStart += (ADC_ZW.W * ZDim + ADC_ZW.Z) * YDim * 2 # seek to Z/W plane
FirstDatum = RSI_read(InAddr_RowStart, ADC_XY.Y & 0xff) # uint16_t lookup
InputNumDatums = RSI_read(InAddr_RowStart, (ADC_XY.Y & 0xff) + 1) - FirstDatum
# Datum address
InAddr_Datums = InAddr + FirstDatum * DatumSizeBytes
# Exponent pointer offset to match FirstDatum
if InAddr_Exponents is not None:
InAddr_Exponents += FirstDatum / 16.0
# Circular FIFO wrap
limit = ConfigState.THCON_SEC[WhichUnpacker].Unpack_limit_address * 16
fifo = ConfigState.THCON_SEC[WhichUnpacker].Unpack_fifo_size * 16
def WrapAddr(addr):
if addr > limit:
addr -= fifo
return addr
InAddr_Exponents = WrapAddr(InAddr_Exponents) if InAddr_Exponents else None
InAddr_Datums = WrapAddr(InAddr_Datums)
2.4 Phase 3: Output Address Computation
ADC_Out = ADCs[CurrentThread].Unpacker[WhichUnpacker].Channel[1]
OutAddr = (ConfigState.UNP[WhichUnpacker].ADDR_BASE_REG_1_Base
+ ADC_Out.Y * ConfigState.UNP[WhichUnpacker].ADDR_CTRL_XY_REG_1_Ystride
+ ADC_Out.Z * ConfigState.UNP[WhichUnpacker].ADDR_CTRL_ZW_REG_1_Zstride
+ ADC_Out.W * ConfigState.UNP[WhichUnpacker].ADDR_CTRL_ZW_REG_1_Wstride)
# Scale OutAddr by element size
if OutDataFormat in (FP32, TF32, INT32):
OutAddr >>= 2 # 4-byte elements
elif OutDataFormat in (FP16, BF16, INT16):
OutAddr >>= 1 # 2-byte elements
# else INT8/UINT8: OutAddr is 1-byte units
# Apply per-context dest address offset (MultiContextMode, Unpacker 0 only)
if MultiContextMode and WhichUnpacker == 0:
CtxOutAddr = ConfigState.THCON_SEC[0].REG5_Dest_cntx[WhichContext & 3].address
if UnpackToDst or ConfigState.UNP[0].ADD_DEST_ADDR_CNTR_add_dest_addr_cntr:
OutAddr += CtxOutAddr
else:
OutAddr = CtxOutAddr
Dest address for SrcA/SrcB: OutAddr indexes elements (not bytes). Row = OutAddr / 16, Col = OutAddr & 15.
SrcA Row offset: The SrcA register file is indexed as Row = (OutAddr/16 - 4) + CurrentUnpacker.SrcRow[CurrentThread]. The -4 accounts for a fixed 4-row header offset. If SRCA_SET_SetOvrdWithAddr is set, the raw row index is used directly (for unpack-to-dest path).
2.5 Phase 4: Row Stride (Tilize Mode vs Normal)
DiscontiguousInputRows = ConfigState.THCON_SEC[WhichUnpacker].REG2_Tileize_mode
if DiscontiguousInputRows:
# RowStride is the stride between input rows in L1 (tilize mode)
RowStride = ((ConfigState.THCON_SEC[WhichUnpacker].REG2_Shift_amount_cntx[0] << 4)
| (ConfigState.THCON_SEC[WhichUnpacker].REG2_Shift_amount_cntx[1] << 8)
| (ConfigState.THCON_SEC[WhichUnpacker].REG2_Shift_amount_cntx[2] << 12))
# max RowStride = 65520 bytes (12-bit precision with 4-bit shift)
else:
RowStride = DatumSizeBytes * 16 # contiguous: advance by one row of 16 elements
2.6 Phase 5: Main Unpack Loop
for i in range(InputNumDatums):
# Read datum from L1
DatumBits = ReadL1(InAddr_Datums, DatumSizeBytes)
InAddr_Datums += DatumSizeBytes
# Advance row after every 16 elements
if (i + 1) % 16 == 0:
InAddr_Datums -= DatumSizeBytes * 16
InAddr_Datums += RowStride
InAddr_Datums = WrapAddr(InAddr_Datums)
# Read exponent for BFP formats
ExpBits = 0
if IsBFPFormat(InDataFormat):
if REG2_Force_shared_exp:
ExpBits = UNP[WhichUnpacker].FORCED_SHARED_EXP_shared_exp
else:
ExpBits = ReadL1Byte(floor(InAddr_Exponents))
InAddr_Exponents += 1.0 / 16.0
if InAddr_Exponents == floor(InAddr_Exponents / 16.0) * 16.0:
InAddr_Exponents = WrapAddr(InAddr_Exponents)
# Format conversion
Datum = FormatConversion(InDataFormat, OutDataFormat, DatumBits, ExpBits,
WhichUnpacker, UnpackToDst)
if AllDatumsAreZero:
Datum = 0
# Write to output register file
Bank = CurrentUnpacker.SrcBank
Row = OutAddr // 16
Col = OutAddr & 15
OutAddr += 1
if WhichUnpacker == 1:
# SrcB
while SrcB[Bank].AllowedClient != UNPACKERS: wait
Row = (Row + CurrentUnpacker.SrcRow[CurrentThread]) & 0x3f
SrcB[Bank][Row][Col] = Datum
else:
# SrcA or Dst
while SrcA[Bank].AllowedClient != UNPACKERS: wait
if not UnpackToDst:
# SrcA path: skip 4 header rows, apply ColShift
if Row < 4 or Col < ColShift: continue
Row -= 4
Col -= ColShift
if not SRCA_SET_SetOvrdWithAddr:
Row += CurrentUnpacker.SrcRow[CurrentThread]
if Transpose:
RowLowBits = Row & 0xf
RowLowBits, Col = Col, RowLowBits # swap
Row = (Row & ~0xf) | RowLowBits
SrcA[Bank][Row & 0x3f][Col] = Datum
else:
# Unpack-to-Dest path
Row -= 4
if SRCA_SET_SetOvrdWithAddr:
Row &= 0xf
else:
Row &= 0x3ff
if OutDataFormat in (FP32, TF32, INT32):
Dst32b[Row][Col] = Datum
else:
Dst16b[Row][Col] = Datum
2.7 Phase 6: Post-instruction Counter Updates
# Context counter increment (MultiContextMode + UseContextCounter)
if MultiContextMode and UseContextCounter:
IncrementedCounter = WhichContext + 1
if IncrementedCounter >= (1 << ConfigState.THCON_SEC[WhichUnpacker].Context_count):
IncrementedCounter = 0
CurrentUnpacker.ContextCounter[CurrentThread] = IncrementedCounter
# ADC Y and Z increments (from instruction encoding)
for thread in [CurrentThread, WhichADC]:
ADCs[thread].Unpacker[WhichUnpacker].Channel[0].Y += Ch0YInc
ADCs[thread].Unpacker[WhichUnpacker].Channel[0].Z += Ch0ZInc
ADCs[thread].Unpacker[WhichUnpacker].Channel[1].Y += Ch1YInc
ADCs[thread].Unpacker[WhichUnpacker].Channel[1].Z += Ch1ZInc
# Bank flip / SrcRow reset
SrcRowBase = ThreadConfig[CurrentThread].SRCA_SET_Base << 4 # (or SRCB_SET_Base for unp1)
if FlipSrc:
# Transfer current bank to MatrixUnit, flip to other bank
(SrcB if WhichUnpacker else SrcA)[CurrentUnpacker.SrcBank].AllowedClient = MATRIX_UNIT
CurrentUnpacker.SrcBank ^= 1
CurrentUnpacker.SrcRow[CurrentThread] = SrcRowBase
elif ConfigState.THCON_SEC[WhichUnpacker].Unpack_Src_Reg_Set_Upd:
# Advance SrcRow by 16 rows for next unpack
CurrentUnpacker.SrcRow[CurrentThread] += 16 + SrcRowBase
3. Format Conversion Details
3.1 Format Encoding
Data format is a 4-bit field. The canonical encoding is:
bits [1:0]: "size class" bits [3:2]: "exp class"
0b0000 = FP32 0b0100 = TF32 0b1000 = INT32 0b1100 = (unused)
0b0001 = FP16 0b0101 = BF16 0b1001 = INT16 0b1101 = INT8
0b0010 = BFP8a 0b0110 = BFP8 0b1010 = FP8 0b1110 = (unused)
0b0011 = BFP4a 0b0111 = BFP4 0b1011 = BFP2a 0b1111 = BFP2
In the ISA docs the same encoding is expressed as:
0b??11 | 0b??10 | 0b??01 | 0b??00 | |
|---|---|---|---|---|
0b00?? | BFP4a | BFP8a | FP16 | FP32 |
0b01?? | BFP4 | BFP8 | BF16 | TF32 |
0b10?? | BFP2a | FP8 | INT16 | INT32 |
0b11?? | BFP2 | INT8 | — | — |
The DataFormat enum values in software (see pack-unpack-registers.md):
Float32=0, Float16=1, Bfp8a=2, Bfp4a=3, Tf32=4, Float16_b=5, Bfp8=6, Bfp4=7Int32=8, Int16=9, Fp8_e5m2=10, Bfp2a=11, Int8=14, Bfp2=15, UInt32=24, Fp8_e4m3=26, UInt8=30
3.2 FormatConversion Pseudocode (Complete)
def FormatConversion(InDataFormat, OutDataFormat, DatumBits, ExpBits, WhichUnpacker, UnpackToDst):
"""
Returns a 19-bit value for SrcA/SrcB, or 16-bit (or 32-bit) value for Dst.
"""
if InDataFormat == FP32:
if OutDataFormat == FP32:
pass # keep DatumBits as-is (32-bit, only valid for Dst)
elif OutDataFormat == TF32:
if UnpackToDst:
return WriteDstFP32(DatumBits) # TF32 in Dst = FP32
else:
return WriteSrcTF32(DatumBits >> 13) # drop low 13 bits = 10-bit mantissa
elif OutDataFormat == BF16:
# Flush denormals to zero
if not (DatumBits & 0x7f800000):
DatumBits &= 0x80000000
DatumBits >>= 16
InDataFormat = BF16 # fall through to BF16 path
elif OutDataFormat == FP16:
DatumBits = FP32ToFP16(DatumBits)
InDataFormat = FP16
else:
raise UndefinedBehaviour
else:
# For all non-FP32 inputs, InDataFormat must equal OutDataFormat
if InDataFormat != OutDataFormat:
raise UndefinedBehaviour
# Normalize to 16-bit or 32-bit
if InDataFormat == FP8:
# E5M2: shift left 8 bits to align in FP16 position
DatumBits <<= 8
InDataFormat = FP16
elif InDataFormat == FP8_E4M3:
# E4M3 mode selected by THCON_SEC[n]_REG1_Unp_LF8_4b_exp
# Conversion is implementation-specific; treated as FP16
DatumBits = FP8E4M3ToFP16(DatumBits)
InDataFormat = FP16
elif InDataFormat == BFP8:
DatumBits = BFP8ToBF16(DatumBits, ExpBits)
InDataFormat = BF16
elif InDataFormat == BFP4:
DatumBits = BFP8ToBF16(DatumBits << 4, ExpBits)
InDataFormat = BF16
elif InDataFormat == BFP2:
DatumBits = BFP8ToBF16(DatumBits << 6, ExpBits)
InDataFormat = BF16
elif InDataFormat == BFP8a:
DatumBits = BFP8aToFP16(DatumBits, ExpBits)
InDataFormat = FP16
elif InDataFormat == BFP4a:
DatumBits = BFP8aToFP16(DatumBits << 4, ExpBits)
InDataFormat = FP16
elif InDataFormat == BFP2a:
DatumBits = BFP8aToFP16(DatumBits << 6, ExpBits)
InDataFormat = FP16
elif InDataFormat == INT8:
# INT8 sign-magnitude or UINT8 (selected by ALU_FORMAT_SPEC_REG0_SrcAUnsigned/SrcBUnsigned)
StateID = ThreadConfig[CurrentThread].CFG_STATE_ID_StateID
IsUnsigned = ConfigState.ALU_FORMAT_SPEC_REG0_SrcBUnsigned if WhichUnpacker else \
ConfigState.ALU_FORMAT_SPEC_REG0_SrcAUnsigned
Sign = 0 if IsUnsigned else (DatumBits & 0x80)
DatumBits -= Sign
if DatumBits:
DatumBits |= (16 << 10) # dummy FP16 exponent for Integer "8" overlay
DatumBits |= (Sign << 8)
InDataFormat = FP16
elif InDataFormat == TF32:
if UnpackToDst:
return WriteDstFP32(DatumBits)
else:
raise UndefinedBehaviour # TF32 as input only valid for Dst
# Final bit rearrangement to output format
if InDataFormat == INT16:
if UnpackToDst:
return DatumBits & 0xffff
else:
# Rearrange INT16 to SrcA/SrcB layout: (hi<<3) | lo
return ((DatumBits & 0xff00) << 3) | (DatumBits & 0xff)
elif InDataFormat == INT32:
if UnpackToDst:
return WriteDstFP32(DatumBits)
else:
raise UndefinedBehaviour
elif InDataFormat == FP32:
if UnpackToDst:
return WriteDstFP32(DatumBits)
else:
raise UndefinedBehaviour
elif InDataFormat == BF16:
if UnpackToDst:
return WriteDstBF16(DatumBits)
else:
return WriteSrcBF16(DatumBits)
elif InDataFormat == FP16:
if UnpackToDst:
return WriteDstFP16(DatumBits)
else:
return WriteSrcFP16(DatumBits)
3.3 Register Layout Transforms
The SrcA/SrcB register files store data in a specific bit layout different from the L1 representation. These transforms rearrange floating-point bits:
# === SrcA / SrcB (19-bit elements) ===
def WriteSrcTF32(x: int) -> int:
"""TF32: 1 sign + 8 exp + 10 mant → 19-bit Src field: Sign,Mant,Exp"""
# Input: bits [18]=Sign [17:8]=Exp [7:0]=Mant (10 bits)
Sign = x & 0x40000 # bit 18
Exp = x & 0x3fc00 # bits 17:8 (8 bits)
Man = x & 0x003ff # bits 7:0 (10 bits)
return Sign | (Man << 8) | (Exp >> 10)
# Output: [18]=Sign [17:8]=Man [7:0]=Exp
def WriteSrcBF16(x: int) -> int:
"""BF16 → TF32 Src layout (zero-extends mantissa)"""
return WriteSrcTF32(x << 3)
# Shifts 16-bit BF16 left by 3 to produce 19-bit TF32
def WriteSrcFP16(x: int) -> int:
"""FP16 → TF32 Src layout"""
# FP16: Sign[15], Exp[14:10], Man[9:0] → expand to 19-bit TF32 form
return WriteSrcTF32(((x & 0x8000) << 3) | (x & 0x7fff))
# === Dest register (16-bit or 32-bit elements) ===
def WriteDstFP16(x: int) -> int:
"""FP16 → Dst layout: Sign,Man,Exp (fields swapped)"""
Sign = x & 0x8000
Exp = x & 0x7c00
Man = x & 0x03ff
return Sign | (Man << 5) | (Exp >> 10)
def WriteDstBF16(x: int) -> int:
"""BF16 → Dst layout: Sign,Man,Exp (fields swapped)"""
Sign = x & 0x8000
Exp = x & 0x7f80
Man = x & 0x007f
return Sign | (Man << 8) | (Exp >> 7)
def WriteDstFP32(x: int) -> int:
"""FP32 → Dst layout: WriteDstBF16 applied to high 16 bits, low 16 unchanged"""
Hi = WriteDstBF16(x >> 16)
Lo = x & 0xffff
return (Hi << 16) | Lo
3.4 BFP to Floating-Point Conversion
def BFP8ToBF16(DatumBits: int, ExpBits: int) -> int:
"""BFP8 (B-exponent) → BF16"""
Sign = DatumBits >> 7 # 1 bit
Mag = (DatumBits & 0x7f) << 1 # 7-bit magnitude, shift left by 1 = 8 bits
if Mag == 0:
return 0xff80 if Sign else 0 # ±Infinity / ±0
LZ = count_leading_zeros_8bit(Mag)
Mag = (Mag << LZ) & 0xff
ExpBits -= LZ
return (Sign << 15) | (ExpBits << 7) | (Mag & 0x7e)
def BFP8aToFP16(DatumBits: int, ExpBits: int) -> int:
"""BFP8a (A-exponent, 5-bit exponent field) → FP16"""
Sign = DatumBits >> 7
Mag = (DatumBits & 0x7f) << 1
if Mag == 0:
return 0xfc00 if Sign else 0
LZ = count_leading_zeros_8bit(Mag)
Mag = (Mag << LZ) & 0xff
ExpBits -= LZ
# ExpBits must fit in 5 bits (no bits in 0xe0 range)
assert not (ExpBits & 0xe0), "ExpBits overflow"
return (Sign << 15) | (ExpBits << 10) | ((Mag & 0x7e) << 3)
# BFP4/BFP2 use the same routines with pre-shifted DatumBits:
# BFP4→BF16: BFP8ToBF16(DatumBits << 4, ExpBits)
# BFP2→BF16: BFP8ToBF16(DatumBits << 6, ExpBits)
# BFP4a→FP16: BFP8aToFP16(DatumBits << 4, ExpBits)
# BFP2a→FP16: BFP8aToFP16(DatumBits << 6, ExpBits)
3.5 Format Conversion Table (Summary)
| L1 Input | Config (In=Out unless noted) | SrcA/SrcB output | Dst output |
|---|---|---|---|
| FP32 | FP32→TF32 | TF32 (19-bit) | FP32 (32-bit) |
| FP32 | FP32→BF16 | BF16 in TF32 (19-bit) | BF16 (16-bit) |
| FP32 | FP32→FP16 | FP16 in TF32 (19-bit) | FP16 (16-bit) |
| TF32 | FP32→TF32 | TF32 (same) | FP32 (32-bit) |
| BF16 | BF16 | BF16 in TF32 | BF16 (16-bit) |
| BFP8 | BFP8 | BFP8→BF16 in TF32 | BFP8→BF16 (16-bit) |
| BFP4 | BFP4 | BFP4→BF16 in TF32 | BFP4→BF16 (16-bit) |
| BFP2 | BFP2 | BFP2→BF16 in TF32 | BFP2→BF16 (16-bit) |
| BFP8a | BFP8a | BFP8a→FP16 in TF32 | BFP8a→FP16 (16-bit) |
| BFP4a | BFP4a | BFP4a→FP16 in TF32 | BFP4a→FP16 (16-bit) |
| BFP2a | BFP2a | BFP2a→FP16 in TF32 | BFP2a→FP16 (16-bit) |
| FP16 | FP16 | FP16 in TF32 | FP16 (16-bit) |
| FP8 E5M2 | FP8 | FP8→FP16 in TF32 | FP8→FP16 (16-bit) |
| INT8 (s-mag) | BFP8 + Force_shared_exp | INT8→BF16 in TF32 | INT8→BF16 |
| INT8 | INT8 | Int8 overlay on FP16 | Int8 (16-bit) |
| UINT8 | INT8 (SrcAUnsigned=1) | UInt8 overlay | UInt8 (16-bit) |
| INT16 | INT16 | Opaque 16-bit rearranged | INT16 (16-bit) |
| INT32 | INT32 | Not possible | INT32 (32-bit) |
4. ADC Counter Mechanics
4.1 ADC State Structure
Each of 3 Tensix threads has its own ADC state. Each ADC has entries for Unpacker 0, Unpacker 1, and Packers, each with 2 channels:
struct {
struct {
struct {
uint18_t X, X_Cr; // X counter and checkpoint
uint13_t Y, Y_Cr; // Y counter and checkpoint
uint8_t Z, Z_Cr; // Z counter and checkpoint
uint8_t W, W_Cr; // W counter and checkpoint
} Channel[2];
} Unpacker[2], Packers;
} ADCs[3]; // one per thread
Checkpoint values (_Cr) are used by ADDRCRZW/ADDRCRXY for ADC reset operations.
4.2 Channel Usage
| Counter | Channel 0 | Channel 1 |
|---|---|---|
| X | L1 input position within current row | End-of-row boundary (XEnd - 1) |
| Y | L1 row position within Z-face | Output Y position (Ystride multiplier) |
| Z | L1 Z-face (face index within tile) | Output Z position (Zstride multiplier) |
| W | L1 W-face | Output W position (Wstride multiplier) |
Channel 0 drives the L1 read address: which face (Z), which row (Y), which element within the row (X).
Channel 1 drives the Src/Dst write address:
X1= end of row (= face_r_dim × face_c_dim − 1); the number of elements to writeY1, Z1, W1= output face/row position (combined with strides to form byte offset)
4.3 Instructions
SETADC — Set one counter
TT_SETADC(target_mask, // bits: PK=bit2, U1=bit1, U0=bit0
channel, // 0 or 1
xyzw, // 0=X, 1=Y, 2=Z, 3=W
new_value) // 18-bit; bits[17:16] = ThreadOverride
Sets the specified counter and its checkpoint (X_Cr, Y_Cr, etc.).
ThreadOverride (bits [17:16] of new_value):
0= CurrentThread1..3= thread 0..2
SETADCXY — Set X and Y counters together
TT_SETADCXY(target_mask,
Y1Val, X1Val, Y0Val, X0Val, // 3-bit each
bit_mask) // bits: Y1=3, X1=2, Y0=1, X0=0 — which to update
SETADCZW — Set Z and W counters together
TT_SETADCZW(target_mask,
W1Val, Z1Val, W0Val, Z0Val, // 3-bit each
bit_mask) // bits: W1=3, Z1=2, W0=1, Z0=0
SETADCXX — Set both X counters (wider range)
TT_SETADCXX(target_mask,
X1Val, // 10-bit: end-of-row for channel 1 (XEnd-1)
X0Val) // 10-bit: start-of-row for channel 0
Used to program Channel[1].X = face_r_dim * face_c_dim - 1 (the datum count boundary).
INCADCXY / INCADCZW — Increment counters
TT_INCADCXY(target_mask, Y1Inc, X1Inc, Y0Inc, X0Inc)
TT_INCADCZW(target_mask, W1Inc, Z1Inc, W0Inc, Z0Inc)
Adds the increment to the current counter value. Used during untilize to advance Y pointer row-by-row.
ADDRCRXY / ADDRCRZW — Restore checkpoint values
TT_ADDRCRZW(target_mask, W1, Z1, W0, Z0, bit_mask)
Restores selected counters from their checkpoint (_Cr) values. Used in untilize mode to reset the Z counter back to its starting face after completing each row.
4.4 How ADC Drives Tile Traversal
Standard face-by-face unpacking of a 32×32 tile (4 faces of 16×16):
Init:
SETADCZW(UNP_AB, 0,0,0,0, 0b1111) // Z0=0, W0=0, Z1=0, W1=0
SETADCXY(UNP_AB, 0,0,0,0, 0b1011) // X0=0, Y0=0, Y1=0 (X1 already set by SETADCXX)
SETADCXX(UNP_A, face_r_dim*16-1, 0) // X1 = 255 for 16-row face
Per UNPACR (Ch1ZInc=1 advances to next face in SrcA):
- UNPACR reads XDim=256 datums starting at Z0*face, writes to SrcA starting at SrcRow
- Ch0ZInc=1: Z0 increments after each UNPACR, selecting next face in L1
- Ch1ZInc=1: Z1 increments, selecting next output row group in SrcA
For tilize mode, Y0 advances via INCADCXY to move to the next L1 row, while Z0 is reset with ADDRCRZW.
5. Tilize Mode
5.1 What Tilize Does
Tilize converts row-major input data (a normal 2D array in L1) into the tile layout expected by SrcA. The input data is NOT in tile format — it is laid out as a contiguous 2D array where each row has block_c_dim elements.
tileize_mode = 1 in unpack_config_t enables tilize. The unpacker reads one row of 16 elements at a time from L1 (one 1×16 sub-row of a face), then jumps by RowStride bytes to the next row in L1.
RowStride is computed from Shift_amount_cntx[0..2] fields and equals block_c_dim * DatumSizeBytes. This is the byte distance between adjacent rows in the L1 row-major layout.
5.2 Tilize Configuration
// From _llk_unpack_tilize_init_():
config.f.tileize_mode = 1;
config.f.shift_amount = (SCALE_DATUM_SIZE(src_format, block_c_dim)) >> 4;
// shift_amount = (block_c_dim * bytes_per_datum) / 16
// RowStride = shift_amount << 4 = block_c_dim * bytes_per_datum
// Tile x_dim set to cover entire tile row (all faces in X direction):
Tile_x_dim = face_r_dim * num_faces * FACE_C_DIM;
// z_dim = 1 (the entire tile is treated as one Z-slice)
Tile_z_dim = 1;
// ADC: X end covers entire tile row
SETADCXX(UNP0, Tile_x_dim - 1, 0);
Each UNPACR call in tilize mode reads Tile_x_dim elements, skipping RowStride bytes between each 16-element sub-row. The result is that 16-element rows spaced throughout the L1 block get concatenated into a single SrcA row, effectively assembling the tile face-by-face.
5.3 TilizeA+B
The _llk_unpack_tilizeA_B_ variant unpacks SrcA one 1×16 row at a time (with UNPACR CH1_Y+=1 to advance the SrcA destination row). For each face:
- SrcB is loaded once for the entire face
- SrcA rows are loaded individually using a replay buffer
Face layout in L1 for tilizeA+B:
Face 0 top-left: base_address + tile_index * datum_size
Face 0 top-right: base_address + tile_index * datum_size + face_c_dim * datum_size
Face 1 top-left: base_address + block_c_dim * tile_height * datum_size + ...
Face 1 top-right: ...
6. Untilize Mode
Untilize reads SrcA-style tiled data from L1 and writes it row-major to SrcA (in the sense of presenting it row by row for math operations). It is the inverse of tilize.
The untilize loop reads 1×16 element rows from L1, using:
INCADCXY(UNP0, CH1_Y+=1, CH0_Z+=1)to advance both the L1 Z-pointer and SrcA Y-pointerADDRCRZW(CH0_Z)to reset Z back to its start when a new L1 face column startsWRCFGto update the L1 tile offset register (THCON_SEC0_REG7_Offset_address) for the next tile
The _llk_unpack_untilize_pass_ function iterates over FACE_HEIGHT=16 rows, with an inner loop over tiles in the row. The MOP contains:
DMANOP;
UNPACR(SrcA, CH1_Y+=1, CH0_Z+=1); // unpack 2 adjacent 1x16 rows
UNPACR(SrcA, CH1_Y+=1, CH0_Z+=1);
ADDDMAREG(TILE_OFFSET, TILE_OFFSET, TILE_SIZE); // advance to next tile
STALLWAIT(STALL_CFG, THCON);
ADDRCRZW(CH0_Z); // reset Z counter to checkpoint
7. Unpack-to-Dest Mode
7.1 When Used
Unpacker 0 can write directly to Dest instead of SrcA. This is used for:
- 32-bit data types (FP32, INT32, UInt32) — only writable via unpack-to-dest
- Tilize with FP32 input
- Reducing latency by bypassing the SrcA→math→Dest path
7.2 Control Bits
Mode is selected by:
- Non-MultiContextMode:
REG2_Unpack_If_Sel(bit 11 in ADDR32 72) - MultiContextMode:
REG2_Unpack_if_sel_cntx[WhichContext](bits in ADDR32 73)
7.3 UNPACK_TO_DEST Semaphore Protocol
Semaphore UNPACK_TO_DEST (semaphore index defined in ckernel_defs.h) synchronizes the unpack-to-dest path with the math thread:
// TRISC0 (unpack thread) before starting unpack-to-dest:
wait_for_dest_available():
t6_semaphore_wait_on_max<STALL_UNPACK>(semaphore::UNPACK_TO_DEST)
// Blocks until UNPACK_TO_DEST count is < max (Dest is not currently occupied)
// TRISC0 after unpack-to-dest tile done:
unpack_to_dest_tile_done(context_id):
t6_semaphore_post<UNPACK0>(semaphore::UNPACK_TO_DEST)
// Signals Dest has been written
// Also restores stride and context config
The math thread (TRISC1) signals when Dest processing is complete, allowing the unpack thread to write the next tile.
7.4 Dest Address Setup
// From set_dst_write_addr():
dst_byte_addr = 16 * (4 + mailbox_read(ThreadId::MathThreadId))
// MathThreadId is a value in [0, 7] selecting which 16-row block of Dest to write
TTI_SETC16(SRCA_SET_Base_ADDR32, 0x0) // disable address bit swizzle for Dest
// Program per-context dest address:
cfg[THCON_SEC0_REG5_Dest_cntx[ctx]_address] = dst_byte_addr
// Set Unpack_if_sel_cntx[ctx] = 1
cfg_reg_rmw(THCON_SEC0_REG2_Unpack_if_sel_cntxN_RMW, 1)
The Dest_cntx_address field programs the starting byte offset in Dest. This is combined with OutAddr (from ADC Channel 1) using the ADD_DEST_ADDR_CNTR enable bit.
7.5 Destination Write Order
In unpack-to-dest mode, OutAddr directly indexes Dest rows and columns:
# In the unpack loop:
Row = (OutAddr // 16 - 4) & 0x3ff # subtract 4-row header offset, mask to 10-bit
Col = OutAddr & 15
if OutDataFormat in (FP32, TF32, INT32):
Dst32b[Row][Col] = Datum # 32-bit write
else:
Dst16b[Row][Col] = Datum # 16-bit write
The -4 row offset is fixed hardware behavior. The OutAddr is initialized from CtxOutAddr (Dest_cntx address) rather than the Src row tracking used in SrcA writes.
8. UNPACR_NOP Functional Model
UNPACR_NOP is a family of side-channel instructions that operate within the unpacker pipeline, sequenced after previous UNPACR instructions. They share the same execution unit as UNPACR.
8.1 Mode Encoding
| Mode bits [4:0] | Operation |
|---|---|
0b00000 (0x0) | OverlayClear (pop stream message, using NOC_OVERLAY_MSG_CLEAR_StreamId) |
0b00001 (0x1) | ZEROSRC — zero out a SrcA/SrcB bank |
0b00010 (0x2) | Nop — occupy unpacker for one cycle |
0b00011 (0x3) | OverlayClear with explicit stream + count |
0b00100 (0x4) | SETREG — MMIO register write |
0b00111 (0x7) | SETDVALID — transfer bank to MatrixUnit |
8.2 SETDVALID (0x7)
TT_UNPACR_NOP(WhichUnpacker, 0x7)
Functionally equivalent to SETDVALID but sequenced through the unpacker pipeline:
if WhichUnpacker == 0:
SrcA[Unpackers[0].SrcBank].AllowedClient = MatrixUnit
Unpackers[0].SrcBank ^= 1
Unpackers[0].SrcRow[CurrentThread] = ThreadConfig[CurrentThread].SRCA_SET_Base << 4
else:
SrcB[Unpackers[1].SrcBank].AllowedClient = MatrixUnit
Unpackers[1].SrcBank ^= 1
Unpackers[1].SrcRow[CurrentThread] = ThreadConfig[CurrentThread].SRCB_SET_Base << 4
Does not automatically wait for AllowedClient == Unpackers. Use STALLWAIT before if needed (block B3, condition C10 or C11).
8.3 ZEROSRC (0x1)
TT_UNPACR_NOP(WhichUnpacker,
((WaitLikeUnpacr) << 4) +
((BothBanks) << 3) +
((NegativeInfSrcA)<< 2) +
0x1)
Clears SrcA or SrcB to zeros (or negative infinity for SrcA):
UnpackBank = Unpackers[WhichUnpacker].SrcBank
# Wait for bank access (either unpack bank or math bank)
if WhichUnpacker == 0:
target_bank = UnpackBank if WaitLikeUnpacr else MatrixUnit.SrcABank
while SrcA[target_bank].AllowedClient != Unpackers: wait
else:
target_bank = UnpackBank if WaitLikeUnpacr else MatrixUnit.SrcBBank
while SrcB[target_bank].AllowedClient != Unpackers: wait
# Clear
for bank in range(2):
if BothBanks or bank == UnpackBank:
ClearVal = ~0 if (WhichUnpacker == 0 and NegativeInfSrcA) else 0
for row in range(64):
for col in range(16):
(SrcA if WhichUnpacker == 0 else SrcB)[bank][row][col] = ClearVal
8.4 OverlayClear (0x0 and 0x3)
Pops a message from a NoC Overlay stream. Used in the unpack thread to acknowledge CB message consumption:
# Mode 0x0: use StreamId from ThreadConfig
StreamId = ThreadConfig[CurrentThread].NOC_OVERLAY_MSG_CLEAR_StreamId[WhichUnpacker]
NOC_STREAM_WRITE_REG(StreamId, STREAM_MSG_DATA_CLEAR_REG_INDEX, 1)
# Mode 0x3: explicit stream and count
TT_UNPACR_NOP(WhichUnpacker,
((WhichStream) << 16) + ((ClearCount) << 4) + 0x3)
# Clears 'ClearCount' messages from 'WhichStream'
8.5 SETREG (0x4)
Writes a value to an MMIO register once previous UNPACR L1 reads complete:
Addr = 0xFFB00000 + Unpackers.SetRegBase[AddrSel] + (AddrMid << 12)
if Accumulate:
AccValue = Unpackers[WhichUnpacker].SetRegAcc
if Value11 == 0:
AccValue = 0
else:
AccValue = (AccValue + Value11) & 0x1ffff
write32(Addr, AccValue)
Unpackers[WhichUnpacker].SetRegAcc = AccValue
else:
write32(Addr, Value11)
Used for stream consumer count updates and similar MMIO side effects synchronized with unpack.
9. Config Context Switching
9.1 Why Double-Buffered Config?
The Tensix unpacker uses a double-buffered configuration to allow TRISC0 to set up the next tile’s config while the previous tile is still being unpacked. This is necessary because UNPACR reads tile config registers (base address, format, XDim, etc.) when it executes, but TRISC0 must write those registers before issuing the UNPACR.
The hardware supports 8 contexts (for Unpacker 0; only 2 for Unpacker 1). In practice, software uses 2 contexts (a ping-pong pair).
9.2 Config Ping-Pong Protocol
State tracked in software: unp_cfg_context (global in ckernel_globals.h, 0 or 1).
Context 0:
THCON_SEC0_REG3_Base_address = tile_A_address (context 0)
THCON_SEC0_REG7_Offset_address = tile_A_offset (context 0)
Context 1:
THCON_SEC0_REG3_Base_cntx1_address = tile_B_address (context 1)
THCON_SEC0_REG7_Offset_cntx1_address = tile_B_offset (context 1)
The THCON_SEC0_REG5_Dest_cntx[N]_address and THCON_SEC0_REG5_Tile_x_dim_cntx[N] registers hold per-context values for Unpacker 0. Unpacker 1 only uses contexts 0 and 1 for its base addresses.
9.3 CfgContextOffset (ADDR32 41)
UNPACK_MISC_CFG at ADDR32 41 selects which config context the unpacker accesses:
| Bits | Field | Description |
|---|---|---|
| [3:0] | CfgContextOffset_0 | Context offset for Unpacker 0 |
| [4] | CfgContextCntReset_0 | Reset context counter (Unpacker 0) |
| [5] | CfgContextCntInc_0 | Increment context counter each UNPACR (Unpacker 0) |
| [11:8] | CfgContextOffset_1 | Context offset for Unpacker 1 |
| [12] | CfgContextCntReset_1 | Reset context counter (Unpacker 1) |
| [13] | CfgContextCntInc_1 | Increment context counter (Unpacker 1) |
These are written with SETC16 at ADDR32 41:
// Context 0 active (both unpackers):
TTI_SETC16(UNPACK_MISC_CFG_CfgContextOffset_0_ADDR32, 0x0000)
// Encoding: [7:0]=offset0=0, [15:8]=offset1=0
// Context 1 active (both unpackers):
TTI_SETC16(UNPACK_MISC_CFG_CfgContextOffset_0_ADDR32, 0x0101)
// Encoding: [7:0]=offset0=1, [15:8]=offset1=1
// During unpacker_iteration_cleanup (ping-pong):
// context=1: TTI_SETC16(addr, 0x0104) ← offset0=4, offset1=1
// context=0: TTI_SETC16(addr, 0x0000) ← offset0=0, offset1=0
// At wrapup (reset both to non-overlapping):
TTI_SETC16(addr, 0x1010) // unusual cleanup state
9.4 Switch Sequence
From switch_config_context():
void switch_config_context(uint32_t &unp_cfg_context) {
unp_cfg_context = 1 - unp_cfg_context;
if (unp_cfg_context == 0) {
TTI_SETC16(UNPACK_MISC_CFG_CfgContextOffset_0_ADDR32, 0x0000);
} else {
TTI_SETC16(UNPACK_MISC_CFG_CfgContextOffset_0_ADDR32, 0x0101);
}
}
9.5 Full Double-Buffer Sequence (Per Tile)
TRISC0 (unpack loop per tile):
1. wait_for_next_context(2)
→ spins until semaphore::UNPACK_SYNC < 2
(ensures at most 2 contexts are "in flight")
2. Write tile address to cfg:
if unp_cfg_context == 0:
cfg[THCON_SEC0_REG3_Base_address] = L1_addr_A
cfg[THCON_SEC1_REG3_Base_address] = L1_addr_B
else:
cfg[THCON_SEC0_REG3_Base_cntx1_address] = L1_addr_A
cfg[THCON_SEC1_REG3_Base_cntx1_address] = L1_addr_B
3. semaphore_post(UNPACK_SYNC) ← "I have a context ready"
4. TTI_STALLWAIT(STALL_UNPACK, TRISC_CFG) ← wait for CFG writes to propagate
5. Execute MOP (UNPACR instructions)
6. t6_semaphore_get(UNPACK_SYNC) ← "context is consumed"
7. switch_config_context(unp_cfg_context)
The T6 semaphore_get happens within the MOP/UNPACR instruction itself (the FlipSrc flag or UNPACR_NOP_SETDVALID transfers ownership and implicitly synchronizes). The TRISC0 t6_semaphore_get in step 6 acknowledges context release from the coprocessor side.
10. UNPACR Instruction Context Counter Mode
When UseContextCounter = true and MultiContextMode = true:
# Before UNPACR executes:
WhichContext = CurrentUnpacker.ContextCounter[CurrentThread]
WhichContext += ThreadConfig[CurrentThread].UNPACK_MISC_CFG_CfgContextOffset[WhichUnpacker]
# After UNPACR executes:
IncrementedCounter = WhichContext + 1
if IncrementedCounter >= (1 << Context_count):
IncrementedCounter = 0
CurrentUnpacker.ContextCounter[CurrentThread] = IncrementedCounter
The context counter automatically cycles through 0..(2^Context_count - 1), where Context_count is a 2-bit field (0=max 1 context, 1=2, 2=4, 3=8 contexts).
The UNPACR (Increment context counter) instruction variant just increments the counter without performing any unpack:
TT_OP_UNPACR(WhichUnpacker, 0, true, 0, 0, false, false, false, false, false, false, false)
11. Real Instruction Sequences (Annotated)
11.1 Initialization Sequence (from add1/matmul TRISC0)
This sequence runs once at startup to initialize the unpack configuration. Observed in both add1_trisc0.S and matmul_trisc0.S:
; === ADC reset ===
5f48: ttsetadcxy 3,0,0,0,0,11 ; SETADCXY(UNP_AB, Y1=0,X1=0,Y0=0,X0=0, mask=0b1011)
; Resets X0, Y0, Y1 for both unpackers (bit 0=X0, 1=Y0, 3=Y1)
5f4c: ttsetadczw 3,0,0,0,0,15 ; SETADCZW(UNP_AB, W1=0,Z1=0,W0=0,Z0=0, mask=0b1111)
; Resets all Z/W counters
; === Config register writes via instrn_buffer ===
; (These are stores to 0xFFE40000 = __instrn_buffer, raw 32-bit instruction words)
5f70: ttatgetm 0 ; Acquire mutex 0 (REG_RMW mutex)
; Store ALU format/config words via buffer:
5f7c: sw a4, 0(a0) → 0xb3ff0... ; WRCFG: ALU_FORMAT_SPEC_REG (ADDR32=0)
5f84: sw a4, 0(a0) → 0xb47f0... ; WRCFG: ADDR32=1 (ALU_FORMAT_SPEC_REG + rounding)
5f8c: sw a4, 0(a0) → 0xb3070001 ; WRCFG+1b: Disable zero compress flags
5f98: sw a4, 0(a0) → 0xb4800001 ; WRCFG: another format spec
5fa4: sw a4, 0(a0) → 0xb5010001 ; WRCFG: THCON_SEC0/1 config
5fb0: sw a4, 0(a0) → 0xb6600001 ; WRCFG: more config
5fb8: sw a4, 0(a0) → 0xb3010002 ; WRCFG: out_data_format + throttle
5fc8: ttatrelm 0 ; Release mutex 0
; === Address stride config ===
; (Direct memory writes to instrn_buffer region at various offsets)
; UNP0_ADDR_CTRL_ZW_REG_1_Zstride (ADDR32=57): z-stride for output channel
; UNP1_ADDR_CTRL_ZW_REG_1_Zstride (ADDR32=59): z-stride for SrcB output channel
; === Tile descriptor writes ===
; (Writes to THCON_SEC0_REG0_TileDescriptor at ADDR32=64)
; === Per-context dim config ===
; THCON_SEC0_REG5_Tile_x_dim_cntx0 (ADDR32=86)
; THCON_SEC0_REG5_Dest_cntx0_address (ADDR32=84)
; === ADC x_end ===
60c0: ttsetadcxx 1,255,0 ; SETADCXX(UNP_A, X1=255, X0=0)
; X1=255 = face_r_dim*face_c_dim-1 = 16*16-1
; === Context reset ===
6074: ttsetc16 5,4 ; SETC16 at ADDR32=5 (SRCA_SET_Base), value=4
; Sets SrcA base row = 4 (skip 4-row header)
6088: ttsetc16 41,0 ; SETC16 at ADDR32=41 (UNPACK_MISC_CFG), value=0
; Reset config context to 0
11.2 Per-Tile Unpack Loop (add1)
The add1 kernel unpacks one tile of SrcA and one tile of SrcB per iteration:
; === Wait for idle (previous contexts consumed) ===
; (Busy-wait loop checking semaphore::UNPACK_SYNC)
61c4: ttsetadczw 3,0,0,0,0,15 ; Reset Z/W counters for both unpackers
; === Write tile addresses to config ===
; (RISC-V stores to THCON_SEC0_REG3_Base_address and THCON_SEC1_REG3_Base_address)
61e0: sw zero, 52(a3) ; Clear instrn_buffer+0x34 (busy flag / semaphore reg)
; === Issue UNPACR + MOP ===
61e4: ttstallwait 8,1024 ; STALLWAIT(STALL_UNPACK, TRISC_CFG)
; Block unpacker until TRISC CFG writes complete
; (condition 1024 = TRISC_CFG, block 8 = STALL_UNPACK)
61e8: ttmop 1,0,0 ; Execute MOP program
; The MOP contains UNPACR instructions for SrcA+SrcB
61ec: ttsemget 32 ; t6_semaphore_get(semaphore::UNPACK_SYNC)
; Context released (coprocessor acknowledges)
61f4: ttsetc16 41,257 ; SETC16 UNPACK_MISC_CFG=0x0101
; Switch to context 1 (CfgContextOffset_0=1, _1=1)
11.3 Matmul Unpack SrcB (with per-tile address accumulation)
The matmul kernel increments SrcB tile addresses using SETREG-style RDCFG/ADDDMAREG/WRCFG:
; Replay buffer programs SrcB address update per face:
611c: ttreplay 0,12,0,1 ; Execute replay buffer len=12 from position 0
; Replay buffer contents:
6120: ttunpacr 1,0,0,0,0,1,1,0,0,0,0,0,1 ; UNPACR SrcB: Ch0ZInc=0, FlipSrc=1
; (WhichUnpacker=1, Ch1YInc=0, Ch1ZInc=0,
; Ch0YInc=0, Ch0ZInc=0, no FlushCache,
; ContextNum=0, ContextADC=0,
; MultiContextMode=1, FlipSrc=1,
; no extra flags)
6124: ttrdcfg 12,124 ; RDCFG r12, ADDR32=124 (THCON_SEC1_REG3_Base_address)
; Read SrcB base address into register r12
6128: ttadddmareg 0,12,12,18 ; ADDDMAREG r12 = r12 + r18
; r18 = tile_size_B (preloaded)
; Advance base address by one tile
612c: ttstallwait 128,1 ; STALLWAIT(STALL_CFG, UNPACK1)
; Wait for unpacker 1 to be idle before writing config
6130: ttwrcfg 12,0,124 ; WRCFG ADDR32=124, r12
; Write updated SrcB address back
6134: ttnop
6138: ttunpacr 1,... ; Next SrcB face
6140+: ; Repeat for second context (cntx1_address ADDR32=125)
11.4 Matmul Unpack SrcA (single face per UNPACR)
; Inside tile loop:
62ac: ttstallwait 8,1024 ; Wait for TRISC_CFG writes to propagate
62b0: ttunpacr 0,0,0,0,0,1,1,0,0,0,0,0,1
; UNPACR(SrcA, Ch1ZInc=0,Ch1YInc=0,Ch0ZInc=0,Ch0YInc=0,
; no_flush_cache, ctx_num=0, ctx_adc=0,
; MultiContextMode=1, FlipSrc=1, ...)
; Unpacks one face from L1 to SrcA, then flips bank to MatrixUnit
62c0: ttsemget 32 ; Release context semaphore
62d0: ttsetc16 41,0 ; Switch config context back to 0
11.5 Config Context Switch Pattern
The observed binary pattern in both kernels:
; Context 0→1:
ttsetc16 41, 257 ; 257 = 0x101 → UNPACK_MISC_CFG offset0=1, offset1=1
; Context 1→0:
ttsetc16 41, 0 ; 0 = 0x000 → UNPACK_MISC_CFG offset0=0, offset1=0
12. Upsampling Mode
Controlled by upsample_rate (2-bit) and upsample_and_interleave (1-bit) in unpack_config_t:
upsample_rate | upsample_and_interleave | Effect |
|---|---|---|
| 0 | Any | No upsampling |
| 1 | false | Insert 1 zero after every datum |
| 2 | false | Insert 2 zeros after every datum |
| 3 | false | Insert 4 zeros after every datum |
| 1 | true | Skip 1 output position after every datum |
| 2 | true | Skip 2 output positions after every datum |
| 3 | true | Skip 4 output positions after every datum |
UpsampleZeroes = (1 << upsample_rate) - 1 # 0, 1, 2, or 4 zeros to insert
for j in range(UpsampleZeroes + 1):
datum_to_write = Datum if j == 0 else 0
if upsample_and_interleave and j != 0:
OutAddr += 1 # skip position (don't write)
continue
write_to_output(OutAddr, datum_to_write)
OutAddr += 1
13. Performance Characteristics
L1 Bandwidth
Each unpacker has three speed tiers:
- x1: Up to 16 bytes/cycle
- x2: Up to 32 bytes/cycle
- x4: Up to 64 bytes/cycle
Configured by Throttle_mode field (0=x1, 1=x2, 2=x4). Default in LLK code: throttle_mode = 2 (x4).
When both unpackers are active simultaneously, they share L1 bandwidth per the interference table in the ISA docs (see UNPACR_Regular.md).
Forced Speed Constraints
Certain modes force lower bandwidth:
DiscontiguousInputRows(tilize): always x4!IsUncompressed(compressed data): always x1UpsampleZeroes == 3(4 zeros per datum): always x1BFP2 / BFP2a: always x1UpsampleZeroes == 1(2 zeros): x1 or x2
Initial Latency
Every UNPACR instruction incurs at least 2 cycles of address computation before L1 reads begin. During these cycles, no other thread can issue UNPACR (shared frontend resource). Compressed data incurs additional cycles.
14. Emulator Implementation Notes
14.1 State to Track
class UnpackerState:
SrcBank: int # 0 or 1, current write bank
SrcRow: list[int] # [thread0_row, thread1_row, thread2_row]
ContextCounter: list[int] # [thread0_ctx, thread1_ctx, thread2_ctx]
SetRegAcc: int # Accumulated SETREG value
class ADCChannel:
X: int; X_Cr: int
Y: int; Y_Cr: int
Z: int; Z_Cr: int
W: int; W_Cr: int
class ADCEntry:
Channel: list[ADCChannel] # [channel0, channel1]
class ADCState:
Unpacker: list[ADCEntry] # [unpacker0, unpacker1]
Packers: ADCEntry
ADCs: list[ADCState] # [thread0, thread1, thread2]
Unpackers: list[UnpackerState] # [unpacker0, unpacker1]
14.2 Key Emulation Points
-
L1 Circular Buffer: Implement
WrapAddr(addr)usinglimit_addrandfifo_sizefrom config. The wrapping is used for both data and exponent pointers. -
19-bit SrcA/SrcB storage: Values are stored in the rearranged bit layout (Sign,Mantissa,Exponent rather than Sign,Exponent,Mantissa). Always apply
WriteSrcTF32()/WriteSrcBF16()/WriteSrcFP16()before storing. -
Row offset in SrcA: The SrcA register file conceptually starts at row 0, but the unpacker computes
Row = (OutAddr/16 - 4) + SrcRow[thread]. The-4is a hardware-fixed offset. Rows 0-3 of the output address space are skipped/header. -
SrcRow tracking:
SrcRowadvances by 16 after each UNPACR (whenUnpack_Src_Reg_Set_Upd = 1) and resets toSRCA_SET_Base << 4on bank flip. In the typical 4-face tile unpack (4 UNPACR calls beforeFlipSrc),SrcRowadvances 0, 16, 32, 48, then resets on flip. -
Bank ownership: Before any write, check
SrcA[bank].AllowedClient == UNPACKERS. The bank starts owned by Unpackers, is transferred to MatrixUnit onFlipSrcorUNPACR_NOP_SETDVALID, and returns to Unpackers after math processes it (viaCLEARDVALIDor equivalent). -
ColShift: In non-tilize, non-SrcB mode,
ColShift = Shift_amount_cntx[WhichContext & 3]. Skip elements whereCol < ColShift. Used for partial-row unpacking. -
Exponent pointer alignment: The exponent pointer uses fractional arithmetic (advances by 1/16 per datum). In practice, use integer counters: maintain an exponent index
exp_idxthat increments by 1 every 16 datums, starting atFirstDatum / 16. -
Compressed data: The RSI (row start index) array is a sequence of
uint16_tvalues in L1. RSI[i] gives the byte offset (from the start of the datum stream) of compressed row i. The decompressor uses RLE nibbles interleaved 32-per-block. -
FP8 E4M3 mode: Enabled by
THCON_SEC[n]_REG1_Unp_LF8_4b_exp. When set, FP8 input is interpreted as E4M3 instead of E5M2. The conversion to FP16 differs significantly. -
Context counter wrap:
Context_countis a 2-bit field; the counter wraps at2^Context_count(i.e., 1, 2, 4, or 8).
14.3 Simplified UNPACR Dispatch Logic
def emulate_UNPACR(WhichUnpacker, Ch0YInc, Ch0ZInc, Ch1YInc, Ch1ZInc,
ContextNumber, ContextADC, MultiContextMode, FlipSrc,
AllDatumsAreZero, UseContextCounter, RowSearch):
"""Core UNPACR emulation."""
cfg = get_config_for_current_thread()
# 1. Context selection
ctx = select_context(WhichUnpacker, MultiContextMode, UseContextCounter,
ContextNumber, ContextADC)
# 2. Read tile descriptor and config
td = get_tile_descriptor(WhichUnpacker, ctx)
fmt = get_data_format(WhichUnpacker, ctx, td)
# 3. Compute addresses
in_addr = compute_input_address(WhichUnpacker, ctx, td, fmt)
out_addr = compute_output_address(WhichUnpacker, ctx, fmt)
# 4. Main loop
for datum in read_datums(in_addr, td, fmt):
out_val = format_convert(datum.bits, datum.exp, fmt, WhichUnpacker)
if AllDatumsAreZero:
out_val = 0
write_to_register(WhichUnpacker, out_addr, out_val, ctx)
out_addr += 1
# 5. Post-update
update_context_counter(WhichUnpacker, ctx, MultiContextMode, UseContextCounter)
update_ADC_increments(WhichUnpacker, Ch0YInc, Ch0ZInc, Ch1YInc, Ch1ZInc, ContextADC)
if FlipSrc:
flip_bank(WhichUnpacker)
15. Source References
| File | Contents |
|---|---|
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_Regular.md | Complete UNPACR functional model and encoding |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/README.md | Unpacker overview, decompression, upsampling |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/FormatConversion.md | Format conversion table and configuration |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/FloatBitPatterns.md | BFP/FP16/BF16/TF32/FP32 bit pattern semantics |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/ADCs.md | ADC counter structure and usage |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADC.md | SETADC functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADCXY.md | SETADCXY functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADCZW.md | SETADCZW functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SETADCXX.md | SETADCXX functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/INCADCXY.md | INCADCXY functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/INCADCZW.md | INCADCZW functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_NOP_SETDVALID.md | SETDVALID NOP functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_NOP_ZEROSRC.md | ZEROSRC NOP functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_NOP_SETREG.md | SETREG NOP functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_NOP_OverlayClear.md | OverlayClear NOP functional model |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_IncrementContextCounter.md | Context counter increment instruction |
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_FlushCache.md | Flush decompression cache instruction |
tt-llk/tt_llk_blackhole/common/inc/cunpack_common.h | unpack_tile_descriptor_t, unpack_config_t, core unpack functions |
tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_AB.h | Dual-operand unpack: MOP config, init, execute |
tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_A.h | Single-operand unpack: MOP config, init, execute |
tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_tilize.h | Tilize and TilizeA+B implementations |
tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_untilize.h | Untilize implementation |
tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_common.h | _llk_unpack_hw_configure_, address validation |
tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h | All ADDR32 register positions and bit masks |
blackhole-py/disasms/add1/add1_trisc0.S | Eltwise add kernel unpack disassembly |
blackhole-py/disasms/matmul_peak/matmul_trisc0.S | Matmul kernel unpack disassembly |