How data moves from the input circular buffer in L1 through the unpackers into SrcA/SrcB/Dst, gets computed on by the FPU/SFPU, and is written back to L1 by the four packers. With every config register, every relevant TT* instruction, and the LLK programming model.
Note. Tenstorrent's official ISA docs split documentation across
tt-isa-documentation/WormholeB0/(which has the detailed Unpackers/Packers sub-trees) andtt-isa-documentation/BlackholeA0/(which has BH-specific overrides likeDst.md,BackendConfiguration.md,ConfigurationUnit.md). On Blackhole the functional model is largely the same as Wormhole B0 — same UNPACR / PACR pipeline, with a few BH-specific additions (Auto-TTSync, extra context counts, an 8-bit FP8 e4m3 helper, 1.5 MiB L1 instead of 1.46 MiB, the Mover, new SFPU ops). Below I cite which docs the details come from, and how the LLK (tt-llk/tt_llk_blackhole/) actually programs the registers.
A Tensix tile has:
Figure 1. CB → SrcA/SrcB/Dst → math → Dst → CB. Dashed arrow shows Unpacker 0's optional direct-to-Dst bypass; bidirectional arrow shows the SFPU operating in-place on Dst.Three Tensix-coprocessor threads (T0/T1/T2) issue Tensix instructions independently and the backend execution units (Unpackers / Matrix Unit / Vector Unit / Packers / ThCon / Sync / Mover / Configuration / Misc) are shared. Unpackers usually take instructions from T0, the FPU/SFPU from T1, the packers from T2 — but any thread can issue any instruction. Synchronization is via STALLWAIT, semaphores, SETDVALID, and the SrcA/SrcB "AllowedClient" mechanism described below.
64 rows × 16 columns. Each datum is up to 19 bits wide (1 sign + 8 exponent + 10 mantissa, i.e. TF32). Software-visible types in SrcA/SrcB are: TF32, BF16, FP16, Integer "16" (opaque), Integer "8" (1S + 10M magnitude). One bank holds enough for a 16×16 face's worth of data with a few extras; unpackers and the FPU swap banks via the "AllowedClient" hand-off.1024 rows × 16 cols × 16 bits (Dst16b view) or 512 rows × 16 cols × 32 bits (Dst32b view) — same underlying storage. Dst can hold FP32, BF16, FP16, Integer "32", Integer "16", Integer "8". Dst's row index goes through the Adj16/Adj32 swizzle (DEST_ACCESS_CFG_remap_addrs, DEST_ACCESS_CFG_swizzle_32b). Each row has a 1-bit DstRowValid flag used by the hardware to track when a write has retired vs. is still in flight (4-cycle scoreboard).32×32 datums, divided into 4 faces (F0..F3) of 16×16 each, each face stored row-major. Multiple tiles can be held in Dst (configured via DEST_TARGET_REG_CFG_*_Offset).Both unpackers and packers compute addresses by stepping through "ADC" channels rather than by getting an explicit address each instruction. Each thread has one ADC per unpacker (and a shared one for the packers). Each ADC has two channels (Channel[0] = input/source side, Channel[1] = output/dest side), each with X, Y, Z, W counters + Y_Cr, Z_Cr "current row" shadow counters.
ADC counters are advanced by:
SETADC / SETADCXY / SETADCZW / SETADCXX — absolute setINCADCXY / INCADCZW — incrementADDRCRXY / ADDRCRZW — copy Cr → currentREG2FLOP — load from L1AddrMod field of UNPACR/PACR (auto-increment using ADDR_MOD_AB_SEC[i] / ADDR_MOD_PACK_SEC[i] "modifier" config slots, indexed 0..3, two banks for unpack)This is how a single UNPACR for a face advances to the next face automatically.
There are two unpackers. Unpacker 0 writes to SrcA or directly to Dst. Unpacker 1 writes to SrcB. They share an L1 read port but can run concurrently with throttle-mode arbitration (see §2.5).
Source: tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h, opcode 0x42. The full BH encoding is:
TT_OP_UNPACR(
Unpack_block_selection, // 1 bit : 0 = Unpacker 0, 1 = Unpacker 1
AddrMode, // 8 bits : packed Ch0YInc/Ch0ZInc/Ch1YInc/Ch1ZInc (2 bits each)
CfgContextCntInc, // 2 bits : used to bump context-counter after instr
CfgContextId, // 3 bits : explicit context (used when AutoIncContextID=0)
AddrCntContextId, // 2 bits : which ADC bank to use (0..2 = T0..T2)
OvrdThreadId, // 1 bit : was MultiContextMode in WH docs (see below)
SetDatValid, // 1 bit : hand SrcA/B bank to Matrix Unit after unpack
srcb_bcast, // 1 bit : RAREFYB / broadcast B
ZeroWrite2, // 1 bit : AllDatumsAreZero — write zeros instead of fetching
AutoIncContextID, // 1 bit : use context counter instead of CfgContextId
RowSearch, // 1 bit : use BlobsYStart row-search mode
SearchCacheFlush, // 1 bit : flush the RSI cache
Last) // 1 bit : end-of-context marker
The full functional model is the one in WormholeB0/.../UNPACR_Regular.md (621 lines). It is exactly the same on Blackhole; the field layout above is the BH-specific encoding (the AddrCntContextId / OvrdThreadId split is BH's renaming of WH's ContextADC/MultiContextMode).
The high-level steps the hardware performs:
1. Pick state-bank from ThreadConfig[CurrentThread].CFG_STATE_ID_StateID (0 or 1)
2. Resolve WhichContext (0..7) from either CfgContextId or the per-unpacker
ContextCounter, then add UNPACK_MISC_CFG_CfgContextOffset[WhichUnpacker]
3. Read the tile descriptor (THCON_SEC[u].REG0_TileDescriptor) → InDataFormat,
XDim, YDim, ZDim, WDim, IsUncompressed, BlobsPerXYPlane, DigestSize, ...
4. Compute starting L1 address from REG3_Base_address (+ REG3_Base_cntx in MC mode)
and REG7_Offset_address, skip 16 bytes of tile header times (1+DigestSize).
If compressed: also read row-start indices from a sub-region.
If BFP format: also point at the per-16-datum exponent sub-region.
5. Compute number of input datums = (ADC.Channel[1].X+1) - ADC.Channel[0].X
6. Compute starting OUTPUT address inside SrcA/SrcB/Dst from:
UNP[u].ADDR_BASE_REG_1_Base + ADC.Ch1.Y*Ystride + Z*Zstride + W*Wstride
(in datums, not bytes; >>1 for 16-bit, >>2 for 32-bit output formats)
7. For each input datum:
a. Read DatumSizeBytes from L1
b. If BFP: read shared 8-bit exponent (1 byte per 16 datums)
c. If compressed: read the 4-bit RLE delta and insert that many zeroes
d. FormatConversion(InDataFormat, OutDataFormat, datum, exp, unpacker, toDst)
e. Optional upsampling (insert 0/1/2/4 zero datums after each datum)
f. Write to SrcA[bank][row][col], SrcB[bank][row][col], or Dst32b/Dst16b
(wait if AllowedClient ≠ Unpackers — i.e. FPU hasn't released the bank)
8. After the loop: bump ADC channels by AddrMod, optionally flip the SrcA/SrcB
bank (FlipSrc) and hand the now-full bank to the Matrix Unit, bump
ContextCounter if AutoIncContextID.
The address-wrap behaviour is the circular FIFO in L1: addresses beyond Unpack_limit_address wrap by subtracting Unpack_fifo_size. This is what lets the CB abstraction work — the CB is just a configured circular range in L1.
0x41)UNPACR_NOP (TT_OP_UNPACR_NOP) is a multi-mode "stall/control" instruction that runs in-order with the unpacker queue, used so the no-op effect happens at exactly the right point in the unpacker pipeline:
| Mode | Effect | Doc |
|---|---|---|
ZEROSRC mode | Set SrcA or SrcB bank to all zeroes, sequenced with UNPACR | UNPACR_NOP_ZEROSRC.md |
SETDVALID mode (Set_Dvalid=1) | Mark Src bank as valid → hand to Matrix Unit (AllowedClient = MatrixUnit) | UNPACR_NOP_SETDVALID.md |
SETREG mode | MMIO register write, sequenced with UNPACR | UNPACR_NOP_SETREG.md |
Msg_Clr_Cnt mode | Stream/Overlay message clear | UNPACR_NOP_OverlayClear.md |
| Nop mode | Occupy unpacker for one cycle (used as a barrier/replay padding) | UNPACR_NOP_Nop.md |
Unpack_Pop mode | Pop unpacker FIFO | (covered in UNPACR_NOP.md) |
UNPACR itself has 3 special sub-modes when Unpack_block_selection / SetDvalid etc. are programmed unusually:
UNPACR_IncrementContextCounter.md) — bumps the per-unpacker ContextCounter[CurrentThread] without doing any L1 access.UNPACR_FlushCache.md) — invalidates the small RSI cache used when decompressing.UNPACR_Regular.md) — the normal one described above.These live in the global Config[2][...] array (TENSIX_CFG_BASE), with the symbols in tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h. There are ~370 packer/unpacker fields total; the ones that matter for unpack are:
WRCFG ... wr128b=1 as a 128-bit blob).This is the single most important block; unpack_tile_descriptor_t in cunpack_common.h shows the BH bitfield layout:
| Field | Bits | Meaning |
|---|---|---|
in_data_format | [3:0] | Source format in L1: encoding table below. |
uncompressed | [4] | 1 = no RLE; 0 = decompress on the fly (must also configure Disable_zero_compress_cntx). |
blobs_per_xy_plane | [11:8] | Sparsity ("blob") structure — used with RowSearch + BlobsYStart. |
x_dim | [31:16] | X (column) dim of the source tile in L1. |
y_dim | [47:32] | Y (row) dim. |
z_dim | [63:48] | Z stride count (0 means 1). |
w_dim | [79:64] | W stride count (0 means 1). |
blobs_y_start | [127:80] | Row-start table for sparse blob mode. |
digest_type/digest_size | [...:120] | Tile-header skip = (1+digest_size)*16 bytes. |
Data-format 4-bit encoding (same field everywhere — TileDescriptor, REG2, REG7 contexted, ALU_FORMAT_SPEC, packer REG1):
0b??11 | 0b??10 | 0b??01 | 0b??00 | |
|---|---|---|---|---|
0b00?? | BFP4a | BFP8a | FP16 | FP32 |
0b01?? | BFP4 | BFP8 | BF16 | TF32 |
0b10?? | BFP2a | FP8 (e5m2) | INT16 | INT32 |
0b11?? | BFP2 | INT8 | — | — |
Plus a 5th bit "Unp_LF8_4b_exp" turns FP8 e5m2 into FP8 e4m3 — set when DataFormat::Fp8_e4m3; see THCON_SEC0_REG1_Unp_LF8_4b_exp_RMW programmed in configure_unpack_AB().
unpack_config_t is REG1+REG2 in 128-bit form):| Field (BH name in cfg_defines.h) | What it does |
|---|---|
THCON_SEC[u]_REG2_Out_data_format | OutDataFormat (4 bits) — what hardware presents to SrcA/SrcB/Dst (see §2.6 conversions). |
THCON_SEC[u]_REG2_Throttle_mode | x1/x2/x4 L1 read speed (0/1/2 — see §2.5). |
THCON_SEC[u]_REG2_Context_count | log₂ of how many UnpackContexts to round-robin (0..3 → 1/2/4/8 contexts). |
THCON_SEC[u]_REG2_Haloize_mode | Enables X/Y transpose on the way into SrcA (Unpacker 0 only). |
THCON_SEC[u]_REG2_Tileize_mode | "Discontiguous input rows" mode — rows are NOT contiguous in L1, hop by Shift_amount_cntx. Used by llk_unpack_tilize.h. |
THCON_SEC[u]_REG2_Unpack_Src_Reg_Set_Upd | If 1, after each UNPACR advance SrcRow by 16 + SrcA/B_SET_Base*16. Used for ping-pong'ing within a bank. |
THCON_SEC[u]_REG2_Unpack_If_Sel | Unpacker 0 only: 0 = write to SrcA, 1 = write to Dst. (Per-context version: REG2_Unpack_if_sel_cntx[].) |
THCON_SEC[u]_REG2_Upsample_rate | 0/1/2/3 → 0/1/2/4 inserted zeros per datum. |
THCON_SEC[u]_REG2_Upsample_and_interleave | If 1, skip output positions instead of inserting zeros. |
THCON_SEC[u]_REG2_Ovrd_data_format | If 1 (in MultiContext mode), use the per-context REG7 format instead of the descriptor. |
THCON_SEC[u]_REG2_Force_shared_exp | For BFP formats: ignore the L1 exponent stream, use UNP[u].FORCED_SHARED_EXP_shared_exp for all datums. Used when you want to upcast INT8 via the BFP path. |
THCON_SEC[u]_REG2_Shift_amount_cntx[0..3] | Tileize row stride (4 bits per ctx, concatenated to make a 12-bit byte stride) OR ColShift for SrcA writes. |
THCON_SEC[u]_REG2_Unpack_limit_address | Upper L1 bound (in 16-byte units) for FIFO wrap. |
THCON_SEC[u]_REG2_Unpack_fifo_size | FIFO size (in 16-byte units) — what gets subtracted on wrap. |
THCON_SEC[u]_REG2_Disable_zero_compress_cntx[0..7] | Per-context: 1 = treat that context's data as uncompressed (overrides the TileDescriptor when in MultiContextMode). |
THCON_SEC[u]_REG3_Base_address (32b) | Base L1 address of the input buffer, in 16-byte units. |
THCON_SEC[u]_REG3_Base_cntx[1..7].address | Per-context base addresses (context 0 uses REG3_Base_address). |
THCON_SEC[u]_REG5_Tile_x_dim_cntx[0..3] | Unpacker 0 multi-context X override (16 bits per ctx, packed two-per-32). |
THCON_SEC[u]_REG5_Dest_cntx[0..3].address | Unpacker 0 per-context Dst starting row. |
THCON_SEC[u]_REG7_Offset_address | 16-bit offset added to Base_address each instruction (set by unpacker_wrapup() to zero on flush). |
THCON_SEC[u]_REG7_Offset_cntx[1..3].address | Per-context offsets. |
THCON_SEC[u]_REG7_Unpack_data_format_cntx[0..7] | Per-context InDataFormat override. |
THCON_SEC[u]_REG7_Unpack_out_data_format_cntx[0..7] | Per-context OutDataFormat override. |
THCON_SEC[u]_REG1_Unp_LF8_4b_exp | 1 = treat 8-bit float as e4m3 instead of e5m2. |
UNP[u].ADDR_BASE_REG_1_Base | Base "datum index" within SrcA/SrcB/Dst — combined with ADC Channel[1] Y/Z/W and the strides. |
UNP[u].ADDR_CTRL_XY_REG_1_Ystride | Y stride in datums. |
UNP[u].ADDR_CTRL_ZW_REG_1_Zstride, Wstride | Z, W strides. |
UNP[u].FORCED_SHARED_EXP_shared_exp | The fixed exponent used when Force_shared_exp = 1. |
UNP[u].ADD_DEST_ADDR_CNTR_add_dest_addr_cntr | If 1, add per-context Dst row to base; if 0, override. |
UNP0_BLOBS_Y_START_CNTX[0/2].blobs_y_start[8] | Per-context BlobsYStart (overrides TileDescriptor in MC mode). |
SETC16 since you can't sw to it):| Field | Purpose |
|---|---|
CFG_STATE_ID_StateID | Selects which of the two Config banks (0/1) this thread reads — basic double-buffering of all backend config. |
UNPACK_MISC_CFG_CfgContextOffset_0, _1 | Per-thread per-unpacker base for context index (added to CfgContextId). |
UNPACK_MISC_CFG_CfgContextCntReset_0, _1 | Writing here resets that thread's context counter (immediate side-effect). |
SRCA_SET_Base, SRCB_SET_Base | Source-row base (×16) used when Unpack_Src_Reg_Set_Upd=1 for ping-pong write addressing. |
SRCA_SET_SetOvrdWithAddr | If 1, write to a full 64-row range of SrcA using the address generator directly. |
ADDR_MOD_AB_SEC[0..3].SrcXIncr/SrcYIncr/... | Address-modifier slots for UNPACR / MOVD2A etc. — used to bump SrcA/SrcB row counters. |
ALU_FORMAT_SPEC_REG0_SrcA*, REG0_SrcB*, REG2_Dstacc* | The format the FPU sees (overrides for INT8 signed/unsigned and Dst accumulator format). |
ALU_ACC_CTRL_Fp32_enabled, SFPU_Fp32_enabled, INT8_math_enabled, Zero_Flag_disabled_src/dst | FPU/SFPU global modes. |
From the UNPACR functional model (relevant slice):
if (WhichUnpacker == 1) { // SrcB
while (SrcB[Bank].AllowedClient != SrcClient::Unpackers) wait;
Row = (Row + CurrentUnpacker.SrcRow[CurrentThread]) & 0x3f;
SrcB[Bank][Row][Col] = Datum;
} else if (!UnpackToDst) { // SrcA
while (SrcA[Bank].AllowedClient != SrcClient::Unpackers) wait;
if (Row < 4 || Col < ColShift) continue; // header rows are skipped
Row -= 4;
Col -= ColShift; // optional per-context column shift
Row += CurrentUnpacker.SrcRow[CurrentThread];
if (Transpose) std::swap(Row & 0xf, Col); // Haloize_mode XY transpose
SrcA[Bank][Row][Col] = Datum;
} else { // Dst
Row -= 4;
if (OutDataFormat in {FP32,TF32,INT32}) Dst32b[Row][Col] = Datum;
else Dst16b[Row][Col] = Datum;
}
Key things:
AllowedClient ∈ {Unpackers, MatrixUnit}. The unpacker stalls until the bank is "theirs"; after writing it, software issues a SETDVALID or sets FlipSrc=1 on the next UNPACR to flip the bank and hand it to the FPU. The FPU then computes from it and eventually releases the bank (via CLRDVALID from MVMUL-style instructions, controlled by CLR_DVALID_SrcA_Disable/SrcB_Disable). This is the natural double-buffer: while FPU consumes bank 0, unpacker fills bank 1.MVMUL with srcb_bcast).llk_unpack_A.h with BroadcastType::NONE and unpack_to_dest=1).There's one shared L1 read port. Each unpacker requests x1 (16 B/cyc), x2 (32), or x4 (64), chosen by THCON_SEC[u].Throttle_mode (0/1/2). Constraints force x1 when decompressing, when upsampling ≥ 3, when BFP2/2a, etc. If both unpackers want bandwidth, an arbiter shaves one of them down — the full conflict table is in UNPACR_Regular.md#performance.
| L1 in | → SrcA/B as | → Dst as |
|---|---|---|
| FP32/TF32 | TF32 / BF16 / FP16 | FP32 / BF16 / FP16 |
| BF16 | TF32 / BF16 | BF16 |
BFP8/4/2 (or INT8 via Force_shared_exp) | TF32 / BF16 | BF16 |
| BFP8a/4a/2a | FP16 | FP16 |
| FP16 / FP8 | FP16 | FP16 |
| INT32 (sign-magnitude) | — | Integer "32" |
| INT16 | Integer "16" | Integer "16" |
| INT8 / UINT8 | Integer "8" (via BFP8 path, needs SrcAUnsigned/SrcBUnsigned) | Integer "8" |
(From tt-isa-documentation/WormholeB0/.../Unpackers/FormatConversion.md; BH is identical apart from the e4m3 helper.)
There are four packers. They share one input port to Dst and one output port to L1, but each has its own pipeline and its own copy of REG1 in Config (the four packer configs live as THCON_SEC[01]_REG[18] etc., with PackerIndex selecting which one). Each PACR instruction kicks off one to four packers at once.
tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h, opcode 0x41. BH encoding (notice it grew vs. WH):
TT_OP_PACR(
CfgContext, // 3 bits : which packer config context to use
RowPadZero, // 1 bit : pad short rows with zero
DstAccessMode, // 1 bit : Dst input vs L1 input (Source_interface_selection)
AddrMode, // 2 bits : ADDR_MOD_PACK_SEC slot 0..3
AddrCntContext, // 2 bits : which ADC to use when OvrdThreadId=1
ZeroWrite, // 1 bit : pull from /dev/null (write zeros)
ReadIntfSel, // 1 bit : alternate Dst read interface
OvrdThreadId, // 1 bit : use AddrCntContext instead of CurrentThread
Concat, // 1 bit : continue current compression row
CtxtCtrl, // 1 bit : auto-context-counter behavior
Flush, // 1 bit : flush pre-L1 output buffers
Last) // 1 bit : flush + start next packer at fresh address
The PackerMask is not in the instruction on Blackhole the same way as the simplified WH model shows; the per-context pack_start_intf_pos (4 bits in unpack_config_t/pack_config_t) and Enable_out_fifo per packer together select which packers run. Typical kernels enable all four for one full tile (matching the 4 faces) or just one when packing untilized.
There's also TT_OP_PACR_SETREG (sequenced MMIO-register write that retires after late format conversion of all in-flight packers — used to atomically tick a CB-write-pointer right when the data hits L1) and the helper instructions:
TT_OP_SETPKEDGOF(y_end, y_start, x_end, x_start) — set edge-mask offsets used by edge-maskingTT_OP_CLREXPHIST — clear all four packers' exponent histogramTT_OP_SETASHRMH, TT_OP_SETASHRMH0, TT_OP_SETASHRMH1, TT_OP_SETASHRMV — set "all-shared row mask" (used for halo/shift patterns)Packers/README.md and per-stage docs.
The four packers reuse THCON_SEC0_REG1_* (packer 0+1) and THCON_SEC1_REG1_* (packer 2+3), plus THCON_SEC0_REG8_* / THCON_SEC1_REG8_* for the other half — see cfg_defines.h for the exact mapping. The pack_config_t in cpack_common.h describes the per-packer 96-bit programmable block (REG1 in WH-language).
| Field | Meaning |
|---|---|
row_ptr_section_size | Bytes reserved in L1 for the RSI (row-start-index) section when compressing. |
exp_section_size | Bytes reserved for the per-16-datum exponent section (BFP outputs / FP8 / INT8). |
l1_dest_addr | Base L1 address (16-byte units) for this packer's output. |
uncompress | 1 = no RLE-zero compression. |
add_l1_dest_addr_offset | If 1, add l1_dest_addr_offset (a 16b TDMA-RISC writable register) to base. |
disable_pack_zero_flag | If 1, never produce the per-datum zero flag (forces uncompressed). |
out_data_format (4b) | "LateToFormat" — what hits L1. Format encoding from §2.3. |
in_data_format (4b) | "LateFromFormat" — the intermediate format coming into late conversion. |
dis_shared_exp_assembler | Disable the BFP shared-exponent assembler. |
auto_set_last_pacr_intf_sel | Auto-flip interface-select on the last PACR of a packing burst. |
enable_out_fifo | Enable this packer's L1 output FIFO (must be 1 to write at all). |
sub_l1_tile_header_size | 1 = the addr math subtracts the 16-byte tile-header bump that L1_Dest_addr otherwise has built in. |
src_if_sel (also Source_interface_selection) | Packer 0 only: 0 = source from Dst, 1 = source from L1 (DMA mode used by Mover-style copies). |
pack_start_intf_pos (4b) | Which interfaces are "live" — bitmask selecting packers 0..3 per face. |
all_pack_disable_zero_compress_ovrd | If 1, the global All_pack_disable_zero_compress[i] mask overrides per-packer uncompress. |
add_tile_header_size | Add the 16-byte tile header to outputs. |
pack_dis_y_pos_start_offset | Disable applying Y-position start offset (used by un-tilize). |
l1_src_addr (8b) | High bits of L1 source address when src_if_sel=1. |
| Field | Meaning |
|---|---|
PCK0_ADDR_BASE_REG_0_Base / _REG_1_Base | Base address for the input (Channel[0], Dst-side) and output (Channel[1], L1-side) sides. |
PCK0_ADDR_CTRL_XY_REG_0_Xstride / Ystride | Channel[0] X/Y strides (Dst-side, in bytes for X). |
PCK0_ADDR_CTRL_ZW_REG_0_Zstride / Wstride | Channel[0] Z/W strides. |
PCK0_ADDR_CTRL_XY_REG_1_Ystride | Channel[1] Y stride (L1-side; X stride on L1 side isn't used — output is always linear within a row). |
PCK0_ADDR_CTRL_ZW_REG_1_Zstride / Wstride | Channel[1] Z/W strides. |
PACK_COUNTERS_SEC[i].pack_per_xy_plane | Iteration count per face. |
PACK_COUNTERS_SEC[i].pack_reads_per_xy_plane | How many "reads" the TilePositionGenerator does before advancing Y or Z. |
PACK_COUNTERS_SEC[i].pack_xys_per_til | XY-plane count per tile. |
PACK_COUNTERS_SEC[i].pack_yz_transposed | If 1, advance Z before Y (transposed face order). |
PACK_COUNTERS_SEC[i].pack_per_xy_plane_offset | Per-packer Y offset. |
DEST_TARGET_REG_CFG_PACK_SEC[i].Offset | Per-packer Dst-row offset (in 16-row units) — where in Dst this packer starts reading. |
DEST_TARGET_REG_CFG_PACK_SEC[i].ZOffset | Per-packer Z-offset into Dst. |
THCON_SEC[01]_REG9.{Pack_limit_address, Pack_fifo_size} | Circular L1 wrap. |
| Field | Meaning |
|---|---|
PCK_DEST_RD_CTRL_Read_32b_data | 1 = read 32-bit from Dst (Dst32b view), 0 = 16-bit (Dst16b view). |
PCK_DEST_RD_CTRL_Round_10b_mant | If 1, round mantissa to 10 bits during early conversion (TF32 path). |
PCK_DEST_RD_CTRL_Read_int8 | "Read raw" — skip the early conversion rounding (truncate / bitcast). |
PCK_DEST_RD_CTRL_Read_unsigned | For INT8/UINT8 path — say whether the byte is signed or unsigned. |
ALU_FORMAT_SPEC_REG_Dstacc_override, ALU_FORMAT_SPEC_REG_Dstacc_val, ALU_FORMAT_SPEC_REG2_Dstacc | Specify the IntermediateFormat for the early conversion. |
ALU_ROUNDING_MODE_Packer_srnd_en | Stochastic rounding for the packer (BFP rounds). |
ALU_ROUNDING_MODE_Gasket_srnd_en | Stochastic rounding for the late conversion gasket. |
INT_DESCALE_Enable, INT_DESCALE_Mode, INT_DESCALE_VALUES_SEC[].Value | Shift-amount applied during INT32→INT8 conversion (for quantization). |
STACC_RELU_ApplyRelu (4b) | NO_RELU / ZERO_RELU / MIN_THRESHOLD_RELU / MAX_THRESHOLD_RELU (2b per packer). |
STACC_RELU_ReluThreshold | 16-bit BF16 or FP16 threshold value. |
PCK_EDGE_OFFSET_SEC[0..3].mask | 16-bit column masks for edge masking. |
PCK_EDGE_OFFSET_SEC0.{mode, tile_row_set_select_pack[0..3]} | Which of the four masks each packer uses, and what mode means (per-row vs per-face). |
TILE_ROW_SET_MAPPING[0..3].row_set_mapping[16] | 2 bits per row of a face → which mask slot to apply. |
PCK_EDGE_TILE_FACE_SET_SELECT_*, PCK_EDGE_TILE_ROW_SET_SELECT_* | Higher-level mask selectors. |
PCK_EDGE_MODE_mode | 0 = replace masked datums with 0, 1 = replace with −∞ (for softmax-style max-reduce). |
ENABLE_ACC_STATS_Enable (per-thread) | Enable exponent histogram. |
THCON_SEC[01]_REG1_Downsample_mask, Downsample_rate | 16-bit downsample mask + rate. |
THCON_SEC[01]_REG1_Exp_threshold_en, Exp_threshold | Exponent-threshold clamp-to-zero. |
THCON_SEC[01]_REG1_pack_dis_y_pos_start_offset, pack_start_intf_pos | Packer interface position controls (untilize mode). |
THCON_SEC0_REG1_All_pack_disable_zero_compress (4b mask) | Per-packer compression-disable when override is on. |
ADDR_MOD_PACK_SEC[0..3].{Y/Z}{src/dst}{Clear,CR,Incr} | Address-modifier slots used by PACR.AddrMod. |
DEST_ACCESS_CFG_{remap_addrs, swizzle_32b, disable_full_write_dest_q_bypass, zeroacc_absolute_tile_mode} | Affect both packer Dst-address swizzle and RISCV Dst access. |
Addr = PCK0_ADDR_BASE_REG_0_Base
+ ADC[w].Ch0.X * (Xstride & 0xf)
+ ADC[w].Ch0.Y * Ystride
+ ADC[w].Ch0.Z * Zstride
+ ADC[w].Ch0.W * Wstride
InputNumDatums = Channel[1].X - Channel[0].X + 1 (one row of one face)
If Source_interface_selection=1 and i==0, the packer fetches from L1 (used for some L1-to-L1 reshape ops); otherwise it fetches from Dst with InputSourceAddr = (Addr/BytesPerDatum) + DEST_TARGET_REG_CFG_PACK_SEC[i].Offset << 4, interpreted as Row = addr >> 4, Col = addr & 0xf into Dst. BytesPerDatum is decided by In_data_format: 4 → FP32/TF32/I32, 2 → FP16/BF16/I16, else 1.
YZW_Addr = PCK0_ADDR_BASE_REG_1_Base
+ ADC[w].Ch1.Y*Ystride + Z*Zstride + W*Wstride
Addr = PackerIConfig.L1_Dest_addr + !Sub_l1_tile_header_size + (YZW_Addr & ~0xf)
+ (Add_l1_dest_addr_offset ? PackerI.l1_dest_addr_offset : 0)
if (Addr > Pack_limit_address*2 + 1) Addr -= Pack_fifo_size*2 // circular CB wrap
if (compressing) reserve Row_start_section_size bytes for RSI
if (out_data_format < 16b) reserve Exp_section_size bytes for exponents
DataStream.ByteAddress = Addr (only when NeedsNewAddress)
The Y/Z/W contribution is masked & ~0xf so it only adjusts at 256-byte granularity — fine-grained packing within a 16-byte L1 line is done by the buffered output below.
Output buffering and 16-byte alignment. Output is always aligned 16-byte writes to L1. Datums get accumulated into per-stream buffers (data / exp / RSI) and only flushed to L1 when full. The Flush and Last bits on PACR force the buffers to flush — but if a buffer is partially full it's zero-padded up to 16 B before flushing. This means: if you do many small PACRs, they share the same line; if you do one PACR with Flush=1, you might waste up to 15 B at the end of each stream.
| Op | Purpose |
|---|---|
TT_OP_PACR | The main pack instruction. |
TT_OP_PACR_SETREG | Atomic MMIO register write sequenced after late conversion (used to bump CB write pointers, set semaphore values, kick streams). |
TT_OP_SETPKEDGOF | Edge-mask offsets (x_start, x_end, y_start, y_end → PCK_EDGE_OFFSET masks). |
TT_OP_CLREXPHIST | Clear all four packers' exponent histograms. |
TT_OP_SETASHRMH(0/1), TT_OP_SETASHRMV | Set ALL-SHARED row-mask H/V (halo column/row masks). |
TT_OP_SETDMAREG mode 6/7 | Read the exponent histogram bins into a Tensix GPR. |
TT_OP_SETDMAREG mode 9 | Read packer 0's max-exponent. |
TT_OP_STALLWAIT(stall_res, ...) | Wait for THCON / PACK0..3 / UNPACK0..1 resource idle (this is how Last becomes useful — you flush, then stall until L1 actually has the data). |
TT_OP_RESOURCEDECL | Auto-TTSync tracking declaration. |
Here is what the LLK (llk_unpack_AB.h + llk_math_eltwise_* + llk_pack.h) actually does for a simple "load tile A, load tile B, A+B, store tile" sequence:
configure_unpack_AB(...) in cunpack_common.h writes (excerpt of actual register writes):
// Tell the FPU what format SrcA and SrcB hold:
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcA_val_ADDR32, ...>(unp_src_format_A);
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcB_val_ADDR32, ...>(unp_src_format_B);
// Configure each unpacker's REG0_TileDescriptor (4-dword blob):
cfg[THCON_SEC0_REG0_TileDescriptor_ADDR32 + 0..3] = {format/xdim/ydim/zdim/wdim/...};
cfg[THCON_SEC1_REG0_TileDescriptor_ADDR32 + 0..3] = ...;
// Configure REG2 (out format, throttle, context_count, tileize, etc.):
cfg[THCON_SEC0_REG2_Out_data_format_ADDR32 word] = ...;
cfg[THCON_SEC1_REG2_Out_data_format_ADDR32 word] = ...;
// e4m3 helper:
cfg_reg_rmw_tensix<THCON_SEC0_REG1_Unp_LF8_4b_exp_RMW>(unpA_is_e4m3);
cfg_reg_rmw_tensix<THCON_SEC1_REG1_Unp_LF8_4b_exp_RMW>(unpB_is_e4m3);
// REG3 base addresses (= base of the CB in L1, /16 since it's 16B units):
cfg[THCON_SEC0_REG3_Base_address_ADDR32] = cb_a_base >> 4;
cfg[THCON_SEC1_REG3_Base_address_ADDR32] = cb_b_base >> 4;
// REG5 per-context Dst rows / face dims (multi-context unpacker 0 only):
cfg[THCON_SEC0_REG5_Tile_x_dim_cntx0_ADDR32] = face_dim | (face_dim << 16);
cfg[THCON_SEC0_REG5_Dest_cntx0_address_ADDR32] = Dest_cntx0 | (Dest_cntx1 << 16);
// Reset ADC counters for the two unpackers + the packer:
TTI_SETADCXY(p_setadc::UNP_A | p_setadc::UNP_B, 0,0,0,0, 0b1011);
TTI_SETADCZW(p_setadc::UNP_A | p_setadc::UNP_B, 0,0,0,0, 0b1111);
TTI_SETADCXY(0b100, 0,0,0,0, 0b1011); // packer side
TTI_SETADCZW(0b100, 0,0,0,0, 0b1111);
// "X-end" per face, written into ADC[*].Unpacker[*].Channel[1].X:
TT_SETADCXX(p_setadc::UNP_A, unpA_x_end, 0);
TT_SETADCXX(p_setadc::UNP_B, (unpB_face_r_dim << 4) - 1, 0);
// SrcA row base for ping-pong:
TTI_SETC16(SRCA_SET_Base_ADDR32, 0x4);
And the packer side from cpack_common.h (set_packer_strides):
TT_SETDMAREG(0, LOWER_HALFWORD(y_stride << PCK0_ADDR_CTRL_XY_REG_0_Ystride_SHAMT), 0, LO_16(TMP0));
TT_SETDMAREG(0, UPPER_HALFWORD(y_stride << PCK0_ADDR_CTRL_XY_REG_0_Ystride_SHAMT), 0, HI_16(TMP0));
TT_SETDMAREG(0, LOWER_HALFWORD(z_stride << PCK0_ADDR_CTRL_ZW_REG_0_Zstride_SHAMT), 0, LO_16(TMP1));
TT_SETDMAREG(0, UPPER_HALFWORD(w_stride << PCK0_ADDR_CTRL_ZW_REG_0_Wstride_SHAMT), 0, HI_16(TMP1));
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::THCON);
TTI_WRCFG(TMP0, p_cfg::WRCFG_32b, PCK0_ADDR_CTRL_XY_REG_0_Xstride_ADDR32);
TTI_WRCFG(TMP1, p_cfg::WRCFG_32b, PCK0_ADDR_CTRL_ZW_REG_0_Zstride_ADDR32);
cb_wait_front) — done in RISCV, not Tensix.UNPACR per face. Each UNPACR advances ADC Channel[0].Y → the next face row; on the 4th call the AddrMod flips back and bumps Z.REG3_Base_address, follows REG7_Offset_address, then iterates 16×16=256 datums, doing format conversion + writing to SrcA[bank][row][col] (and stalls on AllowedClient).UNPACR of a tile sets SetDatValid=1 (or it's followed by UNPACR_NOP_SETDVALID) — that flips AllowedClient = MatrixUnit so the FPU can read this bank.ZEROACC (sub-instruction of ZEROSRC) zeroes a 16×16 block of Dst.MVMUL / ELWADD / etc. read SrcA[bank0] + SrcB[bank0] and produce Dst[Row][Col]. Each writes one row at a time; the 4-cycle Dst scoreboard prevents the packer from reading until the write retires.CLRDVALID (unless CLR_DVALID_SrcA_Disable/SrcB_Disable is set), returning the bank to AllowedClient = Unpackers.PACR with PackerMask = 0xF to run all four packers in parallel — each packer is configured to point at a different quarter of Dst via DEST_TARGET_REG_CFG_PACK_SEC[i].Offset, and each lands in a different quarter of the output tile via PCK0_ADDR_*_REG_1 strides.PACR with Last=1 (or Flush=1) flushes the 16-byte buffers to L1.PACR_SETREG (or a follow-up MMIO write) bumps the CB write pointer so the consumer can pick up the tile.STALLWAIT(p_stall::PACK0|PACK1|PACK2|PACK3, ...) is used if the kernel needs to be sure the L1 store has happened before proceeding (e.g. before triggering a NoC send).From tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h. (TT_OP_X is the encoder; TTI_X is the form with immediate operands; TT_X allows GPR operands.)
| Instruction | Opcode | Purpose |
|---|---|---|
TT_OP_UNPACR(...) | 0x42 | The unpack instruction. 13 fields described in §2.1. |
TT_OP_UNPACR_NOP(...) | 0x41 (re-uses with subop) | Multi-mode unpacker-sequenced no-op: ZEROSRC / SETDVALID / SETREG / MsgClr / pure-NOP / Unpack_Pop. Used to flip Src banks, zero a bank, push register writes through the unpacker queue. |
TT_OP_ZEROSRC(zero_val, write_mode, bank_mask, src_mask) | 0x11 | Zero a SrcA/SrcB bank directly (bypasses unpacker pipeline; sometimes faster than going through UNPACR_NOP). |
TT_OP_SETDVALID(setvalid) | 0x57 | Set "AllowedClient = MatrixUnit" on a Src bank (i.e. mark unpacked data ready for FPU). Bit 0 = SrcA, bit 1 = SrcB. |
TT_OP_TRNSPSRCA | 0x14 | Transpose SrcA in place. |
TT_OP_TRNSPSRCB | 0x16 | Transpose SrcB in place (BH-specific; not on Wormhole). |
TT_OP_SETADC(CntSetMask, ChannelIndex, DimensionIndex, Value) | n/a | Set one of {X,Y,Z,W} on Channel[0] or Channel[1] of one ADC bank. |
TT_OP_SETADCXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask) | n/a | Set X/Y on both channels of selected ADC banks at once. BitMask selects which of {Ch0X,Ch0Y,Ch1X,Ch1Y} are written. |
TT_OP_SETADCZW(CntSetMask, Ch1_W, Ch1_Z, Ch0_W, Ch0_Z, BitMask) | n/a | Same for Z/W. |
TT_OP_SETADCXX(CntSetMask, x_end2, x_start) | 0x5e | Set Ch0.X = x_start, Ch1.X = x_end2 in one shot (the typical "set face width minus 1"). |
TT_OP_INCADCXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X) | n/a | Increment X/Y by the given deltas (signed). |
TT_OP_INCADCZW(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X) | n/a | Increment Z/W. |
TT_OP_ADDRCRXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask) | n/a | Copy Y_Cr → Y / X_Cr → X (used to start a fresh face row). |
TT_OP_ADDRCRZW(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask) | n/a | Same for Z/W. |
TT_OP_REG2FLOP(SizeSel, TargetSel, ByteOffset, ContextId_2, FlopIndex, RegIndex) | n/a | Load an ADC/RWC value (or per-context unpacker offset) from L1 via a register copy. Used to set up multi-context unpack from a descriptor in L1. |
TT_OP_SETC16(reg, val) | 0xb2 | Write a 16-bit immediate to ThreadConfig[CurrentThread] — the only way to set UNPACK_MISC_CFG_CfgContextOffset_*, UNPACK_MISC_CFG_CfgContextCntReset_*, SRCA_SET_Base, SRCB_SET_Base, CFG_STATE_ID_StateID and the rest of the per-thread unpacker config. |
TT_OP_WRCFG(GprAddress, wr128b, CfgReg) | 0xb0 | Write 32b (or 128b if wr128b=1) from a Tensix GPR to Config[StateID]. Used to set REG0 TileDescriptor (4 dwords as one 128b write), and all the THCON_SEC[u].REG[1..7] words. |
TT_OP_RDCFG(GprAddress, CfgReg) | 0xb1 | Read a Config word into a Tensix GPR — used by introspection / read_unpack_tile_descriptor(). |
TT_OP_RMWCIB0/1/2/3(Mask, Data, CfgRegAddr) | 0xb3..0xb6 | Read-modify-write a byte of a Config word — used for cfg_reg_rmw_tensix<...> to flip a single bit/field without disturbing neighbours (e.g. Unp_LF8_4b_exp, Ovrd_data_format). The 4 variants target the 4 bytes of the word. |
TT_OP_STREAMWRCFG(stream_id_sel, ...) | n/a | Write a Config word from a NoC overlay stream — used when descriptor data is being streamed in. |
TT_OP_STALLWAIT(stall_res, wait_res) | n/a | Block this thread until specified resources (e.g. UNPACK0, UNPACK1, THCON, XSEARCH) are idle. The single most common synchronization primitive between config writes and unpack/pack work. |
TT_OP_RESOURCEDECL(linger_time, ...) | n/a | Declare resources being used by upcoming instructions, for the BH Auto-TTSync hardware. |
TT_OP_REPLAY(start_idx, len, ...) | n/a | Execute a saved sequence of Tensix instructions (the LLK uses this to compress repetitive UNPACR sequences). |
TT_OP_XMOV(Mov_block_selection, ...) | n/a | Mover instruction — sometimes used by unpack-untilize when the data needs an L1↔L1 copy before unpack. |
TT_OP_MOVA2D(...), TT_OP_MOVB2D(...) | n/a | Copy SrcA/SrcB → Dst directly via the FPU's data path — alternative to "UNPACR into SrcA, then FPU copies to Dst". Listed here because they are sometimes used inside llk_unpack_* paths when staging into Dst is needed. |
TT_OP_MOVD2A(...), TT_OP_MOVD2B(...) | n/a | Reverse direction: Dst → SrcA/SrcB. Used by llk_math_transpose_dest.h and matmul to feed the next round of FPU ops from Dst. |
TT_OP_MOVDBGA2D(...), TT_OP_MOVDBGB2D(...) | n/a | Debug-path variants of MOVA2D/MOVB2D used by the debug tooling. |
TT_OP_MOVB2A(...) | 0x0b | Copy SrcB → SrcA. |
TT_OP_SETPKEDGOF(y_end, y_start, x_end, x_start) | 0x1d | Edge-mask offsets (technically a packer control instruction, but uses the same ADC infrastructure and is often issued by the unpack thread when tile-padding state is being set up). |
TT_OP_SETASHRMH, SETASHRMH0, SETASHRMH1, SETASHRMV(reg_mask2) | 0x1e, 0x1a, 0x1b, 0x1c | "All-shared halo mask" — set the row/column halo bits used by the unpacker's halo-padding mode (also feeds into packer edge masks). |
TT_OP_CLREXPHIST | 0x21 | Clears packer exponent histograms; listed here because it's commonly placed in the unpack/setup phase to start fresh telemetry. |
TT_OP_SETRWC(...), TT_OP_SETIBRWC(...) | n/a | Set the RWC (FPU-side row/word counters) — not strictly an unpacker op, but unpack sequences typically reset them between tiles since the FPU is the immediate consumer of unpacker output. |
(Opcodes I didn't list explicitly are not 0x__-prefixed because the TT_OP macro packs the value differently — see ckernel_ops.h for the exact encoder.)
tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/BackendConfiguration.md — how Config[2][...] and ThreadConfig[3][...] are laid out.tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/ConfigurationUnit.md — config-unit pipeline / latency / which instructions write config.tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/Dst.md — Dst layout, swizzle, scoreboarding, RISCV access.tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_Regular.md — exhaustive (621-line) UNPACR functional model; applies unchanged to BH apart from the e4m3 helper and Auto-TTSync.tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/PACR.md + Packers/{README,InputAddressGenerator,OutputAddressGenerator,FormatConversion,EdgeMasking,ReLU,Downsampling,Compression,ExponentHistogram,ExponentThresholding}.md — full packer functional model.tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/{README,FormatConversion}.md — unpacker overview.tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h — concrete _ADDR32, _MASK, _SHAMT for all 800+ config fields on Blackhole.tt-llk/tt_llk_blackhole/common/inc/{ckernel_ops.h, cunpack_common.h, cpack_common.h} — the instruction macros and the C++ structs (unpack_tile_descriptor_t, unpack_config_t, pack_config_t, relu_config_t, dest_rd_ctrl_t, pck_edge_offset_t, pack_counters_t) the LLK actually writes.tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_*.h, llk_pack*.h — the higher-level helpers (llk_unpack_A, llk_unpack_AB, llk_unpack_tilize, llk_pack, llk_pack_untilize).