← ai corner

Blackhole Tensix Unpacker & Packer — Complete Reference

How data moves from the input circular buffer in L1 through the unpackers into SrcA/SrcB/Dst, gets computed on by the FPU/SFPU, and is written back to L1 by the four packers. With every config register, every relevant TT* instruction, and the LLK programming model.

Note. Tenstorrent's official ISA docs split documentation across tt-isa-documentation/WormholeB0/ (which has the detailed Unpackers/Packers sub-trees) and tt-isa-documentation/BlackholeA0/ (which has BH-specific overrides like Dst.md, BackendConfiguration.md, ConfigurationUnit.md). On Blackhole the functional model is largely the same as Wormhole B0 — same UNPACR / PACR pipeline, with a few BH-specific additions (Auto-TTSync, extra context counts, an 8-bit FP8 e4m3 helper, 1.5 MiB L1 instead of 1.46 MiB, the Mover, new SFPU ops). Below I cite which docs the details come from, and how the LLK (tt-llk/tt_llk_blackhole/) actually programs the registers.

Contents

  1. The big picture: L1 → SrcA/SrcB/Dst → math → Dst → L1
  2. The Unpacker
  3. The Packer
  4. Step-by-step: data going CB → SrcA/Dst → Dst → CB
  5. Complete list of unpacker-related TT* instructions
  6. Source references

1. The big picture: L1 → SrcA/SrcB/Dst → math → Dst → L1

A Tensix tile has:

L1 input CBs circular FIFOs in 1536 KiB shared SRAM Unpack_limit_address Unpack_fifo_size tile = 32×32 datums = 4 faces × 16×16 Unpacker 0 → SrcA or → Dst Unpacker 1 → SrcB SrcA (bank 0/1) 64×16 × 19 bits AllowedClient gate SrcB (bank 0/1) 64×16 × 19 bits AllowedClient gate FPU — Matrix Unit MVMUL / ELW / REDUCE on SrcA, SrcB SFPU — Vector Unit 32-lane SIMD reads & writes Dst Dst register 1024 × 16 × 16b — or — 512 × 16 × 32b DstRowValid scoreboard remap_addrs / swizzle_32b tile = 4 faces in Dst Packer 0 face F0 of Dst Packer 1 face F1 of Dst Packer 2 face F2 of Dst Packer 3 face F3 of Dst L1 output CBs Pack_limit_address Pack_fifo_size 16-byte aligned L1 writes reads Unp 0 → Dst (bypass FPU) writes R/W to packers issued by TRISC0 (unpack) issued by TRISC1 (math) issued by TRISC2 (pack) — typically Figure 1. CB → SrcA/SrcB/Dst → math → Dst → CB. Dashed arrow shows Unpacker 0's optional direct-to-Dst bypass; bidirectional arrow shows the SFPU operating in-place on Dst.

Three Tensix-coprocessor threads (T0/T1/T2) issue Tensix instructions independently and the backend execution units (Unpackers / Matrix Unit / Vector Unit / Packers / ThCon / Sync / Mover / Configuration / Misc) are shared. Unpackers usually take instructions from T0, the FPU/SFPU from T1, the packers from T2 — but any thread can issue any instruction. Synchronization is via STALLWAIT, semaphores, SETDVALID, and the SrcA/SrcB "AllowedClient" mechanism described below.

1.1 SrcA / SrcB / Dst shapes (Blackhole)

1.2 ADCs (Addressing Counters)

Both unpackers and packers compute addresses by stepping through "ADC" channels rather than by getting an explicit address each instruction. Each thread has one ADC per unpacker (and a shared one for the packers). Each ADC has two channels (Channel[0] = input/source side, Channel[1] = output/dest side), each with X, Y, Z, W counters + Y_Cr, Z_Cr "current row" shadow counters.

ADC counters are advanced by:

This is how a single UNPACR for a face advances to the next face automatically.


2. The Unpacker

There are two unpackers. Unpacker 0 writes to SrcA or directly to Dst. Unpacker 1 writes to SrcB. They share an L1 read port but can run concurrently with throttle-mode arbitration (see §2.5).

2.1 The UNPACR instruction

Source: tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h, opcode 0x42. The full BH encoding is:

TT_OP_UNPACR(
    Unpack_block_selection,  // 1 bit  : 0 = Unpacker 0, 1 = Unpacker 1
    AddrMode,                // 8 bits : packed Ch0YInc/Ch0ZInc/Ch1YInc/Ch1ZInc (2 bits each)
    CfgContextCntInc,        // 2 bits : used to bump context-counter after instr
    CfgContextId,            // 3 bits : explicit context (used when AutoIncContextID=0)
    AddrCntContextId,        // 2 bits : which ADC bank to use (0..2 = T0..T2)
    OvrdThreadId,            // 1 bit  : was MultiContextMode in WH docs (see below)
    SetDatValid,             // 1 bit  : hand SrcA/B bank to Matrix Unit after unpack
    srcb_bcast,              // 1 bit  : RAREFYB / broadcast B
    ZeroWrite2,              // 1 bit  : AllDatumsAreZero — write zeros instead of fetching
    AutoIncContextID,        // 1 bit  : use context counter instead of CfgContextId
    RowSearch,               // 1 bit  : use BlobsYStart row-search mode
    SearchCacheFlush,        // 1 bit  : flush the RSI cache
    Last)                    // 1 bit  : end-of-context marker

The full functional model is the one in WormholeB0/.../UNPACR_Regular.md (621 lines). It is exactly the same on Blackhole; the field layout above is the BH-specific encoding (the AddrCntContextId / OvrdThreadId split is BH's renaming of WH's ContextADC/MultiContextMode).

The high-level steps the hardware performs:

1. Pick state-bank from ThreadConfig[CurrentThread].CFG_STATE_ID_StateID  (0 or 1)
2. Resolve WhichContext (0..7) from either CfgContextId or the per-unpacker
   ContextCounter, then add UNPACK_MISC_CFG_CfgContextOffset[WhichUnpacker]
3. Read the tile descriptor (THCON_SEC[u].REG0_TileDescriptor) → InDataFormat,
   XDim, YDim, ZDim, WDim, IsUncompressed, BlobsPerXYPlane, DigestSize, ...
4. Compute starting L1 address from REG3_Base_address (+ REG3_Base_cntx in MC mode)
   and REG7_Offset_address, skip 16 bytes of tile header times (1+DigestSize).
   If compressed: also read row-start indices from a sub-region.
   If BFP format: also point at the per-16-datum exponent sub-region.
5. Compute number of input datums = (ADC.Channel[1].X+1) - ADC.Channel[0].X
6. Compute starting OUTPUT address inside SrcA/SrcB/Dst from:
      UNP[u].ADDR_BASE_REG_1_Base + ADC.Ch1.Y*Ystride + Z*Zstride + W*Wstride
   (in datums, not bytes; >>1 for 16-bit, >>2 for 32-bit output formats)
7. For each input datum:
      a. Read DatumSizeBytes from L1
      b. If BFP: read shared 8-bit exponent (1 byte per 16 datums)
      c. If compressed: read the 4-bit RLE delta and insert that many zeroes
      d. FormatConversion(InDataFormat, OutDataFormat, datum, exp, unpacker, toDst)
      e. Optional upsampling (insert 0/1/2/4 zero datums after each datum)
      f. Write to SrcA[bank][row][col], SrcB[bank][row][col], or Dst32b/Dst16b
         (wait if AllowedClient ≠ Unpackers — i.e. FPU hasn't released the bank)
8. After the loop: bump ADC channels by AddrMod, optionally flip the SrcA/SrcB
   bank (FlipSrc) and hand the now-full bank to the Matrix Unit, bump
   ContextCounter if AutoIncContextID.

The address-wrap behaviour is the circular FIFO in L1: addresses beyond Unpack_limit_address wrap by subtracting Unpack_fifo_size. This is what lets the CB abstraction work — the CB is just a configured circular range in L1.

2.2 UNPACR sub-variants (UNPACR_NOP, opcode 0x41)

UNPACR_NOP (TT_OP_UNPACR_NOP) is a multi-mode "stall/control" instruction that runs in-order with the unpacker queue, used so the no-op effect happens at exactly the right point in the unpacker pipeline:

ModeEffectDoc
ZEROSRC modeSet SrcA or SrcB bank to all zeroes, sequenced with UNPACRUNPACR_NOP_ZEROSRC.md
SETDVALID mode (Set_Dvalid=1)Mark Src bank as valid → hand to Matrix Unit (AllowedClient = MatrixUnit)UNPACR_NOP_SETDVALID.md
SETREG modeMMIO register write, sequenced with UNPACRUNPACR_NOP_SETREG.md
Msg_Clr_Cnt modeStream/Overlay message clearUNPACR_NOP_OverlayClear.md
Nop modeOccupy unpacker for one cycle (used as a barrier/replay padding)UNPACR_NOP_Nop.md
Unpack_Pop modePop unpacker FIFO(covered in UNPACR_NOP.md)

UNPACR itself has 3 special sub-modes when Unpack_block_selection / SetDvalid etc. are programmed unusually:

2.3 Unpacker config registers (THCON_SEC[u] = sec 0 for Unpacker 0 / sec 1 for Unpacker 1)

These live in the global Config[2][...] array (TENSIX_CFG_BASE), with the symbols in tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h. There are ~370 packer/unpacker fields total; the ones that matter for unpack are:

REG0 — TileDescriptor (4 × 32-bit, written via WRCFG ... wr128b=1 as a 128-bit blob).

This is the single most important block; unpack_tile_descriptor_t in cunpack_common.h shows the BH bitfield layout:

FieldBitsMeaning
in_data_format[3:0]Source format in L1: encoding table below.
uncompressed[4]1 = no RLE; 0 = decompress on the fly (must also configure Disable_zero_compress_cntx).
blobs_per_xy_plane[11:8]Sparsity ("blob") structure — used with RowSearch + BlobsYStart.
x_dim[31:16]X (column) dim of the source tile in L1.
y_dim[47:32]Y (row) dim.
z_dim[63:48]Z stride count (0 means 1).
w_dim[79:64]W stride count (0 means 1).
blobs_y_start[127:80]Row-start table for sparse blob mode.
digest_type/digest_size[...:120]Tile-header skip = (1+digest_size)*16 bytes.

Data-format 4-bit encoding (same field everywhere — TileDescriptor, REG2, REG7 contexted, ALU_FORMAT_SPEC, packer REG1):

0b??110b??100b??010b??00
0b00??BFP4aBFP8aFP16FP32
0b01??BFP4BFP8BF16TF32
0b10??BFP2aFP8 (e5m2)INT16INT32
0b11??BFP2INT8

Plus a 5th bit "Unp_LF8_4b_exp" turns FP8 e5m2 into FP8 e4m3 — set when DataFormat::Fp8_e4m3; see THCON_SEC0_REG1_Unp_LF8_4b_exp_RMW programmed in configure_unpack_AB().

REG1 / REG2 — main per-unpacker config (unpack_config_t is REG1+REG2 in 128-bit form):

Field (BH name in cfg_defines.h)What it does
THCON_SEC[u]_REG2_Out_data_formatOutDataFormat (4 bits) — what hardware presents to SrcA/SrcB/Dst (see §2.6 conversions).
THCON_SEC[u]_REG2_Throttle_modex1/x2/x4 L1 read speed (0/1/2 — see §2.5).
THCON_SEC[u]_REG2_Context_countlog₂ of how many UnpackContexts to round-robin (0..3 → 1/2/4/8 contexts).
THCON_SEC[u]_REG2_Haloize_modeEnables X/Y transpose on the way into SrcA (Unpacker 0 only).
THCON_SEC[u]_REG2_Tileize_mode"Discontiguous input rows" mode — rows are NOT contiguous in L1, hop by Shift_amount_cntx. Used by llk_unpack_tilize.h.
THCON_SEC[u]_REG2_Unpack_Src_Reg_Set_UpdIf 1, after each UNPACR advance SrcRow by 16 + SrcA/B_SET_Base*16. Used for ping-pong'ing within a bank.
THCON_SEC[u]_REG2_Unpack_If_SelUnpacker 0 only: 0 = write to SrcA, 1 = write to Dst. (Per-context version: REG2_Unpack_if_sel_cntx[].)
THCON_SEC[u]_REG2_Upsample_rate0/1/2/3 → 0/1/2/4 inserted zeros per datum.
THCON_SEC[u]_REG2_Upsample_and_interleaveIf 1, skip output positions instead of inserting zeros.
THCON_SEC[u]_REG2_Ovrd_data_formatIf 1 (in MultiContext mode), use the per-context REG7 format instead of the descriptor.
THCON_SEC[u]_REG2_Force_shared_expFor BFP formats: ignore the L1 exponent stream, use UNP[u].FORCED_SHARED_EXP_shared_exp for all datums. Used when you want to upcast INT8 via the BFP path.
THCON_SEC[u]_REG2_Shift_amount_cntx[0..3]Tileize row stride (4 bits per ctx, concatenated to make a 12-bit byte stride) OR ColShift for SrcA writes.
THCON_SEC[u]_REG2_Unpack_limit_addressUpper L1 bound (in 16-byte units) for FIFO wrap.
THCON_SEC[u]_REG2_Unpack_fifo_sizeFIFO size (in 16-byte units) — what gets subtracted on wrap.
THCON_SEC[u]_REG2_Disable_zero_compress_cntx[0..7]Per-context: 1 = treat that context's data as uncompressed (overrides the TileDescriptor when in MultiContextMode).
THCON_SEC[u]_REG3_Base_address (32b)Base L1 address of the input buffer, in 16-byte units.
THCON_SEC[u]_REG3_Base_cntx[1..7].addressPer-context base addresses (context 0 uses REG3_Base_address).
THCON_SEC[u]_REG5_Tile_x_dim_cntx[0..3]Unpacker 0 multi-context X override (16 bits per ctx, packed two-per-32).
THCON_SEC[u]_REG5_Dest_cntx[0..3].addressUnpacker 0 per-context Dst starting row.
THCON_SEC[u]_REG7_Offset_address16-bit offset added to Base_address each instruction (set by unpacker_wrapup() to zero on flush).
THCON_SEC[u]_REG7_Offset_cntx[1..3].addressPer-context offsets.
THCON_SEC[u]_REG7_Unpack_data_format_cntx[0..7]Per-context InDataFormat override.
THCON_SEC[u]_REG7_Unpack_out_data_format_cntx[0..7]Per-context OutDataFormat override.
THCON_SEC[u]_REG1_Unp_LF8_4b_exp1 = treat 8-bit float as e4m3 instead of e5m2.
UNP[u].ADDR_BASE_REG_1_BaseBase "datum index" within SrcA/SrcB/Dst — combined with ADC Channel[1] Y/Z/W and the strides.
UNP[u].ADDR_CTRL_XY_REG_1_YstrideY stride in datums.
UNP[u].ADDR_CTRL_ZW_REG_1_Zstride, WstrideZ, W strides.
UNP[u].FORCED_SHARED_EXP_shared_expThe fixed exponent used when Force_shared_exp = 1.
UNP[u].ADD_DEST_ADDR_CNTR_add_dest_addr_cntrIf 1, add per-context Dst row to base; if 0, override.
UNP0_BLOBS_Y_START_CNTX[0/2].blobs_y_start[8]Per-context BlobsYStart (overrides TileDescriptor in MC mode).

Thread-local unpacker config (ThreadConfig, written via SETC16 since you can't sw to it):

FieldPurpose
CFG_STATE_ID_StateIDSelects which of the two Config banks (0/1) this thread reads — basic double-buffering of all backend config.
UNPACK_MISC_CFG_CfgContextOffset_0, _1Per-thread per-unpacker base for context index (added to CfgContextId).
UNPACK_MISC_CFG_CfgContextCntReset_0, _1Writing here resets that thread's context counter (immediate side-effect).
SRCA_SET_Base, SRCB_SET_BaseSource-row base (×16) used when Unpack_Src_Reg_Set_Upd=1 for ping-pong write addressing.
SRCA_SET_SetOvrdWithAddrIf 1, write to a full 64-row range of SrcA using the address generator directly.
ADDR_MOD_AB_SEC[0..3].SrcXIncr/SrcYIncr/...Address-modifier slots for UNPACR / MOVD2A etc. — used to bump SrcA/SrcB row counters.
ALU_FORMAT_SPEC_REG0_SrcA*, REG0_SrcB*, REG2_Dstacc*The format the FPU sees (overrides for INT8 signed/unsigned and Dst accumulator format).
ALU_ACC_CTRL_Fp32_enabled, SFPU_Fp32_enabled, INT8_math_enabled, Zero_Flag_disabled_src/dstFPU/SFPU global modes.

2.4 What "writing to SrcA / SrcB / Dst" looks like

UNPACR instruction pipeline (per face) L1 read — 16 / 32 / 64 B per cycle Throttle_mode (0/1/2), arbiter with other unpacker Skip tile header — (1+digest_size) × 16 B REG0.digest_size, REG3.Base_address, REG7.Offset_address Read shared exponent (BFP) / RLE decompress REG0.uncompressed, Force_shared_exp, blobs_per_xy_plane Format conversion (per-datum) REG0.in_data_format → REG2.Out_data_format (+ Unp_LF8_4b_exp) Optional upsample / interleave REG2.Upsample_rate, Upsample_and_interleave Haloize XY transpose / Tileize discontiguous rows REG2.Haloize_mode, Tileize_mode, Shift_amount_cntx Wait until AllowedClient = Unpackers (blocks until FPU releases the Src bank) SrcA[bank][row+SrcRow][col] Unp 0, skip rows<4 (header) SRCA_SET_Base, ColShift SrcB[bank][row+SrcRow][col] Unp 1 SRCB_SET_Base Dst[row][col] (direct) Unp 0, when Unpack_If_Sel=1 REG5.Dest_cntx, Dst16b/Dst32b After the loop • ADC.AddrMod auto-increment (ADDR_MOD_AB_SEC[0..3]) • FlipSrc → next bank • SetDvalid → hand bank to MatrixUnit (FPU) • AutoIncContextID → bump ContextCounter • Last=1 → end-of-context REG3.Base_address + REG7.Offset_address + ADC.Ch0.Y·stride… decompress to datum stream see Fig. data-format table (§2.3) Unp 0: ColShift, XY-swap into SrcA ping-pong with FPU consumer Figure 2. The functional stages an UNPACR instruction goes through inside the unpacker pipeline. The bottom row shows the three possible destinations (SrcA, SrcB, or Dst directly), and the right column lists what the instruction does after the per-datum loop completes.

From the UNPACR functional model (relevant slice):

if (WhichUnpacker == 1) {            // SrcB
    while (SrcB[Bank].AllowedClient != SrcClient::Unpackers) wait;
    Row = (Row + CurrentUnpacker.SrcRow[CurrentThread]) & 0x3f;
    SrcB[Bank][Row][Col] = Datum;
} else if (!UnpackToDst) {           // SrcA
    while (SrcA[Bank].AllowedClient != SrcClient::Unpackers) wait;
    if (Row < 4 || Col < ColShift) continue;     // header rows are skipped
    Row -= 4;
    Col -= ColShift;                              // optional per-context column shift
    Row += CurrentUnpacker.SrcRow[CurrentThread];
    if (Transpose) std::swap(Row & 0xf, Col);    // Haloize_mode XY transpose
    SrcA[Bank][Row][Col] = Datum;
} else {                              // Dst
    Row -= 4;
    if (OutDataFormat in {FP32,TF32,INT32}) Dst32b[Row][Col] = Datum;
    else                                     Dst16b[Row][Col] = Datum;
}

Key things:

2.5 Throttle / arbitration

There's one shared L1 read port. Each unpacker requests x1 (16 B/cyc), x2 (32), or x4 (64), chosen by THCON_SEC[u].Throttle_mode (0/1/2). Constraints force x1 when decompressing, when upsampling ≥ 3, when BFP2/2a, etc. If both unpackers want bandwidth, an arbiter shaves one of them down — the full conflict table is in UNPACR_Regular.md#performance.

2.6 Unpacker format-conversion summary

L1 in→ SrcA/B as→ Dst as
FP32/TF32TF32 / BF16 / FP16FP32 / BF16 / FP16
BF16TF32 / BF16BF16
BFP8/4/2 (or INT8 via Force_shared_exp)TF32 / BF16BF16
BFP8a/4a/2aFP16FP16
FP16 / FP8FP16FP16
INT32 (sign-magnitude)Integer "32"
INT16Integer "16"Integer "16"
INT8 / UINT8Integer "8" (via BFP8 path, needs SrcAUnsigned/SrcBUnsigned)Integer "8"

(From tt-isa-documentation/WormholeB0/.../Unpackers/FormatConversion.md; BH is identical apart from the e4m3 helper.)


3. The Packer

There are four packers. They share one input port to Dst and one output port to L1, but each has its own pipeline and its own copy of REG1 in Config (the four packer configs live as THCON_SEC[01]_REG[18] etc., with PackerIndex selecting which one). Each PACR instruction kicks off one to four packers at once.

3.1 The PACR instruction

tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h, opcode 0x41. BH encoding (notice it grew vs. WH):

TT_OP_PACR(
    CfgContext,       // 3 bits : which packer config context to use
    RowPadZero,       // 1 bit  : pad short rows with zero
    DstAccessMode,    // 1 bit  : Dst input vs L1 input (Source_interface_selection)
    AddrMode,         // 2 bits : ADDR_MOD_PACK_SEC slot 0..3
    AddrCntContext,   // 2 bits : which ADC to use when OvrdThreadId=1
    ZeroWrite,        // 1 bit  : pull from /dev/null (write zeros)
    ReadIntfSel,      // 1 bit  : alternate Dst read interface
    OvrdThreadId,     // 1 bit  : use AddrCntContext instead of CurrentThread
    Concat,           // 1 bit  : continue current compression row
    CtxtCtrl,         // 1 bit  : auto-context-counter behavior
    Flush,            // 1 bit  : flush pre-L1 output buffers
    Last)             // 1 bit  : flush + start next packer at fresh address

The PackerMask is not in the instruction on Blackhole the same way as the simplified WH model shows; the per-context pack_start_intf_pos (4 bits in unpack_config_t/pack_config_t) and Enable_out_fifo per packer together select which packers run. Typical kernels enable all four for one full tile (matching the 4 faces) or just one when packing untilized.

There's also TT_OP_PACR_SETREG (sequenced MMIO-register write that retires after late format conversion of all in-flight packers — used to atomically tick a CB-write-pointer right when the data hits L1) and the helper instructions:

3.2 The packer pipeline (in order)

Dst[Row][Col] (or L1 if src_if_sel=1) Input Address Generator → Channel[0] Edge masking PCK_EDGE_OFFSET / TILE_ROW_SET_MAPPING Early format conversion PCK_DEST_RD_CTRL_* (Dst → intermediate) ReLU stage STACC_RELU_ApplyRelu, ReluThreshold Exponent thresholding Exp_threshold_en, Exp_threshold Downsampling (vcompress) Downsample_mask, Downsample_rate Exponent histogramming ENABLE_ACC_STATS (CLREXPHIST / SETDMAREG) Late format conversion In_data_format → Out_data_format Compression RLE-zero + BFP shared-exp assembler 16-byte buffers (data / exp / RSI) flushed on Flush / Last / full line → L1 write (16-byte aligned) Output Address Generator Channel[1] — L1 side L1_Dest_addr + YZW × strides (~0xf) + optional offset Pack_limit_address Pack_fifo_size (wrap) Per-packer (×4) pack_config_t (REG1) PCK_ADDR_BASE_REG_{0,1} PCK_ADDR_CTRL_{XY,ZW} PACK_COUNTERS_SEC[i] DEST_TARGET_REG_CFG_PACK Global / shared PCK_EDGE_OFFSET_SEC[0..3] STACC_RELU, INT_DESCALE ADDR_MOD_PACK_SEC[0..3] PACR launches ≤4 packers selected by enable_out_fifo + pack_start_intf_pos Packer 0 may source from L1 (src_if_sel=1) for L1↔L1 reshape Last=1 → flush all buffers, next packer starts at fresh aligned L1 address Figure 3. The 10-stage packer pipeline from Dst to L1. The two highlighted stages (early & late format conversion) are where the most non-trivial bit-level conversion happens; everything else is row-by-row processing. The left column lists the most-relevant per-packer and global config registers feeding the pipeline; the right column lists post-pipeline buffering and output-address rules. Diagram condensed from Packers/README.md and per-stage docs.

3.3 Packer config registers (per-packer "PCK[i]" + per-thread + global)

The four packers reuse THCON_SEC0_REG1_* (packer 0+1) and THCON_SEC1_REG1_* (packer 2+3), plus THCON_SEC0_REG8_* / THCON_SEC1_REG8_* for the other half — see cfg_defines.h for the exact mapping. The pack_config_t in cpack_common.h describes the per-packer 96-bit programmable block (REG1 in WH-language).

REG1 — per-packer "pack_config_t":

FieldMeaning
row_ptr_section_sizeBytes reserved in L1 for the RSI (row-start-index) section when compressing.
exp_section_sizeBytes reserved for the per-16-datum exponent section (BFP outputs / FP8 / INT8).
l1_dest_addrBase L1 address (16-byte units) for this packer's output.
uncompress1 = no RLE-zero compression.
add_l1_dest_addr_offsetIf 1, add l1_dest_addr_offset (a 16b TDMA-RISC writable register) to base.
disable_pack_zero_flagIf 1, never produce the per-datum zero flag (forces uncompressed).
out_data_format (4b)"LateToFormat" — what hits L1. Format encoding from §2.3.
in_data_format (4b)"LateFromFormat" — the intermediate format coming into late conversion.
dis_shared_exp_assemblerDisable the BFP shared-exponent assembler.
auto_set_last_pacr_intf_selAuto-flip interface-select on the last PACR of a packing burst.
enable_out_fifoEnable this packer's L1 output FIFO (must be 1 to write at all).
sub_l1_tile_header_size1 = the addr math subtracts the 16-byte tile-header bump that L1_Dest_addr otherwise has built in.
src_if_sel (also Source_interface_selection)Packer 0 only: 0 = source from Dst, 1 = source from L1 (DMA mode used by Mover-style copies).
pack_start_intf_pos (4b)Which interfaces are "live" — bitmask selecting packers 0..3 per face.
all_pack_disable_zero_compress_ovrdIf 1, the global All_pack_disable_zero_compress[i] mask overrides per-packer uncompress.
add_tile_header_sizeAdd the 16-byte tile header to outputs.
pack_dis_y_pos_start_offsetDisable applying Y-position start offset (used by un-tilize).
l1_src_addr (8b)High bits of L1 source address when src_if_sel=1.

Other per-packer ("PCK0_*" / "PCK[i]_*") regs (in cfg_defines.h):

FieldMeaning
PCK0_ADDR_BASE_REG_0_Base / _REG_1_BaseBase address for the input (Channel[0], Dst-side) and output (Channel[1], L1-side) sides.
PCK0_ADDR_CTRL_XY_REG_0_Xstride / YstrideChannel[0] X/Y strides (Dst-side, in bytes for X).
PCK0_ADDR_CTRL_ZW_REG_0_Zstride / WstrideChannel[0] Z/W strides.
PCK0_ADDR_CTRL_XY_REG_1_YstrideChannel[1] Y stride (L1-side; X stride on L1 side isn't used — output is always linear within a row).
PCK0_ADDR_CTRL_ZW_REG_1_Zstride / WstrideChannel[1] Z/W strides.
PACK_COUNTERS_SEC[i].pack_per_xy_planeIteration count per face.
PACK_COUNTERS_SEC[i].pack_reads_per_xy_planeHow many "reads" the TilePositionGenerator does before advancing Y or Z.
PACK_COUNTERS_SEC[i].pack_xys_per_tilXY-plane count per tile.
PACK_COUNTERS_SEC[i].pack_yz_transposedIf 1, advance Z before Y (transposed face order).
PACK_COUNTERS_SEC[i].pack_per_xy_plane_offsetPer-packer Y offset.
DEST_TARGET_REG_CFG_PACK_SEC[i].OffsetPer-packer Dst-row offset (in 16-row units) — where in Dst this packer starts reading.
DEST_TARGET_REG_CFG_PACK_SEC[i].ZOffsetPer-packer Z-offset into Dst.
THCON_SEC[01]_REG9.{Pack_limit_address, Pack_fifo_size}Circular L1 wrap.

Global / shared packer config:

FieldMeaning
PCK_DEST_RD_CTRL_Read_32b_data1 = read 32-bit from Dst (Dst32b view), 0 = 16-bit (Dst16b view).
PCK_DEST_RD_CTRL_Round_10b_mantIf 1, round mantissa to 10 bits during early conversion (TF32 path).
PCK_DEST_RD_CTRL_Read_int8"Read raw" — skip the early conversion rounding (truncate / bitcast).
PCK_DEST_RD_CTRL_Read_unsignedFor INT8/UINT8 path — say whether the byte is signed or unsigned.
ALU_FORMAT_SPEC_REG_Dstacc_override, ALU_FORMAT_SPEC_REG_Dstacc_val, ALU_FORMAT_SPEC_REG2_DstaccSpecify the IntermediateFormat for the early conversion.
ALU_ROUNDING_MODE_Packer_srnd_enStochastic rounding for the packer (BFP rounds).
ALU_ROUNDING_MODE_Gasket_srnd_enStochastic rounding for the late conversion gasket.
INT_DESCALE_Enable, INT_DESCALE_Mode, INT_DESCALE_VALUES_SEC[].ValueShift-amount applied during INT32→INT8 conversion (for quantization).
STACC_RELU_ApplyRelu (4b)NO_RELU / ZERO_RELU / MIN_THRESHOLD_RELU / MAX_THRESHOLD_RELU (2b per packer).
STACC_RELU_ReluThreshold16-bit BF16 or FP16 threshold value.
PCK_EDGE_OFFSET_SEC[0..3].mask16-bit column masks for edge masking.
PCK_EDGE_OFFSET_SEC0.{mode, tile_row_set_select_pack[0..3]}Which of the four masks each packer uses, and what mode means (per-row vs per-face).
TILE_ROW_SET_MAPPING[0..3].row_set_mapping[16]2 bits per row of a face → which mask slot to apply.
PCK_EDGE_TILE_FACE_SET_SELECT_*, PCK_EDGE_TILE_ROW_SET_SELECT_*Higher-level mask selectors.
PCK_EDGE_MODE_mode0 = replace masked datums with 0, 1 = replace with −∞ (for softmax-style max-reduce).
ENABLE_ACC_STATS_Enable (per-thread)Enable exponent histogram.
THCON_SEC[01]_REG1_Downsample_mask, Downsample_rate16-bit downsample mask + rate.
THCON_SEC[01]_REG1_Exp_threshold_en, Exp_thresholdExponent-threshold clamp-to-zero.
THCON_SEC[01]_REG1_pack_dis_y_pos_start_offset, pack_start_intf_posPacker interface position controls (untilize mode).
THCON_SEC0_REG1_All_pack_disable_zero_compress (4b mask)Per-packer compression-disable when override is on.
ADDR_MOD_PACK_SEC[0..3].{Y/Z}{src/dst}{Clear,CR,Incr}Address-modifier slots used by PACR.AddrMod.
DEST_ACCESS_CFG_{remap_addrs, swizzle_32b, disable_full_write_dest_q_bypass, zeroacc_absolute_tile_mode}Affect both packer Dst-address swizzle and RISCV Dst access.

3.4 PACR input address generator (Channel[0])

Addr = PCK0_ADDR_BASE_REG_0_Base
     + ADC[w].Ch0.X * (Xstride & 0xf)
     + ADC[w].Ch0.Y * Ystride
     + ADC[w].Ch0.Z * Zstride
     + ADC[w].Ch0.W * Wstride

InputNumDatums = Channel[1].X - Channel[0].X + 1     (one row of one face)

If Source_interface_selection=1 and i==0, the packer fetches from L1 (used for some L1-to-L1 reshape ops); otherwise it fetches from Dst with InputSourceAddr = (Addr/BytesPerDatum) + DEST_TARGET_REG_CFG_PACK_SEC[i].Offset << 4, interpreted as Row = addr >> 4, Col = addr & 0xf into Dst. BytesPerDatum is decided by In_data_format: 4 → FP32/TF32/I32, 2 → FP16/BF16/I16, else 1.

3.5 PACR output address generator (Channel[1])

YZW_Addr = PCK0_ADDR_BASE_REG_1_Base
         + ADC[w].Ch1.Y*Ystride + Z*Zstride + W*Wstride
Addr = PackerIConfig.L1_Dest_addr + !Sub_l1_tile_header_size + (YZW_Addr & ~0xf)
       + (Add_l1_dest_addr_offset ? PackerI.l1_dest_addr_offset : 0)

if (Addr > Pack_limit_address*2 + 1) Addr -= Pack_fifo_size*2   // circular CB wrap

if (compressing) reserve Row_start_section_size bytes for RSI
if (out_data_format < 16b)  reserve Exp_section_size  bytes for exponents
DataStream.ByteAddress = Addr     (only when NeedsNewAddress)

The Y/Z/W contribution is masked & ~0xf so it only adjusts at 256-byte granularity — fine-grained packing within a 16-byte L1 line is done by the buffered output below.

Output buffering and 16-byte alignment. Output is always aligned 16-byte writes to L1. Datums get accumulated into per-stream buffers (data / exp / RSI) and only flushed to L1 when full. The Flush and Last bits on PACR force the buffers to flush — but if a buffer is partially full it's zero-padded up to 16 B before flushing. This means: if you do many small PACRs, they share the same line; if you do one PACR with Flush=1, you might waste up to 15 B at the end of each stream.

3.6 Packer-relevant TT instructions

OpPurpose
TT_OP_PACRThe main pack instruction.
TT_OP_PACR_SETREGAtomic MMIO register write sequenced after late conversion (used to bump CB write pointers, set semaphore values, kick streams).
TT_OP_SETPKEDGOFEdge-mask offsets (x_start, x_end, y_start, y_end → PCK_EDGE_OFFSET masks).
TT_OP_CLREXPHISTClear all four packers' exponent histograms.
TT_OP_SETASHRMH(0/1), TT_OP_SETASHRMVSet ALL-SHARED row-mask H/V (halo column/row masks).
TT_OP_SETDMAREG mode 6/7Read the exponent histogram bins into a Tensix GPR.
TT_OP_SETDMAREG mode 9Read packer 0's max-exponent.
TT_OP_STALLWAIT(stall_res, ...)Wait for THCON / PACK0..3 / UNPACK0..1 resource idle (this is how Last becomes useful — you flush, then stall until L1 actually has the data).
TT_OP_RESOURCEDECLAuto-TTSync tracking declaration.

4. Step-by-step: data going CB → SrcA/Dst → Dst → CB on Blackhole

Here is what the LLK (llk_unpack_AB.h + llk_math_eltwise_* + llk_pack.h) actually does for a simple "load tile A, load tile B, A+B, store tile" sequence:

4.1 Setup (called once per format change)

configure_unpack_AB(...) in cunpack_common.h writes (excerpt of actual register writes):

// Tell the FPU what format SrcA and SrcB hold:
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcA_val_ADDR32, ...>(unp_src_format_A);
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcB_val_ADDR32, ...>(unp_src_format_B);

// Configure each unpacker's REG0_TileDescriptor (4-dword blob):
cfg[THCON_SEC0_REG0_TileDescriptor_ADDR32 + 0..3] = {format/xdim/ydim/zdim/wdim/...};
cfg[THCON_SEC1_REG0_TileDescriptor_ADDR32 + 0..3] = ...;

// Configure REG2 (out format, throttle, context_count, tileize, etc.):
cfg[THCON_SEC0_REG2_Out_data_format_ADDR32 word] = ...;
cfg[THCON_SEC1_REG2_Out_data_format_ADDR32 word] = ...;

// e4m3 helper:
cfg_reg_rmw_tensix<THCON_SEC0_REG1_Unp_LF8_4b_exp_RMW>(unpA_is_e4m3);
cfg_reg_rmw_tensix<THCON_SEC1_REG1_Unp_LF8_4b_exp_RMW>(unpB_is_e4m3);

// REG3 base addresses (= base of the CB in L1, /16 since it's 16B units):
cfg[THCON_SEC0_REG3_Base_address_ADDR32] = cb_a_base >> 4;
cfg[THCON_SEC1_REG3_Base_address_ADDR32] = cb_b_base >> 4;

// REG5 per-context Dst rows / face dims (multi-context unpacker 0 only):
cfg[THCON_SEC0_REG5_Tile_x_dim_cntx0_ADDR32] = face_dim | (face_dim << 16);
cfg[THCON_SEC0_REG5_Dest_cntx0_address_ADDR32] = Dest_cntx0 | (Dest_cntx1 << 16);

// Reset ADC counters for the two unpackers + the packer:
TTI_SETADCXY(p_setadc::UNP_A | p_setadc::UNP_B, 0,0,0,0, 0b1011);
TTI_SETADCZW(p_setadc::UNP_A | p_setadc::UNP_B, 0,0,0,0, 0b1111);
TTI_SETADCXY(0b100, 0,0,0,0, 0b1011);   // packer side
TTI_SETADCZW(0b100, 0,0,0,0, 0b1111);

// "X-end" per face, written into ADC[*].Unpacker[*].Channel[1].X:
TT_SETADCXX(p_setadc::UNP_A, unpA_x_end, 0);
TT_SETADCXX(p_setadc::UNP_B, (unpB_face_r_dim << 4) - 1, 0);

// SrcA row base for ping-pong:
TTI_SETC16(SRCA_SET_Base_ADDR32, 0x4);

And the packer side from cpack_common.h (set_packer_strides):

TT_SETDMAREG(0, LOWER_HALFWORD(y_stride << PCK0_ADDR_CTRL_XY_REG_0_Ystride_SHAMT), 0, LO_16(TMP0));
TT_SETDMAREG(0, UPPER_HALFWORD(y_stride << PCK0_ADDR_CTRL_XY_REG_0_Ystride_SHAMT), 0, HI_16(TMP0));
TT_SETDMAREG(0, LOWER_HALFWORD(z_stride << PCK0_ADDR_CTRL_ZW_REG_0_Zstride_SHAMT), 0, LO_16(TMP1));
TT_SETDMAREG(0, UPPER_HALFWORD(w_stride << PCK0_ADDR_CTRL_ZW_REG_0_Wstride_SHAMT), 0, HI_16(TMP1));
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::THCON);
TTI_WRCFG(TMP0, p_cfg::WRCFG_32b, PCK0_ADDR_CTRL_XY_REG_0_Xstride_ADDR32);
TTI_WRCFG(TMP1, p_cfg::WRCFG_32b, PCK0_ADDR_CTRL_ZW_REG_0_Zstride_ADDR32);

4.2 Per-tile UNPACR (T0)

  1. The kernel waits on the producer CB semaphore (cb_wait_front) — done in RISCV, not Tensix.
  2. T0 issues one UNPACR per face. Each UNPACR advances ADC Channel[0].Y → the next face row; on the 4th call the AddrMod flips back and bumps Z.
  3. The unpacker reads REG3_Base_address, follows REG7_Offset_address, then iterates 16×16=256 datums, doing format conversion + writing to SrcA[bank][row][col] (and stalls on AllowedClient).
  4. The last UNPACR of a tile sets SetDatValid=1 (or it's followed by UNPACR_NOP_SETDVALID) — that flips AllowedClient = MatrixUnit so the FPU can read this bank.
  5. The unpacker can immediately start filling the other bank while FPU consumes the first one.

4.3 Math (T1)

4.4 PACR (T2)

  1. T2 (sometimes T0/T1 in fused kernels) waits on free space in the output CB.
  2. It issues PACR with PackerMask = 0xF to run all four packers in parallel — each packer is configured to point at a different quarter of Dst via DEST_TARGET_REG_CFG_PACK_SEC[i].Offset, and each lands in a different quarter of the output tile via PCK0_ADDR_*_REG_1 strides.
  3. After all faces are packed, a final PACR with Last=1 (or Flush=1) flushes the 16-byte buffers to L1.
  4. PACR_SETREG (or a follow-up MMIO write) bumps the CB write pointer so the consumer can pick up the tile.
  5. STALLWAIT(p_stall::PACK0|PACK1|PACK2|PACK3, ...) is used if the kernel needs to be sure the L1 store has happened before proceeding (e.g. before triggering a NoC send).

5. Complete list of unpacker-related TT* instructions on Blackhole

From tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h. (TT_OP_X is the encoder; TTI_X is the form with immediate operands; TT_X allows GPR operands.)

InstructionOpcodePurpose
TT_OP_UNPACR(...)0x42The unpack instruction. 13 fields described in §2.1.
TT_OP_UNPACR_NOP(...)0x41 (re-uses with subop)Multi-mode unpacker-sequenced no-op: ZEROSRC / SETDVALID / SETREG / MsgClr / pure-NOP / Unpack_Pop. Used to flip Src banks, zero a bank, push register writes through the unpacker queue.
TT_OP_ZEROSRC(zero_val, write_mode, bank_mask, src_mask)0x11Zero a SrcA/SrcB bank directly (bypasses unpacker pipeline; sometimes faster than going through UNPACR_NOP).
TT_OP_SETDVALID(setvalid)0x57Set "AllowedClient = MatrixUnit" on a Src bank (i.e. mark unpacked data ready for FPU). Bit 0 = SrcA, bit 1 = SrcB.
TT_OP_TRNSPSRCA0x14Transpose SrcA in place.
TT_OP_TRNSPSRCB0x16Transpose SrcB in place (BH-specific; not on Wormhole).
TT_OP_SETADC(CntSetMask, ChannelIndex, DimensionIndex, Value)n/aSet one of {X,Y,Z,W} on Channel[0] or Channel[1] of one ADC bank.
TT_OP_SETADCXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask)n/aSet X/Y on both channels of selected ADC banks at once. BitMask selects which of {Ch0X,Ch0Y,Ch1X,Ch1Y} are written.
TT_OP_SETADCZW(CntSetMask, Ch1_W, Ch1_Z, Ch0_W, Ch0_Z, BitMask)n/aSame for Z/W.
TT_OP_SETADCXX(CntSetMask, x_end2, x_start)0x5eSet Ch0.X = x_start, Ch1.X = x_end2 in one shot (the typical "set face width minus 1").
TT_OP_INCADCXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X)n/aIncrement X/Y by the given deltas (signed).
TT_OP_INCADCZW(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X)n/aIncrement Z/W.
TT_OP_ADDRCRXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask)n/aCopy Y_Cr → Y / X_Cr → X (used to start a fresh face row).
TT_OP_ADDRCRZW(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask)n/aSame for Z/W.
TT_OP_REG2FLOP(SizeSel, TargetSel, ByteOffset, ContextId_2, FlopIndex, RegIndex)n/aLoad an ADC/RWC value (or per-context unpacker offset) from L1 via a register copy. Used to set up multi-context unpack from a descriptor in L1.
TT_OP_SETC16(reg, val)0xb2Write a 16-bit immediate to ThreadConfig[CurrentThread] — the only way to set UNPACK_MISC_CFG_CfgContextOffset_*, UNPACK_MISC_CFG_CfgContextCntReset_*, SRCA_SET_Base, SRCB_SET_Base, CFG_STATE_ID_StateID and the rest of the per-thread unpacker config.
TT_OP_WRCFG(GprAddress, wr128b, CfgReg)0xb0Write 32b (or 128b if wr128b=1) from a Tensix GPR to Config[StateID]. Used to set REG0 TileDescriptor (4 dwords as one 128b write), and all the THCON_SEC[u].REG[1..7] words.
TT_OP_RDCFG(GprAddress, CfgReg)0xb1Read a Config word into a Tensix GPR — used by introspection / read_unpack_tile_descriptor().
TT_OP_RMWCIB0/1/2/3(Mask, Data, CfgRegAddr)0xb3..0xb6Read-modify-write a byte of a Config word — used for cfg_reg_rmw_tensix<...> to flip a single bit/field without disturbing neighbours (e.g. Unp_LF8_4b_exp, Ovrd_data_format). The 4 variants target the 4 bytes of the word.
TT_OP_STREAMWRCFG(stream_id_sel, ...)n/aWrite a Config word from a NoC overlay stream — used when descriptor data is being streamed in.
TT_OP_STALLWAIT(stall_res, wait_res)n/aBlock this thread until specified resources (e.g. UNPACK0, UNPACK1, THCON, XSEARCH) are idle. The single most common synchronization primitive between config writes and unpack/pack work.
TT_OP_RESOURCEDECL(linger_time, ...)n/aDeclare resources being used by upcoming instructions, for the BH Auto-TTSync hardware.
TT_OP_REPLAY(start_idx, len, ...)n/aExecute a saved sequence of Tensix instructions (the LLK uses this to compress repetitive UNPACR sequences).
TT_OP_XMOV(Mov_block_selection, ...)n/aMover instruction — sometimes used by unpack-untilize when the data needs an L1↔L1 copy before unpack.
TT_OP_MOVA2D(...), TT_OP_MOVB2D(...)n/aCopy SrcA/SrcB → Dst directly via the FPU's data path — alternative to "UNPACR into SrcA, then FPU copies to Dst". Listed here because they are sometimes used inside llk_unpack_* paths when staging into Dst is needed.
TT_OP_MOVD2A(...), TT_OP_MOVD2B(...)n/aReverse direction: Dst → SrcA/SrcB. Used by llk_math_transpose_dest.h and matmul to feed the next round of FPU ops from Dst.
TT_OP_MOVDBGA2D(...), TT_OP_MOVDBGB2D(...)n/aDebug-path variants of MOVA2D/MOVB2D used by the debug tooling.
TT_OP_MOVB2A(...)0x0bCopy SrcB → SrcA.
TT_OP_SETPKEDGOF(y_end, y_start, x_end, x_start)0x1dEdge-mask offsets (technically a packer control instruction, but uses the same ADC infrastructure and is often issued by the unpack thread when tile-padding state is being set up).
TT_OP_SETASHRMH, SETASHRMH0, SETASHRMH1, SETASHRMV(reg_mask2)0x1e, 0x1a, 0x1b, 0x1c"All-shared halo mask" — set the row/column halo bits used by the unpacker's halo-padding mode (also feeds into packer edge masks).
TT_OP_CLREXPHIST0x21Clears packer exponent histograms; listed here because it's commonly placed in the unpack/setup phase to start fresh telemetry.
TT_OP_SETRWC(...), TT_OP_SETIBRWC(...)n/aSet the RWC (FPU-side row/word counters) — not strictly an unpacker op, but unpack sequences typically reset them between tiles since the FPU is the immediate consumer of unpacker output.

(Opcodes I didn't list explicitly are not 0x__-prefixed because the TT_OP macro packs the value differently — see ckernel_ops.h for the exact encoder.)


6. Source references