Blackhole Tensix Unpacker & Packer — Complete Reference

How data moves from the input circular buffer in L1 through the unpackers into SrcA/SrcB/Dst, gets computed on by the FPU/SFPU, and is written back to L1 by the four packers. With every config register, every relevant TT* instruction, and the LLK programming model.

Note. Tenstorrent's official ISA docs split documentation across tt-isa-documentation/WormholeB0/ (which has the detailed Unpackers/Packers sub-trees) and tt-isa-documentation/BlackholeA0/ (which has BH-specific overrides like Dst.md, BackendConfiguration.md, ConfigurationUnit.md). On Blackhole the functional model is largely the same as Wormhole B0 — same UNPACR / PACR pipeline, with a few BH-specific additions (Auto-TTSync, extra context counts, an 8-bit FP8 e4m3 helper, 1.5 MiB L1 instead of 1.46 MiB, the Mover, new SFPU ops). Below I cite which docs the details come from, and how the LLK (tt-llk/tt_llk_blackhole/) actually programs the registers.

The big picture: L1 → SrcA/SrcB/Dst → math → Dst → L1
The Unpacker
The Packer
Step-by-step: data going CB → SrcA/Dst → Dst → CB
Complete list of unpacker-related TT* instructions
Source references

1. The big picture: L1 → SrcA/SrcB/Dst → math → Dst → L1

A Tensix tile has:

Figure 1. CB → SrcA/SrcB/Dst → math → Dst → CB. Dashed arrow shows Unpacker 0's optional direct-to-Dst bypass; bidirectional arrow shows the SFPU operating in-place on Dst.

Three Tensix-coprocessor threads (T0/T1/T2) issue Tensix instructions independently and the backend execution units (Unpackers / Matrix Unit / Vector Unit / Packers / ThCon / Sync / Mover / Configuration / Misc) are shared. Unpackers usually take instructions from T0, the FPU/SFPU from T1, the packers from T2 — but any thread can issue any instruction. Synchronization is via STALLWAIT, semaphores, SETDVALID, and the SrcA/SrcB "AllowedClient" mechanism described below.

1.1 SrcA / SrcB / Dst shapes (Blackhole)

SrcA and SrcB each have two banks; a bank is 64 rows × 16 columns. Each datum is up to 19 bits wide (1 sign + 8 exponent + 10 mantissa, i.e. TF32). Software-visible types in SrcA/SrcB are: TF32, BF16, FP16, Integer "16" (opaque), Integer "8" (1S + 10M magnitude). One bank holds enough for a 16×16 face's worth of data with a few extras; unpackers and the FPU swap banks via the "AllowedClient" hand-off.
Dst is 1024 rows × 16 cols × 16 bits (Dst16b view) or 512 rows × 16 cols × 32 bits (Dst32b view) — same underlying storage. Dst can hold FP32, BF16, FP16, Integer "32", Integer "16", Integer "8". Dst's row index goes through the Adj16/Adj32 swizzle (DEST_ACCESS_CFG_remap_addrs, DEST_ACCESS_CFG_swizzle_32b). Each row has a 1-bit DstRowValid flag used by the hardware to track when a write has retired vs. is still in flight (4-cycle scoreboard).
A tile in Tensix convention = 32×32 datums, divided into 4 faces (F0..F3) of 16×16 each, each face stored row-major. Multiple tiles can be held in Dst (configured via DEST_TARGET_REG_CFG_*_Offset).

1.2 ADCs (Addressing Counters)

Both unpackers and packers compute addresses by stepping through "ADC" channels rather than by getting an explicit address each instruction. Each thread has one ADC per unpacker (and a shared one for the packers). Each ADC has two channels (Channel[0] = input/source side, Channel[1] = output/dest side), each with X, Y, Z, W counters + Y_Cr, Z_Cr "current row" shadow counters.

ADC counters are advanced by:

SETADC / SETADCXY / SETADCZW / SETADCXX — absolute set
INCADCXY / INCADCZW — increment
ADDRCRXY / ADDRCRZW — copy Cr → current
REG2FLOP — load from L1
The AddrMod field of UNPACR/PACR (auto-increment using ADDR_MOD_AB_SEC[i] / ADDR_MOD_PACK_SEC[i] "modifier" config slots, indexed 0..3, two banks for unpack)

This is how a single UNPACR for a face advances to the next face automatically.

2. The Unpacker

There are two unpackers. Unpacker 0 writes to SrcA or directly to Dst. Unpacker 1 writes to SrcB. They share an L1 read port but can run concurrently with throttle-mode arbitration (see §2.5).

2.1 The UNPACR instruction

Source: tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h, opcode 0x42. The full BH encoding is:

TT_OP_UNPACR(
    Unpack_block_selection,  // 1 bit  : 0 = Unpacker 0, 1 = Unpacker 1
    AddrMode,                // 8 bits : packed Ch0YInc/Ch0ZInc/Ch1YInc/Ch1ZInc (2 bits each)
    CfgContextCntInc,        // 2 bits : used to bump context-counter after instr
    CfgContextId,            // 3 bits : explicit context (used when AutoIncContextID=0)
    AddrCntContextId,        // 2 bits : which ADC bank to use (0..2 = T0..T2)
    OvrdThreadId,            // 1 bit  : was MultiContextMode in WH docs (see below)
    SetDatValid,             // 1 bit  : hand SrcA/B bank to Matrix Unit after unpack
    srcb_bcast,              // 1 bit  : RAREFYB / broadcast B
    ZeroWrite2,              // 1 bit  : AllDatumsAreZero — write zeros instead of fetching
    AutoIncContextID,        // 1 bit  : use context counter instead of CfgContextId
    RowSearch,               // 1 bit  : use BlobsYStart row-search mode
    SearchCacheFlush,        // 1 bit  : flush the RSI cache
    Last)                    // 1 bit  : end-of-context marker

The full functional model is the one in WormholeB0/.../UNPACR_Regular.md (621 lines). It is exactly the same on Blackhole; the field layout above is the BH-specific encoding (the AddrCntContextId / OvrdThreadId split is BH's renaming of WH's ContextADC/MultiContextMode).

The high-level steps the hardware performs:

1. Pick state-bank from ThreadConfig[CurrentThread].CFG_STATE_ID_StateID  (0 or 1)
2. Resolve WhichContext (0..7) from either CfgContextId or the per-unpacker
   ContextCounter, then add UNPACK_MISC_CFG_CfgContextOffset[WhichUnpacker]
3. Read the tile descriptor (THCON_SEC[u].REG0_TileDescriptor) → InDataFormat,
   XDim, YDim, ZDim, WDim, IsUncompressed, BlobsPerXYPlane, DigestSize, ...
4. Compute starting L1 address from REG3_Base_address (+ REG3_Base_cntx in MC mode)
   and REG7_Offset_address, skip 16 bytes of tile header times (1+DigestSize).
   If compressed: also read row-start indices from a sub-region.
   If BFP format: also point at the per-16-datum exponent sub-region.
5. Compute number of input datums = (ADC.Channel[1].X+1) - ADC.Channel[0].X
6. Compute starting OUTPUT address inside SrcA/SrcB/Dst from:
      UNP[u].ADDR_BASE_REG_1_Base + ADC.Ch1.Y*Ystride + Z*Zstride + W*Wstride
   (in datums, not bytes; >>1 for 16-bit, >>2 for 32-bit output formats)
7. For each input datum:
      a. Read DatumSizeBytes from L1
      b. If BFP: read shared 8-bit exponent (1 byte per 16 datums)
      c. If compressed: read the 4-bit RLE delta and insert that many zeroes
      d. FormatConversion(InDataFormat, OutDataFormat, datum, exp, unpacker, toDst)
      e. Optional upsampling (insert 0/1/2/4 zero datums after each datum)
      f. Write to SrcA[bank][row][col], SrcB[bank][row][col], or Dst32b/Dst16b
         (wait if AllowedClient ≠ Unpackers — i.e. FPU hasn't released the bank)
8. After the loop: bump ADC channels by AddrMod, optionally flip the SrcA/SrcB
   bank (FlipSrc) and hand the now-full bank to the Matrix Unit, bump
   ContextCounter if AutoIncContextID.

The address-wrap behaviour is the circular FIFO in L1: addresses beyond Unpack_limit_address wrap by subtracting Unpack_fifo_size. This is what lets the CB abstraction work — the CB is just a configured circular range in L1.

2.2 UNPACR sub-variants (UNPACR_NOP, opcode `0x41`)

UNPACR_NOP (TT_OP_UNPACR_NOP) is a multi-mode "stall/control" instruction that runs in-order with the unpacker queue, used so the no-op effect happens at exactly the right point in the unpacker pipeline:

Mode	Effect	Doc
`ZEROSRC` mode	Set SrcA or SrcB bank to all zeroes, sequenced with UNPACR	`UNPACR_NOP_ZEROSRC.md`
`SETDVALID` mode (`Set_Dvalid`=1)	Mark Src bank as valid → hand to Matrix Unit (`AllowedClient = MatrixUnit`)	`UNPACR_NOP_SETDVALID.md`
`SETREG` mode	MMIO register write, sequenced with UNPACR	`UNPACR_NOP_SETREG.md`
`Msg_Clr_Cnt` mode	Stream/Overlay message clear	`UNPACR_NOP_OverlayClear.md`
Nop mode	Occupy unpacker for one cycle (used as a barrier/replay padding)	`UNPACR_NOP_Nop.md`
`Unpack_Pop` mode	Pop unpacker FIFO	(covered in `UNPACR_NOP.md`)

UNPACR itself has 3 special sub-modes when Unpack_block_selection / SetDvalid etc. are programmed unusually:

"Increment context counter" (UNPACR_IncrementContextCounter.md) — bumps the per-unpacker ContextCounter[CurrentThread] without doing any L1 access.
"Flush decompression row-start cache" (UNPACR_FlushCache.md) — invalidates the small RSI cache used when decompressing.
"Move datums" (UNPACR_Regular.md) — the normal one described above.

2.3 Unpacker config registers (THCON_SEC[u] = sec 0 for Unpacker 0 / sec 1 for Unpacker 1)

These live in the global Config[2][...] array (TENSIX_CFG_BASE), with the symbols in tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h. There are ~370 packer/unpacker fields total; the ones that matter for unpack are:

REG0 — TileDescriptor (4 × 32-bit, written via `WRCFG ... wr128b=1` as a 128-bit blob).

This is the single most important block; unpack_tile_descriptor_t in cunpack_common.h shows the BH bitfield layout:

Field	Bits	Meaning
`in_data_format`	[3:0]	Source format in L1: encoding table below.
`uncompressed`	[4]	1 = no RLE; 0 = decompress on the fly (must also configure `Disable_zero_compress_cntx`).
`blobs_per_xy_plane`	[11:8]	Sparsity ("blob") structure — used with `RowSearch` + `BlobsYStart`.
`x_dim`	[31:16]	X (column) dim of the source tile in L1.
`y_dim`	[47:32]	Y (row) dim.
`z_dim`	[63:48]	Z stride count (0 means 1).
`w_dim`	[79:64]	W stride count (0 means 1).
`blobs_y_start`	[127:80]	Row-start table for sparse blob mode.
`digest_type`/`digest_size`	[...:120]	Tile-header skip = `(1+digest_size)*16` bytes.

Data-format 4-bit encoding (same field everywhere — TileDescriptor, REG2, REG7 contexted, ALU_FORMAT_SPEC, packer REG1):

	`0b??11`	`0b??10`	`0b??01`	`0b??00`
`0b00??`	BFP4a	BFP8a	FP16	FP32
`0b01??`	BFP4	BFP8	BF16	TF32
`0b10??`	BFP2a	FP8 (e5m2)	INT16	INT32
`0b11??`	BFP2	INT8	—	—

Plus a 5th bit "Unp_LF8_4b_exp" turns FP8 e5m2 into FP8 e4m3 — set when DataFormat::Fp8_e4m3; see THCON_SEC0_REG1_Unp_LF8_4b_exp_RMW programmed in configure_unpack_AB().

REG1 / REG2 — main per-unpacker config (`unpack_config_t` is REG1+REG2 in 128-bit form):

Field (BH name in cfg_defines.h)	What it does
`THCON_SEC[u]_REG2_Out_data_format`	OutDataFormat (4 bits) — what hardware presents to SrcA/SrcB/Dst (see §2.6 conversions).
`THCON_SEC[u]_REG2_Throttle_mode`	x1/x2/x4 L1 read speed (0/1/2 — see §2.5).
`THCON_SEC[u]_REG2_Context_count`	log₂ of how many UnpackContexts to round-robin (0..3 → 1/2/4/8 contexts).
`THCON_SEC[u]_REG2_Haloize_mode`	Enables X/Y transpose on the way into SrcA (Unpacker 0 only).
`THCON_SEC[u]_REG2_Tileize_mode`	"Discontiguous input rows" mode — rows are NOT contiguous in L1, hop by `Shift_amount_cntx`. Used by `llk_unpack_tilize.h`.
`THCON_SEC[u]_REG2_Unpack_Src_Reg_Set_Upd`	If 1, after each UNPACR advance SrcRow by `16 + SrcA/B_SET_Base*16`. Used for ping-pong'ing within a bank.
`THCON_SEC[u]_REG2_Unpack_If_Sel`	Unpacker 0 only: 0 = write to SrcA, 1 = write to Dst. (Per-context version: `REG2_Unpack_if_sel_cntx[]`.)
`THCON_SEC[u]_REG2_Upsample_rate`	0/1/2/3 → 0/1/2/4 inserted zeros per datum.
`THCON_SEC[u]_REG2_Upsample_and_interleave`	If 1, skip output positions instead of inserting zeros.
`THCON_SEC[u]_REG2_Ovrd_data_format`	If 1 (in MultiContext mode), use the per-context REG7 format instead of the descriptor.
`THCON_SEC[u]_REG2_Force_shared_exp`	For BFP formats: ignore the L1 exponent stream, use `UNP[u].FORCED_SHARED_EXP_shared_exp` for all datums. Used when you want to upcast INT8 via the BFP path.
`THCON_SEC[u]_REG2_Shift_amount_cntx[0..3]`	Tileize row stride (4 bits per ctx, concatenated to make a 12-bit byte stride) OR `ColShift` for SrcA writes.
`THCON_SEC[u]_REG2_Unpack_limit_address`	Upper L1 bound (in 16-byte units) for FIFO wrap.
`THCON_SEC[u]_REG2_Unpack_fifo_size`	FIFO size (in 16-byte units) — what gets subtracted on wrap.
`THCON_SEC[u]_REG2_Disable_zero_compress_cntx[0..7]`	Per-context: 1 = treat that context's data as uncompressed (overrides the TileDescriptor when in MultiContextMode).
`THCON_SEC[u]_REG3_Base_address` (32b)	Base L1 address of the input buffer, in 16-byte units.
`THCON_SEC[u]_REG3_Base_cntx[1..7].address`	Per-context base addresses (context 0 uses `REG3_Base_address`).
`THCON_SEC[u]_REG5_Tile_x_dim_cntx[0..3]`	Unpacker 0 multi-context X override (16 bits per ctx, packed two-per-32).
`THCON_SEC[u]_REG5_Dest_cntx[0..3].address`	Unpacker 0 per-context Dst starting row.
`THCON_SEC[u]_REG7_Offset_address`	16-bit offset added to Base_address each instruction (set by `unpacker_wrapup()` to zero on flush).
`THCON_SEC[u]_REG7_Offset_cntx[1..3].address`	Per-context offsets.
`THCON_SEC[u]_REG7_Unpack_data_format_cntx[0..7]`	Per-context InDataFormat override.
`THCON_SEC[u]_REG7_Unpack_out_data_format_cntx[0..7]`	Per-context OutDataFormat override.
`THCON_SEC[u]_REG1_Unp_LF8_4b_exp`	1 = treat 8-bit float as e4m3 instead of e5m2.
`UNP[u].ADDR_BASE_REG_1_Base`	Base "datum index" within SrcA/SrcB/Dst — combined with ADC Channel[1] Y/Z/W and the strides.
`UNP[u].ADDR_CTRL_XY_REG_1_Ystride`	Y stride in datums.
`UNP[u].ADDR_CTRL_ZW_REG_1_Zstride`, `Wstride`	Z, W strides.
`UNP[u].FORCED_SHARED_EXP_shared_exp`	The fixed exponent used when `Force_shared_exp = 1`.
`UNP[u].ADD_DEST_ADDR_CNTR_add_dest_addr_cntr`	If 1, add per-context Dst row to base; if 0, override.
`UNP0_BLOBS_Y_START_CNTX[0/2].blobs_y_start[8]`	Per-context BlobsYStart (overrides TileDescriptor in MC mode).

Thread-local unpacker config (ThreadConfig, written via `SETC16` since you can't `sw` to it):

Field	Purpose
`CFG_STATE_ID_StateID`	Selects which of the two Config banks (0/1) this thread reads — basic double-buffering of all backend config.
`UNPACK_MISC_CFG_CfgContextOffset_0`, `_1`	Per-thread per-unpacker base for context index (added to `CfgContextId`).
`UNPACK_MISC_CFG_CfgContextCntReset_0`, `_1`	Writing here resets that thread's context counter (immediate side-effect).
`SRCA_SET_Base`, `SRCB_SET_Base`	Source-row base (×16) used when `Unpack_Src_Reg_Set_Upd=1` for ping-pong write addressing.
`SRCA_SET_SetOvrdWithAddr`	If 1, write to a full 64-row range of SrcA using the address generator directly.
`ADDR_MOD_AB_SEC[0..3].SrcXIncr/SrcYIncr/...`	Address-modifier slots for UNPACR / MOVD2A etc. — used to bump SrcA/SrcB row counters.
`ALU_FORMAT_SPEC_REG0_SrcA`, `REG0_SrcB`, `REG2_Dstacc*`	The format the FPU sees (overrides for INT8 signed/unsigned and Dst accumulator format).
`ALU_ACC_CTRL_Fp32_enabled`, `SFPU_Fp32_enabled`, `INT8_math_enabled`, `Zero_Flag_disabled_src/dst`	FPU/SFPU global modes.

2.4 What "writing to SrcA / SrcB / Dst" looks like

Figure 2. The functional stages an UNPACR instruction goes through inside the unpacker pipeline. The bottom row shows the three possible destinations (SrcA, SrcB, or Dst directly), and the right column lists what the instruction does after the per-datum loop completes.

From the UNPACR functional model (relevant slice):

if (WhichUnpacker == 1) {            // SrcB
    while (SrcB[Bank].AllowedClient != SrcClient::Unpackers) wait;
    Row = (Row + CurrentUnpacker.SrcRow[CurrentThread]) & 0x3f;
    SrcB[Bank][Row][Col] = Datum;
} else if (!UnpackToDst) {           // SrcA
    while (SrcA[Bank].AllowedClient != SrcClient::Unpackers) wait;
    if (Row < 4 || Col < ColShift) continue;     // header rows are skipped
    Row -= 4;
    Col -= ColShift;                              // optional per-context column shift
    Row += CurrentUnpacker.SrcRow[CurrentThread];
    if (Transpose) std::swap(Row & 0xf, Col);    // Haloize_mode XY transpose
    SrcA[Bank][Row][Col] = Datum;
} else {                              // Dst
    Row -= 4;
    if (OutDataFormat in {FP32,TF32,INT32}) Dst32b[Row][Col] = Datum;
    else                                     Dst16b[Row][Col] = Datum;
}

Key things:

The SrcA/SrcB hand-off. Each Src bank has a state AllowedClient ∈ {Unpackers, MatrixUnit}. The unpacker stalls until the bank is "theirs"; after writing it, software issues a SETDVALID or sets FlipSrc=1 on the next UNPACR to flip the bank and hand it to the FPU. The FPU then computes from it and eventually releases the bank (via CLRDVALID from MVMUL-style instructions, controlled by CLR_DVALID_SrcA_Disable/SrcB_Disable). This is the natural double-buffer: while FPU consumes bank 0, unpacker fills bank 1.
The first 4 rows of SrcA are "reserved" (Row < 4 is skipped). This is to leave 4 rows of "header" — used by the FPU's broadcast (MVMUL with srcb_bcast).
Writing to Dst directly via Unpacker 0 bypasses SrcA entirely — used when you want to do SFPU-only work on freshly-loaded data, or when you want to skip the FPU (llk_unpack_A.h with BroadcastType::NONE and unpack_to_dest=1).

2.5 Throttle / arbitration

There's one shared L1 read port. Each unpacker requests x1 (16 B/cyc), x2 (32), or x4 (64), chosen by THCON_SEC[u].Throttle_mode (0/1/2). Constraints force x1 when decompressing, when upsampling ≥ 3, when BFP2/2a, etc. If both unpackers want bandwidth, an arbiter shaves one of them down — the full conflict table is in UNPACR_Regular.md#performance.

2.6 Unpacker format-conversion summary

L1 in	→ SrcA/B as	→ Dst as
FP32/TF32	TF32 / BF16 / FP16	FP32 / BF16 / FP16
BF16	TF32 / BF16	BF16
BFP8/4/2 (or INT8 via `Force_shared_exp`)	TF32 / BF16	BF16
BFP8a/4a/2a	FP16	FP16
FP16 / FP8	FP16	FP16
INT32 (sign-magnitude)	—	Integer "32"
INT16	Integer "16"	Integer "16"
INT8 / UINT8	Integer "8" (via BFP8 path, needs `SrcAUnsigned`/`SrcBUnsigned`)	Integer "8"

(From tt-isa-documentation/WormholeB0/.../Unpackers/FormatConversion.md; BH is identical apart from the e4m3 helper.)

3. The Packer

There are four packers. They share one input port to Dst and one output port to L1, but each has its own pipeline and its own copy of REG1 in Config (the four packer configs live as THCON_SEC[01]_REG[18] etc., with PackerIndex selecting which one). Each PACR instruction kicks off one to four packers at once.

3.1 The PACR instruction

tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h, opcode 0x41. BH encoding (notice it grew vs. WH):

TT_OP_PACR(
    CfgContext,       // 3 bits : which packer config context to use
    RowPadZero,       // 1 bit  : pad short rows with zero
    DstAccessMode,    // 1 bit  : Dst input vs L1 input (Source_interface_selection)
    AddrMode,         // 2 bits : ADDR_MOD_PACK_SEC slot 0..3
    AddrCntContext,   // 2 bits : which ADC to use when OvrdThreadId=1
    ZeroWrite,        // 1 bit  : pull from /dev/null (write zeros)
    ReadIntfSel,      // 1 bit  : alternate Dst read interface
    OvrdThreadId,     // 1 bit  : use AddrCntContext instead of CurrentThread
    Concat,           // 1 bit  : continue current compression row
    CtxtCtrl,         // 1 bit  : auto-context-counter behavior
    Flush,            // 1 bit  : flush pre-L1 output buffers
    Last)             // 1 bit  : flush + start next packer at fresh address

The PackerMask is not in the instruction on Blackhole the same way as the simplified WH model shows; the per-context pack_start_intf_pos (4 bits in unpack_config_t/pack_config_t) and Enable_out_fifo per packer together select which packers run. Typical kernels enable all four for one full tile (matching the 4 faces) or just one when packing untilized.

There's also TT_OP_PACR_SETREG (sequenced MMIO-register write that retires after late format conversion of all in-flight packers — used to atomically tick a CB-write-pointer right when the data hits L1) and the helper instructions:

TT_OP_SETPKEDGOF(y_end, y_start, x_end, x_start) — set edge-mask offsets used by edge-masking
TT_OP_CLREXPHIST — clear all four packers' exponent histogram
TT_OP_SETASHRMH, TT_OP_SETASHRMH0, TT_OP_SETASHRMH1, TT_OP_SETASHRMV — set "all-shared row mask" (used for halo/shift patterns)

3.2 The packer pipeline (in order)

Figure 3. The 10-stage packer pipeline from Dst to L1. The two highlighted stages (early & late format conversion) are where the most non-trivial bit-level conversion happens; everything else is row-by-row processing. The left column lists the most-relevant per-packer and global config registers feeding the pipeline; the right column lists post-pipeline buffering and output-address rules. Diagram condensed from Packers/README.md and per-stage docs.

3.3 Packer config registers (per-packer "PCK[i]" + per-thread + global)

The four packers reuse THCON_SEC0_REG1_* (packer 0+1) and THCON_SEC1_REG1_* (packer 2+3), plus THCON_SEC0_REG8_* / THCON_SEC1_REG8_* for the other half — see cfg_defines.h for the exact mapping. The pack_config_t in cpack_common.h describes the per-packer 96-bit programmable block (REG1 in WH-language).

REG1 — per-packer "pack_config_t":

Field	Meaning
`row_ptr_section_size`	Bytes reserved in L1 for the RSI (row-start-index) section when compressing.
`exp_section_size`	Bytes reserved for the per-16-datum exponent section (BFP outputs / FP8 / INT8).
`l1_dest_addr`	Base L1 address (16-byte units) for this packer's output.
`uncompress`	1 = no RLE-zero compression.
`add_l1_dest_addr_offset`	If 1, add `l1_dest_addr_offset` (a 16b TDMA-RISC writable register) to base.
`disable_pack_zero_flag`	If 1, never produce the per-datum zero flag (forces uncompressed).
`out_data_format` (4b)	"LateToFormat" — what hits L1. Format encoding from §2.3.
`in_data_format` (4b)	"LateFromFormat" — the intermediate format coming into late conversion.
`dis_shared_exp_assembler`	Disable the BFP shared-exponent assembler.
`auto_set_last_pacr_intf_sel`	Auto-flip interface-select on the last PACR of a packing burst.
`enable_out_fifo`	Enable this packer's L1 output FIFO (must be 1 to write at all).
`sub_l1_tile_header_size`	1 = the addr math subtracts the 16-byte tile-header bump that `L1_Dest_addr` otherwise has built in.
`src_if_sel` (also `Source_interface_selection`)	Packer 0 only: 0 = source from Dst, 1 = source from L1 (DMA mode used by Mover-style copies).
`pack_start_intf_pos` (4b)	Which interfaces are "live" — bitmask selecting packers 0..3 per face.
`all_pack_disable_zero_compress_ovrd`	If 1, the global `All_pack_disable_zero_compress[i]` mask overrides per-packer `uncompress`.
`add_tile_header_size`	Add the 16-byte tile header to outputs.
`pack_dis_y_pos_start_offset`	Disable applying Y-position start offset (used by un-tilize).
`l1_src_addr` (8b)	High bits of L1 source address when `src_if_sel=1`.

Other per-packer ("PCK0_" / "PCK[i]_") regs (in cfg_defines.h):

Field	Meaning
`PCK0_ADDR_BASE_REG_0_Base` / `_REG_1_Base`	Base address for the input (Channel[0], Dst-side) and output (Channel[1], L1-side) sides.
`PCK0_ADDR_CTRL_XY_REG_0_Xstride` / `Ystride`	Channel[0] X/Y strides (Dst-side, in bytes for X).
`PCK0_ADDR_CTRL_ZW_REG_0_Zstride` / `Wstride`	Channel[0] Z/W strides.
`PCK0_ADDR_CTRL_XY_REG_1_Ystride`	Channel[1] Y stride (L1-side; X stride on L1 side isn't used — output is always linear within a row).
`PCK0_ADDR_CTRL_ZW_REG_1_Zstride` / `Wstride`	Channel[1] Z/W strides.
`PACK_COUNTERS_SEC[i].pack_per_xy_plane`	Iteration count per face.
`PACK_COUNTERS_SEC[i].pack_reads_per_xy_plane`	How many "reads" the TilePositionGenerator does before advancing Y or Z.
`PACK_COUNTERS_SEC[i].pack_xys_per_til`	XY-plane count per tile.
`PACK_COUNTERS_SEC[i].pack_yz_transposed`	If 1, advance Z before Y (transposed face order).
`PACK_COUNTERS_SEC[i].pack_per_xy_plane_offset`	Per-packer Y offset.
`DEST_TARGET_REG_CFG_PACK_SEC[i].Offset`	Per-packer Dst-row offset (in 16-row units) — where in Dst this packer starts reading.
`DEST_TARGET_REG_CFG_PACK_SEC[i].ZOffset`	Per-packer Z-offset into Dst.
`THCON_SEC[01]_REG9.{Pack_limit_address, Pack_fifo_size}`	Circular L1 wrap.

Global / shared packer config:

Field	Meaning
`PCK_DEST_RD_CTRL_Read_32b_data`	1 = read 32-bit from Dst (Dst32b view), 0 = 16-bit (Dst16b view).
`PCK_DEST_RD_CTRL_Round_10b_mant`	If 1, round mantissa to 10 bits during early conversion (TF32 path).
`PCK_DEST_RD_CTRL_Read_int8`	"Read raw" — skip the early conversion rounding (truncate / bitcast).
`PCK_DEST_RD_CTRL_Read_unsigned`	For INT8/UINT8 path — say whether the byte is signed or unsigned.
`ALU_FORMAT_SPEC_REG_Dstacc_override`, `ALU_FORMAT_SPEC_REG_Dstacc_val`, `ALU_FORMAT_SPEC_REG2_Dstacc`	Specify the IntermediateFormat for the early conversion.
`ALU_ROUNDING_MODE_Packer_srnd_en`	Stochastic rounding for the packer (BFP rounds).
`ALU_ROUNDING_MODE_Gasket_srnd_en`	Stochastic rounding for the late conversion gasket.
`INT_DESCALE_Enable`, `INT_DESCALE_Mode`, `INT_DESCALE_VALUES_SEC[].Value`	Shift-amount applied during INT32→INT8 conversion (for quantization).
`STACC_RELU_ApplyRelu` (4b)	NO_RELU / ZERO_RELU / MIN_THRESHOLD_RELU / MAX_THRESHOLD_RELU (2b per packer).
`STACC_RELU_ReluThreshold`	16-bit BF16 or FP16 threshold value.
`PCK_EDGE_OFFSET_SEC[0..3].mask`	16-bit column masks for edge masking.
`PCK_EDGE_OFFSET_SEC0.{mode, tile_row_set_select_pack[0..3]}`	Which of the four masks each packer uses, and what `mode` means (per-row vs per-face).
`TILE_ROW_SET_MAPPING[0..3].row_set_mapping[16]`	2 bits per row of a face → which mask slot to apply.
`PCK_EDGE_TILE_FACE_SET_SELECT_`, `PCK_EDGE_TILE_ROW_SET_SELECT_`	Higher-level mask selectors.
`PCK_EDGE_MODE_mode`	0 = replace masked datums with 0, 1 = replace with −∞ (for softmax-style max-reduce).
`ENABLE_ACC_STATS_Enable` (per-thread)	Enable exponent histogram.
`THCON_SEC[01]_REG1_Downsample_mask`, `Downsample_rate`	16-bit downsample mask + rate.
`THCON_SEC[01]_REG1_Exp_threshold_en`, `Exp_threshold`	Exponent-threshold clamp-to-zero.
`THCON_SEC[01]_REG1_pack_dis_y_pos_start_offset`, `pack_start_intf_pos`	Packer interface position controls (untilize mode).
`THCON_SEC0_REG1_All_pack_disable_zero_compress` (4b mask)	Per-packer compression-disable when override is on.
`ADDR_MOD_PACK_SEC[0..3].{Y/Z}{src/dst}{Clear,CR,Incr}`	Address-modifier slots used by `PACR.AddrMod`.
`DEST_ACCESS_CFG_{remap_addrs, swizzle_32b, disable_full_write_dest_q_bypass, zeroacc_absolute_tile_mode}`	Affect both packer Dst-address swizzle and RISCV Dst access.

3.4 PACR input address generator (Channel[0])

Addr = PCK0_ADDR_BASE_REG_0_Base
     + ADC[w].Ch0.X * (Xstride & 0xf)
     + ADC[w].Ch0.Y * Ystride
     + ADC[w].Ch0.Z * Zstride
     + ADC[w].Ch0.W * Wstride

InputNumDatums = Channel[1].X - Channel[0].X + 1     (one row of one face)

If Source_interface_selection=1 and i==0, the packer fetches from L1 (used for some L1-to-L1 reshape ops); otherwise it fetches from Dst with InputSourceAddr = (Addr/BytesPerDatum) + DEST_TARGET_REG_CFG_PACK_SEC[i].Offset << 4, interpreted as Row = addr >> 4, Col = addr & 0xf into Dst. BytesPerDatum is decided by In_data_format: 4 → FP32/TF32/I32, 2 → FP16/BF16/I16, else 1.

3.5 PACR output address generator (Channel[1])

YZW_Addr = PCK0_ADDR_BASE_REG_1_Base
         + ADC[w].Ch1.Y*Ystride + Z*Zstride + W*Wstride
Addr = PackerIConfig.L1_Dest_addr + !Sub_l1_tile_header_size + (YZW_Addr & ~0xf)
       + (Add_l1_dest_addr_offset ? PackerI.l1_dest_addr_offset : 0)

if (Addr > Pack_limit_address*2 + 1) Addr -= Pack_fifo_size*2   // circular CB wrap

if (compressing) reserve Row_start_section_size bytes for RSI
if (out_data_format < 16b)  reserve Exp_section_size  bytes for exponents
DataStream.ByteAddress = Addr     (only when NeedsNewAddress)

The Y/Z/W contribution is masked & ~0xf so it only adjusts at 256-byte granularity — fine-grained packing within a 16-byte L1 line is done by the buffered output below.

Output buffering and 16-byte alignment. Output is always aligned 16-byte writes to L1. Datums get accumulated into per-stream buffers (data / exp / RSI) and only flushed to L1 when full. The Flush and Last bits on PACR force the buffers to flush — but if a buffer is partially full it's zero-padded up to 16 B before flushing. This means: if you do many small PACRs, they share the same line; if you do one PACR with Flush=1, you might waste up to 15 B at the end of each stream.

3.6 Packer-relevant TT instructions

Op	Purpose
`TT_OP_PACR`	The main pack instruction.
`TT_OP_PACR_SETREG`	Atomic MMIO register write sequenced after late conversion (used to bump CB write pointers, set semaphore values, kick streams).
`TT_OP_SETPKEDGOF`	Edge-mask offsets (x_start, x_end, y_start, y_end → PCK_EDGE_OFFSET masks).
`TT_OP_CLREXPHIST`	Clear all four packers' exponent histograms.
`TT_OP_SETASHRMH(0/1)`, `TT_OP_SETASHRMV`	Set ALL-SHARED row-mask H/V (halo column/row masks).
`TT_OP_SETDMAREG` mode 6/7	Read the exponent histogram bins into a Tensix GPR.
`TT_OP_SETDMAREG` mode 9	Read packer 0's max-exponent.
`TT_OP_STALLWAIT(stall_res, ...)`	Wait for `THCON` / `PACK0..3` / `UNPACK0..1` resource idle (this is how `Last` becomes useful — you flush, then stall until L1 actually has the data).
`TT_OP_RESOURCEDECL`	Auto-TTSync tracking declaration.

4. Step-by-step: data going CB → SrcA/Dst → Dst → CB on Blackhole

Here is what the LLK (llk_unpack_AB.h + llk_math_eltwise_* + llk_pack.h) actually does for a simple "load tile A, load tile B, A+B, store tile" sequence:

4.1 Setup (called once per format change)

configure_unpack_AB(...) in cunpack_common.h writes (excerpt of actual register writes):

// Tell the FPU what format SrcA and SrcB hold:
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcA_val_ADDR32, ...>(unp_src_format_A);
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcB_val_ADDR32, ...>(unp_src_format_B);

// Configure each unpacker's REG0_TileDescriptor (4-dword blob):
cfg[THCON_SEC0_REG0_TileDescriptor_ADDR32 + 0..3] = {format/xdim/ydim/zdim/wdim/...};
cfg[THCON_SEC1_REG0_TileDescriptor_ADDR32 + 0..3] = ...;

// Configure REG2 (out format, throttle, context_count, tileize, etc.):
cfg[THCON_SEC0_REG2_Out_data_format_ADDR32 word] = ...;
cfg[THCON_SEC1_REG2_Out_data_format_ADDR32 word] = ...;

// e4m3 helper:
cfg_reg_rmw_tensix<THCON_SEC0_REG1_Unp_LF8_4b_exp_RMW>(unpA_is_e4m3);
cfg_reg_rmw_tensix<THCON_SEC1_REG1_Unp_LF8_4b_exp_RMW>(unpB_is_e4m3);

// REG3 base addresses (= base of the CB in L1, /16 since it's 16B units):
cfg[THCON_SEC0_REG3_Base_address_ADDR32] = cb_a_base >> 4;
cfg[THCON_SEC1_REG3_Base_address_ADDR32] = cb_b_base >> 4;

// REG5 per-context Dst rows / face dims (multi-context unpacker 0 only):
cfg[THCON_SEC0_REG5_Tile_x_dim_cntx0_ADDR32] = face_dim | (face_dim << 16);
cfg[THCON_SEC0_REG5_Dest_cntx0_address_ADDR32] = Dest_cntx0 | (Dest_cntx1 << 16);

// Reset ADC counters for the two unpackers + the packer:
TTI_SETADCXY(p_setadc::UNP_A | p_setadc::UNP_B, 0,0,0,0, 0b1011);
TTI_SETADCZW(p_setadc::UNP_A | p_setadc::UNP_B, 0,0,0,0, 0b1111);
TTI_SETADCXY(0b100, 0,0,0,0, 0b1011);   // packer side
TTI_SETADCZW(0b100, 0,0,0,0, 0b1111);

// "X-end" per face, written into ADC[*].Unpacker[*].Channel[1].X:
TT_SETADCXX(p_setadc::UNP_A, unpA_x_end, 0);
TT_SETADCXX(p_setadc::UNP_B, (unpB_face_r_dim << 4) - 1, 0);

// SrcA row base for ping-pong:
TTI_SETC16(SRCA_SET_Base_ADDR32, 0x4);

And the packer side from cpack_common.h (set_packer_strides):

TT_SETDMAREG(0, LOWER_HALFWORD(y_stride << PCK0_ADDR_CTRL_XY_REG_0_Ystride_SHAMT), 0, LO_16(TMP0));
TT_SETDMAREG(0, UPPER_HALFWORD(y_stride << PCK0_ADDR_CTRL_XY_REG_0_Ystride_SHAMT), 0, HI_16(TMP0));
TT_SETDMAREG(0, LOWER_HALFWORD(z_stride << PCK0_ADDR_CTRL_ZW_REG_0_Zstride_SHAMT), 0, LO_16(TMP1));
TT_SETDMAREG(0, UPPER_HALFWORD(w_stride << PCK0_ADDR_CTRL_ZW_REG_0_Wstride_SHAMT), 0, HI_16(TMP1));
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::THCON);
TTI_WRCFG(TMP0, p_cfg::WRCFG_32b, PCK0_ADDR_CTRL_XY_REG_0_Xstride_ADDR32);
TTI_WRCFG(TMP1, p_cfg::WRCFG_32b, PCK0_ADDR_CTRL_ZW_REG_0_Zstride_ADDR32);

4.2 Per-tile UNPACR (T0)

The kernel waits on the producer CB semaphore (cb_wait_front) — done in RISCV, not Tensix.
T0 issues one UNPACR per face. Each UNPACR advances ADC Channel[0].Y → the next face row; on the 4th call the AddrMod flips back and bumps Z.
The unpacker reads REG3_Base_address, follows REG7_Offset_address, then iterates 16×16=256 datums, doing format conversion + writing to SrcA[bank][row][col] (and stalls on AllowedClient).
The last UNPACR of a tile sets SetDatValid=1 (or it's followed by UNPACR_NOP_SETDVALID) — that flips AllowedClient = MatrixUnit so the FPU can read this bank.
The unpacker can immediately start filling the other bank while FPU consumes the first one.

4.3 Math (T1)

ZEROACC (sub-instruction of ZEROSRC) zeroes a 16×16 block of Dst.
MVMUL / ELWADD / etc. read SrcA[bank0] + SrcB[bank0] and produce Dst[Row][Col]. Each writes one row at a time; the 4-cycle Dst scoreboard prevents the packer from reading until the write retires.
When done, the FPU implicitly does CLRDVALID (unless CLR_DVALID_SrcA_Disable/SrcB_Disable is set), returning the bank to AllowedClient = Unpackers.

4.4 PACR (T2)

T2 (sometimes T0/T1 in fused kernels) waits on free space in the output CB.
It issues PACR with PackerMask = 0xF to run all four packers in parallel — each packer is configured to point at a different quarter of Dst via DEST_TARGET_REG_CFG_PACK_SEC[i].Offset, and each lands in a different quarter of the output tile via PCK0_ADDR_*_REG_1 strides.
After all faces are packed, a final PACR with Last=1 (or Flush=1) flushes the 16-byte buffers to L1.
PACR_SETREG (or a follow-up MMIO write) bumps the CB write pointer so the consumer can pick up the tile.
STALLWAIT(p_stall::PACK0|PACK1|PACK2|PACK3, ...) is used if the kernel needs to be sure the L1 store has happened before proceeding (e.g. before triggering a NoC send).

5. Complete list of unpacker-related TT* instructions on Blackhole

From tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h. (TT_OP_X is the encoder; TTI_X is the form with immediate operands; TT_X allows GPR operands.)

Instruction	Opcode	Purpose
`TT_OP_UNPACR(...)`	`0x42`	The unpack instruction. 13 fields described in §2.1.
`TT_OP_UNPACR_NOP(...)`	`0x41` (re-uses with subop)	Multi-mode unpacker-sequenced no-op: `ZEROSRC` / `SETDVALID` / `SETREG` / `MsgClr` / pure-NOP / `Unpack_Pop`. Used to flip Src banks, zero a bank, push register writes through the unpacker queue.
`TT_OP_ZEROSRC(zero_val, write_mode, bank_mask, src_mask)`	`0x11`	Zero a SrcA/SrcB bank directly (bypasses unpacker pipeline; sometimes faster than going through `UNPACR_NOP`).
`TT_OP_SETDVALID(setvalid)`	`0x57`	Set "AllowedClient = MatrixUnit" on a Src bank (i.e. mark unpacked data ready for FPU). Bit 0 = SrcA, bit 1 = SrcB.
`TT_OP_TRNSPSRCA`	`0x14`	Transpose SrcA in place.
`TT_OP_TRNSPSRCB`	`0x16`	Transpose SrcB in place (BH-specific; not on Wormhole).
`TT_OP_SETADC(CntSetMask, ChannelIndex, DimensionIndex, Value)`	n/a	Set one of {X,Y,Z,W} on Channel[0] or Channel[1] of one ADC bank.
`TT_OP_SETADCXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask)`	n/a	Set X/Y on both channels of selected ADC banks at once. `BitMask` selects which of {Ch0X,Ch0Y,Ch1X,Ch1Y} are written.
`TT_OP_SETADCZW(CntSetMask, Ch1_W, Ch1_Z, Ch0_W, Ch0_Z, BitMask)`	n/a	Same for Z/W.
`TT_OP_SETADCXX(CntSetMask, x_end2, x_start)`	`0x5e`	Set Ch0.X = `x_start`, Ch1.X = `x_end2` in one shot (the typical "set face width minus 1").
`TT_OP_INCADCXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X)`	n/a	Increment X/Y by the given deltas (signed).
`TT_OP_INCADCZW(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X)`	n/a	Increment Z/W.
`TT_OP_ADDRCRXY(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask)`	n/a	Copy `Y_Cr → Y` / `X_Cr → X` (used to start a fresh face row).
`TT_OP_ADDRCRZW(CntSetMask, Ch1_Y, Ch1_X, Ch0_Y, Ch0_X, BitMask)`	n/a	Same for Z/W.
`TT_OP_REG2FLOP(SizeSel, TargetSel, ByteOffset, ContextId_2, FlopIndex, RegIndex)`	n/a	Load an ADC/RWC value (or per-context unpacker offset) from L1 via a register copy. Used to set up multi-context unpack from a descriptor in L1.
`TT_OP_SETC16(reg, val)`	`0xb2`	Write a 16-bit immediate to `ThreadConfig[CurrentThread]` — the only way to set `UNPACK_MISC_CFG_CfgContextOffset_`, `UNPACK_MISC_CFG_CfgContextCntReset_`, `SRCA_SET_Base`, `SRCB_SET_Base`, `CFG_STATE_ID_StateID` and the rest of the per-thread unpacker config.
`TT_OP_WRCFG(GprAddress, wr128b, CfgReg)`	`0xb0`	Write 32b (or 128b if `wr128b=1`) from a Tensix GPR to `Config[StateID]`. Used to set REG0 TileDescriptor (4 dwords as one 128b write), and all the THCON_SEC[u].REG[1..7] words.
`TT_OP_RDCFG(GprAddress, CfgReg)`	`0xb1`	Read a `Config` word into a Tensix GPR — used by introspection / `read_unpack_tile_descriptor()`.
`TT_OP_RMWCIB0/1/2/3(Mask, Data, CfgRegAddr)`	`0xb3..0xb6`	Read-modify-write a byte of a `Config` word — used for `cfg_reg_rmw_tensix<...>` to flip a single bit/field without disturbing neighbours (e.g. `Unp_LF8_4b_exp`, `Ovrd_data_format`). The 4 variants target the 4 bytes of the word.
`TT_OP_STREAMWRCFG(stream_id_sel, ...)`	n/a	Write a `Config` word from a NoC overlay stream — used when descriptor data is being streamed in.
`TT_OP_STALLWAIT(stall_res, wait_res)`	n/a	Block this thread until specified resources (e.g. `UNPACK0`, `UNPACK1`, `THCON`, `XSEARCH`) are idle. The single most common synchronization primitive between config writes and unpack/pack work.
`TT_OP_RESOURCEDECL(linger_time, ...)`	n/a	Declare resources being used by upcoming instructions, for the BH Auto-TTSync hardware.
`TT_OP_REPLAY(start_idx, len, ...)`	n/a	Execute a saved sequence of Tensix instructions (the LLK uses this to compress repetitive UNPACR sequences).
`TT_OP_XMOV(Mov_block_selection, ...)`	n/a	Mover instruction — sometimes used by unpack-untilize when the data needs an L1↔L1 copy before unpack.
`TT_OP_MOVA2D(...)`, `TT_OP_MOVB2D(...)`	n/a	Copy SrcA/SrcB → Dst directly via the FPU's data path — alternative to "UNPACR into SrcA, then FPU copies to Dst". Listed here because they are sometimes used inside `llk_unpack_*` paths when staging into Dst is needed.
`TT_OP_MOVD2A(...)`, `TT_OP_MOVD2B(...)`	n/a	Reverse direction: Dst → SrcA/SrcB. Used by `llk_math_transpose_dest.h` and matmul to feed the next round of FPU ops from Dst.
`TT_OP_MOVDBGA2D(...)`, `TT_OP_MOVDBGB2D(...)`	n/a	Debug-path variants of MOVA2D/MOVB2D used by the debug tooling.
`TT_OP_MOVB2A(...)`	`0x0b`	Copy SrcB → SrcA.
`TT_OP_SETPKEDGOF(y_end, y_start, x_end, x_start)`	`0x1d`	Edge-mask offsets (technically a packer control instruction, but uses the same ADC infrastructure and is often issued by the unpack thread when tile-padding state is being set up).
`TT_OP_SETASHRMH`, `SETASHRMH0`, `SETASHRMH1`, `SETASHRMV(reg_mask2)`	`0x1e`, `0x1a`, `0x1b`, `0x1c`	"All-shared halo mask" — set the row/column halo bits used by the unpacker's halo-padding mode (also feeds into packer edge masks).
`TT_OP_CLREXPHIST`	`0x21`	Clears packer exponent histograms; listed here because it's commonly placed in the unpack/setup phase to start fresh telemetry.
`TT_OP_SETRWC(...)`, `TT_OP_SETIBRWC(...)`	n/a	Set the RWC (FPU-side row/word counters) — not strictly an unpacker op, but unpack sequences typically reset them between tiles since the FPU is the immediate consumer of unpacker output.

(Opcodes I didn't list explicitly are not 0x__-prefixed because the TT_OP macro packs the value differently — see ckernel_ops.h for the exact encoder.)

6. Source references

tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/BackendConfiguration.md — how Config[2][...] and ThreadConfig[3][...] are laid out.
tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/ConfigurationUnit.md — config-unit pipeline / latency / which instructions write config.
tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/Dst.md — Dst layout, swizzle, scoreboarding, RISCV access.
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_Regular.md — exhaustive (621-line) UNPACR functional model; applies unchanged to BH apart from the e4m3 helper and Auto-TTSync.
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/PACR.md + Packers/{README,InputAddressGenerator,OutputAddressGenerator,FormatConversion,EdgeMasking,ReLU,Downsampling,Compression,ExponentHistogram,ExponentThresholding}.md — full packer functional model.
tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/{README,FormatConversion}.md — unpacker overview.
tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h — concrete _ADDR32, _MASK, _SHAMT for all 800+ config fields on Blackhole.
tt-llk/tt_llk_blackhole/common/inc/{ckernel_ops.h, cunpack_common.h, cpack_common.h} — the instruction macros and the C++ structs (unpack_tile_descriptor_t, unpack_config_t, pack_config_t, relu_config_t, dest_rd_ctrl_t, pck_edge_offset_t, pack_counters_t) the LLK actually writes.
tt-llk/tt_llk_blackhole/llk_lib/llk_unpack_*.h, llk_pack*.h — the higher-level helpers (llk_unpack_A, llk_unpack_AB, llk_unpack_tilize, llk_pack, llk_pack_untilize).