anuraagw.me

pack unpack registers

Pack/Unpack Configuration Registers

Overview

All pack/unpack configuration lives in the Tensix Config Register space, a dedicated register file separate from L1 and MMIO. It is accessed via direct pointer writes during init or via Tensix instructions (WRCFG, RMWCIB, SETC16) during kernel execution.

TENSIX_CFG_BASE = 0xFFEF0000

The space supports two config states (double-buffered ping-pong). State 0 starts at the base; state 1 starts at base + CFG_STATE_SIZE * 16 (offset +896 bytes, since CFG_STATE_SIZE = 56). All register positions are given as ADDR32 — a 32-bit word index from the state base.

Source: tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/tensix.h, cfg_defines.h

DataFormat Enum

From tensix_types.h:213. Hardware encodes only the bottom 4 bits (DATA_FORMAT_BIT_COUNT = 4, mask 0xF).

ValueNameNotes
0Float32
1Float16IEEE FP16
2Bfp8Block FP, 8-bit mantissa, format A exponent
3Bfp4Block FP, 4-bit mantissa, format A exponent
4Tf32TensorFloat-32
5Float16_bBFloat16
6Bfp8_bBlock FP, 8-bit mantissa, format B exponent
7Bfp4_bBlock FP, 4-bit mantissa, format B exponent
8Int32
9UInt16
10Lf8FP8 E5M2
11Bfp2Block FP, 2-bit mantissa, format A exponent
14Int8
15Bfp2_bBlock FP, 2-bit mantissa, format B exponent
24UInt32
26Fp8_e4m3SW alias for Lf8 with Pac_LF8_4b_exp/Unp_LF8_4b_exp mode bit set
30UInt8
0xFFInvalid

Fp8_e4m3 is encoded as Lf8 (value 10) in the 4-bit register field, with a separate mode bit to select E4M3 vs E5M2.

1. Packer Registers

1.1 Pack Config (THCON_SEC0_REG1) — ADDR32 68–71

Struct: pack_config_t in tt_llk_blackhole/common/inc/cpack_common.h:27–64. Written as 4 consecutive 32-bit words. A second packer context lives at THCON_SEC0_REG8 (ADDR32 96–99).

Word 0 (ADDR32 68):

BitsFieldDescription
[15:0]row_ptr_section_sizeBFP tile row pointer section size
[31:16]exp_section_sizeExponent section size (num_faces for BFP, 0 for Lf8/Int8)

Word 1 (ADDR32 69):

BitsFieldDescription
[31:0]l1_dest_addrL1 destination address for pack output (16-byte aligned)

Word 2 (ADDR32 70):

BitsFieldDescription
[0]uncompress1 = uncompressed output
[1]add_l1_dest_addr_offsetAdd dest addr offset on each tile
[2]disable_pack_zero_flagDisable zero flag generation (for L1 acc)
[7:4]out_data_formatOutput data format (4-bit DataFormat)
[11:8]in_data_formatInput (source/Dest register) format (4-bit DataFormat)
[12]dis_shared_exp_assemblerDisable shared exponent assembly for BFP
[13]auto_set_last_pacr_intf_selAuto-set last packer interface select
[14]enable_out_fifoEnable output FIFO
[15]sub_l1_tile_header_sizeSubtract tile header size from addresses
[16]src_if_selSource interface select (0=SrcA, 1=SrcB)
[20:17]pack_start_intf_posStart interface position
[21]all_pack_disable_zero_compress_ovrdOverride: disable z-compress for all packers
[22]add_tile_header_sizePrepend 16B tile header to output
[23]pack_dis_y_pos_start_offsetDisable Y position start offset
[31:24]l1_src_addrL1 source address (upper bits)

Word 3 (ADDR32 71):

BitsFieldDescription
[19]Pack_L1_AccEnable L1 accumulation mode
[20]Exp_threshold_enEnable exponent thresholding
[22]Unp_LF8_4b_expFP8 E4M3 mode for unpacker 0 (shared register)
[23]Pac_LF8_4b_expFP8 E4M3 mode for packer
[31:24]Exp_thresholdExponent threshold value (e.g. 113 for FP32->BFP-A)

1.2 Dest Read Control (PCK_DEST_RD_CTRL) — ADDR32 18

Controls how the packer reads values from the Dest accumulator register.

BitsFieldDescription
[0]Read_32b_dataRead 32-bit from Dest (Float32/Int32/UInt32, or FP32 dest mode)
[1]Read_unsignedTreat data as unsigned (UInt8 output)
[2]Read_int8Read as 8-bit integer from Dest
[3]Round_10b_mantRound to 10-bit mantissa (FP32->FP16, or FP8-E4M3 output)

1.3 ReLU (STACC_RELU) — ADDR32 2

Shares a register with ALU_ACC_CTRL_Zero_Flag_*.

BitsFieldDescription
[0]Zero_Flag_disabled_srcDisable zero flagging for source
[1]Zero_Flag_disabled_dstDisable zero flagging for dest
[5:2]ApplyRelu0=off, 1=ReLU, 2=threshold min, 3=threshold max
[21:6]ReluThreshold16-bit threshold value in BF16 format

1.4 Pack Counters (PACK_COUNTERS_SEC0) — ADDR32 28

BitsFieldDescription
[7:0]pack_per_xy_planePacks per XY plane
[15:8]pack_reads_per_xy_planeReads per XY plane
[22:16]pack_xys_per_tileXY planes per tile
[23]pack_yz_transposedYZ transpose flag
[31:24]auto_ctxt_inc_xys_cntAuto context increment XYs count

1.5 Edge Masking — ADDR32 24–27

For partial tile packing. Four edge offset registers (PCK_EDGE_OFFSET_SEC[0:3]), each holding a 16-bit mask in the lower half. TILE_ROW_SET_MAPPING[0:3] (ADDR32 20–23) map each face row (16 rows x 2 bits = 32 bits per register) to one of the 4 edge offset masks.

RegisterADDR32Content
PCK_EDGE_OFFSET_SEC024mask[15:0], mode[16], tile_row_set_select_pack[25:17]
PCK_EDGE_OFFSET_SEC125mask[15:0]
PCK_EDGE_OFFSET_SEC226mask[15:0]
PCK_EDGE_OFFSET_SEC327mask[15:0]
TILE_ROW_SET_MAPPING02016 rows x 2-bit mapping
TILE_ROW_SET_MAPPING12116 rows x 2-bit mapping
TILE_ROW_SET_MAPPING22216 rows x 2-bit mapping
TILE_ROW_SET_MAPPING32316 rows x 2-bit mapping

1.6 Packer Address Strides — ADDR32 12–17

RegisterADDR32Content
PCK0_ADDR_CTRL_XY_REG_012X-stride [15:0], Y-stride [31:16]
PCK0_ADDR_CTRL_ZW_REG_013Z-stride [15:0], W-stride [31:16]
PCK0_ADDR_CTRL_XY_REG_114Channel 1 X/Y strides
PCK0_ADDR_CTRL_ZW_REG_115Channel 1 Z/W strides
PCK0_ADDR_BASE_REG_016Base address register 0
PCK0_ADDR_BASE_REG_117Base address register 1

1.7 Dest Target (DEST_TARGET_REG_CFG_PACK_SEC) — ADDR32 180–183

Packer dest register offset and Z-offset for each of 4 packer sections (selects which half of Dest to read from).

2. Unpacker Registers

2.1 Tile Descriptor (THCON_SEC0/1_REG0) — ADDR32 64–67 / 112–115

Struct: unpack_tile_descriptor_t in tt_llk_blackhole/common/inc/cunpack_common.h:20–88. Unpacker 0 (SrcA) at ADDR32 64–67, Unpacker 1 (SrcB) at ADDR32 112–115.

Word 0 (ADDR32 64 / 112):

BitsFieldDescription
[3:0]in_data_formatInput tile data format (4-bit DataFormat)
[4]uncompressed1 = tile is uncompressed (no zero-compress)
[11:8]blobs_per_xy_planeBFP metadata blobs per XY plane
[31:16]x_dimTile X dimension (face_width x face_count)

Word 1 (ADDR32 65 / 113):

BitsFieldDescription
[15:0]y_dimTile Y dimension
[31:16]z_dimZ dimension (number of faces: 1, 2, or 4)

Word 2 (ADDR32 66 / 114):

BitsFieldDescription
[15:0]w_dimW dimension
[31:16]blobs_y_start_loBFP blob start Y (low 16 bits)

Word 3 (ADDR32 67 / 115):

BitsFieldDescription
[15:0]blobs_y_start_hiBFP blob start Y (high 16 bits)

2.2 Unpack Config (THCON_SEC0/1_REG2) — ADDR32 72–75 / 120–123

Struct: unpack_config_t.

Word 0 (ADDR32 72 / 120):

BitsFieldDescription
[3:0]out_data_formatOutput format (format in srcA/srcB register file)
[5:4]throttle_modeThrottle mode (default=2)
[7:6]context_countNumber of double-buffered contexts
[8]haloize_modeXY transpose mode
[9]tileize_modeTilize mode (row-major -> tile layout)
[10]unpack_src_reg_set_updUpdate source register set
[11]unpack_if_selUnpack interface select
[13:12]upsample_rateUpsampling rate
[15]upsample_and_interleaveUpsample + interleave mode
[31:16]shift_amountShift amount

Word 1 (ADDR32 73 / 121):

BitsFieldDescription
[3:0]uncompress_cntx0_3Per-context uncompress flags (contexts 0-3)
[7:4]unpack_if_sel_cntx0_3Per-context interface select (0-3)
[8]force_shared_expForce shared exponent mode
[19:16]uncompress_cntx4_7Per-context uncompress flags (contexts 4-7)
[23:20]unpack_if_sel_cntx4_7Per-context interface select (4-7)

Word 2 (ADDR32 74 / 122):

BitsFieldDescription
[16:0]limit_addrL1 FIFO limit address

Word 3 (ADDR32 75 / 123):

BitsFieldDescription
[16:0]fifo_sizeL1 FIFO size

2.3 L1 Base Address (THCON_SEC0/1_REG3)

RegisterADDR32Description
THCON_SEC0_REG3_Base_address(REG3 base)Unp0 tile L1 base address, context 0
THCON_SEC0_REG3_Base_cntx1_address(REG3 base+1)Unp0 tile L1 base address, context 1
THCON_SEC1_REG3_Base_address(SEC1 equivalent)Unp1 tile L1 base address

Written per-tile before issuing UNPACR.

2.4 Per-Context Dest Address (THCON_SEC0_REG5) — ADDR32 84–87

ADDR32Content
84Dest_cntx0_address [15:0], Dest_cntx1_address [31:16]
85Dest_cntx2_address [15:0], Dest_cntx3_address [31:16]
86Tile_x_dim_cntx0 [15:0], Tile_x_dim_cntx1 [31:16]
87Tile_x_dim_cntx2 [15:0], Tile_x_dim_cntx3 [31:16]

2.5 Per-Context Format Override (THCON_SEC0_REG7) — ADDR32 92–93

ADDR32BitsFieldDescription
92[15:0]Offset_addressTile offset (context 0)
92[19:16]Unpack_data_format_cntx0Per-context input format override
92[23:20]Unpack_out_data_format_cntx0Per-context output format override
92[27:24]Unpack_data_format_cntx4Context 4 input format override
92[31:28]Unpack_out_data_format_cntx4Context 4 output format override
93[15:0]Offset_cntx1_addressTile offset (context 1)

2.6 Unpacker Address Strides — ADDR32 44–50, 60–62

RegisterADDR32Description
UNP0_ADDR_CTRL_XY_REG_044Unp0 X/Y stride (channel 0)
UNP0_ADDR_CTRL_ZW_REG_045Unp0 Z/W stride (channel 0)
UNP1_ADDR_CTRL_XY_REG_046Unp1 X/Y stride
UNP1_ADDR_CTRL_ZW_REG_047Unp1 Z/W stride
UNP0_ADDR_BASE_REG_048Unp0 base address reg 0
UNP0_ADDR_BASE_REG_149Unp0 base address reg 1
UNP0_ADD_DEST_ADDR_CNTR50[8]: Enable adding dest address counter
UNP1_ADDR_BASE_REG_060Unp1 base address reg 0
UNP1_ADDR_BASE_REG_161Unp1 base address reg 1
UNP1_ADD_DEST_ADDR_CNTR62Unp1 dest address counter enable

2.7 Unpack Misc Config — ADDR32 41

Controls double-buffered config context switching.

BitsFieldDescription
[3:0]CfgContextOffset_0Offset to context 0 config block
[4]CfgContextCntReset_0Reset context counter 0
[5]CfgContextCntInc_0Increment context counter 0
[11:8]CfgContextOffset_1Offset to context 1 config block

3. ALU Format / Stochastic Rounding — ADDR32 0–2

Written by the unpack thread (TRISC0) but affects all of pack/unpack/math.

ADDR32 0 — ALU_FORMAT_SPEC override values:

BitsFieldDescription
[3:0]SrcA_valSrcA format value (auto-inferred from tile)
[8:5]SrcB_valSrcB format value
[13:10]Dstacc_valDest accumulator format value

ADDR32 1 — ALU_FORMAT_SPEC + ALU_ROUNDING_MODE + ALU_ACC_CTRL (packed):

BitsFieldDescription
[0]Fpu_srnd_enFPU stochastic rounding enable
[1]Gasket_srnd_enGasket (pre-packer) stochastic rounding
[2]Packer_srnd_enPacker stochastic rounding
[13]GS_LFGasket LF8 mode
[14]Bfp8_HFBFP8 high-fidelity mode
[15]SrcAUnsignedSrcA is unsigned
[16]SrcBUnsignedSrcB is unsigned
[20:17]SrcASrcA format (4 bits)
[24:21]SrcBSrcB format (4 bits)
[28:25]DstaccDest accumulator format (4 bits)
[29]Fp32_enabledFP32 dest accumulation mode
[30]SFPU_Fp32_enabledSFPU reads FP32 from dest
[31]INT8_math_enabledINT8 math mode

ADDR32 2 — STACC_RELU + Zero Flags:

See section 1.3 above.

4. How Firmware Writes Config Registers

All mechanisms target the config space at TENSIX_CFG_BASE. Firmware code accesses it via get_cfg_pointer() which returns a volatile uint32_t* to the base.

4.1 Direct Pointer Write (initialization)

volatile uint32_t *cfg = get_cfg_pointer();
cfg[THCON_SEC0_REG0_TileDescriptor_ADDR32 + 0] = tile_descriptor.val[0];
cfg[THCON_SEC0_REG0_TileDescriptor_ADDR32 + 1] = tile_descriptor.val[1];
// ...

Used during configure_unpack_AB() and configure_pack() at kernel init.

4.2 WRCFG (write 32-bit or 128-bit from GPR to config)

Opcode 0xB0. Format: GprAddress[7:0] | wr128b[15] | CfgReg[14:0].

TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::THCON);
TTI_WRCFG(p_gpr_pack::TMP0, p_cfg::WRCFG_32b, STACC_RELU_ApplyRelu_ADDR32);
TTI_NOP; TTI_NOP;  // 2 NOPs required after WRCFG

4.3 RMWCIB0/1/2/3 (read-modify-write individual bytes)

Opcodes 0xB3–0xB6 for bytes 0–3. Used for sub-word bitfield modifications without disturbing surrounding bits.

// cfg_reg_rmw_tensix<ADDR32, SHAMT, MASK>(val) decomposes to:
TT_RMWCIB0(mask_b0, data_b0, CfgAddr32);
TT_RMWCIB1(mask_b1, data_b1, CfgAddr32);
TT_RMWCIB2(mask_b2, data_b2, CfgAddr32);
TT_RMWCIB3(mask_b3, data_b3, CfgAddr32);

4.4 SETC16 (16-bit write to config register)

Opcode 0xB2.

TTI_SETC16(UNPACK_MISC_CFG_CfgContextOffset_0_ADDR32, 0x0101);

4.5 Sequencing: STALLWAIT

All config writes affecting packer/unpacker hardware must be preceded by a STALLWAIT to avoid races:

TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::THCON);   // before THCON (pack/unpack) config
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::PACK);     // before ReLU, edge mask config
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::UNPACK0);  // before unpack reconfig

5. Blackhole-Specific Notes

From cpack_common.h:51–61:

  • Word 3 of pack config (ADDR32 71) was restructured vs Wormhole: Pack_L1_Acc, Exp_threshold_en, Unp_LF8_4b_exp, Pac_LF8_4b_exp, and Exp_threshold moved here to avoid a race condition between unpack and pack threads sharing these fields.
  • Only 1 packer is active in the Blackhole LLK (NUM_PACKERS = 1), compared to up to 4 in hardware.
  • Packer x_start/x_end must be within 1 row (0 to FACE_C_DIM-1 = 15). Set via TTI_SETADCXX(p_setadc::PAC, FACE_C_DIM-1, 0x0).
  • FP8 E4M3 support uses Unp_LF8_4b_exp / Pac_LF8_4b_exp bits at ADDR32 71.

Key Source Files

FileContent
tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.hMaster register map — all ADDR32/SHAMT/MASK definitions
tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/tensix.hAddress map (TENSIX_CFG_BASE, TDMA, debug)
tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/tensix_types.hDataFormat enum, ReLU modes, stochastic rounding enum
tt_llk_blackhole/common/inc/cpack_common.hpack_config_t, ReLU config, dest_rd_ctrl, configure_pack
tt_llk_blackhole/common/inc/cunpack_common.hunpack_tile_descriptor_t, unpack_config_t, configure_unpack_AB
tt_llk_blackhole/common/inc/ckernel.hcfg_reg_rmw_tensix, cfg_write, get_cfg_pointer
tt_llk_blackhole/common/inc/ckernel_ops.hWRCFG, RMWCIB0-3, SETC16 instruction encoding macros
tt_llk_blackhole/llk_lib/llk_pack_common.hHigh-level pack LLK (ReLU, L1 acc, edge masks)
tt_llk_blackhole/llk_lib/llk_unpack_common.hHigh-level unpack LLK (stochastic rounding, reconfig)