april 16, 2025•approx 10m read•📁blackhole emulator

data types and conversions

Numeric formats used in Tensix coprocessor internals: shuffled 19-bit register format for SrcA/SrcB, conversions to/from IEEE FP32, and BFP/FP8 expansion rules.

Data Types and Format Conversions

Overview

The Blackhole Tensix coprocessor uses several numeric formats internally. Values move between L1 memory, the SrcA/SrcB register files, the Dest register file, and the SFPU, and each boundary may require format conversion. This document specifies the bit-level representations, the conversion functions the emulator must implement, and the hardware quirks that must be replicated faithfully.

1. Internal Register Format (Shuffled)

SrcA and SrcB register files store 19-bit values in shuffled order:

{ sign(1), mantissa(10), exponent(8) }

Bit positions within the 19-bit cell:

Field	Bits	Width
sign	18	1
mantissa	17:8	10
exponent	7:0	8

This is not IEEE order. IEEE FP32 uses {sign(1), exponent(8), mantissa(23)}. The emulator must convert between the shuffled 19-bit representation and IEEE FP32 bit patterns whenever values cross the SrcA/SrcB boundary.

def shuffled_to_ieee(val_19bit):
    sign = (val_19bit >> 18) & 1
    mantissa = (val_19bit >> 8) & 0x3FF
    exponent = val_19bit & 0xFF
    return (sign << 31) | (exponent << 23) | (mantissa << 13)

def ieee_to_shuffled(ieee_bits):
    sign = (ieee_bits >> 31) & 1
    exponent = (ieee_bits >> 23) & 0xFF
    mantissa = (ieee_bits >> 13) & 0x3FF
    return (sign << 18) | (mantissa << 8) | exponent

Note that shuffled_to_ieee expands 10 mantissa bits to a 23-bit field by placing them in the top 10 bits (bits 22:13) and zeroing bits 12:0. Conversely, ieee_to_shuffled truncates the 23-bit mantissa to its top 10 bits.

Supported Input Formats Stored in the 19-bit Cell

Format	Bit layout	Fit in 19-bit cell
TF32	1 + 10 + 8 = 19	Native fit, no padding needed
BF16	1 + 7 + 8 = 16	Mantissa zero-padded from 7 to 10 bits
FP16	1 + 10 + 5 = 16	Exponent zero-padded from 5 to 8 bits
INT8	1 + 8 + 0 = 9	Sign-magnitude; stored in sign+mantissa fields

2. FP32 <-> BF16

BF16 (Brain Float 16) is the top 16 bits of an IEEE FP32 value. It has the same exponent range as FP32 (8-bit exponent, bias 127) but only 7 mantissa bits instead of 23.

import struct

def fp32_to_bf16(f):
    bits = struct.unpack('<I', struct.pack('<f', f))[0]
    return (bits >> 16) & 0xFFFF

def bf16_to_fp32(bf16):
    bits = bf16 << 16
    return struct.unpack('<f', struct.pack('<I', bits))[0]

Rounding mode: The default is round-toward-zero (RTZ), implemented by plain truncation (the >> 16 discards the low 16 mantissa bits without rounding). Firmware can enable round-to-nearest-even (RTNE) by setting cfg0 bit 31 (EnBFloatRTNE). When RTNE is enabled the emulator must add a round bit based on the discarded bits before truncating.

3. FP32 -> TF32

TF32 (TensorFloat-32) keeps the FP32 sign and full 8-bit exponent but truncates the mantissa from 23 bits to 10 bits. It is a 19-bit logical type that occupies the full 19-bit shuffled cell.

def fp32_to_tf32(f):
    bits = struct.unpack('<I', struct.pack('<f', f))[0]
    return bits & 0xFFFFE000  # zero low 13 mantissa bits

The result is stored as an IEEE FP32 bit pattern with the low 13 mantissa bits cleared. When written into a SrcA/SrcB cell it is then passed through ieee_to_shuffled.

There is no TF32-to-FP32 conversion: TF32 values in Src are zero-extended in the mantissa when expanded to FP32 by shuffled_to_ieee, which is equivalent.

4. FP32 <-> FP16 (IEEE)

FP16 is standard IEEE 754 half-precision: 1 sign bit, 5 exponent bits (bias 15), 10 mantissa bits.

Conversion FP32 -> FP16:

Extract sign, exponent, mantissa from the FP32 bit pattern.
Rebias the exponent: exp16 = exp32 - 127 + 15. If exp16 <= 0, the result is a subnormal or zero. If exp16 >= 31, the result saturates to the maximum finite FP16 value (Tensix does not produce Inf or NaN on overflow).
Truncate mantissa from 23 bits to 10 bits (take bits 22:13).
Pack as {sign(1), exp16(5), mantissa10(10)}.

Conversion FP16 -> FP32:

Extract sign, 5-bit exponent, 10-bit mantissa.
Rebias exponent: exp32 = exp16 - 15 + 127. Handle subnormals (exp16 == 0) by normalizing.
Zero-extend mantissa from 10 to 23 bits.
Pack as IEEE FP32.

Hardware non-conformance: The Tensix FP16 implementation does not conform to IEEE 754 in edge cases. Overflow saturates to the maximum finite FP16 value rather than producing infinity. NaN inputs may not produce NaN outputs. The emulator should replicate this saturating behavior rather than IEEE-correct behavior.

5. Sign-Magnitude Integers

Tensix uses sign-magnitude representation for integers in Dest and in the SFPU. This differs from the two’s complement used by general-purpose CPUs.

In sign-magnitude:

The most-significant bit is the sign (1 = negative).
The remaining bits are the absolute value (magnitude).
There are two representations of zero: +0 (0x00000000) and -0 (0x80000000).

def int_to_signmag(val, bits=32):
    if val < 0:
        return (1 << (bits-1)) | (-val & ((1 << (bits-1)) - 1))
    return val

def signmag_to_int(val, bits=32):
    sign = (val >> (bits-1)) & 1
    mag = val & ((1 << (bits-1)) - 1)
    return -mag if sign else mag

Known hardware bug — SFPCAST mode 2 (sign-magnitude to two’s complement):

When SFPCAST is invoked in mode 2, it converts a sign-magnitude INT32 to two’s complement INT32. The hardware contains a bug: sign-magnitude -0 (0x80000000) is converted to the most-negative INT32 (0x80000000 in two’s complement, i.e., -2147483648) instead of zero. The emulator must replicate this behavior.

The correct conversion of sign-magnitude -0 to two’s complement would be 0. Any conforming implementation would handle this as a special case, but the hardware does not.

6. DataFormat Enum

The hardware uses a 4-bit field to identify numeric formats in tile descriptors and configuration registers. The encoding is:

Value	Name	Notes
0	Float32	IEEE FP32
1	Float16	IEEE FP16
2	Bfp8	Block floating point, 8-bit mantissa, format A exponent
3	Bfp4	Block floating point, 4-bit mantissa, format A exponent
4	Tf32	TensorFloat-32 (19-bit logical)
5	Float16_b	BFloat16
6	Bfp8_b	Block floating point, 8-bit mantissa, format B exponent
7	Bfp4_b	Block floating point, 4-bit mantissa, format B exponent
8	Int32	Sign-magnitude 32-bit integer
9	UInt16	Unsigned 16-bit integer
10	Lf8	FP8 E5M2 (or E4M3 depending on mode bit; see below)
11	Bfp2	Block floating point, 2-bit mantissa, format A
12	(reserved)
13	(reserved)
14	Int8	Sign-magnitude 8-bit integer
15	Bfp2_b	Block floating point, 2-bit mantissa, format B

_b suffix: The “bfloat” variant of each BFP format. Format A and Format B differ in how shared exponents are scoped — see section 7.

FP8 dual-mode encoding: The value 10 (Lf8) encodes two different FP8 formats selected by a separate mode bit:

Pac_LF8_4b_exp = 0 / Unp_LF8_4b_exp = 0: E5M2 (5 exponent bits, 2 mantissa bits)
Pac_LF8_4b_exp = 1 / Unp_LF8_4b_exp = 1: E4M3 (4 exponent bits, 3 mantissa bits)

The mode bit is in separate packer/unpacker configuration registers and is not part of the 4-bit DataFormat field itself.

7. BFP (Block Floating Point)

BFP formats compress data by sharing a single exponent across a group of elements. Each element stores only its mantissa; the exponent is stored once per group and applied to all elements in that group during unpacking.

Tile Layout

A 32x32 BFP tile in L1 stores data and exponents as separate sections:

Format	Data bytes (mantissa only)	Exponent bytes	Total
Bfp8 / Bfp8_b	1024 (8 bits per element)	64	1088
Bfp4 / Bfp4_b	512 (4 bits per element)	64	576
Bfp2 / Bfp2_b	256 (2 bits per element)	64	320

The 64 exponent bytes encode 4 faces x 16 shared exponents = 64 exponent values. Each exponent is 1 byte.

Format A vs Format B

Both format variants store 64 exponents per tile (4 faces, 16 exponents per face). The difference is in which elements share a given exponent:

Format A: 1 shared exponent per face row. A face is 16x16 elements. Each of the 16 rows within a face gets its own shared exponent. All 16 elements in that row share that exponent.
Format B: Same structural layout (16 exponents per face) but uses a different derivation rule for the shared exponent value. The exact derivation depends on packer configuration (typically the maximum exponent in the group).

The unpacker applies: element_value = mantissa * 2^(shared_exponent - bias).

8. Tile Byte Sizes

Total bytes per 32x32 tile in L1, including any exponent sections but excluding the tile header (16 bytes):

Format	Bytes per 32x32 tile
Float32 / Int32 / UInt32	4096
Float16 / Float16_b / UInt16	2048
Bfp8 / Bfp8_b	1088
Bfp4 / Bfp4_b	576
Bfp2 / Bfp2_b	320
UInt8 / Int8 / Lf8 / Fp8_e4m3	1024

These sizes assume a full 32x32 tile (1024 elements). The tile header (16 bytes) is in addition to these values and is read by the unpacker’s header parser before element data begins.

9. Dest Register Data Views

The Dest register file provides two different logical views over the same physical 1024x16 storage of 16-bit cells.

Physical Storage

The underlying physical array is:

Dest_physical[1024 rows][16 cols] of 16-bit cells

Total: 1024 * 16 * 2 = 32768 bytes = 32 KB.

Dst16b View (16-bit mode)

When ALU_ACC_CTRL_Fp32_enabled = 0:

Dst16b[1024 rows][16 cols] of 16-bit values

Each element maps directly to one physical cell. Used for BF16, FP16, INT16, and UINT16 accumulation.

Dst32b View (32-bit mode)

When ALU_ACC_CTRL_Fp32_enabled = 1:

Dst32b[512 rows][16 cols] of 32-bit values

Each 32-bit value is split across two physical rows that are 8 apart:

Low 16 bits: physical row r
High 16 bits: physical row r + 8

For logical row n in Dst32b:

The low half is at physical row n % 8 + (n / 8) * 16
The high half is at physical row n % 8 + (n / 8) * 16 + 8

Used for FP32 and INT32 accumulation.

The emulator must implement both views and apply the correct mapping based on the ALU_ACC_CTRL_Fp32_enabled bit when reading or writing Dest.

Source References

Source	Path	Relevant Content
Unpackers overview	`tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/Unpackers/`	Shuffled format, format conversion pipeline, BFP expansion
UNPACR.md	`tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/UNPACR_Regular.md`	Unpacker instruction, TileDescriptor fields, DataFormat encoding
Dst.md	`tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/Dst.md`	Dest register layout, Dst16b/Dst32b views, row interleaving
SFPCAST.md	`tt-isa-documentation/BlackholeA0/TensixTile/TensixCoprocessor/SFPCAST*.md`	SFPCAST mode 2 sign-magnitude to two’s complement, -0 bug
cfg_defines.h	`tt-metal/tt_metal/hw/inc/internal/tt-1xx/blackhole/cfg_defines.h`	`ALU_ACC_CTRL_Fp32_enabled`, `EnBFloatRTNE`, `Pac_LF8_4b_exp`, `Unp_LF8_4b_exp` bit positions
cpack_common.h	`tt-llk/tt_llk_blackhole/common/inc/cpack_common.h`	BFP exponent assembly, tile byte size computations
cunpack_common.h	`tt-llk/tt_llk_blackhole/common/inc/cunpack_common.h`	DataFormat enum values, BFP exponent consumption
ckernel_instr_params.h	`tt-llk/tt_llk_blackhole/common/inc/ckernel_instr_params.h`	Format constants and identifiers
pack-data-path.md	`emu-specs/pack-data-path.md`	Packer format conversion stages, BFP shared exponent assembly
unpack-data-path.md	`emu-specs/unpack-data-path.md`	Unpacker format conversion, L1 tile layouts, BFP expansion
dest-srca-srcb-registers.md	`emu-specs/dest-srca-srcb-registers.md`	Detailed SrcA/SrcB/Dest register file structure