april 16, 2025•approx 6m read•📁blackhole emulator

additional scalar unit instructions

Four additional Scalar Unit (ThCon) instructions (SHIFTDMAREG, BITWOPDMAREG, CMPDMAREG, SUBDMAREG) that operate on the GPR file, used infrequently but required for modeling completeness.

Additional Scalar Unit Instructions

Four additional Scalar Unit (ThCon) instructions operate on the GPR file. They are used infrequently in LLK kernels (~400–450 occurrences across 747 Blackhole ELFs) and are not needed for matmul or add1, but they appear in STALLWAIT block masks and must be modeled for completeness.

All four share the same execution characteristics as ADDDMAREG/MULDMAREG: they execute on the Scalar Unit (ThCon), which is fully serialized. They stall under STALLWAIT block bit B5 (STALL_THCON).

SHIFTDMAREG — Bitwise Shift GPR (opcode 0x5C)

Performs an unsigned bitwise shift (left or right) of one GPR by another GPR, or by a 5-bit immediate.

Encoding:

[31:24] = 0x5C
[23]    = OpBisConst  (0 = reg-reg, 1 = reg-immediate)
[20:18] = Mode        (3 bits — shift direction)
[17:12] = ResultRegIndex  (6 bits)
[11:6]  = OpBRegIndex     (6 bits — GPR index or 5-bit shift amount if OpBisConst=1)
[5:0]   = OpARegIndex     (6 bits — source GPR)

#define TT_OP_SHIFTDMAREG(OpBisConst, OpSel, ResultRegIndex, OpBRegIndex, OpARegIndex) \
    TT_OP(0x5c, (((OpBisConst) << 23) + ((OpSel) << 18) + ((ResultRegIndex) << 12) \
               + ((OpBRegIndex) << 6) + ((OpARegIndex) << 0)))

Modes:

Mode	Name	Operation
0	`SHIFTDMAREG_MODE_LEFT`	`Result = Left << Right`
1	`SHIFTDMAREG_MODE_RIGHT`	`Result = Left >> Right` (unsigned)

Functional model:

def SHIFTDMAREG(OpBisConst, mode, result_reg, right_reg_or_imm, left_reg):
    left_val = GPRs[CurrentThread][left_reg]
    if OpBisConst:
        right_val = right_reg_or_imm & 0x1F    # 5-bit immediate
    else:
        right_val = GPRs[CurrentThread][right_reg_or_imm] & 0x1F

    if mode == 0:    # LEFT
        result = (left_val << right_val) & 0xFFFFFFFF
    elif mode == 1:  # RIGHT
        result = left_val >> right_val
    else:
        raise UndefinedBehaviour()

    GPRs[CurrentThread][result_reg] = result

Performance: 3 cycles (immediate variant, or both regs in same aligned group of 4 GPRs), 4 cycles otherwise.

BITWOPDMAREG — Bitwise AND/OR/XOR on GPR (opcode 0x5B)

Performs a bitwise AND, OR, or XOR between two GPRs, or between a GPR and a 6-bit immediate.

Encoding:

[31:24] = 0x5B
[23]    = OpBisConst  (0 = reg-reg, 1 = reg-immediate)
[20:18] = OpSel       (3 bits — operation select)
[17:12] = ResultRegIndex  (6 bits)
[11:6]  = OpBRegIndex     (6 bits — GPR index or 6-bit constant)
[5:0]   = OpARegIndex     (6 bits — source GPR)

#define TT_OP_BITWOPDMAREG(OpBisConst, OpSel, ResultRegIndex, OpBRegIndex, OpARegIndex) \
    TT_OP(0x5b, (((OpBisConst) << 23) + ((OpSel) << 18) + ((ResultRegIndex) << 12) \
               + ((OpBRegIndex) << 6) + ((OpARegIndex) << 0)))

Modes:

OpSel	Name	Operation
0	`BITWOPDMAREG_MODE_AND`	`Result = A & B`
1	`BITWOPDMAREG_MODE_OR`	`Result = A \| B`
2	`BITWOPDMAREG_MODE_XOR`	`Result = A ^ B`

Functional model:

def BITWOPDMAREG(OpBisConst, mode, result_reg, right_reg_or_imm, left_reg):
    left_val = GPRs[CurrentThread][left_reg]
    if OpBisConst:
        right_val = right_reg_or_imm & 0x3F    # 6-bit immediate
    else:
        right_val = GPRs[CurrentThread][right_reg_or_imm]

    if mode == 0:    result = left_val & right_val
    elif mode == 1:  result = left_val | right_val
    elif mode == 2:  result = left_val ^ right_val
    else:            raise UndefinedBehaviour()

    GPRs[CurrentThread][result_reg] = result

Performance: 3 cycles (immediate variant, or both regs in same aligned group of 4 GPRs), 4 cycles otherwise.

CMPDMAREG — Compare GPRs (opcode 0x5D)

Unsigned comparison (GT, LT, EQ) between two GPRs, or between a GPR and a 6-bit immediate. Result is 0 or 1.

Encoding:

[31:24] = 0x5D
[23]    = OpBisConst  (0 = reg-reg, 1 = reg-immediate)
[20:18] = OpSel       (3 bits — comparison mode)
[17:12] = ResultRegIndex  (6 bits)
[11:6]  = OpBRegIndex     (6 bits — GPR index or 6-bit constant)
[5:0]   = OpARegIndex     (6 bits — source GPR)

#define TT_OP_CMPDMAREG(OpBisConst, OpSel, ResultRegIndex, OpBRegIndex, OpARegIndex) \
    TT_OP(0x5d, (((OpBisConst) << 23) + ((OpSel) << 18) + ((ResultRegIndex) << 12) \
               + ((OpBRegIndex) << 6) + ((OpARegIndex) << 0)))

Modes:

OpSel	Name	Operation
0	`CMPDMAREG_MODE_GT`	`Result = (A > B) ? 1 : 0`
1	`CMPDMAREG_MODE_LT`	`Result = (A < B) ? 1 : 0`
2	`CMPDMAREG_MODE_EQ`	`Result = (A == B) ? 1 : 0`

Functional model:

def CMPDMAREG(OpBisConst, mode, result_reg, right_reg_or_imm, left_reg):
    left_val = GPRs[CurrentThread][left_reg]
    if OpBisConst:
        right_val = right_reg_or_imm & 0x3F    # 6-bit unsigned immediate
    else:
        right_val = GPRs[CurrentThread][right_reg_or_imm]

    if mode == 0:    result = 1 if left_val > right_val else 0
    elif mode == 1:  result = 1 if left_val < right_val else 0
    elif mode == 2:  result = 1 if left_val == right_val else 0
    else:            raise UndefinedBehaviour()

    GPRs[CurrentThread][result_reg] = result

All comparisons are unsigned. Performance: same as SHIFTDMAREG/BITWOPDMAREG.

FLUSHDMA — Occupy Scalar Unit Until Conditions Met (opcode 0x46)

Stalls the issuing thread and all other threads trying to use the Scalar Unit until selected conditions are met. In almost every case, STALLWAIT should be preferred — it waits without blocking other threads’ Scalar Unit access.

Encoding:

[31:24] = 0x46
[3:0]   = ConditionMask  (4 bits — conditions C0–C3)

#define TT_OP_FLUSHDMA(FlushSpec) TT_OP(0x46, (((FlushSpec) << 0)))

Condition mask:

Bit	Condition	Keep waiting if…
C0	Scalar Unit memory	The Scalar Unit has outstanding memory requests for the current thread
C1	Unpacker 0	The current thread has an instruction in any stage of Unpacker 0’s pipeline
C2	Unpacker 1	The current thread has an instruction in any stage of Unpacker 1’s pipeline
C3	Packer 0	The current thread has an instruction in any stage of Packer 0’s pipeline

If ConditionMask == 0, it defaults to 0xF (all conditions). The instruction waits until all selected conditions are simultaneously met (i.e., none of the “keep waiting” conditions are true). These condition bits coincide exactly with the low four bits of STALLWAIT’s condition mask.

Functional model:

def FLUSHDMA(condition_mask):
    if condition_mask == 0:
        condition_mask = 0xF

    # Block the Scalar Unit for all threads until conditions are met
    while any_selected_condition_indicates_busy(condition_mask, CurrentThread):
        wait()  # stalls this thread AND any other thread trying to use ThCon

Performance: At least 2 cycles, plus however long the wait takes.

Emulator note: For a synchronous emulator, FLUSHDMA is functionally equivalent to STALLWAIT with block mask 0x20 (STALL_THCON) and the same condition bits. The distinction (blocking other threads’ ThCon access) only matters for cycle-accurate timing.

AutoTTSync Classification

All four instructions share the same STALLWAIT behavior:

Class	Instructions	Behavior
0	ADDDMAREG, SUBDMAREG, MULDMAREG, BITWOPDMAREG, SHIFTDMAREG, CMPDMAREG, SETDMAREG	Read and write Tensix GPRs
7	FLUSHDMA	Write TDMA-RISC state (synchronization barrier)

Encoding Quick Reference

Instruction	Opcode	Field Layout
BITWOPDMAREG	0x5B	`[23]` OpBisConst, `[20:18]` OpSel, `[17:12]` Result, `[11:6]` OpB, `[5:0]` OpA
SHIFTDMAREG	0x5C	(same — but immediate is 5-bit, not 6-bit)
CMPDMAREG	0x5D	(same as BITWOPDMAREG)
FLUSHDMA	0x46	`[3:0]` ConditionMask

Source References

Source	Path
ISA functional models	`tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/{SHIFTDMAREG,BITWOPDMAREG,CMPDMAREG,FLUSHDMA}.md`
Blackhole C macros	`tt-llk/tt_llk_blackhole/common/inc/ckernel_ops.h`
Blackhole assembly YAML	`tt-llk/tt_llk_blackhole/instructions/assembly.yaml`
Python instruction encoders	`tt-exalens/ttexalens/hardware/blackhole/tensix_ops.py`
Instruction frequency data	`boop-docs/llk-sfpi/instruction-frequency-report.md`
AutoTTSync classes	`tt-isa-documentation/BlackholeA0/TensixTile/BabyRISCV/AutoTTSync.md`