april 16, 2025•approx 6m read•📁blackhole emulator

mutexes

Hardware mutexes in Tensix Sync Unit for exclusive mutual exclusion between threads: 4 mutexes (math, unpack0, unpack1, pack0) with ATGETM/ATRELM acquire/release instructions.

Mutexes

Hardware mutexes in the Tensix Sync Unit provide exclusive mutual exclusion among the three coprocessor threads (T0, T1, T2) within a single tile. They are manipulated by two instructions: ATGETM (acquire) and ATRELM (release).

Mutexes are unrelated to semaphores. Semaphores are counting primitives for producer/consumer synchronization (see semaphores.md). Mutexes are exclusive locks for protecting shared register read-modify-write sequences.

Hardware State

There are 4 mutexes per tile. Each mutex holds a 2-bit owner field:

Mutex[i].HeldBy : enum { Nobody, T0, T1, T2 }

Valid indices: 0, 2, 3, 4. Index 1 is invalid. Indices > 4 are invalid. Using an invalid index causes the issuing thread to wait forever.

Index	Name	Typical use
0	`math`	Atomic config register read-modify-write (`REG_RMW`) between threads
2	`unpack0`	Unpacker 0
3	`unpack1`	Unpacker 1
4	`pack0`	Packer 0 / SFPU (SFPU instructions can be issued by both T1 and T2)

Initial state at reset: all mutexes are Nobody (not held).

Comparison with Wormhole B0

Wormhole B0 has 7 mutexes (indices 0, 2, 3, 4, 5, 6, 7) — the extra three are pack1 (5), pack2 (6), pack3 (7). Index 1 is still invalid. Blackhole reduced the count to 4.

Instruction Encodings

Both instructions are 32 bits wide. The top 8 bits are the opcode, the bottom 24 bits are the mutex_index field (only the low 3 bits matter in practice).

ATGETM — Acquire Mutex (opcode `0xA0`)

 31      24 23                              0
┌─────────┬──────────────────────────────────┐
│ 0xA0    │         mutex_index              │
│ [31:24] │            [23:0]                │
└─────────┴──────────────────────────────────┘

#define TT_OP_ATGETM(mutex_index) TT_OP(0xa0, ((mutex_index) << 0))
// word = (0xA0 << 24) | (mutex_index & 0xFFFFFF)

ATRELM — Release Mutex (opcode `0xA1`)

 31      24 23                              0
┌─────────┬──────────────────────────────────┐
│ 0xA1    │         mutex_index              │
│ [31:24] │            [23:0]                │
└─────────┴──────────────────────────────────┘

#define TT_OP_ATRELM(mutex_index) TT_OP(0xa1, ((mutex_index) << 0))
// word = (0xA1 << 24) | (mutex_index & 0xFFFFFF)

Both instructions use execution resource SYNC and are routed to the Sync Unit backend.

Functional Model

ATGETM (Acquire)

The instruction blocks at the Wait Gate until it can acquire the mutex, then proceeds through the Sync Unit in 1 cycle.

void ATGETM(uint thread_id, uint index) {
    // 1. Validate index
    if (index == 1 || index > 4) {
        while (true) { wait; }  // infinite stall
    }

    // 2. Wait for availability
    if (Mutex[index].HeldBy == thread_id) {
        // Already held by this thread — reentrant acquire.
        // May wait 1-2 cycles due to contention with other threads'
        // concurrent ATGETM/ATRELM on the same mutex.
    } else {
        // Spin at the Wait Gate until the mutex is free.
        while (Mutex[index].HeldBy != Nobody) {
            wait;
        }
    }

    // 3. Acquire
    Mutex[index].HeldBy = thread_id;
}

If multiple threads are waiting for the same free mutex simultaneously, one is chosen to acquire it. The fairness guarantee comes from ATRELM (see below).

ATRELM (Release)

void ATRELM(uint thread_id, uint index) {
    // 1. Validate index
    if (index == 1 || index > 4) {
        while (true) { wait; }  // infinite stall
    }

    // 2. May wait 1-2 cycles due to contention with concurrent
    //    ATGETM/ATRELM from other threads on the same mutex.

    // 3. Release (only if held by this thread)
    if (Mutex[index].HeldBy == thread_id) {
        Mutex[index].HeldBy = Nobody;
    }
    // If not held by this thread, instruction completes with no effect.
}

Round-robin fairness: When thread i releases a mutex and both other threads are waiting on it via ATGETM, thread (i + 1) % 3 is chosen as the next acquirer.

Timing and Throughput

Property	Value
Latency	1 cycle (once through the Wait Gate)
Throughput	Up to 3 ATGETM/ATRELM per cycle, if they reference different mutexes
Contention delay	1-2 extra cycles if multiple threads touch the same mutex simultaneously
Blocked-thread behavior	Thread stalls at the Wait Gate; no instructions from that thread proceed past the gate

Semaphore instructions (SEMINIT, SEMPOST, SEMGET, SEMWAIT) share the Sync Unit but have independent throughput — at most 1 semaphore instruction per cycle. Mutex and semaphore instructions can execute concurrently.

Interaction with STALLWAIT

ATGETM and ATRELM are classified as Sync Unit instructions. They are blocked by the STALL_SYNC (B1, value 0x02) bit in a STALLWAIT/SEMWAIT block mask.

If a thread has a latched STALLWAIT with B1 set and the wait condition has not yet been met, any ATGETM or ATRELM from that thread is held at the Wait Gate until the STALLWAIT condition clears.

See stallwait-conditions.md for the full block mask reference.

Per-Core Access

Mutexes are only accessible from the three Tensix coprocessor threads (T0, T1, T2) via pushed Tensix instructions.

Core	Can use ATGETM/ATRELM?	Notes
TRISC0 (T0)	Yes	Pushes instructions to T0’s FIFO
TRISC1 (T1)	Yes	Pushes instructions to T1’s FIFO
TRISC2 (T2)	Yes	Pushes instructions to T2’s FIFO
BRISC	No	Cannot push ATGETM/ATRELM
NCRISC	No	Cannot push ATGETM/ATRELM

There is no memory-mapped interface for mutexes (unlike semaphores, which have the PCBuf semaphore window). The only way to manipulate mutexes is through the Tensix instruction FIFO.

Emulator Implementation

State

# 4 mutexes. HeldBy is None (nobody) or 0/1/2 (thread id).
mutex_held_by = [None] * 5  # indexed 0-4; index 1 is unused/invalid

VALID_MUTEX_INDICES = {0, 2, 3, 4}

Decoding

def decode(word):
    opcode = (word >> 24) & 0xFF
    mutex_index = word & 0xFFFFFF  # only low bits matter
    return opcode, mutex_index

Execution

def exec_atgetm(thread_id, index):
    """Returns True if acquired, False if must stall."""
    if index not in VALID_MUTEX_INDICES:
        return False  # stall forever (or raise in emulator)

    held = mutex_held_by[index]
    if held is None or held == thread_id:
        mutex_held_by[index] = thread_id
        return True   # acquired
    else:
        return False  # stall — re-evaluate next cycle

def exec_atrelm(thread_id, index):
    """Always completes (never stalls, beyond 1-2 cycle contention)."""
    if index not in VALID_MUTEX_INDICES:
        return False  # stall forever (or raise in emulator)

    if mutex_held_by[index] == thread_id:
        mutex_held_by[index] = None
    # else: no effect
    return True

For a cycle-accurate emulator, exec_atgetm should be called each cycle while the thread’s instruction pointer is parked on the ATGETM. The thread’s pipeline stalls (no further instructions issue) until the function returns True.

For a functional emulator that doesn’t model cycle-level timing, you can treat ATGETM as an immediate acquire if the mutex is free, and use a simple scheduling policy (round-robin or arbitrary) to resolve contention when multiple threads attempt to acquire the same mutex in the same “step.”

Fairness (optional for functional emulation)

If modeling round-robin fairness: when exec_atrelm releases mutex i from thread t, and both other threads are stalled on ATGETM for mutex i, the next acquirer should be thread (t + 1) % 3.

Real-World Usage Pattern

The primary use of mutexes in existing kernels is protecting shared config register read-modify-write (RMW) sequences between T0 (unpack) and T2 (pack), since both threads need to modify ALU_FORMAT_SPEC_REG registers:

// T0 (cunpack_common.h) — unpack config
t6_mutex_acquire(mutex::REG_RMW);    // ATGETM(0)
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG_SrcA_val_ADDR32, ...>(alu_src_format);
// ... more RMW operations ...
t6_mutex_release(mutex::REG_RMW);    // ATRELM(0)

// T2 (cpack_common.h) — pack config
t6_mutex_acquire(mutex::REG_RMW);    // ATGETM(0)
cfg_reg_rmw_tensix<ALU_FORMAT_SPEC_REG2_Dstacc_RMW>(pack_output_src_format);
cfg_reg_rmw_tensix<STACC_RELU_ApplyRelu_ADDR32, ...>(relu_config);
t6_mutex_release(mutex::REG_RMW);    // ATRELM(0)

The C++ layer provides an RAII guard for convenience:

// ckernel_mutex_guard.h
{
    T6MutexLockGuard guard(mutex::REG_RMW);
    // critical section — automatically released on scope exit
}