december 30, 2025•approx 3m read

rdna docs

human-readable RDNA3 documentation — instruction set reference, wavefront execution, register layout, and memory model explained with examples.

rdna documentation

This was taken from the RDNA3.5 ISA pdf.

states we need to keep track of

per program

Global memory (allocate 512mb box to start)
Program counter (PC) points to first instruction when wave is created

State	Description	Width / Range
TBA	trap base address	48-bit
TMA	Trap memory address	48-bit

a note about branch jumps

branches jump to pc_of_the_instruction_after_the_branch + offset*4
get_pc and swap_pc are relative to the next instruction, not the current one.
all prior instructions have been issued but may or may not have completed execution

state per wave

State	Description	Width / Range
SGPRs	scalar general purpose registers	s0–s105
VGPRs	vector general purpose registers	v0–v255 (32-bit)
LDS	do we need to emulate cache? scratch ram	—
EXEC	top half not used in wave32	64-bit
EXECZ	exec is zero	1-bit
VCC	vector condition code	64-bit
VCCZ	vcc is zero	1-bit
SCC	scalar condition code	1-bit
Flat_scratch	base address for scratch memory used this wave (overflow registers)	48-bit
M0	misc reg	32-bit
TRAPSTS	trap status	32-bit
TTMP0-TTMP15	trap temporary SGPRs	32-bit
VMcnt	vmem load and sample instructions issued but not yet completed	6-bit
VScnt	vmem store instructions…	6-bit
EXPcnt	export/gds instructions (do we need this)	3-bit
LGKMcnt	lds, gds, constant and message count	6-bit

PC Program counter: Next shader instruction to execute. Read/write only via scalar control flow instructions and indirectly using branch. 2 LSBs are forced to zero. (what does that mean?)

EXECute Mask

Controls which threads in the vector are executed. 1=execute, 0=do not execute. Exec can be read/written via scalar instructions. Can be written as a result of vector-alu compare.

Exec affects: vector-alu, vector-memory, LDS, GDS, and export instructions. No effect on scalar execution / branches.

Wave64 uses all 64 bits, wave32 only uses 31:0.

Instruction skipping (exec=0): todo: this makes no sense right now

SGPRs

106 normal SGPRs. vcc_hi and vcc_low are technically stored in SGPR 106 and 107.

Alignment for SGPRs:

any time 64-bit data is used
scalar memory reads when the address-base comes from an SGPR pair (loading in arguments, i guess)

Other notes:

Writes to an out-of-range SGPR are ignored

VCC Vector condition code written by V_CMP and integer vector add/sub instructions. vcc is read by many instructions. named SGPR pair, subject to same dependency checks (?) as toher SGPRs.

VGPRs Can be modeled by a 32-long array of 32-bit values.

data types

b32 (binary untyped 32-bit), this is not really used
b64
f16
f32
f64.
bf16
i8
i16
i32
i64
u16
u32
u64

what changes in wave64 vs wave32

Every thread gets u
exec uses the entire 64 bits
vcc uses the entire 64 bits