Idli: Instruction Set
Created: 2026-04-25, Updated: 2026-05-08
This article will be split into two main sections. First up I want to go through the instructions that are implemented in the current version of Idli and briefly describe how they work. After that I'd like to go through the things I'm not happy with and would like to change in the future.
Current Instructions
Let's look at the instructions as defined in the current version of the machine, loosely grouped together into functional categories. Each instruction is provided with its encoding and signature, with operand types indicated by the letter used:
AandBare standard register operands.Ais typically a destination whileBis a source, but these rules are a little flexible for memory operations.Cis a register operand, except when the encoded value isR15i.e.SP, in which case the sixteen bits following the instruction are treated as an immediate instead.Nis an unsigned integer indicating a general purpose input or output pin.Mis a conditional execution mask indicating how the instructions following the current one should be predicated.RandSindicate a range of registersR..Sinclusive. This wraps around, soR10..R2indicatesR10..R15followed byR0..R2. An empty range cannot be encoded:R1..R1is the same as justR1.Jis a four bit unsigned immediate used for system operations that modify the count register.
Now let's dig into the instructions. In no particular order:
Arithmetic
0000 AAAA BBBB CCCC ADD A, B, C # A = B + C
0001 AAAA BBBB CCCC SUB A, B, C # A = B - C
Standard addition and subtraction operations. There is some redundancy here when
C is an immediate but given we need both for when it's a register it isn't
worth worrying about. Combining ADD with the zero register also gives us a
couple of handy synonyms:
MOV A, C -> ADD A, ZR, C
NOP -> ADD ZR, ZR, ZR
This also means the encoding of the canonical NOP is encoded as all zero, so
if we happen to read some memory that's been cleared we won't do anything too
unexpected.
1010 AAAA BBBB 1000 INC A, B # A = B + 1
1010 AAAA BBBB 1001 DEC A, B # A = B - 1
Addition and subtraction of one is quite a common operation, but encoding this
using the standard ADD and SUB instructions results in an extra sixteen bits
of consumption. To avoid this overhead, INC and DEC were introduced. In the RTL
these are implemented by setting the carry in to the ALU.
1100 AAAA ???0 CCCC ADDPC A, C # A = PC + C
We need a way to access information relative to the current program counter, and
as PC isn't part of the register file a separate instruction is needed. This
is essentially a branch instruction except the result is written to A instead
of PC.
Logical
0010 AAAA BBBB CCCC AND A, B, C # A = B & C
0011 AAAA BBBB CCCC ANDN A, B, C # A = B & ~C
0100 AAAA BBBB CCCC OR A, B, C # A = B | C
0101 AAAA BBBB CCCC XOR A, B, C # A = B ^ C
1010 AAAA BBBB 1110 NOT A, B # A = ~B
The standard set of logical operations - there's not much to say here!
One interesting point is that we need an actual instruction for logical negation,
but not arithmetic negation which can be implemented as the synonym
SUB A, ZR, B. To implement NOT as a synonym we'd need to have an all-ones
register or use an immediate like XOR A, B, 0xffff. This would result in
a double-width instruction due to the immediate, so it seemed worth including
NOT, but given how infrequently it actually turns up perhaps it isn't worth the
encoding space.
There's also no point in having NOT take C instead of B as an immediate is
always a full sixteen bits, so there's no situation where the negated form couldn't
be encoded directly anyway. This also means we can perform logical negation on
SP, but so far I haven't found a practical use for it.
Shift
1010 AAAA BBBB 1010 SRL A, B # A = { 1'b0, B[15:1]}
1010 AAAA BBBB 1011 SRA A, B # A = {B[15], B[15:1]}
1010 AAAA BBBB 1100 ROR A, B # A = {B[ 0], B[15:1]}
1010 AAAA BBBB 1101 ROL A, B # A = {B[14:0], B[15]}
Shifts in Idli are all a single bit for two reasons:
- Barrel shifters are quite expensive in terms of area and I wasn't sure it would be worth the space.
- The encoding space is limited and having even more three-operation instructions wouldn't fit.
This does make performing multi-bit shifts quite painful, especially if you don't know the shift amount until runtime as this means every bit shifted will also incur a branch redirection penalty.
SRL and SRA can both shift via the internal carry register for performing shifts
over integers larger than sixteen bits. Reviewing the implementation now I can't
remember why I didn't also support this for ROR and ROL, so perhaps this is
something to look at as an improvement.
Given we only have single bit shifts, LSL A, B is implemented as the synonym
ADD A, B, B to save on encoding space.
Comparison
1011 X000 BBBB CCCC EQ{X} B, C # P = (B == C)
1011 X001 BBBB CCCC NE{X} B, C # P = (B != C)
1011 X010 BBBB CCCC LT{X} B, C # P = (B < C) (signed)
1011 X011 BBBB CCCC LTU{X} B, C # P = (B < C) (unsigned)
1011 X100 BBBB CCCC GE{X} B, C # P = (B >= C) (signed)
1011 X101 BBBB CCCC GEU{X} B, C # P = (B >= C) (unsigned)
1011 X110 BBBB CCCC ANY{X} B, C # P = |(B & C)
1011 X111 ??NN CCCC INP{X} B, C # P = PIN(N)
Idli provides a full suite of comparison operations. These all write their results
to the predicate register P. Each instruction has two forms: standard and X.
The X form of an instruction performs the update of P as normal, but also
updates the COND register to predicate the following instruction such that it
only runs if the result of the comparison is true. This gives us a couple of
significant benefits:
- Code density is improved, as performing a comparison and updating
CONDare performed in a single instruction rather than requiring an explicit update viaCEX(see next section for details on this instruction). - There's no need to encode any conditional branch instructions, saving a chunk of encoding space. Yes, we do need to encode two forms of every comparison, but this allows us to predicate anything so is much more flexible.
Control Flow
There are two ways to modify control flow in Idli. The first are branches and jumps:
1100 ??00 ???1 CCCC B C # PC += C
1100 ??01 ???1 CCCC J C # PC = C
1100 ??10 ???1 CCCC BL C # LR = NEXT_PC, PC += C
1100 ??11 ???1 CCCC JL C # LR = NEXT_PC, PC = C
Branches add an offset to the program counter while jumps overwrite it directly.
The link register is also optionally updated to the PC following the branch or
jump, allowing for return from function calls without needing to access the
stack. The RET instruction is implemented as the synonym J LR.
The second form of control flow modification is predication via the COND
register. This can be implicitly updated via the X comparisons discussed above
or by directly writing it using CEX.
1110 ???0 MMMM MMMM CEX M # COND = M
The M operand holds a mask indicating how the following instructions should be
predicated. For each bit i, if M[i] matches the value of P the instruction
will be executed, otherwise it will be skipped. This provides an extremely
flexible way to avoid the costly branch redirection penalty when running short
conditional sequences.
The mask is indicated in assembly language by adding a .T or .F suffix
to the instructions following the CEX, which itself indicates the number
of instructions to be predicated. For example:
CEX 4
ADD.T R1, R2, R3 # IF( P): R1 = R2 + R3
ADD.F R1, R4, R5 # IF(!P): R1 = R4 + R5
INC.F R2, R10 # IF(!P): R2 = R10 + 1
AND.T R2, R2, R11 # IF( P): R2 = R2 & R11
In this case, if P was true then the first and final instruction would be run,
otherwise the second and third would be executed.
Memory Access
There are a few different ways of accessing memory in Idli. First up are the standard load and store instructions:
0110 AAAA BBBB CCCC LD A, B, C # A = [B + C]
0111 AAAA BBBB CCCC ST A, B, C # [B + C] = A
These allow for indexing into memory relative to a base register. As Idli has sixteen bit bytes to match the register width, as discussed previously, we don't need to support different widths and sign or zero extension.
Auto-incrementing load and store operations are also supported:
1010 AAAA BBBB 0000 LD+ A, B # A = [B++]
1010 AAAA BBBB 0001 ST+ A, B # [B++] = A
1010 AAAA BBBB 0010 +LD A, B # A = [++B]
1010 AAAA BBBB 0011 +ST A, B # [++B] = A
1010 AAAA BBBB 0100 LD- A, B # A = [B--]
1010 AAAA BBBB 0101 ST- A, B # [B--] = A
1010 AAAA BBBB 0110 -LD A, B # A = [--B]
1010 AAAA BBBB 0111 -ST A, B # [--B] = A
These allow for incrementing or decrementing the base register either before or after the memory is accessed. This is particularly common in loops and can often save us from needing to increment base addresses manually. In the RTL writeback of the base register is performed before the memory access, so stores of the base register will show the updated value and loads will overwrite it.
As Idli uses serial memories which share instructions and data, each access
requires a redirection to perform the access followed by a redirection back to
the PC to continue execution. As such, load and store multiple operations are
also provided to allow for batching accesses together to minimise the overhead:
1000 RRRR SSSS BBBB LDM R..S, B # R..S = [B, B + 1, ...]
1001 RRRR SSSS BBBB STM R..S, B # [B, B + 1, ...] = R..S
These both take register ranges, and take advantage of the sequential mode of
the attached memories to perform multiple read or write operations with a single
redirection. This greatly enhances performance for batch memory access operations
such as memcpy.
A couple of synonyms are also provided for common stack operations:
PUSH A -> -ST A, SP
POP A -> LD+ A, SP
System
System operations are a catch-all group for things that didn't really fit into
any other category. The boring ones are for moving the predicate register P
into a general purpose register:
1101 ???1 ??10 AAAA GETP A # A = P
1101 ???1 ??11 CCCC PUTP C # P = C & 1
More interesting are the operations that update the COUNT register, which are
used to modify the behaviour of a number of instructions following the update.
1110 ???1 ??00 JJJJ CARRY J # COUNT = {CARRY, J}
1110 ???1 ??01 JJJJ ANDP J # COUNT = { ANDP, J}
1110 ???1 ??10 JJJJ ORP J # COUNT = { ORP, J}
Let's look at CARRY first. By default the processor clears the internal carry
flag after each operation, but this instruction allows us to override this
behaviour and propagate the carry forward for J instructions instead.
For example, if R2..R5 and R6..R9 contain two 64b numbers, we can perform
a full 64b add efficiently by writing
CARRY 4
ADD R2, R2, R6
ADD R3, R3, R7
ADD R4, R4, R8
ADD R5, R5, R9
This saves us from needing to encode a specific add with carry or subtract with
borrow. It's also worth noting that COUNT is a separate register to COND,
allowing for both to be used simultaneously. This can be useful in a number of
scenarios such as implementing a 32b soft multiply.
ANDP and ORP are used to efficiently build up complex boolean expressions
by changing how P is updated. By default comparisons replace the value of the
predicate register, but these allow for the result to be ANDed or ORed into P
instead. Let's say you want to branch if the following condition is true:
(R1 == R2 || R3 == R4) && (R5 >= R6 || R7 != R8 || R9 == R10). We could write
this in Idli assembler as:
EQ R1, R2 # P = R1 == R2
ORP 1
EQ R3, R4 # P |= R3 == R4
GETP R12 # TMP = P
GE R5, R6 # P = R5 >= R6
ORP 2
NE R7, R8 # P |= R7 != R8
EQ R9, R10 # P |= R9 == R10
ANDP 1
NEX R12, ZR # IF (P && TMP):
B.T @TARGET # GOTO TARGET
This saves us from writing out a number of conditional branches, improving both performance and code density.
IO
Communication with the outside world can be performed via general purpose IO pins or a built-in UART.
1101 ???0 00NN AAAA IN A, P # A = PIN(N)
1101 ???0 01NN CCCC OUT N, C # PIN(N, C & 1)
1101 ???0 10NN CCCC OUTN N, C # PIN(N ~(C & 1))
1101 ???0 11NN 0000 OUTP N # PIN(N, P)
Pin instructions allow for getting or setting an arbitrary value on each of the
four general purpose input or output pins from either a general purpose or
predicate register. OUTN primarily exists to allow for writing a one to an
output pin via OUTN N, ZR rather than needing in incur the sixteen bit immediate
cost of writing OUT N, 1.
1101 ???1 ??00 AAAA URX A # A = UART()
1101 ???1 ??01 CCCC UTX C # UART(C)
These instructions allow for receiving (RX) or transmitting (TX) a sixteen bit value across the UART interface. Each read or write is split into two eight bit transactions with a single stop bit. The low eight bits are processed in the first transaction.
Idli has a single sixteen bit buffer to hold incoming data from the UART. If a
URX is performed and the buffer is empty, the processor will stall until a
full sixteen bits of data have been received. Flow control is not implemented, so
if the buffer is full and additional data is sent into the core it will be
silently dropped.
Future Improvements
Since implementing the latest version of Idli I've been considering what I don't like and having ideas about how to improve it. A few these are listed below:
- Remove branch/jump without link
- Remove ANDN
- Add a Program Status Register
- Improve CEX Syntax & Checking
- Multiply Acceleration
- Unify IO Instructions
- Add LD/ST with Immediate Writeback
- Add PC-relative LD/ST
- Rework GETP and PUTP
- Add Rotate Through Carry
- Reduce COUNT Size
- Replace INP{X} With Something More Useful
Remove Branch and Jump Without Link
Given writes to the zero register are ignored B and J can be implemented as
synonyms instead of requiring their own encodings:
B C -> BL ZR, C
J C -> JL ZR, C
This saves a little encoding space and simplifies decode without any downsides so it seems like an obvious change to make.
In the current implementation LR is also baked into the RTL as a constant so
it can be implicitly updated. Having the link register defined has certain benefits
for high-performance microarchitectures (e.g. link register caching), but given
Idli is specifically optimised as a low-area in-order core this isn't really
relevant. I think I'll keep R14 defined as LR just for convenience use with
synonyms like RET, but there will no longer be anything special about it in
the microarchitecture.
Remove ANDN
In the real world ANDN is almost always used with an immediate to e.g. mask
off the bottom bits of a register for address alignment. In Idli, all immediates
passed into the ALU are a full sixteen bits, so there's no penalty for writing
ANDN R1, R2, 0x7 as AND R1, R2, 0xfff8 instead.
Given this, keeping ANDN doesn't seem like it's worth the encoding space of a
three-operand instruction, so I think it makes sense to remove it. This does mean
an ANDN with a register operand will need one extra instruction to negate the
mask, but it's so infrequent that I'd rather gain back the encoding space for
something more useful.
Add a Program Status Register
In the current machine there's no way to read back COND, COUNT, or the carry
flag. These could be combined with P into a Program Status Register, PSR,
which can then be accessed with an instruction like GETPSR A. This would also
allow the random test generator to check previously inaccessible state,
strengthening verification.
I'm not sure yet if a PUTPSR should also be added. This would be useful for
saving/restoring when context switching, but this wasn't ever something that was
originally intended to be supported and I'm not sure I want to deal with the
edge cases it could introduce e.g. updating COND with a value derived at
runtime.
Improve CEX Syntax & Checking
In the current assembler a CEX instruction is followed by a number indicating
the number of instructions after it that are predicated, with the predication
mask deduced by .T and .F suffixes of those instructions following it. This
complicates the assembler as it needs to look ahead to deduce the value of M,
so an improvement would be to specify the mask directly as part of the CEX
itself, then assert that this matches the suffix of the followers. For example:
OLD NEW
------------------ ------------------
CEX 2 CEX TF
SUB.T R1, R2, R3 SUB.T R1, R2, R3
SUB.F R2, R1, R3 SUB.F R2, R1, R3
This should also help to catch mistakes where a .T or .F suffix has been
misplaced or unintentionally inverted.
Another common case is to predicate a group of instructions in the same way, so
to make this easier to work with it would be nice for the assembler to support
Python-like replication syntax. This would allow for writing CEX T*5 instead
of CEX TTTTT which is much easier to read.
The assembler should also be enhanced to check that a branch instruction can only appear at the end of a predicated block. Branching away mid-predication leads to undefined behaviour, as the instructions at the branch target won't have been written with predication in mind, so this just catches a common source of error at assembly time rather than needing runtime debug.
An alternative to this would be to define that taken branches clear COND. This
would allow for writing things like:
CEX TTFF
MOV.T R1, R2
B.T @TARGET1
MOV.F R1, R4
B.F @TARGET2
I could see this being useful so it may be preferable to take this approach. At this stage though more thought is required!
Multiply Acceleration
The current machine has no multiply instruction, which makes multiplying two numbers exceedingly slow. An example of a soft multiply implemented in Idli assembler is part of the FNV-1a hash algorithm (see here for the full listing):
2: eq r5, zr # done = prime0 == 0
andp 1 # (and next compare)
eq r6, zr # done &= prime1 == 0
cex 3 # if p:
mov.t r1, r8 # hash0 = mul0
mov.t r2, r9 # hash1 = mul1
b.t @1b # goto 1b
any r5, 0x1 # p = (prime0 & 1) != 0
cex 3 # if p:
carry.t 2 # (sticky carry)
add.t r8, r8, r1 # c, mul0 = mul0 + hash0
add.t r9, r9, r2 # mul1 = mul1 + hash1 + c
carry 2 # (shift through carry)
srl r6, r6 # c, prime1 >>= 1
srl r5, r5 # prime0 = (c << 15) | (prime0 >> 1)
carry 2 # (shift through carry)
sll r1, r1 # c, hash0 <<= 1
sll r2, r2 # hash1 = (hash1 << 1) | c
b @2b # goto 2b
Not very performant! So how can we make this faster?
The most obvious answer is to introduce a MUL A, B, C instruction that writes
the low sixteen bits of the product of B and C to A. This is obviously a
huge win for performance and code density, but multiplier hardware consumes a
significant amount of area. Given Idli is four bit serial it may not be too bad,
but I need to do some synthesis experiements to know for sure.
An alternative way to add further acceleration is adding a barrel shifter. This would allow for fast multiplies by powers of two, from which other multiplications could be composed. Again, I'm not sure if I'd like to spend the area on the introduction of a shifter, so some further experimentation is needed. This also consumes a significant amount of encoding space, as each shift operation now takes a full three operands.
Another possibility would be to introduce instructions that always shift by four bits. This would save on encoding space and could be implemented relatively efficiently in the hardware as the core is already operating on four bit slices. Right shifts fall out naturally from this, as we can nudge each four bit slice further down the register on each tick. Left shits are unfortunately harder, as we need somewhere to store the slice from the previous cycle after writing the current one. Introducing a four bit flop to save the state will definitely be better for area than the multiply or barrel shifter, but whether it's worth it for the performance gain requires some experimentation.
Unify IO Instructions
It would make the instruction set cleaner to only have two instructions of non-memory
IO which operate on their own address space: IN A, F reads from IO address
F into A, and OUT F, P performs the opposite. F could be an eight bit
address space indicating some kind of input/output device. To match the
capabilities of the current machine, this could be defined as:
7 ???
6 ???
5 ???
4 UART
3 GPIO_3
2 GPIO_2
1 GPIO_1
0 GPIO_0
The original UART and pin instructions can then be defined as synonyms, and we also have space for three more IO interfaces such as SPI or I2C. This also gives a more uniform encoding format for the IO instructions, which should help reduce area by being easier to decode.
The only IO functionality this loses is reading or writing the predicate
register directly to a pin at the cost of an extra instruction to move P
between the registers. Given how infrequently this actually comes up in
practice I think it's a good trade off.
Add LD/ST with Immediate Writeback
The original design for Idli had writeback memory access instructions that took
a C operand, allowing for an arbitrary offset to be used. I wasn't able to fit
this into the encoding space, hence the current LD+ and LD- instructions that
have a fixed implicit offset of one.
Having since though about it more, I realised the variable offset versions of the instructions can be brought back by enforcing that the offset is always an immediate. These could be encoded as
1011 AAAA BBBB 1000 LD! A, B, I # A = [B], B += I
1011 AAAA BBBB 1001 ST! A, B, I # [B] = A, B += I
1011 AAAA BBBB 1010 !LD A, B, I # B += I, A = [B]
1011 AAAA BBBB 1011 !ST A, B, I # B += I, [B] = A
This would work by essentially having an implicit C=SP operand, meaning these
instructions would always be guaranteed to have a sixteen bit immediate offset.
We don't really lose anything here in practice, as almost every case where writeback
is desired would be with an immediate offset anyway.
Even with the introduction of the variable offset writebacks, I think it's still
worth keeping the existing LD+ style of instruction as an immediate of one is
an extremely common case, so the encoding space cost seems worth the code
density gain.
Add PC-relative LD/ST
It's a common operation to want to load/store data relative to the current program counter. In fact, one analysis of the original Quake found that over 10% of the instructions in the binary used program counter relative addressing. Of course, here's no way we'll be running Quake on Idli, but the point is that the operation is common enough that providing some amount of hardware support for it is desirable.
In the current version of Idli, this would be achieved by first
performing an ADDPC to get the final address, followed by the actual load or
store instruction to access the data. This means if we want to perform a PC
relative load there's always at minimum a two instruction sequence:
addpc r2, @data
ld r1, r2, zr
Adding PC-relative load/store instructions would allow for folding these two instructions into one. If these instructions also perform writeback, then we also gain the benefit of saving the address to perform further relative accesses without the immediate overhead.
!LDPC A, B, I
!STPC A, B, I
LDPC! A, B, I
STPC! A, B, I
These would work similarly to the implicit immediate writeback instructions,
except the address update is performed as B = PC + I instead of B + I. For
example, if we had a structure containing three integers at address data we
could load all the struct members as follows:
!ldpc r1, r4, @data
+ld r2, r4
+ld r3, r4
Non-writeback synonyms could be defined with the address register as ZR:
LDPC A, I -> !LDPC A, ZR, I
STPC A, I -> !STPC A, ZR, I
I'm not sure how useful the post-writeback variants of PC-relative accesses are,
as this would mean we access data at PC + I but write back PC, so these may
be removed if I can think of some better use for the encoding space.
Rework GETP and PUTP
There's no actual need to encode PUTP in the instruction set as it can always
be replaced by a comparison instruction instead:
PUTP 1 -> EQ ZR, ZR
PUTP 0 -> NE ZR, ZR
PUTP R1 -> NE R1, ZR
GETP is a little more complicated. An initial idea was to conditionally take
the value of P by incrementing the zero register, but this would actually
require a three instruction sequence:
mov r1, zr # tmp = 0
cex 1 # if p:
inc.t r1, r1 # tmp = 1
If GETPSR is added this could be reduced to two instructions with a GETPSR
and AND to extract the bottom bit. To further reduce the instruction count to
one we'd need an instruction like a conditional move or select that both reads
the current predicate state and always writes a value into the destination
register even when the predicate is false. An example instruction could be
predicated increment PINC A, B which would be implemented as:
A = P ? (B + 1) : B
Allowing GETP to be implemented as a synonym:
GETP A -> PINC A, ZR
This is more flexible than GETP itself as B could be anything, but I'm not
sure how useful it'll be given in general to perform a conditional increment
you would typically set P immediately before, in which case an X comparison
could be used instead. Perhaps a masked form of GETPSR would be more useful,
combining it with an AND. More thought is needed!
Add Rotate Through Carry
In the current machine it's possible to shift through the carry flag, but not to rotate. There isn't any reason why this shouldn't be possible, so it would be useful to support rotate through carry for more efficient rotations of operands with more than sixteen bits.
Reduce COUNT Size
In the current machine COUNT is four bits, allow for the operation to apply to
up to fifteen instructions. In real use counts above four are rarely
needed, so it may be worth reducing the counter to two bits instead and defining
a value of zero to be interpreted as four.
This would save some area and a chunk of encoding space, while still allowing for
processing 64b operands by linking together four operations with a CARRY 4.
Replace INP{X} With Something More Useful
These instructions were added to allow for directly reading an input pin into
P and using it in a condition, but in practice this is rarely needed so it's hard
to justify the consumed encoding space.
Instead, it would be more useful to add an extra comparison operation that takes
the B and C operands. One that would be useful is ALL to check whether
all the bits under the mask are set. This is slightly more complicated to
implement in the RTL than ANY, but is probably useful enough to merit inclusion
anyway. I'm not completely sold on ALL yet though, so if I can think of a
comparison operation that's more useful then that'll be added instead.
Next Steps
We've now gone through the current state of Idli and the changes to the instruction set that I'd like to make going forward, so I think the next step should be to start on the next version of the core.
I'd like to take a different approach to the RTL this time, starting with the UART interface, as this should allow for seeing something working on an FPGA as soon as possible which will make the iterative process of implementing and debugging the rest of the core much easier.
It would also be nice to have some formal verification this time around, although I'm not too sure how much of SystemVerilog assertions are supported by the open source tooling so some concessions may need to be made.
Either way, the first step will be to create the new repository, start on the assembler, and implement a behavioural model. More on this coming soon!