Matt Woodhouse

Idli: Instruction Set

Created: 2026-04-25, Updated: 2026-05-08

This article will be split into two main sections. First up I want to go through the instructions that are implemented in the current version of Idli and briefly describe how they work. After that I'd like to go through the things I'm not happy with and would like to change in the future.

Current Instructions

Let's look at the instructions as defined in the current version of the machine, loosely grouped together into functional categories. Each instruction is provided with its encoding and signature, with operand types indicated by the letter used:

Now let's dig into the instructions. In no particular order:

Arithmetic

0000 AAAA BBBB CCCC     ADD A, B, C     # A = B + C
0001 AAAA BBBB CCCC     SUB A, B, C     # A = B - C

Standard addition and subtraction operations. There is some redundancy here when C is an immediate but given we need both for when it's a register it isn't worth worrying about. Combining ADD with the zero register also gives us a couple of handy synonyms:

MOV A, C    ->  ADD  A, ZR,  C
NOP         ->  ADD ZR, ZR, ZR

This also means the encoding of the canonical NOP is encoded as all zero, so if we happen to read some memory that's been cleared we won't do anything too unexpected.

1010 AAAA BBBB 1000     INC A, B    # A = B + 1
1010 AAAA BBBB 1001     DEC A, B    # A = B - 1

Addition and subtraction of one is quite a common operation, but encoding this using the standard ADD and SUB instructions results in an extra sixteen bits of consumption. To avoid this overhead, INC and DEC were introduced. In the RTL these are implemented by setting the carry in to the ALU.

1100 AAAA ???0 CCCC     ADDPC A, C  # A = PC + C

We need a way to access information relative to the current program counter, and as PC isn't part of the register file a separate instruction is needed. This is essentially a branch instruction except the result is written to A instead of PC.

Logical

0010 AAAA BBBB CCCC     AND  A, B, C    # A =  B &  C
0011 AAAA BBBB CCCC     ANDN A, B, C    # A =  B & ~C
0100 AAAA BBBB CCCC     OR   A, B, C    # A =  B |  C
0101 AAAA BBBB CCCC     XOR  A, B, C    # A =  B ^  C
1010 AAAA BBBB 1110     NOT  A, B       # A = ~B

The standard set of logical operations - there's not much to say here!

One interesting point is that we need an actual instruction for logical negation, but not arithmetic negation which can be implemented as the synonym SUB A, ZR, B. To implement NOT as a synonym we'd need to have an all-ones register or use an immediate like XOR A, B, 0xffff. This would result in a double-width instruction due to the immediate, so it seemed worth including NOT, but given how infrequently it actually turns up perhaps it isn't worth the encoding space.

There's also no point in having NOT take C instead of B as an immediate is always a full sixteen bits, so there's no situation where the negated form couldn't be encoded directly anyway. This also means we can perform logical negation on SP, but so far I haven't found a practical use for it.

Shift

1010 AAAA BBBB 1010     SRL A, B    # A = { 1'b0, B[15:1]}
1010 AAAA BBBB 1011     SRA A, B    # A = {B[15], B[15:1]}
1010 AAAA BBBB 1100     ROR A, B    # A = {B[ 0], B[15:1]}
1010 AAAA BBBB 1101     ROL A, B    # A = {B[14:0], B[15]}

Shifts in Idli are all a single bit for two reasons:

  1. Barrel shifters are quite expensive in terms of area and I wasn't sure it would be worth the space.
  2. The encoding space is limited and having even more three-operation instructions wouldn't fit.

This does make performing multi-bit shifts quite painful, especially if you don't know the shift amount until runtime as this means every bit shifted will also incur a branch redirection penalty.

SRL and SRA can both shift via the internal carry register for performing shifts over integers larger than sixteen bits. Reviewing the implementation now I can't remember why I didn't also support this for ROR and ROL, so perhaps this is something to look at as an improvement.

Given we only have single bit shifts, LSL A, B is implemented as the synonym ADD A, B, B to save on encoding space.

Comparison

1011 X000 BBBB CCCC     EQ{X}  B, C     # P =  (B == C)
1011 X001 BBBB CCCC     NE{X}  B, C     # P =  (B != C)
1011 X010 BBBB CCCC     LT{X}  B, C     # P =  (B <  C) (signed)
1011 X011 BBBB CCCC     LTU{X} B, C     # P =  (B <  C) (unsigned)
1011 X100 BBBB CCCC     GE{X}  B, C     # P =  (B >= C) (signed)
1011 X101 BBBB CCCC     GEU{X} B, C     # P =  (B >= C) (unsigned)
1011 X110 BBBB CCCC     ANY{X} B, C     # P = |(B &  C)
1011 X111 ??NN CCCC     INP{X} B, C     # P = PIN(N)

Idli provides a full suite of comparison operations. These all write their results to the predicate register P. Each instruction has two forms: standard and X. The X form of an instruction performs the update of P as normal, but also updates the COND register to predicate the following instruction such that it only runs if the result of the comparison is true. This gives us a couple of significant benefits:

  1. Code density is improved, as performing a comparison and updating COND are performed in a single instruction rather than requiring an explicit update via CEX (see next section for details on this instruction).
  2. There's no need to encode any conditional branch instructions, saving a chunk of encoding space. Yes, we do need to encode two forms of every comparison, but this allows us to predicate anything so is much more flexible.

Control Flow

There are two ways to modify control flow in Idli. The first are branches and jumps:

1100 ??00 ???1 CCCC     B   C       # PC += C
1100 ??01 ???1 CCCC     J   C       # PC  = C
1100 ??10 ???1 CCCC     BL  C       # LR  = NEXT_PC, PC += C
1100 ??11 ???1 CCCC     JL  C       # LR  = NEXT_PC, PC  = C

Branches add an offset to the program counter while jumps overwrite it directly. The link register is also optionally updated to the PC following the branch or jump, allowing for return from function calls without needing to access the stack. The RET instruction is implemented as the synonym J LR.

The second form of control flow modification is predication via the COND register. This can be implicitly updated via the X comparisons discussed above or by directly writing it using CEX.

1110 ???0 MMMM MMMM     CEX M       # COND = M

The M operand holds a mask indicating how the following instructions should be predicated. For each bit i, if M[i] matches the value of P the instruction will be executed, otherwise it will be skipped. This provides an extremely flexible way to avoid the costly branch redirection penalty when running short conditional sequences.

The mask is indicated in assembly language by adding a .T or .F suffix to the instructions following the CEX, which itself indicates the number of instructions to be predicated. For example:

CEX     4
ADD.T   R1, R2, R3      # IF( P): R1 = R2  + R3
ADD.F   R1, R4, R5      # IF(!P): R1 = R4  + R5
INC.F   R2, R10         # IF(!P): R2 = R10 + 1
AND.T   R2, R2, R11     # IF( P): R2 = R2  & R11

In this case, if P was true then the first and final instruction would be run, otherwise the second and third would be executed.

Memory Access

There are a few different ways of accessing memory in Idli. First up are the standard load and store instructions:

0110 AAAA BBBB CCCC     LD A, B, C      # A = [B + C]
0111 AAAA BBBB CCCC     ST A, B, C      # [B + C] = A

These allow for indexing into memory relative to a base register. As Idli has sixteen bit bytes to match the register width, as discussed previously, we don't need to support different widths and sign or zero extension.

Auto-incrementing load and store operations are also supported:

1010 AAAA BBBB 0000     LD+ A, B        # A = [B++]
1010 AAAA BBBB 0001     ST+ A, B        # [B++] = A
1010 AAAA BBBB 0010     +LD A, B        # A = [++B]
1010 AAAA BBBB 0011     +ST A, B        # [++B] = A
1010 AAAA BBBB 0100     LD- A, B        # A = [B--]
1010 AAAA BBBB 0101     ST- A, B        # [B--] = A
1010 AAAA BBBB 0110     -LD A, B        # A = [--B]
1010 AAAA BBBB 0111     -ST A, B        # [--B] = A

These allow for incrementing or decrementing the base register either before or after the memory is accessed. This is particularly common in loops and can often save us from needing to increment base addresses manually. In the RTL writeback of the base register is performed before the memory access, so stores of the base register will show the updated value and loads will overwrite it.

As Idli uses serial memories which share instructions and data, each access requires a redirection to perform the access followed by a redirection back to the PC to continue execution. As such, load and store multiple operations are also provided to allow for batching accesses together to minimise the overhead:

1000 RRRR SSSS BBBB     LDM R..S, B     # R..S = [B, B + 1, ...]
1001 RRRR SSSS BBBB     STM R..S, B     # [B, B + 1, ...] = R..S

These both take register ranges, and take advantage of the sequential mode of the attached memories to perform multiple read or write operations with a single redirection. This greatly enhances performance for batch memory access operations such as memcpy.

A couple of synonyms are also provided for common stack operations:

PUSH A      ->  -ST A, SP
POP  A      ->  LD+ A, SP

System

System operations are a catch-all group for things that didn't really fit into any other category. The boring ones are for moving the predicate register P into a general purpose register:

1101 ???1 ??10 AAAA     GETP    A       # A = P
1101 ???1 ??11 CCCC     PUTP    C       # P = C & 1

More interesting are the operations that update the COUNT register, which are used to modify the behaviour of a number of instructions following the update.

1110 ???1 ??00 JJJJ     CARRY   J       # COUNT = {CARRY, J}
1110 ???1 ??01 JJJJ     ANDP    J       # COUNT = { ANDP, J}
1110 ???1 ??10 JJJJ     ORP     J       # COUNT = {  ORP, J}

Let's look at CARRY first. By default the processor clears the internal carry flag after each operation, but this instruction allows us to override this behaviour and propagate the carry forward for J instructions instead. For example, if R2..R5 and R6..R9 contain two 64b numbers, we can perform a full 64b add efficiently by writing

CARRY   4
ADD     R2, R2, R6
ADD     R3, R3, R7
ADD     R4, R4, R8
ADD     R5, R5, R9

This saves us from needing to encode a specific add with carry or subtract with borrow. It's also worth noting that COUNT is a separate register to COND, allowing for both to be used simultaneously. This can be useful in a number of scenarios such as implementing a 32b soft multiply.

ANDP and ORP are used to efficiently build up complex boolean expressions by changing how P is updated. By default comparisons replace the value of the predicate register, but these allow for the result to be ANDed or ORed into P instead. Let's say you want to branch if the following condition is true: (R1 == R2 || R3 == R4) && (R5 >= R6 || R7 != R8 || R9 == R10). We could write this in Idli assembler as:

EQ      R1, R2      # P = R1 == R2
ORP     1
EQ      R3, R4      # P |= R3 == R4
GETP    R12         # TMP = P
GE      R5, R6      # P = R5 >= R6
ORP     2
NE      R7, R8      # P |= R7 != R8
EQ      R9, R10     # P |= R9 == R10
ANDP    1
NEX     R12, ZR     # IF (P && TMP):
B.T     @TARGET     #   GOTO TARGET

This saves us from writing out a number of conditional branches, improving both performance and code density.

IO

Communication with the outside world can be performed via general purpose IO pins or a built-in UART.

1101 ???0 00NN AAAA     IN      A, P    # A = PIN(N)
1101 ???0 01NN CCCC     OUT     N, C    # PIN(N,  C & 1)
1101 ???0 10NN CCCC     OUTN    N, C    # PIN(N ~(C & 1))
1101 ???0 11NN 0000     OUTP    N       # PIN(N, P)

Pin instructions allow for getting or setting an arbitrary value on each of the four general purpose input or output pins from either a general purpose or predicate register. OUTN primarily exists to allow for writing a one to an output pin via OUTN N, ZR rather than needing in incur the sixteen bit immediate cost of writing OUT N, 1.

1101 ???1 ??00 AAAA     URX     A       # A = UART()
1101 ???1 ??01 CCCC     UTX     C       # UART(C)

These instructions allow for receiving (RX) or transmitting (TX) a sixteen bit value across the UART interface. Each read or write is split into two eight bit transactions with a single stop bit. The low eight bits are processed in the first transaction.

Idli has a single sixteen bit buffer to hold incoming data from the UART. If a URX is performed and the buffer is empty, the processor will stall until a full sixteen bits of data have been received. Flow control is not implemented, so if the buffer is full and additional data is sent into the core it will be silently dropped.

Future Improvements

Since implementing the latest version of Idli I've been considering what I don't like and having ideas about how to improve it. A few these are listed below:

Remove Branch and Jump Without Link

Given writes to the zero register are ignored B and J can be implemented as synonyms instead of requiring their own encodings:

B   C       ->  BL  ZR, C
J   C       ->  JL  ZR, C

This saves a little encoding space and simplifies decode without any downsides so it seems like an obvious change to make.

In the current implementation LR is also baked into the RTL as a constant so it can be implicitly updated. Having the link register defined has certain benefits for high-performance microarchitectures (e.g. link register caching), but given Idli is specifically optimised as a low-area in-order core this isn't really relevant. I think I'll keep R14 defined as LR just for convenience use with synonyms like RET, but there will no longer be anything special about it in the microarchitecture.

Remove ANDN

In the real world ANDN is almost always used with an immediate to e.g. mask off the bottom bits of a register for address alignment. In Idli, all immediates passed into the ALU are a full sixteen bits, so there's no penalty for writing ANDN R1, R2, 0x7 as AND R1, R2, 0xfff8 instead.

Given this, keeping ANDN doesn't seem like it's worth the encoding space of a three-operand instruction, so I think it makes sense to remove it. This does mean an ANDN with a register operand will need one extra instruction to negate the mask, but it's so infrequent that I'd rather gain back the encoding space for something more useful.

Add a Program Status Register

In the current machine there's no way to read back COND, COUNT, or the carry flag. These could be combined with P into a Program Status Register, PSR, which can then be accessed with an instruction like GETPSR A. This would also allow the random test generator to check previously inaccessible state, strengthening verification.

I'm not sure yet if a PUTPSR should also be added. This would be useful for saving/restoring when context switching, but this wasn't ever something that was originally intended to be supported and I'm not sure I want to deal with the edge cases it could introduce e.g. updating COND with a value derived at runtime.

Improve CEX Syntax & Checking

In the current assembler a CEX instruction is followed by a number indicating the number of instructions after it that are predicated, with the predication mask deduced by .T and .F suffixes of those instructions following it. This complicates the assembler as it needs to look ahead to deduce the value of M, so an improvement would be to specify the mask directly as part of the CEX itself, then assert that this matches the suffix of the followers. For example:

OLD                         NEW
------------------          ------------------
CEX     2                   CEX     TF
SUB.T   R1, R2, R3          SUB.T   R1, R2, R3
SUB.F   R2, R1, R3          SUB.F   R2, R1, R3

This should also help to catch mistakes where a .T or .F suffix has been misplaced or unintentionally inverted.

Another common case is to predicate a group of instructions in the same way, so to make this easier to work with it would be nice for the assembler to support Python-like replication syntax. This would allow for writing CEX T*5 instead of CEX TTTTT which is much easier to read.

The assembler should also be enhanced to check that a branch instruction can only appear at the end of a predicated block. Branching away mid-predication leads to undefined behaviour, as the instructions at the branch target won't have been written with predication in mind, so this just catches a common source of error at assembly time rather than needing runtime debug.

An alternative to this would be to define that taken branches clear COND. This would allow for writing things like:

CEX     TTFF
MOV.T   R1, R2
B.T     @TARGET1
MOV.F   R1, R4
B.F     @TARGET2

I could see this being useful so it may be preferable to take this approach. At this stage though more thought is required!

Multiply Acceleration

The current machine has no multiply instruction, which makes multiplying two numbers exceedingly slow. An example of a soft multiply implemented in Idli assembler is part of the FNV-1a hash algorithm (see here for the full listing):

2:  eq      r5, zr              # done = prime0 == 0
    andp    1                   # (and next compare)
    eq      r6, zr              # done &= prime1 == 0
    cex     3                   # if p:
    mov.t   r1, r8              #   hash0 = mul0
    mov.t   r2, r9              #   hash1 = mul1
    b.t     @1b                 #   goto 1b
    any     r5, 0x1             # p = (prime0 & 1) != 0
    cex     3                   # if p:
    carry.t 2                   #   (sticky carry)
    add.t   r8, r8, r1          #   c, mul0 = mul0 + hash0
    add.t   r9, r9, r2          #   mul1 = mul1 + hash1 + c
    carry   2                   # (shift through carry)
    srl     r6, r6              # c, prime1 >>= 1
    srl     r5, r5              # prime0 = (c << 15) | (prime0 >> 1)
    carry   2                   # (shift through carry)
    sll     r1, r1              # c, hash0 <<= 1
    sll     r2, r2              # hash1 = (hash1 << 1) | c
    b       @2b                 # goto 2b

Not very performant! So how can we make this faster?

The most obvious answer is to introduce a MUL A, B, C instruction that writes the low sixteen bits of the product of B and C to A. This is obviously a huge win for performance and code density, but multiplier hardware consumes a significant amount of area. Given Idli is four bit serial it may not be too bad, but I need to do some synthesis experiements to know for sure.

An alternative way to add further acceleration is adding a barrel shifter. This would allow for fast multiplies by powers of two, from which other multiplications could be composed. Again, I'm not sure if I'd like to spend the area on the introduction of a shifter, so some further experimentation is needed. This also consumes a significant amount of encoding space, as each shift operation now takes a full three operands.

Another possibility would be to introduce instructions that always shift by four bits. This would save on encoding space and could be implemented relatively efficiently in the hardware as the core is already operating on four bit slices. Right shifts fall out naturally from this, as we can nudge each four bit slice further down the register on each tick. Left shits are unfortunately harder, as we need somewhere to store the slice from the previous cycle after writing the current one. Introducing a four bit flop to save the state will definitely be better for area than the multiply or barrel shifter, but whether it's worth it for the performance gain requires some experimentation.

Unify IO Instructions

It would make the instruction set cleaner to only have two instructions of non-memory IO which operate on their own address space: IN A, F reads from IO address F into A, and OUT F, P performs the opposite. F could be an eight bit address space indicating some kind of input/output device. To match the capabilities of the current machine, this could be defined as:

7   ???
6   ???
5   ???
4   UART
3   GPIO_3
2   GPIO_2
1   GPIO_1
0   GPIO_0

The original UART and pin instructions can then be defined as synonyms, and we also have space for three more IO interfaces such as SPI or I2C. This also gives a more uniform encoding format for the IO instructions, which should help reduce area by being easier to decode.

The only IO functionality this loses is reading or writing the predicate register directly to a pin at the cost of an extra instruction to move P between the registers. Given how infrequently this actually comes up in practice I think it's a good trade off.

Add LD/ST with Immediate Writeback

The original design for Idli had writeback memory access instructions that took a C operand, allowing for an arbitrary offset to be used. I wasn't able to fit this into the encoding space, hence the current LD+ and LD- instructions that have a fixed implicit offset of one.

Having since though about it more, I realised the variable offset versions of the instructions can be brought back by enforcing that the offset is always an immediate. These could be encoded as

1011 AAAA BBBB 1000     LD!     A, B, I     # A = [B], B += I
1011 AAAA BBBB 1001     ST!     A, B, I     # [B] = A, B += I
1011 AAAA BBBB 1010     !LD     A, B, I     # B += I, A = [B]
1011 AAAA BBBB 1011     !ST     A, B, I     # B += I, [B] = A

This would work by essentially having an implicit C=SP operand, meaning these instructions would always be guaranteed to have a sixteen bit immediate offset. We don't really lose anything here in practice, as almost every case where writeback is desired would be with an immediate offset anyway.

Even with the introduction of the variable offset writebacks, I think it's still worth keeping the existing LD+ style of instruction as an immediate of one is an extremely common case, so the encoding space cost seems worth the code density gain.

Add PC-relative LD/ST

It's a common operation to want to load/store data relative to the current program counter. In fact, one analysis of the original Quake found that over 10% of the instructions in the binary used program counter relative addressing. Of course, here's no way we'll be running Quake on Idli, but the point is that the operation is common enough that providing some amount of hardware support for it is desirable.

In the current version of Idli, this would be achieved by first performing an ADDPC to get the final address, followed by the actual load or store instruction to access the data. This means if we want to perform a PC relative load there's always at minimum a two instruction sequence:

addpc   r2, @data
ld      r1, r2, zr

Adding PC-relative load/store instructions would allow for folding these two instructions into one. If these instructions also perform writeback, then we also gain the benefit of saving the address to perform further relative accesses without the immediate overhead.

!LDPC   A, B, I
!STPC   A, B, I
LDPC!   A, B, I
STPC!   A, B, I

These would work similarly to the implicit immediate writeback instructions, except the address update is performed as B = PC + I instead of B + I. For example, if we had a structure containing three integers at address data we could load all the struct members as follows:

!ldpc   r1, r4, @data
+ld     r2, r4
+ld     r3, r4

Non-writeback synonyms could be defined with the address register as ZR:

LDPC    A, I    ->  !LDPC A, ZR, I
STPC    A, I    ->  !STPC A, ZR, I

I'm not sure how useful the post-writeback variants of PC-relative accesses are, as this would mean we access data at PC + I but write back PC, so these may be removed if I can think of some better use for the encoding space.

Rework GETP and PUTP

There's no actual need to encode PUTP in the instruction set as it can always be replaced by a comparison instruction instead:

PUTP    1       ->      EQ  ZR, ZR
PUTP    0       ->      NE  ZR, ZR
PUTP    R1      ->      NE  R1, ZR

GETP is a little more complicated. An initial idea was to conditionally take the value of P by incrementing the zero register, but this would actually require a three instruction sequence:

mov     r1, zr      # tmp = 0
cex     1           # if p:
inc.t   r1, r1      #   tmp = 1

If GETPSR is added this could be reduced to two instructions with a GETPSR and AND to extract the bottom bit. To further reduce the instruction count to one we'd need an instruction like a conditional move or select that both reads the current predicate state and always writes a value into the destination register even when the predicate is false. An example instruction could be predicated increment PINC A, B which would be implemented as:

A = P ? (B + 1) : B

Allowing GETP to be implemented as a synonym:

GETP    A       ->  PINC    A, ZR

This is more flexible than GETP itself as B could be anything, but I'm not sure how useful it'll be given in general to perform a conditional increment you would typically set P immediately before, in which case an X comparison could be used instead. Perhaps a masked form of GETPSR would be more useful, combining it with an AND. More thought is needed!

Add Rotate Through Carry

In the current machine it's possible to shift through the carry flag, but not to rotate. There isn't any reason why this shouldn't be possible, so it would be useful to support rotate through carry for more efficient rotations of operands with more than sixteen bits.

Reduce COUNT Size

In the current machine COUNT is four bits, allow for the operation to apply to up to fifteen instructions. In real use counts above four are rarely needed, so it may be worth reducing the counter to two bits instead and defining a value of zero to be interpreted as four.

This would save some area and a chunk of encoding space, while still allowing for processing 64b operands by linking together four operations with a CARRY 4.

Replace INP{X} With Something More Useful

These instructions were added to allow for directly reading an input pin into P and using it in a condition, but in practice this is rarely needed so it's hard to justify the consumed encoding space.

Instead, it would be more useful to add an extra comparison operation that takes the B and C operands. One that would be useful is ALL to check whether all the bits under the mask are set. This is slightly more complicated to implement in the RTL than ANY, but is probably useful enough to merit inclusion anyway. I'm not completely sold on ALL yet though, so if I can think of a comparison operation that's more useful then that'll be added instead.

Next Steps

We've now gone through the current state of Idli and the changes to the instruction set that I'd like to make going forward, so I think the next step should be to start on the next version of the core.

I'd like to take a different approach to the RTL this time, starting with the UART interface, as this should allow for seeing something working on an FPGA as soon as possible which will make the iterative process of implementing and debugging the rest of the core much easier.

It would also be nice to have some formal verification this time around, although I'm not too sure how much of SystemVerilog assertions are supported by the open source tooling so some concessions may need to be made.

Either way, the first step will be to create the new repository, start on the assembler, and implement a behavioural model. More on this coming soon!