The Beauty of VPTERNLOGD

Created: 2026-02-19

Updated: 2026-02-22

AVX-512 is one of the more recent SIMD extensions for x86. The most obvious difference to prior versions of AVX is the doubling of the SIMD register width to 512 bits, allowing for up to 64x8b or 16x32b operations in a single instruction. It has received a certain amount of justified criticism, along with some lackluster implementations, but I wanted to write a little note about one instruction that I've found particularly useful.

I recently had the opportunity to dig into AVX-512 at work while doing some optimisation of machine learning kernels. Having worked with older AVX and SSE intrinsics previously, I headed over to fantastic Intel Intrinsics Guide to have a quick look through some of the new instructions. After recovering from my shock at the sheer number of new instructions that had been added (it's lucky x86 has a variable length encoding!) there were two things that caught my eye.

First up, masks. One of the big annoyances of using SIMD instructions is dealing with loops that don't cleanly unroll due to the number of iterations being a non-exact multiple of the register width. With AVX-512, almost all instructions (including loads and stores!) have mask and maskz variants which are a neat way of addressing the problem. The ISA introduces eight new mask registers, k0 to k7, which hold a single bit for each SIMD lane of the main zmm registers. One of these mask registers can then be passed to the mask instructions along with an extra src register. When the i-th bit of the mask is zero, the result of the instruction will be suppressed for lane i, and the result will contain src[i] instead. maskz is similar, except that instead of taking the value from src zero is used instead. These make unrolling and conditional evaluation of operations far more pleasant than in prior versions of the architecture.

The second thing, and main focus of this article, are the ternarylogic intrinsics which map to the vpternlogd instruction. The intrinsics guide describes one of these as follows:

Bitwise ternary logic that provides the capability to implement any three-operand binary function; the specific binary function is specified by value in imm8. For each bit in each packed 32-bit integer, the corresponding bit from a, b, and c are used according to imm8, and the result is written to the corresponding bit in dst.

Latency and throughput are listed as one and half respectively. Any three-operand binary function with high throughput and minimal latency? This all seemed too good to be true, which made me wonder: how could this be implemented efficiently in hardware?

My initial thought was that the RTL would need to contain a big lookup table describing the operation of each possible boolean function. This is also suggested by the pseudocode provided in the documentation, which shows part of a case statement with one arm for each of the 255 possible functions selected by the eight bit immediate operand.

While it would be possible to implement the operation in this way, it seemed to me that this approach would take up a significant amount of area, meaning there was almost certainly something cleverer going on. After pondering this a little, I think I figured it out - the eight bit operand is the lookup table, but instead of looking up the operation to perform we instead directly look up the result.

As an example, let's take an arbitrary three-operand boolean function, out = a & (b | ~c), and write out its truth table:

a	b	c	out
1	1	1	1
1	1	0	1
1	0	1	0
1	0	0	1
0	1	1	0
0	1	0	0
0	0	1	0
0	0	0	0

Interpreting the out column as an 8-bit binary number starting with the most-significant bit as the first row, we end up with the value 11010000, or 208 decimal. If we then plug this in to the vpternlogd instruction as the immediate operand it will return the result of our original boolean function.

// out = a & (b | ~c)
const auto out = _mm512_ternarylogic_epi32(a, b, c, 208);

So now we know how to derive this magical immediate for any boolean function, but how is the instruction actually implemented in the hardware? My theory, as mentioned above, is that the immediate is the lookup table, indexed by the concatenation of each bit of the three register operands.

Looking back at our truth table, if we take the first three columns of each row and stick them together, we end up with a three bit integer which just so happens to be the index of that row in the table counting upwards from row zero at the bottom. More succinctly, we can write this in SystemVerilog as:

function automatic vpternlogd(
    input zmm         a,
    input zmm         b,
    input zmm         c,
    input logic [7:0] imm
);
    zmm out = 'x;

    for (int unsigned i = 0; i < $bits(zmm); i++) begin
        out[i] = imm[{a[i], b[i], c[i]}];
    end

    return out;
endfunction

I obviously don't have access to the RTL of any official implementations of AVX-512, so this is something of a guess, but given how neatly it maps to the hardware I think there's a decent chance of it at least being on the right track.

A truly elegant way to implement any three-operand boolean function with a single instruction and minimal hardware cost. Beautiful!