AVX-512 Assembly Programming

An Intro to AVX-512 Assembly Programming


In 1998, the Intel Corporation released processors that supported SIMD (single instruction, multiple data) instructions, enabling processors to carry out multiple arithmetic operations using one instruction. This technology was a first step toward parallelization at the instruction level.  The technology, SSE (Streaming SIMD Extensions), made it possible to perform arithmetic operations on four pairs of 32-bit floating point numbers at a time, using a set of 16 128-bit registers, named XMM0 through XMM15. Since then, Intel has expanded its offering, releasing SSE2 in 1999, SSE3 in 2004, and several other versions, each with support for more instructions.

In 2011, Intel released AVX (Advanced Vector eXtensions), increasing the sizes of the registers to 256 bits (YMM0 through YMM15). For backward compatibility with SSE, the lower 128 bits of a YMM register is an XMM register with the same numeric suffix. With AVX, programs can process eight pairs of 32-bit float numbers or four pairs of 64-bit double numbers at a time. Two years later, in 2013, Intel released AVX-2, adding support for integer types as well as the floating point types.

The culmination of this string of technologies is Intel’s AVX-512 instruction set, which doubles the number of registers to 32, and doubles the size of each register to 512 bits.  The AVX-512 registers are named ZMM1 through ZMM31. The lower 256 bits of each ZMM register is a YMM register; the lower 128 bits of a ZMM register is an XMM register. This instruction set is currently available only on Intel Xeon Phi processors, Intel Xeon Scalable processors, and some Core-X processors. Advanced Micro Devices (AMD) does not support AVX-512at this time.

Why This Article?

One goal for writing this Insights article it to provide additional information about the new AVX-512 technology. Although Intel provides documentation for its entire instruction set (see https://software.intel.com/en-us/articles/intel-sdm#specification), the documentation is terse and extremely sparse on examples. You can find a few examples online, but examples of AVX-512 instructions are difficult to find and few in number.

Another goal in this and subsequent articles in this series, is to highlight some interesting features of the AVX-512 instruction set, especially those that have not been adequately documented.

An Example

Here’s a fairly simple example of adding two 16-element arrays of type float, using a single instruction. After performing the addition, the program stores the results in a destination array. The project is implemented as mixed-language program, with a C++ main function that calls an Intel x86 assembly subroutine.  This is a complete example, but you will need to do some additional work to turn this into a running program.  If you are interested in the details, send me a private message. Also, this program must be run on a computer that supports the basic AVX-512 instructions.

Here’s the C++ portion:

// Driver.cpp - Use AVX512 instructions to add two arrays.

#include <iostream>
using std::cout;
using std::endl;

// Prototypes
extern "C" void AddArrays(float Dest[], float Arr1[], float Arr2[]);
void PrintArray(float[], int count);

// Data is aligned to 64-byte boundaries
float __declspec(align(64)) Array1[] =    // First source array
{ 1, 2, 3, 4,
  5, 6, 7, 8,
  9, 10, 11, 12,
 13, 14, 15, 16 };

float __declspec(align(64)) Array2[] =    // Second source array
{ 2, 4, 6, 8,
  3, 6, 9, 12,
  4, 8, 12, 16,
  5, 10, 15, 20};

float __declspec(align(64)) Dest[16];     // Destination array

int main()
    AddArrays(Dest, Array1, Array2);  // Call the assembly routine
    PrintArray(Dest, 16);

void PrintArray(float Arr[], int count)
for (int i = 0; i < count; i++) {
   cout << Arr[i] << '\t';

And here’s the assembly code:

AddArrays PROC C
; Prototype: extern "C" void AddArrays(float Dest[], float Arr1[], float Arr2[]);
; Parameters:
; Dest - Address of the start of the destination array, in RCX
; Arr1 - Address of the first source array, in RDX
; Arr2 - Address of the second source array, in R8
; Returns nothing

; Prologue
   push rdi ; RDI is a non-volatile register, so save it.
   sub rsp, 20h ; Allocate a stack frame of 20 bytes
   mov rdi, rsp 

; Main body of routine
   vmovaps zmm0, zmmword ptr [rdx] ; Load the first source array
   vmovaps zmm1, zmmword ptr [r8] ; Load the second source array
   vaddps zmm2, zmm0, zmm1 ; Add the two arrays
   vmovaps zmmword ptr[rcx], zmm2 ; Store the array sum

; Epilogue 
   add rsp, 20h ; Adjust the stack back to original state
   pop rdi ; Restore RDI 
AddArrays ENDP

The Prologue and Epilogue parts are boilerplate that is required for 64-bit assembly procedures. The part that’s of interest to us is the four lines that make up the main body of the routine.

vmovaps zmm0, zmmword ptr [rdx]		; Load the first source array
vmovaps zmm1, zmmword ptr [r8]		; Load the second source array
vaddps zmm2, zmm0, zmm1			; Add the two arrays
vmovaps zmmword ptr[rcx], zmm2		; Store the array sum

Although this names of these instructions might seem like gibberish, the prefixes and suffixes  provide a lot of information about the instructions:

  • v – vector instruction — operands are vectors
  • mov – move
  • a – aligned — the data is aligned on a 64-byte boundary
  • ps – packed single — the data consists of packed single-precision values

The first vmovaps instruction moves (copies) the 16 float values that start at the address in the RDX to the ZMM0 register.  The second vmovaps is similar, copying the 16 float values that start at the address in the R8 to ZMM1.

The vaddps instruction adds the 16 float values in the ZMM0 register to those in ZMM 1, storing the results in register ZMM 2. The following figure shows this operation.

AVX-512 floats

Add 16 pairs of floats

The fourth instruction, vmovaps, copies the 16 sums in ZMM2 to the memory location in register RCX.

In the next installment here, learn how AVX-512 can be used to significantly boost program performance by eliminating “if – then” decision structures. The program I’ll show runs in 1/3 of the time that it takes for its fully optimized C++ equivalent.




10 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply