AVX-512 Programming subtotals

AVX-512 Programming: Extracting Column Subtotals from a Table

Estimated Read Time: 7 minute(s)
Common Topics: font, text, normal, none, color

In this Insights article I’ll present an example that shows how Intel® AVX-512 instructions can be used to read a whole row of data in a single operation, and then generate subtotals for some or all of the columns of that table. To motivate the example, I’ll use a table of household expenses in several categories over a period of several months.

Household Expenses Table

The table shown here represents some of my monthly expenses in eight categories. I’m not very imaginative, so all of the numbers for one month’s expenses are just copied to the other three months. Later in the article we’ll see how the numbers 1 through 8 in the second header row play a role in the expense categories I want to get subtotals for.

MortgageHO InsProp. TaxElectricityPropaneCar Ins.MC Ins.Cable

Program Description

The program I’ve written consists of two files:

  1. Expenses.cpp defines the array that contains the data shown above, and defines a main() function that calls a function defined in a different file. After the called function returns, output statements in main() display the results.
  2. ExpenseTotals.asm is written in 64-bit assembly code, and contains the definition for a routine that reads each row in the array, calculates the totals of specified columns, and tucks the values away for later use by main().

It’s possible to incorporate 32-bit assembly code inline in C++ code. But since my code uses AVX-512 instructions and 64-bit registers set, the Microsoft Visual Studio C++ compiler requires such code to be in its own file.

Program Details – C++ Code

In this section we take a look at the important details of how to set up the “plumbing” for calling and assembly routine from within a C++ program.

Declaring and calling GetSelectedTotals in main()

The main() function contains the following prototype for GetSelectedTotals.

extern "C" void GetSelectedTotals(int nMonths, unsigned char ColSelect, float Table[],  float Totals[])

Since this function is only declared but not defined in the C++ file, we have to include the extern keyword. The rest of this declaration indicates that the function doesn’t return a value, and that the function takes four arguments.

  • nMonths – the number of months to process
  • ColSelect – a byte whose bits determine the columns we want to add
  • Table – the address of the table containing expense data
  • Totals – the address of an array that will receive the calculated subtotals

The ColSelect parameter deserves some more description. We can get by with a byte (eight bits) because I have laid out the expenses array so that each row contains eight (float) numbers.  If we want the subtotals for, say, the Mortgage, Homeowner Insurance, and Property Tax categories (columns 1, 2, and 3), we need to set bits 1, 2, and 3 of ColSelect. In binary, this would be 0000 0111, or 7 in decimal form. Note that the bit numbers run from 1 up, different from the usual numbering scheme.

Calling GetSelectedTotals

If we’re interested in the four months shown in the Expenses table above, for the Mortgage, HO Insurance, and Prop. Tax categories, we’ll call  the GetSelectedTotals() function like this:

GetSelectedTotals(4, 7, Table, Totals);

The complete code for the C++ portion of the program and its output appears at the bottom of this article.

Program Details – Assembly Code

In this section, we’ll examine the most important details in the assembly code.

Registers used in parameter passing

When main() calls GetSelectedTotals(), the arguments are passed in registers; specifically, in the 64-bit general purpose registers RCX, RDX, R8, and R9, in that order.  RCX is set to the number of months (rows) in the table, RDX is set to the value for ColSelect, R8 is set to the address of the expenses table, and R9 is set to the address of the output array.

Setting the opmask register

The first order of business inside the assembly routine is to copy the ColSelect value (in RDX) to one of the opmask registers. The opmask (or writemask) registers, which were added in AVX-512, are 64-bit registers whose primary use is to hold bitmasks. This kmovb instruction moves 8 bits (one byte) to the lowest 8 bits of the k1 opmask register.

	kmovb k1, edx

Heart of the routine

Next we see the heart of the routine. The value in the RCX register controls the loop. Each time the loop instruction is executed, the value in RCX is decremented; when it gets to zero, the loop terminates.

	vmovaps ymm0 {k1}, ymmword ptr[r8]   ; Read a row of floats 					     
	vaddps ymm1, ymm1, ymm0		     ; Accumlate onto the totals
	loop AnotherRow                      ; If RCX > 0, get another row

This vmovaps instruction moves aligned packed single-precision numbers (floats) from memory to the destination register. Although the instruction reads eight floats from the memory pointed to by R8, it writes only as many floats to YMM0 as there are set bits in k1, and in the positions in memory that correspond to the positions in the opmask register.

The figure below shows how the three set bits in k1 control which locations in memory are written to the destination register, YMM0.

Operation of the vmovaps instruction, with opmask

The next instruction in the loop body, vaddps, adds the packed single-precision numbers in YMM0 to YMM1, an accumulator register, storing the results back into YMM1.

I chose to use 256-bit (32-byte) YMM registers for this example, because the Expenses table contains 8 columns of floats per row — an entire row of 8 4-byte values fits into a YMM register. If I had been working with 16 categories (also of floats), I would have used the 512-bit (64-byte) ZMM registers, with essentially no change to the code.

I don’t know about you, but it seems pretty amazing to me that the code can cycle through a table with an arbitrary number of rows (but 8 columns), and get subtotals of any combination of the columns using essentially four lines of code!

Finishing up

After the loop finishes, it’s a simple matter to write the subtotals back to the output array.

	vmovaps ymmword ptr[r9], ymm1

Here the vmovaps instruction copies the calculated subtotals in YMM1 to the memory address for the Totals array. The main() function can then display the results.

C++ Code for the Example

In the example here, the GetSelectedTotals() function is called twice, once to get the subtotals of home expenses (mortgage, homeowner insurance, and property tax) and again to get subtotals of utility expenses.

// Expenses.cpp
using std::cout;
using std::endl;

extern "C" void GetSelectedTotals(int nMonths, unsigned char ColSelect, float Table[], float Totals[]);

float __declspec(align(32))Expenses[] =
//   1   2    3   4    5    6   7   8      -- expense categories
{ 1800, 32, 200, 70, 130, 100, 60, 150,
  1800, 32, 200, 70, 130, 100, 60, 150,
  1800, 32, 200, 70, 130, 100, 60, 150,
  1800, 32, 200, 70, 130, 100, 60, 150 };
// Legend for expense categories
// 1 = Mortgage, 2 = HO Ins, 3 = PropTax, 4 = Elect,
// 5 = Propane, 6 = Car Ins, 7 = MC Ins, 8 = Cable

float __declspec(align(32))TotalsArray[8];

int main()
   // Get the totals for 4 months for home-related expenses:
   // mortgage, HO insurance, property tax
   unsigned char Selector = 0x7;     // Bit pattern 0000 0111, columns 3, 2, 1
   GetSelectedTotals(4, Selector, Expenses, TotalsArray);

   cout << "Mortgage expenses: " << "$" << totalsArray[0] << endl;
   cout << "HO Insurance: " << "$" << totalsArray[1] << endl;
   cout << "Property tax: " << "$" << totalsArray[2] << endl << endl;

   // Get totals for 4 months for utilities: Electricity, Propane, Cable
   Selector = 0x98; // Bit pattern 1001 1000, columns 8, 5, 4
   GetSelectedTotals(4, Selector, Expenses, TotalsArray);
   cout << "Electricity: " << "$" << TotalsArray[3] << endl;
   cout << "Propane: " << "$" << TotalsArray[4] << endl;
   cout << "Cable: " << "$" << TotalsArray[7] << endl;

Assembly Code for the Example

Here is the complete assembly code.  The Prologue and Epilogue are more-or-less boilerplate required in 64-bit assembly programs. Most of the rest of the routine has been described above in detail.

; ExpenseTotals.asm
Zero dq 4 dup(0) ; 32 bytes of zero
GetSelectedTotals PROC C
; Prototype: extern "C" void GetSelectedTotals(int nMonths, unsigned char ColSelect, float Table[],  float Totals[]);
; Sum the values from Table according to the opmask bits in ColSelect,
; storing the sums in the Totals array.
; Parameters and register usage:
;   nMonths - number of rows to process in Table, in RCX
;   ColSelect - byte with bit pattern for columns to select, in RDX
;   Table - address of the data array, in R8
;   Totals - address of the array to hold the sums, in R9

; Prologue -- boilerplate for 64-bit assembly code
	push	rdi			; RDI is nonvolatile, so save it
	sub	rsp, 20h 
	mov     rdi, rsp	

; Initialization
	kmovb k1, edx				; Load the bitpattern in ColSelect into k1

	vmovaps ymm0 {k1}{z}, ymmword ptr[r8]   ; Read a row of floats, but write only
						; the ones specified in the k1 opmask register 
	vaddps ymm1, ymm1, ymm0			; Accumlate onto the totals
	loop AnotherRow                         ; Get another row if RCX > 0; otherwise exit

	vmovaps ymmword ptr[r9], ymm1		; Store the totals back into memory
	vmovaps ymm0, ymmword ptr[Zero]		; Reset YMM0 to all zero values

; Epilogue
	add		rsp, 20h		; Adjust the stack back to original state
	pop		rdi
GetSelectedTotals ENDP

Program Output

Mortgage expenses: $7200
HO Insurance:      $128
Property tax:      $800
Electricity:       $280
Propane:           $520
Cable:             $600



1 reply
  1. Mark44 says:

    I hope that the example I'm presenting here will be of interest to some of you. The program in the article uses a list of monthly expenses for four months in eight categories, such as mortgage, homeowner insurance, utilities, and so on. The example program can calculate the subtotals of any combination of the eight categories in a loop that has three lines of code. The heart of the loop reads all eight values for a given month in one operation, but writes only the ones of interest into a 32-byte destination register (that could hold all eight values, if necessary). The remainder of the loop adds the items of interest to an accumulator, and starts the loop body again until the data is exhausted.

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply