AVX-512 Programming: Extracting Column Subtotals from a Table

  • #1
37,035
9,133
Greg Bernhardt submitted a new blog post

AVX-512 Programming: Extracting Column Subtotals from a Table
AVX-512_Programming_subtotals.png


Continue reading the Original Blog Post.
 

Attachments

  • AVX-512_Programming_subtotals.png
    AVX-512_Programming_subtotals.png
    5 KB · Views: 955
  • Like
Likes Greg Bernhardt

Answers and Replies

  • #2
I hope that the example I'm presenting here will be of interest to some of you. The program in the article uses a list of monthly expenses for four months in eight categories, such as mortgage, homeowner insurance, utilities, and so on. The example program can calculate the subtotals of any combination of the eight categories in a loop that has three lines of code. The heart of the loop reads all eight values for a given month in one operation, but writes only the ones of interest into a 32-byte destination register (that could hold all eight values, if necessary). The remainder of the loop adds the items of interest to an accumulator, and starts the loop body again until the data is exhausted.
 
  • Like
Likes Greg Bernhardt
  • #3
Now that AMD has annouced that Zen4 will support AVX-512, do you think it will gain in popularity, especially for consumer chips?

One problem is that AVX-512 really shines in fused multiply-adds. It's 512 bits wide, not 64, so there's a factor of 8, and it takes half a clcok, so that's 16. But this runs into two problems - one is Amdahl's Law: if half your time is spent doing FMAs, an infinitely fast FMA buys you a factor of 2. The other is that GPUs are allready pretty good at FMAs.

Related, if most CPUs don't have this feature, vendors won't code for it.
 
  • #4
Thanks for commenting!
Now that AMD has annouced that Zen4 will support AVX-512, do you think it will gain in popularity, especially for consumer chips?
I don't have any idea about whether AVX-512 will gain in popularity now that AMD also supports it. I've been interested in vector (SIMD - single instruction multiple data) operations at the assembly level since I first found out that Intel supported them, back nearly 20 years ago. When I learned that some Intel processors supported 512-bit instructions in their AVX-512 extensions I went out and bought a Dell computer running a Xeon Scalable processor, one of the few Intel products that had support for AVX-512. Not long after getting that computer, about 4 years ago, I wrote about a dozen example assembly programs that used a number of different AVX-512 instructions. The program described in this Insights article was one of them.

Related, if most CPUs don't have this feature, vendors won't code for it.
IIRC, when I bought the Dell computer with its Xeon Scalable, the high-end AMD processors didn't have support for AVX-512. Now that AMD, the other big player in the CPU market also has it, perhaps software vendors will be more apt to use it.

What I do know is that compiler vendors, such as Microsoft with their Visual Studio product, and likely other compilers such as gcc, generate code that takes advantage of at least some processor capabilities, such as SSE (streaming SIMD extensions) rather than the legacy 8087 floating point instructions. My current VS compiler is about 6 year old -- I keep thinking I will upgrade to a newer version, but haven't done so just yet. I haven't investigated the extent to which that compiler and newer versions take advantage of these newer processor extensions.

The AVX-512 extensions for CPUs and the capabilities of GPUs from nVidia and others are naturals for parallel processing. The hard part seems to be coming up with ways to partition a program to take advantage of parallelization, at least from my perspective.
 
  • Like
Likes Greg Bernhardt
  • #5
Well, IMHO the problem with paralellization is that people are trying to optimize the wroing thing - i.e. CPU efficiency. In a SIMD world, you don;t try and figure out which calculation you want before you do it - you do them all and then pick the one you want. People don't learn to code that way. Based on some of the posts here,. many don't learn to code at all.

Intel told the world "Use AVX-512" and then released only a few chips that supported it - and did so via bewildering array of subsets of the instructions. I think if AMD comes out with a fairly wide-ramging standard that will be different than if they too offer it only here and there. And remember, we don't always get the chance to recompile our code - sometimes we buy it and are at the vendor's mercy.

I may want to eb more explicit - AVX-512 lies in an uncomfortable space between CPUs and GPUs. There certainly are workflows where it's better than either scalar or GPU's, but are these enough to get people to pay the price premium?
 

Suggested for: AVX-512 Programming: Extracting Column Subtotals from a Table

Replies
2
Views
929
Replies
10
Views
1K
Replies
33
Views
599
Replies
4
Views
522
Replies
2
Views
751
Replies
3
Views
652
Replies
4
Views
780
Back
Top