AVX-512 Programming: Extracting Column Subtotals from a Table

Click For Summary

Discussion Overview

The discussion centers around the programming and application of AVX-512, particularly in the context of extracting column subtotals from a table of monthly expenses. Participants explore the implications of AMD's support for AVX-512 and its potential impact on software development and performance in consumer chips.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Exploratory

Main Points Raised

  • One participant describes a program that calculates subtotals for various expense categories using AVX-512, emphasizing its efficiency in processing multiple values simultaneously.
  • Another participant raises questions about the future popularity of AVX-512 now that AMD supports it, noting that its effectiveness is tied to the prevalence of CPUs that can utilize it.
  • Concerns are expressed regarding Amdahl's Law and the competition from GPUs, which are already proficient at fused multiply-add operations.
  • A participant reflects on their long-standing interest in SIMD operations and the historical context of AVX-512 support in Intel processors, suggesting that AMD's involvement may encourage broader software adoption.
  • There is a critique of current parallelization approaches, arguing that many developers focus on CPU efficiency rather than leveraging the full capabilities of SIMD, which may hinder effective coding practices.
  • Another participant notes that AMD's implementation of AVX-512 involves executing instructions through two sequential 256-bit operations, which may affect performance expectations.

Areas of Agreement / Disagreement

Participants express a range of views on the implications of AMD's support for AVX-512, with some optimistic about its potential for increased adoption, while others remain skeptical about its practical benefits and the challenges of parallelization. The discussion does not reach a consensus on these points.

Contextual Notes

Participants mention the complexity of AVX-512 instruction sets and the historical limitations of CPU support, which may influence software development decisions. There are also references to the challenges of optimizing code for parallel processing, indicating a need for further exploration of effective programming strategies.

Messages
38,106
Reaction score
10,661
Greg Bernhardt submitted a new blog post

AVX-512 Programming: Extracting Column Subtotals from a Table
AVX-512_Programming_subtotals.png


Continue reading the Original Blog Post.
 

Attachments

  • AVX-512_Programming_subtotals.png
    AVX-512_Programming_subtotals.png
    5 KB · Views: 1,207
  • Like
Likes   Reactions: Greg Bernhardt
Technology news on Phys.org
I hope that the example I'm presenting here will be of interest to some of you. The program in the article uses a list of monthly expenses for four months in eight categories, such as mortgage, homeowner insurance, utilities, and so on. The example program can calculate the subtotals of any combination of the eight categories in a loop that has three lines of code. The heart of the loop reads all eight values for a given month in one operation, but writes only the ones of interest into a 32-byte destination register (that could hold all eight values, if necessary). The remainder of the loop adds the items of interest to an accumulator, and starts the loop body again until the data is exhausted.
 
  • Like
Likes   Reactions: Greg Bernhardt
Now that AMD has annouced that Zen4 will support AVX-512, do you think it will gain in popularity, especially for consumer chips?

One problem is that AVX-512 really shines in fused multiply-adds. It's 512 bits wide, not 64, so there's a factor of 8, and it takes half a clcok, so that's 16. But this runs into two problems - one is Amdahl's Law: if half your time is spent doing FMAs, an infinitely fast FMA buys you a factor of 2. The other is that GPUs are already pretty good at FMAs.

Related, if most CPUs don't have this feature, vendors won't code for it.
 
Thanks for commenting!
Vanadium 50 said:
Now that AMD has annouced that Zen4 will support AVX-512, do you think it will gain in popularity, especially for consumer chips?
I don't have any idea about whether AVX-512 will gain in popularity now that AMD also supports it. I've been interested in vector (SIMD - single instruction multiple data) operations at the assembly level since I first found out that Intel supported them, back nearly 20 years ago. When I learned that some Intel processors supported 512-bit instructions in their AVX-512 extensions I went out and bought a Dell computer running a Xeon Scalable processor, one of the few Intel products that had support for AVX-512. Not long after getting that computer, about 4 years ago, I wrote about a dozen example assembly programs that used a number of different AVX-512 instructions. The program described in this Insights article was one of them.

Vanadium 50 said:
Related, if most CPUs don't have this feature, vendors won't code for it.
IIRC, when I bought the Dell computer with its Xeon Scalable, the high-end AMD processors didn't have support for AVX-512. Now that AMD, the other big player in the CPU market also has it, perhaps software vendors will be more apt to use it.

What I do know is that compiler vendors, such as Microsoft with their Visual Studio product, and likely other compilers such as gcc, generate code that takes advantage of at least some processor capabilities, such as SSE (streaming SIMD extensions) rather than the legacy 8087 floating point instructions. My current VS compiler is about 6 year old -- I keep thinking I will upgrade to a newer version, but haven't done so just yet. I haven't investigated the extent to which that compiler and newer versions take advantage of these newer processor extensions.

The AVX-512 extensions for CPUs and the capabilities of GPUs from nVidia and others are naturals for parallel processing. The hard part seems to be coming up with ways to partition a program to take advantage of parallelization, at least from my perspective.
 
  • Like
Likes   Reactions: Greg Bernhardt
Well, IMHO the problem with paralellization is that people are trying to optimize the wroing thing - i.e. CPU efficiency. In a SIMD world, you don;t try and figure out which calculation you want before you do it - you do them all and then pick the one you want. People don't learn to code that way. Based on some of the posts here,. many don't learn to code at all.

Intel told the world "Use AVX-512" and then released only a few chips that supported it - and did so via bewildering array of subsets of the instructions. I think if AMD comes out with a fairly wide-ramging standard that will be different than if they too offer it only here and there. And remember, we don't always get the chance to recompile our code - sometimes we buy it and are at the vendor's mercy.

I may want to eb more explicit - AVX-512 lies in an uncomfortable space between CPUs and GPUs. There certainly are workflows where it's better than either scalar or GPU's, but are these enough to get people to pay the price premium?
 
Update - the AMD chips with AVX512 accept AVX512 instructions, but execute them through two sequential 256-bit operations.
 
  • Informative
Likes   Reactions: pbuk

Similar threads

  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 25 ·
Replies
25
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 67 ·
3
Replies
67
Views
8K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K