AVX-512 Programming: Extracting Column Subtotals from a Table

Mark44 · Mar 4, 2019

Greg Bernhardt submitted a new blog post

AVX-512 Programming: Extracting Column Subtotals from a Table

Continue reading the Original Blog Post.

Mark44 · Mar 4, 2019

I hope that the example I'm presenting here will be of interest to some of you. The program in the article uses a list of monthly expenses for four months in eight categories, such as mortgage, homeowner insurance, utilities, and so on. The example program can calculate the subtotals of any combination of the eight categories in a loop that has three lines of code. The heart of the loop reads all eight values for a given month in one operation, but writes only the ones of interest into a 32-byte destination register (that could hold all eight values, if necessary). The remainder of the loop adds the items of interest to an accumulator, and starts the loop body again until the data is exhausted.

Vanadium 50 · May 6, 2023

Now that AMD has annouced that Zen4 will support AVX-512, do you think it will gain in popularity, especially for consumer chips?

One problem is that AVX-512 really shines in fused multiply-adds. It's 512 bits wide, not 64, so there's a factor of 8, and it takes half a clcok, so that's 16. But this runs into two problems - one is Amdahl's Law: if half your time is spent doing FMAs, an infinitely fast FMA buys you a factor of 2. The other is that GPUs are already pretty good at FMAs.

Related, if most CPUs don't have this feature, vendors won't code for it.

Mark44 · May 7, 2023

Thanks for commenting!

Vanadium 50 said:

Now that AMD has annouced that Zen4 will support AVX-512, do you think it will gain in popularity, especially for consumer chips?

I don't have any idea about whether AVX-512 will gain in popularity now that AMD also supports it. I've been interested in vector (SIMD - single instruction multiple data) operations at the assembly level since I first found out that Intel supported them, back nearly 20 years ago. When I learned that some Intel processors supported 512-bit instructions in their AVX-512 extensions I went out and bought a Dell computer running a Xeon Scalable processor, one of the few Intel products that had support for AVX-512. Not long after getting that computer, about 4 years ago, I wrote about a dozen example assembly programs that used a number of different AVX-512 instructions. The program described in this Insights article was one of them.

Vanadium 50 said:

Related, if most CPUs don't have this feature, vendors won't code for it.

IIRC, when I bought the Dell computer with its Xeon Scalable, the high-end AMD processors didn't have support for AVX-512. Now that AMD, the other big player in the CPU market also has it, perhaps software vendors will be more apt to use it.

What I do know is that compiler vendors, such as Microsoft with their Visual Studio product, and likely other compilers such as gcc, generate code that takes advantage of at least some processor capabilities, such as SSE (streaming SIMD extensions) rather than the legacy 8087 floating point instructions. My current VS compiler is about 6 year old -- I keep thinking I will upgrade to a newer version, but haven't done so just yet. I haven't investigated the extent to which that compiler and newer versions take advantage of these newer processor extensions.

The AVX-512 extensions for CPUs and the capabilities of GPUs from nVidia and others are naturals for parallel processing. The hard part seems to be coming up with ways to partition a program to take advantage of parallelization, at least from my perspective.

Vanadium 50 · May 7, 2023

Well, IMHO the problem with paralellization is that people are trying to optimize the wroing thing - i.e. CPU efficiency. In a SIMD world, you don;t try and figure out which calculation you want before you do it - you do them all and then pick the one you want. People don't learn to code that way. Based on some of the posts here,. many don't learn to code at all.

Intel told the world "Use AVX-512" and then released only a few chips that supported it - and did so via bewildering array of subsets of the instructions. I think if AMD comes out with a fairly wide-ramging standard that will be different than if they too offer it only here and there. And remember, we don't always get the chance to recompile our code - sometimes we buy it and are at the vendor's mercy.

I may want to eb more explicit - AVX-512 lies in an uncomfortable space between CPUs and GPUs. There certainly are workflows where it's better than either scalar or GPU's, but are these enough to get people to pay the price premium?

Vanadium 50 · Sep 10, 2023

Update - the AMD chips with AVX512 accept AVX512 instructions, but execute them through two sequential 256-bit operations.

AVX-512 Programming: Extracting Column Subtotals from a Table

Attachments

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

AVX-512 Programming: Extracting Column Subtotals from a Table

Attachments

Similar threads