Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

Click For Summary

Discussion Overview

The discussion revolves around a comparison of CUDA programming on nVidia graphics cards and Intel AVX-512 assembly code, specifically focusing on performance metrics related to regression line calculations using a set of generated data points. The scope includes technical exploration of programming languages and their capabilities in parallel computing.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant shares their experience with CUDA and AVX-512 implementations, noting similar performance times of approximately 8 ms for CUDA and 9 ms for AVX-512.
  • Another participant suggests exploring various programming languages (C, Java, Julia, Python) for performance comparisons, highlighting Julia's numerical computing capabilities with CUDA.
  • Some participants express skepticism about the performance of Python with numpy for CUDA, indicating that it requires Numba for CUDA capabilities.
  • Concerns are raised about the efficiency of AVX-512 code, with one participant noting that it does not run in parallel as effectively as CUDA, leading to potential performance bottlenecks.
  • Discussion includes the impact of data transfer rates over PCIe and memory bandwidth on overall performance, with some participants suggesting that the computational intensity of the task may not fully utilize the GPU's capabilities.
  • One participant mentions they are preparing two articles, one for CUDA and one for AVX-512, with timing comparisons included in the AVX-512 article.
  • Another participant expresses interest in reading the upcoming articles, indicating a supportive community response.

Areas of Agreement / Disagreement

Participants express varying opinions on the performance of different programming languages and the efficiency of parallel execution in CUDA versus AVX-512. There is no consensus on the superiority of one approach over the other, and the discussion remains open-ended regarding the implications of data transfer rates on performance.

Contextual Notes

Participants note limitations related to the computational intensity of the tasks being performed and the dependence on hardware capabilities, particularly regarding PCIe bandwidth and memory performance. There are unresolved questions about the optimal conditions for leveraging parallel computing effectively.

Who May Find This Useful

Readers interested in parallel computing, performance comparisons between different programming languages, and those looking for insights into CUDA and AVX-512 implementations may find this discussion valuable.

Messages
38,089
Reaction score
10,637
I've done a bit of CUDA programming lately, to exercise some parallel code on my nVidia graphics card. I also ported implemented the computations in Intel AVX-512 assembly code.

The code I wrote takes a bunch (=262,144 = ##2^{18}## to be exact) of points, and calculates the slope and y-intercept of the regression line that best fits these points. Since all the points were generated using a straight-line function, it's easy to tell whether the computed slope and intercept are correct. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on my Dell computer with a 10-core Xeon Silver processor. The nVidia card is a Quadro Pro P2000, with 8 multiprocessors, and 128 cores per MP,

If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
 
Last edited:
  • Like
Likes   Reactions: Jarvis323, pbuk and jedishrfu
Technology news on Phys.org
It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.
 
jedishrfu said:
It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.
My code is pretty much C/C++ (including CUDA extensions) for the CUDA version, and C/C++ with raw AVX-512 assembly for the other version. There are instrinsics for a lot of the AVX-512 and other assembly instructions, but I've never felt the need to use them.

I've never done anything in Julia, so can't say anything about it. Python + numpy would be very much slower, I believe. Recoding the C/C++ parts in Java might be anywhere from somewhat slower to a lot slower -- don't know.
 
  • Like
Likes   Reactions: jedishrfu and pbuk
jedishrfu said:
It would be interesting to run it with various languages C vs Java vs Julia vs Python.
Maybe, but on a totally different level. This would simply show whether the particular implementation of the relevant language was capable of exploiting efficiences in either AVX512, CUDA or both.
jedishrfu said:
Python has numpy although I'm not sure of its CUDA capabilities.
It has none: for CUDA you need Numba. Which I think demonstrates my point: massively parallel numeric computing is not something that speeds up everything you do, it only speeds up code that is written and/or compiled specifically to take advantage of it.
 
Mark44 said:
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores, however I wouldn't expect this GPU to shine in this test:
  • 64-bit performance on the P2000 is not great (95 GFLOPS vs 3 TFLOPS for 32 bit).
  • ## 2^{18} \times 8 ## bytes is 2 GBo, in 8 ms that's 25 GBo/s. PCIe3 is 32 GB/s so I think this is the bottleneck rather than core processing.
 
Last edited by a moderator:
  • Like
Likes   Reactions: Jarvis323
pbuk said:
Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores,
But the AVX code isn't running in parallel, at least not based on anything I did. I've done some experimentation in the past on splitting a program up into threads, but there is so much overhead in comparison to the relatively small amount of work I'm doing, that it takes way longer with multiple threads than just running a single thread.
pbuk said:
##2^{18}×8## bytes is 2 GBo
I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.
 
Mark44 said:
But the AVX code isn't running in parallel, at least not based on anything I did.
Well it is executing 512 / 64 operations in parallel per core, but if only one core then this only 8 parallel operations vs. 1024 for the GPU.

Mark44 said:
I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.
Yes that's what I assumed but 2 GB takes at least 6 ms over PCIe 3 or 8 ms over DDR4 3200 so the CPU/GPU performance is dominated by bus bandwith in either case. You need something more intensive than the 6 FLOPs per 2 x 64 bit data point of simple linear least squares regression so the extra cores and GDDR5 bandwidth of the GPU can make a difference.
 
Mark44 said:
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
I would be interested in reading it.
 
I'm still working on things. I'm doing it in two Insights articles, one for the CUDA part, and one for the AVX-512 part. The AVX-512 part will also include some timing comparisons. The CUDA part is pretty well done, but I haven't started on the AVX-512 writeup just yet -- it's tax season and I've been working on gathering info for the federal and state returns for my mother's estate.
 
  • Like
Likes   Reactions: sysprog and Jarvis323
  • #10
Blessings to you and yours in this time of your bereavement. Among your readership, I, and I'm confident many others, look forward to more of your good writings. Please carry on when you're ready.
 
  • #11
I'm nearly done with two articles -- one on a CUDA application and the other that does approximately the same thing in AVX-512. One article is finished, and the other is nearly finished. I'm hoping to get them published by the end of this week, maybe.
Edit: I contacted Greg by PM, and he said that the earliest the articles could be published was next Tuesday.
 
Last edited:
  • Like
Likes   Reactions: sysprog

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K