Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

AI Thread Summary
The discussion centers on a comparison between nVidia CUDA and Intel AVX-512 code for performing linear regression on a dataset of 262,144 points. Both implementations yielded similar execution times, with the CUDA version taking about 8 milliseconds and the AVX-512 version approximately 9 milliseconds. The conversation highlights the importance of optimizing code to leverage the capabilities of each platform, noting that the performance is often limited by data transfer speeds rather than processing power. There is interest in exploring various programming languages for this task, including C, Java, Julia, and Python, with a focus on their respective capabilities for parallel computing. The author plans to publish detailed articles on both implementations soon, contributing to the understanding of performance differences between these technologies.
Messages
38,029
Reaction score
10,506
I've done a bit of CUDA programming lately, to exercise some parallel code on my nVidia graphics card. I also ported implemented the computations in Intel AVX-512 assembly code.

The code I wrote takes a bunch (=262,144 = ##2^{18}## to be exact) of points, and calculates the slope and y-intercept of the regression line that best fits these points. Since all the points were generated using a straight-line function, it's easy to tell whether the computed slope and intercept are correct. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on my Dell computer with a 10-core Xeon Silver processor. The nVidia card is a Quadro Pro P2000, with 8 multiprocessors, and 128 cores per MP,

If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
 
Last edited:
  • Like
Likes Jarvis323, pbuk and jedishrfu
Technology news on Phys.org
It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.
 
jedishrfu said:
It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.
My code is pretty much C/C++ (including CUDA extensions) for the CUDA version, and C/C++ with raw AVX-512 assembly for the other version. There are instrinsics for a lot of the AVX-512 and other assembly instructions, but I've never felt the need to use them.

I've never done anything in Julia, so can't say anything about it. Python + numpy would be very much slower, I believe. Recoding the C/C++ parts in Java might be anywhere from somewhat slower to a lot slower -- don't know.
 
  • Like
Likes jedishrfu and pbuk
jedishrfu said:
It would be interesting to run it with various languages C vs Java vs Julia vs Python.
Maybe, but on a totally different level. This would simply show whether the particular implementation of the relevant language was capable of exploiting efficiences in either AVX512, CUDA or both.
jedishrfu said:
Python has numpy although I'm not sure of its CUDA capabilities.
It has none: for CUDA you need Numba. Which I think demonstrates my point: massively parallel numeric computing is not something that speeds up everything you do, it only speeds up code that is written and/or compiled specifically to take advantage of it.
 
Mark44 said:
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores, however I wouldn't expect this GPU to shine in this test:
  • 64-bit performance on the P2000 is not great (95 GFLOPS vs 3 TFLOPS for 32 bit).
  • ## 2^{18} \times 8 ## bytes is 2 GBo, in 8 ms that's 25 GBo/s. PCIe3 is 32 GB/s so I think this is the bottleneck rather than core processing.
 
Last edited by a moderator:
  • Like
Likes Jarvis323
pbuk said:
Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores,
But the AVX code isn't running in parallel, at least not based on anything I did. I've done some experimentation in the past on splitting a program up into threads, but there is so much overhead in comparison to the relatively small amount of work I'm doing, that it takes way longer with multiple threads than just running a single thread.
pbuk said:
##2^{18}×8## bytes is 2 GBo
I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.
 
Mark44 said:
But the AVX code isn't running in parallel, at least not based on anything I did.
Well it is executing 512 / 64 operations in parallel per core, but if only one core then this only 8 parallel operations vs. 1024 for the GPU.

Mark44 said:
I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.
Yes that's what I assumed but 2 GB takes at least 6 ms over PCIe 3 or 8 ms over DDR4 3200 so the CPU/GPU performance is dominated by bus bandwith in either case. You need something more intensive than the 6 FLOPs per 2 x 64 bit data point of simple linear least squares regression so the extra cores and GDDR5 bandwidth of the GPU can make a difference.
 
Mark44 said:
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
I would be interested in reading it.
 
I'm still working on things. I'm doing it in two Insights articles, one for the CUDA part, and one for the AVX-512 part. The AVX-512 part will also include some timing comparisons. The CUDA part is pretty well done, but I haven't started on the AVX-512 writeup just yet -- it's tax season and I've been working on gathering info for the federal and state returns for my mother's estate.
 
  • Like
Likes sysprog and Jarvis323
  • #10
Blessings to you and yours in this time of your bereavement. Among your readership, I, and I'm confident many others, look forward to more of your good writings. Please carry on when you're ready.
 
  • #11
I'm nearly done with two articles -- one on a CUDA application and the other that does approximately the same thing in AVX-512. One article is finished, and the other is nearly finished. I'm hoping to get them published by the end of this week, maybe.
Edit: I contacted Greg by PM, and he said that the earliest the articles could be published was next Tuesday.
 
Last edited:
  • Like
Likes sysprog
Back
Top