How Does CUDA Programming Unlock the Potential of Parallel Computing with GPUs?

  • Thread starter Thread starter Mark44
  • Start date Start date
  • Tags Tags
    Gpu Programming
Click For Summary

Discussion Overview

The discussion centers around CUDA programming and its role in unlocking the potential of parallel computing with GPUs. Participants share their experiences with CUDA, explore its applications in enhancing floating point performance, and discuss the challenges and benefits of parallel processing in various computational problems.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested

Main Points Raised

  • One participant shares their initial experience with CUDA programming, highlighting the performance benefits of parallelism in vector addition tasks.
  • Another participant notes that programmers are increasingly using GPUs for floating point performance, suggesting that multiple GPUs can be utilized for enhanced throughput.
  • Some participants mention specific applications of parallel processing, such as fractal generation, while also noting that certain interconnected problems, like solving partial differential equations, may not benefit from parallelism due to dependencies between computation threads.
  • A participant emphasizes the importance of organizing algorithms to effectively leverage parallel computing, referencing the independence of operations in their CUDA example.

Areas of Agreement / Disagreement

Participants express a range of views on the effectiveness of parallel processing for different types of problems. While there is enthusiasm for the capabilities of CUDA and GPUs, there is also acknowledgment of the limitations and challenges posed by certain computational tasks that require coordination between threads.

Contextual Notes

Some discussions touch on the need for synchronization in algorithms when threads depend on one another, which may complicate the use of parallel computing in specific scenarios.

Who May Find This Useful

Individuals interested in GPU programming, parallel computing, and those exploring performance optimization in computational tasks may find this discussion relevant.

Messages
38,105
Reaction score
10,660
Not a question - I just got started at writing CUDA code to run on an NVIDIA card I bought a couple of months ago. I've been interested in highly performant code for a long time and have spent some time in the last few years tinkering with the technologies that are featured in the Intel and AMD CPUs. These include code that directly accesses the floating point unit (FPU) and advances in the last 10 or so years, such as MMX, SIMD, AVX and a whole alphabet soup of other technogies.

The CUDA stuff offered by NVIDIA is pretty cool, making it possible to write some really fast code via massive parallelism.

Here's a very simple example to give you a little of the flavor of what's going on. There's a lot that's not shown, but there's enough here to show the power that's available.

The first snippet is "host" code, code that is mostly plain old C or C++ that will run on the CPU. Here we have two 50,000 element arrays being initialized. The "h" prefixes are reminders that these are arrays in the host memory.

C:
// Host code
numElements = 50000;
for (int i = 0; i < numElements; ++i)
{
    h_A[i] = .0625 *i;
    h_B[i] = h_A[i] + .0625;
}
.
.
.
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
.
.
The part that isn't plain old C code is the line just above, which includes a CUDA extension to C that calls "device" code, code that runs on the graphics processing unit (GPU). A part that I omitted was the business of copying the two arrays on the host to corresponding arrays on the device. These latter arrays have "d" prefixes.

Below is the code for the function that adds two vectors. A close look suggests that it is adding a component of one array to the corresponding component of another array, and storing this value in the component with the same index of a third array. In short, this code is just adding two numbers together, and storing their sum in another number. I thought we were adding two 50,000 element arrays. What gives here?
C:
// Device code
/* Vector addition: C = A + B.
* This sample implements element by element vector addition.
*/
__global__ void vectorAdd(const float *A, const float *B, float *C, int N)
{
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < N)
  C[i] = A[i] + B[i];
}
What's happening under the hood is that the CUDA library code is generating lots (hundreds or thousands) of threads, and they're all running concurrently, so instead of iterating through the two arrays sequentially, the additions are being done in parallel, in very little time, an order of magnitude faster than even the latest CPUs can do this work.

Anyway, I'm pretty excited about this and thought I'd share what I've learned so far (which isn't very much - I'm still pretty much a n00b).
 
Technology news on Phys.org
Programmers looking to enhance floating point performance without spending a lot of money on fancy CPUs are using the parallel processing features of GPUs to boost throughput. It's not unheard of in custom rigs to have multiple GPUs running for even more number crunching capability. nVidia competitor ATI announced similar plans several years ago for using its GPU in non-graphics number crunching.

http://xcorr.net/2013/12/25/gpu-computing-which-card-should-you-get/
 
Look into parallel processing more too. There are some problems like fractal generation that benefit from parallel computation and other more interconnected problems that don't like numerically solving a system of partial differential equations where the value at given point is dependent on the values of nearby points which means the computation thread for that point must wait or coordinate with threads computing the neighboring points.
 
SteamKing said:
Programmers looking to enhance floating point performance without spending a lot of money on fancy CPUs are using the parallel processing features of GPUs to boost throughput. It's not unheard of in custom rigs to have multiple GPUs running for even more number crunching capability. nVidia competitor ATI announced similar plans several years ago for using its GPU in non-graphics number crunching.

http://xcorr.net/2013/12/25/gpu-computing-which-card-should-you-get/
My nVidia card (GTX 750) was pretty reasonable at around $100. It has 4 multiprocessors with 128 CUDA cores in each one. Some of the higher-end nVidia cards (like the GTX 980 in one of the links in the page you linked to above) have over 2000 CUDA cores, but sell for between $550 and $600, and I guess people who are really into it can drop $4000 on one of these cards (e.g., Tesla K40). I picked a card that had close to the highest "compute capability" (5.0) at a reasonable cost.

A lot of people get these kinds of cards for the computer games they support, but since Solitaire suits me just fine, I don't care at all about the game performance.
 
Last edited:
jedishrfu said:
Look into parallel processing more too. There are some problems like fractal generation that benefit from parallel computation and other more interconnected problems that don't like numerically solving a system of partial differential equations where the value at given point is dependent on the values of nearby points which means the computation thread for that point must wait or coordinate with threads computing the neighboring points.
Right. There's a real trick to organizing your algorithm to take advantage of parallel computing. Some of the samples that are provided in the CUDA toolkit deal with synchronizing threads in situations where one thread has to wait on another. In the sample code I have in my post, the additions are completely independent of one another, so they can be done in parallel.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 49 ·
2
Replies
49
Views
12K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
4K