Fast Maximum Value Computation in CUDA C Arrays

AIR&SPACE · Jul 7, 2011

I was wondering if anyone knows a fast method of finding the maximum value in a Cuda C array. I can't get max_element to work, but not sure its very fast anyways. I know I could do a for loop, but for an array with 100,000+ elements (smallest cases) it takes a lot of time (plus it has to continue through the loop even once its found what will eventually be the max value).

Anyone aware of any tricks?

rcgldr · Jul 8, 2011

The issue here is getting the data out of ram in order to do the compares. A single thread can probably scan through ram as quickly as multiple threads if the ram bandwidth is maxed out from a single loop just doing loads and compares.

aegrisomnia · Jul 8, 2011

Sorry if this answer is way too elementary...

Finding the maximum value in an array is a classic reduction problem; it's the same as finding the sum, average, etc., just using a different "binary operator" (one which returns the maximum of the two arguments). That being said, given a fast (or the fastest) way to compute the sum of an array of arguments in CUDA, the same algorithm should be a fast (or the fastest) way to compute the maximum value in the array.

So I'm seeing this as a multi-phase solution. In the first phase, each block should compute the maximum value of the array corresponding to that block. In the second phase, some subset of the blocks should compute the maximum values from the computed maxima of all the blocks that did work in the first phase. Repeat until only one block is considering all the maximal values, and the result of this is guaranteed to be the maximum array value. You can consider things like shared memory, data distribution, etc. to increase performance.

Example: A = {3, 10, 1, 9, 2, 8, 3, 4, 8, 2, 1, 7, 2, 5, 6, 1, 2, 5, 3, 2}
Using 5 blocks to handle 4 elements each:

Phase 1:
{3, 10, 1, 9} {2, 8, 3, 4} {8, 2, 1, 7} {2, 5, 6, 1} {2, 5, 3, 2}
{10, 10, 9, 9} {8, 8, 4, 4} {8, 2, 7, 7} {5, 5, 6, 1} {5, 5, 3, 2}
{10, 10, 9, 9} {8, 8, 4, 4} {8, 2, 7, 7} {6, 5, 6, 1} {5, 5, 3, 2}

Phase 2:
{10, 8, 8, 6} {5}
{10, 8, 8, 6} {5}
{10, 8, 8, 6} {5}

Phase 3:
{10, 5}
{10, 5}

Maximum is 10

aegrisomnia · Jul 8, 2011

I know I could do a for loop, but for an array with 100,000+ elements (smallest cases) it takes a lot of time (plus it has to continue through the loop even once its found what will eventually be the max value).

There is no way to compute the maximum of a set of values without looking at every value in the list. In this sense, there is no faster sequential algorithm than a for loop over the elements.

chiro · Jul 8, 2011

AIR&SPACE said:

I was wondering if anyone knows a fast method of finding the maximum value in a Cuda C array. I can't get max_element to work, but not sure its very fast anyways. I know I could do a for loop, but for an array with 100,000+ elements (smallest cases) it takes a lot of time (plus it has to continue through the loop even once its found what will eventually be the max value).

Anyone aware of any tricks?

Maybe you could use a structure like a binary tree to optimize the number of comparisons required. It's not an array, but if speed is important, it might be worth investigating.

AIR&SPACE · Jul 8, 2011

aegrisomnia said:

Sorry if this answer is way too elementary...

Finding the maximum value in an array is a classic reduction problem; it's the same as finding the sum, average, etc., just using a different "binary operator" (one which returns the maximum of the two arguments). That being said, given a fast (or the fastest) way to compute the sum of an array of arguments in CUDA, the same algorithm should be a fast (or the fastest) way to compute the maximum value in the array.

So I'm seeing this as a multi-phase solution. In the first phase, each block should compute the maximum value of the array corresponding to that block. In the second phase, some subset of the blocks should compute the maximum values from the computed maxima of all the blocks that did work in the first phase. Repeat until only one block is considering all the maximal values, and the result of this is guaranteed to be the maximum array value. You can consider things like shared memory, data distribution, etc. to increase performance.

Example: A = {3, 10, 1, 9, 2, 8, 3, 4, 8, 2, 1, 7, 2, 5, 6, 1, 2, 5, 3, 2}
Using 5 blocks to handle 4 elements each:

Phase 1:
{3, 10, 1, 9} {2, 8, 3, 4} {8, 2, 1, 7} {2, 5, 6, 1} {2, 5, 3, 2}
{10, 10, 9, 9} {8, 8, 4, 4} {8, 2, 7, 7} {5, 5, 6, 1} {5, 5, 3, 2}
{10, 10, 9, 9} {8, 8, 4, 4} {8, 2, 7, 7} {6, 5, 6, 1} {5, 5, 3, 2}

Phase 2:
{10, 8, 8, 6} {5}
{10, 8, 8, 6} {5}
{10, 8, 8, 6} {5}

Phase 3:
{10, 5}
{10, 5}

Maximum is 10

:!) Me love you long time :!)

Fast Maximum Value Computation in CUDA C Arrays

Thread 'Leading AI systems blackmailed their human users'

Thread 'How to connect Frontend and Backend?'

Thread 'Is this public key encryption?'

Similar threads

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Sequential Analog Computers?

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem