Sorry if this answer is way too elementary...
Finding the maximum value in an array is a classic reduction problem; it's the same as finding the sum, average, etc., just using a different "binary operator" (one which returns the maximum of the two arguments). That being said, given a fast (or the fastest) way to compute the sum of an array of arguments in CUDA, the same algorithm should be a fast (or the fastest) way to compute the maximum value in the array.
So I'm seeing this as a multi-phase solution. In the first phase, each block should compute the maximum value of the array corresponding to that block. In the second phase, some subset of the blocks should compute the maximum values from the computed maxima of all the blocks that did work in the first phase. Repeat until only one block is considering all the maximal values, and the result of this is guaranteed to be the maximum array value. You can consider things like shared memory, data distribution, etc. to increase performance.
Example: A = {3, 10, 1, 9, 2, 8, 3, 4, 8, 2, 1, 7, 2, 5, 6, 1, 2, 5, 3, 2}
Using 5 blocks to handle 4 elements each:
Phase 1:
{3, 10, 1, 9} {2, 8, 3, 4} {8, 2, 1, 7} {2, 5, 6, 1} {2, 5, 3, 2}
{10, 10, 9, 9} {8, 8, 4, 4} {8, 2, 7, 7} {5, 5, 6, 1} {5, 5, 3, 2}
{10, 10, 9, 9} {8, 8, 4, 4} {8, 2, 7, 7} {6, 5, 6, 1} {5, 5, 3, 2}
Phase 2:
{10, 8, 8, 6} {5}
{10, 8, 8, 6} {5}
{10, 8, 8, 6} {5}
Phase 3:
{10, 5}
{10, 5}
Maximum is 10