Why Is My OpenMP FORTRAN Program Slower in Parallel Than in Single Thread?

  • Context: Fortran 
  • Thread starter Thread starter jelanier
  • Start date Start date
  • Tags Tags
    Beginner Fortran
Click For Summary

Discussion Overview

The discussion centers around the performance of a FORTRAN program utilizing OpenMP for parallel processing. Participants are exploring why the program exhibits slower execution times in parallel compared to single-threaded execution. The focus includes aspects of code optimization, parallelization overhead, and the suitability of the test being conducted.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Jim reports that his OpenMP FORTRAN program runs slower in parallel than in single-threaded mode, providing specific timing results for different thread counts.
  • Some participants note that parallelization introduces overhead, which can lead to increased execution times unless the parallel workload is substantial enough to offset this overhead.
  • There is a suggestion to post the code directly in the thread for better analysis, as the provided link to the code is broken.
  • One participant emphasizes that the test may not be effective, as the loop could be optimized away, recommending the use of array operations instead.
  • Another participant points out that the output of the loop is not being utilized, which could lead to the compiler optimizing it out entirely, thus affecting timing results.
  • Jim expresses uncertainty about why OpenMP does not perform well on loops, particularly since he has many loops in his code that he wishes to parallelize.
  • There is a discussion about the number of iterations in loops and the potential impact of thread setup time relative to the workload of each thread.

Areas of Agreement / Disagreement

Participants generally agree that overhead from parallelization can lead to slower performance if the workload per thread is not significant. However, there is no consensus on the specific reasons for Jim's performance issues, and multiple viewpoints on how to address the problem are presented.

Contextual Notes

Some limitations are noted, such as the potential for the compiler to optimize away unused calculations, and the need for more details about the hardware and code structure to provide better assistance.

Who May Find This Useful

This discussion may be useful for programmers and researchers working with OpenMP in FORTRAN, particularly those interested in performance optimization and parallel processing challenges.

  • #31
jelanier said:
. If the loop is not a function of i, it can't possibly run in parallel.

Close. If you have x = x * i, it is a function of i, but still can't run in parallel.

Probably the easiest way to imagine it is to think about giving half the loop to one thread and half to another. Would you get the same answer? No matter how you split it? I usually ask myself "if I had N CPUs, one for each of the N elements of the loop, would I get the right answer?"

jelanier said:
The purpose of this code was to count time more reliably.

Well, you aren't doing that exactly. You're measuring something, that's for sure, but it may be totally unrelated to any real code. In particular, the lack of use of x at the end means the compiler might optimize away part or all of the code involving x. Including the code inside the PARALLEL clause. That said, using wall clock time is a good idea: that's what you really care about, after all.
 
Technology news on Phys.org
  • #32
I understood. I think you took me too literally on the function. I clarified when I said "Each iteration requires knowledge of the previous one" The counter I'm using should work because it is looking at the system clock so the code should not interfere. I think you are saying it may look at system clock while a code process is still running. One of the timing procedures I used in another test shows timing of each thread and when multiplied by number of threads results were close to the system clock result. Can you give me an example of wall clock time so I can compare? I thought system clock was wall clock. Quote from another forum " system_clock reports "wall time" or elapsed time. ". Is this not correct? I would think it to be conservative because the computer is doing things other than the code I'm testing.

see https://gcc.gnu.org/onlinedocs/gfortran/SYSTEM_005fCLOCK.html

Thanks
 
Last edited:
  • #33
You are using wall clock time now.
 
  • #34
OK, I thought you had seen my timing post code using system_clock. Anyway I wrote in some code for a parallel loop and the speed increase was verified. It is a little better than twice the speed. There is a variable printed Z(100) to confirm no errors in the parallel portion. I also compiled with and without optimization to test results.

See results:

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=1

C:\OpenMP\Test Timing>Test_T
MP time 0.961 seconds
z100 WI MP 406.86585116386414
NO MP time 0.938 seconds
z100 NO MP 406.86585116386414
Percentage Time 102.500%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=2

C:\OpenMP\Test Timing>Test_T
MP time 0.500 seconds
z100 WI MP 406.86585116386414
NO MP time 1.000 seconds
z100 NO MP 406.86585116386414
Percentage Time 50.000%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=4

C:\OpenMP\Test Timing>Test_T
MP time 0.391 seconds
z100 WI MP 406.86585116386414
NO MP time 0.949 seconds
z100 NO MP 406.86585116386414
Percentage Time 41.152%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=8

C:\OpenMP\Test Timing>Test_T
MP time 0.422 seconds
z100 WI MP 406.86585116386414
NO MP time 0.953 seconds
z100 NO MP 406.86585116386414
Percentage Time 44.262%

C:\OpenMP\Test Timing>pause
Press any key to continue . . .

Thanks again for your help.
 
  • #35
Vanadium 50 said:
That's very surprising. Each array is only 8 MB in size. Wonder why that is.

I figured it out. The stack size is fixed in Windows. You can increase the value by adding a line at compile time.
Example: gfortran -Wl,--stack,16000000 -O3 -fopenmp -static test.f -o Test

the stack size should be number of array variables X 8. So if you had x(i) 1000000 and y(i) 1000000
then stack would be 1000000 X 2 X 8=16000000 as minimum (plus other overhead)

Jim
 
  • #36
I have written some code that shows gains of OpenMP. This code shows gains and verifies output by comparison. Using my old laptop (2 core i5) I get at best 3X resolution. Probably better on more modern machines. I have attached code and msys2 compile shell.

http://www.chemroc.com/MISC/OpenMP/MP_Test.f

http://www.chemroc.com/MISC/OpenMP/MP_Test.sh

*************************************************************

This is output using my laptop:

***********************************************************
C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=1

C:\OpenMP\Test Timing - random>MP_Test
MP time 2.125 seconds
xM( 540) 0.480
xM( 540) 1.127
NO MP time 2.062 seconds
x( 540) 0.480
x( 540) 1.127
Percentage Time 103.030%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=2

C:\OpenMP\Test Timing - random>MP_Test
MP time 1.219 seconds
xM( 109) 0.821
xM( 109) 1.469
NO MP time 2.062 seconds
x( 109) 0.821
x( 109) 1.469
Percentage Time 59.091%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=4

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 869) 0.151
xM( 869) 0.696
NO MP time 2.094 seconds
x( 869) 0.151
x( 869) 0.696
Percentage Time 35.821%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=8

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 384) 0.952
xM( 384) 1.669
NO MP time 2.094 seconds
x( 384) 0.952
x( 384) 1.669
Percentage Time 35.821%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=16

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 186) 0.869
xM( 186) 1.551
NO MP time 2.031 seconds
x( 186) 0.869
x( 186) 1.551
Percentage Time 36.923%

*******************************************************

Later,

Jim
 
  • #37
So you get a 2.8x improvement on a 2C/4T machine? You should be pretty happy about that. :smile:
 
  • Like
Likes   Reactions: pbuk
  • #38
Vanadium 50 said:
So you get a 2.8x improvement on a 2C/4T machine? You should be pretty happy about that. :smile:
I ran this on a MAC with same processor. Percentage Gains are similar but MAC/UNIX speed is much better overall.

Results on MACOS:

+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ ./MP_Test
MP time 1.125 seconds
xM( 376) 0.995
xM( 376) 1.605
NO MP time 1.125 seconds
x( 376) 0.995
x( 376) 1.605
Percentage Time 100.000%
+ export OMP_NUM_THREADS=2
+ OMP_NUM_THREADS=2
+ ./MP_Test
MP time 0.625 seconds
xM( 627) 0.348
xM( 627) 0.774
NO MP time 1.125 seconds
x( 627) 0.348
x( 627) 0.774
Percentage Time 55.556%
+ export OMP_NUM_THREADS=4
+ OMP_NUM_THREADS=4
+ ./MP_Test
MP time 0.375 seconds
xM( 677) 0.929
xM( 677) 1.739
NO MP time 1.125 seconds
x( 677) 0.929
x( 677) 1.739
Percentage Time 33.333%
+ export OMP_NUM_THREADS=8
+ OMP_NUM_THREADS=8
+ ./MP_Test
MP time 0.375 seconds
xM( 914) 0.882
xM( 914) 1.578
NO MP time 1.250 seconds
x( 914) 0.882
x( 914) 1.578
Percentage Time 30.000%
+ export OMP_NUM_THREADS=16
+ OMP_NUM_THREADS=16
+ ./MP_Test
MP time 0.375 seconds
xM( 423) 0.360
xM( 423) 1.078
NO MP time 1.125 seconds
x( 423) 0.360
x( 423) 1.078
Percentage Time 33.333%
+ export OMP_NUM_THREADS=32
+ OMP_NUM_THREADS=32
+ ./MP_Test
MP time 0.375 seconds
xM( 832) 0.834
xM( 832) 1.575
NO MP time 1.125 seconds
x( 832) 0.834
x( 832) 1.575
Percentage Time 33.333%

[Process completed]
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
4K
  • · Replies 8 ·
Replies
8
Views
8K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K