Why Is My OpenMP FORTRAN Program Slower in Parallel Than in Single Thread?

  • Context: Fortran 
  • Thread starter Thread starter jelanier
  • Start date Start date
  • Tags Tags
    Beginner Fortran
Click For Summary
SUMMARY

The discussion centers on the performance issues encountered when using OpenMP in a FORTRAN program. Users reported that parallel execution times exceeded those of single-threaded execution, with specific results showing increasing times as the number of threads increased. Key factors contributing to this issue include overhead from parallelization and potential compiler optimizations that eliminate unnecessary computations. Recommendations include using more complex operations, such as array manipulations, to better leverage OpenMP's capabilities.

PREREQUISITES
  • Understanding of OpenMP directives and their usage in FORTRAN
  • Familiarity with FORTRAN programming and syntax
  • Knowledge of performance profiling and timing functions in parallel computing
  • Basic concepts of multithreading and its overheads
NEXT STEPS
  • Explore OpenMP's performance optimization techniques, focusing on minimizing overhead
  • Learn about effective array operations in FORTRAN to utilize parallel processing
  • Investigate compiler optimization flags that can affect parallel execution
  • Study performance profiling tools specific to FORTRAN and OpenMP
USEFUL FOR

FORTRAN developers, computational scientists, and anyone interested in optimizing parallel performance using OpenMP in their applications.

  • #31
jelanier said:
. If the loop is not a function of i, it can't possibly run in parallel.

Close. If you have x = x * i, it is a function of i, but still can't run in parallel.

Probably the easiest way to imagine it is to think about giving half the loop to one thread and half to another. Would you get the same answer? No matter how you split it? I usually ask myself "if I had N CPUs, one for each of the N elements of the loop, would I get the right answer?"

jelanier said:
The purpose of this code was to count time more reliably.

Well, you aren't doing that exactly. You're measuring something, that's for sure, but it may be totally unrelated to any real code. In particular, the lack of use of x at the end means the compiler might optimize away part or all of the code involving x. Including the code inside the PARALLEL clause. That said, using wall clock time is a good idea: that's what you really care about, after all.
 
Technology news on Phys.org
  • #32
I understood. I think you took me too literally on the function. I clarified when I said "Each iteration requires knowledge of the previous one" The counter I'm using should work because it is looking at the system clock so the code should not interfere. I think you are saying it may look at system clock while a code process is still running. One of the timing procedures I used in another test shows timing of each thread and when multiplied by number of threads results were close to the system clock result. Can you give me an example of wall clock time so I can compare? I thought system clock was wall clock. Quote from another forum " system_clock reports "wall time" or elapsed time. ". Is this not correct? I would think it to be conservative because the computer is doing things other than the code I'm testing.

see https://gcc.gnu.org/onlinedocs/gfortran/SYSTEM_005fCLOCK.html

Thanks
 
Last edited:
  • #33
You are using wall clock time now.
 
  • #34
OK, I thought you had seen my timing post code using system_clock. Anyway I wrote in some code for a parallel loop and the speed increase was verified. It is a little better than twice the speed. There is a variable printed Z(100) to confirm no errors in the parallel portion. I also compiled with and without optimization to test results.

See results:

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=1

C:\OpenMP\Test Timing>Test_T
MP time 0.961 seconds
z100 WI MP 406.86585116386414
NO MP time 0.938 seconds
z100 NO MP 406.86585116386414
Percentage Time 102.500%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=2

C:\OpenMP\Test Timing>Test_T
MP time 0.500 seconds
z100 WI MP 406.86585116386414
NO MP time 1.000 seconds
z100 NO MP 406.86585116386414
Percentage Time 50.000%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=4

C:\OpenMP\Test Timing>Test_T
MP time 0.391 seconds
z100 WI MP 406.86585116386414
NO MP time 0.949 seconds
z100 NO MP 406.86585116386414
Percentage Time 41.152%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=8

C:\OpenMP\Test Timing>Test_T
MP time 0.422 seconds
z100 WI MP 406.86585116386414
NO MP time 0.953 seconds
z100 NO MP 406.86585116386414
Percentage Time 44.262%

C:\OpenMP\Test Timing>pause
Press any key to continue . . .

Thanks again for your help.
 
  • #35
Vanadium 50 said:
That's very surprising. Each array is only 8 MB in size. Wonder why that is.

I figured it out. The stack size is fixed in Windows. You can increase the value by adding a line at compile time.
Example: gfortran -Wl,--stack,16000000 -O3 -fopenmp -static test.f -o Test

the stack size should be number of array variables X 8. So if you had x(i) 1000000 and y(i) 1000000
then stack would be 1000000 X 2 X 8=16000000 as minimum (plus other overhead)

Jim
 
  • #36
I have written some code that shows gains of OpenMP. This code shows gains and verifies output by comparison. Using my old laptop (2 core i5) I get at best 3X resolution. Probably better on more modern machines. I have attached code and msys2 compile shell.

http://www.chemroc.com/MISC/OpenMP/MP_Test.f

http://www.chemroc.com/MISC/OpenMP/MP_Test.sh

*************************************************************

This is output using my laptop:

***********************************************************
C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=1

C:\OpenMP\Test Timing - random>MP_Test
MP time 2.125 seconds
xM( 540) 0.480
xM( 540) 1.127
NO MP time 2.062 seconds
x( 540) 0.480
x( 540) 1.127
Percentage Time 103.030%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=2

C:\OpenMP\Test Timing - random>MP_Test
MP time 1.219 seconds
xM( 109) 0.821
xM( 109) 1.469
NO MP time 2.062 seconds
x( 109) 0.821
x( 109) 1.469
Percentage Time 59.091%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=4

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 869) 0.151
xM( 869) 0.696
NO MP time 2.094 seconds
x( 869) 0.151
x( 869) 0.696
Percentage Time 35.821%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=8

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 384) 0.952
xM( 384) 1.669
NO MP time 2.094 seconds
x( 384) 0.952
x( 384) 1.669
Percentage Time 35.821%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=16

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 186) 0.869
xM( 186) 1.551
NO MP time 2.031 seconds
x( 186) 0.869
x( 186) 1.551
Percentage Time 36.923%

*******************************************************

Later,

Jim
 
  • #37
So you get a 2.8x improvement on a 2C/4T machine? You should be pretty happy about that. :smile:
 
  • Like
Likes   Reactions: pbuk
  • #38
Vanadium 50 said:
So you get a 2.8x improvement on a 2C/4T machine? You should be pretty happy about that. :smile:
I ran this on a MAC with same processor. Percentage Gains are similar but MAC/UNIX speed is much better overall.

Results on MACOS:

+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ ./MP_Test
MP time 1.125 seconds
xM( 376) 0.995
xM( 376) 1.605
NO MP time 1.125 seconds
x( 376) 0.995
x( 376) 1.605
Percentage Time 100.000%
+ export OMP_NUM_THREADS=2
+ OMP_NUM_THREADS=2
+ ./MP_Test
MP time 0.625 seconds
xM( 627) 0.348
xM( 627) 0.774
NO MP time 1.125 seconds
x( 627) 0.348
x( 627) 0.774
Percentage Time 55.556%
+ export OMP_NUM_THREADS=4
+ OMP_NUM_THREADS=4
+ ./MP_Test
MP time 0.375 seconds
xM( 677) 0.929
xM( 677) 1.739
NO MP time 1.125 seconds
x( 677) 0.929
x( 677) 1.739
Percentage Time 33.333%
+ export OMP_NUM_THREADS=8
+ OMP_NUM_THREADS=8
+ ./MP_Test
MP time 0.375 seconds
xM( 914) 0.882
xM( 914) 1.578
NO MP time 1.250 seconds
x( 914) 0.882
x( 914) 1.578
Percentage Time 30.000%
+ export OMP_NUM_THREADS=16
+ OMP_NUM_THREADS=16
+ ./MP_Test
MP time 0.375 seconds
xM( 423) 0.360
xM( 423) 1.078
NO MP time 1.125 seconds
x( 423) 0.360
x( 423) 1.078
Percentage Time 33.333%
+ export OMP_NUM_THREADS=32
+ OMP_NUM_THREADS=32
+ ./MP_Test
MP time 0.375 seconds
xM( 832) 0.834
xM( 832) 1.575
NO MP time 1.125 seconds
x( 832) 0.834
x( 832) 1.575
Percentage Time 33.333%

[Process completed]
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
4K
  • · Replies 8 ·
Replies
8
Views
8K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K