Why Is My OpenMP FORTRAN Program Slower in Parallel Than in Single Thread?

Vanadium 50 · Nov 15, 2020

jelanier said:

. If the loop is not a function of i, it can't possibly run in parallel.

Close. If you have x = x * i, it is a function of i, but still can't run in parallel.

Probably the easiest way to imagine it is to think about giving half the loop to one thread and half to another. Would you get the same answer? No matter how you split it? I usually ask myself "if I had N CPUs, one for each of the N elements of the loop, would I get the right answer?"

jelanier said:

The purpose of this code was to count time more reliably.

Well, you aren't doing that exactly. You're measuring something, that's for sure, but it may be totally unrelated to any real code. In particular, the lack of use of x at the end means the compiler might optimize away part or all of the code involving x. Including the code inside the PARALLEL clause. That said, using wall clock time is a good idea: that's what you really care about, after all.

jelanier · Nov 15, 2020

I understood. I think you took me too literally on the function. I clarified when I said "Each iteration requires knowledge of the previous one" The counter I'm using should work because it is looking at the system clock so the code should not interfere. I think you are saying it may look at system clock while a code process is still running. One of the timing procedures I used in another test shows timing of each thread and when multiplied by number of threads results were close to the system clock result. Can you give me an example of wall clock time so I can compare? I thought system clock was wall clock. Quote from another forum " system_clock reports "wall time" or elapsed time. ". Is this not correct? I would think it to be conservative because the computer is doing things other than the code I'm testing.

see https://gcc.gnu.org/onlinedocs/gfortran/SYSTEM_005fCLOCK.html

Thanks

Vanadium 50 · Nov 15, 2020

You are using wall clock time now.

jelanier · Nov 16, 2020

OK, I thought you had seen my timing post code using system_clock. Anyway I wrote in some code for a parallel loop and the speed increase was verified. It is a little better than twice the speed. There is a variable printed Z(100) to confirm no errors in the parallel portion. I also compiled with and without optimization to test results.

See results:

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=1

C:\OpenMP\Test Timing>Test_T
MP time 0.961 seconds
z100 WI MP 406.86585116386414
NO MP time 0.938 seconds
z100 NO MP 406.86585116386414
Percentage Time 102.500%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=2

C:\OpenMP\Test Timing>Test_T
MP time 0.500 seconds
z100 WI MP 406.86585116386414
NO MP time 1.000 seconds
z100 NO MP 406.86585116386414
Percentage Time 50.000%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=4

C:\OpenMP\Test Timing>Test_T
MP time 0.391 seconds
z100 WI MP 406.86585116386414
NO MP time 0.949 seconds
z100 NO MP 406.86585116386414
Percentage Time 41.152%

C:\OpenMP\Test Timing>set OMP_NUM_THREADS=8

C:\OpenMP\Test Timing>Test_T
MP time 0.422 seconds
z100 WI MP 406.86585116386414
NO MP time 0.953 seconds
z100 NO MP 406.86585116386414
Percentage Time 44.262%

C:\OpenMP\Test Timing>pause
Press any key to continue . . .

Thanks again for your help.

jelanier · Nov 16, 2020

Vanadium 50 said:

That's very surprising. Each array is only 8 MB in size. Wonder why that is.

I figured it out. The stack size is fixed in Windows. You can increase the value by adding a line at compile time.
Example: gfortran -Wl,--stack,16000000 -O3 -fopenmp -static test.f -o Test

the stack size should be number of array variables X 8. So if you had x(i) 1000000 and y(i) 1000000
then stack would be 1000000 X 2 X 8=16000000 as minimum (plus other overhead)

Jim

jelanier · Nov 18, 2020

I have written some code that shows gains of OpenMP. This code shows gains and verifies output by comparison. Using my old laptop (2 core i5) I get at best 3X resolution. Probably better on more modern machines. I have attached code and msys2 compile shell.

http://www.chemroc.com/MISC/OpenMP/MP_Test.f

http://www.chemroc.com/MISC/OpenMP/MP_Test.sh

*************************************************************

This is output using my laptop:

***********************************************************
C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=1

C:\OpenMP\Test Timing - random>MP_Test
MP time 2.125 seconds
xM( 540) 0.480
xM( 540) 1.127
NO MP time 2.062 seconds
x( 540) 0.480
x( 540) 1.127
Percentage Time 103.030%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=2

C:\OpenMP\Test Timing - random>MP_Test
MP time 1.219 seconds
xM( 109) 0.821
xM( 109) 1.469
NO MP time 2.062 seconds
x( 109) 0.821
x( 109) 1.469
Percentage Time 59.091%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=4

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 869) 0.151
xM( 869) 0.696
NO MP time 2.094 seconds
x( 869) 0.151
x( 869) 0.696
Percentage Time 35.821%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=8

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 384) 0.952
xM( 384) 1.669
NO MP time 2.094 seconds
x( 384) 0.952
x( 384) 1.669
Percentage Time 35.821%

C:\OpenMP\Test Timing - random>set OMP_NUM_THREADS=16

C:\OpenMP\Test Timing - random>MP_Test
MP time 0.750 seconds
xM( 186) 0.869
xM( 186) 1.551
NO MP time 2.031 seconds
x( 186) 0.869
x( 186) 1.551
Percentage Time 36.923%

*******************************************************

Later,

Jim

Vanadium 50 · Nov 18, 2020

So you get a 2.8x improvement on a 2C/4T machine? You should be pretty happy about that.

jelanier · Nov 20, 2020

Vanadium 50 said:

So you get a 2.8x improvement on a 2C/4T machine? You should be pretty happy about that.

I ran this on a MAC with same processor. Percentage Gains are similar but MAC/UNIX speed is much better overall.

Results on MACOS:

+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ ./MP_Test
MP time 1.125 seconds
xM( 376) 0.995
xM( 376) 1.605
NO MP time 1.125 seconds
x( 376) 0.995
x( 376) 1.605
Percentage Time 100.000%
+ export OMP_NUM_THREADS=2
+ OMP_NUM_THREADS=2
+ ./MP_Test
MP time 0.625 seconds
xM( 627) 0.348
xM( 627) 0.774
NO MP time 1.125 seconds
x( 627) 0.348
x( 627) 0.774
Percentage Time 55.556%
+ export OMP_NUM_THREADS=4
+ OMP_NUM_THREADS=4
+ ./MP_Test
MP time 0.375 seconds
xM( 677) 0.929
xM( 677) 1.739
NO MP time 1.125 seconds
x( 677) 0.929
x( 677) 1.739
Percentage Time 33.333%
+ export OMP_NUM_THREADS=8
+ OMP_NUM_THREADS=8
+ ./MP_Test
MP time 0.375 seconds
xM( 914) 0.882
xM( 914) 1.578
NO MP time 1.250 seconds
x( 914) 0.882
x( 914) 1.578
Percentage Time 30.000%
+ export OMP_NUM_THREADS=16
+ OMP_NUM_THREADS=16
+ ./MP_Test
MP time 0.375 seconds
xM( 423) 0.360
xM( 423) 1.078
NO MP time 1.125 seconds
x( 423) 0.360
x( 423) 1.078
Percentage Time 33.333%
+ export OMP_NUM_THREADS=32
+ OMP_NUM_THREADS=32
+ ./MP_Test
MP time 0.375 seconds
xM( 832) 0.834
xM( 832) 1.575
NO MP time 1.125 seconds
x( 832) 0.834
x( 832) 1.575
Percentage Time 33.333%

[Process completed]

Why Is My OpenMP FORTRAN Program Slower in Parallel Than in Single Thread?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

How to increase phone signal strength by lying about it

Who is responsible for the software when AI takes over programming?

Learning Assembly and computer architecture for x86

Use of AI (ML/DL) in Science

Could the reason why I can't select any kernels in VS Code be this error?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers