Fortran Why Is My OpenMP FORTRAN Program Slower in Parallel Than in Single Thread?

jelanier · Nov 12, 2020

I am trying to understand OpenMP, but my experiments are not paying off. I made a simple FORTRAN program to show me parallel time results.
it turns out that parallel times are greater than single thread. What am I doing wrong?

results: (changing number of threads)
No OpenMP seconds= 1.46640897
THR=1 seconds= 1.57561016
THR=2 seconds= 2.55841684
THR=4 seconds= 5.53803492
THR=8 seconds= 5.74083805

this test program is timing on do loop

Code and compile download from:
[Mentors deleted questionable ZIP link]

Thanks in advance

Jim

DrClaude · Nov 12, 2020

I couldn't look at the code since the link doesn't work, but this is something that often happens. There is quite some overhead related to the parallelization, so unless there is a huge gain due to the parallel processing, the overall execution time increases.

You should post your code (or at least parts of it if it is really big) in the thread. Also, some details of the hardware you are running on would help.

jelanier · Nov 12, 2020

DrClaude said:

I couldn't look at the code since the link doesn't work, but this is something that often happens. There is quite some overhead related to the parallelization, so unless there is a huge gain due to the parallel processing, the overall execution time increases.

You should post your code (or at least parts of it if it is really big) in the thread. Also, some details of the hardware you are running on would help.

Sorry for bad link

Core I5 CPU processor 8gig memory

[Mentors deleted ZIP file link]

berkeman · Nov 12, 2020

jelanier said:

Sorry for bad link

Core I5 CPU processor 8gig memory

[Mentors deleted ZIP file link]

Please do not post ZIP files -- they can cause virus/malware problems. Instead, post PDF copies or clear text file copies. You can use PrimoPDF or other free PDF writers to convert your files to safer versions. Thank you.

DrClaude · Nov 12, 2020

I agree with @berkeman, especially since your code is relatively small.

The test you are using is not very good, as the loop can be optimized away. Try doing an operation on an array instead.

Vanadium 50 · Nov 12, 2020

berkeman said:

Instead, post PDF copies

Please, not for source code.

berkeman · Nov 12, 2020

Vanadium 50 said:

Please, not for source code.

Of course not.

berkeman said:

or clear text file copies

... and include code tags around your text ...

Vanadium 50 · Nov 12, 2020

You should start with the OpenMP example. (below)

Fortran:

SUBROUTINE SIMPLE(N, A, B)
  INTEGER I, N
  REAL B(N), A(N)
!$OMP PARALLEL DO !I is private by default
  DO I=2,N
    B(I) = (A(I) + A(I-1)) / 2.0
  ENDDO
!$OMP END PARALLEL DO
END SUBROUTINE SIMPLE

This was taken from the OpenMP API examples document. Your PROGRAM should set A, B, N and do the timing. N can be large.

You should not start with something you have whipped together. You should start with where OpenMP wants you to start.

jelanier · Nov 12, 2020

DrClaude said:

I couldn't look at the code since the link doesn't work, but this is something that often happens. There is quite some overhead related to the parallelization, so unless there is a huge gain due to the parallel processing, the overall execution time increases.

You should post your code (or at least parts of it if it is really big) in the thread. Also, some details of the hardware you are running on would help.

http://www.chemroc.com/MISC/hel.f
http://www.chemroc.com/MISC/hcompilebat.txt

Vanadium 50 · Nov 12, 2020

@jelanier , please post your code here as text with code tags.

Also, please try the OpenMP example.

jelanier · Nov 12, 2020

Fortran:

      program hel
      implicit none
      integer :: i,j,a
      real :: start, finish,x
   
   
      x=1.01
      J=500000000
   

      OPEN (27,FILE = 'DATA01.TXT',ACCESS = 'APPEND',STATUS = 'unknown')
   
      call cpu_time(start)
      do i=1,J
            x=I
            x=x**2                            
      end do
      call cpu_time(finish)
      WRITE(27,*) "No OpenMP",' seconds= ',finish-start
      WRITE(*,*) "No OpenMP",' seconds= ',finish-start
   
   
      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(1)  
!$OMP DO
      do i=1,J
            x=I
            x=x**2                            
      end do   
!$OMP END DO
!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=1",' seconds= ',finish-start
      WRITE(*,*)   "THR=1",' seconds= ',finish-start   
   
      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(2)  
!$OMP DO
      do i=1,J
            x=I
            x=x**2                            
      end do   
!$OMP END DO
!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=2",' seconds= ',finish-start
      WRITE(*,*) "THR=2",' seconds= ',finish-start   

      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(4)  
!$OMP DO
      do i=1,J
            x=I
            x=x**2                            
      end do   
!$OMP END DO

DrClaude said:

I agree with @berkeman, especially since your code is relatively small.

The test you are using is not very good, as the loop can be optimized away. Try doing an operation on an array instead.

DrClaude said:

I couldn't look at the code since the link doesn't work, but this is something that often happens. There is quite some overhead related to the parallelization, so unless there is a huge gain due to the parallel processing, the overall execution time increases.

You should post your code (or at least parts of it if it is really big) in the thread. Also, some details of the hardware you are running on would help.

Fortran:

!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=4",' seconds= ',finish-start
      WRITE(*,*) "THR=4",' seconds= ',finish-start
   
      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(8)  
!$OMP DO
      do i=1,J
            x=I
            x=x**2                            
      end do   
!$OMP END DO
!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=8",' seconds= ',finish-start
      WRITE(*,*) "THR=8",' seconds= ',finish-start
         
      CLOSE(27)

      end program hel

jelanier · Nov 12, 2020

I am not sure I understand why OpenMP doesn't work well on loops. The reason I am experimenting is I have some code that I want to parallelize that does most of it's factoring and iteration in loops. (about 370 loops)
I have already edited the code to use optimized LAPACK on some subroutines which is already parallelized. I am trying to increase speed by implementing OpenMP on the multitude of loops. I don't want to waste my time if I can't get these simple loops to work. So, is there something wrong with my code I posted?

Thanks,

Jim

Vanadium 50 · Nov 12, 2020

A. cpu_time may be including the time executed by all threads. 2.6 seconds for two threads may be 1.3 seconds wall time.
B. As mentioned, you never use the output of the loop, so the optimizer may eliminate some or all of this. This is why you need to use the OpenMP example I posted and not roll your own.

Mark44 · Nov 12, 2020

jelanier said:

I am not sure I understand why OpenMP doesn't work well on loops. The reason I am experimenting is I have some code that I want to parallelize that does most of it's factoring and iteration in loops. (about 370 loops)

Do you mean 370 iterations of a loop? That's not a very large number.
When code is parallelized, the work is done by multiple threads. If the amount of work to be done be each thread doesn't take very long, the overhead of setting up the threads and getting the results back can be very much longer than the time spent in each thread, so overall time spent in the program is much longer than if just one thread had been used.

BTW, I edited your previous post - the icode tag is not the one to use, as it is intended for just a single line of code to be written inline (that's what the i stands for in icode). Instead use code tags, like so:

Fortran:

first line of code
second line of code
etc.

Vanadium 50 · Nov 12, 2020

Mark44 said:

Do you mean 370 iterations of a loop?

The code says

J=500000000

Mark44 · Nov 12, 2020

Vanadium 50 said:

J=500000000

Saw that after I had posted. I reformatted the code, but didn't read it carefully. Is there a 370 in there somewhere? In any case, another person mentioned in this thread that, since the output of the loop to square the loop index wasn't used for anything, a smart compiler might just compile the loop away completely.

Vanadium 50 · Nov 12, 2020

Here is the code I wriote to drive the OpenMP SIMPLE example

Fortran:

       program test
       real, dimension(1000000) :: a, b
       integer i,n

       n = 1000000
       do i = 1, n
         a(i)=0.5+i
       end do

       do i=1,10000
         a(8)=a(8)+1.0;
         call simple(n, a, b)
       end do
       print*,b(7)

       end program

It takes 38 seconds without OpenMP, and 7.3 seconds with it. (24 threads) So a factor of 5.2x speedup.

Vanadium 50 · Nov 12, 2020

One thing to keep in mind is that while OpenMP is intended to be "easy to use", there is no shortcut to writing efficient multithreaded applications. It takes experience to get good at it. I got a factor of 5, sure, but it took 24 threads to do it: my code's efficiency has fallen to 21% of what it was.

jelanier · Nov 13, 2020

Vanadium 50 said:

One thing to keep in mind is that while OpenMP is intended to be "easy to use", there is no shortcut to writing efficient multithreaded applications. It takes experience to get good at it. I got a factor of 5, sure, but it took 24 threads to do it: my code's efficiency has fallen to 21% of what it was.

You are correct in that the time was total for all threads. My original program was fine, but when I did time calls using openMP libs, it showed each thread was faster. Some loops can't be parallelized. If the loop result is not a function of the loop count, there is no way to split the loops into threads.

Thanks,
Jim

jelanier · Nov 13, 2020

Vanadium 50 said:

One thing to keep in mind is that while OpenMP is intended to be "easy to use", there is no shortcut to writing efficient multithreaded applications. It takes experience to get good at it. I got a factor of 5, sure, but it took 24 threads to do it: my code's efficiency has fallen to 21% of what it was.

Also here is how I looked at thread times

Code:

       tbegin = omp_get_wtime()   

!$omp parallel do     
      do i = 1, N
c        code here
      end do
!$omp end parallel do
      
      wtime = omp_get_wtime() - tbegin
      print "( 'Computing MP', i10, ' loops ', i2,' threads took ', f6.
     &4,'s')", N, omp_get_max_threads(), wtime

btw, inserting code on this forum is awful. The indenting and spacing is lost. I there a way to insert and copy it correctly?

Jim

jelanier · Nov 14, 2020

Vanadium 50 said:
Here is the code I wriote to drive the OpenMP SIMPLE example
Fortran:
       program test
       real, dimension(1000000) :: a, b
       integer i,n

       n = 1000000
       do i = 1, n
         a(i)=0.5+i
       end do

       do i=1,10000
         a(8)=a(8)+1.0;
         call simple(n, a, b)
       end do
       print*,b(7)

       end program
It takes 38 seconds without OpenMP, and 7.3 seconds with it. (24 threads) So a factor of 5.2x speedup.

I ran your program on my laptop. The array was too large for this computer and it crashed. I edited to decrease N to the point where it would run. This is a duo core and 4 threads was max it would do. At best it was about 61 percent of non-MP on my laptop.
Results from batch changing number of threads:

C:\OpenMP\Test>set OMP_NUM_THREADS=1

C:\OpenMP\Test>test
MP Time = 54.616 seconds.
NO MP Time = 43.571 seconds.
Percentage time 125.349083

C:\OpenMP\Test>set OMP_NUM_THREADS=2

C:\OpenMP\Test>test
MP Time = 29.281 seconds.
NO MP Time = 43.555 seconds.
Percentage time 67.2277908

C:\OpenMP\Test>set OMP_NUM_THREADS=4

C:\OpenMP\Test>test
MP Time = 26.403 seconds.
NO MP Time = 43.134 seconds.
Percentage time 61.2115822

Vanadium 50 · Nov 14, 2020

jelanier said:

I ran your program on my laptop. The array was too large for this computer and it crashed.

That's very surprising. Each array is only 8 MB in size. Wonder why that is.

But you have a 1.6x speedup on a machine with 2 physical cores, which surely is convincing that OpenMP is working on that computer, right?

jelanier · Nov 14, 2020

Vanadium 50 said:

That's very surprising. Each array is only 8 MB in size. Wonder why that is.

But you have a 1.6x speedup on a machine with 2 physical cores, which surely is convincing that OpenMP is working on that computer, right?

I am OK with it. I guess I am spoiled because when I edited some NEC code (Numeric Electromagnetic) to use OpenBlas optimized libs that I compiled for multiple architectures, I got a 6X improvement.
One question, When I use OpenMP statements in my GFORTRAN code, it has to be in column 1 or it is ignored. The code I see online does not show that. Is that the case with you?

re crash: I dunno. I will try it on some of my better computers including my MAC OS.Later and thanks,

Jim

Vanadium 50 · Nov 14, 2020

jelanier said:

When I use OpenMP statements in my GFORTRAN code, it has to be in column 1 or it is ignored. The code I see online does not show that

This is what the spec says:

In fixed form source files, the sentinel !$omp, c$omp, or *$omp introduce a directive and must start in column 1. An initial directive line must contain a space or zero in column 6, and continuation lines must have a character other than a space or zero in column 6. In free form source files, the sentinel !$omp introduces a directive and can appear in any column if preceded only by white space.

Jarvis323 · Nov 14, 2020

jelanier said:

Fortran:

      program hel
      implicit none
      integer :: i,j,a
      real :: start, finish,x      x=1.01
      J=500000000      OPEN (27,FILE = 'DATA01.TXT',ACCESS = 'APPEND',STATUS = 'unknown')

      call cpu_time(start)
      do i=1,J
            x=I
            x=x**2                         
      end do
      call cpu_time(finish)
      WRITE(27,*) "No OpenMP",' seconds= ',finish-start
      WRITE(*,*) "No OpenMP",' seconds= ',finish-start      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(1)
!$OMP DO
      do i=1,J
            x=I
            x=x**2                         
      end do
!$OMP END DO
!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=1",' seconds= ',finish-start
      WRITE(*,*)   "THR=1",' seconds= ',finish-start

      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(2)
!$OMP DO
      do i=1,J
            x=I
            x=x**2                         
      end do
!$OMP END DO
!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=2",' seconds= ',finish-start
      WRITE(*,*) "THR=2",' seconds= ',finish-start

      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(4)
!$OMP DO
      do i=1,J
            x=I
            x=x**2                         
      end do
!$OMP END DO

Fortran:

!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=4",' seconds= ',finish-start
      WRITE(*,*) "THR=4",' seconds= ',finish-start

      call cpu_time(start)
!$OMP PARALLEL NUM_THREADS(8)
!$OMP DO
      do i=1,J
            x=I
            x=x**2                         
      end do
!$OMP END DO
!$OMP END PARALLEL
      call cpu_time(finish)
      WRITE(27,*) "THR=8",' seconds= ',finish-start
      WRITE(*,*) "THR=8",' seconds= ',finish-start
      
      CLOSE(27)

      end program hel

I'm not that familiar with fortran, but one thing stands out, which is that it looks like your OpenMP version you are trying to make a bunch of threads write to the same variable at the same time. That's never a good thing.

Also, cpu time measures net time in cpu for all the cores added together. So if 1 second of wall clock time was spent doing cpu stuff total, and there were 4 cores, the cpu time would be something like 4 seconds.

You should usually measure wall clock time.

Vanadium 50 · Nov 14, 2020

Some tips:

(1) Don't get greedy. A 2C/4T CPU is not going to give you a factor of 4, and depending on the code might not give you a factor of 2. You can spend a lot of effort trying to get a small incremental improvement.

(1B) If you want to use GPUs, OpenMP is not very efficient at it. You are usually better using something else.

(2) Profile, profile, profile. You need to know where the code is spending its time. That may not be where you think it is spending its time. Find that spot and try and parallelize it. If the code is serial, there's no point in throwing OpenMP at it. Once you have sped it up, profile the code again and see if you have a new bottleneck. Repeat until exhausted.

(3) Compile, compile, compile. Debugging with OpenMP is hard. Make sure you have not introduced any bugs at every step. If you do 8 hours of work and now the answer is wrong, good luck finding exactly where. One missing barrier can make a mess - and good luck finding it. Similarly, save known good pieces of code. You may need to revert.

(4) Consider pthreads over OpenMP. OpenMP parallelizes small chunks of code, ideally with no or few branches. pthreads casn parallelize large chunks of code with lots of branches. Depending on what you are doing, one may be much better than the other.

jelanier · Nov 15, 2020

Vanadium 50 said:

Some tips:

(1) Don't get greedy. A 2C/4T CPU is not going to give you a factor of 4, and depending on the code might not give you a factor of 2. You can spend a lot of effort trying to get a small incremental improvement.

(1B) If you want to use GPUs, OpenMP is not very efficient at it. You are usually better using something else.

(2) Profile, profile, profile. You need to know where the code is spending its time. That may not be where you think it is spending its time. Find that spot and try and parallelize it. If the code is serial, there's no point in throwing OpenMP at it. Once you have sped it up, profile the code again and see if you have a new bottleneck. Repeat until exhausted.

(3) Compile, compile, compile. Debugging with OpenMP is hard. Make sure you have no introduced any bugs at every step. If you do 8 hours of work and now the answer is wrong, good luck finding exactly where. One missing barrier can make a mess - and good luck finding it. Similarly, save known good pieces of code. You may need to revert.

(4) Consider pthreads over OpenMP. OpenMP parallelizes small chunks of code, ideally with no or few branches. pthreads casn parallelize large chunks of code with lots of branches. Depending on what you are doing, one may be much better than the other.

I am with you on that. I typically experiment with my code to maximize efficiency. Back when I first started writing code (1974) it was required. I have revisited my first posted GFORTRAN and came up with a better way of measuring time. It looks like I can improve time better than 2 to 1.

[CODE lang="fortran" title="hel"] program hel

integer i,j,ct_0,ct_1,ct_rate,ct_max
real*8 :: time_init,time_final,elapsed_t,elapsed_tMP,x

j=250000000 ! SET NUMBER OF LOOPS

*************************USE MP*****************************************

x=1.0000001 !SET INITIAL X
c Starting Time
call system_clock(ct_0,ct_rate,ct_max)
time_init=ct_0*1.0/ct_rate

!$OMP PARALLEL DO
C!$OMP PARALLEL
C!$OMP DO
do i=1,J
x=x**2
x=((x-1)/2)+1
end do
C!$OMP END DO
C!$OMP END PARALLEL
!$OMP END PARALLEL DO

c Ending Time
call system_clock(ct_1,ct_rate,ct_max)
time_final=ct_1*1.0/ct_rate
elapsed_t_MP=time_final-time_init
WRITE(*,10) elapsed_t_MP
10 format(" MP time",F8.3," seconds")********************END USE MP****************************************** ********************START NO MP*****************************************

x=1.0000001 !SET INITIAL X
c Starting Time
call system_clock(ct_0,ct_rate,ct_max)
time_init=ct_0*1.0/ct_rate

do i=1,J
x=x**2
x=((x-1)/2)+1
end do

c Ending Time
call system_clock(ct_1,ct_rate,ct_max)
time_final=ct_1*1.0/ct_rate
elapsed_t=time_final-time_init

WRITE(*,11) elapsed_t
11 format("No MP time",F8.3," seconds")

********************END NO MP*******************************************

WRITE(*,12) elapsed_t_MP*100/elapsed_t
12 format("Percentage Time",F8.3)

end program

[/CODE]

jelanier · Nov 15, 2020

results:

C:\OpenMP\Test elapsed HEL>set OMP_NUM_THREADS=8

C:\OpenMP\Test elapsed HEL>hel
MP time 1.092 seconds
No MP time 2.668 seconds
Percentage Time 40.929

C:\OpenMP\Test elapsed HEL>set OMP_NUM_THREADS=16

C:\OpenMP\Test elapsed HEL>hel
MP time 1.123 seconds
No MP time 2.652 seconds
Percentage Time 42.345

C:\OpenMP\Test elapsed HEL>set OMP_NUM_THREADS=32

C:\OpenMP\Test elapsed HEL>hel
MP time 1.107 seconds
No MP time 2.668 seconds
Percentage Time 41.492

C:\OpenMP\Test elapsed HEL>pause
Press any key to continue . . .

Vanadium 50 · Nov 15, 2020

You should print x at the end. Two reasons for this:
a. If you don't, the compiler might recognize that it is never used and optimize it away
b. It will show you that you get a different answer with OpenMP. Your code is serial (can you see why?) so if it is being sped up, it's making a mistake.

jelanier · Nov 15, 2020

Vanadium 50 said:

You should print x at the end. Two reasons for this:
a. If you don't, the compiler might recognize that it is never used and optimize it away
b. It will show you that you get a different answer with OpenMP. Your code is serial (can you see why?) so if it is being sped up, it's making a mistake.

I think it is because of what I commented earlier. If the loop is not a function of i, it can't possibly run in parallel. In my loop example, parallel would make a mess. Each iteration requires knowledge of the previous one. I was going to re-write something in the loop to make it practical. The purpose of this code was to count time more reliably. The call cpu time doesn't work the way I want with the omp code.

thanks

Fortran Why Is My OpenMP FORTRAN Program Slower in Parallel Than in Single Thread?

Similar threads

How to increase phone signal strength by lying about it

A Crisis for Newly Minted CompSci Majors -- entry level jobs gone

Who is responsible for the software when AI takes over programming?

How to calculate Tension for a series of connected points?

Learning Assembly and computer architecture for x86

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers