Parallel DO loop with OpenMP (fortran)

  • Context: Fortran 
  • Thread starter Thread starter davidfur
  • Start date Start date
  • Tags Tags
    Fortran Loop Parallel
Click For Summary

Discussion Overview

The discussion revolves around the challenges of parallelizing a Fortran code segment using OpenMP, specifically focusing on the innermost DO loop that calculates distances between particles and a point in 3D space. Participants explore issues related to thread safety, variable scope, and performance implications of parallel execution.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Exploratory

Main Points Raised

  • One participant attempts to parallelize a DO loop but experiences a significant increase in execution time when using multiple threads, indicating potential issues with the implementation.
  • Another participant suggests that calling the subroutine dista3 within the parallelized loop may not be appropriate due to restrictions on variable data space in OpenMP.
  • A participant questions the validity of the approach, highlighting the potential for race conditions due to shared variable access in the dista3 subroutine.
  • Concerns are raised about the management of variables like dx, which may lead to inconsistent results across threads if not properly handled.
  • Some participants reference a similar implementation from a historical OpenMP Fortran code, suggesting that their approach should not inherently cause issues.
  • There are suggestions that using pthreads might be more suitable for the complexity of the code, as OpenMP may not handle the situation effectively without significant restructuring.
  • Options are discussed for modifying the subroutine to avoid variable conflicts, including moving code into the main block or changing the calling parameters to ensure thread safety.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of the current implementation of OpenMP in the context of the dista3 subroutine. There is no consensus on the best approach to resolve the issues raised, and multiple competing views remain regarding the use of OpenMP versus pthreads.

Contextual Notes

Limitations include the potential for race conditions due to shared variable access, the need for proper variable scoping in parallel regions, and the varying support for threading models across Fortran compilers.

davidfur
Messages
18
Reaction score
2
TL;DR
I'm trying to parallelize a simple DO loop for the first time, but without success.
Need some basic help with minimal code sample.
Hey guys,
I've started to read some OpenMP programming and now I'm trying to parallelize small part of a fortran code.
The first thing I would like to do is to parallelize the innermost DO loop. It loops through the number of particles (na) and calculates
the distance between some point in 3D space (pos) and the particle's position (pos). At the end, the particle closest to the point should be identified.

Fortran:
!$omp parallel do private(i1,imol2,atomDis) default(shared)
            do i1=1,na
              imol1=iag(aid,3+mbond)
              imol2=iag(i1,3+mbond)
              !write(*,*) 'atom ',aid,' belongs to mol: ',imol1
              !write(*,*) 'atom ',i1,'  belongs to mol: ',imol2
              ! perform analysis only on same molecule
              if (imol1 .NE. imol2) then
                  !write(*,*) 'cycle at atom ',i1
                 cycle
              endif
              call dista3(i1,pos,atomDis,dx,dy,dz)
              ! pos is already in angstrom.
              ! convert atomDis back to bohr.
              atomDis=atomDis/bohr2ang
              !write(*,*) 'atomDis=',atomDis
              if (atomDis < shortestDis) then
                 !write(*,*) 'closest atom is: ',i1
                 closestAtm = i1
                 shortestDis = atomDis
              endif
            enddo
!$omp end parallel do

When I compile the program and run it on 1 thread, the execution time is 13seconds (for the whole program). Then, running on 10 threads the execution time jumps to over a minute. Clearly, I have messed up somewhere, but my current understanding is still lacking...

Specifically, I would like to see the expected speed-up, and also make sure that all threads are aware of the most up-to-date shortestDis to compare to.

Can anybody guide me through this?
 
Technology news on Phys.org
I'm actually surprised the time changed at all. You can't call dista3 inside the loop you are trying to parallelize.
 
Why not, Vanadium?
 
Old versions didn't allow this at all, but the basic problem is that that subroutine has a huge number of restrictions placed on it - you do not generate N identical copes of the subroutine each with its own data space. The odds are quiet high that this will break (which is why it wasn't allowed earlier).

And to add, if you knew all this, I would have expected you to post the source of the called routine so we can discuss potential race conditions and such.
 
Thanks, but I don't think I understand the problem.

Here is the dista3 subroutine, which calculates the distance:
Fortran:
    subroutine dista3 (n1,pos,dista,dx,dy,dz)
    include 'cbka.blk'
    double precision :: pos(3)
!*********************************************************************
!     Determine distance between atom n1 and point pos               *
!*********************************************************************
    if (iortho == 1) then
        dx=c(n1,1)-pos(1)
        dy=c(n1,2)-pos(2)
        dz=c(n1,3)-pos(3)
        dx=dx-anint(dx/aaxh)*aaxh
        dy=dy-anint(dy/baxh)*baxh
        dz=dz-anint(dz/caxh)*caxh
        dista=dsqrt(dx*dx+dy*dy+dz*dz)
    else
        dismin=1d+30
        do 10 k1=-1,1
            dx=(c(n1,1)-pos(1)+k1*tm11)
            do 10 k2=-1,1
                dy=(c(n1,2)-pos(2)+k1*tm21+k2*tm22)
                do 10 k3=-1,1
                    dz=(c(n1,3)-pos(3)+k1*tm31+k2*tm32+k3*tm33)
                    dis=dsqrt(dx*dx+dy*dy+dz*dz)
                    if ((dis < dismin)) then
                        dxmin=dx
                        dymin=dy
                        dzmin=dz
                        dismin=dis
                    end if
        10 END DO
        dista=dismin
        dx=dxmin
        dy=dymin
        dz=dzmin
    end if
    return
    end subroutine dista3

In fact, a very similar thing was done in a reference openmp fortran code I've found on the web. It's written by Bill Magro, Kuck and Associates, Inc. (KAI), 1998, which is the chief HPC technologist in Intel.

Fortran:
! The computation of forces and energies is fully parallel.
!$omp  parallel do
!$omp& default(shared)
!$omp& private(i,j,k,rij,d)
!$omp& reduction(+ : pot, kin)
      do i=1,np
        ! compute potential energy and forces
        f(1:nd,i) = 0.0
        do j=1,np
             if (i .ne. j) then
               call dist(nd,box,pos(1,i),pos(1,j),rij,d)
               ! attribute half of the potential energy to particle 'j'
               pot = pot + 0.5*v(d)
               do k=1,nd
                 f(k,i) = f(k,i) - rij(k)*dv(d)/d
               enddo
             endif
        enddo
        ! compute kinetic energy
        kin = kin + dotr8(nd,vel(1,i),vel(1,i))
      enddo
!$omp  end parallel do

And his dist subroutine is the following:

Fortran:
! Compute the displacement vector (and its norm) between two particles.
      subroutine dist(nd,box,r1,r2,dr,d)
      implicit none

      integer nd
      real*8 box(nd)
      real*8 r1(nd)
      real*8 r2(nd)
      real*8 dr(nd)
      real*8 d

      integer i

      d = 0.0
      do i=1,nd
        dr(i) = r1(i) - r2(i)
        d = d + dr(i)**2.
      enddo
      d = sqrt(d)

      return
      end

So, I don't see why my subroutine should cause any special problems. I'm just a beginner with OpenMP though so please bear with me
 
Thread one sets dx = 10.
Then thread two sets dx = 11.
Then thread one updates dx by subtracting off the anint stuff. It's subtracting off 11, not 10.
Then thread two updates dx by subtracting off the anint stuff. It's subtracting off not 11, not 10, but whatever you got in the previous step.
Then thread three comes along and sets dx to something else, and so on.

There is a huge mess.

Your sample code from Magro solves this with setting variables to private. Your code does not.

Hence the mess.
 
  • Like
Likes   Reactions: sysprog and davidfur
If you have complex code and not tight loops, why are you using OpenMP and not pthreads?
 
I know even less about pthreads...
 
I just realized that not all Fortran compilers support pthreads. XLF for sure does. However, pthreads fits what you are trying to do better than OpenMP.

If you insist on OpenMP, you have two icky options. One is to move all the subroutines into the main block of code. Then you can protect the variables. But your code will get very spaghetti-ish. You may even find that you need to duplicate a lot of code.

The other possibility is change the calling parameters so that your subroutine declares no variables at all. Not a one. Then you can protect what needs to be private.

What you will want to do and cannot is declare a variable private inside the subroutine. That's not in the right spot - it needs to be part of the omp parallel instruction block.
 

Similar threads

  • · Replies 37 ·
2
Replies
37
Views
5K
  • · Replies 2 ·
Replies
2
Views
6K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 8 ·
Replies
8
Views
8K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
12K
  • · Replies 1 ·
Replies
1
Views
2K