Trouble with parrallelizing for loop with openMP, cpp

  • Thread starter Gullik
  • Start date
  • Tags
    Loop
In summary: prev_j = j - 1; prev_k = k - 1; //else // // // // // // // // // // // // // // // // // // // // // //
  • #1
Gullik
62
6
Hi,

I'm working on doing some slight parallelization of a for loop in the force calculation in a molecular dynamics simulation, using openMP. (probably not up for a proper MPI implementation yet)

It should only need to parallelize the force loop since 90% ish of the time is spent calculating forces.

The atoms in the simulations is put into several neighbour boxes, for each box it only checks the nearby boxes and calculates the inter atomic forces between the atoms in those boxes. So for each box it needs to cycle through all the nearby boxes and call a force calculation boxes (or only half since all the forces assigned to both atoms when when calculated so calculating box [0,0,0] and [1,0,0] is also calculates what box [1,0,0] and [0,0,0] would do)

I built a small testcase that reproduces the error it generates:

Code:
#include <iostream>
#include <omp.h>
#include <armadillo>
using namespace std;

void testNeighbourCycling();

int main()
{
    cout << "Hello World" << endl;

    for(int i = 0 ; i < 10 ; i++)
    {
        cout << "At timestep: " << i << endl;

        testNeighbourCycling();

    }}

void testNeighbourCycling()
{

//    cout << "Hello World" << endl;

    int noOfNeighbors = 3;
    int next_i,next_j,next_k;
    int prev_i, prev_j, prev_k;
    int i, j, k;
    int tid;

     arma::cube neighborVisits = arma::zeros( noOfNeighbors, noOfNeighbors, noOfNeighbors);#pragma omp parallel private( i, j, k, next_i, next_j, next_k, prev_i, prev_j, prev_k, tid) shared(neighborVisits)
{

    #pragma omp for schedule(auto)
    for(i = 0 ; i < noOfNeighbors ; i ++)
    {
        for(j=0; j < noOfNeighbors ; j++)
        {
            for(k = 0; k < noOfNeighbors; k++)
            {
//                    tid = omp_get_thread_num();
//                    cout << " Thread # " << tid << " did " << i << " " << j << " " << k << endl;
//                    calculateForcesInsideCell(vec3(i,j,k),system);

                //Now atoms should be appended with the increasing neighboring cells
                //Could not find any neat way to do it, this is confusing and slow as is. Should work to do it a better way.
                next_i = i + 1;
                next_j = j + 1;
                next_k = k + 1;

                prev_i = i -1;
                prev_j = j -1;
                prev_k = k -1;

                if(i == noOfNeighbors - 1)
                    next_i = 0;

                if(j == noOfNeighbors - 1)
                    next_j = 0;

                if(k == noOfNeighbors - 1)
                    next_k = 0;

                if(i == 0)
                    prev_i = noOfNeighbors - 1;

                if(j == 0)
                    prev_j = noOfNeighbors - 1;

                if(k == 0)
                    prev_k = noOfNeighbors - 1;

//                #pragma omp critical
                {
                //With itself
                neighborVisits( i, j, k) += 1;

                //The top layer, so it cycles through i and j while k is constant

                neighborVisits(i,j,k) += 1;   neighborVisits(prev_i,   prev_j, next_k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(i,        prev_j, next_k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   prev_j, next_k) += 1;                neighborVisits(i,j,k) += 1;   neighborVisits(prev_i,   j,      next_k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(i,        j,      next_k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   j,      next_k) += 1;

                neighborVisits(i,j,k) += 1;   neighborVisits(prev_i,   next_j, next_k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(i,        next_j, next_k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   next_j, next_k) += 1;                //The front layer that is left                neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   prev_j,      k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   j,           k) += 1;
                neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   next_j,      k) += 1;                //The rest of the left side

                neighborVisits(i,j,k) += 1;   neighborVisits(i,        next_j,      k) += 1;
                }            }
        }
    }
}

arma::cube testMatrix = 27 * arma::ones(noOfNeighbors,noOfNeighbors,noOfNeighbors);

//    cout << testMatrix << endl;

//    cout << neighborVisits << endl;

//    cout << neighborVisits - testMatrix << endl;

//    cout << arma::accu(testMatrix - neighborVisits) << endl;

if(arma::accu(testMatrix - neighborVisits) != 0)
{
    cout << "Test failed: wrong number of neighbourvisits" << endl;
    cout << neighborVisits << endl;
}

}

The tabs got messed up when pasting, but I'll attach the program as well :)

In this testcase it just counts how many times each neihgbour is visited and all of them should be visited 27 times.

If the #pragma parts are taken away the test never triggers and it reaches and all the neighbors reach 27 each time.

If #pragma critical is uncommented it works but is even slower than not parallelizing it.

If the #pragma pieces is left there it fails on a few of the loops giving outpul like below

Code:
At timestep: 7
At timestep: 8
Test failed: wrong number of neighbourvisits
[cube slice 0]
   27.0000   27.0000   26.0000
   26.0000   27.0000   26.0000
   25.0000   25.0000   25.0000
        
[cube slice 1]
   27.0000   27.0000   26.0000
   27.0000   26.0000   27.0000
   25.0000   27.0000   27.0000
        
[cube slice 2]
   27.0000   27.0000   25.0000
   27.0000   26.0000   27.0000
   25.0000   27.0000   27.0000

So it fails some of the for loops part, or fails to write properly to neighborVisits because of some memory writing conflicts or something.BTW, in the real program the "neighborVisits(i,j,k) += 1; neighborVisits(prev_i, prev_j, next_k) += 1;" parts are supposed to be a call to a function that calculates the force between the atoms in neighbour [i,j,k] and neighbour [i, j-1, k+1].
Link to the testProgram at github: https://github.com/Gullik/CompPhysGullik/tree/master/test/neighbour_openmp
 
Last edited:
Technology news on Phys.org
  • #2
You are doing it wrong. Just put "Pragma openmp parrallel for" on the first for then you absolutely need to delete the critical section becaus it will speed down the algorithm. You need to find a solution to delete it. If you delete it, in case of double acces, you will not have a segfault but just a false value in the box. It's not really a problem in atom simulation, because this error is statisticaly not probable.

Avoid "IF" instruction please.
if(i == noOfNeighbors - 1) next_i = 0;
if(j == noOfNeighbors - 1) next_j = 0;

Make a loop on 2 to size-2 without checking indices. Then make the computation on the border on a second step. It is faster and faster !
 
  • #3
kroni said:
You are doing it wrong. Just put "Pragma openmp parrallel for" on the first for then you absolutely need to delete the critical section becaus it will speed down the algorithm. You need to find a solution to delete it. If you delete it, in case of double acces, you will not have a segfault but just a false value in the box. It's not really a problem in atom simulation, because this error is statisticaly not probable.

In the full simulation it increases the variance of the total energy, which should be a conserved quantity, by approximately a factor 10^4 so I'm not that happy about it. But I would think that the error will decrease with a larger system.

Yeah, I realized that the critical part completely tanked performance :) I can probably fix the problem, given that it is a memory access problem, but that will need some major restructuring and I'll hoped to avoid that for now, since OpenMP uses a shared memory. I planned to save that until I have time to properly learn MPI.

kroni said:
Avoid "IF" instruction please.
if(i == noOfNeighbors - 1) next_i = 0;
if(j == noOfNeighbors - 1) next_j = 0;

Make a loop on 2 to size-2 without checking indices. Then make the computation on the border on a second step. It is faster and faster !

Yeah that should definitely be done. But is it that much faster? The calls to the calculatedForces functions will be the same and the actual time for the triple for loop should be negligible?Thanks
 
  • #4
Well, the IF statement is very bad because the CPU pipeline predict the value of the IF, true for example, if false is computed, the pipelines is purged and you loose a lot of CPU cycle (More than you imagine). I write two particle simulator for cosmologic reason using Hash Table like you and one with Barnes Hut Octree and i have proceed in a different way :

My atom are a Class like
Class Atom
Vector3d position
Vector3d speed
Vector3d sum_of_forces
... etc.

Then i use a discrétisation of space (your 3D tab) and insert pointers to atom inside. After, i compute force for each atom, find neighboor atom using the grid (step that you can parralelize) and then i put the sum of repulsive force in sum_of_forces. In an other step i update particle position, speed etc. In this way the Grid is in read only so you don't need to protect it when you parallelize.

ANOTHER REMARK :

"
neighborVisits(i,j,k) += 1; neighborVisits(prev_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, next_j, next_k) += 1;
"
Don't do this because this code is hard to understand, make three another "for" like

for(t = begin; t <begin +3; ++t )
for(l = begin; l <begin+3; ++l )
for(m = begin; m <begin+3; ++m )
neighborVisits(l,m,t) += 1;

BUT declare "begin" as a "const" so the compiler can understand that the loop can be unrolled at the compilation giving the same code as you write by hand. Don't forget -O3 optimisation flag on G++. Like that :

const int begin = ...
for (size_t m = begin...)
 
  • Like
Likes Gullik
  • #5
kroni said:
Well, the IF statement is very bad because the CPU pipeline predict the value of the IF, true for example, if false is computed, the pipelines is purged and you loose a lot of CPU cycle (More than you imagine). I write two particle simulator for cosmologic reason using Hash Table like you and one with Barnes Hut Octree and i have proceed in a different way :

I will remove the if parts and do some timing tests after a weeks vacation :)

My atom are a Class like
Class Atom
Vector3d position
Vector3d speed
Vector3d sum_of_forces
... etc.

Then i use a discrétisation of space (your 3D tab) and insert pointers to atom inside. After, i compute force for each atom, find neighboor atom using the grid (step that you can parralelize) and then i put the sum of repulsive force in sum_of_forces. In an other step i update particle position, speed etc. In this way the Grid is in read only so you don't need to protect it when you parallelize.

This is basically exactly what I have. Can it still not happen that two threads try to update atom->sum_of_forces at the same time?
ANOTHER REMARK :

"
neighborVisits(i,j,k) += 1; neighborVisits(prev_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, next_j, next_k) += 1;
"
Don't do this because this code is hard to understand, make three another "for" like
I was quite confused while writing it, so no wonder it is confusing.
for(t = begin; t <begin +3; ++t )
for(l = begin; l <begin+3; ++l )
for(m = begin; m <begin+3; ++m )
neighborVisits(l,m,t) += 1;

BUT declare "begin" as a "const" so the compiler can understand that the loop can be unrolled at the compilation giving the same code as you write by hand. Don't forget -O3 optimisation flag on G++. Like that :

const int begin = ...
for (size_t m = begin...)

The for loops you are proposing will cycle trough every neighbour and calculate every force twice. It is necessary to calculate a neighbour-pair once and assign the force to both atoms (one plus and one negative) when calculating the force between two atoms.

Some slight modifications to the loops should fix that.
 
  • #6
Well, i find one solution to avoid computing two time the force for each particle and use OpenMP. It's a computer science optimisation and not an algorithm modification.
Imagine you have two thread working on a 2D table of 10 X 10 size. You build a list (std::vector work well) where you list pointers to table elements in order to avoid collision. for example thread 1 treat collumn 1 to 5 and thread 2 treat collumn 6 to 10. So when thread 1 work on collumn 1, it can't update far atom from collumn 6. Saddly you can loose time if thread 1 work on empty volume and thread 2 work on all the atom. But i think that it's better than a openmp "critical section". You need to think more about it, i don't have an optimal solution.
Have nice holiday.
 

Related to Trouble with parrallelizing for loop with openMP, cpp

1. Can you explain what "parallelizing" means in the context of for loops and openMP in C++?

Parallelizing refers to the process of dividing a task into smaller subtasks that can be executed simultaneously on different processing units. In the context of for loops and openMP in C++, this means using openMP directives to split the iterations of a for loop among multiple threads, allowing for faster and more efficient execution of the loop.

2. How do I know if my for loop can be parallelized with openMP?

Not all for loops can be easily parallelized with openMP. To determine if your for loop is suitable for parallelization, you should consider the dependencies between loop iterations and whether the loop body can be executed independently on different threads. Loops with complex dependencies or data dependencies may not be suitable for parallelization.

3. Can you provide an example of parallelizing a for loop with openMP in C++?

Sure, here is a simple example of parallelizing a for loop using openMP in C++:

#pragma omp parallel forfor (int i = 0; i < n; i++) {// do some work}

This code will split the iterations of the for loop among multiple threads, allowing for parallel execution of the loop.

4. Are there any potential issues or challenges when parallelizing for loops with openMP in C++?

Yes, there are some potential issues to consider when parallelizing for loops with openMP in C++. These include race conditions, where multiple threads try to access and modify the same data at the same time, and load balancing, where some threads may finish their assigned iterations faster than others, leading to idle threads and inefficient use of resources. These issues can be addressed with careful coding and the use of synchronization mechanisms.

5. Are there any alternatives to using openMP for parallelizing for loops in C++?

Yes, there are other options for parallelizing for loops in C++, such as using thread libraries like std::thread or parallel algorithms in the C++ standard library. Additionally, other parallel programming models like MPI or CUDA can also be used for parallelizing loops. The choice of which method to use will depend on the specific requirements and characteristics of the loop and the application as a whole.

Similar threads

  • Programming and Computer Science
Replies
6
Views
8K
  • Programming and Computer Science
Replies
4
Views
5K
  • Programming and Computer Science
2
Replies
66
Views
4K
  • Programming and Computer Science
Replies
2
Views
3K
  • Programming and Computer Science
Replies
5
Views
2K
  • Programming and Computer Science
Replies
6
Views
1K
  • Programming and Computer Science
Replies
3
Views
1K
  • Programming and Computer Science
Replies
4
Views
1K
  • Programming and Computer Science
Replies
11
Views
1K
  • Programming and Computer Science
Replies
1
Views
1K
Back
Top