# Trouble with parrallelizing for loop with openMP, cpp

Hi,

I'm working on doing some slight parallelization of a for loop in the force calculation in a molecular dynamics simulation, using openMP. (probably not up for a proper MPI implementation yet)

It should only need to parallelize the force loop since 90% ish of the time is spent calculating forces.

The atoms in the simulations is put into several neighbour boxes, for each box it only checks the nearby boxes and calculates the inter atomic forces between the atoms in those boxes. So for each box it needs to cycle through all the nearby boxes and call a force calculation boxes (or only half since all the forces assigned to both atoms when when calculated so calculating box [0,0,0] and [1,0,0] is also calculates what box [1,0,0] and [0,0,0] would do)

I built a small testcase that reproduces the error it generates:

Code:
#include <iostream>
#include <omp.h>
#include <armadillo>

using namespace std;

void testNeighbourCycling();

int main()
{
cout << "Hello World" << endl;

for(int i = 0 ; i < 10 ; i++)
{
cout << "At timestep: " << i << endl;

testNeighbourCycling();

}

}

void testNeighbourCycling()
{

//    cout << "Hello World" << endl;

int noOfNeighbors = 3;
int next_i,next_j,next_k;
int prev_i, prev_j, prev_k;
int i, j, k;
int tid;

arma::cube neighborVisits = arma::zeros( noOfNeighbors, noOfNeighbors, noOfNeighbors);

#pragma omp parallel private( i, j, k, next_i, next_j, next_k, prev_i, prev_j, prev_k, tid) shared(neighborVisits)
{

#pragma omp for schedule(auto)
for(i = 0 ; i < noOfNeighbors ; i ++)
{
for(j=0; j < noOfNeighbors ; j++)
{
for(k = 0; k < noOfNeighbors; k++)
{
//                    tid = omp_get_thread_num();
//                    cout << " Thread # " << tid << " did " << i << " " << j << " " << k << endl;
//                    calculateForcesInsideCell(vec3(i,j,k),system);

//Now atoms should be appended with the increasing neighboring cells
//Could not find any neat way to do it, this is confusing and slow as is. Should work to do it a better way.
next_i = i + 1;
next_j = j + 1;
next_k = k + 1;

prev_i = i -1;
prev_j = j -1;
prev_k = k -1;

if(i == noOfNeighbors - 1)
next_i = 0;

if(j == noOfNeighbors - 1)
next_j = 0;

if(k == noOfNeighbors - 1)
next_k = 0;

if(i == 0)
prev_i = noOfNeighbors - 1;

if(j == 0)
prev_j = noOfNeighbors - 1;

if(k == 0)
prev_k = noOfNeighbors - 1;

//                #pragma omp critical
{
//With itself
neighborVisits( i, j, k) += 1;

//The top layer, so it cycles through i and j while k is constant

neighborVisits(i,j,k) += 1;   neighborVisits(prev_i,   prev_j, next_k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(i,        prev_j, next_k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   prev_j, next_k) += 1;

neighborVisits(i,j,k) += 1;   neighborVisits(prev_i,   j,      next_k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(i,        j,      next_k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   j,      next_k) += 1;

neighborVisits(i,j,k) += 1;   neighborVisits(prev_i,   next_j, next_k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(i,        next_j, next_k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   next_j, next_k) += 1;

//The front layer that is left

neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   prev_j,      k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   j,           k) += 1;
neighborVisits(i,j,k) += 1;   neighborVisits(next_i,   next_j,      k) += 1;

//The rest of the left side

neighborVisits(i,j,k) += 1;   neighborVisits(i,        next_j,      k) += 1;
}

}
}
}
}

arma::cube testMatrix = 27 * arma::ones(noOfNeighbors,noOfNeighbors,noOfNeighbors);

//    cout << testMatrix << endl;

//    cout << neighborVisits << endl;

//    cout << neighborVisits - testMatrix << endl;

//    cout << arma::accu(testMatrix - neighborVisits) << endl;

if(arma::accu(testMatrix - neighborVisits) != 0)
{
cout << "Test failed: wrong number of neighbourvisits" << endl;
cout << neighborVisits << endl;
}

}

The tabs got messed up when pasting, but I'll attach the program as well :)

In this testcase it just counts how many times each neihgbour is visited and all of them should be visited 27 times.

If the #pragma parts are taken away the test never triggers and it reaches and all the neighbors reach 27 each time.

If #pragma critical is uncommented it works but is even slower than not parallelizing it.

If the #pragma pieces is left there it fails on a few of the loops giving outpul like below

Code:
At timestep: 7
At timestep: 8
Test failed: wrong number of neighbourvisits
[cube slice 0]
27.0000   27.0000   26.0000
26.0000   27.0000   26.0000
25.0000   25.0000   25.0000

[cube slice 1]
27.0000   27.0000   26.0000
27.0000   26.0000   27.0000
25.0000   27.0000   27.0000

[cube slice 2]
27.0000   27.0000   25.0000
27.0000   26.0000   27.0000
25.0000   27.0000   27.0000

So it fails some of the for loops part, or fails to write properly to neighborVisits because of some memory writing conflicts or something.

BTW, in the real program the "neighborVisits(i,j,k) += 1; neighborVisits(prev_i, prev_j, next_k) += 1;" parts are supposed to be a call to a function that calculates the force between the atoms in neighbour [i,j,k] and neighbour [i, j-1, k+1].

Link to the testProgram at github: https://github.com/Gullik/CompPhysGullik/tree/master/test/neighbour_openmp

Last edited:

## Answers and Replies

You are doing it wrong. Just put "Pragma openmp parrallel for" on the first for then you absolutely need to delete the critical section becaus it will speed down the algorithm. You need to find a solution to delete it. If you delete it, in case of double acces, you will not have a segfault but just a false value in the box. It's not realy a problem in atom simulation, because this error is statisticaly not probable.

Avoid "IF" instruction please.
if(i == noOfNeighbors - 1) next_i = 0;
if(j == noOfNeighbors - 1) next_j = 0;

Make a loop on 2 to size-2 without checking indices. Then make the computation on the border on a second step. It is faster and faster !

You are doing it wrong. Just put "Pragma openmp parrallel for" on the first for then you absolutely need to delete the critical section becaus it will speed down the algorithm. You need to find a solution to delete it. If you delete it, in case of double acces, you will not have a segfault but just a false value in the box. It's not realy a problem in atom simulation, because this error is statisticaly not probable.

In the full simulation it increases the variance of the total energy, which should be a conserved quantity, by approximately a factor 10^4 so I'm not that happy about it. But I would think that the error will decrease with a larger system.

Yeah, I realized that the critical part completely tanked performance :) I can probably fix the problem, given that it is a memory access problem, but that will need some major restructuring and I'll hoped to avoid that for now, since OpenMP uses a shared memory. I planned to save that until I have time to properly learn MPI.

kroni said:
Avoid "IF" instruction please.
if(i == noOfNeighbors - 1) next_i = 0;
if(j == noOfNeighbors - 1) next_j = 0;

Make a loop on 2 to size-2 without checking indices. Then make the computation on the border on a second step. It is faster and faster !

Yeah that should definitely be done. But is it that much faster? The calls to the calculatedForces functions will be the same and the actual time for the triple for loop should be negligible?

Thanks

Well, the IF statement is very bad because the CPU pipeline predict the value of the IF, true for example, if false is computed, the pipelines is purged and you loose a lot of CPU cycle (More than you imagine). I write two particle simulator for cosmologic reason using Hash Table like you and one with Barnes Hut Octree and i have proceed in a different way :

My atom are a Class like
Class Atom
Vector3d position
Vector3d speed
Vector3d sum_of_forces
....... etc.

Then i use a discrétisation of space (your 3D tab) and insert pointers to atom inside. After, i compute force for each atom, find neighboor atom using the grid (step that you can parralelize) and then i put the sum of repulsive force in sum_of_forces. In an other step i update particle position, speed etc. In this way the Grid is in read only so you don't need to protect it when you parallelize.

ANOTHER REMARK :

"
neighborVisits(i,j,k) += 1; neighborVisits(prev_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, next_j, next_k) += 1;
"
Don't do this because this code is hard to understand, make three another "for" like

for(t = begin; t <begin +3; ++t )
for(l = begin; l <begin+3; ++l )
for(m = begin; m <begin+3; ++m )
neighborVisits(l,m,t) += 1;

BUT declare "begin" as a "const" so the compiler can understand that the loop can be unrolled at the compilation giving the same code as you write by hand. Don't forget -O3 optimisation flag on G++. Like that :

const int begin = ....
for (size_t m = begin...)

Gullik
Well, the IF statement is very bad because the CPU pipeline predict the value of the IF, true for example, if false is computed, the pipelines is purged and you loose a lot of CPU cycle (More than you imagine). I write two particle simulator for cosmologic reason using Hash Table like you and one with Barnes Hut Octree and i have proceed in a different way :

I will remove the if parts and do some timing tests after a weeks vacation :)

My atom are a Class like
Class Atom
Vector3d position
Vector3d speed
Vector3d sum_of_forces
....... etc.

Then i use a discrétisation of space (your 3D tab) and insert pointers to atom inside. After, i compute force for each atom, find neighboor atom using the grid (step that you can parralelize) and then i put the sum of repulsive force in sum_of_forces. In an other step i update particle position, speed etc. In this way the Grid is in read only so you don't need to protect it when you parallelize.

This is basically exactly what I have. Can it still not happen that two threads try to update atom->sum_of_forces at the same time?

ANOTHER REMARK :

"
neighborVisits(i,j,k) += 1; neighborVisits(prev_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, prev_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(prev_i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(i, next_j, next_k) += 1; neighborVisits(i,j,k) += 1; neighborVisits(next_i, next_j, next_k) += 1;
"
Don't do this because this code is hard to understand, make three another "for" like
I was quite confused while writing it, so no wonder it is confusing.
for(t = begin; t <begin +3; ++t )
for(l = begin; l <begin+3; ++l )
for(m = begin; m <begin+3; ++m )
neighborVisits(l,m,t) += 1;

BUT declare "begin" as a "const" so the compiler can understand that the loop can be unrolled at the compilation giving the same code as you write by hand. Don't forget -O3 optimisation flag on G++. Like that :

const int begin = ....
for (size_t m = begin...)

The for loops you are proposing will cycle trough every neighbour and calculate every force twice. It is necessary to calculate a neighbour-pair once and assign the force to both atoms (one plus and one negative) when calculating the force between two atoms.

Some slight modifications to the loops should fix that.

Well, i find one solution to avoid computing two time the force for each particle and use OpenMP. It's a computer science optimisation and not an algorithm modification.
Imagine you have two thread working on a 2D table of 10 X 10 size. You build a list (std::vector work well) where you list pointers to table elements in order to avoid collision. for example thread 1 treat collumn 1 to 5 and thread 2 treat collumn 6 to 10. So when thread 1 work on collumn 1, it can't update far atom from collumn 6. Saddly you can loose time if thread 1 work on empty volume and thread 2 work on all the atom. But i think that it's better than a openmp "critical section". You need to think more about it, i don't have an optimal solution.
Have nice holiday.