- #1
davidfur
- 18
- 2
Hey guys,
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.
To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.
Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.
I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).
The relevant piece of code (parallelized loop):
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.
To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.
Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.
I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).
The relevant piece of code (parallelized loop):
Code:
// MPI timing variables
double starttime, endtime, diff, sumdiff, avgdiff;
starttime = MPI_Wtime();
// Main loop on NumP particles (NumP = #particles/#numcores)
// cp geo to geo.parID for each particle work with its own geo file
for (int p = 0; p < NumP; p++) {
string parID = std::to_string(p);
boost::filesystem::copy_file( pwd.string()+"/CPU."+str_core+"/geo",pwd.string()+"/CPU."+str_core+"/geo."+parID );
Par NewPar;
newSwarm.AddPar(NewPar);
if (core == 0) {
// If contff == y, then take force field's current values for the position of particle 0 (the rest are just random)
if (contff == true){
vector <double> ffpos;
ffpos.clear();
for (int m = 0; m < dim; m++){
ffpos.push_back(stod(newSwarm.GetPar(0).ffieldmat.at(newSwarm.GetPar(0).ffline.at(m)).at(newSwarm.GetPar(0).ffcol.at(m))));
};
for (int m = 0; m < dim; m++){
newSwarm.GetPar(0).set_pos(ffpos);
};
};
contff = false;
};
// Initialize global best fitness and corresponding global best position and personal best
curfit = newSwarm.GetPar(p).eval_fitness(p);
newSwarm.GetPar(p).set_fitness(curfit);
newSwarm.GetPar(p).set_bfit(curfit);
if (curfit < gbfit){
gbfit = curfit;
};
// define a struct (pair) to hold the minimum fitness across processes and its rank for particle p
struct {
double tmp_fit;
int tmp_cpu;
} min_vals_in[1], min_vals_out[1];
min_vals_in[0].tmp_fit = gbfit; // store current fit on each process
min_vals_in[0].tmp_cpu = core; // store core id of that current process
// get minimum fitness across all processes and the corresponding core id and store them in min_vals_out
MPI_Allreduce(&min_vals_in, &min_vals_out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
// global best fitness across all processes for particle p
gbfit = min_vals_out[0].tmp_fit;
// core rank the above fitness came from
cpuid_gbfit = min_vals_out[0].tmp_cpu;
// store particle id the above fitness came from in parid_gbfit
parid_gbfit = p;
// broadcast the global best fitness data: gbfit, cpuid_gbfit and parid_gbfit
MPI_Bcast(&cpuid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&parid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&gbfit, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
gbpos.clear();
//vector <double> tmp_pos;
gbpos = newSwarm.GetPar(parid_gbfit).get_pos_vec();
MPI_Bcast(gbpos.data(), gbpos.size(), MPI_DOUBLE, cpuid_gbfit, MPI_COMM_WORLD);
}; // done with all particles
endtime = MPI_Wtime();
diff = endtime - starttime;
MPI_Reduce(&diff, &sumdiff, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (core == 0){
avgdiff = sumdiff/numcores;
cout << "inside time: " << avgdiff << " seconds";
};