MPI parallel code speed-up sucks

  • Context:
  • Thread starter Thread starter davidfur
  • Start date Start date
  • Tags Tags
    C++ Code Parallel
Click For Summary

Discussion Overview

The discussion revolves around the performance issues encountered when parallelizing a C++ code using the MPI library. Participants are exploring the reasons for suboptimal speed-up, which saturates at a factor of 6 instead of the expected near-linear scaling with the number of cores. The focus is on identifying bottlenecks in the code, particularly within a specific subroutine that processes particles in a loop.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • The original poster (OP) describes their expectation of achieving a speed-up of 20x when using 20 MPI cores, but they only observe a speed-up of ~6, suggesting potential inefficiencies.
  • The OP shares a relevant piece of code and expresses uncertainty about the basic issues that might be causing the performance bottleneck.
  • One participant requests the OP to provide a minimal code example that replicates the observed behavior, indicating a need for clarity on the issue.
  • Another participant inquires about the hardware being used, suggesting that discrepancies in physical core counts could affect performance.
  • A repeated question about the hardware indicates that participants are considering the impact of the underlying system on the observed performance issues.

Areas of Agreement / Disagreement

Participants have not reached a consensus on the cause of the performance issue. Multiple viewpoints are presented regarding the potential factors affecting speed-up, including code structure and hardware limitations.

Contextual Notes

Participants have not yet addressed specific assumptions or dependencies in the code that could influence performance. The discussion remains open to further exploration of these factors.

davidfur
Messages
18
Reaction score
2
Hey guys,
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.

To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.

Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.

I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).

The relevant piece of code (parallelized loop):
Code:
       // MPI timing variables
        double starttime, endtime, diff, sumdiff, avgdiff;
        starttime = MPI_Wtime();

       // Main loop on NumP particles (NumP = #particles/#numcores)
       // cp geo to geo.parID for each particle work with its own geo file
        for (int p = 0; p < NumP; p++) {
                string parID = std::to_string(p);
                boost::filesystem::copy_file( pwd.string()+"/CPU."+str_core+"/geo",pwd.string()+"/CPU."+str_core+"/geo."+parID );

                Par NewPar;
                newSwarm.AddPar(NewPar);
                if (core == 0) {
                        // If contff == y, then take force field's current values for the position of particle 0 (the rest are just random)
                        if (contff == true){
                                vector <double> ffpos;
                                ffpos.clear();
                                for (int m = 0; m < dim; m++){
                                        ffpos.push_back(stod(newSwarm.GetPar(0).ffieldmat.at(newSwarm.GetPar(0).ffline.at(m)).at(newSwarm.GetPar(0).ffcol.at(m))));
                                };
                                for (int m = 0; m < dim; m++){
                                        newSwarm.GetPar(0).set_pos(ffpos);
                                };
                        };
                        contff = false;
                };
                // Initialize global best fitness and corresponding global best position and personal best
                curfit = newSwarm.GetPar(p).eval_fitness(p);
                newSwarm.GetPar(p).set_fitness(curfit);
                newSwarm.GetPar(p).set_bfit(curfit);
                if (curfit < gbfit){
                        gbfit = curfit;
                };
                // define a struct (pair) to hold the minimum fitness across processes and its rank for particle p
                struct {
                        double tmp_fit;
                        int tmp_cpu;
                                } min_vals_in[1], min_vals_out[1];

                min_vals_in[0].tmp_fit = gbfit;    // store current fit on each process
                min_vals_in[0].tmp_cpu = core;      // store core id of that current process

                // get minimum fitness across all processes and the corresponding core id and store them in min_vals_out
                MPI_Allreduce(&min_vals_in, &min_vals_out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
                // global best fitness across all processes for particle p
                gbfit = min_vals_out[0].tmp_fit;

                // core rank the above fitness came from
                cpuid_gbfit = min_vals_out[0].tmp_cpu;

                // store particle id the above fitness came from in parid_gbfit
                parid_gbfit = p;

                // broadcast the global best fitness data: gbfit, cpuid_gbfit and parid_gbfit
                MPI_Bcast(&cpuid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&parid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&gbfit, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

                gbpos.clear();
                //vector <double> tmp_pos;
                gbpos = newSwarm.GetPar(parid_gbfit).get_pos_vec();
                MPI_Bcast(gbpos.data(), gbpos.size(), MPI_DOUBLE, cpuid_gbfit, MPI_COMM_WORLD);
        }; // done with all particles

        endtime = MPI_Wtime();
        diff = endtime - starttime;
        MPI_Reduce(&diff, &sumdiff, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

        if (core == 0){
                avgdiff = sumdiff/numcores;
                cout << "inside time: " << avgdiff << " seconds";
        };
 
Technology news on Phys.org
Can you a) post the minimum code that shows this behavior and b) show us how you initialize MPI?
 
And what hardware are you running it on?
 
glappkaeft said:
And what hardware are you running it on?
This could be the problem if the hardware doesn't have the same number of physical cores as you (the OP) are attempting to run on.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 12 ·
Replies
12
Views
4K
Replies
1
Views
2K
  • · Replies 54 ·
2
Replies
54
Views
5K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 39 ·
2
Replies
39
Views
4K
  • · Replies 6 ·
Replies
6
Views
3K