MPI parallel code speed-up sucks

davidfur · Mar 23, 2019

Hey guys,
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.

To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.

Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.

I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).

The relevant piece of code (parallelized loop):

Code:

       // MPI timing variables
        double starttime, endtime, diff, sumdiff, avgdiff;
        starttime = MPI_Wtime();

       // Main loop on NumP particles (NumP = #particles/#numcores)
       // cp geo to geo.parID for each particle work with its own geo file
        for (int p = 0; p < NumP; p++) {
                string parID = std::to_string(p);
                boost::filesystem::copy_file( pwd.string()+"/CPU."+str_core+"/geo",pwd.string()+"/CPU."+str_core+"/geo."+parID );

                Par NewPar;
                newSwarm.AddPar(NewPar);
                if (core == 0) {
                        // If contff == y, then take force field's current values for the position of particle 0 (the rest are just random)
                        if (contff == true){
                                vector <double> ffpos;
                                ffpos.clear();
                                for (int m = 0; m < dim; m++){
                                        ffpos.push_back(stod(newSwarm.GetPar(0).ffieldmat.at(newSwarm.GetPar(0).ffline.at(m)).at(newSwarm.GetPar(0).ffcol.at(m))));
                                };
                                for (int m = 0; m < dim; m++){
                                        newSwarm.GetPar(0).set_pos(ffpos);
                                };
                        };
                        contff = false;
                };
                // Initialize global best fitness and corresponding global best position and personal best
                curfit = newSwarm.GetPar(p).eval_fitness(p);
                newSwarm.GetPar(p).set_fitness(curfit);
                newSwarm.GetPar(p).set_bfit(curfit);
                if (curfit < gbfit){
                        gbfit = curfit;
                };
                // define a struct (pair) to hold the minimum fitness across processes and its rank for particle p
                struct {
                        double tmp_fit;
                        int tmp_cpu;
                                } min_vals_in[1], min_vals_out[1];

                min_vals_in[0].tmp_fit = gbfit;    // store current fit on each process
                min_vals_in[0].tmp_cpu = core;      // store core id of that current process

                // get minimum fitness across all processes and the corresponding core id and store them in min_vals_out
                MPI_Allreduce(&min_vals_in, &min_vals_out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
                // global best fitness across all processes for particle p
                gbfit = min_vals_out[0].tmp_fit;

                // core rank the above fitness came from
                cpuid_gbfit = min_vals_out[0].tmp_cpu;

                // store particle id the above fitness came from in parid_gbfit
                parid_gbfit = p;

                // broadcast the global best fitness data: gbfit, cpuid_gbfit and parid_gbfit
                MPI_Bcast(&cpuid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&parid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&gbfit, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

                gbpos.clear();
                //vector <double> tmp_pos;
                gbpos = newSwarm.GetPar(parid_gbfit).get_pos_vec();
                MPI_Bcast(gbpos.data(), gbpos.size(), MPI_DOUBLE, cpuid_gbfit, MPI_COMM_WORLD);
        }; // done with all particles

        endtime = MPI_Wtime();
        diff = endtime - starttime;
        MPI_Reduce(&diff, &sumdiff, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

        if (core == 0){
                avgdiff = sumdiff/numcores;
                cout << "inside time: " << avgdiff << " seconds";
        };

Vanadium 50 · Mar 23, 2019

Can you a) post the minimum code that shows this behavior and b) show us how you initialize MPI?

glappkaeft · Mar 23, 2019

And what hardware are you running it on?

Mark44 · Mar 23, 2019

glappkaeft said:

And what hardware are you running it on?

This could be the problem if the hardware doesn't have the same number of physical cores as you (the OP) are attempting to run on.

MPI parallel code speed-up sucks

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Sweetspot of data compression

HTML/CSS Problems with DNS records

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect