MPI parallel code speed-up sucks

Click For Summary
The user is experiencing suboptimal speed-up in their MPI parallelized C++ code, achieving only a 6x speed-up instead of the expected near-linear performance with 20 cores. They isolated a specific subroutine that processes particles in a loop, where they anticipated a 20x speed-up but encountered significant communication overheads affecting performance. The user is seeking assistance from more experienced programmers to identify the bottlenecks in their code, particularly regarding MPI initialization and hardware compatibility. The discussion highlights the importance of understanding MPI's communication costs and the impact of hardware limitations on parallel performance. Overall, the thread emphasizes the challenges of achieving efficient parallelization in MPI programming.
davidfur
Messages
18
Reaction score
2
Hey guys,
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.

To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.

Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.

I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).

The relevant piece of code (parallelized loop):
Code:
       // MPI timing variables
        double starttime, endtime, diff, sumdiff, avgdiff;
        starttime = MPI_Wtime();

       // Main loop on NumP particles (NumP = #particles/#numcores)
       // cp geo to geo.parID for each particle work with its own geo file
        for (int p = 0; p < NumP; p++) {
                string parID = std::to_string(p);
                boost::filesystem::copy_file( pwd.string()+"/CPU."+str_core+"/geo",pwd.string()+"/CPU."+str_core+"/geo."+parID );

                Par NewPar;
                newSwarm.AddPar(NewPar);
                if (core == 0) {
                        // If contff == y, then take force field's current values for the position of particle 0 (the rest are just random)
                        if (contff == true){
                                vector <double> ffpos;
                                ffpos.clear();
                                for (int m = 0; m < dim; m++){
                                        ffpos.push_back(stod(newSwarm.GetPar(0).ffieldmat.at(newSwarm.GetPar(0).ffline.at(m)).at(newSwarm.GetPar(0).ffcol.at(m))));
                                };
                                for (int m = 0; m < dim; m++){
                                        newSwarm.GetPar(0).set_pos(ffpos);
                                };
                        };
                        contff = false;
                };
                // Initialize global best fitness and corresponding global best position and personal best
                curfit = newSwarm.GetPar(p).eval_fitness(p);
                newSwarm.GetPar(p).set_fitness(curfit);
                newSwarm.GetPar(p).set_bfit(curfit);
                if (curfit < gbfit){
                        gbfit = curfit;
                };
                // define a struct (pair) to hold the minimum fitness across processes and its rank for particle p
                struct {
                        double tmp_fit;
                        int tmp_cpu;
                                } min_vals_in[1], min_vals_out[1];

                min_vals_in[0].tmp_fit = gbfit;    // store current fit on each process
                min_vals_in[0].tmp_cpu = core;      // store core id of that current process

                // get minimum fitness across all processes and the corresponding core id and store them in min_vals_out
                MPI_Allreduce(&min_vals_in, &min_vals_out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
                // global best fitness across all processes for particle p
                gbfit = min_vals_out[0].tmp_fit;

                // core rank the above fitness came from
                cpuid_gbfit = min_vals_out[0].tmp_cpu;

                // store particle id the above fitness came from in parid_gbfit
                parid_gbfit = p;

                // broadcast the global best fitness data: gbfit, cpuid_gbfit and parid_gbfit
                MPI_Bcast(&cpuid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&parid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&gbfit, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

                gbpos.clear();
                //vector <double> tmp_pos;
                gbpos = newSwarm.GetPar(parid_gbfit).get_pos_vec();
                MPI_Bcast(gbpos.data(), gbpos.size(), MPI_DOUBLE, cpuid_gbfit, MPI_COMM_WORLD);
        }; // done with all particles

        endtime = MPI_Wtime();
        diff = endtime - starttime;
        MPI_Reduce(&diff, &sumdiff, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

        if (core == 0){
                avgdiff = sumdiff/numcores;
                cout << "inside time: " << avgdiff << " seconds";
        };
 
Technology news on Phys.org
Can you a) post the minimum code that shows this behavior and b) show us how you initialize MPI?
 
And what hardware are you running it on?
 
glappkaeft said:
And what hardware are you running it on?
This could be the problem if the hardware doesn't have the same number of physical cores as you (the OP) are attempting to run on.
 
Learn If you want to write code for Python Machine learning, AI Statistics/data analysis Scientific research Web application servers Some microcontrollers JavaScript/Node JS/TypeScript Web sites Web application servers C# Games (Unity) Consumer applications (Windows) Business applications C++ Games (Unreal Engine) Operating systems, device drivers Microcontrollers/embedded systems Consumer applications (Linux) Some more tips: Do not learn C++ (or any other dialect of C) as a...

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
Replies
1
Views
2K
  • · Replies 54 ·
2
Replies
54
Views
5K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 39 ·
2
Replies
39
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K