MPI parallel code speed-up sucks

AI Thread Summary
The user is experiencing suboptimal speed-up in their MPI parallelized C++ code, achieving only a 6x speed-up instead of the expected near-linear performance with 20 cores. They isolated a specific subroutine that processes particles in a loop, where they anticipated a 20x speed-up but encountered significant communication overheads affecting performance. The user is seeking assistance from more experienced programmers to identify the bottlenecks in their code, particularly regarding MPI initialization and hardware compatibility. The discussion highlights the importance of understanding MPI's communication costs and the impact of hardware limitations on parallel performance. Overall, the thread emphasizes the challenges of achieving efficient parallelization in MPI programming.
davidfur
Messages
18
Reaction score
2
Hey guys,
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.

To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.

Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.

I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).

The relevant piece of code (parallelized loop):
Code:
       // MPI timing variables
        double starttime, endtime, diff, sumdiff, avgdiff;
        starttime = MPI_Wtime();

       // Main loop on NumP particles (NumP = #particles/#numcores)
       // cp geo to geo.parID for each particle work with its own geo file
        for (int p = 0; p < NumP; p++) {
                string parID = std::to_string(p);
                boost::filesystem::copy_file( pwd.string()+"/CPU."+str_core+"/geo",pwd.string()+"/CPU."+str_core+"/geo."+parID );

                Par NewPar;
                newSwarm.AddPar(NewPar);
                if (core == 0) {
                        // If contff == y, then take force field's current values for the position of particle 0 (the rest are just random)
                        if (contff == true){
                                vector <double> ffpos;
                                ffpos.clear();
                                for (int m = 0; m < dim; m++){
                                        ffpos.push_back(stod(newSwarm.GetPar(0).ffieldmat.at(newSwarm.GetPar(0).ffline.at(m)).at(newSwarm.GetPar(0).ffcol.at(m))));
                                };
                                for (int m = 0; m < dim; m++){
                                        newSwarm.GetPar(0).set_pos(ffpos);
                                };
                        };
                        contff = false;
                };
                // Initialize global best fitness and corresponding global best position and personal best
                curfit = newSwarm.GetPar(p).eval_fitness(p);
                newSwarm.GetPar(p).set_fitness(curfit);
                newSwarm.GetPar(p).set_bfit(curfit);
                if (curfit < gbfit){
                        gbfit = curfit;
                };
                // define a struct (pair) to hold the minimum fitness across processes and its rank for particle p
                struct {
                        double tmp_fit;
                        int tmp_cpu;
                                } min_vals_in[1], min_vals_out[1];

                min_vals_in[0].tmp_fit = gbfit;    // store current fit on each process
                min_vals_in[0].tmp_cpu = core;      // store core id of that current process

                // get minimum fitness across all processes and the corresponding core id and store them in min_vals_out
                MPI_Allreduce(&min_vals_in, &min_vals_out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
                // global best fitness across all processes for particle p
                gbfit = min_vals_out[0].tmp_fit;

                // core rank the above fitness came from
                cpuid_gbfit = min_vals_out[0].tmp_cpu;

                // store particle id the above fitness came from in parid_gbfit
                parid_gbfit = p;

                // broadcast the global best fitness data: gbfit, cpuid_gbfit and parid_gbfit
                MPI_Bcast(&cpuid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&parid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&gbfit, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

                gbpos.clear();
                //vector <double> tmp_pos;
                gbpos = newSwarm.GetPar(parid_gbfit).get_pos_vec();
                MPI_Bcast(gbpos.data(), gbpos.size(), MPI_DOUBLE, cpuid_gbfit, MPI_COMM_WORLD);
        }; // done with all particles

        endtime = MPI_Wtime();
        diff = endtime - starttime;
        MPI_Reduce(&diff, &sumdiff, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

        if (core == 0){
                avgdiff = sumdiff/numcores;
                cout << "inside time: " << avgdiff << " seconds";
        };
 
Technology news on Phys.org
Can you a) post the minimum code that shows this behavior and b) show us how you initialize MPI?
 
And what hardware are you running it on?
 
glappkaeft said:
And what hardware are you running it on?
This could be the problem if the hardware doesn't have the same number of physical cores as you (the OP) are attempting to run on.
 
I tried a web search "the loss of programming ", and found an article saying that all aspects of writing, developing, and testing software programs will one day all be handled through artificial intelligence. One must wonder then, who is responsible. WHO is responsible for any problems, bugs, deficiencies, or whatever malfunctions which the programs make their users endure? Things may work wrong however the "wrong" happens. AI needs to fix the problems for the users. Any way to...
Back
Top