MPI parallel code speed-up sucks

In summary, the conversation discusses an issue with parallelizing code using the MPI library. The speed-up achieved is lower than expected and the code has been narrowed down to a specific subroutine that operates on a set of particles. The code is then shared for further analysis and the issue may be related to the hardware's number of physical cores. The conversation ends with a request for the minimum code that demonstrates the issue and information on how MPI was initialized and the hardware being used.
  • #1
davidfur
18
2
Hey guys,
I parallelized my code, written in C++, with the MPI library. Unfortunately, the speed-up I get saturates at x6, with increasing number of cores. Whereas, in theory, the speed-up should really be nearly linear.

To understand where the bottlenecks come from, I drilled down the code to isolate one specific subroutine which does operation on NumP particles inside a loop. With a serial code (or 1 MPI core), this loop on the 20 particles should run a set of commands. With an MPI code with 20 cores (i.e. mpirun -np 20 ./program) the loop should run once (1 particle on each core with 20 cores). Thus, the speed up I expect for this loop is 20x (maybe a little less because of more communication overheads). But, the speed up I get is ~6.

Since I'm really a beginner with MPI, I can't really solve the issue even though trying different things and would really appreciate the help of some more experienced gurus.

I attach the relevant piece of code, and I apologize for the extra details it contains unrelated to the question, but I'm pretty sure the issue here is probably too basic and you could spot it easily (well, I hope).

The relevant piece of code (parallelized loop):
Code:
       // MPI timing variables
        double starttime, endtime, diff, sumdiff, avgdiff;
        starttime = MPI_Wtime();

       // Main loop on NumP particles (NumP = #particles/#numcores)
       // cp geo to geo.parID for each particle work with its own geo file
        for (int p = 0; p < NumP; p++) {
                string parID = std::to_string(p);
                boost::filesystem::copy_file( pwd.string()+"/CPU."+str_core+"/geo",pwd.string()+"/CPU."+str_core+"/geo."+parID );

                Par NewPar;
                newSwarm.AddPar(NewPar);
                if (core == 0) {
                        // If contff == y, then take force field's current values for the position of particle 0 (the rest are just random)
                        if (contff == true){
                                vector <double> ffpos;
                                ffpos.clear();
                                for (int m = 0; m < dim; m++){
                                        ffpos.push_back(stod(newSwarm.GetPar(0).ffieldmat.at(newSwarm.GetPar(0).ffline.at(m)).at(newSwarm.GetPar(0).ffcol.at(m))));
                                };
                                for (int m = 0; m < dim; m++){
                                        newSwarm.GetPar(0).set_pos(ffpos);
                                };
                        };
                        contff = false;
                };
                // Initialize global best fitness and corresponding global best position and personal best
                curfit = newSwarm.GetPar(p).eval_fitness(p);
                newSwarm.GetPar(p).set_fitness(curfit);
                newSwarm.GetPar(p).set_bfit(curfit);
                if (curfit < gbfit){
                        gbfit = curfit;
                };
                // define a struct (pair) to hold the minimum fitness across processes and its rank for particle p
                struct {
                        double tmp_fit;
                        int tmp_cpu;
                                } min_vals_in[1], min_vals_out[1];

                min_vals_in[0].tmp_fit = gbfit;    // store current fit on each process
                min_vals_in[0].tmp_cpu = core;      // store core id of that current process

                // get minimum fitness across all processes and the corresponding core id and store them in min_vals_out
                MPI_Allreduce(&min_vals_in, &min_vals_out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
                // global best fitness across all processes for particle p
                gbfit = min_vals_out[0].tmp_fit;

                // core rank the above fitness came from
                cpuid_gbfit = min_vals_out[0].tmp_cpu;

                // store particle id the above fitness came from in parid_gbfit
                parid_gbfit = p;

                // broadcast the global best fitness data: gbfit, cpuid_gbfit and parid_gbfit
                MPI_Bcast(&cpuid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&parid_gbfit, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&gbfit, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

                gbpos.clear();
                //vector <double> tmp_pos;
                gbpos = newSwarm.GetPar(parid_gbfit).get_pos_vec();
                MPI_Bcast(gbpos.data(), gbpos.size(), MPI_DOUBLE, cpuid_gbfit, MPI_COMM_WORLD);
        }; // done with all particles

        endtime = MPI_Wtime();
        diff = endtime - starttime;
        MPI_Reduce(&diff, &sumdiff, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

        if (core == 0){
                avgdiff = sumdiff/numcores;
                cout << "inside time: " << avgdiff << " seconds";
        };
 
Technology news on Phys.org
  • #2
Can you a) post the minimum code that shows this behavior and b) show us how you initialize MPI?
 
  • #3
And what hardware are you running it on?
 
  • #4
glappkaeft said:
And what hardware are you running it on?
This could be the problem if the hardware doesn't have the same number of physical cores as you (the OP) are attempting to run on.
 

1. Why is my MPI parallel code running slower than expected?

There could be several reasons for this. One possibility is that your code is not properly optimized for parallel processing. You may need to restructure your code or use different algorithms to better take advantage of the parallel architecture. Additionally, if your code has a lot of communication between processes, this can slow down the overall speed of the program.

2. Can I improve the speed of my MPI parallel code?

Yes, there are several strategies you can use to improve the speed of your code. First, you can try optimizing your code for parallel processing as mentioned before. You can also try using a higher number of processes or using a more efficient parallel algorithm. Additionally, make sure you are using a high-performance computing system that can support the demands of your code.

3. How can I measure the efficiency of my MPI parallel code?

The most common measure of efficiency for parallel code is speed-up. This is calculated by comparing the execution time of the parallel code to the execution time of the same code running on a single processor. Ideally, the speed-up should be close to the number of processors used. Other measures of efficiency include scalability and throughput.

4. Is there a limit to how much I can speed up my MPI parallel code?

Yes, there are limitations to how much speed-up can be achieved with parallel processing. The most significant limiting factor is Amdahl's Law, which states that the speed-up of a parallel program is limited by the portion of the code that cannot be parallelized. This means that even with an infinite number of processors, there will always be a limit to the speed-up that can be achieved.

5. How can I troubleshoot performance issues with my MPI parallel code?

If your code is not performing as expected, there are several steps you can take to troubleshoot the issue. First, you can use profiling tools to identify which parts of your code are taking the most time to execute. This can help you pinpoint areas that may need optimization. You can also try running your code on a different system or with a different number of processors to see if that improves performance. Additionally, make sure you are using the most up-to-date version of MPI and that your system is properly configured for parallel processing.

Similar threads

  • Programming and Computer Science
Replies
1
Views
2K
  • Programming and Computer Science
Replies
12
Views
3K
  • Programming and Computer Science
Replies
1
Views
944
  • Programming and Computer Science
2
Replies
54
Views
4K
  • Programming and Computer Science
Replies
1
Views
2K
  • Programming and Computer Science
Replies
5
Views
1K
  • Programming and Computer Science
Replies
4
Views
2K
  • Programming and Computer Science
2
Replies
39
Views
3K
  • Programming and Computer Science
Replies
6
Views
2K
Back
Top