Comparing MKL vs GSL Speed for C++ Codes

  • Context: C/C++ 
  • Thread starter Thread starter TheCanadian
  • Start date Start date
  • Tags Tags
    Speed
Click For Summary
SUMMARY

This discussion compares the performance of Intel's Math Kernel Library (MKL) and the GNU Scientific Library (GSL) in C++ matrix operations using OpenMPI. The GSL implementation significantly outperforms the MKL version, with execution times for matrix sizes of 1000x1000 being approximately 45 seconds for GSL and several minutes for MKL. The user speculates that the performance discrepancy may stem from differences in how GSL and MKL are designed and integrated into the code, as well as potential compiler optimizations. The discussion highlights the importance of memory management and compiler settings in achieving optimal performance.

PREREQUISITES
  • Familiarity with C++ programming
  • Understanding of OpenMPI for parallel processing
  • Knowledge of matrix operations and linear algebra
  • Experience with Intel MKL and GNU GSL libraries
NEXT STEPS
  • Investigate compiler optimization flags for Intel MKL and GSL
  • Learn about memory management techniques in C++
  • Explore performance profiling tools for C++ applications
  • Study the differences between Cblas and GSL BLAS implementations
USEFUL FOR

Developers and researchers working with high-performance computing, particularly those involved in numerical simulations, matrix computations, and parallel processing using C++ and MPI.

TheCanadian
Messages
361
Reaction score
13
I have two codes in C++ that are both working, yet I cannot figure out why one is so much faster than the other. To my knowledge, BLAS with MKL (Intel) should be much faster than GSL (GNU), although my code is showing quite the opposite. Here are the codes themselves where I am simply creating 2 matrices at the master node and then sending different rows to different "slave" processors (with OpenMPI) which compute the final matrices elements and then return them back to the master node.

GSL example (the fast code):

Code:
    #include <iostream>
    #include <stdio.h>
    #include <iostream>
    #include <cmath>
    #include <mpi.h>
    #include <gsl/gsl_blas.h>
    using namespace std;
 
    int main(int argc, char** argv){
        int noprocs, nid;
        MPI_Status status;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &nid);
        MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
        int master = 0;
     
        const int nsame = 1000; //must be same if matrices multiplied together = acols = brows
        const int arows = 1000;
        const int bcols = 1000;
        int rowsent;
        double * buff;
        buff = new double [nsame];
        double * b;
        b = new double [nsame*bcols];
   
        double** c = new double*[arows];
        for(int i = 0; i < arows; ++i)
            c = new double[bcols];
    
        double * CC;
        CC = new double [1*bcols]; //here ncols corresponds to numbers of rows for matrix b
        for (int i = 0; i < bcols; i++){
                    CC = 0.;
        }; //this is imply a 1-d array of zeros which will be updated and passed by processors
        // Master part
        if (nid == master ) {
        
            double** a = new double*[arows];
            for(int i = 0; i < arows; ++i){
                a = new double[nsame];}
        
            for (int i = 0; i < arows; i++){
                for (int j = 0; j < nsame; j++){
                    if (i == j)
                        a[j] = 1.;
                    else
                        a[j] = 0.;
                }
            }
            for (int i = 0; i < (nsame*bcols); i++){
                b = (10.*i + 3.)/(3.*i - 2.) ;
            }
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD); //assumes stored as contguous block of code
            // send one row to each slave tagged with row number, assume nprocs<nrows
            rowsent=0;
            for (int i=1; i < (noprocs); i++) { //must be equal to noprocs otherwise it will not send to 3
                MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
                rowsent++;
            }
        
            for (int i=0; i<arows; i++) {
                MPI_Recv(CC, bcols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, MPI_ANY_TAG,
                         MPI_COMM_WORLD, &status);
                int sender = status.MPI_SOURCE;
                int anstype = status.MPI_TAG;            //row number+1
                int IND_I = 0;
                while (IND_I < bcols){
                    c[anstype - 1][IND_I] = CC[IND_I];
                    IND_I++;
                }
                if (rowsent < arows) {
                    MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
                    rowsent++;
                }
                else {       // tell sender no more work to do via a 0 TAG
                    MPI_Send(MPI_BOTTOM,0,MPI_DOUBLE_PRECISION,sender,0,MPI_COMM_WORLD);
                }
            }
        }
    
        // Slave part
        else {
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
            MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            while(status.MPI_TAG != 0) {
                int crow = status.MPI_TAG;
                gsl_matrix_view AAAA = gsl_matrix_view_array(buff, 1, nsame);
                gsl_matrix_view BBBB = gsl_matrix_view_array(b, nsame, bcols);
                gsl_matrix_view CCCC = gsl_matrix_view_array(CC, 1, bcols);
            
                /* Compute C = A B */
                gsl_blas_dgemm (CblasNoTrans, CblasNoTrans, 1.0, &AAAA.matrix, &BBBB.matrix,
                                0.0, &CCCC.matrix);
                MPI_Send(CC,bcols,MPI_DOUBLE_PRECISION, master, crow, MPI_COMM_WORLD);
                MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
                //            cout << ans << " OUTPUT \n";
            }
        }
      
        MPI_Finalize();
        return 0;
    };

MKL example (the slow code):

Code:
    #include <iostream>
    #include <stdio.h>
    #include <iostream>
    #include <cmath>
    #include <mpi.h>
    #include </opt/intel/compilers_and_libraries_2017.1.126/mac/mkl/include/mkl.h>
    using namespace std;

    int main(int argc, char** argv){ //THE IDENTITY MATRIX ONLY WORKS IF arows = nsame!
        int noprocs, nid;
        MPI_Status status;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &nid);
        MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
        int master = 0;
       
        const int nsame = 1000;
        const int arows = 1000;
        const int bcols = 1000;
        int rowsent;
        double * buff;
        buff = new double [nsame];
        double * b;
        b = new double [nsame*bcols];
     
        double** c = new double*[arows];
        for(int i = 0; i < arows; ++i)
            c = new double[bcols];
     
        double * CC;
        CC = new double [1*bcols];
        for (int i = 0; i < bcols; i++){
            CC = 0.;
        };
        // Master part
        if (nid == master ) {
         
            double** a = new double*[arows];
            for(int i = 0; i < arows; ++i){
                a = new double[nsame];}
         
            for (int i = 0; i < arows; i++){
                for (int j = 0; j < nsame; j++){
                    if (i == j)
                        a[j] = 1.;
                    else
                        a[j] = 0.;
                }
            }
            for (int i = 0; i < (nsame*bcols); i++){
                b = (10.*i + 3.)/(3.*i - 2.) ; // = 1.*i as test value
            }
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD); //assumes stored as contguous block of code nprocs<nrows
            delete[] b;
            rowsent=0;
            for (int i=1; i < (noprocs); i++) { //must be equal to noprocs otherwise it will not send to 3
                MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
                delete[] a[rowsent];
                rowsent++;
            }
         
            for (int i=0; i<arows; i++) {
                MPI_Recv(CC, bcols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, MPI_ANY_TAG,
                         MPI_COMM_WORLD, &status);
                int sender = status.MPI_SOURCE;
                int anstype = status.MPI_TAG;            //row number+1
                int IND_I = 0;
                while (IND_I < bcols){
                    c[anstype - 1][IND_I] = CC[IND_I];
                    IND_I++;
                }
                if (rowsent < arows) {
                    MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
                    delete[] a[rowsent];
                    rowsent++;
                }
                else {       // tell sender no more work to do via a 0 TAG
                    MPI_Send(MPI_BOTTOM,0,MPI_DOUBLE_PRECISION,sender,0,MPI_COMM_WORLD);
                }
            }
        }
     
        // Slave part
        else {
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
            MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            while(status.MPI_TAG != 0) {
                int crow = status.MPI_TAG;
             
                /* Compute C = A B */
                cblas_dgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, bcols, nsame, 1.0, buff, nsame, b, bcols,
                             0.0, CC, bcols);
               
                MPI_Send(CC,bcols,MPI_DOUBLE_PRECISION, master, crow, MPI_COMM_WORLD);
                MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            }
        }
       
        MPI_Finalize();
        return 0;
    };

I was thinking it might be due to me not deleting any of the new elements created, although I use essentially the same approach to initialize the arrays in both codes. I even tried deleting values in the MKL code (as shown) yet this appears to not have much of an effect. To clarify, when I added `delete` to the MKL example in hopes that it would speed the MKL code up since the MKL example was already much slower than the GSL example even when they both did not delete any arrays. With or without the deletes I've included, the MKL example is still slower. When I increase the size of the arrays from `nsame = arows = bcols = 1000` to `nsame = arows = bcols = 10000`, the time differences in the two codes can readily be observed (the GSL code takes approximately 45 seconds while the MKL code takes quite a few minutes). Thus I am wondering if this is simply inherent to the way GSL and MKL are designed and incorporated in my code or if there is perhaps something else more subtle going on.
 
Technology news on Phys.org
Speed, which is not a good description can also depend on how code was compiled and linked. Generally compiler writers are able to take into consideration various hardware features. But the programmer can "break" this by doing some things in code that the compiler cannot get around.

This white paper is a little dated as far as hardware is concerned, but the content is not and theory is not.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 12 ·
Replies
12
Views
4K
Replies
1
Views
2K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 25 ·
Replies
25
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 6 ·
Replies
6
Views
12K
  • · Replies 35 ·
2
Replies
35
Views
4K