Comparing MKL vs GSL Speed for C++ Codes

  • Thread starter TheCanadian
  • Start date
  • Tags
    Speed
In summary: MPI_Send(buff, 0, 1, MPI_COMM_WORLD, 0, rowssent); // ...and then get the results back from the slaves MPI_Recv(buff, 1, MPI_COMM_WORLD, 0, rowssent); // ...and finally store the results back in buff } // ...end for MPI_Finalize(); return 0;In summary, the GSL code is much
  • #1
TheCanadian
367
13
I have two codes in C++ that are both working, yet I cannot figure out why one is so much faster than the other. To my knowledge, BLAS with MKL (Intel) should be much faster than GSL (GNU), although my code is showing quite the opposite. Here are the codes themselves where I am simply creating 2 matrices at the master node and then sending different rows to different "slave" processors (with OpenMPI) which compute the final matrices elements and then return them back to the master node.

GSL example (the fast code):

Code:
    #include <iostream>
    #include <stdio.h>
    #include <iostream>
    #include <cmath>
    #include <mpi.h>
    #include <gsl/gsl_blas.h>
    using namespace std;
 
    int main(int argc, char** argv){
        int noprocs, nid;
        MPI_Status status;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &nid);
        MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
        int master = 0;
     
        const int nsame = 1000; //must be same if matrices multiplied together = acols = brows
        const int arows = 1000;
        const int bcols = 1000;
        int rowsent;
        double * buff;
        buff = new double [nsame];
        double * b;
        b = new double [nsame*bcols];
   
        double** c = new double*[arows];
        for(int i = 0; i < arows; ++i)
            c = new double[bcols];
    
        double * CC;
        CC = new double [1*bcols]; //here ncols corresponds to numbers of rows for matrix b
        for (int i = 0; i < bcols; i++){
                    CC = 0.;
        }; //this is imply a 1-d array of zeros which will be updated and passed by processors
        // Master part
        if (nid == master ) {
        
            double** a = new double*[arows];
            for(int i = 0; i < arows; ++i){
                a = new double[nsame];}
        
            for (int i = 0; i < arows; i++){
                for (int j = 0; j < nsame; j++){
                    if (i == j)
                        a[j] = 1.;
                    else
                        a[j] = 0.;
                }
            }
            for (int i = 0; i < (nsame*bcols); i++){
                b = (10.*i + 3.)/(3.*i - 2.) ;
            }
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD); //assumes stored as contguous block of code
            // send one row to each slave tagged with row number, assume nprocs<nrows
            rowsent=0;
            for (int i=1; i < (noprocs); i++) { //must be equal to noprocs otherwise it will not send to 3
                MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
                rowsent++;
            }
        
            for (int i=0; i<arows; i++) {
                MPI_Recv(CC, bcols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, MPI_ANY_TAG,
                         MPI_COMM_WORLD, &status);
                int sender = status.MPI_SOURCE;
                int anstype = status.MPI_TAG;            //row number+1
                int IND_I = 0;
                while (IND_I < bcols){
                    c[anstype - 1][IND_I] = CC[IND_I];
                    IND_I++;
                }
                if (rowsent < arows) {
                    MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
                    rowsent++;
                }
                else {       // tell sender no more work to do via a 0 TAG
                    MPI_Send(MPI_BOTTOM,0,MPI_DOUBLE_PRECISION,sender,0,MPI_COMM_WORLD);
                }
            }
        }
    
        // Slave part
        else {
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
            MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            while(status.MPI_TAG != 0) {
                int crow = status.MPI_TAG;
                gsl_matrix_view AAAA = gsl_matrix_view_array(buff, 1, nsame);
                gsl_matrix_view BBBB = gsl_matrix_view_array(b, nsame, bcols);
                gsl_matrix_view CCCC = gsl_matrix_view_array(CC, 1, bcols);
            
                /* Compute C = A B */
                gsl_blas_dgemm (CblasNoTrans, CblasNoTrans, 1.0, &AAAA.matrix, &BBBB.matrix,
                                0.0, &CCCC.matrix);
                MPI_Send(CC,bcols,MPI_DOUBLE_PRECISION, master, crow, MPI_COMM_WORLD);
                MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
                //            cout << ans << " OUTPUT \n";
            }
        }
      
        MPI_Finalize();
        return 0;
    };

MKL example (the slow code):

Code:
    #include <iostream>
    #include <stdio.h>
    #include <iostream>
    #include <cmath>
    #include <mpi.h>
    #include </opt/intel/compilers_and_libraries_2017.1.126/mac/mkl/include/mkl.h>
    using namespace std;

    int main(int argc, char** argv){ //THE IDENTITY MATRIX ONLY WORKS IF arows = nsame!
        int noprocs, nid;
        MPI_Status status;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &nid);
        MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
        int master = 0;
       
        const int nsame = 1000;
        const int arows = 1000;
        const int bcols = 1000;
        int rowsent;
        double * buff;
        buff = new double [nsame];
        double * b;
        b = new double [nsame*bcols];
     
        double** c = new double*[arows];
        for(int i = 0; i < arows; ++i)
            c = new double[bcols];
     
        double * CC;
        CC = new double [1*bcols];
        for (int i = 0; i < bcols; i++){
            CC = 0.;
        };
        // Master part
        if (nid == master ) {
         
            double** a = new double*[arows];
            for(int i = 0; i < arows; ++i){
                a = new double[nsame];}
         
            for (int i = 0; i < arows; i++){
                for (int j = 0; j < nsame; j++){
                    if (i == j)
                        a[j] = 1.;
                    else
                        a[j] = 0.;
                }
            }
            for (int i = 0; i < (nsame*bcols); i++){
                b = (10.*i + 3.)/(3.*i - 2.) ; // = 1.*i as test value
            }
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD); //assumes stored as contguous block of code nprocs<nrows
            delete[] b;
            rowsent=0;
            for (int i=1; i < (noprocs); i++) { //must be equal to noprocs otherwise it will not send to 3
                MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
                delete[] a[rowsent];
                rowsent++;
            }
         
            for (int i=0; i<arows; i++) {
                MPI_Recv(CC, bcols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, MPI_ANY_TAG,
                         MPI_COMM_WORLD, &status);
                int sender = status.MPI_SOURCE;
                int anstype = status.MPI_TAG;            //row number+1
                int IND_I = 0;
                while (IND_I < bcols){
                    c[anstype - 1][IND_I] = CC[IND_I];
                    IND_I++;
                }
                if (rowsent < arows) {
                    MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
                    delete[] a[rowsent];
                    rowsent++;
                }
                else {       // tell sender no more work to do via a 0 TAG
                    MPI_Send(MPI_BOTTOM,0,MPI_DOUBLE_PRECISION,sender,0,MPI_COMM_WORLD);
                }
            }
        }
     
        // Slave part
        else {
            MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
            MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            while(status.MPI_TAG != 0) {
                int crow = status.MPI_TAG;
             
                /* Compute C = A B */
                cblas_dgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, bcols, nsame, 1.0, buff, nsame, b, bcols,
                             0.0, CC, bcols);
               
                MPI_Send(CC,bcols,MPI_DOUBLE_PRECISION, master, crow, MPI_COMM_WORLD);
                MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
            }
        }
       
        MPI_Finalize();
        return 0;
    };

I was thinking it might be due to me not deleting any of the new elements created, although I use essentially the same approach to initialize the arrays in both codes. I even tried deleting values in the MKL code (as shown) yet this appears to not have much of an effect. To clarify, when I added `delete` to the MKL example in hopes that it would speed the MKL code up since the MKL example was already much slower than the GSL example even when they both did not delete any arrays. With or without the deletes I've included, the MKL example is still slower. When I increase the size of the arrays from `nsame = arows = bcols = 1000` to `nsame = arows = bcols = 10000`, the time differences in the two codes can readily be observed (the GSL code takes approximately 45 seconds while the MKL code takes quite a few minutes). Thus I am wondering if this is simply inherent to the way GSL and MKL are designed and incorporated in my code or if there is perhaps something else more subtle going on.
 
Technology news on Phys.org
  • #2
Speed, which is not a good description can also depend on how code was compiled and linked. Generally compiler writers are able to take into consideration various hardware features. But the programmer can "break" this by doing some things in code that the compiler cannot get around.

This white paper is a little dated as far as hardware is concerned, but the content is not and theory is not.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
 

1. What is MKL and GSL?

MKL (Math Kernel Library) and GSL (GNU Scientific Library) are software libraries that provide a collection of high-performance mathematical and statistical functions for scientific computing in C++.

2. How do MKL and GSL differ in terms of speed?

MKL is known to have faster execution times compared to GSL. This is because MKL is optimized for Intel processors, while GSL is a general-purpose library. Additionally, MKL uses multithreading and vectorization techniques to further improve performance.

3. Which library is better for a specific type of code?

It ultimately depends on the type of code and the specific functions being used. If the code is heavily reliant on linear algebra operations, MKL may be a better choice. On the other hand, GSL has a wider range of functions for various mathematical and statistical operations.

4. Are there any notable differences in terms of functionality?

While both libraries offer similar functions, there are some differences in their implementation. For example, MKL uses a proprietary BLAS (Basic Linear Algebra Subprograms) implementation, while GSL uses an open-source BLAS implementation. This may result in slight discrepancies in the results of certain functions.

5. Can MKL and GSL be used together?

Yes, MKL and GSL can be used in combination. In fact, some developers use MKL for its optimized linear algebra functions and GSL for its other mathematical and statistical functions. However, it is important to take note of any potential conflicts or discrepancies in results when using both libraries together.

Similar threads

  • Programming and Computer Science
Replies
3
Views
1K
  • Programming and Computer Science
Replies
12
Views
3K
  • Programming and Computer Science
Replies
7
Views
1K
  • Programming and Computer Science
Replies
1
Views
935
  • Programming and Computer Science
Replies
1
Views
1K
  • Programming and Computer Science
Replies
9
Views
2K
  • Programming and Computer Science
Replies
25
Views
2K
  • Programming and Computer Science
Replies
7
Views
2K
  • Programming and Computer Science
Replies
5
Views
2K
  • Programming and Computer Science
Replies
5
Views
1K
Back
Top