Using MKL to Switch Between Single and Double Precision

In summary, the author is advocating for using doubles instead of floats when dealing with matrices with large dimensions. He provides some justification for this, and mentions that on his computer, which has a large amount of cache, switching to doubles will result in a significant improvement in performance.
  • #1
swartzism
103
0
So I have a header file containing things like

Code:
#ifndef _params_h
#define _params_h

#define REAL double // precision of data
#define ALIGN 64    // byte size of alignment
#define DBG 1       // 1 prints out results
#define READIN 1    // 1 reads in white paper data, 0 randomizes data
#define NVAR 128      // ncols
#define NITER 25000    // nrows
#define MCITER 10000  // Monte Carlo iterations
#define RANDMIN -2
#define RANDMAX 2

#endif

and I am using a lot of MKL calls within my program. If I change REAL to float, I get a bunch of errors about those MKL calls being double and not float, so what I would like to do is something like this

Code:
#ifndef _params_h
#define _params_h

#define REAL double // precision of data
#define ALIGN 64    // byte size of alignment
#define DBG 1       // 1 prints out results
#define READIN 1    // 1 reads in white paper data, 0 randomizes data
#define NVAR 128      // ncols
#define NITER 25000    // nrows
#define MCITER 10000  // Monte Carlo iterations
#define RANDMIN -2
#define RANDMAX 2

#if REAL == double
   #define GEMM cblas_dgemm
   #define TRANS mkl_dimatcopy
   #define CHOL dpotrf
   #define INV LAPACKE_dtrtri
#else if REAL == float
   #define GEMM cblas_sgemm
   #define TRANS mkl_simatcopy
   #define CHOL spotrf
   #define INV LAPACKE_strtri
#endif

#endif

Would this work? Is there a better way to ubiquitously change from single to double precision?

In the mean time, I'm going to go ahead and try and implement this, but I doubt it will work. Ha!

Thanks in advance.
 
Technology news on Phys.org
  • #2
This may be related to math library issues where typically all math functions are written with double arguments so that changing the REAL define to float you are breaking the argument rule. Floats are less precise so library developers chose doubles for all arguments and returns.

Here's some discussion on the use of floats vs doubles that sway you to stay with doubles

http://programmers.stackexchange.co...n-do-you-use-float-and-when-do-you-use-double

Here's a related article on the pitfalls common to floating pt usage:

http://www.codeproject.com/Articles/29637/Five-Tips-for-Floating-Point-Programming
 
  • #3
A lot of the time I found double is faster than float. It's probably due to implicit conversions somewhere in the codebase. Theortically they both execute at the same speed inside the CPU (assumption: modern x86) so execution time differences will depend on data access (which is going to be negligible if they are in the L1 cache) or whether you can be more parallel with floats, e.g. you shove two floats into a single register and do both in a single operation.

I'd really recommend using double unless have profiling data and you really know what you're doing.
 
  • #4
Carno Raar said:
Theortically they both execute at the same speed inside the CPU (assumption: modern x86) so execution time differences will depend on data access (which is going to be negligible if they are in the L1 cache)
In this case, the OP has a 128×25000 matrix. On my computer, which has 6 MB of L3 cache, that matrix won't fit in L3 cache, let alone L1 cache. Given all the cache hits that will inevitably occur in this case, changing from doubles to floats should result in significantly improved performance.

I'd really recommend using double unless have profiling data and you really know what you're doing.
I can't disagree with you here. Switching from doubles to floats sometimes results in complete precision loss, where there are zero significant digits in the result. For example, a mildly ill-conditioned matrix is all it takes.
 
  • #5
swartzism said:
So I have a header file containing things like
Code:
#define MCITER 10000  // Monte Carlo iterations
Are you turning off your browser, your email, your chat, *everything*, and spreading those Monte Carlo iterations over every available CPU, or are you running single threaded? If you have a decent laptop, a seven-fold increase in performance is possible. Double that again with a decent desktop machine. And if you have a blade server available, trying to bring it to its knees is always fun.
 
  • #6
D H said:
In this case, the OP has a 128×25000 matrix. On my computer, which has 6 MB of L3 cache, that matrix won't fit in L3 cache, let alone L1 cache. Given all the cache hits that will inevitably occur in this case, changing from doubles to floats should result in significantly improved performance.

Yes, those numbers are probably the array bounds.

You would be entirely correct if the data was being manipulated in a single block. I didn't provide numbers as I have no way of determining this. If OP is processing individual rows, OP only needs to store 128 values.
 
  • #7
D H said:
In this case, the OP has a 128×25000 matrix. On my computer, which has 6 MB of L3 cache, that matrix won't fit in L3 cache, let alone L1 cache. Given all the cache hits that will inevitably occur in this case, changing from doubles to floats should result in significantly improved performance.

Another difference this could make is that if you or your Sufficiently Smart CompilerTM can make effective use of SIMD, you can potentially pack in registers and simultaneously operate on twice as many floats at a time as doubles (e.g. add four pairs of floats instead of two pairs of doubles with just one CPU instruction). (I haven't tested how much of a difference this makes.)
 
  • #8
Sorry for the late response, I was out of town for the 4th. My intent here is to first compare precision of the answer and second to investigate speed increases due to exactly what you guys are talking about with fitting more data into cache. I don't believe my Monte Carlo loop can be effectively parallelized since there is a dependence between the (i+1)th and (i)th iterations and I fear the cost of communication would void the benefit of parallelization. That's just something to test as well. The internal computations parallelize nicely though.

My exact code listed in the first post works for the double case, but does not work for the float case. I get errors of the following type:

Code:
main.c(299): warning #167: argument of type "float *" is incompatible with parameter of type "double *"
     MKL_TRANS('R', 'T', m_dim, n_dim, 1.0, r_data, n_dim, m_dim);
                                            ^

main.c(342): warning #167: argument of type "float *" is incompatible with parameter of type "double *"
     MKL_CHOL(upper, &n_dim, p_data, &n_dim, &info); // Cholesky factorization of C*
                             ^

main.c(425): warning #167: argument of type "float *" is incompatible with parameter of type "double *"
        MKL_CHOL(upper, &n_dim, t_data, &n_dim, &info); // Cholesky factorization of T, stored in T
                                ^

main.c(461): warning #167: argument of type "float *" is incompatible with parameter of type "double *"
        MKL_INV(CblasRowMajor, 'L', 'N', n_dim, t_data, n_dim);
                                                ^

main.c(465): warning #167: argument of type "float *" is incompatible with parameter of type "const double *"
        MKL_GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_dim, n_dim, n_dim, 1.0, p_data, n_dim, t_data, n_dim, 0.0, s_data, n_dim);
                                                                                      ^

main.c(465): warning #167: argument of type "float *" is incompatible with parameter of type "const double *"
        MKL_GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_dim, n_dim, n_dim, 1.0, p_data, n_dim, t_data, n_dim, 0.0, s_data, n_dim);
                                                                                                     ^

main.c(465): warning #167: argument of type "float *" is incompatible with parameter of type "double *"
        MKL_GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_dim, n_dim, n_dim, 1.0, p_data, n_dim, t_data, n_dim, 0.0, s_data, n_dim);
                                                                                                                         ^

main.c(477): warning #167: argument of type "float *" is incompatible with parameter of type "const double *"
        MKL_GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_dim, m_dim, n_dim, 1.0, s_data, n_dim, r_data, m_dim, 1.0, rstarb_data, m_dim); // For (RS')'=SR'
                                                                                      ^

main.c(477): warning #167: argument of type "float *" is incompatible with parameter of type "const double *"
        MKL_GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_dim, m_dim, n_dim, 1.0, s_data, n_dim, r_data, m_dim, 1.0, rstarb_data, m_dim); // For (RS')'=SR'
                                                                                                     ^

main.c(477): warning #167: argument of type "float *" is incompatible with parameter of type "double *"
        MKL_GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_dim, m_dim, n_dim, 1.0, s_data, n_dim, r_data, m_dim, 1.0, rstarb_data, m_dim); // For (RS')'=SR'

I believe this may be due to the issue jedishrfu brought up.
 
  • #9
Try:
C:
#define USE_FLOAT        // Comment out this line to use double.

// ...

#ifndef USE_FLOAT
    #define REAL double
    #define GEMM cblas_dgemm
    #define TRANS mkl_dimatcopy
    #define CHOL dpotrf
    #define INV LAPACKE_dtrtri
#else
    #define REAL float
    #define GEMM cblas_sgemm
    #define TRANS mkl_simatcopy
    #define CHOL spotrf
    #define INV LAPACKE_strtri
#endif
I did some Googling to try to understand why the attempt in the OP wasn't working (and specifically, why it was failing in the weird way it was) and came across this. It seems that #if can only understand tests involving integers and integer arithmetic. If you do #define REAL float then my understanding is that the following things happen:
  1. The macro REAL in the directive #if REAL == double gets expanded to float, so the test becomes float == double.
  2. float and double aren't macros (or integer or character literals), so they're interpreted as identifiers.
  3. There's a rule saying that identifiers are treated as zero, so the test is treated as 0 == 0, which is 1 (true).
So it's the macros under #if REAL == double that are defined, even if you #define REAL float.

(Incidentally, there's no #else if directive. Use #elif for that.)
 
Last edited:
  • #10
wle,

Making the modifications you suggested ended up working. Thanks for the help! Now, I'd like to compare the runtimes, but my timing is just printing out zeros for the float case. My timer is

Code:
#include <time.h>
#include <sys/time.h>
#include <assert.h>
#include <sys/mman.h>
#include "params.h"

REAL dtime()
{
   REAL tseconds = 0.0;
   struct timeval mytime;
   gettimeofday(&mytime, (struct timezone *) 0);
   tseconds = (REAL)(mytime.tv_sec + (REAL)mytime.tv_usec*1.0e-6);
   return(tseconds);
}

The parameters header file now looks like

Code:
#ifndef _params_h
#define _params_h

#define USE_FLOAT
#define ALIGN 64    // byte size of alignment
#define DBG 0       // 1 prints out results
#define READIN 0    // 1 reads in white paper data, 0 randomizes data
#define NVAR 128      // ncols
#define NITER 25000    // nrows
#define MCITER 1  // Monte Carlo iterations
#define RANDMIN -2
#define RANDMAX 2

#ifndef USE_FLOAT
   #define LHP 0
   #define REAL double
   #define MKL_GEMM cblas_dgemm
   #define MKL_TRANS mkl_dimatcopy
   #define MKL_CHOL dpotrf
   #define MKL_INV LAPACKE_dtrtri
#else
   #define LHP 1
   #define REAL float
   #define MKL_GEMM cblas_sgemm
   #define MKL_TRANS mkl_simatcopy
   #define MKL_CHOL spotrf
   #define MKL_INV LAPACKE_strtri
#endif

#endif

and I'm printing from main with

Code:
   ...
   tstart = dtime();
   ...
   tfinish = dtime();

   printf("\nTotal run time:\nNumber of Monte Carlo iterations: %d\nNumber of iterations: %d\nNumber of variables: %d\nTotal time: %f (s)\n\n", MCITER, m_dim, n_dim, tfinish-tstart);
   printf("\nRun time of step 2: %f (s)\n", t2time);
   printf("\nRun time of step 3: %f (s)\n", t3time);
   printf("\nRun time of step 4a: %f (s)\n", t4atime);
   printf("\nRun time of step 4b: %f (s)\n", t4btime);
   printf("\nRun time of step 5: %f (s)\n", t5time);

Any ideas why I would be getting 0.00000000 for all of my timings?
 
  • #11
I had a quick look then I saw `gettimeofday`.

You should almost certainly be using a higher resolution, monotonic timer for this. On a lot of systems that thing only has one second resolution and its affected by local timezone changes.

I can't remember C's library but I think you want `clock_gettime(CLOCK_MONOTONIC)` from `time.h` for timing programs. It has a higher resolution and importantly, monotonics only ever tick upwards regardless of local TZ adjustments. However I am rusty with C so you'd should look in the C library docs for the correct func for your plaform. In C++ you would use Chrono.
 
  • #13
Turns out that double versus single precision in this case does not make much of a difference.

Code:
Double:                                      Single:
Total run time: 12.210 (s)                      Total run time: 11.240 (s)
   Number of Monte Carlo iterations: 1             Number of Monte Carlo iterations: 1
   Number of iterations: 25000                   Number of iterations: 25000
   Number of variables: 128                      Number of variables: 128
   Data precision: double                         Data precision: single
   
Total run time: 112.20 (s)                      Total run time: 108.770 (s)
   Number of Monte Carlo iterations: 10          Number of Monte Carlo iterations: 10
   Number of iterations: 25000                   Number of iterations: 25000
   Number of variables: 128                      Number of variables: 128
   Data precision: double                         Data precision: single
   
Total run time: 1113.580 (s)                   Total run time: 1099.50 (s)
   Number of Monte Carlo iterations: 100          Number of Monte Carlo iterations: 100
   Number of iterations: 25000                   Number of iterations: 25000
   Number of variables: 128                      Number of variables: 128
   Data precision: double                         Data precision: single
 
  • #14
In my opinion, that is a rather insignificant performance improvement that comes at the expense of a very significant loss of precision.
 
  • #15
D H said:
In my opinion, that is a rather insignificant performance improvement that comes at the expense of a very significant loss of precision.

I agree. I was just interested in the experiment.
 
  • #16
You can/should look into compiler options.
For a course on Monte Carlo simulations I looked into this for a bit and got astonishing results.

If I'm not mistaken I even got a 10-fold decrease in runtime for a certain simulation.

Edit:
I looked into my files because I remember writing a readme for my partner.
I used the following command to compile my simulation (hard disks fluid in the plane)
Code:
gcc Disks.c -o Disks -lm -m64 -Ofast -flto -march=native -funroll-loops

-m64 signifies that I'm on a 64 bit system.
-Ofast is an automatic optimisation setting (look into this)

You should find out what these flags do and test what they do for you
 
  • #17
JorisL said:
You can/should look into compiler options.
For a course on Monte Carlo simulations I looked into this for a bit and got astonishing results.

If I'm not mistaken I even got a 10-fold decrease in runtime for a certain simulation.

The current compiler options I am using are

Code:
icc -c -std=c99 -O3 -opt-streaming-stores auto -qopt-report=3 -opt-multi-version-aggressive -opt-assume-safe-padding -mkl

Do you have any examples of the compiler options you saw increase performance?
 
  • #18
I edited my post.
I must say I'm not exactly sure what gave the best results.
You have to try and find out what works best on your machine since it can be very machine-dependent.
 
  • #19
I would add -DNDEBUG just for good measure.
 
  • #20
newjerseyrunner said:
I would add -DNDEBUG just for good measure.

What does NDEBUG do? I just read that it preprocesses asserts, but I don't use any.
 
  • #21
swartzism said:
What does NDEBUG do? I just read that it preprocesses asserts, but I don't use any.
You may not, but some header libraries may. It's not a huge optimization, but it may remove some unnecessary code.
 

Related to Using MKL to Switch Between Single and Double Precision

1. What is MKL and why is it important for switching between single and double precision?

MKL (Math Kernel Library) is a highly optimized library of mathematical functions developed by Intel. It is important for switching between single and double precision because it allows for efficient computation and enables developers to easily switch between different precision modes for their calculations.

2. How do I switch between single and double precision using MKL?

To switch between single and double precision using MKL, you can use the mkl_set_num_threads and mkl_set_dft_num_threads functions to set the desired precision mode. Additionally, you can also use the MKL_DYNAMIC=FALSE environment variable to specify the precision mode in your code.

3. Are there any performance differences between single and double precision when using MKL?

Yes, there can be performance differences between single and double precision when using MKL. Generally, single precision calculations are faster than double precision calculations. However, the performance difference may vary depending on the specific mathematical function being used.

4. Can I use MKL to switch between other precision modes?

Yes, MKL supports switching between other precision modes such as half precision (16-bit) and quadruple precision (128-bit). However, the availability of these precision modes may vary depending on the specific hardware and compiler being used.

5. Is it necessary to switch between single and double precision when using MKL?

No, it is not necessary to switch between single and double precision when using MKL. The default precision mode for MKL is double precision, and it is suitable for most scientific and engineering applications. However, switching to single precision may be beneficial for certain applications where speed is a priority over precision.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
9
Views
2K
Back
Top