1. Not finding help here? Sign up for a free 30min tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Cycles per instruction question, why different CPI when I change for loop order?

  1. Feb 8, 2007 #1
    Hello everyone, this is a computer science/computer engineering problem...
    I had the following file that multiplies 2 matrices together, with 100 elements and puts it into a 3rd 100 element matrix.

    Well I used Simple Scalar to anlayize the CPI generated by the code and it also records the miss rate on the level 1 cachee.

    But i'm confused on my findings...Here is what I found:
    Code (Text):

    /* This is a program for a trial on super
       scalar simulator , it multiples to matrices */
    #include <stdio.h>

    #define N 100

    void main()
    {
      int a1[N][N], a2[N][N], a3[N][N], i, j, k;
      for(i =0; i<N; i++)
        for(j =0; j<N; j++)
          {
        a1[i][j]= /*1;*/ i * j + 1;
        a2[i][j]= /*1;*/ i + j + 1;
        a3[i][j]= 0;
          }

       printf( "\nThe first matrix !!!\n");

       for(i =0; i<N; i++)
       {
         for(j =0; j<N; j++)
          {
        printf("\t %d", a1[i][j]);
          }
         printf("\n");
       }

       printf( "\nThe second matrix !!!\n");

       for(i =0; i<N; i++)
       {
         for(j =0; j<N; j++)
          {
        printf("\t %d", a2[i][j]);
          }
         printf("\n");
       }

       [b]for(i =0; i<N; i++)
         for(j =0; j<N; j++)
            for(k =0; k<N; k++)
          a3[i][j] += a1[i][k] * a2[k][j];[/b]


       
       printf( "\nThe result matrix !!!\n");

       for(i =0; i<N; i++)
       {
         for(j =0; j<N; j++)
          {
        printf("\t %d", a3[i][j]);
          }
         printf("\n");
       }

    }
    Code (Text):

    Runs:            mxm.c   mxm2.c   mxm3.c
    Loop Order: I,j,k      I,k,j    K,I,j
    Sim_cpi:    0.5832   0.5832     0.5799
    ill.miss_rate    0.0097   0.0097     0.0097
    dl1.miss_rate  0.0047   0.0047     0.0050
    ul2.miss_rate  0.0013   0.0013    0.0012
     
    Okay the Sim_CPI is the #of instructions per cycle
    il1.miss_rate is the instruction level 1 miss rate
    dl1.miss_rate is the data level 1 miss rate
    ul2.miss_rate i'm not quite sure what this is, i'm using a program called Simple Scalar and its not in the documentation :(

    But what i'm confused is, i know the greater the CPI the worse performance. But i'm confused on why the ul2.miss_rate is lower on the k,i,j order but the dl1.miss_rate is higher, but the CPI is still lower out of all of them. I'm also confused on why the I,j,k and the I,K,J are both the same with no change in CPI.

    Any help would be great!

    I was thinking it had to do with the way of ordering the loops has more cache hits than the other. i.e. a cache miss means waiting longer for the data to be fetched from the next level in the memory hierarchy

    But why?
    why the i,j,k and i,k,j are identical while k,i,j is more efficient.

    Thanks!
     
  2. jcsd
  3. Feb 9, 2007 #2

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    I don't know anything about Simple Scalar but I do know a bit about optimizing numerical algorithms.

    I would have thought miss rates less than 0.01 were pretty much neglible. The total size of your 3 matrices is only about 120kbytes. If the cache is bigger than that, you are just just measuring the fact that the data has to be loaded into the cache once - that's unavoidable, and not very interesting.

    That doesn't explain why your numbers are slightly different, but from a practical point of view all the numbers are 0 so the small differences don't matter.

    Try a bigger matrix size, e.g. 1000 x 1000, or a smaller cache.
     
  4. Feb 9, 2007 #3
    Thanks for the responce...

    how do you know how big the cache size is? also do you have any idea what ul2 stands for? I was thinking maybe level 2 data cahce?


    our TA is making us use a 100x100, and the Unix labs are closed right now so i can't go down to test a new case, the 100x100 case took about 10 mins each for it to even output any data :(

    So i can imagine a 1000x1000 i would be there awhile.

    So from my results you really can't tell whats going on? The TA wanted us to explain why or why not there was a difference in CPI when you changed the order of the for loops around.

    Frankly I don't see why moving the order to k,i,j would make it CPI less
     
  5. Feb 9, 2007 #4
    I got a helpful hint from the TA, she said:


    So
    for e.g. in i,j,k
    u access a one same element of C, [i,j] invariant on k varies, so its
    always a hit,
    u access i th row of A A[i,k : k goes from 0 to N] all hits
    u access j th column of B[k,j : k goes from 0 to N] all misses BAD

    for e.g. in i,k,j
    you access one same elment of C, [i,k] invariant on j varies, so its always a hit,
    you access ith row of A, A[i,j: j goes from 0 to N] all hits
    you access kth row of B, B[j,k: j goes form 0 to N] all misses.

    As you can see changing the loop order from i,j,k to i,k,j won't change the cache hit or misses.

    But if you change the loop order to k,i,j
    you access on element of C, [k,i] invariant on j varies, so its always a hit, you access kth row of A, A[k,j: j goes from 0 to N] all hits
    you access ith row of B, B[j,i: j goes from 0 to N ] all misses which gives the better CPI.

    After this i'm still not seeing how this last ordering would give the best cpi...is the memory set up some odd way that in this order it would work betteR?
     
  6. Feb 9, 2007 #5

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    er, a1 = A, a2 = B, a3 = C. You're always indexing matrix C with [i, j].
     
  7. Feb 9, 2007 #6

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I wanted to make a comment so you don't get the wrong idea -- using the L1 cache effectively is usually one of the most important things you can do to make your program run fast.

    When your TA says:
    he (hopefully) means that the L1 cache is too small to hope that a naïve implementation of matrix multiplication can use it effectively, so you should worry about the L2 cache instead.

    If you were really trying to optimize the heck out of this, you would do all sorts of nontrivial things to greatly reduce L1 cache misses.
     
  8. Feb 9, 2007 #7
    Thanks for the reply Hurkyl, i hope the TA thought that too becuase she's a PhD student in this stuff :\

    What I don't get is why if I change the loop order to k,i,j it reduces the L2 cache misses by alittle, so it reduces the CPI by a small amount.

    I understand what she is saying about the hits and misses but I don't see why k,i,j would reduce it, it looks like it would be the same as any of the orderings, your still going to get hits on the rows and misses on the columns
     
  9. Feb 9, 2007 #8

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    You notice you had all the indexing wrong in post #4, right?

    For example, no matter how you order the loops, you are always looking at the [i,j]-th element of the matrix C.

    Have you thought about it again since fixing that error?
     
  10. Feb 10, 2007 #9

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    Yes, but when you get your indexing notation sorted out (as Hurkyl said) you have 3 possibilities for accessing the arrays in the inner loop:

    by row
    by column
    access the same element of n times.

    Also, you are reading data from a1 and a2, but you are reading and writing a3.

    Assuming the 100x100 size is a sensible size of test case, you should see some bigger differences between the 6 possible indexing options than the small differences in the 3 results you posted.
     
  11. Feb 12, 2007 #10
    Sorry for the delayed responce! I can't seem to figure out the correct indexing...

    Okay if i change the order of the orginal loops
    i,j,k shown below

    Code (Text):

    for(i =0; i<N; i++)
         for(j =0; j<N; j++)
            for(k =0; k<N; k++)
          a3[i][j] += a1[i][k] * a2[k][j];
     
    to the following:
    Code (Text):

      for(k =0; k<N; k++)
         for(j =0; j<N; j++)
             for(i =0; i<N; i++)
              a3[i][j] += a1[i][k] * a2[k][j];
     
    I think i understand why this one is more efficent...
    becuase after drawing it out, if u change the loop order to k, j, i

    its accessing a1[k] by row... wait one second, its accessing a1 by column, then going down to the next row so shouldn't this be worse CPI? becuase if all ROWS are hits, and all COLS are missess. its going to be missing 100 times in a row, then hitting once, then missing another 100 then hitting once, or am i not understanding something?


    when she said to change the loop order, thats it right? I don't change the indexing also do i? liike i don't touch this line:
    Code (Text):

     a3[i][j] += a1[i][k] * a2[k][j];
     

    EDIT:
    is it more efficent because when u change the loop order to k,j,i
    a3 is getting assigned the values by row, rather than by column?

    and if you had the loop order i,j,k it looks like its assigning it by column which is misses.

    I'm not sure why columns are misses though and rows are hits?

    also did you mean 3 instead of 6 indexing options Alpha?
     
    Last edited: Feb 12, 2007
  12. Feb 12, 2007 #11

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Incidentally: (yay for google!)
    il1 = instruction level 1 cache
    dl1 = data level 1 cache
    ul2 = unified level 2 cache



    M[x][y] is the element in the x-th row and y-th column of M. As you vary y, you are traversing along a row. As you vary x, you are traversing along a column.



    There are 6 ways to order 3 objects:
    ijk, ikj, jik, jki, kij, kji



    The actual performance of the loops depends heavily on just how your computer's cache is set up.



    Oh, and I think you really ought to be looking at the total misses, not the miss rate. What do those tell you?
     
    Last edited: Feb 12, 2007
  13. Feb 12, 2007 #12
    Thanks! I recored some hits and misses

    Code (Text):

    Runs:   mxm.c   mxm2.c  mxm3.c
    Loop Order: I,j,k   I,k,j   K,I,j
    Sim_cpi:    0.5832  0.5832  0.5799
    ill.miss_rate   0.0097  0.0097  0.0097
    dl1.miss_rate                   0.0047  0.0047  0.0050
    dl1.misses                      135578  135627  144277
    ul2.miss_rate                   0.0013  0.0013  0.0012
    ul2.hits                        1993462 1993511 2125911
     
     
  14. Feb 12, 2007 #13

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I guess I should add that the results heavily depend on just how the compiler implements your loop, and how the cache is implemented. I wrote a cache simulator, and ikj turns out to be the best loop, slightly better than kij. (but if the cache line is 4 bytes, all the loop orderings are essentially the same)



    I just noticed another problem with the analysis: CPI isn't the number you're interested in. Different orderings of the loop will have different numbers of instructions, so CPI isn't an indicator of the total number of cycles your processor spends on a matrix multiply.


    Oh, here are my results, with a cache line of 4 ints, an L1 cache of 32 lines, and an L2 cache of 512 lines.

    (reads are a multiple of 1 million, writes are a multiple of 1 thousand)

    ijk ordering:
    global memory: 0.255 reads, 2.5 writes.
    L2 cache: 1.25 reads, 10 writes
    L1 cache: 2.01 reads, 10 writes

    ikj ordering:
    global memory: 0.255 reads, 2.5 writes.
    L2 cache: 0.510 reads, 250 writes
    L1 cache: 2.01 reads, 1000 writes

    kij ordering:
    global memory: 0.263 reads, 250 writes.
    L2 cache: 0.510 reads, 250 writes
    L1 cache: 2.01 reads, 1000 writes
     
    Last edited: Feb 13, 2007
  15. Feb 12, 2007 #14
    Interesting,

    thanks for the help, I'll see what her answers are after she grades this assignment. I agree though she should have used a 1000x1000 instead of 100x100.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Cycles per instruction question, why different CPI when I change for loop order?
Loading...