# Cycles per instruction question, why different CPI when I change for loop order?

Hello everyone, this is a computer science/computer engineering problem...
I had the following file that multiplies 2 matrices together, with 100 elements and puts it into a 3rd 100 element matrix.

Well I used Simple Scalar to anlayize the CPI generated by the code and it also records the miss rate on the level 1 cachee.

But i'm confused on my findings...Here is what I found:
Code:
/* This is a program for a trial on super
scalar simulator , it multiples to matrices */
#include <stdio.h>

#define N 100

void main()
{
int a1[N][N], a2[N][N], a3[N][N], i, j, k;
for(i =0; i<N; i++)
for(j =0; j<N; j++)
{
a1[i][j]= /*1;*/ i * j + 1;
a2[i][j]= /*1;*/ i + j + 1;
a3[i][j]= 0;
}

printf( "\nThe first matrix !!!\n");

for(i =0; i<N; i++)
{
for(j =0; j<N; j++)
{
printf("\t %d", a1[i][j]);
}
printf("\n");
}

printf( "\nThe second matrix !!!\n");

for(i =0; i<N; i++)
{
for(j =0; j<N; j++)
{
printf("\t %d", a2[i][j]);
}
printf("\n");
}

[b]for(i =0; i<N; i++)
for(j =0; j<N; j++)
for(k =0; k<N; k++)
a3[i][j] += a1[i][k] * a2[k][j];[/b]

printf( "\nThe result matrix !!!\n");

for(i =0; i<N; i++)
{
for(j =0; j<N; j++)
{
printf("\t %d", a3[i][j]);
}
printf("\n");
}

}
Code:
Runs:	         mxm.c	 mxm2.c	  mxm3.c
Loop Order:	I,j,k	   I,k,j	K,I,j
Sim_cpi:	0.5832	 0.5832	    0.5799
ill.miss_rate	 0.0097	  0.0097     0.0097
dl1.miss_rate  0.0047	0.0047	   0.0050
ul2.miss_rate  0.0013	0.0013	  0.0012
Okay the Sim_CPI is the #of instructions per cycle
il1.miss_rate is the instruction level 1 miss rate
dl1.miss_rate is the data level 1 miss rate
ul2.miss_rate i'm not quite sure what this is, i'm using a program called Simple Scalar and its not in the documentation :(

But what i'm confused is, i know the greater the CPI the worse performance. But i'm confused on why the ul2.miss_rate is lower on the k,i,j order but the dl1.miss_rate is higher, but the CPI is still lower out of all of them. I'm also confused on why the I,j,k and the I,K,J are both the same with no change in CPI.

Any help would be great!

I was thinking it had to do with the way of ordering the loops has more cache hits than the other. i.e. a cache miss means waiting longer for the data to be fetched from the next level in the memory hierarchy

But why?
why the i,j,k and i,k,j are identical while k,i,j is more efficient.

Thanks!

Related Engineering and Comp Sci Homework Help News on Phys.org
AlephZero
Homework Helper
I don't know anything about Simple Scalar but I do know a bit about optimizing numerical algorithms.

I would have thought miss rates less than 0.01 were pretty much neglible. The total size of your 3 matrices is only about 120kbytes. If the cache is bigger than that, you are just just measuring the fact that the data has to be loaded into the cache once - that's unavoidable, and not very interesting.

That doesn't explain why your numbers are slightly different, but from a practical point of view all the numbers are 0 so the small differences don't matter.

Try a bigger matrix size, e.g. 1000 x 1000, or a smaller cache.

Thanks for the responce...

how do you know how big the cache size is? also do you have any idea what ul2 stands for? I was thinking maybe level 2 data cahce?

our TA is making us use a 100x100, and the Unix labs are closed right now so i can't go down to test a new case, the 100x100 case took about 10 mins each for it to even output any data :(

So i can imagine a 1000x1000 i would be there awhile.

So from my results you really can't tell whats going on? The TA wanted us to explain why or why not there was a difference in CPI when you changed the order of the for loops around.

Frankly I don't see why moving the order to k,i,j would make it CPI less

I got a helpful hint from the TA, she said:

dl1.miss_rate is not as important as ul2.miss_rate. l1 caches are very
small and can usually not hold all the data of the program so
concentrate on ul2

next ur observations look okay, now u need to come up with a reason to
justify them.
for e.g X ordering looks like best ordering because it will have maximum
no of cache hit, to calculate cache hits
look at innermost loop operation, and also understand that matrix is
laid out in row major form.

when you access all the elements of the same row in inner most loop
cache hit, when u access a column cache miss.

for e.g. in i.j.k
u access a one same element of C, [i,j] invariant on k varies, so its
always a hit,
u access i th row of A A[i,k : k goes from 0 to N] all hits
u access j th column of B[k,j : k goes from 0 to N] all misses BAD

similarly come up with ur own orders and then see whether ur
observations match... also some times because of
cache associativity, small size or replacements observations might not
completely same as analysis, dont worry about that
as long u clearly right the analysis

Reetuparna

So
for e.g. in i,j,k
u access a one same element of C, [i,j] invariant on k varies, so its
always a hit,
u access i th row of A A[i,k : k goes from 0 to N] all hits
u access j th column of B[k,j : k goes from 0 to N] all misses BAD

for e.g. in i,k,j
you access one same elment of C, [i,k] invariant on j varies, so its always a hit,
you access ith row of A, A[i,j: j goes from 0 to N] all hits
you access kth row of B, B[j,k: j goes form 0 to N] all misses.

As you can see changing the loop order from i,j,k to i,k,j won't change the cache hit or misses.

But if you change the loop order to k,i,j
you access on element of C, [k,i] invariant on j varies, so its always a hit, you access kth row of A, A[k,j: j goes from 0 to N] all hits
you access ith row of B, B[j,i: j goes from 0 to N ] all misses which gives the better CPI.

After this i'm still not seeing how this last ordering would give the best cpi...is the memory set up some odd way that in this order it would work betteR?

Hurkyl
Staff Emeritus
Gold Member
er, a1 = A, a2 = B, a3 = C. You're always indexing matrix C with [i, j].

Hurkyl
Staff Emeritus
Gold Member
I wanted to make a comment so you don't get the wrong idea -- using the L1 cache effectively is usually one of the most important things you can do to make your program run fast.

TA said:
l1 caches are very
small and can usually not hold all the data of the program so
concentrate on ul2
he (hopefully) means that the L1 cache is too small to hope that a naïve implementation of matrix multiplication can use it effectively, so you should worry about the L2 cache instead.

If you were really trying to optimize the heck out of this, you would do all sorts of nontrivial things to greatly reduce L1 cache misses.

Thanks for the reply Hurkyl, i hope the TA thought that too becuase she's a PhD student in this stuff :\

What I don't get is why if I change the loop order to k,i,j it reduces the L2 cache misses by alittle, so it reduces the CPI by a small amount.

I understand what she is saying about the hits and misses but I don't see why k,i,j would reduce it, it looks like it would be the same as any of the orderings, your still going to get hits on the rows and misses on the columns

Hurkyl
Staff Emeritus
Gold Member
You notice you had all the indexing wrong in post #4, right?

For example, no matter how you order the loops, you are always looking at the [i,j]-th element of the matrix C.

Have you thought about it again since fixing that error?

AlephZero
Homework Helper
I understand what she is saying about the hits and misses but I don't see why k,i,j would reduce it, it looks like it would be the same as any of the orderings, your still going to get hits on the rows and misses on the columns
Yes, but when you get your indexing notation sorted out (as Hurkyl said) you have 3 possibilities for accessing the arrays in the inner loop:

by row
by column
access the same element of n times.

Also, you are reading data from a1 and a2, but you are reading and writing a3.

Assuming the 100x100 size is a sensible size of test case, you should see some bigger differences between the 6 possible indexing options than the small differences in the 3 results you posted.

a1 = A, a2 = B, a3 = C. You're always indexing matrix C with [i, j].
Sorry for the delayed responce! I can't seem to figure out the correct indexing...

Okay if i change the order of the orginal loops
i,j,k shown below

Code:
for(i =0; i<N; i++)
for(j =0; j<N; j++)
for(k =0; k<N; k++)
a3[i][j] += a1[i][k] * a2[k][j];
to the following:
Code:
  for(k =0; k<N; k++)
for(j =0; j<N; j++)
for(i =0; i<N; i++)
a3[i][j] += a1[i][k] * a2[k][j];
I think i understand why this one is more efficent...
becuase after drawing it out, if u change the loop order to k, j, i

its accessing a1[k] by row... wait one second, its accessing a1 by column, then going down to the next row so shouldn't this be worse CPI? becuase if all ROWS are hits, and all COLS are missess. its going to be missing 100 times in a row, then hitting once, then missing another 100 then hitting once, or am i not understanding something?

when she said to change the loop order, thats it right? I don't change the indexing also do i? liike i don't touch this line:
Code:
 a3[i][j] += a1[i][k] * a2[k][j];

EDIT:
is it more efficent because when u change the loop order to k,j,i
a3 is getting assigned the values by row, rather than by column?

and if you had the loop order i,j,k it looks like its assigning it by column which is misses.

I'm not sure why columns are misses though and rows are hits?

also did you mean 3 instead of 6 indexing options Alpha?

Last edited:
Hurkyl
Staff Emeritus
Gold Member
il1 = instruction level 1 cache
dl1 = data level 1 cache
ul2 = unified level 2 cache

M[x][y] is the element in the x-th row and y-th column of M. As you vary y, you are traversing along a row. As you vary x, you are traversing along a column.

There are 6 ways to order 3 objects:
ijk, ikj, jik, jki, kij, kji

The actual performance of the loops depends heavily on just how your computer's cache is set up.

Oh, and I think you really ought to be looking at the total misses, not the miss rate. What do those tell you?

Last edited:
Thanks! I recored some hits and misses

Code:
Runs:	mxm.c	mxm2.c	mxm3.c
Loop Order:	I,j,k	I,k,j	K,I,j
Sim_cpi:	0.5832	0.5832	0.5799
ill.miss_rate	0.0097	0.0097	0.0097
dl1.miss_rate                	0.0047	0.0047	0.0050
dl1.misses                   	135578	135627	144277
ul2.miss_rate                	0.0013	0.0013	0.0012
ul2.hits                    	1993462	1993511	2125911

Hurkyl
Staff Emeritus
Gold Member
I guess I should add that the results heavily depend on just how the compiler implements your loop, and how the cache is implemented. I wrote a cache simulator, and ikj turns out to be the best loop, slightly better than kij. (but if the cache line is 4 bytes, all the loop orderings are essentially the same)

I just noticed another problem with the analysis: CPI isn't the number you're interested in. Different orderings of the loop will have different numbers of instructions, so CPI isn't an indicator of the total number of cycles your processor spends on a matrix multiply.

Oh, here are my results, with a cache line of 4 ints, an L1 cache of 32 lines, and an L2 cache of 512 lines.

(reads are a multiple of 1 million, writes are a multiple of 1 thousand)

ijk ordering:
global memory: 0.255 reads, 2.5 writes.
L2 cache: 1.25 reads, 10 writes
L1 cache: 2.01 reads, 10 writes

ikj ordering:
global memory: 0.255 reads, 2.5 writes.
L2 cache: 0.510 reads, 250 writes
L1 cache: 2.01 reads, 1000 writes

kij ordering:
global memory: 0.263 reads, 250 writes.
L2 cache: 0.510 reads, 250 writes
L1 cache: 2.01 reads, 1000 writes

Last edited:
Interesting,

thanks for the help, I'll see what her answers are after she grades this assignment. I agree though she should have used a 1000x1000 instead of 100x100.