- #1

- 1,629

- 1

I had the following file that multiplies 2 matrices together, with 100 elements and puts it into a 3rd 100 element matrix.

Well I used Simple Scalar to anlayize the CPI generated by the code and it also records the miss rate on the level 1 cachee.

But i'm confused on my findings...Here is what I found:

Code:

```
/* This is a program for a trial on super
scalar simulator , it multiples to matrices */
#include <stdio.h>
#define N 100
void main()
{
int a1[N][N], a2[N][N], a3[N][N], i, j, k;
for(i =0; i<N; i++)
for(j =0; j<N; j++)
{
a1[i][j]= /*1;*/ i * j + 1;
a2[i][j]= /*1;*/ i + j + 1;
a3[i][j]= 0;
}
printf( "\nThe first matrix !!!\n");
for(i =0; i<N; i++)
{
for(j =0; j<N; j++)
{
printf("\t %d", a1[i][j]);
}
printf("\n");
}
printf( "\nThe second matrix !!!\n");
for(i =0; i<N; i++)
{
for(j =0; j<N; j++)
{
printf("\t %d", a2[i][j]);
}
printf("\n");
}
[b]for(i =0; i<N; i++)
for(j =0; j<N; j++)
for(k =0; k<N; k++)
a3[i][j] += a1[i][k] * a2[k][j];[/b]
printf( "\nThe result matrix !!!\n");
for(i =0; i<N; i++)
{
for(j =0; j<N; j++)
{
printf("\t %d", a3[i][j]);
}
printf("\n");
}
}
```

Code:

```
Runs: mxm.c mxm2.c mxm3.c
Loop Order: I,j,k I,k,j K,I,j
Sim_cpi: 0.5832 0.5832 0.5799
ill.miss_rate 0.0097 0.0097 0.0097
dl1.miss_rate 0.0047 0.0047 0.0050
ul2.miss_rate 0.0013 0.0013 0.0012
```

il1.miss_rate is the instruction level 1 miss rate

dl1.miss_rate is the data level 1 miss rate

ul2.miss_rate i'm not quite sure what this is, i'm using a program called Simple Scalar and its not in the documentation :(

But what i'm confused is, i know the greater the CPI the worse performance. But i'm confused on why the ul2.miss_rate is lower on the k,i,j order but the dl1.miss_rate is higher, but the CPI is still lower out of all of them. I'm also confused on why the I,j,k and the I,K,J are both the same with no change in CPI.

Any help would be great!

I was thinking it had to do with the way of ordering the loops has more cache hits than the other. i.e. a cache miss means waiting longer for the data to be fetched from the next level in the memory hierarchy

But why?

why the i,j,k and i,k,j are identical while k,i,j is more efficient.

Thanks!