How many cache misses occur when performing matrix multiplication?

  • Thread starter SpiffyEh
  • Start date
In summary, the blocked version of matrix-matrix multiplication using ijk will have 6 cache misses for every element in the matrix.
  • #1
SpiffyEh
194
0

Homework Statement


Suppose your data cache has 30 lines and each line can hold 10 doubles. You are performing a matrix-matrix multiplication (C=C+A*B) with square matrices of size 1000 and 10 respectively. Assume data caches are only used to cache matrix elements which are doubles. The cache replacement rule is least recently used first. One-dimensional arrays are used to represent matrices with the row major order.

(a). When matrix-matrix multiplication is performed using the simple triple-loop algorithm, there are 6 versions of the algorithm (ijk, ikj, jik, jki, kij, kji). Calculate the number of cache misses for each element in each matrix for each version of the algorithm when the sizes of the matrices are 1000 and 10 respectively.


Homework Equations



N/A

The Attempt at a Solution


I'm confused about the actual matrix multiplication. From the problem I'm getting matrix A: 1000x1000 and matrix b: 10x10. How can these two be multiplied? The number of columns and rows do not match up between them. Am I misreading the problem?

After the above is clarified I understand that not just one element of the matrix is transferred to the cache but how many are? since matrix A is 1000x1000 is the whole first row a0,0 to a0, 999 transferred into the cache when a0,0 is brought in? Or are just 10 element blocks brought into the cache? If only 10 are brought in at a time are they brought in sequentially or is part of matrix b brought in too?

Sorry for the overload of questions, I just don't understand the problem and can't progress without understanding it.
 
Physics news on Phys.org
  • #2
SpiffyEh said:
I'm confused about the actual matrix multiplication. From the problem I'm getting matrix A: 1000x1000 and matrix b: 10x10.
These are two difference cases, one where both A and B are 1000x1000, and the other where both A and B are 10 x 10.

SpiffyEh said:
After the above is clarified I understand that not just one element of the matrix is transferred to the cache but how many are? ... matrix multiply A ... B ... If only 10 are brought in at a time are they brought in sequentially or is part of matrix B brought in too?
Each cache miss results in 10 doubles being read into a line of cache from sequential addresses in memory, and reads from main memory into a line of cache will always start at a main memory address that is some exact multiple of 10 doubles (80 bytes).

The order of memory operations will depend on the algorithm to multiply A by B. Normally a row of A is multiplied with a column of B, and the products summed and stored in some output matrix, but for this problem I assume that writes will not affect the data cache.
 
Last edited:
  • #3
oh! That makes so much more sense! Thank you!
You assumed correctly, writes do not affect the data cache.

I'm pretty sure I figured this one out but now I'm having issues with part b which involves blocking.

(b). If matrices are partitioned into block matrices with each block being a 10 by 10 matrix, then the matrix-matrix multiplication can be performed using one of the 6 blocked version algorithms (ijk, ikj, jik, jki, kij, kji). Assume the multiplication of two blocks in the inner three loops uses the same loop order as the three outer loops in the blocked version algorithms. Calculate the number of cache misses for each element in each matrix for each version of the blocked algorithm when the size of the matrices is 1000.

I'm starting with ijk. Which is the algorithm:
Code:
for(int i = 0; i < n; i++) {
  for(int j = 0; j < n; j++) {
    for(int k = 0; k < n; k++) {
      c[i*n+j] = c[i*n+j] + a[i*n+k] * b[k*n+j];
    }
  }
}

So, I know that the whole block in all three matricies A, B, and C can fit in cache so the number of cache misses for the first block of A multiplied by the first block in B stored in the first block in C is 30 because there are 10 cache misses per matrix block. Here is the cache I got after the first multiplication (the one in the first position in cache is the most recently used).
0 c9,0…c9,9
1 a9,0…a9,9
2 b9,0…b9,9
3 b8,0…b8,9
4 b7,0…b7,9
5 b6,0…b6,9
6 b5,0…b5,9
7 b4,0…b4,9
8 b3,0…b3,9
9 b2,0…b2,9
10 b1,0…b1,9
11 b0,0…b0,9
12 c8,0…c8,9
13 a8,0…a8,9
14 c7,0…c7,9
15 a7,0…a7,9
16 c6,0…c6,9
17 a6,0…a6,9
18 c5,0…c5,9
19 a5,0…a5,9
20 c4,0…c4,9
21 a4,0…a4,9
22 c3,0…c3,9
23 a3,0…a3,9
24 c2,0…c2,9
25 a2,0…a2,9
26 c1,0…c1,9
27 a1,0…a1,9
28 c0,0…c0,9
29 a0,0…a0,9

Now, I'm having a hard time figuring out how many cache misses it will have for all subsequent blocks. I know there will probably be some values in cache that can be accessed without being pulled in again but I can't seem to figure out the pattern. Hopefully I at least have it right up to this point.
 
  • #4
SpiffyEh said:
I'm starting with ijk. Which is the algorithm:
Code:
for(int i = 0; i < n; i++) {
  for(int j = 0; j < n; j++) {
    for(int k = 0; k < n; k++) {
      c[i*n+j] = c[i*n+j] + a[i*n+k] * b[k*n+j];
    }
  }
}

Now, I'm having a hard time figuring out how many cache misses it will have for all subsequent blocks. I know there will probably be some values in cache that can be accessed without being pulled in again but I can't seem to figure out the pattern. Hopefully I at least have it right up to this point.
I'm not sure there's any easy to do this other than grind out a few cycles and look for a pattern of cache misses. For example, a sub-row of a[...] might get used repeatedly while multiplying with different columns of b[...], ...
 
Last edited:
  • #5
Yeah... I ended up doing that.
Lets say block 0,0 of A includes a00...a09 and a00..a99 (the whole 10x10 square)
block 0,0 of B includes b00...b09 and b00...99
same for block c..
so basically the r,c notation represents each 10x10 block in a matrix

For block 0,0 in a * block 0,0 in b stored in block 0,0 in c:
-all three of the blocks need to be loaded in so it is a cache miss for the 0th column of each

For block 0,0 in a * block 1,0 in b stored in block 0,0 in c:
-each row of b is loaded so there is a cache miss for each value in the 0th column
-the rows a00...a09 and c00...c09 are in cache and are bumped up to more recently used (for each multiplication with a block of b)
-as lines of b are brought in, the other rows of matricies a and c are pushed out of cache so there is a cache miss for each of those when they are loaded in again (the 0th column again)


I kind of understand what's going on and see a pattern but I'm having a really hard time coming up with actual values for the number of cache misses in the pattern for the ijk algorithm.
 

1. What is a cache miss?

A cache miss occurs when a requested data or instruction is not found in the cache memory and has to be retrieved from a slower memory location, such as main memory or disk. This results in a longer access time and can impact the overall performance of a system.

2. How are cache misses calculated?

Cache misses are typically calculated by monitoring the number of times a requested data or instruction is not found in the cache and has to be retrieved from a slower memory location. This can be done using specialized performance monitoring tools or by analyzing the code and predicting potential cache misses.

3. Why are cache misses important?

Cache misses are important because they can have a significant impact on the performance of a system. A high number of cache misses can result in longer access times, which can slow down the execution of programs and reduce the efficiency of a system.

4. How can cache misses be reduced?

There are several techniques that can be used to reduce the number of cache misses, such as increasing the size of the cache, using different cache replacement policies, and optimizing the code to improve spatial and temporal locality. Another approach is to use prefetching, which involves predicting future memory access patterns and loading data into the cache in advance.

5. What are the limitations of calculating cache misses?

One limitation of calculating cache misses is that it can be a complex and time-consuming process, especially for large and complex systems. Additionally, the accuracy of cache miss calculations may be affected by various factors, such as the type of cache, the cache size, and the efficiency of the cache replacement policy. Therefore, it is important to carefully analyze and interpret the results of cache miss calculations.

Similar threads

  • Programming and Computer Science
Replies
1
Views
2K
  • Programming and Computer Science
Replies
4
Views
1K
Replies
5
Views
1K
  • Programming and Computer Science
Replies
1
Views
2K
  • STEM Academic Advising
Replies
13
Views
2K
  • Calculus and Beyond Homework Help
Replies
1
Views
5K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
3
Views
265
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • General Math
Replies
1
Views
29K
Back
Top