How many cache misses occur when performing matrix multiplication?

SpiffyEh · Mar 11, 2012

Homework Statement

Suppose your data cache has 30 lines and each line can hold 10 doubles. You are performing a matrix-matrix multiplication (C=C+A*B) with square matrices of size 1000 and 10 respectively. Assume data caches are only used to cache matrix elements which are doubles. The cache replacement rule is least recently used first. One-dimensional arrays are used to represent matrices with the row major order.

(a). When matrix-matrix multiplication is performed using the simple triple-loop algorithm, there are 6 versions of the algorithm (ijk, ikj, jik, jki, kij, kji). Calculate the number of cache misses for each element in each matrix for each version of the algorithm when the sizes of the matrices are 1000 and 10 respectively.

Homework Equations

N/A

The Attempt at a Solution

I'm confused about the actual matrix multiplication. From the problem I'm getting matrix A: 1000x1000 and matrix b: 10x10. How can these two be multiplied? The number of columns and rows do not match up between them. Am I misreading the problem?

After the above is clarified I understand that not just one element of the matrix is transferred to the cache but how many are? since matrix A is 1000x1000 is the whole first row a0,0 to a0, 999 transferred into the cache when a0,0 is brought in? Or are just 10 element blocks brought into the cache? If only 10 are brought in at a time are they brought in sequentially or is part of matrix b brought in too?

Sorry for the overload of questions, I just don't understand the problem and can't progress without understanding it.

rcgldr · Mar 12, 2012

SpiffyEh said:

I'm confused about the actual matrix multiplication. From the problem I'm getting matrix A: 1000x1000 and matrix b: 10x10.

These are two difference cases, one where both A and B are 1000x1000, and the other where both A and B are 10 x 10.

SpiffyEh said:

After the above is clarified I understand that not just one element of the matrix is transferred to the cache but how many are? ... matrix multiply A ... B ... If only 10 are brought in at a time are they brought in sequentially or is part of matrix B brought in too?

Each cache miss results in 10 doubles being read into a line of cache from sequential addresses in memory, and reads from main memory into a line of cache will always start at a main memory address that is some exact multiple of 10 doubles (80 bytes).

The order of memory operations will depend on the algorithm to multiply A by B. Normally a row of A is multiplied with a column of B, and the products summed and stored in some output matrix, but for this problem I assume that writes will not affect the data cache.

SpiffyEh · Mar 15, 2012

oh! That makes so much more sense! Thank you!
You assumed correctly, writes do not affect the data cache.

I'm pretty sure I figured this one out but now I'm having issues with part b which involves blocking.

(b). If matrices are partitioned into block matrices with each block being a 10 by 10 matrix, then the matrix-matrix multiplication can be performed using one of the 6 blocked version algorithms (ijk, ikj, jik, jki, kij, kji). Assume the multiplication of two blocks in the inner three loops uses the same loop order as the three outer loops in the blocked version algorithms. Calculate the number of cache misses for each element in each matrix for each version of the blocked algorithm when the size of the matrices is 1000.

I'm starting with ijk. Which is the algorithm:

Code:

for(int i = 0; i < n; i++) {
  for(int j = 0; j < n; j++) {
    for(int k = 0; k < n; k++) {
      c[i*n+j] = c[i*n+j] + a[i*n+k] * b[k*n+j];
    }
  }
}

So, I know that the whole block in all three matricies A, B, and C can fit in cache so the number of cache misses for the first block of A multiplied by the first block in B stored in the first block in C is 30 because there are 10 cache misses per matrix block. Here is the cache I got after the first multiplication (the one in the first position in cache is the most recently used).
0 c9,0…c9,9
1 a9,0…a9,9
2 b9,0…b9,9
3 b8,0…b8,9
4 b7,0…b7,9
5 b6,0…b6,9
6 b5,0…b5,9
7 b4,0…b4,9
8 b3,0…b3,9
9 b2,0…b2,9
10 b1,0…b1,9
11 b0,0…b0,9
12 c8,0…c8,9
13 a8,0…a8,9
14 c7,0…c7,9
15 a7,0…a7,9
16 c6,0…c6,9
17 a6,0…a6,9
18 c5,0…c5,9
19 a5,0…a5,9
20 c4,0…c4,9
21 a4,0…a4,9
22 c3,0…c3,9
23 a3,0…a3,9
24 c2,0…c2,9
25 a2,0…a2,9
26 c1,0…c1,9
27 a1,0…a1,9
28 c0,0…c0,9
29 a0,0…a0,9

Now, I'm having a hard time figuring out how many cache misses it will have for all subsequent blocks. I know there will probably be some values in cache that can be accessed without being pulled in again but I can't seem to figure out the pattern. Hopefully I at least have it right up to this point.

rcgldr · Mar 15, 2012

SpiffyEh said:
I'm starting with ijk. Which is the algorithm:
Code:
for(int i = 0; i < n; i++) {
  for(int j = 0; j < n; j++) {
    for(int k = 0; k < n; k++) {
      c[i*n+j] = c[i*n+j] + a[i*n+k] * b[k*n+j];
    }
  }
}
Now, I'm having a hard time figuring out how many cache misses it will have for all subsequent blocks. I know there will probably be some values in cache that can be accessed without being pulled in again but I can't seem to figure out the pattern. Hopefully I at least have it right up to this point.

I'm not sure there's any easy to do this other than grind out a few cycles and look for a pattern of cache misses. For example, a sub-row of a[...] might get used repeatedly while multiplying with different columns of b[...], ...

SpiffyEh · Mar 15, 2012

Yeah... I ended up doing that.
Lets say block 0,0 of A includes a00...a09 and a00..a99 (the whole 10x10 square)
block 0,0 of B includes b00...b09 and b00...99
same for block c..
so basically the r,c notation represents each 10x10 block in a matrix

For block 0,0 in a * block 0,0 in b stored in block 0,0 in c:
-all three of the blocks need to be loaded in so it is a cache miss for the 0th column of each

For block 0,0 in a * block 1,0 in b stored in block 0,0 in c:
-each row of b is loaded so there is a cache miss for each value in the 0th column
-the rows a00...a09 and c00...c09 are in cache and are bumped up to more recently used (for each multiplication with a block of b)
-as lines of b are brought in, the other rows of matricies a and c are pushed out of cache so there is a cache miss for each of those when they are loaded in again (the 0th column again)

I kind of understand what's going on and see a pattern but I'm having a really hard time coming up with actual values for the number of cache misses in the pattern for the ijk algorithm.

How many cache misses occur when performing matrix multiplication?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Homework Statement

Homework Equations

The Attempt at a Solution

Similar threads

Four L-shaped members: Mechanical Analysis Problem

Engineering Joint and Marginal Distributions of a Randomly Selected Test Answer

Engineering Half wave voltage doubler

Truss analysis problem

Engineering Drawing Acceleration Diagram with Coriolis Acceleration

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect