1. The problem statement, all variables and given/known data Suppose your data cache has 30 lines and each line can hold 10 doubles. You are performing a matrix-matrix multiplication (C=C+A*B) with square matrices of size 1000 and 10 respectively. Assume data caches are only used to cache matrix elements which are doubles. The cache replacement rule is least recently used first. One-dimensional arrays are used to represent matrices with the row major order. (a). When matrix-matrix multiplication is performed using the simple triple-loop algorithm, there are 6 versions of the algorithm (ijk, ikj, jik, jki, kij, kji). Calculate the number of cache misses for each element in each matrix for each version of the algorithm when the sizes of the matrices are 1000 and 10 respectively. 2. Relevant equations N/A 3. The attempt at a solution I'm confused about the actual matrix multiplication. From the problem I'm getting matrix A: 1000x1000 and matrix b: 10x10. How can these two be multiplied? The number of columns and rows do not match up between them. Am I misreading the problem? After the above is clarified I understand that not just one element of the matrix is transferred to the cache but how many are? since matrix A is 1000x1000 is the whole first row a0,0 to a0, 999 transfered into the cache when a0,0 is brought in? Or are just 10 element blocks brought into the cache? If only 10 are brought in at a time are they brought in sequentially or is part of matrix b brought in too? Sorry for the overload of questions, I just don't understand the problem and can't progress without understanding it.
These are two difference cases, one where both A and B are 1000x1000, and the other where both A and B are 10 x 10. Each cache miss results in 10 doubles being read into a line of cache from sequential addresses in memory, and reads from main memory into a line of cache will always start at a main memory address that is some exact multiple of 10 doubles (80 bytes). The order of memory operations will depend on the algorithm to multiply A by B. Normally a row of A is multiplied with a column of B, and the products summed and stored in some output matrix, but for this problem I assume that writes will not affect the data cache.
oh! That makes so much more sense! Thank you! You assumed correctly, writes do not affect the data cache. I'm pretty sure I figured this one out but now i'm having issues with part b which involves blocking. (b). If matrices are partitioned into block matrices with each block being a 10 by 10 matrix, then the matrix-matrix multiplication can be performed using one of the 6 blocked version algorithms (ijk, ikj, jik, jki, kij, kji). Assume the multiplication of two blocks in the inner three loops uses the same loop order as the three outer loops in the blocked version algorithms. Calculate the number of cache misses for each element in each matrix for each version of the blocked algorithm when the size of the matrices is 1000. I'm starting with ijk. Which is the algorithm: Code (Text): for(int i = 0; i < n; i++) { for(int j = 0; j < n; j++) { for(int k = 0; k < n; k++) { c[i*n+j] = c[i*n+j] + a[i*n+k] * b[k*n+j]; } } } So, I know that the whole block in all three matricies A, B, and C can fit in cache so the number of cache misses for the first block of A multiplied by the first block in B stored in the first block in C is 30 because there are 10 cache misses per matrix block. Here is the cache I got after the first multiplication (the one in the first position in cache is the most recently used). 0 c9,0…c9,9 1 a9,0…a9,9 2 b9,0…b9,9 3 b8,0…b8,9 4 b7,0…b7,9 5 b6,0…b6,9 6 b5,0…b5,9 7 b4,0…b4,9 8 b3,0…b3,9 9 b2,0…b2,9 10 b1,0…b1,9 11 b0,0…b0,9 12 c8,0…c8,9 13 a8,0…a8,9 14 c7,0…c7,9 15 a7,0…a7,9 16 c6,0…c6,9 17 a6,0…a6,9 18 c5,0…c5,9 19 a5,0…a5,9 20 c4,0…c4,9 21 a4,0…a4,9 22 c3,0…c3,9 23 a3,0…a3,9 24 c2,0…c2,9 25 a2,0…a2,9 26 c1,0…c1,9 27 a1,0…a1,9 28 c0,0…c0,9 29 a0,0…a0,9 Now, I'm having a hard time figuring out how many cache misses it will have for all subsequent blocks. I know there will probably be some values in cache that can be accessed without being pulled in again but I can't seem to figure out the pattern. Hopefully I at least have it right up to this point.
I'm not sure there's any easy to do this other than grind out a few cycles and look for a pattern of cache misses. For example, a sub-row of a[...] might get used repeatedly while multiplying with different columns of b[...], ...
Yeah... I ended up doing that. Lets say block 0,0 of A includes a00...a09 and a00..a99 (the whole 10x10 square) block 0,0 of B includes b00...b09 and b00...99 same for block c.. so basically the r,c notation represents each 10x10 block in a matrix For block 0,0 in a * block 0,0 in b stored in block 0,0 in c: -all three of the blocks need to be loaded in so it is a cache miss for the 0th column of each For block 0,0 in a * block 1,0 in b stored in block 0,0 in c: -each row of b is loaded so there is a cache miss for each value in the 0th column -the rows a00...a09 and c00...c09 are in cache and are bumped up to more recently used (for each multiplication with a block of b) -as lines of b are brought in, the other rows of matricies a and c are pushed out of cache so there is a cache miss for each of those when they are loaded in again (the 0th column again) I kind of understand what's going on and see a pattern but I'm having a really hard time coming up with actual values for the number of cache misses in the pattern for the ijk algorithm.