# Matrix times vector in Fortran

1. Jul 28, 2017

### Telemachus

Hi there. I wanted to ask this question, which is about efficiency in matrix times vector multiplication in fortran. When I have some matrix $\hat A$ and vector $\vec{x}$, and I want to compute the matrix times vector $\hat A \vec{x}=\vec{b}$ in Fortran, what I do is, I build the array $A(i,j)$ and $x(j)$, and then I tell Fortran to do this:

Code (Fortran):

do i=1,msize
rsumaux=rzero
rsum=rzero
do k=1,msize
rsumaux=A(i,k)*x(k)
rsum=rsum+rsumaux
enddo
b(i)=rsum
enddo

And this is just fine, it works, it give the result I want. But the thing is that I have read that Fortran orders the arrays in column-major order, which means that arrays like $A(i,j)$, and let's suppose that A is a 3x3 matrix, then these are ordered in memory like: A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) ,A(3,3).

So, the way I am doing the matrix multiplication isn't the most efficient way of doing it. Should I fix it by using that:
$\left (\hat A \vec{x}\right)^T=\vec{b}^T=\vec{x}^T\hat A^T$, would this be the right way to do it? should I store the matrix transposed in order to make my code more efficient and then code it this way:?

Code (Fortran):

do i=1,msize
rsumaux=rzero
rsum=rzero
do k=1,msize
rsumaux=A(k,i)*x(k)
rsum=rsum+rsumaux
enddo
b(i)=rsum
enddo

2. Jul 28, 2017

### jasonRF

I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason

3. Jul 29, 2017

### willem2

If the colums are indeed in colum-major order, accessing them that way will be much faster.
The processor will put a cache line of 64 bytes with 8 double precision numbers at once. If you scan the matrix by colums, you can use all of the 8 numbers in the cache line. If you scan it by rows you can use only one of the 8 numbers.
Of course if everything fits in the 1st level cache there won't be any difference. The 1st level cache is 32 kb on modern intel/amd processors, so 4096 double precision values, or a 64x64 matrix is where you should start to see a difference.

Another potential bottleneck is the rsum = rsum + rsumaux statement. the addition depends on the previous value of rsum, so this can't be started until the previous addition is done. This might run faster:
Code (Text):

do j=1,msize
b(j) = 0
do k=1,msize
b(k) = b(k) + A(k,j) * x(j)
enddo
enddo

4. Jul 29, 2017

### Staff: Mentor

What Willem2 is discussing in part is called data locality: memory is much slower than the CPU, which is why in the X86 architecture there are L!, L2, etc. caches to speed up access. However, if the cache has to be flushed and rewritten every iteration of the loop because the "next" item is never in the cache, the reverse is true - caching activity can actually degrade the overall efficiency of the loop. Because the cpu idles a lot of time waiting for cache flushes and re-fetches,

For large datasets the speed up can be dramatic when caching is present and used effectively.

https://en.wikipedia.org/wiki/Locality_of_reference
If you want to pursue this further consider: google for data locality algorithms

For x86 this is a good read, and is still valid: