Is Transposing Matrix Efficient for Fortran Matrix-Vector Multiplication?

Telemachus · Jul 28, 2017

Hi there. I wanted to ask this question, which is about efficiency in matrix times vector multiplication in fortran. When I have some matrix ##\hat A## and vector ##\vec{x}##, and I want to compute the matrix times vector ##\hat A \vec{x}=\vec{b}## in Fortran, what I do is, I build the array ##A(i,j)## and ##x(j)##, and then I tell Fortran to do this:

Fortran:

      do i=1,msize
       rsumaux=rzero
       rsum=rzero
       do k=1,msize
        rsumaux=A(i,k)*x(k)
        rsum=rsum+rsumaux
       enddo
       b(i)=rsum
      enddo

And this is just fine, it works, it give the result I want. But the thing is that I have read that Fortran orders the arrays in column-major order, which means that arrays like ##A(i,j)##, and let's suppose that A is a 3x3 matrix, then these are ordered in memory like: A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) ,A(3,3).

So, the way I am doing the matrix multiplication isn't the most efficient way of doing it. Should I fix it by using that:
##\left (\hat A \vec{x}\right)^T=\vec{b}^T=\vec{x}^T\hat A^T##, would this be the right way to do it? should I store the matrix transposed in order to make my code more efficient and then code it this way:?

Fortran:

      do i=1,msize
       rsumaux=rzero
       rsum=rzero
       do k=1,msize
        rsumaux=A(k,i)*x(k)
        rsum=rsum+rsumaux
       enddo
       b(i)=rsum
      enddo

Thanks in advance.

jasonRF · Jul 28, 2017

I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason

willem2 · Jul 29, 2017

If the colums are indeed in colum-major order, accessing them that way will be much faster.
The processor will put a cache line of 64 bytes with 8 double precision numbers at once. If you scan the matrix by colums, you can use all of the 8 numbers in the cache line. If you scan it by rows you can use only one of the 8 numbers.
Of course if everything fits in the 1st level cache there won't be any difference. The 1st level cache is 32 kb on modern intel/amd processors, so 4096 double precision values, or a 64x64 matrix is where you should start to see a difference.

Another potential bottleneck is the rsum = rsum + rsumaux statement. the addition depends on the previous value of rsum, so this can't be started until the previous addition is done. This might run faster:

Code:

do j=1,msize
  b(j) = 0
  do k=1,msize
     b(k) = b(k) + A(k,j) * x(j) 
  enddo
enddo

jim mcnamara · Jul 29, 2017

What Willem2 is discussing in part is called data locality: memory is much slower than the CPU, which is why in the X86 architecture there are L!, L2, etc. caches to speed up access. However, if the cache has to be flushed and rewritten every iteration of the loop because the "next" item is never in the cache, the reverse is true - caching activity can actually degrade the overall efficiency of the loop. Because the cpu idles a lot of time waiting for cache flushes and re-fetches,

For large datasets the speed up can be dramatic when caching is present and used effectively.https://en.wikipedia.org/wiki/Locality_of_reference
If you want to pursue this further consider: google for data locality algorithms

For x86 this is a good read, and is still valid:
https://www.akkadia.org/drepper/cpumemory.pdf

Telemachus · Jul 29, 2017

jasonRF said:

I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason

I'm using fortran77, I've never tried Fortran90, nor anything newer. I like the syntaxes of fortran77, I know it's pretty archaic, but one advantage is that I know exactly what the computer is doing. A question that arises is if the same problem will arise when calling the matmul function, it would be nice to see how that works, I expect to have the same kind of issue. I will certainly compare the different ways of computing the matrix and how much cpu time it takes, maybe this week (I am working with other things, but efficiency is important to me, so I will certainly do it).

When I have some results I'll let you know. Thank you all for all the enriching comments. I have some technical doubts, but I think I have to read a little bit from the links posted by Jim Mcnamara before asking. Thank you very much, I really enjoy when assisted by experts like you, and this help has a tremendous value in so many senses to me, intellectually, professionally too, and at many other levels. So thanks, really.

Is Transposing Matrix Efficient for Fortran Matrix-Vector Multiplication?

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Is Transposing Matrix Efficient for Fortran Matrix-Vector Multiplication?

Similar threads