Is Transposing Matrix Efficient for Fortran Matrix-Vector Multiplication?

  • Context: Fortran 
  • Thread starter Thread starter Telemachus
  • Start date Start date
  • Tags Tags
    Fortran Matrix Vector
Click For Summary

Discussion Overview

The discussion revolves around the efficiency of matrix-vector multiplication in Fortran, specifically addressing the implications of column-major order on performance. Participants explore different coding approaches and the potential benefits of transposing matrices for improved computational efficiency.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes their current method for matrix-vector multiplication in Fortran and questions whether transposing the matrix would enhance efficiency.
  • Another participant suggests trying both methods to compare performance, noting that modern Fortran dialects offer built-in functions like matmul for efficiency.
  • A participant explains that accessing matrix columns in column-major order can be significantly faster due to cache utilization, emphasizing the importance of data locality.
  • Concerns are raised about the potential bottleneck caused by the accumulation of results in the summation process, suggesting an alternative approach to improve performance.
  • One participant expresses a preference for Fortran77 and raises questions about whether the matmul function would encounter similar efficiency issues as their current method.
  • The participant indicates plans to conduct performance comparisons of different methods and appreciates the insights shared by others in the discussion.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the most efficient approach, with multiple competing views on the best method for matrix-vector multiplication and the implications of data locality on performance.

Contextual Notes

Participants acknowledge the limitations of their current methods and the potential impact of cache sizes on performance, but do not resolve these issues within the discussion.

Who May Find This Useful

This discussion may be useful for programmers and researchers working with Fortran, particularly those interested in optimizing matrix operations and understanding the implications of memory access patterns on computational efficiency.

Telemachus
Messages
820
Reaction score
30
Hi there. I wanted to ask this question, which is about efficiency in matrix times vector multiplication in fortran. When I have some matrix ##\hat A## and vector ##\vec{x}##, and I want to compute the matrix times vector ##\hat A \vec{x}=\vec{b}## in Fortran, what I do is, I build the array ##A(i,j)## and ##x(j)##, and then I tell Fortran to do this:

Fortran:
      do i=1,msize
       rsumaux=rzero
       rsum=rzero
       do k=1,msize
        rsumaux=A(i,k)*x(k)
        rsum=rsum+rsumaux
       enddo
       b(i)=rsum
      enddo

And this is just fine, it works, it give the result I want. But the thing is that I have read that Fortran orders the arrays in column-major order, which means that arrays like ##A(i,j)##, and let's suppose that A is a 3x3 matrix, then these are ordered in memory like: A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) ,A(3,3).

So, the way I am doing the matrix multiplication isn't the most efficient way of doing it. Should I fix it by using that:
##\left (\hat A \vec{x}\right)^T=\vec{b}^T=\vec{x}^T\hat A^T##, would this be the right way to do it? should I store the matrix transposed in order to make my code more efficient and then code it this way:?

Fortran:
      do i=1,msize
       rsumaux=rzero
       rsum=rzero
       do k=1,msize
        rsumaux=A(k,i)*x(k)
        rsum=rsum+rsumaux
       enddo
       b(i)=rsum
      enddo

Thanks in advance.
 
Technology news on Phys.org
I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason
 
  • Like
Likes   Reactions: Telemachus and FactChecker
If the colums are indeed in colum-major order, accessing them that way will be much faster.
The processor will put a cache line of 64 bytes with 8 double precision numbers at once. If you scan the matrix by colums, you can use all of the 8 numbers in the cache line. If you scan it by rows you can use only one of the 8 numbers.
Of course if everything fits in the 1st level cache there won't be any difference. The 1st level cache is 32 kb on modern intel/amd processors, so 4096 double precision values, or a 64x64 matrix is where you should start to see a difference.

Another potential bottleneck is the rsum = rsum + rsumaux statement. the addition depends on the previous value of rsum, so this can't be started until the previous addition is done. This might run faster:
Code:
do j=1,msize
  b(j) = 0
  do k=1,msize
     b(k) = b(k) + A(k,j) * x(j) 
  enddo
enddo
 
  • Like
Likes   Reactions: Telemachus
What Willem2 is discussing in part is called data locality: memory is much slower than the CPU, which is why in the X86 architecture there are L!, L2, etc. caches to speed up access. However, if the cache has to be flushed and rewritten every iteration of the loop because the "next" item is never in the cache, the reverse is true - caching activity can actually degrade the overall efficiency of the loop. Because the cpu idles a lot of time waiting for cache flushes and re-fetches,

For large datasets the speed up can be dramatic when caching is present and used effectively.https://en.wikipedia.org/wiki/Locality_of_reference
If you want to pursue this further consider: google for data locality algorithms

For x86 this is a good read, and is still valid:
https://www.akkadia.org/drepper/cpumemory.pdf
 
  • Like
Likes   Reactions: Telemachus
jasonRF said:
I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason
I'm using fortran77, I've never tried Fortran90, nor anything newer. I like the syntaxes of fortran77, I know it's pretty archaic, but one advantage is that I know exactly what the computer is doing. A question that arises is if the same problem will arise when calling the matmul function, it would be nice to see how that works, I expect to have the same kind of issue. I will certainly compare the different ways of computing the matrix and how much cpu time it takes, maybe this week (I am working with other things, but efficiency is important to me, so I will certainly do it).

When I have some results I'll let you know. Thank you all for all the enriching comments. I have some technical doubts, but I think I have to read a little bit from the links posted by Jim Mcnamara before asking. Thank you very much, I really enjoy when assisted by experts like you, and this help has a tremendous value in so many senses to me, intellectually, professionally too, and at many other levels. So thanks, really.
 

Similar threads

  • · Replies 20 ·
Replies
20
Views
2K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 22 ·
Replies
22
Views
5K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 21 ·
Replies
21
Views
3K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K