Fortran Is Transposing Matrix Efficient for Fortran Matrix-Vector Multiplication?

Click For Summary
The discussion focuses on optimizing matrix-vector multiplication in Fortran, particularly considering the language's column-major order for array storage. The original method of multiplication is deemed inefficient due to the way elements are accessed in memory. A proposed solution involves transposing the matrix to enhance memory access patterns, which could lead to performance improvements. It is noted that accessing matrix elements in column order can leverage CPU cache more effectively, as contiguous memory access reduces cache misses. The conversation also highlights the potential bottleneck of the accumulation step in the multiplication, suggesting an alternative approach that may improve performance. Additionally, there is mention of using the built-in `matmul` function in modern Fortran versions, although the user primarily works with Fortran77 and is interested in comparing execution times across different methods. The importance of data locality and its impact on performance is emphasized, with references to further reading on the topic. The user expresses appreciation for the insights shared in the discussion and plans to conduct performance tests.
Telemachus
Messages
820
Reaction score
30
Hi there. I wanted to ask this question, which is about efficiency in matrix times vector multiplication in fortran. When I have some matrix ##\hat A## and vector ##\vec{x}##, and I want to compute the matrix times vector ##\hat A \vec{x}=\vec{b}## in Fortran, what I do is, I build the array ##A(i,j)## and ##x(j)##, and then I tell Fortran to do this:

Fortran:
      do i=1,msize
       rsumaux=rzero
       rsum=rzero
       do k=1,msize
        rsumaux=A(i,k)*x(k)
        rsum=rsum+rsumaux
       enddo
       b(i)=rsum
      enddo

And this is just fine, it works, it give the result I want. But the thing is that I have read that Fortran orders the arrays in column-major order, which means that arrays like ##A(i,j)##, and let's suppose that A is a 3x3 matrix, then these are ordered in memory like: A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) A(1,3) A(2,3) ,A(3,3).

So, the way I am doing the matrix multiplication isn't the most efficient way of doing it. Should I fix it by using that:
##\left (\hat A \vec{x}\right)^T=\vec{b}^T=\vec{x}^T\hat A^T##, would this be the right way to do it? should I store the matrix transposed in order to make my code more efficient and then code it this way:?

Fortran:
      do i=1,msize
       rsumaux=rzero
       rsum=rzero
       do k=1,msize
        rsumaux=A(k,i)*x(k)
        rsum=rsum+rsumaux
       enddo
       b(i)=rsum
      enddo

Thanks in advance.
 
Technology news on Phys.org
I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason
 
  • Like
Likes Telemachus and FactChecker
If the colums are indeed in colum-major order, accessing them that way will be much faster.
The processor will put a cache line of 64 bytes with 8 double precision numbers at once. If you scan the matrix by colums, you can use all of the 8 numbers in the cache line. If you scan it by rows you can use only one of the 8 numbers.
Of course if everything fits in the 1st level cache there won't be any difference. The 1st level cache is 32 kb on modern intel/amd processors, so 4096 double precision values, or a 64x64 matrix is where you should start to see a difference.

Another potential bottleneck is the rsum = rsum + rsumaux statement. the addition depends on the previous value of rsum, so this can't be started until the previous addition is done. This might run faster:
Code:
do j=1,msize
  b(j) = 0
  do k=1,msize
     b(k) = b(k) + A(k,j) * x(j) 
  enddo
enddo
 
  • Like
Likes Telemachus
What Willem2 is discussing in part is called data locality: memory is much slower than the CPU, which is why in the X86 architecture there are L!, L2, etc. caches to speed up access. However, if the cache has to be flushed and rewritten every iteration of the loop because the "next" item is never in the cache, the reverse is true - caching activity can actually degrade the overall efficiency of the loop. Because the cpu idles a lot of time waiting for cache flushes and re-fetches,

For large datasets the speed up can be dramatic when caching is present and used effectively.https://en.wikipedia.org/wiki/Locality_of_reference
If you want to pursue this further consider: google for data locality algorithms

For x86 this is a good read, and is still valid:
https://www.akkadia.org/drepper/cpumemory.pdf
 
  • Like
Likes Telemachus
jasonRF said:
I would try it both ways and see which one is faster. At least in 'modern' dialects of fortran, like fortran90, you can use the built-in matmul function and see how that compares.

jason
I'm using fortran77, I've never tried Fortran90, nor anything newer. I like the syntaxes of fortran77, I know it's pretty archaic, but one advantage is that I know exactly what the computer is doing. A question that arises is if the same problem will arise when calling the matmul function, it would be nice to see how that works, I expect to have the same kind of issue. I will certainly compare the different ways of computing the matrix and how much cpu time it takes, maybe this week (I am working with other things, but efficiency is important to me, so I will certainly do it).

When I have some results I'll let you know. Thank you all for all the enriching comments. I have some technical doubts, but I think I have to read a little bit from the links posted by Jim Mcnamara before asking. Thank you very much, I really enjoy when assisted by experts like you, and this help has a tremendous value in so many senses to me, intellectually, professionally too, and at many other levels. So thanks, really.
 
Learn If you want to write code for Python Machine learning, AI Statistics/data analysis Scientific research Web application servers Some microcontrollers JavaScript/Node JS/TypeScript Web sites Web application servers C# Games (Unity) Consumer applications (Windows) Business applications C++ Games (Unreal Engine) Operating systems, device drivers Microcontrollers/embedded systems Consumer applications (Linux) Some more tips: Do not learn C++ (or any other dialect of C) as a...

Similar threads

  • · Replies 20 ·
Replies
20
Views
2K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 22 ·
Replies
22
Views
5K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 21 ·
Replies
21
Views
3K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K