Fortran FORTRAN ?: Trying to bypass significant slowdown

  • Thread starter Thread starter blbelson
  • Start date Start date
  • Tags Tags
    Fortran
AI Thread Summary
The discussion centers around a significant slowdown in program execution linked to a specific calculation in a nested loop structure using double precision real variables. The execution time increases dramatically when adding a summation line within the innermost loop. Suggestions for optimization include loop unrolling, moving calculations outside the innermost loop, and using subroutines. However, calling a subroutine multiple times can introduce overhead, potentially negating performance gains. The compiler optimizations in use may also influence performance, and the order of loops can affect execution speed. Adjustments to the integration limits and ensuring the correct loop structure have led to improved execution times, with one participant achieving over 50% faster performance. The conversation also touches on the potential benefits of using C over Fortran, though opinions vary on its necessity and efficiency for this specific problem. Ultimately, the focus remains on refining the calculation process while addressing the constraints of the problem's structure.
blbelson
Messages
4
Reaction score
0
I have experienced a serious program execution slow down and traced its source to one calculation...pseudo code below:

do a=0, 360, 10
do b=0,85,5
do c=0,360,10
do d=0,85,5
bunch of calculations involving double precision real variables (stored in variable ele)
tot=tot + ele
end do
end do
end do
end do

tot and ele are both double precision real variables. If I comment out "tot=tot+ele", the program takes 2m 04s to run, otherwise it takes 5m 55s to run.

I am using the ifort compiler with "-ipo -O3 -static -xP -no-prec-div -inline" flags set. Does anyone have an explanation as to why this occurs (compiler optimization issue for example) and if there is a way to prevent it?
 
Technology news on Phys.org
Each statement in your innermost loop is running 17*36*17*36 times, or 374,544 times. If you can eliminate one or more of your loops, that might speed things up. For example, you can "unroll" the innermost loop by having a separate statement or block of statements for each of the values of your loop counter variable, d. The first one would be with d = 0, the second one with d = 5, and so on, up to 85.
 
Thanks Mark 44, I believe the compiler simply skipped the loop when I commented out the line since the loop served no purpose in that instance. I was afraid I'd have to tackle the problem in the way you suggested above but I took a hybrid approach and wrote a subroutine since there are quite a few calculations in that block (realizing the performance boost may be less than writing each line of code in the main body). Doing so reduced program execution time to 4m 22s. Thanks for the kickstart...
 
Calling a subroutine 374,544 times might be slower than having the calculations inline in the innermost loop. There is some overhead associated with a subroutine call and return. You might compare the execution times with a subroutine call vs. having the code inline in the inner loop.
 
Perhaps you can move some of the calculations from the inner loop to the outward ones, and save intermediate results in temp variables? Although it is most likely already done by the compiler.
 
This looks like integration over two spheres. Can you reformulate the problem to eliminate some integrations? For example, if relevant degrees of freedom only depend on the angle between two vectors, you can get away with only one loop instead of 4.

And you shouldn't integrate to 360, you should integrate to 360-step.

Also consider doing this in C.
 
Last edited:
blbelson said:
If I comment out "tot=tot+ele", the program takes 2m 04s to run, otherwise it takes 5m 55s to run.
Your program probably ran out of registers to use, and perhaps the compiler didin't prioritize inner variable over outer variables. It should help to declare those loop counters as integer, which use a different set of registers.
 
> Also consider doing this in C.

Why? C is not intrinsically any faster.
 
Right C is not faster. As a first thought, make sure that your DO loops are in the right "order". You want your inside loop to be the first index in your array, e.g.
Code:
DO k=1,kmax
 DO j=1,jmax
  DO i=1,imax
    array(i,j,k) = something
  END DO
 END DO
END DO
Is much faster than having the DO loops in the opposite order.

Also, as a test (really not sure if it will be faster or not), you can try using the SUM array intrinsic, e.g.
Code:
DO
DO
DO 
DO
 elem = whatever
END DO
END DO
END DO
END DO

tot = SUM(elem)
 
  • #10
Appreciate all the responses. The portion of code in question was just one step in a very lengthy problem. I am quite pleased with the improvements you all have helped me achieve - over 50% faster on the one subroutine which was called over 12,000 times. The changes have moved my dissertation one step closer to completion, thanks!

I'll respond to each suggestion below:

1. I had already tried doing a "sum" outside the nest with no noticeable improvement in calculation time.
2. There is no way around four integrations, the integrations are not spheres but incident and reflected angles.
3. I sucked it up further and put the code in-line for comparison. It took 4m 22s just as it did with a subroutine call. I believe the compiler optimizations I chose may include making the subroutine in-line anyway. I returned to using the subroutine.
4. Great catch on the 360-step. That was a mistake that I did not catch. With that change, I am down to 4m 13s...and what's better, the calculation is now right!
5. Not sure what benefit "C" would have in this project. That change is too significant to test.
 
  • #11
There is no way around four integrations, the integrations are not spheres but incident and reflected angles.

Does the system possesses rotational symmetry? If it does, one integration over 0 to 360 can be eliminated.

Not sure what benefit "C" would have in this project. That change is too significant to test

Being a somewhat lower-level language, C is considerably faster than Fortran (up to 2-3 times, depending on task) and that makes it more suited for heavy numerical programming.

http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=gcc&lang2=ifc&box=1
 
Last edited by a moderator:
  • #12
hamster143 said:
Being a somewhat lower-level language, C is considerably faster than Fortran.
It really depends on the compilier. Comparing a bad Fortran compiler versus a good C compiler isn't fair. In the case of some Cray supercomputers (like the nearly extinct X1 X1E series vector processing machines), the Fortran compliler is faster than C, partly because some vendor specific extensions were made to the Fortan language used on the Cray, and partly because that particular Fortran compiler optimizes very well on the Cray supercomputer. Newer Cray systems are supposed to combine Intel or AMD cpu's with specialized vector math units, but I don't know how many of these have been made.

I don't know what options there are in terms of Fortran compilers for PC based systems, and if which of these, if any, do a really good job of optimizing code, or if in this case, the required floating point calcuations simply can't be optimized beyond some basic level.
 
Last edited:
  • #13
hamster143 said:
Does the system possesses rotational symmetry? If it does, one integration over 0 to 360 can be eliminated.

The system is not rotationally symmetric. The integrations are over all possible incident and reflected angles.
QUOTE]
 

Similar threads

Replies
4
Views
2K
Replies
8
Views
2K
Replies
8
Views
4K
Replies
16
Views
2K
Replies
4
Views
2K
Replies
4
Views
2K
Replies
2
Views
2K
Replies
6
Views
2K
Back
Top