FORTRAN ?: Trying to bypass significant slowdown

blbelson · Nov 5, 2009

I have experienced a serious program execution slow down and traced its source to one calculation...pseudo code below:

do a=0, 360, 10
do b=0,85,5
do c=0,360,10
do d=0,85,5
bunch of calculations involving double precision real variables (stored in variable ele)
tot=tot + ele
end do
end do
end do
end do

tot and ele are both double precision real variables. If I comment out "tot=tot+ele", the program takes 2m 04s to run, otherwise it takes 5m 55s to run.

I am using the ifort compiler with "-ipo -O3 -static -xP -no-prec-div -inline" flags set. Does anyone have an explanation as to why this occurs (compiler optimization issue for example) and if there is a way to prevent it?

Mark44 · Nov 5, 2009

Each statement in your innermost loop is running 17*36*17*36 times, or 374,544 times. If you can eliminate one or more of your loops, that might speed things up. For example, you can "unroll" the innermost loop by having a separate statement or block of statements for each of the values of your loop counter variable, d. The first one would be with d = 0, the second one with d = 5, and so on, up to 85.

blbelson · Nov 5, 2009

Thanks Mark 44, I believe the compiler simply skipped the loop when I commented out the line since the loop served no purpose in that instance. I was afraid I'd have to tackle the problem in the way you suggested above but I took a hybrid approach and wrote a subroutine since there are quite a few calculations in that block (realizing the performance boost may be less than writing each line of code in the main body). Doing so reduced program execution time to 4m 22s. Thanks for the kickstart...

Mark44 · Nov 5, 2009

Calling a subroutine 374,544 times might be slower than having the calculations inline in the innermost loop. There is some overhead associated with a subroutine call and return. You might compare the execution times with a subroutine call vs. having the code inline in the inner loop.

Borek · Nov 6, 2009

Perhaps you can move some of the calculations from the inner loop to the outward ones, and save intermediate results in temp variables? Although it is most likely already done by the compiler.

hamster143 · Nov 6, 2009

This looks like integration over two spheres. Can you reformulate the problem to eliminate some integrations? For example, if relevant degrees of freedom only depend on the angle between two vectors, you can get away with only one loop instead of 4.

And you shouldn't integrate to 360, you should integrate to 360-step.

Also consider doing this in C.

rcgldr · Nov 6, 2009

blbelson said:

If I comment out "tot=tot+ele", the program takes 2m 04s to run, otherwise it takes 5m 55s to run.

Your program probably ran out of registers to use, and perhaps the compiler didin't prioritize inner variable over outer variables. It should help to declare those loop counters as integer, which use a different set of registers.

silverfrost · Nov 6, 2009

> Also consider doing this in C.

Why? C is not intrinsically any faster.

minger · Nov 6, 2009

Right C is not faster. As a first thought, make sure that your DO loops are in the right "order". You want your inside loop to be the first index in your array, e.g.

Code:

DO k=1,kmax
 DO j=1,jmax
  DO i=1,imax
    array(i,j,k) = something
  END DO
 END DO
END DO

Is much faster than having the DO loops in the opposite order.

Also, as a test (really not sure if it will be faster or not), you can try using the SUM array intrinsic, e.g.

Code:

DO
DO
DO 
DO
 elem = whatever
END DO
END DO
END DO
END DO

tot = SUM(elem)

blbelson · Nov 6, 2009

Appreciate all the responses. The portion of code in question was just one step in a very lengthy problem. I am quite pleased with the improvements you all have helped me achieve - over 50% faster on the one subroutine which was called over 12,000 times. The changes have moved my dissertation one step closer to completion, thanks!

I'll respond to each suggestion below:

1. I had already tried doing a "sum" outside the nest with no noticeable improvement in calculation time.
2. There is no way around four integrations, the integrations are not spheres but incident and reflected angles.
3. I sucked it up further and put the code in-line for comparison. It took 4m 22s just as it did with a subroutine call. I believe the compiler optimizations I chose may include making the subroutine in-line anyway. I returned to using the subroutine.
4. Great catch on the 360-step. That was a mistake that I did not catch. With that change, I am down to 4m 13s...and what's better, the calculation is now right!
5. Not sure what benefit "C" would have in this project. That change is too significant to test.

hamster143 · Nov 6, 2009

There is no way around four integrations, the integrations are not spheres but incident and reflected angles.

Does the system possesses rotational symmetry? If it does, one integration over 0 to 360 can be eliminated.

Not sure what benefit "C" would have in this project. That change is too significant to test

Being a somewhat lower-level language, C is considerably faster than Fortran (up to 2-3 times, depending on task) and that makes it more suited for heavy numerical programming.

http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=gcc&lang2=ifc&box=1

rcgldr · Nov 6, 2009

hamster143 said:

Being a somewhat lower-level language, C is considerably faster than Fortran.

It really depends on the compilier. Comparing a bad Fortran compiler versus a good C compiler isn't fair. In the case of some Cray supercomputers (like the nearly extinct X1 X1E series vector processing machines), the Fortran compliler is faster than C, partly because some vendor specific extensions were made to the Fortan language used on the Cray, and partly because that particular Fortran compiler optimizes very well on the Cray supercomputer. Newer Cray systems are supposed to combine Intel or AMD cpu's with specialized vector math units, but I don't know how many of these have been made.

I don't know what options there are in terms of Fortran compilers for PC based systems, and if which of these, if any, do a really good job of optimizing code, or if in this case, the required floating point calcuations simply can't be optimized beyond some basic level.

blbelson · Nov 9, 2009

hamster143 said:

Does the system possesses rotational symmetry? If it does, one integration over 0 to 360 can be eliminated.

The system is not rotationally symmetric. The integrations are over all possible incident and reflected angles.
QUOTE]

FORTRAN ?: Trying to bypass significant slowdown

Discussion

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

PHP My website presents the visitor with the choice of opting out of using cookies....

Python Applying Accelerated Raymarching to Reduce Rendering Time

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect