In line vs. In function produces different results in C++

maverick_starstrider · Aug 5, 2009

Hi, I'm working on a code where essentially I have two very large arrays q and b and I have this function Hamiltonian(double * v1, double * v2) which takes the points of two arrays and does all sorts of things with them. Anyways. If I write my code using Hamiltonian(q,b) my code runs FASTER than if I just copy and paste everything that was in the function Hamiltonian and just plunk it in my main statement. Why should this be? Doesn't the compiler just put it in line anyways when it is compiling to assembly?

mgb_phys · Aug 5, 2009

No when it compiles the code it separates out the function and calls it when it's needed. Inline only saves you the cost of putting the arguments on the stack, doing a jump to the function and reading them back - basically a clock cycle on a modern cpu.

I can't see a reason that putting the code into the main function would speed things up - did you only test it once? It could just have the contents cached?

ps. Unless you are doing something dumb with the addresses you pass, like copying the values into another array one at a time.

maverick_starstrider · Aug 5, 2009

mgb_phys said:

No when it compiles the code it separates out the function and calls it when it's needed. Inline only saves you the cost of putting the arguments on the stack, doing a jump to the function and reading them back - basically a clock cycle on a modern cpu.

I can't see a reason that putting the code into the main function would speed things up - did you only test it once? It could just have the contents cached?

ps. Unless you are doing something dumb with the addresses you pass, like copying the values into another array one at a time.

The gist of the relevant part of my main statement is:

void Hamiltonian(double * v1, double * v2){
blah blah blah
}

main(){
double q = new double[bigNumber];
double b= new double[bigNumber];
Hamiltonian(q,b);
*garbage collection*
}
it runs faster then when I take blah blah blah and just plop it in the code (replacing all references to v1 and v2 to q and b obviously). And I've run both version about 10 times and there is a noticeable speed up.

P.S. it's when my code uses the function that it SPEEDS UP. When I just plop it in the code IT IS SLOWER. Thus my conundrum.

mXSCNT · Aug 5, 2009

I suspect that either there's something funny going on with your compiler's optimization (try using maximum optimization) or there's something going on in the "blah blah blah" part (maybe the inlined version does not do the same thing as the other version, due to some mistake).

fleem · Aug 5, 2009

There are only a handful of CPU registers that can be used as variables. When you plop all code in line in the same block, the compiler assumes all variables must be available to all statements, so it can't trade-off the usage of CPU registers. So it runs out of CPU registers and has to start using RAM for variable space (which ends up being L1 cache unless you're messing with enough data to fill it up, then it goes to the L2 cache, then main storage). When you call a function, local variables are pushed on the stack once before the call, then the function has all those CPU registers free to use, then when the function returns the variables are popped off the stack back into the CPU registers. This doesn't mean functions are always faster! Its just that in your case the number of variables needed, and the amount of time it takes to push/pop etc. all worked out that it happened to be faster to push the variables so that the function had more CPU registers to use. That's why its usually best to let the optimizer decide whether to inline functions.

mXSCNT · Aug 5, 2009

Actually, fleem, an optimizing compiler will detect when a variable does not need to be in a register. http://en.wikipedia.org/wiki/Register_allocation

Anyway, in this case that wouldn't be an issue even when no optimization is attempted, since (from what he has written) there does not appear to be a significant number of extra variables in the main() frame. However, he didn't even fill in the arguments to main, so I think there may be a lot he's not telling us.

rcgldr · Aug 5, 2009

Depends on the compiler and the cpu. In this case it appears that you end up with more variables in registers than memory if you use the function. You could try inlining the main function and putting the garbage collection stuff in another function.

maverick_starstrider · Aug 6, 2009

mXSCNT said:

Actually, fleem, an optimizing compiler will detect when a variable does not need to be in a register. http://en.wikipedia.org/wiki/Register_allocation

Anyway, in this case that wouldn't be an issue even when no optimization is attempted, since (from what he has written) there does not appear to be a significant number of extra variables in the main() frame. However, he didn't even fill in the arguments to main, so I think there may be a lot he's not telling us.

Oh ya. There's a lot of variables in my main. And I actually never create any new variables in my functions. All my functions just manipulate global variables and return void. I do this because when I said q and b were really large they're actually going to be as large as I can possible allocate on a given node. Thus I can't have my functions need a bunch more space because there might be no room and my code might throw a bad allocation error in the middle of its running (unlike at the beginning like it does now).

And yes. My code is actually over 1000 lines so I just wrote down what was relevant to the specific problem.

Anywho, thanks for the help.

fleem · Aug 6, 2009

mXSCNT said:

Actually, fleem, an optimizing compiler will detect when a variable does not need to be in a register. http://en.wikipedia.org/wiki/Register_allocation

Anyway, in this case that wouldn't be an issue even when no optimization is attempted, since (from what he has written) there does not appear to be a significant number of extra variables in the main() frame. However, he didn't even fill in the arguments to main, so I think there may be a lot he's not telling us.

The optimizer often fails to do this reasonably when there are loops in the code for which the optimizer cannot know how many iterations there will be. Or the code is simply too complex (too much indirection, variable array indexes, etc.) for the optimizer to keep track of usage.

mXSCNT · Aug 6, 2009

Well, if any optimization is performed I'm sure the arguments to main() will not go in registers since they are not used at all, and then (assuming the "blah blah blah" part is accurately transcribed and there aren't any extra variables above that point which maverick_starstrider didn't mention) there is the same number of variables in the inlined version as there is in the function call.

maverick_starstrider · Aug 6, 2009

mXSCNT said:

Well, if any optimization is performed I'm sure the arguments to main() will not go in registers since they are not used at all, and then (assuming the "blah blah blah" part is accurately transcribed and there aren't any extra variables above that point which maverick_starstrider didn't mention) there is the same number of variables in the inlined version as there is in the function call.

Yes, all my functions only manipulate global variables that were declared at the very start of the code.

mathmate · Aug 7, 2009

Could it also be that when the code is plugged in the main, many temporary variables cannot be released, while when tucked into a method, garbage collection is sure to get rid of the unused local variables, and probably faster?

Hurkyl · Aug 8, 2009

On running out of memory...

Local variables are often not put into memory at all -- they are frequently completely ephemeral things whose lifespan is spent entirely in registers. Even when that doesn't happen, local variables are placed onto the stack, which may be a different part of memory than the one that new accesses. The same is true of array variables -- ones declared as

int foo[47]; // This goes on the stack, not the heap

Incidentally, this is why on many computers, the declaration

int bigarray[1 << 22]; // I overflow the stack! Haha!

will cause a segmentation fault (because the OS limits how big the stack can be, and this exceeds that limit), whereas

int *bigarray = new int[1 << 22]; // I live on the heap. It's very roomy here

will succeed.

Unless you are doing something very silly -- e.g. allocating lots of large arrays on the stack or extremely deeply nested function calls each with lots of local state -- you will never run out of memory because you allocated local variables. (Unless you intentionally decreased the amount of stack space your program allocates, or are on a peculiar architecture that likes to have tiny stacks)

Aside from the programming drawbacks of global variables being used in that way, you will often get better performance by having your scalar variables -- or even small to medium sized arrays -- be local variables. If your function needs lots of scratch space and you really are concerned about functions allocating their own memory, you can pass in a buffer as an argument, such as this routine to multiply multi-precision integers:

// result and buffer must be arrays of xwords * ywords elements
void poorly_implemented_high_school_multiply(unsigned long *result, size_t *rwords, const unsigned long *x, size_t xwords, const unsigned long *y, size_t ywords, unsigned long *buffer);

Hurkyl · Aug 8, 2009

And yes. My code is actually over 1000 lines so I just wrote down what was relevant to the specific problem.

Are you absolutely, positively sure? It's astonishingly easy to overlook relevant things when it comes to high-performance code... You won't believe how many times I've found errors, slow bits, or other problems in mine and others' code by checking for something that was seemingly impossible.

Possibly silly question -- you are compiling with optimization flags turned on, right?

Another possibly silly question -- how are you timing things? That can be surprisingly tricky to get right as well.

Hurkyl · Aug 8, 2009

Incidentally, the drawback (at least, the only one I really know) to inlining (large) functions is bloated code size -- among the negative effects it could potentially have are:

* The optimizer might have a more difficult time optimizing one giant function rather than several smaller functions.
(But do keep in mind the converse problem -- modularity can sometimes make it impossible for the optimizer to do things)

* More memory traffic -- more bandwidth to the L2 cache (or memory) to fetch instructions, and more opportunities for icache misses.
(But it's hard to imagine how this would happen if your function is called exactly once)

* Confuse the icache controller? I know very, very little about how the hardware on modern processors manage instruction data -- so I still find it plausible that some systems might have an easier time when dealing with a function call rather than with inlined instructions.

maverick_starstrider · Aug 8, 2009

Hurkyl said:

Are you absolutely, positively sure? It's astonishingly easy to overlook relevant things when it comes to high-performance code... You won't believe how many times I've found errors, slow bits, or other problems in mine and others' code by checking for something that was seemingly impossible.

Possibly silly question -- you are compiling with optimization flags turned on, right?

Another possibly silly question -- how are you timing things? That can be surprisingly tricky to get right as well.

The compiler arguments are actually hidden from me, although I assume they are running with optimization flags. This is because in the environment my code runs on all the standard compiler commands (for example mpiCC -o blah blah.cpp) are actually just wrappers for a proprietary compiler that is used (pathscale I think) and the wrapper is hidden from me.

I'm timing things by just using MPI's MPI_Wtime() function which according to the specification assures accuracy on all implementations. I just get the MPI_Wtime() at the beginning of my main and then subtract that from the value at the end of my main.

harborsparrow · Aug 20, 2009

run this on another machine and see if the unexpected speed up still occurs. my guess is it's something to do with how the program gets loaded into memory and paged in/out during execution, esp. since large arrays of data are involved. might not see the same result on another machine, or even on the same machine if many programs are running, vs. only one.

In line vs. In function produces different results in C++

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

Python Why does my loop run slower with larger lists in Python?

File Structure vs Data Structure

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight