# Which is faster? reading from memory or arithmetic?

1. Apr 10, 2015

### Superposed_Cat

Hey all, I was writing this program just now and wondered whether it would be easier for the cpu if I stored a number before I divided it by 2 or if i multiplied by 2 later to get that number back, I know it wont affect speed at all really but was just wondering, Any help appreciated.

2. Apr 10, 2015

### Staff: Mentor

There are a lot of factors that determine the answer to your question.

For the division part, what sort of number are you dividing? Integer division is a lot faster than floating point division. Back in the 80s, many CPUs (Intel 8088/8086 through 80386) didn't have any native hardware support for floating point division, so division was done in software rather than in the CPU. To combat the slowness of software division, people could purchase a floating point unit (FPU). Most motherboards had an empty FPU socket that you could plug one of these units into.

For the division of an integer value by 2, a quick way to do this is to shift all of the bits one place to the right.

For storing a number, the time it takes depends on where the number is stored, which could be in an unused register (very fast), in one of the caches, in RAM, or on disk. Modern computers have several caches, which provide quicker access then regular RAM and much faster access than disk storage.

If I understand what you're asking here, I think it would be faster to store the number in a variable, divide it by 2 to get you halved value, and then retrieve the undivided number from the variable. In this scenario you have one store, one division, and one read. In your alternative, you have one division and one multiplication. In both scenarios we have a division, which if done naively, takes a lot of processor cycles. What it boils down to is whether it takes fewer cycles to do a write followed by a read vs. a multiplication. To answer this question, you need to know a lot of details about the specific CPU involved.

3. Apr 10, 2015

### Superposed_Cat

In a intel I7 what takes more cycles?

4. Apr 10, 2015

### D H

Staff Emeritus
As Mark mentioned, the answer is not easy.

However, there is an easy answer, which is to do the obvious, at least initially.

In modern software, the main bottleneck oftentimes is not performance. It's human understandability of your code. When you finish a task and turn it over to someone else, you want your code to be understandable to your stand-in. Is your stand-in going to be able to understand your cleverness? When you finish a task and don't turn it over to someone else, your code and your cleverness will come back to haunt you six months later when someone reports a problem. Are you yourself going to be able to understand your cleverness six months from now?

Do not obscure your code with what you think are performance optimizations. (Typically, what you think is an optimization is in fact a dis-optimization.) Do not even obscure your code with what you know from experience are performance optimizations. Optimize for performance only when testing and performance analyses say you need to optimize for performance.

I'll give a specific example. A numerical computing system I worked on involved a function with nested loops in which the loop indices were used both as indices into arrays (so ideally you would want integers) and as multipliers in floating point calculations (so ideally you would want floating point values). So which to use, integer or floating point as the base type of those indices?

The obvious answer was to use an integer because that's the natural thing to use for a loop index. Accessing an array (e.g., some_array[index]) and performing a floating point calculation (e.g., index*some_floating_point_value) both work because integers automatically convert to floating point. The conversion is performed in the CPU. This was the obvious approach because the math used in the source code looked very much like the math used in the underlying technical papers.

Testing showed that our function was a performance bottleneck. More detailed analyses showed that one of the problems was in the conversion from integer to floating point. So we switched to using floating point values as our loop variables and computing the integer index. Now that conversion was a bottleneck. Next we tried making the loop variable and its floating point counterpart independent loop variables. Now the act of incrementing became a bottleneck.

Finally we tried using a lookup table, where 0 (integer) mapped to 0.0 (double), 1 to 1.0, 2 to 2.0, etc. Then we arranged memory so that this table was very likely to be in the fastest cache. This was faster than any other approach.

We would never have thought ahead of time that a memory lookup would be faster than CPU calculations. But it was. We also never would have used this approach ahead of time, even if we had known that about how much faster it is than the obvious solution.

Last edited: Apr 10, 2015
5. Apr 11, 2015

### FactChecker

You may not have as much control as you think. It's hard to know what an optimizing compiler will do and what effect that has given the hardware design. Long ago, I tried to avoid memory access. It sometimes had the opposite effect that I expected. You might look at the assembly code of each version and see if anything obvious stands out. I would be surprised if your choice made any noticeable difference.

6. Apr 11, 2015

### Staff: Mentor

To understand just a tiny part of your question look at Uli Drepper's paper on how cpu and memory design impact performance. An important keyword here is locality, one of the concepts factchecker and Mark44 are indirectly referring to. This is the stuff that software engineers up at night.