Fast reciprocal square root algorithm

AI Thread Summary
The discussion centers around a fast algorithm for calculating the reciprocal square root, specifically the method used in Quake III, presented by John Carmack. This algorithm significantly outperforms the standard library function, achieving results within 1% accuracy while being over one hundred times faster. The conversation also touches on various numerical approximation techniques and the historical context of such algorithms, suggesting that many numerical functions can be optimized similarly. Participants share links to research and modern modifications of the algorithm, emphasizing the importance of understanding both speed and accuracy in computational methods. The potential for further improvements in algorithm efficiency remains a key interest among contributors.
Messages
38,043
Reaction score
10,529
I ran into an interesting video on Youtube yesterday, about a fast way to compute the reciprocal of the square root of a number. I.e., to compute this function:
##f(x)= \frac 1 {\sqrt x}##
The presenter is John Carmack. If you search for "Fast Inverse Square Root — A Quake III Algorithm" you'll find it.

The video shows a technique used in Quake III to speed up the calculations for lighting and reflections, to normalize vectors to a magnitude or length of 1.
The obvious way to do the calculation as a C function might be as follows:
C:
#include <math.h>
float RecipSqrt(float num) {
    return 1/sqrt(num);
}
Such a function would produce correct values, within the limits of the float data type, but would be extremely slow. A very clever programmer at Quake came up with the following version.
C:
float RecipSqrt(float number)
{
    long i;
    float x2, y;
    const float THREEHALVES = 1.5F;

    x2 = number * 0.5F;
    y = number;
    i = *(long*)&y;                            // Evil floating point bit hack
    i = 0x5f3759df - (i >> 1);                                                    
    y = *(float*)&i;
    y = y * (THREEHALVES - (x2 * y * y));    // 1st iteration
    return y;
}
I have modified the code shown in the video slightly, changing the name of the function and removing one comment that was an acronym I don't wish to write here. Most of the code is very arcane, with no comments that explain what the code is doing. For a full explanation of how the code works, take a look at the video, starting at about 10 minutes.

I put together a program with the above function. In it I compared the resulting times to calculate the reciprocal square roots of several numbers in an array of floats using both the obvious (but slow) approach I showed first, as well as the function shown just above. My results confirmed that the Quake III algorithm is very fast, producing results that were within 1%, and taking a little over one-hundredth of the time.

Here's the output I got with an array of 10 float values:
Output.png

The first column shows the numbers in the array, 5.0 through 50.0, in increments of 5.0. The second column shows the values produced by the Quake algorithm, and the third column shows the values produced using the standard library sqrt() function.
The two lines at the bottom show the elapsed time, in clock ticks (AKA "clocks") for each algorithm. My 8-year-old computer runs at 3.4 GHz, so by my calculation, 434 clocks works out to 127.6 nsecs.

I wrote the program in Visual Studio 2017, in x64 Release mode, with optimization turned on to Max Optimization favoring speed, but even so, the Quake algorithm outperformed the intuitively obvious algorithm by two orders of magnitude.

In case you're curious about how I did the timing, I have a small assembly routine that I include as a separate file in my project. The assembly routine uses 64-bit registers, which is why I compiled the program as a 64-bit program. Doing so also meant that I had to have a separate assembly file rather than using inline assembly, which can be done only for 32-bit programs.

Here's my timing routine:
C:
.code
readCurrentTime    PROC C
    ;; Returns the start time, clocks, in RAX.
    rdtsc
    shl rdx, 32
    or rax, rdx
    ret
readCurrentTime ENDP
end
The rdtsc instruction (Read Time Stamp Counter) gets the current time as a 64 bit number, spread out in the 32-bit EDX and EAX registers. The remaining instructions combine the two values into the 64-bit RAX register.
 
Last edited:
  • Like
  • Informative
Likes jim mcnamara, Filip Larsen, berkeman and 2 others
Technology news on Phys.org
Insight article?
 
  • Like
Likes jim mcnamara and anorlunda
jedishrfu said:
Insight article?
Maybe. I thought about it...
 
Really looks like a worthy topic, short and succinct and very useful. I wonder what other algorithms could be vastly improved using this scheme.
 
jedishrfu said:
Really looks like a worthy topic, short and succinct and very useful. I wonder what other algorithms could be vastly improved using this scheme.
There are quicker and dirtier approximations for every numerical function. That has been the case since before the first digital computer.

Your choice of approximation will depend on the time you have to find something that you can get away with. I have been using those techniques for over 40 years and still see no end to the variations.

It is certainly not a short and succinct topic.
 
  • Like
Likes Vanadium 50 and jedishrfu
In fact the "real" square root is probably also an approximation, just one chosen to always be less than 1/2 bit away from the correct value. Lots of research in the 50's to 70's on this. Look up CORDIC for a great example.
 
Last edited:
  • Like
Likes jedishrfu and anorlunda
  • Like
  • Informative
Likes DrClaude and jedishrfu
I would be curious to know if the function implementations in my favorite library make use of these clever fast approximations.

For example.
Python:
import math
x = math.sqrt(2)

It would be foolish of me to try to make a DIY sqrt smarter than the library function until I know how smart the library function is. Where would I look to find the documentation for the implementation of math.sqrt?
 
  • Like
Likes jedishrfu
  • #10
Oleh said:
Hello, there is a lot of research on this topic, e.g.,
- http://www.matrix67.com/data/InvSqrt.pdf
- https://cs.uwaterloo.ca/~m32rober/rsqrt.pdf.
The link immediately above contains much of the same information as in the Youtube video I found, but also says that John Carmack disclaimed the invention of the algorithm shown.
Oleh said:
Modern modifications to the algorithm include:
- https://www.mdpi.com/1099-4300/23/1/86
- https://ieeexplore.ieee.org/abstract/document/8924302
- https://www.mdpi.com/2079-3197/9/2/21
Also, you can calculate similarly the reciprocal, square root, cube root, reciprocal cube root functions.
Thanks for posting these links! I looked at several of them, but not the IEEE article, which is behind a paywall.
 
  • Like
Likes Oleh
  • #11
anorlunda said:
It would be foolish of me to try to make a DIY sqrt smarter than the library function until I know how smart the library function is.
You don't necessarily need to know how smart a library function is. It might suffice to know how fast it is. Assuming the homebrew routine is reasonably accurate over some range of inputs, you can compare the running times of your routine vs. the library routine.

Ever since I read "The Zen of Code Optimization," by Michael Abrash, I've attempted to incorporate his ideas into my code, especially in a chunk of code that will be executed many times. For example, if I want to square or cube some number, I'll use multiplication rather than the C library function pow(). Some of Abrash's ideas are obsolete now, given the advance in processor design and compiler capabilities since the book was written, but it's still possible once in a while to write code that is faster than what a compiler produces.
 
  • Like
Likes jedishrfu
  • #13
Oleh said:
Also, you can calculate similarly the reciprocal, square root, cube root, reciprocal cube root functions.
To compute y = 1 / √x you could take the time to evaluate; y = Exp( –Log( x ) / 2 ).
Alternatively, successive approximation of y = 1 / √x doubles the number of correct digits on each pass, so it requires a good guess for the initial seed which can eliminate the first one or two loops.
That good guess can be made by taking the exponent of the floating point representation, [which is the offset integer part of the Log2( x ) ], shift the exponent right to divide the Log by 2 [for the root], then negate it [to get the reciprocal]. Successive approximation then repairs the damage caused by the hack.

Generally speaking, any algorithm that is dependent on the floating point number representation is a hack. I like the following floating point hack for the computation of; y = Log2( x ).
The integer part of y is the floating point exponent of x. But what about the mantissa?
Square the floating point number 7 times, take the 7 LSBs of the exponent as the next 7 bits of the mantissa. Repeat until satisfied.
That is now over 50 years old. See HAKMEM, Item 159 160.

https://w3.pppl.gov/~hammett/work/2009/AIM-239-ocr.pdf
https://en.wikipedia.org/wiki/HAKMEM
 
Last edited:
  • Like
Likes Oleh, jedishrfu and anorlunda
  • #14
A collection long these clever ideas would make a great web page our a great Insights article.
 
  • Like
Likes jedishrfu
  • #16
Baluncore said:
Generally speaking, any algorithm that is dependent on the floating point number representation is a hack.

You make it sound like a bad thing.
Remember, if you use it once, it's a trick. If you use it more than once, it's a technique.

IEEE754 make a lot of tricks techniques easy.You can easily express a number as 2^{2n}(1+\epsilon) with |\epsilon|&lt;1, which leads to a good first guess for its square root of 2^{n}(1+\epsilon/2). The worst that ever does is just over 6%. That's a pretty goo start to an iterative algorithm. (I might be inclined to use a lookup table instead, though)
 
  • #17
I’ve always liked the Trachtenberg multiplication algorithms that were designed for human computing eliminating the multiplication table lookup.

As a kid I wondered if they could have sped up computer multiplication until I discovered that computer math was in binary.
 
  • #18
I have the Python math lib source thanks to @jim mcnamara . There are many implementations (written in C, Python, and C-Python combined) depending on data type. Here's two of them. Both appear to be using bit shifts as opposed to "ordinary" arithmetic.

Python:
/* Approximate square root of a large 64-bit integer.

   Given `n` satisfying `2**62 <= n < 2**64`, return `a`
   satisfying `(a - 1)**2 < n < (a + 1)**2`. */

static uint64_t
_approximate_isqrt(uint64_t n)
{
    uint32_t u;
    u = 1U + (n >> 62);
    u = (u << 1) + (n >> 59) / u;
    u = (u << 3) + (n >> 53) / u;
    u = (u << 7) + (n >> 41) / u;
    return (u << 15) + (n >> 17) / u;
}

Python:
#Here's Python code equivalent to the C implementation below:

    def isqrt(n):
        """
        Return the integer part of the square root of the input.
        """
        n = operator.index(n)

        if n < 0:
            raise ValueError("isqrt() argument must be nonnegative")
        if n == 0:
            return 0

        c = (n.bit_length() - 1) // 2
        a = 1
        d = 0
        for s in reversed(range(c.bit_length())):
            # Loop invariant: (a-1)**2 < (n >> 2*(c - d)) < (a+1)**2
            e = d
            d = c >> s
            a = (a << d - e - 1) + (n >> 2*c - e - d + 1) // a

        return a - (a*a > n)
 
  • #19
As a followup, I looked into seeing whether the Intel AVX (Advanced Vector eXtensions) assembly instructions for computing the reciprocal square root might speed things up even further. The instruction I used is vrsqrtps. The name of the instruction packs a lot of info very tersely.
v -- vector -- the instruction operates on 16-byte or 32-byte "vectors" (in 128-bit or 256-bit registers)
rsqrt -- reciprocal square root. I.e., it calculates the reciprocal of the square root of each number that makes up the vector.
p -- packed - indicates that several numbers are packed into a 128-bit or 256-bit register. In this case, there are either 4 floats in an XMM (16-byte) register or 8 floats packed in in a YMM (32-byte) register.
s - single; i.e., 4-byte float
The instruction I'm using operates on YMM registers, 32-byte or 256-bit registers that can hold 8 floats. There's a version of this instruction that can operate on the ZMM registers, 512-bit AVX-512 registers. Although I have a computer that supports many AVX-512 instructions, only Intel Xeon Phi processors support this particular AVX-512 instruction.

Here's a snippet of my assembly routine to calculate the reciprocal square root of 8 floats at a time. I'm omitting some of the details of prologue and epilogue and register initialization. I anyone is interested, I can provide the complete code by PM.

C:
ProcessRow:
     ; Read a row (8) of floats
     vmovdqu ymm1, ymmword ptr[r8]      ; Copy from source array to YMM1
     vrsqrtps ymm0, ymm1                ; Put reciprocal square root values in YMM0
     vmovdqu ymmword ptr[rdx], ymm0     ; Copy results from YMM0 to destination array
     add r8, rsi                        ; Increment addresses in array by 8 bytes
     add rdx, rdi 
     loop ProcessRow
The first three instructions move 8 floats from memory to a YMM register, calculate the reciprocal square roots of those numbers, and then move the results back to a different YMM register.
The two add instructions increment the addresses of the source and destination arrays (in the R8 and RDX registers, respectively) by 32 bytes.
The final loop instruction decrements the RCX register, and if it's still positive, transfers control back to the ProcessRow label.
In the vmovdqu instruction, v stands for vector, mov is short for move, dq is double quadword (a word is 2 bytes, a quadword is 8 bytes, and a double quadword is 16 bytes, an anachronism left over from when the instruction first operated on 128-bit registers. The final u stands for unaligned memory, as opposed to memory aligned on.

Here's a screen shot of a run with an array of 25 floats:
RSQRT25.png

The output shows selected values from the array. As you can see, the AVX version outperforms the Quake algorithm, running about 4.5 times as fast, and is nearly 120 times as fast as the version that uses the standard library sqrt() function. Again, what you're seeing is the release version, compiled with optimization set to highest speed.

But, and this really puzzles me, if the array size is a multiple of 8, we get significantly different results in times for the Quake and standard library versions.
RSQRT24.png

Notice that the standard library version is significantly faster than the Quake version, and is only slightly slower than the AVX version. I've noticed this discrepancy in quite a few tests -- if the array has a multiple of 8 elements, the library version speeds up dramatically. At the moment I have no idea why this happens.
 
Last edited:
  • Like
Likes jim mcnamara and anorlunda
  • #20
Mark44 said:
At the moment I have no idea why this happens.
Wouldn't it be fun to have a debugger that could step through in sufficient detail to see the bits change, and see the memory & cache accesses.
 
  • #21
Mark44 said:
At the moment I have no idea why this happens.

Depending on your code it may be VS' auto-vectorization optimization that kicks in. You should be able to test if this is the case by including a pragma mentioned below and see if the times go up again.

From https://docs.microsoft.com/en-us/cpp/parallel/auto-parallelization-and-auto-vectorization :
The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions on the target computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2, AVX, and AVX2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors, according to the /arch switch.

The Auto-Vectorizer may generate different instructions than specified by the /arch switch. These instructions are guarded by a runtime check to make sure that code still runs correctly. For example, when you compile /arch:SSE2, SSE4.2 instructions may be emitted. A runtime check verifies that SSE4.2 is available on the target processor and jumps to a non-SSE4.2 version of the loop if the processor does not support those instructions.

By default, the Auto-Vectorizer is enabled. If you want to compare the performance of your code under vectorization, you can use #pragma loop(no_vector) to disable vectorization of any given loop.
 
  • #22
anorlunda said:
Wouldn't it be fun to have a debugger that could step through in sufficient detail to see the bits change, and see the memory & cache accesses.
My debugger won't show individual bits, but I can see if memory or variables change values. Can't see cache accesses, though.
 
  • #23
Filip Larsen said:
Depending on your code it may be VS' auto-vectorization optimization that kicks in.
I'm going to guess that this is exactly what happens. I'll look into this further.
 
  • #24
@Filip Larsen, apparently what happens is that, if the array size is a multiple of 8, vectorization occurs. That means that SIMD (single instruction multiple data) instructions can be used, with either 128-bit or 256-bit registers holding four or eight floats in a register, and a single operation being performed on the contents.
The exact sequence I see in my debugger is the following:
C:
movups      xmm0,xmmword ptr [rbp+rcx+50h] 
movaps      xmm2,xmm3 
sqrtps      xmm1,xmm0 
divps       xmm2,xmm1 
movups      xmmword ptr [rbp+rcx-20h],xmm2
Without going into too much detail, (1) the values to have square roots taken are copied from memory to XMM0, (2) four copies of 1.0 are copied to XMM2, (3) the values in XMM0 have square roots taken and stored in XMM1, (4) the reciprocals are found using the divps instruction and placed in XMM2, and finally, (5) the values in XMM2 are stored to memory.

OTOH, if the array size is not a multiple of 8, the compiler does not generate the SIMD instructions for the square root and division, but instead calls the standard library sqrt() function. The situation is probably similar if the array size is a multiple of 4.

I didn't try the pragma you mentioned, but I would be very surprised if using it does not alter the times for array sizes that are multiples of 8 or even 4.
 
  • Like
  • Informative
Likes Oleh, Tom.G and Filip Larsen
  • #25
Mark44 said:
Ever since I read "The Zen of Code Optimization," by Michael Abrash, I've attempted to incorporate his ideas into my code, especially in a chunk of code that will be executed many times. For example, if I want to square or cube some number, I'll use multiplication rather than the C library function pow(). Some of Abrash's ideas are obsolete now, given the advance in processor design and compiler capabilities since the book was written, but it's still possible once in a while to write code that is faster than what a compiler produces.

Some old optimization techniques are obsolete or less practical now, yes. But some have become far more important because of architecture changes and such.

Branchless programming techniques, for example, are far more important today than they were in the mid 90s, because long pipelines and branch prediction are in basically every powerful processor now.

This video has some examples:
 
  • Like
Likes Oleh
  • #26
The Bill said:
Some old optimization techniques are obsolete or less practical now, yes. But some have become far more important because of architecture changes and such.

Branchless programming techniques, for example, are far more important today than they were in the mid 90s, because long pipelines and branch prediction are in basically every powerful processor now.
It should be mentioned that every type of loop involves a branch, at the level of the assembly code generated by the compiler. Eliminating branches is the reason for so-called "loop unrolling," a technique used to help insure that the pipeline keeps chugging along. One improvement I could make to the AVX version of my reciprocal square root version, if I knew there would be a lot of data to process, would be to pull in more groups of 8 floats in each loop iteration, and taking full advantage of the YMM register set.
Processors these days are designed to predict whether a branch will be taken. If they predict correctly, not many cycles are lost, but if they don't correctly predict that a branch will be taken, the whole pipeline will be invalidated. With modern CISC (complex instruction set computing) processors having 14 or more pipeline stages, that's a lot of cycles being lost.
 
  • Informative
Likes Keith_McClary
  • #27
Mark44 said:
It should be mentioned that every type of loop involves a branch, at the level of the assembly code generated by the compiler. Eliminating branches is the reason for so-called "loop unrolling," a technique used to help insure that the pipeline keeps chugging along. One improvement I could make to the AVX version of my reciprocal square root version, if I knew there would be a lot of data to process, would be to pull in more groups of 8 floats in each loop iteration, and taking full advantage of the YMM register set.
Processors these days are designed to predict whether a branch will be taken. If they predict correctly, not many cycles are lost, but if they don't correctly predict that a branch will be taken, the whole pipeline will be invalidated. With modern CISC (complex instruction set computing) processors having 14 or more pipeline stages, that's a lot of cycles being lost.
Yes, that's exactly what motivates the techniques in the video I linked. The examples he gives involve using operators to remove the need to branch within a loop, not unrolling completely or removing all/most branches. Just removing some of the less predictable branches that are repeated a lot in some types of algorithms. Where possible.

He also points out at least one example where a compiler is smart enough to convert code into branchless machine code.
 
  • #28
The Bill said:
Yes, that's exactly what motivates the techniques in the video I linked. The examples he gives involve using operators to remove the need to branch within a loop, not unrolling completely or removing all/most branches. Just removing some of the less predictable branches that are repeated a lot in some types of algorithms. Where possible.

He also points out at least one example where a compiler is smart enough to convert code into branchless machine code.
I just took a closer look at the video you posted. Nice presentation, and well laid out from simple to more complex. In a sense, he "unrolls the loop" in his last example, in which he uses AVX (advanced vector extensions) and 32-byte YMM registers to operate on 32 characters at a time per iterations, instead of just a single character per iteration. This is similar in flavor to what I did in my example of post #19.
 
  • #29
From the book on "Numerical Analysis" (Fröberg 1962);

Given a positive number a, find the value of \frac{1}{\sqrt{a}}.
The algorithm uses fast-converging series.
Start with some value (for example x_{0}=1)
Generate the following approximations using x_{n+1}=\frac{x_{n}}{2}(3-ax_{n}^{2})

The reason this method is fast is because it does not use division (taking the half of a number is fast and easy in a binary-based system).
 
  • #30
Svein said:
The reason this method is fast is because it does not use division...
A different and faster form of that iterative solution was given in the OP.
y = y * ( THREEHALVES - ( x2 * y * y ) ); where x2 = x/2, was computed once only, by shifting a binary integer, or by decrementing the FP exponent.
It does not use the division or shift inside the iteration loop.
It doubles the number of correct digits with each iteration.

The question becomes; how do you find the optimum initial guess for y ?

If the guess is a constant 1, it will yield 2 correct digits, followed by 4, 8, then 16.
If you can estimate 3 digits then you get 6 digits with one iteration.
The "hack" used in the OP was to manipulate the input floating point number.
 
  • #31
I quoted Fröberg since it was on the required list for my Numerical Algorithms exam back in 1964. After I referred to it, I started to get very unsure of the validity of the algorithm - and, sure enough, it converges badly for a starting value of 1 if a is greater than 1. I have checked and found that the best way is to modify the start value according to this algorithm;
Code:
Start with x0=2
if a>1
   repeat
      x0=x0/2
      t=(3-x0*x0*a)
   until (t<2) and (t>0)
Then use the Fröberg algorithm with the resulting x0.
 
Last edited:
  • Like
Likes Baluncore
Back
Top