# Fast reciprocal square root algorithm

Mentor
I ran into an interesting video on Youtube yesterday, about a fast way to compute the reciprocal of the square root of a number. I.e., to compute this function:
##f(x)= \frac 1 {\sqrt x}##
The presenter is John Carmack. If you search for "Fast Inverse Square Root — A Quake III Algorithm" you'll find it.

The video shows a technique used in Quake III to speed up the calculations for lighting and reflections, to normalize vectors to a magnitude or length of 1.
The obvious way to do the calculation as a C function might be as follows:
C:
#include <math.h>
float RecipSqrt(float num) {
return 1/sqrt(num);
}
Such a function would produce correct values, within the limits of the float data type, but would be extremely slow. A very clever programmer at Quake came up with the following version.
C:
float RecipSqrt(float number)
{
long i;
float x2, y;
const float THREEHALVES = 1.5F;

x2 = number * 0.5F;
y = number;
i = *(long*)&y;                            // Evil floating point bit hack
i = 0x5f3759df - (i >> 1);
y = *(float*)&i;
y = y * (THREEHALVES - (x2 * y * y));    // 1st iteration
return y;
}
I have modified the code shown in the video slightly, changing the name of the function and removing one comment that was an acronym I don't wish to write here. Most of the code is very arcane, with no comments that explain what the code is doing. For a full explanation of how the code works, take a look at the video, starting at about 10 minutes.

I put together a program with the above function. In it I compared the resulting times to calculate the reciprocal square roots of several numbers in an array of floats using both the obvious (but slow) approach I showed first, as well as the function shown just above. My results confirmed that the Quake III algorithm is very fast, producing results that were within 1%, and taking a little over one-hundredth of the time.

Here's the output I got with an array of 10 float values:

The first column shows the numbers in the array, 5.0 through 50.0, in increments of 5.0. The second column shows the values produced by the Quake algorithm, and the third column shows the values produced using the standard library sqrt() function.
The two lines at the bottom show the elapsed time, in clock ticks (AKA "clocks") for each algorithm. My 8-year-old computer runs at 3.4 GHz, so by my calculation, 434 clocks works out to 127.6 nsecs.

I wrote the program in Visual Studio 2017, in x64 Release mode, with optimization turned on to Max Optimization favoring speed, but even so, the Quake algorithm outperformed the intuitively obvious algorithm by two orders of magnitude.

In case you're curious about how I did the timing, I have a small assembly routine that I include as a separate file in my project. The assembly routine uses 64-bit registers, which is why I compiled the program as a 64-bit program. Doing so also meant that I had to have a separate assembly file rather than using inline assembly, which can be done only for 32-bit programs.

Here's my timing routine:
C:
.code
;; Returns the start time, clocks, in RAX.
rdtsc
shl rdx, 32
or rax, rdx
ret
end
The rdtsc instruction (Read Time Stamp Counter) gets the current time as a 64 bit number, spread out in the 32-bit EDX and EAX registers. The remaining instructions combine the two values into the 64-bit RAX register.

Last edited:
jim mcnamara, Filip Larsen, berkeman and 2 others

jedishrfu
Mentor
Insight article?

jim mcnamara and anorlunda
Mentor
Insight article?

jedishrfu
Mentor
Really looks like a worthy topic, short and succinct and very useful. I wonder what other algorithms could be vastly improved using this scheme.

Baluncore
Really looks like a worthy topic, short and succinct and very useful. I wonder what other algorithms could be vastly improved using this scheme.
There are quicker and dirtier approximations for every numerical function. That has been the case since before the first digital computer.

Your choice of approximation will depend on the time you have to find something that you can get away with. I have been using those techniques for over 40 years and still see no end to the variations.

It is certainly not a short and succinct topic.

Staff Emeritus
In fact the "real" square root is probably also an approximation, just one chosen to always be less than 1/2 bit away from the correct value. Lots of research in the 50's to 70's on this. Look up CORDIC for a great example.

Last edited:
jedishrfu and anorlunda
DrClaude and jedishrfu
anorlunda
Staff Emeritus
I would be curious to know if the function implementations in my favorite library make use of these clever fast approximations.

For example.
Python:
import math
x = math.sqrt(2)

It would be foolish of me to try to make a DIY sqrt smarter than the library function until I know how smart the library function is. Where would I look to find the documentation for the implementation of math.sqrt?

jedishrfu
Mentor
Hello, there is a lot of research on this topic, e.g.,
- http://www.matrix67.com/data/InvSqrt.pdf
- https://cs.uwaterloo.ca/~m32rober/rsqrt.pdf.
The link immediately above contains much of the same information as in the Youtube video I found, but also says that John Carmack disclaimed the invention of the algorithm shown.
Oleh said:
Modern modifications to the algorithm include:
- https://www.mdpi.com/1099-4300/23/1/86
- https://ieeexplore.ieee.org/abstract/document/8924302
- https://www.mdpi.com/2079-3197/9/2/21
Also, you can calculate similarly the reciprocal, square root, cube root, reciprocal cube root functions.
Thanks for posting these links! I looked at several of them, but not the IEEE article, which is behind a paywall.

Oleh
Mentor
It would be foolish of me to try to make a DIY sqrt smarter than the library function until I know how smart the library function is.
You don't necessarily need to know how smart a library function is. It might suffice to know how fast it is. Assuming the homebrew routine is reasonably accurate over some range of inputs, you can compare the running times of your routine vs. the library routine.

Ever since I read "The Zen of Code Optimization," by Michael Abrash, I've attempted to incorporate his ideas into my code, especially in a chunk of code that will be executed many times. For example, if I want to square or cube some number, I'll use multiplication rather than the C library function pow(). Some of Abrash's ideas are obsolete now, given the advance in processor design and compiler capabilities since the book was written, but it's still possible once in a while to write code that is faster than what a compiler produces.

jedishrfu
Baluncore
Also, you can calculate similarly the reciprocal, square root, cube root, reciprocal cube root functions.
To compute y = 1 / √x you could take the time to evaluate; y = Exp( –Log( x ) / 2 ).
Alternatively, successive approximation of y = 1 / √x doubles the number of correct digits on each pass, so it requires a good guess for the initial seed which can eliminate the first one or two loops.
That good guess can be made by taking the exponent of the floating point representation, [which is the offset integer part of the Log2( x ) ], shift the exponent right to divide the Log by 2 [for the root], then negate it [to get the reciprocal]. Successive approximation then repairs the damage caused by the hack.

Generally speaking, any algorithm that is dependent on the floating point number representation is a hack. I like the following floating point hack for the computation of; y = Log2( x ).
The integer part of y is the floating point exponent of x. But what about the mantissa?
Square the floating point number 7 times, take the 7 LSBs of the exponent as the next 7 bits of the mantissa. Repeat until satisfied.
That is now over 50 years old. See HAKMEM, Item 159 160.

https://w3.pppl.gov/~hammett/work/2009/AIM-239-ocr.pdf
https://en.wikipedia.org/wiki/HAKMEM

Last edited:
Oleh, jedishrfu and anorlunda
anorlunda
Staff Emeritus
A collection long these clever ideas would make a great web page our a great Insights article.

jedishrfu
Staff Emeritus
Generally speaking, any algorithm that is dependent on the floating point number representation is a hack.

You make it sound like a bad thing.
Remember, if you use it once, it's a trick. If you use it more than once, it's a technique.

IEEE754 make a lot of tricks techniques easy.You can easily express a number as $2^{2n}(1+\epsilon)$ with $|\epsilon|<1$, which leads to a good first guess for its square root of $2^{n}(1+\epsilon/2)$. The worst that ever does is just over 6%. That's a pretty goo start to an iterative algorithm. (I might be inclined to use a lookup table instead, though)

jedishrfu
Mentor
I’ve always liked the Trachtenberg multiplication algorithms that were designed for human computing eliminating the multiplication table lookup.

As a kid I wondered if they could have sped up computer multiplication until I discovered that computer math was in binary.

anorlunda
Staff Emeritus
I have the Python math lib source thanks to @jim mcnamara . There are many implementations (written in C, Python, and C-Python combined) depending on data type. Here's two of them. Both appear to be using bit shifts as opposed to "ordinary" arithmetic.

Python:
/* Approximate square root of a large 64-bit integer.

Given n satisfying 2**62 <= n < 2**64, return a
satisfying (a - 1)**2 < n < (a + 1)**2. */

static uint64_t
_approximate_isqrt(uint64_t n)
{
uint32_t u;
u = 1U + (n >> 62);
u = (u << 1) + (n >> 59) / u;
u = (u << 3) + (n >> 53) / u;
u = (u << 7) + (n >> 41) / u;
return (u << 15) + (n >> 17) / u;
}

Python:
#Here's Python code equivalent to the C implementation below:

def isqrt(n):
"""
Return the integer part of the square root of the input.
"""
n = operator.index(n)

if n < 0:
raise ValueError("isqrt() argument must be nonnegative")
if n == 0:
return 0

c = (n.bit_length() - 1) // 2
a = 1
d = 0
for s in reversed(range(c.bit_length())):
# Loop invariant: (a-1)**2 < (n >> 2*(c - d)) < (a+1)**2
e = d
d = c >> s
a = (a << d - e - 1) + (n >> 2*c - e - d + 1) // a

return a - (a*a > n)

Mentor
As a followup, I looked into seeing whether the Intel AVX (Advanced Vector eXtensions) assembly instructions for computing the reciprocal square root might speed things up even further. The instruction I used is vrsqrtps. The name of the instruction packs a lot of info very tersely.
v -- vector -- the instruction operates on 16-byte or 32-byte "vectors" (in 128-bit or 256-bit registers)
rsqrt -- reciprocal square root. I.e., it calculates the reciprocal of the square root of each number that makes up the vector.
p -- packed - indicates that several numbers are packed into a 128-bit or 256-bit register. In this case, there are either 4 floats in an XMM (16-byte) register or 8 floats packed in in a YMM (32-byte) register.
s - single; i.e., 4-byte float
The instruction I'm using operates on YMM registers, 32-byte or 256-bit registers that can hold 8 floats. There's a version of this instruction that can operate on the ZMM registers, 512-bit AVX-512 registers. Although I have a computer that supports many AVX-512 instructions, only Intel Xeon Phi processors support this particular AVX-512 instruction.

Here's a snippet of my assembly routine to calculate the reciprocal square root of 8 floats at a time. I'm omitting some of the details of prologue and epilogue and register initialization. I anyone is interested, I can provide the complete code by PM.

C:
ProcessRow:
; Read a row (8) of floats
vmovdqu ymm1, ymmword ptr[r8]      ; Copy from source array to YMM1
vrsqrtps ymm0, ymm1                ; Put reciprocal square root values in YMM0
vmovdqu ymmword ptr[rdx], ymm0     ; Copy results from YMM0 to destination array
loop ProcessRow
The first three instructions move 8 floats from memory to a YMM register, calculate the reciprocal square roots of those numbers, and then move the results back to a different YMM register.
The two add instructions increment the addresses of the source and destination arrays (in the R8 and RDX registers, respectively) by 32 bytes.
The final loop instruction decrements the RCX register, and if it's still positive, transfers control back to the ProcessRow label.
In the vmovdqu instruction, v stands for vector, mov is short for move, dq is double quadword (a word is 2 bytes, a quadword is 8 bytes, and a double quadword is 16 bytes, an anachronism left over from when the instruction first operated on 128-bit registers. The final u stands for unaligned memory, as opposed to memory aligned on.

Here's a screen shot of a run with an array of 25 floats:

The output shows selected values from the array. As you can see, the AVX version outperforms the Quake algorithm, running about 4.5 times as fast, and is nearly 120 times as fast as the version that uses the standard library sqrt() function. Again, what you're seeing is the release version, compiled with optimization set to highest speed.

But, and this really puzzles me, if the array size is a multiple of 8, we get significantly different results in times for the Quake and standard library versions.

Notice that the standard library version is significantly faster than the Quake version, and is only slightly slower than the AVX version. I've noticed this discrepancy in quite a few tests -- if the array has a multiple of 8 elements, the library version speeds up dramatically. At the moment I have no idea why this happens.

Last edited:
jim mcnamara and anorlunda
anorlunda
Staff Emeritus
At the moment I have no idea why this happens.
Wouldn't it be fun to have a debugger that could step through in sufficient detail to see the bits change, and see the memory & cache accesses.

Filip Larsen
Gold Member
At the moment I have no idea why this happens.

Depending on your code it may be VS' auto-vectorization optimization that kicks in. You should be able to test if this is the case by including a pragma mentioned below and see if the times go up again.

From https://docs.microsoft.com/en-us/cpp/parallel/auto-parallelization-and-auto-vectorization :
The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions on the target computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2, AVX, and AVX2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors, according to the /arch switch.

The Auto-Vectorizer may generate different instructions than specified by the /arch switch. These instructions are guarded by a runtime check to make sure that code still runs correctly. For example, when you compile /arch:SSE2, SSE4.2 instructions may be emitted. A runtime check verifies that SSE4.2 is available on the target processor and jumps to a non-SSE4.2 version of the loop if the processor does not support those instructions.

By default, the Auto-Vectorizer is enabled. If you want to compare the performance of your code under vectorization, you can use #pragma loop(no_vector) to disable vectorization of any given loop.

Mentor
Wouldn't it be fun to have a debugger that could step through in sufficient detail to see the bits change, and see the memory & cache accesses.
My debugger won't show individual bits, but I can see if memory or variables change values. Can't see cache accesses, though.

Mentor
Depending on your code it may be VS' auto-vectorization optimization that kicks in.
I'm going to guess that this is exactly what happens. I'll look into this further.

Mentor
@Filip Larsen, apparently what happens is that, if the array size is a multiple of 8, vectorization occurs. That means that SIMD (single instruction multiple data) instructions can be used, with either 128-bit or 256-bit registers holding four or eight floats in a register, and a single operation being performed on the contents.
The exact sequence I see in my debugger is the following:
C:
movups      xmm0,xmmword ptr [rbp+rcx+50h]
movaps      xmm2,xmm3
sqrtps      xmm1,xmm0
divps       xmm2,xmm1
movups      xmmword ptr [rbp+rcx-20h],xmm2
Without going into too much detail, (1) the values to have square roots taken are copied from memory to XMM0, (2) four copies of 1.0 are copied to XMM2, (3) the values in XMM0 have square roots taken and stored in XMM1, (4) the reciprocals are found using the divps instruction and placed in XMM2, and finally, (5) the values in XMM2 are stored to memory.

OTOH, if the array size is not a multiple of 8, the compiler does not generate the SIMD instructions for the square root and division, but instead calls the standard library sqrt() function. The situation is probably similar if the array size is a multiple of 4.

I didn't try the pragma you mentioned, but I would be very surprised if using it does not alter the times for array sizes that are multiples of 8 or even 4.

Oleh, Tom.G and Filip Larsen
Ever since I read "The Zen of Code Optimization," by Michael Abrash, I've attempted to incorporate his ideas into my code, especially in a chunk of code that will be executed many times. For example, if I want to square or cube some number, I'll use multiplication rather than the C library function pow(). Some of Abrash's ideas are obsolete now, given the advance in processor design and compiler capabilities since the book was written, but it's still possible once in a while to write code that is faster than what a compiler produces.

Some old optimization techniques are obsolete or less practical now, yes. But some have become far more important because of architecture changes and such.

Branchless programming techniques, for example, are far more important today than they were in the mid 90s, because long pipelines and branch prediction are in basically every powerful processor now.

This video has some examples:

Oleh