# XOR algorithm to find the missing number in a given array

## Summary:

I am having an issue understanding how the given algorithm, is actually working.

## Main Question or Discussion Point

Recently I have been reading up about the logical operations OR, AND and XOR. Whilst reading about XOR I cam across a method using XOR to find the missing number in a given array, but I just cant understand how or really why it works. The algorithm is displayed below.

Find missing number in array:
A = [2, 3, 1, 4, 6, 7]

x1 = A
x2 = 1

for i in range(1, len(A)):
x1 = x1 ^ A[i]

for i in range(2, len(A)+2):
x2 = x2 ^ i

print('--'*10)

print(x1 ^ x2)
What I can seem to understand is two things.
1. Why for the second loop has the array length been extend by 2?

2. In effect from what see is that the algorithm is comparing the last elements within each for loop, but I just understand how this gives the missing number? As I understand XOR compare the binary number of the given integer and if both integers are the same will result in 0 or if not the same will result in a new integer i.e 16 ^ 15 = 1. But why compare the array and update x1 and why compare x2 with i and update.

I have runt the print out of each line of the itterartion to see if I could gain some insight to what going on but to be honest confused me more. Could maybe some explain in laymen terms if possible.

Related Programming and Computer Science News on Phys.org
rcgldr
Homework Helper
The code is specific to the length of A and the values in A and assumes the values in A are sequential and go from 1 to 7, but is missing one value, so the length of A is 6 instead of 7. x1 is initialized to A, then the rest of A is xor'ed to x1 in the first loop. x2 is initialized to 1, the smallest value in A, then the values 2 through 7 are xor'ed to x2. After both loops, x1 ^ x2 will equal the missing number.

A more general version of this. mn is the minimum value in A.

Code:
A = [2, 3, 1, 4, 6, 7]

x1 = 0
x2 = 0
mn = min(A)

for i in range(0, len(A)):
x1 = x1 ^ A[i]
x2 = x2 ^ (i+mn)
x2 ^= len(A)+mn

print(x1 ^ x2)

Last edited:
• Taylor_1989
Filip Larsen
Gold Member
Allow me to add that the main point of this algorithm is to observe that if you XOR the same number twice it comes out zero. This also extends to a sequence of numbers where each number is somewhere in the list twice. This then means that if exactly one number where missing in such a list of double numbers then the total XOR would come out as that missing number.

Regarding code one can usually come a long way in understanding by using some descriptive names. For instance, below you do not see the +2 you were puzzled about because its been split into a max_value calculation and a +1 for the range argument (since the second argument to range is the stop value, not included in the range result):

Python:
actual_values = [2, 3, 1, 4, 6, 7]

min_value = min(actual_values)
max_value = min_value + len(actual_values)
desired_values = range(min_value, max_value + 1)

missing = 0
for x in actual_values:
missing ^= x
for x in desired_values:
missing ^= x

print(missing)

• • anorlunda and Taylor_1989
wle
Just for info, there's a common functional idiom for accumulating a value from a sequence. Python is one of the languages that allows it:
Python:
from functools import reduce    # Only needed in Python 3.

A = [2, 3, 1, 4, 6, 7]

xor = lambda x, y: x^y    # xor(x, y) returns x^y.

x1 = reduce(xor, A)
x2 = reduce(xor, range(1, len(A)+2))

print(x1^x2)    # Prints 5.
Of course, a lot of people still prefer plain loops for this sort of thing.

Last edited:
• Taylor_1989
This algorithm is an interesting trick. Say that X a sequence of any number of different numbers, and A is that same sequence except one (in this case, A is missing c}.

X = {a,b,c,d,e}
A = {a,b,d,e}

How this algorithm works? First it xors all values in A into a variable x; that's what the first loop is doing.

x = a ^ b ^ d ^ e

Then the second loop xors all values in X in the same variable, that is what the second loop is doing:

x = ( a ^ b ^ d ^ e ) ^ ( a ^ b ^ c ^ d ^ e )

then, rearranging:

x = ( a ^ a ) ^ ( b ^ b ) ^ ( c ) ^ ( d ^ d ) ^ ( e ^ e )

then, whatever w, w ^ w = 0, and w ^ 0 = w; therefore:

x = 0 ^ 0 ^ c ^ 0 ^ 0 = c = missing value

Per the question, why it's going to len(A)+2, that's because len(A) is 6, so range[2,8] = { 2, 3, 4, 5, 6, 7 }; now, why did it exclude 1? that's because it already included 1 from the very first assignment; so, by including 1, you're xoring with { 1, 2, 3, 4, 5, 6, 7 }, which is the list of all possible values from 1 to 7.

By the way, you don't really need to use xors for this; + and would work the same way as

x = ( sum of all values in X ) - ( sum of all values in A ) = missing value

or, in code:

 A = [2, 3, 1, 4, 6, 7] x1 = A for i in range(1, len(A)) begin x1 = x1 + A[i] end x2 = 1 for i in range(2, len(A)+2) begin x2 = x2 + i end print( x2 - x1 ) 

• Taylor_1989
Hmm... actually one could also use * to do the same thing...

x = ( multiplication of all values in X ) / ( multiplication of all values in A ) = missing value

I wonder what's the theory behind this... say you have two operators, op1 and op2... then if this holds true:

a == ( a op1 b ) op2 b

then the same trick works with these operators; that's why it works with xor/xor, +/- and */division... but which other operators would also work for this algorithm... interesting question; I guess rotation operators would also work, because they are associative and have an inverse. But then I'm no mathematician.

wle
Not a mathematician myself, but it seems the trick works if you have some subset ##S## of an abelian group (of numbers or some other type of object), you're given a subset ##S'## of ##S## with one element missing, and the problem is to identify that missing element.

Being a group means each object has an inverse (itself for xor and ##-x## and ##1/x## for addition and multiplication) and the group operation is associative (it doesn't matter where you put parentheses). Abelian means the group operation is also commutative (i.e., ##x \ast y = y \ast x##). Commutativity and associativity seem important so that you can move the elements around into any order and get them to cancel out.

Rotations in a 2D plane satisfy these conditions (a bit trivially, since it reduces to simple addition of the rotation angles modulo ##2 \pi## radians). Rotations in 3 or more dimensions don't since they're generally not commutative.

Last edited:
A-ha!! Abelian groups! That's the word, thank you!

I stand corrected on the 3D rotations - thank you!

Filip Larsen
Gold Member
Just a small comment that using plus/minus or similar in a a practical implementation quickly will bring on the issue of overflow.

The characteristic of xor being both symmetric and closed on its data type (i.e. no overflow) is very useful in both theoretic algorithms and practical implementations, and its use is quite common in fields like cryptography.

hmm.. good point... let's see if that holds true on additions with modulus; say modulus 10

ie, say I'm a complement-10 cpu that can only store numbers from 0 to 9, and I have

X= { 1, 2, 3, 4, 5, 6, 7, 8 }
A = { 1, 2, 3, 5, 6, 7, 8 }

so A is missing 4; then

sum of X mod 10 = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 6
sum of A mod 10 = 1 + 2 + 3 + 5 + 6 + 7 + 8 = 2

so in this case ( 6 - 2 ) mod 10 = 4, which is correct

but say the missing value was 7, then

sum of X mod 10 = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 6
sum of A mod 10 = 1 + 2 + 3 + 4 + 5 + 6 + 8 = 9

then ( 6 - 9 ) mod 10 = -3, but -3 does not exist in this CPU (it only holds 0..9), so it would underflow and give the complement-10 of it, which is 10-3 = 7

so I suspect that algorithm would still work (if one disables overflow/underflow) for integer numbers in a CPU that uses complement-N (say complement-2, which everybody uses, or complement-10 for my hypothetical cpu above); but I suspect you're right, that it would not work for cpus that use other representations, like BCD, signal/magnitude or complement-1, nor it would work for any of the common floating point representations (as they are usually in signal/magnitude, plus there's the pesky problem of the NaN overflow thingie). I suspect the xor algorithm would not work for any of those either.

wle
Just a small comment that using plus/minus or similar in a a practical implementation quickly will bring on the issue of overflow.
This depends on the language. Python integers are not machine register integers; their size is only limited by available memory. Still, this just substitutes one possible problem for others (efficiency, and in very very extreme cases maybe exhaustion of memory) that xor doesn't have.

On the other hand, bit fiddling with xor for maximum efficiency somehow seems a bit out of place in a high level language like Python.

hmm.. good point... let's see if that holds true on additions with modulus; say modulus 10
Well in general, the set ##\{0, \dotsc, N-1\}## with addition modulo ##N## is an abelian group, so yes, you could take all the additions modulo some number ##N## to keep the magnitudes of intermediate results under control and it would still work, as long as you're sure ##N## is larger than the numbers in the set you're dealing with.

The set ##\{1, \dotsc, N-1\}## with multiplication modulo ##N## is also an abelian group if ##N## is a prime number: https://en.wikipedia.org/wiki/Group_(mathematics)#Modular_arithmetic. In that case, for every integer in the set there's another integer that acts as its inverse.

rcgldr
Homework Helper
( 6 - 9 ) mod 10 = -3
In math, the sign of modulo is the same as the sign of the dividend (unless the remainder == 0), so (6-9) mod 10 = 7. This is different than the remainder with computers and/or programming languages where the sign of remainder is the same as the sign of the dividend. This won't be an issue with unsigned integers (which are always positive) either. Say the sum of the array is a 32 bit unsigned integer 0xFFFFFFFC, and the sum of the sequential values is 0x00000003, then 0x00000003 - 0xFFFFFFFC = 0x00000007.

In math, the sign of modulo is the same as the sign of the dividend (unless the remainder == 0), so (6-9) mod 10 = 7. This is different than the remainder with computers and/or programming languages where the sign of remainder is the same as the sign of the dividend. This won't be an issue with unsigned integers (which are always positive) either. Say the sum of the array is a 32 bit unsigned integer 0xFFFFFFFC, and the sum of the sequential values is 0x00000003, then 0x00000003 - 0xFFFFFFFC = 0x00000007.
Ah, got it - thank you!

pbuk