C++ swap routines: pointers vs. references

Mark44 · Sep 15, 2020

@sysprog posted this code for a nifty swap function that uses XOR in another thread in this section.

sysprog said:

C:

void XorSwap( int* x, int* y )
{
/* ensure that you're not swapping a location with itself 
(3 XORs would zero it out as surely as 1 XOR would) */
if (x != y)
  {
    *x ^= *y; /* xor x with y and store the result at x */
    *y ^= *x; /* xor y with x and store the result at y */
    *x ^= *y; /* xor x with y and store the result at x */
/* x and y have been swapped without a 3rd memory location */
  }
}

This put me in mind of a couple of swap routines that I did awhile back: one with references and one with pointers. Unlike the example above, both routines do use a third memory location. Rather than wander off-topic, I thought I'd just start a new thread.

Here's the references version:

C++:

// Swap two numbers, using reference parameters
// x and y are references to the a and b variables in main()
void swapRefs(int& x, int& y)
{
    int temp;
    temp = x;
    x = y;
    y = temp;
}

This function is called like this:

C++:

int a = 5;
int b = 8;
swapRefs(a, b);

After swapRefs() returns, a's value will be 8 and b's value will be 5.

Here's the pointers version:

C++:

// Swap two numbers, using pointer parameters
// x and y are pointers to the a and b variables in main()
void swapPtrs(int* x, int* y)
{
    int temp;
    temp = *x;
    *x = *y;
    *y = temp;
}

This function is called like so:

C++:

int a = 5;
int b = 8;
swapPtrs(&a, &b);

After swapPtrs() returns, a's value will be 8 and b's value will be 5, which is what you would expect.

So far, this shouldn't be too surprising, and may even be a bit ho-hum.
What's very interesting to me in a very nerdy way is that, even though the calling semantics of the two swap functions and what happens in the function bodies are quite different, the underlying assembly code (in the MSFT Visual Studio implementation) is identical.

swapRefs() assembly code - the lines of C++ code are shown in comments.

Code:

//    int temp;
//    temp = x;
mov         eax,dword ptr [x]  
mov         ecx,dword ptr [eax]  
mov         dword ptr [temp],ecx  
//    x = y;
mov         eax,dword ptr [x]   
mov         ecx,dword ptr [y]  
mov         edx,dword ptr [ecx]  
mov         dword ptr [eax],edx  
//    y = temp;
mov         eax,dword ptr [y]  
mov         ecx,dword ptr [temp]  
mov         dword ptr [eax],ecx

swapRefs() assembly code

Code:

//    int temp;
//    temp = *x;
mov         eax,dword ptr [x]  
mov         ecx,dword ptr [eax]  
mov         dword ptr [temp],ecx  
//    *x = *y;
mov         eax,dword ptr [x]  
mov         ecx,dword ptr [y]  
mov         edx,dword ptr [ecx]  
mov         dword ptr [eax],edx  
//    *y = temp;
mov         eax,dword ptr [y]  
mov         ecx,dword ptr [temp]  
mov         dword ptr [eax],ecx

sysprog · Sep 15, 2020

Nice observation and commentary @Mark44.

jtbell · Sep 15, 2020

I've always assumed that references are generally implemented the same way as pointers (memory locations that contain the addresses of other memory locations), and I've probably seen statements to that effect in various places over the years, but I think this is the first time I've seen actual assembly language code demonstrating it.

rcgldr · Sep 16, 2020

One issue with using references is that looking at a call, you can't tell if the function being called is using references or not, without having to look for a prototype or the actual function to see if it is using references.

sysprog · Sep 16, 2020

rcgldr said:

One issue with using references is that looking at a call, you can't tell if the function being called is using references or not, without having to look for a prototype or the actual function to see if it is using references.

Why is that an issue? Even a very good disassembler can't reproduce the original high-level language source code. I think that if you have the original source code then it's not an issue, and that if you have only the object code then it doesn't matter what technique was used in the high-level language code that was compiled to produce it, as at that point you're working with the machine language.

pbuk · Sep 16, 2020

Not a good idea to create a swap function, it makes it much harder for the compiler to optimise the code: in many cases an optimised swap will simply involve a single XCHG instruction. ~~3 register-to-register instructions~~.

In some cases a swap may not take any instructions but simply a relabelling of the registers by the compiler!

Mark44 · Sep 16, 2020

pbuk said:

Not a good idea to create a swap function, it makes it much harder for the compiler to optimise the code: in many cases an optimised swap will simply involve 3 register-to-register instructions.

If you do the swap with inline code, rather than calling a function, it takes instructions to copy the value in variables to registers. Here's the generated assembly code (VS) for the usual technique of swapping the values in two variables. As before, the C/C++ code is in // comments. Also, each assembly line has comments detailing what is happening.

Code:

//    int temp;
//    temp = b;
008E298E  mov         eax,dword ptr [b]               ; copy b to EAX
008E2991  mov         dword ptr [temp],eax       ; copy b to temp
//    b = a;
008E2994  mov         eax,dword ptr [a]              ; copy a to EAX
008E2997  mov         dword ptr [b],eax              ; copy a to b
//    a = temp;
008E299A  mov         eax,dword ptr [temp]      ; copy temp to EAX
008E299D  mov         dword ptr [a],eax             ; copy temp to a

Are you suggesting that some compilers can optimize away some of this code? Such as by the use of XOR, as @sysprog showed?

pbuk · Sep 16, 2020

Mark44 said:

If you do the swap with inline code, rather than calling a function, it takes instructions to copy the value in variables to registers.

In many cases at least one of the values will already be in a register, and at least one of the swapped results will be used again before the registers are used for another purpose. To see this in practice you need to look at an real example of a tight-loop construct (which is where optimisation makes a difference).

rcgldr · Sep 16, 2020

rcgldr said:

One issue with using references is that looking at a call, you can't tell if the function being called is using references or not, without having to look for a prototype or the actual function to see if it is using references.

sysprog said:

Why is that an issue? Even a very good disassembler can't reproduce the original high-level language source code. I think that if you have the original source code then it's not an issue, and that if you have only the object code then it doesn't matter what technique was used in the high-level language code that was compiled to produce it, as at that point you're working with the machine language.

That is not what I meant. The issue is having to look at two different locations in source code to see if a function is pass by reference (references) or pass by value. C++ should have used a different syntax for the calling code, so that it would be clear in the calling code that a call is using references on one or more parameters.

willem2 · Sep 16, 2020

Mark44 said:

Are you suggesting that some compilers can optimize away some of this code? Such as by the use of XOR, as @sysprog showed?

A compiler will at least be able to produce:
mov register1, [memory1]
mov register2, [memory2]
mov [memory1], register2
mov [memory2], register1
If you turn on all optimizations with visual studio , it doesn't matter if you use a function for swap. The compiler will eliminate the function anyway.

I tried it in a bubble sort program, and the swap consisted of only 2 stores, since the 2 numbers had just been loaded into registers to compare them

Code:

            if (array[i] > array[i + 1]) {
mov         eax,dword ptr [ecx*4+0C94394h]   // eax = array[i+1]
mov         edx,dword ptr array (0C94390h)[ecx*4]  //edx = array[i]
cmp         edx,eax 
jle         main+92h (0C91092h)    // skip the swap if edx <= eax 
                Swap(array[i], array[i + 1]);
mov         dword ptr array (0C94390h)[ecx*4],eax  //array [i] = eax 
mov         dword ptr [ecx*4+0C94394h],edx   // array[i+1] = edx

The Xor method is horrible. You still need two loads and two stores to accomplish the swap, and you only add extra code with computations that depend on each other and can't be done in parallel.
The compiler should be able to see that it is equivalent to the other method, but I can't get it to optimize it away.

sysprog · Sep 17, 2020

willem2 said:

The Xor method is horrible. You still need two loads and two stores to accomplish the swap, and you only add extra code with computations that depend on each other and can't be done in parallel.

Each of the 3 Xor instructions that accomplish the swap does a fetch and a store.

This is from the IBM System 370 Principles of Operation (1975) manual:

The sequence A EXCLUSIVE-ORed B, B EXCLUSIVE-ORed A, A EXCLUSIVE-ORed B results in the exchange of the contents of A and B without the use of an auxiliary buffer area.

and:

The execution of XI and XC consists in fetching a first-operand byte from main storage and subsequently storing the updated value. These fetch and store accesses to a particular byte do not necessarily occur one immediately after the other. Thus, the instruction EXCLUSIVE OR cannot be safely used to update a shared location in main storage if the possibility exists that another CPU or a channel may also be updating the location. For XI, only one byte is stored.

I think that it's not "horrible" ##-## we used it to conserve memory back when that mattered even more than it does now ##-## and we were fully aware of the multitasking and multiprocessing considerations ##\dots##

willem2 · Sep 17, 2020

sysprog said:

I think that it's not "horrible" − we used it to conserve memory back when that mattered even more than it does now − and we were fully aware of the multitasking and multiprocessing considerations

I wasn't thinking about multitasking/processing, but about a single processor (or processor core) executing more than one instruction at the same time (superscalar execution)
This can happen with two load/store pairs of the non-xor swap instruction. They completely independent. This can not happen with the xor solution, because all xor's but the first depend on earlier xor's. The first xor depends on two memory loads, and can't start until they are both done.

anorlunda · Sep 17, 2020

In today's world, both CPU speed and memory seem to be abundant. However, there can still be cases where saving memory is more important than saving time. My point is simply that we can't make a blanket statement saying that optimization must always favor speed over space.

sysprog · Sep 17, 2020

willem2 said:

I wasn't thinking about multitasking/processing, but about a single processor (or processor core) executing more than one instruction at the same time (superscalar execution)
This can happen with two load/store pairs of the non-xor swap instruction. They completely independent. This can not happen with the xor solution, because all xor's but the first depend on earlier xor's. The first xor depends on two memory loads, and can't start until they are both done.

How are they "completely independent"? If you're using a swap method that requires a 3rd location, e.g. [copy A to C, copy B to A, copy C to B] then how does being able to run more than one instruction in a single clock cycle allow you to fetch from and store to a single location without having to do that sequentially? Don't you need to store A to C before you overwrite A from B? And don't you need to have already stored to C from A before you use the value from C to overwrite B?

According to the Intel manual https://www.intel.com/content/dam/w...4-ia-32-architectures-optimization-manual.pdf

The execution core of the Intel Core microarchitecture is superscalar and can process instructions out of order. When a dependency chain causes the machine to wait for a resource (such as a second-level data cache line), the execution core executes other instructions. This increases the overall rate of instructions executed per cycle (IPC).

That's great, but I think that it cannot relieve us of the sequentiality of the swap process, whether we use a 3rd memory location or not.

sysprog · Sep 17, 2020

pbuk said:

Not a good idea to create a swap function, it makes it much harder for the compiler to optimise the code: in many cases an optimised swap will simply involve a single XCHG instruction. ~~3 register-to-register instructions~~.

In some cases a swap may not take any instructions but simply a relabelling of the registers by the compiler!

I'm curious regarding both of those statements ##-## it seems to me that swapping values between 2 memory locations always requires 3 elements, whether they are logic gates, or microinstructions sequenced within an instruction, or normal instructions ##-## regarding the XCHG instruction:

from http://www.c-jump.com/CIS77/ASM/DataTypes/T77_0200_exchanging_integers.htm:

You can exchange data between registers or between registers and memory, but not from memory to memory.

##-## they reference register pairs ##-## it looks to me like you could use it for part of a tag sort, but not for a plain bubble sort for records longer than a register pair, and as you presumably know, even with a tag sort, you have to read each whole record once, and write each whole record once, but to do the sort, you get to swap just the tags ##\dots##

On IBM mainframes, we have the Compare and Swap instruction, the microcode for which is proprietary, but inferable from its functions, which are documented:

https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_72/rzahw/rzahwrzahwcascasco.htm

glappkaeft · Sep 18, 2020

willem2 said:

The Xor method is horrible. You still need two loads and two stores to accomplish the swap, and you only add extra code with computations that depend on each other and can't be done in parallel.
The compiler should be able to see that it is equivalent to the other method, but I can't get it to optimize it away.

The XOR variant is not equivalent to the regular swap. Consider what happens if the pointers are identical...

sysprog · Sep 18, 2020

glappkaeft said:

The XOR variant is not equivalent to the regular swap. Consider what happens if the pointers are identical...

Did you look at the C code in the quote in post #1 in this thread? ##-## it checks for that via if (x != y) ##-## so that the XORs happen only if the values are not equal ##-## there's no need to swap them if they're equal ##\dots##

glappkaeft · Sep 18, 2020

sysprog said:

Did you look at the C code in the quote in post #1 in this thread? − it checks for that via if (x != y) − so that the XORs happen only if the values are not equal − there's no need to swap them if they're equal …

Yes I did. The reason you check for the pointers not being equal is not because you don't need to swap them in that case but that the algorithm fails (it zeros both pointers).

Relevant to my reply is that I don't expect an optimizing compiler to dare optimize the XOR version by replacing it with a swap version or be smart enough to do so. Partially because pointer assignment and XOR are very different but also this corner case with aliasing.

BTW: I really like the XOR swap but using has been discouraged for a long time. Doing it with a tmp pointer is something the compiler can understand and optimize (MOV-elimination, OoO is possible, etc). It is not uncommon that "clever" optimizations results in slower code due to the compiler not being able to understand what is safe to do. On a modern computer the pointer swap should use the same amount of memory (edit: registers) and be faster than the XOR swap.

sysprog · Sep 18, 2020

glappkaeft said:

Yes I did. The reason you check for the pointers not being equal is not because you don't need to swap them in that case but that the algorithm fails (it zeros both pointers).

I think that the fact that you don't need to swap them if they're equal is why you can get away with not trying to do so if they're equal, and so you're not consequently failing, and you're successfully swapping the 2 values without the 3rd memory location, every time they're unequal and you want to swap them.

from https://en.wikipedia.org/wiki/XOR_swap_algorithm

However, the algorithm fails if x and y use the same storage location, since the value stored in that location will be zeroed out by the first XOR instruction, and then remain zero; it will not be "swapped with itself". This is not the same as if x and y have the same values. The trouble only comes when x and y use the same storage location, in which case their values must already be equal. That is, if x and y use the same storage location, then the line:

X := X XOR Y

sets x to zero (because x = y so X XOR Y is zero) and sets y to zero (since it uses the same storage location), causing x and y to lose their original values.

I cited the instance of equality only as being a sufficient justification for not trying to do a swap in the first place; I think that it's likely that anyone who understands step-by-step how the 3-XOR swap works, will also recognize that XORing something against itself will zero it out.

Relevant to my reply is that I don't expect an optimizing compiler to dare optimize the XOR version by replacing it with a swap version or be smart enough to do so. Partially because pointer assignment and XOR are very different but also this corner case with aliasing.

IBM uses PL/S III for its system code and that compiler calls the assembler and thereby allows putting assembly language directly in-line with the otherwise-PL/1-like code, and that means that the programmer can more easily do his own post-auto-optimization 'manual' optimization. IBM doesn't (generally) provide PL/S III to its customers (there have been 'unofficial' versions cobbled together using the PL/1 compiler and the assembler), so when we're writing system code, and we want the capabilities of the PL/1 Optimizing Compiler (it's a procedural and imperative language), and still want to do assembly language things, we do something like DCL SUB01 ENTRY EXTERNAL OPTIONS(ASM,INTER,NOMAP);, then separately assemble the subroutine, and then resolve the resulting subroutine address location in the compiled code via the linkage editor ##-## that's less convenient than the PL/S III capability to auto-call the assembler and not need to resort to a linked external subroutine but it still works.

BTW: I really like the XOR swap but using has been discouraged for a long time. Doing it with a tmp pointer is something the compiler can understand and optimize (MOV-elimination, OoO is possible, etc). It is not uncommon that "clever" optimizations results in slower code due to the compiler not being able to understand what is safe to do. On a modern computer the pointer swap should use the same amount of memory (edit: registers) and be faster than the XOR swap.

I've seen it frequently in IBM mainframe assembly language and PL/S III code (mostly from pre-'Object Code Only' days) and in machine language when I'm reading dumps ##-## only infrequently in Intel or Motorola or C code. The speed of the 3-XOR swap compared to that of the 3rd memory location swap depends in part on the specifics of the machine architecture.

sysprog · Sep 26, 2021

Mark44 said:
@sysprog posted this code for a nifty swap function that uses XOR in another thread in this section.

This put me in mind of a couple of swap routines that I did awhile back: one with references and one with pointers. Unlike the example above, both routines do use a third memory location. Rather than wander off-topic, I thought I'd just start a new thread.

Here's the references version:
C++:
// Swap two numbers, using reference parameters
// x and y are references to the a and b variables in main()
void swapRefs(int& x, int& y)
{
    int temp;
    temp = x;
    x = y;
    y = temp;
}
This function is called like this:
C++:
int a = 5;
int b = 8;
swapRefs(a, b);
After swapRefs() returns, a's value will be 8 and b's value will be 5.

Here's the pointers version:
C++:
// Swap two numbers, using pointer parameters
// x and y are pointers to the a and b variables in main()
void swapPtrs(int* x, int* y)
{
    int temp;
    temp = *x;
    *x = *y;
    *y = temp;
}
This function is called like so:
C++:
int a = 5;
int b = 8;
swapPtrs(&a, &b);
After swapPtrs() returns, a's value will be 8 and b's value will be 5, which is what you would expect.

So far, this shouldn't be too surprising, and may even be a bit ho-hum.
What's very interesting to me in a very nerdy way is that, even though the calling semantics of the two swap functions and what happens in the function bodies are quite different, the underlying assembly code (in the MSFT Visual Studio implementation) is identical.

swapRefs() assembly code - the lines of C++ code are shown in comments.
Code:
//    int temp;
//    temp = x;
mov         eax,dword ptr [x]
mov         ecx,dword ptr [eax]
mov         dword ptr [temp],ecx
//    x = y;
mov         eax,dword ptr [x] 
mov         ecx,dword ptr [y]
mov         edx,dword ptr [ecx]
mov         dword ptr [eax],edx
//    y = temp;
mov         eax,dword ptr [y]
mov         ecx,dword ptr [temp]
mov         dword ptr [eax],ecx
swapRefs() assembly code
Code:
//    int temp;
//    temp = *x;
mov         eax,dword ptr [x]
mov         ecx,dword ptr [eax]
mov         dword ptr [temp],ecx
//    *x = *y;
mov         eax,dword ptr [x]
mov         ecx,dword ptr [y]
mov         edx,dword ptr [ecx]
mov         dword ptr [eax],edx
//    *y = temp;
mov         eax,dword ptr [y]
mov         ecx,dword ptr [temp]
mov         dword ptr [eax],ecx

The 3 XOR instruction swap code is analogous to a 3 XOR gate planar crossover hardware function, while the non-planar crossover use of the z axis is analogous to using a 3rd memory location, and of course you can use 4 NANDs to take the place of an XOR:

C++ swap routines: pointers vs. references

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

C++ swap routines: pointers vs. references

Similar threads