Mark44 said:
I'm very much not a fan of difficult-to-maintain code, but in certain rare instances its use is justified. Any such uses should be well-documented via code comments.
With regard to item 2 above, I've written assembly code (using AVX-512 instructions) that outperforms optimized C code. The code, both C and assembly, iterates through fairly large arrays of floats and picks out all of the array elements that are larger than some specified value. I've timed both code chunks, and my assembly code runs faster than the C code, even when the C code is fully optimized for speed.
Compilers usually produce well-optimized code, but if you know what you're doing, you can sometimes write code that is faster.
Completely agree - if you really need to speed up a tight loop then providing you are willing to restrict your target processor then assembly is the way to go. With a device driver this is always the case, although with say a numerical analysis routine you might want portability so some horrible, hacky (but well commented) c code is indeed best.
However none of this applies for routine use of
a[i++]
. In fact I have done some experimenting and you might be surprised to find that this quick-to-type shortcut can actually produce SLOWER code!
Edit: as well as being off-topic what follows is original research but I hope it will be allowed to remain as a point of interest. Probably best not to follow up here though!
[code lang="c" title="Source code (simple Fibonnaci sequence)"]
#include <iostream>
using namespace std;
int main() {
int a[100] = {0, 1};
int k = 2;
int f_k_minus_1 = 1;
int f_k_minus_2 = 1;
int f_k;
while (f_k < 1000) {
f_k = f_k_minus_1 + f_k_minus_2;
f_k_minus_2 = f_k_minus_1;
f_k_minus_1 = f_k;
// Either (bad):
a[k++] = f_k;
// Or (good):
// a[k] = f_k;
// k++;
}
cout << a[k - 1];
return 0;
}
[/code]
Compiled using g++ -S
(i.e. no optimisation) g++ version 7.5.0 on x86_64
The generated code saves all intermediate results to memory. The 'bad' code uses an additional register for the incremented value of k which it then saves to memory.
[code lang="asm" title="Using a[k++]"]
movl -432(%rbp), %eax ; Load k into ax.
leal 1(%rax), %edx ; Load k + 1 into dx.
movl %edx, -432(%rbp) ; Save dx into k.
[/code]
The 'good' code simply increments k in place. This saves an instruction but adds an additional fetch from (cached) memory to reload k into the ALU.
[code lang="asm" title="Using a[k]; k++"]
movl -432(%rbp), %eax ; Load k into ax.
; do some other stuff.
addl $1, -432(%rbp) ; Add 1 to k;
[/code]
Compiled using g++ -S -O
g++ version 7.5.0 on x86_64
Now the compiler is not worried about an informative core dump for debugging it uses registers for all intermediate values and comes up with almost the same object code for both 'good' and 'bad' sources - but separating the increment from the array indexing still saves 1 instruction, although only on the first iteration of the loop. This time I am going to show the whole loop for each case.
[code lang="asm" title="Using a[k++]"]
.L3:
leal (%rsi,%rcx), %edx
movl %eax, %edi
movl %edx, -4(%rsp,%rax,4)
addq $1, %rax
movl %ecx, %esi
movl %edx, %ecx
cmpl $999, %edx
jle .L3
[/code]
[code lang="asm" title="Using a[k]; k++"]
jmp .L3
.L6:
movl %edx, %ecx
.L3:
leal (%rcx,%rsi), %edx
movl %edx, -4(%rsp,%rax,4)
movl %eax, %edi
addq $1, %rax
movl %ecx, %esi
cmpl $999, %edx
jle .L6
[/code]Conclusion:
a[i++];
may be quicker to type but is harder to maintain and may well run
slower than
a[i]; i++;
.