GCC asm block optimizations on x86_64

Tue Aug 28 22:03:00 GMT 2007

Rask Ingemann Lambertsen wrote:
> On Mon, Aug 27, 2007 at 06:11:04AM +0100, Darryl L. Miles wrote:
>> [1] This issue is in the way %edx is zero'ed, I would think zeroing out 
>> registers/memory/whatever would be a special optimization case in this 
>> code its clear that there is no useful value in the CPU condition flags, 
>> so "xorl %edx,%edx" would make most sense, instead of having to find 
>> another register to load with zero before then copying.  Interestingly 
>> enough -O generates a "mov $0,%r8d", while -O2 generates a "xor %r8d,%r8d".
> 
>    Peephole optimization isn't performed at -O.
> 
>    It is usually better to post asm output from "gcc -S -dp" than "objdump
> --disassemble" output because the former shows which instruction pattern GCC
> is using.

Thanks for the note on the peephole, can the peephole substitute
sequences when there is overlapping lifetimes of various processor 
features.  For example the 'flags' bits, you can't peephole a sequence 
that does a compare (setting flag bits) then loads a register with zero 
(not affecting flag bits) then does a branch based on flag bits, 
replacing the loads a register with zero with 'xor' on i386 would 
destroy the flags.

  0000000000000090 <u64_divide>:
    00:   49 89 d1                mov    %rdx,%r9	<<- [1] save %rdx in
%r9 for arg-as-return
    03:   48 8b 07                mov    (%rdi),%rax
    06:   ?? ?? ??                xor    %edx,%edx	<<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx
    09:   ?? ??                   xor    %r8d,%r8d
    0b:   48 f7 36                divq   (%rsi)
    0e:   73 02                   jae    12 <u64_divide+0x12>
    10:   ?? ??                   inc    %r8d
    12:   49 89 01                mov    %rax,(%r9)	<<- [1] use saved
%rdx to return argument
    15:   48 89 11                mov    %rdx,(%rcx)
    18:   ?? ??                   mov    %r8d,%eax
    1a:   c3                      retq

Opps there was actually a few errors in the hand optimized version, so 
the above version is fixed.  The return from function is 32bit wide so 
%r8d is the correct register to select.  The insn at offset 0x18 should 
not have reference %ebx but %r8/%r8d.  Also the insn at offset 0x06 is 
probably only 2 bytes long.

I also did not say which version of GCC I was using, it was 4.0.2, but 
I've just tried with 4.2.1 and the same code is generated, although -O6 
appears to try and inline things further which lead me to find an 
invalid constraint "g" ((*divisor)) should be "r" ((*divisor)).  Since 
it tried to use a constant, although a register or memory via indirected 
register is valid here.

Another concern that occurs to me is that if the __asm__ constraints are 
not 100% perfect is there anyway to test/permutate every possible way 
for the compiler might generate the code.

The main things are that if I have given a register or memory or 
constant constraint, I'd like to know if all 3 versions would assemble. 
  The number of possible permutations for selection would multiply up 
but at least I could know for sure the constraints are correct.

This would need GCC to run in a special mode, maybe I could give the 
name of the symbol/function which I wanted it to do its work on and the 
generated code would emit multiple instances of that symbol with a 
counter appended to the symbol name.

gcc -c -o /tmp/testit.o -fasm-block-permutate=u64_divide 
-fasm-block-depth=all testit.c

Where "-fasm-block-permutate=u64_divide" earmarks which code wants 
special treatment.

Where "-fasm-block-depth=all" is some way of describing how deep you 
want the permutations to go.  Possibly to the point of mathematically 
certainty.

Then in the generated /tmp/testit.o I would get symbols:

u64_divide  <-- this would be the default code gen
u64_divide_0000001  <-- this would be code gen for auto generated case 1
u64_divide_0000002  <-- this would be code gen for auto generated case 2

Then having annotated code like with "-S" or "-S -dp" explaining what 
the criteria for the auto-generated cases are.

Just a thought,

Darryl