GCC asm block optimizations on x86_64
Darryl Miles
darryl-mailinglists@netbauds.net
Tue Aug 28 22:03:00 GMT 2007
Rask Ingemann Lambertsen wrote:
> On Mon, Aug 27, 2007 at 06:11:04AM +0100, Darryl L. Miles wrote:
>> [1] This issue is in the way %edx is zero'ed, I would think zeroing out
>> registers/memory/whatever would be a special optimization case in this
>> code its clear that there is no useful value in the CPU condition flags,
>> so "xorl %edx,%edx" would make most sense, instead of having to find
>> another register to load with zero before then copying. Interestingly
>> enough -O generates a "mov $0,%r8d", while -O2 generates a "xor %r8d,%r8d".
>
> Peephole optimization isn't performed at -O.
>
> It is usually better to post asm output from "gcc -S -dp" than "objdump
> --disassemble" output because the former shows which instruction pattern GCC
> is using.
Thanks for the note on the peephole, can the peephole substitute
sequences when there is overlapping lifetimes of various processor
features. For example the 'flags' bits, you can't peephole a sequence
that does a compare (setting flag bits) then loads a register with zero
(not affecting flag bits) then does a branch based on flag bits,
replacing the loads a register with zero with 'xor' on i386 would
destroy the flags.
0000000000000090 <u64_divide>:
00: 49 89 d1 mov %rdx,%r9 <<- [1] save %rdx in
%r9 for arg-as-return
03: 48 8b 07 mov (%rdi),%rax
06: ?? ?? ?? xor %edx,%edx <<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx
09: ?? ?? xor %r8d,%r8d
0b: 48 f7 36 divq (%rsi)
0e: 73 02 jae 12 <u64_divide+0x12>
10: ?? ?? inc %r8d
12: 49 89 01 mov %rax,(%r9) <<- [1] use saved
%rdx to return argument
15: 48 89 11 mov %rdx,(%rcx)
18: ?? ?? mov %r8d,%eax
1a: c3 retq
Opps there was actually a few errors in the hand optimized version, so
the above version is fixed. The return from function is 32bit wide so
%r8d is the correct register to select. The insn at offset 0x18 should
not have reference %ebx but %r8/%r8d. Also the insn at offset 0x06 is
probably only 2 bytes long.
I also did not say which version of GCC I was using, it was 4.0.2, but
I've just tried with 4.2.1 and the same code is generated, although -O6
appears to try and inline things further which lead me to find an
invalid constraint "g" ((*divisor)) should be "r" ((*divisor)). Since
it tried to use a constant, although a register or memory via indirected
register is valid here.
Another concern that occurs to me is that if the __asm__ constraints are
not 100% perfect is there anyway to test/permutate every possible way
for the compiler might generate the code.
The main things are that if I have given a register or memory or
constant constraint, I'd like to know if all 3 versions would assemble.
The number of possible permutations for selection would multiply up
but at least I could know for sure the constraints are correct.
This would need GCC to run in a special mode, maybe I could give the
name of the symbol/function which I wanted it to do its work on and the
generated code would emit multiple instances of that symbol with a
counter appended to the symbol name.
gcc -c -o /tmp/testit.o -fasm-block-permutate=u64_divide
-fasm-block-depth=all testit.c
Where "-fasm-block-permutate=u64_divide" earmarks which code wants
special treatment.
Where "-fasm-block-depth=all" is some way of describing how deep you
want the permutations to go. Possibly to the point of mathematically
certainty.
Then in the generated /tmp/testit.o I would get symbols:
u64_divide <-- this would be the default code gen
u64_divide_0000001 <-- this would be code gen for auto generated case 1
u64_divide_0000002 <-- this would be code gen for auto generated case 2
Then having annotated code like with "-S" or "-S -dp" explaining what
the criteria for the auto-generated cases are.
Just a thought,
Darryl
More information about the Gcc-help
mailing list