[Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang

Sat Dec 4 07:32:38 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #7 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #5)
> (In reply to cqwrteur from comment #4)
> > (In reply to Andrew Pinski from comment #2)
> > > Looks like it is a register allocation/scheduling issue. The extra
> > > instructions are mov.
> > 
> > Are there good algos that can allocate registers optimal?
> 
> note the move instructions might be "free" on most modern x86 machine, it
> just takes up icache space and decode time.
> having so little registers and having a 2 operand instruction set makes
> register allocation a hard problem really. Yes LLVM might get it right in
> this testcase but there are others where GCC might do a better job.

https://github.com/openssl/openssl/blob/38288f424faa0cf61bd705c497bb1a1657611da1/crypto/sha/asm/sha512-x86_64.pl#L18

OpenSSL's comments:

# 40% improvement over compiler-generated code on Opteron. On EM64T
# sha256 was observed to run >80% faster and sha512 - >40%. No magical
# tricks, just straight implementation... I really wonder why gcc
# [being armed with inline assembler] fails to generate as fast code.
# The only thing which is cool about this module is that it's very
# same instruction sequence used for both SHA-256 and SHA-512. In
# former case the instructions operate on 32-bit operands, while in
# latter - on 64-bit ones. All I had to do is to get one flavor right,
# the other one passed the test right away:-)