Who cares about performance (or Intel's CPU errata)?

Andrew Pinski pinskia@gmail.com
Sat May 27 23:30:50 GMT 2023


On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> "Andrew Pinski" <pinskia@gmail.com> wrote:
>
> > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
> >>
> >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
> >>
> >> --- dontcare.c ---
> >> int ispowerof2(unsigned __int128 argument) {
> >>     return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
> >> }
> >> --- EOF ---
> >>
> >> GCC 13.3    gcc -march=haswell -O3
> >>
> >> https://gcc.godbolt.org/z/PPzYsPzMc
> >> ispowerof2(unsigned __int128):
> >>         popcnt  rdi, rdi
> >>         popcnt  rsi, rsi
> >>         add     esi, edi
> >>         xor     eax, eax
> >>         cmp     esi, 1
> >>         sete    al
> >>         ret
> >>
> >> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
> >
> > Because the popcount is going to the same register, there is no false
> > dependency ....
> > The false dependency errata only applies if the result of the popcnt
> > is going to a different register, the processor thinks it depends on
> > the result in that register from a previous instruction but it does
> > not (which is why it is called a false dependency). In this case it
> > actually does depend on the previous result since the input is the
> > same as the input.
>
> OUCH, my fault; sorry for the confusion and the wrong accusation.
>
> Nevertheless GCC fails to optimise code properly:
>
> --- .c ---
> int ispowerof2(unsigned long long argument) {
>     return __builtin_popcountll(argument) == 1;
> }
> --- EOF ---
>
> GCC 13.3    gcc -m32 -mpopcnt -O3
>
> https://godbolt.org/z/fT7a7jP4e
> ispowerof2(unsigned long long):
>         xor     eax, eax
>         xor     edx, edx
>         popcnt  eax, [esp+4]
>         popcnt  edx, [esp+8]
>         add     eax, edx                 # eax is less than 64!
>         cmp     eax, 1    ->    dec eax  # 2 bytes shorter

dec eax is done for -Os already. -O2 means performance, it does not
mean decrease size. dec can be slower as it can create a false
dependency and it requires eax register to be not alive at the end of
the statement. and IIRC for x86 decode, it could cause 2 (not 1)
micro-ops.

>         sete    al
>         movzx   eax, al                  # superfluous

No it is not superfluous, well ok it is because of the context of eax
(besides the lower 8 bits) are already zero'd but keeping that track
is a hard problem and is turning problem really. And I suspect it
would cause another false dependency later on too.

For -Os -march=skylake (and -Oz instead of -Os) we get:
        popcnt  rdi, rdi
        popcnt  rsi, rsi
        add     esi, edi
        xor     eax, eax
        dec     esi
        sete    al

Which is exactly what you want right?

Thanks,
Andrew

>         ret
>
> 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
> kilo- or even megabytes, and they can extend code to cross a cache line
> or a 16-byte alignment boundary.
>
> JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
>
> JFTR: GCC is notorious for generating superfluous MOVZX instructions
>       where its optimiser SHOULD be able see that the value is already
>       less than 256!
>
> Stefan


More information about the Gcc mailing list