This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/81614] New: x86 optimizer combines results of comparisons in a way that risks partial register stalls


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614

            Bug ID: 81614
           Summary: x86 optimizer combines results of comparisons in a way
                    that risks partial register stalls
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cody at codygray dot com
  Target Milestone: ---
            Target: i?86-*-*

Consider the following code:

    bool foo(int a, int b, int c)
    {
        // It doesn't matter if this short-circuits  ('||' vs. '|')
        // because the optimizer treats them as equivalent.
        return (a == c || b == c);
    }

All versions of GCC (going back to at least 4.4.7 and forward to the current
8.0 preview) translate this to the following optimized assembly on x86 targets:

    foo(int, int, int):
        movl    12(%esp), %edx
        cmpl    %edx, 4(%esp)
        sete    %al
        cmpl    8(%esp), %edx
        sete    %dl
        orl     %edx, %eax
        ret

The problem here is the second-to-last instruction. It ORs together two full
32-bit registers, even though the preceding SETE instructions only set the low
8 bits of each register. This results in a speed-zapping phenomenon on
virtually all x86 processors called a *partial register stall*.

(See http://www.agner.org/optimize/microarchitecture.pdf for details on exactly
how this is a performance problem on various implementations of x86. Although
there are differences in exactly *why* it is a speed penalty, it virtually
always is and *certainly* should be considered one when the output is tuned for
a generic x86 target.)

You get the same results at all optimization levels, including -Os (at least,
the relevant portion of the code is the same). You also see this for x86-64
targets:

    foo(int, int, int):
        cmpl    %edx, %edi
        sete    %al
        cmpl    %esi, %edx
        sete    %dl
        orl     %edx, %eax
        ret

One of two things should be done instead: either (A) perform the bitwise
operation *only* on the low bytes, or (B) pre-zero the entire 32-bit register
*before* setting its low byte to break dependencies.

Proposed Resolution A (use only low bytes):

    foo(int, int, int):
        movl    12(%esp), %edx
        cmpl    %edx, 4(%esp)
        sete    %al
        cmpl    8(%esp), %edx
        sete    %dl
        orl     %dl, %al
        ret

Proposed Resolution B (pre-zero to break dependencies):

    foo(int, int, int):
        movl    12(%esp), %edx
        xorl    %eax, %eax
        cmpl    %edx, 4(%esp)
        sete    %al
        xorl    %ecx, %ecx
        cmpl    8(%esp), %edx
        sete    %cl
        orl     %ecx, %eax
        ret

Approach A is the one used by Clang and MSVC. It solves the problem of partial
register stalls while avoiding the need for a third register as in Approach B.

The disadvantage of Approach A is that it creates only a byte-sized (8-bit)
result. This is perfectly fine if the function returns a bool, but doesn't work
if the function returns an integer type. There are two ways to solve that. What
GCC currently does if you change foo() to return int is add a MOVZBL
instruction between the OR and RET:

    foo(int, int, int):
        movl    12(%esp), %edx
        cmpl    %edx, 4(%esp)
        sete    %al
        cmpl    8(%esp), %edx
        sete    %dl
        orl     %edx, %eax
        movzbl  %al, %eax
        ret

This zero-extends the result in AL into EAX. (Notice that the partial register
stall hazard is still there.) This existing behavior could simply be
maintained. However, it would be more optimal to pre-zero as shown in Approach
B. (For details on why this would be more optimal on all x86
microarchitectures, see here: https://stackoverflow.com/a/33668295).

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]