[Bug target/96846] New: [x86] Prefer xor/test/setcc over test/setcc/movzx sequence

Sat Aug 29 17:47:59 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

            Bug ID: 96846
           Summary: [x86] Prefer xor/test/setcc over test/setcc/movzx
                    sequence
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

This has been reported already in bug 86352, but that bug also describes a few
other issues, so I decoded to create a separate bug focused on this one
particular issue.

When performing boolean operations, gcc often prefers the following pattern:

1. Test instruction (test/cmp).
2. "setcc %al" (where al can be any 8-bit register).
3. If the bool needs to be further processed, "movzx %al, %eax" (where al and
eax are 8 and 32-bit versions of the register picked in #2).

The example code can be seen here:

https://gcc.godbolt.org/z/z3WGbq

For convenience I'm posting the code below:

bool narrow_boolean(int a) { return a!=5; }

unsigned int countmatch(unsigned int arr[], unsigned int key)
{
    unsigned count = 0;
    for (int i=0; i<1024 ; i++) {
        count += ((arr[i] & key) != 5);
    }
    return count;
}

narrow_boolean(int):
        cmp     edi, 5
        setne   al
        ret

countmatch(unsigned int*, unsigned int):
        lea     rcx, [rdi+4096]
        xor     eax, eax
.L4:
        mov     edx, DWORD PTR [rdi]
        and     edx, esi
        cmp     edx, 5
        setne   dl
        movzx   edx, dl
        add     rdi, 4
        add     eax, edx
        cmp     rcx, rdi
        jne     .L4
        ret

Command line parameters: -Wall -O3 -mtune=skylake -fno-tree-vectorize

The problem with the generated code is as follows:

- The setcc instruction only modifies the lower 8 bits of the full
architectural register. The upper bits remain unmodified and potentially
"dirty" meaning that the following instructions taking this register as input
may require merging the full register value, with a performance penalty.
- Since setcc preserves upper bits, the following instructions consuming the
output register become dependent on the previous instructions that write that
register. This results in an unnecessary dependency. This is especially a
problem in the case of narrow_boolean above.
- On Haswell and later, "movzx %al, %eax" is not eliminated at register rename
and consumes ALU and has non-zero latency. "movzx %al, %ebx" (i.e. when the
output register is different from input) is eliminated at rename stage, but
requires an additional register. But gcc seems to prefer the former form,
unless -frename-registers is specified.

See this excellent StackOverflow question and answers for details:

https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to

(BTW, the code in this bug originated from that question.)

A better instruction pattern would be this:

1. Zero the target flag register beforehand with "xor %eax, %eax".
2. Perform the test.
3. "setcc %al" to set the flag.

In case if this pattern is in a loop body, the initial xor can be hoisted out
of the loop. The important part is that the xor eliminates the dependency and
doesn't cost ALU uop. Arguably, it is also smaller code since xor is only 2
bytes vs. 3 bytes for movxz.

The initial zeroing requires an additional register, meaning that it cannot
reuse the register involved in the test in #2. However, that is often not a
problem, like in the examples above. I guess, the compiler could estimate if
there are spare registers and use xor/test/setcc sequence if there are and
test/setcc/movzx if not.