Bug 82666 - [11/12/13/14 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)
Summary: [11/12/13/14 regression]: sum += (x>128 ? x : 0) puts the cmov on the critica...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P2 normal
Target Milestone: 11.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: cmov
  Show dependency treegraph
 
Reported: 2017-10-22 22:56 UTC by Peter Cordes
Modified: 2023-08-04 04:18 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-05-30 00:00:00


Attachments
Reproducer (micro benchmark) (1.10 KB, text/plain)
2017-10-23 07:54 UTC, Martin Liška
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-10-22 22:56:48 UTC
long long sumarray(const int *data)
{
    data = (const int*)__builtin_assume_aligned(data, 64);
    long long sum = 0;
    for (int c=0 ; c<32768 ; c++)
        sum += (data[c] >= 128 ? data[c] : 0);

    return sum;
}

The loop body is written to encourage gcc to make the loop-carried dep chain just an ADD, with independent branchless zeroing of each input.  But unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older gcc -O3 from

        if (data[c] >= 128)  // doesn't auto-vectorize with gcc4, unlike the above
            sum += data[c];

See also https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2.


https://godbolt.org/g/GgVp7E
gcc8.0 8.0.0 20171022  -O2 -mtune=haswell  (slow)

        leaq    131072(%rdi), %rsi
        xorl    %eax, %eax
.L3:
        movslq  (%rdi), %rdx
        movq    %rdx, %rcx
        addq    %rax, %rdx      # mov+add could have been LEA
        cmpl    $127, %ecx
        cmovg   %rdx, %rax      # sum = (x>=128 : sum+x : sum)
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        jne     .L3
        ret

This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx  and cmov).  It's also 8 fused-domain uops (1 more than older gcc) but using LEA would fix that.


gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test)

        leaq    131072(%rdi), %rsi
        xorl    %eax, %eax
        xorl    %ecx, %ecx          # extra zero constant for a cmov source
.L3:
        movslq  (%rdi), %rdx
        cmpl    $127, %edx
        cmovle  %rcx, %rdx          # rdx = 0 when rdx<=128
        addq    $4, %rdi
        addq    %rdx, %rax          # sum += ... critical path 1c latency
        cmpq    %rsi, %rdi
        jne     .L3
        ret

7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell).  Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd number of uops in the loop buffer), bottlenecked on the front-end.  The latency bottleneck is only 1 cycle.  (Which Ryzen might come closer to.)

Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x slower with gcc7/8 than with gcc6 and earlier.

Moreover, gcc should try to optimize something like this:

        if (data[c] >= 128)
            sum += data[c];

into conditionally zeroing a register instead of using a loop-carried cmov dep chain.
Comment 1 Martin Liška 2017-10-23 07:54:01 UTC
Created attachment 42437 [details]
Reproducer (micro benchmark)

Confirmed, I'm attaching micro-benchmark that I run on my Haswell machine.
In time between GCC 6.x and current trunk we first improved performance in r239414 from 1.392704s -> 1.228814s. Then we significantly regressed in r242832 from 1.209454s -> 1.929302s.
Comment 2 Jeffrey A. Law 2017-12-21 06:33:09 UTC
This is another case where PRE does the right thing on a local basis, but the result isn't well handled later in the pipeline.

Prior to PRE the key blocks look like:

 <bb3>
 [ ... ]
 if (_4 > 127)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 531502203]:

  <bb 5> [local count: 1063004407]:
  # _17 = PHI <_4(4), 0(3)>
  iftmp.0_8 = (long long int) _17;
  sum_13 = iftmp.0_8 + sum_19;

Not surprisingly PRE realizes it can reduce the total expressions evaluated on the path 3->5 by moving the arithmetic into bb4:

 <bb 3>
 [ ... ]
 if (_4 > 127)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 531502203]:
  _22 = (long long int) _4;
  _24 = sum_19 + _22;

  <bb 5> [local count: 1063004407]:
  # _17 = PHI <_4(4), 0(3)>
  # prephitmp_25 = PHI <_24(4), sum_19(3)>

We keep that form through the rest of the gimple optimizers and into RTL where it turns into the problematic cmov during the first if-conversion pass.

Compiling with -fno-tree-pre gives code that I believe is equivalent to gcc-6.x.
Comment 3 Aldy Hernandez 2018-01-16 20:29:17 UTC
I haven't looked deeply into this, but prior to Richi's PRE patch in r239414 we had a CMOV instruction.  So if anything, Richi's patch removes the problematic instruction.  It is with Bernd's patch tweaking costs in r242832 that we get the CMOV again.

Could Bernd or someone else take a look at the cost patch to see if it's something  simple that backend experts can tackle?
Comment 4 Richard Biener 2018-01-25 08:27:15 UTC
GCC 7.3 is being released, adjusting target milestone.
Comment 5 Jeffrey A. Law 2018-01-30 05:55:02 UTC
Aldy,

It's less about whether or not there is a cmov (I get one with and without PRE), but more about what part of the loop we're using the cmov for.

Consider what we get in the .optimized dump at -O3 correct.  But the target *assembly* we want is what you'd get with -O3 -fno-tree-pre.  ie, PRE is making it harder for us to generate the cmov we want.
Comment 6 Richard Biener 2019-11-14 07:52:41 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 7 Jakub Jelinek 2020-03-04 09:44:50 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 8 Jakub Jelinek 2021-05-14 09:49:18 UTC
GCC 8 branch is being closed.
Comment 9 Richard Biener 2021-06-01 08:09:26 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 10 Richard Biener 2022-05-27 09:37:39 UTC
GCC 9 branch is being closed
Comment 11 Jakub Jelinek 2022-06-28 10:33:47 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 12 Richard Biener 2023-07-07 10:32:36 UTC
GCC 10 branch is being closed.
Comment 13 Andrew Pinski 2023-08-04 00:40:03 UTC
I don't know if this could help here but combine does produce:
Trying 37, 34 -> 38:
   37: flags:CCGC=cmp(r82:SI,0x7f)
      REG_DEAD r82:SI
   34: {r92:DI=r87:DI+r89:DI;clobber flags:CC;}
      REG_DEAD r89:DI
      REG_UNUSED flags:CC
   38: r87:DI={(flags:CCGC>0)?r92:DI:r87:DI}
      REG_DEAD r92:DI
      REG_DEAD flags:CCGC
Failed to match this instruction:
(set (reg/v:DI 87 [ <retval> ])
    (plus:DI (mult:DI (gt:DI (reg:SI 82 [ _4 ])
                (const_int 127 [0x7f]))
            (reg:DI 89 [ _4 ]))
        (reg/v:DI 87 [ <retval> ])))

I don't know if we could catch this in the backend and split this into 3 instructions if that would do the right thing ...
Comment 14 Jeffrey A. Law 2023-08-04 04:18:53 UTC
A better approach might be to to try and create COND_EXPRs for the conditional move in the gimple code.  The biggest problem I see with that is the gimple->rtl converters aren't great at creating efficient code on targets without conditional moves.

Meaning that we could well end up improving x86, but making several other targets worse.  

I know this because I was recently poking at a similar problem.  We expressed a conditional move of 0, C as a multiply of a boolean by C in gimple.  It really should just have been a COND_EXPR, but when we generate that form targets without good conditional move expanders will end up recreating branchy code :(