82666 – [12/13/14/15 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)

Bug 82666 - [12/13/14/15 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)

Summary: [12/13/14/15 regression]: sum += (x>128 ? x : 0) puts the cmov on the critica...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.0

Importance:	P2 normal
Target Milestone:	12.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	cmov
	Show dependency tree / graph

Reported:	2017-10-22 22:56 UTC by Peter Cordes
Modified:	2024-07-19 13:00 UTC (History)
CC List:	6 users (show)

See Also:	80520
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2023-05-30 00:00:00

Attachments
Reproducer (micro benchmark) (1.10 KB, text/plain) 2017-10-23 07:54 UTC, Martin Liška	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Cordes 2017-10-22 22:56:48 UTC

long long sumarray(const int *data)
{
    data = (const int*)__builtin_assume_aligned(data, 64);
    long long sum = 0;
    for (int c=0 ; c<32768 ; c++)
        sum += (data[c] >= 128 ? data[c] : 0);

    return sum;
}

The loop body is written to encourage gcc to make the loop-carried dep chain just an ADD, with independent branchless zeroing of each input.  But unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older gcc -O3 from

        if (data[c] >= 128)  // doesn't auto-vectorize with gcc4, unlike the above
            sum += data[c];

See also https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2.


https://godbolt.org/g/GgVp7E
gcc8.0 8.0.0 20171022  -O2 -mtune=haswell  (slow)

        leaq    131072(%rdi), %rsi
        xorl    %eax, %eax
.L3:
        movslq  (%rdi), %rdx
        movq    %rdx, %rcx
        addq    %rax, %rdx      # mov+add could have been LEA
        cmpl    $127, %ecx
        cmovg   %rdx, %rax      # sum = (x>=128 : sum+x : sum)
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        jne     .L3
        ret

This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx  and cmov).  It's also 8 fused-domain uops (1 more than older gcc) but using LEA would fix that.


gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test)

        leaq    131072(%rdi), %rsi
        xorl    %eax, %eax
        xorl    %ecx, %ecx          # extra zero constant for a cmov source
.L3:
        movslq  (%rdi), %rdx
        cmpl    $127, %edx
        cmovle  %rcx, %rdx          # rdx = 0 when rdx<=128
        addq    $4, %rdi
        addq    %rdx, %rax          # sum += ... critical path 1c latency
        cmpq    %rsi, %rdi
        jne     .L3
        ret

7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell).  Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd number of uops in the loop buffer), bottlenecked on the front-end.  The latency bottleneck is only 1 cycle.  (Which Ryzen might come closer to.)

Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x slower with gcc7/8 than with gcc6 and earlier.

Moreover, gcc should try to optimize something like this:

        if (data[c] >= 128)
            sum += data[c];

into conditionally zeroing a register instead of using a loop-carried cmov dep chain.

Comment 1 Martin Liška 2017-10-23 07:54:01 UTC

Created attachment 42437 [details]
Reproducer (micro benchmark)

Confirmed, I'm attaching micro-benchmark that I run on my Haswell machine.
In time between GCC 6.x and current trunk we first improved performance in r239414 from 1.392704s -> 1.228814s. Then we significantly regressed in r242832 from 1.209454s -> 1.929302s.

Comment 2 Jeffrey A. Law 2017-12-21 06:33:09 UTC

This is another case where PRE does the right thing on a local basis, but the result isn't well handled later in the pipeline.

Prior to PRE the key blocks look like:

 <bb3>
 [ ... ]
 if (_4 > 127)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 531502203]:

  <bb 5> [local count: 1063004407]:
  # _17 = PHI <_4(4), 0(3)>
  iftmp.0_8 = (long long int) _17;
  sum_13 = iftmp.0_8 + sum_19;

Not surprisingly PRE realizes it can reduce the total expressions evaluated on the path 3->5 by moving the arithmetic into bb4:

 <bb 3>
 [ ... ]
 if (_4 > 127)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 531502203]:
  _22 = (long long int) _4;
  _24 = sum_19 + _22;

  <bb 5> [local count: 1063004407]:
  # _17 = PHI <_4(4), 0(3)>
  # prephitmp_25 = PHI <_24(4), sum_19(3)>

We keep that form through the rest of the gimple optimizers and into RTL where it turns into the problematic cmov during the first if-conversion pass.

Compiling with -fno-tree-pre gives code that I believe is equivalent to gcc-6.x.

Comment 3 Aldy Hernandez 2018-01-16 20:29:17 UTC

I haven't looked deeply into this, but prior to Richi's PRE patch in r239414 we had a CMOV instruction.  So if anything, Richi's patch removes the problematic instruction.  It is with Bernd's patch tweaking costs in r242832 that we get the CMOV again.

Could Bernd or someone else take a look at the cost patch to see if it's something  simple that backend experts can tackle?

Comment 4 Richard Biener 2018-01-25 08:27:15 UTC

GCC 7.3 is being released, adjusting target milestone.

Comment 5 Jeffrey A. Law 2018-01-30 05:55:02 UTC

Aldy,

It's less about whether or not there is a cmov (I get one with and without PRE), but more about what part of the loop we're using the cmov for.

Consider what we get in the .optimized dump at -O3 correct.  But the target *assembly* we want is what you'd get with -O3 -fno-tree-pre.  ie, PRE is making it harder for us to generate the cmov we want.

Comment 6 Richard Biener 2019-11-14 07:52:41 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 7 Jakub Jelinek 2020-03-04 09:44:50 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 8 Jakub Jelinek 2021-05-14 09:49:18 UTC

GCC 8 branch is being closed.

Comment 9 Richard Biener 2021-06-01 08:09:26 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 10 Richard Biener 2022-05-27 09:37:39 UTC

GCC 9 branch is being closed

Comment 11 Jakub Jelinek 2022-06-28 10:33:47 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 12 Richard Biener 2023-07-07 10:32:36 UTC

GCC 10 branch is being closed.

Comment 13 Andrew Pinski 2023-08-04 00:40:03 UTC

I don't know if this could help here but combine does produce:
Trying 37, 34 -> 38:
   37: flags:CCGC=cmp(r82:SI,0x7f)
      REG_DEAD r82:SI
   34: {r92:DI=r87:DI+r89:DI;clobber flags:CC;}
      REG_DEAD r89:DI
      REG_UNUSED flags:CC
   38: r87:DI={(flags:CCGC>0)?r92:DI:r87:DI}
      REG_DEAD r92:DI
      REG_DEAD flags:CCGC
Failed to match this instruction:
(set (reg/v:DI 87 [ <retval> ])
    (plus:DI (mult:DI (gt:DI (reg:SI 82 [ _4 ])
                (const_int 127 [0x7f]))
            (reg:DI 89 [ _4 ]))
        (reg/v:DI 87 [ <retval> ])))

I don't know if we could catch this in the backend and split this into 3 instructions if that would do the right thing ...

Comment 14 Jeffrey A. Law 2023-08-04 04:18:53 UTC

A better approach might be to to try and create COND_EXPRs for the conditional move in the gimple code.  The biggest problem I see with that is the gimple->rtl converters aren't great at creating efficient code on targets without conditional moves.

Meaning that we could well end up improving x86, but making several other targets worse.  

I know this because I was recently poking at a similar problem.  We expressed a conditional move of 0, C as a multiply of a boolean by C in gimple.  It really should just have been a COND_EXPR, but when we generate that form targets without good conditional move expanders will end up recreating branchy code :(

Comment 15 Richard Biener 2024-07-19 13:00:43 UTC

GCC 11 branch is being closed.