long long sumarray(const int *data) { data = (const int*)__builtin_assume_aligned(data, 64); long long sum = 0; for (int c=0 ; c<32768 ; c++) sum += (data[c] >= 128 ? data[c] : 0); return sum; } The loop body is written to encourage gcc to make the loop-carried dep chain just an ADD, with independent branchless zeroing of each input. But unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older gcc -O3 from if (data[c] >= 128) // doesn't auto-vectorize with gcc4, unlike the above sum += data[c]; See also https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2. https://godbolt.org/g/GgVp7E gcc8.0 8.0.0 20171022 -O2 -mtune=haswell (slow) leaq 131072(%rdi), %rsi xorl %eax, %eax .L3: movslq (%rdi), %rdx movq %rdx, %rcx addq %rax, %rdx # mov+add could have been LEA cmpl $127, %ecx cmovg %rdx, %rax # sum = (x>=128 : sum+x : sum) addq $4, %rdi cmpq %rsi, %rdi jne .L3 ret This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx and cmov). It's also 8 fused-domain uops (1 more than older gcc) but using LEA would fix that. gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test) leaq 131072(%rdi), %rsi xorl %eax, %eax xorl %ecx, %ecx # extra zero constant for a cmov source .L3: movslq (%rdi), %rdx cmpl $127, %edx cmovle %rcx, %rdx # rdx = 0 when rdx<=128 addq $4, %rdi addq %rdx, %rax # sum += ... critical path 1c latency cmpq %rsi, %rdi jne .L3 ret 7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell). Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd number of uops in the loop buffer), bottlenecked on the front-end. The latency bottleneck is only 1 cycle. (Which Ryzen might come closer to.) Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x slower with gcc7/8 than with gcc6 and earlier. Moreover, gcc should try to optimize something like this: if (data[c] >= 128) sum += data[c]; into conditionally zeroing a register instead of using a loop-carried cmov dep chain.
Created attachment 42437 [details] Reproducer (micro benchmark) Confirmed, I'm attaching micro-benchmark that I run on my Haswell machine. In time between GCC 6.x and current trunk we first improved performance in r239414 from 1.392704s -> 1.228814s. Then we significantly regressed in r242832 from 1.209454s -> 1.929302s.
This is another case where PRE does the right thing on a local basis, but the result isn't well handled later in the pipeline. Prior to PRE the key blocks look like: <bb3> [ ... ] if (_4 > 127) goto <bb 4>; [50.00%] else goto <bb 5>; [50.00%] <bb 4> [local count: 531502203]: <bb 5> [local count: 1063004407]: # _17 = PHI <_4(4), 0(3)> iftmp.0_8 = (long long int) _17; sum_13 = iftmp.0_8 + sum_19; Not surprisingly PRE realizes it can reduce the total expressions evaluated on the path 3->5 by moving the arithmetic into bb4: <bb 3> [ ... ] if (_4 > 127) goto <bb 4>; [50.00%] else goto <bb 5>; [50.00%] <bb 4> [local count: 531502203]: _22 = (long long int) _4; _24 = sum_19 + _22; <bb 5> [local count: 1063004407]: # _17 = PHI <_4(4), 0(3)> # prephitmp_25 = PHI <_24(4), sum_19(3)> We keep that form through the rest of the gimple optimizers and into RTL where it turns into the problematic cmov during the first if-conversion pass. Compiling with -fno-tree-pre gives code that I believe is equivalent to gcc-6.x.
I haven't looked deeply into this, but prior to Richi's PRE patch in r239414 we had a CMOV instruction. So if anything, Richi's patch removes the problematic instruction. It is with Bernd's patch tweaking costs in r242832 that we get the CMOV again. Could Bernd or someone else take a look at the cost patch to see if it's something simple that backend experts can tackle?
GCC 7.3 is being released, adjusting target milestone.
Aldy, It's less about whether or not there is a cmov (I get one with and without PRE), but more about what part of the loop we're using the cmov for. Consider what we get in the .optimized dump at -O3 correct. But the target *assembly* we want is what you'd get with -O3 -fno-tree-pre. ie, PRE is making it harder for us to generate the cmov we want.
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
I don't know if this could help here but combine does produce: Trying 37, 34 -> 38: 37: flags:CCGC=cmp(r82:SI,0x7f) REG_DEAD r82:SI 34: {r92:DI=r87:DI+r89:DI;clobber flags:CC;} REG_DEAD r89:DI REG_UNUSED flags:CC 38: r87:DI={(flags:CCGC>0)?r92:DI:r87:DI} REG_DEAD r92:DI REG_DEAD flags:CCGC Failed to match this instruction: (set (reg/v:DI 87 [ <retval> ]) (plus:DI (mult:DI (gt:DI (reg:SI 82 [ _4 ]) (const_int 127 [0x7f])) (reg:DI 89 [ _4 ])) (reg/v:DI 87 [ <retval> ]))) I don't know if we could catch this in the backend and split this into 3 instructions if that would do the right thing ...
A better approach might be to to try and create COND_EXPRs for the conditional move in the gimple code. The biggest problem I see with that is the gimple->rtl converters aren't great at creating efficient code on targets without conditional moves. Meaning that we could well end up improving x86, but making several other targets worse. I know this because I was recently poking at a similar problem. We expressed a conditional move of 0, C as a multiply of a boolean by C in gimple. It really should just have been a COND_EXPR, but when we generate that form targets without good conditional move expanders will end up recreating branchy code :(
GCC 11 branch is being closed.