Bug 97734

Summary: GCC using branches when a conditional move would be better
Product: gcc Reporter: Rafael Avila de Espindola <rafael>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: UNCONFIRMED ---    
Severity: normal CC: amonakov, crazylht, hjl.tools, mark.d.ryan
Priority: P3 Keywords: missed-optimization
Version: 11.0   
Target Milestone: ---   
Host: Target: x86_64-*-*
Build: Known to work:
Known to fail: Last reconfirmed:
Attachments: graph

Description Rafael Avila de Espindola 2020-11-05 17:18:53 UTC
Created attachment 49511 [details]
graph

Playing with the code in https://github.com/patmorin/arraylayout I noticed that I could not reproduce the results for the eytzinger layout. It turns out the issue was with current gcc selecting moves instead of conditional moves for that particular code.

A reduced testcase is

#include <stdint.h>
uint64_t branchfree_search(uint64_t x, uint64_t n, uint32_t *a) {
    uint64_t i = 0;
    while (i < n) {
        i = (x <= a[i]) ? (2*i + 1) : (2*i + 2);
    }
    uint64_t j = (i+1) >> __builtin_ffsl(~(i+1));
    return (j == 0) ? n : j-1;
}

I have placed it in

https://gcc.godbolt.org/z/Krqrz7

Results
* ICC: conditional move
* Clang: branches
* GCC 6.4: conditional move
* Newer GCCs with -O2:  branches
* GCC with -Os: conditional move

The attached graph shows how the conditional move is better for "small" array sizes.
Comment 1 Richard Biener 2020-11-06 07:23:47 UTC
I guess x86
Comment 2 Alexander Monakov 2020-11-06 09:38:03 UTC
By the time RTL ce2 pass runs the RTL is suitable for if-conversion, but the pass rejects the transform (probably due to costs, not visible in dumps so impossible to tell without gdb'ing the compiler).

Note that this cmov lengthens the loop-carried data dependency on 'i', so it's only beneficial on workloads where the control dependency it replaces corresponds to an unpredictable branch. And GCC has no way to know that.

For a recent counter-example (cmov dramatically slowing down a loop with a trivially predictable branch) please see https://stackoverflow.com/a/64285902/4755075

At risk of needlessly repeating myself: I think what people doing such research really need is __builtin_branchless_select that is properly guaranteed to be branchless (selection statement on GIMPLE and branchless sequence on RTL). Otherwise they spend time tweaking their code to make cmov appear where they need it.
Comment 3 Rafael Avila de Espindola 2020-11-06 22:15:01 UTC
I just realized I made a mistake when producing the reduced testcase. The argument x should have been an uint32_t. Unfortunately, with that fix gcc always produces a branch.

ICC still produces a cmov:

https://gcc.godbolt.org/z/zszrPK

I would be very happy with a __builtin_branchless_select.