Code: #include <stdint.h> int cmp(int64_t a, int64_t b) { return a < b ? -1 : a > b; } Above function compiled with gcc 4.5 or above and -O2 is compiled in following way. It is branchless: cmp(long, long): xor eax, eax cmp rdi, rsi mov edx, -1 setg al cmovl eax, edx ret gcc 7.1 generates different code. It has branch: cmp(long, long): cmp rdi, rsi jl .L3 setg al movzx eax, al ret .L3: mov eax, -1 ret
Started with r237185.
But are you sure the branchless variant is faster?
Yes, branchless version is faster. Here are results for code compiled with gcc 4.8.5: Benchmark Time CPU Iterations -------------------------------------------------- BM_memcmp 6 ns 6 ns 111001949 BM_int64cmp 4 ns 4 ns 183761752 And here for gcc 7.1.0: Benchmark Time CPU Iterations -------------------------------------------------- BM_memcmp 6 ns 6 ns 113198940 BM_int64cmp 8 ns 8 ns 82036754 Note: Code tested accepted values passed via pointer instead of value, to better compare it with memcmp function. Functions were comparing random values, generated before tests and stored in arrays. srand was called with constant value to get repeatable results.
Predictions for bb 2 DS theory heuristics: 98.0% combined heuristics: 98.0% negative return heuristics of edge 2->4: 2.0% Predictions for bb 3 1 edges in bb 3 predicted to even probabilities Predictions for bb 4 1 edges in bb 4 predicted to even probabilities cmp (long long int a, long long int b) { _Bool _1; int iftmp.0_2; int iftmp.0_6; <bb 2> [local count: 1073741825]: if (a_3(D) >= b_4(D)) goto <bb 3>; [98.00%] else goto <bb 4>; [2.00%] Honza, I think it would be worth to recognize idioms like this (<=> operator) and don't apply Dempster-Shaffer in that case, but rather predict either equal probabilities for <, == and > (like before), or perhaps even somewhat smaller probability for the == case and slightly higher for < or >.
Oops, it is most likely the PRED_NEGATIVE_RETURN stuff instead.
Another testcase, which has even higher prediction of not returning -1: int cmp (int a, int b) { if (a < b) return -1; if (a > b) return 1; return 0; } In the #c0 case, we have in the IL: <bb 2> : if (a_3(D) >= b_4(D)) goto <bb 3>; [INV] else goto <bb 4>; [INV] <bb 3> : _1 = a_3(D) > b_4(D); iftmp.0_6 = (int) _1; <bb 4> : # iftmp.0_2 = PHI <iftmp.0_6(3), -1(2)> return iftmp.0_2; while in this one: <bb 2>: if (a_2(D) < b_3(D)) goto <bb 5>; else goto <bb 3>; <bb 3>: if (a_2(D) > b_3(D)) goto <bb 5>; else goto <bb 4>; <bb 4>: <bb 5>: # _1 = PHI <-1(2), 1(3), 0(4)> return _1; In any case, I think it is worth recognizing these patterns. We should also consider other examples, e.g. from gcc itself: static int oecount_cmp (const void *p1, const void *p2) { const oecount *c1 = (const oecount *)p1; const oecount *c2 = (const oecount *)p2; if (c1->cnt != c2->cnt) return c1->cnt > c2->cnt ? 1 : -1; else /* If counts are identical, use unique IDs to stabilize qsort. */ return c1->id > c2->id ? 1 : -1; } static int tm_alias_pair_cmp (const void *x, const void *y) { const tm_alias_pair *p1 = (const tm_alias_pair *) x; const tm_alias_pair *p2 = (const tm_alias_pair *) y; if (p1->uid < p2->uid) return -1; if (p1->uid > p2->uid) return 1; return 0; } int type_warning_cmp (const void *p1, const void *p2) { const odr_type_warn_count *t1 = (const odr_type_warn_count *)p1; const odr_type_warn_count *t2 = (const odr_type_warn_count *)p2; if (t1->dyn_count < t2->dyn_count) return 1; if (t1->dyn_count > t2->dyn_count) return -1; return t2->count - t1->count; } static int stack_var_cmp (const void *a, const void *b) { size_t ia = *(const size_t *)a; size_t ib = *(const size_t *)b; unsigned int aligna = stack_vars[ia].alignb; unsigned int alignb = stack_vars[ib].alignb; HOST_WIDE_INT sizea = stack_vars[ia].size; HOST_WIDE_INT sizeb = stack_vars[ib].size; tree decla = stack_vars[ia].decl; tree declb = stack_vars[ib].decl; bool largea, largeb; unsigned int uida, uidb; /* Primary compare on "large" alignment. Large comes first. */ largea = (aligna * BITS_PER_UNIT > MAX_SUPPORTED_STACK_ALIGNMENT); largeb = (alignb * BITS_PER_UNIT > MAX_SUPPORTED_STACK_ALIGNMENT); if (largea != largeb) return (int)largeb - (int)largea; /* Secondary compare on size, decreasing */ if (sizea > sizeb) return -1; if (sizea < sizeb) return 1; /* Tertiary compare on true alignment, decreasing. */ if (aligna < alignb) return -1; if (aligna > alignb) return 1; /* Final compare on ID for sort stability, increasing. Two SSA names are compared by their version, SSA names come before non-SSA names, and two normal decls are compared by their DECL_UID. */ if (TREE_CODE (decla) == SSA_NAME) { if (TREE_CODE (declb) == SSA_NAME) uida = SSA_NAME_VERSION (decla), uidb = SSA_NAME_VERSION (declb); else return -1; } else if (TREE_CODE (declb) == SSA_NAME) return 1; else uida = DECL_UID (decla), uidb = DECL_UID (declb); if (uida < uidb) return 1; if (uida > uidb) return -1; return 0; } etc. Perhaps catching everything is too hard for heuristics, but at least figuring a pattern of these <=> comparisons, or consecutive comparisons like: if (uida < uidb) return 1; if (uida > uidb) return -1; is highly desirable.
Larger testcase with more cases. Some of them, e.g. in f3, are predicted roughly reasonably, but the > 95% predictions are just wrong in these cases. int f1 (long long a, long long b) { return a < b ? -1 : a > b; } int f2 (int a, int b) { if (a < b) return -1; if (a > b) return 1; return 0; } int f3 (int *a, int *b) { if (a[0] != b[0]) return a[0] > b[0] ? 1 : -1; if (a[1] != b[1]) return a[1] > b[1] ? 1 : -1; return 0; } int f4 (int *a, int *b) { if (a[0] > b[0]) return -1; if (a[0] < b[0]) return 1; if (a[1] > b[1]) return -1; if (a[1] < b[1]) return 1; if (a[2] > b[2]) return -1; if (a[2] < b[2]) return 1; return 0; }
Created attachment 42856 [details] gcc8-pr81914.patch Untested patch that handles all the cases in the #c7 testcase. Though it will not handle some comparison functions in GCC (e.g. those which sometimes just return a - b or similar). If it doesn't look like a completely dumb approach, we probably should test it on SPEC (unfortunately I don't have a reasonable setup for that right now).
In the meantime I found another case when gcc 7 inserts lots of jumps. I am not sure if your extra test cases covers it too: #include <stdint.h> int test(int data1[9][9], int data2[9][9]) { uint64_t b1 = 0, b2 = 0; for (int n = 0; n < 9; ++n) { for (int k = 0; k < 9; ++k) { int a = data1[n][k] * 9 + data2[n][k]; (a < 64 ? b1 : b2) |= 1 << (a & 63); } } return __builtin_popcount(b1) + __builtin_popcount(b2); }
For the particular case of <=> (-1, 0 or 1), I've seen code like (a>b)-(a<b), which is branchless (IIRC we don't generate optimal code for this either, we could use sbb or adc).
Author: jakub Date: Tue Dec 19 16:43:04 2017 New Revision: 255829 URL: https://gcc.gnu.org/viewcvs?rev=255829&root=gcc&view=rev Log: PR middle-end/81914 * predict.c (zero_one_minusone): New function. (apply_return_prediction): Avoid return prediction for functions returning only -1, 0 and 1 values, unless they only return -1 and 0 or 0 and 1. Modified: trunk/gcc/ChangeLog trunk/gcc/predict.c
One more test case. Code compiled with TEST defined is branchless, without it has branch. [code] #include <stdint.h> #define TEST void test(uint64_t* a) { uint64_t n = *a / 8; if (0 == n) n = 1; #ifdef TEST *a += n; #else *a += 1 << n; #endif } [/code]
GCC 7.3 is being released, adjusting target milestone.
Jakub I see you fixed the most interesting case, can we close that?
Hello I noticed this example below does not give a warning. I had expected something similar for -Wsign-conversion present behaviour. Sharing my notes as follows. A) false+true maybe not treated as signed? B) the return conversion from size_t to int. https://godbolt.org/z/KoVPqd #include <cstddef> int main() { size_t j = false+true; return j; }
Fixed in GCC 8.