#include <bits/c++config.h> #undef _GLIBCXX_EXTERN_TEMPLATE #define _GLIBCXX_EXTERN_TEMPLATE 0 #include<string> __attribute__((flatten)) std::string f(){ std::string s; s="hello"; return s; } Yes, I have to go through some lengths to convince gcc to at least try to optimize... With gcc-7, I get <bb 2> [14.44%]: _3 = &MEM[(struct basic_string *)s_2(D)].D.21635._M_local_buf; MEM[(struct _Alloc_hider *)s_2(D)]._M_p = _3; MEM[(size_type *)s_2(D) + 8B] = 0; MEM[(char_type &)s_2(D) + 16] = 0; if (_3 != "hello") goto <bb 3>; [75.00%] else goto <bb 4>; [25.00%] <bb 3> [1.43%]: __builtin_memcpy (_3, "hello", 5); goto <bb 5>; [100.00%] <bb 4> [0.97%]: __builtin_memcpy ("hello", &MEM[(void *)"hello" + 5B], 5); <bb 5> [14.43%]: MEM[(size_type *)s_2(D) + 8B] = 5; MEM[(char_type &)s_2(D) + 21] = 0; return s_2(D); which is kind of OK. It would be much better if we folded _3 != "hello", but it is already small enough. With gcc-9, I get something that starts with __x.7_6 = (long unsigned int) "hello"; __y.8_7 = (long unsigned int) _3; if (__x.7_6 < __y.8_7) goto <bb 4>; [50.00%] else goto <bb 3>; [50.00%] <bb 3> [local count: 38463891]: if (__x.7_6 > __y.8_7) goto <bb 4>; [50.00%] else goto <bb 5>; [50.00%] ifcombine would kindly turn this into __x.7_6 != __y.8_7, but it doesn't look like this yet when ifcombine runs. We also have equivalent blocks (reached by goto, not fallthrough) <bb 4> [local count: 19039626]: __builtin_memcpy (_3, "hello", 5); goto <bb 16>; [100.00%] <bb 6> [local count: 3173271]: __builtin_memcpy (_3, "hello", 5); goto <bb 16>; [100.00%] <bb 16> [local count: 114817586]: # prephitmp_14 = PHI <pretmp_16(13), pretmp_25(15), _3(4), _3(8), _3(6), _3(14)> that we fail to merge. In the end, we have 4 times more code than we used to... This is most likely caused by a change in libstdc++, but I am categorizing it as tree-optimization because I believe we need some improvements there, whatever libstdc++ decides to do.
Created attachment 45394 [details] gcc9-pr88775.patch Just as a first step, with this patch I get much better generated code at -O2, 139 bytes for _Z1fB5cxx11v instead of 259. It helps to have much shorter code already before inlining and that jump threading etc. don't try to deal with that. Jon, any value in having the __x < __y etc. for C++11 and C++98 when the methods aren't marked constexpr?
Probably not. I don't think sanitizers flag the unspecified comparisons, so we could just always do the uintptr_t comparisons for C++98/11. The patch looks good to me.
So the if (__x.7_6 < __y.8_7) goto <bb 4>; [50.00%] else goto <bb 3>; [50.00%] <bb 3> [local count: 38463891]: if (__x.7_6 > __y.8_7) goto <bb 4>; [50.00%] else goto <bb 5>; [50.00%] fails to merge to != because at the time we run ifcombine it still looks like _4 = _3 > "hello"; _5 = __builtin_constant_p (_4); if (_5 != 0) goto <bb 4>; [34.00%] else goto <bb 3>; [66.00%] <bb 3> [local count: 708669605]: __x.5_6 = (long unsigned int) "hello"; __y.6_7 = (long unsigned int) _3; _8 = __x.5_6 < __y.6_7; <bb 4> [local count: 1073741824]: # _9 = PHI <_4(2), _8(3)> if (_9 != 0) goto <bb 8>; [50.00%] else goto <bb 5>; [50.00%] <bb 5> [local count: 536870913]: _10 = _3 < "hello"; _11 = __builtin_constant_p (_10); if (_11 != 0) goto <bb 7>; [34.00%] else goto <bb 6>; [66.00%] <bb 6> [local count: 354334802]: __x.5_12 = (long unsigned int) _3; __y.6_13 = (long unsigned int) "hello"; _14 = __x.5_12 < __y.6_13; <bb 7> [local count: 536870913]: # _15 = PHI <_10(5), _14(6)> if (_15 != 0) goto <bb 8>; [50.00%] else goto <bb 9>; [50.00%] but maybe that is what Jakubs patch fixes.
(In reply to Richard Biener from comment #3) > _4 = _3 > "hello"; > _5 = __builtin_constant_p (_4); > if (_5 != 0) > goto <bb 4>; [34.00%] > else > goto <bb 3>; [66.00%] > > <bb 3> [local count: 708669605]: > __x.5_6 = (long unsigned int) "hello"; > __y.6_7 = (long unsigned int) _3; > _8 = __x.5_6 < __y.6_7; > > <bb 4> [local count: 1073741824]: > # _9 = PHI <_4(2), _8(3)> > if (_9 != 0) > goto <bb 8>; [50.00%] > else > goto <bb 5>; [50.00%] which is a somewhat awkward pattern. I think that VN should really be able to value-number _4 and _8 the same eliding this crap. Let me see to do that. OK, that helps _a lot_.
Created attachment 45401 [details] VN patch Turns f() into just <bb 2> [local count: 1073741824]: _3 = &MEM[(struct basic_string *)s_2(D)].D.18989._M_local_buf; MEM[(struct _Alloc_hider *)s_2(D)]._M_p = _3; MEM[(size_type *)s_2(D) + 8B] = 0; MEM[(char_type &)s_2(D) + 16] = 0; if (_3 != "hello") goto <bb 3>; [75.00%] else goto <bb 4>; [25.00%] <bb 3> [local count: 805306369]: __builtin_memcpy (_3, "hello", 5); goto <bb 5>; [100.00%] <bb 4> [local count: 134217728]: __builtin_memcpy ("hello", &MEM[(void *)"hello" + 5B], 5); <bb 5> [local count: 1073741824]: MEM[(size_type *)s_2(D) + 8B] = 5; MEM[(char_type &)s_2(D) + 21] = 0; return s_2(D);
Author: jakub Date: Thu Jan 10 10:56:56 2019 New Revision: 267800 URL: https://gcc.gnu.org/viewcvs?rev=267800&root=gcc&view=rev Log: PR tree-optimization/88775 * include/bits/stl_function.h (greater<_Tp*>::operator(), less<_Tp*>::operator(), greater_equal<_Tp*>::operator(), less_equal<_Tp*>::operator()): Use __builtin_is_constant_evaluated instead of __builtin_constant_p if available. Don't bother with the pointer comparison in C++11 and earlier. Modified: trunk/libstdc++-v3/ChangeLog trunk/libstdc++-v3/include/bits/stl_function.h
So after the patch we have __x.5_4 = (long unsigned int) "hello"; __y.6_5 = (long unsigned int) _3; if (__x.5_4 != __y.6_5) goto <bb 3>; [75.00%] else goto <bb 4>; [25.00%] <bb 3> [local count: 805306369]: __builtin_memcpy (_3, "hello", 5); goto <bb 14>; [100.00%] <bb 4> [local count: 268435456]: if (_3 >= &MEM[(void *)"hello" + 5B]) goto <bb 5>; [50.00%] else goto <bb 6>; [50.00%] where we fail to elide this condition. Well, in theory "hello" could be followed by "hello", thus in .str "hello\0hello\0" and thus the condition could be true. Somehow w/o the libstdc++ patch this condition either doesn't appear or is elided.
"hello" string literal surely can be followed by anything else, but don't we consider it UB? int foo (void) { int a = 0; for (int i = 0; i < 32; i++) a += "hello"[i]; return a; } warning: iteration 6 invokes undefined behavior [-Waggressive-loop-optimizations] a += "hello"[i]; ~~~~~~~^~~ So, __builtin_memcpy (xyz, &MEM[(void *)"hello" + 5B], 5); is the same kind of UB. Note, I don't see any: __builtin_memcpy ("hello", ..., 5); in my IL, that would be another UB unless string literals are writable.
Created attachment 45404 [details] VRP patch This makes VRP register asserts for the pointer variants. This doesn't help until after ifcombine because we throw the asserts away if they are not EQ/NE_EXPR but after ifcombine DOM then manages to optimize the testcase nicely again. After the libstdc++ patch and dom2: __attribute__((abi_tag ("cxx11"), flatten)) f () { const size_type __nleft; char[16] * _3; long unsigned int __x.5_4; long unsigned int __y.6_5; char * _12; <bb 2> [local count: 1073741824]: _3 = &MEM[(struct basic_string *)s_2(D)].D.18989._M_local_buf; MEM[(struct _Alloc_hider *)s_2(D)]._M_p = _3; MEM[(size_type *)s_2(D) + 8B] = 0; MEM[(char_type &)s_2(D) + 16] = 0; __x.5_4 = (long unsigned int) "hello"; __y.6_5 = (long unsigned int) _3; if (__x.5_4 != __y.6_5) goto <bb 3>; [75.00%] else goto <bb 4>; [25.00%] <bb 3> [local count: 805306369]: __builtin_memcpy (_3, "hello", 5); goto <bb 5>; [100.00%] <bb 4> [local count: 67108864]: __builtin_memcpy ("hello", &MEM[(void *)"hello" + 5B], 5); <bb 5> [local count: 1073741824]: MEM[(size_type *)s_2(D) + 8B] = 5; _12 = MEM[(char * *)s_2(D)]; MEM[(char_type &)_12 + 5] = 0; return s_2(D); }
I get pretty much the same thing with: --- gcc/match.pd.jj 2019-01-07 17:59:24.100931144 +0100 +++ gcc/match.pd 2019-01-10 14:45:31.870509916 +0100 @@ -1660,6 +1660,19 @@ (define_operator_list COND_TERNARY (if (TREE_INT_CST_LOW (@1) & 1) { constant_boolean_node (cmp == NE_EXPR, type); }))) +/* Turn (uintptr_t) ptr1 != (uintptr_t) ptr2 comparison into just + ptr1 != ptr2. Don't do that for non-equality comparisons. */ +(for cmp (eq ne) + (simplify + (cmp (convert@2 @0) (convert @1)) + (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@2)) + && POINTER_TYPE_P (TREE_TYPE (@0)) + && POINTER_TYPE_P (TREE_TYPE (@1)) + && TYPE_PRECISION (TREE_TYPE (@2)) >= TYPE_PRECISION (TREE_TYPE (@0)) + && ((GIMPLE && types_compatible_p (TREE_TYPE (@0), TREE_TYPE (@1))) + || (GENERIC && TREE_TYPE (@0) == TREE_TYPE (@1)))) + (cmp @0 @1)))) + /* Arguments on which one can call get_nonzero_bits to get the bits possibly set. */ (match with_possible_nonzero_bits (well, even without the (unsigned long) casts). Isn't this better or do we want both patches?
Created attachment 45405 [details] gcc9-pr88775.patch Full patch.
Unfortunately the #c11 patch breaks the 20_util/function_objects/comparisons_pointer.cc testcase (wonder if your VRP patch would break it too), where the testcase does exactly what has been discussed on IRC: int b[8]; int a[8]; void test01() { int* p = a + 8; std::greater<int*> gt; std::stringstream ss; ss << gt(p, b) << ' ' << gt(b, p) << ' ' << (!gt(p, b) && !gt(b, p)); So, if we want to do this optimization, we probably need to make sure that whatever optimizes the a + 8 != b comparisons into false doesn't do that if one pointer is known or could point to the end of one of the objects and the other pointer points to the start of another object and those two objects could potentially live in the same section (though, maybe even different sections could be problematic if one is at the end of the whole section and another one is the first one in the next section). As the distinction is gone in RTL, this must be some GIMPLE or tree opt.
Seems it is the: /* When the addresses are not directly of decls compare base and offset. This implements some remaining parts of fold_comparison address comparisons but still no complete part of it. Still it is good enough to make fold_stmt not regress when not dispatching to fold_binary. */ opt in match.pd. So, we'd need to verify that the problematic case can't happen (one address being after the last byte in one of the objects and the other the beginning of another one).
Of course only for equality comparisons, for non-equality the code is ok as is.
Created attachment 45411 [details] gcc9-pr88775-2.patch The following incremental patch (untested except for this testcase and comparisons_pointer.cc) fixes that. Unfortunately there is still ptrs_compare_unequal routine that would need similar treatment, and I'm afraid it will result in less optimized code. This patch alone though could be useful even without the other patch, perhaps if we for pointers like before optimize always. The previous case where we optimized for integral equality comparisons of pointers only if the offsets are the same is both incorrect (for zero sized objects) and in many cases not optimizing enough (it is fine if both offsets are different, all we care is that the problematic cases where one pointer points to the beginning of one object and the other points to one past last byte of another one aren't optimized, everything else can). C99 says: "Tw o pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space." so I think we have to be conservative and need to treat pointer equality the same as equality of pointers cast to integral types, the question is if we are prepared for this for GCC9. Note, in ptrs_compare_unequal for the one obj, one ptr case (the only interesting one it handles), we could check if the pointer to the obj is known to be into the middle of the object (if size is constant and offset too, that is trivial, other cases might be harder and need more discussions) and in that case we can do whatever it does now. Otherwise, either punt, or e.g. check if obj is a global var and the other ptr points to only automatic vars (or if obj is automatic and ptr points to only global vars). Something else we could do?
Author: jakub Date: Tue Jan 15 08:11:00 2019 New Revision: 267931 URL: https://gcc.gnu.org/viewcvs?rev=267931&root=gcc&view=rev Log: PR tree-optimization/88775 * match.pd (cmp (convert1?@2 addr@0) (convert2? addr@1)): Optimize equal == 0 equality pointer comparisons some more if compared in integral types and either one points to an automatic var and the other to a global, or we can prove at least one points to the middle or both point to start or both point to end. * gcc.dg/tree-ssa/pr88775-1.c: New test. * gcc.dg/tree-ssa/pr88775-2.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88775-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88775-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd trunk/gcc/testsuite/ChangeLog
Without the #c11 patch (+ removal of the !INTEGRAL_TYPE_P special case from the above committed change + fixing up ptrs_compare_unequal, or something equivalent like the VRP change) I'm afraid there isn't much possibilities left to do, and those changes are too risky for GCC9. The problem with your testcase is that NRV is in place, so we don't even know if s is an automatic variable or a global variable, or a heap variable etc.
(In reply to Jakub Jelinek from comment #17) > Without the #c11 patch (+ removal of the !INTEGRAL_TYPE_P special case from > the above committed change + fixing up ptrs_compare_unequal, or something > equivalent like the VRP change) I am a bit lost with the current status there ;-) (I don't have much time now so I can't really follow anyway, no need to explain it to me) > I'm afraid there isn't much possibilities > left to do, and those changes are too risky for GCC9. I am not in a hurry. I don't personally write code where the performance of string matters, I was forwarding (the first part of) a report by some user. > The problem with your > testcase is that NRV is in place, so we don't even know if s is an automatic > variable or a global variable, or a heap variable etc. NRV complicates things, but even with NRV I think we should be able to say that s and "hello" cannot have the same address, since the caller is supposed to pass the address of a writable buffer as argument. Or can you think of cases where NRV would allow the return object and a string literal to have the same address? Maybe if we store the returned value in a const object and never compare the address of this const object to another one the compiler could skip constructing the object and use a reference to a constant object?
GCC 8.3 has been released.
came up on the mailing lists here: https://gcc.gnu.org/ml/gcc/2019-04/msg00202.html
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.