Bug 88775

Summary: [10/11/12/13 Regression] Optimize std::string assignment
Product: gcc Reporter: Marc Glisse <glisse>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: dimhen, egallager, fw, jakub, redi
Priority: P2 Keywords: missed-optimization
Version: 9.0   
Target Milestone: 10.5   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2019-01-10 00:00:00
Attachments: gcc9-pr88775.patch
VN patch
VRP patch
gcc9-pr88775.patch
gcc9-pr88775-2.patch

Description Marc Glisse 2019-01-09 20:48:17 UTC
#include <bits/c++config.h>
#undef _GLIBCXX_EXTERN_TEMPLATE
#define _GLIBCXX_EXTERN_TEMPLATE 0
#include<string>
__attribute__((flatten))
std::string f(){
  std::string s;
  s="hello";
  return s;
}

Yes, I have to go through some lengths to convince gcc to at least try to optimize...

With gcc-7, I get

  <bb 2> [14.44%]:
  _3 = &MEM[(struct basic_string *)s_2(D)].D.21635._M_local_buf;
  MEM[(struct _Alloc_hider *)s_2(D)]._M_p = _3;
  MEM[(size_type *)s_2(D) + 8B] = 0;
  MEM[(char_type &)s_2(D) + 16] = 0;
  if (_3 != "hello")
    goto <bb 3>; [75.00%]
  else
    goto <bb 4>; [25.00%]

  <bb 3> [1.43%]:
  __builtin_memcpy (_3, "hello", 5);
  goto <bb 5>; [100.00%]

  <bb 4> [0.97%]:
  __builtin_memcpy ("hello", &MEM[(void *)"hello" + 5B], 5);

  <bb 5> [14.43%]:
  MEM[(size_type *)s_2(D) + 8B] = 5;
  MEM[(char_type &)s_2(D) + 21] = 0;
  return s_2(D);

which is kind of OK. It would be much better if we folded _3 != "hello", but it is already small enough.

With gcc-9, I get something that starts with

  __x.7_6 = (long unsigned int) "hello";
  __y.8_7 = (long unsigned int) _3;
  if (__x.7_6 < __y.8_7)
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 38463891]:
  if (__x.7_6 > __y.8_7)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

ifcombine would kindly turn this into __x.7_6 != __y.8_7, but it doesn't look like this yet when ifcombine runs. We also have equivalent blocks (reached by goto, not fallthrough)

  <bb 4> [local count: 19039626]:
  __builtin_memcpy (_3, "hello", 5);
  goto <bb 16>; [100.00%]

  <bb 6> [local count: 3173271]:
  __builtin_memcpy (_3, "hello", 5);
  goto <bb 16>; [100.00%]

  <bb 16> [local count: 114817586]:
  # prephitmp_14 = PHI <pretmp_16(13), pretmp_25(15), _3(4), _3(8), _3(6), _3(14)>

that we fail to merge. In the end, we have 4 times more code than we used to...

This is most likely caused by a change in libstdc++, but I am categorizing it as tree-optimization because I believe we need some improvements there, whatever libstdc++ decides to do.
Comment 1 Jakub Jelinek 2019-01-09 22:34:19 UTC
Created attachment 45394 [details]
gcc9-pr88775.patch

Just as a first step, with this patch I get much better generated code at -O2, 139 bytes for _Z1fB5cxx11v instead of 259.  It helps to have much shorter code already before inlining and that jump threading etc. don't try to deal with that.  Jon, any value in having the __x < __y etc. for C++11 and C++98 when the methods aren't marked constexpr?
Comment 2 Jonathan Wakely 2019-01-10 08:52:22 UTC
Probably not. I don't think sanitizers flag the unspecified comparisons, so we could just always do the uintptr_t comparisons for C++98/11. The patch looks good to me.
Comment 3 Richard Biener 2019-01-10 10:06:04 UTC
So the

  if (__x.7_6 < __y.8_7)
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 38463891]:
  if (__x.7_6 > __y.8_7)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

fails to merge to != because at the time we run ifcombine it still looks like

  _4 = _3 > "hello";
  _5 = __builtin_constant_p (_4);
  if (_5 != 0)
    goto <bb 4>; [34.00%]
  else
    goto <bb 3>; [66.00%]

  <bb 3> [local count: 708669605]:
  __x.5_6 = (long unsigned int) "hello";
  __y.6_7 = (long unsigned int) _3;
  _8 = __x.5_6 < __y.6_7;

  <bb 4> [local count: 1073741824]:
  # _9 = PHI <_4(2), _8(3)>
  if (_9 != 0)
    goto <bb 8>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 5> [local count: 536870913]:
  _10 = _3 < "hello";
  _11 = __builtin_constant_p (_10);
  if (_11 != 0)
    goto <bb 7>; [34.00%]
  else
    goto <bb 6>; [66.00%]

  <bb 6> [local count: 354334802]:
  __x.5_12 = (long unsigned int) _3;
  __y.6_13 = (long unsigned int) "hello";
  _14 = __x.5_12 < __y.6_13;

  <bb 7> [local count: 536870913]:
  # _15 = PHI <_10(5), _14(6)>
  if (_15 != 0)
    goto <bb 8>; [50.00%]
  else
    goto <bb 9>; [50.00%]

but maybe that is what Jakubs patch fixes.
Comment 4 Richard Biener 2019-01-10 10:36:30 UTC
(In reply to Richard Biener from comment #3)
>   _4 = _3 > "hello";
>   _5 = __builtin_constant_p (_4);
>   if (_5 != 0)
>     goto <bb 4>; [34.00%]
>   else
>     goto <bb 3>; [66.00%]
> 
>   <bb 3> [local count: 708669605]:
>   __x.5_6 = (long unsigned int) "hello";
>   __y.6_7 = (long unsigned int) _3;
>   _8 = __x.5_6 < __y.6_7;
> 
>   <bb 4> [local count: 1073741824]:
>   # _9 = PHI <_4(2), _8(3)>
>   if (_9 != 0)
>     goto <bb 8>; [50.00%]
>   else
>     goto <bb 5>; [50.00%]

which is a somewhat awkward pattern.  I think that VN should really be
able to value-number _4 and _8 the same eliding this crap.  Let me see
to do that.

OK, that helps _a lot_.
Comment 5 Richard Biener 2019-01-10 10:39:03 UTC
Created attachment 45401 [details]
VN patch

Turns f() into just

  <bb 2> [local count: 1073741824]:
  _3 = &MEM[(struct basic_string *)s_2(D)].D.18989._M_local_buf;
  MEM[(struct _Alloc_hider *)s_2(D)]._M_p = _3;
  MEM[(size_type *)s_2(D) + 8B] = 0;
  MEM[(char_type &)s_2(D) + 16] = 0;
  if (_3 != "hello")
    goto <bb 3>; [75.00%]
  else
    goto <bb 4>; [25.00%]

  <bb 3> [local count: 805306369]:
  __builtin_memcpy (_3, "hello", 5);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 134217728]:
  __builtin_memcpy ("hello", &MEM[(void *)"hello" + 5B], 5);

  <bb 5> [local count: 1073741824]:
  MEM[(size_type *)s_2(D) + 8B] = 5;
  MEM[(char_type &)s_2(D) + 21] = 0;
  return s_2(D);
Comment 6 Jakub Jelinek 2019-01-10 10:57:28 UTC
Author: jakub
Date: Thu Jan 10 10:56:56 2019
New Revision: 267800

URL: https://gcc.gnu.org/viewcvs?rev=267800&root=gcc&view=rev
Log:
	PR tree-optimization/88775
	* include/bits/stl_function.h (greater<_Tp*>::operator(),
	less<_Tp*>::operator(), greater_equal<_Tp*>::operator(),
	less_equal<_Tp*>::operator()): Use __builtin_is_constant_evaluated
	instead of __builtin_constant_p if available.  Don't bother with
	the pointer comparison in C++11 and earlier.

Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/stl_function.h
Comment 7 Richard Biener 2019-01-10 11:28:52 UTC
So after the patch we have

  __x.5_4 = (long unsigned int) "hello";
  __y.6_5 = (long unsigned int) _3;
  if (__x.5_4 != __y.6_5)
    goto <bb 3>; [75.00%]
  else
    goto <bb 4>; [25.00%]

  <bb 3> [local count: 805306369]:
  __builtin_memcpy (_3, "hello", 5);
  goto <bb 14>; [100.00%]

  <bb 4> [local count: 268435456]:
  if (_3 >= &MEM[(void *)"hello" + 5B])
    goto <bb 5>; [50.00%]
  else
    goto <bb 6>; [50.00%]

where we fail to elide this condition.  Well, in theory "hello"
could be followed by "hello", thus in .str "hello\0hello\0" and
thus the condition could be true.  Somehow w/o the libstdc++ patch
this condition either doesn't appear or is elided.
Comment 8 Jakub Jelinek 2019-01-10 11:40:32 UTC
"hello" string literal surely can be followed by anything else, but don't we consider it UB?
int
foo (void)
{
  int a = 0;
  for (int i = 0; i < 32; i++)
    a += "hello"[i];
  return a;
}
warning: iteration 6 invokes undefined behavior [-Waggressive-loop-optimizations]
     a += "hello"[i];
          ~~~~~~~^~~
So,
__builtin_memcpy (xyz, &MEM[(void *)"hello" + 5B], 5);
is the same kind of UB.
Note, I don't see any:
__builtin_memcpy ("hello", ..., 5);
in my IL, that would be another UB unless string literals are writable.
Comment 9 Richard Biener 2019-01-10 13:31:14 UTC
Created attachment 45404 [details]
VRP patch

This makes VRP register asserts for the pointer variants.  This doesn't help
until after ifcombine because we throw the asserts away if they are not
EQ/NE_EXPR but after ifcombine DOM then manages to optimize the testcase nicely
again.  After the libstdc++ patch and dom2:

__attribute__((abi_tag ("cxx11"), flatten))
f ()
{
  const size_type __nleft;
  char[16] * _3;
  long unsigned int __x.5_4;
  long unsigned int __y.6_5;
  char * _12;

  <bb 2> [local count: 1073741824]:
  _3 = &MEM[(struct basic_string *)s_2(D)].D.18989._M_local_buf;
  MEM[(struct _Alloc_hider *)s_2(D)]._M_p = _3;
  MEM[(size_type *)s_2(D) + 8B] = 0;
  MEM[(char_type &)s_2(D) + 16] = 0;
  __x.5_4 = (long unsigned int) "hello";
  __y.6_5 = (long unsigned int) _3;
  if (__x.5_4 != __y.6_5)
    goto <bb 3>; [75.00%]
  else
    goto <bb 4>; [25.00%]

  <bb 3> [local count: 805306369]:
  __builtin_memcpy (_3, "hello", 5);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 67108864]:
  __builtin_memcpy ("hello", &MEM[(void *)"hello" + 5B], 5);

  <bb 5> [local count: 1073741824]:
  MEM[(size_type *)s_2(D) + 8B] = 5;
  _12 = MEM[(char * *)s_2(D)];
  MEM[(char_type &)_12 + 5] = 0;
  return s_2(D);

}
Comment 10 Jakub Jelinek 2019-01-10 13:50:24 UTC
I get pretty much the same thing with:
--- gcc/match.pd.jj	2019-01-07 17:59:24.100931144 +0100
+++ gcc/match.pd	2019-01-10 14:45:31.870509916 +0100
@@ -1660,6 +1660,19 @@ (define_operator_list COND_TERNARY
   (if (TREE_INT_CST_LOW (@1) & 1)
    { constant_boolean_node (cmp == NE_EXPR, type); })))
 
+/* Turn (uintptr_t) ptr1 != (uintptr_t) ptr2 comparison into just
+   ptr1 != ptr2.  Don't do that for non-equality comparisons.  */
+(for cmp (eq ne)
+ (simplify
+  (cmp (convert@2 @0) (convert @1))
+  (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@2))
+       && POINTER_TYPE_P (TREE_TYPE (@0))
+       && POINTER_TYPE_P (TREE_TYPE (@1))
+       && TYPE_PRECISION (TREE_TYPE (@2)) >= TYPE_PRECISION (TREE_TYPE (@0))
+       && ((GIMPLE && types_compatible_p (TREE_TYPE (@0), TREE_TYPE (@1)))
+	   || (GENERIC && TREE_TYPE (@0) == TREE_TYPE (@1))))
+   (cmp @0 @1))))
+
 /* Arguments on which one can call get_nonzero_bits to get the bits
    possibly set.  */
 (match with_possible_nonzero_bits

(well, even without the (unsigned long) casts).  Isn't this better or do we want both patches?
Comment 11 Jakub Jelinek 2019-01-10 14:42:56 UTC
Created attachment 45405 [details]
gcc9-pr88775.patch

Full patch.
Comment 12 Jakub Jelinek 2019-01-10 22:50:35 UTC
Unfortunately the #c11 patch breaks the
20_util/function_objects/comparisons_pointer.cc testcase (wonder if your VRP patch would break it too), where the testcase does exactly what has been discussed on IRC:
int b[8];
int a[8];

void
test01()
{
  int* p = a + 8;
  std::greater<int*> gt;

  std::stringstream ss;
  ss << gt(p, b) << ' ' << gt(b, p) << ' ' << (!gt(p, b) && !gt(b, p));

So, if we want to do this optimization, we probably need to make sure that whatever optimizes the a + 8 != b comparisons into false doesn't do that if one pointer is known or could point to the end of one of the objects and the other pointer points to the start of another object and those two objects could potentially live in the same section (though, maybe even different sections could be problematic if one is at the end of the whole section and another one is the first one in the next section).  As the distinction is gone in RTL, this must be some GIMPLE or tree opt.
Comment 13 Jakub Jelinek 2019-01-10 22:56:38 UTC
Seems it is the:
/* When the addresses are not directly of decls compare base and offset.
   This implements some remaining parts of fold_comparison address
   comparisons but still no complete part of it.  Still it is good
   enough to make fold_stmt not regress when not dispatching to fold_binary.  */
opt in match.pd.  So, we'd need to verify that the problematic case can't happen (one address being after the last byte in one of the objects and the other the beginning of another one).
Comment 14 Jakub Jelinek 2019-01-10 22:57:25 UTC
Of course only for equality comparisons, for non-equality the code is ok as is.
Comment 15 Jakub Jelinek 2019-01-11 08:50:52 UTC
Created attachment 45411 [details]
gcc9-pr88775-2.patch

The following incremental patch (untested except for this testcase and comparisons_pointer.cc) fixes that.  Unfortunately there is still ptrs_compare_unequal routine that would need similar treatment, and I'm afraid it will result in less optimized code.

This patch alone though could be useful even without the other patch, perhaps if we for pointers like before optimize always.  The previous case where we optimized for integral equality comparisons of pointers only if the offsets are the same is both incorrect (for zero sized objects) and in many cases not optimizing enough (it is fine if both offsets are different, all we care is that the problematic cases where one pointer points to the beginning of one object and the other points to one past last byte of another one aren't optimized, everything else can).

C99 says:
"Tw o pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space."
so I think we have to be conservative and need to treat pointer equality the same as equality of pointers cast to integral types, the question is if we are prepared for this for GCC9.

Note, in ptrs_compare_unequal for the one obj, one ptr case (the only interesting one it handles), we could check if the pointer to the obj is known to be into the middle of the object (if size is constant and offset too, that is trivial, other cases might be harder and need more discussions) and in that case we can do whatever it does now.  Otherwise, either punt, or e.g. check if obj is a global var and the other ptr points to only automatic vars (or if obj is automatic and ptr points to only global vars).  Something else we could do?
Comment 16 Jakub Jelinek 2019-01-15 08:11:32 UTC
Author: jakub
Date: Tue Jan 15 08:11:00 2019
New Revision: 267931

URL: https://gcc.gnu.org/viewcvs?rev=267931&root=gcc&view=rev
Log:
	PR tree-optimization/88775
	* match.pd (cmp (convert1?@2 addr@0) (convert2? addr@1)): Optimize
	equal == 0 equality pointer comparisons some more if compared in
	integral types and either one points to an automatic var and the
	other to a global, or we can prove at least one points to the middle
	or both point to start or both point to end.

	* gcc.dg/tree-ssa/pr88775-1.c: New test.
	* gcc.dg/tree-ssa/pr88775-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88775-1.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88775-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/match.pd
    trunk/gcc/testsuite/ChangeLog
Comment 17 Jakub Jelinek 2019-01-16 14:36:55 UTC
Without the #c11 patch (+ removal of the !INTEGRAL_TYPE_P special case from the above committed change + fixing up ptrs_compare_unequal, or something equivalent like the VRP change) I'm afraid there isn't much possibilities left to do, and those changes are too risky for GCC9.  The problem with your testcase is that NRV is in place, so we don't even know if s is an automatic variable or a global variable, or a heap variable etc.
Comment 18 Marc Glisse 2019-01-16 17:17:39 UTC
(In reply to Jakub Jelinek from comment #17)
> Without the #c11 patch (+ removal of the !INTEGRAL_TYPE_P special case from
> the above committed change + fixing up ptrs_compare_unequal, or something
> equivalent like the VRP change)

I am a bit lost with the current status there ;-) (I don't have much time now so I can't really follow anyway, no need to explain it to me)

> I'm afraid there isn't much possibilities
> left to do, and those changes are too risky for GCC9.

I am not in a hurry. I don't personally write code where the performance of string matters, I was forwarding (the first part of) a report by some user.

> The problem with your
> testcase is that NRV is in place, so we don't even know if s is an automatic
> variable or a global variable, or a heap variable etc.

NRV complicates things, but even with NRV I think we should be able to say that s and "hello" cannot have the same address, since the caller is supposed to pass the address of a writable buffer as argument. Or can you think of cases where NRV would allow the return object and a string literal to have the same address? Maybe if we store the returned value in a const object and never compare the address of this const object to another one the compiler could skip constructing the object and use a reference to a constant object?
Comment 19 Jakub Jelinek 2019-02-22 15:22:00 UTC
GCC 8.3 has been released.
Comment 20 Eric Gallager 2019-04-20 18:31:48 UTC
came up on the mailing lists here: https://gcc.gnu.org/ml/gcc/2019-04/msg00202.html
Comment 21 Jakub Jelinek 2020-03-04 09:37:48 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 22 Jakub Jelinek 2021-05-14 09:51:06 UTC
GCC 8 branch is being closed.
Comment 23 Richard Biener 2021-06-01 08:12:43 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 24 Richard Biener 2022-05-27 09:40:00 UTC
GCC 9 branch is being closed
Comment 25 Jakub Jelinek 2022-06-28 10:36:22 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.