Bug 90348 - [7/8/9/10 Regression] Partition of char arrays is incorrect in some cases
Summary: [7/8/9/10 Regression] Partition of char arrays is incorrect in some cases
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 8.3.0
: P2 normal
Target Milestone: 8.4
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks: 61203
  Show dependency treegraph
 
Reported: 2019-05-04 18:57 UTC by Pieter Wuille
Modified: 2019-06-04 21:20 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 7.3.0, 8.3.0, 9.0
Last reconfirmed: 2019-05-05 00:00:00


Attachments
Source file to reproduce the problem (4.66 KB, text/plain)
2019-05-04 18:57 UTC, Pieter Wuille
Details
Un-preprocessed test case C file (302 bytes, text/x-csrc)
2019-05-04 19:04 UTC, Pieter Wuille
Details
gcc10-pr90348.patch (2.18 KB, patch)
2019-05-07 11:11 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pieter Wuille 2019-05-04 18:57:12 UTC
Created attachment 46289 [details]
Source file to reproduce the problem

In attachment is a small source file that seems to be miscompiled with "-O1 -finline-small-functions".

Tested versions/environments where the issue appears:
* GCC 7.3, Ubuntu 18.04.2, x86_64, "gcc -O1 -finline-small-functions test.i -o test"
* GCC 8.3, Ubuntu 18.04.2, x86_64, "gcc -O1 -finline-small-functions test.i -o test"
* GCC 8.3, Ubuntu 18.04.2, x86_64, "gcc -m32 -O1 -finline-small-functions test.i -o test"
* GCC 9.0, Fedora 30, i686, "gcc -O2 test.i -o test"

In a loop, a 4-byte (or larger) char array "in" is created, and then increasingly long prefixes initialized to zero. A small inlinable function "set_one_on_stack" is invoked during the loop that should have no effect (it sets a local variable "buf" to one in a roundabout way), but apparently the "buf" variable is given the same stack location as the caller's "in" variable, overwriting the latter. When compiled incorrectly, an unexpected assertion occurs.

The test file is a reduced version of an issue observed on some platforms in the Bitcoin Core unit tests. See https://github.com/bitcoin/bitcoin/issues/14580 for more details.
Comment 1 Pieter Wuille 2019-05-04 19:04:22 UTC
Created attachment 46290 [details]
Un-preprocessed test case C file

Adding the original .c file as well to demonstrate the problem on different platforms (the .i file is for GCC 8.3, x86_64, -m64).
Comment 2 MCCCS 2019-05-05 10:46:43 UTC
/*Below's the compiler testsuite code with no headers.

Here's the assembly difference (because the difference is very small. The one on the left works (-O1 only) while the one on the right has `-O1 -finline-small-functions`): https://www.diffchecker.com/hAYLeoPV

I've also tested it on macOS 10.14 with GCC 8, it failed too.

In addition, on Aarch64 - Raspberry/Raspbian using GCC 6 and GCC 9: GCC 9 also failed on ARM, but GCC 6 didn't cause assertion fail, thus it's clearly a 7/8/9/10 regression.*/

/* { dg-do run } */
/* { dg-options "-O2" } */

void __attribute__ ((noinline)) set_one (unsigned char* ptr)
{
    *ptr = 1;
}

int __attribute__ ((noinline)) check_nonzero (unsigned char const* in, unsigned int len)
{
    for (unsigned int i = 0; i < len; ++i) {
        if (in[i] != 0) return 1;
    }
    return 0;
}

void set_one_on_stack () {
    unsigned char buf[1];
    set_one (buf);
}

int main () {
    for (int i = 0; i < 5; ++i) {
        unsigned char in[4];
        for (int j = 0; j < i; ++j) {
            in[j] = 0;
            set_one_on_stack ();
        }
        if (check_nonzero (in, i)) {
            __builtin_abort ();
        }
    }
}


//~ MCCCS  (DesWurstes)
Comment 3 Andrew Pinski 2019-05-05 20:25:16 UTC
Partition 0: size 4 align 8
        in      buf

That is wrong, it should have been two seperate ones.
Confirmed with:
gcc version 9.0.1 20190310 (experimental) [master revision 449a19898aa:0239598e3c8:8fe074cf790f632b22e59c24f102e528407bb04e] (GCC)

On aarch64-linux-gnu, I don't see any changes to the expand sources after that point.

I could not reproduce it with Ubuntu's 7.3.0 but can with a non modified version of GCC 7.3.0 on mips64-linux-gnu:
Partition 0: size 4 align 8
        in      buf
Comment 4 Richard Biener 2019-05-06 08:37:13 UTC
Sth wrong with liveness analysis.  There's obvious liveness of IN from BB2:

;;   basic block 2, loop depth 0
;;    pred:       ENTRY
  _30 = (unsigned long) &in;
  ivtmp.30_29 = _30 + 1;
  goto <bb 5>; [100.00%]
;;    succ:       5

to

;;   basic block 7, loop depth 1
;;    pred:       5
  in ={v} {CLOBBER};
  i_10 = i_6 + 1;
  if (i_10 != 5)
    goto <bb 3>; [80.00%]
  else
    goto <bb 8>; [20.00%]
;;    succ:       3
;;                8

but maybe that being at different loop depth somehow confuses the algorithm
in fact having it there seems odd to me but the address-taking in BB2 is
the result of IVOPTs hoisting.  The CLOBBER doesn't effect hoisting
the address but RTL expansion liveness compute splits 'in' into multiple
logical instances at the CLOBBER which _does_ make the addresses effectively
different ...
Comment 5 Jakub Jelinek 2019-05-06 15:39:16 UTC
I think before ivopts this looks fine, we have:
  <bb 3> [local count: 858993460]:
  # j_20 = PHI <0(10), j_13(11)>
  in[j_20] = 0;
  set_one (&buf);
  buf ={v} {CLOBBER};
  j_13 = j_20 + 1;
  if (i_6 > j_20)
    goto <bb 11>; [80.00%]
  else
    goto <bb 4>; [20.00%]
in the inner loop and
  <bb 7> [local count: 214748365]:
  in ={v} {CLOBBER};
  i_10 = i_6 + 1;
  ivtmp_22 = ivtmp_4 - 1;
  if (ivtmp_22 != 0)
    goto <bb 8>; [80.00%]
  else
    goto <bb 9>; [20.00%]
near the end of the outer loop.  The problem is that ivopts turns the in[j_20]
access into
  <bb 2> [local count: 42959980]:
  _30 = (unsigned int) &in;
  ivtmp.27_29 = _30 + 1;
  goto <bb 5>; [100.00%]
  
  <bb 10> [local count: 171798691]:
  ivtmp.21_16 = (unsigned int) &in;
  _23 = (unsigned int) &in;
    
  <bb 3> [local count: 858993460]:
  # ivtmp.21_21 = PHI <ivtmp.21_16(10), ivtmp.21_3(11)>
  _18 = (void *) ivtmp.21_21;
  MEM[base: _18, offset: 0B] = 0;
  set_one (&buf);
  buf ={v} {CLOBBER};
which is still fine, for the life analysis we still see in as live in bb 10 and therefore in bb3 too.
But later on dom3 changes
   <bb 3> [local count: 171798691]:
-  ivtmp.21_16 = (unsigned int) &in;
-  _23 = (unsigned int) &in;
+  ivtmp.21_16 = _30;
+  _23 = _30;
and now &in is live clearly only in bb2 before the outer loop for the life analysis purposes.  Now, if we kill that
      if (gimple_clobber_p (stmt))
        {
          tree lhs = gimple_assign_lhs (stmt);
          size_t *v;
          /* Nested function lowering might introduce LHSs
             that are COMPONENT_REFs.  */
          if (!VAR_P (lhs))
            continue;
          if (DECL_RTL_IF_SET (lhs) == pc_rtx
              && (v = decl_to_stack_part->get (lhs)))
            bitmap_clear_bit (work, *v);
        }
you've mentioned, I'm afraid we lose all the stack slot sharing possibilities, or at least most of them.
So I'm afraid we need to do something smarter.
Either we need to track during the life analysis algorithm what variables SSA_NAMEs can point to (even for non-pointers, as in this testcase ivopts uses an integral value (unsigned int) &in and then later casts to integers), or could we use the alias analysis points to information for that?

For this testcase, it would be enough to walk for referenced pointer SSA_NAMEs through their points to variables and mark those variables as also seen in the IL.
Comment 6 Jakub Jelinek 2019-05-06 15:50:58 UTC
We'd probably need to change decl_to_stack_part from hash_map<tree, size_t> to hash_map<unsigned, size_t> that would just map DECL_UIDs which have DECL_RTL_IF_SET equal to pc_rtx to the partition numbers, rather than from decl, or have some other mapping from DECL_UIDs that appear in the points to info vars back to the decls.
Comment 7 Michael Matz 2019-05-06 16:28:24 UTC
No, this is not a problem in the stack slot sharing algorithm, but rather in
the input.  As presented to expand, and only showing the important parts,
and reordering some BBs to make the flow more obvious:

;;   basic block 2, loop depth 0
;;    pred:       ENTRY
  _30 = (unsigned long) &in;
  ivtmp.27_29 = _30 + 1;
  goto <bb 5>; [100.00%]

So, 'in' becomes "live" here, it's address in _30 and _29.  Fallthrough to bb5,
which also uses in, but otherwise is uninteresting, except that it can reach
only BBs 6 and 7:

;;   basic block 5, loop depth 1
  ...
  _2 = check_zero (&in, _31);
  if (_2 != 0)
    goto <bb 7>; [99.96%]
  else
    goto <bb 6>; [0.04%]

bb6 is a no-return block, hence uninteresting.  bb7 _is_ interesting in that
it clobbers in:

;;   basic block 7, loop depth 1
;;    pred:       5
  in ={v} {CLOBBER};
  if (i_11 != 5)
    goto <bb 8>; [83.33%]
  else
    goto <bb 9>; [16.67%]

Note that the semantics of the clobber is not only that the former contents
are destroyed, but rather that the very storage associated with the clobbered
entity is gone.  So, from now on, any pointers into 'in', and memory accesses
into 'in' are invalid.  Nevertheless the flow from bb7 goes to bb 8 and 9,
the latter being the return block, so:

;;   basic block 8, loop depth 1
;;    pred:       7
  if (i_11 > 0)
    goto <bb 3>; [100.00%]
  else
    goto <bb 4>; [0.00%]

and here we finally have a path into bb3, which is the other interesting one:

;;   basic block 3, loop depth 2
  # ivtmp.20_6 = PHI <_30(8), ivtmp.20_18(3)>

.... BOOM! .... Here _30 is used, and ...

  _4 = (void *) ivtmp.20_6;
  MEM[base: _4, offset: 0B] = 0;

... even written into ...  That's invalid.  _30 is associated with an
object that is clobbered and gone ...

  set_one (&buf);
  buf ={v} {CLOBBER};
  ivtmp.20_18 = ivtmp.20_6 + 1;

... and as the MEM[] write can't have possibly been into 'in' (as that is invalid, as 'in' didn't exist at the MEM access), it's okay and sensible to
allocate 'in' and 'buf' into the same memory.

It seems to be a common misconception of what the clobber is really about.
I designed it to mean what I wrote above, the storage associated with the
clobbered entities is gone after the clobber (not merely it's former contents!). 

But ivopts or dom extends the lifetime of 'in' (by moving an address-taken
earlier) and hence the lifetime of its storage, but without doing anything about the clobber (and hence not _really_ extending the lifetime).  That doesn't work.
It's basically a mirrored transformation of the usual take-address-of-local
and access it out of it's declared scope, just that here the _start_ of the
supposed lifetime is moved out of the declared scope, not the end.
Comment 8 Richard Biener 2019-05-07 09:26:30 UTC
(In reply to Michael Matz from comment #7)
> No, this is not a problem in the stack slot sharing algorithm, but rather in
> the input.  As presented to expand, and only showing the important parts,
> and reordering some BBs to make the flow more obvious:
> 
> ;;   basic block 2, loop depth 0
> ;;    pred:       ENTRY
>   _30 = (unsigned long) &in;
>   ivtmp.27_29 = _30 + 1;
>   goto <bb 5>; [100.00%]
> 
> So, 'in' becomes "live" here, it's address in _30 and _29.  Fallthrough to
> bb5,
> which also uses in, but otherwise is uninteresting, except that it can reach
> only BBs 6 and 7:
> 
> ;;   basic block 5, loop depth 1
>   ...
>   _2 = check_zero (&in, _31);
>   if (_2 != 0)
>     goto <bb 7>; [99.96%]
>   else
>     goto <bb 6>; [0.04%]
> 
> bb6 is a no-return block, hence uninteresting.  bb7 _is_ interesting in that
> it clobbers in:
> 
> ;;   basic block 7, loop depth 1
> ;;    pred:       5
>   in ={v} {CLOBBER};
>   if (i_11 != 5)
>     goto <bb 8>; [83.33%]
>   else
>     goto <bb 9>; [16.67%]
> 
> Note that the semantics of the clobber is not only that the former contents
> are destroyed, but rather that the very storage associated with the clobbered
> entity is gone.  So, from now on, any pointers into 'in', and memory accesses
> into 'in' are invalid.  Nevertheless the flow from bb7 goes to bb 8 and 9,
> the latter being the return block, so:
> 
> ;;   basic block 8, loop depth 1
> ;;    pred:       7
>   if (i_11 > 0)
>     goto <bb 3>; [100.00%]
>   else
>     goto <bb 4>; [0.00%]
> 
> and here we finally have a path into bb3, which is the other interesting one:
> 
> ;;   basic block 3, loop depth 2
>   # ivtmp.20_6 = PHI <_30(8), ivtmp.20_18(3)>
> 
> .... BOOM! .... Here _30 is used, and ...
> 
>   _4 = (void *) ivtmp.20_6;
>   MEM[base: _4, offset: 0B] = 0;
> 
> ... even written into ...  That's invalid.  _30 is associated with an
> object that is clobbered and gone ...
> 
>   set_one (&buf);
>   buf ={v} {CLOBBER};
>   ivtmp.20_18 = ivtmp.20_6 + 1;
> 
> ... and as the MEM[] write can't have possibly been into 'in' (as that is
> invalid, as 'in' didn't exist at the MEM access), it's okay and sensible to
> allocate 'in' and 'buf' into the same memory.
> 
> It seems to be a common misconception of what the clobber is really about.
> I designed it to mean what I wrote above, the storage associated with the
> clobbered entities is gone after the clobber (not merely it's former
> contents!). 
> 
> But ivopts or dom extends the lifetime of 'in' (by moving an address-taken
> earlier) and hence the lifetime of its storage, but without doing anything
> about the clobber (and hence not _really_ extending the lifetime).  That
> doesn't work.
> It's basically a mirrored transformation of the usual take-address-of-local
> and access it out of it's declared scope, just that here the _start_ of the
> supposed lifetime is moved out of the declared scope, not the end.

While I spotted this "issue" as well I respectfully disagree about
the semantics.  Not because they are not sound but because of
the implementation challenge resolving around address-takens being
values with no data dependence on the clobbers.

I thought the liveness algorithm would be able to handle this situation.
If not we need to prune clobbers that became "invalid" somehow.
Not sure how - the address value never "dies", so we'd need to compute
liveness of the things the address escapes to - SSA names are easy,
calls - well, we have to give up.  This possiby defeats the whole idea
of doing stack sharing analysis (I remember kernel/fortran testcases
passing the stack slots by reference to a function).

Now if we would want to try forcing the original semantics we need a
verifier verifying the IL state against bogus transforms - but then we
would have a way to prune the invalid ones.
Comment 9 Jakub Jelinek 2019-05-07 11:11:03 UTC
Created attachment 46312 [details]
gcc10-pr90348.patch

Untested patch that implements what was written in #c5.  I agree that without further changes to the IL, determining if one can hoist addresses of local variables or not is going to be hard, would require computing the variable life info in each pass that would do something similar.  On the other side, admittedly such hoisting results in worse code generation because if the address is hoisted earlier than where it used to be live before, then there will be more stack conflicts.
Comment 10 Richard Biener 2019-05-07 12:45:39 UTC
(In reply to Jakub Jelinek from comment #9)
> Created attachment 46312 [details]
> gcc10-pr90348.patch
> 
> Untested patch that implements what was written in #c5.  I agree that
> without further changes to the IL, determining if one can hoist addresses of
> local variables or not is going to be hard, would require computing the
> variable life info in each pass that would do something similar.  On the
> other side, admittedly such hoisting results in worse code generation
> because if the address is hoisted earlier than where it used to be live
> before, then there will be more stack conflicts.

Ick.  You need to handle pt->anything and pt->escaped (walk the escaped
solution) as well.  And !SSA_NAME_PTR_INFO (op) is of course the same
as pt->anything.

I see it quickly degrading given the last PTA compute is far far away...
Comment 11 Jakub Jelinek 2019-05-07 13:15:48 UTC
(In reply to Richard Biener from comment #10)
> (In reply to Jakub Jelinek from comment #9)
> > Created attachment 46312 [details]
> > gcc10-pr90348.patch
> > 
> > Untested patch that implements what was written in #c5.  I agree that
> > without further changes to the IL, determining if one can hoist addresses of
> > local variables or not is going to be hard, would require computing the
> > variable life info in each pass that would do something similar.  On the
> > other side, admittedly such hoisting results in worse code generation
> > because if the address is hoisted earlier than where it used to be live
> > before, then there will be more stack conflicts.
> 
> Ick.  You need to handle pt->anything and pt->escaped (walk the escaped
> solution) as well.  And !SSA_NAME_PTR_INFO (op) is of course the same
> as pt->anything.

Do I?
I thought I don't.  The thing is, I believe the partitioning only cares about where (originally in the source) the automatic variables are referenced.
Now, if I have say:
__attribute__((noipa)) type *foo (type *x) { return x; }
or similar, in order for a variable to be live before the clobber, it needs to escape in the area where the variable is live, we don't care what we do with whatever it returned, unless we actually hoist such escaping calls before that region.  If we can say for:
  for (...)
    {
      unsigned char v[10];
      unsigned char *p = foo (v);
      *p = 1;
      unsigned char w[10];
      bar (w);
    }
hoist the p = foo (v); call before the loop, then indeed we are in big trouble.
But if we can't, the attached patch was just meant as a shorthand for trying to do a dataflow propagation of the can a pointer SSA_NAME point to variable xyz, if we don't consider escaping (to other functions and to global state).
What I'm trying to "undo" is just the hoisting of the address loads, not hoisting of those address loads plus function calls in which it escapes or where that address is saved to memory.

If I have to consider pt->anything and pt->escaped, then it will be as useless for the variable conflicts as is removing the important clearing of the bitmap bit on clobber stmt, we won't share stack slots pretty much at all.
Comment 12 Michael Matz 2019-05-07 13:22:51 UTC
(In reply to Jakub Jelinek from comment #11)
> before that region.  If we can say for:
>   for (...)
>     {
>       unsigned char v[10];
>       unsigned char *p = foo (v);
>       *p = 1;
>       unsigned char w[10];
>       bar (w);
>     }
> hoist the p = foo (v); call before the loop, then indeed we are in big
> trouble.

This is effectively what the testcase is doing (just that 'foo' is no call, but a normal address expression), so yes, we can do that, and yes we are in big
trouble :-/

> If I have to consider pt->anything and pt->escaped, then it will be as
> useless for the variable conflicts as is removing the important clearing of
> the bitmap bit on clobber stmt, we won't share stack slots pretty much at
> all.

Yeah; if we don't want to patch the specific situation for this testcase
(which might be okayish, we haven't seen this problem very often over the
last years), but want to really fix it we might have to take more involved
means like doing stack slot sharing before gimplification and rewriting
the IL to reflect this.  Or give up on sharing (which isn't a good idea).
Gnah.
Comment 13 Richard Biener 2019-05-07 13:36:12 UTC
(In reply to Jakub Jelinek from comment #11)
> (In reply to Richard Biener from comment #10)
> > (In reply to Jakub Jelinek from comment #9)
> > > Created attachment 46312 [details]
> > > gcc10-pr90348.patch
> > > 
> > > Untested patch that implements what was written in #c5.  I agree that
> > > without further changes to the IL, determining if one can hoist addresses of
> > > local variables or not is going to be hard, would require computing the
> > > variable life info in each pass that would do something similar.  On the
> > > other side, admittedly such hoisting results in worse code generation
> > > because if the address is hoisted earlier than where it used to be live
> > > before, then there will be more stack conflicts.
> > 
> > Ick.  You need to handle pt->anything and pt->escaped (walk the escaped
> > solution) as well.  And !SSA_NAME_PTR_INFO (op) is of course the same
> > as pt->anything.
> 
> Do I?
> I thought I don't.  The thing is, I believe the partitioning only cares
> about where (originally in the source) the automatic variables are
> referenced.
> Now, if I have say:
> __attribute__((noipa)) type *foo (type *x) { return x; }
> or similar, in order for a variable to be live before the clobber, it needs
> to escape in the area where the variable is live, we don't care what we do
> with whatever it returned, unless we actually hoist such escaping calls
> before that region.  If we can say for:
>   for (...)
>     {
>       unsigned char v[10];
>       unsigned char *p = foo (v);
>       *p = 1;
>       unsigned char w[10];
>       bar (w);
>     }
> hoist the p = foo (v); call before the loop, then indeed we are in big
> trouble.
> But if we can't, the attached patch was just meant as a shorthand for trying
> to do a dataflow propagation of the can a pointer SSA_NAME point to variable
> xyz, if we don't consider escaping (to other functions and to global state).
> What I'm trying to "undo" is just the hoisting of the address loads, not
> hoisting of those address loads plus function calls in which it escapes or
> where that address is saved to memory.
> 
> If I have to consider pt->anything and pt->escaped, then it will be as
> useless for the variable conflicts as is removing the important clearing of
> the bitmap bit on clobber stmt, we won't share stack slots pretty much at
> all.

The point about !SSA_NAME_PTR_INFO is that passes might clear it, replace
such SSA name with a new one, forgetting to copy it, etc.  The point
about anything is that points-to analysis might just have given up
(consider the provenance thing where we might end up with anything for
going through some uintptr arithmetic), the point about escaped is that
this is just a placeholder for a set of variables and points-to tends
to shrink sets by removing bits also in escaped.  Also
global = ptr; ptr2 = global; will have escaped in the sets but not
necessarily the bit of the original decl.

So my comments are just about correctly interpreting points-to data.
Comment 14 Jakub Jelinek 2019-05-07 13:39:20 UTC
(In reply to Michael Matz from comment #12)
> (In reply to Jakub Jelinek from comment #11)
> > before that region.  If we can say for:
> >   for (...)
> >     {
> >       unsigned char v[10];
> >       unsigned char *p = foo (v);
> >       *p = 1;
> >       unsigned char w[10];
> >       bar (w);
> >     }
> > hoist the p = foo (v); call before the loop, then indeed we are in big
> > trouble.
> 
> This is effectively what the testcase is doing (just that 'foo' is no call,
> but a normal address expression), so yes, we can do that, and yes we are in
> big
> trouble :-/

Well, that p = foo (v); or store of memory is something that will have (unless const call, indeed) a vuse or even a vdef and the alias oracle should usually consider it to be conflicting with the clobber stmt, so I'd hope it isn't hoisted in that case, while for the pure address arithmetics there is nothing that can easily stop the hoisting.

Now, if there isn't really an easy way out of this and given how rarely this has been reported (I think we have this PR and PR61203, don't remember anything else), the options can be do nothing, or do something simple that isn't that expensive, will fix this testcase and while not perfect solution will still allow variable sharing in the general case almost as often as before, or change the IL representation so that such hoisting or sinking of addresses is not possible.
Comment 15 Richard Biener 2019-05-07 13:50:42 UTC
Educating people about -fstack-reuse is also a possibility, thus leave the issue to workarounds like that, experimenting with full rewrites that are obviously not back-portable.