Bug 103006

Summary: [11/12/13/14 Regression] wrong code at -O1 or -O2 on x86_64-linux-gnu by r7-7101
Product: gcc Reporter: Zhendong Su <zhendong.su>
Component: middle-endAssignee: Richard Biener <rguenth>
Status: ASSIGNED ---    
Severity: normal CC: dimhen, jakub, jakub, marxin, matz, msebor, rguenth, rsandifo
Priority: P2 Keywords: wrong-code
Version: 12.0   
Target Milestone: 11.5   
See Also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109967
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110115
Host: Target:
Build: Known to work: 6.3.1
Known to fail: 7.3.1, 8.1.0 Last reconfirmed: 2022-01-29 00:00:00
Bug Depends on: 111843    
Bug Blocks:    

Description Zhendong Su 2021-10-30 17:45:51 UTC
This is quite long-latent, affecting GCC 7.* and later.

[856] % gcctk -v
Using built-in specs.
COLLECT_GCC=gcctk
COLLECT_LTO_WRAPPER=/local/suz-local/software/local/gcc-trunk/libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk/configure --disable-bootstrap --prefix=/local/suz-local/software/local/gcc-trunk --enable-languages=c,c++ --disable-werror --enable-multilib --with-system-zlib
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.0.0 20211030 (experimental) [master r12-4804-g75c9fa318e3] (GCC)
[857] %
[857] % gcctk -O1 small.c; ./a.out
0
[858] % gcctk -O2 small.c
[859] % ./a.out
0
Aborted
[860] %
[860] % cat small.c
int printf(const char *, ...);
int a, *b, c, e, f;
void g() {
  int *d[7];
  d[6] = b = (int *)d;
  printf("0\n");
}
int i() {
  for (c = 0; c < 2; c++) {
    long h[6][2];
    for (e = 0; e < 6; e++)
      for (f = 0; f < 2; f++)
        h[e][f] = 1;
    if (c) {
      g();
      return h[3][0];
    }
  }
  return 0;
}
int main() {
  if (i() != 1)
    __builtin_abort ();
  return 0;
}
Comment 1 Andrew Pinski 2021-10-30 22:12:18 UTC
Confirmed, here is one which is broken at -O1 and -O2 but ok at -O3:

__attribute__((noipa)) void ff(void){}
int a, *b, c, e, f;
__attribute__((always_inline))
static inline void g() {
  int *d[7];
  d[6] = b = (int *)d;
  ff();
}
__attribute__((noinline))
int i() {
  for (c = 0; c < 2; c++) {
    long h[6][2];
    for (e = 0; e < 6; e++)
      for (f = 0; f < 2; f++)
        h[e][f] = 1;
    if (c) {
      g();
      return h[3][0];
    }
  }
  return 0;
}
int main() {
  if (i() != 1)
    __builtin_abort ();
  return 0;
}
Comment 2 Andrew Pinski 2021-10-30 22:15:09 UTC
Since noipa does not work with GCC 7, the following fails all the back to GCC 7:

__attribute__((noipa,noinline,noclone)) void ff(void){asm("":::"memory");}
int a, *b, c, e, f;
__attribute__((always_inline))
static inline void g() {
  int *d[7];
  d[6] = b = (int *)d;
  ff();
}
__attribute__((noinline))
int i() {
  for (c = 0; c < 2; c++) {
    long h[6][2];
    for (e = 0; e < 6; e++)
      for (f = 0; f < 2; f++)
        h[e][f] = 1;
    if (c) {
      g();
      return h[3][0];
    }
  }
  return 0;
}
int main() {
  if (i() != 1)
    __builtin_abort ();
  return 0;
}
Comment 3 Andrew Pinski 2021-10-30 22:28:23 UTC
This looks like a Register allocator issue.
IN GCC 6 we have:
.L8:
        movq    %rsp, b(%rip)
        movq    %rsp, 48(%rsp)
        call    _Z2ffv
        movq    112(%rsp), %rax
        addq    $160, %rsp
        .cfi_def_cfa_offset 8
        ret

While in GCC 7+ we get:
        movl    $2, f(%rip)
        movl    $6, e(%rip)
        movq    %rsp, b(%rip)
        movq    %rsp, 48(%rsp)
        call    _Z2ffv
        movl    48(%rsp), %eax
        addq    $96, %rsp
        .cfi_def_cfa_offset 8
        ret
Comment 4 Jakub Jelinek 2021-11-01 11:13:51 UTC
Doesn't look to me like RA issue, but rather incorrect coalescing of temporary vars.
Optimized dump has:
  <bb 2> [local count: 1652516]:
  ivtmp.40_47 = (unsigned long) &h;
  _30 = ivtmp.40_47 + 96;

  <bb 3> [local count: 13370357]:
  # ivtmp.40_38 = PHI <ivtmp.40_47(2), ivtmp.40_44(3)>
  _21 = (void *) ivtmp.40_38;
  MEM[(long int *)_21] = 1;
  MEM[(long int *)_21 + 8B] = 1;
  ivtmp.40_44 = ivtmp.40_38 + 16;
  if (_30 != ivtmp.40_44)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 1652516]:
  h ={v} {CLOBBER};

  <bb 5> [local count: 13370357]:
  # ivtmp.30_29 = PHI <ivtmp.30_48(5), ivtmp.40_47(4)>
  _31 = (void *) ivtmp.30_29;
  MEM[(long int *)_31] = 1;
  MEM[(long int *)_31 + 8B] = 1;
  ivtmp.30_48 = ivtmp.30_29 + 16;
  if (_30 != ivtmp.30_48)
    goto <bb 5>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 6> [local count: 1652516]:
  f = 2;
  e = 6;
  c = 1;
  b = &d;
  d[6] = &d;
  ff ();
  d ={v} {CLOBBER};
  _5 = h[3][0];
  _18 = (int) _5;
  h ={v} {CLOBBER};
  return _18;

So, the code initializes the whole h array, then has h ={v} {CLOBBER};, then initializes it again, but unfortunately without mentioning the var in the IL - it reuses ivtmp.40_47 for that, then sets various vars, including d[6] initialization with &d escaping, clobbers d, reads from h and finally clobbers h again.  I guess the var partitioning code from the above thinks h isn't really live in between the first h clobber and h[3][0] load and so decides:
Partition 0: size 96 align 16
        h       d
and places both h and d into the same stack slot.
Comment 5 Jakub Jelinek 2021-11-01 11:29:50 UTC
What we see is an effect of 3 different optimizations, one is loop unrolling (cunroll in this case), another one is ivopts that moves the h array references out of the loops but still has separate:
  ivtmp.40_47 = (unsigned long) &h;
  _17 = (unsigned long) &h;
  _30 = _17 + 96;
...
  h ={v} {CLOBBER};
  ivtmp.30_33 = (unsigned long) &h;
  _41 = (unsigned long) &h;
  _36 = _41 + 96;
and finally dom3's VN which replaces ivtmp.30_33 initializer with ivtmp.40_47
and _36 with _30.
If what the cfg expand var partition code is as designed (I think other passes do it too, e.g. compute_live_vars/live_vars_at_stmt relies on it too), then we need to somehow avoid VN of &var across var ={v} {CLOBBER} stmt, but it isn't really clear to me how.
Unless we change loop unrolling so that the different loop iterations if there is a var clobber in the loop actually have different variables (the first iteration the original var and other iterations that var's copies; perhaps only for addressable vars?).  Then naturally VN couldn't merge those and the RTL partitioning code could decide to put them into the same or different partition and later RTL opts could CSE the addresses.
Comment 6 Richard Biener 2021-11-02 07:17:11 UTC
Isn't this another case of the still unsolved PR90348?
Comment 7 Jakub Jelinek 2021-11-02 07:57:57 UTC
Looks like that.
What do you think about unrolling making variable copies?  We'd need to be sure that the scope of the variable is the loop we are unrolling though (or something nested in it).
Comment 8 rguenther@suse.de 2021-11-02 08:10:57 UTC
On Tue, 2 Nov 2021, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103006
> 
> --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Looks like that.
> What do you think about unrolling making variable copies?  We'd need to be sure
> that the scope of the variable is the loop we are unrolling though (or
> something nested in it).

Being able to determine that would solve the very issue we're trying
to fix with making the copy.  The problem is that we put in CLOBBERs
based on the original BINDs but later optimizers do not respect the
birth boundary.  If we can figure that out we could ignore the
respective CLOBBERs for the CFG expansion live compute as well.

I think we may be able to compute the CFG SCC a CLOBBER resides in
and in case the CLOBBERed variable is live-in into that SCC we cannot
prune it with that CLOBBER.  Or so.
Comment 9 Jakub Jelinek 2021-11-02 08:20:34 UTC
We don't have that many unrolling passes though, and perhaps we could use compute_live_vars and decide based on that.  Though I guess the addresses of the vars could be even then hoisted before such loops, though it is unclear why it would be done, it can't be VN because the vars don't exist outside of the loop.
Say LIM can do that though...
Comment 10 rguenther@suse.de 2021-11-02 13:55:55 UTC
On Tue, 2 Nov 2021, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103006
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Keywords|ra                          |
> 
> --- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> We don't have that many unrolling passes though, and perhaps we could use
> compute_live_vars and decide based on that.  Though I guess the addresses of
> the vars could be even then hoisted before such loops, though it is unclear why
> it would be done, it can't be VN because the vars don't exist outside of the
> loop.
> Say LIM can do that though...

Yes, LIM can do it as can loop header copying or jump threading that
happens to peel an iteration.  Also there's no dataflow barrier that
prevents addresses from being moved so that "no pass does this" isn't
a good excuse.

That said, I think the fix is to the code computing stack slot reuse.
Comment 11 Richard Biener 2022-01-31 10:52:46 UTC
(In reply to rguenther@suse.de from comment #8)
> On Tue, 2 Nov 2021, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103006
> > 
> > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> > Looks like that.
> > What do you think about unrolling making variable copies?  We'd need to be sure
> > that the scope of the variable is the loop we are unrolling though (or
> > something nested in it).
> 
> Being able to determine that would solve the very issue we're trying
> to fix with making the copy.  The problem is that we put in CLOBBERs
> based on the original BINDs but later optimizers do not respect the
> birth boundary.  If we can figure that out we could ignore the
> respective CLOBBERs for the CFG expansion live compute as well.
> 
> I think we may be able to compute the CFG SCC a CLOBBER resides in
> and in case the CLOBBERed variable is live-in into that SCC we cannot
> prune it with that CLOBBER.  Or so.

Tried that but while it solves PR97821 it doesn't fix the case in this bug
because there the CLOBBER that breaks things is not inside a SCC but the
issue in this case is that not all accesses to 'h' also mention h and thus
we miss to make 'h' live again after the CLOBBER.

I also fear the more variables we expose the easier it will be to run into
this issue.

It might be possible to (conservatively) track pointers from ADDR_EXPR
mentions, but that's going to give up in most of the interesting cases
(it has the same issue as ideas how to prevent "leakage" via pointers).

The other idea that we discussed past in time is to perform stack slot sharing
early when the CLOBBERs are still OK to use.  Possibly even w/o CLOBBERs
but with the GIMPLE binds we have even after gimplifying.  We'd make the
"stack slot" sharing explicit by replacing decls assigned to the same
stack slot with either the larges of the decls or anonymous stack memory
(a new decl).  Doing this before inlining is complete will obviously not
catch all important cases.  Doing it after inlining requires being careful
during early optimizations (jump threading is the one transform we do early
that can cause code duplication and followup CSE).
Comment 12 Richard Biener 2022-01-31 13:01:49 UTC
Oh, and I think address-takens are really not an issue but the accesses based on them which confuse the simplistic live analysis to not recognize those as births.

So we _can_ introduce explicit birth operations.  The simplest thing we probably
can do is to add clobbers there and set a special 'birth' flag on them
just for liveness analysis, the rest of the compiler can treat them like clobbers - besides cases where we remove clobbers.  We can't remove a birth without also removing all clobbers of a variable (even in copies of birth-death regions).  It might be tempting to somehow link birth and its clobbers (IIRC with cleanups and
so we can have multiple clobbers for one birth), like via SSA def and uses, but
when we copy a birth that breaks down.  So the alternative is probably to
mark a variable as not to be subject to stack slot sharing when removing a
birth clobber.

The initial birth clobber would be at a more conservative position than
the current way of treating the first mention as birth but we can sink
birth clobbers (even across address takens) and hoist clobbers to shrink
live ranges at some point.

Both birth and clobber act as optimization barrier for loads and stores
of the affected variable, that's good for the purpose but possibly bad
for optimization.  I checked and for example loop store motion does consider
clobbers inside a loop as reason to not optimize.

And with the current scheme we don't even optimize cases like

struct Foo { int i; int j; int a[24]; };

void bar(struct Foo f);

void baz()
{
  struct Foo f, g;
  f.i = 1;
  bar (f);
  g.j = 2;
  bar (g);
}

as nothing hoists the clobbers we only put at the end of the function and
thus f and g appear to conflict (we only use clobbers to compute live,
for not address taken vars we could rely on mentions only).

I don't think we can reasonably fix all of the issue on branches and I
have my doubts for GCC 12.
Comment 13 Richard Biener 2022-02-02 11:44:44 UTC
So I have a patch that adds explicit birth markers (using clobbers specially marked).  That works well sofar but it conflicts with clobbers (not marked as birth) that are added for clobbering at the start of variable lifetime like
C++ does at the beginning of CTORs.  I for example see

  inst ={v} {CLOBBER(birth)};
  inst ={v} {CLOBBER};  (**)
  inst.v = 42;
...
  inst ={v} {CLOBBER};

where (**) is inserted by the C++ frontend (with -flifetime-dse which is
the default).  Indeed my life analysis for the purpose of stack slot
sharing now only relies on the birth/death markers so it gets confused
by the extra clobber.

We now also have some use-after-free diagnostic that would likely trip
over this as it assumes that a CLOBBER ends lifetime of storage.

I guess disentangling both use-cases by also marking the end-of-storage-lifetime
clobbers specially would solve both issues.
Comment 14 Richard Biener 2022-02-04 13:12:02 UTC
There's an interesting case,

  a = BIRTH
loop:
  b = DEATH
  a = DEATH
  b = BIRTH
  goto loop;

where we end up having both a and b in the live-in set at the loop label
but a is removed before we see the BIRTH of b which is where we add
conflicts based on the current set of active vars.

In the case I'm running into this I have tail-recursion do

  a = BIRTH
  b = BIRTH
...
  a = DEATH
  b = DEATH

into

loop:
  a = BIRTH
  b = BIRTH
  goto loop;
  a = DEATH
  b = DEATH

leading to a similar issue.  The issue above can for example arise from
loop rotation.

In all cases live from backedges confuse the "optimization" done to only
record conflicts when we add a var to the live set (and it is not already set).

The previous code had

              /* If this is the first real instruction in this BB we need
                 to add conflicts for everything live at this point now.
                 Unlike classical liveness for named objects we can't
                 rely on seeing a def/use of the names we're interested in.
                 There might merely be indirect loads/stores.  We'd not add any
                 conflicts for such partitions.  */

and the easiest is to do the same here (we don't see the backedge "use"),
but we could possibly improve by noting which vars are only live from
a backedge here.
Comment 15 Richard Biener 2022-05-27 09:46:34 UTC
GCC 9 branch is being closed
Comment 16 Jakub Jelinek 2022-06-28 10:46:52 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 17 Richard Biener 2023-07-07 10:41:23 UTC
GCC 10 branch is being closed.