Bug 52558 - write introduction incorrect wrt the C++11 memory model
Summary: write introduction incorrect wrt the C++11 memory model
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: ---
Assignee: Aldy Hernandez
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-12 10:25 UTC by Francesco Zappa Nardelli
Modified: 2012-07-02 18:25 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-03-12 00:00:00


Attachments
Testcase for simulate-threads (628 bytes, patch)
2012-03-12 15:24 UTC, Andrew Macleod
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Francesco Zappa Nardelli 2012-03-12 10:25:55 UTC
The program below:

int g_1 = 1;
int g_2 = 0;

int func_1(void) {
int l;
for (l = 0; (l != 4); l++) {
  if (g_1)
    return l;
  for (g_2 = 0; (g_2 >= 26); ++g_2)
    ;
}
}
int main (int argc, char* argv[]) {
func_1();
}

is miscompiled by 

  gcc -v :  gcc version 4.7.0 20120215 (experimental) (GCC)  

(a recent svn snapshot) on x86-64 when -O2 is passed.

Observe that the inner loop of func_1 is never executed, and this program should never perform any read/write to g_2.  This means that func_1 might be executed in a thread in parallel with another thread that performs:

 g_2 = 42;
 printf ("%d",g_2)

The resulting system is data-race free and the only value that should be printed is 42.

However gcc -O2 generates the following x86-64 assembler for func_1:

func_1:
      movl    g_1(%rip), %edx
      movl    g_2(%rip), %eax
      testl   %edx, %edx
      jne     .L2
      movl    $0, g_2(%rip)
      ret
.L2:
      movl    %eax, g_2(%rip)
      xorl    %eax, %eax
      ret

and this code always performs a write to g_2.  If this asm code runs in parallel with "g_2 = 42; printf g_2", then the system might also print 0: this behaviour is introduced by the compiler and should not have happened.

The command line to generate the assembler above is:

$ g++ -std=c++11 read_and_write_introduced.c -O2 -S

It might be the case that in the C++11 memory model it is safe for the compiler to introduce a write provided that there is an earlier write to the same location, but this testcase shows that introducing a write is unsafe whenever there are no previous writes.

I labelled this as g++, but the bug will also affect the (future) c11 memory model.

For reference:

$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/home/riob/zappanar/tools/gcc-bin/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-src/configure --prefix=/home/zappanar/tools/gcc-bin --with-gmp=/home/zappanar/tools/lib/ --with-mpfr=/home/zappanar/tools/lib/ --with-mpc=/home/zappanar/tools/lib/ --enable-languages=c,c++
Thread model: posix
gcc version 4.7.0 20120215 (experimental) (GCC)
Comment 1 Aldy Hernandez 2012-03-12 14:52:57 UTC
Richi, is this something that should also be fixed for 4.7 as well?  There is a write to g_2 that is introduced on paths that did not have it.  So this is not just a load/load data race.
Comment 2 Richard Biener 2012-03-12 15:01:36 UTC
(In reply to comment #1)
> Richi, is this something that should also be fixed for 4.7 as well?  There is a
> write to g_2 that is introduced on paths that did not have it.  So this is not
> just a load/load data race.

No, we don't want to fix this for 4.7 as this is not a regression.

Yes, LIM only avoids introducing traps, not data-races.  This was discussed
in the past already, btw, and we do not want to generally disallow this
optimization.  [The C++ memory model is stupid here, it should not treat
every variable raceable but only specially marked ones, oh well ...]

There will be very many other passes that are affected by this, and even more
very many passes that will be affected by load data-races.

You will for example slow down SPEC CPU 2006 quite a bit (though technically
it does not include C++11 benchmarks).
Comment 3 Andrew Macleod 2012-03-12 15:24:35 UTC
Created attachment 26881 [details]
Testcase for simulate-threads

I've modified the testcase so that it runs in gcc.dg/simulate-thread as a pass/fail. 

Clearly it currently fails, although its interesting because at -O3 it passes again!!  The second pass of DOM undoes the extra write!

It does fail the testsuite at -O2 as expected.
This diff can be applied to add it to simulate threads.
Comment 4 Aldy Hernandez 2012-03-12 15:29:06 UTC
> No, we don't want to fix this for 4.7 as this is not a regression.
> 
> Yes, LIM only avoids introducing traps, not data-races.  This was discussed
> in the past already, btw, and we do not want to generally disallow this
> optimization.  [The C++ memory model is stupid here, it should not treat
> every variable raceable but only specially marked ones, oh well ...]
> 
> There will be very many other passes that are affected by this, and even more
> very many passes that will be affected by load data-races.
> 
> You will for example slow down SPEC CPU 2006 quite a bit (though technically
> it does not include C++11 benchmarks).

I thought we ignored *load* data races, but still cared about introducing write data races.  This test case has both.  I don't understand why we would allow introducing writes on paths that did not have it, but I will defer to you.
Comment 5 rguenther@suse.de 2012-03-12 15:32:48 UTC
On Mon, 12 Mar 2012, aldyh at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558
> 
> --- Comment #4 from Aldy Hernandez <aldyh at gcc dot gnu.org> 2012-03-12 15:29:06 UTC ---
> 
> > No, we don't want to fix this for 4.7 as this is not a regression.
> > 
> > Yes, LIM only avoids introducing traps, not data-races.  This was discussed
> > in the past already, btw, and we do not want to generally disallow this
> > optimization.  [The C++ memory model is stupid here, it should not treat
> > every variable raceable but only specially marked ones, oh well ...]
> > 
> > There will be very many other passes that are affected by this, and even more
> > very many passes that will be affected by load data-races.
> > 
> > You will for example slow down SPEC CPU 2006 quite a bit (though technically
> > it does not include C++11 benchmarks).
> 
> I thought we ignored *load* data races, but still cared about introducing write
> data races.  This test case has both.  I don't understand why we would allow
> introducing writes on paths that did not have it, but I will defer to you.

Why should we avoid them if we know they cannot cause problems?  This
will happen for every loop where we do not know if it iterates at least
once.  Store-motion is a very important optimization.

Richard.
Comment 6 Aldy Hernandez 2012-03-12 15:42:45 UTC
On 03/12/12 10:32, rguenther at suse dot de wrote:
es, but still cared about introducing write
>> data races.  This test case has both.  I don't understand why we would allow
>> introducing writes on paths that did not have it, but I will defer to you.
>
> Why should we avoid them if we know they cannot cause problems?  This
> will happen for every loop where we do not know if it iterates at least
> once.  Store-motion is a very important optimization.
>
> Richard.
>

They won't cause problems in a single threaded environment, but will 
cause problems in a multi threaded environment.  Even if you're writing 
the value g_2 originally had, another thread may have written to g_2 
right before.

Just to get this straight, am I to assume that the default code 
generation for GCC is a single threaded environment?  I just want to 
make sure I get these variants right in my head, and can properly 
separate what is and is not allowed in default GCC, and what only 
pertains to the C11 memory model (or transactional stuff).

Thanks.
Comment 7 rguenther@suse.de 2012-03-12 15:45:39 UTC
On Mon, 12 Mar 2012, aldyh at redhat dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558
> 
> --- Comment #6 from Aldy Hernandez <aldyh at redhat dot com> 2012-03-12 15:42:45 UTC ---
> On 03/12/12 10:32, rguenther at suse dot de wrote:
> es, but still cared about introducing write
> >> data races.  This test case has both.  I don't understand why we would allow
> >> introducing writes on paths that did not have it, but I will defer to you.
> >
> > Why should we avoid them if we know they cannot cause problems?  This
> > will happen for every loop where we do not know if it iterates at least
> > once.  Store-motion is a very important optimization.
> >
> > Richard.
> >
> 
> They won't cause problems in a single threaded environment, but will 
> cause problems in a multi threaded environment.  Even if you're writing 
> the value g_2 originally had, another thread may have written to g_2 
> right before.

Yes, that's clear to me.

> Just to get this straight, am I to assume that the default code 
> generation for GCC is a single threaded environment?  I just want to 
> make sure I get these variants right in my head, and can properly 
> separate what is and is not allowed in default GCC, and what only 
> pertains to the C11 memory model (or transactional stuff).

It certainly is, though we are getting more careful about this stuff
in recent years and generally only read data-races are ok.

Richard.
Comment 8 Andrew Macleod 2012-03-12 15:50:13 UTC
We can still perform store motion out of a loop, we just can't put the store on
a path which is executed if the loop isn't executed.

In this case, we actually made the code *slower*.  Before LIM, there was a load
of g1, a compare and return.
        movl    g_1(%rip), %edx
        xorl    %eax, %eax
        testl   %edx, %edx
        jne     .L1
.L4:
        addl    $1, %eax
        movl    $0, g_2(%rip)
        cmpl    $4, %eax
        jne     .L4
.L1:
        rep
        ret


  LIM makes it have a load of g_1, a load of g_2 and a store to g_2 before
returning.

       .cfi_startproc
        movl    g_1(%rip), %edx
        movl    g_2(%rip), %eax
        testl   %edx, %edx
        jne     .L2
        movl    $0, g_2(%rip)
        ret
.L2:
        movl    %eax, g_2(%rip)
        xorl    %eax, %eax
        ret



-O3 corrects this mistake and returns it to the more optimal results.


I would argue this testcase shows LIM actually making the code worse in this
case as well.
Comment 9 rguenther@suse.de 2012-03-12 15:55:27 UTC
On Mon, 12 Mar 2012, amacleod at redhat dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558
> 
> --- Comment #8 from Andrew Macleod <amacleod at redhat dot com> 2012-03-12 15:50:13 UTC ---
> We can still perform store motion out of a loop, we just can't put the store on
> a path which is executed if the loop isn't executed.
> 
> In this case, we actually made the code *slower*.  Before LIM, there was a load
> of g1, a compare and return.
>         movl    g_1(%rip), %edx
>         xorl    %eax, %eax
>         testl   %edx, %edx
>         jne     .L1
> .L4:
>         addl    $1, %eax
>         movl    $0, g_2(%rip)
>         cmpl    $4, %eax
>         jne     .L4
> .L1:
>         rep
>         ret
> 
> 
>   LIM makes it have a load of g_1, a load of g_2 and a store to g_2 before
> returning.
> 
>        .cfi_startproc
>         movl    g_1(%rip), %edx
>         movl    g_2(%rip), %eax
>         testl   %edx, %edx
>         jne     .L2
>         movl    $0, g_2(%rip)
>         ret
> .L2:
>         movl    %eax, g_2(%rip)
>         xorl    %eax, %eax
>         ret
> 
> 
> 
> -O3 corrects this mistake and returns it to the more optimal results.
> 
> 
> I would argue this testcase shows LIM actually making the code worse in this
> case as well.

Usually loops are executed ;)  At least that is what we assume if
we don't know better.  But yes, splitting the exit block(s)
is a solution, so, if you fix this please go this way.

Richard.
Comment 10 Aldy Hernandez 2012-03-12 15:56:30 UTC
On 03/12/12 10:45, rguenther at suse dot de wrote:

>> Just to get this straight, am I to assume that the default code
>> generation for GCC is a single threaded environment?  I just want to
>> make sure I get these variants right in my head, and can properly
>> separate what is and is not allowed in default GCC, and what only
>> pertains to the C11 memory model (or transactional stuff).
>
> It certainly is, though we are getting more careful about this stuff
> in recent years and generally only read data-races are ok.
>
> Richard.
>

Ah, ok.  Now it's clear.  I was confused because in a previous thread 
months ago you had mentioned that read data-races were ok, but write 
data-races were not, whereas here we are even allowing write data races.

Understood.

Thanks.
Comment 11 Francesco Zappa Nardelli 2012-03-13 07:29:51 UTC
Just one remark: in this case the write introduction is incorrect wrt the C++11 memory model because there are no previous write to the same location.  If there had been a previous write to the same location, then the discriminating context would have been racy and the code generated by -O2 sound.

I believe that the above argument generalises to all write introductions but I don't yet have a proof of this, so take it with a pinch of salt.
Comment 12 torvald 2012-03-13 21:17:48 UTC
(In reply to comment #11)
> Just one remark: in this case the write introduction is incorrect wrt the C++11
> memory model because there are no previous write to the same location.  If
> there had been a previous write to the same location, then the discriminating
> context would have been racy and the code generated by -O2 sound.
> 
> I believe that the above argument generalises to all write introductions but I
> don't yet have a proof of this, so take it with a pinch of salt.

We follow a similar line of reasoning when thinking about which transformations are allowed in transactional code (or other concurrent code).  In principle, this also applies to loads to some extent even though racy loads are supposed to be okay for as--if as long as a value obtained from a racy read is never used.  Having a more precise yet practical understanding of this would certainly help.
Comment 13 Aldy Hernandez 2012-03-28 19:04:35 UTC
mine
Comment 14 Aldy Hernandez 2012-04-10 17:07:22 UTC
Richard G., or perhaps another aliasing expert.  I am working on a patch for this problem.  Could you pontificate as to why no optimization pass has been able to figure out that g_2_lsm.6_12 == g_2 below?

  # VUSE <.MEM_9(D)>
  g_2_lsm.6_12 = g_2;            <-- g_2_lsm set to g_2
  if (pretmp.4_1 != 0)
    goto <bb 3>;
  else
    goto <bb 5>;

<bb 3>:
  # VUSE <.MEM_9(D)>
  D.1883_17 = g_2;
  if (g_2_lsm.6_12 != D.1883_17)    <-- g_2_lsm compared with g_2
    goto <bb 4>;

Why can't anyone figure out that g_2_lsm is g_2?  Am I building the conditions and stores incorrectly, is there a missing annotation, or is something else amok here?

Thanks.
Comment 15 Richard Biener 2012-04-11 08:15:40 UTC
(In reply to comment #14)
> Richard G., or perhaps another aliasing expert.  I am working on a patch for
> this problem.  Could you pontificate as to why no optimization pass has been
> able to figure out that g_2_lsm.6_12 == g_2 below?
> 
>   # VUSE <.MEM_9(D)>
>   g_2_lsm.6_12 = g_2;            <-- g_2_lsm set to g_2
>   if (pretmp.4_1 != 0)
>     goto <bb 3>;
>   else
>     goto <bb 5>;
> 
> <bb 3>:
>   # VUSE <.MEM_9(D)>
>   D.1883_17 = g_2;
>   if (g_2_lsm.6_12 != D.1883_17)    <-- g_2_lsm compared with g_2
>     goto <bb 4>;
> 
> Why can't anyone figure out that g_2_lsm is g_2?  Am I building the conditions
> and stores incorrectly, is there a missing annotation, or is something else
> amok here?

Both value-numbering (FRE/PRE, that do not run after store motion :/) and
DOM should figure this out.  DOM only in theory, but at least in this simple
case it should figure it out.  Do you have a testcase that does not require
your patches?

> Thanks.
Comment 16 Aldy Hernandez 2012-04-11 13:37:17 UTC
(In reply to comment #15)

> Both value-numbering (FRE/PRE, that do not run after store motion :/) and
> DOM should figure this out.  DOM only in theory, but at least in this simple
> case it should figure it out.  Do you have a testcase that does not require
> your patches?

On a pristine trunk I don't have the exact problem (since the equality in comment 14 was created by my WIP), but I do have a similar problem where no optimization can figure out that g_2_lsm == g_2.

Here is a variation of the PR testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
  int l;
  for (l = 0; l < 1234; l++)
  { 
    if (g_1)
      return l;
    else
      g_2 = 0;
  }
  return 999;
}

On pristine trunk, we get:

  # VUSE <.MEM_9(D)>
  g_2_lsm.6_12 = g_2;           <-- g_2_lsm = g_2
  if (pretmp.4_1 != 0)
    goto <bb 4>;
  else
    goto <bb 3>;

<bb 3>:
  # .MEM_8 = VDEF <.MEM_9(D)>
  g_2 = 0;
  goto <bb 5>;

<bb 4>:
  # .MEM_16 = VDEF <.MEM_9(D)>
  g_2 = g_2_lsm.6_12;           <-- g_2_lsm == g_2!!

This may actually be the problem in this PR.  If we could figure out that g_2_lsm == g_2, there would be no write to g_2 on the g_1!=0 arm of the conditional.

Should DOM be able to figure out that g_2_lsm == g_2 in this case as well?

Thanks.
Comment 17 rguenther@suse.de 2012-04-11 14:07:27 UTC
On Wed, 11 Apr 2012, aldyh at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52558
> 
> --- Comment #16 from Aldy Hernandez <aldyh at gcc dot gnu.org> 2012-04-11 13:37:17 UTC ---
> (In reply to comment #15)
> 
> > Both value-numbering (FRE/PRE, that do not run after store motion :/) and
> > DOM should figure this out.  DOM only in theory, but at least in this simple
> > case it should figure it out.  Do you have a testcase that does not require
> > your patches?
> 
> On a pristine trunk I don't have the exact problem (since the equality in
> comment 14 was created by my WIP), but I do have a similar problem where no
> optimization can figure out that g_2_lsm == g_2.
> 
> Here is a variation of the PR testcase:
> 
> int g_1 = 1;
> int g_2 = 0;
> 
> int func_1(void)
> {
>   int l;
>   for (l = 0; l < 1234; l++)
>   { 
>     if (g_1)
>       return l;
>     else
>       g_2 = 0;
>   }
>   return 999;
> }
> 
> On pristine trunk, we get:
> 
>   # VUSE <.MEM_9(D)>
>   g_2_lsm.6_12 = g_2;           <-- g_2_lsm = g_2
>   if (pretmp.4_1 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 3>;
> 
> <bb 3>:
>   # .MEM_8 = VDEF <.MEM_9(D)>
>   g_2 = 0;
>   goto <bb 5>;
> 
> <bb 4>:
>   # .MEM_16 = VDEF <.MEM_9(D)>
>   g_2 = g_2_lsm.6_12;           <-- g_2_lsm == g_2!!
> 
> This may actually be the problem in this PR.  If we could figure out that
> g_2_lsm == g_2, there would be no write to g_2 on the g_1!=0 arm of the
> conditional.
> 
> Should DOM be able to figure out that g_2_lsm == g_2 in this case as well?

Ah, so its about the redundant store.  I only recently added the
ability to DOM to remove redundant stores, so it might not yet be
perfect.  And in this case it is because the store is still in a
loop, thus a dominator based approach does not really work well.
Comment 18 Aldy Hernandez 2012-05-31 19:46:49 UTC
Author: aldyh
Date: Thu May 31 19:46:43 2012
New Revision: 188081

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188081
Log:
        PR tree-optimization/52558
        * cfg.c (alloc_aux_for_edge): Fix comment.
        (alloc_aux_for_edge): Remove static.
        * basic-block.h (alloc_aux_for_edge): Protoize.
        * tree-ssa-loop-im.c (execute_sm_if_changed): New.
        (execute_sm_if_changed_flag): New.
        (execute_sm_if_changed_flag_set): New.
        (execute_sm): Do not generate data races unless requested.
        (tree_ssa_lim_initialize): Call alloc_aux_for_edges.
        (tree_ssa_lim_finalize): Call free_aux_for_edges.
        * gimple.h (block_in_transaction): New.
        (gimple_in_transaction): Use block_in_transaction.


Added:
    trunk/gcc/testsuite/gcc.dg/pr52558-1.c
    trunk/gcc/testsuite/gcc.dg/pr52558-2.c
    trunk/gcc/testsuite/gcc.dg/tm/reg-promotion.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/basic-block.h
    trunk/gcc/cfg.c
    trunk/gcc/gimple.h
    trunk/gcc/tree-ssa-loop-im.c
Comment 19 Aldy Hernandez 2012-05-31 19:55:17 UTC
Fixed on mainline.  I doubt this will be fixed on 4.7, as this is too intrusive-- unless one of the RMs really wants it on 4.7.
Comment 20 torvald 2012-05-31 20:23:51 UTC
(In reply to comment #19)
> Fixed on mainline.  I doubt this will be fixed on 4.7, as this is too
> intrusive-- unless one of the RMs really wants it on 4.7.

OTOH, it would be good if we could avoid those races in 4.7 too.  Can you gather the RMs' opinions please?
Comment 21 Aldy Hernandez 2012-06-14 19:22:54 UTC
Author: aldyh
Date: Thu Jun 14 19:22:48 2012
New Revision: 188631

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188631
Log:
        PR tree-optimization/52558
        Backport from mainline:
        2012-05-21  Aldy Hernandez  <aldyh@redhat.com>
        * gimple.h (gimple_set_in_transaction): Remove.
        (gimple_in_transaction): Look in BB instead.
        (gimple_statement_base): Remove in_transaction field.
        * basic-block.h (enum bb_flags): Add BB_IN_TRANSACTION.
        * trans-mem.c (compute_transaction_bits): Place transaction bit
        information into basic blocks.
        2012-05-31  Aldy Hernandez  <aldyh@redhat.com>
        PR tree-optimization/52558
        * cfg.c (alloc_aux_for_edge): Fix comment.
        (alloc_aux_for_edge): Remove static.
        * basic-block.h (alloc_aux_for_edge): Protoize.
        * tree-ssa-loop-im.c (execute_sm_if_changed): New.
        (execute_sm_if_changed_flag): New.
        (execute_sm_if_changed_flag_set): New.
        (execute_sm): Do not generate data races unless requested.
        (tree_ssa_lim_initialize): Call alloc_aux_for_edges.
        (tree_ssa_lim_finalize): Call free_aux_for_edges.
        * gimple.h (block_in_transaction): New.
        (gimple_in_transaction): Use block_in_transaction.


Added:
    branches/gcc-4_7-branch/gcc/testsuite/gcc.dg/pr52558-1.c
    branches/gcc-4_7-branch/gcc/testsuite/gcc.dg/pr52558-2.c
    branches/gcc-4_7-branch/gcc/testsuite/gcc.dg/tm/reg-promotion.c
Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/basic-block.h
    branches/gcc-4_7-branch/gcc/cfg.c
    branches/gcc-4_7-branch/gcc/gimple.h
    branches/gcc-4_7-branch/gcc/trans-mem.c
    branches/gcc-4_7-branch/gcc/tree-ssa-loop-im.c
Comment 22 Aldy Hernandez 2012-06-14 21:15:10 UTC
FYI, backported to 4.7.
Comment 23 Aldy Hernandez 2012-07-02 18:25:42 UTC
BTW, for anyone still stuck on this, you need to use the following when compiling the testcase:

--param allow-store-data-races=0

Eventually, this will be the default for C++1x, or the C++ memory model.