Bug 31862 - Loop IM and other optimizations harmful for -fopenmp
Summary: Loop IM and other optimizations harmful for -fopenmp
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, openmp, wrong-code
Depends on:
Blocks:
 
Reported: 2007-05-08 08:59 UTC by Jakub Jelinek
Modified: 2009-02-26 08:26 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-02-25 13:25:54


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Jelinek 2007-05-08 08:59:00 UTC
See http://openmp.org/pipermail/omp/2007/000840.html
and the rest of the lengthy threads:
Memory consistency contradiction between 2.5 specification and GCC
OpenMP spec 2.5 seems to have incorrect flush example on page 12
Two simpler examples (Re: OpenMP spec 2.5 seems to have incorrect flush example on page 12)

Some GCC optimizations are harmful for threaded code, e.g. loop invariant motion
of global variables:

int var;
void
foo (int x)
{
  int i;
  for (i = 0; i < 100; i++)
    {
      if (i > x)
        var = i;
    }
}

When some other thread modifies var at the same time while foo (200) is executed,
the compiler inserted a race which doesn't really exist in the original program,
as it will do reg = var; ... var = reg; even when var was never modified.

Even if we prove a variable is always written to within a loop, but there is some kind of barrier (a function call which might contain a barrier, __sync_synchronize (), #pragma omp barrier, #pragma omp flush (does a volatile var read resp. write count as one too?)), if the variable is not written to
after the barrier, the compiler is still introducing a race which didn't exist originally.

I wonder if we shouldn't disable some optimizations (like loop IM if the var is a global variable) for -fopenmp (or perhaps add also a -fthread-safe option
implied by -fopenmp).  Does if conversion cause similar problems?  What else?
Comment 1 Diego Novillo 2007-05-08 14:10:15 UTC
Subject: Re:  New: Loop IM and other optimizations harmful for -fopenmp

On 8 May 2007 07:59:03 -0000, jakub at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
> See http://openmp.org/pipermail/omp/2007/000840.html
> and the rest of the lengthy threads:

Yes, I've been following the thread and I agree that we need to do
something to avoid the problem.  The real solution is, unfortunately,
quite a bit of work.

In the meantime, we should probably annotate the shared variables and
consider them volatile.  This should prevent most optimizers from
messing things up inadvertently.

This should be done within OMP regions.  Orphaned functions may become
a problem, so this should be implemented as an IPA pass.
Comment 2 Andrew Pinski 2007-05-08 15:30:44 UTC
WTF, this is just sad we have to disable optimizations because openmp folks don't know  how to program threaded code.
Comment 3 Richard Biener 2007-05-08 15:37:02 UTC
OMP is not a good generic programming model for threaded code.  Exactly because of this issues.
Comment 4 Diego Novillo 2007-05-08 15:39:31 UTC
Subject: Re:  Loop IM and other optimizations harmful for -fopenmp

On 8 May 2007 14:30:45 -0000, pinskia at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:

> ------- Comment #2 from pinskia at gcc dot gnu dot org  2007-05-08 15:30 -------
> WTF, this is just sad we have to disable optimizations because openmp folks
> don't know  how to program threaded code.

No need to be insulting.

Another possibility is to require shared variables to be declared
volatile.  It depends on the wording in the standard.
Comment 5 Diego Novillo 2007-05-08 15:44:15 UTC
Subject: Re:  Loop IM and other optimizations harmful for -fopenmp

On 8 May 2007 14:37:05 -0000, rguenth at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:

> OMP is not a good generic programming model for threaded code.  Exactly because
> of this issues.

No.  This is wrong.  The model simply requires the compiler to be
smarter.  Sequential compilers are not aware of parallel semantics.

If the compiler was thread-aware, these issues would be transparent to
the user (as they should be).

The original code did not have a race condition.  The compiler
transformations introduced a race-condition.  This *is*  a compiler
bug.
Comment 6 pinskia@gmail.com 2007-05-08 15:59:30 UTC
Subject: Re:  Loop IM and other optimizations harmful for -fopenmp

On 8 May 2007 14:44:16 -0000, dnovillo at acm dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
> The original code did not have a race condition.  The compiler
> transformations introduced a race-condition.  This *is*  a compiler
> bug.

Actually the original code has a race condition, if another thread is
reading from var and then acting on that and writting back.  This is
why threading programming is considered hard.
Comment 7 Jakub Jelinek 2007-05-08 19:08:26 UTC
This really is not specific to OpenMP, I believe the following is valid threaded program:
#define _XOPEN_SOURCE 600
#include <pthread.h>
#include <stdlib.h>

int v;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_barrier_t b;

void
foo (int x, int y)
{
  int i;
  for (i = 0; i < 100; i++)
    {
      if (i > x)
        v = y;
    }
}

void *
tf (void *arg)
{
  pthread_barrier_wait (&b);
  if (arg == NULL)
    {
      pthread_mutex_lock (&m);
      if (v != 0)
        abort ();
      foo (50, 10);
      pthread_mutex_unlock (&m);
      pthread_barrier_wait (&b);
      foo (120, 30);
    }
  else
    {
      foo (100, 20);
      pthread_barrier_wait (&b);
      pthread_mutex_lock (&m);
      if (v != 10)
        abort ();
      foo (80, 40);
      if (v != 40)
        abort ();
      pthread_mutex_unlock (&m);
    }
  return NULL;
}

int
main (void)
{
  pthread_t th[2];
  pthread_barrier_init (&b, NULL, 2);
  if (pthread_create (&th[0], NULL, tf, NULL)
      || pthread_create (&th[1], NULL, tf, (void *) 1L))
    return 1;
  pthread_join (th[0], NULL);
  pthread_join (th[1], NULL);
  return 0;
}

and at -O0 works just fine and has no races in it, it is quite easy to show
the shared variable is only ever read or written inside of the critical section.
But loop IM creates a race even when there is none in the code originally, if I compile this with -O2 (both 4.1.x and trunk) it will abort after a couple of attempts.

That's why I think we should have a generic option that disables optimizations
which are safe only in sequential programs (and -fopenmp would imply that option).
Comment 8 pinskia@gmail.com 2007-05-08 19:45:29 UTC
Subject: Re:  Loop IM and other optimizations harmful for -fopenmp

On 8 May 2007 18:08:26 -0000, jakub at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> ------- Comment #7 from jakub at gcc dot gnu dot org  2007-05-08 19:08 -------
> This really is not specific to OpenMP, I believe the following is valid
> threaded program:

This is not a valid program.  You have to introduce mutexes to get it
to be a valid program with threads.  This is a common misunderstanding
of thread programming.
Comment 9 Andrew Pinski 2007-05-09 03:19:25 UTC
> That's why I think we should have a generic option that disables optimizations
> which are safe only in sequential programs (and -fopenmp would imply that
> option).

So it sounds like you don't any thing about threading programming.  People have to use mutexes and such to get safe code storing/accessing in globals no matter what, even if it looks like it is thread safe or not because of the way threads act.  I don't see how this is different from knowning when programs access memory in some random way.
Comment 10 Jakub Jelinek 2007-05-09 07:30:03 UTC
You are entitled to your opinion.
Comment 11 theodore.papadopoulo 2007-06-04 18:12:33 UTC
(In reply to comment #8)

I suspect (unless I'm overlooked something) that this problem cause wrong
statistics when using jointly the -fopenmp and -profile-* flags.

I tried that and seen messages saying that the counters are corrupted (negative counters)... Still need to find a small testcase though.

If I'm true, at the very least, the flags should error when used in conjunction (for this I can do a patch, if this is the right solution).

Correcting this in this case should be easy (multiplicate the counters by the number of threads and join them at dump time), but that's probably beyond my 
gcc hacking capabilities...
Comment 12 Tomash Brechko 2007-10-14 14:15:56 UTC
I wish I could reopen the bug, but the only option I see is "Leave as UNCONFIRMED".

(In reply to comment #9)
> So it sounds like you don't any thing about threading programming.  People have
> to use mutexes and such to get safe code storing/accessing in globals no matter
> what, even if it looks like it is thread safe or not because of the way threads
> act.  I don't see how this is different from knowning when programs access
> memory in some random way.

I would agree with Jakub, his example is perfectly thread-safe.  The point is not about reading memory in random ways, but about random writes.  Consider the following adjustment:

void
foo (int x, int y, int access_v)
{
  int i;
  for (i = 0; i < 100; i++)
    {
      if (access_v && i > x)
        v = y;
    }
}

Now I would say that the call to foo(a, b, 0) doesn't require any mutex, no matter what.  The fact that v is mentioned _lexically_ doesn't change anything: for access_v == 0 (or for x >= 99, as in the original example), v = y; is simply a dead code.  You don't have to have a mutex for any variable just because it is _mentioned_ in the code.

And the essence of this bug report is that gcc chooses to unconditionally write to variables that are simply lexically mentioned but otherwise aren't accessed during execution.  This is wrong when both foo(a, b, 0) and foo(a, b, 1) 
calls are mixed, and also may harm SMP performance when there are only foo(a, b, 0) calls: storing the value back to v may flush the cache if the hardware doesn't detect that the stored value is the same (I'm not sure if there are such architectures).

Please reconsider this report.
Comment 13 Jorn Wolfgang Rennecke 2007-10-22 23:17:28 UTC
(In reply to comment #0)
> See http://openmp.org/pipermail/omp/2007/000840.html
> and the rest of the lengthy threads:
> Memory consistency contradiction between 2.5 specification and GCC
> OpenMP spec 2.5 seems to have incorrect flush example on page 12
> Two simpler examples (Re: OpenMP spec 2.5 seems to have incorrect flush example
> on page 12)
> 
> Some GCC optimizations are harmful for threaded code, e.g. loop invariant
> motion
> of global variables:
> 
> int var;
> void
> foo (int x)
> {
>   int i;
>   for (i = 0; i < 100; i++)
>     {
>       if (i > x)
>         var = i;
>     }
> }
> 
> When some other thread modifies var at the same time while foo (200) is
> executed,
> the compiler inserted a race which doesn't really exist in the original
> program,
> as it will do reg = var; ... var = reg; even when var was never modified.
> 

This is not even a particular good choice of transformation for performance
reasons.  We incur memory read latency when there is no need to.

Better is:

  auto int dummy, *addr;
  
  addr = &dummy;
  if (i > x)
    addr = &var;
  *addr = i;

Or for the entire function:

void
foo (int x)
{
  int dummy;
  int *addr = &dummy;

  if (99 > x)
    addr = &var;
  *addr = 99;
}
Comment 14 Michael Matz 2007-11-01 03:15:08 UTC
I've checked in a patch for PR33961, which is similar to this one.  Can
somebody check if anything here is still broken with trunk?
Comment 15 Jakub Jelinek 2007-11-12 19:11:34 UTC
Sorry for the delay.

Unfortunately that doesn't help in this case:
on foo from #c7 it is not cselim pass, but lim, which changes:
foo (x, y)
{
  int i;

<bb 2>:

<bb 3>:
  # i_12 = PHI <i_5(6), 0(2)>
  if (x_3(D) < i_12)
    goto <bb 4>;
  else
    goto <bb 5>;

<bb 4>:
  v = y_4(D);

<bb 5>:
  i_5 = i_12 + 1;
  if (i_5 <= 99)
    goto <bb 6>;
  else
    goto <bb 7>;

<bb 6>:
  goto <bb 3>;

<bb 7>:
  return;

}

into:

foo (x, y)
{
  int v_lsm.12;
  int i;

<bb 2>:
  v_lsm.12_11 = v;

<bb 3>:
  # v_lsm.12_1 = PHI <v_lsm.12_7(6), v_lsm.12_11(2)>
  # i_12 = PHI <i_5(6), 0(2)>
  if (x_3(D) < i_12)
    goto <bb 4>;
  else
    goto <bb 5>;

<bb 4>:
  v_lsm.12_10 = y_4(D);

<bb 5>:
  # v_lsm.12_7 = PHI <v_lsm.12_1(3), v_lsm.12_10(4)>
  i_5 = i_12 + 1;
  if (i_5 <= 99)
    goto <bb 6>;
  else
    goto <bb 7>;

<bb 6>:
  goto <bb 3>;

<bb 7>:
  # v_lsm.12_15 = PHI <v_lsm.12_7(5)>
  v = v_lsm.12_15;
  return;

}

which is unsafe, as there is no guarantee v is ever written within the loop.
Comment 16 Paolo Bonzini 2009-02-25 12:18:18 UTC
The upcoming C++0x memory model forbids this; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html (Concurrency memory model compiler consequences).  But it says that this is acceptable instead:

  tmp = var;
  modified = false;
  for (i = 0; i < 100; i++)
    {
      if (i > x)
        tmp = i, modified = true;
    }
  if (modified)
    var = tmp;

I think the bug can be confirmed, can't it?
Comment 17 Richard Biener 2009-02-25 13:25:54 UTC
Confirmed.
Comment 18 davids 2009-02-25 16:06:55 UTC
This is a real bug. There is simply no way to write correct threaded code if the compiler is free to read a variable and write it back when the code didn't specifically tell it to.

Optimizations on threaded code cannot add a write to a variable unless they can prove no other thread could know the location of the variable. (It's local, it's address has never been passed to a function, and so on.)

In any event, the last time I looked into this, it seemed that such 'optimizations' were typically pessimizations on multi-threaded code anyway, as they added expensive cache misses.
Comment 19 dnovillo@google.com 2009-02-25 16:12:41 UTC
Subject: Re:  Loop IM and other optimizations harmful 
	for -fopenmp

On Wed, Feb 25, 2009 at 11:06, davids at webmaster dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> ------- Comment #18 from davids at webmaster dot com  2009-02-25 16:06 -------
> This is a real bug. There is simply no way to write correct threaded code if
> the compiler is free to read a variable and write it back when the code didn't
> specifically tell it to.

Yes.  Unless we build an actual concurrent data flow and adapt the
optimizers for these programs, we will never get it right.  The best
approach in these cases is to tell the optimizers to back off
completely.  After all the code has completely different semantics
from what they are ready to handle.


Diego.
Comment 20 davids 2009-02-25 17:53:54 UTC
I don't like this either:

  tmp = var;
  modified = false;
  for (i = 0; i < 100; i++)
    {
      if (i > x)
        tmp = i, modified = true;
    }
  if (modified)
    var = tmp;

This can be a pessimization as well. If 'modified' is almost always false, the 'tmp = var;' can force a cache unshare for no reason. If this code is not performance critical, the optimization is pointless. If it is, the pessimization can be significant.

IMO, these kinds of optimizations are not worth doing. Almost any optimization that can introduce an additional memory access risks a net performance loss in some use cases. The compiler cannot determine the use case by static code inspection.

Now, in this case, the 'tmp = var;' is obviously superfluous, you can just eliminate it, so this isn't a good example of what I'm talking about. But in general, an optimization should not *add* a memory operation the code did not request unless you can show that it will always remove at least one operation of comparable cost. Otherwise it can be a net performance loss all the time in some use cases.
Comment 21 Michael Matz 2009-02-25 18:04:15 UTC
The question if such transformation is or isn't worthwhile is independent from
the issue at hand (which is about the validity in the light of the new
memory model).  If you find a clearly pessimizing but valid transformation
create a bug report.
Comment 22 Andrew Pinski 2009-02-25 18:07:22 UTC
Really to me this is still a valid transformation even in the inside threads.  If you want to be able to access via different threads, use locks, heavy or light weight ones.
Comment 23 davids 2009-02-25 18:35:18 UTC
"Really to me this is still a valid transformation even in the inside threads. 
If you want to be able to access via different threads, use locks, heavy or
light weight ones."

Absolutely, you do use locks everywhere you write to a variable that might be read from by another thread. The problem is that the compiler introduces locks where you *don't* write to a variable.

It is simply impossible to use locks if the compiler can add a write where you didn't expect one.

The classic example:

if(can_write) acquire_lock();
for(int i=0; i<100; i++)
{
 some_stuff(i);
 if(can_write) shared_variable+=i;
}
if(can_write) release_lock();

Here this code does acquire a lock if it plans to modify the shared variable. But the optimizations discussed will break this code.

Also, you can have the same problem with this kind of code without threads. Imagine, for example, if the 'shared_variable' may be in read-only memory and 'can_write' indicates this case.
Comment 24 Paolo Bonzini 2009-02-25 18:43:03 UTC
Andrew, your comments #6 #8 #9 clearly show that you haven't understood the issue and are just talking past others.

The other hand the transformation has been shown to be an optimization on single-thread cases; if it is bad for multiprocessors, it means it probably should be guarded by a flag (but what if the load is not necessary?  should it use the same flag, a separate flag, or be always enabled?).  But the IM code should be modified to use the flag as suggested by the C++ standard (and to avoid the load if it is not necessary, as it is in this case but not in the case explained by N2238).
Comment 25 Eric Botcazou 2009-02-26 08:26:03 UTC
> Also, you can have the same problem with this kind of code without threads.
> Imagine, for example, if the 'shared_variable' may be in read-only memory and
> 'can_write' indicates this case.

Then it must be declared 'volatile'.  This optimization is valid in ISO C if the
variable is not declared so.  Of course that's orthogonal to this PR.