This is GCC Bugzilla
This is GCC Bugzilla Version 2.20+
View Bug Activity | Format For Printing | Clone This Bug
See http://openmp.org/pipermail/omp/2007/000840.html and the rest of the lengthy threads: Memory consistency contradiction between 2.5 specification and GCC OpenMP spec 2.5 seems to have incorrect flush example on page 12 Two simpler examples (Re: OpenMP spec 2.5 seems to have incorrect flush example on page 12) Some GCC optimizations are harmful for threaded code, e.g. loop invariant motion of global variables: int var; void foo (int x) { int i; for (i = 0; i < 100; i++) { if (i > x) var = i; } } When some other thread modifies var at the same time while foo (200) is executed, the compiler inserted a race which doesn't really exist in the original program, as it will do reg = var; ... var = reg; even when var was never modified. Even if we prove a variable is always written to within a loop, but there is some kind of barrier (a function call which might contain a barrier, __sync_synchronize (), #pragma omp barrier, #pragma omp flush (does a volatile var read resp. write count as one too?)), if the variable is not written to after the barrier, the compiler is still introducing a race which didn't exist originally. I wonder if we shouldn't disable some optimizations (like loop IM if the var is a global variable) for -fopenmp (or perhaps add also a -fthread-safe option implied by -fopenmp). Does if conversion cause similar problems? What else?
Subject: Re: New: Loop IM and other optimizations harmful for -fopenmp On 8 May 2007 07:59:03 -0000, jakub at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> wrote: > See http://openmp.org/pipermail/omp/2007/000840.html > and the rest of the lengthy threads: Yes, I've been following the thread and I agree that we need to do something to avoid the problem. The real solution is, unfortunately, quite a bit of work. In the meantime, we should probably annotate the shared variables and consider them volatile. This should prevent most optimizers from messing things up inadvertently. This should be done within OMP regions. Orphaned functions may become a problem, so this should be implemented as an IPA pass.
WTF, this is just sad we have to disable optimizations because openmp folks don't know how to program threaded code.
OMP is not a good generic programming model for threaded code. Exactly because of this issues.
Subject: Re: Loop IM and other optimizations harmful for -fopenmp On 8 May 2007 14:30:45 -0000, pinskia at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> wrote: > ------- Comment #2 from pinskia at gcc dot gnu dot org 2007-05-08 15:30 ------- > WTF, this is just sad we have to disable optimizations because openmp folks > don't know how to program threaded code. No need to be insulting. Another possibility is to require shared variables to be declared volatile. It depends on the wording in the standard.
Subject: Re: Loop IM and other optimizations harmful for -fopenmp On 8 May 2007 14:37:05 -0000, rguenth at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> wrote: > OMP is not a good generic programming model for threaded code. Exactly because > of this issues. No. This is wrong. The model simply requires the compiler to be smarter. Sequential compilers are not aware of parallel semantics. If the compiler was thread-aware, these issues would be transparent to the user (as they should be). The original code did not have a race condition. The compiler transformations introduced a race-condition. This *is* a compiler bug.
Subject: Re: Loop IM and other optimizations harmful for -fopenmp On 8 May 2007 14:44:16 -0000, dnovillo at acm dot org <gcc-bugzilla@gcc.gnu.org> wrote: > The original code did not have a race condition. The compiler > transformations introduced a race-condition. This *is* a compiler > bug. Actually the original code has a race condition, if another thread is reading from var and then acting on that and writting back. This is why threading programming is considered hard.
This really is not specific to OpenMP, I believe the following is valid threaded program: #define _XOPEN_SOURCE 600 #include <pthread.h> #include <stdlib.h> int v; pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER; pthread_barrier_t b; void foo (int x, int y) { int i; for (i = 0; i < 100; i++) { if (i > x) v = y; } } void * tf (void *arg) { pthread_barrier_wait (&b); if (arg == NULL) { pthread_mutex_lock (&m); if (v != 0) abort (); foo (50, 10); pthread_mutex_unlock (&m); pthread_barrier_wait (&b); foo (120, 30); } else { foo (100, 20); pthread_barrier_wait (&b); pthread_mutex_lock (&m); if (v != 10) abort (); foo (80, 40); if (v != 40) abort (); pthread_mutex_unlock (&m); } return NULL; } int main (void) { pthread_t th[2]; pthread_barrier_init (&b, NULL, 2); if (pthread_create (&th[0], NULL, tf, NULL) || pthread_create (&th[1], NULL, tf, (void *) 1L)) return 1; pthread_join (th[0], NULL); pthread_join (th[1], NULL); return 0; } and at -O0 works just fine and has no races in it, it is quite easy to show the shared variable is only ever read or written inside of the critical section. But loop IM creates a race even when there is none in the code originally, if I compile this with -O2 (both 4.1.x and trunk) it will abort after a couple of attempts. That's why I think we should have a generic option that disables optimizations which are safe only in sequential programs (and -fopenmp would imply that option).
Subject: Re: Loop IM and other optimizations harmful for -fopenmp On 8 May 2007 18:08:26 -0000, jakub at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> wrote: > > > ------- Comment #7 from jakub at gcc dot gnu dot org 2007-05-08 19:08 ------- > This really is not specific to OpenMP, I believe the following is valid > threaded program: This is not a valid program. You have to introduce mutexes to get it to be a valid program with threads. This is a common misunderstanding of thread programming.
> That's why I think we should have a generic option that disables optimizations > which are safe only in sequential programs (and -fopenmp would imply that > option). So it sounds like you don't any thing about threading programming. People have to use mutexes and such to get safe code storing/accessing in globals no matter what, even if it looks like it is thread safe or not because of the way threads act. I don't see how this is different from knowning when programs access memory in some random way.
You are entitled to your opinion.
(In reply to comment #8) I suspect (unless I'm overlooked something) that this problem cause wrong statistics when using jointly the -fopenmp and -profile-* flags. I tried that and seen messages saying that the counters are corrupted (negative counters)... Still need to find a small testcase though. If I'm true, at the very least, the flags should error when used in conjunction (for this I can do a patch, if this is the right solution). Correcting this in this case should be easy (multiplicate the counters by the number of threads and join them at dump time), but that's probably beyond my gcc hacking capabilities...
I wish I could reopen the bug, but the only option I see is "Leave as UNCONFIRMED". (In reply to comment #9) > So it sounds like you don't any thing about threading programming. People have > to use mutexes and such to get safe code storing/accessing in globals no matter > what, even if it looks like it is thread safe or not because of the way threads > act. I don't see how this is different from knowning when programs access > memory in some random way. I would agree with Jakub, his example is perfectly thread-safe. The point is not about reading memory in random ways, but about random writes. Consider the following adjustment: void foo (int x, int y, int access_v) { int i; for (i = 0; i < 100; i++) { if (access_v && i > x) v = y; } } Now I would say that the call to foo(a, b, 0) doesn't require any mutex, no matter what. The fact that v is mentioned _lexically_ doesn't change anything: for access_v == 0 (or for x >= 99, as in the original example), v = y; is simply a dead code. You don't have to have a mutex for any variable just because it is _mentioned_ in the code. And the essence of this bug report is that gcc chooses to unconditionally write to variables that are simply lexically mentioned but otherwise aren't accessed during execution. This is wrong when both foo(a, b, 0) and foo(a, b, 1) calls are mixed, and also may harm SMP performance when there are only foo(a, b, 0) calls: storing the value back to v may flush the cache if the hardware doesn't detect that the stored value is the same (I'm not sure if there are such architectures). Please reconsider this report.
(In reply to comment #0) > See http://openmp.org/pipermail/omp/2007/000840.html > and the rest of the lengthy threads: > Memory consistency contradiction between 2.5 specification and GCC > OpenMP spec 2.5 seems to have incorrect flush example on page 12 > Two simpler examples (Re: OpenMP spec 2.5 seems to have incorrect flush example > on page 12) > > Some GCC optimizations are harmful for threaded code, e.g. loop invariant > motion > of global variables: > > int var; > void > foo (int x) > { > int i; > for (i = 0; i < 100; i++) > { > if (i > x) > var = i; > } > } > > When some other thread modifies var at the same time while foo (200) is > executed, > the compiler inserted a race which doesn't really exist in the original > program, > as it will do reg = var; ... var = reg; even when var was never modified. > This is not even a particular good choice of transformation for performance reasons. We incur memory read latency when there is no need to. Better is: auto int dummy, *addr; addr = &dummy; if (i > x) addr = &var; *addr = i; Or for the entire function: void foo (int x) { int dummy; int *addr = &dummy; if (99 > x) addr = &var; *addr = 99; }
I've checked in a patch for PR33961, which is similar to this one. Can somebody check if anything here is still broken with trunk?
Sorry for the delay. Unfortunately that doesn't help in this case: on foo from #c7 it is not cselim pass, but lim, which changes: foo (x, y) { int i; <bb 2>: <bb 3>: # i_12 = PHI <i_5(6), 0(2)> if (x_3(D) < i_12) goto <bb 4>; else goto <bb 5>; <bb 4>: v = y_4(D); <bb 5>: i_5 = i_12 + 1; if (i_5 <= 99) goto <bb 6>; else goto <bb 7>; <bb 6>: goto <bb 3>; <bb 7>: return; } into: foo (x, y) { int v_lsm.12; int i; <bb 2>: v_lsm.12_11 = v; <bb 3>: # v_lsm.12_1 = PHI <v_lsm.12_7(6), v_lsm.12_11(2)> # i_12 = PHI <i_5(6), 0(2)> if (x_3(D) < i_12) goto <bb 4>; else goto <bb 5>; <bb 4>: v_lsm.12_10 = y_4(D); <bb 5>: # v_lsm.12_7 = PHI <v_lsm.12_1(3), v_lsm.12_10(4)> i_5 = i_12 + 1; if (i_5 <= 99) goto <bb 6>; else goto <bb 7>; <bb 6>: goto <bb 3>; <bb 7>: # v_lsm.12_15 = PHI <v_lsm.12_7(5)> v = v_lsm.12_15; return; } which is unsafe, as there is no guarantee v is ever written within the loop.
The upcoming C++0x memory model forbids this; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html (Concurrency memory model compiler consequences). But it says that this is acceptable instead: tmp = var; modified = false; for (i = 0; i < 100; i++) { if (i > x) tmp = i, modified = true; } if (modified) var = tmp; I think the bug can be confirmed, can't it?
Confirmed.
This is a real bug. There is simply no way to write correct threaded code if the compiler is free to read a variable and write it back when the code didn't specifically tell it to. Optimizations on threaded code cannot add a write to a variable unless they can prove no other thread could know the location of the variable. (It's local, it's address has never been passed to a function, and so on.) In any event, the last time I looked into this, it seemed that such 'optimizations' were typically pessimizations on multi-threaded code anyway, as they added expensive cache misses.
Subject: Re: Loop IM and other optimizations harmful for -fopenmp On Wed, Feb 25, 2009 at 11:06, davids at webmaster dot com <gcc-bugzilla@gcc.gnu.org> wrote: > > > ------- Comment #18 from davids at webmaster dot com 2009-02-25 16:06 ------- > This is a real bug. There is simply no way to write correct threaded code if > the compiler is free to read a variable and write it back when the code didn't > specifically tell it to. Yes. Unless we build an actual concurrent data flow and adapt the optimizers for these programs, we will never get it right. The best approach in these cases is to tell the optimizers to back off completely. After all the code has completely different semantics from what they are ready to handle. Diego.
I don't like this either: tmp = var; modified = false; for (i = 0; i < 100; i++) { if (i > x) tmp = i, modified = true; } if (modified) var = tmp; This can be a pessimization as well. If 'modified' is almost always false, the 'tmp = var;' can force a cache unshare for no reason. If this code is not performance critical, the optimization is pointless. If it is, the pessimization can be significant. IMO, these kinds of optimizations are not worth doing. Almost any optimization that can introduce an additional memory access risks a net performance loss in some use cases. The compiler cannot determine the use case by static code inspection. Now, in this case, the 'tmp = var;' is obviously superfluous, you can just eliminate it, so this isn't a good example of what I'm talking about. But in general, an optimization should not *add* a memory operation the code did not request unless you can show that it will always remove at least one operation of comparable cost. Otherwise it can be a net performance loss all the time in some use cases.
The question if such transformation is or isn't worthwhile is independent from the issue at hand (which is about the validity in the light of the new memory model). If you find a clearly pessimizing but valid transformation create a bug report.
Really to me this is still a valid transformation even in the inside threads. If you want to be able to access via different threads, use locks, heavy or light weight ones.
"Really to me this is still a valid transformation even in the inside threads. If you want to be able to access via different threads, use locks, heavy or light weight ones." Absolutely, you do use locks everywhere you write to a variable that might be read from by another thread. The problem is that the compiler introduces locks where you *don't* write to a variable. It is simply impossible to use locks if the compiler can add a write where you didn't expect one. The classic example: if(can_write) acquire_lock(); for(int i=0; i<100; i++) { some_stuff(i); if(can_write) shared_variable+=i; } if(can_write) release_lock(); Here this code does acquire a lock if it plans to modify the shared variable. But the optimizations discussed will break this code. Also, you can have the same problem with this kind of code without threads. Imagine, for example, if the 'shared_variable' may be in read-only memory and 'can_write' indicates this case.
Andrew, your comments #6 #8 #9 clearly show that you haven't understood the issue and are just talking past others. The other hand the transformation has been shown to be an optimization on single-thread cases; if it is bad for multiprocessors, it means it probably should be guarded by a flag (but what if the load is not necessary? should it use the same flag, a separate flag, or be always enabled?). But the IM code should be modified to use the flag as suggested by the C++ standard (and to avoid the load if it is not necessary, as it is in this case but not in the case explained by N2238).
> Also, you can have the same problem with this kind of code without threads. > Imagine, for example, if the 'shared_variable' may be in read-only memory and > 'can_write' indicates this case. Then it must be declared 'volatile'. This optimization is valid in ISO C if the variable is not declared so. Of course that's orthogonal to this PR.