68616 – miscompilation in multi-threaded code

Bug 68616 - miscompilation in multi-threaded code

Summary: miscompilation in multi-threaded code

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	ipa (show other bugs)
Version:	6.0

Importance:	P3 major
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	wrong-code

Depends on:
Blocks:

Reported:	2015-11-30 13:32 UTC by torvald
Modified:	2023-06-02 00:11 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-11-30 00:00:00

Attachments
test case (451 bytes, text/plain) 2015-11-30 13:32 UTC, torvald	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description torvald 2015-11-30 13:32:28 UTC

Created attachment 36871 [details]
test case

In the attached case, the compiler seems to assume that in main_thread, ptr still points to x despite there being a call to arrive_and_wait that contains synchronization.  The compiled code on x86_64 with -O1 has just an unconditional store to x after the arrive_and_wait call.  It must keep the conditional because in fact the interference function waits for arrive_and_wait to increment barrier, then assigns zero to ptr, and the unblocks arrive_and_wait, which returns to main_thread.  So, in a correct execution, main_thread will always see ptr==0 and do not assign to x.

This may or may not be a duplicate of PR 68591; I don't know so I created a new PR.  Also, while just a transactional memory bug may be considered less severe, this one here is a severe problem for multi-threaded code.

Also note that in a variation of this I didn't see the miscompilation: In that case, main_thread() was a part of main() and the pthread_create call for interference() was directly before the arrive_and_wait.  The code generated for that case had the load and check of ptr before assigning to ptr.

Comment 1 Richard Biener 2015-11-30 13:56:34 UTC

arrive_and_wait is known to not clobber ptr and thus GCC can happily CSE ptr over the call.  IPA reference computes this for non-address-taken global statics.

It's hard to "fix" IPA reference here because it nowadays operates on the
IPA-ref list and IIRC there are no "may-ref-all" things we could inject here
for atomics.

OTOH IIRC nothing was "designed" to handle synchronization points but instead
the idea was to rely on aliasing barriers which as can be seen here is not
fully sufficient.  That said, do we have sth like a "barrier" detection?
Because a single atomic load or store certainly isn't one(?).

AFAICS only globals that don't have their address taken (thus statics without
LTO and some more with LTO) have this issue.  IPA-PTA might also compute
"interesting" mod/ref info if it were to handle the atomics in some way
(I never committed a patch that tried to improve things there because of
doubts of semantics...)

Comment 2 torvald 2015-12-01 15:11:17 UTC

I basically don't know anything about IPA, so I'll just assume that by "barrier" you mean conditions prohibiting transformations.  I'm also not sure whether just CSE is a problem here, so I'll try to give an unspecific but broad answer.

I'll assume we're looking at cases of nonatomic accesses to foo (such as ptr in the test case) and calls to the (new) atomic builtins on different variables bar1,bar2,... as synchronization points.  Mixed atomic/nonatomic accesses to the same memory location are in most cases a bug, and I believe we discourage it, but they might not generally be incorrect (e.g., if loading through an atomic op and observing a value that indicates we're the only thread accessing it, subsequent nonatomic accesses might be fine); maybe we should just let an atomic access to bar1 be a barrier for all movement/merging across/involving this access.  Any old atomic builtins (ie, __synch) can probably be handled like the equivalent calls to the new builtins.

I believe we don't need to do includes volatiles, because even if they are used in old-style code, they should have asm compiler barriers around them -- and I hope we're handling those correctly.

Because &foo != &bar, atomic stores to bar must be __ATOMIC_RELEASE or stronger, and atomic loads to bar must be __ATOMIC_ACQUIRE or stronger; otherwise, there's no implied ordering relationship between the foo and bar accesses.

A good way to find out what transformations are or are not allowed is to consider the data-race-freedom (DRF) requirement and which regions an implementation would be allowed to execute atomically.

For example, "foo = 1; bar.store(1, release); foo = 2;": The implementation is allowed to execute that in one atomic step, so there cannot be a nonatomic read of fooin another thread that's triggered by the store to bar because there would be a data race in this case.  So, foo=1 can be removed if we assume DRF.
(Note that old synchronization code may not be written to guarantee DRF; so perhaps this should be conditional on -std=c11 or such.)

"temp=foo; if (bar.load(acquire)) temp2=foo;" is similar.

However, this example here (basically the test case) doesn't allow for optimizing across the synchronization:
"foo = 1; temp=foo; bar.store(1, release); if (bar.load(acquire)==2) temp2=foo;"

This is because both loads of foo cannot be executed in one atomic step, as it needs another thread to increment bar to the value 2; also, there is no nonatomic access to foo between the two atomic accesses, so we cannot derive that those other threads that execute between the atomics don't access foo because of DRF.

I hope this helps, though I'm aware it's pretty general.  If you have a list of specific transformations, I could look through that and say which ones are correct or not.  Perhaps it's okay for now if we assume that all non-relaxed atomics are transformation barriers, as we do elsewhere AFAIK.