This is the mail archive of the
mailing list for the GCC project.
Re: volatile access optimization (C++ / x86_64)
- From: Matt Godbolt <matt at godbolt dot org>
- To: Andrew Haley <aph at redhat dot com>
- Cc: GCC Development <gcc at gcc dot gnu dot org>
- Date: Fri, 26 Dec 2014 18:02:24 -0600
- Subject: Re: volatile access optimization (C++ / x86_64)
- Authentication-results: sourceware.org; auth=none
- References: <CAFWXXN3quEdSnaoWuPcQn2k-F99Yaw+6=NqgFgcu9ABpv5ZD3Q at mail dot gmail dot com> <549DE09B dot 8060502 at redhat dot com> <CAFWXXN0V9yvNTpcz54DCK237KPURQs1XkaHcQZK5Eoj_VCj0OA at mail dot gmail dot com> <549DED1B dot 3070006 at redhat dot com>
On Fri, Dec 26, 2014 at 5:19 PM, Andrew Haley <email@example.com> wrote:
> On 26/12/14 22:49, Matt Godbolt wrote:
>> On Fri, Dec 26, 2014 at 4:26 PM, Andrew Haley <firstname.lastname@example.org> wrote:
>>> On 26/12/14 20:32, Matt Godbolt wrote:
>> I realise my understanding could be wrong here!
>> If not though, both clang and icc are taking a short-cut that may
>> puts them into non-compliant state.
> It's hard to be certain. The language used by the standard is very
> unhelpful: it requires all accesses to be as written, but does not
> define exactly what constitutes an access.
Thanks. My world is very x86-centric and so I find it hard to
understand why a single instruction's RMW is different from three
separate instructions; but I appreciate the standard is vague around
volatiles, and that atomics go some way to using more well-defined
>> Thanks. I realise I was unclear in my original email. I'm really
>> looking for a way to say "do a non-lock-prefixed increment".
Performance. The single-threaded writers do not need to use a lock
prefix: the atomicity of their read-add-write is guaranteed by my
knowing no other threads write to the value. Thus the bus lock they
take out unnecessarily slows down the instruction and potentially
causes extra coherency traffic. The order of stores (on x86) is
guaranteed and so provided I take a relaxed view in the consumer
there's not even a need for any other flush. The memory write will
necessarily "eventually" become visible to the reader. Within the
constraints of the architecture I'm working in, this is plenty enough
for a metric.
> You could just use a compiler barrier: asm volatile(""); But this is
> good only for x86 and a few others.
This may be all I need, but my worry is this will inhibit other valid
optimisations. I know that the "trick" used elsewhere as a barrier
(asm voliatile("":::"memory");) has the effect of flushing
enregistered values to memory. Ideally this wouldn't be necessary.
I'll be honest; I don't know the semantics of an empty volatile asm(),
but I'm not sure how it could cause only the one write (metric++) to
be emitted without affecting other variables too.
> Everyone else needs a real store barrier.
This is certainly true if the writer needs to guarantee visibility to
other threads. But that's not the case for my use case.
> Well, that's the problem: do you want a barrier or not? With no
> barrier there is no guarantee that the data will ever be written to
> memory. Do you only care about x86 processors?
I appreciate your patience in understanding my case (given I'm not
explaining myself very well!) In this instance, yes, only x86
processors. I do not need an explicit ISA-level flush. I do need a
guarantee that the compiler cannot optimise the increment by
>> To give a concrete example:
>> By making the int
>> atomic and using relaxed, I get this guarantee but at the cost of a
>> "lock addl".
> Ok, I get that, but not why. If you care about a particular x86
> instruction, you can use it in an inlne asm. I'm not at all sure what
> you want, really.
I hope my other comments at least help to explain the why! It's not a
particular instruction inasmuch as communicating to the compiler that
there's only one writer, and so the lock prefix is unnecessary (for
x86) as the write of the read-modify-write will not race with other
writers (as none exist) and the write will eventually become visible
to other threads in strict memory order (as the x86 guarantees). This
last stage I believe is consistent with a "relaxed" model, with an
optimisation that if no other writers exist, no bus lock is required
on the writer.
Again, thanks for the reply and the time taken thinking about the
issue especially at this festive time of year!
Best regards, Matt