This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

GCC libatomic ABI specification draft

Got an error from alias. Remove the pdf attachment and re-send it to the alias ...

On 11/14/2016 4:34 PM, Bin Fan wrote:
Hi All,

I have an updated version of libatomic ABI specification draft. Please take a look to see if it matches GCC implementation. The purpose of this document is to establish an official GCC libatomic ABI, and allow compatible compiler and runtime implementations on the affected platforms.

Compared to the last version you have reviewed, here are the major updates

- Rewrite the notes in N2.3.2 to explicit mention the implementation of __atomic_compare_exchange follows memcmp/memcpy semantics, and the consequence of it.

- Rewrite section 3 to replace "lock-free" operations with "hardware backed" instructions. The digest of this section is: 1) inlineable atomics must be implemented with the hardware backed atomic instructions. 2) for non-inlineable atomics, the compiler must generate a runtime call, and the runtime support function is free to use any implementation.

- The Rationale section in section 3 is also revised to remove the mentioning of "lock-free", but there is not major change of concept.

- Add note N3.1 to emphasize the assumption of general hardware supported atomic instruction

- Add note N3.2 to discuss the issues of cmpxchg16b

- Add a paragraph in section 4.1 to specify memory_order_consume must be implemented through memory_order_acquire. Section 4.2 emphasizes it again.

- The specification of each runtime functions mostly maps to the corresponding generic functions in the C11 standard. Two functions are worth noting: 1) C11 atomic_compare_exchange compares and updates the "value" while __atomic_compare_exchange functions in this ABI compare and update the "memory", which implies the memcmp and memcpy semantics. 2) The specification of __atomic_is_lock_free allows both a per-object result and a per-type result. A per-type implementation could pass NULL, or a faked address as the address of the object. A per-object implementation could pass the actual address of the object.

- Bin

On 8/10/2016 3:33 PM, Bin Fan wrote:
Hi Torvald,

Thanks a lot for your review. Please find my response inline...

On 8/5/2016 8:51 AM, Torvald Riegel wrote:
[CC'ing Andrew MacLeod, who has been working on the atomics too.]

On Tue, 2016-08-02 at 16:28 -0700, Bin Fan wrote:
I'm wondering if you have a chance to review the revised libatomic ABI
draft. The email was rejected by the gcc alias once due to some html
stuff in the email text. Though I resend a pure txt format version, I'm
not sure if it worked, so this time I drop the gcc alias.

If you do not have any issues, I'm wondering if this ABI draft could be
published in some GCC wiki or documentation? I'd be happy to prepare a
version without the "notes" part.

Because the padding of structure types is not affected by _Atomic
modifier, the contents of any padding in the atomic structure object
is still undefined, therefore the atomic compare and exchange operation
on such objects may fail due to the difference of the padding.
I think this isn't quite clear.
This paragraph is just to clarify that _Atomic does not change (e.g. zeroing out) the padding bits, whose content were undefined in the current SPARC and x86 ABI specifications, and will
still be undefined for _Atomic aggregates.

This paragraph is part of "notes" rather than the main body of the ABI draft. If it is not clear,
I will change it by mentioning the memcmp/memcpy-like semantics.

Perhaps it's easier to describe it in
the way that C++ does, referring to the memcmp/memcpy-like semantics of
compare_exchange (e.g., see N4606 29.6.5p27).
C11 isn't quite clear about this, or I am misunderstanding what they
really mean by "value of the object" (see N1570
This is the subject of C11 Defect Report 431:
which has been fixed to align with the C++ standard and closed with a
Proposed Technical Corrigendum which will appear in the next revision
of the C standard (~2017).

Note that in section 4.2 of this ABI draft, the function description of
__atomic_compare_exchange uses "compares the memory pointed to by object" instead of "compares the value pointed to by object" as you quoted from N1570

Since you asked about whether you should review the function descriptions, this is one of the two worth noticing cases. I will mention another one later in this email.

Lock-free atomic operations does not require runtime support functions.
The compiler may generate inlined code for efficiency. This ABI
specification defines a few inlineable atomic types. An atomic type
is inlineable means the compiler may generate inlined instruction
sequence for atomic operations on such types. The implementation of
the support functions for the inlineable atomic types must also be
lock free.
I think it's better to say that the support functions must be compatible
with what the compiler would generate.  That they are "lock-free" is
just a forward progress property. This also applies to later paragraphs in the draft. Maybe we need to use a different term here, so we can use
it for what we want (ie, a HW-backed, inlineable operation).
I agree that lock-free atomic operations does not equivalent to HW-backed atomic operations. I will think about how to mention it in the ABI. My current thought is as
you suggested, to change "lock-free" to "HW-backed".

So an example of the updated specification would be like this:
The implementation of the support functions for the inlineable atomic types must use HW-backed atomic instructions. For atomic operations on not inlineable types, the compiler
must always generate support function calls.

On all affected platforms, atomic types whose size equal to 1, 2, 4
or 8 and alignment matches the size are inlineable

On the 64-bit x86 platform which supports the cmpxchg16b instruction,
16-byte atomic types whose alignment matches the size is inlineable.
I still think making 16-byte atomic types inlined / lock-free when all
we have is a wide cmpxchg is wrong.  AFAIK there is no atomic 16-byte
load instruction on x86 (or is there?), even though cmpxchg16b might be
At least GCC 6.1.0 still generates cmpxchg16b for an atomic load with -march=native
on my haswell machine.
I'd prefer if we could fix this in GCC in some way instead
of requiring this by putting it into the ABI.  This also applies to the
double-wide CAS on i386.
IIRC, there is a BZ about this somewhere, but I don't find it.
Andrew, do you remember?

Basically, there is a correctness and a performance problem.
The atomic variable might be in a read-only-mapped page, which isn't
unreasonable given that the C/C++ standards explicitly require lock-free
atomics to be address-free too, which is a clear hint towards enabling
mapping memory to more than one place in the address space. So, if the
user does an atomic load on a 16-byte variable accessible through a
read-only page, we'll get a segfault.
One could argue that C/C++ don't provide any mmap feature, and thus you
can't expect this to work.  But this doesn't seem a good argument to
make from a user's perspective.

Second, I'd argue that the "lock-free" property is used by most users as
an indication of which atomics might be as fast as one would expect
typical HW to be -- not because they are interested in the forward
progress aspect or the address-free aspect.  If atomic loads do cause
writes, the performance of a load will be horrible because of the
contention in cases where many threads issue loads.
If the 16-byte atomic read is implemented in software, the current implementation still uses a lock/mutex, meaning a write will happen somewhere, maybe not directly on the object memory but on somewhere else(a spinlock or a mutex). It can resolve the read-only issue you mentioned, because the write is on the lock rather than on the
object, But there would still be the performance issue of contention.

There are some advanced software algorithms that can make this
most-reader-occational-writer scenario more efficient. (For example, seqlock mentioned
in here:
The performance of such algorithms would depend highly on the use cases, so maybe the user should implement their own algorithm instead of relying on the compiler/libatomic
library to provide the best performance in all cases.
This is even more
unfortunate considering that if one has a 64b CAS, then one can
increment a 64b counter which can be considered to never overflow, which
allows one to build efficient atomic snapshots of larger atomic
OTOH, some people would like to use the GCC builtins to get access to

Irrespective of how we deal with this, we should at least document the
current state and the problems associated with it.  Maybe we should
consider providing separate builtins for cmpxchg16b.
I'm OK with the current GCC implementation, which I believe matches the ABI draft. And
we can document the current issues as appendix or whatever.
If GCC is willing to change, I'm also OK with specifying that 16-byte atomic types are
not inlineable.

"Inlineability" is a compile time property, which in most cases depends
only on the type. In a few cases it also depends on whether the target
ISA supports the cmpxchg16b instruction. A compiler may get the ISA
information by either compilation flags or inquiring the hardware
capabilities. When the hardware capabilities information is not available, the compiler should assume the cmpxchg16b instruction is not supported.
I think that strictly speaking, it always depends on the target ISA,
because we assume that it provides 1-byte atomic operations, for
Right. The ABI specification itself is ISA-specific. For example, if we call it SPARC V9 ABI amendment, then it is safe to assume that the ISA support 1,2,4,8 -byte atomic hardware instructions, then it is safe to make such specification of "inlineable" in the ABI.

I'm not very familiar with x86 ISA versioning. I used to assume cmpxchg16b is available on all today's mainstream x86 platforms until I found Xeon Phi does not support it. That's
why the ABI says it depends on target ISA.

    memory_order_consume = 1,
Refer to C standard for the meaning of each enumeration constants of
memory_order type.
Most of the functions listed in this section can be mapped to the generic
functions with the same semantics in the C standard. Refer to the C
standard for the description of the generic functions and how each memory
order works.
We need to say that memory_order_consume must be implemented through
memory_order_acquire.  The compiler can't preserve dependencies
correctly and will never be able to for the current specification of
consume.  Thus, we must fall back to acquire MO.
As far as I can tell, neither SPARC or x86 has instructions that may benefit from the consume
order. So I'm happy to make this change.

I haven't looked at the descriptions of the individual atomic operations
in detail.  Let me know if I should.
In the above I mentioned there may be two places in the descriptions that may be interesting. I have mentioned one in the above (__atomic_compare_exchange). The other one is
__atomic_is_lock_free. This is based on Richard's comments.

Thanks again for your review, I will send a new draft based on your comments. Please send me
any further comments/suggestions.

- Bin


Attachment: libatomicABIdraft.txt
Description: Text document

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]