When targeting -march=rv32i or -march=rv64i (i.e. a RISC-V target without the A extension), GCC will produce a fence-based mapping for atomic load and store while calling the __atomic_* libcalls for other operations. As documented here <https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary> this is not legal. As stated there and in the equivalent LLVM documentation <https://llvm.org/docs/Atomics.html#atomics-and-codegen> lock-based library calls and lock-free instructions should not be intermixed for the same object. Below is a representative example, where I believe a call to __atomic_load_4 should be generated. $ cat foo.c int atomic(int *i) { int j = __atomic_add_fetch(i, 1, __ATOMIC_SEQ_CST); int k; __atomic_load(i, &k, __ATOMIC_SEQ_CST); return j+k; } $ ./riscv32-unknown-elf-gcc -march=rv32i foo.c -O1 -S -o - .file "foo.c" .option nopic .text .align 2 .globl atomic .type atomic, @function atomic: addi sp,sp,-16 sw ra,12(sp) sw s0,8(sp) mv s0,a0 li a2,5 li a1,1 call __atomic_fetch_add_4 addi a0,a0,1 fence iorw,iorw lw a5,0(s0) fence iorw,iorw add a0,a5,a0 lw ra,12(sp) lw s0,8(sp) addi sp,sp,16 jr ra .size atomic, .-atomic .ident "GCC: (GNU) 9.0.0 20180530 (experimental)"
Actually I think this bug is wider in scope than I first thought. GCC will also intermix __atomic libcalls and inline instruction sequences with -march=rv32ia when values less than XLEN in size are accessed. $ cat foo.c char atomic(char *i) { char j = __atomic_add_fetch(i, 1, __ATOMIC_SEQ_CST); char k; __atomic_load(i, &k, __ATOMIC_SEQ_CST); return j+k; } $ ./riscv32-unknown-elf-gcc -march=rv32ia foo.c -O1 -S -o - .file "foo.c" .option nopic .text .align 2 .globl atomic .type atomic, @function atomic: addi sp,sp,-16 sw ra,12(sp) sw s0,8(sp) mv s0,a0 li a2,5 li a1,1 call __atomic_fetch_add_1 addi a0,a0,1 andi a0,a0,0xff fence iorw,iorw lbu a5,0(s0) fence iorw,iorw add a0,a0,a5 andi a0,a0,0xff lw ra,12(sp) lw s0,8(sp) addi sp,sp,16 jr ra .size atomic, .-atomic .ident "GCC: (GNU) 9.0.0 20180530 (experimental)"
I realize the documentation doesn't concur with me, but as long as gcc and libgcc agree on the lock-freeness of the routines, I don't see the harm. (wrt. rv32ia, at least.) On Wed, May 30, 2018 at 10:40 PM, asb at lowrisc dot org <gcc-bugzilla@gcc.gnu.org> wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005 > > Alex Bradbury <asb at lowrisc dot org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |asb at lowrisc dot org > Summary|[RISCV] Invalid lowering of |[RISCV] Invalid intermixing > |atomics for -march=rv32i / |of __atomic_* libcalls and > |-march=rv64i |inline atomic instruction > | |sequences > > --- Comment #1 from Alex Bradbury <asb at lowrisc dot org> --- > Actually I think this bug is wider in scope than I first thought. GCC will also > intermix __atomic libcalls and inline instruction sequences with -march=rv32ia > when values less than XLEN in size are accessed. > > $ cat foo.c > char atomic(char *i) { > char j = __atomic_add_fetch(i, 1, __ATOMIC_SEQ_CST); > char k; > __atomic_load(i, &k, __ATOMIC_SEQ_CST); > return j+k; > } > > $ ./riscv32-unknown-elf-gcc -march=rv32ia foo.c -O1 -S -o - > .file "foo.c" > .option nopic > .text > .align 2 > .globl atomic > .type atomic, @function > atomic: > addi sp,sp,-16 > sw ra,12(sp) > sw s0,8(sp) > mv s0,a0 > li a2,5 > li a1,1 > call __atomic_fetch_add_1 > addi a0,a0,1 > andi a0,a0,0xff > fence iorw,iorw > lbu a5,0(s0) > fence iorw,iorw > add a0,a0,a5 > andi a0,a0,0xff > lw ra,12(sp) > lw s0,8(sp) > addi sp,sp,16 > jr ra > .size atomic, .-atomic > .ident "GCC: (GNU) 9.0.0 20180530 (experimental)"
(In reply to Andrew Waterman from comment #2) > I realize the documentation doesn't concur with me, but as long as gcc > and libgcc agree on the lock-freeness of the routines, I don't see the > harm. (wrt. rv32ia, at least.) Yes, for RV32IA it might be allowable - assuming you're able for the compiler to make the assumption that it knows that property of the __atomic_* implementation and that assumption won't/can't be invalidated by linking in another implementation which isn't lock-free.
I think if RISCV wants to use out-of-line lock-free atomic helpers, it should give those another function name (for example, __sync_fetch_and_##OP##_##WIDTH), and provide them in libgcc.a, the same as the other architectures which use such out-of-line helpers. (But also, why doesn't it implement __atomic_add_fetch inline?) Since lock-free helper functions can be linked into a process as many times as you like without correctness issues, it's safe, and preferable, to provide them in the static runtime lib. That is quite unlike a potentially-locking __atomic_* function, which must share the same lock across the whole process, and thus must have only one implementation in a process, which is why they're provided separately by libatomic.so. It's not a good idea to conflate the two different things.
On Thu, 2018-05-31 at 05:40 +0000, asb at lowrisc dot org wrote: > Actually I think this bug is wider in scope than I first thought. GCC > will also > intermix __atomic libcalls and inline instruction sequences with > -march=rv32ia > when values less than XLEN in size are accessed. FYI I have a prototype patch to fix this. I copied the subword atomic support from the ppc port, and hacked it to work. I haven't done the fences yet, so it probably won't work if you have more than one thread, but it works well enough to handle the gcc/g++ testsuites. It also needs some cleanup to remove more ppc specific stuff. I don't know when I will have time to finish it though.
On Thu, 2018-05-31 at 15:07 +0000, foom at fuhm dot net wrote: > (But also, why doesn't it implement __atomic_add_fetch inline?) If you don't have atomic instructions, then we call an out-of-line function that uses locks. If you have atomic instructions, and it is a 32/64-bit operation, then we inline it. If you have atomic instructions, and it is a 8/16-bit operation, then we call a lock free out-of-line function that uses a sequence of instructions to implement it using 32/64-bit atomic instructions. And as mentioned previously, I have an unfinished prototype patch to fix this by emitting the sequence of instructions inline.
(In reply to Jim Wilson from comment #6) > On Thu, 2018-05-31 at 15:07 +0000, foom at fuhm dot net wrote: > > (But also, why doesn't it implement __atomic_add_fetch inline?) > > If you don't have atomic instructions, then we call an out-of-line > function that uses locks. Except as mentioned in the initial message, it doesn't emit calls for atomic loads and stores too, but it needs to so that they also participate in the locking. > If you have atomic instructions, and it is a 32/64-bit operation, then > we inline it. > > If you have atomic instructions, and it is a 8/16-bit operation, then > we call a lock free out-of-line function that uses a sequence of > instructions to implement it using 32/64-bit atomic instructions. If you're building for rv32ia, yes, you certainly know that libatomic must implement __atomic_fetch_add_1 lock-free. So, sure, you shouldn't have a correctness issue by mixing the calls. Yet even so, if you actually needed to emit a call to a known-lock-free out-of-line function, it's better to give it a different name (e.g. __sync_*) and include that in libgcc.a. Emitting a call to the __atomic_* function means you require users to link against libatomic.so, which shouldn't be required here. > And > as mentioned previously, I have an unfinished prototype patch to fix > this by emitting the sequence of instructions inline. But doing that is fine, too, and likely preferable.
This looks like a generic GCC problem, not a RISC-V specific problem. For instance, if I build an armv6t2 compiler I get bl __atomic_fetch_add_4
Oops, hitting tab doesn't work as expected. Trying again... This looks like a generic GCC problem, not a RISC-V specific problem. Or perhaps, not a gcc bug at all. For instance, if I build an armv6t2 compiler I get bl __atomic_fetch_add_4 ... mcr p15, 0, r0, c7, c10, 5 ldr r3, [r3] mcr p15, 0, r0, c7, c10, 5 where the mcr is equivalent to the RISC-V fence. It looks like MIPS16 and a number of other targets have the same problem. GCC has no support for calling __atomic_load_4 for this testcase. GCC assumes that loads smaller or equal to the word size are always atomic, and will not call a library routine for them. It will emit memory barriers. If what gcc is doing is wrong, then it is broken for all targets that don't inline expand every atomic function call, and/or don't have atomic instructions. I can fix the rv32ia support by inlining expanding every atomic function call. I can't fix the rv32i support without target independent optimizer changes.
I suppose since it doesn't affect most common platforms, nobody's noticed the brokenness yet? ARM is probably the most common architecture which is missing atomics on common CPU models, but when targeting common OSes (Linux, FreeBSD), you get to use the kernel atomic cmpxchg helpers, so the issue doesn't arise there. I haven't tested, but I don't think MIPS16 would be broken, since MIPS has native atomics, just not in the MIPS16 ISA. So, it calls out-of-line libgcc __sync_* routines, implemented in MIPS32 and which use LL/SC instructions. There's no problem mixing those with native load/store instructions. Anyways, for why mixing native load/store with __atomic_* functions is wrong, consider, int value = 4; Thread1: int oldval = __atomic_fetch_add(&value, 1, __ATOMIC_SEQ_CST); Thread2: __atomic_store_n(&value, 0, __ATOMIC_SEQ_CST); The only allowable results after executing both threads should be: oldval = 0, value = 1 -- if atomic_store executes first. oldval = 4, value = 0 -- if atomic_fetch_add executes first. But, if you implement __atomic_fetch_add with a lock, and implement __atomic_store with a native store instruction, you may get the "impossible" result, oldval = 4, value = 5 -- if atomic_store executes *during* atomic_fetch_add, between the load and store instructions. Oops.
The Char example from this issue is resolved on trunk since we added inline subword atomics. https://inbox.sourceware.org/gcc-patches/20230418214124.2446642-1-patrick@rivosinc.com/ Char example: https://godbolt.org/z/h94PT7hfo char atomic(char *i) { char j = __atomic_add_fetch(i, 1, __ATOMIC_SEQ_CST); char k; __atomic_load(i, &k, __ATOMIC_SEQ_CST); return j+k; } options: -O3 -march=rv64iad atomic(char*): andi a4,a0,3 slliw a4,a4,3 li a3,255 sllw a3,a3,a4 li a1,1 andi a2,a0,-4 not a6,a3 sllw a1,a1,a4 1: lr.w.aqrl a5, 0(a2) add a7, a5, a1 and a7, a7, a3 and t1, a5, a6 or t1, t1, a7 sc.w.rl a7, t1, 0(a2) bnez a7, 1b sraw a5,a5,a4 addiw a5,a5,1 andi a5,a5,0xff fence rw,rw lbu a0,0(a0) fence rw,rw add a0,a0,a5 andi a0,a0,0xff ret The initial example's behavior is still present.