This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug c++/71245] New: std::atomic<double> load/store bounces the data to the stack using fild/fistp
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Mon, 23 May 2016 18:53:22 +0000
- Subject: [Bug c++/71245] New: std::atomic<double> load/store bounces the data to the stack using fild/fistp
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245
Bug ID: 71245
Summary: std::atomic<double> load/store bounces the data to the
stack using fild/fistp
Product: gcc
Version: 6.1.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-linux-gnu
Same result with gcc 4.8, gcc5, and gcc6.1. Didn't test exhaustively.
#include <atomic>
std::atomic<double> d(5.0);
void foo_d(void) {
d = d + 1.0;
// d+=1.0; // unimplemented
}
with gcc6.1 -m32 -O3 -march=i586 (https://godbolt.org/g/w3VKpG) this compiles
to
foo_d():
subl $20, %esp #,
fildq d # #
fistpq (%esp) # %sfp # copy `d`'s bits to the stack (with no
loss)
fldl (%esp) # %sfp
fadds .LC0 #
fstpl (%esp) # %sfp # store d + 1.0 to the stack
fildq (%esp) # %sfp #
fistpq d # # atomic store using fild
lock orl $0, (%esp) ## mfence equivalent
addl $20, %esp #,
ret
I assume fild/fistp is gcc's trick for implementing atomic loads and stores
without resorting to cmpxchg8b. Clever, since 80bit float can't munge the
data. The fild/fistp pairs are of course not necessary in this case, where the
data is already 64bit float. The function should just fld / fstp to d
directly.
With -march=i486 or lower, gcc correctly doesn't assume that 64bit FP
loads/stores are atomic, so it calls a library function to do the atomic load
and store. With SSE or SSE2 available, it uses an SSE load/store to copy to
the stack.
With -msse2 and -mfpmath=sse, we finally load/store directly from/to d, with
movq / addsd / movq. movq vs. movsd shouldn't make a performance difference, I
think.
--------
We don't need to allocate any stack space. We could implement the StoreLoad
barrier with lock or $0, -4(%esp) instead of reserving extra stack to avoid
doing it to our return address (which would introduce extra store-forwarding
delay before the ret could eventually retire).