This pop up in kernel code. We have: #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2) \ ({ \ bool __ret; \ __typeof__(*(p1)) __old1 = (o1), __new1 = (n1); \ __typeof__(*(p2)) __old2 = (o2), __new2 = (n2); \ asm volatile(pfx "cmpxchg%c4b %2; sete %0" \ : "=a" (__ret), "+d" (__old2), \ "+m" (*(p1)), "+m" (*(p2)) \ : "i" (2 * sizeof(long)), "a" (__old1), \ "b" (__new1), "c" (__new2)); \ __ret; \ }) #define arch_cmpxchg_double(p1, p2, o1, o2, n1, n2) \ __cmpxchg_double(LOCK_PREFIX, p1, p2, o1, o2, n1, n2) #define cmpxchg_double(p1, p2, o1, o2, n1, n2) \ ({ \ __typeof__(p1) ____p1 = (p1); \ kasan_check_write(____p1, 2 * sizeof(*____p1)); \ arch_cmpxchg_double(____p1, (p2), (o1), (o2), (n1), (n2)); \ }) And this is invoked as: if (cmpxchg_double(&page->freelist, &page->counters, freelist_old, counters_old, freelist_new, counters_new)) This fails with: error: ‘asm’ operand has impossible constraints However if I change cmpxchg_double as: - arch_cmpxchg_double(____p1, (p2), (o1), (o2), (n1), (n2)); \ + arch_cmpxchg_double((p1), (p2), (o1), (o2), (n1), (n2)); \ It works. I've tested with gcc version 7.0.1 20170307 (experimental) (GCC). Arnd Bergmann reported that the problem happens with 4.9 through 7.0.1 for him. I've also tried with gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) and both versions work. It's unclear to me why introducing a local variable as "__typeof__(p1) ____p1 = (p1)" and then using it fails. Whereas using p1 works. Essentially we have: asm volatile("" "cmpxchg%c4b %2; sete %0" : "=a" (__ret), "+d" (__old2), "+m" (*((&page->freelist))), "+m" (*((&page->counters))) : "i" (2 * sizeof(long)), "a" (__old1), "b" (__new1), "c" (__new2)); vs: __typeof__(&page->freelist) ____p1 = (&page->freelist); ... asm volatile("" "cmpxchg%c4b %2; sete %0" : "=a" (__ret), "+d" (__old2), "+m" (*(____p1)), "+m" (*((&page->counters))) : "i" (2 * sizeof(long)), "a" (__old1), "b" (__new1), "c" (__new2)); It seems to me that both versions should work the same way. I will attach preprocessed sources for both versions. They can be compiled as: gcc slub1.c -fno-strict-aliasing -fno-common -Wno-format-security -std=gnu89 -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m32 -msoft-float -mregparm=3 -freg-struct-return -fno-pic -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium3 -Wa,-mtune=generic32 -ffreestanding -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -Wno-frame-address -O2 --param=allow-store-data-races=0 -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -Wno-unused-const-variable -fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-var-tracking-assignments -fno-inline-functions-called-once -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -Werror=incompatible-pointer-types -c
Created attachment 41016 [details] source code The archive contains slub2.c that works, and slub1.c that fails with the error.
With just -m32 slub1.c -O2 -mpreferred-stack-boundary=2 -march=i686 this started with r246059. Why doesn't the kernel use atomic builtins instead? The asm uses %eax/%ebx/%ecx/%edx and needs to use up to 2 other registers to hold the addresses of mems (unless any of them can be sp relative), if %ebp is used for frame pointer then there are no other registers left, but it still ought to be reloadable.
> Why doesn't the kernel use atomic builtins instead? There was a recent discussion here: https://groups.google.com/forum/#!topic/kasan-dev/3sNHjjb4GCI In short: --- Trivially, The C++ model doesn't feature I/O ordering [1]... + https://lwn.net/Articles/691295/ + rmb()/wmb() are not remotely similar to atomic_thread_fenc_{acquire,release}, even if you restrict ordering to coherent CPUs (i.e. the smp_* variants). Please don't do that :) + I'm also terrified of the optimisations that the compiler is theoretically allowed to make to C11 atomics given the assumptions of the language virtual machine, which are not necessarily valid in the kernel environment. We would at least need well-supported compiler options to disable these options, and also to allow data races with things like READ_ONCE.
Thank you for reporting this. Something is wrong with processing insns for reloads. The asm-insn hash 2 the same operands mem[r263+12]. R263 is spilled for a reload. The mem becomes invalid and r263 should be reloaded too. Instead, LRA goes to a spill sub-pass changing the operands on mem[mem[sp+offset]], then it generates 2 reloads for the both operands: p1=mem[sp+offset] p2=mem[sp+offset] LRA can not figure out that p1 and p2 do not conflict and should be the same (in LRA it is done on pseudo bases). So the solution would be a generation of reloads for R263 before the spill sub-pass. When I find why does not it happen, I could say how much time will it take to fix. If it is simple, the patch will be probably ready tomorrow.
The fix proposed by Bernd for PR80160 does not solve the problem. So I am continuing to work on the patch.
Author: vmakarov Date: Fri Mar 24 18:47:38 2017 New Revision: 246467 URL: https://gcc.gnu.org/viewcvs?rev=246467&root=gcc&view=rev Log: 2017-03-24 Vladimir Makarov <vmakarov@redhat.com> PR target/80148 * lra-assigns.c (assign_by_spills): Add spilled non-reload pseudos to consider in curr_insn_transform. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-assigns.c
Fixed by Vlad's commit on the trunk.