Google ref: b/10151411 Reproduced with current trunk, but is broken since at least gcc-4.3.1. On Linux/x86_64, libstdc++.so.6 __cxa_get_globals looks like so: Dump of assembler code for function __cxa_get_globals: 0x00000000000cb430 <+0>: lea 0x233131(%rip),%rdi 0x00000000000cb437 <+7>: callq 0x4f570 <__tls_get_addr@plt> 0x00000000000cb43c <+12>: add $0x0,%rax 0x00000000000cb442 <+18>: retq This calls external function __tls_get_addr with mis-aligned stack. __tls_get_addr may itself call malloc, and malloc is user-replaceable, and may assume that stack is properly aligned (and crash when it isn't). Trivial test case: static __thread char ccc; extern "C" void* __cxa_get_globals() throw() { return &ccc; } g++ -fPIC -S -O2 t.cc results in: __cxa_get_globals: leaq _ZL3ccc@tlsld(%rip), %rdi call __tls_get_addr@PLT addq $_ZL3ccc@dtpoff, %rax ret Ian Lance Taylor says: There is code in the i386 backend that is designed to avoid this. However, it appears to have only been fully implemented for the GNU2 TLS descriptor style ... I suspect that the right fix is to add the line ix86_tls_descriptor_calls_expanded_in_cfun = true; to tls_global_dynamic_64_<mode> and tls_local_dynamic_base_64_<mode> in gcc/config/i386/i386.md.
> However, it appears to have only been fully implemented for the GNU2 TLS > descriptor style ... Which most Linux distro default to anyways ...
(In reply to Andrew Pinski from comment #1) > Which most Linux distro default to anyways ... Ubuntu 12.04.1 LTS doesn't. Configuring trunk GCC on it doesn't default to GNU2 TLS either. What is the way to turn it on?
(In reply to Paul Pluzhnikov from comment #2) > What is the way to turn it on? Compiling test case with -mtls-dialect=gnu2 does appear to improve the picture: g++ -fPIC -O2 -S t.cc -mtls-dialect=gnu2 __cxa_get_globals: leaq _ZL3ccc@TLSDESC(%rip), %rax call *_ZL3ccc@TLSCALL(%rax) addq %fs:0, %rax ret The indirect call goes to _dl_tlsdesc_dynamic in ld-linux-x86-64.so.2 with misaligned stack, and the latter re-aligns it.
Created attachment 32341 [details] A patch This patch sets ix86_tls_descriptor_calls_expanded_in_cfun after reload is complete and checks it for stack boundary in ix86_frame_pointer_required.
Another problem: [hjl@gnu-6 gcc]$ cat /tmp/c.i static __thread char ccc; void* __cxa_get_globals() { return &ccc; } [hjl@gnu-6 gcc]$ ./xgcc -B./ -S -O2 -fPIC /tmp/c.i [hjl@gnu-6 gcc]$ cat /tmp/c.i static __thread char ccc; void* __cxa_get_globals() { return &ccc; } [hjl@gnu-6 gcc]$ ./xgcc -B./ -S -O2 -fPIC /tmp/c.i -m32 [hjl@gnu-6 gcc]$ cat c.s .file "c.i" .section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl __cxa_get_globals .type __cxa_get_globals, @function __cxa_get_globals: .LFB0: .cfi_startproc pushl %ebx .cfi_def_cfa_offset 8 .cfi_offset 3, -8 call __x86.get_pc_thunk.bx addl $_GLOBAL_OFFSET_TABLE_, %ebx subl $8, %esp .cfi_def_cfa_offset 16 addl $8, %esp .cfi_def_cfa_offset 8 leal ccc@tlsgd(,%ebx,1), %eax call ___tls_get_addr@PLT popl %ebx .cfi_restore 3 .cfi_def_cfa_offset 4 ret .cfi_endproc .LFE0: .size __cxa_get_globals, .-__cxa_get_globals sched2 doesn't know (insn:TI 15 25 13 2 (parallel [ (set (reg:SI 0 ax [86]) (unspec:SI [ (reg:SI 3 bx) (symbol_ref:SI ("ccc") [flags 0x1a] <var_decl 0x7f2b2be 5e980 ccc>) (symbol_ref:SI ("___tls_get_addr")) ] UNSPEC_TLS_GD)) (clobber (reg:SI 1 dx [88])) (clobber (reg:SI 2 cx [89])) (clobber (reg:CC 17 flags)) ]) /tmp/c.i:5 772 {*tls_global_dynamic_32_gnu} (expr_list:REG_DEAD (reg:SI 3 bx) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_UNUSED (reg:SI 2 cx [89]) (expr_list:REG_UNUSED (reg:SI 1 dx [88]) (expr_list:REG_EQUIV (unspec:SI [ (reg:SI 3 bx) (symbol_ref:SI ("ccc") [flags 0x1a] <var_decl 0 x7f2b2be5e980 ccc>) (symbol_ref:SI ("___tls_get_addr")) ] UNSPEC_TLS_GD) (nil))))))) is a function call and move stack adjustment cross it.
Author: wmi Date: Thu May 8 16:44:52 2014 New Revision: 210222 URL: http://gcc.gnu.org/viewcvs?rev=210222&root=gcc&view=rev Log: gcc/ 2014-05-08 Wei Mi <wmi@google.com> PR target/58066 * config/i386/i386.c (ix86_compute_frame_layout): Update preferred_stack_boundary for call, expanded from tls descriptor. * config/i386/i386.md: (*tls_global_dynamic_32_gnu): Update RTX to depend on SP register. (*tls_local_dynamic_base_32_gnu): Ditto. (*tls_local_dynamic_32_once): Ditto. (tls_global_dynamic_64_<mode>): Set ix86_tls_descriptor_calls_expanded_in_cfun. (tls_local_dynamic_base_64_<mode>): Ditto. (tls_global_dynamic_32): Set ix86_tls_descriptor_calls_expanded_in_cfun. Update RTX to depend on SP register. (tls_local_dynamic_base_32): Ditto. gcc/testsuite/ 2014-05-08 Wei Mi <wmi@google.com> PR target/58066 * gcc.target/i386/pr58066.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr58066.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.md trunk/gcc/testsuite/ChangeLog
Author: wmi Date: Mon May 19 05:25:45 2014 New Revision: 210601 URL: http://gcc.gnu.org/viewcvs?rev=210601&root=gcc&view=rev Log: 2014-05-18 Wei Mi <wmi@google.com> PR target/58066 * gcc.target/i386/pr58066.c: Replace pattern matching of .cfi directive with rtl insns. Add effective-target of fpic and tls_native. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr58066.c
Is there any progress on this? Is it fixed? I've hit this issue in ThreadSanitizer. It intercepts __tls_get_addr and then code that uses MOVDQA [rbp] crashes. I remember that I hit it previously in some other context as well.
__tls_get_addr is called with misaligned stack on x86-64. It crashes ld.so when it tries to save and restore XMM registers with aligned load/store: https://sourceware.org/ml/libc-alpha/2015-07/msg00365.html
Another testcase: [hjl@gnu-tools-1 pr58066]$ cat x.i struct in_addr { int s_addr; }; typedef long unsigned int size_t; extern void __snprintf (char *__restrict __s, size_t __maxlen, const char *__restrict __format, ...) __attribute__ ((__format__ (__printf__, 3, 4))); static __thread char buffer[18]; char * inet_ntoa (struct in_addr in) { unsigned char *bytes = (unsigned char *) ∈ __snprintf (buffer, sizeof (buffer), "%d.%d.%d.%d", bytes[0], bytes[1], bytes[2], bytes[3]); return buffer; } [hjl@gnu-tools-1 pr58066]$ gcc -S -fPIC -O2 x.i [hjl@gnu-tools-1 pr58066]$ cat x.s .file "x.i" .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "%d.%d.%d.%d" .section .text.unlikely,"ax",@progbits .LCOLDB1: .text .LHOTB1: .p2align 4,,15 .globl inet_ntoa .type inet_ntoa, @function inet_ntoa: .LFB0: .cfi_startproc pushq %r14 .cfi_def_cfa_offset 16 .cfi_offset 14, -16 pushq %r13 .cfi_def_cfa_offset 24 .cfi_offset 13, -24 movzbl %dil, %r13d pushq %r12 .cfi_def_cfa_offset 32 .cfi_offset 12, -32 pushq %rbp .cfi_def_cfa_offset 40 .cfi_offset 6, -40 movl %edi, %r12d pushq %rbx .cfi_def_cfa_offset 48 .cfi_offset 3, -48 movl %edi, %ebx shrl $16, %r12d movzbl %bh, %eax shrl $24, %ebx movzbl %r12b, %r12d subq $8, %rsp .cfi_def_cfa_offset 56 movl %eax, %r14d leaq buffer@tlsld(%rip), %rdi call __tls_get_addr@PLT pushq %rbx .cfi_def_cfa_offset 64 leaq .LC0(%rip), %rdx movl %r12d, %r9d leaq buffer@dtpoff(%rax), %rbp movl %r14d, %r8d movl %r13d, %ecx xorl %eax, %eax movl $18, %esi movq %rbp, %rdi call __snprintf@PLT popq %rax .cfi_def_cfa_offset 56 movq %rbp, %rax popq %rdx .cfi_def_cfa_offset 48 popq %rbx .cfi_def_cfa_offset 40 popq %rbp .cfi_def_cfa_offset 32 popq %r12 .cfi_def_cfa_offset 24 popq %r13 .cfi_def_cfa_offset 16 popq %r14 .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE0: .size inet_ntoa, .-inet_ntoa .section .text.unlikely .LCOLDE1: .text .LHOTE1: .section .tbss,"awT",@nobits .type buffer, @object .size buffer, 18 buffer: .zero 18 .ident "GCC: (GNU) 5.1.1 20150707 (Red Hat 5.1.1-5)" .section .note.GNU-stack,"",@progbits [hjl@gnu-tools-1 pr58066]$ __tls_get_addr is called with misaligned stack.
Please make 64bit TLS patterns dependant on SP_REG, in the same way as 32bit are.
(In reply to Uroš Bizjak from comment #11) > Please make 64bit TLS patterns dependant on SP_REG, in the same way as 32bit > are. This wont't fix this particular case, but this dependency would be nice to have. The problem with the testcase from Comment #10 is caused by stack anti-adjustment, emitted from calls.c: 1: NOTE_INSN_DELETED 4: NOTE_INSN_BASIC_BLOCK 2 2: r96:SI=di:SI 3: NOTE_INSN_FUNCTION_BEG 6: {sp:DI=sp:DI-0x8;clobber flags:CC;} <<--- *** here *** REG_ARGS_SIZE 0x8 7: {r98:SI=r96:SI 0>>0x10;clobber flags:CC;} 8: {r99:QI=r98:SI#0&0xffffffffffffffff;clobber flags:CC;} 9: r100:SI=zero_extend(r99:QI) 10: r101:QI#0=zero_extract(r96:SI,0x8,0x8) 11: r102:SI=zero_extend(r101:QI) 12: r103:SI=zero_extend(r96:SI#0) 13: ax:DI=call [`__tls_get_addr'] argc:0 REG_EH_REGION 0xffffffff80000000 14: r105:DI=ax:DI REG_EQUAL unspec[0] 21 15: {r106:DI=r105:DI+const(unspec[`buffer'] 6);clobber flags:CC;} 16: r104:DI=r106:DI REG_EQUAL `buffer' 17: {r108:SI=r96:SI 0>>0x18;clobber flags:CC;} 18: r109:SI=zero_extend(r108:SI#0) 19: [pre sp:DI+=0xfffffffffffffff8]=r109:SI REG_ARGS_SIZE 0x10 20: r9:SI=r100:SI 21: r8:SI=r102:SI 22: cx:SI=r103:SI 23: dx:DI=`*.LC0' 24: si:DI=0x12 25: di:DI=r104:DI 26: ax:QI=0 27: call [`__snprintf'] argc:0x10 REG_CALL_DECL `__snprintf' 28: ax:DI=call [`__tls_get_addr'] argc:0 REG_EH_REGION 0xffffffff80000000 29: r111:DI=ax:DI REG_EQUAL unspec[0] 21 30: {r112:DI=r111:DI+const(unspec[`buffer'] 6);clobber flags:CC;} 31: r95:DI=r112:DI REG_EQUAL `buffer' 32: {sp:DI=sp:DI+0x10;clobber flags:CC;} REG_ARGS_SIZE 0 36: ax:DI=r95:DI 37: use ax:DI Putting a breakpoint on anti_adjust_stack will show where it happens: Breakpoint 1, anti_adjust_stack (adjust=0x2aaaae7b0500) at /home/uros/gcc-svn/trunk/gcc/explow.c:902 902 if (adjust == const0_rtx) (gdb) bt #0 anti_adjust_stack (adjust=0x2aaaae7b0500) at /home/uros/gcc-svn/trunk/gcc/explow.c:902 #1 0x000000000080f24c in expand_call (exp=0x2aaaae7b3680, target=0x0, ignore=1) at /home/uros/gcc-svn/trunk/gcc/calls.c:3165 #2 0x0000000000966084 in expand_expr_real_1 (exp=0x2aaaae7b3680, target=0x0, tmode=VOIDmode, modifier=EXPAND_NORMAL, alt_rtl=0x0, inner_reference_p=false) at /home/uros/gcc-svn/trunk/gcc/expr.c:10362 There is already precompute_register_parameters function where: /* If the value is a non-legitimate constant, force it into a pseudo now. TLS symbols sometimes need a call to resolve. */ if (CONSTANT_P (args[i].value) && !targetm.legitimate_constant_p (args[i].mode, args[i].value)) args[i].value = force_reg (args[i].mode, args[i].value); So, the core of the problem is in the call infrastructure that should emit precomputed register parameters before anti_adjust_stack is emitted After this infrastructure problem is fixed, proposed SP_REG dependency will prevent stack adjustment to be scheduled above TLS patterns. Re-confirmed as RTL-optimization problem.
Created attachment 35964 [details] Combined middle/end/target patch Patch in testing.
(In reply to Uroš Bizjak from comment #13) > Patch in testing. This patch fixes the testcase, now we get: 0000000000000000 <inet_ntoa>: 0: 41 56 push %r14 2: 41 55 push %r13 4: 44 0f b6 ef movzbl %dil,%r13d 8: 41 54 push %r12 a: 55 push %rbp b: 41 89 fc mov %edi,%r12d e: 53 push %rbx f: 89 fb mov %edi,%ebx 11: 41 c1 ec 10 shr $0x10,%r12d 15: 0f b6 c7 movzbl %bh,%eax 18: c1 eb 18 shr $0x18,%ebx 1b: 45 0f b6 e4 movzbl %r12b,%r12d 1f: 41 89 c6 mov %eax,%r14d 22: 48 8d 3d 00 00 00 00 lea 0(%rip),%rdi # 29 <inet_ntoa+0x29> 25: R_X86_64_TLSLD buffer+0xfffffffffffffffc 29: e8 00 00 00 00 callq 2e <inet_ntoa+0x2e> 2a: R_X86_64_PLT32 __tls_get_addr+0xfffffffffffffffc 2e: 48 83 ec 08 sub $0x8,%rsp 32: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 39 <inet_ntoa+0x39> 35: R_X86_64_PC32 .LC0+0xfffffffffffffffc 39: 45 89 e1 mov %r12d,%r9d 3c: 48 8d a8 00 00 00 00 lea 0x0(%rax),%rbp 3f: R_X86_64_DTPOFF32 buffer 43: 53 push %rbx 44: 45 89 f0 mov %r14d,%r8d 47: 44 89 e9 mov %r13d,%ecx 4a: 31 c0 xor %eax,%eax 4c: be 12 00 00 00 mov $0x12,%esi 51: 48 89 ef mov %rbp,%rdi 54: e8 00 00 00 00 callq 59 <inet_ntoa+0x59> 55: R_X86_64_PLT32 __snprintf+0xfffffffffffffffc 59: 58 pop %rax 5a: 48 89 e8 mov %rbp,%rax 5d: 5a pop %rdx 5e: 5b pop %rbx 5f: 5d pop %rbp 60: 41 5c pop %r12 62: 41 5d pop %r13 64: 41 5e pop %r14 66: c3 retq The difference between patched (+++) and unpatched (---) code is: --- pr58066_.s 2015-07-13 11:58:23.000000000 +0200 +++ pr58066.s 2015-07-13 11:58:26.000000000 +0200 @@ -28,16 +28,16 @@ movzbl %bh, %eax shrl $24, %ebx movzbl %r12b, %r12d - subq $8, %rsp -.LCFI5: movl %eax, %r14d leaq buffer@tlsld(%rip), %rdi call __tls_get_addr@PLT - pushq %rbx -.LCFI6: + subq $8, %rsp +.LCFI5: leaq .LC0(%rip), %rdx movl %r12d, %r9d leaq buffer@dtpoff(%rax), %rbp + pushq %rbx +.LCFI6: movl %r14d, %r8d movl %r13d, %ecx xorl %eax, %eax HJ, can you please test the patch if it fixes your problem?
(In reply to Uroš Bizjak from comment #13) > Created attachment 35964 [details] > Combined middle/end/target patch > > Patch in testing. I tried it on GCC 5 and it works on glibc. Thanks.
Author: uros Date: Wed Jul 15 07:39:30 2015 New Revision: 225807 URL: https://gcc.gnu.org/viewcvs?rev=225807&root=gcc&view=rev Log: PR rtl-optimization/58066 * calls.c (expand_call): Precompute register parameters before stack alignment is performed. Modified: trunk/gcc/ChangeLog trunk/gcc/calls.c
Back to target component.
Author: uros Date: Wed Jul 15 13:42:07 2015 New Revision: 225829 URL: https://gcc.gnu.org/viewcvs?rev=225829&root=gcc&view=rev Log: PR target/58066 * config/i386/i386.md (*tls_global_dynamic_64_<mode>): Depend on SP_REG. (*tls_local_dynamic_base_64_<mode>): Ditto. (*tls_local_dynamic_base_64_largepic): Ditto. (tls_global_dynamic_64_<mode>): Update expander pattern. (tls_local_dynamic_base_64_<mode>): Ditto. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md
Author: uros Date: Thu Jul 23 18:51:56 2015 New Revision: 226119 URL: https://gcc.gnu.org/viewcvs?rev=226119&root=gcc&view=rev Log: Backport from mainline: 2015-07-17 Uros Bizjak <ubizjak@gmail.com> PR rtl-optimization/66891 * calls.c (expand_call): Wrap precompute_register_parameters with NO_DEFER_POP/OK_DEFER_POP to prevent deferred pops. 2015-07-15 Uros Bizjak <ubizjak@gmail.com> PR target/58066 * config/i386/i386.md (*tls_global_dynamic_64_<mode>): Depend on SP_REG. (*tls_local_dynamic_base_64_<mode>): Ditto. (*tls_local_dynamic_base_64_largepic): Ditto. (tls_global_dynamic_64_<mode>): Update expander pattern. (tls_local_dynamic_base_64_<mode>): Ditto. 2015-07-15 Uros Bizjak <ubizjak@gmail.com> PR rtl-optimization/58066 * calls.c (expand_call): Precompute register parameters before stack testsuite/ChangeLog: Backport from mainline: 2015-07-17 Uros Bizjak <ubizjak@gmail.com> PR target/66891 * gcc.target/i386/pr66891.c: New test. Added: branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr66891.c Modified: branches/gcc-5-branch/gcc/ChangeLog branches/gcc-5-branch/gcc/calls.c branches/gcc-5-branch/gcc/config/i386/i386.md branches/gcc-5-branch/gcc/testsuite/ChangeLog
Author: uros Date: Thu Jul 30 08:53:48 2015 New Revision: 226389 URL: https://gcc.gnu.org/viewcvs?rev=226389&root=gcc&view=rev Log: Backport from mainline: 2015-07-17 Uros Bizjak <ubizjak@gmail.com> PR rtl-optimization/66891 * calls.c (expand_call): Wrap precompute_register_parameters with NO_DEFER_POP/OK_DEFER_POP to prevent deferred pops. 2015-07-15 Uros Bizjak <ubizjak@gmail.com> PR target/58066 * config/i386/i386.md (*tls_global_dynamic_64_<mode>): Depend on SP_REG. (*tls_local_dynamic_base_64_<mode>): Ditto. (*tls_local_dynamic_base_64_largepic): Ditto. (tls_global_dynamic_64_<mode>): Update expander pattern. (tls_local_dynamic_base_64_<mode>): Ditto. 2015-07-15 Uros Bizjak <ubizjak@gmail.com> PR rtl-optimization/58066 * calls.c (expand_call): Precompute register parameters before stack alignment is performed. 2014-05-08 Wei Mi <wmi@google.com> PR target/58066 * config/i386/i386.c (ix86_compute_frame_layout): Update preferred_stack_boundary for call, expanded from tls descriptor. * config/i386/i386.md (*tls_global_dynamic_32_gnu): Update RTX to depend on SP register. (*tls_local_dynamic_base_32_gnu): Ditto. (*tls_local_dynamic_32_once): Ditto. (tls_global_dynamic_64_<mode>): Set ix86_tls_descriptor_calls_expanded_in_cfun. (tls_local_dynamic_base_64_<mode>): Ditto. (tls_global_dynamic_32): Set ix86_tls_descriptor_calls_expanded_in_cfun. Update RTX to depend on SP register. (tls_local_dynamic_base_32): Ditto. testsuite/ChangeLog: Backport from mainline: 2015-07-17 Uros Bizjak <ubizjak@gmail.com> PR target/66891 * gcc.target/i386/pr66891.c: New test. 2014-05-18 Wei Mi <wmi@google.com> PR target/58066 * gcc.target/i386/pr58066.c: Replace pattern matching of .cfi directive with rtl insns. Add effective-target fpic and tls_native. 2014-05-08 Wei Mi <wmi@google.com> PR target/58066 * gcc.target/i386/pr58066.c: New test. Added: branches/gcc-4_9-branch/gcc/testsuite/gcc.target/i386/pr58066.c branches/gcc-4_9-branch/gcc/testsuite/gcc.target/i386/pr66891.c Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/config/i386/i386.c branches/gcc-4_9-branch/gcc/config/i386/i386.md branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
Fixed everywhere.