Bug 80881 - Implement Windows native TLS
Summary: Implement Windows native TLS
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.1.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-05-25 16:57 UTC by Daniel Starke
Modified: 2025-02-19 14:33 UTC (History)
6 users (show)

See Also:
Host: mingw64
Target: mingw64
Build: mingw64
Known to work:
Known to fail: 14.1.0, 15.0, 7.1.0
Last reconfirmed: 2017-11-30 00:00:00


Attachments
My latest gcc tls support patch (2.21 KB, patch)
2024-07-12 12:14 UTC, Alexandre Pereira Nunes
Details | Diff
Newer patch for TLS support, incomplete (2.32 KB, patch)
2024-10-07 17:09 UTC, Julian Waters
Details | Diff
Further progress (2.73 KB, patch)
2024-10-08 09:36 UTC, Julian Waters
Details | Diff
WIP patch (2.64 KB, patch)
2024-10-08 17:11 UTC, Eric Botcazou
Details | Diff
gcc 14 version, broken (2.62 KB, patch)
2024-10-09 05:35 UTC, Julian Waters
Details | Diff
Minimal reproducer (221 bytes, text/plain)
2024-10-10 08:15 UTC, Eric Botcazou
Details
WI (2.66 KB, patch)
2024-10-10 09:14 UTC, Eric Botcazou
Details | Diff
WIP patch #2 (2.66 KB, patch)
2024-10-10 09:15 UTC, Eric Botcazou
Details | Diff
WIP patch #3 (3.29 KB, patch)
2024-10-10 10:42 UTC, Eric Botcazou
Details | Diff
Candidate patch (3.86 KB, patch)
2024-10-10 19:11 UTC, Eric Botcazou
Details | Diff
Attempt to parallelize the load from gs/fs and load of _tls_index (2.68 KB, patch)
2024-10-11 08:35 UTC, Julian Waters
Details | Diff
Attempt to parallelize the load from gs/fs and load of _tls_index (2.65 KB, patch)
2024-10-11 13:38 UTC, Julian Waters
Details | Diff
Lastest TLS (2.63 KB, patch)
2024-10-11 18:48 UTC, Julian Waters
Details | Diff
Latest TLS (3.30 KB, patch)
2024-11-01 18:36 UTC, Julian Waters
Details | Diff
quote symbols for intel syntax (917 bytes, patch)
2024-11-20 05:59 UTC, LIU Hao
Details | Diff
libstdc++ fix (1.17 KB, patch)
2024-11-22 09:01 UTC, LIU Hao
Details | Diff
transitional patch for libstdc++ (1.65 KB, patch)
2024-11-29 09:22 UTC, LIU Hao
Details | Diff
Latest TLS (3.36 KB, patch)
2025-01-15 06:35 UTC, Julian Waters
Details | Diff
transitional patch for libstdc++ #2 (2.11 KB, patch)
2025-01-15 08:28 UTC, LIU Hao
Details | Diff
Latest TLS (3.50 KB, patch)
2025-01-27 08:59 UTC, Julian Waters
Details | Diff
Latest TLS (3.50 KB, patch)
2025-02-04 13:58 UTC, Julian Waters
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Starke 2017-05-25 16:57:26 UTC
GCC 7.1.0 compiled OpenMP applications fail with invalid memory access to address 0.

Used configuration:
    ../../src/gcc-7.1.0/configure --host=x86_64-w64-mingw32 --enable-languages=c,c++ --enable-seh-exceptions --enable-threads=posix --enable-tls --disable-nls --disable-shared --enable-static --enable-fully-dynamic-string --enable-lto --enable-plugins --enable-libgomp --with-dwarf2 --disable-win32-registry --enable-version-specific-runtime-libs --prefix=/mingw64-64 --with-sysroot=/mingw64-64 --target=x86_64-w64-mingw32 --enable-targets=all --enable-checking=release --with-gmp=/usr/new-gcc/lib/gmp-5.0.5 --with-mpfr=/usr/new-gcc/lib/mpfr-2.4.2 --with-mpc=/usr/new-gcc/lib/mpc-0.9 --with-isl=/usr/new-gcc/lib/isl-0.18 --with-cloog=/usr/new-gcc/lib/cloog-0.18.4 --with-host-libstdcxx='-lstdc++ -lsupc++' --disable-cloog-version-check --enable-cloog-backend=isl
    Thread model: posix
    gcc version 7.1.0 (GCC)

Sample application:
    #include <stdlib.h>
    #include <stdio.h>
    
    #define N 1024
    
    int main() {
    	int i;
    	float var[N];
    	volatile float PI = 3.1415927;
    	
    #	pragma omp parallel for private(i)
    	for (i = 0; i < N; i++) {
    		var[i] = (1024.0f / PI) + 0.5f;
    	}
    	
    	return EXIT_SUCCESS;
    }

Error message from Dr.Memory (32-bit variant):
    Error #1: UNADDRESSABLE ACCESS: reading 0x00000000-0x00000004 4 byte(s)
    # 0 GOMP_parallel               [../../../../../src/gcc-7.1.0/libgomp/libgomp.h:677]
    # 1 main                        [h:\Temp\cpp017/test.c:11]
    Note: @0:00:00.642 in thread 6700
    Note: instruction: mov    %gs:0x00 -> %esi

Error message from Dr.Memory (64-bit variant):
    Error #1: UNADDRESSABLE ACCESS: reading 0x0000000000000000-0x0000000000000008 8 byte(s)
    # 0 GOMP_parallel               [../../../../src/gcc-7.1.0/libgomp/libgomp.h:677]
    # 1 main                        [h:\Temp\cpp017/test.c:11]
    Note: @0:00:00.170 in thread 1320
    Note: instruction: mov    %fs:0x00 -> %rdi
Comment 1 Jakub Jelinek 2017-05-25 17:05:22 UTC
That suggests TLS doesn't work at all on your platform (but then you should obviously not --enable-tls).
Comment 2 Daniel Starke 2017-05-26 10:59:44 UTC
True, I have rebuild GCC without --enable-tls enabled and the null pointer access is gone. So I guess there is still no TLS support for mingw-w64 (even though Windows supports it as far as I know).
Comment 3 Jakub Jelinek 2017-05-26 11:18:10 UTC
There is always an emulated TLS support, but that is not what you ask for if you --enable-tls.  As for mingw TLS support, you need to ask the mingw maintainers, I don't have access to that target, nor sufficient knowledge about it.
Comment 4 Jakub Jelinek 2017-11-27 11:22:40 UTC
CCing Cygwin/Mingw maintainer.
Comment 5 jyong 2017-11-27 12:22:47 UTC
Can you post the full backtrace? Meanwhile, I'll setup gcc with --enable-tls and give this a try.
Comment 6 jyong 2017-11-28 10:15:39 UTC
Crash seems to be coming from the mingw-w64 runtime tls handler.
Comment 7 Daniel Starke 2017-11-28 19:49:49 UTC
Error report from Dr.Memory:
Error #1: UNADDRESSABLE ACCESS: reading 0x0000000000000000-0x0000000000000008 8 byte(s)
# 0 gomp_resolve_num_threads               [../../../../src/gcc-7.1.0/libgomp/libgomp.h:677]
# 1 GOMP_parallel                          [../../../../src/gcc-7.1.0/libgomp/parallel.c:166]
# 2 main                                   [h:\Temp\cpp017/test.c:11]
Note: @0:00:00.450 in thread 3376
Note: instruction: mov    %fs:0x00 -> %rax

Backtrace from SIGSEGV in GDB:
#0  gomp_resolve_num_threads (specified=specified@entry=0, count=count@entry=0) at ../../../../src/gcc-7.1.0/libgomp/parallel.c:47
        threads_requested = <optimized out>
        max_num_threads = <optimized out>
        num_threads = <optimized out>
        busy = <optimized out>
        pool = <optimized out>
#1  0x000000000040184f in GOMP_parallel (fn=fn@entry=0x401520 <main._omp_fn.0>, data=data@entry=0x22fe60, num_threads=num_threads@entry=0, 
    flags=flags@entry=0) at ../../../../src/gcc-7.1.0/libgomp/parallel.c:166
No locals.
#2  0x0000000000401604 in main () at test.c:11
        var = {3.72983052e-039, 0, 7.3767739e+033, 0, 7.34706519e+033, 0, 3.20827844e-039, 0, 9.03661843e-038, 0, 3.20798697e-039, 0, 
          3.67341985e-039, 0, 6.86636248e-044, 0, 1.40129846e-045, 0, 7.53898574e-043, 0, 2, 0, 3.67341985e-039, 0, 3.67341985e-039, 0, 
          1.07899982e-043, 0, 2.75506488e-040, 0, 7.67135411e+033, 0, 0, 0, 8.59029811e+009, 0, 0, 0, 3.67390189e-039, 0, 0, 0, 5.60519386e-045, 0, 
          7.53898574e-043, 0, 2.00002337, 0, 1.07899982e-043, 0, 4.20389539e-045, 0, 1.77964905e-043, 0, 7.41472914e+033, 0, 3.71850803e-039, 0, 
          8.59029811e+009, 0, 3.67420457e-039, 0, 4.20389539e-043, 0, 3.20836812e-039, 0, 1.8758415e-012, 0, 3.72900095e-039, 0, 1.40129846e-045, 
          0, 3.67420457e-039, 0, 0, 0, 3.67390189e-039, 0, 0, 0, 1.07899982e-043, 0, 4.48415509e-044, 4.20389539e-045, 3.67420457e-039, 0, 0, 0, 
          2.80259693e-045, 0, 3.67420457e-039, 0, 0, 0, 1.40129846e-045, 0, 0, 0, 1.56945428e-043, 0, 0, 0, 0, 0, 3.72904579e-039, 0, 
          5.60519386e-044, 0, 3.20930979e-039, 0, 8.51989466e-043, 0, 3.20865959e-039, 0, 9.82653682e-039, 4.49998415e-039, 5.87344331e+022, 
          2.67781571e+020, 0, 0, 6.74539118e-039, 0, 2.38775653e-039, 0, -2.81029619e+037, 2.86705666e-042, 0, 0, 1.83673515e-039, 0, 0, 0, 
          1.56945428e-043, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9.93057972e-035, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9.00001315e-039, 0, 3.21108104e-039, 0, 
          3.20936585e-039, 0, 7.34316878e+033, 0, 3.20935604e-039, 0, 3.72902337e-039, 0, 0, 0, 3.67341985e-039, 0, 0, 1.34524653e-043, 
          7.18866112e-043, 2.80259693e-045, 3.67341985e-039, 0, 7.63497105e+033, 0, 3.67341985e-039, 0, 3.72900095e-039, 0, 3.21121556e-039, 0, 
          1.34524653e-043, 0, 0, 0, 0, 0, -nan(0x7dd000), 2.86845796e-042, 1.82959882e-018, 0, 3.67341985e-039, 0, 7.48008123e+033, 0, 0, 0, 
          3.72902337e-039, 0...}
        PI = 3.14159274

Stack level 0, frame at 0x22edd0:
 rip = 0x401629 in gomp_resolve_num_threads (../../../../src/gcc-7.1.0/libgomp/parallel.c:47); saved rip 0x40184f
 called by frame at 0x22ee30
 source language c.
 Arglist at 0x22ed88, args: specified=specified@entry=0, count=count@entry=0
 Locals at 0x22ed88, Previous frame's sp is 0x22edd0
 Saved registers:
  rbx at 0x22edb0, rsi at 0x22edb8, rdi at 0x22edc0, rip at 0x22edc8, xmm15 at 0x22edc8

Used mingw-w64-v5.0.2.
Comment 8 LIU Hao 2017-11-29 12:57:21 UTC
I cannot reproduce this problem on either i686-w64-mingw32 or x86_64-w64-mingw32 with --enable-tls.


On line 677 in libgomp.h there is a call to `gomp_thread()` which is supposed to return a pointer to a __thread object. However, from your first post, it is weird that for i686 GCC generates code referring the GS segment register, since GS is unused by x86 Windows.

On x86 Windows, TLS is indirected from the FS register. It is Microsoft's rocket science, hence GCC still relies on the emutls solution. The GS register is known to be utilized by x64 Windows and Linux AFAICT.

I presume that your GCC generated Linux code for Windows targets. If you are cross-compiling for example it may becaused by GCC scripts mistaking the host for the build. This still requires investigation.

Reference: <https://en.wikipedia.org/wiki/Win32_Thread_Information_Block>
Comment 9 Daniel Starke 2017-11-29 16:52:51 UTC
This was a native build.
I have added the GCC build in question to https://sourceforge.net/projects/gcc-win64/files/7.1.0/gcc-7.1.0-debug-broken-tls.7z
Comment 10 LIU Hao 2017-11-30 01:39:07 UTC
Compiling this rather simple program using your gcc:

```
__thread int a = 1;

int get_a(void){
   return a;
}
```

resulted in wrong assembly:

```
E:\Desktop\gcc-7.1.0-debug-broken-tls\bin>gcc E:\Desktop\test.c -S -masm=intel -O2 -o -
        .file   "test.c"
        .intel_syntax noprefix
        .text
        .p2align 4,,15
        .globl  get_a
        .def    get_a;  .scl    2;      .type   32;     .endef
        .seh_proc       get_a
get_a:
        .seh_endprologue
        mov     rax, QWORD PTR fs:0
        mov     eax, DWORD PTR a@tpoff[rax]
        ret
        .seh_endproc
        .globl  a
        .data
        .align 4
a:
        .long   1
        .ident  "GCC: (GNU) 7.1.0"
```

With my working GCC it resulted in:

```
E:\Desktop>gcc E:\Desktop\test.c -S -masm=intel -O2 -o -
        .file   "test.c"
        .intel_syntax noprefix
        .text
        .globl  get_a
        .def    get_a;  .scl    2;      .type   32;     .endef
        .seh_proc       get_a
get_a:
        sub     rsp, 40
        .seh_stackalloc 40
        .seh_endprologue
        lea     rcx, __emutls_v.a[rip]
        call    __emutls_get_address
        mov     eax, DWORD PTR [rax]
        add     rsp, 40
        ret
        .seh_endproc
        .section .rdata,"dr"
        .align 4
__emutls_t.a:
        .long   1
        .globl  __emutls_v.a
        .data
        .align 32
__emutls_v.a:
        .quad   4
        .quad   4
        .quad   0
        .quad   __emutls_t.a
        .ident  "GCC: (gcc-7-branch HEAD with MCF thread model, built by LH_Mouse.) 7.2.1 20171119"
        .def    __emutls_get_address;   .scl    2;      .type   32;     .endef
```
Comment 11 LIU Hao 2017-11-30 01:50:06 UTC
Diff'ing configure options used to build both GCC produces the following result:

```
E:\Desktop>gcc-7.1.0-debug-broken-tls\bin\gcc.exe -v 2>&1 | sed "s/ --/\n&/g" > yours.txt

E:\Desktop>gcc -v 2>&1 | sed "s/ --/\n&/g" > mine.txt

E:\Desktop>gcc-7.1.0-debug-broken-tls\bin\gcc.exe -v 2>&1 | sed "s/ --/\n&/g" > yours.txt

E:\Desktop>diff --color -U1 mine.txt yours.txt
--- mine.txt    2017-11-30 09:42:33.612869600 +0800
+++ yours.txt   2017-11-30 09:42:35.493977200 +0800
@@ -1,47 +1,35 @@
 Using built-in specs.
-COLLECT_GCC=gcc
-COLLECT_LTO_WRAPPER=C:/MinGW/MSYS2/mingw64/lib/gcc/x86_64-w64-mingw32/7.2.1/lto-wrapper.exe
+COLLECT_GCC=gcc-7.1.0-debug-broken-tls\bin\gcc.exe
+COLLECT_LTO_WRAPPER=e:/desktop/gcc-7.1.0-debug-broken-tls/bin/../libexec/gcc/x86_64-w64-mingw32/7.1.0/lto-wrapper.exe
 Target: x86_64-w64-mingw32
-Configured with: ../gcc/configure
- --prefix=/mingw64
- --with-local-prefix=/mingw64/local
- --build=x86_64-w64-mingw32
+Configured with: ../../src/gcc-7.1.0/configure
  --host=x86_64-w64-mingw32
- --target=x86_64-w64-mingw32
- --with-native-system-header-dir=/mingw64/x86_64-w64-mingw32/include
- --libexecdir=/mingw64/lib
- --enable-bootstrap
- --with-arch=x86-64
- --with-tune=nocona
- --enable-languages=c,lto,c++
- --enable-shared
+ --enable-languages=c,c++
+ --enable-seh-exceptions
+ --enable-threads=posix
+ --enable-tls
+ --disable-nls
+ --disable-shared
  --enable-static
- --enable-threads=mcf
- --enable-graphite
  --enable-fully-dynamic-string
- --enable-libstdcxx-time=yes
- --disable-libstdcxx-pch
- --disable-libstdcxx-debug
- --disable-isl-version-check
  --enable-lto
+ --enable-plugins
  --enable-libgomp
- --disable-multilib
- --enable-checking=release
- --disable-rpath
+ --with-dwarf2
  --disable-win32-registry
- --enable-nls
- --disable-werror
- --disable-symvers
- --with-libiconv
- --with-system-zlib
- --with-gmp=/mingw64
- --with-mpfr=/mingw64
- --with-mpc=/mingw64
- --with-isl=/mingw64
- --with-pkgversion='gcc-7-branch HEAD with MCF thread model, built by LH_Mouse.'
- --with-bugurl=https://gcc-mcf.lhmouse.com/
- --with-gnu-as
- --with-gnu-ld
- --disable-tls
-Thread model: mcf
-gcc version 7.2.1 20171119 (gcc-7-branch HEAD with MCF thread model, built by LH_Mouse.)
+ --enable-version-specific-runtime-libs
+ --prefix=/mingw64-64
+ --with-sysroot=/mingw64-64
+ --target=x86_64-w64-mingw32
+ --enable-targets=all
+ --enable-checking=release
+ --with-gmp=/usr/new-gcc/lib/gmp-5.0.5
+ --with-mpfr=/usr/new-gcc/lib/mpfr-2.4.2
+ --with-mpc=/usr/new-gcc/lib/mpc-0.9
+ --with-isl=/usr/new-gcc/lib/isl-0.18
+ --with-cloog=/usr/new-gcc/lib/cloog-0.18.4
+ --with-host-libstdcxx='-lstdc++ -lsupc++'
+ --disable-cloog-version-check
+ --enable-cloog-backend=isl
+Thread model: posix
+gcc version 7.1.0 (GCC)

E:\Desktop>
```

I notice that:
0) You didn't specify `--build=`.
1) You specified `--enable-targets=all` but I think this does not affect mingw targets according to <https://gcc.gnu.org/install/configure.html> and should be removed.

Maybe you should try adding `--build=`?
Comment 12 Daniel Starke 2017-11-30 05:50:52 UTC
I am not an expert on this field but your build does not use platform tls support as mine is supposed to do. Furthermore, I was building all under Windows. The only difference during the build process was the target architecture (x86/x64). Using --enable-targets=all produced a compiler able to build for both architectures. Not specifying --build= should just default to the base compilers default target (which is, nevertheless, Windows). The only issue I could possible see here is that the base compiler used to build GCC did not support platform tls support but GCC still assumed it was available resulting in a wrong setup. In this sense I was cross compiling (mingw x86 to mingw-w64 x64).
Nevertheless, building GCC without --enable-tls like you did produces a working executable for me too as mentioned on 2017-05-26.
Comment 13 LIU Hao 2017-11-30 06:17:28 UTC
Native TLS requires essential support from LD, which I don't think is going to be available in foreseeable future.

Without native TLS GCC tries to use emulated TLS, and if it generates code attempting to use the native one (which does not exist), it is, of course, a bug.
Comment 14 jyong 2017-11-30 23:25:21 UTC
Doing some simple testcases, looks like generates:

    movl %gs:0, %eax
    movl _a@ntpoff(%eax), %eax

While MSVC does (Intel syntax):
    mov ecx, DWORD PTR __tls_index
    mov eax, DWORD PTR fs:__tls_array
    mov eax, DWORD PTR [eax+ecx*4]
    mov eax, DWORD PTR _a[eax]

For a statement "return a;" where a is a thread local integer.
I'm not sure how to modify the machine definition to emit this.
Comment 15 Jakub Jelinek 2017-12-01 08:57:53 UTC
(In reply to jyong from comment #14)
> Doing some simple testcases, looks like generates:
> 
>     movl %gs:0, %eax
>     movl _a@ntpoff(%eax), %eax
> 
> While MSVC does (Intel syntax):
>     mov ecx, DWORD PTR __tls_index
>     mov eax, DWORD PTR fs:__tls_array
>     mov eax, DWORD PTR [eax+ecx*4]
>     mov eax, DWORD PTR _a[eax]
> 
> For a statement "return a;" where a is a thread local integer.
> I'm not sure how to modify the machine definition to emit this.

Do Windows/mingw have multiple TLS models, e.g. different for shared libraries vs. executables, and different cases for static vs. exported variables, or is everything done the same way, the &{fs/gs}:__tls_array[__tls_index] computation sufficient to be done once in the whole function that needs TLS and that returns a pointer to what use the .tls section relative symbols.

All I could find quickly is:
http://lists.llvm.org/pipermail/llvm-dev/2011-December/045886.html
http://www.nynaeve.net/?p=185

In any case, to implement it I think you'd want TARGET_WIN_TLS (or some better name next to TARGET_SUN_TLS, TARGET_GNU_TLS and TARGET_GNU2_TLS), associated command line switches and option handling setting the default, and then do something with it in legitimize_tls_address and ix86_delegitimize_tls_address.

I also fail to see why this is tracked as 7/8 Regression, given that the Windows TLS really isn't implemented, --enable-tls is just a user error, this can be turned into an enhancement request to implement Windows TLS.

In any case, I fail to
Comment 16 Daniel Starke 2017-12-02 07:59:10 UTC
Sorry, the wrong title was just me having mistaken wrong configuration options in a newer GCC build with a regression. I have removed the "known to work" version.
Comment 17 Richard Biener 2018-01-25 08:21:03 UTC
GCC 7.3 is being released, adjusting target milestone.
Comment 18 Alexandre Pereira Nunes 2020-06-23 18:29:54 UTC
I'm working on native TLS for windows targets. The goal is to match clang's output (the latter has native TLS implemented for quite a while).
Comment 19 Julian Waters 2024-07-11 10:54:37 UTC
Jakub, the Windows .tls support to my knowledge only has 1 model. The following code:

_Thread_local int local = 1;

int get(void) {
    return local;
}

is equivalent to the following (handwritten) assembly:

    .section .tls$, "dw"
    .p2align 2, 0x0
local:
    .long 1
    .text
    .globl get
get:
    movl _tls_index(%rip), %eax
    movq %gs:88, %rcx
    movq (%rcx, %rax, 8), %rax
    movl local@SECREL32(%rax), %eax
    ret

Where rax and rcx can be substituted for any 64 bit scratch register, and .p2align 2 and .long should be replaced with the appropriate values/directives depending on the size of the TLS variable (For instance, changing to an 8 byte long long means .p2align 3 and .quad should be used instead). I am willing to step up to implement this, but am new to the gcc codebase and am having trouble finding out how to plug it into the compiler so it can emit the assembly required for TLS support. You mentioned briefly about how to implement it, could you run me through the steps required to get the compiler to emit the assembly above?

Once this is implemented in the compiler, 2 bugs in binutils need to be fixed. The first is that the assembly fails horribly on the Intel syntax, as gas cannot recognize @SECREL32 as a directive and instead thinks the entire local@SECREL32 is a symbol name in Intel mode. The other is that ld linked executable crashes with a mysterious SIGSEGV on movl local@SECREL32(%rax), %eax. Nothing is wrong with the assembly, as assembling it with gas and then linking with clang results in a working exe, instead this is a bug in ld that I have yet to decipher

Alexandre, how's your progress on Windows TLS going? Could we collaborate to get this into gcc somehow?
Comment 20 Julian Waters 2024-07-11 10:56:25 UTC
Could the version for this be bumped to either 14 or 15 too? Thanks
Comment 21 Andrew Pinski 2024-07-11 16:12:58 UTC
(In reply to Julian Waters from comment #20)
> Could the version for this be bumped to either 14 or 15 too? Thanks

The version entry just mentions what version the original report was against, it does not say anything else really. I updated the "known to fail" field to include 15.0 and 14.1.0 which is the field that says where it does not work (still).
Comment 22 Alexandre Pereira Nunes 2024-07-12 12:12:11 UTC
(In reply to Julian Waters from comment #19)
> 
> Alexandre, how's your progress on Windows TLS going? Could we collaborate to
> get this into gcc somehow?


Hi Julian,

I used to manage a collection of cross-compiled libraries and tools, including gcc compiler. I managed to get it to work partially by using this patch as a origin: https://github.com/venix1/MinGW-GDC/blob/master/patches/mingw-tls-gcc-4.8.patch

(Perhaps it wasn't exactly this one, but very similar)

It compiled and linked fine with the whole collection of libraries I used to manage (here: https://build.opensuse.org/project/show/home:polesapart:win64)


From some version on, binutils started to complain about bad relocations in x86_64 code, I suspect it's correct about this but previous versions didn't catch it. So I suspect the code generation for x86_64 needs fix, or otherwise binutils needs a patch to differentiate apart from invalid cases.

Anyway, I'm no longer working on this for quite some time. I'll attach the last non-published version of the patch I was applying to gcc. I can provide some help but it was been quite a while so memory is not all bright right now.

When I posted I was trying to write this from scratch (and having a bad time understanding gcc internals), I even registered as a gcc contributor for this. But since I found the patch and it seemed to work for my collection (which I used professionally for 32-bit code at the time), I got lazy.

AFAIK, for that work to be merged to gcc mainstream, you'd have to track the patch's owner and get it to transfer rights to you or convince him to become a gcc contributor, if not already. Or rewrite the logic from scratch.
Comment 23 Alexandre Pereira Nunes 2024-07-12 12:14:33 UTC
Created attachment 58642 [details]
My latest gcc tls support patch
Comment 24 Julian Waters 2024-07-17 11:38:03 UTC
Thanks for the patch, I've been looking through it these past few days. While the simpler parts of it I can manage, I'm struggling terribly with understanding the RTL shifting code in legitimize_tls_address and the RTL templates in the machine definitions file (i386.md to be specific). Do you happen to know how to read the RTL code in the patch? I definitely need some help with figuring out how it works mechanically
Comment 25 Julian Waters 2024-10-07 17:09:39 UTC
Created attachment 59290 [details]
Newer patch for TLS support, incomplete
Comment 26 LIU Hao 2024-10-08 01:10:30 UTC
Comment on attachment 59290 [details]
Newer patch for TLS support, incomplete

> +  "mov{l}\t{_tls_index(%%rip), %k0|%k0, DWORD PTR [rip+_tls_index]}\;mov{q}\t{%%gs:88, %1|%1, QWORD PTR gs:[88]}\;mov{q}\t{(%1,%0,8), %0|%0, QWORD PTR [%1+%0*8]}"

For i686 this would be (untested):

```
"mov{l}\t{_tls_index, %k0|%k0, DWORD PTR [_tls_index]}\;mov{l}\t{%%fs:44, %1|%1, DWORD PTR fs:[44]}\;mov{l}\t{(%1,%0,4), %0|%0, DWORD PTR [%1+%0*4]}"
```

i.e. pointer size is 4 (instead of 8), TLS segment is FS (instead of GS), and addresses of global symbols are absolute (instead of being RIP-relative).
Comment 27 Julian Waters 2024-10-08 04:59:21 UTC
X(In reply to LIU Hao from comment #26)
> Comment on attachment 59290 [details]
> Newer patch for TLS support, incomplete
> 
> > +  "mov{l}\t{_tls_index(%%rip), %k0|%k0, DWORD PTR [rip+_tls_index]}\;mov{q}\t{%%gs:88, %1|%1, QWORD PTR gs:[88]}\;mov{q}\t{(%1,%0,8), %0|%0, QWORD PTR [%1+%0*8]}"
> 
> For i686 this would be (untested):
> 
> ```
> "mov{l}\t{_tls_index, %k0|%k0, DWORD PTR [_tls_index]}\;mov{l}\t{%%fs:44,
> %1|%1, DWORD PTR fs:[44]}\;mov{l}\t{(%1,%0,4), %0|%0, DWORD PTR [%1+%0*4]}"
> ```
> 
> i.e. pointer size is 4 (instead of 8), TLS segment is FS (instead of GS),
> and addresses of global symbols are absolute (instead of being RIP-relative).

I think I remember clang using __tls_index instead of _tls_index for 32 bit as well, but that's the only difference I remember. On another note, Cygwin doesn't support TLS natively, right? Eric modified the stopgap patch above and he put some definitions in cygming.h, since he expects it to support Cygwin as well, but I vaguely remember you saying something about Cygwin not having the support for this
Comment 28 LIU Hao 2024-10-08 05:12:57 UTC
(In reply to Julian Waters from comment #27)
> I think I remember clang using __tls_index instead of _tls_index for 32 bit
> as well, but that's the only difference I remember. On another note, Cygwin

Yes, you are right. Solely for i686, external symbols have to be prefixed by an underscore.


> doesn't support TLS natively, right? Eric modified the stopgap patch above
> and he put some definitions in cygming.h, since he expects it to support
> Cygwin as well, but I vaguely remember you saying something about Cygwin not
> having the support for this

Correct, because the Cygwin CRT doesn't have a TLS directory. You can use `objdump -h` to print PE headers of a Cygwin executable, and there is no `.tls` section.

An application may provide its own TLS directory, but it's not default.
Comment 29 Julian Waters 2024-10-08 09:36:18 UTC
Created attachment 59295 [details]
Further progress
Comment 30 Eric Botcazou 2024-10-08 10:29:37 UTC
AFAICT the last missing piece is the configure check for the linker.
Comment 31 Julian Waters 2024-10-08 13:10:12 UTC
(In reply to Eric Botcazou from comment #30)
> AFAICT the last missing piece is the configure check for the linker.

It's a bit of a shame I couldn't figure out how to make the zero extend approach work correctly. That aside, I'm concerned that this patch still isn't correct, because it doesn't seem to be using the parallel rtx correctly. From what I can gather parallel is meant for multiple operations to run at the same time, which crucially means results from one operation cannot be assumed to be available for the next operation inside a parallel, which is exactly what this patch is doing, since it's using the results from the first 2 instructions to calculate the base thread pointer. I've tried to do this the "correct" way by splitting the part of the thread pointer load that can be done in parallel into one insn (In particular trying to use UNSPEC_PCREL) and the actual calculation of the base pointer into another, but both insns are not recognized by gcc. I don't know how to circumvent this issue at the moment
Comment 32 Eric Botcazou 2024-10-08 17:09:35 UTC
> It's a bit of a shame I couldn't figure out how to make the zero extend
> approach work correctly. That aside, I'm concerned that this patch still
> isn't correct, because it doesn't seem to be using the parallel rtx
> correctly.

No worries, it's the standard way of requesting a scratch register, and nothing will try to use the result of a CLOBBER on it.  That being said, we could indeed try and split the instructions for better scheduling, although the TLS pattern for the Sun linker is multi-insn too, see tls_initial_exec_64_sun.

I'm attaching a minor update which uses named insns to simplify the code.
Comment 33 Eric Botcazou 2024-10-08 17:11:28 UTC
Created attachment 59298 [details]
WIP patch
Comment 34 Julian Waters 2024-10-09 05:34:14 UTC
(In reply to Eric Botcazou from comment #32)
> > It's a bit of a shame I couldn't figure out how to make the zero extend
> > approach work correctly. That aside, I'm concerned that this patch still
> > isn't correct, because it doesn't seem to be using the parallel rtx
> > correctly.
> 
> No worries, it's the standard way of requesting a scratch register, and
> nothing will try to use the result of a CLOBBER on it.  That being said, we
> could indeed try and split the instructions for better scheduling, although
> the TLS pattern for the Sun linker is multi-insn too, see
> tls_initial_exec_64_sun.
> 
> I'm attaching a minor update which uses named insns to simplify the code.

Perhaps splitting the 3 instructions that make up the thread pointer load into the 2 instructions that can be done in parallel and the last one that depends on the 2 is an enhancement that can be done, yes. Right now it seems like gcc cannot compile libgomp again though, this time when done with bootstrap enabled

../../../gcc-14.2.0/libgomp/team.c: In function 'gomp_team_start':
../../../gcc-14.2.0/libgomp/team.c:940:1: error: unrecognizable insn:
  940 | }
      | ^
(insn 290 289 291 12 (set (reg:DI 406)
        (const:DI (plus:DI (unspec:DI [
                        (symbol_ref:DI ("gomp_tls_data") [flags 0x2a] <var_decl 0000000003e98120 gomp_tls_data>)
                    ] UNSPEC_SECREL32)
                (const_int 16 [0x10])))) "../../../gcc-14.2.0/libgomp/team.c":354:17 -1
     (nil))
during RTL pass: vregs

The corresponding command line was

libtool: compile:  /c/Users/vertig0/Downloads/eclipse-committers-2023-12-R-win32-x86_64/Workspace/MINGW-packages/mingw-w64-gcc/src/build-UCRT64/./gcc/xgcc -B/c/Users/vertig0/Downloads/eclipse-committers-2023-12-R-win32-x86_64/Workspace/MINGW-packages/mingw-w64-gcc/src/build-UCRT64/./gcc/ -L/c/Users/vertig0/Downloads/eclipse-committers-2023-12-R-win32-x86_64/Workspace/MINGW-packages/mingw-w64-gcc/src/build-UCRT64/./gcc -isystem /ucrt64/x86_64-w64-mingw32/include -isystem /ucrt64/include -B/ucrt64/x86_64-w64-mingw32/bin/ -B/ucrt64/x86_64-w64-mingw32/lib/ -isystem /ucrt64/x86_64-w64-mingw32/include -isystem /ucrt64/x86_64-w64-mingw32/sys-include -fno-checking -DHAVE_CONFIG_H -I. -I../../../gcc-14.2.0/libgomp -I../../../gcc-14.2.0/libgomp/config/mingw32 -I../../../gcc-14.2.0/libgomp/config/posix -I../../../gcc-14.2.0/libgomp -I../../../gcc-14.2.0/libgomp/../include -pthread -Wall -g -march=nocona -msahf -mtune=generic -O2 -pipe -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong -Wp,-D__USE_MINGW_ANSI_STDIO=1 -MT team.lo -MD -MP -MF .deps/team.Tpo -c ../../../gcc-14.2.0/libgomp/team.c  -DDLL_EXPORT -DPIC -o .libs/team.o

I will work on trying to fix this and creating a reproducer for the issue. The patch used here is slightly different because I'm testing it on gcc 14
Comment 35 Julian Waters 2024-10-09 05:35:46 UTC
Created attachment 59300 [details]
gcc 14 version, broken
Comment 36 Julian Waters 2024-10-10 05:22:50 UTC
No luck in creating a reproducer or even figuring out why the plus is happening directly on the UNSPEC_SECREL32 unfortunately. There's no hint as to why this is happening at all
Comment 37 Eric Botcazou 2024-10-10 08:15:46 UTC
Created attachment 59304 [details]
Minimal reproducer

To be compiled with -O:

team.c: In function 'gomp_team_start':
team.c:22:1: error: unrecognizable insn:
   22 | }
      | ^
(insn 8 7 9 2 (set (reg:DI 101)
        (const:DI (plus:DI (unspec:DI [
                        (symbol_ref:DI ("gomp_tls_data") [flags 0x2a] <var_decl 0x7ffb0f019cf0 gomp_tls_data>)
                    ] UNSPEC_SECREL32)
                (const_int 8 [0x8])))) "team.c":21:24 -1
     (nil))
during RTL pass: vregs
team.c:22:1: internal compiler error: in extract_insn, at recog.cc:2882
Comment 38 Eric Botcazou 2024-10-10 09:14:40 UTC
It's the legitimate_pic_operand_p hunk that I dropped earlier...
Comment 39 Eric Botcazou 2024-10-10 09:14:58 UTC
Created attachment 59305 [details]
WI
Comment 40 Eric Botcazou 2024-10-10 09:15:30 UTC
Created attachment 59306 [details]
WIP patch #2
Comment 41 Uroš Bizjak 2024-10-10 10:05:17 UTC
(In reply to Eric Botcazou from comment #32)
> > It's a bit of a shame I couldn't figure out how to make the zero extend
> > approach work correctly. That aside, I'm concerned that this patch still
> > isn't correct, because it doesn't seem to be using the parallel rtx
> > correctly.
> 
> No worries, it's the standard way of requesting a scratch register, and
> nothing will try to use the result of a CLOBBER on it.  That being said, we
> could indeed try and split the instructions for better scheduling, although
> the TLS pattern for the Sun linker is multi-insn too, see
> tls_initial_exec_64_sun.

No, please don't look there, it is just an old forgotten insn pattern, indented for an old quirky sun linker (where nobody knows details of its quirks anymore). It should be converted to split the asm sequence, or (even better) removed altogether by raising the minimum linker version.

Multi insn can be used as an interim solution, but please use modern approach to create TLS sequences. All infrastructure is already there.

Looking at Comment #19 testcase:

a) Create _tls_index symbol in a similar way to ix86_tls_get_addr or ix86_tls_module_base (i386.cc). Using this RTX in your sequence will take care of (%rip) suffix and different number of _ prefixes automatically

b) Create thread pointer using get_thread_pointer<mode> expander. Please note how initial RTX is split to use generic moves using memory location with address space with an absolute offset (const0_rtx in case of ELF).

c) Emit complete TLS sequence in legitimize_tls_address (i386.cc). Please see TLS_MODEL_INITIAL_EXEC part for your case.

Using the above approach, you won't need any special move instructions because generic moves should be able to handle all specifics of the TLS sequence. You will have to add your UNSPEC to various predicate functions, perhaps you should grep for UNSPEC_GOTTPOFF and add your new unspec nearby.
Comment 42 Sam James 2024-10-10 10:22:15 UTC
(In reply to Julian Waters from comment #36)
> No luck in creating a reproducer or even figuring out why the plus is
> happening directly on the UNSPEC_SECREL32 unfortunately. There's no hint as
> to why this is happening at all

I'd really strongly recommend you do development with trunk and then backport it later, partly by checking to any changes to affected files since 14 and trunk.

Anyway, while Eric has reduced it for you, I recommend you consider trying to use cvise to reduce it yourself as an exercise: https://wiki.gentoo.org/wiki/GCC_ICE_reporting_guide (there's also https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction but it's less accessible IMO).
Comment 43 Eric Botcazou 2024-10-10 10:42:20 UTC
Created attachment 59307 [details]
WIP patch #3

Add the configure check for the PE linker.
Comment 44 Eric Botcazou 2024-10-10 10:43:24 UTC
> Multi insn can be used as an interim solution, but please use modern
> approach to create TLS sequences. All infrastructure is already there.

OK, thanks for the detailed instructions.
Comment 45 Eric Botcazou 2024-10-10 11:25:23 UTC
> c) Emit complete TLS sequence in legitimize_tls_address (i386.cc). Please
> see TLS_MODEL_INITIAL_EXEC part for your case.

Note that there is no indirection on the offset for TARGET_WIN32_TLS so it's similar to TLS_MODEL_LOCAL_EXEC rather than to TLS_MODEL_INITIAL_EXEC.
Comment 46 Eric Botcazou 2024-10-10 19:09:28 UTC
> Note that there is no indirection on the offset for TARGET_WIN32_TLS so it's
> similar to TLS_MODEL_LOCAL_EXEC rather than to TLS_MODEL_INITIAL_EXEC.

It's more of a TLS_MODEL_LOCAL_DYNAMIC model in the end, but I think we'd rather keep the implementation separate for the sake of clarify.
Comment 47 Eric Botcazou 2024-10-10 19:11:09 UTC
Created attachment 59315 [details]
Candidate patch
Comment 48 Uroš Bizjak 2024-10-10 21:43:12 UTC
Comment on attachment 59315 [details]
Candidate patch

>+static rtx
>+ix86_tls_index (void)
>+{
>+  if (!ix86_tls_index_symbol)
>+    ix86_tls_index_symbol = gen_rtx_SYMBOL_REF (Pmode, "_tls_index");
>+
>+  if (flag_pic)
>+    {
>+      rtx unspec = gen_rtx_UNSPEC (Pmode, gen_rtvec (1, ix86_tls_index_symbol),
>+				   UNSPEC_PCREL);
>+      return gen_rtx_CONST (Pmode, unspec);

Please note that RIP-relative addresses are one byte shorter than absolute addresses and are interchangeable on x86_64 Linux. If this is also true on Windows (UNSPEC_PCREL was introduced for PE linkers) then the above should also be emitted without flag_pic. Looking through the i386.cc, there are some (flag_pic) asserts w/ UNSPEC_PCREL), but perhaps these can be relaxed to also support RIP-relative addresses without -fPIC.
Comment 49 LIU Hao 2024-10-11 01:25:24 UTC
On Windows x64 almost all symbols in the flat address space are to be referenced by RIP-relative addressing. I don't know whether things would work otherwise.

This correspond to GCC's `-mcmodel=medium` and Clang's `-mcmodel=small` (both default). GCC uses RIP-relative addressing for all code models, while Clang does same for `small` and emits absolute addresses for `medium` and `large`.
Comment 50 Eric Botcazou 2024-10-11 06:28:27 UTC
> Please note that RIP-relative addresses are one byte shorter than absolute
> addresses and are interchangeable on x86_64 Linux. If this is also true on
> Windows (UNSPEC_PCREL was introduced for PE linkers) then the above should
> also be emitted without flag_pic.

The crucial bits are these in config/i386/cygming.h:

/* Don't allow flag_pic to propagate since gas may produce invalid code
   otherwise.  */

#undef  SUBTARGET_OVERRIDE_OPTIONS
#define SUBTARGET_OVERRIDE_OPTIONS					\
do {									\
  flag_pic = TARGET_64BIT ? 1 : 0;                                      \
} while (0)

so 64-bit code is always PIC whereas 32-bit code is never PIC.
Comment 51 Julian Waters 2024-10-11 08:35:57 UTC
Created attachment 59318 [details]
Attempt to parallelize the load from gs/fs and load of _tls_index

I've written a slightly different version of the patch, with the following differences

- The configure check is not implemented yet, but will be soon. I was busy focusing on getting other parts right first
- I've added a 'd' section flag to the .tls section to be emitted. clang does this and according to as documentation this is required to mark a section as a data section
- I've attempted to parallelize the part where the 2 loads can happen in parallel. This doesn't seem to work, since it fails with an unrecognizable insn error. My lack of knowledge of how RTL functions is really showing here
- I didn't touch the load_tp code in i386.md for this one, since that one seems to be for Linux and would break Linux if it were changed like the candidate patch did?
- I know it's been mentioned that $ after .tls isn't required, but from my memory code like .tls$XXX can be emitted, making the $ necessary. Someone correct me if I'm wrong
- Similarly, the !DECL_P was said to not be required, but upon closer examination, the ELF select_section seems to do that, so I've left it in since it does look like it is used for some purpose

A question: Does gen_rtx_SYMBOL_REF take care of the preceding _ of _tls_index on 32 bit?
Comment 52 Eric Botcazou 2024-10-11 08:58:20 UTC
> - Similarly, the !DECL_P was said to not be required, but upon closer
> examination, the ELF select_section seems to do that, so I've left it in
> since it does look like it is used for some purpose

No, it's plain dead code since a VAR_DECL is a DECL_P.

> A question: Does gen_rtx_SYMBOL_REF take care of the preceding _ of
> _tls_index on 32 bit?

Yes, it does.
Comment 53 Julian Waters 2024-10-11 09:03:12 UTC
Alright, will remove the DECL_P
Comment 54 Uroš Bizjak 2024-10-11 10:05:51 UTC
(In reply to Julian Waters from comment #51)
> Created attachment 59318 [details]
> Attempt to parallelize the load from gs/fs and load of _tls_index
> 
> I've written a slightly different version of the patch, with the following
> differences
> 
> - The configure check is not implemented yet, but will be soon. I was busy
> focusing on getting other parts right first
> - I've added a 'd' section flag to the .tls section to be emitted. clang
> does this and according to as documentation this is required to mark a
> section as a data section
> - I've attempted to parallelize the part where the 2 loads can happen in
> parallel. This doesn't seem to work, since it fails with an unrecognizable
> insn error. My lack of knowledge of how RTL functions is really showing here

It is better to emit loads separately as the candidate patch does. These loads decay to standard insn patters. Standard patterns provide additional information to the compiler (various insn attributes), so the compiler can take better decisions (e.g. instruction scheduling).

> - I didn't touch the load_tp code in i386.md for this one, since that one
> seems to be for Linux and would break Linux if it were changed like the
> candidate patch did?

No, these changes in candidate patch are OK. In addition to TLS segment reg, they generalize TLS offset.

> - I know it's been mentioned that $ after .tls isn't required, but from my
> memory code like .tls$XXX can be emitted, making the $ necessary. Someone
> correct me if I'm wrong
> - Similarly, the !DECL_P was said to not be required, but upon closer
> examination, the ELF select_section seems to do that, so I've left it in
> since it does look like it is used for some purpose
> 
> A question: Does gen_rtx_SYMBOL_REF take care of the preceding _ of
> _tls_index on 32 bit?
Comment 55 Julian Waters 2024-10-11 13:38:29 UTC
Created attachment 59319 [details]
Attempt to parallelize the load from gs/fs and load of _tls_index

Deleted !DECL_P
Comment 56 Julian Waters 2024-10-11 15:53:40 UTC
Ah, I see. I had been under the impression that gcc would see the parallel and realize that the 2 loads could be done at the same time. Since it can see that without the parallel anyway, and doing so allows gcc to emit more efficient code, I'll remove it (Not like the parallel approach worked anyway, it resulted in an unrecognizable insn...)
Comment 57 Julian Waters 2024-10-11 18:34:55 UTC
Just a heads up, the minimal reproducer seems to be getting garbage movabsq instructions emitted again with the first stage gcc in the bootstrap phase

	.file	"tls.c"
	.text
	.section	.text.unlikely,"x"
	.globl	gomp_team_start
	.def	gomp_team_start;	.scl	2;	.type	32;	.endef
	.seh_proc	gomp_team_start
gomp_team_start:
	pushq	%rdi
	.seh_pushreg	%rdi
	pushq	%rsi
	.seh_pushreg	%rsi
	subq	$40, %rsp
	.seh_stackalloc	40
	.seh_endprologue
	leaq	gomp_team_start_team(%rip), %rdi
	movabsq	$8+gomp_tls_data@secrel32, %rsi
	call	gomp_display_affinity_thread
	movl	_tls_index(%rip), %edx
	movl	$10, %ecx
	movq	%gs:88, %rax
	addq	(%rax,%rdx,8), %rsi
	rep movsl
	addq	$40, %rsp
	popq	%rsi
	popq	%rdi
	ret
	.seh_endproc
	.globl	gomp_tls_data
	.section	.tls$,"dw"
	.align 8
gomp_tls_data:
	.space 48
	.globl	gomp_team_start_team
	.bss
	.align 32
gomp_team_start_team:
	.space 40
	.ident	"GCC: (Rev1, Built by MSYS2 project) 14.2.0"
	.def	gomp_display_affinity_thread;	.scl	2;	.type	32;	.endef

This was caught when team.c started to fail in the assembler
Comment 58 Julian Waters 2024-10-11 18:48:32 UTC
Created attachment 59322 [details]
Lastest TLS
Comment 59 Uroš Bizjak 2024-10-11 19:17:12 UTC
(In reply to Julian Waters from comment #57)
> Just a heads up, the minimal reproducer seems to be getting garbage movabsq
> instructions emitted again with the first stage gcc in the bootstrap phase

You probably need to adjust pic_32bit_operand predicate (called via x86_64_movabs_operand).
Comment 60 Uroš Bizjak 2024-10-11 19:29:27 UTC
(In reply to Uroš Bizjak from comment #59)
> (In reply to Julian Waters from comment #57)
> > Just a heads up, the minimal reproducer seems to be getting garbage movabsq
> > instructions emitted again with the first stage gcc in the bootstrap phase
> 
> You probably need to adjust pic_32bit_operand predicate (called via
> x86_64_movabs_operand).

Please also note symbolic_operand predicate (called from pic_32bit_operand) and perhaps other relevant predicates in i386/predicates.md
Comment 61 Eric Botcazou 2024-10-15 09:07:18 UTC
> Just a heads up, the minimal reproducer seems to be getting garbage movabsq
> instructions emitted again with the first stage gcc in the bootstrap phase

I cannot reproduce on the mainline though.
Comment 62 Julian Waters 2024-10-15 09:31:22 UTC
That's gonna be a problem, sigh. The only noteworthy difference I can see between the 2 patches that is related to the secrel32 unspec is one of the GET_CODE == SYMBOL_REF is enclosed in brackets. Unless this significantly changes the true/false evaluation in that branch the reason for this broken assembly eludes me. I'll keep digging in the meantime
Comment 63 Julian Waters 2024-10-15 09:32:57 UTC
(I know the predicates have been brought up as a potential cause for this, but if it cannot be replicated with the candidate patch the problem may lie elsewhere)
Comment 64 Julian Waters 2024-10-15 10:43:19 UTC
Just tried it again, it emits broken assembly on both master and gcc 14 with the "Latest TLS" patch
Comment 65 Eric Botcazou 2024-10-15 10:53:07 UTC
> Just tried it again, it emits broken assembly on both master and gcc 14 with
> the "Latest TLS" patch

What command line do you use to compile the minimal reproducer?
Comment 66 Eric Botcazou 2024-10-15 10:59:34 UTC
> That's gonna be a problem, sigh. The only noteworthy difference I can see
> between the 2 patches that is related to the secrel32 unspec is one of the
> GET_CODE == SYMBOL_REF is enclosed in brackets. Unless this significantly
> changes the true/false evaluation in that branch the reason for this broken
> assembly eludes me. I'll keep digging in the meantime

Well, your version is substantially different than mine, so no wonder that they behave differently...  We should probably use mine at this point.
Comment 67 Julian Waters 2024-10-15 14:46:45 UTC
Command line used in compiling the reproducer: xgcc -O2 -S -std=c11 -pedantic -Wpedantic tls.c

The thing that has me puzzled is that the main differences between both patches are in the load of the primary thread pointer. The PLUS on the relocation - Which is the faulty part - Is pretty much exactly the same between both patches, even down to the special casing of the relocation in all the checking code. Oh well, I guess it's some internal gcc quirk
Comment 68 Julian Waters 2024-10-17 07:49:00 UTC
I don't know why but apparently using force_reg and copy_to_mode_reg to load into registers instead of using gen_rtx_SET and emit_insn fixes the garbage movabsq instructions. I was going to ask why, but I suspect the reason is buried so deeply in gcc's source code and is such an edge case that others probably don't know why this is happening either
Comment 69 Julian Waters 2024-10-23 08:34:27 UTC
I apologize for vanishing suddenly and not giving progress reports, I am currently busy with some JDK work. The only thing left missing is the configure check. I will return to finishing TLS support once and for all when my JDK fixes have been completed
Comment 70 Eric Botcazou 2024-10-30 08:41:15 UTC
> I apologize for vanishing suddenly and not giving progress reports, I am
> currently busy with some JDK work. The only thing left missing is the
> configure check. I will return to finishing TLS support once and for all
> when my JDK fixes have been completed

The configure check is in the candidate patch though.
Comment 71 Julian Waters 2024-11-01 18:36:01 UTC
Created attachment 59516 [details]
Latest TLS
Comment 72 Sam James 2024-11-02 02:12:33 UTC
Note that your patch doesn't include the changes to generated configure. While Eric knows that and likely decided not to include that in his, I'm pointing it out because you might not be aware for your own patch that your changes might not be being tested if you haven't regenerated it locally.
Comment 73 Julian Waters 2024-11-05 00:55:54 UTC
Thanks for the reminder, I did choose to leave that out since I was under the impression that regenerating configure should be done in a separate commit. I chose a different approach for the configure check since it seems neater and sets the appropriate flag for linker support (HAVE_AS_TLS, despite what the name suggests, represents support for both the assembler and linker) instead of introducing a new one. Is this one good to go? If there are any logic errors in the new code feel free to point them out to me
Comment 74 Julian Waters 2024-11-12 13:55:38 UTC
Sorry for the noise, any feedback on the new patch?
Comment 75 Julian Waters 2024-11-18 05:34:49 UTC
Any feedback on the new patch?
Comment 76 LIU Hao 2024-11-20 02:53:27 UTC
I can include this patch for some testing on GCC 14 now.
Comment 77 LIU Hao 2024-11-20 04:38:27 UTC
../../gcc/gcc/config/i386/i386.cc: In function 'rtx_def* legitimize_tls_address(rtx, tls_model, bool)':
../../gcc/gcc/config/i386/i386.cc:12196:27: error: 'GOT_ALIAS_SET' was not declared in this scope; did you mean 'MEM_ALIAS_SET'?
12196 |   set_mem_alias_set (off, GOT_ALIAS_SET);
      |                           ^~~~~~~~~~~~~
      |                           MEM_ALIAS_SET
Comment 78 LIU Hao 2024-11-20 05:53:26 UTC
I changed it to `ix86_GOT_alias_set()` and checked output assembly. The patch should be fine for these setups:

  * x86_64-w64-mingw-32 (-O0, -O1, -O2, -Os)
  * i686-w64-mingw-32 (-O0, -O1, -O2, -Os)

Simple test program:

```
#include <assert.h>

struct Data
  {
    int value;
    Data() { value = 42; }
    ~Data() { }
  };

thread_local Data ecx;
thread_local Data esp;

int
accumulate(int r)
  {
    int t = ecx.value;
    ecx.value += r;
    t += esp.value;
    esp.value ++;
    return t;
  }

int
main(void)
  {
    assert(ecx.value == 42);
    assert(esp.value == 42);

    assert(accumulate(10) == 42 + 42);
    assert(accumulate(11) == 52 + 43);
    assert(accumulate(12) == 63 + 44);
  }

```
Comment 79 LIU Hao 2024-11-20 05:59:09 UTC
Created attachment 59639 [details]
quote symbols for intel syntax

This patch is necessary for TLS in Intel syntax to work with GNU AS. Details follow.

This is actually a bit more complicated, and lacks a check for assembler support in configure. I will not care about those assemblers; please update the patch as necessary.



[00:48:32] <lh_mouse> theshermantanker:   am going to bed soon so I would like to provide some details about the GAS parser.
[00:49:45] <lh_mouse> for AT&T syntax:   Control flow enters `i386_displacement`, where you will find this:     gotfree_input_line = lex_got (&i.reloc[this_operand], NULL, &types);   if (gotfree_input_line)     input_line_pointer = gotfree_input_line;   expr_mode = expr_operator_none;   exp_seg = expression (exp); 
[00:50:46] <lh_mouse> `lex_got` might stand for 'lexically parse for GOT`  but anyway,  it strips `@SECREL32` from the symbol and returns a malloc'd buffer of the symbol without it.
[00:51:39] <lh_mouse> this unsuffixed symbol is assigned to `input_line_pointer` (static variable), then `expression`  is invoked to parse it as an expression. 
[00:52:14] <lh_mouse> the `@SECREL32` thing affects what will be stored into `&i.reloc[this_operand]`, the first argument to `lex_got`.
[00:52:24] <lh_mouse> -- end AT&T suffix]  
[00:53:17] <lh_mouse> for Intel syntax:  Notice the macro `md_operator` which is only defined for x86.  In 'expr.c'  there is:         if (is_name_beginner (c) || c == '"') /* Here if did not begin with a digit.  */  {    /* Identifier begins here.       This is kludged for speed, so code is repeated.  */  isname:    -- input_line_pointer;    c = get_symbol_name (&name); 
[00:54:40] <lh_mouse> Control flow enters `md_operand` where you will find a `case '[':`;   this is how Intel syntax is parsed;  the stuff in [] is parsed as an expression.
[00:55:59] <lh_mouse>  when an identifier is encountered, in 'expr.c',   `get_symbol_name` is used to parse it.  but it knows nothing about the `@SECREL32` thing  and mistakes it as the symbol name.
[00:56:28] <lh_mouse> (note how  this path differs from AT&T where `expression` is called on the result of `lex_got`.)
[00:57:29] <lh_mouse> so, later `md_operator` will not see the @ character, because it has already mistaken as part of the symbol.
[00:58:22] <lh_mouse> this can be worked around for GAS by inserting a space before @;  Clang does not accept this workaround, so please only play with it at home.
[01:01:14] <lh_mouse> hope you know how to fix it now.  :)  good night.
Comment 80 LIU Hao 2024-11-20 10:58:28 UTC
In libstdc++ <mutex> there is:

```
  /// @cond undocumented
# ifdef _GLIBCXX_HAVE_TLS
  // If TLS is available use thread-local state for the type-erased callable
  // that is being run by std::call_once in the current thread.
  extern __thread void* __once_callable;
  extern __thread void (*__once_call)();
```

On Windows `_tls_index` is a module-specific variable and is not exported. There is no way to export a `__thread` variable, nor to access it from a different module.

As native TLS is already an ABI break, here are two possible fixes:

1. Change these to getter/setter functions.
2. Get rid of this hack and call `__cxa_acquire`, `__cxa_abort` and `__cxa_release` instead.
Comment 81 Uroš Bizjak 2024-11-21 08:43:50 UTC
(In reply to Julian Waters from comment #75)
> Any feedback on the new patch?

I would propose you legitimize TLS address using get_thread_pointer (as is the case with Eric's patch). Generic optimizers are then able to optimize the access to the symbol and later rewrite the address to a TLS named address space.

Please consider this testcase (very relevant on linux, I don't know about Windows):

--cut here--
extern __thread int i[8];

int foo (void)
{
  return i[2] + i[4];
}
--cut here--

Using get_thread_pointer, the above is expanded into:

(insn 5 2 6 2 (set (reg:DI 102)
        (mem/u/c:DI (const:DI (unspec:DI [
                        (symbol_ref:DI ("i") [flags 0x60]  <var_decl 0x7fbe95c10c60 i>)
                    ] UNSPEC_GOTNTPOFF)) [2  S8 A8])) "tls.c":5:11 95 {*movdi_internal}
     (nil))
(insn 6 5 7 2 (set (reg:DI 103)
        (mem/u/c:DI (const:DI (unspec:DI [
                        (symbol_ref:DI ("i") [flags 0x60]  <var_decl 0x7fbe95c10c60 i>)
                    ] UNSPEC_GOTNTPOFF)) [2  S8 A8])) "tls.c":5:18 95 {*movdi_internal}
     (nil))
(insn 7 6 8 2 (set (reg:SI 104)
        (mem/c:SI (plus:DI (plus:DI (unspec:DI [
                            (const_int 0 [0])
                        ] UNSPEC_TP)
                    (reg:DI 102))
                (const_int 8 [0x8])) [1 i[2]+0 S4 A32])) "tls.c":5:15 96 {*movsi_internal}
     (nil))
(insn 8 7 9 2 (set (reg:SI 105)
        (mem/c:SI (plus:DI (plus:DI (unspec:DI [
                            (const_int 0 [0])
                        ] UNSPEC_TP)
                    (reg:DI 103))
                (const_int 16 [0x10])) [1 i[4]+0 S4 A32])) "tls.c":5:15 96 {*movsi_internal}
     (nil))
(insn 9 8 10 2 (parallel [
            (set (reg:SI 101 [ _4 ])
                (plus:SI (reg:SI 104)
                    (reg:SI 105)))
            (clobber (reg:CC 17 flags))
        ]) "tls.c":5:15 283 {*addsi_1}
     (expr_list:REG_EQUAL (plus:SI (mem/c:SI (plus:DI (plus:DI (unspec:DI [
                                (const_int 0 [0])
                            ] UNSPEC_TP)
                        (reg:DI 102))
                    (const_int 8 [0x8])) [1 i[2]+0 S4 A32])
            (mem/c:SI (plus:DI (plus:DI (unspec:DI [
                                (const_int 0 [0])
                            ] UNSPEC_TP)
                        (reg:DI 103))
                    (const_int 16 [0x10])) [1 i[4]+0 S4 A32]))
        (nil)))

Please note how UNSPEC_TP forms legitimate address in (insn 9). Generic optimizers optimize the above to the following RTX sequence:

(insn 5 2 7 2 (set (reg:DI 102)
        (mem/u/c:DI (const:DI (unspec:DI [
                        (symbol_ref:DI ("i") [flags 0x60]  <var_decl 0x7fbe95c10c60 i>)
                    ] UNSPEC_GOTNTPOFF)) [2  S8 A8])) "tls.c":5:11 95 {*movdi_internal}
     (nil))
(note 7 5 8 2 NOTE_INSN_DELETED)
(insn 8 7 9 2 (set (reg:SI 105 [ i[4] ])
        (mem/c:SI (plus:DI (plus:DI (unspec:DI [
                            (const_int 0 [0])
                        ] UNSPEC_TP)
                    (reg:DI 102))
                (const_int 16 [0x10])) [1 i[4]+0 S4 A32])) "tls.c":5:15 96 {*movsi_internal}
     (nil))
(insn 9 8 14 2 (parallel [
            (set (reg:SI 101 [ _4 ])
                (plus:SI (mem/c:SI (plus:DI (plus:DI (unspec:DI [
                                        (const_int 0 [0])
                                    ] UNSPEC_TP)
                                (reg:DI 102))
                            (const_int 8 [0x8])) [1 i[2]+0 S4 A32])
                    (reg:SI 105 [ i[4] ])))
            (clobber (reg:CC 17 flags))
        ]) "tls.c":5:15 283 {*addsi_1}
     (expr_list:REG_DEAD (reg:DI 102)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_DEAD (reg:SI 105 [ i[4] ])
                (nil)))))

And the above sequence is later rewritten to use TLS named address space (please note AS1 in the address):

(insn 5 2 7 2 (set (reg:DI 102)
        (mem/u/c:DI (const:DI (unspec:DI [
                        (symbol_ref:DI ("i") [flags 0x60]  <var_decl 0x7fbe95c10c60 i>)
                    ] UNSPEC_GOTNTPOFF)) [2  S8 A8])) "tls.c":5:11 95 {*movdi_internal}
     (nil))
(note 7 5 18 2 NOTE_INSN_DELETED)
(insn 18 7 19 2 (set (reg:SI 105 [ i[4] ])
        (mem/c:SI (plus:DI (reg:DI 102)
                (const_int 16 [0x10])) [1 i[4]+0 S4 A32 AS1])) "tls.c":5:15 -1
     (nil))
(insn 19 18 14 2 (parallel [
            (set (reg:SI 101 [ _4 ])
                (plus:SI (mem/c:SI (plus:DI (reg:DI 102)
                            (const_int 8 [0x8])) [1 i[2]+0 S4 A32 AS1])
                    (reg:SI 105 [ i[4] ])))
            (clobber (reg:CC 17 flags))
        ]) "tls.c":5:15 -1
     (nil))

and this results in the optimal assembly:

foo:
        movq    i@gottpoff(%rip), %rdx
        movl    %fs:16(%rdx), %eax
        addl    %fs:8(%rdx), %eax
        ret

BTW, adding -mno-tls-direct-seg-refs to compile flags (that avoids optimizations with segment register in the address) results in:

foo:
        movq    %fs:0, %rcx
        movq    i@gottpoff(%rip), %rdx
        movl    16(%rcx,%rdx), %eax
        addl    8(%rcx,%rdx), %eax
        ret
Comment 82 LIU Hao 2024-11-21 12:07:49 UTC
(In reply to Uroš Bizjak from comment #81)
> (In reply to Julian Waters from comment #75)
> > Any feedback on the new patch?
> 
> I would propose you legitimize TLS address using get_thread_pointer (as is
> the case with Eric's patch). Generic optimizers are then able to optimize
> the access to the symbol and later rewrite the address to a TLS named
> address space.
> 
> Please consider this testcase (very relevant on linux, I don't know about
> Windows):
> 
> --cut here--
> extern __thread int i[8];
> 
> int foo (void)
> {
>   return i[2] + i[4];
> }
> --cut here--
> 

This also compiles fine with this patch (backported to GCC 14) which produces:

(x86-32)
```
	mov	eax, DWORD PTR "__tls_index"
	mov	edx, DWORD PTR fs:44
	mov	edx, DWORD PTR [edx+eax*4]
	mov	eax, DWORD PTR "_i"@secrel32[edx+16]
	add	eax, DWORD PTR "_i"@secrel32[edx+8]
	ret
```

(x86-64)
```
	mov	eax, DWORD PTR "_tls_index"[rip]
	mov	rdx, QWORD PTR gs:88
	mov	rdx, QWORD PTR [rdx+rax*8]
	mov	eax, DWORD PTR "i"@secrel32[rdx+16]
	add	eax, DWORD PTR "i"@secrel32[rdx+8]
	ret
```
Comment 83 Julian Waters 2024-11-22 06:46:29 UTC
Liu Hao: The registers it's using seem to be all over the place. Prior it was using rdx for the gs:[88] load and rax for everything else, now it's either using any register it can find, or using rdx to store the result of rdx+rax*8. I have no idea why the resulting assembly is so different, but this could mean the resulting program runs less efficiently

EDIT: Nevermind, it was because of rax being the return value register and the thread local being an array

extern _Thread_local int local;

int get(void) {
    return local;
}

movl	_tls_index(%rip), %eax
movq	%gs:88, %rdx
movq	(%rdx,%rax,8), %rax
movl	local@secrel32(%rax), %eax

extern _Thread_local int local[8];

int get(void) {
    return local[2] + local[4];
}

movl	_tls_index(%rip), %eax
movq	%gs:88, %rdx
movq	(%rdx,%rax,8), %rdx
movl	16+local@secrel32(%rdx), %eax
addl	8+local@secrel32(%rdx), %eax

Uros: I see, I'll try to do so. I was mainly avoiding that to break less code (I have a habit of doing that to anything I touch). Although, the resulting assembly (Barring the register selection) already seems to be as compact as possible for Windows, I'm not sure how using get_thread_pointer could make it any more optimal. This is a genuinely curious question, not placing doubt on whether get_thread_pointer can help optimize the resulting assembly
Comment 84 LIU Hao 2024-11-22 07:03:12 UTC
(In reply to Julian Waters from comment #83)
> Liu Hao: The registers it's using seem to be all over the place. Prior it
> was using rdx for the gs:[88] load and rax for everything else, now it's
> either using any register it can find, or using rdx to store the result of
> rdx+rax*8. I have no idea why the resulting assembly is so different, but
> this could mean the resulting program runs less efficiently

For EAX, ECX, EDX, EBX, ESI, EDI there's usually no difference, except that callee-saved ones (EBX, EBP, ESI and EDI) have to be pushed and popped. Accessing a 64-bit register, as well as R8D ~ R15D, requires an REX prefix, so such an instruction is one byte longer.
Comment 85 LIU Hao 2024-11-22 09:01:15 UTC
Created attachment 59666 [details]
libstdc++ fix

It is not possible to export a `__thread` variable with native TLS, because it's not possible to access `_tls_used` of a different module.

The linker (GNU LD) shall not export such a symbol. It is exporting an unusable symbol now, so it's a bug.

libstdc++ shall not export and import `__thread` variables. Attached is a patch for MCF thread model about a different PR, but I think this is a good opportunity to fix it forever, as native TLS is a huuuuge ABI break already.
Comment 86 Uroš Bizjak 2024-11-22 10:36:00 UTC
(In reply to Julian Waters from comment #83)

> Uros: I see, I'll try to do so. I was mainly avoiding that to break less
> code (I have a habit of doing that to anything I touch). Although, the
> resulting assembly (Barring the register selection) already seems to be as
> compact as possible for Windows, I'm not sure how using get_thread_pointer
> could make it any more optimal. This is a genuinely curious question, not
> placing doubt on whether get_thread_pointer can help optimize the resulting
> assembly

I can speak from Linux perspective - when thread pointer is modelled as UNSPEC, then generic compiler part can optimize access to the location as shown in the Comment #81. There are many optimizations performed, and following the current implementation assures that your target won't be left behind when new generic optimization is introduced.

That said, and looking at your code in Comment #83, it looks that on Windows, TLS access can't use gs: prefixed address (similar to Linux with -mno-tls-direct-seg-refs). If this is the case, then generating UNSPEC via get_thread_pointer is not beneficial, since UNSPEC can't be combined into address.

Your thread pointer is generated with:

+  tp = gen_const_mem (Pmode, GEN_INT (TARGET_64BIT ? 88 : 44));
+  set_mem_addr_space (tp, DEFAULT_TLS_SEG_REG);

which is in fact what UNSPEC_TP will be split to in split1 pass.
Comment 87 Julian Waters 2024-11-27 08:06:19 UTC
Eric, I've just come to realize that the configure check might not be needed, because the intention is to only allow native TLS on Windows when --enable-tls is forcefully enabled, similar to win32 thread model in libstdc++ requiring --enable-libstdcxx-threads before it is enabled properly. This in effect restricts native TLS only to vendors who know what they're doing, and they would already know that a specific binutils version without the linker bug is required. By default, the lack of a configure check for TLS achieves just that, and --enable-tls bypasses the check anyway. Anyone have any objections if I back out the (Admittedly also poorly written) check from configure? And anyone have any further objections on the patch in general, minus configure?
Comment 88 Eric Botcazou 2024-11-27 08:33:07 UTC
> Eric, I've just come to realize that the configure check might not be
> needed, because the intention is to only allow native TLS on Windows when
> --enable-tls is forcefully enabled, similar to win32 thread model in
> libstdc++ requiring --enable-libstdcxx-threads before it is enabled
> properly. This in effect restricts native TLS only to vendors who know what
> they're doing, and they would already know that a specific binutils version
> without the linker bug is required. By default, the lack of a configure
> check for TLS achieves just that, and --enable-tls bypasses the check
> anyway. Anyone have any objections if I back out the (Admittedly also poorly
> written) check from configure?

Yes, in my opinion the check is an extra safety net and should be kept now that it has been written.  TLS binutils support is really bleeding edge.
Comment 89 Julian Waters 2024-11-27 09:29:39 UTC
Understood. I will try to improve the check in that case. A side effect of this is that native TLS will be enabled by default for Windows unless --disable-tls is passed, unless I rewrite the check extensively, which doesn't seem to be desirable. But besides the configure thing, any other objections?
Comment 90 LIU Hao 2024-11-27 10:00:54 UTC
As mentioned yesterday, the libstdc++ `call_once` has to be reimplemented to get rid of `__once_callable` and `__once_call` which can't be exported from libstdc++ DLL.

You might want to introduce a builtin macro for native TLS, so in the case of that, 'src/c++11/mutex.cc' will be mostly empty.
Comment 91 LIU Hao 2024-11-29 09:22:24 UTC
Created attachment 59742 [details]
transitional patch for libstdc++
Comment 92 Julian Waters 2025-01-15 06:35:09 UTC
Created attachment 60160 [details]
Latest TLS
Comment 93 Julian Waters 2025-01-15 06:39:23 UTC
I've revised the patch to implement the desired configure semantics. If anyone wants to test the Latest TLS out I'd appreciate it (To ensure that it works on more than just my own laptop. Please do check if gcc with the patch applied actually emits native TLS access sequences and doesn't start using emutls again, that would mean the configure check has a bug in it). If there are no more objections and tests on other people's devices succeed, then I would say this is ready for integration and we could get Jonathan to commit it
Comment 94 LIU Hao 2025-01-15 08:28:40 UTC
Created attachment 60161 [details]
transitional patch for libstdc++ #2

https://gcc.gnu.org/pipermail/gcc-patches/2024-November/670434.html
Comment 95 LIU Hao 2025-01-16 14:47:41 UTC
To be honest, I think `set_have_as_tls=no` is a rather horrible idea. It changes the output ABI about thread-local storage according to the linker, which is just there, happens to be used, in an unpredictable way.

If the linker can't be used then configure should fail with an eye-catching error, which suggests users either upgrade their linker or disable TLS.
Comment 96 Julian Waters 2025-01-27 08:59:29 UTC
Created attachment 60287 [details]
Latest TLS

Changed to hard error on broken @secrel32 support. Do help me check if the configure patch is ok. If it is, it's probably finally ready for integration
Comment 97 LIU Hao 2025-01-31 06:25:25 UTC
(In reply to Julian Waters from comment #96)
> Created attachment 60287 [details]
> Latest TLS
> 
> Changed to hard error on broken @secrel32 support. Do help me check if the
> configure patch is ok. If it is, it's probably finally ready for integration

In the configure check there is `grep '.reloc'`, where the dot character matches everything, which is probably not desired. It should be `grep '\.reloc\>'`.

The idea about that check looks correct.
Comment 98 Julian Waters 2025-02-04 13:58:43 UTC
Created attachment 60375 [details]
Latest TLS

Hopefully ready for integration now, after the grep check is fixed
Comment 99 LIU Hao 2025-02-05 06:31:06 UTC
binutils 2.44 is available in MSYS2 now.
Comment 100 Julian Waters 2025-02-05 06:39:15 UTC
Great! This means it should be able to compile on MSYS2 now. Ready for integration?
Comment 101 LIU Hao 2025-02-05 15:21:05 UTC
I have bootstrapped GCC 15 (master) with this patch applied, but without `--enable-tls`, on {i686,x86_64}-w64-mingw32; and have checked that TLS is emulated as before so it's not an ABI break by default.

I'm preparing more builds with `--enable-tls` now.
Comment 102 LIU Hao 2025-02-06 04:26:35 UTC
I have bootstrapped GCC 15 with native TLS on {i686,x86_64}-w64-mingw32, and have rebuilt these packages; no issues have been observed so far:

* mingw-w64
* mcfgthread
* binutils
* gdb
* mpdecimal
* mpfr
* icu
* iconv
* python
* cmake
* openblas
* boost
Comment 103 LIU Hao 2025-02-19 14:33:31 UTC
New test results on master:

```
UCRT64 ~/Desktop
$ cat test.c
extern __thread int i[8];

int foo (void)
{
  return i[2] + i[4];
}

UCRT64 ~/Desktop
$ x86_64-w64-mingw32-gcc --version
x86_64-w64-mingw32-gcc.exe (GCC with MCF thread model, built by LH_Mouse) 15.0.1 20250205 (experimental)
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


UCRT64 ~/Desktop
$ x86_64-w64-mingw32-gcc test.c -O2 -c && objdump -dr test.o

test.o:     file format pe-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
   0:   8b 05 00 00 00 00       mov    eax,DWORD PTR [rip+0x0]        # 6 <foo+0x6>
                        2: IMAGE_REL_AMD64_REL32        _tls_index
   6:   65 48 8b 14 25 58 00    mov    rdx,QWORD PTR gs:0x58
   d:   00 00
   f:   48 8b 14 c2             mov    rdx,QWORD PTR [rdx+rax*8]
  13:   8b 82 10 00 00 00       mov    eax,DWORD PTR [rdx+0x10]
                        15: IMAGE_REL_AMD64_SECREL      i
  19:   03 82 08 00 00 00       add    eax,DWORD PTR [rdx+0x8]
                        1b: IMAGE_REL_AMD64_SECREL      i
  1f:   c3                      ret

MINGW32 ~/Desktop
$ i686-w64-mingw32-gcc --version
i686-w64-mingw32-gcc.exe (GCC with MCF thread model, built by LH_Mouse) 15.0.1 20250205 (experimental)
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


MINGW32 ~/Desktop
$ i686-w64-mingw32-gcc test.c -O2 -c && objdump -dr test.o

test.o:     file format pe-i386


Disassembly of section .text:

00000000 <_foo>:
   0:   a1 00 00 00 00          mov    eax,ds:0x0
                        1: dir32        __tls_index
   5:   64 8b 15 2c 00 00 00    mov    edx,DWORD PTR fs:0x2c
   c:   8b 14 82                mov    edx,DWORD PTR [edx+eax*4]
   f:   8b 82 10 00 00 00       mov    eax,DWORD PTR [edx+0x10]
                        11: secrel32    _i
  15:   03 82 08 00 00 00       add    eax,DWORD PTR [edx+0x8]
                        17: secrel32    _i
  1b:   c3                      ret
  1c:   90                      nop
  1d:   90                      nop
  1e:   90                      nop
  1f:   90                      nop

```