Bug 55354 - [asan] by default, the asan run-time should be linked statically, not dynamically
Summary: [asan] by default, the asan run-time should be linked statically, not dynamic...
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: sanitizer (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-16 17:02 UTC by Konstantin Serebryany
Modified: 2012-11-23 11:14 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Konstantin Serebryany 2012-11-16 17:02:40 UTC
Today, -faddress-sanitizer (to be changed to -fsanitize=address) causes the 
asan run-time library to be linked dynamically. 
It needs to be changed to static linkage because 
  - it matches clang behavior 
  - causes less confusion for users (where is my libasan.so???)
  - better for tsan performance (we'll need to link tsan statically too)


Note that on MacOS we probably want to keep using dynamic linkage
since the future library-interposition-based run-time will require it.
Comment 1 Andrew Pinski 2012-11-16 17:15:31 UTC
>  - it matches clang behavior 

But it is inconsistent with the rest of the target libraries of GCC.

>  - causes less confusion for users (where is my libasan.so???)

So this is a well documented issue already and easy fixed.  

>  - better for tsan performance (we'll need to link tsan statically too)

Not much better performance.
Comment 2 Dmitry Vyukov 2012-11-16 17:20:43 UTC
> Not much better performance.

Sole -fPIE vs -fPIC gives us 20% speedup on real programs. Indirect call will add another 10%.
Comment 3 Jakub Jelinek 2012-11-16 17:22:03 UTC
This has been discussed already.  And Diego agreed Google can have a different default in google branches if desirable for that kind of usage scenarios, but otherwise it is undesirable.
Comment 4 Konstantin Serebryany 2012-11-16 20:28:34 UTC
You have been warned (especially about tsan performance. tsan run-time heavily depends on TLS, and TLS is much slower with -fPIC than with -fPIE).

Do we have a flag today which will cause libasan.a to be linked statically, 
while not forcing anything else to link statically?
Comment 5 Jakub Jelinek 2012-11-16 20:46:35 UTC
For TLS, you can just use -ftls-model=initial-exec or __attribute__((tls_model ("initial-exec"))).  libasan from what I can see doesn't use TLS at all, and you really can't just have -fno-pic libasan.a only anyway, any time you want to instrument a shared library, if you linked non-pic libasan.a into it, it wouldn't link on many architectures and on others would fail SELinux restrictions.
If IE TLS mode is used, there is no problem if an app is linked against it or if libraries it depends on are linked against it, there could be a problem if its TLS usage is too large and app isn't linked against the library, and you only dlopen some -fsanitize=address compiled/linked shared library.  Then dlopen could fail (unless you e.g. LD_PRELOAD=libasan.so.0 or otherwise make sure the app is linked against it).  Other GCC shared libraries (e.g. libgomp.so.1) are also using the IE model.
Comment 6 Konstantin Serebryany 2012-11-16 20:54:40 UTC
Answering my own question: we can get static linking with 
 -Wl,-Bstatic -lasan -Wl,-Bdynamic -ldl -lpthread 

>> For TLS, you can just use -ftls-model=initial-exec 
I did not know, thanks. 
Sounds like this could be a good solution for tsan. Will check.
Comment 7 H.J. Lu 2012-11-17 20:35:57 UTC
(In reply to comment #6)
> Answering my own question: we can get static linking with 
>  -Wl,-Bstatic -lasan -Wl,-Bdynamic -ldl -lpthread 
> 

The -static-libasan option was added by

http://gcc.gnu.org/ml/gcc-cvs/2012-11/msg00536.html
Comment 8 Markus Trippelsdorf 2012-11-17 21:08:21 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Answering my own question: we can get static linking with 
> >  -Wl,-Bstatic -lasan -Wl,-Bdynamic -ldl -lpthread 
> > 
> 
> The -static-libasan option was added by
> 
> http://gcc.gnu.org/ml/gcc-cvs/2012-11/msg00536.html

Can you please add "-ldl -lpthread" to it?
Otherwise the user has to set this by hand:

markus@x4 ~ % g++ -faddress-sanitizer -static-libasan -g test.cpp
/home/markus/gcc/libsanitizer/sanitizer_common/sanitizer_linux.cc:135: error: undefined reference to 'pthread_getattr_np'
/home/markus/gcc/libsanitizer/sanitizer_common/sanitizer_linux.cc:138: error: undefined reference to 'pthread_attr_getstack'
/home/markus/gcc/libsanitizer/asan/asan_posix.cc:103: error: undefined reference to 'pthread_key_create'
/home/markus/gcc/libsanitizer/asan/asan_posix.cc:108: error: undefined reference to 'pthread_getspecific'
/home/markus/gcc/libsanitizer/asan/asan_posix.cc:113: error: undefined reference to 'pthread_setspecific'
/home/markus/gcc/libsanitizer/interception/interception_linux.cc:22: error: undefined reference to 'dlsym'
collect2: error: ld returned 1 exit status
Comment 9 Konstantin Serebryany 2012-11-18 19:35:43 UTC
As dvyuokv@ pointed out, 
-ftls-model=initial-exec improves the situation, but does not fully help. 

Experiment: 

% cat x.c 
__thread int a;
int foo() {
  return a;
}


HORRIBLE: -fPIC -shared
% gcc x.c -O2 -fPIC -shared -o x.so  ; objdump -d x.so  | grep foo.: -A 5 
00000000000006e0 <foo>:
 6e0:   66 48 8d 3d f0 08 20    lea    0x2008f0(%rip),%rdi        # 200fd8 <_DYNAMIC+0x1b8>
 6e7:   00 
 6e8:   66 66 48 e8 10 ff ff    callq  600 <__tls_get_addr@plt>
 6ef:   ff 
 6f0:   8b 00                   mov    (%rax),%eax


NOT-SO-BAD: -fPIC -shared  -ftls-model=initial-exec
% gcc x.c -O2 -fPIC -shared -o x.so  -ftls-model=initial-exec ; objdump -d x.so  | grep foo.: -A 5 
0000000000000630 <foo>:
 630:   48 8b 05 a9 09 20 00    mov    0x2009a9(%rip),%rax        # 200fe0 <_DYNAMIC+0x1b8>
 637:   64 8b 00                mov    %fs:(%rax),%eax
 63a:   c3                      retq   


GOOD: -fPIE 
% gcc -c x.c -O2 -fPIE -o x.o  ; objdump -d x.o  | grep foo.: -A 5 
0000000000000000 <foo>:
   0:   64 8b 04 25 00 00 00    mov    %fs:0x0,%eax
   7:   00 
   8:   c3                      retq   


So, while -ftls-model=initial-exec improves the TLS performance, it is still 
2x slower than -fPIE. 

For tsan, which does this for *every* memory access in the original program, 
this will cost 5%-10% slowdown. 

For our users this is a big deal, so they will link the static library whenever possible. Which default is used in gcc -- I don't care that much.
Comment 10 Jakub Jelinek 2012-11-18 19:54:37 UTC
(In reply to comment #9)
> NOT-SO-BAD: -fPIC -shared  -ftls-model=initial-exec
> % gcc x.c -O2 -fPIC -shared -o x.so  -ftls-model=initial-exec ; objdump -d x.so
>  | grep foo.: -A 5 
> 0000000000000630 <foo>:
>  630:   48 8b 05 a9 09 20 00    mov    0x2009a9(%rip),%rax        # 200fe0
> <_DYNAMIC+0x1b8>
>  637:   64 8b 00                mov    %fs:(%rax),%eax
>  63a:   c3                      retq   
> 
> 
> GOOD: -fPIE 
> % gcc -c x.c -O2 -fPIE -o x.o  ; objdump -d x.o  | grep foo.: -A 5 
> 0000000000000000 <foo>:
>    0:   64 8b 04 25 00 00 00    mov    %fs:0x0,%eax
>    7:   00 
>    8:   c3                      retq   
> 
> 
> So, while -ftls-model=initial-exec improves the TLS performance, it is still 
> 2x slower than -fPIE. 

Except obviously you can't use the last code sequence if you want to link it into a shared library.  The extra indirection is the standard cost of relocatable code, especially if there are just a few TLS vars in libtsan and they are accessed a lot, that memory (the .got section entry) is in caches most likely and so the indirection can be just a cycle or at most a few of them.
No idea how would you plan to compile libtsan with -fPIE flag, for libtsan.so.0 you obviously can't, it would fail to link or load, and for libtsan.a it would make the shared library
only usable in executables or PIEs, not from shared libraries.
Comment 11 Konstantin Serebryany 2012-11-18 19:59:42 UTC
The above comment is correct. 
-fPIE is only applicable if we build libtsan.a and link it statically to the pie executable. 
This mode however, works pretty well and most our users pay this price for performance. 

On every memory access tsan touches a few (two or three) extra cache lines. 
Not using -fPIE makes it touch one extra cache line. 
Even if that line is in L1, it still has a non-zero cost. 

We will try to provide benchmark numbers next week.
Comment 12 Jakub Jelinek 2012-11-18 20:09:39 UTC
That would effectively require building libtsan as libtsan.so.0, libtsan.a (both -fPIC built) and libtsan_pie.a (-fPIE built), where the gcc driver would do:
%{static-libtsan:!shared:-ltsan_pie;:-ltsan} or so.  Or there could be even libtsan_nonpic.a for !shared:!pie, of course everything would need to be done only given appropriate benchmarks of real-world programs.
Comment 13 Konstantin Serebryany 2012-11-19 04:13:23 UTC
>> of course everything would need to be done only given appropriate benchmarks of real-world programs.

We have a synthetic benchmark which perfectly reflects the only major hot spot
in tsan: the set of functions __tsan_{read,write}{1,2,4,8} that are called on every memory access.


When building libtsan as a shared library (for which I had to hack our assembly blobs a bit) we get two sources of slowdown: 
  1. __tsan_read8 and friends are called through PLT
  2. __tsan_read8 and friends use one extra load to get to TLS

The result is > 10% slowdown.
Comment 14 Jakub Jelinek 2012-11-19 08:54:47 UTC
I bet 9.5% or more of that is due to the PLT call.  The thing is, even when you have initial-exec TLS model code, if you link it into an executable and the referenced TLS variable is in the executable, the linker TLS transitions optimization changes the IE model into LE model, so instead of something like:
  mov    0x2009a9(%rip),%rax
  mov    %fs:(%rax),%eax
you'll end up with
  mov    $-4,%rax
  mov    %fs:(%rax),%eax
or so (compared to
  mov    %fs:-4,%eax
if it was local-exec model from the beginning).  Given the amount of code in __tsan_read8, I seriously doubt it is noticeable.  So, please compare libtsan built with -fPIC -ftls-model=initial-exec with libtsan built with -fPIE (-ftls-model=local-exec).  The former will not require any special hacks and will work just fine even with shared libraries, the latter won't.
Comment 15 Konstantin Serebryany 2012-11-19 09:03:35 UTC
You are right that "-fPIC -ftls-model=initial-exec" does not affect performance 
if we link libtsan statically (I checked). 
As you say, the linker nukes one of the loads. 

But if we link libtsan.so dynamically, we still get both sources of overhead.
Comment 16 Konstantin Serebryany 2012-11-19 09:06:26 UTC
So, using "-fPIC -ftls-model=initial-exec" is a great idea, it will allow 
to build the files once and have both static and dynamic library.
But we need to agree that the dynamic library is noticeably slower.
Comment 17 Dmitry Vyukov 2012-11-19 10:53:04 UTC
>When building libtsan as a shared library (for which I had to hack our assembly
>blobs a bit) we get two sources of slowdown: 
>  1. __tsan_read8 and friends are called through PLT
>  2. __tsan_read8 and friends use one extra load to get to TLS

> I bet 9.5% or more of that is due to the PLT call.

That's not the overhead you are looking for, Luke.

We currently compile with -fPIC and link statically, linker inserts only 1 memory dereference in this case. However, -fPIC affects code generation in compiler, it has to reserve more registers for tls access code and has to allocate stack frame because of the potential call. Only that causes *20%* slowdown on a real application (not a synthetic benchmark).

Kostya, to evaluate initial-exec you need to insure that code characteristics of __tsan_read/write are not affected, i.e. 0 stack spills and analyze script passes. Everything else we have w/o initial-exec.
Comment 18 Dmitry Vyukov 2012-11-21 07:45:20 UTC
(In reply to comment #17)
> >When building libtsan as a shared library (for which I had to hack our assembly
> >blobs a bit) we get two sources of slowdown: 
> >  1. __tsan_read8 and friends are called through PLT
> >  2. __tsan_read8 and friends use one extra load to get to TLS
> 
> > I bet 9.5% or more of that is due to the PLT call.
> 
> That's not the overhead you are looking for, Luke.
> 
> We currently compile with -fPIC and link statically, linker inserts only 1
> memory dereference in this case. However, -fPIC affects code generation in
> compiler, it has to reserve more registers for tls access code and has to
> allocate stack frame because of the potential call. Only that causes *20%*
> slowdown on a real application (not a synthetic benchmark).
> 
> Kostya, to evaluate initial-exec you need to insure that code characteristics
> of __tsan_read/write are not affected, i.e. 0 stack spills and analyze script
> passes. Everything else we have w/o initial-exec.

For actual ThreadSanitizer runtime -fPIC -ftls-model=initial-exec causes degradation of generated code. Linker emits the same tls access code in all cases, but the compiler generates worse code.
The table below show various stats about generated code for the hot functions. We are mostly interested in stack access instructions (rsp/push/pop):

gcc -fPIE
    write1 tot 325; size 1316; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  68; lea   6; cmp  46
    write2 tot 326; size 1340; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  68; lea   6; cmp  46
    write4 tot 325; size 1316; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  68; lea   6; cmp  46
    write8 tot 325; size 1316; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  68; lea   6; cmp  46
     read1 tot 355; size 1476; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  52; mov  71; lea   6; cmp  52
     read2 tot 344; size 1428; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  71; lea   6; cmp  51
     read4 tot 344; size 1436; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  71; lea   6; cmp  51
     read8 tot 344; size 1436; rsp 1; push 0; pop 0; call 2; load 16; store  5; sh  52; mov  71; lea   6; cmp  51
func_entry tot  28; size  116; rsp 0; push 0; pop 0; call 1; load  3; store  1; sh   2; mov   8; lea   0; cmp   1
 func_exit tot  25; size  100; rsp 0; push 0; pop 0; call 1; load  1; store  0; sh   2; mov   5; lea   0; cmp   1

gcc -fPIC -ftls-model=initial-exec
    write1 tot 323; size 1268; rsp 1; push 1; pop 5; call 2; load 17; store  7; sh  52; mov  64; lea   6; cmp  46
    write2 tot 321; size 1275; rsp 1; push 1; pop 5; call 2; load 17; store  7; sh  52; mov  64; lea   6; cmp  46
    write4 tot 323; size 1268; rsp 1; push 1; pop 5; call 2; load 17; store  7; sh  52; mov  64; lea   6; cmp  46
    write8 tot 323; size 1268; rsp 1; push 1; pop 5; call 2; load 17; store  7; sh  52; mov  64; lea   6; cmp  46
     read1 tot 342; size 1380; rsp 1; push 1; pop 4; call 2; load 18; store  7; sh  52; mov  67; lea   6; cmp  52
     read2 tot 331; size 1331; rsp 1; push 1; pop 3; call 2; load 17; store  7; sh  52; mov  67; lea   6; cmp  51
     read4 tot 334; size 1356; rsp 1; push 1; pop 3; call 2; load 17; store  7; sh  52; mov  67; lea   6; cmp  51
     read8 tot 334; size 1356; rsp 1; push 1; pop 3; call 2; load 17; store  7; sh  52; mov  67; lea   6; cmp  51
func_entry tot   7; size   24; rsp 0; push 0; pop 0; call 0; load  0; store  1; sh   0; mov   2; lea   0; cmp   0
 func_exit tot   6; size   21; rsp 0; push 0; pop 0; call 0; load  0; store  1; sh   0; mov   1; lea   0; cmp   0

gcc -fPIC
    write1 tot 379; size 1571; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 100; lea   6; cmp  42
    write2 tot 383; size 1603; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 100; lea   6; cmp  42
    write4 tot 379; size 1571; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 100; lea   6; cmp  42
    write8 tot 379; size 1571; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 100; lea   6; cmp  42
     read1 tot 402; size 1715; rsp 23; push 0; pop 0; call 2; load 26; store 20; sh  52; mov 103; lea   6; cmp  48
     read2 tot 393; size 1659; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 103; lea   6; cmp  47
     read4 tot 391; size 1659; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 103; lea   6; cmp  47
     read8 tot 391; size 1659; rsp 23; push 0; pop 0; call 2; load 25; store 20; sh  52; mov 103; lea   6; cmp  47
func_entry tot   9; size   32; rsp 0; push 1; pop 1; call 0; load  0; store  0; sh   0; mov   4; lea   0; cmp   0
 func_exit tot   7; size   32; rsp 0; push 0; pop 0; call 0; load  0; store  0; sh   0; mov   2; lea   0; cmp   0

clang -fPIE
    write1 tot 326; size 1194; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  77; lea   4; cmp  42
    write2 tot 330; size 1210; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  77; lea   4; cmp  46
    write4 tot 330; size 1210; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  77; lea   4; cmp  46
    write8 tot 330; size 1210; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  77; lea   4; cmp  46
     read1 tot 350; size 1304; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  38; mov  85; lea   3; cmp  47
     read2 tot 355; size 1311; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  83; lea   4; cmp  50
     read4 tot 355; size 1311; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  83; lea   4; cmp  50
     read8 tot 355; size 1311; rsp 1; push 1; pop 5; call 2; load 16; store  5; sh  37; mov  83; lea   4; cmp  50
func_entry tot  26; size  108; rsp 0; push 0; pop 0; call 1; load  2; store  0; sh   1; mov   7; lea   0; cmp   1
 func_exit tot  21; size   93; rsp 0; push 0; pop 0; call 1; load  1; store  0; sh   1; mov   4; lea   0; cmp   1

clang -fPIC -ftls-model=initial-exec
    write1 tot 315; size 1173; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  78; lea   4; cmp  42
    write2 tot 319; size 1189; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  78; lea   4; cmp  46
    write4 tot 319; size 1189; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  78; lea   4; cmp  46
    write8 tot 319; size 1189; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  78; lea   4; cmp  46
     read1 tot 347; size 1271; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  38; mov  86; lea   3; cmp  47
     read2 tot 345; size 1268; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  84; lea   4; cmp  50
     read4 tot 345; size 1268; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  84; lea   4; cmp  50
     read8 tot 345; size 1268; rsp 1; push 2; pop 2; call 2; load 20; store  6; sh  37; mov  84; lea   4; cmp  50
func_entry tot  21; size   96; rsp 0; push 0; pop 0; call 1; load  4; store  1; sh   1; mov   8; lea   0; cmp   1
 func_exit tot  19; size   82; rsp 0; push 0; pop 0; call 1; load  2; store  0; sh   1; mov   5; lea   0; cmp   1

clang -fPIC
    write1 tot 409; size 1625; rsp 17; push 6; pop 6; call 2; load 17; store 25; sh  37; mov 131; lea  20; cmp  42
    write2 tot 413; size 1641; rsp 17; push 6; pop 6; call 2; load 17; store 25; sh  37; mov 131; lea  20; cmp  46
    write4 tot 413; size 1641; rsp 17; push 6; pop 6; call 2; load 17; store 25; sh  37; mov 131; lea  20; cmp  46
    write8 tot 413; size 1641; rsp 17; push 6; pop 6; call 2; load 17; store 25; sh  37; mov 131; lea  20; cmp  46
     read1 tot 423; size 1654; rsp 15; push 6; pop 6; call 2; load 13; store 26; sh  38; mov 123; lea  19; cmp  47
     read2 tot 446; size 1742; rsp 19; push 6; pop 6; call 2; load 19; store 27; sh  37; mov 140; lea  20; cmp  50
     read4 tot 446; size 1742; rsp 19; push 6; pop 6; call 2; load 19; store 27; sh  37; mov 140; lea  20; cmp  50
     read8 tot 446; size 1742; rsp 19; push 6; pop 6; call 2; load 19; store 27; sh  37; mov 140; lea  20; cmp  50
func_entry tot  36; size  129; rsp 0; push 3; pop 3; call 1; load  4; store  1; sh   1; mov  10; lea   2; cmp   1
 func_exit tot  30; size  114; rsp 0; push 3; pop 2; call 1; load  2; store  0; sh   1; mov   6; lea   2; cmp   1

clang -fPIE project-stream
    write1 tot 293; size 1071; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  72; lea   1; cmp  38
    write2 tot 297; size 1087; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  72; lea   1; cmp  42
    write4 tot 297; size 1087; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  72; lea   1; cmp  42
    write8 tot 297; size 1087; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  72; lea   1; cmp  42
     read1 tot 334; size 1252; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  78; lea   1; cmp  38
     read2 tot 341; size 1283; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  79; lea   1; cmp  42
     read4 tot 341; size 1283; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  79; lea   1; cmp  42
     read8 tot 341; size 1283; rsp 1; push 0; pop 0; call 2; load 17; store  5; sh  33; mov  79; lea   1; cmp  42
func_entry tot  26; size  102; rsp 0; push 0; pop 0; call 1; load  2; store  0; sh   0; mov   7; lea   0; cmp   1
 func_exit tot  21; size   87; rsp 0; push 0; pop 0; call 1; load  1; store  0; sh   0; mov   4; lea   0; cmp   1

clang -fPIC -ftls-model=initial-exec project-stream
    write1 tot 293; size 1051; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  73; lea   1; cmp  38
    write2 tot 297; size 1067; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  73; lea   1; cmp  42
    write4 tot 297; size 1067; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  73; lea   1; cmp  42
    write8 tot 297; size 1067; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  73; lea   1; cmp  42
     read1 tot 327; size 1231; rsp 1; push 0; pop 0; call 2; load 21; store  6; sh  33; mov  79; lea   1; cmp  38
     read2 tot 340; size 1263; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  80; lea   1; cmp  42
     read4 tot 340; size 1263; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  80; lea   1; cmp  42
     read8 tot 340; size 1263; rsp 1; push 1; pop 5; call 2; load 21; store  6; sh  33; mov  80; lea   1; cmp  42
func_entry tot  21; size   90; rsp 0; push 0; pop 0; call 1; load  4; store  2; sh   0; mov   8; lea   0; cmp   1
 func_exit tot  17; size   76; rsp 0; push 0; pop 0; call 1; load  2; store  1; sh   0; mov   5; lea   0; cmp   1

clang -fPIC project-stream
    write1 tot 341; size 1289; rsp 7; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  91; lea  12; cmp  38
    write2 tot 345; size 1305; rsp 7; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  91; lea  12; cmp  42
    write4 tot 345; size 1305; rsp 7; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  91; lea  12; cmp  42
    write8 tot 345; size 1305; rsp 7; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  91; lea  12; cmp  42
     read1 tot 379; size 1489; rsp 7; push 6; pop 6; call 2; load 21; store  6; sh  33; mov  93; lea  12; cmp  38
     read2 tot 388; size 1533; rsp 9; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  97; lea  12; cmp  42
     read4 tot 388; size 1533; rsp 9; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  97; lea  12; cmp  42
     read8 tot 388; size 1533; rsp 9; push 6; pop 6; call 2; load 22; store  7; sh  33; mov  97; lea  12; cmp  42
func_entry tot  33; size  123; rsp 0; push 3; pop 3; call 1; load  4; store  2; sh   0; mov  10; lea   2; cmp   1
 func_exit tot  28; size  108; rsp 0; push 3; pop 2; call 1; load  2; store  1; sh   0; mov   6; lea   2; cmp   1
Comment 19 Jakub Jelinek 2012-11-21 08:40:52 UTC
(In reply to comment #18)
> For actual ThreadSanitizer runtime -fPIC -ftls-model=initial-exec causes
> degradation of generated code. Linker emits the same tls access code in all
> cases, but the compiler generates worse code.

-fPIC -ftls-model=initial-exec is by definition almost equivalent to -fPIE, the only exceptions are:
1) -fPIE code is allowed to assume globally visible symbols aren't interposed
2) if TLS vars are defined locally (or hidden visibility), then local-exec
   model can be used instead of initial-exec (one less dereference)

As for 2), I've explained already that by linking -fPIC code into the executable if the TLS var is defined in the executable, linker TLS transition transform all other TLS models (even global and local dynamic) into local-exec, just might result in some nops or for IE->LE setting of a register to an immediate and using that register as opposed to just using the immediate in the %fs: prefixed insn.

And for 1), for the fast path, for any symbols on the fast path that shouldn't be interposeable and that are defined in libtsan, you should be able to just use visibility attributes and get the same effect.

-fPIE flag simply isn't usable for a library that is to be used also by shared libraries.  How do you link -fsanitize=thread shared libraries anyway?  Just don't link libtsan in for -static-libtsan, and rely on the executable being linked against it?  Such libraries will fail to link with -Wl,-z,defs ...
Of course, having multiple tsan TLS roots in the same process isn't a good idea either (which is why I think we can't default to -static-libtsan).
Comment 20 Dmitry Vyukov 2012-11-21 09:04:07 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > For actual ThreadSanitizer runtime -fPIC -ftls-model=initial-exec causes
> > degradation of generated code. Linker emits the same tls access code in all
> > cases, but the compiler generates worse code.
> 
> -fPIC -ftls-model=initial-exec is by definition almost equivalent to -fPIE, the
> only exceptions are:
> 1) -fPIE code is allowed to assume globally visible symbols aren't interposed
> 2) if TLS vars are defined locally (or hidden visibility), then local-exec
>    model can be used instead of initial-exec (one less dereference)


What I see is that it also affect code generation (register allocation). Do we need to file a bug on that?


> As for 2), I've explained already that by linking -fPIC code into the
> executable if the TLS var is defined in the executable, linker TLS transition
> transform all other TLS models (even global and local dynamic) into local-exec,
> just might result in some nops or for IE->LE setting of a register to an
> immediate and using that register as opposed to just using the immediate in the
> %fs: prefixed insn.
> 
> And for 1), for the fast path, for any symbols on the fast path that shouldn't
> be interposeable and that are defined in libtsan, you should be able to just
> use visibility attributes and get the same effect.
> 
> -fPIE flag simply isn't usable for a library that is to be used also by shared
> libraries.  How do you link -fsanitize=thread shared libraries anyway?  Just
> don't link libtsan in for -static-libtsan, and rely on the executable being
> linked against it?


Yes, we rely on the library being linked into the executable, because we want the runtime be linked statically.
For dynamic libraries that are loaded into a non-instrumented executable (e.g. swig so preloaded into python process), we statically link the tsan runtime into the so.



> Such libraries will fail to link with -Wl,-z,defs ...
> Of course, having multiple tsan TLS roots in the same process isn't a good idea
> either (which is why I think we can't default to -static-libtsan).
Comment 21 Jakub Jelinek 2012-11-21 09:23:56 UTC
(In reply to comment #20)
> What I see is that it also affect code generation (register allocation). Do we
> need to file a bug on that?

If you see a code generation difference even with -ftls-model=local-exec -fPIC vs. -fPIE, then it must mean you don't have visibility attributes on the symbols used in the fast path.  For initial-exec, the RA effects should be minimal, the TLS offset load from got is usually very close to the actual TLS memory load (or lea), and thus it will just pick up some short lived scratch register.  Generally in GCC, -fPIE sets flag_pic and not flag_shlib, while
-fPIC sets flag_pic and flag_shlib.  flag_pic is about whether position independent code needs to be generated, flag_shlib is about whether locally defined symbols can be interposed (plus it affects TLS model default choice).

> For dynamic libraries that are loaded into a non-instrumented executable (e.g.
> swig so preloaded into python process), we statically link the tsan runtime
> into the so.

And you don't get linker errors from that?  That must be by pure luck.
Comment 22 Dmitry Vyukov 2012-11-23 07:16:14 UTC
> > For dynamic libraries that are loaded into a non-instrumented executable (e.g.
> > swig so preloaded into python process), we statically link the tsan runtime
> > into the so.
> 
> And you don't get linker errors from that?  That must be by pure luck.

Why must I get errors in this case?
Comment 23 Dmitry Vyukov 2012-11-23 07:27:27 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > What I see is that it also affect code generation (register allocation). Do we
> > need to file a bug on that?
> 
> If you see a code generation difference even with -ftls-model=local-exec -fPIC
> vs. -fPIE, then it must mean you don't have visibility attributes on the
> symbols used in the fast path.  For initial-exec, the RA effects should be
> minimal, the TLS offset load from got is usually very close to the actual TLS
> memory load (or lea), and thus it will just pick up some short lived scratch
> register.  Generally in GCC, -fPIE sets flag_pic and not flag_shlib, while
> -fPIC sets flag_pic and flag_shlib.  flag_pic is about whether position
> independent code needs to be generated, flag_shlib is about whether locally
> defined symbols can be interposed (plus it affects TLS model default choice).

When I compile with -fvisibility=hidden, it does not affect generated code. It's not that we access a lot of symbols in the function, there is one thread-local and one static global var.

That "minimal" RA effects do have effect in our case. We don't have a reserve to squeeze another register for tls access:

// -fPIE
000000000009ca30 <__tsan_write2>:
   9ca30:       64 48 8b 04 25 40 1f    mov    %fs:0xffffffffffeb1f40,%rax
   9ca37:       eb ff 
   9ca39:       48 8b 0c 24             mov    (%rsp),%rcx
   9ca3d:       a8 01                   test   $0x1,%al
   9ca3f:       0f 85 d3 00 00 00       jne    9cb18 <__tsan_write2+0xe8>
   9ca45:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
   9ca49:       48 89 fe                mov    %rdi,%rsi
   9ca4c:       48 89 c2                mov    %rax,%rdx
   9ca4f:       64 48 89 04 25 40 1f    mov    %rax,%fs:0xffffffffffeb1f40
   9ca56:       eb ff 

// -fPIC -ftls-model=initial-exec
00000000000969f0 <__tsan_write2>:
   969f0:       48 c7 c2 40 1f eb ff    mov    $0xffffffffffeb1f40,%rdx
   969f7:       53                      push   %rbx
   969f8:       48 8b 4c 24 08          mov    0x8(%rsp),%rcx
   969fd:       64 48 8b 02             mov    %fs:(%rdx),%rax
   96a01:       a8 01                   test   $0x1,%al
   96a03:       0f 85 c7 00 00 00       jne    96ad0 <__tsan_write2+0xe0>
Comment 24 Jakub Jelinek 2012-11-23 08:13:11 UTC
When I rebuild libtsan with -fPIE instead of -fPIC in the Makefile, and
g++ -shared -Wl,--whole-archive -o libtsanx.so libtsan.a -ldl -lpthread
(note that .libs/*.o are still built with -fPIC because libtool overrides those flags and adds its -fPIC at the end), then this link fails with:
/usr/bin/ld: libtsan.a(tsan_rtl.o): relocation R_X86_64_TPOFF32 against `_ZN6__tsan22cur_thread_placeholderE' can not be used when making a shared object; recompile with -fPIC
libtsan.a(tsan_rtl.o): could not read symbols: Bad value
collect2: error: ld returned 1 exit status
(obviously, local-exec model would need to patch the insn at dynamic linking time, making the library DT_TEXTREL (not acceptable for SELinux, not allowed for x86_64 at all)).

Yes, I see the code generation differences, but the functions are huge anyway, is it really so crucial that you'd want to make another libtsan_pie.a for it?

BTW, the tsan that was added to GCC yesterday doesn't have -ftls-model=initial-exec even, so it is even slower.
-ftls-model=initial-exec can only be done for *-*-linux* targets btw, I don't think other dynamic linkers support dlopening IE model shared libraries.
E.g. libgomp has in its configure.tgt
if test $gcc_cv_have_tls = yes ; then
  case "${target}" in

    *-*-linux*)
        XCFLAGS="${XCFLAGS} -ftls-model=initial-exec"
        ;;
  esac
fi

so libsanitizer/configure.tgt would need to add it to say TSAN_CXXFLAGS var that would be substituted by configure.

Would be nice if you could check the numerous warnings from tsan build, e.g. it seems the ALWAYS_INLINE macro doesn't include the inline keyword, and you are using it on functions that don't have inline keyword, which gives a warning and if it is inlined, it is by pure luck.  Either you should add inline keywords manually, or put inline keyword into ALWAYS_INLINE macro.  There are other warnings...
Comment 25 Jakub Jelinek 2012-11-23 08:20:40 UTC
Have you considered using __builtin_expect to help the compiler in the performance sensitive code?  Or even better would be profile-feedback, build libtsan once, run some benchmark, build libtsan again.
Comment 26 Dmitry Vyukov 2012-11-23 08:34:52 UTC
(In reply to comment #25)
> Have you considered using __builtin_expect to help the compiler in the
> performance sensitive code?  Or even better would be profile-feedback, build
> libtsan once, run some benchmark, build libtsan again.

There are few of them in the code in the obvious cases. The performance is very shaky and depends on compiler, compiler version and options. I remember I get worse results when I added some other likily/unlikely. I didn't have time for more comprehensive investigation. FDO may be a good idea for prebuild libraries.
Comment 27 Konstantin Serebryany 2012-11-23 10:47:14 UTC
>> is it really so crucial that you'd want to make another libtsan_pie.a for it?
I would actually prefer to have *only* libtsan_pie.a, and not bother with .so version at all. This is what we have in clang land.
Comment 28 Konstantin Serebryany 2012-11-23 11:14:58 UTC
Note that today's upstream tsan sources don't link with -fPIC at all
because our assembly blobs are not PIC-friendly. 

/usr/bin/ld: .libs/tsan_rtl_thread.o: relocation R_X86_64_PC32 against
undefined symbol `__tsan_trace_switch_thunk' can not be used when making a
shared objec

Apparently, Wei hacked this around by enabling pure-C++ code, which is slower.