Bug 55309 - gcc's address-sanitizer 66% slower than clang's
Summary: gcc's address-sanitizer 66% slower than clang's
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: sanitizer (show other bugs)
Version: 4.8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-13 10:00 UTC by Markus Trippelsdorf
Modified: 2014-01-27 08:22 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Candidate patch to avoid duplicated intra bb instrumentation (6.61 KB, patch)
2013-02-06 10:55 UTC, Dodji Seketeli
Details | Diff
Candidate patch to avoid duplicated intra bb instrumentation (6.55 KB, patch)
2013-02-06 15:02 UTC, Dodji Seketeli
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Trippelsdorf 2012-11-13 10:00:00 UTC
Comparing gcc build times:
CC="clang -fsanitize=address -w" CXX="clang++ -fsanitize=address -w" ~/gcc/configure --disable-bootstrap --disable-werror --disable-multilib --enable-languages=c,c++
with
CC="gcc -faddress-sanitizer" CXX="g++ -faddress-sanitizer" ...
and
CC="gcc -fno-var-tracking -faddress-sanitizer" CXX="g++ -fno-var-tracking -faddress-sanitizer" ...

Clang : nice -n 19 make -j4  1173.74s user 104.73s system 325% cpu 6:32.18 total
gcc   : nice -n 19 make -j4  3653.30s user 122.27s system 369% cpu 17:00.77 total
gcc_no: nice -n 19 make -j4  2925.20s user 116.42s system 357% cpu 14:11.52 total

"perf top" shows references_value_p() and value_member() on top.
Comment 1 Konstantin Serebryany 2012-11-13 21:10:23 UTC
While this is an interesting comparison, I should note that 
the typical use of asan is with -O1 or -O2, 
so it might make more sense to  compare the asan implementations at -O1/-O2
Comment 2 Markus Trippelsdorf 2012-11-13 21:31:08 UTC
gcc uses "-O2 -g" by default for --disable-bootstrap.

Also ,to be fair, if one uses a profiledbootstrapped gcc configured with
--enable-checking=release to build it only takes 12:58.77 on the same machine.
Comment 3 Jakub Jelinek 2012-11-14 07:51:44 UTC
Note that GCC doesn't perform any ASAN optimizations yet (if the same address is written or read several times in the same bb, it doesn't optimize away the tests).  We plan to first switch to first expanding the shadow memory checks as simple builtins without control flow, performing optimizations on them and only later on (in fab pass?) to expand it to the longer sequences with control flow in them.
Comment 4 Jakub Jelinek 2012-11-14 16:37:36 UTC
Also, this comparison doesn't have numbers for pure clang without -fsanitize=address and gcc without -faddress-sanitizer, so likely most of the speed differences can't be attributed just to asan.
Comment 5 Markus Trippelsdorf 2012-11-14 17:02:54 UTC
(In reply to comment #4)
> Also, this comparison doesn't have numbers for pure clang without
> -fsanitize=address and gcc without -faddress-sanitizer, so likely most of the
> speed differences can't be attributed just to asan.

Yes. It was meant as a rough estimate.

Here is a more complete result (clang trunk build with Release mode,
gcc trunk with --enable-checking=release and lto/profiledbootstraped):

  clang (with-without):  1278.47-662.47 = 616
  gcc   (with-without):  2733.02-1168.07= 1564.95

That's still a 60.64% slowdown.
Comment 6 Kostya Serebryany 2013-02-05 09:21:59 UTC
I am slightly confused. Are we discussing compile time or test-run-time? 
I've just built SPEC 2006 with -fsanitize=address -O2
gcc: r195706
clang: r174324
Measured on Intel(R) Xeon(R) CPU W3690  @ 3.47GHz

                           clang         gcc
       400.perlbench,      1209.00,        -1.00,        -0.00
           401.bzip2,       885.00,      1187.00,         1.34
             403.gcc,       739.00,       756.00,         1.02
             429.mcf,       602.00,       612.00,         1.02
           445.gobmk,       840.00,      1191.00,         1.42
           456.hmmer,      1304.00,      1838.00,         1.41
           458.sjeng,       923.00,      1326.00,         1.44
      462.libquantum,       543.00,       481.00,         0.89
         464.h264ref,      1271.00,        -1.00,        -0.00
         471.omnetpp,       631.00,       624.00,         0.99
           473.astar,       672.00,       765.00,         1.14
       483.xalancbmk,       500.00,       521.00,         1.04
            433.milc,       710.00,       629.00,         0.89
            444.namd,       637.00,       539.00,         0.85
          447.dealII,       650.00,       714.00,         1.10
          450.soplex,       389.00,       419.00,         1.08
          453.povray,       459.00,       432.00,         0.94
             470.lbm,       388.00,       409.00,         1.05
         482.sphinx3,       998.00,      1335.00,         1.34


400.perlbench fails with a real asan-ish warning 
(clang can use a blacklist file and disables instrumentation for the buggy function.
See https://code.google.com/p/address-sanitizer/wiki/FoundBugs#Spec_CPU_2006 and 
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer#Turning_off_instrumentation)

464.h264ref with gcc loops forever, I did not investigate why. 

So, on average clang+asan is faster than gcc-asan (up to 40%!), 
but in some cases (mostly, FP code) gcc is faster (up to 15%)
Comment 7 Kostya Serebryany 2013-02-05 09:43:11 UTC
If we are talking about compile time, I observe 2x difference in favor of clang: 
building 483.xalancbmk
gcc+asan+O2:   564 seconds
clang+asan+O2: 243 second

gcc is built with default options
clang is built with -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON
Comment 8 Jakub Jelinek 2013-02-05 09:56:17 UTC
"464.h264ref with gcc loops forever, I did not investigate why."
is PR53073 , you can use -fno-aggressive-loop-optimizations to workaround the invalid code in SPEC.
As for runtime performance of gcc -fsanitize=address code, it would be interesting to try also with Dodji's patchset, how that improves things.

And, for compile time, you want to be testing with --enable-checking=release built gcc, that is what people will actually use if they aren't developing gcc.
Comment 9 Kostya Serebryany 2013-02-05 10:30:16 UTC
> And, for compile time, you want to be testing with --enable-checking=release
Thanks! 
With --enable-checking=release gcc's compile time drops to 374 seconds.
That's much better, but still 50% slower than clang (built with asserts)
Comment 10 Kostya Serebryany 2013-02-05 10:41:20 UTC
(In reply to comment #8)
> "464.h264ref with gcc loops forever, I did not investigate why."
> is PR53073 , you can use -fno-aggressive-loop-optimizations to workaround the
> invalid code in SPEC.
Thanks. But then again we hit another bug in 464.h264ref.
So, if we want to run h264ref and perbmk w/o changing sources under gcc+asan
we need to have the blacklist functionality. 
(see the same links as above). 

==8720== ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff625736a0 at pc 0x4e2a98 bp 0x7fff62573600 sp 0x7fff625735f8
READ of size 4 at 0x7fff625736a0 thread T0
    #0 0x4e2a97 in SATD (benchspec/CPU2006/464.h264ref/run/run_base_ref_z.0000/h264ref_base.z+0x4e2a97)
    #1 0x4e47c0 in SubPelBlockMotionSearch (benchspec/CPU2006/464.h264ref/run/run_base_ref_z.0000/h264ref_base.z+0x4e47c0)
...
Address 0x7fff625736a0 is located at offset 96 in frame <SATD> of T0's stack:
  This frame has 1 object(s):
    [32, 96) 'd'
Comment 11 Jakub Jelinek 2013-02-05 10:54:46 UTC
I really don't like the blacklist hack, such changes belong to the source, not outside of it.  If you want to disable instrumentation of SATD, I think modification of the source is preferrable, or I guess you can
use
echo > buggy-spec-workarounds.h <<\EOF
extern int SATD (int *, int) __attribute__((__no_address_safety_analysis__));
EOF
and use -include .../buggy-spec-workarounds.h, though of course if it is a real bug in SPEC, it would be much better to just report it to SPEC and hope they fix it up.  Though given http://www.spec.org/cpu2006/Docs/faq.html#Run.05 I don't have much hope they will (when they even don't see it as C89 violation).
Comment 12 Markus Trippelsdorf 2013-02-05 11:17:42 UTC
(In reply to comment #9)
> > And, for compile time, you want to be testing with --enable-checking=release
> Thanks! 
> With --enable-checking=release gcc's compile time drops to 374 seconds.
> That's much better, but still 50% slower than clang (built with asserts)

Hmm, that means gcc is 35% slower (374 vs. 243). That is exactly the
slowdown that I see in all my tests. (So switching to clang is like
moving from a 4-core to a 6-core machine from a compile time perspective.)
Comment 13 Jakub Jelinek 2013-02-05 11:24:23 UTC
Please, let's not make this PR into a general gcc vs. clang compile time comparison (see e.g. Vlad Makarov's mails on this topic, if you care more about compile time than runtime, supposedly e.g. -O1 might be better than -O2), for this particular PR I think it matters what relative slowdown -fsanitize=address causes on compile time and runtime for both compilers, and whether with Dodji's changes help here.  If not, it is time to look at testcases and figure out what is going on.  Without Dodji's patch we know what's going on and what could make the difference.
Comment 14 Jakub Jelinek 2013-02-05 11:26:05 UTC
(In reply to comment #11)
> bug in SPEC, it would be much better to just report it to SPEC and hope they
> fix it up.  Though given http://www.spec.org/cpu2006/Docs/faq.html#Run.05 I
> don't have much hope they will (when they even don't see it as C89 violation).

Ah, in this case it is actually the same bug.
Comment 15 Kostya Serebryany 2013-02-05 12:22:56 UTC
Well, I of course can change the SPEC code
         464.h264ref,      1271.00,        1879.00,        1.47

As for Dodji's patch: can someone attach it here? 
Let me benchmark it too, although if that's just optimizing within one BB
I don't expect more than 5% difference (based on my experiments in llvm). 

Dodji, what are your numbers?
Comment 16 Dodji Seketeli 2013-02-06 10:55:38 UTC
Created attachment 29366 [details]
Candidate patch to avoid duplicated intra bb instrumentation

> As for Dodji's patch: can someone attach it here?

Here is the attachment of what I currently have.

> Let me benchmark it too,

Thank you, that would be very appreciated.

> although if that's just optimizing within one BB I don't expect more
> than 5% difference (based on my experiments in llvm).

That would be what I'd expect too, based on my experiments on GCC.
But then I'd be very curious to hear about your findings.
Comment 17 Kostya Serebryany 2013-02-06 11:18:28 UTC
Trying this patch: 
% cat inc.cc
void foo(int *a) {
  (*a)++;
}
% gcc -fsanitize=address -O2 inc.cc -S -o - | grep __asan_report
        call    __asan_report_load4
        call    __asan_report_store4
% clang -fsanitize=address -O2 inc.cc -S -o - | grep __asan_report 
        callq   __asan_report_load4
% 

Is this test expected to work (have one __asan_error call) with this patch?

(I've checked that the patch is applied correctly, on 
gcc/testsuite/c-c++-common/asan/no-redundant-instrumentation-1.c 
it reduces the number of calls from 16 to 5)
Comment 18 Kostya Serebryany 2013-02-06 12:24:51 UTC
First results with the patch (c-only tests, train data):
                             orig          patched
           401.bzip2,        89.60,        90.10,         1.01
             429.mcf,        23.50,        23.90,         1.02
           456.hmmer,       181.00,       145.00,         0.80
      462.libquantum,         1.64,         1.64,         1.00
         464.h264ref,       249.00,       249.00,         1.00
            433.milc,        20.10,        20.00,         1.00
             470.lbm,        37.20,        37.20,         1.00
         482.sphinx3,        17.50,        17.50,         1.00

significant speedup on 456.hmmer, no difference elsewhere. 
3 benchmarks fail to build: 
Error: 1x403.gcc 1x445.gobmk 1x458.sjeng
resource.c:431:1: internal compiler error: in update_mem_ref_hash_table, at asan.c:460
 find_dead_or_set_registers (target, res, jump_target, jump_count, set, needed)
 ^
0x7d0c74 update_mem_ref_hash_table
        ../../gcc/gcc/asan.c:460
0x7d15ab maybe_instrument_assignment
        ../../gcc/gcc/asan.c:1799
0x7d15ab transform_statements
        ../../gcc/gcc/asan.c:1870
0x7d15ab asan_instrument
        ../../gcc/gcc/asan.c:2209
Comment 19 Richard Biener 2013-02-06 12:39:21 UTC
(In reply to comment #17)
> Trying this patch: 
> % cat inc.cc
> void foo(int *a) {
>   (*a)++;
> }
> % gcc -fsanitize=address -O2 inc.cc -S -o - | grep __asan_report
>         call    __asan_report_load4
>         call    __asan_report_store4
> % clang -fsanitize=address -O2 inc.cc -S -o - | grep __asan_report 
>         callq   __asan_report_load4
> % 

The clang variant looks incorrect to me - if asan distinguishes between
loads and stores the __asan_report_load4 should have been promoted to
a __asan_report_store4.  Consider a pointing to read-only memory.
Or rather asan would need a __asan_report_load_store4 to be really correct.

> Is this test expected to work (have one __asan_error call) with this patch?
> 
> (I've checked that the patch is applied correctly, on 
> gcc/testsuite/c-c++-common/asan/no-redundant-instrumentation-1.c 
> it reduces the number of calls from 16 to 5)
Comment 20 Kostya Serebryany 2013-02-06 12:43:09 UTC
> The clang variant looks incorrect to me - if asan distinguishes between
> loads and stores
It doesn't.
The only reason why we have two callbacks is that asan 
prints a message containing "READ" or "WRITE"
In this case we can report a bad read or a bad write -- doesn't matter.
Comment 21 Jakub Jelinek 2013-02-06 12:48:39 UTC
As the shadow memory doesn't have information about what locations are read-only, it only has info whether the relevant bytes are valid, or invalid (or some invalid, some valid), and for all invalid a few magic values for more detailed reporting.  So, if you have a RMW statement, without any asan optimization it will first check the read, and already fail on the read, so even with the optimization if you just check the read and not the write, the user visible behavior will be exactly the same.
Comment 22 Dodji Seketeli 2013-02-06 15:02:44 UTC
Created attachment 29370 [details]
Candidate patch to avoid duplicated intra bb instrumentation

> Trying this patch: 
> % cat inc.cc
> void foo(int *a) {
>   (*a)++;
> }
> % gcc -fsanitize=address -O2 inc.cc -S -o - | grep __asan_report
>         call    __asan_report_load4
>         call    __asan_report_store4
> % clang -fsanitize=address -O2 inc.cc -S -o - | grep __asan_report 
>         callq   __asan_report_load4
> % 
> 
> Is this test expected to work (have one __asan_error call) with this patch?

The patch indeed (naively) considers read and write accesses as being
different, you are right.  I am attaching a patch that does not, and
that generates just one __asan_report call here.

I'd be nice to know if that makes any change to ...

> First results with the patch (c-only tests, train data):
>                              orig          patched
>            401.bzip2,        89.60,        90.10,         1.01
>              429.mcf,        23.50,        23.90,         1.02
>            456.hmmer,       181.00,       145.00,         0.80
>       462.libquantum,         1.64,         1.64,         1.00
>          464.h264ref,       249.00,       249.00,         1.00
>             433.milc,        20.10,        20.00,         1.00
>              470.lbm,        37.20,        37.20,         1.00
>          482.sphinx3,        17.50,        17.50,         1.00
> 
> significant speedup on 456.hmmer, no difference elsewhere. 

... this.  Hopefully, if subsequent intrumentations on same BB on
read/write are considered redundant now, we should see some speed
difference on more tests.

> 3 benchmarks fail to build: 

> Error: 1x403.gcc 1x445.gobmk 1x458.sjeng
> resource.c:431:1: internal compiler error: in
> update_mem_ref_hash_table, at
> asan.c:460

The updated patch hopefully addresses that too.

Thank you for doing this!
Comment 23 Kostya Serebryany 2013-02-07 05:01:53 UTC
with the patch from comment 22 (all benchmarks, ref data): 
                           orig          patched
       400.perlbench,        -1.00,      1244.00,     -1244.00
           401.bzip2,      1189.00,      1137.00,         0.96
             403.gcc,       754.00,       750.00,         0.99
             429.mcf,       611.00,       610.00,         1.00
           445.gobmk,      1211.00,      1167.00,         0.96
           456.hmmer,      1834.00,      1501.00,         0.82
           458.sjeng,      1353.00,      1288.00,         0.95
      462.libquantum,       478.00,       480.00,         1.00
         464.h264ref,      1880.00,      1836.00,         0.98
         471.omnetpp,       621.00,       621.00,         1.00
           473.astar,       766.00,       763.00,         1.00
       483.xalancbmk,       515.00,       517.00,         1.00
            433.milc,       631.00,       625.00,         0.99
            444.namd,       538.00,       538.00,         1.00
          447.dealII,       716.00,       719.00,         1.00
          450.soplex,       421.00,       415.00,         0.99
          453.povray,       433.00,       429.00,         0.99
             470.lbm,       415.00,       411.00,         0.99
         482.sphinx3,      1377.00,      1343.00,         0.98

The average speedup is similar to what we saw with equivalent optimization in clang. Strangely, 400.perlbench fails with a warning when built with trunk but passes with this patch. I did not investigate this further yet.

If we are looking for greater speedup we need to perform more comprehensive 
research. I have two wild guesses (not supported by any data). 

#1 afaict, the asan pass happens in the middle of the gcc optimization flow. imho it should happen as late as possible so that the instrumentation 
happens on fully optimized code. 
#2 asan speed is very sensitive to quality of regalloc. It would be interesting
(and useful anyway) to implement zero-offset-shadow
(https://code.google.com/p/address-sanitizer/wiki/ZeroBasedShadow)
and see how much it helps with performance. 
If more than clang's 5% -- we have issues with regalloc, otherwise see #1
Comment 24 Jakub Jelinek 2013-02-07 17:00:17 UTC
(In reply to comment #23)
> #1 afaict, the asan pass happens in the middle of the gcc optimization flow.
> imho it should happen as late as possible so that the instrumentation 
> happens on fully optimized code. 

Our current plan for 4.9 is add __builtin_asan_mem_test (address, length, is_write) or similar builtin, where the current asan pass would just insert these builtins.  Then, we'd teach the alias oracle and other code about these builtins (that they shouldn't be optimized away, unless dominated by similar test on the same address with same or bigger length, without an intervening call that could free memory, and that they on the other side don't modify any memory), teach the vectorizer how to vectorize these builtins and look at other passes where it might prevent some optimizations (I guess vectorization will be the most important though).  And, finally have some later pass that will do the optimization Dodji just wrote, but on the builtins in the IL, with some propagation etc. (and could handle tsan builtins too), and then lower this special asan builtin to the shadow memory load + test + __asan_report*.

> #2 asan speed is very sensitive to quality of regalloc. It would be interesting
> (and useful anyway) to implement zero-offset-shadow
> (https://code.google.com/p/address-sanitizer/wiki/ZeroBasedShadow)
> and see how much it helps with performance. 
> If more than clang's 5% -- we have issues with regalloc, otherwise see #1
Comment 25 Dmitry Vyukov 2013-02-07 17:18:05 UTC
On Thu, Feb 7, 2013 at 9:00 PM, jakub at gcc dot gnu.org
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55309
>
> --- Comment #24 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-02-07 17:00:17 UTC ---
> (In reply to comment #23)
>> #1 afaict, the asan pass happens in the middle of the gcc optimization flow.
>> imho it should happen as late as possible so that the instrumentation
>> happens on fully optimized code.
>
> Our current plan for 4.9 is add __builtin_asan_mem_test (address, length,
> is_write) or similar builtin, where the current asan pass would just insert
> these builtins.  Then, we'd teach the alias oracle and other code about these
> builtins (that they shouldn't be optimized away, unless dominated by similar
> test on the same address with same or bigger length, without an intervening
> call that could free memory, and that they on the other side don't modify any
> memory), teach the vectorizer how to vectorize these builtins and look at other
> passes where it might prevent some optimizations (I guess vectorization will be
> the most important though).  And, finally have some later pass that will do the
> optimization Dodji just wrote, but on the builtins in the IL, with some
> propagation etc. (and could handle tsan builtins too), and then lower this
> special asan builtin to the shadow memory load + test + __asan_report*.

If a memory access is *post* dominated by a memory access to the same
location, then the first one can be eliminated even if there are
intervening function calls, because it's impossible to make an
unaddressable variable addressable again.
This is not true for tsan, though.
Comment 26 Kostya Serebryany 2013-02-08 06:31:26 UTC
FTR: here is the perf data for zero-based offset (clang)
https://code.google.com/p/address-sanitizer/wiki/ZeroBasedShadow#Performance
Comment 27 Jakub Jelinek 2013-02-08 09:02:23 UTC
Zero based offset has the big disadvantage of imposing big requirements on the executable.
Could we on x86_64 think about mem_to_shadow(x) (x >> 3) + 0x7fff8000 (note, not |, but +)?
Then instead of something like:
        movq    %rdi, %rdx
        movabsq $17592186044416, %rax
        shrq    $3, %rdx
        cmpb    $0, (%rdx,%rax)
        jne     .L5
        movq    (%rdi), %rax
        ret
.L5:
        pushq   %rax
        call    __asan_report_load8
we could emit:
        movq    %rdi, %rdx
        shrq    $3, %rdx
        cmpb    $0, 0x7fff8000(%rdx)
        jne     .L5
        movq    (%rdi), %rax
        ret
.L5:
        pushq   %rax
        call    __asan_report_load8
which is 7 bytes shorter sequence, without the need of an extra register and the not so cheap movabs insn.  By forcing PIE for everything, you are forcing the PIC overhead of unnecessary extra indirections in many places (and, on non-x86_64 usually it is even much more expensive).
Comment 28 Kostya Serebryany 2013-02-08 09:13:27 UTC
> Could we on x86_64 think about mem_to_shadow(x) (x >> 3) + 0x7fff8000 (note,
> not |, but +)?

That sounds compelling, but I afraid we may have binaries with 2G of text+globals. (!!)
Still, worth investigating. 

I agree with your arguments about not everyone willing to use -pie, 
but many large projects already do this anyway (e.g. Chrome)
Comment 29 Jakub Jelinek 2013-02-08 09:25:22 UTC
I think not in the default memory model, it can support only first 2GB of code+data.  Otherwise you couldn't call from the start of executable to a function at the end of it (if text segment is bigger than 2GB) or reference data from a function at the start of executable that is located at the end of data segment.
So, with zero offset model, your restriction on programs would be essentially, non-PIE executables (i.e. -mcmodel={small,medium,large} are unsupported),
with 0x7fff8000 (or perhaps even 0x7ffff000) it would be non-PIE executables of -mcmodel=medium is unsupported and -mcmodel=large is unsupported, unless linked to an address above shadow mem end.  -mcmodel=small supported.
Comment 30 Kostya Serebryany 2013-02-11 14:42:43 UTC
> Could we on x86_64 think about mem_to_shadow(x) (x >> 3) + 0x7fff8000

Committed http://llvm.org/viewvc/llvm-project?rev=174886&view=rev
which adds an optional flag -mllvm -asan-short-64bit-mapping-offset=1

On bzip2/train it gives us ~ 2/3 of the zero-base-offset benefits:
                  orig          0x7fff8000    zero      
401.bzip2,        68.80,        64.80,        62.70

Measuring the rest. 

Note that with clang this did not require any change in the run-time
(since recently we switched to ASAN_FLEXIBLE_MAPPING_AND_OFFSET=1)
Comment 31 Jakub Jelinek 2013-02-11 15:02:25 UTC
If the mapping is so flexible, how can you detect mismatches?  Different scale or shadow offsets are ABI incompatible...
Comment 32 Kostya Serebryany 2013-02-12 06:47:56 UTC
Good news, 0x7fff8000 seems great: 

t0: orig
t1: short offset (0x7fff8000)
t2: zero offset + pie

                  t0       t1     t1/t0   t2    t2/t0  t2/t1
-----------------------------------------------------------
 400.perlbench, 1206.00, 1151.00, 0.95, 1192.00, 0.99, 1.04
     401.bzip2,  884.00,  842.00, 0.95,  821.00, 0.93, 0.98
       403.gcc,  738.00,  722.00, 0.98,  716.00, 0.97, 0.99
       429.mcf,  609.00,  596.00, 0.98,  586.00, 0.96, 0.98
     445.gobmk,  844.00,  804.00, 0.95,  809.00, 0.96, 1.01
     456.hmmer, 1304.00, 1223.00, 0.94, 1235.00, 0.95, 1.01
     458.sjeng,  916.00,  868.00, 0.95,  897.00, 0.98, 1.03
462.libquantum,  547.00,  535.00, 0.98,  534.00, 0.98, 1.00
   464.h264ref, 1328.00, 1313.00, 0.99, 1265.00, 0.95, 0.96
   471.omnetpp,  628.00,  601.00, 0.96,  596.00, 0.95, 0.99
     473.astar,  665.00,  646.00, 0.97,  657.00, 0.99, 1.02
 483.xalancbmk,  480.00,  449.00, 0.94,  445.00, 0.93, 0.99
      433.milc,  709.00,  655.00, 0.92,  656.00, 0.93, 1.00
      444.namd,  636.00,  594.00, 0.93,  593.00, 0.93, 1.00
    447.dealII,  649.00,  615.00, 0.95,  637.00, 0.98, 1.04
    450.soplex,  390.00,  374.00, 0.96,  370.00, 0.95, 0.99
    453.povray,  452.00,  402.00, 0.89,  421.00, 0.93, 1.05
       470.lbm,  389.00,  378.00, 0.97,  387.00, 0.99, 1.02
   482.sphinx3,  980.00,  930.00, 0.95,  926.00, 0.94, 1.00

So, 0x7fff8000 seems to be a win, even compared to pie+zerobase. 
We'll do some more testing a flip the switch in clang. 

There is another suggestion (from dvyukov) to use -Wl,-Ttext-segment=0x40000000
together with zerobase (pie is not required) which is worth investigating.
Comment 33 Kostya Serebryany 2013-02-12 07:02:40 UTC
(In reply to comment #31)
> If the mapping is so flexible, how can you detect mismatches?  Different scale
> or shadow offsets are ABI incompatible...
We don't detect mismatches. 
This has never been a problem for our users (who build everything from scratch)
but we do see it as a coming problem as asan is becoming more popular. 

(in reply to comment from another bug)
> Perhaps instead of global vars defined outside of libasan (which e.g. requires
> GOT accesses to those vars in libasan)

Accessing these vars was never a perf problem (we run asan with perf regularly)

> , it might be better to have the scale
> and offset as arguments of __asan_init?  

We did this in the very early version, but it did not work in general. 
Consider you are linking your program with a third-party object 
not built with asan. It may have constructor functions called before main and
before __asan_init, and those functions call malloc which has to 
call __asan_init, but can not pass arguments. 
In some cases we can use .preinit_array to call __asan_init there, but 
that is not always available (?). 

We were (and still are) thinking about encoding the abi version in the name
of the  init function, e.g. __asan_init_v_123. 
It will help us detect abi mismatches when two objects are instrumented with
different generations of asan. 
This doesn't solve the problem of using different offsets though. 



> Then you could easily test at runtime,
> whether all compilation units agree on the same offset/scale, and complain if
> they don't.  Then __asan_mapping_offset and __asan_mapping_scale or how are the
> vars called could be hidden attribute, used with PC relative addressing and
> avoid one extra indirection, and more importantly have better runtime checking
> of mismatches.
Comment 34 Jakub Jelinek 2013-02-12 08:39:33 UTC
(In reply to comment #32)
> Good news, 0x7fff8000 seems great: 
> There is another suggestion (from dvyukov) to use -Wl,-Ttext-segment=0x40000000
> together with zerobase (pie is not required) which is worth investigating.

Glad to hear that.  The disadvantage of 
-Wl,-Ttext-segment=0x40000000 is that it requires special command line option for building the executable, i.e. you can't e.g. just build some shared library with -fsanitize=address and leave the main executable non-instrumented.
Plus, I don't see how can 
-Wl,-Ttext-segment=0x40000000 be used for x86_64, where you need 16TB of shadow memory for >> 3 scale.  For zero shadow offset you'd need to place the executable above 16TB, and that implies non-small model.
If -Ttext-segment is meant for 32-bit programs, then it could allow zero shadow offset, but with the disadvantage of special building of executables, and on i?86 the offset already fits into the immediates, so it is basically the 0x7fff8000 case for x86_64 already.

(In reply to comment #33)
> > , it might be better to have the scale
> > and offset as arguments of __asan_init?  
>
> We did this in the very early version, but it did not work in general. 
> Consider you are linking your program with a third-party object 
> not built with asan. It may have constructor functions called before main and
> before __asan_init, and those functions call malloc which has to 
> call __asan_init, but can not pass arguments.

I see, but then you could use the global vars (perhaps weak ones in libasan with some default), combined together with arguments to __asan_init (or some alternative name of the same function for compatibility).  All that it would do beyond normal initialization would be complain if the requested scale/offset pair is different from the chosen one.
Comment 35 Dmitry Vyukov 2013-02-12 08:47:21 UTC
On Tue, Feb 12, 2013 at 12:39 PM, jakub at gcc dot gnu.org
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55309
>
> --- Comment #34 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-02-12 08:39:33 UTC ---
> (In reply to comment #32)
>> Good news, 0x7fff8000 seems great:
>> There is another suggestion (from dvyukov) to use -Wl,-Ttext-segment=0x40000000
>> together with zerobase (pie is not required) which is worth investigating.
>
> Glad to hear that.  The disadvantage of
> -Wl,-Ttext-segment=0x40000000 is that it requires special command line option
> for building the executable, i.e. you can't e.g. just build some shared library
> with -fsanitize=address and leave the main executable non-instrumented.
> Plus, I don't see how can
> -Wl,-Ttext-segment=0x40000000 be used for x86_64, where you need 16TB of shadow
> memory for >> 3 scale.  For zero shadow offset you'd need to place the
> executable above 16TB, and that implies non-small model.

It is intended for x86_64. The binary is situated at 0x40000000 and
it's shadow is at 0x10000000-0x3fffffff (MAP_32BIT can live here as
well).
Dynamic libraries and mmap live either at 0x7fxxxxxxxxxx or at
0x55xxxxxxxxxx, that is mapped way above the executable. So there are
no overlaps.





> If -Ttext-segment is meant for 32-bit programs, then it could allow zero shadow
> offset, but with the disadvantage of special building of executables, and on
> i?86 the offset already fits into the immediates, so it is basically the
> 0x7fff8000 case for x86_64 already.
>
> (In reply to comment #33)
>> > , it might be better to have the scale
>> > and offset as arguments of __asan_init?
>>
>> We did this in the very early version, but it did not work in general.
>> Consider you are linking your program with a third-party object
>> not built with asan. It may have constructor functions called before main and
>> before __asan_init, and those functions call malloc which has to
>> call __asan_init, but can not pass arguments.
>
> I see, but then you could use the global vars (perhaps weak ones in libasan
> with some default), combined together with arguments to __asan_init (or some
> alternative name of the same function for compatibility).  All that it would do
> beyond normal initialization would be complain if the requested scale/offset
> pair is different from the chosen one.
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
Comment 36 Kostya Serebryany 2013-02-12 08:58:56 UTC
> I see, but then you could use the global vars (perhaps weak ones in libasan
> with some default), combined together with arguments to __asan_init (or some
> alternative name of the same function for compatibility).  All that it would do
> beyond normal initialization would be complain if the requested scale/offset
> pair is different from the chosen one.

Maybe we could add calls to e.g. 
  __asan_check_abi_mismatch(uptr a1, uptr a2, uptr a3, uptr a4, uptr a5, uptr a6)
after every call to __asan_init
(a1 == offset, a2 == shift, a3 == something_else, etc)

if any of a1..a6 is different between the calls to __asan_check_abi_mismatch -- fire an error.
WDYT?
Comment 37 Kostya Serebryany 2013-02-12 11:17:45 UTC
http://llvm.org/viewvc/llvm-project?rev=174957&view=rev (and r174958)
change the default offset for x86_64 to 7fff8000
and changes __asan_init to __asan_init_v1
Comment 38 Kostya Serebryany 2013-02-12 11:31:20 UTC
Unfortunately, this does not work on Mac, so we will have to keep the old 
mapping on Mac. grrrrr
Comment 39 Jakub Jelinek 2013-02-12 11:42:33 UTC
So, if Darwin keeps the old 1ULL << 44, then the corresponding gcc change (to be applied together with asan merge) would be something like (untested):
--- gcc/sanitizer.def	2013-01-11 09:02:37.879637130 +0100
+++ gcc/sanitizer.def	2013-02-12 12:39:12.743272092 +0100
@@ -27,7 +27,7 @@ along with GCC; see the file COPYING3.
    for other FEs by asan.c.  */
 
 /* Address Sanitizer */
-DEF_SANITIZER_BUILTIN(BUILT_IN_ASAN_INIT, "__asan_init",
+DEF_SANITIZER_BUILTIN(BUILT_IN_ASAN_INIT, "__asan_init_v1",
 		      BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 /* Do not reorder the BUILT_IN_ASAN_REPORT* builtins, e.g. cfgcleanup.c
    relies on this order.  */
--- gcc/config/i386/i386.c	2013-02-12 11:23:35.400193705 +0100
+++ gcc/config/i386/i386.c	2013-02-12 12:38:30.775503155 +0100
@@ -5436,7 +5436,9 @@ ix86_legitimate_combined_insn (rtx insn)
 static unsigned HOST_WIDE_INT
 ix86_asan_shadow_offset (void)
 {
-  return (unsigned HOST_WIDE_INT) 1 << (TARGET_LP64 ? 44 : 29);
+  return TARGET_LP64 ? (TARGET_MACHO ? (HOST_WIDE_INT_1 << 44)
+				     : HOST_WIDE_INT_C (0x7fff8000))
+		     : (HOST_WIDE_INT_1 << 29);
 }
 

 /* Argument support functions.  */
Comment 40 Jack Howarth 2013-02-12 14:00:15 UTC
(In reply to comment #23)

> #1 afaict, the asan pass happens in the middle of the gcc optimization flow.
> imho it should happen as late as possible so that the instrumentation 
> happens on fully optimized code. 

I can confirm this is the case from my experiments compiling xplor-nih with -fsanitize=address. This code is habitually miscompiled by gfortran at the higher optimizations levels. The addition of the  -fsanitize=address flag to the build suppresses most of the xplor-nih testsuite failures indicating that it has changed the code optimization in gfortran. Is there any chance of moving the asan pass or is that definitely stage 1 material?
Comment 41 Jakub Jelinek 2013-02-12 14:11:28 UTC
That is definitely stage1 material, and a lot of work, especially to teach the vectorizer how to deal with these.  And, we don't want to introduce the asan instrumentation too late, e.g. vectorization often reads even from memory outside of what the source code actually accesses, when it e.g. knows it is sufficiently aligned and won't cause crashes.  That would be false positives for asan.
Comment 42 Jack Howarth 2013-02-12 14:41:56 UTC
(In reply to comment #41)

FYI, most of the codegen issues with xplor-nih compiled with gfortran can be suppressed with -fno-tree-vectorize at -O3 (hence my interest in a function libasan on darwin).
Comment 43 Kostya Serebryany 2013-02-22 07:11:06 UTC
gcc r196201:  -O2 -fno-aggressive-loop-optimizations
clang 175735: -O2 

x86_64 linux, both are using the new 7fff8000 shadow offset

       400.perlbench,      1136.00,        -1.00,        -0.00
           401.bzip2,       838.00,      1154.00,         1.38
             403.gcc,       716.00,       742.00,         1.04
             429.mcf,       582.00,       578.00,         0.99
           445.gobmk,       801.00,      1138.00,         1.42
           456.hmmer,      1277.00,      1515.00,         1.19
           458.sjeng,       869.00,      1258.00,         1.45
      462.libquantum,       532.00,       469.00,         0.88
         464.h264ref,      1303.00,      4395.00,         3.37
         471.omnetpp,       568.00,       585.00,         1.03
           473.astar,       647.00,       748.00,         1.16
       483.xalancbmk,       460.00,       534.00,         1.16
            433.milc,       659.00,       614.00,         0.93
            444.namd,       592.00,       531.00,         0.90
          447.dealII,       614.00,       706.00,         1.15
          450.soplex,       367.00,       406.00,         1.11
          453.povray,       423.00,       410.00,         0.97
             470.lbm,       377.00,       401.00,         1.06
         482.sphinx3,       958.00,      1325.00,         1.38

400.perlbench fails with a global-buffer-overflow which clang does not detect.
I did not investigate why. It could be a gcc false positive or clang false negative.

464.h264ref is VERY slow, I did not look why.
Comment 44 Joost VandeVondele 2013-02-22 08:31:11 UTC
(In reply to comment #43)
> 400.perlbench fails with a global-buffer-overflow which clang does not detect.

I'm wondering if the failure goes away compiled with -O0 instead ?
Comment 45 Kostya Serebryany 2013-02-22 08:36:14 UTC
> I'm wondering if the failure goes away compiled with -O0 instead ?
No, the failure is still present with -O0
Comment 46 Jakub Jelinek 2013-02-22 13:09:10 UTC
(In reply to comment #43)
> 400.perlbench fails with a global-buffer-overflow which clang does not detect.
> I did not investigate why. It could be a gcc false positive or clang false
> negative.

On which file/function the global-buffer-overflow was?  Can you send me the asan diagnostics?

> 464.h264ref is VERY slow, I did not look why.

And it didn't fail on that:
    for (dd=d[k=0]; k<16; dd=d[++k])
    {
      satd += (dd < 0 ? -dd : dd);
    }
or have you fixed that up in your SPEC sources?
Comment 47 Kostya Serebryany 2013-02-22 13:52:12 UTC
(In reply to comment #46)
> (In reply to comment #43)
> > 400.perlbench fails with a global-buffer-overflow which clang does not detect.
> > I did not investigate why. It could be a gcc false positive or clang false
> > negative.
> 
> On which file/function the global-buffer-overflow was?  Can you send me the
> asan diagnostics?

Interestingly, the symbolization/debuginfo seems to be completely broken :( 

% g++ -g -fsanitize=address ./use-after-free.cc -static-libasan ; ./a.out  2>&1 | grep '#0'
    #0 0x4179c2 (/home/kcc/tmp/a.out+0x4179c2)
    #0 0x40f18a (/home/kcc/tmp/a.out+0x40f18a)
    #0 0x40f26a (/home/kcc/tmp/a.out+0x40f26a)
% addr2line -f -e ./a.out 0x4179c2 0x40f18a 0x40f26a 
main
??:0
free
??:0
malloc
??:0
% 

==580== ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000078e2a5 at pc 0x4e47d7 bp 0x7fffa2fbc7b0 sp 0x7fffa2fbc7a8
READ of size 1 at 0x00000078e2a5 thread T0
    #0 0x4e47d6 in PerlIO_find_layer (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4e47d6)
    #1 0x4e63e6 in PerlIO_default_buffer (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4e63e6)
    #2 0x4e678e in PerlIO_default_layers (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4e678e)
    #3 0x4e7a41 in PerlIO_resolve_layers (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4e7a41)
    #4 0x4e8145 in PerlIO_openn (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4e8145)
    #5 0x4f5d32 in PerlIO_open (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4f5d32)
    #6 0x4dd808 in S_open_script (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4dd808)
    #7 0x4d3be6 in S_parse_body (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4d3be6)
    #8 0x4d2a4b in perl_parse (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4d2a4b)
    #9 0x4f6ee8 in main (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4f6ee8)
    #10 0x7fd3a245376c (/lib/x86_64-linux-gnu/libc.so.6+0x2176c)
    #11 0x4037d8 (benchspec/CPU2006/400.perlbench/run/run_base_train_z.0000/perlbench_base.z+0x4037d8)
0x00000078e2a5 is located 0 bytes to the right of global variable '*.LC50 (perlio.c)' (0x78e2a0) of size 5
  '*.LC50 (perlio.c)' is ascii string 'unix'
SUMMARY: AddressSanitizer: global-buffer-overflow ??:0 PerlIO_find_layer




> 
> > 464.h264ref is VERY slow, I did not look why.
> 
> And it didn't fail on that:
>     for (dd=d[k=0]; k<16; dd=d[++k])
>     {
>       satd += (dd < 0 ? -dd : dd);
>     }
> or have you fixed that up in your SPEC sources?

Interestingly, no. I haven't touched SPEC sources here. 
Maybe gcc does full unroll thus eliminating the buggy read (I did not check).
Comment 48 Joost VandeVondele 2013-02-22 13:55:16 UTC
(In reply to comment #47)
> 
> Interestingly, the symbolization/debuginfo seems to be completely broken :( 
> 
I've tried compiling with -gdwarf-3 , with some luck
Comment 49 Kostya Serebryany 2013-02-22 14:29:27 UTC
with -gdwarf-3: 
==11621== ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000078e2a5 at pc 0x4e47d7 bp 0x7fff553d4cc0 sp 0x7fff553d4cb8
READ of size 1 at 0x00000078e2a5 thread T0
    #0 0x4e47d6 in PerlIO_find_layer perlio.c:751
    #1 0x4e63e6 in PerlIO_default_buffer perlio.c:1015
    #2 0x4e678e in PerlIO_default_layers perlio.c:1113
    #3 0x4e7a41 in PerlIO_resolve_layers perlio.c:1433
    #4 0x4e8145 in PerlIO_openn perlio.c:1519
    #5 0x4f5c08 in PerlIO_fdopen perlio.c:4745
    #6 0x4e68a3 in PerlIO_stdstreams perlio.c:1150
    #7 0x4f5b46 in Perl_PerlIO_stdin perlio.c:4686
    #8 0x4dd7ee in S_open_script perl.c:3348
    #9 0x4d3be6 in S_parse_body perl.c:1718
    #10 0x4d2a4b in perl_parse perl.c:1312
    #11 0x4f6ee8 in main perlmain.c:96
    #12 0x7f686f32576c in __libc_start_main libc-start.c:226
    #13 0x4037d8 in _start ??:0
0x00000078e2a5 is located 0 bytes to the right of global variable '*.LC50 (perlio.c)' (0x78e2a0) of size 5
  '*.LC50 (perlio.c)' is ascii string 'unix'
Comment 50 Kostya Serebryany 2013-02-22 14:54:24 UTC
reproducer: 

#include <string.h>
#include <stdio.h>
int foo(const char *x, const char *y, int len) {
  return memcmp(x, y, len);
}
int main() {
  printf("%d\n", foo("perlio", "unix", 6));
}

clang does not report a warning here, but gcc does. 
This is a gray area for me, not sure if we should treat this as a buggy code. 

on one hand, memcmp gets size=6, while one of the buffers is smaller. 
otoh, the first bytes of the strings are different and memcmp should not read the rest. 

I vaguely remember some similar case where we decided that the code is correct. 
Anyone?
Comment 51 Jakub Jelinek 2013-02-22 15:01:08 UTC
Looks like a real SPEC bug to me.

PerlIO_funcs *
PerlIO_find_layer(pTHX_ const char *name, STRLEN len, int load)
{
    IV i;
    if ((SSize_t) len <= 0)
        len = strlen(name);
    for (i = 0; i < PL_known_layers->cur; i++) {
        PerlIO_funcs *f = PL_known_layers->array[i].funcs;
        if (memEQ(f->name, name, len) && f->name[len] == 0) {
            PerlIO_debug("%.*s => %p\n", (int) len, name, (void*)f);
            return f;
        }
    }

memEQ is memcmp, and my reading of ISO C99 or http://pubs.opengroup.org/onlinepubs/9699919799/functions/memcmp.html is that it is a bug to call memcmp ("abcdef", "defg", 6).  A valid memcmp implementation could preread all bytes from both arrays (of the given length) and only then compare.  And, at least some implementations (e.g. glibc string/memcmp.c) does that if the two strings aren't starting at the same address modulo size of word.
Comment 52 Jakub Jelinek 2013-02-22 15:03:31 UTC
CCing Joseph for expert opinion on whether memcmp ("abcdef", "qrst", 6); is valid C99.
Comment 53 Kostya Serebryany 2013-02-22 15:06:25 UTC
The interceptor we have is conservative: 

INTERCEPTOR(int, memcmp, const void *a1, const void *a2, uptr size) {
  if (!asan_inited) return internal_memcmp(a1, a2, size);
  ENSURE_ASAN_INITED();
  unsigned char c1 = 0, c2 = 0;
  const unsigned char *s1 = (const unsigned char*)a1;
  const unsigned char *s2 = (const unsigned char*)a2;
  uptr i;
  for (i = 0; i < size; i++) {
    c1 = s1[i];
    c2 = s2[i];
    if (c1 != c2) break;
  }
  ASAN_READ_RANGE(s1, Min(i + 1, size));
  ASAN_READ_RANGE(s2, Min(i + 1, size));
  return CharCmp(c1, c2);
} 

looks like gcc partially inlines memcmp and 
bypasses out conservative interceptor.

We could make the interceptor more strict (ASAN_READ_RANGE(s2, size);).
I am trying to remember why we didn't do this...
Comment 54 Jakub Jelinek 2013-02-22 15:13:34 UTC
gcc instruments many of the builtins inline, on the assumption that the builtins are often expanded inline and thus the interceptor might not be called at all.  Either it isn't, or is and the gcc instrumentation is done in addition to the interceptor's instrumentation.
Comment 55 jsm-csl@polyomino.org.uk 2013-02-22 16:10:49 UTC
I believe the arguments to memcmp must point to objects with at least the 
given number of bytes.  (For strcmp, they must point to NUL-terminated 
strings.  For strncmp, they must point to objects that either have at 
least the given number of bytes or have bytes present up to a NUL within 
that number of bytes - there's no guarantee that comparison stops early 
when characters differ except for not reading after a NUL.  By comparison, 
the array passed to memchr may be shorter than the given length if a 
matching character is found early - see the wording added in C11 for 
memchr for alignment with POSIX.  But memcmp has no such special rule.)
Comment 56 Kostya Serebryany 2013-02-26 07:43:19 UTC
http://llvm.org/viewvc/llvm-project?rev=176078&view=rev
makes memcmp interceptor more aggressive, so that clang finds this bug 
in perlbmk too.
Comment 57 Kostya Serebryany 2013-02-28 11:31:54 UTC
I've created a page that describes how I run SPEC with asan. 
There is also a patch that works around the known SPEC bugs.

https://code.google.com/p/address-sanitizer/wiki/RunningSpecBenchmarks
https://code.google.com/p/address-sanitizer/source/browse/trunk/spec/spec2006-asan.patch
Comment 58 Kostya Serebryany 2014-01-27 08:11:33 UTC
FTR, here are the new numbers; except for 464.h264ref looks good.
clang r199888, gcc r207025
flags: -O2 -fsanitize=address
machine: Dell 3500 (Intel(R) Xeon(R) CPU W3690  @ 3.47GHz)

                           clang          gcc            diff
       400.perlbench,      1286.00,        -1.00,        -0.00
           401.bzip2,       857.00,       940.00,         1.10
             403.gcc,       621.00,       606.00,         0.98
             429.mcf,       578.00,       574.00,         0.99
           445.gobmk,       860.00,       850.00,         0.99
           456.hmmer,       880.00,      1149.00,         1.31
           458.sjeng,       992.00,       996.00,         1.00
      462.libquantum,       492.00,       483.00,         0.98
         464.h264ref,      1274.00,      3998.00,         3.14
         471.omnetpp,       566.00,       569.00,         1.01
           473.astar,       661.00,       647.00,         0.98
       483.xalancbmk,       478.00,       491.00,         1.03
            433.milc,       620.00,       611.00,         0.99
            444.namd,       601.00,       528.00,         0.88
          447.dealII,       624.00,       670.00,         1.07
          450.soplex,       366.00,       389.00,         1.06
          453.povray,       430.00,       374.00,         0.87
             470.lbm,       355.00,       452.00,         1.27
         482.sphinx3,       926.00,      1108.00,         1.20
Comment 59 Markus Trippelsdorf 2014-01-27 08:22:09 UTC
(In reply to Kostya Serebryany from comment #58)
> FTR, here are the new numbers; except for 464.h264ref looks good.

Thanks. Lets close this bug then.