53726 – [4.8 Regression] aes test performance drop for eembc_2_0_peak_32

Bug 53726 - [4.8 Regression] aes test performance drop for eembc_2_0_peak_32

Summary: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-06-20 06:10 UTC by Vladimir Yakovlev
Modified:	2012-06-22 23:04 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-06-20 00:00:00

Attachments
Test case and assemblers (6.42 KB, application/x-zip-compressed) 2012-06-20 06:13 UTC, Vladimir Yakovlev	Details
Executable test case (830.15 KB, application/x-zip-compressed) 2012-06-20 10:50 UTC, Vladimir Yakovlev	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Vladimir Yakovlev 2012-06-20 06:10:34 UTC

After fix

r188261 | rguenth | 2012-06-06 13:45:27 +0400 (Wed, 06 Jun 2012) | 23 lines

2012-06-06  Richard Guenther  <rguenther@suse.de>

        PR tree-optimization/53081
        * tree-data-ref.h (adjacent_store_dr_p): Rename to ...
        (adjacent_dr_p): ... this and make it work for reads, too.
        * tree-loop-distribution.c (enum partition_kind): Add PKIND_MEMCPY.
        (struct partition_s): Change main_stmt to main_dr, add
        secondary_dr member.
        (build_size_arg_loc): Change to date data-reference and not
        gimplify here.
        (build_addr_arg_loc): New function split out from ...
        (generate_memset_builtin): ... here.  Use it and simplify.
        (generate_memcpy_builtin): New function.
        (generate_code_for_partition): Adjust.
        (classify_partition): Streamline pattern detection.  Detect
        memcpy.
        (ldist_gen): Adjust.
        (tree_loop_distribution): Adjust seed statements for memcpy
        recognition.

        * gcc.dg/tree-ssa/ldist-20.c: New testcase.
        * gcc.dg/tree-ssa/loop-19.c: Add -fno-tree-loop-distribute-patterns.

regression on Atom 11%, on Sundy Bridge 30%. The fix lead to unrecognition of memcpy. Reduced test case and assemblers are attached. Command line to reproduce

gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7 -mtune=corei7   test.c

Comment 1 Vladimir Yakovlev 2012-06-20 06:13:26 UTC

Created attachment 27658 [details]
Test case and assemblers

Comment 2 Richard Biener 2012-06-20 09:27:52 UTC

You mean the fix lead to recognition of memcpy?  At least I see memcpy
calls in the bad assembly.

There is always a cost consideration for memcpy - does performance recover
with -minline-all-stringops?  I suppose BC is actually very small?

The testcase does not include a runtime part so I can't check myself.

Definitely a byte-wise copy loop as in the .good assembly variant,

 .L5:
-       .loc 1 14 0 is_stmt 1 discriminator 2
-       movzbl  16(%esp,%eax), %edx
-       movb    %dl, (%esi,%eax)
-       leal    1(%eax), %eax
-.LVL5:
-       cmpl    %ebx, %eax
-       jl      .L5

does not look good - even a rep movb should be faster, no?

Comment 3 Vladimir Yakovlev 2012-06-20 10:48:11 UTC

I added executable testcase. Command line to compile

gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7 -mtune=corei7   test.c m.c

Run results

Wed Jun 20 14:39:05: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time ./test.corei7.bad.exe                                                                                                                                                
 
real    0m6.317s
user    0m6.290s
sys     0m0.002s
Wed Jun 20 14:39:24: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time ./test.corei7.good.exe                                                                                                                                               
 
real    0m4.815s
user    0m4.713s
sys     0m0.000s

Comment 4 Vladimir Yakovlev 2012-06-20 10:50:28 UTC

Created attachment 27664 [details]
Executable test case

Comment 5 Richard Biener 2012-06-20 11:48:13 UTC

Ok.  A rep movsb; is as slow as a memcpy call (-mstringop-strategy=rep_byte -minline-all-stringops).  -minline-all-stringops itself is nearly as fast
as -fno-tree-loop-distribute-patterns.

To answer my own question, BC is between zero and 7.

But I really wonder why the rep movsb is slower than the explicit byte-copy
loop ...

We do seem to seriously hose the CFG though - with PGO we get a nice
loop nest CFG and the speed of before the patch - even when it uses
a memcpy call.

Comment 6 Richard Biener 2012-06-20 12:31:22 UTC

Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
simply does a rep movsb; for any size lower than 20 bytes ... but as I have
been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...

Thus, I suppose you should look at improving the memcpy implementation for
small sizes on 32bits.

Comment 7 H.J. Lu 2012-06-20 12:46:57 UTC

(In reply to comment #6)
> Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc

Which glibc are you using?

> simply does a rep movsb; for any size lower than 20 bytes ... but as I have
> been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
> 

I didn't see rep movsb in 32-bit memcpy in glibc 2.14.

Comment 8 rguenther@suse.de 2012-06-20 13:30:21 UTC

On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 12:46:57 UTC ---
> (In reply to comment #6)
> > Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
> 
> Which glibc are you using?
> 
> > simply does a rep movsb; for any size lower than 20 bytes ... but as I have
> > been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
> > 
> 
> I didn't see rep movsb in 32-bit memcpy in glibc 2.14.

The one from openSUSE 12.1.  Btw, this is with static linking, so I
suppose IFUNC and friends does not apply(?)

Comment 9 H.J. Lu 2012-06-20 14:19:29 UTC

(In reply to comment #3)
> I added executable testcase. Command line to compile
> 
> gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> -mtune=corei7   test.c m.c
> 

Please compare results without -static.

Comment 10 Vladimir Yakovlev 2012-06-20 14:34:32 UTC

I've tried without static. Runtimes is still the same.

Comment 11 Richard Biener 2012-06-20 14:36:00 UTC

(In reply to comment #9)
> (In reply to comment #3)
> > I added executable testcase. Command line to compile
> > 
> > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> > -mtune=corei7   test.c m.c
> > 
> 
> Please compare results without -static.

Same results.  memcpy ends up here:

Dump of assembler code for function memcpy:
=> 0xf7ed0cc0 <+0>:     push   %edi
   0xf7ed0cc1 <+1>:     push   %esi
   0xf7ed0cc2 <+2>:     mov    0xc(%esp),%edi
   0xf7ed0cc6 <+6>:     mov    0x10(%esp),%esi
   0xf7ed0cca <+10>:    mov    0x14(%esp),%ecx
   0xf7ed0cce <+14>:    mov    %edi,%eax
   0xf7ed0cd0 <+16>:    cld    
   0xf7ed0cd1 <+17>:    cmp    $0x20,%ecx
   0xf7ed0cd4 <+20>:    jbe    0xf7ed0d2c <memcpy+108>
...
   0xf7ed0d2c <+108>:   rep movsb %ds:(%esi),%es:(%edi)
   0xf7ed0d2e <+110>:   pop    %esi
   0xf7ed0d2f <+111>:   pop    %edi
   0xf7ed0d30 <+112>:   ret    

this seems to be sysdeps/i386/i586/memcpy.S

Comment 12 H.J. Lu 2012-06-20 14:43:11 UTC

(In reply to comment #11)
> (In reply to comment #9)
> > (In reply to comment #3)
> > > I added executable testcase. Command line to compile
> > > 
> > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> > > -mtune=corei7   test.c m.c
> > > 
> > 
> > Please compare results without -static.
> 
> Same results.  memcpy ends up here:
> 
> 
> this seems to be sysdeps/i386/i586/memcpy.S

Your libc.so doesn't support IFUNC optimization.

Comment 13 H.J. Lu 2012-06-20 14:46:55 UTC

(In reply to comment #10)
> I've tried without static. Runtimes is still the same.

It doesn't match what I saw.  On Atom D510:

/export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o new
time ./new
./new  58.46s user 0.00s system 99% cpu 58.479 total
/export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o old
time ./old
./old  58.38s user 0.00s system 99% cpu 58.490 total

Comment 14 H.J. Lu 2012-06-20 14:51:31 UTC

With -static, there is a regression:

./new.static  61.99s user 0.00s system 99% cpu 1:02.14 total
./old.static  58.25s user 0.00s system 99% cpu 58.261 total

The problem is the slow memcpy, not GCC.  Please find why
it is different from you.

Comment 15 rguenther@suse.de 2012-06-20 14:53:58 UTC

On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:43:11 UTC ---
> (In reply to comment #11)
> > (In reply to comment #9)
> > > (In reply to comment #3)
> > > > I added executable testcase. Command line to compile
> > > > 
> > > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> > > > -mtune=corei7   test.c m.c
> > > > 
> > > 
> > > Please compare results without -static.
> > 
> > Same results.  memcpy ends up here:
> > 
> > 
> > this seems to be sysdeps/i386/i586/memcpy.S
> 
> Your libc.so doesn't support IFUNC optimization.

Quite possible (for whatever reason).

Comment 16 Richard Biener 2012-06-20 15:09:46 UTC

What we could do for the case in question is look at the maximum possible
value of c, derived from number-of-iteration analysis which should tell
us 8 because of the size of the tem array.

But I am not sure if a good library implementation shouldn't be always
preferable to a byte-wise copy.  We could, at least try to envision a way
to retain and use the knowledge that the size is at most 8 when expanding
the memcpy (with AVX we could use a masked store for example - quite fancy).

Comment 17 H.J. Lu 2012-06-20 15:36:09 UTC

(In reply to comment #16)
> But I am not sure if a good library implementation shouldn't be always
> preferable to a byte-wise copy.  We could, at least try to envision a way
> to retain and use the knowledge that the size is at most 8 when expanding
> the memcpy (with AVX we could use a masked store for example - quite fancy).

string/memory functions in libc can be much faster than the ones generated
by GCC unless the size is very small, PR 43052.

Comment 18 rguenther@suse.de 2012-06-21 08:46:11 UTC

On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC ---
> (In reply to comment #16)
> > But I am not sure if a good library implementation shouldn't be always
> > preferable to a byte-wise copy.  We could, at least try to envision a way
> > to retain and use the knowledge that the size is at most 8 when expanding
> > the memcpy (with AVX we could use a masked store for example - quite fancy).
> 
> string/memory functions in libc can be much faster than the ones generated
> by GCC unless the size is very small, PR 43052.

Yes.  The question is what is "very small" and how can we possibly
detect "very small".  For this testcase we can derive an upper bound
of the size, which is 8, but the size is not constant.  I think unless
we know we can expand the variable-size memcpy with, say, three
CPU instructions inline there is no reason to not call memcpy.

Thus if the CPU could do

  tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
  mask = generate mask from size
  store-unaligned-8-bytes-with-maxk

then expanding the memcpy call inline would be a win I suppose.
AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
bytes is profitable?  Note that from the specs
of VMASKMOV it seems the memory operands need to be aligned and
the mask does not support byte-granularity.

Which would leave us to inline expanding the case of at most 2 byte
memcpy.  Of course currently there is no way to record an upper
bound for the size (we do not retain value-range information - but
we of course should).

Comment 19 Vladimir Yakovlev 2012-06-21 12:46:11 UTC

(In reply to comment #13)
> (In reply to comment #10)
> > I've tried without static. Runtimes is still the same.
> 
> It doesn't match what I saw.  On Atom D510:
> 
> /export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3
> -ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o new
> time ./new
> ./new  58.46s user 0.00s system 99% cpu 58.479 total
> /export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3
> -ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o old
> time ./old
> ./old  58.38s user 0.00s system 99% cpu 58.490 total

I rechecked there is no regression without static on Sundy Bridge nor Atom.

Comment 20 H.J. Lu 2012-06-21 12:50:21 UTC

(In reply to comment #19)
> 
> I rechecked there is no regression without static on Sundy Bridge nor Atom.

Great.  Please verify that -static isn't used on eembc and SPEC
CPU 2000/2006 runs.

Comment 21 H.J. Lu 2012-06-21 12:55:33 UTC

(In reply to comment #18)
> > string/memory functions in libc can be much faster than the ones generated
> > by GCC unless the size is very small, PR 43052.
> 
> Yes.  The question is what is "very small" and how can we possibly
> detect "very small".  For this testcase we can derive an upper bound
> of the size, which is 8, but the size is not constant.  I think unless
> we know we can expand the variable-size memcpy with, say, three
> CPU instructions inline there is no reason to not call memcpy.

It is OK to call memcpy if the size isn't constant.

> 
> Which would leave us to inline expanding the case of at most 2 byte
> memcpy.  Of course currently there is no way to record an upper
> bound for the size (we do not retain value-range information - but
> we of course should).

It is nice to have.  We can open another bug for this.

Comment 22 Jan Hubicka 2012-06-22 22:45:35 UTC

> Yes.  The question is what is "very small" and how can we possibly

As what is very small is defined in the i386.c in the cost tables.
I simply run a small benchmark testing library&GCC implementations to
fill it in.  With new glibcs these tables may need upating.  I updated them
on some to make glibc in SUSE 11.x.

PR  43052 is about memcmp. Memcpy/memset should behave more or less sanely.
(that also reminds me that I should look again at the SSE memcpy/memset
implementation for 4.8)

> detect "very small".  For this testcase we can derive an upper bound
> of the size, which is 8, but the size is not constant.  I think unless
> we know we can expand the variable-size memcpy with, say, three
> CPU instructions inline there is no reason to not call memcpy.
> 
> Thus if the CPU could do
> 
>   tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
>   mask = generate mask from size
>   store-unaligned-8-bytes-with-maxk
> 
> then expanding the memcpy call inline would be a win I suppose.
> AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
> bytes is profitable?  Note that from the specs
> of VMASKMOV it seems the memory operands need to be aligned and
> the mask does not support byte-granularity.
> 
> Which would leave us to inline expanding the case of at most 2 byte
> memcpy.  Of course currently there is no way to record an upper
> bound for the size (we do not retain value-range information - but
> we of course should).

My secret plan was to make VRP produce value profiling histogram
when value is known to be with small range.  Should be quite easy
to implement.

Honza

Comment 23 Jan Hubicka 2012-06-22 23:04:21 UTC

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> > Yes.  The question is what is "very small" and how can we possibly
> 
> As what is very small is defined in the i386.c in the cost tables.
> I simply run a small benchmark testing library&GCC implementations to
> fill it in.  With new glibcs these tables may need upating.  I updated them
> on some to make glibc in SUSE 11.x.
> 
> PR  43052 is about memcmp. Memcpy/memset should behave more or less sanely.
> (that also reminds me that I should look again at the SSE memcpy/memset
> implementation for 4.8)

That also reminds me that this tunning was mostly reverted with the SSE work.
I will look into that patches and push out the safe bits for 4.8

Honza