After fix r188261 | rguenth | 2012-06-06 13:45:27 +0400 (Wed, 06 Jun 2012) | 23 lines 2012-06-06 Richard Guenther <rguenther@suse.de> PR tree-optimization/53081 * tree-data-ref.h (adjacent_store_dr_p): Rename to ... (adjacent_dr_p): ... this and make it work for reads, too. * tree-loop-distribution.c (enum partition_kind): Add PKIND_MEMCPY. (struct partition_s): Change main_stmt to main_dr, add secondary_dr member. (build_size_arg_loc): Change to date data-reference and not gimplify here. (build_addr_arg_loc): New function split out from ... (generate_memset_builtin): ... here. Use it and simplify. (generate_memcpy_builtin): New function. (generate_code_for_partition): Adjust. (classify_partition): Streamline pattern detection. Detect memcpy. (ldist_gen): Adjust. (tree_loop_distribution): Adjust seed statements for memcpy recognition. * gcc.dg/tree-ssa/ldist-20.c: New testcase. * gcc.dg/tree-ssa/loop-19.c: Add -fno-tree-loop-distribute-patterns. regression on Atom 11%, on Sundy Bridge 30%. The fix lead to unrecognition of memcpy. Reduced test case and assemblers are attached. Command line to reproduce gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7 -mtune=corei7 test.c
Created attachment 27658 [details] Test case and assemblers
You mean the fix lead to recognition of memcpy? At least I see memcpy calls in the bad assembly. There is always a cost consideration for memcpy - does performance recover with -minline-all-stringops? I suppose BC is actually very small? The testcase does not include a runtime part so I can't check myself. Definitely a byte-wise copy loop as in the .good assembly variant, .L5: - .loc 1 14 0 is_stmt 1 discriminator 2 - movzbl 16(%esp,%eax), %edx - movb %dl, (%esi,%eax) - leal 1(%eax), %eax -.LVL5: - cmpl %ebx, %eax - jl .L5 does not look good - even a rep movb should be faster, no?
I added executable testcase. Command line to compile gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7 -mtune=corei7 test.c m.c Run results Wed Jun 20 14:39:05: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time ./test.corei7.bad.exe real 0m6.317s user 0m6.290s sys 0m0.002s Wed Jun 20 14:39:24: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time ./test.corei7.good.exe real 0m4.815s user 0m4.713s sys 0m0.000s
Created attachment 27664 [details] Executable test case
Ok. A rep movsb; is as slow as a memcpy call (-mstringop-strategy=rep_byte -minline-all-stringops). -minline-all-stringops itself is nearly as fast as -fno-tree-loop-distribute-patterns. To answer my own question, BC is between zero and 7. But I really wonder why the rep movsb is slower than the explicit byte-copy loop ... We do seem to seriously hose the CFG though - with PGO we get a nice loop nest CFG and the speed of before the patch - even when it uses a memcpy call.
Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc simply does a rep movsb; for any size lower than 20 bytes ... but as I have been told rep movsb; setup cost is prohibitively high on most Intel CPUs ... Thus, I suppose you should look at improving the memcpy implementation for small sizes on 32bits.
(In reply to comment #6) > Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc Which glibc are you using? > simply does a rep movsb; for any size lower than 20 bytes ... but as I have > been told rep movsb; setup cost is prohibitively high on most Intel CPUs ... > I didn't see rep movsb in 32-bit memcpy in glibc 2.14.
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726 > > --- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 12:46:57 UTC --- > (In reply to comment #6) > > Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc > > Which glibc are you using? > > > simply does a rep movsb; for any size lower than 20 bytes ... but as I have > > been told rep movsb; setup cost is prohibitively high on most Intel CPUs ... > > > > I didn't see rep movsb in 32-bit memcpy in glibc 2.14. The one from openSUSE 12.1. Btw, this is with static linking, so I suppose IFUNC and friends does not apply(?)
(In reply to comment #3) > I added executable testcase. Command line to compile > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7 > -mtune=corei7 test.c m.c > Please compare results without -static.
I've tried without static. Runtimes is still the same.
(In reply to comment #9) > (In reply to comment #3) > > I added executable testcase. Command line to compile > > > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7 > > -mtune=corei7 test.c m.c > > > > Please compare results without -static. Same results. memcpy ends up here: Dump of assembler code for function memcpy: => 0xf7ed0cc0 <+0>: push %edi 0xf7ed0cc1 <+1>: push %esi 0xf7ed0cc2 <+2>: mov 0xc(%esp),%edi 0xf7ed0cc6 <+6>: mov 0x10(%esp),%esi 0xf7ed0cca <+10>: mov 0x14(%esp),%ecx 0xf7ed0cce <+14>: mov %edi,%eax 0xf7ed0cd0 <+16>: cld 0xf7ed0cd1 <+17>: cmp $0x20,%ecx 0xf7ed0cd4 <+20>: jbe 0xf7ed0d2c <memcpy+108> ... 0xf7ed0d2c <+108>: rep movsb %ds:(%esi),%es:(%edi) 0xf7ed0d2e <+110>: pop %esi 0xf7ed0d2f <+111>: pop %edi 0xf7ed0d30 <+112>: ret this seems to be sysdeps/i386/i586/memcpy.S
(In reply to comment #11) > (In reply to comment #9) > > (In reply to comment #3) > > > I added executable testcase. Command line to compile > > > > > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7 > > > -mtune=corei7 test.c m.c > > > > > > > Please compare results without -static. > > Same results. memcpy ends up here: > > > this seems to be sysdeps/i386/i586/memcpy.S Your libc.so doesn't support IFUNC optimization.
(In reply to comment #10) > I've tried without static. Runtimes is still the same. It doesn't match what I saw. On Atom D510: /export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o new time ./new ./new 58.46s user 0.00s system 99% cpu 58.479 total /export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o old time ./old ./old 58.38s user 0.00s system 99% cpu 58.490 total
With -static, there is a regression: ./new.static 61.99s user 0.00s system 99% cpu 1:02.14 total ./old.static 58.25s user 0.00s system 99% cpu 58.261 total The problem is the slow memcpy, not GCC. Please find why it is different from you.
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726 > > --- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:43:11 UTC --- > (In reply to comment #11) > > (In reply to comment #9) > > > (In reply to comment #3) > > > > I added executable testcase. Command line to compile > > > > > > > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7 > > > > -mtune=corei7 test.c m.c > > > > > > > > > > Please compare results without -static. > > > > Same results. memcpy ends up here: > > > > > > this seems to be sysdeps/i386/i586/memcpy.S > > Your libc.so doesn't support IFUNC optimization. Quite possible (for whatever reason).
What we could do for the case in question is look at the maximum possible value of c, derived from number-of-iteration analysis which should tell us 8 because of the size of the tem array. But I am not sure if a good library implementation shouldn't be always preferable to a byte-wise copy. We could, at least try to envision a way to retain and use the knowledge that the size is at most 8 when expanding the memcpy (with AVX we could use a masked store for example - quite fancy).
(In reply to comment #16) > But I am not sure if a good library implementation shouldn't be always > preferable to a byte-wise copy. We could, at least try to envision a way > to retain and use the knowledge that the size is at most 8 when expanding > the memcpy (with AVX we could use a masked store for example - quite fancy). string/memory functions in libc can be much faster than the ones generated by GCC unless the size is very small, PR 43052.
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726 > > --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC --- > (In reply to comment #16) > > But I am not sure if a good library implementation shouldn't be always > > preferable to a byte-wise copy. We could, at least try to envision a way > > to retain and use the knowledge that the size is at most 8 when expanding > > the memcpy (with AVX we could use a masked store for example - quite fancy). > > string/memory functions in libc can be much faster than the ones generated > by GCC unless the size is very small, PR 43052. Yes. The question is what is "very small" and how can we possibly detect "very small". For this testcase we can derive an upper bound of the size, which is 8, but the size is not constant. I think unless we know we can expand the variable-size memcpy with, say, three CPU instructions inline there is no reason to not call memcpy. Thus if the CPU could do tem = unaligned-load-8-bytes-from-src-and-ignore-faults; mask = generate mask from size store-unaligned-8-bytes-with-maxk then expanding the memcpy call inline would be a win I suppose. AVX has VMASKMOV, but I'm not sure using that for sizes <= 16 bytes is profitable? Note that from the specs of VMASKMOV it seems the memory operands need to be aligned and the mask does not support byte-granularity. Which would leave us to inline expanding the case of at most 2 byte memcpy. Of course currently there is no way to record an upper bound for the size (we do not retain value-range information - but we of course should).
(In reply to comment #13) > (In reply to comment #10) > > I've tried without static. Runtimes is still the same. > > It doesn't match what I saw. On Atom D510: > > /export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3 > -ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o new > time ./new > ./new 58.46s user 0.00s system 99% cpu 58.479 total > /export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3 > -ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o old > time ./old > ./old 58.38s user 0.00s system 99% cpu 58.490 total I rechecked there is no regression without static on Sundy Bridge nor Atom.
(In reply to comment #19) > > I rechecked there is no regression without static on Sundy Bridge nor Atom. Great. Please verify that -static isn't used on eembc and SPEC CPU 2000/2006 runs.
(In reply to comment #18) > > string/memory functions in libc can be much faster than the ones generated > > by GCC unless the size is very small, PR 43052. > > Yes. The question is what is "very small" and how can we possibly > detect "very small". For this testcase we can derive an upper bound > of the size, which is 8, but the size is not constant. I think unless > we know we can expand the variable-size memcpy with, say, three > CPU instructions inline there is no reason to not call memcpy. It is OK to call memcpy if the size isn't constant. > > Which would leave us to inline expanding the case of at most 2 byte > memcpy. Of course currently there is no way to record an upper > bound for the size (we do not retain value-range information - but > we of course should). It is nice to have. We can open another bug for this.
> Yes. The question is what is "very small" and how can we possibly As what is very small is defined in the i386.c in the cost tables. I simply run a small benchmark testing library&GCC implementations to fill it in. With new glibcs these tables may need upating. I updated them on some to make glibc in SUSE 11.x. PR 43052 is about memcmp. Memcpy/memset should behave more or less sanely. (that also reminds me that I should look again at the SSE memcpy/memset implementation for 4.8) > detect "very small". For this testcase we can derive an upper bound > of the size, which is 8, but the size is not constant. I think unless > we know we can expand the variable-size memcpy with, say, three > CPU instructions inline there is no reason to not call memcpy. > > Thus if the CPU could do > > tem = unaligned-load-8-bytes-from-src-and-ignore-faults; > mask = generate mask from size > store-unaligned-8-bytes-with-maxk > > then expanding the memcpy call inline would be a win I suppose. > AVX has VMASKMOV, but I'm not sure using that for sizes <= 16 > bytes is profitable? Note that from the specs > of VMASKMOV it seems the memory operands need to be aligned and > the mask does not support byte-granularity. > > Which would leave us to inline expanding the case of at most 2 byte > memcpy. Of course currently there is no way to record an upper > bound for the size (we do not retain value-range information - but > we of course should). My secret plan was to make VRP produce value profiling histogram when value is known to be with small range. Should be quite easy to implement. Honza
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726 > > --- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC --- > > Yes. The question is what is "very small" and how can we possibly > > As what is very small is defined in the i386.c in the cost tables. > I simply run a small benchmark testing library&GCC implementations to > fill it in. With new glibcs these tables may need upating. I updated them > on some to make glibc in SUSE 11.x. > > PR 43052 is about memcmp. Memcpy/memset should behave more or less sanely. > (that also reminds me that I should look again at the SSE memcpy/memset > implementation for 4.8) That also reminds me that this tunning was mostly reverted with the SSE work. I will look into that patches and push out the safe bits for 4.8 Honza