PATCH: Turn on x86_rep_movl_optimal for m_GENERIC64

Sat Nov 18 17:53:00 GMT 2006

> On Fri, Nov 17, 2006 at 07:19:51AM -0800, H. J. Lu wrote:
> > 
> > I found the following on Core 2 Duo:
> > 
> > 1. The optimized memory functions don't help SPEC CPU 2K FP much.
> > 2. The optimized memory functions help SPEC CPU 2K INT:
> > 			-O2 + optimized memory vs -O2
> > 164.gzip                         1.44928%
> > 175.vpr                          -0.522952%
> > 176.gcc                          23.6236%
> > 181.mcf                          -1.30276%
> > 186.crafty                       -0.576258%
> > 197.parser                       0.64%
> > 252.eon                          0.480769%
> > 253.perlbmk                      -1.60281%
> > 254.gap                          -0.406504%
> > 255.vortex                       10.6028%
> > 256.bzip2                        0.0993542%
> > 300.twolf                        -0.0783392%
> > Est. SPECint_base2000            2.457%
> > 
> > 3. rep_movl_optimal + optimized memory functions don't help SPEC CPU
> > 2K FP much.
> > 4. rep_movl_optimal + optimized memory functions is a mixed bag on
> > SPEC CPU 2K INT:
> > 
> > 		-O2 + optimized memory + rep_movl_optimal vs -O2 + optimized memory
> > 164.gzip                         0%
> > 175.vpr                          -0.233645%
> > 176.gcc                          -2.18623%
> > 181.mcf                          1.10876%
> > 186.crafty                       0.772798%
> > 197.parser                       -0.317965%
> > 252.eon                          -0.438596%
> > 253.perlbmk                      1.58919%
> > 254.gap                          0.272109%
> > 255.vortex                       0.231929%
> > 256.bzip2                        0%
> > 300.twolf                        0.117601%
> > Est. SPECint_base2000            0.0959233%
> > 
> > Given that rep_movl_optimal improves 176.gcc significantly with the
> > old memory fuctions and doesn't have no significant negative impact
> > with optimized memory functions, I think we should turn it on for
> > m_GENERIC64. We can always fine tune rep_movl_optimal later.
> > 
> > 
> 
> Here is the patch.
> 
> 
> H.J.
> ----
> 2006-11-17  H.J. Lu  <hongjiu.lu@intel.com>
> 
> 	* config/i386/i386.c (x86_rep_movl_optimal): Turn it on for
> 	m_GENERIC64.

I must say that this seems like a mistake to me in longer run.  The
effect of the flag is to turn x86 backend to always inline memcpy/memset
believing that there is nothing better than "rep;movl" sequence.  This
throws away any oppurtunity to have optimized memset/memcpy
implementation in a library. This flexibility is IMO very useful
allowing distributions to shop optimized string functions per CPU basis.

Your runs shows that with not completely poor glibc implementation the
libcalls to string functions are neutral for SPECint and it is definitly
not dificult to show that the nontemporary and prefetching bits in CPUs
are important for copying/memsetting the larger memory blocks giving
measurable speedup.

Definitly it is not that dificult to find non-SPECint benchmark that do
depend on copying blocks larger than cache and rep/movl would be a loss.

In my stringop rewrite I have simple benchmark testing stringop
performance on different block sizes and alignments with pseudo-random
access pattern.  All basis codegen methods I can think of to make sense
to inline (ie the rep;movs of various sizes, loop and unrolled loop) are
tested with and without alignment prologue and compared to libcall.

Running it on core2 with SUSE 10.1 stringops I get the following
results:

                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte dynamic
block size 81920000 0:00.77 0:01.04 0:01.03 0:01.01 0:01.18 0:01.01 0:01.14 0:01.08 0:01.14 0:01.10 0:01.09 0:01.25 0:00.72 best: 0:00.77 libcall
block size 8192000 0:00.61 0:00.90 0:00.90 0:00.85 0:00.99 0:00.82 0:00.96 0:00.89 0:00.96 0:00.89 0:00.90 0:01.04 0:00.61 best: 0:00.61 libcall
block size  819200 0:00.85 0:00.90 0:00.89 0:00.85 0:00.96 0:00.81 0:00.94 0:00.88 0:00.94 0:00.88 0:00.89 0:01.02 0:00.85 best: 0:00.81 rep8
block size   81920 0:00.13 0:00.52 0:00.52 0:00.17 0:00.78 0:00.14 0:00.76 0:00.14 0:00.75 0:00.15 0:00.38 0:00.96 0:00.13 best: 0:00.13 libcall
block size   20480 0:00.15 0:00.55 0:00.55 0:00.18 0:00.81 0:00.15 0:00.80 0:00.15 0:00.78 0:00.15 0:00.40 0:01.01 0:00.15 best: 0:00.15 libcall
block size    8192 0:00.13 0:00.48 0:00.48 0:00.16 0:00.73 0:00.13 0:00.70 0:00.13 0:00.70 0:00.13 0:00.36 0:00.90 0:00.13 best: 0:00.13 libcall
block size    4096 0:00.12 0:00.46 0:00.46 0:00.15 0:00.71 0:00.11 0:00.68 0:00.13 0:00.68 0:00.12 0:00.36 0:00.89 0:00.12 best: 0:00.11 rep8
block size    2048 0:00.12 0:00.45 0:00.45 0:00.15 0:00.71 0:00.11 0:00.68 0:00.14 0:00.68 0:00.12 0:00.38 0:00.90 0:00.12 best: 0:00.11 rep8
block size    1024 0:00.17 0:00.48 0:00.48 0:00.16 0:00.72 0:00.12 0:00.69 0:00.16 0:00.69 0:00.15 0:00.41 0:00.92 0:00.17 best: 0:00.12 rep8
block size     512 0:00.27 0:00.53 0:00.53 0:00.19 0:00.73 0:00.14 0:00.70 0:00.19 0:00.72 0:00.20 0:00.45 0:00.95 0:00.24 best: 0:00.14 rep8
block size     256 0:00.59 0:00.62 0:00.62 0:00.23 0:00.76 0:00.17 0:00.72 0:00.25 0:00.75 0:00.29 0:00.54 0:01.01 0:00.47 best: 0:00.17 rep8
block size     128 0:00.78 0:00.80 0:00.80 0:00.32 0:00.81 0:00.25 0:00.76 0:00.36 0:00.82 0:00.47 0:00.69 0:01.12 0:00.32 best: 0:00.25 rep8
block size      64 0:01.03 0:01.11 0:01.11 0:00.48 0:00.92 0:00.43 0:00.87 0:00.58 0:00.94 0:00.80 0:00.95 0:01.35 0:00.48 best: 0:00.43 rep8
block size      48 0:01.17 0:01.34 0:01.33 0:00.60 0:01.02 0:00.55 0:00.94 0:00.64 0:01.02 0:01.04 0:01.10 0:01.42 0:00.61 best: 0:00.55 rep8
block size      32 0:01.37 0:01.72 0:01.73 0:00.79 0:01.15 0:00.80 0:01.14 0:00.91 0:01.14 0:01.44 0:01.55 0:01.73 0:00.81 best: 0:00.79 rep4
block size      24 0:01.63 0:02.17 0:02.17 0:01.03 0:01.36 0:01.06 0:01.33 0:00.96 0:01.25 0:01.73 0:01.75 0:01.93 0:01.07 best: 0:00.96 loop
block size      16 0:01.91 0:02.85 0:02.86 0:01.39 0:01.65 0:01.46 0:01.73 0:01.33 0:01.57 0:02.62 0:02.47 0:02.55 0:01.40 best: 0:01.33 loop
block size      14 0:02.02 0:03.27 0:03.27 0:01.64 0:01.86 0:01.65 0:01.78 0:01.47 0:01.55 0:02.78 0:02.78 0:02.49 0:01.68 best: 0:01.47 loop
block size      12 0:02.24 0:03.68 0:03.67 0:01.92 0:02.14 0:01.89 0:02.01 0:01.64 0:01.84 0:02.94 0:02.88 0:02.71 0:02.01 best: 0:01.64 loop
block size      10 0:02.42 0:04.21 0:04.21 0:02.23 0:02.32 0:02.27 0:02.22 0:01.92 0:02.00 0:03.16 0:03.08 0:03.02 0:02.31 best: 0:01.92 loop
block size       8 0:03.11 0:04.83 0:04.83 0:02.65 0:02.79 0:02.92 0:02.69 0:02.48 0:02.78 0:03.89 0:03.62 0:03.62 0:02.70 best: 0:02.48 loop
block size       6 0:03.26 0:06.13 0:06.14 0:03.63 0:03.45 0:03.39 0:03.14 0:02.98 0:03.01 0:04.20 0:04.21 0:03.95 0:03.78 best: 0:02.98 loop
block size       4 0:04.80 0:07.43 0:07.42 0:04.89 0:04.37 0:04.75 0:04.67 0:04.78 0:04.83 0:05.81 0:05.51 0:05.21 0:04.89 best: 0:04.37 rep4noalign
block size       1 0:07.60 0:09.33 0:09.33 0:06.41 0:06.39 0:06.59 0:06.37 0:06.67 0:06.38 0:08.04 0:06.45 0:09.83 0:06.44 best: 0:06.37 rep8noalign
memset

So in short for blocks starting from 8mb and above, library call with
nontemporary hints is significant (30%) win. I am not sure about Intel
chips, but for AMD chips also the hints become more important in
multicore environment as it reduces amount of cache synchronization
traffic on the system bus.  Inlining seems to be win up to 8K of block
size of 8K, it is neutral up to 80k then it starts losing.

For very small blocks, loop is a win since rep is expensive setup.

For Opteron, the scores are surprisingly similar.  In fact best scores
for Opteron are:

loop  0....48 bytes
rep   ....8192 bytes

For Nocona preferred algorithms are
loop  0....32 bytes
rep   ... 20000 bytes
unrolled loop ....100000 bytes
the library functions are obviously lackng here a bit more than on
Core and Opteron.

I would suggest for generic to actually use loop up to 32 bytes (note
that those very small blocks matters only for profile feedback - when
size is known to be 32 bytes at compile time, we do inline unrolled
version anyway) and rep up to 8192 bytes, libcall afterwards.

for blocks of unknown size we have three meamingful options
 - always libcall
 - always do rep mov
 - runtime check block size and do rep mov for blocks smaller than 8k

I've implemented first by default (it is winner for code size too) and
last as command line option that can be used for applications bound by
copying small blocks (such as GCC). 

Memset tells similar story:

                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte dynamic
block size 81920000 0:00.36 0:00.62 0:00.62 0:00.48 0:00.78 0:00.48 0:00.84 0:00.85 0:00.86 0:00.84 0:00.86 0:01.10 0:00.36 best: 0:00.36 libcall
block size 8192000 0:00.28 0:00.60 0:00.59 0:00.43 0:00.59 0:00.34 0:00.62 0:00.63 0:00.64 0:00.63 0:00.64 0:00.92 0:00.28 best: 0:00.28 libcall
block size  819200 0:00.37 0:00.67 0:00.67 0:00.52 0:00.68 0:00.38 0:00.69 0:00.70 0:00.72 0:00.71 0:00.72 0:00.90 0:00.37 best: 0:00.37 libcall
block size   81920 0:00.09 0:00.46 0:00.46 0:00.13 0:00.25 0:00.09 0:00.19 0:00.12 0:00.17 0:00.10 0:00.17 0:00.96 0:00.09 best: 0:00.09 libcall
block size   20480 0:00.10 0:00.49 0:00.49 0:00.13 0:00.27 0:00.10 0:00.20 0:00.13 0:00.18 0:00.10 0:00.18 0:01.00 0:00.10 best: 0:00.10 libcall
block size    8192 0:00.09 0:00.44 0:00.44 0:00.12 0:00.24 0:00.09 0:00.18 0:00.11 0:00.16 0:00.10 0:00.17 0:00.90 0:00.09 best: 0:00.09 libcall
block size    4096 0:00.10 0:00.44 0:00.43 0:00.12 0:00.24 0:00.09 0:00.19 0:00.12 0:00.17 0:00.10 0:00.17 0:00.88 0:00.10 best: 0:00.09 rep8
block size    2048 0:00.11 0:00.45 0:00.45 0:00.13 0:00.25 0:00.10 0:00.20 0:00.13 0:00.18 0:00.11 0:00.18 0:00.89 0:00.11 best: 0:00.10 rep8
block size    1024 0:00.15 0:00.47 0:00.47 0:00.15 0:00.27 0:00.12 0:00.22 0:00.15 0:00.20 0:00.14 0:00.21 0:00.90 0:00.15 best: 0:00.12 rep8
block size     512 0:00.21 0:00.50 0:00.51 0:00.17 0:00.30 0:00.14 0:00.23 0:00.18 0:00.23 0:00.18 0:00.25 0:00.92 0:00.19 best: 0:00.14 rep8
block size     256 0:00.31 0:00.57 0:00.57 0:00.22 0:00.35 0:00.17 0:00.27 0:00.22 0:00.28 0:00.23 0:00.32 0:00.96 0:00.26 best: 0:00.17 rep8
block size     128 0:00.53 0:00.72 0:00.72 0:00.31 0:00.42 0:00.24 0:00.31 0:00.33 0:00.38 0:00.35 0:00.40 0:01.03 0:00.31 best: 0:00.24 rep8
block size      64 0:00.64 0:01.01 0:01.01 0:00.46 0:00.53 0:00.38 0:00.45 0:00.52 0:00.59 0:00.55 0:00.50 0:01.12 0:00.45 best: 0:00.38 rep8
block size      48 0:00.68 0:01.23 0:01.23 0:00.58 0:00.63 0:00.52 0:00.55 0:00.65 0:00.71 0:00.57 0:00.58 0:01.36 0:00.58 best: 0:00.52 rep8
block size      32 0:00.71 0:01.57 0:01.57 0:00.74 0:00.79 0:00.71 0:00.73 0:00.81 0:00.83 0:00.77 0:00.71 0:01.43 0:00.77 best: 0:00.71 libcall
block size      24 0:00.82 0:02.00 0:02.00 0:00.97 0:01.00 0:00.96 0:00.95 0:00.96 0:00.89 0:00.89 0:00.84 0:01.75 0:00.99 best: 0:00.82 libcall
block size      16 0:01.18 0:02.57 0:02.57 0:01.32 0:01.36 0:01.31 0:01.20 0:01.14 0:01.13 0:01.06 0:01.14 0:01.95 0:01.32 best: 0:01.06 unrl
block size      14 0:01.39 0:03.33 0:03.32 0:01.60 0:01.58 0:01.57 0:01.54 0:01.52 0:01.22 0:01.32 0:01.20 0:02.60 0:01.61 best: 0:01.20 unrlnoalign
block size      12 0:01.61 0:03.41 0:03.41 0:01.84 0:01.83 0:01.73 0:01.69 0:01.55 0:01.35 0:01.47 0:01.35 0:02.55 0:01.87 best: 0:01.35 loopnoalign
block size      10 0:01.91 0:03.95 0:03.95 0:02.14 0:02.14 0:02.06 0:01.96 0:01.83 0:01.51 0:01.61 0:01.54 0:02.91 0:02.19 best: 0:01.51 loopnoalign
block size       8 0:02.08 0:04.44 0:04.46 0:02.53 0:02.39 0:02.42 0:02.16 0:01.95 0:01.75 0:01.81 0:01.82 0:03.12 0:02.54 best: 0:01.75 loopnoalign
block size       6 0:03.40 0:05.91 0:05.91 0:03.45 0:03.31 0:02.94 0:02.74 0:02.56 0:02.37 0:02.41 0:02.39 0:04.22 0:03.60 best: 0:02.37 loopnoalign
block size       4 0:04.04 0:07.01 0:07.01 0:04.34 0:03.88 0:03.12 0:03.13 0:03.12 0:03.00 0:03.12 0:03.00 0:04.95 0:04.20 best: 0:03.00 loopnoalign
block size       1 0:11.74 0:09.24 0:09.23 0:06.34 0:06.28 0:06.43 0:06.42 0:06.35 0:06.33 0:06.97 0:06.33 0:07.03 0:06.31 best: 0:06.28 rep4noalign

There seems to be little unoptimality in my memset setup code (it is
actually not testing zeroing, but setting to arbitrary value and I do
use imul to compute the value to memset, I will check).

K8 is relatively similar here, up to 48 bytes unrolled aligned loop is
preferred, up to 8k, it is rep mov. Core seems special not worrying
about laignment for small blcoks.

Overall I would suggest dropping all the exhisting string operand
generation hints (such as REP_MOVL_OPTIMAL) and just try to collect such
an tables from all chips in interest and produce list of algorithms to
use based on operand size and feed them into our cost tables.  

I would also like to assume that library implementation is sane.  We
should try to fix glibc by default or instruct distributions using
vanilla glibc to hack GCC to change the defaults rahter then hacking GCC
to work around by default.  Using GCC specific memset/memcpy
implementation is also an option, especially once Richard's math
function plans converges.

I now have the memcpy patch updated to mainline and will send it in
separate mail once testing converge.

Honza