This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Builtin expansion versus headers optimization: Reductions


On Fri, Jun 05, 2015 at 09:26:33AM +0400, Mikhail Maltsev wrote:
> > The compiler has much more information than the headers.
> > - It can do alias analysis, so to avoid needing to handle overlap
> > and similar.
> > - It can (sometimes) determine alignment, which is important
> > information for tuning.
> > - With profile feedback it can use value histograms to determine the
> > best code.
> Value range information is also passed to, for example, memcpy expander.
> Even if exact length of data being copied is not known at compile time,
> the known range can help the compiler to select a subset of available
> algorithms a choose one of them at run time without using several
> branches and a huge jumptable (which is unavoidable in library code,
> because it has to deal with general case).
>
You can't as there is no such thing as optimum algorithm. I did simple
memset benchmark that tests 8byte aligned memset(x, 0, c) with constant
argument on L1 cache resident data.

You as compiler simply cannot do that as you would need to expand
memset(x, 0, 256) into

a) rep stosb. On haswell thats fastest which glibc memset-avx2 does.
b) sequence of movdqa %xmm0, x(%rdi) Thats optimal on ivy bridge.
Thats a table lookup exported from libc that I talked about.
c) use -mstringop-strategy=unrolled_loop. Thats a fastest implementation on core2

Now tell me how you as a compiler would do selection?

Also data clearly shows that memset suffers same problem as memcmp and
using loop should be disabled by default. Its slower than glibc one
since 128 where you stop unrolling it into movq sequence for all tested
processors except core2 where treshold is at 256 bytes (read below).

>From variants only unrolled_loop is sometimes faster but thats slower
than a full unrolling that I proposed.

As using jumptable its only way to do that as you need to map each size
to appropriate address in completely unrolled implementation. Also we
could use unrolled implementation using vmovdqa %ymm0, (%rdi) on
haswell.
Without that I could use jumptable with powers of 2.

 
> BUT comparison functions (like str[n]cmp, memcmp) expand to something
> like "repz cmpsb", which is definitely suboptimal. IMHO, a quick and
> dirty (but working) solution would be to use library call instead of
> relying on HAVE_cmpstr[n]si, (except short strings). A better solution
> would be to implement something like movmem pattern (which is aware of
> value ranges, alignment, etc.) but for comparison patterns. I could try
> to write at least some pro

I wrote a recursive always_inline function how it should be done in
sibling thread.

Main point is for gcc to recognize streq and memeq when you dont care
about strcmp sign which makes implementation almost trivial.


haswell

size 64

glibc.s	0.41
loop.s	0.64
memset-avx2.s	0.52
memset_tbl.s	0.43
rep8.s	0.91
unroll.s	0.44

size 96

glibc.s	0.63
loop.s	0.82
memset-avx2.s	0.52
memset_tbl.s	0.55
rep8.s	0.91
unroll.s	0.56

size 128

glibc.s	0.80
loop.s	1.02
memset-avx2.s	0.77
memset_tbl.s	0.64
rep8.s	0.92
unroll.s	0.64

size 192

glibc.s	1.06
loop.s	1.46
memset-avx2.s	0.95
memset_tbl.s	0.80
rep8.s	1.26
unroll.s	0.88

size 256

glibc.s	1.34
loop.s	1.87
memset-avx2.s	0.70
memset_tbl.s	0.93
rep8.s	1.46
unroll.s	1.10

size 384

glibc.s	1.64
loop.s	3.05
memset-avx2.s	0.85
memset_tbl.s	1.21
rep8.s	1.88
unroll.s	1.54

size 512

glibc.s	1.86
loop.s	3.59
memset-avx2.s	1.07
memset_tbl.s	1.45
rep8.s	2.28
unroll.s	2.00

size 1024

glibc.s	2.94
loop.s	5.80
memset-avx2.s	1.83
memset_tbl.s	2.54
rep8.s	2.81
unroll.s	3.78

size 2048

glibc.s	4.91
loop.s	10.44
memset-avx2.s	3.60
memset_tbl.s	4.65
rep8.s	4.91
unroll.s	7.50


i7_ivy_bridge



size 64

glibc.s	0.44
loop.s	0.99
memset-avx2.s	0.57
memset_tbl.s	0.44
rep8.s	0.96
unroll.s	0.48

size 96

glibc.s	0.64
loop.s	1.14
memset-avx2.s	0.56
memset_tbl.s	0.59
rep8.s	1.00
unroll.s	0.62

size 128

glibc.s	0.80
loop.s	1.45
memset-avx2.s	0.84
memset_tbl.s	0.66
rep8.s	1.14
unroll.s	0.69

size 192

glibc.s	1.04
loop.s	2.24
memset-avx2.s	0.97
memset_tbl.s	0.82
rep8.s	1.38
unroll.s	0.93

size 256

glibc.s	1.31
loop.s	2.83
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	0.95
rep8.s	1.61
unroll.s	1.17

size 384

glibc.s	1.61
loop.s	4.13
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	1.26
rep8.s	2.01
unroll.s	1.72

size 512

glibc.s	1.89
loop.s	4.99
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	1.48
rep8.s	2.39
unroll.s	2.05

size 1024

glibc.s	3.01
loop.s	8.26
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	3.31
rep8.s	3.43
unroll.s	3.84

size 2048

glibc.s	5.01
loop.s	14.89
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	5.05
rep8.s	5.75
unroll.s	7.83


core2:



size 64

glibc.s	1.04
loop.s	1.48
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	1.53
unroll.s	0.85

size 96

glibc.s	1.81
loop.s	1.73
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	1.53
unroll.s	1.11

size 128

glibc.s	1.87
loop.s	2.21
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	1.73
unroll.s	1.29

size 192

glibc.s	2.25
loop.s	3.42
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	2.14
unroll.s	1.86

size 256

glibc.s	2.59
loop.s	4.38
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	2.46
unroll.s	2.24

size 384

glibc.s	3.18
loop.s	6.32
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	3.35
unroll.s	3.21

size 512

glibc.s	3.80
loop.s	8.25
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	Command terminated by signal 4 0.00
rep8.s	4.19
unroll.s	4.07

size 1024

glibc.s	6.27
loop.s	12.92
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	5.95
rep8.s	7.97
unroll.s	7.64

size 2048

glibc.s	11.37
loop.s	24.11
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	11.01
rep8.s	14.56
unroll.s	14.97


fx10:



size 64

glibc.s	0.65
loop.s	1.06
memset-avx2.s	0.83
memset_tbl.s	0.64
rep8.s	0.92
unroll.s	0.67

size 96

glibc.s	1.15
loop.s	1.25
memset-avx2.s	0.97
memset_tbl.s	0.87
rep8.s	1.14
unroll.s	0.88

size 128

glibc.s	1.19
loop.s	1.91
memset-avx2.s	1.50
memset_tbl.s	0.91
rep8.s	1.20
unroll.s	1.00

size 192

glibc.s	1.59
loop.s	2.78
memset-avx2.s	1.61
memset_tbl.s	1.22
rep8.s	1.58
unroll.s	1.29

size 256

glibc.s	1.76
loop.s	3.38
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	1.69
rep8.s	1.97
unroll.s	1.78

size 384

glibc.s	2.33
loop.s	4.59
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	2.12
rep8.s	2.64
unroll.s	2.37

size 512

glibc.s	3.01
loop.s	5.76
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	2.70
rep8.s	4.45
unroll.s	3.36

size 1024

glibc.s	5.72
loop.s	10.43
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	5.48
rep8.s	6.56
unroll.s	6.06

size 2048

glibc.s	10.32
loop.s	19.63
memset-avx2.s	Command terminated by signal 4 0.00
memset_tbl.s	10.15
rep8.s	11.17
unroll.s	11.12

Attachment: memset_bt.tar.bz2
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]