This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] Fix a bunch of i?86 memcpy/memset expansion bugs (PR target/38686, PR target/38708)


Hi!

This patch fixes a bunch of bugs in i386.c memset and memcpy expanders.
The patch has been bootstrapped on x86_64-linux, i686-linux,
i686-linux --with-tune=pentium-m, i686-linux --with-arch=pentium-m
and regtested on the first tree with:
make -j48 -k check RUNTESTFLAGS="--target_board=unix/'{`echo -mstringop-strategy={libcall,rep_{,4,8}byte,{byte_,,unrolled_}loop} | sed s0\ 0,0g`,-m64}'"
on the second tree with:
make -j48 -k check RUNTESTFLAGS="--target_board=unix/'{`echo -mstringop-strategy={libcall,rep_{,4}byte,{byte_,,unrolled_}loop} | sed s0\ 0,0g`,-m32}'"
and on the third tree with:
make -j48 -k check RUNTESTFLAGS="--target_board=unix/-mtune=pentium-m/'{`echo -mstringop-strategy={libcall,rep_{,4}byte,{byte_,,unrolled_}loop} | sed s0\ 0,0g`,-m32}'"

Compared to normal x86_64-linux or i686-linux checking with no special options
the following extra FAILs still appear:
{{-m64,-m32,-m32/-mtune=pentium-m}/-mstringop-strategy={{byte_,,unrolled_}loop,rep_{,4}_byte},-m64/-mstringop-strategy=rep_8byte}
FAIL: gcc.dg/visibility-11.c scan-assembler memcpy@PLT
{-m64,-m32,-m32/-mtune=pentium-m}/-mstringop-strategy={byte_loop,libcall}
FAIL: gcc.target/i386/cold-attribute-1.c scan-assembler stosb
{-m64,-m32,-m32/-mtune=pentium-m}/-mstringop-strategy=libcall
FAIL: gcc.target/i386/inline-mcpy.c scan-assembler-not memcpy
{-m64,-m32,-m32/-mtune=pentium-m}/-mstringop-strategy={,unrolled_}loop
FAIL: gcc.target/i386/pr11001-memset-1.c (internal compiler error)   
FAIL: gcc.target/i386/pr11001-memset-1.c (test for excess errors) 
FAIL: gcc.target/i386/pr11001-memset-3.c (internal compiler error)
FAIL: gcc.target/i386/pr11001-memset-3.c (test for excess errors) 
{-m32,-m32/-mtune=pentium-m}/-mstringop-strategy={{byte_,,unrolled_}loop,libcall}
FAIL: gcc.target/i386/memcpy-1.c scan-assembler rep 
FAIL: gcc.target/i386/memcpy-1.c scan-assembler movs
-m32/-mtune=pentium-m{,/-mstringop-strategy={{byte_,,unrolled_}loop,rep_{,4}_byte}}
FAIL: gcc.dg/tree-ssa/loop-28.c scan-tree-dump-times final_cleanup "MEM" 65 
FAIL: gcc.dg/tree-ssa/prefetch-3.c scan-tree-dump-times aprefetch "unroll factor 4" 1
FAIL: gcc.misc-tests/i386-pf-none-1.c scan-assembler-not fetch 
FAIL: gcc.misc-tests/i386-pf-none-1.c scan-assembler-not fetch 
FAIL: gcc.target/i386/addr-sel-1.c scan-assembler a\\\\+1 
FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1 
FAIL: gcc.target/i386/bt-1.c scan-assembler btl[ \\t] 
FAIL: gcc.target/i386/bt-2.c scan-assembler btl[ \\t] 
FAIL: gcc.target/i386/bt-mask-1.c scan-assembler-not and[lq][ \\t] 
FAIL: gcc.target/i386/bt-mask-2.c scan-assembler-not and[lq][ \\t]  
FAIL: gcc.target/i386/cmov1.c scan-assembler sar[^\\\\n]*magic_namec 
FAIL: gcc.target/i386/cmov1.c scan-assembler shr[^\\\\n]*magic_namef
FAIL: gcc.target/i386/movq-2.c (test for excess errors) 
FAIL: gcc.target/i386/movq.c (test for excess errors)   
FAIL: gcc.target/i386/mul.c scan-assembler and[^\\\\n]*magic 
FAIL: gcc.target/i386/pr30315.c scan-assembler-times cmp 4
FAIL: gcc.target/i386/xchg-1.c scan-assembler rol  
FAIL: gcc.target/i386/xchg-2.c scan-assembler xchgb
but they don't seem to be regressions. pr11001-memset-{1,3}.c is discussed
above, the rest of -mstringop-strategy=something only match failures,
unless it can be reproduced with just some -mtune= option, can be IMHO ignored
and the -m32 -mtune=pentium-m specific failures also look just like
testcases not expecting -mtune=pentium-m.

The bugs fixed:
1) -m32 -mstringop-strategy=rep_8byte ICEd on all non-libcall memcpy/memset
   expansions (obviously, as rep; stosq isn't a 32-bit insn)
2) if chosen algorithm need_zero_guard, desired alignment was bigger than
   actual and count was constant, but forced into register, we could store/copy
   too many bytes, as need_zero_guard comparison was only done if !count.
   Say for count 9, size_needed 8, desired_align 8, align 1 (and, not known
   at compile time, (dst & 7) == 1, we would store 7 bytes in the prologue
   and then jump into the main loop, so store another 8 bytes, instead of
   jumping directly to the epilogue
3) for size_needed 1, desired_align 8, and very small sizes movmem jumps
   around prologue (aligning of dst), but unlike size_needed > 2 where it
   jumps to the epilogue, it jumps to the main algorithm.  But as
   epilogue_size_needed wasn't reset to 1, epilogue was executed afterwards,
   storing extra bytes
4) for setmem, size_needed 1, desired_align 8, non-constant value, very small
   sizes, we'd jump around prologue and promoted_val initialization directly
   to the main algorithm, which used (subreg:QI (promoted_val) 0).  But as
   promoted_val wasn't initialized in that codepath, we'd store uninitialized
   value
5) setmem epilogue for desired_align - align > size_needed was called
   with not a power of 2 value, using that value - 1 as a mask for loops
   resulted in wrong epilogues
6) movmem for constant small sizes, desired_align > align and size_needed 1
   would jump directly to the epilogue, not the main algorithm, which is shorter.
   When count is smaller than epilogue_size_needed, we just want to
   do rep; stosb immediately without aligning (or the 1 byte storing loop).
7) setmem for constant sizes could ICE (regression introduced by my recent
   setmem changes) trying to and a CONST_INT count_exp with a constant mask.

Other bugs, not fixed by this patch:
1) pr11001-memset-{1,3}.c ICE with certain -mstring-strategy= switches
   (see above) ICEs, because decide_alg only considers the main algorithm
   when deciding if an algorithm is usable when some needed hard regs are
   fixed.  But even for algorithms that don't use fixed regs, the epilogue
   still uses gen_strset or gen_strmov and those on some architectures still
   use stos* or movs* insns.  If we care about it, I guess we'd need to
   force loopy epilogue in that case, or avoid using strmov/strset.
2) in some cases for -Os and small sizes memset by pieces was rejected
   as too large, but the actual expanded sequence looked much bigger.
   Unfortunately I haven't noted which exact parameters that was, will try
   to see if I can reproduce it again.

Attached are also two testcases I've used to verify it (e.g. offset and
length both known at compile time isn't exhaustively tested for small
sizes), but they are perhaps too expensive for make check, even on a fast
box they can take up to 16 minutes of compile time (runtime is almost
instant) with some options.  I've tested them both with
-O{0,1,2,3,s} {-m32 {,-mtune=pentiumpro},-m64} for all valid
-mstringop-strategy= options (plus without that option).

Ok for trunk?

	Jakub

Attachment: Y191
Description: Text document

Attachment: memset-4.c
Description: Text document

Attachment: memcpy-4.c
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]