[Bug target/26658] [4.0/4.1/4.2/4.3 Regression] memcpy/memset are not inlining with -march=athlon-xp and size of 128

Wed Nov 7 09:15:00 GMT 2007

------- Comment #8 from jakub at gcc dot gnu dot org  2007-11-07 09:15 -------
I'd stress that this is extremely worthless "benchmark", because it makes no
attempt to ensure the calls are really done and not optimized away, which
happens
in the 3.4.x -march=athlon-xp case.  At expand time GCC decides which of the
forms of
memcpy/memset are fastest and 4.x believes for -mathlon-xp it is rep; stosl
resp. rep; movsl, while 3.4.x believed it is 32 individual stores resp. 32
reads + 32 stores, another alternative is calling an optimized memcpy library
routine.
Try changing the definition of T to
#define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size); asm
volatile ("" : : "r" (mb1), "r" (mb2) : "memory");
which makes sure the memcpy/memsets can't be optimized away and you'll see very
different results.

The thing is just that we are able to DSE just the memcpy/memset expanded to
individual instructions.  What we perhaps should have a tree pass which
analyzes all the usual string operations, knows exactly what they are doing and
will track what they do with memory (track e.g. how long a zero terminated
string
in some buffer is, what values it contains - these len1 bytes are copied from
bufx, these len2 bytes are 0 and change say calls like strcat where we know
where the destination string ends into strcpy (or memcpy if we even known the
length etc.)).

Plus perhaps teach tree DSE about memcpy/memset.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26658