Bug 43052

Summary: [4.9/5/6 Regression] Inline memcmp is *much* slower than glibc's, no longer expanded inline
Product: gcc Reporter: Björn Stenberg <bjorn>
Component: targetAssignee: Bernd Schmidt <bernds>
Status: RESOLVED FIXED    
Severity: normal CC: bernds, bjorn, dimhen, dodji, dushistov, enkovich.gnu, gcc-bugs, hjl.tools, hubicka, izamyatin, jakub, jlp.bugs, justin.lebar+bug, manu, mh+gcc, nickc, ojab, rth, sergos.gnu, thayer-public, tkoenig, uros, wbrana, xunxun1982
Priority: P2    
Version: 4.4.3   
Target Milestone: 4.9.4   
Host: i486-linux-gnu Target: x86_64-*-*, i?86-*-*
Build: i486-linux-gnu Known to work:
Known to fail: Last reconfirmed: 2010-02-12 15:45:19
Bug Depends on: 52171    
Bug Blocks: 95458    
Attachments: memcpy/memset testing script
Test results from my core i7

Description Björn Stenberg 2010-02-12 14:51:36 UTC
memcmp-intensive code becomes up to 6 times slower if compiled with the -O3 option than with the -g or -O0 option. The reason for this is that the inline memcmp function is *much* slower than the glibc memcmp.

Here's a simple test case:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/times.h>

void* list[1024 * 1024];

int main(void)
{
    int count = sizeof(list) / sizeof(char*);
    int i;
    for (i=0; i < count; i++)
        list[i] = calloc(1024, 1);

    int dupes = 0;
    int start = times(NULL);
    for (i=0; i<count-1; i++)
        if (!memcmp(list[i], list[i+1], 1024))
            dupes++;

    int ticks = times(NULL) - start;
    printf("Time: %d ticks (%d memcmp/tick)\n", ticks, dupes/ticks);

    return 0;
}


# gcc -O3 -o test test.c
# ./test
Time: 188 ticks (5577 memcmp/tick)
# gcc -O0 -o test test.c
# ./test
Time: 30 ticks (34952 memcmp/tick)

System is Debian testing with libc package 2.10.1-7.
Comment 1 Alexander Monakov 2010-02-12 15:45:19 UTC
Confirmed. GCC simply emits repz cmpsb.  There was even an e-mail with benchmark results and a patch (never applied):

http://gcc.gnu.org/ml/gcc-patches/2009-09/msg02129.html
Comment 2 Michael Thayer 2010-10-01 11:08:11 UTC
Unless I read those benchmarks wrong, they say that for unaligned memory the inline byte-by-byte version is almost always faster.  So surely a multi-byte inline version should at least do an alignment check before proceeding.  I don't know if it would still be worth inlining then.
Comment 3 Jan Hubicka 2010-11-10 21:17:09 UTC
Mine.  I think we should go for the rep movq. Ideally we should track the alignment from calloc call to use, perhaps by fusing the loops, but I am just day dreaming with that ;))

 I will take a look.

Honza
Comment 4 Jan Hubicka 2010-11-10 21:36:07 UTC
Hmm, I should read testcases curefully. It is memcmp. GCC is not really smart on inlining this; I guess we should just disable the inline unless we optimize for size since we don't really have enough info to inline it well by default.
Comment 5 H.J. Lu 2010-11-11 02:24:28 UTC
String/memory functions in the current glibc are VERY FAST,
especially for larger data size under all cases. Gcc should
inline only small/known data size. We are investigating this.
Comment 6 Richard Biener 2010-11-11 10:39:35 UTC
(In reply to comment #4)
> Hmm, I should read testcases curefully. It is memcmp. GCC is not really smart
> on inlining this; I guess we should just disable the inline unless we optimize
> for size since we don't really have enough info to inline it well by default.

We can still inline for very small known sizes where the call overhead
would outweight any optimization in glibc.
Comment 7 Jakub Jelinek 2010-11-11 11:46:43 UTC
Speaking of memcmp/strcmp, perhaps we on the other side should inline for !-Os
commonly used comparisons like
if (strcmp (p, ".") == 0 || strcmp (p, "..") == 0)
  continue;

glibc has a macro for this, but I think it would be preferable to do it in gcc where we know about -Os, cold basic blocks etc.

Currently we only fold if (strcmp (p, "")) ...
Comment 8 Justin Lebar 2011-06-13 18:09:07 UTC
I just did some tests, and on my machine, glibc's memcmp is faster even when the size of the thing we're comparing is 4 bytes.  I can't point to a case that this optimization speeds up on my machine.
Comment 9 H.J. Lu 2011-06-13 18:14:26 UTC
The basic string/memory functions in glibc 2.13 or above are
super faster in all cases.  GCC can't beat glibc if function
call overhead is low.
Comment 10 Justin Lebar 2011-06-13 18:18:13 UTC
Can I force gcc not to use its inlined version?
Comment 11 Jan Hubicka 2011-07-04 10:11:03 UTC
H.J,
if glibc implementation beats gcc even for size of 4, I guess we could just drop the pattern or enable at at -Os only.
Or are there easy cases we want to inline, like we do for memcpy?

Unlike memcpy, memcmp/strcmp is more difficult to handle because the amount of memory it will process is harder to estimate. I guess still with known alignment and/or small upper bound of object size, inline code would be a win.
Comment 12 Jan Hubicka 2011-07-04 10:49:18 UTC
Created attachment 24670 [details]
memcpy/memset testing script

HJ,
can you please run the attached script with new glibc as 
sh test_stringop 64 640000000 gcc -march=native | tee out

In my quick testing on glibc2.11 and core i5 & AMD machine, inline memcpy/memset is still win on I5 for all blocks sizes (our optimization table is however wrong since it is inherited from generic one). For blocks of 512b and above however the inline code is about as fast as glibc code and obviously longer.

On AMD machine libcall is win for blocks of 1k to 8k. For large blocks inline seems to be win again, for whatever reason. Probably prefetch logic is wrong on the older glibc.

If glibc stringops has been finally made sane, we ought to revisit the tables we generate inline versions from.
Comment 13 Justin Lebar 2011-07-04 14:40:40 UTC
(In reply to comment #12)
> Created attachment 24670 [details]
> memcpy/memset testing script
> 
> HJ,
> can you please run the attached script with new glibc as 
> sh test_stringop 64 640000000 gcc -march=native | tee out

Do you think you could spin a script which also tests memcmp?
Comment 14 Justin Lebar 2011-07-04 15:00:36 UTC
Created attachment 24676 [details]
Test results from my core i7
Comment 15 Jan Hubicka 2011-07-05 11:08:50 UTC
> 
> Do you think you could spin a script which also tests memcmp?
memcmp is different story.  Few years back I rewrote memcpy/memset codegen to allow choosing
from several basic implementations based on size. The script I attached is basically
testing the individual algorithm so you can set for each CPU target a table choosing best
performing one for given size.

memcmp is still produced in very stupid way.  Sane inline implementations of
memcmp are harder than memcpy/memset.  I will try to look into it and how
current recommended codegens look for AMD/Intel chips.

Honza
Comment 16 Richard Biener 2011-08-24 10:56:00 UTC
Since

2011-08-12  Nick Clifton  <nickc@redhat.com>

        * builtins.c (expand_builtin_memcmp): Do not use cmpstrnsi pattern.
        * doc/md.texi (cmpstrn): Note that the comparison stops if both
        fetched bytes are zero.
        (cmpstr): Likewise.
        (cmpmem): Note that the comparison does not stop if both of the
        fetched bytes are zero.

we no longer expand memcmp inline as x86 does not have a cmpmemsi pattern.

Testcase:

int foo(char *s, char *q)
{
  return memcmp (s, q, 4);
}

the 1-size case is folded to *s == *q.

This is now a regression, we have to do _something_ about it for 4.7,
at least for constant small sizes.
Comment 17 Jan Hubicka 2011-08-24 14:20:29 UTC
Hmm,
I guess ideally the middle-end should know how to inline the simple loop (for both strlen and memcmp) and do so when object size is known to be small (probably by target specific value). Or does anyone think it is a bad idea?

We could then make i386 backend to again inline the rep instruction and/or the new string extensions

Honza
Comment 18 Richard Biener 2011-08-24 14:36:45 UTC
(In reply to comment #17)
> Hmm,
> I guess ideally the middle-end should know how to inline the simple loop (for
> both strlen and memcmp) and do so when object size is known to be small
> (probably by target specific value). Or does anyone think it is a bad idea?

I think that's a bad idea, yes.  Apart from maybe simple cases we already
handle for memcpy -> assignment simplifications (thus, aligned compares up to
word_mode size), but supposedly only if the result is checked for ==/!= 0
only(?).
Comment 20 Richard Biener 2012-03-22 08:27:12 UTC
GCC 4.7.0 is being released, adjusting target milestone.
Comment 21 Richard Biener 2012-06-14 08:36:17 UTC
GCC 4.7.1 is being released, adjusting target milestone.
Comment 22 Jakub Jelinek 2012-09-20 10:19:50 UTC
GCC 4.7.2 has been released.
Comment 23 Steven Bosscher 2013-03-13 20:13:32 UTC
Honza, ping?
Comment 24 Richard Biener 2013-04-11 07:59:26 UTC
GCC 4.7.3 is being released, adjusting target milestone.
Comment 25 Richard Biener 2014-06-12 13:47:04 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 26 Jakub Jelinek 2014-12-19 13:30:13 UTC
GCC 4.8.4 has been released.
Comment 27 Jan Hubicka 2015-03-20 20:43:27 UTC
I won't be able to do anything iwth this for 5.0, so I am unassigning myself. Will try to keep this on my radar for next stage1 though.
Comment 28 Manuel López-Ibáñez 2015-06-03 06:34:01 UTC
*** Bug 59048 has been marked as a duplicate of this bug. ***
Comment 29 Manuel López-Ibáñez 2015-06-03 06:34:51 UTC
This is not assigned to anyone
Comment 30 Richard Biener 2015-06-23 08:19:16 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 31 Jakub Jelinek 2015-06-26 19:55:18 UTC
GCC 4.9.3 has been released.
Comment 32 Nick Clifton 2016-01-08 13:13:38 UTC
Is this PR still a problem ?

I tried building the testcase from comment #1 today and found the memcmp is being called even at -O3.  The run times are approximately the same for -O3 and -O0 as well.

Cheers
  Nick
Comment 33 Bernd Schmidt 2016-01-14 13:41:37 UTC
I see memcmp being called at -O3 in compilers going all the way back to 4.5. That's expected given Nick's patch to disable cmpstrnM. I guess I'll look into making a cmpmem pattern.

Also, the test program needs some additional repeat to give meaningful times.
Comment 34 Björn Stenberg 2016-01-14 14:04:28 UTC
Yes, confirmed. -O3 and -O0 now both run the same speed so this bug is fixed.

Sorry for being potentially several years late confirming this.
Comment 35 Bernd Schmidt 2016-01-18 13:57:09 UTC
Closing since PR52171 tracks the other problem.