47000 – [4.5 Regression] Failure to inline SSE intrinsics

Bug 47000 - [4.5 Regression] Failure to inline SSE intrinsics

Summary: [4.5 Regression] Failure to inline SSE intrinsics

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.5.1

Importance:	P2 major
Target Milestone:	4.6.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2010-12-18 07:45 UTC by Jeff Garzik
Modified:	2012-07-02 10:27 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	x86_64-unknown-linux-gnu
Build:
Known to work:	4.4.2, 4.4.5, 4.6.0
Known to fail:	4.5.0, 4.5.1, 4.5.2
Last reconfirmed:	2010-12-18 12:39:26

Attachments
4-way SHA256 implementation, whose performance decreases markedly 4.4.x -> 4.5.x (20.30 KB, application/x-gzip) 2010-12-18 07:45 UTC, Jeff Garzik	Details
A new patch (376 bytes, patch) 2010-12-18 19:35 UTC, H.J. Lu	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeff Garzik 2010-12-18 07:45:54 UTC

Created attachment 22805 [details]
4-way SHA256 implementation, whose performance decreases markedly 4.4.x -> 4.5.x

OS: Fedora 14

My "cpuminer" open source project is -very- sensitive to performance of generated code, and experiences a severe performance regression going from gcc 4.4.x to 4.5.x.

Our program core is essentially
     for (n = 0; n < 0xffffff; n++)
          sha256( sha256( data ) )      /* one iteration of inner loop */

Building with gcc 4.4.5 -or- Fedora 13 gcc (4.4.x derivative), we achieve
     1850.85 kilo-iterations per second

Building with gcc 4.5.1 -or- Fedora 14 gcc (4.5.x derivative), we achieve
     1389.82 kilo-iterations per second

This is a significant performance decrease, and the only variable is the compiler.  I have presented x86_64 data below, but similar slowdowns are seen on i686-mingw in Fedora 13 (fast gcc 4.4.x) or Fedora 14 (slow gcc 4.5.x).

This interesting variant of the standard SHA256 algorithm is implemented using Intel/AMD SSE2-specific operations, effectively running four (4) SHA256 iterations in parallel, generating four (4) SHA256 hashes on four distinct datasets.

See attachment sha256_4way.i.

--------------------------------------------------------------------------
fast, working gcc -v:
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../src/gcc-4.4.5/configure --prefix=/garz/gcc44 --enable-languages=c
Thread model: posix
gcc version 4.4.5 (GCC) 

--------------------------------------------------------------------------
slow, broken gcc -v:
Using built-in specs.
COLLECT_GCC=/garz/gcc45/bin/gcc
COLLECT_LTO_WRAPPER=/garz/gcc45/libexec/gcc/x86_64-unknown-linux-gnu/4.5.1/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../src/gcc-4.5.1/configure --prefix=/garz/gcc45 --enable-languages=c
Thread model: posix
gcc version 4.5.1 (GCC)

Comment 1 Jeff Garzik 2010-12-18 07:48:23 UTC

Besides the attached sha256_4way.i, the full source code is at http://yyz.us/bitcoin/cpuminer-0.2.2.tar.gz  It's really quite small and easy to build and use.

A sample RPC destination, usable for free, is http://mining.bitcoin.cz/

Comment 2 Steven Bosscher 2010-12-18 12:27:37 UTC

What compiler options are you using?

Comment 3 Steven Bosscher 2010-12-18 12:39:26 UTC

Compiled like so:
$ gcc-4.4.2 -S -O2 sha256_4way.i -o sha256_4way-44.s
$ gcc-4.5.0 -S -O2 sha256_4way.i -o sha256_4way-45.s

$ grep -c call *.s
sha256_4way-44.s:0
sha256_4way-45.s:484
$ grep call *.s|head
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
sha256_4way-45.s:	call	ROTR
$ 

ROTR should have been inlined:

static inline __m128i ROTR(__m128i x, const int n) {
    return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
}

This probably explains the slowdown.

Comment 4 Steven Bosscher 2010-12-18 12:49:12 UTC

GCC 4.6 (trunk revision 167996) also inlines ROTR. Is it possible for the reporter to measure the number of k-iters with a recent snapshot of the trunk?

Comment 5 H.J. Lu 2010-12-18 15:36:30 UTC

(In reply to comment #3)
> Compiled like so:
> $ gcc-4.4.2 -S -O2 sha256_4way.i -o sha256_4way-44.s
> $ gcc-4.5.0 -S -O2 sha256_4way.i -o sha256_4way-45.s
> 
> $ grep -c call *.s
> sha256_4way-44.s:0
> sha256_4way-45.s:484
> $ grep call *.s|head
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> $ 
> 
> ROTR should have been inlined:
> 
> static inline __m128i ROTR(__m128i x, const int n) {
>     return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
> }
> 
> This probably explains the slowdown.

This is caused by revision 151511:

http://gcc.gnu.org/ml/gcc-cvs/2009-09/msg00257.html

Comment 6 H.J. Lu 2010-12-18 15:40:38 UTC

(In reply to comment #5)
> (In reply to comment #3)
> > Compiled like so:
> > $ gcc-4.4.2 -S -O2 sha256_4way.i -o sha256_4way-44.s
> > $ gcc-4.5.0 -S -O2 sha256_4way.i -o sha256_4way-45.s
> > 
> > $ grep -c call *.s
> > sha256_4way-44.s:0
> > sha256_4way-45.s:484
> > $ grep call *.s|head
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > $ 
> > 
> > ROTR should have been inlined:
> > 
> > static inline __m128i ROTR(__m128i x, const int n) {
> >     return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
> > }
> > 
> > This probably explains the slowdown.
> 
> This is caused by revision 151511:
> 
> http://gcc.gnu.org/ml/gcc-cvs/2009-09/msg00257.html

It is fixed by revision 166517:

http://gcc.gnu.org/ml/gcc-cvs/2010-11/msg00405.html

Comment 7 H.J. Lu 2010-12-18 15:43:06 UTC

It may be fixed by the patch for PR 40436.

Comment 8 H.J. Lu 2010-12-18 16:03:44 UTC

Can you try

--
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index af1adf4..dd00de6 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3342,7 +3342,11 @@ estimate_num_insns (gimple stmt, eni_weights *weights)
 	if (POINTER_TYPE_P (funtype))
 	  funtype = TREE_TYPE (funtype);
 
-	if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
+	/* Do not special case builtins where we see the body.
+	   This just confuse inliner.  */
+	if (!decl || cgraph_node (decl)->analyzed)
+	  ;
+	else if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
 	  cost = weights->target_builtin_call_cost;
 	else
 	  cost = weights->call_cost;
---

Comment 9 Jeff Garzik 2010-12-18 18:24:09 UTC

(In reply to comment #2)
> What compiler options are you using?

Pretty basic:  -O3 -Wall -msse2 -g

Sometimes -O3 -Wall -g -march=native, on a quad core Intel box

Results are the same: performance regression.

Will try HJ's patch next...

Comment 10 Jeff Garzik 2010-12-18 18:49:35 UTC

(In reply to comment #8)
> Can you try
> 
> --
> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
> index af1adf4..dd00de6 100644
> --- a/gcc/tree-inline.c
> +++ b/gcc/tree-inline.c
> @@ -3342,7 +3342,11 @@ estimate_num_insns (gimple stmt, eni_weights *weights)
>      if (POINTER_TYPE_P (funtype))
>        funtype = TREE_TYPE (funtype);
> 
> -    if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
> +    /* Do not special case builtins where we see the body.
> +       This just confuse inliner.  */
> +    if (!decl || cgraph_node (decl)->analyzed)
> +      ;
> +    else if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
>        cost = weights->target_builtin_call_cost;
>      else
>        cost = weights->call_cost;
> ---

ICE appears when attempting to make with gcc 4.5.1 + your patch:

/garz/gcc45/src/gcc-4.5.1.patch1/libgcc/config/libbid/bid128_fma.c:4460:1: internal compiler error: in estimate_function_body_sizes, at ipa-inline.c:1920
Please submit a full bug report,

Comment 11 Jeff Garzik 2010-12-18 19:08:45 UTC

(In reply to comment #4)
> GCC 4.6 (trunk revision 167996) also inlines ROTR. Is it possible for the
> reporter to measure the number of k-iters with a recent snapshot of the trunk?

Latest SVN trunk (r168027) is very nice, and does not show the performance regression:

Performance is all the way up to 1977.31 kilo-iterations per second.

Comment 12 Jeff Garzik 2010-12-18 19:09:25 UTC

Any other patches for me to try, for gcc 4.5.1?

Comment 13 Steven Bosscher 2010-12-18 19:21:05 UTC

I'd like to wait for Honza's opinion before we just start trying random patches. 

But if you feel like trying some other things, perhaps you can see if backporting all changes of http://gcc.gnu.org/viewcvs?view=revision&revision=166517 helps.

Comment 14 H.J. Lu 2010-12-18 19:35:24 UTC

Created attachment 22813 [details]
A new patch

Try this.

Comment 15 H.J. Lu 2010-12-18 19:38:47 UTC

(In reply to comment #13)
> I'd like to wait for Honza's opinion before we just start trying random
> patches. 
> 
> But if you feel like trying some other things, perhaps you can see if
> backporting all changes of
> http://gcc.gnu.org/viewcvs?view=revision&revision=166517 helps.

This checkin depends on is_simple_builtin and is_inexpensive_builtin,
which are new in 4.6.

Comment 16 Jakub Jelinek 2010-12-18 20:26:29 UTC

I don't think it is a good idea to change inliner heuristics in 4.5 at this point.  If it is always a good idea to inline that function, it should be __attribute__((always_inline)).

Comment 17 Jeff Garzik 2010-12-18 21:15:28 UTC

(In reply to comment #8)
> -    if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
> +    /* Do not special case builtins where we see the body.
> +       This just confuse inliner.  */
> +    if (!decl || cgraph_node (decl)->analyzed)
> +      ;
> +    else if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
>        cost = weights->target_builtin_call_cost;
>      else
>        cost = weights->call_cost;

This patch successfully fixes the performance regression in 4.5.1.

Comment 18 Jeff Garzik 2010-12-18 21:16:31 UTC

argh, please ignore comment #17.  misquote.

Comment 19 Jeff Garzik 2010-12-18 21:17:09 UTC

(In reply to comment #14)
> Created attachment 22813 [details]
> A new patch
> 
> Try this.

This patch successfully fixes the performance regression in 4.5.1.

Thanks!

Comment 20 Jeff Garzik 2010-12-18 21:25:46 UTC

(In reply to comment #16)
> I don't think it is a good idea to change inliner heuristics in 4.5 at this
> point.  If it is always a good idea to inline that function, it should be
> __attribute__((always_inline)).

I confirm that replacing 'inline' with '__attribute__((always_inline))' also resolves the regression.

It is a bit disappointing to leave such a major performance diff (-26%!) in latest stable compiler release without resolution (if the decision is to leave the inliner alone).

Comment 21 Jan Hubicka 2010-12-19 11:49:53 UTC

> I'd like to wait for Honza's opinion before we just start trying random
> patches. 
Well, if H.J.'s proposed backport of the builtin cost sizes helps, I guess it is sane
way to fix this.  I will take a look why main inliner don't do the job when early inliner
ignores the call.

I am not sure how much of heuristics changes makes sense to backport to 4.5. Depends on importance
of the regression I guess.

Honza

Comment 22 Jan Hubicka 2010-12-19 11:53:32 UTC

  freq:  8000 size:  2 time:  2 D.13088_5480 = VIEW_CONVERT_EXPR<vector int>(D.8004_729);
  freq:  8000 size:  2 time:  2 D.13087_5481 = VIEW_CONVERT_EXPR<vector int>(a_5271);
  freq:  8000 size:  7 time: 16 D.13086_5482 = __builtin_ia32_paddd128 (D.13087_5481, D.13088_5480);

obviously we also should count V_C_E as free like other conversions.  Will test patch for that.

Comment 23 Jan Hubicka 2010-12-19 11:58:35 UTC

sha256_4way.c:287:78: warning: called from here
sha256_4way.c:50:23: warning: inlining failed in call to ‘ROTR’: --param inline-unit-growth limit reached

so you could also workaround with --param inline-unit-growth=<some sufficiently large number>.
Otherwise H.J.'s proposed backport seems like most sane way to solve the problem.  I guess it can be backported.


I am testing
Index: tree-inline.c
===================================================================
--- tree-inline.c       (revision 168047)
+++ tree-inline.c       (working copy)
@@ -3281,6 +3281,7 @@ estimate_operator_cost (enum tree_code c
     CASE_CONVERT:
     case COMPLEX_EXPR:
     case PAREN_EXPR:
+    case VIEW_CONVERT_EXPR:
       return 0;
 
     /* Assign cost of 1 to usual operations.

to solve the V_C_E problems.

Comment 24 Jakub Jelinek 2010-12-20 08:32:10 UTC

VCE is often very expensive though (often a memory store followed by memory load into a different register, etc.), so 0 unconditionally is IMHO wrong.
Perhaps for some TYPE_MODE combinations at most.

Comment 25 Jan Hubicka 2010-12-21 10:30:36 UTC

Author: hubicka
Date: Tue Dec 21 10:30:33 2010
New Revision: 168108

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168108
Log:

	PR middle-end/47000
	* tree-inline.c (estimate_operator_cost): Handle VIEW_CONVERT_EXPR.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-inline.c

Comment 26 Jan Hubicka 2010-12-21 10:39:42 UTC

Hi,
I read the comment only after comiting the patch.  We generally believe conversions to be free even if this is not always the case. FP->int conversions tends to be expensive, too.  I don't think it is serious problem since the conversions tends to be dominated by real work elsewhere and there is good chance for conversions to combine and optimize when code is duplicated by inlining or peeling or so.

For non-registers V_C_Es are already counted as all non-register accesses are believed to be read/writes.  So all we get wrong are those int<->fp V_C_Es. I don't think they are terribly common and it is very target specific on how expensive they really are... SSE intrincics and SRA are nowdays both quite good source of V_C_Es that are cheap so I would guess that wast majority of them is cheap anyway.

Honza

Comment 27 Richard Biener 2010-12-28 14:57:37 UTC

(In reply to comment #24)
> VCE is often very expensive though (often a memory store followed by memory
> load into a different register, etc.), so 0 unconditionally is IMHO wrong.
> Perhaps for some TYPE_MODE combinations at most.

I think assuming VCE is zero-cost on the tree level makes sense though,
as they tend to get away usually (that is, when they appear in regular
code, not as a result of weird type punnings).

Comment 28 Richard Biener 2011-04-28 14:51:41 UTC

GCC 4.5.3 is being released, adjusting target milestone.

Comment 29 Richard Biener 2012-07-02 10:27:51 UTC

Fixed in 4.6.0, the 4.5 branch is being closed.