59511 – [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

Bug 59511 - [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

Summary: [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa ...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.9.0

Importance:	P1 normal
Target Milestone:	4.9.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	ra

Depends on:
Blocks:

Reported:	2013-12-15 12:11 UTC by Uroš Bizjak
Modified:	2016-06-02 17:37 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2013-12-15 00:00:00

Attachments
extra-movdqa-with-gcc5-not-4.9.cpp (1.76 KB, text/plain) 2016-06-02 17:36 UTC, Peter Cordes	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Uroš Bizjak 2013-12-15 12:11:27 UTC

Hello!

The gcc.target/i386/pr36222-1.c compiles for x86_64-linux-gnu with "-O2 -mno-sse3 -mtune=corei7" to:

_mm_set_epi32:
        movd    %ecx, %xmm1
        movd    %edx, %xmm4
        movd    %esi, %xmm0
        movd    %edi, %xmm3
        punpckldq       %xmm4, %xmm1
        movdqa  %xmm1, %xmm2
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm0, %xmm2
        movdqa  %xmm2, %xmm0
        ret

However, 4.8 branch compiles to:

_mm_set_epi32:
        movd    %esi, %xmm1
        movd    %edi, %xmm2
        movd    %ecx, %xmm0
        movd    %edx, %xmm3
        punpckldq       %xmm2, %xmm1
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm1, %xmm0
        ret

Comment 1 Uroš Bizjak 2013-12-15 12:12:46 UTC

Confirmed as RA regression.

Comment 2 Jakub Jelinek 2013-12-16 11:44:20 UTC

One movdqa started appearing with r204212, the second movdqa started appearing with r204752.  Vlad, can you please have a look?

Comment 3 Vladimir Makarov 2013-12-17 16:37:36 UTC

(In reply to Jakub Jelinek from comment #2)
> One movdqa started appearing with r204212, the second movdqa started
> appearing with r204752.  Vlad, can you please have a look?

It seems the changes triggered a bug in register move cost calculations.  I have a patch to fix it but I need more time to check affect of it on the performance.  So the fix will be ready at the end of week if everything is ok.

Comment 4 Vladimir Makarov 2014-01-15 17:33:18 UTC

Author: vmakarov
Date: Wed Jan 15 17:32:47 2014
New Revision: 206636

URL: http://gcc.gnu.org/viewcvs?rev=206636&root=gcc&view=rev
Log:
2014-01-15  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/59511
	* ira.c (ira_init_register_move_cost): Use memory costs for some
	cases of register move cost calculations.
	* lra-constraints.c (lra_constraints): Use REG_FREQ_FROM_BB
	instead of BB frequency.
	* lra-coalesce.c (move_freq_compare_func, lra_coalesce): Ditto.
	* lra-assigns.c (find_hard_regno_for): Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira.c
    trunk/gcc/lra-assigns.c
    trunk/gcc/lra-coalesce.c
    trunk/gcc/lra-constraints.c

Comment 5 Jakub Jelinek 2014-01-15 17:36:48 UTC

Fixed, thanks.

Comment 6 Peter Cordes 2016-06-02 17:36:23 UTC

Created attachment 38629 [details]
extra-movdqa-with-gcc5-not-4.9.cpp

Comment 7 Peter Cordes 2016-06-02 17:37:39 UTC

I'm seeing the same symptom, affecting gcc4.9 through 5.3.  Not present in 6.1.

IDK if the cause is the same.

(code from an improvement to the horizontal_add functions in Agner Fog's vector class library)

#include <immintrin.h>
int hsum16_gccmovdqa (__m128i const a) {
	__m128i lo    = _mm_cvtepi16_epi32(a);                 // sign-extended a0, a1, a2, a3
	__m128i hi    = _mm_unpackhi_epi64(a,a);     // gcc4.9 through 5.3 wastes a movdqa on this
	        hi    = _mm_cvtepi16_epi32(hi);
	__m128i sum1  = _mm_add_epi32(lo,hi);		           // add sign-extended upper / lower halves
	//return horizontal_add(sum1);  // manually inlined.
    // Shortening the code below can avoid the movdqa
    __m128i shuf  = _mm_shuffle_epi32(sum1, 0xEE);
    __m128i sum2  = _mm_add_epi32(shuf,sum1);              // 2 sums
            shuf  = _mm_shufflelo_epi16(sum2, 0xEE);
    __m128i sum4  = _mm_add_epi32(shuf,sum2);
    return          _mm_cvtsi128_si32(sum4);               // 32 bit sum
}

gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1):

        movdqa  %xmm0, %xmm1
        pmovsxwd        %xmm0, %xmm2
        punpckhqdq      %xmm0, %xmm1
        pmovsxwd        %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        ...

gcc6.1 output:

        pmovsxwd        %xmm0, %xmm1
        punpckhqdq      %xmm0, %xmm0
        pmovsxwd        %xmm0, %xmm0
        paddd   %xmm0, %xmm1
        ...



In a more complicated case, when inlining this code or not, there's actually a difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases.  See my attachment, copied from https://godbolt.org/g/e8iQsj