Bug 59511 - [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7
Summary: [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.9.0
: P1 normal
Target Milestone: 4.9.0
Assignee: Not yet assigned to anyone
URL:
Keywords: ra
Depends on:
Blocks:
 
Reported: 2013-12-15 12:11 UTC by Uroš Bizjak
Modified: 2016-06-02 17:37 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-12-15 00:00:00


Attachments
extra-movdqa-with-gcc5-not-4.9.cpp (1.76 KB, text/plain)
2016-06-02 17:36 UTC, Peter Cordes
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2013-12-15 12:11:27 UTC
Hello!

The gcc.target/i386/pr36222-1.c compiles for x86_64-linux-gnu with "-O2 -mno-sse3 -mtune=corei7" to:

_mm_set_epi32:
        movd    %ecx, %xmm1
        movd    %edx, %xmm4
        movd    %esi, %xmm0
        movd    %edi, %xmm3
        punpckldq       %xmm4, %xmm1
        movdqa  %xmm1, %xmm2
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm0, %xmm2
        movdqa  %xmm2, %xmm0
        ret

However, 4.8 branch compiles to:

_mm_set_epi32:
        movd    %esi, %xmm1
        movd    %edi, %xmm2
        movd    %ecx, %xmm0
        movd    %edx, %xmm3
        punpckldq       %xmm2, %xmm1
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm1, %xmm0
        ret
Comment 1 Uroš Bizjak 2013-12-15 12:12:46 UTC
Confirmed as RA regression.
Comment 2 Jakub Jelinek 2013-12-16 11:44:20 UTC
One movdqa started appearing with r204212, the second movdqa started appearing with r204752.  Vlad, can you please have a look?
Comment 3 Vladimir Makarov 2013-12-17 16:37:36 UTC
(In reply to Jakub Jelinek from comment #2)
> One movdqa started appearing with r204212, the second movdqa started
> appearing with r204752.  Vlad, can you please have a look?

It seems the changes triggered a bug in register move cost calculations.  I have a patch to fix it but I need more time to check affect of it on the performance.  So the fix will be ready at the end of week if everything is ok.
Comment 4 Vladimir Makarov 2014-01-15 17:33:18 UTC
Author: vmakarov
Date: Wed Jan 15 17:32:47 2014
New Revision: 206636

URL: http://gcc.gnu.org/viewcvs?rev=206636&root=gcc&view=rev
Log:
2014-01-15  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/59511
	* ira.c (ira_init_register_move_cost): Use memory costs for some
	cases of register move cost calculations.
	* lra-constraints.c (lra_constraints): Use REG_FREQ_FROM_BB
	instead of BB frequency.
	* lra-coalesce.c (move_freq_compare_func, lra_coalesce): Ditto.
	* lra-assigns.c (find_hard_regno_for): Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira.c
    trunk/gcc/lra-assigns.c
    trunk/gcc/lra-coalesce.c
    trunk/gcc/lra-constraints.c
Comment 5 Jakub Jelinek 2014-01-15 17:36:48 UTC
Fixed, thanks.
Comment 6 Peter Cordes 2016-06-02 17:36:23 UTC
Created attachment 38629 [details]
extra-movdqa-with-gcc5-not-4.9.cpp
Comment 7 Peter Cordes 2016-06-02 17:37:39 UTC
I'm seeing the same symptom, affecting gcc4.9 through 5.3.  Not present in 6.1.

IDK if the cause is the same.

(code from an improvement to the horizontal_add functions in Agner Fog's vector class library)

#include <immintrin.h>
int hsum16_gccmovdqa (__m128i const a) {
	__m128i lo    = _mm_cvtepi16_epi32(a);                 // sign-extended a0, a1, a2, a3
	__m128i hi    = _mm_unpackhi_epi64(a,a);     // gcc4.9 through 5.3 wastes a movdqa on this
	        hi    = _mm_cvtepi16_epi32(hi);
	__m128i sum1  = _mm_add_epi32(lo,hi);		           // add sign-extended upper / lower halves
	//return horizontal_add(sum1);  // manually inlined.
    // Shortening the code below can avoid the movdqa
    __m128i shuf  = _mm_shuffle_epi32(sum1, 0xEE);
    __m128i sum2  = _mm_add_epi32(shuf,sum1);              // 2 sums
            shuf  = _mm_shufflelo_epi16(sum2, 0xEE);
    __m128i sum4  = _mm_add_epi32(shuf,sum2);
    return          _mm_cvtsi128_si32(sum4);               // 32 bit sum
}

gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1):

        movdqa  %xmm0, %xmm1
        pmovsxwd        %xmm0, %xmm2
        punpckhqdq      %xmm0, %xmm1
        pmovsxwd        %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        ...

gcc6.1 output:

        pmovsxwd        %xmm0, %xmm1
        punpckhqdq      %xmm0, %xmm0
        pmovsxwd        %xmm0, %xmm0
        paddd   %xmm0, %xmm1
        ...



In a more complicated case, when inlining this code or not, there's actually a difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases.  See my attachment, copied from https://godbolt.org/g/e8iQsj