22432 – [4.0/4.1 Regression] Wrong code generation using MMX intrinsics on amd64

Bug 22432 - [4.0/4.1 Regression] Wrong code generation using MMX intrinsics on amd64

Summary: [4.0/4.1 Regression] Wrong code generation using MMX intrinsics on amd64

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.1

Importance:	P1 normal
Target Milestone:	4.0.3
Assignee:	Not yet assigned to anyone

URL:
Keywords:	patch, ssemmx, wrong-code

Depends on:
Blocks:

Reported:	2005-07-12 09:27 UTC by Lars Knoll
Modified:	2005-11-06 06:41 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	x86_64-linux-gnu
Build:
Known to work:	3.4.5 3.3.6
Known to fail:	4.1.0 4.0.3
Last reconfirmed:	2005-08-04 22:49:31

Attachments
Proposed patch (659 bytes, patch) 2005-11-05 01:00 UTC, Ian Lance Taylor	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lars Knoll 2005-07-12 09:27:45 UTC

The following example generates wrong code when compiled with -O2 on amd64. It 
doesn't seem to happen when compiling in 32bit mode though. 
 
#include <mmintrin.h> 
#include <assert.h> 
 
typedef unsigned int CARD32; 
 
static void 
mmxCombineAddU (CARD32 *dest, const CARD32 *src, int width) 
{ 
    const __m64 mmx_0 = _mm_setzero_si64(); 
 
    const CARD32 *end = dest + width; 
    while (dest < end) { 
        __m64 s, d; 
        s = _mm_unpacklo_pi8 (_mm_cvtsi32_si64(*src), mmx_0); 
        d = _mm_unpacklo_pi8 (_mm_cvtsi32_si64(*dest), mmx_0); 
        s = _mm_add_pi16(s, d); 
        *dest = (CARD32)_mm_cvtsi64_si32(_mm_packs_pu16(s, mmx_0)); 
        ++dest; 
        ++src; 
    } 
    _mm_empty(); 
} 
 
int main() 
{ 
    CARD32 a = 0xffffffff; 
    CARD32 b = 0x10101010; 
 
    mmxCombineAddU(&a,  &b, 1); 
    assert(a == 0xffffffff); 
    return 0; 
} 
 
The compiled assembler for the mmx instructions looks like this: 
 
  400555:       0f 6e 00                movd   (%rax),%mm0 
  400558:       0f 60 ca                punpcklbw %mm2,%mm1 
  40055b:       0f 60 c2                punpcklbw %mm2,%mm0 
  40055e:       0f fc c1                paddb  %mm1,%mm0 
  400561:       0f 67 c3                packuswb %mm3,%mm0 
  400564:       0f 7e 00                movd   %mm0,(%rax) 
 
It should use paddw instead of paddb here. 
 
best regards, 
Lars

Comment 1 Andrew Pinski 2005-08-04 22:49:30 UTC

Confirmed.

Comment 2 Uroš Bizjak 2005-08-17 09:54:31 UTC

The code produced by crosscompiling from i686 for x86_86, I got functionally 
equal asm for 32bit and 64bit mode:

gcc -O2 -m64:

.LFB128:
	subq	$24, %rsp
.LCFI0:
	leaq	20(%rsp), %rax
	leaq	24(%rsp), %rdx
	cmpq	%rax, %rdx
	jbe	.L7
	movq	.LC5(%rip), %mm0
	pxor	%mm1, %mm1
	packuswb	%mm1, %mm0
	movq	%mm0, (%rsp)
	movq	(%rsp), %rax
	incl	%eax
	jne	.L4
	emms
...

gcc -m32 -mmmx:

	pushl	%ebp
	movl	%esp, %ebp
	subl	$40, %esp
	leal	-4(%ebp), %eax
	andl	$-16, %esp
	subl	$16, %esp
	cmpl	%eax, %ebp
	jbe	.L7
	movq	.LC6, %mm0
	pxor	%mm1, %mm1
	packuswb	%mm1, %mm0
	movd	%mm0, -20(%ebp)
	movl	-20(%ebp), %eax
	incl	%eax
	jne	.L4
	emms
...

Tested with gcc version 4.1.0 20050716 (experimental)

Comment 3 Andrew Pinski 2005-10-27 14:44:55 UTC

A regression from 3.4.x and 3.3.x.

Comment 4 Pawel Sikora 2005-10-27 16:00:00 UTC

With my patched gcc-4.1.0-20051019 the testcase works.
Applied patches: PR7776, PR20297, PR22429, PR22533, PR23948,
PR19505, PR20606, PR24069, PR24419, PR24172, PR24295, PR20928

[builder2@estel BUILD]$ gcc -Wall -O2 pr22432.c -S -march=x86-64
[builder2@estel BUILD]$ cat pr22432.s
        .file   "pr22432.c"
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC0:   .string "pr22432.c"
.LC1:   .string "a == 0xffffffff"
        .section        .rodata.cst8,"aM",@progbits,8
        .align 8
.LC2:   .long   269488144
        .long   0
        .align 8
.LC3:   .long   -1
        .long   0
        .text
        .p2align 4,,15
.globl main
        .type   main, @function
main:   pushl   %ebp
        movl    %esp, %ebp
        subl    $56, %esp
        leal    -4(%ebp), %eax
        andl    $-16, %esp
        subl    $16, %esp
        cmpl    %ebp, %eax
        jae     .L9
        movl    $0, -24(%ebp)
        movl    $0, -20(%ebp)
        movq    -24(%ebp), %mm2
        movq    .LC2, %mm0
        movq    .LC3, %mm1
        punpcklbw       %mm2, %mm0
        punpcklbw       %mm2, %mm1
        paddw   %mm1, %mm0
        packuswb        %mm2, %mm0
        movd    %mm0, -28(%ebp)
        emms
        cmpl    $-1, -28(%ebp)
        jne     .L10
        leave
        xorl    %eax, %eax
        ret
.L9:    emms
        leave
        xorl    %eax, %eax
        ret
.L10:   movl    $__PRETTY_FUNCTION__.2544, 12(%esp)
        movl    $30, 8(%esp)
        movl    $.LC0, 4(%esp)
        movl    $.LC1, (%esp)
        call    __assert_fail
        .size   main, .-main
        .section        .rodata
        .type   __PRETTY_FUNCTION__.2544, @object
        .size   __PRETTY_FUNCTION__.2544, 5
__PRETTY_FUNCTION__.2544:
        .string "main"
        .ident  "GCC: (GNU) 4.1.0 20051019 (experimental)"
        .section        .note.GNU-stack,"",@progbits

[builder2@estel BUILD]$ gcc -Wall -O2 pr22432.c -save-temps -march=x86-64
[builder2@estel BUILD]$ ./a.out ; echo $?
0

Comment 5 Mark Mitchell 2005-10-31 04:04:07 UTC

This is a showstopper; wrong code on a primary platform.

Comment 6 Steven Bosscher 2005-11-04 23:17:17 UTC

In the .life1 dump we have:

(insn 41 38 42 2 (set (reg:V2SI 79 [ D.2609 ])
        (subreg:V2SI (reg:V8QI 76) 0)) 998 {*movv2si_internal_rex64 
    (insn_list:REG_DEP_TRUE 32 (nil))
        (expr_list:REG_DEAD (reg:V8QI 76)
            (nil)))

(insn 42 41 43 2 (set (reg:V2SI 80 [ D.2619 ])
        (subreg:V2SI (reg:V8QI 78) 0)) 998 {*movv2si_internal_rex64} 
    (insn_list:REG_DEP_TRUE 38 (nil))
        (expr_list:REG_DEAD (reg:V8QI 78)
            (nil)))

(insn 43 42 46 2 (set (reg:V4HI 81)
        (plus:V4HI (subreg:V4HI (reg:V2SI 79 [ D.2609 ]) 0)
            (subreg:V4HI (reg:V2SI 80 [ D.2619 ]) 0))) 1031 {mmx_addv4hi3} 
    (insn_list:REG_DEP_TRUE 41 (insn_list:REG_DEP_TRUE 42 (nil)))
        (expr_list:REG_DEAD (reg:V2SI 79 [ D.2609 ])
            (expr_list:REG_DEAD (reg:V2SI 80 [ D.2619 ])
                (nil))))



After combine we have:
(note 41 38 42 2 NOTE_INSN_DELETED)

(note 42 41 43 2 NOTE_INSN_DELETED)

(insn 43 42 46 2 (set (reg:V8QI 81)
        (plus:V8QI (reg:V8QI 76)
            (reg:V8QI 78))) 1030 {mmx_addv8qi3}
    (insn_list:REG_DEP_TRUE 38 (insn_list:REG_DEP_TRUE 32 (nil)))
        (expr_list:REG_DEAD (reg:V8QI 78)
            (expr_list:REG_DEAD (reg:V8QI 76)
                (nil))))

Is that really the same???

Comment 7 Steven Bosscher 2005-11-04 23:26:09 UTC

According to rth on IRC this is indeed absolutely wrong.

So it looks like we may have a combine bug here.

Comment 8 Ian Lance Taylor 2005-11-04 23:27:21 UTC

No, doing the add in v4hi mode is not the same as doing the add in v8qi mode.  The carry bits will be handled differently.

It's also rather odd that register 81 changed from V4HImode to V8QImode.  Normally a pseudo-register always has the same mode, and accessing it in a different mode requires a subreg.

Comment 9 Ian Lance Taylor 2005-11-05 01:00:00 UTC

Created attachment 10151 [details]
Proposed patch

This patch fixes the problem.  I'm running tests now.

Comment 10 ian@gcc.gnu.org 2005-11-06 05:34:43 UTC

Subject: Bug 22432

Author: ian
Date: Sun Nov  6 05:34:38 2005
New Revision: 106555

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=106555
Log:
./:
	PR target/22432
	* combine.c (apply_distributive_law): Don't distribute across a
	vector mode subreg.
testsuite/:
	PR target/22432
	* gcc.target/i386/pr22432.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr22432.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/combine.c
    trunk/gcc/testsuite/ChangeLog

Comment 11 ian@gcc.gnu.org 2005-11-06 06:38:17 UTC

Subject: Bug 22432

Author: ian
Date: Sun Nov  6 06:38:14 2005
New Revision: 106556

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=106556
Log:
./:
	PR target/22432
	* combine.c (apply_distributive_law): Don't distribute across a
	vector mode subreg.
testsuite/:
2005-11-05  Ian Lance Taylor  <ian@airs.com>

	PR target/22432
	* gcc.target/i386/pr22432.c: New test.

Added:
    branches/gcc-4_0-branch/gcc/testsuite/gcc.target/i386/pr22432.c
Modified:
    branches/gcc-4_0-branch/gcc/ChangeLog
    branches/gcc-4_0-branch/gcc/combine.c
    branches/gcc-4_0-branch/gcc/testsuite/ChangeLog

Comment 12 Ian Lance Taylor 2005-11-06 06:41:46 UTC

Fixed for 4.0.3 and 4.1.