52908 – xop-mul-1:f9 miscompiled on bulldozer (-mxop)

Bug 52908 - xop-mul-1:f9 miscompiled on bulldozer (-mxop)

Summary: xop-mul-1:f9 miscompiled on bulldozer (-mxop)

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.7.2
Assignee:	Venkataramanan

URL:	http://gcc.gnu.org/ml/gcc-patches/201...
Keywords:	wrong-code

Depends on:
Blocks:	53071
	Show dependency tree / graph

Reported:	2012-04-08 22:47 UTC by Michael Matz
Modified:	2024-03-17 02:33 UTC (History)
CC List:	1 user (show)

See Also:	52910
Host:
Target:	x86
Build:
Known to work:	4.7.1, 4.8.0
Known to fail:	4.7.0
Last reconfirmed:	2012-04-09 00:00:00

Attachments
Proposed patch (1.08 KB, patch) 2012-04-09 11:48 UTC, Uroš Bizjak	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Matz 2012-04-08 22:47:53 UTC

FAIL: gcc.target/i386/xop-mul-1.c execution test

This is because f9 is miscompiled with -O3 -mxop (f13 is as well,
but this report is only about f9).  f9 looks like so:

__attribute__((noinline, noclone)) void
f9 (void)
{
  int i;
  for (i = 0; i < 512; ++i)
    e1[i] = (long long) c2[i] * (long long) c3[i];
}

This is compiled into this asm:

f9:
        movl    $0, %eax
.L29:
        vpshufd $216, c2(%rax), %xmm1
        vpshufd $216, c3(%rax), %xmm0
        vpxor   %xmm2, %xmm2, %xmm2
        vpmacsdql       %xmm2, %xmm0, %xmm1, %xmm2
        vmovdqa %xmm2, e1(%rax,%rax)
        vpxor   %xmm0, %xmm0, %xmm0         <<< overwrite xmm0
        vpmacsdqh       %xmm0, %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, e1+16(%rax,%rax)
        addq    $16, %rax
        cmpq    $2048, %rax
        jne     .L29
        rep
        ret

The marked instruction zeroing xmm0 before the second vpmacsdqh overwrites
xmm0 although it's still used as input.  That insn is generated by post-reload
splitting and is caused by a conflict between two pattern in sse.md:

(define_insn "*sse4_1_mulv2siv2di3"
  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
        (mult:V2DI
          (sign_extend:V2DI
            (vec_select:V2SI
              (match_operand:V4SI 1 "nonimmediate_operand" "%0,x")
              (parallel [(const_int 0) (const_int 2)])))
          (sign_extend:V2DI
            (vec_select:V2SI
              (match_operand:V4SI 2 "nonimmediate_operand" "xm,xm")
              (parallel [(const_int 0) (const_int 2)])))))]
  "TARGET_SSE4_1 && ix86_binary_operator_ok (MULT, V4SImode, operands)"
...

(note how operand 0 is not marked early-clobber) and the next xop pattern:

;; We don't have a straight 32-bit parallel multiply and extend on XOP, so
;; fake it with a multiply/add.  In general, we expect the define_split to
;; occur before register allocation, so we have to handle the corner case where
;; the target is the same as either operands[1] or operands[2]
(define_insn_and_split "xop_mulv2div2di3_high"
  [(set (match_operand:V2DI 0 "register_operand" "=&x")
        (mult:V2DI
          (sign_extend:V2DI
            (vec_select:V2SI
              (match_operand:V4SI 1 "register_operand" "%x")
              (parallel [(const_int 0)
                         (const_int 2)])))
          (sign_extend:V2DI
            (vec_select:V2SI
              (match_operand:V4SI 2 "nonimmediate_operand" "xm")
              (parallel [(const_int 0)
                         (const_int 2)])))))]
  "TARGET_XOP"
  "#"
  "&& reload_completed"
  [(set (match_dup 0)
        (match_dup 3))
  ...

Note in particular how the comment about being careful about dest == op0/op1
is taken care of by the early-clobber of op0.

Now the problem is, the first (sse 4.1) pattern matches the insn in question:

(insn 31 27 32 3 (set (reg:V2DI 84)
        (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 81)
                    (parallel [
                            (const_int 0 [0])
                            (const_int 2 [0x2])
                        ])))
            (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 82)
                    (parallel [
                            (const_int 0 [0])
                            (const_int 2 [0x2])
                        ]))))) /matz/trunk/gcc/gcc/testsuite/gcc.target/i386/sse
2-mul-1.c:99 1484 {*sse4_1_mulv2siv2di3}
     (expr_list:REG_DEAD (reg:V4SI 82)
        (expr_list:REG_DEAD (reg:V4SI 81)
            (nil))))

(sse4_1_mulv2siv2di3 matched).  The reg allocator hence doesn't see any
early-clobber, allocates pseudo 84 and 81 to xmm0, and when the pattern
then is split (using the xop_mulv2div2di3_high splitter, used because
the sse4_1_mulv2siv2di3 pattern is no splitter) op1 is overwritten.

I've no idea how best to solve this, either deactivating the sse4.1 pattern
when TARGET_XOP, or rewriting the latter to not early-clobber result, or
something else entirely.  I've tested that the first way fixes this problem.

Comment 1 Uroš Bizjak 2012-04-09 08:16:26 UTC

(In reply to comment #0)

> I've no idea how best to solve this, either deactivating the sse4.1 pattern
> when TARGET_XOP, or rewriting the latter to not early-clobber result, or
> something else entirely.  I've tested that the first way fixes this problem.

We have attribute enabled for these case. I can take this bug, if you don't mind.

Comment 2 Uroš Bizjak 2012-04-09 11:48:05 UTC

Created attachment 27117 [details]
Proposed patch

There are indeed two problems with XOP patterns:

a) duplication of *sse4_1_mulv2siv2di3 pattern
b) wrong order of operands in all (!!!) XOP patterns. XOP patterns consider element 0 as MSB.

Attached patch solves this by simply removing fake xop_mulv2div2di3_{low,high} patterns, expanding to (fixed) xop_pmacsdq{h,l} patterns directly. There is simply no need to use vpmacsdql instead of vpmuldq. For consistency, the patch expands to xop_pmacsdql pattern, but gcc figures out that addition of 0 is unneeded and substitutes MAC insn with plain MUL.

Attached patch does not even try to fix other intrinsics. Someone familiar with AMD documentation should review all these, since the documentation (43479.pdf) is somehow inconsistent (i.e. the figure that explains VPMADCSSWD is inconsistent with the description).

Since I don't have XOP processor, I can only eyeball the asm, in this case:

        vpxor   %xmm3, %xmm3, %xmm3
        xorl    %eax, %eax
.L3:
        vpshufd $216, c2(%rax), %xmm1
        vpshufd $216, c3(%rax), %xmm0
        vpmuldq %xmm0, %xmm1, %xmm2
        vpmacsdqh       %xmm3, %xmm0, %xmm1, %xmm0
        vmovdqa %xmm2, e1(%rax,%rax)
        vmovdqa %xmm0, e1+16(%rax,%rax)
        addq    $16, %rax
        cmpq    $2048, %rax
        jne     .L3

Please also note hoisting of constant load.

Comment 3 Uroš Bizjak 2012-04-10 13:49:02 UTC

(In reply to comment #2)
> Created attachment 27117 [details]
> Proposed patch

Michael, can you please test this patch?

Comment 4 Venkataramanan 2012-05-04 10:42:43 UTC

Hi Uros,

I have not yet explored the patches.

A Quick make check on i386.exp result is shown below: 

sh compare_tests  ../../build_186236 ../../build_186236_patch/
# Comparing directories
## Dir1=../../build_186236: 7 sum files
## Dir2=../../build_186236_patch/: 7 sum files

# Comparing 7 common sum files
## /bin/sh compare_tests  /tmp/gxx-sum1.21222 /tmp/gxx-sum2.21222
Tests that now fail, but worked before:

gcc.target/i386/xop-imul32widen-vector.c scan-assembler vpmacsdql

Tests that now work, but didn't before:

gcc.target/i386/xop-mul-1.c execution test

## Differences found:
# 1 differences in 7 common sum files found

Comment 5 Uroš Bizjak 2012-05-04 12:51:49 UTC

(In reply to comment #4)

> A Quick make check on i386.exp result is shown below: 
> 
> Tests that now fail, but worked before:
> 
> gcc.target/i386/xop-imul32widen-vector.c scan-assembler vpmacsdql

This is expected, there is no need to emit vpmacsdql, IMO vpmuldq works as good (if not even better, since we don't have to preload accumulator with zero). The test should be either changed to really emit vpmacsdql, or asm scan should be adjusted.

> Tests that now work, but didn't before:
> 
> gcc.target/i386/xop-mul-1.c execution test

Yes this one should be fixed with the patch.

If the change vmpacsdql -> vpmuldq is OK with you, I can commit the patch, but I'd ask you for full bootstrap/regression test.

Comment 6 Venkataramanan 2012-05-09 03:13:01 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > A Quick make check on i386.exp result is shown below: 
> > 
> > Tests that now fail, but worked before:
> > 
> > gcc.target/i386/xop-imul32widen-vector.c scan-assembler vpmacsdql
> This is expected, there is no need to emit vpmacsdql, IMO vpmuldq works as good
> (if not even better, since we don't have to preload accumulator with zero). The
> test should be either changed to really emit vpmacsdql, or asm scan should be
> adjusted.
> > Tests that now work, but didn't before:
> > 
> > gcc.target/i386/xop-mul-1.c execution test
> Yes this one should be fixed with the patch.
> If the change vmpacsdql -> vpmuldq is OK with you, I can commit the patch, but
> I'd ask you for full bootstrap/regression test.

Hi Uros, 

I did a complete boot strap test and Ok with the patch. We will adjust the test case xop-imul32widen-vector.c to scan for vmuldq.

Comment 7 uros 2012-05-09 20:41:23 UTC

Author: uros
Date: Wed May  9 20:41:08 2012
New Revision: 187354

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187354
Log:
	PR target/52908
	* config/i386/sse.md (vec_widen_smult_hi_v4si): Expand using
	xop_pmacsdqh insn pattern instead of xop_mulv2div2di3_high.
	(vec_widen_smult_lo_v4si): Expand using xop_pmacsdql insn pattern
	instead of xop_mulv2div2di3_low.
	(xop_p<macs>dql): Fix vec_select selector.
	(xop_p<macs>dqh): Ditto.
	(xop_mulv2div2di3_low): Remove insn_and_split pattern.
	(xop_mulv2div2di3_high): Ditto.

testsuite/ChangeLog:

	PR target/52908
	* gcc.target/i386/xop-imul32widen-vector.c: Update scan-assembler
	directive to Scan for vpmuldq, not vpmacsdql.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/xop-imul32widen-vector.c

Comment 8 Uroš Bizjak 2012-05-09 20:45:46 UTC

Fixed for 4.8, patch needs to be backported to release branches.

Comment 9 vekumar 2012-06-18 15:10:51 UTC

Author: vekumar
Date: Mon Jun 18 15:10:45 2012
New Revision: 188736

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188736
Log:
Back port Fix PR 52908 - xop-mul-1:f9 miscompiled on bulldozer (-mxop) to 4.7 branch

Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/config/i386/sse.md
    branches/gcc-4_7-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_7-branch/gcc/testsuite/gcc.target/i386/xop-imul32widen-vector.c

Comment 10 Sam James 2024-03-17 02:33:07 UTC

Long fixed, the backport landed in 4.7.2.