This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/52910] New: xop-mul-1:f13 miscompiled on bulldozer (-mxop)

From: "matz at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Sun, 08 Apr 2012 23:21:57 +0000
Subject: [Bug target/52910] New: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
Auto-submitted: auto-generated

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52910

             Bug #: 52910
           Summary: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: matz@gcc.gnu.org


Once PR52908 (miscompile of f9 in the same testcase) would be fixed by
the method of deactivating the sse4.1 pattern with TARGET_XOP this one
will be exposed; f13 is miscompiled too.  The function is a straight
mul-add-reduce:

__attribute__((noinline, noclone)) long long
f13 (void)
{
  int i;
  long long r = 0;
  for (i = 0; i < 512; ++i)
    r += (long long) c2[i] * (long long) c3[i];
  return r;
}

This is compiled into (with -O3 -mxop):

f13:
        movl    $0, %eax
        vpxor   %xmm2, %xmm2, %xmm2
.L41:
        vmovdqa c2(%rax), %xmm1
        vmovdqa c3(%rax), %xmm0
        vpmacsdqh       %xmm2, %xmm0, %xmm1, %xmm2
        vpsrldq $4, %xmm1, %xmm1
        vpsrldq $4, %xmm0, %xmm0
        vpmacsdqh       %xmm2, %xmm0, %xmm1, %xmm2
        addq    $16, %rax
        cmpq    $2048, %rax
        jne     .L41
        vpsrldq $8, %xmm2, %xmm0
        vpaddq  %xmm2, %xmm0, %xmm2
        vpextrq $0, %xmm2, %rax
        ret

I think the problem is confusion between how the vector components are numbered
and a mismatch between the sse/avx patterns and the xop patterns.  What
the above loop clearly tries to do when looking at the patterns is to
mult-acc components X and X+2 in the first vpmacsdqh and then components
X+1 and X+3 in the second one.  For that it uses a right bit shift by 32 bits
(the input component size).  For reference here the rough patterns in the
last RTL dump for the two mult-accs and the shifts:

(insn 13 12 14 3 (set (reg:V2DI 23 xmm2 [75])
        (plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22
xm
m1 [73])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ])))
                (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [74])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ]))))
            (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65]))) 1794
{xop_pmacsdq
h}
     (nil))

(insn 14 13 15 3 (set (reg:V1TI 22 xmm1 [76])
        (lshiftrt:V1TI (reg:V1TI 22 xmm1 [73])
            (const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3}
     (nil))

(insn 15 14 16 3 (set (reg:V1TI 21 xmm0 [77])
        (lshiftrt:V1TI (reg:V1TI 21 xmm0 [74])
            (const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3}
     (nil))

(insn 17 16 18 3 (set (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65])
        (plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22
xm
m1 [76])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ])))
                (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [77])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ]))))
            (reg:V2DI 23 xmm2 [75]))) 1794 {xop_pmacsdqh}
     (expr_list:REG_DEAD (reg:V4SI 22 xmm1 [76])
        (expr_list:REG_DEAD (reg:V4SI 21 xmm0 [77])
            (nil))))

So, for this to do the right thing, the mult-acc patterns idea of component
0 and 2 must be so that components 1 and 3 are transformed into 0 and 2 by
a right shift of 32 bits.

Tracing the whole thing in gdb reveals some miscommunication in that idea.
The two input values at start of loop:

% $xmm0.v4_int32 {1215505350, 1491885311, -676627251, -515498245}
% $xmm1.v4_int32 {-1737931417, -1807033263, 1488592681, -1444724238}
% $xmm2.v2_int64 {0, 0}

The inputs from the arrays:

% c2[0]@4 {-1737931417, -1807033263, 1488592681, -1444724238}
% c3[0]@4 {1215505350, 1491885311, -676627251, -515498245}

Now after the first mult-acc:

% $xmm2.v2_int64 {-2695886381558099793, 744752809197962310}

That is, it multiplied components 1 and 3, not 0 and 2 (that is,
c2[1]*c3[1] and c2[3]*c3[3]).  Now the two right shifts come and we have:

% $xmm0.v4_int32 {1491885311, -676627251, -515498245, 0}
% $xmm1.v4_int32 {-1807033263, 1488592681, -1444724238, 0}

Doing again the mult-acc (using component 1 and 3) now leads to wrong results.
The original component 0 is shifted out already, and component 3 is the zero
that was shifted in.

>From the description of the xop instruction (that talk about second and fourth
component for vpmacsdqh) it seems that the vpmacsdqh pattern should refer
to components 1 and 3 (and hence not be used in this case when the shift stay).

Actually with xop the shifts aren't necessary and both {0,2} and {1,3}
mult-accs could be done with vpmacsdql and vpmacsdqh.

Follow-Ups:
- [Bug target/52910] xop-mul-1:f13 miscompiled on bulldozer (-mxop)
  - From: agner at agner dot org

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]