This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/52910] New: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
- From: "matz at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 08 Apr 2012 23:21:57 +0000
- Subject: [Bug target/52910] New: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
- Auto-submitted: auto-generated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52910
Bug #: 52910
Summary: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
Classification: Unclassified
Product: gcc
Version: 4.8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: matz@gcc.gnu.org
Once PR52908 (miscompile of f9 in the same testcase) would be fixed by
the method of deactivating the sse4.1 pattern with TARGET_XOP this one
will be exposed; f13 is miscompiled too. The function is a straight
mul-add-reduce:
__attribute__((noinline, noclone)) long long
f13 (void)
{
int i;
long long r = 0;
for (i = 0; i < 512; ++i)
r += (long long) c2[i] * (long long) c3[i];
return r;
}
This is compiled into (with -O3 -mxop):
f13:
movl $0, %eax
vpxor %xmm2, %xmm2, %xmm2
.L41:
vmovdqa c2(%rax), %xmm1
vmovdqa c3(%rax), %xmm0
vpmacsdqh %xmm2, %xmm0, %xmm1, %xmm2
vpsrldq $4, %xmm1, %xmm1
vpsrldq $4, %xmm0, %xmm0
vpmacsdqh %xmm2, %xmm0, %xmm1, %xmm2
addq $16, %rax
cmpq $2048, %rax
jne .L41
vpsrldq $8, %xmm2, %xmm0
vpaddq %xmm2, %xmm0, %xmm2
vpextrq $0, %xmm2, %rax
ret
I think the problem is confusion between how the vector components are numbered
and a mismatch between the sse/avx patterns and the xop patterns. What
the above loop clearly tries to do when looking at the patterns is to
mult-acc components X and X+2 in the first vpmacsdqh and then components
X+1 and X+3 in the second one. For that it uses a right bit shift by 32 bits
(the input component size). For reference here the rough patterns in the
last RTL dump for the two mult-accs and the shifts:
(insn 13 12 14 3 (set (reg:V2DI 23 xmm2 [75])
(plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22
xm
m1 [73])
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
])))
(sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [74])
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
]))))
(reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65]))) 1794
{xop_pmacsdq
h}
(nil))
(insn 14 13 15 3 (set (reg:V1TI 22 xmm1 [76])
(lshiftrt:V1TI (reg:V1TI 22 xmm1 [73])
(const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3}
(nil))
(insn 15 14 16 3 (set (reg:V1TI 21 xmm0 [77])
(lshiftrt:V1TI (reg:V1TI 21 xmm0 [74])
(const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3}
(nil))
(insn 17 16 18 3 (set (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65])
(plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22
xm
m1 [76])
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
])))
(sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [77])
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
]))))
(reg:V2DI 23 xmm2 [75]))) 1794 {xop_pmacsdqh}
(expr_list:REG_DEAD (reg:V4SI 22 xmm1 [76])
(expr_list:REG_DEAD (reg:V4SI 21 xmm0 [77])
(nil))))
So, for this to do the right thing, the mult-acc patterns idea of component
0 and 2 must be so that components 1 and 3 are transformed into 0 and 2 by
a right shift of 32 bits.
Tracing the whole thing in gdb reveals some miscommunication in that idea.
The two input values at start of loop:
% $xmm0.v4_int32 {1215505350, 1491885311, -676627251, -515498245}
% $xmm1.v4_int32 {-1737931417, -1807033263, 1488592681, -1444724238}
% $xmm2.v2_int64 {0, 0}
The inputs from the arrays:
% c2[0]@4 {-1737931417, -1807033263, 1488592681, -1444724238}
% c3[0]@4 {1215505350, 1491885311, -676627251, -515498245}
Now after the first mult-acc:
% $xmm2.v2_int64 {-2695886381558099793, 744752809197962310}
That is, it multiplied components 1 and 3, not 0 and 2 (that is,
c2[1]*c3[1] and c2[3]*c3[3]). Now the two right shifts come and we have:
% $xmm0.v4_int32 {1491885311, -676627251, -515498245, 0}
% $xmm1.v4_int32 {-1807033263, 1488592681, -1444724238, 0}
Doing again the mult-acc (using component 1 and 3) now leads to wrong results.
The original component 0 is shifted out already, and component 3 is the zero
that was shifted in.
>From the description of the xop instruction (that talk about second and fourth
component for vpmacsdqh) it seems that the vpmacsdqh pattern should refer
to components 1 and 3 (and hence not be used in this case when the shift stay).
Actually with xop the shifts aren't necessary and both {0,2} and {1,3}
mult-accs could be done with vpmacsdql and vpmacsdqh.