This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] PR/67682, break SLP groups up if only some elements match
- From: Alan Lawrence <alan dot lawrence at arm dot com>
- To: Christophe Lyon <christophe dot lyon at linaro dot org>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Tejas Belagod <Tejas dot Belagod at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>
- Date: Tue, 17 Nov 2015 16:46:53 +0000
- Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match
- Authentication-results: sourceware.org; auth=none
- References: <CAFiYyc3GXFmREtzjLP+m1LBju2okEEzDwcX6wHn2xcuSdrh4wg at mail dot gmail dot com> <1447431522-4695-1-git-send-email-alan dot lawrence at arm dot com> <CAKdteOYnXq87nCGxcxfyPSSy4TN-rP_3HJ2u4wRgLpa_o7ZEZw at mail dot gmail dot com>
On 16/11/15 14:42, Christophe Lyon wrote:
Hi Alan,
I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.
Thanks - yes I see it now.
-fdump-tree-optimized looks sensible:
__attribute__((noinline))
test ()
{
vector(4) int vect__14.21;
vector(4) int vect__2.20;
vector(4) int vect__2.19;
vector(4) int vect__3.13;
vector(4) int vect__2.12;
<bb 2>:
vect__2.12_24 = MEM[(int *)&b];
vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
MEM[(int *)&a] = vect__3.13_27;
vect__2.19_31 = MEM[(int *)&b + 16B];
vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
MEM[(int *)&a + 16B] = vect__14.21_35;
return;
}
but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being
multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2
pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as
to what's in each register:
test:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movw r2, #:lower16:b
movt r2, #:upper16:b
vldr d22, .L11
vldr d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
movw r3, #:lower16:a
movt r3, #:upper16:a
vld1.64 {d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
vmov q9, q8 @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
vldr d20, [r2, #16]
vldr d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
vuzp.32 q10, q9
;; So q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
vldr d20, .L11+16
vldr d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
vmul.i32 q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
vadd.i32 q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
vst1.64 {d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
vstr d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
vstr d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
bx lr
.L12:
.align 3
.L11:
.word 3
.word 4
.word 5
.word 7
.word 1
.word 2
.word 3
.word 4
Which is to say - the bit order in the q-registers, is neither big- nor
little-endian, but the elements get stored back to memory in a consistent order
with how they were loaded, so we're OK as long as there are no permutes.
Unfortunately for UZP this lane ordering mixup is not idempotent and messes
everything up...
Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing
armeb's whole register-numbering/lane-flipping scheme, and might be quite a
large task. OTOH it might also fix the significant number of failing vectorizer
tests. A simpler solution might be to disable...some part of vector
support....on armeb, but I'm not sure which part would be best, yet.
Thoughts (CC maintainers)?
--Alan