[PATCH] PR/67682, break SLP groups up if only some elements match

Tue Nov 17 16:47:00 GMT 2015

On 16/11/15 14:42, Christophe Lyon wrote:
>
> Hi Alan,
>
> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
> fails for armeb targets.
> I haven't had time to look at more details yet, but I guess you can
> reproduce it quickly enough.
>

Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
   vector(4) int vect__14.21;
   vector(4) int vect__2.20;
   vector(4) int vect__2.19;
   vector(4) int vect__3.13;
   vector(4) int vect__2.12;

   <bb 2>:
   vect__2.12_24 = MEM[(int *)&b];
   vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
   MEM[(int *)&a] = vect__3.13_27;
   vect__2.19_31 = MEM[(int *)&b + 16B];
   vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
   vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
   MEM[(int *)&a + 16B] = vect__14.21_35;
   return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being 
multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 
pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as 
to what's in each register:

test:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	movw	r2, #:lower16:b
	movt	r2, #:upper16:b
	vldr	d22, .L11
	vldr	d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
	movw	r3, #:lower16:a
	movt	r3, #:upper16:a
	vld1.64	{d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
	vmov	q9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
	vldr	d20, [r2, #16]
	vldr	d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
	vuzp.32	q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
	vldr	d20, .L11+16
	vldr	d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
	vmul.i32	q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
	vadd.i32	q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
	vst1.64	{d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
	vstr	d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
	vstr	d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
	bx	lr
.L12:
	.align	3
.L11:
	.word	3
	.word	4
	.word	5
	.word	7
	.word	1
	.word	2
	.word	3
	.word	4

Which is to say - the bit order in the q-registers, is neither big- nor 
little-endian, but the elements get stored back to memory in a consistent order 
with how they were loaded, so we're OK as long as there are no permutes. 
Unfortunately for UZP this lane ordering mixup is not idempotent and messes 
everything up...

Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing 
armeb's whole register-numbering/lane-flipping scheme, and might be quite a 
large task. OTOH it might also fix the significant number of failing vectorizer 
tests. A simpler solution might be to disable...some part of vector 
support....on armeb, but I'm not sure which part would be best, yet.

Thoughts (CC maintainers)?

--Alan