This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PR/67682, break SLP groups up if only some elements match


On 16/11/15 14:42, Christophe Lyon wrote:

Hi Alan,

I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.


Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
  vector(4) int vect__14.21;
  vector(4) int vect__2.20;
  vector(4) int vect__2.19;
  vector(4) int vect__3.13;
  vector(4) int vect__2.12;

  <bb 2>:
  vect__2.12_24 = MEM[(int *)&b];
  vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
  MEM[(int *)&a] = vect__3.13_27;
  vect__2.19_31 = MEM[(int *)&b + 16B];
  vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
  vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
  MEM[(int *)&a + 16B] = vect__14.21_35;
  return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as to what's in each register:

test:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	movw	r2, #:lower16:b
	movt	r2, #:upper16:b
	vldr	d22, .L11
	vldr	d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
	movw	r3, #:lower16:a
	movt	r3, #:upper16:a
	vld1.64	{d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
	vmov	q9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
	vldr	d20, [r2, #16]
	vldr	d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
	vuzp.32	q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
	vldr	d20, .L11+16
	vldr	d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
	vmul.i32	q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
	vadd.i32	q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
	vst1.64	{d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
	vstr	d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
	vstr	d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
	bx	lr
.L12:
	.align	3
.L11:
	.word	3
	.word	4
	.word	5
	.word	7
	.word	1
	.word	2
	.word	3
	.word	4

Which is to say - the bit order in the q-registers, is neither big- nor little-endian, but the elements get stored back to memory in a consistent order with how they were loaded, so we're OK as long as there are no permutes. Unfortunately for UZP this lane ordering mixup is not idempotent and messes everything up...

Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing armeb's whole register-numbering/lane-flipping scheme, and might be quite a large task. OTOH it might also fix the significant number of failing vectorizer tests. A simpler solution might be to disable...some part of vector support....on armeb, but I'm not sure which part would be best, yet.

Thoughts (CC maintainers)?

--Alan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]