This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal


On Fri, Oct 24, 2014 at 01:06:41PM +0100, Alan Lawrence wrote:
> This migrates the reduction patterns in altivec.md and vector.md to
> the new names. I've not touched paired.md as I wasn't really sure
> how to fix that (how do I vec_extractv2sf ?), moreover the testing I
> did didn't seem to exercise any of those patterns (iow: I'm not sure
> what would be an appropriate target machine?).
> 
> I note the reduc_uplus_v16qi (which I've removed, as unsigned and
> signed addition should be equivalent) differed from
> reduc_splus_v16qi in using gen_altivec_vsum4ubs rather than
> gen_altivec_vsum4sbs.  Testcases gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> thus produce assembly which differs from previously (only) in that
> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I
> assume this is OK.
> 
> The combining of signed and unsigned addition also improves gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> : these are now reduced using direct vector reduction, rather than
> with shifts as previously (because there was only a reduc_splus
> rather than the reduc_uplus these tests looked for).

I checked the integer vector add reductions, and it seems to generate the same
value with old/new code, and I like eliminating the vector shift.

> ((Side note: the RTL changes to vector.md are to match the combine
> patterns in vsx.md; now that we now longer depend upon combine to
> generate those patterns (as the optab outputs them directly), one
> might wish to remove the smaller pattern from vsx.md, and/or
> simplify the RTL. I theorize that a reduction of a two-element
> vector is just adding the first element to the second, so maybe to
> something like
> 
>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> 		   (VEC_reduc:V2DF
> 		    (vec_select:DF
> 		     (match_operand:V2DF 1 "vfloat_operand" "")
> 		     (parallel [(const_int 1)]))
> 		    (vec_select:DF
> 		     (match_dup 1)
> 		     (parallel [(const_int 0)]))))
> 	      (clobber (match_scratch:V2DF 2 ""))])]
> 
> but I think it's best for me to leave that to the port maintainers.))
> 
> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).

However, the double pattern is completely broken.  This cannot go in.

Consider this source:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#ifndef TYPE
#define TYPE double
#endif

#ifndef OTYPE
#define OTYPE TYPE
#endif

#ifndef SIZE
#define SIZE 1024
#endif

#ifndef ALIGN
#define ALIGN 32
#endif

TYPE a[SIZE] __attribute__((__aligned__(ALIGN)));

OTYPE sum (void) __attribute__((__noinline__));

OTYPE
sum (void)
{
  size_t i;
  OTYPE s = (OTYPE) 0;

  for (i = 0; i < SIZE; i++)
    s += a[i];

  return s;
}

If I compile with today's trunk, and -mcpu=power8 -ffast-math -O3, I get code
that I expect (though it could xxpermdi instead of xxsldwi):

sum:
	.quad	.L.sum,.TOC.@tocbase,0
	.previous
	.type	sum, @function
.L.sum:
	li 10,512
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	xxlxor 0,0,0
	mtctr 10
	.p2align 4,,15
.L2:
	lxvd2x 12,0,9
	addi 9,9,16
	xvadddp 0,0,12
	bdnz .L2
	xxsldwi 12,0,0,2
	xvadddp 1,12,0
	xxpermdi 1,1,1,2
	blr
	.long 0

However, the code produced by the patches gives:

sum:
	.quad	.L.sum,.TOC.@tocbase,0
	.previous
	.type	sum, @function
.L.sum:
	xxlxor 0,0,0
	addi 10,1,-16
	li 8,512
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	mtctr 8
	stxvd2x 0,0,10
	.p2align 5,,31
.L2:
	addi 10,1,-16
	lxvd2x 0,0,9
	addi 9,9,16
	lxvd2x 12,0,10
	xvadddp 12,12,0
	stxvd2x 12,0,10
	bdnz .L2
	lfd 0,-16(1)
	xxpermdi 1,12,12,2
	fadd 1,0,1
	blr
	.long 0

It is unacceptable to have to do the inner loop doing a load, vector add, and
store in the loop.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]