50596 – Problems in vectorization of condition expression

Bug 50596 - Problems in vectorization of condition expression

Summary: Problems in vectorization of condition expression

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-10-03 08:08 UTC by vincenzo Innocente
Modified:	2011-10-25 08:23 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2011-10-04 00:00:00

Attachments
gcc47-vect-condexpr-mixed.patch (3.13 KB, patch) 2011-10-06 11:57 UTC, Jakub Jelinek	Details \| Diff
/tmp/gcc47-vect-condexpr-mixed.patch (3.73 KB, patch) 2011-10-06 13:30 UTC, Jakub Jelinek	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description vincenzo Innocente 2011-10-03 08:08:33 UTC

I've the need to vectorize something like this
const int N=1024;
float a[1024];
float b[1024];
float c[1024];
float d[1024];
bool z[1024];
// not vectorized: control flow in loop
void ori() {
  for (int i=0; i!=N; ++i)
    z[i] = a[i]<b[i] && c[i]<d[i];
}

The only equivalent function that vectorize is
float r[1024];
void bar() {
  for (int i=0; i!=N; ++i)
    r[i] = (a[i]<b[i] ? 1.f : 0.f) *  ( c[i]<d[i] ? 1.f : 0.f);
}
(not exactly a highly optimized version…)

All other variants below do not vectorize for the reason in the comment that precede each

// not vectorized: no vectype for stmt: z[i_17] = D.2199_10;
// scalar_type: bool
void ori2() {
  for (int i=0; i!=N; ++i)
    z[i] = a[i]<b[i] & c[i]<d[i];
}


// not vectorized: control flow in loop.
int j[1024];
void foo1() {
  for (int i=0; i!=N; ++i)
    j[i] = a[i]<b[i] && c[i]<d[i];
}

// not vectorized: unsupported data-type bool
void foo2() {
  for (int i=0; i!=N; ++i)
    j[i] = int(a[i]<b[i]) & int(c[i]<d[i]);
}

// not vectorized: unsupported data-type bool
void foo3() {
  for (int i=0; i!=N; ++i)
    j[i] = int(a[i]<b[i]);
}

// not vectorized: unsupported data-type bool
void foo4() {
  for (int i=0; i!=N; ++i)
    j[i] = a[i]<b[i] ? 1L : 0L ;
}


any chance to make at least "foo2" or equivalent vectorized?

Comment 1 vincenzo Innocente 2011-10-03 08:40:53 UTC

manage to vectorize this

int j[1024];
void foo5() {
  for (int i=0; i!=N; ++i)
    j[i] = (a[i]<b[i] ? -1 : 0) & (c[i]<d[i] ? -1 : 0);
}


which is not bad, still a funny syntax (at least for those who are not used to code in native SSE)

Comment 2 Jakub Jelinek 2011-10-04 07:04:25 UTC

The first problem with vectorization of ori function is similar to why the
first loop below is not vectorized and second is:
float a[1024], b[1024], c[1024], d[1024], e[1024];
void foo (void)
{
  for (int i = 0; i < 1024; i++)
    a[i] = b[i] < c[i] ? d[i] : e[i];
}
void bar (void)
{
  for (int i = 0; i < 1024; i++)
    {
      float d_ = d[i], e_ = e[i];
      a[i] = b[i] < c[i] ? d_ : e_;
    }
}

gcc doesn't think it is ok to load d[i] resp. e[i] unconditionally.  In this exact case where the loop bound is known and it is an static array of at least that size it is probably fine, but if d or e was a pointer which might point to a smaller array, d[i] or e[i] accesses might segfault.

That said, we still have control flow that even ifcvt doesn't fix up even with:
void
f2 ()
{
  for (int i = 0; i != N; ++i)
    {
      float c_ = c[i], d_ = d[i];
      z[i] = a[i] < b[i] && c_ < d_;
    }
}

void
f3 ()
{
  for (int i = 0; i != N; ++i)
    {
      float a_ = a[i], b_ = b[i], c_ = c[i], d_ = d[i];
      z[i] = a_ < b_ && c_ < d_;
    }
}

Note even if there would be no control flow, we'd still give up on bool not being vectorized.  Bool is problematic, we'd have to use an unsigned char vector instead (if bool is QImode) for vcond.  But it would be a vcond with different
datamode and cmpmode size, we'd either need to do it using a V4SFmode/V8SFmode vcond, then VEC_PACK_TRUNC_EXPR them into V16QImode/V32QImode.

Anyway, I think handling _Bool/bool somehow is now much more urgent than it has been before, given Kai's/Richard's change to use _Bool/bool much more often in GIMPLE.  If a bool SSA_NAME just feeds some COND_EXPR, we could just use some wider type, or we could use wider vcond etc.

Comment 3 vincenzo Innocente 2011-10-04 09:11:53 UTC

for (int i = 0; i < 1024; i++)
    a[i] = b[i] < c[i] ? d[i] : e[i];
DOES vectorize with
 -ftree-loop-if-convert-stores
even with 
float * a; float * b; float * c; float * d; float * e;

Comment 4 Richard Biener 2011-10-04 11:09:58 UTC

I agree with the need to at least support vectorizing loads and stores of
1-bit unsigned precision values.  We need to be careful with arithmetic
and conversions though (which is why we reject bools right now).

Comment 5 Jakub Jelinek 2011-10-04 11:13:58 UTC

(In reply to comment #4)
> I agree with the need to at least support vectorizing loads and stores of
> 1-bit unsigned precision values.  We need to be careful with arithmetic
> and conversions though (which is why we reject bools right now).

We could represent the arithmetic and conversions (or at least subset thereof) using *COND_EXPRs etc.  In any case, the bool representation is desirable for the scalar loop, so this isn't something we should be doing in ifcvt, it needs to be done in the vectorizer itself.

Comment 6 rguenther@suse.de 2011-10-04 11:26:51 UTC

On Tue, 4 Oct 2011, jakub at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50596
> 
> --- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-10-04 11:13:58 UTC ---
> (In reply to comment #4)
> > I agree with the need to at least support vectorizing loads and stores of
> > 1-bit unsigned precision values.  We need to be careful with arithmetic
> > and conversions though (which is why we reject bools right now).
> 
> We could represent the arithmetic and conversions (or at least subset thereof)
> using *COND_EXPRs etc.  In any case, the bool representation is desirable for
> the scalar loop, so this isn't something we should be doing in ifcvt, it needs
> to be done in the vectorizer itself.

Sure.  Note that in GIMPLE

 bool = (bool) int;

isn't equivalent to bool = int != 0 but to a truncation to 1-bit
precision.  Thus for the truncation a BIT_AND is enough.  I'm just
worried about N-precision signed to mode-precision sign-extension
(for the 1-bit case we can use a COND_EXPR, but for more bits
it gets more difficult).

Richard.

Comment 7 rguenther@suse.de 2011-10-04 11:28:18 UTC

On Tue, 4 Oct 2011, jakub at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50596
> 
> --- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-10-04 11:13:58 UTC ---
> (In reply to comment #4)
> > I agree with the need to at least support vectorizing loads and stores of
> > 1-bit unsigned precision values.  We need to be careful with arithmetic
> > and conversions though (which is why we reject bools right now).
> 
> We could represent the arithmetic and conversions (or at least subset thereof)
> using *COND_EXPRs etc.  In any case, the bool representation is desirable for
> the scalar loop, so this isn't something we should be doing in ifcvt, it needs
> to be done in the vectorizer itself.

Oh, and we don't handle expanding N-bit precision arithmetic on
vector types properly - for scalars we do the necessary truncation
at RTL expansion time.  So I think we should give up for that case
for now.

Comment 8 Jakub Jelinek 2011-10-05 07:10:56 UTC

Until http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176563
float a[1024], b[1024], c[1024], d[1024];
int j[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; ++i)
    {
      int x = a[i] < b[i];
      int y = c[i] < d[i];
      j[i] = x & c[i] < y;
    }
}

didn't use any bool types, just int and float, still it couldn't vectorize:
pr50596-2.c:8: note: not vectorized: relevant stmt not supported: x_5 = D.2699_3 < D.2700_4;

I think we could use VECT_COND_EXPR <vect1 < vect2, { 1, 1, ...}, { 0, 0, ... }>
for that (and hopefully the backends optimize that well, e.g. into
anding the comparison mask with { 1, 1, ... } or doing per-element right shift
by element width - 1 on the mask.

With bool it would be nice if at least for non-stores we would pick the best suitable wider integer vector type (in this case where the bools are set by comparison operation and feed & that is afterwards cast to int the best is obviously int vector).

Comment 9 Jakub Jelinek 2011-10-06 11:57:54 UTC

Created attachment 25428 [details]
gcc47-vect-condexpr-mixed.patch

I believe at least some simple case of bool could be handled in tree-vect-patterns.c by transforming bool lhs assignments with comparison on rhs
into COND_EXPRs on char/short/int/long (depending on the comparison operand size), &/|/^ could be handled too and finally either cast to some integer type or memory store).  Before trying to write it, I tried to write something simpler, in particular a pattern recognizer that allows to vectorize mixed size type COND_EXPRs (so far only with INTEGER_CST then/else).  For the case where COND_EXPR lhs type is wider than comparison type I think it must be INTEGER_CSTs, otherwise we can't ensure that they fit into the narrower integer type.  But for lhs type narrower than comparison type
  lhs = cmp0 < cmp1 ? val1 : val2;
(where sizeof (lhs) < sizeof (cmp0)) the above in theory could be transformed into (for itype an integer type with the same sign as val1's type, but size of cmp0) into:
  val1' = (itype) val1;
  val2' = (itype) val2;
  lhs' = cmp0 < cmp1 ? val1' : val2';
  lhs = (__typeof (lhs)) lhs';
but we'd need more than one def_stmt for that.

This patch allows e.g. vectorization of:
float a[1024], b[1024];
unsigned char k[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; ++i)
    k[i] = a[i] < b[i] ? -1 : 0;
}
on i?86/x86_64 which couldn't be previously vectorized.

Ira, does this sound reasonable?  How should a testcase look like (I think it will be currently only vectorized on i?86/x86_64, as it needs mixed mode vcond support, which, while probably implementable for e.g. altivec, is currently i386 backend only feature)?  If this makes sense, I'll try to do the bool pattern recognition next.

Comment 10 Ira Rosen 2011-10-06 12:31:29 UTC

(In reply to comment #9)

> 
> Ira, does this sound reasonable?  

Looks good to me. (You can probably use build_nonstandard_integer_type() instead of lang_hooks.types.type_for_mode).

> How should a testcase look like (I think it
> will be currently only vectorized on i?86/x86_64, as it needs mixed mode vcond
> support, which, while probably implementable for e.g. altivec, is currently
> i386 backend only feature)?  

I am not sure I understand the question. Are you asking how to check that it gets vectorized only on i?86/x86_64? If so, you need a new proc in lib/target-supports.exp (something like vect_cond_mixed_types).

Comment 11 Jakub Jelinek 2011-10-06 13:30:36 UTC

Created attachment 25429 [details]
/tmp/gcc47-vect-condexpr-mixed.patch

Thanks, here is an updated patch.

Comment 12 Jakub Jelinek 2011-10-06 17:49:43 UTC

Author: jakub
Date: Thu Oct  6 17:49:36 2011
New Revision: 179626

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=179626
Log:
	PR tree-optimization/50596
	* tree-vectorizer.h (vect_is_simple_cond): New prototype.
	(NUM_PATTERNS): Change to 6.
	* tree-vect-patterns.c (vect_recog_mixed_size_cond_pattern): New
	function.
	(vect_vect_recog_func_ptrs): Add vect_recog_mixed_size_cond_pattern.
	(vect_mark_pattern_stmts): Don't create stmt_vinfo for def_stmt
	if it already has one, and don't set STMT_VINFO_VECTYPE in it
	if it is already set.
	* tree-vect-stmts.c (vect_mark_stmts_to_be_vectorized): Handle
	COND_EXPR in pattern stmts.
	(vect_is_simple_cond): No longer static.

	* lib/target-supports.exp (check_effective_target_vect_cond_mixed):
	New.
	* gcc.dg/vect/vect-cond-8.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/vect-cond-8.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/lib/target-supports.exp
    trunk/gcc/tree-vect-patterns.c
    trunk/gcc/tree-vect-stmts.c
    trunk/gcc/tree-vectorizer.h

Comment 13 vincenzo Innocente 2011-10-07 07:35:40 UTC

is not PR50649 caused by your changes?

Comment 14 vincenzo Innocente 2011-10-07 10:15:03 UTC

signed char k[1024];
void foo6() {
  for (int i=0; i!=N; ++i)
    k[i] = (a[i]<b[i] ? -1 : 0) & (c[i]<d[i] ? -1 : 0);
}

requires -fno-tree-pre to vectorize
w/o I get
not vectorized: relevant stmt not supported: prephitmp.214_16 = D.2173_10 < D.2174_11 ? iftmp.2_2 : 0;



btw in almost all code of mine -fno-tree-pre produces always faster code when vectorization matters!

Comment 15 Jakub Jelinek 2011-10-07 10:31:13 UTC

float a[1024], b[1024], c[1024], d[1024];
int j[1024];

void
f1 (void)
{
  int i;
  for (i = 0; i < 1024; ++i)
    {
      unsigned int x = a[i] < b[i] ? -1 : 0;
      unsigned int y = c[i] < d[i] ? -1 : 0;
      j[i] = (x & y) >> 31;
    }
}

vectorizes fine and generates quite good code IMHO.  Something similar I'd like to achieve with the vect_recog_bool_pattern I'm working on even for some of your testcases.

Comment 16 Jakub Jelinek 2011-10-16 13:10:26 UTC

Author: jakub
Date: Sun Oct 16 13:10:20 2011
New Revision: 180057

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=180057
Log:
	PR tree-optimization/50596
	* tree-vectorizer.h (NUM_PATTERNS): Increase to 7.
	* tree-vect-patterns.c (vect_vect_recog_func_ptrs): Add
	vect_recog_bool_pattern.
	(check_bool_pattern, adjust_bool_pattern_cast,
	adjust_bool_pattern, vect_recog_bool_pattern): New functions.

	* gcc.dg/vect/vect-cond-9.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/vect-cond-9.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-patterns.c
    trunk/gcc/tree-vectorizer.h

Comment 17 vincenzo Innocente 2011-10-16 13:47:22 UTC

cool!
even
signed char k[1024];
    61	void foo6() {
    62	  for (int i=0; i!=N; ++i)
    63	    k[i] = (a[i]<b[i]) & (c[i]<d[i]);
vectorize!
with 
bool k[1024];
does not. I can survive though.
I will have to measure performance. I suspect that using
int k[1024];
will be faster…
Anyhow great achievement

Comment 18 Jakub Jelinek 2011-10-25 08:02:16 UTC

Author: jakub
Date: Tue Oct 25 08:02:08 2011
New Revision: 180424

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=180424
Log:
	PR tree-optimization/50596
	* tree-vect-stmts.c (vect_mark_relevant): Only use
	FOR_EACH_IMM_USE_FAST if lhs is SSA_NAME.
	(vectorizable_store): If is_pattern_stmt_p look through
	VIEW_CONVERT_EXPR on lhs.
	* tree-vect-patterns.c (check_bool_pattern, adjust_bool_pattern):
	Use unsigned type instead of signed.
	(vect_recog_bool_pattern): Optimize also stores into bool memory in
	addition to casts from bool to integral types.
	(vect_mark_pattern_stmts): If pattern_stmt already has vinfo
	created, don't create it again.

	* gcc.dg/vect/vect-cond-10.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/vect-cond-10.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-patterns.c
    trunk/gcc/tree-vect-stmts.c

Comment 19 Jakub Jelinek 2011-10-25 08:23:32 UTC

Bool stores are handled now too.