Bug 80198 - [10/11/12/13 Regression] does not vectorize generic inplace integer operation
Summary: [10/11/12/13 Regression] does not vectorize generic inplace integer operation
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.3.0
: P2 normal
Target Milestone: 10.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer 65917
  Show dependency treegraph
 
Reported: 2017-03-26 14:44 UTC by Julian Taylor
Modified: 2023-03-24 07:35 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work: 10.1.0
Known to fail:
Last reconfirmed: 2023-03-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Julian Taylor 2017-03-26 14:44:36 UTC
In the following code the generic function does not vectorize when provided with arguments that do an inplace operation. GCC does emit vector code but the code is only executed in out of place operations where |a - b| > 16.
This is despite the explicit hint to the compiler that the two pointers are the same.

As the explicit inplace variant does execute the same vectorized code there should be no issue with the method chosen and GCC should allow the inplace operations to choose the vector path.

void __attribute__((noinline)) fun(int * a, int * b, int c)
{
    int i;
    if (a == b) {
	for (i=0; i < 256; i++) {
	    a[i] = b[i] | c;
	}
    }
    else {
	for (i=0; i < 256; i++) {
	    a[i] = b[i] | c;
	}
    }
}


void __attribute__((noinline)) inplace(int * a, int c)
{
    int i;
    for (i=0; i < 256; i++) {
        a[i] = a[i] | c;
    }
}

int main()
{
    int a[256];
    generic(a, a, 0);
    inplace(a, 0);
}


$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-5ubuntu12' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) 

$ gcc test.c  -ftree-vectorize -g -O2
Comment 1 Julian Taylor 2017-03-26 15:32:30 UTC
actually this seems to be a regression.
Compiled with gcc 4.6 it does chose the vectorized path for the inplace function arguments.
Comment 2 Julian Taylor 2017-03-26 18:37:26 UTC
the regression occurred in 07c0f1caa31079d8f4edd7ff4a06656e2a4441c5 svn 233207
current trunk/master is still affected
Comment 3 Andrew Pinski 2017-03-26 18:42:47 UTC
r233207
Comment 4 Richard Biener 2017-03-27 08:33:57 UTC
Another fallout of that questionable change...

Note it's true that the vectorizer doesn't special-case zero dependence distance when doing runtime checking for aliasing, that's a missed feature.  Not sure
if we can cheaply add that case.  If we know the difference we are looking at
is >= 0 we can bias by -1 (unsigned).
Comment 5 Julian Taylor 2017-03-27 12:48:48 UTC
I have been searching for a workaround as this has pretty bad performance impact in my usecase. I found that putting this into the inplace conditional
#pragma GCC ivdep
allows it to be vectorized again with gcc 6 and 7. It should be fine as inplace is no dependence in my case.

I'll adapt my code to use this, but it would still be great when the regression could get fixed.
Comment 6 Richard Biener 2017-03-27 12:49:53 UTC
So this is another case where the conditional propagation happens so that

 if (a == b)
   {
      ... = deref a;
      ... = deref b;
   }

turns into

 if (a == b)
   {
      ... = deref b;
      ... = deref a;
   }

which is of course not sensible.  Having both a = b and b = a recorded
as copies breaks the lattices canonical value.

The original bug would need to have been fixed by recording equivalency
sets or by doing more expensive lookups and knowing a backmapping,
SSA value -> SSA name list.
Comment 7 Jakub Jelinek 2017-04-05 09:26:33 UTC
How would building equivalences help?
Looking at the gcc.dg/tree-ssa/20030922-2.c testcase, the reason it works with
Jeff's patch is that we are lucky enough on that exact testcase starting with:
  if (_6 != _11)
    goto <bb 3>; [69.50%]
  else
    goto <bb 6>; [30.50%]

  <bb 3> [69.50%]:
  target_bb.3_12 = target_bb;
  if (_11 == target_bb.3_12)
    goto <bb 4>; [37.68%]
  else
    goto <bb 6>; [62.32%]

  <bb 4> [26.19%]:
  if (_6 != target_bb.3_12)
and first replace the target_bb.3_12 due to the extra added SSA_NAME_VALUE to
_11 in the last GIMPLE_COND above during cprop_operand and then eliminate_redundant_computations do
avail_exprs_stack->lookup_avail_expr (stmt, insert, true);
on that if (_6 != _11) and find it.  If we are unlucky enough and say have
  if (_6 != _11)
in the bb4 and replace that with if (_6 != target_bb.3_12), then we won't find it.  Consider say -O1 -fno-tree-fre:
struct rtx_def;
typedef struct rtx_def *rtx;
struct rtx_def
{
  int bb;
};
int *block_to_bb;
int target_bb;

int
rgn_rank (rtx insn1, rtx insn2)
{
  int a = block_to_bb[insn1->bb];
  int b = block_to_bb[insn2->bb];
  if (a != b)
    if (b == target_bb && a != b)
      return 1;
}
Here dom2 with the r233207 change reverted is optimized into just two GIMPLE_CONDs, while current trunk does not (manages to do it only during dom3).
As the first if (_6 != _11) is entered before we have any equivalences recorded (and the equivalences are only temporary anyway, might not be in effect in all the places where the if (_6 != _11) is in effect), I think there is no hope for some canonicalization of the stmt we want to look up at the place we have some equivalences for it temporarily recorded can help in doing just a single hash table lookup.
Perhaps we should revert r233207 and in addition to the SSA_NAME_VALUE record the temporary equivalences some way in another data structure (e.g. forward and backward chains to other SSA_NAME versions, chains could be just unsigned int SSA_NAME_VERSIONs or whatever).  Then e.g.
avail_exprs_stack::lookup_avail_expr
if it detects one or both arguments of the stmt have temporary equivalences recorded could just loop through all the equivalence combinations (perhaps with some upper bound), trying to (!insert) find the available expression.  And if everything else fails do the insert lookup (if requested) on the SSA_NAMEs as in the stmt last.  Thoughts on this?
Comment 8 Jeffrey A. Law 2017-04-12 16:20:56 UTC
Note that I've got POC changes that would allow us to distinguish between normal equivalences and those created by conditionals.  That in turn would allow DOM to avoid ping-ponging copy propagations -- only propagating when doing so allows real simplifications.  But that is gcc-8 material.  Explicitly deferring.
Comment 9 Jakub Jelinek 2017-11-27 11:42:44 UTC
Jeff, any progress with this?
Comment 10 Jeffrey A. Law 2017-11-27 16:19:54 UTC
Yea.  The code that was recording NAME = NAME conditional equivalences was largely disabled back in August.  They'll only be recorded now if one name is cheaper to compute than the other.

So if the conditional equivalences are still a problem with this BZ, then we need to look at the costing of the SSA_NAME's defining statement.
Comment 11 Jakub Jelinek 2017-12-01 11:31:29 UTC
(In reply to Jeffrey A. Law from comment #10)
> Yea.  The code that was recording NAME = NAME conditional equivalences was
> largely disabled back in August.  They'll only be recorded now if one name
> is cheaper to compute than the other.
> 
> So if the conditional equivalences are still a problem with this BZ, then we
> need to look at the costing of the SSA_NAME's defining statement.

Yes, they are still the problem.
Before r233207, dom changed:
   if (a_6(D) == b_7(D))
     goto <bb 3>;
...
   <bb 6>:
   # i_40 = PHI <i_36(3), i_18(8)>
   # DEBUG i => i_40
   _9 = (long unsigned int) i_40;
   _10 = _9 * 4;
-  _11 = a_6(D) + _10;
-  _13 = b_7(D) + _10;
-  _14 = *_13;
+  _11 = b_7(D) + _10;
+  _13 = _11;
+  _14 = *_11;
   _16 = _14 | c_15(D);
   *_11 = _16;
But that doesn't happen even with current trunk, we still end up with both a and b in the bb dominated by a == b test.

Another part of this PR is being able to determine in the vectorizer if we can allow identical pointers or not (in this case we can, so we could adjust the alias check by 1).
Comment 12 Jeffrey A. Law 2017-12-18 23:23:49 UTC
Given that B and A have the same cost to compute, DOM, per design decision, leaves them alone.  It makes no attempt to canonicalize on one or the other.  While we know there's an equivalence between A and B, we have no information to guide selection of one over the other.
Comment 13 Jakub Jelinek 2018-05-02 10:04:09 UTC
GCC 8.1 has been released.
Comment 14 Jakub Jelinek 2018-07-26 10:59:03 UTC
GCC 8.2 has been released.
Comment 15 Jakub Jelinek 2019-02-22 15:19:18 UTC
GCC 8.3 has been released.
Comment 16 Jeffrey A. Law 2020-02-24 15:50:53 UTC
What we'd need to do here is capture the temporary equivalence in another structure, then have lookup_avail_expr do multiple lookups using substitutions from that alternate structure.  It's not terribly hard conceptually.

We really want to avoid polluting const_and_copies and SSA_NAME_VALUE with these temporary equivalences when we can't prove one form is clearly better than the other.  That has caused all kinds of problems.
Comment 17 Jeffrey A. Law 2020-02-24 19:43:22 UTC
I've got something limping along.  It seems to allow us to capture the redundancy when a == b.  I'm poking at it some more to see how often it applies across a wider codebase.
Comment 18 Jeffrey A. Law 2020-02-24 23:30:05 UTC
So AFAICT opportunities to simplify as a result of conditional equivalences that aren't used for propagation are pretty rare.  Call it a dozen across a bootstrap.  While I only implemented binary operators, I'd be surprised if adding other operators would help all that much. Maybe we'd get something from INDIRECT_REF and MEM_REF.  Given we're well into stage4, it's hard to justify adding this code right now.  I'll queue it for reevlauation in gcc-11 though -- the implementation isn't terribly complex, though I do worry about its compile-time cost.
Comment 19 Richard Biener 2021-01-27 11:49:18 UTC
So I think when you consider

void __attribute__((noinline)) fun(int * a, int * b, int c)
{
  int i;
  for (i=0; i < 256; i++) {
     a[i] = b[i] | c;
  }
}

we can improve the versioning condition to allow a dependence distance
of zero.  Likewise with

void __attribute__((noipa)) generic(int * a, int * b, int c)
{
  int i;
  a = __builtin_assume_aligned (a, 16);
  b = __builtin_assume_aligned (b, 16);
  for (i=0; i < 256; i++) {
      a[i] = b[i] | c;
  }
}

we fail to realize no versioning check is required - the distance is
either zero or a multiple of 16.

Richard - ISTR you added some alignment considerations to the alias
versioning code, but it doesn't seem to help?
Comment 20 Richard Sandiford 2021-01-27 12:02:18 UTC
(In reply to Richard Biener from comment #19)
> So I think when you consider
> 
> void __attribute__((noinline)) fun(int * a, int * b, int c)
> {
>   int i;
>   for (i=0; i < 256; i++) {
>      a[i] = b[i] | c;
>   }
> }
> 
> we can improve the versioning condition to allow a dependence distance
> of zero.
This one was fixed by r10-4803.  E.g. for aarch64 we now have:

        add     x3, x1, 4
        sub     x3, x0, x3
        cmp     x3, 8
        bls     .L5

> Likewise with
> 
> void __attribute__((noipa)) generic(int * a, int * b, int c)
> {
>   int i;
>   a = __builtin_assume_aligned (a, 16);
>   b = __builtin_assume_aligned (b, 16);
>   for (i=0; i < 256; i++) {
>       a[i] = b[i] | c;
>   }
> }
> 
> we fail to realize no versioning check is required - the distance is
> either zero or a multiple of 16.
> 
> Richard - ISTR you added some alignment considerations to the alias
> versioning code, but it doesn't seem to help?
I don't remember adding anything for that, but yeah, I agree it looks
like we need it.
Comment 21 rguenther@suse.de 2021-01-27 12:43:48 UTC
On Wed, 27 Jan 2021, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80198
> 
> --- Comment #20 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #19)
> > So I think when you consider
> > 
> > void __attribute__((noinline)) fun(int * a, int * b, int c)
> > {
> >   int i;
> >   for (i=0; i < 256; i++) {
> >      a[i] = b[i] | c;
> >   }
> > }
> > 
> > we can improve the versioning condition to allow a dependence distance
> > of zero.
> This one was fixed by r10-4803.  E.g. for aarch64 we now have:
> 
>         add     x3, x1, 4
>         sub     x3, x0, x3
>         cmp     x3, 8
>         bls     .L5

Ah, yeah - I failed to decipher the generated check:

  _7 = b_12 + 4;
  _22 = a_10 - _7;
  _23 = (sizetype) _22;
  if (_23 > 8)

the difference is -4U and thus > 8 when a == b.

> > Likewise with
> > 
> > void __attribute__((noipa)) generic(int * a, int * b, int c)
> > {
> >   int i;
> >   a = __builtin_assume_aligned (a, 16);
> >   b = __builtin_assume_aligned (b, 16);
> >   for (i=0; i < 256; i++) {
> >       a[i] = b[i] | c;
> >   }
> > }
> > 
> > we fail to realize no versioning check is required - the distance is
> > either zero or a multiple of 16.
> > 
> > Richard - ISTR you added some alignment considerations to the alias
> > versioning code, but it doesn't seem to help?
> I don't remember adding anything for that, but yeah, I agree it looks
> like we need it.
Comment 22 Jakub Jelinek 2021-04-27 11:37:58 UTC
GCC 11.1 has been released, retargeting bugs to GCC 11.2.
Comment 23 Richard Biener 2021-07-28 07:04:28 UTC
GCC 11.2 is being released, retargeting bugs to GCC 11.3
Comment 24 Richard Biener 2022-04-21 07:47:42 UTC
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
Comment 25 Richard Biener 2023-03-24 07:35:48 UTC
I think the regression is fixed - we perform alias versioning unnecessarily
but we are now entering the vectorized loop for a == b (in fact RTL opts
eventually elide the versioning check, so the resulting code is optimal).

This was fixed in GCC 10.

I've split out remaining missed optimization to PR109271.