In the following code the generic function does not vectorize when provided with arguments that do an inplace operation. GCC does emit vector code but the code is only executed in out of place operations where |a - b| > 16. This is despite the explicit hint to the compiler that the two pointers are the same. As the explicit inplace variant does execute the same vectorized code there should be no issue with the method chosen and GCC should allow the inplace operations to choose the vector path. void __attribute__((noinline)) fun(int * a, int * b, int c) { int i; if (a == b) { for (i=0; i < 256; i++) { a[i] = b[i] | c; } } else { for (i=0; i < 256; i++) { a[i] = b[i] | c; } } } void __attribute__((noinline)) inplace(int * a, int c) { int i; for (i=0; i < 256; i++) { a[i] = a[i] | c; } } int main() { int a[256]; generic(a, a, 0); inplace(a, 0); } $ gcc -v Using built-in specs. COLLECT_GCC=/usr/bin/gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-5ubuntu12' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) $ gcc test.c -ftree-vectorize -g -O2
actually this seems to be a regression. Compiled with gcc 4.6 it does chose the vectorized path for the inplace function arguments.
the regression occurred in 07c0f1caa31079d8f4edd7ff4a06656e2a4441c5 svn 233207 current trunk/master is still affected
r233207
Another fallout of that questionable change... Note it's true that the vectorizer doesn't special-case zero dependence distance when doing runtime checking for aliasing, that's a missed feature. Not sure if we can cheaply add that case. If we know the difference we are looking at is >= 0 we can bias by -1 (unsigned).
I have been searching for a workaround as this has pretty bad performance impact in my usecase. I found that putting this into the inplace conditional #pragma GCC ivdep allows it to be vectorized again with gcc 6 and 7. It should be fine as inplace is no dependence in my case. I'll adapt my code to use this, but it would still be great when the regression could get fixed.
So this is another case where the conditional propagation happens so that if (a == b) { ... = deref a; ... = deref b; } turns into if (a == b) { ... = deref b; ... = deref a; } which is of course not sensible. Having both a = b and b = a recorded as copies breaks the lattices canonical value. The original bug would need to have been fixed by recording equivalency sets or by doing more expensive lookups and knowing a backmapping, SSA value -> SSA name list.
How would building equivalences help? Looking at the gcc.dg/tree-ssa/20030922-2.c testcase, the reason it works with Jeff's patch is that we are lucky enough on that exact testcase starting with: if (_6 != _11) goto <bb 3>; [69.50%] else goto <bb 6>; [30.50%] <bb 3> [69.50%]: target_bb.3_12 = target_bb; if (_11 == target_bb.3_12) goto <bb 4>; [37.68%] else goto <bb 6>; [62.32%] <bb 4> [26.19%]: if (_6 != target_bb.3_12) and first replace the target_bb.3_12 due to the extra added SSA_NAME_VALUE to _11 in the last GIMPLE_COND above during cprop_operand and then eliminate_redundant_computations do avail_exprs_stack->lookup_avail_expr (stmt, insert, true); on that if (_6 != _11) and find it. If we are unlucky enough and say have if (_6 != _11) in the bb4 and replace that with if (_6 != target_bb.3_12), then we won't find it. Consider say -O1 -fno-tree-fre: struct rtx_def; typedef struct rtx_def *rtx; struct rtx_def { int bb; }; int *block_to_bb; int target_bb; int rgn_rank (rtx insn1, rtx insn2) { int a = block_to_bb[insn1->bb]; int b = block_to_bb[insn2->bb]; if (a != b) if (b == target_bb && a != b) return 1; } Here dom2 with the r233207 change reverted is optimized into just two GIMPLE_CONDs, while current trunk does not (manages to do it only during dom3). As the first if (_6 != _11) is entered before we have any equivalences recorded (and the equivalences are only temporary anyway, might not be in effect in all the places where the if (_6 != _11) is in effect), I think there is no hope for some canonicalization of the stmt we want to look up at the place we have some equivalences for it temporarily recorded can help in doing just a single hash table lookup. Perhaps we should revert r233207 and in addition to the SSA_NAME_VALUE record the temporary equivalences some way in another data structure (e.g. forward and backward chains to other SSA_NAME versions, chains could be just unsigned int SSA_NAME_VERSIONs or whatever). Then e.g. avail_exprs_stack::lookup_avail_expr if it detects one or both arguments of the stmt have temporary equivalences recorded could just loop through all the equivalence combinations (perhaps with some upper bound), trying to (!insert) find the available expression. And if everything else fails do the insert lookup (if requested) on the SSA_NAMEs as in the stmt last. Thoughts on this?
Note that I've got POC changes that would allow us to distinguish between normal equivalences and those created by conditionals. That in turn would allow DOM to avoid ping-ponging copy propagations -- only propagating when doing so allows real simplifications. But that is gcc-8 material. Explicitly deferring.
Jeff, any progress with this?
Yea. The code that was recording NAME = NAME conditional equivalences was largely disabled back in August. They'll only be recorded now if one name is cheaper to compute than the other. So if the conditional equivalences are still a problem with this BZ, then we need to look at the costing of the SSA_NAME's defining statement.
(In reply to Jeffrey A. Law from comment #10) > Yea. The code that was recording NAME = NAME conditional equivalences was > largely disabled back in August. They'll only be recorded now if one name > is cheaper to compute than the other. > > So if the conditional equivalences are still a problem with this BZ, then we > need to look at the costing of the SSA_NAME's defining statement. Yes, they are still the problem. Before r233207, dom changed: if (a_6(D) == b_7(D)) goto <bb 3>; ... <bb 6>: # i_40 = PHI <i_36(3), i_18(8)> # DEBUG i => i_40 _9 = (long unsigned int) i_40; _10 = _9 * 4; - _11 = a_6(D) + _10; - _13 = b_7(D) + _10; - _14 = *_13; + _11 = b_7(D) + _10; + _13 = _11; + _14 = *_11; _16 = _14 | c_15(D); *_11 = _16; But that doesn't happen even with current trunk, we still end up with both a and b in the bb dominated by a == b test. Another part of this PR is being able to determine in the vectorizer if we can allow identical pointers or not (in this case we can, so we could adjust the alias check by 1).
Given that B and A have the same cost to compute, DOM, per design decision, leaves them alone. It makes no attempt to canonicalize on one or the other. While we know there's an equivalence between A and B, we have no information to guide selection of one over the other.
GCC 8.1 has been released.
GCC 8.2 has been released.
GCC 8.3 has been released.
What we'd need to do here is capture the temporary equivalence in another structure, then have lookup_avail_expr do multiple lookups using substitutions from that alternate structure. It's not terribly hard conceptually. We really want to avoid polluting const_and_copies and SSA_NAME_VALUE with these temporary equivalences when we can't prove one form is clearly better than the other. That has caused all kinds of problems.
I've got something limping along. It seems to allow us to capture the redundancy when a == b. I'm poking at it some more to see how often it applies across a wider codebase.
So AFAICT opportunities to simplify as a result of conditional equivalences that aren't used for propagation are pretty rare. Call it a dozen across a bootstrap. While I only implemented binary operators, I'd be surprised if adding other operators would help all that much. Maybe we'd get something from INDIRECT_REF and MEM_REF. Given we're well into stage4, it's hard to justify adding this code right now. I'll queue it for reevlauation in gcc-11 though -- the implementation isn't terribly complex, though I do worry about its compile-time cost.
So I think when you consider void __attribute__((noinline)) fun(int * a, int * b, int c) { int i; for (i=0; i < 256; i++) { a[i] = b[i] | c; } } we can improve the versioning condition to allow a dependence distance of zero. Likewise with void __attribute__((noipa)) generic(int * a, int * b, int c) { int i; a = __builtin_assume_aligned (a, 16); b = __builtin_assume_aligned (b, 16); for (i=0; i < 256; i++) { a[i] = b[i] | c; } } we fail to realize no versioning check is required - the distance is either zero or a multiple of 16. Richard - ISTR you added some alignment considerations to the alias versioning code, but it doesn't seem to help?
(In reply to Richard Biener from comment #19) > So I think when you consider > > void __attribute__((noinline)) fun(int * a, int * b, int c) > { > int i; > for (i=0; i < 256; i++) { > a[i] = b[i] | c; > } > } > > we can improve the versioning condition to allow a dependence distance > of zero. This one was fixed by r10-4803. E.g. for aarch64 we now have: add x3, x1, 4 sub x3, x0, x3 cmp x3, 8 bls .L5 > Likewise with > > void __attribute__((noipa)) generic(int * a, int * b, int c) > { > int i; > a = __builtin_assume_aligned (a, 16); > b = __builtin_assume_aligned (b, 16); > for (i=0; i < 256; i++) { > a[i] = b[i] | c; > } > } > > we fail to realize no versioning check is required - the distance is > either zero or a multiple of 16. > > Richard - ISTR you added some alignment considerations to the alias > versioning code, but it doesn't seem to help? I don't remember adding anything for that, but yeah, I agree it looks like we need it.
On Wed, 27 Jan 2021, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80198 > > --- Comment #20 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #19) > > So I think when you consider > > > > void __attribute__((noinline)) fun(int * a, int * b, int c) > > { > > int i; > > for (i=0; i < 256; i++) { > > a[i] = b[i] | c; > > } > > } > > > > we can improve the versioning condition to allow a dependence distance > > of zero. > This one was fixed by r10-4803. E.g. for aarch64 we now have: > > add x3, x1, 4 > sub x3, x0, x3 > cmp x3, 8 > bls .L5 Ah, yeah - I failed to decipher the generated check: _7 = b_12 + 4; _22 = a_10 - _7; _23 = (sizetype) _22; if (_23 > 8) the difference is -4U and thus > 8 when a == b. > > Likewise with > > > > void __attribute__((noipa)) generic(int * a, int * b, int c) > > { > > int i; > > a = __builtin_assume_aligned (a, 16); > > b = __builtin_assume_aligned (b, 16); > > for (i=0; i < 256; i++) { > > a[i] = b[i] | c; > > } > > } > > > > we fail to realize no versioning check is required - the distance is > > either zero or a multiple of 16. > > > > Richard - ISTR you added some alignment considerations to the alias > > versioning code, but it doesn't seem to help? > I don't remember adding anything for that, but yeah, I agree it looks > like we need it.
GCC 11.1 has been released, retargeting bugs to GCC 11.2.
GCC 11.2 is being released, retargeting bugs to GCC 11.3
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
I think the regression is fixed - we perform alias versioning unnecessarily but we are now entering the vectorized loop for a == b (in fact RTL opts eventually elide the versioning check, so the resulting code is optimal). This was fixed in GCC 10. I've split out remaining missed optimization to PR109271.