Bug 100756 - [12 Regression] vect: Superfluous epilog created on s390x
Summary: [12 Regression] vect: Superfluous epilog created on s390x
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P2 normal
Target Milestone: 13.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 107499
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2021-05-25 16:22 UTC by rdapp
Modified: 2024-06-20 11:39 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 13.0
Known to fail:
Last reconfirmed: 2022-10-21 00:00:00


Attachments
incomplete patch (2.06 KB, patch)
2021-05-26 07:39 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description rdapp 2021-05-25 16:22:26 UTC
Since g:d846f225c25c5885250c303c8d118caa08c447ab we create an epilog loop on s390 for the following test case:

/* { dg-do compile } */
/* { dg-options "-O3 -mzarch -march=z13" } */
/* { dg-require-effective-target s390_vx } */

int
foo (int * restrict a, int n)
{
  int i, result = 0;

  for (i = 0; i < n * 4; i++)
    result += a[i];
  return result;
}

vec.c:10:17: note:  epilog loop required

The following check in tree-vect-loop.c:vect_need_peeling_or_partial_vectors_p() is now true:

               || ((tree_ctz (LOOP_VINFO_NITERS (loop_vinfo))                                                                                                                       
                     < (unsigned) exact_log2 (const_vf)) 

We now have LOOP_VINFO_NITERS (loop_vinfo) = _15 > 0 ? (unsigned int) _15 : 1 as compared to (unsigned int) _15 before. tree_ctz() returns 0 for the conditional and 2 before which did not trigger the epilog requirement.

may_be_zero is _15 > 0 so it looks to me like we rather want to check the not-may_be_zero part of niter for alignment. Not sure if this is the right/safe thing to do, though.
Comment 1 Richard Biener 2021-05-26 07:39:25 UTC
Created attachment 50870 [details]
incomplete patch

Yes, the simplest thing might be to push may_be_zero to assumptions, but that way we're going to force versioning for the loop.  That said, recording the
not maybe_zero_part separately and using it for this check should be easy.

Also confirmed on x86_64.

Incomplete patch attached - it would need auditing of all _NITER[M1] uses.
Comment 2 rdapp 2022-03-25 09:45:25 UTC
I did not get back to this until now.  The patch works, of course and a testsuite run looks good so far.  I assume we're too late in the cycle to still get this in, right?
Comment 3 Richard Biener 2022-03-25 09:53:34 UTC
(In reply to rdapp from comment #2)
> I did not get back to this until now.  The patch works, of course and a
> testsuite run looks good so far.  I assume we're too late in the cycle to
> still get this in, right?

Yes.  That's clearly stage1 material now.
Comment 4 Robin Dapp 2022-10-20 12:04:40 UTC
Anything that can/should be done here in case we'd still want to tackle it in this P1 cycle?
Comment 5 Richard Biener 2022-10-21 12:21:18 UTC
I think the issue is that we now have

  <bb 2> [local count: 118111600]:
  _15 = n_8(D) * 4;
  if (n_8(D) > 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 7>; [11.00%]

while we probably had

  <bb 2> [local count: 118111600]:
  _15 = n_8(D) * 4;
  if (_15 > 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 7>; [11.00%]

before the change.  Loop header copying applies VN to the copied blocks:

Processing block 0: BB6
Value numbering stmt = i_9 = PHI <0(2)>
Setting value number of i_9 to 0 (changed)
Replaced redundant PHI node defining i_9 with 0
Value numbering stmt = result_14 = PHI <0(2)>
Setting value number of result_14 to 0 (changed)
Replaced redundant PHI node defining result_14 with 0
Value numbering stmt = _15 = n_8(D) * 4;
Setting value number of _15 to _15 (changed)
Making available beyond BB6 _15 for value _15
Value numbering stmt = if (_15 > i_9)
Recording on edge 6->7 _15 gt_expr 0 == true
Recording on edge 6->7 _15 le_expr 0 == false
Recording on edge 6->7 _15 ne_expr 0 == true
Recording on edge 6->7 _15 ge_expr 0 == true
Recording on edge 6->7 _15 lt_expr 0 == false
Recording on edge 6->7 _15 eq_expr 0 == false
marking outgoing edge 6 -> 7 executable
gimple_simplified to if (n_8(D) > 0)

with

  <bb 2> [local count: 118111600]:
  _15 = n_8(D) * 4;
  if (n_8(D) > 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 3> [local count: 955630225]:
  # i_16 = PHI <i_13(3), 0(2)>
  # result_17 = PHI <result_12(3), 0(2)>
  _1 = (long unsigned int) i_16;
  _2 = _1 * 4;
  _3 = a_11(D) + _2;
  _4 = *_3;
  result_12 = _4 + result_17;
  i_13 = i_16 + 1;
  _5 = n_8(D) * 4;
  if (_5 > i_13)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

so it was a single use in the compare (because CSE only later introduces
more uses through DOM).

The niter code then ends up with maybe-zero as _15 <= 0 and a condition
of n_8(D) > 0 it tries to simplify with tree_simplify_using_condition
(called from simplify_using_initial_conditions).  That old machinery
would be a perfect candidate to be rewritten using path ranger, but
in a somewhat extended mode that can "skip" diamonds, aka, the path
just contains dominators of the loop entry edge on which we want to
evaluate the _15 <= 0 condition.

To make the old simplification code work we can do the following:

diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
index 1e0f609d8b6..4ffcef4f4ff 100644
--- a/gcc/tree-ssa-loop-niter.cc
+++ b/gcc/tree-ssa-loop-niter.cc
@@ -2216,6 +2216,7 @@ expand_simple_operations (tree expr, tree stop, hash_map<tree, tree> &cache)
 
     case PLUS_EXPR:
     case MINUS_EXPR:
+    case MULT_EXPR:
       if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (expr))
          && TYPE_OVERFLOW_TRAPS (TREE_TYPE (expr)))
        return expr;

but that can of course have unintended side-effects elsewhere
(this function is also used by IVOPTs).
Comment 6 GCC Commits 2022-10-25 08:03:16 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:19295e8607da2f743368fe6f5708146616aafa91

commit r13-3476-g19295e8607da2f743368fe6f5708146616aafa91
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Oct 24 09:51:32 2022 +0200

    tree-optimization/100756 - niter analysis and folding
    
    niter analysis, specifically the part trying to simplify the computed
    maybe_zero condition against the loop header copying condition, is
    confused by us now simplifying
    
      _15 = n_8(D) * 4;
      if (_15 > 0)
    
    to
    
      _15 = n_8(D) * 4;
      if (n_8(D) > 0)
    
    which is perfectly sound at the point we do this transform.  One
    solution might be to involve ranger in this simplification, another
    is to be more aggressive when expanding expressions - the condition
    we try to simplify is _15 > 0, so all we need is expanding that
    to n_8(D) * 4 > 0.
    
    The following does just that.
    
            PR tree-optimization/100756
            * tree-ssa-loop-niter.cc (expand_simple_operations): Also
            expand multiplications by invariants.
    
            * gcc.dg/vect/pr100756.c: New testcase.
Comment 7 Richard Biener 2022-10-25 08:03:43 UTC
Fixed on trunk - lets see if there's any fallout.
Comment 8 Robin Dapp 2023-02-01 09:59:02 UTC
For completeness: haven't observed any fallout on s390 since and the regression is fixed.
Comment 9 Richard Biener 2023-05-08 12:22:00 UTC
GCC 12.3 is being released, retargeting bugs to GCC 12.4.
Comment 10 Richard Biener 2024-06-20 08:56:37 UTC
GCC 12.4 is being released, retargeting bugs to GCC 12.5.
Comment 11 Robin Dapp 2024-06-20 10:22:37 UTC
Just noticed this is still open due to the retargeting message.
IMHO this can be closed.  I'm pretty sure I erroneously used the GCC 12 target when opening the bug when it should have been trunk/GCC 13.

I suppose we're not expecting a backport for GCC 12 anyway and Stefan confirmed that the bug is fixed since.  Therefore I'd suggest to close it.
Comment 12 Richard Biener 2024-06-20 11:39:16 UTC
Fixed then.