This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to amker from comment #23)
> (In reply to Richard Biener from comment #21)
> > So after r257453 we improve the situation pre-IVOPTs to just
> > 6 IVs (duplicated but trivially equivalent) plus one counting IV.  But then
> > when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us
> > spill... (for AVX256, so you need -march=core-avx2 for example).
> > 
> > Bin, any chance you can take a look?  In the IVO dump I see
> > 
> >   target_avail_regs 15
> >   target_clobbered_regs 9
> >   target_reg_cost 4
> >   target_spill_cost 8
> >   regs_used 3
> > ^^^
> > 
> > and regs_used looks awfully low to me.  The loop has even more IVs initially
> > plus variable steps for that IVs which means we need two regs per IV.
> > 
> > There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs?
> > Or just use the original set, removing the obvious redundancies?  There is
> > a microarchitectural issue left with the vectorization but the spilling
> > obscures the look quite a bit :/
> 
> Sure, I will have a look based on your commit.  Thanks

Note the loop in question is the one starting at line 551, it gets inlined
multiple times but the issue is visible with -fno-inline as well.
-mavx2 makes things worse (compared to -mavx2 -mprefer-avx128) because
for the strided accesses we choose to compute extra invariants for the
two strides of A and E.  For SSE we keep stride and stride * 3 while
for AVX we additionally compute stride * 5, stride * 6 and stride * 7
(in the cases we don't choose another base IV).  At least computing stride * 6
can be avoided by using stride * 3 with step 2 - but it's probably too
hard to see that within the current IVO model?  I'm not sure avoiding
an invariant in exchange for an extra IV is ever a good idea?  Spilling
an invariant should be cheaper than spilling an IV - but yes, the
addressing mode possibly looks to offset for any bias we apply there.

Note the vectorizer itself tries to avoid computing stride * N by
strength-reducing it:

  _711 = (sizetype) iftmp.472_91;
  _712 = _711 * 64;
  _715 = (sizetype) iftmp.472_91;
  _716 = _715 * 8;
...
  # ivtmp_891 = PHI <ivtmp_892(28), _710(44)>
...
  _893 = MEM[(real(kind=4) *)ivtmp_891];
  ivtmp_894 = ivtmp_891 + _716;
  _895 = MEM[(real(kind=4) *)ivtmp_894];
  ivtmp_896 = ivtmp_894 + _716;
  _897 = MEM[(real(kind=4) *)ivtmp_896];
  ivtmp_898 = ivtmp_896 + _716;
  _899 = MEM[(real(kind=4) *)ivtmp_898];
  ivtmp_900 = ivtmp_898 + _716;
  _901 = MEM[(real(kind=4) *)ivtmp_900];
  ivtmp_902 = ivtmp_900 + _716;
  _903 = MEM[(real(kind=4) *)ivtmp_902];
  ivtmp_904 = ivtmp_902 + _716;
  _905 = MEM[(real(kind=4) *)ivtmp_904];
  ivtmp_906 = ivtmp_904 + _716;
  _907 = MEM[(real(kind=4) *)ivtmp_906];
  vect_cst__909 = {_893, _895, _897, _899, _901, _903, _905, _907};
...
  ivtmp_892 = ivtmp_891 + _712;

note how it advances the IV in one step at the end though.  Not sure if
IVO is confused by that or the way we compute _716 vs. _712.

That said, the summary is that IVO behavior with unrolled loop bodies with
variable stride isn't helping here ;)

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]