This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 08 Feb 2018 09:32:32 +0000
- Subject: [Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
- Auto-submitted: auto-generated
- References: <bug-84037-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to amker from comment #23)
> (In reply to Richard Biener from comment #21)
> > So after r257453 we improve the situation pre-IVOPTs to just
> > 6 IVs (duplicated but trivially equivalent) plus one counting IV. But then
> > when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us
> > spill... (for AVX256, so you need -march=core-avx2 for example).
> >
> > Bin, any chance you can take a look? In the IVO dump I see
> >
> > target_avail_regs 15
> > target_clobbered_regs 9
> > target_reg_cost 4
> > target_spill_cost 8
> > regs_used 3
> > ^^^
> >
> > and regs_used looks awfully low to me. The loop has even more IVs initially
> > plus variable steps for that IVs which means we need two regs per IV.
> >
> > There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs?
> > Or just use the original set, removing the obvious redundancies? There is
> > a microarchitectural issue left with the vectorization but the spilling
> > obscures the look quite a bit :/
>
> Sure, I will have a look based on your commit. Thanks
Note the loop in question is the one starting at line 551, it gets inlined
multiple times but the issue is visible with -fno-inline as well.
-mavx2 makes things worse (compared to -mavx2 -mprefer-avx128) because
for the strided accesses we choose to compute extra invariants for the
two strides of A and E. For SSE we keep stride and stride * 3 while
for AVX we additionally compute stride * 5, stride * 6 and stride * 7
(in the cases we don't choose another base IV). At least computing stride * 6
can be avoided by using stride * 3 with step 2 - but it's probably too
hard to see that within the current IVO model? I'm not sure avoiding
an invariant in exchange for an extra IV is ever a good idea? Spilling
an invariant should be cheaper than spilling an IV - but yes, the
addressing mode possibly looks to offset for any bias we apply there.
Note the vectorizer itself tries to avoid computing stride * N by
strength-reducing it:
_711 = (sizetype) iftmp.472_91;
_712 = _711 * 64;
_715 = (sizetype) iftmp.472_91;
_716 = _715 * 8;
...
# ivtmp_891 = PHI <ivtmp_892(28), _710(44)>
...
_893 = MEM[(real(kind=4) *)ivtmp_891];
ivtmp_894 = ivtmp_891 + _716;
_895 = MEM[(real(kind=4) *)ivtmp_894];
ivtmp_896 = ivtmp_894 + _716;
_897 = MEM[(real(kind=4) *)ivtmp_896];
ivtmp_898 = ivtmp_896 + _716;
_899 = MEM[(real(kind=4) *)ivtmp_898];
ivtmp_900 = ivtmp_898 + _716;
_901 = MEM[(real(kind=4) *)ivtmp_900];
ivtmp_902 = ivtmp_900 + _716;
_903 = MEM[(real(kind=4) *)ivtmp_902];
ivtmp_904 = ivtmp_902 + _716;
_905 = MEM[(real(kind=4) *)ivtmp_904];
ivtmp_906 = ivtmp_904 + _716;
_907 = MEM[(real(kind=4) *)ivtmp_906];
vect_cst__909 = {_893, _895, _897, _899, _901, _903, _905, _907};
...
ivtmp_892 = ivtmp_891 + _712;
note how it advances the IV in one step at the end though. Not sure if
IVO is confused by that or the way we compute _716 vs. _712.
That said, the summary is that IVO behavior with unrolled loop bodies with
variable stride isn't helping here ;)