[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

rguenth at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Fri Sep 25 13:05:28 GMT 2020


--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #18)
> (In reply to Richard Biener from comment #10)
> > (In reply to Kewen Lin from comment #9)
> > > (In reply to Richard Biener from comment #8)
> > > > (In reply to Kewen Lin from comment #7)
> > > > > Two questions in mind, need to dig into it further:
> > > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > > consider the stores.
> > > > 
> > > > Because when modeling they are still there.  There's no good way around this.
> > > > 
> > > 
> > > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > > once just before SLP? a bad idea due to compilation time?
> > 
> > Yeah, we already run FRE a lot and it is one of the more expensive passes.
> > 
> > Note there's one point we could do better which is the embedded SESE FRE
> > run from cunroll which is only run before we consider peeling an outer loop
> > and thus not for the outermost unrolled/peeled code (but the question would
> > be from where / up to what to apply FRE to).  On x86_64 this would apply to
> > the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> > quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> > matters for costing though).
> By following this idea, to release the restriction on loop_outer
> (loop_father) when setting the father_bbs, I can see FRE works as
> expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.

Yeah, that's the reason we do not do it.  We could possibly restrict it
to a containing loop, or if the containing loop is the whole function,
restrict it to the original preheader block to the loop exits (which are
no longer there, we'd need to pre-record those I think)

> If it's taken as costly, we probably can guard it to take effects only when
> all its inner loops are unrolled, for this case, all of its three loops are
> unrolled.

The original reason VN is done on unrolled bodies is to improve cost estimates
on unrolling nests - since we do not unroll the non-existing outer loop of the
outermost loop applying VN there for this reason is pointless and the
reasoning is to simply wait for the next scheduled VN pass (which is too late
for SLP)

> Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> for scalar codes it needs one more time dse run after cunroll to get array d
> eliminated. I guess it's not costly? Can one pass be run or not controlled
> by something in another pass? via global variable and add one parameter in
> passes.def seems weird. If it's costly, probably we can go by factoring out
> one routine to be called instead of running a pass, like do_rpo_vn?

No, we don't have a good way to schedule passes from other passes.  And yes,
the way forward is to support key transforms on regions.  Oh, and every
pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.

What we could eventually do is move the non-loop SLP pass later and
run the loop SLP pass only on loop bodies (so overall every BB is just
SLP vectorized once).  The

      PUSH_INSERT_PASSES_WITHIN (pass_tree_no_loop)
          NEXT_PASS (pass_slp_vectorize);

pass would then be unconditional and only run on non-loop BBs.  Note
that SLP still leaves us with FRE opportunities so this will probably
not solve the issue fully.  Pass reorderings are also always tricky...
(adding passes is easy but piling these up over the time is bad).

More information about the Gcc-bugs mailing list