V3 [PATCH] i386: Add pass_remove_partial_avx_dependency

Mon Jan 21 21:27:00 GMT 2019

On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <law@redhat.com> wrote:
>
> On 1/7/19 6:55 AM, H.J. Lu wrote:
> > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
> >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>>>>>>> Did you mean "the nearest common dominator"?
> >>>>>>> If the nearest common dominator appears in the loop while all uses are
> >>>>>>> out of loops, this will result in suboptimal xor placement.
> >>>>>>> In this case you want to split edges out of the loop.
> >>>>>>>
> >>>>>>> In general this is what the LCM framework will do for you if the problem
> >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> >>>>>>> "no zero register needed" and all conversions need mode "zero register
> >>>>>>> needed".  Mode switching should then do the correct placement decisions
> >>>>>>> (reaching minimal number of executions of xor).
> >>>>>>>
> >>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
> >>>>>>> It seems like a special case of more general issue, but I do not see
> >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> >>>>>>> the placement is correct we can probalby go either with new pass or
> >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
> >>>>>> So I haven't followed this discussion at all, but did touch on this
> >>>>>> issue with some patch a month or two ago with a target patch that was
> >>>>>> trying to avoid the partial stalls.
> >>>>>>
> >>>>>> My assumption is that we're trying to find one or more places to
> >>>>>> initialize the upper half of an avx register so as to avoid partial
> >>>>>> register stall at existing sites that set the upper half.
> >>>>>>
> >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> >>>>>> switching is just another variant).   A common-dominator approach is
> >>>>>> closer to a classic GCSE and is going to result is more initializations
> >>>>>> at sub-optimal points than a PRE/LCM style.
> >>>>> yes, it is usual code placement problem. It is special case because the
> >>>>> zero register is not modified by the conversion (just we need to have
> >>>>> zero somewhere).  So basically we do not have kills to the zero except
> >>>>> for entry block.
> >>>>>
> >>>> Do you have  testcase to show thatf the nearest common dominator
> >>>> in the loop, while all uses areout of loops, leads to suboptimal xor
> >>>> placement?
> >>> I don't have a testcase, but it's all but certain nearest common
> >>> dominator is going to be a suboptimal placement.  That's going to create
> >>> paths where you're going to emit the xor when it's not used.
> >>>
> >>> The whole point of the LCM algorithms is they are optimal in terms of
> >>> expression evaluations.
> >> We tried LCM and it didn't work well for this case.  LCM places a single
> >> VXOR close to the location where it is needed, which can be inside a
> >> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
> >> solve
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> >>
> >> where VXOR is executed multiple times inside of a function, instead of
> >> just once.   We are investigating to generate a single VXOR at entry of the
> >> nearest dominator for basic blocks with SF/DF conversions, which is in
> >> the the fake loop that contains the whole function:
> >>
> >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> >>                                              convert_bbs);
> >>       while (bb->loop_father->latch
> >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> >>                                       bb->loop_father->header);
> >>
> >>       insn = BB_HEAD (bb);
> >>       if (!NONDEBUG_INSN_P (insn))
> >>         insn = next_nonnote_nondebug_insn (insn);
> >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> >>       set_insn = emit_insn_before (set, insn);
> >>
> > Here is the updated patch.  OK for trunk?
> >
> > Thanks.
> >
> > -- H.J.
> >
> >
> > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> >
> > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> > From: "H.J. Lu" <hjl.tools@gmail.com>
> > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> >
> > With -mavx, for
> >
> > $ cat foo.i
> > extern float f;
> > extern double d;
> > extern int i;
> >
> > void
> > foo (void)
> > {
> >   d = f;
> >   f = i;
> > }
> >
> > we need to generate
> >
> >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> >       ...
> >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> >       ...
> >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> >
> > to avoid partial XMM register stall.  This patch adds a pass to generate
> > a single
> >
> >       vxorps          %xmmN, %xmmN, %xmmN
> >
> > at entry of the nearest dominator for basic blocks with SF/DF conversions,
> > which is in the fake loop that contains the whole function, instead of
> > generating one
> >
> >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> >
> > for each SF/DF conversion.
> >
> > NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> > inside the loop.  Simple testcase show this:
> >
> > $ cat badcase.c
> >
> > extern float f;
> > extern double d;
> >
> > void
> > foo (int n, int k)
> > {
> >   for (int j = 0; j != n; j++)
> >     if (j < k)
> >       d = f;
> > }
> >
> > It generates
> >
> >     ...
> >     loop:
> >       if(j < k)
> >         vxorps  %xmm0, %xmm0, %xmm0
> >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> >       ...
> >     loopend
> >     ...
> >
> > This is because LCM only works when there is a certain benifit.  But for
> > conditional branch, LCM wouldn't move
> >
> >    vxorps  %xmm0, %xmm0, %xmm0
> It works this way for a reason.  There are obviously paths through the
> loop where the conversion does not happen and thus the vxor is not
> needed or desirable on those paths.
>
> That's a fundamental property of the LCM algorithm -- it never inserts
> an evaluation on a path through the CFG where it will not be used.
>
> Your algorithm of inserting into the dominator block will introduce
> runtime executions of the vxor on paths where it is not needed.
>
> It's well known that relaxing that property of LCM can result in better
> code generation in some circumstances.  Block copying and loop
> restructuring are the gold standard for dealing with this kind of problem.
>
> In this case you could split the iteration space so that you have two
> loops.  one for 0..k and the other for k..n.  Note that GCC has support
> for this kind of loop restructuring.  This has the advantage that the j
> < k test does not happen each iteration of the loop and the vxor stuff
> via LCM would be optimal.
>
> There's many other cases where copying and restructuring results in
> better common subexpression elimination (which is what you're doing).
> Probably the best work I've seen in this space is Bodik's thesis.
> Click's work from '95 touches on some of this as well, but isn't as
> relevant to this specific instance.
>
> Anyway, whether or not the patch should move forward is really up to Jan
> (and Uros if he wants to be involved) I think.  I'm not fundamentally
> opposed to HJ's approach as I'm aware of the different tradeoffs.
>
> HJ's approach of pulling into the dominator block can result in
> unnecessary evaluations.  But it can also reduce the number of
> evaluations in other cases.  It really depends on the runtime behavior
> of the code.   One could argue that the vxor stuff we're talking about
> is most likely happening in loops, and probably not in deeply nested
> control structures within those loops.  Thus pulling them out more
> aggressively ala LICM may be better than LCM.

True, there is a trade-off.   My approach inserts a vxorps at the last possible
position.  Yes, vxorps will always be executed even if it may not be required.
Since it is executed only once in all cases, it is a win overall.

> But where to throttle that aggressiveness isn't obvious.  A hybrid of
> LICM + LCM would probably be better than either individually.
>

Thanks.

-- 
H.J.