This is the mail archive of the
mailing list for the GCC project.
Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Jeff Law <law at redhat dot com>
- Cc: Jan Hubicka <hubicka at ucw dot cz>, GCC Patches <gcc-patches at gcc dot gnu dot org>, "Pandey, Sunil K" <sunil dot k dot pandey at intel dot com>, Uros Bizjak <ubizjak at gmail dot com>, Hongtao Liu <hongtao dot liu at intel dot com>
- Date: Sun, 30 Dec 2018 08:40:50 -0800
- Subject: Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency
- References: <CAMe9rOryQB=OT=RAff6BkLbbOrx9tDB1by6M_tnuYPBxmXQ5mQ@mail.gmail.com> <email@example.com> <firstname.lastname@example.org> <email@example.com> <CAMe9rOpB3nj4LQ_yuFpB25bHhiv=1YEUL210f0s9DH6D3nS-EA@mail.gmail.com> <firstname.lastname@example.org>
On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <email@example.com> wrote:
> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <firstname.lastname@example.org> wrote:
> >>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> >>>>> Did you mean "the nearest common dominator"?
> >>>> If the nearest common dominator appears in the loop while all uses are
> >>>> out of loops, this will result in suboptimal xor placement.
> >>>> In this case you want to split edges out of the loop.
> >>>> In general this is what the LCM framework will do for you if the problem
> >>>> is modelled siimlar way as in mode_swtiching. At entry function mode is
> >>>> "no zero register needed" and all conversions need mode "zero register
> >>>> needed". Mode switching should then do the correct placement decisions
> >>>> (reaching minimal number of executions of xor).
> >>>> Jeff, whan is your optinion on the approach taken by the patch?
> >>>> It seems like a special case of more general issue, but I do not see
> >>>> very elegant way to solve it at least in the GCC 9 horisont, so if
> >>>> the placement is correct we can probalby go either with new pass or
> >>>> making this part of mode swithcing (which is anyway run by x86 backend)
> >>> So I haven't followed this discussion at all, but did touch on this
> >>> issue with some patch a month or two ago with a target patch that was
> >>> trying to avoid the partial stalls.
> >>> My assumption is that we're trying to find one or more places to
> >>> initialize the upper half of an avx register so as to avoid partial
> >>> register stall at existing sites that set the upper half.
> >>> This sounds like a classic PRE/LCM style problem (of which mode
> >>> switching is just another variant). A common-dominator approach is
> >>> closer to a classic GCSE and is going to result is more initializations
> >>> at sub-optimal points than a PRE/LCM style.
> >> yes, it is usual code placement problem. It is special case because the
> >> zero register is not modified by the conversion (just we need to have
> >> zero somewhere). So basically we do not have kills to the zero except
> >> for entry block.
> > Do you have testcase to show thatf the nearest common dominator
> > in the loop, while all uses areout of loops, leads to suboptimal xor
> > placement?
> I don't have a testcase, but it's all but certain nearest common
> dominator is going to be a suboptimal placement. That's going to create
> paths where you're going to emit the xor when it's not used.
> The whole point of the LCM algorithms is they are optimal in terms of
> expression evaluations.
We tried LCM and it didn't work well for this case. LCM places a single
VXOR close to the location where it is needed, which can be inside a
loop. There is nothing wrong with the LCM algorithms. But this doesn't
where VXOR is executed multiple times inside of a function, instead of
just once. We are investigating to generate a single VXOR at entry of the
nearest dominator for basic blocks with SF/DF conversions, which is in
the the fake loop that contains the whole function:
bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
!= EXIT_BLOCK_PTR_FOR_FN (cfun))
bb = get_immediate_dominator (CDI_DOMINATORS,
insn = BB_HEAD (bb);
if (!NONDEBUG_INSN_P (insn))
insn = next_nonnote_nondebug_insn (insn);
set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
set_insn = emit_insn_before (set, insn);