V3 [PATCH] i386: Add pass_remove_partial_avx_dependency

Mon Jan 21 18:54:00 GMT 2019

On 1/7/19 6:55 AM, H.J. Lu wrote:
> On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <law@redhat.com> wrote:
>>> On 11/28/18 12:48 PM, H.J. Lu wrote:
>>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
>>>>>>>> Did you mean "the nearest common dominator"?
>>>>>>> If the nearest common dominator appears in the loop while all uses are
>>>>>>> out of loops, this will result in suboptimal xor placement.
>>>>>>> In this case you want to split edges out of the loop.
>>>>>>>
>>>>>>> In general this is what the LCM framework will do for you if the problem
>>>>>>> is modelled siimlar way as in mode_swtiching.  At entry function mode is
>>>>>>> "no zero register needed" and all conversions need mode "zero register
>>>>>>> needed".  Mode switching should then do the correct placement decisions
>>>>>>> (reaching minimal number of executions of xor).
>>>>>>>
>>>>>>> Jeff, whan is your optinion on the approach taken by the patch?
>>>>>>> It seems like a special case of more general issue, but I do not see
>>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if
>>>>>>> the placement is correct we can probalby go either with new pass or
>>>>>>> making this part of mode swithcing (which is anyway run by x86 backend)
>>>>>> So I haven't followed this discussion at all, but did touch on this
>>>>>> issue with some patch a month or two ago with a target patch that was
>>>>>> trying to avoid the partial stalls.
>>>>>>
>>>>>> My assumption is that we're trying to find one or more places to
>>>>>> initialize the upper half of an avx register so as to avoid partial
>>>>>> register stall at existing sites that set the upper half.
>>>>>>
>>>>>> This sounds like a classic PRE/LCM style problem (of which mode
>>>>>> switching is just another variant).   A common-dominator approach is
>>>>>> closer to a classic GCSE and is going to result is more initializations
>>>>>> at sub-optimal points than a PRE/LCM style.
>>>>> yes, it is usual code placement problem. It is special case because the
>>>>> zero register is not modified by the conversion (just we need to have
>>>>> zero somewhere).  So basically we do not have kills to the zero except
>>>>> for entry block.
>>>>>
>>>> Do you have  testcase to show thatf the nearest common dominator
>>>> in the loop, while all uses areout of loops, leads to suboptimal xor
>>>> placement?
>>> I don't have a testcase, but it's all but certain nearest common
>>> dominator is going to be a suboptimal placement.  That's going to create
>>> paths where you're going to emit the xor when it's not used.
>>>
>>> The whole point of the LCM algorithms is they are optimal in terms of
>>> expression evaluations.
>> We tried LCM and it didn't work well for this case.  LCM places a single
>> VXOR close to the location where it is needed, which can be inside a
>> loop.  There is nothing wrong with the LCM algorithms.   But this doesn't
>> solve
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
>>
>> where VXOR is executed multiple times inside of a function, instead of
>> just once.   We are investigating to generate a single VXOR at entry of the
>> nearest dominator for basic blocks with SF/DF conversions, which is in
>> the the fake loop that contains the whole function:
>>
>>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
>>                                              convert_bbs);
>>       while (bb->loop_father->latch
>>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
>>         bb = get_immediate_dominator (CDI_DOMINATORS,
>>                                       bb->loop_father->header);
>>
>>       insn = BB_HEAD (bb);
>>       if (!NONDEBUG_INSN_P (insn))
>>         insn = next_nonnote_nondebug_insn (insn);
>>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
>>       set_insn = emit_insn_before (set, insn);
>>
> Here is the updated patch.  OK for trunk?
> 
> Thanks.
> 
> -- H.J.
> 
> 
> 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> 
> From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001
> From: "H.J. Lu" <hjl.tools@gmail.com>
> Date: Mon, 7 Jan 2019 05:44:59 -0800
> Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> 
> With -mavx, for
> 
> $ cat foo.i
> extern float f;
> extern double d;
> extern int i;
> 
> void
> foo (void)
> {
>   d = f;
>   f = i;
> }
> 
> we need to generate
> 
> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
> 	...
> 	vcvtss2sd	f(%rip), %xmmN, %xmmX
> 	...
> 	vcvtsi2ss	i(%rip), %xmmN, %xmmY
> 
> to avoid partial XMM register stall.  This patch adds a pass to generate
> a single
> 
> 	vxorps		%xmmN, %xmmN, %xmmN
> 
> at entry of the nearest dominator for basic blocks with SF/DF conversions,
> which is in the fake loop that contains the whole function, instead of
> generating one
> 
> 	vxorp[ds]	%xmmN, %xmmN, %xmmN
> 
> for each SF/DF conversion.
> 
> NB: The LCM algorithm isn't appropriate here since it may place a vxorps
> inside the loop.  Simple testcase show this:
> 
> $ cat badcase.c
> 
> extern float f;
> extern double d;
> 
> void
> foo (int n, int k)
> {
>   for (int j = 0; j != n; j++)
>     if (j < k)
>       d = f;
> }
> 
> It generates
> 
>     ...
>     loop:
>       if(j < k)
>         vxorps  %xmm0, %xmm0, %xmm0
>         vcvtss2sd  %xmm1, %xmm0, %xmm0
>       ...
>     loopend
>     ...
> 
> This is because LCM only works when there is a certain benifit.  But for
> conditional branch, LCM wouldn't move
> 
>    vxorps  %xmm0, %xmm0, %xmm0
It works this way for a reason.  There are obviously paths through the
loop where the conversion does not happen and thus the vxor is not
needed or desirable on those paths.

That's a fundamental property of the LCM algorithm -- it never inserts
an evaluation on a path through the CFG where it will not be used.

Your algorithm of inserting into the dominator block will introduce
runtime executions of the vxor on paths where it is not needed.

It's well known that relaxing that property of LCM can result in better
code generation in some circumstances.  Block copying and loop
restructuring are the gold standard for dealing with this kind of problem.

In this case you could split the iteration space so that you have two
loops.  one for 0..k and the other for k..n.  Note that GCC has support
for this kind of loop restructuring.  This has the advantage that the j
< k test does not happen each iteration of the loop and the vxor stuff
via LCM would be optimal.

There's many other cases where copying and restructuring results in
better common subexpression elimination (which is what you're doing).
Probably the best work I've seen in this space is Bodik's thesis.
Click's work from '95 touches on some of this as well, but isn't as
relevant to this specific instance.

Anyway, whether or not the patch should move forward is really up to Jan
(and Uros if he wants to be involved) I think.  I'm not fundamentally
opposed to HJ's approach as I'm aware of the different tradeoffs.

HJ's approach of pulling into the dominator block can result in
unnecessary evaluations.  But it can also reduce the number of
evaluations in other cases.  It really depends on the runtime behavior
of the code.   One could argue that the vxor stuff we're talking about
is most likely happening in loops, and probably not in deeply nested
control structures within those loops.  Thus pulling them out more
aggressively ala LICM may be better than LCM.

But where to throttle that aggressiveness isn't obvious.  A hybrid of
LICM + LCM would probably be better than either individually.

jeff