This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: V2 [PATCH] i386: Add pass_remove_partial_avx_dependency


On 11/5/18 7:21 AM, Jan Hubicka wrote:
>>
>> Did you mean "the nearest common dominator"?
> 
> If the nearest common dominator appears in the loop while all uses are
> out of loops, this will result in suboptimal xor placement.
> In this case you want to split edges out of the loop.
> 
> In general this is what the LCM framework will do for you if the problem
> is modelled siimlar way as in mode_swtiching.  At entry function mode is
> "no zero register needed" and all conversions need mode "zero register
> needed".  Mode switching should then do the correct placement decisions
> (reaching minimal number of executions of xor).
> 
> Jeff, whan is your optinion on the approach taken by the patch?
> It seems like a special case of more general issue, but I do not see
> very elegant way to solve it at least in the GCC 9 horisont, so if
> the placement is correct we can probalby go either with new pass or
> making this part of mode swithcing (which is anyway run by x86 backend)
So I haven't followed this discussion at all, but did touch on this
issue with some patch a month or two ago with a target patch that was
trying to avoid the partial stalls.

My assumption is that we're trying to find one or more places to
initialize the upper half of an avx register so as to avoid partial
register stall at existing sites that set the upper half.

This sounds like a classic PRE/LCM style problem (of which mode
switching is just another variant).   A common-dominator approach is
closer to a classic GCSE and is going to result is more initializations
at sub-optimal points than a PRE/LCM style.

The only advantage a common-dominator approach would have that I could
think of would be potentially further separating the initialization from
the subsequent use points which avoid store-store stalls or somesuch.  I
doubt this effect would be enough to overcome the inherent advantages of
a PRE/LCM approach.


ISTM that if we were to scan the RTL noting which instructions set the
upper part of the avx register, which instructions have potential stalls
and which instructions reset the upper half to an indeterminate state
(calls), then we have the local properties.  THen we feed that into a
traditional LCM solver and we get back the optimal points.

The only weirdness is that we don't want to move existing instructions
that set the upper bits.  Those are essentially fixed.  So maybe this is
actually better modeled by click's algorithm which has the concept of
pinned instructions.  But click's algorithm assumes SSA.  Ugh.

I'd probably have to sit down with it for a while -- it might be
possible to handle the fixed instructions using some of the ideas from
click (essentially exposing the earliest/latest results from LCM, then
picking a point on the dominator path between earliest and latest).



jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]