This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fixing inconsistent uses of address costs


On Fri, Mar 27, 2015 at 5:43 PM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>
> On 27/03/15 03:29, Bin.Cheng wrote:
>>
>> On Thu, Mar 26, 2015 at 11:40 PM, Kyrill Tkachov <kyrylo.tkachov@arm.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to attempt to make GCC's usage of costs in the backends
>>> consistent.
>>> We have a lot of different types: rtx costs, address costs, regmove
>>> costs,
>>> vector costs etc. Some of them are use in different units, compared
>>> against
>>> different types of costs and in general are a bit of a mess. For now I'd
>>> like
>>> to clean up address costs since they are actually not used as much as
>>> some
>>> other costs and seem to me to be a slightly simpler task.
>>>
>>>  From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook are
>>> used
>>> in 5 places overall:
>>> * fwprop.c: forward propagation tries to propagate an rtx into an address
>>> and compare the address cost of the old address and the one it wants to
>>> propagate into it and picks the most profitable.
>>>
>>> * postreload.c: Again, tries to replace an address with a new one and
>>> compares
>>> the two address costs and picks the most profitable.
>>>
>>> * tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell
>>> it
>>> is
>>> used to assign a cost to various addressing modes, but ends up comparing
>>> but
>>> in the computation_cost function adds/mixes the address cost with rtx
>>> costs.
>>>
>>> * Some targets use address costs as part of their calculation of rtx
>>> costs
>>> for, say, a memory access instruction.
>>>
>>> * The final and worst part is in loop-invariant.c in the
>>> create_new_invariant
>>> function that needs to get a feel for how 'cheap' an address is and for
>>> that
>>> it compares the address cost to a magic number 3. See the thread that
>>> introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.
>>>
>>> There are a few issues wiht the above usages:
>>> - The magic number in loop-invariant.c was picked to help bwaves on x86
>>> and
>>> makes no analytical sense for any other target.
>>>
>>> - Some targets use COSTS_N_INSNS units for their address costs, others
>>> use
>>> normal scalar units (every target that doesn't define the address costs
>>> hook
>>> ends up going through rtx costs which use COSTS_N_INSNS).
>>> I think this rather undermines the usage in tree-ssa-loop-ivopts.c which
>>> uses
>>> rtx costs in COSTS_N_INSNS and address costs interchangeably.
>>>
>>> An insight here is that address costs have two types of usage:
>>> 1. To compare two addresses and choose whether to replace one by the
>>> other
>>> (as in fwprop and postreload)
>>> 2. To give a metric of cheapness for use in loop optimisations.
>>>
>>> So I propose to split the two uses up. For the fwprop and postreload case
>>> introduce a hook that will take two addresses 'x' and 'y' and determine
>>> whether to substitute 'x' by 'y'. This can use address costs and allow
>>> the
>>> backend to do as deep comparison between two addresses as it could
>>> possibly
>>> want, unconstrained by the requirement to assign just a single number to
>>> each one individually and having the midend just pick the smallest
>>> number.
>>> We can call it something like TARGET_ADDRESS_PREFERABLE_P or something
>>> like
>>> that. Then get fwprop and postreload to use that to determine whether to
>>> substitute one address for another.
>>>
>>> What to do with the remaining address costs hook and its uses in
>>> tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
>>> the address costs interchangeably with rtx costs I think we should at
>>> least
>>> convert all address costs to use the same units: COSTS_N_INSNS. This
>>> raises
>>> a question: could we not just use rtx costs for the addresses? Someone
>>> with
>>> more familiarity with the code can correct me, but is the address cost
>>> usage
>>> in that file supposed to reflect the cost of computing the address
>>> outside
>>> of
>>> an addressing mode? If so, then shouldn't rtx costs be used? Conversely,
>>> if
>>> address costs are a measure of something different, surely we shouldn't
>>> be
>>> adding or comparing them with rtx costs, but rather create some
>>> two-dimensional
>>> structure to compare two rtx/address entities? (similarly to the
>>> rtx cost/latency calculations in expmed for the mult synthesis code).
>>>
>>> Then there's the usage in loop-invariant.c where the address cost number
>>> is
>>> compared with '3'. This is really not portable across targets. The code
>>> wants
>>> some measure of whether an address is 'cheap'. There could be various
>>> approaches on how to provide that.
>>> Could we enumerate across an rtx array of representatives of
>>> all possible addresses for a target sorted by the
>>> TARGET_ADDRESS_PREFERABLE_P
>>> hook above? Should we be asking the target to tell us which address type
>>> is
>>> cheap? In any case, it seems to me that the cheapest address in a target
>>> is
>>> something that only the target can tell, unless we agree on a universal
>>> meaning and unit of measurement for address costs.
>>>
>>> What do people think? Would this approach be worth pursuing if the above
>>> questions can be answered? From having a quick look at config/ I think
>>> that
>>> converting the targets to use COSTS_N_INSNS units would not be a
>>> controversial
>>> task as long as the midend usage of address costs is consistent.
>>
>> I totally agree that use of address cost should be cleaned up.  My
>> point is the use is inconsistent on RTL level, maybe even worse than
>> midend.  Look at arm_arm_address_cost, it defines meaningless costs
>> regarding to both rtx cost and address cost.  It makes up to RTL
>> passes like fprop_addr (maybe postreload too) thus these passes can
>> propagate as many operations into address expressions and save alu
>> instructions (fprop_addr pass may have other more important uses,
>> please correct me here).
>
> Hi Bin,
>
> fwprop does try to propagate rtxes into addresses and it has a heuristic by
> which
> it checks the address costs. If one is clearly cheaper than the other
> according
> to these costs it picks that. If the costs are equal it picks the one with
> the *highest*
> rtx cost in the hope that it will help eliminate more costly expressions.
> I agree that the arm address costs are just a hack that were just picked to
> get the
> least bad behaviour from fwprop and postreload.
> I'm proposing to make a hook that takes two addresses and returns the
> preferable
> one. That way, the backend doesn't have to assign numbers to addresses and
> can do
> as deep relative analysis as it wants.
As far as fwprop itself is concerned, IMHO the problem is with how the
address cost interface is used, rather than the how the interface is
implemented.

>
>>    IMHO, this does the right optimization but
>> with wrong method, and it has falls out too.  For example, below code
>> transformation is reported not only once before:
>>
>>      add  r5, r6, r7
>>      ldr r8, [r5]
>>      ldr r8, [r5, 4]
>>
>> changed into:
>>
>>      add  r5, r6, r7
>>      ldr r8, [r6, r7]
>>      ldr r8, [r5, 4]
>>
>> In this case, it corrupts load-pair opportunity and doesn't give any
>> benefit.  Save problem holds for anto-inc-dec optimization.
>>
>> So, here fprop_addr is doing context-sensitive optimization with
>> respect to address cost, which isn't always right.  I once tried to
>> re-implement the pass so that it can firstly checks saved rtx costs
>> and then the address costs benefit.  I believe others also tried the
>> same thing with different case.
>
>
> I had looked at this as well. I was trying to get fwprop to propagate a
> sign-extend
> into the addressing mode in aarch64 and hit something similar.
>
>>    Anyway, it never concluded because
>> there is no obvious performance change.  One or two alu instructions
>> saved doesn't matter that much on advanced processors, not mention it
>> might disturb combiner.
>
>
> I think it would be worth pursuing. My first stab at this in fwprop would be
> to
> pick the preferable address using the TARGET_ADDRESS_PREFERABLE_P hook
> above.
> The problem comes in replicating the current behaviour. If the hook just
> returns
> true/false, then we lose the heuristic we currently use when the addresses
> are of
> equal cost. A possible way around this that I'm  experimenting with is to
> make this
> new hook return {-1,0,1} rather than a boolean with -1,+1 representing a
> definite
> preference and 0 having no preference. That way fwprop can tell when there's
> no
> clear preference and pick the more expensive one rtx-costs-wise.
>
>>
>> As for tree ivopts, address cost is used in both ways.  For any
>> address computation that's invalid, it tries to legitimize it into two
>> parts, the first part results in alu instructions, the second part
>> results in address expression of different addressing modes.  Right
>> now the rtx cost (for the first part) and the address cost (for the
>> second part) are accumulated and compared altogether.  I do like the
>> idea split costs into different types because all elements are
>> entangled in single cost model, and it's hard to tune for specific
>> case.
>
>
> Thanks for explaining.
> I think it would be possible to make the comparisons there a bit more sane.
> If an address computation is split into alu instructions and a legitimate
> address
> then carry around the rtx cost of the alu part and the address. When the
> time
> comes to compare two computations, we create a more involved way of
> comparing.
> For example (would need benchmarking, of course):
> * Compare the rtx costs of the alu components and check the address
> preference for
> the legitimate address components using the new hook.
> * If the alu part costs are of equal rtx cost, pick the one which has the
> preferable legitimate address.
> * If expression 'a' has a more expensive alu component than 'b' but a more
> preferable address component, then use some tie breaker. Perhaps apply rtx
> costs
> on the address expression and compare those...
>
>
> I think that if we fixed up the usage in tree-ssa-loop-ivopts.c the way
> described above,
> and figure out what to do for the 'cheap address' heuristic in
> loop-invariant.c, we could
> eliminate the use of address costs altogether. Or at least make them a
> backend-specific measure that each backend can use in the process of
> determining a preferable address but would not have to be mixed with rtx
> costs or magic
> constants in the midend.

Forgot to mention.  Just like Sandra commented, I think what really
important is how legitimize_address works and how IVOPT mimicries the
legitimization behavior and generates the final code, rather cost for
different addressing modes.  Actually aarch64 and some other targets
return exactly the same cost (0) for all addressing modes.  And I
think it really doesn't matter very much.

Thanks,
bin
>
> Thanks,
> Kyrill
>
>
>>
>> Please correct me if I was wrong.
>>
>> Thanks,
>> bin
>>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]