This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fixing inconsistent uses of address costs



On 27/03/15 03:29, Bin.Cheng wrote:
On Thu, Mar 26, 2015 at 11:40 PM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
Hi all,

I'd like to attempt to make GCC's usage of costs in the backends consistent.
We have a lot of different types: rtx costs, address costs, regmove costs,
vector costs etc. Some of them are use in different units, compared against
different types of costs and in general are a bit of a mess. For now I'd
like
to clean up address costs since they are actually not used as much as some
other costs and seem to me to be a slightly simpler task.

 From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook are
used
in 5 places overall:
* fwprop.c: forward propagation tries to propagate an rtx into an address
and compare the address cost of the old address and the one it wants to
propagate into it and picks the most profitable.

* postreload.c: Again, tries to replace an address with a new one and
compares
the two address costs and picks the most profitable.

* tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell it
is
used to assign a cost to various addressing modes, but ends up comparing but
in the computation_cost function adds/mixes the address cost with rtx costs.

* Some targets use address costs as part of their calculation of rtx costs
for, say, a memory access instruction.

* The final and worst part is in loop-invariant.c in the
create_new_invariant
function that needs to get a feel for how 'cheap' an address is and for that
it compares the address cost to a magic number 3. See the thread that
introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.

There are a few issues wiht the above usages:
- The magic number in loop-invariant.c was picked to help bwaves on x86 and
makes no analytical sense for any other target.

- Some targets use COSTS_N_INSNS units for their address costs, others use
normal scalar units (every target that doesn't define the address costs hook
ends up going through rtx costs which use COSTS_N_INSNS).
I think this rather undermines the usage in tree-ssa-loop-ivopts.c which
uses
rtx costs in COSTS_N_INSNS and address costs interchangeably.

An insight here is that address costs have two types of usage:
1. To compare two addresses and choose whether to replace one by the other
(as in fwprop and postreload)
2. To give a metric of cheapness for use in loop optimisations.

So I propose to split the two uses up. For the fwprop and postreload case
introduce a hook that will take two addresses 'x' and 'y' and determine
whether to substitute 'x' by 'y'. This can use address costs and allow the
backend to do as deep comparison between two addresses as it could possibly
want, unconstrained by the requirement to assign just a single number to
each one individually and having the midend just pick the smallest number.
We can call it something like TARGET_ADDRESS_PREFERABLE_P or something like
that. Then get fwprop and postreload to use that to determine whether to
substitute one address for another.

What to do with the remaining address costs hook and its uses in
tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
the address costs interchangeably with rtx costs I think we should at least
convert all address costs to use the same units: COSTS_N_INSNS. This raises
a question: could we not just use rtx costs for the addresses? Someone with
more familiarity with the code can correct me, but is the address cost usage
in that file supposed to reflect the cost of computing the address outside
of
an addressing mode? If so, then shouldn't rtx costs be used? Conversely, if
address costs are a measure of something different, surely we shouldn't be
adding or comparing them with rtx costs, but rather create some
two-dimensional
structure to compare two rtx/address entities? (similarly to the
rtx cost/latency calculations in expmed for the mult synthesis code).

Then there's the usage in loop-invariant.c where the address cost number is
compared with '3'. This is really not portable across targets. The code
wants
some measure of whether an address is 'cheap'. There could be various
approaches on how to provide that.
Could we enumerate across an rtx array of representatives of
all possible addresses for a target sorted by the
TARGET_ADDRESS_PREFERABLE_P
hook above? Should we be asking the target to tell us which address type is
cheap? In any case, it seems to me that the cheapest address in a target is
something that only the target can tell, unless we agree on a universal
meaning and unit of measurement for address costs.

What do people think? Would this approach be worth pursuing if the above
questions can be answered? From having a quick look at config/ I think that
converting the targets to use COSTS_N_INSNS units would not be a
controversial
task as long as the midend usage of address costs is consistent.
I totally agree that use of address cost should be cleaned up.  My
point is the use is inconsistent on RTL level, maybe even worse than
midend.  Look at arm_arm_address_cost, it defines meaningless costs
regarding to both rtx cost and address cost.  It makes up to RTL
passes like fprop_addr (maybe postreload too) thus these passes can
propagate as many operations into address expressions and save alu
instructions (fprop_addr pass may have other more important uses,
please correct me here).
Hi Bin,

fwprop does try to propagate rtxes into addresses and it has a heuristic by which it checks the address costs. If one is clearly cheaper than the other according to these costs it picks that. If the costs are equal it picks the one with the *highest*
rtx cost in the hope that it will help eliminate more costly expressions.
I agree that the arm address costs are just a hack that were just picked to get the
least bad behaviour from fwprop and postreload.
I'm proposing to make a hook that takes two addresses and returns the preferable one. That way, the backend doesn't have to assign numbers to addresses and can do
as deep relative analysis as it wants.

   IMHO, this does the right optimization but
with wrong method, and it has falls out too.  For example, below code
transformation is reported not only once before:

     add  r5, r6, r7
     ldr r8, [r5]
     ldr r8, [r5, 4]

changed into:

     add  r5, r6, r7
     ldr r8, [r6, r7]
     ldr r8, [r5, 4]

In this case, it corrupts load-pair opportunity and doesn't give any
benefit.  Save problem holds for anto-inc-dec optimization.

So, here fprop_addr is doing context-sensitive optimization with
respect to address cost, which isn't always right.  I once tried to
re-implement the pass so that it can firstly checks saved rtx costs
and then the address costs benefit.  I believe others also tried the
same thing with different case.

I had looked at this as well. I was trying to get fwprop to propagate a sign-extend
into the addressing mode in aarch64 and hit something similar.

   Anyway, it never concluded because
there is no obvious performance change.  One or two alu instructions
saved doesn't matter that much on advanced processors, not mention it
might disturb combiner.

I think it would be worth pursuing. My first stab at this in fwprop would be to pick the preferable address using the TARGET_ADDRESS_PREFERABLE_P hook above. The problem comes in replicating the current behaviour. If the hook just returns true/false, then we lose the heuristic we currently use when the addresses are of equal cost. A possible way around this that I'm experimenting with is to make this new hook return {-1,0,1} rather than a boolean with -1,+1 representing a definite preference and 0 having no preference. That way fwprop can tell when there's no
clear preference and pick the more expensive one rtx-costs-wise.


As for tree ivopts, address cost is used in both ways.  For any
address computation that's invalid, it tries to legitimize it into two
parts, the first part results in alu instructions, the second part
results in address expression of different addressing modes.  Right
now the rtx cost (for the first part) and the address cost (for the
second part) are accumulated and compared altogether.  I do like the
idea split costs into different types because all elements are
entangled in single cost model, and it's hard to tune for specific
case.

Thanks for explaining.
I think it would be possible to make the comparisons there a bit more sane.
If an address computation is split into alu instructions and a legitimate address then carry around the rtx cost of the alu part and the address. When the time comes to compare two computations, we create a more involved way of comparing.
For example (would need benchmarking, of course):
* Compare the rtx costs of the alu components and check the address preference for
the legitimate address components using the new hook.
* If the alu part costs are of equal rtx cost, pick the one which has the preferable legitimate address.
* If expression 'a' has a more expensive alu component than 'b' but a more
preferable address component, then use some tie breaker. Perhaps apply rtx costs
on the address expression and compare those...


I think that if we fixed up the usage in tree-ssa-loop-ivopts.c the way described above, and figure out what to do for the 'cheap address' heuristic in loop-invariant.c, we could eliminate the use of address costs altogether. Or at least make them a backend-specific measure that each backend can use in the process of determining a preferable address but would not have to be mixed with rtx costs or magic
constants in the midend.

Thanks,
Kyrill


Please correct me if I was wrong.

Thanks,
bin



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]