This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Stack Reorganization Patch

"Naveen Sharma, Noida" wrote:
> Hi Everyone,
> Some time back we had a discussion on layout of locals
> on stack at (
> With that in mind, here is a patch which delays assignment of hard
> stack slots till after register allocation. It works as follows.
> A call to assign_stack_local_1 is intercepted and we return a rtx
> of the form (mem:mode reg/f/c:Pmode regno) for each requested stack slot
> instead of normal form (mem:mode (plus:Pmode fp const_int offset)).
> Note that the special flag /c is used to tell that this is
> stack address pseudo. The register allocator should not try to
> allocate any hard reg for this because it is already a known
> stack slot.After register allocation, we sort the allocated stack
> slots by size and number of references and convert it to normal
> "fp + offset" form. We also have to mark them live at end of
> each basic block.
Your comparison function for qsort never returns 0.  I see nothing in the
man page for qsort that says that an implementation must not compare a value
with itself.  The name kind of implies that it is doing a quicksort, but
that is not guaranteed.

A reason why you see performance decrease on some benchmarks, apart from
the usual optimizer noise, could be that you don't seem to consider
spatial and temporal locality.  When you look at the definition for
LEGITIMIZE_RELOAD_ADDRESS for the SH, you will see that for 32 bit integer
values, there are 16 slots that can be addressed directly, and the other
addresses come in groups of 16 that can be addressed with a common base
register.  If there are multiple stack slots in the same group in
succession, reload typically needs to load the base register just once.
Some codes tend to use stack slots with related purpose in close sucession,
and so with allocation of stack slots in order of encountering the pseudo,
we have a good chance of using the same base register several times of
related accesses.  However, with your sorting, if the number of accesses
doesn't match, the pseudos will end up in quite different locations in the
stack frame.

On the other hand, offset size is not linear with cost increase.  For the
SH, there are basically three classes:
 - directly addressable - 32 bit integers in the rage 0..60
 - accessible with an immediate constant load (isolated access)
   or a move / add immediate (multiple adjacent accesses) for the -128..+127
 - anything above needs a constant load from memory (isolated access)
   or a constant load from memory / add (multiple adjacent accesses).
   If the offset fits into a 16 bit signed integer, the constant table
   entry is smaller than if it doesn't, but that makes hardly a difference
   to execution speed.

While I don't know of an algorithm that does a really good job at finding a
nice solution in reasonable time, an obvious possibility to improve things
a bit would be to first use your current sort to put the stack slots into
their respective buckets of 0..60 / -128..+128 / beyond ranges, and then
sort withing these ranges again strictly according to id, thus restoring the
original order within.
Obviously, it does no good to hard-code the numbers for one processor into
the target-independent code, so we could have a target hook that translates
mode/size and tentative offset into cost equivalence classes.
Then you re-sort based on access cost and id.
SuperH (UK) Ltd.
2410 Aztec West / Almondsbury / BRISTOL / BS32 4QX
T:+44 1454 465658

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]