This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [patch] do not always push DImode constants to memory beforeloading them to registers on ARM


On Wed, 2004-11-17 at 21:59, Nicolas Pitre wrote:
> This patch lets the ARM backend builds DImode constants with immediate 
> values when worth it.  This makes for faster and often smaller code.
> 
> [date]  Nicolas Pitre <nico@cam.org
> 
> 	* config/arm/arm.c (const_double_needs_minipool): New function
> 	determining if a CONST_DOUBLE should be pushed to the minipool.
> 	(note_invalid_constants): Use it.
> 

This is OK.

With respect to your comment on optimal sequences for time/space
trade-offs, then I think for -Os the best value would probably be around
3 insns.  It's not common to see constant pool entries shared,
particularly double-word values -- the compiler can often CSE them
anyway.  So I would *estimate* that setting the level at 3 would produce
a 1 word saving in ~90% of cases, but would cost at least 1 word in the
remaining 10% (it might be a bit more expensive than that -- there's a
very high upper bound in theory -- but it's unlikely to be significant).

For -O and -O2, it's harder to be sure.  On cores without load delay
slots, then 4 ALU ops will certainly be faster than 2 LDR operations,
but on cores with a single load delay slot then it is likely that
anything more than 2 ALU ops will be slower (note that there's really
only one delay slot to fill for the two insns, since the second load
will fill the delay slot of the first insn).  However, balanced against
this is the fact that constant pools really appear in the middle of code
sections and cause cache pollution (lines that can end up in both
caches/TLBs).

I suspect the balance changes yet again for cores with a 2-cycle load
delay slot, since then there can be severe scheduling difficulties, so
the balance will shift back towards using ALU operations, which can also
then be used to fill delay slots of other load instructions.

So I suspect that on balance, the optimum is likely to be around 3
instructions regardless of the optimization level, except when there are
no load delay slots (arm7 or earlier) when it should be 4.

None of this has been tested experimentally.

R.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]