This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PATCH: fine-tuning for can_store_by_pieces


Sandra Loosemore <sandra@codesourcery.com> writes:
> I did some fiddling around with CSiBE to further improve the cost
> estimate for a call.  Setting the base cost to 4, rather than 3 as in
> the original patch, produced smaller code both with and without
> -mabicalls.  With o32 abicalls, which Richard indicated would be the
> worst case scenario for call overhead, base cost of 5 gave similar
> size results to 4 (overall a tiny decrease, but it looked like just as
> many individual tests went up as down), but code size went up for
> everything else.  So I think 4 is probably a reasonable value without
> having to conditionalize it for -mabicalls.

Thanks for the testing.  In that case, I agree 4 is fine for everything.
If you still have the results, could you post the totals?  I'm curious
what kind of figures we're talking about here.

> + /* The base cost of a memcpy call, for MOVE_RATIO and friends.  This
> +    was determined experimentally by benchmarking with CSiBE.  In theory,
> +    the call overhead is higher for TARGET_ABICALLS (especially for o32
> +    where we have to restore $gp afterwards as well as make an indirect
> +    call), but in practice, bumping this up higher for TARGET_ABICALLS 
> +    doesn't make much difference to code size.  */
> + 
> + #define MIPS_CALL_RATIO 4

I think the number you use in CLEAR_RATIO (MIPS_CALL_RATIO + 2)
is effectively estimating the number of instruction for a call.
ISTM CLEAR_RATIO is basically being compared against an estimate of
the number of zero stores, and zero stores are 1 instruction on MIPS.
(Also, nothing really explained why CLEAR_RATIO adds a magic 2 to the
ratio.)

So I think this should really be 6 and that CLEAR_RATIO should be:

#define CLEAR_RATIO	(optimize_size ? MIPS_CALL_RATIO : 15)

Then...

> + #define MOVE_RATIO ((TARGET_MIPS16 || TARGET_MEMCPY) ? MIPS_CALL_RATIO : 2)

...a comment in the original patch said that MOVE_RATIO effectively
counted memory-to-memory moves.  I think that was a useful comment,
and that the use of the old MIPS_CALL_RATIO above should be the new
MIPS_CALL_RATIO / 2.  Conveniently, that gives us the 3 that you had
in the original patch.  (You didn't say whether you'd benchmarked
-mips16 or -mmemcpy; if so, did you see any difference between a
MOVE_RATIO of 3 and a MOVE_RATIO of 4?)

> + /* This is similar to CLEAR_RATIO, but for a non-zero constant, so when
> +    optimizing for size adjust the ratio to account for the overhead of
> +    loading the constant and replicating it across the word.  In fact in
> +    that case it is never worth inlining, since calling memset should
> +    always be smaller.  */
> + 
> + #define SET_RATIO	(optimize_size ? MIPS_CALL_RATIO : 15)

I still find this comment a little confusing.  If this macro were
ever used for anything other than 1-byte copies when optimize_size,
MIPS_CALL_RATIO _would_ be big enough to cause SET_BY_PIECES_P to be
true for some of them.  And, as already discussed, I think "always"
is too strong anyway.

How about:

/* This is similar to CLEAR_RATIO, but for a non-zero constant.
   At the time this definition was added, optimize_size compilations
   only used this macro for single-byte stores.  Although it might be
   worthwhile doing larger stores inline too -- if the target-indepedent
   code ever considered it -- the boundaries are not very clear.
   Just use a value of 2 for now so that all single-byte copies are
   done inline.  */

#define SET_RATIO	(optimize_size ? 2 : 15)

But I don't like having to make excuses like this.  I wonder if we
should instead remove the optimize_size checks above the calls to
store_by_pieces_p and define the default SET_RATIO to something like:

#define SET_RATIO (optimize_size ? 2 : MOVE_RATIO)

?  Then the target would be able to control the otherwise arbitrary
optimize_size limit, just like it already can for CLEAR_RATIO.

> + /* STORE_BY_PIECES_P can be used when copying a constant string, but
> +    in that case each word takes 3 insns (lui, ori, sw), or more in
> +    64-bit mode, instead of 2 (lw, sw). So better to always fail this
> +    and let the move_by_pieces code copy the string from read-only
> +    memory.  */
> + 
> + #define STORE_BY_PIECES_P(SIZE, ALIGN) 0

You asked when lui/ori/sw might be faster.  Consider a three-word
store on a typical 2-way superscalar target:

  Cycle 1:    lui     lui
        2:    ori     ori
        3:    sw             lui
        4:            sw     ori
        5:                   sw

That's 5 cycles.  The equivalent lw/sw version is at least 6 cycles
(more if the read-only string is not in cache).

Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]