This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: PATCH: fine-tuning for can_store_by_pieces
- From: Richard Sandiford <richard at codesourcery dot com>
- To: Sandra Loosemore <sandra at codesourcery dot com>
- Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, Nigel Stephens <nigel at mips dot com>, Guy Morrogh <guym at mips dot com>, David Ung <davidu at mips dot com>, Thiemo Seufer <ths at mips dot com>, Mark Mitchell <mark at codesourcery dot com>
- Date: Mon, 20 Aug 2007 09:20:51 +0100
- Subject: Re: PATCH: fine-tuning for can_store_by_pieces
- References: <46C3343A.5080407@codesourcery.com> <87ps1nop2x.fsf@firetop.home> <46C778D6.5060808@codesourcery.com>
Sandra Loosemore <sandra@codesourcery.com> writes:
> I did some fiddling around with CSiBE to further improve the cost
> estimate for a call. Setting the base cost to 4, rather than 3 as in
> the original patch, produced smaller code both with and without
> -mabicalls. With o32 abicalls, which Richard indicated would be the
> worst case scenario for call overhead, base cost of 5 gave similar
> size results to 4 (overall a tiny decrease, but it looked like just as
> many individual tests went up as down), but code size went up for
> everything else. So I think 4 is probably a reasonable value without
> having to conditionalize it for -mabicalls.
Thanks for the testing. In that case, I agree 4 is fine for everything.
If you still have the results, could you post the totals? I'm curious
what kind of figures we're talking about here.
> + /* The base cost of a memcpy call, for MOVE_RATIO and friends. This
> + was determined experimentally by benchmarking with CSiBE. In theory,
> + the call overhead is higher for TARGET_ABICALLS (especially for o32
> + where we have to restore $gp afterwards as well as make an indirect
> + call), but in practice, bumping this up higher for TARGET_ABICALLS
> + doesn't make much difference to code size. */
> +
> + #define MIPS_CALL_RATIO 4
I think the number you use in CLEAR_RATIO (MIPS_CALL_RATIO + 2)
is effectively estimating the number of instruction for a call.
ISTM CLEAR_RATIO is basically being compared against an estimate of
the number of zero stores, and zero stores are 1 instruction on MIPS.
(Also, nothing really explained why CLEAR_RATIO adds a magic 2 to the
ratio.)
So I think this should really be 6 and that CLEAR_RATIO should be:
#define CLEAR_RATIO (optimize_size ? MIPS_CALL_RATIO : 15)
Then...
> + #define MOVE_RATIO ((TARGET_MIPS16 || TARGET_MEMCPY) ? MIPS_CALL_RATIO : 2)
...a comment in the original patch said that MOVE_RATIO effectively
counted memory-to-memory moves. I think that was a useful comment,
and that the use of the old MIPS_CALL_RATIO above should be the new
MIPS_CALL_RATIO / 2. Conveniently, that gives us the 3 that you had
in the original patch. (You didn't say whether you'd benchmarked
-mips16 or -mmemcpy; if so, did you see any difference between a
MOVE_RATIO of 3 and a MOVE_RATIO of 4?)
> + /* This is similar to CLEAR_RATIO, but for a non-zero constant, so when
> + optimizing for size adjust the ratio to account for the overhead of
> + loading the constant and replicating it across the word. In fact in
> + that case it is never worth inlining, since calling memset should
> + always be smaller. */
> +
> + #define SET_RATIO (optimize_size ? MIPS_CALL_RATIO : 15)
I still find this comment a little confusing. If this macro were
ever used for anything other than 1-byte copies when optimize_size,
MIPS_CALL_RATIO _would_ be big enough to cause SET_BY_PIECES_P to be
true for some of them. And, as already discussed, I think "always"
is too strong anyway.
How about:
/* This is similar to CLEAR_RATIO, but for a non-zero constant.
At the time this definition was added, optimize_size compilations
only used this macro for single-byte stores. Although it might be
worthwhile doing larger stores inline too -- if the target-indepedent
code ever considered it -- the boundaries are not very clear.
Just use a value of 2 for now so that all single-byte copies are
done inline. */
#define SET_RATIO (optimize_size ? 2 : 15)
But I don't like having to make excuses like this. I wonder if we
should instead remove the optimize_size checks above the calls to
store_by_pieces_p and define the default SET_RATIO to something like:
#define SET_RATIO (optimize_size ? 2 : MOVE_RATIO)
? Then the target would be able to control the otherwise arbitrary
optimize_size limit, just like it already can for CLEAR_RATIO.
> + /* STORE_BY_PIECES_P can be used when copying a constant string, but
> + in that case each word takes 3 insns (lui, ori, sw), or more in
> + 64-bit mode, instead of 2 (lw, sw). So better to always fail this
> + and let the move_by_pieces code copy the string from read-only
> + memory. */
> +
> + #define STORE_BY_PIECES_P(SIZE, ALIGN) 0
You asked when lui/ori/sw might be faster. Consider a three-word
store on a typical 2-way superscalar target:
Cycle 1: lui lui
2: ori ori
3: sw lui
4: sw ori
5: sw
That's 5 cycles. The equivalent lw/sw version is at least 6 cycles
(more if the read-only string is not in cache).
Richard