TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

Mon Mar 25 19:59:14 GMT 2024

On 3/25/24 1:48 PM, Xi Ruoyao wrote:
> On Mon, 2024-03-18 at 20:54 -0600, Jeff Law wrote:
>>> +/* Costs to use when optimizing for xiangshan nanhu.  */
>>> +static const struct riscv_tune_param xiangshan_nanhu_tune_info = {
>>> +  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},	/* fp_add */
>>> +  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},	/* fp_mul */
>>> +  {COSTS_N_INSNS (10), COSTS_N_INSNS (20)},	/* fp_div */
>>> +  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},	/* int_mul */
>>> +  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
>>> +  6,						/* issue_rate */
>>> +  3,						/* branch_cost */
>>> +  3,						/* memory_cost */
>>> +  3,						/* fmv_cost */
>>> +  true,						/* slow_unaligned_access */
>>> +  false,					/* use_divmod_expansion */
>>> +  RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH,          /* fusible_ops */
>>> +  NULL,						/* vector cost */
> 
>> Is your integer division really that fast?  The table above essentially
>> says that your cpu can do integer division in 6 cycles.
> 
> Hmm, I just seen I've coded some even smaller value for LoongArch CPUs
> so forgive me for "hijacking" this thread...
> 
> The problem seems integer division may spend different number of cycles
> for different inputs: on LoongArch LA664 I've observed 5 cycles for some
> inputs and 39 cycles for other inputs.
> 
> So should we use the minimal value, the maximum value, or something in-
> between for TARGET_RTX_COSTS and pipeline descriptions?
Yea, early outs are relatively common in the actual hardware 
implementation.

The biggest reason to refine the cost of a division is so that we've got 
a reasonably accurate cost for division by a constant -- which can often 
be done with multiplication by reciprocal sequence.  The multiplication 
by reciprocal sequence will use mult, add, sub, shadd insns and you need 
a reasonable cost model for those so you can compare against the cost of 
a hardware division.

So to answer your question.  Choose something sensible, you probably 
don't want the fastest case and you may not want the slowest case.

Jeff