[patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Fri Jun 7 10:41:39 GMT 2024

Hello Richard:

On 07/06/24 1:52 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>> +
>>>>>>>> +  df_ref use;
>>>>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>>>>> +    {
>>>>>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>>>>>> +
>>>>>>>> +      if (!def_link || !def_link->ref
>>>>>>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>>>>> +	continue;
>>>>>>>> +
>>>>>>>> +      while (def_link && def_link->ref)
>>>>>>>> +	{
>>>>>>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>>>>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>>>>>> +	    return false;
>>>>>>>
>>>>>>> Why do you need to skip PARALLELs?
>>>>>>>
>>>>>>
>>>>>> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
>>>>>> Thats why I have added this.
>>>>>
>>>>> But in (vec_select ... (parallel ...)), the parallel won't be the 
>>>>> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
>>>>>
>>>>> Here too it's important to understand why the final.cc failure occurs
>>>>> and what the correct fix is.
>>>>>
>>>>
>>>> subreg with vec_select operand already exists before fusion pass.
>>>> We overwrite them with subreg 128 bits from 256 OO mode operand.
>>>
>>> But why is that wrong?  What was the full rtl of the subreg before the
>>> pass runs, what did the subreg look like after the pass, and why is the
>>> change not correct?
>>>
>>> In general, there are two main ways that an rtl change can be incorrect:
>>>
>>> (1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
>>>     In this case, the new rtl makes no inherent sense when viewed
>>>     in isolation: it isn't necessary to see the old rtl to tell that
>>>     the new rtl is wrong.
>>>
>>> (2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
>>>     isolation) but it does not have the same semantics as the old rtl.
>>>     In other words, the new rtl is describing a different operation
>>>     from the old rtl.
>>>
>>> I think we need to talk about it in those terms, rather than where
>>> the eventual ICE occurs.
>>>
>> Before the fusion.
>> old rtl looks like this:
>>
>> (vec_select:HI (subreg:V8HI (reg:V16QI 125 [ vect__29.38 ]) 0)
>>
>> After the fusion
>> new rtl looks like this:
>>
>> (vec_select:HI (subreg:V16QI (reg:OO 125 [ vect__29.38 ]) 16)
>>
>> new rtl is not well formed.
>>
>> Thats why its failing.
>>
>> reg:v16QI 125 is the destination of the load that needs to be fused.
> 
> This indicates that there's a bug in the substitution code.
> 
> It's probably better to create a fresh OO register, rather than
> change an existing 128-bit register to 256 bits.  If we do that,
> and if reg:V16QI 125 is the destination of the second load
> (which I assume it is from the 16 offset in the subreg),
> then the new RTL should be:
> 
>   (vec_select:HI (subreg:V8HI (reg:OO NEW_REG) 16) ...)
> 
> It's possible to get this by using insn_propagation to replace
> (reg:V16QI 125) with (subreg:V16QI (reg:OO NEW_REG) 16).
> insn_propagation should then take care of the rest.
> 
> There are no existing rtl-ssa routines for handling new registers
> though.  (The idea was to add things as the need arose.)
> 

Sure I will do that. Thanks.

>>>> Due to this in final.cc we couldnt splt at line 2807 and bails
>>>> out fatal_insn.
>>>>
>>>> Currently we dont support already existing subreg vector operand
>>>> to generate register pairs.
>>>> We should bail out from fusion pass in this case.
>>>>>>>> +
>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>> +	    return false;
>>>>>>>> +
>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>> +
>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>> +	  //                  (reg2)
>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>> +	    return false;
>>>>>>>
>>>>>>> What's special about (neg (fma ...))?
>>>>>>>
>>>>>>
>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>> Unary operation with fma operand. 
>>>>>
>>>>
>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>> set correctly. 
>>>> IRA marked them spill candidates as spill priority is zero.
>>>>
>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>
>>> I think this is just restating the symptom though.  I suppose the same
>>> kind of questions apply here too: what was the instruction before the
>>> pass runs, what was the instruction after the pass runs, and why is
>>> the rtl change incorrect (by the meaning above)?
>>>
>>
>> Original case where we dont do load fusion, spill happens, in that
>> case we dont require sequential register pairs to be generated for 2 loads
>> for. Hence it worked.
>>
>> rtl change is correct and there is no error.
>>
>> for load fusion spill happens and we dont generate sequential register pairs
>> because pf spill candidate and lxvp gives incorrect results as sequential register
>> pairs are required for lxvp.
> 
> Can you go into more detail?  How is the lxvp represented?  And how do
> we end up not getting a sequential register pair?  What does the rtl
> look like (before and after things have gone wrong)?
> 
> It seems like either the rtl is not describing the result of the fusion
> correctly or there is some problem in the .md description of lxvp.
> 

After fusion pass:

(insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
                (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
     (nil))
(insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
                (reg:V2DF 44 12 [3119])
                (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
     (nil))

In LRA reload.

(insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
        (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
     (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
        (nil)))
(insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
        (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
                (reg:V2DF 4283 [3119])
                (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
     (nil))

In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.

lxvp vsx0, 0(r1).

It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.

Without load fusion since 2 loads exists and 2 loads need not require sequential registers
hence it worked but with load fusion and using lxvp it requires sequential register pairs.

> Thanks,
> Richard

Thanks & regards
Ajit