[PATCH PR94442] [AArch64] Redundant ldp/stp instructions emitted at -O3

Richard Sandiford richard.sandiford@arm.com
Thu Aug 20 08:55:12 GMT 2020

xiezhiheng <xiezhiheng@huawei.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford [mailto:richard.sandiford@arm.com]
>> Sent: Wednesday, August 19, 2020 6:06 PM
>> To: xiezhiheng <xiezhiheng@huawei.com>
>> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH PR94442] [AArch64] Redundant ldp/stp instructions
>> emitted at -O3
>> xiezhiheng <xiezhiheng@huawei.com> writes:
>> > I add FLAGS for part of intrinsics in aarch64-simd-builtins.def first for a try,
>> > including all the add/sub arithmetic intrinsics.
>> >
>> > Something like faddp intrinsic which only handles floating-point operations,
>> > both FP and NONE flags are suitable for it because FLAG_FP will be added
>> > later if the intrinsic handles floating-point operations.  And I prefer FP
>> since
>> > it would be more clear.
>> Sounds good to me.
>> > But for qadd intrinsics, they would modify FPSR register which is a scenario
>> > I missed before.  And I consider to add an additional flag
>> > to represent it.
>> I don't think we make any attempt to guarantee that the Q flag is
>> meaningful after saturating intrinsics.  To do that, we'd need to model
>> the modification of the flag in the .md patterns too.
>> So my preference would be to leave this out and just use NONE for the
>> saturating forms too.
> The problem is that the test case in the attachment has different results under -O0 and -O2.

Right.  But my point was that I don't think that use case is supported.
If you want to use saturating instructions and read the Q flag afterwards,
the saturating instructions need to be inline asm too.

> In gimple phase statement:
>   _9 = __builtin_aarch64_uqaddv2si_uuu (op0_4, op1_6);
> would be treated as dead code if we set NONE flag for saturating intrinsics.
> Adding FLAG_WRITE_FPSR would help fix this problem.
> Even when we set FLAG_WRITE_FPSR, the uqadd insn: 
>   (insn 11 10 12 2 (set (reg:V2SI 97)
>         (us_plus:V2SI (reg:V2SI 98)
>             (reg:V2SI 99))) {aarch64_uqaddv2si}
>      (nil))
> could also be eliminated in RTL phase because this insn will be treated as dead insn.
> So I think we might also need to modify saturating instruction patterns adding the side effect of set the FPSR register.

The problem is that FPSR is global state and we don't in general
know who might read it.  So if we modelled the modification of the FPSR,
we'd never be able to fold away saturating arithmetic that does actually
saturate at compile time, because we'd never know whether the program
wanted the effect on the Q flag result to be visible (perhaps to another
function that the compiler can't see).  We'd also be unable to remove
results that really are dead.

So I think this is one of those situations in which we can't keep all
constituents happy.  Catering for people who want to read the Q flag
would make things worse for those who want saturating arithmetic to be
optimised as aggressively as possible.  And the same holds in reverse.


> So if we could use NONE flag for saturating intrinsics, the description of function attributes and patterns are both incorrect. 
> I think I can propose another patch to fix the patterns if you agree? 
> Thanks,
> Xie Zhiheng
> #include <arm_neon.h>
> #include <stdlib.h>
> typedef union {
>   struct {
>     int _xxx:24;
>     unsigned int FZ:1;
>     unsigned int DN:1;
>     unsigned int AHP:1;
>     unsigned int QC:1;
>     int V:1;
>     int C:1;
>     int Z:1;
>     int N:1;
>   } b;
>   unsigned int word;
> static volatile int __read_neon_cumulative_sat (void) {
>     _ARM_FPSCR _afpscr_for_qc;
>     asm volatile ("mrs %0,fpsr" : "=r" (_afpscr_for_qc));
>     return _afpscr_for_qc.b.QC;
> }
> int main()
> {
>   uint32x2_t op0, op1, res;
>   op0 = vdup_n_u32 ((uint32_t)0xfffffff0);
>   op1 = vdup_n_u32 ((uint32_t)0x20);
>   _ARM_FPSCR _afpscr_for_qc;
>   asm volatile ("mrs %0,fpsr" : "=r" (_afpscr_for_qc));
>   _afpscr_for_qc.b.QC = (0);
>   asm volatile ("msr fpsr,%0" :  : "r" (_afpscr_for_qc));
>   res = vqadd_u32 (op0, op1);
>   if (__read_neon_cumulative_sat () != 1)
>     abort ();
>   return 0;
> }

More information about the Gcc-patches mailing list