[PATCH PR94442] [AArch64] Redundant ldp/stp instructions emitted at -O3

Tue Jul 7 14:07:48 GMT 2020

xiezhiheng <xiezhiheng@huawei.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford [mailto:richard.sandiford@arm.com]
>> Sent: Monday, July 6, 2020 5:31 PM
>> To: xiezhiheng <xiezhiheng@huawei.com>
>> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH PR94442] [AArch64] Redundant ldp/stp instructions
>> emitted at -O3
>> 
>> No, this is unfortunately a known bug.  See:
>> 
>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95964
>> 
>> (Although the PR is recent, it's been a known bug for longer.)
>> 
>> As you say, the difficulty is that the correct attributes depend on what
>> the built-in function does.  Most integer arithmetic is “const”, but things
>> get more complicated for floating-point arithmetic.
>> 
>> The SVE intrinsics use a three stage process:
>> 
>> - each function is classified into one of several groups
>> - each group has a set of flags that describe what functions in the
>>   group can do
>> - these flags get converted into attributes based on the current
>>   command-line options
>> 
>> I guess we should have something similar for the arm_neon.h built-ins.
>> 
>> If you're willing to help fix this, that'd be great.  I think a first
>> step would be to agree a design.
>> 
>> Thanks,
>> Richard
>
> I'd like to have a try.

Great!

> I have checked the steps in SVE intrinsics.
> It defines a base class "function_base" and derives different classes
> to describe several intrinsics for each.  And each class may
> have its own unique flags described in virtual function "call_properties".
> The specific attributes will be converted from these flags in
> "get_attributes" later.
>
> I find that there are more than 100 classes in total and if I only
> need to classify them into different groups by attributes, maybe
> we does not need so many classes?

Yeah, I agree.

Long term, there might be value in defining arm_neon.h in a similar
way to arm_sve.h: i.e. have arm_neon.h defer most of the work to
a special compiler pragma.  But that's going to be a lot of work.

I think it's possible to make incremental improvements to the current
arm_neon.h implementation without that work being thrown away if we ever
did switch to a pragma in future.  And the incremental approach seems
more practical.

> The difficult thing I think is how to classify neon intrinsics into
> different groups.  I'm going to follow up the way in SVE intrinsics
> first now.

For now I'd suggest just giving a name to each combination of flags
that the intrinsics need, rather than splitting instructions in a
more fine-grained way.  (It's not at all obvious from the final state
of the SVE code, but even there, the idea was to have as few groups as
possible.  I.e. the groups were supposedly only split where necessary.
As you say, there still ended up being a lot of groups in the end…)

It'd be easier to review if the work was split up into smaller steps.
E.g. maybe one way would be this, with each number being a single
patch:

(1) (a) Add a flags field to the built-in function definitions
        that for now is always zero.
    (b) Pick a name N to describe the most conservative set of flags.
    (c) Make every built-in function definition use N.

(2) (a) Pick one type of function that cannot yet be described properly.
    (b) Pick a name N for that type of function.
    (c) Add whichever new flags are needed.
    (d) Add the appropriate attributes when the flags are set,
        possibly based on command-line options.
    (e) Make (exactly) one built-in function definition use N.

(3) (a) Pick some functions that all need the same attributes and
        that can already be described properly
    (b) Update all of their built-in function definitions accordingly,
        as a single change.

So after (1), filling out the table is an iterative process of (2) and
(3), in any order that's convenient (although it might help to order the
(2) patches so that each one adds as few flags as possible).  Each patch
would then be fairly small and self-contained.

That's just a suggestion though.  Please let me know if you have
any other suggestions.

I guess there are two obvious ways of adding the flags field:

- add a new parameter to every built-in function macro, e.g.
  BUILTIN_VSDQ_I and VAR1.

- wrap the definitions in a new macro, e.g.
  MY_NEW_GROUP (BUILTIN_VSDQ_I (BINOP, sqshl, 0))

I don't really have a preference, and I guess all other things being
equal, the first one wins by being more obvious than the second.
Just thought I'd mention the second way in case anyone preferred it.

Thanks,
Richard