[PATCH] constructor: Elide expand_constructor when can move by pieces is true

Fri May 21 07:30:13 GMT 2021

On 5/21/21 8:57 AM, Richard Biener wrote:
> On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, May 20, 2021 at 12:51 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> On Wed, May 19, 2021 at 2:33 AM Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>>
>>>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>
>>>>>> When expanding a constant constructor, don't call expand_constructor if
>>>>>> it is more efficient to load the data from the memory via move by pieces.
>>>>>>
>>>>>> gcc/
>>>>>>
>>>>>>         PR middle-end/90773
>>>>>>         * expr.c (expand_expr_real_1): Don't call expand_constructor if
>>>>>>         it is more efficient to load the data from the memory.
>>>>>>
>>>>>> gcc/testsuite/
>>>>>>
>>>>>>         PR middle-end/90773
>>>>>>         * gcc.target/i386/pr90773-24.c: New test.
>>>>>>         * gcc.target/i386/pr90773-25.c: Likewise.
>>>>>> ---
>>>>>>  gcc/expr.c                                 | 10 ++++++++++
>>>>>>  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
>>>>>>  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
>>>>>>  3 files changed, 52 insertions(+)
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
>>>>>>
>>>>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>>>>> index d09ee42e262..80e01ea1cbe 100644
>>>>>> --- a/gcc/expr.c
>>>>>> +++ b/gcc/expr.c
>>>>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>>>>>                 unsigned HOST_WIDE_INT ix;
>>>>>>                 tree field, value;
>>>>>>
>>>>>> +               /* Check if it is more efficient to load the data from
>>>>>> +                  the memory directly.  FIXME: How many stores do we
>>>>>> +                  need here if not moved by pieces?  */
>>>>>> +               unsigned HOST_WIDE_INT bytes
>>>>>> +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>>>>>
>>>>> that's prone to fail - it could be a VLA.
>>>>
>>>> What do you mean by fail?  Is it ICE or missed optimization?
>>>> Do you have a testcase?
>>>>
>>>>>
>>>>>> +               if ((bytes / UNITS_PER_WORD) > 2
>>>>>> +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
>>>>>> +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
>>>>>> +                 goto normal_inner_ref;
>>>>>> +
>>>>>
>>>>> It looks like you're concerned about aggregate copies but this also handles
>>>>> non-aggregates (which on GIMPLE might already be optimized of course).
>>>>
>>>> Here I check if we copy more than 2 words and we can move more than
>>>> a word in a single instruction.
>>>>
>>>>> Also you say "if it's cheaper" but I see no cost considerations.  How do
>>>>> we generally handle immed const vs. load from constant pool costs?
>>>>
>>>> This trades 2 (update to 8) stores with one load plus one store.  Is there
>>>> a way to check which one is faster?
>>>
>>> I'm not sure - it depends on whether the target can do stores from immediates
>>> at all or what restrictions apply, what the immediate value actually is
>>> (zero or all-ones should be way cheaper than sth arbitrary) and how the
>>> pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
>>> also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
>>> that might depend on alignment.  There's by_pieces_ninsns that might provide
>>> some hint here.
>>>
>>> I'm sure it works well for x86.
>>>
>>> I wonder if the existing code is in the appropriate place and we
>>> shouldn't instead
>>> handle this somewhere upthread where we ask to copy 'exp' into some other
>>> memory location.  For your testcase that's expand_assignment but I can
>>> imagine passing array[0] by value to a function resulting in similar copying.
>>> Testing that shows we get
>>>
>>>         pushq   array+56(%rip)
>>>         .cfi_def_cfa_offset 24
>>>         pushq   array+48(%rip)
>>>         .cfi_def_cfa_offset 32
>>>         pushq   array+40(%rip)
>>>         .cfi_def_cfa_offset 40
>>>         pushq   array+32(%rip)
>>>         .cfi_def_cfa_offset 48
>>>         pushq   array+24(%rip)
>>>         .cfi_def_cfa_offset 56
>>>         pushq   array+16(%rip)
>>>         .cfi_def_cfa_offset 64
>>>         pushq   array+8(%rip)
>>>         .cfi_def_cfa_offset 72
>>>         pushq   array(%rip)
>>>         .cfi_def_cfa_offset 80
>>>         call    bar
>>>
>>> for that.  We do have the by-pieces infrastructure to generally do this kind of
>>> copying but in both of these cases we do not seem to use it.  I also wonder
>>> if the by-pieces infrastructure can pick up constant initializers automagically
>>> (we could native_encode the initializer part and feed the by-pieces
>>> infrastructure with an array of bytes).  There for example might be easy to
>>> immediate-store byte parts and difficult ones where we could decide on a
>>> case-by-case basis whether to load+store or immediate-store them.
>>
>> I opened:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704
>>
>>> For example if I change your testcase to have the array[] initializer
>>> all-zero we currently emit
>>>
>>>         pxor    %xmm0, %xmm0
>>>         movups  %xmm0, (%rdi)
>>>         movups  %xmm0, 16(%rdi)
>>>         movups  %xmm0, 32(%rdi)
>>>         movups  %xmm0, 48(%rdi)
>>>         ret
>>>
>>> will your patch cause us to emit 4 loads?  OTHO if I do
>>>
>>> const struct S array[] = {
>>>   { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
>>> };
>>>
>>> we get
>>>
>>>         movq    $0, (%rdi)
>>>         movl    $0, 8(%rdi)
>>>         movl    $0, 12(%rdi)
>>>         movl    $7241, 16(%rdi)
>>> ...
>>>
>>> ideally we'd have sth like
>>>
>>>     pxor %xmm0, %xmm0
>>>     movups  %xmm0, (%rdi)
>>>     movaps array+16(%rip), %xmm0
>>>     movups %xmm0, 16(%rdi)
>>> ...
>>>
>>> thus have the zeros written as immediates and the remaining pieces
>>> with load+stores.
>>>
>>> The by-pieces infrastructure eventually get's to see
>>>
>>> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
>>> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256])
>>>
>>> where the MEM_EXPR should provide a way to access the constant initializer.
>>>
>>> That said I do agree the current code is a bit premature optimization
>>> - but maybe
>>> it should be fend off in expand_constructor which has the cheap clear_storage
>>> first and which already does check can_move_by_pieces with some heuristics,
>>> but that seems to be guarded by
>>>
>>>            || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>>>                && (! can_move_by_pieces
>>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>>                     TYPE_ALIGN (type)))
>>>                && ! mostly_zeros_p (exp))))
>>>
>>> which is odd (we _can_ move by pieces, but how does this apply to
>>> TREE_CONSTANT CTORs and avoid_temp_mem?).
>>>
>>> That said, I wonder if we want to elide expand_constructor when the
>>> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
>>> and we can_move_by_pieces.
>>>
>>> So sth like
>>>
>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>> index 7139545d543..76b3bdf0c01 100644
>>> --- a/gcc/expr.c
>>> +++ b/gcc/expr.c
>>> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
>>> expand_modifier modifier,
>>>                && (! can_move_by_pieces
>>>                    (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>>                     TYPE_ALIGN (type)))
>>> +              && ! mostly_zeros_p (exp))
>>> +          || (TREE_CONSTANT (exp)
>>> +              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
>>> +              && (can_move_by_pieces
>>> +                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
>>> +                   TYPE_ALIGN (type)))
>>>                && ! mostly_zeros_p (exp))))
>>>        || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
>>>           && TREE_CONSTANT (exp)))
>>>
>>> which handles your initializer and the all-zero one optimal?
>>>
>>
>> It works.  Here is the updated patch.
> 
> So just looking at the code again I think we probably want to add
> && avoid_temp_mem here, at least that's the case we're looking
> at.  Not sure if we ever arrive with TREE_CONSTANT CTORs
> and !avoid_temp_mem but if so we'd create a temporary here
> which of course would be pointless.
> 
> So maybe it's then clearer to split the condition out as
> 
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 7139545d543..ee8f25f9abd 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum
> expand_modifier modifier,
>        return constructor;
>      }
> 
> +  /* If the CTOR is available in static storage and not mostly
> +     zeros and we can move it by pieces prefer to do so since
> +     that's usually more efficient than performing a series of
> +     stores from immediates.  */
> +  if (avoid_temp_mem
> +      && TREE_STATIC (exp)
> +      && TREE_CONSTANT (exp)
> +      && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> +      && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
> +                            TYPE_ALIGN (type))
> +      && ! mostly_zeros_p (exp))
> +    return NULL_RTX;
> +
>    /* Handle calls that pass values in multiple non-contiguous
>       locations.  The Irix 6 ABI has examples of this.  */
>    if (target == 0 || ! safe_from_p (target, exp, 1)
> 
> 
> OK with that change.
> 

Note however (I've been playing with the previous version)
that the test case 

FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1

fails for --target_board=unix

$ grep movdqu pr90773-25.s
	vmovdqu	%xmm0, (%rdi)
	vmovdqu	%xmm1, 16(%rdi)
	vmovdqu	%xmm2, 32(%rdi)
	vmovdqu	%xmm3, 48(%rdi)

while the test expects %ymm
/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */

and

FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times pxor[\\\\t ]%xmm[0-9]+, %xmm[0-9]+ 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1
FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1

fails for --target_board=unix/-m32

Bernd.

> Thanks,
> Richard.
> 
>> Thanks.
>>
>> --
>> H.J.