[RFC] middle-end: Extend CSE to understand vector extracts.
Richard Biener
rguenther@suse.de
Mon Jan 4 13:33:21 GMT 2021
On Mon, 4 Jan 2021, Tamar Christina wrote:
> Hi All,
>
> I am trying to get CSE to re-use constants already inside a vector rather than
> re-materializing the constant again.
>
> Basically consider the following case:
>
> #include <stdint.h>
> #include <arm_neon.h>
>
> uint64_t
> test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
> {
> uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
> uint64_t res = a | arr[0];
> uint64x2_t val = vld1q_u64 (arr);
> *rt = vaddq_u64 (val, b);
> return res;
> }
>
> The actual behavior is inconsequential however notice that the same constants
> are used in the vector (arr and later val) and in the calculation of res.
>
> The code we generate for this however is quite sub-optimal:
>
> test:
> adrp x2, .LC0
> sub sp, sp, #16
> ldr q1, [x2, #:lo12:.LC0]
> mov x2, 16502
> movk x2, 0x1023, lsl 16
> movk x2, 0x4308, lsl 32
> add v1.2d, v1.2d, v0.2d
> movk x2, 0x942, lsl 48
> orr x0, x0, x2
> str q1, [x1]
> add sp, sp, 16
> ret
> .LC0:
> .xword 667169396713799798
> .xword 667169396713799798
>
> Essentially we materialize the same constant twice. The reason for this is
> because the front-end lowers the constant extracted from arr[0] quite early on.
> If you look into the result of fre you'll find
>
> <bb 2> :
> arr[0] = 667169396713799798;
> arr[1] = 667169396713799798;
> res_7 = a_6(D) | 667169396713799798;
> _16 = __builtin_aarch64_ld1v2di (&arr);
> _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
> _11 = b_10(D) + _17;
> *rt_12(D) = _11;
> arr ={v} {CLOBBER};
> return res_7;
>
> Which makes sense for further optimization. However come expand time if the
> constant isn't representable in the target arch it will be assigned to a
> register again.
>
> (insn 8 5 9 2 (set (reg:V2DI 99)
> (const_vector:V2DI [
> (const_int 667169396713799798 [0x942430810234076]) repeated x2
> ])) "cse.c":7:12 -1
> (nil))
> ...
> (insn 14 13 15 2 (set (reg:DI 103)
> (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
> (nil))
> (insn 15 14 16 2 (set (reg:DI 102 [ res ])
> (ior:DI (reg/v:DI 96 [ a ])
> (reg:DI 103))) "cse.c":8:12 -1
> (nil))
>
> And since it's out of the immediate range of the scalar instruction used
> combine won't be able to do anything here.
>
> This will then trigger the re-materialization of the constant twice.
>
> So I figured the best place to handle this is in CSE since in some uArch it's
> far cheaper to extract a constant from a vector than to materialize it.
>
> Particularly doing it pre-RA has the benefit of allowing RA to decide whether it
> needs to move the constant between register files or not as some uArch can
> perform scalar operation both on the SIMD and GENREG side.
>
> The issue is I don't know that much about CSE. I have been reading through the
> source and think I have a basic understanding of how it works but this email is
> to see if I'm on the right track or not (to something that is acceptable
> upstream).
>
> My current patch for CSE is:
>
> diff --git a/gcc/cse.c b/gcc/cse.c
> index 36bcfc354d8..3cee53bed85 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. If not see
> #include "rtl-iter.h"
> #include "regs.h"
> #include "function-abi.h"
> +#include "expr.h"
>
> /* The basic idea of common subexpression elimination is to go
> through the code, keeping a record of expressions that would
> @@ -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
> someplace else, so it isn't worth cse'ing. */
> else if (GET_CODE (SET_SRC (x)) == CALL)
> ;
> + else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> + {
> + /* First register the vector itself. */
> + sets[n_sets++].rtl = x;
> + rtx src = SET_SRC (x);
> + machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> + /* Go over the constants of the CONST_VECTOR in forward order, to
> + put them in the same order in the SETS array. */
> + for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> + {
> + rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> + sets[n_sets++].rtl = PATTERN (gen_move_insn (y, CONST_VECTOR_ELT (src, i)));
> + }
> + }
> else
> sets[n_sets++].rtl = x;
> }
> @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
> struct set *sets = (struct set *) 0;
>
> if (GET_CODE (x) == SET)
> - sets = XALLOCA (struct set);
> + {
> + /* For CONST_VECTOR we wants to be able to CSE the vector itself along with
> + elements inside the vector if the target says it's cheap. */
> + if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> + sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) + 1);
> + else
> + sets = XALLOCA (struct set);
> + }
> else if (GET_CODE (x) == PARALLEL)
> sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
>
> --
>
> This extends the sets that CSE uses to perform CSE to not only contain the
> CONST_VECTOR but also the individual elements of the vector.
>
> For each element I generate new RTL which models them as a constant being set
> into a subreg of the original vector at the index of the element in the vector.
>
> This so that the SRC is the constant we want to CSE and DEST contains the
> SUBREG to extract from the vector.
>
> It works as expected, the testcase above generates:
>
> test:
> adrp x2, .LC0
> sub sp, sp, #16
> ldr q1, [x2, #:lo12:.LC0]
> add v0.2d, v1.2d, v0.2d
> fmov x2, d1
> str q0, [x1]
> orr x0, x0, x2
> add sp, sp, 16
> ret
> .LC0:
> .xword 667169396713799798
> .xword 667169396713799798
>
> The problem is that this is somewhat accidental. CSE is single pass, presumably
> because it currently only tracks SETs of constants where any of the duplicates
> can be replaced by any alternative (it does pick the cheapest, but all the
> alternatives are valid.).
>
> This breaks with vectors because vectors can only be used as a SRC. The code
> does validate that the resulting CSE is valid, so this does not break.
>
> but if the INSN are flipped in RTL:
>
> (insn 14 13 15 2 (set (reg:DI 103)
> (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
> (nil))
> ...
> (insn 8 5 9 2 (set (reg:V2DI 99)
> (const_vector:V2DI [
> (const_int 667169396713799798 [0x942430810234076]) repeated x2
> ])) "cse.c":7:12 -1
> (nil))
>
> This no longer works, because it sees the constant version in insn 14 before it
> sees insn 8. When we find insn 8 we can tell that there is an instruction that
> can be replaced by insn 8, but we don't know the original insn and so as a
> consequence we can't update it.
>
> so questions:
>
> 1) Does what I'm doing make sense?
> 2) Is there anyway to go from a SET to an insn?
> 3) If not, can I store the insn in table_elt and have cse_insn produce a worklist
> of additional insn that need to be re-examined?
Without being able to comment on RTL or the CSE implementation the
issue at hand (optimizing constant generation / placement) doesn't
fit CSE well but it's more a global LCM/PRE problem. There's also
the issue that while on x86 many constants _are_ valid as immediates
CSEing them into a register (if one is available!) is still
profitable but RTL passes generally propagate / duplicate them
back into the instructions where they are valid (so "fixing" things
on GIMPLE generally doesn't work).
Also IIRC targets can delegitmize constants late (during reload/LRA)
which might cause extra complication.
Richard.
> Thanks,
> Tamar
>
> --- inline copy of patch --
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
More information about the Gcc-patches
mailing list