[PATCH 2/5]AArch64 sve: combine nested if predicates
Richard Sandiford
richard.sandiford@arm.com
Tue Nov 2 15:04:41 GMT 2021
Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi All,
>
> Here’s a respin of the patch.
>
> The following example
>
> void f5(float * restrict z0, float * restrict z1, float *restrict x,
> float * restrict y, float c, int n)
> {
> for (int i = 0; i < n; i++) {
> float a = x[i];
> float b = y[i];
> if (a > b) {
> z0[i] = a + b;
> if (a > c) {
> z1[i] = a - b;
> }
> }
> }
> }
>
> generates currently:
>
> ptrue p3.b, all
> ld1w z1.s, p1/z, [x2, x5, lsl 2]
> ld1w z2.s, p1/z, [x3, x5, lsl 2]
> fcmgt p0.s, p3/z, z1.s, z0.s
> fcmgt p2.s, p1/z, z1.s, z2.s
> fcmgt p0.s, p0/z, z1.s, z2.s
> and p0.b, p0/z, p1.b, p1.b
>
> The conditions for a > b and a > c become separate comparisons.
>
> After this patch we generate:
>
> ld1w z1.s, p0/z, [x2, x5, lsl 2]
> ld1w z2.s, p0/z, [x3, x5, lsl 2]
> fcmgt p1.s, p0/z, z1.s, z2.s
> fcmgt p1.s, p1/z, z1.s, z0.s
>
> Where the condition a > b && a > c are folded by using the predicate result of
> the previous compare and thus allows the removal of one of the compares.
>
> When never a mask is being generated from an BIT_AND we mask the operands of
> the and instead and then just AND the result.
>
> This allows us to be able to CSE the masks and generate the right combination.
> However because re-assoc will try to re-order the masks in the & we have to now
> perform a small local CSE on the vectorized loop is vectorization is successful.
>
> Note: This patch series is working incrementally towards generating the most
> efficient code for this and other loops in small steps.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * tree-vect-stmts.c (prepare_load_store_mask): When combining two masks
> mask the operands instead of the combined operation.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/sve/pred-combine-and.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-combine-and.c b/gcc/testsuite/gcc.target/aarch64/sve/pred-combine-and.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..ed7fb591ec69dbdafe27fc9aa08a0b0910c94003
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-combine-and.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --save-temps" } */
> +
> +void f5(float * restrict z0, float * restrict z1, float *restrict x, float * restrict y, float c, int n)
> +{
> + for (int i = 0; i < n; i++) {
> + float a = x[i];
> + float b = y[i];
> + if (a > b) {
> + z0[i] = a + b;
> + if (a > c) {
> + z1[i] = a - b;
> + }
> + }
> + }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-9]+/z, z[0-9]+\.s, z[0-9]+\.s} 2 } } */
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 1f56e10709e8f27d768c04f7ef914e2cd9347c36..27ee48aea429810a37777d907435a92b8fd1817d 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -6302,10 +6302,39 @@ vectorizable_operation (vec_info *vinfo,
> }
> else
> {
> + /* When combining two masks check is either of them has already been
> + combined with a loop mask, if that's the case we can mark that the
I think something like:
s/has already been combined with/is elsewhere combined with/
might be clearer. The other statement that ANDs with the loop mask might
come before or after this one. In the latter case it won't have been
vectorised yet.
> + new combined mask doesn't need to be combined with a loop mask. */
The way I imagined this last part working is that, after vectorising
a BIT_AND_EXPR like this:
vop0' = vop & loop_mask_N
vres = vop0' & vop1
we should record that vres is effectively already ANDed with loop_mask_N,
and so there's no need to create:
vres' = vres & loop_mask_N
when vectorising the innermost IFN_MASK_STORE.
This could be a new hash_set in the loop_vec_info, containing
(condition, loop_mask) pairs for which condition & loop_mask == condition.
prepare_load_store_mask could check whether (vec_mask, loop_mask) is in
this set and return vec_mask unaltered if so.
This new set would in a sense have the opposite effect of
scalar_cond_masked_key. scalar_cond_masked_key encourage statements
to create ANDs with the loop mask in cases where they might not have
done otherwise. The new set would instead tell statements (and
specifically prepare_load_store_mask) that such ANDs aren't necessary.
I guess we should also pass other ANDs with the loop mask through
prepare_load_store_mask (and rename it), so that they get the benefit too.
It looks like the patch tried to do that by entering scalar_dest
into the scalar_cond_masked_set. I think that's likely to be
counter-productive though, since it would tell other statements
in the loop that they should AND the scalar dest with the loop mask
even if they wouldn't have done normally.
> + if (masked_loop_p && code == BIT_AND_EXPR)
> + {
> + scalar_cond_masked_key cond1 (op0, ncopies);
Nit, but: calling them cond0 and cond1 might be better, to avoid
having cond1 describe op2 rather than op1.
The patch looks good otherwise. I think we're there in terms of
deciding when to generate the ANDs. It's just a case of what
to do with the result.
Thanks,
Richard
> + if (loop_vinfo->scalar_cond_masked_set.contains (cond1))
> + {
> + tree mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
> + vectype, i);
> +
> + vop0 = prepare_load_store_mask (TREE_TYPE (mask), mask, vop0, gsi);
> + scalar_cond_masked_key cond (scalar_dest, ncopies);
> + loop_vinfo->scalar_cond_masked_set.add (cond);
> + }
> +
> + scalar_cond_masked_key cond2 (op1, ncopies);
> + if (loop_vinfo->scalar_cond_masked_set.contains (cond2))
> + {
> + tree mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
> + vectype, i);
> +
> + vop1 = prepare_load_store_mask (TREE_TYPE (mask), mask, vop1, gsi);
> + scalar_cond_masked_key cond (scalar_dest, ncopies);
> + loop_vinfo->scalar_cond_masked_set.add (cond);
> + }
> + }
> +
> new_stmt = gimple_build_assign (vec_dest, code, vop0, vop1, vop2);
> new_temp = make_ssa_name (vec_dest, new_stmt);
> gimple_assign_set_lhs (new_stmt, new_temp);
> vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
> if (vec_cvt_dest)
> {
> new_temp = build1 (VIEW_CONVERT_EXPR, vectype_out, new_temp);
More information about the Gcc-patches
mailing list