[PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
Richard Biener
rguenther@suse.de
Wed Jun 19 12:13:52 GMT 2024
On Fri, 14 Jun 2024, Tamar Christina wrote:
> Hi All,
>
> When the patch for PR114074 was applied we saw a good boost in exchange2.
>
> This boost was partially caused by a simplification of the addressing modes.
> With the patch applied IV opts saw the following form for the base addressing;
>
> Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
>
> vs what we normally get:
>
> Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
>
> This is because the patch promoted multiplies where one operand is a constant
> from a signed multiply to an unsigned one, to attempt to fold away the constant.
>
> This patch attempts the same but due to the various problems with SCEV and
> niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> do it during SCEV or in the general form like in fold-const like extract_muldiv
> attempts.
>
> Instead this applies the simplification during IVopts initialization when we
> create the IV. Essentially when we know the IV won't overflow with regards to
> niters then we perform an affine fold which gets it to simplify the internal
> computation, even if this is signed because we know that for IVOPTs uses the
> IV won't ever overflow. This allows IV opts to see the simplified form
> without influencing the rest of the compiler.
>
> as mentioned in PR114074 it would be good to fix the missed optimization in the
> other passes so we can perform this in general.
>
> The reason this has a big impact on fortran code is that fortran doesn't seem to
> have unsigned integer types. As such all it's addressing are created with
> signed types and folding does not happen on them due to the possible overflow.
>
> concretely on AArch64 this changes the results from generation:
>
> mov x27, -108
> mov x24, -72
> mov x23, -36
> add x21, x1, x0, lsl 2
> add x19, x20, x22
> .L5:
> add x0, x22, x19
> add x19, x19, 324
> ldr d1, [x0, x27]
> add v1.2s, v1.2s, v15.2s
> str d1, [x20, 216]
> ldr d0, [x0, x24]
> add v0.2s, v0.2s, v15.2s
> str d0, [x20, 252]
> ldr d31, [x0, x23]
> add v31.2s, v31.2s, v15.2s
> str d31, [x20, 288]
> bl digits_20_
> cmp x21, x19
> bne .L5
>
> into:
>
> .L5:
> ldr d1, [x19, -108]
> add v1.2s, v1.2s, v15.2s
> str d1, [x20, 216]
> ldr d0, [x19, -72]
> add v0.2s, v0.2s, v15.2s
> str d0, [x20, 252]
> ldr d31, [x19, -36]
> add x19, x19, 324
> add v31.2s, v31.2s, v15.2s
> str d31, [x20, 288]
> bl digits_20_
> cmp x21, x19
> bne .L5
>
> The two patches together results in a 10% performance increase in exchange2 in
> SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
> time. There's also a 5% performance improvement in fotonik3d and similar
> reduction in binary size.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR tree-optimization/114932
> * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/114932
> * gfortran.dg/addressing-modes_1.f90: New test.
>
> ---
> diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> new file mode 100644
> index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> @@ -0,0 +1,37 @@
> +! { dg-do compile { target aarch64-*-* } }
> +! { dg-additional-options "-w -Ofast" }
> +
> + module brute_force
> + integer, parameter :: r=9
> + integer block(r, r, 0)
> + contains
> + subroutine brute
> + do
> + do
> + do
> + do
> + do
> + do
> + do i7 = l0, 1
> + select case(1 )
> + case(1)
> + block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> + end select
> + do i8 = 1, 1
> + do i9 = 1, 1
> + if(1 == 1) then
> + call digits_20
> + end if
> + end do
> + end do
> + end do
> + end do
> + end do
> + end do
> + end do
> + end do
> + end do
> + end
> + end
> +
> +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> index 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811 100644
> --- a/gcc/tree-ssa-loop-ivopts.cc
> +++ b/gcc/tree-ssa-loop-ivopts.cc
> @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
> base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> }
>
> + /* If we know the IV won't overflow wrt niters and the type is an unsigned
> + type then fold using affine unsigned arithmetic to allow more folding of
> + constants. */
> + if (no_overflow
> + && TYPE_UNSIGNED (TREE_TYPE (expr)))
> + {
> + aff_tree comb;
> + tree utype = unsigned_type_for (TREE_TYPE (expr));
> + tree_to_aff_combination (expr, utype, &comb);
> + base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> + }
> +
So right above we already do
/* Lower address expression in base except ones with DECL_P as operand.
By doing this:
1) More accurate cost can be computed for address expressions;
2) Duplicate candidates won't be created for bases in different
forms, like &a[0] and &a. */
STRIP_NOPS (expr);
if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
|| contain_complex_addr_expr (expr))
{
aff_tree comb;
tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
(&comb));
}
and if I read correctly 'expr' is
(integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4
in your interesting case which means it doesn't satisfy
contain_complex_addr_expr.
I don't quite get why rewriting the base into (T)(unsigned)... is
only valid when no_overflow - no_overflow is about {base, +, step},
not about any overflow contained in 'base'.
I wonder if we maybe want to record an "original" iv->base
to be used for code generation (and there only the unexpanded form)
and a variant used for the various sorts of canonicalization/compare
(I see we eventually add/subtract step and then compare against
sth else). And then apply this normalization always to the not
"original" form.
The above STRIP_NOPS (expr) + expand might turn an unsigned
affine combination into a signed one which might be problematic.
So what happens if you change the above to simply always
unsigned expand?
Richard.
> iv->base = base;
> iv->base_object = determine_base_object (data, base);
> iv->step = step;
>
>
>
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
More information about the Gcc-patches
mailing list