[PATCH] Vectorization of BB reductions

Richard Sandiford richard.sandiford@arm.com
Wed Jun 16 13:13:33 GMT 2021


Richard Biener <rguenther@suse.de> writes:
> This adds a simple reduction vectorization capability to the
> non-loop vectorizer.  Simple meaning it lacks any of the fancy
> ways to generate the reduction epilogue but only supports
> those we can handle via a direct internal function reducing
> a vector to a scalar.  One of the main reasons is to avoid
> massive refactoring at this point but also that more complex
> epilogue operations are hardly profitable.
>
> Mixed sign reductions are for now fend off and I'm not finally
> settled with whether we want an explicit SLP node for the
> reduction epilogue operation.  Handling mixed signs could be
> done by multiplying with a { 1, -1, .. } vector.  Fend off
> are also reductions with non-internal operands (constants
> or register parameters for example).
>
> Costing is done by accounting the original scalar participating
> stmts for the scalar cost and log2 permutes and operations for
> the vectorized epilogue.

It would be good if we have had a standard way of asking for this
cost for both loops and SLP, perhaps based on the internal function.
E.g. for aarch64 we have a cost table that gives a more precise cost
(and log2 of the scalar op isn't always it :-)).

I don't have any specific suggestion how though.  And I guess it
can be a follow-on patch anyway.

> SPEC CPU 2017 FP with rate workload measurements show (picked
> fastest runs of three) regressions for 507.cactuBSSN_r (1.5%),
> 508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and
> 527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and
> 538.imagick_r (1.5%).  This is with -Ofast -march=znver2 on a Zen2.
>
> Statistics on CPU 2017 shows that the overwhelming number of seeds
> we find are reductions of two lanes (well - that's basically every
> associative operation).  That means we put a quite high pressure
> on the SLP discovery process this way.
>
> In total we find 583218 seeds we put to SLP discovery out of which
> 66205 pass that and only 6185 of those make it through
> code generation checks. 796 of those are discarded because the reduction
> is part of a larger SLP instance.  4195 of the remaining
> are deemed not profitable to vectorize and 1194 are finally
> vectorized.  That's a poor 0.2% rate.

Oof.

> Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%),
> 28181 four (5%), 4808 five, 909 six and there are instances up to 120
> lanes.
>
> There's a set of 54086 candidate seeds we reject because
> they contain a constant or invariant (not implemented yet) but still
> have two or more lanes that could be put to SLP discovery.

It looks like the patch doesn't explicitly forbid 2-element reductions
and instead relies on the cost model.  Is that right?

> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> built and tested SPEC CPU 2017 with -Ofast -march=znver2 successfully.
>
> I do think this is good enough(TM) for this point, please speak up
> if you disagree and/or like to see changes.

No objection from me FWIW.  Looks like a nice feature :-)

Thanks,
Richard

>
> Thanks,
> Richard.
>
> 2021-06-16  Richard Biener   <rguenther@suse.de>
>
> 	PR tree-optimization/54400
> 	* tree-vectorizer.h (enum slp_instance_kind): Add
> 	slp_inst_kind_bb_reduc.
> 	(reduction_fn_for_scalar_code): Declare.
> 	* tree-vect-data-refs.c (vect_slp_analyze_instance_dependence):
> 	Check SLP_INSTANCE_KIND instead of looking at the
> 	representative.
> 	(vect_slp_analyze_instance_alignment): Likewise.
> 	* tree-vect-loop.c (reduction_fn_for_scalar_code): Export.
> 	* tree-vect-slp.c (vect_slp_linearize_chain): Split out
> 	chain linearization from vect_build_slp_tree_2 and generalize
> 	for the use of BB reduction vectorization.
> 	(vect_build_slp_tree_2): Adjust accordingly.
> 	(vect_optimize_slp): Elide permutes at the root of BB reduction
> 	instances.
> 	(vectorizable_bb_reduc_epilogue): New function.
> 	(vect_slp_prune_covered_roots): Likewise.
> 	(vect_slp_analyze_operations): Use them.
> 	(vect_slp_check_for_constructors): Recognize associatable
> 	chains for BB reduction vectorization.
> 	(vectorize_slp_instance_root_stmt): Generate code for the
> 	BB reduction epilogue.
>
> 	* gcc.dg/vect/bb-slp-pr54400.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c |  43 +++
>  gcc/tree-vect-data-refs.c                  |   9 +-
>  gcc/tree-vect-loop.c                       |   2 +-
>  gcc/tree-vect-slp.c                        | 383 +++++++++++++++++----
>  gcc/tree-vectorizer.h                      |   2 +
>  5 files changed, 367 insertions(+), 72 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c


More information about the Gcc-patches mailing list