[PATCH PR96053] Add "#pragma GCC no_reduc_chain"

Mon Jul 27 13:06:33 GMT 2020

Sorry for the late reply!

-----Original Message-----
From: Richard Biener [mailto:rguenther@suse.de]
Sent: Wednesday, July 22, 2020 3:02 PM

> First of all I think giving users more control over vectorization is 
> good.  Now as for "#pragma GCC no_reduc_chain" I'd like to avoid 
> negatives and terms internal to GCC.  I also would like to see 
> vectorization pragmas to be grouped somehow, also to avoid bit 
> explosion in struct loop.  There's already annot_expr_no_vector_kind 
> and annot_expr_vector_kind both only used by the fortran FE at the 
> moment.  Note ANNOATE_EXPR already allows an extra argument thus only 
> annot_expr_vector_kind should prevail with its argument specifying a 
> bitmask of vectorizer hints.  We'd have an extra enum for those like
> 
> enum annot_vector_subkind {
>   annot_vector_never = 0,
>   annot_vector_auto = 1, // this is the default
>   annot_vector_always = 3,
>   your new flag
> };
> 
> and the user would specify it via
> 
> #pragma GCC vect [(never|always|auto)] [your new flag]

I'll add the code to complete the above modifications.

> now, I honestly have a difficulty in suggesting a better name than 
> no_reduc_chain.  Quoting the testcase:
> 
> +double f(double *a, double *b)
> +{
> +  double res1 = 0;
> +  double res0 = 0;
> +#pragma GCC no_reduc_chain
> +  for (int i = 0 ; i < 1000; i+=4) {
> +    res0 += a[i] * b[i];
> +    res1 += a[i+1] * b[i*1];
> +    res0 += a[i+2] * b[i+2];
> +    res1 += a[i+3] * b[i+3];
> +  }
> +  return res0 + res1;
> +}
> 
> for your case with IIRC V2DF vectors using reduction chains will 
> result in a vectorization factor of two while with a SLP reduction the 
> vectorization factor is one.

I reconfirmed the situation.  Using reduction chains also result in a vectorization factor of one as the same as using reductions.
Related logs are followed:

1.using reduction chains
pr96053.c:6:3: note:   Final SLP tree for instance:
pr96053.c:6:3: note:   node 0x1a4f4ba0 (max_nunits=2, refcnt=2)
pr96053.c:6:3: note:    stmt 0 res1_37 = _14 + res1_48;
pr96053.c:6:3: note:    stmt 1 res1_39 = _28 + res1_37;
...
pr96053.c:6:3: note:   === vect_make_slp_decision ===
pr96053.c:6:3: note:   Decided to SLP 2 instances. Unrolling factor 1
pr96053.c:6:3: note:   === vect_detect_hybrid_slp ===
pr96053.c:6:3: note:   === vect_update_vf_for_slp ===
pr96053.c:6:3: note:   Loop contains only SLP stmts
pr96053.c:6:3: note:   Updating vectorization factor to 1.
pr96053.c:6:3: note:  vectorization_factor = 1, niters = 250

2.using reductions
pr96053.c:6:3: note:   Final SLP tree for instance:
pr96053.c:6:3: note:   node 0x3a7f2cb0 (max_nunits=2, refcnt=2)
pr96053.c:6:3: note:    stmt 0 res0_38 = _21 + res0_36;
pr96053.c:6:3: note:    stmt 1 res1_39 = _28 + res1_37;
...
pr96053.c:6:3: note:   === vect_make_slp_decision ===
pr96053.c:6:3: note:   Decided to SLP 1 instances. Unrolling factor 1
pr96053.c:6:3: note:   === vect_detect_hybrid_slp ===
pr96053.c:6:3: note:   === vect_update_vf_for_slp ===
pr96053.c:6:3: note:   Loop contains only SLP stmts
pr96053.c:6:3: note:   Updating vectorization factor to 1.
pr96053.c:6:3: note:  vectorization_factor = 1, niters = 250

> So maybe it is better to give the user control over the vectorization factor?
> That's desirable in other cases where the user wants to force a larger 
> VF to get extra unrolling for example.  For the testcase above you'd 
> use
> 
> #pragma GCC vect vf(1)
> 
> or so (syntax to be discussed).  The side-effect would be that with a 
> reduction chain the VF request cannot be fulfilled but with a SLP 
> reduction it can.  Of course no_reduc_chain is much easier to actually 
> implement in a strict way while specifying VF will likely need to be 
> documented as a hint (with an eventual diagnostic if it wasn't 
> fulfilled)
> 
> Richard/Jakub, any thoughts?
> 
> Thanks,
> Richard.

Thanks,
Kaipeng Zhou