This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH GCC]Improve alias check code generation in vectorizer


On Mon, Jun 13, 2016 at 12:06 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> Take subroutine "DACOP" from spec2k/200.sixtrack as an example, the loop needs to be versioned for vectorization because of possibly alias.  The alias check for data-references are like:
>
> //pair 1
> dr_a:
> (Data Ref:
>   bb: 8
>   stmt: _92 = da.cc[_27];
>   ref: da.cc[_27];
> )
> dr_b:
> (Data Ref:
>   bb: 8
>   stmt: da.cc[_93] = _92;
>   ref: da.cc[_93];
> )
> //pair 2
> dr_a:
> (Data Ref:
>   bb: 8
>   stmt: pretmp_29 = da.i2[_27];
>   ref: da.i2[_27];
> )
> dr_b:
> (Data Ref:
>   bb: 8
>   stmt: da.i2[_93] = pretmp_29;
>   ref: da.i2[_93];
> )
> //pair 3
> dr_a:
> (Data Ref:
>   bb: 8
>   stmt: pretmp_28 = da.i1[_27];
>   ref: da.i1[_27];
> )
> dr_b:
> (Data Ref:
>   bb: 8
>   stmt: da.i1[_93] = pretmp_28;
>   ref: da.i1[_93];
> )
>
> The code generated for alias checks are as below:
>
>   <bb 23>:
>   # iy_186 = PHI <_413(22), 2(2)>
>   # ivtmp_1050 = PHI <ivtmp_1049(22), 512(2)>
>   _155 = iy_186 + -2;
>   _156 = _155 * 516;
>   _241 = iy_186 + -1;
>   _242 = _241 * 516;
>   _328 = iy_186 * 516;
>   _413 = iy_186 + 1;
>   _414 = _413 * 516;
>   _499 = iy_186 + 2;
>   _500 = _499 * 516;
>   _998 = iy_186 * 516;
>   _997 = (sizetype) _998;
>   _996 = _997 + 6;
>   _995 = _996 * 4;
>   _994 = global_Output.2_16 + _995;
>   _993 = iy_186 * 516;
>   _992 = (long unsigned int) _993;
>   _991 = _992 * 4;
>   _990 = _991 + 18446744073709547488;
>   _989 = global_Input.0_153 + _990;
>   _886 = _989 >= _994;
>   _885 = iy_186 * 516;
>   _884 = (sizetype) _885;
>   _883 = _884 + 1040;
>   _882 = _883 * 4;
>   _881 = global_Input.0_153 + _882;
>   _880 = (sizetype) _998;
>   _879 = _880 + 2;
>   _878 = _879 * 4;
>   _877 = global_Output.2_16 + _878;
>   _876 = _877 >= _881;
>   _875 = _876 | _886;
>   _874 = iy_186 * 516;
>   _873 = (sizetype) _874;
>   _872 = _873 + 514;
>   _871 = _872 * 4;
>   _870 = global_Output.2_16 + _871;
>   _869 = local_Filter_33 >= _870;
>   _868 = local_Filter_33 + 100;
>   _867 = (sizetype) _874;
>   _866 = _867 + 2;
>   _865 = _866 * 4;
>   _864 = global_Output.2_16 + _865;
>   _863 = _864 >= _868;
>   _862 = _863 | _869;
>   _861 = _862 & _875;
>   if (_861 != 0)
>     goto <bb 7>;
>   else
>     goto <bb 4>;
>
> It contains quite a lot redundant computations.  Root cause is vectorizer simply translates alias checks into full address expressions comparison, and CSE opportunities are covered by foler.  This patch improves function vect_create_cond_for_alias_checks by simplifying the comparison by comparing DR_BASE_ADDRESS/DR_INIT of both data-reference at compilation time.  It also simplifies conditions:
>   (addr_a_min + addr_a_length) <= addr_b_min || (addr_b_min + addr_b_length) <= addr_a_min
> into below form:
>   cond_expr = addr_b_min - addr_a_min
>   cond_expr >= addr_a_length || cond_expr <= -addr_b_length
> if the comparison is done in signed type.  And this can be further simplified by folder if addr_a_length and addr_b_lengnth are equal/const, which is quite common.
> I looked into generated assembly, this patch does introduces small regression in some cases, but overall I think it's good.  Bootstrap and test on x86_64 and AArch64.  Is it OK?

Hmm.  I think it's obviously good to make the generated expressions
smaller.  Ideally we'd know the dependence distance
here (but have added them to the runtime checks because the distance
was symbolic - dependence analysis doesn't give us this information).
This means we often can't optimize the check to i0 - i1 >= 4 but
retain useless widening/multiplications.  Does your patch help with
those?

int cmp = tree_int_cst_compare (init_a, init_b);

+      /* No need to consider init parts if they are equal in both DRs.  */
+      if (wi::to_widest (init_a) != wi::to_widest (init_b))

cmp != 0

+       {
+         if (wi::to_widest (init_a) < wi::to_widest (init_b))

cmp < 0

+           {
....

this code seems to mimic some comparison folding we have?  Does that
not trigger for some reason?

That said, not adding the address-base will make that not trigger if
the remainder of ops are unsigned.

Doing intermittent force_gimple_operand will make fold not see the
whole trees and thus simplifications get lost.
In fact you're mixing gimple stmt building and generic expr building -
ideally you'd use gimple_build (...) for
all of it (but that doesn't work as all the DR_ components are GENERIC).

So I think we have to be careful with this kind of micro-optimization
patch.  If you'd extracted the "indices" to
compare somehow that would be great.

Another option would be to use tree-affine to simplify the compares.

Richard.



> Thanks,
> bin
>
> 2016-06-08  Bin Cheng  <bin.cheng@arm.com>
>
>         * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): New
>         Parameter.  Simplify alias check conditions at compilation time.
>         (vect_loop_versioning): Pass new argument to above function.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]