This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH GCC]Improve alias check code generation in vectorizer
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Bin Cheng <Bin dot Cheng at arm dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>
- Date: Tue, 14 Jun 2016 14:42:23 +0200
- Subject: Re: [PATCH GCC]Improve alias check code generation in vectorizer
- Authentication-results: sourceware.org; auth=none
- References: <DB5PR08MB1144E0756813A571266140FFE7530 at DB5PR08MB1144 dot eurprd08 dot prod dot outlook dot com>
On Mon, Jun 13, 2016 at 12:06 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> Take subroutine "DACOP" from spec2k/200.sixtrack as an example, the loop needs to be versioned for vectorization because of possibly alias. The alias check for data-references are like:
>
> //pair 1
> dr_a:
> (Data Ref:
> bb: 8
> stmt: _92 = da.cc[_27];
> ref: da.cc[_27];
> )
> dr_b:
> (Data Ref:
> bb: 8
> stmt: da.cc[_93] = _92;
> ref: da.cc[_93];
> )
> //pair 2
> dr_a:
> (Data Ref:
> bb: 8
> stmt: pretmp_29 = da.i2[_27];
> ref: da.i2[_27];
> )
> dr_b:
> (Data Ref:
> bb: 8
> stmt: da.i2[_93] = pretmp_29;
> ref: da.i2[_93];
> )
> //pair 3
> dr_a:
> (Data Ref:
> bb: 8
> stmt: pretmp_28 = da.i1[_27];
> ref: da.i1[_27];
> )
> dr_b:
> (Data Ref:
> bb: 8
> stmt: da.i1[_93] = pretmp_28;
> ref: da.i1[_93];
> )
>
> The code generated for alias checks are as below:
>
> <bb 23>:
> # iy_186 = PHI <_413(22), 2(2)>
> # ivtmp_1050 = PHI <ivtmp_1049(22), 512(2)>
> _155 = iy_186 + -2;
> _156 = _155 * 516;
> _241 = iy_186 + -1;
> _242 = _241 * 516;
> _328 = iy_186 * 516;
> _413 = iy_186 + 1;
> _414 = _413 * 516;
> _499 = iy_186 + 2;
> _500 = _499 * 516;
> _998 = iy_186 * 516;
> _997 = (sizetype) _998;
> _996 = _997 + 6;
> _995 = _996 * 4;
> _994 = global_Output.2_16 + _995;
> _993 = iy_186 * 516;
> _992 = (long unsigned int) _993;
> _991 = _992 * 4;
> _990 = _991 + 18446744073709547488;
> _989 = global_Input.0_153 + _990;
> _886 = _989 >= _994;
> _885 = iy_186 * 516;
> _884 = (sizetype) _885;
> _883 = _884 + 1040;
> _882 = _883 * 4;
> _881 = global_Input.0_153 + _882;
> _880 = (sizetype) _998;
> _879 = _880 + 2;
> _878 = _879 * 4;
> _877 = global_Output.2_16 + _878;
> _876 = _877 >= _881;
> _875 = _876 | _886;
> _874 = iy_186 * 516;
> _873 = (sizetype) _874;
> _872 = _873 + 514;
> _871 = _872 * 4;
> _870 = global_Output.2_16 + _871;
> _869 = local_Filter_33 >= _870;
> _868 = local_Filter_33 + 100;
> _867 = (sizetype) _874;
> _866 = _867 + 2;
> _865 = _866 * 4;
> _864 = global_Output.2_16 + _865;
> _863 = _864 >= _868;
> _862 = _863 | _869;
> _861 = _862 & _875;
> if (_861 != 0)
> goto <bb 7>;
> else
> goto <bb 4>;
>
> It contains quite a lot redundant computations. Root cause is vectorizer simply translates alias checks into full address expressions comparison, and CSE opportunities are covered by foler. This patch improves function vect_create_cond_for_alias_checks by simplifying the comparison by comparing DR_BASE_ADDRESS/DR_INIT of both data-reference at compilation time. It also simplifies conditions:
> (addr_a_min + addr_a_length) <= addr_b_min || (addr_b_min + addr_b_length) <= addr_a_min
> into below form:
> cond_expr = addr_b_min - addr_a_min
> cond_expr >= addr_a_length || cond_expr <= -addr_b_length
> if the comparison is done in signed type. And this can be further simplified by folder if addr_a_length and addr_b_lengnth are equal/const, which is quite common.
> I looked into generated assembly, this patch does introduces small regression in some cases, but overall I think it's good. Bootstrap and test on x86_64 and AArch64. Is it OK?
Hmm. I think it's obviously good to make the generated expressions
smaller. Ideally we'd know the dependence distance
here (but have added them to the runtime checks because the distance
was symbolic - dependence analysis doesn't give us this information).
This means we often can't optimize the check to i0 - i1 >= 4 but
retain useless widening/multiplications. Does your patch help with
those?
int cmp = tree_int_cst_compare (init_a, init_b);
+ /* No need to consider init parts if they are equal in both DRs. */
+ if (wi::to_widest (init_a) != wi::to_widest (init_b))
cmp != 0
+ {
+ if (wi::to_widest (init_a) < wi::to_widest (init_b))
cmp < 0
+ {
....
this code seems to mimic some comparison folding we have? Does that
not trigger for some reason?
That said, not adding the address-base will make that not trigger if
the remainder of ops are unsigned.
Doing intermittent force_gimple_operand will make fold not see the
whole trees and thus simplifications get lost.
In fact you're mixing gimple stmt building and generic expr building -
ideally you'd use gimple_build (...) for
all of it (but that doesn't work as all the DR_ components are GENERIC).
So I think we have to be careful with this kind of micro-optimization
patch. If you'd extracted the "indices" to
compare somehow that would be great.
Another option would be to use tree-affine to simplify the compares.
Richard.
> Thanks,
> bin
>
> 2016-06-08 Bin Cheng <bin.cheng@arm.com>
>
> * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): New
> Parameter. Simplify alias check conditions at compilation time.
> (vect_loop_versioning): Pass new argument to above function.