[gomp4 simd, RFC] Simple fix to override vectorization cost estimation.

Fri Nov 15 15:24:00 GMT 2013

On Fri, 15 Nov 2013, Sergey Ostanevich wrote:

> Richard,
> 
> here's an example that causes trigger for the cost model.

I hardly believe that (AVX2)

.L9:
        vmovups (%rsi), %xmm3
        addl    $1, %r8d
        addq    $256, %rsi
        vinsertf128     $0x1, -240(%rsi), %ymm3, %ymm1
        vmovups -224(%rsi), %xmm3
        vinsertf128     $0x1, -208(%rsi), %ymm3, %ymm3
        vshufps $136, %ymm3, %ymm1, %ymm3
        vperm2f128      $3, %ymm3, %ymm3, %ymm2
        vshufps $68, %ymm2, %ymm3, %ymm1
        vshufps $238, %ymm2, %ymm3, %ymm2
        vmovups -192(%rsi), %xmm3
        vinsertf128     $1, %xmm2, %ymm1, %ymm2
        vinsertf128     $0x1, -176(%rsi), %ymm3, %ymm1
        vmovups -160(%rsi), %xmm3
        vinsertf128     $0x1, -144(%rsi), %ymm3, %ymm3
        vshufps $136, %ymm3, %ymm1, %ymm3
        vperm2f128      $3, %ymm3, %ymm3, %ymm1
        vshufps $68, %ymm1, %ymm3, %ymm4
        vshufps $238, %ymm1, %ymm3, %ymm1
        vmovups -128(%rsi), %xmm3
        vinsertf128     $1, %xmm1, %ymm4, %ymm1
        vshufps $136, %ymm1, %ymm2, %ymm1
        vperm2f128      $3, %ymm1, %ymm1, %ymm2
        vshufps $68, %ymm2, %ymm1, %ymm4
        vshufps $238, %ymm2, %ymm1, %ymm2
        vinsertf128     $0x1, -112(%rsi), %ymm3, %ymm1
        vmovups -96(%rsi), %xmm3
        vinsertf128     $1, %xmm2, %ymm4, %ymm4
        vinsertf128     $0x1, -80(%rsi), %ymm3, %ymm3
        vshufps $136, %ymm3, %ymm1, %ymm3
        vperm2f128      $3, %ymm3, %ymm3, %ymm2
        vshufps $68, %ymm2, %ymm3, %ymm1
        vshufps $238, %ymm2, %ymm3, %ymm2
        vmovups -64(%rsi), %xmm3
        vinsertf128     $1, %xmm2, %ymm1, %ymm2
        vinsertf128     $0x1, -48(%rsi), %ymm3, %ymm1
        vmovups -32(%rsi), %xmm3
        vinsertf128     $0x1, -16(%rsi), %ymm3, %ymm3
        cmpl    %r8d, %edi
        vshufps $136, %ymm3, %ymm1, %ymm3
        vperm2f128      $3, %ymm3, %ymm3, %ymm1
        vshufps $68, %ymm1, %ymm3, %ymm5
        vshufps $238, %ymm1, %ymm3, %ymm1
        vinsertf128     $1, %xmm1, %ymm5, %ymm1
        vshufps $136, %ymm1, %ymm2, %ymm1
        vperm2f128      $3, %ymm1, %ymm1, %ymm2
        vshufps $68, %ymm2, %ymm1, %ymm3
        vshufps $238, %ymm2, %ymm1, %ymm2
        vinsertf128     $1, %xmm2, %ymm3, %ymm1
        vshufps $136, %ymm1, %ymm4, %ymm1
        vperm2f128      $3, %ymm1, %ymm1, %ymm2
        vshufps $68, %ymm2, %ymm1, %ymm3
        vshufps $238, %ymm2, %ymm1, %ymm2
        vinsertf128     $1, %xmm2, %ymm3, %ymm2
        vaddps  %ymm2, %ymm0, %ymm0
        ja      .L9

is more efficient than

.L3:
        vaddss  (%rcx,%rax), %xmm0, %xmm0
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L3

;)

> As soon as
> elemental functions will appear and we update the vectorizer so it can accept
> an elemental function inside the loop - we will have the same
> situation as we have
> it now: cost model will bail out with profitability estimation.

Yes.

> Still we have no chance to get info on how efficient the bar() function when it
> is in vector form.

Well I assume you mean that the speedup when vectorizing the elemental
will offset whatever wreckage we cause with vectorizing the rest of the
statements.  I'd say you can at least compare to unrolling by
the vectorization factor, building the vector inputs to the elemental
from scalars, distributing the vector result from the elemental to
scalars.

> I believe I should repeat: #pragma omp simd is intended for introduction of an
> instruction-level parallel region on developer's request, hence should
> be treated
> in same manner as #pragma omp parallel. Vectorizer cost model is an obstacle
> here, not a help.

Surely not if there isn't an elemental call in it.  With it the
cost model of course will have not enough information to decide.

But still, what's the difference to the case where we cannot vectorize
the function?  What happens if we cannot vectorize the elemental?
Do we have to build scalar versions for all possible vector sizes?

Richard.

> Regards,
> Sergos
> 
> 
> On Fri, Nov 15, 2013 at 1:08 AM, Richard Biener <rguenther@suse.de> wrote:
> > Sergey Ostanevich <sergos.gnu@gmail.com> wrote:
> >>this is only for the whole file? I mean to have a particular loop
> >>vectorized in a
> >>file while all others - up to compiler's cost model. is there such a
> >>machinery?
> >
> > No, there is not.
> >
> > Richard.
> >
> >>Sergos
> >>
> >>On Thu, Nov 14, 2013 at 12:39 PM, Richard Biener <rguenther@suse.de>
> >>wrote:
> >>> On Wed, 13 Nov 2013, Sergey Ostanevich wrote:
> >>>
> >>>> I will get some tests.
> >>>> As for cost analysis - simply consider the pragma as a request to
> >>>> vectorize. How can I - as a developer - enforce it beyond the
> >>pragma?
> >>>
> >>> You can disable the cost model via -fvect-cost-model=unlimited
> >>>
> >>> Richard.
> >>>
> >>>> On Wed, Nov 13, 2013 at 12:55 PM, Richard Biener <rguenther@suse.de>
> >>wrote:
> >>>> > On Tue, 12 Nov 2013, Sergey Ostanevich wrote:
> >>>> >
> >>>> >> The reason patch was in its original state is because we want
> >>>> >> to notify user that his assumption of profitability may be wrong.
> >>>> >> This is not a part of any spec and as far as I know ICC does not
> >>>> >> notify user about the case. Still it can be a good hint for those
> >>>> >> users who tries to get as much as possible performance.
> >>>> >>
> >>>> >> Richard's comment on the vectorization problems is about the same
> >>-
> >>>> >> to inform user that his attempt to force vectorization is failed.
> >>>> >>
> >>>> >> As for profitable or not - sometimes I believe it's impossible to
> >>be
> >>>> >> precise. For OMP we have case of a vector version of a function
> >>>> >> and we have no chance to figure out whether it is profitable to
> >>use
> >>>> >> it or to loose it. If we can't map the loop for any vector length
> >>>> >> other than 1 - I believe in this case we have to bail out and
> >>report.
> >>>> >> Is it about 'never profitable'?
> >>>> >
> >>>> > For example.  I think we should report non-vectorized loops
> >>>> > that are marked with force_vect anyway, with
> >>-Wdisabled-optimization.
> >>>> > Another case is that a loop may be profitable to vectorize if
> >>>> > the ISA supports a gather instruction but otherwise not.  Or if
> >>the
> >>>> > ISA supports efficient vector construction from N not loop
> >>>> > invariant scalars (for vectorization of strided loads).
> >>>> >
> >>>> > Simply disregarding all of the cost analysis sounds completely
> >>>> > bogus to me.
> >>>> >
> >>>> > I'd simply go for the diagnostic for now, not changing anything
> >>else.
> >>>> > We want to have a good understanding about why the cost model is
> >>>> > so bad that we have to force to ignore it for #pragma simd - thus
> >>we
> >>>> > want testcases.
> >>>> >
> >>>> > Richard.
> >>>> >
> >>>> >>
> >>>> >> On Tue, Nov 12, 2013 at 6:35 PM, Richard Biener
> >><rguenther@suse.de> wrote:
> >>>> >> > On 11/12/13 3:16 PM, Jakub Jelinek wrote:
> >>>> >> >> On Tue, Nov 12, 2013 at 05:46:14PM +0400, Sergey Ostanevich
> >>wrote:
> >>>> >> >>> ivdep just substitutes all cross-iteration data analysis,
> >>>> >> >>> nothing related to cost model. ICC does not cancel its
> >>>> >> >>> cost model in case of #pragma ivdep
> >>>> >> >>>
> >>>> >> >>> as for the safelen - OMP standart treats it as a limitation
> >>>> >> >>> for the vector length. this means if no safelen is present
> >>>> >> >>> an arbitrary vector length can be used.
> >>>> >> >>
> >>>> >> >> I was talking about GCC loop->safelen, which is INT_MAX for
> >>#pragma omp simd
> >>>> >> >> without safelen clause or #pragma simd without vectorlength
> >>clause.
> >>>> >> >>
> >>>> >> >>> so I believe loop->force_vect is the only trigger to
> >>disregard
> >>>> >> >>> the cost model
> >>>> >> >>
> >>>> >> >> Anyway, in that case I think the originally posted patch is
> >>wrong,
> >>>> >> >> if we want to treat force_vect as disregard all the cost model
> >>and
> >>>> >> >> force vectorization (well, the name of the field already kind
> >>of suggest
> >>>> >> >> that), then IMHO we should treat it the same as
> >>-fvect-cost-model=unlimited
> >>>> >> >> for those loops.
> >>>> >> >
> >>>> >> > Err - the user may have a specific sub-architecture in mind
> >>when using
> >>>> >> > #pragma simd, if you say we should completely ignore the cost
> >>model
> >>>> >> > then should we also sorry () if we cannot vectorize the loop
> >>(either
> >>>> >> > because of GCC deficiencies or lack of sub-target support)?
> >>>> >> >
> >>>> >> > That said, at least in the cases that the cost model says the
> >>loop
> >>>> >> > is never profitable to vectorize we should follow its advice.
> >>>> >> >
> >>>> >> > Richard.
> >>>> >> >
> >>>> >> >> Thus (untested):
> >>>> >> >>
> >>>> >> >> 2013-11-12  Jakub Jelinek  <jakub@redhat.com>
> >>>> >> >>
> >>>> >> >>       * tree-vect-loop.c (vect_estimate_min_profitable_iters):
> >>Use
> >>>> >> >>       unlimited cost model also for force_vect loops.
> >>>> >> >>
> >>>> >> >> --- gcc/tree-vect-loop.c.jj   2013-11-12 12:09:40.000000000
> >>+0100
> >>>> >> >> +++ gcc/tree-vect-loop.c      2013-11-12 15:11:43.821404330
> >>+0100
> >>>> >> >> @@ -2702,7 +2702,7 @@ vect_estimate_min_profitable_iters (loop
> >>>> >> >>    void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA
> >>(loop_vinfo);
> >>>> >> >>
> >>>> >> >>    /* Cost model disabled.  */
> >>>> >> >> -  if (unlimited_cost_model ())
> >>>> >> >> +  if (unlimited_cost_model () || LOOP_VINFO_LOOP
> >>(loop_vinfo)->force_vect)
> >>>> >> >>      {
> >>>> >> >>        dump_printf_loc (MSG_NOTE, vect_location, "cost model
> >>disabled.\n");
> >>>> >> >>        *ret_min_profitable_niters = 0;
> >>>> >> >>
> >>>> >> >>       Jakub
> >>>> >> >>
> >>>> >> >
> >>>> >>
> >>>> >>
> >>>> >
> >>>> > --
> >>>> > Richard Biener <rguenther@suse.de>
> >>>> > SUSE / SUSE Labs
> >>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> >>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend
> >>>>
> >>>>
> >>>
> >>> --
> >>> Richard Biener <rguenther@suse.de>
> >>> SUSE / SUSE Labs
> >>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> >>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> >
> >
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer