Bug 96789 - x264: sub4x4_dct() improves when vectorization is disabled
Summary: x264: sub4x4_dct() improves when vectorization is disabled
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Kewen Lin
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2020-08-26 03:19 UTC by Kewen Lin
Modified: 2020-11-05 02:25 UTC (History)
10 users (show)

See Also:
Host:
Target: powerpc
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-09-16 00:00:00


Attachments
sub4x4_dct SLP dumping (26.35 KB, text/plain)
2020-08-26 07:13 UTC, Kewen Lin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kewen Lin 2020-08-26 03:19:26 UTC
One of my workmates found that if we disable vectorization for SPEC2017 525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it can speed up by 4%.

The option used is: -O3 -mcpu=power9 -fcommon -fno-strict-aliasing -fgnu89-inline

I confirmed this finding and it can further narrow down to SLP vectorization with attribute __attribute__((optimize("no-tree-slp-vectorize"))).

I also checked with r11-0 commit for this particular file, the performance keep unchanged, with/without vectorization attribute. So I think it's a trunk regression, probably exposes one SLP flaw or one cost modeling issue.
Comment 1 Richard Biener 2020-08-26 06:53:05 UTC
Can you attach the relevant SLP vectorizer dump part (just for the function in question)?
Comment 2 Kewen Lin 2020-08-26 07:13:37 UTC
Created attachment 49124 [details]
sub4x4_dct SLP dumping
Comment 3 Kewen Lin 2020-08-27 03:28:17 UTC
Bisection shows it started to fail from r11-205.
Comment 4 Richard Biener 2020-08-27 06:40:57 UTC
This delays some checks to eventually support part of the BB vectorization
which is what succeeds here.  I suspect that w/o vectorization we manage
to elide the tmp[] array but with the part vectorization that occurs we
fail to do that.

On the cost side there would be a lot needed to make the vectorization
not profitable:

  Vector inside of basic block cost: 8
  Vector prologue cost: 36
  Vector epilogue cost: 0
  Scalar cost of basic block: 64

the thing to double-check is

0x123b1ff0 <unknown> 1 times vec_construct costs 17 in prologue

that is the cost of the V16QI construct

 _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565, _576, _125, _143, _161, _179}; 

maybe you can extract a testcase for the function as well?
Comment 5 Richard Biener 2020-08-27 11:16:24 UTC
testcase from https://github.com/mirror/x264/blob/master/common/dct.c

where FENC_STRIDE is 16 and FDEC_STRIDE 32

pixel is unsigned char, dctcoef is unsigned short

static inline void pixel_sub_wxh( dctcoef *diff, int i_size,
                                  pixel *pix1, int i_pix1, pixel *pix2, int i_pix2 )
{
    for( int y = 0; y < i_size; y++ )
    {
        for( int x = 0; x < i_size; x++ )
            diff[x + y*i_size] = pix1[x] - pix2[x];
        pix1 += i_pix1;
        pix2 += i_pix2;
    }
}

static void sub4x4_dct( dctcoef dct[16], pixel *pix1, pixel *pix2 )
{
    dctcoef d[16];
    dctcoef tmp[16];

    pixel_sub_wxh( d, 4, pix1, FENC_STRIDE, pix2, FDEC_STRIDE );

    for( int i = 0; i < 4; i++ )
    {
        int s03 = d[i*4+0] + d[i*4+3];
        int s12 = d[i*4+1] + d[i*4+2];
        int d03 = d[i*4+0] - d[i*4+3];
        int d12 = d[i*4+1] - d[i*4+2];

        tmp[0*4+i] =   s03 +   s12;
        tmp[1*4+i] = 2*d03 +   d12;
        tmp[2*4+i] =   s03 -   s12;
        tmp[3*4+i] =   d03 - 2*d12;
    }

    for( int i = 0; i < 4; i++ )
    {
        int s03 = tmp[i*4+0] + tmp[i*4+3];
        int s12 = tmp[i*4+1] + tmp[i*4+2];
        int d03 = tmp[i*4+0] - tmp[i*4+3];
        int d12 = tmp[i*4+1] - tmp[i*4+2];

        dct[i*4+0] =   s03 +   s12;
        dct[i*4+1] = 2*d03 +   d12;
        dct[i*4+2] =   s03 -   s12;
        dct[i*4+3] =   d03 - 2*d12;
    }
}

one can see vectorization is complicated by the non-homogenous computes.
Comment 6 Kewen Lin 2020-08-31 04:05:29 UTC
(In reply to Richard Biener from comment #4)
> This delays some checks to eventually support part of the BB vectorization
> which is what succeeds here.  I suspect that w/o vectorization we manage
> to elide the tmp[] array but with the part vectorization that occurs we
> fail to do that.
> 
> On the cost side there would be a lot needed to make the vectorization
> not profitable:
> 
>   Vector inside of basic block cost: 8
>   Vector prologue cost: 36
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 64
> 
> the thing to double-check is
> 
> 0x123b1ff0 <unknown> 1 times vec_construct costs 17 in prologue
> 
> that is the cost of the V16QI construct
> 
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 
> 

Thanks Richard!  I did some cost adjustment experiment last year and the cost for v16qi looks off indeed, but at that time with the cost tweaking for this the SPEC performance doesn't change, I guessed it's just we happened not have this kind of case to trap into. I'll have a look and re-evaluate it for this.
Comment 7 Kewen Lin 2020-09-16 10:03:51 UTC
Two questions in mind, need to dig into it further:
  1) from the assembly of scalar/vector code, I don't see any stores needed into temp array d (array diff in pixel_sub_wxh), but when modeling we consider the stores. On Power two vector stores take cost 2 while 16 scalar stores takes cost 16, it seems wrong to cost model something useless. Later, for the vector version we need 16 vector halfword extractions from these two halfword vectors, while scalar version the values are just in GPR register, vector version looks inefficient.
  2) on Power, the conversion from unsigned char to unsigned short is nop conversion, when we counting scalar cost, it's counted, then add costs 32 totally onto scalar cost. Meanwhile, the conversion from unsigned short to signed short should be counted but it's not (need to check why further).  The nop conversion costing looks something we can handle in function rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function tree_nop_conversion_p, but it's only for same mode/precision conversion. Will find/check something else.
Comment 8 Richard Biener 2020-09-16 11:17:55 UTC
(In reply to Kewen Lin from comment #7)
> Two questions in mind, need to dig into it further:
>   1) from the assembly of scalar/vector code, I don't see any stores needed
> into temp array d (array diff in pixel_sub_wxh), but when modeling we
> consider the stores.

Because when modeling they are still there.  There's no good way around this.

> On Power two vector stores take cost 2 while 16 scalar
> stores takes cost 16, it seems wrong to cost model something useless. Later,
> for the vector version we need 16 vector halfword extractions from these two
> halfword vectors, while scalar version the values are just in GPR register,
> vector version looks inefficient.
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 
> The nop conversion costing looks something we can handle in function
> rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
> tree_nop_conversion_p, but it's only for same mode/precision conversion.
> Will find/check something else.
Comment 9 Kewen Lin 2020-09-16 12:25:02 UTC
(In reply to Richard Biener from comment #8)
> (In reply to Kewen Lin from comment #7)
> > Two questions in mind, need to dig into it further:
> >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > consider the stores.
> 
> Because when modeling they are still there.  There's no good way around this.
> 

I noticed the stores get eliminated during FRE.  Can we consider running FRE once just before SLP? a bad idea due to compilation time?
Comment 10 Richard Biener 2020-09-16 13:04:49 UTC
(In reply to Kewen Lin from comment #9)
> (In reply to Richard Biener from comment #8)
> > (In reply to Kewen Lin from comment #7)
> > > Two questions in mind, need to dig into it further:
> > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > consider the stores.
> > 
> > Because when modeling they are still there.  There's no good way around this.
> > 
> 
> I noticed the stores get eliminated during FRE.  Can we consider running FRE
> once just before SLP? a bad idea due to compilation time?

Yeah, we already run FRE a lot and it is one of the more expensive passes.

Note there's one point we could do better which is the embedded SESE FRE
run from cunroll which is only run before we consider peeling an outer loop
and thus not for the outermost unrolled/peeled code (but the question would
be from where / up to what to apply FRE to).  On x86_64 this would apply to
the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
matters for costing though).

I think I looked at this or a related testcase some time ago and split out
some PRs (can't find those right now).  For example we are not considering
to simplify

  _318 = {_4, _14, _293, _30, _49, _251, _225, _248, _52, _70, _260, _284, _100, _117, _134, _151};
  vect__5.47_319 = (vector(16) short unsigned int) _318;
  _154 = MEM[(pixel *)pix2_58(D) + 99B];
  _320 = {_6, _16, _22, _32, _51, _255, _231, _243, _54, _68, _276, _286, _103, _120, _137, _154};
  vect__7.48_321 = (vector(16) short unsigned int) _320;
  vect__12.49_322 = vect__5.47_319 - vect__7.48_321;
  _317 = BIT_FIELD_REF <vect__12.49_322, 64, 0>;
  _315 = BIT_FIELD_REF <vect__12.49_322, 64, 64>;
  _313 = BIT_FIELD_REF <vect__12.49_322, 64, 128>;
  _311 = BIT_FIELD_REF <vect__12.49_322, 64, 192>;
  vect_perm_even_165 = VEC_PERM_EXPR <_317, _315, { 0, 2, 4, 6 }>;
  vect_perm_odd_164 = VEC_PERM_EXPR <_317, _315, { 1, 3, 5, 7 }>;
  vect_perm_even_163 = VEC_PERM_EXPR <_313, _311, { 0, 2, 4, 6 }>;
  vect_perm_odd_156 = VEC_PERM_EXPR <_313, _311, { 1, 3, 5, 7 }>;

down to smaller vectors.  Also appearantly the two vector CTORs are not
re-shuffled to vector load + shuffle.  In the SLP analysis we end up with

t2.c:12:32: note:   Final SLP tree for instance:
t2.c:12:32: note:   node 0x436e3c0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 *_11 = _12;
t2.c:12:32: note:       stmt 1 *_21 = _71;
...
t2.c:12:32: note:       stmt 15 *_160 = _161;
t2.c:12:32: note:       children 0x436de70
t2.c:12:32: note:   node 0x436de70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _12 = _5 - _7;
t2.c:12:32: note:       stmt 1 _71 = _15 - _17;
...
.c:12:32: note:       stmt 15 _161 = _152 - _155;
t2.c:12:32: note:       children 0x436ebb0 0x4360b70
t2.c:12:32: note:   node 0x436ebb0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _5 = (short unsigned int) _4;
...
t2.c:12:32: note:       stmt 15 _152 = (short unsigned int) _151;
t2.c:12:32: note:       children 0x42f1740
t2.c:12:32: note:   node 0x42f1740 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _4 = *pix1_57(D);
...
t2.c:12:32: note:       stmt 15 _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:12:32: note:       load permutation { 0 1 2 3 16 17 18 19 32 33 34 35 48 49 50 51 }
t2.c:12:32: note:   node 0x4360b70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _7 = (short unsigned int) _6;
...
t2.c:12:32: note:       stmt 15 _155 = (short unsigned int) _154;
t2.c:12:32: note:       children 0x4360be0
t2.c:12:32: note:   node 0x4360be0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _6 = *pix2_58(D);
...
t2.c:12:32: note:       stmt 15 _154 = MEM[(pixel *)pix2_296 + 3B];
t2.c:12:32: note:       load permutation { 0 1 2 3 32 33 34 35 64 65 66 67 96 97 98 99 }

the load permutations suggest that splitting the group into 4-lane pieces
would avoid doing permutes but then that would require target support
for V4QI and V4HI vectors.  At least the loads could be considered
to be vectorized with strided-SLP, yielding 'int' loads and a vector
build from 4 ints.  I'd need to analyze why we do not consider this.

t2.c:50:1: note:   Detected interleaving load of size 52
t2.c:50:1: note:        _4 = *pix1_57(D);
t2.c:50:1: note:        _14 = MEM[(pixel *)pix1_57(D) + 1B];
t2.c:50:1: note:        _293 = MEM[(pixel *)pix1_57(D) + 2B];
t2.c:50:1: note:        _30 = MEM[(pixel *)pix1_57(D) + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _49 = *pix1_40;
t2.c:50:1: note:        _251 = MEM[(pixel *)pix1_40 + 1B];
t2.c:50:1: note:        _225 = MEM[(pixel *)pix1_40 + 2B];
t2.c:50:1: note:        _248 = MEM[(pixel *)pix1_40 + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _52 = *pix1_264;
t2.c:50:1: note:        _70 = MEM[(pixel *)pix1_264 + 1B];
t2.c:50:1: note:        _260 = MEM[(pixel *)pix1_264 + 2B];
t2.c:50:1: note:        _284 = MEM[(pixel *)pix1_264 + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _100 = *pix1_295;
t2.c:50:1: note:        _117 = MEM[(pixel *)pix1_295 + 1B];
t2.c:50:1: note:        _134 = MEM[(pixel *)pix1_295 + 2B];
t2.c:50:1: note:        _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:50:1: note:   Detected interleaving load of size 100
t2.c:50:1: note:        _6 = *pix2_58(D);
t2.c:50:1: note:        _16 = MEM[(pixel *)pix2_58(D) + 1B];
t2.c:50:1: note:        _22 = MEM[(pixel *)pix2_58(D) + 2B];
t2.c:50:1: note:        _32 = MEM[(pixel *)pix2_58(D) + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _51 = *pix2_41;
t2.c:50:1: note:        _255 = MEM[(pixel *)pix2_41 + 1B];
t2.c:50:1: note:        _231 = MEM[(pixel *)pix2_41 + 2B];
t2.c:50:1: note:        _243 = MEM[(pixel *)pix2_41 + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _54 = *pix2_272;
t2.c:50:1: note:        _68 = MEM[(pixel *)pix2_272 + 1B];
t2.c:50:1: note:        _276 = MEM[(pixel *)pix2_272 + 2B];
t2.c:50:1: note:        _286 = MEM[(pixel *)pix2_272 + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _103 = *pix2_296;
t2.c:50:1: note:        _120 = MEM[(pixel *)pix2_296 + 1B];
t2.c:50:1: note:        _137 = MEM[(pixel *)pix2_296 + 2B];
t2.c:50:1: note:        _154 = MEM[(pixel *)pix2_296 + 3B];
Comment 11 Kewen Lin 2020-09-17 02:50:07 UTC
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

Thanks for the explanation! I'll look at it after checking 2). IIUC, the advantage to eliminate stores here looks able to get those things which is fed to stores and stores' consumers bundled, then get more things SLP-ed if available?

> 
> I think I looked at this or a related testcase some time ago and split out
> some PRs (can't find those right now).  For example we are not considering
> to simplify
> 
....
> 
> the load permutations suggest that splitting the group into 4-lane pieces
> would avoid doing permutes but then that would require target support
> for V4QI and V4HI vectors.  At least the loads could be considered
> to be vectorized with strided-SLP, yielding 'int' loads and a vector
> build from 4 ints.  I'd need to analyze why we do not consider this.

Good idea! Curious that is there some port where int load can not work well on 1-byte aligned address like trap?
Comment 12 Kewen Lin 2020-09-17 05:06:03 UTC
> Thanks for the explanation! I'll look at it after checking 2). IIUC, the
> advantage to eliminate stores here looks able to get those things which is
> fed to stores and stores' consumers bundled, then get more things SLP-ed if
> available?

Hmm, I think I was wrong, if both the feeding chain and consuming chain of the stores are SLP-ed, later FRE would be able to fuse them.
Comment 13 Kewen Lin 2020-09-18 09:11:22 UTC
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 

UH to SH conversion is true when calling vect_nop_conversion_p, so it's not even put into the cost vector. 

tree_nop_conversion_p's comments saying:

/* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
   no instruction.  */

I may miss something here, but UH to SH conversion does need one explicit extend instruction *extsh*, the precision/mode equality check looks wrong for this conversion.
Comment 14 rguenther@suse.de 2020-09-18 10:30:42 UTC
On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> 
> --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > conversion, when we counting scalar cost, it's counted, then add costs 32
> > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > signed short should be counted but it's not (need to check why further). 
> 
> UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> even put into the cost vector. 
> 
> tree_nop_conversion_p's comments saying:
> 
> /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
>    no instruction.  */
> 
> I may miss something here, but UH to SH conversion does need one explicit
> extend instruction *extsh*, the precision/mode equality check looks wrong for
> this conversion.

Well, it isn't a RTL predicate and it only needs extension because
there's never a HImode pseudo but always SImode subregs.
Comment 15 Kewen Lin 2020-09-18 13:40:52 UTC
(In reply to rguenther@suse.de from comment #14)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > signed short should be counted but it's not (need to check why further). 
> > 
> > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > even put into the cost vector. 
> > 
> > tree_nop_conversion_p's comments saying:
> > 
> > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> >    no instruction.  */
> > 
> > I may miss something here, but UH to SH conversion does need one explicit
> > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > this conversion.
> 
> Well, it isn't a RTL predicate and it only needs extension because
> there's never a HImode pseudo but always SImode subregs.

Thanks Richi! Should we take care of this case? or neglect this kind of extension as "no instruction"? I was intent to handle it in target specific code, but it isn't recorded into cost vector while it seems too heavy to do the bb_info slp_instances revisits in finish_cost.
Comment 16 Hongtao.liu 2020-09-19 15:45:26 UTC
I notice

0x5561dc0 _36 * 2 1 times scalar_stmt costs 16 in body
0x5561dc0 _38 * 2 1 times scalar_stmt costs 16 in body
0x5562df0 _36 * 2 1 times vector_stmt costs 16 in body
0x5562df0 _38 * 2 1 times vector_stmt costs 16 in body

ix86_multiplication_cost would be called for cost estimation, but in pass_expand, synth_mult will tranform the multiplization to shift. So shift cost should be used in this case, not mult.
Comment 17 rguenther@suse.de 2020-09-21 07:14:22 UTC
On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> 
> --- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #14)
> > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > 
> > > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > > signed short should be counted but it's not (need to check why further). 
> > > 
> > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > > even put into the cost vector. 
> > > 
> > > tree_nop_conversion_p's comments saying:
> > > 
> > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > >    no instruction.  */
> > > 
> > > I may miss something here, but UH to SH conversion does need one explicit
> > > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > > this conversion.
> > 
> > Well, it isn't a RTL predicate and it only needs extension because
> > there's never a HImode pseudo but always SImode subregs.
> 
> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do the
> bb_info slp_instances revisits in finish_cost.

I think it's not something we should handle on GIMPLE.
Comment 18 Kewen Lin 2020-09-25 12:46:54 UTC
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

By following this idea, to release the restriction on loop_outer (loop_father) when setting the father_bbs, I can see FRE works as expectedly.  But it actually does the rpo_vn from cfun's entry to its exit. If it's taken as costly, we probably can guard it to take effects only when all its inner loops are unrolled, for this case, all of its three loops are unrolled.
Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but for scalar codes it needs one more time dse run after cunroll to get array d eliminated. I guess it's not costly? Can one pass be run or not controlled by something in another pass? via global variable and add one parameter in passes.def seems weird. If it's costly, probably we can go by factoring out one routine to be called instead of running a pass, like do_rpo_vn?
Comment 19 Kewen Lin 2020-09-25 12:52:14 UTC
(In reply to rguenther@suse.de from comment #17)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #14)
> > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > 
> > > > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > > > signed short should be counted but it's not (need to check why further). 
> > > > 
> > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > > > even put into the cost vector. 
> > > > 
> > > > tree_nop_conversion_p's comments saying:
> > > > 
> > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > >    no instruction.  */
> > > > 
> > > > I may miss something here, but UH to SH conversion does need one explicit
> > > > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > > > this conversion.
> > > 
> > > Well, it isn't a RTL predicate and it only needs extension because
> > > there's never a HImode pseudo but always SImode subregs.
> > 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do the
> > bb_info slp_instances revisits in finish_cost.
> 
> I think it's not something we should handle on GIMPLE.

Got it! For 

	  else if (vect_nop_conversion_p (stmt_info))
	    continue;

Is it a good idea to change it to call record_stmt_cost like the others? 
  1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
  2) builtin_vectorization_cost return zero for it by default as before.
  3) targets can adjust the cost according to its need
Comment 20 Richard Biener 2020-09-25 12:57:26 UTC
(In reply to Kewen Lin from comment #19)
> (In reply to rguenther@suse.de from comment #17)
> > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > 
> > > --- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > (In reply to rguenther@suse.de from comment #14)
> > > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > > 
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > > 
> > > > > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > > > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > > > > signed short should be counted but it's not (need to check why further). 
> > > > > 
> > > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > > > > even put into the cost vector. 
> > > > > 
> > > > > tree_nop_conversion_p's comments saying:
> > > > > 
> > > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > > >    no instruction.  */
> > > > > 
> > > > > I may miss something here, but UH to SH conversion does need one explicit
> > > > > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > > > > this conversion.
> > > > 
> > > > Well, it isn't a RTL predicate and it only needs extension because
> > > > there's never a HImode pseudo but always SImode subregs.
> > > 
> > > Thanks Richi! Should we take care of this case? or neglect this kind of
> > > extension as "no instruction"? I was intent to handle it in target specific
> > > code, but it isn't recorded into cost vector while it seems too heavy to do the
> > > bb_info slp_instances revisits in finish_cost.
> > 
> > I think it's not something we should handle on GIMPLE.
> 
> Got it! For 
> 
> 	  else if (vect_nop_conversion_p (stmt_info))
> 	    continue;
> 
> Is it a good idea to change it to call record_stmt_cost like the others? 
>   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
>   2) builtin_vectorization_cost return zero for it by default as before.
>   3) targets can adjust the cost according to its need

I think this early-out was added for the case where there was no cost but
the target costed it.  So at least go back and look what target that was
and see if it can be adjusted.
Comment 21 Richard Biener 2020-09-25 13:05:28 UTC
(In reply to Kewen Lin from comment #18)
> (In reply to Richard Biener from comment #10)
> > (In reply to Kewen Lin from comment #9)
> > > (In reply to Richard Biener from comment #8)
> > > > (In reply to Kewen Lin from comment #7)
> > > > > Two questions in mind, need to dig into it further:
> > > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > > consider the stores.
> > > > 
> > > > Because when modeling they are still there.  There's no good way around this.
> > > > 
> > > 
> > > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > > once just before SLP? a bad idea due to compilation time?
> > 
> > Yeah, we already run FRE a lot and it is one of the more expensive passes.
> > 
> > Note there's one point we could do better which is the embedded SESE FRE
> > run from cunroll which is only run before we consider peeling an outer loop
> > and thus not for the outermost unrolled/peeled code (but the question would
> > be from where / up to what to apply FRE to).  On x86_64 this would apply to
> > the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> > quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> > matters for costing though).
> 
> By following this idea, to release the restriction on loop_outer
> (loop_father) when setting the father_bbs, I can see FRE works as
> expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.

Yeah, that's the reason we do not do it.  We could possibly restrict it
to a containing loop, or if the containing loop is the whole function,
restrict it to the original preheader block to the loop exits (which are
no longer there, we'd need to pre-record those I think)

> If it's taken as costly, we probably can guard it to take effects only when
> all its inner loops are unrolled, for this case, all of its three loops are
> unrolled.

The original reason VN is done on unrolled bodies is to improve cost estimates
on unrolling nests - since we do not unroll the non-existing outer loop of the
outermost loop applying VN there for this reason is pointless and the
reasoning is to simply wait for the next scheduled VN pass (which is too late
for SLP)

> Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> for scalar codes it needs one more time dse run after cunroll to get array d
> eliminated. I guess it's not costly? Can one pass be run or not controlled
> by something in another pass? via global variable and add one parameter in
> passes.def seems weird. If it's costly, probably we can go by factoring out
> one routine to be called instead of running a pass, like do_rpo_vn?

No, we don't have a good way to schedule passes from other passes.  And yes,
the way forward is to support key transforms on regions.  Oh, and every
pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.

What we could eventually do is move the non-loop SLP pass later and
run the loop SLP pass only on loop bodies (so overall every BB is just
SLP vectorized once).  The

      PUSH_INSERT_PASSES_WITHIN (pass_tree_no_loop)
          NEXT_PASS (pass_slp_vectorize);
      POP_INSERT_PASSES ()

pass would then be unconditional and only run on non-loop BBs.  Note
that SLP still leaves us with FRE opportunities so this will probably
not solve the issue fully.  Pass reorderings are also always tricky...
(adding passes is easy but piling these up over the time is bad).
Comment 22 Hongtao.liu 2020-09-27 02:56:43 UTC
>One of my workmates found that if we disable vectorization for SPEC2017 >525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with ?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.

For CLX, if we disable slp vectorization in sub4x4_dct by  __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by 4%.

> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do
> the bb_info slp_instances revisits in finish_cost.

For i386 backend unsigned char --> unsigned short is no "no instruction", but in this case
---
1033  _134 = MEM[(pixel *)pix1_295 + 2B];                                                                                                                                                                    
1034  _135 = (short unsigned int) _134;
---

It could be combined and optimized to 
---
movzbl  19(%rcx), %r8d
---

So, if "unsigned char" variable is loaded from memory, then the convertion would also be "no instruction", i'm not sure if backend cost model could handle such situation.
Comment 23 Hongtao.liu 2020-09-27 03:07:06 UTC
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 

The cost of vec_construct in i386 backend is 64, calculated as 16 x 4

cut from i386.c
---
/* N element inserts into SSE vectors.  */ 
int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
---

From perspective of pipeline latency, is seems ok, but from perspective of rtx_cost, it seems inaccurate since it would be initialized as
---
        vmovd   %eax, %xmm0
        vpinsrb $1, 1(%rsi), %xmm0, %xmm0
        vmovd   %eax, %xmm7
        vpinsrb $1, 3(%rsi), %xmm7, %xmm7
        vmovd   %eax, %xmm3
        vpinsrb $1, 17(%rsi), %xmm3, %xmm3
        vmovd   %eax, %xmm6
        vpinsrb $1, 19(%rsi), %xmm6, %xmm6
        vmovd   %eax, %xmm1
        vpinsrb $1, 33(%rsi), %xmm1, %xmm1
        vmovd   %eax, %xmm5
        vpinsrb $1, 35(%rsi), %xmm5, %xmm5
        vmovd   %eax, %xmm2
        vpinsrb $1, 49(%rsi), %xmm2, %xmm2
        vmovd   %eax, %xmm4
        vpinsrb $1, 51(%rsi), %xmm4, %xmm4
        vpunpcklwd      %xmm6, %xmm3, %xmm3
        vpunpcklwd      %xmm4, %xmm2, %xmm2
        vpunpcklwd      %xmm7, %xmm0, %xmm0
        vpunpcklwd      %xmm5, %xmm1, %xmm1
        vpunpckldq      %xmm2, %xmm1, %xmm1
        vpunpckldq      %xmm3, %xmm0, %xmm0
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
---

it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost should be 92(23 * 4).
Comment 24 rguenther@suse.de 2020-09-27 06:48:09 UTC
On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
>
>--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
>>One of my workmates found that if we disable vectorization for
>SPEC2017 >525.x264_r function sub4x4_dct in source file
>x264_src/common/dct.c with ?>explicit function attribute
>__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
>
>For CLX, if we disable slp vectorization in sub4x4_dct by 
>__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
>up by 4%.
>
>> Thanks Richi! Should we take care of this case? or neglect this kind
>of
>> extension as "no instruction"? I was intent to handle it in target
>specific
>> code, but it isn't recorded into cost vector while it seems too heavy
>to do
>> the bb_info slp_instances revisits in finish_cost.
>
>For i386 backend unsigned char --> unsigned short is no "no
>instruction", but
>in this case
>---
>1033  _134 = MEM[(pixel *)pix1_295 + 2B];                              
>        
>1034  _135 = (short unsigned int) _134;
>---
>
>It could be combined and optimized to 
>---
>movzbl  19(%rcx), %r8d
>---
>
>So, if "unsigned char" variable is loaded from memory, then the
>convertion
>would also be "no instruction", i'm not sure if backend cost model
>could handle
>such situation.

I think all attempts to address this from the side of the scalar cost is going to be difficult and fragile..
Comment 25 Kewen Lin 2020-09-27 10:20:46 UTC
> > 
> > Got it! For 
> > 
> > 	  else if (vect_nop_conversion_p (stmt_info))
> > 	    continue;
> > 
> > Is it a good idea to change it to call record_stmt_cost like the others? 
> >   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
> >   2) builtin_vectorization_cost return zero for it by default as before.
> >   3) targets can adjust the cost according to its need
> 
> I think this early-out was added for the case where there was no cost but
> the target costed it.  So at least go back and look what target that was
> and see if it can be adjusted.

OK, thanks. The change is from commit r10-4592, I think most of its handled objects are no costs for all targets, like VIEW_CONVERT_EXPR, UL/SL bi-direction conversions etc, so it's good to differentiate it from scalar_stmt, but the "continue" way makes target hard to tweak for some tree_nop_conversion_p stmts.
Comment 26 Kewen Lin 2020-09-27 10:36:07 UTC
> > By following this idea, to release the restriction on loop_outer
> > (loop_father) when setting the father_bbs, I can see FRE works as
> > expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.
> 
> Yeah, that's the reason we do not do it.  We could possibly restrict it
> to a containing loop, or if the containing loop is the whole function,
> restrict it to the original preheader block to the loop exits (which are
> no longer there, we'd need to pre-record those I think)

Thanks for the suggestion! 

I tried the idea to restrict it to run from the original preheader block to the loop exits (pre-record both as you said), but it can't support the array d eliminated finally, unfortunately this case requires VN to run across the boundary between the original loops.  Now I ended up to run one time the whole function VN if there isn't any loops after unrolling. I guess if there are no loops, the CFG should be simple in most times and then not so costly? 

> > Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> > for scalar codes it needs one more time dse run after cunroll to get array d
> > eliminated. I guess it's not costly? Can one pass be run or not controlled
> > by something in another pass? via global variable and add one parameter in
> > passes.def seems weird. If it's costly, probably we can go by factoring out
> > one routine to be called instead of running a pass, like do_rpo_vn?
> 
> No, we don't have a good way to schedule passes from other passes.  And yes,
> the way forward is to support key transforms on regions.  Oh, and every
> pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.
> 

OK, I'll have a look at DSE and try to get it to support region style. Although it may not help this case since it needs to operate things across loop boundary.
Comment 27 Kewen Lin 2020-09-27 10:42:31 UTC
(In reply to Hongtao.liu from comment #22)
> >One of my workmates found that if we disable vectorization for SPEC2017 >525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with ?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> 
> For CLX, if we disable slp vectorization in sub4x4_dct by 
> __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by
> 4%.
> 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do
> > the bb_info slp_instances revisits in finish_cost.
> 
> For i386 backend unsigned char --> unsigned short is no "no instruction",

Thanks for the information, it means it's target specific.

> but in this case
> ---
> 1033  _134 = MEM[(pixel *)pix1_295 + 2B];                                   
> 
> 1034  _135 = (short unsigned int) _134;
> ---
> 
> It could be combined and optimized to 
> ---
> movzbl  19(%rcx), %r8d
> ---
> 
> So, if "unsigned char" variable is loaded from memory, then the convertion
> would also be "no instruction", i'm not sure if backend cost model could
> handle such situation.

Probably you can try to tweak it in ix86_add_stmt_cost? when the statement is UB to UH conversion statement, further check if the def of the input UB is MEM.
Comment 28 Hongtao.liu 2020-09-28 05:45:37 UTC
> Probably you can try to tweak it in ix86_add_stmt_cost? when the statement

Yes, it's the place.

> is UB to UH conversion statement, further check if the def of the input UB
> is MEM.

Only if there's no multi-use for UB. More generally, it's quite difficult to guess later optimizations for the purpose of more accurate vectorization cost model, :(.
Comment 29 Kewen Lin 2020-09-28 06:40:31 UTC
(In reply to Hongtao.liu from comment #28)
> > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> 
> Yes, it's the place.
> 
> > is UB to UH conversion statement, further check if the def of the input UB
> > is MEM.
> 
> Only if there's no multi-use for UB. More generally, it's quite difficult to
> guess later optimizations for the purpose of more accurate vectorization
> cost model, :(.

Yeah, it's hard sadly. The generic cost modeling is rough, ix86_add_stmt_cost is more fine-grain (at least than what we have on Power :)), if you want to check it more, it seems doable in target specific hook finish_cost where you can get the whole vinfo object, but it could end up with very heavy analysis and might not be worthy.

Do you mind to check if it can also fix this degradation on x86 to run FRE and DSE just after cunroll? I found it worked for Power, hoped it can help there too.
Comment 30 Richard Biener 2020-09-28 06:48:06 UTC
(In reply to Hongtao.liu from comment #23)
> >  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> > _576, _125, _143, _161, _179}; 
> 
> The cost of vec_construct in i386 backend is 64, calculated as 16 x 4
> 
> cut from i386.c
> ---
> /* N element inserts into SSE vectors.  */ 
> int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> ---
> 
> From perspective of pipeline latency, is seems ok, but from perspective of
> rtx_cost, it seems inaccurate since it would be initialized as
> ---
>         vmovd   %eax, %xmm0
>         vpinsrb $1, 1(%rsi), %xmm0, %xmm0
>         vmovd   %eax, %xmm7
>         vpinsrb $1, 3(%rsi), %xmm7, %xmm7
>         vmovd   %eax, %xmm3
>         vpinsrb $1, 17(%rsi), %xmm3, %xmm3
>         vmovd   %eax, %xmm6
>         vpinsrb $1, 19(%rsi), %xmm6, %xmm6
>         vmovd   %eax, %xmm1
>         vpinsrb $1, 33(%rsi), %xmm1, %xmm1
>         vmovd   %eax, %xmm5
>         vpinsrb $1, 35(%rsi), %xmm5, %xmm5
>         vmovd   %eax, %xmm2
>         vpinsrb $1, 49(%rsi), %xmm2, %xmm2
>         vmovd   %eax, %xmm4
>         vpinsrb $1, 51(%rsi), %xmm4, %xmm4
>         vpunpcklwd      %xmm6, %xmm3, %xmm3
>         vpunpcklwd      %xmm4, %xmm2, %xmm2
>         vpunpcklwd      %xmm7, %xmm0, %xmm0
>         vpunpcklwd      %xmm5, %xmm1, %xmm1
>         vpunpckldq      %xmm2, %xmm1, %xmm1
>         vpunpckldq      %xmm3, %xmm0, %xmm0
>         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> ---
> 
> it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
> should be 92(23 * 4).

So the important part for any target is that it makes the scalar and
vector costs apples and apples because they end up being compared
against each other.  For loops the most important metric tends to be
latency which is also the only thing that can be reasonably costed
when looking at a single statement at a time.  For all other factors
coming in there's (in theory) the finish_cost hook where, after
gathering individual stmt data from add_stmt_cost, a target hook can
apply adjustments based on say functional unit allocation (IIRC
the powerpc backend looks whether there are "many" shifts and
disparages vectorization in that case).

For the vector construction the x86 backend does a reasonable job
in costing - the only thing that's not very well modeled is the
extra cost of constructing from values in GPRs compared to
values in XMM regs (on some CPU archs that even as extra penalties).
But as seen above "GPR" values can also come from memory where
the difference vanishes (for AVX, not for SSE).
Comment 31 Richard Biener 2020-09-28 06:59:30 UTC
(In reply to Kewen Lin from comment #29)
> (In reply to Hongtao.liu from comment #28)
> > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > 
> > Yes, it's the place.
> > 
> > > is UB to UH conversion statement, further check if the def of the input UB
> > > is MEM.
> > 
> > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > guess later optimizations for the purpose of more accurate vectorization
> > cost model, :(.
> 
> Yeah, it's hard sadly. The generic cost modeling is rough,
> ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> :)), if you want to check it more, it seems doable in target specific hook
> finish_cost where you can get the whole vinfo object, but it could end up
> with very heavy analysis and might not be worthy.
> 
> Do you mind to check if it can also fix this degradation on x86 to run FRE
> and DSE just after cunroll? I found it worked for Power, hoped it can help
> there too.

Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
returned from passes that see cleanup opportunities and have the pass
manager queue that up, looking for a special marked pass and enabling
that so we could have

          NEXT_PASS (pass_predcom);
          NEXT_PASS (pass_complete_unroll);
          NEXT_PASS (pass_scalar_cleanup);
          PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
            NEXT_PASS (pass_fre, false /* may_iterate */);
            NEXT_PASS (pass_dse);
          POP_INSERT_PASSES ();

with pass_scalar_cleanup gate() returning false otherwise.  Eventually
pass properties would match this better, or sth else.

That said, running a cleanup on the whole function should be done via
a separate pass - running a cleanup on a sub-CFG can be done from
within another pass.  But mind that sub-CFG cleanup really has to be
of O(size-of-sub-CFG), otherwise it doesn't help.
Comment 32 Kewen Lin 2020-09-28 12:54:23 UTC
(In reply to Richard Biener from comment #31)
> (In reply to Kewen Lin from comment #29)
> > (In reply to Hongtao.liu from comment #28)
> > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > > 
> > > Yes, it's the place.
> > > 
> > > > is UB to UH conversion statement, further check if the def of the input UB
> > > > is MEM.
> > > 
> > > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > > guess later optimizations for the purpose of more accurate vectorization
> > > cost model, :(.
> > 
> > Yeah, it's hard sadly. The generic cost modeling is rough,
> > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > :)), if you want to check it more, it seems doable in target specific hook
> > finish_cost where you can get the whole vinfo object, but it could end up
> > with very heavy analysis and might not be worthy.
> > 
> > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > there too.
> 
> Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> returned from passes that see cleanup opportunities and have the pass
> manager queue that up, looking for a special marked pass and enabling
> that so we could have
> 
>           NEXT_PASS (pass_predcom);
>           NEXT_PASS (pass_complete_unroll);
>           NEXT_PASS (pass_scalar_cleanup);
>           PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
>             NEXT_PASS (pass_fre, false /* may_iterate */);
>             NEXT_PASS (pass_dse);
>           POP_INSERT_PASSES ();
> 
> with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> pass properties would match this better, or sth else.
> 

Thanks for the suggestion! Before cooking the patch, I have one question that it looks to only update function property is enough, eg: some pass sets property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the func with this property (checking in gate), I'm not quite sure the reason for the TODO_flag TODO_force_next_scalar_cleanup.
Comment 33 Richard Biener 2020-09-28 13:27:50 UTC
(In reply to Kewen Lin from comment #32)
> (In reply to Richard Biener from comment #31)
> > (In reply to Kewen Lin from comment #29)
> > > (In reply to Hongtao.liu from comment #28)
> > > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > > > 
> > > > Yes, it's the place.
> > > > 
> > > > > is UB to UH conversion statement, further check if the def of the input UB
> > > > > is MEM.
> > > > 
> > > > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > > > guess later optimizations for the purpose of more accurate vectorization
> > > > cost model, :(.
> > > 
> > > Yeah, it's hard sadly. The generic cost modeling is rough,
> > > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > > :)), if you want to check it more, it seems doable in target specific hook
> > > finish_cost where you can get the whole vinfo object, but it could end up
> > > with very heavy analysis and might not be worthy.
> > > 
> > > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > > there too.
> > 
> > Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> > returned from passes that see cleanup opportunities and have the pass
> > manager queue that up, looking for a special marked pass and enabling
> > that so we could have
> > 
> >           NEXT_PASS (pass_predcom);
> >           NEXT_PASS (pass_complete_unroll);
> >           NEXT_PASS (pass_scalar_cleanup);
> >           PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
> >             NEXT_PASS (pass_fre, false /* may_iterate */);
> >             NEXT_PASS (pass_dse);
> >           POP_INSERT_PASSES ();
> > 
> > with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> > pass properties would match this better, or sth else.
> > 
> 
> Thanks for the suggestion! Before cooking the patch, I have one question
> that it looks to only update function property is enough, eg: some pass sets
> property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
> func with this property (checking in gate), I'm not quite sure the reason
> for the TODO_flag TODO_force_next_scalar_cleanup.

properties are not an easy fit since they are static in the pass
description while we want to trigger the cleanup only if we unrolled
an outermost loop for example.  Returning TODO_force_next_scalar_cleanup
from cunroll is the natural way to signal this.

The hackish way then would be to "queue" this TODO_force_next_scalar_cleanup
inside the pass manager itself until it runs into a
"scalar cleanup pass" (either somehow marked or just hard-matched), forcing
its gate() to evaluate to true (just skipping its evaluation).

As said, there's no "nice" way for the information flow at the moment.

One could expose the "pending TODO" (TODO not handled by the pass manager
itself) in a global variable (like current_pass) so the cleanup pass
gate() could check

  gate () { return pending_todo & TODO_force_next_scalar_cleanup; }

and in its execute clear this bit from pending_todo.
Comment 34 Hongtao.liu 2020-09-29 02:05:07 UTC
(In reply to Kewen Lin from comment #29)
> (In reply to Hongtao.liu from comment #28)
> > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > 
> > Yes, it's the place.
> > 
> > > is UB to UH conversion statement, further check if the def of the input UB
> > > is MEM.
> > 
> > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > guess later optimizations for the purpose of more accurate vectorization
> > cost model, :(.
> 
> Yeah, it's hard sadly. The generic cost modeling is rough,
> ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> :)), if you want to check it more, it seems doable in target specific hook
> finish_cost where you can get the whole vinfo object, but it could end up
> with very heavy analysis and might not be worthy.
> 
> Do you mind to check if it can also fix this degradation on x86 to run FRE
> and DSE just after cunroll? I found it worked for Power, hoped it can help
> there too.

No, it's not working for CLX, problem in i386 backend is a bit different.
Comment 35 rsandifo@gcc.gnu.org 2020-09-29 12:27:34 UTC
(In reply to rguenther@suse.de from comment #24)
> On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com
> <gcc-bugzilla@gcc.gnu.org> wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> >
> >--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
> >>One of my workmates found that if we disable vectorization for
> >SPEC2017 >525.x264_r function sub4x4_dct in source file
> >x264_src/common/dct.c with ?>explicit function attribute
> >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> >
> >For CLX, if we disable slp vectorization in sub4x4_dct by 
> >__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
> >up by 4%.
> >
> >> Thanks Richi! Should we take care of this case? or neglect this kind
> >of
> >> extension as "no instruction"? I was intent to handle it in target
> >specific
> >> code, but it isn't recorded into cost vector while it seems too heavy
> >to do
> >> the bb_info slp_instances revisits in finish_cost.
> >
> >For i386 backend unsigned char --> unsigned short is no "no
> >instruction", but
> >in this case
> >---
> >1033  _134 = MEM[(pixel *)pix1_295 + 2B];                              
> >        
> >1034  _135 = (short unsigned int) _134;
> >---
> >
> >It could be combined and optimized to 
> >---
> >movzbl  19(%rcx), %r8d
> >---
> >
> >So, if "unsigned char" variable is loaded from memory, then the
> >convertion
> >would also be "no instruction", i'm not sure if backend cost model
> >could handle
> >such situation.
> 
> I think all attempts to address this from the side of the scalar cost is
> going to be difficult and fragile..
Agreed FWIW.  Even in rtl, the kinds of conversion we're talking
about could be removed, such as by proving that the upper bits are
already correct, by combining the extension with other instructions
so that it becomes “free” again, or by ree.  Proving that the upper
bits are already correct isn't uncommon: gimple has to make a choice
between signed and unsigned types even if both choices would be
correct, whereas rtl is sign-agnostic for storage.

So it's not obvious to me that trying model things at this level is
going to be right more often than it's wrong.
Comment 36 CVS Commits 2020-11-03 06:28:56 UTC
The master branch has been updated by Kewen Lin <linkw@gcc.gnu.org>:

https://gcc.gnu.org/g:f5e18dd9c7dacc9671044fc669bd5c1b26b6bdba

commit r11-4637-gf5e18dd9c7dacc9671044fc669bd5c1b26b6bdba
Author: Kewen Lin <linkw@gcc.gnu.org>
Date:   Tue Nov 3 02:51:47 2020 +0000

    pass: Run cleanup passes before SLP [PR96789]
    
    As the discussion in PR96789, we found that some scalar stmts
    which can be eliminated by some passes after SLP, but we still
    modeled their costs when trying to SLP, it could impact
    vectorizer's decision.  One typical case is the case in PR96789
    on target Power.
    
    As Richard suggested there, this patch is to introduce one pass
    called pre_slp_scalar_cleanup which has some secondary clean up
    passes, for now they are FRE and DSE.  It introduces one new
    TODO flags group called pending TODO flags, unlike normal TODO
    flags, the pending TODO flags are passed down in the pipeline
    until one of its consumers can perform the requested action.
    Consumers should then clear the flags for the actions that they
    have taken.
    
    Soem compilation time statistics on all SPEC2017 INT bmks were
    collected on one Power9 machine for several option sets below:
      A1: -Ofast -funroll-loops
      A2: -O1
      A3: -O1 -funroll-loops
      A4: -O2
      A5: -O2 -funroll-loops
    
    the corresponding increment rate is trivial:
      A1       A2       A3        A4        A5
      0.08%    0.00%    -0.38%    -0.10%    -0.05%
    
    Bootstrapped/regtested on powerpc64le-linux-gnu P8.
    
    gcc/ChangeLog:
    
            PR tree-optimization/96789
            * function.h (struct function): New member unsigned pending_TODOs.
            * passes.c (class pass_pre_slp_scalar_cleanup): New class.
            (make_pass_pre_slp_scalar_cleanup): New function.
            (pass_data_pre_slp_scalar_cleanup): New pass data.
            * passes.def: (pass_pre_slp_scalar_cleanup): New pass, add
            pass_fre and pass_dse as its children.
            * timevar.def (TV_SCALAR_CLEANUP): New timevar.
            * tree-pass.h (PENDING_TODO_force_next_scalar_cleanup): New
            pending TODO flag.
            (make_pass_pre_slp_scalar_cleanup): New declare.
            * tree-ssa-loop-ivcanon.c (tree_unroll_loops_completely_1):
            Once any outermost loop gets unrolled, flag cfun pending_TODOs
            PENDING_TODO_force_next_scalar_cleanup on.
    
    gcc/testsuite/ChangeLog:
    
            PR tree-optimization/96789
            * gcc.dg/tree-ssa/ssa-dse-28.c: Adjust.
            * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
            * gcc.dg/vect/bb-slp-41.c: Likewise.
            * gcc.dg/tree-ssa/pr96789.c: New test.
Comment 37 Kewen Lin 2020-11-05 02:25:50 UTC
The degradation should be gone on Power now.