Bug 50374 - Support vectorization of min/max location pattern
Summary: Support vectorization of min/max location pattern
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2011-09-13 07:48 UTC by vincenzo Innocente
Modified: 2016-08-27 23:23 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 4.9.1
Last reconfirmed: 2014-08-24 00:00:00


Attachments
gcc47-pr50374-wip.patch (13.70 KB, patch)
2011-09-19 15:59 UTC, Jakub Jelinek
Details | Diff
fix (1.19 KB, patch)
2011-09-20 11:45 UTC, Ira Rosen
Details | Diff
complete patch including my fix (14.00 KB, patch)
2011-09-20 11:47 UTC, Ira Rosen
Details | Diff
sorry, fixed it now (1.19 KB, patch)
2011-09-20 12:17 UTC, Ira Rosen
Details | Diff
complete patch (14.00 KB, patch)
2011-09-20 12:18 UTC, Ira Rosen
Details | Diff
gcc47-pr50374.patch (16.24 KB, patch)
2011-09-21 12:10 UTC, Jakub Jelinek
Details | Diff
gcc47-pr50374.patch (17.98 KB, patch)
2011-09-21 16:08 UTC, Jakub Jelinek
Details | Diff
gcc47-pr50374.patch (19.60 KB, patch)
2011-09-22 15:10 UTC, Jakub Jelinek
Details | Diff
gcc47-pr50374.patch (19.22 KB, patch)
2011-11-29 14:46 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2011-09-13 07:48:19 UTC
in reference to http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00631.html
and the discussion in PR47860:
any chance to see "min/max location pattern" vectorization in 4.7?
Comment 1 Ira Rosen 2011-09-13 07:55:20 UTC
I am not planning to work on this.

Ira
Comment 2 Jakub Jelinek 2011-09-13 08:29:24 UTC
With the recent vcond* changes doing float comparison and having integral vectors after ? and : is fine.
Comment 3 vincenzo Innocente 2011-09-13 08:45:40 UTC
with
gcc version 4.7.0 20110910 (experimental) (GCC) 


int lmin(float const * __restrict__ c, int N) {
  int k=0;
  for (int i=1; i!=N; ++i)
     k =  (c[k] > c[i]) ? i : k;
  return k;
}

does not vectorize.
any suggestion how to modify it?
Comment 4 Jakub Jelinek 2011-09-19 12:51:59 UTC
I've looked briefly at the http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00631.html
patch, it seems the UNSPEC_REDUC* is unnecessary there (only used in the expanders ending with DONE, in that case just a list of the operands is enough),
and I don't see any reason for the new RTL codes either (why not just use
(gt:V4SI (reg:V4SF r1) (reg:V4SF r2)) etc. in the patterns instead?).
Furthermore, the floating vs. integral operands in vcond* should now be handled on the trunk (just supported on i386/x86_64, but there is no reason why it can't be added for rs6000 too).
The patch unfortunately doesn't apply cleanly, am trying to resolve the rejects now.
Comment 5 Jakub Jelinek 2011-09-19 15:59:06 UTC
Created attachment 25320 [details]
gcc47-pr50374-wip.patch

Here is the ported patch with rejects hopefully resolved and a few bugfixes for the 4.6 -> 4.7 changes.
I've omitted the vcondc/vcondcu stuff because the trunk handles mixed float/integral vcond/u already, the rs6000 stuff (because IMHO the first step for rs6000 should be to add support for the mixed float/integral vcond/u for rs6000 and I'm not familiar with rs6000 enough), GTF/EQF etc. dropped (I think that it is not needed) and testcases separated out of it for now.
I had to make changes beyond rejects as COND_EXPR assignments are now represented differently and the pattern stmts are no longer emitted into the basic blocks.  I see no reason why this should be limited to floating point comparisons and integral indexes, IMHO e.g. finding location of smallest integer is common too.

That said, I'm now stuck with it, on the fast-math-no-pre-minmax-loc-1.c (on x86_64 -O2 -ftree-vectorize -fno-tree-pre) testcase the compound pattern recognition seems to work, but still the vectorizer gives up:
17: Unsupported pattern.
17: not vectorized: unsupported use in stmt.
17: unexpected pattern.

Ira, do you think you could have a quick look at this?
Comment 6 Ira Rosen 2011-09-20 08:27:08 UTC
Thanks for working on this!
It looks like the problem is with the way the stmts are marked. We don't insert pattern stmts now, so the things are more tricky. I'll try to fix this.
Comment 7 Ira Rosen 2011-09-20 11:45:53 UTC
Created attachment 25322 [details]
fix

Here is the fix (it's a diff relative to your patch).
Comment 8 Ira Rosen 2011-09-20 11:47:00 UTC
Created attachment 25323 [details]
complete patch including my fix
Comment 9 vincenzo Innocente 2011-09-20 12:05:01 UTC
does not compile to me

../.././gcc/tree-vect-loop.c: In function 'vect_is_simple_reduction_1':
../.././gcc/tree-vect-loop.c:2237:35: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
../.././gcc/tree-vect-loop.c:2238:5: error: expected ')' before '{' token
../.././gcc/tree-vect-loop.c:2402:1: error: expected expression before '}' token
../.././gcc/tree-vect-loop.c:2021:33: warning: unused variable 'def2' [-Wunused-variable]
../.././gcc/tree-vect-loop.c: At top level:
../.././gcc/tree-vect-loop.c:1798:1: warning: 'vect_is_slp_reduction' defined but not used [-Wunused-function]
../.././gcc/tree-vect-loop.c: In function 'vect_is_simple_reduction_1':
../.././gcc/tree-vect-loop.c:2402:1: warning: control reaches end of non-void function [-Wreturn-type]
make[3]: *** [tree-vect-loop.o] Error 1

looks like some parentheses is missing in the if statement
Comment 10 Ira Rosen 2011-09-20 12:17:46 UTC
Created attachment 25324 [details]
sorry, fixed it now
Comment 11 Ira Rosen 2011-09-20 12:18:50 UTC
Created attachment 25325 [details]
complete patch
Comment 12 vincenzo Innocente 2011-09-20 13:46:16 UTC
I'm getting these errors

../.././gcc/optabs.c: In function 'optab_d* optab_for_tree_code(tree_code, const_tree, optab_subtype)':
../.././gcc/optabs.c:470:9: error: cannot convert 'convert_optab_d*' to 'optab' in return
../.././gcc/optabs.c:474:9: error: cannot convert 'convert_optab_d*' to 'optab' in return
../.././gcc/optabs.c:478:9: error: cannot convert 'convert_optab_d*' to 'optab' in return
../.././gcc/optabs.c:482:9: error: cannot convert 'convert_optab_d*' to 'optab' in return

any idea what could be wrong?
Comment 13 Jakub Jelinek 2011-09-20 14:01:55 UTC
Ira, thanks for your help, I'll continue working on this now.
To vincenzo, this patch isn't finished, so it is premature to test it.
Comment 14 Jakub Jelinek 2011-09-21 12:10:28 UTC
Created attachment 25331 [details]
gcc47-pr50374.patch

Updated patch, not very heavily tested yet.
The optabs stuff needs to change, because currently i?86 has 480 gen_reduc*loc*
expanders, that's way too much.

Thus I think it will be better to only iterate on the two modes, and pass
the comparison operator as another argument like we pass to vcond{,u}, from which the expansion can determine what kind of comparison is supposed to happen (signed vs. unsigned, min vs. max and first vs. last).  I'll change it now.

Another thing is that this really ought to work even with -ftree-pre, having a vectorization that requires users to disable PRE would be weird.  I believe there is one extra assignment, will look at that.

Then there is the question if ifcvt should, as richi asked, fold the COND_EXPR into MIN/MAX and vectorizer be adjusted for that.  Or, if we just should it vectorize it using two VECT_COND_EXPRs instead (with the same condition).

Then, can also vectorize e.g. double comparisons, but only int indexes?
Comment 15 Ira Rosen 2011-09-21 12:20:47 UTC
(In reply to comment #14)

> Another thing is that this really ought to work even with -ftree-pre, having a
> vectorization that requires users to disable PRE would be weird.  I believe
> there is one extra assignment, will look at that.

It was discussed in pr 44711 and http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02982.html.
Comment 16 Jakub Jelinek 2011-09-21 16:08:29 UTC
Created attachment 25333 [details]
gcc47-pr50374.patch

Updated patch which changes the optab as I mentioned, and adds the testcases (both original Ira's and two new testcases, one which tries various type combinations).
Currently on both x86_64 and i686 the big new testcase (-12.c) fails, apparently some issue with float comparison and unsigned int index, likely a backend issue, debugging, and -3.c fails with an ICE in the vectorizer, Ira, could you look at that?  Let's leave the PRE issue for PRE for now, I'll try tomorrow to adjust ifcvt for the min/max folding.
Comment 17 Jakub Jelinek 2011-09-21 16:57:08 UTC
Seems my reduction patterns just give some extreme's index, but not necessarily the first or last extreme's index.  It was just narrowing down the index vector
together with the comparison vector the same way, picking the values the comparison picked.  Seems Ira's patch does the right thing, in particular,
finding the minimum resp. maximum (then at least on i?86 I'll need to broadcast that value to all vector members as the shifts have zero filled instead of kept the upper bits as is), then compare that to the original comparison vector (i.e. get mask where the minimum values were), mask the index vector with that mask, find the minimum (for *FIRST_LOC*) resp. maximum (for *LAST_LOC*) in the index vector (for FIRST_LOC ~mask actually needs to be ored in).
I'll fix that up tomorrow.

Now, does the generic vectorizer code ensure that the biv has the right properties (doesn't wrap in the loop and is strictly increasing)?  And, if it is signed that it is never negative?
Comment 18 Ira Rosen 2011-09-22 07:51:35 UTC
(In reply to comment #16)
> and -3.c fails with an ICE in the vectorizer, Ira,
> could you look at that?  

--- tree-vect-stmts.c   2011-09-22 09:48:34.000000000 +0200
+++ tree-vect-stmts.c.new       2011-09-22 09:48:14.000000000 +0200
@@ -4806,7 +4806,8 @@ vectorizable_condition (gimple stmt, gim
     return false;

   gcc_assert (ncopies >= 1);
-  if (reduc_index && ncopies > 1)
+  if ((reduc_index || STMT_VINFO_COMPOUND_PATTERN (stmt_info))
+      && (ncopies > 1))
     return false; /* FORNOW */

   if (!STMT_VINFO_RELEVANT_P (stmt_info)
Comment 19 Jakub Jelinek 2011-09-22 15:10:34 UTC
Created attachment 25341 [details]
gcc47-pr50374.patch

Thanks.  Here is an updated patch, with hopefully fixed backend part, which passes the whole newly added testsuite.
Unfortunately, even with -fno-tree-pre -fno-vect-cost-model, on the *-12.c testcase it vectorizes just 12 loops (f_*_[fiu]_u), not even with -mavx2 where e.g. I'd expect f_*_{d,ll,ull}_ull to be vectorized too, or e.g. the [fiu]_i etc.  Seems the pattern recognizer is just too restrictive in finding the IV.
On the other side as I wrote earlier, the check whether the index is strigly increasing through the whole loop is missing (if the loop bounds are known, I guess we could check its POLYNOMIAL_CHREC whether it has the expected form and
whether it won't wrap/overflow, and if the loop bound is unknown, if there is addition done in a signed type, assume it won't wrap, and for unsigned if the increment is 1 and it has the size bigger or equal to pointer size and init 0/1, then it won't wrap either.

BTW, I think on i?86/x86_64 we could in theory support even mixed size reductions, e.g. when the index is long long (64-bit) and comparison int or float, then I think we could use {,v}pmovsxdq instruction to extend the mask where extremes are present from vector of 4 ints or floats to a vector of 4 long longs.
Comment 20 Jakub Jelinek 2011-09-22 15:52:11 UTC
The reason why many of the loops in *-12.c aren't vectorized is that for idxtype other than unsigned int which is the type of i there is a cast from i to the type of pos.

OT, one big disadvantage of the latest patch is that the code to compute
the extreme is different between normal smin/smax/umin/umax reduction and between the first part of the min/max_loc reduction - the former is faster and only computes the extreme in the lowest element of the vector, while the latter computes it duplicated in all vector elements.  Which means if a loop wants to compute both the extreme and position of the extreme, it now results in unnecessary extra instructions.  We don't support currently two results operations in GIMPLE, so I wonder how to express that.  Perhaps use different min/max reduction expression in that case?  Say REDUC_MIN_LOC_VALUE_EXPR etc.,
which would use reduc_min_loc_* optab to expand as well?  That optab would have
two targets, and if we wanted to expand REDUC_MIN_LOC_EXPR, we'd use one of the targets and pass a gen_reg_rtx () to the other one, for REDUC_MIN_LOC_VALUE_EXPR the other way around.  Thoughts?
Comment 21 vincenzo Innocente 2011-11-29 09:54:49 UTC
My understanding is that we cannot count to see a "min/max location pattern" vectorization in 4.7, isn't it?
Comment 22 Jakub Jelinek 2011-11-29 09:58:40 UTC
I gave up on it, because it still needed lots of changes in the generic vectorizer.  Guess I can update the patch so that it at least applies, but there is more work on it.
Comment 23 vincenzo Innocente 2011-11-29 10:10:06 UTC
a working patch could be useful to check performance and quality of the generated ASM w.r.t. manually crafted soutions
Comment 24 Jakub Jelinek 2011-11-29 14:46:59 UTC
Created attachment 25945 [details]
gcc47-pr50374.patch

This updated patch applies (had to remove various parts from the older patch that have been upstreamed independently and adjust e.g. to the vectorizable_condition SLP changes), and passes vect.exp=\*minmax\* testing, haven't tested anything beyond that.  The above written issues still apply of course.
Comment 25 vincenzo Innocente 2011-11-29 15:24:02 UTC
Thanks Jakub for the new patch.
Using it I get this: any Idea how to fix/avoid?

home/data/newsoft/gcc-trunk/host-x86_64-unknown-linux-gnu/gcc/xgcc -B/home/data/newsoft/gcc-trunk/host-x86_64-unknown-linux-gnu/gcc/ -B/afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/bin/ -B/afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/lib/ -isystem /afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/include -isystem /afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/sys-include    -g -O2 -ftree-vectorize -fPIC -O2  -g -O2 -ftree-vectorize -fPIC -DIN_GCC   -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fPIC -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fPIC -I. -I. -I../../host-x86_64-unknown-linux-gnu/gcc -I../.././libgcc -I../.././libgcc/. -I../.././libgcc/../gcc -I../.././libgcc/../include -I../.././libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS  -DUSE_TLS -o bid_binarydecimal.o -MT bid_binarydecimal.o -MD -MP -MF bid_binarydecimal.dep -c ../.././libgcc/config/libbid/bid_binarydecimal.c
../.././libgcc/config/libbid/bid_binarydecimal.c: In function '__bid32_to_binary128':
../.././libgcc/config/libbid/bid_binarydecimal.c:145115:1: internal compiler error: vector VEC(vec_void_p,base) index domain error, in vinfo_for_stmt at tree-vectorizer.h:636
Comment 26 vincenzo Innocente 2011-11-30 09:13:08 UTC
using a minimal configuration like
./configure --prefix=/something --enable-languages=c,c++ --disable-multilib --disable-bootstrap --enable-lto -enable-gold=yes -disable-libitm
I manage to build with the patch

int lmin(float const * __restrict__ c, int N) {
  int k=0;
  for (int i=1; i!=N; ++i)
     k =  (c[k] > c[i]) ? i : k;
  return k;
}
indeed vectorizes (sse, not avx).
The amount of code emitted is quite large though, it will most probably worth only in a long loop
with more instructions to vectorize: the effective speedup need to be properly evaluated case by case
Comment 27 vincenzo Innocente 2014-08-23 17:36:57 UTC
coming back to this old issue.
Any chance to see it implemented in the auto-vectorizer soon?

using "extended vectors" I manage to vectorize "min_element" as below.
In principle the auto-vectorizer should be able to do the same starting from the loop in comment 3


typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
typedef float __attribute__( ( vector_size( 16 ) , aligned(4) ) ) float32x4a4_t;
typedef int __attribute__( ( vector_size( 16 ) ) ) int32x4_t;


inline
float32x4_t load(float const * x) {
   return *(float32x4a4_t const *)(x);
}


int minloc(float const * x, int N) {
  float32x4_t v0;
  int32x4_t index;

  auto M = 4*(N/4);
  for (int i=M; i<N; ++i) {
    v0[i-M] = x[i];
    index[i]=i;
  }
  for (int i=N; i<M+4;++i) {
    v0[i-M] = x[0];
    index[i]=0;
  }
  int32x4_t j = {0,1,2,3};
  for (int i=0; i<M; i+=4) {
    decltype(auto) v = load(x+i);
    index =  (v<v0) ? j : index;
    v0 = (v<v0) ? v : v0;
    j+=4;
  }
  auto k = 0;
  for (int i=1;i<4; ++i) if (v0[i]<v0[k]) k=i;
  return index[k];
}


#include<iostream>
#include<algorithm>
#include <x86intrin.h>
unsigned int taux=0;
inline unsigned long long rdtscp() {
 return __rdtscp(&taux);
}

int main() {

  float x[1024];
  for (int i=0; i<1024; ++i) x[i]= i%2 ? i : -i;
  for (int i = 0; i<10; ++i) {
   std::random_shuffle(x,x+1024);
   long long ts = -rdtscp();
   int l1 = std::min_element(x+i,x+1024) - (x+i);
   ts +=rdtscp();
   long long tv = -rdtscp();	
   int l2 = minloc(x+i,1024-i);
   tv +=rdtscp();

    std::cout << "min is at " << l1 << ' ' << ts << std::endl;
    std::cout << "minloc " << l2 << ' ' << tv << std::endl;
  }
  return 0;

}


which result in a pretty good speed up
c++ -std=c++1y -Ofast minloc.cc -march=nehalem
./a.out
./a.out 
min is at 959 13780
minloc 959 2380
min is at 536 13680
minloc 536 4972
min is at 513 13648
minloc 513 1848
min is at 825 13640
minloc 825 1924
min is at 885 13628
minloc 885 1644
min is at 636 11252
minloc 636 1536
min is at 982 11240
minloc 982 1416
min is at 382 11228
minloc 382 1392
min is at 271 11216
minloc 271 1340
min is at 50 11204
minloc 50 1384
Comment 28 Joost VandeVondele 2014-08-24 10:30:24 UTC
actually, minloc/maxloc are also Fortran intrinsics, and while expanded inline, don't vectorize either. The following testcase runs 5x faster with ifort than with gfortran:

INTEGER FUNCTION ana(a)
 REAL :: a(1024)
 ana=maxloc(a,1)
END FUNCTION


INTEGER :: NRUN=1000000
REAL :: a(1024)
INTEGER :: ana,I
a=(/(I,I=1,1024)/)
DO I=1,NRUN
   j=j+ana(a)
ENDDO
WRITE(6,*) NRUN,j,ana(a)
END

> ifort -fast -no-ipo -fno-inline-functions t.f90 ; time ./a.out
     1000000  1024000000        1024

real	0m0.233s
user	0m0.228s
sys	0m0.004s

> gfortran -Ofast -fno-inline-functions t.f90 ; time ./a.out
     1000000  1024000000        1024

real	0m1.193s
user	0m1.188s
sys	0m0.000s