in reference to http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00631.html and the discussion in PR47860: any chance to see "min/max location pattern" vectorization in 4.7?
I am not planning to work on this. Ira
With the recent vcond* changes doing float comparison and having integral vectors after ? and : is fine.
with gcc version 4.7.0 20110910 (experimental) (GCC) int lmin(float const * __restrict__ c, int N) { int k=0; for (int i=1; i!=N; ++i) k = (c[k] > c[i]) ? i : k; return k; } does not vectorize. any suggestion how to modify it?
I've looked briefly at the http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00631.html patch, it seems the UNSPEC_REDUC* is unnecessary there (only used in the expanders ending with DONE, in that case just a list of the operands is enough), and I don't see any reason for the new RTL codes either (why not just use (gt:V4SI (reg:V4SF r1) (reg:V4SF r2)) etc. in the patterns instead?). Furthermore, the floating vs. integral operands in vcond* should now be handled on the trunk (just supported on i386/x86_64, but there is no reason why it can't be added for rs6000 too). The patch unfortunately doesn't apply cleanly, am trying to resolve the rejects now.
Created attachment 25320 [details] gcc47-pr50374-wip.patch Here is the ported patch with rejects hopefully resolved and a few bugfixes for the 4.6 -> 4.7 changes. I've omitted the vcondc/vcondcu stuff because the trunk handles mixed float/integral vcond/u already, the rs6000 stuff (because IMHO the first step for rs6000 should be to add support for the mixed float/integral vcond/u for rs6000 and I'm not familiar with rs6000 enough), GTF/EQF etc. dropped (I think that it is not needed) and testcases separated out of it for now. I had to make changes beyond rejects as COND_EXPR assignments are now represented differently and the pattern stmts are no longer emitted into the basic blocks. I see no reason why this should be limited to floating point comparisons and integral indexes, IMHO e.g. finding location of smallest integer is common too. That said, I'm now stuck with it, on the fast-math-no-pre-minmax-loc-1.c (on x86_64 -O2 -ftree-vectorize -fno-tree-pre) testcase the compound pattern recognition seems to work, but still the vectorizer gives up: 17: Unsupported pattern. 17: not vectorized: unsupported use in stmt. 17: unexpected pattern. Ira, do you think you could have a quick look at this?
Thanks for working on this! It looks like the problem is with the way the stmts are marked. We don't insert pattern stmts now, so the things are more tricky. I'll try to fix this.
Created attachment 25322 [details] fix Here is the fix (it's a diff relative to your patch).
Created attachment 25323 [details] complete patch including my fix
does not compile to me ../.././gcc/tree-vect-loop.c: In function 'vect_is_simple_reduction_1': ../.././gcc/tree-vect-loop.c:2237:35: warning: suggest parentheses around '&&' within '||' [-Wparentheses] ../.././gcc/tree-vect-loop.c:2238:5: error: expected ')' before '{' token ../.././gcc/tree-vect-loop.c:2402:1: error: expected expression before '}' token ../.././gcc/tree-vect-loop.c:2021:33: warning: unused variable 'def2' [-Wunused-variable] ../.././gcc/tree-vect-loop.c: At top level: ../.././gcc/tree-vect-loop.c:1798:1: warning: 'vect_is_slp_reduction' defined but not used [-Wunused-function] ../.././gcc/tree-vect-loop.c: In function 'vect_is_simple_reduction_1': ../.././gcc/tree-vect-loop.c:2402:1: warning: control reaches end of non-void function [-Wreturn-type] make[3]: *** [tree-vect-loop.o] Error 1 looks like some parentheses is missing in the if statement
Created attachment 25324 [details] sorry, fixed it now
Created attachment 25325 [details] complete patch
I'm getting these errors ../.././gcc/optabs.c: In function 'optab_d* optab_for_tree_code(tree_code, const_tree, optab_subtype)': ../.././gcc/optabs.c:470:9: error: cannot convert 'convert_optab_d*' to 'optab' in return ../.././gcc/optabs.c:474:9: error: cannot convert 'convert_optab_d*' to 'optab' in return ../.././gcc/optabs.c:478:9: error: cannot convert 'convert_optab_d*' to 'optab' in return ../.././gcc/optabs.c:482:9: error: cannot convert 'convert_optab_d*' to 'optab' in return any idea what could be wrong?
Ira, thanks for your help, I'll continue working on this now. To vincenzo, this patch isn't finished, so it is premature to test it.
Created attachment 25331 [details] gcc47-pr50374.patch Updated patch, not very heavily tested yet. The optabs stuff needs to change, because currently i?86 has 480 gen_reduc*loc* expanders, that's way too much. Thus I think it will be better to only iterate on the two modes, and pass the comparison operator as another argument like we pass to vcond{,u}, from which the expansion can determine what kind of comparison is supposed to happen (signed vs. unsigned, min vs. max and first vs. last). I'll change it now. Another thing is that this really ought to work even with -ftree-pre, having a vectorization that requires users to disable PRE would be weird. I believe there is one extra assignment, will look at that. Then there is the question if ifcvt should, as richi asked, fold the COND_EXPR into MIN/MAX and vectorizer be adjusted for that. Or, if we just should it vectorize it using two VECT_COND_EXPRs instead (with the same condition). Then, can also vectorize e.g. double comparisons, but only int indexes?
(In reply to comment #14) > Another thing is that this really ought to work even with -ftree-pre, having a > vectorization that requires users to disable PRE would be weird. I believe > there is one extra assignment, will look at that. It was discussed in pr 44711 and http://gcc.gnu.org/ml/gcc-patches/2010-06/msg02982.html.
Created attachment 25333 [details] gcc47-pr50374.patch Updated patch which changes the optab as I mentioned, and adds the testcases (both original Ira's and two new testcases, one which tries various type combinations). Currently on both x86_64 and i686 the big new testcase (-12.c) fails, apparently some issue with float comparison and unsigned int index, likely a backend issue, debugging, and -3.c fails with an ICE in the vectorizer, Ira, could you look at that? Let's leave the PRE issue for PRE for now, I'll try tomorrow to adjust ifcvt for the min/max folding.
Seems my reduction patterns just give some extreme's index, but not necessarily the first or last extreme's index. It was just narrowing down the index vector together with the comparison vector the same way, picking the values the comparison picked. Seems Ira's patch does the right thing, in particular, finding the minimum resp. maximum (then at least on i?86 I'll need to broadcast that value to all vector members as the shifts have zero filled instead of kept the upper bits as is), then compare that to the original comparison vector (i.e. get mask where the minimum values were), mask the index vector with that mask, find the minimum (for *FIRST_LOC*) resp. maximum (for *LAST_LOC*) in the index vector (for FIRST_LOC ~mask actually needs to be ored in). I'll fix that up tomorrow. Now, does the generic vectorizer code ensure that the biv has the right properties (doesn't wrap in the loop and is strictly increasing)? And, if it is signed that it is never negative?
(In reply to comment #16) > and -3.c fails with an ICE in the vectorizer, Ira, > could you look at that? --- tree-vect-stmts.c 2011-09-22 09:48:34.000000000 +0200 +++ tree-vect-stmts.c.new 2011-09-22 09:48:14.000000000 +0200 @@ -4806,7 +4806,8 @@ vectorizable_condition (gimple stmt, gim return false; gcc_assert (ncopies >= 1); - if (reduc_index && ncopies > 1) + if ((reduc_index || STMT_VINFO_COMPOUND_PATTERN (stmt_info)) + && (ncopies > 1)) return false; /* FORNOW */ if (!STMT_VINFO_RELEVANT_P (stmt_info)
Created attachment 25341 [details] gcc47-pr50374.patch Thanks. Here is an updated patch, with hopefully fixed backend part, which passes the whole newly added testsuite. Unfortunately, even with -fno-tree-pre -fno-vect-cost-model, on the *-12.c testcase it vectorizes just 12 loops (f_*_[fiu]_u), not even with -mavx2 where e.g. I'd expect f_*_{d,ll,ull}_ull to be vectorized too, or e.g. the [fiu]_i etc. Seems the pattern recognizer is just too restrictive in finding the IV. On the other side as I wrote earlier, the check whether the index is strigly increasing through the whole loop is missing (if the loop bounds are known, I guess we could check its POLYNOMIAL_CHREC whether it has the expected form and whether it won't wrap/overflow, and if the loop bound is unknown, if there is addition done in a signed type, assume it won't wrap, and for unsigned if the increment is 1 and it has the size bigger or equal to pointer size and init 0/1, then it won't wrap either. BTW, I think on i?86/x86_64 we could in theory support even mixed size reductions, e.g. when the index is long long (64-bit) and comparison int or float, then I think we could use {,v}pmovsxdq instruction to extend the mask where extremes are present from vector of 4 ints or floats to a vector of 4 long longs.
The reason why many of the loops in *-12.c aren't vectorized is that for idxtype other than unsigned int which is the type of i there is a cast from i to the type of pos. OT, one big disadvantage of the latest patch is that the code to compute the extreme is different between normal smin/smax/umin/umax reduction and between the first part of the min/max_loc reduction - the former is faster and only computes the extreme in the lowest element of the vector, while the latter computes it duplicated in all vector elements. Which means if a loop wants to compute both the extreme and position of the extreme, it now results in unnecessary extra instructions. We don't support currently two results operations in GIMPLE, so I wonder how to express that. Perhaps use different min/max reduction expression in that case? Say REDUC_MIN_LOC_VALUE_EXPR etc., which would use reduc_min_loc_* optab to expand as well? That optab would have two targets, and if we wanted to expand REDUC_MIN_LOC_EXPR, we'd use one of the targets and pass a gen_reg_rtx () to the other one, for REDUC_MIN_LOC_VALUE_EXPR the other way around. Thoughts?
My understanding is that we cannot count to see a "min/max location pattern" vectorization in 4.7, isn't it?
I gave up on it, because it still needed lots of changes in the generic vectorizer. Guess I can update the patch so that it at least applies, but there is more work on it.
a working patch could be useful to check performance and quality of the generated ASM w.r.t. manually crafted soutions
Created attachment 25945 [details] gcc47-pr50374.patch This updated patch applies (had to remove various parts from the older patch that have been upstreamed independently and adjust e.g. to the vectorizable_condition SLP changes), and passes vect.exp=\*minmax\* testing, haven't tested anything beyond that. The above written issues still apply of course.
Thanks Jakub for the new patch. Using it I get this: any Idea how to fix/avoid? home/data/newsoft/gcc-trunk/host-x86_64-unknown-linux-gnu/gcc/xgcc -B/home/data/newsoft/gcc-trunk/host-x86_64-unknown-linux-gnu/gcc/ -B/afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/bin/ -B/afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/lib/ -isystem /afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/include -isystem /afs/cern.ch/user/i/innocent/w2/x86_64-unknown-linux-gnu/sys-include -g -O2 -ftree-vectorize -fPIC -O2 -g -O2 -ftree-vectorize -fPIC -DIN_GCC -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fPIC -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -fPIC -I. -I. -I../../host-x86_64-unknown-linux-gnu/gcc -I../.././libgcc -I../.././libgcc/. -I../.././libgcc/../gcc -I../.././libgcc/../include -I../.././libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o bid_binarydecimal.o -MT bid_binarydecimal.o -MD -MP -MF bid_binarydecimal.dep -c ../.././libgcc/config/libbid/bid_binarydecimal.c ../.././libgcc/config/libbid/bid_binarydecimal.c: In function '__bid32_to_binary128': ../.././libgcc/config/libbid/bid_binarydecimal.c:145115:1: internal compiler error: vector VEC(vec_void_p,base) index domain error, in vinfo_for_stmt at tree-vectorizer.h:636
using a minimal configuration like ./configure --prefix=/something --enable-languages=c,c++ --disable-multilib --disable-bootstrap --enable-lto -enable-gold=yes -disable-libitm I manage to build with the patch int lmin(float const * __restrict__ c, int N) { int k=0; for (int i=1; i!=N; ++i) k = (c[k] > c[i]) ? i : k; return k; } indeed vectorizes (sse, not avx). The amount of code emitted is quite large though, it will most probably worth only in a long loop with more instructions to vectorize: the effective speedup need to be properly evaluated case by case
coming back to this old issue. Any chance to see it implemented in the auto-vectorizer soon? using "extended vectors" I manage to vectorize "min_element" as below. In principle the auto-vectorizer should be able to do the same starting from the loop in comment 3 typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t; typedef float __attribute__( ( vector_size( 16 ) , aligned(4) ) ) float32x4a4_t; typedef int __attribute__( ( vector_size( 16 ) ) ) int32x4_t; inline float32x4_t load(float const * x) { return *(float32x4a4_t const *)(x); } int minloc(float const * x, int N) { float32x4_t v0; int32x4_t index; auto M = 4*(N/4); for (int i=M; i<N; ++i) { v0[i-M] = x[i]; index[i]=i; } for (int i=N; i<M+4;++i) { v0[i-M] = x[0]; index[i]=0; } int32x4_t j = {0,1,2,3}; for (int i=0; i<M; i+=4) { decltype(auto) v = load(x+i); index = (v<v0) ? j : index; v0 = (v<v0) ? v : v0; j+=4; } auto k = 0; for (int i=1;i<4; ++i) if (v0[i]<v0[k]) k=i; return index[k]; } #include<iostream> #include<algorithm> #include <x86intrin.h> unsigned int taux=0; inline unsigned long long rdtscp() { return __rdtscp(&taux); } int main() { float x[1024]; for (int i=0; i<1024; ++i) x[i]= i%2 ? i : -i; for (int i = 0; i<10; ++i) { std::random_shuffle(x,x+1024); long long ts = -rdtscp(); int l1 = std::min_element(x+i,x+1024) - (x+i); ts +=rdtscp(); long long tv = -rdtscp(); int l2 = minloc(x+i,1024-i); tv +=rdtscp(); std::cout << "min is at " << l1 << ' ' << ts << std::endl; std::cout << "minloc " << l2 << ' ' << tv << std::endl; } return 0; } which result in a pretty good speed up c++ -std=c++1y -Ofast minloc.cc -march=nehalem ./a.out ./a.out min is at 959 13780 minloc 959 2380 min is at 536 13680 minloc 536 4972 min is at 513 13648 minloc 513 1848 min is at 825 13640 minloc 825 1924 min is at 885 13628 minloc 885 1644 min is at 636 11252 minloc 636 1536 min is at 982 11240 minloc 982 1416 min is at 382 11228 minloc 382 1392 min is at 271 11216 minloc 271 1340 min is at 50 11204 minloc 50 1384
actually, minloc/maxloc are also Fortran intrinsics, and while expanded inline, don't vectorize either. The following testcase runs 5x faster with ifort than with gfortran: INTEGER FUNCTION ana(a) REAL :: a(1024) ana=maxloc(a,1) END FUNCTION INTEGER :: NRUN=1000000 REAL :: a(1024) INTEGER :: ana,I a=(/(I,I=1,1024)/) DO I=1,NRUN j=j+ana(a) ENDDO WRITE(6,*) NRUN,j,ana(a) END > ifort -fast -no-ipo -fno-inline-functions t.f90 ; time ./a.out 1000000 1024000000 1024 real 0m0.233s user 0m0.228s sys 0m0.004s > gfortran -Ofast -fno-inline-functions t.f90 ; time ./a.out 1000000 1024000000 1024 real 0m1.193s user 0m1.188s sys 0m0.000s