Hello, this is really just a follow-up to PR52568. The permutations {0,2,1,3} and {1,3,0,2} can be realized with a very similar technique. Starting from 0123: vpermilpd+vperm2f128->3210 vblendpd(0123,3210)->0213 or: vpermilpd->1032 vperm2f128->2301 vblendpd(1032,2301)->1302 I am not sure if there is a nice way to generalize this or if the function expand_vec_perm_vperm2f128_vblend should be cloned a few times and slightly modified. (these permutations are less important to me than 1230 was)
Note that {1,2,0,3} seems harder, I need one extra vpermilpd. Actually, it looks like every v4df shuffle can be realized as a vblendpd of a vpermilpd and a vpermilpd+vperm2f128. For v8sf, it also seems true but may require the version of vpermilps that takes its controls from a register/memory.
Created attachment 26908 [details] copy-paste patch for 0213 and 1302 This seems to handle 0213 and 1302 (I only vaguely looked at the generated code, can't do proper testing). It is really a copy-paste of the function that handles 1230. I didn't try to understand everything, so there may be things that made sense in the original function but don't anymore here. It should be possible to merge the 2 new functions, but merging them with the previous one looks harder.
Uh. I feel silly, but it looks like vshufpd could replace vpermilpd+vblendpd in many cases, including the original 1230 from PR52568...
Created attachment 26909 [details] patch Here is a try. Again, I just looked at the generated code on a couple examples, which isn't very reliable... expand_vec_perm_vperm2f128_vblend0 is already covered by expand_vec_perm_vperm2f128_vblend1, but it is confusing to have a 3-instruction function generate only 2. I didn't do generic permutations with 4 instructions. There is probably more that can be done with vshufp[sd].
Created attachment 26911 [details] patch With this one, all V4DF shuffles on one vector are done in at most 3 instructions (and are correct). Doing V8SF at the same time was getting confusing so I dropped it for the last 2 functions, which end up looking almost like: "if the pattern is 0112 do this, if it is 0130 do that, etc". I didn't check if all the functions are still used by at least one pattern... Note: my access to an avx machine is not sufficient to submit a patch, so feel free to take pieces of this and modify/test/submit them (I have a copyright assignment).
Created attachment 26912 [details] generic shuffle of a single v8sf An additional function (I should find better names...) to handle generic shuffles of a single v8sf in 4 instructions. Only tested on {6,2,3,3,5,2,3,7}. By the way, expand_vec_perm_vperm2f128_vblend2 does vpermilpd+vperm2f128 in this order, but it would be better to do it in the reverse order (adapting the mask), because it is common to need several __builtin_shuffle(x,*) and the vperm2f128 can then be shared. I also noticed while experimenting that -mavx2 generates vpermd instead of vpermps (the vpermq->vpermpd change didn't affect that).
Created attachment 26915 [details] gcc48-pr52607.patch AVX2 changes. This improves: 1) for {x, x, x, x} V4DFmode permutations we now don't generate vpermpd+vperm2f128, and just emit vpermpd (for all other V4DFmode permutations we were already emitting just vpermpd) 2) for most of V8SFmode permutations we now emit code like vmovdqa .LC0(%rip), %ymm1; vpermps %ymm0, %ymm1, %ymm0 instead of vperm2f128 $0, %ymm0, %ymm0, %ymm0; vpermilps .LC0(%rip), %ymm0, %ymm0 3) broadcast permutations improvements for V8SFmode (using vbroadcastss) as well as some integer modes
I'm not very keen on having too many different routines, the more generic they are, the better. So IMHO e.g. the two insn sequence, vperm2[if]128 + some one insn shuffle could look like: /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to expand a vector permutation using two instructions, vperm2f128 resp. vperm2i128 followed by any single in-lane permutation. */ static bool expand_vec_perm_vperm2f128 (struct expand_vec_perm_d *d) { struct expand_vec_perm_d dfirst, dsecond; unsigned i, j, nelt = d->nelt, nelt2 = nelt / 2, perm; bool ok; if (!TARGET_AVX || GET_MODE_SIZE (d->vmode) != 32 || (d->vmode != V8SFmode && d->vmode != V4DFmode && !TARGET_AVX2)) return false; dsecond = *d; if (d->op0 == d->op1) dsecond.op1 = gen_reg_rtx (d->vmode); dsecond.testing_p = true; /* ((perm << 2)|perm) & 0x33 is the vperm2[fi]128 immediate. For perm < 16 the second permutation uses d->op0 as first operand, for perm >= 16 it uses d->op1 as first operand. The second operand is the result of vperm2[fi]128. */ for (perm = 0; perm < 32; perm++) { /* Ignore permutations which do not move anything cross-lane. */ if (perm < 16) { if ((perm & 0xc) == (1 << 2)) continue; if ((perm & 3) == 0) continue; if ((perm & 0xf) == ((3 << 2) | 2)) continue; } else { if ((perm & 0xc) == (3 << 2)) continue; if ((perm & 3) == 2) continue; if ((perm & 0xf) == (1 << 2)) continue; } for (i = 0; i < nelt; i++) { j = d->perm[i] / nelt2; if (j == ((perm >> (2 * (i >= nelt2))) & 3)) dsecond.perm[i] = nelt + (i & nelt2) + (d->perm[i] & (nelt2 - 1)); else if (j == (unsigned) (i >= nelt2) + 2 * (perm >= 16)) dsecond.perm[i] = d->perm[i] & (nelt - 1); else break; } if (i == nelt) { start_sequence (); ok = expand_vec_perm_1 (&dsecond); end_sequence (); } else ok = false; if (ok) { if (d->testing_p) return true; dsecond.testing_p = false; dfirst = *d; if (d->op0 == d->op1) dfirst.target = dsecond.op1; else dfirst.target = gen_reg_rtx (d->vmode); for (i = 0; i < nelt; i++) dfirst.perm[i] = (i & (nelt2 - 1)) + ((perm >> (2 * (i >= nelt2))) & 3) * nelt2; ok = expand_vec_perm_1 (&dfirst); gcc_assert (ok); dsecond.op1 = dfirst.target; if (perm >= 16) dsecond.op0 = dfirst.op1; ok = expand_vec_perm_1 (&dsecond); gcc_assert (ok); return true; } /* For d->op0 == d->op1 the only useful vperm2f128 permutation is 0x10. */ if (d->op0 == d->op1) return false; } return false; } This will handle e.g. vperm2f128 + {vshufpd,vblendpd,vunpcklpd,vunpckhpd} etc. But with the current expand_vselect implementation it might be too costly, at least memory-wise.
(In reply to comment #8) > I'm not very keen on having too many different routines, the more generic they > are, the better. Agreed, that was one of my concerns from the first message in this bug, but to experiment it was easier to have separate functions. > So IMHO e.g. the two insn sequence, vperm2[if]128 + some one > insn shuffle could look like: > > /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to expand > a vector permutation using two instructions, vperm2f128 resp. > vperm2i128 followed by any single in-lane permutation. */ I haven't yet looked at it closely enough to understand what it does (those functions are surprisingly confusing when you don't write them yourself), but that looks interesting. My first idea in order to make things more generic was to tentatively turn __builtin_shuffle(x,m) into __builtin_shuffle(x,vperm2f128(x,x,33),mm) where mm avoids any cross-lane. The 2-vector no-cross-lane shuffle should take at most 3 instructions in v4df or v8sf (I haven't checked if it works now) and that's where most of the work would happen (instead of having many routines for single-vector shuffles that almost all start with vperm2f128). Then you would probably want to check how many instructions it used, since it could be more or less than one of the few instruction sequences that don't start with vperm2f128. From a quick look, it looks like you may be doing something even more generic... > This will handle e.g. vperm2f128 + {vshufpd,vblendpd,vunpcklpd,vunpckhpd} etc. Cool!
Created attachment 26925 [details] gcc48-vselect.patch Patch to decrease memory cost of expand_vselect{,_vconcat}, by keeping around a skeleton of the vselect/vconcat insn.
The vselect patch looks pretty good. + if (icode >= 0 && !testing_p) + x = copy_rtx (PATTERN (vselect_insn)); ... + if (!testing_p) + emit_insn (x); could be merged for clarity. The patch in comment #8 looks plausible. It could stand to have some comments added though.
Testing the 3 patches now (AVX2 improvements, expand_vselect and #c8 with further comments). For 3/4 insn sequences, I agree with the proposal to attempt to handle d->op0 == d->op1 cross-lane shuffles as two operand in-lane shuffles after vperm2f128 swapping the lanes. Two insn expanders could be groupped into expand_vec_perm_2 and three insn expanders into expand_vec_perm_3. We need to write some further 2 and 3 insn in-lane expanders though, as shown by: typedef double V4DF __attribute__((vector_size (4 * sizeof (double)))); typedef long V4DI __attribute__((vector_size (4 * sizeof (long)))); #define A(a, b, c, d) \ __attribute__((noinline, noclone)) V4DF \ f##a##b##c##d (V4DF x, V4DF y) \ {\ V4DI m = { a, b, c, d }; \ return __builtin_shuffle (x, y, m); \ } #define B(b, c, d) A(0, b, c, d) A(1, b, c, d) A(4, b, c, d) A(5, b, c, d) #define C(c, d) B(0, c, d) B(1, c, d) B(4, c, d) B(5, c, d) #define D(d) C(2, d) C(3, d) C(6, d) C(7, d) #define E D(2) D(3) D(6) D(7) E int main () { V4DF x = { 0.5, 1.5, 2.5, 3.5 }, y = { 4.5, 5.5, 6.5, 7.5 }, z; #undef A #define A(a, b, c, d) \ z = f##a##b##c##d (x, y); \ if (z[0] != a + .5 || z[1] != b + .5 || z[2] != c + .5 || z[3] != d + .5) \ __builtin_abort (); E return 0; }
Author: jakub Date: Tue Mar 20 16:25:54 2012 New Revision: 185577 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185577 Log: PR target/52607 * config/i386/i386.md ("isa" attribute): Add avx2 and noavx2. ("enabled" attribute): Handle avx2 and noavx2 isas. * config/i386/sse.md (avx2_vec_dupv8sf_1, avx2_pbroadcast<mode>_1): New insns. (vec_dup<mode>): Add avx2 =x,x alternative. (vec_dup<mode> splitter): Don't split if TARGET_AVX2. (*avx_vperm_broadcast_<mode>): Don't split V4DFmode if TARGET_AVX2. For TARGET_AVX2, V8SFmode and elt == 0 split into vbroadcastss. * config/i386/i386.c (expand_vec_perm_pshufb): Emit also vpermps for V8SFmode. (expand_vec_perm_1): For broadcasts, use avx2_pbroadcast<mode>_1 if possible, handle also V8SFmode. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/sse.md
Author: jakub Date: Tue Mar 20 16:51:41 2012 New Revision: 185579 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185579 Log: PR target/52607 * config/i386/i386.c (expand_vec_perm_vperm2f128): New function. (ix86_expand_vec_perm_const_1): Call it. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
If I am not mistaken, the V8SF shuffle 22022246 is doable by a vperm2f128 that takes 01234567 to 01230123, followed by a vshufps (mask 138 maybe). Was your patch supposed to handle it?
(In reply to comment #15) > If I am not mistaken, the V8SF shuffle 22022246 is doable by a vperm2f128 that > takes 01234567 to 01230123, followed by a vshufps (mask 138 maybe). Was your > patch supposed to handle it? Uh, no it isn't supposed to handle it (there would be redundancy and it wouldn't know where to take elements from), sorry, forget that comment.
Created attachment 26938 [details] intra-lane shuffle in 3 insn This (mostly untested) patch is a reformulation of the generic v8sf single vector shuffle in 4 insn as a generic intra-lane 2 vector shuffle in at most 3 insn. Reformulating __builtin_shuffle(x,m) as __builtin_shuffle(x,vperm2f128(x,1),mm) would then guarantee a maximum size of 4. Note that the strategy of doing a 2-vector shuffle by shuffling (not restricted to one vpermilp*) each vector and blending the results gives a maximum of 9 insn, whereas the current code often generates twice that number. By the way, I have trouble understanding this comment: /* For d->op0 == d->op1 the only useful vperm2f128 permutation is 0x10. */ Is it really 0x10, or is there a stray 0 at the end and it is really just 1?
Created attachment 26979 [details] default case An updated version of this simple, generic-case shuffle (do note that I didn't run the generated code, just checked that it compiled and the instructions generated looked roughly ok). With the patch, we have (concerning v4df and v8sf): - no single-vector shuffle takes more than 4 insn, - no 2-vector shuffle takes more than 9 insn (or 3 (+ 2 movs for constants...) with AVX2). I think the current code already guarantees than anything that can be done in a single instruction is. Some possible goals (making everything optimal may be a bit hard) would be: - everything that can be done in 2 insn is, - no single-vector v4df takes more than 3 insn, - one or two extra optimizations, if they are generic enough. I do wonder occasionally about allowing wild indexes (jokers, places where you can put anything) in shuffles, whether it is exposed to users or just an internal tool.
(In reply to comment #17) > By the way, I have trouble understanding this comment: > /* For d->op0 == d->op1 the only useful vperm2f128 permutation > is 0x10. */ > Is it really 0x10, or is there a stray 0 at the end and it is really just 1? Yeah, you are right, perm 1 (which actually is emitted as $33 immediate anyway, as vperm2f128 insn is used with the same register operand twice.
Thanks for working on this. I don't like much the calls to ix86_expand_vec_perm_const_1, if you are looking for exactly two insn permutations, then really the two insn permutation functions should be groupped together into expand_vec_perm_2 and you should call that instead, or if it is 1 or 2, then expand_vec_perm_1 || expand_vec_perm_2. expand_vec_perm_vperm2f128_merge has probably swapped the meaning of dfirst and dsecond permutations when it first performs the dsecond permutation. Lastly for each routine it is desirable to think whether it might be useful for other vector modes (likely 32-byte only) for TARGET_AVX2.
(In reply to comment #20) > I don't like much the calls to ix86_expand_vec_perm_const_1, if you are looking > for exactly two insn permutations, Actually, it isn't just 2 insn. The call in expand_vec_perm_vperm2f128_merge can take 3, and the calls in expand_vec_perm_perm_blend(...,true) up to 4 (this is how I get a maximum of 9 insn, 1+2*4). But some more splits of ix86_expand_vec_perm_const_1 to avoid recursive calls should be doable, if you don't like the recursion. > then really the two insn permutation > functions should be groupped together into expand_vec_perm_2 and you should > call that instead, or if it is 1 or 2, then expand_vec_perm_1 || > expand_vec_perm_2. Yes, this grouping by size makes sense, whether it ends up being used or not. Although there are expanders in the "3" category that occasionally get lucky and generate only 2 :-) > expand_vec_perm_vperm2f128_merge has probably swapped the meaning of dfirst and > dsecond permutations when it first performs the dsecond permutation. If you are just talking of the naming of the variables, yes, I completely agree they should be swapped (or given more explicit names, like swap_lanes and dintra).
(In reply to comment #20) > Lastly for each routine it is desirable to think whether it might be useful for > other vector modes (likely 32-byte only) for TARGET_AVX2. I am not very familiar with the integer versions, so I tried: #include <x86intrin.h> __v32qi f(__v32qi x){ __v32qi m={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}; return __builtin_shuffle(x,m); } $ gcc r.c -S -O1 -mavx && cat r.s r.c: In function 'f': r.c:2:9: error: invalid position or size operand to BIT_FIELD_REF BIT_FIELD_REF <x_2(D), 8, -128> r.c:2:9: note: in statement D.5992_24 = BIT_FIELD_REF <x_2(D), 8, -128>; r.c:2:9: error: invalid position or size operand to BIT_FIELD_REF BIT_FIELD_REF <x_2(D), 8, -120> r.c:2:9: note: in statement D.5993_25 = BIT_FIELD_REF <x_2(D), 8, -120>; [...] r.c:2:9: internal compiler error: verify_gimple failed (with -mavx2 it works fine)
(In reply to comment #18) + if (!d->testing_p) + dsecond.target = gen_reg_rtx (dsecond.vmode); + dfirst.op1 = dsecond.target; This bit has a problem with testing_p in that we'll have op0==op1 while testing and not when expanding. Which means that testing_p will be checking something else. I've been meaning to convert i386 from op0==op1 to one_operand_p, like I used in targets I converted later, like ia64. I'll see about making this change this afternoon, and then you can update your patch to match. + ok = expand_vec_perm_1 (&dsecond); + ok &= ix86_expand_vec_perm_const_1 (&dfirst); + + if (!ok) + return false; + + return true; Better with a short-circuit to avoid extra work: return (expand_vec_perm_1 (&dsecond) && ix86_expand_vec_perm_const_1 (&dfirst)); Otherwise the patch looks pretty good.
(In reply to comment #23) > (In reply to comment #18) > > + if (!d->testing_p) > + dsecond.target = gen_reg_rtx (dsecond.vmode); > + dfirst.op1 = dsecond.target; > > This bit has a problem with testing_p in that we'll have op0==op1 > while testing and not when expanding. Which means that testing_p > will be checking something else. Unless d->target==d->op0 (is that the case? I was kind of assuming it wasn't), it looks ok, but I agree that it should be improved. From other code, it looks like using gen_reg_rtx in testing is fine and avoiding it is just an optimization. On the other hand, if I remember correctly, the function could just return true early when testing (like the other function does) and assert during expansion, since it is not supposed to fail (except for the initial mode/target check), that would document the intent better. > I've been meaning to convert i386 from op0==op1 to one_operand_p, > like I used in targets I converted later, like ia64. I'll see about > making this change this afternoon, and then you can update your > patch to match. ok (no promise timewise). > + ok = expand_vec_perm_1 (&dsecond); > + ok &= ix86_expand_vec_perm_const_1 (&dfirst); > + > + if (!ok) > + return false; > + > + return true; > > Better with a short-circuit to avoid extra work: > > return (expand_vec_perm_1 (&dsecond) > && ix86_expand_vec_perm_const_1 (&dfirst)); Indeed! Thanks for the comments.
The test for AVX2 in expand_vec_perm_interleave2 might be too strict. For the V4DF shuffle 4,0,2,6, removing that check lets the compiler generate a nice vunpcklpd+vpermilpd (as opposed to 3 insn with my patch and 5+ without). The expansion of dfinal is already protected (so the function returns false for 4,2,0,6), I haven't checked whether something else (dremap?) needs protecting, but it doesn't look like it.
Created attachment 27052 [details] default case Updated with your comments, still can't properly test.
Notes to self (or other): - Intel's SDE makes it possible to test without appropriate hardware; - for V4DF shuffles, there seems to be a very simple generic solution that performs two vperm2f128 and then one vshufpd. permutation (a,b,c,d), input (x,y): t1 = vperm2f128(x,y,(a/2)+16*(c/2)); t2 = vperm2f128(x,y,(b/2)+16*(d/2)); return vshufpd(t1,t2,(a%2)+2*(b%2)+4*(c%2)+8*(d%2)); (when t1 or t2 is equal to x or y, it generates only 2 insn in cases that the current code doesn't detect, like {3,1,2,2})
A difficulty I hadn't foreseen is that the code that canonicalizes permutations (and in particular checks if one of the operands is unused) is in ix86_expand_vec_perm_const. So if I ask expand_vec_perm_1 to generate the 2-operand 0,1,2,3 permutation, it will happily generate vperm2f128 with immediate 16 without noticing that it is the identity on the first operand. I should probably move that code into its own function so I can call it before expand_vec_perm_1.
Created attachment 27136 [details] V4DF generic shuffle A patch (independent from the others) implementing what is explained in the last 2 comments. It is simple and works really well, all V4DF shuffles (even with 2 vectors) take only 3 insn (and often just 2). It only requires AVX, but also improves a lot on the current AVX2 code which casts to vectors of integers and uses up to 9 insn (although my "default case" patch also goes down to 3 insn on AVX2). The drawback is that it is limited to V4DF. vshufps is a different enough beast from vshufpd that it would require a different code, which wouldn't even apply that often. For V8SF, my "default case" patch seems more interesting. Integer vectors have different instructions again... By the way, I tested all V4DF permutations (there are only 2^12 of them) in the simulator. I also have a file (400K) with the code for each permutation, that looks like the following: 0,0,0,0 vpermilpd $0, %ymm0, %ymm0 vperm2f128 $0, %ymm0, %ymm0, %ymm0 [...] 1,7,6,3 vperm2f128 $48, %ymm1, %ymm0, %ymm2 vperm2f128 $19, %ymm1, %ymm0, %ymm0 vshufpd $11, %ymm0, %ymm2, %ymm0 1,7,6,4 vperm2f128 $48, %ymm1, %ymm0, %ymm0 vperm2f128 $33, %ymm1, %ymm1, %ymm1 vshufpd $3, %ymm1, %ymm0, %ymm0 [...] If anyone wants to take a look, tell me and I'll attach it.
Author: glisse Date: Mon May 14 20:19:30 2012 New Revision: 187479 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187479 Log: 2012-05-14 Marc Glisse <marc.glisse@inria.fr> PR target/52607 * config/i386/i386.c (ix86_expand_vec_perm_const): Move code to ... (canonicalize_perm): ... new function. (expand_vec_perm_2vperm2f128_vshuf): New function. (ix86_expand_vec_perm_const_1): Call it. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c