52607 – v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}

Bug 52607 - v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}

Summary: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.7.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	52750
	Show dependency tree / graph

Reported:	2012-03-17 00:38 UTC by Marc Glisse
Modified:	2017-09-17 19:49 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-03-29 00:00:00

Attachments
copy-paste patch for 0213 and 1302 (900 bytes, patch) 2012-03-17 19:20 UTC, Marc Glisse	Details \| Diff
patch (1.26 KB, patch) 2012-03-17 22:03 UTC, Marc Glisse	Details \| Diff
patch (1.79 KB, patch) 2012-03-18 12:53 UTC, Marc Glisse	Details \| Diff
generic shuffle of a single v8sf (639 bytes, text/x-csrc) 2012-03-18 18:58 UTC, Marc Glisse	Details
gcc48-pr52607.patch (2.46 KB, patch) 2012-03-19 11:13 UTC, Jakub Jelinek	Details \| Diff
gcc48-vselect.patch (2.02 KB, patch) 2012-03-19 19:18 UTC, Jakub Jelinek	Details \| Diff
intra-lane shuffle in 3 insn (878 bytes, patch) 2012-03-20 21:50 UTC, Marc Glisse	Details \| Diff
default case (1.13 KB, patch) 2012-03-25 13:52 UTC, Marc Glisse	Details \| Diff
default case (1.77 KB, patch) 2012-03-31 14:02 UTC, Marc Glisse	Details \| Diff
V4DF generic shuffle (1.64 KB, patch) 2012-04-11 20:35 UTC, Marc Glisse	Details \| Diff
Show Obsolete (5) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marc Glisse 2012-03-17 00:38:22 UTC

Hello,

this is really just a follow-up to PR52568. The permutations {0,2,1,3} and {1,3,0,2} can be realized with a very similar technique.

Starting from 0123:
vpermilpd+vperm2f128->3210
vblendpd(0123,3210)->0213

or:
vpermilpd->1032
vperm2f128->2301
vblendpd(1032,2301)->1302

I am not sure if there is a nice way to generalize this or if the function expand_vec_perm_vperm2f128_vblend should be cloned a few times and slightly modified.

(these permutations are less important to me than 1230 was)

Comment 1 Marc Glisse 2012-03-17 01:05:57 UTC

Note that {1,2,0,3} seems harder, I need one extra vpermilpd. Actually, it looks like every v4df shuffle can be realized as a vblendpd of a vpermilpd and a vpermilpd+vperm2f128. For v8sf, it also seems true but may require the version of vpermilps that takes its controls from a register/memory.

Comment 2 Marc Glisse 2012-03-17 19:20:36 UTC

Created attachment 26908 [details]
copy-paste patch for 0213 and 1302

This seems to handle 0213 and 1302 (I only vaguely looked at the generated code, can't do proper testing).

It is really a copy-paste of the function that handles 1230. I didn't try to understand everything, so there may be things that made sense in the original function but don't anymore here. It should be possible to merge the 2 new functions, but merging them with the previous one looks harder.

Comment 3 Marc Glisse 2012-03-17 19:55:18 UTC

Uh. I feel silly, but it looks like vshufpd could replace vpermilpd+vblendpd in many cases, including the original 1230 from PR52568...

Comment 4 Marc Glisse 2012-03-17 22:03:08 UTC

Created attachment 26909 [details]
patch

Here is a try. Again, I just looked at the generated code on a couple examples, which isn't very reliable...

expand_vec_perm_vperm2f128_vblend0 is already covered by expand_vec_perm_vperm2f128_vblend1, but it is confusing to have a 3-instruction function generate only 2.

I didn't do generic permutations with 4 instructions.

There is probably more that can be done with vshufp[sd].

Comment 5 Marc Glisse 2012-03-18 12:53:13 UTC

Created attachment 26911 [details]
patch

With this one, all V4DF shuffles on one vector are done in at most 3 instructions (and are correct). Doing V8SF at the same time was getting confusing so I dropped it for the last 2 functions, which end up looking almost like: "if the pattern is 0112 do this, if it is 0130 do that, etc".

I didn't check if all the functions are still used by at least one pattern...

Note: my access to an avx machine is not sufficient to submit a patch, so feel free to take pieces of this and modify/test/submit them (I have a copyright assignment).

Comment 6 Marc Glisse 2012-03-18 18:58:44 UTC

Created attachment 26912 [details]
generic shuffle of a single v8sf

An additional function (I should find better names...) to handle generic shuffles of a single v8sf in 4 instructions. Only tested on {6,2,3,3,5,2,3,7}.

By the way, expand_vec_perm_vperm2f128_vblend2 does vpermilpd+vperm2f128 in this order, but it would be better to do it in the reverse order (adapting the mask), because it is common to need several __builtin_shuffle(x,*) and the vperm2f128 can then be shared.

I also noticed while experimenting that -mavx2 generates vpermd instead of vpermps (the vpermq->vpermpd change didn't affect that).

Comment 7 Jakub Jelinek 2012-03-19 11:13:52 UTC

Created attachment 26915 [details]
gcc48-pr52607.patch

AVX2 changes.
This improves:
1) for {x, x, x, x} V4DFmode permutations we now don't generate vpermpd+vperm2f128, and just emit vpermpd (for all other V4DFmode permutations we were already emitting just vpermpd)
2) for most of V8SFmode permutations we now emit code like
vmovdqa .LC0(%rip), %ymm1; vpermps %ymm0, %ymm1, %ymm0 instead of
vperm2f128 $0, %ymm0, %ymm0, %ymm0; vpermilps .LC0(%rip), %ymm0, %ymm0
3) broadcast permutations improvements for V8SFmode (using vbroadcastss) as well
as some integer modes

Comment 8 Jakub Jelinek 2012-03-19 17:13:53 UTC

I'm not very keen on having too many different routines, the more generic they are, the better.  So IMHO e.g. the two insn sequence, vperm2[if]128 + some one insn shuffle could look like:

/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to expand
   a vector permutation using two instructions, vperm2f128 resp.
   vperm2i128 followed by any single in-lane permutation.  */

static bool
expand_vec_perm_vperm2f128 (struct expand_vec_perm_d *d)
{
  struct expand_vec_perm_d dfirst, dsecond;
  unsigned i, j, nelt = d->nelt, nelt2 = nelt / 2, perm;
  bool ok;

  if (!TARGET_AVX
      || GET_MODE_SIZE (d->vmode) != 32
      || (d->vmode != V8SFmode && d->vmode != V4DFmode && !TARGET_AVX2))
    return false;

  dsecond = *d;
  if (d->op0 == d->op1)
    dsecond.op1 = gen_reg_rtx (d->vmode);
  dsecond.testing_p = true;

  /* ((perm << 2)|perm) & 0x33 is the vperm2[fi]128
     immediate.  For perm < 16 the second permutation uses
     d->op0 as first operand, for perm >= 16 it uses d->op1
     as first operand.  The second operand is the result of
     vperm2[fi]128.  */
  for (perm = 0; perm < 32; perm++)
    {
      /* Ignore permutations which do not move anything cross-lane.  */
      if (perm < 16)
        {
          if ((perm & 0xc) == (1 << 2)) continue;
          if ((perm & 3) == 0) continue;
          if ((perm & 0xf) == ((3 << 2) | 2)) continue;
        }
      else
        {
          if ((perm & 0xc) == (3 << 2)) continue;
          if ((perm & 3) == 2) continue;
          if ((perm & 0xf) == (1 << 2)) continue;
        }

      for (i = 0; i < nelt; i++)
        {
          j = d->perm[i] / nelt2;
          if (j == ((perm >> (2 * (i >= nelt2))) & 3))
            dsecond.perm[i] = nelt + (i & nelt2) + (d->perm[i] & (nelt2 - 1));
          else if (j == (unsigned) (i >= nelt2) + 2 * (perm >= 16))
            dsecond.perm[i] = d->perm[i] & (nelt - 1);
          else
            break;
        }

      if (i == nelt)
        {
          start_sequence ();
          ok = expand_vec_perm_1 (&dsecond);
          end_sequence ();
        }
      else
        ok = false;

      if (ok)
        {
          if (d->testing_p)
            return true;

          dsecond.testing_p = false;
          dfirst = *d;
          if (d->op0 == d->op1)
            dfirst.target = dsecond.op1;
          else
            dfirst.target = gen_reg_rtx (d->vmode);
          for (i = 0; i < nelt; i++)
            dfirst.perm[i] = (i & (nelt2 - 1))
                             + ((perm >> (2 * (i >= nelt2))) & 3) * nelt2;

          ok = expand_vec_perm_1 (&dfirst);
          gcc_assert (ok);

          dsecond.op1 = dfirst.target;
          if (perm >= 16)
            dsecond.op0 = dfirst.op1;

          ok = expand_vec_perm_1 (&dsecond);
          gcc_assert (ok);

          return true;
        }

      /* For d->op0 == d->op1 the only useful vperm2f128 permutation
         is 0x10.  */
      if (d->op0 == d->op1)
        return false;
    }

  return false;
}

This will handle e.g. vperm2f128 + {vshufpd,vblendpd,vunpcklpd,vunpckhpd} etc.
But with the current expand_vselect implementation it might be too costly, at least memory-wise.

Comment 9 Marc Glisse 2012-03-19 18:29:50 UTC

(In reply to comment #8)
> I'm not very keen on having too many different routines, the more generic they
> are, the better.

Agreed, that was one of my concerns from the first message in this bug, but to experiment it was easier to have separate functions.

> So IMHO e.g. the two insn sequence, vperm2[if]128 + some one
> insn shuffle could look like:
> 
> /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to expand
>    a vector permutation using two instructions, vperm2f128 resp.
>    vperm2i128 followed by any single in-lane permutation.  */

I haven't yet looked at it closely enough to understand what it does (those functions are surprisingly confusing when you don't write them yourself), but that looks interesting.

My first idea in order to make things more generic was to tentatively turn __builtin_shuffle(x,m) into __builtin_shuffle(x,vperm2f128(x,x,33),mm) where mm avoids any cross-lane. The 2-vector no-cross-lane shuffle should take at most 3 instructions in v4df or v8sf (I haven't checked if it works now) and that's where most of the work would happen (instead of having many routines for single-vector shuffles that almost all start with vperm2f128). Then you would probably want to check how many instructions it used, since it could be more or less than one of the few instruction sequences that don't start with vperm2f128.

From a quick look, it looks like you may be doing something even more generic...

> This will handle e.g. vperm2f128 + {vshufpd,vblendpd,vunpcklpd,vunpckhpd} etc.

Cool!

Comment 10 Jakub Jelinek 2012-03-19 19:18:33 UTC

Created attachment 26925 [details]
gcc48-vselect.patch

Patch to decrease memory cost of expand_vselect{,_vconcat}, by keeping around
a skeleton of the vselect/vconcat insn.

Comment 11 Richard Henderson 2012-03-19 19:36:03 UTC

The vselect patch looks pretty good.

+  if (icode >= 0 && !testing_p)
+    x = copy_rtx (PATTERN (vselect_insn));
...
+  if (!testing_p)
+    emit_insn (x);

could be merged for clarity.

The patch in comment #8 looks plausible.  It could
stand to have some comments added though.

Comment 12 Jakub Jelinek 2012-03-20 09:17:41 UTC

Testing the 3 patches now (AVX2 improvements, expand_vselect and #c8 with further comments).  For 3/4 insn sequences, I agree with the proposal to attempt to handle d->op0 == d->op1 cross-lane shuffles as two operand in-lane shuffles
after vperm2f128 swapping the lanes.  Two insn expanders could be groupped into
expand_vec_perm_2 and three insn expanders into expand_vec_perm_3.
We need to write some further 2 and 3 insn in-lane expanders though, as shown by:
typedef double V4DF __attribute__((vector_size (4 * sizeof (double))));
typedef long V4DI __attribute__((vector_size (4 * sizeof (long))));

#define A(a, b, c, d) \
__attribute__((noinline, noclone)) V4DF \
f##a##b##c##d (V4DF x, V4DF y) \
{\
  V4DI m = { a, b, c, d }; \
  return __builtin_shuffle (x, y, m); \
}
#define B(b, c, d) A(0, b, c, d) A(1, b, c, d) A(4, b, c, d) A(5, b, c, d)
#define C(c, d) B(0, c, d) B(1, c, d) B(4, c, d) B(5, c, d)
#define D(d) C(2, d) C(3, d) C(6, d) C(7, d)
#define E D(2) D(3) D(6) D(7)
E

int
main ()
{
  V4DF x = { 0.5, 1.5, 2.5, 3.5 }, y = { 4.5, 5.5, 6.5, 7.5 }, z;
#undef A
#define A(a, b, c, d) \
  z = f##a##b##c##d (x, y); \
  if (z[0] != a + .5 || z[1] != b + .5 || z[2] != c + .5 || z[3] != d + .5) \
    __builtin_abort ();
  E
  return 0;
}

Comment 13 Jakub Jelinek 2012-03-20 16:26:08 UTC

Author: jakub
Date: Tue Mar 20 16:25:54 2012
New Revision: 185577

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185577
Log:
	PR target/52607
	* config/i386/i386.md ("isa" attribute): Add avx2 and noavx2.
	("enabled" attribute): Handle avx2 and noavx2 isas.
	* config/i386/sse.md (avx2_vec_dupv8sf_1, avx2_pbroadcast<mode>_1):
	New insns.
	(vec_dup<mode>): Add avx2 =x,x alternative.
	(vec_dup<mode> splitter): Don't split if TARGET_AVX2.
	(*avx_vperm_broadcast_<mode>): Don't split V4DFmode if TARGET_AVX2.
	For TARGET_AVX2, V8SFmode and elt == 0 split into vbroadcastss.
	* config/i386/i386.c (expand_vec_perm_pshufb): Emit also vpermps
	for V8SFmode.
	(expand_vec_perm_1): For broadcasts, use avx2_pbroadcast<mode>_1
	if possible, handle also V8SFmode.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/sse.md

Comment 14 Jakub Jelinek 2012-03-20 16:51:50 UTC

Author: jakub
Date: Tue Mar 20 16:51:41 2012
New Revision: 185579

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185579
Log:
	PR target/52607
	* config/i386/i386.c (expand_vec_perm_vperm2f128): New function.
	(ix86_expand_vec_perm_const_1): Call it.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c

Comment 15 Marc Glisse 2012-03-20 19:00:32 UTC

If I am not mistaken, the V8SF shuffle 22022246 is doable by a vperm2f128 that takes 01234567 to 01230123, followed by a vshufps (mask 138 maybe). Was your patch supposed to handle it?

Comment 16 Marc Glisse 2012-03-20 19:05:22 UTC

(In reply to comment #15)
> If I am not mistaken, the V8SF shuffle 22022246 is doable by a vperm2f128 that
> takes 01234567 to 01230123, followed by a vshufps (mask 138 maybe). Was your
> patch supposed to handle it?

Uh, no it isn't supposed to handle it (there would be redundancy and it wouldn't know where to take elements from), sorry, forget that comment.

Comment 17 Marc Glisse 2012-03-20 21:50:40 UTC

Created attachment 26938 [details]
intra-lane shuffle in 3 insn

This (mostly untested) patch is a reformulation of the generic v8sf single vector shuffle in 4 insn as a generic intra-lane 2 vector shuffle in at most 3 insn. Reformulating __builtin_shuffle(x,m) as __builtin_shuffle(x,vperm2f128(x,1),mm) would then guarantee a maximum size of 4.

Note that the strategy of doing a 2-vector shuffle by shuffling (not restricted to one vpermilp*) each vector and blending the results gives a maximum of 9 insn, whereas the current code often generates twice that number.


By the way, I have trouble understanding this comment:
      /* For d->op0 == d->op1 the only useful vperm2f128 permutation
         is 0x10.  */
Is it really 0x10, or is there a stray 0 at the end and it is really just 1?

Comment 18 Marc Glisse 2012-03-25 13:52:09 UTC

Created attachment 26979 [details]
default case

An updated version of this simple, generic-case shuffle (do note that I didn't run the generated code, just checked that it compiled and the instructions generated looked roughly ok). With the patch, we have (concerning v4df and v8sf):

- no single-vector shuffle takes more than 4 insn,
- no 2-vector shuffle takes more than 9 insn (or 3 (+ 2 movs for constants...) with AVX2).

I think the current code already guarantees than anything that can be done in a single instruction is.

Some possible goals (making everything optimal may be a bit hard) would be:

- everything that can be done in 2 insn is,
- no single-vector v4df takes more than 3 insn,
- one or two extra optimizations, if they are generic enough.

I do wonder occasionally about allowing wild indexes (jokers, places where you can put anything) in shuffles, whether it is exposed to users or just an internal tool.

Comment 19 Jakub Jelinek 2012-03-27 16:53:24 UTC

(In reply to comment #17)
> By the way, I have trouble understanding this comment:
>       /* For d->op0 == d->op1 the only useful vperm2f128 permutation
>          is 0x10.  */
> Is it really 0x10, or is there a stray 0 at the end and it is really just 1?

Yeah, you are right, perm 1 (which actually is emitted as $33 immediate anyway,
as vperm2f128 insn is used with the same register operand twice.

Comment 20 Jakub Jelinek 2012-03-27 17:10:00 UTC

Thanks for working on this.

I don't like much the calls to ix86_expand_vec_perm_const_1, if you are looking for exactly two insn permutations, then really the two insn permutation functions should be groupped together into expand_vec_perm_2 and you should call that instead, or if it is 1 or 2, then expand_vec_perm_1 || expand_vec_perm_2.

expand_vec_perm_vperm2f128_merge has probably swapped the meaning of dfirst and dsecond permutations when it first performs the dsecond permutation.

Lastly for each routine it is desirable to think whether it might be useful for other vector modes (likely 32-byte only) for TARGET_AVX2.

Comment 21 Marc Glisse 2012-03-27 18:21:39 UTC

(In reply to comment #20)
> I don't like much the calls to ix86_expand_vec_perm_const_1, if you are looking
> for exactly two insn permutations,

Actually, it isn't just 2 insn. The call in expand_vec_perm_vperm2f128_merge can take 3, and the calls in expand_vec_perm_perm_blend(...,true) up to 4 (this is how I get a maximum of 9 insn, 1+2*4). But some more splits of ix86_expand_vec_perm_const_1 to avoid recursive calls should be doable, if you don't like the recursion.

> then really the two insn permutation
> functions should be groupped together into expand_vec_perm_2 and you should
> call that instead, or if it is 1 or 2, then expand_vec_perm_1 ||
> expand_vec_perm_2.

Yes, this grouping by size makes sense, whether it ends up being used or not. Although there are expanders in the "3" category that occasionally get lucky and generate only 2 :-)

> expand_vec_perm_vperm2f128_merge has probably swapped the meaning of dfirst and
> dsecond permutations when it first performs the dsecond permutation.

If you are just talking of the naming of the variables, yes, I completely agree they should be swapped (or given more explicit names, like swap_lanes and dintra).

Comment 22 Marc Glisse 2012-03-27 20:57:16 UTC

(In reply to comment #20)
> Lastly for each routine it is desirable to think whether it might be useful for
> other vector modes (likely 32-byte only) for TARGET_AVX2.

I am not very familiar with the integer versions, so I tried:
#include <x86intrin.h>
__v32qi f(__v32qi x){
  __v32qi m={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31};
  return __builtin_shuffle(x,m);
}

$ gcc r.c -S -O1 -mavx && cat r.s
r.c: In function 'f':
r.c:2:9: error: invalid position or size operand to BIT_FIELD_REF
BIT_FIELD_REF <x_2(D), 8, -128>
r.c:2:9: note: in statement
D.5992_24 = BIT_FIELD_REF <x_2(D), 8, -128>;

r.c:2:9: error: invalid position or size operand to BIT_FIELD_REF
BIT_FIELD_REF <x_2(D), 8, -120>
r.c:2:9: note: in statement
D.5993_25 = BIT_FIELD_REF <x_2(D), 8, -120>;

[...]

r.c:2:9: internal compiler error: verify_gimple failed

(with -mavx2 it works fine)

Comment 23 Richard Henderson 2012-03-29 13:36:23 UTC

(In reply to comment #18)

+  if (!d->testing_p)
+    dsecond.target = gen_reg_rtx (dsecond.vmode);
+  dfirst.op1 = dsecond.target;

This bit has a problem with testing_p in that we'll have op0==op1
while testing and not when expanding.  Which means that testing_p
will be checking something else.

I've been meaning to convert i386 from op0==op1 to one_operand_p,
like I used in targets I converted later, like ia64.  I'll see about
making this change this afternoon, and then you can update your
patch to match.

+  ok = expand_vec_perm_1 (&dsecond);
+  ok &= ix86_expand_vec_perm_const_1 (&dfirst);
+
+  if (!ok)
+    return false;
+
+  return true;

Better with a short-circuit to avoid extra work:

  return (expand_vec_perm_1 (&dsecond)
          && ix86_expand_vec_perm_const_1 (&dfirst));

Otherwise the patch looks pretty good.

Comment 24 Marc Glisse 2012-03-29 14:19:11 UTC

(In reply to comment #23)
> (In reply to comment #18)
> 
> +  if (!d->testing_p)
> +    dsecond.target = gen_reg_rtx (dsecond.vmode);
> +  dfirst.op1 = dsecond.target;
> 
> This bit has a problem with testing_p in that we'll have op0==op1
> while testing and not when expanding.  Which means that testing_p
> will be checking something else.

Unless d->target==d->op0 (is that the case? I was kind of assuming it wasn't), it looks ok, but I agree that it should be improved. From other code, it looks like using gen_reg_rtx in testing is fine and avoiding it is just an optimization.

On the other hand, if I remember correctly, the function could just return true early when testing (like the other function does) and assert during expansion, since it is not supposed to fail (except for the initial mode/target check), that would document the intent better.

> I've been meaning to convert i386 from op0==op1 to one_operand_p,
> like I used in targets I converted later, like ia64.  I'll see about
> making this change this afternoon, and then you can update your
> patch to match.

ok (no promise timewise).

> +  ok = expand_vec_perm_1 (&dsecond);
> +  ok &= ix86_expand_vec_perm_const_1 (&dfirst);
> +
> +  if (!ok)
> +    return false;
> +
> +  return true;
> 
> Better with a short-circuit to avoid extra work:
> 
>   return (expand_vec_perm_1 (&dsecond)
>           && ix86_expand_vec_perm_const_1 (&dfirst));

Indeed!

Thanks for the comments.

Comment 25 Marc Glisse 2012-03-31 09:37:51 UTC

The test for AVX2 in expand_vec_perm_interleave2 might be too strict. For the V4DF shuffle 4,0,2,6, removing that check lets the compiler generate a nice vunpcklpd+vpermilpd (as opposed to 3 insn with my patch and 5+ without). The expansion of dfinal is already protected (so the function returns false for 4,2,0,6), I haven't checked whether something else (dremap?) needs protecting, but it doesn't look like it.

Comment 26 Marc Glisse 2012-03-31 14:02:54 UTC

Created attachment 27052 [details]
default case

Updated with your comments, still can't properly test.

Comment 27 Marc Glisse 2012-04-09 16:50:47 UTC

Notes to self (or other):
- Intel's SDE makes it possible to test without appropriate hardware;
- for V4DF shuffles, there seems to be a very simple generic solution that performs two vperm2f128 and then one vshufpd.

permutation (a,b,c,d), input (x,y):
t1 = vperm2f128(x,y,(a/2)+16*(c/2));
t2 = vperm2f128(x,y,(b/2)+16*(d/2));
return vshufpd(t1,t2,(a%2)+2*(b%2)+4*(c%2)+8*(d%2));

(when t1 or t2 is equal to x or y, it generates only 2 insn in cases that the current code doesn't detect, like {3,1,2,2})

Comment 28 Marc Glisse 2012-04-11 16:48:47 UTC

A difficulty I hadn't foreseen is that the code that canonicalizes permutations (and in particular checks if one of the operands is unused) is in ix86_expand_vec_perm_const. So if I ask expand_vec_perm_1 to generate the 2-operand 0,1,2,3 permutation, it will happily generate vperm2f128 with immediate 16 without noticing that it is the identity on the first operand. I should probably move that code into its own function so I can call it before expand_vec_perm_1.

Comment 29 Marc Glisse 2012-04-11 20:35:00 UTC

Created attachment 27136 [details]
V4DF generic shuffle

A patch (independent from the others) implementing what is explained in the last 2 comments. It is simple and works really well, all V4DF shuffles (even with 2 vectors) take only 3 insn (and often just 2). It only requires AVX, but also improves a lot on the current AVX2 code which casts to vectors of integers and uses up to 9 insn (although my "default case" patch also goes down to 3 insn on AVX2).

The drawback is that it is limited to V4DF. vshufps is a different enough beast from vshufpd that it would require a different code, which wouldn't even apply that often. For V8SF, my "default case" patch seems more interesting. Integer vectors have different instructions again...

By the way, I tested all V4DF permutations (there are only 2^12 of them) in the simulator. I also have a file (400K) with the code for each permutation, that looks like the following:
0,0,0,0
	vpermilpd	$0, %ymm0, %ymm0
	vperm2f128	$0, %ymm0, %ymm0, %ymm0
[...]
1,7,6,3
        vperm2f128      $48, %ymm1, %ymm0, %ymm2
        vperm2f128      $19, %ymm1, %ymm0, %ymm0
        vshufpd $11, %ymm0, %ymm2, %ymm0
1,7,6,4
        vperm2f128      $48, %ymm1, %ymm0, %ymm0
        vperm2f128      $33, %ymm1, %ymm1, %ymm1
        vshufpd $3, %ymm1, %ymm0, %ymm0
[...]
If anyone wants to take a look, tell me and I'll attach it.

Comment 30 Marc Glisse 2012-05-14 20:19:39 UTC

Author: glisse
Date: Mon May 14 20:19:30 2012
New Revision: 187479

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187479
Log:
2012-05-14  Marc Glisse  <marc.glisse@inria.fr>

	PR target/52607
	* config/i386/i386.c (ix86_expand_vec_perm_const): Move code to ...
	(canonicalize_perm): ... new function.
	(expand_vec_perm_2vperm2f128_vshuf): New function.
	(ix86_expand_vec_perm_const_1): Call it.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c