Bug 52252 - An opportunity for x86 gcc vectorizer (gain up to 3 times)
Summary: An opportunity for x86 gcc vectorizer (gain up to 3 times)
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2012-02-14 22:41 UTC by Stupachenko Evgeny
Modified: 2014-06-18 07:46 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-02-15 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stupachenko Evgeny 2012-02-14 22:41:45 UTC
This is an example of byte conversion from RGB (Red Green Blue) to CMYK (Cyan Magenta Yellow blacK):

#define byte unsigned char
#define MIN(a, b) ((a) > (b)?(b):(a))

void convert_image(byte *in, byte *out, int size) {
    int i;
    for(i = 0; i < size; i++) {
        byte r = in[0];
        byte g = in[1];
        byte b = in[2];
        byte c, m, y, k, tmp;
        c = 255 - r;
        m = 255 - g;
        y = 255 - b;
        tmp = MIN(m, y);
        k = MIN(c, tmp);
        out[0] = c - k;
        out[1] = m - k;
        out[2] = y - k;
        out[3] = k;
        in += 3;
        out += 4;
    }
}

Here trunk gcc for Arm unrolls the loop by 2 and vectorizes it using neon; gcc for x86 does not vectorize it.

There are 2  tricky moments in this loop:
1)	It converts 3 bytes into 4
2)	We need to shuffle bytes after load:
Let 0123456789ABCDF be 16 bytes in “in” array (first rgb is 012, next 345…)
To count vector minimum we need to place 0,1,2 bytes into 3 different vectors.

Gcc for Arm does this by 2 special loads:
  vld3.8  {d16, d18, d20}, [r2]!
  vld3.8  {d17, d19, d21}, [r2]
putting 0 and 3 bytes into q8(d16, d17)
        1 and 4 bytes into q9(d18, d19)
        2 and 5 bytes into q10(d20, d21)

And after all vector transformations it stores by 2 special stores:

  vst4.8  {d8, d10, d12, d14}, [r3]!
  vst4.8  {d9, d11, d13, d15}, [r3]

However x86 gcc can do the same loads:
  movq (%edi),%mm5
  movq %mm5,%mm7
  movq %mm5,%mm6
  pshufb %mm3,%mm5 /*0x00ffffff03ffffff*/
  pshufb %mm2,%mm6 /*0x01ffffff04ffffff*/
  pshufb %mm1,%mm7 /*0x02ffffff05ffffff*/
  /* %mm5 – r, %mm6 – g, %mm7 – b */

And same stores:
  pslld  $0x8,%mm6
  pslld  $0x10,%mm7
  pslld  $0x18,%mm4
  pxor   %mm5,%mm6 
  pxor   %mm7,%mm4
  pxor   %mm6,%mm4
  pshufb %mm0,%mm4 /*0x000102030405060708*/ /*here redundant*/
  movq %mm4,(%esi)
  /* %mm5 – c, %mm6 – m, %mm7 – y, %mm4 - k */

pshufb here does not do anything, so could be removed, only in case we store less than 4 bytes we will need to shuffle them

Moreover x86 gcc can do unroll not only by 2, but by 4:
With the following loads:

  movdqu (%edi),%xmm5
  movdqa %xmm5,%xmm7
  movdqa %xmm5,%xmm6
  pshufb %xmm3,%xmm5 /*0x00ffffff03ffffff06ffffff09ffffff*/
  pshufb %xmm2,%xmm6 /*0x01ffffff04ffffff07ffffff0affffff*/
  pshufb %xmm1,%xmm7 /*0x02ffffff05ffffff08ffffff0bffffff*/
  /* %xmm5 – r, %xmm6 – g, %xmm7 – b */

And stores:
  pslld  $0x8,%xmm6
  pslld  $0x10,%xmm7
  pslld  $0x18,%xmm4
  pxor   %xmm5,%xmm6
  pxor   %xmm7,%xmm4
  pxor   %xmm6,%xmm4
  pshufb %xmm0,%xmm4 /*0x000102030405060708090a0b0c0d0e0f*/ /*here redundant*/
  movdqa %xmm4,(%esi)
  /* %xmm5 – c, %xmm6 – m, %xmm7 – y, %xmm4 - k */
Comment 1 Richard Biener 2012-02-15 11:53:58 UTC
We fail to SLP vectorize this because of

6: Build SLP failed: different operation in stmt k_15 = MIN_EXPR <tmp_14, y_13>;

thus,

        out[0] = c - k;
        out[1] = m - k;
        out[2] = y - k;
        out[3] = k;

isn't detected as equivalent to

        out[0] = c - k;
        out[1] = m - k;
        out[2] = y - k;
        out[3] = <magic> - k;

or

        out[3] = k - 0;

whatever would be more suitable (the latter would fail to be detected as
induction I guess, the former would fail with a similar issue for the
definition of <magic>).

With

        out[3] = y - k;

we fail with

6: Load permutation 0 1 2 2 1 1 1 1 0 0 0 0 2 2 2 2
6: Build SLP failed: unsupported load permutation *out_37 = D.1721_16;

we can vectorize

void convert_image(byte *in, byte *out, int size) {
    int i;
    for(i = 0; i < size; i++) {
        byte r = in[0];
        byte g = in[1];
        byte b = in[2];
        byte a = in[3];
        byte c, m, y, k, z, tmp;
        c = 255 - r;
        m = 255 - g;
        y = 255 - b;
        z = 255 - a;
        tmp = MIN(m, y);
        k = MIN(c, tmp);
        out[0] = c - k;
        out[1] = m - k;
        out[2] = y - k;
        out[3] = z - k;
        in += 4;
        out += 4;
    }
}

though.
Comment 2 Stupachenko Evgeny 2012-02-29 12:32:20 UTC
The difference of 2 dumps from

Arm: gcc -O3 -mfpu=neon test.c -S -ftree-vectorizer-verbose=12
X86: gcc -O3 -m32 -msse3 test.c -S -ftree-vectorizer-verbose=12

Starts at:

For Arm (can use vec_load_lanes):

6: === vect_make_slp_decision === 
6: === vect_detect_hybrid_slp ===
6: === vect_analyze_loop_operations ===
6: examining phi: in_35 = PHI <in_22(7), in_5(D)(4)>

……

6: can use vec_load_lanes<CI><V16QI> 
6: vect_model_load_cost: unaligned supported by hardware. 
6: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .

For x86 (no array mode for V16QI[3]):

6: === vect_make_slp_decision === 
6: === vect_detect_hybrid_slp === 
6: === vect_analyze_loop_operations === 
6: examining phi: in_35 = PHI <in_22(7), in_5(D)(4)> 

.……
 
6: no array mode for V16QI[3] 
6: the size of the group of strided accesses is not a power of 2 
6: not vectorized: relevant stmt not supported: r_8 = *in_35; 
 
As I mentioned before, there is an ability for x86 to handle this (Arm can shuffle than loads, x86 can use pshufb).
Comment 3 Richard Biener 2012-07-13 08:48:18 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 4 Stupachenko Evgeny 2014-02-11 14:27:33 UTC
The patch giving an expected 3 times gain submitted for a discussion at:
http://gcc.gnu.org/ml/gcc-patches/2014-02/msg00670.html
Comment 5 Kirill Yukhin 2014-05-07 12:10:53 UTC
Author: kyukhin
Date: Wed May  7 12:10:22 2014
New Revision: 210155

URL: http://gcc.gnu.org/viewcvs?rev=210155&root=gcc&view=rev
Log:
gcc/
	* tree-vect-data-refs.c (vect_grouped_load_supported): New
	check for loads group of length 3.
	(vect_permute_load_chain): New permutations for loads group of
	length 3.
	* tree-vect-stmts.c (vect_model_load_cost): Change cost
	of vec_perm_shuffle for the new permutations.

gcc/testsuite/
	PR tree-optimization/52252
	* gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.


Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c
    trunk/gcc/tree-vect-stmts.c
Comment 6 Kirill Yukhin 2014-06-11 08:38:25 UTC
Author: kyukhin
Date: Wed Jun 11 08:37:53 2014
New Revision: 211439

URL: http://gcc.gnu.org/viewcvs?rev=211439&root=gcc&view=rev
Log:
gcc/
	* tree-vect-data-refs.c (vect_grouped_store_supported): New
	check for stores group of length 3.
	(vect_permute_store_chain): New permutations for stores group of
	length 3.
	* tree-vect-stmts.c (vect_model_store_cost): Change cost
	of vec_perm_shuffle for the new permutations.

gcc/testsuite/
	PR tree-optimization/52252
	* gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.


Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr52252-st.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c
    trunk/gcc/tree-vect-stmts.c
Comment 7 Kirill Yukhin 2014-06-18 07:46:50 UTC
Author: kyukhin
Date: Wed Jun 18 07:46:18 2014
New Revision: 211769

URL: https://gcc.gnu.org/viewcvs?rev=211769&root=gcc&view=rev
Log:
gcc/
	* config/i386/i386.c (ix86_reassociation_width): Add alternative for
	vector case.
	* config/i386/i386.h (TARGET_VECTOR_PARALLEL_EXECUTION): New.
	* config/i386/x86-tune.def (X86_TUNE_VECTOR_PARALLEL_EXECUTION): New.
	* tree-vect-data-refs.c (vect_shift_permute_load_chain): New.
	Introduces alternative way of loads group permutaions.
	(vect_transform_grouped_load): Try alternative way of permutations.

gcc/testsuite/
	PR tree-optimization/52252
	* gcc.target/i386/pr52252-atom.c: Test on loads group of size 3.
	* gcc.target/i386/pr52252-core.c: Ditto.

	PR tree-optimization/61403
	* gcc.target/i386/pr61403.c: Test on loads and stores group of size 3.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr52252-atom.c
    trunk/gcc/testsuite/gcc.target/i386/pr52252-core.c
    trunk/gcc/testsuite/gcc.target/i386/pr61403.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/x86-tune.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c