This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug c/52252] New: An opportunity for x86 gcc vectorizer (gain up to 3 times)
- From: "evstupac at gmail dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 14 Feb 2012 22:41:45 +0000
- Subject: [Bug c/52252] New: An opportunity for x86 gcc vectorizer (gain up to 3 times)
- Auto-submitted: auto-generated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252
Bug #: 52252
Summary: An opportunity for x86 gcc vectorizer (gain up to 3
times)
Classification: Unclassified
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: evstupac@gmail.com
This is an example of byte conversion from RGB (Red Green Blue) to CMYK (Cyan
Magenta Yellow blacK):
#define byte unsigned char
#define MIN(a, b) ((a) > (b)?(b):(a))
void convert_image(byte *in, byte *out, int size) {
int i;
for(i = 0; i < size; i++) {
byte r = in[0];
byte g = in[1];
byte b = in[2];
byte c, m, y, k, tmp;
c = 255 - r;
m = 255 - g;
y = 255 - b;
tmp = MIN(m, y);
k = MIN(c, tmp);
out[0] = c - k;
out[1] = m - k;
out[2] = y - k;
out[3] = k;
in += 3;
out += 4;
}
}
Here trunk gcc for Arm unrolls the loop by 2 and vectorizes it using neon; gcc
for x86 does not vectorize it.
There are 2 tricky moments in this loop:
1) It converts 3 bytes into 4
2) We need to shuffle bytes after load:
Let 0123456789ABCDF be 16 bytes in âinâ array (first rgb is 012, next 345â)
To count vector minimum we need to place 0,1,2 bytes into 3 different vectors.
Gcc for Arm does this by 2 special loads:
vld3.8 {d16, d18, d20}, [r2]!
vld3.8 {d17, d19, d21}, [r2]
putting 0 and 3 bytes into q8(d16, d17)
1 and 4 bytes into q9(d18, d19)
2 and 5 bytes into q10(d20, d21)
And after all vector transformations it stores by 2 special stores:
vst4.8 {d8, d10, d12, d14}, [r3]!
vst4.8 {d9, d11, d13, d15}, [r3]
However x86 gcc can do the same loads:
movq (%edi),%mm5
movq %mm5,%mm7
movq %mm5,%mm6
pshufb %mm3,%mm5 /*0x00ffffff03ffffff*/
pshufb %mm2,%mm6 /*0x01ffffff04ffffff*/
pshufb %mm1,%mm7 /*0x02ffffff05ffffff*/
/* %mm5 â r, %mm6 â g, %mm7 â b */
And same stores:
pslld $0x8,%mm6
pslld $0x10,%mm7
pslld $0x18,%mm4
pxor %mm5,%mm6
pxor %mm7,%mm4
pxor %mm6,%mm4
pshufb %mm0,%mm4 /*0x000102030405060708*/ /*here redundant*/
movq %mm4,(%esi)
/* %mm5 â c, %mm6 â m, %mm7 â y, %mm4 - k */
pshufb here does not do anything, so could be removed, only in case we store
less than 4 bytes we will need to shuffle them
Moreover x86 gcc can do unroll not only by 2, but by 4:
With the following loads:
movdqu (%edi),%xmm5
movdqa %xmm5,%xmm7
movdqa %xmm5,%xmm6
pshufb %xmm3,%xmm5 /*0x00ffffff03ffffff06ffffff09ffffff*/
pshufb %xmm2,%xmm6 /*0x01ffffff04ffffff07ffffff0affffff*/
pshufb %xmm1,%xmm7 /*0x02ffffff05ffffff08ffffff0bffffff*/
/* %xmm5 â r, %xmm6 â g, %xmm7 â b */
And stores:
pslld $0x8,%xmm6
pslld $0x10,%xmm7
pslld $0x18,%xmm4
pxor %xmm5,%xmm6
pxor %xmm7,%xmm4
pxor %xmm6,%xmm4
pshufb %xmm0,%xmm4 /*0x000102030405060708090a0b0c0d0e0f*/ /*here redundant*/
movdqa %xmm4,(%esi)
/* %xmm5 â c, %xmm6 â m, %xmm7 â y, %xmm4 - k */