88547 – missed optimization for vector comparisons

Bug 88547 - missed optimization for vector comparisons

Summary: missed optimization for vector comparisons

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	9.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2018-12-18 22:13 UTC by Richard Henderson
Modified:	2019-01-30 17:09 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-12-19 00:00:00

Attachments
gcc9-pr88547-1.patch (2.33 KB, patch) 2018-12-19 16:38 UTC, Jakub Jelinek	Details \| Diff
gcc9-pr88547.patch (3.57 KB, patch) 2018-12-20 17:02 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Henderson 2018-12-18 22:13:15 UTC

typedef signed svec __attribute__((vector_size(16)));
typedef unsigned uvec __attribute__((vector_size(16)));

svec les(svec x, svec y) {
    return x <= y;
}

uvec leu(uvec x, uvec y) {
    return x <= y;
}

currently assemble to 

les:
        vpcmpgtd  %xmm1, %xmm0, %xmm0
        vpcmpeqd  %xmm1, %xmm1, %xmm1
        vpandn    %xmm1, %xmm0, %xmm0

leu:
        vmovdqa64 .LC0(%rip), %xmm2
        vpsubd    %xmm2, %xmm1, %xmm1
        vpsubd    %xmm2, %xmm0, %xmm0
        vpcmpgtd  %xmm1, %xmm0, %xmm0
        vpcmpeqd  %xmm1, %xmm1, %xmm1
        vpandn    %xmm1, %xmm0, %xmm0

By using the transformation min(x, y) == x we can produce

les:
        vpminsd   %xmm1, %xmm0, %xmm1
        vpcmpeqd  %xmm1, %xmm0, %xmm0

leu:
        vpminud   %xmm1, %xmm0, %xmm1
        vpcmpeqd  %xmm0, %xmm1, %xmm0

This can be used to reduce unsigned comparisons without requiring
the use of a constant bias vector.  At least when the given min insn
is available in the architecture.

Comment 1 Richard Biener 2018-12-19 08:29:26 UTC

Nice.  Patch?

Comment 2 Jakub Jelinek 2018-12-19 11:26:33 UTC

More complete testcase:
typedef signed char v16qi __attribute__((vector_size(16)));
typedef unsigned char v16uqi __attribute__((vector_size(16)));
typedef short v8hi __attribute__((vector_size(16)));
typedef unsigned short v8uhi __attribute__((vector_size(16)));
typedef int v4si __attribute__((vector_size(16)));
typedef unsigned v4usi __attribute__((vector_size(16)));
typedef long long v2di __attribute__((vector_size(16)));
typedef unsigned long long v2udi __attribute__((vector_size(16)));

v16qi
f1 (v16qi x, v16qi y)
{
  return x <= y;
}

v16qi
f1a (v16qi x, v16qi y)
{
  return x < y;
}

v16uqi
f2 (v16uqi x, v16uqi y)
{
  return x <= y;
}

v16qi
f3 (v16qi x, v16qi y)
{
  return x >= y;
}

v16uqi
f4 (v16uqi x, v16uqi y)
{
  return x >= y;
}

v8hi
f5 (v8hi x, v8hi y)
{
  return x <= y;
}

v8uhi
f6 (v8uhi x, v8uhi y)
{
  return x <= y;
}

v8hi
f7 (v8hi x, v8hi y)
{
  return x >= y;
}

v8uhi
f8 (v8uhi x, v8uhi y)
{
  return x >= y;
}

v4si
f9 (v4si x, v4si y)
{
  return x <= y;
}

v4usi
f10 (v4usi x, v4usi y)
{
  return x <= y;
}

v4si
f11 (v4si x, v4si y)
{
  return x >= y;
}

v4usi
f12 (v4usi x, v4usi y)
{
  return x >= y;
}

v2di
f13 (v2di x, v2di y)
{
  return x <= y;
}

v2udi
f14 (v2udi x, v2udi y)
{
  return x <= y;
}

v2di
f15 (v2di x, v2di y)
{
  return x >= y;
}

v2udi
f16 (v2udi x, v2udi y)
{
  return x >= y;
}

plus of course we need a 32-byte and 64-byte vector variant, and test with -msse4.1 (the first one to have pmin{s,u}b, -mavx, -mavx2, -mavx512*.

I think it could be done in ix86_expand_int_sse_cmp or in ix86_expand_int_vcond
- perhaps only for the cases where one of the vcond operands is all ones and the other one is zero, notice that depending on which one is which the negation is 2 instructions (though, only if we don't hoist the constant load e.g. before a loop) and that for TARGET_SSE4_1 we can use the minimum or maximum.

Comment 3 Jakub Jelinek 2018-12-19 15:01:42 UTC

For 64-byte vectors, we emit
        vpcmpgtb        %zmm1, %zmm0, %k1
        vpxor   %xmm1, %xmm1, %xmm1
        vpternlogd      $0xFF, %zmm0, %zmm0, %zmm0
        vmovdqu8        %zmm1, %zmm0{%k1}
for f1, perhaps it would be better to emit:
        vpcmpgtb        %zmm1, %zmm0, %k1
        knotq           %k1, %k1
        vpmovm2b        %k1, %zmm0
?

Comment 4 Jakub Jelinek 2018-12-19 16:38:14 UTC

Created attachment 45264 [details]
gcc9-pr88547-1.patch

Untested patch to improve the avx512* sse_movcc.

Comment 5 Jakub Jelinek 2018-12-20 07:58:35 UTC

Author: jakub
Date: Thu Dec 20 07:58:02 2018
New Revision: 267293

URL: https://gcc.gnu.org/viewcvs?rev=267293&root=gcc&view=rev
Log:
	PR target/88547
	* config/i386/i386.c (ix86_expand_sse_movcc): For maskcmp, try to
	emit vpmovm2? instruction perhaps after knot?.  Reorganize code
	so that it doesn't have to test !maskcmp in almost every conditional.

	* gcc.target/i386/pr88547-1.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr88547-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog

Comment 6 Jakub Jelinek 2018-12-20 17:02:13 UTC

Created attachment 45274 [details]
gcc9-pr88547.patch

Untested patch for the rest.  Richard, is that what you had in mind?

Comment 7 Jakub Jelinek 2018-12-21 10:37:42 UTC

Author: jakub
Date: Fri Dec 21 10:37:11 2018
New Revision: 267322

URL: https://gcc.gnu.org/viewcvs?rev=267322&root=gcc&view=rev
Log:
	PR target/88547
	* config/i386/i386.c (ix86_expand_int_sse_cmp): Optimize
	x > y ? 0 : -1 into min (x, y) == x ? -1 : 0.

	* gcc.target/i386/pr88547-1.c: Expect only 2 knotb and 2 knotw
	insns instead of 4, check for vpminud, vpminuq and no vpsubd or
	vpsubq.
	* gcc.target/i386/sse2-pr88547-1.c: New test.
	* gcc.target/i386/sse2-pr88547-2.c: New test.
	* gcc.target/i386/sse4_1-pr88547-1.c: New test.
	* gcc.target/i386/sse4_1-pr88547-2.c: New test.
	* gcc.target/i386/avx2-pr88547-1.c: New test.
	* gcc.target/i386/avx2-pr88547-2.c: New test.
	* gcc.target/i386/avx512f-pr88547-2.c: New test.
	* gcc.target/i386/avx512vl-pr88547-1.c: New test.
	* gcc.target/i386/avx512vl-pr88547-2.c: New test.
	* gcc.target/i386/avx512vl-pr88547-3.c: New test.
	* gcc.target/i386/avx512f_cond_move.c (y): Change from unsigned int
	array to int array.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx2-pr88547-1.c
    trunk/gcc/testsuite/gcc.target/i386/avx2-pr88547-2.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88547-2.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88547-1.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88547-2.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88547-3.c
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr88547-1.c
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr88547-2.c
    trunk/gcc/testsuite/gcc.target/i386/sse4_1-pr88547-1.c
    trunk/gcc/testsuite/gcc.target/i386/sse4_1-pr88547-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/avx512f_cond_move.c
    trunk/gcc/testsuite/gcc.target/i386/pr88547-1.c

Comment 8 Jakub Jelinek 2019-01-30 17:09:05 UTC

Fixed.