96861 – Integer min/max optimization failed under -march=skylake-avx512

Bug 96861 - Integer min/max optimization failed under -march=skylake-avx512

Summary: Integer min/max optimization failed under -march=skylake-avx512

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	11.0

Importance:	P3 normal
Target Milestone:	11.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-08-31 09:33 UTC by Hongtao.liu
Modified:	2020-09-21 11:36 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i386, x86-64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-08-31 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hongtao.liu 2020-08-31 09:33:45 UTC

cat minmax-10.c

---
#define max(a,b) (((a) > (b))) ? (a) : (b)
#define min(a,b) (((a) < (b))) ? (a) : (b)

int smax1(int x)
{
    return max(x,1);
}
---

gcc -O2 -march=skylake-avx512 -S

---
smax1(int):
        vmovdqa xmm1, XMMWORD PTR .LC0[rip]
        vmovd   xmm0, edi
        vpmaxsd xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
---

gcc -O2 -march=x86-64 -S
---
smax1(int):
        test    edi, edi
        mov     eax, 1
        cmovg   eax, edi
        ret
---

It seems cost model need to be adjusted for skylake_cost.

Comment 1 Richard Biener 2020-08-31 13:11:58 UTC

Collected chain #1...
  insns: 6
  defs to convert: r84, r85
Computing gain for chain #1...
  Instruction gain 8 for     6: {r84:SI=smax(r85:SI,0x1);clobber flags:CC;}
      REG_DEAD r85:SI
      REG_UNUSED flags:CC
  Instruction conversion gain: 8
  Registers conversion cost: 4
  Total gain: 4
Converting chain #1...

      else if (GET_CODE (src) == SMAX
               || GET_CODE (src) == SMIN
               || GET_CODE (src) == UMAX
               || GET_CODE (src) == UMIN)
        {
          /* We do not have any conditional move cost, estimate it as a
             reg-reg move.  Comparisons are costed as adds.  */
          igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
          /* Integer SSE ops are all costed the same.  */
          igain -= ix86_cost->sse_op;
        }

is a bit of hand-waving, esp. since costs in general have not been
very consistent between the GPR and vector parts of the cores
(that also hurts vector cost estimates btw which also compare those
apples vs. oranges)

Comment 2 Hongtao.liu 2020-09-03 02:32:20 UTC

Comment 3 Hongtao.liu 2020-09-03 03:45:08 UTC

(In reply to Hongtao.liu from comment #2)
> .

Difference comes from

  /* Cost the integer to sse and sse to integer moves.  */
  cost += n_sse_to_integer * ix86_cost->sse_to_integer;
  /* ???  integer_to_sse but we only have that in the RA cost table.
     Assume sse_to_integer/integer_to_sse are the same which they
     are at the moment.  */
  cost += n_integer_to_sse * ix86_cost->sse_to_integer;

Maybe need to increase ix86->sse_to_integer, i'm test the following.

--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1769,7 +1769,7 @@ struct processor_costs skylake_cost = {
   {6, 6, 6, 10, 20},                   /* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
   2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
-  2,                                   /* cost of moving SSE register to integer.  */
+  6,                                   /* cost of moving SSE register to integer.  */
   20, 8,                               /* Gather load static, per_elt.  */
   22, 10,                              /* Gather store static, per_elt.  */
   64,                                  /* size of l1 cache.  */

Comment 4 Hongtao.liu 2020-09-07 07:55:56 UTC

Closed by mistake.

Comment 5 GCC Commits 2020-09-19 14:55:39 UTC

The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:83858ba1db31cd83667592a41d71315090459da4

commit r11-3294-g83858ba1db31cd83667592a41d71315090459da4
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Sep 16 10:53:52 2020 +0800

    Increase rtx cost of sse_to_integer in skylake_cost.
    
    As https://gcc.gnu.org/pipermail/gcc-patches/2019-August/528839.html
    indicates, movement between SSE and gpr should be much expensive than
    movement inside gpr(which is 2 as default).
    
    gcc/ChangeLog
    
            PR target/96861
            * config/i386/x86-tune-costs.h (skylake_cost): increase rtx
            cost of sse_to_integer from 2 to 6.
    
    gcc/testsuite
    
            * gcc.target/i386/pr95021-3.c: Add -mtune=generic.

Comment 6 Hongtao.liu 2020-09-19 15:05:51 UTC

Fixed in GCC11.

Comment 7 GCC Commits 2020-09-19 17:15:02 UTC

The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:c66c004ad610f4f02d5f065fe29c4c10f05ae13f

commit r11-3297-gc66c004ad610f4f02d5f065fe29c4c10f05ae13f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sat Sep 19 09:57:16 2020 -0700

    x86: Add a testcase for PR target/96861
    
    Add a testcase to verify that -march=skylake-avx512 -mtune=skylake-avx512
    generates desired code sequence.
    
            PR target/96861
            * gcc.target/i386/pr96861.c: New test.