91154 – [10 Regression] 456.hmmer regression on Haswell caused by r272922

Bug 91154 - [10 Regression] 456.hmmer regression on Haswell caused by r272922

Summary: [10 Regression] 456.hmmer regression on Haswell caused by r272922

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	10.0

Importance:	P3 normal
Target Milestone:	10.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	spec cmov
	Show dependency tree / graph

Reported:	2019-07-12 18:12 UTC by Richard Biener
Modified:	2021-08-07 01:58 UTC (History)
CC List:	9 users (show)

See Also:	69519
Host:
Target:	x86_64-- powerpc--*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-07-15 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Biener 2019-07-12 18:12:11 UTC

https://gcc.opensuse.org/gcc-old/SPEC/CINT/sb-czerny-head-64-2006/456_hmmer_big.png

shows two recent regressions, one around June 13th and one around Jul 3th.

Comment 1 Richard Biener 2019-07-15 08:25:46 UTC

Jul 1st regression: last known good is r272839, first known bad r272955.
June 13th regression: last known good is r272230, first known bad r272280.

Comment 2 Richard Biener 2019-07-15 10:40:51 UTC

Bisecting the Jul 1st regression now.

Comment 3 Richard Biener 2019-07-15 14:09:41 UTC

The Jul 1st regression looks like it is r272922 (mine :/).

Still need to bisect the older one.

Comment 4 Richard Biener 2019-07-17 10:06:14 UTC

For the Jul 1st regression the DOM change causes us to recognize MIN/MAX patterns
way earlier (they are exposed by conditional store elimination).  While before
the change we have to if-convert 5 loops of P7Viterbi after the change only
one is remaining.  This naturally affects the scalar fallback.  perf shows
(base is r272921, peak is r272922)

  41.37%        611991  hmmer_peak.amd6  hmmer_peak.amd64-m64-gcc42-nn  [.] P7Viterbi
  29.84%        441484  hmmer_base.amd6  hmmer_base.amd64-m64-gcc42-nn  [.] P7Viterbi
  15.32%        226646  hmmer_base.amd6  hmmer_base.amd64-m64-gcc42-nn  [.] P7Viterbi.cold
   6.74%         99771  hmmer_peak.amd6  hmmer_peak.amd64-m64-gcc42-nn  [.] P7Viterbi.cold

so the cold part is way faster but somehow the hot part quite a bit slower,
I suspect profile changes in the end, sums are 668130 vs. 711762 samples.
-fopt-info-loop doesn't show any changes, vectorizer cost-model estimates
(and thus runtime niter checks) are the same.

Note the Jul 1st regression is in the range PR90911 was present which should
be the observed Jun 13th regression (r272239).

perf points to a sequence of two cmovs being slow compared to one cmove
and a conditional branch.  Fast:

       │            dc[k] = dc[k-1] + tpdd[k-1];                                                       
  7.42 │ fd0:   mov    -0x80(%rbp),%rbx                                                                
  0.08 │        add    -0x8(%rbx,%rax,4),%ecx                                                          
       │            if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;                                
  0.24 │        mov    -0xa0(%rbp),%rbx                                                                
       │            dc[k] = dc[k-1] + tpdd[k-1];                                                       
  0.07 │        mov    %ecx,-0x4(%rdx,%rax,4)                                                          
       │            if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;                                
  7.41 │        mov    -0x8(%r15,%rax,4),%esi                                                          
  2.69 │        add    -0x8(%rbx,%rax,4),%esi                                                          
  3.31 │        cmp    %ecx,%esi                                                                       
  2.85 │        cmovge %esi,%ecx                                                                       
 12.17 │        mov    %ecx,-0x4(%rdx,%rax,4)                                                          
       │            if (dc[k] < -INFTY) dc[k] = -INFTY;                                                
  7.43 │        cmp    $0xc521974f,%ecx                                                                
       │        jge    1060                                                                            
  0.01 │        movl   $0xc521974f,-0x4(%rdx,%rax,4)                                                 
       │          for (k = 1; k <= M; k++) {                                                           
  0.02 │        mov    %eax,%esi                                                                       
  0.02 │        inc    %rax                                                                            
  0.01 │        cmp    %rax,%rdi                                                                       
       │        je     106e                                                                            
       │            if (dc[k] < -INFTY) dc[k] = -INFTY;                                                
  0.01 │        mov    $0xc521974f,%ecx                                                                
  0.01 │        jmp    fd0                                
...
  0.00 │1060:   mov    %eax,%esi                                                                       
  0.03 │        inc    %rax                                                                            
  0.00 │        cmp    %rdi,%rax                                                                       
  0.00 │        jne    fd0 
  0.06 │106e:   cmp    -0x64(%rbp),%esi                                                                
       │        jg     1508                  

slow:

       │            dc[k] = dc[k-1] + tpdd[k-1];                                                       
  0.06 │13b0:   add    -0x8(%r9,%rcx,4),%eax                                                           
  0.07 │        mov    %eax,-0x4(%r13,%rcx,4)                                                          
       │            if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;                                
  6.81 │        mov    -0x8(%r8,%rcx,4),%esi                                                           
  0.04 │        add    -0x8(%rdx,%rcx,4),%esi                                                          
  1.46 │        cmp    %eax,%esi                                                                       
  0.29 │        cmovge %esi,%eax                                                                       
       │          for (k = 1; k <= M; k++) {                                                           
 13.43 │        mov    %ecx,%esi                                                                       
  0.05 │        cmp    $0xc521974f,%eax                                                                
  6.78 │        cmovl  %ebx,%eax                                                                       
 13.53 │        mov    %eax,-0x4(%r13,%rcx,4)                                                          
  6.81 │        inc    %rcx                                                                            
  0.03 │        cmp    %rcx,%rdi                                                                       
       │        jne    13b0                                                  
          
where on GIMPLE we have mostly MAX_EXPR <..., -987654321> for the slow case
and a conditional statement in the fast case.  Of course(?) a jump should
be well predicted (and can be speculated [correctly]) while the data dependence
on the earlier access is not.

Thus in the end this is a RTL expansion / if-conversion or backend costing issue.  The following heuristic^Whack fixes the regression for me:

Index: gcc/expr.c
===================================================================
--- gcc/expr.c  (revision 272922)
+++ gcc/expr.c  (working copy)
@@ -9159,7 +9159,9 @@ expand_expr_real_2 (sepops ops, rtx targ
          }
 
        /* Use a conditional move if possible.  */
-       if (can_conditionally_move_p (mode))
+       if ((TREE_CODE (treeop1) != INTEGER_CST
+            || !wi::eq_p (wi::to_widest (treeop1), -987654321))
+           && can_conditionally_move_p (mode))
          {
            rtx insn;
 
Index: gcc/ifcvt.c
===================================================================
--- gcc/ifcvt.c (revision 272922)
+++ gcc/ifcvt.c (working copy)
@@ -3570,6 +3570,13 @@ noce_process_if_block (struct noce_if_in
        which do a good enough job these days.  */
     return FALSE;
 
+  rtx_insn *earliest;
+  cond = noce_get_alt_condition (if_info, if_info->a, &earliest);
+  if (cond
+      && CONST_INT_P (XEXP (cond, 1))
+      && INTVAL (XEXP (cond, 1)) == -987654321)
+    return FALSE;
+
   if (noce_try_move (if_info))
     goto success;
   if (noce_try_ifelse_collapse (if_info))

Comment 5 Richard Biener 2019-07-17 12:55:31 UTC

One x86 option is to peephole

(insn 635 632 637 17 (set (reg:CCGC 17 flags)
        (compare:CCGC (reg:SI 0 ax [1735])
            (const_int -987654321 [0xffffffffc521974f]))) 11 {*cmpsi_1}
     (nil))
(insn 637 635 638 17 (set (reg:SI 0 ax [1737])
        (if_then_else:SI (ge (reg:CCGC 17 flags)
                (const_int 0 [0]))
            (reg:SI 0 ax [1735])
            (reg:SI 3 bx [2395]))) 947 {*movsicc_noc}
     (expr_list:REG_DEAD (reg:CCGC 17 flags)
        (expr_list:REG_EQUAL (if_then_else:SI (ge (reg:CCGC 17 flags)
                    (const_int 0 [0]))
                (reg:SI 0 ax [1735])
                (const_int -987654321 [0xffffffffc521974f]))
            (nil))))

into a compare-and-jump sequence.

Benchmarking on a more recent microarchitecture is probably also in order.

Comment 6 Jakub Jelinek 2019-07-17 13:05:31 UTC

So pretty much undo all ifcvt into conditional moves (aka pretend the target doesn't have cmov) or something else?
The problem is that cmov isn't unconditionally bad or unconditionally good and moving in either direction usually results in significant regressions.  cmov is good if the branch will be often badly predicted, so when the branch prediction is close to 50%, and is bad when it is usually well predicted.

Comment 7 Uroš Bizjak 2019-07-17 13:08:11 UTC

This is PR 56309, again and again.

Comment 8 Richard Biener 2019-07-17 13:10:48 UTC

(In reply to Jakub Jelinek from comment #6)
> So pretty much undo all ifcvt into conditional moves (aka pretend the target
> doesn't have cmov) or something else?
> The problem is that cmov isn't unconditionally bad or unconditionally good
> and moving in either direction usually results in significant regressions. 
> cmov is good if the branch will be often badly predicted, so when the branch
> prediction is close to 50%, and is bad when it is usually well predicted.

I'd only do it for MAX<small-constant, x> and MIN<large-constant, x>.  There's
also another conditional move a few instructions up, not sure if cmove
density would be another thing to key on.

If target costing generally favors the cmp + if-then-else then we may also
simply advertise [us]{min,max} patterns here?  That said, RTL expansion
doesn't look at cost but only asks can_conditionally_move_p (mode) but
hopefully if-conversion considers actual rtx-cost.

Other ideas welcome of course.

Comment 9 Richard Biener 2019-07-17 13:17:44 UTC

(In reply to Richard Biener from comment #8)
> (In reply to Jakub Jelinek from comment #6)
> > So pretty much undo all ifcvt into conditional moves (aka pretend the target
> > doesn't have cmov) or something else?
> > The problem is that cmov isn't unconditionally bad or unconditionally good
> > and moving in either direction usually results in significant regressions. 
> > cmov is good if the branch will be often badly predicted, so when the branch
> > prediction is close to 50%, and is bad when it is usually well predicted.
> 
> I'd only do it for MAX<small-constant, x> and MIN<large-constant, x>. 
> There's
> also another conditional move a few instructions up, not sure if cmove
> density would be another thing to key on.

And if I read the refrenced PR56309 then the issue might be that for

        cmpl    %eax, %esi
        cmovge  %esi, %eax
        movl    %ecx, %esi
        cmpl    %ebx, %eax
        cmovl   %ebx, %eax

the second compare uses the cmov destination.  But todays micro architectures
are too complex to really tell what the issue is (still constraining a
peephole quite strictly could be possible, on either or both of the
patterns that appear here).

Comment 10 Richard Biener 2019-07-17 13:47:33 UTC

Note that on Haswell the conditional moves are two uops while on Broadwell and up
they are only one uop (overall loop 16 uops vs. 18 uops).  IACA doesn't show
any particular issue (the iterations shoud neatly interleave w/o inter iteration
dependences), but it says the throughput bottleneck is dependency chains
(not sure if it models conditional moves very well).

Comment 11 Richard Biener 2019-07-18 09:01:08 UTC

One could in some awkward way also rewrite the loop to use integer SSE (so
we have access to min/max), relying on zero-filling scalar moves into %xmm
and then using vector integer operations.  Need to split the loads from
the arithmetic for that of course.  With that IACA is more happy (14 uops
for Haswell, throughput 3.5 cycles vs. 4 before).  But this is quite
aggressive "STV".

        vmovd  %eax, %xmm10
        vmovd  %ebx, %xmm12
        .p2align 4,,10
        .p2align 3
.L34:
#       addl    -8(%r9,%rcx,4), %eax
        vmovd   -8(%r9,%rcx,4), %xmm13
        vpaddd  %xmm13, %xmm10, %xmm10
#       movl    %eax, -4(%r13,%rcx,4)
        vmovd   %xmm10, -4(%r13,%rcx,4)
#       movl    -8(%r8,%rcx,4), %esi
        vmovd   -8(%r8,%rcx,4), %xmm11
#       addl    -8(%rdx,%rcx,4), %esi
        vmovd   -8(%rdx,%rcx,4), %xmm13
        vpaddd  %xmm13, %xmm11, %xmm11
#       cmpl    %eax, %esi
#       cmovge  %esi, %eax
        vpmaxsd %xmm11, %xmm10, %xmm10
        movl    %ecx, %esi
#       cmpl    %ebx, %eax
#       cmovl   %ebx, %eax
        vpmaxsd %xmm12, %xmm10, %xmm10
#       movl    %eax, -4(%r13,%rcx,4)
        vmovd   %xmm10, -4(%r13,%rcx,4)
        incq    %rcx
        cmpq    %rcx, %rdi
        jne     .L34

Interesting fact is that doing this _improves_ 456.hmmer 9% beyond fixing
the original regression...!

Note I carefully avoided crossing integer/vector boundaries and thus didn't
try to just do the max operation in the vector domain.  At least on AMD
CPUs moving data between integer and FP/vector is slow (IIRC Intel doesn't
care).  With AVX512 can we even do vpadd %eax, %xmm0, %xmm0 (don't care
if %eax is splat or just in lane zero) - IIRC there was support for
'scalar' operands on some ops.

The above experiment also clearly shows integer max/min operations are
desperately missing and cmp + cmov or cmp + branch + move isn't a good substitute.

Now - isn't STV supposed to handle exactly cases like this?  Well, it
seems it only looks for TImode operations.

Note ICC when tuning for Haswell produces exactly the code we are producing
now (which is slow):

..B1.80:                        # Preds ..B1.80 ..B1.79
                                # Execution count [1.25e+01]
        movl      4(%r15,%rdi,4), %eax                          #146.10
        movl      4(%rsi,%rdi,4), %edx                          #147.12
        addl      4(%r13,%rdi,4), %eax                          #146.19
        addl      4(%r11,%rdi,4), %edx                          #147.20
        cmpl      %edx, %eax                                    #149.2
        cmovge    %eax, %edx                                    #149.2
        addl      4(%rbx,%rdi,4), %edx                          #148.2
        cmpl      $-987654321, %edx                             #149.2
        cmovl     %r14d, %edx                                   #149.2
        movl      %edx, 4(%r8,%rdi,4)                           #149.26
        incq      %rdi                                          #133.5
        cmpq      %rcx, %rdi                                    #133.5
        jb        ..B1.80       # Prob 82%                      #133.5

it everywhere uses cmov quite aggressively it seems...

Comment 12 Richard Biener 2019-07-18 10:36:32 UTC

On Skylake (Coffeelake actually) with the same binary (built for Haswell), the improvement is down to 5%.

On Haswell, when I just replace the second conditional move like

        vmovd  %ebx, %xmm12
        .p2align 4,,10
        .p2align 3
.L34:
...
        cmpl    %eax, %esi
        cmovge  %esi, %eax
        movl    %ecx, %esi
#       cmpl    %ebx, %eax
#       cmovl   %ebx, %eax
        vmovd   %eax, %xmm10
        vpmaxsd %xmm12, %xmm10, %xmm10
        vmovd   %xmm10, %eax
        movl    %eax, -4(%r13,%rcx,4)
...

this doesn't make a difference.  Replacing both, like

        movl    -8(%r8,%rcx,4), %esi
        addl    -8(%rdx,%rcx,4), %esi
#       cmpl    %eax, %esi
#       cmovge  %esi, %eax
        vmovd   %eax, %xmm10
        vmovd   %esi, %xmm11
        vpmaxsd %xmm11, %xmm10, %xmm10
        movl    %ecx, %esi
#       cmpl    %ebx, %eax
#       cmovl   %ebx, %eax
        vpmaxsd %xmm12, %xmm10, %xmm10
        vmovd   %xmm10, %eax
        movl    %eax, -4(%r13,%rcx,4)

makes runtime improve to within 1% of fixing the regression.
I guess that's the best a insn-localized "fix" (provide a smax pattern for SImode) would get to here.  As expected on Zen this localized "fix" is a loss
(additional 11% regression, tested same Haswell tuned binary) while
the full SSE variant is also a lot faster - 35% so compared to the code
with two cmovs and 17% compared to the good r272921 code.

So it seems important to avoid crossing the gpr/xmm domain here.  When
I go through the stack for all three GPR<->XMM moves the situation
gets even (much) worse.

Overall this means enhancing STV to seed itself on conditional moves
that match max/min _and_ that can avoid GPR <-> XMM moves would be quite
a big win here.

Comment 13 Richard Biener 2019-07-18 12:14:27 UTC

Testcase for the loop in question:

void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M)
{
  int sc;
  int k;
  for (k = 1; k <= M; k++)
    {
      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -987654321) dc[k] = -987654321;
    }
}

Comment 14 Richard Biener 2019-07-18 12:53:35 UTC

With

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md     (revision 273567)
+++ gcc/config/i386/i386.md     (working copy)
@@ -17681,6 +17681,23 @@ (define_insn "<code><mode>3"
    (set_attr "type" "sseadd")
    (set_attr "mode" "<MODE>")])
 
+(define_expand "smaxsi3"
+ [(set (match_operand:SI 0 "register_operand")
+       (smax:SI
+        (match_operand:SI 1 "register_operand")
+       (match_operand:SI 2 "register_operand")))]
+ ""
+{
+  rtx vop1 = gen_reg_rtx (V4SImode);
+  rtx vop2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_vec_setv4si_0 (vop1, CONST0_RTX (V4SImode), operands[1]));
+  emit_insn (gen_vec_setv4si_0 (vop2, CONST0_RTX (V4SImode), operands[2]));
+  rtx tem = gen_reg_rtx (V4SImode);
+  emit_insn (gen_smaxv4si3 (tem, vop1, vop2));
+  emit_move_insn (operands[0], lowpart_subreg (SImode, tem, V4SImode));
+  DONE;
+})
+
 ;; These versions of the min/max patterns implement exactly the operations
 ;;   min = (op1 < op2 ? op1 : op2)
 ;;   max = (!(op1 < op2) ? op1 : op2)


we generate

.L3:
        addl    (%rdx,%r8,4), %r9d
        movl    (%rcx,%r8,4), %eax
        addl    (%rsi,%r8,4), %eax
        vmovd   %r9d, %xmm1
        vmovd   %eax, %xmm0
        movq    %r8, %rax
        vpmaxsd %xmm1, %xmm0, %xmm0
        vinsertps       $0xe, %xmm0, %xmm0, %xmm0
        vpmaxsd %xmm2, %xmm0, %xmm0
        vmovd   %xmm0, 4(%rdi,%r8,4)
        vmovd   %xmm0, %r9d
        incq    %r8
        cmpq    %rax, %r10
        jne     .L3

so we manage to catch the store as well but somehow

(insn:TI 35 27 37 4 (set (reg:V4SI 20 xmm0 [114])
        (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 20 xmm0 [107]))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) 2740 {vec_setv4si_0}
     (nil))

fails to be elided.  Maybe vec_setv4si_0 isn't the optimal representation
choice.  Ah, of course the zeros might end up invalidated by the earlier
max...  we can't say in RTL that we actually do not care about the
upper bits - can we?

Anyhow, while the above would fix the regression on Haswell we'd degrade
on Zen and in more isolated cmov cases it's clearly not going to be a win
as well.

Comment 15 Richard Biener 2019-07-19 12:48:32 UTC

So another idea would be to provide [us]{min,max} patterns for integer modes
that split after reload into a compare&cmov or jumpy sequence if allocated
using GPR regs but also allow SSE reg alternatives which would fit into
existing SSE code if we'd allow SI-WITH-SSE similar to how we do
TARGET_MMX_WITH_SSE operations.  We already seem to have movsi patterns
{r,m}<->v so that part is done, for the testcase that would leave
addsi3 plus appropriate costing of the min/max case.

The smaxsi3 "splitter" (well ;)) is

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md     (revision 273592)
+++ gcc/config/i386/i386.md     (working copy)
@@ -1881,6 +1881,27 @@ (define_expand "mov<mode>"
   ""
   "ix86_expand_move (<MODE>mode, operands); DONE;")
 
+(define_insn "smaxsi3"
+ [(set (match_operand:SI 0 "register_operand" "=r,x")
+       (smax:SI (match_operand:SI 1 "register_operand" "%0,x")
+                (match_operand:SI 2 "register_operand" "r,x")))
+  (clobber (reg:CC FLAGS_REG))]
+  "TARGET_AVX2"
+{
+  switch (get_attr_type (insn))
+    {
+    case TYPE_SSEADD:
+      return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
+    case TYPE_ICMOV:
+      return "cmpl\t{%2, %0|%0, %2}\n"
+            "cmovl\t{%2, %0|%0, %2}";
+    default:
+      gcc_unreachable ();
+    }
+}
+  [(set_attr "isa" "noavx,avx")
+   (set_attr "type" "icmov,sseadd")])
+
 (define_insn "*mov<mode>_xor"
   [(set (match_operand:SWI48 0 "register_operand" "=r")
        (match_operand:SWI48 1 "const0_operand"))

with that we get the elision of the zeroing between the vpmaxsd but even
-mtune=bdver2 doesn't disparage the cross-unit moves enough for the
RA to choose the first alternative.  Huh.  But it then goes through the
stack...

.L3:
        vmovd   %xmm0, %r8d
        addl    (%rdx,%rax,4), %r8d
        movl    %r8d, 4(%rdi,%rax,4)
        movl    (%rcx,%rax,4), %r9d
        addl    (%rsi,%rax,4), %r9d
        movl    %r9d, -4(%rsp)
        vmovd   -4(%rsp), %xmm2
        movl    %r8d, -4(%rsp)
        movq    %rax, %r8
        vmovd   -4(%rsp), %xmm3
        vpmaxsd %xmm3, %xmm2, %xmm0
        vpmaxsd %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, 4(%rdi,%rax,4)
        incq    %rax
        cmpq    %r8, %r10
        jne     .L3

not sure why RA selected the 2nd alternative at all... but yes, we'd
want to slightly prefer it.  But I expected the inter-unit moves to
push us to the first alternative.

As you can see we can automatically handle stores of SImode in SSE
regs and I verified we can also handle loads by slightly altering the
testcase.  That leaves implementing the addsi3 alternative for SSE regs,
like the following quite incomplete hack

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md     (revision 273592)
+++ gcc/config/i386/i386.md     (working copy)
@@ -5368,10 +5389,10 @@ (define_insn_and_split "*add<dwi>3_doubl
 })
 
 (define_insn "*add<mode>_1"
-  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r,v")
        (plus:SWI48
-         (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
-         (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
+         (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r,v")
+         (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le,v")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
 {
@@ -5390,6 +5411,9 @@ (define_insn "*add<mode>_1"
           return "dec{<imodesuffix>}\t%0";
        }
 
+     case TYPE_SSEADD:
+       return "vpaddd\t{%2, %1, %0|%0, %1, %2}";
+
     default:
       /* For most processors, ADD is faster than LEA.  This alternative
         was added to use ADD as much as possible.  */
@@ -5406,6 +5430,8 @@ (define_insn "*add<mode>_1"
   [(set (attr "type")
      (cond [(eq_attr "alternative" "3")
               (const_string "lea")
+           (eq_attr "alternative" "4")
+             (const_string "sseadd")
            (match_operand:SWI48 2 "incdec_operand")
              (const_string "incdec")
           ]

but somehow that doesn't trigger for me.

Comment 16 Richard Biener 2019-07-19 16:31:25 UTC

Ah, because x86_64_general_operand allows memory but the v alternative not
and reloading that is appearantly more expensive than not doing that and
reloading the general reg later.  Fun.  Changing that to x86_64_nonmemory_operand
makes the whole thing work nearly fully (for this testcase, breaking everything else of course), there's one gpr op remaining again because we get memory,
this time in the first operand which I kept as nonimmediate_operand.
Not sure how we make RA happier to reload a memory operand for the v,v,v
alternative without doing that elsewhere.

        movl    $-987654321, %r10d
        vmovd   (%rdi), %xmm0
        leal    -1(%r8), %r9d
        xorl    %eax, %eax
        vmovd   %r10d, %xmm1
        .p2align 4,,10
        .p2align 3
.L3:
        vmovd   (%rdx,%rax,4), %xmm2
        vpaddd  %xmm2, %xmm0, %xmm0
        vmovd   %xmm0, 4(%rdi,%rax,4)
        movl    (%rcx,%rax,4), %r8d
        addl    (%rsi,%rax,4), %r8d
        vmovd   %r8d, %xmm3
        movq    %rax, %r8
        vpmaxsd %xmm0, %xmm3, %xmm0
        vpmaxsd %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, 4(%rdi,%rax,4)
        addq    $1, %rax
        cmpq    %r9, %r8
        jne     .L3

Comment 17 Bill Schmidt 2019-08-05 13:38:33 UTC

We're going to need a fix for Power as well.  We see a 12% drop in 456.hmmer around July 2.  We lose vectorization and end up with a couple of isel's in the loop (equivalent to x86 cmov).

Comment 18 Bill Schmidt 2019-08-05 13:45:20 UTC

Richi corrected me -- this is not vectorization, but use of SSE on lane zero to do scalar computation.

Comment 19 Segher Boessenkool 2019-08-07 16:04:02 UTC

So how does this cause 12% degradation (20% by some other measurements)
on power9?  Pretty much everything is the *opposite* way around for us:
we do have cheap conditional moves, we do prefer integer registers.

Comment 20 rguenther@suse.de 2019-08-09 11:03:28 UTC

On Wed, 7 Aug 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
> 
> Segher Boessenkool <segher at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |segher at gcc dot gnu.org
> 
> --- Comment #19 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> So how does this cause 12% degradation (20% by some other measurements)
> on power9?  Pretty much everything is the *opposite* way around for us:
> we do have cheap conditional moves, we do prefer integer registers.

Might be that speculating the jump (which is very very well predicted)
is still a lot faster than the conditional move.

Comment 21 Richard Biener 2019-08-14 08:32:26 UTC

Author: rguenth
Date: Wed Aug 14 08:31:54 2019
New Revision: 274422

URL: https://gcc.gnu.org/viewcvs?rev=274422&root=gcc&view=rev
Log:
2019-08-14  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.c
	(dimode_scalar_chain::compute_convert_gain): Compute and dump
	individual instruction gain.  Fix reg-reg copy GRP cost.  Use
	ix86_cost->sse_op for vector instruction costs.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-features.c

Comment 22 Richard Biener 2019-08-14 12:04:36 UTC

Author: rguenth
Date: Wed Aug 14 12:04:05 2019
New Revision: 274481

URL: https://gcc.gnu.org/viewcvs?rev=274481&root=gcc&view=rev
Log:
2019-08-14  Richard Biener  <rguenther@suse.de>
        Uroš Bizjak  <ubizjak@gmail.com>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Add
	{S,U}{MIN,MAX} support.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
	(*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
	(*<maxmin>di3_doubleword): Likewise.

	* gcc.target/i386/pr91154.c: New testcase.
	* gcc.target/i386/minmax-3.c: Likewise.
	* gcc.target/i386/minmax-4.c: Likewise.
	* gcc.target/i386/minmax-5.c: Likewise.
	* gcc.target/i386/minmax-6.c: Likewise.
	* gcc.target/i386/minmax-1.c: Add -mno-stv.
	* gcc.target/i386/minmax-2.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/minmax-3.c
    trunk/gcc/testsuite/gcc.target/i386/minmax-4.c
    trunk/gcc/testsuite/gcc.target/i386/minmax-5.c
    trunk/gcc/testsuite/gcc.target/i386/minmax-6.c
    trunk/gcc/testsuite/gcc.target/i386/pr91154.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-features.c
    trunk/gcc/config/i386/i386-features.h
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/minmax-1.c
    trunk/gcc/testsuite/gcc.target/i386/minmax-2.c

Comment 23 Richard Biener 2019-08-14 12:05:02 UTC

Fixed for {x86_64,i?86}-*-* (hopefully).

Comment 24 Uroš Bizjak 2019-08-17 08:56:48 UTC

It looks that the patch introduced a (small?) runtime regression of 5% in SPEC2000 300.twolf on haswell [1]. Maybe worth looking at.

[1] https://gcc.opensuse.org/gcc-old/SPEC/CINT/sb-czerny-head-64/300_twolf_recent_big.png

Comment 25 rguenther@suse.de 2019-08-19 10:50:37 UTC

On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
> 
> --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> It looks that the patch introduced a (small?) runtime regression of 5% in
> SPEC2000 300.twolf on haswell [1]. Maybe worth looking at.

Biggest changes when benchmarking -mno-stv (base) against -mstv (peak):

   7.28%         37789  twolf_peak.none  twolf_peak.none   [.] ucxx2 
   4.21%         25709  twolf_base.none  twolf_base.none   [.] ucxx2        
   3.72%         22584  twolf_base.none  twolf_base.none   [.] new_dbox
   2.48%         22364  twolf_peak.none  twolf_peak.none   [.] new_dbox
   1.49%          8270  twolf_base.none  twolf_base.none   [.] sub_penal
   1.12%          7576  twolf_peak.none  twolf_peak.none   [.] sub_penal
   1.36%          9314  twolf_peak.none  twolf_peak.none   [.] old_assgnto_new2
   1.11%          5257  twolf_base.none  twolf_base.none   [.] old_assgnto_new2

and in ucxx2 I see

  0.17 │       mov    %eax,(%rsp)
  3.55 │       vpmins (%rsp),%xmm0,%xmm1   
       │       test   %eax,%eax
  0.22 │       vmovd  %xmm1,%r8d              
  0.80 │       cmovs  %esi,%r8d

This is from code like

  a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin)

with only the inner one recognized as MIN because 'numBins' is only
ever loaded conditionally and we don't speculate it.  So we expand
from

  _41 = _40 / binWidth.15_36;
  if (_41 >= 0)
    goto <bb 5>; [59.00%]
  else
    goto <bb 6>; [41.00%]

bb5:
  numBins.26_42 = numBins;
  iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>;

bb6:
  # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)>

ending up with

        movl    %r9d, %eax
        cltd
        idivl   %ecx
        movl    %eax, (%rsp)
        vpminsd (%rsp), %xmm0, %xmm1
        testl   %eax, %eax
        vmovd   %xmm1, %r11d
        cmovs   %esi, %r11d

and STV converting single-instruction 'chains':

Collected chain #40... 
  insns: 381
  defs to convert: r463, r465
Computing gain for chain #40...
  Instruction gain 8 for   381: {r465:SI=smin(r463:SI,[`numBins']);clobber 
flags:CC;}
      REG_DEAD r463:SI
      REG_UNUSED flags:CC
  Instruction conversion gain: 8 
  Registers conversion cost: 4
  Total gain: 4
Converting chain #40...

to me the "spill" to (%rsp) looks suspicious and even more so
the vector(!) memory use in vpminsd.  RA could have used

  movd  %eax, %xmm1
  vpminsd %xmm1, %xmm0, %xmm1

no?  IRA allocates the pseudo to memory.  Testcase:

extern int numBins;
extern int binOffst;
extern int binWidth;
extern int Trybin;
void foo (int);

void bar (int aleft, int axcenter)
{
  int a1LoBin = (((Trybin=((axcenter + aleft)-binOffst)/binWidth)<0)
                 ? 0 : ((Trybin>numBins) ? numBins : Trybin));
  foo (a1LoBin);
}

STV had emitted

(insn 10 9 38 2 (parallel [
            (set (reg:SI 93)
                (div:SI (reg:SI 92)
...
(insn 38 10 12 2 (set (subreg:V4SI (reg:SI 98) 0)
        (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 93))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) "t.c":9:56 -1
     (nil))
...
(insn 39 31 32 2 (set (reg:SI 99)
        (mem/c:SI (symbol_ref:DI ("numBins") [flags 0x40]  <var_decl 
0x7f9a4bfdbab0 numBins>) [1 numBins+0 S4 A32])) "t.c":9:75 -1
     (nil))
(insn 32 39 37 2 (set (subreg:V4SI (reg:SI 95) 0)
        (smin:V4SI (subreg:V4SI (reg:SI 98) 0)
            (subreg:V4SI (reg:SI 99) 0))) "t.c":9:75 3657 
{*sse4_1_sminv4si3}
     (nil))

but then combine elimiated the copy...

Trying 38 -> 32:
   38: r98:SI#0=vec_merge(vec_duplicate(r93:SI),const_vector,0x1)
   32: r95:SI#0=smin(r98:SI#0,r99:SI#0)
      REG_DEAD r99:SI
      REG_DEAD r98:SI
Successfully matched this instruction:
(set (subreg:V4SI (reg:SI 95) 0)
    (smin:V4SI (subreg:V4SI (reg:SI 93) 0)
        (subreg:V4SI (reg:SI 99 [ numBins ]) 0)))
allowing combination of insns 38 and 32
original costs 4 + 40 = 44
replacement cost 40

...running into the issue I tried to fix with making the live-range
split copy more "explicit" ...

So it looks like STV "forgot" to convert 'reg:SI 99' which is
a memory load it split out.  But even when fixing that combine
now forwards both ops, eliminating the LR split again :/

Disabling combine results in the expected variant (not going
through the stack):

        idivl   binWidth(%rip)
        vmovd   %eax, %xmm0
        testl   %eax, %eax
        movl    %eax, Trybin(%rip)
        vpminsd %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        cmovns  %eax, %edi
        jmp     foo

so while

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c     (revision 274666)
+++ gcc/config/i386/i386-features.c     (working copy)
@@ -910,7 +910,9 @@ general_scalar_chain::convert_op (rtx *o
     {
       rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
-      emit_insn_before (gen_move_insn (tmp, *op), insn);
+      emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0),
+                                    gen_gpr_to_xmm_move_src (vmode, 
*op)),
+                       insn);
       *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)

will help to fence off regprop it doesn't help against combines
power :/  (or IRAs failure to split live-ranges)

Comment 26 Richard Biener 2019-08-19 11:31:34 UTC

This is the powers of simplify_subreg I guess.  We're lucky it doesn't do
this to arbitrary arithmetic.

So we need to really change all defs we introduce to vector modes instead of
making our live easy and using paradoxical subregs all over the place.

Comment 27 Uroš Bizjak 2019-08-19 11:45:37 UTC

(In reply to rguenther@suse.de from comment #25)
> On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
> > 
> > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> > It looks that the patch introduced a (small?) runtime regression of 5% in
> > SPEC2000 300.twolf on haswell [1]. Maybe worth looking at.
> 
> Biggest changes when benchmarking -mno-stv (base) against -mstv (peak):
> 
>    7.28%         37789  twolf_peak.none  twolf_peak.none   [.] ucxx2 
>    4.21%         25709  twolf_base.none  twolf_base.none   [.] ucxx2        
>    3.72%         22584  twolf_base.none  twolf_base.none   [.] new_dbox
>    2.48%         22364  twolf_peak.none  twolf_peak.none   [.] new_dbox
>    1.49%          8270  twolf_base.none  twolf_base.none   [.] sub_penal
>    1.12%          7576  twolf_peak.none  twolf_peak.none   [.] sub_penal
>    1.36%          9314  twolf_peak.none  twolf_peak.none   [.]
> old_assgnto_new2
>    1.11%          5257  twolf_base.none  twolf_base.none   [.]
> old_assgnto_new2
> 
> and in ucxx2 I see
> 
>   0.17 │       mov    %eax,(%rsp)
>   3.55 │       vpmins (%rsp),%xmm0,%xmm1   
>        │       test   %eax,%eax
>   0.22 │       vmovd  %xmm1,%r8d              
>   0.80 │       cmovs  %esi,%r8d
> 
> This is from code like
> 
>   a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin)
> 
> with only the inner one recognized as MIN because 'numBins' is only
> ever loaded conditionally and we don't speculate it.  So we expand
> from
> 
>   _41 = _40 / binWidth.15_36;
>   if (_41 >= 0)
>     goto <bb 5>; [59.00%]
>   else
>     goto <bb 6>; [41.00%]
> 
> bb5:
>   numBins.26_42 = numBins;
>   iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>;
> 
> bb6:
>   # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)>
> 
> ending up with
> 
>         movl    %r9d, %eax
>         cltd
>         idivl   %ecx
>         movl    %eax, (%rsp)
>         vpminsd (%rsp), %xmm0, %xmm1
>         testl   %eax, %eax
>         vmovd   %xmm1, %r11d
>         cmovs   %esi, %r11d
> 
> and STV converting single-instruction 'chains':
> 
> Collected chain #40... 
>   insns: 381
>   defs to convert: r463, r465
> Computing gain for chain #40...
>   Instruction gain 8 for   381: {r465:SI=smin(r463:SI,[`numBins']);clobber 
> flags:CC;}
>       REG_DEAD r463:SI
>       REG_UNUSED flags:CC
>   Instruction conversion gain: 8 
>   Registers conversion cost: 4
>   Total gain: 4
> Converting chain #40...

Is this in STV1 pass? This (pre-combine) pass should be enabled only for TImode conversion, a semi-hack where 64bit targets convert memory access to TImode. General STV should not be ran before combine.

> to me the "spill" to (%rsp) looks suspicious and even more so
> the vector(!) memory use in vpminsd.  RA could have used
> 
>   movd  %eax, %xmm1
>   vpminsd %xmm1, %xmm0, %xmm1
> 
> no?  IRA allocates the pseudo to memory.  Testcase:

This is how IRA handles subregs. Please note, that the memory is correctly aligned, so vector load does not trip alignment trap. However, on x86 this approach triggers store forwarding stall.

Comment 28 Uroš Bizjak 2019-08-19 11:50:10 UTC

(In reply to Richard Biener from comment #26)
> This is the powers of simplify_subreg I guess.  We're lucky it doesn't do
> this to arbitrary arithmetic.
> 
> So we need to really change all defs we introduce to vector modes instead of
> making our live easy and using paradoxical subregs all over the place.

No, IMO IRA should be "fixed" to avoid stack temporary and (based on some cost metric) use direct move for paradoxical subregs.

Comment 29 Richard Biener 2019-08-19 11:53:43 UTC

(In reply to Uroš Bizjak from comment #27)
> (In reply to rguenther@suse.de from comment #25)
> > and STV converting single-instruction 'chains':
> > 
> > Collected chain #40... 
> >   insns: 381
> >   defs to convert: r463, r465
> > Computing gain for chain #40...
> >   Instruction gain 8 for   381: {r465:SI=smin(r463:SI,[`numBins']);clobber 
> > flags:CC;}
> >       REG_DEAD r463:SI
> >       REG_UNUSED flags:CC
> >   Instruction conversion gain: 8 
> >   Registers conversion cost: 4
> >   Total gain: 4
> > Converting chain #40...
> 
> Is this in STV1 pass? This (pre-combine) pass should be enabled only for
> TImode conversion, a semi-hack where 64bit targets convert memory access to
> TImode. General STV should not be ran before combine.

Yes, this is STV1.  My patch to enable SImode and DImode chains didn't change
where the pass runs or enable the 2nd run out of compile-time concerns.

Indeed changing this fixes the issue.  I'm going to benchmark it on
300.twolf.

> > to me the "spill" to (%rsp) looks suspicious and even more so
> > the vector(!) memory use in vpminsd.  RA could have used
> > 
> >   movd  %eax, %xmm1
> >   vpminsd %xmm1, %xmm0, %xmm1
> > 
> > no?  IRA allocates the pseudo to memory.  Testcase:
> 
> This is how IRA handles subregs. Please note, that the memory is correctly
> aligned, so vector load does not trip alignment trap. However, on x86 this
> approach triggers store forwarding stall.

Comment 30 Richard Biener 2019-08-19 12:18:05 UTC

(In reply to Richard Biener from comment #29)
> (In reply to Uroš Bizjak from comment #27)
> > (In reply to rguenther@suse.de from comment #25)
> > > and STV converting single-instruction 'chains':
> > > 
> > > Collected chain #40... 
> > >   insns: 381
> > >   defs to convert: r463, r465
> > > Computing gain for chain #40...
> > >   Instruction gain 8 for   381: {r465:SI=smin(r463:SI,[`numBins']);clobber 
> > > flags:CC;}
> > >       REG_DEAD r463:SI
> > >       REG_UNUSED flags:CC
> > >   Instruction conversion gain: 8 
> > >   Registers conversion cost: 4
> > >   Total gain: 4
> > > Converting chain #40...
> > 
> > Is this in STV1 pass? This (pre-combine) pass should be enabled only for
> > TImode conversion, a semi-hack where 64bit targets convert memory access to
> > TImode. General STV should not be ran before combine.
> 
> Yes, this is STV1.  My patch to enable SImode and DImode chains didn't change
> where the pass runs or enable the 2nd run out of compile-time concerns.
> 
> Indeed changing this fixes the issue.  I'm going to benchmark it on
> 300.twolf.

Hmm, it regresses the gcc.target/i386/minmax-6.c though and thus cactusADM
(IIRC).

Comment 31 H.J. Lu 2019-08-19 20:22:01 UTC

(In reply to Uroš Bizjak from comment #28)
> (In reply to Richard Biener from comment #26)
> > This is the powers of simplify_subreg I guess.  We're lucky it doesn't do
> > this to arbitrary arithmetic.
> > 
> > So we need to really change all defs we introduce to vector modes instead of
> > making our live easy and using paradoxical subregs all over the place.
> 
> No, IMO IRA should be "fixed" to avoid stack temporary and (based on some
> cost metric) use direct move for paradoxical subregs.

The problem is

  /* Moves between SSE and integer units are expensive.  */
  if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))

    /* ??? By keeping returned value relatively high, we limit the number
       of moves between integer and SSE registers for all targets.
       Additionally, high value prevents problem with x86_modes_tieable_p(),
       where integer modes in SSE registers are not tieable
       because of missing QImode and HImode moves to, from or between
       MMX/SSE registers.  */
    return MAX (8, SSE_CLASS_P (class1)
                ? ix86_cost->hard_register.sse_to_integer
                : ix86_cost->hard_register.integer_to_sse);

The minimum cost of moves between SSE and integer units is 8.

Comment 32 Uroš Bizjak 2019-08-20 06:04:04 UTC

(In reply to H.J. Lu from comment #31)

> > No, IMO IRA should be "fixed" to avoid stack temporary and (based on some
> > cost metric) use direct move for paradoxical subregs.
> 
> The problem is
> 
>   /* Moves between SSE and integer units are expensive.  */
>   if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))
> 
>     /* ??? By keeping returned value relatively high, we limit the number
>        of moves between integer and SSE registers for all targets.
>        Additionally, high value prevents problem with x86_modes_tieable_p(),
>        where integer modes in SSE registers are not tieable
>        because of missing QImode and HImode moves to, from or between
>        MMX/SSE registers.  */
>     return MAX (8, SSE_CLASS_P (class1)
>                 ? ix86_cost->hard_register.sse_to_integer
>                 : ix86_cost->hard_register.integer_to_sse);
> 
> The minimum cost of moves between SSE and integer units is 8.

I guess this should be reviewed. This is from reload time, nowadays we never actually disable sse <-> int moves, we use preferred_for_{speed,size} attributes. Also, at least for TARGET_MMX_WITH_SSE targets, we can change the penalty for MMX moves.

These sort of changes should be backed by runtime benchmark results.

Thanks for heads up, but let's take this cost issue elsewhere.

Comment 33 Richard Biener 2019-08-20 08:08:43 UTC

(In reply to Uroš Bizjak from comment #32)
> (In reply to H.J. Lu from comment #31)
> 
> > > No, IMO IRA should be "fixed" to avoid stack temporary and (based on some
> > > cost metric) use direct move for paradoxical subregs.
> > 
> > The problem is
> > 
> >   /* Moves between SSE and integer units are expensive.  */
> >   if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))
> > 
> >     /* ??? By keeping returned value relatively high, we limit the number
> >        of moves between integer and SSE registers for all targets.
> >        Additionally, high value prevents problem with x86_modes_tieable_p(),
> >        where integer modes in SSE registers are not tieable
> >        because of missing QImode and HImode moves to, from or between
> >        MMX/SSE registers.  */
> >     return MAX (8, SSE_CLASS_P (class1)
> >                 ? ix86_cost->hard_register.sse_to_integer
> >                 : ix86_cost->hard_register.integer_to_sse);
> > 
> > The minimum cost of moves between SSE and integer units is 8.
> 
> I guess this should be reviewed. This is from reload time, nowadays we never
> actually disable sse <-> int moves, we use preferred_for_{speed,size}
> attributes. Also, at least for TARGET_MMX_WITH_SSE targets, we can change
> the penalty for MMX moves.
> 
> These sort of changes should be backed by runtime benchmark results.
> 
> Thanks for heads up, but let's take this cost issue elsewhere.

I think it's related but since we were asked to keep this PR open also for
other targets I've created PR91498 for the 300.twolf regression and this
RA/costing issue.

Comment 34 Rainer Orth 2019-08-20 13:37:39 UTC

Some of the new testcases FAIL on 32-bit x86:

+FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pmaxsd 1
+FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pmaxud 1
+FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pminsd 1
+FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pminud 1
+FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpmaxsd 1
+FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpmaxud 1
+FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpminsd 1
+FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpminud 1
+FAIL: gcc.target/i386/minmax-6.c scan-assembler pmaxsd

This group is only seen on i386-pc-solaris2.11 while ...

+FAIL: gcc.target/i386/pr91154.c scan-assembler-not cmov
+FAIL: gcc.target/i386/pr91154.c scan-assembler-times paddd 2
+FAIL: gcc.target/i386/pr91154.c scan-assembler-times pmaxsd 2

... for this one there are also couple of gcc-testresults reports on
x86_64-pc-linux-gnu.

Comment 35 Rainer Orth 2019-08-20 17:55:12 UTC

Between 20190819 (r274678) and 20190820 (r274749), a new failure was introduced:

+FAIL: gcc.target/i386/minmax-6.c scan-assembler-not rsp

Seen on both i386-pc-solaris2.11 with -m64 and x86_64-pc-linux-gnu (oldest report
for r274708).

Comment 36 Uroš Bizjak 2019-09-01 10:05:52 UTC

(In reply to Richard Biener from comment #30)

> Hmm, it regresses the gcc.target/i386/minmax-6.c though and thus cactusADM
> (IIRC).

I was looking a bit into minmax6.c failure. Apparently, despite spill/fill cactusADM doesn't regress [1]. By avoiding subregs, spill and fill are both in V4SImode, so store forwarding mitigates the effects of memory access. Thus, the current testsuite failure is benign, and scan-asm-not should be adjusted to only check for SImode spill.

OTOH, spill in SImode (32bit) and fill in V4SImode (128 bit), which is the consequence of using V4SImode paradoxical subreg of SImode value disables store forwarding and a big runtime regression in a hot loop is observed.

Based on this observation, your approach of using special patterns to convert SImode to V4SImode is the correct approach, and I retract my patch that reintroduces subregs instead.

[1] https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/436_cactusADM_recent_big.png

Comment 37 rguenther@suse.de 2019-09-01 10:27:35 UTC

On September 1, 2019 12:05:52 PM GMT+02:00, ubizjak at gmail dot com <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
>
>--- Comment #36 from Uroš Bizjak <ubizjak at gmail dot com> ---
>(In reply to Richard Biener from comment #30)
>
>> Hmm, it regresses the gcc.target/i386/minmax-6.c though and thus
>cactusADM
>> (IIRC).
>
>I was looking a bit into minmax6.c failure. Apparently, despite
>spill/fill
>cactusADM doesn't regress [1]. By avoiding subregs, spill and fill are
>both in
>V4SImode, so store forwarding mitigates the effects of memory access.
>Thus, the
>current testsuite failure is benign, and scan-asm-not should be
>adjusted to
>only check for SImode spill.
>
>OTOH, spill in SImode (32bit) and fill in V4SImode (128 bit), which is
>the
>consequence of using V4SImode paradoxical subreg of SImode value
>disables store
>forwarding and a big runtime regression in a hot loop is observed.

Yeah, that might very well be the original issue. Still reg, xmm moves should be better so not sure if we should adjust the testcase.  Maybe we can make two of them and assert we don't see the SImode spill with any tuning? 

>Based on this observation, your approach of using special patterns to
>convert
>SImode to V4SImode is the correct approach, and I retract my patch that
>reintroduces subregs instead.
>
>[1]
>https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/436_cactusADM_recent_big.png

Comment 38 Martin Liška 2019-12-09 10:53:35 UTC

Just for the record, the same can be seen on -march=znver1:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=248.180.0&plot.1=27.180.0&

Comment 39 Richard Biener 2020-01-17 09:50:31 UTC

So this bug turned into a mess of random benchmarks.  The 456.hmmer regression is fixed.

Please open new bugs for any other things found.  cactusADM jumps up and down
on my tester and shows no clear regression.