Bug 66489 - combine fails to merge insns if some are reused later on
Summary: combine fails to merge insns if some are reused later on
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 6.0
: P3 enhancement
Target Milestone: 9.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2015-06-10 14:30 UTC by ktkachov
Modified: 2023-06-02 04:53 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-12-25 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description ktkachov 2015-06-10 14:30:31 UTC
Consider code:
int f2(int a, int b, int c, int d)
{
  a = -a;
  b += a * c;
  c += a * d;
  return b ^ c;
}

On aarch64 this could be written with to msub instructions, RTL code:

(set (reg/i:SI 0 x0)
     (minus:SI
       (reg:SI 1 x1 [ b ])
       (mult:SI (reg:SI 0 x0 [ a ])
                (reg:SI 2 x2 [ c ])))) 

However, combine doesn't merge the neg and the multiply-adds and generates:
f2:
        neg     w4, w0
        madd    w0, w4, w2, w1
        madd    w3, w4, w3, w2
        eor     w0, w0, w3
        ret


If we modify the code to only do a single multiply-accumulate:
int f2(int a, int b, int c, int d)
{
  a = -a;
  b += a * c;
  return b;
}

Then they the expected single msub instruction is generated.
I think this is due to combine being scared of the negated 'a' being used in two places.
Comment 1 Andrew Pinski 2015-06-10 14:37:15 UTC
This was the reason why the pass on the tree level to fused multiply add was added.  Maybe we can do the same.

Do you know how often this shows up?
Comment 2 ktkachov 2015-06-10 14:41:35 UTC
(In reply to Andrew Pinski from comment #1)
> This was the reason why the pass on the tree level to fused multiply add was
> added.  Maybe we can do the same.
> 
> Do you know how often this shows up?

My understanding is that it occurs in SPEC2006 and leslie3d in particular, but with FP code there (just substitute int for double in the testcase and pick something other than eor).

Don't have any numbers on how often or how much of a bottleneck it is to performance...
Comment 3 Andrew Pinski 2015-06-10 14:53:59 UTC
For floating point, tree-ssa-math-opts.c  handles the FMA and then expand is supposed to handle the neg inside the FMA_EXPR but for some reason on this testcase it is not. 
        def0 = get_def_for_expr (treeop0, NEGATE_EXPR);
        /* The multiplication is commutative - look at its 2nd operand
           if the first isn't fed by a negate.  */
        if (!def0)
          {
            def0 = get_def_for_expr (treeop1, NEGATE_EXPR);
            /* Swap operands if the 2nd operand is fed by a negate.  */
            if (def0)
              {
                tree tem = treeop0;
                treeop0 = treeop1;
                treeop1 = tem;
              }
          }
        def2 = get_def_for_expr (treeop2, NEGATE_EXPR);

        op0 = op2 = NULL;

        if (def0 && def2
            && optab_handler (fnms_optab, mode) != CODE_FOR_nothing)
          {
            opt = fnms_optab;
            op0 = expand_normal (gimple_assign_rhs1 (def0));
            op2 = expand_normal (gimple_assign_rhs1 (def2));
          }
        else if (def0
                 && optab_handler (fnma_optab, mode) != CODE_FOR_nothing)
          {
            opt = fnma_optab;
            op0 = expand_normal (gimple_assign_rhs1 (def0));
          }
        else if (def2
                 && optab_handler (fms_optab, mode) != CODE_FOR_nothing)
          {
            opt = fms_optab;
            op2 = expand_normal (gimple_assign_rhs1 (def2));
          }

So you are going to need to debug why expr.c is not handling this case for floating point.

Again for integer, it is a different story.  And I doubt for integer it shows up that often and if it did, the neg could most likely schedule with some other instruction.
Comment 4 Segher Boessenkool 2015-06-10 19:17:02 UTC
Combine does not do this because it would need to combine 2 insns into 2.
This *can* be safely done (carefully).  As a result combine can do more
work (which is good), but lifetime of registers can get longer too (but
also shorter in some cases).  I don't yet know if it is a win on average.

diff --git a/gcc/combine.c b/gcc/combine.c
index 9861291..599d383 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3992,8 +3992,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
 	   && XVECLEN (newpat, 0) == 2
 	   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
 	   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-	   && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
-		  || set_noop_p (XVECEXP (newpat, 0, 1)))
+//	   && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
+//		  || set_noop_p (XVECEXP (newpat, 0, 1)))
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT
Comment 5 Andrew Pinski 2015-12-25 21:41:02 UTC
Confirmed.
Comment 6 Andrew Pinski 2023-06-02 04:53:08 UTC
Fixed in GCC 9 with 2->2 combing being done now (r9-2064-gc4c5ad1d6d1e1e1fe7a):

Trying 11 -> 12:
   11: r96:SI=-r115:SI
      REG_DEAD r115:SI
   12: r108:SI=r96:SI*r103:SI
Failed to match this instruction:
(parallel [
        (set (reg:SI 108)
            (mult:SI (neg:SI (reg:SI 115))
                (reg/v:SI 103 [ cD.4384 ])))
        (set (reg/v:SI 96 [ aD.4382 ])
            (neg:SI (reg:SI 115)))
    ])
Failed to match this instruction:
(parallel [
        (set (reg:SI 108)
            (mult:SI (neg:SI (reg:SI 115))
                (reg/v:SI 103 [ cD.4384 ])))
        (set (reg/v:SI 96 [ aD.4382 ])
            (neg:SI (reg:SI 115)))
    ])
Successfully matched this instruction:
(set (reg/v:SI 96 [ aD.4382 ])
    (neg:SI (reg:SI 115)))
Successfully matched this instruction:
(set (reg:SI 108)
    (mult:SI (neg:SI (reg:SI 115))
        (reg/v:SI 103 [ cD.4384 ])))
allowing combination of insns 11 and 12
original costs 4 + 12 = 16
replacement costs 4 + 12 = 16
modifying insn i2    11: r96:SI=-r115:SI
deferring rescan insn with uid = 11.
modifying insn i3    12: r108:SI=-r115:SI*r103:SI
      REG_DEAD r115:SI
deferring rescan insn with uid = 12.