Bug 54589 - struct offset add should be folded into address calculation
Summary: struct offset add should be folded into address calculation
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2012-09-15 13:36 UTC by Steinar H. Gunderson
Modified: 2022-01-10 08:13 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail: 4.6.3, 4.7.2, 4.8.0
Last reconfirmed: 2012-09-17 00:00:00


Attachments
gcc9-pr54589.patch (1.46 KB, patch)
2018-11-29 17:31 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Steinar H. Gunderson 2012-09-15 13:36:51 UTC
Hi,

I found this in 4.4 (Ubuntu 10.04), and have confirmed it's still there in

  gcc (Debian 20120820-1) 4.8.0 20120820 (experimental) [trunk revision 190537]

This code:

  #include <emmintrin.h>

  struct param {
          int a, b, c, d;
          __m128i array[256];
  };

  void func(struct param *p, unsigned char *src, int *dst)
  {
          __m128i x = p->array[*src];
          *dst = _mm_cvtsi128_si32(x);
  }

compiles with -O2 on x86-64 to this assembler:

  0000000000000000 <func>:
     0:	0f b6 06             	movzbl (%rsi),%eax
     3:	48 83 c0 01          	add    $0x1,%rax
     7:	48 c1 e0 04          	shl    $0x4,%rax
     b:	8b 04 07             	mov    (%rdi,%rax,1),%eax
     e:	89 02                	mov    %eax,(%rdx)
    10:	c3                   	retq   

The add should be folded into the address calculation here. (The shl can't, because it's too big.) Curiously enough, if I misalign the struct element by removing c and d, and declaring the struct __attribute__((packed)), GCC will do that; the mov will then be from $8(%rdi,%rax,1),%eax and there is no redundant add.
Comment 1 Richard Biener 2012-09-17 08:49:25 UTC
As there is no loop involved it is RTL fwprop or combines job to eventually
do this.  I see

func:
.LFB517:
        .cfi_startproc
        movzbl  (%rsi), %eax
        addq    $1, %rax
        salq    $4, %rax
        movl    (%rdi,%rax), %eax
        movl    %eax, (%rdx)
        ret

for both 4.7 and 4.8.  i?86 is even worse.
Comment 2 Steinar H. Gunderson 2012-09-17 09:18:16 UTC
FWIW, in my original code, func() is a part of a loop body (it keeps reading values from src in a loop). It doesn't really change anything in the generated code, though.
Comment 3 Steinar H. Gunderson 2017-11-03 10:56:59 UTC
Still there in GCC 7.2.1 (exact same assembler output), and in 8.0 snapshot 20171017.
Comment 4 Uroš Bizjak 2018-08-22 07:33:13 UTC
clang generates for x86_64:

        movzbl  (%rsi), %eax
        shlq    $4, %rax
        movl    16(%rdi,%rax), %eax
        movl    %eax, (%rdx)
        retq

and for i?86:

        movl    8(%esp), %edx
        movl    4(%esp), %ecx
        movl    12(%esp), %eax
        movzbl  (%edx), %edx
        shll    $4, %edx
        movl    16(%ecx,%edx), %ecx
        movl    %ecx, (%eax)
        retl

for the later case (-m32), gcc generates:

        movl    8(%esp), %eax
        movzbl  (%eax), %eax
        addl    $1, %eax
        sall    $4, %eax
        addl    4(%esp), %eax
        movl    (%eax), %edx
        movl    12(%esp), %eax
        movl    %edx, (%eax)
        ret

so, two extra additions.
Comment 5 Jakub Jelinek 2018-11-29 15:50:12 UTC
Note, in other cases combine is sucessful.  E.g. consider:
struct S { int a, b, c, d; };
struct T { struct S e[16]; struct S f[1024]; } t;

int
foo (unsigned long x)
{
  return t.f[x + 5].a;
}

int
bar (struct T *x, unsigned long y)
{
  return x->f[y + 5].b;
}

On x86_64-linux with -O2, we have before combine in both cases
(idx + 21) << 4 and in foo combine says:
Trying 7, 8 -> 10:
    7: {r87:DI=r91:DI+0x15;clobber flags:CC;}
      REG_DEAD r91:DI
      REG_UNUSED flags:CC
    8: {r88:DI=r87:DI<<0x4;clobber flags:CC;}
      REG_DEAD r87:DI
      REG_UNUSED flags:CC
   10: r90:SI=[r88:DI+`t']
      REG_DEAD r88:DI
Failed to match this instruction:
(set (reg:SI 90 [ t.f[_1].a ])
    (mem:SI (plus:DI (mult:DI (reg:DI 91)
                (const_int 16 [0x10]))
            (const:DI (plus:DI (symbol_ref:DI ("t") [flags 0x2]  <var_decl 0x7f837b94cab0 t>)
                    (const_int 336 [0x150])))) [1 t.f[_1].a+0 S4 A32]))
Successfully matched this instruction:
(set (reg:DI 88)
    (ashift:DI (reg:DI 91)
        (const_int 4 [0x4])))
Successfully matched this instruction:
(set (reg:SI 90 [ t.f[_1].a ])
    (mem:SI (plus:DI (reg:DI 88)
            (const:DI (plus:DI (symbol_ref:DI ("t") [flags 0x2]  <var_decl 0x7f837b94cab0 t>)
                    (const_int 336 [0x150])))) [1 t.f[_1].a+0 S4 A32]))
and it is merged the way we want:
        salq    $4, %rdi
        movl    t+336(%rdi), %eax
while in bar:
Failed to match this instruction:
(set (reg:SI 91 [ x_4(D)->f[_1].b ])
    (mem:SI (plus:DI (plus:DI (mult:DI (reg:DI 93)
                    (const_int 16 [0x10]))
                (reg:DI 92))
            (const_int 340 [0x154])) [1 x_4(D)->f[_1].b+0 S4 A32]))
Failed to match this instruction:
(set (reg:DI 88)
    (plus:DI (ashift:DI (reg:DI 93)
            (const_int 4 [0x4]))
        (reg:DI 92)))
and it is not merged:
        addq    $21, %rsi
        salq    $4, %rsi
        movl    4(%rsi,%rdi), %eax
when it could be:
	salq	$4, %rsi
	movl	340(%rsi,%rdi), %eax

It isn't x86 specific though, e.g. on powerpc64-linux we end up with:
        addi 4,4,21
        sldi 4,4,4
        add 4,3,4
        lwa 3,4(4)
for bar, where I guess:
        sldi 4,4,4
        add 4,3,4
	lwa 3,340(4)
would work too.
Comment 6 Jakub Jelinek 2018-11-29 17:04:54 UTC
--- gcc/combine.c.jj	2018-11-21 19:57:26.229726485 +0100
+++ gcc/combine.c	2018-11-29 17:57:48.069423874 +0100
@@ -4945,7 +4945,7 @@ find_split_point (rtx *loc, rtx_insn *in
 	}
 
       /* If we have a PLUS whose second operand is a constant and the
-	 address is not valid, perhaps will can split it up using
+	 address is not valid, perhaps we can split it up using
 	 the machine-specific way to split large constants.  We use
 	 the first pseudo-reg (one of the virtual regs) as a placeholder;
 	 it will not remain in the result.  */
@@ -4960,7 +4960,7 @@ find_split_point (rtx *loc, rtx_insn *in
 
 	  /* This should have produced two insns, each of which sets our
 	     placeholder.  If the source of the second is a valid address,
-	     we can make put both sources together and make a split point
+	     we can put both sources together and make a split point
 	     in the middle.  */
 
 	  if (seq
@@ -5001,14 +5001,45 @@ find_split_point (rtx *loc, rtx_insn *in
 		}
 	    }
 
+	  /* If that didn't work and we have a nested plus, like:
+	     ((REG1 * CONST1) + REG2) + CONST2 and (REG1 + REG2) + CONST2
+	     is valid address, try to split (REG1 * CONST1).  */
+	  if (GET_CODE (XEXP (XEXP (x, 0), 0)) == PLUS
+	      && !OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 0))
+	      && OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 1)))
+	    {
+	      rtx tem = XEXP (XEXP (XEXP (x, 0), 0), 0);
+	      XEXP (XEXP (XEXP (x, 0), 0), 0) = reg;
+	      if (memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0),
+					       MEM_ADDR_SPACE (x)))
+		{
+		  XEXP (XEXP (XEXP (x, 0), 0), 0) = tem;
+		  return &XEXP (XEXP (XEXP (x, 0), 0), 0);
+		}
+	      XEXP (XEXP (XEXP (x, 0), 0), 0) = tem;
+	    }
+	  else if (GET_CODE (XEXP (XEXP (x, 0), 0)) == PLUS
+		   && OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 0))
+		   && !OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 1)))
+	    {
+	      rtx tem = XEXP (XEXP (XEXP (x, 0), 0), 1);
+	      XEXP (XEXP (XEXP (x, 0), 0), 1) = reg;
+	      if (memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0),
+					       MEM_ADDR_SPACE (x)))
+		{
+		  XEXP (XEXP (XEXP (x, 0), 0), 1) = tem;
+		  return &XEXP (XEXP (XEXP (x, 0), 0), 1);
+		}
+	      XEXP (XEXP (XEXP (x, 0), 0), 1) = tem;
+	    }
+
 	  /* If that didn't work, perhaps the first operand is complex and
 	     needs to be computed separately, so make a split point there.
 	     This will occur on machines that just support REG + CONST
 	     and have a constant moved through some previous computation.  */
-
-	  else if (!OBJECT_P (XEXP (XEXP (x, 0), 0))
-		   && ! (GET_CODE (XEXP (XEXP (x, 0), 0)) == SUBREG
-			 && OBJECT_P (SUBREG_REG (XEXP (XEXP (x, 0), 0)))))
+	  if (!OBJECT_P (XEXP (XEXP (x, 0), 0))
+	      && ! (GET_CODE (XEXP (XEXP (x, 0), 0)) == SUBREG
+		    && OBJECT_P (SUBREG_REG (XEXP (XEXP (x, 0), 0)))))
 	    return &XEXP (XEXP (x, 0), 0);
 	}
 

fixes this for me.
Comment 7 Jakub Jelinek 2018-11-29 17:31:21 UTC
Created attachment 45125 [details]
gcc9-pr54589.patch

Untested fix.
Comment 8 Jakub Jelinek 2018-12-01 07:28:29 UTC
Author: jakub
Date: Sat Dec  1 07:27:58 2018
New Revision: 266707

URL: https://gcc.gnu.org/viewcvs?rev=266707&root=gcc&view=rev
Log:
	PR target/54589
	* combine.c (find_split_point): For invalid memory address
	nonobj + obj + const, if reg + obj + const is valid addressing
	mode, split at nonobj.  Use if rather than else if for the
	fallback.  Comment fixes.

	* gcc.target/i386/pr54589.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr54589.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/combine.c
    trunk/gcc/testsuite/ChangeLog
Comment 9 Jakub Jelinek 2018-12-01 07:37:31 UTC
Fixed for GCC9 and later.
Comment 10 Jaydeep Chauhan 2018-12-07 11:12:34 UTC
 Hi,
 
 For the below case fix is not working.
 
 #include <emmintrin.h>

  struct param {
          int a, b, c, d;
          __m128i array[256];
  };

  struct param *p1;
  void func(struct param *p, unsigned char *src, int *dst)
  {
          __m128i x = p1->array[*src];
          *dst = _mm_cvtsi128_si32(x);
  }

  gcc generates for x86-64 with -O2:
  
  func:
        movzbl  (%rsi), %eax
        addq    $1, %rax
        salq    $4, %rax
        addq    p1(%rip), %rax
        movl    (%rax), %eax
        movl    %eax, (%rdx)
        ret
		
  clang generates for x86_64 with -O2:	
		
  func:                                  
        movq    p1(%rip), %rax
        movzbl  (%rsi), %ecx
        shlq    $4, %rcx
        movl    16(%rax,%rcx), %eax
        movl    %eax, (%rdx)
        retq
Comment 11 Jakub Jelinek 2018-12-07 12:29:51 UTC
That is because e.g. on:
struct S { int a, b, c, d; };
struct T { struct S a; struct S b[256]; } *v;

int
foo (unsigned char *x)
{
  return v->b[*x].a;
}
we have:
(insn 6 3 7 2 (set (reg/f:DI 88 [ v ])
        (mem/f/c:DI (symbol_ref:DI ("v") [flags 0x2]  <var_decl 0x7f5f67c7aab0 v>) [1 v+0 S8 A64])) "pr54589-3.c":7:18 66 {*movdi_internal}
     (nil))
(insn 7 6 8 2 (set (reg:SI 89 [ *x_5(D) ])
        (zero_extend:SI (mem:QI (reg/v/f:DI 86 [ x ]) [0 *x_5(D)+0 S1 A8]))) "pr54589-3.c":7:15 119 {*zero_extendqisi2}
     (expr_list:REG_DEAD (reg/v/f:DI 86 [ x ])
        (nil)))
(insn 8 7 9 2 (set (reg:DI 90 [ *x_5(D) ])
        (sign_extend:DI (reg:SI 89 [ *x_5(D) ]))) "pr54589-3.c":7:18 128 {*extendsidi2_rex64}
     (expr_list:REG_DEAD (reg:SI 89 [ *x_5(D) ])
        (nil)))
(insn 9 8 10 2 (parallel [
            (set (reg:DI 91)
                (plus:DI (reg:DI 90 [ *x_5(D) ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) "pr54589-3.c":7:18 191 {*adddi_1}
     (expr_list:REG_DEAD (reg:DI 90 [ *x_5(D) ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 10 9 11 2 (parallel [
            (set (reg:DI 92)
                (ashift:DI (reg:DI 91)
                    (const_int 4 [0x4])))
            (clobber (reg:CC 17 flags))
        ]) "pr54589-3.c":7:18 518 {*ashldi3_1}
     (expr_list:REG_DEAD (reg:DI 91)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 11 10 12 2 (parallel [
            (set (reg/f:DI 93)
                (plus:DI (reg/f:DI 88 [ v ])
                    (reg:DI 92)))
            (clobber (reg:CC 17 flags))
        ]) "pr54589-3.c":7:18 191 {*adddi_1}
     (expr_list:REG_DEAD (reg:DI 92)
        (expr_list:REG_DEAD (reg/f:DI 88 [ v ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))
(insn 12 11 17 2 (set (reg:SI 94 [ v.0_1->b[_3].a ])
        (mem:SI (reg/f:DI 93) [2 v.0_1->b[_3].a+0 S4 A32])) "pr54589-3.c":7:18 67 {*movsi_internal}
     (expr_list:REG_DEAD (reg/f:DI 93)
        (nil)))
and combine would handle that fine if it tried to combine 9, 10, 11 -> 12
first, but unfortunately for this case it tries 6 -> 11 first and, if the combination of everything doesn't simplify, we try to split it only into two parts, which in this case isn't enough, we'd need to split those 4 instructions into 3 (one to do the shift, one to do the v load and last to do the mem read with the 16(reg1, reg2) addressing.

Unless the combiner grows the possibility to split into 3 functions, I'm afraid this would need to be solved in machine reorg or something similar.
Comment 12 Jakub Jelinek 2018-12-07 12:30:37 UTC
(In reply to Jakub Jelinek from comment #11)
> Unless the combiner grows the possibility to split into 3 functions, I'm

I mean 3 instructions when trying to combine 4.

> afraid this would need to be solved in machine reorg or something similar.
Comment 13 Segher Boessenkool 2018-12-07 17:59:19 UTC
Yeah, that's not going to happen.

Will it help to do some define_split or define_insn_and_split for this?
Comment 14 Jaydeep Chauhan 2018-12-17 15:08:40 UTC
(In reply to Jaydeep Chauhan from comment #10)


I have tried solve this case using define_split or define_insn_and_split 
  but i am facing is some issue as per below:

1.It is not possible to combine below four instruction 

        addq    $1, %rax
        salq    $4, %rax
        addq    p1(%rip), %rax
        movl    (%rax), %eax
		
into the this
        shlq    $4, %rcx
        movl    16(%rax,%rcx), %eax 
		
using define_split or define_insn_and_split
	
To add offset "16" and "p1"  it is creating problem because "p1" is symbol ref.

2.Also to optimize 
        addq    p1(%rip), %rax
        movl    (%rax), %eax
into a single instruction it should need to be seperate define_split or define_insn_and_split.
So it should need to be seperate 

3.I have also tried this case with peephole but i am facing same problem with this also


(define_peephole2
  [(parallel [(set (match_operand:DI 0 "register_operand")
                 (plus:DI (match_dup 0)
                     (match_operand:DI 1 "const_int_operand")))
                 (clobber (reg:CC FLAGS_REG))])
   (parallel [(set (match_dup 0)
                 (ashift:DI (match_dup 0)
                         (match_operand 2 "const_int_operand")))
                 (clobber (reg:CC FLAGS_REG))])
   (parallel [(set (match_dup 0)
        (plus:DI (match_operand:DI 4 "nonimmediate_operand")
                 (mem:DI (match_operand:SDWIM 5 "<general_hilo_operand>"))))
                 (clobber (reg:CC FLAGS_REG))])]
  ""
  [(parallel [(set (match_dup 0)
                   (ashift:DI (match_dup 0)(match_dup 2)))
                   (clobber (reg:CC FLAGS_REG))])
   (parallel [(set (match_dup 0)
               (plus:DI (match_dup 4)
                    (mem:DI (match_dup 5))))
                    (clobber (reg:CC FLAGS_REG))])]
{
   int scale = 8 << INTVAL (operands[1]);
   operands[4] = gen_rtx_PLUS (DImode,operands[4], GEN_INT (scale));
})


Please share your comment/suggestion on this.