Hi, I found this in 4.4 (Ubuntu 10.04), and have confirmed it's still there in gcc (Debian 20120820-1) 4.8.0 20120820 (experimental) [trunk revision 190537] This code: #include <emmintrin.h> struct param { int a, b, c, d; __m128i array[256]; }; void func(struct param *p, unsigned char *src, int *dst) { __m128i x = p->array[*src]; *dst = _mm_cvtsi128_si32(x); } compiles with -O2 on x86-64 to this assembler: 0000000000000000 <func>: 0: 0f b6 06 movzbl (%rsi),%eax 3: 48 83 c0 01 add $0x1,%rax 7: 48 c1 e0 04 shl $0x4,%rax b: 8b 04 07 mov (%rdi,%rax,1),%eax e: 89 02 mov %eax,(%rdx) 10: c3 retq The add should be folded into the address calculation here. (The shl can't, because it's too big.) Curiously enough, if I misalign the struct element by removing c and d, and declaring the struct __attribute__((packed)), GCC will do that; the mov will then be from $8(%rdi,%rax,1),%eax and there is no redundant add.
As there is no loop involved it is RTL fwprop or combines job to eventually do this. I see func: .LFB517: .cfi_startproc movzbl (%rsi), %eax addq $1, %rax salq $4, %rax movl (%rdi,%rax), %eax movl %eax, (%rdx) ret for both 4.7 and 4.8. i?86 is even worse.
FWIW, in my original code, func() is a part of a loop body (it keeps reading values from src in a loop). It doesn't really change anything in the generated code, though.
Still there in GCC 7.2.1 (exact same assembler output), and in 8.0 snapshot 20171017.
clang generates for x86_64: movzbl (%rsi), %eax shlq $4, %rax movl 16(%rdi,%rax), %eax movl %eax, (%rdx) retq and for i?86: movl 8(%esp), %edx movl 4(%esp), %ecx movl 12(%esp), %eax movzbl (%edx), %edx shll $4, %edx movl 16(%ecx,%edx), %ecx movl %ecx, (%eax) retl for the later case (-m32), gcc generates: movl 8(%esp), %eax movzbl (%eax), %eax addl $1, %eax sall $4, %eax addl 4(%esp), %eax movl (%eax), %edx movl 12(%esp), %eax movl %edx, (%eax) ret so, two extra additions.
Note, in other cases combine is sucessful. E.g. consider: struct S { int a, b, c, d; }; struct T { struct S e[16]; struct S f[1024]; } t; int foo (unsigned long x) { return t.f[x + 5].a; } int bar (struct T *x, unsigned long y) { return x->f[y + 5].b; } On x86_64-linux with -O2, we have before combine in both cases (idx + 21) << 4 and in foo combine says: Trying 7, 8 -> 10: 7: {r87:DI=r91:DI+0x15;clobber flags:CC;} REG_DEAD r91:DI REG_UNUSED flags:CC 8: {r88:DI=r87:DI<<0x4;clobber flags:CC;} REG_DEAD r87:DI REG_UNUSED flags:CC 10: r90:SI=[r88:DI+`t'] REG_DEAD r88:DI Failed to match this instruction: (set (reg:SI 90 [ t.f[_1].a ]) (mem:SI (plus:DI (mult:DI (reg:DI 91) (const_int 16 [0x10])) (const:DI (plus:DI (symbol_ref:DI ("t") [flags 0x2] <var_decl 0x7f837b94cab0 t>) (const_int 336 [0x150])))) [1 t.f[_1].a+0 S4 A32])) Successfully matched this instruction: (set (reg:DI 88) (ashift:DI (reg:DI 91) (const_int 4 [0x4]))) Successfully matched this instruction: (set (reg:SI 90 [ t.f[_1].a ]) (mem:SI (plus:DI (reg:DI 88) (const:DI (plus:DI (symbol_ref:DI ("t") [flags 0x2] <var_decl 0x7f837b94cab0 t>) (const_int 336 [0x150])))) [1 t.f[_1].a+0 S4 A32])) and it is merged the way we want: salq $4, %rdi movl t+336(%rdi), %eax while in bar: Failed to match this instruction: (set (reg:SI 91 [ x_4(D)->f[_1].b ]) (mem:SI (plus:DI (plus:DI (mult:DI (reg:DI 93) (const_int 16 [0x10])) (reg:DI 92)) (const_int 340 [0x154])) [1 x_4(D)->f[_1].b+0 S4 A32])) Failed to match this instruction: (set (reg:DI 88) (plus:DI (ashift:DI (reg:DI 93) (const_int 4 [0x4])) (reg:DI 92))) and it is not merged: addq $21, %rsi salq $4, %rsi movl 4(%rsi,%rdi), %eax when it could be: salq $4, %rsi movl 340(%rsi,%rdi), %eax It isn't x86 specific though, e.g. on powerpc64-linux we end up with: addi 4,4,21 sldi 4,4,4 add 4,3,4 lwa 3,4(4) for bar, where I guess: sldi 4,4,4 add 4,3,4 lwa 3,340(4) would work too.
--- gcc/combine.c.jj 2018-11-21 19:57:26.229726485 +0100 +++ gcc/combine.c 2018-11-29 17:57:48.069423874 +0100 @@ -4945,7 +4945,7 @@ find_split_point (rtx *loc, rtx_insn *in } /* If we have a PLUS whose second operand is a constant and the - address is not valid, perhaps will can split it up using + address is not valid, perhaps we can split it up using the machine-specific way to split large constants. We use the first pseudo-reg (one of the virtual regs) as a placeholder; it will not remain in the result. */ @@ -4960,7 +4960,7 @@ find_split_point (rtx *loc, rtx_insn *in /* This should have produced two insns, each of which sets our placeholder. If the source of the second is a valid address, - we can make put both sources together and make a split point + we can put both sources together and make a split point in the middle. */ if (seq @@ -5001,14 +5001,45 @@ find_split_point (rtx *loc, rtx_insn *in } } + /* If that didn't work and we have a nested plus, like: + ((REG1 * CONST1) + REG2) + CONST2 and (REG1 + REG2) + CONST2 + is valid address, try to split (REG1 * CONST1). */ + if (GET_CODE (XEXP (XEXP (x, 0), 0)) == PLUS + && !OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 0)) + && OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 1))) + { + rtx tem = XEXP (XEXP (XEXP (x, 0), 0), 0); + XEXP (XEXP (XEXP (x, 0), 0), 0) = reg; + if (memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0), + MEM_ADDR_SPACE (x))) + { + XEXP (XEXP (XEXP (x, 0), 0), 0) = tem; + return &XEXP (XEXP (XEXP (x, 0), 0), 0); + } + XEXP (XEXP (XEXP (x, 0), 0), 0) = tem; + } + else if (GET_CODE (XEXP (XEXP (x, 0), 0)) == PLUS + && OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 0)) + && !OBJECT_P (XEXP (XEXP (XEXP (x, 0), 0), 1))) + { + rtx tem = XEXP (XEXP (XEXP (x, 0), 0), 1); + XEXP (XEXP (XEXP (x, 0), 0), 1) = reg; + if (memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0), + MEM_ADDR_SPACE (x))) + { + XEXP (XEXP (XEXP (x, 0), 0), 1) = tem; + return &XEXP (XEXP (XEXP (x, 0), 0), 1); + } + XEXP (XEXP (XEXP (x, 0), 0), 1) = tem; + } + /* If that didn't work, perhaps the first operand is complex and needs to be computed separately, so make a split point there. This will occur on machines that just support REG + CONST and have a constant moved through some previous computation. */ - - else if (!OBJECT_P (XEXP (XEXP (x, 0), 0)) - && ! (GET_CODE (XEXP (XEXP (x, 0), 0)) == SUBREG - && OBJECT_P (SUBREG_REG (XEXP (XEXP (x, 0), 0))))) + if (!OBJECT_P (XEXP (XEXP (x, 0), 0)) + && ! (GET_CODE (XEXP (XEXP (x, 0), 0)) == SUBREG + && OBJECT_P (SUBREG_REG (XEXP (XEXP (x, 0), 0))))) return &XEXP (XEXP (x, 0), 0); } fixes this for me.
Created attachment 45125 [details] gcc9-pr54589.patch Untested fix.
Author: jakub Date: Sat Dec 1 07:27:58 2018 New Revision: 266707 URL: https://gcc.gnu.org/viewcvs?rev=266707&root=gcc&view=rev Log: PR target/54589 * combine.c (find_split_point): For invalid memory address nonobj + obj + const, if reg + obj + const is valid addressing mode, split at nonobj. Use if rather than else if for the fallback. Comment fixes. * gcc.target/i386/pr54589.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr54589.c Modified: trunk/gcc/ChangeLog trunk/gcc/combine.c trunk/gcc/testsuite/ChangeLog
Fixed for GCC9 and later.
Hi, For the below case fix is not working. #include <emmintrin.h> struct param { int a, b, c, d; __m128i array[256]; }; struct param *p1; void func(struct param *p, unsigned char *src, int *dst) { __m128i x = p1->array[*src]; *dst = _mm_cvtsi128_si32(x); } gcc generates for x86-64 with -O2: func: movzbl (%rsi), %eax addq $1, %rax salq $4, %rax addq p1(%rip), %rax movl (%rax), %eax movl %eax, (%rdx) ret clang generates for x86_64 with -O2: func: movq p1(%rip), %rax movzbl (%rsi), %ecx shlq $4, %rcx movl 16(%rax,%rcx), %eax movl %eax, (%rdx) retq
That is because e.g. on: struct S { int a, b, c, d; }; struct T { struct S a; struct S b[256]; } *v; int foo (unsigned char *x) { return v->b[*x].a; } we have: (insn 6 3 7 2 (set (reg/f:DI 88 [ v ]) (mem/f/c:DI (symbol_ref:DI ("v") [flags 0x2] <var_decl 0x7f5f67c7aab0 v>) [1 v+0 S8 A64])) "pr54589-3.c":7:18 66 {*movdi_internal} (nil)) (insn 7 6 8 2 (set (reg:SI 89 [ *x_5(D) ]) (zero_extend:SI (mem:QI (reg/v/f:DI 86 [ x ]) [0 *x_5(D)+0 S1 A8]))) "pr54589-3.c":7:15 119 {*zero_extendqisi2} (expr_list:REG_DEAD (reg/v/f:DI 86 [ x ]) (nil))) (insn 8 7 9 2 (set (reg:DI 90 [ *x_5(D) ]) (sign_extend:DI (reg:SI 89 [ *x_5(D) ]))) "pr54589-3.c":7:18 128 {*extendsidi2_rex64} (expr_list:REG_DEAD (reg:SI 89 [ *x_5(D) ]) (nil))) (insn 9 8 10 2 (parallel [ (set (reg:DI 91) (plus:DI (reg:DI 90 [ *x_5(D) ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) "pr54589-3.c":7:18 191 {*adddi_1} (expr_list:REG_DEAD (reg:DI 90 [ *x_5(D) ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 10 9 11 2 (parallel [ (set (reg:DI 92) (ashift:DI (reg:DI 91) (const_int 4 [0x4]))) (clobber (reg:CC 17 flags)) ]) "pr54589-3.c":7:18 518 {*ashldi3_1} (expr_list:REG_DEAD (reg:DI 91) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 11 10 12 2 (parallel [ (set (reg/f:DI 93) (plus:DI (reg/f:DI 88 [ v ]) (reg:DI 92))) (clobber (reg:CC 17 flags)) ]) "pr54589-3.c":7:18 191 {*adddi_1} (expr_list:REG_DEAD (reg:DI 92) (expr_list:REG_DEAD (reg/f:DI 88 [ v ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 12 11 17 2 (set (reg:SI 94 [ v.0_1->b[_3].a ]) (mem:SI (reg/f:DI 93) [2 v.0_1->b[_3].a+0 S4 A32])) "pr54589-3.c":7:18 67 {*movsi_internal} (expr_list:REG_DEAD (reg/f:DI 93) (nil))) and combine would handle that fine if it tried to combine 9, 10, 11 -> 12 first, but unfortunately for this case it tries 6 -> 11 first and, if the combination of everything doesn't simplify, we try to split it only into two parts, which in this case isn't enough, we'd need to split those 4 instructions into 3 (one to do the shift, one to do the v load and last to do the mem read with the 16(reg1, reg2) addressing. Unless the combiner grows the possibility to split into 3 functions, I'm afraid this would need to be solved in machine reorg or something similar.
(In reply to Jakub Jelinek from comment #11) > Unless the combiner grows the possibility to split into 3 functions, I'm I mean 3 instructions when trying to combine 4. > afraid this would need to be solved in machine reorg or something similar.
Yeah, that's not going to happen. Will it help to do some define_split or define_insn_and_split for this?
(In reply to Jaydeep Chauhan from comment #10) I have tried solve this case using define_split or define_insn_and_split but i am facing is some issue as per below: 1.It is not possible to combine below four instruction addq $1, %rax salq $4, %rax addq p1(%rip), %rax movl (%rax), %eax into the this shlq $4, %rcx movl 16(%rax,%rcx), %eax using define_split or define_insn_and_split To add offset "16" and "p1" it is creating problem because "p1" is symbol ref. 2.Also to optimize addq p1(%rip), %rax movl (%rax), %eax into a single instruction it should need to be seperate define_split or define_insn_and_split. So it should need to be seperate 3.I have also tried this case with peephole but i am facing same problem with this also (define_peephole2 [(parallel [(set (match_operand:DI 0 "register_operand") (plus:DI (match_dup 0) (match_operand:DI 1 "const_int_operand"))) (clobber (reg:CC FLAGS_REG))]) (parallel [(set (match_dup 0) (ashift:DI (match_dup 0) (match_operand 2 "const_int_operand"))) (clobber (reg:CC FLAGS_REG))]) (parallel [(set (match_dup 0) (plus:DI (match_operand:DI 4 "nonimmediate_operand") (mem:DI (match_operand:SDWIM 5 "<general_hilo_operand>")))) (clobber (reg:CC FLAGS_REG))])] "" [(parallel [(set (match_dup 0) (ashift:DI (match_dup 0)(match_dup 2))) (clobber (reg:CC FLAGS_REG))]) (parallel [(set (match_dup 0) (plus:DI (match_dup 4) (mem:DI (match_dup 5)))) (clobber (reg:CC FLAGS_REG))])] { int scale = 8 << INTVAL (operands[1]); operands[4] = gen_rtx_PLUS (DImode,operands[4], GEN_INT (scale)); }) Please share your comment/suggestion on this.