Created attachment 40809 [details] C source from rust-lang/rust#39446 The linked Rust issue is initially comparing rustc and clang, but I found that GCC is also generating worse code than clang. Profiling pointed me to these: x_x = (x * x) / 200; y_y = (y * y) / 200; Clang optimizes these to just 4 instructions each, like: movl %edx, %ebp imull %ebp, %ebp imulq $1374389535, %rbp, %rbp # imm = 0x51EB851F shrq $38, %rbp But clang -fwrapv generates 8 instructions each, same as rustc, and it turns out GCC uses a similar longer sequence too regardless of -fwrapv. GCC 6.3 gives: movl %edi, %r13d imull %edi, %r13d movl %r13d, %eax sarl $31, %r13d imull %ebx sarl $6, %edx movl %edx, %ecx subl %r13d, %ecx And GCC 7 is the same, although it interleaves the two x and y lines. Here they are on Compiler Explorer: https://godbolt.org/g/Fy8Fiy I believe what's happening is that clang identifies that signed (x*x) is always positive, thanks to overflow UB, and then (x*x)/200 is essentially unsigned division, which requires less fixup than signed constant division.
What options are you using? Did you try -march=native ?
What we can do IMHO is in expand_divmod, if we have division or modulo by constant and the first operand is known to have 0 MSB (from get_range_info or get_nonzero_bits), then we can expand it as both signed or unsigned division/modulo, regardless what unsignedp is. Then we could compare costs of both expansions and decide what is faster.
I'm using just -O3, and then I compared effects with and without -fwrapv to figure out what's going on. Clang is only faster without -fwrapv. With -march=native on my Sandy Bridge, a few instructions are placed in a different order, but it's otherwise identical.
Created attachment 40811 [details] gcc7-pr79665.patch Untested fix.
Psst! GCC 7 was already 1.75x faster than Clang 3.8 on my aarch64 machine when I benchmarked this code 3 weeks ago, but with this patch, it seems to take a 5% hit.
But that's related to -mcpu=cortex-a53 again, so never mind I guess.
Author: jakub Date: Thu Feb 23 07:49:06 2017 New Revision: 245676 URL: https://gcc.gnu.org/viewcvs?rev=245676&root=gcc&view=rev Log: PR middle-end/79665 * internal-fn.c (get_range_pos_neg): Moved to ... * tree.c (get_range_pos_neg): ... here. No longer static. * tree.h (get_range_pos_neg): New prototype. * expr.c (expand_expr_real_2) <case TRUNC_DIV_EXPR>: If both arguments are known to be in between 0 and signed maximum inclusive, try to expand both unsigned and signed divmod and use the cheaper one from those. Modified: trunk/gcc/ChangeLog trunk/gcc/expr.c trunk/gcc/internal-fn.c trunk/gcc/tree.c trunk/gcc/tree.h
Fixed. If this makes -mcpu=cortex-a53 slower, then it doesn't have right costs function.
Hmm, for -mcpu=cortex-a53 one of the divisions ends up being unexpanded and generates an sdiv with this patch, whereas before it used to expand
You can put a breakpoint at the new expr.c code and see what both the unsigned and signed sequences are and see what seq_cost gcc computed. If the costs don't match the hw, then it can of course choose a worse sequence.
Looks like the sdiv comes from the % 300 expansion (an sdiv followed by a multiply-subtract). Need to figure why MOD expressions are affected by this change
Huh, never mind. That sdiv was there even before this changes, it is unrelated to this. I don't have see how there could be a slowdown from this change
Still, the 5% regression must have happened very recently. The fast gcc was built on 20170220 and the slow one yesterday, using the original patch. Once again, switching away from Cortex-A53 codegen restores the expected performance.
(In reply to PeteVine from comment #13) > Still, the 5% regression must have happened very recently. The fast gcc was > built on 20170220 and the slow one yesterday, using the original patch. Once > again, switching away from Cortex-A53 codegen restores the expected > performance. The issue is due to inefficient code generated for unsigned modulo: umull x0, w0, w4 umull x1, w1, w4 lsr x0, x0, 32 lsr x1, x1, 32 lsr w0, w0, 6 lsr w1, w1, 6 It seems the Cortex-A53 scheduler isn't modelling this correctly. When I manually remove the redundant shifts I get a 15% speedup. I'll have a look.
Author: tnfchris Date: Thu Apr 27 09:58:27 2017 New Revision: 247307 URL: https://gcc.gnu.org/viewcvs?rev=247307&root=gcc&view=rev Log: 2017-04-26 Tamar Christina <tamar.christina@arm.com> PR middle-end/79665 * expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR, CEIL_MOD_EXPR, ROUND_MOD_EXPR cases. Modified: trunk/gcc/ChangeLog trunk/gcc/expr.c
(In reply to wilco from comment #14) > (In reply to PeteVine from comment #13) > > Still, the 5% regression must have happened very recently. The fast gcc was > > built on 20170220 and the slow one yesterday, using the original patch. Once > > again, switching away from Cortex-A53 codegen restores the expected > > performance. > > The issue is due to inefficient code generated for unsigned modulo: > > umull x0, w0, w4 > umull x1, w1, w4 > lsr x0, x0, 32 > lsr x1, x1, 32 > lsr w0, w0, 6 > lsr w1, w1, 6 > > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I > manually remove the redundant shifts I get a 15% speedup. I'll have a look. See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html
(In reply to wilco from comment #16) > (In reply to wilco from comment #14) > > (In reply to PeteVine from comment #13) > > > Still, the 5% regression must have happened very recently. The fast gcc was > > > built on 20170220 and the slow one yesterday, using the original patch. Once > > > again, switching away from Cortex-A53 codegen restores the expected > > > performance. > > > > The issue is due to inefficient code generated for unsigned modulo: > > > > umull x0, w0, w4 > > umull x1, w1, w4 > > lsr x0, x0, 32 > > lsr x1, x1, 32 > > lsr w0, w0, 6 > > lsr w1, w1, 6 > > > > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I > > manually remove the redundant shifts I get a 15% speedup. I'll have a look. > > See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html The redundant LSRs and SDIV are removed on latest trunk. Although my patch above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53 and 8% with -mcpu=cortex-a72.
Author: tnfchris Date: Mon May 8 09:45:46 2017 New Revision: 247734 URL: https://gcc.gnu.org/viewcvs?rev=247734&root=gcc&view=rev Log: 2017-05-08 Tamar Christina <tamar.christina@arm.com> PR middle-end/79665 * expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR, CEIL_MOD_EXPR, ROUND_MOD_EXPR cases. Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/expr.c
Author: wilco Date: Wed Jun 28 14:13:02 2017 New Revision: 249740 URL: https://gcc.gnu.org/viewcvs?rev=249740&root=gcc&view=rev Log: Improve Cortex-A53 shift bypass The aarch_forward_to_shift_is_not_shifted_reg bypass always returns true on AArch64 shifted instructions. This causes the bypass to activate in too many cases, resulting in slower execution on Cortex-A53 like reported in PR79665. This patch uses the arm_no_early_alu_shift_dep condition instead which improves the example in PR79665 by ~7%. Given it is no longer used, remove aarch_forward_to_shift_is_not_shifted_reg. Also remove an unnecessary REG_P check. gcc/ PR target/79665 * config/arm/aarch-common.c (arm_no_early_alu_shift_dep): Remove redundant if. (aarch_forward_to_shift_is_not_shifted_reg): Remove. * config/arm/aarch-common-protos.h (aarch_forward_to_shift_is_not_shifted_re): Remove. * config/arm/cortex-a53.md: Use arm_no_early_alu_shift_dep in bypass. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/aarch-common-protos.h trunk/gcc/config/arm/aarch-common.c trunk/gcc/config/arm/cortex-a53.md