Bug 79665 - gcc's signed (x*x)/200 is slower than clang's
Summary: gcc's signed (x*x)/200 is slower than clang's
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 7.0
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL: https://github.com/rust-lang/rust/iss...
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-02-21 21:50 UTC by Josh Stone
Modified: 2017-06-28 14:13 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-02-22 00:00:00


Attachments
C source from rust-lang/rust#39446 (498 bytes, text/x-csrc)
2017-02-21 21:50 UTC, Josh Stone
Details
gcc7-pr79665.patch (2.20 KB, patch)
2017-02-22 09:32 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Josh Stone 2017-02-21 21:50:12 UTC
Created attachment 40809 [details]
C source from rust-lang/rust#39446

The linked Rust issue is initially comparing rustc and clang, but I found that GCC is also generating worse code than clang.  Profiling pointed me to these:

          x_x = (x * x) / 200;
          y_y = (y * y) / 200;

Clang optimizes these to just 4 instructions each, like:

        movl    %edx, %ebp
        imull   %ebp, %ebp
        imulq   $1374389535, %rbp, %rbp # imm = 0x51EB851F
        shrq    $38, %rbp

But clang -fwrapv generates 8 instructions each, same as rustc, and it turns out GCC uses a similar longer sequence too regardless of -fwrapv.  GCC 6.3 gives:

        movl    %edi, %r13d
        imull   %edi, %r13d
        movl    %r13d, %eax
        sarl    $31, %r13d
        imull   %ebx
        sarl    $6, %edx
        movl    %edx, %ecx
        subl    %r13d, %ecx

And GCC 7 is the same, although it interleaves the two x and y lines.

Here they are on Compiler Explorer: https://godbolt.org/g/Fy8Fiy

I believe what's happening is that clang identifies that signed (x*x) is always positive, thanks to overflow UB, and then (x*x)/200 is essentially unsigned division, which requires less fixup than signed constant division.
Comment 1 Andrew Pinski 2017-02-21 21:53:14 UTC
What options are you using?

Did you try -march=native ?
Comment 2 Jakub Jelinek 2017-02-21 22:06:13 UTC
What we can do IMHO is in expand_divmod, if we have division or modulo by constant and the first operand is known to have 0 MSB (from get_range_info or get_nonzero_bits), then we can expand it as both signed or unsigned division/modulo, regardless what unsignedp is.  Then we could compare costs of both expansions and decide what is faster.
Comment 3 Josh Stone 2017-02-21 22:12:26 UTC
I'm using just -O3, and then I compared effects with and without -fwrapv to figure out what's going on.  Clang is only faster without -fwrapv.

With -march=native on my Sandy Bridge, a few instructions are placed in a different order, but it's otherwise identical.
Comment 4 Jakub Jelinek 2017-02-22 09:32:14 UTC
Created attachment 40811 [details]
gcc7-pr79665.patch

Untested fix.
Comment 5 PeteVine 2017-02-22 21:53:35 UTC
Psst! GCC 7 was already 1.75x faster than Clang 3.8 on my aarch64 machine when I benchmarked this code 3 weeks ago, but with this patch, it seems to take a 5% hit.
Comment 6 PeteVine 2017-02-22 22:04:51 UTC
But that's related to -mcpu=cortex-a53 again, so never mind I guess.
Comment 7 Jakub Jelinek 2017-02-23 07:49:38 UTC
Author: jakub
Date: Thu Feb 23 07:49:06 2017
New Revision: 245676

URL: https://gcc.gnu.org/viewcvs?rev=245676&root=gcc&view=rev
Log:
	PR middle-end/79665
	* internal-fn.c (get_range_pos_neg): Moved to ...
	* tree.c (get_range_pos_neg): ... here.  No longer static.
	* tree.h (get_range_pos_neg): New prototype.
	* expr.c (expand_expr_real_2) <case TRUNC_DIV_EXPR>: If both arguments
	are known to be in between 0 and signed maximum inclusive, try to
	expand both unsigned and signed divmod and use the cheaper one from
	those.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expr.c
    trunk/gcc/internal-fn.c
    trunk/gcc/tree.c
    trunk/gcc/tree.h
Comment 8 Jakub Jelinek 2017-02-23 07:53:54 UTC
Fixed.  If this makes -mcpu=cortex-a53 slower, then it doesn't have right costs function.
Comment 9 ktkachov 2017-02-23 09:23:27 UTC
Hmm, for -mcpu=cortex-a53 one of the divisions ends up being unexpanded and generates an sdiv with this patch, whereas before it used to expand
Comment 10 Jakub Jelinek 2017-02-23 09:41:24 UTC
You can put a breakpoint at the new expr.c code and see what both the unsigned and signed sequences are and see what seq_cost gcc computed.  If the costs don't match the hw, then it can of course choose a worse sequence.
Comment 11 ktkachov 2017-02-23 09:43:59 UTC
Looks like the sdiv comes from the % 300 expansion (an sdiv followed by a multiply-subtract). Need to figure why MOD expressions are affected by this change
Comment 12 ktkachov 2017-02-23 10:03:16 UTC
Huh, never mind. That sdiv was there even before this changes, it is unrelated to this. I don't have see how there could be a slowdown from this change
Comment 13 PeteVine 2017-02-23 15:28:48 UTC
Still, the 5% regression must have happened very recently. The fast gcc was built on 20170220 and the slow one yesterday, using the original patch. Once again, switching away from Cortex-A53 codegen restores the expected performance.
Comment 14 Wilco 2017-04-18 18:59:41 UTC
(In reply to PeteVine from comment #13)
> Still, the 5% regression must have happened very recently. The fast gcc was
> built on 20170220 and the slow one yesterday, using the original patch. Once
> again, switching away from Cortex-A53 codegen restores the expected
> performance.

The issue is due to inefficient code generated for unsigned modulo:

        umull   x0, w0, w4
        umull   x1, w1, w4
        lsr     x0, x0, 32
        lsr     x1, x1, 32
        lsr     w0, w0, 6
        lsr     w1, w1, 6

It seems the Cortex-A53 scheduler isn't modelling this correctly. When I manually remove the redundant shifts I get a 15% speedup. I'll have a look.
Comment 15 Tamar Christina 2017-04-27 09:58:59 UTC
Author: tnfchris
Date: Thu Apr 27 09:58:27 2017
New Revision: 247307

URL: https://gcc.gnu.org/viewcvs?rev=247307&root=gcc&view=rev
Log:
2017-04-26  Tamar Christina  <tamar.christina@arm.com>

	PR middle-end/79665
	* expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR,
	CEIL_MOD_EXPR, ROUND_MOD_EXPR cases.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expr.c
Comment 16 Wilco 2017-04-27 17:40:58 UTC
(In reply to wilco from comment #14)
> (In reply to PeteVine from comment #13)
> > Still, the 5% regression must have happened very recently. The fast gcc was
> > built on 20170220 and the slow one yesterday, using the original patch. Once
> > again, switching away from Cortex-A53 codegen restores the expected
> > performance.
> 
> The issue is due to inefficient code generated for unsigned modulo:
> 
>         umull   x0, w0, w4
>         umull   x1, w1, w4
>         lsr     x0, x0, 32
>         lsr     x1, x1, 32
>         lsr     w0, w0, 6
>         lsr     w1, w1, 6
> 
> It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> manually remove the redundant shifts I get a 15% speedup. I'll have a look.

See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html
Comment 17 Wilco 2017-05-05 11:21:07 UTC
(In reply to wilco from comment #16)
> (In reply to wilco from comment #14)
> > (In reply to PeteVine from comment #13)
> > > Still, the 5% regression must have happened very recently. The fast gcc was
> > > built on 20170220 and the slow one yesterday, using the original patch. Once
> > > again, switching away from Cortex-A53 codegen restores the expected
> > > performance.
> > 
> > The issue is due to inefficient code generated for unsigned modulo:
> > 
> >         umull   x0, w0, w4
> >         umull   x1, w1, w4
> >         lsr     x0, x0, 32
> >         lsr     x1, x1, 32
> >         lsr     w0, w0, 6
> >         lsr     w1, w1, 6
> > 
> > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> > manually remove the redundant shifts I get a 15% speedup. I'll have a look.
> 
> See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

The redundant LSRs and SDIV are removed on latest trunk. Although my patch above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53 and 8% with -mcpu=cortex-a72.
Comment 18 Tamar Christina 2017-05-08 09:46:19 UTC
Author: tnfchris
Date: Mon May  8 09:45:46 2017
New Revision: 247734

URL: https://gcc.gnu.org/viewcvs?rev=247734&root=gcc&view=rev
Log:
2017-05-08  Tamar Christina  <tamar.christina@arm.com>

        PR middle-end/79665
        * expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR,
        CEIL_MOD_EXPR, ROUND_MOD_EXPR cases.


Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/expr.c
Comment 19 Wilco 2017-06-28 14:13:34 UTC
Author: wilco
Date: Wed Jun 28 14:13:02 2017
New Revision: 249740

URL: https://gcc.gnu.org/viewcvs?rev=249740&root=gcc&view=rev
Log:
Improve Cortex-A53 shift bypass

The aarch_forward_to_shift_is_not_shifted_reg bypass always returns true
on AArch64 shifted instructions.  This causes the bypass to activate in
too many cases, resulting in slower execution on Cortex-A53 like reported
in PR79665.

This patch uses the arm_no_early_alu_shift_dep condition instead which
improves the example in PR79665 by ~7%.  Given it is no longer used,
remove aarch_forward_to_shift_is_not_shifted_reg.  Also remove an
unnecessary REG_P check.

    gcc/
	PR target/79665
	* config/arm/aarch-common.c (arm_no_early_alu_shift_dep):
	Remove redundant if.
	(aarch_forward_to_shift_is_not_shifted_reg): Remove.
	* config/arm/aarch-common-protos.h
	(aarch_forward_to_shift_is_not_shifted_re): Remove.
	* config/arm/cortex-a53.md: Use arm_no_early_alu_shift_dep in bypass.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/aarch-common-protos.h
    trunk/gcc/config/arm/aarch-common.c
    trunk/gcc/config/arm/cortex-a53.md