79665 – gcc's signed (x*x)/200 is slower than clang's

Bug 79665 - gcc's signed (x*x)/200 is slower than clang's

Summary: gcc's signed (x*x)/200 is slower than clang's

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Jakub Jelinek

URL:	https://github.com/rust-lang/rust/iss...
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2017-02-21 21:50 UTC by Josh Stone
Modified:	2017-06-28 14:13 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-02-22 00:00:00

Attachments
C source from rust-lang/rust#39446 (498 bytes, text/x-csrc) 2017-02-21 21:50 UTC, Josh Stone	Details
gcc7-pr79665.patch (2.20 KB, patch) 2017-02-22 09:32 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Josh Stone 2017-02-21 21:50:12 UTC

Created attachment 40809 [details]
C source from rust-lang/rust#39446

The linked Rust issue is initially comparing rustc and clang, but I found that GCC is also generating worse code than clang.  Profiling pointed me to these:

          x_x = (x * x) / 200;
          y_y = (y * y) / 200;

Clang optimizes these to just 4 instructions each, like:

        movl    %edx, %ebp
        imull   %ebp, %ebp
        imulq   $1374389535, %rbp, %rbp # imm = 0x51EB851F
        shrq    $38, %rbp

But clang -fwrapv generates 8 instructions each, same as rustc, and it turns out GCC uses a similar longer sequence too regardless of -fwrapv.  GCC 6.3 gives:

        movl    %edi, %r13d
        imull   %edi, %r13d
        movl    %r13d, %eax
        sarl    $31, %r13d
        imull   %ebx
        sarl    $6, %edx
        movl    %edx, %ecx
        subl    %r13d, %ecx

And GCC 7 is the same, although it interleaves the two x and y lines.

Here they are on Compiler Explorer: https://godbolt.org/g/Fy8Fiy

I believe what's happening is that clang identifies that signed (x*x) is always positive, thanks to overflow UB, and then (x*x)/200 is essentially unsigned division, which requires less fixup than signed constant division.

Comment 1 Andrew Pinski 2017-02-21 21:53:14 UTC

What options are you using?

Did you try -march=native ?

Comment 2 Jakub Jelinek 2017-02-21 22:06:13 UTC

What we can do IMHO is in expand_divmod, if we have division or modulo by constant and the first operand is known to have 0 MSB (from get_range_info or get_nonzero_bits), then we can expand it as both signed or unsigned division/modulo, regardless what unsignedp is.  Then we could compare costs of both expansions and decide what is faster.

Comment 3 Josh Stone 2017-02-21 22:12:26 UTC

I'm using just -O3, and then I compared effects with and without -fwrapv to figure out what's going on.  Clang is only faster without -fwrapv.

With -march=native on my Sandy Bridge, a few instructions are placed in a different order, but it's otherwise identical.

Comment 4 Jakub Jelinek 2017-02-22 09:32:14 UTC

Created attachment 40811 [details]
gcc7-pr79665.patch

Untested fix.

Comment 5 PeteVine 2017-02-22 21:53:35 UTC

Psst! GCC 7 was already 1.75x faster than Clang 3.8 on my aarch64 machine when I benchmarked this code 3 weeks ago, but with this patch, it seems to take a 5% hit.

Comment 6 PeteVine 2017-02-22 22:04:51 UTC

But that's related to -mcpu=cortex-a53 again, so never mind I guess.

Comment 7 Jakub Jelinek 2017-02-23 07:49:38 UTC

Author: jakub
Date: Thu Feb 23 07:49:06 2017
New Revision: 245676

URL: https://gcc.gnu.org/viewcvs?rev=245676&root=gcc&view=rev
Log:
	PR middle-end/79665
	* internal-fn.c (get_range_pos_neg): Moved to ...
	* tree.c (get_range_pos_neg): ... here.  No longer static.
	* tree.h (get_range_pos_neg): New prototype.
	* expr.c (expand_expr_real_2) <case TRUNC_DIV_EXPR>: If both arguments
	are known to be in between 0 and signed maximum inclusive, try to
	expand both unsigned and signed divmod and use the cheaper one from
	those.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expr.c
    trunk/gcc/internal-fn.c
    trunk/gcc/tree.c
    trunk/gcc/tree.h

Comment 8 Jakub Jelinek 2017-02-23 07:53:54 UTC

Fixed.  If this makes -mcpu=cortex-a53 slower, then it doesn't have right costs function.

Comment 9 ktkachov 2017-02-23 09:23:27 UTC

Hmm, for -mcpu=cortex-a53 one of the divisions ends up being unexpanded and generates an sdiv with this patch, whereas before it used to expand

Comment 10 Jakub Jelinek 2017-02-23 09:41:24 UTC

You can put a breakpoint at the new expr.c code and see what both the unsigned and signed sequences are and see what seq_cost gcc computed.  If the costs don't match the hw, then it can of course choose a worse sequence.

Comment 11 ktkachov 2017-02-23 09:43:59 UTC

Looks like the sdiv comes from the % 300 expansion (an sdiv followed by a multiply-subtract). Need to figure why MOD expressions are affected by this change

Comment 12 ktkachov 2017-02-23 10:03:16 UTC

Huh, never mind. That sdiv was there even before this changes, it is unrelated to this. I don't have see how there could be a slowdown from this change

Comment 13 PeteVine 2017-02-23 15:28:48 UTC

Still, the 5% regression must have happened very recently. The fast gcc was built on 20170220 and the slow one yesterday, using the original patch. Once again, switching away from Cortex-A53 codegen restores the expected performance.

Comment 14 Wilco 2017-04-18 18:59:41 UTC

(In reply to PeteVine from comment #13)
> Still, the 5% regression must have happened very recently. The fast gcc was
> built on 20170220 and the slow one yesterday, using the original patch. Once
> again, switching away from Cortex-A53 codegen restores the expected
> performance.

The issue is due to inefficient code generated for unsigned modulo:

        umull   x0, w0, w4
        umull   x1, w1, w4
        lsr     x0, x0, 32
        lsr     x1, x1, 32
        lsr     w0, w0, 6
        lsr     w1, w1, 6

It seems the Cortex-A53 scheduler isn't modelling this correctly. When I manually remove the redundant shifts I get a 15% speedup. I'll have a look.

Comment 15 Tamar Christina 2017-04-27 09:58:59 UTC

Author: tnfchris
Date: Thu Apr 27 09:58:27 2017
New Revision: 247307

URL: https://gcc.gnu.org/viewcvs?rev=247307&root=gcc&view=rev
Log:
2017-04-26  Tamar Christina  <tamar.christina@arm.com>

	PR middle-end/79665
	* expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR,
	CEIL_MOD_EXPR, ROUND_MOD_EXPR cases.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expr.c

Comment 16 Wilco 2017-04-27 17:40:58 UTC

(In reply to wilco from comment #14)
> (In reply to PeteVine from comment #13)
> > Still, the 5% regression must have happened very recently. The fast gcc was
> > built on 20170220 and the slow one yesterday, using the original patch. Once
> > again, switching away from Cortex-A53 codegen restores the expected
> > performance.
> 
> The issue is due to inefficient code generated for unsigned modulo:
> 
>         umull   x0, w0, w4
>         umull   x1, w1, w4
>         lsr     x0, x0, 32
>         lsr     x1, x1, 32
>         lsr     w0, w0, 6
>         lsr     w1, w1, 6
> 
> It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> manually remove the redundant shifts I get a 15% speedup. I'll have a look.

See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

Comment 17 Wilco 2017-05-05 11:21:07 UTC

(In reply to wilco from comment #16)
> (In reply to wilco from comment #14)
> > (In reply to PeteVine from comment #13)
> > > Still, the 5% regression must have happened very recently. The fast gcc was
> > > built on 20170220 and the slow one yesterday, using the original patch. Once
> > > again, switching away from Cortex-A53 codegen restores the expected
> > > performance.
> > 
> > The issue is due to inefficient code generated for unsigned modulo:
> > 
> >         umull   x0, w0, w4
> >         umull   x1, w1, w4
> >         lsr     x0, x0, 32
> >         lsr     x1, x1, 32
> >         lsr     w0, w0, 6
> >         lsr     w1, w1, 6
> > 
> > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> > manually remove the redundant shifts I get a 15% speedup. I'll have a look.
> 
> See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

The redundant LSRs and SDIV are removed on latest trunk. Although my patch above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53 and 8% with -mcpu=cortex-a72.

Comment 18 Tamar Christina 2017-05-08 09:46:19 UTC

Author: tnfchris
Date: Mon May  8 09:45:46 2017
New Revision: 247734

URL: https://gcc.gnu.org/viewcvs?rev=247734&root=gcc&view=rev
Log:
2017-05-08  Tamar Christina  <tamar.christina@arm.com>

        PR middle-end/79665
        * expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR,
        CEIL_MOD_EXPR, ROUND_MOD_EXPR cases.


Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/expr.c

Comment 19 Wilco 2017-06-28 14:13:34 UTC

Author: wilco
Date: Wed Jun 28 14:13:02 2017
New Revision: 249740

URL: https://gcc.gnu.org/viewcvs?rev=249740&root=gcc&view=rev
Log:
Improve Cortex-A53 shift bypass

The aarch_forward_to_shift_is_not_shifted_reg bypass always returns true
on AArch64 shifted instructions.  This causes the bypass to activate in
too many cases, resulting in slower execution on Cortex-A53 like reported
in PR79665.

This patch uses the arm_no_early_alu_shift_dep condition instead which
improves the example in PR79665 by ~7%.  Given it is no longer used,
remove aarch_forward_to_shift_is_not_shifted_reg.  Also remove an
unnecessary REG_P check.

    gcc/
	PR target/79665
	* config/arm/aarch-common.c (arm_no_early_alu_shift_dep):
	Remove redundant if.
	(aarch_forward_to_shift_is_not_shifted_reg): Remove.
	* config/arm/aarch-common-protos.h
	(aarch_forward_to_shift_is_not_shifted_re): Remove.
	* config/arm/cortex-a53.md: Use arm_no_early_alu_shift_dep in bypass.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/aarch-common-protos.h
    trunk/gcc/config/arm/aarch-common.c
    trunk/gcc/config/arm/cortex-a53.md