Bug 59393 - [6/7/8 regression] mips16/7/8 code size
Summary: [6/7/8 regression] mips16/7/8 code size
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.9.0
: P2 normal
Target Milestone: 6.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2013-12-05 00:56 UTC by Sandra Loosemore
Modified: 2018-02-04 19:28 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-01-31 00:00:00


Attachments
bf_enc.i (677 bytes, text/plain)
2013-12-05 00:56 UTC, Sandra Loosemore
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sandra Loosemore 2013-12-05 00:56:53 UTC
Created attachment 31383 [details]
bf_enc.i

The attached test case (part of blowfish) is compiled for MIPS16 with

-Os -DNDEBUG -fno-schedule-insns2  -mno-check-zero-division -fno-common  -fsection-anchors -fno-shrink-wrap -ffunction-sections -mips16

In GCC 4.7, this produced 2152 bytes of code.

In GCC 4.8, it produces 2396 bytes.  On mainline head, it's 2384.  In both cases, the code growth is coming from reload blowing up.

I tracked the 4.8 regression down to two distinct changes:

(1) This patch for PR54109
http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00617.html
removed forward-propagation of constant offsets into address expressions.  No subsequent pass is making that optimization so the code requires one extra register that is live essentially through the whole function by the time it gets to reload.

(2) The hoist pass seems to be significantly underestimating register pressure and adds 9 more pseudos with long lifetimes.  (This might be a red herring, but why is it only considering GR_REGS as a pressure class and not M16_REGS?)

By reverting the above patch and disabling the hoist pass, I was able to get the same code size on 4.8 as 4.7.  On mainline head, there's something else going on as well, as this brought the code size down only halfway, to 2280 bytes.  I haven't yet analyzed where the remaining bad code is coming from, but if I had to make a wild stab in the dark, I'd guess that if the register pressure calculation is wrong in hoist it may be wrong in other places as well.

In any case, reverting a patch that fixes a correctness bug and disabling the hoist pass is clearly not an acceptable solution.  Any suggestions on the "right" way to fix the two already-identified problems?
Comment 1 Richard Biener 2013-12-05 10:51:03 UTC
  <bb 2>:
  s_4 = &key_3(D)->S[0];
...
  _15 = _14 * 8;
  _16 = s_4 + _15;
  _17 = *_16;
...
  _21 = _20 * 8;
  _22 = s_4 + _21;
  _23 = *_22;
...

formerly we'd have created

  _17 = MEM[key_3].S[_14];
...
  _23 = MEM[key_3].S[_20];

which isn't a valid transform.  That eventually gets us better addressing
mode selection?  At RTL this probably (didn't verify) re-associates the
key_3 + offsetof(S) + index * 8 expression to a more suitable way and
by-passes the multiple-use restriction of combine (forwprop here un-CSEs
key_3 + offsetof(S)).

In a loop IVOPTs would be the one to utilize target addressing mode
information and eventually generate a TARGET_MEM_REF.  In non-loops
we have SLSR (gimple-ssa-strength-reduction.c) that could serve as a
vehicle to generate TARGET_MEM_REFs (it doesn't).

In the end I would point at RTL forwprop which is supposed to improve
addressing-mode selection.  At least on x86_64 I see

        leaq    144(%rsi), %rax
...
        xorq    4096(%rax,%rbx,8), %r8
        addl    6144(%rax,%r9,8), %r8d

as well (and %rsi is live as well), instead of folding the 144 into
the dereference offset.

forwprop sees

(insn 8 5 9 2 (parallel [
            (set (reg/v/f:DI 85 [ s ])
                (plus:DI (reg/v/f:DI 991 [ key ])
                    (const_int 144 [0x90])))
            (clobber (reg:CC 17 flags))
...
(insn 20 19 21 3 (set (reg:DI 998 [ *_22 ])
        (mem:DI (plus:DI (mult:DI (reg:DI 995)
                    (const_int 8 [0x8]))
                (reg/v/f:DI 85 [ s ])) [2 *_22+0 S8 A64]))
...

and then combine folds in an additional addition:

Trying 18 -> 20:
Successfully matched this instruction:
(set (reg:DI 998 [ *_22 ])
    (mem:DI (plus:DI (plus:DI (mult:DI (reg:DI 994 [ D.1883 ])
                    (const_int 8 [0x8]))
                (reg/v/f:DI 85 [ s ]))
            (const_int 2048 [0x800])) [2 *_22+0 S8 A64]))

but of course doesn't consider insn 8 (it's cross basic-block and it has
multiple uses).

Now there isn't any further forwprop pass after combine (which would maybe
now fold in the addition - not sure).  Certainly ira/lra/reload do not
consider materializing the def in-place either instead of spilling it
for you.

Not sure how the situation is on mips16, but in the end RTL optimizers
are supposed to fixup anything related to addressing mode selection.
Comment 2 Richard Biener 2014-01-31 11:15:51 UTC
Confirmed, not only mips16 is affected.
Comment 3 Richard Biener 2014-05-22 09:01:19 UTC
GCC 4.8.3 is being released, adjusting target milestone.
Comment 4 Jakub Jelinek 2014-12-19 13:27:58 UTC
GCC 4.8.4 has been released.
Comment 5 Richard Biener 2015-06-23 08:17:36 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 6 Jakub Jelinek 2015-06-26 19:53:49 UTC
GCC 4.9.3 has been released.
Comment 7 Jeffrey A. Law 2016-03-31 23:48:13 UTC
I was looking at this and noticed we have several sequences like

 _18 = l_11 >> 16;
  _19 = _18 & 255;
  _20 = _19 + 256;
  _21 = _20 * 8;

There's variations in the constants, but the pattern repeats regularly.  My first thought was to rewrite that as

 _18 = l_11 >> 13;
 _19 = _18 & 0x7f8;
 _20 = _19 + 0x800;

That seemed to be slightly worse on x86_64.  I'd already noticed that the addition was setting bits we knew to be zero, so it could be rewritten using an IOR like this:


 _18 = l_11 >> 13;
 _19 = _18 & 0x7f8;
 _20 = _19 | 0x800;

In isolation, that looked good on x86_64.  So my thought was that we may have an gcc-7 improvement that could be made for this code.  But then I coded up a quick pattern in match.pd and tested it and the resulting assembly code was considerably worse on x86_64 for the benchmark code.

There's a couple things in play here on x86_64.  In the benchmark code these are address computations.  The *8 and +256 in the original sequence can be a part of the effective address in the memory reference.  Furthermore, the masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the later sequences.  This negates all the gain by using IOR instead of PLUS, which was shorter for x86_64.

mips16 does slightly better with the second sequence, saving ~76 bytes on the included testcase.

However, given how highly dependent this is on the target's addressing modes, match.pd is probably not the place to attack this problem.  Combine is likely a better place, using either a generic splitting sequence that self-tunes via rtx_cost.  Or via a target specific splitter.

The closest we get right now is this combine attempt:

(set (reg:SI 1077)
    (plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073)
                    (const_int 8 [0x8]))
                (reg:SI 1074))
            (const_int 2 [0x2]))
        (const_int 1024 [0x400])))


reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg 1074 has other uses as seen by this attempt:

(parallel [
        (set (reg:SI 1077)
            (plus:SI (and:SI (ashift:SI (reg:SI 1072)
                        (const_int 2 [0x2]))
                    (const_int 1020 [0x3fc]))
                (const_int 1024 [0x400])))
        (set (reg:SI 1074)
            (const_int 255 [0xff]))
    ])
Comment 8 rguenther@suse.de 2016-04-01 07:26:16 UTC
On Thu, 31 Mar 2016, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393
> 
> Jeffrey A. Law <law at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |law at redhat dot com
> 
> --- Comment #7 from Jeffrey A. Law <law at redhat dot com> ---
> I was looking at this and noticed we have several sequences like
> 
>  _18 = l_11 >> 16;
>   _19 = _18 & 255;
>   _20 = _19 + 256;
>   _21 = _20 * 8;
>
> There's variations in the constants, but the pattern repeats regularly.  My
> first thought was to rewrite that as
> 
>  _18 = l_11 >> 13;
>  _19 = _18 & 0x7f8;
>  _20 = _19 + 0x800;
> 
> That seemed to be slightly worse on x86_64.  I'd already noticed that the
> addition was setting bits we knew to be zero, so it could be rewritten using an
> IOR like this:
> 
> 
>  _18 = l_11 >> 13;
>  _19 = _18 & 0x7f8;
>  _20 = _19 | 0x800;
> 
> In isolation, that looked good on x86_64.  So my thought was that we may have
> an gcc-7 improvement that could be made for this code.  But then I coded up a
> quick pattern in match.pd and tested it and the resulting assembly code was
> considerably worse on x86_64 for the benchmark code.
> 
> There's a couple things in play here on x86_64.  In the benchmark code these
> are address computations.  The *8 and +256 in the original sequence can be a
> part of the effective address in the memory reference.  Furthermore, the
> masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the
> later sequences.  This negates all the gain by using IOR instead of PLUS, which
> was shorter for x86_64.

Yeah, I think we have several fold-const.c pieces that try to make sure
to preserve / create shifts / ands that match mode widths.
 
> mips16 does slightly better with the second sequence, saving ~76 bytes on the
> included testcase.
> 
> However, given how highly dependent this is on the target's addressing modes,
> match.pd is probably not the place to attack this problem.  Combine is likely a
> better place, using either a generic splitting sequence that self-tunes via
> rtx_cost.  Or via a target specific splitter.

True, though the idea to have target specific match.pd bits is still
on the plate - we'd have sth like config/$arch/$arch.pd which we can
include from match.pd and we could guard those patterns by
sth like compile-phase == pre-RTL-expand so they get enabled only
in late GIMPLE (after loop opts).  We'd add those mainly to remove
expand complexity and its reliance on TER to see complex expressions
for better initial instruction selection.

> The closest we get right now is this combine attempt:
> 
> (set (reg:SI 1077)
>     (plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073)
>                     (const_int 8 [0x8]))
>                 (reg:SI 1074))
>             (const_int 2 [0x2]))
>         (const_int 1024 [0x400])))
> 
> 
> reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg
> 1074 has other uses as seen by this attempt:
> 
> (parallel [
>         (set (reg:SI 1077)
>             (plus:SI (and:SI (ashift:SI (reg:SI 1072)
>                         (const_int 2 [0x2]))
>                     (const_int 1020 [0x3fc]))
>                 (const_int 1024 [0x400])))
>         (set (reg:SI 1074)
>             (const_int 255 [0xff]))
>     ])

Yeah, the multi-use restriction in combine is a serious limitation.
OTOH we face a similar issue in GIMPLE forwprop and all those
"aritificial" single_use tests in match.pd - to do better the
pattern detection / replacement would need to be done with a
cost model that includes all pattern applications (all with
have uses in common at least).
Comment 9 Richard Biener 2016-08-03 11:32:33 UTC
GCC 4.9 branch is being closed
Comment 10 Jakub Jelinek 2017-10-10 13:26:59 UTC
GCC 5 branch is being closed
Comment 11 Aldy Hernandez 2018-02-04 12:31:08 UTC
I have chatted with Sandra (the original reporter) who has confirmed that the size for the original testcase on current mainline is 2296 bytes on MIPS16.  This is better than the reported regression, but still not as good as 4.7.

The reporter is OK with closing the PR since they no longer care about MIPS16. 
 However, as far as I understand, this may affect other architectures.

Is this at all on our radar to fix, or can we either (a) close as WONTFIX (b) move the target milestone to GCC 9?

Thanks.
Comment 12 Jeffrey A. Law 2018-02-04 15:50:31 UTC
One more tidbit here.  I noted that we got "reasonably" close to having enough state in the combiner to attack this in c#7.  The problem is there's a REG object that we really need to be a CONST_INT.

It turns out that we're failing to substitute the value from a REG_EQUAL note as combine builds up that hunk of RTL.  So if we were to substitute in the REG_EQUAL, then we'd have a fighting chance to address this in the combiner.
Comment 13 Aldy Hernandez 2018-02-04 16:08:22 UTC
Do we care though?  Does this bug pose a big enough problem on non
MIPS16 that we would like addressed?  Just curious....

On Sun, Feb 4, 2018 at 10:50 AM, law at redhat dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393
>
> --- Comment #12 from Jeffrey A. Law <law at redhat dot com> ---
> One more tidbit here.  I noted that we got "reasonably" close to having enough
> state in the combiner to attack this in c#7.  The problem is there's a REG
> object that we really need to be a CONST_INT.
>
> It turns out that we're failing to substitute the value from a REG_EQUAL note
> as combine builds up that hunk of RTL.  So if we were to substitute in the
> REG_EQUAL, then we'd have a fighting chance to address this in the combiner.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 14 Jeffrey A. Law 2018-02-04 19:21:19 UTC
I suspect we could likely show a similar codesize and performance regression on other targets.  ppc & arm come to mind.  aarch64 as well, but it wouldn't be a regression there since it didn't exist when this was first reported.

mips16?  Na, I don't care too much about that :-)
Comment 15 Aldy Hernandez 2018-02-04 19:28:07 UTC
If we care about this on non MIPS16 targets, we can open a new PR and reference this one, perhaps with a less target specific title :).