Bug 70341 - [9/10/11/12 Regression] cost model for addresses is incorrect, slsr is using reg + reg + CST for arm
Summary: [9/10/11/12 Regression] cost model for addresses is incorrect, slsr is using ...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 5.3.0
: P2 normal
Target Milestone: 9.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-03-21 16:57 UTC by Fredrik Hederstierna
Modified: 2021-06-01 08:07 UTC (History)
5 users (show)

See Also:
Host:
Target: arm
Build:
Known to work: 4.8.5
Known to fail: 4.9.3, 5.3.0, 6.0
Last reconfirmed: 2016-03-23 00:00:00


Attachments
test_switch.c (217 bytes, text/plain)
2016-03-21 16:57 UTC, Fredrik Hederstierna
Details
switch.c (274 bytes, text/plain)
2016-03-21 17:11 UTC, Fredrik Hederstierna
Details
gcc9-pr70341.patch (1.44 KB, patch)
2019-02-21 14:41 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Fredrik Hederstierna 2016-03-21 16:57:41 UTC
Created attachment 38049 [details]
test_switch.c

Starting with GCC-4.9.x suboptimal code is generated for this switch-statement.
Toolchain arm-none-eabi.

extern void handle_case_1(int name);
extern void handle_case_2(int name);
extern void handle_case_3(int name);
extern void handle_case_4(int name);

struct item_s
{
  int index;
  int type;
  int name;
  int data;
};

struct table_s
{
  struct item_s items[1];
};

void test(struct table_s *table, int xi)
{
  struct item_s *item = &(table->items[xi]);
  switch (item->type)
    {
    case 1:
      handle_case_1(item->name);
      break;
    case 2:
      handle_case_2(item->name);
      break;
    case 3:
      handle_case_3(item->name);
      break;
    case 4:
      handle_case_4(item->name);
      break;
    }
}

Compiled with gcc-4.6.x, gcc-4.7.x, gcc-4.8.x:

00000000 <test>:
   0:   eb00 1101       add.w   r1, r0, r1, lsl #4
   4:   684b            ldr     r3, [r1, #4]
   6:   3b01            subs    r3, #1
   8:   2b03            cmp     r3, #3
   a:   d80f            bhi.n   2c <test+0x2c>
   c:   e8df f003       tbb     [pc, r3]
  10:   0b080502        bleq    201420 <test+0x201420>
  14:   6888            ldr     r0, [r1, #8]
  16:   f7ff bffe       b.w     0 <handle_case_1>
  1a:   6888            ldr     r0, [r1, #8]
  1c:   f7ff bffe       b.w     0 <handle_case_2>
  20:   6888            ldr     r0, [r1, #8]
  22:   f7ff bffe       b.w     0 <handle_case_3>
  26:   6888            ldr     r0, [r1, #8]
  28:   f7ff bffe       b.w     0 <handle_case_4>
  2c:   4770            bx      lr

Compiled with 4.9.x, 5.3.0, and with current master:

00000000 <test>:
   0:   0109            lsls    r1, r1, #4
   2:   1843            adds    r3, r0, r1
   4:   685b            ldr     r3, [r3, #4]
   6:   3b01            subs    r3, #1
   8:   2b03            cmp     r3, #3
   a:   d813            bhi.n   34 <test+0x34>
   c:   e8df f003       tbb     [pc, r3]
  10:   0e0a0602        cfmadd32eq      mvax0, mvfx0, mvfx10, mvfx2
  14:   4408            add     r0, r1
  16:   6880            ldr     r0, [r0, #8]
  18:   f7ff bffe       b.w     0 <handle_case_1>
  1c:   4408            add     r0, r1
  1e:   6880            ldr     r0, [r0, #8]
  20:   f7ff bffe       b.w     0 <handle_case_2>
  24:   4408            add     r0, r1
  26:   6880            ldr     r0, [r0, #8]
  28:   f7ff bffe       b.w     0 <handle_case_3>
  2c:   4408            add     r0, r1
  2e:   6880            ldr     r0, [r0, #8]
  30:   f7ff bffe       b.w     0 <handle_case_4>
  34:   4770            bx      lr

Flags: -mcpu=cortex-m3
Both -Os and -O2 gives increased code size.
Comment 1 Fredrik Hederstierna 2016-03-21 17:11:49 UTC
Created attachment 38051 [details]
switch.c

Added additional slightly larger test case that also show problems for cortex-m0 and thumb1.
Comment 2 Richard Biener 2016-03-22 08:57:30 UTC
Possibly caused by performing switch lowering in GIMPLE switch conversion.
Comment 3 Ramana Radhakrishnan 2016-03-23 13:38:12 UTC
Confirmed.
Comment 4 Andrew Pinski 2016-07-31 02:17:46 UTC
(In reply to Richard Biener from comment #2)
> Possibly caused by performing switch lowering in GIMPLE switch conversion.

It can't be because the resulting assembly still uses a case table.
The problem is the add happens on each of the cases rather than before.
  14:   4408            add     r0, r1
...
  1c:   4408            add     r0, r1
...
  24:   4408            add     r0, r1
...
  2c:   4408            add     r0, r1


The correct thing happens on aarch64-linux-gnu (that is the add does not show up in all of the cases but hoisted before the switch).
Comment 5 Andrew Pinski 2016-07-31 02:21:56 UTC
I am suspecting what is happening on arm is the cost model for addresses is incorrect and allowing [r1 + r2 + const] so slsr is picking that.

This is why it works correctly on aarch64 because the cost model is correct and rejects that mode.
Comment 6 Richard Biener 2016-08-03 11:56:53 UTC
GCC 4.9 branch is being closed
Comment 7 Fredrik Hederstierna 2016-08-28 16:20:37 UTC
Tested with GCC 6.2 and still same behaviour.
Comment 8 Fredrik Hederstierna 2016-09-04 20:48:28 UTC
Could it be something in tree-ssa-forwprop pass?

This pass is executed 4 times in -Os, and starting with GCC-4.9 it seems like statements that seems to generate instructions that are hard to eliminate later in compilation could be propagated?


Checking 4.8.5 (good result)

;; Function test (test, funcdef_no=0, decl_uid=4075, cgraph_uid=0)

test (struct table_s * table, int xi)
{
  struct item_s * item;
  int _6;
  int _7;
  int _9;
  int _11;
  int _13;

  <bb 2>:
  item_4 = &table_2(D)->items[xi_3(D)];
  _6 = item_4->type;
  switch (_6) <default: <L4>, case 1: <L0>, case 2: <L1>, case 3: <L2>, case 4: <L3>>

<L0>:
  _7 = item_4->name;
  handle_case_1 (_7);
  goto <bb 7> (<L4>);

<L1>:
  _9 = item_4->name;
  handle_case_2 (_9);
  goto <bb 7> (<L4>);

<L2>:
  _11 = item_4->name;
  handle_case_3 (_11);
  goto <bb 7> (<L4>);

<L3>:
  _13 = item_4->name;
  handle_case_4 (_13);

<L4>:
  return;

}

-------------------

Same code GCC 4.9.3 (bad result)

;; Function test (test, funcdef_no=0, decl_uid=4090, symbol_order=0)

test (struct table_s * table, int xi)
{
  struct item_s * item;
  int _6;
  int _7;
  int _9;
  int _11;
  int _13;

  <bb 2>:
  _6 = MEM[(struct item_s *)table_2(D)].items[xi_3(D)].type;
  switch (_6) <default: <L4>, case 1: <L0>, case 2: <L1>, case 3: <L2>, case 4: <L3>>

<L0>:
  _7 = MEM[(struct item_s *)table_2(D)].items[xi_3(D)].name;
  handle_case_1 (_7);
  goto <bb 7> (<L4>);

<L1>:
  _9 = MEM[(struct item_s *)table_2(D)].items[xi_3(D)].name;
  handle_case_2 (_9);
  goto <bb 7> (<L4>);

<L2>:
  _11 = MEM[(struct item_s *)table_2(D)].items[xi_3(D)].name;
  handle_case_3 (_11);
  goto <bb 7> (<L4>);

<L3>:
  _13 = MEM[(struct item_s *)table_2(D)].items[xi_3(D)].name;
  handle_case_4 (_13);

<L4>:
  return;

}


When comparing the two versions of "tree-ssa-forwprop.c" it seems like sometimes there are removed checks for invariant code, eg. in forward_propagate_addr_expr_1() there are no checks for is_gimple_min_invariant(), but I'm not familiar with this code is it might just fine, it was just an observation, since it seems like invariant code are propagated here, but maybe I'm wrong.

When compiling for speed there is less downside in copying statements, its like loop unrolling or peeling, but when optimizing for size its always bad if later passes cannot eliminate these extra copied instructions again I think. Maybe propagation should be more restrictive when optimizing for size?
I did not see any checks for optimization type in forwardpropagation, but maybe its intended to only do zero-cost propagations?

Thank & BR/Fredrik
Comment 9 Andrew Pinski 2016-09-05 07:23:03 UTC
(In reply to Fredrik Hederstierna from comment #8)
> Could it be something in tree-ssa-forwprop pass?
> 
> This pass is executed 4 times in -Os, and starting with GCC-4.9 it seems
> like statements that seems to generate instructions that are hard to
> eliminate later in compilation could be propagated?

No that is correct for this pass, this was done on purpose to allow for better optimizations and better aliasing info.

Anyways if we change case 4: to default: for GCC 7, we get even better code on aarch64 than previous versions (even at -O2):
        add     x0, x0, x1, sxtw 4
        ldp     w2, w0, [x0, 4]
        cmp     w2, 2
        beq     .L3
        cmp     w2, 3
        beq     .L4
        cmp     w2, 1
        beq     .L8
        b       handle_case_4
        .p2align 3
.L8:
        b       handle_case_1
        .p2align 3
.L4:
        b       handle_case_3
        .p2align 3
.L3:
        b       handle_case_2
Comment 10 Fredrik Hederstierna 2016-09-05 08:32:45 UTC
Can it be related to some missing code hoisting in eg. combiner or gcse?
I found this old PR 11832 on similar issue, can it be related, or have a common solution?

Bug 11832 - Optimization of common stores in switch statements
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11832
BR/Fredrik
Comment 11 Jeffrey A. Law 2017-02-05 16:27:48 UTC
I wonder why the old GCSE code hoisting pass doesn't pick this up and pull those loads up the CFG.   This seems like exactly what it was designed to handle.
Comment 12 Jakub Jelinek 2017-10-10 13:25:25 UTC
GCC 5 branch is being closed
Comment 13 Jakub Jelinek 2018-10-26 10:08:17 UTC
GCC 6 branch is being closed
Comment 14 Jakub Jelinek 2019-02-21 12:51:40 UTC
With additional default: __builtin_unreachable (); this gets somewhat optimized 
into:
        add     r0, r0, r1
        cmp     r3, #3
        ldrls   pc, [pc, r3, asl #2]
        b       .L3
.L4:
        .word   .L7
        .word   .L6
        .word   .L5
        .word   .L3
.L5:
        ldr     r0, [r0, #8]
        b       handle_case_3
.L6:
        ldr     r0, [r0, #8]
        b       handle_case_2
.L7:
        ldr     r0, [r0, #8]
        b       handle_case_1
.L3:
        ldr     r0, [r0, #8]
        b       handle_case_4
so add r0, r0, r1 is hoisted by the jump2 pass, but strangely it doesn't happen for the ldr r0, [r0, #8] instruction.
On x86_64-linux with -O2, we also get:
        movl    8(%rsi), %edi
        jmp     handle_case_2
...
        movl    8(%rsi), %edi
        jmp     handle_case_4
etc. without the default: and the jump2 pass hoists those movl 8(%rsi), %edi instructions before the switch.
Guess the reason why jump2 doesn't hoist anything without default: __builtin_unreachable (); is that the instruction(s) are not common to all the paths, so it would be a speculative execution, which still is a win for -Os.
Though, on x86_64 for -Os and 5 such cases instead of 4 (for 4 there is no switch, but a series of conditional branches) the RA actually does hoist it.
Comment 15 Jakub Jelinek 2019-02-21 14:41:35 UTC
Created attachment 45789 [details]
gcc9-pr70341.patch

Untested fix for e.g.:
#define A(N) void foo##N (int);
#define B(N) A(N##0) A(N##1) A(N##2) A(N##3)
#define C(N) B(N##0) B(N##1) B(N##2) B(N##3)
#define D C(1) C(2)
D

struct E { int a, b, c, d; };
struct F { struct E e[1]; };

void
test (struct F *x, int y)
{
  struct E *p = &x->e[y];
#undef A
#define A(N) case N: foo##N (p->c); break;
  switch (p->b)
    {
    D
    default: __builtin_unreachable ();
    }
}

at -O2 on both arm and aarch64 - by making sure the casesi MEM load is mem/u/c try_head_merge_bb is willing to move over the loads across the casesi.
Comment 16 Jakub Jelinek 2019-02-27 14:51:07 UTC
Author: jakub
Date: Wed Feb 27 14:50:35 2019
New Revision: 269255

URL: https://gcc.gnu.org/viewcvs?rev=269255&root=gcc&view=rev
Log:
	PR target/70341
	* config/arm/arm.md (arm_casesi_internal): New define_expand.  Rename
	old define_insn to ...
	(*arm_casesi_internal): ... this.  Add mode to LABEL_REFs.
	* config/arm/thumb2.md (thumb2_casesi_internal): New define_expand.
	Rename old define_insn to ...
	(*thumb2_casesi_internal): ... this.  Add mode to LABEL_REFs.
	(thumb2_casesi_internal_pic): New define_expand.  Rename old
	define_insn to ...
	(*thumb2_casesi_internal_pic): ... this.  Add mode to LABEL_REFs.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.md
    trunk/gcc/config/arm/thumb2.md
Comment 17 Jakub Jelinek 2019-02-27 18:11:57 UTC
Author: jakub
Date: Wed Feb 27 18:11:26 2019
New Revision: 269260

URL: https://gcc.gnu.org/viewcvs?rev=269260&root=gcc&view=rev
Log:
	PR target/70341
	* config/aarch64/aarch64.md (casesi): Create the casesi_dispatch
	MEM manually here, set MEM_READONLY_P and MEM_NOTRAP_P on it.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.md
Comment 18 Richard Biener 2019-11-14 07:54:59 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 19 Jakub Jelinek 2020-03-04 09:49:13 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 20 Jakub Jelinek 2021-05-14 09:48:07 UTC
GCC 8 branch is being closed.
Comment 21 Richard Biener 2021-06-01 08:07:41 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.