Bug 82365 - stack locations are consolidated if noreturn function is on the path
Summary: stack locations are consolidated if noreturn function is on the path
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-09-29 19:16 UTC by Christophe Lyon
Modified: 2018-02-22 13:22 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-30 00:00:00


Attachments
testcase (793 bytes, text/plain)
2017-09-29 19:16 UTC, Christophe Lyon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Lyon 2017-09-29 19:16:34 UTC
Created attachment 42264 [details]
testcase

We have noticed that gcc fails to reuse stack locations in presence of noreturn attribute in the call graph.

Basically, the attached testcase has
case 1: { struct XXX localvar; bla1; break; }
case 2: { struct XXX localvar; bla2; break; }
case 3: { struct XXX localvar; bal3; break; }

With noreturn attribute:

aarch64-linux-gnu-gcc -Wall -O2 -S bz-3265.c --param asan-stack=1 -Wframe-larger-than=1 
bz-3265.c: In function ‘em28xx_dvb_init’:
bz-3265.c:99:1: warning: the frame size of 480 bytes is larger than 1 bytes [-Wframe-larger-than=]


Without noreturn attribute:
aarch64-linux-gnu-gcc -Wall -O2 -S bz-3265.c --param asan-stack=1 -Wframe-larger-than=1 -DNONORETURN
bz-3265.c: In function ‘em28xx_dvb_init’:
bz-3265.c:99:1: warning: the frame size of 128 bytes is larger than 1 bytes [-Wframe-larger-than=]


The code fragment is extracted from the linux kernel where this causes more problems with using -fsanitize=kernel-address, where this causes excessive stack usage.

I used an aarch64 compiler here, but Arnd observed similar problems on x86_64 too.
Comment 1 Andrew Pinski 2017-09-30 21:49:12 UTC
here is a reduced testcase to show the issue a lot better:
#define __noreturn __attribute__((noreturn))

struct i2c_board_info {
        char type[20];
        char pad[100];
};

#ifdef NONORETURN
void fortify_panic();
#else
void fortify_panic() __noreturn;
#endif


int f(int a)
{
  if (a)
    fortify_panic();
}


void i2c_new_device(struct i2c_board_info *);
int em28xx_dvb_init(int model, int a, int b, int c, int d)
{
        switch (model) {
        case 1:{
                        struct i2c_board_info info = {};
                        f(a);
                        i2c_new_device(&info);
                        break;
                }
        case 2:{
                        struct i2c_board_info info = {};
                        f(b);
                        i2c_new_device(&info);
                        break;
                }
        case 3:{
                        struct i2c_board_info info = { };
                        f(c);
                        i2c_new_device(&info);
                        break;
                }
        case 4:{
                        struct i2c_board_info info = { };
                        f(d);
                        i2c_new_device(&info);
                        break;
                }
        }
        return 0;
}

Basically the noreturn function is not considered a barrier for the info so the middle-end thinks all of the info can overlap in scope.
Comment 2 Andrew Pinski 2017-09-30 21:54:24 UTC
Basically someone needs to look into add_scope_conflicts (in gimplecfg.c) to see the scope conflict is being added.
Comment 3 Arnd Bergmann 2017-10-02 07:49:06 UTC
I tried replacing the call to a noreturn function with a __builtin_unreachable() statement, same result.

However, adding an empty assembler statement before the noreturn statement or the __builtin_unreachable() appears to work around the problem:

void __fortify_panic() __noreturn;
extern inline __attribute__((always_inline)) fortify_panic()
{
      asm volatile("");
      __fortify_panic();
}

I need to look at the resulting object code some more to see if that has other downsides or if we can use that trick in the kernel to work around compilers without a fix.
Comment 4 Jakub Jelinek 2017-10-05 15:53:42 UTC
(In reply to Andrew Pinski from comment #2)
> Basically someone needs to look into add_scope_conflicts (in gimplecfg.c) to
> see the scope conflict is being added.

??  Implying clobber stmts for all variables after noreturn calls wouldn't change anything.  The problem is that all the variables are addressable/escape and all the fortify_panic () calls are threaded.  So, on the coalesced fortify_panic () call the stack conflict code thinks all the 4 info escaped variables can be live during that call.

The current add_scope_conflicts is fairly simple algorithm, first we walk basic blocks, whenever we see any kind of reference to a stack variable under consideration, we set it in a bitmap, whenever we see a clobber stmt for such a variable, we clear it, at the start of bb we or into the bitmap the bitmaps from all the predecessor bb and remember the bitmap at the end of the bb, we keep propagating it until no further changes happen.
And finally we just walk all bbs, on the first non-debug non-conflict stmt we add conflicts for all currently live vars at that point, and then for any stmt add conflicts for directly referenced vars in the stmt.

Now, for this testcase to be optimized, one step would be to be more conservative on which stmt we do this:
          if (for_conflict
              && visit == visit_op)
            {
              /* If this is the first real instruction in this BB we need
                 to add conflicts for everything live at this point now.
                 Unlike classical liveness for named objects we can't
                 rely on seeing a def/use of the names we're interested in.
                 There might merely be indirect loads/stores.  We'd not add any
                 conflicts for such partitions.  */
              bitmap_iterator bi;
              unsigned i;
              EXECUTE_IF_SET_IN_BITMAP (work, 0, i, bi)
                {
                  struct stack_var *a = &stack_vars[i];
                  if (!a->conflicts)
                    a->conflicts = BITMAP_ALLOC (&stack_var_bitmap_obstack);
                  bitmap_ior_into (a->conflicts, work);
                }
              visit = visit_conflict;
            }
(well, obviously we'd need to visit = visit_conflict; for for_conflict from the first stmt and use some bool to track the first "indirect" stmt).
I think it needs to be done solely on statements that could have any indirect loads/stores (so calls, loads, stores, asm stmts, anything else?), but not say on arithmetics with SSA_NAME operands only etc., there we care only about explicitly referenced vars.

This wouldn't help here because fortify_panic is a call.

Another step could be track which of the live variables are address taken in another set of bitmaps (don't track just whether it is addressable, that is TREE_ADDRESSABLE bit after all, but when its address was taken).  In statements when the var has been seen live already, but not yet address taken, we wouldn't need to record conflicts on possibly indirect stmts, for those only explict references would imply a conflict (in particular, only conflicts between the addressable work set vs. addressable work set).  I bet a clobber should still clear both bitmaps, even if address is live across a clobber, any dereferences through that would be invalid.  The last argument of walk_stmt_load_store_addr_ops is the callback for stuff that had the address taken, so I think we could use that.  We'd need to take care of either handling all the possibly indirect stmts the above way, or track separately which vars were marked addressable since the last possibly indirect stmt.

Richard/Micha, thoughts on that?
Comment 5 Jakub Jelinek 2017-10-05 16:06:22 UTC
The reason why that would work in this case is that we have
  info = {};
as the first stmt that mentions it, but doesn't take address of info.
Then there is
  if (something)
    fortify_panic ();
The condition isn't possibly indirect stmt, fortify_panic is (it is a call), but none of the info vars are addressable at that point.  Then there is:
  i2c_new_device(&info);
which first takes info's address and is also (afterwards) a possibly indirect call (so, we first need to walk explicit references and then handle the possibly indirect, not the other way around).  But, in this spot only one of the info vars is live/addressable, not all of them (on fortify_panic () all 4 info vars are live, but none is addressable).
  info ={v} {CLOBBER};
Finally, the var is no longer live nor addressable.

Now, if you change the code so that 
i2c_new_device(&info);
comes first and then
f(a);
this would no longer be optimized, because in the common fortify_panic call all 4 info vars would be addressable (even escaped to another function).

That said, if it is the same type of variable and it can't be shared, I wonder why the kernel just doesn't use the same variable.
Like:
        switch (model) {
                struct i2c_board_info info;
        case 1:
                info = (struct i2c_board_info) {};
                f(a);
                i2c_new_device(&info);
                break;
        case 2:
                info = (struct i2c_board_info) {};
                f(b);
                i2c_new_device(&info);
                break;
...
        }
(or memset or whatever).
Comment 6 Arnd Bergmann 2017-10-05 19:48:22 UTC
(In reply to Jakub Jelinek from comment #5)

> Now, if you change the code so that 
> i2c_new_device(&info);
> comes first and then
> f(a);
> this would no longer be optimized, because in the common fortify_panic call
> all 4 info vars would be addressable (even escaped to another function).
> 
> That said, if it is the same type of variable and it can't be shared, I
> wonder why the kernel just doesn't use the same variable.

That is one of multiple workarounds I already implemented, the others are

1. split out the long case statements into separate functions (much too large to be backported into stable kernel releases, but reduces the stack frames more than the other approaches).
2. have only one variable (in the original driver the variables don't all have the same name, so this patch also ends up being larger than the usual limit for backported patches)
3. use strncpy() rather than strlcpy() (the fortified strncpy doesn't use
   strnlen()).
4. add the empty asm statement in strnlen()

Any of the above work fine, we just need to decide which one to use. I'm currently leaning toward 3, followed by 1.

We will need to work around older compilers anyway, I mainly reported this because it seemed like something that gcc should be able to handle better and it could in theory cause problems elsewhere. I have only observed the problem in one file in the kernel (an older device driver that probably has very few users), so if this turns out hard to avoid in gcc, it won't be a big deal.
Comment 7 Michael Matz 2017-10-06 12:53:45 UTC
(In reply to Jakub Jelinek from comment #4)
> Richard/Micha, thoughts on that?

None better than what you came up with.  It'd solve this specific instance
of the problem (and not the one with swapped call statements).  I don't think
we should complicate and slow down the clobber algorithm for this partial solution, though.

One problem with "not address taken up to here" is this situation:

  char a[100];
  char *p = &somethingelse;
  for () {
     *p = ...;
     p = a;
  }

So, you'd need to iterate address-takeness until fixpoint like normal liveness as well.  As the current algorithm does this already for a crude approximation
of liveness you might be able to implement the idea without seriously
complicating it or slowing it down.  If so it'd make sense.  Otherwise I wouldn't worry.

IMHO the real problem is rather the phase ordering between joining the
four no-return calls into one (and thereby merging scopes) and scope-based
analyses.  Possibly that joining should happen later (but it'd mean at RTL
level only, which is also meh), or the stack sharing should happen earlier
(which would possibly mean introducing type aliasing problems at GIMPLE level,
which is also meh).
Comment 8 Arnd Bergmann 2017-12-01 09:38:52 UTC
I noticed that I never resubmitted my workaround for the kernel for this problem, and nothing happened on the gcc side either. To make sure I capture the situation correctly in the kernel patch changelog, could I get an clarification on whether this is something the gcc developers think might get addressed before the gcc-8 release, or whether I should assume it is too minor to take up anyone's time?
Comment 9 rguenther@suse.de 2017-12-01 12:39:27 UTC
On Fri, 1 Dec 2017, arnd at linaro dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
> 
> --- Comment #8 from Arnd Bergmann <arnd at linaro dot org> ---
> I noticed that I never resubmitted my workaround for the kernel for this
> problem, and nothing happened on the gcc side either. To make sure I capture
> the situation correctly in the kernel patch changelog, could I get an
> clarification on whether this is something the gcc developers think might get
> addressed before the gcc-8 release, or whether I should assume it is too minor
> to take up anyone's time?

Too complicated at this point for questionable gain (easy source 
workaround possible).
Comment 10 Arnd Bergmann 2017-12-13 16:54:27 UTC
I have now observed the problem in another file, this one time that is more commonly used, but not as drastic, as the stack usage is only around 1000 bytes.

The original source code is in https://elixir.free-electrons.com/linux/v4.14/source/include/net/ip_vs.h#L1243
This reduces to

int a, b, c, d, e;
inline char *ip_vs_dbg_addr(char *p1, int *p2) {
  __builtin_snprintf(&p1[*p2], a, "[%pI6c]", &b);
  if ((a)) {
    __builtin_trap();
        }
  return &p1[1];
}
void ip_vs_control_add() {
  do {
    char f[160];
    ip_vs_dbg_addr(f, &e);
  } while (0);
  {
    int g = d;
    char h[160];
    ip_vs_dbg_addr(h, &c);
    {
      char i[160];
      ip_vs_dbg_addr(i, &g);
    }
    {
      char j[160];
      ip_vs_dbg_addr(j, &e);
    }
  }
  char k[160];
  ip_vs_dbg_addr(k, &e);
}

which uses 800 bytes of stack space. Adding an 'asm volatile("");' statement before __builtin_trap() again solves the problem.
Comment 11 Arnd Bergmann 2017-12-15 13:34:07 UTC
More testing reveals that a handful of files in the kernel are affected by this bug in the BUG() definition on architectures that do not use an inline assembly statement to trap during an assertion, around half the supported architectures. This kernel patch

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 963b755d19b0..23c6a2a6a3d6 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -52,6 +52,7 @@ struct bug_entry {
 #ifndef HAVE_ARCH_BUG
 #define BUG() do { \
        printk("BUG: failure at %s:%d/%s()!\n", __FILE__, __LINE__, __func__); \
+       barrier(); \
        panic("BUG!"); \
 } while (0)
 #endif

can work around the following set of overly large stack frames:

fs/ext4/inode.c:82:1: warning: the frame size of 1672 bytes is larger than 800 bytes [-Wframe-larger-than=]
fs/ext4/namei.c:434:1: warning: the frame size of 904 bytes is larger than 800 bytes [-Wframe-larger-than=]
fs/ext4/super.c:2279:1: warning: the frame size of 1160 bytes is larger than 800 bytes [-Wframe-larger-than=]
fs/ext4/xattr.c:146:1: warning: the frame size of 1168 bytes is larger than 800 bytes [-Wframe-larger-than=]
fs/f2fs/inode.c:152:1: warning: the frame size of 1424 bytes is larger than 800 bytes [-Wframe-larger-than=]
net/netfilter/ipvs/ip_vs_core.c:1195:1: warning: the frame size of 1068 bytes is larger than 800 bytes [-Wframe-larger-than=]
net/netfilter/ipvs/ip_vs_core.c:395:1: warning: the frame size of 1084 bytes is larger than 800 bytes [-Wframe-larger-than=]
net/netfilter/ipvs/ip_vs_ftp.c:298:1: warning: the frame size of 928 bytes is larger than 800 bytes [-Wframe-larger-than=]
net/netfilter/ipvs/ip_vs_ftp.c:418:1: warning: the frame size of 908 bytes is larger than 800 bytes [-Wframe-larger-than=]
net/netfilter/ipvs/ip_vs_lblcr.c:718:1: warning: the frame size of 960 bytes is larger than 800 bytes [-Wframe-larger-than=]
drivers/net/xen-netback/netback.c:1500:1: warning: the frame size of 1088 bytes is larger than 800 bytes [-Wframe-larger-than=]

and similar patches can be created for architectures not using the generic implementation. For reference, the above was tested on all architectures that are supported by mainline versions of both linux and gcc using an 'allmodconfig' build, and the same symptoms were visible on all architectures using the generic BUG(). I only looked at files that had any functions with frame sizes over 800 bytes (1000 bytes for 64-bit architectures), 89 files in total out of 31841 source files that were built, down to 78 with my workaround.

In a single build, around 100 files had functions that get a (mostly minor) reduction in frame size with my patch, in a few cases the frame sizes appear to get slightly larger due to different inlining decisions, and in some other cases including the ones listed above there is a drastic reduction in frame size of factor two to five.

I have submitted a workaround for the kernel for the original case (involving strncpy()) and plan to submit another workaround for BUG() now. However, I'd still like to see this addressed in gcc as well, since that will cover those instances in other code. I would hope that a simple workaround such as the patch for PR81715 is possible. This seems to be a related issue with very similar symptoms.
Comment 12 Arnd Bergmann 2017-12-19 11:45:14 UTC
The first partial workaround for strncpy() got merged into Linux and stable backports: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=146734b091430

Submitted a second partial workaround for BUG():
https://patchwork.kernel.org/patch/10123109/