Bug 82365 - stack locations are consolidated if noreturn function is on the path
Summary: stack locations are consolidated if noreturn function is on the path
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-09-29 19:16 UTC by Christophe Lyon
Modified: 2017-10-06 12:53 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-30 00:00:00


Attachments
testcase (793 bytes, text/plain)
2017-09-29 19:16 UTC, Christophe Lyon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Lyon 2017-09-29 19:16:34 UTC
Created attachment 42264 [details]
testcase

We have noticed that gcc fails to reuse stack locations in presence of noreturn attribute in the call graph.

Basically, the attached testcase has
case 1: { struct XXX localvar; bla1; break; }
case 2: { struct XXX localvar; bla2; break; }
case 3: { struct XXX localvar; bal3; break; }

With noreturn attribute:

aarch64-linux-gnu-gcc -Wall -O2 -S bz-3265.c --param asan-stack=1 -Wframe-larger-than=1 
bz-3265.c: In function ‘em28xx_dvb_init’:
bz-3265.c:99:1: warning: the frame size of 480 bytes is larger than 1 bytes [-Wframe-larger-than=]


Without noreturn attribute:
aarch64-linux-gnu-gcc -Wall -O2 -S bz-3265.c --param asan-stack=1 -Wframe-larger-than=1 -DNONORETURN
bz-3265.c: In function ‘em28xx_dvb_init’:
bz-3265.c:99:1: warning: the frame size of 128 bytes is larger than 1 bytes [-Wframe-larger-than=]


The code fragment is extracted from the linux kernel where this causes more problems with using -fsanitize=kernel-address, where this causes excessive stack usage.

I used an aarch64 compiler here, but Arnd observed similar problems on x86_64 too.
Comment 1 Andrew Pinski 2017-09-30 21:49:12 UTC
here is a reduced testcase to show the issue a lot better:
#define __noreturn __attribute__((noreturn))

struct i2c_board_info {
        char type[20];
        char pad[100];
};

#ifdef NONORETURN
void fortify_panic();
#else
void fortify_panic() __noreturn;
#endif


int f(int a)
{
  if (a)
    fortify_panic();
}


void i2c_new_device(struct i2c_board_info *);
int em28xx_dvb_init(int model, int a, int b, int c, int d)
{
        switch (model) {
        case 1:{
                        struct i2c_board_info info = {};
                        f(a);
                        i2c_new_device(&info);
                        break;
                }
        case 2:{
                        struct i2c_board_info info = {};
                        f(b);
                        i2c_new_device(&info);
                        break;
                }
        case 3:{
                        struct i2c_board_info info = { };
                        f(c);
                        i2c_new_device(&info);
                        break;
                }
        case 4:{
                        struct i2c_board_info info = { };
                        f(d);
                        i2c_new_device(&info);
                        break;
                }
        }
        return 0;
}

Basically the noreturn function is not considered a barrier for the info so the middle-end thinks all of the info can overlap in scope.
Comment 2 Andrew Pinski 2017-09-30 21:54:24 UTC
Basically someone needs to look into add_scope_conflicts (in gimplecfg.c) to see the scope conflict is being added.
Comment 3 Arnd Bergmann 2017-10-02 07:49:06 UTC
I tried replacing the call to a noreturn function with a __builtin_unreachable() statement, same result.

However, adding an empty assembler statement before the noreturn statement or the __builtin_unreachable() appears to work around the problem:

void __fortify_panic() __noreturn;
extern inline __attribute__((always_inline)) fortify_panic()
{
      asm volatile("");
      __fortify_panic();
}

I need to look at the resulting object code some more to see if that has other downsides or if we can use that trick in the kernel to work around compilers without a fix.
Comment 4 Jakub Jelinek 2017-10-05 15:53:42 UTC
(In reply to Andrew Pinski from comment #2)
> Basically someone needs to look into add_scope_conflicts (in gimplecfg.c) to
> see the scope conflict is being added.

??  Implying clobber stmts for all variables after noreturn calls wouldn't change anything.  The problem is that all the variables are addressable/escape and all the fortify_panic () calls are threaded.  So, on the coalesced fortify_panic () call the stack conflict code thinks all the 4 info escaped variables can be live during that call.

The current add_scope_conflicts is fairly simple algorithm, first we walk basic blocks, whenever we see any kind of reference to a stack variable under consideration, we set it in a bitmap, whenever we see a clobber stmt for such a variable, we clear it, at the start of bb we or into the bitmap the bitmaps from all the predecessor bb and remember the bitmap at the end of the bb, we keep propagating it until no further changes happen.
And finally we just walk all bbs, on the first non-debug non-conflict stmt we add conflicts for all currently live vars at that point, and then for any stmt add conflicts for directly referenced vars in the stmt.

Now, for this testcase to be optimized, one step would be to be more conservative on which stmt we do this:
          if (for_conflict
              && visit == visit_op)
            {
              /* If this is the first real instruction in this BB we need
                 to add conflicts for everything live at this point now.
                 Unlike classical liveness for named objects we can't
                 rely on seeing a def/use of the names we're interested in.
                 There might merely be indirect loads/stores.  We'd not add any
                 conflicts for such partitions.  */
              bitmap_iterator bi;
              unsigned i;
              EXECUTE_IF_SET_IN_BITMAP (work, 0, i, bi)
                {
                  struct stack_var *a = &stack_vars[i];
                  if (!a->conflicts)
                    a->conflicts = BITMAP_ALLOC (&stack_var_bitmap_obstack);
                  bitmap_ior_into (a->conflicts, work);
                }
              visit = visit_conflict;
            }
(well, obviously we'd need to visit = visit_conflict; for for_conflict from the first stmt and use some bool to track the first "indirect" stmt).
I think it needs to be done solely on statements that could have any indirect loads/stores (so calls, loads, stores, asm stmts, anything else?), but not say on arithmetics with SSA_NAME operands only etc., there we care only about explicitly referenced vars.

This wouldn't help here because fortify_panic is a call.

Another step could be track which of the live variables are address taken in another set of bitmaps (don't track just whether it is addressable, that is TREE_ADDRESSABLE bit after all, but when its address was taken).  In statements when the var has been seen live already, but not yet address taken, we wouldn't need to record conflicts on possibly indirect stmts, for those only explict references would imply a conflict (in particular, only conflicts between the addressable work set vs. addressable work set).  I bet a clobber should still clear both bitmaps, even if address is live across a clobber, any dereferences through that would be invalid.  The last argument of walk_stmt_load_store_addr_ops is the callback for stuff that had the address taken, so I think we could use that.  We'd need to take care of either handling all the possibly indirect stmts the above way, or track separately which vars were marked addressable since the last possibly indirect stmt.

Richard/Micha, thoughts on that?
Comment 5 Jakub Jelinek 2017-10-05 16:06:22 UTC
The reason why that would work in this case is that we have
  info = {};
as the first stmt that mentions it, but doesn't take address of info.
Then there is
  if (something)
    fortify_panic ();
The condition isn't possibly indirect stmt, fortify_panic is (it is a call), but none of the info vars are addressable at that point.  Then there is:
  i2c_new_device(&info);
which first takes info's address and is also (afterwards) a possibly indirect call (so, we first need to walk explicit references and then handle the possibly indirect, not the other way around).  But, in this spot only one of the info vars is live/addressable, not all of them (on fortify_panic () all 4 info vars are live, but none is addressable).
  info ={v} {CLOBBER};
Finally, the var is no longer live nor addressable.

Now, if you change the code so that 
i2c_new_device(&info);
comes first and then
f(a);
this would no longer be optimized, because in the common fortify_panic call all 4 info vars would be addressable (even escaped to another function).

That said, if it is the same type of variable and it can't be shared, I wonder why the kernel just doesn't use the same variable.
Like:
        switch (model) {
                struct i2c_board_info info;
        case 1:
                info = (struct i2c_board_info) {};
                f(a);
                i2c_new_device(&info);
                break;
        case 2:
                info = (struct i2c_board_info) {};
                f(b);
                i2c_new_device(&info);
                break;
...
        }
(or memset or whatever).
Comment 6 Arnd Bergmann 2017-10-05 19:48:22 UTC
(In reply to Jakub Jelinek from comment #5)

> Now, if you change the code so that 
> i2c_new_device(&info);
> comes first and then
> f(a);
> this would no longer be optimized, because in the common fortify_panic call
> all 4 info vars would be addressable (even escaped to another function).
> 
> That said, if it is the same type of variable and it can't be shared, I
> wonder why the kernel just doesn't use the same variable.

That is one of multiple workarounds I already implemented, the others are

1. split out the long case statements into separate functions (much too large to be backported into stable kernel releases, but reduces the stack frames more than the other approaches).
2. have only one variable (in the original driver the variables don't all have the same name, so this patch also ends up being larger than the usual limit for backported patches)
3. use strncpy() rather than strlcpy() (the fortified strncpy doesn't use
   strnlen()).
4. add the empty asm statement in strnlen()

Any of the above work fine, we just need to decide which one to use. I'm currently leaning toward 3, followed by 1.

We will need to work around older compilers anyway, I mainly reported this because it seemed like something that gcc should be able to handle better and it could in theory cause problems elsewhere. I have only observed the problem in one file in the kernel (an older device driver that probably has very few users), so if this turns out hard to avoid in gcc, it won't be a big deal.
Comment 7 Michael Matz 2017-10-06 12:53:45 UTC
(In reply to Jakub Jelinek from comment #4)
> Richard/Micha, thoughts on that?

None better than what you came up with.  It'd solve this specific instance
of the problem (and not the one with swapped call statements).  I don't think
we should complicate and slow down the clobber algorithm for this partial solution, though.

One problem with "not address taken up to here" is this situation:

  char a[100];
  char *p = &somethingelse;
  for () {
     *p = ...;
     p = a;
  }

So, you'd need to iterate address-takeness until fixpoint like normal liveness as well.  As the current algorithm does this already for a crude approximation
of liveness you might be able to implement the idea without seriously
complicating it or slowing it down.  If so it'd make sense.  Otherwise I wouldn't worry.

IMHO the real problem is rather the phase ordering between joining the
four no-return calls into one (and thereby merging scopes) and scope-based
analyses.  Possibly that joining should happen later (but it'd mean at RTL
level only, which is also meh), or the stack sharing should happen earlier
(which would possibly mean introducing type aliasing problems at GIMPLE level,
which is also meh).