Bug 26388 - Variable sized storage allocation should be promoted to stack allocation
Summary: Variable sized storage allocation should be promoted to stack allocation
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.2.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 26387
Blocks: std::vector
  Show dependency treegraph
 
Reported: 2006-02-20 22:20 UTC by Richard Biener
Modified: 2024-12-28 14:52 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-02-16 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2006-02-20 22:20:14 UTC
#include <vector>

void foo(int n)
{
        std::vector<int> v;
        for (int i=0; i<n; ++i)
                v.push_back(i);
}

here the (single) variable storage allocated on the heap can be promoted
to stack allocation, where the (single) object can grow dynamically.  This
happens in even more libstdc++ code and is generally exposed after inlining
only.

Note that due to the missing realloc primitive in C++, growing is done
as new + memmove pair, so detection of this case is somewhat harder.
Comment 1 Andrew Pinski 2006-02-20 22:21:47 UTC
Actually this is all dead code and should be removed.
Comment 2 Richard Biener 2006-02-20 22:27:36 UTC
Ok, here's a real example:

template <typename T>
std::vector<T> makeRBlocksFactor(T number)
{
  std::vector<T> factors;

  for (T f=2; f<number; f++) {
    while (number % f == 0) {
      factors.push_back(f);
      number /= f;
    }
  }
  if (number != 1)
    factors.push_back(number);
  return factors;
}

stupid way of calculating prime factors of number.  Of course here we have
NRV optimization (maybe).
Comment 3 Richard Biener 2006-02-20 22:34:38 UTC
make it return the largest prime, factors[factors.size()-1] instead.
Comment 4 Falk Hueffner 2006-02-20 22:35:58 UTC
This would be incredibly difficult to detect reliably. But the main
problem is, on most operating systems, stack space is limited,
typically to something tiny compared to main memory like 8M, so this
wouldn't really be a good idea.  Anyway, due to the amortization, I
suspect it wouldn't save all that much time.  Have you tried some
benchmarks?
Comment 5 rguenther@suse.de 2006-02-20 22:52:41 UTC
Subject: Re:  Variable sized storage allocation
 should be promoted to stack allocation

> ------- Comment #4 from falk at debian dot org  2006-02-20 22:35 -------
> This would be incredibly difficult to detect reliably. But the main
> problem is, on most operating systems, stack space is limited,
> typically to something tiny compared to main memory like 8M, so this
> wouldn't really be a good idea.  Anyway, due to the amortization, I
> suspect it wouldn't save all that much time.  Have you tried some
> benchmarks?

No I didn't do benchmarks.  But doing a lot of computation using
libstdc++ containers takes a quite amount of time for tramp3d benchmark
if using the new allocator (operator new (unsigned int) is on top of
the profile for small array sizes), while it improves by using the
mt allocator (mainly due to less overhead and better cache locality
with using the size-pools).  Looking at the affected algorithms they
all collect information in vectors, do some processing and then throw
the vector away again.  Of course this could all be addressed at
the algorithmic level by just not using libstdc++ containers but
local stack arrays - just using libstdc++ is more expressive for this
kind of stuff.

If this is really not feasible or will not work in any useful cases,
maybe exposing the stack to the user would be a nice thing to have.
Like

 foo()
 {
   int *p = __builtin_stack_top();
   *(p--) = i;
   ...
 }

where __builtin_stack_top() would be beyond the known stack usage of
foo(), erroring in case of function calls, of course (or even make
the stack decrements builtins).
Comment 6 Falk Hueffner 2006-02-20 23:06:38 UTC
(In reply to comment #5)

> If this is really not feasible or will not work in any useful cases,
> maybe exposing the stack to the user would be a nice thing to have.
> Like
> 
>  foo()
>  {
>    int *p = __builtin_stack_top();
>    *(p--) = i;
>    ...
>  }
> 
> where __builtin_stack_top() would be beyond the known stack usage of
> foo(), erroring in case of function calls, of course (or even make
> the stack decrements builtins).

This would be a quite finicky interface. For example, on the Alpha architecture,
the stack pointer needs to be 16-byte aligned at all times. Also, you must
not read at some position on the stack unless you've previously written within
4096 bytes (or something) of that position. I suppose other architectures
have similar peculiarities.
Comment 7 Andrew Pinski 2006-02-21 02:22:07 UTC
I am going to say there are better ways of optimizing this than using the stack.
the easy way to optimize this in your code is to reserve the required size before the loop, this is done in GCC also with the VEC already..
Comment 8 Richard Biener 2011-07-27 10:56:13 UTC
*** Bug 43513 has been marked as a duplicate of this bug. ***
Comment 9 xiaoyuanbo 2012-02-22 12:43:25 UTC Comment hidden (spam)
Comment 10 Martin Sebor 2017-02-16 17:16:25 UTC
The implementation challenges aside, I think the transformation suggested here could break some valid (if rare) C++ programs.  std::vector is specified to use std::allocator to obtain storage which in turn is specified to use operator new.  Operator new is a replaceable function meaning that a program can define its own that keeps track of the number of calls to it by vector (and other) member functions.  Such programs could break by GCC eliminating the of the operator.

Richard, in light of this and other concerns in the discussion of this bug, do you want to keep this bug open or do you agree with closing it?
Comment 11 Marc Glisse 2017-02-16 17:20:02 UTC
(In reply to Martin Sebor from comment #10)
> The implementation challenges aside, I think the transformation suggested
> here could break some valid (if rare) C++ programs.  std::vector is
> specified to use std::allocator to obtain storage which in turn is specified
> to use operator new.  Operator new is a replaceable function meaning that a
> program can define its own that keeps track of the number of calls to it by
> vector (and other) member functions.  Such programs could break by GCC
> eliminating the of the operator.
> 
> Richard, in light of this and other concerns in the discussion of this bug,
> do you want to keep this bug open or do you agree with closing it?

LLVM developers got the C++ standard fixed so that this is explicitly permitted...
Comment 12 Jonathan Wakely 2017-02-16 18:02:38 UTC
(In reply to Martin Sebor from comment #10)
> The implementation challenges aside, I think the transformation suggested
> here could break some valid (if rare) C++ programs.  std::vector is
> specified to use std::allocator to obtain storage which in turn is specified
> to use operator new.

Eventually, but not for every allocation.

I think if std::allocator did this optimisation there is no way a conforming program could tell the difference.

Users can replace operator new (or malloc) but it's unspecified when or how often that would be called by std::allocator.

Users can specialize std::allocator for their own types, so can observe calls to its allocate/deallocate members, but if the optimisation is done by the std::allocator primary template then it wouldn't happen for user specializations of std::allocator, so we're still OK.
Comment 13 Richard Biener 2017-02-17 09:26:17 UTC
I think it's still important to optimize such temporary uses of STL containers, esp. std::vector.  And yes, I realize actually doing this will be quite a challenge.
Comment 14 Martin Sebor 2017-02-17 17:34:07 UTC
I agree with the goal of making containers efficient so please take this as constructive criticism.

Jonathan and I discussed this on IRC a bit yesterday and I've thought about it some more since then.  I'm unconvinced the suggested optimization would be valid (conforming).  To Marc's point in comment #11, the default vector was never intended to be stack-based and I'm not aware of any change in the standard that would make it valid to allocate it on the stack.  To Jonathan's point in comment #12, "eventually" means that creating one or more non-empty vectors can be expected by a conforming program to cause at least one call to operator new.

Beyond conformance, another, arguably even more important, point to consider is that vector is required to throw an exception when memory is exhausted.  Implementing it to live on the stack (and allocate memory via alloca) would either violate that requirement, letting it exhaust the stack and crash the program, or mean checking the amount of available stack space on every allocation (or just the first allocation).  Alternatively, it would mean gating the optimization on known sizes of the vector that were less than some small constant maximum (i.e., effectively turning std::vector into std::array.)

Jonathan also reminded me that vectors A and B must satisfy the "i = A.begin(); A.swap (B); assert (i == B.begin())" postcondition, with swap having constant complexity.  A vector allocated on the stack would not, in general, be able to satisfy these requirements.  It couldn't be swapped with or moved in constant time to another vector with a different lifetime, further constraining the optimization.

There may be clever ways to overcome these problems, either by making the implementation more complex, or by relaxing the standard requirements on implementations, or (likely both), but they would have costs of their own.  The former likely in efficiency and significant complexity, and the latter in constraining programs (i.e., breaking some that are now valid).

So in summary, while I would like to see STL containers be as efficient as they can be (and have, in fact, been thinking and talking with Jeff about how to improve them to this end), I don't think this optimization is viable.  It may be unfortunate but it is the price of generality, customizability, and conformance.
Comment 15 rguenther@suse.de 2017-02-17 18:29:40 UTC
On February 17, 2017 6:34:07 PM GMT+01:00, "msebor at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26388
>
>--- Comment #14 from Martin Sebor <msebor at gcc dot gnu.org> ---
>I agree with the goal of making containers efficient so please take
>this as
>constructive criticism.
>
>Jonathan and I discussed this on IRC a bit yesterday and I've thought
>about it
>some more since then.  I'm unconvinced the suggested optimization would
>be
>valid (conforming).  To Marc's point in comment #11, the default vector
>was
>never intended to be stack-based and I'm not aware of any change in the
>standard that would make it valid to allocate it on the stack.  To
>Jonathan's
>point in comment #12, "eventually" means that creating one or more
>non-empty
>vectors can be expected by a conforming program to cause at least one
>call to
>operator new.
>
>Beyond conformance, another, arguably even more important, point to
>consider is
>that vector is required to throw an exception when memory is exhausted.
>
>Implementing it to live on the stack (and allocate memory via alloca)
>would
>either violate that requirement, letting it exhaust the stack and crash
>the
>program, or mean checking the amount of available stack space on every
>allocation (or just the first allocation).  Alternatively, it would
>mean gating
>the optimization on known sizes of the vector that were less than some
>small
>constant maximum (i.e., effectively turning std::vector into
>std::array.)
>
>Jonathan also reminded me that vectors A and B must satisfy the "i =
>A.begin();
>A.swap (B); assert (i == B.begin())" postcondition, with swap having
>constant
>complexity.  A vector allocated on the stack would not, in general, be
>able to
>satisfy these requirements.  It couldn't be swapped with or moved in
>constant
>time to another vector with a different lifetime, further constraining
>the
>optimization.
>
>There may be clever ways to overcome these problems, either by making
>the
>implementation more complex, or by relaxing the standard requirements
>on
>implementations, or (likely both), but they would have costs of their
>own.  The
>former likely in efficiency and significant complexity, and the latter
>in
>constraining programs (i.e., breaking some that are now valid).
>
>So in summary, while I would like to see STL containers be as efficient
>as they
>can be (and have, in fact, been thinking and talking with Jeff about
>how to
>improve them to this end), I don't think this optimization is viable. 
>It may
>be unfortunate but it is the price of generality, customizability, and
>conformance.

Note this bug talks about a middle-end optimization, not a change in libstdc++.  Obviously such optimization can only be applied when it is valid.

Eliding a new / delete pair is a valid optimization AFAIK.
Comment 16 Marc Glisse 2017-02-17 18:53:53 UTC
[expr.new] now provides some flexibility:
"A new-expression may obtain storage for the object by calling an allocation function" (not must)
"An implementation is allowed to omit a call to a replaceable global allocation function (18.6.2.1, 18.6.2.2). When it does so, the storage is instead provided by the implementation or provided by extending the allocation of another new-expression. The implementation may extend the allocation of a new-expression e1 to provide storage for a new-expression e2"
etc

The original example is a bit strange, we know the size n in advance, so the optimization that seems most appealing is to allocate directly at least n ints, essentially inserting the call v.reserve(n) that the user missed. (for a small, constant n where we can unroll the whole loop, it may even not be too hard)

This was filed as tree-optimization, so I think it meant to optimize a sequence of new/delete where we can see all uses, not replace std::vector with dynarray in the general case.

Now it seems pretty hard to find the right circumstances where this optimization is safely possible and to insert the right runtime checks to avoid overflowing the stack, while there are several simple possibilities for the user to improve performance (boost::small_vector if the vector is often very small, or recycle the vector by making it thread_local if there is no reentrance and the program is not going to miss that memory, etc), so this doesn't sound like it is going to happen any time soon.
Comment 17 Martin Sebor 2017-02-17 20:52:04 UTC
The quoted text allows an implementation to to merge calls made by two or more new expressions to the underlying allocation function (operator new).  It can call it just once but it's can't avoid calling it altogether.

There is no such a permission for calls to the allocation function itself.

std::allocator doesn't use a new expression.  It's specified to call the ::operator new(size_t), so the permission in new.expr doesn't apply to it.  It isn't specified when or how often the allocator calls the function so, as Jonathan noted, for some programs it could call it just once and cache it.  But it does have to call it and can't just get the memory from somewhere else.
Comment 18 Marc Glisse 2017-02-17 21:52:58 UTC
(In reply to Martin Sebor from comment #17)
> The quoted text allows an implementation to to merge calls made by two or
> more new expressions to the underlying allocation function (operator new). 
> It can call it just once but it's can't avoid calling it altogether.

Yes it can: "allowed to omit a call", "the storage is instead provided by the implementation".

> There is no such a permission for calls to the allocation function itself.

Ahhh, now I remember thinking that they had messed up by specifying it only for new expressions, which forced them to introduce a builtin in llvm to differentiate calls to operator new coming from new expressions and direct ones (for instance by the allocator)... Though of course that appears to have been a voluntary change between N3537 and N3664, I found on the Bristol wiki this single remark "Better to have the merging apply only to new-expressions and not to explicit calls to operator new()" without justification. Maybe I should ask CWG what the reason was.
Comment 19 Martin Sebor 2017-02-17 22:55:59 UTC
You're right,  I had overlooked the "provided by the implementation" part.

I would not want implementations to have the latitude to omit explicit calls to operator new() unless the operator were the default.  That would largely defeat the purpose of the operators' replaceabilty: to be able to control the dynamic allocation policy of a program.

But if there were a way to detect whether or not the operator has been replaced (it seems there should be some trick) then dynamic memory allocated at local scope and known not to escape could be replaced with stack space without breaking anything.  I think that would be a useful optimization.
Comment 20 Jan Hubicka 2024-12-27 16:54:41 UTC
with __builtin_operator_new we could theoretically optimize the initial testcase to empty function, but we don't, since we can not handle the reallocation.

  _60 = operator new (_59);

  <bb 9> [local count: 166566348]:
  _46 = (long unsigned int) _38;
  _47 = _60 + _46;
  *_47 = i_56;
  if (_38 != 0)
    goto <bb 10>; [41.48%]
  else
    goto <bb 11>; [58.52%]

  <bb 10> [local count: 69091722]:
  __builtin_memcpy (_60, v$_M_start_73, _46);
  _51 = _46 + 4;
  __new_finish_54 = _60 + _51;
  goto <bb 12>; [100.00%]

so here we would need to be first able to handle the chain of memcpy copies and *_47 stores probably in DSE and once we get rid of the stores and memcpy theoretically dce can optimize out the allocation/eh and other noise.