Bug 90634 - filesystem::path insane memory allocations
Summary: filesystem::path insane memory allocations
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 8.3.0
: P3 normal
Target Milestone: 8.4
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-25 20:24 UTC by baltic
Modified: 2019-06-26 15:12 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description baltic 2019-05-25 20:24:55 UTC
std::filesystem::path allocates insane amounts of memory, even for short paths.
for example, for a 20 bytes path it allocates 813 bytes. which makes it practically unusable to store a lot of paths, while crawling through a filesystem for example. 

see the small test program:

#include <cstdlib>
#include <iostream>
#include <string>
#include <filesystem>

using namespace std;
using namespace std::filesystem;

size_t total = 0;

// replace operator new and delete to log allocations
void* operator new(size_t n) {
	cout << "allocating " << n << " bytes" << endl;
	size_t *p = (size_t*) malloc(n + sizeof(size_t));
	*p++ = n;
	total += n;
	return p;
}
void operator delete(void* p) throw() {
	size_t *sp = (size_t *) p;
	cout << "freed " <<  *--sp << " bytes" << endl;
	total -= *sp;
	free(sp);
}

int main() {
	path p("/test/test/test/test");
	cout << "about to quit. total allocated " << total << " bytes" << endl;
	return 0;
}

the output is:

allocating 21 bytes
allocating 72 bytes
allocating 144 bytes
freed 72 bytes
allocating 288 bytes
allocating 72 bytes
freed 144 bytes
allocating 576 bytes
allocating 72 bytes
allocating 72 bytes
allocating 72 bytes
freed 72 bytes
freed 288 bytes
about to quit. total allocated 813 bytes
Comment 1 Jonathan Wakely 2019-05-26 08:44:40 UTC
The path type was rewritten for GCC 9, and now prints:

allocating 21 bytes
allocating 248 bytes
about to quit.
total allocated 269 bytes
freed 248 bytes
freed 21 bytes

The 21 bytes is for the native() string, and 248 bytes is the sequence of path objects in the range [begin(),end()).
Comment 2 Jonathan Wakely 2019-05-26 10:32:14 UTC
The 813 bytes seen with GCC 8.x is due to reallocations within std::vector, as the sequence of path objects in  [begin(),end()) is populated. The new code counts the number of components first, so there's no need to keep growing as the path is parsed. The sequence no longer uses std::vector at all, which also reduces sizeof(path).

I'm not sure where the repeated 72 allocations come from, and can't check the code right now, but the new code doesn't do it anyway. I think this can be considered fixed.
Comment 3 baltic 2019-05-26 13:09:07 UTC
(In reply to Jonathan Wakely from comment #2)
 
> I'm not sure where the repeated 72 allocations come from, and can't check
> the code right now, but the new code doesn't do it anyway.

It comes from std::string creation, which then is passed to path constructor.

> I think this can be considered fixed.

hardly. the gcc 9.1 is even worse in that regard, as it allocates 2kB of memory for the same case
see the line:

about to quit. total allocated 2131 bytes

in here:
https://wandbox.org/permlink/u9dEfPh1Zc5pmJ34
Comment 4 baltic 2019-05-26 15:16:15 UTC
besides the 269 bytes you have mentioned, is still x10 overhead for a 20 chars string. and massively adds up, if you store a lot objects.
for example, when i go around home folder on my machine and save all the found paths to vector, it takes 2.7GB, while should take ~150MB!! quite an overhead on a simple task, for a language which strives for efficiency.

check out clang, for example:
https://wandbox.org/permlink/u9dEfPh1Zc5pmJ34
it's smart enough to allocate such short strings inplace:
about to quit. total allocated 0 bytes
Comment 5 Jonathan Wakely 2019-05-26 19:42:20 UTC
(In reply to baltic from comment #3)
> (In reply to Jonathan Wakely from comment #2)
>  
> > I'm not sure where the repeated 72 allocations come from, and can't check
> > the code right now, but the new code doesn't do it anyway.
> 
> It comes from std::string creation, which then is passed to path constructor.

No it doesn't, that's the 21 bytes.

With GCC 9.x the 72 bytes is sizeof(path), which comes from allocating a path. It comes from each component of the path creating a vector<path> during parsing, and then if there's only one component it empties the vector again. Those transient allocations could be avoided, so I'll change that for GCC 8.4.0, but there's nothing to do for GCC 9.x as it's already avoiding unnecessary vector<path> operations.

> > I think this can be considered fixed.
> 
> hardly. the gcc 9.1 is even worse in that regard, as it allocates 2kB of
> memory for the same case
> see the line:
> 
> about to quit. total allocated 2131 bytes
> 
> in here:
> https://wandbox.org/permlink/u9dEfPh1Zc5pmJ34

You should try using a real compiler and not only rely on wandbox. Maybe there's something wrong with the wandbox compiler, but that result only happens when using Boost (so you should take it up with Boost instead).

The correct results with 9.1.0 are:

allocating 21 bytes
allocating 248 bytes
about to quit. total allocated 269 bytes
freed 248 bytes
freed 21 bytes

(In reply to baltic from comment #4)
> besides the 269 bytes you have mentioned, is still x10 overhead for a 20
> chars string. and massively adds up, if you store a lot objects.

The overhead is linear in the number of components in the path, not the string length. If you have a path with a single filename and no directories then the overhead is nothing like 10x.

> for example, when i go around home folder on my machine and save all the
> found paths to vector, it takes 2.7GB, while should take ~150MB!! quite an
> overhead on a simple task, for a language which strives for efficiency.

So don't store them as filesystem::path objects then, store them as strings and create a path as needed.

> check out clang, for example:
> https://wandbox.org/permlink/u9dEfPh1Zc5pmJ34
> it's smart enough to allocate such short strings inplace:
> about to quit. total allocated 0 bytes

GCC will also create short strings in place, but with a different limit for what is considered "short".

GCC's implementation creates the path components eagerly, so that path::iterator meets the requirements of a forward iterator, whereas the libc++ implementation creates them lazily during iteration, and so is not a valid forward iterator. This fails with libc++:

https://wandbox.org/permlink/eas3ZqA2CojecrQJ

The implementations have different trade-offs. That is not a bug.
Comment 6 Jonathan Wakely 2019-05-26 19:52:50 UTC
(In reply to Jonathan Wakely from comment #5)
> The correct results with 9.1.0 are:
> 
> allocating 21 bytes
> allocating 248 bytes
> about to quit. total allocated 269 bytes
> freed 248 bytes
> freed 21 bytes

Proof: https://wandbox.org/permlink/zJVe39jAZtL1p6je
Comment 7 baltic 2019-05-27 09:58:12 UTC
(In reply to Jonathan Wakely from comment #5)

> So don't store them as filesystem::path objects then, store them as strings
> and create a path as needed.

Type system is here for a reason. And it has advantages to type something, which is naturally a path, as such, not as an arbitrary string.

> GCC's implementation creates the path components eagerly, so that
> path::iterator meets the requirements of a forward iterator, whereas the
> libc++ implementation creates them lazily during iteration, and so is not a
> valid forward iterator. 

It doesn't have to be. See standard 30.11.7.5:

A path::iterator is a constant iterator satisfying all the requirements of a bidirectional iterator (27.2.6)
except that, for dereferenceable iterators a and b of type path::iterator with a == b, there is no requirement
that *a and *b are bound to the same object. Its value_type is path.

> The implementations have different trade-offs. That is not a bug.

Poorly considered trade-offs are called "bad design" ;)
Besides, your implementation breaks the fundamental C++ design principle "you don't pay for what you don't use". now everyone has to pay x10 memory overhead even if they are not ever going to iterate over a path object. I agree, that is not a bug, in C++ world, its a pure crime ;)
Comment 8 Jonathan Wakely 2019-05-27 23:29:21 UTC
Yes, I'm well aware what the standard says, I reported https://wg21.link/lwg2674

I consider it a mistake for the standard to say "these are bidirectional iterators, except they're not, but it's OK because at some point in the future a subset of algorithms won't care about that requirement any more". GCC's implementation provides real bidirectional iterators that meet the C++17 requirements and the C++2a BidirectionalIterator requirement that:

"Pointers and references obtained from a forward iterator into a range [i, s) shall remain valid while [i, s) continues to denote a range."

The libc++ implementation also fails this test:
https://wandbox.org/permlink/RwiFphMn1Hh2iis4
It's OK though, because you don't pay for what you don't use!

The design trade off was discussed in Bug 71044 comment 6. Populating the range lazily would still be possible (as long as it was thread-safe) but I'm not convinced it would be an improvement. In theory the range could also be shared by copies of the same path (again, as long as it was thread-safe). Those optimizations can be considered later, but for now the implementation aims to be correct and not produce dangling pointers.

As I said, I'll reduce the number of allocations done by the GCC 8.x implementation, but I'm not planning on any other changes for now.
Comment 9 baltic 2019-05-28 08:32:18 UTC
(In reply to Jonathan Wakely from comment #8)

> The libc++ implementation also fails this test:

As i've shown before, neither of those are failures. By the current C++ standard at least.

So, long story short: "I am not going to fix the x10 overhead, because I believe the standard is wrong."

Ok.
But damn! how good clang looks on the same test:

about to quit. total allocated 0 bytes
Comment 10 Jonathan Wakely 2019-05-28 14:58:06 UTC
Author: redi
Date: Tue May 28 14:57:35 2019
New Revision: 271710

URL: https://gcc.gnu.org/viewcvs?rev=271710&root=gcc&view=rev
Log:
PR libstdc++/90634 reduce allocations in filesystem::path construction

	PR libstdc++/90634
	* include/bits/fs_path.h (path::path(path&&)): Only call
	_M_split_cmpts() for a path with multiple components.
	(path::_S_is_dir_sep()): Add missing 'static' keyword to function.
	* include/experimental/bits/fs_path.h: Likewise.
	* src/filesystem/path.cc (path::_M_split_cmpts()): Count number of
	components and reserve space in vector. Return early when there is
	only one component.
	* src/filesystem/std-path.cc (path::_M_split_cmpts()): Likewise.

Modified:
    branches/gcc-8-branch/libstdc++-v3/ChangeLog
    branches/gcc-8-branch/libstdc++-v3/include/bits/fs_path.h
    branches/gcc-8-branch/libstdc++-v3/include/experimental/bits/fs_path.h
    branches/gcc-8-branch/libstdc++-v3/src/filesystem/path.cc
    branches/gcc-8-branch/libstdc++-v3/src/filesystem/std-path.cc
Comment 11 Jonathan Wakely 2019-05-28 16:09:22 UTC
Author: redi
Date: Tue May 28 16:08:51 2019
New Revision: 271712

URL: https://gcc.gnu.org/viewcvs?rev=271712&root=gcc&view=rev
Log:
Fix check for root-directory path and add tests

	PR libstdc++/90634
	* src/filesystem/path.cc (path::_M_split_cmpts()): Fix check for "/".
	* testsuite/27_io/filesystem/path/construct/90634.cc: New test.
	* testsuite/experimental/filesystem/path/construct/90634.cc: New test.

Added:
    branches/gcc-8-branch/libstdc++-v3/testsuite/27_io/filesystem/path/construct/90634.cc
    branches/gcc-8-branch/libstdc++-v3/testsuite/experimental/filesystem/path/construct/90634.cc
Modified:
    branches/gcc-8-branch/libstdc++-v3/ChangeLog
    branches/gcc-8-branch/libstdc++-v3/src/filesystem/path.cc
Comment 12 Jonathan Wakely 2019-05-28 19:40:19 UTC
Author: redi
Date: Tue May 28 19:39:48 2019
New Revision: 271717

URL: https://gcc.gnu.org/viewcvs?rev=271717&root=gcc&view=rev
Log:
PR libstdc++/90634 reduce allocations in filesystem::path construction

	PR libstdc++/90634
	* include/experimental/bits/fs_path.h (path::path(path&&)): Only call
	_M_split_cmpts() for a path with multiple components.
	(path::_S_is_dir_sep()): Add missing 'static' keyword to function.
	* src/filesystem/path.cc (path::_M_split_cmpts()): Count number of
	components and reserve space in vector. Return early when there is
	only one component.
	* testsuite/27_io/filesystem/path/construct/90634.cc: New test.
	* testsuite/experimental/filesystem/path/construct/90634.cc: New test.

Added:
    trunk/libstdc++-v3/testsuite/27_io/filesystem/path/construct/90634.cc
    trunk/libstdc++-v3/testsuite/experimental/filesystem/path/construct/90634.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/experimental/bits/fs_path.h
    trunk/libstdc++-v3/src/filesystem/path.cc
Comment 13 Jonathan Wakely 2019-05-28 20:49:03 UTC
Author: redi
Date: Tue May 28 20:48:31 2019
New Revision: 271719

URL: https://gcc.gnu.org/viewcvs?rev=271719&root=gcc&view=rev
Log:
PR libstdc++/90634 reduce allocations in filesystem::path construction

Backport from mainline
2019-05-28  Jonathan Wakely  <jwakely@redhat.com>

	PR libstdc++/90634
	* include/experimental/bits/fs_path.h (path::path(path&&)): Only call
	_M_split_cmpts() for a path with multiple components.
	(path::_S_is_dir_sep()): Add missing 'static' keyword to function.
	* src/filesystem/path.cc (path::_M_split_cmpts()): Count number of
	components and reserve space in vector. Return early when there is
	only one component.
	* testsuite/27_io/filesystem/path/construct/90634.cc: New test.
	* testsuite/experimental/filesystem/path/construct/90634.cc: New test.

Added:
    branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/construct/90634.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/experimental/filesystem/path/construct/90634.cc
Modified:
    branches/gcc-9-branch/libstdc++-v3/ChangeLog
    branches/gcc-9-branch/libstdc++-v3/include/experimental/bits/fs_path.h
    branches/gcc-9-branch/libstdc++-v3/src/filesystem/path.cc
Comment 14 Jonathan Wakely 2019-05-29 14:42:19 UTC
(In reply to baltic from comment #9)
> (In reply to Jonathan Wakely from comment #8)
> 
> > The libc++ implementation also fails this test:
> 
> As i've shown before, neither of those are failures. By the current C++
> standard at least.

It's true that the standard does not require path::iterator to be usable in generic algorithms that expect forward iterators or bidirectional iterators.

I think the standard is wrong.

> So, long story short: "I am not going to fix the x10 overhead, because I
> believe the standard is wrong."

Yes.

The vector reallocations during path construction are gone for gcc 8.4 (and also in experimental::filesystem::path in all branches except gcc-7-branch).
Comment 15 baltic 2019-05-30 09:28:52 UTC
(In reply to Jonathan Wakely from comment #14)
 
> It's true that the standard does not require path::iterator to be usable in
> generic algorithms that expect forward iterators or bidirectional iterators.

Indeed! Ability to run a modifying algorithm on a path's elements is extremely useful. Sorting path's components makes a lot of practical sense. So does std::random_shuffle them. And ability to heapify them (std::make_heap) is basically unavoidable in any app, which has to deal with filesystem::path. 
But what weirdo would want to store hundreds of thousands of paths without x10 memory overhead, right?

So i agree. Just screw the C++ standard and do what you think is best for ppl out there.

Thanks for your great work!
Comment 16 Jonathan Wakely 2019-06-26 15:12:46 UTC
Author: redi
Date: Wed Jun 26 15:12:15 2019
New Revision: 272697

URL: https://gcc.gnu.org/viewcvs?rev=272697&root=gcc&view=rev
Log:
PR libstdc++/90634 reduce allocations in filesystem::path construction

Backport from mainline
2019-05-28  Jonathan Wakely  <jwakely@redhat.com>

	PR libstdc++/90634
	* include/experimental/bits/fs_path.h (path::path(path&&)): Only call
	_M_split_cmpts() for a path with multiple components.
	(path::_S_is_dir_sep()): Add missing 'static' keyword to function.
	* src/filesystem/path.cc (path::_M_split_cmpts()): Count number of
	components and reserve space in vector. Return early when there is
	only one component.
	* testsuite/experimental/filesystem/path/construct/90634.cc: New test.

Added:
    branches/gcc-7-branch/libstdc++-v3/testsuite/experimental/filesystem/path/construct/90634.cc
Modified:
    branches/gcc-7-branch/libstdc++-v3/ChangeLog
    branches/gcc-7-branch/libstdc++-v3/include/experimental/bits/fs_path.h
    branches/gcc-7-branch/libstdc++-v3/src/filesystem/path.cc