WIP: Implement Filesystem TS

Jonathan Wakely jwakely@redhat.com
Mon Aug 4 22:36:00 GMT 2014


On 04/08/14 21:40 +0200, Marc Glisse wrote:
>On Mon, 4 Aug 2014, Jonathan Wakely wrote:
>
>>On 04/08/14 14:50 +0100, Jonathan Wakely wrote:
>>>This is a 99% complete implementation of the Filesystem TS as defined
>>>by the N4099 draft,
>>>http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4099.html
>>
>>The missing 1% relates to code conversions from wide character
>>strings, which will be easier with the C++11 codecvt facets.
>>
>>And that 99% only refers to POSIX systems, there are large chunks
>>missing for Windows, but as I don't have access to a Windows PC and
>>don't know the API someone else will have to provide those missing
>>pieces. I hope no major redesign will be needed to add Windows support
>>though.
>>
>>I plan to document the design, but I'm not sure when that will happen,
>>so for now:
>>
>>* filesystem::path is implemented as a basic_string containing the
>>native path, an enumeration describing the type of path represented
>>(either an individual component such as root-name, root-dir,
>>directory, filename, or a composite path of multiple components) and
>>a std::list containing the separate components of a composite path
>>(which is empty unless the path has multiple components).
>
>Any reason not to make it a vector?

No reason.

I started with a std::list because I thought I might want to splice
elements from one list to another when concatenating paths, but never
implemented that (instead I just re-parse the native string with
_M_split_cmpts() as needed). I considered changing it to a std::vector
then never got around to it. It would probably be an improvement.

>(N4099 writes --end() in a few places, I don't remember seeing text 
>explaining that this is ok even if end() returns a pointer, while we 
>do have text explaining what end()-1 means)

The synopsis in [class.path] declares path::iterator as a class type,
so it can't be a pointer.

>>The list exists so that dereferencing a filesystem::path::iterator 
>>can return a reference to an object that is stored somewhere outside 
>>the iterator, as required for forward iterators.
>
>I find it a bit scary that this wart in the standard (the iterator 
>concepts mix traversal and access properties) has such an impact on 
>the design of the rest of the library. We might still have kept a 
>vector with the indices of the '/' or something, but having never 
>looked at the FS proposals I was expecting iterators to return 
>something similar to a string_view. Now I agree you have little choice 
>with the current wording (I didn't check the status of the LWG issue 
>about iterators returning references to themselves but you nicely 
>added a reminder of what the conclusion was :-).

Yes, while I was doing the final editorial review on the TS I realised
the design meant path objects must contain other path objects. I
decided to implement the TS to make sure I wasn't missing anything
else non-obvious. I struggled for a while to find some way to
implement the path::iterator type without having paths contain other
paths and couldn't find a solution. The enumeration identifying the
path type is needed so that the child paths don't also try to fill a
std::list of child paths and so on forever until either the stack or
heap runs out of space!

Boost's path::iterator uses the Boost bidirectional_traversal_tag and
claims its category is bidirectional iterator, but as it returns a
reference to a path object inside the iterator it fails to meet the
forward iterator requirements (specifically [forward.iterators]/6).

>>* It might be possible to optimize path by lazily populating the
>>std::list, so that copying paths and passing them around by value
>>just copies the basic_string containing the native path, and it only
>>gets parsed to find the individual components as needed.
>
>Or we could parse eagerly and not need to store the full string, but 
>that's probably less efficient if we are going to need the string soon 
>to pass it to the system.

I believe the generic_string() members defined in
http://cplusplus.github.io/filesystem-ts/working-draft.html#path-generic-obs
could be reconstructed from the individual components, but not the
native() string.

http://cplusplus.github.io/filesystem-ts/working-draft.html#path-append 
defines operations in terms of concatenating the native() strings, not 
on normalized forms.

IIUC the full string must be kept around, it might contain information
not in the parsed version, e.g. path("foo//bar") and path("foo/bar")
have the same individual components, but different values for
native().

>>* directory_iterator holds a shared_ptr<_Dir> where _Dir is a pimpl
>>class containing a DIR* returned by opendir(), a path object
>>containing the path the dir was opened with, and a directory_entry
>>object that gets returned by dereferencing the iterator. It also
>>contained a file_type enumeration, which gets used on GNU and BSD
>>platforms where the dirent struct contains the file type, which
>>means no stat() system call is needed to find out whether the
>>current entry is a directory and should be recursed into.
>>
>>* recursive_directory_iterator holds a shared_ptr to a
>>stack<pair<_Dir, directory_iterator>> representing each directory
>>recursed into and the position within that directory. The
>>shared_ptr<_Dir>s belong to the directory_iterator objects in the
>>stack alias the shared_ptr held by the parent
>>recursive_directory_iterator, so the reference counts are shared by
>>the whole stack.
>
>I don't understand the last sentence of this paragraph.

Sorry, it should have said "... belonging to the directory_iterator
objects in the stack ..."

> I don't know 
>what a parent recursive_directory_iterator is, and from what I 
>understand each directory_iterator in the stack iterates on a 
>different directory so there is nothing to share.

Every directory_iterator contains a shared_ptr<_Dir>. This is good for
users working with directory_iterator explicitly, because copies of
the same directory_iterator refer to the same underlying "sequence",
and all the state is stored in the _Dir object (directory_iterators
are single-pass input iterators, like the DIR struct operated on by
readdir()).

A recursive_directory_iterator contains a shared_ptr<_Dir_stack>,
where the stack contains N directory_iterators (one for each directory
level that has been recursed into). Again, sharing the underlying
state between different recursive_directory_iterator objects is
desirable, and having the underlying state live external to the
recursive_directory_iterator objects themselves is necessary.

A naive recursive_directory_iterator implementation such as
std::shared_ptr<std::stack<directory_iterator>> would have N+1
different shared_ptr control blocks with separate reference counts.

However, those directory_iterators in the stack owned by a
recursive_directory_iterator will never be shared, because they are
not exposed directly to users, so their shared_ptr members would
always have use_count()==1. Having N reference counts that are always
equal to 1 is wasteful.

One solution would be for recursive_directory_iterator to not use
directory_iterator, but work with a stack of unique_ptr<_Dir> objects
instead, re-implementing the required directory_iterator members that
operate on the _Dir object. I didn't try that, but I think it would
work.

My recursive_directory_iterator implementation instead uses
shared_ptr<stack<pair<_Dir, directory_iterator>>>, where the
shared_ptr<_Dir> held by each directory_iterator shares ownership with
the shared_ptr<stack<...>> (using the shared_ptr aliasing constructor)
so there is only a single control block for the whole group of
objects.

So the stack at *m_dirs has the following invariants:

  if (m_dirs && !m_dirs->empty())
    {
      // for any element in the stack
      pair<_Dir, directory_iterator>& x = m_dirs->top();

      // the directory_iterator uses the _Dir in the stack
      assert( x.second.m_dir.get() == &x.first );

      // the directory iterator shares ownership with the stack
      assert( !m_dirs.owner_before(x.second.m_dir) );
      assert( !x.second.m_dir.owner_before(m_dirs) );
    }

Does that make it clearer?



More information about the Libstdc++ mailing list