Bug 90281 - utf-8 encoded std::filesystem::path can not be converted to utf-16.
Summary: utf-8 encoded std::filesystem::path can not be converted to utf-16.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 8.3.0
: P3 normal
Target Milestone: 9.0
Assignee: Jonathan Wakely
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2019-04-29 15:14 UTC by Steffen Schuemann
Modified: 2020-02-26 09:37 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-04-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steffen Schuemann 2019-04-29 15:14:15 UTC
During tests on an implementation of a std::fs compatible backport for C++11/C++14 I found the following issue when running my checks against GCCs C++17 std::fs code. The following example

#include <filesystem>

int main()
{
    auto p = std::filesystem::u8path("\xf0\x9d\x84\x9e").u16string();
    return 0;
}

fails with:

terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: Cannot convert character sequence: Invalid or incomplete multibyte or wide character

Tested with GCC 8.1.0 and 8.2.0 on Ubuntu 18.04 and macOS, and GCC 8.3.0 on Wandbox.

The UTF-8 sequence is the musical symbol "clef" (U+1D11E) and on both systems I can create a file with that name, e.g. from the shell. Even when then iterating over the directory and calling u16string() on the filename of the directory_entry, this exception will be thrown, but the example code is simpler.

(Just as additional info: \xf0\x9d\x84\x9e is the correct UTF8 encoding for U+1D11E and this works with clang and its libc++/libc++fs.)
Comment 1 Jonathan Wakely 2019-04-29 15:43:15 UTC
The problem is that I'm using codecvt_utf8<char16_t>, which converts between UTF-8 and UCS-2 (not UTF-16). The U+1D11E is outside the basic multilingual plane, so is not valid UCS-2.

I need to use a different codecvt facet for UTF-16.
Comment 2 Jonathan Wakely 2019-04-30 22:22:12 UTC
Patch posted: https://gcc.gnu.org/ml/gcc-patches/2019-04/msg01242.html
Comment 3 Jonathan Wakely 2019-06-17 14:19:36 UTC
Author: redi
Date: Mon Jun 17 14:19:04 2019
New Revision: 272385

URL: https://gcc.gnu.org/viewcvs?rev=272385&root=gcc&view=rev
Log:
PR libstdc++/90281 Fix string conversions for filesystem::path

Fix several bugs in the encoding conversions for filesystem::path that
prevent conversion of Unicode characters outside the Basic Multilingual
Plane, and prevent returning basic_string specializations with
alternative allocator types.

The std::codecvt_utf8 class template is not suitable for UTF-16
conversions because it uses UCS-2 instead. For conversions between UTF-8
and UTF-16 either std::codecvt<C, char, mbstate> or
codecvt_utf8_utf16<C> must be used.

The __str_codecvt_in and __str_codecvt_out utilities do not
return false on a partial conversion (e.g. for invalid or incomplete
Unicode input). Add new helpers that treat partial conversions as
errors, and use them for all filesystem::path conversions.

	PR libstdc++/90281 Fix string conversions for filesystem::path
	* include/bits/fs_path.h (u8path) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]:
	Use codecvt_utf8_utf16 instead of codecvt_utf8. Use
	__str_codecvt_in_all to fail for partial conversions and throw on
	error.
	[!_GLIBCXX_FILESYSTEM_IS_WINDOWS && _GLIBCXX_USE_CHAR8_T]
	(path::_Cvt<char8_t>): Add explicit specialization.
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_wconvert): Remove
	overloads.
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_convert): Use
	if-constexpr instead of dispatching to _S_wconvert. Use codecvt
	instead of codecvt_utf8. Use __str_codecvt_in_all and
	__str_codecvt_out_all.
	[!_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_convert): Use
	codecvt instead of codecvt_utf8. Use __str_codecvt_out_all.
	(path::_S_str_convert) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Construct return values
	with allocator. Use __str_codecvt_out_all. Fallthrough to POSIX code
	after converting to UTF-8.
	(path::_S_str_convert): Use codecvt instead of codecvt_utf8. Use
	__str_codecvt_in_all.
	(path::string): Fix initialization of string types with different
	allocators.
	(path::u8string) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Use __str_codecvt_out_all.
	* include/bits/locale_conv.h (__do_str_codecvt): Reorder static and
	runtime conditions.
	(__str_codecvt_out_all, __str_codecvt_in_all): New functions that
	return false for partial conversions.
	* include/experimental/bits/fs_path.h (u8path):
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Implement correctly for mingw.
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_wconvert): Add
	missing handling for char8_t. Use codecvt and codecvt_utf8_utf16
	instead of codecvt_utf8. Use __str_codecvt_in_all and
	__str_codecvt_out_all.
	[!_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_convert): Use
	codecvt instead of codecvt_utf8. Use __str_codecvt_out_all.
	(path::string) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Construct return values
	with allocator. Use __str_codecvt_out_all and __str_codecvt_in_all.
	(path::string) [!_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	__str_codecvt_in_all.
	(path::u8string) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Use __str_codecvt_out_all.
	* src/c++17/fs_path.cc (path::_S_convert_loc): Use
	__str_codecvt_in_all.
	* src/filesystem/path.cc (path::_S_convert_loc): Likewise.
	* testsuite/27_io/filesystem/path/construct/90281.cc: New test.
	* testsuite/27_io/filesystem/path/factory/u8path.cc: New test.
	* testsuite/27_io/filesystem/path/native/string.cc: Test with empty
	strings and with Unicode characters outside the basic multilingual
	plane.
	* testsuite/27_io/filesystem/path/native/alloc.cc: New test.
	* testsuite/experimental/filesystem/path/construct/90281.cc: New test.
	* testsuite/experimental/filesystem/path/factory/u8path.cc: New test.
	* testsuite/experimental/filesystem/path/native/alloc.cc: New test.
	* testsuite/experimental/filesystem/path/native/string.cc: Test with
	empty strings and with Unicode characters outside the basic
	multilingual plane.

Added:
    trunk/libstdc++-v3/testsuite/27_io/filesystem/path/construct/90281.cc
    trunk/libstdc++-v3/testsuite/27_io/filesystem/path/factory/
    trunk/libstdc++-v3/testsuite/27_io/filesystem/path/factory/u8path.cc
      - copied, changed from r272381, trunk/libstdc++-v3/testsuite/27_io/filesystem/path/native/string.cc
    trunk/libstdc++-v3/testsuite/27_io/filesystem/path/native/alloc.cc
    trunk/libstdc++-v3/testsuite/experimental/filesystem/path/construct/90281.cc
    trunk/libstdc++-v3/testsuite/experimental/filesystem/path/factory/
    trunk/libstdc++-v3/testsuite/experimental/filesystem/path/factory/u8path.cc
    trunk/libstdc++-v3/testsuite/experimental/filesystem/path/native/alloc.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/fs_path.h
    trunk/libstdc++-v3/include/bits/locale_conv.h
    trunk/libstdc++-v3/include/experimental/bits/fs_path.h
    trunk/libstdc++-v3/src/c++17/fs_path.cc
    trunk/libstdc++-v3/src/filesystem/path.cc
    trunk/libstdc++-v3/testsuite/27_io/filesystem/path/native/string.cc
    trunk/libstdc++-v3/testsuite/experimental/filesystem/path/native/string.cc
Comment 4 Jonathan Wakely 2019-06-17 15:04:18 UTC
Author: redi
Date: Mon Jun 17 15:03:46 2019
New Revision: 272389

URL: https://gcc.gnu.org/viewcvs?rev=272389&root=gcc&view=rev
Log:
PR libstdc++/90281 Fix string conversions for filesystem::path

Fix several bugs in the encoding conversions for filesystem::path that
prevent conversion of Unicode characters outside the Basic Multilingual
Plane, and prevent returning basic_string specializations with
alternative allocator types.

The std::codecvt_utf8 class template is not suitable for UTF-16
conversions because it uses UCS-2 instead. For conversions between UTF-8
and UTF-16 either std::codecvt<C, char, mbstate> or
codecvt_utf8_utf16<C> must be used.

The __str_codecvt_in and __str_codecvt_out utilities do not
return false on a partial conversion (e.g. for invalid or incomplete
Unicode input). Add new helpers that treat partial conversions as
errors, and use them for all filesystem::path conversions.

	PR libstdc++/90281 Fix string conversions for filesystem::path
	* include/bits/fs_path.h (u8path) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]:
	Use codecvt_utf8_utf16 instead of codecvt_utf8. Use
	__str_codecvt_in_all to fail for partial conversions and throw on
	error.
	[!_GLIBCXX_FILESYSTEM_IS_WINDOWS && _GLIBCXX_USE_CHAR8_T]
	(path::_Cvt<char8_t>): Add explicit specialization.
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_wconvert): Remove
	overloads.
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_convert): Use
	if-constexpr instead of dispatching to _S_wconvert. Use codecvt
	instead of codecvt_utf8. Use __str_codecvt_in_all and
	__str_codecvt_out_all.
	[!_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_convert): Use
	codecvt instead of codecvt_utf8. Use __str_codecvt_out_all.
	(path::_S_str_convert) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Construct return values
	with allocator. Use __str_codecvt_out_all. Fallthrough to POSIX code
	after converting to UTF-8.
	(path::_S_str_convert): Use codecvt instead of codecvt_utf8. Use
	__str_codecvt_in_all.
	(path::string): Fix initialization of string types with different
	allocators.
	(path::u8string) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Use __str_codecvt_out_all.
	* include/bits/locale_conv.h (__do_str_codecvt): Reorder static and
	runtime conditions.
	(__str_codecvt_out_all, __str_codecvt_in_all): New functions that
	return false for partial conversions.
	* include/experimental/bits/fs_path.h (u8path):
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Implement correctly for mingw.
	[_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_wconvert): Add
	missing handling for char8_t. Use codecvt and codecvt_utf8_utf16
	instead of codecvt_utf8. Use __str_codecvt_in_all and
	__str_codecvt_out_all.
	[!_GLIBCXX_FILESYSTEM_IS_WINDOWS] (path::_Cvt::_S_convert): Use
	codecvt instead of codecvt_utf8. Use __str_codecvt_out_all.
	(path::string) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Construct return values
	with allocator. Use __str_codecvt_out_all and __str_codecvt_in_all.
	(path::string) [!_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	__str_codecvt_in_all.
	(path::u8string) [_GLIBCXX_FILESYSTEM_IS_WINDOWS]: Use
	codecvt_utf8_utf16 instead of codecvt_utf8. Use __str_codecvt_out_all.
	* src/c++17/fs_path.cc (path::_S_convert_loc): Use
	__str_codecvt_in_all.
	* src/filesystem/path.cc (path::_S_convert_loc): Likewise.
	* testsuite/27_io/filesystem/path/construct/90281.cc: New test.
	* testsuite/27_io/filesystem/path/factory/u8path.cc: New test.
	* testsuite/27_io/filesystem/path/native/string.cc: Test with empty
	strings and with Unicode characters outside the basic multilingual
	plane.
	* testsuite/27_io/filesystem/path/native/alloc.cc: New test.
	* testsuite/experimental/filesystem/path/construct/90281.cc: New test.
	* testsuite/experimental/filesystem/path/factory/u8path.cc: New test.
	* testsuite/experimental/filesystem/path/native/alloc.cc: New test.
	* testsuite/experimental/filesystem/path/native/string.cc: Test with
	empty strings and with Unicode characters outside the basic
	multilingual plane.

Added:
    branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/construct/90281.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/factory/
    branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/factory/u8path.cc
      - copied, changed from r272374, branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/native/string.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/native/alloc.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/experimental/filesystem/path/construct/90281.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/experimental/filesystem/path/factory/
    branches/gcc-9-branch/libstdc++-v3/testsuite/experimental/filesystem/path/factory/u8path.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/experimental/filesystem/path/native/alloc.cc
Modified:
    branches/gcc-9-branch/libstdc++-v3/ChangeLog
    branches/gcc-9-branch/libstdc++-v3/include/bits/fs_path.h
    branches/gcc-9-branch/libstdc++-v3/include/bits/locale_conv.h
    branches/gcc-9-branch/libstdc++-v3/include/experimental/bits/fs_path.h
    branches/gcc-9-branch/libstdc++-v3/src/c++17/fs_path.cc
    branches/gcc-9-branch/libstdc++-v3/src/filesystem/path.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/27_io/filesystem/path/native/string.cc
    branches/gcc-9-branch/libstdc++-v3/testsuite/experimental/filesystem/path/native/string.cc
Comment 5 Jonathan Wakely 2020-02-26 09:37:56 UTC
I no longer plan to backport this to the gcc-8 branch, so closing as fixed.