[PATCH] Various fixes for <codecvt> facets

Jonathan Wakely jwakely@redhat.com
Fri Mar 17 19:29:00 GMT 2017


On 16/03/17 15:23 +0000, Jonathan Wakely wrote:
>On 14/03/17 18:46 +0000, Jonathan Wakely wrote:
>>On 13/03/17 19:35 +0000, Jonathan Wakely wrote:
>>>This is a series of patches to fix various bugs in the Unicode
>>>character conversion facets.
>>>
>>>Ther first patch fixes a silly < versus <= bug that meant that 0xffff
>>>got written as a surrogate pair instead of as simply 0xff, and an
>>>endianness bug for the internal representation of UTF-16 code units
>>>stored in char32_t or wchar_t values. That's PR 79511.
>>>
>>>The second patch fixes some incorrect bitwise operations (because I
>>>confused & and |) and some incorrect limits (because I confused max
>>>and min). That fixes determining the endianness of the external
>>>representation bytes when they start with a Byte OrderMark, and
>>>correctly reports errors on invalid UCS2. It also fixes
>>>wstring_convert so that it reports the number of characters that were
>>>converted prior to an error. That's PR 79980.
>>>
>>>The third patch fixes the output of the encoding() and max_length()
>>>member functions on the codecvt facets, because I wasn't correctly
>>>accounting for a BOM or for the differences between UTF-16 and UCS2.
>>>
>>>I plan to commit these for all branches, but I'll wait until after GCC
>>>7.1 is released, and fix it for 7.2 instead. These bugs aren't
>>>important enough to rush into trunk now.
>>
>>One more patch for a problem found by the libc++ testsuite. Now we
>>pass all the libc++ tests, and we even pass a test that libc++ fails.
>>With this, I hope our <codecvt> is 100% conforming. Just in time to be
>>deprecated for C++17 :-)
>
>I've committed these to trunk, on the basis that they're intended to
>be backported to all branches anyway (fixing features that are
>currently broken in all branches). There's no point waiting if we plan
>to commit them anyway, it would just mean doing an extra backport (5,
>6, 7 *and* 8).
>
>Backports will be done soon.

I backported all the recent <codecvt> fixes to gcc-6-branch and it was
failing one test, due to unaligned reads in std::codecvt_utf16. That
type reads UTF-16 data from a const char* (Why narrow characters when
we have char16_t? Because <codecvt> likes to be awkward) and I was
doing that by casting the const char* to const char16_t*. That isn't
safe when the first char isn't aligned correctly for a char16_t.

This patch fixes all the unaligned accesses by abstracting the
operations on the pointers to use new overlaoded operators on the
range<Elem> type. A new partial specialization range<Elem, false>
uses memcpy to read/write char16_t values from the char*, avoiding
alignment problems. The primary template (range<Elem, true>) just
dereferences the pointers directly.

Tested x86_64-linux, powerpc64le-linux, powerpc64-linux,
powerpc-ibm-aix7.2.0.0 (which has 2-byte wchar_t).

Also tested with ubsan to confirm the unaligned accesses are gone.

Committed to trunk, gcc-6-branch, gcc-5-branch.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.txt
Type: text/x-patch
Size: 36289 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/libstdc++/attachments/20170317/31309ddc/attachment.bin>


More information about the Libstdc++ mailing list