This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Selectable execution character set (and a whole bunch of sideeffects)
"Joseph S. Myers" <jsm28@cam.ac.uk> writes:
> On Fri, 4 Jul 2003, Zack Weinberg wrote:
>
>> * It was necessary to fix string constant concatenation, which now
>> works the way the standard says it does. A consequence is that we
>
> Meaning what exactly, bearing in mind the confusion in the standard in
> this area (as discussed in
> <http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n951.htm>)?
I meant that conversion to the execution character set (TP5) is now
aware of whether any string literal in the series to be concatenated
(TP6) has an L prefix. Formerly un-prefixed literals would get
converted to the narrow execution character set and then padded to the
width of wchar_t, which is wrong.
Reading through that document, I see that I mostly implemented the
model described there. It's easier to explain the deviations in terms
of the list of principles:
P1, P2, P3, P4 are all honored exactly.
P5 is honored if and only if GCC's assumption about the encoding
produced by mbstowcs is correct - the user can override with
-fwide-exec-charset if GCC gets it wrong. Unfortunately, as far as
I know there is no way to find out what the encoding produced by
mbstowcs is. (mbstowcs is not used, nor can it be used, as host
and target may not agree.)
P6 is honored if and only if the encoding that GCC assumes is produced
by mbstowcs has this property. In particular, UTF-16 does NOT have
this property. There is nothing we can do about that, if the
target machine's ABI specifies 16-bit wchar_t.
P7 is honored with several caveats:
- UCNs can produce one or more bytes (this is a simple oversight in
the list, or else they were lumping UCNs with multibyte source
characters)
- mbstowcs applied to an individual multibyte sequence will always
produce a single wide character - only if P6 holds
- GCC will make no special efforts to preserve the original byte
sequences from the input file.
P8 is presently honored vacuously, as the only acceptable input
encoding is UTF-8 which has no shift sequences. However, my
plan for implementing selectable input encodings automatically
gets this right.
P9 again, GCC will not go to special effort to preserve the original
byte sequences from the input file.
zw