This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Selectable execution character set (and a whole bunch of sideeffects)


"Joseph S. Myers" <jsm28@cam.ac.uk> writes:

> On Fri, 4 Jul 2003, Zack Weinberg wrote:
>
>> * It was necessary to fix string constant concatenation, which now
>>   works the way the standard says it does.  A consequence is that we
>
> Meaning what exactly, bearing in mind the confusion in the standard in
> this area (as discussed in
> <http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n951.htm>)?

I meant that conversion to the execution character set (TP5) is now
aware of whether any string literal in the series to be concatenated
(TP6) has an L prefix.  Formerly un-prefixed literals would get
converted to the narrow execution character set and then padded to the
width of wchar_t, which is wrong.

Reading through that document, I see that I mostly implemented the
model described there.  It's easier to explain the deviations in terms
of the list of principles:

P1, P2, P3, P4 are all honored exactly.

P5 is honored if and only if GCC's assumption about the encoding
   produced by mbstowcs is correct - the user can override with
   -fwide-exec-charset if GCC gets it wrong.  Unfortunately, as far as
   I know there is no way to find out what the encoding produced by
   mbstowcs is.  (mbstowcs is not used, nor can it be used, as host
   and target may not agree.)

P6 is honored if and only if the encoding that GCC assumes is produced
   by mbstowcs has this property.  In particular, UTF-16 does NOT have
   this property.  There is nothing we can do about that, if the
   target machine's ABI specifies 16-bit wchar_t.

P7 is honored with several caveats:
   - UCNs can produce one or more bytes (this is a simple oversight in
     the list, or else they were lumping UCNs with multibyte source
     characters)
   - mbstowcs applied to an individual multibyte sequence will always
     produce a single wide character - only if P6 holds
   - GCC will make no special efforts to preserve the original byte
     sequences from the input file.

P8 is presently honored vacuously, as the only acceptable input
   encoding is UTF-8 which has no shift sequences.  However, my
   plan for implementing selectable input encodings automatically
   gets this right.

P9 again, GCC will not go to special effort to preserve the original
   byte sequences from the input file.

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]