Implementation-defined behavior or not?

esoteric escape manips88@gmail.com
Mon Jun 3 09:49:00 GMT 2019


Thanks! I see, yes speaking of C++17. Just to make sure I grasped it I'll
say how I get it:

1. In the std::string's case, we care about bits regardless of the value of
the chars inside std::string, so because mapping is precise that makes it
well-defined.
2. In case of char, the underlying bit representation changes if value
overflows range and its implementation-defined.

Say, I decide to manually do this:

std::string s = "\xE2\x82\xAC";

Or,

char c[3];
c[0] = 0xE2;
c[1] = 0x82;
c[2] = 0xAC;

Then, I suppose these cases will be more like my #2 above than #1, true?

On Mon, Jun 3, 2019 at 2:52 PM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Mon, 3 Jun 2019 at 10:07, esoteric escape wrote:
> >
> > Hello, I am on Windows OS where CHAR_BIT == 8.
> >
> > I am trying to understand whether this behavior is implementation-defined
> > or not.
> >
> > I have this string in UTF-8, and I am trying to understand if it is
> > implementation-defined:
> >
> > std::string s = u8"€";
> >
> > It's clear to me that char c = 0xC8 is implementation defined for the
> > reasons:
> > 1. char's signedness depends on compiler..
> > 2. If the value is beyond the representatable range of char say, -128 to
> > 127, then again it is implementation-defined.
>
> The char value is implementation-defined, but there will be some
> unique value that corresponds to 0xC8 and can be unambiguously
> converted to (unsigned char)0xC8. For GCC (and in C++20) the
> conversion to char is the obvious two's complement one, producing
> (char)-56.
>
>
> >
> > In the same way, I am trying to understand how the std::string case is
> > handled because its also uses char.
> >
> > So, € in UTF-8 means E2 82 AC sequence of bytes in hex. If std::string
> uses
> > suppose signed version of char, don't they fall beyond the representable
> > range and therefore, their value is implementation defined in
> std::string?
> > Or, is this case actually well-defined?
>
> Both. The code is perfectly well-defined in C++11, C++14 and C++17.
> The precise numerical values are implementation-defined, but there is
> a one-to-one mapping from 8-bit UTF-8 code units to char values, and
> back again.
>
> N.B. In C++20 the code is ill-formed and won't compile, because the
> type of u8"€" is const char8_t[4] which cannot be used to initialize a
> std::string. You'd need to cast it to (const char*) or use
> std::u8string instead.
>



More information about the Gcc-help mailing list