g++ off-by-one bug in utf16 conversion

John Schmerge jbschmerge@gmail.com
Sun Oct 26 06:22:00 GMT 2014


Hey guys,

I came across this bug earlier today in implementing some
unit tests for utf8/16 conversions... The following c++
fragment gives the wrong result:

int main() {
  char16_t s[] = u"\uffff";
  std::cout << std::hex << s[0] << " " << s[1] << std::endl;
}

it prints:
  d7ff dfff
where as it should print:
  ffff 0
For those unfamiliar with utf16, all unicode values less than
or equal to 0xffff remain 16 bit values and no conversion is
done on them, code points greater than 0xffff get converted
to a pair of 16-bit shorts, where the 1st is in the range
0xd800-dbff and the 2nd is in the range 0xdc00-dffff.

Clearly this is an off-by-one issue. I traced it down to a
use of a less-than operator vs less-than-equal operator in
libcpp/charset.c

I have verified this is a bug with versions 4.4.7 (rhel 6.5),
4.8.2 (linaro/ubuntu/mint) and g++ (GCC) 5.0.0 20141025...
I am a bit surprised  that this has gone so many years unnoticed
or at least unresolved.

Attached is a patch against gcc 4.8.2 from the gcc website for
the issue to $gcc-root/libcpp/charset.c that fixes the issue by my tests.

Thanks,
John
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gcc-utf16.patch
Type: text/x-patch
Size: 250 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-bugs/attachments/20141026/5b738516/attachment.bin>


More information about the Gcc-bugs mailing list