libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion

Mark Wielaard
Sat Feb 22 13:39:00 GMT 2003

Thanks for the bug report.
Your suggested fix seems obviously correct and I verified that making
sure that avail is always decremented makes String.getBytes("UTF-8")
work (read not throw an ArrayIndexOutOfBoundException).

But while creating a test case I noticed that for your example we return
two bytes: {0xf0, 0x90} but other implementations return four bytes
{0xf0, 0x90, 0x8c, 0x80}. I don't know enough of Unicode and UTF-8
encoding to know what is correct or why.

If someone has a quick reference to the relevant definitions and/or a
testsuite for these kind of things that would be higly appreciated.

More information about the Gcc-bugs mailing list