I have created a bunch of test cases, and on some it fails unexpectedly. I'll post the code as attachment, the lines with the bug have the word "bug".
Created attachment 44359 [details] test cases that trigger the bug
Created attachment 44360 [details] better test cases with proper asserts In the previous file the asserts were accustomed to the bugged behavior, had only comments. in this file the asserts are made as the expected behavior.
Thanks, this is still present in the latest version of the code too. This just includes the failing cases: #include <cassert> #include <locale> using namespace std; // 2 code points, both are 4 byte in UTF-8. // in UTF-16 both are 2 unit i.e. surrogate pairs const char* u8in = u8"\U0010FFFF\U0010AAAA"; //tests .in() function of codecvt<char16_t, char, mbstate> auto test_u16_in() { char16_t u16out[4]; auto& cvt = use_facet<codecvt<char16_t, char, mbstate_t>>(locale::classic()); auto state = mbstate_t{}; auto in_ptr = u8in; auto out_ptr = u16out; state = {}; in_ptr = nullptr; out_ptr = nullptr; auto res = cvt.in(state, u8in, u8in + 6, in_ptr, u16out, u16out + 2, out_ptr); // actual output assert(res == cvt.partial); // BUG assert(out_ptr == u16out + 2); assert(in_ptr == u8in + 4); // expected output // assert(res == cvt.partial); // assert(out_ptr == u16str+2); // assert(in_ptr == u8str+4); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, u8in, u8in + 8, in_ptr, u16out, u16out + 2, out_ptr); // actual output assert(res == cvt.partial); // BUG assert(out_ptr == u16out + 2); assert(in_ptr == u8in + 4); // expected output // assert(res == cvt.partial); // assert(out_ptr == u16str+2); // assert(in_ptr == u8str+4); } //tests .out() function of codecvt<char16_t, char, mbstate> auto test_u16_out() { const char16_t* u16in = u"\U0010FFFF\U0010AAAA"; char u8out[8]; auto& cvt = use_facet<codecvt<char16_t, char, mbstate_t>>(locale::classic()); auto state = mbstate_t{}; auto in_ptr = u16in; auto out_ptr = u8out; state = {}; in_ptr = nullptr; out_ptr = nullptr; auto res = cvt.out(state, u16in, u16in + 1, in_ptr, u8out, u8out + 3, out_ptr); assert(res == cvt.partial); // BUG assert(in_ptr == u16in); assert(out_ptr == u8out); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.out(state, u16in, u16in + 1, in_ptr, u8out, u8out + 4, out_ptr); assert(res == cvt.partial); // BUG assert(in_ptr == u16in); assert(out_ptr == u8out); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.out(state, u16in, u16in + 3, in_ptr, u8out, u8out + 4, out_ptr); assert(res == cvt.partial); // BUG assert(in_ptr == u16in + 2); assert(out_ptr == u8out + 4); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.out(state, u16in, u16in + 3, in_ptr, u8out, u8out + 8, out_ptr); assert(res == cvt.partial); // BUG assert(in_ptr == u16in + 2); assert(out_ptr == u8out + 4); } int main() { test_u16_in(); test_u16_out(); }
I think i found where the bug lies. It lies in 1. line 557 of the file c++11/codecvt.cc https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l557 . The return of the function utf16_in() should be: return from.size() ? codecvt_base::partial : codecvt_base::ok; The bug is triggered because the loop exists because t.size() is zero. from.size() should be checked. 2.
I think I found where the bug lies. It lies in 1. line 557 of the file c++11/codecvt.cc https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l557 . The return of the function utf16_in() should be: return from.size() ? codecvt_base::partial : codecvt_base::ok; The bug is triggered because the loop exists because t.size() is zero. from.size() should be checked. 2. line 579 of the same file https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l579 578 if (from.size() < 2) 579 return codecvt_base::ok; // stop converting at this point Should be 578 if (from.size() < 2) 579 return codecvt_base::partial; // stop converting at this point
Thanks for the analysis.
I think a found a related bug in the UTF8 to UCS2 codecvt, codecvt_utf8<char16_t>. It can be tested with the following example: #include <codecvt> auto test_u8_ucs2_in() { // 2 code points, one is 3 bytes and the other is 4 bytes in UTF-8. // in UTF-16 the first is sinlge unit, the second is surrogate pair // in UCS2 only the first CP is allowed. const char* in = u8"\uAAAA\U0010AAAA"; char16_t out[2] = { 'y' , 'y' }; auto cvt_ptr = make_unique<codecvt_utf8<char16_t>>(); auto& cvt = *cvt_ptr; auto state = mbstate_t{}; auto in_ptr = in; auto out_ptr = out; state = {}; in_ptr = nullptr; out_ptr = nullptr; auto res = cvt.in(state, in, in + 2, in_ptr, out, out, out_ptr); assert(res == cvt.partial); //BUG, returns OK, should be Partial assert(out_ptr == out); assert(in_ptr == in); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 2, in_ptr, out, out + 1, out_ptr); assert(res == cvt.partial); // BUG, returns ERROR, should be Partial assert(out_ptr == out); assert(in_ptr == in); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 3, in_ptr, out, out, out_ptr); assert(res == cvt.partial); //BUG, return OK, should be Partial assert(out_ptr == out); assert(in_ptr == in); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 3, in_ptr, out, out + 1, out_ptr); assert(res == cvt.ok); assert(out_ptr == out + 1); assert(in_ptr == in + 3); cout << "UCS2 sequence: " << hex << out[0] << ' ' << out[1] << '\n'; state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 6, in_ptr, out, out + 1, out_ptr); assert(res == cvt.partial); // BUG, return OK, should be Partial assert(out_ptr == out + 1); assert(in_ptr == in + 3); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 6, in_ptr, out, out + 2, out_ptr); assert(res == cvt.partial); // BUG, returns ERROR, should be Partial assert(out_ptr == out + 1); assert(in_ptr == in + 3); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 7, in_ptr, out, out + 1, out_ptr); assert(res == cvt.partial); // BUG, returns OK, should be Partial assert(out_ptr == out + 1); assert(in_ptr == in + 3); state = {}; in_ptr = nullptr; out_ptr = nullptr; res = cvt.in(state, in, in + 7, in_ptr, out, out + 2, out_ptr); assert(res == cvt.error); assert(out_ptr == out + 1); assert(in_ptr == in + 3); } The bug lies in the same function utf16_in() I mentioned in comment #5, in lines 544-547 https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l544 Those lines: 544 if (s == surrogates::allowed) 545 return codecvt_base::partial; 546 else 547 return codecvt_base::error; // No surrogates in UCS2 Should simply be one line: 544 return codecvt_base::partial;
Looking again at my proposed fix in comment #6, i concluded it is not the best fix. It will fix the testsuite in the same comment #6, but I discovered another class of errors related to the lines I am touching in that proposed fix. The error is when we have an incomplete sequence which is in the middle of the from range, and not at the end. In such cases codecvt_base::error should be returned. The bug exists in UTF8->UTF16, UTF8->UCS4 and UTF16->UCS4. I guess some more test need to be written about returning error.
Ignore my last comment, here is it fixed. Looking again at my proposed fix in comment #7, i concluded it is not the best fix. It will fix the testsuite in the same comment #7, but I discovered another class of errors related to the lines I am touching in that proposed fix. The error is when we have an incomplete sequence which is in the middle of the from range, and not at the end. In such cases codecvt_base::error should be returned. The bug exists in UTF8->UTF16, UTF8->UCS4 and UTF16->UCS4. I guess some more test need to be written about returning error.
I was wrong in comment #9. The bug and the proposed fix are ok in comment #7. While writing some tests for error I discovered yet another bug in UTF-8 decoding. See the example: // 2 code points, both are 4 byte in UTF-8. const char u8in[] = u8"\U0010FFFF\U0010AAAA"; const char32_t u32in[] = U"\U0010FFFF\U0010AAAA"; void utf8_to_utf32_in_error_7 (const codecvt<char32_t, char, mbstate_t> &cvt) { char in[7] = {}; char32_t out[3] = {}; char_traits<char>::copy (in, u8in, 7); in[5] = 'z'; // Last CP has two errors. Its second code unit is malformed and it // misses its last code unit. Because it misses its last CU, the // decoder return too early that it is incomplete. // It should return invalid. auto state = mbstate_t{}; auto in_next = (const char *) nullptr; auto out_next = (char32_t *) nullptr; auto res = codecvt_base::result (); res = cvt.in (state, in, in + 7, in_next, out, out + 3, out_next); VERIFY (res == cvt.error); //incorrectly returns partial VERIFY (in_next == in + 4); VERIFY (out_next == out + 1); VERIFY (out[0] == u32in[0] && out[1] == 0 && out[2] == 0); } I published the full testsuite on Github, licensed under GPL v3+ of course. https://github.com/dimztimz/codecvt_test/blob/master/codecvt.cpp . I was thinking of sending a patch, but after this last bug, 4th, I see this needs more time. Maybe a testsuite from another library like ICU can be incorporated? Well, whatever, I will pause my work on this.
(In reply to Dimitrij Mijoski from comment #10) > Well, whatever, I will pause my work on this. Thanks again. There are a few code conversion issues in my TODO list, so I'll get around to properly reviewing everything above next time I get a chance to work on it.
Hello Jonathan. I posted a patch for this bug which I hope you'll find it useful once you start working on this. https://gcc.gnu.org/pipermail/libstdc++/2020-September/051073.html
Thanks!
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: https://gcc.gnu.org/g:02dab998665dda0f6df31740e8897c42de3d740f commit r13-5144-g02dab998665dda0f6df31740e8897c42de3d740f Author: Dimitrij Mijoski <dmjpp@hotmail.com> Date: Tue Jan 10 13:58:59 2023 +0100 libstdc++: Fix Unicode codecvt and add tests [PR86419] Fixes the conversion from UTF-8 to UTF-16 to properly return partial instead ok. Fixes the conversion from UTF-16 to UTF-8 to properly return partial instead ok. Fixes the conversion from UTF-8 to UCS-2 to properly return partial instead error. Fixes the conversion from UTF-8 to UCS-2 to treat 4-byte UTF-8 sequences as error just by seeing the leading byte. Fixes UTF-8 decoding for all codecvts so they detect error at the end of the input range when the last code point is also incomplete. libstdc++-v3/ChangeLog: PR libstdc++/86419 * src/c++11/codecvt.cc (read_utf8_code_point): Correctly detect errors in incomplete multibyte sequences. (utf16_in): Remove surrogates parameter. Fix conditions for returning partial. (utf16_out): Fix condition for returning partial. (ucs2_in): Do not pass surrogates argument to utf16_in. * testsuite/22_locale/codecvt/codecvt_unicode.cc: New test. * testsuite/22_locale/codecvt/codecvt_unicode.h: New header for tests. * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: New test.
Fixed for GCC 13. I will consider backporting this to the branches later, as it doesn't change anything observable by users except for fixing these bugs. Thanks for the report and the patch.
The releases/gcc-12 branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: https://gcc.gnu.org/g:12c193e5723f08694c8784457200112bae117063 commit r12-10246-g12c193e5723f08694c8784457200112bae117063 Author: Dimitrij Mijoski <dmjpp@hotmail.com> Date: Tue Jan 10 13:58:59 2023 +0100 libstdc++: Fix Unicode codecvt and add tests [PR86419] Fixes the conversion from UTF-8 to UTF-16 to properly return partial instead ok. Fixes the conversion from UTF-16 to UTF-8 to properly return partial instead ok. Fixes the conversion from UTF-8 to UCS-2 to properly return partial instead error. Fixes the conversion from UTF-8 to UCS-2 to treat 4-byte UTF-8 sequences as error just by seeing the leading byte. Fixes UTF-8 decoding for all codecvts so they detect error at the end of the input range when the last code point is also incomplete. libstdc++-v3/ChangeLog: PR libstdc++/86419 * src/c++11/codecvt.cc (read_utf8_code_point): Correctly detect errors in incomplete multibyte sequences. (utf16_in): Remove surrogates parameter. Fix conditions for returning partial. (utf16_out): Fix condition for returning partial. (ucs2_in): Do not pass surrogates argument to utf16_in. * testsuite/22_locale/codecvt/codecvt_unicode.cc: New test. * testsuite/22_locale/codecvt/codecvt_unicode.h: New header for tests. * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: New test. (cherry picked from commit 02dab998665dda0f6df31740e8897c42de3d740f)