Bug 86419 - codecvt<char16_t, ...>::in() and out() incorrectly return ok in some cases.
Summary: codecvt<char16_t, ...>::in() and out() incorrectly return ok in some cases.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 7.3.0
: P3 normal
Target Milestone: 12.4
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-06 12:18 UTC by Dimitrij Mijoski
Modified: 2024-03-18 14:12 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-06 00:00:00


Attachments
test cases that trigger the bug (940 bytes, text/plain)
2018-07-06 12:20 UTC, Dimitrij Mijoski
Details
better test cases with proper asserts (938 bytes, text/plain)
2018-07-06 12:33 UTC, Dimitrij Mijoski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dimitrij Mijoski 2018-07-06 12:18:56 UTC
I have created a bunch of test cases, and on some it fails unexpectedly. I'll post the code as attachment, the lines with the bug have the word "bug".
Comment 1 Dimitrij Mijoski 2018-07-06 12:20:58 UTC
Created attachment 44359 [details]
test cases that trigger the bug
Comment 2 Dimitrij Mijoski 2018-07-06 12:33:24 UTC
Created attachment 44360 [details]
better test cases with proper asserts

In the previous file the asserts were accustomed to the bugged behavior, had only comments.

in this file the asserts are made as the expected behavior.
Comment 3 Jonathan Wakely 2018-07-06 12:43:38 UTC
Thanks, this is still present in the latest version of the code too.

This just includes the failing cases:

#include <cassert>
#include <locale>

using namespace std;

// 2 code points, both are 4 byte in UTF-8.
// in UTF-16 both are 2 unit i.e. surrogate pairs
const char* u8in = u8"\U0010FFFF\U0010AAAA";

//tests .in() function of codecvt<char16_t, char, mbstate>
auto test_u16_in()
{
	char16_t u16out[4];

	auto& cvt =
	    use_facet<codecvt<char16_t, char, mbstate_t>>(locale::classic());
	auto state = mbstate_t{};
	auto in_ptr = u8in;
	auto out_ptr = u16out;

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	auto res =
	    cvt.in(state, u8in, u8in + 6, in_ptr, u16out, u16out + 2, out_ptr);
	// actual output
	assert(res == cvt.partial); // BUG
	assert(out_ptr == u16out + 2);
	assert(in_ptr == u8in + 4);
	// expected output
	// assert(res == cvt.partial);
	// assert(out_ptr == u16str+2);
	// assert(in_ptr == u8str+4);

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res =
	    cvt.in(state, u8in, u8in + 8, in_ptr, u16out, u16out + 2, out_ptr);
	// actual output
	assert(res == cvt.partial); // BUG
	assert(out_ptr == u16out + 2);
	assert(in_ptr == u8in + 4);
	// expected output
	// assert(res == cvt.partial);
	// assert(out_ptr == u16str+2);
	// assert(in_ptr == u8str+4);
}

//tests .out() function of codecvt<char16_t, char, mbstate>
auto test_u16_out()
{
	const char16_t* u16in = u"\U0010FFFF\U0010AAAA";
	char u8out[8];

	auto& cvt =
	    use_facet<codecvt<char16_t, char, mbstate_t>>(locale::classic());
	auto state = mbstate_t{};
	auto in_ptr = u16in;
	auto out_ptr = u8out;

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	auto res =
	    cvt.out(state, u16in, u16in + 1, in_ptr, u8out, u8out + 3, out_ptr);
	assert(res == cvt.partial); // BUG
	assert(in_ptr == u16in);
	assert(out_ptr == u8out);

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res =
	    cvt.out(state, u16in, u16in + 1, in_ptr, u8out, u8out + 4, out_ptr);
	assert(res == cvt.partial); // BUG
	assert(in_ptr == u16in);
	assert(out_ptr == u8out);

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res =
	    cvt.out(state, u16in, u16in + 3, in_ptr, u8out, u8out + 4, out_ptr);
	assert(res == cvt.partial); // BUG
	assert(in_ptr == u16in + 2);
	assert(out_ptr == u8out + 4);

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res =
	    cvt.out(state, u16in, u16in + 3, in_ptr, u8out, u8out + 8, out_ptr);
	assert(res == cvt.partial); // BUG
	assert(in_ptr == u16in + 2);
	assert(out_ptr == u8out + 4);
}

int main()
{
	test_u16_in();
	test_u16_out();
}
Comment 4 Dimitrij Mijoski 2020-09-02 00:39:04 UTC Comment hidden (obsolete)
Comment 5 Dimitrij Mijoski 2020-09-02 00:41:17 UTC
I think I found where the bug lies. It lies in 

1. line 557 of the file c++11/codecvt.cc https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l557 .

  The return of the function utf16_in() should be:

      return from.size() ? codecvt_base::partial : codecvt_base::ok;

  The bug is triggered because the loop exists because t.size() is zero. from.size() should be checked.

2. line 579 of the same file https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l579

 578             if (from.size() < 2)
 579               return codecvt_base::ok; // stop converting at this point

Should be

 578             if (from.size() < 2)
 579               return codecvt_base::partial; // stop converting at this point
Comment 6 Jonathan Wakely 2020-09-02 16:59:17 UTC
Thanks for the analysis.
Comment 7 Dimitrij Mijoski 2020-09-03 16:26:05 UTC
I think a found a related bug in the UTF8 to UCS2 codecvt, codecvt_utf8<char16_t>. It can be tested with the following example:

#include <codecvt>

auto test_u8_ucs2_in()
{
	// 2 code points, one is 3 bytes and the other is 4 bytes in UTF-8.
	// in UTF-16 the first is sinlge unit, the second is surrogate pair
	// in UCS2 only the first CP is allowed.
	const char* in = u8"\uAAAA\U0010AAAA";
	char16_t out[2] = { 'y' , 'y' };

	auto cvt_ptr = make_unique<codecvt_utf8<char16_t>>();
	auto& cvt = *cvt_ptr;
	auto state = mbstate_t{};
	auto in_ptr = in;
	auto out_ptr = out;

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	auto res = cvt.in(state, in, in + 2, in_ptr, out, out, out_ptr);
	assert(res == cvt.partial); //BUG, returns OK, should be Partial 
	assert(out_ptr == out);
	assert(in_ptr == in);

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 2, in_ptr, out, out + 1, out_ptr);
	assert(res == cvt.partial); // BUG, returns ERROR, should be Partial
	assert(out_ptr == out);
	assert(in_ptr == in);

	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 3, in_ptr, out, out, out_ptr);
	assert(res == cvt.partial); //BUG, return OK, should be Partial
	assert(out_ptr == out);
	assert(in_ptr == in);
	
	
	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 3, in_ptr, out, out + 1, out_ptr);
	assert(res == cvt.ok);
	assert(out_ptr == out + 1);
	assert(in_ptr == in + 3);
	cout << "UCS2 sequence: " << hex << out[0] << ' ' << out[1] << '\n';
	
	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 6, in_ptr, out, out + 1, out_ptr);
	assert(res == cvt.partial); // BUG, return OK, should be Partial
	assert(out_ptr == out + 1);
	assert(in_ptr == in + 3);
	
	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 6, in_ptr, out, out + 2, out_ptr);
	assert(res == cvt.partial); // BUG, returns ERROR, should be Partial
	assert(out_ptr == out + 1);
	assert(in_ptr == in + 3);
	
	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 7, in_ptr, out, out + 1, out_ptr);
	assert(res == cvt.partial); // BUG, returns OK, should be Partial
	assert(out_ptr == out + 1);
	assert(in_ptr == in + 3);
	
	state = {};
	in_ptr = nullptr;
	out_ptr = nullptr;
	res = cvt.in(state, in, in + 7, in_ptr, out, out + 2, out_ptr);
	assert(res == cvt.error);
	assert(out_ptr == out + 1);
	assert(in_ptr == in + 3);
}


The bug lies in the same function utf16_in() I mentioned in comment #5, in lines 544-547 https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=0311b15177d0439757e0347f7934b5a09b78f8e3;hb=HEAD#l544

Those lines:

 544             if (s == surrogates::allowed)
 545               return codecvt_base::partial;
 546             else
 547               return codecvt_base::error; // No surrogates in UCS2

Should simply be one line: 

 544               return codecvt_base::partial;
Comment 8 Dimitrij Mijoski 2020-09-04 08:50:32 UTC
Looking again at my proposed fix in comment #6, i concluded it is not the best fix. It will fix the testsuite in the same comment #6, but I discovered another class of errors related to the lines I am touching in that proposed fix.

The error is when we have an incomplete sequence which is in the middle of the from range, and not at the end. In such cases codecvt_base::error should be returned. The bug exists in UTF8->UTF16, UTF8->UCS4 and UTF16->UCS4.

I guess some more test need to be written about returning error.
Comment 9 Dimitrij Mijoski 2020-09-04 08:53:18 UTC
Ignore my last comment, here is it fixed.

Looking again at my proposed fix in comment #7, i concluded it is not the best fix. It will fix the testsuite in the same comment #7, but I discovered another class of errors related to the lines I am touching in that proposed fix.

The error is when we have an incomplete sequence which is in the middle of the from range, and not at the end. In such cases codecvt_base::error should be returned. The bug exists in UTF8->UTF16, UTF8->UCS4 and UTF16->UCS4.

I guess some more test need to be written about returning error.
Comment 10 Dimitrij Mijoski 2020-09-05 21:42:04 UTC
I was wrong in comment #9. The bug and the proposed fix are ok in comment #7.

While writing some tests for error I discovered yet another bug in UTF-8 decoding. See the example:

// 2 code points, both are 4 byte in UTF-8.
const char u8in[] = u8"\U0010FFFF\U0010AAAA";
const char32_t u32in[] = U"\U0010FFFF\U0010AAAA";

void
utf8_to_utf32_in_error_7 (const codecvt<char32_t, char, mbstate_t> &cvt)
{
  char in[7] = {};
  char32_t out[3] = {};
  char_traits<char>::copy (in, u8in, 7);
  in[5] = 'z';
  // Last CP has two errors. Its second code unit is malformed and it
  // misses its last code unit. Because it misses  its last CU, the
  // decoder return too early that it is incomplete.
  // It should return invalid.

  auto state = mbstate_t{};
  auto in_next = (const char *) nullptr;
  auto out_next = (char32_t *) nullptr;
  auto res = codecvt_base::result ();

  res = cvt.in (state, in, in + 7, in_next, out, out + 3, out_next);
  VERIFY (res == cvt.error); //incorrectly returns partial
  VERIFY (in_next == in + 4);
  VERIFY (out_next == out + 1);
  VERIFY (out[0] == u32in[0] && out[1] == 0 && out[2] == 0);
}

I published the full testsuite on Github, licensed under GPL v3+ of course. https://github.com/dimztimz/codecvt_test/blob/master/codecvt.cpp . I was thinking of sending a patch, but after this last bug, 4th, I see this needs more time. Maybe a testsuite from another library like ICU can be incorporated? Well, whatever, I will pause my work on this.
Comment 11 Jonathan Wakely 2020-09-07 15:32:55 UTC
(In reply to Dimitrij Mijoski from comment #10)
> Well, whatever, I will pause my work on this.

Thanks again. There are a few code conversion issues in my TODO list, so I'll get around to properly reviewing everything above next time I get a chance to work on it.
Comment 12 Dimitrij Mijoski 2020-09-24 14:22:46 UTC
Hello Jonathan. I posted a patch for this bug which I hope you'll find it useful once you start working on this. https://gcc.gnu.org/pipermail/libstdc++/2020-September/051073.html
Comment 13 Jonathan Wakely 2020-09-24 14:43:24 UTC
Thanks!
Comment 14 GCC Commits 2023-01-13 13:42:50 UTC
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:02dab998665dda0f6df31740e8897c42de3d740f

commit r13-5144-g02dab998665dda0f6df31740e8897c42de3d740f
Author: Dimitrij Mijoski <dmjpp@hotmail.com>
Date:   Tue Jan 10 13:58:59 2023 +0100

    libstdc++: Fix Unicode codecvt and add tests [PR86419]
    
    Fixes the conversion from UTF-8 to UTF-16 to properly return partial
    instead ok.
    Fixes the conversion from UTF-16 to UTF-8 to properly return partial
    instead ok.
    Fixes the conversion from UTF-8 to UCS-2 to properly return partial
    instead error.
    Fixes the conversion from UTF-8 to UCS-2 to treat 4-byte UTF-8 sequences
    as error just by seeing the leading byte.
    Fixes UTF-8 decoding for all codecvts so they detect error at the end of
    the input range when the last code point is also incomplete.
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/86419
            * src/c++11/codecvt.cc (read_utf8_code_point): Correctly detect
            errors in incomplete multibyte sequences.
            (utf16_in): Remove surrogates parameter. Fix conditions for
            returning partial.
            (utf16_out): Fix condition for returning partial.
            (ucs2_in): Do not pass surrogates argument to utf16_in.
            * testsuite/22_locale/codecvt/codecvt_unicode.cc: New test.
            * testsuite/22_locale/codecvt/codecvt_unicode.h: New header for
            tests.
            * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: New
            test.
Comment 15 Jonathan Wakely 2023-01-13 13:44:24 UTC
Fixed for GCC 13. I will consider backporting this to the branches later, as it doesn't change anything observable by users except for fixing these bugs.

Thanks for the report and the patch.
Comment 16 GCC Commits 2024-03-18 14:05:33 UTC
The releases/gcc-12 branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:12c193e5723f08694c8784457200112bae117063

commit r12-10246-g12c193e5723f08694c8784457200112bae117063
Author: Dimitrij Mijoski <dmjpp@hotmail.com>
Date:   Tue Jan 10 13:58:59 2023 +0100

    libstdc++: Fix Unicode codecvt and add tests [PR86419]
    
    Fixes the conversion from UTF-8 to UTF-16 to properly return partial
    instead ok.
    Fixes the conversion from UTF-16 to UTF-8 to properly return partial
    instead ok.
    Fixes the conversion from UTF-8 to UCS-2 to properly return partial
    instead error.
    Fixes the conversion from UTF-8 to UCS-2 to treat 4-byte UTF-8 sequences
    as error just by seeing the leading byte.
    Fixes UTF-8 decoding for all codecvts so they detect error at the end of
    the input range when the last code point is also incomplete.
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/86419
            * src/c++11/codecvt.cc (read_utf8_code_point): Correctly detect
            errors in incomplete multibyte sequences.
            (utf16_in): Remove surrogates parameter. Fix conditions for
            returning partial.
            (utf16_out): Fix condition for returning partial.
            (ucs2_in): Do not pass surrogates argument to utf16_in.
            * testsuite/22_locale/codecvt/codecvt_unicode.cc: New test.
            * testsuite/22_locale/codecvt/codecvt_unicode.h: New header for
            tests.
            * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: New
            test.
    
    (cherry picked from commit 02dab998665dda0f6df31740e8897c42de3d740f)