This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
Bug in 'codecvt<_InternT, _ExternT, encoding_state>::do_unshift' (non-standard extension)
- From: "Kristian Spangsege" <kristian dot spangsege at gmail dot com>
- To: libstdc++ at gcc dot gnu dot org
- Date: Sat, 12 Jan 2008 12:33:14 +0100
- Subject: Bug in 'codecvt<_InternT, _ExternT, encoding_state>::do_unshift' (non-standard extension)
The 'unshift' method of the 'iconv' specialization of 'codecvt' in
"ext/codecvt_specializations.h" does not work correctly.
For a stateful character encoding this method is supposed to output a
sequence of bytes that resets the stream state. Let's check out an
example using the stateful encoding 'ISO-2022-JP' and the japanese
character 0x3076 'HIRAGANA LETTER BU' having UTF-8 encoding
0xE3,0x81,0xB6:
echo -e -n '\xE3\x81\xB6' | iconv -f UTF-8 -t ISO-2022-JP | hexdump -C
00000000 1b 24 42 24 56 1b 28 42 |.$B$V.(B|
'ESC $ B' switches from ASCII to a japaneese code set and 'ESC ( B'
switches back (the reset sequence.)
With 'codecvt<_InternT, _ExternT, encoding_state>' the difference is apparent:
./a.out | hexdump -C
00000000 1b 24 42 24 56 |.$B$V|
The reset sequence is missing (see the code below.)
>From a quick inspection of the codecvt implementation it appears the
the problem is due to 'unshift' using the input state rather that the
output state when determining the reset sequence - a pretty clean
error - if I'm right.
>From the codecvt implementation ("ext/codecvt_specializations.h"):
> template<typename _InternT, typename _ExternT>
> codecvt_base::result
> codecvt<_InternT, _ExternT, encoding_state>::
> do_unshift(state_type& __state, extern_type* __to,
> extern_type* __to_end, extern_type*& __to_next) const
> {
> // ...
>
> const descriptor_type& __desc = __state.in_descriptor();
>
> // ...
>
> size_t __conv = __iconv_adaptor(iconv,__desc, NULL, NULL,
> &__cto, &__tlen);
>
> // ...
> }
My test code:
> #include <cwchar>
> #include <string>
> #include <locale>
> #include <iostream>
> #include <ext/codecvt_specializations.h>
>
> using namespace std;
>
> int main()
> {
> typedef codecvt<wchar_t, char, encoding_state> cvt_type;
> locale l(locale::classic(), new cvt_type);
> cvt_type const &cvt = use_facet<cvt_type>(l);
> wstring s = L"\u3076";
> char buffer[32];
> wchar_t const *from=s.data(), *from_end=from+s.size(), *from_next;
> char *to=buffer, *to_end=to+sizeof(buffer), *to_next;
> encoding_state state("UCS-4LE", "ISO-2022-JP");
>
> if(cvt.out(state,from,from_end,from_next,to,to_end,to_next) != codecvt_base::ok) return 1;
> cout << string(to,to_next);
>
> // Output reset sequence
> if(cvt.unshift(state,to,to_end,to_next) != codecvt_base::ok) return 1;
> cout << string(to,to_next);
> return 0;
> }
References:
http://gcc.gnu.org/viewcvs/trunk/libstdc%2B%2B-v3/include/ext/codecvt_specializations.h?revision=130805&view=markup
http://en.wikipedia.org/wiki/ISO_2022
http://www.unicode.org/charts/PDF/U3040.pdf
Regards,
Kristian Spangsege