This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Bug in 'codecvt<_InternT, _ExternT, encoding_state>::do_unshift' (non-standard extension)


The 'unshift' method of the 'iconv' specialization of 'codecvt' in
"ext/codecvt_specializations.h" does not work correctly.

For a stateful character encoding this method is supposed to output a
sequence of bytes that resets the stream state. Let's check out an
example using the stateful encoding 'ISO-2022-JP' and the japanese
character 0x3076 'HIRAGANA LETTER BU' having UTF-8 encoding
0xE3,0x81,0xB6:

echo -e -n '\xE3\x81\xB6' | iconv -f UTF-8 -t ISO-2022-JP | hexdump -C
00000000  1b 24 42 24 56 1b 28 42                           |.$B$V.(B|

'ESC $ B' switches from ASCII to a japaneese code set and 'ESC ( B'
switches back (the reset sequence.)


With 'codecvt<_InternT, _ExternT, encoding_state>' the difference is apparent:

./a.out | hexdump -C
00000000  1b 24 42 24 56                                    |.$B$V|

The reset sequence is missing (see the code below.)


>From a quick inspection of the codecvt implementation it appears the
the problem is due to 'unshift' using the input state rather that the
output state when determining the reset sequence - a pretty clean
error - if I'm right.


>From the codecvt implementation ("ext/codecvt_specializations.h"):

>   template<typename _InternT, typename _ExternT>
>     codecvt_base::result
>     codecvt<_InternT, _ExternT, encoding_state>::
>     do_unshift(state_type& __state, extern_type* __to,
> 	       extern_type* __to_end, extern_type*& __to_next) const
>     {
>       // ...
>
>       const descriptor_type& __desc = __state.in_descriptor();
>
>       // ...
>
>       size_t __conv = __iconv_adaptor(iconv,__desc, NULL, NULL,
>                                           &__cto, &__tlen);
>
>       // ...
>     }


My test code:

> #include <cwchar>
> #include <string>
> #include <locale>
> #include <iostream>
> #include <ext/codecvt_specializations.h>
>
> using namespace std;
>
> int main()
> {
>   typedef codecvt<wchar_t, char, encoding_state> cvt_type;
>   locale l(locale::classic(), new cvt_type);
>   cvt_type const &cvt = use_facet<cvt_type>(l);
>   wstring s = L"\u3076";
>   char buffer[32];
>   wchar_t const *from=s.data(), *from_end=from+s.size(), *from_next;
>   char *to=buffer, *to_end=to+sizeof(buffer), *to_next;
>   encoding_state state("UCS-4LE", "ISO-2022-JP");
>
>   if(cvt.out(state,from,from_end,from_next,to,to_end,to_next) != codecvt_base::ok) return 1;
>   cout << string(to,to_next);
>
>   // Output reset sequence
>   if(cvt.unshift(state,to,to_end,to_next) != codecvt_base::ok) return 1;
>   cout << string(to,to_next);
>   return 0;
> }


References:

http://gcc.gnu.org/viewcvs/trunk/libstdc%2B%2B-v3/include/ext/codecvt_specializations.h?revision=130805&view=markup

http://en.wikipedia.org/wiki/ISO_2022

http://www.unicode.org/charts/PDF/U3040.pdf



Regards,
Kristian Spangsege


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]