It seems that Finnish numbers use a non-breaking space as the thousands separator. This character (0xA0 I believe) is converted incorrectly to UTF-8 when using the UTF-8 locale and outputting numbers. This program demonstrates the problem: #include <iostream> int main() { std::cout.imbue(std::locale("fi_FI.UTF-8")); std::cout << 1224 << std::endl; setlocale(LC_ALL, "fi_FI.UTF-8"); printf("%'d\n", 1224); } Compile it with "g++ tmp.cpp -o tmp -Wall". Then run "./tmp". It will output ole:~/tmp$ ./tmp 1Â224 1Â 224 Note that libstdc++ converts the non-breaking space (you can see this character by using .ISO-8859-1 instead of .UTF-8 in the program) into one character, which is obviously not UTF-8, whereas libc converts the space into two. I'm not from Finland myself, but a user of one of my programs had mysterious crashes due to this problem - obviously they only occurs when the numbers become greater than 1,000.
Hi. First, a comment about your C lines: what do you mean by "%'d"? This format string looks definitely incorrect to me and and if I change it to just "%d" the expected 1224 is produced. About the C++ lines: the particular capital A, is just the thousands separator in the locale at issue. The output seems also ok (taking into account the character set of yours (and mine;) shell)
The %'d is to make it output the thousands separator. Look in the glibc manual: `'' Separate the digits into groups as specified by the locale specified for the `LC_NUMERIC' category; *note General Numeric::. This flag is a GNU extension. I'm not sure how you do it otherwise in C. But about the bug. You are wrong - the output is _not_ OK. It is not UTF-8. Run the program with .ISO-8859-1 instead of .UTF-8, and you get the non-breaking space in .ISO-8859-1. Then put that character through iconv from ISO-8859-1 to UTF-8 and you get _two_ characters, not one (in fact it could not possible be just one character when it's UTF-8). So glibc is right (produces correct UTF-8 non-breaking space) and libstdc++ is wrong (produces incorrect UTF-8 non-breaking space). The invalid UTF-8 from libstdc++ makes my GTK+ program die horrible.
> This flag is a GNU extension. So, we are in the realm of -extensions-, not of Standard C. Ok, if you want to use that, but, beware, no consistency with the C++ Standard is guaranteed. > You are wrong - > the output is _not_ OK. It is not UTF-8. Run the program with .ISO-8859-1 > instead of .UTF-8, and you get the non-breaking space in .ISO-8859-1. Then > put that character through iconv from ISO-8859-1 to UTF-8 and you get _two_ > characters, not one (in fact it could not possible be just one character when > it's UTF-8). In the ISO Standard the thousands separator is a -single- char_type of the -internal- encoding. Therefore, in general, in order to accomplish what you want, you have to use an internal encoding sufficiently wide (cout -> wcout) and also you have to call std::ios::sync_with_stdio(false) before any other I/O operation, otherwise no encoding to UTF-8 (from the internal representation) will take place (despite the imbue). Thanks, Paolo.
... forgot to add: complete support for UTF-8 is available only in 3.4.0, therefore, not even try with 3.3.x ;-)
*** Bug 29379 has been marked as a duplicate of this bug. ***
Let's reopen this report as an enhancement request. In fact, we should implement this: http://gcc.gnu.org/ml/libstdc++/2004-06/msg00256.html probably by using iconv to implement the relevant char <-> char codecvt_byname, like here: http://gcc.gnu.org/ml/libstdc++/2004-06/msg00252.html Note that this is not going to happen very soon, it will be quite a bit of work: for now people requiring UTF-8 output should rely as a rule on wchar streams.
*** Bug 39243 has been marked as a duplicate of this bug. ***
Unassigning since Benjamin since not been active in GCC development for over 8 years now.