It seems that Finnish numbers use a non-breaking space as the thousands
separator. This character (0xA0 I believe) is converted incorrectly to UTF-8
when using the UTF-8 locale and outputting numbers. This program demonstrates
std::cout << 1224 << std::endl;
Compile it with "g++ tmp.cpp -o tmp -Wall". Then run "./tmp". It will output
Note that libstdc++ converts the non-breaking space (you can see this character
by using .ISO-8859-1 instead of .UTF-8 in the program) into one character, which
is obviously not UTF-8, whereas libc converts the space into two.
I'm not from Finland myself, but a user of one of my programs had mysterious
crashes due to this problem - obviously they only occurs when the numbers become
greater than 1,000.
Hi. First, a comment about your C lines: what do you mean by "%'d"? This format
string looks definitely incorrect to me and and if I change it to just "%d" the
About the C++ lines: the particular capital A, is just the thousands separator
in the locale at issue. The output seems also ok (taking into account the
character set of yours (and mine;) shell)
The %'d is to make it output the thousands separator. Look in the glibc manual:
Separate the digits into groups as specified by the locale
specified for the `LC_NUMERIC' category; *note General Numeric::.
This flag is a GNU extension.
I'm not sure how you do it otherwise in C. But about the bug. You are wrong -
the output is _not_ OK. It is not UTF-8. Run the program with .ISO-8859-1
instead of .UTF-8, and you get the non-breaking space in .ISO-8859-1. Then put
that character through iconv from ISO-8859-1 to UTF-8 and you get _two_
characters, not one (in fact it could not possible be just one character when
So glibc is right (produces correct UTF-8 non-breaking space) and libstdc++ is
wrong (produces incorrect UTF-8 non-breaking space). The invalid UTF-8 from
libstdc++ makes my GTK+ program die horrible.
> This flag is a GNU extension.
So, we are in the realm of -extensions-, not of Standard C. Ok, if you want to
use that, but, beware, no consistency with the C++ Standard is guaranteed.
> You are wrong -
> the output is _not_ OK. It is not UTF-8. Run the program with .ISO-8859-1
> instead of .UTF-8, and you get the non-breaking space in .ISO-8859-1. Then
> put that character through iconv from ISO-8859-1 to UTF-8 and you get _two_
> characters, not one (in fact it could not possible be just one character when
> it's UTF-8).
In the ISO Standard the thousands separator is a -single- char_type of the
-internal- encoding. Therefore, in general, in order to accomplish what you
want, you have to use an internal encoding sufficiently wide (cout -> wcout)
and also you have to call std::ios::sync_with_stdio(false) before any other
I/O operation, otherwise no encoding to UTF-8 (from the internal
representation) will take place (despite the imbue).
... forgot to add: complete support for UTF-8 is available only in 3.4.0,
therefore, not even try with 3.3.x ;-)
*** Bug 29379 has been marked as a duplicate of this bug. ***
Let's reopen this report as an enhancement request. In fact, we should implement this:
probably by using iconv to implement the relevant char <-> char codecvt_byname, like here:
Note that this is not going to happen very soon, it will be quite a bit of work: for now people requiring UTF-8 output should rely as a rule on wchar streams.
*** Bug 39243 has been marked as a duplicate of this bug. ***