Bug 16006 - Conversions of numbers in fi_FI.UTF-8 produces incorrect UTF-8
Summary: Conversions of numbers in fi_FI.UTF-8 produces incorrect UTF-8
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 3.3.4
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
: 29379 39243 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-06-15 16:48 UTC by Ole Laursen
Modified: 2023-05-16 19:41 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-11-06 11:24:16


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ole Laursen 2004-06-15 16:48:43 UTC
It seems that Finnish numbers use a non-breaking space as the thousands
separator. This character (0xA0 I believe) is converted incorrectly to UTF-8
when using the UTF-8 locale and outputting numbers. This program demonstrates
the problem:

#include <iostream>

int main()
{
  std::cout.imbue(std::locale("fi_FI.UTF-8"));
  std::cout << 1224 << std::endl;
  
  setlocale(LC_ALL, "fi_FI.UTF-8");
  printf("%'d\n", 1224);
}

Compile it with "g++ tmp.cpp -o tmp -Wall". Then run "./tmp". It will output

ole:~/tmp$ ./tmp
1Â224
1Â 224

Note that libstdc++ converts the non-breaking space (you can see this character
by using .ISO-8859-1 instead of .UTF-8 in the program) into one character, which
is obviously not UTF-8, whereas libc converts the space into two.

I'm not from Finland myself, but a user of one of my programs had mysterious
crashes due to this problem - obviously they only occurs when the numbers become
greater than 1,000.
Comment 1 Paolo Carlini 2004-06-16 09:13:17 UTC
Hi. First, a comment about your C lines: what do you mean by "%'d"? This format
string looks definitely incorrect to me and and if I change it to just "%d" the 
expected
1224
is produced.
About the C++ lines: the particular capital A, is just the thousands separator
in the locale at issue. The output seems also ok (taking into account the 
character set of yours (and mine;) shell)
Comment 2 Ole Laursen 2004-06-16 14:17:51 UTC
The %'d is to make it output the thousands separator. Look in the glibc manual:

`''
     Separate the digits into groups as specified by the locale
     specified for the `LC_NUMERIC' category; *note General Numeric::.
     This flag is a GNU extension.

I'm not sure how you do it otherwise in C. But about the bug. You are wrong -
the output is _not_ OK. It is not UTF-8. Run the program with .ISO-8859-1
instead of .UTF-8, and you get the non-breaking space in .ISO-8859-1. Then put
that character through iconv from ISO-8859-1 to UTF-8 and you get _two_
characters, not one (in fact it could not possible be just one character when
it's UTF-8).

So glibc is right (produces correct UTF-8 non-breaking space) and libstdc++ is
wrong (produces incorrect UTF-8 non-breaking space). The invalid UTF-8 from
libstdc++ makes my GTK+ program die horrible.
Comment 3 Paolo Carlini 2004-06-16 15:57:45 UTC
> This flag is a GNU extension.

So, we are in the realm of -extensions-, not of Standard C. Ok, if you want to
use that, but, beware, no consistency with the C++ Standard is guaranteed.

> You are wrong -
> the output is _not_ OK. It is not UTF-8. Run the program with .ISO-8859-1
> instead of .UTF-8, and you get the non-breaking space in .ISO-8859-1. Then
> put that character through iconv from ISO-8859-1 to UTF-8 and you get _two_
> characters, not one (in fact it could not possible be just one character when
> it's UTF-8).

In the ISO Standard the thousands separator is a -single- char_type of the
-internal- encoding. Therefore, in general, in order to accomplish what you
want, you have to use an internal encoding sufficiently wide (cout -> wcout)
and also you have to call std::ios::sync_with_stdio(false) before any other
I/O operation, otherwise no encoding to UTF-8 (from the internal
representation) will take place (despite the imbue).

Thanks, Paolo.
Comment 4 Paolo Carlini 2004-06-16 16:07:29 UTC
... forgot to add: complete support for UTF-8 is available only in 3.4.0,
therefore, not even try with 3.3.x ;-)
Comment 5 Paolo Carlini 2006-10-07 19:48:31 UTC
*** Bug 29379 has been marked as a duplicate of this bug. ***
Comment 6 Paolo Carlini 2006-10-08 11:04:21 UTC
Let's reopen this report as an enhancement request. In fact, we should implement this:

  http://gcc.gnu.org/ml/libstdc++/2004-06/msg00256.html

probably by using iconv to implement the relevant char <-> char codecvt_byname, like here:

  http://gcc.gnu.org/ml/libstdc++/2004-06/msg00252.html

Note that this is not going to happen very soon, it will be quite a bit of work: for now people requiring UTF-8 output should rely as a rule on wchar streams.
Comment 7 Paolo Carlini 2010-01-08 18:47:48 UTC
*** Bug 39243 has been marked as a duplicate of this bug. ***
Comment 8 Andrew Pinski 2023-05-16 19:41:13 UTC
Unassigning since Benjamin since not been active in GCC development for over 8 years now.