Bug 34733 - numpunct::grouping() doesn't match libc value for Bulgarian (bg_BG) locale
Summary: numpunct::grouping() doesn't match libc value for Bulgarian (bg_BG) locale
Status: SUSPENDED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 4.1.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-11 01:02 UTC by Martin Sebor
Modified: 2016-01-16 14:02 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-01-12 18:06:26


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Sebor 2008-01-11 01:02:27 UTC
I came across this while investigating (most likely) a related problem in Apache stdcxx. Btw., I suspect the bg_BG locale is incorrect in defining thousands_sep to NUL and filed http://sources.redhat.com/bugzilla/show_bug.cgi?id=5599

$ cat t.cpp && g++ -dumpversion && g++ t.cpp && LC_NUMERIC=bg_BG locale -ck LC_NUMERIC && ./a.out bg_BG | od -c
#include <cassert>
#include <clocale>
#include <cstdio>
#include <locale>
#include <string>

int main (int argc, char *argv[])
{
    const char* const name = 1 < argc ? argv [1] : "bg_BG";

    const std::locale bg (name);

    const std::numpunct<char> &np =
        std::use_facet<std::numpunct<char> >(bg);
   
    const char ts = np.thousands_sep ();
    const std::string grp = np.grouping ();

    std::setlocale (LC_ALL, name);
    const lconv* const pconv = std::localeconv ();

    std::printf ("'\\x%02x' == '\\x%02x'\n", ts, *pconv->thousands_sep);
    std::printf ("\"%s\" == \"%s\"\n", grp.c_str (), pconv->grouping);

    assert (ts == *pconv->thousands_sep);
    assert (grp == pconv->grouping);
}

4.1.2
LC_NUMERIC
decimal_point=","
thousands_sep=""
grouping=3;3
numeric-decimal-point-wc=44
numeric-thousands-sep-wc=0
numeric-codeset="CP1251"
a.out: t.cpp:26: int main(int, char**): Assertion `grp == pconv->grouping' failed.
0000000   '   \   x   0   0   '       =   =       '   \   x   0   0   '
0000020  \n   "   "       =   =       " 003 003   "  \n
0000034
Comment 1 Paolo Carlini 2008-01-11 01:54:11 UTC
Yes, I think we want to suspend this, waiting for the resolution of the glibc issue. Because, I clearly remember adjusting the code basing on feedback from people working on glibc: I learned that an empty thousand separator implies no grouping (i.e., the langinfo entry for GROUPING is not relevant in that case) and we even added a comment about that in the code. I think Andreas Schwab may remember something...
Comment 2 Martin Sebor 2008-01-11 02:09:33 UTC
Right, in C it does mean that (because thousands_sep is a multibyte string, and so the value is really ""). The problem is that in C++ a NUL thousands_sep is a perfectly valid single-byte character, i.e., '\0'. IMO, the way to fix it is by somehow distinguishing a NUL thousands_sep that comes from the system locale (e.g., via localeconv()) and a NUL thousands_sep intentionally supplied by the user (read "test suite writer" ;-) because they want to see it on output.
Comment 3 Paolo Carlini 2008-01-11 02:18:49 UTC
(In reply to comment #2)
> Right, in C it does mean that (because thousands_sep is a multibyte string, and
> so the value is really ""). The problem is that in C++ a NUL thousands_sep is a
> perfectly valid single-byte character, i.e., '\0'. IMO, the way to fix it is by
> somehow distinguishing a NUL thousands_sep that comes from the system locale
> (e.g., via localeconv()) and a NUL thousands_sep intentionally supplied by the
> user (read "test suite writer" ;-) because they want to see it on output.

I do not understand. When v3 sees a NUL thousand separator in a system locale (like bg_BG or what else), it understands no grouping, consistently with C, as you are saying. In that case, what it would see as GROUPING with langinfo is simply not relevant, I would say random garbage. Therefore, there is no point in comparing anything. What should we change?
Comment 4 Paolo Carlini 2008-01-11 02:27:42 UTC
In other terms, on the v3 side, we are not grouping anything in such cases, therefore grouping() can only be the empty string, consistently with 22.2.3.1.2. As for the thousands separator, a '\0' seems a good character as anything else, in that case. I don't see what could v3 do in order to improve the conformance.
Comment 5 Martin Sebor 2008-01-11 03:09:13 UTC
It's irrelevant to the implementation but it could be relevant to user-defined formatting (or parsing) code that bypasses num_put (or num_get) but uses numpunct to get the expected formats.

IMO, the improvement in any Linux implementation (libstdc++ or stdcxx) lies in figuring out how to preserve the "\3\3" grouping and NUL thousands_sep for bg_BG (and expose each via numpunct::grouping() and numpunct::thousdands_sep()) without causing num_put to insert NUL every three digits.
Comment 6 Paolo Carlini 2008-01-11 03:17:22 UTC
(In reply to comment #5)
> It's irrelevant to the implementation but it could be relevant to user-defined
> formatting (or parsing) code that bypasses num_put (or num_get) but uses
> numpunct to get the expected formats.

I'm not questioning that.

> IMO, the improvement in any Linux implementation (libstdc++ or stdcxx) lies in
> figuring out how to preserve the "\3\3" grouping and NUL thousands_sep for
> bg_BG (and expose each via numpunct::grouping() and numpunct::thousdands_sep())
> without causing num_put to insert NUL every three digits.

Frankly, I think it's impossible. Because the underlying glibc info says the C++ library to not group. The locale does not group. In that case, I don't think grouping() can be anything but the empty string. And I think 22.2.3.1.2/3 (and the rest of 22.2) almost enforces that explicitly. If you disagree with that, I think we should file a DR. 

Comment 7 Martin Sebor 2008-01-11 03:37:08 UTC
But that's just the libstdc++ interpretation of grouping and thousands_sep (no
offense). The C library paints a different picture.

If I want to write my own formatter/parser for numbers in the Bulgarian locale
that inserts '@' as my own thousands separator every 3 digits I can do it but
I have to drop down to C and get the grouping data from localeconv(). Not that
there's anything wrong with it but it would be nice if I could also do it in
C++ as well.

What I'm saying is that if the C library says localeconv()->grouping is "\3\3"
the C++ numpunct::grouping() shouldn't say it's something else.

I'm not sure this is something the C++ standard can do anything about except
acknowledge it as a limitation. But that doesn't mean that implementations
couldn't do better. For instance, numpunct could append some harmless special
character to its internal representation of grouping (say set it to "\3\3\0\0")
when thousands_sep is NUL. num_put would be instructed to look for the two
consecutive NULs in grouping and avoid inserting thousands_sep when NUL. Since
user-defined numpunct would almost certainly not return grouping terminated by
two NULs if such a facet with a grouping of "\3\3" and NUL for thousands_sep
num_put would insert the NUL as expected. Yes, it's a hack but it's one that
would work in the vast majority of cases: the implementation would pass anal
tests while behaving sanely in ordinary situations.
Comment 8 Paolo Carlini 2008-01-11 03:59:53 UTC
(In reply to comment #7)
> What I'm saying is that if the C library says localeconv()->grouping is "\3\3"
> the C++ numpunct::grouping() shouldn't say it's something else.

Why not? If we agree that when the thousand separator returned by the C library is NUL, then localeconv()->grouping **is meaningless**, garbage, uninitialized memory (I think we previously agreed about that), I don't see why people can expect that match. In particular, I don't see that because, in C++, due probably to a limitation in the standard, there is no separate boolean in numpunct to tell "no-grouping": I maintain that essentially the only meaningful way to espress "no-grouping" in C++, in the grouping() string, is leaving the string empty. I think my reading of the standard is pretty straightforward, I mentioned a specific section, there is also a note, saying how "3" would be interpreted. 

Note, I'm not interested in num_put, here, of course in principle you can store "somewhere" the values returned by the underlying C library and make num_put happy, I do not dispute that you can play one of tricks you mentioned, for that.

What I'm saying is that, a numpunct conveying to its user the meaning "no-grouping" is a numpunct returning an empty string from grouping(). Any other return value, even if for the sake of consistency of an hypothetical - note, I'm saying hypothetical, see above, I don't think localeconv()->grouping is anything but junk when the thousand separator is NUL, in C - localedata seems wrong to me.
Comment 9 Martin Sebor 2008-01-11 05:44:42 UTC
I don't agree that localeconv()->grouping is garbage just because thousands_sep
is NUL. I'm not aware of anything in C or POSIX that says that. In the case of
bg_BG, the grouping is clearly correct. What's questionable in this case is the
value of thousands_sep. But that's probably just a bug in the Bulgarian locale
definition. Bugs aside, there may be a perfectly valid locale (a user-defined
one) in which there is no thsousands_sep (i.e., it's NUL) but where grouping
is non-empty. It's trivial to create one as an exercise. The C++ library
numpunct should be able to represent such a locale without change.
Comment 10 Paolo Carlini 2008-01-11 10:18:30 UTC
(In reply to comment #9)
> I don't agree that localeconv()->grouping is garbage just because thousands_sep
> is NUL. I'm not aware of anything in C or POSIX that says that.

We are working on this assumption. You agreed to it, at the beginning of this thread. You agreed that, in C, a NUL as thousand separator in the localedata means no grouping in C. The grouping must be not looked at, in that case (this point is exactly what has been clarified to me, time ago, by glibc people) And that makes perfect sense, because C strings cannot have a \0 in the middle. In C, there is no meaningful way to group with \0.

 In the case of
> bg_BG, the grouping is clearly correct. What's questionable in this case is the
> value of thousands_sep. But that's probably just a bug in the Bulgarian locale
> definition.

That is another, glibc, issue, probably.

 Bugs aside, there may be a perfectly valid locale (a user-defined
> one) in which there is no thsousands_sep (i.e., it's NUL) but where grouping
> is non-empty. It's trivial to create one as an exercise. The C++ library
> numpunct should be able to represent such a locale without change.

Nobody disagrees with that! Maybe that was not clear in the previous messages. Of course the user can do that, in C++, and the rationale is that basic_string can have \0 in the middle. That seems obvious to me. No problem in v3, AFAIK, but any issue here it's just a bug that will be fixed ASAP.

Comment 11 Martin Sebor 2008-01-11 15:56:52 UTC
I think the disconnect might be in how each of us is looking at the LC_NUMERIC
data. To me, it's just a bunch of values that are independent of one another.
You, OTOH, seem to view it more as a set of rules described by the data (if
thousands_sep == NUL then grouping doesn't matter).

The reason why I don't look at it that way is because I think the rules for
the interpretation of the data are separate from the data: in C they are
in sprintf and C++ in num_put.

So while I agree that "NUL thousands_sep means no grouping" to stdio and to
iostreams in C++, it's just one possible interpretation of the data for the
specific purposes of the two libraries. There are other possible and equally
valid interpretations/rules that can be drawn from it (e.g., in third party
code).

Finally, I'm not suggesting or even hoping that anything be done about this
in libstdc++ ASAP, or necessarily at any point. I mostly just wanted to make
a record of the issue (whether you consider it valid or not) and get your
perspective on it, that's all :)
Comment 12 Martin Sebor 2008-01-11 15:59:03 UTC
(In reply to comment #11)
[...]
> So while I agree that "NUL thousands_sep means no grouping" to stdio and to
> iostreams in C++,

I should clarify that in C++ it means that only if the NUL comes from libc (e.g.,
via localeconv), but not from a user-defined numpunct facet.
Comment 13 Paolo Carlini 2008-01-12 18:06:25 UTC
Anyway... The issue, after all, seems pretty academic to me, for the simple reason that, per your report, the only known example of a named locale showing '\0' as thousands separator is bg_BG, and most likely the data isn't even ok, per your glibc report. Therefore, unless the data turns out to be ok and the glibc people tell us, contrary to our understanding, that a \0 must be interpreted differently, there is really nothing to do. Let's suspend the PR, in the meantime.
Comment 14 Martin Sebor 2008-01-12 22:45:44 UTC
bg_BG is the only known example on Linux.

The original bug report we got was for a fr_FR locale on Tru64. I haven't
gone through other locales on Tru64 or any other platforms except for Linux
to see how pervasive such locales might be. I'll keep you posted once I have
more data.

I agree that if the locale is buggy and the bug is rare there's not much
point in spending time on this issue.
Comment 15 Paolo Carlini 2008-01-12 22:57:26 UTC
(In reply to comment #14)
> bg_BG is the only known example on Linux.

To be clear, for occasional readers: we are supporting named locales only on GNU / Linux systems.