Bug 39168 - Incorrect interpretation of CHAR_MAX inside grouping string in monetary and numeric facets.
Summary: Incorrect interpretation of CHAR_MAX inside grouping string in monetary and n...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: unknown
: P3 normal
Target Milestone: 4.4.0
Assignee: Paolo Carlini
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-12 15:26 UTC by Andrey Tsyvarev
Modified: 2009-02-15 16:49 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-02-12 15:58:48


Attachments
test for num_get<>::get() method (288 bytes, text/plain)
2009-02-12 15:30 UTC, Andrey Tsyvarev
Details
test for money_put<>::put method (296 bytes, text/plain)
2009-02-12 15:33 UTC, Andrey Tsyvarev
Details
Unified money_get and num_get test case and results. (677 bytes, text/plain)
2009-02-14 21:21 UTC, Martin Sebor
Details
Corrected unified demo. (802 bytes, text/plain)
2009-02-14 21:58 UTC, Martin Sebor
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Tsyvarev 2009-02-12 15:26:53 UTC
Description of numpunct<>::do_grouping() (22.2.3.1.2, p3): 

Returns: A basic_string<char> vec used as a vector of integer values, in which each element vec[i] represents the number of digits in the group at position i, starting with position 0 as the rightmost group. If vec.size() <= i, the number is the same as group (i-1); if (i<0 || vec[i]<=0 || vec[i]==CHAR_MAX), the size of the digit group is unlimited. 

But num_put and num_get implementations don't interpret condition (vec[i]==CHAR_MAX) as condition of group with unlimited size, but interpret value CHAR_MAX as regular size of digits group.

According to the code, condition (static_cast<signed char>(vec[i]) > 0) is used as an indicator of regular group size. This means range 1..127 for vec[i] values.

But on architectures with char ~ signed char(e.g. x86), CHAR_MAX is 127, so the range shouldn't include value 127.

In tests, string(1, CHAR_MAX) is used as grouping string. Like string() and string(1, -1), this should disable grouping at all.
But on architectures with char ~ signed char tests failed. On architectures with char ~ unsigned char or when CHAR_MAX replaced with -1, all became ok.
Comment 1 Andrey Tsyvarev 2009-02-12 15:30:37 UTC
Created attachment 17289 [details]
test for num_get<>::get() method

When grouping is disabled, thousands separator should not be read by get() methods.
Comment 2 Andrey Tsyvarev 2009-02-12 15:33:32 UTC
Created attachment 17290 [details]
test for money_put<>::put method

When grouping is disabled, money_put<>::put() method for digits should write these digits as is, without thousands separator.
Comment 3 Paolo Carlini 2009-02-12 15:58:48 UTC
Ok, thanks.
Comment 4 Martin Sebor 2009-02-12 16:49:27 UTC
(In reply to comment #0)

I'm not sure I understand your rationale or I agree that this is a bug. IIUC,
string(1, CHAR_MAX) indicates that groups may be of arbitrary length, which
includes "123,456" This behavior is the same regardless of whether char is
a signed or unsigned type.

As a data point WRT existing practice: all implementations I've tried (Apache
stdcxx on HP-UX/IPF, HP aCC 6.16, Sun C++ 5.9 with both libCstd and STLport,
and IBM XLC++ 9.1) behave the same as libstdc++: they extract 123456 from the
stream and set eofbit.
Comment 5 Paolo Carlini 2009-02-12 17:05:55 UTC
Hi. Maybe the testcases should be amended, but surely when char is signed, just checking that vec[i] > 0 as an indication of regular group size instead of vec[i] > 0 && vec[i] != CHAR_MAX cannot be right...

Please, Andrey, double check the testcases. Thanks in advance!
Comment 6 Paolo Carlini 2009-02-12 17:53:10 UTC
Actually, libstdc++ stores 123456, which is indeed fine, and sets failbit | eofbit, failbit exactly because of the issue discussed here.
Comment 7 Andrey Tsyvarev 2009-02-13 11:21:07 UTC
(In reply to comment #4)
> 
> I'm not sure I understand your rationale or I agree that this is a bug. IIUC,
> string(1, CHAR_MAX) indicates that groups may be of arbitrary length, which
> includes "123,456" This behavior is the same regardless of whether char is
> a signed or unsigned type.

"Arbitrary length" is not quite correct here - "123,456" violates grouping, given with string(1, CHAR_MAX). Standard use term "unlimited length", which means(as I understand) that all other digits should incorporate in only one group - only "123456" is correct.

(In reply to comment #6)
>Actually, libstdc++ stores 123456, which is indeed fine, and sets failbit |
>eofbit, failbit exactly because of the issue discussed here.

The thing is that, according to the standard, CHAR_MAX should be treats similar as -1. But implementation treats string(1, -1) as no grouping at all, and stops read, when has encountered symbol ','. So only "123" is accumulated.
This behaviour seems correct for me (though standard treats only string() as no grouping at all, 22.2.2.1.2, p8).
So with string(1, CHAR_MAX) only "123" should be accumulated, not "123,456".

In other words, test is passed when CHAR_MAX is replaced with -1. Inside grouping string, CHAR_MAX means same as -1(according to the standard). So test should be passed with original text.

The same is about the second test.
Comment 8 Paolo Carlini 2009-02-13 11:31:41 UTC
(In reply to comment #7)
                                  Standard use term "unlimited length", which
> means(as I understand) that all other digits should incorporate in only one
> group - only "123456" is correct.

I don't read that anywhere in the Standard.

> The thing is that, according to the standard, CHAR_MAX should be treats similar
> as -1. But implementation treats string(1, -1) as no grouping at all, and stops
> read, when has encountered symbol ','. So only "123" is accumulated.
> This behaviour seems correct for me (though standard treats only string() as no
> grouping at all, 22.2.2.1.2, p8).
> So with string(1, CHAR_MAX) only "123" should be accumulated, not "123,456".

No, no, it is not correct, it is a bug in the current implementation. See where bool discard is defined, in Stage 2. (1, -1) (or (1, CHAR_MAX)) are not special, are normal groupings, that must be enforced.
Comment 9 Paolo Carlini 2009-02-13 11:39:15 UTC
In other terms, as usual about those matters, Martin is right ;) Only he has got a wrong data point about libstdc++, he believes we are setting eofbit.

Thanks anyway for pointing our attention to this CHAR_MAX issue, I'm going to fix it together with the bool discard thing (and its consequences).
Comment 10 Andrey Tsyvarev 2009-02-13 13:41:23 UTC
(In reply to comment #8)
> (In reply to comment #7)
>                                   Standard use term "unlimited length", which
> > means(as I understand) that all other digits should incorporate in only one
> > group - only "123456" is correct.
> 
> I don't read that anywhere in the Standard.

Could you clarify this a bit?
Do you mean that when reading "123,456" with string(1, -1), failbit shouldn't be set?
Comment 11 Paolo Carlini 2009-02-13 13:43:17 UTC
(In reply to comment #10)
> Do you mean that when reading "123,456" with string(1, -1), failbit shouldn't
> be set?

Right, as Martin said. 

Comment 12 Andrey Tsyvarev 2009-02-13 15:04:39 UTC
Let's consider the following situation (seems lifelike to me). Suppose one needs a representation of numbers in which only the last 3 digits are separated from all other digits (grouped), like "1234,567" or "1234567,890". Other separators shouldn't appear.

Grouping string "\003" doesn't fit for this purpose as it separates all 3-digits groups: "1,234,567".
Before this PR, I thought that "\003\177" is sufficient for this purpose. But, as I understand now, the representations like "12,34,567" are acceptable in this case, which is not what I would like to have.

Could you suggest which grouping string should be used to do such number representation? Or is this unachievable?

I investigated locale support in POSIX, such number representation is achieved there with "\003\177". It seems strange for me that similar mechanism is not work with C++ locales.
Comment 13 Paolo Carlini 2009-02-13 15:11:34 UTC
(In reply to comment #12)
> Could you suggest which grouping string should be used to do such number
> representation? Or is this unachievable?

Actually \003\176 seems perfectly practical to me. In particular considering the actual size of real numbers.

> I investigated locale support in POSIX, such number representation is achieved
> there with "\003\177". It seems strange for me that similar mechanism is not
> work with C++ locales.

POSIX seems a good point, probably you should have mentioned it much earlier. Let's hear Martin again, then. Certainly, however, I'm concerned about having a behavior different from all the other implementations mentioned by Martin.

Comment 14 Paolo Carlini 2009-02-13 15:21:29 UTC
If I understand correctly, in order to implement the POSIX behavior for C++, assuming we want it, the Standard should be clarified to explain that values <= 0 or CHAR_MAX are different in that do no admit repetitions, thus saying explicitly that such group is effectively the last, arbitrarily long, one.

I must say, that is what we would do when formatting anyway.
Comment 15 Andrey Tsyvarev 2009-02-13 15:38:52 UTC
(In reply to comment #14)
> If I understand correctly, in order to implement the POSIX behavior for C++,
> assuming we want it, the Standard should be clarified to explain that values <=
> 0 or CHAR_MAX are different in that do no admit repetitions, thus saying
> explicitly that such group is effectively the last, arbitrarily long, one.

Yes, I meen exactly this.
Also, current implementation follows this strategy - accroding to the tests, and according to it's source code.

So it is strange for me, that the standard for num_get<>::do_get() say, that only string() is indicator of no grouping at all.
While string(1,-1) and string(1,CHAR_MAX) also force number representation to have only one group.
Comment 16 Paolo Carlini 2009-02-13 15:49:41 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > If I understand correctly, in order to implement the POSIX behavior for C++,
> > assuming we want it, the Standard should be clarified to explain that values <=
> > 0 or CHAR_MAX are different in that do no admit repetitions, thus saying
> > explicitly that such group is effectively the last, arbitrarily long, one.
> 
> Yes, I meen exactly this.
> Also, current implementation follows this strategy - according to the tests,
> and according to it's source code.

Yes, I know that ;) And I also know that implementing the exact behaviour you want would be trivial and would bring consistency between signed and unsigned char platforms. Probably we want to implement it anyway in v3, but that doesn't mean by itself that it's obviously the only meaningful interpretation of the standard if so many other implementations differ, as pointed out by Martin. 

Thus let's be clear about that with Martin and probably at the same time also take actions to open a DR for C++0x.
Comment 17 Martin Sebor 2009-02-14 21:21:29 UTC
Created attachment 17300 [details]
Unified money_get and num_get test case and results.

Attached is a unified test case with test results on popular platforms. The
assertions reflect the behavior described in comment #4 which may actually
turn out to be incorrect in the end.
Comment 18 Martin Sebor 2009-02-14 21:41:51 UTC
I was too hasty -- the attached test case is buggy: it's missing a seek to
the beginning of the stream after the first extraction, making the results
for the num_get part meaningless.

(In reply to comment #7)
> "Arbitrary length" is not quite correct here - "123,456" violates grouping,
> given with string(1, CHAR_MAX). Standard use term "unlimited length", which
> means(as I understand) that all other digits should incorporate in only one
> group - only "123456" is correct.

That seems like a reasonable interpretation but others appear to be possible
as well. Looks like this needs to be clarified.

(In reply to comment #12)
> Let's consider the following situation (seems lifelike to me). Suppose one
> needs a representation of numbers in which only the last 3 digits are separated
> from all other digits (grouped), like "1234,567" or "1234567,890". Other
> separators shouldn't appear.

It seems that "\003\000" should do that, and unless I'm mistaken, does with
libstc++ (but not other implementations).

(In reply to comment #13)
> POSIX seems a good point, probably you should have mentioned it much earlier.
> Let's hear Martin again, then. Certainly, however, I'm concerned about having a
> behavior different from all the other implementations mentioned by Martin.

I agree. It would be good to reconcile any accidental differences between
C++ and POSIX.
Comment 19 Paolo Carlini 2009-02-14 21:54:54 UTC
Thanks Martin. Then, as requested by submitter, for now I'm just going to tweak a bit the behaviour of libstdc++ to bring consistency between unsigned and signed char platforms, in other terms, value <= 0 and value == CHAR_MAX will lead to the same behavior everywhere.
Comment 20 Martin Sebor 2009-02-14 21:58:49 UTC
Created attachment 17301 [details]
Corrected unified demo.

Attached a corrected unified demo with assertions removed and with output on popular implementations for reference.
Comment 21 paolo@gcc.gnu.org 2009-02-15 16:48:12 UTC
Subject: Bug 39168

Author: paolo
Date: Sun Feb 15 16:47:57 2009
New Revision: 144190

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144190
Log:
2009-02-15  Paolo Carlini  <paolo.carlini@oracle.com>

	PR libstdc++/39168
	* src/locale_facets.cc (__verify_grouping(const char*, size_t,
	const string&)): Also check that the value != CHAR_MAX.
	* include/bits/locale_facets.tcc (__numpunct_cache<>::
	_M_cache(const locale&), __add_grouping(_CharT*, _CharT,
	const char*, size_t, const _CharT*, const _CharT*)): Likewise.
	* include/bits/locale_facets_nonio.tcc (__moneypunct_cache<>::
	_M_cache(const locale&)): Likewise.
	* testsuite/22_locale/money_put/put/wchar_t/39168.cc: New.
	* testsuite/22_locale/money_put/put/char/39168.cc: Likewise.
	* testsuite/22_locale/money_get/get/wchar_t/39168.cc: Likewise.
	* testsuite/22_locale/money_get/get/char/39168.cc: Likewise.
	* testsuite/22_locale/num_get/get/wchar_t/39168.cc: Likewise.
	* testsuite/22_locale/num_get/get/char/39168.cc: Likewise.


Added:
    trunk/libstdc++-v3/testsuite/22_locale/money_get/get/char/39168.cc
    trunk/libstdc++-v3/testsuite/22_locale/money_get/get/wchar_t/39168.cc
    trunk/libstdc++-v3/testsuite/22_locale/money_put/put/char/39168.cc
    trunk/libstdc++-v3/testsuite/22_locale/money_put/put/wchar_t/39168.cc
    trunk/libstdc++-v3/testsuite/22_locale/num_get/get/char/39168.cc
    trunk/libstdc++-v3/testsuite/22_locale/num_get/get/wchar_t/39168.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/locale_facets.tcc
    trunk/libstdc++-v3/include/bits/locale_facets_nonio.tcc
    trunk/libstdc++-v3/src/locale_facets.cc

Comment 22 Paolo Carlini 2009-02-15 16:49:57 UTC
Fixed for 4.4.0.