[Bug libstdc++/71500] regex::icase only works on first character in a range

Mon Jun 13 03:23:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71500

--- Comment #9 from Michael Duggan <mwd at md5i dot com> ---
"timshen at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> writes:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71500
>
> --- Comment #8 from Tim Shen <timshen at gcc dot gnu.org> ---
> (In reply to Michael Duggan from comment #7)
>> Hmm...  Okay.  For the sake of argument, I am going to make the
>> following claims: 
>
> Yeah, thanks for the arguments, we should at least get this clear.
>
>> a) Ignoring case _requires_ a locale.  Without a locale, how do you
>>    determine the case of a character anyway?  Especially if the
>>    character is not a char?  Even std::toupper and std::towuppwer are
>>    bound to the "C" locale.
>> 
>>    As such, you can't say you're not to use the locale when collate is
>>    off.  Otherwise icase wouldn't work ever unless collate was
>>    specified.
>
> Not necessarily. My interpretation of "icase" is whether regex should call
> traits_type::translate or traits_type::translate_nocase when comparing two
> characters. No locale is directly involved - a user-defined regex traits type
> may ignore the locale and inject its own weird stuff into translate_nocase.
> It's just, not surprisingly, std::regex_traits<T> "happens to" use locale to
> implement translate_nocase.

Yeah, I think I finally see what you are getting at.

>> b) You can still use regex_traits::transform in an icase scenario.  The
>>    rule for [A-B], matching X should be:
>> 
>>      transform(a) <= transform(x) <= transform(b)
>> 
>>    In the icase scenario, I would posit that the result of this _or_
>> 
>>      transform(a) <= transform(toggle_case(x)) <= transform(b)
>> 
>>    is correct.
>
> +1.
>
> I looked at boost source, it seems that "collate" is interpreted as whether
> regex should call traits_type::transform when doing range matching:
> https://github.com/boostorg/regex/blob/9059bfb5c6287b0c579bfa4be5160b44c8cc2957/include/boost/regex/v4/perl_matcher.hpp#L198
>
> I don't like the way the documentation put it, but the definition above seems
> accurate and clear. transform/collate seems to be an unrelated concern.
>
>
> With above two claims hold, my point is, the way I can think of to implement
> toggle_case:
>   char_type basic_regex<...>::toggle_case(char_type c) {
>     const auto& ctype =
> std::use_facet<std::ctype<char_type>>(this->_M_traits.getloc());
>     auto lower = ctype.tolower(c);
>     if (lower == c) return ctype.toupper(c);
>     return lower;
>   }
>
> it uses locale's definition of uppercase/lowercase, not translate_nocase's
> definition. This result might surprise a picky user.
>
> There is a correct way to implement this:
>   if (icase)
>     c = traits.translate_nocase(c);
>   else
>     c=  traits.translate(c);
>   for (pair<char_type, char_type> range : regex.range_set)
>     for (char_type e = range.first; e != range.second+1; ++e) {
>       if (icase)
>         e = traits.translate_nocase(e);
>       else
>         e = traits.translate(e);
>       if (collate) {
>         if (traits.transform(e) == traits.transform(c))
>           return true;
>       } else {
>         if (e == c)
>           return true;
>       }
>     }
>   }
>   return false;
>
> But it's terribly inefficient and not practical at all.

I will make two suggestions.  The initial suggestion is simple enough:
Given that regex_traits is mandated to be implemented by the library for
char and wchar_t, it would be reasonable, I think, to make the
specializations that handled these that go ahead and use the locale.
This falls into the "as if" rule, where an implementation is fine as
long as it works "as if" the more correct way was implemented.

The other suggestion is this.  Given that the more correct is "terribly
inefficient and impractical", I posit that you have three options: Use
the "correct", but "bad", implementation, leave the "incorrect in all
cases" implementation in place, or just go ahead and use the locale to
implement toggle_case().  Of these three, I think the "incorrect in all
cases" option is the worst of all worlds.

(It wouldn't be a bad idea to poke the original authors of the regex
functionality to see what they think.  Apologies in advance if you were
one of the authors.)