[Bug libstdc++/63776] [C++11] Regex collate matching not working

redi at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Wed Oct 3 10:48:00 GMT 2018


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID

--- Comment #11 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Tim Shen from comment #8)
> I don't think std::regex_match<BiIter, Alloc, char, RegexTraits> should care
> about decoding a char string to wchar_t string and call
> std::regex_match<AnotherBiIter, AnotherAlloc, wchar_t,
> std::regex_traits<wchar_t>>, leaving user defined RegexTraits potentially
> unused.

I agree.

> Instead, user can maually decode the utf-8 string (I'm sad we don't have a
> standard char iterator adaptor which converts a utf-8 char iterator to
> char32_t iterator) and call std::regex_match<..., wchar_t, ...>.

Agreed.

> These are my understanding, so it's surely possible that I may miss
> something.
> 
> Thoughts?

Having looked through this again, I think you're right.

So this reduced test case is not expected to pass:

#include <regex>
#include <cassert>

int main()
{
  std::locale::global(std::locale("en_US.UTF-8"));
  std::string s = "joão méroço";
  std::regex r{"[[:alpha:]]{4} [[:alpha:]]{6}"};
  assert( regex_match(s, r) );
}

But this is (assuming wchar_t uses a unicode encoding):

#include <regex>
#include <cassert>

int main()
{
  std::locale::global(std::locale("en_US.UTF-8"));
  std::string s = "joão méroço";
  std::regex r{"[[:alpha:]]{4} [[:alpha:]]{6}"};
  assert( regex_match(s, r) );
}


More information about the Gcc-bugs mailing list