98723 – On Windows with CP936 encoding, regex compiles very slow.

Bug 98723 - On Windows with CP936 encoding, regex compiles very slow.

Summary: On Windows with CP936 encoding, regex compiles very slow.

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	libstdc++ (show other bugs)
Version:	10.2.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	std::regex
	Show dependency tree / graph

Reported:	2021-01-18 10:37 UTC by goughost
Modified:	2023-12-09 13:52 UTC (History)
CC List:	7 users (show)

See Also:	85824 94409
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-01-18 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description goughost 2021-01-18 10:37:34 UTC

example code:

#include <regex>
#include <iostream>
#include <locale>
int main() {
   std::setlocale(LC_ALL, "");
   std::regex rgx{"[a-z][a-z][a-z]"};
   std::cerr<<rgx.mark_count()<<std::endl;
   return 0;
}

build and run in mingw64 environment (gcc 10.2.0), the program blocks while compiling the regex for a long time.

my finding is that:

compiling '[a-z]' needs to cache info for all 256 chars;
for each char, a call to std::collate<char>::do_transform() is made;
do_transform() will use the result of strxfrm() to allocate buffer;
on Windows, strxfrm() returns INT_MAX to indicate error;
if char > 0x7f, and the system encoding is CP936, strxfrm() will fail;
thus, compiling '[a-z]' will repeatedly allocate large buffers.

issues:

1. the regex compilation will be affected by current locale even if std::regex::collate is not set, by calling strxfrm.

2. code in bits/locale_classes.tcc should handle documented return conditions of strxfrm() on Windows:

         size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm()
         // If the buffer was not large enough, try again with the
         // correct size.
         if (__res >= __len)
      {
        __len = __res + 1;
        delete [] __c, __c = 0;
        __c = new _CharT[__len];
        __res = _M_transform(__c, __p, __len);
      }

Comment 1 Jonathan Wakely 2021-01-18 11:39:42 UTC

The Windows behaviour fails to conform to the C and C++ standards. I think _M_transform should check errno and throw an exception on error (which means removing the non-throwing exceptions specification from that function).

Comment 2 goughost 2021-01-18 14:31:31 UTC

That may be acceptable for issue 2.
But additional fixes are need; otherwise, users cannot use regex after calling setlocale(LC_ALL,"") in such a situation.
Can regex compilers work without calling _M_transform? (at least when std::regex::collate is not set)

On the other hand, maybe the error condition can be handled by regex compiler code.
To some extent, the bug is in the regex compiler.
Building cache for '\xee' calls strxfrm() with "\xee\x00", which is not a valid string if current encoding is utf8.
Also, in GNU/Linux, resulting strings of such (successful) calls might not help building the cache.

Examples calling strxfrom in GNU/Linux with various locales.
(Note that, in cases when Windows fails, Linux gives trivial results.)

// C
input 61 00, errno 0, res 1, outbuf:  61
input 62 00, errno 0, res 1, outbuf:  62
input aa 00, errno 0, res 1, outbuf:  aa
input bb 00, errno 0, res 1, outbuf:  bb

// C.UTF-8
input 61 00, errno 0, res 1, outbuf:  63
input 62 00, errno 0, res 1, outbuf:  64
input aa 00, errno 0, res 1, outbuf:  03
input bb 00, errno 0, res 1, outbuf:  03

// en_US.UTF-8
input 61 00, errno 0, res 10, outbuf:  51 01 02 01 02 01 00 00 00 00
input 62 00, errno 0, res 10, outbuf:  5e 01 02 01 02 01 00 00 00 00
input aa 00, errno 0, res 5, outbuf:  01 01 01 01 03
input bb 00, errno 0, res 5, outbuf:  01 01 01 01 03

// zh_CN.GB2312 
input 61 00, errno 0, res 11, outbuf:  e1 a9 bd 01 02 01 02 01 00 00 61
input 62 00, errno 0, res 11, outbuf:  e1 a9 be 01 02 01 02 01 00 00 62
input aa 00, errno 0, res 5, outbuf:  01 01 01 01 03
input bb 00, errno 0, res 5, outbuf:  01 01 01 01 03

Comment 3 cqwrteur 2021-01-24 19:17:05 UTC

This should be reported as a CVE.

Comment 4 Eric Gallager 2021-11-26 13:34:28 UTC

This is affecting The Battle for Wesnoth: https://github.com/wesnoth/wesnoth/issues/6291

Comment 5 cqwrteur 2021-11-26 16:40:20 UTC

(In reply to Eric Gallager from comment #4)
> This is affecting The Battle for Wesnoth:
> https://github.com/wesnoth/wesnoth/issues/6291

C++ std::regex is just terrible and highly likely be deprecated in the future standard. I think you better switch to some 3rd party implementation

Comment 6 Jeroen Ooms 2022-05-29 11:13:57 UTC

This bug has become more problematic because it also affects any program running under recent versions of Windows UCRT in UTF-8 locale[1], and therefore all users of the R programming language.

The only solution right now seems to avoid std::rexex, e.g.: https://github.com/tesseract-ocr/tesseract/issues/3830


[1] https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

Comment 7 cqwrteur 2022-05-30 16:24:05 UTC

well the right solution is to write the regex by yourself. C++ regex might be deprecated in the future.

Comment 8 Luca Bacci 2023-12-09 13:52:01 UTC

(In reply to Jonathan Wakely from comment #1)
> The Windows behaviour fails to conform to the C and C++ standards. I think
> _M_transform should check errno and throw an exception on error (which means
> removing the non-throwing exceptions specification from that function).

Hi Jonathan! I'm giving it a go, but I have one question: which encoding are the strings passed to _M_transform() / _M_compare() in? (libstdc++-v3/config/locale/generic/collate_members.cc) is it the execution character set? Or is it always UTF-8?

I am asking because we have to convert to UTF-16 and call wcsxfrm().

Many thanks,
Luca