[Bug c++/93502] New: std::regex_match uses stack space proportional to input string

nyh at math dot technion.ac.il gcc-bugzilla@gcc.gnu.org
Wed Jan 29 21:21:00 GMT 2020


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93502

            Bug ID: 93502
           Summary: std::regex_match uses stack space proportional to
                    input string
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nyh at math dot technion.ac.il
  Target Milestone: ---

Created attachment 47736
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47736&action=edit
Simple test code demonstrating the bug

As we all know from Automata Theory course, the finite automaton used to match
a regular expression takes a constant amount of memory, unrelated to the size
of the input string.

The attached test program creates a string with N (the first command-line
parameter) spaces, and then checks whether it matches the trivial regular
expression " *" using std::regex_match(). Obviously it should match, and take
constant amount of memory unrelated to the input string length (N).

But it turns out that std::regex_match() does take more memory on the stack the
bigger N is (the longer the input string is). For example:

$ c++ -fsanitize=address z.cc
$ ./a.out 10000
Matched!
$ ./a.out 100000
...
SUMMARY: AddressSanitizer: stack-overflow (/tmp/a.out+0x42a88b) in
std::function<bool (char)>::operator()(char) const

We have seen this problem in a real application - which crashed with a stack
overflow because it tried to match a simple regular expression to a long input
string. The attached program is of course only a simplified example.

Trying to figure out where this excessive stack use comes from, I discovered
that the boolean std::regex_match() actually calls a different overload, which
returns not just the boolean, but also the match itself:

regex_match(_Bi_iter __first, _Bi_iter __last,
                const basic_regex<_Ch_type, _Rx_traits>& __re,
                regex_constants::match_flag_type __flags
                = regex_constants::match_default)
    {
      match_results<_Bi_iter> __what;
      return regex_match(__first, __last, __what, __re, __flags);
    }

However, I'm not sure if or why this explains the excessive stack suage -
shouldn't match_results contain only the beginning and end pointers of the
match - and not a full copy of the match?

But whether or not this implementation explains the bug, it's definitely a bug.
Even if for some reason (?) the regex_match overload with match_results cannot
be implemented with constant memory usage, the one I am complaining about - the
one which does not need to return the match results, just a boolean -
definitely shouldn't take more stack memory the bigger the input string is.


More information about the Gcc-bugs mailing list