Bug 86164 - std::regex crashes when matching long lines
Summary: std::regex crashes when matching long lines
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 7.1.0
: P3 normal
Target Milestone: ---
Assignee: Patrick Palka
URL:
Keywords:
: 84738 84865 86163 86165 93502 (view as bug list)
Depends on:
Blocks: std::regex
  Show dependency treegraph
 
Reported: 2018-06-15 10:48 UTC by Holger Seelig
Modified: 2023-04-09 16:02 UTC (History)
9 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-01-22 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Holger Seelig 2018-06-15 10:48:15 UTC
std::regex crashes when matching long lines.

Here is an example:


#include <regex>
#include <iostream>
 
int main()
{
    std::string s (100'000, '*');
    std::smatch m;
    std::regex r ("^(.*?)$");

    std::regex_search (s, m, r);

    std::cout << s .substr (0, 10) << std::endl;
    std::cout << m  .str (1) .substr (0, 10) << std::endl;
}


It turns out that std::regex_search operator .* is implemented recursively which result in this example in a stack overflow.
Comment 1 ktkachov 2018-06-15 11:12:42 UTC
*** Bug 86163 has been marked as a duplicate of this bug. ***
Comment 2 ktkachov 2018-06-15 11:13:53 UTC
*** Bug 86165 has been marked as a duplicate of this bug. ***
Comment 3 Vadim Zeitlin 2018-11-13 23:46:30 UTC
BTW, this is unrelated to using grouping in the regex, searching for something as simple as "A.*B" also crashes for input longer than ~27KiB on Linux amd64 with g++ 8.2.0. This makes std::regex simply unusable.
Comment 4 Jonathan Wakely 2018-11-14 00:08:18 UTC
(In reply to Vadim Zeitlin from comment #3)
> This makes std::regex simply unusable.

Yes, because there are no uses with inputs below 27KiB.
Comment 5 Vadim Zeitlin 2018-11-14 00:12:49 UTC
I obviously meant that it makes it unusable in my use case when I can't guarantee that the input is bounded by this (smallish) size.
Comment 6 Nils Gladitz 2018-12-05 07:33:02 UTC
I think I am hitting this issue somewhat earlier on an ARM system with a more limited stack size.

Was able to reproduce it on Desktop x86_64 Linux with e.g.:

#include <regex>
  
int main()
{
    std::regex_match(
        std::string(2000, 'a'),
        std::regex(".*")
    );
}

$ ulimit -s 256 # 256kb stack; which is what have by default on the ARM system
$ g++ test.cpp -o regex_test
$ ./regex_test
Segmentation fault (core dumped)
Comment 7 Giuliano Belinassi 2019-03-29 12:48:29 UTC
It seems that the issue is the backtracking required by the NFA, as it enters in a deep recursion when calling _M_dfs in _M_main_dispatch (regex_executor.tcc).

Maybe moving the DFS stack from the recursion stack to the heap and use an iterative DFS could fix this, but converting the NFA to DFA may be a better choice, as it removes the backtracking requirement when iterating with the string.
Comment 8 Jonathan Wakely 2019-03-29 20:24:25 UTC
I started working on a patch to replace the recursion with iteration, but didn't get it working yet.
Comment 9 Boris Kolpackov 2021-09-14 07:33:10 UTC
Any progress on this?

I get the segfault (due to stack overflow) with the following trivial regex:

  regex re ("#+",);
  regex_search (string (32 * 1024, '#'), re);

In comparison, MSVC's implementation crashes on much larger input (in the above test it is still able to match 4MB string) while libc++ doesn't seem to have any stack-related limits (I was able to match 40MB).

I see two issues here:

1. It would have been nice if implementation-related limits were reported with an exception rather than a crash.

2. The limits seem to be really low, both practically (matching 32KB doesn't feel unreasonable) and compared to other implementations.
Comment 10 Jonathan Wakely 2021-09-22 09:08:33 UTC
*** Bug 84865 has been marked as a duplicate of this bug. ***
Comment 11 Jonathan Wakely 2021-12-16 00:05:36 UTC
*** Bug 93502 has been marked as a duplicate of this bug. ***
Comment 12 Jonathan Wakely 2021-12-16 00:08:03 UTC
*** Bug 84738 has been marked as a duplicate of this bug. ***
Comment 13 Maarten L. Hekkelman 2022-08-18 17:11:08 UTC
Too bad this bug has still not been dealt with. And it is even worse that simply running out of stack space seems to be acceptable. And no, I'm not using inputs in the form of 27kB, more like just a few hundred characters at most with quite complex expressions.

Fortunately, it is now very easy to use the boost::regex as a standalone library as a replacement. But alas, that's still a dependency.
Comment 14 Jonathan Wakely 2022-08-22 11:06:11 UTC
Running out of stack space is not acceptable, that's why this is considered a bug. As already stated in comment 8, I started work on fixing it, but the rewritten code had bugs that I haven't had time to resolve yet.
Comment 15 Nadav Har'El 2023-04-09 16:02:58 UTC
More than 5 years later, more and more projects are discovering this bug the hard way, and moving from std::regex to boost::regex which doesn't have this bug - boost::regex defaults to BOOST_REGEX_NON_RECURSIVE mode, which uses a stack on the heap instead of recursion (but I don't know if the specific examples shown the various duplicates all need this stack in practice, for example it's unfortunate if matching " *" needs to copy the entire input string in a stack). The latest example of this exodus is https://github.com/scylladb/scylladb/pull/13452. 
So I think it's about time this issue is solved. Maybe even the Boost implementation can studied for inspiration and implementation ideas?