std::regex crashes when matching long lines. Here is an example: #include <regex> #include <iostream> int main() { std::string s (100'000, '*'); std::smatch m; std::regex r ("^(.*?)$"); std::regex_search (s, m, r); std::cout << s .substr (0, 10) << std::endl; std::cout << m .str (1) .substr (0, 10) << std::endl; } It turns out that std::regex_search operator .* is implemented recursively which result in this example in a stack overflow.
*** Bug 86163 has been marked as a duplicate of this bug. ***
*** Bug 86165 has been marked as a duplicate of this bug. ***
BTW, this is unrelated to using grouping in the regex, searching for something as simple as "A.*B" also crashes for input longer than ~27KiB on Linux amd64 with g++ 8.2.0. This makes std::regex simply unusable.
(In reply to Vadim Zeitlin from comment #3) > This makes std::regex simply unusable. Yes, because there are no uses with inputs below 27KiB.
I obviously meant that it makes it unusable in my use case when I can't guarantee that the input is bounded by this (smallish) size.
I think I am hitting this issue somewhat earlier on an ARM system with a more limited stack size. Was able to reproduce it on Desktop x86_64 Linux with e.g.: #include <regex> int main() { std::regex_match( std::string(2000, 'a'), std::regex(".*") ); } $ ulimit -s 256 # 256kb stack; which is what have by default on the ARM system $ g++ test.cpp -o regex_test $ ./regex_test Segmentation fault (core dumped)
It seems that the issue is the backtracking required by the NFA, as it enters in a deep recursion when calling _M_dfs in _M_main_dispatch (regex_executor.tcc). Maybe moving the DFS stack from the recursion stack to the heap and use an iterative DFS could fix this, but converting the NFA to DFA may be a better choice, as it removes the backtracking requirement when iterating with the string.
I started working on a patch to replace the recursion with iteration, but didn't get it working yet.
Any progress on this? I get the segfault (due to stack overflow) with the following trivial regex: regex re ("#+",); regex_search (string (32 * 1024, '#'), re); In comparison, MSVC's implementation crashes on much larger input (in the above test it is still able to match 4MB string) while libc++ doesn't seem to have any stack-related limits (I was able to match 40MB). I see two issues here: 1. It would have been nice if implementation-related limits were reported with an exception rather than a crash. 2. The limits seem to be really low, both practically (matching 32KB doesn't feel unreasonable) and compared to other implementations.
*** Bug 84865 has been marked as a duplicate of this bug. ***
*** Bug 93502 has been marked as a duplicate of this bug. ***
*** Bug 84738 has been marked as a duplicate of this bug. ***
Too bad this bug has still not been dealt with. And it is even worse that simply running out of stack space seems to be acceptable. And no, I'm not using inputs in the form of 27kB, more like just a few hundred characters at most with quite complex expressions. Fortunately, it is now very easy to use the boost::regex as a standalone library as a replacement. But alas, that's still a dependency.
Running out of stack space is not acceptable, that's why this is considered a bug. As already stated in comment 8, I started work on fixing it, but the rewritten code had bugs that I haven't had time to resolve yet.