Bug 61582 - C++11 regex memory corruption
Summary: C++11 regex memory corruption
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Jonathan Wakely
URL:
Keywords:
: 66456 67212 68688 70411 70459 (view as bug list)
Depends on:
Blocks: std::regex
  Show dependency treegraph
 
Reported: 2014-06-23 00:05 UTC by Maksymilian Arciemowicz
Modified: 2021-12-16 23:40 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-12-16 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Maksymilian Arciemowicz 2014-06-23 00:05:26 UTC
Hi,

Tested on GCC 4.8.1

----------
#include <regex>

using namespace std;

int main (int argc, char *argv[])
{
      regex r(argv[1]);
      return 0;
}
----------

# g++ c11RE.cpp -o c11RE -std=c++11 -Wall 
# ./c11RE '.*'
# ./c11RE '(|'
Segmentation fault (core dumped)
# ./c11RE '((x|'
*** Error in `./c11RE': malloc(): memory corruption: 0x00007fffa0cb8670 ***

Expected (regex_error):
# ./c11RE '(x}' 
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Aborted (core dumped)

------------
(gdb) r '(|'
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/cx/c11RE '(|'

Program received signal SIGSEGV, Segmentation fault.
0x00000000004022cc in std::__detail::_StateSeq::_StateSeq(std::__detail::_StateSeq const&) ()
(gdb) bt
#0  0x00000000004022cc in std::__detail::_StateSeq::_StateSeq(std::__detail::_StateSeq const&) ()
#1  0x0000000000404a05 in std::__detail::_Compiler<char const*, std::regex_traits<char> >::_M_disjunction() ()
#2  0x0000000000407901 in std::__detail::_Compiler<char const*, std::regex_traits<char> >::_M_atom() ()
#3  0x00000000004069cb in std::__detail::_Compiler<char const*, std::regex_traits<char> >::_M_term() ()
#4  0x000000000040567e in std::__detail::_Compiler<char const*, std::regex_traits<char> >::_M_alternative() ()
#5  0x00000000004049c8 in std::__detail::_Compiler<char const*, std::regex_traits<char> >::_M_disjunction() ()
#6  0x0000000000403ef2 in std::__detail::_Compiler<char const*, std::regex_traits<char> >::_Compiler(char const* const&, char const* const&, std::regex_traits<char>&, unsigned int) ()
#7  0x0000000000403297 in std::shared_ptr<std::__detail::_Automaton> std::__detail::__compile<char const*, std::regex_traits<char> >(char const* const&, char const* const&, std::regex_traits<char>&, unsigned int) ()
#8  0x0000000000402abb in std::basic_regex<char, std::regex_traits<char> >::basic_regex(char const*, unsigned int) ()
#9  0x0000000000401767 in main ()
(gdb) x/i $rip
=> 0x4022cc <_ZNSt8__detail9_StateSeqC2ERKS0_+16>:	mov    (%rax),%rdx
(gdb) x/x $rax
0xffffffffffffffe8:	Cannot access memory at address 0xffffffffffffffe8
(gdb) x/x $rdx
0xffffffffffffffe8:	Cannot access memory at address 0xffffffffffffffe8
------------

BR,
Maksymilian
http://cxsecurity.com/
Comment 1 Jonathan Wakely 2014-06-23 08:13:44 UTC
*sigh* <regex> is not implemented prior to GCC 4.9.0, I thought the whole world was aware of that by now.
Comment 2 Maksymilian Arciemowicz 2014-06-24 19:37:40 UTC
Sorry for mistake.
Could you check this again ?

cx@cx:~/REstd11/kozak5$ ~/gcc49/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/home/cx/gcc49/bin/g++
COLLECT_LTO_WRAPPER=/home/cx/gcc49/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /home/cx/gcc49/source/gcc-4.9.0/configure --disable-multilib --prefix=/home/cx/gcc49
Thread model: posix
gcc version 4.9.0 (GCC) 
cx@cx:~/REstd11/kozak5$ cat c11re.c
#include <iostream>
#include <string>
#include <regex>

using namespace std;

int main (int argc, char *argv[])
{
	if (std::regex_match ("GNUj", std::regex(argv[1]) ))
		std::cout << "ELO\n";
	 return 0;
}
cx@cx:~/REstd11/kozak5$ ~/gcc49/bin/g++ -o c11re c11re.c -std=c++11
cx@cx:~/REstd11/kozak5$ ./c11re '((x|'
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Przerwane (core dumped)
cx@cx:~/REstd11/kozak5$ ./c11re '((.*)()?*{100})'
Naruszenie ochrony pamięci (core dumped)
cx@cx:~/REstd11/kozak5$

(gdb) r '((.*)()?*{100})'
Starting program: /home/cx/REstd11/kozak5/./c11re '((.*)()?*{100})'

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402f15 in std::_Bit_reference::operator bool() const
    ()
(gdb) x/i $rip
=> 0x402f15 <_ZNKSt14_Bit_referencecvbEv+15>:	
    mov    (%rax),%rdx
(gdb) i r
rax            0x200000000063a128	2305843009220223272
rbx            0xffffffffffffffff	-1
rcx            0x200000000063a128	2305843009220223272
rdx            0x8000000000000000	-9223372036854775808
rsi            0x200000000063a128	2305843009220223272
rdi            0x7fffffffd350	140737488343888
rbp            0x7fffffffd310	0x7fffffffd310
rsp            0x7fffffffd310	0x7fffffffd310
r8             0x2	2
r9             0x20	32
r10            0x3	3
r11            0x7ffff75b5798	140737343346584
r12            0x402880	4204672
r13            0x7fffffffe260	140737488347744
r14            0x0	0
r15            0x0	0
=> 0x402f15 <_ZNKSt14_Bit_referencecvbEv+15>:
rip            0x402f15	0x402f15 <std::_Bit_reference::operator bool() const+15>

...

#0  0x0000000000402f15 in std::_Bit_reference::operator bool() const ()
#1  0x000000000040a1bc in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_dfs<true>(long) ()
#2  0x000000000040a275 in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_dfs<true>(long) ()
#3  0x000000000040a493 in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_dfs<true>(long) ()
#4  0x000000000040a28f in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_dfs<true>(long) ()
#5  0x000000000040a3a5 in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_dfs<true>(long) ()
#6  0x000000000040a3a5 in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits---Type <return> to continue, or q <return> to quit---
<char>, false>::_M_dfs<true>(long) ()
#7  0x000000000040a3a5 in void std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_dfs<true>(long) ()
#8  0x0000000000407ee0 in bool std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_main<true>() ()
#9  0x0000000000406172 in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, false>::_M_match() ()
#10 0x0000000000404cf5 in bool std::__detail::__regex_algo_impl<char const*, std::allocator<std::sub_match<char const*> >, char, std::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, true>(char const*, char const*, std::match_results<char const*, std::allocator<std::sub_match<char const*> > >&, std::basic_regex<char, std::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#11 0x000000000040449e in bool std::regex_match<char const*, std::allocator<std::sub_match<char const*> >, char, std::regex_traits<c---Type <return> to continue, or q <return> to quit---
har> >(char const*, char const*, std::match_results<char const*, std::allocator<std::sub_match<char const*> > >&, std::basic_regex<char, std::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#12 0x000000000040405c in bool std::regex_match<char const*, char, std::regex_traits<char> >(char const*, char const*, std::basic_regex<char, std::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#13 0x0000000000403d4c in bool std::regex_match<char, std::regex_traits<char> >(char const*, std::basic_regex<char, std::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#14 0x0000000000402a5f in main ()
Comment 3 Jonathan Wakely 2014-06-25 09:54:19 UTC
(In reply to Maksymilian A from comment #2)
> cx@cx:~/REstd11/kozak5$ ./c11re '((x|'
> terminate called after throwing an instance of 'std::regex_error'
>   what():  regex_error
> Przerwane (core dumped)

I think this is by design.

> cx@cx:~/REstd11/kozak5$ ./c11re '((.*)()?*{100})'
> Naruszenie ochrony pamięci (core dumped)

That's a bug.

(It would be helpful if you didn't put C11 in the subject, this has nothing to do with C)
Comment 4 Jonathan Wakely 2014-06-25 18:01:15 UTC
That segfault is already fixed on trunk, although possibly just latent
Comment 5 Maksymilian Arciemowicz 2014-06-25 19:15:39 UTC
Thanks for feedback. I'm going verify this on trunk
Comment 6 Maksymilian Arciemowicz 2014-06-25 23:31:34 UTC
@Jonathan: true but check this case

cx@cx:~/REtrunk/kozak5$ ~/gccTRUNK/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/home/cx/gccTRUNK/bin/g++
COLLECT_LTO_WRAPPER=/home/cx/gccTRUNK/libexec/gcc/x86_64-unknown-linux-gnu/4.10.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../trunk/configure --prefix=/home/cx/gccTRUNK/ --disable-multilib
Thread model: posix
gcc version 4.10.0 20140625 (experimental) (GCC) 
cx@cx:~/REtrunk/kozak5$ ~/gccTRUNK/bin/g++ c11re.c -o c11re -std=c++11
cx@cx:~/REtrunk/kozak5$ ./c11re '(.*{100}{100}{100})'
Naruszenie ochrony pamięci (core dumped)

Program received signal SIGSEGV, Segmentation fault.
0x000000000041014e in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_State_info<std::integral_constant<bool, true>, std::vector<std::sub_match<char const*>, std::allocator<std::sub_match<char const*> > > >::_M_visited(long) const ()

BR,
Maksymilian
http://cxsecurity.com/
Comment 7 Tim Shen 2014-06-26 06:14:59 UTC
"(.*{100}{100}{100})" seems to be a stack overflow. It's because regex executor uses recursion. It could be fixed (not segfault but memory exhaustion) by using a std::stack and simulate recursion; IMH, however, directly throwing regex_error::error_space is the right thing here to do.
Comment 8 Maksymilian Arciemowicz 2014-06-26 07:11:19 UTC
(In reply to Tim Shen from comment #7)
> "(.*{100}{100}{100})" seems to be a stack overflow. It's because regex
> executor uses recursion. It could be fixed (not segfault but memory
> exhaustion) by using a std::stack and simulate recursion; IMH, however,
> directly throwing regex_error::error_space is the right thing here to do.

Yeap it's stack overflow. Why regex_error::error_space? Not better regex_error::error_stack?
Comment 9 Tim Shen 2014-06-26 07:16:59 UTC
(In reply to Maksymilian Arciemowicz from comment #8)
> (In reply to Tim Shen from comment #7)
> > "(.*{100}{100}{100})" seems to be a stack overflow. It's because regex
> > executor uses recursion. It could be fixed (not segfault but memory
> > exhaustion) by using a std::stack and simulate recursion; IMH, however,
> > directly throwing regex_error::error_space is the right thing here to do.
> 
> Yeap it's stack overflow. Why regex_error::error_space? Not better
> regex_error::error_stack?

Sorry for not clarify that: I prefer throwing error_space when constructing (complaining about too many states) instead of throwing error_stack when matching. To solve the latter problem, as I said, we can use a std::stack or something to avoid a stack overflow.
Comment 10 Maksymilian Arciemowicz 2014-06-26 07:59:32 UTC
There is also one other alternative like this

http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/regex/regcomp.c.diff?r1=1.29&r2=1.30&f=h
Comment 11 Tim Shen 2014-07-01 03:06:18 UTC
Author: timshen
Date: Tue Jul  1 03:05:45 2014
New Revision: 212185

URL: https://gcc.gnu.org/viewcvs?rev=212185&root=gcc&view=rev
Log:
	PR libstdc++/61061
	PR libstdc++/61582
	* include/bits/regex_automaton.h (_NFA<>::_M_insert_state): Add
	a NFA state limit. If it's exceeded, regex_constants::error_space
	will be throwed.
	* include/bits/regex_automaton.tcc (_StateSeq<>::_M_clone): Use
	map (which is sparse) instead of vector. This reduce n times clones'
	cost from O(n^2) to O(n).
	* include/std/regex: Add map dependency.
	* testsuite/28_regex/algorithms/regex_match/ecma/char/61601.cc: New
	testcase.


Added:
    trunk/libstdc++-v3/testsuite/28_regex/algorithms/regex_match/ecma/char/61601.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/regex_automaton.h
    trunk/libstdc++-v3/include/bits/regex_automaton.tcc
    trunk/libstdc++-v3/include/std/regex
Comment 12 Maksymilian Arciemowicz 2014-07-01 18:54:41 UTC
Ups. Check this (.*{100}{300})

gcc version 4.10.0 20140701 (experimental) (GCC)
--------
Starting program: /home/cx/REtrunk/kozak5/t3 '(.*{100}{300})'

Program received signal SIGSEGV, Segmentation fault.
0x000000000040c22a in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
--------
Comment 13 Maksymilian Arciemowicz 2014-07-04 10:25:22 UTC
@Tim: do you need help?
Comment 14 Tim Shen 2014-07-04 18:00:15 UTC
(In reply to Maksymilian Arciemowicz from comment #13)
> @Tim: do you need help?

This is what I'm going to do: https://gcc.gnu.org/ml/libstdc++/2014-07/msg00008.html

Please send to libstdc++ ml if you have any ideas.
Comment 15 Tim Shen 2015-08-14 17:00:17 UTC
*** Bug 66456 has been marked as a duplicate of this bug. ***
Comment 16 Tim Shen 2015-08-14 17:01:00 UTC
*** Bug 67212 has been marked as a duplicate of this bug. ***
Comment 17 Tim Shen 2015-12-04 07:58:14 UTC
*** Bug 68688 has been marked as a duplicate of this bug. ***
Comment 18 Tim Shen 2016-03-26 01:50:33 UTC
*** Bug 70411 has been marked as a duplicate of this bug. ***
Comment 19 Tim Shen 2016-03-31 04:27:16 UTC
*** Bug 70459 has been marked as a duplicate of this bug. ***
Comment 20 Pádraig Brady 2017-02-11 00:44:29 UTC
Any status update on this. GCC7 is looming...
Thanks.
Comment 21 Tim Shen 2017-02-11 01:05:35 UTC
(In reply to Pádraig Brady from comment #20)
> Any status update on this. GCC7 is looming...
> Thanks.

Unfortunately I haven't get a chance to work on this. I plan to put up a one-line tweak on the internal state limit to make the library throwing an exception, instead of crash. That's probably a strict improvement.
Comment 22 M Welinder 2019-02-21 20:17:42 UTC
FWIW, there is an excellent overview of regular expression engine pitfalls
and methods here:

https://swtch.com/~rsc/regexp/regexp1.html
https://swtch.com/~rsc/regexp/regexp2.html
https://swtch.com/~rsc/regexp/regexp3.html

Those are about 10 years old, but not outdated.

The TL;DR is "use NFAs and DFAs, not back-tracking".
Comment 23 Jonathan Wakely 2021-12-15 23:57:52 UTC
(In reply to M Welinder from comment #22)
> FWIW, there is an excellent overview of regular expression engine pitfalls
> and methods here:
> 
> https://swtch.com/~rsc/regexp/regexp1.html
> https://swtch.com/~rsc/regexp/regexp2.html
> https://swtch.com/~rsc/regexp/regexp3.html

Yes, there have been links to the first one in libstdc++ headers since 2013.
Comment 24 Jonathan Wakely 2021-12-16 23:40:38 UTC
(In reply to Maksymilian Arciemowicz from comment #12)
> Ups. Check this (.*{100}{300})

This one still results in a stack overflow on trunk, with an 8MB stack. That is:

std::regex_match("a", std::regex("(.*{100}{300})"));

I have a proof-of-concept patch replacing the recursion in _Executor. The example above runs successfully with a 16k stack limit.