Regex backend refactoring/rewriting?

Tim Shen timshen@google.com
Thu Feb 19 05:25:00 GMT 2015


There're three unresolved bugs for C++11 <regex>:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61424: std::regex matches
right to left, not leftmost longest

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61601: C++11 regex
resource exhaustion

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776: Regex collate
matching not working

For the third one, I've put my thought there, since I'm not quite sure
what should be done;

Additionally, I noticed that our POSIX executer is not conforming:

    #include <regex>
    #include <iostream>

    using namespace std;

    int main() {
        cmatch m;
        cout << regex_search("aaaaaaaa", m, regex("a*(a*)",
regex_constants::extended)) << "\n";
        for (const auto& it : m) {
            cout << it << "\n";
        }
    }

it prints:
1
aaaaaaaa
/* here is an empty string */
but it should print:
1
aaaaaaaa
aaaaaaaa

since POSIX requires leftmost longest match
(http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html), and the
latter captured (a*) should take all 'a's.

Our ECMAScript executer is not conforming for at least one case:
repeating once more a capture group won't clear it's sub captures:

    #include <regex>
    #include <iostream>

    using namespace std;

    int main() {
        cmatch m;
        cout << regex_search("ab", m, regex("((b)|(a))*")) << "\n";
        for (const auto& it : m) {
            cout << it << "\n";
        }
    }

It prints:
1
1 ab
1 b
1 b
1 a
but should be:
1
1 ab
1 b
1 b
0 /* here is an empty string */
see ECMAScript 262 [15.10.2.3.3].


To fix these problems, I suggest to split our current _Executor class
two 3~4 classes:
a) ECMAScript backtracking executer;
b) POSIX backtracking executer;
c) POSIX Thompson NFA executer;
d) (optional) ECMAScript Thompson NFA executer.

To clarify, in libstdc++, backtracking is called DFS, Thompson NFA is
called BFS.

Splitting them may introduce a little bit duplication, but it makes
code easier to maintain.

a) b) c) will be straightforward, but I need to investigate more on
d). But we don't have d) today anyway :)

To solve 2), std::stack will be used instead of calling stack
(currently they are implemented using recursion).

Also, Thompson NFA will be enabled by using a flag extension
regex_constants::syntax_option_type::__polynomial, and there may be
one regex_error extension too. With this flag, the user can be ensured
both space (NFA -> DFA optimization, which we don't have, may consume
exponentially large space) and time (the whole reason why we do
Thompson NFA) complexity is polynomial.

Also, GSoC is approaching, I'm not sure if this deserves a student project.

Thoughts?

Thanks! :)


-- 
Regards,
Tim Shen



More information about the Libstdc++ mailing list