[Bug libstdc++/85472] Regex match bug

haberg-1 at telia dot com gcc-bugzilla@gcc.gnu.org
Wed Apr 25 12:54:00 GMT 2018


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #8 from Hans Åberg <haberg-1 at telia dot com> ---
(In reply to Tim Shen from comment #5)
> I'm not following the meaning of "action number" and "the partial reverse
> NFA is recorded".
> 
> How many actions numbers are recorded? for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

Let's check the OP example, to see if it is useful, which will also explain the
action numbers. So for regex '(z)((a+)?(b+)?(c))*', I used '(z)(a*b*(c))*' with
action numbers same as in the C++ library numbering. Here they are written out
with # in the NFA:
  ex = ({0}, {0, 3},
        {0: {z ↦ {1, ∅}}[#1];    1: {a ↦ {1, 2, 3}}[#2#3];
         2: {b ↦ {2, 3}}[#2#4];  3: {c ↦ {3, 2, 1, ∅}}[#2#5]}, {})
with start set {0}, and set of last (pointing to a final state ∅) states {0,
3}. (The last {} is the not yet computed DFA.)

Then lexing:

> zaacbbbcac
Full string is a valid match!
Matches: {zaacbbbcac, [@0@3@7@9]; 
[#1]
[#1][#2#3][#2#3][#2#5]
[#1][#2#3][#2#3][#2#5][#2#4][#2#4][#2#4][#2#5]
[#1][#2#3][#2#3][#2#5][#2#4][#2#4][#2#4][#2#5][#2#3][#2#5]
}
Time: 0.000555 s.

Each line gives the action numbers for the string positions for the successive
strings z, zaac, zaacbbbc, zaacbbbcac with final string positions @0, @3, @7,
@9.

The last line gives for #1 the sub-string "z" , and for #2 "aacbbbcac". For #3
"a", and for #5 "c". But #4 is missing, indication there is no match. So there
might be problem here, as there are earlier matches:

Perhaps the intent is that it should be implemented as a loop, only retaining
the last #4, but the problem is that that is not what the underlying theory
says.


More information about the Gcc-bugs mailing list