Bug 85472 - Regex match bug
Summary: Regex match bug
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 7.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: std::regex
  Show dependency treegraph
 
Reported: 2018-04-19 20:11 UTC by Hans Åberg
Modified: 2021-09-22 08:34 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-04-19 00:00:00


Attachments
The regex example mentioned in the report. (655 bytes, text/x-csrc)
2018-04-19 20:11 UTC, Hans Åberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hans Åberg 2018-04-19 20:11:45 UTC
Created attachment 43993 [details]
The regex example mentioned in the report.

In the regex "(z)((a+)?(b+)?(c))*" example at [1], also attached, applied to the string "zaacbbbcac", g++ produces a pattern #4 match as "bbb", whereas it says there it should be empty (the latter which clang++ does).

1. http://en.cppreference.com/w/cpp/regex/ecmascript
Comment 1 Tim Shen 2018-04-25 08:14:35 UTC
As I cross checked, boost 1.66, libstdc++, and python3 have the same behavior, only libc++ is different from the rest.

I haven't checked the ECMAScript spec, but I suspect that it describes the boost/libstdc++/python behavior, only because it's more efficient, not less confusing.

I know exactly how libc++ produces this result. It creates an empty match_result during each repetition ("*" operator). It's less confusing but much slower.
Comment 2 Tim Shen 2018-04-25 08:39:19 UTC
My bad, I didn't realize that "(z)((a+)?(b+)?(c))*" is exactly an example described in ECMAScript third edition 15.10.2.5. Therefore libstdc++ is not conforming.
Comment 3 Tim Shen 2018-04-25 08:43:19 UTC
Conclusively, yes it is a bug, but it is hard to fix without regressing normal case performance.
Comment 4 Hans Åberg 2018-04-25 08:45:14 UTC
(In reply to Tim Shen from comment #1)
> I know exactly how libc++ produces this result. It creates an empty
> match_result during each repetition ("*" operator). It's less confusing but
> much slower.

I wrote an NFA/DFA that computes all matches, it seems, in an efficient manner: Action numbers are marked on the states of the sub-NFAs one is interested in. While lexing a string, the partial reverse NFA is recorded, which is searched when a final state is encountered. The string position action numbers, as marked on the corresponding states, are recorded, and for each action number, the contiguous string sequences are the possible submatches.
Comment 5 Tim Shen 2018-04-25 09:10:45 UTC
(In reply to Hans Åberg from comment #4)
> (In reply to Tim Shen from comment #1)
> > I know exactly how libc++ produces this result. It creates an empty
> > match_result during each repetition ("*" operator). It's less confusing but
> > much slower.
> 
> I wrote an NFA/DFA that computes all matches, it seems, in an efficient
> manner: Action numbers are marked on the states of the sub-NFAs one is
> interested in. While lexing a string, the partial reverse NFA is recorded,
> which is searched when a final state is encountered. The string position
> action numbers, as marked on the corresponding states, are recorded, and for
> each action number, the contiguous string sequences are the possible
> submatches.

I'm not following the meaning of "action number" and "the partial reverse NFA is recorded".

How many actions numbers are recorded? for regex_match(s, regex(re)), is it on the order of O(s.size()), or O(re.size())? The goal is to allocate O(re.size()) of memory exactly once.
Comment 6 Hans Åberg 2018-04-25 10:55:58 UTC
(In reply to Tim Shen from comment #5)
> (In reply to Hans Åberg from comment #4)
> > I wrote an NFA/DFA that computes all matches, it seems, in an efficient
> > manner: Action numbers are marked on the states of the sub-NFAs one is
> > interested in. While lexing a string, the partial reverse NFA is recorded,
> > which is searched when a final state is encountered. The string position
> > action numbers, as marked on the corresponding states, are recorded, and for
> > each action number, the contiguous string sequences are the possible
> > submatches.
> 
> I'm not following the meaning of "action number"...

Arbitrary numbers used identify the sub-automatons; in the C++ library, the number of left parentheses in the regex expression from which it derives. 

> ...and "the partial reverse
> NFA is recorded".

The lexing states are sets of NFA states, which are also the DFA states. If a lexing character c in the state p = {p_0, ...} has transition p_i -> {q_0, ...}, the reverse state q_0 -> {r_0, ...} gets p_i added at this position in the string. This is to record the states which c actually uses.

> How many actions numbers are recorded?

They are sets, both in the states and the string matches. Each final state position in the string produces a separate match, containing information about the submatches via the action numbers.

> for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

The action numbers are just stamped onto the NFA states when it is created, and then carried along. So it is the same as the number of states in the NFA. When a string is lexed, one just gets the action numbers for each position that produced a character match.
Comment 7 Hans Åberg 2018-04-25 10:57:00 UTC
(In reply to Tim Shen from comment #5)
> (In reply to Hans Åberg from comment #4)
> > I wrote an NFA/DFA that computes all matches, it seems, in an efficient
> > manner: Action numbers are marked on the states of the sub-NFAs one is
> > interested in. While lexing a string, the partial reverse NFA is recorded,
> > which is searched when a final state is encountered. The string position
> > action numbers, as marked on the corresponding states, are recorded, and for
> > each action number, the contiguous string sequences are the possible
> > submatches.
> 
> I'm not following the meaning of "action number"...

Arbitrary numbers used identify the sub-automatons; in the C++ library, the number of left parentheses in the regex expression from which it derives. 

> ...and "the partial reverse
> NFA is recorded".

The lexing states are sets of NFA states, which are also the DFA states. If a lexing character c in the state p = {p_0, ...} has transition p_i -> {q_0, ...}, the reverse state q_0 -> {r_0, ...} gets p_i added at this position in the string. This is to record the states which c actually uses.

> How many actions numbers are recorded?

They are sets, both in the states and the string matches. Each final state position in the string produces a separate match, containing information about the submatches via the action numbers.

> for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

The action numbers are just stamped onto the NFA states when it is created, and then carried along. So it is the same as the number of states in the NFA. When a string is lexed, one just gets the action numbers for each position that produced a character match.
Comment 8 Hans Åberg 2018-04-25 12:54:03 UTC
(In reply to Tim Shen from comment #5)
> I'm not following the meaning of "action number" and "the partial reverse
> NFA is recorded".
> 
> How many actions numbers are recorded? for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

Let's check the OP example, to see if it is useful, which will also explain the action numbers. So for regex '(z)((a+)?(b+)?(c))*', I used '(z)(a*b*(c))*' with action numbers same as in the C++ library numbering. Here they are written out with # in the NFA:
  ex = ({0}, {0, 3},
        {0: {z ↦ {1, ∅}}[#1];    1: {a ↦ {1, 2, 3}}[#2#3];
         2: {b ↦ {2, 3}}[#2#4];  3: {c ↦ {3, 2, 1, ∅}}[#2#5]}, {})
with start set {0}, and set of last (pointing to a final state ∅) states {0, 3}. (The last {} is the not yet computed DFA.)

Then lexing:

> zaacbbbcac
Full string is a valid match!
Matches: {zaacbbbcac, [@0@3@7@9]; 
[#1]
[#1][#2#3][#2#3][#2#5]
[#1][#2#3][#2#3][#2#5][#2#4][#2#4][#2#4][#2#5]
[#1][#2#3][#2#3][#2#5][#2#4][#2#4][#2#4][#2#5][#2#3][#2#5]
}
Time: 0.000555 s.

Each line gives the action numbers for the string positions for the successive strings z, zaac, zaacbbbc, zaacbbbcac with final string positions @0, @3, @7, @9.

The last line gives for #1 the sub-string "z" , and for #2 "aacbbbcac". For #3 "a", and for #5 "c". But #4 is missing, indication there is no match. So there might be problem here, as there are earlier matches:

Perhaps the intent is that it should be implemented as a loop, only retaining the last #4, but the problem is that that is not what the underlying theory says.
Comment 9 Tim Shen 2018-04-25 17:37:31 UTC
Ah with the example it's clear, thanks!

> The last line gives for #1 the sub-string "z" , and for #2 "aacbbbcac".

This is not what ECMAScript produces either. for capture #2, ECMAScriptn produces "ac", the last match of the loop.

Think about the difference between

  (z)((a+)?(b+)?(c))*

and

  (z)((?:(a+)?(b+)?(c))*)

Your #2 seems to capture the second case, which is different.

> For
> #3 "a", and for #5 "c". But #4 is missing, indication there is no match. So
> there might be problem here, as there are earlier matches:
> 
> Perhaps the intent is that it should be implemented as a loop, only
> retaining the last #4, 

That's what the implementations (boost, libstdc++, python) actually do. That's not ECMAScript's intention. ECMAScript's intention is to leave #4 undefined (*not* retaining the last non-empty #4), as in the last iteration of the loop, #4 (b+)? doesn't match any sub-string.

> but the problem is that that is not what the
> underlying theory says.

I'm not sure if there is any theory around caputring groups. If we are about to create one, be aware that there are multiple plausible definitions.
Comment 10 Hans Åberg 2018-04-25 18:08:27 UTC
(In reply to Tim Shen from comment #9)
> Ah with the example it's clear, thanks!

You are welcome.

> > The last line gives for #1 the sub-string "z" , and for #2 "aacbbbcac".
> 
> This is not what ECMAScript produces either. for capture #2, ECMAScriptn
> produces "ac", the last match of the loop.
> 
> Think about the difference between
> 
>   (z)((a+)?(b+)?(c))*
> 
> and
> 
>   (z)((?:(a+)?(b+)?(c))*)
> 
> Your #2 seems to capture the second case, which is different.

Yes, I thought about this difference in an earlier version that marked the start and last states only, but switched to this, because I wanted to ignore empty matches:

I think though that one might mark the transitions with action numbers and capture it that way (see below).

> > For
> > #3 "a", and for #5 "c". But #4 is missing, indication there is no match. So
> > there might be problem here, as there are earlier matches:
> > 
> > Perhaps the intent is that it should be implemented as a loop, only
> > retaining the last #4, 
> 
> That's what the implementations (boost, libstdc++, python) actually do.
> That's not ECMAScript's intention. ECMAScript's intention is to leave #4
> undefined (*not* retaining the last non-empty #4), as in the last iteration
> of the loop, #4 (b+)? doesn't match any sub-string.

The problem is that #4 has an earlier capture, making it impossible to see that it is left undefined later. So it would be better to capture it as an empty string here:

In the example I gave, if the transitions are marked (ignoring #2)
  1: {a ↦ {1#3, 2#3, 3#3$4}};
  2: {b ↦ {2#4, 3#4}};
  3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus an empty which I have written $4. That is the idea, though.

> > but the problem is that that is not what the
> > underlying theory says.
> 
> I'm not sure if there is any theory around caputring groups. If we are about
> to create one, be aware that there are multiple plausible definitions.

I have though not implemented the counted repetitions A{m, n}, which I think usually are implemented as loops, so I do not know how that fits into the picture.
Comment 11 Tim Shen 2018-04-25 18:35:16 UTC
> The problem is that #4 has an earlier capture, making it impossible to see
> that it is left undefined later.

I wouldn't say it's impossible. libc++ implements it correctly at a cost.

Specifically, see https://github.com/llvm-mirror/libcxx/blob/e54a22f809bc741760e39ac78444f57f06117652/include/regex#L5566.

> So it would be better to capture it as an
> empty string here:
> 
> In the example I gave, if the transitions are marked (ignoring #2)
>   1: {a ↦ {1#3, 2#3, 3#3$4}};
>   2: {b ↦ {2#4, 3#4}};
>   3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
> Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus
> an empty which I have written $4. That is the idea, though.

I'm feeling that this is going to a quadratic space complexity here. Does "(((...((a_0)|(a_1)|(a_2)|...(a_n))...)))" if the regex expands to n surrounding parenthesis, and n characters in the middle? Would this regex match taking up to n^2 time, given a n-sized input string?

If you think it's time and space efficiently doable, maybe you can develop a prototype first.

> I have though not implemented the counted repetitions A{m, n}, which I think
> usually are implemented as loops, so I do not know how that fits into the
> picture.

It might also be helpful to think about back-references like "(a+)\1".
Comment 12 Hans Åberg 2018-04-25 18:42:16 UTC
(In reply to Tim Shen from comment #11)
> > The problem is that #4 has an earlier capture, making it impossible to see
> > that it is left undefined later.
> 
> I wouldn't say it's impossible. libc++ implements it correctly at a cost.
>
> Specifically, see
> https://github.com/llvm-mirror/libcxx/blob/
> e54a22f809bc741760e39ac78444f57f06117652/include/regex#L5566. 

I meant the way I do it now.

> > So it would be better to capture it as an
> > empty string here:
> > 
> > In the example I gave, if the transitions are marked (ignoring #2)
> >   1: {a ↦ {1#3, 2#3, 3#3$4}};
> >   2: {b ↦ {2#4, 3#4}};
> >   3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
> > Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus
> > an empty which I have written $4. That is the idea, though.
> 
> I'm feeling that this is going to a quadratic space complexity here. Does
> "(((...((a_0)|(a_1)|(a_2)|...(a_n))...)))" if the regex expands to n
> surrounding parenthesis, and n characters in the middle? Would this regex
> match taking up to n^2 time, given a n-sized input string?

It is all linear, in one traverse, just pick down the action form the transitions instead of the states. Both are computed when checking the forward transition for a character: I just record them.

> If you think it's time and space efficiently doable, maybe you can develop a
> prototype first.

Yes. I will have to think about it, and report back here.

> > I have though not implemented the counted repetitions A{m, n}, which I think
> > usually are implemented as loops, so I do not know how that fits into the
> > picture.
> 
> It might also be helpful to think about back-references like "(a+)\1".

What's this?
Comment 13 Hans Åberg 2018-05-01 17:07:27 UTC
(In reply to Tim Shen from comment #11)
> If you think it's time and space efficiently doable, maybe you can develop a
> prototype first.

I put action number lists on the NFA transitions (no markup of the states). But then the things I described before work.

Theoretically it may be simplest to start with a NFA with empty (ε) transitions: For the sub-automatons one is interested in to find matches for, put a unique action number on the start and final state transitions, marked begin resp. end match. When compacted to a NFA without empty transitions, which is what I use, they become lists.

I use bra-ket notation <k|A|k> to indicate the action number k begin-end match of the automaton A; parentheses indicate grouping. Then the original example can be written
  <1|z|1>(<2|<3|a*|3><4|b*|4><5|c|5>|2>)*

My program then produces the following result, first the NFA, and then the matches:

ex = (ε ↦ {<1|0}, {
  0: {z ↦ {|1><2|<3||3><4||4><5|3, |1><2|<3||3><4|2, |1><2|<3|1, |1>∅}}
  1: {a ↦ {1, |3><4|2, |3><4||4><5|3}}
  2: {b ↦ {2, |4><5|3}}
  3: {c ↦ {|5>|2><2|<3||3><4||4><5|3, |5>|2><2|<3||3><4|2, |5>|2><2|<3|1, |5>|2>∅}}}, {})

> zaacbbbcac
Full string is a valid match!
Matches: {zaacbbbcac, [@1@4@8@10];
<1|z|1>
<1|z|1> <2|<3|aa|3><4||4><5|c|5>|2>
<1|z|1> <2|<3|aa|3><4||4><5|c|5>|2> <2|<3||3><4|bbb|4><5|c|5>|2>
<1|z|1> <2|<3|aa|3><4||4><5|c|5>|2> <2|<3||3><4|bbb|4><5|c|5>|2> <2|<3|a|3><4||4><5|c|5>|2>
}

In the first NFA description, transition actions are written after the action that one pass over. So, for example, the original start state ε ↦ {<1|0} means that one opens the match <1| to arrive at state 1; then all possibilities in state 1 closes it with |1>, then opening some other matches.

In the matches description, after the matched string follows, the final state match string lengths follows, so that eventual empty matches get number 0. The longest match last then have at the end the sequence
  <2|<3|a|3><4||4><5|c|5>|2>
which is what is asked for: #2 "ac, #3 "a", #4 "", #5 "c", in addition to # 1 "z" first.
Comment 14 Tim Shen 2018-05-01 18:30:00 UTC
How fast does your prototype run on the following example?
  ((10)|(11)|(12)|...|(98)|(99))* matching "10111213141516...9899"
how about this:
  ((100)|(101)|(102)|...|(998)|(999))* matching "100101102...998999"

where "..." are the skipped strings that follow the surrounding pattern.

For the record, for the 100-999 case, libstdc++ takes 0.112 seconds given a proper stack limit. libc++ takes 6.386 seconds.

I have a C++ test case for this:

#include <regex>
#include <string>
#include <iostream>

constexpr int kStart = 100;
constexpr int kEnd = 1000;

int main() {
  std::regex re;
  {
    std::string re_string;
    re_string += '(';
    re_string += std::string("(") + std::to_string(kStart) + ")";
    for (int i = kStart + 1; i < kEnd; i++) {
      re_string += '|';
      re_string += '(';
      re_string += std::to_string(i);
      re_string += ')';
    }
    re_string += ')';
    re_string += '*';
    re.assign(re_string);
  }
  std::string s;
  for (int i = kStart; i < kEnd; i++) {
    s += std::to_string(i);
  }
  std::smatch m;
  std::cout << std::regex_match(s, m, re) << "\n";
  for (const auto p : m) {
    std::cout << p << " ";
  }
  std::cout << "\n";
  return 0;
}
Comment 15 Hans Åberg 2018-05-01 18:50:09 UTC
(In reply to Tim Shen from comment #14)
> How fast does your prototype run on the following example?
>   ((10)|(11)|(12)|...|(98)|(99))* matching "10111213141516...9899"
> how about this:
>   ((100)|(101)|(102)|...|(998)|(999))* matching "100101102...998999"
> 
> where "..." are the skipped strings that follow the surrounding pattern.

My program does not choke on limit cases, like the regex '(a?)^k a^k' on the string string "a"^k, mentioned at [1]; for k = 100, it takes 0.393 s on the NFA and 0.168 on the DFA on a not so powerful computer. Otherwise, it is not optimized and fast in absolute terms: I use standard C++ containers such as unordered_set for DFA state lookups, and not a number.

1. https://swtch.com/~rsc/regexp/regexp1.html
Comment 16 Tim Shen 2018-05-01 19:10:32 UTC
(In reply to Hans Åberg from comment #15)
> (In reply to Tim Shen from comment #14)
> > How fast does your prototype run on the following example?
> >   ((10)|(11)|(12)|...|(98)|(99))* matching "10111213141516...9899"
> > how about this:
> >   ((100)|(101)|(102)|...|(998)|(999))* matching "100101102...998999"
> > 
> > where "..." are the skipped strings that follow the surrounding pattern.
> 
> My program does not choke on limit cases, like the regex '(a?)^k a^k' on the
> string string "a"^k, mentioned at [1]; for k = 100, it takes 0.393 s on the
> NFA and 0.168 on the DFA on a not so powerful computer. Otherwise, it is not
> optimized and fast in absolute terms: I use standard C++ containers such as
> unordered_set for DFA state lookups, and not a number.
> 
> 1. https://swtch.com/~rsc/regexp/regexp1.html

Sorry, I meant to observe whether the example triggers a O(k^2) behavior. It should be less than quadratic. To measure this, you don't have to compare the time of your prototype against libstdc++.

I set (kStart, kEnd) to be (1000, 2000), (1000, 3000), ..., (1000, 9000) to see how the time grows. I changed my pattern a bit to make it k lg k. I'm more curious about the time complexity your program takes, as I was not able to understand your algorithm description.

Here are my numbers for end ranging 2000 to 9000, in ms: [79,150,262,427,620,868,1078,1310].

The difference array: [71, 112, 165, 193, 248, 210, 232].

I think it's vaguely k lg k, but I'm not quite sure yet.

Here is the updated example, to re-arrange the expression to a tree, not a chain of "|" operators.

#include <regex>
#include <string>
#include <iostream>
#include <cstdlib>
#include <cassert>

void gen_pattern(int start, int end, std::string& acc) {
  if (!(start < end)) {
    std::abort();
  }
  if (start + 1 == end) {
    acc += '(';
    acc += std::to_string(start);
    acc += ')';
    return;
  }
  int mid = (start + end) / 2;
  if (start + 1 == mid) {
    gen_pattern(start, mid, acc);
  } else {
    acc += "(?:";
    gen_pattern(start, mid, acc);
    acc += ")";
  }

  acc += "|";

  if (mid + 1 == end) {
    gen_pattern(mid, end, acc);
  } else {
    acc += "(?:";
    gen_pattern(mid, end, acc);
    acc += ")";
  }
}

int main(int argc, char** argv) {
  if (argc < 3) {
    std::abort();
  }
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);

  std::regex re;
  {
    std::string pattern;
    gen_pattern(start, end, pattern);
    re.assign("(" + pattern + ")*");
  }
  std::string s;
  for (int i = start; i < end; i++) {
    s += std::to_string(i);
  }
  std::smatch m;
  std::cout << std::regex_match(s, m, re) << "\n";
  //for (const auto p : m) {
  //  std::cout << p << " ";
  //}
  //std::cout << "\n";
  return 0;
}
Comment 17 Hans Åberg 2018-05-01 19:44:20 UTC
(In reply to Tim Shen from comment #16)
> (In reply to Hans Åberg from comment #15)
> > (In reply to Tim Shen from comment #14)
> > > How fast does your prototype run on the following example?
> > >   ((10)|(11)|(12)|...|(98)|(99))* matching "10111213141516...9899"
> > > how about this:
> > >   ((100)|(101)|(102)|...|(998)|(999))* matching "100101102...998999"
> > > 
> > > where "..." are the skipped strings that follow the surrounding pattern.
> > 
> > My program does not choke on limit cases, like the regex '(a?)^k a^k' on the
> > string string "a"^k, mentioned at [1]; for k = 100, it takes 0.393 s on the
> > NFA and 0.168 on the DFA on a not so powerful computer. Otherwise, it is not
> > optimized and fast in absolute terms: I use standard C++ containers such as
> > unordered_set for DFA state lookups, and not a number.
> > 
> > 1. https://swtch.com/~rsc/regexp/regexp1.html
> 
> Sorry, I meant to observe whether the example triggers a O(k^2) behavior. It
> should be less than quadratic. To measure this, you don't have to compare
> the time of your prototype against libstdc++.

I'll have to get back on this. But it shouldn't choke:

During the lexing, I just record the DFA states as function of the string positions. When a final state appears, the reverse NFA data can be used to compute the NFA states this match actually uses, and therefore, as the characters are known, also the NFA transitions, and any data that depends on it.
Comment 18 Hans Åberg 2018-05-01 19:59:18 UTC
(In reply to Tim Shen from comment #14)
> I have a C++ test case for this:

I get segmentation fault on the regex_match line with g++ 7.3.0.
Comment 19 Hans Åberg 2018-05-01 20:09:58 UTC
(In reply to Tim Shen from comment #16)
> I set (kStart, kEnd) to be (1000, 2000), (1000, 3000), ..., (1000, 9000) to
> see how the time grows. ...

> Here are my numbers for end ranging 2000 to 9000, in ms:
> [79,150,262,427,620,868,1078,1310].

I get segmentation fault beyond 4000, but for this value, it takes 1.130 s, so your computer is about 4 times faster it seems.
Comment 20 Tim Shen 2018-05-01 20:15:04 UTC
(In reply to Hans Åberg from comment #18)
> (In reply to Tim Shen from comment #14)
> > I have a C++ test case for this:
> 
> I get segmentation fault on the regex_match line with g++ 7.3.0.

On Linux I set "ulimit -s unlimited". The stack overflow is a known issue because the implementation is recursive.
Comment 21 Hans Åberg 2018-05-01 20:25:51 UTC
(In reply to Tim Shen from comment #20)
> (In reply to Hans Åberg from comment #18)
> > (In reply to Tim Shen from comment #14)
> > > I have a C++ test case for this:
> > 
> > I get segmentation fault on the regex_match line with g++ 7.3.0.
> 
> On Linux I set "ulimit -s unlimited".

On MacOS, I can change the value, default 8192, but not to unlimited.

> The stack overflow is a known issue
> because the implementation is recursive.

I have a loop, as in Flex and Bison generated parsers.
Comment 22 Hans Åberg 2018-05-02 21:47:57 UTC
(In reply to Tim Shen from comment #16)
> ...I meant to observe whether the example triggers a O(k^2) behavior. It
> should be less than quadratic. ...

> Here is the updated example, to re-arrange the expression to a tree, not a
> chain of "|" operators.

The matching itself that I do is linear O(k): I used a version of your example for my program, timing the part in question for various values, dividing with N and N^2, where N is the difference of the input numbers, making it easy to see if it is O(k) or O(k^2).

However, in your example, the DFA lexing is quadratic, probably because I use unordered_set objects that here gets size proportional to N. Thus this ought to be optimized. So you might check if that might be an issue in the C++ library.