26166 – Matcher.find mis-behaviour

Bug 26166 - Matcher.find mis-behaviour

Summary: Matcher.find mis-behaviour

Status:	RESOLVED FIXED

Alias:	None

Product:	classpath
Classification:	Unclassified
Component:	classpath (show other bugs)
Version:	unspecified

Importance:	P3 normal
Target Milestone:	0.90
Assignee:	Ito Kazumitsu

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-02-07 21:53 UTC by Andrew Overholt
Modified:	2006-02-13 16:52 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2006-02-08 04:41:13

Attachments
test case (469 bytes, text/x-java) 2006-02-07 21:53 UTC, Andrew Overholt	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Overholt 2006-02-07 21:53:02 UTC

Eclipse's "hippie completion" (similar to word completion in vim or Emacs) does not work in Fedora due to a bug in our regular expression code.  I believe this is a GNU Classpath regex issue.  I have made a test case (soon to be attached).

javac TestHippieRegex.java
java TestHippieRegex

With the Sun JVM, I get the following:

++++++++++++++ trying fFindReplaceMatcher.find(100)
fFindReplaceMatcher.pattern().pattern() = [\p{L}[\p{Mn}[\p{Pc}[\p{Nd}[\p{Nl}[\p{Sc}]]]]]]+
++++++++++++++ found =  true

but with gij (Fedora rawhide's 4.1.0-0.20), I get:

++++++++++++++ trying fFindReplaceMatcher.find(100)
fFindReplaceMatcher.pattern().pattern() = [\p{L}[\p{Mn}[\p{Pc}[\p{Nd}[\p{Nl}[\p{Sc}]]]]]]+
++++++++++++++ found =  false

I'm investigating this because of https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178648

Comment 1 Andrew Overholt 2006-02-07 21:53:25 UTC

Created attachment 10799 [details]
test case

Comment 2 Mark Wielaard 2006-02-07 22:11:21 UTC

Named property support was added recently:

2006-01-31  Ito Kazumitsu  <kaz@maczuka.gcd.org>

        Fixes bug #26002
        * gnu/regexp/gnu/regexp/RE.java(initialize): Parse /\p{prop}/.
        (NamedProperty): New inner class.
        (getNamedProperty): New method.
        (getRETokenNamedProperty): New Method.
        * gnu/regexp/RESyntax.java(RE_NAMED_PROPERTY): New syntax falg.
        * gnu/regexp/RETokenNamedProperty.java: New file.

But the attached test program still fails.

Comment 3 Mark Wielaard 2006-02-07 22:26:55 UTC

This might be because we are not handling character class union [<class>[<another-class>]] correctly. In this case it seems the regular expression can be rewritten with logical or as follows:

Pattern pattern= Pattern.compile("(\\p{L}|\\p{Mn}|\\p{Pc}|\\p{Nd}|\\p{Nl}|\\p{Sc})+", patternFlags);

In that case the test program does return true.

The above is just a fancy way of saying you want to match a string of one or more Letters, Modifier symbols, Connector punctuation, Decimal digit numbers, Letter numbers and Currency symbols.

Comment 4 Ito Kazumitsu 2006-02-08 04:41:13 UTC

While playing with the gnu.regexp package these days, I wished
the day would not come so soon when the nested character class
expression such as [aaa[xyz]] is needed.

This syntax is not in Perl, but Sun's JDK introduced it.

[X[Y[^Z]]] would not be so difficult.  It is equivalent to X|Y|[^Z].
I think I can manage to do this.

But Sun's JDK introduced another syntax like [X&&[Y[^Z]]] meaning
X and (Y or not Z).  Supporting "&&" will require a serious design change.

Comment 5 Ito Kazumitsu 2006-02-08 11:13:49 UTC

Before implementing this new syntax, could someone explain this
strange behavior of Sun's JDK?  The source of W.java used here
is attached below.

bash$ java -DFIND=1 W 'b' '[^b]'
false
bash$ java -DFIND=1 W 'b' '[^b[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^b[b]b]'
false
bash$ java -DFIND=1 W 'b' '[^b[b]b[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^b[b]b[b]b]'
false
bash$ java -DFIND=1 W 'b' '[^[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^[b]b]'
false
bash$ java -DFIND=1 W 'b' '[^[b]b[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^[b]b[b]b]'
false
bash$ cat W.java
import java.util.regex.*;
public class W {
  public static void main(String[] args) throws Exception {
    int flags = 0;
    boolean find = (System.getProperty("FIND") != null);
    if (System.getProperty("CASE_INSENSITIVE") != null) {
      flags |= Pattern.CASE_INSENSITIVE;
    }
    Pattern p = Pattern.compile(args[1], flags);
    Matcher m = p.matcher(args[0]);
    boolean b = (find ? m.find() : m.matches());
    System.out.println(b);
    if (b) {
      int groups = m.groupCount();
      for (int i = 0; i <= groups; i++) {
        System.out.println("G" + i + " = " + m.group(i));
      }
    }
  }
}

I assume

  [^X[Y][Z]] means (not X) or Y or Z
     whrere X must not contain a subclass enclosed by [].

  [^[X]] and [^X[Y]Z] are invalid expressions whose matching results are
  meaningless,  although Sun's JDK neglects the checking of validity.

Comment 6 cvs-commit@developer.classpath.org 2006-02-13 15:22:49 UTC

Subject: Bug 26166

CVSROOT:	/cvsroot/classpath
Module name:	classpath
Branch: 	
Changes by:	Ito Kazumitsu <itokaz@savannah.gnu.org>	06/02/13 13:19:44

Modified files:
	.              : ChangeLog 
	gnu/regexp     : RE.java RESyntax.java RETokenOneOf.java 

Log message:
	2006-02-13  Ito Kazumitsu  <kaz@maczuka.gcd.org>
	
	Fixes bug #26166
	* gnu/regexp/RE.java(initialize): Parsing of character class expression
	was moved to a new method parseCharClass.
	(parseCharClass): New method originally in initialize. Added parsing
	of nested character classes.
	(ParseCharClassResult): New inner class used as a return value of
	parseCharClass.
	(getCharExpression),(getNamedProperty): Made static.
	* gnu/regexp/RESyntax.java(RE_NESTED_CHARCLASS): New syntax flag.
	* gnu/regexp/RETokenOneOf.java(addition): New Vector for storing
	nested character classes.
	(RETokenOneOf): New constructor accepting the Vector addition.
	(getMinimumLength), (getMaximumLength): Returns 1 if the token
	stands for only one character.
	(match): Added the processing of the Vector addition.
	(matchN), (matchP): Do not check next token if addition is used.

CVSWeb URLs:
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/ChangeLog.diff?tr1=1.6350&tr2=1.6351&r1=text&r2=text
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/gnu/regexp/RE.java.diff?tr1=1.17&tr2=1.18&r1=text&r2=text
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/gnu/regexp/RESyntax.java.diff?tr1=1.6&tr2=1.7&r1=text&r2=text
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/gnu/regexp/RETokenOneOf.java.diff?tr1=1.6&tr2=1.7&r1=text&r2=text

Comment 7 Ito Kazumitsu 2006-02-13 15:59:11 UTC

Fixed.

Comment 8 Tom Tromey 2006-02-13 22:58:40 UTC

Subject: Bug 26166

Author: tromey
Date: Mon Feb 13 22:58:37 2006
New Revision: 110937

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=110937
Log:
2006-02-13  Ito Kazumitsu  <kaz@maczuka.gcd.org>

	Fixes bug #26166
	* gnu/regexp/RE.java(initialize): Parsing of character class expression
	was moved to a new method parseCharClass.
	(parseCharClass): New method originally in initialize. Added parsing
	of nested character classes.
	(ParseCharClassResult): New inner class used as a return value of
	parseCharClass.
	(getCharExpression),(getNamedProperty): Made static.
	* gnu/regexp/RESyntax.java(RE_NESTED_CHARCLASS): New syntax flag.
	* gnu/regexp/RETokenOneOf.java(addition): New Vector for storing
	nested character classes.
	(RETokenOneOf): New constructor accepting the Vector addition.
	(getMinimumLength), (getMaximumLength): Returns 1 if the token
	stands for only one character.
	(match): Added the processing of the Vector addition.
	(matchN), (matchP): Do not check next token if addition is used.

Modified:
    branches/gcc-4_1-branch/libjava/classpath/ChangeLog.gcj
    branches/gcc-4_1-branch/libjava/classpath/gnu/regexp/RE.java
    branches/gcc-4_1-branch/libjava/classpath/gnu/regexp/RESyntax.java
    branches/gcc-4_1-branch/libjava/classpath/gnu/regexp/RETokenOneOf.java