Bug 26166 - Matcher.find mis-behaviour
Summary: Matcher.find mis-behaviour
Status: RESOLVED FIXED
Alias: None
Product: classpath
Classification: Unclassified
Component: classpath (show other bugs)
Version: unspecified
: P3 normal
Target Milestone: 0.90
Assignee: Ito Kazumitsu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-07 21:53 UTC by Andrew Overholt
Modified: 2006-02-13 16:52 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-02-08 04:41:13


Attachments
test case (469 bytes, text/x-java)
2006-02-07 21:53 UTC, Andrew Overholt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Overholt 2006-02-07 21:53:02 UTC
Eclipse's "hippie completion" (similar to word completion in vim or Emacs) does not work in Fedora due to a bug in our regular expression code.  I believe this is a GNU Classpath regex issue.  I have made a test case (soon to be attached).

javac TestHippieRegex.java
java TestHippieRegex

With the Sun JVM, I get the following:

++++++++++++++ trying fFindReplaceMatcher.find(100)
fFindReplaceMatcher.pattern().pattern() = [\p{L}[\p{Mn}[\p{Pc}[\p{Nd}[\p{Nl}[\p{Sc}]]]]]]+
++++++++++++++ found =  true

but with gij (Fedora rawhide's 4.1.0-0.20), I get:

++++++++++++++ trying fFindReplaceMatcher.find(100)
fFindReplaceMatcher.pattern().pattern() = [\p{L}[\p{Mn}[\p{Pc}[\p{Nd}[\p{Nl}[\p{Sc}]]]]]]+
++++++++++++++ found =  false

I'm investigating this because of https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178648
Comment 1 Andrew Overholt 2006-02-07 21:53:25 UTC
Created attachment 10799 [details]
test case
Comment 2 Mark Wielaard 2006-02-07 22:11:21 UTC
Named property support was added recently:

2006-01-31  Ito Kazumitsu  <kaz@maczuka.gcd.org>

        Fixes bug #26002
        * gnu/regexp/gnu/regexp/RE.java(initialize): Parse /\p{prop}/.
        (NamedProperty): New inner class.
        (getNamedProperty): New method.
        (getRETokenNamedProperty): New Method.
        * gnu/regexp/RESyntax.java(RE_NAMED_PROPERTY): New syntax falg.
        * gnu/regexp/RETokenNamedProperty.java: New file.

But the attached test program still fails.
Comment 3 Mark Wielaard 2006-02-07 22:26:55 UTC
This might be because we are not handling character class union [<class>[<another-class>]] correctly. In this case it seems the regular expression can be rewritten with logical or as follows:

Pattern pattern= Pattern.compile("(\\p{L}|\\p{Mn}|\\p{Pc}|\\p{Nd}|\\p{Nl}|\\p{Sc})+", patternFlags);

In that case the test program does return true.

The above is just a fancy way of saying you want to match a string of one or more Letters, Modifier symbols, Connector punctuation, Decimal digit numbers, Letter numbers and Currency symbols.
Comment 4 Ito Kazumitsu 2006-02-08 04:41:13 UTC
While playing with the gnu.regexp package these days, I wished
the day would not come so soon when the nested character class
expression such as [aaa[xyz]] is needed.

This syntax is not in Perl, but Sun's JDK introduced it.

[X[Y[^Z]]] would not be so difficult.  It is equivalent to X|Y|[^Z].
I think I can manage to do this.

But Sun's JDK introduced another syntax like [X&&[Y[^Z]]] meaning
X and (Y or not Z).  Supporting "&&" will require a serious design change.
Comment 5 Ito Kazumitsu 2006-02-08 11:13:49 UTC
Before implementing this new syntax, could someone explain this
strange behavior of Sun's JDK?  The source of W.java used here
is attached below.

bash$ java -DFIND=1 W 'b' '[^b]'
false
bash$ java -DFIND=1 W 'b' '[^b[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^b[b]b]'
false
bash$ java -DFIND=1 W 'b' '[^b[b]b[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^b[b]b[b]b]'
false
bash$ java -DFIND=1 W 'b' '[^[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^[b]b]'
false
bash$ java -DFIND=1 W 'b' '[^[b]b[b]]'
true
G0 = b
bash$ java -DFIND=1 W 'b' '[^[b]b[b]b]'
false
bash$ cat W.java
import java.util.regex.*;
public class W {
  public static void main(String[] args) throws Exception {
    int flags = 0;
    boolean find = (System.getProperty("FIND") != null);
    if (System.getProperty("CASE_INSENSITIVE") != null) {
      flags |= Pattern.CASE_INSENSITIVE;
    }
    Pattern p = Pattern.compile(args[1], flags);
    Matcher m = p.matcher(args[0]);
    boolean b = (find ? m.find() : m.matches());
    System.out.println(b);
    if (b) {
      int groups = m.groupCount();
      for (int i = 0; i <= groups; i++) {
        System.out.println("G" + i + " = " + m.group(i));
      }
    }
  }
}

I assume

  [^X[Y][Z]] means (not X) or Y or Z
     whrere X must not contain a subclass enclosed by [].

  [^[X]] and [^X[Y]Z] are invalid expressions whose matching results are
  meaningless,  although Sun's JDK neglects the checking of validity.

Comment 6 cvs-commit@developer.classpath.org 2006-02-13 15:22:49 UTC
Subject: Bug 26166

CVSROOT:	/cvsroot/classpath
Module name:	classpath
Branch: 	
Changes by:	Ito Kazumitsu <itokaz@savannah.gnu.org>	06/02/13 13:19:44

Modified files:
	.              : ChangeLog 
	gnu/regexp     : RE.java RESyntax.java RETokenOneOf.java 

Log message:
	2006-02-13  Ito Kazumitsu  <kaz@maczuka.gcd.org>
	
	Fixes bug #26166
	* gnu/regexp/RE.java(initialize): Parsing of character class expression
	was moved to a new method parseCharClass.
	(parseCharClass): New method originally in initialize. Added parsing
	of nested character classes.
	(ParseCharClassResult): New inner class used as a return value of
	parseCharClass.
	(getCharExpression),(getNamedProperty): Made static.
	* gnu/regexp/RESyntax.java(RE_NESTED_CHARCLASS): New syntax flag.
	* gnu/regexp/RETokenOneOf.java(addition): New Vector for storing
	nested character classes.
	(RETokenOneOf): New constructor accepting the Vector addition.
	(getMinimumLength), (getMaximumLength): Returns 1 if the token
	stands for only one character.
	(match): Added the processing of the Vector addition.
	(matchN), (matchP): Do not check next token if addition is used.

CVSWeb URLs:
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/ChangeLog.diff?tr1=1.6350&tr2=1.6351&r1=text&r2=text
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/gnu/regexp/RE.java.diff?tr1=1.17&tr2=1.18&r1=text&r2=text
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/gnu/regexp/RESyntax.java.diff?tr1=1.6&tr2=1.7&r1=text&r2=text
http://cvs.savannah.gnu.org/viewcvs/classpath/classpath/gnu/regexp/RETokenOneOf.java.diff?tr1=1.6&tr2=1.7&r1=text&r2=text



Comment 7 Ito Kazumitsu 2006-02-13 15:59:11 UTC
Fixed. 
Comment 8 Tom Tromey 2006-02-13 22:58:40 UTC
Subject: Bug 26166

Author: tromey
Date: Mon Feb 13 22:58:37 2006
New Revision: 110937

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=110937
Log:
2006-02-13  Ito Kazumitsu  <kaz@maczuka.gcd.org>

	Fixes bug #26166
	* gnu/regexp/RE.java(initialize): Parsing of character class expression
	was moved to a new method parseCharClass.
	(parseCharClass): New method originally in initialize. Added parsing
	of nested character classes.
	(ParseCharClassResult): New inner class used as a return value of
	parseCharClass.
	(getCharExpression),(getNamedProperty): Made static.
	* gnu/regexp/RESyntax.java(RE_NESTED_CHARCLASS): New syntax flag.
	* gnu/regexp/RETokenOneOf.java(addition): New Vector for storing
	nested character classes.
	(RETokenOneOf): New constructor accepting the Vector addition.
	(getMinimumLength), (getMaximumLength): Returns 1 if the token
	stands for only one character.
	(match): Added the processing of the Vector addition.
	(matchN), (matchP): Do not check next token if addition is used.

Modified:
    branches/gcc-4_1-branch/libjava/classpath/ChangeLog.gcj
    branches/gcc-4_1-branch/libjava/classpath/gnu/regexp/RE.java
    branches/gcc-4_1-branch/libjava/classpath/gnu/regexp/RESyntax.java
    branches/gcc-4_1-branch/libjava/classpath/gnu/regexp/RETokenOneOf.java