This is the mail archive of the
java-patches@gcc.gnu.org
mailing list for the Java project.
[4.1] Patch: FYI: big regex merge
- From: Tom Tromey <tromey at redhat dot com>
- To: Java Patch List <java-patches at gcc dot gnu dot org>
- Date: 10 Feb 2006 12:43:33 -0700
- Subject: [4.1] Patch: FYI: big regex merge
- Reply-to: tromey at redhat dot com
I'm checking this in on the 4.1 branch.
This merges in all the recent regex fixes from GNU Classpath.
Ordinarily I would not like to put in a big patch like this at the
last minute, but:
* It affects several real applications
* It fixes a large number of Mauve tests (more than 2000)
* Our current regular expression code is broken enough that
this is unlikely to cause regressions
* It is pure java and so relatively safe
Tested on x86 FC4, including Mauve. Also Mark tested this against
Eclipse.
Tom
Index: ChangeLog
from Tom Tromey <tromey@redhat.com>
* java/lang/Character.java: Merged from Classpath.
(start, end): Now 'int's.
(canonicalName): New field.
(CANONICAL_NAME, NO_SPACES_NAME, CONSTANT_NAME): New constants.
(UnicodeBlock): Added argument.
(of): New overload.
(forName): New method.
Updated unicode blocks.
(sets): Updated.
* sources.am, Makefile.in: Rebuilt.
2006-01-13 Tom Tromey <tromey@redhat.com>
* gnu/regexp/MessagesBundle_fr.properties: Removed.
* gnu/regexp/MessagesBundle.properties: Removed.
Index: classpath/ChangeLog.gcj
from Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #26112
* gnu/regexp/RE.java(REG_REPLACE_USE_BACKSLASHESCAPE): New execution
flag which enables backslash escape in a replacement.
(getReplacement): New public static method.
(substituteImpl),(substituteAllImpl): Use getReplacement.
* gnu/regexp/REMatch.java(substituteInto): Replace $n even if n>=10.
* java/util/regex/Matcher.java(appendReplacement)
Use RE#getReplacement.
(replaceFirst),(replaceAll): Use RE.REG_REPLACE_USE_BACKSLASHESCAPE.
2006-02-06 Ito Kazumitsu <kaz@maczuka.gcd.org>
* java/util/regex/Matcher.java(matches):
set RE.REG_TRY_ENTIRE_MATCH as an execution flag of getMatch.
2006-02-06 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #25812
* gnu/regexp/CharIndexed.java(lookBehind),(length): New method.
* gnu/regexp/CharIndexedCharArray.java
(lookBehind),(length): Implemented.
* gnu/regexp/CharIndexedInputStream.java: Likewise.
* gnu/regexp/CharIndexedString.java: Likewise.
* gnu/regexp/CharIndexedStringBuffer.java: Likewise.
* gnu/regexp/REToken.java(getMaximumLength): New method.
* gnu/regexp/RE.java(internal constructor RE): Added new argument
maxLength.
(initialize): Parse (?<=X), (?<!X), (?>X).
(getMaximumLength): Implemented.
* gnu/regexp/RETokenAny.java(getMaximumLength): Implemented.
* gnu/regexp/RETokenChar.java: Likewise.
* gnu/regexp/RETokenEnd.java: Likewise.
* gnu/regexp/RETokenEndSub.java: Likewise.
* gnu/regexp/RETokenLookAhead.java: Likewise.
* gnu/regexp/RETokenNamedProperty.java: Likewise.
* gnu/regexp/RETokenOneOf.java: Likewise.
* gnu/regexp/RETokenPOSIX.java: Likewise.
* gnu/regexp/RETokenRange.java: Likewise.
* gnu/regexp/RETokenRepeated.java: Likewise.
* gnu/regexp/RETokenStart.java: Likewise.
* gnu/regexp/RETokenWordBoundary.java: Likewise.
* gnu/regexp/RETokenIndependent.java: New file.
* gnu/regexp/RETokenLookBehind.java: New file.
2006-02-04 Ito Kazumitsu <kaz@maczuka.gcd.org>
* gnu/regexp/RETokenNamedProperty.java(getHandler): Check for
a Unicode block if the name starts with "In".
(UnicodeBlockHandler): New inner class.
2006-02-02 Ito Kazumitsu <kaz@maczuka.gcd.org>
* gnu/regexp/REMatch.java(REMatchList): New inner utility class
for making a list of REMatch instances.
* gnu/regexp/RETokenOneOf.java(match): Rewritten using REMatchList.
* gnu/regexp/RETokenRepeated.java(findDoables): New method.
(match): Rewritten using REMatchList.
(matchRest): Rewritten using REMatchList.
2006-02-01 Mark Wielaard <mark@klomp.org>
* gnu/regexp/RE.java (getRETokenNamedProperty): Chain exception.
* gnu/regexp/RETokenNamedProperty.java (LETTER, MARK, SEPARATOR,
SYMBOL, NUMBER, PUNCTUATION, OTHER): New final byte[] fields.
(getHandler): Check for grouped properties L, M, Z, S, N, P or C.
(UnicodeCategoriesHandler): New private static class.
2006-01-31 Mark Wielaard <mark@klomp.org>
* java/net/URI.java (getURIGroup): Check for null to see whether
group actually exists.
2006-01-31 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #22873
* gnu/regexp/REMatch(toString(int)): Throw IndexOutOfBoundsException
for an invalid index and return null for a skipped group.
2006-01-31 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #26002
* gnu/regexp/gnu/regexp/RE.java(initialize): Parse /\p{prop}/.
(NamedProperty): New inner class.
(getNamedProperty): New method.
(getRETokenNamedProperty): New Method.
* gnu/regexp/RESyntax.java(RE_NAMED_PROPERTY): New syntax falg.
* gnu/regexp/RETokenNamedProperty.java: New file.
2006-01-30 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #24876
* gnu/regexp/gnu/regexp/RE.java(REG_TRY_ENTIRE_MATCH):
New execution flag.
(getMatchImpl): if REG_TRY_ENTIRE_MATCH is set, add an
implicit RETokenEnd at the end of the regexp chain.
Do not select the longest match, but select the first match.
(match): Do not take care of REMatch.empty.
* gnu/regexp/REMatch.java(empty): To be used only in RETokenRepeated.
* gnu/regexp/RETokenOneOf.java: Corrected a typo in a comment.
* gnu/regexp/RETokenBackRef.java: Do not take care of REMatch.empty.
* gnu/regexp/RETokenRepeated.java (match): Rewrote stingy matching.
Do not take care of REMatch.empty. Set and check REMatch.empty
when trying to match the single token.
2006-01-24 Tom Tromey <tromey@redhat.com>
* java/util/regex/PatternSyntaxException.java: Added @since.
* java/util/regex/Matcher.java (Matcher): Implements MatchResult.
* java/util/regex/MatchResult.java: New file.
2006-01-23 Ito Kazumitsu <kaz@maczuka.gcd.org>
* gnu/regexp/REToken.java(empty): Made Cloneable.
* gnu/regexp/RETokenOneOf.java(match): RE.java(match):
Use separate methods matchN and matchP depending on the
boolean negative.
(matchN): New method used when negative. Done as before.
(matchP): New method used when not negative. Each token is
tried not by itself but by a clone of it.
2006-01-22 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #25837
* gnu/regexp/REMatch.java(empty): New boolean indicating
an empty string matched.
* gnu/regexp/RE.java(match): Sets empty flag when an empty
string matched.
(initialize): Support back reference \10, \11, and so on.
(parseInt): renamed from getEscapedChar and returns int.
* gnu/regexp/RETokenRepeated.java(match): Sets empty flag
when an empty string matched. Fixed a bug of the case where
an empty string matched. Added special handling of {0}.
* gnu/regexp/RETokenBackRef.java(match): Sets empty flag
when an empty string matched. Fixed the case insensitive matching.
2006-01-19 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #23212
* gnu/regexp/RE.java(initialize): Support escaped characters such as
\0123, \x1B, \u1234.
(getEscapedChar): New method.
(CharExpression): New inner class.
(getCharExpression): New Method.
* gnu/regexp/RESyntax.java(RE_OCTAL_CHAR, RE_HEX_CHAR,
RE_UNICODE_CHAR): New syntax bits.
2006-01-17 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #25817
* gnu/regexp/RETokenRange.java(constructor):
Keep lo and hi as they are.
(match): Changed the case insensitive comparison.
2006-01-17 Ito Kazumitsu <kaz@maczuka.gcd.org>
* gnu/regexp/RETokenChar.java(chain):
Do not concatenate tokens whose insens flags are diffent.
2006-01-16 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #22884
* gnu/regexp/RE.java(initialize): Parse embedded flags.
* gnu/regexp/RESyntax.java(RE_EMBEDDED_FLAGS): New syntax bit.
2006-01-13 Mark Wielaard <mark@klomp.org>
* java/util/regex/Pattern.java (Pattern): Chain REException.
2006-01-12 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #22802
* gnu/regexp/RE.java(initialize): Fixed the parsing of
character classes within a subexpression.
2006-01-08 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #25679
* gnu/regexp/RETokenRepeated.java(match): Optimized the case
when an empty string matched an empty token.
2006-01-06 Ito Kazumitsu <kaz@maczuka.gcd.org>
Fixes bug #25616
* gnu/regexp/RE.java(initialize): Allow repeat.empty.token.
* gnu/regexp/RETokenRepeated.java(match): Break the loop
when an empty string matched an empty token.
Index: gnu/regexp/MessagesBundle_fr.properties
===================================================================
--- gnu/regexp/MessagesBundle_fr.properties (revision 110832)
+++ gnu/regexp/MessagesBundle_fr.properties (working copy)
@@ -1,22 +0,0 @@
-# Localized error messages for gnu.regexp
-
-# Prefix for REException messages
-error.prefix=A l''index {0} dans le modèle d''expression régulière:
-
-# REException (parse error) messages
-repeat.assertion=l'élément répété est de largeur zéro
-repeat.chained=tentative de répétition d'un élément déjà répété
-repeat.no.token=quantifieur (?*+{}) sans élément précédent
-repeat.empty.token=l'élément répété peut être vide
-unmatched.brace=accolade inégalée
-unmatched.bracket=crochet inégalé
-unmatched.paren=parenthèse inégalée
-interval.no.end=fin d'interval attendue
-class.no.end=fin de classe de caractères attendue
-subexpr.no.end=fin de sous-expression attendue
-interval.order=l'interval minimum est supérieur à l'interval maximum
-interval.error=l'interval est vide ou contient des caractères illégaux
-ends.with.backslash=antislash à la fin du modèle
-
-# RESyntax message
-syntax.final=La syntaxe a été déclarée finale et ne peut pas être modifiée
Index: gnu/regexp/MessagesBundle.properties
===================================================================
--- gnu/regexp/MessagesBundle.properties (revision 110832)
+++ gnu/regexp/MessagesBundle.properties (working copy)
@@ -1,22 +0,0 @@
-# Localized error messages for gnu.regexp
-
-# Prefix for REException messages
-error.prefix=At position {0} in regular expression pattern:
-
-# REException (parse error) messages
-repeat.assertion=repeated token is zero-width assertion
-repeat.chained=attempted to repeat a token that is already repeated
-repeat.no.token=quantifier (?*+{}) without preceding token
-repeat.empty.token=repeated token may be empty
-unmatched.brace=unmatched brace
-unmatched.bracket=unmatched bracket
-unmatched.paren=unmatched parenthesis
-interval.no.end=expected end of interval
-class.no.end=expected end of character class
-subexpr.no.end=expected end of subexpression
-interval.order=interval minimum is greater than maximum
-interval.error=interval is empty or contains illegal characters
-ends.with.backslash=backslash at end of pattern
-
-# RESyntax message
-syntax.final=Syntax has been declared final and cannot be modified
Index: java/lang/Character.java
===================================================================
--- java/lang/Character.java (revision 110832)
+++ java/lang/Character.java (working copy)
@@ -48,6 +48,8 @@
package java.lang;
import java.io.Serializable;
+import java.text.Collator;
+import java.util.Locale;
/**
* Wrapper class for the primitive char data type. In addition, this class
@@ -150,11 +152,19 @@
public static final class UnicodeBlock extends Subset
{
/** The start of the subset. */
- private final char start;
+ private final int start;
/** The end of the subset. */
- private final char end;
+ private final int end;
+ /** The canonical name of the block according to the Unicode standard. */
+ private final String canonicalName;
+
+ /** Constants for the <code>forName()</code> method */
+ private static final int CANONICAL_NAME = 0;
+ private static final int NO_SPACES_NAME = 1;
+ private static final int CONSTANT_NAME = 2;
+
/**
* Constructor for strictly defined blocks.
*
@@ -162,24 +172,43 @@
* @param end the end character of the range
* @param name the block name
*/
- private UnicodeBlock(char start, char end, String name)
+ private UnicodeBlock(int start, int end, String name,
+ String canonicalName)
{
super(name);
this.start = start;
this.end = end;
+ this.canonicalName = canonicalName;
}
/**
* Returns the Unicode character block which a character belongs to.
+ * <strong>Note</strong>: This method does not support the use of
+ * supplementary characters. For such support, <code>of(int)</code>
+ * should be used instead.
*
* @param ch the character to look up
* @return the set it belongs to, or null if it is not in one
*/
public static UnicodeBlock of(char ch)
{
- // Special case, since SPECIALS contains two ranges.
- if (ch == '\uFEFF')
- return SPECIALS;
+ return of((int) ch);
+ }
+
+ /**
+ * Returns the Unicode character block which a code point belongs to.
+ *
+ * @param codePoint the character to look up
+ * @return the set it belongs to, or null if it is not in one.
+ * @throws IllegalArgumentException if the specified code point is
+ * invalid.
+ * @since 1.5
+ */
+ public static UnicodeBlock of(int codePoint)
+ {
+ if (codePoint > MAX_CODE_POINT)
+ throw new IllegalArgumentException("The supplied integer value is " +
+ "too large to be a codepoint.");
// Simple binary search for the correct block.
int low = 0;
int hi = sets.length - 1;
@@ -187,9 +216,9 @@
{
int mid = (low + hi) >> 1;
UnicodeBlock b = sets[mid];
- if (ch < b.start)
+ if (codePoint < b.start)
hi = mid - 1;
- else if (ch > b.end)
+ else if (codePoint > b.end)
low = mid + 1;
else
return b;
@@ -198,705 +227,1302 @@
}
/**
+ * <p>
+ * Returns the <code>UnicodeBlock</code> with the given name, as defined
+ * by the Unicode standard. The version of Unicode in use is defined by
+ * the <code>Character</code> class, and the names are given in the
+ * <code>Blocks-<version>.txt</code> file corresponding to that version.
+ * The name may be specified in one of three ways:
+ * </p>
+ * <ol>
+ * <li>The canonical, human-readable name used by the Unicode standard.
+ * This is the name with all spaces and hyphens retained. For example,
+ * `Basic Latin' retrieves the block, UnicodeBlock.BASIC_LATIN.</li>
+ * <li>The canonical name with all spaces removed e.g. `BasicLatin'.</li>
+ * <li>The name used for the constants specified by this class, which
+ * is the canonical name with all spaces and hyphens replaced with
+ * underscores e.g. `BASIC_LATIN'</li>
+ * </ol>
+ * <p>
+ * The names are compared case-insensitively using the case comparison
+ * associated with the U.S. English locale. The method recognises the
+ * previous names used for blocks as well as the current ones. At
+ * present, this simply means that the deprecated `SURROGATES_AREA'
+ * will be recognised by this method (the <code>of()</code> methods
+ * only return one of the three new surrogate blocks).
+ * </p>
+ *
+ * @param blockName the name of the block to look up.
+ * @return the specified block.
+ * @throws NullPointerException if the <code>blockName</code> is
+ * <code>null</code>.
+ * @throws IllegalArgumentException if the name does not match any Unicode
+ * block.
+ * @since 1.5
+ */
+ public static final UnicodeBlock forName(String blockName)
+ {
+ int type;
+ if (blockName.indexOf(' ') != -1)
+ type = CANONICAL_NAME;
+ else if (blockName.indexOf('_') != -1)
+ type = CONSTANT_NAME;
+ else
+ type = NO_SPACES_NAME;
+ Collator usCollator = Collator.getInstance(Locale.US);
+ usCollator.setStrength(Collator.PRIMARY);
+ /* Special case for deprecated blocks not in sets */
+ switch (type)
+ {
+ case CANONICAL_NAME:
+ if (usCollator.compare(blockName, "Surrogates Area") == 0)
+ return SURROGATES_AREA;
+ break;
+ case NO_SPACES_NAME:
+ if (usCollator.compare(blockName, "SurrogatesArea") == 0)
+ return SURROGATES_AREA;
+ break;
+ case CONSTANT_NAME:
+ if (usCollator.compare(blockName, "SURROGATES_AREA") == 0)
+ return SURROGATES_AREA;
+ break;
+ }
+ /* Other cases */
+ int setLength = sets.length;
+ switch (type)
+ {
+ case CANONICAL_NAME:
+ for (int i = 0; i < setLength; i++)
+ {
+ UnicodeBlock block = sets[i];
+ if (usCollator.compare(blockName, block.canonicalName) == 0)
+ return block;
+ }
+ break;
+ case NO_SPACES_NAME:
+ for (int i = 0; i < setLength; i++)
+ {
+ UnicodeBlock block = sets[i];
+ String nsName = block.canonicalName.replaceAll(" ","");
+ if (usCollator.compare(blockName, nsName) == 0)
+ return block;
+ }
+ break;
+ case CONSTANT_NAME:
+ for (int i = 0; i < setLength; i++)
+ {
+ UnicodeBlock block = sets[i];
+ if (usCollator.compare(blockName, block.toString()) == 0)
+ return block;
+ }
+ break;
+ }
+ throw new IllegalArgumentException("No Unicode block found for " +
+ blockName + ".");
+ }
+
+ /**
* Basic Latin.
- * '\u0000' - '\u007F'.
+ * 0x0000 - 0x007F.
*/
public static final UnicodeBlock BASIC_LATIN
- = new UnicodeBlock('\u0000', '\u007F',
- "BASIC_LATIN");
+ = new UnicodeBlock(0x0000, 0x007F,
+ "BASIC_LATIN",
+ "Basic Latin");
/**
* Latin-1 Supplement.
- * '\u0080' - '\u00FF'.
+ * 0x0080 - 0x00FF.
*/
public static final UnicodeBlock LATIN_1_SUPPLEMENT
- = new UnicodeBlock('\u0080', '\u00FF',
- "LATIN_1_SUPPLEMENT");
+ = new UnicodeBlock(0x0080, 0x00FF,
+ "LATIN_1_SUPPLEMENT",
+ "Latin-1 Supplement");
/**
* Latin Extended-A.
- * '\u0100' - '\u017F'.
+ * 0x0100 - 0x017F.
*/
public static final UnicodeBlock LATIN_EXTENDED_A
- = new UnicodeBlock('\u0100', '\u017F',
- "LATIN_EXTENDED_A");
+ = new UnicodeBlock(0x0100, 0x017F,
+ "LATIN_EXTENDED_A",
+ "Latin Extended-A");
/**
* Latin Extended-B.
- * '\u0180' - '\u024F'.
+ * 0x0180 - 0x024F.
*/
public static final UnicodeBlock LATIN_EXTENDED_B
- = new UnicodeBlock('\u0180', '\u024F',
- "LATIN_EXTENDED_B");
+ = new UnicodeBlock(0x0180, 0x024F,
+ "LATIN_EXTENDED_B",
+ "Latin Extended-B");
/**
* IPA Extensions.
- * '\u0250' - '\u02AF'.
+ * 0x0250 - 0x02AF.
*/
public static final UnicodeBlock IPA_EXTENSIONS
- = new UnicodeBlock('\u0250', '\u02AF',
- "IPA_EXTENSIONS");
+ = new UnicodeBlock(0x0250, 0x02AF,
+ "IPA_EXTENSIONS",
+ "IPA Extensions");
/**
* Spacing Modifier Letters.
- * '\u02B0' - '\u02FF'.
+ * 0x02B0 - 0x02FF.
*/
public static final UnicodeBlock SPACING_MODIFIER_LETTERS
- = new UnicodeBlock('\u02B0', '\u02FF',
- "SPACING_MODIFIER_LETTERS");
+ = new UnicodeBlock(0x02B0, 0x02FF,
+ "SPACING_MODIFIER_LETTERS",
+ "Spacing Modifier Letters");
/**
* Combining Diacritical Marks.
- * '\u0300' - '\u036F'.
+ * 0x0300 - 0x036F.
*/
public static final UnicodeBlock COMBINING_DIACRITICAL_MARKS
- = new UnicodeBlock('\u0300', '\u036F',
- "COMBINING_DIACRITICAL_MARKS");
+ = new UnicodeBlock(0x0300, 0x036F,
+ "COMBINING_DIACRITICAL_MARKS",
+ "Combining Diacritical Marks");
/**
* Greek.
- * '\u0370' - '\u03FF'.
+ * 0x0370 - 0x03FF.
*/
public static final UnicodeBlock GREEK
- = new UnicodeBlock('\u0370', '\u03FF',
- "GREEK");
+ = new UnicodeBlock(0x0370, 0x03FF,
+ "GREEK",
+ "Greek");
/**
* Cyrillic.
- * '\u0400' - '\u04FF'.
+ * 0x0400 - 0x04FF.
*/
public static final UnicodeBlock CYRILLIC
- = new UnicodeBlock('\u0400', '\u04FF',
- "CYRILLIC");
+ = new UnicodeBlock(0x0400, 0x04FF,
+ "CYRILLIC",
+ "Cyrillic");
/**
+ * Cyrillic Supplementary.
+ * 0x0500 - 0x052F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock CYRILLIC_SUPPLEMENTARY
+ = new UnicodeBlock(0x0500, 0x052F,
+ "CYRILLIC_SUPPLEMENTARY",
+ "Cyrillic Supplementary");
+
+ /**
* Armenian.
- * '\u0530' - '\u058F'.
+ * 0x0530 - 0x058F.
*/
public static final UnicodeBlock ARMENIAN
- = new UnicodeBlock('\u0530', '\u058F',
- "ARMENIAN");
+ = new UnicodeBlock(0x0530, 0x058F,
+ "ARMENIAN",
+ "Armenian");
/**
* Hebrew.
- * '\u0590' - '\u05FF'.
+ * 0x0590 - 0x05FF.
*/
public static final UnicodeBlock HEBREW
- = new UnicodeBlock('\u0590', '\u05FF',
- "HEBREW");
+ = new UnicodeBlock(0x0590, 0x05FF,
+ "HEBREW",
+ "Hebrew");
/**
* Arabic.
- * '\u0600' - '\u06FF'.
+ * 0x0600 - 0x06FF.
*/
public static final UnicodeBlock ARABIC
- = new UnicodeBlock('\u0600', '\u06FF',
- "ARABIC");
+ = new UnicodeBlock(0x0600, 0x06FF,
+ "ARABIC",
+ "Arabic");
/**
* Syriac.
- * '\u0700' - '\u074F'.
+ * 0x0700 - 0x074F.
* @since 1.4
*/
public static final UnicodeBlock SYRIAC
- = new UnicodeBlock('\u0700', '\u074F',
- "SYRIAC");
+ = new UnicodeBlock(0x0700, 0x074F,
+ "SYRIAC",
+ "Syriac");
/**
* Thaana.
- * '\u0780' - '\u07BF'.
+ * 0x0780 - 0x07BF.
* @since 1.4
*/
public static final UnicodeBlock THAANA
- = new UnicodeBlock('\u0780', '\u07BF',
- "THAANA");
+ = new UnicodeBlock(0x0780, 0x07BF,
+ "THAANA",
+ "Thaana");
/**
* Devanagari.
- * '\u0900' - '\u097F'.
+ * 0x0900 - 0x097F.
*/
public static final UnicodeBlock DEVANAGARI
- = new UnicodeBlock('\u0900', '\u097F',
- "DEVANAGARI");
+ = new UnicodeBlock(0x0900, 0x097F,
+ "DEVANAGARI",
+ "Devanagari");
/**
* Bengali.
- * '\u0980' - '\u09FF'.
+ * 0x0980 - 0x09FF.
*/
public static final UnicodeBlock BENGALI
- = new UnicodeBlock('\u0980', '\u09FF',
- "BENGALI");
+ = new UnicodeBlock(0x0980, 0x09FF,
+ "BENGALI",
+ "Bengali");
/**
* Gurmukhi.
- * '\u0A00' - '\u0A7F'.
+ * 0x0A00 - 0x0A7F.
*/
public static final UnicodeBlock GURMUKHI
- = new UnicodeBlock('\u0A00', '\u0A7F',
- "GURMUKHI");
+ = new UnicodeBlock(0x0A00, 0x0A7F,
+ "GURMUKHI",
+ "Gurmukhi");
/**
* Gujarati.
- * '\u0A80' - '\u0AFF'.
+ * 0x0A80 - 0x0AFF.
*/
public static final UnicodeBlock GUJARATI
- = new UnicodeBlock('\u0A80', '\u0AFF',
- "GUJARATI");
+ = new UnicodeBlock(0x0A80, 0x0AFF,
+ "GUJARATI",
+ "Gujarati");
/**
* Oriya.
- * '\u0B00' - '\u0B7F'.
+ * 0x0B00 - 0x0B7F.
*/
public static final UnicodeBlock ORIYA
- = new UnicodeBlock('\u0B00', '\u0B7F',
- "ORIYA");
+ = new UnicodeBlock(0x0B00, 0x0B7F,
+ "ORIYA",
+ "Oriya");
/**
* Tamil.
- * '\u0B80' - '\u0BFF'.
+ * 0x0B80 - 0x0BFF.
*/
public static final UnicodeBlock TAMIL
- = new UnicodeBlock('\u0B80', '\u0BFF',
- "TAMIL");
+ = new UnicodeBlock(0x0B80, 0x0BFF,
+ "TAMIL",
+ "Tamil");
/**
* Telugu.
- * '\u0C00' - '\u0C7F'.
+ * 0x0C00 - 0x0C7F.
*/
public static final UnicodeBlock TELUGU
- = new UnicodeBlock('\u0C00', '\u0C7F',
- "TELUGU");
+ = new UnicodeBlock(0x0C00, 0x0C7F,
+ "TELUGU",
+ "Telugu");
/**
* Kannada.
- * '\u0C80' - '\u0CFF'.
+ * 0x0C80 - 0x0CFF.
*/
public static final UnicodeBlock KANNADA
- = new UnicodeBlock('\u0C80', '\u0CFF',
- "KANNADA");
+ = new UnicodeBlock(0x0C80, 0x0CFF,
+ "KANNADA",
+ "Kannada");
/**
* Malayalam.
- * '\u0D00' - '\u0D7F'.
+ * 0x0D00 - 0x0D7F.
*/
public static final UnicodeBlock MALAYALAM
- = new UnicodeBlock('\u0D00', '\u0D7F',
- "MALAYALAM");
+ = new UnicodeBlock(0x0D00, 0x0D7F,
+ "MALAYALAM",
+ "Malayalam");
/**
* Sinhala.
- * '\u0D80' - '\u0DFF'.
+ * 0x0D80 - 0x0DFF.
* @since 1.4
*/
public static final UnicodeBlock SINHALA
- = new UnicodeBlock('\u0D80', '\u0DFF',
- "SINHALA");
+ = new UnicodeBlock(0x0D80, 0x0DFF,
+ "SINHALA",
+ "Sinhala");
/**
* Thai.
- * '\u0E00' - '\u0E7F'.
+ * 0x0E00 - 0x0E7F.
*/
public static final UnicodeBlock THAI
- = new UnicodeBlock('\u0E00', '\u0E7F',
- "THAI");
+ = new UnicodeBlock(0x0E00, 0x0E7F,
+ "THAI",
+ "Thai");
/**
* Lao.
- * '\u0E80' - '\u0EFF'.
+ * 0x0E80 - 0x0EFF.
*/
public static final UnicodeBlock LAO
- = new UnicodeBlock('\u0E80', '\u0EFF',
- "LAO");
+ = new UnicodeBlock(0x0E80, 0x0EFF,
+ "LAO",
+ "Lao");
/**
* Tibetan.
- * '\u0F00' - '\u0FFF'.
+ * 0x0F00 - 0x0FFF.
*/
public static final UnicodeBlock TIBETAN
- = new UnicodeBlock('\u0F00', '\u0FFF',
- "TIBETAN");
+ = new UnicodeBlock(0x0F00, 0x0FFF,
+ "TIBETAN",
+ "Tibetan");
/**
* Myanmar.
- * '\u1000' - '\u109F'.
+ * 0x1000 - 0x109F.
* @since 1.4
*/
public static final UnicodeBlock MYANMAR
- = new UnicodeBlock('\u1000', '\u109F',
- "MYANMAR");
+ = new UnicodeBlock(0x1000, 0x109F,
+ "MYANMAR",
+ "Myanmar");
/**
* Georgian.
- * '\u10A0' - '\u10FF'.
+ * 0x10A0 - 0x10FF.
*/
public static final UnicodeBlock GEORGIAN
- = new UnicodeBlock('\u10A0', '\u10FF',
- "GEORGIAN");
+ = new UnicodeBlock(0x10A0, 0x10FF,
+ "GEORGIAN",
+ "Georgian");
/**
* Hangul Jamo.
- * '\u1100' - '\u11FF'.
+ * 0x1100 - 0x11FF.
*/
public static final UnicodeBlock HANGUL_JAMO
- = new UnicodeBlock('\u1100', '\u11FF',
- "HANGUL_JAMO");
+ = new UnicodeBlock(0x1100, 0x11FF,
+ "HANGUL_JAMO",
+ "Hangul Jamo");
/**
* Ethiopic.
- * '\u1200' - '\u137F'.
+ * 0x1200 - 0x137F.
* @since 1.4
*/
public static final UnicodeBlock ETHIOPIC
- = new UnicodeBlock('\u1200', '\u137F',
- "ETHIOPIC");
+ = new UnicodeBlock(0x1200, 0x137F,
+ "ETHIOPIC",
+ "Ethiopic");
/**
* Cherokee.
- * '\u13A0' - '\u13FF'.
+ * 0x13A0 - 0x13FF.
* @since 1.4
*/
public static final UnicodeBlock CHEROKEE
- = new UnicodeBlock('\u13A0', '\u13FF',
- "CHEROKEE");
+ = new UnicodeBlock(0x13A0, 0x13FF,
+ "CHEROKEE",
+ "Cherokee");
/**
* Unified Canadian Aboriginal Syllabics.
- * '\u1400' - '\u167F'.
+ * 0x1400 - 0x167F.
* @since 1.4
*/
public static final UnicodeBlock UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS
- = new UnicodeBlock('\u1400', '\u167F',
- "UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS");
+ = new UnicodeBlock(0x1400, 0x167F,
+ "UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS",
+ "Unified Canadian Aboriginal Syllabics");
/**
* Ogham.
- * '\u1680' - '\u169F'.
+ * 0x1680 - 0x169F.
* @since 1.4
*/
public static final UnicodeBlock OGHAM
- = new UnicodeBlock('\u1680', '\u169F',
- "OGHAM");
+ = new UnicodeBlock(0x1680, 0x169F,
+ "OGHAM",
+ "Ogham");
/**
* Runic.
- * '\u16A0' - '\u16FF'.
+ * 0x16A0 - 0x16FF.
* @since 1.4
*/
public static final UnicodeBlock RUNIC
- = new UnicodeBlock('\u16A0', '\u16FF',
- "RUNIC");
+ = new UnicodeBlock(0x16A0, 0x16FF,
+ "RUNIC",
+ "Runic");
/**
+ * Tagalog.
+ * 0x1700 - 0x171F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock TAGALOG
+ = new UnicodeBlock(0x1700, 0x171F,
+ "TAGALOG",
+ "Tagalog");
+
+ /**
+ * Hanunoo.
+ * 0x1720 - 0x173F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock HANUNOO
+ = new UnicodeBlock(0x1720, 0x173F,
+ "HANUNOO",
+ "Hanunoo");
+
+ /**
+ * Buhid.
+ * 0x1740 - 0x175F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock BUHID
+ = new UnicodeBlock(0x1740, 0x175F,
+ "BUHID",
+ "Buhid");
+
+ /**
+ * Tagbanwa.
+ * 0x1760 - 0x177F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock TAGBANWA
+ = new UnicodeBlock(0x1760, 0x177F,
+ "TAGBANWA",
+ "Tagbanwa");
+
+ /**
* Khmer.
- * '\u1780' - '\u17FF'.
+ * 0x1780 - 0x17FF.
* @since 1.4
*/
public static final UnicodeBlock KHMER
- = new UnicodeBlock('\u1780', '\u17FF',
- "KHMER");
+ = new UnicodeBlock(0x1780, 0x17FF,
+ "KHMER",
+ "Khmer");
/**
* Mongolian.
- * '\u1800' - '\u18AF'.
+ * 0x1800 - 0x18AF.
* @since 1.4
*/
public static final UnicodeBlock MONGOLIAN
- = new UnicodeBlock('\u1800', '\u18AF',
- "MONGOLIAN");
+ = new UnicodeBlock(0x1800, 0x18AF,
+ "MONGOLIAN",
+ "Mongolian");
/**
+ * Limbu.
+ * 0x1900 - 0x194F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock LIMBU
+ = new UnicodeBlock(0x1900, 0x194F,
+ "LIMBU",
+ "Limbu");
+
+ /**
+ * Tai Le.
+ * 0x1950 - 0x197F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock TAI_LE
+ = new UnicodeBlock(0x1950, 0x197F,
+ "TAI_LE",
+ "Tai Le");
+
+ /**
+ * Khmer Symbols.
+ * 0x19E0 - 0x19FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock KHMER_SYMBOLS
+ = new UnicodeBlock(0x19E0, 0x19FF,
+ "KHMER_SYMBOLS",
+ "Khmer Symbols");
+
+ /**
+ * Phonetic Extensions.
+ * 0x1D00 - 0x1D7F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock PHONETIC_EXTENSIONS
+ = new UnicodeBlock(0x1D00, 0x1D7F,
+ "PHONETIC_EXTENSIONS",
+ "Phonetic Extensions");
+
+ /**
* Latin Extended Additional.
- * '\u1E00' - '\u1EFF'.
+ * 0x1E00 - 0x1EFF.
*/
public static final UnicodeBlock LATIN_EXTENDED_ADDITIONAL
- = new UnicodeBlock('\u1E00', '\u1EFF',
- "LATIN_EXTENDED_ADDITIONAL");
+ = new UnicodeBlock(0x1E00, 0x1EFF,
+ "LATIN_EXTENDED_ADDITIONAL",
+ "Latin Extended Additional");
/**
* Greek Extended.
- * '\u1F00' - '\u1FFF'.
+ * 0x1F00 - 0x1FFF.
*/
public static final UnicodeBlock GREEK_EXTENDED
- = new UnicodeBlock('\u1F00', '\u1FFF',
- "GREEK_EXTENDED");
+ = new UnicodeBlock(0x1F00, 0x1FFF,
+ "GREEK_EXTENDED",
+ "Greek Extended");
/**
* General Punctuation.
- * '\u2000' - '\u206F'.
+ * 0x2000 - 0x206F.
*/
public static final UnicodeBlock GENERAL_PUNCTUATION
- = new UnicodeBlock('\u2000', '\u206F',
- "GENERAL_PUNCTUATION");
+ = new UnicodeBlock(0x2000, 0x206F,
+ "GENERAL_PUNCTUATION",
+ "General Punctuation");
/**
* Superscripts and Subscripts.
- * '\u2070' - '\u209F'.
+ * 0x2070 - 0x209F.
*/
public static final UnicodeBlock SUPERSCRIPTS_AND_SUBSCRIPTS
- = new UnicodeBlock('\u2070', '\u209F',
- "SUPERSCRIPTS_AND_SUBSCRIPTS");
+ = new UnicodeBlock(0x2070, 0x209F,
+ "SUPERSCRIPTS_AND_SUBSCRIPTS",
+ "Superscripts and Subscripts");
/**
* Currency Symbols.
- * '\u20A0' - '\u20CF'.
+ * 0x20A0 - 0x20CF.
*/
public static final UnicodeBlock CURRENCY_SYMBOLS
- = new UnicodeBlock('\u20A0', '\u20CF',
- "CURRENCY_SYMBOLS");
+ = new UnicodeBlock(0x20A0, 0x20CF,
+ "CURRENCY_SYMBOLS",
+ "Currency Symbols");
/**
* Combining Marks for Symbols.
- * '\u20D0' - '\u20FF'.
+ * 0x20D0 - 0x20FF.
*/
public static final UnicodeBlock COMBINING_MARKS_FOR_SYMBOLS
- = new UnicodeBlock('\u20D0', '\u20FF',
- "COMBINING_MARKS_FOR_SYMBOLS");
+ = new UnicodeBlock(0x20D0, 0x20FF,
+ "COMBINING_MARKS_FOR_SYMBOLS",
+ "Combining Marks for Symbols");
/**
* Letterlike Symbols.
- * '\u2100' - '\u214F'.
+ * 0x2100 - 0x214F.
*/
public static final UnicodeBlock LETTERLIKE_SYMBOLS
- = new UnicodeBlock('\u2100', '\u214F',
- "LETTERLIKE_SYMBOLS");
+ = new UnicodeBlock(0x2100, 0x214F,
+ "LETTERLIKE_SYMBOLS",
+ "Letterlike Symbols");
/**
* Number Forms.
- * '\u2150' - '\u218F'.
+ * 0x2150 - 0x218F.
*/
public static final UnicodeBlock NUMBER_FORMS
- = new UnicodeBlock('\u2150', '\u218F',
- "NUMBER_FORMS");
+ = new UnicodeBlock(0x2150, 0x218F,
+ "NUMBER_FORMS",
+ "Number Forms");
/**
* Arrows.
- * '\u2190' - '\u21FF'.
+ * 0x2190 - 0x21FF.
*/
public static final UnicodeBlock ARROWS
- = new UnicodeBlock('\u2190', '\u21FF',
- "ARROWS");
+ = new UnicodeBlock(0x2190, 0x21FF,
+ "ARROWS",
+ "Arrows");
/**
* Mathematical Operators.
- * '\u2200' - '\u22FF'.
+ * 0x2200 - 0x22FF.
*/
public static final UnicodeBlock MATHEMATICAL_OPERATORS
- = new UnicodeBlock('\u2200', '\u22FF',
- "MATHEMATICAL_OPERATORS");
+ = new UnicodeBlock(0x2200, 0x22FF,
+ "MATHEMATICAL_OPERATORS",
+ "Mathematical Operators");
/**
* Miscellaneous Technical.
- * '\u2300' - '\u23FF'.
+ * 0x2300 - 0x23FF.
*/
public static final UnicodeBlock MISCELLANEOUS_TECHNICAL
- = new UnicodeBlock('\u2300', '\u23FF',
- "MISCELLANEOUS_TECHNICAL");
+ = new UnicodeBlock(0x2300, 0x23FF,
+ "MISCELLANEOUS_TECHNICAL",
+ "Miscellaneous Technical");
/**
* Control Pictures.
- * '\u2400' - '\u243F'.
+ * 0x2400 - 0x243F.
*/
public static final UnicodeBlock CONTROL_PICTURES
- = new UnicodeBlock('\u2400', '\u243F',
- "CONTROL_PICTURES");
+ = new UnicodeBlock(0x2400, 0x243F,
+ "CONTROL_PICTURES",
+ "Control Pictures");
/**
* Optical Character Recognition.
- * '\u2440' - '\u245F'.
+ * 0x2440 - 0x245F.
*/
public static final UnicodeBlock OPTICAL_CHARACTER_RECOGNITION
- = new UnicodeBlock('\u2440', '\u245F',
- "OPTICAL_CHARACTER_RECOGNITION");
+ = new UnicodeBlock(0x2440, 0x245F,
+ "OPTICAL_CHARACTER_RECOGNITION",
+ "Optical Character Recognition");
/**
* Enclosed Alphanumerics.
- * '\u2460' - '\u24FF'.
+ * 0x2460 - 0x24FF.
*/
public static final UnicodeBlock ENCLOSED_ALPHANUMERICS
- = new UnicodeBlock('\u2460', '\u24FF',
- "ENCLOSED_ALPHANUMERICS");
+ = new UnicodeBlock(0x2460, 0x24FF,
+ "ENCLOSED_ALPHANUMERICS",
+ "Enclosed Alphanumerics");
/**
* Box Drawing.
- * '\u2500' - '\u257F'.
+ * 0x2500 - 0x257F.
*/
public static final UnicodeBlock BOX_DRAWING
- = new UnicodeBlock('\u2500', '\u257F',
- "BOX_DRAWING");
+ = new UnicodeBlock(0x2500, 0x257F,
+ "BOX_DRAWING",
+ "Box Drawing");
/**
* Block Elements.
- * '\u2580' - '\u259F'.
+ * 0x2580 - 0x259F.
*/
public static final UnicodeBlock BLOCK_ELEMENTS
- = new UnicodeBlock('\u2580', '\u259F',
- "BLOCK_ELEMENTS");
+ = new UnicodeBlock(0x2580, 0x259F,
+ "BLOCK_ELEMENTS",
+ "Block Elements");
/**
* Geometric Shapes.
- * '\u25A0' - '\u25FF'.
+ * 0x25A0 - 0x25FF.
*/
public static final UnicodeBlock GEOMETRIC_SHAPES
- = new UnicodeBlock('\u25A0', '\u25FF',
- "GEOMETRIC_SHAPES");
+ = new UnicodeBlock(0x25A0, 0x25FF,
+ "GEOMETRIC_SHAPES",
+ "Geometric Shapes");
/**
* Miscellaneous Symbols.
- * '\u2600' - '\u26FF'.
+ * 0x2600 - 0x26FF.
*/
public static final UnicodeBlock MISCELLANEOUS_SYMBOLS
- = new UnicodeBlock('\u2600', '\u26FF',
- "MISCELLANEOUS_SYMBOLS");
+ = new UnicodeBlock(0x2600, 0x26FF,
+ "MISCELLANEOUS_SYMBOLS",
+ "Miscellaneous Symbols");
/**
* Dingbats.
- * '\u2700' - '\u27BF'.
+ * 0x2700 - 0x27BF.
*/
public static final UnicodeBlock DINGBATS
- = new UnicodeBlock('\u2700', '\u27BF',
- "DINGBATS");
+ = new UnicodeBlock(0x2700, 0x27BF,
+ "DINGBATS",
+ "Dingbats");
/**
+ * Miscellaneous Mathematical Symbols-A.
+ * 0x27C0 - 0x27EF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock MISCELLANEOUS_MATHEMATICAL_SYMBOLS_A
+ = new UnicodeBlock(0x27C0, 0x27EF,
+ "MISCELLANEOUS_MATHEMATICAL_SYMBOLS_A",
+ "Miscellaneous Mathematical Symbols-A");
+
+ /**
+ * Supplemental Arrows-A.
+ * 0x27F0 - 0x27FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock SUPPLEMENTAL_ARROWS_A
+ = new UnicodeBlock(0x27F0, 0x27FF,
+ "SUPPLEMENTAL_ARROWS_A",
+ "Supplemental Arrows-A");
+
+ /**
* Braille Patterns.
- * '\u2800' - '\u28FF'.
+ * 0x2800 - 0x28FF.
* @since 1.4
*/
public static final UnicodeBlock BRAILLE_PATTERNS
- = new UnicodeBlock('\u2800', '\u28FF',
- "BRAILLE_PATTERNS");
+ = new UnicodeBlock(0x2800, 0x28FF,
+ "BRAILLE_PATTERNS",
+ "Braille Patterns");
/**
+ * Supplemental Arrows-B.
+ * 0x2900 - 0x297F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock SUPPLEMENTAL_ARROWS_B
+ = new UnicodeBlock(0x2900, 0x297F,
+ "SUPPLEMENTAL_ARROWS_B",
+ "Supplemental Arrows-B");
+
+ /**
+ * Miscellaneous Mathematical Symbols-B.
+ * 0x2980 - 0x29FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock MISCELLANEOUS_MATHEMATICAL_SYMBOLS_B
+ = new UnicodeBlock(0x2980, 0x29FF,
+ "MISCELLANEOUS_MATHEMATICAL_SYMBOLS_B",
+ "Miscellaneous Mathematical Symbols-B");
+
+ /**
+ * Supplemental Mathematical Operators.
+ * 0x2A00 - 0x2AFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock SUPPLEMENTAL_MATHEMATICAL_OPERATORS
+ = new UnicodeBlock(0x2A00, 0x2AFF,
+ "SUPPLEMENTAL_MATHEMATICAL_OPERATORS",
+ "Supplemental Mathematical Operators");
+
+ /**
+ * Miscellaneous Symbols and Arrows.
+ * 0x2B00 - 0x2BFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_ARROWS
+ = new UnicodeBlock(0x2B00, 0x2BFF,
+ "MISCELLANEOUS_SYMBOLS_AND_ARROWS",
+ "Miscellaneous Symbols and Arrows");
+
+ /**
* CJK Radicals Supplement.
- * '\u2E80' - '\u2EFF'.
+ * 0x2E80 - 0x2EFF.
* @since 1.4
*/
public static final UnicodeBlock CJK_RADICALS_SUPPLEMENT
- = new UnicodeBlock('\u2E80', '\u2EFF',
- "CJK_RADICALS_SUPPLEMENT");
+ = new UnicodeBlock(0x2E80, 0x2EFF,
+ "CJK_RADICALS_SUPPLEMENT",
+ "CJK Radicals Supplement");
/**
* Kangxi Radicals.
- * '\u2F00' - '\u2FDF'.
+ * 0x2F00 - 0x2FDF.
* @since 1.4
*/
public static final UnicodeBlock KANGXI_RADICALS
- = new UnicodeBlock('\u2F00', '\u2FDF',
- "KANGXI_RADICALS");
+ = new UnicodeBlock(0x2F00, 0x2FDF,
+ "KANGXI_RADICALS",
+ "Kangxi Radicals");
/**
* Ideographic Description Characters.
- * '\u2FF0' - '\u2FFF'.
+ * 0x2FF0 - 0x2FFF.
* @since 1.4
*/
public static final UnicodeBlock IDEOGRAPHIC_DESCRIPTION_CHARACTERS
- = new UnicodeBlock('\u2FF0', '\u2FFF',
- "IDEOGRAPHIC_DESCRIPTION_CHARACTERS");
+ = new UnicodeBlock(0x2FF0, 0x2FFF,
+ "IDEOGRAPHIC_DESCRIPTION_CHARACTERS",
+ "Ideographic Description Characters");
/**
* CJK Symbols and Punctuation.
- * '\u3000' - '\u303F'.
+ * 0x3000 - 0x303F.
*/
public static final UnicodeBlock CJK_SYMBOLS_AND_PUNCTUATION
- = new UnicodeBlock('\u3000', '\u303F',
- "CJK_SYMBOLS_AND_PUNCTUATION");
+ = new UnicodeBlock(0x3000, 0x303F,
+ "CJK_SYMBOLS_AND_PUNCTUATION",
+ "CJK Symbols and Punctuation");
/**
* Hiragana.
- * '\u3040' - '\u309F'.
+ * 0x3040 - 0x309F.
*/
public static final UnicodeBlock HIRAGANA
- = new UnicodeBlock('\u3040', '\u309F',
- "HIRAGANA");
+ = new UnicodeBlock(0x3040, 0x309F,
+ "HIRAGANA",
+ "Hiragana");
/**
* Katakana.
- * '\u30A0' - '\u30FF'.
+ * 0x30A0 - 0x30FF.
*/
public static final UnicodeBlock KATAKANA
- = new UnicodeBlock('\u30A0', '\u30FF',
- "KATAKANA");
+ = new UnicodeBlock(0x30A0, 0x30FF,
+ "KATAKANA",
+ "Katakana");
/**
* Bopomofo.
- * '\u3100' - '\u312F'.
+ * 0x3100 - 0x312F.
*/
public static final UnicodeBlock BOPOMOFO
- = new UnicodeBlock('\u3100', '\u312F',
- "BOPOMOFO");
+ = new UnicodeBlock(0x3100, 0x312F,
+ "BOPOMOFO",
+ "Bopomofo");
/**
* Hangul Compatibility Jamo.
- * '\u3130' - '\u318F'.
+ * 0x3130 - 0x318F.
*/
public static final UnicodeBlock HANGUL_COMPATIBILITY_JAMO
- = new UnicodeBlock('\u3130', '\u318F',
- "HANGUL_COMPATIBILITY_JAMO");
+ = new UnicodeBlock(0x3130, 0x318F,
+ "HANGUL_COMPATIBILITY_JAMO",
+ "Hangul Compatibility Jamo");
/**
* Kanbun.
- * '\u3190' - '\u319F'.
+ * 0x3190 - 0x319F.
*/
public static final UnicodeBlock KANBUN
- = new UnicodeBlock('\u3190', '\u319F',
- "KANBUN");
+ = new UnicodeBlock(0x3190, 0x319F,
+ "KANBUN",
+ "Kanbun");
/**
* Bopomofo Extended.
- * '\u31A0' - '\u31BF'.
+ * 0x31A0 - 0x31BF.
* @since 1.4
*/
public static final UnicodeBlock BOPOMOFO_EXTENDED
- = new UnicodeBlock('\u31A0', '\u31BF',
- "BOPOMOFO_EXTENDED");
+ = new UnicodeBlock(0x31A0, 0x31BF,
+ "BOPOMOFO_EXTENDED",
+ "Bopomofo Extended");
/**
+ * Katakana Phonetic Extensions.
+ * 0x31F0 - 0x31FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock KATAKANA_PHONETIC_EXTENSIONS
+ = new UnicodeBlock(0x31F0, 0x31FF,
+ "KATAKANA_PHONETIC_EXTENSIONS",
+ "Katakana Phonetic Extensions");
+
+ /**
* Enclosed CJK Letters and Months.
- * '\u3200' - '\u32FF'.
+ * 0x3200 - 0x32FF.
*/
public static final UnicodeBlock ENCLOSED_CJK_LETTERS_AND_MONTHS
- = new UnicodeBlock('\u3200', '\u32FF',
- "ENCLOSED_CJK_LETTERS_AND_MONTHS");
+ = new UnicodeBlock(0x3200, 0x32FF,
+ "ENCLOSED_CJK_LETTERS_AND_MONTHS",
+ "Enclosed CJK Letters and Months");
/**
* CJK Compatibility.
- * '\u3300' - '\u33FF'.
+ * 0x3300 - 0x33FF.
*/
public static final UnicodeBlock CJK_COMPATIBILITY
- = new UnicodeBlock('\u3300', '\u33FF',
- "CJK_COMPATIBILITY");
+ = new UnicodeBlock(0x3300, 0x33FF,
+ "CJK_COMPATIBILITY",
+ "CJK Compatibility");
/**
* CJK Unified Ideographs Extension A.
- * '\u3400' - '\u4DB5'.
+ * 0x3400 - 0x4DBF.
* @since 1.4
*/
public static final UnicodeBlock CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
- = new UnicodeBlock('\u3400', '\u4DB5',
- "CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A");
+ = new UnicodeBlock(0x3400, 0x4DBF,
+ "CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A",
+ "CJK Unified Ideographs Extension A");
/**
+ * Yijing Hexagram Symbols.
+ * 0x4DC0 - 0x4DFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock YIJING_HEXAGRAM_SYMBOLS
+ = new UnicodeBlock(0x4DC0, 0x4DFF,
+ "YIJING_HEXAGRAM_SYMBOLS",
+ "Yijing Hexagram Symbols");
+
+ /**
* CJK Unified Ideographs.
- * '\u4E00' - '\u9FFF'.
+ * 0x4E00 - 0x9FFF.
*/
public static final UnicodeBlock CJK_UNIFIED_IDEOGRAPHS
- = new UnicodeBlock('\u4E00', '\u9FFF',
- "CJK_UNIFIED_IDEOGRAPHS");
+ = new UnicodeBlock(0x4E00, 0x9FFF,
+ "CJK_UNIFIED_IDEOGRAPHS",
+ "CJK Unified Ideographs");
/**
* Yi Syllables.
- * '\uA000' - '\uA48F'.
+ * 0xA000 - 0xA48F.
* @since 1.4
*/
public static final UnicodeBlock YI_SYLLABLES
- = new UnicodeBlock('\uA000', '\uA48F',
- "YI_SYLLABLES");
+ = new UnicodeBlock(0xA000, 0xA48F,
+ "YI_SYLLABLES",
+ "Yi Syllables");
/**
* Yi Radicals.
- * '\uA490' - '\uA4CF'.
+ * 0xA490 - 0xA4CF.
* @since 1.4
*/
public static final UnicodeBlock YI_RADICALS
- = new UnicodeBlock('\uA490', '\uA4CF',
- "YI_RADICALS");
+ = new UnicodeBlock(0xA490, 0xA4CF,
+ "YI_RADICALS",
+ "Yi Radicals");
/**
* Hangul Syllables.
- * '\uAC00' - '\uD7A3'.
+ * 0xAC00 - 0xD7AF.
*/
public static final UnicodeBlock HANGUL_SYLLABLES
- = new UnicodeBlock('\uAC00', '\uD7A3',
- "HANGUL_SYLLABLES");
+ = new UnicodeBlock(0xAC00, 0xD7AF,
+ "HANGUL_SYLLABLES",
+ "Hangul Syllables");
/**
- * Surrogates Area.
- * '\uD800' - '\uDFFF'.
+ * High Surrogates.
+ * 0xD800 - 0xDB7F.
+ * @since 1.5
*/
- public static final UnicodeBlock SURROGATES_AREA
- = new UnicodeBlock('\uD800', '\uDFFF',
- "SURROGATES_AREA");
+ public static final UnicodeBlock HIGH_SURROGATES
+ = new UnicodeBlock(0xD800, 0xDB7F,
+ "HIGH_SURROGATES",
+ "High Surrogates");
/**
+ * High Private Use Surrogates.
+ * 0xDB80 - 0xDBFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock HIGH_PRIVATE_USE_SURROGATES
+ = new UnicodeBlock(0xDB80, 0xDBFF,
+ "HIGH_PRIVATE_USE_SURROGATES",
+ "High Private Use Surrogates");
+
+ /**
+ * Low Surrogates.
+ * 0xDC00 - 0xDFFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock LOW_SURROGATES
+ = new UnicodeBlock(0xDC00, 0xDFFF,
+ "LOW_SURROGATES",
+ "Low Surrogates");
+
+ /**
* Private Use Area.
- * '\uE000' - '\uF8FF'.
+ * 0xE000 - 0xF8FF.
*/
public static final UnicodeBlock PRIVATE_USE_AREA
- = new UnicodeBlock('\uE000', '\uF8FF',
- "PRIVATE_USE_AREA");
+ = new UnicodeBlock(0xE000, 0xF8FF,
+ "PRIVATE_USE_AREA",
+ "Private Use Area");
/**
* CJK Compatibility Ideographs.
- * '\uF900' - '\uFAFF'.
+ * 0xF900 - 0xFAFF.
*/
public static final UnicodeBlock CJK_COMPATIBILITY_IDEOGRAPHS
- = new UnicodeBlock('\uF900', '\uFAFF',
- "CJK_COMPATIBILITY_IDEOGRAPHS");
+ = new UnicodeBlock(0xF900, 0xFAFF,
+ "CJK_COMPATIBILITY_IDEOGRAPHS",
+ "CJK Compatibility Ideographs");
/**
* Alphabetic Presentation Forms.
- * '\uFB00' - '\uFB4F'.
+ * 0xFB00 - 0xFB4F.
*/
public static final UnicodeBlock ALPHABETIC_PRESENTATION_FORMS
- = new UnicodeBlock('\uFB00', '\uFB4F',
- "ALPHABETIC_PRESENTATION_FORMS");
+ = new UnicodeBlock(0xFB00, 0xFB4F,
+ "ALPHABETIC_PRESENTATION_FORMS",
+ "Alphabetic Presentation Forms");
/**
* Arabic Presentation Forms-A.
- * '\uFB50' - '\uFDFF'.
+ * 0xFB50 - 0xFDFF.
*/
public static final UnicodeBlock ARABIC_PRESENTATION_FORMS_A
- = new UnicodeBlock('\uFB50', '\uFDFF',
- "ARABIC_PRESENTATION_FORMS_A");
+ = new UnicodeBlock(0xFB50, 0xFDFF,
+ "ARABIC_PRESENTATION_FORMS_A",
+ "Arabic Presentation Forms-A");
/**
+ * Variation Selectors.
+ * 0xFE00 - 0xFE0F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock VARIATION_SELECTORS
+ = new UnicodeBlock(0xFE00, 0xFE0F,
+ "VARIATION_SELECTORS",
+ "Variation Selectors");
+
+ /**
* Combining Half Marks.
- * '\uFE20' - '\uFE2F'.
+ * 0xFE20 - 0xFE2F.
*/
public static final UnicodeBlock COMBINING_HALF_MARKS
- = new UnicodeBlock('\uFE20', '\uFE2F',
- "COMBINING_HALF_MARKS");
+ = new UnicodeBlock(0xFE20, 0xFE2F,
+ "COMBINING_HALF_MARKS",
+ "Combining Half Marks");
/**
* CJK Compatibility Forms.
- * '\uFE30' - '\uFE4F'.
+ * 0xFE30 - 0xFE4F.
*/
public static final UnicodeBlock CJK_COMPATIBILITY_FORMS
- = new UnicodeBlock('\uFE30', '\uFE4F',
- "CJK_COMPATIBILITY_FORMS");
+ = new UnicodeBlock(0xFE30, 0xFE4F,
+ "CJK_COMPATIBILITY_FORMS",
+ "CJK Compatibility Forms");
/**
* Small Form Variants.
- * '\uFE50' - '\uFE6F'.
+ * 0xFE50 - 0xFE6F.
*/
public static final UnicodeBlock SMALL_FORM_VARIANTS
- = new UnicodeBlock('\uFE50', '\uFE6F',
- "SMALL_FORM_VARIANTS");
+ = new UnicodeBlock(0xFE50, 0xFE6F,
+ "SMALL_FORM_VARIANTS",
+ "Small Form Variants");
/**
* Arabic Presentation Forms-B.
- * '\uFE70' - '\uFEFE'.
+ * 0xFE70 - 0xFEFF.
*/
public static final UnicodeBlock ARABIC_PRESENTATION_FORMS_B
- = new UnicodeBlock('\uFE70', '\uFEFE',
- "ARABIC_PRESENTATION_FORMS_B");
+ = new UnicodeBlock(0xFE70, 0xFEFF,
+ "ARABIC_PRESENTATION_FORMS_B",
+ "Arabic Presentation Forms-B");
/**
* Halfwidth and Fullwidth Forms.
- * '\uFF00' - '\uFFEF'.
+ * 0xFF00 - 0xFFEF.
*/
public static final UnicodeBlock HALFWIDTH_AND_FULLWIDTH_FORMS
- = new UnicodeBlock('\uFF00', '\uFFEF',
- "HALFWIDTH_AND_FULLWIDTH_FORMS");
+ = new UnicodeBlock(0xFF00, 0xFFEF,
+ "HALFWIDTH_AND_FULLWIDTH_FORMS",
+ "Halfwidth and Fullwidth Forms");
/**
* Specials.
- * '\uFEFF', '\uFFF0' - '\uFFFD'.
+ * 0xFFF0 - 0xFFFF.
*/
public static final UnicodeBlock SPECIALS
- = new UnicodeBlock('\uFFF0', '\uFFFD',
- "SPECIALS");
+ = new UnicodeBlock(0xFFF0, 0xFFFF,
+ "SPECIALS",
+ "Specials");
/**
+ * Linear B Syllabary.
+ * 0x10000 - 0x1007F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock LINEAR_B_SYLLABARY
+ = new UnicodeBlock(0x10000, 0x1007F,
+ "LINEAR_B_SYLLABARY",
+ "Linear B Syllabary");
+
+ /**
+ * Linear B Ideograms.
+ * 0x10080 - 0x100FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock LINEAR_B_IDEOGRAMS
+ = new UnicodeBlock(0x10080, 0x100FF,
+ "LINEAR_B_IDEOGRAMS",
+ "Linear B Ideograms");
+
+ /**
+ * Aegean Numbers.
+ * 0x10100 - 0x1013F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock AEGEAN_NUMBERS
+ = new UnicodeBlock(0x10100, 0x1013F,
+ "AEGEAN_NUMBERS",
+ "Aegean Numbers");
+
+ /**
+ * Old Italic.
+ * 0x10300 - 0x1032F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock OLD_ITALIC
+ = new UnicodeBlock(0x10300, 0x1032F,
+ "OLD_ITALIC",
+ "Old Italic");
+
+ /**
+ * Gothic.
+ * 0x10330 - 0x1034F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock GOTHIC
+ = new UnicodeBlock(0x10330, 0x1034F,
+ "GOTHIC",
+ "Gothic");
+
+ /**
+ * Ugaritic.
+ * 0x10380 - 0x1039F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock UGARITIC
+ = new UnicodeBlock(0x10380, 0x1039F,
+ "UGARITIC",
+ "Ugaritic");
+
+ /**
+ * Deseret.
+ * 0x10400 - 0x1044F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock DESERET
+ = new UnicodeBlock(0x10400, 0x1044F,
+ "DESERET",
+ "Deseret");
+
+ /**
+ * Shavian.
+ * 0x10450 - 0x1047F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock SHAVIAN
+ = new UnicodeBlock(0x10450, 0x1047F,
+ "SHAVIAN",
+ "Shavian");
+
+ /**
+ * Osmanya.
+ * 0x10480 - 0x104AF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock OSMANYA
+ = new UnicodeBlock(0x10480, 0x104AF,
+ "OSMANYA",
+ "Osmanya");
+
+ /**
+ * Cypriot Syllabary.
+ * 0x10800 - 0x1083F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock CYPRIOT_SYLLABARY
+ = new UnicodeBlock(0x10800, 0x1083F,
+ "CYPRIOT_SYLLABARY",
+ "Cypriot Syllabary");
+
+ /**
+ * Byzantine Musical Symbols.
+ * 0x1D000 - 0x1D0FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock BYZANTINE_MUSICAL_SYMBOLS
+ = new UnicodeBlock(0x1D000, 0x1D0FF,
+ "BYZANTINE_MUSICAL_SYMBOLS",
+ "Byzantine Musical Symbols");
+
+ /**
+ * Musical Symbols.
+ * 0x1D100 - 0x1D1FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock MUSICAL_SYMBOLS
+ = new UnicodeBlock(0x1D100, 0x1D1FF,
+ "MUSICAL_SYMBOLS",
+ "Musical Symbols");
+
+ /**
+ * Tai Xuan Jing Symbols.
+ * 0x1D300 - 0x1D35F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock TAI_XUAN_JING_SYMBOLS
+ = new UnicodeBlock(0x1D300, 0x1D35F,
+ "TAI_XUAN_JING_SYMBOLS",
+ "Tai Xuan Jing Symbols");
+
+ /**
+ * Mathematical Alphanumeric Symbols.
+ * 0x1D400 - 0x1D7FF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock MATHEMATICAL_ALPHANUMERIC_SYMBOLS
+ = new UnicodeBlock(0x1D400, 0x1D7FF,
+ "MATHEMATICAL_ALPHANUMERIC_SYMBOLS",
+ "Mathematical Alphanumeric Symbols");
+
+ /**
+ * CJK Unified Ideographs Extension B.
+ * 0x20000 - 0x2A6DF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B
+ = new UnicodeBlock(0x20000, 0x2A6DF,
+ "CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B",
+ "CJK Unified Ideographs Extension B");
+
+ /**
+ * CJK Compatibility Ideographs Supplement.
+ * 0x2F800 - 0x2FA1F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT
+ = new UnicodeBlock(0x2F800, 0x2FA1F,
+ "CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT",
+ "CJK Compatibility Ideographs Supplement");
+
+ /**
+ * Tags.
+ * 0xE0000 - 0xE007F.
+ * @since 1.5
+ */
+ public static final UnicodeBlock TAGS
+ = new UnicodeBlock(0xE0000, 0xE007F,
+ "TAGS",
+ "Tags");
+
+ /**
+ * Variation Selectors Supplement.
+ * 0xE0100 - 0xE01EF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock VARIATION_SELECTORS_SUPPLEMENT
+ = new UnicodeBlock(0xE0100, 0xE01EF,
+ "VARIATION_SELECTORS_SUPPLEMENT",
+ "Variation Selectors Supplement");
+
+ /**
+ * Supplementary Private Use Area-A.
+ * 0xF0000 - 0xFFFFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock SUPPLEMENTARY_PRIVATE_USE_AREA_A
+ = new UnicodeBlock(0xF0000, 0xFFFFF,
+ "SUPPLEMENTARY_PRIVATE_USE_AREA_A",
+ "Supplementary Private Use Area-A");
+
+ /**
+ * Supplementary Private Use Area-B.
+ * 0x100000 - 0x10FFFF.
+ * @since 1.5
+ */
+ public static final UnicodeBlock SUPPLEMENTARY_PRIVATE_USE_AREA_B
+ = new UnicodeBlock(0x100000, 0x10FFFF,
+ "SUPPLEMENTARY_PRIVATE_USE_AREA_B",
+ "Supplementary Private Use Area-B");
+
+ /**
+ * Surrogates Area.
+ * 'D800' - 'DFFF'.
+ * @deprecated As of 1.5, the three areas,
+ * <a href="#HIGH_SURROGATES">HIGH_SURROGATES</a>,
+ * <a href="#HIGH_PRIVATE_USE_SURROGATES">HIGH_PRIVATE_USE_SURROGATES</a>
+ * and <a href="#LOW_SURROGATES">LOW_SURROGATES</a>, as defined
+ * by the Unicode standard, should be used in preference to
+ * this. These are also returned from calls to <code>of(int)</code>
+ * and <code>of(char)</code>.
+ */
+ public static final UnicodeBlock SURROGATES_AREA
+ = new UnicodeBlock(0xD800, 0xDFFF,
+ "SURROGATES_AREA",
+ "Surrogates Area");
+
+ /**
* The defined subsets.
*/
private static final UnicodeBlock sets[] = {
@@ -909,6 +1535,7 @@
COMBINING_DIACRITICAL_MARKS,
GREEK,
CYRILLIC,
+ CYRILLIC_SUPPLEMENTARY,
ARMENIAN,
HEBREW,
ARABIC,
@@ -935,8 +1562,16 @@
UNIFIED_CANADIAN_ABORIGINAL_SYLLABICS,
OGHAM,
RUNIC,
+ TAGALOG,
+ HANUNOO,
+ BUHID,
+ TAGBANWA,
KHMER,
MONGOLIAN,
+ LIMBU,
+ TAI_LE,
+ KHMER_SYMBOLS,
+ PHONETIC_EXTENSIONS,
LATIN_EXTENDED_ADDITIONAL,
GREEK_EXTENDED,
GENERAL_PUNCTUATION,
@@ -956,7 +1591,13 @@
GEOMETRIC_SHAPES,
MISCELLANEOUS_SYMBOLS,
DINGBATS,
+ MISCELLANEOUS_MATHEMATICAL_SYMBOLS_A,
+ SUPPLEMENTAL_ARROWS_A,
BRAILLE_PATTERNS,
+ SUPPLEMENTAL_ARROWS_B,
+ MISCELLANEOUS_MATHEMATICAL_SYMBOLS_B,
+ SUPPLEMENTAL_MATHEMATICAL_OPERATORS,
+ MISCELLANEOUS_SYMBOLS_AND_ARROWS,
CJK_RADICALS_SUPPLEMENT,
KANGXI_RADICALS,
IDEOGRAPHIC_DESCRIPTION_CHARACTERS,
@@ -967,24 +1608,49 @@
HANGUL_COMPATIBILITY_JAMO,
KANBUN,
BOPOMOFO_EXTENDED,
+ KATAKANA_PHONETIC_EXTENSIONS,
ENCLOSED_CJK_LETTERS_AND_MONTHS,
CJK_COMPATIBILITY,
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A,
+ YIJING_HEXAGRAM_SYMBOLS,
CJK_UNIFIED_IDEOGRAPHS,
YI_SYLLABLES,
YI_RADICALS,
HANGUL_SYLLABLES,
- SURROGATES_AREA,
+ HIGH_SURROGATES,
+ HIGH_PRIVATE_USE_SURROGATES,
+ LOW_SURROGATES,
PRIVATE_USE_AREA,
CJK_COMPATIBILITY_IDEOGRAPHS,
ALPHABETIC_PRESENTATION_FORMS,
ARABIC_PRESENTATION_FORMS_A,
+ VARIATION_SELECTORS,
COMBINING_HALF_MARKS,
CJK_COMPATIBILITY_FORMS,
SMALL_FORM_VARIANTS,
ARABIC_PRESENTATION_FORMS_B,
HALFWIDTH_AND_FULLWIDTH_FORMS,
SPECIALS,
+ LINEAR_B_SYLLABARY,
+ LINEAR_B_IDEOGRAMS,
+ AEGEAN_NUMBERS,
+ OLD_ITALIC,
+ GOTHIC,
+ UGARITIC,
+ DESERET,
+ SHAVIAN,
+ OSMANYA,
+ CYPRIOT_SYLLABARY,
+ BYZANTINE_MUSICAL_SYMBOLS,
+ MUSICAL_SYMBOLS,
+ TAI_XUAN_JING_SYMBOLS,
+ MATHEMATICAL_ALPHANUMERIC_SYMBOLS,
+ CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B,
+ CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT,
+ TAGS,
+ VARIATION_SELECTORS_SUPPLEMENT,
+ SUPPLEMENTARY_PRIVATE_USE_AREA_A,
+ SUPPLEMENTARY_PRIVATE_USE_AREA_B,
};
} // class UnicodeBlock
Index: classpath/gnu/regexp/CharIndexedStringBuffer.java
===================================================================
--- classpath/gnu/regexp/CharIndexedStringBuffer.java (revision 110832)
+++ classpath/gnu/regexp/CharIndexedStringBuffer.java (working copy)
@@ -1,5 +1,5 @@
/* gnu/regexp/CharIndexedStringBuffer.java
- Copyright (C) 1998-2001, 2004 Free Software Foundation, Inc.
+ Copyright (C) 1998-2001, 2004, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -59,4 +59,13 @@
public boolean move(int index) {
return ((anchor += index) < s.length());
}
+
+ public CharIndexed lookBehind(int index, int length) {
+ if (length > (anchor + index)) length = anchor + index;
+ return new CharIndexedStringBuffer(s, anchor + index - length);
+ }
+
+ public int length() {
+ return s.length() - anchor;
+ }
}
Index: classpath/gnu/regexp/RETokenChar.java
===================================================================
--- classpath/gnu/regexp/RETokenChar.java (revision 110832)
+++ classpath/gnu/regexp/RETokenChar.java (working copy)
@@ -52,6 +52,10 @@
return ch.length;
}
+ int getMaximumLength() {
+ return ch.length;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
int z = ch.length;
char c;
@@ -68,7 +72,7 @@
// Overrides REToken.chain() to optimize for strings
boolean chain(REToken next) {
- if (next instanceof RETokenChar) {
+ if (next instanceof RETokenChar && ((RETokenChar)next).insens == insens) {
RETokenChar cnext = (RETokenChar) next;
// assume for now that next can only be one character
int newsize = ch.length + cnext.ch.length;
Index: classpath/gnu/regexp/CharIndexedString.java
===================================================================
--- classpath/gnu/regexp/CharIndexedString.java (revision 110832)
+++ classpath/gnu/regexp/CharIndexedString.java (working copy)
@@ -1,5 +1,5 @@
/* gnu/regexp/CharIndexedString.java
- Copyright (C) 1998-2001, 2004 Free Software Foundation, Inc.
+ Copyright (C) 1998-2001, 2004, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -61,4 +61,13 @@
public boolean move(int index) {
return ((anchor += index) < len);
}
+
+ public CharIndexed lookBehind(int index, int length) {
+ if (length > (anchor + index)) length = anchor + index;
+ return new CharIndexedString(s, anchor + index - length);
+ }
+
+ public int length() {
+ return len - anchor;
+ }
}
Index: classpath/gnu/regexp/RETokenLookBehind.java
===================================================================
--- classpath/gnu/regexp/RETokenLookBehind.java (revision 0)
+++ classpath/gnu/regexp/RETokenLookBehind.java (revision 0)
@@ -0,0 +1,116 @@
+/* gnu/regexp/RETokenLookBehind.java
+ Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Classpath.
+
+GNU Classpath is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+GNU Classpath is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Classpath; see the file COPYING. If not, write to the
+Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+02110-1301 USA.
+
+Linking this library statically or dynamically with other modules is
+making a combined work based on this library. Thus, the terms and
+conditions of the GNU General Public License cover the whole
+combination.
+
+As a special exception, the copyright holders of this library give you
+permission to link this library with independent modules to produce an
+executable, regardless of the license terms of these independent
+modules, and to copy and distribute the resulting executable under
+terms of your choice, provided that you also meet, for each linked
+independent module, the terms and conditions of the license of that
+module. An independent module is a module which is not derived from
+or based on this library. If you modify this library, you may extend
+this exception to your version of the library, but you are not
+obligated to do so. If you do not wish to do so, delete this
+exception statement from your version. */
+
+package gnu.regexp;
+
+/**
+ * @author Ito Kazumitsu
+ */
+final class RETokenLookBehind extends REToken
+{
+ REToken re;
+ boolean negative;
+
+ RETokenLookBehind(REToken re, boolean negative) throws REException {
+ super(0);
+ this.re = re;
+ this.negative = negative;
+ }
+
+ int getMaximumLength() {
+ return 0;
+ }
+
+ boolean match(CharIndexed input, REMatch mymatch)
+ {
+ int max = re.getMaximumLength();
+ CharIndexed behind = input.lookBehind(mymatch.index, max);
+ REMatch trymatch = (REMatch)mymatch.clone();
+ REMatch trymatch1 = (REMatch)mymatch.clone();
+ REMatch newMatch = null;
+ int curIndex = trymatch.index + behind.length() - input.length();
+ trymatch.index = 0;
+ RETokenMatchHereOnly stopper = new RETokenMatchHereOnly(curIndex);
+ REToken re1 = (REToken) re.clone();
+ re1.chain(stopper);
+ if (re1.match(behind, trymatch)) {
+ if (negative) return false;
+ if (next(input, trymatch1))
+ newMatch = trymatch1;
+ }
+
+ if (newMatch != null) {
+ if (negative) return false;
+ //else
+ mymatch.assignFrom(newMatch);
+ return true;
+ }
+ else { // no match
+ if (negative)
+ return next(input, mymatch);
+ //else
+ return false;
+ }
+ }
+
+ void dump(StringBuffer os) {
+ os.append("(?<");
+ os.append(negative ? '!' : '=');
+ re.dumpAll(os);
+ os.append(')');
+ }
+
+ private static class RETokenMatchHereOnly extends REToken {
+
+ int getMaximumLength() { return 0; }
+
+ private int index;
+
+ RETokenMatchHereOnly(int index) {
+ super(0);
+ this.index = index;
+ }
+
+ boolean match(CharIndexed input, REMatch mymatch) {
+ return index == mymatch.index;
+ }
+
+ void dump(StringBuffer os) {}
+
+ }
+}
+
Index: classpath/gnu/regexp/RE.java
===================================================================
--- classpath/gnu/regexp/RE.java (revision 110832)
+++ classpath/gnu/regexp/RE.java (working copy)
@@ -136,12 +136,13 @@
/** Minimum length, in characters, of any possible match. */
private int minimumLength;
+ private int maximumLength;
/**
* Compilation flag. Do not differentiate case. Subsequent
* searches using this RE will be case insensitive.
*/
- public static final int REG_ICASE = 2;
+ public static final int REG_ICASE = 0x02;
/**
* Compilation flag. The match-any-character operator (dot)
@@ -149,14 +150,14 @@
* bit RE_DOT_NEWLINE (see RESyntax for details). This is equivalent to
* the "/s" operator in Perl.
*/
- public static final int REG_DOT_NEWLINE = 4;
+ public static final int REG_DOT_NEWLINE = 0x04;
/**
* Compilation flag. Use multiline mode. In this mode, the ^ and $
* anchors will match based on newlines within the input. This is
* equivalent to the "/m" operator in Perl.
*/
- public static final int REG_MULTILINE = 8;
+ public static final int REG_MULTILINE = 0x08;
/**
* Execution flag.
@@ -185,14 +186,14 @@
* // m4.toString(): "fool"<BR>
* </CODE>
*/
- public static final int REG_NOTBOL = 16;
+ public static final int REG_NOTBOL = 0x10;
/**
* Execution flag.
* The match-end operator ($) does not match at the end
* of the input string. Useful for matching on substrings.
*/
- public static final int REG_NOTEOL = 32;
+ public static final int REG_NOTEOL = 0x20;
/**
* Execution flag.
@@ -206,7 +207,7 @@
* the example under REG_NOTBOL. It also affects the use of the \<
* and \b operators.
*/
- public static final int REG_ANCHORINDEX = 64;
+ public static final int REG_ANCHORINDEX = 0x40;
/**
* Execution flag.
@@ -215,8 +216,25 @@
* the corresponding subexpressions. For example, you may want to
* replace all matches of "one dollar" with "$1".
*/
- public static final int REG_NO_INTERPOLATE = 128;
+ public static final int REG_NO_INTERPOLATE = 0x80;
+ /**
+ * Execution flag.
+ * Try to match the whole input string. An implicit match-end operator
+ * is added to this regexp.
+ */
+ public static final int REG_TRY_ENTIRE_MATCH = 0x0100;
+
+ /**
+ * Execution flag.
+ * The substitute and substituteAll methods will treat the
+ * character '\' in the replacement as an escape to a literal
+ * character. In this case "\n", "\$", "\\", "\x40" and "\012"
+ * will become "n", "$", "\", "x40" and "012" respectively.
+ * This flag has no effect if REG_NO_INTERPOLATE is set on.
+ */
+ public static final int REG_REPLACE_USE_BACKSLASHESCAPE = 0x0200;
+
/** Returns a string representing the version of the gnu.regexp package. */
public static final String version() {
return VERSION;
@@ -273,12 +291,13 @@
}
// internal constructor used for alternation
- private RE(REToken first, REToken last,int subs, int subIndex, int minLength) {
+ private RE(REToken first, REToken last,int subs, int subIndex, int minLength, int maxLength) {
super(subIndex);
firstToken = first;
lastToken = last;
numSubs = subs;
minimumLength = minLength;
+ maximumLength = maxLength;
addToken(new RETokenEndSub(subIndex));
}
@@ -333,6 +352,11 @@
char ch;
boolean quot = false;
+ // Saved syntax and flags.
+ RESyntax savedSyntax = null;
+ int savedCflags = 0;
+ boolean flagsSaved = false;
+
while (index < pLength) {
// read the next character unit (including backslash escapes)
index = getCharUnit(pattern,index,unit,quot);
@@ -359,8 +383,9 @@
&& !syntax.get(RESyntax.RE_LIMITED_OPS)) {
// make everything up to here be a branch. create vector if nec.
addToken(currentToken);
- RE theBranch = new RE(firstToken, lastToken, numSubs, subIndex, minimumLength);
+ RE theBranch = new RE(firstToken, lastToken, numSubs, subIndex, minimumLength, maximumLength);
minimumLength = 0;
+ maximumLength = 0;
if (branches == null) {
branches = new Vector();
}
@@ -374,6 +399,9 @@
//
// OPEN QUESTION:
// what is proper interpretation of '{' at start of string?
+ //
+ // This method used to check "repeat.empty.token" to avoid such regexp
+ // as "(a*){2,}", but now "repeat.empty.token" is allowed.
else if ((unit.ch == '{') && syntax.get(RESyntax.RE_INTERVALS) && (syntax.get(RESyntax.RE_NO_BK_BRACES) ^ (unit.bk || quot))) {
int newIndex = getMinMax(pattern,index,minMax,syntax);
@@ -386,8 +414,6 @@
throw new REException(getLocalizedMessage("repeat.chained"),REException.REG_BADRPT,newIndex);
if (currentToken instanceof RETokenWordBoundary || currentToken instanceof RETokenWordBoundary)
throw new REException(getLocalizedMessage("repeat.assertion"),REException.REG_BADRPT,newIndex);
- if ((currentToken.getMinimumLength() == 0) && (minMax.second == Integer.MAX_VALUE))
- throw new REException(getLocalizedMessage("repeat.empty.token"),REException.REG_BADRPT,newIndex);
index = newIndex;
currentToken = setRepeated(currentToken,minMax.first,minMax.second,index);
}
@@ -403,6 +429,8 @@
else if ((unit.ch == '[') && !(unit.bk || quot)) {
Vector options = new Vector();
boolean negative = false;
+ // FIXME: lastChar == 0 means lastChar is not set. But what if
+ // \u0000 is used as a meaningful character?
char lastChar = 0;
if (index == pLength) throw new REException(getLocalizedMessage("unmatched.bracket"),REException.REG_EBRACK,index);
@@ -426,6 +454,13 @@
options.addElement(new RETokenChar(subIndex,lastChar,insens));
lastChar = '-';
} else {
+ if ((ch == '\\') && syntax.get(RESyntax.RE_BACKSLASH_ESCAPE_IN_LISTS)) {
+ CharExpression ce = getCharExpression(pattern, index, pLength, syntax);
+ if (ce == null)
+ throw new REException("invalid escape sequence", REException.REG_ESCAPE, index);
+ ch = ce.ch;
+ index = index + ce.len - 1;
+ }
options.addElement(new RETokenRange(subIndex,lastChar,ch,insens));
lastChar = 0;
index++;
@@ -434,7 +469,10 @@
if (index == pLength) throw new REException(getLocalizedMessage("class.no.end"),REException.REG_EBRACK,index);
int posixID = -1;
boolean negate = false;
+ // FIXME: asciiEsc == 0 means asciiEsc is not set. But what if
+ // \u0000 is used as a meaningful character?
char asciiEsc = 0;
+ NamedProperty np = null;
if (("dswDSW".indexOf(pattern[index]) != -1) && syntax.get(RESyntax.RE_CHAR_CLASS_ESC_IN_LISTS)) {
switch (pattern[index]) {
case 'D':
@@ -454,23 +492,25 @@
break;
}
}
- else if ("nrt".indexOf(pattern[index]) != -1) {
- switch (pattern[index]) {
- case 'n':
- asciiEsc = '\n';
- break;
- case 't':
- asciiEsc = '\t';
- break;
- case 'r':
- asciiEsc = '\r';
- break;
- }
- }
+ if (("pP".indexOf(pattern[index]) != -1) && syntax.get(RESyntax.RE_NAMED_PROPERTY)) {
+ np = getNamedProperty(pattern, index - 1, pLength);
+ if (np == null)
+ throw new REException("invalid escape sequence", REException.REG_ESCAPE, index);
+ index = index - 1 + np.len - 1;
+ }
+ else {
+ CharExpression ce = getCharExpression(pattern, index - 1, pLength, syntax);
+ if (ce == null)
+ throw new REException("invalid escape sequence", REException.REG_ESCAPE, index);
+ asciiEsc = ce.ch;
+ index = index - 1 + ce.len - 1;
+ }
if (lastChar != 0) options.addElement(new RETokenChar(subIndex,lastChar,insens));
if (posixID != -1) {
options.addElement(new RETokenPOSIX(subIndex,posixID,insens,negate));
+ } else if (np != null) {
+ options.addElement(getRETokenNamedProperty(subIndex,np,insens,index));
} else if (asciiEsc != 0) {
lastChar = asciiEsc;
} else {
@@ -506,7 +546,10 @@
boolean pure = false;
boolean comment = false;
boolean lookAhead = false;
+ boolean lookBehind = false;
+ boolean independent = false;
boolean negativelh = false;
+ boolean negativelb = false;
if ((index+1 < pLength) && (pattern[index] == '?')) {
switch (pattern[index+1]) {
case '!':
@@ -524,6 +567,114 @@
index += 2;
}
break;
+ case '<':
+ // We assume that if the syntax supports look-ahead,
+ // it also supports look-behind.
+ if (syntax.get(RESyntax.RE_LOOKAHEAD)) {
+ index++;
+ switch (pattern[index +1]) {
+ case '!':
+ pure = true;
+ negativelb = true;
+ lookBehind = true;
+ index += 2;
+ break;
+ case '=':
+ pure = true;
+ lookBehind = true;
+ index += 2;
+ }
+ }
+ break;
+ case '>':
+ // We assume that if the syntax supports look-ahead,
+ // it also supports independent group.
+ if (syntax.get(RESyntax.RE_LOOKAHEAD)) {
+ pure = true;
+ independent = true;
+ index += 2;
+ }
+ break;
+ case 'i':
+ case 'd':
+ case 'm':
+ case 's':
+ // case 'u': not supported
+ // case 'x': not supported
+ case '-':
+ if (!syntax.get(RESyntax.RE_EMBEDDED_FLAGS)) break;
+ // Set or reset syntax flags.
+ int flagIndex = index + 1;
+ int endFlag = -1;
+ RESyntax newSyntax = new RESyntax(syntax);
+ int newCflags = cflags;
+ boolean negate = false;
+ while (flagIndex < pLength && endFlag < 0) {
+ switch(pattern[flagIndex]) {
+ case 'i':
+ if (negate)
+ newCflags &= ~REG_ICASE;
+ else
+ newCflags |= REG_ICASE;
+ flagIndex++;
+ break;
+ case 'd':
+ if (negate)
+ newSyntax.setLineSeparator(RESyntax.DEFAULT_LINE_SEPARATOR);
+ else
+ newSyntax.setLineSeparator("\n");
+ flagIndex++;
+ break;
+ case 'm':
+ if (negate)
+ newCflags &= ~REG_MULTILINE;
+ else
+ newCflags |= REG_MULTILINE;
+ flagIndex++;
+ break;
+ case 's':
+ if (negate)
+ newCflags &= ~REG_DOT_NEWLINE;
+ else
+ newCflags |= REG_DOT_NEWLINE;
+ flagIndex++;
+ break;
+ // case 'u': not supported
+ // case 'x': not supported
+ case '-':
+ negate = true;
+ flagIndex++;
+ break;
+ case ':':
+ case ')':
+ endFlag = pattern[flagIndex];
+ break;
+ default:
+ throw new REException(getLocalizedMessage("repeat.no.token"), REException.REG_BADRPT, index);
+ }
+ }
+ if (endFlag == ')') {
+ syntax = newSyntax;
+ cflags = newCflags;
+ insens = ((cflags & REG_ICASE) > 0);
+ // This can be treated as though it were a comment.
+ comment = true;
+ index = flagIndex - 1;
+ break;
+ }
+ if (endFlag == ':') {
+ savedSyntax = syntax;
+ savedCflags = cflags;
+ flagsSaved = true;
+ syntax = newSyntax;
+ cflags = newCflags;
+ insens = ((cflags & REG_ICASE) > 0);
+ index = flagIndex -1;
+ // Fall through to the next case.
+ }
+ else {
+ throw new REException(getLocalizedMessage("unmatched.paren"), REException.REG_ESUBREG,index);
+ }
case ':':
if (syntax.get(RESyntax.RE_PURE_GROUPING)) {
pure = true;
@@ -550,13 +701,50 @@
int nested = 0;
while ( ((nextIndex = getCharUnit(pattern,endIndex,unit,false)) > 0)
- && !(nested == 0 && (unit.ch == ')') && (syntax.get(RESyntax.RE_NO_BK_PARENS) ^ (unit.bk || quot))) )
+ && !(nested == 0 && (unit.ch == ')') && (syntax.get(RESyntax.RE_NO_BK_PARENS) ^ (unit.bk || quot))) ) {
if ((endIndex = nextIndex) >= pLength)
throw new REException(getLocalizedMessage("subexpr.no.end"),REException.REG_ESUBREG,nextIndex);
+ else if ((unit.ch == '[') && !(unit.bk || quot)) {
+ // I hate to do something similar to the LIST OPERATOR matters
+ // above, but ...
+ int listIndex = nextIndex;
+ if (listIndex < pLength && pattern[listIndex] == '^') listIndex++;
+ if (listIndex < pLength && pattern[listIndex] == ']') listIndex++;
+ int listEndIndex = -1;
+ int listNest = 0;
+ while (listIndex < pLength && listEndIndex < 0) {
+ switch(pattern[listIndex++]) {
+ case '\\':
+ listIndex++;
+ break;
+ case '[':
+ // Sun's API document says that regexp like "[a-d[m-p]]"
+ // is legal. Even something like "[[[^]]]]" is accepted.
+ listNest++;
+ if (listIndex < pLength && pattern[listIndex] == '^') listIndex++;
+ if (listIndex < pLength && pattern[listIndex] == ']') listIndex++;
+ break;
+ case ']':
+ if (listNest == 0)
+ listEndIndex = listIndex;
+ listNest--;
+ break;
+ }
+ }
+ if (listEndIndex >= 0) {
+ nextIndex = listEndIndex;
+ if ((endIndex = nextIndex) >= pLength)
+ throw new REException(getLocalizedMessage("subexpr.no.end"),REException.REG_ESUBREG,nextIndex);
+ else
+ continue;
+ }
+ throw new REException(getLocalizedMessage("subexpr.no.end"),REException.REG_ESUBREG,nextIndex);
+ }
else if (unit.ch == '(' && (syntax.get(RESyntax.RE_NO_BK_PARENS) ^ (unit.bk || quot)))
nested++;
else if (unit.ch == ')' && (syntax.get(RESyntax.RE_NO_BK_PARENS) ^ (unit.bk || quot)))
nested--;
+ }
// endIndex is now position at a ')','\)'
// nextIndex is end of string or position after ')' or '\)'
@@ -569,15 +757,28 @@
numSubs++;
}
- int useIndex = (pure || lookAhead) ? 0 : nextSub + numSubs;
+ int useIndex = (pure || lookAhead || lookBehind || independent) ?
+ 0 : nextSub + numSubs;
currentToken = new RE(String.valueOf(pattern,index,endIndex-index).toCharArray(),cflags,syntax,useIndex,nextSub + numSubs);
numSubs += ((RE) currentToken).getNumSubs();
if (lookAhead) {
currentToken = new RETokenLookAhead(currentToken,negativelh);
}
+ else if (lookBehind) {
+ currentToken = new RETokenLookBehind(currentToken,negativelb);
+ }
+ else if (independent) {
+ currentToken = new RETokenIndependent(currentToken);
+ }
index = nextIndex;
+ if (flagsSaved) {
+ syntax = savedSyntax;
+ cflags = savedCflags;
+ insens = ((cflags & REG_ICASE) > 0);
+ flagsSaved = false;
+ }
} // not a comment
} // subexpression
@@ -616,6 +817,9 @@
// ZERO-OR-MORE REPEAT OPERATOR
// *
+ //
+ // This method used to check "repeat.empty.token" to avoid such regexp
+ // as "(a*)*", but now "repeat.empty.token" is allowed.
else if ((unit.ch == '*') && !(unit.bk || quot)) {
if (currentToken == null)
@@ -624,14 +828,15 @@
throw new REException(getLocalizedMessage("repeat.chained"),REException.REG_BADRPT,index);
if (currentToken instanceof RETokenWordBoundary || currentToken instanceof RETokenWordBoundary)
throw new REException(getLocalizedMessage("repeat.assertion"),REException.REG_BADRPT,index);
- if (currentToken.getMinimumLength() == 0)
- throw new REException(getLocalizedMessage("repeat.empty.token"),REException.REG_BADRPT,index);
currentToken = setRepeated(currentToken,0,Integer.MAX_VALUE,index);
}
// ONE-OR-MORE REPEAT OPERATOR / POSSESSIVE MATCHING OPERATOR
// + | \+ depending on RE_BK_PLUS_QM
// not available if RE_LIMITED_OPS is set
+ //
+ // This method used to check "repeat.empty.token" to avoid such regexp
+ // as "(a*)+", but now "repeat.empty.token" is allowed.
else if ((unit.ch == '+') && !syntax.get(RESyntax.RE_LIMITED_OPS) && (!syntax.get(RESyntax.RE_BK_PLUS_QM) ^ (unit.bk || quot))) {
if (currentToken == null)
@@ -648,8 +853,6 @@
}
else if (currentToken instanceof RETokenWordBoundary || currentToken instanceof RETokenWordBoundary)
throw new REException(getLocalizedMessage("repeat.assertion"),REException.REG_BADRPT,index);
- else if (currentToken.getMinimumLength() == 0)
- throw new REException(getLocalizedMessage("repeat.empty.token"),REException.REG_BADRPT,index);
else
currentToken = setRepeated(currentToken,1,Integer.MAX_VALUE,index);
}
@@ -675,14 +878,45 @@
else
currentToken = setRepeated(currentToken,0,1,index);
}
+
+ // OCTAL CHARACTER
+ // \0377
+ else if (unit.bk && (unit.ch == '0') && syntax.get(RESyntax.RE_OCTAL_CHAR)) {
+ CharExpression ce = getCharExpression(pattern, index - 2, pLength, syntax);
+ if (ce == null)
+ throw new REException("invalid octal character", REException.REG_ESCAPE, index);
+ index = index - 2 + ce.len;
+ addToken(currentToken);
+ currentToken = new RETokenChar(subIndex,ce.ch,insens);
+ }
+
// BACKREFERENCE OPERATOR
- // \1 \2 ... \9
+ // \1 \2 ... \9 and \10 \11 \12 ...
// not available if RE_NO_BK_REFS is set
+ // Perl recognizes \10, \11, and so on only if enough number of
+ // parentheses have opened before it, otherwise they are treated
+ // as aliases of \010, \011, ... (octal characters). In case of
+ // Sun's JDK, octal character expression must always begin with \0.
+ // We will do as JDK does. But FIXME, take a look at "(a)(b)\29".
+ // JDK treats \2 as a back reference to the 2nd group because
+ // there are only two groups. But in our poor implementation,
+ // we cannot help but treat \29 as a back reference to the 29th group.
else if (unit.bk && Character.isDigit(unit.ch) && !syntax.get(RESyntax.RE_NO_BK_REFS)) {
addToken(currentToken);
- currentToken = new RETokenBackRef(subIndex,Character.digit(unit.ch,10),insens);
+ int numBegin = index - 1;
+ int numEnd = pLength;
+ for (int i = index; i < pLength; i++) {
+ if (! Character.isDigit(pattern[i])) {
+ numEnd = i;
+ break;
+ }
+ }
+ int num = parseInt(pattern, numBegin, numEnd-numBegin, 10);
+
+ currentToken = new RETokenBackRef(subIndex,num,insens);
+ index = numEnd;
}
// START OF STRING OPERATOR
@@ -804,6 +1038,32 @@
currentToken = new RETokenEnd(subIndex,null);
}
+ // HEX CHARACTER, UNICODE CHARACTER
+ // \x1B, \u1234
+
+ else if ((unit.bk && (unit.ch == 'x') && syntax.get(RESyntax.RE_HEX_CHAR)) ||
+ (unit.bk && (unit.ch == 'u') && syntax.get(RESyntax.RE_UNICODE_CHAR))) {
+ CharExpression ce = getCharExpression(pattern, index - 2, pLength, syntax);
+ if (ce == null)
+ throw new REException("invalid hex character", REException.REG_ESCAPE, index);
+ index = index - 2 + ce.len;
+ addToken(currentToken);
+ currentToken = new RETokenChar(subIndex,ce.ch,insens);
+ }
+
+ // NAMED PROPERTY
+ // \p{prop}, \P{prop}
+
+ else if ((unit.bk && (unit.ch == 'p') && syntax.get(RESyntax.RE_NAMED_PROPERTY)) ||
+ (unit.bk && (unit.ch == 'P') && syntax.get(RESyntax.RE_NAMED_PROPERTY))) {
+ NamedProperty np = getNamedProperty(pattern, index - 2, pLength);
+ if (np == null)
+ throw new REException("invalid escape sequence", REException.REG_ESCAPE, index);
+ index = index - 2 + np.len;
+ addToken(currentToken);
+ currentToken = getRETokenNamedProperty(subIndex,np,insens,index);
+ }
+
// NON-SPECIAL CHARACTER (or escape to make literal)
// c | \* for example
@@ -817,9 +1077,10 @@
addToken(currentToken);
if (branches != null) {
- branches.addElement(new RE(firstToken,lastToken,numSubs,subIndex,minimumLength));
+ branches.addElement(new RE(firstToken,lastToken,numSubs,subIndex,minimumLength, maximumLength));
branches.trimToSize(); // compact the Vector
minimumLength = 0;
+ maximumLength = 0;
firstToken = lastToken = null;
addToken(new RETokenOneOf(subIndex,branches,false));
}
@@ -838,7 +1099,177 @@
return index;
}
+ private static int parseInt(char[] input, int pos, int len, int radix) {
+ int ret = 0;
+ for (int i = pos; i < pos + len; i++) {
+ ret = ret * radix + Character.digit(input[i], radix);
+ }
+ return ret;
+ }
+
/**
+ * This class represents various expressions for a character.
+ * "a" : 'a' itself.
+ * "\0123" : Octal char 0123
+ * "\x1b" : Hex char 0x1b
+ * "\u1234" : Unicode char \u1234
+ */
+ private static class CharExpression {
+ /** character represented by this expression */
+ char ch;
+ /** String expression */
+ String expr;
+ /** length of this expression */
+ int len;
+ public String toString() { return expr; }
+ }
+
+ private CharExpression getCharExpression(char[] input, int pos, int lim,
+ RESyntax syntax) {
+ CharExpression ce = new CharExpression();
+ char c = input[pos];
+ if (c == '\\') {
+ if (pos + 1 >= lim) return null;
+ c = input[pos + 1];
+ switch(c) {
+ case 't':
+ ce.ch = '\t';
+ ce.len = 2;
+ break;
+ case 'n':
+ ce.ch = '\n';
+ ce.len = 2;
+ break;
+ case 'r':
+ ce.ch = '\r';
+ ce.len = 2;
+ break;
+ case 'x':
+ case 'u':
+ if ((c == 'x' && syntax.get(RESyntax.RE_HEX_CHAR)) ||
+ (c == 'u' && syntax.get(RESyntax.RE_UNICODE_CHAR))) {
+ int l = 0;
+ int expectedLength = (c == 'x' ? 2 : 4);
+ for (int i = pos + 2; i < pos + 2 + expectedLength; i++) {
+ if (i >= lim) break;
+ if (!((input[i] >= '0' && input[i] <= '9') ||
+ (input[i] >= 'A' && input[i] <= 'F') ||
+ (input[i] >= 'a' && input[i] <= 'f')))
+ break;
+ l++;
+ }
+ if (l != expectedLength) return null;
+ ce.ch = (char)(parseInt(input, pos + 2, l, 16));
+ ce.len = l + 2;
+ }
+ else {
+ ce.ch = c;
+ ce.len = 2;
+ }
+ break;
+ case '0':
+ if (syntax.get(RESyntax.RE_OCTAL_CHAR)) {
+ int l = 0;
+ for (int i = pos + 2; i < pos + 2 + 3; i++) {
+ if (i >= lim) break;
+ if (input[i] < '0' || input[i] > '7') break;
+ l++;
+ }
+ if (l == 3 && input[pos + 2] > '3') l--;
+ if (l <= 0) return null;
+ ce.ch = (char)(parseInt(input, pos + 2, l, 8));
+ ce.len = l + 2;
+ }
+ else {
+ ce.ch = c;
+ ce.len = 2;
+ }
+ break;
+ default:
+ ce.ch = c;
+ ce.len = 2;
+ break;
+ }
+ }
+ else {
+ ce.ch = input[pos];
+ ce.len = 1;
+ }
+ ce.expr = new String(input, pos, ce.len);
+ return ce;
+ }
+
+ /**
+ * This class represents a substring in a pattern string expressing
+ * a named property.
+ * "\pA" : Property named "A"
+ * "\p{prop}" : Property named "prop"
+ * "\PA" : Property named "A" (Negated)
+ * "\P{prop}" : Property named "prop" (Negated)
+ */
+ private static class NamedProperty {
+ /** Property name */
+ String name;
+ /** Negated or not */
+ boolean negate;
+ /** length of this expression */
+ int len;
+ }
+
+ private NamedProperty getNamedProperty(char[] input, int pos, int lim) {
+ NamedProperty np = new NamedProperty();
+ char c = input[pos];
+ if (c == '\\') {
+ if (++pos >= lim) return null;
+ c = input[pos++];
+ switch(c) {
+ case 'p':
+ np.negate = false;
+ break;
+ case 'P':
+ np.negate = true;
+ break;
+ default:
+ return null;
+ }
+ c = input[pos++];
+ if (c == '{') {
+ int p = -1;
+ for (int i = pos; i < lim; i++) {
+ if (input[i] == '}') {
+ p = i;
+ break;
+ }
+ }
+ if (p < 0) return null;
+ int len = p - pos;
+ np.name = new String(input, pos, len);
+ np.len = len + 4;
+ }
+ else {
+ np.name = new String(input, pos - 1, 1);
+ np.len = 3;
+ }
+ return np;
+ }
+ else return null;
+ }
+
+ private static RETokenNamedProperty getRETokenNamedProperty(
+ int subIndex, NamedProperty np, boolean insens, int index)
+ throws REException {
+ try {
+ return new RETokenNamedProperty(subIndex, np.name, insens, np.negate);
+ }
+ catch (REException e) {
+ REException ree;
+ ree = new REException(e.getMessage(), REException.REG_ESCAPE, index);
+ ree.initCause(e);
+ throw ree;
+ }
+ }
+
+ /**
* Checks if the regular expression matches the input in its entirety.
*
* @param input The input text.
@@ -918,6 +1349,10 @@
return minimumLength;
}
+ public int getMaximumLength() {
+ return maximumLength;
+ }
+
/**
* Returns an array of all matches found in the input.
*
@@ -985,7 +1420,9 @@
/* Implements abstract method REToken.match() */
boolean match(CharIndexed input, REMatch mymatch) {
- if (firstToken == null) return next(input, mymatch);
+ if (firstToken == null) {
+ return next(input, mymatch);
+ }
// Note the start of this subexpression
mymatch.start[subIndex] = mymatch.index;
@@ -1049,23 +1486,34 @@
}
REMatch getMatchImpl(CharIndexed input, int anchor, int eflags, StringBuffer buffer) {
+ boolean tryEntireMatch = ((eflags & REG_TRY_ENTIRE_MATCH) != 0);
+ RE re = (tryEntireMatch ? (RE) this.clone() : this);
+ if (tryEntireMatch) {
+ re.chain(new RETokenEnd(0, null));
+ }
// Create a new REMatch to hold results
REMatch mymatch = new REMatch(numSubs, anchor, eflags);
do {
// Optimization: check if anchor + minimumLength > length
if (minimumLength == 0 || input.charAt(minimumLength-1) != CharIndexed.OUT_OF_BOUNDS) {
- if (match(input, mymatch)) {
- // Find longest match of them all to observe leftmost longest
- REMatch longest = mymatch;
+ if (re.match(input, mymatch)) {
+ REMatch best = mymatch;
+ // We assume that the match that coms first is the best.
+ // And the following "The longer, the better" rule has
+ // been commented out. The longest is not neccesarily
+ // the best. For example, "a" out of "aaa" is the best
+ // match for /a+?/.
+ /*
+ // Find best match of them all to observe leftmost longest
while ((mymatch = mymatch.next) != null) {
- if (mymatch.index > longest.index) {
- longest = mymatch;
+ if (mymatch.index > best.index) {
+ best = mymatch;
}
}
-
- longest.end[0] = longest.index;
- longest.finish(input);
- return longest;
+ */
+ best.end[0] = best.index;
+ best.finish(input);
+ return best;
}
}
mymatch.clear(++anchor);
@@ -1176,8 +1624,7 @@
StringBuffer buffer = new StringBuffer();
REMatch m = getMatchImpl(input,index,eflags,buffer);
if (m==null) return buffer.toString();
- buffer.append( ((eflags & REG_NO_INTERPOLATE) > 0) ?
- replace : m.substituteInto(replace) );
+ buffer.append(getReplacement(replace, m, eflags));
if (input.move(m.end[0])) {
do {
buffer.append(input.charAt(0));
@@ -1238,8 +1685,7 @@
StringBuffer buffer = new StringBuffer();
REMatch m;
while ((m = getMatchImpl(input,index,eflags,buffer)) != null) {
- buffer.append( ((eflags & REG_NO_INTERPOLATE) > 0) ?
- replace : m.substituteInto(replace) );
+ buffer.append(getReplacement(replace, m, eflags));
index = m.getEndIndex();
if (m.end[0] == 0) {
char ch = input.charAt(0);
@@ -1254,11 +1700,50 @@
}
return buffer.toString();
}
+
+ public static String getReplacement(String replace, REMatch m, int eflags) {
+ if ((eflags & REG_NO_INTERPOLATE) > 0)
+ return replace;
+ else {
+ if ((eflags & REG_REPLACE_USE_BACKSLASHESCAPE) > 0) {
+ StringBuffer sb = new StringBuffer();
+ int l = replace.length();
+ for (int i = 0; i < l; i++) {
+ char c = replace.charAt(i);
+ switch(c) {
+ case '\\':
+ i++;
+ // Let StringIndexOutOfBoundsException be thrown.
+ sb.append(replace.charAt(i));
+ break;
+ case '$':
+ int i1 = i + 1;
+ while (i1 < replace.length() &&
+ Character.isDigit(replace.charAt(i1))) i1++;
+ sb.append(m.substituteInto(replace.substring(i, i1)));
+ i = i1 - 1;
+ break;
+ default:
+ sb.append(c);
+ }
+ }
+ return sb.toString();
+ }
+ else
+ return m.substituteInto(replace);
+ }
+ }
/* Helper function for constructor */
private void addToken(REToken next) {
if (next == null) return;
minimumLength += next.getMinimumLength();
+ int nmax = next.getMaximumLength();
+ if (nmax < Integer.MAX_VALUE && maximumLength < Integer.MAX_VALUE)
+ maximumLength += nmax;
+ else
+ maximumLength = Integer.MAX_VALUE;
+
if (firstToken == null) {
lastToken = firstToken = next;
} else {
Index: classpath/gnu/regexp/REToken.java
===================================================================
--- classpath/gnu/regexp/REToken.java (revision 110832)
+++ classpath/gnu/regexp/REToken.java (working copy)
@@ -38,12 +38,21 @@
package gnu.regexp;
import java.io.Serializable;
-abstract class REToken implements Serializable {
+abstract class REToken implements Serializable, Cloneable {
protected REToken next = null;
protected REToken uncle = null;
protected int subIndex;
+ public Object clone() {
+ try {
+ REToken copy = (REToken) super.clone();
+ return copy;
+ } catch (CloneNotSupportedException e) {
+ throw new Error(); // doesn't happen
+ }
+ }
+
protected REToken(int subIndex) {
this.subIndex = subIndex;
}
@@ -52,6 +61,10 @@
return 0;
}
+ int getMaximumLength() {
+ return Integer.MAX_VALUE;
+ }
+
void setUncle(REToken anUncle) {
uncle = anUncle;
}
Index: classpath/gnu/regexp/RETokenWordBoundary.java
===================================================================
--- classpath/gnu/regexp/RETokenWordBoundary.java (revision 110832)
+++ classpath/gnu/regexp/RETokenWordBoundary.java (working copy)
@@ -52,6 +52,11 @@
this.where = where;
this.negated = negated;
}
+
+ int getMaximumLength() {
+ return 0;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
// Word boundary means input[index-1] was a word character
Index: classpath/gnu/regexp/RETokenEndSub.java
===================================================================
--- classpath/gnu/regexp/RETokenEndSub.java (revision 110832)
+++ classpath/gnu/regexp/RETokenEndSub.java (working copy)
@@ -41,6 +41,10 @@
RETokenEndSub(int subIndex) {
super(subIndex);
}
+
+ int getMaximumLength() {
+ return 0;
+ }
boolean match(CharIndexed input, REMatch mymatch) {
mymatch.end[subIndex] = mymatch.index;
Index: classpath/gnu/regexp/CharIndexedInputStream.java
===================================================================
--- classpath/gnu/regexp/CharIndexedInputStream.java (revision 110832)
+++ classpath/gnu/regexp/CharIndexedInputStream.java (working copy)
@@ -1,5 +1,5 @@
/* gnu/regexp/CharIndexedInputStream.java
- Copyright (C) 1998-2001, 2004 Free Software Foundation, Inc.
+ Copyright (C) 1998-2001, 2004, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -145,5 +145,15 @@
public boolean isValid() {
return (cached != OUT_OF_BOUNDS);
}
+
+ public CharIndexed lookBehind(int index, int length) {
+ throw new UnsupportedOperationException(
+ "difficult to look behind for an input stream");
+ }
+
+ public int length() {
+ throw new UnsupportedOperationException(
+ "difficult to tell the length for an input stream");
+ }
}
Index: classpath/gnu/regexp/CharIndexedCharArray.java
===================================================================
--- classpath/gnu/regexp/CharIndexedCharArray.java (revision 110832)
+++ classpath/gnu/regexp/CharIndexedCharArray.java (working copy)
@@ -1,5 +1,5 @@
/* gnu/regexp/CharIndexedCharArray.java
- Copyright (C) 1998-2001, 2004 Free Software Foundation, Inc.
+ Copyright (C) 1998-2001, 2004, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -59,4 +59,13 @@
public boolean move(int index) {
return ((anchor += index) < s.length);
}
+
+ public CharIndexed lookBehind(int index, int length) {
+ if (length > (anchor + index)) length = anchor + index;
+ return new CharIndexedCharArray(s, anchor + index - length);
+ }
+
+ public int length() {
+ return s.length - anchor;
+ }
}
Index: classpath/gnu/regexp/RESyntax.java
===================================================================
--- classpath/gnu/regexp/RESyntax.java (revision 110832)
+++ classpath/gnu/regexp/RESyntax.java (working copy)
@@ -202,9 +202,34 @@
*/
public static final int RE_POSSESSIVE_OPS = 25;
- private static final int BIT_TOTAL = 26;
+ /**
+ * Syntax bit. Allow embedded flags, (?is-x), as in Perl5.
+ */
+ public static final int RE_EMBEDDED_FLAGS = 26;
/**
+ * Syntax bit. Allow octal char (\0377), as in Perl5.
+ */
+ public static final int RE_OCTAL_CHAR = 27;
+
+ /**
+ * Syntax bit. Allow hex char (\x1b), as in Perl5.
+ */
+ public static final int RE_HEX_CHAR = 28;
+
+ /**
+ * Syntax bit. Allow Unicode char (\u1234), as in Java 1.4.
+ */
+ public static final int RE_UNICODE_CHAR = 29;
+
+ /**
+ * Syntax bit. Allow named property (\p{P}, \P{p}), as in Perl5.
+ */
+ public static final int RE_NAMED_PROPERTY = 30;
+
+ private static final int BIT_TOTAL = 31;
+
+ /**
* Predefined syntax.
* Emulates regular expression support in the awk utility.
*/
@@ -422,6 +447,10 @@
.set(RE_STRING_ANCHORS) // \A,\Z
.set(RE_CHAR_CLASS_ESC_IN_LISTS)// \d,\D,\w,\W,\s,\S within []
.set(RE_COMMENTS) // (?#)
+ .set(RE_EMBEDDED_FLAGS) // (?imsx-imsx)
+ .set(RE_OCTAL_CHAR) // \0377
+ .set(RE_HEX_CHAR) // \x1b
+ .set(RE_NAMED_PROPERTY) // \p{prop}, \P{prop}
.makeFinal();
RE_SYNTAX_PERL5_S = new RESyntax(RE_SYNTAX_PERL5)
@@ -431,6 +460,7 @@
RE_SYNTAX_JAVA_1_4 = new RESyntax(RE_SYNTAX_PERL5)
// XXX
.set(RE_POSSESSIVE_OPS) // *+,?+,++,{}+
+ .set(RE_UNICODE_CHAR) // \u1234
.makeFinal();
}
Index: classpath/gnu/regexp/CharIndexed.java
===================================================================
--- classpath/gnu/regexp/CharIndexed.java (revision 110832)
+++ classpath/gnu/regexp/CharIndexed.java (working copy)
@@ -1,5 +1,5 @@
/* gnu/regexp/CharIndexed.java
- Copyright (C) 1998-2001, 2004 Free Software Foundation, Inc.
+ Copyright (C) 1998-2001, 2004, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -81,4 +81,16 @@
* position at a valid position in the input.
*/
boolean isValid();
+
+ /**
+ * Returns another CharIndexed containing length characters to the left
+ * of the given index. The given length is an expected maximum and
+ * the returned CharIndexed may not necessarily contain so many characters.
+ */
+ CharIndexed lookBehind(int index, int length);
+
+ /**
+ * Returns the effective length of this CharIndexed
+ */
+ int length();
}
Index: classpath/gnu/regexp/RETokenAny.java
===================================================================
--- classpath/gnu/regexp/RETokenAny.java (revision 110832)
+++ classpath/gnu/regexp/RETokenAny.java (working copy)
@@ -55,6 +55,10 @@
return 1;
}
+ int getMaximumLength() {
+ return 1;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
char ch = input.charAt(mymatch.index);
if ((ch == CharIndexed.OUT_OF_BOUNDS)
Index: classpath/gnu/regexp/RETokenLookAhead.java
===================================================================
--- classpath/gnu/regexp/RETokenLookAhead.java (revision 110832)
+++ classpath/gnu/regexp/RETokenLookAhead.java (working copy)
@@ -52,6 +52,10 @@
this.negative = negative;
}
+ int getMaximumLength() {
+ return 0;
+ }
+
boolean match(CharIndexed input, REMatch mymatch)
{
REMatch trymatch = (REMatch)mymatch.clone();
Index: classpath/gnu/regexp/RETokenRepeated.java
===================================================================
--- classpath/gnu/regexp/RETokenRepeated.java (revision 110832)
+++ classpath/gnu/regexp/RETokenRepeated.java (working copy)
@@ -45,12 +45,14 @@
private int min,max;
private boolean stingy;
private boolean possessive;
+ private boolean alwaysEmpty; // Special case of {0}
RETokenRepeated(int subIndex, REToken token, int min, int max) {
super(subIndex);
this.token = token;
this.min = min;
this.max = max;
+ alwaysEmpty = (min == 0 && max == 0);
}
/** Sets the minimal matching mode to true. */
@@ -82,6 +84,36 @@
return (min * token.getMinimumLength());
}
+ int getMaximumLength() {
+ if (max == Integer.MAX_VALUE) return Integer.MAX_VALUE;
+ int tmax = token.getMaximumLength();
+ if (tmax == Integer.MAX_VALUE) return tmax;
+ return (max * tmax);
+ }
+
+ boolean stopMatchingIfSatisfied = true;
+
+ private static REMatch findDoables(REToken tk,
+ CharIndexed input, REMatch mymatch) {
+
+ REMatch.REMatchList doables = new REMatch.REMatchList();
+
+ // try next repeat at all possible positions
+ for (REMatch current = mymatch;
+ current != null; current = current.next) {
+ REMatch recurrent = (REMatch) current.clone();
+ int origin = recurrent.index;
+ tk = (REToken) tk.clone();
+ tk.next = tk.uncle = null;
+ if (tk.match(input, recurrent)) {
+ if (recurrent.index == origin) recurrent.empty = true;
+ // add all items in current to doables array
+ doables.addTail(recurrent);
+ }
+ }
+ return doables.head;
+ }
+
// We do need to save every possible point, but the number of clone()
// invocations here is really a killer for performance on non-stingy
// repeat operators. I'm open to suggestions...
@@ -91,58 +123,34 @@
// the subexpression back-reference operator allow that?
boolean match(CharIndexed input, REMatch mymatch) {
- // number of times we've matched so far
- int numRepeats = 0;
-
// Possible positions for the next repeat to match at
REMatch newMatch = mymatch;
- REMatch last = null;
- REMatch current;
- // Add the '0-repeats' index
- // positions.elementAt(z) == position [] in input after <<z>> matches
- Vector positions = new Vector();
- positions.addElement(newMatch);
+ // {0} needs some special treatment.
+ if (alwaysEmpty) {
+ REMatch result = matchRest(input, newMatch);
+ if (result != null) {
+ mymatch.assignFrom(result);
+ return true;
+ }
+ else {
+ return false;
+ }
+ }
+
+ // number of times we've matched so far
+ int numRepeats = 0;
- // Declare variables used in loop
REMatch doables;
- REMatch doablesLast;
- REMatch recurrent;
+ int lastIndex = mymatch.index;
+ boolean emptyMatchFound = false;
- do {
- // Check for stingy match for each possibility.
- if (stingy && (numRepeats >= min)) {
- REMatch result = matchRest(input, newMatch);
- if (result != null) {
- mymatch.assignFrom(result);
- return true;
- }
- }
+ while (numRepeats < min) {
+ doables = findDoables(token, input, newMatch);
- doables = null;
- doablesLast = null;
-
- // try next repeat at all possible positions
- for (current = newMatch; current != null; current = current.next) {
- recurrent = (REMatch) current.clone();
- if (token.match(input, recurrent)) {
- // add all items in current to doables array
- if (doables == null) {
- doables = recurrent;
- doablesLast = recurrent;
- } else {
- // Order these from longest to shortest
- // Start by assuming longest (more repeats)
- doablesLast.next = recurrent;
- }
- // Find new doablesLast
- while (doablesLast.next != null) {
- doablesLast = doablesLast.next;
- }
- }
- }
- // if none of the possibilities worked out, break out of do/while
- if (doables == null) break;
+ // if none of the possibilities worked out,
+ // it means that minimum number of repeats could not be found.
+ if (doables == null) return false;
// reassign where the next repeat can match
newMatch = doables;
@@ -150,44 +158,92 @@
// increment how many repeats we've successfully found
++numRepeats;
- positions.addElement(newMatch);
- } while (numRepeats < max);
-
- // If there aren't enough repeats, then fail
- if (numRepeats < min) return false;
-
- // We're greedy, but ease off until a true match is found
- int posIndex = positions.size();
-
- // At this point we've either got too many or just the right amount.
- // See if this numRepeats works with the rest of the regexp.
- REMatch allResults = null;
- REMatch allResultsLast = null;
+ if (newMatch.empty) {
+ numRepeats = min;
+ emptyMatchFound = true;
+ break;
+ }
+ lastIndex = newMatch.index;
+ }
- REMatch results = null;
- while (--posIndex >= min) {
- newMatch = (REMatch) positions.elementAt(posIndex);
- results = matchRest(input, newMatch);
- if (results != null) {
- if (allResults == null) {
- allResults = results;
- allResultsLast = results;
- } else {
- // Order these from longest to shortest
- // Start by assuming longest (more repeats)
- allResultsLast.next = results;
+ Vector positions = new Vector();
+
+ while (numRepeats <= max) {
+ // We want to check something like
+ // if (stingy)
+ // and neglect the further matching. But experience tells
+ // such neglection may cause incomplete matching.
+ // For example, if we neglect the seemingly unnecessay
+ // matching, /^(b+?|a){1,2}?c/ cannot match "bbc".
+ // On the other hand, if we do not stop the unnecessary
+ // matching, /(([a-c])b*?\2)*/ matches "ababbbcbc"
+ // entirely when we wan to find only "ababb".
+ // In order to make regression tests pass, we do as we did.
+ if (stopMatchingIfSatisfied && stingy) {
+ REMatch results = matchRest(input, newMatch);
+ if (results != null) {
+ mymatch.assignFrom(results);
+ return true;
}
- // Find new doablesLast
- while (allResultsLast.next != null) {
- allResultsLast = allResultsLast.next;
+ }
+ positions.add(newMatch);
+ if (emptyMatchFound) break;
+
+ doables = findDoables(token, input, newMatch);
+ if (doables == null) break;
+
+ // doables.index == lastIndex occurs either
+ // (1) when an empty string was the longest
+ // that matched this token.
+ // or
+ // (2) when the same string matches this token many times.
+ // For example, "acbab" itself matches "a.*b" and
+ // its substrings "acb" and "ab" also match.
+ // In this case, we do not have to go further until
+ // numRepeats == max because the more numRepeats grows,
+ // the shorter the substring matching this token becomes.
+ // So the previous succesful match must have bee the best
+ // match. But this is not necessarily the case if stingy.
+ if (doables.index == lastIndex) {
+ if (doables.empty) {
+ emptyMatchFound = true;
+ }
+ else {
+ if (!stingy) break;
}
}
- // else did not match rest of the tokens, try again on smaller sample
- // or break out when performing possessive matching
- if (possessive) break;
+ numRepeats++;
+ newMatch = doables;
+ lastIndex = newMatch.index;
}
- if (allResults != null) {
- mymatch.assignFrom(allResults); // does this get all?
+
+ // We're greedy, but ease off until a true match is found.
+ // At this point we've either got too many or just the right amount.
+ // See if this numRepeats works with the rest of the regexp.
+
+ REMatch.REMatchList allResults = new REMatch.REMatchList();
+
+ int posCount = positions.size();
+ int posIndex = (stingy ? 0 : posCount - 1);
+
+ while (posCount-- > 0) {
+ REMatch m = (REMatch) positions.elementAt(posIndex);
+ if (stingy) posIndex++; else posIndex--;
+
+ REMatch results = matchRest(input, m);
+ if (results != null) {
+ // Order these from longest to shortest
+ // Start by assuming longest (more repeats)
+ // If stingy the order is shortest to longest.
+ allResults.addTail(results);
+ }
+ else {
+ if (possessive) break;
+ }
+ }
+
+ if (allResults.head != null) {
+ mymatch.assignFrom(allResults.head); // does this get all?
return true;
}
// If we fall out, no matches.
@@ -196,27 +252,17 @@
private REMatch matchRest(CharIndexed input, final REMatch newMatch) {
REMatch current, single;
- REMatch doneIndex = null;
- REMatch doneIndexLast = null;
+ REMatch.REMatchList doneIndex = new REMatch.REMatchList();
// Test all possible matches for this number of repeats
for (current = newMatch; current != null; current = current.next) {
// clone() separates a single match from the chain
single = (REMatch) current.clone();
if (next(input, single)) {
// chain results to doneIndex
- if (doneIndex == null) {
- doneIndex = single;
- doneIndexLast = single;
- } else {
- doneIndexLast.next = single;
- }
- // Find new doneIndexLast
- while (doneIndexLast.next != null) {
- doneIndexLast = doneIndexLast.next;
- }
+ doneIndex.addTail(single);
}
}
- return doneIndex;
+ return doneIndex.head;
}
void dump(StringBuffer os) {
Index: classpath/gnu/regexp/RETokenNamedProperty.java
===================================================================
--- classpath/gnu/regexp/RETokenNamedProperty.java (revision 0)
+++ classpath/gnu/regexp/RETokenNamedProperty.java (revision 0)
@@ -0,0 +1,301 @@
+/* gnu/regexp/RETokenNamedProperty.java
+ Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Classpath.
+
+GNU Classpath is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+GNU Classpath is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Classpath; see the file COPYING. If not, write to the
+Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+02110-1301 USA.
+
+Linking this library statically or dynamically with other modules is
+making a combined work based on this library. Thus, the terms and
+conditions of the GNU General Public License cover the whole
+combination.
+
+As a special exception, the copyright holders of this library give you
+permission to link this library with independent modules to produce an
+executable, regardless of the license terms of these independent
+modules, and to copy and distribute the resulting executable under
+terms of your choice, provided that you also meet, for each linked
+independent module, the terms and conditions of the license of that
+module. An independent module is a module which is not derived from
+or based on this library. If you modify this library, you may extend
+this exception to your version of the library, but you are not
+obligated to do so. If you do not wish to do so, delete this
+exception statement from your version. */
+
+
+package gnu.regexp;
+
+final class RETokenNamedProperty extends REToken {
+ String name;
+ boolean insens;
+ boolean negate;
+ Handler handler;
+
+ // Grouped properties
+ static final byte[] LETTER = new byte[]
+ { Character.LOWERCASE_LETTER,
+ Character.UPPERCASE_LETTER,
+ Character.TITLECASE_LETTER,
+ Character.MODIFIER_LETTER,
+ Character.OTHER_LETTER };
+
+ static final byte[] MARK = new byte[]
+ { Character.NON_SPACING_MARK,
+ Character.COMBINING_SPACING_MARK,
+ Character.ENCLOSING_MARK };
+
+ static final byte[] SEPARATOR = new byte[]
+ { Character.SPACE_SEPARATOR,
+ Character.LINE_SEPARATOR,
+ Character.PARAGRAPH_SEPARATOR };
+
+ static final byte[] SYMBOL = new byte[]
+ { Character.MATH_SYMBOL,
+ Character.CURRENCY_SYMBOL,
+ Character.MODIFIER_SYMBOL,
+ Character.OTHER_SYMBOL };
+
+ static final byte[] NUMBER = new byte[]
+ { Character.DECIMAL_DIGIT_NUMBER,
+ Character.LETTER_NUMBER,
+ Character.OTHER_NUMBER };
+
+ static final byte[] PUNCTUATION = new byte[]
+ { Character.DASH_PUNCTUATION,
+ Character.START_PUNCTUATION,
+ Character.END_PUNCTUATION,
+ Character.CONNECTOR_PUNCTUATION,
+ Character.OTHER_PUNCTUATION,
+ Character.INITIAL_QUOTE_PUNCTUATION,
+ Character.FINAL_QUOTE_PUNCTUATION};
+
+ static final byte[] OTHER = new byte[]
+ { Character.CONTROL,
+ Character.FORMAT,
+ Character.PRIVATE_USE,
+ Character.SURROGATE,
+ Character.UNASSIGNED };
+
+ RETokenNamedProperty(int subIndex, String name, boolean insens, boolean negate) throws REException {
+ super(subIndex);
+ this.name = name;
+ this.insens = insens;
+ this.negate = negate;
+ handler = getHandler(name);
+ }
+
+ int getMinimumLength() {
+ return 1;
+ }
+
+ int getMaximumLength() {
+ return 1;
+ }
+
+ boolean match(CharIndexed input, REMatch mymatch) {
+ char ch = input.charAt(mymatch.index);
+ if (ch == CharIndexed.OUT_OF_BOUNDS)
+ return false;
+
+ boolean retval = handler.includes(ch);
+ if (insens) {
+ retval = retval ||
+ handler.includes(Character.toUpperCase(ch)) ||
+ handler.includes(Character.toLowerCase(ch));
+ }
+
+ if (negate) retval = !retval;
+ if (retval) {
+ ++mymatch.index;
+ return next(input, mymatch);
+ }
+ else return false;
+ }
+
+ void dump(StringBuffer os) {
+ os.append("\\")
+ .append(negate ? "P" : "p")
+ .append("{" + name + "}");
+ }
+
+ private abstract static class Handler {
+ public abstract boolean includes(char c);
+ }
+
+ private Handler getHandler(String name) throws REException {
+ if (name.equals("Lower") ||
+ name.equals("Upper") ||
+ // name.equals("ASCII") ||
+ name.equals("Alpha") ||
+ name.equals("Digit") ||
+ name.equals("Alnum") ||
+ name.equals("Punct") ||
+ name.equals("Graph") ||
+ name.equals("Print") ||
+ name.equals("Blank") ||
+ name.equals("Cntrl") ||
+ name.equals("XDigit") ||
+ name.equals("Space") ) {
+ return new POSIXHandler(name);
+ }
+ if (name.startsWith("In")) {
+ try {
+ name = name.substring(2);
+ Character.UnicodeBlock block = Character.UnicodeBlock.forName(name);
+ return new UnicodeBlockHandler(block);
+ }
+ catch (IllegalArgumentException e) {
+ throw new REException("Invalid Unicode block name: " + name, REException.REG_ESCAPE, 0);
+ }
+ }
+ if (name.startsWith("Is")) {
+ name = name.substring(2);
+ }
+
+ // "grouped properties"
+ if (name.equals("L"))
+ return new UnicodeCategoriesHandler(LETTER);
+ if (name.equals("M"))
+ return new UnicodeCategoriesHandler(MARK);
+ if (name.equals("Z"))
+ return new UnicodeCategoriesHandler(SEPARATOR);
+ if (name.equals("S"))
+ return new UnicodeCategoriesHandler(SYMBOL);
+ if (name.equals("N"))
+ return new UnicodeCategoriesHandler(NUMBER);
+ if (name.equals("P"))
+ return new UnicodeCategoriesHandler(PUNCTUATION);
+ if (name.equals("C"))
+ return new UnicodeCategoriesHandler(OTHER);
+
+ if (name.equals("Mc"))
+ return new UnicodeCategoryHandler(Character.COMBINING_SPACING_MARK);
+ if (name.equals("Pc"))
+ return new UnicodeCategoryHandler(Character.CONNECTOR_PUNCTUATION);
+ if (name.equals("Cc"))
+ return new UnicodeCategoryHandler(Character.CONTROL);
+ if (name.equals("Sc"))
+ return new UnicodeCategoryHandler(Character.CURRENCY_SYMBOL);
+ if (name.equals("Pd"))
+ return new UnicodeCategoryHandler(Character.DASH_PUNCTUATION);
+ if (name.equals("Nd"))
+ return new UnicodeCategoryHandler(Character.DECIMAL_DIGIT_NUMBER);
+ if (name.equals("Me"))
+ return new UnicodeCategoryHandler(Character.ENCLOSING_MARK);
+ if (name.equals("Pe"))
+ return new UnicodeCategoryHandler(Character.END_PUNCTUATION);
+ if (name.equals("Pf"))
+ return new UnicodeCategoryHandler(Character.FINAL_QUOTE_PUNCTUATION);
+ if (name.equals("Cf"))
+ return new UnicodeCategoryHandler(Character.FORMAT);
+ if (name.equals("Pi"))
+ return new UnicodeCategoryHandler(Character.INITIAL_QUOTE_PUNCTUATION);
+ if (name.equals("Nl"))
+ return new UnicodeCategoryHandler(Character.LETTER_NUMBER);
+ if (name.equals("Zl"))
+ return new UnicodeCategoryHandler(Character.LINE_SEPARATOR);
+ if (name.equals("Ll"))
+ return new UnicodeCategoryHandler(Character.LOWERCASE_LETTER);
+ if (name.equals("Sm"))
+ return new UnicodeCategoryHandler(Character.MATH_SYMBOL);
+ if (name.equals("Lm"))
+ return new UnicodeCategoryHandler(Character.MODIFIER_LETTER);
+ if (name.equals("Sk"))
+ return new UnicodeCategoryHandler(Character.MODIFIER_SYMBOL);
+ if (name.equals("Mn"))
+ return new UnicodeCategoryHandler(Character.NON_SPACING_MARK);
+ if (name.equals("Lo"))
+ return new UnicodeCategoryHandler(Character.OTHER_LETTER);
+ if (name.equals("No"))
+ return new UnicodeCategoryHandler(Character.OTHER_NUMBER);
+ if (name.equals("Po"))
+ return new UnicodeCategoryHandler(Character.OTHER_PUNCTUATION);
+ if (name.equals("So"))
+ return new UnicodeCategoryHandler(Character.OTHER_SYMBOL);
+ if (name.equals("Zp"))
+ return new UnicodeCategoryHandler(Character.PARAGRAPH_SEPARATOR);
+ if (name.equals("Co"))
+ return new UnicodeCategoryHandler(Character.PRIVATE_USE);
+ if (name.equals("Zs"))
+ return new UnicodeCategoryHandler(Character.SPACE_SEPARATOR);
+ if (name.equals("Ps"))
+ return new UnicodeCategoryHandler(Character.START_PUNCTUATION);
+ if (name.equals("Cs"))
+ return new UnicodeCategoryHandler(Character.SURROGATE);
+ if (name.equals("Lt"))
+ return new UnicodeCategoryHandler(Character.TITLECASE_LETTER);
+ if (name.equals("Cn"))
+ return new UnicodeCategoryHandler(Character.UNASSIGNED);
+ if (name.equals("Lu"))
+ return new UnicodeCategoryHandler(Character.UPPERCASE_LETTER);
+ throw new REException("unsupported name " + name, REException.REG_ESCAPE, 0);
+ }
+
+ private static class POSIXHandler extends Handler {
+ private RETokenPOSIX retoken;
+ private REMatch mymatch = new REMatch(0,0,0);
+ private char[] chars = new char[1];
+ private CharIndexedCharArray ca = new CharIndexedCharArray(chars, 0);
+ public POSIXHandler(String name) {
+ int posixId = RETokenPOSIX.intValue(name.toLowerCase());
+ if (posixId != -1)
+ retoken = new RETokenPOSIX(0,posixId,false,false);
+ else
+ throw new RuntimeException("Unknown posix ID: " + name);
+ }
+ public boolean includes(char c) {
+ chars[0] = c;
+ mymatch.index = 0;
+ return retoken.match(ca, mymatch);
+ }
+ }
+
+ private static class UnicodeCategoryHandler extends Handler {
+ public UnicodeCategoryHandler(byte category) {
+ this.category = (int)category;
+ }
+ private int category;
+ public boolean includes(char c) {
+ return Character.getType(c) == category;
+ }
+ }
+
+ private static class UnicodeCategoriesHandler extends Handler {
+ public UnicodeCategoriesHandler(byte[] categories) {
+ this.categories = categories;
+ }
+ private byte[] categories;
+ public boolean includes(char c) {
+ int category = Character.getType(c);
+ for (int i = 0; i < categories.length; i++)
+ if (category == categories[i])
+ return true;
+ return false;
+ }
+ }
+
+ private static class UnicodeBlockHandler extends Handler {
+ public UnicodeBlockHandler(Character.UnicodeBlock block) {
+ this.block = block;
+ }
+ private Character.UnicodeBlock block;
+ public boolean includes(char c) {
+ Character.UnicodeBlock cblock = Character.UnicodeBlock.of(c);
+ return (cblock != null && cblock.equals(block));
+ }
+ }
+
+}
Index: classpath/gnu/regexp/REMatch.java
===================================================================
--- classpath/gnu/regexp/REMatch.java (revision 110832)
+++ classpath/gnu/regexp/REMatch.java (working copy)
@@ -67,6 +67,8 @@
int[] start; // start positions (relative to offset) for each (sub)exp.
int[] end; // end positions for the same
REMatch next; // other possibility (to avoid having to use arrays)
+ boolean empty; // empty string matched. This flag is used only within
+ // RETokenRepeated.
public Object clone() {
try {
@@ -177,7 +179,9 @@
* @param sub Index of the subexpression.
*/
public String toString(int sub) {
- if ((sub >= start.length) || (start[sub] == -1)) return "";
+ if ((sub >= start.length) || sub < 0)
+ throw new IndexOutOfBoundsException("No group " + sub);
+ if (start[sub] == -1) return null;
return (matchedText.substring(start[sub],end[sub]));
}
@@ -242,6 +246,8 @@
* <code>$0</code> through <code>$9</code>. <code>$0</code> matches
* the full substring matched; <code>$<i>n</i></code> matches
* subexpression number <i>n</i>.
+ * <code>$10, $11, ...</code> may match the 10th, 11th, ... subexpressions
+ * if such subexpressions exist.
*
* @param input A string consisting of literals and <code>$<i>n</i></code> tokens.
*/
@@ -252,6 +258,16 @@
for (pos = 0; pos < input.length()-1; pos++) {
if ((input.charAt(pos) == '$') && (Character.isDigit(input.charAt(pos+1)))) {
int val = Character.digit(input.charAt(++pos),10);
+ int pos1 = pos + 1;
+ while (pos1 < input.length() &&
+ Character.isDigit(input.charAt(pos1))) {
+ int val1 = val*10 + Character.digit(input.charAt(pos1),10);
+ if (val1 >= start.length) break;
+ pos1++;
+ val = val1;
+ }
+ pos = pos1 - 1;
+
if (val < start.length) {
output.append(toString(val));
}
@@ -260,4 +276,42 @@
if (pos < input.length()) output.append(input.charAt(pos));
return output.toString();
}
+
+ static class REMatchList {
+ REMatch head;
+ REMatch tail;
+ REMatchList() {
+ head = tail = null;
+ }
+ /* Not used now. But we may need this some day?
+ void addHead(REMatch newone) {
+ if (head == null) {
+ head = newone;
+ tail = newone;
+ while (tail.next != null) {
+ tail = tail.next;
+ }
+ }
+ else {
+ REMatch tmp = newone;
+ while (tmp.next != null) tmp = tmp.next;
+ tmp.next = head;
+ head = newone;
+ }
+ }
+ */
+ void addTail(REMatch newone) {
+ if (head == null) {
+ head = newone;
+ tail = newone;
+ }
+ else {
+ tail.next = newone;
+ }
+ while (tail.next != null) {
+ tail = tail.next;
+ }
+ }
+ }
+
}
Index: classpath/gnu/regexp/RETokenRange.java
===================================================================
--- classpath/gnu/regexp/RETokenRange.java (revision 110832)
+++ classpath/gnu/regexp/RETokenRange.java (working copy)
@@ -43,19 +43,32 @@
RETokenRange(int subIndex, char lo, char hi, boolean ins) {
super(subIndex);
- this.lo = (insens = ins) ? Character.toLowerCase(lo) : lo;
- this.hi = ins ? Character.toLowerCase(hi) : hi;
+ insens = ins;
+ this.lo = lo;
+ this.hi = hi;
}
int getMinimumLength() {
return 1;
}
+ int getMaximumLength() {
+ return 1;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
char c = input.charAt(mymatch.index);
if (c == CharIndexed.OUT_OF_BOUNDS) return false;
- if (insens) c = Character.toLowerCase(c);
- if ((c >= lo) && (c <= hi)) {
+ boolean matches = (c >= lo) && (c <= hi);
+ if (! matches && insens) {
+ char c1 = Character.toLowerCase(c);
+ matches = (c1 >= lo) && (c1 <= hi);
+ if (!matches) {
+ c1 = Character.toUpperCase(c);
+ matches = (c1 >= lo) && (c1 <= hi);
+ }
+ }
+ if (matches) {
++mymatch.index;
return next(input, mymatch);
}
Index: classpath/gnu/regexp/RETokenBackRef.java
===================================================================
--- classpath/gnu/regexp/RETokenBackRef.java (revision 110832)
+++ classpath/gnu/regexp/RETokenBackRef.java (working copy)
@@ -51,13 +51,25 @@
// should implement getMinimumLength() -- any ideas?
boolean match(CharIndexed input, REMatch mymatch) {
+ if (num >= mymatch.start.length) return false;
+ if (num >= mymatch.end.length) return false;
int b,e;
b = mymatch.start[num];
e = mymatch.end[num];
if ((b==-1)||(e==-1)) return false; // this shouldn't happen, but...
for (int i=b; i<e; i++) {
- if (input.charAt(mymatch.index+i-b) != input.charAt(i)) {
- return false;
+ char c1 = input.charAt(mymatch.index+i-b);
+ char c2 = input.charAt(i);
+ if (c1 != c2) {
+ if (insens) {
+ if (c1 != Character.toLowerCase(c2) &&
+ c1 != Character.toUpperCase(c2)) {
+ return false;
+ }
+ }
+ else {
+ return false;
+ }
}
}
mymatch.index += e-b;
Index: classpath/gnu/regexp/RETokenStart.java
===================================================================
--- classpath/gnu/regexp/RETokenStart.java (revision 110832)
+++ classpath/gnu/regexp/RETokenStart.java (working copy)
@@ -44,6 +44,10 @@
super(subIndex);
this.newline = newline;
}
+
+ int getMaximumLength() {
+ return 0;
+ }
boolean match(CharIndexed input, REMatch mymatch) {
// charAt(index-n) may be unknown on a Reader/InputStream. FIXME
Index: classpath/gnu/regexp/RETokenIndependent.java
===================================================================
--- classpath/gnu/regexp/RETokenIndependent.java (revision 0)
+++ classpath/gnu/regexp/RETokenIndependent.java (revision 0)
@@ -0,0 +1,76 @@
+/* gnu/regexp/RETokenIndependent.java
+ Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Classpath.
+
+GNU Classpath is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+GNU Classpath is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Classpath; see the file COPYING. If not, write to the
+Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+02110-1301 USA.
+
+Linking this library statically or dynamically with other modules is
+making a combined work based on this library. Thus, the terms and
+conditions of the GNU General Public License cover the whole
+combination.
+
+As a special exception, the copyright holders of this library give you
+permission to link this library with independent modules to produce an
+executable, regardless of the license terms of these independent
+modules, and to copy and distribute the resulting executable under
+terms of your choice, provided that you also meet, for each linked
+independent module, the terms and conditions of the license of that
+module. An independent module is a module which is not derived from
+or based on this library. If you modify this library, you may extend
+this exception to your version of the library, but you are not
+obligated to do so. If you do not wish to do so, delete this
+exception statement from your version. */
+
+package gnu.regexp;
+
+/**
+ * @author Ito Kazumitsu
+ */
+final class RETokenIndependent extends REToken
+{
+ REToken re;
+
+ RETokenIndependent(REToken re) throws REException {
+ super(0);
+ this.re = re;
+ }
+
+ int getMinimumLength() {
+ return re.getMinimumLength();
+ }
+
+ int getMaximumLength() {
+ return re.getMaximumLength();
+ }
+
+ boolean match(CharIndexed input, REMatch mymatch)
+ {
+ if (re.match(input, mymatch)) {
+ // Once we have found a match, we do not see other possible matches.
+ mymatch.next = null;
+ return next(input, mymatch);
+ }
+ return false;
+ }
+
+ void dump(StringBuffer os) {
+ os.append("(?>");
+ re.dumpAll(os);
+ os.append(')');
+ }
+}
+
Index: classpath/gnu/regexp/RETokenPOSIX.java
===================================================================
--- classpath/gnu/regexp/RETokenPOSIX.java (revision 110832)
+++ classpath/gnu/regexp/RETokenPOSIX.java (working copy)
@@ -81,6 +81,10 @@
return 1;
}
+ int getMaximumLength() {
+ return 1;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
char ch = input.charAt(mymatch.index);
if (ch == CharIndexed.OUT_OF_BOUNDS)
Index: classpath/gnu/regexp/RETokenEnd.java
===================================================================
--- classpath/gnu/regexp/RETokenEnd.java (revision 110832)
+++ classpath/gnu/regexp/RETokenEnd.java (working copy)
@@ -49,6 +49,10 @@
this.newline = newline;
}
+ int getMaximumLength() {
+ return 0;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
char ch = input.charAt(mymatch.index);
if (ch == CharIndexed.OUT_OF_BOUNDS)
Index: classpath/gnu/regexp/RETokenOneOf.java
===================================================================
--- classpath/gnu/regexp/RETokenOneOf.java (revision 110832)
+++ classpath/gnu/regexp/RETokenOneOf.java (working copy)
@@ -70,53 +70,67 @@
return min;
}
+
+ int getMaximumLength() {
+ int max = 0;
+ int x;
+ for (int i=0; i < options.size(); i++) {
+ if ((x = ((REToken) options.elementAt(i)).getMaximumLength()) > max)
+ max = x;
+ }
+ return max;
+ }
+
boolean match(CharIndexed input, REMatch mymatch) {
- if (negative && (input.charAt(mymatch.index) == CharIndexed.OUT_OF_BOUNDS))
+ return negative ? matchN(input, mymatch) : matchP(input, mymatch);
+ }
+
+ private boolean matchN(CharIndexed input, REMatch mymatch) {
+ if (input.charAt(mymatch.index) == CharIndexed.OUT_OF_BOUNDS)
return false;
REMatch newMatch = null;
REMatch last = null;
REToken tk;
- boolean isMatch;
for (int i=0; i < options.size(); i++) {
tk = (REToken) options.elementAt(i);
REMatch tryMatch = (REMatch) mymatch.clone();
if (tk.match(input, tryMatch)) { // match was successful
- if (negative) return false;
+ return false;
+ } // is a match
+ } // try next option
- if (next(input, tryMatch)) {
- // Add tryMatch to list of possibilities.
- if (last == null) {
- newMatch = tryMatch;
- last = tryMatch;
- } else {
- last.next = tryMatch;
- last = tryMatch;
- }
- } // next succeeds
+ ++mymatch.index;
+ return next(input, mymatch);
+ }
+
+ private boolean matchP(CharIndexed input, REMatch mymatch) {
+ REMatch.REMatchList newMatch = new REMatch.REMatchList();
+ REToken tk;
+ for (int i=0; i < options.size(); i++) {
+ // In order that the backtracking can work,
+ // each option must be chained to the next token.
+ // But the chain method has some side effect, so
+ // we use clones.
+ tk = (REToken)((REToken) options.elementAt(i)).clone();
+ tk.chain(this.next);
+ tk.setUncle(this.uncle);
+ tk.subIndex = this.subIndex;
+ REMatch tryMatch = (REMatch) mymatch.clone();
+ if (tk.match(input, tryMatch)) { // match was successful
+ newMatch.addTail(tryMatch);
} // is a match
} // try next option
- if (newMatch != null) {
- if (negative) {
- return false;
- } else {
- // set contents of mymatch equal to newMatch
+ if (newMatch.head != null) {
+ // set contents of mymatch equal to newMatch
- // try each one that matched
- mymatch.assignFrom(newMatch);
- return true;
- }
+ // try each one that matched
+ mymatch.assignFrom(newMatch.head);
+ return true;
} else {
- if (negative) {
- ++mymatch.index;
- return next(input, mymatch);
- } else {
- return false;
- }
+ return false;
}
-
- // index+1 works for [^abc] lists, not for generic lookahead (--> index)
}
void dump(StringBuffer os) {
Index: classpath/java/net/URI.java
===================================================================
--- classpath/java/net/URI.java (revision 110832)
+++ classpath/java/net/URI.java (working copy)
@@ -1,5 +1,5 @@
/* URI.java -- An URI class
- Copyright (C) 2002, 2004, 2005 Free Software Foundation, Inc.
+ Copyright (C) 2002, 2004, 2005, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -346,8 +346,15 @@
private static String getURIGroup(Matcher match, int group)
{
String matched = match.group(group);
- return matched.length() == 0
- ? ((match.group(group - 1).length() == 0) ? null : "") : matched;
+ if (matched == null || matched.length() == 0)
+ {
+ String prevMatched = match.group(group -1);
+ if (prevMatched == null || prevMatched.length() == 0)
+ return null;
+ else
+ return "";
+ }
+ return matched;
}
/**
Index: classpath/java/util/regex/Matcher.java
===================================================================
--- classpath/java/util/regex/Matcher.java (revision 110832)
+++ classpath/java/util/regex/Matcher.java (working copy)
@@ -1,5 +1,5 @@
/* Matcher.java -- Instance of a regular expression applied to a char sequence.
- Copyright (C) 2002, 2004 Free Software Foundation, Inc.
+ Copyright (C) 2002, 2004, 2006 Free Software Foundation, Inc.
This file is part of GNU Classpath.
@@ -38,6 +38,7 @@
package java.util.regex;
+import gnu.regexp.RE;
import gnu.regexp.REMatch;
/**
@@ -45,7 +46,7 @@
*
* @since 1.4
*/
-public final class Matcher
+public final class Matcher implements MatchResult
{
private Pattern pattern;
private CharSequence input;
@@ -233,10 +234,15 @@
*/
public boolean matches ()
{
- if (lookingAt())
+ match = pattern.getRE().getMatch(input, 0, RE.REG_TRY_ENTIRE_MATCH);
+ if (match != null)
{
- if (position == input.length())
- return true;
+ if (match.getStartIndex() == 0)
+ {
+ position = match.getEndIndex();
+ if (position == input.length())
+ return true;
+ }
match = null;
}
return false;
Index: classpath/java/util/regex/PatternSyntaxException.java
===================================================================
--- classpath/java/util/regex/PatternSyntaxException.java (revision 110832)
+++ classpath/java/util/regex/PatternSyntaxException.java (working copy)
@@ -41,6 +41,7 @@
* Indicates illegal pattern for regular expression.
* Includes state to inspect the pattern and what and where the expression
* was not valid regular expression.
+ * @since 1.4
*/
public class PatternSyntaxException extends IllegalArgumentException
{
Index: classpath/java/util/regex/MatchResult.java
===================================================================
--- classpath/java/util/regex/MatchResult.java (revision 0)
+++ classpath/java/util/regex/MatchResult.java (revision 0)
@@ -0,0 +1,81 @@
+/* MatchResult.java -- Result of a regular expression match.
+ Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Classpath.
+
+GNU Classpath is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+GNU Classpath is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Classpath; see the file COPYING. If not, write to the
+Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+02110-1301 USA.
+
+Linking this library statically or dynamically with other modules is
+making a combined work based on this library. Thus, the terms and
+conditions of the GNU General Public License cover the whole
+combination.
+
+As a special exception, the copyright holders of this library give you
+permission to link this library with independent modules to produce an
+executable, regardless of the license terms of these independent
+modules, and to copy and distribute the resulting executable under
+terms of your choice, provided that you also meet, for each linked
+independent module, the terms and conditions of the license of that
+module. An independent module is a module which is not derived from
+or based on this library. If you modify this library, you may extend
+this exception to your version of the library, but you are not
+obligated to do so. If you do not wish to do so, delete this
+exception statement from your version. */
+
+
+package java.util.regex;
+
+/**
+ * This interface represents the result of a regular expression match.
+ * It can be used to query the contents of the match, but not to modify
+ * them.
+ * @since 1.5
+ */
+public interface MatchResult
+{
+ /** Returns the index just after the last matched character. */
+ int end();
+
+ /**
+ * Returns the index just after the last matched character of the
+ * given sub-match group.
+ * @param group the sub-match group
+ */
+ int end(int group);
+
+ /** Returns the substring of the input which was matched. */
+ String group();
+
+ /**
+ * Returns the substring of the input which was matched by the
+ * given sub-match group.
+ * @param group the sub-match group
+ */
+ String group(int group);
+
+ /** Returns the number of sub-match groups in the matching pattern. */
+ int groupCount();
+
+ /** Returns the index of the first character of the match. */
+ int start();
+
+ /**
+ * Returns the index of the first character of the given sub-match
+ * group.
+ * @param group the sub-match group
+ */
+ int start(int group);
+}
Index: classpath/java/util/regex/Pattern.java
===================================================================
--- classpath/java/util/regex/Pattern.java (revision 110832)
+++ classpath/java/util/regex/Pattern.java (working copy)
@@ -103,8 +103,11 @@
}
catch (REException e)
{
- throw new PatternSyntaxException(e.getMessage(),
+ PatternSyntaxException pse;
+ pse = new PatternSyntaxException(e.getMessage(),
regex, e.getPosition());
+ pse.initCause(e);
+ throw pse;
}
}