This is the mail archive of the java-patches@gcc.gnu.org mailing list for the Java project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Patch: Fix for PR 2319

To: Gcc Patch List <gcc-patches at gcc dot gnu dot org>
Subject: Patch: Fix for PR 2319
From: Tom Tromey <tromey at redhat dot com>
Date: 19 Jun 2001 14:12:29 -0600
Cc: Java Patch List <java-patches at gcc dot gnu dot org>
Reply-To: tromey at redhat dot com

This patch fixes PR 2319.  With this, we will now get an error if the
built-in UTF-8 decoder is used and it sees an invalid or overlong
sequence.

Note that on systems with a working iconv() this decoder isn't used,
even if the "UTF-8" encoding is requested.  This means we're still at
the mercy of the system in some ways.  The UTF-8 decoder in the glibc
I'm using does not flag these things as errors :-(

Ok to commit?

2001-06-19  Tom Tromey  <tromey@redhat.com>

	* lex.c (java_read_char): Disallow invalid and overlong
	sequences.  Fixes PR java/2319.

Tom

Index: lex.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/java/lex.c,v
retrieving revision 1.65
diff -u -r1.65 lex.c
--- lex.c	2001/05/04 00:34:48	1.65
+++ lex.c	2001/06/19 19:53:15
@@ -454,15 +454,21 @@
       if (c == EOF)
 	return UEOF;
       if (c < 128)
-	return (unicode_t)c;
+	return (unicode_t) c;
       else
 	{
 	  if ((c & 0xe0) == 0xc0)
 	    {
 	      c1 = getc (lex->finput);
 	      if ((c1 & 0xc0) == 0x80)
-		return (unicode_t)(((c &0x1f) << 6) + (c1 & 0x3f));
-	      c = c1;
+		{
+		  unicode_t r = (unicode_t)(((c & 0x1f) << 6) + (c1 & 0x3f));
+		  /* Check for valid 2-byte characters.  We explicitly
+		     allow \0 because this encoding is common in the
+		     Java world.  */
+		  if (r == 0 || (r >= 0x80 && r <= 0x7ff))
+		    return r;
+		}
 	    }
 	  else if ((c & 0xf0) == 0xe0)
 	    {
@@ -471,16 +477,23 @@
 		{
 		  c2 = getc (lex->finput);
 		  if ((c2 & 0xc0) == 0x80)
-		    return (unicode_t)(((c & 0xf) << 12) + 
-				       (( c1 & 0x3f) << 6) + (c2 & 0x3f));
-		  else
-		    c = c2;
+		    {
+		      unicode_t r =  (unicode_t)(((c & 0xf) << 12) + 
+						 (( c1 & 0x3f) << 6)
+						 + (c2 & 0x3f));
+		      /* Check for valid 3-byte characters.
+			 Don't allow surrogate, \ufffe or \uffff.  */
+		      if (r >= 0x800 && r <= 0xffff
+			  && ! (r >= 0xd800 && r <= 0xdfff)
+			  && r != 0xfffe && r != 0xffff)
+			return r;
+		    }
 		}
-	      else
-		c = c1;
 	    }
 
-	  /* We simply don't support invalid characters.  */
+	  /* We simply don't support invalid characters.  We also
+	     don't support 4-, 5-, or 6-byte UTF-8 sequences, as these
+	     cannot be valid Java characters.  */
 	  java_lex_error ("malformed UTF-8 character", 0);
 	}
     }

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]