23008 – The charset decoder for UnicodeLittle gives wrong results

Bug 23008 - The charset decoder for UnicodeLittle gives wrong results

Summary: The charset decoder for UnicodeLittle gives wrong results

Status:	RESOLVED FIXED

Alias:	None

Product:	classpath
Classification:	Unclassified
Component:	classpath (show other bugs)
Version:	unspecified

Importance:	P3 normal
Target Milestone:	0.20
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-06-24 07:23 UTC by from-classpath
Modified:	2005-11-14 13:14 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2005-10-21 17:34:02

Attachments
Testfile (in UTF-16) and -program from lucene unit tests (159.71 KB, application/x-gzip) 2005-10-02 19:25 UTC, Wolfgang Baer	Details
Proposed patch, which is applied in Kaffe (681 bytes, patch) 2005-10-22 00:28 UTC, Ito Kazumitsu	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description from-classpath 2005-06-24 07:23:35 UTC

While running Java Excel API (http://www.andykhan.com/jexcelapi/),
which extracts character strings from an Excel worksheet using the
charset UnicodeLittle, I found a case where the charset decoder
returned broken strings.  I found two causes of this problem.

1. Which endian to use.

UnicodeLittle is little endian.  But if the data to be decoded does not
have a byte order mark, UTF-16Decoder assumes that it is big endian.
Although Sun's document says that UnicodeLittle is with byte-order mark,
the default byte order of UnicodeLittle should be little endian.
UnicodeLittle without byte order mark seems to be a common practice.

2. UTF-16Decoder's bug

UTF-16Decoder.java has something like (char) ((b1 << 8) | b2).
Let b1 be 0xA1 and b2 be 0xB1. Then,

   (b1 << 8) | b2 = 0xA100 | 0xFFB1 = 0xFFB1

This is not our expected result: 0xA1B1.

And my patch follows.

--- gnu/java/nio/charset/UnicodeLittle.java.orig	Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UnicodeLittle.java	Fri Jun 24 14:36:33 2005
@@ -64,7 +64,7 @@
 
   public CharsetDecoder newDecoder ()
   {
-    return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);
+    return new UTF_16Decoder (this, UTF_16Decoder.LITTLE_ENDIAN);
   }
 
   public CharsetEncoder newEncoder ()

--- gnu/java/nio/charset/UTF_16Decoder.java.orig	Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UTF_16Decoder.java	Fri Jun 24 15:44:04 2005
@@ -83,7 +83,7 @@
             // handle byte order mark
             if (byteOrder == UNKNOWN_ENDIAN)
               {
-                char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));
+                char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));
                 if (c == BYTE_ORDER_MARK)
                   {
                     byteOrder = BIG_ENDIAN;
@@ -105,8 +105,9 @@
               }
 
 	    // FIXME: Change so you only do a single comparison here.
-            char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)
-                                             : (char) ((b2 << 8) | b1);
+            char c = byteOrder == BIG_ENDIAN ?
+                 (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :
+                 (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));
 
             if (0xD800 <= c && c <= 0xDFFF)
               {

Comment 1 from-classpath 2005-06-27 06:34:03 UTC

> Although Sun's document says that UnicodeLittle is with byte-order mark,
> the default byte order of UnicodeLittle should be little endian.
> UnicodeLittle without byte order mark seems to be a common practice.


Seeing the behavior of Sun's JDK, UnicodeLittle with or without byte order
mark should be treated as follows:

  UnicodeLittle with correct byte order mark:
    Ignore the byte order mark and continue assuming the byte order
    to be little endian.

  UnicodeLittle with incorrect byte order mark:
    The byte sequence is malformed.

  UnicodeLittle without byte order mark:
    Continue assuming the byte order to be little endian.

Then the patch will be like this:

--- gnu/java/nio/charset/UnicodeLittle.java.orig	Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UnicodeLittle.java	Mon Jun 27 14:44:27 2005
@@ -64,7 +64,7 @@
 
   public CharsetDecoder newDecoder ()
   {
-    return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);
+    return new UTF_16Decoder (this, UTF_16Decoder.MAYBE_LITTLE_ENDIAN);
   }
 
   public CharsetEncoder newEncoder ()

--- gnu/java/nio/charset/UTF_16Decoder.java.orig	Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UTF_16Decoder.java	Mon Jun 27 14:55:04 2005
@@ -54,6 +54,8 @@
   static final int BIG_ENDIAN = 0;
   static final int LITTLE_ENDIAN = 1;
   static final int UNKNOWN_ENDIAN = 2;
+  static final int MAYBE_BIG_ENDIAN = 3;
+  static final int MAYBE_LITTLE_ENDIAN = 4;
 
   private static final char BYTE_ORDER_MARK = 0xFEFF;
   private static final char REVERSED_BYTE_ORDER_MARK = 0xFFFE;
@@ -81,32 +83,44 @@
             byte b2 = in.get ();
 
             // handle byte order mark
-            if (byteOrder == UNKNOWN_ENDIAN)
+            if (byteOrder == UNKNOWN_ENDIAN ||
+                byteOrder == MAYBE_BIG_ENDIAN ||
+                byteOrder == MAYBE_LITTLE_ENDIAN)
               {
-                char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));
+                char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));
                 if (c == BYTE_ORDER_MARK)
                   {
+                    if (byteOrder == MAYBE_LITTLE_ENDIAN)
+                      {
+                        return CoderResult.malformedForLength (2);
+                      }
                     byteOrder = BIG_ENDIAN;
                     inPos += 2;
                     continue;
                   }
                 else if (c == REVERSED_BYTE_ORDER_MARK)
                   {
+                    if (byteOrder == MAYBE_BIG_ENDIAN)
+                      {
+                        return CoderResult.malformedForLength (2);
+                      }
                     byteOrder = LITTLE_ENDIAN;
                     inPos += 2;
                     continue;
                   }
                 else
                   {
-                    // assume big endian, do not consume bytes,
+                    // assume big or little endian, do not consume bytes,
                     // continue with normal processing
-                    byteOrder = BIG_ENDIAN;
+                    byteOrder = (byteOrder == MAYBE_LITTLE_ENDIAN ?
+                                 LITTLE_ENDIAN : BIG_ENDIAN);
                   }
               }
 
 	    // FIXME: Change so you only do a single comparison here.
-            char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)
-                                             : (char) ((b2 << 8) | b1);
+            char c = byteOrder == BIG_ENDIAN ?
+                 (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :
+                 (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));
 
             if (0xD800 <= c && c <= 0xDFFF)
               {

Comment 2 cvs-commit@developer.classpath.org 2005-08-12 00:02:20 UTC

Subject: Bug 23008

CVSROOT:	/cvsroot/classpath
Module name:	classpath
Branch: 	
Changes by:	Tom Tromey <tromey@savannah.gnu.org>	05/08/11 23:51:30

Modified files:
	.              : ChangeLog 
	gnu/java/nio/charset: UTF_16Decoder.java 

Log message:
	For PR classpath/23008:
	* gnu/java/nio/charset/UTF_16Decoder.java (decodeLoop): Correctly
	mask bytes when constructing characters.

CVSWeb URLs:
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/ChangeLog.diff?tr1=1.4392&tr2=1.4393&r1=text&r2=text
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/gnu/java/nio/charset/UTF_16Decoder.java.diff?tr1=1.5&tr2=1.6&r1=text&r2=text

Comment 3 Andrew Pinski 2005-08-25 22:08:14 UTC

Fixed.

Comment 4 Tom Tromey 2005-09-06 18:54:22 UTC

This bug is not fully fixed.
Only the masking part went in, the other part has not yet.

Comment 5 Wolfgang Baer 2005-10-02 19:23:21 UTC

Hi all,

I have also problems with UTF-16 decoder. I tried the patch from comment #1 and
it solved the problem. My question now is why the whole patch wasn't applied the last time ?

I will attach a testcase extracted from the lucene unit tests which
confirms the correctness of the proposed patch.

The output has to be "Number of lines read 49673" - currently it treats
the file as with 868 lines.

BTW, the iconv Charset provider works correctly with the test file !

Thanks,
Wolfgang

Comment 6 Wolfgang Baer 2005-10-02 19:25:02 UTC

Created attachment 9862 [details]
Testfile (in UTF-16) and -program  from lucene unit tests

Comment 7 Andrew Pinski 2005-10-21 17:34:02 UTC

Confirmed.

Comment 8 Ito Kazumitsu 2005-10-22 00:28:35 UTC

Created attachment 10042 [details]
Proposed patch, which is applied in Kaffe

Comment 9 cvs-commit@developer.classpath.org 2005-11-13 22:19:56 UTC

Subject: Bug 23008

CVSROOT:	/cvsroot/classpath
Module name:	classpath
Branch: 	
Changes by:	Ito Kazumitsu <itokaz@savannah.gnu.org>	05/11/13 22:18:24

Modified files:
	.              : ChangeLog 
	gnu/java/nio/charset: UTF_16Decoder.java UnicodeLittle.java 

Log message:
	2005-11-13  Ito Kazumitsu  <kaz@maczuka.gcd.org>
	
	Fixes bug #23008
	* gnu/java/nio/charset/UTF_16Decoder.java
	MAYBE_BIG_ENDIAN, MAYBE_LITTLE_ENDIAN: New constants representing
	such endianness which is similar to UNKNOWN_ENDIAN but defaults
	to big/little endian without a byte order mark.
	(decodeLoop): Handle MAYBE_BIG_ENDIAN and MAYBE_LITTLE_ENDIAN.
	* gnu/java/nio/charset/UnicodeLittle.java
	(newDecoder): Set the endianness to MAYBE_LITTLE_ENDIAN.

CVSWeb URLs:
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/ChangeLog.diff?tr1=1.5567&tr2=1.5568&r1=text&r2=text
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/gnu/java/nio/charset/UTF_16Decoder.java.diff?tr1=1.6&tr2=1.7&r1=text&r2=text
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/gnu/java/nio/charset/UnicodeLittle.java.diff?tr1=1.2&tr2=1.3&r1=text&r2=text

Comment 10 Ito Kazumitsu 2005-11-13 22:27:44 UTC

Fixed as reported in Comment #9.