This is the mail archive of the java-patches@gcc.gnu.org mailing list for the Java project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: java/2313: Java SimpleDateFormat crash with non US locales (french...)


>>>>> "Bryce" == Bryce McKinlay <bryce@albatross.co.nz> writes:

Bryce>     System.out.println ("Liberté, égalité, fraternité !");

Bryce> works fine in the default mode, but with "--encoding=UTF-8" it
Bryce> produces incorrect output.

That's because the input file isn't actually in UTF-8, but it also
doesn't contain an incorrect (by our rules -- see below) UTF-8
sequence that would let us see it as erroneous.

The `é' is 0xe9.  This is a valid start byte for a 2-byte UTF-8
sequence.  That is why the following character is also removed.

We ought to be noticing that the subsequent bytes in the sequence are
invalid.  That is what Unicode specifies, and there probably isn't a
good reason to allow incorrectly encoded characters.  However the code
wasn't originally written this way and I never updated it to do this.
I'll submit a PR.

Bryce> Unfortunately, I know very little about character
Bryce> encoding. Maybe Tom can suggest a fix or workaround. Perhaps
Bryce> its possible to do something to convert the file to a UTF-8
Bryce> encoding before trying to compile it?

One fix would be to tell gcj the real encoding of the file:

    gcj --encoding=8859_1 ...

This works for me.  However, note that the encoding names are
system-dependent :-(.  Ideally we'd have a table of aliases mapping
the Java-specified names to the system-dependent ones.

Another fix would be to use the `iconv' or `recode' programs to
convert the file into UTF-8 before compiling.  This is a pain to do,
but might be the only recourse on systems with a losing (or no)
iconv() implementation.

Tom


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]