This is the mail archive of the
java@gcc.gnu.org
mailing list for the Java project.
About the encoding of libgcj?
- From: bbskill <bbkills at tom dot com>
- To: java at gcc dot gnu dot org
- Date: Fri, 02 Jun 2006 21:09:53 +0800
- Subject: About the encoding of libgcj?
Hi, all,
As we know, the libbcj library is only for the utf-8 compatible locale
. That is to say , it assume all the machine is using utf-8 compatible
locale. I am wonderring why libgcj doesn't support other locale, such
GB2312, probably using the iconv to convert between them. for example,
if my locale is XXX, and the names of files in my machine are not
compatible with utf-8. and then when I get the file list of my directory
, I will get a NullPointerException.(of cource , I have compiled the
program to binary using gcj). I looked into the code of libgcj ,and
finally find this codes
==================
libjava/java/io/natFilePosix.cc
--------------------------------
the jobjectArray java::io::File::(java::io::FilenameFilter *filter
,java::io::FileFilter *fileFilter,
java::lang::Class *result_type) method:
while (readdir_r (dir, (struct dirent *) dbuf, &d) == 0 && d != NULL)
#else /* HAVE_READDIR_R */
while ((d = readdir (dir)) != NULL)
#endif /* HAVE_READDIR_R */
{
// Omit "." and "..".
if (d->d_name[0] == '.'
&& (d->d_name[1] == '\0'
|| (d->d_name[1] == '.' && d->d_name[2] == '\0')))
continue;
jstring name = JvNewStringUTF (d->d_name); // because the encoding
of d->d_name is XXX,which is not compatible with utf-8, The name is NULL
if (filter && ! filter->accept(this, name))
continue;
if (result_type == &java::io::File::class$)
{
java::io::File *file = new java::io::File (this, name);
//This cause an NullPointerException as the name is NULL
if (fileFilter && ! fileFilter->accept(file))
continue;
list->add(file);
}
else
list->add(name);
}
closedir (dir);
=============
As we see, the reason is caused by the JvNewStringUTF method ,which is
located at gcj/cni.h. It calls _Jv_NewStringUTF in java/lang/natString.cc.
The _Jv_NewStringUTF method converts a utf-8 char[] to a jstring. but
the problem is the encoding of the the char[](d->d_name) is determined
by the machine's locale, As I said below, which is not compatible with
utf-8.
So the _Jv_strLengthUtf8 ((char *) p, size) alrways return -1,that is to
say, the jstring returned always returns NULL. why shouldn't we convent the
d-d_name char[] to a utf-8 char[] probably using iconv ? I have tried
this, and it works fine.
==============
jstring
_Jv_NewStringUTF (const char *bytes)
{
int size = strlen (bytes);
unsigned char *p = (unsigned char *) bytes;
int length = _Jv_strLengthUtf8 ((char *) p, size); // the length is
alway -1.
if (length < 0)
return NULL;
jstring jstr = JvAllocString (length);
jchar *chrs = JvGetStringChars (jstr);
p = (unsigned char *) bytes;
unsigned char *limit = p + size;
while (p < limit)
*chrs++ = UTF8_GET (p, limit);
return jstr;
}
==============
prims.cc :
---------
int
_Jv_strLengthUtf8(char* str, int len)
{
unsigned char* ptr;
unsigned char* limit;
int str_length;
ptr = (unsigned char*) str;
limit = ptr + len;
str_length = 0;
for (; ptr < limit; str_length++)
{
if (UTF8_GET (ptr, limit) < 0)
return (-1);
}
return (str_length);
}
=============
include/jvm.h
------------------
/* Extract a character from a Java-style Utf8 string.
* PTR points to the current character.
* LIMIT points to the end of the Utf8 string.
* PTR is incremented to point after the character thta gets returns.
* On an error, -1 is returned. */
#define UTF8_GET(PTR, LIMIT) \
((PTR) >= (LIMIT) ? -1 \
: *(PTR) < 128 ? *(PTR)++ \
: (*(PTR)&0xE0) == 0xC0 && ((PTR)+=2)<=(LIMIT) && ((PTR)[-1]&0xC0) ==
0x80 \
? (((PTR)[-2] & 0x1F) << 6) + ((PTR)[-1] & 0x3F) \
: (*(PTR) & 0xF0) == 0xE0 && ((PTR) += 3) <= (LIMIT) \
&& ((PTR)[-2] & 0xC0) == 0x80 && ((PTR)[-1] & 0xC0) == 0x80 \
? (((PTR)[-3]&0x0F) << 12) + (((PTR)[-2]&0x3F) << 6) + ((PTR)[-1]&0x3F) \
: ((PTR)++, -1))
=================
The similar problems exist in may places, for example the
File.isFile() , which is finally invoke a _stat(jint query).
=============
java/io/natFilePosix.cc
----------
java::io::File::_stat (jint query)
{
if (query == ISHIDDEN)
return getName()->charAt(0) == '.';
#ifdef HAVE_STAT
char *buf = (char *) __builtin_alloca (JvGetStringUTFLength (path) + 1);
jsize total = JvGetStringUTFRegion (path, 0, path->length(), buf);
buf[total] = '\0';
struct stat sb;
if (::stat (buf, &sb)) // the char[] buf is encoded by
utf-8 by defaul, and of cause it can not work fine on my XXX-locale
machine, my suggest is
// converting the char[] buf
to a new char[] whose encoding is compatible with my locale.
return false;
JvAssert (query == DIRECTORY || query == ISFILE);
jboolean r = S_ISDIR (sb.st_mode);
return query == DIRECTORY ? r : ! r;
#else
return false;
#endif
}
=========
So , What I want to say is why libgcj does not provoide a method to
implement the convertions between utf-8 and other encoding on the layer
between libgcj and OS , probably outside the jvm?
May I know your thoughts.Thank you.
Best Regards,
jimmy