Filenames with accented characters

Tue Nov 18 06:12:00 GMT 2003

Tom Tromey wrote:
>>>when I compile a Java application on Windows with GCJ, that app
>>>can't deal with files whose names contain accented chars. For
>>>example, I can open "city.jpg", but not "cittÃ .jpg".
> 
> 
> Mohan> This is a known issue. JoÃ£o has solved this problem, but I
> Mohan> can't find his original post on java-patches.
> 
> Also see:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9463
> 
> It looks like for the posix flavor we'll need platform-specific
> overrides.

Quite coincidentally, I've been working recently on
adding globalisation support to the native portions
of our application and I've been assigned to do it
for Windows and Solaris.

So what follows is what I've found to work on these
platforms - this might be useful to this discussion,
or it might not be - in the latter case, I apologise
in advance for the noise level.

BTW, all our message catalogues are UTF-8 encoded so
that's what my scope is limited to - I guess this
holds for GCJ as well.

The primary difficulty that we encounter in such
a scenario is to be able to correctly convert these
UTF-8 encoded messages into the user locale encoding
so that all the glyphs are displayed as intended.

The following also applies to the reverse process.

Solaris
=======

In Solaris, we honour the value of the
LC_MESSAGES environment variable as determined by
a call to "setlocale( LC_MESSAGES, NULL)".

Then we determine the encoding in effect by a
call to "nl_langinfo( CODESET)".

This encoding is then used in a call to iconv( )
to convert the message from UTF-8 to the local
encoding - this requires you to have a valid
conversion descriptor opened with a call to
iconv_open( ).

Windows
=======

This is, as usual, an ugly beast. The primary
issue here is that Windows NT based OSs (NT4/2K/XP)
have the notion of both a System Locale and a
User Locale, which are almost, *but not quite*,
of the same status.

Specifically, console applications can only
display those glyphs that are supported by
the character set of the System Locale, irrespective
of what the User Locale is set to. GUI applications
fortunately do not share this problem.

This problem is not visible for Western European
languages, but is quite prominent for East Asian
languages like Japanese, Chinese and Korean.

Inspite of the above, applications must still honour
the User Locale and list the above as a known
limitation of the OS itself.

To get the user locale, you have to call GetLocaleInfo( )
Win32 method with LOCALE_USER_DEFAULT as the first
parameter.

To convert a message in UTF-8 encoding to the native
character set, you have to use MultiByteToWideChar( )
with CP_UTF8 as the first parameter.

This message can be written out to the console with
a call to WriteConsoleW( ) which requires the HANDLE
to the standard output - this can readily be retrieved
using a call to GetStdHandle( ).

For Windows 95/98/ME, Unicode support is not there at
all as has been pointed out and requires a separate
UNICOWS library. Fortunately for us, Win 9x/ME support
is not needed.

Note that we haven't used wchar_t at all - on Solaris,
this seems to be of an unknown encoding (UCS-4) that
also seems to vary with Solaris releases, on Windows it
is very likely UTF-16 (LE).

Ranjit.

-- 
Ranjit Mathew          Email: rmathew AT hotmail DOT com

Bangalore, INDIA.      Web: http://ranjitmathew.tripod.com/