[Bug fortran/48972] OPEN with Unicode file name

jb at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Thu May 12 21:09:00 GMT 2011


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #8 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-05-12 21:02:40 UTC ---
(In reply to comment #4)
> (In reply to comment #3)
> > - Specify that the default character set is UTF-8.
> 
> What do you mean by that? I know 1 byte and 4 byte character variables, but I
> do not see where UTF-8 fits in there. (One can place UTF-8 into
> character(kind=1) - and it also kind of works OK. But if one wants to use
> len(), string manipulation ("change 3 character to ..."), or tabulated I/O that
> will fail. But as quirky workaround, one can use UTF-8 file names with kind=1
> character variables - at least under Unix/Linux.)

Well, for backwards compatibility I strongly think we should keep kind=1 the
default. What I meant was that for bytes whose values are not part of the 7-bit
ASCII character set, we can interpret it as UTF-8, as UTF-8 is backwards
compatible with ASCII. In most cases this won't matter, but it matters e.g. as
discussed in this PR on mingw as we need to convert the default character
filename to utf-16.

The other option, I suppose, would be to regard the default character set as
some locale-dependent charset, and then use some char->wchar_t conversion
routines from the MS libc, assuming such things exist.

FWIW, the issue that the length of a string does not equal the width when
printed is not unique to utf-8. The same issue is seen with kind=4 (utf-32) as
well e.g. if one uses diacritic characters. So regardless of whether one uses
UTF-8, UTF-16 or UTF-32, with unicode one needs to be prepared for the fact
that the number of code points in a string might not be the same as the width.
Fortran is not really prepared for this, so I suppose that making the string
intrinsics etc. consider bytes==characters (for kind=1) is the best we can do
in any case.

> Regarding the ENCODING= specifier: That's already used for the encoding of the
> file content - one shan't use it to also modify the interpretation of the FILE
> string.

Yes, the point was not related to the FILE= issue. Rather, that if we make
utf-8 the default charset then it makes sense to also make the default file
encoding utf-8.

> I still think that the default character encoding should remain 1 byte
> (kind=1), which is simply passed as is to "open()". 

Yes, I agree, at least for Unix. What about mingw, then, if the string contains
characters not part of the 7-bit ASCII charset? Will MS libc convert it into
UTF-16 assuming the encoding is according to the current locale, or what?

> And UCS-4 as FILE= argument
> should simply be supported as vendor extension. One just needs to tell the
> library that the string is in UCS-4. 

I'm not convinced of the value of such an extension. Fortran already suffers
from too many vendor extensions.

> This wide string could then directly used
> for Windows' _wopen

Not really, since wchar_t on windows is a 16-bit type (utf-16), not a 32-bit
one.

> or converted to UTF-8 for Unix/Linux.

Well, that is also a choice that needs to be made, analogous on how to convert
default char file names to utf-16 on mingw. That is, do we convert the name to
UTF-8 or to whatever the charset of the current locale is (LC_CTYPE)?

So in one way it would be nice to make gfortran respect the current locale
charset, but OTOH Unicode was invented because the locale charset system is a
failure, and just using unicode everywhere would in some respects be simpler.



More information about the Gcc-bugs mailing list