48972 – OPEN with Unicode file name

Bug 48972 - OPEN with Unicode file name

Summary: OPEN with Unicode file name

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	fortran (show other bugs)
Version:	4.7.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	accepts-invalid, diagnostic

Depends on:
Blocks:

Reported:	2011-05-11 21:48 UTC by Tobias Burnus
Modified:	2011-11-07 22:35 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2011-11-07 00:00:00

Attachments
Test case (781 bytes, text/plain) 2011-05-12 12:39 UTC, Tobias Burnus	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tobias Burnus 2011-05-11 21:48:58 UTC

This PR is motivated by the thread which started at
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=COMP-FORTRAN-90;59308f3c.1105


GNU Fortran happily accepts kind=4 character strings to the FILE= argument of the OPEN statement - and probably also to the other string arguments.

However, the Fortran 2008 standard has:

  R905 connect-spec is
            ...
                   or   FILE = file-name-expr
with
  R906 file-name-expr is scalar-default-char-expr

Thus, such strings should be rejected -- at least with -std=f2008.

 * * *

Independent of that, it would be convenient if as vendor extension passing a UCS-4 string would be allowed. The only problem is how it should be handled in the library.

For Unix systems, I think converting the UCS-4 to UTF-8 and using it in the normal file open should work.

However, for Windows, I think one needs a special solution as Windows seems to use UTF-16 everywhere [1]. Thus, one should be able to directly pass the UCS-16 file name to CreateFileW [2].

[1] http://msdn.microsoft.com/en-us/library/dd374081%28v=vs.85%29.aspx
[2] http://msdn.microsoft.com/en-us/library/aa363858%28v=vs.85%29.aspx



Example program. Sample usage:
  $ gfortran test.f90
  $ ./a.out 
  Enter filename: ファイル
  $

Should create "ファイル.dat" with the content "Hello World and Ni Hao -- 你好" - the latter works but the file name is as written above "?" (= \343). If one passes "44", the created file is just "4".


use iso_fortran_env
implicit none
integer, parameter :: ucs4  = selected_char_kind ('ISO_10646')
character(len=30, kind=ucs4) :: str
integer :: unit

open(unit=INPUT_UNIT, encoding='utf-8')
write(*, '(a)', advance='no') 'Enter filename: '
read(*,*) str
open(newunit=unit, file=trim(str)//ucs4_'.dat', encoding='utf-8')
write(unit, '(a)') ucs4_'Hello World and Ni Hao -- ' &
                   // char (int (z'4F60'), ucs4)     &
                   // char (int (z'597D'), ucs4)
close(unit)
end

Comment 1 Tobias Burnus 2011-05-12 06:15:57 UTC

For the diagnostic, the following untested patch should do. For Unicode file-name support more work needs to be done ...

--- a/gcc/fortran/io.c
+++ b/gcc/fortran/io.c
@@ -1478,6 +1478,13 @@ resolve_tag (const io_tag *tag, gfc_expr *e)
       return FAILURE;
     }
 
+  if (e->ts.type == BT_CHARACTER && e->ts.kind != gfc_default_character_kind)
+    {
+      gfc_error ("%s tag at %L must be a character string of default kind",
+                tag->name, &e->where);
+      return FAILURE;
+    }
+
   if (e->rank != 0)
     {
       gfc_error ("%s tag at %L must be scalar", tag->name, &e->where);

Comment 2 Tobias Burnus 2011-05-12 12:39:32 UTC

Created attachment 24238 [details]
Test case

(In reply to comment #1)
> For the diagnostic, the following untested patch should do.

Well, almost. It fails for FORMAT/fmt=; I have to admit that I do not quite understand why only for e->expr_type == EXPR_CONSTANT a default-kind character is tested for in   io.c's  resolve_tag_format.

Jerry, could you have a look? I am a bit lost.

Comment 3 Janne Blomqvist 2011-05-12 12:56:07 UTC

Wouldn't a standard-conforming way to support Unicode file names be for
gfortran to 

- Specify that the default character set is UTF-8. 

- Then an internal read or write could be used to do a UTF-8 <->  UTF-32
conversion, if the user program uses kind=4 characters. Or if the user program
stuffs utf-8 data into default character variables, nothing needs to be done.

- When passing a filename in the open statement, on posix this can be passed
as-is to open(), on mingw the library would need to do a utf-8 -> utf-16
conversion, then call wopen(). And similarly for other syscalls where we pass
path names (e.g. stat(), access() and so on).

In any case, initially something like your patch in #c1 looks good; regardless of how/if we decide to support Unicode filenames, currently we don't do anything sensible for kind=4 file names.
And as you say, it's a standard violation.

Similarly to specifying the default character set as UTF-8, we could specify
the default encoding as UTF-8 (see ENCODING= in OPEN (9.5.6.9) and INQUIRE
(9.10.2.10)). That way we wouldn't need to handle the non-Unicode cases in
10.7.1 at all. I think we're mostly there already, really, what's lacking is
perhaps a "GFortran and Unicode" chapter in the manual.

Comment 4 Tobias Burnus 2011-05-12 13:37:34 UTC

(In reply to comment #3)
> Wouldn't a standard-conforming way to support Unicode file names be for
> gfortran to

I am admittedly a bit lost.

> - Specify that the default character set is UTF-8.

What do you mean by that? I know 1 byte and 4 byte character variables, but I do not see where UTF-8 fits in there. (One can place UTF-8 into character(kind=1) - and it also kind of works OK. But if one wants to use len(), string manipulation ("change 3 character to ..."), or tabulated I/O that will fail. But as quirky workaround, one can use UTF-8 file names with kind=1 character variables - at least under Unix/Linux.)

Regarding the ENCODING= specifier: That's already used for the encoding of the file content - one shan't use it to also modify the interpretation of the FILE string.

I still think that the default character encoding should remain 1 byte (kind=1), which is simply passed as is to "open()". And UCS-4 as FILE= argument should simply be supported as vendor extension. One just needs to tell the library that the string is in UCS-4. This wide string could then directly used for Windows' _wopen or converted to UTF-8 for Unix/Linux. (The conversion routine exists for UCS-4 <-> UTF-8 I/O.)

Comment 5 Tobias Burnus 2011-05-12 17:40:32 UTC

Author: burnus
Date: Thu May 12 17:40:29 2011
New Revision: 173708

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173708
Log:
2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * resolve.c (resolve_intrinsic): Don't resolve module
        intrinsics multiple times.

2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * gfortran.dg/iso_c_binding_compiler_3.f90: New.


Added:
    trunk/gcc/testsuite/gfortran.dg/iso_c_binding_compiler_3.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/resolve.c
    trunk/gcc/testsuite/ChangeLog

Comment 6 Tobias Burnus 2011-05-12 17:44:21 UTC

(In reply to comment #5)
> New Revision: 173708

Wrong PR number - supposed to go to PR 45823

Comment 7 Jerry DeLisle 2011-05-12 18:39:19 UTC

Reply to comment#2, There are tags that are constants and some that are variable expressions, so you have to resolve the correct one.  I have not looked for a while , but I think there is a resolve_tag_e or such.

Comment 8 Janne Blomqvist 2011-05-12 21:02:40 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > - Specify that the default character set is UTF-8.
> 
> What do you mean by that? I know 1 byte and 4 byte character variables, but I
> do not see where UTF-8 fits in there. (One can place UTF-8 into
> character(kind=1) - and it also kind of works OK. But if one wants to use
> len(), string manipulation ("change 3 character to ..."), or tabulated I/O that
> will fail. But as quirky workaround, one can use UTF-8 file names with kind=1
> character variables - at least under Unix/Linux.)

Well, for backwards compatibility I strongly think we should keep kind=1 the default. What I meant was that for bytes whose values are not part of the 7-bit ASCII character set, we can interpret it as UTF-8, as UTF-8 is backwards compatible with ASCII. In most cases this won't matter, but it matters e.g. as discussed in this PR on mingw as we need to convert the default character filename to utf-16.

The other option, I suppose, would be to regard the default character set as some locale-dependent charset, and then use some char->wchar_t conversion routines from the MS libc, assuming such things exist.

FWIW, the issue that the length of a string does not equal the width when printed is not unique to utf-8. The same issue is seen with kind=4 (utf-32) as well e.g. if one uses diacritic characters. So regardless of whether one uses UTF-8, UTF-16 or UTF-32, with unicode one needs to be prepared for the fact that the number of code points in a string might not be the same as the width. Fortran is not really prepared for this, so I suppose that making the string intrinsics etc. consider bytes==characters (for kind=1) is the best we can do in any case.

> Regarding the ENCODING= specifier: That's already used for the encoding of the
> file content - one shan't use it to also modify the interpretation of the FILE
> string.

Yes, the point was not related to the FILE= issue. Rather, that if we make utf-8 the default charset then it makes sense to also make the default file encoding utf-8.

> I still think that the default character encoding should remain 1 byte
> (kind=1), which is simply passed as is to "open()". 

Yes, I agree, at least for Unix. What about mingw, then, if the string contains characters not part of the 7-bit ASCII charset? Will MS libc convert it into UTF-16 assuming the encoding is according to the current locale, or what?

> And UCS-4 as FILE= argument
> should simply be supported as vendor extension. One just needs to tell the
> library that the string is in UCS-4. 

I'm not convinced of the value of such an extension. Fortran already suffers from too many vendor extensions.

> This wide string could then directly used
> for Windows' _wopen

Not really, since wchar_t on windows is a 16-bit type (utf-16), not a 32-bit one.

> or converted to UTF-8 for Unix/Linux.

Well, that is also a choice that needs to be made, analogous on how to convert default char file names to utf-16 on mingw. That is, do we convert the name to UTF-8 or to whatever the charset of the current locale is (LC_CTYPE)?

So in one way it would be nice to make gfortran respect the current locale charset, but OTOH Unicode was invented because the locale charset system is a failure, and just using unicode everywhere would in some respects be simpler.

Comment 9 Tobias Burnus 2011-05-13 18:16:40 UTC

Author: burnus
Date: Fri May 13 18:16:37 2011
New Revision: 173736

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173736
Log:
2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * io.c (resolve_tag_format, resolve_tag): Make sure
        that the string is of default kind.
        (gfc_resolve_inquire): Also resolve decimal tag.

2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * gfortran.dg/io_constraints_8.f90: New.
        * gfortran.dg/io_constraints_9.f90: New.


Added:
    trunk/gcc/testsuite/gfortran.dg/io_constraints_8.f90
    trunk/gcc/testsuite/gfortran.dg/io_constraints_9.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/io.c
    trunk/gcc/testsuite/ChangeLog

Comment 10 Tobias Burnus 2011-05-13 20:59:09 UTC

Author: burnus
Date: Fri May 13 20:59:07 2011
New Revision: 173738

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173738
Log:
2011-05-13  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        PR fortran/48991
        * gfortran.dg/assign_8.f90: Update dg-error.


Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gfortran.dg/assign_8.f90

Comment 11 Tobias Burnus 2011-05-14 11:55:14 UTC

Done: Constraint diagnostic of the Fortran standard.

To be done: Adding vendor extension to support UCS-4 arguments to OPEN's and INQUIRE's file argument.