Bug 35863 - [F2003] Implement ENCODING="UTF-8"
Summary: [F2003] Implement ENCODING="UTF-8"
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libfortran (show other bugs)
Version: 4.4.0
: P3 normal
Target Milestone: ---
Assignee: Jerry DeLisle
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-07 22:35 UTC by Jerry DeLisle
Modified: 2008-08-16 07:09 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-06-07 20:18:41


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jerry DeLisle 2008-04-07 22:35:20 UTC
Front end and library are ready to handle this when implemented.
Comment 1 Janne Blomqvist 2008-04-14 18:55:40 UTC
Confirmed.

This could be a bit tricky to get right. OTOH Fortran is fortunate enough that there are real strings and not char arrays like in C, so from a user perspective it should be pretty transparent. But certainly the implementation can be tricky. Perhaps we should ask advice from e.g. python developers who already have implemented unicode support in some language with a runtime library written in C?

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Specifically

http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod

http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
Comment 2 Francois-Xavier Coudert 2008-04-15 10:45:40 UTC
(In reply to comment #0)
> Front end and library are ready to handle this when implemented.

Front-end is ready? Is ENCODING="UTF-8" related to UCS-4 support? Because if it is, then the front-end is not ready, it only supports a single character kind.

(In reply to comment #1)
> This could be a bit tricky to get right. OTOH Fortran is fortunate enough that
> there are real strings and not char arrays like in C, so from a user
> perspective it should be pretty transparent.

Well, I'm not too sure it's hard. We are not required to support UTF-8 strings as a character kind (that would be really hard) but just UCS-4 strings (ie UTF-32), which is basically (as I see it):
  - remove limitations in the front-end that there is only one character kind]
  - make a new character kind, as an array of 32-bit integers and a length
  - adjust library functions

Then, I/O with UTF-8 encoding just needs UTF-8 <--> UTF-32 conversions, which is only a few dozen lines of code (unless I'm confused).
Comment 3 Tobias Burnus 2008-04-15 19:46:22 UTC
> > Front end and library are ready to handle this when implemented.
> Front-end is ready?
Yes, it is: ENCODING= is supported and the rest is neither in the library nor in the front-end implemented. Though I would not call this "ready".

> Is ENCODING="UTF-8" related to UCS-4 support?

I think it is at the end. You can easily use UTF-8 encoding already now, but '(a2)' might print one (non-ascii) or two (ascii) characters. To have something well-defined, only one-byte-wide characters can be used currently. For anything beyond, UCS4 is needed in the front end.

Actually, I do not understand how to write things like 

   character(kind=myUCS4,len=20) :: foo = myUCS4_'Some UCS4 string'

(The problem is switching the encoding within the same file; good luck in finding an editor which supports this.)

If one does not need non-ascii character literals (i.e. reading from / writing to files), there is no problem.

Possible solutions?
a) Have a UCS-4 input file; then both default_'foo' and ucs4_'foo' work.
b) Expect that for myUCS4_'foo' literals the characters in the quotes are actually UTF-8.

I'm personally in favour of (b). I'm not quite sure whether this is really compatible with the Fortran standard, but I like the way of inputting the string.

Otherwise, I think Fortran misses a good way of inputting non-ascii characters in an ASCII file. C99 offers '\uXXXX' but unless I missed something in Fortran the equivalent would be:

I think (c) is what most programmers want, but I actually do not see how this should work syntax wise; or should an ascii literal automatically handled as UTF-8? Then it would work: when assigning to a ucs8 string, the UTF-8 gets properly converted a non-ascii character has then the length one (len(char() while if one assigns to a ASCII string, non-ascii characters of cause need more bytes and thus "len('ยง') == 2".

(b) is also an interesting problem. And (a) of cause works, but it is quite cumbersome to use - Fortran misses the \uXXXX way of C for specifying an unicode character; one can probably work with
   myUCS4string = char(int(z/A0FF/),kind=myUCS4)
but this is awful. (Actually, I think the standard does not even guarantee that it does this as "char" is processor dependent.)
Comment 4 Francois-Xavier Coudert 2008-04-15 20:53:22 UTC
(In reply to comment #3)
> Actually, I do not understand how to write things like 
>    character(kind=myUCS4,len=20) :: foo = myUCS4_'Some UCS4 string'

Ah, I'm glad that I'm not alone! I was thinking of asking advice on c.l.f when I get some time to write. I agree with you that it is not clear at all.

> (The problem is switching the encoding within the same file; good luck in
> finding an editor which supports this.)

I don't think there is such thing as a file with multiple encodings, and we shouldn't create such a beast just for Fortran.

> a) Have a UCS-4 input file; then both default_'foo' and ucs4_'foo' work.

I'd suggest going for that.

> b) Expect that for myUCS4_'foo' literals the characters in the quotes are
> actually UTF-8.

See above, I don't think we want to mix encodings. But, we can support both (a) and (b): if the file is UCS4, go for (a), if the file is UTF-8, go for (b).

On a personal note, I would use (b) more than (a): UTF-8 is the way forward, and fixed-width encodings are a real pain for file representation (which is different than internal representation).

> Otherwise, I think Fortran misses a good way of inputting non-ascii characters
> in an ASCII file. C99 offers '\uXXXX'

We already have -fbackslash, I can see us accepting that kind of code with a given option; it would really be useful.
Comment 5 Jerry DeLisle 2008-06-07 20:18:41 UTC
Working on this now.
Comment 6 Jerry DeLisle 2008-06-13 20:28:56 UTC
Subject: Bug 35863

Author: jvdelisle
Date: Fri Jun 13 20:28:08 2008
New Revision: 136763

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=136763
Log:
2008-06-13  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR fortran/35863
	* libgfortran.h: Change l8_to_l4_offset to big_endian and add endian_off.
	* runtime/main.c: Fix error in comment. Change l8_to_l4_offset to
	big_endian. (determine_endianness): Add endian_off and set its value
	according to big_endian.
	* gfortran.map: Add symbol for new _gfortran_transfer_character_wide.
	* io/io.h: Add prototype declarations for new functions.
	* io/list_read.c (list_formatted_read_scalar): Modify to handle kind=4.
	(list_formatted_read): Calculate stride based on kind for character type
	and use it when calling list_formatted_read_scalar.
	* io/inquire.c (inquire_via_unit): Change l8_to_l4_offset to big_endian.
	* io/open.c (st_open): Change l8_to_l4_offset to big_endian.
	* io/read.c (read_a_char4): New function to handle formatted read.
	* io/write.c: Define GFC_CHAR4(x) to improve readability of code.
	(write_a_char4): New function to handle formatted write.
	(write_character): Modify to accept the kind parameter and adjust for
	endianess of the machine. (list_formatted_write): Calculate the stride
	resulting from the kind and adjust the list_formatted_write_scalar call
	accordingly. (nml_write_obj): Adjust calls to write_character.
	(namelist_write): Likewise.
	* io/transfer.c (formatted_transfer_scaler): Rename 'len' argument to
	'kind' argument to better describe what it is. Add calls to new
	functions for kind == 4. (formatted_transfer): Modify to handle the case
	of type character and kind equals 4 to pass in the kind to the transfer
	routines. (transfer_character_wide): Add this new function.
	(transfer_array): Don't set kind to the character string length. Adjust
	strides bases on character kind.
	(unformatted_read): Adjust size based on kind for character types.
	(unformatted_write): Likewise. (data_transfer_init): Change
	l8_to_l4_offset to big_endian. 

Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/gfortran.map
    trunk/libgfortran/io/fbuf.c
    trunk/libgfortran/io/inquire.c
    trunk/libgfortran/io/io.h
    trunk/libgfortran/io/list_read.c
    trunk/libgfortran/io/open.c
    trunk/libgfortran/io/read.c
    trunk/libgfortran/io/transfer.c
    trunk/libgfortran/io/write.c
    trunk/libgfortran/libgfortran.h
    trunk/libgfortran/runtime/main.c

Comment 7 Jerry DeLisle 2008-06-13 20:31:34 UTC
Subject: Bug 35863

Author: jvdelisle
Date: Fri Jun 13 20:30:48 2008
New Revision: 136764

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=136764
Log:
2008-06-13  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR fortran/35863
	* trans-io.c (gfc_build_io_library_fndecls): Build declaration for
	transfer_character_wide which includes passing in the character kind to
	support wide character IO. (transfer_expr): If the kind == 4, create the
	argument and build the call.
	* gfortran.texi: Fix typo.

Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/gfortran.texi
    trunk/gcc/fortran/trans-io.c

Comment 8 Jerry DeLisle 2008-06-13 20:35:56 UTC
Subject: Bug 35863

Author: jvdelisle
Date: Fri Jun 13 20:35:12 2008
New Revision: 136766

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=136766
Log:
2008-06-13  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR fortran/35863
	* gfortran.dg/widechar_IO_1.f90: New test.
	* gfortran.dg/widechar_IO_2.f90: New test.
	* gfortran.dg/widechar_IO_3.f90: New test.
	* gfortran.dg/widechar_IO_4.f90: New test.

Added:
    trunk/gcc/testsuite/gfortran.dg/widechar_IO_1.f90
    trunk/gcc/testsuite/gfortran.dg/widechar_IO_2.f90
    trunk/gcc/testsuite/gfortran.dg/widechar_IO_3.f90
    trunk/gcc/testsuite/gfortran.dg/widechar_IO_4.f90
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 9 Jerry DeLisle 2008-08-16 06:11:45 UTC
Subject: Bug 35863

Author: jvdelisle
Date: Sat Aug 16 03:38:31 2008
New Revision: 139147

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=139147
Log:
2008-08-15  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR libfortran/35863
	* intrinsics/selected_char_kind.c: Enable iso_10646.
	* io/read.c (typedef uchar): New type.
	(read_utf8): New function to read a single UTF-8 encoded character.
	(read_utf8_char1): New function to read UTF-8 into a KIND=1 string.
	(read_default_char1): New functio to read default into KIND=1 string.
	(read_utf8_char4): New function to read UTF-8 into a KIND=4 string.
	(read_default_char4): New function to read UTF-8 into a KIND=4 string.
	(read_a): Modify to use the new functions.
	(read_a_char4): Modify to use the new functions.
	* io/write.c (error.h): Add include. (typedef uchar): New type.
	(write_default_char4): New function to default write KIND=4 string.
	(write_utf8_char4): New function to UTF-8 write KIND=4 string.
	(write_a_char4): Modify to use new functions.
	(write_character): Modify to use new functions.

Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/intrinsics/selected_char_kind.c
    trunk/libgfortran/io/read.c
    trunk/libgfortran/io/write.c

Comment 10 Jerry DeLisle 2008-08-16 06:11:48 UTC
Subject: Bug 35863

Author: jvdelisle
Date: Sat Aug 16 03:42:54 2008
New Revision: 139148

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=139148
Log:
2008-08-15  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR fortran/35863
	* gfortran.dg/utf8_1.f03: New test.
	* gfortran.dg/utf8_2.f03: New test.

Added:
    trunk/gcc/testsuite/gfortran.dg/utf8_1.f03
    trunk/gcc/testsuite/gfortran.dg/utf8_2.f03
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 11 Jerry DeLisle 2008-08-16 06:11:49 UTC
Subject: Bug 35863

Author: jvdelisle
Date: Sat Aug 16 03:36:32 2008
New Revision: 139146

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=139146
Log:
2008-08-15  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR fortran/35863
	* io.c (gfc_match_open): Enable UTF-8 in checks.
	* simplify.c (gfc_simplify_selected_char_kind): Enable iso_10646.

Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/io.c
    trunk/gcc/fortran/simplify.c

Comment 12 Tobias Burnus 2008-08-16 07:09:55 UTC
FIXED on the trunk (4.4.0). There are still left over PRs with UCS-4/UTF-8 but most things work.

TODO items:
PR 37077 character(kind=4) unit
PR 37076 concatination of character(kind=4) literals
PR 37025 transfer to character(kind=4)
Comment 13 Tobias Burnus 2008-08-19 06:02:15 UTC
Subject: Bug 35863

Author: burnus
Date: Tue Aug 19 06:00:51 2008
New Revision: 139223

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=139223
Log:
2008-08-19  Tobias Burnus  <burnus@net-b.de>

       PR libfortran/35863
       * io/write.c (write_a_char4): Add missing variable declaration
       in HAVE_CRLF block.


Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/io/write.c