This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: RFC Wide Characters in I/O

From: Tobias Burnus <burnus at net-b dot de>
To: Jerry DeLisle <jvdelisle at verizon dot net>
Cc: Fortran List <fortran at gcc dot gnu dot org>
Date: Fri, 16 May 2008 23:01:44 +0200
Subject: Re: RFC Wide Characters in I/O
References: <482DD108.5080006@verizon.net>

Jerry DeLisle wrote:

If we are given a wide character to transfer as unformatted, we simply transfer all the bytes as is. (I will confirm with the standard on this.)

I think this OK as the standard does not define the on-disk format; additionally, ENCODING= can only be given for formatted I/O. One should make sure that the endianess is properly honored. As a wide character is simply an INTEGER(4) one can simply write them as integer(4) [array] - what they actually are.

If the user has specified an ENCODING="default" and the kind is 1, we do what we do now and transfer as 8bit (mostly ASCII).

OK. This automatically allows to support 8-bit encodings and is backward compatible.

[still formatted and encoding='default':]

If kind is greater than 1, I suggest we transfer each byte as is. So for kind=4, we would transfer 4 bytes. This would enable doing some packed 4x1 byte character stuff. I think the standard would allow this and thats why "default" is so loosely defined.

I think for I/O, almost everything is allowed by the standard. Note that the standard distinguishes between DEFAULT and ASCII characters. In case of gfortran, they are the same and thus the better-defined ASCII applies.

Somehow I think your suggestion is not necessarily the best one. For

write(foo,'(a)') ucs4_"Hello"

I expect that it writes 5 bytes and not 20 bytes to the unit "foo" or internal file "foo". Regarding internal files one finds in the standard:

"An input/output list shall not contain an item of nondefault character type if the input/output statement specifies an internal file of default character type. An input/output list shall not contain an item of nondefault character type other than ISO 10646 or ASCII character type if the input/output statement specifies an internal file of ISO 10646 character type. An input/output list shall not contain a character item of any character type other than ASCII character type if the input/output statement specifies an internal file of ASCII character type."

If I read this correctly, the standard requires likewise. Thus I think it is OK to give an error if the character is not representable in a byte. (Thus char < 128 or char < 256 should be rejected.) Alternatively, one can convert these characters into a "?".

For files one is more flexible, but still I think being able to write a normal ASCII string without needing to assign the ucs4-kind string to a default-kind variable first is useful. Again either an error or conversion to "?" could be a solution.

For the sake of completeness, the following can also be found in the standard:

"During input from a Unicode file, (1) characters in the record that correspond to an ASCII character variable shall have a position in the ISO 10646 character type collating sequence of 127 or less, and (2) characters in the record that correspond to a default character variable shall be representable in the default character type.

During input from a non-Unicode file, (1) characters in the record that correspond to a character variable shall have the kind of the character variable, and (2) characters in the record that correspond to a numeric or logical variable shall be of default character type.

During output to a Unicode file, all characters transmitted to the record are of ISO 10646 character type. If a character input/output list item or character string edit descriptor contains a character that is not representable in the ISO 10646 character type, the result is processor-dependent. During output to a non-Unicode file, characters transmitted to the record as a result of processing a character string edit descriptor or as a result of evaluating a numeric, logical, or default character data entity, are of type default character."

I think writing a wide-char variable into a default-kind internal file should give an error for characters which cannot fit into the variable. (One can argue whether only ASCII characters or also ISO-8859-1 (Latin1) characters fit.)

(Note: It is my understanding, in reading about UTF-8, that it is internally represented by a fixed width 4 byte hexidecimal number which when transfered to an external file, is translated on the fly into a variable width encoding. When reading back in, it is translated back to the 4 byte code.)

I'm not 100% sure whether I read this correctly, but the UTF-8 characters should be represented as UCS-4 (integer(4)) in the library until the record is finally written.

If the user has specified an ENCODING="UTF-8" and the kind is less than 4, strictly speaking, thats an error, but we can give a warning and assume a truncation.

I guess, you are talking about READING from a file; and (see quote above) this is valid as long as the characters are representable. Thus I would go again for "?" or for an error.

Tobias

PS: I think it is a matter of taste, but I slightly favour printing a "?" to giving a run-time error.

References:
- RFC Wide Characters in I/O
  - From: Jerry DeLisle

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]