This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC Wide Characters in I/O


Janne Blomqvist wrote:
Jerry DeLisle wrote:
Modify the front-end in trans-io.c to build a call to a new function called transfer_wide_character. This function is used if we have an ENCODING= specifier or a wide character to transfer. This will retain the existing transfer_character call to maintain compatibility.

Repeating myself, the new function will be used for kind > 1, while transfer_character is retained for kind=1 and hollerith as is.

This sounds good. As you imply, I agree that it's ok to restrict hollerith to plain ASCII, as hollerith was removed before any support for non-ASCII was added to the Fortran standard.


The downstream functions in the library such as list_formatted_read, list_formatted_write, formatted_transfer, unformatted_read, unformatted_write, already accept a kind parameter which transfer_character sets to 1.

Yes. But needless to say, there's probably a number of bugs lurking there since so far kind has always been 1.

Always, always, always



If we are given a wide character to transfer as unformatted, we simply transfer all the bytes as is. (I will confirm with the standard on this.)

Yes, I think this is ok.


If the user has specified an ENCODING="default" [snip]... If kind is greater than 1, I suggest we transfer each byte as is. So for kind=4, we would transfer 4 bytes.

Like FX said on IRC, I think the sensible thing here would be to convert each external byte into a 4-byte internal representation. Anything else would be very confusing, IMHO.

Agreed



This would enable doing some packed 4x1 byte character stuff.


Meaning, packing 4 ASCII characters into one kind=4 character and doing I/O with the kind=4 character. Lets just forget about this one. :)



(Note: It is my understanding, in reading about UTF-8, that it is internally represented by a fixed width 4 byte hexidecimal number which when transfered to an external file, is translated on the fly into a variable width encoding. When reading back in, it is translated back to the 4 byte code.)

To clarify, Unicode basically is a mapping between integer numbers ("code points" in unicode jargon) and characters. UTF-8 is a particular encoding for Unicode, encoding the code points into a variable number of 8-bit octets (with the nice property that the first 127 are backwards compatible with ASCII). UTF-8 thus has no concept of internal, external, in-memory or on-disk formats.

The concept of internal and external is entirely mine for the sole purpose of explaining what we want to do. ;)

In practice, what we want to do is exactly what you describe. I.e. externally data is UTF-8 encoded, and is converted to a fixed width 4-byte representation in memory for easier handling.


--- snip ---

I think the usual thing is to use a "?" or something like that for a non-representable character, rather than a runtime error.



Looks like "?" is the consensus. Works for me.


For kind=4 and ENCODING="UTF-8", we do a complete conforming translation of the 4 byte hex to/from the variable width UTF-8 encoding.

Yeah; there's probably library functions available for doing a lot of these conversions back and forth. One problem here might be that wchar_t is not 32 bits on all platforms, so perhaps we can't rely on the libc wide char functions?


If there is something we can use, fine, but it tends to give us less control over what is happening. We will plan to look into it before we decide whether to roll our own or use someone else's.

Thanks everyone for comments.

Jerry


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]