This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Help : File does not end with newline


On Sun, Sep 30, 2001 at 08:43:29PM -0400, dewar@gnat.com wrote:
>
> << However, I can't remember seeing a Windows-based editor that did this, or
> a Windows-based textfile that had that 0x1A. >>
> 
> Well there are some, and another test is that in my experience *all* windows
> editors except those hailing from Unix sources, can always at least
> *recognize* the 1A. 
>
This is (mostly) not a problem of the editor implementation. This is more a
problem of the C standard library (fopen). The editor opens the file with
"r", writes the file at the end with "w" to a tempfile and moves the
tempfile to the source. There are also other implemenation possible, but as
long as the editor uses the C file FILE*IO the translation is done by the C
library.

And there are multiple translations modes possible:

^Z read:
    - remove ^Z from the input stream
    - remove ^Z in the last file position from the input stream 
    - break of input stream at ^Z  (Turbo-C++ works in this way)

^Z write:
    - write a ^Z to stream
    - do nothing
    - write a ^Z to stream, disable writing to this stream
    - disable writing to this stream (Turbo-C++ works in this way)

fclose():
    - (read/write access) append a ^Z if the last charcters was a ^Z on
      opening the file)

^M read:
    - remove all ^M from the input stream
    - remove all ^M followed by ^J from the stream

^M write:
    - write a ^M before every ^J
    - write a ^M before ^J if the last character before ^M is not a ^J

Writing ^Z to text mode streams died out in the middle of the 80's, the
last program which works in this way is (C) 1986.

Nevertheless the ^Z read filtering was done at least in Windows 95.

^M filtering is (C) Microsoft until the end of the days.

Also note that some editors are using low level nontranslating IO.
For instance the MS-VC++ IDE editor. If you are working with Unix style
files, all modified lines have a ^M^J, not modified lines a ^J.
Also tabs are confusing, because they have arbitrary size, mostly 4 or 3.
But this is unimportant from the view of a compiler.


> Yes, the 1A hailed from CPM, but was incorporated
> into DOS (which remember was derived from CPM), and this convention has
> continued in at least half baked form in the DOS/Windows environment
> ever since.
>
It was necessary until MS-DOS 1.0. MS-DOS hasn't had handle based IO, but
only FCB (file control block) based IO.

Note that also the PDP-8 used a ^Z as text stream end character (of cause
influenced by CP/M).


> << No, this is false. Standard Windows textfiles do not contain a terminating
> 0x1A. >>
> 
> I am not sure there is any defining standard that says what "Standard
> Windows textfiles" look like, but in practice all WIndows software
> recognizes either the hard EOF, or a 1A at the end of the file. 
>
My last tests (in 1996) showed that ^Z at the end of text files are kept and
hide. So an old C source file has an ^Z at the end and you are not able to
remove it. Because I don't like these ^Z, I removed it with the Norton
Commander.


> Quite a few software components will also recognize 1A in the middle of
> the file, though this is much less universal. 
>
The original behaviour is stop reading at the ^Z.
MS-DOS partially works in this way, partially only one ^Z at the end is
removed.


> Indeed quite a lot of DOS
> based software is still widely used in the Windows environments (e.g. the
> SPITBOL/386 compiler :-)
> 
> Any software running on Windows that does NOT recognize a terminating 1A
> and ignore it seems hostile to me.
>
I will do some new tests with the current M$ development IDE.

 
> <<It is true, however, that various Windows libc variants tend to get upset
> at seeing a 0x1A on input, telling upper layers they saw EOF. These days,
> this should probably regarded as a bug.
> >>
> 
> I think a good compromise is to only allow 1A as the last character in the
> file, and then ignore this, reporting end of file. But 1A in any other
> position can be regarded as a valid character. Note that this is appropriate
> for text mode files only, not binary files, of course, but then text files
> are usually treated specially to deal with the CR/LF => NL translation
> for Unix purposes.
> 
> Certainly we found that making the GNAT compiler do this on source input
> files is useful in practice. We do this even in a Unix environment, since
> quite often these contaminated 1A files wander from DOS/Windows systems
> to Unix systems.
>
First a very important remark. The ^Z is a little problem compared with the
following. A lot of 100% portable Windows sources can't be translated with
typical Unix compilers. The problem a short-sighted bug or flaw in the C
standard. You have a multi line macro and multi line macros are designed in
may so they can't be portable and also in a way that they maltreat
programmers.

You have a mult-line macro in Windows:


#define MULTI_LINE_MACRO	"This is a " \
				"multi-line macro"

The standard says:

         2.  Each instance of a backslash character (\) immediately
             followed by a new-line character is deleted,  splicing
             physical  source  lines  to form logical source lines.
             If, as a result, a character sequence that matches the
             syntax  of a universal character name is produced, the
             behavior is undefined.  Only the last backslash on any
             physical  source line shall be eligible for being part
             of such a splice.  A source file  that  is  not  empty
             shall  end in a new-line character, which shall not be
             immediately preceded by a backslash  character  before
             any such splicing takes place.


but in Unix you see the following:

#define MULTI_LINE_MACRO	"This is a " \^M
				"multi-line macro"^M


which disables the \^J deleting and result in a fatal error. Also from the
view of a programmer this causes "invisible" trouble from time to time

#define MULTI_LINE_MACRO	"This is a " \ <-
				"multi-line macro"<-

"<-" is a mark which a lot of editors can insert to show trailing spaces and
tabs. Normally you don't see these ghosts. Both are problems which can be
eliminated by changing the standard to:


         2.  Each instance of a backslash character (\), followed by
             optional white spaces and then immediately followed by a
             new-line character is deleted, splicing physical source lines
             to form logical source lines.  If, as a result, a character
             sequence that matches the syntax of a universal character name
             is produced, the behavior is undefined.  Only the last
             backslash on any physical source line shall be eligible for
             being part of such a splice.  A source file that is not empty
             shall end in a new-line character, which shall not be
             immediately preceded by a backslash character before any such
             splicing takes place.

This is only a small change, a lot of compilers (IMHO gcc too) work in this
way and the problem is known for more than 13 years for me.

So it is really strange and peculiarly that this was not changed in C99.

I don't found any incompatiblity, even when using the old way of string
concatenation (I mark spaces in strings as _):

char oldstring [] = "This_is_an_old_\_
string";

Note that this is different, but it is also not allowed in C89/K&R (illegal
string escape sequence \_).


Back to the ^Z problem:

If you want to compile every source from CP/M, PDP-8, MS-DOS, Unix and
Windows one way is the following:

  - remove up to 127 trailing ^Z from the end of a file
  - remove all ^M which are followed by ^J

-- 
Frank Klemm


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]