This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Further diagnostic quoting cleanup patch


On Sun, 7 Nov 2004, Paul Schlie wrote:

> please consider that "pretty-printing" is most likely properly within
> the domain-of-responsibility of text formatting, and display programs,
> not core development tools.

pretty-print.[ch] are the part of GCC which I refer to as the 
pretty-printing code.  Likewise other such files such as 
c-pretty-print.[ch].  Diagnostic texts are for humans, in their native 
language and preferred character set as determined by the locale, though 
maybe one day GCC will add message codes which are more nearly fixed for 
machine parsing.  GCC is a text formatting program which produces 
diagnostics for users; a lot of care is taken, for example, to reformat 
GCC's datastructures for expressions and types into something friendly to 
include in diagnostic messages.  GCC can also do line wrapping of long C++ 
diagnostics.

For a large proportion of users, their terminals are no longer 
artificially constrained to a 7-bit subset of the characters needed to 
display English properly.  It's well-established that terminals should 
handle ordinary single-width and double-width Unicode characters, at least 
for left-to-right languages and with only simple accents as combining 
characters; this is just a natural extension of the text terminal 
paradigm.  It is less clear where bidirectional processing should be done 
for right-to-left languages on text terminals, although the standard for 
text terminals, ECMA-48 (not generally implemented in full; instead 
subsets and extensions are used), does cover bidirectionality (as of the 
fifth edition, 1991).  For such alphabets as Arabic, Devanagari and Thai 
it seems more clear that simple terminals such as xterm won't handle them.

The right approach for applications such as GCC would seem to be just to 
generate messages with Unicode characters in logical ordering, for all 
languages, with the user left to provide a terminal suitable for the 
locale specified (probably something more sophisticated than xterm for the 
trickier alphabets).  We do not *yet* have translations for such tricky 
languages to display, but hopefully it's just a matter of time until we 
do.

English should not be singled out for inferior diagnostics.  It is just 
one of the many native languages supported by GCC; others have had proper 
linguistically appropriate quotes for years.  Those places where '' are 
used as quotes because the text does not go through the pretty-printing 
infrastructure to interpret %< %q %> should eventually be fixed to use the 
same infrastructure, with other benefits; but for now we avoid the misuse 
of the grave accent ` as a left quote which in modern ASCII it is not.

> Correspondingly, although I don't quite know what you mean by making
> corresponding changes within your new parser, I hope you don't mean
> that you've tweaked your parser to parse "pretty-quotes" as generic
> ones, as I don't suspect it's appropriate for GCC to assume the right
> to alter lexical definition of the language definitions it supports?

I changed the text of a diagnostic in c-parse.in.  So I made the 
corresponding change of the diagnostic in the new c-parser.c which needs 
to be kept in sync with the current parser.  The new parser accepts, by 
design, exactly the same language as the old one, with the same 
diagnostics on code that parses successfully.  By design, I have not 
changed the lexer at all.

Unicode quotes are not C language quotes, so they are not accepted in C.  
One day however we will accept letters and digits from non-ASCII alphabets 
in identifiers, as required by C++98 and C99.  We will then need to 
consider how to format those identifiers in diagnostics, likely providing 
a choice between displaying them in Unicode or the locale character set if 
different (friendly to users) and displaying them using \u escapes 
(reveals otherwise hidden information about whether "A" is a Latin, Greek 
or Cyrillic capital letter, which look identical but are distinct 
characters, and similar problems with combining characters).

-- 
Joseph S. Myers               http://www.srcf.ucam.org/~jsm28/gcc/
    jsm@polyomino.org.uk (personal mail)
    joseph@codesourcery.com (CodeSourcery mail)
    jsm28@gcc.gnu.org (Bugzilla assignments and CCs)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]