[PATCH][RFC] diagnostics: Add support for Unicode drawing characters

Thu Jul 23 23:03:06 GMT 2020

On Thu, Jul 23, 2020 at 05:47:28PM -0400, David Malcolm wrote:
> On Thu, 2020-07-23 at 12:28 -0400, Lewis Hyatt via Gcc-patches wrote:
> > Hello-
> > 
> > The attached patch is complete including docs, but I tagged as RFC
> > because I am not sure if anyone will like it, or if the general
> > reaction may
> > be closer to recoiling in horror :). Would appreciate your thoughts,
> > please...
> 
> Thanks for working on this.  I'm interested in other people's thoughts
> on this.  Various comments inline throughout below.
>

Thanks for the feedback! I made a few replies below as well.

> Currently, if a UTF-8 locale is detected, GCC changes the quote
> > characters
> > it outputs in diagnostics to Unicode directional quotes. I feel like
> > this is
> > a nice touch, so I was wondering whether GCC shouldn't do more along
> > these
> > lines. This patch adds support for using Unicode line drawing
> > characters and
> > similar things when outputting diagnostics. There is a new option
> > -fdiagnostics-unicode-drawing=[auto|never|always] to control it,
> > which
> > defaults to auto. "auto" will enable the feature under the same
> > circumstances that Unicode quotes get output, namely when the locale
> > is
> > determined by gcc_init_libintl() to support UTF-8. (The new option
> > does not
> > affect Unicode quote characters, which currently are not configurable
> > and
> > are determined solely by the locale.)
> 
> FWIW when I first started experimenting with location ranges back in
> 2015 my first patches had box-drawing characters for underlines; you
> can see this in some of the early examples here (and similar URLs from
> around then):
>
> https://dmalcolm.fedorapeople.org/gcc/2015-08-18/plugin.html
>   (this also has a different approach for labeling ranges, which I
> called "captions", putting them in a right margin)
> 
> https://dmalcolm.fedorapeople.org/gcc/2015-08-19/diagnostic-test-string-literals-1.html
> 
> https://dmalcolm.fedorapeople.org/gcc/2015-08-26/tree-expression-ranges.html
> 
> etc; the patch kits were:
> 
> https://gcc.gnu.org/legacy-ml/gcc-patches/2015-03/msg00837.html
> https://gcc.gnu.org/pipermail/gcc-patches/2015-September/428036.html
> https://gcc.gnu.org/legacy-ml/gcc-patches/2015-09/msg01696.html
> 
> In:
>   https://gcc.gnu.org/legacy-ml/gcc-patches/2015-09/msg01700.html
> I wrote:
> > * Eliminated UTF-8/box-drawing and captions.  Captions were cute but
> >   weren't "fully baked".  Without them, box-drawing isn't really
> >   needed, and I think I prefer the ASCII look, with the actual
> >   "caret" character, and '~' makes it easier to count characters
> >   compared to a box-drawing line, in my terminal's font, at least.
> >   Doing so greatly simplifies the new locus-printing code.
> 
> So I dropped the UTF-8 box drawing from that original kit for:
> (a) simplicity (the original patch kit was huge in scope, covering a
> bunch of ideas for diagnostics - ranges, labeling, fix-it hints,
> spelling suggestions, so I wanted to reduce the scope to something
> manageable)
> (b) I found it easier to count characters with "~"
>

Oh interesting, sorry I didn't realize this had already been considered. Well
perhaps it's a good time to revisit it anyway, if people find it appealing.

> 
> The thing I'm most nervous about with this patch is the potential for
> introducing mojibake when people copy and paste GCC output.
> 
> For example, looking at:
> https://gcc.gnu.org/legacy-ml/gcc-patches/2015-03/msg00837.html
> I see mojibake where the unicode line-drawing characters in my email
> are being displayed in the HTML mailing list archive via "&#xE2;" -
> something has gone wrong with encoding somewhere between the copy&paste
> from my terminal, the email, and the list archive.
> 
> That said, looking at your email in the archive here:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550551.html
> I don't see any mojibake.
> 
> What happens if GCC's stderr is piped into "less"?
> What happens if GCC's stderr is saved in a build.log file, uploaded
> somewhere, and then viewed?
> etc.
>

pipe to less should presumably be no problem, unless user goes out of their
way to use different locale settings there. Transporting the data is certainly
a potential area of difficulty, as is also interaction with IDEs and other
tools that want to parse the data, but such tools are at least already
handling the UTF-8 quote characters OK.

By the way, I was wondering separately what you think about adding an option
like -fplain-diagnostics or something, which would achieve basically the same
thing you get in the testsuite right now (-fno-diagnostics-show-caret
-fno-diagnostics-show-line-numbers -fdiagnostics-color=never
-fdiagnostics-urls=never) but would change as necessary whenever diagnostics
evolve. It seems rather involved currently to add a new option like
-fdiagnostics-unicode-drawing but keep the testsuite working, in addition to
adding to prune.exp and to the libstdc++.exp, you also need to update the
compat.exp so that it can figure out to pass the option only to sufficiently
new compilers. With -fplain-diagnostics, this could just be part of the code
change and the testsuite could stay the same; this may also make it easier on
IDE type utilities since they could rely on a more stable format for the
diagnostics, assuming they don't already use JSON format.

> 
> > The elements implemented are:
> > 
> >     * Vertical lines, e.g. those indicating labels and those
> > separating the
> >       source lines from the line numbers, are changed to line drawing
> >       characters.
> > 
> >     * The diagnostic paths output by the static analyzer make use of
> > line
> >       drawing characters to output smooth corners etc.
> > 
> >     * The squiggly underline ~~~~~ used to highlight source locations
> > is
> >       changed to a double underline ═════. The main reason for this
> > is that
> >       it enables a seamless "tee" character to connect the underline
> > to a
> >       label line if one exists.
> > 
> >     * Carets (^) are changed to a slightly different character (∧). I
> > think
> >       the new one is a little nicer looking, although probably not
> > worth the
> >       trouble on its own. I wanted to implement the support in this
> > patch
> >       beause carets are harder to change than the rest of the
> > elements
> >       (front ends have an interface to override them, which currently
> >       Fortran makes use of), so I thought it worthwhile to get this
> > logic in
> >       place, so that it can easily be changed to a more superior
> > character
> >       in the future if one comes up. It would also be easy enough to
> > leave
> >       the Unicode support in place for carets, but keep the default
> > set to
> >       the plain one for now.
> 
> Some other ideas:
> 
> * fix-it hints
> 
> * maybe have a different character for separating the line numbers as
> opposed to those for labels and for showing interprocedural paths.
>

Something like that would be easy to add, sure, perhaps a double vertical line
instead:

diagnostic-ranges.c:196:28: warning: field width specifier ‘*’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
  196 ║   __builtin_sprintf (d, " %*ld ", foo + bar, foo);
      ║                           ═∧══    ════╤════
      ║                            │          │
      ║                            int        long int

> > As an example, this diagnostic from gcc.dg/format/diagnostic-
> > ranges.c:
> > 
> > diagnostic-ranges.c:196:28: warning: field width specifier ‘*’
> > expects argument of type ‘int’, but argument 3 has type ‘long int’ [-
> > Wformat=]
> >   196 |   __builtin_sprintf (d, " %*ld ", foo + bar, foo);
> >       |                           ~^~~    ~~~~~~~~~
> >       |                            |          |
> >       |                            int        long int
> > 
> > would become instead:
> > 
> > diagnostic-ranges.c:196:28: warning: field width specifier ‘*’
> > expects argument of type ‘int’, but argument 3 has type ‘long int’ [-
> > Wformat=]
> >   196 │   __builtin_sprintf (d, " %*ld ", foo + bar, foo);
> >       │                           ═∧══    ════╤════
> >       │                            │          │
> >       │                            int        long int
> > 
> > Hopefully you are viewing this in a terminal that displays it
> > properly :), in
> > which case, hopefully you may find it to be an improvement?
> 
> I wonder if you can upload colorized examples somewhere?
> 
> e.g. using bin/gcc-color-to-html.py from our website repository:
>  https://gcc.gnu.org/git/?p=gcc-wwwdocs.git;a=blob;f=bin/gcc-color-to-html.py
> or one of the various ansi2html conversion scripts e.g.
>   http://www.pixelbeat.org/scripts/ansi2html.sh
>

Here are some .PNG screenshots, which I hope is OK for this purpose?
https://drive.google.com/file/d/1MN-_eok_gwk_hl5C8DgvQGHPCpHAn1RH/view?usp=sharing
https://drive.google.com/file/d/1siR-vh1osCvT8VB9uMrVuaS-pznswdBs/view?usp=sharing
https://drive.google.com/file/d/1xgq0F-zhXpwOi3zJyU6EZEaQL079DJD5/view?usp=sharing

...

> > 
> > Although probably premature, bootstrap and regtest were done on x86-
> > 64
> > linux, all tests the same before/after and new tests passing:
> > FAIL 96 96
> > PASS 479090 479239
> > UNSUPPORTED 11946 11946
> > UNTESTED 194 194
> > XFAIL 1839 1839
> > XPASS 36 36
> 
> I see the patch kit touches Fortran; was this with all frontends
> enabled?  (though I guess I'm likewise being premature here)
>

Yes, this was all languages except for jit.

> > I tried to set this up as a general framework, at least, it is easy
> > in one
> > place to change the characters that are used for various contexts, so
> > that
> > if people like the general idea, but not some of the specifics, the
> > patch is
> > easily modified for that now or in the future. Thanks for any
> > feedback!
> 
> Thanks again for the patch; let's see what others think.
> Dave
>

Thanks for the feedback. If there is interest in something like this, I'm
happy to implement any suggestions.

-Lewis