Bug 86904 - Column numbers ignore tab characters
Summary: Column numbers ignore tab characters
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: other (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: diagnostic
Depends on:
Blocks:
 
Reported: 2018-08-09 20:08 UTC by David Malcolm
Modified: 2018-11-10 03:58 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Malcolm 2018-08-09 20:08:35 UTC
As noted in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19165#c21 :

/* Both gcc and emacs number source *lines* starting at 1, but
   they have differing conventions for *columns*.

   GCC uses a 1-based convention for source columns,
   whereas Emacs's M-x column-number-mode uses a 0-based convention.

   For example, an error in the initial, left-hand
   column of source line 3 is reported by GCC as:

      some-file.c:3:1: error: ...etc...

   On navigating to the location of that error in Emacs
   (e.g. via "next-error"),
   the locus is reported in the Mode Line
   (assuming M-x column-number-mode) as:

     some-file.c   10%   (3, 0)

   i.e. "3:1:" in GCC corresponds to "(3, 0)" in Emacs.  */

In terms of 0 vs 1, GCC complies with the GNU standards here:
https://www.gnu.org/prep/standards/html_node/Errors.html

However our "column numbers" are also simply a 1-based byte-count, so a tab character is treated by us as simply an increment of 1 right now.

(see also PR 49973, which covers the case of multibyte characters).

It turns out that we convert tab characters to *single* space characters when printing source code.

This behavior has been present since Manu first implemented -fdiagnostics-show-caret in r186305 (aka 5a9830842f69ebb059061e26f8b0699cbd85121e, PR 24985), where it was this logic (there in diagnostic.c's diagnostic_show_locus):
      char c = *line == '\t' ? ' ' : *line;
      pp_character (context->printer, c);

(that logic is now in diagnostic-show-locus.c in layout::print_source_line)

This is arguably a bug, but it's intimately linked to the way in which we track "column numbers".

Our "column numbers" are currently simply a 1-based byte-count, I believe, so a tab character is treated by us as simply an increment of 1 right now.

There are similar issues with encodings that aren't 1 byte per character (e.g. non-ASCII unicode characters), which are being tracked in PR 49973.

Presumably, when we print source lines containing tab characters, we should emit a number of spaces to reach a tab stop.

Consider a diagnostic with a multiline range covering the
following source (and beyond):

      indent: 6 (tabs: 0, spaces: 6)
       indent: 7 (tabs: 0, spaces: 7)
        indent: 8 (tabs: 1, spaces: 0)
         indent: 9 (tabs: 1, spaces: 1)

i.e.:

  "      indent: 6 (tabs: 0, spaces: 6)\n"
  "       indent: 7 (tabs: 0, spaces: 7)\n"
  "\tindent: 8 (tabs: 1, spaces: 0)\n"
  "\t indent: 9 (tabs: 1, spaces: 1)\n"

Currently diagnostic_show_locus prints:

       indent: 6 (tabs: 0, spaces: 6)
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        indent: 7 (tabs: 0, spaces: 7)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  indent: 8 (tabs: 1, spaces: 0)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   indent: 9 (tabs: 1, spaces: 1)
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i.e:
  "       indent: 6 (tabs: 0, spaces: 6)\n"
  "       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "        indent: 7 (tabs: 0, spaces: 7)\n"
  "        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "  indent: 8 (tabs: 1, spaces: 0)\n"
  "  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "   indent: 9 (tabs: 1, spaces: 1)\n"
  "   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"

which misrepresents the indentation of the user's code.

It should respect tabstops, and print:

       indent: 6 (tabs: 0, spaces: 6)
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        indent: 7 (tabs: 0, spaces: 7)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         indent: 8 (tabs: 1, spaces: 0)
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          indent: 9 (tabs: 1, spaces: 1)
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i.e.:

  "       indent: 6 (tabs: 0, spaces: 6)\n"
  "       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "        indent: 7 (tabs: 0, spaces: 7)\n"
  "        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "         indent: 8 (tabs: 1, spaces: 0)\n"
  "         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "          indent: 9 (tabs: 1, spaces: 1)\n"
  "          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"

We should also handle erroneous leading spaces before a tab, so that e.g.

  "  \tfoo"

should be printed as if it were:

 "\tfoo"

(given that that's what the user's editor is probably printing it as).

Similarly, we should presumably print "8" for the column number for the 'f' of "foo".  However, IDEs are expecting GCC's existing behavior, so we should probably add a command-line option for controlling this.

Adding a left margin with line numbers (as of r263450) doesn't change this bug, but makes fixing it slightly more complicated.

Maybe:
  -fdiagnostics-x-coord=bytes : count of bytes
  -fdiagnostics-x-coord=characters : count of characters (not special-casing tab)
  -fdiagnostics-x-coord=columns : count of columns: as per characters, but with tabs doing tabstops
(currently we use "bytes"  Not sure if we need "characters")

I'm thinking that internally, we should continue to track byte offsets, but make the option affect the presentation of the number in diagnostics*.c.

(should it affect -fdiagnostics-parseable-fixits ?
see also the Emacs RFE for parseable fixits:
  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987 )
Comment 1 richard.earnshaw 2018-08-10 09:44:11 UTC
On 09/08/18 21:08, dmalcolm at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86904
> 
>             Bug ID: 86904
>            Summary: Column numbers ignore tab characters
>            Product: gcc
>            Version: unknown
>             Status: UNCONFIRMED
>           Keywords: diagnostic
>           Severity: normal
>           Priority: P3
>          Component: other
>           Assignee: unassigned at gcc dot gnu.org
>           Reporter: dmalcolm at gcc dot gnu.org
>   Target Milestone: ---
> 
> As noted in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19165#c21 :
> 
> /* Both gcc and emacs number source *lines* starting at 1, but
>    they have differing conventions for *columns*.
> 
>    GCC uses a 1-based convention for source columns,
>    whereas Emacs's M-x column-number-mode uses a 0-based convention.
> 
>    For example, an error in the initial, left-hand
>    column of source line 3 is reported by GCC as:
> 
>       some-file.c:3:1: error: ...etc...
> 
>    On navigating to the location of that error in Emacs
>    (e.g. via "next-error"),
>    the locus is reported in the Mode Line
>    (assuming M-x column-number-mode) as:
> 
>      some-file.c   10%   (3, 0)
> 
>    i.e. "3:1:" in GCC corresponds to "(3, 0)" in Emacs.  */
> 
> In terms of 0 vs 1, GCC complies with the GNU standards here:
> https://www.gnu.org/prep/standards/html_node/Errors.html
> 
> However our "column numbers" are also simply a 1-based byte-count, so a tab
> character is treated by us as simply an increment of 1 right now.
> 
> (see also PR 49973, which covers the case of multibyte characters).
> 
> It turns out that we convert tab characters to *single* space characters when
> printing source code.
> 
> This behavior has been present since Manu first implemented
> -fdiagnostics-show-caret in r186305 (aka
> 5a9830842f69ebb059061e26f8b0699cbd85121e, PR 24985), where it was this logic
> (there in diagnostic.c's diagnostic_show_locus):
>       char c = *line == '\t' ? ' ' : *line;
>       pp_character (context->printer, c);
> 
> (that logic is now in diagnostic-show-locus.c in layout::print_source_line)
> 
> This is arguably a bug, but it's intimately linked to the way in which we track
> "column numbers".
> 
> Our "column numbers" are currently simply a 1-based byte-count, I believe, so a
> tab character is treated by us as simply an increment of 1 right now.
> 
> There are similar issues with encodings that aren't 1 byte per character (e.g.
> non-ASCII unicode characters), which are being tracked in PR 49973.
> 
> Presumably, when we print source lines containing tab characters, we should
> emit a number of spaces to reach a tab stop.
> 
> Consider a diagnostic with a multiline range covering the
> following source (and beyond):
> 
>       indent: 6 (tabs: 0, spaces: 6)
>        indent: 7 (tabs: 0, spaces: 7)
>         indent: 8 (tabs: 1, spaces: 0)
>          indent: 9 (tabs: 1, spaces: 1)
> 
> i.e.:
> 
>   "      indent: 6 (tabs: 0, spaces: 6)\n"
>   "       indent: 7 (tabs: 0, spaces: 7)\n"
>   "\tindent: 8 (tabs: 1, spaces: 0)\n"
>   "\t indent: 9 (tabs: 1, spaces: 1)\n"
> 
> Currently diagnostic_show_locus prints:
> 
>        indent: 6 (tabs: 0, spaces: 6)
>        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>         indent: 7 (tabs: 0, spaces: 7)
>         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   indent: 8 (tabs: 1, spaces: 0)
>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    indent: 9 (tabs: 1, spaces: 1)
>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> i.e:
>   "       indent: 6 (tabs: 0, spaces: 6)\n"
>   "       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
>   "        indent: 7 (tabs: 0, spaces: 7)\n"
>   "        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
>   "  indent: 8 (tabs: 1, spaces: 0)\n"
>   "  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
>   "   indent: 9 (tabs: 1, spaces: 1)\n"
>   "   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
> 
> which misrepresents the indentation of the user's code.
> 
> It should respect tabstops, and print:
> 
>        indent: 6 (tabs: 0, spaces: 6)
>        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>         indent: 7 (tabs: 0, spaces: 7)
>         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>          indent: 8 (tabs: 1, spaces: 0)
>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>           indent: 9 (tabs: 1, spaces: 1)
>           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> i.e.:
> 
>   "       indent: 6 (tabs: 0, spaces: 6)\n"
>   "       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
>   "        indent: 7 (tabs: 0, spaces: 7)\n"
>   "        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
>   "         indent: 8 (tabs: 1, spaces: 0)\n"
>   "         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
>   "          indent: 9 (tabs: 1, spaces: 1)\n"
>   "          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
> 
> We should also handle erroneous leading spaces before a tab, so that e.g.
> 
>   "  \tfoo"
> 
> should be printed as if it were:
> 
>  "\tfoo"
> 
> (given that that's what the user's editor is probably printing it as).
> 
> Similarly, we should presumably print "8" for the column number for the 'f' of
> "foo".  However, IDEs are expecting GCC's existing behavior, so we should
> probably add a command-line option for controlling this.
> 
> Adding a left margin with line numbers (as of r263450) doesn't change this bug,
> but makes fixing it slightly more complicated.
> 
> Maybe:
>   -fdiagnostics-x-coord=bytes : count of bytes
>   -fdiagnostics-x-coord=characters : count of characters (not special-casing
> tab)
>   -fdiagnostics-x-coord=columns : count of columns: as per characters, but with
> tabs doing tabstops

how about -fdiagnostics-x-coord=visual-[n]

Where n is the size of a hard tab?  Some folks change the tab stop to 4,
for example.  Or maybe ...coord=tab[-n], where -n defaults to "-8".

R.

> (currently we use "bytes"  Not sure if we need "characters")
> 
> I'm thinking that internally, we should continue to track byte offsets, but
> make the option affect the presentation of the number in diagnostics*.c.
> 
> (should it affect -fdiagnostics-parseable-fixits ?
> see also the Emacs RFE for parseable fixits:
>   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987 )
>
Comment 2 David Malcolm 2018-08-10 10:41:56 UTC
(In reply to richard.earnshaw from comment #1)
> On 09/08/18 21:08, dmalcolm at gcc dot gnu.org wrote:
[...snip...]
> > Maybe:
> >   -fdiagnostics-x-coord=bytes : count of bytes
> >   -fdiagnostics-x-coord=characters : count of characters (not special-casing
> > tab)
> >   -fdiagnostics-x-coord=columns : count of columns: as per characters, but with
> > tabs doing tabstops
> 
> how about -fdiagnostics-x-coord=visual-[n]
> 
> Where n is the size of a hard tab?  Some folks change the tab stop to 4,
> for example.  Or maybe ...coord=tab[-n], where -n defaults to "-8".

The C family of frontends already has:
  -ftabstop=<number>	Distance between tab stops for column reporting.
which IIRC is currently only used by -Wmisleading-indentation.  I guess it could be moved to common.opt.
Comment 3 Eric Gallager 2018-11-10 03:58:22 UTC
(In reply to David Malcolm from comment #2)
> (In reply to richard.earnshaw from comment #1)
> > On 09/08/18 21:08, dmalcolm at gcc dot gnu.org wrote:
> [...snip...]
> > > Maybe:
> > >   -fdiagnostics-x-coord=bytes : count of bytes
> > >   -fdiagnostics-x-coord=characters : count of characters (not special-casing
> > > tab)
> > >   -fdiagnostics-x-coord=columns : count of columns: as per characters, but with
> > > tabs doing tabstops
> > 
> > how about -fdiagnostics-x-coord=visual-[n]
> > 
> > Where n is the size of a hard tab?  Some folks change the tab stop to 4,
> > for example.  Or maybe ...coord=tab[-n], where -n defaults to "-8".
> 
> The C family of frontends already has:
>   -ftabstop=<number>	Distance between tab stops for column reporting.
> which IIRC is currently only used by -Wmisleading-indentation.  I guess it
> could be moved to common.opt.

Yeah, using -ftabstop makes sense