[PATCH] diagnostics: Support for -finput-charset [PR93067]

Fri Jan 29 15:46:30 GMT 2021

On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > Hello-
> > 
> > The attached patch addresses PR93067:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > 
> > This is similar to the patch I posted last year on the PR, with some
> tweaks
> > to make it a little simpler. Recapping some of the commentary on the
> PR:
> > 
> > When source lines are needed for diagnostics output, they are
> retrieved from
> > the source file by the fcache infrastructure in input.c, since libcpp
> has
> > generally already forgotten them (plus not all front ends are using
> > libcpp). This infrastructure does not read the files in the same way
> as
> > libcpp does; in particular, it does not translate the encoding as
> requested
> > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > present. The patch adds this ability. My thinking in deciding how to
> do it
> > was the following:
> > 
> > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> still,
> >   so this patch should try hard not to introduce any worse
> performance
> >   unless these things are needed.
> > 
> > - It is desirable to reuse libcpp's encoding infrastructure from
> charset.c
> >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> it also
> >   has hand-coded routines for certain charsets to make sure they are
> >   available.)
> > 
> > - There is a performance degradation required in order to make use of
> libcpp
> >   directly, because the input.c infrastructure only reads as much of
> the
> >   source file as necessary, whereas libcpp interfaces as-is require
> to read
> >   the entire file into memory.
> > 
> > - It can't be quite as simple as just "only delegate to libcpp if
> >   -finput-charset was specified", because the stripping of the UTF-8
> BOM has
> >   to happen with or without this option.
> > 
> > - So it seemed a reasonable compromise to me, if -finput-charset is
> >   specified, then use libcpp to convert the file, otherwise, strip
> the BOM
> >   in input.c and then process the file the same way it is done now.
> There's
> >   a little bit of leakage of charset logic from libcpp this way (for
> the
> >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> always
> >   be reading the entire file into memory, which is not a cost paid
> >   currently.
> 
> Thanks for the patch; sorry about the delay in reviewing it.
>

Thanks for the comments! Here is an updated patch that addresses your
feedback, plus some responses inline below.

Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
the same before and after, plus 6 new PASS.

FAIL 85 85
PASS 479191 479197
UNSUPPORTED 13664 13664
UNTESTED 129 129
XFAIL 2292 2292
XPASS 30 30


> This mostly seems good to me.
> 
> One aspect I'm not quite convinced about is the
> input_cpp_context.in_use flag.  The input.c machinery is used by
> diagnostics, and so could be used by middle-end warnings for frontends
> that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> BOM for those, and do encoding fixups if necessary - is it more a case
> of initializing things to express what the expected input charset is?
> (since that is part of the cpp_options)
> 
> c.opt has:
>   finput-charset=
>   C ObjC C++ ObjC++ Joined RejectNegative
>   -finput-charset=<cset>	Specify the default character set for
> source files.
> 
> I believe that D and Go are the two frontends that don't use libcpp for
> parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> considering the heritage of both).  I don't know what source encodings
> D supports.
>

For this patch I was rather singularly focused on libcpp, so I looked
deeper at the other frontends now. It seems to me that there are basically
two questions to answer, and the three frontend styles answer this pair in
three different ways.

Q1: What is the input charset?
A1:

    libcpp: Whatever was passed to -finput-charset (note, for Fortran,
    -finput-charset is not supported though.)

    go: Assume UTF-8.

    D: UTF-8, UTF-16, or UTF-32 (the latter two in either
       endianness); determined by inspecting the first bytes of the file.

Q2: How should a UTF-8 BOM be handled?
A2:

    libcpp: Treat entirely the same, as if it was not present at all. So
    a diagnostic referring to the first non-BOM character in the file will
    point to column 1, not to column 4.

    go: Treat it like whitespace, ignored for parsing purposes, but still
    logically part of the file. A diagnostic referring to the first non-BOM
    character in the file will point to column 4.

    D: Same as libcpp.

So input.c's current behavior (with or without my patch) actually matches
the "go" frontend exactly and does the right thing for it. As you
thought, though, for D it would be desirable to skip over the UTF-8 BOM
too.

D adds an additional wrinkle in that, instead of using -finput-charset, it
uses its own detection logic -- based on knowledge that the first codepoint
in the file must be ASCII, it is able to deduce the encoding from the first
few bytes. This means that in D, each source file can have a different
encoding, which is not expressible in libcpp frontends currently, and hence
the notion of a global input_cpp_context with a constant charset is not
sufficient to capture this.

In this iteration of the patch, I replaced the input_cpp_context with a more
general input_context that can handle all of these cases. The context now
consists of a callback function and a bool, which the frontend is supposed
to configure. The bool determines whether or not to skip the BOM. The
callback, when passed a filename, returns the input charset needed to
convert that file. For libcpp, the filename is not used as the charset is
determined from -finput-charset for all files. For D, the callback function
is currently a stub, but it could be expanded to open the file, read the
first few bytes, and determine the charset to be used. I have not
implemented it for now, because it seems inelegant to duplicate the logic D
already has for this detection, but this logic is part of the dmd/ tree
which I think is maintained external to GCC, and so it didn't seem right to
attempt to refactor it. I figured there may not be much interest in this
feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
D), and this patch makes no change to it. This patch does fix the handling
of a UTF-8 BOM in D diagnostics, however.

> > Separate from the main patch are two testcases that both fail before
> this
> > patch and pass after. I attached them gzipped because they use non-
> standard
> > encodings that won't email well.
> > 
> > The main question I have about the patch is whether I chose a good
> way to
> > address the following complication. location_get_source_line() in
> input.c is
> > used to generate diagnostics for all front ends, whether they use
> libcpp to
> > read the files or not. So input.c needs some way to know whether
> libcpp is
> > in use or not. I ended up adding a new function
> input_initialize_cpp_context(),
> > which front ends have to call if they are using libcpp to read their
> > files. Currently that is c-family and fortran. I don't see a simpler
> way
> > to do it at least... Even if there were a better way for input.c to
> find out
> > the value passed to -finput-charset, it would still need to know
> whether
> > libcpp is being used or not.
> 
> At some point I want to convert the global state in input.c into a
> source_manager class, probably hung off the diagnostic_context, so the
> natural place to put that state would be in there (rather than the
> fcache being global state).  Perhaps have a source_reader policy class
> hung off of the source_manager, which would have responsibility for
> reading and converting the file, either piecemeal, or fully for the
> "need to use libcpp" case.
> 
> But that's not necessary for this fix.
> 
> > Please let me know if it looks OK (either now or for stage 1,
> whatever makes
> > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
> tests the
> > same before & after plus 6 new PASS from the new tests. Thanks!
> 
> Various comments inline below.  In particular, does the patch add a
> memory leak in _cpp_convert_input?
> 
> Thanks
> Dave
> 
> [...]
> 
> > libcpp/ChangeLog:
> > 
> >         PR other/93067
> >         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
> to
> >         be NULL.
> >         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
> to...
> >         (cpp_check_utf8_bom): ...here.  New function.
> >         (cpp_input_conversion_is_trivial): New function.
> >         * files.c (read_file_guts): Allow PFILE argument to be NULL.
> Add
> >         INPUT_CHARSET argument as an alternate source of this
> information.
> >         (cpp_get_converted_source): New function.
> >         * include/cpplib.h (struct cpp_converted_source): Declare.
> >         (cpp_get_converted_source): Declare.
> >         (cpp_input_conversion_is_trivial): Declare.
> >         (cpp_check_utf8_bom): Declare.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> >         PR other/93067
> >         * gcc.dg/diagnostic-input-charset-1.c: New test.
> >         * gcc.dg/diagnostic-input-charset-2.c: New test.
> 
> Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?
>

Done.

> > 
> > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > index 59cabd12407..d5aa7859cc1 100644
> > --- a/gcc/c-family/c-opts.c
> > +++ b/gcc/c-family/c-opts.c
> > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
> >    cpp_post_options (parse_in);
> >    init_global_opts_from_cpp (&global_options, cpp_get_options
> (parse_in));
> >  
> > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > +     the input.  */
> > +  input_initialize_cpp_context (parse_in);
> 
> As noted above I think the most significant thing here is getting the
> source encoding from parse_in, so maybe reword the comment accordingly.
> 
> [...]
> 
> > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > index 51baf141711..2b12a98afc0 100644
> > --- a/gcc/fortran/cpp.c
> > +++ b/gcc/fortran/cpp.c
> > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
> >  
> >    cpp_post_options (cpp_in);
> >  
> > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > +     the input.  */
> > +  input_initialize_cpp_context (cpp_in);
> 
> Likewise.
>

I kept this in mind when redoing this part, hopefully it is better now.

> [...]
> 
> > diff --git a/gcc/input.c b/gcc/input.c
> > index 29d10f06b86..1dcdd464bc1 100644
> > --- a/gcc/input.c
> > +++ b/gcc/input.c
> 
> [...]
> 
> > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
> >    r->total_lines = total_lines_num (file_path);
> >    r->missing_trailing_newline = true;
> >  
> > +  /* If libcpp is managing the reading, then there are two cases we
> need to
> > +     consider.  If -finput-charset is not in effect, then we just
> need to
> > +     strip a UTF-8 BOM, so do that ourselves rather than calling
> into libcpp so
> > +     as to avoid paying the penalty of using libcpp, namely that the
> entire file
> > +     must be read at once.  In the (generally rare) case that a non-
> trivial
> > +     -finput-charset is needed, then go ahead and use libcpp to read
> the whole
> > +     file and do the conversion.  */
> > +  if (input_cpp_context.in_use)
> > +    {
> > +      if (input_cpp_context.conversion_is_trivial)
> > +       {
> > +         /* Strip the UTF8 BOM if present.  */
> > +         if (read_data (r))
> > +           {
> > +             const int offset = cpp_check_utf8_bom (r->data, r-
> >nb_read);
> > +             r->offset_buffer (offset);
> > +             r->nb_read -= offset;
> > +           }
> 
> This assumes that trivial conversions are always UTF8 to UTF8, which
> assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...
> 
> [...]
> 
> > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > +   of a buffer.  Returns the number of bytes to skip, which
> currently
> > +   will be either 0 or 3.  */
> > +int
> > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > +{
> > +
> > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > +  const unsigned char *udata = (const unsigned char *) data;
> > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > +      && udata[2] == 0xbf)
> > +    return 3;
> > +#endif
> > +
> > +  return 0;
> > +}
> 
> ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
> no-op for the EBCDIC case, so that's OK.
> 
> [...]
> 
> > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
> char *input_charset,
> >        to.text = XNEWVEC (uchar, to.asize);
> >        to.len = 0;
> >  
> > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > -       cpp_error (pfile, CPP_DL_ERROR,
> > -                  "failure to convert %s to %s",
> > -                  CPP_OPTION (pfile, input_charset),
> SOURCE_CHARSET);
> > +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
> &to);
> >  
> > -      free (input);
> > -    }
> > +      /* Handle conversion failure.  */
> > +      if (!ok)
> > +       {
> > +         free (input);
> > +         if (!pfile)
> > +           {
> > +             XDELETEVEC (to.text);
> > +             *buffer_start = NULL;
> > +             *st_size = 0;
> > +             return NULL;
> > +           }
> > +         cpp_error (pfile, CPP_DL_ERROR,
> > +                    "failure to convert %s to %s",
> > +                    CPP_OPTION (pfile, input_charset),
> SOURCE_CHARSET);
> > +       }
> > +    }
> >
> 
> Doesn't this lose the
>   free (input);
> for the case where the conversion is successful?  Is this a leak?
>

Ugh, sorry to waste your time with that mistake. Fixed.

> [...]
> 
> > diff --git a/libcpp/files.c b/libcpp/files.c
> > index 301b2379a23..178bb9ed1e6 100644
> > --- a/libcpp/files.c
> > +++ b/libcpp/files.c
> > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
> _cpp_file *file,
> >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> >                               bool *invalid_pch, location_t loc);
> >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > -                           location_t loc);
> > +                           location_t loc, const char *input_charset
> = NULL);
> 
> read_file_guts is only used in one place before the patch, so rather
> than add a default argument, I think it's simpler to pass in the
> input_charset at that call.
> 
> > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
> *fname, cpp_dir *start_dir,
> >  
> >     Use LOC for any diagnostics.
> >  
> > +   The input charset may be specified in the INPUT_CHARSET argument,
> or
> > +   else it will be taken from PFILE.
> 
> ...and doing so means that input_charset will be non-NULL at all
> callers.
> 
> > +   PFILE may be NULL.  In this case, no diagnostics are issued, and
> the
> > +   input charset must be specified in the arguments.
> > +
> >     FIXME: Flush file cache and try again if we run out of memory. 
> */
> >  static bool
> > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > +               const char *input_charset)
> >  {
> >    ssize_t size, total, count;
> >    uchar *buf;
> >    bool regular;
> >  
> > +  if (!input_charset)
> > +    {
> > +      gcc_assert (pfile);
> > +      input_charset = CPP_OPTION (pfile, input_charset);
> > +    }
> 
> ...which would allow for this logic to be removed, moving this to the
> existing caller, I believe.
>

Much better your way, done.

Thanks for taking a look at it, sorry this version is a bit changed from the
previous one. FWIW, the libcpp pieces were not touched other than addressing
your comments; the new logic is in input.h/input.c plus the associated
tweaks to the frontends.

-Lewis
-------------- next part --------------
From: Lewis Hyatt <lhyatt@gmail.com>
Date: Wed, 27 Jan 2021 18:23:13 -0500
Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]

Adds the logic to handle -finput-charset in layout_get_source_line(), so that
source lines are converted from their input encodings prior to being output by
diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.

gcc/c-family/ChangeLog:

	PR other/93067
	* c-opts.c (c_common_input_charset_cb): New function.
	(c_common_post_options): Call new function input_initialize_context().

gcc/d/ChangeLog:

	PR other/93067
	* d-lang.cc (d_input_charset_callback): New function.
	(d_init): Call new function input_initialize_context().

gcc/fortran/ChangeLog:

	PR other/93067
	* cpp.c (gfc_cpp_post_options): Call new function
	input_initialize_cpp_context().

gcc/ChangeLog:

	PR other/93067
	* input.c (default_charset_callback): New function.
	(input_initialize_context): New function.
	(read_data): Add prototype.
	(add_file_to_cache_tab): Use libcpp to convert input encoding when
	needed. Strip UTF-8 BOM when needed.
	(class fcache): Add new members to track input encoding conversion.
	(fcache::fcache): Adapt for new members.
	(fcache::~fcache): Likewise.
	(maybe_grow): Likewise.
	(needs_read): Adapt to be aware that fp member may be NULL now.
	(get_next_line): Likewise.
	* input.h (input_initialize_context): Declare.

libcpp/ChangeLog:

	PR other/93067
	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
	be NULL.
	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
	(cpp_check_utf8_bom): ...here.  New function.
	(cpp_input_conversion_is_trivial): New function.
	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
	INPUT_CHARSET argument as an alternate source of this information.
	(read_file): Pass the new argument to read_file_guts.
	(cpp_get_converted_source): New function.
	* include/cpplib.h (struct cpp_converted_source): Declare.
	(cpp_get_converted_source): Declare.
	(cpp_input_conversion_is_trivial): Declare.
	(cpp_check_utf8_bom): Declare.

gcc/testsuite/ChangeLog:

	PR other/93067
	* gcc.dg/diagnostic-input-charset-1.c: New test.
	* gcc.dg/diagnostic-input-utf8-bom.c: New test.

diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index bd15b9cd902..2f58e8413b9 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context)
   context->opt_permissive = OPT_fpermissive;
 }
 
+/* Input charset configuration for diagnostics.  */
+static const char *
+c_common_input_charset_cb (const char * /*filename*/)
+{
+  const char *cs = cpp_opts->input_charset;
+  return cpp_input_conversion_is_trivial (cs) ? NULL : cs;
+}
+
 /* Whether options from all C-family languages should be accepted
    quietly.  */
 static bool accept_all_c_family_options = false;
@@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename)
   cpp_post_options (parse_in);
   init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
 
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely using the configured input charset and
+     skipping a UTF-8 BOM if present.  */
+  input_initialize_context (c_common_input_charset_cb, true);
   input_location = UNKNOWN_LOCATION;
 
   *pfilename = this_input_filename
diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
index 0fd207da7f3..9f170c09886 100644
--- a/gcc/d/d-lang.cc
+++ b/gcc/d/d-lang.cc
@@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "print-tree.h"
 #include "debug.h"
+#include "input.h"
 
 #include "d-tree.h"
 #include "id.h"
@@ -362,6 +363,19 @@ d_option_lang_mask (void)
   return CL_D;
 }
 
+/* Implements input charset and BOM skipping configuration for
+   diagnostics.  */
+static const char *d_input_charset_callback (const char * /*filename*/)
+{
+  /* TODO: The input charset is automatically determined by code in
+     dmd/dmodule.c based on the contents of the file.  If this detection
+     logic were factored out and could be reused here, then we would be able
+     to return UTF-16 or UTF-32 as needed here.  For now, we return always
+     NULL, which means no conversion is necessary, i.e. the input is assumed
+     to be UTF-8 when diagnostics read this file.  */
+  return NULL;
+}
+
 /* Implements the lang_hooks.init routine for language D.  */
 
 static bool
@@ -373,6 +387,10 @@ d_init (void)
   Expression::_init ();
   Objc::_init ();
 
+  /* Diagnostics input init, to enable BOM skipping and
+     input charset conversion.  */
+  input_initialize_context (d_input_charset_callback, true);
+
   /* Back-end init.  */
   global_binding_level = ggc_cleared_alloc <binding_level> ();
   current_binding_level = global_binding_level;
diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
index 419cd6accbe..8b564888673 100644
--- a/gcc/fortran/cpp.c
+++ b/gcc/fortran/cpp.c
@@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
 
   cpp_post_options (cpp_in);
 
+
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely, with no charset conversion but with
+     skipping of a UTF-8 BOM if present.  */
+  input_initialize_context (NULL, true);
+
   gfc_cpp_register_include_paths ();
 }
 
diff --git a/gcc/input.c b/gcc/input.c
index 9e39e7df83c..f8f6dc7e4ca 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -30,6 +30,29 @@ along with GCC; see the file COPYING3.  If not see
 #define HAVE_ICONV 0
 #endif
 
+/* Input charset configuration.  */
+static const char *default_charset_callback (const char *)
+{
+  return NULL;
+}
+
+struct
+{
+  input_context_charset_callback ccb;
+  bool should_skip_bom;
+} static input_context =
+{
+  default_charset_callback,
+  false
+};
+
+void input_initialize_context (input_context_charset_callback ccb,
+			       bool should_skip_bom)
+{
+  input_context.ccb = (ccb ? ccb : default_charset_callback);
+  input_context.should_skip_bom = should_skip_bom;
+}
+
 /* This is a cache used by get_next_line to store the content of a
    file to be searched for file lines.  */
 class fcache
@@ -78,6 +101,10 @@ public:
      far.  */
   char *data;
 
+  /* The allocated buffer to be freed may start a little earlier than DATA,
+     e.g. if a UTF8 BOM was skipped at the beginning.  */
+  int alloc_offset;
+
   /*  The size of the DATA array above.*/
   size_t size;
 
@@ -118,6 +145,17 @@ public:
 
   fcache ();
   ~fcache ();
+
+  void offset_buffer (int offset)
+  {
+    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
+		: (size_t) offset <= size);
+    gcc_assert (data);
+    alloc_offset += offset;
+    data += offset;
+    size -= offset;
+  }
+
 };
 
 /* Current position in real source file.  */
@@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count)
   return to_evict;
 }
 
+static bool
+read_data (fcache *c);
+
 /* Create the cache used for the content of a given file to be
    accessed by caret diagnostic.  This cache is added to an array of
    cache and can be retrieved by lookup_file_in_cache_tab.  This
@@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path)
   if (r->fp)
     fclose (r->fp);
   r->fp = fp;
+  if (r->alloc_offset)
+    r->offset_buffer (-r->alloc_offset);
   r->nb_read = 0;
   r->line_start_idx = 0;
   r->line_num = 0;
@@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path)
   r->total_lines = total_lines_num (file_path);
   r->missing_trailing_newline = true;
 
+  /* Check the frontend configuration to determine if we need to do any
+     transformations, such as charset conversion or BOM skipping.  */
+  if (const char *input_charset = input_context.ccb (file_path))
+    {
+      /* Need a full-blown conversion of the input charset.  */
+      fclose (r->fp);
+      r->fp = NULL;
+      const cpp_converted_source cs
+	= cpp_get_converted_source (file_path, input_charset);
+      if (!cs.data)
+	return NULL;
+      if (r->data)
+	XDELETEVEC (r->data);
+      r->data = cs.data;
+      r->nb_read = r->size = cs.len;
+      r->alloc_offset = cs.data - cs.to_free;
+    }
+  else if (input_context.should_skip_bom)
+    {
+      if (read_data (r))
+	{
+	  const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
+	  r->offset_buffer (offset);
+	  r->nb_read -= offset;
+	}
+    }
+
   return r;
 }
 
@@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path)
    diagnostic.  */
 
 fcache::fcache ()
-: use_count (0), file_path (NULL), fp (NULL), data (0),
+: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
   size (0), nb_read (0), line_start_idx (0), line_num (0),
   total_lines (0), missing_trailing_newline (true)
 {
@@ -433,6 +503,7 @@ fcache::~fcache ()
     }
   if (data)
     {
+      offset_buffer (-alloc_offset);
       XDELETEVEC (data);
       data = 0;
     }
@@ -447,9 +518,9 @@ fcache::~fcache ()
 static bool
 needs_read (fcache *c)
 {
-  return (c->nb_read == 0
-	  || c->nb_read == c->size
-	  || (c->line_start_idx >= c->nb_read - 1));
+  return c->fp && (c->nb_read == 0
+		   || c->nb_read == c->size
+		   || (c->line_start_idx >= c->nb_read - 1));
 }
 
 /*  Return TRUE iff the cache is full and thus needs to be
@@ -469,9 +540,20 @@ maybe_grow (fcache *c)
   if (!needs_grow (c))
     return;
 
-  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
-  c->data = XRESIZEVEC (char, c->data, size);
-  c->size = size;
+  if (!c->data)
+    {
+      gcc_assert (c->size == 0 && c->alloc_offset == 0);
+      c->size = fcache_buffer_size;
+      c->data = XNEWVEC (char, c->size);
+    }
+  else
+    {
+      const int offset = c->alloc_offset;
+      c->offset_buffer (-offset);
+      c->size *= 2;
+      c->data = XRESIZEVEC (char, c->data, c->size);
+      c->offset_buffer (offset);
+    }
 }
 
 /*  Read more data into the cache.  Extends the cache if need be.
@@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len)
       c->missing_trailing_newline = false;
     }
 
-  if (ferror (c->fp))
+  if (c->fp && ferror (c->fp))
     return false;
 
   /* At this point, we've found the end of the of line.  It either
diff --git a/gcc/input.h b/gcc/input.h
index f1ef5d76cfd..8a793399958 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -214,4 +214,21 @@ class GTY(()) string_concat_db
   hash_map <location_hash, string_concat *> *m_table;
 };
 
+/* Because we read source files a second time after the frontend did it the
+   first time, we need to know how the frontend handled things like character
+   set conversion and UTF-8 BOM stripping, in order to make everything
+   consistent.  This function needs to be called by each frontend that requires
+   non-default behavior, to inform the diagnostics infrastructure how input is
+   to be processed.  The default behavior is to do no conversion and not to
+   strip a UTF-8 BOM.
+
+   The callback should return the input charset to be used to convert the given
+   file's contents to UTF-8, or it should return NULL if no conversion is needed
+   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
+   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
+   beginning of the file.  (In case a conversion was performed, the BOM is
+   rather skipped as part of the conversion process.)  */
+typedef const char * (*input_context_charset_callback)(const char *);
+void input_initialize_context (input_context_charset_callback ccb,
+			       bool should_skip_bom);
 #endif
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 99a9b73e5ab..61881f978a8 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
    cset_converter structure for conversion from FROM to TO.  If
    iconv_open() fails, issue an error and return an identity
    converter.  Silently return an identity converter if FROM and TO
-   are identical.  */
+   are identical.
+
+   PFILE is only used for generating diagnostics; setting it to NULL
+   suppresses diagnostics.  */
+
 static struct cset_converter
 init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 {
@@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 
       if (ret.cd == (iconv_t) -1)
 	{
-	  if (errno == EINVAL)
-	    cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
-		       "conversion from %s to %s not supported by iconv",
-		       from, to);
-	  else
-	    cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
-
+	  if (pfile)
+	    {
+	      if (errno == EINVAL)
+		cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
+			   "conversion from %s to %s not supported by iconv",
+			   from, to);
+	      else
+		cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
+	    }
 	  ret.func = convert_no_conversion;
 	}
     }
   else
     {
-      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
-		 "no iconv implementation, cannot convert from %s to %s",
-		 from, to);
+      if (pfile)
+	{
+	  cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
+		     "no iconv implementation, cannot convert from %s to %s",
+		     from, to);
+	}
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
       ret.width = -1;
     }
+
   return ret;
 }
 
@@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
 				  buf, bufp - buf, HT_ALLOC));
 }
 
+
+/* Utility to strip a UTF-8 byte order marking from the beginning
+   of a buffer.  Returns the number of bytes to skip, which currently
+   will be either 0 or 3.  */
+int
+cpp_check_utf8_bom (const char *data, size_t data_length)
+{
+
+#if HOST_CHARSET == HOST_CHARSET_ASCII
+  const unsigned char *udata = (const unsigned char *) data;
+  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
+      && udata[2] == 0xbf)
+    return 3;
+#endif
+
+  return 0;
+}
+
+
 /* Convert an input buffer (containing the complete contents of one
    source file) from INPUT_CHARSET to the source character set.  INPUT
    points to the input buffer, SIZE is its allocated size, and LEN is
@@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
    INPUT is expected to have been allocated with xmalloc.  This
    function will either set *BUFFER_START to INPUT, or free it and set
    *BUFFER_START to a pointer to another xmalloc-allocated block of
-   memory.  */
+   memory.
+
+   PFILE is only used to generate diagnostics; setting it to NULL suppresses
+   diagnostics, and causes a return of NULL if there was any error instead.  */
+
 uchar * 
 _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 		    uchar *input, size_t size, size_t len,
@@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
       to.text = XNEWVEC (uchar, to.asize);
       to.len = 0;
 
-      if (!APPLY_CONVERSION (input_cset, input, len, &to))
-	cpp_error (pfile, CPP_DL_ERROR,
-		   "failure to convert %s to %s",
-		   CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
-
+      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
       free (input);
-    }
 
-  /* Clean up the mess.  */
-  if (input_cset.func == convert_using_iconv)
-    iconv_close (input_cset.cd);
+      /* Clean up the mess.  */
+      if (input_cset.func == convert_using_iconv)
+	iconv_close (input_cset.cd);
+
+      /* Handle conversion failure.  */
+      if (!ok)
+	{
+	  if (!pfile)
+	    {
+	      XDELETEVEC (to.text);
+	      *buffer_start = NULL;
+	      *st_size = 0;
+	      return NULL;
+	    }
+	  cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
+		     input_charset, SOURCE_CHARSET);
+	}
+    }
 
   /* Resize buffer if we allocated substantially too much, or if we
      haven't enough space for the \n-terminator or following
@@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 
   buffer = to.text;
   *st_size = to.len;
-#if HOST_CHARSET == HOST_CHARSET_ASCII
-  /* The HOST_CHARSET test just above ensures that the source charset
-     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
-     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
+
+  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
+     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
      BOM -- however, even if it did, we would still need this code due
      to the 'convert_no_conversion' case.  */
-  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
-      && to.text[2] == 0xbf)
-    {
-      *st_size -= 3;
-      buffer += 3;
-    }
-#endif
+  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
+  *st_size -= bom_len;
+  buffer += bom_len;
 
   *buffer_start = to.text;
   return buffer;
@@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
   return current_encoding;
 }
 
+/* Check if the configured input charset requires no conversion, other than
+   possibly stripping a UTF-8 BOM.  */
+bool cpp_input_conversion_is_trivial (const char *input_charset)
+{
+  return !strcasecmp (input_charset, SOURCE_CHARSET);
+}
+
 /* Implementation of class cpp_string_location_reader.  */
 
 /* Constructor for cpp_string_location_reader.  */
diff --git a/libcpp/files.c b/libcpp/files.c
index 5ea3f8e1bf3..d91606c1532 100644
--- a/libcpp/files.c
+++ b/libcpp/files.c
@@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
 static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
 			      bool *invalid_pch, location_t loc);
 static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
-			    location_t loc);
+			    location_t loc, const char *input_charset);
 static bool read_file (cpp_reader *pfile, _cpp_file *file,
 		       location_t loc);
 static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
@@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
 
    Use LOC for any diagnostics.
 
+   PFILE may be NULL.  In this case, no diagnostics are issued.
+
    FIXME: Flush file cache and try again if we run out of memory.  */
 static bool
-read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
+read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
+		const char *input_charset)
 {
   ssize_t size, total, count;
   uchar *buf;
@@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (S_ISBLK (file->st.st_mode))
     {
-      cpp_error_at (pfile, CPP_DL_ERROR, loc,
-		    "%s is a block device", file->path);
+      if (pfile)
+	cpp_error_at (pfile, CPP_DL_ERROR, loc,
+		      "%s is a block device", file->path);
       return false;
     }
 
@@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 	 does not bite us.  */
       if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
 	{
-	  cpp_error_at (pfile, CPP_DL_ERROR, loc,
-			"%s is too large", file->path);
+	  if (pfile)
+	    cpp_error_at (pfile, CPP_DL_ERROR, loc,
+			  "%s is too large", file->path);
 	  return false;
 	}
 
@@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (count < 0)
     {
-      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
+      if (pfile)
+	cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
       free (buf);
       return false;
     }
 
-  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
+  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
     cpp_error_at (pfile, CPP_DL_WARNING, loc,
 	       "%s is shorter than expected", file->path);
 
   file->buffer = _cpp_convert_input (pfile,
-				     CPP_OPTION (pfile, input_charset),
+				     input_charset,
 				     buf, size + 16, total,
 				     &file->buffer_start,
 				     &file->st.st_size);
-  file->buffer_valid = true;
-
-  return true;
+  file->buffer_valid = file->buffer;
+  return file->buffer_valid;
 }
 
 /* Convenience wrapper around read_file_guts that opens the file if
    necessary and closes the file descriptor after reading.  FILE must
    have been passed through find_file() at some stage.  Use LOC for
-   any diagnostics.  */
+   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
 static bool
 read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
 {
@@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
       return false;
     }
 
-  file->dont_read = !read_file_guts (pfile, file, loc);
+  file->dont_read = !read_file_guts (pfile, file, loc,
+				     CPP_OPTION (pfile, input_charset));
   close (file->fd);
   file->fd = -1;
 
@@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
   return file->err_no != ENOENT;
 }
 
+/* Read a file and convert to input charset, the same as if it were being read
+   by a cpp_reader.  */
+
+cpp_converted_source
+cpp_get_converted_source (const char *fname, const char *input_charset)
+{
+  cpp_converted_source res = {};
+  _cpp_file file = {};
+  file.fd = -1;
+  file.name = lbasename (fname);
+  file.path = fname;
+  if (!open_file (&file))
+    return res;
+  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
+  close (file.fd);
+  if (!ok)
+    return res;
+  res.to_free = (char *) file.buffer_start;
+  res.data = (char *) file.buffer;
+  res.len = file.st.st_size;
+  return res;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 4467c73284d..d02ef4aa054 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
 extern cpp_buffer *cpp_get_prev (cpp_buffer *);
 extern void cpp_clear_file_cache (cpp_reader *);
 
+/* cpp_get_converted_source returns the contents of the given file, as it exists
+   after cpplib has read it and converted it from the input charset to the
+   source charset.  Return struct will be zero-filled if the data could not be
+   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
+   pointer is what should be passed to free(), as there may be an offset.  */
+struct cpp_converted_source
+{
+  char *to_free;
+  char *data;
+  size_t len;
+};
+cpp_converted_source cpp_get_converted_source (const char *fname,
+					       const char *input_charset);
+
 /* In pch.c */
 struct save_macro_data;
 extern int cpp_save_state (cpp_reader *, FILE *);
@@ -1439,6 +1453,7 @@ class cpp_display_width_computation {
 /* Convenience functions that are simple use cases for class
    cpp_display_width_computation.  Tab characters will be expanded to spaces
    as determined by TABSTOP.  */
+
 int cpp_byte_column_to_display_column (const char *data, int data_length,
 				       int column, int tabstop);
 inline int cpp_display_width (const char *data, int data_length,
@@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length,
 				       int display_col, int tabstop);
 int cpp_wcwidth (cppchar_t c);
 
+bool cpp_input_conversion_is_trivial (const char *input_charset);
+int cpp_check_utf8_bom (const char *data, size_t data_length);
+
 #endif /* ! LIBCPP_CPPLIB_H */
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diagnostics-input-charset-v2-part2.txt.gz
Type: application/x-gunzip
Size: 831 bytes
Desc: not available
URL: <https://gcc.gnu.org/pipermail/gcc-patches/attachments/20210129/d41ee419/attachment-0001.bin>