diff mbox series

diagnostics: Support for -finput-charset [PR93067]

Message ID 20201218230353.GA6439@ldh-imac.local
State New
Headers show
Series diagnostics: Support for -finput-charset [PR93067] | expand

Commit Message

Lewis Hyatt Dec. 18, 2020, 11:03 p.m. UTC
Hello-

The attached patch addresses PR93067:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0

This is similar to the patch I posted last year on the PR, with some tweaks
to make it a little simpler. Recapping some of the commentary on the PR:

When source lines are needed for diagnostics output, they are retrieved from
the source file by the fcache infrastructure in input.c, since libcpp has
generally already forgotten them (plus not all front ends are using
libcpp). This infrastructure does not read the files in the same way as
libcpp does; in particular, it does not translate the encoding as requested
by -finput-charset, and it does not strip a UTF-8 byte-order mark if
present. The patch adds this ability. My thinking in deciding how to do it
was the following:

- Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer still,
  so this patch should try hard not to introduce any worse performance
  unless these things are needed.

- It is desirable to reuse libcpp's encoding infrastructure from charset.c
  rather than repeat it in input.c. (Notably, libcpp uses iconv but it also
  has hand-coded routines for certain charsets to make sure they are
  available.)

- There is a performance degradation required in order to make use of libcpp
  directly, because the input.c infrastructure only reads as much of the
  source file as necessary, whereas libcpp interfaces as-is require to read
  the entire file into memory.

- It can't be quite as simple as just "only delegate to libcpp if
  -finput-charset was specified", because the stripping of the UTF-8 BOM has
  to happen with or without this option.

- So it seemed a reasonable compromise to me, if -finput-charset is
  specified, then use libcpp to convert the file, otherwise, strip the BOM
  in input.c and then process the file the same way it is done now. There's
  a little bit of leakage of charset logic from libcpp this way (for the
  BOM), but it seems worthwhile, since otherwise, diagnostics would always
  be reading the entire file into memory, which is not a cost paid
  currently.

Separate from the main patch are two testcases that both fail before this
patch and pass after. I attached them gzipped because they use non-standard
encodings that won't email well.

The main question I have about the patch is whether I chose a good way to
address the following complication. location_get_source_line() in input.c is
used to generate diagnostics for all front ends, whether they use libcpp to
read the files or not. So input.c needs some way to know whether libcpp is
in use or not. I ended up adding a new function input_initialize_cpp_context(),
which front ends have to call if they are using libcpp to read their
files. Currently that is c-family and fortran. I don't see a simpler way
to do it at least... Even if there were a better way for input.c to find out
the value passed to -finput-charset, it would still need to know whether
libcpp is being used or not.

Please let me know if it looks OK (either now or for stage 1, whatever makes
sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all tests the
same before & after plus 6 new PASS from the new tests. Thanks!

-Lewis
Adds the logic to handle -finput-charset in layout_get_source_line(), so that
source lines are converted from their input encodings prior to being output by
diagnostics machinery.

gcc/c-family/ChangeLog:

	PR other/93067
	* c-opts.c (c_common_post_options): Call new function
	input_initialize_cpp_context().

gcc/fortran/ChangeLog:

	PR other/93067
	* cpp.c (gfc_cpp_post_options): Call new function
	input_initialize_cpp_context().

gcc/ChangeLog:

	PR other/93067
	* input.c (input_initialize_cpp_context): New function.
	(read_data): Add prototype.
	(add_file_to_cache_tab): Use libcpp to convert input encoding when
	needed.
	(class fcache): Add new members to track input encoding conversion
	via libcpp.
	(fcache::fcache): Adapt for new members.
	(fcache::~fcache): Likewise.
	(maybe_grow): Likewise.
	(needs_read): Adapt to be aware that fp member may be NULL now.
	(get_next_line): Likewise.
	* input.h (struct cpp_reader): Forward declare for use...
	(input_initialize_cpp_context): ...here.  Declare new function.

libcpp/ChangeLog:

	PR other/93067
	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
	be NULL.
	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
	(cpp_check_utf8_bom): ...here.  New function.
	(cpp_input_conversion_is_trivial): New function.
	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
	INPUT_CHARSET argument as an alternate source of this information.
	(cpp_get_converted_source): New function.
	* include/cpplib.h (struct cpp_converted_source): Declare.
	(cpp_get_converted_source): Declare.
	(cpp_input_conversion_is_trivial): Declare.
	(cpp_check_utf8_bom): Declare.

gcc/testsuite/ChangeLog:

	PR other/93067
	* gcc.dg/diagnostic-input-charset-1.c: New test.
	* gcc.dg/diagnostic-input-charset-2.c: New test.

Comments

David Malcolm Jan. 26, 2021, 9:02 p.m. UTC | #1
On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> Hello-
> 
> The attached patch addresses PR93067:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> 
> This is similar to the patch I posted last year on the PR, with some
tweaks
> to make it a little simpler. Recapping some of the commentary on the
PR:
> 
> When source lines are needed for diagnostics output, they are
retrieved from
> the source file by the fcache infrastructure in input.c, since libcpp
has
> generally already forgotten them (plus not all front ends are using
> libcpp). This infrastructure does not read the files in the same way
as
> libcpp does; in particular, it does not translate the encoding as
requested
> by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> present. The patch adds this ability. My thinking in deciding how to
do it
> was the following:
> 
> - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
still,
>   so this patch should try hard not to introduce any worse
performance
>   unless these things are needed.
> 
> - It is desirable to reuse libcpp's encoding infrastructure from
charset.c
>   rather than repeat it in input.c. (Notably, libcpp uses iconv but
it also
>   has hand-coded routines for certain charsets to make sure they are
>   available.)
> 
> - There is a performance degradation required in order to make use of
libcpp
>   directly, because the input.c infrastructure only reads as much of
the
>   source file as necessary, whereas libcpp interfaces as-is require
to read
>   the entire file into memory.
> 
> - It can't be quite as simple as just "only delegate to libcpp if
>   -finput-charset was specified", because the stripping of the UTF-8
BOM has
>   to happen with or without this option.
> 
> - So it seemed a reasonable compromise to me, if -finput-charset is
>   specified, then use libcpp to convert the file, otherwise, strip
the BOM
>   in input.c and then process the file the same way it is done now.
There's
>   a little bit of leakage of charset logic from libcpp this way (for
the
>   BOM), but it seems worthwhile, since otherwise, diagnostics would
always
>   be reading the entire file into memory, which is not a cost paid
>   currently.

Thanks for the patch; sorry about the delay in reviewing it.

This mostly seems good to me.

One aspect I'm not quite convinced about is the
input_cpp_context.in_use flag.  The input.c machinery is used by
diagnostics, and so could be used by middle-end warnings for frontends
that don't use libcpp.  Presumably we'd still want to remove the UTF-8
BOM for those, and do encoding fixups if necessary - is it more a case
of initializing things to express what the expected input charset is?
(since that is part of the cpp_options)

c.opt has:
  finput-charset=
  C ObjC C++ ObjC++ Joined RejectNegative
  -finput-charset=<cset>	Specify the default character set for
source files.

I believe that D and Go are the two frontends that don't use libcpp for
parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
considering the heritage of both).  I don't know what source encodings
D supports.

> Separate from the main patch are two testcases that both fail before
this
> patch and pass after. I attached them gzipped because they use non-
standard
> encodings that won't email well.
> 
> The main question I have about the patch is whether I chose a good
way to
> address the following complication. location_get_source_line() in
input.c is
> used to generate diagnostics for all front ends, whether they use
libcpp to
> read the files or not. So input.c needs some way to know whether
libcpp is
> in use or not. I ended up adding a new function
input_initialize_cpp_context(),
> which front ends have to call if they are using libcpp to read their
> files. Currently that is c-family and fortran. I don't see a simpler
way
> to do it at least... Even if there were a better way for input.c to
find out
> the value passed to -finput-charset, it would still need to know
whether
> libcpp is being used or not.

At some point I want to convert the global state in input.c into a
source_manager class, probably hung off the diagnostic_context, so the
natural place to put that state would be in there (rather than the
fcache being global state).  Perhaps have a source_reader policy class
hung off of the source_manager, which would have responsibility for
reading and converting the file, either piecemeal, or fully for the
"need to use libcpp" case.

But that's not necessary for this fix.

> Please let me know if it looks OK (either now or for stage 1,
whatever makes
> sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
tests the
> same before & after plus 6 new PASS from the new tests. Thanks!

Various comments inline below.  In particular, does the patch add a
memory leak in _cpp_convert_input?

Thanks
Dave

[...]

> libcpp/ChangeLog:
> 
>         PR other/93067
>         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
to
>         be NULL.
>         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
to...
>         (cpp_check_utf8_bom): ...here.  New function.
>         (cpp_input_conversion_is_trivial): New function.
>         * files.c (read_file_guts): Allow PFILE argument to be NULL.
Add
>         INPUT_CHARSET argument as an alternate source of this
information.
>         (cpp_get_converted_source): New function.
>         * include/cpplib.h (struct cpp_converted_source): Declare.
>         (cpp_get_converted_source): Declare.
>         (cpp_input_conversion_is_trivial): Declare.
>         (cpp_check_utf8_bom): Declare.
> 
> gcc/testsuite/ChangeLog:
> 
>         PR other/93067
>         * gcc.dg/diagnostic-input-charset-1.c: New test.
>         * gcc.dg/diagnostic-input-charset-2.c: New test.

Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?

> 
> diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> index 59cabd12407..d5aa7859cc1 100644
> --- a/gcc/c-family/c-opts.c
> +++ b/gcc/c-family/c-opts.c
> @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
>    cpp_post_options (parse_in);
>    init_global_opts_from_cpp (&global_options, cpp_get_options
(parse_in));
>  
> +  /* Let diagnostics infrastructure know we are using libcpp to read
> +     the input.  */
> +  input_initialize_cpp_context (parse_in);

As noted above I think the most significant thing here is getting the
source encoding from parse_in, so maybe reword the comment accordingly.

[...]

> diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> index 51baf141711..2b12a98afc0 100644
> --- a/gcc/fortran/cpp.c
> +++ b/gcc/fortran/cpp.c
> @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
>  
>    cpp_post_options (cpp_in);
>  
> +  /* Let diagnostics infrastructure know we are using libcpp to read
> +     the input.  */
> +  input_initialize_cpp_context (cpp_in);

Likewise.

[...]

> diff --git a/gcc/input.c b/gcc/input.c
> index 29d10f06b86..1dcdd464bc1 100644
> --- a/gcc/input.c
> +++ b/gcc/input.c

[...]

> @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
>    r->total_lines = total_lines_num (file_path);
>    r->missing_trailing_newline = true;
>  
> +  /* If libcpp is managing the reading, then there are two cases we
need to
> +     consider.  If -finput-charset is not in effect, then we just
need to
> +     strip a UTF-8 BOM, so do that ourselves rather than calling
into libcpp so
> +     as to avoid paying the penalty of using libcpp, namely that the
entire file
> +     must be read at once.  In the (generally rare) case that a non-
trivial
> +     -finput-charset is needed, then go ahead and use libcpp to read
the whole
> +     file and do the conversion.  */
> +  if (input_cpp_context.in_use)
> +    {
> +      if (input_cpp_context.conversion_is_trivial)
> +       {
> +         /* Strip the UTF8 BOM if present.  */
> +         if (read_data (r))
> +           {
> +             const int offset = cpp_check_utf8_bom (r->data, r-
>nb_read);
> +             r->offset_buffer (offset);
> +             r->nb_read -= offset;
> +           }

This assumes that trivial conversions are always UTF8 to UTF8, which
assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...

[...]

> +/* Utility to strip a UTF-8 byte order marking from the beginning
> +   of a buffer.  Returns the number of bytes to skip, which
currently
> +   will be either 0 or 3.  */
> +int
> +cpp_check_utf8_bom (const char *data, size_t data_length)
> +{
> +
> +#if HOST_CHARSET == HOST_CHARSET_ASCII
> +  const unsigned char *udata = (const unsigned char *) data;
> +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> +      && udata[2] == 0xbf)
> +    return 3;
> +#endif
> +
> +  return 0;
> +}

...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
no-op for the EBCDIC case, so that's OK.

[...]

> @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
char *input_charset,
>        to.text = XNEWVEC (uchar, to.asize);
>        to.len = 0;
>  
> -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> -       cpp_error (pfile, CPP_DL_ERROR,
> -                  "failure to convert %s to %s",
> -                  CPP_OPTION (pfile, input_charset),
SOURCE_CHARSET);
> +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
&to);
>  
> -      free (input);
> -    }
> +      /* Handle conversion failure.  */
> +      if (!ok)
> +       {
> +         free (input);
> +         if (!pfile)
> +           {
> +             XDELETEVEC (to.text);
> +             *buffer_start = NULL;
> +             *st_size = 0;
> +             return NULL;
> +           }
> +         cpp_error (pfile, CPP_DL_ERROR,
> +                    "failure to convert %s to %s",
> +                    CPP_OPTION (pfile, input_charset),
SOURCE_CHARSET);
> +       }
> +    }
>

Doesn't this lose the
  free (input);
for the case where the conversion is successful?  Is this a leak?

[...]

> diff --git a/libcpp/files.c b/libcpp/files.c
> index 301b2379a23..178bb9ed1e6 100644
> --- a/libcpp/files.c
> +++ b/libcpp/files.c
> @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
_cpp_file *file,
>  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
>                               bool *invalid_pch, location_t loc);
>  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> -                           location_t loc);
> +                           location_t loc, const char *input_charset
= NULL);

read_file_guts is only used in one place before the patch, so rather
than add a default argument, I think it's simpler to pass in the
input_charset at that call.

> @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
*fname, cpp_dir *start_dir,
>  
>     Use LOC for any diagnostics.
>  
> +   The input charset may be specified in the INPUT_CHARSET argument,
or
> +   else it will be taken from PFILE.

...and doing so means that input_charset will be non-NULL at all
callers.

> +   PFILE may be NULL.  In this case, no diagnostics are issued, and
the
> +   input charset must be specified in the arguments.
> +
>     FIXME: Flush file cache and try again if we run out of memory. 
*/
>  static bool
> -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> +               const char *input_charset)
>  {
>    ssize_t size, total, count;
>    uchar *buf;
>    bool regular;
>  
> +  if (!input_charset)
> +    {
> +      gcc_assert (pfile);
> +      input_charset = CPP_OPTION (pfile, input_charset);
> +    }

...which would allow for this logic to be removed, moving this to the
existing caller, I believe.

[...]

Hope this is constructive
Dave
Lewis Hyatt Jan. 29, 2021, 3:46 p.m. UTC | #2
On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > Hello-
> > 
> > The attached patch addresses PR93067:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > 
> > This is similar to the patch I posted last year on the PR, with some
> tweaks
> > to make it a little simpler. Recapping some of the commentary on the
> PR:
> > 
> > When source lines are needed for diagnostics output, they are
> retrieved from
> > the source file by the fcache infrastructure in input.c, since libcpp
> has
> > generally already forgotten them (plus not all front ends are using
> > libcpp). This infrastructure does not read the files in the same way
> as
> > libcpp does; in particular, it does not translate the encoding as
> requested
> > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > present. The patch adds this ability. My thinking in deciding how to
> do it
> > was the following:
> > 
> > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> still,
> >   so this patch should try hard not to introduce any worse
> performance
> >   unless these things are needed.
> > 
> > - It is desirable to reuse libcpp's encoding infrastructure from
> charset.c
> >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> it also
> >   has hand-coded routines for certain charsets to make sure they are
> >   available.)
> > 
> > - There is a performance degradation required in order to make use of
> libcpp
> >   directly, because the input.c infrastructure only reads as much of
> the
> >   source file as necessary, whereas libcpp interfaces as-is require
> to read
> >   the entire file into memory.
> > 
> > - It can't be quite as simple as just "only delegate to libcpp if
> >   -finput-charset was specified", because the stripping of the UTF-8
> BOM has
> >   to happen with or without this option.
> > 
> > - So it seemed a reasonable compromise to me, if -finput-charset is
> >   specified, then use libcpp to convert the file, otherwise, strip
> the BOM
> >   in input.c and then process the file the same way it is done now.
> There's
> >   a little bit of leakage of charset logic from libcpp this way (for
> the
> >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> always
> >   be reading the entire file into memory, which is not a cost paid
> >   currently.
> 
> Thanks for the patch; sorry about the delay in reviewing it.
>

Thanks for the comments! Here is an updated patch that addresses your
feedback, plus some responses inline below.

Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
the same before and after, plus 6 new PASS.

FAIL 85 85
PASS 479191 479197
UNSUPPORTED 13664 13664
UNTESTED 129 129
XFAIL 2292 2292
XPASS 30 30


> This mostly seems good to me.
> 
> One aspect I'm not quite convinced about is the
> input_cpp_context.in_use flag.  The input.c machinery is used by
> diagnostics, and so could be used by middle-end warnings for frontends
> that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> BOM for those, and do encoding fixups if necessary - is it more a case
> of initializing things to express what the expected input charset is?
> (since that is part of the cpp_options)
> 
> c.opt has:
>   finput-charset=
>   C ObjC C++ ObjC++ Joined RejectNegative
>   -finput-charset=<cset>	Specify the default character set for
> source files.
> 
> I believe that D and Go are the two frontends that don't use libcpp for
> parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> considering the heritage of both).  I don't know what source encodings
> D supports.
>

For this patch I was rather singularly focused on libcpp, so I looked
deeper at the other frontends now. It seems to me that there are basically
two questions to answer, and the three frontend styles answer this pair in
three different ways.

Q1: What is the input charset?
A1:

    libcpp: Whatever was passed to -finput-charset (note, for Fortran,
    -finput-charset is not supported though.)

    go: Assume UTF-8.

    D: UTF-8, UTF-16, or UTF-32 (the latter two in either
       endianness); determined by inspecting the first bytes of the file.

Q2: How should a UTF-8 BOM be handled?
A2:

    libcpp: Treat entirely the same, as if it was not present at all. So
    a diagnostic referring to the first non-BOM character in the file will
    point to column 1, not to column 4.

    go: Treat it like whitespace, ignored for parsing purposes, but still
    logically part of the file. A diagnostic referring to the first non-BOM
    character in the file will point to column 4.

    D: Same as libcpp.

So input.c's current behavior (with or without my patch) actually matches
the "go" frontend exactly and does the right thing for it. As you
thought, though, for D it would be desirable to skip over the UTF-8 BOM
too.

D adds an additional wrinkle in that, instead of using -finput-charset, it
uses its own detection logic -- based on knowledge that the first codepoint
in the file must be ASCII, it is able to deduce the encoding from the first
few bytes. This means that in D, each source file can have a different
encoding, which is not expressible in libcpp frontends currently, and hence
the notion of a global input_cpp_context with a constant charset is not
sufficient to capture this.

In this iteration of the patch, I replaced the input_cpp_context with a more
general input_context that can handle all of these cases. The context now
consists of a callback function and a bool, which the frontend is supposed
to configure. The bool determines whether or not to skip the BOM. The
callback, when passed a filename, returns the input charset needed to
convert that file. For libcpp, the filename is not used as the charset is
determined from -finput-charset for all files. For D, the callback function
is currently a stub, but it could be expanded to open the file, read the
first few bytes, and determine the charset to be used. I have not
implemented it for now, because it seems inelegant to duplicate the logic D
already has for this detection, but this logic is part of the dmd/ tree
which I think is maintained external to GCC, and so it didn't seem right to
attempt to refactor it. I figured there may not be much interest in this
feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
D), and this patch makes no change to it. This patch does fix the handling
of a UTF-8 BOM in D diagnostics, however.

> > Separate from the main patch are two testcases that both fail before
> this
> > patch and pass after. I attached them gzipped because they use non-
> standard
> > encodings that won't email well.
> > 
> > The main question I have about the patch is whether I chose a good
> way to
> > address the following complication. location_get_source_line() in
> input.c is
> > used to generate diagnostics for all front ends, whether they use
> libcpp to
> > read the files or not. So input.c needs some way to know whether
> libcpp is
> > in use or not. I ended up adding a new function
> input_initialize_cpp_context(),
> > which front ends have to call if they are using libcpp to read their
> > files. Currently that is c-family and fortran. I don't see a simpler
> way
> > to do it at least... Even if there were a better way for input.c to
> find out
> > the value passed to -finput-charset, it would still need to know
> whether
> > libcpp is being used or not.
> 
> At some point I want to convert the global state in input.c into a
> source_manager class, probably hung off the diagnostic_context, so the
> natural place to put that state would be in there (rather than the
> fcache being global state).  Perhaps have a source_reader policy class
> hung off of the source_manager, which would have responsibility for
> reading and converting the file, either piecemeal, or fully for the
> "need to use libcpp" case.
> 
> But that's not necessary for this fix.
> 
> > Please let me know if it looks OK (either now or for stage 1,
> whatever makes
> > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
> tests the
> > same before & after plus 6 new PASS from the new tests. Thanks!
> 
> Various comments inline below.  In particular, does the patch add a
> memory leak in _cpp_convert_input?
> 
> Thanks
> Dave
> 
> [...]
> 
> > libcpp/ChangeLog:
> > 
> >         PR other/93067
> >         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
> to
> >         be NULL.
> >         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
> to...
> >         (cpp_check_utf8_bom): ...here.  New function.
> >         (cpp_input_conversion_is_trivial): New function.
> >         * files.c (read_file_guts): Allow PFILE argument to be NULL.
> Add
> >         INPUT_CHARSET argument as an alternate source of this
> information.
> >         (cpp_get_converted_source): New function.
> >         * include/cpplib.h (struct cpp_converted_source): Declare.
> >         (cpp_get_converted_source): Declare.
> >         (cpp_input_conversion_is_trivial): Declare.
> >         (cpp_check_utf8_bom): Declare.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> >         PR other/93067
> >         * gcc.dg/diagnostic-input-charset-1.c: New test.
> >         * gcc.dg/diagnostic-input-charset-2.c: New test.
> 
> Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?
>

Done.

> > 
> > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > index 59cabd12407..d5aa7859cc1 100644
> > --- a/gcc/c-family/c-opts.c
> > +++ b/gcc/c-family/c-opts.c
> > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
> >    cpp_post_options (parse_in);
> >    init_global_opts_from_cpp (&global_options, cpp_get_options
> (parse_in));
> >  
> > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > +     the input.  */
> > +  input_initialize_cpp_context (parse_in);
> 
> As noted above I think the most significant thing here is getting the
> source encoding from parse_in, so maybe reword the comment accordingly.
> 
> [...]
> 
> > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > index 51baf141711..2b12a98afc0 100644
> > --- a/gcc/fortran/cpp.c
> > +++ b/gcc/fortran/cpp.c
> > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
> >  
> >    cpp_post_options (cpp_in);
> >  
> > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > +     the input.  */
> > +  input_initialize_cpp_context (cpp_in);
> 
> Likewise.
>

I kept this in mind when redoing this part, hopefully it is better now.

> [...]
> 
> > diff --git a/gcc/input.c b/gcc/input.c
> > index 29d10f06b86..1dcdd464bc1 100644
> > --- a/gcc/input.c
> > +++ b/gcc/input.c
> 
> [...]
> 
> > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
> >    r->total_lines = total_lines_num (file_path);
> >    r->missing_trailing_newline = true;
> >  
> > +  /* If libcpp is managing the reading, then there are two cases we
> need to
> > +     consider.  If -finput-charset is not in effect, then we just
> need to
> > +     strip a UTF-8 BOM, so do that ourselves rather than calling
> into libcpp so
> > +     as to avoid paying the penalty of using libcpp, namely that the
> entire file
> > +     must be read at once.  In the (generally rare) case that a non-
> trivial
> > +     -finput-charset is needed, then go ahead and use libcpp to read
> the whole
> > +     file and do the conversion.  */
> > +  if (input_cpp_context.in_use)
> > +    {
> > +      if (input_cpp_context.conversion_is_trivial)
> > +       {
> > +         /* Strip the UTF8 BOM if present.  */
> > +         if (read_data (r))
> > +           {
> > +             const int offset = cpp_check_utf8_bom (r->data, r-
> >nb_read);
> > +             r->offset_buffer (offset);
> > +             r->nb_read -= offset;
> > +           }
> 
> This assumes that trivial conversions are always UTF8 to UTF8, which
> assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...
> 
> [...]
> 
> > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > +   of a buffer.  Returns the number of bytes to skip, which
> currently
> > +   will be either 0 or 3.  */
> > +int
> > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > +{
> > +
> > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > +  const unsigned char *udata = (const unsigned char *) data;
> > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > +      && udata[2] == 0xbf)
> > +    return 3;
> > +#endif
> > +
> > +  return 0;
> > +}
> 
> ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
> no-op for the EBCDIC case, so that's OK.
> 
> [...]
> 
> > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
> char *input_charset,
> >        to.text = XNEWVEC (uchar, to.asize);
> >        to.len = 0;
> >  
> > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > -       cpp_error (pfile, CPP_DL_ERROR,
> > -                  "failure to convert %s to %s",
> > -                  CPP_OPTION (pfile, input_charset),
> SOURCE_CHARSET);
> > +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
> &to);
> >  
> > -      free (input);
> > -    }
> > +      /* Handle conversion failure.  */
> > +      if (!ok)
> > +       {
> > +         free (input);
> > +         if (!pfile)
> > +           {
> > +             XDELETEVEC (to.text);
> > +             *buffer_start = NULL;
> > +             *st_size = 0;
> > +             return NULL;
> > +           }
> > +         cpp_error (pfile, CPP_DL_ERROR,
> > +                    "failure to convert %s to %s",
> > +                    CPP_OPTION (pfile, input_charset),
> SOURCE_CHARSET);
> > +       }
> > +    }
> >
> 
> Doesn't this lose the
>   free (input);
> for the case where the conversion is successful?  Is this a leak?
>

Ugh, sorry to waste your time with that mistake. Fixed.

> [...]
> 
> > diff --git a/libcpp/files.c b/libcpp/files.c
> > index 301b2379a23..178bb9ed1e6 100644
> > --- a/libcpp/files.c
> > +++ b/libcpp/files.c
> > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
> _cpp_file *file,
> >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> >                               bool *invalid_pch, location_t loc);
> >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > -                           location_t loc);
> > +                           location_t loc, const char *input_charset
> = NULL);
> 
> read_file_guts is only used in one place before the patch, so rather
> than add a default argument, I think it's simpler to pass in the
> input_charset at that call.
> 
> > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
> *fname, cpp_dir *start_dir,
> >  
> >     Use LOC for any diagnostics.
> >  
> > +   The input charset may be specified in the INPUT_CHARSET argument,
> or
> > +   else it will be taken from PFILE.
> 
> ...and doing so means that input_charset will be non-NULL at all
> callers.
> 
> > +   PFILE may be NULL.  In this case, no diagnostics are issued, and
> the
> > +   input charset must be specified in the arguments.
> > +
> >     FIXME: Flush file cache and try again if we run out of memory. 
> */
> >  static bool
> > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > +               const char *input_charset)
> >  {
> >    ssize_t size, total, count;
> >    uchar *buf;
> >    bool regular;
> >  
> > +  if (!input_charset)
> > +    {
> > +      gcc_assert (pfile);
> > +      input_charset = CPP_OPTION (pfile, input_charset);
> > +    }
> 
> ...which would allow for this logic to be removed, moving this to the
> existing caller, I believe.
>

Much better your way, done.

Thanks for taking a look at it, sorry this version is a bit changed from the
previous one. FWIW, the libcpp pieces were not touched other than addressing
your comments; the new logic is in input.h/input.c plus the associated
tweaks to the frontends.

-Lewis
From: Lewis Hyatt <lhyatt@gmail.com>
Date: Wed, 27 Jan 2021 18:23:13 -0500
Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]

Adds the logic to handle -finput-charset in layout_get_source_line(), so that
source lines are converted from their input encodings prior to being output by
diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.

gcc/c-family/ChangeLog:

	PR other/93067
	* c-opts.c (c_common_input_charset_cb): New function.
	(c_common_post_options): Call new function input_initialize_context().

gcc/d/ChangeLog:

	PR other/93067
	* d-lang.cc (d_input_charset_callback): New function.
	(d_init): Call new function input_initialize_context().

gcc/fortran/ChangeLog:

	PR other/93067
	* cpp.c (gfc_cpp_post_options): Call new function
	input_initialize_cpp_context().

gcc/ChangeLog:

	PR other/93067
	* input.c (default_charset_callback): New function.
	(input_initialize_context): New function.
	(read_data): Add prototype.
	(add_file_to_cache_tab): Use libcpp to convert input encoding when
	needed. Strip UTF-8 BOM when needed.
	(class fcache): Add new members to track input encoding conversion.
	(fcache::fcache): Adapt for new members.
	(fcache::~fcache): Likewise.
	(maybe_grow): Likewise.
	(needs_read): Adapt to be aware that fp member may be NULL now.
	(get_next_line): Likewise.
	* input.h (input_initialize_context): Declare.

libcpp/ChangeLog:

	PR other/93067
	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
	be NULL.
	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
	(cpp_check_utf8_bom): ...here.  New function.
	(cpp_input_conversion_is_trivial): New function.
	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
	INPUT_CHARSET argument as an alternate source of this information.
	(read_file): Pass the new argument to read_file_guts.
	(cpp_get_converted_source): New function.
	* include/cpplib.h (struct cpp_converted_source): Declare.
	(cpp_get_converted_source): Declare.
	(cpp_input_conversion_is_trivial): Declare.
	(cpp_check_utf8_bom): Declare.

gcc/testsuite/ChangeLog:

	PR other/93067
	* gcc.dg/diagnostic-input-charset-1.c: New test.
	* gcc.dg/diagnostic-input-utf8-bom.c: New test.

diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index bd15b9cd902..2f58e8413b9 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context)
   context->opt_permissive = OPT_fpermissive;
 }
 
+/* Input charset configuration for diagnostics.  */
+static const char *
+c_common_input_charset_cb (const char * /*filename*/)
+{
+  const char *cs = cpp_opts->input_charset;
+  return cpp_input_conversion_is_trivial (cs) ? NULL : cs;
+}
+
 /* Whether options from all C-family languages should be accepted
    quietly.  */
 static bool accept_all_c_family_options = false;
@@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename)
   cpp_post_options (parse_in);
   init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
 
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely using the configured input charset and
+     skipping a UTF-8 BOM if present.  */
+  input_initialize_context (c_common_input_charset_cb, true);
   input_location = UNKNOWN_LOCATION;
 
   *pfilename = this_input_filename
diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
index 0fd207da7f3..9f170c09886 100644
--- a/gcc/d/d-lang.cc
+++ b/gcc/d/d-lang.cc
@@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "print-tree.h"
 #include "debug.h"
+#include "input.h"
 
 #include "d-tree.h"
 #include "id.h"
@@ -362,6 +363,19 @@ d_option_lang_mask (void)
   return CL_D;
 }
 
+/* Implements input charset and BOM skipping configuration for
+   diagnostics.  */
+static const char *d_input_charset_callback (const char * /*filename*/)
+{
+  /* TODO: The input charset is automatically determined by code in
+     dmd/dmodule.c based on the contents of the file.  If this detection
+     logic were factored out and could be reused here, then we would be able
+     to return UTF-16 or UTF-32 as needed here.  For now, we return always
+     NULL, which means no conversion is necessary, i.e. the input is assumed
+     to be UTF-8 when diagnostics read this file.  */
+  return NULL;
+}
+
 /* Implements the lang_hooks.init routine for language D.  */
 
 static bool
@@ -373,6 +387,10 @@ d_init (void)
   Expression::_init ();
   Objc::_init ();
 
+  /* Diagnostics input init, to enable BOM skipping and
+     input charset conversion.  */
+  input_initialize_context (d_input_charset_callback, true);
+
   /* Back-end init.  */
   global_binding_level = ggc_cleared_alloc <binding_level> ();
   current_binding_level = global_binding_level;
diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
index 419cd6accbe..8b564888673 100644
--- a/gcc/fortran/cpp.c
+++ b/gcc/fortran/cpp.c
@@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
 
   cpp_post_options (cpp_in);
 
+
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely, with no charset conversion but with
+     skipping of a UTF-8 BOM if present.  */
+  input_initialize_context (NULL, true);
+
   gfc_cpp_register_include_paths ();
 }
 
diff --git a/gcc/input.c b/gcc/input.c
index 9e39e7df83c..f8f6dc7e4ca 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -30,6 +30,29 @@ along with GCC; see the file COPYING3.  If not see
 #define HAVE_ICONV 0
 #endif
 
+/* Input charset configuration.  */
+static const char *default_charset_callback (const char *)
+{
+  return NULL;
+}
+
+struct
+{
+  input_context_charset_callback ccb;
+  bool should_skip_bom;
+} static input_context =
+{
+  default_charset_callback,
+  false
+};
+
+void input_initialize_context (input_context_charset_callback ccb,
+			       bool should_skip_bom)
+{
+  input_context.ccb = (ccb ? ccb : default_charset_callback);
+  input_context.should_skip_bom = should_skip_bom;
+}
+
 /* This is a cache used by get_next_line to store the content of a
    file to be searched for file lines.  */
 class fcache
@@ -78,6 +101,10 @@ public:
      far.  */
   char *data;
 
+  /* The allocated buffer to be freed may start a little earlier than DATA,
+     e.g. if a UTF8 BOM was skipped at the beginning.  */
+  int alloc_offset;
+
   /*  The size of the DATA array above.*/
   size_t size;
 
@@ -118,6 +145,17 @@ public:
 
   fcache ();
   ~fcache ();
+
+  void offset_buffer (int offset)
+  {
+    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
+		: (size_t) offset <= size);
+    gcc_assert (data);
+    alloc_offset += offset;
+    data += offset;
+    size -= offset;
+  }
+
 };
 
 /* Current position in real source file.  */
@@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count)
   return to_evict;
 }
 
+static bool
+read_data (fcache *c);
+
 /* Create the cache used for the content of a given file to be
    accessed by caret diagnostic.  This cache is added to an array of
    cache and can be retrieved by lookup_file_in_cache_tab.  This
@@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path)
   if (r->fp)
     fclose (r->fp);
   r->fp = fp;
+  if (r->alloc_offset)
+    r->offset_buffer (-r->alloc_offset);
   r->nb_read = 0;
   r->line_start_idx = 0;
   r->line_num = 0;
@@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path)
   r->total_lines = total_lines_num (file_path);
   r->missing_trailing_newline = true;
 
+  /* Check the frontend configuration to determine if we need to do any
+     transformations, such as charset conversion or BOM skipping.  */
+  if (const char *input_charset = input_context.ccb (file_path))
+    {
+      /* Need a full-blown conversion of the input charset.  */
+      fclose (r->fp);
+      r->fp = NULL;
+      const cpp_converted_source cs
+	= cpp_get_converted_source (file_path, input_charset);
+      if (!cs.data)
+	return NULL;
+      if (r->data)
+	XDELETEVEC (r->data);
+      r->data = cs.data;
+      r->nb_read = r->size = cs.len;
+      r->alloc_offset = cs.data - cs.to_free;
+    }
+  else if (input_context.should_skip_bom)
+    {
+      if (read_data (r))
+	{
+	  const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
+	  r->offset_buffer (offset);
+	  r->nb_read -= offset;
+	}
+    }
+
   return r;
 }
 
@@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path)
    diagnostic.  */
 
 fcache::fcache ()
-: use_count (0), file_path (NULL), fp (NULL), data (0),
+: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
   size (0), nb_read (0), line_start_idx (0), line_num (0),
   total_lines (0), missing_trailing_newline (true)
 {
@@ -433,6 +503,7 @@ fcache::~fcache ()
     }
   if (data)
     {
+      offset_buffer (-alloc_offset);
       XDELETEVEC (data);
       data = 0;
     }
@@ -447,9 +518,9 @@ fcache::~fcache ()
 static bool
 needs_read (fcache *c)
 {
-  return (c->nb_read == 0
-	  || c->nb_read == c->size
-	  || (c->line_start_idx >= c->nb_read - 1));
+  return c->fp && (c->nb_read == 0
+		   || c->nb_read == c->size
+		   || (c->line_start_idx >= c->nb_read - 1));
 }
 
 /*  Return TRUE iff the cache is full and thus needs to be
@@ -469,9 +540,20 @@ maybe_grow (fcache *c)
   if (!needs_grow (c))
     return;
 
-  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
-  c->data = XRESIZEVEC (char, c->data, size);
-  c->size = size;
+  if (!c->data)
+    {
+      gcc_assert (c->size == 0 && c->alloc_offset == 0);
+      c->size = fcache_buffer_size;
+      c->data = XNEWVEC (char, c->size);
+    }
+  else
+    {
+      const int offset = c->alloc_offset;
+      c->offset_buffer (-offset);
+      c->size *= 2;
+      c->data = XRESIZEVEC (char, c->data, c->size);
+      c->offset_buffer (offset);
+    }
 }
 
 /*  Read more data into the cache.  Extends the cache if need be.
@@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len)
       c->missing_trailing_newline = false;
     }
 
-  if (ferror (c->fp))
+  if (c->fp && ferror (c->fp))
     return false;
 
   /* At this point, we've found the end of the of line.  It either
diff --git a/gcc/input.h b/gcc/input.h
index f1ef5d76cfd..8a793399958 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -214,4 +214,21 @@ class GTY(()) string_concat_db
   hash_map <location_hash, string_concat *> *m_table;
 };
 
+/* Because we read source files a second time after the frontend did it the
+   first time, we need to know how the frontend handled things like character
+   set conversion and UTF-8 BOM stripping, in order to make everything
+   consistent.  This function needs to be called by each frontend that requires
+   non-default behavior, to inform the diagnostics infrastructure how input is
+   to be processed.  The default behavior is to do no conversion and not to
+   strip a UTF-8 BOM.
+
+   The callback should return the input charset to be used to convert the given
+   file's contents to UTF-8, or it should return NULL if no conversion is needed
+   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
+   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
+   beginning of the file.  (In case a conversion was performed, the BOM is
+   rather skipped as part of the conversion process.)  */
+typedef const char * (*input_context_charset_callback)(const char *);
+void input_initialize_context (input_context_charset_callback ccb,
+			       bool should_skip_bom);
 #endif
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 99a9b73e5ab..61881f978a8 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
    cset_converter structure for conversion from FROM to TO.  If
    iconv_open() fails, issue an error and return an identity
    converter.  Silently return an identity converter if FROM and TO
-   are identical.  */
+   are identical.
+
+   PFILE is only used for generating diagnostics; setting it to NULL
+   suppresses diagnostics.  */
+
 static struct cset_converter
 init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 {
@@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 
       if (ret.cd == (iconv_t) -1)
 	{
-	  if (errno == EINVAL)
-	    cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
-		       "conversion from %s to %s not supported by iconv",
-		       from, to);
-	  else
-	    cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
-
+	  if (pfile)
+	    {
+	      if (errno == EINVAL)
+		cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
+			   "conversion from %s to %s not supported by iconv",
+			   from, to);
+	      else
+		cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
+	    }
 	  ret.func = convert_no_conversion;
 	}
     }
   else
     {
-      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
-		 "no iconv implementation, cannot convert from %s to %s",
-		 from, to);
+      if (pfile)
+	{
+	  cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
+		     "no iconv implementation, cannot convert from %s to %s",
+		     from, to);
+	}
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
       ret.width = -1;
     }
+
   return ret;
 }
 
@@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
 				  buf, bufp - buf, HT_ALLOC));
 }
 
+
+/* Utility to strip a UTF-8 byte order marking from the beginning
+   of a buffer.  Returns the number of bytes to skip, which currently
+   will be either 0 or 3.  */
+int
+cpp_check_utf8_bom (const char *data, size_t data_length)
+{
+
+#if HOST_CHARSET == HOST_CHARSET_ASCII
+  const unsigned char *udata = (const unsigned char *) data;
+  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
+      && udata[2] == 0xbf)
+    return 3;
+#endif
+
+  return 0;
+}
+
+
 /* Convert an input buffer (containing the complete contents of one
    source file) from INPUT_CHARSET to the source character set.  INPUT
    points to the input buffer, SIZE is its allocated size, and LEN is
@@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
    INPUT is expected to have been allocated with xmalloc.  This
    function will either set *BUFFER_START to INPUT, or free it and set
    *BUFFER_START to a pointer to another xmalloc-allocated block of
-   memory.  */
+   memory.
+
+   PFILE is only used to generate diagnostics; setting it to NULL suppresses
+   diagnostics, and causes a return of NULL if there was any error instead.  */
+
 uchar * 
 _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 		    uchar *input, size_t size, size_t len,
@@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
       to.text = XNEWVEC (uchar, to.asize);
       to.len = 0;
 
-      if (!APPLY_CONVERSION (input_cset, input, len, &to))
-	cpp_error (pfile, CPP_DL_ERROR,
-		   "failure to convert %s to %s",
-		   CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
-
+      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
       free (input);
-    }
 
-  /* Clean up the mess.  */
-  if (input_cset.func == convert_using_iconv)
-    iconv_close (input_cset.cd);
+      /* Clean up the mess.  */
+      if (input_cset.func == convert_using_iconv)
+	iconv_close (input_cset.cd);
+
+      /* Handle conversion failure.  */
+      if (!ok)
+	{
+	  if (!pfile)
+	    {
+	      XDELETEVEC (to.text);
+	      *buffer_start = NULL;
+	      *st_size = 0;
+	      return NULL;
+	    }
+	  cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
+		     input_charset, SOURCE_CHARSET);
+	}
+    }
 
   /* Resize buffer if we allocated substantially too much, or if we
      haven't enough space for the \n-terminator or following
@@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 
   buffer = to.text;
   *st_size = to.len;
-#if HOST_CHARSET == HOST_CHARSET_ASCII
-  /* The HOST_CHARSET test just above ensures that the source charset
-     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
-     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
+
+  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
+     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
      BOM -- however, even if it did, we would still need this code due
      to the 'convert_no_conversion' case.  */
-  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
-      && to.text[2] == 0xbf)
-    {
-      *st_size -= 3;
-      buffer += 3;
-    }
-#endif
+  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
+  *st_size -= bom_len;
+  buffer += bom_len;
 
   *buffer_start = to.text;
   return buffer;
@@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
   return current_encoding;
 }
 
+/* Check if the configured input charset requires no conversion, other than
+   possibly stripping a UTF-8 BOM.  */
+bool cpp_input_conversion_is_trivial (const char *input_charset)
+{
+  return !strcasecmp (input_charset, SOURCE_CHARSET);
+}
+
 /* Implementation of class cpp_string_location_reader.  */
 
 /* Constructor for cpp_string_location_reader.  */
diff --git a/libcpp/files.c b/libcpp/files.c
index 5ea3f8e1bf3..d91606c1532 100644
--- a/libcpp/files.c
+++ b/libcpp/files.c
@@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
 static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
 			      bool *invalid_pch, location_t loc);
 static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
-			    location_t loc);
+			    location_t loc, const char *input_charset);
 static bool read_file (cpp_reader *pfile, _cpp_file *file,
 		       location_t loc);
 static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
@@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
 
    Use LOC for any diagnostics.
 
+   PFILE may be NULL.  In this case, no diagnostics are issued.
+
    FIXME: Flush file cache and try again if we run out of memory.  */
 static bool
-read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
+read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
+		const char *input_charset)
 {
   ssize_t size, total, count;
   uchar *buf;
@@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (S_ISBLK (file->st.st_mode))
     {
-      cpp_error_at (pfile, CPP_DL_ERROR, loc,
-		    "%s is a block device", file->path);
+      if (pfile)
+	cpp_error_at (pfile, CPP_DL_ERROR, loc,
+		      "%s is a block device", file->path);
       return false;
     }
 
@@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 	 does not bite us.  */
       if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
 	{
-	  cpp_error_at (pfile, CPP_DL_ERROR, loc,
-			"%s is too large", file->path);
+	  if (pfile)
+	    cpp_error_at (pfile, CPP_DL_ERROR, loc,
+			  "%s is too large", file->path);
 	  return false;
 	}
 
@@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (count < 0)
     {
-      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
+      if (pfile)
+	cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
       free (buf);
       return false;
     }
 
-  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
+  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
     cpp_error_at (pfile, CPP_DL_WARNING, loc,
 	       "%s is shorter than expected", file->path);
 
   file->buffer = _cpp_convert_input (pfile,
-				     CPP_OPTION (pfile, input_charset),
+				     input_charset,
 				     buf, size + 16, total,
 				     &file->buffer_start,
 				     &file->st.st_size);
-  file->buffer_valid = true;
-
-  return true;
+  file->buffer_valid = file->buffer;
+  return file->buffer_valid;
 }
 
 /* Convenience wrapper around read_file_guts that opens the file if
    necessary and closes the file descriptor after reading.  FILE must
    have been passed through find_file() at some stage.  Use LOC for
-   any diagnostics.  */
+   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
 static bool
 read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
 {
@@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
       return false;
     }
 
-  file->dont_read = !read_file_guts (pfile, file, loc);
+  file->dont_read = !read_file_guts (pfile, file, loc,
+				     CPP_OPTION (pfile, input_charset));
   close (file->fd);
   file->fd = -1;
 
@@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
   return file->err_no != ENOENT;
 }
 
+/* Read a file and convert to input charset, the same as if it were being read
+   by a cpp_reader.  */
+
+cpp_converted_source
+cpp_get_converted_source (const char *fname, const char *input_charset)
+{
+  cpp_converted_source res = {};
+  _cpp_file file = {};
+  file.fd = -1;
+  file.name = lbasename (fname);
+  file.path = fname;
+  if (!open_file (&file))
+    return res;
+  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
+  close (file.fd);
+  if (!ok)
+    return res;
+  res.to_free = (char *) file.buffer_start;
+  res.data = (char *) file.buffer;
+  res.len = file.st.st_size;
+  return res;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 4467c73284d..d02ef4aa054 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
 extern cpp_buffer *cpp_get_prev (cpp_buffer *);
 extern void cpp_clear_file_cache (cpp_reader *);
 
+/* cpp_get_converted_source returns the contents of the given file, as it exists
+   after cpplib has read it and converted it from the input charset to the
+   source charset.  Return struct will be zero-filled if the data could not be
+   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
+   pointer is what should be passed to free(), as there may be an offset.  */
+struct cpp_converted_source
+{
+  char *to_free;
+  char *data;
+  size_t len;
+};
+cpp_converted_source cpp_get_converted_source (const char *fname,
+					       const char *input_charset);
+
 /* In pch.c */
 struct save_macro_data;
 extern int cpp_save_state (cpp_reader *, FILE *);
@@ -1439,6 +1453,7 @@ class cpp_display_width_computation {
 /* Convenience functions that are simple use cases for class
    cpp_display_width_computation.  Tab characters will be expanded to spaces
    as determined by TABSTOP.  */
+
 int cpp_byte_column_to_display_column (const char *data, int data_length,
 				       int column, int tabstop);
 inline int cpp_display_width (const char *data, int data_length,
@@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length,
 				       int display_col, int tabstop);
 int cpp_wcwidth (cppchar_t c);
 
+bool cpp_input_conversion_is_trivial (const char *input_charset);
+int cpp_check_utf8_bom (const char *data, size_t data_length);
+
 #endif /* ! LIBCPP_CPPLIB_H */
Lewis Hyatt April 30, 2021, 9:53 p.m. UTC | #3
Hi David-

With GCC 11 released now, I thought it might be a good time to ping this patch
that supports -finput-charset for diagnostics please?
[https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564527.html]
Patch still applies to master, and tests still look good as before. If you
have any concerns about the approach to letting front ends configure how
the input is to be converted, I am happy to try other ways. Thanks!

-Lewis

On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote:
> On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > > Hello-
> > > 
> > > The attached patch addresses PR93067:
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > > 
> > > This is similar to the patch I posted last year on the PR, with some
> > tweaks
> > > to make it a little simpler. Recapping some of the commentary on the
> > PR:
> > > 
> > > When source lines are needed for diagnostics output, they are
> > retrieved from
> > > the source file by the fcache infrastructure in input.c, since libcpp
> > has
> > > generally already forgotten them (plus not all front ends are using
> > > libcpp). This infrastructure does not read the files in the same way
> > as
> > > libcpp does; in particular, it does not translate the encoding as
> > requested
> > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > > present. The patch adds this ability. My thinking in deciding how to
> > do it
> > > was the following:
> > > 
> > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> > still,
> > >   so this patch should try hard not to introduce any worse
> > performance
> > >   unless these things are needed.
> > > 
> > > - It is desirable to reuse libcpp's encoding infrastructure from
> > charset.c
> > >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> > it also
> > >   has hand-coded routines for certain charsets to make sure they are
> > >   available.)
> > > 
> > > - There is a performance degradation required in order to make use of
> > libcpp
> > >   directly, because the input.c infrastructure only reads as much of
> > the
> > >   source file as necessary, whereas libcpp interfaces as-is require
> > to read
> > >   the entire file into memory.
> > > 
> > > - It can't be quite as simple as just "only delegate to libcpp if
> > >   -finput-charset was specified", because the stripping of the UTF-8
> > BOM has
> > >   to happen with or without this option.
> > > 
> > > - So it seemed a reasonable compromise to me, if -finput-charset is
> > >   specified, then use libcpp to convert the file, otherwise, strip
> > the BOM
> > >   in input.c and then process the file the same way it is done now.
> > There's
> > >   a little bit of leakage of charset logic from libcpp this way (for
> > the
> > >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> > always
> > >   be reading the entire file into memory, which is not a cost paid
> > >   currently.
> > 
> > Thanks for the patch; sorry about the delay in reviewing it.
> >
> 
> Thanks for the comments! Here is an updated patch that addresses your
> feedback, plus some responses inline below.
> 
> Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
> the same before and after, plus 6 new PASS.
> 
> FAIL 85 85
> PASS 479191 479197
> UNSUPPORTED 13664 13664
> UNTESTED 129 129
> XFAIL 2292 2292
> XPASS 30 30
> 
> 
> > This mostly seems good to me.
> > 
> > One aspect I'm not quite convinced about is the
> > input_cpp_context.in_use flag.  The input.c machinery is used by
> > diagnostics, and so could be used by middle-end warnings for frontends
> > that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> > BOM for those, and do encoding fixups if necessary - is it more a case
> > of initializing things to express what the expected input charset is?
> > (since that is part of the cpp_options)
> > 
> > c.opt has:
> >   finput-charset=
> >   C ObjC C++ ObjC++ Joined RejectNegative
> >   -finput-charset=<cset>	Specify the default character set for
> > source files.
> > 
> > I believe that D and Go are the two frontends that don't use libcpp for
> > parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> > considering the heritage of both).  I don't know what source encodings
> > D supports.
> >
> 
> For this patch I was rather singularly focused on libcpp, so I looked
> deeper at the other frontends now. It seems to me that there are basically
> two questions to answer, and the three frontend styles answer this pair in
> three different ways.
> 
> Q1: What is the input charset?
> A1:
> 
>     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
>     -finput-charset is not supported though.)
> 
>     go: Assume UTF-8.
> 
>     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
>        endianness); determined by inspecting the first bytes of the file.
> 
> Q2: How should a UTF-8 BOM be handled?
> A2:
> 
>     libcpp: Treat entirely the same, as if it was not present at all. So
>     a diagnostic referring to the first non-BOM character in the file will
>     point to column 1, not to column 4.
> 
>     go: Treat it like whitespace, ignored for parsing purposes, but still
>     logically part of the file. A diagnostic referring to the first non-BOM
>     character in the file will point to column 4.
> 
>     D: Same as libcpp.
> 
> So input.c's current behavior (with or without my patch) actually matches
> the "go" frontend exactly and does the right thing for it. As you
> thought, though, for D it would be desirable to skip over the UTF-8 BOM
> too.
> 
> D adds an additional wrinkle in that, instead of using -finput-charset, it
> uses its own detection logic -- based on knowledge that the first codepoint
> in the file must be ASCII, it is able to deduce the encoding from the first
> few bytes. This means that in D, each source file can have a different
> encoding, which is not expressible in libcpp frontends currently, and hence
> the notion of a global input_cpp_context with a constant charset is not
> sufficient to capture this.
> 
> In this iteration of the patch, I replaced the input_cpp_context with a more
> general input_context that can handle all of these cases. The context now
> consists of a callback function and a bool, which the frontend is supposed
> to configure. The bool determines whether or not to skip the BOM. The
> callback, when passed a filename, returns the input charset needed to
> convert that file. For libcpp, the filename is not used as the charset is
> determined from -finput-charset for all files. For D, the callback function
> is currently a stub, but it could be expanded to open the file, read the
> first few bytes, and determine the charset to be used. I have not
> implemented it for now, because it seems inelegant to duplicate the logic D
> already has for this detection, but this logic is part of the dmd/ tree
> which I think is maintained external to GCC, and so it didn't seem right to
> attempt to refactor it. I figured there may not be much interest in this
> feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> D), and this patch makes no change to it. This patch does fix the handling
> of a UTF-8 BOM in D diagnostics, however.
> 
> > > Separate from the main patch are two testcases that both fail before
> > this
> > > patch and pass after. I attached them gzipped because they use non-
> > standard
> > > encodings that won't email well.
> > > 
> > > The main question I have about the patch is whether I chose a good
> > way to
> > > address the following complication. location_get_source_line() in
> > input.c is
> > > used to generate diagnostics for all front ends, whether they use
> > libcpp to
> > > read the files or not. So input.c needs some way to know whether
> > libcpp is
> > > in use or not. I ended up adding a new function
> > input_initialize_cpp_context(),
> > > which front ends have to call if they are using libcpp to read their
> > > files. Currently that is c-family and fortran. I don't see a simpler
> > way
> > > to do it at least... Even if there were a better way for input.c to
> > find out
> > > the value passed to -finput-charset, it would still need to know
> > whether
> > > libcpp is being used or not.
> > 
> > At some point I want to convert the global state in input.c into a
> > source_manager class, probably hung off the diagnostic_context, so the
> > natural place to put that state would be in there (rather than the
> > fcache being global state).  Perhaps have a source_reader policy class
> > hung off of the source_manager, which would have responsibility for
> > reading and converting the file, either piecemeal, or fully for the
> > "need to use libcpp" case.
> > 
> > But that's not necessary for this fix.
> > 
> > > Please let me know if it looks OK (either now or for stage 1,
> > whatever makes
> > > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
> > tests the
> > > same before & after plus 6 new PASS from the new tests. Thanks!
> > 
> > Various comments inline below.  In particular, does the patch add a
> > memory leak in _cpp_convert_input?
> > 
> > Thanks
> > Dave
> > 
> > [...]
> > 
> > > libcpp/ChangeLog:
> > > 
> > >         PR other/93067
> > >         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
> > to
> > >         be NULL.
> > >         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
> > to...
> > >         (cpp_check_utf8_bom): ...here.  New function.
> > >         (cpp_input_conversion_is_trivial): New function.
> > >         * files.c (read_file_guts): Allow PFILE argument to be NULL.
> > Add
> > >         INPUT_CHARSET argument as an alternate source of this
> > information.
> > >         (cpp_get_converted_source): New function.
> > >         * include/cpplib.h (struct cpp_converted_source): Declare.
> > >         (cpp_get_converted_source): Declare.
> > >         (cpp_input_conversion_is_trivial): Declare.
> > >         (cpp_check_utf8_bom): Declare.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >         PR other/93067
> > >         * gcc.dg/diagnostic-input-charset-1.c: New test.
> > >         * gcc.dg/diagnostic-input-charset-2.c: New test.
> > 
> > Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?
> >
> 
> Done.
> 
> > > 
> > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > > index 59cabd12407..d5aa7859cc1 100644
> > > --- a/gcc/c-family/c-opts.c
> > > +++ b/gcc/c-family/c-opts.c
> > > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
> > >    cpp_post_options (parse_in);
> > >    init_global_opts_from_cpp (&global_options, cpp_get_options
> > (parse_in));
> > >  
> > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > +     the input.  */
> > > +  input_initialize_cpp_context (parse_in);
> > 
> > As noted above I think the most significant thing here is getting the
> > source encoding from parse_in, so maybe reword the comment accordingly.
> > 
> > [...]
> > 
> > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > > index 51baf141711..2b12a98afc0 100644
> > > --- a/gcc/fortran/cpp.c
> > > +++ b/gcc/fortran/cpp.c
> > > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
> > >  
> > >    cpp_post_options (cpp_in);
> > >  
> > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > +     the input.  */
> > > +  input_initialize_cpp_context (cpp_in);
> > 
> > Likewise.
> >
> 
> I kept this in mind when redoing this part, hopefully it is better now.
> 
> > [...]
> > 
> > > diff --git a/gcc/input.c b/gcc/input.c
> > > index 29d10f06b86..1dcdd464bc1 100644
> > > --- a/gcc/input.c
> > > +++ b/gcc/input.c
> > 
> > [...]
> > 
> > > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
> > >    r->total_lines = total_lines_num (file_path);
> > >    r->missing_trailing_newline = true;
> > >  
> > > +  /* If libcpp is managing the reading, then there are two cases we
> > need to
> > > +     consider.  If -finput-charset is not in effect, then we just
> > need to
> > > +     strip a UTF-8 BOM, so do that ourselves rather than calling
> > into libcpp so
> > > +     as to avoid paying the penalty of using libcpp, namely that the
> > entire file
> > > +     must be read at once.  In the (generally rare) case that a non-
> > trivial
> > > +     -finput-charset is needed, then go ahead and use libcpp to read
> > the whole
> > > +     file and do the conversion.  */
> > > +  if (input_cpp_context.in_use)
> > > +    {
> > > +      if (input_cpp_context.conversion_is_trivial)
> > > +       {
> > > +         /* Strip the UTF8 BOM if present.  */
> > > +         if (read_data (r))
> > > +           {
> > > +             const int offset = cpp_check_utf8_bom (r->data, r-
> > >nb_read);
> > > +             r->offset_buffer (offset);
> > > +             r->nb_read -= offset;
> > > +           }
> > 
> > This assumes that trivial conversions are always UTF8 to UTF8, which
> > assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...
> > 
> > [...]
> > 
> > > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > > +   of a buffer.  Returns the number of bytes to skip, which
> > currently
> > > +   will be either 0 or 3.  */
> > > +int
> > > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > > +{
> > > +
> > > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > > +  const unsigned char *udata = (const unsigned char *) data;
> > > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > > +      && udata[2] == 0xbf)
> > > +    return 3;
> > > +#endif
> > > +
> > > +  return 0;
> > > +}
> > 
> > ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
> > no-op for the EBCDIC case, so that's OK.
> > 
> > [...]
> > 
> > > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
> > char *input_charset,
> > >        to.text = XNEWVEC (uchar, to.asize);
> > >        to.len = 0;
> > >  
> > > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > > -       cpp_error (pfile, CPP_DL_ERROR,
> > > -                  "failure to convert %s to %s",
> > > -                  CPP_OPTION (pfile, input_charset),
> > SOURCE_CHARSET);
> > > +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
> > &to);
> > >  
> > > -      free (input);
> > > -    }
> > > +      /* Handle conversion failure.  */
> > > +      if (!ok)
> > > +       {
> > > +         free (input);
> > > +         if (!pfile)
> > > +           {
> > > +             XDELETEVEC (to.text);
> > > +             *buffer_start = NULL;
> > > +             *st_size = 0;
> > > +             return NULL;
> > > +           }
> > > +         cpp_error (pfile, CPP_DL_ERROR,
> > > +                    "failure to convert %s to %s",
> > > +                    CPP_OPTION (pfile, input_charset),
> > SOURCE_CHARSET);
> > > +       }
> > > +    }
> > >
> > 
> > Doesn't this lose the
> >   free (input);
> > for the case where the conversion is successful?  Is this a leak?
> >
> 
> Ugh, sorry to waste your time with that mistake. Fixed.
> 
> > [...]
> > 
> > > diff --git a/libcpp/files.c b/libcpp/files.c
> > > index 301b2379a23..178bb9ed1e6 100644
> > > --- a/libcpp/files.c
> > > +++ b/libcpp/files.c
> > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
> > _cpp_file *file,
> > >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> > >                               bool *invalid_pch, location_t loc);
> > >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > > -                           location_t loc);
> > > +                           location_t loc, const char *input_charset
> > = NULL);
> > 
> > read_file_guts is only used in one place before the patch, so rather
> > than add a default argument, I think it's simpler to pass in the
> > input_charset at that call.
> > 
> > > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
> > *fname, cpp_dir *start_dir,
> > >  
> > >     Use LOC for any diagnostics.
> > >  
> > > +   The input charset may be specified in the INPUT_CHARSET argument,
> > or
> > > +   else it will be taken from PFILE.
> > 
> > ...and doing so means that input_charset will be non-NULL at all
> > callers.
> > 
> > > +   PFILE may be NULL.  In this case, no diagnostics are issued, and
> > the
> > > +   input charset must be specified in the arguments.
> > > +
> > >     FIXME: Flush file cache and try again if we run out of memory. 
> > */
> > >  static bool
> > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > > +               const char *input_charset)
> > >  {
> > >    ssize_t size, total, count;
> > >    uchar *buf;
> > >    bool regular;
> > >  
> > > +  if (!input_charset)
> > > +    {
> > > +      gcc_assert (pfile);
> > > +      input_charset = CPP_OPTION (pfile, input_charset);
> > > +    }
> > 
> > ...which would allow for this logic to be removed, moving this to the
> > existing caller, I believe.
> >
> 
> Much better your way, done.
> 
> Thanks for taking a look at it, sorry this version is a bit changed from the
> previous one. FWIW, the libcpp pieces were not touched other than addressing
> your comments; the new logic is in input.h/input.c plus the associated
> tweaks to the frontends.
> 
> -Lewis

> From: Lewis Hyatt <lhyatt@gmail.com>
> Date: Wed, 27 Jan 2021 18:23:13 -0500
> Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]
> 
> Adds the logic to handle -finput-charset in layout_get_source_line(), so that
> source lines are converted from their input encodings prior to being output by
> diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.
> 
> gcc/c-family/ChangeLog:
> 
> 	PR other/93067
> 	* c-opts.c (c_common_input_charset_cb): New function.
> 	(c_common_post_options): Call new function input_initialize_context().
> 
> gcc/d/ChangeLog:
> 
> 	PR other/93067
> 	* d-lang.cc (d_input_charset_callback): New function.
> 	(d_init): Call new function input_initialize_context().
> 
> gcc/fortran/ChangeLog:
> 
> 	PR other/93067
> 	* cpp.c (gfc_cpp_post_options): Call new function
> 	input_initialize_cpp_context().
> 
> gcc/ChangeLog:
> 
> 	PR other/93067
> 	* input.c (default_charset_callback): New function.
> 	(input_initialize_context): New function.
> 	(read_data): Add prototype.
> 	(add_file_to_cache_tab): Use libcpp to convert input encoding when
> 	needed. Strip UTF-8 BOM when needed.
> 	(class fcache): Add new members to track input encoding conversion.
> 	(fcache::fcache): Adapt for new members.
> 	(fcache::~fcache): Likewise.
> 	(maybe_grow): Likewise.
> 	(needs_read): Adapt to be aware that fp member may be NULL now.
> 	(get_next_line): Likewise.
> 	* input.h (input_initialize_context): Declare.
> 
> libcpp/ChangeLog:
> 
> 	PR other/93067
> 	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
> 	be NULL.
> 	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
> 	(cpp_check_utf8_bom): ...here.  New function.
> 	(cpp_input_conversion_is_trivial): New function.
> 	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
> 	INPUT_CHARSET argument as an alternate source of this information.
> 	(read_file): Pass the new argument to read_file_guts.
> 	(cpp_get_converted_source): New function.
> 	* include/cpplib.h (struct cpp_converted_source): Declare.
> 	(cpp_get_converted_source): Declare.
> 	(cpp_input_conversion_is_trivial): Declare.
> 	(cpp_check_utf8_bom): Declare.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR other/93067
> 	* gcc.dg/diagnostic-input-charset-1.c: New test.
> 	* gcc.dg/diagnostic-input-utf8-bom.c: New test.
> 
> diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> index bd15b9cd902..2f58e8413b9 100644
> --- a/gcc/c-family/c-opts.c
> +++ b/gcc/c-family/c-opts.c
> @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context)
>    context->opt_permissive = OPT_fpermissive;
>  }
>  
> +/* Input charset configuration for diagnostics.  */
> +static const char *
> +c_common_input_charset_cb (const char * /*filename*/)
> +{
> +  const char *cs = cpp_opts->input_charset;
> +  return cpp_input_conversion_is_trivial (cs) ? NULL : cs;
> +}
> +
>  /* Whether options from all C-family languages should be accepted
>     quietly.  */
>  static bool accept_all_c_family_options = false;
> @@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename)
>    cpp_post_options (parse_in);
>    init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
>  
> +  /* Let diagnostics infrastructure know how to convert input files the same
> +     way libcpp will do it, namely using the configured input charset and
> +     skipping a UTF-8 BOM if present.  */
> +  input_initialize_context (c_common_input_charset_cb, true);
>    input_location = UNKNOWN_LOCATION;
>  
>    *pfilename = this_input_filename
> diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> index 0fd207da7f3..9f170c09886 100644
> --- a/gcc/d/d-lang.cc
> +++ b/gcc/d/d-lang.cc
> @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "print-tree.h"
>  #include "debug.h"
> +#include "input.h"
>  
>  #include "d-tree.h"
>  #include "id.h"
> @@ -362,6 +363,19 @@ d_option_lang_mask (void)
>    return CL_D;
>  }
>  
> +/* Implements input charset and BOM skipping configuration for
> +   diagnostics.  */
> +static const char *d_input_charset_callback (const char * /*filename*/)
> +{
> +  /* TODO: The input charset is automatically determined by code in
> +     dmd/dmodule.c based on the contents of the file.  If this detection
> +     logic were factored out and could be reused here, then we would be able
> +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> +     NULL, which means no conversion is necessary, i.e. the input is assumed
> +     to be UTF-8 when diagnostics read this file.  */
> +  return NULL;
> +}
> +
>  /* Implements the lang_hooks.init routine for language D.  */
>  
>  static bool
> @@ -373,6 +387,10 @@ d_init (void)
>    Expression::_init ();
>    Objc::_init ();
>  
> +  /* Diagnostics input init, to enable BOM skipping and
> +     input charset conversion.  */
> +  input_initialize_context (d_input_charset_callback, true);
> +
>    /* Back-end init.  */
>    global_binding_level = ggc_cleared_alloc <binding_level> ();
>    current_binding_level = global_binding_level;
> diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> index 419cd6accbe..8b564888673 100644
> --- a/gcc/fortran/cpp.c
> +++ b/gcc/fortran/cpp.c
> @@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
>  
>    cpp_post_options (cpp_in);
>  
> +
> +  /* Let diagnostics infrastructure know how to convert input files the same
> +     way libcpp will do it, namely, with no charset conversion but with
> +     skipping of a UTF-8 BOM if present.  */
> +  input_initialize_context (NULL, true);
> +
>    gfc_cpp_register_include_paths ();
>  }
>  
> diff --git a/gcc/input.c b/gcc/input.c
> index 9e39e7df83c..f8f6dc7e4ca 100644
> --- a/gcc/input.c
> +++ b/gcc/input.c
> @@ -30,6 +30,29 @@ along with GCC; see the file COPYING3.  If not see
>  #define HAVE_ICONV 0
>  #endif
>  
> +/* Input charset configuration.  */
> +static const char *default_charset_callback (const char *)
> +{
> +  return NULL;
> +}
> +
> +struct
> +{
> +  input_context_charset_callback ccb;
> +  bool should_skip_bom;
> +} static input_context =
> +{
> +  default_charset_callback,
> +  false
> +};
> +
> +void input_initialize_context (input_context_charset_callback ccb,
> +			       bool should_skip_bom)
> +{
> +  input_context.ccb = (ccb ? ccb : default_charset_callback);
> +  input_context.should_skip_bom = should_skip_bom;
> +}
> +
>  /* This is a cache used by get_next_line to store the content of a
>     file to be searched for file lines.  */
>  class fcache
> @@ -78,6 +101,10 @@ public:
>       far.  */
>    char *data;
>  
> +  /* The allocated buffer to be freed may start a little earlier than DATA,
> +     e.g. if a UTF8 BOM was skipped at the beginning.  */
> +  int alloc_offset;
> +
>    /*  The size of the DATA array above.*/
>    size_t size;
>  
> @@ -118,6 +145,17 @@ public:
>  
>    fcache ();
>    ~fcache ();
> +
> +  void offset_buffer (int offset)
> +  {
> +    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
> +		: (size_t) offset <= size);
> +    gcc_assert (data);
> +    alloc_offset += offset;
> +    data += offset;
> +    size -= offset;
> +  }
> +
>  };
>  
>  /* Current position in real source file.  */
> @@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count)
>    return to_evict;
>  }
>  
> +static bool
> +read_data (fcache *c);
> +
>  /* Create the cache used for the content of a given file to be
>     accessed by caret diagnostic.  This cache is added to an array of
>     cache and can be retrieved by lookup_file_in_cache_tab.  This
> @@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path)
>    if (r->fp)
>      fclose (r->fp);
>    r->fp = fp;
> +  if (r->alloc_offset)
> +    r->offset_buffer (-r->alloc_offset);
>    r->nb_read = 0;
>    r->line_start_idx = 0;
>    r->line_num = 0;
> @@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path)
>    r->total_lines = total_lines_num (file_path);
>    r->missing_trailing_newline = true;
>  
> +  /* Check the frontend configuration to determine if we need to do any
> +     transformations, such as charset conversion or BOM skipping.  */
> +  if (const char *input_charset = input_context.ccb (file_path))
> +    {
> +      /* Need a full-blown conversion of the input charset.  */
> +      fclose (r->fp);
> +      r->fp = NULL;
> +      const cpp_converted_source cs
> +	= cpp_get_converted_source (file_path, input_charset);
> +      if (!cs.data)
> +	return NULL;
> +      if (r->data)
> +	XDELETEVEC (r->data);
> +      r->data = cs.data;
> +      r->nb_read = r->size = cs.len;
> +      r->alloc_offset = cs.data - cs.to_free;
> +    }
> +  else if (input_context.should_skip_bom)
> +    {
> +      if (read_data (r))
> +	{
> +	  const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
> +	  r->offset_buffer (offset);
> +	  r->nb_read -= offset;
> +	}
> +    }
> +
>    return r;
>  }
>  
> @@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path)
>     diagnostic.  */
>  
>  fcache::fcache ()
> -: use_count (0), file_path (NULL), fp (NULL), data (0),
> +: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
>    size (0), nb_read (0), line_start_idx (0), line_num (0),
>    total_lines (0), missing_trailing_newline (true)
>  {
> @@ -433,6 +503,7 @@ fcache::~fcache ()
>      }
>    if (data)
>      {
> +      offset_buffer (-alloc_offset);
>        XDELETEVEC (data);
>        data = 0;
>      }
> @@ -447,9 +518,9 @@ fcache::~fcache ()
>  static bool
>  needs_read (fcache *c)
>  {
> -  return (c->nb_read == 0
> -	  || c->nb_read == c->size
> -	  || (c->line_start_idx >= c->nb_read - 1));
> +  return c->fp && (c->nb_read == 0
> +		   || c->nb_read == c->size
> +		   || (c->line_start_idx >= c->nb_read - 1));
>  }
>  
>  /*  Return TRUE iff the cache is full and thus needs to be
> @@ -469,9 +540,20 @@ maybe_grow (fcache *c)
>    if (!needs_grow (c))
>      return;
>  
> -  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
> -  c->data = XRESIZEVEC (char, c->data, size);
> -  c->size = size;
> +  if (!c->data)
> +    {
> +      gcc_assert (c->size == 0 && c->alloc_offset == 0);
> +      c->size = fcache_buffer_size;
> +      c->data = XNEWVEC (char, c->size);
> +    }
> +  else
> +    {
> +      const int offset = c->alloc_offset;
> +      c->offset_buffer (-offset);
> +      c->size *= 2;
> +      c->data = XRESIZEVEC (char, c->data, c->size);
> +      c->offset_buffer (offset);
> +    }
>  }
>  
>  /*  Read more data into the cache.  Extends the cache if need be.
> @@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len)
>        c->missing_trailing_newline = false;
>      }
>  
> -  if (ferror (c->fp))
> +  if (c->fp && ferror (c->fp))
>      return false;
>  
>    /* At this point, we've found the end of the of line.  It either
> diff --git a/gcc/input.h b/gcc/input.h
> index f1ef5d76cfd..8a793399958 100644
> --- a/gcc/input.h
> +++ b/gcc/input.h
> @@ -214,4 +214,21 @@ class GTY(()) string_concat_db
>    hash_map <location_hash, string_concat *> *m_table;
>  };
>  
> +/* Because we read source files a second time after the frontend did it the
> +   first time, we need to know how the frontend handled things like character
> +   set conversion and UTF-8 BOM stripping, in order to make everything
> +   consistent.  This function needs to be called by each frontend that requires
> +   non-default behavior, to inform the diagnostics infrastructure how input is
> +   to be processed.  The default behavior is to do no conversion and not to
> +   strip a UTF-8 BOM.
> +
> +   The callback should return the input charset to be used to convert the given
> +   file's contents to UTF-8, or it should return NULL if no conversion is needed
> +   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
> +   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
> +   beginning of the file.  (In case a conversion was performed, the BOM is
> +   rather skipped as part of the conversion process.)  */
> +typedef const char * (*input_context_charset_callback)(const char *);
> +void input_initialize_context (input_context_charset_callback ccb,
> +			       bool should_skip_bom);
>  #endif
> diff --git a/libcpp/charset.c b/libcpp/charset.c
> index 99a9b73e5ab..61881f978a8 100644
> --- a/libcpp/charset.c
> +++ b/libcpp/charset.c
> @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
>     cset_converter structure for conversion from FROM to TO.  If
>     iconv_open() fails, issue an error and return an identity
>     converter.  Silently return an identity converter if FROM and TO
> -   are identical.  */
> +   are identical.
> +
> +   PFILE is only used for generating diagnostics; setting it to NULL
> +   suppresses diagnostics.  */
> +
>  static struct cset_converter
>  init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
>  {
> @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
>  
>        if (ret.cd == (iconv_t) -1)
>  	{
> -	  if (errno == EINVAL)
> -	    cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> -		       "conversion from %s to %s not supported by iconv",
> -		       from, to);
> -	  else
> -	    cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> -
> +	  if (pfile)
> +	    {
> +	      if (errno == EINVAL)
> +		cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> +			   "conversion from %s to %s not supported by iconv",
> +			   from, to);
> +	      else
> +		cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> +	    }
>  	  ret.func = convert_no_conversion;
>  	}
>      }
>    else
>      {
> -      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> -		 "no iconv implementation, cannot convert from %s to %s",
> -		 from, to);
> +      if (pfile)
> +	{
> +	  cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> +		     "no iconv implementation, cannot convert from %s to %s",
> +		     from, to);
> +	}
>        ret.func = convert_no_conversion;
>        ret.cd = (iconv_t) -1;
>        ret.width = -1;
>      }
> +
>    return ret;
>  }
>  
> @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
>  				  buf, bufp - buf, HT_ALLOC));
>  }
>  
> +
> +/* Utility to strip a UTF-8 byte order marking from the beginning
> +   of a buffer.  Returns the number of bytes to skip, which currently
> +   will be either 0 or 3.  */
> +int
> +cpp_check_utf8_bom (const char *data, size_t data_length)
> +{
> +
> +#if HOST_CHARSET == HOST_CHARSET_ASCII
> +  const unsigned char *udata = (const unsigned char *) data;
> +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> +      && udata[2] == 0xbf)
> +    return 3;
> +#endif
> +
> +  return 0;
> +}
> +
> +
>  /* Convert an input buffer (containing the complete contents of one
>     source file) from INPUT_CHARSET to the source character set.  INPUT
>     points to the input buffer, SIZE is its allocated size, and LEN is
> @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
>     INPUT is expected to have been allocated with xmalloc.  This
>     function will either set *BUFFER_START to INPUT, or free it and set
>     *BUFFER_START to a pointer to another xmalloc-allocated block of
> -   memory.  */
> +   memory.
> +
> +   PFILE is only used to generate diagnostics; setting it to NULL suppresses
> +   diagnostics, and causes a return of NULL if there was any error instead.  */
> +
>  uchar * 
>  _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>  		    uchar *input, size_t size, size_t len,
> @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>        to.text = XNEWVEC (uchar, to.asize);
>        to.len = 0;
>  
> -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> -	cpp_error (pfile, CPP_DL_ERROR,
> -		   "failure to convert %s to %s",
> -		   CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
> -
> +      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
>        free (input);
> -    }
>  
> -  /* Clean up the mess.  */
> -  if (input_cset.func == convert_using_iconv)
> -    iconv_close (input_cset.cd);
> +      /* Clean up the mess.  */
> +      if (input_cset.func == convert_using_iconv)
> +	iconv_close (input_cset.cd);
> +
> +      /* Handle conversion failure.  */
> +      if (!ok)
> +	{
> +	  if (!pfile)
> +	    {
> +	      XDELETEVEC (to.text);
> +	      *buffer_start = NULL;
> +	      *st_size = 0;
> +	      return NULL;
> +	    }
> +	  cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
> +		     input_charset, SOURCE_CHARSET);
> +	}
> +    }
>  
>    /* Resize buffer if we allocated substantially too much, or if we
>       haven't enough space for the \n-terminator or following
> @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>  
>    buffer = to.text;
>    *st_size = to.len;
> -#if HOST_CHARSET == HOST_CHARSET_ASCII
> -  /* The HOST_CHARSET test just above ensures that the source charset
> -     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
> -     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
> +
> +  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
> +     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
>       BOM -- however, even if it did, we would still need this code due
>       to the 'convert_no_conversion' case.  */
> -  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
> -      && to.text[2] == 0xbf)
> -    {
> -      *st_size -= 3;
> -      buffer += 3;
> -    }
> -#endif
> +  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
> +  *st_size -= bom_len;
> +  buffer += bom_len;
>  
>    *buffer_start = to.text;
>    return buffer;
> @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
>    return current_encoding;
>  }
>  
> +/* Check if the configured input charset requires no conversion, other than
> +   possibly stripping a UTF-8 BOM.  */
> +bool cpp_input_conversion_is_trivial (const char *input_charset)
> +{
> +  return !strcasecmp (input_charset, SOURCE_CHARSET);
> +}
> +
>  /* Implementation of class cpp_string_location_reader.  */
>  
>  /* Constructor for cpp_string_location_reader.  */
> diff --git a/libcpp/files.c b/libcpp/files.c
> index 5ea3f8e1bf3..d91606c1532 100644
> --- a/libcpp/files.c
> +++ b/libcpp/files.c
> @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
>  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
>  			      bool *invalid_pch, location_t loc);
>  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> -			    location_t loc);
> +			    location_t loc, const char *input_charset);
>  static bool read_file (cpp_reader *pfile, _cpp_file *file,
>  		       location_t loc);
>  static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
> @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
>  
>     Use LOC for any diagnostics.
>  
> +   PFILE may be NULL.  In this case, no diagnostics are issued.
> +
>     FIXME: Flush file cache and try again if we run out of memory.  */
>  static bool
> -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> +		const char *input_charset)
>  {
>    ssize_t size, total, count;
>    uchar *buf;
> @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  
>    if (S_ISBLK (file->st.st_mode))
>      {
> -      cpp_error_at (pfile, CPP_DL_ERROR, loc,
> -		    "%s is a block device", file->path);
> +      if (pfile)
> +	cpp_error_at (pfile, CPP_DL_ERROR, loc,
> +		      "%s is a block device", file->path);
>        return false;
>      }
>  
> @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  	 does not bite us.  */
>        if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
>  	{
> -	  cpp_error_at (pfile, CPP_DL_ERROR, loc,
> -			"%s is too large", file->path);
> +	  if (pfile)
> +	    cpp_error_at (pfile, CPP_DL_ERROR, loc,
> +			  "%s is too large", file->path);
>  	  return false;
>  	}
>  
> @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  
>    if (count < 0)
>      {
> -      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
> +      if (pfile)
> +	cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
>        free (buf);
>        return false;
>      }
>  
> -  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
> +  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
>      cpp_error_at (pfile, CPP_DL_WARNING, loc,
>  	       "%s is shorter than expected", file->path);
>  
>    file->buffer = _cpp_convert_input (pfile,
> -				     CPP_OPTION (pfile, input_charset),
> +				     input_charset,
>  				     buf, size + 16, total,
>  				     &file->buffer_start,
>  				     &file->st.st_size);
> -  file->buffer_valid = true;
> -
> -  return true;
> +  file->buffer_valid = file->buffer;
> +  return file->buffer_valid;
>  }
>  
>  /* Convenience wrapper around read_file_guts that opens the file if
>     necessary and closes the file descriptor after reading.  FILE must
>     have been passed through find_file() at some stage.  Use LOC for
> -   any diagnostics.  */
> +   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
>  static bool
>  read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  {
> @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
>        return false;
>      }
>  
> -  file->dont_read = !read_file_guts (pfile, file, loc);
> +  file->dont_read = !read_file_guts (pfile, file, loc,
> +				     CPP_OPTION (pfile, input_charset));
>    close (file->fd);
>    file->fd = -1;
>  
> @@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
>    return file->err_no != ENOENT;
>  }
>  
> +/* Read a file and convert to input charset, the same as if it were being read
> +   by a cpp_reader.  */
> +
> +cpp_converted_source
> +cpp_get_converted_source (const char *fname, const char *input_charset)
> +{
> +  cpp_converted_source res = {};
> +  _cpp_file file = {};
> +  file.fd = -1;
> +  file.name = lbasename (fname);
> +  file.path = fname;
> +  if (!open_file (&file))
> +    return res;
> +  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
> +  close (file.fd);
> +  if (!ok)
> +    return res;
> +  res.to_free = (char *) file.buffer_start;
> +  res.data = (char *) file.buffer;
> +  res.len = file.st.st_size;
> +  return res;
> +}
> diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
> index 4467c73284d..d02ef4aa054 100644
> --- a/libcpp/include/cpplib.h
> +++ b/libcpp/include/cpplib.h
> @@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
>  extern cpp_buffer *cpp_get_prev (cpp_buffer *);
>  extern void cpp_clear_file_cache (cpp_reader *);
>  
> +/* cpp_get_converted_source returns the contents of the given file, as it exists
> +   after cpplib has read it and converted it from the input charset to the
> +   source charset.  Return struct will be zero-filled if the data could not be
> +   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
> +   pointer is what should be passed to free(), as there may be an offset.  */
> +struct cpp_converted_source
> +{
> +  char *to_free;
> +  char *data;
> +  size_t len;
> +};
> +cpp_converted_source cpp_get_converted_source (const char *fname,
> +					       const char *input_charset);
> +
>  /* In pch.c */
>  struct save_macro_data;
>  extern int cpp_save_state (cpp_reader *, FILE *);
> @@ -1439,6 +1453,7 @@ class cpp_display_width_computation {
>  /* Convenience functions that are simple use cases for class
>     cpp_display_width_computation.  Tab characters will be expanded to spaces
>     as determined by TABSTOP.  */
> +
>  int cpp_byte_column_to_display_column (const char *data, int data_length,
>  				       int column, int tabstop);
>  inline int cpp_display_width (const char *data, int data_length,
> @@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length,
>  				       int display_col, int tabstop);
>  int cpp_wcwidth (cppchar_t c);
>  
> +bool cpp_input_conversion_is_trivial (const char *input_charset);
> +int cpp_check_utf8_bom (const char *data, size_t data_length);
> +
>  #endif /* ! LIBCPP_CPPLIB_H */
Iain Buclaw May 18, 2021, 9:31 a.m. UTC | #4
Excerpts from Lewis Hyatt via Gcc-patches's message of January 29, 2021 4:46 pm:
> Q1: What is the input charset?
> A1:
> 
>     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
>     -finput-charset is not supported though.)
> 
>     go: Assume UTF-8.
> 
>     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
>        endianness); determined by inspecting the first bytes of the file.
> 
> Q2: How should a UTF-8 BOM be handled?
> A2:
> 
>     libcpp: Treat entirely the same, as if it was not present at all. So
>     a diagnostic referring to the first non-BOM character in the file will
>     point to column 1, not to column 4.
> 
>     go: Treat it like whitespace, ignored for parsing purposes, but still
>     logically part of the file. A diagnostic referring to the first non-BOM
>     character in the file will point to column 4.
> 
>     D: Same as libcpp.
> 
> So input.c's current behavior (with or without my patch) actually matches
> the "go" frontend exactly and does the right thing for it. As you
> thought, though, for D it would be desirable to skip over the UTF-8 BOM
> too.
> 
> D adds an additional wrinkle in that, instead of using -finput-charset, it
> uses its own detection logic -- based on knowledge that the first codepoint
> in the file must be ASCII, it is able to deduce the encoding from the first
> few bytes. This means that in D, each source file can have a different
> encoding, which is not expressible in libcpp frontends currently, and hence
> the notion of a global input_cpp_context with a constant charset is not
> sufficient to capture this.
> 

Yes, that is correct for D.

> In this iteration of the patch, I replaced the input_cpp_context with a more
> general input_context that can handle all of these cases. The context now
> consists of a callback function and a bool, which the frontend is supposed
> to configure. The bool determines whether or not to skip the BOM. The
> callback, when passed a filename, returns the input charset needed to
> convert that file. For libcpp, the filename is not used as the charset is
> determined from -finput-charset for all files. For D, the callback function
> is currently a stub, but it could be expanded to open the file, read the
> first few bytes, and determine the charset to be used. I have not
> implemented it for now, because it seems inelegant to duplicate the logic D
> already has for this detection, but this logic is part of the dmd/ tree
> which I think is maintained external to GCC, and so it didn't seem right to
> attempt to refactor it. I figured there may not be much interest in this
> feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> D), and this patch makes no change to it. This patch does fix the handling
> of a UTF-8 BOM in D diagnostics, however.
> 

And yes, dmd/ is maintained in a separate upstream repository.

I wouldn't worry about UTF-16 and UTF-32.  I expect all the sources
written in those charsets are found only in the D testsuite.

> gcc/d/ChangeLog:
> 
> 	PR other/93067
> 	* d-lang.cc (d_input_charset_callback): New function.
> 	(d_init): Call new function input_initialize_context().
> 

> diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> index 0fd207da7f3..9f170c09886 100644
> --- a/gcc/d/d-lang.cc
> +++ b/gcc/d/d-lang.cc
> @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "print-tree.h"
>  #include "debug.h"
> +#include "input.h"
>  
>  #include "d-tree.h"
>  #include "id.h"
> @@ -362,6 +363,19 @@ d_option_lang_mask (void)
>    return CL_D;
>  }
>  
> +/* Implements input charset and BOM skipping configuration for
> +   diagnostics.  */
> +static const char *d_input_charset_callback (const char * /*filename*/)
> +{
> +  /* TODO: The input charset is automatically determined by code in
> +     dmd/dmodule.c based on the contents of the file.  If this detection
> +     logic were factored out and could be reused here, then we would be able
> +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> +     NULL, which means no conversion is necessary, i.e. the input is assumed
> +     to be UTF-8 when diagnostics read this file.  */
> +  return NULL;
> +}
> +

OK from me.

I can have a look into this, but it will have to wait until after I do
an impending switch-over of the dmd/ sources from C++ to D.  Then I'd
have to convince upstream to be open to the change (refactoring has been
getting a bad rep as of late for sneaking regressions into upstream D).

Iain.
Lewis Hyatt May 24, 2021, 9:55 p.m. UTC | #5
On Tue, May 18, 2021 at 5:31 AM Iain Buclaw <ibuclaw@gdcproject.org> wrote:
>
> Excerpts from Lewis Hyatt via Gcc-patches's message of January 29, 2021 4:46 pm:
> > Q1: What is the input charset?
> > A1:
> >
> >     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
> >     -finput-charset is not supported though.)
> >
> >     go: Assume UTF-8.
> >
> >     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
> >        endianness); determined by inspecting the first bytes of the file.
> >
> > Q2: How should a UTF-8 BOM be handled?
> > A2:
> >
> >     libcpp: Treat entirely the same, as if it was not present at all. So
> >     a diagnostic referring to the first non-BOM character in the file will
> >     point to column 1, not to column 4.
> >
> >     go: Treat it like whitespace, ignored for parsing purposes, but still
> >     logically part of the file. A diagnostic referring to the first non-BOM
> >     character in the file will point to column 4.
> >
> >     D: Same as libcpp.
> >
> > So input.c's current behavior (with or without my patch) actually matches
> > the "go" frontend exactly and does the right thing for it. As you
> > thought, though, for D it would be desirable to skip over the UTF-8 BOM
> > too.
> >
> > D adds an additional wrinkle in that, instead of using -finput-charset, it
> > uses its own detection logic -- based on knowledge that the first codepoint
> > in the file must be ASCII, it is able to deduce the encoding from the first
> > few bytes. This means that in D, each source file can have a different
> > encoding, which is not expressible in libcpp frontends currently, and hence
> > the notion of a global input_cpp_context with a constant charset is not
> > sufficient to capture this.
> >
>
> Yes, that is correct for D.
>
> > In this iteration of the patch, I replaced the input_cpp_context with a more
> > general input_context that can handle all of these cases. The context now
> > consists of a callback function and a bool, which the frontend is supposed
> > to configure. The bool determines whether or not to skip the BOM. The
> > callback, when passed a filename, returns the input charset needed to
> > convert that file. For libcpp, the filename is not used as the charset is
> > determined from -finput-charset for all files. For D, the callback function
> > is currently a stub, but it could be expanded to open the file, read the
> > first few bytes, and determine the charset to be used. I have not
> > implemented it for now, because it seems inelegant to duplicate the logic D
> > already has for this detection, but this logic is part of the dmd/ tree
> > which I think is maintained external to GCC, and so it didn't seem right to
> > attempt to refactor it. I figured there may not be much interest in this
> > feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> > D), and this patch makes no change to it. This patch does fix the handling
> > of a UTF-8 BOM in D diagnostics, however.
> >
>
> And yes, dmd/ is maintained in a separate upstream repository.
>
> I wouldn't worry about UTF-16 and UTF-32.  I expect all the sources
> written in those charsets are found only in the D testsuite.
>
> > gcc/d/ChangeLog:
> >
> >       PR other/93067
> >       * d-lang.cc (d_input_charset_callback): New function.
> >       (d_init): Call new function input_initialize_context().
> >
>
> > diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> > index 0fd207da7f3..9f170c09886 100644
> > --- a/gcc/d/d-lang.cc
> > +++ b/gcc/d/d-lang.cc
> > @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "output.h"
> >  #include "print-tree.h"
> >  #include "debug.h"
> > +#include "input.h"
> >
> >  #include "d-tree.h"
> >  #include "id.h"
> > @@ -362,6 +363,19 @@ d_option_lang_mask (void)
> >    return CL_D;
> >  }
> >
> > +/* Implements input charset and BOM skipping configuration for
> > +   diagnostics.  */
> > +static const char *d_input_charset_callback (const char * /*filename*/)
> > +{
> > +  /* TODO: The input charset is automatically determined by code in
> > +     dmd/dmodule.c based on the contents of the file.  If this detection
> > +     logic were factored out and could be reused here, then we would be able
> > +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> > +     NULL, which means no conversion is necessary, i.e. the input is assumed
> > +     to be UTF-8 when diagnostics read this file.  */
> > +  return NULL;
> > +}
> > +
>
> OK from me.
>
> I can have a look into this, but it will have to wait until after I do
> an impending switch-over of the dmd/ sources from C++ to D.  Then I'd
> have to convince upstream to be open to the change (refactoring has been
> getting a bad rep as of late for sneaking regressions into upstream D).
>
> Iain.

Thank you very much for taking a look at it. I am not sure yet if
David likes the approach of this patch, but in case he does approve
and we go this way or similar, I'll keep the D part like this then.
I think you must be right that UTF-16 and UTF-32 are not widely used,
at least, anyone trying to use them now would receive garbled
diagnostics. But at least with this patch, almost all the needed
pieces to support it for diagnostics are in place.

-Lewis


-Lewis
Lewis Hyatt July 30, 2021, 8:13 p.m. UTC | #6
On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote:
> On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > > Hello-
> > > 
> > > The attached patch addresses PR93067:
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > > 
> > > This is similar to the patch I posted last year on the PR, with some
> > tweaks
> > > to make it a little simpler. Recapping some of the commentary on the
> > PR:
> > > 
> > > When source lines are needed for diagnostics output, they are
> > retrieved from
> > > the source file by the fcache infrastructure in input.c, since libcpp
> > has
> > > generally already forgotten them (plus not all front ends are using
> > > libcpp). This infrastructure does not read the files in the same way
> > as
> > > libcpp does; in particular, it does not translate the encoding as
> > requested
> > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > > present. The patch adds this ability. My thinking in deciding how to
> > do it
> > > was the following:
> > > 
> > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> > still,
> > >   so this patch should try hard not to introduce any worse
> > performance
> > >   unless these things are needed.
> > > 
> > > - It is desirable to reuse libcpp's encoding infrastructure from
> > charset.c
> > >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> > it also
> > >   has hand-coded routines for certain charsets to make sure they are
> > >   available.)
> > > 
> > > - There is a performance degradation required in order to make use of
> > libcpp
> > >   directly, because the input.c infrastructure only reads as much of
> > the
> > >   source file as necessary, whereas libcpp interfaces as-is require
> > to read
> > >   the entire file into memory.
> > > 
> > > - It can't be quite as simple as just "only delegate to libcpp if
> > >   -finput-charset was specified", because the stripping of the UTF-8
> > BOM has
> > >   to happen with or without this option.
> > > 
> > > - So it seemed a reasonable compromise to me, if -finput-charset is
> > >   specified, then use libcpp to convert the file, otherwise, strip
> > the BOM
> > >   in input.c and then process the file the same way it is done now.
> > There's
> > >   a little bit of leakage of charset logic from libcpp this way (for
> > the
> > >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> > always
> > >   be reading the entire file into memory, which is not a cost paid
> > >   currently.
> > 
> > Thanks for the patch; sorry about the delay in reviewing it.
> >
> 
> Thanks for the comments! Here is an updated patch that addresses your
> feedback, plus some responses inline below.
> 
> Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
> the same before and after, plus 6 new PASS.
> 
> FAIL 85 85
> PASS 479191 479197
> UNSUPPORTED 13664 13664
> UNTESTED 129 129
> XFAIL 2292 2292
> XPASS 30 30
> 
> 
> > This mostly seems good to me.
> > 
> > One aspect I'm not quite convinced about is the
> > input_cpp_context.in_use flag.  The input.c machinery is used by
> > diagnostics, and so could be used by middle-end warnings for frontends
> > that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> > BOM for those, and do encoding fixups if necessary - is it more a case
> > of initializing things to express what the expected input charset is?
> > (since that is part of the cpp_options)
> > 
> > c.opt has:
> >   finput-charset=
> >   C ObjC C++ ObjC++ Joined RejectNegative
> >   -finput-charset=<cset>	Specify the default character set for
> > source files.
> > 
> > I believe that D and Go are the two frontends that don't use libcpp for
> > parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> > considering the heritage of both).  I don't know what source encodings
> > D supports.
> >
> 
> For this patch I was rather singularly focused on libcpp, so I looked
> deeper at the other frontends now. It seems to me that there are basically
> two questions to answer, and the three frontend styles answer this pair in
> three different ways.
> 
> Q1: What is the input charset?
> A1:
> 
>     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
>     -finput-charset is not supported though.)
> 
>     go: Assume UTF-8.
> 
>     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
>        endianness); determined by inspecting the first bytes of the file.
> 
> Q2: How should a UTF-8 BOM be handled?
> A2:
> 
>     libcpp: Treat entirely the same, as if it was not present at all. So
>     a diagnostic referring to the first non-BOM character in the file will
>     point to column 1, not to column 4.
> 
>     go: Treat it like whitespace, ignored for parsing purposes, but still
>     logically part of the file. A diagnostic referring to the first non-BOM
>     character in the file will point to column 4.
> 
>     D: Same as libcpp.
> 
> So input.c's current behavior (with or without my patch) actually matches
> the "go" frontend exactly and does the right thing for it. As you
> thought, though, for D it would be desirable to skip over the UTF-8 BOM
> too.
> 
> D adds an additional wrinkle in that, instead of using -finput-charset, it
> uses its own detection logic -- based on knowledge that the first codepoint
> in the file must be ASCII, it is able to deduce the encoding from the first
> few bytes. This means that in D, each source file can have a different
> encoding, which is not expressible in libcpp frontends currently, and hence
> the notion of a global input_cpp_context with a constant charset is not
> sufficient to capture this.
> 
> In this iteration of the patch, I replaced the input_cpp_context with a more
> general input_context that can handle all of these cases. The context now
> consists of a callback function and a bool, which the frontend is supposed
> to configure. The bool determines whether or not to skip the BOM. The
> callback, when passed a filename, returns the input charset needed to
> convert that file. For libcpp, the filename is not used as the charset is
> determined from -finput-charset for all files. For D, the callback function
> is currently a stub, but it could be expanded to open the file, read the
> first few bytes, and determine the charset to be used. I have not
> implemented it for now, because it seems inelegant to duplicate the logic D
> already has for this detection, but this logic is part of the dmd/ tree
> which I think is maintained external to GCC, and so it didn't seem right to
> attempt to refactor it. I figured there may not be much interest in this
> feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> D), and this patch makes no change to it. This patch does fix the handling
> of a UTF-8 BOM in D diagnostics, however.
> 
> > > Separate from the main patch are two testcases that both fail before
> > this
> > > patch and pass after. I attached them gzipped because they use non-
> > standard
> > > encodings that won't email well.
> > > 
> > > The main question I have about the patch is whether I chose a good
> > way to
> > > address the following complication. location_get_source_line() in
> > input.c is
> > > used to generate diagnostics for all front ends, whether they use
> > libcpp to
> > > read the files or not. So input.c needs some way to know whether
> > libcpp is
> > > in use or not. I ended up adding a new function
> > input_initialize_cpp_context(),
> > > which front ends have to call if they are using libcpp to read their
> > > files. Currently that is c-family and fortran. I don't see a simpler
> > way
> > > to do it at least... Even if there were a better way for input.c to
> > find out
> > > the value passed to -finput-charset, it would still need to know
> > whether
> > > libcpp is being used or not.
> > 
> > At some point I want to convert the global state in input.c into a
> > source_manager class, probably hung off the diagnostic_context, so the
> > natural place to put that state would be in there (rather than the
> > fcache being global state).  Perhaps have a source_reader policy class
> > hung off of the source_manager, which would have responsibility for
> > reading and converting the file, either piecemeal, or fully for the
> > "need to use libcpp" case.
> > 
> > But that's not necessary for this fix.
> > 
> > > Please let me know if it looks OK (either now or for stage 1,
> > whatever makes
> > > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
> > tests the
> > > same before & after plus 6 new PASS from the new tests. Thanks!
> > 
> > Various comments inline below.  In particular, does the patch add a
> > memory leak in _cpp_convert_input?
> > 
> > Thanks
> > Dave
> > 
> > [...]
> > 
> > > libcpp/ChangeLog:
> > > 
> > >         PR other/93067
> > >         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
> > to
> > >         be NULL.
> > >         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
> > to...
> > >         (cpp_check_utf8_bom): ...here.  New function.
> > >         (cpp_input_conversion_is_trivial): New function.
> > >         * files.c (read_file_guts): Allow PFILE argument to be NULL.
> > Add
> > >         INPUT_CHARSET argument as an alternate source of this
> > information.
> > >         (cpp_get_converted_source): New function.
> > >         * include/cpplib.h (struct cpp_converted_source): Declare.
> > >         (cpp_get_converted_source): Declare.
> > >         (cpp_input_conversion_is_trivial): Declare.
> > >         (cpp_check_utf8_bom): Declare.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >         PR other/93067
> > >         * gcc.dg/diagnostic-input-charset-1.c: New test.
> > >         * gcc.dg/diagnostic-input-charset-2.c: New test.
> > 
> > Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?
> >
> 
> Done.
> 
> > > 
> > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > > index 59cabd12407..d5aa7859cc1 100644
> > > --- a/gcc/c-family/c-opts.c
> > > +++ b/gcc/c-family/c-opts.c
> > > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
> > >    cpp_post_options (parse_in);
> > >    init_global_opts_from_cpp (&global_options, cpp_get_options
> > (parse_in));
> > >  
> > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > +     the input.  */
> > > +  input_initialize_cpp_context (parse_in);
> > 
> > As noted above I think the most significant thing here is getting the
> > source encoding from parse_in, so maybe reword the comment accordingly.
> > 
> > [...]
> > 
> > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > > index 51baf141711..2b12a98afc0 100644
> > > --- a/gcc/fortran/cpp.c
> > > +++ b/gcc/fortran/cpp.c
> > > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
> > >  
> > >    cpp_post_options (cpp_in);
> > >  
> > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > +     the input.  */
> > > +  input_initialize_cpp_context (cpp_in);
> > 
> > Likewise.
> >
> 
> I kept this in mind when redoing this part, hopefully it is better now.
> 
> > [...]
> > 
> > > diff --git a/gcc/input.c b/gcc/input.c
> > > index 29d10f06b86..1dcdd464bc1 100644
> > > --- a/gcc/input.c
> > > +++ b/gcc/input.c
> > 
> > [...]
> > 
> > > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
> > >    r->total_lines = total_lines_num (file_path);
> > >    r->missing_trailing_newline = true;
> > >  
> > > +  /* If libcpp is managing the reading, then there are two cases we
> > need to
> > > +     consider.  If -finput-charset is not in effect, then we just
> > need to
> > > +     strip a UTF-8 BOM, so do that ourselves rather than calling
> > into libcpp so
> > > +     as to avoid paying the penalty of using libcpp, namely that the
> > entire file
> > > +     must be read at once.  In the (generally rare) case that a non-
> > trivial
> > > +     -finput-charset is needed, then go ahead and use libcpp to read
> > the whole
> > > +     file and do the conversion.  */
> > > +  if (input_cpp_context.in_use)
> > > +    {
> > > +      if (input_cpp_context.conversion_is_trivial)
> > > +       {
> > > +         /* Strip the UTF8 BOM if present.  */
> > > +         if (read_data (r))
> > > +           {
> > > +             const int offset = cpp_check_utf8_bom (r->data, r-
> > >nb_read);
> > > +             r->offset_buffer (offset);
> > > +             r->nb_read -= offset;
> > > +           }
> > 
> > This assumes that trivial conversions are always UTF8 to UTF8, which
> > assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...
> > 
> > [...]
> > 
> > > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > > +   of a buffer.  Returns the number of bytes to skip, which
> > currently
> > > +   will be either 0 or 3.  */
> > > +int
> > > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > > +{
> > > +
> > > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > > +  const unsigned char *udata = (const unsigned char *) data;
> > > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > > +      && udata[2] == 0xbf)
> > > +    return 3;
> > > +#endif
> > > +
> > > +  return 0;
> > > +}
> > 
> > ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
> > no-op for the EBCDIC case, so that's OK.
> > 
> > [...]
> > 
> > > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
> > char *input_charset,
> > >        to.text = XNEWVEC (uchar, to.asize);
> > >        to.len = 0;
> > >  
> > > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > > -       cpp_error (pfile, CPP_DL_ERROR,
> > > -                  "failure to convert %s to %s",
> > > -                  CPP_OPTION (pfile, input_charset),
> > SOURCE_CHARSET);
> > > +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
> > &to);
> > >  
> > > -      free (input);
> > > -    }
> > > +      /* Handle conversion failure.  */
> > > +      if (!ok)
> > > +       {
> > > +         free (input);
> > > +         if (!pfile)
> > > +           {
> > > +             XDELETEVEC (to.text);
> > > +             *buffer_start = NULL;
> > > +             *st_size = 0;
> > > +             return NULL;
> > > +           }
> > > +         cpp_error (pfile, CPP_DL_ERROR,
> > > +                    "failure to convert %s to %s",
> > > +                    CPP_OPTION (pfile, input_charset),
> > SOURCE_CHARSET);
> > > +       }
> > > +    }
> > >
> > 
> > Doesn't this lose the
> >   free (input);
> > for the case where the conversion is successful?  Is this a leak?
> >
> 
> Ugh, sorry to waste your time with that mistake. Fixed.
> 
> > [...]
> > 
> > > diff --git a/libcpp/files.c b/libcpp/files.c
> > > index 301b2379a23..178bb9ed1e6 100644
> > > --- a/libcpp/files.c
> > > +++ b/libcpp/files.c
> > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
> > _cpp_file *file,
> > >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> > >                               bool *invalid_pch, location_t loc);
> > >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > > -                           location_t loc);
> > > +                           location_t loc, const char *input_charset
> > = NULL);
> > 
> > read_file_guts is only used in one place before the patch, so rather
> > than add a default argument, I think it's simpler to pass in the
> > input_charset at that call.
> > 
> > > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
> > *fname, cpp_dir *start_dir,
> > >  
> > >     Use LOC for any diagnostics.
> > >  
> > > +   The input charset may be specified in the INPUT_CHARSET argument,
> > or
> > > +   else it will be taken from PFILE.
> > 
> > ...and doing so means that input_charset will be non-NULL at all
> > callers.
> > 
> > > +   PFILE may be NULL.  In this case, no diagnostics are issued, and
> > the
> > > +   input charset must be specified in the arguments.
> > > +
> > >     FIXME: Flush file cache and try again if we run out of memory. 
> > */
> > >  static bool
> > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > > +               const char *input_charset)
> > >  {
> > >    ssize_t size, total, count;
> > >    uchar *buf;
> > >    bool regular;
> > >  
> > > +  if (!input_charset)
> > > +    {
> > > +      gcc_assert (pfile);
> > > +      input_charset = CPP_OPTION (pfile, input_charset);
> > > +    }
> > 
> > ...which would allow for this logic to be removed, moving this to the
> > existing caller, I believe.
> >
> 
> Much better your way, done.
> 
> Thanks for taking a look at it, sorry this version is a bit changed from the
> previous one. FWIW, the libcpp pieces were not touched other than addressing
> your comments; the new logic is in input.h/input.c plus the associated
> tweaks to the frontends.
> 
> -Lewis

> From: Lewis Hyatt <lhyatt@gmail.com>
> Date: Wed, 27 Jan 2021 18:23:13 -0500
> Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]
> 
> Adds the logic to handle -finput-charset in layout_get_source_line(), so that
> source lines are converted from their input encodings prior to being output by
> diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.
> 
> gcc/c-family/ChangeLog:
> 
> 	PR other/93067
> 	* c-opts.c (c_common_input_charset_cb): New function.
> 	(c_common_post_options): Call new function input_initialize_context().
> 
> gcc/d/ChangeLog:
> 
> 	PR other/93067
> 	* d-lang.cc (d_input_charset_callback): New function.
> 	(d_init): Call new function input_initialize_context().
> 
> gcc/fortran/ChangeLog:
> 
> 	PR other/93067
> 	* cpp.c (gfc_cpp_post_options): Call new function
> 	input_initialize_cpp_context().
> 
> gcc/ChangeLog:
> 
> 	PR other/93067
> 	* input.c (default_charset_callback): New function.
> 	(input_initialize_context): New function.
> 	(read_data): Add prototype.
> 	(add_file_to_cache_tab): Use libcpp to convert input encoding when
> 	needed. Strip UTF-8 BOM when needed.
> 	(class fcache): Add new members to track input encoding conversion.
> 	(fcache::fcache): Adapt for new members.
> 	(fcache::~fcache): Likewise.
> 	(maybe_grow): Likewise.
> 	(needs_read): Adapt to be aware that fp member may be NULL now.
> 	(get_next_line): Likewise.
> 	* input.h (input_initialize_context): Declare.
> 
> libcpp/ChangeLog:
> 
> 	PR other/93067
> 	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
> 	be NULL.
> 	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
> 	(cpp_check_utf8_bom): ...here.  New function.
> 	(cpp_input_conversion_is_trivial): New function.
> 	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
> 	INPUT_CHARSET argument as an alternate source of this information.
> 	(read_file): Pass the new argument to read_file_guts.
> 	(cpp_get_converted_source): New function.
> 	* include/cpplib.h (struct cpp_converted_source): Declare.
> 	(cpp_get_converted_source): Declare.
> 	(cpp_input_conversion_is_trivial): Declare.
> 	(cpp_check_utf8_bom): Declare.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR other/93067
> 	* gcc.dg/diagnostic-input-charset-1.c: New test.
> 	* gcc.dg/diagnostic-input-utf8-bom.c: New test.
> 
> diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> index bd15b9cd902..2f58e8413b9 100644
> --- a/gcc/c-family/c-opts.c
> +++ b/gcc/c-family/c-opts.c
> @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context)
>    context->opt_permissive = OPT_fpermissive;
>  }
>  
> +/* Input charset configuration for diagnostics.  */
> +static const char *
> +c_common_input_charset_cb (const char * /*filename*/)
> +{
> +  const char *cs = cpp_opts->input_charset;
> +  return cpp_input_conversion_is_trivial (cs) ? NULL : cs;
> +}
> +
>  /* Whether options from all C-family languages should be accepted
>     quietly.  */
>  static bool accept_all_c_family_options = false;
> @@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename)
>    cpp_post_options (parse_in);
>    init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
>  
> +  /* Let diagnostics infrastructure know how to convert input files the same
> +     way libcpp will do it, namely using the configured input charset and
> +     skipping a UTF-8 BOM if present.  */
> +  input_initialize_context (c_common_input_charset_cb, true);
>    input_location = UNKNOWN_LOCATION;
>  
>    *pfilename = this_input_filename
> diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> index 0fd207da7f3..9f170c09886 100644
> --- a/gcc/d/d-lang.cc
> +++ b/gcc/d/d-lang.cc
> @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "print-tree.h"
>  #include "debug.h"
> +#include "input.h"
>  
>  #include "d-tree.h"
>  #include "id.h"
> @@ -362,6 +363,19 @@ d_option_lang_mask (void)
>    return CL_D;
>  }
>  
> +/* Implements input charset and BOM skipping configuration for
> +   diagnostics.  */
> +static const char *d_input_charset_callback (const char * /*filename*/)
> +{
> +  /* TODO: The input charset is automatically determined by code in
> +     dmd/dmodule.c based on the contents of the file.  If this detection
> +     logic were factored out and could be reused here, then we would be able
> +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> +     NULL, which means no conversion is necessary, i.e. the input is assumed
> +     to be UTF-8 when diagnostics read this file.  */
> +  return NULL;
> +}
> +
>  /* Implements the lang_hooks.init routine for language D.  */
>  
>  static bool
> @@ -373,6 +387,10 @@ d_init (void)
>    Expression::_init ();
>    Objc::_init ();
>  
> +  /* Diagnostics input init, to enable BOM skipping and
> +     input charset conversion.  */
> +  input_initialize_context (d_input_charset_callback, true);
> +
>    /* Back-end init.  */
>    global_binding_level = ggc_cleared_alloc <binding_level> ();
>    current_binding_level = global_binding_level;
> diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> index 419cd6accbe..8b564888673 100644
> --- a/gcc/fortran/cpp.c
> +++ b/gcc/fortran/cpp.c
> @@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
>  
>    cpp_post_options (cpp_in);
>  
> +
> +  /* Let diagnostics infrastructure know how to convert input files the same
> +     way libcpp will do it, namely, with no charset conversion but with
> +     skipping of a UTF-8 BOM if present.  */
> +  input_initialize_context (NULL, true);
> +
>    gfc_cpp_register_include_paths ();
>  }
>  
> diff --git a/gcc/input.c b/gcc/input.c
> index 9e39e7df83c..f8f6dc7e4ca 100644
> --- a/gcc/input.c
> +++ b/gcc/input.c
> @@ -30,6 +30,29 @@ along with GCC; see the file COPYING3.  If not see
>  #define HAVE_ICONV 0
>  #endif
>  
> +/* Input charset configuration.  */
> +static const char *default_charset_callback (const char *)
> +{
> +  return NULL;
> +}
> +
> +struct
> +{
> +  input_context_charset_callback ccb;
> +  bool should_skip_bom;
> +} static input_context =
> +{
> +  default_charset_callback,
> +  false
> +};
> +
> +void input_initialize_context (input_context_charset_callback ccb,
> +			       bool should_skip_bom)
> +{
> +  input_context.ccb = (ccb ? ccb : default_charset_callback);
> +  input_context.should_skip_bom = should_skip_bom;
> +}
> +
>  /* This is a cache used by get_next_line to store the content of a
>     file to be searched for file lines.  */
>  class fcache
> @@ -78,6 +101,10 @@ public:
>       far.  */
>    char *data;
>  
> +  /* The allocated buffer to be freed may start a little earlier than DATA,
> +     e.g. if a UTF8 BOM was skipped at the beginning.  */
> +  int alloc_offset;
> +
>    /*  The size of the DATA array above.*/
>    size_t size;
>  
> @@ -118,6 +145,17 @@ public:
>  
>    fcache ();
>    ~fcache ();
> +
> +  void offset_buffer (int offset)
> +  {
> +    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
> +		: (size_t) offset <= size);
> +    gcc_assert (data);
> +    alloc_offset += offset;
> +    data += offset;
> +    size -= offset;
> +  }
> +
>  };
>  
>  /* Current position in real source file.  */
> @@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count)
>    return to_evict;
>  }
>  
> +static bool
> +read_data (fcache *c);
> +
>  /* Create the cache used for the content of a given file to be
>     accessed by caret diagnostic.  This cache is added to an array of
>     cache and can be retrieved by lookup_file_in_cache_tab.  This
> @@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path)
>    if (r->fp)
>      fclose (r->fp);
>    r->fp = fp;
> +  if (r->alloc_offset)
> +    r->offset_buffer (-r->alloc_offset);
>    r->nb_read = 0;
>    r->line_start_idx = 0;
>    r->line_num = 0;
> @@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path)
>    r->total_lines = total_lines_num (file_path);
>    r->missing_trailing_newline = true;
>  
> +  /* Check the frontend configuration to determine if we need to do any
> +     transformations, such as charset conversion or BOM skipping.  */
> +  if (const char *input_charset = input_context.ccb (file_path))
> +    {
> +      /* Need a full-blown conversion of the input charset.  */
> +      fclose (r->fp);
> +      r->fp = NULL;
> +      const cpp_converted_source cs
> +	= cpp_get_converted_source (file_path, input_charset);
> +      if (!cs.data)
> +	return NULL;
> +      if (r->data)
> +	XDELETEVEC (r->data);
> +      r->data = cs.data;
> +      r->nb_read = r->size = cs.len;
> +      r->alloc_offset = cs.data - cs.to_free;
> +    }
> +  else if (input_context.should_skip_bom)
> +    {
> +      if (read_data (r))
> +	{
> +	  const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
> +	  r->offset_buffer (offset);
> +	  r->nb_read -= offset;
> +	}
> +    }
> +
>    return r;
>  }
>  
> @@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path)
>     diagnostic.  */
>  
>  fcache::fcache ()
> -: use_count (0), file_path (NULL), fp (NULL), data (0),
> +: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
>    size (0), nb_read (0), line_start_idx (0), line_num (0),
>    total_lines (0), missing_trailing_newline (true)
>  {
> @@ -433,6 +503,7 @@ fcache::~fcache ()
>      }
>    if (data)
>      {
> +      offset_buffer (-alloc_offset);
>        XDELETEVEC (data);
>        data = 0;
>      }
> @@ -447,9 +518,9 @@ fcache::~fcache ()
>  static bool
>  needs_read (fcache *c)
>  {
> -  return (c->nb_read == 0
> -	  || c->nb_read == c->size
> -	  || (c->line_start_idx >= c->nb_read - 1));
> +  return c->fp && (c->nb_read == 0
> +		   || c->nb_read == c->size
> +		   || (c->line_start_idx >= c->nb_read - 1));
>  }
>  
>  /*  Return TRUE iff the cache is full and thus needs to be
> @@ -469,9 +540,20 @@ maybe_grow (fcache *c)
>    if (!needs_grow (c))
>      return;
>  
> -  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
> -  c->data = XRESIZEVEC (char, c->data, size);
> -  c->size = size;
> +  if (!c->data)
> +    {
> +      gcc_assert (c->size == 0 && c->alloc_offset == 0);
> +      c->size = fcache_buffer_size;
> +      c->data = XNEWVEC (char, c->size);
> +    }
> +  else
> +    {
> +      const int offset = c->alloc_offset;
> +      c->offset_buffer (-offset);
> +      c->size *= 2;
> +      c->data = XRESIZEVEC (char, c->data, c->size);
> +      c->offset_buffer (offset);
> +    }
>  }
>  
>  /*  Read more data into the cache.  Extends the cache if need be.
> @@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len)
>        c->missing_trailing_newline = false;
>      }
>  
> -  if (ferror (c->fp))
> +  if (c->fp && ferror (c->fp))
>      return false;
>  
>    /* At this point, we've found the end of the of line.  It either
> diff --git a/gcc/input.h b/gcc/input.h
> index f1ef5d76cfd..8a793399958 100644
> --- a/gcc/input.h
> +++ b/gcc/input.h
> @@ -214,4 +214,21 @@ class GTY(()) string_concat_db
>    hash_map <location_hash, string_concat *> *m_table;
>  };
>  
> +/* Because we read source files a second time after the frontend did it the
> +   first time, we need to know how the frontend handled things like character
> +   set conversion and UTF-8 BOM stripping, in order to make everything
> +   consistent.  This function needs to be called by each frontend that requires
> +   non-default behavior, to inform the diagnostics infrastructure how input is
> +   to be processed.  The default behavior is to do no conversion and not to
> +   strip a UTF-8 BOM.
> +
> +   The callback should return the input charset to be used to convert the given
> +   file's contents to UTF-8, or it should return NULL if no conversion is needed
> +   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
> +   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
> +   beginning of the file.  (In case a conversion was performed, the BOM is
> +   rather skipped as part of the conversion process.)  */
> +typedef const char * (*input_context_charset_callback)(const char *);
> +void input_initialize_context (input_context_charset_callback ccb,
> +			       bool should_skip_bom);
>  #endif
> diff --git a/libcpp/charset.c b/libcpp/charset.c
> index 99a9b73e5ab..61881f978a8 100644
> --- a/libcpp/charset.c
> +++ b/libcpp/charset.c
> @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
>     cset_converter structure for conversion from FROM to TO.  If
>     iconv_open() fails, issue an error and return an identity
>     converter.  Silently return an identity converter if FROM and TO
> -   are identical.  */
> +   are identical.
> +
> +   PFILE is only used for generating diagnostics; setting it to NULL
> +   suppresses diagnostics.  */
> +
>  static struct cset_converter
>  init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
>  {
> @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
>  
>        if (ret.cd == (iconv_t) -1)
>  	{
> -	  if (errno == EINVAL)
> -	    cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> -		       "conversion from %s to %s not supported by iconv",
> -		       from, to);
> -	  else
> -	    cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> -
> +	  if (pfile)
> +	    {
> +	      if (errno == EINVAL)
> +		cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> +			   "conversion from %s to %s not supported by iconv",
> +			   from, to);
> +	      else
> +		cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> +	    }
>  	  ret.func = convert_no_conversion;
>  	}
>      }
>    else
>      {
> -      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> -		 "no iconv implementation, cannot convert from %s to %s",
> -		 from, to);
> +      if (pfile)
> +	{
> +	  cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> +		     "no iconv implementation, cannot convert from %s to %s",
> +		     from, to);
> +	}
>        ret.func = convert_no_conversion;
>        ret.cd = (iconv_t) -1;
>        ret.width = -1;
>      }
> +
>    return ret;
>  }
>  
> @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
>  				  buf, bufp - buf, HT_ALLOC));
>  }
>  
> +
> +/* Utility to strip a UTF-8 byte order marking from the beginning
> +   of a buffer.  Returns the number of bytes to skip, which currently
> +   will be either 0 or 3.  */
> +int
> +cpp_check_utf8_bom (const char *data, size_t data_length)
> +{
> +
> +#if HOST_CHARSET == HOST_CHARSET_ASCII
> +  const unsigned char *udata = (const unsigned char *) data;
> +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> +      && udata[2] == 0xbf)
> +    return 3;
> +#endif
> +
> +  return 0;
> +}
> +
> +
>  /* Convert an input buffer (containing the complete contents of one
>     source file) from INPUT_CHARSET to the source character set.  INPUT
>     points to the input buffer, SIZE is its allocated size, and LEN is
> @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
>     INPUT is expected to have been allocated with xmalloc.  This
>     function will either set *BUFFER_START to INPUT, or free it and set
>     *BUFFER_START to a pointer to another xmalloc-allocated block of
> -   memory.  */
> +   memory.
> +
> +   PFILE is only used to generate diagnostics; setting it to NULL suppresses
> +   diagnostics, and causes a return of NULL if there was any error instead.  */
> +
>  uchar * 
>  _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>  		    uchar *input, size_t size, size_t len,
> @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>        to.text = XNEWVEC (uchar, to.asize);
>        to.len = 0;
>  
> -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> -	cpp_error (pfile, CPP_DL_ERROR,
> -		   "failure to convert %s to %s",
> -		   CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
> -
> +      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
>        free (input);
> -    }
>  
> -  /* Clean up the mess.  */
> -  if (input_cset.func == convert_using_iconv)
> -    iconv_close (input_cset.cd);
> +      /* Clean up the mess.  */
> +      if (input_cset.func == convert_using_iconv)
> +	iconv_close (input_cset.cd);
> +
> +      /* Handle conversion failure.  */
> +      if (!ok)
> +	{
> +	  if (!pfile)
> +	    {
> +	      XDELETEVEC (to.text);
> +	      *buffer_start = NULL;
> +	      *st_size = 0;
> +	      return NULL;
> +	    }
> +	  cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
> +		     input_charset, SOURCE_CHARSET);
> +	}
> +    }
>  
>    /* Resize buffer if we allocated substantially too much, or if we
>       haven't enough space for the \n-terminator or following
> @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>  
>    buffer = to.text;
>    *st_size = to.len;
> -#if HOST_CHARSET == HOST_CHARSET_ASCII
> -  /* The HOST_CHARSET test just above ensures that the source charset
> -     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
> -     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
> +
> +  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
> +     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
>       BOM -- however, even if it did, we would still need this code due
>       to the 'convert_no_conversion' case.  */
> -  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
> -      && to.text[2] == 0xbf)
> -    {
> -      *st_size -= 3;
> -      buffer += 3;
> -    }
> -#endif
> +  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
> +  *st_size -= bom_len;
> +  buffer += bom_len;
>  
>    *buffer_start = to.text;
>    return buffer;
> @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
>    return current_encoding;
>  }
>  
> +/* Check if the configured input charset requires no conversion, other than
> +   possibly stripping a UTF-8 BOM.  */
> +bool cpp_input_conversion_is_trivial (const char *input_charset)
> +{
> +  return !strcasecmp (input_charset, SOURCE_CHARSET);
> +}
> +
>  /* Implementation of class cpp_string_location_reader.  */
>  
>  /* Constructor for cpp_string_location_reader.  */
> diff --git a/libcpp/files.c b/libcpp/files.c
> index 5ea3f8e1bf3..d91606c1532 100644
> --- a/libcpp/files.c
> +++ b/libcpp/files.c
> @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
>  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
>  			      bool *invalid_pch, location_t loc);
>  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> -			    location_t loc);
> +			    location_t loc, const char *input_charset);
>  static bool read_file (cpp_reader *pfile, _cpp_file *file,
>  		       location_t loc);
>  static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
> @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
>  
>     Use LOC for any diagnostics.
>  
> +   PFILE may be NULL.  In this case, no diagnostics are issued.
> +
>     FIXME: Flush file cache and try again if we run out of memory.  */
>  static bool
> -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> +		const char *input_charset)
>  {
>    ssize_t size, total, count;
>    uchar *buf;
> @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  
>    if (S_ISBLK (file->st.st_mode))
>      {
> -      cpp_error_at (pfile, CPP_DL_ERROR, loc,
> -		    "%s is a block device", file->path);
> +      if (pfile)
> +	cpp_error_at (pfile, CPP_DL_ERROR, loc,
> +		      "%s is a block device", file->path);
>        return false;
>      }
>  
> @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  	 does not bite us.  */
>        if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
>  	{
> -	  cpp_error_at (pfile, CPP_DL_ERROR, loc,
> -			"%s is too large", file->path);
> +	  if (pfile)
> +	    cpp_error_at (pfile, CPP_DL_ERROR, loc,
> +			  "%s is too large", file->path);
>  	  return false;
>  	}
>  
> @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  
>    if (count < 0)
>      {
> -      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
> +      if (pfile)
> +	cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
>        free (buf);
>        return false;
>      }
>  
> -  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
> +  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
>      cpp_error_at (pfile, CPP_DL_WARNING, loc,
>  	       "%s is shorter than expected", file->path);
>  
>    file->buffer = _cpp_convert_input (pfile,
> -				     CPP_OPTION (pfile, input_charset),
> +				     input_charset,
>  				     buf, size + 16, total,
>  				     &file->buffer_start,
>  				     &file->st.st_size);
> -  file->buffer_valid = true;
> -
> -  return true;
> +  file->buffer_valid = file->buffer;
> +  return file->buffer_valid;
>  }
>  
>  /* Convenience wrapper around read_file_guts that opens the file if
>     necessary and closes the file descriptor after reading.  FILE must
>     have been passed through find_file() at some stage.  Use LOC for
> -   any diagnostics.  */
> +   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
>  static bool
>  read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  {
> @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
>        return false;
>      }
>  
> -  file->dont_read = !read_file_guts (pfile, file, loc);
> +  file->dont_read = !read_file_guts (pfile, file, loc,
> +				     CPP_OPTION (pfile, input_charset));
>    close (file->fd);
>    file->fd = -1;
>  
> @@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
>    return file->err_no != ENOENT;
>  }
>  
> +/* Read a file and convert to input charset, the same as if it were being read
> +   by a cpp_reader.  */
> +
> +cpp_converted_source
> +cpp_get_converted_source (const char *fname, const char *input_charset)
> +{
> +  cpp_converted_source res = {};
> +  _cpp_file file = {};
> +  file.fd = -1;
> +  file.name = lbasename (fname);
> +  file.path = fname;
> +  if (!open_file (&file))
> +    return res;
> +  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
> +  close (file.fd);
> +  if (!ok)
> +    return res;
> +  res.to_free = (char *) file.buffer_start;
> +  res.data = (char *) file.buffer;
> +  res.len = file.st.st_size;
> +  return res;
> +}
> diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
> index 4467c73284d..d02ef4aa054 100644
> --- a/libcpp/include/cpplib.h
> +++ b/libcpp/include/cpplib.h
> @@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
>  extern cpp_buffer *cpp_get_prev (cpp_buffer *);
>  extern void cpp_clear_file_cache (cpp_reader *);
>  
> +/* cpp_get_converted_source returns the contents of the given file, as it exists
> +   after cpplib has read it and converted it from the input charset to the
> +   source charset.  Return struct will be zero-filled if the data could not be
> +   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
> +   pointer is what should be passed to free(), as there may be an offset.  */
> +struct cpp_converted_source
> +{
> +  char *to_free;
> +  char *data;
> +  size_t len;
> +};
> +cpp_converted_source cpp_get_converted_source (const char *fname,
> +					       const char *input_charset);
> +
>  /* In pch.c */
>  struct save_macro_data;
>  extern int cpp_save_state (cpp_reader *, FILE *);
> @@ -1439,6 +1453,7 @@ class cpp_display_width_computation {
>  /* Convenience functions that are simple use cases for class
>     cpp_display_width_computation.  Tab characters will be expanded to spaces
>     as determined by TABSTOP.  */
> +
>  int cpp_byte_column_to_display_column (const char *data, int data_length,
>  				       int column, int tabstop);
>  inline int cpp_display_width (const char *data, int data_length,
> @@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length,
>  				       int display_col, int tabstop);
>  int cpp_wcwidth (cppchar_t c);
>  
> +bool cpp_input_conversion_is_trivial (const char *input_charset);
> +int cpp_check_utf8_bom (const char *data, size_t data_length);
> +
>  #endif /* ! LIBCPP_CPPLIB_H */

Hello-

As discussed most recently here:

https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575243.html

Here is the updated patch to support -finput-charset in diagnostics. As I
mentioned in the above thread, I adapted the patch in input.h and input.c now
that class file_cache has been created. That change also enabled me to move
the input context state into the file_cache class, so that it doesn't need to
be global. Since global_dc is now in charge of creating the file_cache, it
seemed to make sense to me also to move the initialization function for
frontends to call into the diagnostic_context, so they can call it on the
global_dc. Please let me know if this makes sense, overall it is "less global"
than before, which I think is the direction everyone would prefer. Happy to
adjust as needed. Thanks!

As before, I put the new testcases in a separate gzipped file, because they
have non-standard encodings.

Bootstrap + testsuite looks good on x86-64 Linux, all tests the same before
and after except for the 6 new passes.

-Lewis
From: Lewis Hyatt <lhyatt@gmail.com>
Date: Fri, 30 Jul 2021 12:53:43 -0400
Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]

Adds the logic to handle -finput-charset in layout_get_source_line(), so that
source lines are converted from their input encodings prior to being output by
diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.

gcc/c-family/ChangeLog:

	PR other/93067
	* c-opts.c (c_common_input_charset_cb): New function.
	(c_common_post_options): Call new function
	diagnostic_initialize_input_context().

gcc/d/ChangeLog:

	PR other/93067
	* d-lang.cc (d_input_charset_callback): New function.
	(d_init): Call new function
	diagnostic_initialize_input_context().

gcc/fortran/ChangeLog:

	PR other/93067
	* cpp.c (gfc_cpp_post_options): Call new function
	diagnostic_initialize_input_context().

gcc/ChangeLog:

	PR other/93067
	* coretypes.h (typedef diagnostic_input_charset_callback): Declare.
	* diagnostic.c (diagnostic_initialize_input_context): New function.
	* diagnostic.h (diagnostic_initialize_input_context): Declare.
	* input.c (default_charset_callback): New function.
	(file_cache::initialize_input_context): New function.
	(file_cache_slot::create): Added ability to convert the input
	according to the input context.
	(file_cache::file_cache): Initialize the new input context.
	(class file_cache_slot): Added new m_alloc_offset member.
	(file_cache_slot::file_cache_slot): Initialize the new member.
	(file_cache_slot::~file_cache_slot): Handle potentially offset buffer.
	(file_cache_slot::maybe_grow): Likewise.
	(file_cache_slot::needs_read_p): Handle NULL fp, which is now possible.
	(file_cache_slot::get_next_line): Likewise.
	* input.h (class file_cache): Added input context member.

libcpp/ChangeLog:

	PR other/93067
	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
	be NULL.
	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
	(cpp_check_utf8_bom): ...here.  New function.
	(cpp_input_conversion_is_trivial): New function.
	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
	INPUT_CHARSET argument as an alternate source of this information.
	(read_file): Pass the new argument to read_file_guts.
	(cpp_get_converted_source): New function.
	* include/cpplib.h (struct cpp_converted_source): Declare.
	(cpp_get_converted_source): Declare.
	(cpp_input_conversion_is_trivial): Declare.
	(cpp_check_utf8_bom): Declare.

gcc/testsuite/ChangeLog:

	PR other/93067
	* gcc.dg/diagnostic-input-charset-1.c: New test.
	* gcc.dg/diagnostic-input-utf8-bom.c: New test.

diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index 1c4e832c7ed..961e40c29e7 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context)
   context->opt_permissive = OPT_fpermissive;
 }
 
+/* Input charset configuration for diagnostics.  */
+static const char *
+c_common_input_charset_cb (const char * /*filename*/)
+{
+  const char *cs = cpp_opts->input_charset;
+  return cpp_input_conversion_is_trivial (cs) ? nullptr : cs;
+}
+
 /* Whether options from all C-family languages should be accepted
    quietly.  */
 static bool accept_all_c_family_options = false;
@@ -1136,6 +1144,11 @@ c_common_post_options (const char **pfilename)
   cpp_post_options (parse_in);
   init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
 
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely using the configured input charset and
+     skipping a UTF-8 BOM if present.  */
+  diagnostic_initialize_input_context (global_dc,
+				       c_common_input_charset_cb, true);
   input_location = UNKNOWN_LOCATION;
 
   *pfilename = this_input_filename
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index 406572e947d..726fcaddda2 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -154,6 +154,7 @@ struct cl_option_handlers;
 struct diagnostic_context;
 class pretty_printer;
 class diagnostic_event_id_t;
+typedef const char * (*diagnostic_input_charset_callback)(const char *);
 
 template<typename T> struct array_traits;
 
diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
index 4386a489ff2..fa29a46ab1e 100644
--- a/gcc/d/d-lang.cc
+++ b/gcc/d/d-lang.cc
@@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "print-tree.h"
 #include "debug.h"
+#include "input.h"
 
 #include "d-tree.h"
 #include "id.h"
@@ -362,6 +363,19 @@ d_option_lang_mask (void)
   return CL_D;
 }
 
+/* Implements input charset and BOM skipping configuration for
+   diagnostics.  */
+static const char *d_input_charset_callback (const char * /*filename*/)
+{
+  /* TODO: The input charset is automatically determined by code in
+     dmd/dmodule.c based on the contents of the file.  If this detection
+     logic were factored out and could be reused here, then we would be able
+     to return UTF-16 or UTF-32 as needed here.  For now, we return always
+     NULL, which means no conversion is necessary, i.e. the input is assumed
+     to be UTF-8 when diagnostics read this file.  */
+  return nullptr;
+}
+
 /* Implements the lang_hooks.init routine for language D.  */
 
 static bool
@@ -373,6 +387,11 @@ d_init (void)
   Expression::_init ();
   Objc::_init ();
 
+  /* Diagnostics input init, to enable BOM skipping and
+     input charset conversion.  */
+  diagnostic_initialize_input_context (global_dc,
+				       d_input_charset_callback, true);
+
   /* Back-end init.  */
   global_binding_level = ggc_cleared_alloc <binding_level> ();
   current_binding_level = global_binding_level;
diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
index 8361f68aace..b3afbeae648 100644
--- a/gcc/diagnostic.c
+++ b/gcc/diagnostic.c
@@ -293,6 +293,17 @@ diagnostic_urls_init (diagnostic_context *context, int value /*= -1 */)
     = determine_url_format ((diagnostic_url_rule_t) value);
 }
 
+/* Create the file_cache, if not already created, and tell it how to
+   translate files on input.  */
+void diagnostic_initialize_input_context (diagnostic_context *context,
+					  diagnostic_input_charset_callback ccb,
+					  bool should_skip_bom)
+{
+  if (!context->m_file_cache)
+    context->m_file_cache = new file_cache;
+  context->m_file_cache->initialize_input_context (ccb, should_skip_bom);
+}
+
 /* Do any cleaning up required after the last diagnostic is emitted.  */
 
 void
diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h
index 7227dae1b6b..f90d20a4f5e 100644
--- a/gcc/diagnostic.h
+++ b/gcc/diagnostic.h
@@ -446,6 +446,25 @@ extern void diagnostic_show_locus (diagnostic_context *,
 				   diagnostic_t diagnostic_kind);
 extern void diagnostic_show_any_path (diagnostic_context *, diagnostic_info *);
 
+/* Because we read source files a second time after the frontend did it the
+   first time, we need to know how the frontend handled things like character
+   set conversion and UTF-8 BOM stripping, in order to make everything
+   consistent.  This function needs to be called by each frontend that requires
+   non-default behavior, to inform the diagnostics infrastructure how input is
+   to be processed.  The default behavior is to do no conversion and not to
+   strip a UTF-8 BOM.
+
+   The callback should return the input charset to be used to convert the given
+   file's contents to UTF-8, or it should return NULL if no conversion is needed
+   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
+   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
+   beginning of the file.  (In case a conversion was performed, the BOM is
+   rather skipped as part of the conversion process.)  */
+
+void diagnostic_initialize_input_context (diagnostic_context *context,
+					  diagnostic_input_charset_callback ccb,
+					  bool should_skip_bom);
+
 /* Force diagnostics controlled by OPTIDX to be kind KIND.  */
 extern diagnostic_t diagnostic_classify_diagnostic (diagnostic_context *,
 						    int /* optidx */,
diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
index 419cd6accbe..83c4517acdb 100644
--- a/gcc/fortran/cpp.c
+++ b/gcc/fortran/cpp.c
@@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
 
   cpp_post_options (cpp_in);
 
+
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely, with no charset conversion but with
+     skipping of a UTF-8 BOM if present.  */
+  diagnostic_initialize_input_context (global_dc, nullptr, true);
+
   gfc_cpp_register_include_paths ();
 }
 
diff --git a/gcc/input.c b/gcc/input.c
index de20d983d2c..4b809862e02 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -22,7 +22,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "intl.h"
 #include "diagnostic.h"
-#include "diagnostic-core.h"
 #include "selftest.h"
 #include "cpplib.h"
 
@@ -30,6 +29,20 @@ along with GCC; see the file COPYING3.  If not see
 #define HAVE_ICONV 0
 #endif
 
+/* Input charset configuration.  */
+static const char *default_charset_callback (const char *)
+{
+  return nullptr;
+}
+
+void
+file_cache::initialize_input_context (diagnostic_input_charset_callback ccb,
+				      bool should_skip_bom)
+{
+  in_context.ccb = (ccb ? ccb : default_charset_callback);
+  in_context.should_skip_bom = should_skip_bom;
+}
+
 /* This is a cache used by get_next_line to store the content of a
    file to be searched for file lines.  */
 class file_cache_slot
@@ -51,7 +64,8 @@ public:
 
   void inc_use_count () { m_use_count++; }
 
-  void create (const char *file_path, FILE *fp, unsigned highest_use_count);
+  bool create (const file_cache::input_context &in_context,
+	       const char *file_path, FILE *fp, unsigned highest_use_count);
   void evict ();
 
  private:
@@ -110,6 +124,10 @@ public:
      far.  */
   char *m_data;
 
+  /* The allocated buffer to be freed may start a little earlier than DATA,
+     e.g. if a UTF8 BOM was skipped at the beginning.  */
+  int m_alloc_offset;
+
   /*  The size of the DATA array above.*/
   size_t m_size;
 
@@ -147,6 +165,17 @@ public:
      doesn't explode.  We thus scale total_lines down to
      line_record_size.  */
   vec<line_info, va_heap> m_line_record;
+
+  void offset_buffer (int offset)
+  {
+    gcc_assert (offset < 0 ? m_alloc_offset + offset >= 0
+		: (size_t) offset <= m_size);
+    gcc_assert (m_data);
+    m_alloc_offset += offset;
+    m_data += offset;
+    m_size -= offset;
+  }
+
 };
 
 /* Current position in real source file.  */
@@ -419,21 +448,25 @@ file_cache::add_file (const char *file_path)
 
   unsigned highest_use_count = 0;
   file_cache_slot *r = evicted_cache_tab_entry (&highest_use_count);
-  r->create (file_path, fp, highest_use_count);
+  if (!r->create (in_context, file_path, fp, highest_use_count))
+    return NULL;
   return r;
 }
 
 /* Populate this slot for use on FILE_PATH and FP, dropping any
    existing cached content within it.  */
 
-void
-file_cache_slot::create (const char *file_path, FILE *fp,
+bool
+file_cache_slot::create (const file_cache::input_context &in_context,
+			 const char *file_path, FILE *fp,
 			 unsigned highest_use_count)
 {
   m_file_path = file_path;
   if (m_fp)
     fclose (m_fp);
   m_fp = fp;
+  if (m_alloc_offset)
+    offset_buffer (-m_alloc_offset);
   m_nb_read = 0;
   m_line_start_idx = 0;
   m_line_num = 0;
@@ -443,6 +476,36 @@ file_cache_slot::create (const char *file_path, FILE *fp,
   m_use_count = ++highest_use_count;
   m_total_lines = total_lines_num (file_path);
   m_missing_trailing_newline = true;
+
+
+  /* Check the input configuration to determine if we need to do any
+     transformations, such as charset conversion or BOM skipping.  */
+  if (const char *input_charset = in_context.ccb (file_path))
+    {
+      /* Need a full-blown conversion of the input charset.  */
+      fclose (m_fp);
+      m_fp = NULL;
+      const cpp_converted_source cs
+	= cpp_get_converted_source (file_path, input_charset);
+      if (!cs.data)
+	return false;
+      if (m_data)
+	XDELETEVEC (m_data);
+      m_data = cs.data;
+      m_nb_read = m_size = cs.len;
+      m_alloc_offset = cs.data - cs.to_free;
+    }
+  else if (in_context.should_skip_bom)
+    {
+      if (read_data ())
+	{
+	  const int offset = cpp_check_utf8_bom (m_data, m_nb_read);
+	  offset_buffer (offset);
+	  m_nb_read -= offset;
+	}
+    }
+
+  return true;
 }
 
 /* file_cache's ctor.  */
@@ -450,6 +513,7 @@ file_cache_slot::create (const char *file_path, FILE *fp,
 file_cache::file_cache ()
 : m_file_slots (new file_cache_slot[num_file_slots])
 {
+  initialize_input_context (nullptr, false);
 }
 
 /* file_cache's dtor.  */
@@ -478,8 +542,8 @@ file_cache::lookup_or_add_file (const char *file_path)
 
 file_cache_slot::file_cache_slot ()
 : m_use_count (0), m_file_path (NULL), m_fp (NULL), m_data (0),
-  m_size (0), m_nb_read (0), m_line_start_idx (0), m_line_num (0),
-  m_total_lines (0), m_missing_trailing_newline (true)
+  m_alloc_offset (0), m_size (0), m_nb_read (0), m_line_start_idx (0),
+  m_line_num (0), m_total_lines (0), m_missing_trailing_newline (true)
 {
   m_line_record.create (0);
 }
@@ -495,6 +559,7 @@ file_cache_slot::~file_cache_slot ()
     }
   if (m_data)
     {
+      offset_buffer (-m_alloc_offset);
       XDELETEVEC (m_data);
       m_data = 0;
     }
@@ -509,7 +574,7 @@ file_cache_slot::~file_cache_slot ()
 bool
 file_cache_slot::needs_read_p () const
 {
-  return (m_nb_read == 0
+  return m_fp && (m_nb_read == 0
 	  || m_nb_read == m_size
 	  || (m_line_start_idx >= m_nb_read - 1));
 }
@@ -531,9 +596,20 @@ file_cache_slot::maybe_grow ()
   if (!needs_grow_p ())
     return;
 
-  size_t size = m_size == 0 ? buffer_size : m_size * 2;
-  m_data = XRESIZEVEC (char, m_data, size);
-  m_size = size;
+  if (!m_data)
+    {
+      gcc_assert (m_size == 0 && m_alloc_offset == 0);
+      m_size = buffer_size;
+      m_data = XNEWVEC (char, m_size);
+    }
+  else
+    {
+      const int offset = m_alloc_offset;
+      offset_buffer (-offset);
+      m_size *= 2;
+      m_data = XRESIZEVEC (char, m_data, m_size);
+      offset_buffer (offset);
+    }
 }
 
 /*  Read more data into the cache.  Extends the cache if need be.
@@ -632,7 +708,7 @@ file_cache_slot::get_next_line (char **line, ssize_t *line_len)
       m_missing_trailing_newline = false;
     }
 
-  if (ferror (m_fp))
+  if (m_fp && ferror (m_fp))
     return false;
 
   /* At this point, we've found the end of the of line.  It either
diff --git a/gcc/input.h b/gcc/input.h
index bbcec84c521..e6881072c5f 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -111,6 +111,15 @@ class file_cache
   file_cache_slot *lookup_or_add_file (const char *file_path);
   void forcibly_evict_file (const char *file_path);
 
+  /* See comments in diagnostic.h about the input conversion context.  */
+  struct input_context
+  {
+    diagnostic_input_charset_callback ccb;
+    bool should_skip_bom;
+  };
+  void initialize_input_context (diagnostic_input_charset_callback ccb,
+				 bool should_skip_bom);
+
  private:
   file_cache_slot *evicted_cache_tab_entry (unsigned *highest_use_count);
   file_cache_slot *add_file (const char *file_path);
@@ -119,6 +128,7 @@ class file_cache
  private:
   static const size_t num_file_slots = 16;
   file_cache_slot *m_file_slots;
+  input_context in_context;
 };
 
 extern expanded_location
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 99a9b73e5ab..61881f978a8 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
    cset_converter structure for conversion from FROM to TO.  If
    iconv_open() fails, issue an error and return an identity
    converter.  Silently return an identity converter if FROM and TO
-   are identical.  */
+   are identical.
+
+   PFILE is only used for generating diagnostics; setting it to NULL
+   suppresses diagnostics.  */
+
 static struct cset_converter
 init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 {
@@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 
       if (ret.cd == (iconv_t) -1)
 	{
-	  if (errno == EINVAL)
-	    cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
-		       "conversion from %s to %s not supported by iconv",
-		       from, to);
-	  else
-	    cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
-
+	  if (pfile)
+	    {
+	      if (errno == EINVAL)
+		cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
+			   "conversion from %s to %s not supported by iconv",
+			   from, to);
+	      else
+		cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
+	    }
 	  ret.func = convert_no_conversion;
 	}
     }
   else
     {
-      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
-		 "no iconv implementation, cannot convert from %s to %s",
-		 from, to);
+      if (pfile)
+	{
+	  cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
+		     "no iconv implementation, cannot convert from %s to %s",
+		     from, to);
+	}
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
       ret.width = -1;
     }
+
   return ret;
 }
 
@@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
 				  buf, bufp - buf, HT_ALLOC));
 }
 
+
+/* Utility to strip a UTF-8 byte order marking from the beginning
+   of a buffer.  Returns the number of bytes to skip, which currently
+   will be either 0 or 3.  */
+int
+cpp_check_utf8_bom (const char *data, size_t data_length)
+{
+
+#if HOST_CHARSET == HOST_CHARSET_ASCII
+  const unsigned char *udata = (const unsigned char *) data;
+  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
+      && udata[2] == 0xbf)
+    return 3;
+#endif
+
+  return 0;
+}
+
+
 /* Convert an input buffer (containing the complete contents of one
    source file) from INPUT_CHARSET to the source character set.  INPUT
    points to the input buffer, SIZE is its allocated size, and LEN is
@@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
    INPUT is expected to have been allocated with xmalloc.  This
    function will either set *BUFFER_START to INPUT, or free it and set
    *BUFFER_START to a pointer to another xmalloc-allocated block of
-   memory.  */
+   memory.
+
+   PFILE is only used to generate diagnostics; setting it to NULL suppresses
+   diagnostics, and causes a return of NULL if there was any error instead.  */
+
 uchar * 
 _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 		    uchar *input, size_t size, size_t len,
@@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
       to.text = XNEWVEC (uchar, to.asize);
       to.len = 0;
 
-      if (!APPLY_CONVERSION (input_cset, input, len, &to))
-	cpp_error (pfile, CPP_DL_ERROR,
-		   "failure to convert %s to %s",
-		   CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
-
+      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
       free (input);
-    }
 
-  /* Clean up the mess.  */
-  if (input_cset.func == convert_using_iconv)
-    iconv_close (input_cset.cd);
+      /* Clean up the mess.  */
+      if (input_cset.func == convert_using_iconv)
+	iconv_close (input_cset.cd);
+
+      /* Handle conversion failure.  */
+      if (!ok)
+	{
+	  if (!pfile)
+	    {
+	      XDELETEVEC (to.text);
+	      *buffer_start = NULL;
+	      *st_size = 0;
+	      return NULL;
+	    }
+	  cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
+		     input_charset, SOURCE_CHARSET);
+	}
+    }
 
   /* Resize buffer if we allocated substantially too much, or if we
      haven't enough space for the \n-terminator or following
@@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 
   buffer = to.text;
   *st_size = to.len;
-#if HOST_CHARSET == HOST_CHARSET_ASCII
-  /* The HOST_CHARSET test just above ensures that the source charset
-     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
-     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
+
+  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
+     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
      BOM -- however, even if it did, we would still need this code due
      to the 'convert_no_conversion' case.  */
-  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
-      && to.text[2] == 0xbf)
-    {
-      *st_size -= 3;
-      buffer += 3;
-    }
-#endif
+  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
+  *st_size -= bom_len;
+  buffer += bom_len;
 
   *buffer_start = to.text;
   return buffer;
@@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
   return current_encoding;
 }
 
+/* Check if the configured input charset requires no conversion, other than
+   possibly stripping a UTF-8 BOM.  */
+bool cpp_input_conversion_is_trivial (const char *input_charset)
+{
+  return !strcasecmp (input_charset, SOURCE_CHARSET);
+}
+
 /* Implementation of class cpp_string_location_reader.  */
 
 /* Constructor for cpp_string_location_reader.  */
diff --git a/libcpp/files.c b/libcpp/files.c
index 6e20fc5887f..c93a03c69ef 100644
--- a/libcpp/files.c
+++ b/libcpp/files.c
@@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
 static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
 			      bool *invalid_pch, location_t loc);
 static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
-			    location_t loc);
+			    location_t loc, const char *input_charset);
 static bool read_file (cpp_reader *pfile, _cpp_file *file,
 		       location_t loc);
 static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
@@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
 
    Use LOC for any diagnostics.
 
+   PFILE may be NULL.  In this case, no diagnostics are issued.
+
    FIXME: Flush file cache and try again if we run out of memory.  */
 static bool
-read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
+read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
+		const char *input_charset)
 {
   ssize_t size, total, count;
   uchar *buf;
@@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (S_ISBLK (file->st.st_mode))
     {
-      cpp_error_at (pfile, CPP_DL_ERROR, loc,
-		    "%s is a block device", file->path);
+      if (pfile)
+	cpp_error_at (pfile, CPP_DL_ERROR, loc,
+		      "%s is a block device", file->path);
       return false;
     }
 
@@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 	 does not bite us.  */
       if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
 	{
-	  cpp_error_at (pfile, CPP_DL_ERROR, loc,
-			"%s is too large", file->path);
+	  if (pfile)
+	    cpp_error_at (pfile, CPP_DL_ERROR, loc,
+			  "%s is too large", file->path);
 	  return false;
 	}
 
@@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (count < 0)
     {
-      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
+      if (pfile)
+	cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
       free (buf);
       return false;
     }
 
-  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
+  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
     cpp_error_at (pfile, CPP_DL_WARNING, loc,
 	       "%s is shorter than expected", file->path);
 
   file->buffer = _cpp_convert_input (pfile,
-				     CPP_OPTION (pfile, input_charset),
+				     input_charset,
 				     buf, size + 16, total,
 				     &file->buffer_start,
 				     &file->st.st_size);
-  file->buffer_valid = true;
-
-  return true;
+  file->buffer_valid = file->buffer;
+  return file->buffer_valid;
 }
 
 /* Convenience wrapper around read_file_guts that opens the file if
    necessary and closes the file descriptor after reading.  FILE must
    have been passed through find_file() at some stage.  Use LOC for
-   any diagnostics.  */
+   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
 static bool
 read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
 {
@@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
       return false;
     }
 
-  file->dont_read = !read_file_guts (pfile, file, loc);
+  file->dont_read = !read_file_guts (pfile, file, loc,
+				     CPP_OPTION (pfile, input_charset));
   close (file->fd);
   file->fd = -1;
 
@@ -2145,3 +2151,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
   return file->err_no != ENOENT;
 }
 
+/* Read a file and convert to input charset, the same as if it were being read
+   by a cpp_reader.  */
+
+cpp_converted_source
+cpp_get_converted_source (const char *fname, const char *input_charset)
+{
+  cpp_converted_source res = {};
+  _cpp_file file = {};
+  file.fd = -1;
+  file.name = lbasename (fname);
+  file.path = fname;
+  if (!open_file (&file))
+    return res;
+  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
+  close (file.fd);
+  if (!ok)
+    return res;
+  res.to_free = (char *) file.buffer_start;
+  res.data = (char *) file.buffer;
+  res.len = file.st.st_size;
+  return res;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 7e840635a38..af1429171ea 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1379,6 +1379,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
 extern cpp_buffer *cpp_get_prev (cpp_buffer *);
 extern void cpp_clear_file_cache (cpp_reader *);
 
+/* cpp_get_converted_source returns the contents of the given file, as it exists
+   after cpplib has read it and converted it from the input charset to the
+   source charset.  Return struct will be zero-filled if the data could not be
+   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
+   pointer is what should be passed to free(), as there may be an offset.  */
+struct cpp_converted_source
+{
+  char *to_free;
+  char *data;
+  size_t len;
+};
+cpp_converted_source cpp_get_converted_source (const char *fname,
+					       const char *input_charset);
+
 /* In pch.c */
 struct save_macro_data;
 extern int cpp_save_state (cpp_reader *, FILE *);
@@ -1449,6 +1463,7 @@ class cpp_display_width_computation {
 /* Convenience functions that are simple use cases for class
    cpp_display_width_computation.  Tab characters will be expanded to spaces
    as determined by TABSTOP.  */
+
 int cpp_byte_column_to_display_column (const char *data, int data_length,
 				       int column, int tabstop);
 inline int cpp_display_width (const char *data, int data_length,
@@ -1461,4 +1476,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length,
 				       int display_col, int tabstop);
 int cpp_wcwidth (cppchar_t c);
 
+bool cpp_input_conversion_is_trivial (const char *input_charset);
+int cpp_check_utf8_bom (const char *data, size_t data_length);
+
 #endif /* ! LIBCPP_CPPLIB_H */
Lewis Hyatt Aug. 24, 2021, 12:17 p.m. UTC | #7
Hello-

I thought it might be a good time to check on this patch please? Thanks!
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html

-Lewis

On Fri, Jul 30, 2021 at 4:13 PM Lewis Hyatt <lhyatt@gmail.com> wrote:
>
> On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote:
> > On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> > > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > > > Hello-
> > > >
> > > > The attached patch addresses PR93067:
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > > >
> > > > This is similar to the patch I posted last year on the PR, with some
> > > tweaks
> > > > to make it a little simpler. Recapping some of the commentary on the
> > > PR:
> > > >
> > > > When source lines are needed for diagnostics output, they are
> > > retrieved from
> > > > the source file by the fcache infrastructure in input.c, since libcpp
> > > has
> > > > generally already forgotten them (plus not all front ends are using
> > > > libcpp). This infrastructure does not read the files in the same way
> > > as
> > > > libcpp does; in particular, it does not translate the encoding as
> > > requested
> > > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > > > present. The patch adds this ability. My thinking in deciding how to
> > > do it
> > > > was the following:
> > > >
> > > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> > > still,
> > > >   so this patch should try hard not to introduce any worse
> > > performance
> > > >   unless these things are needed.
> > > >
> > > > - It is desirable to reuse libcpp's encoding infrastructure from
> > > charset.c
> > > >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> > > it also
> > > >   has hand-coded routines for certain charsets to make sure they are
> > > >   available.)
> > > >
> > > > - There is a performance degradation required in order to make use of
> > > libcpp
> > > >   directly, because the input.c infrastructure only reads as much of
> > > the
> > > >   source file as necessary, whereas libcpp interfaces as-is require
> > > to read
> > > >   the entire file into memory.
> > > >
> > > > - It can't be quite as simple as just "only delegate to libcpp if
> > > >   -finput-charset was specified", because the stripping of the UTF-8
> > > BOM has
> > > >   to happen with or without this option.
> > > >
> > > > - So it seemed a reasonable compromise to me, if -finput-charset is
> > > >   specified, then use libcpp to convert the file, otherwise, strip
> > > the BOM
> > > >   in input.c and then process the file the same way it is done now.
> > > There's
> > > >   a little bit of leakage of charset logic from libcpp this way (for
> > > the
> > > >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> > > always
> > > >   be reading the entire file into memory, which is not a cost paid
> > > >   currently.
> > >
> > > Thanks for the patch; sorry about the delay in reviewing it.
> > >
> >
> > Thanks for the comments! Here is an updated patch that addresses your
> > feedback, plus some responses inline below.
> >
> > Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
> > the same before and after, plus 6 new PASS.
> >
> > FAIL 85 85
> > PASS 479191 479197
> > UNSUPPORTED 13664 13664
> > UNTESTED 129 129
> > XFAIL 2292 2292
> > XPASS 30 30
> >
> >
> > > This mostly seems good to me.
> > >
> > > One aspect I'm not quite convinced about is the
> > > input_cpp_context.in_use flag.  The input.c machinery is used by
> > > diagnostics, and so could be used by middle-end warnings for frontends
> > > that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> > > BOM for those, and do encoding fixups if necessary - is it more a case
> > > of initializing things to express what the expected input charset is?
> > > (since that is part of the cpp_options)
> > >
> > > c.opt has:
> > >   finput-charset=
> > >   C ObjC C++ ObjC++ Joined RejectNegative
> > >   -finput-charset=<cset>    Specify the default character set for
> > > source files.
> > >
> > > I believe that D and Go are the two frontends that don't use libcpp for
> > > parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> > > considering the heritage of both).  I don't know what source encodings
> > > D supports.
> > >
> >
> > For this patch I was rather singularly focused on libcpp, so I looked
> > deeper at the other frontends now. It seems to me that there are basically
> > two questions to answer, and the three frontend styles answer this pair in
> > three different ways.
> >
> > Q1: What is the input charset?
> > A1:
> >
> >     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
> >     -finput-charset is not supported though.)
> >
> >     go: Assume UTF-8.
> >
> >     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
> >        endianness); determined by inspecting the first bytes of the file.
> >
> > Q2: How should a UTF-8 BOM be handled?
> > A2:
> >
> >     libcpp: Treat entirely the same, as if it was not present at all. So
> >     a diagnostic referring to the first non-BOM character in the file will
> >     point to column 1, not to column 4.
> >
> >     go: Treat it like whitespace, ignored for parsing purposes, but still
> >     logically part of the file. A diagnostic referring to the first non-BOM
> >     character in the file will point to column 4.
> >
> >     D: Same as libcpp.
> >
> > So input.c's current behavior (with or without my patch) actually matches
> > the "go" frontend exactly and does the right thing for it. As you
> > thought, though, for D it would be desirable to skip over the UTF-8 BOM
> > too.
> >
> > D adds an additional wrinkle in that, instead of using -finput-charset, it
> > uses its own detection logic -- based on knowledge that the first codepoint
> > in the file must be ASCII, it is able to deduce the encoding from the first
> > few bytes. This means that in D, each source file can have a different
> > encoding, which is not expressible in libcpp frontends currently, and hence
> > the notion of a global input_cpp_context with a constant charset is not
> > sufficient to capture this.
> >
> > In this iteration of the patch, I replaced the input_cpp_context with a more
> > general input_context that can handle all of these cases. The context now
> > consists of a callback function and a bool, which the frontend is supposed
> > to configure. The bool determines whether or not to skip the BOM. The
> > callback, when passed a filename, returns the input charset needed to
> > convert that file. For libcpp, the filename is not used as the charset is
> > determined from -finput-charset for all files. For D, the callback function
> > is currently a stub, but it could be expanded to open the file, read the
> > first few bytes, and determine the charset to be used. I have not
> > implemented it for now, because it seems inelegant to duplicate the logic D
> > already has for this detection, but this logic is part of the dmd/ tree
> > which I think is maintained external to GCC, and so it didn't seem right to
> > attempt to refactor it. I figured there may not be much interest in this
> > feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> > D), and this patch makes no change to it. This patch does fix the handling
> > of a UTF-8 BOM in D diagnostics, however.
> >
> > > > Separate from the main patch are two testcases that both fail before
> > > this
> > > > patch and pass after. I attached them gzipped because they use non-
> > > standard
> > > > encodings that won't email well.
> > > >
> > > > The main question I have about the patch is whether I chose a good
> > > way to
> > > > address the following complication. location_get_source_line() in
> > > input.c is
> > > > used to generate diagnostics for all front ends, whether they use
> > > libcpp to
> > > > read the files or not. So input.c needs some way to know whether
> > > libcpp is
> > > > in use or not. I ended up adding a new function
> > > input_initialize_cpp_context(),
> > > > which front ends have to call if they are using libcpp to read their
> > > > files. Currently that is c-family and fortran. I don't see a simpler
> > > way
> > > > to do it at least... Even if there were a better way for input.c to
> > > find out
> > > > the value passed to -finput-charset, it would still need to know
> > > whether
> > > > libcpp is being used or not.
> > >
> > > At some point I want to convert the global state in input.c into a
> > > source_manager class, probably hung off the diagnostic_context, so the
> > > natural place to put that state would be in there (rather than the
> > > fcache being global state).  Perhaps have a source_reader policy class
> > > hung off of the source_manager, which would have responsibility for
> > > reading and converting the file, either piecemeal, or fully for the
> > > "need to use libcpp" case.
> > >
> > > But that's not necessary for this fix.
> > >
> > > > Please let me know if it looks OK (either now or for stage 1,
> > > whatever makes
> > > > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
> > > tests the
> > > > same before & after plus 6 new PASS from the new tests. Thanks!
> > >
> > > Various comments inline below.  In particular, does the patch add a
> > > memory leak in _cpp_convert_input?
> > >
> > > Thanks
> > > Dave
> > >
> > > [...]
> > >
> > > > libcpp/ChangeLog:
> > > >
> > > >         PR other/93067
> > > >         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
> > > to
> > > >         be NULL.
> > > >         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
> > > to...
> > > >         (cpp_check_utf8_bom): ...here.  New function.
> > > >         (cpp_input_conversion_is_trivial): New function.
> > > >         * files.c (read_file_guts): Allow PFILE argument to be NULL.
> > > Add
> > > >         INPUT_CHARSET argument as an alternate source of this
> > > information.
> > > >         (cpp_get_converted_source): New function.
> > > >         * include/cpplib.h (struct cpp_converted_source): Declare.
> > > >         (cpp_get_converted_source): Declare.
> > > >         (cpp_input_conversion_is_trivial): Declare.
> > > >         (cpp_check_utf8_bom): Declare.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         PR other/93067
> > > >         * gcc.dg/diagnostic-input-charset-1.c: New test.
> > > >         * gcc.dg/diagnostic-input-charset-2.c: New test.
> > >
> > > Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?
> > >
> >
> > Done.
> >
> > > >
> > > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > > > index 59cabd12407..d5aa7859cc1 100644
> > > > --- a/gcc/c-family/c-opts.c
> > > > +++ b/gcc/c-family/c-opts.c
> > > > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
> > > >    cpp_post_options (parse_in);
> > > >    init_global_opts_from_cpp (&global_options, cpp_get_options
> > > (parse_in));
> > > >
> > > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > > +     the input.  */
> > > > +  input_initialize_cpp_context (parse_in);
> > >
> > > As noted above I think the most significant thing here is getting the
> > > source encoding from parse_in, so maybe reword the comment accordingly.
> > >
> > > [...]
> > >
> > > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > > > index 51baf141711..2b12a98afc0 100644
> > > > --- a/gcc/fortran/cpp.c
> > > > +++ b/gcc/fortran/cpp.c
> > > > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
> > > >
> > > >    cpp_post_options (cpp_in);
> > > >
> > > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > > +     the input.  */
> > > > +  input_initialize_cpp_context (cpp_in);
> > >
> > > Likewise.
> > >
> >
> > I kept this in mind when redoing this part, hopefully it is better now.
> >
> > > [...]
> > >
> > > > diff --git a/gcc/input.c b/gcc/input.c
> > > > index 29d10f06b86..1dcdd464bc1 100644
> > > > --- a/gcc/input.c
> > > > +++ b/gcc/input.c
> > >
> > > [...]
> > >
> > > > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
> > > >    r->total_lines = total_lines_num (file_path);
> > > >    r->missing_trailing_newline = true;
> > > >
> > > > +  /* If libcpp is managing the reading, then there are two cases we
> > > need to
> > > > +     consider.  If -finput-charset is not in effect, then we just
> > > need to
> > > > +     strip a UTF-8 BOM, so do that ourselves rather than calling
> > > into libcpp so
> > > > +     as to avoid paying the penalty of using libcpp, namely that the
> > > entire file
> > > > +     must be read at once.  In the (generally rare) case that a non-
> > > trivial
> > > > +     -finput-charset is needed, then go ahead and use libcpp to read
> > > the whole
> > > > +     file and do the conversion.  */
> > > > +  if (input_cpp_context.in_use)
> > > > +    {
> > > > +      if (input_cpp_context.conversion_is_trivial)
> > > > +       {
> > > > +         /* Strip the UTF8 BOM if present.  */
> > > > +         if (read_data (r))
> > > > +           {
> > > > +             const int offset = cpp_check_utf8_bom (r->data, r-
> > > >nb_read);
> > > > +             r->offset_buffer (offset);
> > > > +             r->nb_read -= offset;
> > > > +           }
> > >
> > > This assumes that trivial conversions are always UTF8 to UTF8, which
> > > assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...
> > >
> > > [...]
> > >
> > > > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > > > +   of a buffer.  Returns the number of bytes to skip, which
> > > currently
> > > > +   will be either 0 or 3.  */
> > > > +int
> > > > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > > > +{
> > > > +
> > > > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > > > +  const unsigned char *udata = (const unsigned char *) data;
> > > > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > > > +      && udata[2] == 0xbf)
> > > > +    return 3;
> > > > +#endif
> > > > +
> > > > +  return 0;
> > > > +}
> > >
> > > ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
> > > no-op for the EBCDIC case, so that's OK.
> > >
> > > [...]
> > >
> > > > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
> > > char *input_charset,
> > > >        to.text = XNEWVEC (uchar, to.asize);
> > > >        to.len = 0;
> > > >
> > > > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > > > -       cpp_error (pfile, CPP_DL_ERROR,
> > > > -                  "failure to convert %s to %s",
> > > > -                  CPP_OPTION (pfile, input_charset),
> > > SOURCE_CHARSET);
> > > > +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
> > > &to);
> > > >
> > > > -      free (input);
> > > > -    }
> > > > +      /* Handle conversion failure.  */
> > > > +      if (!ok)
> > > > +       {
> > > > +         free (input);
> > > > +         if (!pfile)
> > > > +           {
> > > > +             XDELETEVEC (to.text);
> > > > +             *buffer_start = NULL;
> > > > +             *st_size = 0;
> > > > +             return NULL;
> > > > +           }
> > > > +         cpp_error (pfile, CPP_DL_ERROR,
> > > > +                    "failure to convert %s to %s",
> > > > +                    CPP_OPTION (pfile, input_charset),
> > > SOURCE_CHARSET);
> > > > +       }
> > > > +    }
> > > >
> > >
> > > Doesn't this lose the
> > >   free (input);
> > > for the case where the conversion is successful?  Is this a leak?
> > >
> >
> > Ugh, sorry to waste your time with that mistake. Fixed.
> >
> > > [...]
> > >
> > > > diff --git a/libcpp/files.c b/libcpp/files.c
> > > > index 301b2379a23..178bb9ed1e6 100644
> > > > --- a/libcpp/files.c
> > > > +++ b/libcpp/files.c
> > > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
> > > _cpp_file *file,
> > > >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> > > >                               bool *invalid_pch, location_t loc);
> > > >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > > > -                           location_t loc);
> > > > +                           location_t loc, const char *input_charset
> > > = NULL);
> > >
> > > read_file_guts is only used in one place before the patch, so rather
> > > than add a default argument, I think it's simpler to pass in the
> > > input_charset at that call.
> > >
> > > > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
> > > *fname, cpp_dir *start_dir,
> > > >
> > > >     Use LOC for any diagnostics.
> > > >
> > > > +   The input charset may be specified in the INPUT_CHARSET argument,
> > > or
> > > > +   else it will be taken from PFILE.
> > >
> > > ...and doing so means that input_charset will be non-NULL at all
> > > callers.
> > >
> > > > +   PFILE may be NULL.  In this case, no diagnostics are issued, and
> > > the
> > > > +   input charset must be specified in the arguments.
> > > > +
> > > >     FIXME: Flush file cache and try again if we run out of memory.
> > > */
> > > >  static bool
> > > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > > > +               const char *input_charset)
> > > >  {
> > > >    ssize_t size, total, count;
> > > >    uchar *buf;
> > > >    bool regular;
> > > >
> > > > +  if (!input_charset)
> > > > +    {
> > > > +      gcc_assert (pfile);
> > > > +      input_charset = CPP_OPTION (pfile, input_charset);
> > > > +    }
> > >
> > > ...which would allow for this logic to be removed, moving this to the
> > > existing caller, I believe.
> > >
> >
> > Much better your way, done.
> >
> > Thanks for taking a look at it, sorry this version is a bit changed from the
> > previous one. FWIW, the libcpp pieces were not touched other than addressing
> > your comments; the new logic is in input.h/input.c plus the associated
> > tweaks to the frontends.
> >
> > -Lewis
>
> > From: Lewis Hyatt <lhyatt@gmail.com>
> > Date: Wed, 27 Jan 2021 18:23:13 -0500
> > Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]
> >
> > Adds the logic to handle -finput-charset in layout_get_source_line(), so that
> > source lines are converted from their input encodings prior to being output by
> > diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.
> >
> > gcc/c-family/ChangeLog:
> >
> >       PR other/93067
> >       * c-opts.c (c_common_input_charset_cb): New function.
> >       (c_common_post_options): Call new function input_initialize_context().
> >
> > gcc/d/ChangeLog:
> >
> >       PR other/93067
> >       * d-lang.cc (d_input_charset_callback): New function.
> >       (d_init): Call new function input_initialize_context().
> >
> > gcc/fortran/ChangeLog:
> >
> >       PR other/93067
> >       * cpp.c (gfc_cpp_post_options): Call new function
> >       input_initialize_cpp_context().
> >
> > gcc/ChangeLog:
> >
> >       PR other/93067
> >       * input.c (default_charset_callback): New function.
> >       (input_initialize_context): New function.
> >       (read_data): Add prototype.
> >       (add_file_to_cache_tab): Use libcpp to convert input encoding when
> >       needed. Strip UTF-8 BOM when needed.
> >       (class fcache): Add new members to track input encoding conversion.
> >       (fcache::fcache): Adapt for new members.
> >       (fcache::~fcache): Likewise.
> >       (maybe_grow): Likewise.
> >       (needs_read): Adapt to be aware that fp member may be NULL now.
> >       (get_next_line): Likewise.
> >       * input.h (input_initialize_context): Declare.
> >
> > libcpp/ChangeLog:
> >
> >       PR other/93067
> >       * charset.c (init_iconv_desc): Adapt to permit PFILE argument to
> >       be NULL.
> >       (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
> >       (cpp_check_utf8_bom): ...here.  New function.
> >       (cpp_input_conversion_is_trivial): New function.
> >       * files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
> >       INPUT_CHARSET argument as an alternate source of this information.
> >       (read_file): Pass the new argument to read_file_guts.
> >       (cpp_get_converted_source): New function.
> >       * include/cpplib.h (struct cpp_converted_source): Declare.
> >       (cpp_get_converted_source): Declare.
> >       (cpp_input_conversion_is_trivial): Declare.
> >       (cpp_check_utf8_bom): Declare.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       PR other/93067
> >       * gcc.dg/diagnostic-input-charset-1.c: New test.
> >       * gcc.dg/diagnostic-input-utf8-bom.c: New test.
> >
> > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > index bd15b9cd902..2f58e8413b9 100644
> > --- a/gcc/c-family/c-opts.c
> > +++ b/gcc/c-family/c-opts.c
> > @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context)
> >    context->opt_permissive = OPT_fpermissive;
> >  }
> >
> > +/* Input charset configuration for diagnostics.  */
> > +static const char *
> > +c_common_input_charset_cb (const char * /*filename*/)
> > +{
> > +  const char *cs = cpp_opts->input_charset;
> > +  return cpp_input_conversion_is_trivial (cs) ? NULL : cs;
> > +}
> > +
> >  /* Whether options from all C-family languages should be accepted
> >     quietly.  */
> >  static bool accept_all_c_family_options = false;
> > @@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename)
> >    cpp_post_options (parse_in);
> >    init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
> >
> > +  /* Let diagnostics infrastructure know how to convert input files the same
> > +     way libcpp will do it, namely using the configured input charset and
> > +     skipping a UTF-8 BOM if present.  */
> > +  input_initialize_context (c_common_input_charset_cb, true);
> >    input_location = UNKNOWN_LOCATION;
> >
> >    *pfilename = this_input_filename
> > diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> > index 0fd207da7f3..9f170c09886 100644
> > --- a/gcc/d/d-lang.cc
> > +++ b/gcc/d/d-lang.cc
> > @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "output.h"
> >  #include "print-tree.h"
> >  #include "debug.h"
> > +#include "input.h"
> >
> >  #include "d-tree.h"
> >  #include "id.h"
> > @@ -362,6 +363,19 @@ d_option_lang_mask (void)
> >    return CL_D;
> >  }
> >
> > +/* Implements input charset and BOM skipping configuration for
> > +   diagnostics.  */
> > +static const char *d_input_charset_callback (const char * /*filename*/)
> > +{
> > +  /* TODO: The input charset is automatically determined by code in
> > +     dmd/dmodule.c based on the contents of the file.  If this detection
> > +     logic were factored out and could be reused here, then we would be able
> > +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> > +     NULL, which means no conversion is necessary, i.e. the input is assumed
> > +     to be UTF-8 when diagnostics read this file.  */
> > +  return NULL;
> > +}
> > +
> >  /* Implements the lang_hooks.init routine for language D.  */
> >
> >  static bool
> > @@ -373,6 +387,10 @@ d_init (void)
> >    Expression::_init ();
> >    Objc::_init ();
> >
> > +  /* Diagnostics input init, to enable BOM skipping and
> > +     input charset conversion.  */
> > +  input_initialize_context (d_input_charset_callback, true);
> > +
> >    /* Back-end init.  */
> >    global_binding_level = ggc_cleared_alloc <binding_level> ();
> >    current_binding_level = global_binding_level;
> > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > index 419cd6accbe..8b564888673 100644
> > --- a/gcc/fortran/cpp.c
> > +++ b/gcc/fortran/cpp.c
> > @@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
> >
> >    cpp_post_options (cpp_in);
> >
> > +
> > +  /* Let diagnostics infrastructure know how to convert input files the same
> > +     way libcpp will do it, namely, with no charset conversion but with
> > +     skipping of a UTF-8 BOM if present.  */
> > +  input_initialize_context (NULL, true);
> > +
> >    gfc_cpp_register_include_paths ();
> >  }
> >
> > diff --git a/gcc/input.c b/gcc/input.c
> > index 9e39e7df83c..f8f6dc7e4ca 100644
> > --- a/gcc/input.c
> > +++ b/gcc/input.c
> > @@ -30,6 +30,29 @@ along with GCC; see the file COPYING3.  If not see
> >  #define HAVE_ICONV 0
> >  #endif
> >
> > +/* Input charset configuration.  */
> > +static const char *default_charset_callback (const char *)
> > +{
> > +  return NULL;
> > +}
> > +
> > +struct
> > +{
> > +  input_context_charset_callback ccb;
> > +  bool should_skip_bom;
> > +} static input_context =
> > +{
> > +  default_charset_callback,
> > +  false
> > +};
> > +
> > +void input_initialize_context (input_context_charset_callback ccb,
> > +                            bool should_skip_bom)
> > +{
> > +  input_context.ccb = (ccb ? ccb : default_charset_callback);
> > +  input_context.should_skip_bom = should_skip_bom;
> > +}
> > +
> >  /* This is a cache used by get_next_line to store the content of a
> >     file to be searched for file lines.  */
> >  class fcache
> > @@ -78,6 +101,10 @@ public:
> >       far.  */
> >    char *data;
> >
> > +  /* The allocated buffer to be freed may start a little earlier than DATA,
> > +     e.g. if a UTF8 BOM was skipped at the beginning.  */
> > +  int alloc_offset;
> > +
> >    /*  The size of the DATA array above.*/
> >    size_t size;
> >
> > @@ -118,6 +145,17 @@ public:
> >
> >    fcache ();
> >    ~fcache ();
> > +
> > +  void offset_buffer (int offset)
> > +  {
> > +    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
> > +             : (size_t) offset <= size);
> > +    gcc_assert (data);
> > +    alloc_offset += offset;
> > +    data += offset;
> > +    size -= offset;
> > +  }
> > +
> >  };
> >
> >  /* Current position in real source file.  */
> > @@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count)
> >    return to_evict;
> >  }
> >
> > +static bool
> > +read_data (fcache *c);
> > +
> >  /* Create the cache used for the content of a given file to be
> >     accessed by caret diagnostic.  This cache is added to an array of
> >     cache and can be retrieved by lookup_file_in_cache_tab.  This
> > @@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path)
> >    if (r->fp)
> >      fclose (r->fp);
> >    r->fp = fp;
> > +  if (r->alloc_offset)
> > +    r->offset_buffer (-r->alloc_offset);
> >    r->nb_read = 0;
> >    r->line_start_idx = 0;
> >    r->line_num = 0;
> > @@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path)
> >    r->total_lines = total_lines_num (file_path);
> >    r->missing_trailing_newline = true;
> >
> > +  /* Check the frontend configuration to determine if we need to do any
> > +     transformations, such as charset conversion or BOM skipping.  */
> > +  if (const char *input_charset = input_context.ccb (file_path))
> > +    {
> > +      /* Need a full-blown conversion of the input charset.  */
> > +      fclose (r->fp);
> > +      r->fp = NULL;
> > +      const cpp_converted_source cs
> > +     = cpp_get_converted_source (file_path, input_charset);
> > +      if (!cs.data)
> > +     return NULL;
> > +      if (r->data)
> > +     XDELETEVEC (r->data);
> > +      r->data = cs.data;
> > +      r->nb_read = r->size = cs.len;
> > +      r->alloc_offset = cs.data - cs.to_free;
> > +    }
> > +  else if (input_context.should_skip_bom)
> > +    {
> > +      if (read_data (r))
> > +     {
> > +       const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
> > +       r->offset_buffer (offset);
> > +       r->nb_read -= offset;
> > +     }
> > +    }
> > +
> >    return r;
> >  }
> >
> > @@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path)
> >     diagnostic.  */
> >
> >  fcache::fcache ()
> > -: use_count (0), file_path (NULL), fp (NULL), data (0),
> > +: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
> >    size (0), nb_read (0), line_start_idx (0), line_num (0),
> >    total_lines (0), missing_trailing_newline (true)
> >  {
> > @@ -433,6 +503,7 @@ fcache::~fcache ()
> >      }
> >    if (data)
> >      {
> > +      offset_buffer (-alloc_offset);
> >        XDELETEVEC (data);
> >        data = 0;
> >      }
> > @@ -447,9 +518,9 @@ fcache::~fcache ()
> >  static bool
> >  needs_read (fcache *c)
> >  {
> > -  return (c->nb_read == 0
> > -       || c->nb_read == c->size
> > -       || (c->line_start_idx >= c->nb_read - 1));
> > +  return c->fp && (c->nb_read == 0
> > +                || c->nb_read == c->size
> > +                || (c->line_start_idx >= c->nb_read - 1));
> >  }
> >
> >  /*  Return TRUE iff the cache is full and thus needs to be
> > @@ -469,9 +540,20 @@ maybe_grow (fcache *c)
> >    if (!needs_grow (c))
> >      return;
> >
> > -  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
> > -  c->data = XRESIZEVEC (char, c->data, size);
> > -  c->size = size;
> > +  if (!c->data)
> > +    {
> > +      gcc_assert (c->size == 0 && c->alloc_offset == 0);
> > +      c->size = fcache_buffer_size;
> > +      c->data = XNEWVEC (char, c->size);
> > +    }
> > +  else
> > +    {
> > +      const int offset = c->alloc_offset;
> > +      c->offset_buffer (-offset);
> > +      c->size *= 2;
> > +      c->data = XRESIZEVEC (char, c->data, c->size);
> > +      c->offset_buffer (offset);
> > +    }
> >  }
> >
> >  /*  Read more data into the cache.  Extends the cache if need be.
> > @@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len)
> >        c->missing_trailing_newline = false;
> >      }
> >
> > -  if (ferror (c->fp))
> > +  if (c->fp && ferror (c->fp))
> >      return false;
> >
> >    /* At this point, we've found the end of the of line.  It either
> > diff --git a/gcc/input.h b/gcc/input.h
> > index f1ef5d76cfd..8a793399958 100644
> > --- a/gcc/input.h
> > +++ b/gcc/input.h
> > @@ -214,4 +214,21 @@ class GTY(()) string_concat_db
> >    hash_map <location_hash, string_concat *> *m_table;
> >  };
> >
> > +/* Because we read source files a second time after the frontend did it the
> > +   first time, we need to know how the frontend handled things like character
> > +   set conversion and UTF-8 BOM stripping, in order to make everything
> > +   consistent.  This function needs to be called by each frontend that requires
> > +   non-default behavior, to inform the diagnostics infrastructure how input is
> > +   to be processed.  The default behavior is to do no conversion and not to
> > +   strip a UTF-8 BOM.
> > +
> > +   The callback should return the input charset to be used to convert the given
> > +   file's contents to UTF-8, or it should return NULL if no conversion is needed
> > +   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
> > +   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
> > +   beginning of the file.  (In case a conversion was performed, the BOM is
> > +   rather skipped as part of the conversion process.)  */
> > +typedef const char * (*input_context_charset_callback)(const char *);
> > +void input_initialize_context (input_context_charset_callback ccb,
> > +                            bool should_skip_bom);
> >  #endif
> > diff --git a/libcpp/charset.c b/libcpp/charset.c
> > index 99a9b73e5ab..61881f978a8 100644
> > --- a/libcpp/charset.c
> > +++ b/libcpp/charset.c
> > @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
> >     cset_converter structure for conversion from FROM to TO.  If
> >     iconv_open() fails, issue an error and return an identity
> >     converter.  Silently return an identity converter if FROM and TO
> > -   are identical.  */
> > +   are identical.
> > +
> > +   PFILE is only used for generating diagnostics; setting it to NULL
> > +   suppresses diagnostics.  */
> > +
> >  static struct cset_converter
> >  init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
> >  {
> > @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
> >
> >        if (ret.cd == (iconv_t) -1)
> >       {
> > -       if (errno == EINVAL)
> > -         cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> > -                    "conversion from %s to %s not supported by iconv",
> > -                    from, to);
> > -       else
> > -         cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> > -
> > +       if (pfile)
> > +         {
> > +           if (errno == EINVAL)
> > +             cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> > +                        "conversion from %s to %s not supported by iconv",
> > +                        from, to);
> > +           else
> > +             cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> > +         }
> >         ret.func = convert_no_conversion;
> >       }
> >      }
> >    else
> >      {
> > -      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> > -              "no iconv implementation, cannot convert from %s to %s",
> > -              from, to);
> > +      if (pfile)
> > +     {
> > +       cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> > +                  "no iconv implementation, cannot convert from %s to %s",
> > +                  from, to);
> > +     }
> >        ret.func = convert_no_conversion;
> >        ret.cd = (iconv_t) -1;
> >        ret.width = -1;
> >      }
> > +
> >    return ret;
> >  }
> >
> > @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
> >                                 buf, bufp - buf, HT_ALLOC));
> >  }
> >
> > +
> > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > +   of a buffer.  Returns the number of bytes to skip, which currently
> > +   will be either 0 or 3.  */
> > +int
> > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > +{
> > +
> > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > +  const unsigned char *udata = (const unsigned char *) data;
> > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > +      && udata[2] == 0xbf)
> > +    return 3;
> > +#endif
> > +
> > +  return 0;
> > +}
> > +
> > +
> >  /* Convert an input buffer (containing the complete contents of one
> >     source file) from INPUT_CHARSET to the source character set.  INPUT
> >     points to the input buffer, SIZE is its allocated size, and LEN is
> > @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
> >     INPUT is expected to have been allocated with xmalloc.  This
> >     function will either set *BUFFER_START to INPUT, or free it and set
> >     *BUFFER_START to a pointer to another xmalloc-allocated block of
> > -   memory.  */
> > +   memory.
> > +
> > +   PFILE is only used to generate diagnostics; setting it to NULL suppresses
> > +   diagnostics, and causes a return of NULL if there was any error instead.  */
> > +
> >  uchar *
> >  _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
> >                   uchar *input, size_t size, size_t len,
> > @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
> >        to.text = XNEWVEC (uchar, to.asize);
> >        to.len = 0;
> >
> > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > -     cpp_error (pfile, CPP_DL_ERROR,
> > -                "failure to convert %s to %s",
> > -                CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
> > -
> > +      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
> >        free (input);
> > -    }
> >
> > -  /* Clean up the mess.  */
> > -  if (input_cset.func == convert_using_iconv)
> > -    iconv_close (input_cset.cd);
> > +      /* Clean up the mess.  */
> > +      if (input_cset.func == convert_using_iconv)
> > +     iconv_close (input_cset.cd);
> > +
> > +      /* Handle conversion failure.  */
> > +      if (!ok)
> > +     {
> > +       if (!pfile)
> > +         {
> > +           XDELETEVEC (to.text);
> > +           *buffer_start = NULL;
> > +           *st_size = 0;
> > +           return NULL;
> > +         }
> > +       cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
> > +                  input_charset, SOURCE_CHARSET);
> > +     }
> > +    }
> >
> >    /* Resize buffer if we allocated substantially too much, or if we
> >       haven't enough space for the \n-terminator or following
> > @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
> >
> >    buffer = to.text;
> >    *st_size = to.len;
> > -#if HOST_CHARSET == HOST_CHARSET_ASCII
> > -  /* The HOST_CHARSET test just above ensures that the source charset
> > -     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
> > -     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
> > +
> > +  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
> > +     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
> >       BOM -- however, even if it did, we would still need this code due
> >       to the 'convert_no_conversion' case.  */
> > -  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
> > -      && to.text[2] == 0xbf)
> > -    {
> > -      *st_size -= 3;
> > -      buffer += 3;
> > -    }
> > -#endif
> > +  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
> > +  *st_size -= bom_len;
> > +  buffer += bom_len;
> >
> >    *buffer_start = to.text;
> >    return buffer;
> > @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
> >    return current_encoding;
> >  }
> >
> > +/* Check if the configured input charset requires no conversion, other than
> > +   possibly stripping a UTF-8 BOM.  */
> > +bool cpp_input_conversion_is_trivial (const char *input_charset)
> > +{
> > +  return !strcasecmp (input_charset, SOURCE_CHARSET);
> > +}
> > +
> >  /* Implementation of class cpp_string_location_reader.  */
> >
> >  /* Constructor for cpp_string_location_reader.  */
> > diff --git a/libcpp/files.c b/libcpp/files.c
> > index 5ea3f8e1bf3..d91606c1532 100644
> > --- a/libcpp/files.c
> > +++ b/libcpp/files.c
> > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
> >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> >                             bool *invalid_pch, location_t loc);
> >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > -                         location_t loc);
> > +                         location_t loc, const char *input_charset);
> >  static bool read_file (cpp_reader *pfile, _cpp_file *file,
> >                      location_t loc);
> >  static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
> > @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
> >
> >     Use LOC for any diagnostics.
> >
> > +   PFILE may be NULL.  In this case, no diagnostics are issued.
> > +
> >     FIXME: Flush file cache and try again if we run out of memory.  */
> >  static bool
> > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > +             const char *input_charset)
> >  {
> >    ssize_t size, total, count;
> >    uchar *buf;
> > @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> >
> >    if (S_ISBLK (file->st.st_mode))
> >      {
> > -      cpp_error_at (pfile, CPP_DL_ERROR, loc,
> > -                 "%s is a block device", file->path);
> > +      if (pfile)
> > +     cpp_error_at (pfile, CPP_DL_ERROR, loc,
> > +                   "%s is a block device", file->path);
> >        return false;
> >      }
> >
> > @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> >        does not bite us.  */
> >        if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
> >       {
> > -       cpp_error_at (pfile, CPP_DL_ERROR, loc,
> > -                     "%s is too large", file->path);
> > +       if (pfile)
> > +         cpp_error_at (pfile, CPP_DL_ERROR, loc,
> > +                       "%s is too large", file->path);
> >         return false;
> >       }
> >
> > @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> >
> >    if (count < 0)
> >      {
> > -      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
> > +      if (pfile)
> > +     cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
> >        free (buf);
> >        return false;
> >      }
> >
> > -  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
> > +  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
> >      cpp_error_at (pfile, CPP_DL_WARNING, loc,
> >              "%s is shorter than expected", file->path);
> >
> >    file->buffer = _cpp_convert_input (pfile,
> > -                                  CPP_OPTION (pfile, input_charset),
> > +                                  input_charset,
> >                                    buf, size + 16, total,
> >                                    &file->buffer_start,
> >                                    &file->st.st_size);
> > -  file->buffer_valid = true;
> > -
> > -  return true;
> > +  file->buffer_valid = file->buffer;
> > +  return file->buffer_valid;
> >  }
> >
> >  /* Convenience wrapper around read_file_guts that opens the file if
> >     necessary and closes the file descriptor after reading.  FILE must
> >     have been passed through find_file() at some stage.  Use LOC for
> > -   any diagnostics.  */
> > +   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
> >  static bool
> >  read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
> >  {
> > @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
> >        return false;
> >      }
> >
> > -  file->dont_read = !read_file_guts (pfile, file, loc);
> > +  file->dont_read = !read_file_guts (pfile, file, loc,
> > +                                  CPP_OPTION (pfile, input_charset));
> >    close (file->fd);
> >    file->fd = -1;
> >
> > @@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
> >    return file->err_no != ENOENT;
> >  }
> >
> > +/* Read a file and convert to input charset, the same as if it were being read
> > +   by a cpp_reader.  */
> > +
> > +cpp_converted_source
> > +cpp_get_converted_source (const char *fname, const char *input_charset)
> > +{
> > +  cpp_converted_source res = {};
> > +  _cpp_file file = {};
> > +  file.fd = -1;
> > +  file.name = lbasename (fname);
> > +  file.path = fname;
> > +  if (!open_file (&file))
> > +    return res;
> > +  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
> > +  close (file.fd);
> > +  if (!ok)
> > +    return res;
> > +  res.to_free = (char *) file.buffer_start;
> > +  res.data = (char *) file.buffer;
> > +  res.len = file.st.st_size;
> > +  return res;
> > +}
> > diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
> > index 4467c73284d..d02ef4aa054 100644
> > --- a/libcpp/include/cpplib.h
> > +++ b/libcpp/include/cpplib.h
> > @@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
> >  extern cpp_buffer *cpp_get_prev (cpp_buffer *);
> >  extern void cpp_clear_file_cache (cpp_reader *);
> >
> > +/* cpp_get_converted_source returns the contents of the given file, as it exists
> > +   after cpplib has read it and converted it from the input charset to the
> > +   source charset.  Return struct will be zero-filled if the data could not be
> > +   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
> > +   pointer is what should be passed to free(), as there may be an offset.  */
> > +struct cpp_converted_source
> > +{
> > +  char *to_free;
> > +  char *data;
> > +  size_t len;
> > +};
> > +cpp_converted_source cpp_get_converted_source (const char *fname,
> > +                                            const char *input_charset);
> > +
> >  /* In pch.c */
> >  struct save_macro_data;
> >  extern int cpp_save_state (cpp_reader *, FILE *);
> > @@ -1439,6 +1453,7 @@ class cpp_display_width_computation {
> >  /* Convenience functions that are simple use cases for class
> >     cpp_display_width_computation.  Tab characters will be expanded to spaces
> >     as determined by TABSTOP.  */
> > +
> >  int cpp_byte_column_to_display_column (const char *data, int data_length,
> >                                      int column, int tabstop);
> >  inline int cpp_display_width (const char *data, int data_length,
> > @@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length,
> >                                      int display_col, int tabstop);
> >  int cpp_wcwidth (cppchar_t c);
> >
> > +bool cpp_input_conversion_is_trivial (const char *input_charset);
> > +int cpp_check_utf8_bom (const char *data, size_t data_length);
> > +
> >  #endif /* ! LIBCPP_CPPLIB_H */
>
> Hello-
>
> As discussed most recently here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575243.html
>
> Here is the updated patch to support -finput-charset in diagnostics. As I
> mentioned in the above thread, I adapted the patch in input.h and input.c now
> that class file_cache has been created. That change also enabled me to move
> the input context state into the file_cache class, so that it doesn't need to
> be global. Since global_dc is now in charge of creating the file_cache, it
> seemed to make sense to me also to move the initialization function for
> frontends to call into the diagnostic_context, so they can call it on the
> global_dc. Please let me know if this makes sense, overall it is "less global"
> than before, which I think is the direction everyone would prefer. Happy to
> adjust as needed. Thanks!
>
> As before, I put the new testcases in a separate gzipped file, because they
> have non-standard encodings.
>
> Bootstrap + testsuite looks good on x86-64 Linux, all tests the same before
> and after except for the 6 new passes.
>
> -Lewis
David Malcolm Aug. 24, 2021, 10:51 p.m. UTC | #8
On Tue, 2021-08-24 at 08:17 -0400, Lewis Hyatt wrote:
> Hello-
> 
> I thought it might be a good time to check on this patch please?
> Thanks!
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html
> 
> -Lewis

I went through that latest version of the patch and have no further
suggestions - I like the changes you made to incorporate the changes I
had made to input.c.

The latest version of the patch is OK for trunk.

It might be an idea to rebase it and retest it before pushing it, to
make sure nothing significant has changed in the last few weeks.

Thanks for your work on this, and sorry again for the delay in
reviewing it.

Dave
Lewis Hyatt Aug. 24, 2021, 11:28 p.m. UTC | #9
On Tue, Aug 24, 2021 at 6:51 PM David Malcolm <dmalcolm@redhat.com> wrote:
>
> On Tue, 2021-08-24 at 08:17 -0400, Lewis Hyatt wrote:
> > Hello-
> >
> > I thought it might be a good time to check on this patch please?
> > Thanks!
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html
> >
> > -Lewis
>
> I went through that latest version of the patch and have no further
> suggestions - I like the changes you made to incorporate the changes I
> had made to input.c.
>
> The latest version of the patch is OK for trunk.
>
> It might be an idea to rebase it and retest it before pushing it, to
> make sure nothing significant has changed in the last few weeks.
>
> Thanks for your work on this, and sorry again for the delay in
> reviewing it.
>
> Dave
>
>

OK great, thanks for your time. I will push after retesting.

BTW, do you think it would be worthwhile to work on the other half of
encoding support, i.e. translating from UTF-8 to the user's locale,
when outputting diagnostics? I have probably 90% of a patch that does
this, however it complexifies things a bit and I am not sure if it is
really worth the trouble. What is rather manageable (that my patch in
progress does now) is to replace non-translatable characters with
something like UCN escapes. What is not so easy, is to do this and
preserve the alignment of carets and label lines and such... this
requires making the display width of a character also
locale-dependent, which concept doesn't exist currently. Adding that
feels like a lot of complication for what would be a little-used
feature... Anyway, if you think a patch that does the translation
without preserving the alignment would be useful, I could finish it up
and send it. Otherwise I was kinda inclined to forget about it.
Thanks!

-Lewis
David Malcolm Aug. 25, 2021, 1:45 p.m. UTC | #10
On Tue, 2021-08-24 at 19:28 -0400, Lewis Hyatt wrote:
> On Tue, Aug 24, 2021 at 6:51 PM David Malcolm <dmalcolm@redhat.com>
> wrote:
> > 
> > On Tue, 2021-08-24 at 08:17 -0400, Lewis Hyatt wrote:
> > > Hello-
> > > 
> > > I thought it might be a good time to check on this patch please?
> > > Thanks!
> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html
> > > 
> > > -Lewis
> > 
> > I went through that latest version of the patch and have no further
> > suggestions - I like the changes you made to incorporate the
> > changes I
> > had made to input.c.
> > 
> > The latest version of the patch is OK for trunk.
> > 
> > It might be an idea to rebase it and retest it before pushing it,
> > to
> > make sure nothing significant has changed in the last few weeks.
> > 
> > Thanks for your work on this, and sorry again for the delay in
> > reviewing it.
> > 
> > Dave
> > 
> > 
> 
> OK great, thanks for your time. I will push after retesting.
> 
> BTW, do you think it would be worthwhile to work on the other half of
> encoding support, i.e. translating from UTF-8 to the user's locale,
> when outputting diagnostics? I have probably 90% of a patch that does
> this, however it complexifies things a bit and I am not sure if it is
> really worth the trouble. What is rather manageable (that my patch in
> progress does now) is to replace non-translatable characters with
> something like UCN escapes. What is not so easy, is to do this and
> preserve the alignment of carets and label lines and such... this
> requires making the display width of a character also
> locale-dependent, which concept doesn't exist currently. Adding that
> feels like a lot of complication for what would be a little-used
> feature... Anyway, if you think a patch that does the translation
> without preserving the alignment would be useful, I could finish it
> up
> and send it. Otherwise I was kinda inclined to forget about it.
> Thanks!

Maybe post the patch you have so far, making clear that it's
unfinished/work-in-progress?  I think I'll find it easier to think
about this with a patch in hand rather than a description of a patch,
and also, that way the list archives will have a copy of your work in
case we do want to finish it at some later point (rather than it just
being on your hard drive).

Thanks
Dave
Lewis Hyatt Aug. 27, 2021, 3:16 p.m. UTC | #11
On Wed, Aug 25, 2021 at 9:45 AM David Malcolm <dmalcolm@redhat.com> wrote:
> > BTW, do you think it would be worthwhile to work on the other half of
> > encoding support, i.e. translating from UTF-8 to the user's locale,
> > when outputting diagnostics? I have probably 90% of a patch that does
> > this, however it complexifies things a bit and I am not sure if it is
> > really worth the trouble. What is rather manageable (that my patch in
> > progress does now) is to replace non-translatable characters with
> > something like UCN escapes. What is not so easy, is to do this and
> > preserve the alignment of carets and label lines and such... this
> > requires making the display width of a character also
> > locale-dependent, which concept doesn't exist currently. Adding that
> > feels like a lot of complication for what would be a little-used
> > feature... Anyway, if you think a patch that does the translation
> > without preserving the alignment would be useful, I could finish it
> > up
> > and send it. Otherwise I was kinda inclined to forget about it.
> > Thanks!
>
> Maybe post the patch you have so far, making clear that it's
> unfinished/work-in-progress?  I think I'll find it easier to think
> about this with a patch in hand rather than a description of a patch,
> and also, that way the list archives will have a copy of your work in
> case we do want to finish it at some later point (rather than it just
> being on your hard drive).

I created PR102099 about this
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102099) and posted the
patch there. It's pretty complete, although it has a couple issues I
mention in the PR. Thanks!

-Lewis
diff mbox series

Patch

diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index 59cabd12407..d5aa7859cc1 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -1124,6 +1124,10 @@  c_common_post_options (const char **pfilename)
   cpp_post_options (parse_in);
   init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
 
+  /* Let diagnostics infrastructure know we are using libcpp to read
+     the input.  */
+  input_initialize_cpp_context (parse_in);
+
   input_location = UNKNOWN_LOCATION;
 
   *pfilename = this_input_filename
diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
index 51baf141711..2b12a98afc0 100644
--- a/gcc/fortran/cpp.c
+++ b/gcc/fortran/cpp.c
@@ -493,6 +493,10 @@  gfc_cpp_post_options (void)
 
   cpp_post_options (cpp_in);
 
+  /* Let diagnostics infrastructure know we are using libcpp to read
+     the input.  */
+  input_initialize_cpp_context (cpp_in);
+
   gfc_cpp_register_include_paths ();
 }
 
diff --git a/gcc/input.c b/gcc/input.c
index 29d10f06b86..1dcdd464bc1 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -30,6 +30,24 @@  along with GCC; see the file COPYING3.  If not see
 #define HAVE_ICONV 0
 #endif
 
+/* If libcpp is being used to read the data, we need to note the configuration
+   so we can read files back in consistently in location_get_source_line().  */
+struct
+{
+  bool in_use;
+  bool conversion_is_trivial;
+  const char *charset;
+} static input_cpp_context;
+
+void input_initialize_cpp_context (cpp_reader *cpp)
+{
+  input_cpp_context.in_use = true;
+  const cpp_options *opts = cpp_get_options (cpp);
+  input_cpp_context.charset = opts->input_charset;
+  input_cpp_context.conversion_is_trivial
+    = cpp_input_conversion_is_trivial (input_cpp_context.charset);
+}
+
 /* This is a cache used by get_next_line to store the content of a
    file to be searched for file lines.  */
 class fcache
@@ -78,6 +96,10 @@  public:
      far.  */
   char *data;
 
+  /* The allocated buffer to be freed may start a little earlier than DATA,
+     e.g. if a UTF8 BOM was skipped at the beginning.  */
+  int alloc_offset;
+
   /*  The size of the DATA array above.*/
   size_t size;
 
@@ -118,6 +140,17 @@  public:
 
   fcache ();
   ~fcache ();
+
+  void offset_buffer (int offset)
+  {
+    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
+		: (size_t) offset <= size);
+    gcc_assert (data);
+    alloc_offset += offset;
+    data += offset;
+    size -= offset;
+  }
+
 };
 
 /* Current position in real source file.  */
@@ -364,6 +397,9 @@  evicted_cache_tab_entry (unsigned *highest_use_count)
   return to_evict;
 }
 
+static bool
+read_data (fcache *c);
+
 /* Create the cache used for the content of a given file to be
    accessed by caret diagnostic.  This cache is added to an array of
    cache and can be retrieved by lookup_file_in_cache_tab.  This
@@ -384,6 +420,8 @@  add_file_to_cache_tab (const char *file_path)
   if (r->fp)
     fclose (r->fp);
   r->fp = fp;
+  if (r->alloc_offset)
+    r->offset_buffer (-r->alloc_offset);
   r->nb_read = 0;
   r->line_start_idx = 0;
   r->line_num = 0;
@@ -394,6 +432,42 @@  add_file_to_cache_tab (const char *file_path)
   r->total_lines = total_lines_num (file_path);
   r->missing_trailing_newline = true;
 
+  /* If libcpp is managing the reading, then there are two cases we need to
+     consider.  If -finput-charset is not in effect, then we just need to
+     strip a UTF-8 BOM, so do that ourselves rather than calling into libcpp so
+     as to avoid paying the penalty of using libcpp, namely that the entire file
+     must be read at once.  In the (generally rare) case that a non-trivial
+     -finput-charset is needed, then go ahead and use libcpp to read the whole
+     file and do the conversion.  */
+  if (input_cpp_context.in_use)
+    {
+      if (input_cpp_context.conversion_is_trivial)
+	{
+	  /* Strip the UTF8 BOM if present.  */
+	  if (read_data (r))
+	    {
+	      const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
+	      r->offset_buffer (offset);
+	      r->nb_read -= offset;
+	    }
+	}
+      else
+	{
+	  /* Need a full-blown conversion of the input charset.  */
+	  fclose (r->fp);
+	  r->fp = NULL;
+	  const cpp_converted_source cs
+	    = cpp_get_converted_source (file_path, input_cpp_context.charset);
+	  if (!cs.data)
+	    return NULL;
+	  if (r->data)
+	    XDELETEVEC (r->data);
+	  r->data = cs.data;
+	  r->nb_read = r->size = cs.len;
+	  r->alloc_offset = cs.data - cs.to_free;
+	}
+    }
+
   return r;
 }
 
@@ -415,7 +489,7 @@  lookup_or_add_file_to_cache_tab (const char *file_path)
    diagnostic.  */
 
 fcache::fcache ()
-: use_count (0), file_path (NULL), fp (NULL), data (0),
+: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
   size (0), nb_read (0), line_start_idx (0), line_num (0),
   total_lines (0), missing_trailing_newline (true)
 {
@@ -433,6 +507,7 @@  fcache::~fcache ()
     }
   if (data)
     {
+      offset_buffer (-alloc_offset);
       XDELETEVEC (data);
       data = 0;
     }
@@ -447,9 +522,9 @@  fcache::~fcache ()
 static bool
 needs_read (fcache *c)
 {
-  return (c->nb_read == 0
-	  || c->nb_read == c->size
-	  || (c->line_start_idx >= c->nb_read - 1));
+  return c->fp && (c->nb_read == 0
+		   || c->nb_read == c->size
+		   || (c->line_start_idx >= c->nb_read - 1));
 }
 
 /*  Return TRUE iff the cache is full and thus needs to be
@@ -469,9 +544,20 @@  maybe_grow (fcache *c)
   if (!needs_grow (c))
     return;
 
-  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
-  c->data = XRESIZEVEC (char, c->data, size);
-  c->size = size;
+  if (!c->data)
+    {
+      gcc_assert (c->size == 0 && c->alloc_offset == 0);
+      c->size = fcache_buffer_size;
+      c->data = XNEWVEC (char, c->size);
+    }
+  else
+    {
+      const int offset = c->alloc_offset;
+      c->offset_buffer (-offset);
+      c->size *= 2;
+      c->data = XRESIZEVEC (char, c->data, c->size);
+      c->offset_buffer (offset);
+    }
 }
 
 /*  Read more data into the cache.  Extends the cache if need be.
@@ -570,7 +656,7 @@  get_next_line (fcache *c, char **line, ssize_t *line_len)
       c->missing_trailing_newline = false;
     }
 
-  if (ferror (c->fp))
+  if (c->fp && ferror (c->fp))
     return false;
 
   /* At this point, we've found the end of the of line.  It either
diff --git a/gcc/input.h b/gcc/input.h
index 4790a571c6a..0f1c6dc1f27 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -214,4 +214,10 @@  class GTY(()) string_concat_db
   hash_map <location_hash, string_concat *> *m_table;
 };
 
+/* Because we may read files a 2nd time, after libcpp does, in order to emit
+   diagnostics, we need to be aware if libcpp is being used and how it has
+   been configured, e.g., to know the value of -finput-charset.  This function
+   needs to be called by any frontend that is using libcpp to read its data.  */
+struct cpp_reader;
+void input_initialize_cpp_context (cpp_reader *cpp);
 #endif
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 3e5578b1390..d6e4e096d33 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -630,7 +630,11 @@  static const struct cpp_conversion conversion_tab[] = {
    cset_converter structure for conversion from FROM to TO.  If
    iconv_open() fails, issue an error and return an identity
    converter.  Silently return an identity converter if FROM and TO
-   are identical.  */
+   are identical.
+
+   PFILE is only used for generating diagnostics; setting it to NULL
+   suppresses diagnostics.  */
+
 static struct cset_converter
 init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 {
@@ -672,25 +676,31 @@  init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 
       if (ret.cd == (iconv_t) -1)
 	{
-	  if (errno == EINVAL)
-	    cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
-		       "conversion from %s to %s not supported by iconv",
-		       from, to);
-	  else
-	    cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
-
+	  if (pfile)
+	    {
+	      if (errno == EINVAL)
+		cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
+			   "conversion from %s to %s not supported by iconv",
+			   from, to);
+	      else
+		cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
+	    }
 	  ret.func = convert_no_conversion;
 	}
     }
   else
     {
-      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
-		 "no iconv implementation, cannot convert from %s to %s",
-		 from, to);
+      if (pfile)
+	{
+	  cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
+		     "no iconv implementation, cannot convert from %s to %s",
+		     from, to);
+	}
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
       ret.width = -1;
     }
+
   return ret;
 }
 
@@ -2122,6 +2132,25 @@  _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
 				  buf, bufp - buf, HT_ALLOC));
 }
 
+
+/* Utility to strip a UTF-8 byte order marking from the beginning
+   of a buffer.  Returns the number of bytes to skip, which currently
+   will be either 0 or 3.  */
+int
+cpp_check_utf8_bom (const char *data, size_t data_length)
+{
+
+#if HOST_CHARSET == HOST_CHARSET_ASCII
+  const unsigned char *udata = (const unsigned char *) data;
+  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
+      && udata[2] == 0xbf)
+    return 3;
+#endif
+
+  return 0;
+}
+
+
 /* Convert an input buffer (containing the complete contents of one
    source file) from INPUT_CHARSET to the source character set.  INPUT
    points to the input buffer, SIZE is its allocated size, and LEN is
@@ -2135,7 +2164,11 @@  _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
    INPUT is expected to have been allocated with xmalloc.  This
    function will either set *BUFFER_START to INPUT, or free it and set
    *BUFFER_START to a pointer to another xmalloc-allocated block of
-   memory.  */
+   memory.
+
+   PFILE is only used to generate diagnostics; setting it to NULL suppresses
+   diagnostics, and causes a return of NULL if there was any error instead.  */
+
 uchar * 
 _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 		    uchar *input, size_t size, size_t len,
@@ -2158,17 +2191,28 @@  _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
       to.text = XNEWVEC (uchar, to.asize);
       to.len = 0;
 
-      if (!APPLY_CONVERSION (input_cset, input, len, &to))
-	cpp_error (pfile, CPP_DL_ERROR,
-		   "failure to convert %s to %s",
-		   CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
+      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
 
-      free (input);
-    }
+      /* Clean up the mess.  */
+      if (input_cset.func == convert_using_iconv)
+	iconv_close (input_cset.cd);
 
-  /* Clean up the mess.  */
-  if (input_cset.func == convert_using_iconv)
-    iconv_close (input_cset.cd);
+      /* Handle conversion failure.  */
+      if (!ok)
+	{
+	  free (input);
+	  if (!pfile)
+	    {
+	      XDELETEVEC (to.text);
+	      *buffer_start = NULL;
+	      *st_size = 0;
+	      return NULL;
+	    }
+	  cpp_error (pfile, CPP_DL_ERROR,
+		     "failure to convert %s to %s",
+		     CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
+	}
+    }
 
   /* Resize buffer if we allocated substantially too much, or if we
      haven't enough space for the \n-terminator or following
@@ -2192,19 +2236,14 @@  _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
 
   buffer = to.text;
   *st_size = to.len;
-#if HOST_CHARSET == HOST_CHARSET_ASCII
-  /* The HOST_CHARSET test just above ensures that the source charset
-     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
-     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
+
+  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
+     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
      BOM -- however, even if it did, we would still need this code due
      to the 'convert_no_conversion' case.  */
-  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
-      && to.text[2] == 0xbf)
-    {
-      *st_size -= 3;
-      buffer += 3;
-    }
-#endif
+  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
+  *st_size -= bom_len;
+  buffer += bom_len;
 
   *buffer_start = to.text;
   return buffer;
@@ -2244,6 +2283,13 @@  _cpp_default_encoding (void)
   return current_encoding;
 }
 
+/* Check if the configured input charset requires no conversion, other than
+   possibly stripping a UTF-8 BOM.  */
+bool cpp_input_conversion_is_trivial (const char *input_charset)
+{
+  return !strcasecmp (input_charset, SOURCE_CHARSET);
+}
+
 /* Implementation of class cpp_string_location_reader.  */
 
 /* Constructor for cpp_string_location_reader.  */
diff --git a/libcpp/files.c b/libcpp/files.c
index 301b2379a23..178bb9ed1e6 100644
--- a/libcpp/files.c
+++ b/libcpp/files.c
@@ -173,7 +173,7 @@  static bool pch_open_file (cpp_reader *pfile, _cpp_file *file,
 static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
 			      bool *invalid_pch, location_t loc);
 static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
-			    location_t loc);
+			    location_t loc, const char *input_charset = NULL);
 static bool read_file (cpp_reader *pfile, _cpp_file *file,
 		       location_t loc);
 static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
@@ -671,18 +671,32 @@  _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir,
 
    Use LOC for any diagnostics.
 
+   The input charset may be specified in the INPUT_CHARSET argument, or
+   else it will be taken from PFILE.
+
+   PFILE may be NULL.  In this case, no diagnostics are issued, and the
+   input charset must be specified in the arguments.
+
    FIXME: Flush file cache and try again if we run out of memory.  */
 static bool
-read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
+read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
+		const char *input_charset)
 {
   ssize_t size, total, count;
   uchar *buf;
   bool regular;
 
+  if (!input_charset)
+    {
+      gcc_assert (pfile);
+      input_charset = CPP_OPTION (pfile, input_charset);
+    }
+
   if (S_ISBLK (file->st.st_mode))
     {
-      cpp_error_at (pfile, CPP_DL_ERROR, loc,
-		    "%s is a block device", file->path);
+      if (pfile)
+	cpp_error_at (pfile, CPP_DL_ERROR, loc,
+		      "%s is a block device", file->path);
       return false;
     }
 
@@ -699,8 +713,9 @@  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 	 does not bite us.  */
       if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
 	{
-	  cpp_error_at (pfile, CPP_DL_ERROR, loc,
-			"%s is too large", file->path);
+	  if (pfile)
+	    cpp_error_at (pfile, CPP_DL_ERROR, loc,
+			  "%s is too large", file->path);
 	  return false;
 	}
 
@@ -733,29 +748,29 @@  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
 
   if (count < 0)
     {
-      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
+      if (pfile)
+	cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
       free (buf);
       return false;
     }
 
-  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
+  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
     cpp_error_at (pfile, CPP_DL_WARNING, loc,
 	       "%s is shorter than expected", file->path);
 
   file->buffer = _cpp_convert_input (pfile,
-				     CPP_OPTION (pfile, input_charset),
+				     input_charset,
 				     buf, size + 16, total,
 				     &file->buffer_start,
 				     &file->st.st_size);
-  file->buffer_valid = true;
-
-  return true;
+  file->buffer_valid = file->buffer;
+  return file->buffer_valid;
 }
 
 /* Convenience wrapper around read_file_guts that opens the file if
    necessary and closes the file descriptor after reading.  FILE must
    have been passed through find_file() at some stage.  Use LOC for
-   any diagnostics.  */
+   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
 static bool
 read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
 {
@@ -2118,3 +2133,25 @@  _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets,
   return file->err_no != ENOENT;
 }
 
+/* Read a file and convert to input charset, the same as if it were being read
+   by a cpp_reader.  */
+
+cpp_converted_source
+cpp_get_converted_source (const char *fname, const char *input_charset)
+{
+  cpp_converted_source res = {};
+  _cpp_file file = {};
+  file.fd = -1;
+  file.name = lbasename (fname);
+  file.path = fname;
+  if (!open_file (&file))
+    return res;
+  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
+  close (file.fd);
+  if (!ok)
+    return res;
+  res.to_free = (char *) file.buffer_start;
+  res.data = (char *) file.buffer;
+  res.len = file.st.st_size;
+  return res;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 50d28dc9d5a..d38dd040367 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1368,6 +1368,20 @@  extern struct _cpp_file *cpp_get_file (cpp_buffer *);
 extern cpp_buffer *cpp_get_prev (cpp_buffer *);
 extern void cpp_clear_file_cache (cpp_reader *);
 
+/* cpp_get_converted_source returns the contents of the given file, as it exists
+   after cpplib has read it and converted it from the input charset to the
+   source charset.  Return struct will be zero-filled if the data could not be
+   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
+   pointer is what should be passed to free(), as there may be an offset.  */
+struct cpp_converted_source
+{
+  char *to_free;
+  char *data;
+  size_t len;
+};
+cpp_converted_source cpp_get_converted_source (const char *fname,
+					       const char *input_charset);
+
 /* In pch.c */
 struct save_macro_data;
 extern int cpp_save_state (cpp_reader *, FILE *);
@@ -1438,6 +1452,7 @@  class cpp_display_width_computation {
 /* Convenience functions that are simple use cases for class
    cpp_display_width_computation.  Tab characters will be expanded to spaces
    as determined by TABSTOP.  */
+
 int cpp_byte_column_to_display_column (const char *data, int data_length,
 				       int column, int tabstop);
 inline int cpp_display_width (const char *data, int data_length,
@@ -1450,4 +1465,7 @@  int cpp_display_column_to_byte_column (const char *data, int data_length,
 				       int display_col, int tabstop);
 int cpp_wcwidth (cppchar_t c);
 
+bool cpp_input_conversion_is_trivial (const char *input_charset);
+int cpp_check_utf8_bom (const char *data, size_t data_length);
+
 #endif /* ! LIBCPP_CPPLIB_H */