diagnostics: Support for -finput-charset [PR93067]

From: Lewis Hyatt <lhyatt@gmail.com>

Hello-

The attached patch addresses PR93067:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0

This is similar to the patch I posted last year on the PR, with some tweaks
to make it a little simpler. Recapping some of the commentary on the PR:

When source lines are needed for diagnostics output, they are retrieved from
the source file by the fcache infrastructure in input.c, since libcpp has
generally already forgotten them (plus not all front ends are using
libcpp). This infrastructure does not read the files in the same way as
libcpp does; in particular, it does not translate the encoding as requested
by -finput-charset, and it does not strip a UTF-8 byte-order mark if
present. The patch adds this ability. My thinking in deciding how to do it
was the following:

- Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer still,
  so this patch should try hard not to introduce any worse performance
  unless these things are needed.

- It is desirable to reuse libcpp's encoding infrastructure from charset.c
  rather than repeat it in input.c. (Notably, libcpp uses iconv but it also
  has hand-coded routines for certain charsets to make sure they are
  available.)

- There is a performance degradation required in order to make use of libcpp
  directly, because the input.c infrastructure only reads as much of the
  source file as necessary, whereas libcpp interfaces as-is require to read
  the entire file into memory.

- It can't be quite as simple as just "only delegate to libcpp if
  -finput-charset was specified", because the stripping of the UTF-8 BOM has
  to happen with or without this option.

- So it seemed a reasonable compromise to me, if -finput-charset is
  specified, then use libcpp to convert the file, otherwise, strip the BOM
  in input.c and then process the file the same way it is done now. There's
  a little bit of leakage of charset logic from libcpp this way (for the
  BOM), but it seems worthwhile, since otherwise, diagnostics would always
  be reading the entire file into memory, which is not a cost paid
  currently.

Separate from the main patch are two testcases that both fail before this
patch and pass after. I attached them gzipped because they use non-standard
encodings that won't email well.

The main question I have about the patch is whether I chose a good way to
address the following complication. location_get_source_line() in input.c is
used to generate diagnostics for all front ends, whether they use libcpp to
read the files or not. So input.c needs some way to know whether libcpp is
in use or not. I ended up adding a new function input_initialize_cpp_context(),
which front ends have to call if they are using libcpp to read their
files. Currently that is c-family and fortran. I don't see a simpler way
to do it at least... Even if there were a better way for input.c to find out
the value passed to -finput-charset, it would still need to know whether
libcpp is being used or not.

Please let me know if it looks OK (either now or for stage 1, whatever makes
sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all tests the
same before & after plus 6 new PASS from the new tests. Thanks!

-Lewis
Adds the logic to handle -finput-charset in layout_get_source_line(), so that
source lines are converted from their input encodings prior to being output by
diagnostics machinery.

gcc/c-family/ChangeLog:

	PR other/93067
	* c-opts.c (c_common_post_options): Call new function
	input_initialize_cpp_context().

gcc/fortran/ChangeLog:

	PR other/93067
	* cpp.c (gfc_cpp_post_options): Call new function
	input_initialize_cpp_context().

gcc/ChangeLog:

	PR other/93067
	* input.c (input_initialize_cpp_context): New function.
	(read_data): Add prototype.
	(add_file_to_cache_tab): Use libcpp to convert input encoding when
	needed.
	(class fcache): Add new members to track input encoding conversion
	via libcpp.
	(fcache::fcache): Adapt for new members.
	(fcache::~fcache): Likewise.
	(maybe_grow): Likewise.
	(needs_read): Adapt to be aware that fp member may be NULL now.
	(get_next_line): Likewise.
	* input.h (struct cpp_reader): Forward declare for use...
	(input_initialize_cpp_context): ...here.  Declare new function.

libcpp/ChangeLog:

	PR other/93067
	* charset.c (init_iconv_desc): Adapt to permit PFILE argument to
	be NULL.
	(_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
	(cpp_check_utf8_bom): ...here.  New function.
	(cpp_input_conversion_is_trivial): New function.
	* files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
	INPUT_CHARSET argument as an alternate source of this information.
	(cpp_get_converted_source): New function.
	* include/cpplib.h (struct cpp_converted_source): Declare.
	(cpp_get_converted_source): Declare.
	(cpp_input_conversion_is_trivial): Declare.
	(cpp_check_utf8_bom): Declare.

gcc/testsuite/ChangeLog:

	PR other/93067
	* gcc.dg/diagnostic-input-charset-1.c: New test.
	* gcc.dg/diagnostic-input-charset-2.c: New test.

Message ID	20201218230353.GA6439@ldh-imac.local
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gcc.gnu.org Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=yB7EIAea; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CyPZ30Yd0z9sTK for <incoming@patchwork.ozlabs.org>; Sat, 19 Dec 2020 10:04:10 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8E9DB385480B; Fri, 18 Dec 2020 23:04:06 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8E9DB385480B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1608332646; bh=tlZXi8Uyms61Aq/W6HXnCiyaIITOVP7zw1W0nqw4wuk=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=yB7EIAeavEQWdrFUPmXipz8odxGWLET44C1DgFmPpC9s9EuUeM5+KlVHJPwbF9t87 E290vWUGJD28sAd5WAjPTG7dp8KGZIEcNP0T63NMGj2ycBfM+toY437Y8gWIL4cSet UazZbRKhCa6aoMJavoOkcEUwZNi7LBgcqp6NKWpY= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) by sourceware.org (Postfix) with ESMTPS id AD8B93857C70 for <gcc-patches@gcc.gnu.org>; Fri, 18 Dec 2020 23:04:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org AD8B93857C70 Received: by mail-qt1-x830.google.com with SMTP id z9so2571597qtn.4 for <gcc-patches@gcc.gnu.org>; Fri, 18 Dec 2020 15:04:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:mime-version :content-disposition; bh=tlZXi8Uyms61Aq/W6HXnCiyaIITOVP7zw1W0nqw4wuk=; b=Kq4KqSfJsF0R9FIihh3cT+cuQvFucFaypJRWgqhQxjIcDz0ql93rofiDX4BJSCJJq2 3RZaFfogzgXrkfPli+lDlDBZmEEJZg9VLwtThY8nLVNuBG1qH1pneDDMRYhyhz+SSb+6 awRMpNP27lk1p0l/R/WVsRbVuazbeyh3MnNJdVoek8f+TO0pg0yV2FpVbwWKbgMjGBJY a0pwz3pfkcTAGVxz6oRxyHc2d8ZcNL845FgW8CRWCk570+JHz0n4T84izCNtImtj75lz LXPTy8plt1RT7yfcePkDT7NROi2Ty8C0fgaO21XWlJ7HeLkVdTsFZi3zM1L0JbCZ2x7w 6ejQ== X-Gm-Message-State: AOAM533ZOF3HOfC9MwX4y32UAvv39mB77ShQ7LLEWQCS7k4/Ax3bM0IB AbsGPCnjkefuKSw4ZfFl4vFNWMw8jcuJbA== X-Google-Smtp-Source: ABdhPJxhesKN5WPKDxMFJVdIufFK0vEU+C06UdLlEao0/X3deS5q1TtYzaTcNIy1drg7twmoYHXEYQ== X-Received: by 2002:ac8:d4d:: with SMTP id r13mr6290539qti.349.1608332641701; Fri, 18 Dec 2020 15:04:01 -0800 (PST) Received: from ldh-imac.local (944c6a92.cst.lightpath.net. [148.76.106.146]) by smtp.gmail.com with ESMTPSA id g10sm6681057qkb.8.2020.12.18.15.03.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 18 Dec 2020 15:04:00 -0800 (PST) Date: Fri, 18 Dec 2020 18:03:53 -0500 To: gcc-patches@gcc.gnu.org Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067] Message-ID: <20201218230353.GA6439@ldh-imac.local> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="fdj2RfSjLxBAspz7" Content-Disposition: inline X-Spam-Status: No, score=-3039.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Lewis Hyatt <lhyatt@gmail.com> Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	diagnostics: Support for -finput-charset [PR93067] \| expand diagnostics: Support for -finput-charset [PR93067]

diagnostics: Support for -finput-charset [PR93067]

Commit Message

Comments

Patch