Ensure colorization doesn't corrupt multibyte sequences in diagnostics
diff mbox series

Message ID 20191212232104.GA49742@ldh.local
State New
Headers show
Series
  • Ensure colorization doesn't corrupt multibyte sequences in diagnostics
Related show

Commit Message

Lewis Hyatt Dec. 12, 2019, 11:21 p.m. UTC
Hello-

In the original discussion of implementing UTF-8 identifiers
( https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26 ), I pointed out
that colorization would corrupt the appearance of certain diagnostics. For
example, this code, with -std=c99:

----------
int ٩x;
----------

Produces:

t2.cpp:1:5: error: extended character ٩ is not valid at the start of an identifier
    1 | int ٩x;
      |     ^

The diagnostic location contains only the first byte of the character, so
when colorization is enabled, the ANSI escapes are inserted in the middle
of the UTF-8 sequence and produce corrupted output on the terminal.

I feel like there are two separate issues here:

#1. diagnostic_show_locus() should be sure it will not corrupt output in
this way, regardless of what ranges it is given to work with.

#2. libcpp should probably generate a range that includes the whole UTF-8
character. Actually in other ways the range seems not ideal, for example
if an invalid character appears in the middle of the identifier, the
diagnostic still points to the first byte of the identifier.

The attached patch fixes #1. It's essentially a one-line change, plus a
new selftest. Would you please have a look at it sometime? bootstrap
and testsuite were done on linux x86-64.

Other questions that I have:

- I am not quite clear when a selftest is preferred vs a dejagnu test. In
  this case I stuck with the selftest because color diagnostics don't seem
  to work well with dg-error etc, and it didn't seem worth creating a new
  plugin-based test like g++.dg/plugin just for this. (I also considered
  using the existing g++.dg plugin, but it seems this test should run for
  gcc as well.)

- I wasn't sure if I should create a PR for an issue such as this, if
  there is already a patch readily available. And if I did create a PR,
  not sure if it's preferred to post the patch to gcc-patches, or as an
  attachment to the PR.

- Does it seem worth me looking into #2? I think the patch to address #1 is
  appropriate in any case, because it handles generically all potential
  cases where this may arise, but still perhaps the ranges coming out of
  libcpp could be improved?

Thanks...

-Lewis
gcc/ChangeLog:

2019-12-12  Lewis Hyatt  <lhyatt@gmail.com>

	* diagnostic-show-locus.c (layout::print_source_line): Do not emit
	color codes in the middle of a UTF-8 sequence.
	(test_one_liner_colorized_utf8): New test.
	(test_diagnostic_show_locus_one_liner_utf8): Call the new test.

Comments

Lewis Hyatt Jan. 15, 2020, 12:05 a.m. UTC | #1
Hello-

I thought I might ping this short patch please, just in case it may
make sense to include in GCC 10 along with the other UTF-8-related
fixes to diagnostics. Thanks!

https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00915.html

-Lewis

On Thu, Dec 12, 2019 at 6:21 PM Lewis Hyatt <lhyatt@gmail.com> wrote:
>
> Hello-
>
> In the original discussion of implementing UTF-8 identifiers
> ( https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26 ), I pointed out
> that colorization would corrupt the appearance of certain diagnostics. For
> example, this code, with -std=c99:
>
> ----------
> int ٩x;
> ----------
>
> Produces:
>
> t2.cpp:1:5: error: extended character ٩ is not valid at the start of an identifier
>     1 | int ٩x;
>       |     ^
>
> The diagnostic location contains only the first byte of the character, so
> when colorization is enabled, the ANSI escapes are inserted in the middle
> of the UTF-8 sequence and produce corrupted output on the terminal.
>
> I feel like there are two separate issues here:
>
> #1. diagnostic_show_locus() should be sure it will not corrupt output in
> this way, regardless of what ranges it is given to work with.
>
> #2. libcpp should probably generate a range that includes the whole UTF-8
> character. Actually in other ways the range seems not ideal, for example
> if an invalid character appears in the middle of the identifier, the
> diagnostic still points to the first byte of the identifier.
>
> The attached patch fixes #1. It's essentially a one-line change, plus a
> new selftest. Would you please have a look at it sometime? bootstrap
> and testsuite were done on linux x86-64.
>
> Other questions that I have:
>
> - I am not quite clear when a selftest is preferred vs a dejagnu test. In
>   this case I stuck with the selftest because color diagnostics don't seem
>   to work well with dg-error etc, and it didn't seem worth creating a new
>   plugin-based test like g++.dg/plugin just for this. (I also considered
>   using the existing g++.dg plugin, but it seems this test should run for
>   gcc as well.)
>
> - I wasn't sure if I should create a PR for an issue such as this, if
>   there is already a patch readily available. And if I did create a PR,
>   not sure if it's preferred to post the patch to gcc-patches, or as an
>   attachment to the PR.
>
> - Does it seem worth me looking into #2? I think the patch to address #1 is
>   appropriate in any case, because it handles generically all potential
>   cases where this may arise, but still perhaps the ranges coming out of
>   libcpp could be improved?
>
> Thanks...
>
> -Lewis

Patch
diff mbox series

diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
index c87603caf41..7385b8a1692 100644
--- a/gcc/diagnostic-show-locus.c
+++ b/gcc/diagnostic-show-locus.c
@@ -1482,6 +1482,8 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
   int last_non_ws = 0;
   for (int col_byte = 1 + x_offset_bytes; col_byte <= line_bytes; col_byte++)
     {
+      char c = *line;
+
       /* Assuming colorization is enabled for the caret and underline
 	 characters, we may also colorize the associated characters
 	 within the source line.
@@ -1493,8 +1495,13 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 
 	 For frontends that only generate carets, we don't colorize the
 	 characters above them, since this would look strange (e.g.
-	 colorizing just the first character in a token).  */
-      if (m_colorize_source_p)
+	 colorizing just the first character in a token).
+
+	 We need to avoid inserting color codes ahead of UTF-8 continuation
+	 bytes to avoid corrupting the output, in case the location range
+	 points only to the first byte of a multibyte sequence.  */
+
+      if (m_colorize_source_p && (((unsigned int) c) & 0xC0) != 0x80)
 	{
 	  bool in_range_p;
 	  point_state state;
@@ -1507,7 +1514,6 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 	  else
 	    m_colorizer.set_normal_text ();
 	}
-      char c = *line;
       if (c == '\0' || c == '\t' || c == '\r')
 	c = ' ';
       if (c != ' ')
@@ -3836,6 +3842,27 @@  test_one_liner_labels_utf8 ()
   }
 }
 
+/* Make sure that colorization codes don't interrupt a multibyte
+   sequence, which would corrupt it.  */
+static void
+test_one_liner_colorized_utf8 ()
+{
+  test_diagnostic_context dc;
+  dc.colorize_source_p = true;
+  diagnostic_color_init (&dc, DIAGNOSTICS_COLOR_YES);
+  const location_t pi = linemap_position_for_column (line_table, 12);
+  rich_location richloc (line_table, pi);
+  diagnostic_show_locus (&dc, &richloc, DK_ERROR);
+
+  /* In order to avoid having the test depend on exactly how the colorization
+     was effected, just confirm there are two pi characters in the output.  */
+  const char *result = pp_formatted_text (dc.printer);
+  const char *null_term = result + strlen (result);
+  const char *first_pi = strstr (result, "\xcf\x80");
+  ASSERT_TRUE (first_pi && first_pi <= null_term - 2);
+  ASSERT_STR_CONTAINS (first_pi + 2, "\xcf\x80");
+}
+
 /* Run the various one-liner tests.  */
 
 static void
@@ -3882,6 +3909,7 @@  test_diagnostic_show_locus_one_liner_utf8 (const line_table_case &case_)
   test_one_liner_many_fixits_1_utf8 ();
   test_one_liner_many_fixits_2_utf8 ();
   test_one_liner_labels_utf8 ();
+  test_one_liner_colorized_utf8 ();
 }
 
 /* Verify that gcc_rich_location::add_location_if_nearby works.  */