diff mbox series

Thoughts on applying a short diagnostics patch for GCC 10

Message ID 20200415201914.GA96820@ldh-imac.local
State New
Headers show
Series Thoughts on applying a short diagnostics patch for GCC 10 | expand

Commit Message

Lewis Hyatt April 15, 2020, 8:19 p.m. UTC
Hello David-

I would appreciate hearing your thoughts please on the following
(relatively minor) issue... There is a little bug with colorization in
diagnostics: in case a location only points to the first byte of a
multibyte sequence, the colorization produces corrupted output as it
interrupts the UTF-8 sequence. This seems to happen with errors that come
out of cpplib for instance. A test case is:

-------
int ٩x;
-------

compiled with -x c -std=c99.

The fix for this issue is essentially two lines plus a test + comments,
and I had originally submitted it here:

https://gcc.gnu.org/legacy-ml/gcc-patches/2019-12/msg00915.html

Subsequent to that, I also submitted patches to implement tab expansion in
diagnostics. Those patches naturally fixed the colorization issue already
in a different way, because they rewrote this whole function, so I
indicated we might as well forget the simpler older patch. But then we
decided to hold off on the tab expansion feature until GCC 11. That makes
sense for sure, but then I think regarding the two-line fix for the
colorization bug, there is still potentially a good case to push it now?
It is not strictly a regression, but because we added support for UTF-8
identifiers in GCC 10, it is going to be more apparent now than it was
before. Given that and the fact that it's a pretty small change, would it
make sense to go ahead and get this in?

For reference I attached the patch and ChangeLog again here as
well. Bootstrap all languages on x86-64 linux looks good with all reg
tests the same before + after:

FAIL 94 94
PASS 473413 473413
UNSUPPORTED 11503 11503
UNTESTED 195 195
XFAIL 1818 1818
XPASS 36 36

Thanks!

-Lewis
gcc/ChangeLog:

2020-04-15  Lewis Hyatt  <lhyatt@gmail.com>

	* diagnostic-show-locus.c (layout::print_source_line): Do not emit
	colorization codes in the middle of a UTF-8 sequence.
	(test_one_liner_colorized_utf8): New test.
	(test_diagnostic_show_locus_one_liner_utf8): Call the new test.
commit 5e2f6ec837fdf808ba20096bd632be9025f7526c
Author: Lewis Hyatt <lhyatt@gmail.com>
Date:   Wed Apr 15 11:17:33 2020 -0400

    diagnostics: Fix colorization interrupting UTF-8 sequences
diff mbox series

Patch

diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
index 4618b4edb7d..0aa42e236ee 100644
--- a/gcc/diagnostic-show-locus.c
+++ b/gcc/diagnostic-show-locus.c
@@ -1494,6 +1494,8 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
   int last_non_ws = 0;
   for (int col_byte = 1 + x_offset_bytes; col_byte <= line_bytes; col_byte++)
     {
+      char c = *line;
+
       /* Assuming colorization is enabled for the caret and underline
 	 characters, we may also colorize the associated characters
 	 within the source line.
@@ -1505,8 +1507,13 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 
 	 For frontends that only generate carets, we don't colorize the
 	 characters above them, since this would look strange (e.g.
-	 colorizing just the first character in a token).  */
-      if (m_colorize_source_p)
+	 colorizing just the first character in a token).
+
+	 We need to avoid inserting color codes ahead of UTF-8 continuation
+	 bytes to avoid corrupting the output, in case the location range
+	 points only to the first byte of a multibyte sequence.  */
+
+      if (m_colorize_source_p && (((unsigned int) c) & 0xC0) != 0x80)
 	{
 	  bool in_range_p;
 	  point_state state;
@@ -1519,7 +1526,6 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 	  else
 	    m_colorizer.set_normal_text ();
 	}
-      char c = *line;
       if (c == '\0' || c == '\t' || c == '\r')
 	c = ' ';
       if (c != ' ')
@@ -3854,6 +3860,27 @@  test_one_liner_labels_utf8 ()
   }
 }
 
+/* Make sure that colorization codes don't interrupt a multibyte
+   sequence, which would corrupt it.  */
+static void
+test_one_liner_colorized_utf8 ()
+{
+  test_diagnostic_context dc;
+  dc.colorize_source_p = true;
+  diagnostic_color_init (&dc, DIAGNOSTICS_COLOR_YES);
+  const location_t pi = linemap_position_for_column (line_table, 12);
+  rich_location richloc (line_table, pi);
+  diagnostic_show_locus (&dc, &richloc, DK_ERROR);
+
+  /* In order to avoid having the test depend on exactly how the colorization
+     was effected, just confirm there are two pi characters in the output.  */
+  const char *result = pp_formatted_text (dc.printer);
+  const char *null_term = result + strlen (result);
+  const char *first_pi = strstr (result, "\xcf\x80");
+  ASSERT_TRUE (first_pi && first_pi <= null_term - 2);
+  ASSERT_STR_CONTAINS (first_pi + 2, "\xcf\x80");
+}
+
 /* Run the various one-liner tests.  */
 
 static void
@@ -3900,6 +3927,7 @@  test_diagnostic_show_locus_one_liner_utf8 (const line_table_case &case_)
   test_one_liner_many_fixits_1_utf8 ();
   test_one_liner_many_fixits_2_utf8 ();
   test_one_liner_labels_utf8 ();
+  test_one_liner_colorized_utf8 ();
 }
 
 /* Verify that gcc_rich_location::add_location_if_nearby works.  */