diff mbox series

Ensure colorization doesn't corrupt multibyte sequences in diagnostics

Message ID 20191212232104.GA49742@ldh.local
State New
Headers show
Series Ensure colorization doesn't corrupt multibyte sequences in diagnostics | expand

Commit Message

Lewis Hyatt Dec. 12, 2019, 11:21 p.m. UTC
Hello-

In the original discussion of implementing UTF-8 identifiers
( https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26 ), I pointed out
that colorization would corrupt the appearance of certain diagnostics. For
example, this code, with -std=c99:

----------
int ٩x;
----------

Produces:

t2.cpp:1:5: error: extended character ٩ is not valid at the start of an identifier
    1 | int ٩x;
      |     ^

The diagnostic location contains only the first byte of the character, so
when colorization is enabled, the ANSI escapes are inserted in the middle
of the UTF-8 sequence and produce corrupted output on the terminal.

I feel like there are two separate issues here:

#1. diagnostic_show_locus() should be sure it will not corrupt output in
this way, regardless of what ranges it is given to work with.

#2. libcpp should probably generate a range that includes the whole UTF-8
character. Actually in other ways the range seems not ideal, for example
if an invalid character appears in the middle of the identifier, the
diagnostic still points to the first byte of the identifier.

The attached patch fixes #1. It's essentially a one-line change, plus a
new selftest. Would you please have a look at it sometime? bootstrap
and testsuite were done on linux x86-64.

Other questions that I have:

- I am not quite clear when a selftest is preferred vs a dejagnu test. In
  this case I stuck with the selftest because color diagnostics don't seem
  to work well with dg-error etc, and it didn't seem worth creating a new
  plugin-based test like g++.dg/plugin just for this. (I also considered
  using the existing g++.dg plugin, but it seems this test should run for
  gcc as well.)

- I wasn't sure if I should create a PR for an issue such as this, if
  there is already a patch readily available. And if I did create a PR,
  not sure if it's preferred to post the patch to gcc-patches, or as an
  attachment to the PR.

- Does it seem worth me looking into #2? I think the patch to address #1 is
  appropriate in any case, because it handles generically all potential
  cases where this may arise, but still perhaps the ranges coming out of
  libcpp could be improved?

Thanks...

-Lewis
gcc/ChangeLog:

2019-12-12  Lewis Hyatt  <lhyatt@gmail.com>

	* diagnostic-show-locus.c (layout::print_source_line): Do not emit
	color codes in the middle of a UTF-8 sequence.
	(test_one_liner_colorized_utf8): New test.
	(test_diagnostic_show_locus_one_liner_utf8): Call the new test.

Comments

Lewis Hyatt Jan. 15, 2020, 12:05 a.m. UTC | #1
Hello-

I thought I might ping this short patch please, just in case it may
make sense to include in GCC 10 along with the other UTF-8-related
fixes to diagnostics. Thanks!

https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00915.html

-Lewis

On Thu, Dec 12, 2019 at 6:21 PM Lewis Hyatt <lhyatt@gmail.com> wrote:
>
> Hello-
>
> In the original discussion of implementing UTF-8 identifiers
> ( https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26 ), I pointed out
> that colorization would corrupt the appearance of certain diagnostics. For
> example, this code, with -std=c99:
>
> ----------
> int ٩x;
> ----------
>
> Produces:
>
> t2.cpp:1:5: error: extended character ٩ is not valid at the start of an identifier
>     1 | int ٩x;
>       |     ^
>
> The diagnostic location contains only the first byte of the character, so
> when colorization is enabled, the ANSI escapes are inserted in the middle
> of the UTF-8 sequence and produce corrupted output on the terminal.
>
> I feel like there are two separate issues here:
>
> #1. diagnostic_show_locus() should be sure it will not corrupt output in
> this way, regardless of what ranges it is given to work with.
>
> #2. libcpp should probably generate a range that includes the whole UTF-8
> character. Actually in other ways the range seems not ideal, for example
> if an invalid character appears in the middle of the identifier, the
> diagnostic still points to the first byte of the identifier.
>
> The attached patch fixes #1. It's essentially a one-line change, plus a
> new selftest. Would you please have a look at it sometime? bootstrap
> and testsuite were done on linux x86-64.
>
> Other questions that I have:
>
> - I am not quite clear when a selftest is preferred vs a dejagnu test. In
>   this case I stuck with the selftest because color diagnostics don't seem
>   to work well with dg-error etc, and it didn't seem worth creating a new
>   plugin-based test like g++.dg/plugin just for this. (I also considered
>   using the existing g++.dg plugin, but it seems this test should run for
>   gcc as well.)
>
> - I wasn't sure if I should create a PR for an issue such as this, if
>   there is already a patch readily available. And if I did create a PR,
>   not sure if it's preferred to post the patch to gcc-patches, or as an
>   attachment to the PR.
>
> - Does it seem worth me looking into #2? I think the patch to address #1 is
>   appropriate in any case, because it handles generically all potential
>   cases where this may arise, but still perhaps the ranges coming out of
>   libcpp could be improved?
>
> Thanks...
>
> -Lewis
Jeff Law Nov. 13, 2020, 10:27 p.m. UTC | #2
On 1/14/20 5:05 PM, Lewis Hyatt wrote:
> Hello-
>
> I thought I might ping this short patch please, just in case it may
> make sense to include in GCC 10 along with the other UTF-8-related
> fixes to diagnostics. Thanks!
>
> https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00915.html

This is fine for the trunk.  Note that due to the changes to handle
tabs/control bytes will require this patch to be updated.  It may be as
simple as moving the c = dw.next_byte() statement up.


Go ahead and do the necessary update and retest & repost the patch for
archival purposes.  If you have commit privs, go ahead and commit the
updated patch, else indicate in the patch repost that someone needs to
apply it for you.


Thanks for your patience,

Jeff


>> #1. diagnostic_show_locus() should be sure it will not corrupt output in
>> this way, regardless of what ranges it is given to work with.

Yes.


>>
>> #2. libcpp should probably generate a range that includes the whole UTF-8
>> character. Actually in other ways the range seems not ideal, for example
>> if an invalid character appears in the middle of the identifier, the
>> diagnostic still points to the first byte of the identifier.

Probably.  We haven't traditionally worried a  lot about multitbyte
sequences, so I'm not surprised we're not handling them particularly well.


>>
>> The attached patch fixes #1. It's essentially a one-line change, plus a
>> new selftest. Would you please have a look at it sometime? bootstrap
>> and testsuite were done on linux x86-64.
>>
>> Other questions that I have:
>>
>> - I am not quite clear when a selftest is preferred vs a dejagnu test. In
>>   this case I stuck with the selftest because color diagnostics don't seem
>>   to work well with dg-error etc, and it didn't seem worth creating a new
>>   plugin-based test like g++.dg/plugin just for this. (I also considered
>>   using the existing g++.dg plugin, but it seems this test should run for
>>   gcc as well.)

It varies and there's cases that are fine in either and I suspect there
are many tests in the dejagnu suite that would be better as selftests --
selftests are a fairly new concept.


The guidance I would give is the more a particular test is tied to the
internals of the code, the more likely a selftest is the right
approach.  THe more the test needs an end-to-end run through passes of
the compiler, the more it belongs in the dejagnu suite.



>>
>> - I wasn't sure if I should create a PR for an issue such as this, if
>>   there is already a patch readily available. And if I did create a PR,
>>   not sure if it's preferred to post the patch to gcc-patches, or as an
>>   attachment to the PR.

We still prefer patches to go to gcc-patches -- I personally don't troll
BZ looking for attached patches.


>>
>> - Does it seem worth me looking into #2? I think the patch to address #1 is
>>   appropriate in any case, because it handles generically all potential
>>   cases where this may arise, but still perhaps the ranges coming out of
>>   libcpp could be improved?

I don't think it can hurt to look into the difficulty in addressing #2.


jeff
Lewis Hyatt Nov. 14, 2020, 8:33 p.m. UTC | #3
On Fri, Nov 13, 2020 at 5:27 PM Jeff Law <law@redhat.com> wrote:
>
>
> On 1/14/20 5:05 PM, Lewis Hyatt wrote:
> > Hello-
> >
> > I thought I might ping this short patch please, just in case it may
> > make sense to include in GCC 10 along with the other UTF-8-related
> > fixes to diagnostics. Thanks!
> >
> > https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00915.html
>
> This is fine for the trunk.  Note that due to the changes to handle
> tabs/control bytes will require this patch to be updated.  It may be as
> simple as moving the c = dw.next_byte() statement up.
>
>
> Go ahead and do the necessary update and retest & repost the patch for
> archival purposes.  If you have commit privs, go ahead and commit the
> updated patch, else indicate in the patch repost that someone needs to
> apply it for you.
>
>
> Thanks for your patience,
>
> Jeff
>
>
> >> #1. diagnostic_show_locus() should be sure it will not corrupt output in
> >> this way, regardless of what ranges it is given to work with.
>
> Yes.
>
>
> >>
> >> #2. libcpp should probably generate a range that includes the whole UTF-8
> >> character. Actually in other ways the range seems not ideal, for example
> >> if an invalid character appears in the middle of the identifier, the
> >> diagnostic still points to the first byte of the identifier.
>
> Probably.  We haven't traditionally worried a  lot about multitbyte
> sequences, so I'm not surprised we're not handling them particularly well.
>
>
> >>
> >> The attached patch fixes #1. It's essentially a one-line change, plus a
> >> new selftest. Would you please have a look at it sometime? bootstrap
> >> and testsuite were done on linux x86-64.
> >>
> >> Other questions that I have:
> >>
> >> - I am not quite clear when a selftest is preferred vs a dejagnu test. In
> >>   this case I stuck with the selftest because color diagnostics don't seem
> >>   to work well with dg-error etc, and it didn't seem worth creating a new
> >>   plugin-based test like g++.dg/plugin just for this. (I also considered
> >>   using the existing g++.dg plugin, but it seems this test should run for
> >>   gcc as well.)
>
> It varies and there's cases that are fine in either and I suspect there
> are many tests in the dejagnu suite that would be better as selftests --
> selftests are a fairly new concept.
>
>
> The guidance I would give is the more a particular test is tied to the
> internals of the code, the more likely a selftest is the right
> approach.  THe more the test needs an end-to-end run through passes of
> the compiler, the more it belongs in the dejagnu suite.
>
>
>
> >>
> >> - I wasn't sure if I should create a PR for an issue such as this, if
> >>   there is already a patch readily available. And if I did create a PR,
> >>   not sure if it's preferred to post the patch to gcc-patches, or as an
> >>   attachment to the PR.
>
> We still prefer patches to go to gcc-patches -- I personally don't troll
> BZ looking for attached patches.
>
>
> >>
> >> - Does it seem worth me looking into #2? I think the patch to address #1 is
> >>   appropriate in any case, because it handles generically all potential
> >>   cases where this may arise, but still perhaps the ranges coming out of
> >>   libcpp could be improved?
>
> I don't think it can hurt to look into the difficulty in addressing #2.
>
>
> jeff
>

Thanks very much for the detailed comments, that's all very useful to
me. This particular patch was subsumed by r11-2092, which added the
support for tab expansion, since this whole function was redone and
now handles multibyte correctly. Sorry I probably should have updated
the thread for this old patch in addition to mentioning in the new
one, to save you some time. I will try to take a look sometime at the
ranges that libcpp outputs too. Thanks again!

-Lewis
Jeff Law Nov. 18, 2020, 9:41 p.m. UTC | #4
On 11/14/20 1:33 PM, Lewis Hyatt wrote:
> On Fri, Nov 13, 2020 at 5:27 PM Jeff Law <law@redhat.com> wrote:
>>
>> On 1/14/20 5:05 PM, Lewis Hyatt wrote:
>>> Hello-
>>>
>>> I thought I might ping this short patch please, just in case it may
>>> make sense to include in GCC 10 along with the other UTF-8-related
>>> fixes to diagnostics. Thanks!
>>>
>>> https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00915.html
>> This is fine for the trunk.  Note that due to the changes to handle
>> tabs/control bytes will require this patch to be updated.  It may be as
>> simple as moving the c = dw.next_byte() statement up.
>>
>>
>> Go ahead and do the necessary update and retest & repost the patch for
>> archival purposes.  If you have commit privs, go ahead and commit the
>> updated patch, else indicate in the patch repost that someone needs to
>> apply it for you.
>>
>>
>> Thanks for your patience,
>>
>> Jeff
>>
>>
>>>> #1. diagnostic_show_locus() should be sure it will not corrupt output in
>>>> this way, regardless of what ranges it is given to work with.
>> Yes.
>>
>>
>>>> #2. libcpp should probably generate a range that includes the whole UTF-8
>>>> character. Actually in other ways the range seems not ideal, for example
>>>> if an invalid character appears in the middle of the identifier, the
>>>> diagnostic still points to the first byte of the identifier.
>> Probably.  We haven't traditionally worried a  lot about multitbyte
>> sequences, so I'm not surprised we're not handling them particularly well.
>>
>>
>>>> The attached patch fixes #1. It's essentially a one-line change, plus a
>>>> new selftest. Would you please have a look at it sometime? bootstrap
>>>> and testsuite were done on linux x86-64.
>>>>
>>>> Other questions that I have:
>>>>
>>>> - I am not quite clear when a selftest is preferred vs a dejagnu test. In
>>>>   this case I stuck with the selftest because color diagnostics don't seem
>>>>   to work well with dg-error etc, and it didn't seem worth creating a new
>>>>   plugin-based test like g++.dg/plugin just for this. (I also considered
>>>>   using the existing g++.dg plugin, but it seems this test should run for
>>>>   gcc as well.)
>> It varies and there's cases that are fine in either and I suspect there
>> are many tests in the dejagnu suite that would be better as selftests --
>> selftests are a fairly new concept.
>>
>>
>> The guidance I would give is the more a particular test is tied to the
>> internals of the code, the more likely a selftest is the right
>> approach.  THe more the test needs an end-to-end run through passes of
>> the compiler, the more it belongs in the dejagnu suite.
>>
>>
>>
>>>> - I wasn't sure if I should create a PR for an issue such as this, if
>>>>   there is already a patch readily available. And if I did create a PR,
>>>>   not sure if it's preferred to post the patch to gcc-patches, or as an
>>>>   attachment to the PR.
>> We still prefer patches to go to gcc-patches -- I personally don't troll
>> BZ looking for attached patches.
>>
>>
>>>> - Does it seem worth me looking into #2? I think the patch to address #1 is
>>>>   appropriate in any case, because it handles generically all potential
>>>>   cases where this may arise, but still perhaps the ranges coming out of
>>>>   libcpp could be improved?
>> I don't think it can hurt to look into the difficulty in addressing #2.
>>
>>
>> jeff
>>
> Thanks very much for the detailed comments, that's all very useful to
> me. This particular patch was subsumed by r11-2092, which added the
> support for tab expansion, since this whole function was redone and
> now handles multibyte correctly. Sorry I probably should have updated
> the thread for this old patch in addition to mentioning in the new
> one, to save you some time. I will try to take a look sometime at the
> ranges that libcpp outputs too. Thanks again!
No problem.  I'm slogging my way through lots of old stuff.  I can often
determine if something has been subsumed, but wasn't able to in this case.

Thanks,

Jeff
diff mbox series

Patch

diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
index c87603caf41..7385b8a1692 100644
--- a/gcc/diagnostic-show-locus.c
+++ b/gcc/diagnostic-show-locus.c
@@ -1482,6 +1482,8 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
   int last_non_ws = 0;
   for (int col_byte = 1 + x_offset_bytes; col_byte <= line_bytes; col_byte++)
     {
+      char c = *line;
+
       /* Assuming colorization is enabled for the caret and underline
 	 characters, we may also colorize the associated characters
 	 within the source line.
@@ -1493,8 +1495,13 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 
 	 For frontends that only generate carets, we don't colorize the
 	 characters above them, since this would look strange (e.g.
-	 colorizing just the first character in a token).  */
-      if (m_colorize_source_p)
+	 colorizing just the first character in a token).
+
+	 We need to avoid inserting color codes ahead of UTF-8 continuation
+	 bytes to avoid corrupting the output, in case the location range
+	 points only to the first byte of a multibyte sequence.  */
+
+      if (m_colorize_source_p && (((unsigned int) c) & 0xC0) != 0x80)
 	{
 	  bool in_range_p;
 	  point_state state;
@@ -1507,7 +1514,6 @@  layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 	  else
 	    m_colorizer.set_normal_text ();
 	}
-      char c = *line;
       if (c == '\0' || c == '\t' || c == '\r')
 	c = ' ';
       if (c != ' ')
@@ -3836,6 +3842,27 @@  test_one_liner_labels_utf8 ()
   }
 }
 
+/* Make sure that colorization codes don't interrupt a multibyte
+   sequence, which would corrupt it.  */
+static void
+test_one_liner_colorized_utf8 ()
+{
+  test_diagnostic_context dc;
+  dc.colorize_source_p = true;
+  diagnostic_color_init (&dc, DIAGNOSTICS_COLOR_YES);
+  const location_t pi = linemap_position_for_column (line_table, 12);
+  rich_location richloc (line_table, pi);
+  diagnostic_show_locus (&dc, &richloc, DK_ERROR);
+
+  /* In order to avoid having the test depend on exactly how the colorization
+     was effected, just confirm there are two pi characters in the output.  */
+  const char *result = pp_formatted_text (dc.printer);
+  const char *null_term = result + strlen (result);
+  const char *first_pi = strstr (result, "\xcf\x80");
+  ASSERT_TRUE (first_pi && first_pi <= null_term - 2);
+  ASSERT_STR_CONTAINS (first_pi + 2, "\xcf\x80");
+}
+
 /* Run the various one-liner tests.  */
 
 static void
@@ -3882,6 +3909,7 @@  test_diagnostic_show_locus_one_liner_utf8 (const line_table_case &case_)
   test_one_liner_many_fixits_1_utf8 ();
   test_one_liner_many_fixits_2_utf8 ();
   test_one_liner_labels_utf8 ();
+  test_one_liner_colorized_utf8 ();
 }
 
 /* Verify that gcc_rich_location::add_location_if_nearby works.  */