diff mbox series

print full STRING_CST in Gimple dumps (PR 87052)

Message ID 5ee56fe9-bf96-196f-6bd6-6ecbda2d1ca0@gmail.com
State New
Headers show
Series print full STRING_CST in Gimple dumps (PR 87052) | expand

Commit Message

Martin Sebor Aug. 22, 2018, 2:56 a.m. UTC
In the discussion of the fallout from the enhancement for pr71625
it was pointed out that STRING_CSts in Gimple dumps extend only
to the first nul and don't include any subsequent characters,
and that this makes the dumps harder to read and might give rise
to the question whether the code is correct.

In the attached patch I enhance the pretty_print_string() function
to print the full contents of a STRING_CST, including any embedded
nuls to make the dumps clearer.  I got rid of the single digit
escapes like '\1' since they make the string look ambiguous.
If TREE_STRING_LENGTH (node) is guaranteed to be non-zero these
days the test for it being so may be redundant but I figured it's
better to be safe than sorry.

A further enhancement might be to also distinguish the type of
the STRING_CST.

Martin

Comments

Richard Biener Aug. 22, 2018, 12:48 p.m. UTC | #1
On Wed, Aug 22, 2018 at 4:56 AM Martin Sebor <msebor@gmail.com> wrote:
>
> In the discussion of the fallout from the enhancement for pr71625
> it was pointed out that STRING_CSts in Gimple dumps extend only
> to the first nul and don't include any subsequent characters,
> and that this makes the dumps harder to read and might give rise
> to the question whether the code is correct.
>
> In the attached patch I enhance the pretty_print_string() function
> to print the full contents of a STRING_CST, including any embedded
> nuls to make the dumps clearer.  I got rid of the single digit
> escapes like '\1' since they make the string look ambiguous.
> If TREE_STRING_LENGTH (node) is guaranteed to be non-zero these
> days the test for it being so may be redundant but I figured it's
> better to be safe than sorry.
>
> A further enhancement might be to also distinguish the type of
> the STRING_CST.

And somehow indicate whether it is \0 terminated (just thinking of
the GIMPLE FE and how it parses string literals simply by relying
on libcpp).

Can you write a not \0 terminated string literal in C?

The patch is OK.

Thanks,
Richard.

>
> Martin
>
Martin Sebor Aug. 22, 2018, 3:05 p.m. UTC | #2
On 08/22/2018 06:48 AM, Richard Biener wrote:
> On Wed, Aug 22, 2018 at 4:56 AM Martin Sebor <msebor@gmail.com> wrote:
>>
>> In the discussion of the fallout from the enhancement for pr71625
>> it was pointed out that STRING_CSts in Gimple dumps extend only
>> to the first nul and don't include any subsequent characters,
>> and that this makes the dumps harder to read and might give rise
>> to the question whether the code is correct.
>>
>> In the attached patch I enhance the pretty_print_string() function
>> to print the full contents of a STRING_CST, including any embedded
>> nuls to make the dumps clearer.  I got rid of the single digit
>> escapes like '\1' since they make the string look ambiguous.
>> If TREE_STRING_LENGTH (node) is guaranteed to be non-zero these
>> days the test for it being so may be redundant but I figured it's
>> better to be safe than sorry.
>>
>> A further enhancement might be to also distinguish the type of
>> the STRING_CST.
>
> And somehow indicate whether it is \0 terminated (just thinking of
> the GIMPLE FE and how it parses string literals simply by relying
> on libcpp).
>
> Can you write a not \0 terminated string literal in C?

Yes: char a[2] = "12";

I briefly considered making the terminating nul visible when
there was one but there's at least one test that scans GIMPLE
for the expected string and it failed so I didn't pursue this
idea further.

The tests can certainly be changed to look for the new pattern
if we should decide to make a change here.  I'm not sure what
would be best.  Printing "12\x00" for an ordinary nul-terminated
literal looks like there might be an extra nul after the first
one.  Leaving it out doesn't distinguish it from the unterminated
array.  Using the braced-list notation (i.e.,  a = { '1', '2' }
for unterminated arrays doesn't capture the fact that it's
represented as STRING_CST).

I'm open to suggestions.

Martin
Richard Biener Aug. 23, 2018, 12:55 p.m. UTC | #3
On Wed, Aug 22, 2018 at 5:05 PM Martin Sebor <msebor@gmail.com> wrote:
>
> On 08/22/2018 06:48 AM, Richard Biener wrote:
> > On Wed, Aug 22, 2018 at 4:56 AM Martin Sebor <msebor@gmail.com> wrote:
> >>
> >> In the discussion of the fallout from the enhancement for pr71625
> >> it was pointed out that STRING_CSts in Gimple dumps extend only
> >> to the first nul and don't include any subsequent characters,
> >> and that this makes the dumps harder to read and might give rise
> >> to the question whether the code is correct.
> >>
> >> In the attached patch I enhance the pretty_print_string() function
> >> to print the full contents of a STRING_CST, including any embedded
> >> nuls to make the dumps clearer.  I got rid of the single digit
> >> escapes like '\1' since they make the string look ambiguous.
> >> If TREE_STRING_LENGTH (node) is guaranteed to be non-zero these
> >> days the test for it being so may be redundant but I figured it's
> >> better to be safe than sorry.
> >>
> >> A further enhancement might be to also distinguish the type of
> >> the STRING_CST.
> >
> > And somehow indicate whether it is \0 terminated (just thinking of
> > the GIMPLE FE and how it parses string literals simply by relying
> > on libcpp).
> >
> > Can you write a not \0 terminated string literal in C?
>
> Yes: char a[2] = "12";

I thought they are fully defined in translation phase #1 ...

> I briefly considered making the terminating nul visible when
> there was one but there's at least one test that scans GIMPLE
> for the expected string and it failed so I didn't pursue this
> idea further.

Yes, I think it would just confuse people at the moment.  I thought
of somehow marking non-terminated literals, like  "12"[nt] or so ...
or by using an alternate quote "12' (eh).

Anyway, not too important I guess if you for a moment exclude
the -fdump-XXX-gimple and re-parsing with the GIMPLE FE case.

Richard.

> The tests can certainly be changed to look for the new pattern
> if we should decide to make a change here.  I'm not sure what
> would be best.  Printing "12\x00" for an ordinary nul-terminated
> literal looks like there might be an extra nul after the first
> one.  Leaving it out doesn't distinguish it from the unterminated
> array.  Using the braced-list notation (i.e.,  a = { '1', '2' }
> for unterminated arrays doesn't capture the fact that it's
> represented as STRING_CST).
>
> I'm open to suggestions.
>
> Martin
Michael Matz Aug. 23, 2018, 2:07 p.m. UTC | #4
Hi,

On Thu, 23 Aug 2018, Richard Biener wrote:

> > > Can you write a not \0 terminated string literal in C?
> >
> > Yes: char a[2] = "12";
> 
> I thought they are fully defined in translation phase #1 ...

No, you can't write a string literal which is not zero terminated, because 
in translation phase 7 a zero code is appended to all character sequences 
resulting from string literals, which is then used to allocate and 
initialize a static (wide) character array of just the right size, 
including the zero code.

The above construct uses that static char[3] array from the string literal 
to initialize a char[2] array (which is explicitely allowed), and _that_ 
one is not zero terminated.  But it's also no string literal.

(Of course, due to as-if the intermediate char[3] array won't usually be 
explicitely constructed.)


Ciao,
Michael.
Martin Sebor Aug. 23, 2018, 3:13 p.m. UTC | #5
On 08/23/2018 08:07 AM, Michael Matz wrote:
> Hi,
>
> On Thu, 23 Aug 2018, Richard Biener wrote:
>
>>>> Can you write a not \0 terminated string literal in C?
>>>
>>> Yes: char a[2] = "12";
>>
>> I thought they are fully defined in translation phase #1 ...
>
> No, you can't write a string literal which is not zero terminated, because
> in translation phase 7 a zero code is appended to all character sequences
> resulting from string literals, which is then used to allocate and
> initialize a static (wide) character array of just the right size,
> including the zero code.
>
> The above construct uses that static char[3] array from the string literal
> to initialize a char[2] array (which is explicitely allowed), and _that_
> one is not zero terminated.  But it's also no string literal.

Yes, you're right.  I misinterpreted Richard's question as
asking if one could construct an unterminated array of chars
using a string literal.

The distinction that I thought would be useful to capture in
Gimple is between "12" that initializes an array of two elements
(and where the nul doesn't fit) vs "12" that initializes one of
three elements (and where the nul does fit).  It's probably not
terribly important when the array type also appears in Gimple
like in the case above but there are (perhaps corner) cases
where it doesn't, as in the second one below:

   char a[2] = "12";
   char *p = (char[2]){ "12" };

which ends up represented as:

   char a[2];
   char * p;
   char D.1910[2];

   try
     {
       a = "12";
       D.1910 = "12";
       p = &D.1910;
     }

Martin
Richard Biener Aug. 24, 2018, 9:30 a.m. UTC | #6
On Thu, Aug 23, 2018 at 5:13 PM Martin Sebor <msebor@gmail.com> wrote:
>
> On 08/23/2018 08:07 AM, Michael Matz wrote:
> > Hi,
> >
> > On Thu, 23 Aug 2018, Richard Biener wrote:
> >
> >>>> Can you write a not \0 terminated string literal in C?
> >>>
> >>> Yes: char a[2] = "12";
> >>
> >> I thought they are fully defined in translation phase #1 ...
> >
> > No, you can't write a string literal which is not zero terminated, because
> > in translation phase 7 a zero code is appended to all character sequences
> > resulting from string literals, which is then used to allocate and
> > initialize a static (wide) character array of just the right size,
> > including the zero code.
> >
> > The above construct uses that static char[3] array from the string literal
> > to initialize a char[2] array (which is explicitely allowed), and _that_
> > one is not zero terminated.  But it's also no string literal.
>
> Yes, you're right.  I misinterpreted Richard's question as
> asking if one could construct an unterminated array of chars
> using a string literal.
>
> The distinction that I thought would be useful to capture in
> Gimple is between "12" that initializes an array of two elements
> (and where the nul doesn't fit) vs "12" that initializes one of
> three elements (and where the nul does fit).  It's probably not
> terribly important when the array type also appears in Gimple
> like in the case above but there are (perhaps corner) cases
> where it doesn't, as in the second one below:
>
>    char a[2] = "12";
>    char *p = (char[2]){ "12" };
>
> which ends up represented as:
>
>    char a[2];
>    char * p;
>    char D.1910[2];
>
>    try
>      {
>        a = "12";
>        D.1910 = "12";
>        p = &D.1910;
>      }

Yeah, we currently require useless_type_conversion_p between the
LHS type and the RHS type (type of a vs. type of "12") which implies
TYPE_SIZE matches (if constant).  That means GIMPLE expects
at least the type of "12" to not include the NUL.

Richard.

> Martin
diff mbox series

Patch

PR middle-end/87052 - STRING_CST printing incomplete in Gimple dumps

gcc/testsuite/ChangeLog:

	PR middle-end/87052
	* gcc.dg/pr87052.c: New test.

gcc/ChangeLog:

	PR middle-end/87052
	* tree-pretty-print.c (pretty_print_string): Add argument.
	(dump_generic_node): Call to pretty_print_string with string size.

Index: gcc/tree-pretty-print.c
===================================================================
--- gcc/tree-pretty-print.c	(revision 263754)
+++ gcc/tree-pretty-print.c	(working copy)
@@ -37,7 +37,7 @@  along with GCC; see the file COPYING3.  If not see
 
 /* Local functions, macros and variables.  */
 static const char *op_symbol (const_tree);
-static void pretty_print_string (pretty_printer *, const char*);
+static void pretty_print_string (pretty_printer *, const char*, unsigned);
 static void newline_and_indent (pretty_printer *, int);
 static void maybe_init_pretty_print (FILE *);
 static void print_struct_decl (pretty_printer *, const_tree, int, dump_flags_t);
@@ -1800,10 +1800,13 @@  dump_generic_node (pretty_printer *pp, tree node,
       break;
 
     case STRING_CST:
-      pp_string (pp, "\"");
-      pretty_print_string (pp, TREE_STRING_POINTER (node));
-      pp_string (pp, "\"");
-      break;
+      {
+	pp_string (pp, "\"");
+	if (unsigned nbytes = TREE_STRING_LENGTH (node))
+	  pretty_print_string (pp, TREE_STRING_POINTER (node), nbytes);
+	pp_string (pp, "\"");
+	break;
+      }
 
     case VECTOR_CST:
       {
@@ -3865,15 +3868,16 @@  print_call_name (pretty_printer *pp, tree node, du
     }
 }
 
-/* Parses the string STR and replaces new-lines by '\n', tabs by '\t', ...  */
+/* Print the first N characters in the array STR, replacing non-printable
+   characters (including embedded nuls) with unambiguous escape sequences.  */
 
 static void
-pretty_print_string (pretty_printer *pp, const char *str)
+pretty_print_string (pretty_printer *pp, const char *str, unsigned n)
 {
   if (str == NULL)
     return;
 
-  while (*str)
+  for ( ; n; --n, ++str)
     {
       switch (str[0])
 	{
@@ -3913,48 +3917,20 @@  static void
 	  pp_string (pp, "\\'");
 	  break;
 
-	  /* No need to handle \0; the loop terminates on \0.  */
-
-	case '\1':
-	  pp_string (pp, "\\1");
-	  break;
-
-	case '\2':
-	  pp_string (pp, "\\2");
-	  break;
-
-	case '\3':
-	  pp_string (pp, "\\3");
-	  break;
-
-	case '\4':
-	  pp_string (pp, "\\4");
-	  break;
-
-	case '\5':
-	  pp_string (pp, "\\5");
-	  break;
-
-	case '\6':
-	  pp_string (pp, "\\6");
-	  break;
-
-	case '\7':
-	  pp_string (pp, "\\7");
-	  break;
-
 	default:
-	  if (!ISPRINT (str[0]))
+	  if (str[0] || n > 1)
 	    {
-	      char buf[5];
-	      sprintf (buf, "\\x%x", (unsigned char)str[0]);
-	      pp_string (pp, buf);
+	      if (!ISPRINT (str[0]))
+		{
+		  char buf[5];
+		  sprintf (buf, "\\x%02x", (unsigned char)str[0]);
+		  pp_string (pp, buf);
+		}
+	      else
+		pp_character (pp, str[0]);
+	      break;
 	    }
-	  else
-	    pp_character (pp, str[0]);
-	  break;
 	}
-      str++;
     }
 }
 
Index: gcc/testsuite/gcc.dg/pr87052.c
===================================================================
--- gcc/testsuite/gcc.dg/pr87052.c	(nonexistent)
+++ gcc/testsuite/gcc.dg/pr87052.c	(working copy)
@@ -0,0 +1,41 @@ 
+/* PR middle-end/87052 - STRING_CST printing incomplete in Gimple dumps
+   { dg-do compile }
+   { dg-options "-fdump-tree-gimple" } */
+
+void sink (const void*, ...);
+
+void test (void)
+{
+  const char a[3] = "\000ab";
+
+  /* Expect the following in the dump:
+     a = "\x00ab"; */
+
+  const char b[] = { 'a', 0, 'b', 'c' };
+
+  /* Expect the following:
+     b = "a\x00bc"; */
+
+  const char c[] = "";
+
+  /* Expect the following:
+     c = ""; */
+
+  const char d[0] = { };
+
+  /* Expect the following:
+     d = ""; */
+
+  const char e[0] = "";
+
+  /* Expect nothing.  */
+
+  sink (a, b, c, d, e);
+}
+
+/* { dg-final { scan-tree-dump-times "a = \"\\\\x00ab\";" 1 "gimple" } }
+   { dg-final { scan-tree-dump-times "b = \"a\\\\x00bc\";"  1 "gimple" } }
+   { dg-final { scan-tree-dump-times "c = \"\";"  1 "gimple" } }
+   { dg-final { scan-tree-dump-times "d = { *};"  1 "gimple" } }
+   { dg-final { scan-tree-dump-times "e = "  1 "gimple" } }
+   { dg-final { scan-tree-dump-times "e = {CLOBBER}"  1 "gimple" } }  */