diff mbox series

Check the STRING_CSTs in varasm.c

Message ID AM5PR0701MB265742C278E7D1859AF10E59E42D0@AM5PR0701MB2657.eurprd07.prod.outlook.com
State New
Headers show
Series Check the STRING_CSTs in varasm.c | expand

Commit Message

Bernd Edlinger Aug. 1, 2018, 11:35 a.m. UTC
Hi,

this completes the previous patches, and adds a check in varasm.c
that ensures that all string constants are NUL terminated,
And that varasm does not strip anything but _exactly_ one NUL
character.

Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
Is it OK for trunk?


Thanks
Bernd.

Comments

Jeff Law Aug. 3, 2018, 9:36 p.m. UTC | #1
On 08/01/2018 05:35 AM, Bernd Edlinger wrote:
> Hi,
> 
> this completes the previous patches, and adds a check in varasm.c
> that ensures that all string constants are NUL terminated,
> And that varasm does not strip anything but _exactly_ one NUL
> character.
> 
> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> Is it OK for trunk?
> 
> 
> Thanks
> Bernd.
> 
> 
> patch-varasm.diff
> 
> 
> 2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>
> 
> 	* varasm.c (check_string_literal): New checking function.
> 	(output_constant): Use it.
My biggest concern here is that we've explicitly defined STRING_CST
nodes as not needing NUL termination.  See generic.texi.  That explicit
definition is almost certainly derived from existing practice that long
pre-dates the introduction of gimple/generic.

Even so I'm generally OK with the concept of tightening the rules here.
If we need to fault in more fixes, that's fine with me.  I'm assuming
that you and Ian have sorted out what to do WRT Go.

If we decide to go forward, you'll definitely want to update the
STRING_CST documentation in generic.texi.

jeff
Bernd Edlinger Aug. 3, 2018, 9:41 p.m. UTC | #2
On 08/03/18 23:36, Jeff Law wrote:
> On 08/01/2018 05:35 AM, Bernd Edlinger wrote:
>> Hi,
>>
>> this completes the previous patches, and adds a check in varasm.c
>> that ensures that all string constants are NUL terminated,
>> And that varasm does not strip anything but _exactly_ one NUL
>> character.
>>
>> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
>> Is it OK for trunk?
>>
>>
>> Thanks
>> Bernd.
>>
>>
>> patch-varasm.diff
>>
>>
>> 2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>
>>
>> 	* varasm.c (check_string_literal): New checking function.
>> 	(output_constant): Use it.
> My biggest concern here is that we've explicitly defined STRING_CST
> nodes as not needing NUL termination.  See generic.texi.  That explicit
> definition is almost certainly derived from existing practice that long
> pre-dates the introduction of gimple/generic.
> 
> Even so I'm generally OK with the concept of tightening the rules here.
> If we need to fault in more fixes, that's fine with me.  I'm assuming
> that you and Ian have sorted out what to do WRT Go.
> 
> If we decide to go forward, you'll definitely want to update the
> STRING_CST documentation in generic.texi.
> 

Yes, note however that STRING_CST have more uses that string literals,
for instance ASM constraints, and __attribute__ values, and these
do not have explicit NUL termination, and no TREE_TYPE i'd suppose.

I am open to discuss how the string constants should be handled
in the middle-end.


Bernd.
Bernd Edlinger Aug. 4, 2018, 5:55 a.m. UTC | #3
On 08/03/18 23:36, Jeff Law wrote:
> On 08/01/2018 05:35 AM, Bernd Edlinger wrote:
>> Hi,
>>
>> this completes the previous patches, and adds a check in varasm.c
>> that ensures that all string constants are NUL terminated,
>> And that varasm does not strip anything but _exactly_ one NUL
>> character.
>>
>> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
>> Is it OK for trunk?
>>
>>
>> Thanks
>> Bernd.
>>
>>
>> patch-varasm.diff
>>
>>
>> 2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>
>>
>> 	* varasm.c (check_string_literal): New checking function.
>> 	(output_constant): Use it.
> My biggest concern here is that we've explicitly defined STRING_CST
> nodes as not needing NUL termination.  See generic.texi.  That explicit
> definition is almost certainly derived from existing practice that long
> pre-dates the introduction of gimple/generic.
> 
> Even so I'm generally OK with the concept of tightening the rules here.
> If we need to fault in more fixes, that's fine with me.  I'm assuming
> that you and Ian have sorted out what to do WRT Go.
> 
> If we decide to go forward, you'll definitely want to update the
> STRING_CST documentation in generic.texi.
> 

This is the same patch with STRING_CST documentation updated.


Bernd.
2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* doc/generic.texi (STRING_CST): Update.
	* varasm.c (check_string_literal): New checking function.
	(output_constant): Use it.

diff -pur gcc/varasm.c gcc/varasm.c
--- gcc/varasm.c	2018-07-17 11:19:27.000000000 +0200
+++ gcc/varasm.c	2018-07-31 10:16:12.058827505 +0200
@@ -4774,6 +4774,29 @@ initializer_constant_valid_for_bitfield_
   return false;
 }
 
+/* Check if a STRING_CST fits into the field.
+   Tolerate only the case when the NUL termination
+   does not fit into the field.   */
+
+bool
+check_string_literal (tree string, unsigned HOST_WIDE_INT size)
+{
+  tree eltype = TREE_TYPE (TREE_TYPE (string));
+  unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype));
+  const char *p = TREE_STRING_POINTER (string);
+  int len = TREE_STRING_LENGTH (string);
+
+  if (elts != 1 && elts != 2 && elts != 4)
+    return false;
+  if (len <= 0 || len % elts != 0)
+    return false;
+  if ((unsigned)len != size && (unsigned)len != size + elts)
+    return false;
+  if (memcmp (p + len - elts, "\0\0\0\0", elts) != 0)
+    return false;
+  return true;
+}
+
 /* output_constructor outer state of relevance in recursive calls, typically
    for nested aggregate bitfields.  */
 
@@ -4942,6 +4965,7 @@ output_constant (tree exp, unsigned HOST
 	case STRING_CST:
 	  thissize
 	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  gcc_checking_assert (check_string_literal (exp, thissize));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:
Index: gcc/doc/generic.texi
===================================================================
--- gcc/doc/generic.texi	(revision 263072)
+++ gcc/doc/generic.texi	(working copy)
@@ -1162,9 +1162,13 @@ These nodes represent string-constants.  The @code
 returns the length of the string, as an @code{int}.  The
 @code{TREE_STRING_POINTER} is a @code{char*} containing the string
 itself.  The string may not be @code{NUL}-terminated, and it may contain
-embedded @code{NUL} characters.  Therefore, the
-@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is
-present.
+embedded @code{NUL} characters.  However, the
+@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
+is not part of the language string literal but appended by the front end.
+If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
+is one character shorter than @code{TREE_STRING_LENGTH}.
+Excess caracters other than one trailing @code{NUL} character are not
+permitted.
 
 For wide string constants, the @code{TREE_STRING_LENGTH} is the number
 of bytes in the string, and the @code{TREE_STRING_POINTER}
@@ -1173,6 +1177,11 @@ target system (that is, as integers in the target
 non-wide string constants are distinguished only by the @code{TREE_TYPE}
 of the @code{STRING_CST}.
 
+String constants may also be used for other purpose like assember
+constraints or attributes.  These have no @code{TREE_TYPE} and
+need no explicit @code{NUL}-termination.  Note there is always
+another @code{NUL}-byte beyond @code{TREE_STRING_LENGTH}.
+
 FIXME: The formats of string constants are not well-defined when the
 target system bytes are not the same width as host system bytes.
Bernd Edlinger Aug. 5, 2018, 10:28 a.m. UTC | #4
Hi,

I would like to do a minor tweak to the patch.
While staring at the other patch I realized that I should
better pass size and not thissize to the check
function, instead of making use of how thissize is
computed using MIN above.  This means that this condition

+  if ((unsigned)len != size && (unsigned)len != size + elts)
+    return false;

should better and more readable be:

+  if (size < (unsigned)len && size != len - elts)
+    return false;

Furthermore I wanted to have the check function static,
which I missed to do previously.

For completeness, these are again the precondition patches
for this patch:

[PATCH] Handle overlength strings in the C FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00142.html

[PATCH] Handle overlength strings in C++ FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00045.html
Approved: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00105.html

[PATCH] Make GO string literals properly NUL terminated
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01931.html

[PATCH] [Ada] Make middle-end string literals NUL terminated
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01929.html

[PATCH] Create internally nul terminated string literals in fortan FE
https://gcc.gnu.org/ml/fortran/2018-08/msg00000.html


Bootstrapped, as usual.
Is it OK for trunk?


Thanks
Bernd.
2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* doc/generic.texi (STRING_CST): Update.
	* varasm.c (check_string_literal): New checking function.
	(output_constant): Use it.

diff -pur gcc/varasm.c gcc/varasm.c
--- gcc/varasm.c	2018-07-17 11:19:27.000000000 +0200
+++ gcc/varasm.c	2018-07-31 10:16:12.058827505 +0200
@@ -4774,6 +4774,29 @@ initializer_constant_valid_for_bitfield_
   return false;
 }
 
+/* Check if a STRING_CST fits into the field.
+   Tolerate only the case when the NUL termination
+   does not fit into the field.   */
+
+static bool
+check_string_literal (tree string, unsigned HOST_WIDE_INT size)
+{
+  tree eltype = TREE_TYPE (TREE_TYPE (string));
+  unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype));
+  const char *p = TREE_STRING_POINTER (string);
+  int len = TREE_STRING_LENGTH (string);
+
+  if (elts != 1 && elts != 2 && elts != 4)
+    return false;
+  if (len <= 0 || len % elts != 0)
+    return false;
+  if (size < (unsigned)len && size != len - elts)
+    return false;
+  if (memcmp (p + len - elts, "\0\0\0\0", elts) != 0)
+    return false;
+  return true;
+}
+
 /* output_constructor outer state of relevance in recursive calls, typically
    for nested aggregate bitfields.  */
 
@@ -4942,6 +4965,7 @@ output_constant (tree exp, unsigned HOST
 	case STRING_CST:
 	  thissize
 	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  gcc_checking_assert (check_string_literal (exp, size));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:
Index: gcc/doc/generic.texi
===================================================================
--- gcc/doc/generic.texi	(revision 263072)
+++ gcc/doc/generic.texi	(working copy)
@@ -1162,9 +1162,13 @@ These nodes represent string-constants.  The @code
 returns the length of the string, as an @code{int}.  The
 @code{TREE_STRING_POINTER} is a @code{char*} containing the string
 itself.  The string may not be @code{NUL}-terminated, and it may contain
-embedded @code{NUL} characters.  Therefore, the
-@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is
-present.
+embedded @code{NUL} characters.  However, the
+@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
+is not part of the language string literal but appended by the front end.
+If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
+is one character shorter than @code{TREE_STRING_LENGTH}.
+Excess caracters other than one trailing @code{NUL} character are not
+permitted.
 
 For wide string constants, the @code{TREE_STRING_LENGTH} is the number
 of bytes in the string, and the @code{TREE_STRING_POINTER}
@@ -1173,6 +1177,11 @@ target system (that is, as integers in the target
 non-wide string constants are distinguished only by the @code{TREE_TYPE}
 of the @code{STRING_CST}.
 
+String constants may also be used for other purpose like assember
+constraints or attributes.  These have no @code{TREE_TYPE} and
+need no explicit @code{NUL}-termination.  Note there is always
+another @code{NUL}-byte beyond @code{TREE_STRING_LENGTH}.
+
 FIXME: The formats of string constants are not well-defined when the
 target system bytes are not the same width as host system bytes.
Jeff Law Aug. 17, 2018, 4:46 a.m. UTC | #5
On 08/05/2018 04:28 AM, Bernd Edlinger wrote:
> Hi,
> 
> I would like to do a minor tweak to the patch.
> While staring at the other patch I realized that I should
> better pass size and not thissize to the check
> function, instead of making use of how thissize is
> computed using MIN above.  This means that this condition
> 
> +  if ((unsigned)len != size && (unsigned)len != size + elts)
> +    return false;
> 
> should better and more readable be:
> 
> +  if (size < (unsigned)len && size != len - elts)
> +    return false;
> 
> Furthermore I wanted to have the check function static,
> which I missed to do previously.
> 
> For completeness, these are again the precondition patches
> for this patch:
> 
> [PATCH] Handle overlength strings in the C FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00142.html
I believe Joseph ack'd this already:

https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00655.html


> 
> [PATCH] Make GO string literals properly NUL terminated
> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01931.html
Ian didn't object, deferring to the larger scoped review.  My
understanding is this should be a nop for Go.  Given it takes us closer
to have a canonical form, I'll go ahead and ACK for the trunk.


> 
> [PATCH] [Ada] Make middle-end string literals NUL terminated
> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01929.html
This is OK.


> 
> [PATCH] Create internally nul terminated string literals in fortan FE
> https://gcc.gnu.org/ml/fortran/2018-08/msg00000.html
This is OK too.  THough it is unclear why we're not fixing
gfc_build_string_const vs introducing gfc_build_hollerith_string_const.
Are hollerith strings naturally not terminated in the fortran front-end
and thus need special handling, while other calls to
gfc_build_string_const are always with naturally nul terminated strings?




> 
> 
> Bootstrapped, as usual.
> Is it OK for trunk?
So as a toplevel comment.    While there may be some interaction with
Martin's code to detect and warn for unterminated strings passed to
strlen and other cases, I think the overall goal of canonicalizing our
internal representation of STRING_CST along with a verification step is
a good thing and it may make some of Martin's work easier.

At this point are the prereqs of the varasm verification patch all ACK'd?

Jeff
Richard Biener Aug. 17, 2018, 9:33 a.m. UTC | #6
On Fri, 3 Aug 2018, Jeff Law wrote:

> On 08/01/2018 05:35 AM, Bernd Edlinger wrote:
> > Hi,
> > 
> > this completes the previous patches, and adds a check in varasm.c
> > that ensures that all string constants are NUL terminated,
> > And that varasm does not strip anything but _exactly_ one NUL
> > character.
> > 
> > Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> > Is it OK for trunk?
> > 
> > 
> > Thanks
> > Bernd.
> > 
> > 
> > patch-varasm.diff
> > 
> > 
> > 2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>
> > 
> > 	* varasm.c (check_string_literal): New checking function.
> > 	(output_constant): Use it.
> My biggest concern here is that we've explicitly defined STRING_CST
> nodes as not needing NUL termination.  See generic.texi.  That explicit
> definition is almost certainly derived from existing practice that long
> pre-dates the introduction of gimple/generic.
> 
> Even so I'm generally OK with the concept of tightening the rules here.
> If we need to fault in more fixes, that's fine with me.  I'm assuming
> that you and Ian have sorted out what to do WRT Go.
> 
> If we decide to go forward, you'll definitely want to update the
> STRING_CST documentation in generic.texi.

Note that I'm a little bit confused here given build_string
already appends a '\0' after TREE_STRING_LENGTH.  So it is safe
to call strlen() on all STRING_CSTs.  I think I raised this in
the review of one of Bernds patches but to be honest all the
various threads have become a bit convoluted (esp. if joining
in after two weeks of vacation).

So -- why is what we do currently not enough?

Quoting our single STRING_CST creator:

tree
build_string (int len, const char *str)
{
  tree s;
  size_t length;

  /* Do not waste bytes provided by padding of struct tree_string.  */
  length = len + offsetof (struct tree_string, str) + 1;

  record_node_allocation_statistics (STRING_CST, length);

  s = (tree) ggc_internal_alloc (length);

  memset (s, 0, sizeof (struct tree_typed));
  TREE_SET_CODE (s, STRING_CST);
  TREE_CONSTANT (s) = 1;
  TREE_STRING_LENGTH (s) = len;
  memcpy (s->string.str, str, len);
  s->string.str[len] = '\0';

  return s;
}


Richard.
Richard Biener Aug. 17, 2018, 9:38 a.m. UTC | #7
On Sat, 4 Aug 2018, Bernd Edlinger wrote:

> On 08/03/18 23:36, Jeff Law wrote:
> > On 08/01/2018 05:35 AM, Bernd Edlinger wrote:
> >> Hi,
> >>
> >> this completes the previous patches, and adds a check in varasm.c
> >> that ensures that all string constants are NUL terminated,
> >> And that varasm does not strip anything but _exactly_ one NUL
> >> character.
> >>
> >> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> >> Is it OK for trunk?
> >>
> >>
> >> Thanks
> >> Bernd.
> >>
> >>
> >> patch-varasm.diff
> >>
> >>
> >> 2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>
> >>
> >> 	* varasm.c (check_string_literal): New checking function.
> >> 	(output_constant): Use it.
> > My biggest concern here is that we've explicitly defined STRING_CST
> > nodes as not needing NUL termination.  See generic.texi.  That explicit
> > definition is almost certainly derived from existing practice that long
> > pre-dates the introduction of gimple/generic.
> > 
> > Even so I'm generally OK with the concept of tightening the rules here.
> > If we need to fault in more fixes, that's fine with me.  I'm assuming
> > that you and Ian have sorted out what to do WRT Go.
> > 
> > If we decide to go forward, you'll definitely want to update the
> > STRING_CST documentation in generic.texi.
> > 
> 
> This is the same patch with STRING_CST documentation updated.

+embedded @code{NUL} characters.  However, the
+@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
+is not part of the language string literal but appended by the front end.
+If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
+is one character shorter than @code{TREE_STRING_LENGTH}.
+Excess caracters other than one trailing @code{NUL} character are not
+permitted.

I find this very confusing and oppose to that change.  Can we get
back to the drawing board please?  If we want an easy way to
see whether a string is "properly" terminated then maybe we can
simply use a flag that gets set by build_string?

Richard.
Bernd Edlinger Aug. 17, 2018, 10:22 a.m. UTC | #8
Richard Biener wrote:

> Note that I'm a little bit confused here given build_string
> already appends a '\0' after TREE_STRING_LENGTH.  So it is safe
> to call strlen() on all STRING_CSTs.  I think I raised this in
> the review of one of Bernds patches but to be honest all the
> various threads have become a bit convoluted (esp. if joining
> in after two weeks of vacation).
>
> So -- why is what we do currently not enough?
>
> Quoting our single STRING_CST creator:
>
> tree
> build_string (int len, const char *str)
> {
>   tree s;
>   size_t length;
>
>  /* Do not waste bytes provided by padding of struct tree_string.  */
>  length = len + offsetof (struct tree_string, str) + 1;
>
>  record_node_allocation_statistics (STRING_CST, length);
>
>  s = (tree) ggc_internal_alloc (length);
>
>  memset (s, 0, sizeof (struct tree_typed));
>  TREE_SET_CODE (s, STRING_CST);
>  TREE_CONSTANT (s) = 1;
>  TREE_STRING_LENGTH (s) = len;
>  memcpy (s->string.str, str, len);
>  s->string.str[len] = '\0';
>
>  return s;
>}

Thanks for this question.

I have two issues with that.
1. it adds not a wide character nul, only a single byte '\0'.
2. the '\0' here is _guaranteed_ not to be assembled
    by varasm.
   Since it is at TREE_STRING_LENGTH+1.

That is fine for some string constants, like ASM constaraints.
it makes most of the time sense, as long as it is not
used for folding of memxxx functions.

Of course the whole patch series is dependent on the
consensus about how we want to handle string constants
in the middle-end, so no problem with going back to the
drawing board, that's the reason why I did not apply the
already approved bits, and I guess we are not in a hurry.

What I see, is that string constants are most of the time
handled like the C language strings, that is there
are different character wides, and the array_type on the
string constants, includes another NUL (wide) char which
is assembled along with the rest of the bytes, but it is
not written in the language string constant.
The front end appends this before giving the string
constant to build_string.
That means at least most of the time.

So I thought it might be good to have some more
easily checkable things regarding the zero termiation,
or what is allowed for excess precision.

It's possible that this shoots over the target, but I think
this hardening in the varasm is of some value.

This way how this patch works has one advantage:
That it is easy to check for "there must be one wide char zero",
if it is someting that can't be checked, like:
"there may be a nul or not", then it is impossible to be checked.

Well yeah, it's an idea.



Bernd.
Bernd Edlinger Aug. 17, 2018, 10:38 a.m. UTC | #9
Richard Biener wrote:
> +embedded @code{NUL} characters.  However, the
> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
> +is not part of the language string literal but appended by the front end.
> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
> +is one character shorter than @code{TREE_STRING_LENGTH}.
> +Excess caracters other than one trailing @code{NUL} character are not
> +permitted.
> 
> I find this very confusing and oppose to that change.  Can we get
> back to the drawing board please?  If we want an easy way to
> see whether a string is "properly" terminated then maybe we can
> simply use a flag that gets set by build_string?
>

What I mean with that is the case like
char x[2] = "123456";

which is build_string(7, "123456"), but with a type char[2],
so varasm throws away "3456\0".

I want to say that this is not okay, the excess precision
should only be used to strip the nul termination, in cases
where it is intended to be a assembled as a not zero terminated
string.  But maybe the wording could be improved?


Bernd.
Bernd Edlinger Aug. 17, 2018, 12:13 p.m. UTC | #10
On 08/17/18 06:46, Jeff Law wrote:
> On 08/05/2018 04:28 AM, Bernd Edlinger wrote:
>> Hi,
>>
>> I would like to do a minor tweak to the patch.
>> While staring at the other patch I realized that I should
>> better pass size and not thissize to the check
>> function, instead of making use of how thissize is
>> computed using MIN above.  This means that this condition
>>
>> +  if ((unsigned)len != size && (unsigned)len != size + elts)
>> +    return false;
>>
>> should better and more readable be:
>>
>> +  if (size < (unsigned)len && size != len - elts)
>> +    return false;
>>
>> Furthermore I wanted to have the check function static,
>> which I missed to do previously.
>>
>> For completeness, these are again the precondition patches
>> for this patch:
>>
>> [PATCH] Handle overlength strings in the C FE
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00142.html
> I believe Joseph ack'd this already:
> 
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00655.html
> 
> 

Yes.

>>
>> [PATCH] Make GO string literals properly NUL terminated
>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01931.html
> Ian didn't object, deferring to the larger scoped review.  My
> understanding is this should be a nop for Go.  Given it takes us closer
> to have a canonical form, I'll go ahead and ACK for the trunk.
> 
> 

Yes, it should be a no-op, eventually the strings will go
into the string merge section, with the follow-up patch.
Everything works nicely, as I always bootstrap/regtest with GO.


>>
>> [PATCH] [Ada] Make middle-end string literals NUL terminated
>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01929.html
> This is OK.
> 
> 
>>
>> [PATCH] Create internally nul terminated string literals in fortan FE
>> https://gcc.gnu.org/ml/fortran/2018-08/msg00000.html
> This is OK too.  THough it is unclear why we're not fixing
> gfc_build_string_const vs introducing gfc_build_hollerith_string_const.
> Are hollerith strings naturally not terminated in the fortran front-end
> and thus need special handling, while other calls to
> gfc_build_string_const are always with naturally nul terminated strings?
> 
> 

I think all other strings are null terminated, except the hollerith strings.

> 
> 
>>
>>
>> Bootstrapped, as usual.
>> Is it OK for trunk?
> So as a toplevel comment.    While there may be some interaction with
> Martin's code to detect and warn for unterminated strings passed to
> strlen and other cases, I think the overall goal of canonicalizing our
> internal representation of STRING_CST along with a verification step is
> a good thing and it may make some of Martin's work easier.
> 
> At this point are the prereqs of the varasm verification patch all ACK'd?
> 

Yes, there is a JIT code gen bug, that was caught by the assertion in my
patch:

[PATCH] Fix not properly nul-terminated string constants in JIT
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00370.html


That is a simple bug, which means, that the JIT FE was never able to
assemble strings > 200 character length.

Fixing a bug like that makes me I wonder, if there is any test coverage
for JIT except the gcc test suite itself.


Thanks
Bernd.
Richard Biener Aug. 17, 2018, 12:14 p.m. UTC | #11
On Fri, 17 Aug 2018, Bernd Edlinger wrote:

> Richard Biener wrote:
> 
> > Note that I'm a little bit confused here given build_string
> > already appends a '\0' after TREE_STRING_LENGTH.  So it is safe
> > to call strlen() on all STRING_CSTs.  I think I raised this in
> > the review of one of Bernds patches but to be honest all the
> > various threads have become a bit convoluted (esp. if joining
> > in after two weeks of vacation).
> >
> > So -- why is what we do currently not enough?
> >
> > Quoting our single STRING_CST creator:
> >
> > tree
> > build_string (int len, const char *str)
> > {
> >   tree s;
> >   size_t length;
> >
> >  /* Do not waste bytes provided by padding of struct tree_string.  */
> >  length = len + offsetof (struct tree_string, str) + 1;
> >
> >  record_node_allocation_statistics (STRING_CST, length);
> >
> >  s = (tree) ggc_internal_alloc (length);
> >
> >  memset (s, 0, sizeof (struct tree_typed));
> >  TREE_SET_CODE (s, STRING_CST);
> >  TREE_CONSTANT (s) = 1;
> >  TREE_STRING_LENGTH (s) = len;
> >  memcpy (s->string.str, str, len);
> >  s->string.str[len] = '\0';
> >
> >  return s;
> >}
> 
> Thanks for this question.
> 
> I have two issues with that.
> 1. it adds not a wide character nul, only a single byte '\0'.
> 2. the '\0' here is _guaranteed_ not to be assembled
>     by varasm.
>    Since it is at TREE_STRING_LENGTH+1.
> 
> That is fine for some string constants, like ASM constaraints.
> it makes most of the time sense, as long as it is not
> used for folding of memxxx functions.
> 
> Of course the whole patch series is dependent on the
> consensus about how we want to handle string constants
> in the middle-end, so no problem with going back to the
> drawing board, that's the reason why I did not apply the
> already approved bits, and I guess we are not in a hurry.
> 
> What I see, is that string constants are most of the time
> handled like the C language strings, that is there
> are different character wides, and the array_type on the
> string constants, includes another NUL (wide) char which
> is assembled along with the rest of the bytes, but it is
> not written in the language string constant.
> The front end appends this before giving the string
> constant to build_string.
> That means at least most of the time.
> 
> So I thought it might be good to have some more
> easily checkable things regarding the zero termiation,
> or what is allowed for excess precision.
> 
> It's possible that this shoots over the target, but I think
> this hardening in the varasm is of some value.
> 
> This way how this patch works has one advantage:
> That it is easy to check for "there must be one wide char zero",
> if it is someting that can't be checked, like:
> "there may be a nul or not", then it is impossible to be checked.

What I most object to is the fact that this makes the requirement
of TREE_TYPE of a STRING_CST to be an ARRAY_TYPE a hard one
while the ARRAY_TYPE doesn't add any useful information.  Those
array types are quite heavy structures while the only thing
we really need is the type of the elements.  We could improve
memory use greatly here by using charT[] as TREE_TYPE of
STRING_CSTs.

Note that not all STRING_CSTs do get a type (probably all that
end up in the IL do).

So, why not add a TREE_STRING_NUL_TERMINATED flag instad that
is true when the string is NUL-terminated within TREE_STRING_LENGTH
and is false when we do not know (to make the transition easy).

Richard.

> Well yeah, it's an idea.
> 
> 
> 
> Bernd.
> 
>
Richard Biener Aug. 17, 2018, 12:19 p.m. UTC | #12
On Fri, 17 Aug 2018, Bernd Edlinger wrote:

> Richard Biener wrote:
> > +embedded @code{NUL} characters.  However, the
> > +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
> > +is not part of the language string literal but appended by the front end.
> > +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
> > +is one character shorter than @code{TREE_STRING_LENGTH}.
> > +Excess caracters other than one trailing @code{NUL} character are not

characters btw.

I read the above that the string literal for

char x[2] = "1";

is actually "1\0\0" - there's one NUL that is not part of the language
string literal.  The second sentence then suggests that both \0
are removed because 2 is less than 3?

As said, having this extra semantics of a STRING_CST tied to
another tree node (its TREE_TYPE) looks ugly.

> > +permitted.
> > 
> > I find this very confusing and oppose to that change.  Can we get
> > back to the drawing board please?  If we want an easy way to
> > see whether a string is "properly" terminated then maybe we can
> > simply use a flag that gets set by build_string?
> >
> 
> What I mean with that is the case like
> char x[2] = "123456";
> 
> which is build_string(7, "123456"), but with a type char[2],
> so varasm throws away "3456\0".

I think varasm throws away chars not because of the type of
the STRING_CST but because of the available storage in x.

> I want to say that this is not okay, the excess precision
> should only be used to strip the nul termination, in cases
> where it is intended to be a assembled as a not zero terminated
> string.  But maybe the wording could be improved?

ISTR we always assemble a NUL in .strings to get string merging
working.

Richard.
Bernd Edlinger Aug. 17, 2018, 12:36 p.m. UTC | #13
On 08/17/18 14:19, Richard Biener wrote:
> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
> 
>> Richard Biener wrote:
>>> +embedded @code{NUL} characters.  However, the
>>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
>>> +is not part of the language string literal but appended by the front end.
>>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
>>> +is one character shorter than @code{TREE_STRING_LENGTH}.
>>> +Excess caracters other than one trailing @code{NUL} character are not
> 
> characters btw.
> 

thanks, updated.

> I read the above that the string literal for
> 
> char x[2] = "1";
> 
> is actually "1\0\0" - there's one NUL that is not part of the language
> string literal.  The second sentence then suggests that both \0
> are removed because 2 is less than 3?
> 

maybe 2 is a bad example, lets consider:
char x[2000000000] = "1";

That is a string_cst with STRING_LENGTH = 2, content = "2\0\0"
the array_type is used on both x, and the string_cst,
I was assuming that both tree objects refer to the same type object.
which is char[0..2000000000-1]

varasm assembles the bytes that are given by STRING_LENGTH
and appends zeros as appropriate.

> As said, having this extra semantics of a STRING_CST tied to
> another tree node (its TREE_TYPE) looks ugly.
> 
>>> +permitted.
>>>
>>> I find this very confusing and oppose to that change.  Can we get
>>> back to the drawing board please?  If we want an easy way to
>>> see whether a string is "properly" terminated then maybe we can
>>> simply use a flag that gets set by build_string?
>>>
>>
>> What I mean with that is the case like
>> char x[2] = "123456";
>>
>> which is build_string(7, "123456"), but with a type char[2],
>> so varasm throws away "3456\0".
> 
> I think varasm throws away chars not because of the type of
> the STRING_CST but because of the available storage in x.
> 

But at other places we look at the type of the string_cst, don't we?
Shouldn't those be the same?

>> I want to say that this is not okay, the excess precision
>> should only be used to strip the nul termination, in cases
>> where it is intended to be a assembled as a not zero terminated
>> string.  But maybe the wording could be improved?
> 
> ISTR we always assemble a NUL in .strings to get string merging
> working.
> 

String merging is not working when the string is not explicitly
NUL terminated, my followup patch here tries to fix that:

[PATCH] Handle not explicitly zero terminated strings in merge sections
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html


Bernd.
Richard Biener Aug. 17, 2018, 1:38 p.m. UTC | #14
On Fri, 17 Aug 2018, Bernd Edlinger wrote:

> On 08/17/18 14:19, Richard Biener wrote:
> > On Fri, 17 Aug 2018, Bernd Edlinger wrote:
> > 
> >> Richard Biener wrote:
> >>> +embedded @code{NUL} characters.  However, the
> >>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
> >>> +is not part of the language string literal but appended by the front end.
> >>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
> >>> +is one character shorter than @code{TREE_STRING_LENGTH}.
> >>> +Excess caracters other than one trailing @code{NUL} character are not
> > 
> > characters btw.
> > 
> 
> thanks, updated.
> 
> > I read the above that the string literal for
> > 
> > char x[2] = "1";
> > 
> > is actually "1\0\0" - there's one NUL that is not part of the language
> > string literal.  The second sentence then suggests that both \0
> > are removed because 2 is less than 3?
> > 
> 
> maybe 2 is a bad example, lets consider:
> char x[2000000000] = "1";
> 
> That is a string_cst with STRING_LENGTH = 2, content = "2\0\0"
> the array_type is used on both x, and the string_cst,
> I was assuming that both tree objects refer to the same type object.
> which is char[0..2000000000-1]

Oh, didn't realize we use char[200000000] for the STRING_CST.  Makes
my suggestion to use char[] instead not (very) much worse than the
existing practice then.

> varasm assembles the bytes that are given by STRING_LENGTH
> and appends zeros as appropriate.
> 
> > As said, having this extra semantics of a STRING_CST tied to
> > another tree node (its TREE_TYPE) looks ugly.
> > 
> >>> +permitted.
> >>>
> >>> I find this very confusing and oppose to that change.  Can we get
> >>> back to the drawing board please?  If we want an easy way to
> >>> see whether a string is "properly" terminated then maybe we can
> >>> simply use a flag that gets set by build_string?
> >>>
> >>
> >> What I mean with that is the case like
> >> char x[2] = "123456";
> >>
> >> which is build_string(7, "123456"), but with a type char[2],
> >> so varasm throws away "3456\0".
> > 
> > I think varasm throws away chars not because of the type of
> > the STRING_CST but because of the available storage in x.
> > 
> 
> But at other places we look at the type of the string_cst, don't we?
> Shouldn't those be the same?

I think most (all?) places look at TREE_TYPE (TREE_TYPE (string))
only.  I'm not aware of users of the array domain of the array type
of a string - but I'm far from knowing GCC inside-out ;)

> >> I want to say that this is not okay, the excess precision
> >> should only be used to strip the nul termination, in cases
> >> where it is intended to be a assembled as a not zero terminated
> >> string.  But maybe the wording could be improved?
> > 
> > ISTR we always assemble a NUL in .strings to get string merging
> > working.
> > 
> 
> String merging is not working when the string is not explicitly
> NUL terminated, my followup patch here tries to fix that:
> 
> [PATCH] Handle not explicitly zero terminated strings in merge sections
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html

I'd have expected sth as simple as

  if (merge_strings && str[thissize - 1] != '\0')
    thissize++;

being appended in output_constant.

Richard.
Bernd Edlinger Aug. 17, 2018, 1:53 p.m. UTC | #15
On 08/17/18 15:38, Richard Biener wrote:
> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
> 
>> On 08/17/18 14:19, Richard Biener wrote:
>>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>>
>>>> Richard Biener wrote:
>>>>> +embedded @code{NUL} characters.  However, the
>>>>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
>>>>> +is not part of the language string literal but appended by the front end.
>>>>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
>>>>> +is one character shorter than @code{TREE_STRING_LENGTH}.
>>>>> +Excess caracters other than one trailing @code{NUL} character are not
>>>
>>> characters btw.
>>>
>>
>> thanks, updated.
>>
>>> I read the above that the string literal for
>>>
>>> char x[2] = "1";
>>>
>>> is actually "1\0\0" - there's one NUL that is not part of the language
>>> string literal.  The second sentence then suggests that both \0
>>> are removed because 2 is less than 3?
>>>
>>
>> maybe 2 is a bad example, lets consider:
>> char x[2000000000] = "1";
>>
>> That is a string_cst with STRING_LENGTH = 2, content = "2\0\0"
>> the array_type is used on both x, and the string_cst,
>> I was assuming that both tree objects refer to the same type object.
>> which is char[0..2000000000-1]
> 
> Oh, didn't realize we use char[200000000] for the STRING_CST.  Makes
> my suggestion to use char[] instead not (very) much worse than the
> existing practice then.
> 
>> varasm assembles the bytes that are given by STRING_LENGTH
>> and appends zeros as appropriate.
>>
>>> As said, having this extra semantics of a STRING_CST tied to
>>> another tree node (its TREE_TYPE) looks ugly.
>>>
>>>>> +permitted.
>>>>>
>>>>> I find this very confusing and oppose to that change.  Can we get
>>>>> back to the drawing board please?  If we want an easy way to
>>>>> see whether a string is "properly" terminated then maybe we can
>>>>> simply use a flag that gets set by build_string?
>>>>>
>>>>
>>>> What I mean with that is the case like
>>>> char x[2] = "123456";
>>>>
>>>> which is build_string(7, "123456"), but with a type char[2],
>>>> so varasm throws away "3456\0".
>>>
>>> I think varasm throws away chars not because of the type of
>>> the STRING_CST but because of the available storage in x.
>>>
>>
>> But at other places we look at the type of the string_cst, don't we?
>> Shouldn't those be the same?
> 
> I think most (all?) places look at TREE_TYPE (TREE_TYPE (string))
> only.  I'm not aware of users of the array domain of the array type
> of a string - but I'm far from knowing GCC inside-out ;)
> 

Yes, I know, that happens to me as well on the first day after my holidays ;)

>>>> I want to say that this is not okay, the excess precision
>>>> should only be used to strip the nul termination, in cases
>>>> where it is intended to be a assembled as a not zero terminated
>>>> string.  But maybe the wording could be improved?
>>>
>>> ISTR we always assemble a NUL in .strings to get string merging
>>> working.
>>>
>>
>> String merging is not working when the string is not explicitly
>> NUL terminated, my followup patch here tries to fix that:
>>
>> [PATCH] Handle not explicitly zero terminated strings in merge sections
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html
> 
> I'd have expected sth as simple as
> 
>    if (merge_strings && str[thissize - 1] != '\0')
>      thissize++;
> 
> being appended in output_constant.
> 

Yes, but that can only be done in the .merge.str section,
otherwise it would happen in structure initializers as well.
And I would like to undo the case when Ada programs do

Process ("ABCD" & Ascii.NUL);

but not for embedded NUL in the string constant.
like:

Process ("ABCD" & Acii.NUL & "EFGH");


Bernd.
Jeff Law Aug. 18, 2018, 3:43 a.m. UTC | #16
On 08/17/2018 06:13 AM, Bernd Edlinger wrote:
>>
> 
> Yes, there is a JIT code gen bug, that was caught by the assertion in my
> patch:
> 
> [PATCH] Fix not properly nul-terminated string constants in JIT
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00370.html
THanks for the reference.  I didn't have it in my queue.

> 
> 
> That is a simple bug, which means, that the JIT FE was never able to
> assemble strings > 200 character length.
> 
> Fixing a bug like that makes me I wonder, if there is any test coverage
> for JIT except the gcc test suite itself.
Nothing I'm aware of except the gcc testsuite itself -- it was more of a
research project to find out what we'd need to fix in GCC if we ever
wanted to be able to use it as a JIT.

Jeff
Jeff Law Aug. 18, 2018, 3:47 a.m. UTC | #17
On 08/17/2018 07:53 AM, Bernd Edlinger wrote:
> On 08/17/18 15:38, Richard Biener wrote:
>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>
>>> On 08/17/18 14:19, Richard Biener wrote:
>>>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>>>
>>>>> Richard Biener wrote:
>>>>>> +embedded @code{NUL} characters.  However, the
>>>>>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
>>>>>> +is not part of the language string literal but appended by the front end.
>>>>>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
>>>>>> +is one character shorter than @code{TREE_STRING_LENGTH}.
>>>>>> +Excess caracters other than one trailing @code{NUL} character are not
>>>>
>>>> characters btw.
>>>>
>>>
>>> thanks, updated.
>>>
>>>> I read the above that the string literal for
>>>>
>>>> char x[2] = "1";
>>>>
>>>> is actually "1\0\0" - there's one NUL that is not part of the language
>>>> string literal.  The second sentence then suggests that both \0
>>>> are removed because 2 is less than 3?
>>>>
>>>
>>> maybe 2 is a bad example, lets consider:
>>> char x[2000000000] = "1";
>>>
>>> That is a string_cst with STRING_LENGTH = 2, content = "2\0\0"
>>> the array_type is used on both x, and the string_cst,
>>> I was assuming that both tree objects refer to the same type object.
>>> which is char[0..2000000000-1]
>>
>> Oh, didn't realize we use char[200000000] for the STRING_CST.  Makes
>> my suggestion to use char[] instead not (very) much worse than the
>> existing practice then.
>>
>>> varasm assembles the bytes that are given by STRING_LENGTH
>>> and appends zeros as appropriate.
>>>
>>>> As said, having this extra semantics of a STRING_CST tied to
>>>> another tree node (its TREE_TYPE) looks ugly.
>>>>
>>>>>> +permitted.
>>>>>>
>>>>>> I find this very confusing and oppose to that change.  Can we get
>>>>>> back to the drawing board please?  If we want an easy way to
>>>>>> see whether a string is "properly" terminated then maybe we can
>>>>>> simply use a flag that gets set by build_string?
>>>>>>
>>>>>
>>>>> What I mean with that is the case like
>>>>> char x[2] = "123456";
>>>>>
>>>>> which is build_string(7, "123456"), but with a type char[2],
>>>>> so varasm throws away "3456\0".
>>>>
>>>> I think varasm throws away chars not because of the type of
>>>> the STRING_CST but because of the available storage in x.
>>>>
>>>
>>> But at other places we look at the type of the string_cst, don't we?
>>> Shouldn't those be the same?
>>
>> I think most (all?) places look at TREE_TYPE (TREE_TYPE (string))
>> only.  I'm not aware of users of the array domain of the array type
>> of a string - but I'm far from knowing GCC inside-out ;)
>>
> 
> Yes, I know, that happens to me as well on the first day after my holidays ;)
I'm not aware of anything using the array domain of the array type of a
string.  But there's still a lot of old code in tree.c and expr.c.  Who
knows what would be found with a thorough audit.

jeff
Bernd Edlinger Aug. 22, 2018, 2:27 p.m. UTC | #18
Hi,


this is an updated version of my STRING_CSTs checking in varasm.c
patch.

It tries to answer the questions that were raised in th GO string literals
thread.

The answers are:
a) strings with TYPE_SIZE_UNIT == NULL do exist, but only for STRING_CSTs
in constructors of flexible array struct members.  Not for string literals.
b) In all cases where the STRING_CSTs do have a TYPE_SIZE_UNIT, the
DECL_SIZE_UNIT has the same value.
c) When STRING_CSTs do not have a TYPE_SIZE_UNIT, that is in the constructor
of a flexible array member.  In that case the TREE_STRING_LENGTH
determines the flexible array size.


It changes varasm's get_constant_size to return the TYPE_SIZE_UNIT of
a STRING_CST literal as it is, without increasing the storage size
to TREE_STRING_LENGTH.  I added an assertion to make sure that
all STRING_CSTs have a type size; size == 0 can happen for empty Ada
strings.

Furthermore it adds code to compare_constant to also compare the
STRING_CSTs TYPE_SIZE_UNIT if available.

So I want to remove that from get_constant_size in order to not change
the memory layout of GO and Ada strings, while still having them look
mostly like C string literals.

Furthermore I added one more consistency check to check_string_literal,
that makes sure that all STRING_CSTs that do have a TYPE_SIZE_UNIT,
the size matches the DECL_SIZE_UNIT.

Those newly discovered properties of string literals make the code in
c_strlen and friends a lot simpler.



Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
Is it OK for trunk?


Thanks
Bernd.




On 08/17/18 15:53, Bernd Edlinger wrote:
> On 08/17/18 15:38, Richard Biener wrote:
>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>
>>> On 08/17/18 14:19, Richard Biener wrote:
>>>> On Fri, 17 Aug 2018, Bernd Edlinger wrote:
>>>>
>>>>> Richard Biener wrote:
>>>>>> +embedded @code{NUL} characters.  However, the
>>>>>> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
>>>>>> +is not part of the language string literal but appended by the front end.
>>>>>> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
>>>>>> +is one character shorter than @code{TREE_STRING_LENGTH}.
>>>>>> +Excess caracters other than one trailing @code{NUL} character are not
>>>>
>>>> characters btw.
>>>>
>>>
>>> thanks, updated.
>>>
>>>> I read the above that the string literal for
>>>>
>>>> char x[2] = "1";
>>>>
>>>> is actually "1\0\0" - there's one NUL that is not part of the language
>>>> string literal.  The second sentence then suggests that both \0
>>>> are removed because 2 is less than 3?
>>>>
>>>
>>> maybe 2 is a bad example, lets consider:
>>> char x[2000000000] = "1";
>>>
>>> That is a string_cst with STRING_LENGTH = 2, content = "2\0\0"
>>> the array_type is used on both x, and the string_cst,
>>> I was assuming that both tree objects refer to the same type object.
>>> which is char[0..2000000000-1]
>>
>> Oh, didn't realize we use char[200000000] for the STRING_CST.  Makes
>> my suggestion to use char[] instead not (very) much worse than the
>> existing practice then.
>>
>>> varasm assembles the bytes that are given by STRING_LENGTH
>>> and appends zeros as appropriate.
>>>
>>>> As said, having this extra semantics of a STRING_CST tied to
>>>> another tree node (its TREE_TYPE) looks ugly.
>>>>
>>>>>> +permitted.
>>>>>>
>>>>>> I find this very confusing and oppose to that change.  Can we get
>>>>>> back to the drawing board please?  If we want an easy way to
>>>>>> see whether a string is "properly" terminated then maybe we can
>>>>>> simply use a flag that gets set by build_string?
>>>>>>
>>>>>
>>>>> What I mean with that is the case like
>>>>> char x[2] = "123456";
>>>>>
>>>>> which is build_string(7, "123456"), but with a type char[2],
>>>>> so varasm throws away "3456\0".
>>>>
>>>> I think varasm throws away chars not because of the type of
>>>> the STRING_CST but because of the available storage in x.
>>>>
>>>
>>> But at other places we look at the type of the string_cst, don't we?
>>> Shouldn't those be the same?
>>
>> I think most (all?) places look at TREE_TYPE (TREE_TYPE (string))
>> only.  I'm not aware of users of the array domain of the array type
>> of a string - but I'm far from knowing GCC inside-out ;)
>>
> 
> Yes, I know, that happens to me as well on the first day after my holidays ;)
> 
>>>>> I want to say that this is not okay, the excess precision
>>>>> should only be used to strip the nul termination, in cases
>>>>> where it is intended to be a assembled as a not zero terminated
>>>>> string.  But maybe the wording could be improved?
>>>>
>>>> ISTR we always assemble a NUL in .strings to get string merging
>>>> working.
>>>>
>>>
>>> String merging is not working when the string is not explicitly
>>> NUL terminated, my followup patch here tries to fix that:
>>>
>>> [PATCH] Handle not explicitly zero terminated strings in merge sections
>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html
>>
>> I'd have expected sth as simple as
>>
>>    if (merge_strings && str[thissize - 1] != '\0')
>>      thissize++;
>>
>> being appended in output_constant.
>>
> 
> Yes, but that can only be done in the .merge.str section,
> otherwise it would happen in structure initializers as well.
> And I would like to undo the case when Ada programs do
> 
> Process ("ABCD" & Ascii.NUL);
> 
> but not for embedded NUL in the string constant.
> like:
> 
> Process ("ABCD" & Acii.NUL & "EFGH");
> 
> 
> Bernd.
2018-08-22  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* doc/generic.texi (STRING_CST): Update.
	* varasm.c (compare_constant): Compare type size of STRING_CSTs.
	(get_constant_size): Don't make STRING_CSTs larget than they are.
	(check_string_literal): New check function for STRING_CSTs.
	(output_constant): Use it.

diff -Npur gcc/doc/generic.texi gcc/doc/generic.texi
--- gcc/doc/generic.texi	2018-08-16 17:28:11.000000000 +0200
+++ gcc/doc/generic.texi	2018-08-22 10:17:29.852681466 +0200
@@ -1162,9 +1162,13 @@ These nodes represent string-constants.
 returns the length of the string, as an @code{int}.  The
 @code{TREE_STRING_POINTER} is a @code{char*} containing the string
 itself.  The string may not be @code{NUL}-terminated, and it may contain
-embedded @code{NUL} characters.  Therefore, the
-@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is
-present.
+embedded @code{NUL} characters.  However, the
+@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
+is not part of the language string literal but appended by the front end.
+If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
+is one character shorter than @code{TREE_STRING_LENGTH}.
+Excess characters other than one trailing @code{NUL} character are not
+permitted.
 
 For wide string constants, the @code{TREE_STRING_LENGTH} is the number
 of bytes in the string, and the @code{TREE_STRING_POINTER}
@@ -1173,6 +1177,11 @@ target system (that is, as integers in t
 non-wide string constants are distinguished only by the @code{TREE_TYPE}
 of the @code{STRING_CST}.
 
+String constants may also be used for other purpose like assember
+constraints or attributes.  These have no @code{TREE_TYPE} and
+need no explicit @code{NUL}-termination.  Note there is always
+another @code{NUL}-byte beyond @code{TREE_STRING_LENGTH}.
+
 FIXME: The formats of string constants are not well-defined when the
 target system bytes are not the same width as host system bytes.
 
diff -Npur gcc/varasm.c gcc/varasm.c
--- gcc/varasm.c	2018-08-16 17:28:11.000000000 +0200
+++ gcc/varasm.c	2018-08-22 11:42:13.259002521 +0200
@@ -3146,7 +3146,9 @@ compare_constant (const tree t1, const t
       return FIXED_VALUES_IDENTICAL (TREE_FIXED_CST (t1), TREE_FIXED_CST (t2));
 
     case STRING_CST:
-      if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
+      if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))
+	  || int_size_in_bytes (TREE_TYPE (t1))
+	     != int_size_in_bytes (TREE_TYPE (t2)))
 	return 0;
 
       return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
@@ -3303,8 +3305,7 @@ get_constant_size (tree exp)
   HOST_WIDE_INT size;
 
   size = int_size_in_bytes (TREE_TYPE (exp));
-  if (TREE_CODE (exp) == STRING_CST)
-    size = MAX (TREE_STRING_LENGTH (exp), size);
+  gcc_checking_assert (size >= 0);
   return size;
 }
 
@@ -4774,6 +4775,32 @@ initializer_constant_valid_for_bitfield_
   return false;
 }
 
+/* Check if a STRING_CST fits into the field.
+   Tolerate only the case when the NUL termination
+   does not fit into the field.   */
+
+static bool
+check_string_literal (tree string, unsigned HOST_WIDE_INT size)
+{
+  tree eltype = TREE_TYPE (TREE_TYPE (string));
+  unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype));
+  const char *p = TREE_STRING_POINTER (string);
+  int len = TREE_STRING_LENGTH (string);
+  HOST_WIDE_INT mems = int_size_in_bytes (TREE_TYPE (string));
+
+  if (elts != 1 && elts != 2 && elts != 4)
+    return false;
+  if (len <= 0 || len % elts != 0)
+    return false;
+  if (size < (unsigned)len && size != len - elts)
+    return false;
+  if (memcmp (p + len - elts, "\0\0\0\0", elts) != 0)
+    return false;
+  if (mems >= 0 && mems != (HOST_WIDE_INT)size)
+    return false;
+  return true;
+}
+
 /* output_constructor outer state of relevance in recursive calls, typically
    for nested aggregate bitfields.  */
 
@@ -4942,6 +4969,7 @@ output_constant (tree exp, unsigned HOST
 	case STRING_CST:
 	  thissize
 	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  gcc_checking_assert (check_string_literal (exp, size));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:
Martin Sebor Aug. 22, 2018, 8:54 p.m. UTC | #19
On 08/22/2018 08:27 AM, Bernd Edlinger wrote:
> Hi,
>
>
> this is an updated version of my STRING_CSTs checking in varasm.c
> patch.
>
> It tries to answer the questions that were raised in th GO string literals
> thread.
>
> The answers are:
> a) strings with TYPE_SIZE_UNIT == NULL do exist, but only for STRING_CSTs
> in constructors of flexible array struct members.  Not for string literals.
> b) In all cases where the STRING_CSTs do have a TYPE_SIZE_UNIT, the
> DECL_SIZE_UNIT has the same value.
> c) When STRING_CSTs do not have a TYPE_SIZE_UNIT, that is in the constructor
> of a flexible array member.  In that case the TREE_STRING_LENGTH
> determines the flexible array size.
>
>
> It changes varasm's get_constant_size to return the TYPE_SIZE_UNIT of
> a STRING_CST literal as it is, without increasing the storage size
> to TREE_STRING_LENGTH.  I added an assertion to make sure that
> all STRING_CSTs have a type size; size == 0 can happen for empty Ada
> strings.
>
> Furthermore it adds code to compare_constant to also compare the
> STRING_CSTs TYPE_SIZE_UNIT if available.
>
> So I want to remove that from get_constant_size in order to not change
> the memory layout of GO and Ada strings, while still having them look
> mostly like C string literals.
>
> Furthermore I added one more consistency check to check_string_literal,
> that makes sure that all STRING_CSTs that do have a TYPE_SIZE_UNIT,
> the size matches the DECL_SIZE_UNIT.
>
> Those newly discovered properties of string literals make the code in
> c_strlen and friends a lot simpler.
>
>
>
> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> Is it OK for trunk?


@@ -1162,9 +1162,13 @@ These nodes represent string-constants.

  returns the length of the string, as an @code{int}.  The

  @code{TREE_STRING_POINTER} is a @code{char*} containing the string

  itself.  The string may not be @code{NUL}-terminated, and it may contain

-embedded @code{NUL} characters.  Therefore, the

-@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is

-present.

+embedded @code{NUL} characters.  However, the

+@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that

+is not part of the language string literal but appended by the front end.

+If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}


If the string is not NUL-terminated (not shall not be -- that
makes it sound like it must not be nul-terminated).

+is one character shorter than @code{TREE_STRING_LENGTH}.


Presumably the phrase "the @code{TREE_TYPE} is shorter" means
that the type of the string is an array whose domain is
[0, TREE_STRING_LENGTH - 1], or something like that.  It would
help to make this clear in a less informal way, especially if
not all STRING_CST types have a domain (sounds like some don't
if they have a null TYPE_SIZE_UNIT).

+Excess characters other than one trailing @code{NUL} character are not

+permitted.



Does this mean that they can be counted on not to exist
because the front ends make sure they don't and the middle
end doesn't create them, or that should not be created but
that they might still exist?

You also mentioned a lot of detail in your answers above that
isn't captured here.  Can that be added?  E.g., the part about
TYPE_SIZE_UNIT of a STRING_CST being allowed to be null seems
especially important, as does the bit about flexible array
members.

Martin
Bernd Edlinger Aug. 22, 2018, 10:52 p.m. UTC | #20
On 08/22/18 22:54, Martin Sebor wrote:
> On 08/22/2018 08:27 AM, Bernd Edlinger wrote:
>> Hi,
>>
>>
>> this is an updated version of my STRING_CSTs checking in varasm.c
>> patch.
>>
>> It tries to answer the questions that were raised in th GO string literals
>> thread.
>>
>> The answers are:
>> a) strings with TYPE_SIZE_UNIT == NULL do exist, but only for STRING_CSTs
>> in constructors of flexible array struct members.  Not for string literals.
>> b) In all cases where the STRING_CSTs do have a TYPE_SIZE_UNIT, the
>> DECL_SIZE_UNIT has the same value.
>> c) When STRING_CSTs do not have a TYPE_SIZE_UNIT, that is in the constructor
>> of a flexible array member.  In that case the TREE_STRING_LENGTH
>> determines the flexible array size.
>>
>>
>> It changes varasm's get_constant_size to return the TYPE_SIZE_UNIT of
>> a STRING_CST literal as it is, without increasing the storage size
>> to TREE_STRING_LENGTH.  I added an assertion to make sure that
>> all STRING_CSTs have a type size; size == 0 can happen for empty Ada
>> strings.
>>
>> Furthermore it adds code to compare_constant to also compare the
>> STRING_CSTs TYPE_SIZE_UNIT if available.
>>
>> So I want to remove that from get_constant_size in order to not change
>> the memory layout of GO and Ada strings, while still having them look
>> mostly like C string literals.
>>
>> Furthermore I added one more consistency check to check_string_literal,
>> that makes sure that all STRING_CSTs that do have a TYPE_SIZE_UNIT,
>> the size matches the DECL_SIZE_UNIT.
>>
>> Those newly discovered properties of string literals make the code in
>> c_strlen and friends a lot simpler.
>>
>>
>>
>> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
>> Is it OK for trunk?
> 
> 
> @@ -1162,9 +1162,13 @@ These nodes represent string-constants.
> 
>   returns the length of the string, as an @code{int}.  The
> 
>   @code{TREE_STRING_POINTER} is a @code{char*} containing the string
> 
>   itself.  The string may not be @code{NUL}-terminated, and it may contain
> 
> -embedded @code{NUL} characters.  Therefore, the
> 
> -@code{TREE_STRING_LENGTH} includes the trailing @code{NUL} if it is
> 
> -present.
> 
> +embedded @code{NUL} characters.  However, the
> 
> +@code{TREE_STRING_LENGTH} always includes a trailing @code{NUL} that
> 
> +is not part of the language string literal but appended by the front end.
> 
> +If the string shall not be @code{NUL}-terminated the @code{TREE_TYPE}
> 
> 
> If the string is not NUL-terminated (not shall not be -- that
> makes it sound like it must not be nul-terminated).
> 
> +is one character shorter than @code{TREE_STRING_LENGTH}.
> 
> 
> Presumably the phrase "the @code{TREE_TYPE} is shorter" means
> that the type of the string is an array whose domain is
> [0, TREE_STRING_LENGTH - 1], or something like that.  It would
> help to make this clear in a less informal way, especially if
> not all STRING_CST types have a domain (sounds like some don't
> if they have a null TYPE_SIZE_UNIT).
> 

Yes, good point.

> +Excess characters other than one trailing @code{NUL} character are not
> 
> +permitted.
> 
> 
> 
> Does this mean that they can be counted on not to exist
> because the front ends make sure they don't and the middle
> end doesn't create them, or that should not be created but
> that they might still exist?
> 
> You also mentioned a lot of detail in your answers above that
> isn't captured here.  Can that be added?  E.g., the part about
> TYPE_SIZE_UNIT of a STRING_CST being allowed to be null seems
> especially important, as does the bit about flexible array
> members.
>
Yes, initially I just saw the strings which are returned
from string_constant were much longer than the memory they lived in.
Therefore I wrote that patch that shortens the overlength
C strings.  I wanted to keep the property that the bare
string constants are null terminated, and not longer than they
should be, but that the type domain in the array type can
be used to strip off just the wide NUL termination, where the
frontend creates those properly, and the middle-end can
make use of these properties, and varasm.c checks that.

It was Richard's suggestion to add some checking code that the
string constants are actually how we want them to be in varasm.c,
so I did it and discovered a lot of things that are probably
not like we expected them to be.
Some background, and Richards comments are in this thread:
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01328.html

Richard does not think that it is safe to use the
TYPE_SIZE_UNIT like c_strlen, and other places does, but most
of the time the TYPE_SIZE_UNIT of the STRING_CST is valid.
So I used the assertions in this patch to locate places where
that assumption fails.

This revealed that string literals do always have a type domain,
and it is so far always such that
TYPE_SIZE_UNIT(TREE_TYPE) is something, most of the time = TREE_STRING_LENGTH.

Even if it is smaller than TREE_STRING_LENGTH that does not
matter for varasm, when these are actually streamed out they
are made as large as MAX(TREE_STRING_LENGTH, TYPE_SIZE_UNIT).

Thus code that looked at STRING_CST that were previously returned from
string_constant (before the initializer handling was implemented),
were in fact correct to use all bytes up to TREE_STRING_LENGTH-1.

Now for STRING_CST that are in DECL_INITIAL the rules are different.
Their memory size is MIN(TYPE_SIZE_UNIT,TREE_STRING_LENGTH) when
TYPE_SIZE_UNIT or the TYPE_DOMAIN MAX is != NULL.  If it is NULL
we usually have an initializer of a flexible array structure members.
The memory size is then simly TREE_STRING_LENGTH.  I do not know
if string_constant is able to return such incomplete strings or not,
maybe it is better not to return them.

However incomplete array initializers like
const char x[] = "abc";
are only incomplete while parsing (where you call braced_list_to_string)
and are made complete types a bit later (where I would like to move the
braced_list_to_string, after the array type domain is complete).
See: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01361.html

So with the change to return the DECL_INITIAL STRING_CSTs
from string_constant, all places where the STRING_CSTs are used
are now broken.

This is an attempt to restore correctness on the STRING_CSTs.

So something has to change here.

I do currently think that I want to use the same rules for
string literals and string initializers.  That if they have
a type domain, that defines the memory size.  The Frontends
create always string constants that use a zero-termination.
And the array type domain for C literals is always
[0..TREE_STRING_LENGTH-1]
while Ada uses [0..TREE_STRING_LENGTH-2].

For sting_constant it is okay to use the array type domain,
at least when it is available, and varasm protects the
overall integrity.

Richard wants to use incomplete types everywhere.
But I cannot say if it is always easy to get a DECL_SIZE_UNIT
when one needs it.
I am also not sure that this is also an even larger change.

It is also possible to not do the NUL termination and
striping again for Ada/GO, However, even Ada (and Go) has a need to
get the strings in the string merge section, and that
plays nice with the additional NUL termination when the
string goes into the merge section.

And of course it is also possible to do everything without
any NUL termination, the termination could be simply encoded
in the array type domain.

My suggestions are mostly due to how one can make c_getstr
and c_strlen's life more easy.

But that is I would say details are still in discussion.


Thanks
Bernd.
Bernd Edlinger Aug. 24, 2018, 8:18 p.m. UTC | #21
Hi!


This is an alternative approach in making STRING_CST semantics
consistent.

This means it makes STRING_CST used for literals and for initializers
use exactly the same semantics.

It makes sure that all STRING_CST have a TYPE_SIZE_UNIT, and that it is
always larger than TREE_STRING_LENGTH, these and certain other properties
are checked in varasm.c, so the currently observed consistency issues
in the middle-end can be resolved.


It depends on the following pre-condition patches:

[PATCH] Use complete_array_type on flexible array member initializers
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01538.html

PATCHv2] Call braced_list_to_string after array size is fixed
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01565.html

[PATCHv2] Handle overlength strings in the C FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01566.html

[PATCHv2] Handle overlength strings in C++ FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01567.html

[PATCHv2] Handle overlength string literals in the fortan FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01568.html

The Ada and GO patches are no longer necessary.


Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
Is it OK for trunk?


Thanks
Bernd.
2018-08-24  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* varasm.c (compare_constant): Compare type size of STRING_CSTs.
	(get_constant_size): Don't make STRING_CSTs larger than they are.
	(check_string_literal): New check function for STRING_CSTs.
	(output_constant): Use it.

Index: gcc/varasm.c
===================================================================
--- gcc/varasm.c	(revision 263807)
+++ gcc/varasm.c	(working copy)
@@ -3146,7 +3146,9 @@ compare_constant (const tree t1, const tree t2)
       return FIXED_VALUES_IDENTICAL (TREE_FIXED_CST (t1), TREE_FIXED_CST (t2));
 
     case STRING_CST:
-      if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
+      if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))
+	  || int_size_in_bytes (TREE_TYPE (t1))
+	     != int_size_in_bytes (TREE_TYPE (t2)))
 	return 0;
 
       return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
@@ -3303,8 +3305,9 @@ get_constant_size (tree exp)
   HOST_WIDE_INT size;
 
   size = int_size_in_bytes (TREE_TYPE (exp));
-  if (TREE_CODE (exp) == STRING_CST)
-    size = MAX (TREE_STRING_LENGTH (exp), size);
+  gcc_checking_assert (size >= 0);
+  gcc_checking_assert (TREE_CODE (exp) != STRING_CST
+		       || size >= TREE_STRING_LENGTH (exp));
   return size;
 }
 
@@ -4774,6 +4777,30 @@ initializer_constant_valid_for_bitfield_p (tree va
   return false;
 }
 
+/* Check if a STRING_CST fits into the field.
+   Tolerate only the case when the NUL termination
+   does not fit into the field.   */
+
+static bool
+check_string_literal (tree string, unsigned HOST_WIDE_INT size)
+{
+  tree type = TREE_TYPE (string);
+  tree eltype = TREE_TYPE (type);
+  unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype));
+  unsigned HOST_WIDE_INT mem_size = tree_to_uhwi (TYPE_SIZE_UNIT (type));
+  int len = TREE_STRING_LENGTH (string);
+
+  if (elts != 1 && elts != 2 && elts != 4)
+    return false;
+  if (len < 0 || len % elts != 0)
+    return false;
+  if (size < (unsigned)len)
+    return false;
+  if (mem_size != size)
+    return false;
+  return true;
+}
+
 /* output_constructor outer state of relevance in recursive calls, typically
    for nested aggregate bitfields.  */
 
@@ -4942,6 +4969,7 @@ output_constant (tree exp, unsigned HOST_WIDE_INT
 	case STRING_CST:
 	  thissize
 	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  gcc_checking_assert (check_string_literal (exp, size));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:
Jeff Law Sept. 13, 2018, 6:42 p.m. UTC | #22
On 8/24/18 2:18 PM, Bernd Edlinger wrote:
> 
> [PATCHv2] Handle overlength string literals in the fortan FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01568.html
I've committed this patch to the trunk.

jeff
Jeff Law Sept. 13, 2018, 6:44 p.m. UTC | #23
On 8/24/18 2:18 PM, Bernd Edlinger wrote:
> Hi!
> 
> 
> This is an alternative approach in making STRING_CST semantics
> consistent.
> 
> This means it makes STRING_CST used for literals and for initializers
> use exactly the same semantics.
> 
> It makes sure that all STRING_CST have a TYPE_SIZE_UNIT, and that it is
> always larger than TREE_STRING_LENGTH, these and certain other properties
> are checked in varasm.c, so the currently observed consistency issues
> in the middle-end can be resolved.
> 
> 
> It depends on the following pre-condition patches:
[ ... ]

> 
> [PATCHv2] Handle overlength strings in C++ FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01567.html
I've committed this patch to the trunk.

jeff
Bernd Edlinger Sept. 13, 2018, 6:59 p.m. UTC | #24
On 09/13/18 20:42, Jeff Law wrote:
> On 8/24/18 2:18 PM, Bernd Edlinger wrote:
>>
>> [PATCHv2] Handle overlength string literals in the fortan FE
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01568.html
> I've committed this patch to the trunk.
> 


Hi Jeff,

I fixed the ChangeLog entry,
and committed as obvious:

Index: gcc/cp/ChangeLog
===================================================================
--- gcc/cp/ChangeLog	(Revision 264286)
+++ gcc/cp/ChangeLog	(Revision 264287)
@@ -1,8 +1,6 @@
  2018-09-13  Bernd Edlinger  <bernd.edlinger@hotmail.de>
  
  	* typeck2.c (digest_init_r): Fix overlength strings.
-	* vtable-class-hierarchy.c (build_key_buffer_arg): Make string literal
-	NUL terminated.
  
  2018-09-13  Ville Voutilainen  <ville.voutilainen@gmail.com>
  

Thanks
Bernd.
Bernd Edlinger Sept. 13, 2018, 7:41 p.m. UTC | #25
On 09/13/18 20:44, Jeff Law wrote:
> On 8/24/18 2:18 PM, Bernd Edlinger wrote:
>> Hi!
>>
>>
>> This is an alternative approach in making STRING_CST semantics
>> consistent.
>>
>> This means it makes STRING_CST used for literals and for initializers
>> use exactly the same semantics.
>>
>> It makes sure that all STRING_CST have a TYPE_SIZE_UNIT, and that it is
>> always larger than TREE_STRING_LENGTH, these and certain other properties
>> are checked in varasm.c, so the currently observed consistency issues
>> in the middle-end can be resolved.
>>
>>
>> It depends on the following pre-condition patches:
> [ ... ]
> 
>>
>> [PATCHv2] Handle overlength strings in C++ FE
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01567.html
> I've committed this patch to the trunk.
> 

Note, however this leaves pr87503.c semi-broken:

gcc -fpermissive -x c++ -O2 pr87053.c
pr87053.c:11:23: warning: initializer-string for array of chars is too long [-fpermissive]
11 | } u = {{"1234", "567"}};
    |                       ^
$ ./a.out
Aborted (core dumped)


This would be fixed by the patch I posted earlier today:

https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00690.html



Thanks
Bernd.
Jeff Law Sept. 13, 2018, 8:43 p.m. UTC | #26
On 9/13/18 1:41 PM, Bernd Edlinger wrote:
> On 09/13/18 20:44, Jeff Law wrote:
>> On 8/24/18 2:18 PM, Bernd Edlinger wrote:
>>> Hi!
>>>
>>>
>>> This is an alternative approach in making STRING_CST semantics
>>> consistent.
>>>
>>> This means it makes STRING_CST used for literals and for initializers
>>> use exactly the same semantics.
>>>
>>> It makes sure that all STRING_CST have a TYPE_SIZE_UNIT, and that it is
>>> always larger than TREE_STRING_LENGTH, these and certain other properties
>>> are checked in varasm.c, so the currently observed consistency issues
>>> in the middle-end can be resolved.
>>>
>>>
>>> It depends on the following pre-condition patches:
>> [ ... ]
>>
>>>
>>> [PATCHv2] Handle overlength strings in C++ FE
>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01567.html
>> I've committed this patch to the trunk.
>>
> 
> Note, however this leaves pr87503.c semi-broken:
> 
> gcc -fpermissive -x c++ -O2 pr87053.c
> pr87053.c:11:23: warning: initializer-string for array of chars is too long [-fpermissive]
> 11 | } u = {{"1234", "567"}};
>     |                       ^
> $ ./a.out
> Aborted (core dumped)
> 
> 
> This would be fixed by the patch I posted earlier today:
> 
> https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00690.html
I'm not done for the day :-)

jeff
Jeff Law Sept. 13, 2018, 9:44 p.m. UTC | #27
On 8/24/18 2:18 PM, Bernd Edlinger wrote:
> Hi!
> 
> 
> This is an alternative approach in making STRING_CST semantics
> consistent.
> 
> This means it makes STRING_CST used for literals and for initializers
> use exactly the same semantics.
> 
> It makes sure that all STRING_CST have a TYPE_SIZE_UNIT, and that it is
> always larger than TREE_STRING_LENGTH, these and certain other properties
> are checked in varasm.c, so the currently observed consistency issues
> in the middle-end can be resolved.
> 
> 
> It depends on the following pre-condition patches:
> 
> [PATCH] Use complete_array_type on flexible array member initializers
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01538.html
> 
> PATCHv2] Call braced_list_to_string after array size is fixed
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01565.html
> 
> [PATCHv2] Handle overlength strings in the C FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01566.html
> 
> [PATCHv2] Handle overlength strings in C++ FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01567.html
> 
> [PATCHv2] Handle overlength string literals in the fortan FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01568.html
> 
> The Ada and GO patches are no longer necessary.
> 
> 
> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> Is it OK for trunk?
> 
> 
> Thanks
> Bernd.
> 
> 
> patch-varasm-v2.diff
> 
> 2018-08-24  Bernd Edlinger  <bernd.edlinger@hotmail.de>
> 
> 	* varasm.c (compare_constant): Compare type size of STRING_CSTs.
> 	(get_constant_size): Don't make STRING_CSTs larger than they are.
> 	(check_string_literal): New check function for STRING_CSTs.
> 	(output_constant): Use it.
The prereqs above are all installed and I've installed this patch.
We've still got the pr87053 regression, but I'll commit a fix for that
momentarily.

Jeff
diff mbox series

Patch

2018-08-01  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* varasm.c (check_string_literal): New checking function.
	(output_constant): Use it.

diff -pur gcc/varasm.c gcc/varasm.c
--- gcc/varasm.c	2018-07-17 11:19:27.000000000 +0200
+++ gcc/varasm.c	2018-07-31 10:16:12.058827505 +0200
@@ -4774,6 +4774,29 @@  initializer_constant_valid_for_bitfield_
   return false;
 }
 
+/* Check if a STRING_CST fits into the field.
+   Tolerate only the case when the NUL termination
+   does not fit into the field.   */
+
+bool
+check_string_literal (tree string, unsigned HOST_WIDE_INT size)
+{
+  tree eltype = TREE_TYPE (TREE_TYPE (string));
+  unsigned HOST_WIDE_INT elts = tree_to_uhwi (TYPE_SIZE_UNIT (eltype));
+  const char *p = TREE_STRING_POINTER (string);
+  int len = TREE_STRING_LENGTH (string);
+
+  if (elts != 1 && elts != 2 && elts != 4)
+    return false;
+  if (len <= 0 || len % elts != 0)
+    return false;
+  if ((unsigned)len != size && (unsigned)len != size + elts)
+    return false;
+  if (memcmp (p + len - elts, "\0\0\0\0", elts) != 0)
+    return false;
+  return true;
+}
+
 /* output_constructor outer state of relevance in recursive calls, typically
    for nested aggregate bitfields.  */
 
@@ -4942,6 +4965,7 @@  output_constant (tree exp, unsigned HOST
 	case STRING_CST:
 	  thissize
 	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  gcc_checking_assert (check_string_literal (exp, thissize));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST: