diff mbox series

Make GO string literals properly NUL terminated

Message ID AM5PR0701MB26570687DEEC36D3BE2F21F4E42E0@AM5PR0701MB2657.eurprd07.prod.outlook.com
State New
Headers show
Series Make GO string literals properly NUL terminated | expand

Commit Message

Bernd Edlinger July 31, 2018, 12:14 p.m. UTC
Hi,


could someone please review this patch and check it in into the GO FE?


Thanks
Bernd.

Comments

Ian Lance Taylor July 31, 2018, 2:40 p.m. UTC | #1
On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> could someone please review this patch and check it in into the GO FE?

I don't understand why the change is correct, and you didn't explain
it.  Go strings are not NUL terminated.  Go strings always have an
associated length.

Ian
Bernd Edlinger July 31, 2018, 4:19 p.m. UTC | #2
On 07/31/18 16:40, Ian Lance Taylor wrote:
> On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> could someone please review this patch and check it in into the GO FE?
> 
> I don't understand why the change is correct, and you didn't explain
> it.  Go strings are not NUL terminated.  Go strings always have an
> associated length.
> 

Yes, sorry.  Effectively for go this change is a no-op.
I'll elaborate a bit.

This makes it easier for the middle-end to distinguish between nul-terminated
and not nul terminated strings.  Especially if wide character strings
may also may come along.

In C a not nul terminated string might be declared like
char x[2] = "12";
it is always a STRING_CST object of length 3, with value "12\0".
The array_type is char[0..1]

while a nul terminated string is declared like
char x[3] = "12"
it is also a STRING_CST object of length 3, with value "12\0"
The array_type is char[0..2]

Note however the array type is different.
So with this convention one only needs to compare the array type
size with the string length which is much easier than looking for
a terminating wide character, which is rarely done right.

At the end varasm.c filters the excess NUL byte away, but
I would like to add a checking assertion there that this does not
strip more than max. one wide character nul byte.


Bernd.
Ian Lance Taylor July 31, 2018, 8:18 p.m. UTC | #3
On Tue, Jul 31, 2018 at 9:19 AM, Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> On 07/31/18 16:40, Ian Lance Taylor wrote:
>> On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger
>> <bernd.edlinger@hotmail.de> wrote:
>>>
>>> could someone please review this patch and check it in into the GO FE?
>>
>> I don't understand why the change is correct, and you didn't explain
>> it.  Go strings are not NUL terminated.  Go strings always have an
>> associated length.
>>
>
> Yes, sorry.  Effectively for go this change is a no-op.
> I'll elaborate a bit.
>
> This makes it easier for the middle-end to distinguish between nul-terminated
> and not nul terminated strings.  Especially if wide character strings
> may also may come along.
>
> In C a not nul terminated string might be declared like
> char x[2] = "12";
> it is always a STRING_CST object of length 3, with value "12\0".
> The array_type is char[0..1]
>
> while a nul terminated string is declared like
> char x[3] = "12"
> it is also a STRING_CST object of length 3, with value "12\0"
> The array_type is char[0..2]
>
> Note however the array type is different.
> So with this convention one only needs to compare the array type
> size with the string length which is much easier than looking for
> a terminating wide character, which is rarely done right.
>
> At the end varasm.c filters the excess NUL byte away, but
> I would like to add a checking assertion there that this does not
> strip more than max. one wide character nul byte.

Thanks, I think I should probably let this be reviewed by someone
reviewing the larger patch.  The go-gcc.cc file lives in the GCC repo
and changes to it can be approved and committed by any GCC middle-end
or global maintainer.  It's not part of the code copied from another
repo, which lives in gcc/go/gofrontend.

Ian
Richard Biener Aug. 1, 2018, 7:01 a.m. UTC | #4
On Tue, 31 Jul 2018, Ian Lance Taylor wrote:

> On Tue, Jul 31, 2018 at 9:19 AM, Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
> > On 07/31/18 16:40, Ian Lance Taylor wrote:
> >> On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger
> >> <bernd.edlinger@hotmail.de> wrote:
> >>>
> >>> could someone please review this patch and check it in into the GO FE?
> >>
> >> I don't understand why the change is correct, and you didn't explain
> >> it.  Go strings are not NUL terminated.  Go strings always have an
> >> associated length.
> >>
> >
> > Yes, sorry.  Effectively for go this change is a no-op.
> > I'll elaborate a bit.
> >
> > This makes it easier for the middle-end to distinguish between nul-terminated
> > and not nul terminated strings.  Especially if wide character strings
> > may also may come along.
> >
> > In C a not nul terminated string might be declared like
> > char x[2] = "12";
> > it is always a STRING_CST object of length 3, with value "12\0".
> > The array_type is char[0..1]
> >
> > while a nul terminated string is declared like
> > char x[3] = "12"
> > it is also a STRING_CST object of length 3, with value "12\0"
> > The array_type is char[0..2]
> >
> > Note however the array type is different.
> > So with this convention one only needs to compare the array type
> > size with the string length which is much easier than looking for
> > a terminating wide character, which is rarely done right.
> >
> > At the end varasm.c filters the excess NUL byte away, but
> > I would like to add a checking assertion there that this does not
> > strip more than max. one wide character nul byte.
> 
> Thanks, I think I should probably let this be reviewed by someone
> reviewing the larger patch.  The go-gcc.cc file lives in the GCC repo
> and changes to it can be approved and committed by any GCC middle-end
> or global maintainer.  It's not part of the code copied from another
> repo, which lives in gcc/go/gofrontend.

The change to have all STRING_CSTs NUL terminated (but that NUL
termination not necessarily inclided in STRING_LENGTH) is a good
one.

I'm not sure how we can reliably verify NUL termination after the
fact though and build_string already makes sure to NUL terminate
STRING_CSTs.  So if GO strings are not NUL terminated then
the STRING_CSTs still are.

Bernd?

Richard.

> Ian
> 
>
Bernd Edlinger Aug. 1, 2018, 7:44 a.m. UTC | #5
> The change to have all STRING_CSTs NUL terminated (but that NUL
> termination not necessarily inclided in STRING_LENGTH) is a good
> one.
> 
> I'm not sure how we can reliably verify NUL termination after the
> fact though and build_string already makes sure to NUL terminate
> STRING_CSTs.  So if GO strings are not NUL terminated then
> the STRING_CSTs still are.

The advantage is that there are less variations how string literals look
like in the middle end.  We will have a simple way to determine if
a string literal is NUL terminated or not.  And checking that property
in varasm.c is exactly the right thing to do.

String literals always have an array_type which may be shorter
than TREE_STRING_LENGTH, but that chops off only exactly
one wide character nul. Otherwise if the array_type is equal or larger,
we know for sure the value is nul terminated. In the middle-end
we can easily determine if a string is not NUL terminated by:

compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)),
                       TREE_STRING_LENGTH (init)) < 0

I did use that already in my patch for pr86711.

Additionally not having oversize string constants produced
by the front ends, where the extra characters are good for nothing,
also helps to improve correctness.


Bernd.
Richard Biener Aug. 1, 2018, 7:53 a.m. UTC | #6
On Wed, 1 Aug 2018, Bernd Edlinger wrote:

> > The change to have all STRING_CSTs NUL terminated (but that NUL
> > termination not necessarily inclided in STRING_LENGTH) is a good
> > one.
> > 
> > I'm not sure how we can reliably verify NUL termination after the
> > fact though and build_string already makes sure to NUL terminate
> > STRING_CSTs.  So if GO strings are not NUL terminated then
> > the STRING_CSTs still are.
> 
> The advantage is that there are less variations how string literals look
> like in the middle end.  We will have a simple way to determine if
> a string literal is NUL terminated or not.  And checking that property
> in varasm.c is exactly the right thing to do.
> 
> String literals always have an array_type which may be shorter
> than TREE_STRING_LENGTH, but that chops off only exactly
> one wide character nul. Otherwise if the array_type is equal or larger,
> we know for sure the value is nul terminated. In the middle-end
> we can easily determine if a string is not NUL terminated by:
> 
> compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)),
>                        TREE_STRING_LENGTH (init)) < 0
> 
> I did use that already in my patch for pr86711.

Hmm.  How does that tell you whether the string is NUL terminated?
TREE_STRING_LENGTH includes the NUL termination and if it is
a significant char then TYPE_SIZE_UNIT should as well.

Isn't a proper test to look at TREE_STRING_POINTER[TREE_STRING_LENGTH - 1]
(for HOST_CHAR_BITS strings)?

Relying on the type here looks somewhat fragile to me.

Abstracting a string_cst_nul_terminated_p () helper would be a good
idea I guess.

I realize using strlen(TREE_STRING_POINTER) doesn't work because
of embedded NULs.

> Additionally not having oversize string constants produced
> by the front ends, where the extra characters are good for nothing,
> also helps to improve correctness.
> 
> 
> Bernd.
>
Bernd Edlinger Aug. 1, 2018, 9:22 a.m. UTC | #7
>> > The change to have all STRING_CSTs NUL terminated (but that NUL
>> > termination not necessarily inclided in STRING_LENGTH) is a good
>> > one.
>> >
>> > I'm not sure how we can reliably verify NUL termination after the
>> > fact though and build_string already makes sure to NUL terminate
>> > STRING_CSTs.  So if GO strings are not NUL terminated then
>> > the STRING_CSTs still are.
>>
>> The advantage is that there are less variations how string literals look
>> like in the middle end.  We will have a simple way to determine if
>> a string literal is NUL terminated or not.  And checking that property
>> in varasm.c is exactly the right thing to do.
>>
>> String literals always have an array_type which may be shorter
>> than TREE_STRING_LENGTH, but that chops off only exactly
>> one wide character nul. Otherwise if the array_type is equal or larger,
>> we know for sure the value is nul terminated. In the middle-end
>> we can easily determine if a string is not NUL terminated by:
>>
>> compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)),
>>                        TREE_STRING_LENGTH (init)) < 0
>>
>> I did use that already in my patch for pr86711.
>
> Hmm.  How does that tell you whether the string is NUL terminated?
> TREE_STRING_LENGTH includes the NUL termination and if it is
> a significant char then TYPE_SIZE_UNIT should as well.

I debugged that code a lot recently.
static const char x[2] = "ab"
gives a TREE_STRING_LENGTH of 3, the TREE_TYPE of that
beast is an array type for char[2]. and TYPE_SIZE_UNIT = 2.

An ordinary C string "ab" has TYPE_SIZE_UNIT(TREE_TYPE(x)) = 3.

Of course with wide caracher strings the TREE_STING_LENGTH
and TYPE_SIZE_UNIT of the ARRAY_TYPE are multiple of
the used wide character type.

So I would like to be able to assume that the STRING_CST objects
are internally always generated properly by the front end.
And that the ARRAY_TYPE of the string literal either has the
same length than the TREE_STRING_LENGTH or if it is shorter,
this is always exactly one (wide) character size less than TREE_STRING_LENGTH

The idea is to use this property of string literals where needed,
and check rigorously in varasm.c.

Does that make sense?

>
> Isn't a proper test to look at TREE_STRING_POINTER[TREE_STRING_LENGTH - 1]
> (for HOST_CHAR_BITS strings)?
>

There are also wide character strings, and all those test are broken everywhere
for wide characters right now.

Therefore checking the string constants at varasm.c revealed a lot of intersting
things.
I will post several patches in the afternoon.

> Relying on the type here looks somewhat fragile to me.
>
> Abstracting a string_cst_nul_terminated_p () helper would be a good
> idea I guess.

Yes. indeed.

> I realize using strlen(TREE_STRING_POINTER) doesn't work because
> of embedded NULs.
>
>> Additionally not having oversize string constants produced
>> by the front ends, where the extra characters are good for nothing,
>> also helps to improve correctness.
>>
Richard Biener Aug. 1, 2018, 9:29 a.m. UTC | #8
On Wed, 1 Aug 2018, Bernd Edlinger wrote:

> >> > The change to have all STRING_CSTs NUL terminated (but that NUL
> >> > termination not necessarily inclided in STRING_LENGTH) is a good
> >> > one.
> >> >
> >> > I'm not sure how we can reliably verify NUL termination after the
> >> > fact though and build_string already makes sure to NUL terminate
> >> > STRING_CSTs.  So if GO strings are not NUL terminated then
> >> > the STRING_CSTs still are.
> >>
> >> The advantage is that there are less variations how string literals look
> >> like in the middle end.  We will have a simple way to determine if
> >> a string literal is NUL terminated or not.  And checking that property
> >> in varasm.c is exactly the right thing to do.
> >>
> >> String literals always have an array_type which may be shorter
> >> than TREE_STRING_LENGTH, but that chops off only exactly
> >> one wide character nul. Otherwise if the array_type is equal or larger,
> >> we know for sure the value is nul terminated. In the middle-end
> >> we can easily determine if a string is not NUL terminated by:
> >>
> >> compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)),
> >>                        TREE_STRING_LENGTH (init)) < 0
> >>
> >> I did use that already in my patch for pr86711.
> >
> > Hmm.  How does that tell you whether the string is NUL terminated?
> > TREE_STRING_LENGTH includes the NUL termination and if it is
> > a significant char then TYPE_SIZE_UNIT should as well.
> 
> I debugged that code a lot recently.
> static const char x[2] = "ab"
> gives a TREE_STRING_LENGTH of 3, the TREE_TYPE of that
> beast is an array type for char[2]. and TYPE_SIZE_UNIT = 2.

Hmm.  I think it would be nice if TREE_STRING_LENGTH would
match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
for your check above.  Because the '\0' doesn't belong to the
string.  Then build_string internally appends a '\0' outside
of TREE_STRING_LENGTH.

> An ordinary C string "ab" has TYPE_SIZE_UNIT(TREE_TYPE(x)) = 3.
> 
> Of course with wide caracher strings the TREE_STING_LENGTH
> and TYPE_SIZE_UNIT of the ARRAY_TYPE are multiple of
> the used wide character type.

Sure.

> So I would like to be able to assume that the STRING_CST objects
> are internally always generated properly by the front end.

Yeah, I guess we need to define what "properly" is ;)

> And that the ARRAY_TYPE of the string literal either has the
> same length than the TREE_STRING_LENGTH or if it is shorter,
> this is always exactly one (wide) character size less than TREE_STRING_LENGTH

I think it should be always the same...

> The idea is to use this property of string literals where needed,
> and check rigorously in varasm.c.
> 
> Does that make sense?

So if it is not the same then the excess character needs to be
a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
that.

> >
> > Isn't a proper test to look at TREE_STRING_POINTER[TREE_STRING_LENGTH - 1]
> > (for HOST_CHAR_BITS strings)?
> >
> 
> There are also wide character strings, and all those test are broken everywhere
> for wide characters right now.
> 
> Therefore checking the string constants at varasm.c revealed a lot of intersting
> things.
> I will post several patches in the afternoon.
> 
> > Relying on the type here looks somewhat fragile to me.
> >
> > Abstracting a string_cst_nul_terminated_p () helper would be a good
> > idea I guess.
> 
> Yes. indeed.
> 
> > I realize using strlen(TREE_STRING_POINTER) doesn't work because
> > of embedded NULs.
> >
> >> Additionally not having oversize string constants produced
> >> by the front ends, where the extra characters are good for nothing,
> >> also helps to improve correctness.
> >>
> 
>
Bernd Edlinger Aug. 1, 2018, 1:05 p.m. UTC | #9
On 08/01/18 11:29, Richard Biener wrote:
> 
> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> for your check above.  Because the '\0' doesn't belong to the
> string.  Then build_string internally appends a '\0' outside
> of TREE_STRING_LENGTH.
> 

Hmm. Yes, but the outside-0 byte is just one byte, not a wide
character.  There are STRING_CSTs which are not string literals,
for instance attribute tags, Pragmas, asm constrants, etc.
They use the '\0' outside, and have probably no TREE_TYPE.

> 
>> So I would like to be able to assume that the STRING_CST objects
>> are internally always generated properly by the front end.
> 
> Yeah, I guess we need to define what "properly" is ;)
> 
Yes.

>> And that the ARRAY_TYPE of the string literal either has the
>> same length than the TREE_STRING_LENGTH or if it is shorter,
>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> 
> I think it should be always the same...
> 

One could not differentiate between "\0" without zero-termination
and "" with zero-termination, theoretically.
We also have char x[100] = "ab";
that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
Of course one could create it with a TREE_STRING_LENGTH = 100,
but imagine char x[100000000000] = "ab"

>> The idea is to use this property of string literals where needed,
>> and check rigorously in varasm.c.
>>
>> Does that make sense?
> 
> So if it is not the same then the excess character needs to be
> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
> that.
> 

I think it does.


Bernd.
Richard Biener Aug. 20, 2018, 11:01 a.m. UTC | #10
On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>
>
>
> On 08/01/18 11:29, Richard Biener wrote:
> >
> > Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> > match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> > for your check above.  Because the '\0' doesn't belong to the
> > string.  Then build_string internally appends a '\0' outside
> > of TREE_STRING_LENGTH.
> >
>
> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> character.

That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
parameter to build_string and allocate as many extra 0s as needed.

  There are STRING_CSTs which are not string literals,
> for instance attribute tags, Pragmas, asm constrants, etc.
> They use the '\0' outside, and have probably no TREE_TYPE.
>
> >
> >> So I would like to be able to assume that the STRING_CST objects
> >> are internally always generated properly by the front end.
> >
> > Yeah, I guess we need to define what "properly" is ;)
> >
> Yes.
>
> >> And that the ARRAY_TYPE of the string literal either has the
> >> same length than the TREE_STRING_LENGTH or if it is shorter,
> >> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >
> > I think it should be always the same...
> >
>
> One could not differentiate between "\0" without zero-termination
> and "" with zero-termination, theoretically.

Is that important?  Doesn't the C standard say how to parse string literals?

> We also have char x[100] = "ab";
> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> Of course one could create it with a TREE_STRING_LENGTH = 100,
> but imagine char x[100000000000] = "ab"

The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
unconditionally be [], thus incomplete (because the literals "size" depends
on the context/LHS it is used on).

> >> The idea is to use this property of string literals where needed,
> >> and check rigorously in varasm.c.
> >>
> >> Does that make sense?
> >
> > So if it is not the same then the excess character needs to be
> > a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
> > that.
> >
>
> I think it does.
>
>
> Bernd.
Bernd Edlinger Aug. 20, 2018, 12:10 p.m. UTC | #11
On 08/20/18 13:01, Richard Biener wrote:
> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>
>>
>>
>> On 08/01/18 11:29, Richard Biener wrote:
>>>
>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>> for your check above.  Because the '\0' doesn't belong to the
>>> string.  Then build_string internally appends a '\0' outside
>>> of TREE_STRING_LENGTH.
>>>
>>
>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>> character.
> 
> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> parameter to build_string and allocate as many extra 0s as needed.
> 
>    There are STRING_CSTs which are not string literals,
>> for instance attribute tags, Pragmas, asm constrants, etc.
>> They use the '\0' outside, and have probably no TREE_TYPE.
>>
>>>
>>>> So I would like to be able to assume that the STRING_CST objects
>>>> are internally always generated properly by the front end.
>>>
>>> Yeah, I guess we need to define what "properly" is ;)
>>>
>> Yes.
>>
>>>> And that the ARRAY_TYPE of the string literal either has the
>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>
>>> I think it should be always the same...
>>>
>>
>> One could not differentiate between "\0" without zero-termination
>> and "" with zero-termination, theoretically.
> 
> Is that important?  Doesn't the C standard say how to parse string literals?
> 
>> We also have char x[100] = "ab";
>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>> but imagine char x[100000000000] = "ab"
> 
> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> unconditionally be [], thus incomplete (because the literals "size" depends
> on the context/LHS it is used on).
> 

Sorry, but I must say, it is not at all like that.

If I compile x.c:
const char x[100] = "ab";

and set a breakpoint at output_constant:

Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
     reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
4821	  if (size == 0 || flag_syntax_only)
(gdb) p size
$1 = 100
(gdb) call debug(exp)
"ab"
(gdb) p *exp
$2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
(gdb) p exp->typed.type->type_common.size_unit
$5 = (tree) 0x7ffff6ff9d80
(gdb) call debug(exp->typed.type->type_common.size_unit)
100
(gdb) p exp->string.length
$6 = 3
(gdb) p exp->string.str[0]
$8 = 97 'a'
(gdb) p exp->string.str[1]
$9 = 98 'b'
(gdb) p exp->string.str[2]
$10 = 0 '\000'
(gdb) p exp->string.str[3]
$11 = 0 '\000'


This is an important property of string_cst objects, that is used in c_strlen:

It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
is guaranteed to be zero up to the type size.

I would not have spent one thought on implementing an optimization like that,
but that's how it is right now.

All I want to do, is make sure that all string constants have the same look and feel
in the middle-end, and restrict the variations that are allowed by the current
implementation.


Bernd.


>>>> The idea is to use this property of string literals where needed,
>>>> and check rigorously in varasm.c.
>>>>
>>>> Does that make sense?
>>>
>>> So if it is not the same then the excess character needs to be
>>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
>>> that.
>>>
>>
>> I think it does.
>>
>>
>> Bernd.
Richard Biener Aug. 20, 2018, 1:19 p.m. UTC | #12
On Mon, 20 Aug 2018, Bernd Edlinger wrote:

> On 08/20/18 13:01, Richard Biener wrote:
> > On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>
> >>
> >>
> >> On 08/01/18 11:29, Richard Biener wrote:
> >>>
> >>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>> for your check above.  Because the '\0' doesn't belong to the
> >>> string.  Then build_string internally appends a '\0' outside
> >>> of TREE_STRING_LENGTH.
> >>>
> >>
> >> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >> character.
> > 
> > That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> > parameter to build_string and allocate as many extra 0s as needed.
> > 
> >    There are STRING_CSTs which are not string literals,
> >> for instance attribute tags, Pragmas, asm constrants, etc.
> >> They use the '\0' outside, and have probably no TREE_TYPE.
> >>
> >>>
> >>>> So I would like to be able to assume that the STRING_CST objects
> >>>> are internally always generated properly by the front end.
> >>>
> >>> Yeah, I guess we need to define what "properly" is ;)
> >>>
> >> Yes.
> >>
> >>>> And that the ARRAY_TYPE of the string literal either has the
> >>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>
> >>> I think it should be always the same...
> >>>
> >>
> >> One could not differentiate between "\0" without zero-termination
> >> and "" with zero-termination, theoretically.
> > 
> > Is that important?  Doesn't the C standard say how to parse string literals?
> > 
> >> We also have char x[100] = "ab";
> >> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >> but imagine char x[100000000000] = "ab"
> > 
> > The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> > hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> > unconditionally be [], thus incomplete (because the literals "size" depends
> > on the context/LHS it is used on).
> > 
> 
> Sorry, but I must say, it is not at all like that.
> 
> If I compile x.c:
> const char x[100] = "ab";
> 
> and set a breakpoint at output_constant:
> 
> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>      reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> 4821	  if (size == 0 || flag_syntax_only)
> (gdb) p size
> $1 = 100
> (gdb) call debug(exp)
> "ab"
> (gdb) p *exp
> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> (gdb) p exp->typed.type->type_common.size_unit
> $5 = (tree) 0x7ffff6ff9d80
> (gdb) call debug(exp->typed.type->type_common.size_unit)
> 100
> (gdb) p exp->string.length
> $6 = 3
> (gdb) p exp->string.str[0]
> $8 = 97 'a'
> (gdb) p exp->string.str[1]
> $9 = 98 'b'
> (gdb) p exp->string.str[2]
> $10 = 0 '\000'
> (gdb) p exp->string.str[3]
> $11 = 0 '\000'
> 
> 
> This is an important property of string_cst objects, that is used in c_strlen:
> 
> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> is guaranteed to be zero up to the type size.
> 
> I would not have spent one thought on implementing an optimization like that,
> but that's how it is right now.

Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
they have zero-padding up to its type size.  I don't see this documented
anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
with appropriate TYPE_SIZE.

This is also a relatively new thing on trunk (coming out of the added
mem_size parameter of string_constant).  That this looks at the STRING_CST
type like

  if (TREE_CODE (array) == STRING_CST)
    {
      *ptr_offset = fold_convert (sizetype, offset);
      if (mem_size)
        *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
      return array;

I'd call simply a bug.  As said, interpretation of STRING_CSTs should
depend on the context.  And for

char a[4] = "abc";
char b[5] = "abc";

we should better be able to share STRING_CSTs [you can see LTO
sharing the nodes if you do b[4] but not when b[5] I suppose].

> All I want to do, is make sure that all string constants have the same look and feel
> in the middle-end, and restrict the variations that are allowed by the current
> implementation.

Sure, I understand that.  But I'd like to simplify things and not add
complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
whether sth is 0-terminated.

Richard.

> 
> Bernd.
> 
> 
> >>>> The idea is to use this property of string literals where needed,
> >>>> and check rigorously in varasm.c.
> >>>>
> >>>> Does that make sense?
> >>>
> >>> So if it is not the same then the excess character needs to be
> >>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
> >>> that.
> >>>
> >>
> >> I think it does.
> >>
> >>
> >> Bernd.
>
Bernd Edlinger Aug. 20, 2018, 3:59 p.m. UTC | #13
On 08/20/18 15:19, Richard Biener wrote:
> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> 
>> On 08/20/18 13:01, Richard Biener wrote:
>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>
>>>>
>>>>
>>>> On 08/01/18 11:29, Richard Biener wrote:
>>>>>
>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>>>> for your check above.  Because the '\0' doesn't belong to the
>>>>> string.  Then build_string internally appends a '\0' outside
>>>>> of TREE_STRING_LENGTH.
>>>>>
>>>>
>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>>>> character.
>>>
>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
>>> parameter to build_string and allocate as many extra 0s as needed.
>>>
>>>     There are STRING_CSTs which are not string literals,
>>>> for instance attribute tags, Pragmas, asm constrants, etc.
>>>> They use the '\0' outside, and have probably no TREE_TYPE.
>>>>
>>>>>
>>>>>> So I would like to be able to assume that the STRING_CST objects
>>>>>> are internally always generated properly by the front end.
>>>>>
>>>>> Yeah, I guess we need to define what "properly" is ;)
>>>>>
>>>> Yes.
>>>>
>>>>>> And that the ARRAY_TYPE of the string literal either has the
>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>>>
>>>>> I think it should be always the same...
>>>>>
>>>>
>>>> One could not differentiate between "\0" without zero-termination
>>>> and "" with zero-termination, theoretically.
>>>
>>> Is that important?  Doesn't the C standard say how to parse string literals?
>>>
>>>> We also have char x[100] = "ab";
>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>>>> but imagine char x[100000000000] = "ab"
>>>
>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
>>> unconditionally be [], thus incomplete (because the literals "size" depends
>>> on the context/LHS it is used on).
>>>
>>
>> Sorry, but I must say, it is not at all like that.
>>
>> If I compile x.c:
>> const char x[100] = "ab";
>>
>> and set a breakpoint at output_constant:
>>
>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>>       reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
>> 4821	  if (size == 0 || flag_syntax_only)
>> (gdb) p size
>> $1 = 100
>> (gdb) call debug(exp)
>> "ab"
>> (gdb) p *exp
>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
>> (gdb) p exp->typed.type->type_common.size_unit
>> $5 = (tree) 0x7ffff6ff9d80
>> (gdb) call debug(exp->typed.type->type_common.size_unit)
>> 100
>> (gdb) p exp->string.length
>> $6 = 3
>> (gdb) p exp->string.str[0]
>> $8 = 97 'a'
>> (gdb) p exp->string.str[1]
>> $9 = 98 'b'
>> (gdb) p exp->string.str[2]
>> $10 = 0 '\000'
>> (gdb) p exp->string.str[3]
>> $11 = 0 '\000'
>>
>>
>> This is an important property of string_cst objects, that is used in c_strlen:
>>
>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
>> is guaranteed to be zero up to the type size.
>>
>> I would not have spent one thought on implementing an optimization like that,
>> but that's how it is right now.
> 
> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> they have zero-padding up to its type size.  I don't see this documented
> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> with appropriate TYPE_SIZE.
> 
> This is also a relatively new thing on trunk (coming out of the added
> mem_size parameter of string_constant).  That this looks at the STRING_CST
> type like
> 
>    if (TREE_CODE (array) == STRING_CST)
>      {
>        *ptr_offset = fold_convert (sizetype, offset);
>        if (mem_size)
>          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>        return array;
> 
> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> depend on the context.  And for
> 

This use of the TYPE_SIZE_UNIT was pre-existing to r263607
before that (but not in the gcc-8-branch) we had this in c_strlen:

   HOST_WIDE_INT maxelts = strelts;
   tree type = TREE_TYPE (src);
   if (tree size = TYPE_SIZE_UNIT (type))
     if (tree_fits_shwi_p (size))
       {
        maxelts = tree_to_uhwi (size);
        maxelts = maxelts / eltsize - 1;
       }

which I moved to string_constant and return the result through memsize.

It seems to be working somehow, and I'd bet removing that will immediately
break twenty or thirty of the strlenopt tests.

Obviously the tree string objects allow way too much variations,
and it has to be restricted in some way or another.

In the moment my approach is: look at what most string constants do
and add assertions to ensure that there are no exceptions.


> char a[4] = "abc";
> char b[5] = "abc";
> 
> we should better be able to share STRING_CSTs [you can see LTO
> sharing the nodes if you do b[4] but not when b[5] I suppose].
> 
>> All I want to do, is make sure that all string constants have the same look and feel
>> in the middle-end, and restrict the variations that are allowed by the current
>> implementation.
> 
> Sure, I understand that.  But I'd like to simplify things and not add
> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> whether sth is 0-terminated.
> 

This is not only about 0-terminated. (*)

It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
there are places in the middle-end that assume that the object contains
_all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
Those I want to protect.

Bernd.


*: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
the string is only returned when it is properly zero terminated.",
but maybe I should have written:
"If MEM_SIZE is zero, the string is only returned when the storage size
of the string object is at least TREE_STRING_LENGTH."
That's at least exactly what the check does.

> Richard.
> 
>>
>> Bernd.
>>
>>
>>>>>> The idea is to use this property of string literals where needed,
>>>>>> and check rigorously in varasm.c.
>>>>>>
>>>>>> Does that make sense?
>>>>>
>>>>> So if it is not the same then the excess character needs to be
>>>>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
>>>>> that.
>>>>>
>>>>
>>>> I think it does.
>>>>
>>>>
>>>> Bernd.
>>
>
Bernd Edlinger Aug. 21, 2018, 5:29 a.m. UTC | #14
On 08/20/18 17:59, Bernd Edlinger wrote:
> On 08/20/18 15:19, Richard Biener wrote:
>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
>>
>>> On 08/20/18 13:01, Richard Biener wrote:
>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 08/01/18 11:29, Richard Biener wrote:
>>>>>>
>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>>>>> for your check above.  Because the '\0' doesn't belong to the
>>>>>> string.  Then build_string internally appends a '\0' outside
>>>>>> of TREE_STRING_LENGTH.
>>>>>>
>>>>>
>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>>>>> character.
>>>>
>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
>>>> parameter to build_string and allocate as many extra 0s as needed.
>>>>
>>>>     There are STRING_CSTs which are not string literals,
>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
>>>>>
>>>>>>
>>>>>>> So I would like to be able to assume that the STRING_CST objects
>>>>>>> are internally always generated properly by the front end.
>>>>>>
>>>>>> Yeah, I guess we need to define what "properly" is ;)
>>>>>>
>>>>> Yes.
>>>>>
>>>>>>> And that the ARRAY_TYPE of the string literal either has the
>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>>>>
>>>>>> I think it should be always the same...
>>>>>>
>>>>>
>>>>> One could not differentiate between "\0" without zero-termination
>>>>> and "" with zero-termination, theoretically.
>>>>
>>>> Is that important?  Doesn't the C standard say how to parse string literals?
>>>>
>>>>> We also have char x[100] = "ab";
>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>>>>> but imagine char x[100000000000] = "ab"
>>>>
>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
>>>> unconditionally be [], thus incomplete (because the literals "size" depends
>>>> on the context/LHS it is used on).
>>>>
>>>
>>> Sorry, but I must say, it is not at all like that.
>>>
>>> If I compile x.c:
>>> const char x[100] = "ab";
>>>
>>> and set a breakpoint at output_constant:
>>>
>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>>>       reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
>>> 4821      if (size == 0 || flag_syntax_only)
>>> (gdb) p size
>>> $1 = 100
>>> (gdb) call debug(exp)
>>> "ab"
>>> (gdb) p *exp
>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
>>> (gdb) p exp->typed.type->type_common.size_unit
>>> $5 = (tree) 0x7ffff6ff9d80
>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
>>> 100
>>> (gdb) p exp->string.length
>>> $6 = 3
>>> (gdb) p exp->string.str[0]
>>> $8 = 97 'a'
>>> (gdb) p exp->string.str[1]
>>> $9 = 98 'b'
>>> (gdb) p exp->string.str[2]
>>> $10 = 0 '\000'
>>> (gdb) p exp->string.str[3]
>>> $11 = 0 '\000'
>>>
>>>
>>> This is an important property of string_cst objects, that is used in c_strlen:
>>>
>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
>>> is guaranteed to be zero up to the type size.
>>>
>>> I would not have spent one thought on implementing an optimization like that,
>>> but that's how it is right now.
>>
>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
>> they have zero-padding up to its type size.  I don't see this documented
>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
>> with appropriate TYPE_SIZE.
>>
>> This is also a relatively new thing on trunk (coming out of the added
>> mem_size parameter of string_constant).  That this looks at the STRING_CST
>> type like
>>
>>    if (TREE_CODE (array) == STRING_CST)
>>      {
>>        *ptr_offset = fold_convert (sizetype, offset);
>>        if (mem_size)
>>          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>>        return array;
>>
>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
>> depend on the context.  And for
>>
> 
> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> before that (but not in the gcc-8-branch) we had this in c_strlen:
> 
>    HOST_WIDE_INT maxelts = strelts;
>    tree type = TREE_TYPE (src);
>    if (tree size = TYPE_SIZE_UNIT (type))
>      if (tree_fits_shwi_p (size))
>        {
>         maxelts = tree_to_uhwi (size);
>         maxelts = maxelts / eltsize - 1;
>        }
> 
> which I moved to string_constant and return the result through memsize.
> 
> It seems to be working somehow, and I'd bet removing that will immediately
> break twenty or thirty of the strlenopt tests.
> 
> Obviously the tree string objects allow way too much variations,
> and it has to be restricted in some way or another.
> 
> In the moment my approach is: look at what most string constants do
> and add assertions to ensure that there are no exceptions.
> 
> 

I found some more places where TYPE_SIZE_UNIT (TREE_TYPE (str))
and TREE_STRING_LENGTH (str) are used together, in some very old code
and clearly Martin is not to blame for them:

varasm.c:

static HOST_WIDE_INT
get_constant_size (tree exp)
{
   HOST_WIDE_INT size;

   size = int_size_in_bytes (TREE_TYPE (exp));
   if (TREE_CODE (exp) == STRING_CST)
     size = MAX (TREE_STRING_LENGTH (exp), size);
   return size;
}

mergeable_string_section (tree decl ATTRIBUTE_UNUSED,
                           unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED,
                           unsigned int flags ATTRIBUTE_UNUSED)
{
   HOST_WIDE_INT len;

   if (HAVE_GAS_SHF_MERGE && flag_merge_constants
       && TREE_CODE (decl) == STRING_CST
       && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE
       && align <= 256
       && (len = int_size_in_bytes (TREE_TYPE (decl))) > 0
       && TREE_STRING_LENGTH (decl) >= len)

So, no sorry, I don't see how to change the STRING_CST from using the
TREE_TYPE in that way.


Bernd.

>> char a[4] = "abc";
>> char b[5] = "abc";
>>
>> we should better be able to share STRING_CSTs [you can see LTO
>> sharing the nodes if you do b[4] but not when b[5] I suppose].
>>
>>> All I want to do, is make sure that all string constants have the same look and feel
>>> in the middle-end, and restrict the variations that are allowed by the current
>>> implementation.
>>
>> Sure, I understand that.  But I'd like to simplify things and not add
>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
>> whether sth is 0-terminated.
>>
> 
> This is not only about 0-terminated. (*)
> 
> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> there are places in the middle-end that assume that the object contains
> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> Those I want to protect.
> 
> Bernd.
> 
> 
> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> the string is only returned when it is properly zero terminated.",
> but maybe I should have written:
> "If MEM_SIZE is zero, the string is only returned when the storage size
> of the string object is at least TREE_STRING_LENGTH."
> That's at least exactly what the check does.
> 
>> Richard.
>>
>>>
>>> Bernd.
>>>
>>>
>>>>>>> The idea is to use this property of string literals where needed,
>>>>>>> and check rigorously in varasm.c.
>>>>>>>
>>>>>>> Does that make sense?
>>>>>>
>>>>>> So if it is not the same then the excess character needs to be
>>>>>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
>>>>>> that.
>>>>>>
>>>>>
>>>>> I think it does.
>>>>>
>>>>>
>>>>> Bernd.
>>>
>>
Richard Biener Aug. 21, 2018, 8:33 a.m. UTC | #15
On Mon, 20 Aug 2018, Bernd Edlinger wrote:

> On 08/20/18 15:19, Richard Biener wrote:
> > On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> > 
> >> On 08/20/18 13:01, Richard Biener wrote:
> >>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 08/01/18 11:29, Richard Biener wrote:
> >>>>>
> >>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>>>> for your check above.  Because the '\0' doesn't belong to the
> >>>>> string.  Then build_string internally appends a '\0' outside
> >>>>> of TREE_STRING_LENGTH.
> >>>>>
> >>>>
> >>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >>>> character.
> >>>
> >>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> >>> parameter to build_string and allocate as many extra 0s as needed.
> >>>
> >>>     There are STRING_CSTs which are not string literals,
> >>>> for instance attribute tags, Pragmas, asm constrants, etc.
> >>>> They use the '\0' outside, and have probably no TREE_TYPE.
> >>>>
> >>>>>
> >>>>>> So I would like to be able to assume that the STRING_CST objects
> >>>>>> are internally always generated properly by the front end.
> >>>>>
> >>>>> Yeah, I guess we need to define what "properly" is ;)
> >>>>>
> >>>> Yes.
> >>>>
> >>>>>> And that the ARRAY_TYPE of the string literal either has the
> >>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>>>
> >>>>> I think it should be always the same...
> >>>>>
> >>>>
> >>>> One could not differentiate between "\0" without zero-termination
> >>>> and "" with zero-termination, theoretically.
> >>>
> >>> Is that important?  Doesn't the C standard say how to parse string literals?
> >>>
> >>>> We also have char x[100] = "ab";
> >>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >>>> but imagine char x[100000000000] = "ab"
> >>>
> >>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> >>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> >>> unconditionally be [], thus incomplete (because the literals "size" depends
> >>> on the context/LHS it is used on).
> >>>
> >>
> >> Sorry, but I must say, it is not at all like that.
> >>
> >> If I compile x.c:
> >> const char x[100] = "ab";
> >>
> >> and set a breakpoint at output_constant:
> >>
> >> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
> >>       reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> >> 4821	  if (size == 0 || flag_syntax_only)
> >> (gdb) p size
> >> $1 = 100
> >> (gdb) call debug(exp)
> >> "ab"
> >> (gdb) p *exp
> >> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> >> (gdb) p exp->typed.type->type_common.size_unit
> >> $5 = (tree) 0x7ffff6ff9d80
> >> (gdb) call debug(exp->typed.type->type_common.size_unit)
> >> 100
> >> (gdb) p exp->string.length
> >> $6 = 3
> >> (gdb) p exp->string.str[0]
> >> $8 = 97 'a'
> >> (gdb) p exp->string.str[1]
> >> $9 = 98 'b'
> >> (gdb) p exp->string.str[2]
> >> $10 = 0 '\000'
> >> (gdb) p exp->string.str[3]
> >> $11 = 0 '\000'
> >>
> >>
> >> This is an important property of string_cst objects, that is used in c_strlen:
> >>
> >> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> >> is guaranteed to be zero up to the type size.
> >>
> >> I would not have spent one thought on implementing an optimization like that,
> >> but that's how it is right now.
> > 
> > Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> > they have zero-padding up to its type size.  I don't see this documented
> > anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> > with appropriate TYPE_SIZE.
> > 
> > This is also a relatively new thing on trunk (coming out of the added
> > mem_size parameter of string_constant).  That this looks at the STRING_CST
> > type like
> > 
> >    if (TREE_CODE (array) == STRING_CST)
> >      {
> >        *ptr_offset = fold_convert (sizetype, offset);
> >        if (mem_size)
> >          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >        return array;
> > 
> > I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> > depend on the context.  And for
> > 
> 
> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> before that (but not in the gcc-8-branch) we had this in c_strlen:
> 
>    HOST_WIDE_INT maxelts = strelts;
>    tree type = TREE_TYPE (src);
>    if (tree size = TYPE_SIZE_UNIT (type))
>      if (tree_fits_shwi_p (size))
>        {
>         maxelts = tree_to_uhwi (size);
>         maxelts = maxelts / eltsize - 1;
>        }
> 
> which I moved to string_constant and return the result through memsize.
> 
> It seems to be working somehow, and I'd bet removing that will immediately
> break twenty or thirty of the strlenopt tests.
> 
> Obviously the tree string objects allow way too much variations,
> and it has to be restricted in some way or another.
> 
> In the moment my approach is: look at what most string constants do
> and add assertions to ensure that there are no exceptions.
> 
> 
> > char a[4] = "abc";
> > char b[5] = "abc";
> > 
> > we should better be able to share STRING_CSTs [you can see LTO
> > sharing the nodes if you do b[4] but not when b[5] I suppose].
> > 
> >> All I want to do, is make sure that all string constants have the same look and feel
> >> in the middle-end, and restrict the variations that are allowed by the current
> >> implementation.
> > 
> > Sure, I understand that.  But I'd like to simplify things and not add
> > complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> > whether sth is 0-terminated.
> > 
> 
> This is not only about 0-terminated. (*)
> 
> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> there are places in the middle-end that assume that the object contains
> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> Those I want to protect.

Well, but string_constant handles &STRING_CST just fine but in that
context there's no "object" we look at.

IMHO whenever string_constant ends up with a DECL, looking at
ctor_for_folding and we end up with a STRING_CST that doesn't fit
in the DECL we looked at we have a bug (non-truncated STRING_CST)
and either should do the truncation or simply return NULL.

So we should look at DECL_SIZE_UNIT and compare that with 
TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
of the STRING_CST being "correct" (fitting the decl context).

> Bernd.
> 
> 
> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> the string is only returned when it is properly zero terminated.",
> but maybe I should have written:
> "If MEM_SIZE is zero, the string is only returned when the storage size
> of the string object is at least TREE_STRING_LENGTH."
> That's at least exactly what the check does.

Well, as said above

  if (TREE_CODE (array) == STRING_CST)
    {
      *ptr_offset = fold_convert (sizetype, offset);
      if (mem_size)
        *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
      return array;
    }

simply assumes the "storage size" of a STRING_CST is determined
by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
but clearly in the above case there's nothing to protect?  And in
the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
protect from overflow of the FIELD_DECL unless frontends never
generate "wrong" STRING_CSTs?

Btw, get_constant_size / mergeable_string_section suggsts that
we may view STRING_CST as general target-encoded byte blob.
That may be useful to compress CONSTRUCTORs memory use.

It looks like mergeable_string_section wrongly would reject
a char[] typed string because int_size_in_bytes returns -1
for incomplete types.  I still believe using char[] for most
STRING_CSTs generated by FEs would be a good thing for
IL memory use.  Shouldn't the mem_size initializations use
sth like get_constant_size as well?

Also

  if (mem_size)
    *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
  else if (compare_tree_int (array_size, length + 1) < 0)
    return NULL_TREE;

the latter check doesn't seem to honor 'offset' and *mem_size
is again looking at the STRING_CST, not at the FIELD_DECL or
wherever we got the STRING_CST from.

Richard.


> 
> > Richard.
> > 
> >>
> >> Bernd.
> >>
> >>
> >>>>>> The idea is to use this property of string literals where needed,
> >>>>>> and check rigorously in varasm.c.
> >>>>>>
> >>>>>> Does that make sense?
> >>>>>
> >>>>> So if it is not the same then the excess character needs to be
> >>>>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
> >>>>> that.
> >>>>>
> >>>>
> >>>> I think it does.
> >>>>
> >>>>
> >>>> Bernd.
> >>
> > 
>
Bernd Edlinger Aug. 22, 2018, 4:57 a.m. UTC | #16
On 08/21/18 10:33, Richard Biener wrote:
> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> 
>> On 08/20/18 15:19, Richard Biener wrote:
>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
>>>
>>>> On 08/20/18 13:01, Richard Biener wrote:
>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 08/01/18 11:29, Richard Biener wrote:
>>>>>>>
>>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>>>>>> for your check above.  Because the '\0' doesn't belong to the
>>>>>>> string.  Then build_string internally appends a '\0' outside
>>>>>>> of TREE_STRING_LENGTH.
>>>>>>>
>>>>>>
>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>>>>>> character.
>>>>>
>>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
>>>>> parameter to build_string and allocate as many extra 0s as needed.
>>>>>
>>>>>      There are STRING_CSTs which are not string literals,
>>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
>>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
>>>>>>
>>>>>>>
>>>>>>>> So I would like to be able to assume that the STRING_CST objects
>>>>>>>> are internally always generated properly by the front end.
>>>>>>>
>>>>>>> Yeah, I guess we need to define what "properly" is ;)
>>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>> And that the ARRAY_TYPE of the string literal either has the
>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>>>>>
>>>>>>> I think it should be always the same...
>>>>>>>
>>>>>>
>>>>>> One could not differentiate between "\0" without zero-termination
>>>>>> and "" with zero-termination, theoretically.
>>>>>
>>>>> Is that important?  Doesn't the C standard say how to parse string literals?
>>>>>
>>>>>> We also have char x[100] = "ab";
>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>>>>>> but imagine char x[100000000000] = "ab"
>>>>>
>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
>>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
>>>>> unconditionally be [], thus incomplete (because the literals "size" depends
>>>>> on the context/LHS it is used on).
>>>>>
>>>>
>>>> Sorry, but I must say, it is not at all like that.
>>>>
>>>> If I compile x.c:
>>>> const char x[100] = "ab";
>>>>
>>>> and set a breakpoint at output_constant:
>>>>
>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>>>>        reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
>>>> 4821	  if (size == 0 || flag_syntax_only)
>>>> (gdb) p size
>>>> $1 = 100
>>>> (gdb) call debug(exp)
>>>> "ab"
>>>> (gdb) p *exp
>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
>>>> (gdb) p exp->typed.type->type_common.size_unit
>>>> $5 = (tree) 0x7ffff6ff9d80
>>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
>>>> 100
>>>> (gdb) p exp->string.length
>>>> $6 = 3
>>>> (gdb) p exp->string.str[0]
>>>> $8 = 97 'a'
>>>> (gdb) p exp->string.str[1]
>>>> $9 = 98 'b'
>>>> (gdb) p exp->string.str[2]
>>>> $10 = 0 '\000'
>>>> (gdb) p exp->string.str[3]
>>>> $11 = 0 '\000'
>>>>
>>>>
>>>> This is an important property of string_cst objects, that is used in c_strlen:
>>>>
>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
>>>> is guaranteed to be zero up to the type size.
>>>>
>>>> I would not have spent one thought on implementing an optimization like that,
>>>> but that's how it is right now.
>>>
>>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
>>> they have zero-padding up to its type size.  I don't see this documented
>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
>>> with appropriate TYPE_SIZE.
>>>
>>> This is also a relatively new thing on trunk (coming out of the added
>>> mem_size parameter of string_constant).  That this looks at the STRING_CST
>>> type like
>>>
>>>     if (TREE_CODE (array) == STRING_CST)
>>>       {
>>>         *ptr_offset = fold_convert (sizetype, offset);
>>>         if (mem_size)
>>>           *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>>>         return array;
>>>
>>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
>>> depend on the context.  And for
>>>
>>
>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
>> before that (but not in the gcc-8-branch) we had this in c_strlen:
>>
>>     HOST_WIDE_INT maxelts = strelts;
>>     tree type = TREE_TYPE (src);
>>     if (tree size = TYPE_SIZE_UNIT (type))
>>       if (tree_fits_shwi_p (size))
>>         {
>>          maxelts = tree_to_uhwi (size);
>>          maxelts = maxelts / eltsize - 1;
>>         }
>>
>> which I moved to string_constant and return the result through memsize.
>>
>> It seems to be working somehow, and I'd bet removing that will immediately
>> break twenty or thirty of the strlenopt tests.
>>
>> Obviously the tree string objects allow way too much variations,
>> and it has to be restricted in some way or another.
>>
>> In the moment my approach is: look at what most string constants do
>> and add assertions to ensure that there are no exceptions.
>>
>>
>>> char a[4] = "abc";
>>> char b[5] = "abc";
>>>
>>> we should better be able to share STRING_CSTs [you can see LTO
>>> sharing the nodes if you do b[4] but not when b[5] I suppose].
>>>
>>>> All I want to do, is make sure that all string constants have the same look and feel
>>>> in the middle-end, and restrict the variations that are allowed by the current
>>>> implementation.
>>>
>>> Sure, I understand that.  But I'd like to simplify things and not add
>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
>>> whether sth is 0-terminated.
>>>
>>
>> This is not only about 0-terminated. (*)
>>
>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
>> there are places in the middle-end that assume that the object contains
>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
>> Those I want to protect.
> 
> Well, but string_constant handles &STRING_CST just fine but in that
> context there's no "object" we look at.
> 
> IMHO whenever string_constant ends up with a DECL, looking at
> ctor_for_folding and we end up with a STRING_CST that doesn't fit
> in the DECL we looked at we have a bug (non-truncated STRING_CST)
> and either should do the truncation or simply return NULL.
> 
> So we should look at DECL_SIZE_UNIT and compare that with
> TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
> of the STRING_CST being "correct" (fitting the decl context).
> 
>> Bernd.
>>
>>
>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
>> the string is only returned when it is properly zero terminated.",
>> but maybe I should have written:
>> "If MEM_SIZE is zero, the string is only returned when the storage size
>> of the string object is at least TREE_STRING_LENGTH."
>> That's at least exactly what the check does.
> 
> Well, as said above
> 
>    if (TREE_CODE (array) == STRING_CST)
>      {
>        *ptr_offset = fold_convert (sizetype, offset);
>        if (mem_size)
>          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>        return array;
>      }
> 
> simply assumes the "storage size" of a STRING_CST is determined
> by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
> but clearly in the above case there's nothing to protect?  And in
> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
> protect from overflow of the FIELD_DECL unless frontends never
> generate "wrong" STRING_CSTs?
> 

Hmm, I digged some more in varasm.c

Normal STRING_CST use get_constant_size to allocate
MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH)
thus they can have excess zero bytes.

while STRING_CST that are in DECL_INITIALIZERs
are shrinked to DECL_SIZE_UNIT.

At least that difference looks unnecessary to me.


But compare_constant does not compare the TYPE_SIZE_UNIT:

     case STRING_CST:
       if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
         return 0;

       return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
               && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2),
                          TREE_STRING_LENGTH (t1)));


This looks now like a bug.


> Btw, get_constant_size / mergeable_string_section suggsts that
> we may view STRING_CST as general target-encoded byte blob.
> That may be useful to compress CONSTRUCTORs memory use.
> 
> It looks like mergeable_string_section wrongly would reject
> a char[] typed string because int_size_in_bytes returns -1
> for incomplete types.  I still believe using char[] for most
> STRING_CSTs generated by FEs would be a good thing for
> IL memory use.  Shouldn't the mem_size initializations use
> sth like get_constant_size as well?
> 

I think incomplete types exist only in structs with variable
length:

struct {
   int x;
   char y[];
} s = { 1, "test" };

this is no candidate for a mergeable string.

but
char y[] = "test";

is fixed by the FE to:
char y[5] = "test";

So that does not make problems with mergeable_string_section.


> Also
> 
>    if (mem_size)
>      *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
>    else if (compare_tree_int (array_size, length + 1) < 0)
>      return NULL_TREE;
> 
> the latter check doesn't seem to honor 'offset' and *mem_size
> is again looking at the STRING_CST, not at the FIELD_DECL or
> wherever we got the STRING_CST from.
> 

the caller compares offset with *mem_size, but we do not have
the FIELD_DECL at hand here (Or I did not know where it was).
I'd rather fail the cases when the TYPE_SIZE_UNIT is null,
or something non-constant.

All those are variations of vla with initializer and similar.
Once the STRING_CST has a type, I would like to use it,
when it doesn't the whole thing is probably not worth it
anyway.


Bernd.
Richard Biener Aug. 24, 2018, 10:52 a.m. UTC | #17
On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> On 08/21/18 10:33, Richard Biener wrote:
> > On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >
> >> On 08/20/18 15:19, Richard Biener wrote:
> >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >>>
> >>>> On 08/20/18 13:01, Richard Biener wrote:
> >>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 08/01/18 11:29, Richard Biener wrote:
> >>>>>>>
> >>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>>>>>> for your check above.  Because the '\0' doesn't belong to the
> >>>>>>> string.  Then build_string internally appends a '\0' outside
> >>>>>>> of TREE_STRING_LENGTH.
> >>>>>>>
> >>>>>>
> >>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >>>>>> character.
> >>>>>
> >>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> >>>>> parameter to build_string and allocate as many extra 0s as needed.
> >>>>>
> >>>>>      There are STRING_CSTs which are not string literals,
> >>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
> >>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
> >>>>>>
> >>>>>>>
> >>>>>>>> So I would like to be able to assume that the STRING_CST objects
> >>>>>>>> are internally always generated properly by the front end.
> >>>>>>>
> >>>>>>> Yeah, I guess we need to define what "properly" is ;)
> >>>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>>>> And that the ARRAY_TYPE of the string literal either has the
> >>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>>>>>
> >>>>>>> I think it should be always the same...
> >>>>>>>
> >>>>>>
> >>>>>> One could not differentiate between "\0" without zero-termination
> >>>>>> and "" with zero-termination, theoretically.
> >>>>>
> >>>>> Is that important?  Doesn't the C standard say how to parse string literals?
> >>>>>
> >>>>>> We also have char x[100] = "ab";
> >>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >>>>>> but imagine char x[100000000000] = "ab"
> >>>>>
> >>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> >>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> >>>>> unconditionally be [], thus incomplete (because the literals "size" depends
> >>>>> on the context/LHS it is used on).
> >>>>>
> >>>>
> >>>> Sorry, but I must say, it is not at all like that.
> >>>>
> >>>> If I compile x.c:
> >>>> const char x[100] = "ab";
> >>>>
> >>>> and set a breakpoint at output_constant:
> >>>>
> >>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
> >>>>        reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> >>>> 4821         if (size == 0 || flag_syntax_only)
> >>>> (gdb) p size
> >>>> $1 = 100
> >>>> (gdb) call debug(exp)
> >>>> "ab"
> >>>> (gdb) p *exp
> >>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> >>>> (gdb) p exp->typed.type->type_common.size_unit
> >>>> $5 = (tree) 0x7ffff6ff9d80
> >>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
> >>>> 100
> >>>> (gdb) p exp->string.length
> >>>> $6 = 3
> >>>> (gdb) p exp->string.str[0]
> >>>> $8 = 97 'a'
> >>>> (gdb) p exp->string.str[1]
> >>>> $9 = 98 'b'
> >>>> (gdb) p exp->string.str[2]
> >>>> $10 = 0 '\000'
> >>>> (gdb) p exp->string.str[3]
> >>>> $11 = 0 '\000'
> >>>>
> >>>>
> >>>> This is an important property of string_cst objects, that is used in c_strlen:
> >>>>
> >>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> >>>> is guaranteed to be zero up to the type size.
> >>>>
> >>>> I would not have spent one thought on implementing an optimization like that,
> >>>> but that's how it is right now.
> >>>
> >>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> >>> they have zero-padding up to its type size.  I don't see this documented
> >>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> >>> with appropriate TYPE_SIZE.
> >>>
> >>> This is also a relatively new thing on trunk (coming out of the added
> >>> mem_size parameter of string_constant).  That this looks at the STRING_CST
> >>> type like
> >>>
> >>>     if (TREE_CODE (array) == STRING_CST)
> >>>       {
> >>>         *ptr_offset = fold_convert (sizetype, offset);
> >>>         if (mem_size)
> >>>           *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >>>         return array;
> >>>
> >>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> >>> depend on the context.  And for
> >>>
> >>
> >> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> >> before that (but not in the gcc-8-branch) we had this in c_strlen:
> >>
> >>     HOST_WIDE_INT maxelts = strelts;
> >>     tree type = TREE_TYPE (src);
> >>     if (tree size = TYPE_SIZE_UNIT (type))
> >>       if (tree_fits_shwi_p (size))
> >>         {
> >>          maxelts = tree_to_uhwi (size);
> >>          maxelts = maxelts / eltsize - 1;
> >>         }
> >>
> >> which I moved to string_constant and return the result through memsize.
> >>
> >> It seems to be working somehow, and I'd bet removing that will immediately
> >> break twenty or thirty of the strlenopt tests.
> >>
> >> Obviously the tree string objects allow way too much variations,
> >> and it has to be restricted in some way or another.
> >>
> >> In the moment my approach is: look at what most string constants do
> >> and add assertions to ensure that there are no exceptions.
> >>
> >>
> >>> char a[4] = "abc";
> >>> char b[5] = "abc";
> >>>
> >>> we should better be able to share STRING_CSTs [you can see LTO
> >>> sharing the nodes if you do b[4] but not when b[5] I suppose].
> >>>
> >>>> All I want to do, is make sure that all string constants have the same look and feel
> >>>> in the middle-end, and restrict the variations that are allowed by the current
> >>>> implementation.
> >>>
> >>> Sure, I understand that.  But I'd like to simplify things and not add
> >>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> >>> whether sth is 0-terminated.
> >>>
> >>
> >> This is not only about 0-terminated. (*)
> >>
> >> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> >> there are places in the middle-end that assume that the object contains
> >> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> >> Those I want to protect.
> >
> > Well, but string_constant handles &STRING_CST just fine but in that
> > context there's no "object" we look at.
> >
> > IMHO whenever string_constant ends up with a DECL, looking at
> > ctor_for_folding and we end up with a STRING_CST that doesn't fit
> > in the DECL we looked at we have a bug (non-truncated STRING_CST)
> > and either should do the truncation or simply return NULL.
> >
> > So we should look at DECL_SIZE_UNIT and compare that with
> > TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
> > of the STRING_CST being "correct" (fitting the decl context).
> >
> >> Bernd.
> >>
> >>
> >> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> >> the string is only returned when it is properly zero terminated.",
> >> but maybe I should have written:
> >> "If MEM_SIZE is zero, the string is only returned when the storage size
> >> of the string object is at least TREE_STRING_LENGTH."
> >> That's at least exactly what the check does.
> >
> > Well, as said above
> >
> >    if (TREE_CODE (array) == STRING_CST)
> >      {
> >        *ptr_offset = fold_convert (sizetype, offset);
> >        if (mem_size)
> >          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >        return array;
> >      }
> >
> > simply assumes the "storage size" of a STRING_CST is determined
> > by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
> > but clearly in the above case there's nothing to protect?  And in
> > the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
> > protect from overflow of the FIELD_DECL unless frontends never
> > generate "wrong" STRING_CSTs?
> >
>
> Hmm, I digged some more in varasm.c
>
> Normal STRING_CST use get_constant_size to allocate
> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH)
> thus they can have excess zero bytes.
>
> while STRING_CST that are in DECL_INITIALIZERs
> are shrinked to DECL_SIZE_UNIT.
>
> At least that difference looks unnecessary to me.
>
>
> But compare_constant does not compare the TYPE_SIZE_UNIT:
>
>      case STRING_CST:
>        if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
>          return 0;
>
>        return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
>                && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2),
>                           TREE_STRING_LENGTH (t1)));
>
>
> This looks now like a bug.

Well.  I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't
a designed feature.  I'd rather not have it that way because the length of a
STRING_CST becomes defined in multiple places if you consider STRING_CST
with [] or too small type then TREE_STRING_LENGTH is authorative while
if the type is larger then that is authorative.

The length of a STRING_CST if output should be only determined from
TREE_STRING_LENGTH.

>
> > Btw, get_constant_size / mergeable_string_section suggsts that
> > we may view STRING_CST as general target-encoded byte blob.
> > That may be useful to compress CONSTRUCTORs memory use.
> >
> > It looks like mergeable_string_section wrongly would reject
> > a char[] typed string because int_size_in_bytes returns -1
> > for incomplete types.  I still believe using char[] for most
> > STRING_CSTs generated by FEs would be a good thing for
> > IL memory use.  Shouldn't the mem_size initializations use
> > sth like get_constant_size as well?
> >
>
> I think incomplete types exist only in structs with variable
> length:
>
> struct {
>    int x;
>    char y[];
> } s = { 1, "test" };
>
> this is no candidate for a mergeable string.
>
> but
> char y[] = "test";
>
> is fixed by the FE to:
> char y[5] = "test";
>
> So that does not make problems with mergeable_string_section.
>
>
> > Also
> >
> >    if (mem_size)
> >      *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
> >    else if (compare_tree_int (array_size, length + 1) < 0)
> >      return NULL_TREE;
> >
> > the latter check doesn't seem to honor 'offset' and *mem_size
> > is again looking at the STRING_CST, not at the FIELD_DECL or
> > wherever we got the STRING_CST from.
> >
>
> the caller compares offset with *mem_size, but we do not have
> the FIELD_DECL at hand here (Or I did not know where it was).
> I'd rather fail the cases when the TYPE_SIZE_UNIT is null,
> or something non-constant.
>
> All those are variations of vla with initializer and similar.
> Once the STRING_CST has a type, I would like to use it,
> when it doesn't the whole thing is probably not worth it
> anyway.

I would rather have STRING_CSTs only have a type for the
elements, not the whole thing...  you can get that
if you drop TYPE_DOMAIN on all STRING_CSTs array types.

Not sure what breaks if we change get_constant_size to
return TREE_STRING_LENGTH unconditionally.  At least
somehow the need to special-case STRING_CSTs looks
odd and I wonder if we do this in other functions as well...

Richard.

>
>
> Bernd.
Bernd Edlinger Aug. 24, 2018, 11:49 a.m. UTC | #18
On 08/24/18 12:52, Richard Biener wrote:
> On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger
> <bernd.edlinger@hotmail.de> wrote:
>>
>> On 08/21/18 10:33, Richard Biener wrote:
>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
>>>
>>>> On 08/20/18 15:19, Richard Biener wrote:
>>>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
>>>>>
>>>>>> On 08/20/18 13:01, Richard Biener wrote:
>>>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/01/18 11:29, Richard Biener wrote:
>>>>>>>>>
>>>>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>>>>>>>> for your check above.  Because the '\0' doesn't belong to the
>>>>>>>>> string.  Then build_string internally appends a '\0' outside
>>>>>>>>> of TREE_STRING_LENGTH.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>>>>>>>> character.
>>>>>>>
>>>>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
>>>>>>> parameter to build_string and allocate as many extra 0s as needed.
>>>>>>>
>>>>>>>       There are STRING_CSTs which are not string literals,
>>>>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
>>>>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> So I would like to be able to assume that the STRING_CST objects
>>>>>>>>>> are internally always generated properly by the front end.
>>>>>>>>>
>>>>>>>>> Yeah, I guess we need to define what "properly" is ;)
>>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>> And that the ARRAY_TYPE of the string literal either has the
>>>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>>>>>>>
>>>>>>>>> I think it should be always the same...
>>>>>>>>>
>>>>>>>>
>>>>>>>> One could not differentiate between "\0" without zero-termination
>>>>>>>> and "" with zero-termination, theoretically.
>>>>>>>
>>>>>>> Is that important?  Doesn't the C standard say how to parse string literals?
>>>>>>>
>>>>>>>> We also have char x[100] = "ab";
>>>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>>>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>>>>>>>> but imagine char x[100000000000] = "ab"
>>>>>>>
>>>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
>>>>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
>>>>>>> unconditionally be [], thus incomplete (because the literals "size" depends
>>>>>>> on the context/LHS it is used on).
>>>>>>>
>>>>>>
>>>>>> Sorry, but I must say, it is not at all like that.
>>>>>>
>>>>>> If I compile x.c:
>>>>>> const char x[100] = "ab";
>>>>>>
>>>>>> and set a breakpoint at output_constant:
>>>>>>
>>>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>>>>>>         reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
>>>>>> 4821         if (size == 0 || flag_syntax_only)
>>>>>> (gdb) p size
>>>>>> $1 = 100
>>>>>> (gdb) call debug(exp)
>>>>>> "ab"
>>>>>> (gdb) p *exp
>>>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
>>>>>> (gdb) p exp->typed.type->type_common.size_unit
>>>>>> $5 = (tree) 0x7ffff6ff9d80
>>>>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
>>>>>> 100
>>>>>> (gdb) p exp->string.length
>>>>>> $6 = 3
>>>>>> (gdb) p exp->string.str[0]
>>>>>> $8 = 97 'a'
>>>>>> (gdb) p exp->string.str[1]
>>>>>> $9 = 98 'b'
>>>>>> (gdb) p exp->string.str[2]
>>>>>> $10 = 0 '\000'
>>>>>> (gdb) p exp->string.str[3]
>>>>>> $11 = 0 '\000'
>>>>>>
>>>>>>
>>>>>> This is an important property of string_cst objects, that is used in c_strlen:
>>>>>>
>>>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
>>>>>> is guaranteed to be zero up to the type size.
>>>>>>
>>>>>> I would not have spent one thought on implementing an optimization like that,
>>>>>> but that's how it is right now.
>>>>>
>>>>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
>>>>> they have zero-padding up to its type size.  I don't see this documented
>>>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
>>>>> with appropriate TYPE_SIZE.
>>>>>
>>>>> This is also a relatively new thing on trunk (coming out of the added
>>>>> mem_size parameter of string_constant).  That this looks at the STRING_CST
>>>>> type like
>>>>>
>>>>>      if (TREE_CODE (array) == STRING_CST)
>>>>>        {
>>>>>          *ptr_offset = fold_convert (sizetype, offset);
>>>>>          if (mem_size)
>>>>>            *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>>>>>          return array;
>>>>>
>>>>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
>>>>> depend on the context.  And for
>>>>>
>>>>
>>>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
>>>> before that (but not in the gcc-8-branch) we had this in c_strlen:
>>>>
>>>>      HOST_WIDE_INT maxelts = strelts;
>>>>      tree type = TREE_TYPE (src);
>>>>      if (tree size = TYPE_SIZE_UNIT (type))
>>>>        if (tree_fits_shwi_p (size))
>>>>          {
>>>>           maxelts = tree_to_uhwi (size);
>>>>           maxelts = maxelts / eltsize - 1;
>>>>          }
>>>>
>>>> which I moved to string_constant and return the result through memsize.
>>>>
>>>> It seems to be working somehow, and I'd bet removing that will immediately
>>>> break twenty or thirty of the strlenopt tests.
>>>>
>>>> Obviously the tree string objects allow way too much variations,
>>>> and it has to be restricted in some way or another.
>>>>
>>>> In the moment my approach is: look at what most string constants do
>>>> and add assertions to ensure that there are no exceptions.
>>>>
>>>>
>>>>> char a[4] = "abc";
>>>>> char b[5] = "abc";
>>>>>
>>>>> we should better be able to share STRING_CSTs [you can see LTO
>>>>> sharing the nodes if you do b[4] but not when b[5] I suppose].
>>>>>
>>>>>> All I want to do, is make sure that all string constants have the same look and feel
>>>>>> in the middle-end, and restrict the variations that are allowed by the current
>>>>>> implementation.
>>>>>
>>>>> Sure, I understand that.  But I'd like to simplify things and not add
>>>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
>>>>> whether sth is 0-terminated.
>>>>>
>>>>
>>>> This is not only about 0-terminated. (*)
>>>>
>>>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
>>>> there are places in the middle-end that assume that the object contains
>>>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
>>>> Those I want to protect.
>>>
>>> Well, but string_constant handles &STRING_CST just fine but in that
>>> context there's no "object" we look at.
>>>
>>> IMHO whenever string_constant ends up with a DECL, looking at
>>> ctor_for_folding and we end up with a STRING_CST that doesn't fit
>>> in the DECL we looked at we have a bug (non-truncated STRING_CST)
>>> and either should do the truncation or simply return NULL.
>>>
>>> So we should look at DECL_SIZE_UNIT and compare that with
>>> TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
>>> of the STRING_CST being "correct" (fitting the decl context).
>>>
>>>> Bernd.
>>>>
>>>>
>>>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
>>>> the string is only returned when it is properly zero terminated.",
>>>> but maybe I should have written:
>>>> "If MEM_SIZE is zero, the string is only returned when the storage size
>>>> of the string object is at least TREE_STRING_LENGTH."
>>>> That's at least exactly what the check does.
>>>
>>> Well, as said above
>>>
>>>     if (TREE_CODE (array) == STRING_CST)
>>>       {
>>>         *ptr_offset = fold_convert (sizetype, offset);
>>>         if (mem_size)
>>>           *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>>>         return array;
>>>       }
>>>
>>> simply assumes the "storage size" of a STRING_CST is determined
>>> by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
>>> but clearly in the above case there's nothing to protect?  And in
>>> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
>>> protect from overflow of the FIELD_DECL unless frontends never
>>> generate "wrong" STRING_CSTs?
>>>
>>
>> Hmm, I digged some more in varasm.c
>>
>> Normal STRING_CST use get_constant_size to allocate
>> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH)
>> thus they can have excess zero bytes.
>>
>> while STRING_CST that are in DECL_INITIALIZERs
>> are shrinked to DECL_SIZE_UNIT.
>>
>> At least that difference looks unnecessary to me.
>>
>>
>> But compare_constant does not compare the TYPE_SIZE_UNIT:
>>
>>       case STRING_CST:
>>         if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
>>           return 0;
>>
>>         return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
>>                 && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2),
>>                            TREE_STRING_LENGTH (t1)));
>>
>>
>> This looks now like a bug.
> 
> Well.  I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't
> a designed feature.  I'd rather not have it that way because the length of a
> STRING_CST becomes defined in multiple places if you consider STRING_CST
> with [] or too small type then TREE_STRING_LENGTH is authorative while
> if the type is larger then that is authorative.
> 
> The length of a STRING_CST if output should be only determined from
> TREE_STRING_LENGTH.
> 

Hmm. I think about that.  It is possible that I change my mind....

Currently the semantic of STRING_CST objects differs for literals
and for initializer elements.

For literals the TREE_STRING_LENGTH is guaranteed to resemble
the runtime string object.

For initializer elements the type domain overrides the TREE_STRING_LENGTH
if it is available.  If it is a flexible array member the type domain
is not available, and the TREE_STRING_LENGTH determines the memory size.

The regression in tree-ssa-forwprop.c (PR 86714) is caused by
the fact that previously constant_string did always return string literals
while now it can as well return initializer elements.
But tree-ssa-forwprop expects string literal semantics.

I would like to remove the semantic differences.

I have a patch in the test, that fixes the TYPE_DOMAIN of flexible
array members.

Maybe the most important property is TYPE_DOMAIN of the STRING_CST should never
be shorter than the TREE_STRING_LENGTH. Then it the difference between
string literals and string initializers will disappear.

And that should probably take precedence over the zero-termination property of
the string_cst objects.

Which will of course put one more question mark on the correctness of the
string_constant, c_strlen etc.


>>
>>> Btw, get_constant_size / mergeable_string_section suggsts that
>>> we may view STRING_CST as general target-encoded byte blob.
>>> That may be useful to compress CONSTRUCTORs memory use.
>>>
>>> It looks like mergeable_string_section wrongly would reject
>>> a char[] typed string because int_size_in_bytes returns -1
>>> for incomplete types.  I still believe using char[] for most
>>> STRING_CSTs generated by FEs would be a good thing for
>>> IL memory use.  Shouldn't the mem_size initializations use
>>> sth like get_constant_size as well?
>>>
>>
>> I think incomplete types exist only in structs with variable
>> length:
>>
>> struct {
>>     int x;
>>     char y[];
>> } s = { 1, "test" };
>>
>> this is no candidate for a mergeable string.
>>
>> but
>> char y[] = "test";
>>
>> is fixed by the FE to:
>> char y[5] = "test";
>>
>> So that does not make problems with mergeable_string_section.
>>
>>
>>> Also
>>>
>>>     if (mem_size)
>>>       *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
>>>     else if (compare_tree_int (array_size, length + 1) < 0)
>>>       return NULL_TREE;
>>>
>>> the latter check doesn't seem to honor 'offset' and *mem_size
>>> is again looking at the STRING_CST, not at the FIELD_DECL or
>>> wherever we got the STRING_CST from.
>>>
>>
>> the caller compares offset with *mem_size, but we do not have
>> the FIELD_DECL at hand here (Or I did not know where it was).
>> I'd rather fail the cases when the TYPE_SIZE_UNIT is null,
>> or something non-constant.
>>
>> All those are variations of vla with initializer and similar.
>> Once the STRING_CST has a type, I would like to use it,
>> when it doesn't the whole thing is probably not worth it
>> anyway.
> 
> I would rather have STRING_CSTs only have a type for the
> elements, not the whole thing...  you can get that
> if you drop TYPE_DOMAIN on all STRING_CSTs array types.
> 

This will break the string_constant code even more.
It depends on the TYPE_DOMAIN to mean something.

> Not sure what breaks if we change get_constant_size to
> return TREE_STRING_LENGTH unconditionally.  At least
> somehow the need to special-case STRING_CSTs looks
> odd and I wonder if we do this in other functions as well...
> 
I think as well this special handling in get_constant_size
is something that has to go away.

It is even worse, the DECL_SIZE_UNIT is also NULL in
flexible array members.

But I think I can make it work the other way round.
So that STRING_CST always have a TYPE_DOMAIN which
may only be larger than the TREE_STRING_LENGTH.



Bernd.

> Richard.
> 
>>
>>
>> Bernd.
Richard Biener Aug. 24, 2018, 12:57 p.m. UTC | #19
On Fri, Aug 24, 2018 at 1:49 PM Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
>
> On 08/24/18 12:52, Richard Biener wrote:
> > On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger
> > <bernd.edlinger@hotmail.de> wrote:
> >>
> >> On 08/21/18 10:33, Richard Biener wrote:
> >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >>>
> >>>> On 08/20/18 15:19, Richard Biener wrote:
> >>>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> >>>>>
> >>>>>> On 08/20/18 13:01, Richard Biener wrote:
> >>>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 08/01/18 11:29, Richard Biener wrote:
> >>>>>>>>>
> >>>>>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
> >>>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
> >>>>>>>>> for your check above.  Because the '\0' doesn't belong to the
> >>>>>>>>> string.  Then build_string internally appends a '\0' outside
> >>>>>>>>> of TREE_STRING_LENGTH.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
> >>>>>>>> character.
> >>>>>>>
> >>>>>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
> >>>>>>> parameter to build_string and allocate as many extra 0s as needed.
> >>>>>>>
> >>>>>>>       There are STRING_CSTs which are not string literals,
> >>>>>>>> for instance attribute tags, Pragmas, asm constrants, etc.
> >>>>>>>> They use the '\0' outside, and have probably no TREE_TYPE.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> So I would like to be able to assume that the STRING_CST objects
> >>>>>>>>>> are internally always generated properly by the front end.
> >>>>>>>>>
> >>>>>>>>> Yeah, I guess we need to define what "properly" is ;)
> >>>>>>>>>
> >>>>>>>> Yes.
> >>>>>>>>
> >>>>>>>>>> And that the ARRAY_TYPE of the string literal either has the
> >>>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
> >>>>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
> >>>>>>>>>
> >>>>>>>>> I think it should be always the same...
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> One could not differentiate between "\0" without zero-termination
> >>>>>>>> and "" with zero-termination, theoretically.
> >>>>>>>
> >>>>>>> Is that important?  Doesn't the C standard say how to parse string literals?
> >>>>>>>
> >>>>>>>> We also have char x[100] = "ab";
> >>>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
> >>>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
> >>>>>>>> but imagine char x[100000000000] = "ab"
> >>>>>>>
> >>>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
> >>>>>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
> >>>>>>> unconditionally be [], thus incomplete (because the literals "size" depends
> >>>>>>> on the context/LHS it is used on).
> >>>>>>>
> >>>>>>
> >>>>>> Sorry, but I must say, it is not at all like that.
> >>>>>>
> >>>>>> If I compile x.c:
> >>>>>> const char x[100] = "ab";
> >>>>>>
> >>>>>> and set a breakpoint at output_constant:
> >>>>>>
> >>>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
> >>>>>>         reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
> >>>>>> 4821         if (size == 0 || flag_syntax_only)
> >>>>>> (gdb) p size
> >>>>>> $1 = 100
> >>>>>> (gdb) call debug(exp)
> >>>>>> "ab"
> >>>>>> (gdb) p *exp
> >>>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
> >>>>>> (gdb) p exp->typed.type->type_common.size_unit
> >>>>>> $5 = (tree) 0x7ffff6ff9d80
> >>>>>> (gdb) call debug(exp->typed.type->type_common.size_unit)
> >>>>>> 100
> >>>>>> (gdb) p exp->string.length
> >>>>>> $6 = 3
> >>>>>> (gdb) p exp->string.str[0]
> >>>>>> $8 = 97 'a'
> >>>>>> (gdb) p exp->string.str[1]
> >>>>>> $9 = 98 'b'
> >>>>>> (gdb) p exp->string.str[2]
> >>>>>> $10 = 0 '\000'
> >>>>>> (gdb) p exp->string.str[3]
> >>>>>> $11 = 0 '\000'
> >>>>>>
> >>>>>>
> >>>>>> This is an important property of string_cst objects, that is used in c_strlen:
> >>>>>>
> >>>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
> >>>>>> is guaranteed to be zero up to the type size.
> >>>>>>
> >>>>>> I would not have spent one thought on implementing an optimization like that,
> >>>>>> but that's how it is right now.
> >>>>>
> >>>>> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> >>>>> they have zero-padding up to its type size.  I don't see this documented
> >>>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> >>>>> with appropriate TYPE_SIZE.
> >>>>>
> >>>>> This is also a relatively new thing on trunk (coming out of the added
> >>>>> mem_size parameter of string_constant).  That this looks at the STRING_CST
> >>>>> type like
> >>>>>
> >>>>>      if (TREE_CODE (array) == STRING_CST)
> >>>>>        {
> >>>>>          *ptr_offset = fold_convert (sizetype, offset);
> >>>>>          if (mem_size)
> >>>>>            *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >>>>>          return array;
> >>>>>
> >>>>> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> >>>>> depend on the context.  And for
> >>>>>
> >>>>
> >>>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607
> >>>> before that (but not in the gcc-8-branch) we had this in c_strlen:
> >>>>
> >>>>      HOST_WIDE_INT maxelts = strelts;
> >>>>      tree type = TREE_TYPE (src);
> >>>>      if (tree size = TYPE_SIZE_UNIT (type))
> >>>>        if (tree_fits_shwi_p (size))
> >>>>          {
> >>>>           maxelts = tree_to_uhwi (size);
> >>>>           maxelts = maxelts / eltsize - 1;
> >>>>          }
> >>>>
> >>>> which I moved to string_constant and return the result through memsize.
> >>>>
> >>>> It seems to be working somehow, and I'd bet removing that will immediately
> >>>> break twenty or thirty of the strlenopt tests.
> >>>>
> >>>> Obviously the tree string objects allow way too much variations,
> >>>> and it has to be restricted in some way or another.
> >>>>
> >>>> In the moment my approach is: look at what most string constants do
> >>>> and add assertions to ensure that there are no exceptions.
> >>>>
> >>>>
> >>>>> char a[4] = "abc";
> >>>>> char b[5] = "abc";
> >>>>>
> >>>>> we should better be able to share STRING_CSTs [you can see LTO
> >>>>> sharing the nodes if you do b[4] but not when b[5] I suppose].
> >>>>>
> >>>>>> All I want to do, is make sure that all string constants have the same look and feel
> >>>>>> in the middle-end, and restrict the variations that are allowed by the current
> >>>>>> implementation.
> >>>>>
> >>>>> Sure, I understand that.  But I'd like to simplify things and not add
> >>>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> >>>>> whether sth is 0-terminated.
> >>>>>
> >>>>
> >>>> This is not only about 0-terminated. (*)
> >>>>
> >>>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
> >>>> there are places in the middle-end that assume that the object contains
> >>>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
> >>>> Those I want to protect.
> >>>
> >>> Well, but string_constant handles &STRING_CST just fine but in that
> >>> context there's no "object" we look at.
> >>>
> >>> IMHO whenever string_constant ends up with a DECL, looking at
> >>> ctor_for_folding and we end up with a STRING_CST that doesn't fit
> >>> in the DECL we looked at we have a bug (non-truncated STRING_CST)
> >>> and either should do the truncation or simply return NULL.
> >>>
> >>> So we should look at DECL_SIZE_UNIT and compare that with
> >>> TREE_STRING_LENGTH.  Otherwise you are relying on TYPE_SIZE_UNIT
> >>> of the STRING_CST being "correct" (fitting the decl context).
> >>>
> >>>> Bernd.
> >>>>
> >>>>
> >>>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
> >>>> the string is only returned when it is properly zero terminated.",
> >>>> but maybe I should have written:
> >>>> "If MEM_SIZE is zero, the string is only returned when the storage size
> >>>> of the string object is at least TREE_STRING_LENGTH."
> >>>> That's at least exactly what the check does.
> >>>
> >>> Well, as said above
> >>>
> >>>     if (TREE_CODE (array) == STRING_CST)
> >>>       {
> >>>         *ptr_offset = fold_convert (sizetype, offset);
> >>>         if (mem_size)
> >>>           *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
> >>>         return array;
> >>>       }
> >>>
> >>> simply assumes the "storage size" of a STRING_CST is determined
> >>> by its TYPE_SIZE_UNIT.  That may be true as you noted in the folloup,
> >>> but clearly in the above case there's nothing to protect?  And in
> >>> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't
> >>> protect from overflow of the FIELD_DECL unless frontends never
> >>> generate "wrong" STRING_CSTs?
> >>>
> >>
> >> Hmm, I digged some more in varasm.c
> >>
> >> Normal STRING_CST use get_constant_size to allocate
> >> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH)
> >> thus they can have excess zero bytes.
> >>
> >> while STRING_CST that are in DECL_INITIALIZERs
> >> are shrinked to DECL_SIZE_UNIT.
> >>
> >> At least that difference looks unnecessary to me.
> >>
> >>
> >> But compare_constant does not compare the TYPE_SIZE_UNIT:
> >>
> >>       case STRING_CST:
> >>         if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2)))
> >>           return 0;
> >>
> >>         return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2)
> >>                 && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2),
> >>                            TREE_STRING_LENGTH (t1)));
> >>
> >>
> >> This looks now like a bug.
> >
> > Well.  I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't
> > a designed feature.  I'd rather not have it that way because the length of a
> > STRING_CST becomes defined in multiple places if you consider STRING_CST
> > with [] or too small type then TREE_STRING_LENGTH is authorative while
> > if the type is larger then that is authorative.
> >
> > The length of a STRING_CST if output should be only determined from
> > TREE_STRING_LENGTH.
> >
>
> Hmm. I think about that.  It is possible that I change my mind....
>
> Currently the semantic of STRING_CST objects differs for literals
> and for initializer elements.
>
> For literals the TREE_STRING_LENGTH is guaranteed to resemble
> the runtime string object.
>
> For initializer elements the type domain overrides the TREE_STRING_LENGTH
> if it is available.  If it is a flexible array member the type domain
> is not available, and the TREE_STRING_LENGTH determines the memory size.
>
> The regression in tree-ssa-forwprop.c (PR 86714) is caused by
> the fact that previously constant_string did always return string literals
> while now it can as well return initializer elements.
> But tree-ssa-forwprop expects string literal semantics.
>
> I would like to remove the semantic differences.
>
> I have a patch in the test, that fixes the TYPE_DOMAIN of flexible
> array members.
>
> Maybe the most important property is TYPE_DOMAIN of the STRING_CST should never
> be shorter than the TREE_STRING_LENGTH. Then it the difference between
> string literals and string initializers will disappear.
>
> And that should probably take precedence over the zero-termination property of
> the string_cst objects.
>
> Which will of course put one more question mark on the correctness of the
> string_constant, c_strlen etc.
>
>
> >>
> >>> Btw, get_constant_size / mergeable_string_section suggsts that
> >>> we may view STRING_CST as general target-encoded byte blob.
> >>> That may be useful to compress CONSTRUCTORs memory use.
> >>>
> >>> It looks like mergeable_string_section wrongly would reject
> >>> a char[] typed string because int_size_in_bytes returns -1
> >>> for incomplete types.  I still believe using char[] for most
> >>> STRING_CSTs generated by FEs would be a good thing for
> >>> IL memory use.  Shouldn't the mem_size initializations use
> >>> sth like get_constant_size as well?
> >>>
> >>
> >> I think incomplete types exist only in structs with variable
> >> length:
> >>
> >> struct {
> >>     int x;
> >>     char y[];
> >> } s = { 1, "test" };
> >>
> >> this is no candidate for a mergeable string.
> >>
> >> but
> >> char y[] = "test";
> >>
> >> is fixed by the FE to:
> >> char y[5] = "test";
> >>
> >> So that does not make problems with mergeable_string_section.
> >>
> >>
> >>> Also
> >>>
> >>>     if (mem_size)
> >>>       *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init));
> >>>     else if (compare_tree_int (array_size, length + 1) < 0)
> >>>       return NULL_TREE;
> >>>
> >>> the latter check doesn't seem to honor 'offset' and *mem_size
> >>> is again looking at the STRING_CST, not at the FIELD_DECL or
> >>> wherever we got the STRING_CST from.
> >>>
> >>
> >> the caller compares offset with *mem_size, but we do not have
> >> the FIELD_DECL at hand here (Or I did not know where it was).
> >> I'd rather fail the cases when the TYPE_SIZE_UNIT is null,
> >> or something non-constant.
> >>
> >> All those are variations of vla with initializer and similar.
> >> Once the STRING_CST has a type, I would like to use it,
> >> when it doesn't the whole thing is probably not worth it
> >> anyway.
> >
> > I would rather have STRING_CSTs only have a type for the
> > elements, not the whole thing...  you can get that
> > if you drop TYPE_DOMAIN on all STRING_CSTs array types.
> >
>
> This will break the string_constant code even more.
> It depends on the TYPE_DOMAIN to mean something.
>
> > Not sure what breaks if we change get_constant_size to
> > return TREE_STRING_LENGTH unconditionally.  At least
> > somehow the need to special-case STRING_CSTs looks
> > odd and I wonder if we do this in other functions as well...
> >
> I think as well this special handling in get_constant_size
> is something that has to go away.
>
> It is even worse, the DECL_SIZE_UNIT is also NULL in
> flexible array members.
>
> But I think I can make it work the other way round.
> So that STRING_CST always have a TYPE_DOMAIN which
> may only be larger than the TREE_STRING_LENGTH.

OK, that would then be at least clear semantics and follow
that of CONSTRUCTORs (missing "elements" are zero).

Richard.

>
>
> Bernd.
>
> > Richard.
> >
> >>
> >>
> >> Bernd.
diff mbox series

Patch

2018-07-31  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* go-gcc.cc (Gcc_backend::string_constant_expression): Make string
	literal properly NUL terminated.

diff -pur gcc/go/go-gcc.cc gcc/go/go-gcc.cc
--- gcc/go/go-gcc.cc	2018-06-28 19:46:36.000000000 +0200
+++ gcc/go/go-gcc.cc	2018-07-31 12:52:24.690236476 +0200
@@ -1394,7 +1394,7 @@  Gcc_backend::string_constant_expression(
 					      TYPE_QUAL_CONST);
   tree string_type = build_array_type(const_char_type, index_type);
   TYPE_STRING_FLAG(string_type) = 1;
-  tree string_val = build_string(val.length(), val.data());
+  tree string_val = build_string(val.length() + 1, val.data());
   TREE_TYPE(string_val) = string_type;
 
   return this->make_expression(string_val);