Message ID | AM5PR0701MB26570687DEEC36D3BE2F21F4E42E0@AM5PR0701MB2657.eurprd07.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | Make GO string literals properly NUL terminated | expand |
On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > > could someone please review this patch and check it in into the GO FE? I don't understand why the change is correct, and you didn't explain it. Go strings are not NUL terminated. Go strings always have an associated length. Ian
On 07/31/18 16:40, Ian Lance Taylor wrote: > On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger > <bernd.edlinger@hotmail.de> wrote: >> >> could someone please review this patch and check it in into the GO FE? > > I don't understand why the change is correct, and you didn't explain > it. Go strings are not NUL terminated. Go strings always have an > associated length. > Yes, sorry. Effectively for go this change is a no-op. I'll elaborate a bit. This makes it easier for the middle-end to distinguish between nul-terminated and not nul terminated strings. Especially if wide character strings may also may come along. In C a not nul terminated string might be declared like char x[2] = "12"; it is always a STRING_CST object of length 3, with value "12\0". The array_type is char[0..1] while a nul terminated string is declared like char x[3] = "12" it is also a STRING_CST object of length 3, with value "12\0" The array_type is char[0..2] Note however the array type is different. So with this convention one only needs to compare the array type size with the string length which is much easier than looking for a terminating wide character, which is rarely done right. At the end varasm.c filters the excess NUL byte away, but I would like to add a checking assertion there that this does not strip more than max. one wide character nul byte. Bernd.
On Tue, Jul 31, 2018 at 9:19 AM, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > On 07/31/18 16:40, Ian Lance Taylor wrote: >> On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger >> <bernd.edlinger@hotmail.de> wrote: >>> >>> could someone please review this patch and check it in into the GO FE? >> >> I don't understand why the change is correct, and you didn't explain >> it. Go strings are not NUL terminated. Go strings always have an >> associated length. >> > > Yes, sorry. Effectively for go this change is a no-op. > I'll elaborate a bit. > > This makes it easier for the middle-end to distinguish between nul-terminated > and not nul terminated strings. Especially if wide character strings > may also may come along. > > In C a not nul terminated string might be declared like > char x[2] = "12"; > it is always a STRING_CST object of length 3, with value "12\0". > The array_type is char[0..1] > > while a nul terminated string is declared like > char x[3] = "12" > it is also a STRING_CST object of length 3, with value "12\0" > The array_type is char[0..2] > > Note however the array type is different. > So with this convention one only needs to compare the array type > size with the string length which is much easier than looking for > a terminating wide character, which is rarely done right. > > At the end varasm.c filters the excess NUL byte away, but > I would like to add a checking assertion there that this does not > strip more than max. one wide character nul byte. Thanks, I think I should probably let this be reviewed by someone reviewing the larger patch. The go-gcc.cc file lives in the GCC repo and changes to it can be approved and committed by any GCC middle-end or global maintainer. It's not part of the code copied from another repo, which lives in gcc/go/gofrontend. Ian
On Tue, 31 Jul 2018, Ian Lance Taylor wrote: > On Tue, Jul 31, 2018 at 9:19 AM, Bernd Edlinger > <bernd.edlinger@hotmail.de> wrote: > > On 07/31/18 16:40, Ian Lance Taylor wrote: > >> On Tue, Jul 31, 2018 at 5:14 AM, Bernd Edlinger > >> <bernd.edlinger@hotmail.de> wrote: > >>> > >>> could someone please review this patch and check it in into the GO FE? > >> > >> I don't understand why the change is correct, and you didn't explain > >> it. Go strings are not NUL terminated. Go strings always have an > >> associated length. > >> > > > > Yes, sorry. Effectively for go this change is a no-op. > > I'll elaborate a bit. > > > > This makes it easier for the middle-end to distinguish between nul-terminated > > and not nul terminated strings. Especially if wide character strings > > may also may come along. > > > > In C a not nul terminated string might be declared like > > char x[2] = "12"; > > it is always a STRING_CST object of length 3, with value "12\0". > > The array_type is char[0..1] > > > > while a nul terminated string is declared like > > char x[3] = "12" > > it is also a STRING_CST object of length 3, with value "12\0" > > The array_type is char[0..2] > > > > Note however the array type is different. > > So with this convention one only needs to compare the array type > > size with the string length which is much easier than looking for > > a terminating wide character, which is rarely done right. > > > > At the end varasm.c filters the excess NUL byte away, but > > I would like to add a checking assertion there that this does not > > strip more than max. one wide character nul byte. > > Thanks, I think I should probably let this be reviewed by someone > reviewing the larger patch. The go-gcc.cc file lives in the GCC repo > and changes to it can be approved and committed by any GCC middle-end > or global maintainer. It's not part of the code copied from another > repo, which lives in gcc/go/gofrontend. The change to have all STRING_CSTs NUL terminated (but that NUL termination not necessarily inclided in STRING_LENGTH) is a good one. I'm not sure how we can reliably verify NUL termination after the fact though and build_string already makes sure to NUL terminate STRING_CSTs. So if GO strings are not NUL terminated then the STRING_CSTs still are. Bernd? Richard. > Ian > >
> The change to have all STRING_CSTs NUL terminated (but that NUL > termination not necessarily inclided in STRING_LENGTH) is a good > one. > > I'm not sure how we can reliably verify NUL termination after the > fact though and build_string already makes sure to NUL terminate > STRING_CSTs. So if GO strings are not NUL terminated then > the STRING_CSTs still are. The advantage is that there are less variations how string literals look like in the middle end. We will have a simple way to determine if a string literal is NUL terminated or not. And checking that property in varasm.c is exactly the right thing to do. String literals always have an array_type which may be shorter than TREE_STRING_LENGTH, but that chops off only exactly one wide character nul. Otherwise if the array_type is equal or larger, we know for sure the value is nul terminated. In the middle-end we can easily determine if a string is not NUL terminated by: compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)), TREE_STRING_LENGTH (init)) < 0 I did use that already in my patch for pr86711. Additionally not having oversize string constants produced by the front ends, where the extra characters are good for nothing, also helps to improve correctness. Bernd.
On Wed, 1 Aug 2018, Bernd Edlinger wrote: > > The change to have all STRING_CSTs NUL terminated (but that NUL > > termination not necessarily inclided in STRING_LENGTH) is a good > > one. > > > > I'm not sure how we can reliably verify NUL termination after the > > fact though and build_string already makes sure to NUL terminate > > STRING_CSTs. So if GO strings are not NUL terminated then > > the STRING_CSTs still are. > > The advantage is that there are less variations how string literals look > like in the middle end. We will have a simple way to determine if > a string literal is NUL terminated or not. And checking that property > in varasm.c is exactly the right thing to do. > > String literals always have an array_type which may be shorter > than TREE_STRING_LENGTH, but that chops off only exactly > one wide character nul. Otherwise if the array_type is equal or larger, > we know for sure the value is nul terminated. In the middle-end > we can easily determine if a string is not NUL terminated by: > > compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)), > TREE_STRING_LENGTH (init)) < 0 > > I did use that already in my patch for pr86711. Hmm. How does that tell you whether the string is NUL terminated? TREE_STRING_LENGTH includes the NUL termination and if it is a significant char then TYPE_SIZE_UNIT should as well. Isn't a proper test to look at TREE_STRING_POINTER[TREE_STRING_LENGTH - 1] (for HOST_CHAR_BITS strings)? Relying on the type here looks somewhat fragile to me. Abstracting a string_cst_nul_terminated_p () helper would be a good idea I guess. I realize using strlen(TREE_STRING_POINTER) doesn't work because of embedded NULs. > Additionally not having oversize string constants produced > by the front ends, where the extra characters are good for nothing, > also helps to improve correctness. > > > Bernd. >
>> > The change to have all STRING_CSTs NUL terminated (but that NUL >> > termination not necessarily inclided in STRING_LENGTH) is a good >> > one. >> > >> > I'm not sure how we can reliably verify NUL termination after the >> > fact though and build_string already makes sure to NUL terminate >> > STRING_CSTs. So if GO strings are not NUL terminated then >> > the STRING_CSTs still are. >> >> The advantage is that there are less variations how string literals look >> like in the middle end. We will have a simple way to determine if >> a string literal is NUL terminated or not. And checking that property >> in varasm.c is exactly the right thing to do. >> >> String literals always have an array_type which may be shorter >> than TREE_STRING_LENGTH, but that chops off only exactly >> one wide character nul. Otherwise if the array_type is equal or larger, >> we know for sure the value is nul terminated. In the middle-end >> we can easily determine if a string is not NUL terminated by: >> >> compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)), >> TREE_STRING_LENGTH (init)) < 0 >> >> I did use that already in my patch for pr86711. > > Hmm. How does that tell you whether the string is NUL terminated? > TREE_STRING_LENGTH includes the NUL termination and if it is > a significant char then TYPE_SIZE_UNIT should as well. I debugged that code a lot recently. static const char x[2] = "ab" gives a TREE_STRING_LENGTH of 3, the TREE_TYPE of that beast is an array type for char[2]. and TYPE_SIZE_UNIT = 2. An ordinary C string "ab" has TYPE_SIZE_UNIT(TREE_TYPE(x)) = 3. Of course with wide caracher strings the TREE_STING_LENGTH and TYPE_SIZE_UNIT of the ARRAY_TYPE are multiple of the used wide character type. So I would like to be able to assume that the STRING_CST objects are internally always generated properly by the front end. And that the ARRAY_TYPE of the string literal either has the same length than the TREE_STRING_LENGTH or if it is shorter, this is always exactly one (wide) character size less than TREE_STRING_LENGTH The idea is to use this property of string literals where needed, and check rigorously in varasm.c. Does that make sense? > > Isn't a proper test to look at TREE_STRING_POINTER[TREE_STRING_LENGTH - 1] > (for HOST_CHAR_BITS strings)? > There are also wide character strings, and all those test are broken everywhere for wide characters right now. Therefore checking the string constants at varasm.c revealed a lot of intersting things. I will post several patches in the afternoon. > Relying on the type here looks somewhat fragile to me. > > Abstracting a string_cst_nul_terminated_p () helper would be a good > idea I guess. Yes. indeed. > I realize using strlen(TREE_STRING_POINTER) doesn't work because > of embedded NULs. > >> Additionally not having oversize string constants produced >> by the front ends, where the extra characters are good for nothing, >> also helps to improve correctness. >>
On Wed, 1 Aug 2018, Bernd Edlinger wrote: > >> > The change to have all STRING_CSTs NUL terminated (but that NUL > >> > termination not necessarily inclided in STRING_LENGTH) is a good > >> > one. > >> > > >> > I'm not sure how we can reliably verify NUL termination after the > >> > fact though and build_string already makes sure to NUL terminate > >> > STRING_CSTs. So if GO strings are not NUL terminated then > >> > the STRING_CSTs still are. > >> > >> The advantage is that there are less variations how string literals look > >> like in the middle end. We will have a simple way to determine if > >> a string literal is NUL terminated or not. And checking that property > >> in varasm.c is exactly the right thing to do. > >> > >> String literals always have an array_type which may be shorter > >> than TREE_STRING_LENGTH, but that chops off only exactly > >> one wide character nul. Otherwise if the array_type is equal or larger, > >> we know for sure the value is nul terminated. In the middle-end > >> we can easily determine if a string is not NUL terminated by: > >> > >> compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)), > >> TREE_STRING_LENGTH (init)) < 0 > >> > >> I did use that already in my patch for pr86711. > > > > Hmm. How does that tell you whether the string is NUL terminated? > > TREE_STRING_LENGTH includes the NUL termination and if it is > > a significant char then TYPE_SIZE_UNIT should as well. > > I debugged that code a lot recently. > static const char x[2] = "ab" > gives a TREE_STRING_LENGTH of 3, the TREE_TYPE of that > beast is an array type for char[2]. and TYPE_SIZE_UNIT = 2. Hmm. I think it would be nice if TREE_STRING_LENGTH would match char[2] and TYPE_SIZE_UNIT even if that is inconvenient for your check above. Because the '\0' doesn't belong to the string. Then build_string internally appends a '\0' outside of TREE_STRING_LENGTH. > An ordinary C string "ab" has TYPE_SIZE_UNIT(TREE_TYPE(x)) = 3. > > Of course with wide caracher strings the TREE_STING_LENGTH > and TYPE_SIZE_UNIT of the ARRAY_TYPE are multiple of > the used wide character type. Sure. > So I would like to be able to assume that the STRING_CST objects > are internally always generated properly by the front end. Yeah, I guess we need to define what "properly" is ;) > And that the ARRAY_TYPE of the string literal either has the > same length than the TREE_STRING_LENGTH or if it is shorter, > this is always exactly one (wide) character size less than TREE_STRING_LENGTH I think it should be always the same... > The idea is to use this property of string literals where needed, > and check rigorously in varasm.c. > > Does that make sense? So if it is not the same then the excess character needs to be a (wide) NUL in your model? ISTR your varasm.c patch didn't verify that. > > > > Isn't a proper test to look at TREE_STRING_POINTER[TREE_STRING_LENGTH - 1] > > (for HOST_CHAR_BITS strings)? > > > > There are also wide character strings, and all those test are broken everywhere > for wide characters right now. > > Therefore checking the string constants at varasm.c revealed a lot of intersting > things. > I will post several patches in the afternoon. > > > Relying on the type here looks somewhat fragile to me. > > > > Abstracting a string_cst_nul_terminated_p () helper would be a good > > idea I guess. > > Yes. indeed. > > > I realize using strlen(TREE_STRING_POINTER) doesn't work because > > of embedded NULs. > > > >> Additionally not having oversize string constants produced > >> by the front ends, where the extra characters are good for nothing, > >> also helps to improve correctness. > >> > >
On 08/01/18 11:29, Richard Biener wrote: > > Hmm. I think it would be nice if TREE_STRING_LENGTH would > match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > for your check above. Because the '\0' doesn't belong to the > string. Then build_string internally appends a '\0' outside > of TREE_STRING_LENGTH. > Hmm. Yes, but the outside-0 byte is just one byte, not a wide character. There are STRING_CSTs which are not string literals, for instance attribute tags, Pragmas, asm constrants, etc. They use the '\0' outside, and have probably no TREE_TYPE. > >> So I would like to be able to assume that the STRING_CST objects >> are internally always generated properly by the front end. > > Yeah, I guess we need to define what "properly" is ;) > Yes. >> And that the ARRAY_TYPE of the string literal either has the >> same length than the TREE_STRING_LENGTH or if it is shorter, >> this is always exactly one (wide) character size less than TREE_STRING_LENGTH > > I think it should be always the same... > One could not differentiate between "\0" without zero-termination and "" with zero-termination, theoretically. We also have char x[100] = "ab"; that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. Of course one could create it with a TREE_STRING_LENGTH = 100, but imagine char x[100000000000] = "ab" >> The idea is to use this property of string literals where needed, >> and check rigorously in varasm.c. >> >> Does that make sense? > > So if it is not the same then the excess character needs to be > a (wide) NUL in your model? ISTR your varasm.c patch didn't verify > that. > I think it does. Bernd.
On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > > > > On 08/01/18 11:29, Richard Biener wrote: > > > > Hmm. I think it would be nice if TREE_STRING_LENGTH would > > match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > > for your check above. Because the '\0' doesn't belong to the > > string. Then build_string internally appends a '\0' outside > > of TREE_STRING_LENGTH. > > > > Hmm. Yes, but the outside-0 byte is just one byte, not a wide > character. That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 parameter to build_string and allocate as many extra 0s as needed. There are STRING_CSTs which are not string literals, > for instance attribute tags, Pragmas, asm constrants, etc. > They use the '\0' outside, and have probably no TREE_TYPE. > > > > >> So I would like to be able to assume that the STRING_CST objects > >> are internally always generated properly by the front end. > > > > Yeah, I guess we need to define what "properly" is ;) > > > Yes. > > >> And that the ARRAY_TYPE of the string literal either has the > >> same length than the TREE_STRING_LENGTH or if it is shorter, > >> this is always exactly one (wide) character size less than TREE_STRING_LENGTH > > > > I think it should be always the same... > > > > One could not differentiate between "\0" without zero-termination > and "" with zero-termination, theoretically. Is that important? Doesn't the C standard say how to parse string literals? > We also have char x[100] = "ab"; > that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. > Of course one could create it with a TREE_STRING_LENGTH = 100, > but imagine char x[100000000000] = "ab" The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it unconditionally be [], thus incomplete (because the literals "size" depends on the context/LHS it is used on). > >> The idea is to use this property of string literals where needed, > >> and check rigorously in varasm.c. > >> > >> Does that make sense? > > > > So if it is not the same then the excess character needs to be > > a (wide) NUL in your model? ISTR your varasm.c patch didn't verify > > that. > > > > I think it does. > > > Bernd.
On 08/20/18 13:01, Richard Biener wrote: > On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >> >> >> >> On 08/01/18 11:29, Richard Biener wrote: >>> >>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>> for your check above. Because the '\0' doesn't belong to the >>> string. Then build_string internally appends a '\0' outside >>> of TREE_STRING_LENGTH. >>> >> >> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >> character. > > That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 > parameter to build_string and allocate as many extra 0s as needed. > > There are STRING_CSTs which are not string literals, >> for instance attribute tags, Pragmas, asm constrants, etc. >> They use the '\0' outside, and have probably no TREE_TYPE. >> >>> >>>> So I would like to be able to assume that the STRING_CST objects >>>> are internally always generated properly by the front end. >>> >>> Yeah, I guess we need to define what "properly" is ;) >>> >> Yes. >> >>>> And that the ARRAY_TYPE of the string literal either has the >>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH >>> >>> I think it should be always the same... >>> >> >> One could not differentiate between "\0" without zero-termination >> and "" with zero-termination, theoretically. > > Is that important? Doesn't the C standard say how to parse string literals? > >> We also have char x[100] = "ab"; >> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >> Of course one could create it with a TREE_STRING_LENGTH = 100, >> but imagine char x[100000000000] = "ab" > > The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I > hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it > unconditionally be [], thus incomplete (because the literals "size" depends > on the context/LHS it is used on). > Sorry, but I must say, it is not at all like that. If I compile x.c: const char x[100] = "ab"; and set a breakpoint at output_constant: Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 4821 if (size == 0 || flag_syntax_only) (gdb) p size $1 = 100 (gdb) call debug(exp) "ab" (gdb) p *exp $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, (gdb) p exp->typed.type->type_common.size_unit $5 = (tree) 0x7ffff6ff9d80 (gdb) call debug(exp->typed.type->type_common.size_unit) 100 (gdb) p exp->string.length $6 = 3 (gdb) p exp->string.str[0] $8 = 97 'a' (gdb) p exp->string.str[1] $9 = 98 'b' (gdb) p exp->string.str[2] $10 = 0 '\000' (gdb) p exp->string.str[3] $11 = 0 '\000' This is an important property of string_cst objects, that is used in c_strlen: It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH is guaranteed to be zero up to the type size. I would not have spent one thought on implementing an optimization like that, but that's how it is right now. All I want to do, is make sure that all string constants have the same look and feel in the middle-end, and restrict the variations that are allowed by the current implementation. Bernd. >>>> The idea is to use this property of string literals where needed, >>>> and check rigorously in varasm.c. >>>> >>>> Does that make sense? >>> >>> So if it is not the same then the excess character needs to be >>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify >>> that. >>> >> >> I think it does. >> >> >> Bernd.
On Mon, 20 Aug 2018, Bernd Edlinger wrote: > On 08/20/18 13:01, Richard Biener wrote: > > On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > >> > >> > >> > >> On 08/01/18 11:29, Richard Biener wrote: > >>> > >>> Hmm. I think it would be nice if TREE_STRING_LENGTH would > >>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > >>> for your check above. Because the '\0' doesn't belong to the > >>> string. Then build_string internally appends a '\0' outside > >>> of TREE_STRING_LENGTH. > >>> > >> > >> Hmm. Yes, but the outside-0 byte is just one byte, not a wide > >> character. > > > > That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 > > parameter to build_string and allocate as many extra 0s as needed. > > > > There are STRING_CSTs which are not string literals, > >> for instance attribute tags, Pragmas, asm constrants, etc. > >> They use the '\0' outside, and have probably no TREE_TYPE. > >> > >>> > >>>> So I would like to be able to assume that the STRING_CST objects > >>>> are internally always generated properly by the front end. > >>> > >>> Yeah, I guess we need to define what "properly" is ;) > >>> > >> Yes. > >> > >>>> And that the ARRAY_TYPE of the string literal either has the > >>>> same length than the TREE_STRING_LENGTH or if it is shorter, > >>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH > >>> > >>> I think it should be always the same... > >>> > >> > >> One could not differentiate between "\0" without zero-termination > >> and "" with zero-termination, theoretically. > > > > Is that important? Doesn't the C standard say how to parse string literals? > > > >> We also have char x[100] = "ab"; > >> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. > >> Of course one could create it with a TREE_STRING_LENGTH = 100, > >> but imagine char x[100000000000] = "ab" > > > > The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I > > hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it > > unconditionally be [], thus incomplete (because the literals "size" depends > > on the context/LHS it is used on). > > > > Sorry, but I must say, it is not at all like that. > > If I compile x.c: > const char x[100] = "ab"; > > and set a breakpoint at output_constant: > > Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, > reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 > 4821 if (size == 0 || flag_syntax_only) > (gdb) p size > $1 = 100 > (gdb) call debug(exp) > "ab" > (gdb) p *exp > $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, > (gdb) p exp->typed.type->type_common.size_unit > $5 = (tree) 0x7ffff6ff9d80 > (gdb) call debug(exp->typed.type->type_common.size_unit) > 100 > (gdb) p exp->string.length > $6 = 3 > (gdb) p exp->string.str[0] > $8 = 97 'a' > (gdb) p exp->string.str[1] > $9 = 98 'b' > (gdb) p exp->string.str[2] > $10 = 0 '\000' > (gdb) p exp->string.str[3] > $11 = 0 '\000' > > > This is an important property of string_cst objects, that is used in c_strlen: > > It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH > is guaranteed to be zero up to the type size. > > I would not have spent one thought on implementing an optimization like that, > but that's how it is right now. Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka they have zero-padding up to its type size. I don't see this documented anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" with appropriate TYPE_SIZE. This is also a relatively new thing on trunk (coming out of the added mem_size parameter of string_constant). That this looks at the STRING_CST type like if (TREE_CODE (array) == STRING_CST) { *ptr_offset = fold_convert (sizetype, offset); if (mem_size) *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); return array; I'd call simply a bug. As said, interpretation of STRING_CSTs should depend on the context. And for char a[4] = "abc"; char b[5] = "abc"; we should better be able to share STRING_CSTs [you can see LTO sharing the nodes if you do b[4] but not when b[5] I suppose]. > All I want to do, is make sure that all string constants have the same look and feel > in the middle-end, and restrict the variations that are allowed by the current > implementation. Sure, I understand that. But I'd like to simplify things and not add complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide whether sth is 0-terminated. Richard. > > Bernd. > > > >>>> The idea is to use this property of string literals where needed, > >>>> and check rigorously in varasm.c. > >>>> > >>>> Does that make sense? > >>> > >>> So if it is not the same then the excess character needs to be > >>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify > >>> that. > >>> > >> > >> I think it does. > >> > >> > >> Bernd. >
On 08/20/18 15:19, Richard Biener wrote: > On Mon, 20 Aug 2018, Bernd Edlinger wrote: > >> On 08/20/18 13:01, Richard Biener wrote: >>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >>>> >>>> >>>> >>>> On 08/01/18 11:29, Richard Biener wrote: >>>>> >>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>> for your check above. Because the '\0' doesn't belong to the >>>>> string. Then build_string internally appends a '\0' outside >>>>> of TREE_STRING_LENGTH. >>>>> >>>> >>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>> character. >>> >>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>> parameter to build_string and allocate as many extra 0s as needed. >>> >>> There are STRING_CSTs which are not string literals, >>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>> >>>>> >>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>> are internally always generated properly by the front end. >>>>> >>>>> Yeah, I guess we need to define what "properly" is ;) >>>>> >>>> Yes. >>>> >>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH >>>>> >>>>> I think it should be always the same... >>>>> >>>> >>>> One could not differentiate between "\0" without zero-termination >>>> and "" with zero-termination, theoretically. >>> >>> Is that important? Doesn't the C standard say how to parse string literals? >>> >>>> We also have char x[100] = "ab"; >>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>> but imagine char x[100000000000] = "ab" >>> >>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it >>> unconditionally be [], thus incomplete (because the literals "size" depends >>> on the context/LHS it is used on). >>> >> >> Sorry, but I must say, it is not at all like that. >> >> If I compile x.c: >> const char x[100] = "ab"; >> >> and set a breakpoint at output_constant: >> >> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >> 4821 if (size == 0 || flag_syntax_only) >> (gdb) p size >> $1 = 100 >> (gdb) call debug(exp) >> "ab" >> (gdb) p *exp >> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, >> (gdb) p exp->typed.type->type_common.size_unit >> $5 = (tree) 0x7ffff6ff9d80 >> (gdb) call debug(exp->typed.type->type_common.size_unit) >> 100 >> (gdb) p exp->string.length >> $6 = 3 >> (gdb) p exp->string.str[0] >> $8 = 97 'a' >> (gdb) p exp->string.str[1] >> $9 = 98 'b' >> (gdb) p exp->string.str[2] >> $10 = 0 '\000' >> (gdb) p exp->string.str[3] >> $11 = 0 '\000' >> >> >> This is an important property of string_cst objects, that is used in c_strlen: >> >> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH >> is guaranteed to be zero up to the type size. >> >> I would not have spent one thought on implementing an optimization like that, >> but that's how it is right now. > > Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka > they have zero-padding up to its type size. I don't see this documented > anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" > with appropriate TYPE_SIZE. > > This is also a relatively new thing on trunk (coming out of the added > mem_size parameter of string_constant). That this looks at the STRING_CST > type like > > if (TREE_CODE (array) == STRING_CST) > { > *ptr_offset = fold_convert (sizetype, offset); > if (mem_size) > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > return array; > > I'd call simply a bug. As said, interpretation of STRING_CSTs should > depend on the context. And for > This use of the TYPE_SIZE_UNIT was pre-existing to r263607 before that (but not in the gcc-8-branch) we had this in c_strlen: HOST_WIDE_INT maxelts = strelts; tree type = TREE_TYPE (src); if (tree size = TYPE_SIZE_UNIT (type)) if (tree_fits_shwi_p (size)) { maxelts = tree_to_uhwi (size); maxelts = maxelts / eltsize - 1; } which I moved to string_constant and return the result through memsize. It seems to be working somehow, and I'd bet removing that will immediately break twenty or thirty of the strlenopt tests. Obviously the tree string objects allow way too much variations, and it has to be restricted in some way or another. In the moment my approach is: look at what most string constants do and add assertions to ensure that there are no exceptions. > char a[4] = "abc"; > char b[5] = "abc"; > > we should better be able to share STRING_CSTs [you can see LTO > sharing the nodes if you do b[4] but not when b[5] I suppose]. > >> All I want to do, is make sure that all string constants have the same look and feel >> in the middle-end, and restrict the variations that are allowed by the current >> implementation. > > Sure, I understand that. But I'd like to simplify things and not add > complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide > whether sth is 0-terminated. > This is not only about 0-terminated. (*) It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, there are places in the middle-end that assume that the object contains _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. Those I want to protect. Bernd. *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, the string is only returned when it is properly zero terminated.", but maybe I should have written: "If MEM_SIZE is zero, the string is only returned when the storage size of the string object is at least TREE_STRING_LENGTH." That's at least exactly what the check does. > Richard. > >> >> Bernd. >> >> >>>>>> The idea is to use this property of string literals where needed, >>>>>> and check rigorously in varasm.c. >>>>>> >>>>>> Does that make sense? >>>>> >>>>> So if it is not the same then the excess character needs to be >>>>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify >>>>> that. >>>>> >>>> >>>> I think it does. >>>> >>>> >>>> Bernd. >> >
On 08/20/18 17:59, Bernd Edlinger wrote: > On 08/20/18 15:19, Richard Biener wrote: >> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >> >>> On 08/20/18 13:01, Richard Biener wrote: >>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >>>>> >>>>> >>>>> >>>>> On 08/01/18 11:29, Richard Biener wrote: >>>>>> >>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>>> for your check above. Because the '\0' doesn't belong to the >>>>>> string. Then build_string internally appends a '\0' outside >>>>>> of TREE_STRING_LENGTH. >>>>>> >>>>> >>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>>> character. >>>> >>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>>> parameter to build_string and allocate as many extra 0s as needed. >>>> >>>>    There are STRING_CSTs which are not string literals, >>>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>>> >>>>>> >>>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>>> are internally always generated properly by the front end. >>>>>> >>>>>> Yeah, I guess we need to define what "properly" is ;) >>>>>> >>>>> Yes. >>>>> >>>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH >>>>>> >>>>>> I think it should be always the same... >>>>>> >>>>> >>>>> One could not differentiate between "\0" without zero-termination >>>>> and "" with zero-termination, theoretically. >>>> >>>> Is that important? Doesn't the C standard say how to parse string literals? >>>> >>>>> We also have char x[100] = "ab"; >>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>>> but imagine char x[100000000000] = "ab" >>>> >>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it >>>> unconditionally be [], thus incomplete (because the literals "size" depends >>>> on the context/LHS it is used on). >>>> >>> >>> Sorry, but I must say, it is not at all like that. >>> >>> If I compile x.c: >>> const char x[100] = "ab"; >>> >>> and set a breakpoint at output_constant: >>> >>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >>>      reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >>> 4821     if (size == 0 || flag_syntax_only) >>> (gdb) p size >>> $1 = 100 >>> (gdb) call debug(exp) >>> "ab" >>> (gdb) p *exp >>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, >>> (gdb) p exp->typed.type->type_common.size_unit >>> $5 = (tree) 0x7ffff6ff9d80 >>> (gdb) call debug(exp->typed.type->type_common.size_unit) >>> 100 >>> (gdb) p exp->string.length >>> $6 = 3 >>> (gdb) p exp->string.str[0] >>> $8 = 97 'a' >>> (gdb) p exp->string.str[1] >>> $9 = 98 'b' >>> (gdb) p exp->string.str[2] >>> $10 = 0 '\000' >>> (gdb) p exp->string.str[3] >>> $11 = 0 '\000' >>> >>> >>> This is an important property of string_cst objects, that is used in c_strlen: >>> >>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH >>> is guaranteed to be zero up to the type size. >>> >>> I would not have spent one thought on implementing an optimization like that, >>> but that's how it is right now. >> >> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka >> they have zero-padding up to its type size. I don't see this documented >> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" >> with appropriate TYPE_SIZE. >> >> This is also a relatively new thing on trunk (coming out of the added >> mem_size parameter of string_constant). That this looks at the STRING_CST >> type like >> >>   if (TREE_CODE (array) == STRING_CST) >>     { >>       *ptr_offset = fold_convert (sizetype, offset); >>       if (mem_size) >>         *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>       return array; >> >> I'd call simply a bug. As said, interpretation of STRING_CSTs should >> depend on the context. And for >> > > This use of the TYPE_SIZE_UNIT was pre-existing to r263607 > before that (but not in the gcc-8-branch) we had this in c_strlen: > >  HOST_WIDE_INT maxelts = strelts; >  tree type = TREE_TYPE (src); >  if (tree size = TYPE_SIZE_UNIT (type)) >    if (tree_fits_shwi_p (size)) >      { >       maxelts = tree_to_uhwi (size); >       maxelts = maxelts / eltsize - 1; >      } > > which I moved to string_constant and return the result through memsize. > > It seems to be working somehow, and I'd bet removing that will immediately > break twenty or thirty of the strlenopt tests. > > Obviously the tree string objects allow way too much variations, > and it has to be restricted in some way or another. > > In the moment my approach is: look at what most string constants do > and add assertions to ensure that there are no exceptions. > > I found some more places where TYPE_SIZE_UNIT (TREE_TYPE (str)) and TREE_STRING_LENGTH (str) are used together, in some very old code and clearly Martin is not to blame for them: varasm.c: static HOST_WIDE_INT get_constant_size (tree exp) { HOST_WIDE_INT size; size = int_size_in_bytes (TREE_TYPE (exp)); if (TREE_CODE (exp) == STRING_CST) size = MAX (TREE_STRING_LENGTH (exp), size); return size; } mergeable_string_section (tree decl ATTRIBUTE_UNUSED, unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED, unsigned int flags ATTRIBUTE_UNUSED) { HOST_WIDE_INT len; if (HAVE_GAS_SHF_MERGE && flag_merge_constants && TREE_CODE (decl) == STRING_CST && TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE && align <= 256 && (len = int_size_in_bytes (TREE_TYPE (decl))) > 0 && TREE_STRING_LENGTH (decl) >= len) So, no sorry, I don't see how to change the STRING_CST from using the TREE_TYPE in that way. Bernd. >> char a[4] = "abc"; >> char b[5] = "abc"; >> >> we should better be able to share STRING_CSTs [you can see LTO >> sharing the nodes if you do b[4] but not when b[5] I suppose]. >> >>> All I want to do, is make sure that all string constants have the same look and feel >>> in the middle-end, and restrict the variations that are allowed by the current >>> implementation. >> >> Sure, I understand that. But I'd like to simplify things and not add >> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide >> whether sth is 0-terminated. >> > > This is not only about 0-terminated. (*) > > It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, > there are places in the middle-end that assume that the object contains > _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. > Those I want to protect. > > Bernd. > > > *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, > the string is only returned when it is properly zero terminated.", > but maybe I should have written: > "If MEM_SIZE is zero, the string is only returned when the storage size > of the string object is at least TREE_STRING_LENGTH." > That's at least exactly what the check does. > >> Richard. >> >>> >>> Bernd. >>> >>> >>>>>>> The idea is to use this property of string literals where needed, >>>>>>> and check rigorously in varasm.c. >>>>>>> >>>>>>> Does that make sense? >>>>>> >>>>>> So if it is not the same then the excess character needs to be >>>>>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify >>>>>> that. >>>>>> >>>>> >>>>> I think it does. >>>>> >>>>> >>>>> Bernd. >>> >>
On Mon, 20 Aug 2018, Bernd Edlinger wrote: > On 08/20/18 15:19, Richard Biener wrote: > > On Mon, 20 Aug 2018, Bernd Edlinger wrote: > > > >> On 08/20/18 13:01, Richard Biener wrote: > >>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > >>>> > >>>> > >>>> > >>>> On 08/01/18 11:29, Richard Biener wrote: > >>>>> > >>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would > >>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > >>>>> for your check above. Because the '\0' doesn't belong to the > >>>>> string. Then build_string internally appends a '\0' outside > >>>>> of TREE_STRING_LENGTH. > >>>>> > >>>> > >>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide > >>>> character. > >>> > >>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 > >>> parameter to build_string and allocate as many extra 0s as needed. > >>> > >>> There are STRING_CSTs which are not string literals, > >>>> for instance attribute tags, Pragmas, asm constrants, etc. > >>>> They use the '\0' outside, and have probably no TREE_TYPE. > >>>> > >>>>> > >>>>>> So I would like to be able to assume that the STRING_CST objects > >>>>>> are internally always generated properly by the front end. > >>>>> > >>>>> Yeah, I guess we need to define what "properly" is ;) > >>>>> > >>>> Yes. > >>>> > >>>>>> And that the ARRAY_TYPE of the string literal either has the > >>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, > >>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH > >>>>> > >>>>> I think it should be always the same... > >>>>> > >>>> > >>>> One could not differentiate between "\0" without zero-termination > >>>> and "" with zero-termination, theoretically. > >>> > >>> Is that important? Doesn't the C standard say how to parse string literals? > >>> > >>>> We also have char x[100] = "ab"; > >>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. > >>>> Of course one could create it with a TREE_STRING_LENGTH = 100, > >>>> but imagine char x[100000000000] = "ab" > >>> > >>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I > >>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it > >>> unconditionally be [], thus incomplete (because the literals "size" depends > >>> on the context/LHS it is used on). > >>> > >> > >> Sorry, but I must say, it is not at all like that. > >> > >> If I compile x.c: > >> const char x[100] = "ab"; > >> > >> and set a breakpoint at output_constant: > >> > >> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, > >> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 > >> 4821 if (size == 0 || flag_syntax_only) > >> (gdb) p size > >> $1 = 100 > >> (gdb) call debug(exp) > >> "ab" > >> (gdb) p *exp > >> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, > >> (gdb) p exp->typed.type->type_common.size_unit > >> $5 = (tree) 0x7ffff6ff9d80 > >> (gdb) call debug(exp->typed.type->type_common.size_unit) > >> 100 > >> (gdb) p exp->string.length > >> $6 = 3 > >> (gdb) p exp->string.str[0] > >> $8 = 97 'a' > >> (gdb) p exp->string.str[1] > >> $9 = 98 'b' > >> (gdb) p exp->string.str[2] > >> $10 = 0 '\000' > >> (gdb) p exp->string.str[3] > >> $11 = 0 '\000' > >> > >> > >> This is an important property of string_cst objects, that is used in c_strlen: > >> > >> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH > >> is guaranteed to be zero up to the type size. > >> > >> I would not have spent one thought on implementing an optimization like that, > >> but that's how it is right now. > > > > Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka > > they have zero-padding up to its type size. I don't see this documented > > anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" > > with appropriate TYPE_SIZE. > > > > This is also a relatively new thing on trunk (coming out of the added > > mem_size parameter of string_constant). That this looks at the STRING_CST > > type like > > > > if (TREE_CODE (array) == STRING_CST) > > { > > *ptr_offset = fold_convert (sizetype, offset); > > if (mem_size) > > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > > return array; > > > > I'd call simply a bug. As said, interpretation of STRING_CSTs should > > depend on the context. And for > > > > This use of the TYPE_SIZE_UNIT was pre-existing to r263607 > before that (but not in the gcc-8-branch) we had this in c_strlen: > > HOST_WIDE_INT maxelts = strelts; > tree type = TREE_TYPE (src); > if (tree size = TYPE_SIZE_UNIT (type)) > if (tree_fits_shwi_p (size)) > { > maxelts = tree_to_uhwi (size); > maxelts = maxelts / eltsize - 1; > } > > which I moved to string_constant and return the result through memsize. > > It seems to be working somehow, and I'd bet removing that will immediately > break twenty or thirty of the strlenopt tests. > > Obviously the tree string objects allow way too much variations, > and it has to be restricted in some way or another. > > In the moment my approach is: look at what most string constants do > and add assertions to ensure that there are no exceptions. > > > > char a[4] = "abc"; > > char b[5] = "abc"; > > > > we should better be able to share STRING_CSTs [you can see LTO > > sharing the nodes if you do b[4] but not when b[5] I suppose]. > > > >> All I want to do, is make sure that all string constants have the same look and feel > >> in the middle-end, and restrict the variations that are allowed by the current > >> implementation. > > > > Sure, I understand that. But I'd like to simplify things and not add > > complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide > > whether sth is 0-terminated. > > > > This is not only about 0-terminated. (*) > > It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, > there are places in the middle-end that assume that the object contains > _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. > Those I want to protect. Well, but string_constant handles &STRING_CST just fine but in that context there's no "object" we look at. IMHO whenever string_constant ends up with a DECL, looking at ctor_for_folding and we end up with a STRING_CST that doesn't fit in the DECL we looked at we have a bug (non-truncated STRING_CST) and either should do the truncation or simply return NULL. So we should look at DECL_SIZE_UNIT and compare that with TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT of the STRING_CST being "correct" (fitting the decl context). > Bernd. > > > *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, > the string is only returned when it is properly zero terminated.", > but maybe I should have written: > "If MEM_SIZE is zero, the string is only returned when the storage size > of the string object is at least TREE_STRING_LENGTH." > That's at least exactly what the check does. Well, as said above if (TREE_CODE (array) == STRING_CST) { *ptr_offset = fold_convert (sizetype, offset); if (mem_size) *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); return array; } simply assumes the "storage size" of a STRING_CST is determined by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, but clearly in the above case there's nothing to protect? And in the case we pulled the STRING_CST from some DECL_INITIAL it doesn't protect from overflow of the FIELD_DECL unless frontends never generate "wrong" STRING_CSTs? Btw, get_constant_size / mergeable_string_section suggsts that we may view STRING_CST as general target-encoded byte blob. That may be useful to compress CONSTRUCTORs memory use. It looks like mergeable_string_section wrongly would reject a char[] typed string because int_size_in_bytes returns -1 for incomplete types. I still believe using char[] for most STRING_CSTs generated by FEs would be a good thing for IL memory use. Shouldn't the mem_size initializations use sth like get_constant_size as well? Also if (mem_size) *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); else if (compare_tree_int (array_size, length + 1) < 0) return NULL_TREE; the latter check doesn't seem to honor 'offset' and *mem_size is again looking at the STRING_CST, not at the FIELD_DECL or wherever we got the STRING_CST from. Richard. > > > Richard. > > > >> > >> Bernd. > >> > >> > >>>>>> The idea is to use this property of string literals where needed, > >>>>>> and check rigorously in varasm.c. > >>>>>> > >>>>>> Does that make sense? > >>>>> > >>>>> So if it is not the same then the excess character needs to be > >>>>> a (wide) NUL in your model? ISTR your varasm.c patch didn't verify > >>>>> that. > >>>>> > >>>> > >>>> I think it does. > >>>> > >>>> > >>>> Bernd. > >> > > >
On 08/21/18 10:33, Richard Biener wrote: > On Mon, 20 Aug 2018, Bernd Edlinger wrote: > >> On 08/20/18 15:19, Richard Biener wrote: >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >>> >>>> On 08/20/18 13:01, Richard Biener wrote: >>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 08/01/18 11:29, Richard Biener wrote: >>>>>>> >>>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>>>> for your check above. Because the '\0' doesn't belong to the >>>>>>> string. Then build_string internally appends a '\0' outside >>>>>>> of TREE_STRING_LENGTH. >>>>>>> >>>>>> >>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>>>> character. >>>>> >>>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>>>> parameter to build_string and allocate as many extra 0s as needed. >>>>> >>>>> There are STRING_CSTs which are not string literals, >>>>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>>>> >>>>>>> >>>>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>>>> are internally always generated properly by the front end. >>>>>>> >>>>>>> Yeah, I guess we need to define what "properly" is ;) >>>>>>> >>>>>> Yes. >>>>>> >>>>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH >>>>>>> >>>>>>> I think it should be always the same... >>>>>>> >>>>>> >>>>>> One could not differentiate between "\0" without zero-termination >>>>>> and "" with zero-termination, theoretically. >>>>> >>>>> Is that important? Doesn't the C standard say how to parse string literals? >>>>> >>>>>> We also have char x[100] = "ab"; >>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>>>> but imagine char x[100000000000] = "ab" >>>>> >>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it >>>>> unconditionally be [], thus incomplete (because the literals "size" depends >>>>> on the context/LHS it is used on). >>>>> >>>> >>>> Sorry, but I must say, it is not at all like that. >>>> >>>> If I compile x.c: >>>> const char x[100] = "ab"; >>>> >>>> and set a breakpoint at output_constant: >>>> >>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >>>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >>>> 4821 if (size == 0 || flag_syntax_only) >>>> (gdb) p size >>>> $1 = 100 >>>> (gdb) call debug(exp) >>>> "ab" >>>> (gdb) p *exp >>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, >>>> (gdb) p exp->typed.type->type_common.size_unit >>>> $5 = (tree) 0x7ffff6ff9d80 >>>> (gdb) call debug(exp->typed.type->type_common.size_unit) >>>> 100 >>>> (gdb) p exp->string.length >>>> $6 = 3 >>>> (gdb) p exp->string.str[0] >>>> $8 = 97 'a' >>>> (gdb) p exp->string.str[1] >>>> $9 = 98 'b' >>>> (gdb) p exp->string.str[2] >>>> $10 = 0 '\000' >>>> (gdb) p exp->string.str[3] >>>> $11 = 0 '\000' >>>> >>>> >>>> This is an important property of string_cst objects, that is used in c_strlen: >>>> >>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH >>>> is guaranteed to be zero up to the type size. >>>> >>>> I would not have spent one thought on implementing an optimization like that, >>>> but that's how it is right now. >>> >>> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka >>> they have zero-padding up to its type size. I don't see this documented >>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" >>> with appropriate TYPE_SIZE. >>> >>> This is also a relatively new thing on trunk (coming out of the added >>> mem_size parameter of string_constant). That this looks at the STRING_CST >>> type like >>> >>> if (TREE_CODE (array) == STRING_CST) >>> { >>> *ptr_offset = fold_convert (sizetype, offset); >>> if (mem_size) >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>> return array; >>> >>> I'd call simply a bug. As said, interpretation of STRING_CSTs should >>> depend on the context. And for >>> >> >> This use of the TYPE_SIZE_UNIT was pre-existing to r263607 >> before that (but not in the gcc-8-branch) we had this in c_strlen: >> >> HOST_WIDE_INT maxelts = strelts; >> tree type = TREE_TYPE (src); >> if (tree size = TYPE_SIZE_UNIT (type)) >> if (tree_fits_shwi_p (size)) >> { >> maxelts = tree_to_uhwi (size); >> maxelts = maxelts / eltsize - 1; >> } >> >> which I moved to string_constant and return the result through memsize. >> >> It seems to be working somehow, and I'd bet removing that will immediately >> break twenty or thirty of the strlenopt tests. >> >> Obviously the tree string objects allow way too much variations, >> and it has to be restricted in some way or another. >> >> In the moment my approach is: look at what most string constants do >> and add assertions to ensure that there are no exceptions. >> >> >>> char a[4] = "abc"; >>> char b[5] = "abc"; >>> >>> we should better be able to share STRING_CSTs [you can see LTO >>> sharing the nodes if you do b[4] but not when b[5] I suppose]. >>> >>>> All I want to do, is make sure that all string constants have the same look and feel >>>> in the middle-end, and restrict the variations that are allowed by the current >>>> implementation. >>> >>> Sure, I understand that. But I'd like to simplify things and not add >>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide >>> whether sth is 0-terminated. >>> >> >> This is not only about 0-terminated. (*) >> >> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, >> there are places in the middle-end that assume that the object contains >> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. >> Those I want to protect. > > Well, but string_constant handles &STRING_CST just fine but in that > context there's no "object" we look at. > > IMHO whenever string_constant ends up with a DECL, looking at > ctor_for_folding and we end up with a STRING_CST that doesn't fit > in the DECL we looked at we have a bug (non-truncated STRING_CST) > and either should do the truncation or simply return NULL. > > So we should look at DECL_SIZE_UNIT and compare that with > TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT > of the STRING_CST being "correct" (fitting the decl context). > >> Bernd. >> >> >> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, >> the string is only returned when it is properly zero terminated.", >> but maybe I should have written: >> "If MEM_SIZE is zero, the string is only returned when the storage size >> of the string object is at least TREE_STRING_LENGTH." >> That's at least exactly what the check does. > > Well, as said above > > if (TREE_CODE (array) == STRING_CST) > { > *ptr_offset = fold_convert (sizetype, offset); > if (mem_size) > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > return array; > } > > simply assumes the "storage size" of a STRING_CST is determined > by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, > but clearly in the above case there's nothing to protect? And in > the case we pulled the STRING_CST from some DECL_INITIAL it doesn't > protect from overflow of the FIELD_DECL unless frontends never > generate "wrong" STRING_CSTs? > Hmm, I digged some more in varasm.c Normal STRING_CST use get_constant_size to allocate MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH) thus they can have excess zero bytes. while STRING_CST that are in DECL_INITIALIZERs are shrinked to DECL_SIZE_UNIT. At least that difference looks unnecessary to me. But compare_constant does not compare the TYPE_SIZE_UNIT: case STRING_CST: if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) return 0; return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2), TREE_STRING_LENGTH (t1))); This looks now like a bug. > Btw, get_constant_size / mergeable_string_section suggsts that > we may view STRING_CST as general target-encoded byte blob. > That may be useful to compress CONSTRUCTORs memory use. > > It looks like mergeable_string_section wrongly would reject > a char[] typed string because int_size_in_bytes returns -1 > for incomplete types. I still believe using char[] for most > STRING_CSTs generated by FEs would be a good thing for > IL memory use. Shouldn't the mem_size initializations use > sth like get_constant_size as well? > I think incomplete types exist only in structs with variable length: struct { int x; char y[]; } s = { 1, "test" }; this is no candidate for a mergeable string. but char y[] = "test"; is fixed by the FE to: char y[5] = "test"; So that does not make problems with mergeable_string_section. > Also > > if (mem_size) > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); > else if (compare_tree_int (array_size, length + 1) < 0) > return NULL_TREE; > > the latter check doesn't seem to honor 'offset' and *mem_size > is again looking at the STRING_CST, not at the FIELD_DECL or > wherever we got the STRING_CST from. > the caller compares offset with *mem_size, but we do not have the FIELD_DECL at hand here (Or I did not know where it was). I'd rather fail the cases when the TYPE_SIZE_UNIT is null, or something non-constant. All those are variations of vla with initializer and similar. Once the STRING_CST has a type, I would like to use it, when it doesn't the whole thing is probably not worth it anyway. Bernd.
On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > > On 08/21/18 10:33, Richard Biener wrote: > > On Mon, 20 Aug 2018, Bernd Edlinger wrote: > > > >> On 08/20/18 15:19, Richard Biener wrote: > >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: > >>> > >>>> On 08/20/18 13:01, Richard Biener wrote: > >>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 08/01/18 11:29, Richard Biener wrote: > >>>>>>> > >>>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would > >>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > >>>>>>> for your check above. Because the '\0' doesn't belong to the > >>>>>>> string. Then build_string internally appends a '\0' outside > >>>>>>> of TREE_STRING_LENGTH. > >>>>>>> > >>>>>> > >>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide > >>>>>> character. > >>>>> > >>>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 > >>>>> parameter to build_string and allocate as many extra 0s as needed. > >>>>> > >>>>> There are STRING_CSTs which are not string literals, > >>>>>> for instance attribute tags, Pragmas, asm constrants, etc. > >>>>>> They use the '\0' outside, and have probably no TREE_TYPE. > >>>>>> > >>>>>>> > >>>>>>>> So I would like to be able to assume that the STRING_CST objects > >>>>>>>> are internally always generated properly by the front end. > >>>>>>> > >>>>>>> Yeah, I guess we need to define what "properly" is ;) > >>>>>>> > >>>>>> Yes. > >>>>>> > >>>>>>>> And that the ARRAY_TYPE of the string literal either has the > >>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, > >>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH > >>>>>>> > >>>>>>> I think it should be always the same... > >>>>>>> > >>>>>> > >>>>>> One could not differentiate between "\0" without zero-termination > >>>>>> and "" with zero-termination, theoretically. > >>>>> > >>>>> Is that important? Doesn't the C standard say how to parse string literals? > >>>>> > >>>>>> We also have char x[100] = "ab"; > >>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. > >>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, > >>>>>> but imagine char x[100000000000] = "ab" > >>>>> > >>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I > >>>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it > >>>>> unconditionally be [], thus incomplete (because the literals "size" depends > >>>>> on the context/LHS it is used on). > >>>>> > >>>> > >>>> Sorry, but I must say, it is not at all like that. > >>>> > >>>> If I compile x.c: > >>>> const char x[100] = "ab"; > >>>> > >>>> and set a breakpoint at output_constant: > >>>> > >>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, > >>>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 > >>>> 4821 if (size == 0 || flag_syntax_only) > >>>> (gdb) p size > >>>> $1 = 100 > >>>> (gdb) call debug(exp) > >>>> "ab" > >>>> (gdb) p *exp > >>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, > >>>> (gdb) p exp->typed.type->type_common.size_unit > >>>> $5 = (tree) 0x7ffff6ff9d80 > >>>> (gdb) call debug(exp->typed.type->type_common.size_unit) > >>>> 100 > >>>> (gdb) p exp->string.length > >>>> $6 = 3 > >>>> (gdb) p exp->string.str[0] > >>>> $8 = 97 'a' > >>>> (gdb) p exp->string.str[1] > >>>> $9 = 98 'b' > >>>> (gdb) p exp->string.str[2] > >>>> $10 = 0 '\000' > >>>> (gdb) p exp->string.str[3] > >>>> $11 = 0 '\000' > >>>> > >>>> > >>>> This is an important property of string_cst objects, that is used in c_strlen: > >>>> > >>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH > >>>> is guaranteed to be zero up to the type size. > >>>> > >>>> I would not have spent one thought on implementing an optimization like that, > >>>> but that's how it is right now. > >>> > >>> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka > >>> they have zero-padding up to its type size. I don't see this documented > >>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" > >>> with appropriate TYPE_SIZE. > >>> > >>> This is also a relatively new thing on trunk (coming out of the added > >>> mem_size parameter of string_constant). That this looks at the STRING_CST > >>> type like > >>> > >>> if (TREE_CODE (array) == STRING_CST) > >>> { > >>> *ptr_offset = fold_convert (sizetype, offset); > >>> if (mem_size) > >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > >>> return array; > >>> > >>> I'd call simply a bug. As said, interpretation of STRING_CSTs should > >>> depend on the context. And for > >>> > >> > >> This use of the TYPE_SIZE_UNIT was pre-existing to r263607 > >> before that (but not in the gcc-8-branch) we had this in c_strlen: > >> > >> HOST_WIDE_INT maxelts = strelts; > >> tree type = TREE_TYPE (src); > >> if (tree size = TYPE_SIZE_UNIT (type)) > >> if (tree_fits_shwi_p (size)) > >> { > >> maxelts = tree_to_uhwi (size); > >> maxelts = maxelts / eltsize - 1; > >> } > >> > >> which I moved to string_constant and return the result through memsize. > >> > >> It seems to be working somehow, and I'd bet removing that will immediately > >> break twenty or thirty of the strlenopt tests. > >> > >> Obviously the tree string objects allow way too much variations, > >> and it has to be restricted in some way or another. > >> > >> In the moment my approach is: look at what most string constants do > >> and add assertions to ensure that there are no exceptions. > >> > >> > >>> char a[4] = "abc"; > >>> char b[5] = "abc"; > >>> > >>> we should better be able to share STRING_CSTs [you can see LTO > >>> sharing the nodes if you do b[4] but not when b[5] I suppose]. > >>> > >>>> All I want to do, is make sure that all string constants have the same look and feel > >>>> in the middle-end, and restrict the variations that are allowed by the current > >>>> implementation. > >>> > >>> Sure, I understand that. But I'd like to simplify things and not add > >>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide > >>> whether sth is 0-terminated. > >>> > >> > >> This is not only about 0-terminated. (*) > >> > >> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, > >> there are places in the middle-end that assume that the object contains > >> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. > >> Those I want to protect. > > > > Well, but string_constant handles &STRING_CST just fine but in that > > context there's no "object" we look at. > > > > IMHO whenever string_constant ends up with a DECL, looking at > > ctor_for_folding and we end up with a STRING_CST that doesn't fit > > in the DECL we looked at we have a bug (non-truncated STRING_CST) > > and either should do the truncation or simply return NULL. > > > > So we should look at DECL_SIZE_UNIT and compare that with > > TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT > > of the STRING_CST being "correct" (fitting the decl context). > > > >> Bernd. > >> > >> > >> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, > >> the string is only returned when it is properly zero terminated.", > >> but maybe I should have written: > >> "If MEM_SIZE is zero, the string is only returned when the storage size > >> of the string object is at least TREE_STRING_LENGTH." > >> That's at least exactly what the check does. > > > > Well, as said above > > > > if (TREE_CODE (array) == STRING_CST) > > { > > *ptr_offset = fold_convert (sizetype, offset); > > if (mem_size) > > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > > return array; > > } > > > > simply assumes the "storage size" of a STRING_CST is determined > > by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, > > but clearly in the above case there's nothing to protect? And in > > the case we pulled the STRING_CST from some DECL_INITIAL it doesn't > > protect from overflow of the FIELD_DECL unless frontends never > > generate "wrong" STRING_CSTs? > > > > Hmm, I digged some more in varasm.c > > Normal STRING_CST use get_constant_size to allocate > MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH) > thus they can have excess zero bytes. > > while STRING_CST that are in DECL_INITIALIZERs > are shrinked to DECL_SIZE_UNIT. > > At least that difference looks unnecessary to me. > > > But compare_constant does not compare the TYPE_SIZE_UNIT: > > case STRING_CST: > if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) > return 0; > > return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) > && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2), > TREE_STRING_LENGTH (t1))); > > > This looks now like a bug. Well. I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't a designed feature. I'd rather not have it that way because the length of a STRING_CST becomes defined in multiple places if you consider STRING_CST with [] or too small type then TREE_STRING_LENGTH is authorative while if the type is larger then that is authorative. The length of a STRING_CST if output should be only determined from TREE_STRING_LENGTH. > > > Btw, get_constant_size / mergeable_string_section suggsts that > > we may view STRING_CST as general target-encoded byte blob. > > That may be useful to compress CONSTRUCTORs memory use. > > > > It looks like mergeable_string_section wrongly would reject > > a char[] typed string because int_size_in_bytes returns -1 > > for incomplete types. I still believe using char[] for most > > STRING_CSTs generated by FEs would be a good thing for > > IL memory use. Shouldn't the mem_size initializations use > > sth like get_constant_size as well? > > > > I think incomplete types exist only in structs with variable > length: > > struct { > int x; > char y[]; > } s = { 1, "test" }; > > this is no candidate for a mergeable string. > > but > char y[] = "test"; > > is fixed by the FE to: > char y[5] = "test"; > > So that does not make problems with mergeable_string_section. > > > > Also > > > > if (mem_size) > > *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); > > else if (compare_tree_int (array_size, length + 1) < 0) > > return NULL_TREE; > > > > the latter check doesn't seem to honor 'offset' and *mem_size > > is again looking at the STRING_CST, not at the FIELD_DECL or > > wherever we got the STRING_CST from. > > > > the caller compares offset with *mem_size, but we do not have > the FIELD_DECL at hand here (Or I did not know where it was). > I'd rather fail the cases when the TYPE_SIZE_UNIT is null, > or something non-constant. > > All those are variations of vla with initializer and similar. > Once the STRING_CST has a type, I would like to use it, > when it doesn't the whole thing is probably not worth it > anyway. I would rather have STRING_CSTs only have a type for the elements, not the whole thing... you can get that if you drop TYPE_DOMAIN on all STRING_CSTs array types. Not sure what breaks if we change get_constant_size to return TREE_STRING_LENGTH unconditionally. At least somehow the need to special-case STRING_CSTs looks odd and I wonder if we do this in other functions as well... Richard. > > > Bernd.
On 08/24/18 12:52, Richard Biener wrote: > On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger > <bernd.edlinger@hotmail.de> wrote: >> >> On 08/21/18 10:33, Richard Biener wrote: >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >>> >>>> On 08/20/18 15:19, Richard Biener wrote: >>>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >>>>> >>>>>> On 08/20/18 13:01, Richard Biener wrote: >>>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 08/01/18 11:29, Richard Biener wrote: >>>>>>>>> >>>>>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>>>>>> for your check above. Because the '\0' doesn't belong to the >>>>>>>>> string. Then build_string internally appends a '\0' outside >>>>>>>>> of TREE_STRING_LENGTH. >>>>>>>>> >>>>>>>> >>>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>>>>>> character. >>>>>>> >>>>>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>>>>>> parameter to build_string and allocate as many extra 0s as needed. >>>>>>> >>>>>>> There are STRING_CSTs which are not string literals, >>>>>>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>>>>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>>>>>> >>>>>>>>> >>>>>>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>>>>>> are internally always generated properly by the front end. >>>>>>>>> >>>>>>>>> Yeah, I guess we need to define what "properly" is ;) >>>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH >>>>>>>>> >>>>>>>>> I think it should be always the same... >>>>>>>>> >>>>>>>> >>>>>>>> One could not differentiate between "\0" without zero-termination >>>>>>>> and "" with zero-termination, theoretically. >>>>>>> >>>>>>> Is that important? Doesn't the C standard say how to parse string literals? >>>>>>> >>>>>>>> We also have char x[100] = "ab"; >>>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>>>>>> but imagine char x[100000000000] = "ab" >>>>>>> >>>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>>>>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it >>>>>>> unconditionally be [], thus incomplete (because the literals "size" depends >>>>>>> on the context/LHS it is used on). >>>>>>> >>>>>> >>>>>> Sorry, but I must say, it is not at all like that. >>>>>> >>>>>> If I compile x.c: >>>>>> const char x[100] = "ab"; >>>>>> >>>>>> and set a breakpoint at output_constant: >>>>>> >>>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >>>>>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >>>>>> 4821 if (size == 0 || flag_syntax_only) >>>>>> (gdb) p size >>>>>> $1 = 100 >>>>>> (gdb) call debug(exp) >>>>>> "ab" >>>>>> (gdb) p *exp >>>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, >>>>>> (gdb) p exp->typed.type->type_common.size_unit >>>>>> $5 = (tree) 0x7ffff6ff9d80 >>>>>> (gdb) call debug(exp->typed.type->type_common.size_unit) >>>>>> 100 >>>>>> (gdb) p exp->string.length >>>>>> $6 = 3 >>>>>> (gdb) p exp->string.str[0] >>>>>> $8 = 97 'a' >>>>>> (gdb) p exp->string.str[1] >>>>>> $9 = 98 'b' >>>>>> (gdb) p exp->string.str[2] >>>>>> $10 = 0 '\000' >>>>>> (gdb) p exp->string.str[3] >>>>>> $11 = 0 '\000' >>>>>> >>>>>> >>>>>> This is an important property of string_cst objects, that is used in c_strlen: >>>>>> >>>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH >>>>>> is guaranteed to be zero up to the type size. >>>>>> >>>>>> I would not have spent one thought on implementing an optimization like that, >>>>>> but that's how it is right now. >>>>> >>>>> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka >>>>> they have zero-padding up to its type size. I don't see this documented >>>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" >>>>> with appropriate TYPE_SIZE. >>>>> >>>>> This is also a relatively new thing on trunk (coming out of the added >>>>> mem_size parameter of string_constant). That this looks at the STRING_CST >>>>> type like >>>>> >>>>> if (TREE_CODE (array) == STRING_CST) >>>>> { >>>>> *ptr_offset = fold_convert (sizetype, offset); >>>>> if (mem_size) >>>>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>>>> return array; >>>>> >>>>> I'd call simply a bug. As said, interpretation of STRING_CSTs should >>>>> depend on the context. And for >>>>> >>>> >>>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607 >>>> before that (but not in the gcc-8-branch) we had this in c_strlen: >>>> >>>> HOST_WIDE_INT maxelts = strelts; >>>> tree type = TREE_TYPE (src); >>>> if (tree size = TYPE_SIZE_UNIT (type)) >>>> if (tree_fits_shwi_p (size)) >>>> { >>>> maxelts = tree_to_uhwi (size); >>>> maxelts = maxelts / eltsize - 1; >>>> } >>>> >>>> which I moved to string_constant and return the result through memsize. >>>> >>>> It seems to be working somehow, and I'd bet removing that will immediately >>>> break twenty or thirty of the strlenopt tests. >>>> >>>> Obviously the tree string objects allow way too much variations, >>>> and it has to be restricted in some way or another. >>>> >>>> In the moment my approach is: look at what most string constants do >>>> and add assertions to ensure that there are no exceptions. >>>> >>>> >>>>> char a[4] = "abc"; >>>>> char b[5] = "abc"; >>>>> >>>>> we should better be able to share STRING_CSTs [you can see LTO >>>>> sharing the nodes if you do b[4] but not when b[5] I suppose]. >>>>> >>>>>> All I want to do, is make sure that all string constants have the same look and feel >>>>>> in the middle-end, and restrict the variations that are allowed by the current >>>>>> implementation. >>>>> >>>>> Sure, I understand that. But I'd like to simplify things and not add >>>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide >>>>> whether sth is 0-terminated. >>>>> >>>> >>>> This is not only about 0-terminated. (*) >>>> >>>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, >>>> there are places in the middle-end that assume that the object contains >>>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. >>>> Those I want to protect. >>> >>> Well, but string_constant handles &STRING_CST just fine but in that >>> context there's no "object" we look at. >>> >>> IMHO whenever string_constant ends up with a DECL, looking at >>> ctor_for_folding and we end up with a STRING_CST that doesn't fit >>> in the DECL we looked at we have a bug (non-truncated STRING_CST) >>> and either should do the truncation or simply return NULL. >>> >>> So we should look at DECL_SIZE_UNIT and compare that with >>> TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT >>> of the STRING_CST being "correct" (fitting the decl context). >>> >>>> Bernd. >>>> >>>> >>>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, >>>> the string is only returned when it is properly zero terminated.", >>>> but maybe I should have written: >>>> "If MEM_SIZE is zero, the string is only returned when the storage size >>>> of the string object is at least TREE_STRING_LENGTH." >>>> That's at least exactly what the check does. >>> >>> Well, as said above >>> >>> if (TREE_CODE (array) == STRING_CST) >>> { >>> *ptr_offset = fold_convert (sizetype, offset); >>> if (mem_size) >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>> return array; >>> } >>> >>> simply assumes the "storage size" of a STRING_CST is determined >>> by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, >>> but clearly in the above case there's nothing to protect? And in >>> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't >>> protect from overflow of the FIELD_DECL unless frontends never >>> generate "wrong" STRING_CSTs? >>> >> >> Hmm, I digged some more in varasm.c >> >> Normal STRING_CST use get_constant_size to allocate >> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH) >> thus they can have excess zero bytes. >> >> while STRING_CST that are in DECL_INITIALIZERs >> are shrinked to DECL_SIZE_UNIT. >> >> At least that difference looks unnecessary to me. >> >> >> But compare_constant does not compare the TYPE_SIZE_UNIT: >> >> case STRING_CST: >> if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) >> return 0; >> >> return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) >> && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2), >> TREE_STRING_LENGTH (t1))); >> >> >> This looks now like a bug. > > Well. I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't > a designed feature. I'd rather not have it that way because the length of a > STRING_CST becomes defined in multiple places if you consider STRING_CST > with [] or too small type then TREE_STRING_LENGTH is authorative while > if the type is larger then that is authorative. > > The length of a STRING_CST if output should be only determined from > TREE_STRING_LENGTH. > Hmm. I think about that. It is possible that I change my mind.... Currently the semantic of STRING_CST objects differs for literals and for initializer elements. For literals the TREE_STRING_LENGTH is guaranteed to resemble the runtime string object. For initializer elements the type domain overrides the TREE_STRING_LENGTH if it is available. If it is a flexible array member the type domain is not available, and the TREE_STRING_LENGTH determines the memory size. The regression in tree-ssa-forwprop.c (PR 86714) is caused by the fact that previously constant_string did always return string literals while now it can as well return initializer elements. But tree-ssa-forwprop expects string literal semantics. I would like to remove the semantic differences. I have a patch in the test, that fixes the TYPE_DOMAIN of flexible array members. Maybe the most important property is TYPE_DOMAIN of the STRING_CST should never be shorter than the TREE_STRING_LENGTH. Then it the difference between string literals and string initializers will disappear. And that should probably take precedence over the zero-termination property of the string_cst objects. Which will of course put one more question mark on the correctness of the string_constant, c_strlen etc. >> >>> Btw, get_constant_size / mergeable_string_section suggsts that >>> we may view STRING_CST as general target-encoded byte blob. >>> That may be useful to compress CONSTRUCTORs memory use. >>> >>> It looks like mergeable_string_section wrongly would reject >>> a char[] typed string because int_size_in_bytes returns -1 >>> for incomplete types. I still believe using char[] for most >>> STRING_CSTs generated by FEs would be a good thing for >>> IL memory use. Shouldn't the mem_size initializations use >>> sth like get_constant_size as well? >>> >> >> I think incomplete types exist only in structs with variable >> length: >> >> struct { >> int x; >> char y[]; >> } s = { 1, "test" }; >> >> this is no candidate for a mergeable string. >> >> but >> char y[] = "test"; >> >> is fixed by the FE to: >> char y[5] = "test"; >> >> So that does not make problems with mergeable_string_section. >> >> >>> Also >>> >>> if (mem_size) >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); >>> else if (compare_tree_int (array_size, length + 1) < 0) >>> return NULL_TREE; >>> >>> the latter check doesn't seem to honor 'offset' and *mem_size >>> is again looking at the STRING_CST, not at the FIELD_DECL or >>> wherever we got the STRING_CST from. >>> >> >> the caller compares offset with *mem_size, but we do not have >> the FIELD_DECL at hand here (Or I did not know where it was). >> I'd rather fail the cases when the TYPE_SIZE_UNIT is null, >> or something non-constant. >> >> All those are variations of vla with initializer and similar. >> Once the STRING_CST has a type, I would like to use it, >> when it doesn't the whole thing is probably not worth it >> anyway. > > I would rather have STRING_CSTs only have a type for the > elements, not the whole thing... you can get that > if you drop TYPE_DOMAIN on all STRING_CSTs array types. > This will break the string_constant code even more. It depends on the TYPE_DOMAIN to mean something. > Not sure what breaks if we change get_constant_size to > return TREE_STRING_LENGTH unconditionally. At least > somehow the need to special-case STRING_CSTs looks > odd and I wonder if we do this in other functions as well... > I think as well this special handling in get_constant_size is something that has to go away. It is even worse, the DECL_SIZE_UNIT is also NULL in flexible array members. But I think I can make it work the other way round. So that STRING_CST always have a TYPE_DOMAIN which may only be larger than the TREE_STRING_LENGTH. Bernd. > Richard. > >> >> >> Bernd.
On Fri, Aug 24, 2018 at 1:49 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > > On 08/24/18 12:52, Richard Biener wrote: > > On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger > > <bernd.edlinger@hotmail.de> wrote: > >> > >> On 08/21/18 10:33, Richard Biener wrote: > >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: > >>> > >>>> On 08/20/18 15:19, Richard Biener wrote: > >>>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: > >>>>> > >>>>>> On 08/20/18 13:01, Richard Biener wrote: > >>>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 08/01/18 11:29, Richard Biener wrote: > >>>>>>>>> > >>>>>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would > >>>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient > >>>>>>>>> for your check above. Because the '\0' doesn't belong to the > >>>>>>>>> string. Then build_string internally appends a '\0' outside > >>>>>>>>> of TREE_STRING_LENGTH. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide > >>>>>>>> character. > >>>>>>> > >>>>>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 > >>>>>>> parameter to build_string and allocate as many extra 0s as needed. > >>>>>>> > >>>>>>> There are STRING_CSTs which are not string literals, > >>>>>>>> for instance attribute tags, Pragmas, asm constrants, etc. > >>>>>>>> They use the '\0' outside, and have probably no TREE_TYPE. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> So I would like to be able to assume that the STRING_CST objects > >>>>>>>>>> are internally always generated properly by the front end. > >>>>>>>>> > >>>>>>>>> Yeah, I guess we need to define what "properly" is ;) > >>>>>>>>> > >>>>>>>> Yes. > >>>>>>>> > >>>>>>>>>> And that the ARRAY_TYPE of the string literal either has the > >>>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, > >>>>>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH > >>>>>>>>> > >>>>>>>>> I think it should be always the same... > >>>>>>>>> > >>>>>>>> > >>>>>>>> One could not differentiate between "\0" without zero-termination > >>>>>>>> and "" with zero-termination, theoretically. > >>>>>>> > >>>>>>> Is that important? Doesn't the C standard say how to parse string literals? > >>>>>>> > >>>>>>>> We also have char x[100] = "ab"; > >>>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. > >>>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, > >>>>>>>> but imagine char x[100000000000] = "ab" > >>>>>>> > >>>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I > >>>>>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have it > >>>>>>> unconditionally be [], thus incomplete (because the literals "size" depends > >>>>>>> on the context/LHS it is used on). > >>>>>>> > >>>>>> > >>>>>> Sorry, but I must say, it is not at all like that. > >>>>>> > >>>>>> If I compile x.c: > >>>>>> const char x[100] = "ab"; > >>>>>> > >>>>>> and set a breakpoint at output_constant: > >>>>>> > >>>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, > >>>>>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 > >>>>>> 4821 if (size == 0 || flag_syntax_only) > >>>>>> (gdb) p size > >>>>>> $1 = 100 > >>>>>> (gdb) call debug(exp) > >>>>>> "ab" > >>>>>> (gdb) p *exp > >>>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1, > >>>>>> (gdb) p exp->typed.type->type_common.size_unit > >>>>>> $5 = (tree) 0x7ffff6ff9d80 > >>>>>> (gdb) call debug(exp->typed.type->type_common.size_unit) > >>>>>> 100 > >>>>>> (gdb) p exp->string.length > >>>>>> $6 = 3 > >>>>>> (gdb) p exp->string.str[0] > >>>>>> $8 = 97 'a' > >>>>>> (gdb) p exp->string.str[1] > >>>>>> $9 = 98 'b' > >>>>>> (gdb) p exp->string.str[2] > >>>>>> $10 = 0 '\000' > >>>>>> (gdb) p exp->string.str[3] > >>>>>> $11 = 0 '\000' > >>>>>> > >>>>>> > >>>>>> This is an important property of string_cst objects, that is used in c_strlen: > >>>>>> > >>>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH > >>>>>> is guaranteed to be zero up to the type size. > >>>>>> > >>>>>> I would not have spent one thought on implementing an optimization like that, > >>>>>> but that's how it is right now. > >>>>> > >>>>> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka > >>>>> they have zero-padding up to its type size. I don't see this documented > >>>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" > >>>>> with appropriate TYPE_SIZE. > >>>>> > >>>>> This is also a relatively new thing on trunk (coming out of the added > >>>>> mem_size parameter of string_constant). That this looks at the STRING_CST > >>>>> type like > >>>>> > >>>>> if (TREE_CODE (array) == STRING_CST) > >>>>> { > >>>>> *ptr_offset = fold_convert (sizetype, offset); > >>>>> if (mem_size) > >>>>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > >>>>> return array; > >>>>> > >>>>> I'd call simply a bug. As said, interpretation of STRING_CSTs should > >>>>> depend on the context. And for > >>>>> > >>>> > >>>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607 > >>>> before that (but not in the gcc-8-branch) we had this in c_strlen: > >>>> > >>>> HOST_WIDE_INT maxelts = strelts; > >>>> tree type = TREE_TYPE (src); > >>>> if (tree size = TYPE_SIZE_UNIT (type)) > >>>> if (tree_fits_shwi_p (size)) > >>>> { > >>>> maxelts = tree_to_uhwi (size); > >>>> maxelts = maxelts / eltsize - 1; > >>>> } > >>>> > >>>> which I moved to string_constant and return the result through memsize. > >>>> > >>>> It seems to be working somehow, and I'd bet removing that will immediately > >>>> break twenty or thirty of the strlenopt tests. > >>>> > >>>> Obviously the tree string objects allow way too much variations, > >>>> and it has to be restricted in some way or another. > >>>> > >>>> In the moment my approach is: look at what most string constants do > >>>> and add assertions to ensure that there are no exceptions. > >>>> > >>>> > >>>>> char a[4] = "abc"; > >>>>> char b[5] = "abc"; > >>>>> > >>>>> we should better be able to share STRING_CSTs [you can see LTO > >>>>> sharing the nodes if you do b[4] but not when b[5] I suppose]. > >>>>> > >>>>>> All I want to do, is make sure that all string constants have the same look and feel > >>>>>> in the middle-end, and restrict the variations that are allowed by the current > >>>>>> implementation. > >>>>> > >>>>> Sure, I understand that. But I'd like to simplify things and not add > >>>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide > >>>>> whether sth is 0-terminated. > >>>>> > >>>> > >>>> This is not only about 0-terminated. (*) > >>>> > >>>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, > >>>> there are places in the middle-end that assume that the object contains > >>>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. > >>>> Those I want to protect. > >>> > >>> Well, but string_constant handles &STRING_CST just fine but in that > >>> context there's no "object" we look at. > >>> > >>> IMHO whenever string_constant ends up with a DECL, looking at > >>> ctor_for_folding and we end up with a STRING_CST that doesn't fit > >>> in the DECL we looked at we have a bug (non-truncated STRING_CST) > >>> and either should do the truncation or simply return NULL. > >>> > >>> So we should look at DECL_SIZE_UNIT and compare that with > >>> TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT > >>> of the STRING_CST being "correct" (fitting the decl context). > >>> > >>>> Bernd. > >>>> > >>>> > >>>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, > >>>> the string is only returned when it is properly zero terminated.", > >>>> but maybe I should have written: > >>>> "If MEM_SIZE is zero, the string is only returned when the storage size > >>>> of the string object is at least TREE_STRING_LENGTH." > >>>> That's at least exactly what the check does. > >>> > >>> Well, as said above > >>> > >>> if (TREE_CODE (array) == STRING_CST) > >>> { > >>> *ptr_offset = fold_convert (sizetype, offset); > >>> if (mem_size) > >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); > >>> return array; > >>> } > >>> > >>> simply assumes the "storage size" of a STRING_CST is determined > >>> by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, > >>> but clearly in the above case there's nothing to protect? And in > >>> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't > >>> protect from overflow of the FIELD_DECL unless frontends never > >>> generate "wrong" STRING_CSTs? > >>> > >> > >> Hmm, I digged some more in varasm.c > >> > >> Normal STRING_CST use get_constant_size to allocate > >> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH) > >> thus they can have excess zero bytes. > >> > >> while STRING_CST that are in DECL_INITIALIZERs > >> are shrinked to DECL_SIZE_UNIT. > >> > >> At least that difference looks unnecessary to me. > >> > >> > >> But compare_constant does not compare the TYPE_SIZE_UNIT: > >> > >> case STRING_CST: > >> if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) > >> return 0; > >> > >> return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) > >> && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER (t2), > >> TREE_STRING_LENGTH (t1))); > >> > >> > >> This looks now like a bug. > > > > Well. I think that we emit excess zero bytes for "truncated" STRING_CSTs isn't > > a designed feature. I'd rather not have it that way because the length of a > > STRING_CST becomes defined in multiple places if you consider STRING_CST > > with [] or too small type then TREE_STRING_LENGTH is authorative while > > if the type is larger then that is authorative. > > > > The length of a STRING_CST if output should be only determined from > > TREE_STRING_LENGTH. > > > > Hmm. I think about that. It is possible that I change my mind.... > > Currently the semantic of STRING_CST objects differs for literals > and for initializer elements. > > For literals the TREE_STRING_LENGTH is guaranteed to resemble > the runtime string object. > > For initializer elements the type domain overrides the TREE_STRING_LENGTH > if it is available. If it is a flexible array member the type domain > is not available, and the TREE_STRING_LENGTH determines the memory size. > > The regression in tree-ssa-forwprop.c (PR 86714) is caused by > the fact that previously constant_string did always return string literals > while now it can as well return initializer elements. > But tree-ssa-forwprop expects string literal semantics. > > I would like to remove the semantic differences. > > I have a patch in the test, that fixes the TYPE_DOMAIN of flexible > array members. > > Maybe the most important property is TYPE_DOMAIN of the STRING_CST should never > be shorter than the TREE_STRING_LENGTH. Then it the difference between > string literals and string initializers will disappear. > > And that should probably take precedence over the zero-termination property of > the string_cst objects. > > Which will of course put one more question mark on the correctness of the > string_constant, c_strlen etc. > > > >> > >>> Btw, get_constant_size / mergeable_string_section suggsts that > >>> we may view STRING_CST as general target-encoded byte blob. > >>> That may be useful to compress CONSTRUCTORs memory use. > >>> > >>> It looks like mergeable_string_section wrongly would reject > >>> a char[] typed string because int_size_in_bytes returns -1 > >>> for incomplete types. I still believe using char[] for most > >>> STRING_CSTs generated by FEs would be a good thing for > >>> IL memory use. Shouldn't the mem_size initializations use > >>> sth like get_constant_size as well? > >>> > >> > >> I think incomplete types exist only in structs with variable > >> length: > >> > >> struct { > >> int x; > >> char y[]; > >> } s = { 1, "test" }; > >> > >> this is no candidate for a mergeable string. > >> > >> but > >> char y[] = "test"; > >> > >> is fixed by the FE to: > >> char y[5] = "test"; > >> > >> So that does not make problems with mergeable_string_section. > >> > >> > >>> Also > >>> > >>> if (mem_size) > >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); > >>> else if (compare_tree_int (array_size, length + 1) < 0) > >>> return NULL_TREE; > >>> > >>> the latter check doesn't seem to honor 'offset' and *mem_size > >>> is again looking at the STRING_CST, not at the FIELD_DECL or > >>> wherever we got the STRING_CST from. > >>> > >> > >> the caller compares offset with *mem_size, but we do not have > >> the FIELD_DECL at hand here (Or I did not know where it was). > >> I'd rather fail the cases when the TYPE_SIZE_UNIT is null, > >> or something non-constant. > >> > >> All those are variations of vla with initializer and similar. > >> Once the STRING_CST has a type, I would like to use it, > >> when it doesn't the whole thing is probably not worth it > >> anyway. > > > > I would rather have STRING_CSTs only have a type for the > > elements, not the whole thing... you can get that > > if you drop TYPE_DOMAIN on all STRING_CSTs array types. > > > > This will break the string_constant code even more. > It depends on the TYPE_DOMAIN to mean something. > > > Not sure what breaks if we change get_constant_size to > > return TREE_STRING_LENGTH unconditionally. At least > > somehow the need to special-case STRING_CSTs looks > > odd and I wonder if we do this in other functions as well... > > > I think as well this special handling in get_constant_size > is something that has to go away. > > It is even worse, the DECL_SIZE_UNIT is also NULL in > flexible array members. > > But I think I can make it work the other way round. > So that STRING_CST always have a TYPE_DOMAIN which > may only be larger than the TREE_STRING_LENGTH. OK, that would then be at least clear semantics and follow that of CONSTRUCTORs (missing "elements" are zero). Richard. > > > Bernd. > > > Richard. > > > >> > >> > >> Bernd.
2018-07-31 Bernd Edlinger <bernd.edlinger@hotmail.de> * go-gcc.cc (Gcc_backend::string_constant_expression): Make string literal properly NUL terminated. diff -pur gcc/go/go-gcc.cc gcc/go/go-gcc.cc --- gcc/go/go-gcc.cc 2018-06-28 19:46:36.000000000 +0200 +++ gcc/go/go-gcc.cc 2018-07-31 12:52:24.690236476 +0200 @@ -1394,7 +1394,7 @@ Gcc_backend::string_constant_expression( TYPE_QUAL_CONST); tree string_type = build_array_type(const_char_type, index_type); TYPE_STRING_FLAG(string_type) = 1; - tree string_val = build_string(val.length(), val.data()); + tree string_val = build_string(val.length() + 1, val.data()); TREE_TYPE(string_val) = string_type; return this->make_expression(string_val);