Message ID | 20171019155031.6441-1-krebbel@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
Series | [RFC] New pragma exec_charset | expand |
On 10/19/2017 09:50 AM, Andreas Krebbel wrote: > The TPF operating system uses the GCC S/390 backend. They set an > EBCDIC exec charset for compilation using -fexec-charset. However, > certain libraries require ASCII strings instead. In order to be able > to put calls to that library into the normal code it is required to > switch the exec charset within a compilation unit. > > This is an attempt to implement it by adding a new pragma which could > be used like in the following example: > > int > foo () > { > call_with_utf8("hello world"); > > #pragma GCC exec_charset("UTF16") > call_with_utf16("hello world"); > > #pragma GCC exec_charset(pop) > call_with_utf8("hello world"); > } > > Does this look reasonable? I'm not an expert on this but at a high level it looks reasonable to me. But based on some small amount of work I did in this area I have a couple of questions. There are a few places in the compiler that already do or that should but don't yet handle different execution character sets. The former include built-ins like __bultin_isdigit() and __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c) The latter is the -Wformat checking done by the C and C++ front ends. The missing support for the latter is the subject of bug 38308. According to bug 81686, LTO is apparently also missing support for exec-charset. I'm curious how the pragma might interact with these two areas, and whether the lack of support for it in the latter is a concern (and if not, why not). For the former, I'm also wondering about the interaction of inlining and other interprocedural optimizations with the pragma. Does it propagate through inlined calls as one would expect? Thanks Martin
On Thu, 19 Oct 2017, Andreas Krebbel wrote: > gcc/testsuite/gcc.dg/pragma-exec_charset-1.c | 26 +++++++++++++++ I'd expect a c-c++-common test rather than a C-specific one, given there are significant differences in how the C and C++ front ends handle lexing. > +#pragma GCC exec_charset("UTF16") > +call_with_utf16("hello world"); I don't think UTF16 makes sense as an example, as it's not a multibyte character set (characters are 16-bit, and with 8-bit bytes that means many characters contain NUL bytes, which is not allowed for multibyte character sets).
On Thu, Oct 19, 2017 at 7:13 PM, Martin Sebor <msebor@gmail.com> wrote: > On 10/19/2017 09:50 AM, Andreas Krebbel wrote: >> >> The TPF operating system uses the GCC S/390 backend. They set an >> EBCDIC exec charset for compilation using -fexec-charset. However, >> certain libraries require ASCII strings instead. In order to be able >> to put calls to that library into the normal code it is required to >> switch the exec charset within a compilation unit. >> >> This is an attempt to implement it by adding a new pragma which could >> be used like in the following example: >> >> int >> foo () >> { >> call_with_utf8("hello world"); >> >> #pragma GCC exec_charset("UTF16") >> call_with_utf16("hello world"); >> >> #pragma GCC exec_charset(pop) >> call_with_utf8("hello world"); >> } >> >> Does this look reasonable? > > > I'm not an expert on this but at a high level it looks reasonable > to me. But based on some small amount of work I did in this area > I have a couple of questions. > > There are a few places in the compiler that already do or that > should but don't yet handle different execution character sets. > The former include built-ins like __bultin_isdigit() and > __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c) > The latter is the -Wformat checking done by the C and C++ front > ends. The missing support for the latter is the subject of bug > 38308. According to bug 81686, LTO is apparently also missing > support for exec-charset. > > I'm curious how the pragma might interact with these two areas, > and whether the lack of support for it in the latter is a concern > (and if not, why not). For the former, I'm also wondering about > the interaction of inlining and other interprocedural optimizations > with the pragma. Does it propagate through inlined calls as one > would expect? How does it work semantically to have different exec charsets? That is, if "strings" flow from a region with one -fexec-charset setting to a region with another one is that undefined behavior? Do we now require external function declarations to be in the proper region (declared under the appropriate exec charset flag)? This would mean that passing the exec charset in effect as additional argument isn't a possibility. Or do we have to treat -fexec-charset similar to -frounding-math, that is, we can't ever _interpret_ any string in the compiler? [unless -fexec-charset is the same everywhere] I think the -frounding-math route is probably the easiest (and wisest given the quite low test coverage we'll get) route. Thus, add a -fmixed-charset flag and reject any exec-charset attribute/pragma if that flag is not set? With LTO we could always add this and/or merge -fexec-charset flags appropriately, injecting -fmixed-charset in case TUs use different settings. Richard. > Thanks > Martin >
On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote: > How does it work semantically to have different exec charsets? That is, > if "strings" flow from a region with one -fexec-charset setting to a region > with another one is that undefined behavior? Do we now require > external function declarations to be in the proper region (declared under > the appropriate exec charset flag)? This would mean that passing > the exec charset in effect as additional argument isn't a possibility. > > Or do we have to treat -fexec-charset similar to -frounding-math, that is, > we can't ever _interpret_ any string in the compiler? [unless -fexec-charset > is the same everywhere] > > I think the -frounding-math route is probably the easiest (and wisest > given the quite low test coverage we'll get) route. Thus, add a -fmixed-charset > flag and reject any exec-charset attribute/pragma if that flag is not set? > With LTO we could always add this and/or merge -fexec-charset flags > appropriately, > injecting -fmixed-charset in case TUs use different settings. It wouldn't have to be an option, simply mark in cfun all functions that have more than one exec charset and give up on all optimizations/warnings that require to read the characters and merge that unknown exec_charset flag during inlining etc. Though, that might still not be enough, e.g. the whole function might have one exec charset, but a global const char [] variable might have another one and during optimization we might be looking at that. So perhaps it would need to be a per-TU flag merged during LTO. Jakub
On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote: >> How does it work semantically to have different exec charsets? That is, >> if "strings" flow from a region with one -fexec-charset setting to a region >> with another one is that undefined behavior? Do we now require >> external function declarations to be in the proper region (declared under >> the appropriate exec charset flag)? This would mean that passing >> the exec charset in effect as additional argument isn't a possibility. >> >> Or do we have to treat -fexec-charset similar to -frounding-math, that is, >> we can't ever _interpret_ any string in the compiler? [unless -fexec-charset >> is the same everywhere] >> >> I think the -frounding-math route is probably the easiest (and wisest >> given the quite low test coverage we'll get) route. Thus, add a -fmixed-charset >> flag and reject any exec-charset attribute/pragma if that flag is not set? >> With LTO we could always add this and/or merge -fexec-charset flags >> appropriately, >> injecting -fmixed-charset in case TUs use different settings. > > It wouldn't have to be an option, simply mark in cfun all functions that > have more than one exec charset and give up on all optimizations/warnings > that require to read the characters and merge that unknown exec_charset > flag during inlining etc. Though, that might still not be enough, e.g. > the whole function might have one exec charset, but a global const char [] > variable might have another one and during optimization we might be looking > at that. So perhaps it would need to be a per-TU flag merged during LTO. There's also IPA flow of strings between functions so unless mixing exec charsets invokes undefined behavior I can't see how a per-function flag would help. But yes, if we can reliably detect whether multiple exec charsets are used in a TU we can make this a flag that doesn't have to be set by the user. But that means the pragma probably _always_ forces that flag given we have that forced pre-included file on some targest and the pragma token would occur after that... Richard. > Jakub
On 10/20/2017 10:28 AM, Richard Biener wrote: > On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote: >> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote: >>> How does it work semantically to have different exec charsets? That is, >>> if "strings" flow from a region with one -fexec-charset setting to a region >>> with another one is that undefined behavior? Do we now require >>> external function declarations to be in the proper region (declared under >>> the appropriate exec charset flag)? This would mean that passing >>> the exec charset in effect as additional argument isn't a possibility. >>> >>> Or do we have to treat -fexec-charset similar to -frounding-math, that is, >>> we can't ever _interpret_ any string in the compiler? [unless -fexec-charset >>> is the same everywhere] >>> >>> I think the -frounding-math route is probably the easiest (and wisest >>> given the quite low test coverage we'll get) route. Thus, add a -fmixed-charset >>> flag and reject any exec-charset attribute/pragma if that flag is not set? >>> With LTO we could always add this and/or merge -fexec-charset flags >>> appropriately, >>> injecting -fmixed-charset in case TUs use different settings. >> >> It wouldn't have to be an option, simply mark in cfun all functions that >> have more than one exec charset and give up on all optimizations/warnings >> that require to read the characters and merge that unknown exec_charset >> flag during inlining etc. Though, that might still not be enough, e.g. >> the whole function might have one exec charset, but a global const char [] >> variable might have another one and during optimization we might be looking >> at that. So perhaps it would need to be a per-TU flag merged during LTO. > > There's also IPA flow of strings between functions so unless mixing > exec charsets > invokes undefined behavior I can't see how a per-function flag would help. > > But yes, if we can reliably detect whether multiple exec charsets are > used in a TU > we can make this a flag that doesn't have to be set by the user. But that means > the pragma probably _always_ forces that flag given we have that > forced pre-included > file on some targest and the pragma token would occur after that... Would it make sense to mark the string literals itself as not using the default charset? Then we could disable all interpretations only for these strings instead of disabling it for the entire TU? -Andreas-
On Fri, Oct 20, 2017 at 1:19 PM, Andreas Krebbel <krebbel@linux.vnet.ibm.com> wrote: > On 10/20/2017 10:28 AM, Richard Biener wrote: >> On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote: >>> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote: >>>> How does it work semantically to have different exec charsets? That is, >>>> if "strings" flow from a region with one -fexec-charset setting to a region >>>> with another one is that undefined behavior? Do we now require >>>> external function declarations to be in the proper region (declared under >>>> the appropriate exec charset flag)? This would mean that passing >>>> the exec charset in effect as additional argument isn't a possibility. >>>> >>>> Or do we have to treat -fexec-charset similar to -frounding-math, that is, >>>> we can't ever _interpret_ any string in the compiler? [unless -fexec-charset >>>> is the same everywhere] >>>> >>>> I think the -frounding-math route is probably the easiest (and wisest >>>> given the quite low test coverage we'll get) route. Thus, add a -fmixed-charset >>>> flag and reject any exec-charset attribute/pragma if that flag is not set? >>>> With LTO we could always add this and/or merge -fexec-charset flags >>>> appropriately, >>>> injecting -fmixed-charset in case TUs use different settings. >>> >>> It wouldn't have to be an option, simply mark in cfun all functions that >>> have more than one exec charset and give up on all optimizations/warnings >>> that require to read the characters and merge that unknown exec_charset >>> flag during inlining etc. Though, that might still not be enough, e.g. >>> the whole function might have one exec charset, but a global const char [] >>> variable might have another one and during optimization we might be looking >>> at that. So perhaps it would need to be a per-TU flag merged during LTO. >> >> There's also IPA flow of strings between functions so unless mixing >> exec charsets >> invokes undefined behavior I can't see how a per-function flag would help. >> >> But yes, if we can reliably detect whether multiple exec charsets are >> used in a TU >> we can make this a flag that doesn't have to be set by the user. But that means >> the pragma probably _always_ forces that flag given we have that >> forced pre-included >> file on some targest and the pragma token would occur after that... > > Would it make sense to mark the string literals itself as not using the default charset? Then we > could disable all interpretations only for these strings instead of disabling it for the entire TU? I think that would work, too. Though I'd then rather explicitely state the charset the string literal is in (for efficiency we'd then need some mapping of charset id to actual charset we store globally somewhere and which we'd need to stream and merge for LTO - the "default" would then always get zero and the default charset being streamed to LTO). Looks like tree_base.u.bits is unused for STRING_CST in the middle-end, you'd have to check FEs if they use a lang-specific flag though. Then we could stick the exec charset number there (32bit index even - whoo). Bah, C++ of course uses a single lang flag (PAREN_STRING_LITERAL_P). Sticking it in the literals type would work as well but I find that a bit ugly. We could reuse bits.address_space for a max of 256 exec charsets, a special value of 255 could indicate 'unknown, too many charsets' also used in an initial implementation without providing the actual mapping just distinguishing default from non-default. The interesting part is of course libcpp/cc1 interaction and getting this all right. Richard. > -Andreas- >
On Fri, Oct 20, 2017 at 1:34 PM, Richard Biener <richard.guenther@gmail.com> wrote: > On Fri, Oct 20, 2017 at 1:19 PM, Andreas Krebbel > <krebbel@linux.vnet.ibm.com> wrote: >> On 10/20/2017 10:28 AM, Richard Biener wrote: >>> On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote: >>>> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote: >>>>> How does it work semantically to have different exec charsets? That is, >>>>> if "strings" flow from a region with one -fexec-charset setting to a region >>>>> with another one is that undefined behavior? Do we now require >>>>> external function declarations to be in the proper region (declared under >>>>> the appropriate exec charset flag)? This would mean that passing >>>>> the exec charset in effect as additional argument isn't a possibility. >>>>> >>>>> Or do we have to treat -fexec-charset similar to -frounding-math, that is, >>>>> we can't ever _interpret_ any string in the compiler? [unless -fexec-charset >>>>> is the same everywhere] >>>>> >>>>> I think the -frounding-math route is probably the easiest (and wisest >>>>> given the quite low test coverage we'll get) route. Thus, add a -fmixed-charset >>>>> flag and reject any exec-charset attribute/pragma if that flag is not set? >>>>> With LTO we could always add this and/or merge -fexec-charset flags >>>>> appropriately, >>>>> injecting -fmixed-charset in case TUs use different settings. >>>> >>>> It wouldn't have to be an option, simply mark in cfun all functions that >>>> have more than one exec charset and give up on all optimizations/warnings >>>> that require to read the characters and merge that unknown exec_charset >>>> flag during inlining etc. Though, that might still not be enough, e.g. >>>> the whole function might have one exec charset, but a global const char [] >>>> variable might have another one and during optimization we might be looking >>>> at that. So perhaps it would need to be a per-TU flag merged during LTO. >>> >>> There's also IPA flow of strings between functions so unless mixing >>> exec charsets >>> invokes undefined behavior I can't see how a per-function flag would help. >>> >>> But yes, if we can reliably detect whether multiple exec charsets are >>> used in a TU >>> we can make this a flag that doesn't have to be set by the user. But that means >>> the pragma probably _always_ forces that flag given we have that >>> forced pre-included >>> file on some targest and the pragma token would occur after that... >> >> Would it make sense to mark the string literals itself as not using the default charset? Then we >> could disable all interpretations only for these strings instead of disabling it for the entire TU? > > I think that would work, too. Though I'd then rather explicitely > state the charset the string literal is in > (for efficiency we'd then need some mapping of charset id to actual > charset we store globally somewhere > and which we'd need to stream and merge for LTO - the "default" would > then always get zero and > the default charset being streamed to LTO). Looks like > tree_base.u.bits is unused for STRING_CST > in the middle-end, you'd have to check FEs if they use a lang-specific > flag though. Then we could > stick the exec charset number there (32bit index even - whoo). Bah, > C++ of course uses a single > lang flag (PAREN_STRING_LITERAL_P). Sticking it in the literals type > would work as well but > I find that a bit ugly. We could reuse bits.address_space for a max > of 256 exec charsets, > a special value of 255 could indicate 'unknown, too many charsets' > also used in an initial implementation > without providing the actual mapping just distinguishing default from > non-default. > > The interesting part is of course libcpp/cc1 interaction and getting > this all right. Oh, and there are plenty of bits unused for STRING_CST so if the C++ FE could stop using lang specific tree bits we could shrink tree_string by moving length to tree_base.u. Re-using address_space would block this improvement. Finding a single bit for default vs. non-default wouldn't. Richard. > Richard. > >> -Andreas- >>
On 10/19/2017 07:13 PM, Martin Sebor wrote: > On 10/19/2017 09:50 AM, Andreas Krebbel wrote: >> The TPF operating system uses the GCC S/390 backend. They set an >> EBCDIC exec charset for compilation using -fexec-charset. However, >> certain libraries require ASCII strings instead. In order to be able >> to put calls to that library into the normal code it is required to >> switch the exec charset within a compilation unit. >> >> This is an attempt to implement it by adding a new pragma which could >> be used like in the following example: >> >> int >> foo () >> { >> call_with_utf8("hello world"); >> >> #pragma GCC exec_charset("UTF16") >> call_with_utf16("hello world"); >> >> #pragma GCC exec_charset(pop) >> call_with_utf8("hello world"); >> } >> >> Does this look reasonable? > > I'm not an expert on this but at a high level it looks reasonable > to me. But based on some small amount of work I did in this area > I have a couple of questions. > > There are a few places in the compiler that already do or that > should but don't yet handle different execution character sets. > The former include built-ins like __bultin_isdigit() and > __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c) > The latter is the -Wformat checking done by the C and C++ front > ends. The missing support for the latter is the subject of bug > 38308. According to bug 81686, LTO is apparently also missing > support for exec-charset. These probably are the areas Richard and Jakub were referring to as well?! These cases did not work properly with the -fexec-charset cmdline option and this does not change with the pragma. I'll try to look at what has been proposed in the discussion. Perhaps I can get it working somehow. > I'm curious how the pragma might interact with these two areas, > and whether the lack of support for it in the latter is a concern > (and if not, why not). For the former, I'm also wondering about > the interaction of inlining and other interprocedural optimizations > with the pragma. Does it propagate through inlined calls as one > would expect? The pragma does not apply to the callees of a function defined under the pragma regardless of whether it gets inlined or not. That matches the behavior of other pragmas. If it would apply to inlined callees the program semantics might change depending on optimization decisions i.e. whether a certain call got inlined or not. Callees marked as always_inline might be discussed separately. I remember this being a topic when looking at function attributes. Bye, -Andreas-
On 10/23/2017 04:55 AM, Andreas Krebbel wrote: > On 10/19/2017 07:13 PM, Martin Sebor wrote: >> On 10/19/2017 09:50 AM, Andreas Krebbel wrote: >>> The TPF operating system uses the GCC S/390 backend. They set an >>> EBCDIC exec charset for compilation using -fexec-charset. However, >>> certain libraries require ASCII strings instead. In order to be able >>> to put calls to that library into the normal code it is required to >>> switch the exec charset within a compilation unit. >>> >>> This is an attempt to implement it by adding a new pragma which could >>> be used like in the following example: >>> >>> int >>> foo () >>> { >>> call_with_utf8("hello world"); >>> >>> #pragma GCC exec_charset("UTF16") >>> call_with_utf16("hello world"); >>> >>> #pragma GCC exec_charset(pop) >>> call_with_utf8("hello world"); >>> } >>> >>> Does this look reasonable? >> >> I'm not an expert on this but at a high level it looks reasonable >> to me. But based on some small amount of work I did in this area >> I have a couple of questions. >> >> There are a few places in the compiler that already do or that >> should but don't yet handle different execution character sets. >> The former include built-ins like __bultin_isdigit() and >> __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c) >> The latter is the -Wformat checking done by the C and C++ front >> ends. The missing support for the latter is the subject of bug >> 38308. According to bug 81686, LTO is apparently also missing >> support for exec-charset. > > These probably are the areas Richard and Jakub were referring to as well?! These cases did not work > properly with the -fexec-charset cmdline option and this does not change with the pragma. I'll try > to look at what has been proposed in the discussion. Perhaps I can get it working somehow. Right, the patch doesn't remove the known deficiencies. But by providing another knob to control the execution charset, at a fine grain level, it encourages users to make greater use of the (incomplete) exec-charset support and increases the odds that they will run afoul of them. It seems to me that before exposing a new mechanism to control the exec charset it would be prudent to a) plug at least the biggest holes to make the feature more reliable (in my mind, that's at least -Wformat), and b) make sure the pragma interacts correctly with existing features that work correctly with the -fexec-charset option. Where it doesn't and where it cannot be made to work correctly (i.e., is undefined), I would expect an effort to be made to detect and diagnose those undefined interactions if possible, or if that's too difficult, at a minimum document them. >> I'm curious how the pragma might interact with these two areas, >> and whether the lack of support for it in the latter is a concern >> (and if not, why not). For the former, I'm also wondering about >> the interaction of inlining and other interprocedural optimizations >> with the pragma. Does it propagate through inlined calls as one >> would expect? > > The pragma does not apply to the callees of a function defined under the pragma regardless of > whether it gets inlined or not. That matches the behavior of other pragmas. If it would apply to > inlined callees the program semantics might change depending on optimization decisions i.e. whether > a certain call got inlined or not. > > Callees marked as always_inline might be discussed separately. I remember this being a topic when > looking at function attributes. My concern with this pragma/attribute and inlining has to do with strings in one exec charset being propagated into functions that operate on strings in another charset. E.g., like in the test case below that's "miscompiled" with your patch -- the first test for n == 7 is eliminated and the buffer overflow is not detected. If this cannot be made to work then I think some effort should be made to detect this mixing and matching and existing optimizations that assume the same charset (like the sprintf one does) disabled. static inline int f (char *d, const char *fmt) { #pragma GCC exec_charset ("utf8") int n = __builtin_sprintf (d, fmt, 12345); #pragma GCC exec_charset (pop) if (n == 7) // incorrectly optimized away __builtin_abort (); return n; } int main (void) { char d[5]; #pragma GCC exec_charset ("EBCDIC-US") int n = f (d, "i=%i"); // buffer overflow not detected #pragma GCC exec_charset (pop) __builtin_printf ("%i (%lu): %s\n", n, __builtin_strlen (d), d); if (n != 7) // aborts at runtime __builtin_abort (); } Martin
On 10/23/2017 06:14 PM, Martin Sebor wrote: ... > It seems to me that before exposing a new mechanism to control > the exec charset it would be prudent to a) plug at least the > biggest holes to make the feature more reliable (in my mind, > that's at least -Wformat), and b) make sure the pragma interacts > correctly with existing features that work correctly with the > -fexec-charset option. Where it doesn't and where it cannot > be made to work correctly (i.e., is undefined), I would expect > an effort to be made to detect and diagnose those undefined > interactions if possible, or if that's too difficult, at > a minimum document them. Agreed. Getting a better understanding of the deficiencies with that mechanism and at least document them is an important step. ... > My concern with this pragma/attribute and inlining has to do with > strings in one exec charset being propagated into functions that > operate on strings in another charset. E.g., like in the test > case below that's "miscompiled" with your patch -- the first test > for n == 7 is eliminated and the buffer overflow is not detected. > If this cannot be made to work then I think some effort should be > made to detect this mixing and matching and existing optimizations > that assume the same charset (like the sprintf one does) disabled. > > static inline int > f (char *d, const char *fmt) > { > #pragma GCC exec_charset ("utf8") > int n = __builtin_sprintf (d, fmt, 12345); > #pragma GCC exec_charset (pop) > > if (n == 7) // incorrectly optimized away > __builtin_abort (); To my understanding the problem in your example isn't triggered through inlining. The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first. The sprintf implementation does not recognize the format letter and hence only prints the converted format string without inserting the integer. That's kinda expected I would say. You call a function which is expected to deal with a UTF-8 string with an EBCDIC string. GCC is reasoning about what the sprintf invocation would return and removes the condition since it would never be triggered. Compiling with -O0 makes the program behave the same way even without the inlining and the check not being removed. It would also happen without using the pragma but compiling with -fexec-charset=EBCDIC-US instead. E.g. compiling this program with -fexec-charset=EBCDIC-US does abort while it does not abort without using -fexec-charset: int main (void) { char d[5]; int n = __builtin_sprintf (d, "i=%i", 12345); if (n != 7) __builtin_abort (); } I'm not sure what kind of warning to expect for such code? Would it be helpful if the compiler tells that the exec_charset setting does not affect "fmt" for this code?! #pragma GCC exec_charset ("EBCDIC-US") int n = __builtin_sprintf (d, fmt, 12345); #pragma GCC exec_charset (pop) On the other hand, diagnostics like this might create false positives. One reason of course is that this might be intentional. But it might also be because in C it is difficult to detect whether a char* variable will eventually be interpreted as string or not. While I agree with you that the feature is not that easy to use and more diagnostics would be good I'm not really sure what to do about it. Bye, -Andreas-
>> My concern with this pragma/attribute and inlining has to do with >> strings in one exec charset being propagated into functions that >> operate on strings in another charset. E.g., like in the test >> case below that's "miscompiled" with your patch -- the first test >> for n == 7 is eliminated and the buffer overflow is not detected. >> If this cannot be made to work then I think some effort should be >> made to detect this mixing and matching and existing optimizations >> that assume the same charset (like the sprintf one does) disabled. >> >> static inline int >> f (char *d, const char *fmt) >> { >> #pragma GCC exec_charset ("utf8") >> int n = __builtin_sprintf (d, fmt, 12345); >> #pragma GCC exec_charset (pop) >> >> if (n == 7) // incorrectly optimized away >> __builtin_abort (); > > To my understanding the problem in your example isn't triggered through inlining. > The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first. It's optimized away because the sprintf pass interprets the constant contents of the format string, which it only does when the function is inlined into its caller. Otherwise, without inlining, it treats the format string as an unknown and so it cannot do the buffer overflow analysis or the optimization, and everything works fine at runtime. > The sprintf implementation does not recognize the format letter and hence only prints the converted > format string without inserting the integer. That's kinda expected I would say. You call a function > which is expected to deal with a UTF-8 string with an EBCDIC string. GCC is reasoning about what the > sprintf invocation would return and removes the condition since it would never be triggered. > Compiling with -O0 makes the program behave the same way even without the inlining and the check not > being removed. Right, at -O0 the function isn't inlined so it behaves sensibly. The buffer overflow analysis doesn't happen but that is expected and documented. > It would also happen without using the pragma but compiling with -fexec-charset=EBCDIC-US instead. > E.g. compiling this program with -fexec-charset=EBCDIC-US does abort while it does not abort without > using -fexec-charset: > > int main (void) > { > char d[5]; > > int n = __builtin_sprintf (d, "i=%i", 12345); > > if (n != 7) > __builtin_abort (); > } It aborts in GCC 7 because its sprintf pass doesn't support -fexec-charset. The sprintf pass in GCC 8 does support it so it's supposed to work correctly. It does with -fexec-charse=IBM1047, the only charset I tested the implementation with. The output is: y.c: In function ‘main’: y.c:7:34: warning: too many arguments for format [-Wformat-extra-args] int n = __builtin_sprintf (d, "i=%i", 12345); ^~~~~~ y.c:7:34: warning: ‘%i’ directive writing 5 bytes into a region of size 3 [-Wformat-overflow=] int n = __builtin_sprintf (d, "i=%i", 12345); ^~~~~~ y.c:7:12: note: ‘__builtin_sprintf’ output 8 bytes into a destination of size 5 int n = __builtin_sprintf (d, "i=%i", 12345); The first warning is wrong, caused by the missing support for -fexec-charset in -Wformat. But the other warning is expected and correct. The abort seems like a bug in the implementation, although since the buffer overflows the behavior is undefined so one could argue the abort is okay. I will look into what's going on and fix it. When there is no buffer overflow there is no abort, and that's the correct behavior that needs to be preserved with the pragma. For some reason, GCC 8 gets an ICE when -fexec-charset=EBCDIC-US. I opened pr82700 to investigate and fix this bug. > I'm not sure what kind of warning to expect for such code? Would it be helpful if the compiler tells > that the exec_charset setting does not affect "fmt" for this code?! > > #pragma GCC exec_charset ("EBCDIC-US") > int n = __builtin_sprintf (d, fmt, 12345); > #pragma GCC exec_charset (pop) Other than making it work as one would expect I think the only reasonable approach would be to disable the optimization and analysis. Losing the optimization wouldn't be a big deal, but it would be very unfortunate to have to trade off the pragma for the sprintf buffer overflow detection. But I'd like to think we don't have to make such a tradeoff. > On the other hand, diagnostics like this might create false positives. One reason of course is that > this might be intentional. But it might also be because in C it is difficult to detect whether a > char* variable will eventually be interpreted as string or not. > > While I agree with you that the feature is not that easy to use and more diagnostics would be good > I'm not really sure what to do about it. I think the suggestion to associate each string with its own charset is a good one. That should let the sprintf pass (and others like it) do the right thing even as strings get inlined and charsets are mixed and matched, e.g., like so: static inline void f (const char *f) { #pragma GCC exec_charset ("UTF-8") const char format[] = "fmt = %s\n"; // format in UTF-8 printf (format, f); // f interpreted in EBCDIC-US printf (f, 12345); // ditto here #pragma GCC exec_charset (pop) } void g (void) { #pragma GCC exec_charset ("EBCDIC-US") f ("i = %i\n"); // literal in EBCDIC-US #pragma GCC exec_charset (pop) } Martin
On 10/24/2017 05:33 PM, Martin Sebor wrote: >>> My concern with this pragma/attribute and inlining has to do with >>> strings in one exec charset being propagated into functions that >>> operate on strings in another charset. E.g., like in the test >>> case below that's "miscompiled" with your patch -- the first test >>> for n == 7 is eliminated and the buffer overflow is not detected. >>> If this cannot be made to work then I think some effort should be >>> made to detect this mixing and matching and existing optimizations >>> that assume the same charset (like the sprintf one does) disabled. >>> >>> static inline int >>> f (char *d, const char *fmt) >>> { >>> #pragma GCC exec_charset ("utf8") >>> int n = __builtin_sprintf (d, fmt, 12345); >>> #pragma GCC exec_charset (pop) >>> >>> if (n == 7) // incorrectly optimized away >>> __builtin_abort (); >> >> To my understanding the problem in your example isn't triggered through inlining. >> The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first. > > It's optimized away because the sprintf pass interprets the constant > contents of the format string, which it only does when the function > is inlined into its caller. Otherwise, without inlining, it treats > the format string as an unknown and so it cannot do the buffer > overflow analysis or the optimization, and everything works fine > at runtime. I assumed that GCC is supposed to emulate the format string parsing of sprintf as if it were compiled with the default execcharset. But instead it should emulate sprintf as if it was compiled with the same exec-charset as the currently compiled code - right?! So for an example like this: char d[3]; #pragma GCC exec_charset ("EBCDIC-US") char *s = "i=%i"; #pragma GCC exec_charset (pop) __builtin_sprintf (d, s, 1); It currently warns: t.c:6:13: warning: ‘~l’ directive writing 4 bytes into a region of size 3 [-Wformat-overflow=] char *s = "i=%i"; ^~~~~~ t.c:9:3: note: ‘__builtin_sprintf’ output 5 bytes into a destination of size 3 __builtin_sprintf (d, s, 1); ^~~~~~~~~~~~~~~~~~~~~~~~~~~ That's wrong and happens because the GCC sprintf format string parsing uses the default charset instead of the charset which has been used when originally parsing the string literal. Hence it does not recognize the %i format specifier and the output string remains 4 bytes in length. As Richard suggested we probably have to record the exec_charset in each string literal and use this when trying to interpret it. -Andreas-
diff --git a/gcc/c-family/c-pragma.c b/gcc/c-family/c-pragma.c index f7b59b3..db281b9 100644 --- a/gcc/c-family/c-pragma.c +++ b/gcc/c-family/c-pragma.c @@ -34,6 +34,8 @@ along with GCC; see the file COPYING3. If not see #include "opts.h" #include "plugin.h" +extern cpp_options *cpp_opts; + #define GCC_BAD(gmsgid) \ do { warning (OPT_Wpragmas, gmsgid); return; } while (0) #define GCC_BAD2(gmsgid, arg) \ @@ -1141,6 +1143,52 @@ handle_pragma_message (cpp_reader *ARG_UNUSED(dummy)) inform (input_location, "#pragma message: %s", TREE_STRING_POINTER (message)); } +static void +handle_pragma_exec_charset (cpp_reader *ARG_UNUSED(dummy)) +{ + enum cpp_ttype token; + tree x; + static const char* previous_charset = NULL; + + token = pragma_lex (&x); + if (token == CPP_OPEN_PAREN) + { + token = pragma_lex (&x); + if (token == CPP_STRING) + { + previous_charset = cpp_opts->narrow_charset; + cpp_opts->narrow_charset = TREE_STRING_POINTER (x); + } + else if (token == CPP_NAME + && strncmp (IDENTIFIER_POINTER (x), "pop", 3) == 0) + { + if (previous_charset == NULL) + { + warning (OPT_Wpragmas, + "pop without previous exec_charset use - ignored"); + return; + } + cpp_opts->narrow_charset = previous_charset; + previous_charset = NULL; + } + else + GCC_BAD ("expected a charset string or pop after %<#pragma exec_charset%>"); + + if (pragma_lex (&x) != CPP_CLOSE_PAREN) + GCC_BAD ("malformed %<#pragma exec_charset%>, ignored"); + } + else + GCC_BAD ("expected a string after %<#pragma exec_charset%>"); + + if (pragma_lex (&x) != CPP_EOF) + warning (OPT_Wpragmas, "junk at end of %<#pragma exec_charset%>"); + + inform (input_location, "switching to exec charset: %s", + cpp_opts->narrow_charset); + cpp_destroy_iconv (parse_in); + cpp_init_iconv (parse_in); +} + /* Mark whether the current location is valid for a STDC pragma. */ static bool valid_location_for_stdc_pragma; @@ -1571,6 +1619,8 @@ init_pragma (void) handle_pragma_redefine_extname); c_register_pragma_with_expansion (0, "message", handle_pragma_message); + c_register_pragma_with_expansion ("GCC", "exec_charset", + handle_pragma_exec_charset); #ifdef REGISTER_TARGET_PRAGMAS REGISTER_TARGET_PRAGMAS (); diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi index d9b7a54..b67993a 100644 --- a/gcc/doc/extend.texi +++ b/gcc/doc/extend.texi @@ -21611,6 +21611,7 @@ for further explanation. * Push/Pop Macro Pragmas:: * Function Specific Option Pragmas:: * Loop-Specific Pragmas:: +* Charset-Specific Pragmas:: @end menu @node AArch64 Pragmas @@ -22209,6 +22210,31 @@ void ignore_vec_dep (int *a, int k, int c, int m) @} @end smallexample +@node Charset-Specific Pragmas +@subsection Charset-Specific Pragmas + +@table @code +@item #pragma GCC exec_charset(@var{"charset"}) +@cindex pragma GCC exec_charset + +Set the execution character set, used for string and character +constants. The default is the exec charset specified with +@option{-fexec-charset} or UTF-8 if @option{-fexec-charset} isn't used. +charset can be any encoding supported by the system's "iconv" library +routine. The special value @var{pop} (without ``) can be +used to switch back to the exec charset before the last @code{#pragma +GCC exec_charset} setting. +@end table + +@smallexample +call_with_utf8("hello world"); + +#pragma GCC exec_charset("UTF16") +call_with_utf16("hello world"); + +#pragma GCC exec_charset(pop) +call_with_utf8("hello world"); +@end smallexample @node Unnamed Fields @section Unnamed Structure and Union Fields diff --git a/gcc/testsuite/gcc.dg/pragma-exec_charset-1.c b/gcc/testsuite/gcc.dg/pragma-exec_charset-1.c new file mode 100644 index 0000000..5c695aa --- /dev/null +++ b/gcc/testsuite/gcc.dg/pragma-exec_charset-1.c @@ -0,0 +1,26 @@ +/* { dg-do run } */ + +#include <stdio.h> + +char t1[] = "hello world"; +#pragma GCC exec_charset("EBCDIC-US") +char t2[] = "hello world"; +#pragma GCC exec_charset(pop) +char t3[] = "hello world"; + +char hello_world_utf8[12] = { 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x00 }; +char hello_world_ebcdic[12] = { 0x88, 0x85, 0x93, 0x93, 0x96, 0x40, 0xa6, 0x96, 0x99, 0x93, 0x84, 0x00 }; + +int +main () +{ + if (__builtin_memcmp (t1, hello_world_utf8, 12) != 0) + __builtin_abort (); + + if (__builtin_memcmp (t2, hello_world_ebcdic, 12) != 0) + __builtin_abort (); + + if (__builtin_memcmp (t3, hello_world_utf8, 12) != 0) + __builtin_abort (); + +} diff --git a/libcpp/charset.c b/libcpp/charset.c index 6a3bbbc..47fa406 100644 --- a/libcpp/charset.c +++ b/libcpp/charset.c @@ -737,7 +737,7 @@ cpp_init_iconv (cpp_reader *pfile) /* Destroy iconv(3) descriptors set up by cpp_init_iconv, if necessary. */ void -_cpp_destroy_iconv (cpp_reader *pfile) +cpp_destroy_iconv (cpp_reader *pfile) { if (HAVE_ICONV) { diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 804132a..acbdf5a 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -861,6 +861,9 @@ extern void cpp_post_options (cpp_reader *); /* Set up translation to the target character set. */ extern void cpp_init_iconv (cpp_reader *); +/* Cleanup translation to the target character set. */ +extern void cpp_destroy_iconv (cpp_reader *); + /* Call this to finish preprocessing. If you requested dependency generation, pass an open stream to write the information to, otherwise NULL. It is your responsibility to close the stream. */ diff --git a/libcpp/init.c b/libcpp/init.c index 16ff202..4e68645 100644 --- a/libcpp/init.c +++ b/libcpp/init.c @@ -314,7 +314,7 @@ cpp_destroy (cpp_reader *pfile) _cpp_destroy_hashtable (pfile); _cpp_cleanup_files (pfile); - _cpp_destroy_iconv (pfile); + cpp_destroy_iconv (pfile); _cpp_free_buff (pfile->a_buff); _cpp_free_buff (pfile->u_buff); diff --git a/libcpp/internal.h b/libcpp/internal.h index f24e85c..ce2d902 100644 --- a/libcpp/internal.h +++ b/libcpp/internal.h @@ -757,7 +757,6 @@ extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **, cppchar_t *, source_range *char_range, cpp_string_location_reader *loc_reader); -extern void _cpp_destroy_iconv (cpp_reader *); extern unsigned char *_cpp_convert_input (cpp_reader *, const char *, unsigned char *, size_t, size_t, const unsigned char **, off_t *);