diff mbox series

[RFC] New pragma exec_charset

Message ID 20171019155031.6441-1-krebbel@linux.vnet.ibm.com
State New
Headers show
Series [RFC] New pragma exec_charset | expand

Commit Message

Andreas Krebbel Oct. 19, 2017, 3:50 p.m. UTC
The TPF operating system uses the GCC S/390 backend.  They set an
EBCDIC exec charset for compilation using -fexec-charset.  However,
certain libraries require ASCII strings instead.  In order to be able
to put calls to that library into the normal code it is required to
switch the exec charset within a compilation unit.

This is an attempt to implement it by adding a new pragma which could
be used like in the following example:

int
foo ()
{
  call_with_utf8("hello world");

#pragma GCC exec_charset("UTF16")
  call_with_utf16("hello world");

#pragma GCC exec_charset(pop)
  call_with_utf8("hello world");
}

Does this look reasonable?

Bye,

-Andreas-
---
 gcc/c-family/c-pragma.c                      | 50 ++++++++++++++++++++++++++++
 gcc/doc/extend.texi                          | 26 +++++++++++++++
 gcc/testsuite/gcc.dg/pragma-exec_charset-1.c | 26 +++++++++++++++
 libcpp/charset.c                             |  2 +-
 libcpp/include/cpplib.h                      |  3 ++
 libcpp/init.c                                |  2 +-
 libcpp/internal.h                            |  1 -
 7 files changed, 107 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pragma-exec_charset-1.c

Comments

Martin Sebor Oct. 19, 2017, 5:13 p.m. UTC | #1
On 10/19/2017 09:50 AM, Andreas Krebbel wrote:
> The TPF operating system uses the GCC S/390 backend.  They set an
> EBCDIC exec charset for compilation using -fexec-charset.  However,
> certain libraries require ASCII strings instead.  In order to be able
> to put calls to that library into the normal code it is required to
> switch the exec charset within a compilation unit.
>
> This is an attempt to implement it by adding a new pragma which could
> be used like in the following example:
>
> int
> foo ()
> {
>   call_with_utf8("hello world");
>
> #pragma GCC exec_charset("UTF16")
>   call_with_utf16("hello world");
>
> #pragma GCC exec_charset(pop)
>   call_with_utf8("hello world");
> }
>
> Does this look reasonable?

I'm not an expert on this but at a high level it looks reasonable
to me.  But based on some small amount of work I did in this area
I have a couple of questions.

There are a few places in the compiler that already do or that
should but don't yet handle different execution character sets.
The former include built-ins like __bultin_isdigit() and
__builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c)
The latter is the -Wformat checking done by the C and C++ front
ends.  The missing support for the latter is the subject of bug
38308.  According to bug 81686, LTO is apparently also missing
support for exec-charset.

I'm curious how the pragma might interact with these two areas,
and whether the lack of support for it in the latter is a concern
(and if not, why not).  For the former, I'm also wondering about
the interaction of inlining and other interprocedural optimizations
with the pragma.  Does it propagate through inlined calls as one
would expect?

Thanks
Martin
Joseph Myers Oct. 19, 2017, 5:52 p.m. UTC | #2
On Thu, 19 Oct 2017, Andreas Krebbel wrote:

>  gcc/testsuite/gcc.dg/pragma-exec_charset-1.c | 26 +++++++++++++++

I'd expect a c-c++-common test rather than a C-specific one, given there 
are significant differences in how the C and C++ front ends handle lexing.

> +#pragma GCC exec_charset("UTF16")
> +call_with_utf16("hello world");

I don't think UTF16 makes sense as an example, as it's not a multibyte 
character set (characters are 16-bit, and with 8-bit bytes that means many 
characters contain NUL bytes, which is not allowed for multibyte character 
sets).
Richard Biener Oct. 20, 2017, 7:48 a.m. UTC | #3
On Thu, Oct 19, 2017 at 7:13 PM, Martin Sebor <msebor@gmail.com> wrote:
> On 10/19/2017 09:50 AM, Andreas Krebbel wrote:
>>
>> The TPF operating system uses the GCC S/390 backend.  They set an
>> EBCDIC exec charset for compilation using -fexec-charset.  However,
>> certain libraries require ASCII strings instead.  In order to be able
>> to put calls to that library into the normal code it is required to
>> switch the exec charset within a compilation unit.
>>
>> This is an attempt to implement it by adding a new pragma which could
>> be used like in the following example:
>>
>> int
>> foo ()
>> {
>>   call_with_utf8("hello world");
>>
>> #pragma GCC exec_charset("UTF16")
>>   call_with_utf16("hello world");
>>
>> #pragma GCC exec_charset(pop)
>>   call_with_utf8("hello world");
>> }
>>
>> Does this look reasonable?
>
>
> I'm not an expert on this but at a high level it looks reasonable
> to me.  But based on some small amount of work I did in this area
> I have a couple of questions.
>
> There are a few places in the compiler that already do or that
> should but don't yet handle different execution character sets.
> The former include built-ins like __bultin_isdigit() and
> __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c)
> The latter is the -Wformat checking done by the C and C++ front
> ends.  The missing support for the latter is the subject of bug
> 38308.  According to bug 81686, LTO is apparently also missing
> support for exec-charset.
>
> I'm curious how the pragma might interact with these two areas,
> and whether the lack of support for it in the latter is a concern
> (and if not, why not).  For the former, I'm also wondering about
> the interaction of inlining and other interprocedural optimizations
> with the pragma.  Does it propagate through inlined calls as one
> would expect?

How does it work semantically to have different exec charsets?  That is,
if "strings" flow from a region with one -fexec-charset setting to a region
with another one is that undefined behavior?  Do we now require
external function declarations to be in the proper region (declared under
the appropriate exec charset flag)?  This would mean that passing
the exec charset in effect as additional argument isn't a possibility.

Or do we have to treat -fexec-charset similar to -frounding-math, that is,
we can't ever _interpret_ any string in the compiler?  [unless -fexec-charset
is the same everywhere]

I think the -frounding-math route is probably the easiest (and wisest
given the quite low test coverage we'll get) route.  Thus, add a -fmixed-charset
flag and reject any exec-charset attribute/pragma if that flag is not set?
With LTO we could always add this and/or merge -fexec-charset flags
appropriately,
injecting -fmixed-charset in case TUs use different settings.

Richard.


> Thanks
> Martin
>
Jakub Jelinek Oct. 20, 2017, 7:53 a.m. UTC | #4
On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote:
> How does it work semantically to have different exec charsets?  That is,
> if "strings" flow from a region with one -fexec-charset setting to a region
> with another one is that undefined behavior?  Do we now require
> external function declarations to be in the proper region (declared under
> the appropriate exec charset flag)?  This would mean that passing
> the exec charset in effect as additional argument isn't a possibility.
> 
> Or do we have to treat -fexec-charset similar to -frounding-math, that is,
> we can't ever _interpret_ any string in the compiler?  [unless -fexec-charset
> is the same everywhere]
> 
> I think the -frounding-math route is probably the easiest (and wisest
> given the quite low test coverage we'll get) route.  Thus, add a -fmixed-charset
> flag and reject any exec-charset attribute/pragma if that flag is not set?
> With LTO we could always add this and/or merge -fexec-charset flags
> appropriately,
> injecting -fmixed-charset in case TUs use different settings.

It wouldn't have to be an option, simply mark in cfun all functions that
have more than one exec charset and give up on all optimizations/warnings
that require to read the characters and merge that unknown exec_charset
flag during inlining etc.  Though, that might still not be enough, e.g.
the whole function might have one exec charset, but a global const char []
variable might have another one and during optimization we might be looking
at that.  So perhaps it would need to be a per-TU flag merged during LTO.

	Jakub
Richard Biener Oct. 20, 2017, 8:28 a.m. UTC | #5
On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote:
>> How does it work semantically to have different exec charsets?  That is,
>> if "strings" flow from a region with one -fexec-charset setting to a region
>> with another one is that undefined behavior?  Do we now require
>> external function declarations to be in the proper region (declared under
>> the appropriate exec charset flag)?  This would mean that passing
>> the exec charset in effect as additional argument isn't a possibility.
>>
>> Or do we have to treat -fexec-charset similar to -frounding-math, that is,
>> we can't ever _interpret_ any string in the compiler?  [unless -fexec-charset
>> is the same everywhere]
>>
>> I think the -frounding-math route is probably the easiest (and wisest
>> given the quite low test coverage we'll get) route.  Thus, add a -fmixed-charset
>> flag and reject any exec-charset attribute/pragma if that flag is not set?
>> With LTO we could always add this and/or merge -fexec-charset flags
>> appropriately,
>> injecting -fmixed-charset in case TUs use different settings.
>
> It wouldn't have to be an option, simply mark in cfun all functions that
> have more than one exec charset and give up on all optimizations/warnings
> that require to read the characters and merge that unknown exec_charset
> flag during inlining etc.  Though, that might still not be enough, e.g.
> the whole function might have one exec charset, but a global const char []
> variable might have another one and during optimization we might be looking
> at that.  So perhaps it would need to be a per-TU flag merged during LTO.

There's also IPA flow of strings between functions so unless mixing
exec charsets
invokes undefined behavior I can't see how a per-function flag would help.

But yes, if we can reliably detect whether multiple exec charsets are
used in a TU
we can make this a flag that doesn't have to be set by the user.  But that means
the pragma probably _always_ forces that flag given we have that
forced pre-included
file on some targest and the pragma token would occur after that...

Richard.

>         Jakub
Andreas Krebbel Oct. 20, 2017, 11:19 a.m. UTC | #6
On 10/20/2017 10:28 AM, Richard Biener wrote:
> On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote:
>>> How does it work semantically to have different exec charsets?  That is,
>>> if "strings" flow from a region with one -fexec-charset setting to a region
>>> with another one is that undefined behavior?  Do we now require
>>> external function declarations to be in the proper region (declared under
>>> the appropriate exec charset flag)?  This would mean that passing
>>> the exec charset in effect as additional argument isn't a possibility.
>>>
>>> Or do we have to treat -fexec-charset similar to -frounding-math, that is,
>>> we can't ever _interpret_ any string in the compiler?  [unless -fexec-charset
>>> is the same everywhere]
>>>
>>> I think the -frounding-math route is probably the easiest (and wisest
>>> given the quite low test coverage we'll get) route.  Thus, add a -fmixed-charset
>>> flag and reject any exec-charset attribute/pragma if that flag is not set?
>>> With LTO we could always add this and/or merge -fexec-charset flags
>>> appropriately,
>>> injecting -fmixed-charset in case TUs use different settings.
>>
>> It wouldn't have to be an option, simply mark in cfun all functions that
>> have more than one exec charset and give up on all optimizations/warnings
>> that require to read the characters and merge that unknown exec_charset
>> flag during inlining etc.  Though, that might still not be enough, e.g.
>> the whole function might have one exec charset, but a global const char []
>> variable might have another one and during optimization we might be looking
>> at that.  So perhaps it would need to be a per-TU flag merged during LTO.
> 
> There's also IPA flow of strings between functions so unless mixing
> exec charsets
> invokes undefined behavior I can't see how a per-function flag would help.
> 
> But yes, if we can reliably detect whether multiple exec charsets are
> used in a TU
> we can make this a flag that doesn't have to be set by the user.  But that means
> the pragma probably _always_ forces that flag given we have that
> forced pre-included
> file on some targest and the pragma token would occur after that...

Would it make sense to mark the string literals itself as not using the default charset? Then we
could disable all interpretations only for these strings instead of disabling it for the entire TU?

-Andreas-
Richard Biener Oct. 20, 2017, 11:34 a.m. UTC | #7
On Fri, Oct 20, 2017 at 1:19 PM, Andreas Krebbel
<krebbel@linux.vnet.ibm.com> wrote:
> On 10/20/2017 10:28 AM, Richard Biener wrote:
>> On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote:
>>>> How does it work semantically to have different exec charsets?  That is,
>>>> if "strings" flow from a region with one -fexec-charset setting to a region
>>>> with another one is that undefined behavior?  Do we now require
>>>> external function declarations to be in the proper region (declared under
>>>> the appropriate exec charset flag)?  This would mean that passing
>>>> the exec charset in effect as additional argument isn't a possibility.
>>>>
>>>> Or do we have to treat -fexec-charset similar to -frounding-math, that is,
>>>> we can't ever _interpret_ any string in the compiler?  [unless -fexec-charset
>>>> is the same everywhere]
>>>>
>>>> I think the -frounding-math route is probably the easiest (and wisest
>>>> given the quite low test coverage we'll get) route.  Thus, add a -fmixed-charset
>>>> flag and reject any exec-charset attribute/pragma if that flag is not set?
>>>> With LTO we could always add this and/or merge -fexec-charset flags
>>>> appropriately,
>>>> injecting -fmixed-charset in case TUs use different settings.
>>>
>>> It wouldn't have to be an option, simply mark in cfun all functions that
>>> have more than one exec charset and give up on all optimizations/warnings
>>> that require to read the characters and merge that unknown exec_charset
>>> flag during inlining etc.  Though, that might still not be enough, e.g.
>>> the whole function might have one exec charset, but a global const char []
>>> variable might have another one and during optimization we might be looking
>>> at that.  So perhaps it would need to be a per-TU flag merged during LTO.
>>
>> There's also IPA flow of strings between functions so unless mixing
>> exec charsets
>> invokes undefined behavior I can't see how a per-function flag would help.
>>
>> But yes, if we can reliably detect whether multiple exec charsets are
>> used in a TU
>> we can make this a flag that doesn't have to be set by the user.  But that means
>> the pragma probably _always_ forces that flag given we have that
>> forced pre-included
>> file on some targest and the pragma token would occur after that...
>
> Would it make sense to mark the string literals itself as not using the default charset? Then we
> could disable all interpretations only for these strings instead of disabling it for the entire TU?

I think that would work, too.  Though I'd then rather explicitely
state the charset the string literal is in
(for efficiency we'd then need some mapping of charset id to actual
charset we store globally somewhere
and which we'd need to stream and merge for LTO - the "default" would
then always get zero and
the default charset being streamed to LTO).  Looks like
tree_base.u.bits is unused for STRING_CST
in the middle-end, you'd have to check FEs if they use a lang-specific
flag though.  Then we could
stick the exec charset number there (32bit index even - whoo).  Bah,
C++ of course uses a single
lang flag (PAREN_STRING_LITERAL_P).  Sticking it in the literals type
would work as well but
I find that a bit ugly.  We could reuse bits.address_space for a max
of 256 exec charsets,
a special value of 255 could indicate 'unknown, too many charsets'
also used in an initial implementation
without providing the actual mapping just distinguishing default from
non-default.

The interesting part is of course libcpp/cc1 interaction and getting
this all right.

Richard.

> -Andreas-
>
Richard Biener Oct. 20, 2017, 11:38 a.m. UTC | #8
On Fri, Oct 20, 2017 at 1:34 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Oct 20, 2017 at 1:19 PM, Andreas Krebbel
> <krebbel@linux.vnet.ibm.com> wrote:
>> On 10/20/2017 10:28 AM, Richard Biener wrote:
>>> On Fri, Oct 20, 2017 at 9:53 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>> On Fri, Oct 20, 2017 at 09:48:38AM +0200, Richard Biener wrote:
>>>>> How does it work semantically to have different exec charsets?  That is,
>>>>> if "strings" flow from a region with one -fexec-charset setting to a region
>>>>> with another one is that undefined behavior?  Do we now require
>>>>> external function declarations to be in the proper region (declared under
>>>>> the appropriate exec charset flag)?  This would mean that passing
>>>>> the exec charset in effect as additional argument isn't a possibility.
>>>>>
>>>>> Or do we have to treat -fexec-charset similar to -frounding-math, that is,
>>>>> we can't ever _interpret_ any string in the compiler?  [unless -fexec-charset
>>>>> is the same everywhere]
>>>>>
>>>>> I think the -frounding-math route is probably the easiest (and wisest
>>>>> given the quite low test coverage we'll get) route.  Thus, add a -fmixed-charset
>>>>> flag and reject any exec-charset attribute/pragma if that flag is not set?
>>>>> With LTO we could always add this and/or merge -fexec-charset flags
>>>>> appropriately,
>>>>> injecting -fmixed-charset in case TUs use different settings.
>>>>
>>>> It wouldn't have to be an option, simply mark in cfun all functions that
>>>> have more than one exec charset and give up on all optimizations/warnings
>>>> that require to read the characters and merge that unknown exec_charset
>>>> flag during inlining etc.  Though, that might still not be enough, e.g.
>>>> the whole function might have one exec charset, but a global const char []
>>>> variable might have another one and during optimization we might be looking
>>>> at that.  So perhaps it would need to be a per-TU flag merged during LTO.
>>>
>>> There's also IPA flow of strings between functions so unless mixing
>>> exec charsets
>>> invokes undefined behavior I can't see how a per-function flag would help.
>>>
>>> But yes, if we can reliably detect whether multiple exec charsets are
>>> used in a TU
>>> we can make this a flag that doesn't have to be set by the user.  But that means
>>> the pragma probably _always_ forces that flag given we have that
>>> forced pre-included
>>> file on some targest and the pragma token would occur after that...
>>
>> Would it make sense to mark the string literals itself as not using the default charset? Then we
>> could disable all interpretations only for these strings instead of disabling it for the entire TU?
>
> I think that would work, too.  Though I'd then rather explicitely
> state the charset the string literal is in
> (for efficiency we'd then need some mapping of charset id to actual
> charset we store globally somewhere
> and which we'd need to stream and merge for LTO - the "default" would
> then always get zero and
> the default charset being streamed to LTO).  Looks like
> tree_base.u.bits is unused for STRING_CST
> in the middle-end, you'd have to check FEs if they use a lang-specific
> flag though.  Then we could
> stick the exec charset number there (32bit index even - whoo).  Bah,
> C++ of course uses a single
> lang flag (PAREN_STRING_LITERAL_P).  Sticking it in the literals type
> would work as well but
> I find that a bit ugly.  We could reuse bits.address_space for a max
> of 256 exec charsets,
> a special value of 255 could indicate 'unknown, too many charsets'
> also used in an initial implementation
> without providing the actual mapping just distinguishing default from
> non-default.
>
> The interesting part is of course libcpp/cc1 interaction and getting
> this all right.

Oh, and there are plenty of bits unused for STRING_CST so if the C++
FE could stop using lang specific tree bits we could shrink tree_string
by moving length to tree_base.u.  Re-using address_space would block
this improvement.  Finding a single bit for default vs. non-default wouldn't.

Richard.

> Richard.
>
>> -Andreas-
>>
Andreas Krebbel Oct. 23, 2017, 10:55 a.m. UTC | #9
On 10/19/2017 07:13 PM, Martin Sebor wrote:
> On 10/19/2017 09:50 AM, Andreas Krebbel wrote:
>> The TPF operating system uses the GCC S/390 backend.  They set an
>> EBCDIC exec charset for compilation using -fexec-charset.  However,
>> certain libraries require ASCII strings instead.  In order to be able
>> to put calls to that library into the normal code it is required to
>> switch the exec charset within a compilation unit.
>>
>> This is an attempt to implement it by adding a new pragma which could
>> be used like in the following example:
>>
>> int
>> foo ()
>> {
>>   call_with_utf8("hello world");
>>
>> #pragma GCC exec_charset("UTF16")
>>   call_with_utf16("hello world");
>>
>> #pragma GCC exec_charset(pop)
>>   call_with_utf8("hello world");
>> }
>>
>> Does this look reasonable?
> 
> I'm not an expert on this but at a high level it looks reasonable
> to me.  But based on some small amount of work I did in this area
> I have a couple of questions.
> 
> There are a few places in the compiler that already do or that
> should but don't yet handle different execution character sets.
> The former include built-ins like __bultin_isdigit() and
> __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c)
> The latter is the -Wformat checking done by the C and C++ front
> ends.  The missing support for the latter is the subject of bug
> 38308.  According to bug 81686, LTO is apparently also missing
> support for exec-charset.

These probably are the areas Richard and Jakub were referring to as well?!  These cases did not work
properly with the -fexec-charset cmdline option and this does not change with the pragma. I'll try
to look at what has been proposed in the discussion. Perhaps I can get it working somehow.

> I'm curious how the pragma might interact with these two areas,
> and whether the lack of support for it in the latter is a concern
> (and if not, why not).  For the former, I'm also wondering about
> the interaction of inlining and other interprocedural optimizations
> with the pragma.  Does it propagate through inlined calls as one
> would expect?

The pragma does not apply to the callees of a function defined under the pragma regardless of
whether it gets inlined or not.  That matches the behavior of other pragmas.  If it would apply to
inlined callees the program semantics might change depending on optimization decisions i.e. whether
a certain call got inlined or not.

Callees marked as always_inline might be discussed separately. I remember this being a topic when
looking at function attributes.

Bye,

-Andreas-
Martin Sebor Oct. 23, 2017, 4:14 p.m. UTC | #10
On 10/23/2017 04:55 AM, Andreas Krebbel wrote:
> On 10/19/2017 07:13 PM, Martin Sebor wrote:
>> On 10/19/2017 09:50 AM, Andreas Krebbel wrote:
>>> The TPF operating system uses the GCC S/390 backend.  They set an
>>> EBCDIC exec charset for compilation using -fexec-charset.  However,
>>> certain libraries require ASCII strings instead.  In order to be able
>>> to put calls to that library into the normal code it is required to
>>> switch the exec charset within a compilation unit.
>>>
>>> This is an attempt to implement it by adding a new pragma which could
>>> be used like in the following example:
>>>
>>> int
>>> foo ()
>>> {
>>>   call_with_utf8("hello world");
>>>
>>> #pragma GCC exec_charset("UTF16")
>>>   call_with_utf16("hello world");
>>>
>>> #pragma GCC exec_charset(pop)
>>>   call_with_utf8("hello world");
>>> }
>>>
>>> Does this look reasonable?
>>
>> I'm not an expert on this but at a high level it looks reasonable
>> to me.  But based on some small amount of work I did in this area
>> I have a couple of questions.
>>
>> There are a few places in the compiler that already do or that
>> should but don't yet handle different execution character sets.
>> The former include built-ins like __bultin_isdigit() and
>> __builtin_sprintf (in both builtins.c and gimple-ssa-sprintf.c)
>> The latter is the -Wformat checking done by the C and C++ front
>> ends.  The missing support for the latter is the subject of bug
>> 38308.  According to bug 81686, LTO is apparently also missing
>> support for exec-charset.
>
> These probably are the areas Richard and Jakub were referring to as well?!  These cases did not work
> properly with the -fexec-charset cmdline option and this does not change with the pragma. I'll try
> to look at what has been proposed in the discussion. Perhaps I can get it working somehow.

Right, the patch doesn't remove the known deficiencies.  But
by providing another knob to control the execution charset, at
a fine grain level, it encourages users to make greater use of
the (incomplete) exec-charset support and increases the odds
that they will run afoul of them.

It seems to me that before exposing a new mechanism to control
the exec charset it would be prudent to a) plug at least the
biggest holes to make the feature more reliable (in my mind,
that's at least -Wformat), and b) make sure the pragma interacts
correctly with existing features that work correctly with the
-fexec-charset option.  Where it doesn't and where it cannot
be made to work correctly (i.e., is undefined),  I would expect
an effort to be made to detect and diagnose those undefined
interactions if possible, or if that's too difficult, at
a minimum document them.

>> I'm curious how the pragma might interact with these two areas,
>> and whether the lack of support for it in the latter is a concern
>> (and if not, why not).  For the former, I'm also wondering about
>> the interaction of inlining and other interprocedural optimizations
>> with the pragma.  Does it propagate through inlined calls as one
>> would expect?
>
> The pragma does not apply to the callees of a function defined under the pragma regardless of
> whether it gets inlined or not.  That matches the behavior of other pragmas.  If it would apply to
> inlined callees the program semantics might change depending on optimization decisions i.e. whether
> a certain call got inlined or not.
>
> Callees marked as always_inline might be discussed separately. I remember this being a topic when
> looking at function attributes.

My concern with this pragma/attribute and inlining has to do with
strings in one exec charset being propagated into functions that
operate on strings in another charset.  E.g., like in the test
case below that's "miscompiled" with your patch -- the first test
for n == 7 is eliminated and the buffer overflow is not detected.
If this cannot be made to work then I think some effort should be
made to detect this mixing and matching and existing optimizations
that assume the same charset (like the sprintf one does) disabled.

static inline int
f (char *d, const char *fmt)
{
#pragma GCC exec_charset ("utf8")
   int n = __builtin_sprintf (d, fmt, 12345);
#pragma GCC exec_charset (pop)

   if (n == 7)   // incorrectly optimized away
     __builtin_abort ();

   return n;
}

int main (void)
{
   char d[5];

#pragma GCC exec_charset ("EBCDIC-US")
   int n = f (d, "i=%i");   // buffer overflow not detected
#pragma GCC exec_charset (pop)

   __builtin_printf ("%i (%lu): %s\n", n, __builtin_strlen (d), d);

   if (n != 7)   // aborts at runtime
     __builtin_abort ();
}

Martin
Andreas Krebbel Oct. 24, 2017, 10:52 a.m. UTC | #11
On 10/23/2017 06:14 PM, Martin Sebor wrote:
...
> It seems to me that before exposing a new mechanism to control
> the exec charset it would be prudent to a) plug at least the
> biggest holes to make the feature more reliable (in my mind,
> that's at least -Wformat), and b) make sure the pragma interacts
> correctly with existing features that work correctly with the
> -fexec-charset option.  Where it doesn't and where it cannot
> be made to work correctly (i.e., is undefined),  I would expect
> an effort to be made to detect and diagnose those undefined
> interactions if possible, or if that's too difficult, at
> a minimum document them.

Agreed. Getting a better understanding of the deficiencies with that mechanism and at least document
them is an important step.

...
> My concern with this pragma/attribute and inlining has to do with
> strings in one exec charset being propagated into functions that
> operate on strings in another charset.  E.g., like in the test
> case below that's "miscompiled" with your patch -- the first test
> for n == 7 is eliminated and the buffer overflow is not detected.
> If this cannot be made to work then I think some effort should be
> made to detect this mixing and matching and existing optimizations
> that assume the same charset (like the sprintf one does) disabled.
> 
> static inline int
> f (char *d, const char *fmt)
> {
> #pragma GCC exec_charset ("utf8")
>    int n = __builtin_sprintf (d, fmt, 12345);
> #pragma GCC exec_charset (pop)
> 
>    if (n == 7)   // incorrectly optimized away
>      __builtin_abort ();

To my understanding the problem in your example isn't triggered through inlining.
The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first.
 The sprintf implementation does not recognize the format letter and hence only prints the converted
format string without inserting the integer. That's kinda expected I would say.  You call a function
which is expected to deal with a UTF-8 string with an EBCDIC string. GCC is reasoning about what the
sprintf invocation would return and removes the condition since it would never be triggered.
Compiling with -O0 makes the program behave the same way even without the inlining and the check not
being removed.

It would also happen without using the pragma but compiling with -fexec-charset=EBCDIC-US instead.
E.g. compiling this program with -fexec-charset=EBCDIC-US does abort while it does not abort without
using -fexec-charset:

int main (void)
{
   char d[5];

   int n = __builtin_sprintf (d, "i=%i", 12345);

   if (n != 7)
     __builtin_abort ();
}

I'm not sure what kind of warning to expect for such code? Would it be helpful if the compiler tells
that the exec_charset setting does not affect "fmt" for this code?!

#pragma GCC exec_charset ("EBCDIC-US")
   int n = __builtin_sprintf (d, fmt, 12345);
#pragma GCC exec_charset (pop)

On the other hand, diagnostics like this might create false positives. One reason of course is that
this might be intentional. But it might also be because in C it is difficult to detect whether a
char* variable will eventually be interpreted as string or not.

While I agree with you that the feature is not that easy to use and more diagnostics would be good
I'm not really sure what to do about it.

Bye,

-Andreas-
Martin Sebor Oct. 24, 2017, 3:33 p.m. UTC | #12
>> My concern with this pragma/attribute and inlining has to do with
>> strings in one exec charset being propagated into functions that
>> operate on strings in another charset.  E.g., like in the test
>> case below that's "miscompiled" with your patch -- the first test
>> for n == 7 is eliminated and the buffer overflow is not detected.
>> If this cannot be made to work then I think some effort should be
>> made to detect this mixing and matching and existing optimizations
>> that assume the same charset (like the sprintf one does) disabled.
>>
>> static inline int
>> f (char *d, const char *fmt)
>> {
>> #pragma GCC exec_charset ("utf8")
>>    int n = __builtin_sprintf (d, fmt, 12345);
>> #pragma GCC exec_charset (pop)
>>
>>    if (n == 7)   // incorrectly optimized away
>>      __builtin_abort ();
>
> To my understanding the problem in your example isn't triggered through inlining.
> The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first.

It's optimized away because the sprintf pass interprets the constant
contents of the format string, which it only does when the function
is inlined into its caller.  Otherwise, without inlining, it treats
the format string as an unknown and so it cannot do the buffer
overflow analysis or the optimization, and everything works fine
at runtime.

>  The sprintf implementation does not recognize the format letter and hence only prints the converted
> format string without inserting the integer. That's kinda expected I would say.  You call a function
> which is expected to deal with a UTF-8 string with an EBCDIC string. GCC is reasoning about what the
> sprintf invocation would return and removes the condition since it would never be triggered.
> Compiling with -O0 makes the program behave the same way even without the inlining and the check not
> being removed.

Right, at -O0 the function isn't inlined so it behaves sensibly.
The buffer overflow analysis doesn't happen but that is expected
and documented.

> It would also happen without using the pragma but compiling with -fexec-charset=EBCDIC-US instead.
> E.g. compiling this program with -fexec-charset=EBCDIC-US does abort while it does not abort without
> using -fexec-charset:
>
> int main (void)
> {
>    char d[5];
>
>    int n = __builtin_sprintf (d, "i=%i", 12345);
>
>    if (n != 7)
>      __builtin_abort ();
> }

It aborts in GCC 7 because its sprintf pass doesn't support
-fexec-charset.  The sprintf pass in GCC 8 does support it so it's
supposed to work correctly.  It does with -fexec-charse=IBM1047,
the only charset I tested the implementation with.  The output is:

y.c: In function ‘main’:
y.c:7:34: warning: too many arguments for format [-Wformat-extra-args]
     int n = __builtin_sprintf (d, "i=%i", 12345);
                                   ^~~~~~
y.c:7:34: warning: ‘%i’ directive writing 5 bytes into a region of size 
3 [-Wformat-overflow=]
     int n = __builtin_sprintf (d, "i=%i", 12345);
                                   ^~~~~~
y.c:7:12: note: ‘__builtin_sprintf’ output 8 bytes into a destination of 
size 5
     int n = __builtin_sprintf (d, "i=%i", 12345);

The first warning is wrong, caused by the missing support for
-fexec-charset in -Wformat.  But the other warning is expected
and correct.

The abort seems like a bug in the implementation, although since
the buffer overflows the behavior is undefined so one could argue
the abort is okay.  I will look into what's going on and fix it.
  When there is no buffer overflow there is no abort, and that's
the correct behavior that needs to be preserved with the pragma.

For some reason, GCC 8 gets an ICE when -fexec-charset=EBCDIC-US.
I opened pr82700 to investigate and fix this bug.

> I'm not sure what kind of warning to expect for such code? Would it be helpful if the compiler tells
> that the exec_charset setting does not affect "fmt" for this code?!
>
> #pragma GCC exec_charset ("EBCDIC-US")
>    int n = __builtin_sprintf (d, fmt, 12345);
> #pragma GCC exec_charset (pop)

Other than making it work as one would expect I think the only
reasonable approach would be to disable the optimization and
analysis.  Losing the optimization wouldn't be a big deal, but
it would be very unfortunate to have to trade off the pragma for
the sprintf buffer overflow detection.  But I'd like to think
we don't have to make such a tradeoff.

> On the other hand, diagnostics like this might create false positives. One reason of course is that
> this might be intentional. But it might also be because in C it is difficult to detect whether a
> char* variable will eventually be interpreted as string or not.
>
> While I agree with you that the feature is not that easy to use and more diagnostics would be good
> I'm not really sure what to do about it.

I think the suggestion to associate each string with its own
charset is a good one.  That should let the sprintf pass (and
others like it) do the right thing even as strings get inlined
and charsets are mixed and matched, e.g., like so:

   static inline void f (const char *f)
   {
     #pragma GCC exec_charset ("UTF-8")
       const char format[] = "fmt = %s\n";   // format in UTF-8

       printf (format, f);         // f interpreted in EBCDIC-US

       printf (f, 12345);          // ditto here
     #pragma GCC exec_charset (pop)
   }

   void g (void)
   {
     #pragma GCC exec_charset ("EBCDIC-US")

     f ("i = %i\n");   // literal in EBCDIC-US

     #pragma GCC exec_charset (pop)
   }

Martin
Andreas Krebbel Oct. 24, 2017, 4:57 p.m. UTC | #13
On 10/24/2017 05:33 PM, Martin Sebor wrote:
>>> My concern with this pragma/attribute and inlining has to do with
>>> strings in one exec charset being propagated into functions that
>>> operate on strings in another charset.  E.g., like in the test
>>> case below that's "miscompiled" with your patch -- the first test
>>> for n == 7 is eliminated and the buffer overflow is not detected.
>>> If this cannot be made to work then I think some effort should be
>>> made to detect this mixing and matching and existing optimizations
>>> that assume the same charset (like the sprintf one does) disabled.
>>>
>>> static inline int
>>> f (char *d, const char *fmt)
>>> {
>>> #pragma GCC exec_charset ("utf8")
>>>    int n = __builtin_sprintf (d, fmt, 12345);
>>> #pragma GCC exec_charset (pop)
>>>
>>>    if (n == 7)   // incorrectly optimized away
>>>      __builtin_abort ();
>>
>> To my understanding the problem in your example isn't triggered through inlining.
>> The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first.
> 
> It's optimized away because the sprintf pass interprets the constant
> contents of the format string, which it only does when the function
> is inlined into its caller.  Otherwise, without inlining, it treats
> the format string as an unknown and so it cannot do the buffer
> overflow analysis or the optimization, and everything works fine
> at runtime.

I assumed that GCC is supposed to emulate the format string parsing of sprintf as if it were
compiled with the default execcharset. But instead it should emulate sprintf as if it was compiled
with the same exec-charset as the currently compiled code - right?!

So for an example like this:

  char d[3];
#pragma GCC exec_charset ("EBCDIC-US")
  char *s = "i=%i";
#pragma GCC exec_charset (pop)

  __builtin_sprintf (d, s, 1);

It currently warns:

t.c:6:13: warning: ‘~l’ directive writing 4 bytes into a region of size 3 [-Wformat-overflow=]
   char *s = "i=%i";
             ^~~~~~
t.c:9:3: note: ‘__builtin_sprintf’ output 5 bytes into a destination of size 3
   __builtin_sprintf (d, s, 1);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~

That's wrong and happens because the GCC sprintf format string parsing uses the default charset
instead of the charset which has been used when originally parsing the string literal.  Hence it
does not recognize the %i format specifier and the output string remains 4 bytes in length.

As Richard suggested we probably have to record the exec_charset in each string literal and use this
when trying to interpret it.

-Andreas-
diff mbox series

Patch

diff --git a/gcc/c-family/c-pragma.c b/gcc/c-family/c-pragma.c
index f7b59b3..db281b9 100644
--- a/gcc/c-family/c-pragma.c
+++ b/gcc/c-family/c-pragma.c
@@ -34,6 +34,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "opts.h"
 #include "plugin.h"
 
+extern cpp_options *cpp_opts;
+
 #define GCC_BAD(gmsgid) \
   do { warning (OPT_Wpragmas, gmsgid); return; } while (0)
 #define GCC_BAD2(gmsgid, arg) \
@@ -1141,6 +1143,52 @@  handle_pragma_message (cpp_reader *ARG_UNUSED(dummy))
     inform (input_location, "#pragma message: %s", TREE_STRING_POINTER (message));
 }
 
+static void
+handle_pragma_exec_charset (cpp_reader *ARG_UNUSED(dummy))
+{
+  enum cpp_ttype token;
+  tree x;
+  static const char* previous_charset = NULL;
+
+  token = pragma_lex (&x);
+  if (token == CPP_OPEN_PAREN)
+    {
+      token = pragma_lex (&x);
+      if (token == CPP_STRING)
+	{
+	  previous_charset = cpp_opts->narrow_charset;
+	  cpp_opts->narrow_charset = TREE_STRING_POINTER (x);
+	}
+      else if (token == CPP_NAME
+	       && strncmp (IDENTIFIER_POINTER (x), "pop", 3) == 0)
+	{
+	  if (previous_charset == NULL)
+	    {
+	      warning (OPT_Wpragmas,
+		       "pop without previous exec_charset use - ignored");
+	      return;
+	    }
+	  cpp_opts->narrow_charset = previous_charset;
+	  previous_charset = NULL;
+	}
+      else
+	GCC_BAD ("expected a charset string or pop after %<#pragma exec_charset%>");
+
+      if (pragma_lex (&x) != CPP_CLOSE_PAREN)
+	GCC_BAD ("malformed %<#pragma exec_charset%>, ignored");
+    }
+  else
+    GCC_BAD ("expected a string after %<#pragma exec_charset%>");
+
+  if (pragma_lex (&x) != CPP_EOF)
+    warning (OPT_Wpragmas, "junk at end of %<#pragma exec_charset%>");
+
+  inform (input_location, "switching to exec charset: %s",
+	  cpp_opts->narrow_charset);
+  cpp_destroy_iconv (parse_in);
+  cpp_init_iconv (parse_in);
+}
+
 /* Mark whether the current location is valid for a STDC pragma.  */
 
 static bool valid_location_for_stdc_pragma;
@@ -1571,6 +1619,8 @@  init_pragma (void)
 				    handle_pragma_redefine_extname);
 
   c_register_pragma_with_expansion (0, "message", handle_pragma_message);
+  c_register_pragma_with_expansion ("GCC", "exec_charset",
+				    handle_pragma_exec_charset);
 
 #ifdef REGISTER_TARGET_PRAGMAS
   REGISTER_TARGET_PRAGMAS ();
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index d9b7a54..b67993a 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -21611,6 +21611,7 @@  for further explanation.
 * Push/Pop Macro Pragmas::
 * Function Specific Option Pragmas::
 * Loop-Specific Pragmas::
+* Charset-Specific Pragmas::
 @end menu
 
 @node AArch64 Pragmas
@@ -22209,6 +22210,31 @@  void ignore_vec_dep (int *a, int k, int c, int m)
 @}
 @end smallexample
 
+@node Charset-Specific Pragmas
+@subsection Charset-Specific Pragmas
+
+@table @code
+@item #pragma GCC exec_charset(@var{"charset"})
+@cindex pragma GCC exec_charset
+
+Set the execution character set, used for string and character
+constants.  The default is the exec charset specified with
+@option{-fexec-charset} or UTF-8 if @option{-fexec-charset} isn't used.
+charset can be any encoding supported by the system's "iconv" library
+routine.  The special value @var{pop} (without ``) can be
+used to switch back to the exec charset before the last @code{#pragma
+GCC exec_charset} setting.
+@end table
+
+@smallexample
+call_with_utf8("hello world");
+
+#pragma GCC exec_charset("UTF16")
+call_with_utf16("hello world");
+
+#pragma GCC exec_charset(pop)
+call_with_utf8("hello world");
+@end smallexample
 
 @node Unnamed Fields
 @section Unnamed Structure and Union Fields
diff --git a/gcc/testsuite/gcc.dg/pragma-exec_charset-1.c b/gcc/testsuite/gcc.dg/pragma-exec_charset-1.c
new file mode 100644
index 0000000..5c695aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pragma-exec_charset-1.c
@@ -0,0 +1,26 @@ 
+/* { dg-do run } */
+
+#include <stdio.h>
+
+char t1[] = "hello world";
+#pragma GCC exec_charset("EBCDIC-US")
+char t2[] = "hello world";
+#pragma GCC exec_charset(pop)
+char t3[] = "hello world";
+
+char hello_world_utf8[12] = { 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x00 };
+char hello_world_ebcdic[12] = { 0x88, 0x85, 0x93, 0x93, 0x96, 0x40, 0xa6, 0x96, 0x99, 0x93, 0x84, 0x00 };
+
+int
+main ()
+{
+  if (__builtin_memcmp (t1, hello_world_utf8, 12) != 0)
+    __builtin_abort ();
+
+  if (__builtin_memcmp (t2, hello_world_ebcdic, 12) != 0)
+    __builtin_abort ();
+
+  if (__builtin_memcmp (t3, hello_world_utf8, 12) != 0)
+    __builtin_abort ();
+
+}
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 6a3bbbc..47fa406 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -737,7 +737,7 @@  cpp_init_iconv (cpp_reader *pfile)
 
 /* Destroy iconv(3) descriptors set up by cpp_init_iconv, if necessary.  */
 void
-_cpp_destroy_iconv (cpp_reader *pfile)
+cpp_destroy_iconv (cpp_reader *pfile)
 {
   if (HAVE_ICONV)
     {
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 804132a..acbdf5a 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -861,6 +861,9 @@  extern void cpp_post_options (cpp_reader *);
 /* Set up translation to the target character set.  */
 extern void cpp_init_iconv (cpp_reader *);
 
+/* Cleanup translation to the target character set.  */
+extern void cpp_destroy_iconv (cpp_reader *);
+
 /* Call this to finish preprocessing.  If you requested dependency
    generation, pass an open stream to write the information to,
    otherwise NULL.  It is your responsibility to close the stream.  */
diff --git a/libcpp/init.c b/libcpp/init.c
index 16ff202..4e68645 100644
--- a/libcpp/init.c
+++ b/libcpp/init.c
@@ -314,7 +314,7 @@  cpp_destroy (cpp_reader *pfile)
 
   _cpp_destroy_hashtable (pfile);
   _cpp_cleanup_files (pfile);
-  _cpp_destroy_iconv (pfile);
+  cpp_destroy_iconv (pfile);
 
   _cpp_free_buff (pfile->a_buff);
   _cpp_free_buff (pfile->u_buff);
diff --git a/libcpp/internal.h b/libcpp/internal.h
index f24e85c..ce2d902 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -757,7 +757,6 @@  extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **,
 			    cppchar_t *,
 			    source_range *char_range,
 			    cpp_string_location_reader *loc_reader);
-extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
 					  const unsigned char **, off_t *);