Message ID | 20211007130049.GT304296@tucnak |
---|---|
State | New |
Headers | show |
Series | c++: Add testcase for C++23 P2316R2 - consistent character literal encoding [PR102615] | expand |
On 10/7/21 09:00, Jakub Jelinek wrote: > Hi! > > I believe we need no changes to the compiler for P2316R2, seems we treat > character literals the same between preprocessor and C++ expressions, > here is a testcase that should verify it. > > Tested on x86_64-linux, ok for trunk? > > Note, seems the internal charset for GCC can be either UTF-8 or UTF-EBCDIC, > but I bet it is very hard (at least for me) to actually test the latter. > I'd guess one needs all system headers to be in EBCDIC and the gcc sources too. > But looking around the source, I'm a little bit worried about the UTF-EBCDIC > case. > One is: > #if '\n' == 0x0A && ' ' == 0x20 && '0' == 0x30 \ > && 'A' == 0x41 && 'a' == 0x61 && '!' == 0x21 > # define HOST_CHARSET HOST_CHARSET_ASCII > #else > # if '\n' == 0x15 && ' ' == 0x40 && '0' == 0xF0 \ > && 'A' == 0xC1 && 'a' == 0x81 && '!' == 0x5A > # define HOST_CHARSET HOST_CHARSET_EBCDIC > # else > # define HOST_CHARSET HOST_CHARSET_UNKNOWN > # endif > #endif > in include/safe-ctype.h, does that mean we only support EBCDIC if -funsigned-char > and otherwise fail to build gcc? Because with -fsigned-char, '0' is -0x10 > rather than 0xF0, 'A' is -0x3F rather than 0xC1 and 'a' is -0x7F rather than > 0x81. > And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c > static const cppchar_t utf8_signifier = 0xC0; > ... > if (*buffer->cur >= utf8_signifier) > { > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > state, &s)) > return true; > } > work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of > multi-byte character, it is more complicated and seems _cpp_valid_utf8 > assumes UTF-8 as the host charset. Are there any supported platforms that use UTF-EBCDIC? > 2021-10-07 Jakub Jelinek <jakub@redhat.com> > > PR c++/102615 > * g++.dg/cpp23/charlit-encoding1.C: New testcase for C++23 P2316R2. > > --- gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C.jj 2021-10-07 14:34:35.182132411 +0200 > +++ gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C 2021-10-07 14:34:02.902583774 +0200 > @@ -0,0 +1,33 @@ > +// PR c++/102615 - P2316R2 - Consistent character literal encoding > +// { dg-do compile } Doesn't this need to run? OK with that change. > +extern "C" void abort (); > + > +int > +main () > +{ > +#if ' ' == 0x20 > + if (' ' != 0x20) > + abort (); > +#elif ' ' == 0x40 > + if (' ' != 0x40) > + abort (); > +#else > + if (' ' == 0x20 || ' ' == 0x40) > + abort (); > +#endif > +#if 'a' == 0x61 > + if ('a' != 0x61) > + abort (); > +#elif 'a' == 0x81 > + if ('a' != 0x81) > + abort (); > +#elif 'a' == -0x7F > + if ('a' != -0x7F) > + abort (); > +#else > + if ('a' == 0x61 || 'a' == 0x81 || 'a' == -0x7F) > + abort (); > +#endif > + return 0; > +} > > Jakub >
On Thu, Oct 07, 2021 at 09:12:15AM -0400, Jason Merrill wrote: > > And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c > > static const cppchar_t utf8_signifier = 0xC0; > > ... > > if (*buffer->cur >= utf8_signifier) > > { > > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > > state, &s)) > > return true; > > } > > work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of > > multi-byte character, it is more complicated and seems _cpp_valid_utf8 > > assumes UTF-8 as the host charset. > > Are there any supported platforms that use UTF-EBCDIC? I have no idea. From the libcpp/charset.c code, seems there is no built-in conversion for UTF-EBCDIC, the only internally supported conversions are { "UTF-8/UTF-32LE", convert_utf8_utf32, (iconv_t)0 }, { "UTF-8/UTF-32BE", convert_utf8_utf32, (iconv_t)1 }, { "UTF-8/UTF-16LE", convert_utf8_utf16, (iconv_t)0 }, { "UTF-8/UTF-16BE", convert_utf8_utf16, (iconv_t)1 }, { "UTF-32LE/UTF-8", convert_utf32_utf8, (iconv_t)0 }, { "UTF-32BE/UTF-8", convert_utf32_utf8, (iconv_t)1 }, { "UTF-16LE/UTF-8", convert_utf16_utf8, (iconv_t)0 }, { "UTF-16BE/UTF-8", convert_utf16_utf8, (iconv_t)1 }, and identity, so unless the C library iconv supports conversion to UTF-EBCDIC, the only case that could be supported is when -finput-charset= is also UTF-EBCDIC. E.g. glibc iconv doesn't support that. Never used z/VM nor OS/390 which I think are the only possible hosts that could have UTF-EBCDIC. CCing Andreas if he knows more... > > --- gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C.jj 2021-10-07 14:34:35.182132411 +0200 > > +++ gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C 2021-10-07 14:34:02.902583774 +0200 > > @@ -0,0 +1,33 @@ > > +// PR c++/102615 - P2316R2 - Consistent character literal encoding > > +// { dg-do compile } > > Doesn't this need to run? OK with that change. Thanks for catching that, fixed, retested and committed. Jakub
On Thu, Oct 7, 2021 at 9:01 AM Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c > static const cppchar_t utf8_signifier = 0xC0; > ... > if (*buffer->cur >= utf8_signifier) > { > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > state, &s)) > return true; > } > work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of > multi-byte character, it is more complicated and seems _cpp_valid_utf8 > assumes UTF-8 as the host charset. FWIW, here I was following Joseph's guidance from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c21 ("You can ignore anything claiming to handle UTF-EBCDIC.") -Lewis
--- gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C.jj 2021-10-07 14:34:35.182132411 +0200 +++ gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C 2021-10-07 14:34:02.902583774 +0200 @@ -0,0 +1,33 @@ +// PR c++/102615 - P2316R2 - Consistent character literal encoding +// { dg-do compile } + +extern "C" void abort (); + +int +main () +{ +#if ' ' == 0x20 + if (' ' != 0x20) + abort (); +#elif ' ' == 0x40 + if (' ' != 0x40) + abort (); +#else + if (' ' == 0x20 || ' ' == 0x40) + abort (); +#endif +#if 'a' == 0x61 + if ('a' != 0x61) + abort (); +#elif 'a' == 0x81 + if ('a' != 0x81) + abort (); +#elif 'a' == -0x7F + if ('a' != -0x7F) + abort (); +#else + if ('a' == 0x61 || 'a' == 0x81 || 'a' == -0x7F) + abort (); +#endif + return 0; +}