diff mbox series

[1/3] : C N2653 char8_t: Language support

Message ID f093d853-707c-af6f-70b2-06f9d90aa587@honermann.net
State New
Headers show
Series : C N2653 char8_t implementation | expand

Commit Message

Tom Honermann June 7, 2021, 2:32 a.m. UTC
This patch implements the core language and compiler dependent library 
changes proposed in WG14 N2653 [1] for C.  The changes include:
- Use of the existing -fchar8_t and -fno-char8_t options to opt-in to
   (or opt-out of) the following changes when compiling C code.
- Change of type for UTF-8 string literals from array of char to array
   of char8_t (unsigned char).
- A new atomic_char8_t typedef.
- A new ATOMIC_CHAR8_T_LOCK_FREE macro defined in terms of a new
   predefined ATOMIC_CHAR8_T_LOCK_FREE macro.

When -fchar8_t support is enabled for non-C++ modes, the _CHAR8_T_SOURCE 
macro is predefined.  This is the mechanism proposed to glibc to opt-in 
to declarations of the char8_t typedef and c8rtomb and mbrtoc8 functions 
proposed in N2653.  See [2].

Tested on Linux x86_64.

gcc/ChangeLog:

2021-05-31  Tom Honermann  <tom@honermann.net>

          * ginclude/stdatomic.h (atomic_char8_t, ATOMIC_CHAR8_T_LOCK_FREE):
            New typedef and macro.

gcc/c/ChangeLog:

2021-05-31  Tom Honermann  <tom@honermann.net>

          * c-parser.c (c_parser_string_literal): Use char8_t as the type of
            CPP_UTF8STRING when char8_t support is enabled.
          * c-typeck.c (digest_init): Handle initialization of an array
            of character type by a string literal with type array of
            unsigned char.

gcc/c-family/ChangeLog:

2021-05-31  Tom Honermann  <tom@honermann.net>

          * c-cppbuiltin.c (c_cpp_builtins): Define _CHAR8_T_SOURCE if
            char8_t support is enabled in non-C++ language modes.
          * c-lex.c (lex_string): Use char8_t as the type of
            CPP_UTF8STRING when char8_t support is enabled.
          * c-opts.c (c_common_handle_option): Inform the preprocessor if
            char8_t support is enabled.
          * c.opt (fchar8_t): Enable for C language modes.

libcpp/ChangeLog:

2021-05-31  Tom Honermann  <tom@honermann.net>

          * include/cpplib.h (cpp_options): Add char8.

Tom.

[1]: WG14 N2653
      "char8_t: A type for UTF-8 characters and strings (Revision 1)"
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm

[2]: C++20 P0482R6 and C2X N2653: support for char8_t, mbrtoc8(), and 
c8rtomb().
      [Patch 0]: 
https://sourceware.org/pipermail/libc-alpha/2021-June/127230.html
      [Patch 1]: 
https://sourceware.org/pipermail/libc-alpha/2021-June/127231.html
      [Patch 2]: 
https://sourceware.org/pipermail/libc-alpha/2021-June/127232.html
      [Patch 3]: 
https://sourceware.org/pipermail/libc-alpha/2021-June/127233.html

Comments

Joseph Myers June 7, 2021, 9:11 p.m. UTC | #1
On Sun, 6 Jun 2021, Tom Honermann via Gcc-patches wrote:

> When -fchar8_t support is enabled for non-C++ modes, the _CHAR8_T_SOURCE macro
> is predefined.  This is the mechanism proposed to glibc to opt-in to
> declarations of the char8_t typedef and c8rtomb and mbrtoc8 functions proposed
> in N2653.  See [2].

I don't think glibc should have such a feature test macro, and I don't 
think GCC should define such feature test macros either - _*_SOURCE macros 
are generally for the *user* to define to decide what namespace they want 
visible, not for the compiler to define.  Without proliferating new 
language dialects, __STDC_VERSION__ ought to be sufficient to communicate 
from the compiler to the library (including to GCC's own headers such as 
stdatomic.h).
Joseph Myers June 7, 2021, 9:12 p.m. UTC | #2
Also, it seems odd to add a new field to cpp_options without any code in 
libcpp that uses the value of that field.
Tom Honermann June 11, 2021, 3:52 p.m. UTC | #3
On 6/7/21 5:11 PM, Joseph Myers wrote:
> On Sun, 6 Jun 2021, Tom Honermann via Gcc-patches wrote:
>
>> When -fchar8_t support is enabled for non-C++ modes, the _CHAR8_T_SOURCE macro
>> is predefined.  This is the mechanism proposed to glibc to opt-in to
>> declarations of the char8_t typedef and c8rtomb and mbrtoc8 functions proposed
>> in N2653.  See [2].
> I don't think glibc should have such a feature test macro, and I don't
> think GCC should define such feature test macros either - _*_SOURCE macros
> are generally for the *user* to define to decide what namespace they want
> visible, not for the compiler to define.  Without proliferating new
> language dialects, __STDC_VERSION__ ought to be sufficient to communicate
> from the compiler to the library (including to GCC's own headers such as
> stdatomic.h).
>
In general I agree, but I think an exception is warranted in this case 
for a few reasons:

 1. The feature includes both core language changes (the change of type
    for u8 string literals) and library changes.  The library changes
    are not actually dependent on the core language change, but they are
    intended to be used together.
 2. Existing use of the char8_t identifier can be found in existing open
    source projects and likely exists in some closed source projects as
    well.  An opt-in approach avoids conflict and the need to
    conditionalize code based on gcc version.
 3. An opt-in approach enables evaluation of the feature prior to any
    WG14 approval.

Tom.
Tom Honermann June 11, 2021, 3:54 p.m. UTC | #4
On 6/7/21 5:12 PM, Joseph Myers wrote:
> Also, it seems odd to add a new field to cpp_options without any code in
> libcpp that uses the value of that field.
>
Ah, thank you.  That appears to be leftover code from prior 
experimentation and I failed to identify it as such when preparing the 
patch.  I'll provide a revised patch.

Tom.
Jakub Jelinek June 11, 2021, 4:01 p.m. UTC | #5
On Fri, Jun 11, 2021 at 11:52:41AM -0400, Tom Honermann via Gcc-patches wrote:
> On 6/7/21 5:11 PM, Joseph Myers wrote:
> > On Sun, 6 Jun 2021, Tom Honermann via Gcc-patches wrote:
> > 
> > > When -fchar8_t support is enabled for non-C++ modes, the _CHAR8_T_SOURCE macro
> > > is predefined.  This is the mechanism proposed to glibc to opt-in to
> > > declarations of the char8_t typedef and c8rtomb and mbrtoc8 functions proposed
> > > in N2653.  See [2].
> > I don't think glibc should have such a feature test macro, and I don't
> > think GCC should define such feature test macros either - _*_SOURCE macros
> > are generally for the *user* to define to decide what namespace they want
> > visible, not for the compiler to define.  Without proliferating new
> > language dialects, __STDC_VERSION__ ought to be sufficient to communicate
> > from the compiler to the library (including to GCC's own headers such as
> > stdatomic.h).
> > 
> In general I agree, but I think an exception is warranted in this case for a
> few reasons:
> 
> 1. The feature includes both core language changes (the change of type
>    for u8 string literals) and library changes.  The library changes
>    are not actually dependent on the core language change, but they are
>    intended to be used together.
> 2. Existing use of the char8_t identifier can be found in existing open
>    source projects and likely exists in some closed source projects as
>    well.  An opt-in approach avoids conflict and the need to
>    conditionalize code based on gcc version.
> 3. An opt-in approach enables evaluation of the feature prior to any
>    WG14 approval.

But calling it _CHAR8_T_SOURCE is weird and inconsistent with everything
else.
In C++, there is __cpp_char8_t 201811L predefined macro for char8_t.
Using that in C is not right, sure.
Often we use __SIZEOF_type__ macros not just for sizeof(), but also for
presence check of the types, like
#ifdef __SIZEOF_INT128__
__int128 i;
#else
long long i;
#endif
etc., while char8_t has sizeof (char8_t) == 1, perhaps predefining
__SIZEOF_CHAR8_T__ 1
instead of _CHAR8_T_SOURCE would be better?

	Jakub
Tom Honermann June 11, 2021, 4:20 p.m. UTC | #6
On 6/11/21 12:01 PM, Jakub Jelinek wrote:
> On Fri, Jun 11, 2021 at 11:52:41AM -0400, Tom Honermann via Gcc-patches wrote:
>> On 6/7/21 5:11 PM, Joseph Myers wrote:
>>> On Sun, 6 Jun 2021, Tom Honermann via Gcc-patches wrote:
>>>
>>>> When -fchar8_t support is enabled for non-C++ modes, the _CHAR8_T_SOURCE macro
>>>> is predefined.  This is the mechanism proposed to glibc to opt-in to
>>>> declarations of the char8_t typedef and c8rtomb and mbrtoc8 functions proposed
>>>> in N2653.  See [2].
>>> I don't think glibc should have such a feature test macro, and I don't
>>> think GCC should define such feature test macros either - _*_SOURCE macros
>>> are generally for the *user* to define to decide what namespace they want
>>> visible, not for the compiler to define.  Without proliferating new
>>> language dialects, __STDC_VERSION__ ought to be sufficient to communicate
>>> from the compiler to the library (including to GCC's own headers such as
>>> stdatomic.h).
>>>
>> In general I agree, but I think an exception is warranted in this case for a
>> few reasons:
>>
>> 1. The feature includes both core language changes (the change of type
>>     for u8 string literals) and library changes.  The library changes
>>     are not actually dependent on the core language change, but they are
>>     intended to be used together.
>> 2. Existing use of the char8_t identifier can be found in existing open
>>     source projects and likely exists in some closed source projects as
>>     well.  An opt-in approach avoids conflict and the need to
>>     conditionalize code based on gcc version.
>> 3. An opt-in approach enables evaluation of the feature prior to any
>>     WG14 approval.
> But calling it _CHAR8_T_SOURCE is weird and inconsistent with everything
> else.
> In C++, there is __cpp_char8_t 201811L predefined macro for char8_t.
> Using that in C is not right, sure.
> Often we use __SIZEOF_type__ macros not just for sizeof(), but also for
> presence check of the types, like
> #ifdef __SIZEOF_INT128__
> __int128 i;
> #else
> long long i;
> #endif
> etc., while char8_t has sizeof (char8_t) == 1, perhaps predefining
> __SIZEOF_CHAR8_T__ 1
> instead of _CHAR8_T_SOURCE would be better?

I'm open to whatever signaling mechanism would be preferred.  It took me 
a while to settle on _CHAR8_T_SOURCE as the mechanism to propose as I 
didn't find much for other precedents.

I agree that having _CHAR8_T_SOURCE be implied by the -fchar8_t option 
is unusual with respect to other feature test macros.  Is that what you 
find to be weird and inconsistent?

Predefining __SIZEOF_CHAR8_T__ would be consistent with 
__SIZEOF_WCHAR_T__, but kind of strange too since the size is always 1.

Perhaps a better approach would be to follow the __CHAR16_TYPE__ and 
__CHAR32_TYPE__ precedent and define __CHAR8_TYPE__ to unsigned char.  
That is likewise a bit strange since the type would always be unsigned 
char, but it does provide a bit more symmetry.  That could potentially 
have some use as well; for C++, it could be defined as char8_t and 
thereby reflect the difference between the two languages.  Perhaps it 
could be useful in the future as well if WG14 were to add distinct 
char8_t, char16_t, and char32_t types as C++ did (I'm not offering any 
prediction regarding the likelihood of that happening).

Tom.

>
> 	Jakub
>
Jakub Jelinek June 11, 2021, 4:53 p.m. UTC | #7
On Fri, Jun 11, 2021 at 12:20:48PM -0400, Tom Honermann wrote:
> I'm open to whatever signaling mechanism would be preferred.  It took me a
> while to settle on _CHAR8_T_SOURCE as the mechanism to propose as I didn't
> find much for other precedents.
> 
> I agree that having _CHAR8_T_SOURCE be implied by the -fchar8_t option is
> unusual with respect to other feature test macros.  Is that what you find to
> be weird and inconsistent?
> 
> Predefining __SIZEOF_CHAR8_T__ would be consistent with __SIZEOF_WCHAR_T__,
> but kind of strange too since the size is always 1.
> 
> Perhaps a better approach would be to follow the __CHAR16_TYPE__ and
> __CHAR32_TYPE__ precedent and define __CHAR8_TYPE__ to unsigned char.  That
> is likewise a bit strange since the type would always be unsigned char, but
> it does provide a bit more symmetry.  That could potentially have some use
> as well; for C++, it could be defined as char8_t and thereby reflect the
> difference between the two languages.  Perhaps it could be useful in the
> future as well if WG14 were to add distinct char8_t, char16_t, and char32_t
> types as C++ did (I'm not offering any prediction regarding the likelihood
> of that happening).

C++ already predefines
#define __CHAR8_TYPE__ unsigned char
#define __CHAR16_TYPE__ short unsigned int
#define __CHAR32_TYPE__ unsigned int
for -std={c,gnu}++2{0,a,3,b} or -fchar8_t (unless -fno-char8_t), so I agree
just making sure __CHAR8_TYPE__ is defined to unsigned char even for C
is best.
And you probably don't need to do anything in the C patch for it,
void
c_stddef_cpp_builtins(void)
{
  builtin_define_with_value ("__SIZE_TYPE__", SIZE_TYPE, 0);
...
  if (flag_char8_t)
    builtin_define_with_value ("__CHAR8_TYPE__", CHAR8_TYPE, 0);
  builtin_define_with_value ("__CHAR16_TYPE__", CHAR16_TYPE, 0);
  builtin_define_with_value ("__CHAR32_TYPE__", CHAR32_TYPE, 0);
will do that.

	Jakub
Tom Honermann June 13, 2021, 1:40 p.m. UTC | #8
On 6/11/21 12:53 PM, Jakub Jelinek wrote:
> On Fri, Jun 11, 2021 at 12:20:48PM -0400, Tom Honermann wrote:
>> I'm open to whatever signaling mechanism would be preferred.  It took me a
>> while to settle on _CHAR8_T_SOURCE as the mechanism to propose as I didn't
>> find much for other precedents.
>>
>> I agree that having _CHAR8_T_SOURCE be implied by the -fchar8_t option is
>> unusual with respect to other feature test macros.  Is that what you find to
>> be weird and inconsistent?
>>
>> Predefining __SIZEOF_CHAR8_T__ would be consistent with __SIZEOF_WCHAR_T__,
>> but kind of strange too since the size is always 1.
>>
>> Perhaps a better approach would be to follow the __CHAR16_TYPE__ and
>> __CHAR32_TYPE__ precedent and define __CHAR8_TYPE__ to unsigned char.  That
>> is likewise a bit strange since the type would always be unsigned char, but
>> it does provide a bit more symmetry.  That could potentially have some use
>> as well; for C++, it could be defined as char8_t and thereby reflect the
>> difference between the two languages.  Perhaps it could be useful in the
>> future as well if WG14 were to add distinct char8_t, char16_t, and char32_t
>> types as C++ did (I'm not offering any prediction regarding the likelihood
>> of that happening).
> C++ already predefines
> #define __CHAR8_TYPE__ unsigned char
> #define __CHAR16_TYPE__ short unsigned int
> #define __CHAR32_TYPE__ unsigned int
> for -std={c,gnu}++2{0,a,3,b} or -fchar8_t (unless -fno-char8_t), so I agree
> just making sure __CHAR8_TYPE__ is defined to unsigned char even for C
> is best.
> And you probably don't need to do anything in the C patch for it,
> void
> c_stddef_cpp_builtins(void)
> {
>    builtin_define_with_value ("__SIZE_TYPE__", SIZE_TYPE, 0);
> ...
>    if (flag_char8_t)
>      builtin_define_with_value ("__CHAR8_TYPE__", CHAR8_TYPE, 0);
>    builtin_define_with_value ("__CHAR16_TYPE__", CHAR16_TYPE, 0);
>    builtin_define_with_value ("__CHAR32_TYPE__", CHAR32_TYPE, 0);
> will do that.

Thank you; I had forgotten that I had already done that work.  I 
confirmed that the proposed changes result in __CHAR8_TYPE__ being 
defined (the tests included with the patch already enforced it).

Tom.

>
> 	Jakub
>
diff mbox series

Patch

commit c4260c7c49822522945377cc2fb93ee9830cefc8
Author: Tom Honermann <tom@honermann.net>
Date:   Sat Feb 13 09:02:34 2021 -0500

    N2653 char8_t for C: Language support
    
    This patch implements the core language and compiler dependent library
    changes proposed in WG14 N2653 for C.  The changes include:
    - Use of the existing -fchar8_t and -fno-char8_t options to opt-in to
      (or opt-out of) the following changes when compiling C code.
    - Change of type for UTF-8 string literals from array of const char to
      array of const char8_t (unsigned char).
    - A new atomic_char8_t typedef.
    - A new ATOMIC_CHAR8_T_LOCK_FREE macro defined in terms of a new
      predefined ATOMIC_CHAR8_T_LOCK_FREE macro.
    
    When -fchar8_t support is enabled for non-C++ modes, the _CHAR8_T_SOURCE
    macro is predefined.  This is the mechanism proposed to glibc to opt-in
    to declarations of the char8_t typedef and c8rtomb and mbrtoc8 functions
    proposed in N2653.

diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 42b7604c9ac..3e944ec2b86 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -1467,6 +1467,11 @@  c_cpp_builtins (cpp_reader *pfile)
   if (flag_iso)
     cpp_define (pfile, "__STRICT_ANSI__");
 
+  /* Express intent for char8_t support in C (not C++) to the C library if
+     requested.  */
+  if (!c_dialect_cxx () && flag_char8_t)
+    cpp_define (pfile, "_CHAR8_T_SOURCE");
+
   if (!flag_signed_char)
     cpp_define (pfile, "__CHAR_UNSIGNED__");
 
diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
index c44e7a13489..e30e44e9f5c 100644
--- a/gcc/c-family/c-lex.c
+++ b/gcc/c-family/c-lex.c
@@ -1335,7 +1335,14 @@  lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 	default:
 	case CPP_STRING:
 	case CPP_UTF8STRING:
-	  value = build_string (1, "");
+	  if (type == CPP_UTF8STRING && flag_char8_t)
+	    {
+	      value = build_string (TYPE_PRECISION (char8_type_node)
+				    / TYPE_PRECISION (char_type_node),
+				    "");  /* char8_t is 8 bits */
+	    }
+	  else
+	    value = build_string (1, "");
 	  break;
 	case CPP_STRING16:
 	  value = build_string (TYPE_PRECISION (char16_type_node)
diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index 60b5802722c..eefc607dac6 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -718,6 +718,10 @@  c_common_handle_option (size_t scode, const char *arg, HOST_WIDE_INT value,
     case OPT_v:
       verbose = true;
       break;
+
+    case OPT_fchar8_t:
+      cpp_opts->char8 = value;
+      break;
     }
 
   switch (c_language)
diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index 91929706aff..eadb2468aa9 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -1451,8 +1451,8 @@  C ObjC C++ ObjC++
 Where shorter, use canonicalized paths to systems headers.
 
 fchar8_t
-C++ ObjC++ Var(flag_char8_t) Init(-1)
-Enable the char8_t fundamental type and use it as the type for UTF-8 string
+C ObjC C++ ObjC++ Var(flag_char8_t) Init(-1)
+Enable the char8_t type and use it as the type for UTF-8 string
 and character literals.
 
 fcheck-pointer-bounds
diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index d71fd0abe90..501253d0ffe 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -7425,7 +7425,14 @@  c_parser_string_literal (c_parser *parser, bool translate, bool wide_ok)
 	default:
 	case CPP_STRING:
 	case CPP_UTF8STRING:
-	  value = build_string (1, "");
+	  if (type == CPP_UTF8STRING && flag_char8_t)
+	    {
+	      value = build_string (TYPE_PRECISION (char8_type_node)
+				    / TYPE_PRECISION (char_type_node),
+				    "");  /* char8_t is 8 bits */
+	    }
+	  else
+	    value = build_string (1, "");
 	  break;
 	case CPP_STRING16:
 	  value = build_string (TYPE_PRECISION (char16_type_node)
@@ -7450,9 +7457,14 @@  c_parser_string_literal (c_parser *parser, bool translate, bool wide_ok)
     {
     default:
     case CPP_STRING:
-    case CPP_UTF8STRING:
       TREE_TYPE (value) = char_array_type_node;
       break;
+    case CPP_UTF8STRING:
+      if (flag_char8_t)
+	TREE_TYPE (value) = char8_array_type_node;
+      else
+	TREE_TYPE (value) = char_array_type_node;
+      break;
     case CPP_STRING16:
       TREE_TYPE (value) = char16_array_type_node;
       break;
diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c
index 5f322874423..1fa95949919 100644
--- a/gcc/c/c-typeck.c
+++ b/gcc/c/c-typeck.c
@@ -7979,7 +7979,8 @@  digest_init (location_t init_loc, tree type, tree init, tree origtype,
 
 	  if (char_array)
 	    {
-	      if (typ2 != char_type_node)
+	      if (typ2 != char_type_node
+		  && typ2 != unsigned_char_type_node) /* char8_t literal */
 		incompat_string_cst = true;
 	    }
 	  else if (!comptypes (typ1, typ2))
diff --git a/gcc/ginclude/stdatomic.h b/gcc/ginclude/stdatomic.h
index 23c07be2a48..6629902a666 100644
--- a/gcc/ginclude/stdatomic.h
+++ b/gcc/ginclude/stdatomic.h
@@ -49,6 +49,9 @@  typedef _Atomic long atomic_long;
 typedef _Atomic unsigned long atomic_ulong;
 typedef _Atomic long long atomic_llong;
 typedef _Atomic unsigned long long atomic_ullong;
+#if defined(_CHAR8_T_SOURCE)
+typedef _Atomic __CHAR8_TYPE__ atomic_char8_t;
+#endif
 typedef _Atomic __CHAR16_TYPE__ atomic_char16_t;
 typedef _Atomic __CHAR32_TYPE__ atomic_char32_t;
 typedef _Atomic __WCHAR_TYPE__ atomic_wchar_t;
@@ -97,6 +100,9 @@  extern void atomic_signal_fence (memory_order);
 
 #define ATOMIC_BOOL_LOCK_FREE		__GCC_ATOMIC_BOOL_LOCK_FREE
 #define ATOMIC_CHAR_LOCK_FREE		__GCC_ATOMIC_CHAR_LOCK_FREE
+#if defined(_CHAR8_T_SOURCE)
+#define ATOMIC_CHAR8_T_LOCK_FREE	__GCC_ATOMIC_CHAR8_T_LOCK_FREE
+#endif
 #define ATOMIC_CHAR16_T_LOCK_FREE	__GCC_ATOMIC_CHAR16_T_LOCK_FREE
 #define ATOMIC_CHAR32_T_LOCK_FREE	__GCC_ATOMIC_CHAR32_T_LOCK_FREE
 #define ATOMIC_WCHAR_T_LOCK_FREE	__GCC_ATOMIC_WCHAR_T_LOCK_FREE
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 7e840635a38..4c90f8bbbda 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -358,6 +358,9 @@  struct cpp_options
   /* Nonzero means process u8 prefixed character literals (UTF-8).  */
   unsigned char utf8_char_literals;
 
+  /* Nonzero means char8_t support is enabled.  */
+  unsigned char char8;
+
   /* Nonzero means process r/R raw strings.  If this is set, uliterals
      must be set as well.  */
   unsigned char rliterals;