[C++] PR c++/91370 - Implement P1041R4 and P1139R2 - Stronger Unicode reqs
diff mbox series

Message ID 20191107153859.GI4650@tucnak
State New
Headers show
Series
  • [C++] PR c++/91370 - Implement P1041R4 and P1139R2 - Stronger Unicode reqs
Related show

Commit Message

Jakub Jelinek Nov. 7, 2019, 3:38 p.m. UTC
Hi!

GCC does use UTF-16 and UTF-32 for char16_t and char32_t string literals
already, so P1041R4 is I believe already implemented with no changes needed.

While going through P1139R2, I've realized that we weren't handling
"If the value is not representable within 16 bits, the program is ill-formed. A char16_t
literal containing multiple c-chars is ill-formed."
and
"A char32_t literal containing multiple c-chars is ill-formed."
already from C++11 correctly, we were just warning about it, rather than
emitting an error.  This is different from C11, where the standard
makes it implementation-defined what happens.

Furthermore, the C++17:
"If the value is not representable with a single UTF-8 code unit,
the program is ill-formed. A UTF-8 character literal containing multiple c-chars is
ill-formed."
wasn't handled as an error, but instead u8'ab' would be an int with a
warning, similarly u8'\u00c0' etc.  u8 char literals are only in C++17+,
not in C, so no need to worry about C at this point.

And lastly, P1139R2 makes it clear that code points above U+10FFFF are
ill-formed, but that is something Eric already implemented in r276167.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

I believe we can now claim to have both P1041R4 and P1139R2 implemented.

2019-11-07  Jakub Jelinek  <jakub@redhat.com>

	PR c++/91370 - Implement P1041R4 and P1139R2 - Stronger Unicode reqs
	* charset.c (narrow_str_to_charconst): Add TYPE argument.  For
	CPP_UTF8CHAR diagnose whenever number of chars is > 1, using
	CPP_DL_ERROR instead of CPP_DL_WARNING.
	(wide_str_to_charconst): For CPP_CHAR16 or CPP_CHAR32, use
	CPP_DL_ERROR instead of CPP_DL_WARNING when multiple char16_t
	or char32_t chars are needed.
	(cpp_interpret_charconst): Adjust narrow_str_to_charconst caller.

	* g++.dg/cpp1z/utf8-neg.C: Expect errors rather than -Wmultichar
	warnings.
	* g++.dg/ext/utf16-4.C: Expect errors rather than warnings.
	* g++.dg/ext/utf32-4.C: Likewise.
	* g++.dg/cpp2a/ucn2.C: New test.


	Jakub

Comments

Jason Merrill Nov. 7, 2019, 8:02 p.m. UTC | #1
OK.

On 11/7/19 3:38 PM, Jakub Jelinek wrote:
> Hi!
> 
> GCC does use UTF-16 and UTF-32 for char16_t and char32_t string literals
> already, so P1041R4 is I believe already implemented with no changes needed.
> 
> While going through P1139R2, I've realized that we weren't handling
> "If the value is not representable within 16 bits, the program is ill-formed. A char16_t
> literal containing multiple c-chars is ill-formed."
> and
> "A char32_t literal containing multiple c-chars is ill-formed."
> already from C++11 correctly, we were just warning about it, rather than
> emitting an error.  This is different from C11, where the standard
> makes it implementation-defined what happens.
> 
> Furthermore, the C++17:
> "If the value is not representable with a single UTF-8 code unit,
> the program is ill-formed. A UTF-8 character literal containing multiple c-chars is
> ill-formed."
> wasn't handled as an error, but instead u8'ab' would be an int with a
> warning, similarly u8'\u00c0' etc.  u8 char literals are only in C++17+,
> not in C, so no need to worry about C at this point.
> 
> And lastly, P1139R2 makes it clear that code points above U+10FFFF are
> ill-formed, but that is something Eric already implemented in r276167.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> I believe we can now claim to have both P1041R4 and P1139R2 implemented.
> 
> 2019-11-07  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/91370 - Implement P1041R4 and P1139R2 - Stronger Unicode reqs
> 	* charset.c (narrow_str_to_charconst): Add TYPE argument.  For
> 	CPP_UTF8CHAR diagnose whenever number of chars is > 1, using
> 	CPP_DL_ERROR instead of CPP_DL_WARNING.
> 	(wide_str_to_charconst): For CPP_CHAR16 or CPP_CHAR32, use
> 	CPP_DL_ERROR instead of CPP_DL_WARNING when multiple char16_t
> 	or char32_t chars are needed.
> 	(cpp_interpret_charconst): Adjust narrow_str_to_charconst caller.
> 
> 	* g++.dg/cpp1z/utf8-neg.C: Expect errors rather than -Wmultichar
> 	warnings.
> 	* g++.dg/ext/utf16-4.C: Expect errors rather than warnings.
> 	* g++.dg/ext/utf32-4.C: Likewise.
> 	* g++.dg/cpp2a/ucn2.C: New test.
> 
> --- libcpp/charset.c.jj	2019-09-27 10:32:17.127641484 +0200
> +++ libcpp/charset.c	2019-11-07 13:40:19.616040925 +0100
> @@ -1881,10 +1881,11 @@ cpp_interpret_string_notranslate (cpp_re
>   /* Subroutine of cpp_interpret_charconst which performs the conversion
>      to a number, for narrow strings.  STR is the string structure returned
>      by cpp_interpret_string.  PCHARS_SEEN and UNSIGNEDP are as for
> -   cpp_interpret_charconst.  */
> +   cpp_interpret_charconst.  TYPE is the token type.  */
>   static cppchar_t
>   narrow_str_to_charconst (cpp_reader *pfile, cpp_string str,
> -			 unsigned int *pchars_seen, int *unsignedp)
> +			 unsigned int *pchars_seen, int *unsignedp,
> +			 enum cpp_ttype type)
>   {
>     size_t width = CPP_OPTION (pfile, char_precision);
>     size_t max_chars = CPP_OPTION (pfile, int_precision) / width;
> @@ -1913,10 +1914,12 @@ narrow_str_to_charconst (cpp_reader *pfi
>   	result = c;
>       }
>   
> +  if (type == CPP_UTF8CHAR)
> +    max_chars = 1;
>     if (i > max_chars)
>       {
>         i = max_chars;
> -      cpp_error (pfile, CPP_DL_WARNING,
> +      cpp_error (pfile, type == CPP_UTF8CHAR ? CPP_DL_ERROR : CPP_DL_WARNING,
>   		 "character constant too long for its type");
>       }
>     else if (i > 1 && CPP_OPTION (pfile, warn_multichar))
> @@ -1980,7 +1983,9 @@ wide_str_to_charconst (cpp_reader *pfile
>        character exactly fills a wchar_t, so a multi-character wide
>        character constant is guaranteed to overflow.  */
>     if (str.len > nbwc * 2)
> -    cpp_error (pfile, CPP_DL_WARNING,
> +    cpp_error (pfile, (CPP_OPTION (pfile, cplusplus)
> +		       && (type == CPP_CHAR16 || type == CPP_CHAR32))
> +		      ? CPP_DL_ERROR : CPP_DL_WARNING,
>   	       "character constant too long for its type");
>   
>     /* Truncate the constant to its natural width, and simultaneously
> @@ -2038,7 +2043,8 @@ cpp_interpret_charconst (cpp_reader *pfi
>       result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp,
>   				    token->type);
>     else
> -    result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp);
> +    result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp,
> +				      token->type);
>   
>     if (str.text != token->val.str.text)
>       free ((void *)str.text);
> --- gcc/testsuite/g++.dg/cpp1z/utf8-neg.C.jj	2018-10-22 09:28:06.380657152 +0200
> +++ gcc/testsuite/g++.dg/cpp1z/utf8-neg.C	2019-11-07 14:34:23.929317534 +0100
> @@ -1,6 +1,6 @@
>   /* { dg-do compile { target c++17 } } */
>   
>   const static char c0 = u8'';		// { dg-error "empty character" }
> -const static char c1 = u8'ab';  	// { dg-warning "multi-character character constant" }
> -const static char c2 = u8'\u0124';	// { dg-warning "multi-character character constant" }
> -const static char c3 = u8'\U00064321';  // { dg-warning "multi-character character constant" }
> +const static char c1 = u8'ab';  	// { dg-error "character constant too long for its type" }
> +const static char c2 = u8'\u0124';	// { dg-error "character constant too long for its type" }
> +const static char c3 = u8'\U00064321';  // { dg-error "character constant too long for its type" }
> --- gcc/testsuite/g++.dg/ext/utf16-4.C.jj	2017-06-02 09:01:16.403322989 +0200
> +++ gcc/testsuite/g++.dg/ext/utf16-4.C	2019-11-07 14:30:42.433643404 +0100
> @@ -4,8 +4,8 @@
>   
>   
>   const static char16_t	c0 = u'';		/* { dg-error "empty character" } */
> -const static char16_t	c1 = u'ab';		/* { dg-warning "constant too long" } */
> -const static char16_t	c2 = u'\U00064321';	/* { dg-warning "constant too long" } */
> +const static char16_t	c1 = u'ab';		/* { dg-error "constant too long" } */
> +const static char16_t	c2 = u'\U00064321';	/* { dg-error "constant too long" } */
>   
>   const static char16_t	c3 = 'a';
>   const static char16_t	c4 = U'a';
> --- gcc/testsuite/g++.dg/ext/utf32-4.C.jj	2014-03-10 10:49:55.292085832 +0100
> +++ gcc/testsuite/g++.dg/ext/utf32-4.C	2019-11-07 14:31:19.745083152 +0100
> @@ -3,13 +3,13 @@
>   /* { dg-do compile { target c++11 } } */
>   
>   const static char32_t	c0 = U'';		/* { dg-error "empty character" } */
> -const static char32_t	c1 = U'ab';		/* { dg-warning "constant too long" } */
> +const static char32_t	c1 = U'ab';		/* { dg-error "constant too long" } */
>   const static char32_t	c2 = U'\U00064321';
>   
>   const static char32_t	c3 = 'a';
>   const static char32_t	c4 = u'a';
>   const static char32_t	c5 = u'\u2029';
> -const static char32_t	c6 = u'\U00064321';	/* { dg-warning "constant too long" } */
> +const static char32_t	c6 = u'\U00064321';	/* { dg-error "constant too long" } */
>   const static char32_t	c7 = L'a';
>   const static char32_t	c8 = L'\u2029';
>   const static char32_t	c9 = L'\U00064321';     /* { dg-warning "constant too long" "" { target { ! 4byte_wchar_t } } } */
> --- gcc/testsuite/g++.dg/cpp2a/ucn2.C.jj	2019-11-07 13:56:46.356219953 +0100
> +++ gcc/testsuite/g++.dg/cpp2a/ucn2.C	2019-11-07 14:21:34.488871186 +0100
> @@ -0,0 +1,30 @@
> +// P1139R2
> +// { dg-do compile { target c++11 } }
> +// { dg-additional-options "-fchar8_t" { target c++17_down } }
> +
> +const char16_t *a = u"\U0001F914\u2753";
> +const char32_t *b = U"\U0001F914\u2753";
> +const char16_t *c = u"\uD802";		// { dg-error "is not a valid universal character" }
> +const char16_t *d = u"\U0000DFF0";	// { dg-error "is not a valid universal character" }
> +const char16_t *e = u"\U00110000";	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
> +					// { dg-error "converting UCN to execution character set" "" { target *-*-* } .-1 }
> +const char32_t *f = U"\uD802";		// { dg-error "is not a valid universal character" }
> +const char32_t *g = U"\U0000DFF0";	// { dg-error "is not a valid universal character" }
> +const char32_t *h = U"\U00110001";	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
> +#if __cpp_unicode_characters >= 201411
> +const char8_t i = u8'\u00C0';		// { dg-error "character constant too long for its type" "" { target c++17 } }
> +#endif
> +const char16_t j = u'\U0001F914';	// { dg-error "character constant too long for its type" }
> +const char32_t k = U'\U0001F914';
> +#if __cpp_unicode_characters >= 201411
> +const char8_t l = u8'ab';		// { dg-error "character constant too long for its type" "" { target c++17 } }
> +#endif
> +const char16_t m = u'ab';		// { dg-error "character constant too long for its type" }
> +const char32_t n = U'ab';		// { dg-error "character constant too long for its type" }
> +#if __cpp_unicode_characters >= 201411
> +const char8_t o = u8'\U00110002';	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
> +					// { dg-error "character constant too long for its type" "" { target c++17 } .-1 }
> +#endif
> +const char16_t p = u'\U00110003';	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
> +					// { dg-error "converting UCN to execution character set" "" { target *-*-* } .-1 }
> +const char32_t q = U'\U00110004';	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
> 
> 	Jakub
>

Patch
diff mbox series

--- libcpp/charset.c.jj	2019-09-27 10:32:17.127641484 +0200
+++ libcpp/charset.c	2019-11-07 13:40:19.616040925 +0100
@@ -1881,10 +1881,11 @@  cpp_interpret_string_notranslate (cpp_re
 /* Subroutine of cpp_interpret_charconst which performs the conversion
    to a number, for narrow strings.  STR is the string structure returned
    by cpp_interpret_string.  PCHARS_SEEN and UNSIGNEDP are as for
-   cpp_interpret_charconst.  */
+   cpp_interpret_charconst.  TYPE is the token type.  */
 static cppchar_t
 narrow_str_to_charconst (cpp_reader *pfile, cpp_string str,
-			 unsigned int *pchars_seen, int *unsignedp)
+			 unsigned int *pchars_seen, int *unsignedp,
+			 enum cpp_ttype type)
 {
   size_t width = CPP_OPTION (pfile, char_precision);
   size_t max_chars = CPP_OPTION (pfile, int_precision) / width;
@@ -1913,10 +1914,12 @@  narrow_str_to_charconst (cpp_reader *pfi
 	result = c;
     }
 
+  if (type == CPP_UTF8CHAR)
+    max_chars = 1;
   if (i > max_chars)
     {
       i = max_chars;
-      cpp_error (pfile, CPP_DL_WARNING,
+      cpp_error (pfile, type == CPP_UTF8CHAR ? CPP_DL_ERROR : CPP_DL_WARNING,
 		 "character constant too long for its type");
     }
   else if (i > 1 && CPP_OPTION (pfile, warn_multichar))
@@ -1980,7 +1983,9 @@  wide_str_to_charconst (cpp_reader *pfile
      character exactly fills a wchar_t, so a multi-character wide
      character constant is guaranteed to overflow.  */
   if (str.len > nbwc * 2)
-    cpp_error (pfile, CPP_DL_WARNING,
+    cpp_error (pfile, (CPP_OPTION (pfile, cplusplus)
+		       && (type == CPP_CHAR16 || type == CPP_CHAR32))
+		      ? CPP_DL_ERROR : CPP_DL_WARNING,
 	       "character constant too long for its type");
 
   /* Truncate the constant to its natural width, and simultaneously
@@ -2038,7 +2043,8 @@  cpp_interpret_charconst (cpp_reader *pfi
     result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp,
 				    token->type);
   else
-    result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp);
+    result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp,
+				      token->type);
 
   if (str.text != token->val.str.text)
     free ((void *)str.text);
--- gcc/testsuite/g++.dg/cpp1z/utf8-neg.C.jj	2018-10-22 09:28:06.380657152 +0200
+++ gcc/testsuite/g++.dg/cpp1z/utf8-neg.C	2019-11-07 14:34:23.929317534 +0100
@@ -1,6 +1,6 @@ 
 /* { dg-do compile { target c++17 } } */
 
 const static char c0 = u8'';		// { dg-error "empty character" }
-const static char c1 = u8'ab';  	// { dg-warning "multi-character character constant" }
-const static char c2 = u8'\u0124';	// { dg-warning "multi-character character constant" }
-const static char c3 = u8'\U00064321';  // { dg-warning "multi-character character constant" }
+const static char c1 = u8'ab';  	// { dg-error "character constant too long for its type" }
+const static char c2 = u8'\u0124';	// { dg-error "character constant too long for its type" }
+const static char c3 = u8'\U00064321';  // { dg-error "character constant too long for its type" }
--- gcc/testsuite/g++.dg/ext/utf16-4.C.jj	2017-06-02 09:01:16.403322989 +0200
+++ gcc/testsuite/g++.dg/ext/utf16-4.C	2019-11-07 14:30:42.433643404 +0100
@@ -4,8 +4,8 @@ 
 
 
 const static char16_t	c0 = u'';		/* { dg-error "empty character" } */
-const static char16_t	c1 = u'ab';		/* { dg-warning "constant too long" } */
-const static char16_t	c2 = u'\U00064321';	/* { dg-warning "constant too long" } */
+const static char16_t	c1 = u'ab';		/* { dg-error "constant too long" } */
+const static char16_t	c2 = u'\U00064321';	/* { dg-error "constant too long" } */
 
 const static char16_t	c3 = 'a';
 const static char16_t	c4 = U'a';
--- gcc/testsuite/g++.dg/ext/utf32-4.C.jj	2014-03-10 10:49:55.292085832 +0100
+++ gcc/testsuite/g++.dg/ext/utf32-4.C	2019-11-07 14:31:19.745083152 +0100
@@ -3,13 +3,13 @@ 
 /* { dg-do compile { target c++11 } } */
 
 const static char32_t	c0 = U'';		/* { dg-error "empty character" } */
-const static char32_t	c1 = U'ab';		/* { dg-warning "constant too long" } */
+const static char32_t	c1 = U'ab';		/* { dg-error "constant too long" } */
 const static char32_t	c2 = U'\U00064321';
 
 const static char32_t	c3 = 'a';
 const static char32_t	c4 = u'a';
 const static char32_t	c5 = u'\u2029';
-const static char32_t	c6 = u'\U00064321';	/* { dg-warning "constant too long" } */
+const static char32_t	c6 = u'\U00064321';	/* { dg-error "constant too long" } */
 const static char32_t	c7 = L'a';
 const static char32_t	c8 = L'\u2029';
 const static char32_t	c9 = L'\U00064321';     /* { dg-warning "constant too long" "" { target { ! 4byte_wchar_t } } } */  
--- gcc/testsuite/g++.dg/cpp2a/ucn2.C.jj	2019-11-07 13:56:46.356219953 +0100
+++ gcc/testsuite/g++.dg/cpp2a/ucn2.C	2019-11-07 14:21:34.488871186 +0100
@@ -0,0 +1,30 @@ 
+// P1139R2
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-fchar8_t" { target c++17_down } }
+
+const char16_t *a = u"\U0001F914\u2753";
+const char32_t *b = U"\U0001F914\u2753";
+const char16_t *c = u"\uD802";		// { dg-error "is not a valid universal character" }
+const char16_t *d = u"\U0000DFF0";	// { dg-error "is not a valid universal character" }
+const char16_t *e = u"\U00110000";	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
+					// { dg-error "converting UCN to execution character set" "" { target *-*-* } .-1 }
+const char32_t *f = U"\uD802";		// { dg-error "is not a valid universal character" }
+const char32_t *g = U"\U0000DFF0";	// { dg-error "is not a valid universal character" }
+const char32_t *h = U"\U00110001";	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
+#if __cpp_unicode_characters >= 201411
+const char8_t i = u8'\u00C0';		// { dg-error "character constant too long for its type" "" { target c++17 } }
+#endif
+const char16_t j = u'\U0001F914';	// { dg-error "character constant too long for its type" }
+const char32_t k = U'\U0001F914';
+#if __cpp_unicode_characters >= 201411
+const char8_t l = u8'ab';		// { dg-error "character constant too long for its type" "" { target c++17 } }
+#endif
+const char16_t m = u'ab';		// { dg-error "character constant too long for its type" }
+const char32_t n = U'ab';		// { dg-error "character constant too long for its type" }
+#if __cpp_unicode_characters >= 201411
+const char8_t o = u8'\U00110002';	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
+					// { dg-error "character constant too long for its type" "" { target c++17 } .-1 }
+#endif
+const char16_t p = u'\U00110003';	// { dg-error "is outside the UCS codespace" "" { target c++2a } }
+					// { dg-error "converting UCN to execution character set" "" { target *-*-* } .-1 }
+const char32_t q = U'\U00110004';	// { dg-error "is outside the UCS codespace" "" { target c++2a } }