diff mbox series

[libcpp] Issue a pedantic warning for UCNs outside UCS codespace

Message ID 46160747.VyTTuRmQVZ@polaris
State New
Headers show
Series [libcpp] Issue a pedantic warning for UCNs outside UCS codespace | expand

Commit Message

Eric Botcazou Sept. 24, 2019, 5:22 p.m. UTC
Hi,

the Universal Character Names accepted by the C family of compilers are mapped 
to those of ISO/IEC 10646, which defines the Universal Character Set codespace 
as the range 0-0x10FFFF inclusive.  The upper bound is already enforced for 
identifiers but not for literals, so the following code is accepted in C99:

#include <stddef.h>

wchar_t a = L'\U00110000';

whereas it is rejected with an error by other compilers (Clang, MSVC).

I'm not sure whether the compiler is really equired to issue a diagnostic in 
this case.  Moreover a few tests in the testsuite manipulate UCNs outside the 
UCS codespace.  That's why I suggest issuing a pedantic warning.

Tested on x86_64-suse-linux, OK for the mainline?


2019-09-24  Eric Botcazou  <ebotcazou@adacore.com>

libcpp/
	* charset.c (UCS_LIMIT): New macro.
	(ucn_valid_in_identifier): Use it instead of a hardcoded constant.
	(_cpp_valid_ucn): Issue a pedantic warning for UCNs larger than
	UCS_LIMIT outside of identifiers.


2019-09-24  Eric Botcazou  <ebotcazou@adacore.com>

gcc/testsuite/
	* gcc.dg/cpp/ucs.c: Add test for new warning and adjust.
	* gcc.dg/cpp/utf8-5byte-1.c: Add -w to the options.
	* gcc.dg/attr-alias-5.c: Likewise.

Comments

Florian Weimer Sept. 24, 2019, 6:41 p.m. UTC | #1
* Eric Botcazou:

> the Universal Character Names accepted by the C family of compilers
> are mapped to those of ISO/IEC 10646, which defines the Universal
> Character Set codespace as the range 0-0x10FFFF inclusive.  The
> upper bound is already enforced for identifiers but not for
> literals, so the following code is accepted in C99:
>
> #include <stddef.h>
>
> wchar_t a = L'\U00110000';
>
> whereas it is rejected with an error by other compilers (Clang, MSVC).
>
> I'm not sure whether the compiler is really equired to issue a diagnostic in 
> this case.  Moreover a few tests in the testsuite manipulate UCNs outside the 
> UCS codespace.  That's why I suggest issuing a pedantic warning.
>
> Tested on x86_64-suse-linux, OK for the mainline?

Since this is a pedantic warning …

I think this has to depend on the C standards version.  I think each C
standard needs to be read against the edition of ISO 10646 current at
the time of standards approval (the references are sadly not
versioned, so the version is implied).  Early versions of ISO 10646
definitely do not have the codespace restriction you mention.
Eric Botcazou Sept. 24, 2019, 8:06 p.m. UTC | #2
> I think this has to depend on the C standards version.  I think each C
> standard needs to be read against the edition of ISO 10646 current at
> the time of standards approval (the references are sadly not
> versioned, so the version is implied).  Early versions of ISO 10646
> definitely do not have the codespace restriction you mention.

Note the already existing hardcoded check in ucn_valid_in_identifier though.
Joseph Myers Sept. 25, 2019, 11:19 p.m. UTC | #3
On Tue, 24 Sep 2019, Florian Weimer wrote:

> I think this has to depend on the C standards version.  I think each C
> standard needs to be read against the edition of ISO 10646 current at
> the time of standards approval (the references are sadly not
> versioned, so the version is implied).  Early versions of ISO 10646
> definitely do not have the codespace restriction you mention.

Undated references aren't implicitly dated to the version when the 
standard was published.  The ISO/IEC Directives, Part 2 (Principles and 
rules for the structure and drafting of ISO and IEC documents) (2018 
edition, subclause 10.4) 
<https://www.iso.org/sites/directives/current/part2/index.xhtml#_idTextAnchor122> 
say:

  Undated references may be made:

  * only to a complete document;

  * if it will be possible to use all future changes of the referenced 
  document for the purposes of the referring document;

  * when it is understood that the reference will include all amendments 
  to and revisions of the referenced document.

I think that's clear that the latest version at the time the standard is 
used applies (so if the document in the undated normative reference is 
revised, that effectively changes the requirements of the standad version 
referencing it).
Joseph Myers Sept. 25, 2019, 11:33 p.m. UTC | #4
On Tue, 24 Sep 2019, Eric Botcazou wrote:

> Hi,
> 
> the Universal Character Names accepted by the C family of compilers are mapped 
> to those of ISO/IEC 10646, which defines the Universal Character Set codespace 
> as the range 0-0x10FFFF inclusive.  The upper bound is already enforced for 
> identifiers but not for literals, so the following code is accepted in C99:
> 
> #include <stddef.h>
> 
> wchar_t a = L'\U00110000';
> 
> whereas it is rejected with an error by other compilers (Clang, MSVC).
> 
> I'm not sure whether the compiler is really equired to issue a diagnostic in 
> this case.  Moreover a few tests in the testsuite manipulate UCNs outside the 
> UCS codespace.  That's why I suggest issuing a pedantic warning.

For C, I think such UCNs violate the Semantics but not the Constraints on 
UCNs, so no diagnostic is actually required in C, although it is permitted 
as a pedwarn / error.

However, while C++ doesn't have that Semantics / Constraints division, 
it's also the case that before C++2a, C++ only has a dated normative 
reference to ISO/IEC 10646-1:1993 (C++2a adds an undated reference and 
says the dated one is only for deprecated features, as well as explicitly 
making such UCNs outside the ISO 10646 code point range ill-formed).  So I 
think that for C++, this is only correct as an error / pedwarn in the 
C++2a case.
Joseph Myers Sept. 25, 2019, 11:37 p.m. UTC | #5
On Tue, 24 Sep 2019, Eric Botcazou wrote:

> > I think this has to depend on the C standards version.  I think each C
> > standard needs to be read against the edition of ISO 10646 current at
> > the time of standards approval (the references are sadly not
> > versioned, so the version is implied).  Early versions of ISO 10646
> > definitely do not have the codespace restriction you mention.
> 
> Note the already existing hardcoded check in ucn_valid_in_identifier though.

No C or C++ standard version allows characters outside that range in 
identifiers, so that check is independent of the ISO 10646 version being 
used.
Eric Botcazou Sept. 26, 2019, 11:19 a.m. UTC | #6
> For C, I think such UCNs violate the Semantics but not the Constraints on
> UCNs, so no diagnostic is actually required in C, although it is permitted
> as a pedwarn / error.
> 
> However, while C++ doesn't have that Semantics / Constraints division,
> it's also the case that before C++2a, C++ only has a dated normative
> reference to ISO/IEC 10646-1:1993 (C++2a adds an undated reference and
> says the dated one is only for deprecated features, as well as explicitly
> making such UCNs outside the ISO 10646 code point range ill-formed).  So I
> think that for C++, this is only correct as an error / pedwarn in the
> C++2a case.

OK, thanks for the exegesis. ;-)  Revision version attached.


2019-09-26  Eric Botcazou  <ebotcazou@adacore.com>

libcpp/
	* charset.c (UCS_LIMIT): New macro.
	(ucn_valid_in_identifier): Use it instead of a hardcoded constant.
	(_cpp_valid_ucn): Issue a pedantic warning for UCNs larger than
	UCS_LIMIT outside of identifiers in C and in C++2a.


2019-09-26  Eric Botcazou  <ebotcazou@adacore.com>

gcc/testsuite/
	* gcc.dg/cpp/ucs.c: Add test for new warning and adjust.
	* gcc.dg/cpp/utf8-5byte-1.c: Add -w to the options.
	* gcc.dg/attr-alias-5.c: Likewise.
	* g++.dg/cpp/ucn-1.C: Add test for new warning.
	* g++.dg/cpp2a/ucn1.C: New test.
Joseph Myers Sept. 26, 2019, 3:54 p.m. UTC | #7
On Thu, 26 Sep 2019, Eric Botcazou wrote:

> > For C, I think such UCNs violate the Semantics but not the Constraints on
> > UCNs, so no diagnostic is actually required in C, although it is permitted
> > as a pedwarn / error.
> > 
> > However, while C++ doesn't have that Semantics / Constraints division,
> > it's also the case that before C++2a, C++ only has a dated normative
> > reference to ISO/IEC 10646-1:1993 (C++2a adds an undated reference and
> > says the dated one is only for deprecated features, as well as explicitly
> > making such UCNs outside the ISO 10646 code point range ill-formed).  So I
> > think that for C++, this is only correct as an error / pedwarn in the
> > C++2a case.
> 
> OK, thanks for the exegesis. ;-)  Revision version attached.

Checking "CPP_OPTION (pfile, lang) == CLK_CXX2A" is problematic because 
future versions later than C++2a should be handled the same as C++2a.

The only place I see doing something similar (outside of init.c, most 
version conditionals are handled via language flags set there) does 
"CPP_OPTION (pfile, lang) > CLK_CXX11" (for "In C++14 and up these 
suffixes are in the standard library, so treat them as user-defined 
literals.", two places doing the same comparison).  So I think that 
"CPP_OPTION (pfile, lang) > CLK_CXX17" is the right thing to replace the 
comparisons against CLK_CXX2A and CLK_GNUCXX2A.

The patch is OK with that change.
diff mbox series

Patch

Index: libcpp/charset.c
===================================================================
--- libcpp/charset.c	(revision 275988)
+++ libcpp/charset.c	(working copy)
@@ -901,6 +901,9 @@  struct ucnrange {
 };
 #include "ucnid.h"
 
+/* ISO 10646 defines the UCS codespace as the range 0-0x10FFFF inclusive.  */
+#define UCS_LIMIT 0x10FFFF
+
 /* Returns 1 if C is valid in an identifier, 2 if C is valid except at
    the start of an identifier, and 0 if C is not valid in an
    identifier.  We assume C has already gone through the checks of
@@ -915,7 +918,7 @@  ucn_valid_in_identifier (cpp_reader *pfi
   int mn, mx, md;
   unsigned short valid_flags, invalid_start_flags;
 
-  if (c > 0x10FFFF)
+  if (c > UCS_LIMIT)
     return 0;
 
   mn = 0;
@@ -1016,6 +1019,9 @@  ucn_valid_in_identifier (cpp_reader *pfi
    whose short identifier is less than 00A0 other than 0024 ($), 0040 (@),
    or 0060 (`), nor one in the range D800 through DFFF inclusive.
 
+   If the hexadecimal value is larger than the upper bound of the UCS
+   codespace specified in ISO/IEC 10646, a pedantic warning is issued.
+
    *PSTR must be preceded by "\u" or "\U"; it is assumed that the
    buffer end is delimited by a non-hex digit.  Returns false if the
    UCN has not been consumed, true otherwise.
@@ -1135,6 +1141,10 @@  _cpp_valid_ucn (cpp_reader *pfile, const
    "universal character %.*s is not valid at the start of an identifier",
 		   (int) (str - base), base);
     }
+  else if (result > UCS_LIMIT)
+    cpp_error (pfile, CPP_DL_PEDWARN,
+	       "%.*s is outside the UCS codespace",
+	       (int) (str - base), base);
 
   *cp = result;
   return true;
Index: gcc/testsuite/gcc.dg/attr-alias-5.c
===================================================================
--- gcc/testsuite/gcc.dg/attr-alias-5.c	(revision 275988)
+++ gcc/testsuite/gcc.dg/attr-alias-5.c	(working copy)
@@ -1,7 +1,7 @@ 
 /* Verify diagnostics for aliases to strings containing extended
    identifiers or bad characters.  */
 /* { dg-do compile } */
-/* { dg-options "-std=gnu99" } */
+/* { dg-options "-std=gnu99 -w" } */
 /* { dg-require-alias "" } */
 /* { dg-require-ascii-locale "" } */
 /* { dg-skip-if "" { powerpc*-*-aix* } } */
Index: gcc/testsuite/gcc.dg/cpp/ucs.c
===================================================================
--- gcc/testsuite/gcc.dg/cpp/ucs.c	(revision 275988)
+++ gcc/testsuite/gcc.dg/cpp/ucs.c	(working copy)
@@ -39,7 +39,7 @@ 
 #endif
 
 #if WCHAR_MAX >= 0x7ffffff
-# if L'\U1234abcd' != 0x1234abcd
+# if L'\U1234abcd' != 0x1234abcd /* { dg-warning "outside" "" } */
 #  error bad long ucs	/* { dg-bogus "bad" "bad U1234abcd evaluation" } */
 # endif
 #endif
@@ -49,7 +49,7 @@  void foo ()
   int c;
 
   c = L'\ubad';		/* { dg-error "incomplete" "incomplete UCN 1" } */
-  c = L"\U1234"[0];	/* { dg-error "incomplete" "incompete UCN 2" } */
+  c = L"\U1234"[0];	/* { dg-error "incomplete" "incomplete UCN 2" } */
 
   c = L'\u000x';	/* { dg-error "incomplete" "non-hex digit in UCN" } */
   /* If sizeof(HOST_WIDE_INT) > sizeof(wchar_t), we can get a multi-character
@@ -64,4 +64,6 @@  void foo ()
   c = '\u0025';		/* { dg-error "not a valid" "0025 invalid UCN" } */
   c = L"\uD800"[0];	/* { dg-error "not a valid" "D800 invalid UCN" } */
   c = L'\U0000DFFF';	/* { dg-error "not a valid" "DFFF invalid UCN" } */
+
+  c = L'\U00110000';	/* { dg-warning "outside" "110000 outside UCS" } */
 }
Index: gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c
===================================================================
--- gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c	(revision 275988)
+++ gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c	(working copy)
@@ -1,7 +1,7 @@ 
 /* Test for bug in conversions from 5-byte UTF-8 sequences in
    cpplib.  */
 /* { dg-do run { target { 4byte_wchar_t } } } */
-/* { dg-options "-std=gnu99" } */
+/* { dg-options "-std=gnu99 -w" } */
 
 extern void abort (void);
 extern void exit (int);