From patchwork Thu Nov 14 20:20:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joseph Myers X-Patchwork-Id: 1195204 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-513495-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="g6H/rWiV"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47DXsz1d0Cz9s7T for ; Fri, 15 Nov 2019 07:20:36 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; q=dns; s=default; b=uSFTeexyKSzbP3/cVvqktLX+lzEBlxmCZgK6iaCS5w56CdxeCg TA/mrHcpYLnWCcCn93YuFKduaz4Lvl64st6y+1bHXzw/ReuX+DHdhBHI7OT+X8Jx ORPrEuxsbA9YbER4NDOfRFaSW8+ShF3qNWNXJGES5p8sWiVPOarC67niM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; s= default; bh=QjB991RCqpU/I7+aSO8r7VOuDMU=; b=g6H/rWiVqxiJ6RDAt6T7 VreL21x94YHsRTVRai9uvfhrWCot/+78Kfj3GJvtgeI6/qaMriimuK8eaQZRSSVF 3XY1SVOwJBDM87zRxkq6M+bhqJ8wW74VOqDZfiuPES0ba19wExM+cJsxjr5c6Y1s gN/WKdsES4hf/8IGNJdhT64= Received: (qmail 50235 invoked by alias); 14 Nov 2019 20:20:29 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 50227 invoked by uid 89); 14 Nov 2019 20:20:28 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-7.3 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, SPF_PASS autolearn=ham version=3.3.1 spammy= X-HELO: esa4.mentor.iphmx.com Received: from esa4.mentor.iphmx.com (HELO esa4.mentor.iphmx.com) (68.232.137.252) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 14 Nov 2019 20:20:27 +0000 IronPort-SDR: E43ZaCA8M3kVv2GgZQjCRMZs4mPeEqaBfJcgFYr2Xbm+i9EpgU7DkvDhaH4lmJXlyARRw+vTHA Z84bGY4kklNltscOQlxo+07K1HevLACeLAnNIqK51LJGlrCfJk66nMK3KK/OlQrasxsoJCG7Jm 0CtRv3hxw/zhwuJHFYyLsLu/z4RYRjFinrOPDuHXS8zWGm2tdx755Cs+l0GPnX847/WYOX5N8p P6f9aFMpuAQ3IXMRbzZinJkNtoDJsx9B4qaD6qaHWD5TdWbq7Re7fngTGisueGXQ9DTUZh5Eik nCY= Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa4.mentor.iphmx.com with ESMTP; 14 Nov 2019 12:20:25 -0800 IronPort-SDR: 3PJgJxbVmZoLdxn9E6Ze5+Mt959LhVEMR/yFCE5sFu0S7VexQJpyAwOTZu70ak4bytxCirsLfr /Vdpl67TY6qnwo5NmYfJJIM8W1HnJMRjByk1mW9FzXMsE2TBXJ0Omw8m6QRJ7PTbl5gwKTnSkL f6ys09KiN6pYpBQVWZaYHfKFBIyBiwvmk3t7b/8Wz9EFtQj07GwhL8FYjx+e9jYRN1KUpkdvz+ Yi7IF0p9kae6Djt3FQyLx241m5M/JmIS051Fr5Tc8LlFTgq2emIdj0xaoJsaSULFta29QqHo7T Vac= Date: Thu, 14 Nov 2019 20:20:19 +0000 From: Joseph Myers To: CC: , Subject: Support UTF-8 character constants for C2x Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 C2x adds u8'' character constants to C. This patch adds the corresponding GCC support. Most of the support was already present for C++ and just needed enabling for C2x. However, in C2x these constants have type unsigned char, which required corresponding adjustments in the compiler and the preprocessor to give them that type for C. For C, it seems clear to me that having type unsigned char means the constants are unsigned in the preprocessor (and thus treated as having type uintmax_t in #if conditionals), so this patch implements that. I included a conditional in the libcpp change to avoid affecting signedness for C++, but I'm not sure if in fact these constants should also be unsigned in the preprocessor for C++ in which case that !CPP_OPTION (pfile, cplusplus) conditional would not be needed. Bootstrapped with no regressions on x86_64-pc-linux-gnu. Applied to mainline. gcc/c: 2019-11-14 Joseph Myers * c-parser.c (c_parser_postfix_expression) (c_parser_check_literal_zero): Handle CPP_UTF8CHAR. * gimple-parser.c (c_parser_gimple_postfix_expression): Likewise. gcc/c-family: 2019-11-14 Joseph Myers * c-lex.c (lex_charconst): Make CPP_UTF8CHAR constants unsigned char for C. gcc/testsuite: 2019-11-14 Joseph Myers * gcc.dg/c11-utf8char-1.c, gcc.dg/c2x-utf8char-1.c, gcc.dg/c2x-utf8char-2.c, gcc.dg/c2x-utf8char-3.c, gcc.dg/gnu2x-utf8char-1.c: New tests. libcpp: 2019-11-14 Joseph Myers * charset.c (narrow_str_to_charconst): Make CPP_UTF8CHAR constants unsigned for C. * init.c (lang_defaults): Set utf8_char_literals for GNUC2X and STDC2X. Index: gcc/c/c-parser.c =================================================================== --- gcc/c/c-parser.c (revision 278253) +++ gcc/c/c-parser.c (working copy) @@ -8783,6 +8783,7 @@ c_parser_postfix_expression (c_parser *parser) case CPP_CHAR: case CPP_CHAR16: case CPP_CHAR32: + case CPP_UTF8CHAR: case CPP_WCHAR: expr.value = c_parser_peek_token (parser)->value; /* For the purpose of warning when a pointer is compared with @@ -10459,6 +10460,7 @@ c_parser_check_literal_zero (c_parser *parser, uns case CPP_WCHAR: case CPP_CHAR16: case CPP_CHAR32: + case CPP_UTF8CHAR: /* If a parameter is literal zero alone, remember it for -Wmemset-transposed-args warning. */ if (integer_zerop (tok->value) Index: gcc/c/gimple-parser.c =================================================================== --- gcc/c/gimple-parser.c (revision 278253) +++ gcc/c/gimple-parser.c (working copy) @@ -1395,6 +1395,7 @@ c_parser_gimple_postfix_expression (gimple_parser case CPP_CHAR: case CPP_CHAR16: case CPP_CHAR32: + case CPP_UTF8CHAR: case CPP_WCHAR: expr.value = c_parser_peek_token (parser)->value; set_c_expr_source_range (&expr, tok_range); Index: gcc/c-family/c-lex.c =================================================================== --- gcc/c-family/c-lex.c (revision 278253) +++ gcc/c-family/c-lex.c (working copy) @@ -1376,7 +1376,9 @@ lex_charconst (const cpp_token *token) type = char16_type_node; else if (token->type == CPP_UTF8CHAR) { - if (flag_char8_t) + if (!c_dialect_cxx ()) + type = unsigned_char_type_node; + else if (flag_char8_t) type = char8_type_node; else type = char_type_node; Index: gcc/testsuite/gcc.dg/c11-utf8char-1.c =================================================================== --- gcc/testsuite/gcc.dg/c11-utf8char-1.c (nonexistent) +++ gcc/testsuite/gcc.dg/c11-utf8char-1.c (working copy) @@ -0,0 +1,7 @@ +/* Test C2x UTF-8 characters. Test not accepted for C11. */ +/* { dg-do compile } */ +/* { dg-options "-std=c11 -pedantic-errors" } */ + +#define z(x) 0 +#define u8 z( +unsigned char a = u8'a'); Index: gcc/testsuite/gcc.dg/c2x-utf8char-1.c =================================================================== --- gcc/testsuite/gcc.dg/c2x-utf8char-1.c (nonexistent) +++ gcc/testsuite/gcc.dg/c2x-utf8char-1.c (working copy) @@ -0,0 +1,29 @@ +/* Test C2x UTF-8 characters. Test valid usages. */ +/* { dg-do compile } */ +/* { dg-options "-std=c2x -pedantic-errors" } */ + +unsigned char a = u8'a'; +_Static_assert (u8'a' == 97); + +unsigned char b = u8'\0'; +_Static_assert (u8'\0' == 0); + +unsigned char c = u8'\xff'; +_Static_assert (u8'\xff' == 255); + +unsigned char d = u8'\377'; +_Static_assert (u8'\377' == 255); + +_Static_assert (sizeof (u8'a') == 1); +_Static_assert (sizeof (u8'\0') == 1); +_Static_assert (sizeof (u8'\xff') == 1); +_Static_assert (sizeof (u8'\377') == 1); + +_Static_assert (_Generic (u8'a', unsigned char: 1, default: 2) == 1); +_Static_assert (_Generic (u8'\0', unsigned char: 1, default: 2) == 1); +_Static_assert (_Generic (u8'\xff', unsigned char: 1, default: 2) == 1); +_Static_assert (_Generic (u8'\377', unsigned char: 1, default: 2) == 1); + +#if u8'\0' - 1 < 0 +#error "UTF-8 constants not unsigned in preprocessor" +#endif Index: gcc/testsuite/gcc.dg/c2x-utf8char-2.c =================================================================== --- gcc/testsuite/gcc.dg/c2x-utf8char-2.c (nonexistent) +++ gcc/testsuite/gcc.dg/c2x-utf8char-2.c (working copy) @@ -0,0 +1,8 @@ +/* Test C2x UTF-8 characters. Character values not affected by + different execution character set. */ +/* { dg-do compile } */ +/* { dg-require-iconv "IBM1047" } */ +/* { dg-options "-std=c2x -pedantic-errors -fexec-charset=IBM1047" } */ + +_Static_assert (u8'a' == 97); +_Static_assert (u8'a' != (unsigned char) 'a'); Index: gcc/testsuite/gcc.dg/c2x-utf8char-3.c =================================================================== --- gcc/testsuite/gcc.dg/c2x-utf8char-3.c (nonexistent) +++ gcc/testsuite/gcc.dg/c2x-utf8char-3.c (working copy) @@ -0,0 +1,8 @@ +/* Test C2x UTF-8 characters. Test errors for invalid code. */ +/* { dg-do compile } */ +/* { dg-options "-std=c2x -pedantic-errors" } */ + +unsigned char a = u8''; /* { dg-error "empty character constant" } */ +unsigned char b = u8'ab'; /* { dg-error "character constant too long for its type" } */ +unsigned char c = u8'\u00ff'; /* { dg-error "character constant too long for its type" } */ +unsigned char d = u8'\x100'; /* { dg-error "hex escape sequence out of range" } */ Index: gcc/testsuite/gcc.dg/gnu2x-utf8char-1.c =================================================================== --- gcc/testsuite/gcc.dg/gnu2x-utf8char-1.c (nonexistent) +++ gcc/testsuite/gcc.dg/gnu2x-utf8char-1.c (working copy) @@ -0,0 +1,5 @@ +/* Test C2x UTF-8 characters. Test accepted with -std=gnu2x. */ +/* { dg-do compile } */ +/* { dg-options "-std=gnu2x" } */ + +#include "c2x-utf8char-1.c" Index: libcpp/charset.c =================================================================== --- libcpp/charset.c (revision 278253) +++ libcpp/charset.c (working copy) @@ -1928,6 +1928,8 @@ narrow_str_to_charconst (cpp_reader *pfile, cpp_st /* Multichar constants are of type int and therefore signed. */ if (i > 1) unsigned_p = 0; + else if (type == CPP_UTF8CHAR && !CPP_OPTION (pfile, cplusplus)) + unsigned_p = 1; else unsigned_p = CPP_OPTION (pfile, unsigned_char); Index: libcpp/init.c =================================================================== --- libcpp/init.c (revision 278253) +++ libcpp/init.c (working copy) @@ -102,13 +102,13 @@ static const struct lang_flags lang_defaults[] = /* GNUC99 */ { 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0 }, /* GNUC11 */ { 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0 }, /* GNUC17 */ { 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0 }, - /* GNUC2X */ { 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1 }, + /* GNUC2X */ { 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1 }, /* STDC89 */ { 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 }, /* STDC94 */ { 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 }, /* STDC99 */ { 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 }, /* STDC11 */ { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0 }, /* STDC17 */ { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0 }, - /* STDC2X */ { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1 }, + /* STDC2X */ { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1 }, /* GNUCXX */ { 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0 }, /* CXX98 */ { 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0 }, /* GNUCXX11 */ { 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0 },