From patchwork Tue Jul 4 15:50:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Thomas Schwinge X-Patchwork-Id: 1803284 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4QwS2326Tvz20Pf for ; Wed, 5 Jul 2023 01:51:07 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3709E385B515 for ; Tue, 4 Jul 2023 15:51:05 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180]) by sourceware.org (Postfix) with ESMTPS id 05FB53858D35 for ; Tue, 4 Jul 2023 15:50:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 05FB53858D35 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="6.01,181,1684828800"; d="scan'208,223";a="10892121" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa3.mentor.iphmx.com with ESMTP; 04 Jul 2023 07:50:48 -0800 IronPort-SDR: 8h3GbmOfC+mdETmIG8qca+OaL/C4WQewNRbNa/Ce1GV59vL2/nrVxnzc+UVNz8PjZBQDhblmcP tM/1xoOq/H45Gk2gdqVXLcn9hsaz6WmvsbzjropM2lsfpycCVNet93XMYMWNBY9SFSlxBYa/tH VnYY4VUgw4cdS2j2H0v0BnPWeuzBWPOYg3blKWLdl8uYu4SqEv4PRcLz9Z49oyX+eU6EgqkXEX ugeAQH9i8v5/pTYIsvETCWeOL19o3zY03n7yzMo5VVC8+kJgwe5hqzuenYFNd5K1OVECHJuPXk N48= From: Thomas Schwinge To: Lewis Hyatt , , "Richard Sandiford" , Jakub Jelinek , David Malcolm Subject: 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) In-Reply-To: References: User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2 (x86_64-pc-linux-gnu) Date: Tue, 4 Jul 2023 17:50:34 +0200 Message-ID: <87h6qjvfp1.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi! I came across this one here on my way working through another (somewhat related) GTY issue. I generally do understand the issue here, but do have a question about 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier': On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches wrote: > When a GTY'ed struct is streamed to PCH, any plain char* pointers it contains > (whether they live in GC-controlled memory or not) will be marked for PCH > output by the routine gt_pch_note_object in ggc-common.cc. This routine > special-cases plain char* strings, and in particular it uses strlen() to get > their length. Oh, wow, this special casing for strings... 8-| > Thus it does not handle strings with embedded null bytes, but it > is possible for something PCH cares about (such as a string literal token in a > macro definition) to contain such embedded nulls. To fix that up, add a new > GTY option "string_length" so that gt_pch_note_object can be informed the > actual length it ought to use, and use it in the relevant libcpp structs > (cpp_string and ht_identifier) accordingly. For your test case: > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C > @@ -0,0 +1,3 @@ > +// { dg-do compile { target c++11 } } > +#include "pch-string-nulls.H" > +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error"); > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs > @@ -0,0 +1,2 @@ > +/* Note that there is a null byte following "ABC". */ > +#define X R"(ABC^@[!])" ..., I understand how the following is necessary: > --- a/libcpp/include/cpplib.h > +++ b/libcpp/include/cpplib.h > @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, CLK_GNUC17, CLK_GNUC2X, > /* Payload of a NUMBER, STRING, CHAR or COMMENT token. */ > struct GTY(()) cpp_string { > unsigned int len; > - const unsigned char *text; > + > + /* TEXT is always null terminated (terminator not included in len); but this > + GTY markup arranges that PCH streaming works properly even if there is a > + null byte in the middle of the string. */ > + const unsigned char * GTY((string_length ("1 + %h.len"))) text; > }; (That is, the test case FAILs with that one reverted.) However, this one did confuse me: > --- a/libcpp/include/symtab.h > +++ b/libcpp/include/symtab.h > @@ -29,7 +29,10 @@ along with this program; see the file COPYING3. If not see > typedef struct ht_identifier ht_identifier; > typedef struct ht_identifier *ht_identifier_ptr; > struct GTY(()) ht_identifier { > - const unsigned char *str; > + /* This GTY markup arranges that the null-terminated identifier would still > + stream to PCH correctly, if a null byte were to make its way into an > + identifier somehow. */ > + const unsigned char * GTY((string_length ("1 + %h.len"))) str; > unsigned int len; > unsigned int hash_value; > }; I did wonder whether that's an actual or just a theoretical concern, to have 'ht_identifier's with embedded NULs? If an actual concern, can we get a test case constructed? Otherwise, should we revert this hunk, given that we have this auto-'strlen' handling, ignorant of embedded NULs, in a lot of other places? But then I realized that possibly we do maintain 'len' here not for correctness but as an optimization (trading an 'unsigned int' for repeated 'strlen' calls)? My quick testing with the attached "[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm this theory: no regressions (..., but I didn't bootstrap, and ran only parts of the testsuite). (Not proposing that RFC for 'git push', of course.) If that's indeed the intention here, I shall change/add source code commentary to describe this rationale for this dedicated 'len' field (plus, that handling of embedded NULs falls out of that, automatically). For reference, this 'len' field has existed "forever". Before 'struct ht_identifier' was added in commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode': 'unsigned short length', earlier 'size_t length' with comment: "length of token, for quick comparison", erlier 'int length', ever since the 'gcc/cpp*' files were added in commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191). Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 From 75b54da2f30b7ffa0c69c1659e3f3538bfdbd339 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Tue, 4 Jul 2023 14:07:47 +0200 Subject: [PATCH] [RFC] Verify no embedded NULs in 'struct ht_identifier' --- gcc/analyzer/pending-diagnostic.cc | 2 +- gcc/c-family/c-spellcheck.h | 2 +- gcc/c/c-decl.cc | 2 +- gcc/tree.h | 2 +- libcpp/include/symtab.h | 4 ++-- libcpp/symtab.cc | 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/gcc/analyzer/pending-diagnostic.cc b/gcc/analyzer/pending-diagnostic.cc index e36ed4fd9c1..9452719a908 100644 --- a/gcc/analyzer/pending-diagnostic.cc +++ b/gcc/analyzer/pending-diagnostic.cc @@ -128,7 +128,7 @@ pending_diagnostic::same_tree_p (tree t1, tree t2) static bool ht_ident_eq (ht_identifier ident, const char *str) { - return (strlen (str) == ident.len + return (strlen (str) == HT_LEN (&ident) && 0 == strcmp (str, (const char *)ident.str)); } diff --git a/gcc/c-family/c-spellcheck.h b/gcc/c-family/c-spellcheck.h index e3748430b18..0f0c264ef01 100644 --- a/gcc/c-family/c-spellcheck.h +++ b/gcc/c-family/c-spellcheck.h @@ -31,7 +31,7 @@ struct edit_distance_traits { static size_t get_length (cpp_hashnode *hashnode) { - return hashnode->ident.len; + return HT_LEN (&hashnode->ident); } static const char *get_string (cpp_hashnode *hashnode) diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc index 1af51c4acfc..c67ae5be514 100644 --- a/gcc/c/c-decl.cc +++ b/gcc/c/c-decl.cc @@ -4499,7 +4499,7 @@ lookup_name_fuzzy (tree name, enum lookup_name_fuzzy_kind kind, location_t loc) { const char *id = (const char *)best_macro->ident.str; tree macro_as_identifier - = get_identifier_with_length (id, best_macro->ident.len); + = get_identifier_with_length (id, HT_LEN (&best_macro->ident)); bm.set_best_so_far (macro_as_identifier, bmm.get_best_distance (), bmm.get_best_candidate_length ()); diff --git a/gcc/tree.h b/gcc/tree.h index 1854fe4a7d4..20169e7a467 100644 --- a/gcc/tree.h +++ b/gcc/tree.h @@ -1174,7 +1174,7 @@ extern void omp_clause_range_check_failed (const_tree, const char *, int, /* Unlike STRING_CST, in C terms this is strlen, not sizeof. */ #define IDENTIFIER_LENGTH(NODE) \ - (IDENTIFIER_NODE_CHECK (NODE)->identifier.id.len) + (HT_LEN (&IDENTIFIER_NODE_CHECK (NODE)->identifier.id)) #define IDENTIFIER_POINTER(NODE) \ ((const char *) IDENTIFIER_NODE_CHECK (NODE)->identifier.id.str) #define IDENTIFIER_HASH_VALUE(NODE) \ diff --git a/libcpp/include/symtab.h b/libcpp/include/symtab.h index c7ccc6db9f0..ea23e3dbedb 100644 --- a/libcpp/include/symtab.h +++ b/libcpp/include/symtab.h @@ -33,11 +33,11 @@ struct GTY(()) ht_identifier { stream to PCH correctly, if a null byte were to make its way into an identifier somehow. */ const unsigned char * GTY((string_length ("1 + %h.len"))) str; - unsigned int len; + unsigned int len_; unsigned int hash_value; }; -#define HT_LEN(NODE) ((NODE)->len) +#define HT_LEN(NODE) ({ gcc_assert ((NODE)->len_ == strlen ((const char *) HT_STR (NODE))); (NODE)->len_; }) #define HT_STR(NODE) ((NODE)->str) typedef struct ht cpp_hash_table; diff --git a/libcpp/symtab.cc b/libcpp/symtab.cc index ea9f26905a9..be5c1da9d67 100644 --- a/libcpp/symtab.cc +++ b/libcpp/symtab.cc @@ -155,7 +155,7 @@ ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str, node = (*table->alloc_node) (table); table->entries[index] = node; - HT_LEN (node) = (unsigned int) len; + node->len_ = (unsigned int) len; node->hash_value = hash; if (table->alloc_subobject) -- 2.34.1