From patchwork Thu Oct 10 20:27:46 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lewis Hyatt X-Patchwork-Id: 1174794 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-510696-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="gz6qsh+t"; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="OU9ouwaL"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 46q2hg2tnXz9s7T for ; Fri, 11 Oct 2019 07:28:00 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=YIkACtdfotGbJeNz9JTx90sj8TWGQhbZNYREnlE/qlGX2bpZVRJC1 Xvx3kXWC7xYHZeMbxCpVIsHRnAgUbSQUHHo9NrPyFpTBlxsS6vcvfmzsGvAMrhHc j1j3ZncRjejdpzgBNmfpl4GFtuscAa9aY7RQqPtOlSnLUx23v1h0V4= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; s= default; bh=v+4uz5Qcc0fuJ6A9FtbA1HQ6+Ao=; b=gz6qsh+tm3whsVtzqUFs RH6Pua7czpdic9o5fluD4/TpoxFiXjl0mOg8g68c3YqmoqGlPU/fcTIJ9k3Opk0g KY0k5ANEYFTdAHeJTnMk9ybEab1NCv3C4VVW3FVUUqNookYOGp3z7ZFZhK+Dwmfv RlLUaMwIhW7UXxIr6OV0BFA= Received: (qmail 52299 invoked by alias); 10 Oct 2019 20:27:53 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 52291 invoked by uid 89); 10 Oct 2019 20:27:53 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-22.6 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.1 spammy=H*MI:local, locales, Together X-HELO: mail-qt1-f175.google.com Received: from mail-qt1-f175.google.com (HELO mail-qt1-f175.google.com) (209.85.160.175) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 10 Oct 2019 20:27:51 +0000 Received: by mail-qt1-f175.google.com with SMTP id n7so10619640qtb.6 for ; Thu, 10 Oct 2019 13:27:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=5rsNZ6w+uZ58ZGagtb54r3/q9prDzdi6GlQpsU1t52E=; b=OU9ouwaLiCMT4vQL/sJhNlQEE2WFndwgpbamUCKq2yReiV1dlOLO72VFiYffPlwtF+ PoiSOl+dzmQgkPF6+76oFhduqXcJYSwFhL3qtPD16xu6+b7VvZK69toygHutAwzVv8SF XyCt55Hsl7S4bBn8yfmYaHcpY6tED//9Rg2QGqj2ROSLsjudSSgvmK9ouPfFkEKprCYY gbVY92HGxnCu5cQxQu7cJxLGf8IS/vQ2yc454QEM4NFr5rJe224GY/f9rOTN4AYu/d0Z QvXU24h5NWS5+ElU39MWeomPbLRUjGl6V2KLG+BxZmUGU790WBRFhTMXejFfsdJgd2+p 7m9Q== Received: from ldh.local (944c6a92.cst.lightpath.net. [148.76.106.146]) by smtp.gmail.com with ESMTPSA id p56sm4462443qtp.81.2019.10.10.13.27.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Oct 2019 13:27:48 -0700 (PDT) Date: Thu, 10 Oct 2019 16:27:46 -0400 From: Lewis Hyatt To: gcc-patches@gcc.gnu.org Subject: [PATCH] Fix multibyte-related issues in pretty-print.c (PR 91843) Message-ID: <20191010202746.GA53480@ldh.local> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.12.1 (2019-06-15) X-IsSubscribed: yes Hello- This short patch addresses https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91843 by adding the needed multibyte awareness to pretty-print.c. Together with my other patch awaiting review (https://gcc.gnu.org/ml/gcc-patches/2019-09/msg01627.html), this fixes all issues that I am aware of regarding printing diagnostics with multibyte characters in UTF-8 locales. Would you please have a look and see if it's OK? Thanks very much. bootstrapped and tested on x86-64 Linux, all test results were identical before and after: 34 XPASS 109 FAIL 1490 XFAIL 9470 UNSUPPORTED 332971 PASS -Lewis gcc/ChangeLog: 2019-10-10 Lewis Hyatt PR 91853 * pretty-print.c (pp_quoted_string): Avoid hex-escaping valid multibyte input. Fix off-by-one-bug printing the last byte before a hex-escaped output. (pp_character): Don't apply line wrapping in the middle of multibyte characters. (test_utf8): New test. (pretty_print_c_tests): Call the new test. diff --git a/gcc/pretty-print.c b/gcc/pretty-print.c index c57a3dbd887..742f3d23725 100644 --- a/gcc/pretty-print.c +++ b/gcc/pretty-print.c @@ -699,6 +699,8 @@ mingw_ansi_fputs (const char *str, FILE *fp) #endif /* __MINGW32__ */ +static int +decode_utf8_char (const unsigned char *, size_t len, unsigned int *); static void pp_quoted_string (pretty_printer *, const char *, size_t = -1); /* Overwrite the given location/range within this text_info's rich_location. @@ -1689,6 +1691,8 @@ void pp_character (pretty_printer *pp, int c) { if (pp_is_wrapping_line (pp) + /* If printing UTF-8, don't wrap in the middle of a sequence. */ + && (((unsigned int) c) & 0xC0) != 0x80 && pp_remaining_character_count_for_line (pp) <= 0) { pp_newline (pp); @@ -1729,8 +1733,22 @@ pp_quoted_string (pretty_printer *pp, const char *str, size_t n /* = -1 */) if (ISPRINT (*ps)) continue; + /* Don't escape a valid UTF-8 extended char. */ + const unsigned char *ups = (const unsigned char *) ps; + if (*ups & 0x80) + { + unsigned int extended_char; + const int valid_utf8_len = decode_utf8_char (ups, n, &extended_char); + if (valid_utf8_len > 0) + { + ps += valid_utf8_len - 1; + n -= valid_utf8_len - 1; + continue; + } + } + if (last < ps) - pp_maybe_wrap_text (pp, last, ps - 1); + pp_maybe_wrap_text (pp, last, ps); /* Append the hexadecimal value of the character. Allocate a buffer that's large enough for a 32-bit char plus the hex prefix. */ @@ -2374,6 +2392,46 @@ test_urls () } } +/* Test multibyte awareness. */ +static void test_utf8 () +{ + + /* Check that pp_quoted_string leaves valid UTF-8 alone. */ + { + pretty_printer pp; + const char *s = "\xf0\x9f\x98\x82"; + pp_quoted_string (&pp, s); + ASSERT_STREQ (pp_formatted_text (&pp), s); + } + + /* Check that pp_quoted_string escapes non-UTF-8 nonprintable bytes. */ + { + pretty_printer pp; + pp_quoted_string (&pp, "\xf0!\x9f\x98\x82"); + ASSERT_STREQ (pp_formatted_text (&pp), + "\\xf0!\\x9f\\x98\\x82"); + } + + /* Check that pp_character will line-wrap at the beginning of a UTF-8 + sequence, but not in the middle. */ + { + pretty_printer pp (3); + const char s[] = "---\xf0\x9f\x98\x82"; + for (int i = 0; i != sizeof (s) - 1; ++i) + pp_character (&pp, s[i]); + pp_newline (&pp); + for (int i = 1; i != sizeof (s) - 1; ++i) + pp_character (&pp, s[i]); + pp_character (&pp, '-'); + ASSERT_STREQ (pp_formatted_text (&pp), + "---\n" + "\xf0\x9f\x98\x82\n" + "--\xf0\x9f\x98\x82\n" + "-"); + } + +} + /* Run all of the selftests within this file. */ void @@ -2383,6 +2441,7 @@ pretty_print_c_tests () test_pp_format (); test_prefixes_and_wrapping (); test_urls (); + test_utf8 (); } } // namespace selftest