From patchwork Fri Mar 17 19:29:01 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Wakely X-Patchwork-Id: 740495 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3vlFlq6WSRz9ryr for ; Sat, 18 Mar 2017 06:29:41 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="D3N/z6nc"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:references:mime-version:content-type :in-reply-to; q=dns; s=default; b=HHcoGldp1Yk05WxAjch3tP7r2w26HJ Q0tMVY36i0H7FWBiV6wynqhYWWixfeXNaFCdWvRv8c7Ze75GgHsfTKtjQaEsB16y 6JBqpcwcBsd3coq9ikx2+xMZTyju/BzSVpZu1d4xrbsAdy7pbvAeK3XjpWMWc6jX mYwEfneARLXMI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:references:mime-version:content-type :in-reply-to; s=default; bh=0DVU2rxlSkHbN0DmeVZE5EgCFAM=; b=D3N/ z6ncwyUVDoKyGxyxECdsggM/k9eULPXFd8y9Ur6OuQF7S7OgBYuoH137mZENuK09 Oyi8QW+vb78IFdw5Io9OlHrMDSyrO4IOj5QDtbcUpUsn4+a5OJ5IWYyrUtl3X9tu j7h6bAjSwVm/0wCXBLLAAszhXUWAhf1RoyozET4= Received: (qmail 79017 invoked by alias); 17 Mar 2017 19:29:15 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 78955 invoked by uid 89); 17 Mar 2017 19:29:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RP_MATCHES_RCVD, SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=3697 X-Spam-User: qpsmtpd, 2 recipients X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 17 Mar 2017 19:29:03 +0000 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 349143B709; Fri, 17 Mar 2017 19:29:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 349143B709 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=jwakely@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 349143B709 Received: from localhost (unknown [10.33.36.2]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7566C4FA20; Fri, 17 Mar 2017 19:29:03 +0000 (UTC) Date: Fri, 17 Mar 2017 19:29:01 +0000 From: Jonathan Wakely To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Various fixes for facets Message-ID: <20170317192901.GT4425@redhat.com> References: <20170313193547.GW3501@redhat.com> <20170314184612.GC3501@redhat.com> <20170316152339.GP4425@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170316152339.GP4425@redhat.com> X-Clacks-Overhead: GNU Terry Pratchett User-Agent: Mutt/1.8.0 (2017-02-23) On 16/03/17 15:23 +0000, Jonathan Wakely wrote: >On 14/03/17 18:46 +0000, Jonathan Wakely wrote: >>On 13/03/17 19:35 +0000, Jonathan Wakely wrote: >>>This is a series of patches to fix various bugs in the Unicode >>>character conversion facets. >>> >>>Ther first patch fixes a silly < versus <= bug that meant that 0xffff >>>got written as a surrogate pair instead of as simply 0xff, and an >>>endianness bug for the internal representation of UTF-16 code units >>>stored in char32_t or wchar_t values. That's PR 79511. >>> >>>The second patch fixes some incorrect bitwise operations (because I >>>confused & and |) and some incorrect limits (because I confused max >>>and min). That fixes determining the endianness of the external >>>representation bytes when they start with a Byte OrderMark, and >>>correctly reports errors on invalid UCS2. It also fixes >>>wstring_convert so that it reports the number of characters that were >>>converted prior to an error. That's PR 79980. >>> >>>The third patch fixes the output of the encoding() and max_length() >>>member functions on the codecvt facets, because I wasn't correctly >>>accounting for a BOM or for the differences between UTF-16 and UCS2. >>> >>>I plan to commit these for all branches, but I'll wait until after GCC >>>7.1 is released, and fix it for 7.2 instead. These bugs aren't >>>important enough to rush into trunk now. >> >>One more patch for a problem found by the libc++ testsuite. Now we >>pass all the libc++ tests, and we even pass a test that libc++ fails. >>With this, I hope our is 100% conforming. Just in time to be >>deprecated for C++17 :-) > >I've committed these to trunk, on the basis that they're intended to >be backported to all branches anyway (fixing features that are >currently broken in all branches). There's no point waiting if we plan >to commit them anyway, it would just mean doing an extra backport (5, >6, 7 *and* 8). > >Backports will be done soon. I backported all the recent fixes to gcc-6-branch and it was failing one test, due to unaligned reads in std::codecvt_utf16. That type reads UTF-16 data from a const char* (Why narrow characters when we have char16_t? Because likes to be awkward) and I was doing that by casting the const char* to const char16_t*. That isn't safe when the first char isn't aligned correctly for a char16_t. This patch fixes all the unaligned accesses by abstracting the operations on the pointers to use new overlaoded operators on the range type. A new partial specialization range uses memcpy to read/write char16_t values from the char*, avoiding alignment problems. The primary template (range) just dereferences the pointers directly. Tested x86_64-linux, powerpc64le-linux, powerpc64-linux, powerpc-ibm-aix7.2.0.0 (which has 2-byte wchar_t). Also tested with ubsan to confirm the unaligned accesses are gone. Committed to trunk, gcc-6-branch, gcc-5-branch. commit 96ebc791ce1bd9cbba913d0b25b60ee4a09c41f1 Author: Jonathan Wakely Date: Fri Mar 17 13:00:00 2017 +0000 Fix alignment bugs in std::codecvt_utf16 * src/c++11/codecvt.cc (range): Add non-type template parameter and define oerloaded operators for reading and writing code units. (range): Define partial specialization for accessing wide characters in potentially unaligned byte ranges. (ucs2_span(const char16_t*, const char16_t*, ...)) (ucs4_span(const char16_t*, const char16_t*, ...)): Change parameters to range in order to avoid unaligned reads. (__codecvt_utf16_base::do_out) (__codecvt_utf16_base::do_out) (__codecvt_utf16_base::do_out): Use range specialization for unaligned data to avoid unaligned writes. (__codecvt_utf16_base::do_in) (__codecvt_utf16_base::do_in) (__codecvt_utf16_base::do_in): Likewise for writes. Return error if there are unprocessable trailing bytes. (__codecvt_utf16_base::do_length) (__codecvt_utf16_base::do_length) (__codecvt_utf16_base::do_length): Pass arguments of type range to span functions. * testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc: New test. diff --git a/libstdc++-v3/src/c++11/codecvt.cc b/libstdc++-v3/src/c++11/codecvt.cc index 02866ef..1187339 100644 --- a/libstdc++-v3/src/c++11/codecvt.cc +++ b/libstdc++-v3/src/c++11/codecvt.cc @@ -57,17 +57,104 @@ namespace const char32_t incomplete_mb_character = char32_t(-2); const char32_t invalid_mb_sequence = char32_t(-1); - template + // Utility type for reading and writing code units of type Elem from + // a range defined by a pair of pointers. + template struct range { Elem* next; Elem* end; + // Write a code unit. + range& operator=(Elem e) + { + *next++ = e; + return *this; + } + + // Read the next code unit. Elem operator*() const { return *next; } - range& operator++() { ++next; return *this; } + // Read the Nth code unit. + Elem operator[](size_t n) const { return next[n]; } + // Move to the next code unit. + range& operator++() + { + ++next; + return *this; + } + + // Move to the Nth code unit. + range& operator+=(size_t n) + { + next += n; + return *this; + } + + // The number of code units remaining. size_t size() const { return end - next; } + + // The number of bytes remaining. + size_t nbytes() const { return (const char*)end - (const char*)next; } + }; + + // This specialization is used when accessing char16_t values through + // pointers to char, which might not be correctly aligned for char16_t. + template + struct range + { + using value_type = typename remove_const::type; + + using char_pointer = typename + conditional::value, const char*, char*>::type; + + char_pointer next; + char_pointer end; + + // Write a code unit. + range& operator=(Elem e) + { + memcpy(next, &e, sizeof(Elem)); + ++*this; + return *this; + } + + // Read the next code unit. + Elem operator*() const + { + value_type e; + memcpy(&e, next, sizeof(Elem)); + return e; + } + + // Read the Nth code unit. + Elem operator[](size_t n) const + { + value_type e; + memcpy(&e, next + n * sizeof(Elem), sizeof(Elem)); + return e; + } + + // Move to the next code unit. + range& operator++() + { + next += sizeof(Elem); + return *this; + } + + // Move to the Nth code unit. + range& operator+=(size_t n) + { + next += n * sizeof(Elem); + return *this; + } + + // The number of code units remaining. + size_t size() const { return nbytes() / sizeof(Elem); } + + // The number of bytes remaining. + size_t nbytes() const { return end - next; } }; // Multibyte sequences can have "header" consisting of Byte Order Mark @@ -75,17 +162,37 @@ namespace const unsigned char utf16_bom[2] = { 0xFE, 0xFF }; const unsigned char utf16le_bom[2] = { 0xFF, 0xFE }; - template - inline bool - write_bom(range& to, const unsigned char (&bom)[N]) + // Write a BOM (space permitting). + template + bool + write_bom(range& to, const unsigned char (&bom)[N]) { - if (to.size() < N) + static_assert( (N / sizeof(C)) != 0, "" ); + static_assert( (N % sizeof(C)) == 0, "" ); + + if (to.nbytes() < N) return false; memcpy(to.next, bom, N); - to.next += N; + to += (N / sizeof(C)); return true; } + // Try to read a BOM. + template + bool + read_bom(range& from, const unsigned char (&bom)[N]) + { + static_assert( (N / sizeof(C)) != 0, "" ); + static_assert( (N % sizeof(C)) == 0, "" ); + + if (from.nbytes() >= N && !memcmp(from.next, bom, N)) + { + from += (N / sizeof(C)); + return true; + } + return false; + } + // If generate_header is set in mode write out UTF-8 BOM. bool write_utf8_bom(range& to, codecvt_mode mode) @@ -97,32 +204,20 @@ namespace // If generate_header is set in mode write out the UTF-16 BOM indicated // by whether little_endian is set in mode. + template bool - write_utf16_bom(range& to, codecvt_mode mode) + write_utf16_bom(range& to, codecvt_mode mode) { if (mode & generate_header) { - if (!to.size()) - return false; - auto* bom = (mode & little_endian) ? utf16le_bom : utf16_bom; - std::memcpy(to.next, bom, 2); - ++to.next; + if (mode & little_endian) + return write_bom(to, utf16le_bom); + else + return write_bom(to, utf16_bom); } return true; } - template - inline bool - read_bom(range& from, const unsigned char (&bom)[N]) - { - if (from.size() >= N && !memcmp(from.next, bom, N)) - { - from.next += N; - return true; - } - return false; - } - // If consume_header is set in mode update from.next to after any BOM. void read_utf8_bom(range& from, codecvt_mode mode) @@ -135,21 +230,16 @@ namespace // Otherwise, if *from.next is a UTF-16 BOM increment from.next and then: // - if the UTF-16BE BOM was found unset little_endian in mode, or // - if the UTF-16LE BOM was found set little_endian in mode. + template void - read_utf16_bom(range& from, codecvt_mode& mode) + read_utf16_bom(range& from, codecvt_mode& mode) { - if (mode & consume_header && from.size()) + if (mode & consume_header) { - if (!memcmp(from.next, utf16_bom, 2)) - { - ++from.next; - mode &= ~little_endian; - } - else if (!memcmp(from.next, utf16le_bom, 2)) - { - ++from.next; - mode |= little_endian; - } + if (read_bom(from, utf16_bom)) + mode &= ~little_endian; + else if (read_bom(from, utf16le_bom)) + mode |= little_endian; } } @@ -162,11 +252,11 @@ namespace const size_t avail = from.size(); if (avail == 0) return incomplete_mb_character; - unsigned char c1 = from.next[0]; + unsigned char c1 = from[0]; // https://en.wikipedia.org/wiki/UTF-8#Sample_code if (c1 < 0x80) { - ++from.next; + ++from; return c1; } else if (c1 < 0xC2) // continuation or overlong 2-byte sequence @@ -175,51 +265,51 @@ namespace { if (avail < 2) return incomplete_mb_character; - unsigned char c2 = from.next[1]; + unsigned char c2 = from[1]; if ((c2 & 0xC0) != 0x80) return invalid_mb_sequence; char32_t c = (c1 << 6) + c2 - 0x3080; if (c <= maxcode) - from.next += 2; + from += 2; return c; } else if (c1 < 0xF0) // 3-byte sequence { if (avail < 3) return incomplete_mb_character; - unsigned char c2 = from.next[1]; + unsigned char c2 = from[1]; if ((c2 & 0xC0) != 0x80) return invalid_mb_sequence; if (c1 == 0xE0 && c2 < 0xA0) // overlong return invalid_mb_sequence; - unsigned char c3 = from.next[2]; + unsigned char c3 = from[2]; if ((c3 & 0xC0) != 0x80) return invalid_mb_sequence; char32_t c = (c1 << 12) + (c2 << 6) + c3 - 0xE2080; if (c <= maxcode) - from.next += 3; + from += 3; return c; } else if (c1 < 0xF5) // 4-byte sequence { if (avail < 4) return incomplete_mb_character; - unsigned char c2 = from.next[1]; + unsigned char c2 = from[1]; if ((c2 & 0xC0) != 0x80) return invalid_mb_sequence; if (c1 == 0xF0 && c2 < 0x90) // overlong return invalid_mb_sequence; if (c1 == 0xF4 && c2 >= 0x90) // > U+10FFFF return invalid_mb_sequence; - unsigned char c3 = from.next[2]; + unsigned char c3 = from[2]; if ((c3 & 0xC0) != 0x80) return invalid_mb_sequence; - unsigned char c4 = from.next[3]; + unsigned char c4 = from[3]; if ((c4 & 0xC0) != 0x80) return invalid_mb_sequence; char32_t c = (c1 << 18) + (c2 << 12) + (c3 << 6) + c4 - 0x3C82080; if (c <= maxcode) - from.next += 4; + from += 4; return c; } else // > U+10FFFF @@ -233,31 +323,31 @@ namespace { if (to.size() < 1) return false; - *to.next++ = code_point; + to = code_point; } else if (code_point <= 0x7FF) { if (to.size() < 2) return false; - *to.next++ = (code_point >> 6) + 0xC0; - *to.next++ = (code_point & 0x3F) + 0x80; + to = (code_point >> 6) + 0xC0; + to = (code_point & 0x3F) + 0x80; } else if (code_point <= 0xFFFF) { if (to.size() < 3) return false; - *to.next++ = (code_point >> 12) + 0xE0; - *to.next++ = ((code_point >> 6) & 0x3F) + 0x80; - *to.next++ = (code_point & 0x3F) + 0x80; + to = (code_point >> 12) + 0xE0; + to = ((code_point >> 6) & 0x3F) + 0x80; + to = (code_point & 0x3F) + 0x80; } else if (code_point <= 0x10FFFF) { if (to.size() < 4) return false; - *to.next++ = (code_point >> 18) + 0xF0; - *to.next++ = ((code_point >> 12) & 0x3F) + 0x80; - *to.next++ = ((code_point >> 6) & 0x3F) + 0x80; - *to.next++ = (code_point & 0x3F) + 0x80; + to = (code_point >> 18) + 0xF0; + to = ((code_point >> 12) & 0x3F) + 0x80; + to = ((code_point >> 6) & 0x3F) + 0x80; + to = (code_point & 0x3F) + 0x80; } else return false; @@ -298,38 +388,39 @@ namespace // The sequence's endianness is indicated by (mode & little_endian). // Updates from.next if the codepoint is not greater than maxcode. // Returns invalid_mb_sequence, incomplete_mb_character or the code point. - char32_t - read_utf16_code_point(range& from, unsigned long maxcode, - codecvt_mode mode) - { - const size_t avail = from.size(); - if (avail == 0) - return incomplete_mb_character; - int inc = 1; - char32_t c = adjust_byte_order(from.next[0], mode); - if (is_high_surrogate(c)) - { - if (avail < 2) - return incomplete_mb_character; - const char16_t c2 = adjust_byte_order(from.next[1], mode); - if (is_low_surrogate(c2)) - { - c = surrogate_pair_to_code_point(c, c2); - inc = 2; - } - else - return invalid_mb_sequence; - } - else if (is_low_surrogate(c)) - return invalid_mb_sequence; - if (c <= maxcode) - from.next += inc; - return c; - } + template + char32_t + read_utf16_code_point(range& from, + unsigned long maxcode, codecvt_mode mode) + { + const size_t avail = from.size(); + if (avail == 0) + return incomplete_mb_character; + int inc = 1; + char32_t c = adjust_byte_order(from[0], mode); + if (is_high_surrogate(c)) + { + if (avail < 2) + return incomplete_mb_character; + const char16_t c2 = adjust_byte_order(from[1], mode); + if (is_low_surrogate(c2)) + { + c = surrogate_pair_to_code_point(c, c2); + inc = 2; + } + else + return invalid_mb_sequence; + } + else if (is_low_surrogate(c)) + return invalid_mb_sequence; + if (c <= maxcode) + from += inc; + return c; + } - template + template bool - write_utf16_code_point(range& to, char32_t codepoint, codecvt_mode mode) + write_utf16_code_point(range& to, char32_t codepoint, codecvt_mode mode) { static_assert(sizeof(C) >= 2, "a code unit must be at least 16-bit"); @@ -337,8 +428,7 @@ namespace { if (to.size() > 0) { - *to.next = adjust_byte_order(codepoint, mode); - ++to.next; + to = adjust_byte_order(codepoint, mode); return true; } } @@ -348,9 +438,8 @@ namespace const char32_t LEAD_OFFSET = 0xD800 - (0x10000 >> 10); char16_t lead = LEAD_OFFSET + (codepoint >> 10); char16_t trail = 0xDC00 + (codepoint & 0x3FF); - to.next[0] = adjust_byte_order(lead, mode); - to.next[1] = adjust_byte_order(trail, mode); - to.next += 2; + to = adjust_byte_order(lead, mode); + to = adjust_byte_order(trail, mode); return true; } return false; @@ -369,7 +458,7 @@ namespace return codecvt_base::partial; if (codepoint > maxcode) return codecvt_base::error; - *to.next++ = codepoint; + to = codepoint; } return from.size() ? codecvt_base::partial : codecvt_base::ok; } @@ -383,19 +472,19 @@ namespace return codecvt_base::partial; while (from.size()) { - const char32_t c = from.next[0]; + const char32_t c = from[0]; if (c > maxcode) return codecvt_base::error; if (!write_utf8_code_point(to, c)) return codecvt_base::partial; - ++from.next; + ++from; } return codecvt_base::ok; } // utf16 -> ucs4 codecvt_base::result - ucs4_in(range& from, range& to, + ucs4_in(range& from, range& to, unsigned long maxcode = max_code_point, codecvt_mode mode = {}) { read_utf16_bom(from, mode); @@ -406,26 +495,26 @@ namespace return codecvt_base::partial; if (codepoint > maxcode) return codecvt_base::error; - *to.next++ = codepoint; + to = codepoint; } return from.size() ? codecvt_base::partial : codecvt_base::ok; } // ucs4 -> utf16 codecvt_base::result - ucs4_out(range& from, range& to, + ucs4_out(range& from, range& to, unsigned long maxcode = max_code_point, codecvt_mode mode = {}) { if (!write_utf16_bom(to, mode)) return codecvt_base::partial; while (from.size()) { - const char32_t c = from.next[0]; + const char32_t c = from[0]; if (c > maxcode) return codecvt_base::error; if (!write_utf16_code_point(to, c, mode)) return codecvt_base::partial; - ++from.next; + ++from; } return codecvt_base::ok; } @@ -443,7 +532,7 @@ namespace read_utf8_bom(from, mode); while (from.size() && to.size()) { - const char* const first = from.next; + auto orig = from; const char32_t codepoint = read_utf8_code_point(from, maxcode); if (codepoint == incomplete_mb_character) { @@ -456,7 +545,7 @@ namespace return codecvt_base::error; if (!write_utf16_code_point(to, codepoint, mode)) { - from.next = first; + from = orig; // rewind to previous position return codecvt_base::partial; } } @@ -474,7 +563,7 @@ namespace return codecvt_base::partial; while (from.size()) { - char32_t c = from.next[0]; + char32_t c = from[0]; int inc = 1; if (is_high_surrogate(c)) { @@ -484,7 +573,7 @@ namespace if (from.size() < 2) return codecvt_base::ok; // stop converting at this point - const char32_t c2 = from.next[1]; + const char32_t c2 = from[1]; if (is_low_surrogate(c2)) { c = surrogate_pair_to_code_point(c, c2); @@ -499,7 +588,7 @@ namespace return codecvt_base::error; if (!write_utf8_code_point(to, c)) return codecvt_base::partial; - from.next += inc; + from += inc; } return codecvt_base::ok; } @@ -548,27 +637,27 @@ namespace // ucs2 -> utf16 codecvt_base::result - ucs2_out(range& from, range& to, + ucs2_out(range& from, range& to, char32_t maxcode = max_code_point, codecvt_mode mode = {}) { if (!write_utf16_bom(to, mode)) return codecvt_base::partial; while (from.size() && to.size()) { - char16_t c = from.next[0]; + char16_t c = from[0]; if (is_high_surrogate(c)) return codecvt_base::error; if (c > maxcode) return codecvt_base::error; - *to.next++ = adjust_byte_order(c, mode); - ++from.next; + to = adjust_byte_order(c, mode); + ++from; } return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial; } // utf16 -> ucs2 codecvt_base::result - ucs2_in(range& from, range& to, + ucs2_in(range& from, range& to, char32_t maxcode = max_code_point, codecvt_mode mode = {}) { read_utf16_bom(from, mode); @@ -581,23 +670,22 @@ namespace return codecvt_base::error; // UCS-2 only supports single units. if (c > maxcode) return codecvt_base::error; - *to.next++ = c; + to = c; } return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial; } const char16_t* - ucs2_span(const char16_t* begin, const char16_t* end, size_t max, + ucs2_span(range& from, size_t max, char32_t maxcode, codecvt_mode mode) { - range from{ begin, end }; read_utf16_bom(from, mode); // UCS-2 only supports characters in the BMP, i.e. one UTF-16 code unit: maxcode = std::min(max_single_utf16_unit, maxcode); char32_t c = 0; while (max-- && c <= maxcode) c = read_utf16_code_point(from, maxcode, mode); - return from.next; + return reinterpret_cast(from.next); } const char* @@ -629,15 +717,14 @@ namespace // return pos such that [begin,pos) is valid UCS-4 string no longer than max const char16_t* - ucs4_span(const char16_t* begin, const char16_t* end, size_t max, + ucs4_span(range& from, size_t max, char32_t maxcode = max_code_point, codecvt_mode mode = {}) { - range from{ begin, end }; read_utf16_bom(from, mode); char32_t c = 0; while (max-- && c <= maxcode) c = read_utf16_code_point(from, maxcode, mode); - return from.next; + return reinterpret_cast(from.next); } } @@ -937,6 +1024,13 @@ __codecvt_utf8_base::do_max_length() const throw() } #ifdef _GLIBCXX_USE_WCHAR_T + +#if __SIZEOF_WCHAR_T__ == 2 +static_assert(sizeof(wchar_t) == sizeof(char16_t), ""); +#elif __SIZEOF_WCHAR_T__ == 4 +static_assert(sizeof(wchar_t) == sizeof(char32_t), ""); +#endif + // Define members of codecvt_utf8 base class implementation. // Converts from UTF-8 to UCS-2 or UCS-4 depending on sizeof(wchar_t). @@ -1057,10 +1151,7 @@ do_out(state_type&, const intern_type* __from, const intern_type* __from_end, extern_type*& __to_next) const { range from{ __from, __from_end }; - range to{ - reinterpret_cast(__to), - reinterpret_cast(__to_end) - }; + range to{ __to, __to_end }; auto res = ucs2_out(from, to, _M_maxcode, _M_mode); __from_next = from.next; __to_next = reinterpret_cast(to.next); @@ -1083,14 +1174,13 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end, intern_type* __to, intern_type* __to_end, intern_type*& __to_next) const { - range from{ - reinterpret_cast(__from), - reinterpret_cast(__from_end) - }; + range from{ __from, __from_end }; range to{ __to, __to_end }; auto res = ucs2_in(from, to, _M_maxcode, _M_mode); __from_next = reinterpret_cast(from.next); __to_next = to.next; + if (res == codecvt_base::ok && __from_next != __from_end) + res = codecvt_base::error; return res; } @@ -1107,9 +1197,8 @@ __codecvt_utf16_base:: do_length(state_type&, const extern_type* __from, const extern_type* __end, size_t __max) const { - auto next = reinterpret_cast(__from); - next = ucs2_span(next, reinterpret_cast(__end), __max, - _M_maxcode, _M_mode); + range from{ __from, __end }; + const char16_t* next = ucs2_span(from, __max, _M_maxcode, _M_mode); return reinterpret_cast(next) - __from; } @@ -1137,10 +1226,7 @@ do_out(state_type&, const intern_type* __from, const intern_type* __from_end, extern_type*& __to_next) const { range from{ __from, __from_end }; - range to{ - reinterpret_cast(__to), - reinterpret_cast(__to_end) - }; + range to{ __to, __to_end }; auto res = ucs4_out(from, to, _M_maxcode, _M_mode); __from_next = from.next; __to_next = reinterpret_cast(to.next); @@ -1163,14 +1249,13 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end, intern_type* __to, intern_type* __to_end, intern_type*& __to_next) const { - range from{ - reinterpret_cast(__from), - reinterpret_cast(__from_end) - }; + range from{ __from, __from_end }; range to{ __to, __to_end }; auto res = ucs4_in(from, to, _M_maxcode, _M_mode); __from_next = reinterpret_cast(from.next); __to_next = to.next; + if (res == codecvt_base::ok && __from_next != __from_end) + res = codecvt_base::error; return res; } @@ -1187,9 +1272,8 @@ __codecvt_utf16_base:: do_length(state_type&, const extern_type* __from, const extern_type* __end, size_t __max) const { - auto next = reinterpret_cast(__from); - next = ucs4_span(next, reinterpret_cast(__end), __max, - _M_maxcode, _M_mode); + range from{ __from, __end }; + const char16_t* next = ucs4_span(from, __max, _M_maxcode, _M_mode); return reinterpret_cast(next) - __from; } @@ -1217,20 +1301,17 @@ do_out(state_type&, const intern_type* __from, const intern_type* __from_end, extern_type* __to, extern_type* __to_end, extern_type*& __to_next) const { - range to{ - reinterpret_cast(__to), - reinterpret_cast(__to_end) - }; + range to{ __to, __to_end }; #if __SIZEOF_WCHAR_T__ == 2 range from{ reinterpret_cast(__from), - reinterpret_cast(__from_end) + reinterpret_cast(__from_end), }; auto res = ucs2_out(from, to, _M_maxcode, _M_mode); #elif __SIZEOF_WCHAR_T__ == 4 range from{ reinterpret_cast(__from), - reinterpret_cast(__from_end) + reinterpret_cast(__from_end), }; auto res = ucs4_out(from, to, _M_maxcode, _M_mode); #else @@ -1257,20 +1338,17 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end, intern_type* __to, intern_type* __to_end, intern_type*& __to_next) const { - range from{ - reinterpret_cast(__from), - reinterpret_cast(__from_end) - }; + range from{ __from, __from_end }; #if __SIZEOF_WCHAR_T__ == 2 range to{ reinterpret_cast(__to), - reinterpret_cast(__to_end) + reinterpret_cast(__to_end), }; auto res = ucs2_in(from, to, _M_maxcode, _M_mode); #elif __SIZEOF_WCHAR_T__ == 4 range to{ reinterpret_cast(__to), - reinterpret_cast(__to_end) + reinterpret_cast(__to_end), }; auto res = ucs4_in(from, to, _M_maxcode, _M_mode); #else @@ -1278,6 +1356,8 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end, #endif __from_next = reinterpret_cast(from.next); __to_next = reinterpret_cast(to.next); + if (res == codecvt_base::ok && __from_next != __from_end) + res = codecvt_base::error; return res; } @@ -1294,13 +1374,11 @@ __codecvt_utf16_base:: do_length(state_type&, const extern_type* __from, const extern_type* __end, size_t __max) const { - auto next = reinterpret_cast(__from); + range from{ __from, __end }; #if __SIZEOF_WCHAR_T__ == 2 - next = ucs2_span(next, reinterpret_cast(__end), __max, - _M_maxcode, _M_mode); + const char16_t* next = ucs2_span(from, __max, _M_maxcode, _M_mode); #elif __SIZEOF_WCHAR_T__ == 4 - next = ucs4_span(next, reinterpret_cast(__end), __max, - _M_maxcode, _M_mode); + const char16_t* next = ucs4_span(from, __max, _M_maxcode, _M_mode); #endif return reinterpret_cast(next) - __from; } diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc index 9383818..d8b9729 100644 --- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc +++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc @@ -103,6 +103,31 @@ test07() VERIFY( conv.converted() == 5 ); } +void +test08() +{ + // Read/write UTF-16 code units from data not correctly aligned for char16_t + Conv conv; + const char src[] = "-\xFE\xFF\0\x61\xAB\xCD"; + auto out = conv.from_bytes(src + 1, src + 7); + VERIFY( out[0] == 0x0061 ); + VERIFY( out[1] == 0xabcd ); + auto bytes = conv.to_bytes(out); + VERIFY( bytes == std::string(src + 1, 6) ); +} + +void +test09() +{ + // Read/write UTF-16 code units from data not correctly aligned for char16_t + Conv conv; + const char src[] = "-\xFE\xFF\xD8\x08\xDF\x45"; + auto out = conv.from_bytes(src + 1, src + 7); + VERIFY( out == U"\U00012345" ); + auto bytes = conv.to_bytes(out); + VERIFY( bytes == std::string(src + 1, 6) ); +} + int main() { test01(); @@ -112,4 +137,6 @@ int main() test05(); test06(); test07(); + test08(); + test09(); } diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc new file mode 100644 index 0000000..0179c18 --- /dev/null +++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc @@ -0,0 +1,289 @@ +// Copyright (C) 2017 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. + +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License along +// with this library; see the file COPYING3. If not see +// . + +// { dg-do run { target c++11 } } + +#include +#include +#include + +using std::codecvt_base; +using std::codecvt_mode; +using std::codecvt_utf16; +using std::wstring_convert; +using std::mbstate_t; + +constexpr codecvt_mode +operator|(codecvt_mode m1, codecvt_mode m2) +{ + using underlying = std::underlying_type::type; + return static_cast(static_cast(m1) | m2); +} + +// Read/write UTF-16 code units from data not correctly aligned for char16_t + +void +test01() +{ + mbstate_t st; + constexpr codecvt_mode m = std::consume_header|std::generate_header; + codecvt_utf16 conv; + const char src[] = "-\xFE\xFF\0\x61\xAB\xCD"; + const char* const src_end = src + 7; + + int len = conv.length(st, src + 1, src_end, 1); + VERIFY( len == 4 ); + len = conv.length(st, src + 1, src_end, 2); + VERIFY( len == 6 ); + + char16_t dst[2]; + char16_t* const dst_end = dst + 2; + char16_t* dst_next; + const char* src_cnext; + auto res = conv.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( dst[0] == 0x0061 ); + VERIFY( dst[1] == 0xabcd ); + VERIFY( src_cnext == src_end ); + VERIFY( dst_next == dst_end ); + + char out[sizeof(src)] = { src[0] }; + char* const out_end = out + 7; + char* out_next; + const char16_t* dst_cnext; + res = conv.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( out_next == out_end ); + VERIFY( dst_cnext == dst_end ); + VERIFY( out[1] == src[1] ); + VERIFY( out[2] == src[2] ); + VERIFY( out[3] == src[3] ); + VERIFY( out[4] == src[4] ); + VERIFY( out[5] == src[5] ); + VERIFY( out[6] == src[6] ); + + codecvt_utf16 conv_le; + + len = conv_le.length(st, src + 1, src_end, 1); + VERIFY( len == 4 ); + len = conv_le.length(st, src + 1, src_end, 2); + VERIFY( len == 6 ); + + res = conv_le.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( dst[0] == 0x0061 ); + VERIFY( dst[1] == 0xabcd ); + VERIFY( src_cnext == src_end ); + VERIFY( dst_next == dst_end ); + + res = conv_le.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( out_next == out_end ); + VERIFY( dst_cnext == dst_end ); + VERIFY( out[1] == src[2] ); + VERIFY( out[2] == src[1] ); + VERIFY( out[3] == src[4] ); + VERIFY( out[4] == src[3] ); + VERIFY( out[5] == src[6] ); + VERIFY( out[6] == src[5] ); +} + +void +test02() +{ + mbstate_t st; + constexpr codecvt_mode m = std::consume_header|std::generate_header; + codecvt_utf16 conv; + const char src[] = "-\xFE\xFF\0\x61\xAB\xCD\xD8\x08\xDF\x45"; + const char* const src_end = src + 11; + + int len = conv.length(st, src + 1, src_end, 1); + VERIFY( len == 4 ); + len = conv.length(st, src + 1, src_end, 2); + VERIFY( len == 6 ); + len = conv.length(st, src + 1, src_end, -1ul); + VERIFY( len == 10 ); + + char32_t dst[3]; + char32_t* const dst_end = dst + 3; + char32_t* dst_next; + const char* src_cnext; + auto res = conv.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( dst[0] == 0x0061 ); + VERIFY( dst[1] == 0xabcd ); + VERIFY( dst[2] == 0x012345 ); + VERIFY( src_cnext == src_end ); + VERIFY( dst_next == dst_end ); + + char out[sizeof(src)] = { src[0] }; + char* const out_end = out + 11; + char* out_next; + const char32_t* dst_cnext; + res = conv.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( out_next == out_end ); + VERIFY( dst_cnext == dst_end ); + VERIFY( out[1] == src[1] ); + VERIFY( out[2] == src[2] ); + VERIFY( out[3] == src[3] ); + VERIFY( out[4] == src[4] ); + VERIFY( out[5] == src[5] ); + VERIFY( out[6] == src[6] ); + VERIFY( out[7] == src[7] ); + VERIFY( out[8] == src[8] ); + VERIFY( out[9] == src[9] ); + VERIFY( out[10] == src[10] ); + + codecvt_utf16 conv_le; + + len = conv_le.length(st, src + 1, src_end, 1); + VERIFY( len == 4 ); + len = conv_le.length(st, src + 1, src_end, 2); + VERIFY( len == 6 ); + len = conv.length(st, src + 1, src_end, -1ul); + VERIFY( len == 10 ); + + res = conv_le.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( dst[0] == 0x0061 ); + VERIFY( dst[1] == 0xabcd ); + VERIFY( dst[2] == 0x012345 ); + VERIFY( src_cnext == src_end ); + VERIFY( dst_next == dst_end ); + + res = conv_le.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( out_next == out_end ); + VERIFY( dst_cnext == dst_end ); + VERIFY( out[1] == src[2] ); + VERIFY( out[2] == src[1] ); + VERIFY( out[3] == src[4] ); + VERIFY( out[4] == src[3] ); + VERIFY( out[5] == src[6] ); + VERIFY( out[6] == src[5] ); + VERIFY( out[7] == src[8] ); + VERIFY( out[8] == src[7] ); + VERIFY( out[9] == src[10] ); + VERIFY( out[10] == src[9] ); +} + +void +test03() +{ +#ifdef _GLIBCXX_USE_WCHAR_T + mbstate_t st; + constexpr codecvt_mode m = std::consume_header|std::generate_header; + codecvt_utf16 conv; + const char src[] = "-\xFE\xFF\0\x61\xAB\xCD\xD8\x08\xDF\x45"; + const size_t in_len = sizeof(wchar_t) == 4 ? 11 : 7; + const size_t out_len = sizeof(wchar_t) == 4 ? 3 : 2; + const char* const src_end = src + in_len; + + int len = conv.length(st, src + 1, src_end, 1); + VERIFY( len == 4 ); + len = conv.length(st, src + 1, src_end, 2); + VERIFY( len == 6 ); + if (sizeof(wchar_t) == 4) + { + len = conv.length(st, src + 1, src_end, -1ul); + VERIFY( len == 10 ); + } + + wchar_t dst[out_len]; + wchar_t* const dst_end = dst + out_len; + wchar_t* dst_next; + const char* src_cnext; + auto res = conv.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( dst[0] == 0x0061 ); + VERIFY( dst[1] == 0xabcd ); + if (sizeof(wchar_t) == 4) + VERIFY( dst[2] == 0x012345 ); + VERIFY( src_cnext == src_end ); + VERIFY( dst_next == dst_end ); + + char out[sizeof(src)] = { src[0] }; + char* const out_end = out + in_len; + char* out_next; + const wchar_t* dst_cnext; + res = conv.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( out_next == out_end ); + VERIFY( dst_cnext == dst_end ); + VERIFY( out[1] == src[1] ); + VERIFY( out[2] == src[2] ); + VERIFY( out[3] == src[3] ); + VERIFY( out[4] == src[4] ); + VERIFY( out[5] == src[5] ); + VERIFY( out[6] == src[6] ); + if (sizeof(wchar_t) == 4) + { + VERIFY( out[7] == src[7] ); + VERIFY( out[8] == src[8] ); + VERIFY( out[9] == src[9] ); + VERIFY( out[10] == src[10] ); + } + + codecvt_utf16 conv_le; + + len = conv_le.length(st, src + 1, src_end, 1); + VERIFY( len == 4 ); + len = conv_le.length(st, src + 1, src_end, 2); + VERIFY( len == 6 ); + if (sizeof(wchar_t) == 4) + { + len = conv.length(st, src + 1, src_end, -1ul); + VERIFY( len == 10 ); + } + + res = conv_le.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( dst[0] == 0x0061 ); + VERIFY( dst[1] == 0xabcd ); + if (sizeof(wchar_t) == 4) + VERIFY( dst[2] == 0x012345 ); + VERIFY( src_cnext == src_end ); + VERIFY( dst_next == dst_end ); + + res = conv_le.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next); + VERIFY( res == codecvt_base::ok ); + VERIFY( out_next == out_end ); + VERIFY( dst_cnext == dst_end ); + VERIFY( out[1] == src[2] ); + VERIFY( out[2] == src[1] ); + VERIFY( out[3] == src[4] ); + VERIFY( out[4] == src[3] ); + VERIFY( out[5] == src[6] ); + VERIFY( out[6] == src[5] ); + if (sizeof(wchar_t) == 4) + { + VERIFY( out[7] == src[8] ); + VERIFY( out[8] == src[7] ); + VERIFY( out[9] == src[10] ); + VERIFY( out[10] == src[9] ); + } +#endif +} + +int +main() +{ + test01(); + test02(); + test03(); +}