From patchwork Thu Apr 11 16:07:18 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Markus Armbruster X-Patchwork-Id: 235821 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id CA5BE2C00B6 for ; Fri, 12 Apr 2013 02:08:10 +1000 (EST) Received: from localhost ([::1]:40139 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQK2i-0005nU-U1 for incoming@patchwork.ozlabs.org; Thu, 11 Apr 2013 12:08:08 -0400 Received: from eggs.gnu.org ([208.118.235.92]:43064) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQK23-0005hL-4z for qemu-devel@nongnu.org; Thu, 11 Apr 2013 12:07:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UQK21-0002fE-Ku for qemu-devel@nongnu.org; Thu, 11 Apr 2013 12:07:27 -0400 Received: from oxygen.pond.sub.org ([2a01:4f8:121:10e4::3]:47946) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQK21-0002en-6A for qemu-devel@nongnu.org; Thu, 11 Apr 2013 12:07:25 -0400 Received: from blackfin.pond.sub.org (p5B32B378.dip.t-dialin.net [91.50.179.120]) by oxygen.pond.sub.org (Postfix) with ESMTPA id 692AE9FE60; Thu, 11 Apr 2013 18:07:22 +0200 (CEST) Received: by blackfin.pond.sub.org (Postfix, from userid 1000) id 90D55200B1; Thu, 11 Apr 2013 18:07:21 +0200 (CEST) From: Markus Armbruster To: qemu-devel@nongnu.org Date: Thu, 11 Apr 2013 18:07:18 +0200 Message-Id: <1365696441-10696-2-git-send-email-armbru@redhat.com> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1365696441-10696-1-git-send-email-armbru@redhat.com> References: <1365696441-10696-1-git-send-email-armbru@redhat.com> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a01:4f8:121:10e4::3 Cc: blauwirbel@gmail.com, aliguori@us.ibm.com, lersek@redhat.com Subject: [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint() X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Signed-off-by: Markus Armbruster --- include/qemu-common.h | 3 ++ util/Makefile.objs | 2 +- util/unicode.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 1 deletion(-) create mode 100644 util/unicode.c diff --git a/include/qemu-common.h b/include/qemu-common.h index 31fff22..3b1873e 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -442,6 +442,9 @@ int64_t pow2floor(int64_t value); int uleb128_encode_small(uint8_t *out, uint32_t n); int uleb128_decode_small(const uint8_t *in, uint32_t *n); +/* unicode.c */ +int mod_utf8_codepoint(const char *s, size_t n, char **end); + /* * Hexdump a buffer to a file. An optional string prefix is added to every line */ diff --git a/util/Makefile.objs b/util/Makefile.objs index 557bda7..c5652f5 100644 --- a/util/Makefile.objs +++ b/util/Makefile.objs @@ -1,4 +1,4 @@ -util-obj-y = osdep.o cutils.o qemu-timer-common.o +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o event_notifier-win32.o util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o event_notifier-posix.o util-obj-y += envlist.o path.o host-utils.o cache-utils.o module.o diff --git a/util/unicode.c b/util/unicode.c new file mode 100644 index 0000000..d1c8658 --- /dev/null +++ b/util/unicode.c @@ -0,0 +1,100 @@ +/* + * Dealing with Unicode + * + * Copyright (C) 2013 Red Hat, Inc. + * + * Authors: + * Markus Armbruster + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" + +/** + * mod_utf8_codepoint: + * @s: string encoded in modified UTF-8 + * @n: maximum number of bytes to read from @s, if less than 6 + * @end: set to end of sequence on return + * + * Convert the modified UTF-8 sequence at the start of @s. Modified + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as + * "\xC0\x80". + * + * If @n is zero or @s points to a zero byte, the sequence is invalid, + * and @end is set to @s. + * + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation + * byte, the sequence is invalid, and @end is set to @s + 1 + * + * Else, the first byte determines how many continuation bytes are + * expected. If there are fewer, the sequence is invalid, and @end is + * set to @s + 1 + actual number of continuation bytes. Else, the + * sequence is well-formed, and @end is set to @s + 1 + expected + * number of continuation bytes. + * + * A well-formed sequence is valid unless it encodes a codepoint + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66 + * noncharacters, a surrogate codepoint, or is overlong. Except the + * overlong sequence "\xC0\x80" is valid. + * + * Conversion succeeds if and only if the sequence is valid. + * + * Returns: the Unicode codepoint on success, -1 on failure. + */ +int mod_utf8_codepoint(const char *s, size_t n, char **end) +{ + static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 }; + const unsigned char *p; + unsigned byte, mask, len, i; + int cp; + + if (n == 0 || *s == 0) { + /* empty sequence */ + *end = (char *)s; + return -1; + } + + p = (const unsigned char *)s; + byte = *p++; + if (byte < 0x80) { + cp = byte; /* one byte sequence */ + } else if (byte >= 0xFE) { + cp = -1; /* impossible bytes 0xFE, 0xFF */ + } else if ((byte & 0x40) == 0) { + cp = -1; /* unexpected continuation byte */ + } else { + /* multi-byte sequence */ + len = 0; + for (mask = 0x80; byte & mask; mask >>= 1) { + len++; + } + assert(len > 1 && len < 7); + cp = byte & (mask - 1); + for (i = 1; i < len; i++) { + byte = i < n ? *p : 0; + if ((byte & 0xC0) != 0x80) { + cp = -1; /* continuation byte missing */ + goto out; + } + p++; + cp <<= 6; + cp |= byte & 0x3F; + } + if (cp > 0x10FFFF) { + cp = -1; /* beyond Unicode range */ + } else if ((cp >= 0xFDD0 && cp <= 0xFDEF) + || (cp & 0xFFFE) == 0xFFFE) { + cp = -1; /* noncharacter */ + } else if (cp >= 0xD800 && cp <= 0xDFFF) { + cp = -1; /* surrogate code point */ + } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) { + cp = -1; /* overlong, not \xC0\x80 */ + } + } + +out: + *end = (char *)p; + return cp; +}