From patchwork Mon Jan 28 21:32:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032245 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNBn1kjfz9sDX for ; Tue, 29 Jan 2019 08:32:33 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727838AbfA1Vcc (ORCPT ); Mon, 28 Jan 2019 16:32:32 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58044 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726661AbfA1Vcc (ORCPT ); Mon, 28 Jan 2019 16:32:32 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id F271127FB76 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH RFC v5 01/11] unicode: Add unicode character database files Date: Mon, 28 Jan 2019 16:32:13 -0500 Message-Id: <20190128213223.31512-2-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Olaf Weber Add files from the Unicode Character Database, version 11.0, to the source. A helper program that generates a trie used for normalization from these files is part of a separate commit. - Notes on the update from 8.0.0 and 11.0: The structure of ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script was required for the update. The actual files are not part of the commit submitted to the list because they are to big and would bounce. Still, they can be obtained by the following script: FILES="CaseFolding.txt DerivedAge.txt extracted/DerivedCombiningClass.txt DerivedCoreProperties.txt NormalizationCorrections.txt NormalizationTest.txt UnicodeData.txt" VERSION=11.0.0 BASE=http://www.unicode.org/Public/${VERSION}/ucd for i in ${FILES} ; do wget "${BASE}/$i" -O fs/unicode/ucd/$(basename ${i} .txt)-${VERSION}.txt done Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Move ucd directory to fs/unicode/] [Update to Unicode 11.0.0] --- fs/unicode/ucd/README | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 fs/unicode/ucd/README diff --git a/fs/unicode/ucd/README b/fs/unicode/ucd/README new file mode 100644 index 000000000000..5f89017b35ee --- /dev/null +++ b/fs/unicode/ucd/README @@ -0,0 +1,33 @@ +The files in this directory are part of the Unicode Character Database +for version 11.0.0 of the Unicode standard. + +The full set of files can be found here: + + http://www.unicode.org/Public/11.0.0/ucd/ + +The latest released version of the UCD can be found here: + + http://www.unicode.org/Public/UCD/latest/ + +The files in this directory are identical, except that they have been +renamed with a suffix indicating the unicode version. + +Individual source links: + + http://www.unicode.org/Public/11.0.0/ucd/CaseFolding.txt + http://www.unicode.org/Public/11.0.0/ucd/DerivedAge.txt + http://www.unicode.org/Public/11.0.0/ucd/extracted/DerivedCombiningClass.txt + http://www.unicode.org/Public/11.0.0/ucd/DerivedCoreProperties.txt + http://www.unicode.org/Public/11.0.0/ucd/NormalizationCorrections.txt + http://www.unicode.org/Public/11.0.0/ucd/NormalizationTest.txt + http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt + +md5sums + + 414436796cf097df55f798e1585448ee CaseFolding-11.0.0.txt + 6032a595fbb782694456491d86eecfac DerivedAge-11.0.0.txt + 3240997d671297ac754ab0d27577acf7 DerivedCombiningClass-11.0.0.txt + 2a4fe257d9d8184518e036194d2248ec DerivedCoreProperties-11.0.0.txt + 4e7d383fa0dd3cd9d49d64e5b7b7c9e0 NormalizationCorrections-11.0.0.txt + c9500c5b8b88e584469f056023ecc3f2 NormalizationTest-11.0.0.txt + acc291106c3758d2025f8d7bd5518bee UnicodeData-11.0.0.txt From patchwork Mon Jan 28 21:32:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032247 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNC01MtGz9sDX for ; Tue, 29 Jan 2019 08:32:44 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728071AbfA1Vcn (ORCPT ); Mon, 28 Jan 2019 16:32:43 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58072 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1Vcm (ORCPT ); Mon, 28 Jan 2019 16:32:42 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 4905327FB77 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH RFC v5 02/11] scripts: add trie generator for UTF-8 Date: Mon, 28 Jan 2019 16:32:14 -0500 Message-Id: <20190128213223.31512-3-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Olaf Weber mkutf8data.c is the source for a program that generates utf8data.h, which contains the trie that utf8norm.c uses. The trie is generated from the Unicode 11.0.0 data files. The format of the utf8data[] table is described in utf8norm.c, which is added in the next patch. Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Rebase to mainline] [Fix out-of-tree-build] [Fix checkpatch warnings] [Merge back robustness fixes from original patch. Requested by Dave Chinner] [Update makefile to build 11.0.0 ucd files] [drop references to xfs] [Convert NFKD to NFD] --- fs/Kconfig | 1 + fs/Makefile | 1 + fs/unicode/Kconfig | 8 + fs/unicode/Makefile | 16 + scripts/Makefile | 1 + scripts/mkutf8data.c | 3189 ++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 3216 insertions(+) create mode 100644 fs/unicode/Kconfig create mode 100644 fs/unicode/Makefile create mode 100644 scripts/mkutf8data.c diff --git a/fs/Kconfig b/fs/Kconfig index ac474a61be37..a42e09c569da 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -313,5 +313,6 @@ endif # NETWORK_FILESYSTEMS source "fs/nls/Kconfig" source "fs/dlm/Kconfig" +source "fs/unicode/Kconfig" endmenu diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..dca31ec4c072 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -90,6 +90,7 @@ obj-$(CONFIG_EXPORTFS) += exportfs/ obj-$(CONFIG_NFSD) += nfsd/ obj-$(CONFIG_LOCKD) += lockd/ obj-$(CONFIG_NLS) += nls/ +obj-$(CONFIG_UNICODE) += unicode/ obj-$(CONFIG_SYSV_FS) += sysv/ obj-$(CONFIG_CIFS) += cifs/ obj-$(CONFIG_HPFS_FS) += hpfs/ diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig new file mode 100644 index 000000000000..f41520f57dff --- /dev/null +++ b/fs/unicode/Kconfig @@ -0,0 +1,8 @@ +# +# UTF-8 normalization +# +config UNICODE + bool "UTF-8 normalization and casefolding support" + help + Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding + support. diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile new file mode 100644 index 000000000000..efc944a75b4b --- /dev/null +++ b/fs/unicode/Makefile @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0 + +UNICODE_VERSION=11.0.0 + +$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE + $(call cmd,mkutf8data) +quiet_cmd_mkutf8data = MKUTF8DATA $@ + cmd_mkutf8data = $(objtree)/scripts/mkutf8data \ + -a $(srctree)/$(src)/ucd/DerivedAge-$(UNICODE_VERSION).txt \ + -c $(srctree)/$(src)/ucd/DerivedCombiningClass-$(UNICODE_VERSION).txt \ + -p $(srctree)/$(src)/ucd/DerivedCoreProperties-$(UNICODE_VERSION).txt \ + -d $(srctree)/$(src)/ucd/UnicodeData-$(UNICODE_VERSION).txt \ + -f $(srctree)/$(src)/ucd/CaseFolding-$(UNICODE_VERSION).txt \ + -n $(srctree)/$(src)/ucd/NormalizationCorrections-$(UNICODE_VERSION).txt \ + -t $(srctree)/$(src)/ucd/NormalizationTest-$(UNICODE_VERSION).txt \ + -o $@ diff --git a/scripts/Makefile b/scripts/Makefile index ece52ff20171..2e17484afe87 100644 --- a/scripts/Makefile +++ b/scripts/Makefile @@ -20,6 +20,7 @@ hostprogs-$(CONFIG_ASN1) += asn1_compiler hostprogs-$(CONFIG_MODULE_SIG) += sign-file hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert +hostprogs-$(CONFIG_UNICODE) += mkutf8data HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c new file mode 100644 index 000000000000..4df1a799f73c --- /dev/null +++ b/scripts/mkutf8data.c @@ -0,0 +1,3189 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/* Generator for a compact trie for unicode normalization */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* Default names of the in- and output files. */ + +#define AGE_NAME "DerivedAge.txt" +#define CCC_NAME "DerivedCombiningClass.txt" +#define PROP_NAME "DerivedCoreProperties.txt" +#define DATA_NAME "UnicodeData.txt" +#define FOLD_NAME "CaseFolding.txt" +#define NORM_NAME "NormalizationCorrections.txt" +#define TEST_NAME "NormalizationTest.txt" +#define UTF8_NAME "utf8data.h" + +const char *age_name = AGE_NAME; +const char *ccc_name = CCC_NAME; +const char *prop_name = PROP_NAME; +const char *data_name = DATA_NAME; +const char *fold_name = FOLD_NAME; +const char *norm_name = NORM_NAME; +const char *test_name = TEST_NAME; +const char *utf8_name = UTF8_NAME; + +int verbose = 0; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; + +const char *argv0; + +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) + +/* ------------------------------------------------------------------ */ + +/* + * Unicode version numbers consist of three parts: major, minor, and a + * revision. These numbers are packed into an unsigned int to obtain + * a single version number. + * + * To save space in the generated trie, the unicode version is not + * stored directly, instead we calculate a generation number from the + * unicode versions seen in the DerivedAge file, and use that as an + * index into a table of unicode versions. + */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_MAJ_MAX ((unsigned short)-1) +#define UNICODE_MIN_MAX ((unsigned char)-1) +#define UNICODE_REV_MAX ((unsigned char)-1) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +unsigned int *ages; +int ages_count; + +unsigned int unicode_maxage; + +static int age_valid(unsigned int major, unsigned int minor, + unsigned int revision) +{ + if (major > UNICODE_MAJ_MAX) + return 0; + if (minor > UNICODE_MIN_MAX) + return 0; + if (revision > UNICODE_REV_MAX) + return 0; + return 1; +} + +/* ------------------------------------------------------------------ */ + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype, unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + */ +typedef unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MAXGEN (255) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +struct tree; +static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, const char *); + +unsigned char *utf8data; +size_t utf8data_size; + +utf8trie_t *nfdi; +utf8trie_t *nfdicf; + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int utf8encode(char *str, unsigned int val) +{ + int len; + + if (val < 0x80) { + str[0] = val; + len = 1; + } else if (val < 0x800) { + str[1] = val & UTF8_V_MASK; + str[1] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[0] = val; + str[0] |= UTF8_2_BITS; + len = 2; + } else if (val < 0x10000) { + str[2] = val & UTF8_V_MASK; + str[2] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[1] = val & UTF8_V_MASK; + str[1] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[0] = val; + str[0] |= UTF8_3_BITS; + len = 3; + } else if (val < 0x110000) { + str[3] = val & UTF8_V_MASK; + str[3] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[2] = val & UTF8_V_MASK; + str[2] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[1] = val & UTF8_V_MASK; + str[1] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[0] = val; + str[0] |= UTF8_4_BITS; + len = 4; + } else { + printf("%#x: illegal val\n", val); + len = 0; + } + return len; +} + +static unsigned int utf8decode(const char *str) +{ + const unsigned char *s = (const unsigned char*)str; + unsigned int unichar = 0; + + if (*s < 0x80) { + unichar = *s; + } else if (*s < UTF8_3_BITS) { + unichar = *s++ & 0x1F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else if (*s < UTF8_4_BITS) { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } + return unichar; +} + +static int utf32valid(unsigned int unichar) +{ + return unichar < 0x110000; +} + +#define NODE 1 +#define LEAF 0 + +struct tree { + void *root; + int childnode; + const char *type; + unsigned int maxage; + struct tree *next; + int (*leaf_equal)(void *, void *); + void (*leaf_print)(void *, int); + int (*leaf_mark)(void *); + int (*leaf_size)(void *); + int *(*leaf_index)(struct tree *, void *); + unsigned char *(*leaf_emit)(void *, unsigned char *); + int leafindex[0x110000]; + int index; +}; + +struct node { + int index; + int offset; + int mark; + int size; + struct node *parent; + void *left; + void *right; + unsigned char bitnum; + unsigned char nextbyte; + unsigned char leftnode; + unsigned char rightnode; + unsigned int keybits; + unsigned int keymask; +}; + +/* + * Example lookup function for a tree. + */ +static void *lookup(struct tree *tree, const char *key) +{ + struct node *node; + void *leaf = NULL; + + node = tree->root; + while (!leaf && node) { + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) { + /* Right leg */ + if (node->rightnode == NODE) { + node = node->right; + } else if (node->rightnode == LEAF) { + leaf = node->right; + } else { + node = NULL; + } + } else { + /* Left leg */ + if (node->leftnode == NODE) { + node = node->left; + } else if (node->leftnode == LEAF) { + leaf = node->left; + } else { + node = NULL; + } + } + } + + return leaf; +} + +/* + * A simple non-recursive tree walker: keep track of visits to the + * left and right branches in the leftmask and rightmask. + */ +static void tree_walk(struct tree *tree) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int indent = 1; + int nodes, singletons, leaves; + + nodes = singletons = leaves = 0; + + printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_print(tree->root, indent); + leaves = 1; + } else { + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + printf("%*snode @ %p bitnum %d nextbyte %d" + " left %p right %p mask %x bits %x\n", + indent, "", node, + node->bitnum, node->nextbyte, + node->left, node->right, + node->keymask, node->keybits); + nodes += 1; + if (!(node->left && node->right)) + singletons += 1; + + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + tree->leaf_print(node->left, + indent+1); + leaves += 1; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + tree->leaf_print(node->right, + indent+1); + leaves += 1; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } + } + printf("nodes %d leaves %d singletons %d\n", + nodes, leaves, singletons); +} + +/* + * Allocate an initialize a new internal node. + */ +static struct node *alloc_node(struct node *parent) +{ + struct node *node; + int bitnum; + + node = malloc(sizeof(*node)); + node->left = node->right = NULL; + node->parent = parent; + node->leftnode = NODE; + node->rightnode = NODE; + node->keybits = 0; + node->keymask = 0; + node->mark = 0; + node->index = 0; + node->offset = -1; + node->size = 4; + + if (node->parent) { + bitnum = parent->bitnum; + if ((bitnum & 7) == 0) { + node->bitnum = bitnum + 7 + 8; + node->nextbyte = 1; + } else { + node->bitnum = bitnum - 1; + node->nextbyte = 0; + } + } else { + node->bitnum = 7; + node->nextbyte = 0; + } + + return node; +} + +/* + * Insert a new leaf into the tree, and collapse any subtrees that are + * fully populated and end in identical leaves. A nextbyte tagged + * internal node will not be removed to preserve the tree's integrity. + * Note that due to the structure of utf8, no nextbyte tagged node + * will be a candidate for removal. + */ +static int insert(struct tree *tree, char *key, int keylen, void *leaf) +{ + struct node *node; + struct node *parent; + void **cursor; + int keybits; + + assert(keylen >= 1 && keylen <= 4); + + node = NULL; + cursor = &tree->root; + keybits = 8 * keylen; + + /* Insert, creating path along the way. */ + while (keybits) { + if (!*cursor) + *cursor = alloc_node(node); + node = *cursor; + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) + cursor = &node->right; + else + cursor = &node->left; + keybits--; + } + *cursor = leaf; + + /* Merge subtrees if possible. */ + while (node) { + if (*key & (1 << (node->bitnum & 7))) + node->rightnode = LEAF; + else + node->leftnode = LEAF; + if (node->nextbyte) + break; + if (node->leftnode == NODE || node->rightnode == NODE) + break; + assert(node->left); + assert(node->right); + /* Compare */ + if (! tree->leaf_equal(node->left, node->right)) + break; + /* Keep left, drop right leaf. */ + leaf = node->left; + /* Check in parent */ + parent = node->parent; + if (!parent) { + /* root of tree! */ + tree->root = leaf; + tree->childnode = LEAF; + } else if (parent->left == node) { + parent->left = leaf; + parent->leftnode = LEAF; + if (parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + } + } else if (parent->right == node) { + parent->right = leaf; + parent->rightnode = LEAF; + if (parent->left) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + parent->keybits |= (1 << node->bitnum); + } + } else { + /* internal tree error */ + assert(0); + } + free(node); + node = parent; + } + + /* Propagate keymasks up along singleton chains. */ + while (node) { + parent = node->parent; + if (!parent) + break; + /* Nix the mask for parents with two children. */ + if (node->keymask == 0) { + parent->keymask = 0; + parent->keybits = 0; + } else if (parent->left && parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + assert((parent->keymask & node->keymask) == 0); + parent->keymask |= node->keymask; + parent->keymask |= (1 << parent->bitnum); + parent->keybits |= node->keybits; + if (parent->right) + parent->keybits |= (1 << parent->bitnum); + } + node = parent; + } + + return 0; +} + +/* + * Prune internal nodes. + * + * Fully populated subtrees that end at the same leaf have already + * been collapsed. There are still internal nodes that have for both + * their left and right branches a sequence of singletons that make + * identical choices and end in identical leaves. The keymask and + * keybits collected in the nodes describe the choices made in these + * singleton chains. When they are identical for the left and right + * branch of a node, and the two leaves comare identical, the node in + * question can be removed. + * + * Note that nodes with the nextbyte tag set will not be removed by + * this to ensure tree integrity. Note as well that the structure of + * utf8 ensures that these nodes would not have been candidates for + * removal in any case. + */ +static void prune(struct tree *tree) +{ + struct node *node; + struct node *left; + struct node *right; + struct node *parent; + void *leftleaf; + void *rightleaf; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + + if (verbose > 0) + printf("Pruning %s_%x\n", tree->type, tree->maxage); + + count = 0; + if (tree->childnode == LEAF) + return; + if (!tree->root) + return; + + leftmask = rightmask = 0; + node = tree->root; + while (node) { + if (node->nextbyte) + goto advance; + if (node->leftnode == LEAF) + goto advance; + if (node->rightnode == LEAF) + goto advance; + if (!node->left) + goto advance; + if (!node->right) + goto advance; + left = node->left; + right = node->right; + if (left->keymask == 0) + goto advance; + if (right->keymask == 0) + goto advance; + if (left->keymask != right->keymask) + goto advance; + if (left->keybits != right->keybits) + goto advance; + leftleaf = NULL; + while (!leftleaf) { + assert(left->left || left->right); + if (left->leftnode == LEAF) + leftleaf = left->left; + else if (left->rightnode == LEAF) + leftleaf = left->right; + else if (left->left) + left = left->left; + else if (left->right) + left = left->right; + else + assert(0); + } + rightleaf = NULL; + while (!rightleaf) { + assert(right->left || right->right); + if (right->leftnode == LEAF) + rightleaf = right->left; + else if (right->rightnode == LEAF) + rightleaf = right->right; + else if (right->left) + right = right->left; + else if (right->right) + right = right->right; + else + assert(0); + } + if (! tree->leaf_equal(leftleaf, rightleaf)) + goto advance; + /* + * This node has identical singleton-only subtrees. + * Remove it. + */ + parent = node->parent; + left = node->left; + right = node->right; + if (parent->left == node) + parent->left = left; + else if (parent->right == node) + parent->right = left; + else + assert(0); + left->parent = parent; + left->keymask |= (1 << node->bitnum); + node->left = NULL; + while (node) { + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + if (node->leftnode == NODE && node->left) { + left = node->left; + free(node); + count++; + node = left; + } else if (node->rightnode == NODE && node->right) { + right = node->right; + free(node); + count++; + node = right; + } else { + node = NULL; + } + } + /* Propagate keymasks up along singleton chains. */ + node = parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + for (;;) { + if (node->left && node->right) + break; + if (node->left) { + left = node->left; + node->keymask |= left->keymask; + node->keybits |= left->keybits; + } + if (node->right) { + right = node->right; + node->keymask |= right->keymask; + node->keybits |= right->keybits; + } + node->keymask |= (1 << node->bitnum); + node = node->parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + } + advance: + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0 && + node->leftnode == NODE && + node->left) { + leftmask |= bitmask; + node = node->left; + } else if ((rightmask & bitmask) == 0 && + node->rightnode == NODE && + node->right) { + rightmask |= bitmask; + node = node->right; + } else { + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + } + if (verbose > 0) + printf("Pruned %d nodes\n", count); +} + +/* + * Mark the nodes in the tree that lead to leaves that must be + * emitted. + */ +static void mark_nodes(struct tree *tree) +{ + struct node *node; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int marked; + + marked = 0; + if (verbose > 0) + printf("Marking %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + + /* second pass: left siblings and singletons */ + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + if (!node->mark && node->parent->mark) { + marked++; + node->mark = 1; + } + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + if (!node->mark && node->parent->mark && + !node->parent->left) { + marked++; + node->mark = 1; + } + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } +done: + if (verbose > 0) + printf("Marked %d nodes\n", marked); +} + +/* + * Compute the index of each node and leaf, which is the offset in the + * emitted trie. These values must be pre-computed because relative + * offsets between nodes are used to navigate the tree. + */ +static int index_nodes(struct tree *tree, int index) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + int indent; + + /* Align to a cache line (or half a cache line?). */ + while (index % 64) + index++; + tree->index = index; + indent = 1; + count = 0; + + if (verbose > 0) + printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index); + if (tree->childnode == LEAF) { + index += tree->leaf_size(tree->root); + goto done; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + count++; + if (node->index != index) + node->index = index; + index += node->size; +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + *tree->leaf_index(tree, node->left) = + index; + index += tree->leaf_size(node->left); + count++; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + *tree->leaf_index(tree, node->right) = index; + index += tree->leaf_size(node->right); + count++; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + /* Round up to a multiple of 16 */ + while (index % 16) + index++; + if (verbose > 0) + printf("Final index %d\n", index); + return index; +} + +/* + * Compute the size of nodes and leaves. We start by assuming that + * each node needs to store a three-byte offset. The indexes of the + * nodes are calculated based on that, and then this function is + * called to see if the sizes of some nodes can be reduced. This is + * repeated until no more changes are seen. + */ +static int size_nodes(struct tree *tree) +{ + struct tree *next; + struct node *node; + struct node *right; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + unsigned int pathbits; + unsigned int pathmask; + int changed; + int offset; + int size; + int indent; + + indent = 1; + changed = 0; + size = 0; + + if (verbose > 0) + printf("Sizing %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + pathbits = 0; + pathmask = 0; + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + offset = 0; + if (!node->left || !node->right) { + size = 1; + } else { + if (node->rightnode == NODE) { + right = node->right; + next = tree->next; + while (!right->mark) { + assert(next); + n = next->root; + while (n->bitnum != node->bitnum) { + if (pathbits & (1<bitnum)) + n = n->right; + else + n = n->left; + } + n = n->right; + assert(right->bitnum == n->bitnum); + right = n; + next = next->next; + } + offset = right->index - node->index; + } else { + offset = *tree->leaf_index(tree, node->right); + offset -= node->index; + } + assert(offset >= 0); + assert(offset <= 0xffffff); + if (offset <= 0xff) { + size = 2; + } else if (offset <= 0xffff) { + size = 3; + } else { /* offset <= 0xffffff */ + size = 4; + } + } + if (node->size != size || node->offset != offset) { + node->size = size; + node->offset = offset; + changed++; + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + pathmask |= bitmask; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + pathbits |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + pathmask &= ~bitmask; + pathbits &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + if (verbose > 0) + printf("Found %d changes\n", changed); + return changed; +} + +/* + * Emit a trie for the given tree into the data array. + */ +static void emit(struct tree *tree, unsigned char *data) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int offlen; + int offset; + int index; + int indent; + unsigned char byte; + + index = tree->index; + data += index; + indent = 1; + if (verbose > 0) + printf("Emitting %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_emit(tree->root, data); + return; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + assert(node->offset != -1); + assert(node->index == index); + + byte = 0; + if (node->nextbyte) + byte |= NEXTBYTE; + byte |= (node->bitnum & BITNUM); + if (node->left && node->right) { + if (node->leftnode == NODE) + byte |= LEFTNODE; + if (node->rightnode == NODE) + byte |= RIGHTNODE; + if (node->offset <= 0xff) + offlen = 1; + else if (node->offset <= 0xffff) + offlen = 2; + else + offlen = 3; + offset = node->offset; + byte |= offlen << OFFLEN_SHIFT; + *data++ = byte; + index++; + while (offlen--) { + *data++ = offset & 0xff; + index++; + offset >>= 8; + } + } else if (node->left) { + if (node->leftnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else if (node->right) { + byte |= RIGHTNODE; + if (node->rightnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else { + assert(0); + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + data = tree->leaf_emit(node->left, + data); + index += tree->leaf_size(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + data = tree->leaf_emit(node->right, + data); + index += tree->leaf_size(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +} + +/* ------------------------------------------------------------------ */ + +/* + * Unicode data. + * + * We need to keep track of the Canonical Combining Class, the Age, + * and decompositions for a code point. + * + * For the Age, we store the index into the ages table. Effectively + * this is a generation number that the table maps to a unicode + * version. + * + * The correction field is used to indicate that this entry is in the + * corrections array, which contains decompositions that were + * corrected in later revisions. The value of the correction field is + * the Unicode version in which the mapping was corrected. + */ +struct unicode_data { + unsigned int code; + int ccc; + int gen; + int correction; + unsigned int *utf32nfdi; + unsigned int *utf32nfdicf; + char *utf8nfdi; + char *utf8nfdicf; +}; + +struct unicode_data unicode_data[0x110000]; +struct unicode_data *corrections; +int corrections_count; + +struct tree *nfdi_tree; +struct tree *nfdicf_tree; + +struct tree *trees; +int trees_count; + +/* + * Check the corrections array to see if this entry was corrected at + * some point. + */ +static struct unicode_data *corrections_lookup(struct unicode_data *u) +{ + int i; + + for (i = 0; i != corrections_count; i++) + if (u->code == corrections[i].code) + return &corrections[i]; + return u; +} + +static int nfdi_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfdi && right->utf8nfdi && + strcmp(left->utf8nfdi, right->utf8nfdi) == 0) + return 1; + if (left->utf8nfdi || right->utf8nfdi) + return 0; + return 1; +} + +static int nfdicf_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfdicf && right->utf8nfdicf && + strcmp(left->utf8nfdicf, right->utf8nfdicf) == 0) + return 1; + if (left->utf8nfdicf && right->utf8nfdicf) + return 0; + if (left->utf8nfdicf || right->utf8nfdicf) + return 0; + if (left->utf8nfdi && right->utf8nfdi && + strcmp(left->utf8nfdi, right->utf8nfdi) == 0) + return 1; + if (left->utf8nfdi || right->utf8nfdi) + return 0; + return 1; +} + +static void nfdi_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfdi) + printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi); + printf("\n"); +} + +static void nfdicf_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfdicf) + printf(" nfdicf \"%s\"", (const char*)leaf->utf8nfdicf); + else if (leaf->utf8nfdi) + printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi); + printf("\n"); +} + +static int nfdi_mark(void *l) +{ + return 1; +} + +static int nfdicf_mark(void *l) +{ + struct unicode_data *leaf = l; + + if (leaf->utf8nfdicf) + return 1; + return 0; +} + +static int correction_mark(void *l) +{ + struct unicode_data *leaf = l; + + return leaf->correction; +} + +static int nfdi_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfdi) + size += strlen(leaf->utf8nfdi) + 1; + return size; +} + +static int nfdicf_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfdicf) + size += strlen(leaf->utf8nfdicf) + 1; + else if (leaf->utf8nfdi) + size += strlen(leaf->utf8nfdi) + 1; + return size; +} + +static int *nfdi_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static int *nfdicf_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static unsigned char *nfdi_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static unsigned char *nfdicf_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfdicf) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfdicf; + while ((*data++ = *s++) != 0) + ; + } else if (leaf->utf8nfdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static void utf8_create(struct unicode_data *data) +{ + char utf[18*4+1]; + char *u; + unsigned int *um; + int i; + + u = utf; + um = data->utf32nfdi; + if (um) { + for (i = 0; um[i]; i++) + u += utf8encode(u, um[i]); + *u = '\0'; + data->utf8nfdi = strdup(utf); + } + u = utf; + um = data->utf32nfdicf; + if (um) { + for (i = 0; um[i]; i++) + u += utf8encode(u, um[i]); + *u = '\0'; + if (!data->utf8nfdi || strcmp(data->utf8nfdi, utf)) + data->utf8nfdicf = strdup(utf); + } +} + +static void utf8_init(void) +{ + unsigned int unichar; + int i; + + for (unichar = 0; unichar != 0x110000; unichar++) + utf8_create(&unicode_data[unichar]); + + for (i = 0; i != corrections_count; i++) + utf8_create(&corrections[i]); +} + +static void trees_init(void) +{ + struct unicode_data *data; + unsigned int maxage; + unsigned int nextage; + int count; + int i; + int j; + + /* Count the number of different ages. */ + count = 0; + nextage = (unsigned int)-1; + do { + maxage = nextage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + count++; + } while (nextage); + + /* Two trees per age: nfdi and nfdicf */ + trees_count = count * 2; + trees = calloc(trees_count, sizeof(struct tree)); + + /* Assign ages to the trees. */ + count = trees_count; + nextage = (unsigned int)-1; + do { + maxage = nextage; + trees[--count].maxage = maxage; + trees[--count].maxage = maxage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + } while (nextage); + + /* The ages assigned above are off by one. */ + for (i = 0; i != trees_count; i++) { + j = 0; + while (ages[j] < trees[i].maxage) + j++; + trees[i].maxage = ages[j-1]; + } + + /* Set up the forwarding between trees. */ + trees[trees_count-2].next = &trees[trees_count-1]; + trees[trees_count-1].leaf_mark = nfdi_mark; + trees[trees_count-2].leaf_mark = nfdicf_mark; + for (i = 0; i != trees_count-2; i += 2) { + trees[i].next = &trees[trees_count-2]; + trees[i].leaf_mark = correction_mark; + trees[i+1].next = &trees[trees_count-1]; + trees[i+1].leaf_mark = correction_mark; + } + + /* Assign the callouts. */ + for (i = 0; i != trees_count; i += 2) { + trees[i].type = "nfdicf"; + trees[i].leaf_equal = nfdicf_equal; + trees[i].leaf_print = nfdicf_print; + trees[i].leaf_size = nfdicf_size; + trees[i].leaf_index = nfdicf_index; + trees[i].leaf_emit = nfdicf_emit; + + trees[i+1].type = "nfdi"; + trees[i+1].leaf_equal = nfdi_equal; + trees[i+1].leaf_print = nfdi_print; + trees[i+1].leaf_size = nfdi_size; + trees[i+1].leaf_index = nfdi_index; + trees[i+1].leaf_emit = nfdi_emit; + } + + /* Finish init. */ + for (i = 0; i != trees_count; i++) + trees[i].childnode = NODE; +} + +static void trees_populate(void) +{ + struct unicode_data *data; + unsigned int unichar; + char keyval[4]; + int keylen; + int i; + + for (i = 0; i != trees_count; i++) { + if (verbose > 0) { + printf("Populating %s_%x\n", + trees[i].type, trees[i].maxage); + } + for (unichar = 0; unichar != 0x110000; unichar++) { + if (unicode_data[unichar].gen < 0) + continue; + keylen = utf8encode(keyval, unichar); + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= trees[i].maxage) + data = &unicode_data[unichar]; + insert(&trees[i], keyval, keylen, data); + } + } +} + +static void trees_reduce(void) +{ + int i; + int size; + int changed; + + for (i = 0; i != trees_count; i++) + prune(&trees[i]); + for (i = 0; i != trees_count; i++) + mark_nodes(&trees[i]); + do { + size = 0; + for (i = 0; i != trees_count; i++) + size = index_nodes(&trees[i], size); + changed = 0; + for (i = 0; i != trees_count; i++) + changed += size_nodes(&trees[i]); + } while (changed); + + utf8data = calloc(size, 1); + utf8data_size = size; + for (i = 0; i != trees_count; i++) + emit(&trees[i], utf8data); + + if (verbose > 0) { + for (i = 0; i != trees_count; i++) { + printf("%s_%x idx %d\n", + trees[i].type, trees[i].maxage, trees[i].index); + } + } + + nfdi = utf8data + trees[trees_count-1].index; + nfdicf = utf8data + trees[trees_count-2].index; + + nfdi_tree = &trees[trees_count-1]; + nfdicf_tree = &trees[trees_count-2]; +} + +static void verify(struct tree *tree) +{ + struct unicode_data *data; + utf8leaf_t *leaf; + unsigned int unichar; + char key[4]; + int report; + int nocf; + + if (verbose > 0) + printf("Verifying %s_%x\n", tree->type, tree->maxage); + nocf = strcmp(tree->type, "nfdicf"); + + for (unichar = 0; unichar != 0x110000; unichar++) { + report = 0; + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= tree->maxage) + data = &unicode_data[unichar]; + utf8encode(key,unichar); + leaf = utf8lookup(tree, key); + if (!leaf) { + if (data->gen != -1) + report++; + if (unichar < 0xd800 || unichar > 0xdfff) + report++; + } else { + if (unichar >= 0xd800 && unichar <= 0xdfff) + report++; + if (data->gen == -1) + report++; + if (data->gen != LEAF_GEN(leaf)) + report++; + if (LEAF_CCC(leaf) == DECOMPOSE) { + if (nocf) { + if (!data->utf8nfdi) { + report++; + } else if (strcmp(data->utf8nfdi, + LEAF_STR(leaf))) { + report++; + } + } else { + if (!data->utf8nfdicf && + !data->utf8nfdi) { + report++; + } else if (data->utf8nfdicf) { + if (strcmp(data->utf8nfdicf, + LEAF_STR(leaf))) + report++; + } else if (strcmp(data->utf8nfdi, + LEAF_STR(leaf))) { + report++; + } + } + } else if (data->ccc != LEAF_CCC(leaf)) { + report++; + } + } + if (report) { + printf("%X code %X gen %d ccc %d" + " nfdi -> \"%s\"", + unichar, data->code, data->gen, + data->ccc, + data->utf8nfdi); + if (leaf) { + printf(" gen %d ccc %d" + " nfdi -> \"%s\"", + LEAF_GEN(leaf), + LEAF_CCC(leaf), + LEAF_CCC(leaf) == DECOMPOSE ? + LEAF_STR(leaf) : ""); + } + printf("\n"); + } + } +} + +static void trees_verify(void) +{ + int i; + + for (i = 0; i != trees_count; i++) + verify(&trees[i]); +} + +/* ------------------------------------------------------------------ */ + +static void help(void) +{ + printf("Usage: %s [options]\n", argv0); + printf("\n"); + printf("This program creates an a data trie used for parsing and\n"); + printf("normalization of UTF-8 strings. The trie is derived from\n"); + printf("a set of input files from the Unicode character database\n"); + printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n"); + printf("\n"); + printf("The generated tree supports two normalization forms:\n"); + printf("\n"); + printf("\tnfdi:\n"); + printf("\t- Apply unicode normalization form NFD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\n"); + printf("\tnfdicf:\n"); + printf("\t- Apply unicode normalization form NFD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\t- Apply a full casefold (C + F).\n"); + printf("\n"); + printf("These forms were chosen as being most useful when dealing\n"); + printf("with file names: NFD catches most cases where characters\n"); + printf("should be considered equivalent. The ignorables are mostly\n"); + printf("invisible, making names hard to type.\n"); + printf("\n"); + printf("The options to specify the files to be used are listed\n"); + printf("below with their default values, which are the names used\n"); + printf("by version 11.0.0 of the Unicode Character Database.\n"); + printf("\n"); + printf("The input files:\n"); + printf("\t-a %s\n", AGE_NAME); + printf("\t-c %s\n", CCC_NAME); + printf("\t-p %s\n", PROP_NAME); + printf("\t-d %s\n", DATA_NAME); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-n %s\n", NORM_NAME); + printf("\n"); + printf("Additionally, the generated tables are tested using:\n"); + printf("\t-t %s\n", TEST_NAME); + printf("\n"); + printf("Finally, the output file:\n"); + printf("\t-o %s\n", UTF8_NAME); + printf("\n"); +} + +static void usage(void) +{ + help(); + exit(1); +} + +static void open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +static void line_fail(const char *filename, const char *line) +{ + printf("Error parsing %s:%s\n", filename, line); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +static void print_utf32(unsigned int *utf32str) +{ + int i; + + for (i = 0; utf32str[i]; i++) + printf(" %X", utf32str[i]); +} + +static void print_utf32nfdi(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfdi); + printf("\n"); +} + +static void print_utf32nfdicf(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfdicf); + printf("\n"); +} + +/* ------------------------------------------------------------------ */ + +static void age_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + int gen; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", age_name); + + file = fopen(age_name, "r"); + if (!file) + open_fail(age_name, errno); + count = 0; + + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d_%d\n", + major, minor, revision); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d\n", major, minor); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + } + + /* We must have found something above. */ + if (verbose > 1) + printf("%d age entries\n", ages_count); + if (ages_count == 0 || ages_count > MAXGEN) + file_fail(age_name); + + /* There is a 0 entry. */ + ages_count++; + ages = calloc(ages_count + 1, sizeof(*ages)); + /* And a guard entry. */ + ages[ages_count] = (unsigned int)-1; + + rewind(file); + count = 0; + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages[++gen] = + UNICODE_AGE(major, minor, revision); + if (verbose > 1) + printf(" Age V%d_%d_%d = gen %d\n", + major, minor, revision, gen); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages[++gen] = UNICODE_AGE(major, minor, 0); + if (verbose > 1) + printf(" Age V%d_%d = %d\n", + major, minor, gen); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X..%X ; %d.%d #", + &first, &last, &major, &minor); + if (ret == 4) { + for (unichar = first; unichar <= last; unichar++) + unicode_data[unichar].gen = gen; + count += 1 + last - first; + if (verbose > 1) + printf(" %X..%X gen %d\n", first, last, gen); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor); + if (ret == 3) { + unicode_data[unichar].gen = gen; + count++; + if (verbose > 1) + printf(" %X gen %d\n", unichar, gen); + if (!utf32valid(unichar)) + line_fail(age_name, line); + continue; + } + } + unicode_maxage = ages[gen]; + fclose(file); + + /* Nix surrogate block */ + if (verbose > 1) + printf(" Removing surrogate block D800..DFFF\n"); + for (unichar = 0xd800; unichar <= 0xdfff; unichar++) + unicode_data[unichar].gen = -1; + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(age_name); +} + +static void ccc_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int value; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", ccc_name); + + file = fopen(ccc_name, "r"); + if (!file) + open_fail(ccc_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value); + if (ret == 3) { + for (unichar = first; unichar <= last; unichar++) { + unicode_data[unichar].ccc = value; + count++; + } + if (verbose > 1) + printf(" %X..%X ccc %d\n", first, last, value); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(ccc_name, line); + continue; + } + ret = sscanf(line, "%X ; %d #", &unichar, &value); + if (ret == 2) { + unicode_data[unichar].ccc = value; + count++; + if (verbose > 1) + printf(" %X ccc %d\n", unichar, value); + if (!utf32valid(unichar)) + line_fail(ccc_name, line); + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(ccc_name); +} + +static int ignore_compatibility_form(char *type) +{ + int i; + char *ignored_types[] = {"font", "noBreak", "initial", "medial", + "final", "isolated", "circle", "super", + "sub", "vertical", "wide", "narrow", + "small", "square", "fraction", "compat"}; + + for (i = 0 ; i < ARRAY_SIZE(ignored_types); i++) + if (strcmp(type, ignored_types[i]) == 0) + return 1; + return 0; +} + +static void nfdi_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + char *type; + unsigned int *um; + int count; + int i; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", data_name); + file = fopen(data_name, "r"); + if (!file) + open_fail(data_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];", + &unichar, buf0); + if (ret != 2) + continue; + if (!utf32valid(unichar)) + line_fail(data_name, line); + + s = buf0; + /* skip over */ + if (*s == '<') { + type = ++s; + while (*++s != '>'); + *s++ = '\0'; + if(ignore_compatibility_form(type)) + continue; + } + /* decode the decomposition into UTF-32 */ + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(data_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdi = um; + + if (verbose > 1) + print_utf32nfdi(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(data_name); +} + +static void nfdicf_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char status; + char *s; + unsigned int *um; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", fold_name); + file = fopen(fold_name, "r"); + if (!file) + open_fail(fold_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0); + if (ret != 3) + continue; + if (!utf32valid(unichar)) + line_fail(fold_name, line); + /* Use the C+F casefold. */ + if (status != 'C' && status != 'F') + continue; + s = buf0; + if (*s == '<') + while (*s++ != ' ') + ; + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(fold_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdicf = um; + + if (verbose > 1) + print_utf32nfdicf(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(fold_name); +} + +static void ignore_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int first; + unsigned int last; + unsigned int *um; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", prop_name); + file = fopen(prop_name, "r"); + if (!file) + open_fail(prop_name, errno); + assert(file); + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0); + if (ret == 3) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(first) || !utf32valid(last)) + line_fail(prop_name, line); + for (unichar = first; unichar <= last; unichar++) { + free(unicode_data[unichar].utf32nfdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfdi = um; + free(unicode_data[unichar].utf32nfdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfdicf = um; + count++; + } + if (verbose > 1) + printf(" %X..%X Default_Ignorable_Code_Point\n", + first, last); + continue; + } + ret = sscanf(line, "%X ; %s # ", &unichar, buf0); + if (ret == 2) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(unichar)) + line_fail(prop_name, line); + free(unicode_data[unichar].utf32nfdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfdi = um; + free(unicode_data[unichar].utf32nfdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfdicf = um; + if (verbose > 1) + printf(" %X Default_Ignorable_Code_Point\n", + unichar); + count++; + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(prop_name); +} + +static void corrections_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + unsigned int age; + unsigned int *um; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", norm_name); + file = fopen(norm_name, "r"); + if (!file) + open_fail(norm_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + count++; + } + corrections = calloc(count, sizeof(struct unicode_data)); + corrections_count = count; + rewind(file); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + corrections[count] = unicode_data[unichar]; + assert(corrections[count].code == unichar); + age = UNICODE_AGE(major, minor, revision); + corrections[count].correction = age; + + i = 0; + s = buf0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(norm_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + corrections[count].utf32nfdi = um; + + if (verbose > 1) + printf(" %X -> %s -> %s V%d_%d_%d\n", + unichar, buf0, buf1, major, minor, revision); + count++; + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(norm_name); +} + +/* ------------------------------------------------------------------ */ + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;;Lo;0;L;;;;;N;;;;; + * D7A3;;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount) + * LVPart = SBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount) + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = + * } else { + * TPart = TBase + TIndex + * d = + * } + * + */ + +static void +hangul_decompose(void) +{ + unsigned int sb = 0xAC00; + unsigned int lb = 0x1100; + unsigned int vb = 0x1161; + unsigned int tb = 0x11a7; + /* unsigned int lc = 19; */ + unsigned int vc = 21; + unsigned int tc = 28; + unsigned int nc = (vc * tc); + /* unsigned int sc = (lc * nc); */ + unsigned int unichar; + unsigned int mapping[4]; + unsigned int *um; + int count; + int i; + + if (verbose > 0) + printf("Decomposing hangul\n"); + /* Hangul */ + count = 0; + for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) { + unsigned int si = unichar - sb; + unsigned int li = si / nc; + unsigned int vi = (si % nc) / tc; + unsigned int ti = si % tc; + + i = 0; + mapping[i++] = lb + li; + mapping[i++] = vb + vi; + if (ti) + mapping[i++] = tb + ti; + mapping[i++] = 0; + + assert(!unicode_data[unichar].utf32nfdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdi = um; + + assert(!unicode_data[unichar].utf32nfdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdicf = um; + + if (verbose > 1) + print_utf32nfdi(unichar); + + count++; + } + if (verbose > 0) + printf("Created %d entries\n", count); +} + +static void nfdi_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfdi\n"); + + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfdi) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfdi; + while (*um) { + dc = unicode_data[*um].utf32nfdi; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdi = um; + } + /* Add this decomposition to nfdicf if there is no entry. */ + if (!unicode_data[unichar].utf32nfdicf) { + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdicf = um; + } + if (verbose > 1) + print_utf32nfdi(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +static void nfdicf_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfdicf\n"); + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfdicf) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfdicf; + while (*um) { + dc = unicode_data[*um].utf32nfdicf; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfdicf = um; + } + if (verbose > 1) + print_utf32nfdicf(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +/* ------------------------------------------------------------------ */ + +int utf8agemax(struct tree *, const char *); +int utf8nagemax(struct tree *, const char *, size_t); +int utf8agemin(struct tree *, const char *); +int utf8nagemin(struct tree *, const char *, size_t); +ssize_t utf8len(struct tree *, const char *); +ssize_t utf8nlen(struct tree *, const char *, size_t); +struct utf8cursor; +int utf8cursor(struct utf8cursor *, struct tree *, const char *); +int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); +int utf8byte(struct utf8cursor *); + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + tree->index; + int offlen; + int offset; + int mask; + int node; + + if (!tree) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + return NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + return NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to trie_nlookup(). + */ +static utf8leaf_t *utf8lookup(struct tree *tree, const char *s) +{ + return utf8nlookup(tree, s, (size_t)-1); +} + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int utf8agemax(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int utf8agemin(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age; + int leaf_age; + + if (!tree) + return -1; + age = tree->maxage; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemax(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemin(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age; + + if (!tree) + return -1; + age = tree->maxage; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t utf8len(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t utf8nlen(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + struct tree *tree; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; + unsigned int unichar; +}; + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : string. + * len : length of s. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int utf8ncursor(struct utf8cursor *u8c, struct tree *tree, const char *s, + size_t len) +{ + if (!tree) + return -1; + if (!s) + return -1; + u8c->tree = tree; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->unichar = 0; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : NUL-terminated string. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int utf8cursor(struct utf8cursor *u8c, struct tree *tree, const char *s) +{ + return utf8ncursor(u8c, tree, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->tree, u8c->s); + else + leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->tree, u8c->s); + ccc = LEAF_CCC(leaf); + } + u8c->unichar = utf8decode(u8c->s); + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + assert(u8c->ccc == STOPPER); + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +/* ------------------------------------------------------------------ */ + +static int normalize_line(struct tree *tree) +{ + char *s; + char *t; + int c; + struct utf8cursor u8c; + + /* First test: null-terminated string. */ + s = buf2; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + /* Second test: length-limited string. */ + s = buf2; + /* Replace NUL with a value that will cause an error if seen. */ + s[strlen(s) + 1] = -1; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + return 0; +} + +static void normalization_test(void) +{ + FILE *file; + unsigned int unichar; + struct unicode_data *data; + char *s; + char *t; + int ret; + int ignorables; + int tests = 0; + int failures = 0; + + if (verbose > 0) + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%[^;];%*[^;];%*[^;];", + buf0, buf1); + if (ret != 2 || *line == '#') + continue; + s = buf0; + t = buf2; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8encode(t, unichar); + } + *t = '\0'; + + ignorables = 0; + s = buf1; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + data = &unicode_data[unichar]; + if (data->utf8nfdi && !*data->utf8nfdi) + ignorables = 1; + else + t += utf8encode(t, unichar); + } + *t = '\0'; + + tests++; + if (normalize_line(nfdi_tree) < 0) { + printf("Line %s -> %s", buf0, buf1); + if (ignorables) + printf(" (ignorables removed)"); + printf(" failure\n"); + failures++; + } + } + fclose(file); + if (verbose > 0) + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +/* ------------------------------------------------------------------ */ + +static void write_file(void) +{ + FILE *file; + int i; + int j; + int t; + int gen; + + if (verbose > 0) + printf("Writing %s\n", utf8_name); + file = fopen(utf8_name, "w"); + if (!file) + open_fail(utf8_name, errno); + + fprintf(file, "/* This file is generated code, do not edit. */\n"); + fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n"); + fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n"); + fprintf(file, "#endif\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8vers = %#x;\n", + unicode_maxage); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8agetab[] = {\n"); + for (i = 0; i != ages_count; i++) + fprintf(file, "\t%#x%s\n", ages[i], + ages[i] == unicode_maxage ? "" : ","); + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfdicfdata[] = {\n"); + t = 0; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfdidata[] = {\n"); + t = 1; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned char utf8data[%zd] = {\n", + utf8data_size); + t = 0; + for (i = 0; i != utf8data_size; i += 16) { + if (i == trees[t].index) { + fprintf(file, "\t/* %s_%x */\n", + trees[t].type, trees[t].maxage); + if (t < trees_count-1) + t++; + } + fprintf(file, "\t"); + for (j = i; j != i + 16; j++) + fprintf(file, "0x%.2x%s", utf8data[j], + (j < utf8data_size -1 ? "," : "")); + fprintf(file, "\n"); + } + fprintf(file, "};\n"); + fclose(file); +} + +/* ------------------------------------------------------------------ */ + +int main(int argc, char *argv[]) +{ + unsigned int unichar; + int opt; + + argv0 = argv[0]; + + while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) { + switch (opt) { + case 'a': + age_name = optarg; + break; + case 'c': + ccc_name = optarg; + break; + case 'd': + data_name = optarg; + break; + case 'f': + fold_name = optarg; + break; + case 'n': + norm_name = optarg; + break; + case 'o': + utf8_name = optarg; + break; + case 'p': + prop_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (verbose > 1) + help(); + for (unichar = 0; unichar != 0x110000; unichar++) + unicode_data[unichar].code = unichar; + age_init(); + ccc_init(); + nfdi_init(); + nfdicf_init(); + ignore_init(); + corrections_init(); + hangul_decompose(); + nfdi_decompose(); + nfdicf_decompose(); + utf8_init(); + trees_init(); + trees_populate(); + trees_reduce(); + trees_verify(); + /* Prevent "unused function" warning. */ + (void)lookup(nfdi_tree, " "); + if (verbose > 2) + tree_walk(nfdi_tree); + if (verbose > 2) + tree_walk(nfdicf_tree); + normalization_test(); + write_file(); + + return 0; +} From patchwork Mon Jan 28 21:32:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032246 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNBy2HBnz9sDL for ; Tue, 29 Jan 2019 08:32:42 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728049AbfA1Vck (ORCPT ); Mon, 28 Jan 2019 16:32:40 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58100 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728033AbfA1Vck (ORCPT ); Mon, 28 Jan 2019 16:32:40 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 4CB8C27FB7A From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH RFC v5 03/11] unicode: Introduce code for UTF-8 normalization Date: Mon, 28 Jan 2019 16:32:15 -0500 Message-Id: <20190128213223.31512-4-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Olaf Weber Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfdi and nfdicf. nfdi: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. nfdicf: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). From the original SGI patch and for conformity with coding standards, the utf8data_t typedef was dropped, since it was just masking the struct keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I decided to keep it, since they are simple pointers to memory buffers, and using uchars here wouldn't provide any more meaningful information. From the original submission, we also converted from the compatibility form to canonical. Changes since v4: - Convert to canonical decomposition form. Changes since RFC v1: - utf8_version_is_supported receives maj, min and rev as separate arguments. (Olaf Weber) Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Rebase to Mainline] [Fix up checkpatch.pl warnings] [Drop typedefs] [move out of libxfs] [Convert from NFKD to NFD] --- fs/unicode/Makefile | 3 + fs/unicode/utf8-norm.c | 640 +++++++++++++++++++++++++++++++++++++++++ fs/unicode/utf8n.h | 112 ++++++++ 3 files changed, 755 insertions(+) create mode 100644 fs/unicode/utf8-norm.c create mode 100644 fs/unicode/utf8n.h diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile index efc944a75b4b..1ed10e40c30d 100644 --- a/fs/unicode/Makefile +++ b/fs/unicode/Makefile @@ -2,6 +2,9 @@ UNICODE_VERSION=11.0.0 +obj-$(CONFIG_UNICODE) += utf8-norm.o + +$(obj)/utf8-norm.o: $(obj)/utf8data.h $(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE $(call cmd,mkutf8data) quiet_cmd_mkutf8data = MKUTF8DATA $@ diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c new file mode 100644 index 000000000000..4ed50f3fda6e --- /dev/null +++ b/fs/unicode/utf8-norm.c @@ -0,0 +1,640 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include "utf8n.h" + +struct utf8data { + unsigned int maxage; + unsigned int offset; +}; + +#define __INCLUDED_FROM_UTF8NORM_C__ +#include "utf8data.h" +#undef __INCLUDED_FROM_UTF8NORM_C__ + +int utf8version_is_supported(u8 maj, u8 min, u8 rev) +{ + int i = ARRAY_SIZE(utf8agetab) - 1; + unsigned int sb_utf8version = UNICODE_AGE(maj, min, rev); + + while (i >= 0 && utf8agetab[i] != 0) { + if (sb_utf8version == utf8agetab[i]) + return 1; + i--; + } + return 0; +} +EXPORT_SYMBOL(utf8version_is_supported); + +/* + * UTF-8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7F: 0 - 0x7F + * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF + * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF + * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF + * + * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int utf8clen(const char *s) +{ + unsigned char c = *s; + + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef const unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype: unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + * + * The trie is constructed in such a way that leaves exist for all + * UTF-8 sequences that match the criteria from the "UTF-8 valid + * ranges" comment above, and only for those sequences. Therefore a + * lookup in the trie can be used to validate the UTF-8 input. + */ +typedef const unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char *)((LEAF) + 2)) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, + size_t len) +{ + utf8trie_t *trie = utf8data + data->offset; + int offlen; + int offset; + int mask; + int node; + + if (!data) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to utf8nlookup(). + */ +static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s) +{ + return utf8nlookup(data, s, (size_t)-1); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int utf8agemax(const struct utf8data *data, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (*s) { + leaf = utf8lookup(data, s); + if (!leaf) + return -1; + + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemax); + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int utf8agemin(const struct utf8data *data, const char *s) +{ + utf8leaf_t *leaf; + int age; + int leaf_age; + + if (!data) + return -1; + age = data->maxage; + while (*s) { + leaf = utf8lookup(data, s); + if (!leaf) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemin); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemax(const struct utf8data *data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (len && *s) { + leaf = utf8nlookup(data, s, len); + if (!leaf) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemax); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemin(const struct utf8data *data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age; + + if (!data) + return -1; + age = data->maxage; + while (len && *s) { + leaf = utf8nlookup(data, s, len); + if (!leaf) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemin); + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t utf8len(const struct utf8data *data, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (*s) { + leaf = utf8lookup(data, s); + if (!leaf) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8len); + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (len && *s) { + leaf = utf8nlookup(data, s, len); + if (!leaf) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8nlen); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : const struct utf8data to use for normalization. + * s : string. + * len : length of s. + * + * Returns -1 on error, 0 on success. + */ +int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s, size_t len) +{ + if (!data) + return -1; + if (!s) + return -1; + u8c->data = data; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} +EXPORT_SYMBOL(utf8ncursor); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : const struct utf8data to use for normalization. + * s : NUL-terminated string. + * + * Returns -1 on error, 0 on success. + */ +int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s) +{ + return utf8ncursor(u8c, data, s, (unsigned int)-1); +} +EXPORT_SYMBOL(utf8cursor); + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->data, u8c->s); + else + leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + ccc = LEAF_CCC(leaf); + /* Characters that are too new have CCC 0. */ + if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) { + ccc = STOPPER; + } else if (ccc == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->data, u8c->s); + } + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ +ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} +EXPORT_SYMBOL(utf8byte); + +const struct utf8data *utf8nfdi(unsigned int maxage) +{ + int i = ARRAY_SIZE(utf8nfdidata) - 1; + + while (maxage < utf8nfdidata[i].maxage) + i--; + if (maxage > utf8nfdidata[i].maxage) + return NULL; + return &utf8nfdidata[i]; +} +EXPORT_SYMBOL(utf8nfdi); + +const struct utf8data *utf8nfdicf(unsigned int maxage) +{ + int i = ARRAY_SIZE(utf8nfdicfdata) - 1; + + while (maxage < utf8nfdicfdata[i].maxage) + i--; + if (maxage > utf8nfdicfdata[i].maxage) + return NULL; + return &utf8nfdicfdata[i]; +} +EXPORT_SYMBOL(utf8nfdicf); diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h new file mode 100644 index 000000000000..696e52124296 --- /dev/null +++ b/fs/unicode/utf8n.h @@ -0,0 +1,112 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#ifndef UTF8NORM_H +#define UTF8NORM_H + +#include +#include +#include +#include + +/* Encoding a unicode version number as a single unsigned int. */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_AGE(MAJ, MIN, REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +/* Highest unicode version supported by the data tables. */ +extern int utf8version_is_supported(u8 maj, u8 min, u8 rev); + +/* + * Look for the correct const struct utf8data for a unicode version. + * Returns NULL if the version requested is too new. + * + * Two normalization forms are supported: nfdi and nfdicf. + * + * nfdi: + * - Apply unicode normalization form NFD. + * - Remove any Default_Ignorable_Code_Point. + * + * nfdicf: + * - Apply unicode normalization form NFD. + * - Remove any Default_Ignorable_Code_Point. + * - Apply a full casefold (C + F). + */ +extern const struct utf8data *utf8nfdi(unsigned int maxage); +extern const struct utf8data *utf8nfdicf(unsigned int maxage); + +/* + * Determine the maximum age of any unicode character in the string. + * Returns 0 if only unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemax(const struct utf8data *data, const char *s); +extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len); + +/* + * Determine the minimum age of any unicode character in the string. + * Returns 0 if any unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemin(const struct utf8data *data, const char *s); +extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len); + +/* + * Determine the length of the normalized from of the string, + * excluding any terminating NULL byte. + * Returns 0 if only ignorable code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern ssize_t utf8len(const struct utf8data *data, const char *s); +extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len); + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + const struct utf8data *data; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; +}; + +/* + * Initialize a utf8cursor to normalize a string. + * Returns 0 on success. + * Returns -1 on failure. + */ +extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s); +extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s, size_t len); + +/* + * Get the next byte in the normalization. + * Returns a value > 0 && < 256 on success. + * Returns 0 when the end of the normalization is reached. + * Returns -1 if the string being normalized is not valid UTF-8. + */ +extern int utf8byte(struct utf8cursor *u8c); + +#endif /* UTF8NORM_H */ From patchwork Mon Jan 28 21:32:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032249 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNC548V0z9sDL for ; Tue, 29 Jan 2019 08:32:49 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728111AbfA1Vcr (ORCPT ); Mon, 28 Jan 2019 16:32:47 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58124 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728033AbfA1Vcp (ORCPT ); Mon, 28 Jan 2019 16:32:45 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 14F0327FB78 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH RFC v5 04/11] unicode: reduce the size of utf8data[] Date: Mon, 28 Jan 2019 16:32:16 -0500 Message-Id: <20190128213223.31512-5-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Olaf Weber Remove the Hangul decompositions from the utf8data trie, and do algorithmic decomposition to calculate them on the fly. To store the decomposition the caller of utf8lookup()/utf8nlookup() must provide a 12-byte buffer, which is used to synthesize a leaf with the decomposition. Trie size is reduced from 245kB to 90kB. Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Rebase to mainline] [Fix checkpatch errors] [Extract robustness fixes and merge back to original mkutf8data.c patch] --- fs/unicode/utf8-norm.c | 191 ++++++++++++++++++++++--- fs/unicode/utf8n.h | 4 + scripts/mkutf8data.c | 307 +++++++++++++++++++++++++++++++++++------ 3 files changed, 443 insertions(+), 59 deletions(-) diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c index 4ed50f3fda6e..845c0f300370 100644 --- a/fs/unicode/utf8-norm.c +++ b/fs/unicode/utf8-norm.c @@ -98,6 +98,38 @@ static inline int utf8clen(const char *s) return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); } +/* + * Decode a 3-byte UTF-8 sequence. + */ +static unsigned int +utf8decode3(const char *str) +{ + unsigned int uc; + + uc = *str++ & 0x0F; + uc <<= 6; + uc |= *str++ & 0x3F; + uc <<= 6; + uc |= *str++ & 0x3F; + + return uc; +} + +/* + * Encode a 3-byte UTF-8 sequence. + */ +static int +utf8encode3(char *str, unsigned int val) +{ + str[2] = (val & 0x3F) | 0x80; + val >>= 6; + str[1] = (val & 0x3F) | 0x80; + val >>= 6; + str[0] = val | 0xE0; + + return 3; +} + /* * utf8trie_t * @@ -159,7 +191,8 @@ typedef const unsigned char utf8trie_t; * characters with the Default_Ignorable_Code_Point property. * These do affect normalization, as they all have CCC 0. * - * The decompositions in the trie have been fully expanded. + * The decompositions in the trie have been fully expanded, with the + * exception of Hangul syllables, which are decomposed algorithmically. * * Casefolding, if applicable, is also done using decompositions. * @@ -179,6 +212,105 @@ typedef const unsigned char utf8leaf_t; #define STOPPER (0) #define DECOMPOSE (255) +/* Marker for hangul syllable decomposition. */ +#define HANGUL ((char)(255)) +/* Size of the synthesized leaf used for Hangul syllable decomposition. */ +#define UTF8HANGULLEAF (12) + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;;Lo;0;L;;;;;N;;;;; + * D7A3;;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount) + * LVPart = SBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount) + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = + * } else { + * TPart = TBase + TIndex + * d = + * } + */ + +/* Constants */ +#define SB (0xAC00) +#define LB (0x1100) +#define VB (0x1161) +#define TB (0x11A7) +#define LC (19) +#define VC (21) +#define TC (28) +#define NC (VC * TC) +#define SC (LC * NC) + +/* Algorithmic decomposition of hangul syllable. */ +static utf8leaf_t * +utf8hangul(const char *str, unsigned char *hangul) +{ + unsigned int si; + unsigned int li; + unsigned int vi; + unsigned int ti; + unsigned char *h; + + /* Calculate the SI, LI, VI, and TI values. */ + si = utf8decode3(str) - SB; + li = si / NC; + vi = (si % NC) / TC; + ti = si % TC; + + /* Fill in base of leaf. */ + h = hangul; + LEAF_GEN(h) = 2; + LEAF_CCC(h) = DECOMPOSE; + h += 2; + + /* Add LPart, a 3-byte UTF-8 sequence. */ + h += utf8encode3((char *)h, li + LB); + + /* Add VPart, a 3-byte UTF-8 sequence. */ + h += utf8encode3((char *)h, vi + VB); + + /* Add TPart if required, also a 3-byte UTF-8 sequence. */ + if (ti) + h += utf8encode3((char *)h, ti + TB); + + /* Terminate string. */ + h[0] = '\0'; + + return hangul; +} + /* * Use trie to scan s, touching at most len bytes. * Returns the leaf if one exists, NULL otherwise. @@ -187,8 +319,8 @@ typedef const unsigned char utf8leaf_t; * is well-formed and corresponds to a known unicode code point. The * shorthand for this will be "is valid UTF-8 unicode". */ -static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, - size_t len) +static utf8leaf_t *utf8nlookup(const struct utf8data *data, + unsigned char *hangul, const char *s, size_t len) { utf8trie_t *trie = utf8data + data->offset; int offlen; @@ -226,8 +358,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, trie++; } else { /* No right node. */ - node = 0; - trie = NULL; + return NULL; } } else { /* Left leg */ @@ -237,8 +368,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, trie += offlen + 1; } else if (*trie & RIGHTPATH) { /* No left node. */ - node = 0; - trie = NULL; + return NULL; } else { /* Left node after this node */ node = (*trie & TRIENODE); @@ -246,6 +376,14 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, } } } + /* + * Hangul decomposition is done algorithmically. These are the + * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is + * always 3 bytes long, so s has been advanced twice, and the + * start of the sequence is at s-2. + */ + if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL) + trie = utf8hangul(s - 2, hangul); return trie; } @@ -255,9 +393,10 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, * * Forwards to utf8nlookup(). */ -static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s) +static utf8leaf_t *utf8lookup(const struct utf8data *data, + unsigned char *hangul, const char *s) { - return utf8nlookup(data, s, (size_t)-1); + return utf8nlookup(data, hangul, s, (size_t)-1); } /* @@ -270,11 +409,13 @@ int utf8agemax(const struct utf8data *data, const char *s) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; + while (*s) { - leaf = utf8lookup(data, s); + leaf = utf8lookup(data, hangul, s); if (!leaf) return -1; @@ -297,12 +438,13 @@ int utf8agemin(const struct utf8data *data, const char *s) utf8leaf_t *leaf; int age; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; age = data->maxage; while (*s) { - leaf = utf8lookup(data, s); + leaf = utf8lookup(data, hangul, s); if (!leaf) return -1; leaf_age = utf8agetab[LEAF_GEN(leaf)]; @@ -323,11 +465,13 @@ int utf8nagemax(const struct utf8data *data, const char *s, size_t len) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; + while (len && *s) { - leaf = utf8nlookup(data, s, len); + leaf = utf8nlookup(data, hangul, s, len); if (!leaf) return -1; leaf_age = utf8agetab[LEAF_GEN(leaf)]; @@ -349,12 +493,13 @@ int utf8nagemin(const struct utf8data *data, const char *s, size_t len) utf8leaf_t *leaf; int leaf_age; int age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; age = data->maxage; while (len && *s) { - leaf = utf8nlookup(data, s, len); + leaf = utf8nlookup(data, hangul, s, len); if (!leaf) return -1; leaf_age = utf8agetab[LEAF_GEN(leaf)]; @@ -377,11 +522,12 @@ ssize_t utf8len(const struct utf8data *data, const char *s) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; while (*s) { - leaf = utf8lookup(data, s); + leaf = utf8lookup(data, hangul, s); if (!leaf) return -1; if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) @@ -404,11 +550,12 @@ ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; while (len && *s) { - leaf = utf8nlookup(data, s, len); + leaf = utf8nlookup(data, hangul, s, len); if (!leaf) return -1; if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) @@ -531,10 +678,12 @@ int utf8byte(struct utf8cursor *u8c) } /* Look up the data for the current character. */ - if (u8c->p) - leaf = utf8lookup(u8c->data, u8c->s); - else - leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + if (u8c->p) { + leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s); + } else { + leaf = utf8nlookup(u8c->data, u8c->hangul, + u8c->s, u8c->len); + } /* No leaf found implies that the input is a binary blob. */ if (!leaf) @@ -555,7 +704,9 @@ int utf8byte(struct utf8cursor *u8c) ccc = STOPPER; goto ccc_mismatch; } - leaf = utf8lookup(u8c->data, u8c->s); + + leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s); + ccc = LEAF_CCC(leaf); } /* diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h index 696e52124296..b63a9091dc39 100644 --- a/fs/unicode/utf8n.h +++ b/fs/unicode/utf8n.h @@ -76,6 +76,9 @@ extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len); extern ssize_t utf8len(const struct utf8data *data, const char *s); extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len); +/* Needed in struct utf8cursor below. */ +#define UTF8HANGULLEAF (12) + /* * Cursor structure used by the normalizer. */ @@ -89,6 +92,7 @@ struct utf8cursor { unsigned int slen; short int ccc; short int nccc; + unsigned char hangul[UTF8HANGULLEAF]; }; /* diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c index 4df1a799f73c..12ce94b43be6 100644 --- a/scripts/mkutf8data.c +++ b/scripts/mkutf8data.c @@ -182,10 +182,14 @@ typedef unsigned char utf8leaf_t; #define MAXCCC (254) #define STOPPER (0) #define DECOMPOSE (255) +#define HANGUL ((char)(255)) + +#define UTF8HANGULLEAF (12) struct tree; -static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); -static utf8leaf_t *utf8lookup(struct tree *, const char *); +static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *, + const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *); unsigned char *utf8data; size_t utf8data_size; @@ -333,6 +337,8 @@ static int utf32valid(unsigned int unichar) return unichar < 0x110000; } +#define HANGUL_SYLLABLE(U) ((U) >= 0xAC00 && (U) <= 0xD7A3) + #define NODE 1 #define LEAF 0 @@ -463,7 +469,7 @@ static void tree_walk(struct tree *tree) indent+1); leaves += 1; } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -857,7 +863,7 @@ static void mark_nodes(struct tree *tree) } } } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); node = node->right; continue; } @@ -909,7 +915,7 @@ static void mark_nodes(struct tree *tree) } } } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); node = node->right; if (!node->mark && node->parent->mark && !node->parent->left) { @@ -992,7 +998,7 @@ static int index_nodes(struct tree *tree, int index) index += tree->leaf_size(node->right); count++; } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -1013,6 +1019,25 @@ static int index_nodes(struct tree *tree, int index) return index; } +/* + * Mark the nodes in a subtree, helper for size_nodes(). + */ +static int mark_subtree(struct node *node) +{ + int changed; + + if (!node || node->mark) + return 0; + node->mark = 1; + node->index = node->parent->index; + changed = 1; + if (node->leftnode == NODE) + changed += mark_subtree(node->left); + if (node->rightnode == NODE) + changed += mark_subtree(node->right); + return changed; +} + /* * Compute the size of nodes and leaves. We start by assuming that * each node needs to store a three-byte offset. The indexes of the @@ -1031,6 +1056,7 @@ static int size_nodes(struct tree *tree) unsigned int bitmask; unsigned int pathbits; unsigned int pathmask; + unsigned int nbit; int changed; int offset; int size; @@ -1058,22 +1084,40 @@ static int size_nodes(struct tree *tree) size = 1; } else { if (node->rightnode == NODE) { + /* + * If the right node is not marked, + * look for a corresponding node in + * the next tree. Such a node need + * not exist. + */ right = node->right; next = tree->next; while (!right->mark) { assert(next); n = next->root; while (n->bitnum != node->bitnum) { - if (pathbits & (1<bitnum)) + nbit = 1 << n->bitnum; + if (!(pathmask & nbit)) + break; + if (pathbits & nbit) { + if (n->rightnode == LEAF) + break; n = n->right; - else + } else { + if (n->leftnode == LEAF) + break; n = n->left; + } } + if (n->bitnum != node->bitnum) + break; n = n->right; - assert(right->bitnum == n->bitnum); right = n; next = next->next; } + /* Make sure the right node is marked. */ + if (!right->mark) + changed += mark_subtree(right); offset = right->index - node->index; } else { offset = *tree->leaf_index(tree, node->right); @@ -1115,7 +1159,7 @@ static int size_nodes(struct tree *tree) if (node->rightnode == LEAF) { assert(node->right); } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -1148,8 +1192,15 @@ static void emit(struct tree *tree, unsigned char *data) int offset; int index; int indent; + int size; + int bytes; + int leaves; + int nodes[4]; unsigned char byte; + nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0; + leaves = 0; + bytes = 0; index = tree->index; data += index; indent = 1; @@ -1158,7 +1209,10 @@ static void emit(struct tree *tree, unsigned char *data) if (tree->childnode == LEAF) { assert(tree->root); tree->leaf_emit(tree->root, data); - return; + size = tree->leaf_size(tree->root); + index += size; + leaves++; + goto done; } assert(tree->childnode == NODE); @@ -1185,6 +1239,7 @@ static void emit(struct tree *tree, unsigned char *data) offlen = 2; else offlen = 3; + nodes[offlen]++; offset = node->offset; byte |= offlen << OFFLEN_SHIFT; *data++ = byte; @@ -1197,12 +1252,14 @@ static void emit(struct tree *tree, unsigned char *data) } else if (node->left) { if (node->leftnode == NODE) byte |= TRIENODE; + nodes[0]++; *data++ = byte; index++; } else if (node->right) { byte |= RIGHTNODE; if (node->rightnode == NODE) byte |= TRIENODE; + nodes[0]++; *data++ = byte; index++; } else { @@ -1217,7 +1274,10 @@ static void emit(struct tree *tree, unsigned char *data) assert(node->left); data = tree->leaf_emit(node->left, data); - index += tree->leaf_size(node->left); + size = tree->leaf_size(node->left); + index += size; + bytes += size; + leaves++; } else if (node->left) { assert(node->leftnode == NODE); indent += 1; @@ -1231,9 +1291,12 @@ static void emit(struct tree *tree, unsigned char *data) assert(node->right); data = tree->leaf_emit(node->right, data); - index += tree->leaf_size(node->right); + size = tree->leaf_size(node->right); + index += size; + bytes += size; + leaves++; } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -1245,6 +1308,15 @@ static void emit(struct tree *tree, unsigned char *data) indent -= 1; } } +done: + if (verbose > 0) { + printf("Emitted %d (%d) leaves", + leaves, bytes); + printf(" %d (%d+%d+%d+%d) nodes", + nodes[0] + nodes[1] + nodes[2] + nodes[3], + nodes[0], nodes[1], nodes[2], nodes[3]); + printf(" %d total\n", index - tree->index); + } } /* ------------------------------------------------------------------ */ @@ -1346,8 +1418,12 @@ static void nfdi_print(void *l, int indent) printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, leaf->code, leaf->ccc, leaf->gen); - if (leaf->utf8nfdi) + + if (leaf->utf8nfdi && leaf->utf8nfdi[0] == HANGUL) + printf(" nfdi \"%s\"", "HANGUL SYLLABLE"); + else if (leaf->utf8nfdi) printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi); + printf("\n"); } @@ -1357,8 +1433,11 @@ static void nfdicf_print(void *l, int indent) printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfdicf) printf(" nfdicf \"%s\"", (const char*)leaf->utf8nfdicf); + else if (leaf->utf8nfdi && leaf->utf8nfdi[0] == HANGUL) + printf(" nfdi \"%s\"", "HANGUL SYLLABLE"); else if (leaf->utf8nfdi) printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi); printf("\n"); @@ -1388,9 +1467,11 @@ static int correction_mark(void *l) static int nfdi_size(void *l) { struct unicode_data *leaf = l; - int size = 2; - if (leaf->utf8nfdi) + + if (HANGUL_SYLLABLE(leaf->code)) + size += 1; + else if (leaf->utf8nfdi) size += strlen(leaf->utf8nfdi) + 1; return size; } @@ -1398,9 +1479,11 @@ static int nfdi_size(void *l) static int nfdicf_size(void *l) { struct unicode_data *leaf = l; - int size = 2; - if (leaf->utf8nfdicf) + + if (HANGUL_SYLLABLE(leaf->code)) + size += 1; + else if (leaf->utf8nfdicf) size += strlen(leaf->utf8nfdicf) + 1; else if (leaf->utf8nfdi) size += strlen(leaf->utf8nfdi) + 1; @@ -1427,7 +1510,11 @@ static unsigned char *nfdi_emit(void *l, unsigned char *data) unsigned char *s; *data++ = leaf->gen; - if (leaf->utf8nfdi) { + + if (HANGUL_SYLLABLE(leaf->code)) { + *data++ = DECOMPOSE; + *data++ = HANGUL; + } else if (leaf->utf8nfdi) { *data++ = DECOMPOSE; s = (unsigned char*)leaf->utf8nfdi; while ((*data++ = *s++) != 0) @@ -1444,7 +1531,11 @@ static unsigned char *nfdicf_emit(void *l, unsigned char *data) unsigned char *s; *data++ = leaf->gen; - if (leaf->utf8nfdicf) { + + if (HANGUL_SYLLABLE(leaf->code)) { + *data++ = DECOMPOSE; + *data++ = HANGUL; + } else if (leaf->utf8nfdicf) { *data++ = DECOMPOSE; s = (unsigned char*)leaf->utf8nfdicf; while ((*data++ = *s++) != 0) @@ -1467,6 +1558,11 @@ static void utf8_create(struct unicode_data *data) unsigned int *um; int i; + if (data->utf8nfdi) { + assert(data->utf8nfdi[0] == HANGUL); + return; + } + u = utf; um = data->utf32nfdi; if (um) { @@ -1652,6 +1748,7 @@ static void verify(struct tree *tree) utf8leaf_t *leaf; unsigned int unichar; char key[4]; + unsigned char hangul[UTF8HANGULLEAF]; int report; int nocf; @@ -1665,7 +1762,8 @@ static void verify(struct tree *tree) if (data->correction <= tree->maxage) data = &unicode_data[unichar]; utf8encode(key,unichar); - leaf = utf8lookup(tree, key); + leaf = utf8lookup(tree, hangul, key); + if (!leaf) { if (data->gen != -1) report++; @@ -1679,7 +1777,10 @@ static void verify(struct tree *tree) if (data->gen != LEAF_GEN(leaf)) report++; if (LEAF_CCC(leaf) == DECOMPOSE) { - if (nocf) { + if (HANGUL_SYLLABLE(data->code)) { + if (data->utf8nfdi[0] != HANGUL) + report++; + } else if (nocf) { if (!data->utf8nfdi) { report++; } else if (strcmp(data->utf8nfdi, @@ -2323,8 +2424,7 @@ static void corrections_init(void) * */ -static void -hangul_decompose(void) +static void hangul_decompose(void) { unsigned int sb = 0xAC00; unsigned int lb = 0x1100; @@ -2368,6 +2468,15 @@ hangul_decompose(void) memcpy(um, mapping, i * sizeof(unsigned int)); unicode_data[unichar].utf32nfdicf = um; + /* + * Add a cookie as a reminder that the hangul syllable + * decompositions must not be stored in the generated + * trie. + */ + unicode_data[unichar].utf8nfdi = malloc(2); + unicode_data[unichar].utf8nfdi[0] = HANGUL; + unicode_data[unichar].utf8nfdi[1] = '\0'; + if (verbose > 1) print_utf32nfdi(unichar); @@ -2493,6 +2602,99 @@ int utf8cursor(struct utf8cursor *, struct tree *, const char *); int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); int utf8byte(struct utf8cursor *); +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;;Lo;0;L;;;;;N;;;;; + * D7A3;;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount) + * LVPart = SBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount) + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = + * } else { + * TPart = TBase + TIndex + * d = + * } + */ + +/* Constants */ +#define SB (0xAC00) +#define LB (0x1100) +#define VB (0x1161) +#define TB (0x11A7) +#define LC (19) +#define VC (21) +#define TC (28) +#define NC (VC * TC) +#define SC (LC * NC) + +/* Algorithmic decomposition of hangul syllable. */ +static utf8leaf_t *utf8hangul(const char *str, unsigned char *hangul) +{ + unsigned int si; + unsigned int li; + unsigned int vi; + unsigned int ti; + unsigned char *h; + + /* Calculate the SI, LI, VI, and TI values. */ + si = utf8decode(str) - SB; + li = si / NC; + vi = (si % NC) / TC; + ti = si % TC; + + /* Fill in base of leaf. */ + h = hangul; + LEAF_GEN(h) = 2; + LEAF_CCC(h) = DECOMPOSE; + h += 2; + + /* Add LPart, a 3-byte UTF-8 sequence. */ + h += utf8encode((char *)h, li + LB); + + /* Add VPart, a 3-byte UTF-8 sequence. */ + h += utf8encode((char *)h, vi + VB); + + /* Add TPart if required, also a 3-byte UTF-8 sequence. */ + if (ti) + h += utf8encode((char *)h, ti + TB); + + /* Terminate string. */ + h[0] = '\0'; + + return hangul; +} + /* * Use trie to scan s, touching at most len bytes. * Returns the leaf if one exists, NULL otherwise. @@ -2501,7 +2703,8 @@ int utf8byte(struct utf8cursor *); * is well-formed and corresponds to a known unicode code point. The * shorthand for this will be "is valid UTF-8 unicode". */ -static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) +static utf8leaf_t *utf8nlookup(struct tree *tree, unsigned char *hangul, + const char *s, size_t len) { utf8trie_t *trie = utf8data + tree->index; int offlen; @@ -2557,6 +2760,14 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) } } } + /* + * Hangul decomposition is done algorithmically. These are the + * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is + * always 3 bytes long, so s has been advanced twice, and the + * start of the sequence is at s-2. + */ + if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL) + trie = utf8hangul(s - 2, hangul); return trie; } @@ -2566,9 +2777,10 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) * * Forwards to trie_nlookup(). */ -static utf8leaf_t *utf8lookup(struct tree *tree, const char *s) +static utf8leaf_t *utf8lookup(struct tree *tree, unsigned char *hangul, + const char *s) { - return utf8nlookup(tree, s, (size_t)-1); + return utf8nlookup(tree, hangul, s, (size_t)-1); } /* @@ -2592,11 +2804,14 @@ int utf8agemax(struct tree *tree, const char *s) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; + while (*s) { - if (!(leaf = utf8lookup(tree, s))) + leaf = utf8lookup(tree, hangul, s); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age > age) @@ -2616,12 +2831,14 @@ int utf8agemin(struct tree *tree, const char *s) utf8leaf_t *leaf; int age; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; age = tree->maxage; while (*s) { - if (!(leaf = utf8lookup(tree, s))) + leaf = utf8lookup(tree, hangul, s); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age < age) @@ -2640,11 +2857,14 @@ int utf8nagemax(struct tree *tree, const char *s, size_t len) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; + while (len && *s) { - if (!(leaf = utf8nlookup(tree, s, len))) + leaf = utf8nlookup(tree, hangul, s, len); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age > age) @@ -2664,12 +2884,14 @@ int utf8nagemin(struct tree *tree, const char *s, size_t len) utf8leaf_t *leaf; int leaf_age; int age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; age = tree->maxage; while (len && *s) { - if (!(leaf = utf8nlookup(tree, s, len))) + leaf = utf8nlookup(tree, hangul, s, len); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age < age) @@ -2690,11 +2912,13 @@ ssize_t utf8len(struct tree *tree, const char *s) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; while (*s) { - if (!(leaf = utf8lookup(tree, s))) + leaf = utf8lookup(tree, hangul, s); + if (!leaf) return -1; if (ages[LEAF_GEN(leaf)] > tree->maxage) ret += utf8clen(s); @@ -2715,11 +2939,13 @@ ssize_t utf8nlen(struct tree *tree, const char *s, size_t len) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; while (len && *s) { - if (!(leaf = utf8nlookup(tree, s, len))) + leaf = utf8nlookup(tree, hangul, s, len); + if (!leaf) return -1; if (ages[LEAF_GEN(leaf)] > tree->maxage) ret += utf8clen(s); @@ -2747,6 +2973,7 @@ struct utf8cursor { short int ccc; short int nccc; unsigned int unichar; + unsigned char hangul[UTF8HANGULLEAF]; }; /* @@ -2854,10 +3081,12 @@ int utf8byte(struct utf8cursor *u8c) } /* Look up the data for the current character. */ - if (u8c->p) - leaf = utf8lookup(u8c->tree, u8c->s); - else - leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + if (u8c->p) { + leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s); + } else { + leaf = utf8nlookup(u8c->tree, u8c->hangul, + u8c->s, u8c->len); + } /* No leaf found implies that the input is a binary blob. */ if (!leaf) @@ -2877,7 +3106,7 @@ int utf8byte(struct utf8cursor *u8c) ccc = STOPPER; goto ccc_mismatch; } - leaf = utf8lookup(u8c->tree, u8c->s); + leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s); ccc = LEAF_CCC(leaf); } u8c->unichar = utf8decode(u8c->s); From patchwork Mon Jan 28 21:32:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032248 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNC4126nz9sDX for ; Tue, 29 Jan 2019 08:32:48 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728175AbfA1Vcr (ORCPT ); Mon, 28 Jan 2019 16:32:47 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58142 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1Vcr (ORCPT ); Mon, 28 Jan 2019 16:32:47 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 2635927FB7B From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 05/11] unicode: Implement higher level API for string handling Date: Mon, 28 Jan 2019 16:32:17 -0500 Message-Id: <20190128213223.31512-6-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Gabriel Krisman Bertazi This patch integrates the utf8n patches with some higher level API to perform UTF-8 string comparison, normalization and casefolding operations. Implemented is a variation of NFD, and casefold is performed by doing full casefold on top of NFD. These algorithms are based on the core implemented by Olaf Weber from SGI. Changes since v4: - integrate in fs/unicode Changes since RFC v1: - Change error return code from EIO to EINVAL. (Olaf Weber) - Fix issues with strncmp/strcmp. (Olaf Weber) - Remove stack buffer in normalization/casefold. (Olaf Weber) - Include length parameter for second string on comparison functions. - Change length type to size_t. Signed-off-by: Gabriel Krisman Bertazi --- fs/unicode/Makefile | 4 +- fs/unicode/utf8-core.c | 183 ++++++++++++++++++++++++++++++++++++++++ fs/unicode/utf8-norm.c | 6 ++ fs/unicode/utf8n.h | 1 + include/linux/unicode.h | 30 +++++++ 5 files changed, 223 insertions(+), 1 deletion(-) create mode 100644 fs/unicode/utf8-core.c create mode 100644 include/linux/unicode.h diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile index 1ed10e40c30d..9a9836fcf38b 100644 --- a/fs/unicode/Makefile +++ b/fs/unicode/Makefile @@ -2,7 +2,9 @@ UNICODE_VERSION=11.0.0 -obj-$(CONFIG_UNICODE) += utf8-norm.o +obj-$(CONFIG_UNICODE) += unicode.o + +unicode-y := utf8-norm.o utf8-core.o $(obj)/utf8-norm.o: $(obj)/utf8data.h $(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c new file mode 100644 index 000000000000..39f4b06dded6 --- /dev/null +++ b/fs/unicode/utf8-core.c @@ -0,0 +1,183 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include +#include +#include +#include +#include +#include + +#include "utf8n.h" + +int utf8_validate(const struct unicode_map *um, const struct qstr *str) +{ + const struct utf8data *data = utf8nfdi(um->version); + + if (utf8nlen(data, str->name, str->len) < 0) + return -1; + return 0; +} +EXPORT_SYMBOL(utf8_validate); + +int utf8_strncmp(const struct unicode_map *um, + const struct qstr *s1, const struct qstr *s2) +{ + const struct utf8data *data = utf8nfdi(um->version); + struct utf8cursor cur1, cur2; + int c1, c2; + + if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0) + return -EINVAL; + + if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0) + return -EINVAL; + + do { + c1 = utf8byte(&cur1); + c2 = utf8byte(&cur2); + + if (c1 < 0 || c2 < 0) + return -EINVAL; + if (c1 != c2) + return 1; + } while (c1); + + return 0; +} +EXPORT_SYMBOL(utf8_strncmp); + +int utf8_strncasecmp(const struct unicode_map *um, + const struct qstr *s1, const struct qstr *s2) +{ + const struct utf8data *data = utf8nfdicf(um->version); + struct utf8cursor cur1, cur2; + int c1, c2; + + if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0) + return -EINVAL; + + if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0) + return -EINVAL; + + do { + c1 = utf8byte(&cur1); + c2 = utf8byte(&cur2); + + if (c1 < 0 || c2 < 0) + return -EINVAL; + if (c1 != c2) + return 1; + } while (c1); + + return 0; +} +EXPORT_SYMBOL(utf8_strncasecmp); + +int utf8_casefold(const struct unicode_map *um, const struct qstr *str, + unsigned char *dest, size_t dlen) +{ + const struct utf8data *data = utf8nfdicf(um->version); + struct utf8cursor cur; + size_t nlen = 0; + + if (utf8ncursor(&cur, data, str->name, str->len) < 0) + return -EINVAL; + + for (nlen = 0; nlen < dlen; nlen++) { + dest[nlen] = utf8byte(&cur); + if (!dest[nlen]) + return nlen; + if (dest[nlen] == -1) + break; + } + return -EINVAL; +} + +EXPORT_SYMBOL(utf8_casefold); + +int utf8_normalize(const struct unicode_map *um, const struct qstr *str, + unsigned char *dest, size_t dlen) +{ + const struct utf8data *data = utf8nfdi(um->version); + struct utf8cursor cur; + ssize_t nlen = 0; + + if (utf8ncursor(&cur, data, str->name, str->len) < 0) + return -EINVAL; + + for (nlen = 0; nlen < dlen; nlen++) { + dest[nlen] = utf8byte(&cur); + if (!dest[nlen]) + return nlen; + if (dest[nlen] == -1) + break; + } + return -EINVAL; +} + +EXPORT_SYMBOL(utf8_normalize); + +static int utf8_parse_version(const char *version, unsigned int *maj, + unsigned int *min, unsigned int *rev) +{ + substring_t args[3]; + char version_string[12]; + const struct match_token token[] = { + {1, "%d.%d.%d"}, + {0, NULL} + }; + + strncpy(version_string, version, sizeof(version_string)); + + if (match_token(version_string, token, args) != 1) + return -EINVAL; + + if (match_int(&args[0], maj) || match_int(&args[1], min) || + match_int(&args[2], rev)) + return -EINVAL; + + return 0; +} + +struct unicode_map *utf8_load(const char *version) +{ + struct unicode_map *um = NULL; + int unicode_version; + + if (version) { + unsigned int maj, min, rev; + + if (utf8_parse_version(version, &maj, &min, &rev) < 0) + return ERR_PTR(-EINVAL); + + if (!utf8version_is_supported(maj, min, rev)) + return ERR_PTR(-EINVAL); + + unicode_version = UNICODE_AGE(maj, min, rev); + } else { + unicode_version = utf8version_latest(); + printk(KERN_WARNING"UTF-8 version not specified. " + "Assuming latest supported version (%d.%d.%d).", + (unicode_version >> 16) & 0xff, + (unicode_version >> 8) & 0xff, + (unicode_version & 0xff)); + } + + um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL); + if (!um) + return ERR_PTR(-ENOMEM); + + um->charset = "UTF-8"; + um->version = unicode_version; + + return um; +} +EXPORT_SYMBOL(utf8_load); + +void utf8_unload(struct unicode_map *um) +{ + kfree(um); +} +EXPORT_SYMBOL(utf8_unload); + +MODULE_LICENSE("GPL v2"); diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c index 845c0f300370..94e066be3ea6 100644 --- a/fs/unicode/utf8-norm.c +++ b/fs/unicode/utf8-norm.c @@ -38,6 +38,12 @@ int utf8version_is_supported(u8 maj, u8 min, u8 rev) } EXPORT_SYMBOL(utf8version_is_supported); +int utf8version_latest() +{ + return utf8vers; +} +EXPORT_SYMBOL(utf8version_latest); + /* * UTF-8 valid ranges. * diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h index b63a9091dc39..a120638014c1 100644 --- a/fs/unicode/utf8n.h +++ b/fs/unicode/utf8n.h @@ -32,6 +32,7 @@ /* Highest unicode version supported by the data tables. */ extern int utf8version_is_supported(u8 maj, u8 min, u8 rev); +extern int utf8version_latest(void); /* * Look for the correct const struct utf8data for a unicode version. diff --git a/include/linux/unicode.h b/include/linux/unicode.h new file mode 100644 index 000000000000..aec2c6d800aa --- /dev/null +++ b/include/linux/unicode.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_UNICODE_H +#define _LINUX_UNICODE_H + +#include +#include + +struct unicode_map { + const char *charset; + int version; +}; + +int utf8_validate(const struct unicode_map *um, const struct qstr *str); + +int utf8_strncmp(const struct unicode_map *um, + const struct qstr *s1, const struct qstr *s2); + +int utf8_strncasecmp(const struct unicode_map *um, + const struct qstr *s1, const struct qstr *s2); + +int utf8_normalize(const struct unicode_map *um, const struct qstr *str, + unsigned char *dest, size_t dlen); + +int utf8_casefold(const struct unicode_map *um, const struct qstr *str, + unsigned char *dest, size_t dlen); + +struct unicode_map *utf8_load(const char *version); +void utf8_unload(struct unicode_map *um); + +#endif /* _LINUX_UNICODE_H */ From patchwork Mon Jan 28 21:32:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032250 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNC80jbxz9sDr for ; Tue, 29 Jan 2019 08:32:52 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728197AbfA1Vcv (ORCPT ); Mon, 28 Jan 2019 16:32:51 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58170 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1Vcv (ORCPT ); Mon, 28 Jan 2019 16:32:51 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id ADE7827FB7D From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 06/11] unicode: Introduce test module for normalized utf8 implementation Date: Mon, 28 Jan 2019 16:32:18 -0500 Message-Id: <20190128213223.31512-7-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Gabriel Krisman Bertazi This implements a in-kernel sanity test module for the utf8 normalization core. At probe time, it will run basic sequences through the utf8n core, to identify problems will equivalent sequences and normalization/casefold code. This is supposed to be useful for regression testing when adding support for a new version of utf8 to linux. Changes since v4: - integrate with fs/unicode Changes since RFC v1: - Include comparison tests for matching strings with different lengths. - Include tests for characters included in unicode 8.0.0, 9.0.0 and 10.0.0. Signed-off-by: Gabriel Krisman Bertazi --- fs/unicode/Kconfig | 5 + fs/unicode/Makefile | 1 + fs/unicode/utf8-selftest.c | 320 +++++++++++++++++++++++++++++++++++++ 3 files changed, 326 insertions(+) create mode 100644 fs/unicode/utf8-selftest.c diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig index f41520f57dff..b560a879edf7 100644 --- a/fs/unicode/Kconfig +++ b/fs/unicode/Kconfig @@ -6,3 +6,8 @@ config UNICODE help Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding support. + +config UNICODE_NORMALIZATION_SELFTEST + tristate "Test UTF-8 normalization support" + depends on UNICODE + default n diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile index 9a9836fcf38b..b1560c4107bf 100644 --- a/fs/unicode/Makefile +++ b/fs/unicode/Makefile @@ -3,6 +3,7 @@ UNICODE_VERSION=11.0.0 obj-$(CONFIG_UNICODE) += unicode.o +obj-$(CONFIG_UNICODE_NORMALIZATION_SELFTEST) += utf8-selftest.o unicode-y := utf8-norm.o utf8-core.o diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c new file mode 100644 index 000000000000..492d934d5c1c --- /dev/null +++ b/fs/unicode/utf8-selftest.c @@ -0,0 +1,320 @@ +/* + * Kernel module for testing utf-8 support. + * + * Copyright 2017 Collabora Ltd. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include + +#include "utf8n.h" + +unsigned int failed_tests; +unsigned int total_tests; + +/* Tests will be based on this version. */ +#define latest_maj 11 +#define latest_min 0 +#define latest_rev 0 + +#define _test(cond, func, line, fmt, ...) do { \ + total_tests++; \ + if (!cond) { \ + failed_tests++; \ + pr_err("test %s:%d Failed: %s%s", \ + func, line, #cond, (fmt?":":".")); \ + if (fmt) \ + pr_err(fmt, ##__VA_ARGS__); \ + } \ + } while (0) +#define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__) +#define test(cond) _test(cond, __func__, __LINE__, "") + +const static struct { + /* UTF-8 strings in this vector _must_ be NULL-terminated. */ + unsigned char str[10]; + unsigned char dec[10]; +} nfdi_test_data[] = { + /* Trivial sequence */ + { + /* "ABba" decomposes to itself */ + .str = "aBba", + .dec = "aBba", + }, + /* Simple equivalent sequences */ + { + /* 'VULGAR FRACTION ONE QUARTER' cannot decompose to + 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' on + canonical decomposition */ + .str = {0xc2, 0xbc, 0x00}, + .dec = {0xc2, 0xbc, 0x00}, + }, + { + /* 'LATIN SMALL LETTER A WITH DIAERESIS' decomposes to + 'LETTER A' + 'COMBINING DIAERESIS' */ + .str = {0xc3, 0xa4, 0x00}, + .dec = {0x61, 0xcc, 0x88, 0x00}, + }, + { + /* 'LATIN SMALL LETTER LJ' can't decompose to + 'LETTER L' + 'LETTER J' on canonical decomposition */ + .str = {0xC7, 0x89, 0x00}, + .dec = {0xC7, 0x89, 0x00}, + }, + { + /* GREEK ANO TELEIA decomposes to MIDDLE DOT */ + .str = {0xCE, 0x87, 0x00}, + .dec = {0xC2, 0xB7, 0x00} + }, + /* Canonical ordering */ + { + /* A + 'COMBINING ACUTE ACCENT' + 'COMBINING OGONEK' decomposes + to A + 'COMBINING OGONEK' + 'COMBINING ACUTE ACCENT' */ + .str = {0x41, 0xcc, 0x81, 0xcc, 0xa8, 0x0}, + .dec = {0x41, 0xcc, 0xa8, 0xcc, 0x81, 0x0}, + }, + { + /* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK' + decomposes to + 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */ + .str = {0xc3, 0xa4, 0xCC, 0xA8, 0x00}, + + .dec = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00}, + }, + +}; + +const static struct { + /* UTF-8 strings in this vector _must_ be NULL-terminated. */ + unsigned char str[30]; + unsigned char ncf[30]; +} nfdicf_test_data[] = { + /* Trivial sequences */ + { + /* "ABba" folds to lowercase */ + .str = {0x41, 0x42, 0x62, 0x61, 0x00}, + .ncf = {0x61, 0x62, 0x62, 0x61, 0x00}, + }, + { + /* All ASCII folds to lower-case */ + .str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0.1", + .ncf = "abcdefghijklmnopqrstuvwxyz0.1", + }, + { + /* LATIN SMALL LETTER SHARP S folds to + LATIN SMALL LETTER S + LATIN SMALL LETTER S */ + .str = {0xc3, 0x9f, 0x00}, + .ncf = {0x73, 0x73, 0x00}, + }, + { + /* LATIN CAPITAL LETTER A WITH RING ABOVE folds to + LATIN SMALL LETTER A + COMBINING RING ABOVE */ + .str = {0xC3, 0x85, 0x00}, + .ncf = {0x61, 0xcc, 0x8a, 0x00}, + }, + /* Introduced by UTF-8.0.0. */ + /* Cherokee letters are interesting test-cases because they fold + to upper-case. Before 8.0.0, Cherokee lowercase were + undefined, thus, the folding from LC is not stable between + 7.0.0 -> 8.0.0, but it is from UC. */ + { + /* CHEROKEE SMALL LETTER A folds to CHEROKEE LETTER A */ + .str = {0xea, 0xad, 0xb0, 0x00}, + .ncf = {0xe1, 0x8e, 0xa0, 0x00}, + }, + { + /* CHEROKEE SMALL LETTER YE folds to CHEROKEE LETTER YE */ + .str = {0xe1, 0x8f, 0xb8, 0x00}, + .ncf = {0xe1, 0x8f, 0xb0, 0x00}, + }, + { + /* OLD HUNGARIAN CAPITAL LETTER AMB folds to + OLD HUNGARIAN SMALL LETTER AMB */ + .str = {0xf0, 0x90, 0xb2, 0x83, 0x00}, + .ncf = {0xf0, 0x90, 0xb3, 0x83, 0x00}, + }, + /* Introduced by UTF-9.0.0. */ + { + /* OSAGE CAPITAL LETTER CHA folds to + OSAGE SMALL LETTER CHA */ + .str = {0xf0, 0x90, 0x92, 0xb5, 0x00}, + .ncf = {0xf0, 0x90, 0x93, 0x9d, 0x00}, + }, + { + /* LATIN CAPITAL LETTER SMALL CAPITAL I folds to + LATIN LETTER SMALL CAPITAL I */ + .str = {0xea, 0x9e, 0xae, 0x00}, + .ncf = {0xc9, 0xaa, 0x00}, + }, + /* Introduced by UTF-11.0.0. */ + { + /* GEORGIAN SMALL LETTER AN folds to GEORGIAN MTAVRULI + CAPITAL LETTER AN */ + .str = {0xe1, 0xb2, 0x90, 0x00}, + .ncf = {0xe1, 0x83, 0x90, 0x00}, + } +}; + +static void check_utf8_nfdi(void) +{ + int i; + struct utf8cursor u8c; + const struct utf8data *data; + + data = utf8nfdi(UNICODE_AGE(latest_maj, latest_min, latest_rev)); + if (!data) { + pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n", + __func__, latest_maj, latest_min, latest_rev); + return; + } + + for (i = 0; i < ARRAY_SIZE(nfdi_test_data); i++) { + int len = strlen(nfdi_test_data[i].str); + int nlen = strlen(nfdi_test_data[i].dec); + int j = 0; + unsigned char c; + + test((utf8len(data, nfdi_test_data[i].str) == nlen)); + test((utf8nlen(data, nfdi_test_data[i].str, len) == nlen)); + + if (utf8cursor(&u8c, data, nfdi_test_data[i].str) < 0) + pr_err("can't create cursor\n"); + + while ((c = utf8byte(&u8c)) > 0) { + test_f((c == nfdi_test_data[i].dec[j]), + "Unexpected byte 0x%x should be 0x%x\n", + c, nfdi_test_data[i].dec[j]); + j++; + } + + test((j == nlen)); + } +} + +static void check_utf8_nfdicf(void) +{ + int i; + struct utf8cursor u8c; + const struct utf8data *data; + + data = utf8nfdicf(UNICODE_AGE(latest_maj, latest_min, latest_rev)); + if (!data) { + pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n", + __func__, latest_maj, latest_min, latest_rev); + return; + } + + for (i = 0; i < ARRAY_SIZE(nfdicf_test_data); i++) { + int len = strlen(nfdicf_test_data[i].str); + int nlen = strlen(nfdicf_test_data[i].ncf); + int j = 0; + unsigned char c; + + test((utf8len(data, nfdicf_test_data[i].str) == nlen)); + test((utf8nlen(data, nfdicf_test_data[i].str, len) == nlen)); + + if (utf8cursor(&u8c, data, nfdicf_test_data[i].str) < 0) + pr_err("can't create cursor\n"); + + while ((c = utf8byte(&u8c)) > 0) { + test_f((c == nfdicf_test_data[i].ncf[j]), + "Unexpected byte 0x%x should be 0x%x\n", + c, nfdicf_test_data[i].ncf[j]); + j++; + } + + test((j == nlen)); + } +} + +static void check_utf8_comparisons(void) +{ + int i; + struct unicode_map *table = utf8_load("11.0.0"); + + if (IS_ERR(table)) { + pr_err("%s: Unable to load utf8 %d.%d.%d. Skipping.\n", + __func__, latest_maj, latest_min, latest_rev); + return; + } + + for (i = 0; i < ARRAY_SIZE(nfdi_test_data); i++) { + const struct qstr s1 = {.name = nfdi_test_data[i].str, + .len = sizeof(nfdi_test_data[i].str)}; + const struct qstr s2 = {.name = nfdi_test_data[i].dec, + .len = sizeof(nfdi_test_data[i].dec)}; + + test_f(!utf8_strncmp(table, &s1, &s2), + "%s %s comparison mismatch\n", s1.name, s2.name); + } + + for (i = 0; i < ARRAY_SIZE(nfdicf_test_data); i++) { + const struct qstr s1 = {.name = nfdicf_test_data[i].str, + .len = sizeof(nfdicf_test_data[i].str)}; + const struct qstr s2 = {.name = nfdicf_test_data[i].ncf, + .len = sizeof(nfdicf_test_data[i].ncf)}; + + test_f(!utf8_strncasecmp(table, &s1, &s2), + "%s %s comparison mismatch\n", s1.name, s2.name); + } + + utf8_unload(table); +} + +static void check_supported_versions(void) +{ + /* Unicode 7.0.0 should be supported. */ + test(utf8version_is_supported(7, 0, 0)); + + /* Unicode 9.0.0 should be supported. */ + test(utf8version_is_supported(9, 0, 0)); + + /* Unicode 1x.0.0 (the latest version) should be supported. */ + test(utf8version_is_supported(latest_maj, latest_min, latest_rev)); + + /* Next versions don't exist. */ + test(!utf8version_is_supported(12, 0, 0)); + test(!utf8version_is_supported(0, 0, 0)); + test(!utf8version_is_supported(-1, -1, -1)); +} + +static int __init init_test_ucd(void) +{ + failed_tests = 0; + total_tests = 0; + + check_supported_versions(); + check_utf8_nfdi(); + check_utf8_nfdicf(); + check_utf8_comparisons(); + + if (!failed_tests) + pr_info("All %u tests passed\n", total_tests); + else + pr_err("%u out of %u tests failed\n", failed_tests, + total_tests); + return 0; +} + +static void __exit exit_test_ucd(void) +{ +} + +module_init(init_test_ucd); +module_exit(exit_test_ucd); + +MODULE_AUTHOR("Gabriel Krisman Bertazi "); +MODULE_LICENSE("GPL"); From patchwork Mon Jan 28 21:32:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032251 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNCB1QJlz9sCX for ; Tue, 29 Jan 2019 08:32:54 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728234AbfA1Vcx (ORCPT ); Mon, 28 Jan 2019 16:32:53 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58192 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1Vcw (ORCPT ); Mon, 28 Jan 2019 16:32:52 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id A819527FB80 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 07/11] MAINTAINERS: Add Unicode subsystem entry Date: Mon, 28 Jan 2019 16:32:19 -0500 Message-Id: <20190128213223.31512-8-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Signed-off-by: Gabriel Krisman Bertazi --- MAINTAINERS | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 380e43f585d3..ab1f04436ea2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15337,6 +15337,12 @@ F: drivers/uwb/ F: include/linux/uwb.h F: include/linux/uwb/ +UNICODE SUBSYSTEM: +M: Gabriel Krisman Bertazi +L: linux-fsdevel@vger.kernel.org +S: Supported +F: fs/unicode/ + UNICORE32 ARCHITECTURE: M: Guan Xuetao W: http://mprc.pku.edu.cn/~guanxuetao/linux From patchwork Mon Jan 28 21:32:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032252 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNCF075Tz9sDL for ; Tue, 29 Jan 2019 08:32:57 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728248AbfA1Vc4 (ORCPT ); Mon, 28 Jan 2019 16:32:56 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58218 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1Vc4 (ORCPT ); Mon, 28 Jan 2019 16:32:56 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id B12C727FB81 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 08/11] ext4: Include encoding information in the superblock Date: Mon, 28 Jan 2019 16:32:20 -0500 Message-Id: <20190128213223.31512-9-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Gabriel Krisman Bertazi Support for encoding is considered an incompatible feature, since it has potential to create collisions of file names in existing filesystems. If the feature flag is not enabled, the entire filesystem will operate on opaque byte sequences, respecting the original behavior. The s_encoding field stores a magic number indicating the encoding format and version used globally by file and directory names in the filesystem. The s_encoding_flags defines policies for using the charset encoding, like how to handle invalid sequences. The magic number is mapped to the exact charset table, but the mapping is specific to ext4. Since we don't have any commitment to support old encodings, the only encoding I am supporting right now is utf8-11.0. The current implementation prevents the user from enabling encoding and per-directory encryption on the same filesystem at the same time. The incompatibility between these features lies in how we do efficient directory searches when we cannot be sure the encryption of the user provided fname will match the actual hash stored in the disk without decrypting every directory entry, because of normalization cases. My quickest solution is to simply block the concurrent use of these features for now, and enable it later, once we have a better solution. Changes since v4: - Move from fs/nls to fs/unicode Changes since v2: - Split superblock bitfield reservation into another patch. - Rename s_ioencoding -> s_encoding - Remove encoding_info from in-memory superblock. Changes since v1: - Guard code with CONFIG_NLS. - Use 16 bits for s_ioencoding. - Split mount option from this patch Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/ext4.h | 21 +++++++++++- fs/ext4/super.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 105 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 3f89d0ab08fc..683fa2fe4b38 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1311,7 +1311,9 @@ struct ext4_super_block { __u8 s_first_error_time_hi; __u8 s_last_error_time_hi; __u8 s_pad[2]; - __le32 s_reserved[96]; /* Padding to the end of the block */ + __le16 s_encoding; /* Filename charset encoding */ + __le16 s_encoding_flags; /* Filename charset encoding flags */ + __le32 s_reserved[95]; /* Padding to the end of the block */ __le32 s_checksum; /* crc32c(superblock) */ }; @@ -1336,6 +1338,16 @@ struct ext4_super_block { /* Number of quota types we support */ #define EXT4_MAXQUOTAS 3 +#define EXT4_ENC_UTF8_11_0 1 + +/* + * Flags for ext4_sb_info.s_encoding_flags. + */ +#define EXT4_ENC_STRICT_MODE_FL (1 << 0) + +#define ext4_has_strict_mode(sbi) \ + (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL) + /* * fourth extended-fs super-block data in memory */ @@ -1385,6 +1397,10 @@ struct ext4_sb_info { struct kobject s_kobj; struct completion s_kobj_unregister; struct super_block *s_sb; +#ifdef CONFIG_UNICODE + struct unicode_map *s_encoding; + __u16 s_encoding_flags; +#endif /* Journaling */ struct journal_s *s_journal; @@ -1661,6 +1677,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei) #define EXT4_FEATURE_INCOMPAT_LARGEDIR 0x4000 /* >2GB or 3-lvl htree */ #define EXT4_FEATURE_INCOMPAT_INLINE_DATA 0x8000 /* data in inode */ #define EXT4_FEATURE_INCOMPAT_ENCRYPT 0x10000 +#define EXT4_FEATURE_INCOMPAT_FNAME_ENCODING 0x20000 #define EXT4_FEATURE_COMPAT_FUNCS(name, flagname) \ static inline bool ext4_has_feature_##name(struct super_block *sb) \ @@ -1749,6 +1766,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(csum_seed, CSUM_SEED) EXT4_FEATURE_INCOMPAT_FUNCS(largedir, LARGEDIR) EXT4_FEATURE_INCOMPAT_FUNCS(inline_data, INLINE_DATA) EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT) +EXT4_FEATURE_INCOMPAT_FUNCS(fname_encoding, FNAME_ENCODING) #define EXT2_FEATURE_COMPAT_SUPP EXT4_FEATURE_COMPAT_EXT_ATTR #define EXT2_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \ @@ -1776,6 +1794,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT) EXT4_FEATURE_INCOMPAT_MMP | \ EXT4_FEATURE_INCOMPAT_INLINE_DATA | \ EXT4_FEATURE_INCOMPAT_ENCRYPT | \ + EXT4_FEATURE_INCOMPAT_FNAME_ENCODING | \ EXT4_FEATURE_INCOMPAT_CSUM_SEED | \ EXT4_FEATURE_INCOMPAT_LARGEDIR) #define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \ diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 53ff6c2a26ed..bcc2bc192403 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -1022,6 +1023,9 @@ static void ext4_put_super(struct super_block *sb) crypto_free_shash(sbi->s_chksum_driver); kfree(sbi->s_blockgroup_lock); fs_put_dax(sbi->s_daxdev); +#ifdef CONFIG_UNICODE + utf8_unload(sbi->s_encoding); +#endif kfree(sbi); } @@ -1716,6 +1720,36 @@ static const struct mount_opts { {Opt_err, 0, 0} }; +#ifdef CONFIG_UNICODE +static const struct ext4_sb_encodings { + __u16 magic; + char *name; + char *version; +} ext4_sb_encoding_map[] = { + {EXT4_ENC_UTF8_11_0, "utf8", "11.0.0"}, +}; + +static int ext4_sb_read_encoding(const struct ext4_super_block *es, + const struct ext4_sb_encodings **encoding, + __u16 *flags) +{ + __u16 magic = le16_to_cpu(es->s_encoding); + int i; + + for (i = 0; i < ARRAY_SIZE(ext4_sb_encoding_map); i++) + if (magic == ext4_sb_encoding_map[i].magic) + break; + + if (i >= ARRAY_SIZE(ext4_sb_encoding_map)) + return -EINVAL; + + *encoding = &ext4_sb_encoding_map[i]; + *flags = le16_to_cpu(es->s_encoding_flags); + + return 0; +} +#endif + static int handle_mount_opt(struct super_block *sb, char *opt, int token, substring_t *args, unsigned long *journal_devnum, unsigned int *journal_ioprio, int is_remount) @@ -2847,6 +2881,15 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly) return 0; } +#ifndef CONFIG_UNICODE + if (ext4_has_feature_fname_encoding(sb)) { + ext4_msg(sb, KERN_ERR, + "Filesystem with fname_encoding feature cannot be " + "mounted without CONFIG_UNICODE"); + return 0; + } +#endif + if (readonly) return 1; @@ -3706,6 +3749,43 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) &journal_ioprio, 0)) goto failed_mount; +#ifdef CONFIG_UNICODE + if (ext4_has_feature_fname_encoding(sb) && !sbi->s_encoding) { + const struct ext4_sb_encodings *encoding_info; + struct unicode_map *encoding; + __u16 encoding_flags; + + if (ext4_has_feature_encrypt(sb)) { + ext4_msg(sb, KERN_ERR, + "Can't mount with encoding and encryption"); + goto failed_mount; + } + + if (ext4_sb_read_encoding(es, &encoding_info, + &encoding_flags)) { + ext4_msg(sb, KERN_ERR, + "Encoding requested by superblock is unknown"); + goto failed_mount; + } + + encoding = utf8_load(encoding_info->version); + if (IS_ERR(encoding)) { + ext4_msg(sb, KERN_ERR, + "can't mount with superblock charset: %s-%s " + "not supported by the kernel. flags: 0x%x.", + encoding_info->name, encoding_info->version, + encoding_flags); + goto failed_mount; + } + ext4_msg(sb, KERN_INFO,"Using encoding defined by superblock: " + "%s-%s with flags 0x%hx", encoding_info->name, + encoding_info->version?:"\b", encoding_flags); + + sbi->s_encoding = encoding; + sbi->s_encoding_flags = encoding_flags; + } +#endif + if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) { printk_once(KERN_WARNING "EXT4-fs: Warning: mounting " "with data=journal disables delayed " @@ -4547,6 +4627,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) failed_mount: if (sbi->s_chksum_driver) crypto_free_shash(sbi->s_chksum_driver); + +#ifdef CONFIG_UNICODE + utf8_unload(sbi->s_encoding); +#endif + #ifdef CONFIG_QUOTA for (i = 0; i < EXT4_MAXQUOTAS; i++) kfree(sbi->s_qf_names[i]); From patchwork Mon Jan 28 21:32:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032253 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNCJ6GFCz9sDL for ; Tue, 29 Jan 2019 08:33:00 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728257AbfA1Vc7 (ORCPT ); Mon, 28 Jan 2019 16:32:59 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58238 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1Vc7 (ORCPT ); Mon, 28 Jan 2019 16:32:59 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id D925827FB83 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 09/11] ext4: Support encoding-aware file name lookups Date: Mon, 28 Jan 2019 16:32:21 -0500 Message-Id: <20190128213223.31512-10-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Gabriel Krisman Bertazi This patch implements the actual support for encoding-aware file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the fname_encoding feature set is able to find files even if the name used by userspace is not a byte per byte match, but is an equivalent unicode string. This operation will be called and inexact-match name search. * dcache handling: Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite what equivalent string was used without resorting to ->lookup(). d_hash() is implemented as the hash of the normalized string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare uses the nls_strncmp() infrastructure, which should handle the comparison of equivalent names as well. If the filesystem's normalization type is PLAIN, though, we can just reuse the VFS hash. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetch from the disk is a byte-per-byte match with what the user requested, making this implementation Name-preserving. i.e. no actual information is lost when writing to the storage. DX is supported by modifying the hashes to make them encoding-aware. The new disk hashes are also calculated as the hash of the normalized string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide the exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that inexact-match lookups and (future) casefold will not work for that file only. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that flag is set, any attempt to create a file name using an invalid UTF-8 sequence will fail. * Normalization algorithm: The UTF-8 algorithm used to lookup and compare strings in ext4 is implemented by fs/unicode, and is based on a previous developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 11.0, or higher, combined with the elimination of ignorable code points (NFDi) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like NFKD. But: - It ignores valid equivalent sequences in some languages, which would be invalid on anothers: the ff ligature, in o_ff_i_c_e, and the usual spelling o_f_f_i_c_e, for instance, will no longer match. On the other hand, the ash grapheme (formed by a ligature of AE ) can be represented as a different letter than A plus E, which is correct in Danish, Norwegian, Icelandic, and Faroese. - This implementation is not linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. Changes since v4: - Migrate NFKD -> NFD Changes since v2: - Don't use d_add_ci. - Squash the dcache hooks into this patch. - Rename sbi->encoding -> sbi->s_encoding. Changes since v1: - Support normalized htree hashes. - Guard code with CONFIG_NLS. - Use qstr->len instead of strlen in dcache hookups. Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/dir.c | 40 +++++++++++++++++++ fs/ext4/ext4.h | 12 +++++- fs/ext4/hash.c | 35 ++++++++++++++++- fs/ext4/ialloc.c | 2 +- fs/ext4/inline.c | 2 +- fs/ext4/namei.c | 100 +++++++++++++++++++++++++++++++++++++++++------ fs/ext4/super.c | 6 +++ 7 files changed, 181 insertions(+), 16 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index f93f9881ec18..c8d2669e17dc 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "ext4.h" #include "xattr.h" @@ -662,3 +663,42 @@ const struct file_operations ext4_dir_operations = { .open = ext4_dir_open, .release = ext4_release_dir, }; + +#ifdef CONFIG_UNICODE +static int ext4_d_compare(const struct dentry *dentry, unsigned int len, + const char *str, const struct qstr *name) +{ + struct qstr qstr = {.name = str, .len = len }; + + return ext4_encoding_cmp(dentry->d_parent->d_inode, name, &qstr); +} + +static int ext4_d_hash(const struct dentry *dentry, struct qstr *str) +{ + const struct ext4_sb_info *sbi = EXT4_SB(dentry->d_sb); + const struct unicode_map *um = sbi->s_encoding; + unsigned char *norm; + int len, ret = 0; + + norm = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!norm) + return -ENOMEM; + + len = utf8_normalize(um, str, norm, PATH_MAX); + + if (len < 0) { + if (ext4_has_strict_mode(sbi)) + ret = -EINVAL; + goto out; + } + str->hash = full_name_hash(dentry, norm, len); +out: + kfree(norm); + return ret; +} + +const struct dentry_operations ext4_dentry_ops = { + .d_hash = ext4_d_hash, + .d_compare = ext4_d_compare, +}; +#endif diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 683fa2fe4b38..009fab63c8e1 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2394,8 +2394,8 @@ extern int ext4_check_all_de(struct inode *dir, struct buffer_head *bh, extern int ext4_sync_file(struct file *, loff_t, loff_t, int); /* hash.c */ -extern int ext4fs_dirhash(const char *name, int len, struct - dx_hash_info *hinfo); +extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len, + struct dx_hash_info *hinfo); /* ialloc.c */ extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t, @@ -2979,6 +2979,10 @@ static inline void ext4_unlock_group(struct super_block *sb, /* dir.c */ extern const struct file_operations ext4_dir_operations; +#ifdef CONFIG_UNICODE +extern const struct dentry_operations ext4_dentry_ops; +#endif + /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; @@ -3071,6 +3075,10 @@ extern void initialize_dirent_tail(struct ext4_dir_entry_tail *t, extern int ext4_handle_dirty_dirent_node(handle_t *handle, struct inode *inode, struct buffer_head *bh); +extern int ext4_encoding_cmp(const struct inode *parent, + const struct qstr *name, + const struct qstr *entry); + #define S_SHIFT 12 static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = { [S_IFREG >> S_SHIFT] = EXT4_FT_REG_FILE, diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c index e22dcfab308b..12a7b0eda132 100644 --- a/fs/ext4/hash.c +++ b/fs/ext4/hash.c @@ -6,6 +6,7 @@ */ #include +#include #include #include #include "ext4.h" @@ -196,7 +197,8 @@ static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num) * represented, and whether or not the returned hash is 32 bits or 64 * bits. 32 bit hashes will return 0 for the minor hash. */ -int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo) +static int __ext4fs_dirhash(const char *name, int len, + struct dx_hash_info *hinfo) { __u32 hash; __u32 minor_hash = 0; @@ -266,3 +268,34 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo) hinfo->minor_hash = minor_hash; return 0; } + +int ext4fs_dirhash(const struct inode *dir, const char *name, int len, + struct dx_hash_info *hinfo) +{ +#ifdef CONFIG_UNICODE + const struct unicode_map *um = EXT4_SB(dir->i_sb)->s_encoding; + int r, dlen; + unsigned char *buff; + struct qstr qstr = {.name = name, .len = len }; + + if (len && um) { + buff = kzalloc(sizeof(char) * PATH_MAX, GFP_KERNEL); + if (!buff) + return -1; + + dlen = utf8_normalize(um, &qstr, buff, PATH_MAX); + + if (dlen < 0) { + kfree(buff); + goto opaque_seq; + } + + r = __ext4fs_dirhash(buff, dlen, hinfo); + + kfree(buff); + return r; + } +opaque_seq: +#endif + return __ext4fs_dirhash(name, len, hinfo); +} diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 014f6a698cb7..1ef355549abf 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -455,7 +455,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent, if (qstr) { hinfo.hash_version = DX_HASH_HALF_MD4; hinfo.seed = sbi->s_hash_seed; - ext4fs_dirhash(qstr->name, qstr->len, &hinfo); + ext4fs_dirhash(parent, qstr->name, qstr->len, &hinfo); grp = hinfo.hash; } else grp = prandom_u32(); diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 9c4bac18cc6c..10b9d3dcec4e 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -1404,7 +1404,7 @@ int htree_inlinedir_to_tree(struct file *dir_file, } } - ext4fs_dirhash(de->name, de->name_len, hinfo); + ext4fs_dirhash(dir, de->name, de->name_len, hinfo); if ((hinfo->hash < start_hash) || ((hinfo->hash == start_hash) && (hinfo->minor_hash < start_minor_hash))) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 437f71fe83ae..99d936a1c58e 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -35,6 +35,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" @@ -629,7 +630,7 @@ static struct stats dx_show_leaf(struct inode *dir, } if (!fscrypt_has_encryption_key(dir)) { /* Directory is not encrypted */ - ext4fs_dirhash(de->name, + ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:(U)%x.%u ", len, name, h.hash, @@ -662,8 +663,8 @@ static struct stats dx_show_leaf(struct inode *dir, name = fname_crypto_str.name; len = fname_crypto_str.len; } - ext4fs_dirhash(de->name, de->name_len, - &h); + ext4fs_dirhash(dir, de->name, + de->name_len, &h); printk("%*.s:(E)%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); @@ -673,7 +674,7 @@ static struct stats dx_show_leaf(struct inode *dir, #else int len = de->name_len; char *name = de->name; - ext4fs_dirhash(de->name, de->name_len, &h); + ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); #endif @@ -762,7 +763,7 @@ dx_probe(struct ext4_filename *fname, struct inode *dir, hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed; if (fname && fname_name(fname)) - ext4fs_dirhash(fname_name(fname), fname_len(fname), hinfo); + ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), hinfo); hash = hinfo->hash; if (root->info.unused_flags & 1) { @@ -1008,7 +1009,7 @@ static int htree_dirblock_to_tree(struct file *dir_file, /* silently ignore the rest of the block */ break; } - ext4fs_dirhash(de->name, de->name_len, hinfo); + ext4fs_dirhash(dir, de->name, de->name_len, hinfo); if ((hinfo->hash < start_hash) || ((hinfo->hash == start_hash) && (hinfo->minor_hash < start_minor_hash))) @@ -1197,7 +1198,7 @@ static int dx_make_map(struct inode *dir, struct ext4_dir_entry_2 *de, while ((char *) de < base + blocksize) { if (de->name_len && de->inode) { - ext4fs_dirhash(de->name, de->name_len, &h); + ext4fs_dirhash(dir, de->name, de->name_len, &h); map_tail--; map_tail->hash = h.hash; map_tail->offs = ((char *) de - base)>>2; @@ -1252,15 +1253,45 @@ static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block) dx_set_count(entries, count + 1); } +#ifdef CONFIG_UNICODE +int ext4_encoding_cmp(const struct inode *parent, const struct qstr *name, + const struct qstr *entry) +{ + const struct ext4_sb_info *sbi = EXT4_SB(parent->i_sb); + const struct unicode_map *um = sbi->s_encoding; + int ret; + + ret = utf8_strncmp(um, name, entry); + if (ret < 0) { + /* Handle invalid character sequence as either an error + * or as an opaque byte sequence. + */ + if (ext4_has_strict_mode(sbi)) + return -EINVAL; + + if (name->len != entry->len) + return 1; + + return !!memcmp(name->name, entry->name, name->len); + } + + return ret; +} +#endif + /* * Test whether a directory entry matches the filename being searched for. * * Return: %true if the directory entry matches, otherwise %false. */ -static inline bool ext4_match(const struct ext4_filename *fname, +static inline bool ext4_match(const struct inode *parent, + const struct ext4_filename *fname, const struct ext4_dir_entry_2 *de) { struct fscrypt_name f; +#ifdef CONFIG_UNICODE + const struct qstr entry = {.name = de->name, .len = de->name_len}; +#endif if (!de->inode) return false; @@ -1270,6 +1301,12 @@ static inline bool ext4_match(const struct ext4_filename *fname, #ifdef CONFIG_EXT4_FS_ENCRYPTION f.crypto_buf = fname->crypto_buf; #endif + +#ifdef CONFIG_UNICODE + if (EXT4_SB(parent->i_sb)->s_encoding) + return !ext4_encoding_cmp(parent, fname->usr_fname, &entry); +#endif + return fscrypt_match_name(&f, de->name, de->name_len); } @@ -1290,7 +1327,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size, /* this code is executed quadratically often */ /* do minimal checking `by hand' */ if ((char *) de + de->name_len <= dlimit && - ext4_match(fname, de)) { + ext4_match(dir, fname, de)) { /* found a match - just to be sure, do * a full check */ if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data, @@ -1588,6 +1625,17 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi return ERR_PTR(-EPERM); } } + +#ifdef CONFIG_UNICODE + if (EXT4_SB(dir->i_sb)->s_encoding && !inode) { + /* Eventually we want to call d_add_ci(dentry, NULL) + * for negative dentries in the encoding case as + * well. For now, prevent the negative dentry + * from being cached. + */ + return NULL; + } +#endif return d_splice_alias(inode, dentry); } @@ -1798,7 +1846,7 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode, if (ext4_check_dir_entry(dir, NULL, de, bh, buf, buf_size, offset)) return -EFSCORRUPTED; - if (ext4_match(fname, de)) + if (ext4_match(dir, fname, de)) return -EEXIST; nlen = EXT4_DIR_REC_LEN(de->name_len); rlen = ext4_rec_len_from_disk(de->rec_len, buf_size); @@ -1983,7 +2031,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname, if (fname->hinfo.hash_version <= DX_HASH_TEA) fname->hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; fname->hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed; - ext4fs_dirhash(fname_name(fname), fname_len(fname), &fname->hinfo); + ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), &fname->hinfo); memset(frames, 0, sizeof(frames)); frame = frames; @@ -2036,6 +2084,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry, struct ext4_dir_entry_2 *de; struct ext4_dir_entry_tail *t; struct super_block *sb; + struct ext4_sb_info *sbi; struct ext4_filename fname; int retval; int dx_fallback=0; @@ -2047,10 +2096,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry, csum_size = sizeof(struct ext4_dir_entry_tail); sb = dir->i_sb; + sbi = EXT4_SB(sb); blocksize = sb->s_blocksize; if (!dentry->d_name.len) return -EINVAL; +#ifdef CONFIG_UNICODE + if (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL && + utf8_validate(sbi->s_encoding, &dentry->d_name)) + return -EINVAL; +#endif + retval = ext4_fname_setup_filename(dir, &dentry->d_name, 0, &fname); if (retval) return retval; @@ -2975,6 +3031,17 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry) ext4_update_dx_flag(dir); ext4_mark_inode_dirty(handle, dir); +#ifdef CONFIG_UNICODE + /* VFS negative dentries are incompatible with Encoding and + * Case-insensitiveness. Eventually we'll want avoid + * invalidating the dentries here, alongside with returning the + * negative dentries at ext4_lookup(), when it is better + * supported by the VFS for the CI case. + */ + if (EXT4_SB(dir->i_sb)->s_encoding) + d_invalidate(dentry); +#endif + end_rmdir: brelse(bh); if (handle) @@ -3044,6 +3111,17 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry) inode->i_ctime = current_time(inode); ext4_mark_inode_dirty(handle, inode); +#ifdef CONFIG_UNICODE + /* VFS negative dentries are incompatible with Encoding and + * Case-insensitiveness. Eventually we'll want avoid + * invalidating the dentries here, alongside with returning the + * negative dentries at ext4_lookup(), when it is better + * supported by the VFS for the CI case. + */ + if (EXT4_SB(dir->i_sb)->s_encoding) + d_invalidate(dentry); +#endif + end_unlink: brelse(bh); if (handle) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index bcc2bc192403..408099b9a15a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4420,6 +4420,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) iput(root); goto failed_mount4; } + +#ifdef CONFIG_UNICODE + if (sbi->s_encoding) + sb->s_d_op = &ext4_dentry_ops; +#endif + sb->s_root = d_make_root(root); if (!sb->s_root) { ext4_msg(sb, KERN_ERR, "get root dentry failed"); From patchwork Mon Jan 28 21:32:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032254 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNCM4s4Rz9sCX for ; Tue, 29 Jan 2019 08:33:03 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728286AbfA1VdC (ORCPT ); Mon, 28 Jan 2019 16:33:02 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58274 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1VdC (ORCPT ); Mon, 28 Jan 2019 16:33:02 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 54E0527FB86 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 10/11] ext4: Implement EXT4_CASEFOLD_FL flag Date: Mon, 28 Jan 2019 16:32:22 -0500 Message-Id: <20190128213223.31512-11-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Gabriel Krisman Bertazi Casefold is implemented as a special case of the fname_encoding feature, thus requiring it to be enabled. This design mechanism works semantically fine, because we consider that without an specified encoding the casefold operation cannot be defined. Despite the common core implementation, the Casefold setting can be applied to directories individually, making entire subtrees of the filesystem case-insensitive or not. The flag is set as an inode attribute applied to directories and inherited by its children which states that the directory requires case-insensitive searches. This flag can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. Changes since v2: - Rename sbi->encoding -> sbi->s_encoding. Changes since v1: - Moved the CASEFOLD_FL to prevent collision with reserved verity flag. Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/dir.c | 5 ++++- fs/ext4/ext4.h | 9 +++++---- fs/ext4/hash.c | 5 ++++- fs/ext4/inode.c | 4 +++- fs/ext4/ioctl.c | 18 ++++++++++++++++++ fs/ext4/namei.c | 6 +++++- include/linux/fs.h | 2 ++ 7 files changed, 41 insertions(+), 8 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index c8d2669e17dc..8a9343811d48 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -684,7 +684,10 @@ static int ext4_d_hash(const struct dentry *dentry, struct qstr *str) if (!norm) return -ENOMEM; - len = utf8_normalize(um, str, norm, PATH_MAX); + if (!IS_CASEFOLDED(dentry->d_inode)) + len = utf8_normalize(um, str, norm, PATH_MAX); + else + len = utf8_casefold(um, str, norm, PATH_MAX); if (len < 0) { if (ext4_has_strict_mode(sbi)) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 009fab63c8e1..44b71e15e40b 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -400,10 +400,11 @@ struct flex_groups { #define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */ #define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */ #define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */ +#define EXT4_CASEFOLD_FL 0x40000000 /* Casefolded file */ #define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */ -#define EXT4_FL_USER_VISIBLE 0x304BDFFF /* User visible flags */ -#define EXT4_FL_USER_MODIFIABLE 0x204BC0FF /* User modifiable flags */ +#define EXT4_FL_USER_VISIBLE 0x704BDFFF /* User visible flags */ +#define EXT4_FL_USER_MODIFIABLE 0x604BC0FF /* User modifiable flags */ /* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */ #define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \ @@ -418,10 +419,10 @@ struct flex_groups { EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\ EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\ - EXT4_PROJINHERIT_FL) + EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL) /* Flags that are appropriate for regular files (all but dir-specific ones). */ -#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL)) +#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL | EXT4_CASEFOLD_FL)) /* Flags that are appropriate for non-directories/regular files. */ #define EXT4_OTHER_FLMASK (EXT4_NODUMP_FL | EXT4_NOATIME_FL) diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c index 12a7b0eda132..e106c4727961 100644 --- a/fs/ext4/hash.c +++ b/fs/ext4/hash.c @@ -283,7 +283,10 @@ int ext4fs_dirhash(const struct inode *dir, const char *name, int len, if (!buff) return -1; - dlen = utf8_normalize(um, &qstr, buff, PATH_MAX); + if (!IS_CASEFOLDED(dir)) + dlen = utf8_normalize(um, &qstr, buff, PATH_MAX); + else + dlen = utf8_casefold(um, &qstr, buff, PATH_MAX); if (dlen < 0) { kfree(buff); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 22a9d8159720..9908d7d98b6e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4745,9 +4745,11 @@ void ext4_set_inode_flags(struct inode *inode) new_fl |= S_DAX; if (flags & EXT4_ENCRYPT_FL) new_fl |= S_ENCRYPTED; + if (flags & EXT4_CASEFOLD_FL) + new_fl |= S_CASEFOLD; inode_set_flags(inode, new_fl, S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX| - S_ENCRYPTED); + S_ENCRYPTED|S_CASEFOLD); } static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode, diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 0edee31913d1..ef4ffe681836 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -231,6 +231,7 @@ static int ext4_ioctl_setflags(struct inode *inode, struct ext4_iloc iloc; unsigned int oldflags, mask, i; unsigned int jflag; + struct super_block *sb = inode->i_sb; /* Is it quota file? Do not allow user to mess with it */ if (ext4_is_quota_file(inode)) @@ -275,6 +276,23 @@ static int ext4_ioctl_setflags(struct inode *inode, goto flags_out; } + if ((flags ^ oldflags) & EXT4_CASEFOLD_FL) { + if (!ext4_has_feature_fname_encoding(sb)) { + err = -EOPNOTSUPP; + goto flags_out; + } + + if (!S_ISDIR(inode->i_mode)) { + err = -ENOTDIR; + goto flags_out; + } + + if (!ext4_empty_dir(inode)) { + err = -ENOTEMPTY; + goto flags_out; + } + } + handle = ext4_journal_start(inode, EXT4_HT_INODE, 1); if (IS_ERR(handle)) { err = PTR_ERR(handle); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 99d936a1c58e..66fbdebcb554 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1261,7 +1261,11 @@ int ext4_encoding_cmp(const struct inode *parent, const struct qstr *name, const struct unicode_map *um = sbi->s_encoding; int ret; - ret = utf8_strncmp(um, name, entry); + if (!IS_CASEFOLDED(parent)) + ret = utf8_strncmp(um, name, entry); + else + ret = utf8_strncasecmp(um, name, entry); + if (ret < 0) { /* Handle invalid character sequence as either an error * or as an opaque byte sequence. diff --git a/include/linux/fs.h b/include/linux/fs.h index c95c0807471f..69abaca207c0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1947,6 +1947,7 @@ struct super_operations { #define S_DAX 0 /* Make all the DAX code disappear */ #endif #define S_ENCRYPTED 16384 /* Encrypted file (using fs/crypto/) */ +#define S_CASEFOLD 32768 /* Casefolded file */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1987,6 +1988,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) #define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED) +#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) From patchwork Mon Jan 28 21:32:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 1032255 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43pNCQ43L7z9sDX for ; Tue, 29 Jan 2019 08:33:06 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728295AbfA1VdF (ORCPT ); Mon, 28 Jan 2019 16:33:05 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:58304 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbfA1VdF (ORCPT ); Mon, 28 Jan 2019 16:33:05 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 68A1527FB87 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, sfrench@samba.org, darrick.wong@oracle.com, samba-technical@lists.samba.org, jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org, Gabriel Krisman Bertazi Subject: [PATCH RFC v5 11/11] docs: ext4.rst: Document encoding and case-insensitive Date: Mon, 28 Jan 2019 16:32:23 -0500 Message-Id: <20190128213223.31512-12-krisman@collabora.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190128213223.31512-1-krisman@collabora.com> References: <20190128213223.31512-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Gabriel Krisman Bertazi Introduces the encoding-awareness and case-insensitive features on ext4 for system administrators. Explain the minimum of design decisions that are important for sysadmins wanting to enable this feature. Signed-off-by: Gabriel Krisman Bertazi --- Documentation/admin-guide/ext4.rst | 41 ++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index e506d3dae510..4e08d0309f1e 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -91,10 +91,51 @@ Currently Available * large block (up to pagesize) support * efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force the ordering) +* Encoding aware file names +* Case insensitive file name lookups [1] Filesystems with a block size of 1k may see a limit imposed by the directory hash tree having a maximum depth of two. +Encoding-aware file names and case-insensitive lookups +====================================================== + +Ext4 optionally supports filesystem-wide charset knowledge when handling +file names, which allows the user to perform file system lookups using +charset equivalent versions of the same file name, and optionally ensure +that no invalid names are held by the filesystem. charset encoding +awareness is also essential for performing case-insensitive lookups, +because it is what defines the casefold operation. + +The case-insensitive file name lookup feature is supported in a smaller +granularity, on a per-directory basis, allowing the user to mix +case-insensitive and case-sensitive directories in the same filesystem. +It is enabled by flipping a file attribute on an empty directory. For +the reason stated above, the filesystem must have encoding enabled to +use this feature. + +Both encoding-awareness and case-awareness are name-preserving on the +disk, meaning that the file name provided by userspace is a +byte-per-byte match to what is actually written in the disk. The +Unicode normalization format used by the kernel is thus an internal +representation, and not exposed to the userspace nor to the disk, with +the important exception of disk hashes, used on large directories with +DX feature. On DX directories, the hash must be calculated using the +normalized version of the filename, meaning that the normalization +format used actually has an impact on where the directory entry is +stored. + +When we change from viewing filenames as opaque byte sequences to seeing +them as encoded strings we need to address what happens when a program +tries to create a file with an invalid name. The Unicode subsystem +within the kernel leaves the decision of what to do in this case to the +filesystem, which select its preferred behavior by enabling/disabling +the strict mode. When Ext4 encounters one of those strings and the +filesystem did not require strict mode, it falls back to considering the +entire string as an opaque byte sequence, which still allows the user to +operate on that file but the case-insensitive and equivalent sequence +lookups won't work. + Options =======