[BZ,14094] Update LC_CTYPE character class data to Unicode 7.0.0
diff mbox

Message ID s9dh9xcu67z.fsf@ari.site
State New
Headers show

Commit Message

Mike FABIAN Dec. 3, 2014, 3:02 p.m. UTC
1) 0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch

   Patch to update the character class data in
   glibc/localedata/locales/i18n. The patch includes the 2 scripts
   gen-unicode-ctype.py and ctype-compatibility.py.

2) 0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch

   After applying 3), building glibc and running “make check”,
   The test localedata/tst-ctype fails. See:
   
   https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c34

   I believe the test is wrong. Therefore, this patch fixes the test.

3) gen-unicode-ctype.py
   (Included in the above patch, attached seperately here as well for easier
   review).
   
   Script to generate the new character class data for LC_CTYPE from
   the Unicode data

   Usage of the script:
   
   python3 ./gen-unicode-ctype.py -u UnicodeData.txt -d DerivedCoreProperties.txt -i locales/i18n -o locales/i18n-new --unicode_version 7.0.0

   Everything in the original glibc/localedata/locales/i18n file (given
   with the -i option) except the "date" stamp and the LC_CTYPE
   character class data is preserved and copied unchanged into the new
   file (given with the -o option). The character class data is replaced
   with the data from UnicodeData.txt and DerivedCoreProperties.txt from
   Unicode 7.0.0.

   The script is based on Bruno Haible’s gen-unicode-ctype.c program,
   rewritten to Python3 and extended to use DerivedCoreProperties.txt as
   well for the character classes “alpha”, “lower”, and “upper”.

   I also considers all non-ASCII digits as alphabetic, just like
   Bruno’s original gen-unicode-ctype.c because ISO C 99 forbids us to
   have them in the category “digit” but we want “isalnum” return
   true on them.

   It treats title case characters as both “upper” and
   “lower” (also the same as Bruno’s gen-unicode-ctype.c).

4) ctype-compatibility.py
   (Included in the above patch, attached seperately here as well for easier
   review).

   A Python script to compare the old and the new i18n file and check
   for errors. A sort of test suite for gen-unicode-ctype.py

   Currently this test reports 11 “errors” in the new file, see:
   
   https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c29

   All these 11 “errors” are because of a disagreement between this
   part of Bruno’s gen-unicode-ctype.c:
   
        is_alpha (unsigned int ch)
        {
          return (unicode_attributes[ch].name != NULL
                  && ((unicode_attributes[ch].category[0] == 'L'
                       /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
                          <U0E2F>, <U0E46> should belong to is_punct.  */
                       && (ch != 0x0E2F) && (ch != 0x0E46))
                      /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
                         <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are is_alpha.  */
                      || (ch == 0x0E31)
                      || (ch >= 0x0E34 && ch <= 0x0E3A)
                      || (ch >= 0x0E47 && ch <= 0x0E4E)

   and Unicode’s DerivedCoreProperties.txt.
   According to DerivedCoreProperties.txt, <U0E2F>, <U0E46> are
   “Alphabetic”. And <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are
   *not* “Alphabetic” according to DerivedCoreProperties.txt.

   I tried to write mail to Bruno Haible and Theppitak Karoonboonyanan
   but got no response.

   I assume DerivedCoreProperties.txt is more trustworthy.
   In that case, if we can trust DerivedCoreProperties.txt, there
   are no errors left found by ctype-compatibility.py.

Patch
diff mbox

From 25c913674386011a44b6270579a894b2e8200d25 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Wed, 3 Dec 2014 10:05:42 +0100
Subject: [PATCH 2/2] Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

DerivedCoreProperties.txt from Unicode 7.0.0 lists
the characters U+00AA (ª) and U+00BA (º) as lower case:

00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR
---
 localedata/tst-ctype-de_DE.ISO-8859-1.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in b/localedata/tst-ctype-de_DE.ISO-8859-1.in
index f71d76c..e124a52 100644
--- a/localedata/tst-ctype-de_DE.ISO-8859-1.in
+++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in
@@ -1,5 +1,5 @@ 
 lower    ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
-        000000000000000000000100000000000000000000000000
+        000000000010000000000100001000000000000000000000
 lower   ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
         000000000000000111111111111111111111111011111111
 upper    ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
-- 
1.9.3