[BZ,14094] Update LC_CTYPE character class data to Unicode 7.0.0

1) 0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch

   Patch to update the character class data in
   glibc/localedata/locales/i18n. The patch includes the 2 scripts
   gen-unicode-ctype.py and ctype-compatibility.py.

2) 0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch

   After applying 3), building glibc and running “make check”,
   The test localedata/tst-ctype fails. See:

   https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c34

   I believe the test is wrong. Therefore, this patch fixes the test.

3) gen-unicode-ctype.py
   (Included in the above patch, attached seperately here as well for easier
   review).

   Script to generate the new character class data for LC_CTYPE from
   the Unicode data

   Usage of the script:

   python3 ./gen-unicode-ctype.py -u UnicodeData.txt -d DerivedCoreProperties.txt -i locales/i18n -o locales/i18n-new --unicode_version 7.0.0

   Everything in the original glibc/localedata/locales/i18n file (given
   with the -i option) except the "date" stamp and the LC_CTYPE
   character class data is preserved and copied unchanged into the new
   file (given with the -o option). The character class data is replaced
   with the data from UnicodeData.txt and DerivedCoreProperties.txt from
   Unicode 7.0.0.

   The script is based on Bruno Haible’s gen-unicode-ctype.c program,
   rewritten to Python3 and extended to use DerivedCoreProperties.txt as
   well for the character classes “alpha”, “lower”, and “upper”.

   I also considers all non-ASCII digits as alphabetic, just like
   Bruno’s original gen-unicode-ctype.c because ISO C 99 forbids us to
   have them in the category “digit” but we want “isalnum” return
   true on them.

   It treats title case characters as both “upper” and
   “lower” (also the same as Bruno’s gen-unicode-ctype.c).

4) ctype-compatibility.py
   (Included in the above patch, attached seperately here as well for easier
   review).

   A Python script to compare the old and the new i18n file and check
   for errors. A sort of test suite for gen-unicode-ctype.py

   Currently this test reports 11 “errors” in the new file, see:

   https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c29

   All these 11 “errors” are because of a disagreement between this
   part of Bruno’s gen-unicode-ctype.c:

        is_alpha (unsigned int ch)
        {
          return (unicode_attributes[ch].name != NULL
                  && ((unicode_attributes[ch].category[0] == 'L'
                       /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
                          <U0E2F>, <U0E46> should belong to is_punct.  */
                       && (ch != 0x0E2F) && (ch != 0x0E46))
                      /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
                         <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are is_alpha.  */
                      || (ch == 0x0E31)
                      || (ch >= 0x0E34 && ch <= 0x0E3A)
                      || (ch >= 0x0E47 && ch <= 0x0E4E)

   and Unicode’s DerivedCoreProperties.txt.
   According to DerivedCoreProperties.txt, <U0E2F>, <U0E46> are
   “Alphabetic”. And <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are
   *not* “Alphabetic” according to DerivedCoreProperties.txt.

   I tried to write mail to Bruno Haible and Theppitak Karoonboonyanan
   but got no response.

   I assume DerivedCoreProperties.txt is more trustworthy.
   In that case, if we can trust DerivedCoreProperties.txt, there
   are no errors left found by ctype-compatibility.py.

[BZ,14094] Update LC_CTYPE character class data to Unicode 7.0.0

Commit Message

Patch