Message ID | ortwyig5xa.fsf@livre.home |
---|---|
State | New |
Headers | show |
On 18 Feb 2015 21:23, Alexandre Oliva wrote: > --- a/localedata/unicode-gen/ctype_compatibility.py > +++ b/localedata/unicode-gen/ctype_compatibility.py > > -# Copyright (C) 2014 Free Software Foundation, Inc. > +# Copyright (C) 2014, 2015 Free Software Foundation, Inc. should be a date range (2014-2015) > +# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book, > +# sections 3.11 and 4.4. > + > +jamo_initial_short_name = [ > + 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ', > + 'C', 'K', 'T', 'P', 'H' > +] module level constants should really be in CAPS. and use a tuple to make it const. -mike
On Feb 18, 2015, Mike Frysinger <vapier@gentoo.org> wrote: > should be a date range (2014-2015) > module level constants should really be in CAPS. and use a tuple to make it > const. Thanks. Mind if we save these cosmetic changes to the scripts for a follow up patch?
There's also the matter of updating __STDC_ISO_10646__ in stdc-predef.h. Unicode 7.0 claims to correspond to ISO/IEC 10646:2012 plus amendments 1 and 2 (and one extra character). Unfortunately I can find no sign of amendment 2 ever having been published; it looks rather like it was subsumed into ISO/IEC 10646:2014. Wikipedia claims that corresponds to Unicode 7.0 (which would imply 201409L as version), but I can't find any authoritative information, either on the Unicode website or after looking through lots of SC2 documents, to confirm if there are indeed no characters in 10646:2014 that aren't in Unicode 7.0.
Mike Frysinger <vapier@gentoo.org> wrote: >> +# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book, >> +# sections 3.11 and 4.4. >> + >> +jamo_initial_short_name = [ >> + 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ', >> + 'C', 'K', 'T', 'P', 'H' >> +] > > module level constants should really be in CAPS. and use a tuple to make it > const. > -mike https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543
On 02/18/2015 06:23 PM, Alexandre Oliva wrote: > [BZ #17588] > [BZ #13064] > [BZ #14094] > [BZ #17998] > * unicode-gen/Makefile: New. > * unicode-gen/unicode-license.txt: New, from Unicode. > * unicode-gen/UnicodeData.txt: New, from Unicode. > * unicode-gen/DerivedCoreProperties.txt: New, from Unicode. > * unicode-gen/EastAsianWidth.txt: New, from Unicode. > * unicode-gen/gen_unicode_ctype.py: New generator, from Mike > FABIAN <mfabian@redhat.com>. > * unicode-gen/ctype_compatibility.py: New verifier, from > Pravin Satpute <psatpute@redhat.com> and Mike FABIAN. > * unicode-gen/ctype_compatibility_test_cases.py: New verifier > module, from Mike FABIAN. > * unicode-gen/utf8_gen.py: New generator, from Pravin Satpute > and Mike FABIAN. > * unicode-gen/utf8_compatibility.py: New verifier, from Pravin > Satpute and Mike FABIAN. > * charmaps/UTF-8: Update. > * locales/i18n: Update. > * gen-unicode-ctype.c: Remove. > * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns > true for ordinal indicators. Looks good to me. Please feel free to commit. One nit: -% Character width according to Unicode 5.0.0. +% Character width according to Unicode 7.0.0. % - Default width is 1. % - Double-width characters have width 2; generated from % "grep '^[^;]*;[WF]' EastAsianWidth.txt" -% and "grep '^[^;]*;[^WF]' EastAsianWidth.txt" % - Non-spacing characters have width 0; generated from PropList.txt or % "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt" % - Format control characters have width 0; generated from % "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" -% - Zero width characters have width 0; generated from -% "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt" Why even mention the `grep` to be used to generate this data? It should just say to use the scripts. Nobody should be confused that this data was actually generated by this method. Nor do I want anyone doing it this way ever again. Thus shouldn't `write_header_width` simply not output any of this stuff? I understand we're trying to minimize the initial diff, but in cleanup, we should remove all of this and just say: "% Character width according to Unicode 7.0.0." Thoughts? Cheers, Carlos.
On 02/18/2015 08:19 PM, Joseph Myers wrote: > There's also the matter of updating __STDC_ISO_10646__ in stdc-predef.h. > > Unicode 7.0 claims to correspond to ISO/IEC 10646:2012 plus amendments 1 > and 2 (and one extra character). Unfortunately I can find no sign of > amendment 2 ever having been published; it looks rather like it was > subsumed into ISO/IEC 10646:2014. Wikipedia claims that corresponds to > Unicode 7.0 (which would imply 201409L as version), but I can't find any > authoritative information, either on the Unicode website or after looking > through lots of SC2 documents, to confirm if there are indeed no > characters in 10646:2014 that aren't in Unicode 7.0. I have submitted a question to the Unicode Consortium to answer this. Proving there are no characters in 10646:2014 that aren't in Unicode 7.0 is going to be a difficult slog. Someone from the relevant groups has to answer the question for us. I went through SC2 documents from the Canadian side and found that 10646:2012 amendement 2 did go to ITTF for FDAM and a summary of votes shows it passed. However, it seems the secretariat changed at that point and perhaps everything was delayed until the 2014 standard. Cheers, Carlos.
On 02/18/2015 08:19 PM, Joseph Myers wrote: > There's also the matter of updating __STDC_ISO_10646__ in stdc-predef.h. > > Unicode 7.0 claims to correspond to ISO/IEC 10646:2012 plus amendments 1 > and 2 (and one extra character). Unfortunately I can find no sign of > amendment 2 ever having been published; it looks rather like it was > subsumed into ISO/IEC 10646:2014. Wikipedia claims that corresponds to > Unicode 7.0 (which would imply 201409L as version), but I can't find any > authoritative information, either on the Unicode website or after looking > through lots of SC2 documents, to confirm if there are indeed no > characters in 10646:2014 that aren't in Unicode 7.0. > The ISO never published ammendment 2 for ISO/IEC 10646:2012. The answer from the Unicode Consortium was (with some copy editing): ~~~ Version 7.0 of the Unicode Standard is synchronized with ISO/IEC 10646:2012, plus Amendments 1 and 2. Additionally, it includes the accelerated publication of U+20BD RUBLE SIGN. Unicode 8.0, due for publication in the summer of 2015, is in early draft stage now, with a page here: http://www.unicode.org/versions/Unicode8.0.0/#Summary Is intended to synchronize with ISO 10646:2014, plus Amendment 1. ~~~ Therefore Unicode 7.0.0 is between 10646:2012 and 10646:2014. The wikipedia page is wrong and I have corrected it. Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012 Amd.1 was published). Thoughts? Cheers, Carlos.
On Fri, 20 Feb 2015, Carlos O'Donell wrote: > Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012 > Amd.1 was published). > > Thoughts? That accords with what I suggested as a safe value in <https://sourceware.org/ml/libc-alpha/2014-06/msg00588.html> when the 2014 edition hadn't been published either.
On 02/20/2015 04:28 PM, Joseph Myers wrote: > On Fri, 20 Feb 2015, Carlos O'Donell wrote: > >> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012 >> Amd.1 was published). >> >> Thoughts? > > That accords with what I suggested as a safe value in > <https://sourceware.org/ml/libc-alpha/2014-06/msg00588.html> when the 2014 > edition hadn't been published either. > Sounds like consensus. Alex, could you please make sure __STDC_ISO_10646__ ends up as 201304L? Cheers, Carlos.
diff --git a/NEWS b/NEWS index 0501d51..a59b68d 100644 --- a/NEWS +++ b/NEWS @@ -9,8 +9,15 @@ Version 2.22 * The following bugs are resolved with this release: - 4719, 15319, 15467, 15790, 16560, 17569, 17792, 17912, 17932, 17944, - 17949, 17964, 17965, 17967, 17969, 17978, 17987, 17991, 17996. + 4719, 13064, 14094, 15319, 15467, 15790, 16560, 17569, 17588, 17792, + 17912, 17932, 17944, 17949, 17964, 17965, 17967, 17969, 17978, 17987, + 17991, 17996, 17998. + +* Character encoding and ctype tables were updated to Unicode 7.0.0, using + new generator scripts contributed by Pravin Satpute and Mike FABIAN (Red + Hat). These updates cause user visible changes, such as the fix for bug + 17998. + Version 2.21 Incremental changes to the scripts: diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py index 9535f81..19e9ee5 100755 --- a/localedata/unicode-gen/ctype_compatibility.py +++ b/localedata/unicode-gen/ctype_compatibility.py @@ -1,10 +1,7 @@ #!/usr/bin/python3 # -*- coding: utf-8 -*- -# Copyright (C) 2014 Free Software Foundation, Inc. +# Copyright (C) 2014, 2015 Free Software Foundation, Inc. # This file is part of the GNU C Library. -# Contributed by -# Pravin Satpute <psatpute@redhat.com>, 2014. -# Mike FABIAN <mfabian@redhat.com>, 2014. # # The GNU C Library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py index 09438d7..ab7f6dd 100644 --- a/localedata/unicode-gen/ctype_compatibility_test_cases.py +++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py @@ -1,8 +1,6 @@ # -*- coding: utf-8 -*- -# Copyright (C) 2014 Free Software Foundation, Inc. +# Copyright (C) 2014, 2015 Free Software Foundation, Inc. # This file is part of the GNU C Library. -# Contributed by -# Mike FABIAN <mfabian@redhat.com>, 2014. # # The GNU C Library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py index 24155bd..559af79 100755 --- a/localedata/unicode-gen/gen_unicode_ctype.py +++ b/localedata/unicode-gen/gen_unicode_ctype.py @@ -1,9 +1,8 @@ #!/usr/bin/python3 # # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file. -# Copyright (C) 2014 Free Software Foundation, Inc. +# Copyright (C) 2014, 2015 Free Software Foundation, Inc. # This file is part of the GNU C Library. -# Contributed by Mike FABIAN <maiku.fabian@gmail.com>, 2014. # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000. # # The GNU C Library is free software; you can redistribute it and/or diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py index 4928e3e..e11327b 100755 --- a/localedata/unicode-gen/utf8_compatibility.py +++ b/localedata/unicode-gen/utf8_compatibility.py @@ -1,9 +1,7 @@ #!/usr/bin/python3 # -*- coding: utf-8 -*- -# Copyright (C) 2014 Free Software Foundation, Inc. +# Copyright (C) 2014, 2015 Free Software Foundation, Inc. # This file is part of the GNU C Library. -# Contributed by Pravin Satpute <psatpute@redhat.com>, 2014. -# Mike FABIAN <mfabian@redhat.com>, 2014 # # The GNU C Library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public @@ -27,8 +25,6 @@ To see how this script is used, call it with the “-h” option: $ ./utf8_compatibility.py -h … prints usage message … - -For issues upstream https://github.com/pravins/glibc-i18n ''' import sys diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py index 9ffb7f6..670a628 100755 --- a/localedata/unicode-gen/utf8_gen.py +++ b/localedata/unicode-gen/utf8_gen.py @@ -1,10 +1,7 @@ #!/usr/bin/python3 # -*- coding: utf-8 -*- -# Copyright (C) 2014 Free Software Foundation, Inc. +# Copyright (C) 2014, 2015 Free Software Foundation, Inc. # This file is part of the GNU C Library. -# Contributed by -# Pravin Satpute <psatpute AT redhat DOT com> and -# Mike Fabian <mfabian At redhat DOT com> - 2014 # # The GNU C Library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public @@ -28,13 +25,30 @@ from Unicode data. Usage: python3 utf8_gen.py UnicodeData.txt EastAsianWidth.txt It will output UTF-8 file - -For issues upstream https://github.com/pravins/glibc-i18n ''' import sys import re +# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book, +# sections 3.11 and 4.4. + +jamo_initial_short_name = [ + 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ', + 'C', 'K', 'T', 'P', 'H' +] + +jamo_medial_short_name = [ + 'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE', + 'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I' +] + +jamo_final_short_name = [ + '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS', + 'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T', + 'P', 'H' +] + def ucs_symbol(code_point): '''Return the UCS symbol string for a Unicode character.''' if code_point < 0x10000: @@ -57,8 +71,15 @@ def process_range(start, end, outfile, name): # # So we expand the Hangul Syllables here: for i in range(int(start, 16), int(end, 16)+1 ): - outfile.write('{:s} {:s} {:s}\n'.format( - ucs_symbol(i), convert_to_hex(i), name)) + index2, index3 = divmod(i - 0xaC00, 28) + index1, index2 = divmod(index2, 21) + hangul_syllable_name = 'HANGUL SYLLABLE ' \ + + jamo_initial_short_name[index1] \ + + jamo_medial_short_name[index2] \ + + jamo_final_short_name[index3] + outfile.write('{:<11s} {:<12s} {:s}\n'.format( + ucs_symbol(i), convert_to_hex(i), + hangul_syllable_name)) return # UnicodeData.txt file has contains code point ranges like this: # @@ -73,13 +94,13 @@ def process_range(start, end, outfile, name): # <U4D80>..<U4DB5> /xe4/xb6/x80 <CJK Ideograph Extension A> for i in range(int(start, 16), int(end, 16), 64 ): if i > (int(end, 16)-64): - outfile.write('{:s}..{:s} {:s} {:s}\n'.format( + outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format( ucs_symbol(i), ucs_symbol(int(end,16)), convert_to_hex(i), name)) break - outfile.write('{:s}..{:s} {:s} {:s}\n'.format( + outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format( ucs_symbol(i), ucs_symbol(i+63), convert_to_hex(i), @@ -146,7 +167,7 @@ def process_charmap(flines, outfile): # the original UTF-8 file in glibc had them as # comments, so we keep these comment lines. outfile.write('%') - outfile.write('{:s} {:s} {:s}\n'.format( + outfile.write('{:<11s} {:<12s} {:s}\n'.format( ucs_symbol(int(fields[0], 16)), convert_to_hex(int(fields[0], 16)), fields[1]))