diff mbox

[BZ,17588,13064] Update UTF-8 charmap and width to Unicode 7.0.0

Message ID or8ufscg8p.fsf@livre.home
State New
Headers show

Commit Message

Alexandre Oliva Feb. 20, 2015, 11:31 p.m. UTC
On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:

> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
> Amd.1 was published).

Fixed in the patch below.

On Feb 19, 2015, Mike FABIAN <mfabian@redhat.com> wrote:

> Mike Frysinger <vapier@gentoo.org> wrote:

>> module level constants should really be in CAPS.  and use a tuple to make it 
>> const.
>> -mike

> https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543

Thanks, integrated.  I also adjusted the copyright notices to use year
ranges, as requested.

On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:

> One nit:

> -% Character width according to Unicode 5.0.0.
> +% Character width according to Unicode 7.0.0.
>  % - Default width is 1.
>  % - Double-width characters have width 2; generated from
>  %        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
> -%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
>  % - Non-spacing characters have width 0; generated from PropList.txt or
>  %   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
>  % - Format control characters have width 0; generated from
>  %   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
> -% - Zero width characters have width 0; generated from
> -%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

> Why even mention the `grep` to be used to generate this data?
> It should just say to use the scripts. Nobody should be confused
> that this data was actually generated by this method. Nor do I want
> anyone doing it this way ever again.

> Thus shouldn't `write_header_width` simply not output any of this
> stuff? I understand we're trying to minimize the initial diff, but
> in cleanup, we should remove all of this and just say:

> "% Character width according to Unicode 7.0.0."

I don't know enough about Unicode to tell whether we've extracted all of
the width information encoded in it, but I have verified that behavior
encoded in the python script is equivalent to what is described in the
comments, so I decided not to act on this right away.  I guess we might
want to tweak the comments to make what's going on clearer, instead of
just dropping the info, although I wouldn't oppose that either.

Does anyone else have thoughts to share on this?

Mike FABIAN, should you want to tackle this, would you please submit a
patch to this list, with a proper ChangeLog entry, so that it can be
installed as written by yourself?


Here's the patch I'm testing.  Ok to install?


Amendments to Unicode 7 update.

From: Alexandre Oliva <aoliva@redhat.com>

for  ChangeLog

	* include/stdc-predef.h (__STDC_ISO_10646__): Update to
	201304L, for Unicode 7.

for  localedata/ChangeLog

	* unicode-gen/ctype_compatibility.py: Use date ranges in
	copyright notice.
	* unicode-gen/ctype_compatibility_test_cases.py: Likewise.
	* unicode-gen/gen_unicode_ctype.py: Likewise.
	* unicode-gen/utf8_compatibility.py: Likewise.
	* unicode-gen/utf8_gen.py: Likewise.  Use upper case for
	global variables, use tuples for global constant arrays.  From
	Mike FABIAN.  Suggested by Mike Frysinger <vapier@gentoo.org>.
---
 include/stdc-predef.h                              |   11 ++++++++---
 localedata/unicode-gen/ctype_compatibility.py      |    2 +-
 .../unicode-gen/ctype_compatibility_test_cases.py  |    2 +-
 localedata/unicode-gen/gen_unicode_ctype.py        |    2 +-
 localedata/unicode-gen/utf8_compatibility.py       |    2 +-
 localedata/unicode-gen/utf8_gen.py                 |   20 ++++++++++----------
 6 files changed, 22 insertions(+), 17 deletions(-)

Comments

Carlos O'Donell Feb. 23, 2015, 2:25 p.m. UTC | #1
On 02/20/2015 06:31 PM, Alexandre Oliva wrote:
> On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
> 
>> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
>> Amd.1 was published).
> 
> Fixed in the patch below.

This change looks good to me. OK to commit.

> On Feb 19, 2015, Mike FABIAN <mfabian@redhat.com> wrote:
> 
>> Mike Frysinger <vapier@gentoo.org> wrote:
> 
>>> module level constants should really be in CAPS.  and use a tuple to make it 
>>> const.
>>> -mike
> 
>> https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543
> 
> Thanks, integrated.  I also adjusted the copyright notices to use year
> ranges, as requested.

Thanks.

> On Feb 20, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
> 
>> One nit:
> 
>> -% Character width according to Unicode 5.0.0.
>> +% Character width according to Unicode 7.0.0.
>>  % - Default width is 1.
>>  % - Double-width characters have width 2; generated from
>>  %        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
>> -%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
>>  % - Non-spacing characters have width 0; generated from PropList.txt or
>>  %   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
>>  % - Format control characters have width 0; generated from
>>  %   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
>> -% - Zero width characters have width 0; generated from
>> -%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
> 
>> Why even mention the `grep` to be used to generate this data?
>> It should just say to use the scripts. Nobody should be confused
>> that this data was actually generated by this method. Nor do I want
>> anyone doing it this way ever again.
> 
>> Thus shouldn't `write_header_width` simply not output any of this
>> stuff? I understand we're trying to minimize the initial diff, but
>> in cleanup, we should remove all of this and just say:
> 
>> "% Character width according to Unicode 7.0.0."
> 
> I don't know enough about Unicode to tell whether we've extracted all of
> the width information encoded in it, but I have verified that behavior
> encoded in the python script is equivalent to what is described in the
> comments, so I decided not to act on this right away.  I guess we might
> want to tweak the comments to make what's going on clearer, instead of
> just dropping the info, although I wouldn't oppose that either.
> 
> Does anyone else have thoughts to share on this?
> 
> Mike FABIAN, should you want to tackle this, would you please submit a
> patch to this list, with a proper ChangeLog entry, so that it can be
> installed as written by yourself?

Yes, please take this up with Mike and make sure we clean it up.
My preference is to remove the comment entirely. 

> Here's the patch I'm testing.  Ok to install?
 
Yes, OK to install.
 
> Amendments to Unicode 7 update.
> 
> From: Alexandre Oliva <aoliva@redhat.com>
> 
> for  ChangeLog
> 
> 	* include/stdc-predef.h (__STDC_ISO_10646__): Update to
> 	201304L, for Unicode 7.

OK.

> for  localedata/ChangeLog
> 
> 	* unicode-gen/ctype_compatibility.py: Use date ranges in
> 	copyright notice.
> 	* unicode-gen/ctype_compatibility_test_cases.py: Likewise.
> 	* unicode-gen/gen_unicode_ctype.py: Likewise.
> 	* unicode-gen/utf8_compatibility.py: Likewise.
> 	* unicode-gen/utf8_gen.py: Likewise.  Use upper case for
> 	global variables, use tuples for global constant arrays.  From
> 	Mike FABIAN.  Suggested by Mike Frysinger <vapier@gentoo.org>.
> ---
>  include/stdc-predef.h                              |   11 ++++++++---
>  localedata/unicode-gen/ctype_compatibility.py      |    2 +-
>  .../unicode-gen/ctype_compatibility_test_cases.py  |    2 +-
>  localedata/unicode-gen/gen_unicode_ctype.py        |    2 +-
>  localedata/unicode-gen/utf8_compatibility.py       |    2 +-
>  localedata/unicode-gen/utf8_gen.py                 |   20 ++++++++++----------
>  6 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/include/stdc-predef.h b/include/stdc-predef.h
> index 1d6a4eb..e5f1139 100644
> --- a/include/stdc-predef.h
> +++ b/include/stdc-predef.h
> @@ -49,9 +49,14 @@
>  # define __STDC_IEC_559_COMPLEX__	1
>  #endif
>  
> -/* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
> -   Unicode 6.0.  */
> -#define __STDC_ISO_10646__		201103L
> +/* wchar_t uses Unicode 7.0.0.  Version 7.0 of the Unicode Standard is
> +   synchronized with ISO/IEC 10646:2012, plus Amendments 1 (published
> +   on April, 2013) and 2 (not yet published as of February, 2015).
> +   Additionally, it includes the accelerated publication of U+20BD
> +   RUBLE SIGN.  Therefore Unicode 7.0.0 is between 10646:2012 and
> +   10646:2014, and so we use the date ISO/IEC 10646:2012 Amd.1 was
> +   published.  */

OK. Excellent comment.

> +#define __STDC_ISO_10646__		201304L
>  
>  /* We do not support C11 <threads.h>.  */
>  #define __STDC_NO_THREADS__		1
> diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py
> index 19e9ee5..0d67f29 100755
> --- a/localedata/unicode-gen/ctype_compatibility.py
> +++ b/localedata/unicode-gen/ctype_compatibility.py
> @@ -1,6 +1,6 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py
> index ab7f6dd..34e6de4 100644
> --- a/localedata/unicode-gen/ctype_compatibility_test_cases.py
> +++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py
> @@ -1,5 +1,5 @@
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
> index 559af79..0c74f2a 100755
> --- a/localedata/unicode-gen/gen_unicode_ctype.py
> +++ b/localedata/unicode-gen/gen_unicode_ctype.py
> @@ -1,7 +1,7 @@
>  #!/usr/bin/python3
>  #
>  # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
>  #
> diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
> index e11327b..b84a1eb 100755
> --- a/localedata/unicode-gen/utf8_compatibility.py
> +++ b/localedata/unicode-gen/utf8_compatibility.py
> @@ -1,6 +1,6 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 670a628..f1b88f5 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -1,6 +1,6 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
> -# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
> +# Copyright (C) 2014-2015 Free Software Foundation, Inc.
>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> @@ -33,21 +33,21 @@ import re
>  # Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
>  # sections 3.11 and 4.4.
>  
> -jamo_initial_short_name = [
> +JAMO_INITIAL_SHORT_NAME = (
>      'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
>      'C', 'K', 'T', 'P', 'H'
> -]
> +)
>  
> -jamo_medial_short_name = [
> +JAMO_MEDIAL_SHORT_NAME = (
>      'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
>      'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
> -]
> +)
>  
> -jamo_final_short_name = [
> +JAMO_FINAL_SHORT_NAME = (
>      '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
>      'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
>      'P', 'H'
> -]
> +)
>  
>  def ucs_symbol(code_point):
>      '''Return the UCS symbol string for a Unicode character.'''
> @@ -74,9 +74,9 @@ def process_range(start, end, outfile, name):
>              index2, index3 = divmod(i - 0xaC00, 28)
>              index1, index2 = divmod(index2, 21)
>              hangul_syllable_name = 'HANGUL SYLLABLE ' \
> -                                   + jamo_initial_short_name[index1] \
> -                                   + jamo_medial_short_name[index2] \
> -                                   + jamo_final_short_name[index3]
> +                                   + JAMO_INITIAL_SHORT_NAME[index1] \
> +                                   + JAMO_MEDIAL_SHORT_NAME[index2] \
> +                                   + JAMO_FINAL_SHORT_NAME[index3]
>              outfile.write('{:<11s} {:<12s} {:s}\n'.format(
>                  ucs_symbol(i), convert_to_hex(i),
>                  hangul_syllable_name))
> 
> 

OK.

Cheers,
Carlos.
diff mbox

Patch

diff --git a/include/stdc-predef.h b/include/stdc-predef.h
index 1d6a4eb..e5f1139 100644
--- a/include/stdc-predef.h
+++ b/include/stdc-predef.h
@@ -49,9 +49,14 @@ 
 # define __STDC_IEC_559_COMPLEX__	1
 #endif
 
-/* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
-   Unicode 6.0.  */
-#define __STDC_ISO_10646__		201103L
+/* wchar_t uses Unicode 7.0.0.  Version 7.0 of the Unicode Standard is
+   synchronized with ISO/IEC 10646:2012, plus Amendments 1 (published
+   on April, 2013) and 2 (not yet published as of February, 2015).
+   Additionally, it includes the accelerated publication of U+20BD
+   RUBLE SIGN.  Therefore Unicode 7.0.0 is between 10646:2012 and
+   10646:2014, and so we use the date ISO/IEC 10646:2012 Amd.1 was
+   published.  */
+#define __STDC_ISO_10646__		201304L
 
 /* We do not support C11 <threads.h>.  */
 #define __STDC_NO_THREADS__		1
diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py
index 19e9ee5..0d67f29 100755
--- a/localedata/unicode-gen/ctype_compatibility.py
+++ b/localedata/unicode-gen/ctype_compatibility.py
@@ -1,6 +1,6 @@ 
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py
index ab7f6dd..34e6de4 100644
--- a/localedata/unicode-gen/ctype_compatibility_test_cases.py
+++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py
@@ -1,5 +1,5 @@ 
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
index 559af79..0c74f2a 100755
--- a/localedata/unicode-gen/gen_unicode_ctype.py
+++ b/localedata/unicode-gen/gen_unicode_ctype.py
@@ -1,7 +1,7 @@ 
 #!/usr/bin/python3
 #
 # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
 #
diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
index e11327b..b84a1eb 100755
--- a/localedata/unicode-gen/utf8_compatibility.py
+++ b/localedata/unicode-gen/utf8_compatibility.py
@@ -1,6 +1,6 @@ 
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 670a628..f1b88f5 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -1,6 +1,6 @@ 
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
+# Copyright (C) 2014-2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
@@ -33,21 +33,21 @@  import re
 # Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
 # sections 3.11 and 4.4.
 
-jamo_initial_short_name = [
+JAMO_INITIAL_SHORT_NAME = (
     'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
     'C', 'K', 'T', 'P', 'H'
-]
+)
 
-jamo_medial_short_name = [
+JAMO_MEDIAL_SHORT_NAME = (
     'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
     'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
-]
+)
 
-jamo_final_short_name = [
+JAMO_FINAL_SHORT_NAME = (
     '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
     'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
     'P', 'H'
-]
+)
 
 def ucs_symbol(code_point):
     '''Return the UCS symbol string for a Unicode character.'''
@@ -74,9 +74,9 @@  def process_range(start, end, outfile, name):
             index2, index3 = divmod(i - 0xaC00, 28)
             index1, index2 = divmod(index2, 21)
             hangul_syllable_name = 'HANGUL SYLLABLE ' \
-                                   + jamo_initial_short_name[index1] \
-                                   + jamo_medial_short_name[index2] \
-                                   + jamo_final_short_name[index3]
+                                   + JAMO_INITIAL_SHORT_NAME[index1] \
+                                   + JAMO_MEDIAL_SHORT_NAME[index2] \
+                                   + JAMO_FINAL_SHORT_NAME[index3]
             outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 ucs_symbol(i), convert_to_hex(i),
                 hangul_syllable_name))