mbox series

[0/13,BZ,#14095] update collation data from Unicode / ISO 14651

Message ID s9d4ln8q4f0.fsf@taka.site
Headers show
Series update collation data from Unicode / ISO 14651 | expand

Message

Mike FABIAN Jan. 26, 2018, 10:51 a.m. UTC
This set of patches updates our
glibc/localedata/locales/iso14651_t1_common file to the latest
available version from ISO and adapts the collation rules of all
locales using “copy "iso14651_t1"” to the changes in the new file.

The ISO standard 14651:2016 is available here:

ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html

And a POSIX style LC_COLLATE file is downloadable from:

http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip

This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
similar format as our current iso14651_t1_common and can be used as an
update.

That file is unfortunately up-to-date only with Unicode 8.0.0,
but that is already a huge improvement over what we have now.

Also, that file contained some errors which needed to be fixed.
Seems strange for a file release by ISO, but it really contained
some errors.

And as the names for most collation symbols have been changed, all the
collation rules of locales using “copy "iso14651_t1"” needed to be
updated.

While doing that, I made the collation rules of all locales I touched
agree with the CLDR collation rules. glibc has several locales which are
not in CLDR, for these I just adapted the existing rules.

Comments

Carlos O'Donell Jan. 26, 2018, 6:03 p.m. UTC | #1
On 01/26/2018 02:51 AM, Mike FABIAN wrote:
> 
> This set of patches updates our
> glibc/localedata/locales/iso14651_t1_common file to the latest
> available version from ISO and adapts the collation rules of all
> locales using “copy "iso14651_t1"” to the changes in the new file.
> 
> The ISO standard 14651:2016 is available here:
> 
> ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html
> 
> And a POSIX style LC_COLLATE file is downloadable from:
> 
> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
> http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip
> 
> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
> similar format as our current iso14651_t1_common and can be used as an
> update.
> 
> That file is unfortunately up-to-date only with Unicode 8.0.0,
> but that is already a huge improvement over what we have now.
> 
> Also, that file contained some errors which needed to be fixed.
> Seems strange for a file release by ISO, but it really contained
> some errors.
> 
> And as the names for most collation symbols have been changed, all the
> collation rules of locales using “copy "iso14651_t1"” needed to be
> updated.
> 
> While doing that, I made the collation rules of all locales I touched
> agree with the CLDR collation rules. glibc has several locales which are
> not in CLDR, for these I just adapted the existing rules.
 
Thanks for doing this work!
Carlos O'Donell Jan. 26, 2018, 6:14 p.m. UTC | #2
On 01/26/2018 02:51 AM, Mike FABIAN wrote:
> 
> This set of patches updates our
> glibc/localedata/locales/iso14651_t1_common file to the latest
> available version from ISO and adapts the collation rules of all
> locales using “copy "iso14651_t1"” to the changes in the new file.
> 
> The ISO standard 14651:2016 is available here:

What about ISO/IEC 14651:2016/Amd.1:2017?

It looks like it updates things to Unicode 9.0?

In particular ISO14651_2017_TABLE1_en.txt matches Amd.1:2017, and
*not* the 2016 version.

> ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html
> 
> And a POSIX style LC_COLLATE file is downloadable from:
> 
> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
> http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip
> 
> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
> similar format as our current iso14651_t1_common and can be used as an
> update.
>

To be clear, the text file is not in the above zip, it is in the associated
"Eletronic inserts" zip file which is part of the published standard.

http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016_Electronic_inserts.zip

With this additional zip file you can review the tabular data to make
comparisons and review the patches.

> That file is unfortunately up-to-date only with Unicode 8.0.0,
> but that is already a huge improvement over what we have now.

This doesn't seem correct given the data in Amd.1:2017:
~~~
The current Common Template Table reflects the repertoire of characters of Unicode 9.0, included in
ISO/IEC 10646:2014 plus its Amendments 1 and 2, plus 273 new characters that will be included in the
fifth edition of ISO/IEC 10646.
~~~

> Also, that file contained some errors which needed to be fixed.
> Seems strange for a file release by ISO, but it really contained
> some errors.
> 
> And as the names for most collation symbols have been changed, all the
> collation rules of locales using “copy "iso14651_t1"” needed to be
> updated.
> 
> While doing that, I made the collation rules of all locales I touched
> agree with the CLDR collation rules. glibc has several locales which are
> not in CLDR, for these I just adapted the existing rules.

In summary:

* Can we get clarification of exactly which standard we are update to?
  Is it just ISO/IEC 14651:2016 or ISO/IEC 14651:2016/Amd.1:2017?
Carlos O'Donell Jan. 26, 2018, 6:18 p.m. UTC | #3
On 01/26/2018 02:51 AM, Mike FABIAN wrote:
> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
> similar format as our current iso14651_t1_common and can be used as an
> update.

Did you mean to write ISO14651_2016_TABLE1_en.txt?

% Autogenerated Common Template Table
%   created from unidata-9.0.0.txt

See:
http://standards.iso.org/iso-iec/14651/ed-4/

Again, this looks like it aligns with Amd.1:2017 and Unicode 9.
Mike FABIAN Jan. 27, 2018, 9:03 a.m. UTC | #4
Carlos O'Donell <carlos@redhat.com> さんはかきました:

> On 01/26/2018 02:51 AM, Mike FABIAN wrote:
>> 
>> This set of patches updates our
>> glibc/localedata/locales/iso14651_t1_common file to the latest
>> available version from ISO and adapts the collation rules of all
>> locales using “copy "iso14651_t1"” to the changes in the new file.
>> 
>> The ISO standard 14651:2016 is available here:
>
> What about ISO/IEC 14651:2016/Amd.1:2017?
>
> It looks like it updates things to Unicode 9.0?
>
> In particular ISO14651_2017_TABLE1_en.txt matches Amd.1:2017, and
> *not* the 2016 version.

I used ISO14651_2015_TABLE1_en.txt because I did not find
ISO14651_2017_TABLE1_en.txt. I’ll update to ISO14651_2017_TABLE1_en.txt
in the next version of my patch series.

>> ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html
>> 
>> And a POSIX style LC_COLLATE file is downloadable from:
>> 
>> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
>> http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip
>> 
>> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
>> similar format as our current iso14651_t1_common and can be used as an
>> update.
>>
>
> To be clear, the text file is not in the above zip, it is in the associated
> "Eletronic inserts" zip file which is part of the published standard.
>
> http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016_Electronic_inserts.zip
>
> With this additional zip file you can review the tabular data to make
> comparisons and review the patches.
>
>> That file is unfortunately up-to-date only with Unicode 8.0.0,
>> but that is already a huge improvement over what we have now.
>
> This doesn't seem correct given the data in Amd.1:2017:
> ~~~
> The current Common Template Table reflects the repertoire of
> characters of Unicode 9.0, included in
> ISO/IEC 10646:2014 plus its Amendments 1 and 2, plus 273 new
> characters that will be included in the
> fifth edition of ISO/IEC 10646.
> ~~~

Yes, it was Unicode 8.0.0 because I used the older file
ISO14651_2015_TABLE1_en.txt. I’ll update to the newer
ISO14651_2017_TABLE1_en.txt file.

>> Also, that file contained some errors which needed to be fixed.
>> Seems strange for a file release by ISO, but it really contained
>> some errors.
>> 
>> And as the names for most collation symbols have been changed, all the
>> collation rules of locales using “copy "iso14651_t1"” needed to be
>> updated.
>> 
>> While doing that, I made the collation rules of all locales I touched
>> agree with the CLDR collation rules. glibc has several locales which are
>> not in CLDR, for these I just adapted the existing rules.
>
> In summary:
>
> * Can we get clarification of exactly which standard we are update to?
>   Is it just ISO/IEC 14651:2016 or ISO/IEC 14651:2016/Amd.1:2017?
Mike FABIAN Jan. 27, 2018, 9:05 a.m. UTC | #5
Carlos O'Donell <carlos@redhat.com> さんはかきました:

> On 01/26/2018 02:51 AM, Mike FABIAN wrote:
>> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
>> similar format as our current iso14651_t1_common and can be used as an
>> update.
>
> Did you mean to write ISO14651_2016_TABLE1_en.txt?
>
> % Autogenerated Common Template Table
> %   created from unidata-9.0.0.txt
>
> See:
> http://standards.iso.org/iso-iec/14651/ed-4/
>
> Again, this looks like it aligns with Amd.1:2017 and Unicode 9.

Actually I used ISO14651_2015_TABLE1_en.txt. But I’ll update
in the next version of my patch series.
Mike FABIAN Jan. 27, 2018, 9:20 a.m. UTC | #6
Carlos O'Donell <carlos@redhat.com> wrote:

> On 01/26/2018 02:51 AM, Mike FABIAN wrote:
>> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
>> similar format as our current iso14651_t1_common and can be used as an
>> update.
>
> Did you mean to write ISO14651_2016_TABLE1_en.txt?
>
> % Autogenerated Common Template Table
> %   created from unidata-9.0.0.txt

Actually I used ISO14651_2015_TABLE1_en.txt.

> See:
> http://standards.iso.org/iso-iec/14651/ed-4/
>
> Again, this looks like it aligns with Amd.1:2017 and Unicode 9.

I’ll try again with ISO14651_2016_TABLE1_en.txt now, that
seems to be the latest version.
Carlos O'Donell Jan. 27, 2018, 5:28 p.m. UTC | #7
On 01/27/2018 01:20 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> On 01/26/2018 02:51 AM, Mike FABIAN wrote:
>>> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a
>>> similar format as our current iso14651_t1_common and can be used as an
>>> update.
>>
>> Did you mean to write ISO14651_2016_TABLE1_en.txt?
>>
>> % Autogenerated Common Template Table
>> %   created from unidata-9.0.0.txt
> 
> Actually I used ISO14651_2015_TABLE1_en.txt.
> 
>> See:
>> http://standards.iso.org/iso-iec/14651/ed-4/
>>
>> Again, this looks like it aligns with Amd.1:2017 and Unicode 9.
> 
> I’ll try again with ISO14651_2016_TABLE1_en.txt now, that
> seems to be the latest version.

OK, good! The 2016_TABLE1 seems to be for the Amd.1:2017 which matches
what I would expect and lines up with Unicode 9.

In which case we are only 1 unicode revision behind.
Joseph Myers Jan. 29, 2018, 4:03 p.m. UTC | #8
On Sat, 27 Jan 2018, Carlos O'Donell wrote:

> > I’ll try again with ISO14651_2016_TABLE1_en.txt now, that
> > seems to be the latest version.
> 
> OK, good! The 2016_TABLE1 seems to be for the Amd.1:2017 which matches
> what I would expect and lines up with Unicode 9.
> 
> In which case we are only 1 unicode revision behind.

Since the tables are apparently generated in an automated way from the 
Unicode data (according to the comments on them), is the source for that 
automation available somewhere so we could use it and work from the latest 
Unicode data directly?
Carlos O'Donell Jan. 29, 2018, 5:31 p.m. UTC | #9
On 01/29/2018 08:03 AM, Joseph Myers wrote:
> On Sat, 27 Jan 2018, Carlos O'Donell wrote:
> 
>>> I’ll try again with ISO14651_2016_TABLE1_en.txt now, that
>>> seems to be the latest version.
>>
>> OK, good! The 2016_TABLE1 seems to be for the Amd.1:2017 which matches
>> what I would expect and lines up with Unicode 9.
>>
>> In which case we are only 1 unicode revision behind.
> 
> Since the tables are apparently generated in an automated way from the 
> Unicode data (according to the comments on them), is the source for that 
> automation available somewhere so we could use it and work from the latest 
> Unicode data directly?
 
The source of this automation is not publicly available. We would have to track
down those working on the standard and work with them to get the scripts.

Even if we had the latest set of scripts we could not use it to process the latest
Unicode data directly because it would not match any published version of the
ISO 14651 standard.

We could however have used the scripts to process Unicode 9 data to simplify our
own processes. Thus we would convert from raw Unicode data to our own internal
formats rather than through any indirect means via ISO 14651. Then all we would
need is a further verification pass to ensure that the published ISO 14651 matches
what we generated from the Unicode data.

So in summary:

Today we have:

* Automated glibc process to convert Unicode data into Unicode-based locale data.
* Manual glibc process to convert IS 14651 data into locale data.

In the future it would be nice to have:

* Get automation scripts from ISO 14651 group to process Unicode data into ISO 14651
  format data.
* Unify glibc process to turn Unicode data (at two possible revisions) into our normal
  Unicode-based locale data, and our ISO 14651-based locale data.
* Add a verification pass to ensure the published ISO 14651 data table matches what we
  generated for our ISO 14651-based locale data.

Does that make sense?
Joseph Myers Jan. 29, 2018, 5:42 p.m. UTC | #10
On Mon, 29 Jan 2018, Carlos O'Donell wrote:

> * Get automation scripts from ISO 14651 group to process Unicode data 
>   into ISO 14651 format data.

- Hopefully under a free software license such as the Unicode, Inc. 
License Agreement for Data Files and Software.

Ultimately the point is to have correct Unicode collation - and if the 
overall effect of the collation definitions in glibc is as intended, those 
definitions don't need to be textually close to those from ISO 14651, and 
the generators don't need to be the same, if there are different ways to 
achieve the same resulting ordering.
Carlos O'Donell Jan. 29, 2018, 6:41 p.m. UTC | #11
On 01/29/2018 09:42 AM, Joseph Myers wrote:
> On Mon, 29 Jan 2018, Carlos O'Donell wrote:
> 
>> * Get automation scripts from ISO 14651 group to process Unicode data 
>>   into ISO 14651 format data.
> 
> - Hopefully under a free software license such as the Unicode, Inc. 
> License Agreement for Data Files and Software.
> 
> Ultimately the point is to have correct Unicode collation - and if the 
> overall effect of the collation definitions in glibc is as intended, those 
> definitions don't need to be textually close to those from ISO 14651, and 
> the generators don't need to be the same, if there are different ways to 
> achieve the same resulting ordering.
 
Ultimately I think the goal of the project should be to harmonize as much
as possible with Unicode, CLDR, and ISO 14651 etc. This harmonization includes
collation, but only in so far as we *can* harmonize with Unicode.

Mike and I have talked about this on-and-off over the years, and we don't know
if the POSIX collation rules are semantically sufficient to match the Unicode
Collation Algorithm rules, particularly when it comes to complex Asian collations.
We don't know if glibc can actually sort all Japanese symbols correctly, but we
will endeavour to try and harmonize collation up to the point where we document
the collation failings.

Collation is certainly the most difficult update for glibc. The recent test cases
that Mike adds with the ISO 14651 update make a huge difference in providing
stability guarantees and rationale for the verification of correct sorting.