Message ID | s9d4ln8q4f0.fsf@taka.site |
---|---|
Headers | show |
Series | update collation data from Unicode / ISO 14651 | expand |
On 01/26/2018 02:51 AM, Mike FABIAN wrote: > > This set of patches updates our > glibc/localedata/locales/iso14651_t1_common file to the latest > available version from ISO and adapts the collation rules of all > locales using “copy "iso14651_t1"” to the changes in the new file. > > The ISO standard 14651:2016 is available here: > > ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html > > And a POSIX style LC_COLLATE file is downloadable from: > > http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html > http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip > > This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a > similar format as our current iso14651_t1_common and can be used as an > update. > > That file is unfortunately up-to-date only with Unicode 8.0.0, > but that is already a huge improvement over what we have now. > > Also, that file contained some errors which needed to be fixed. > Seems strange for a file release by ISO, but it really contained > some errors. > > And as the names for most collation symbols have been changed, all the > collation rules of locales using “copy "iso14651_t1"” needed to be > updated. > > While doing that, I made the collation rules of all locales I touched > agree with the CLDR collation rules. glibc has several locales which are > not in CLDR, for these I just adapted the existing rules. Thanks for doing this work!
On 01/26/2018 02:51 AM, Mike FABIAN wrote: > > This set of patches updates our > glibc/localedata/locales/iso14651_t1_common file to the latest > available version from ISO and adapts the collation rules of all > locales using “copy "iso14651_t1"” to the changes in the new file. > > The ISO standard 14651:2016 is available here: What about ISO/IEC 14651:2016/Amd.1:2017? It looks like it updates things to Unicode 9.0? In particular ISO14651_2017_TABLE1_en.txt matches Amd.1:2017, and *not* the 2016 version. > ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html > > And a POSIX style LC_COLLATE file is downloadable from: > > http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html > http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip > > This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a > similar format as our current iso14651_t1_common and can be used as an > update. > To be clear, the text file is not in the above zip, it is in the associated "Eletronic inserts" zip file which is part of the published standard. http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016_Electronic_inserts.zip With this additional zip file you can review the tabular data to make comparisons and review the patches. > That file is unfortunately up-to-date only with Unicode 8.0.0, > but that is already a huge improvement over what we have now. This doesn't seem correct given the data in Amd.1:2017: ~~~ The current Common Template Table reflects the repertoire of characters of Unicode 9.0, included in ISO/IEC 10646:2014 plus its Amendments 1 and 2, plus 273 new characters that will be included in the fifth edition of ISO/IEC 10646. ~~~ > Also, that file contained some errors which needed to be fixed. > Seems strange for a file release by ISO, but it really contained > some errors. > > And as the names for most collation symbols have been changed, all the > collation rules of locales using “copy "iso14651_t1"” needed to be > updated. > > While doing that, I made the collation rules of all locales I touched > agree with the CLDR collation rules. glibc has several locales which are > not in CLDR, for these I just adapted the existing rules. In summary: * Can we get clarification of exactly which standard we are update to? Is it just ISO/IEC 14651:2016 or ISO/IEC 14651:2016/Amd.1:2017?
On 01/26/2018 02:51 AM, Mike FABIAN wrote: > This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a > similar format as our current iso14651_t1_common and can be used as an > update. Did you mean to write ISO14651_2016_TABLE1_en.txt? % Autogenerated Common Template Table % created from unidata-9.0.0.txt See: http://standards.iso.org/iso-iec/14651/ed-4/ Again, this looks like it aligns with Amd.1:2017 and Unicode 9.
Carlos O'Donell <carlos@redhat.com> さんはかきました: > On 01/26/2018 02:51 AM, Mike FABIAN wrote: >> >> This set of patches updates our >> glibc/localedata/locales/iso14651_t1_common file to the latest >> available version from ISO and adapts the collation rules of all >> locales using “copy "iso14651_t1"” to the changes in the new file. >> >> The ISO standard 14651:2016 is available here: > > What about ISO/IEC 14651:2016/Amd.1:2017? > > It looks like it updates things to Unicode 9.0? > > In particular ISO14651_2017_TABLE1_en.txt matches Amd.1:2017, and > *not* the 2016 version. I used ISO14651_2015_TABLE1_en.txt because I did not find ISO14651_2017_TABLE1_en.txt. I’ll update to ISO14651_2017_TABLE1_en.txt in the next version of my patch series. >> ISO/IEC 14651:2016: https://www.iso.org/standard/68309.html >> >> And a POSIX style LC_COLLATE file is downloadable from: >> >> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html >> http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016.zip >> >> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a >> similar format as our current iso14651_t1_common and can be used as an >> update. >> > > To be clear, the text file is not in the above zip, it is in the associated > "Eletronic inserts" zip file which is part of the published standard. > > http://standards.iso.org/ittf/PubliclyAvailableStandards/c068309_ISO_IEC_14651_2016_Electronic_inserts.zip > > With this additional zip file you can review the tabular data to make > comparisons and review the patches. > >> That file is unfortunately up-to-date only with Unicode 8.0.0, >> but that is already a huge improvement over what we have now. > > This doesn't seem correct given the data in Amd.1:2017: > ~~~ > The current Common Template Table reflects the repertoire of > characters of Unicode 9.0, included in > ISO/IEC 10646:2014 plus its Amendments 1 and 2, plus 273 new > characters that will be included in the > fifth edition of ISO/IEC 10646. > ~~~ Yes, it was Unicode 8.0.0 because I used the older file ISO14651_2015_TABLE1_en.txt. I’ll update to the newer ISO14651_2017_TABLE1_en.txt file. >> Also, that file contained some errors which needed to be fixed. >> Seems strange for a file release by ISO, but it really contained >> some errors. >> >> And as the names for most collation symbols have been changed, all the >> collation rules of locales using “copy "iso14651_t1"” needed to be >> updated. >> >> While doing that, I made the collation rules of all locales I touched >> agree with the CLDR collation rules. glibc has several locales which are >> not in CLDR, for these I just adapted the existing rules. > > In summary: > > * Can we get clarification of exactly which standard we are update to? > Is it just ISO/IEC 14651:2016 or ISO/IEC 14651:2016/Amd.1:2017?
Carlos O'Donell <carlos@redhat.com> さんはかきました: > On 01/26/2018 02:51 AM, Mike FABIAN wrote: >> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a >> similar format as our current iso14651_t1_common and can be used as an >> update. > > Did you mean to write ISO14651_2016_TABLE1_en.txt? > > % Autogenerated Common Template Table > % created from unidata-9.0.0.txt > > See: > http://standards.iso.org/iso-iec/14651/ed-4/ > > Again, this looks like it aligns with Amd.1:2017 and Unicode 9. Actually I used ISO14651_2015_TABLE1_en.txt. But I’ll update in the next version of my patch series.
Carlos O'Donell <carlos@redhat.com> wrote: > On 01/26/2018 02:51 AM, Mike FABIAN wrote: >> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a >> similar format as our current iso14651_t1_common and can be used as an >> update. > > Did you mean to write ISO14651_2016_TABLE1_en.txt? > > % Autogenerated Common Template Table > % created from unidata-9.0.0.txt Actually I used ISO14651_2015_TABLE1_en.txt. > See: > http://standards.iso.org/iso-iec/14651/ed-4/ > > Again, this looks like it aligns with Amd.1:2017 and Unicode 9. I’ll try again with ISO14651_2016_TABLE1_en.txt now, that seems to be the latest version.
On 01/27/2018 01:20 AM, Mike FABIAN wrote: > Carlos O'Donell <carlos@redhat.com> wrote: > >> On 01/26/2018 02:51 AM, Mike FABIAN wrote: >>> This .zip file contains a ISO14651_2017_TABLE1_en.txt which is in a >>> similar format as our current iso14651_t1_common and can be used as an >>> update. >> >> Did you mean to write ISO14651_2016_TABLE1_en.txt? >> >> % Autogenerated Common Template Table >> % created from unidata-9.0.0.txt > > Actually I used ISO14651_2015_TABLE1_en.txt. > >> See: >> http://standards.iso.org/iso-iec/14651/ed-4/ >> >> Again, this looks like it aligns with Amd.1:2017 and Unicode 9. > > I’ll try again with ISO14651_2016_TABLE1_en.txt now, that > seems to be the latest version. OK, good! The 2016_TABLE1 seems to be for the Amd.1:2017 which matches what I would expect and lines up with Unicode 9. In which case we are only 1 unicode revision behind.
On Sat, 27 Jan 2018, Carlos O'Donell wrote: > > I’ll try again with ISO14651_2016_TABLE1_en.txt now, that > > seems to be the latest version. > > OK, good! The 2016_TABLE1 seems to be for the Amd.1:2017 which matches > what I would expect and lines up with Unicode 9. > > In which case we are only 1 unicode revision behind. Since the tables are apparently generated in an automated way from the Unicode data (according to the comments on them), is the source for that automation available somewhere so we could use it and work from the latest Unicode data directly?
On 01/29/2018 08:03 AM, Joseph Myers wrote: > On Sat, 27 Jan 2018, Carlos O'Donell wrote: > >>> I’ll try again with ISO14651_2016_TABLE1_en.txt now, that >>> seems to be the latest version. >> >> OK, good! The 2016_TABLE1 seems to be for the Amd.1:2017 which matches >> what I would expect and lines up with Unicode 9. >> >> In which case we are only 1 unicode revision behind. > > Since the tables are apparently generated in an automated way from the > Unicode data (according to the comments on them), is the source for that > automation available somewhere so we could use it and work from the latest > Unicode data directly? The source of this automation is not publicly available. We would have to track down those working on the standard and work with them to get the scripts. Even if we had the latest set of scripts we could not use it to process the latest Unicode data directly because it would not match any published version of the ISO 14651 standard. We could however have used the scripts to process Unicode 9 data to simplify our own processes. Thus we would convert from raw Unicode data to our own internal formats rather than through any indirect means via ISO 14651. Then all we would need is a further verification pass to ensure that the published ISO 14651 matches what we generated from the Unicode data. So in summary: Today we have: * Automated glibc process to convert Unicode data into Unicode-based locale data. * Manual glibc process to convert IS 14651 data into locale data. In the future it would be nice to have: * Get automation scripts from ISO 14651 group to process Unicode data into ISO 14651 format data. * Unify glibc process to turn Unicode data (at two possible revisions) into our normal Unicode-based locale data, and our ISO 14651-based locale data. * Add a verification pass to ensure the published ISO 14651 data table matches what we generated for our ISO 14651-based locale data. Does that make sense?
On Mon, 29 Jan 2018, Carlos O'Donell wrote: > * Get automation scripts from ISO 14651 group to process Unicode data > into ISO 14651 format data. - Hopefully under a free software license such as the Unicode, Inc. License Agreement for Data Files and Software. Ultimately the point is to have correct Unicode collation - and if the overall effect of the collation definitions in glibc is as intended, those definitions don't need to be textually close to those from ISO 14651, and the generators don't need to be the same, if there are different ways to achieve the same resulting ordering.
On 01/29/2018 09:42 AM, Joseph Myers wrote: > On Mon, 29 Jan 2018, Carlos O'Donell wrote: > >> * Get automation scripts from ISO 14651 group to process Unicode data >> into ISO 14651 format data. > > - Hopefully under a free software license such as the Unicode, Inc. > License Agreement for Data Files and Software. > > Ultimately the point is to have correct Unicode collation - and if the > overall effect of the collation definitions in glibc is as intended, those > definitions don't need to be textually close to those from ISO 14651, and > the generators don't need to be the same, if there are different ways to > achieve the same resulting ordering. Ultimately I think the goal of the project should be to harmonize as much as possible with Unicode, CLDR, and ISO 14651 etc. This harmonization includes collation, but only in so far as we *can* harmonize with Unicode. Mike and I have talked about this on-and-off over the years, and we don't know if the POSIX collation rules are semantically sufficient to match the Unicode Collation Algorithm rules, particularly when it comes to complex Asian collations. We don't know if glibc can actually sort all Japanese symbols correctly, but we will endeavour to try and harmonize collation up to the point where we document the collation failings. Collation is certainly the most difficult update for glibc. The recent test cases that Mike adds with the ISO 14651 update make a huge difference in providing stability guarantees and rationale for the verification of correct sorting.