diff mbox

[PATCHv2] Update the localedata/locales/translit_* files to Unicode 7.0.0

Message ID s9degldez4x.fsf@ari.site
State New
Headers show

Commit Message

Mike FABIAN June 15, 2015, 4:04 p.m. UTC
This is an update to my earlier patches:

https://sourceware.org/ml/libc-alpha/2015-04/msg00361.html

Updates:

    - transliteration rules for da, nb, nn, and sv locales
      added to transliterate for example "ö" to "oe" in these
      locales because the "neutral" transliteration should be
      "ö" to "o" (For example in English, coöperation as used in
      http://www.newyorker.com/humor/borowitz-report/obama-putin-agree-never-to-speak-to-each-other-again
      should be transliterated to "cooperation", not "cooeperation").
      This should fix [BZ #89].

    - lots of stuff added to translit_neutral
    - some more tweaks to the script generating the translit files
      generated from Unicode

I tested the patches on Fedora 22.

Can somebody review this please?

----------------------------------------------------------------------

The attached file updates these translit files to Unicode 7.0.0:

    locales/translit_circle
    locales/translit_cjk_compat
    locales/translit_combining
    locales/translit_compat
    locales/translit_font  
    locales/translit_fraction

it also contains lots of manual updates to

    locales/translit_neutral

now, many of them taken from 

    http://unicode.org/cldr/trac/browser/trunk/common/transforms/Latin-ASCII.xml

It does *not* update these translit files:

    locales/translit_cjk_variants
    locales/translit_hangul
    locales/translit_narrow
    locales/translit_small
    locales/translit_wide

because translit_cjk_variants is apparently not generated from Unicode
data.

The other files, translit_hangul, translit_narrow, translit_small,
translit_wide are generated but they would not change when using Unicode
7.0.0 data, nothing seems to have changed in Unicode affecting these
files. I could add scripts to generate these as well, but they would
just reproduce the current files.  Maybe I should do that nevertheless,
just to be able to see if something changes in future (quite unlikely, I
think).

Some code was duplicated in utf8_gen.py and utf8_compatibility.py,
Alexandre Oliva had already suggested to split this into an extra file.
As the new generator scripts added by this patch needed this code
again I saw that Alexandre was right and did put the reusable code
into an extra file unicode_utils.py.

Not everything in the generated translit_* files could be reproduced
exactly from Unicode data, the were some manual additions in the files
(which were not mentioned in the comments on top of these files,
the “grep” and “sed” expressions mentioned in the comments reproduce
most of the contents of these files but not everything).

Where the manual additions seemed to make sense, I added manual
hacks to the new generator scripts gen_translit_*.py to reproduce
these manual additons as well.

Comments

Marko Myllynen June 16, 2015, 1:24 p.m. UTC | #1
Hi Mike,

I reviewed the resulting transliteration and special decompose rules and
in general everything looks very good, few minor comments below.

On 2015-06-15 19:04, Mike FABIAN wrote:
> 
> Subject: [PATCH 1/4] Remove duplicate transliterations for U+0152 and U+0153
>  from C-translit.h.in

this looks like an obvious fix.

> Subject: [PATCH 2/4] Addition and fixes for translit_neutral
> 
> +% LATIN CAPITAL LETTER ENG
> +<U014A> <U004E>
> +% LATIN SMALL LETTER ENG
> +<U014B> <U006E>

Hmm, I presume NG/ng would be more expected than N/n here, but reading
https://en.wikipedia.org/wiki/Eng_%28letter%29 doesn't give a clear
answer either way, what do you think?

> +% EURO-CURRENCY SIGN
> +% CRUZEIRO SIGN
> +% FRENCH FRANC SIGN
> +% LIRA SIGN
> +% PESETA SIGN
>  % DONG SIGN
> +% INDIAN RUPEE SIGN
> +% TURKISH LIRA SIGN

While at it, should we perhaps also add pound, ruble, drachma, won, and
hryvnia signs here?

> Subject: [PATCH 3/4] Update the translit files to Unicode 7.0.0

The generated files included in this patch look good.

> Subject: [PATCH 4/4] Add transliteration rules for da, nb, nn, and sv locales.

AFAICS these also look good.

Thanks,
Marko Myllynen June 16, 2015, 2:25 p.m. UTC | #2
Hi,

actually, one more additional note: after these patches some rules are
now duplicated, see below for few examples, is there some particular
reason for this or could those duplicates be avoided?

localhost:~> grep '^<U00C6>' translit*
translit_combining:<U00C6> "<U0041><U0045>"
translit_neutral:<U00C6> "<U0041><U0045>"
localhost:~> grep '^<U00D8>' translit*
translit_combining:<U00D8> <U004F>
translit_neutral:<U00D8> <U004F>
localhost:~>


Thanks,

On 2015-06-16 16:24, Marko Myllynen wrote:
> Hi Mike,
> 
> I reviewed the resulting transliteration and special decompose rules and
> in general everything looks very good, few minor comments below.
> 
> On 2015-06-15 19:04, Mike FABIAN wrote:
>>
>> Subject: [PATCH 1/4] Remove duplicate transliterations for U+0152 and U+0153
>>  from C-translit.h.in
> 
> this looks like an obvious fix.
> 
>> Subject: [PATCH 2/4] Addition and fixes for translit_neutral
>>
>> +% LATIN CAPITAL LETTER ENG
>> +<U014A> <U004E>
>> +% LATIN SMALL LETTER ENG
>> +<U014B> <U006E>
> 
> Hmm, I presume NG/ng would be more expected than N/n here, but reading
> https://en.wikipedia.org/wiki/Eng_%28letter%29 doesn't give a clear
> answer either way, what do you think?
> 
>> +% EURO-CURRENCY SIGN
>> +% CRUZEIRO SIGN
>> +% FRENCH FRANC SIGN
>> +% LIRA SIGN
>> +% PESETA SIGN
>>  % DONG SIGN
>> +% INDIAN RUPEE SIGN
>> +% TURKISH LIRA SIGN
> 
> While at it, should we perhaps also add pound, ruble, drachma, won, and
> hryvnia signs here?
> 
>> Subject: [PATCH 3/4] Update the translit files to Unicode 7.0.0
> 
> The generated files included in this patch look good.
> 
>> Subject: [PATCH 4/4] Add transliteration rules for da, nb, nn, and sv locales.
> 
> AFAICS these also look good.
> 
> Thanks,
>
Marko Myllynen June 16, 2015, 2:27 p.m. UTC | #3
Hi,

On 2015-06-16 17:24, Mike FABIAN wrote:
> Marko Myllynen <myllynen@redhat.com> さんはかきました:
> 
>>> Subject: [PATCH 2/4] Addition and fixes for translit_neutral
>>>
>>> +% LATIN CAPITAL LETTER ENG
>>> +<U014A> <U004E>
>>> +% LATIN SMALL LETTER ENG
>>> +<U014B> <U006E>
>>
>> Hmm, I presume NG/ng would be more expected than N/n here, but reading
>> https://en.wikipedia.org/wiki/Eng_%28letter%29 doesn't give a clear
>> answer either way, what do you think?
> 
> http://unicode.org/cldr/trac/browser/trunk/common/transforms/Latin-ASCII.xml#L54
> 
> has:
> 
> 54	                        <tRule>Ŋ → N ; # 014A;LATIN CAPITAL LETTER ENG</tRule>
> 55	                        <tRule>ŋ → n ; # 014B;LATIN SMALL LETTER ENG</tRule>
> 
> "ng" might be phonetically closer but the main spirit of the "neutral"
> transliteration to ASCII seems to be something like "drop the accents",
> not "approximate the pronunciation using ASCII".

I see, looks ok then.

Thanks,
diff mbox

Patch

From ef2a1022224d32989891f7a12f2170a1b3a7e7f9 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Wed, 20 May 2015 11:16:30 +0200
Subject: [PATCH 4/4] Add transliteration rules for da, nb, nn, and sv locales.

for localedata/Changelog

    [BZ #89]
    * locales/da_DK add more transliteration rules
    * locales/nb_NO add transliteration rules
    * locales/sv_SE add transliteration rules
---
 localedata/locales/da_DK | 21 ++++++++++++++++++---
 localedata/locales/nb_NO | 22 ++++++++++++++++++++++
 localedata/locales/sv_SE | 22 ++++++++++++++++++++++
 3 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/localedata/locales/da_DK b/localedata/locales/da_DK
index c5024a4..d1d4087 100644
--- a/localedata/locales/da_DK
+++ b/localedata/locales/da_DK
@@ -137,11 +137,26 @@  translit_start
 
 include "translit_combining";""
 
-% Danish.
-% LATIN CAPITAL LETTER A WITH RING ABOVE.
+% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
+<U00C4> "<U0041><U0308>";"<U0041><U0045>"
+% LATIN CAPITAL LETTER A WITH RING ABOVE -> "AA"
 <U00C5> "<U0041><U030A>";"<U0041><U0041>"
-% LATIN SMALL LETTER A WITH RING ABOVE.
+% LATIN CAPITAL LETTER AE -> "AE"
+<U00C6> "<U0041><U0045>"
+% LATIN CAPITAL LETTER O WITH DIAERESIS -> "OE"
+<U00D6> "<U004F><U0308>";"<U004F><U0045>"
+% LATIN CAPITAL LETTER O WITH STROKE -> "OE"
+<U00D8> "<U004F><U0338>";"<U004F><U0045>"
+% LATIN SMALL LETTER A WITH DIAERESIS -> "ae"
+<U00E4> "<U0061><U0308>";"<U0061><U0065>"
+% LATIN SMALL LETTER A WITH RING ABOVE -> "aa"
 <U00E5> "<U0061><U030A>";"<U0061><U0061>"
+% LATIN SMALL LETTER AE -> "ae"
+<U00E6> "<U0061><U0065>"
+% LATIN SMALL LETTER O WITH DIAERESIS -> "oe"
+<U00F6> "<U006F><U0308>";"<U006F><U0065>"
+% LATIN SMALL LETTER O WITH STROKE -> "oe"
+<U00F8> "<U006F><U0338>";"<U006F><U0065>"
 
 translit_end
 
diff --git a/localedata/locales/nb_NO b/localedata/locales/nb_NO
index 513d50c..332092a 100644
--- a/localedata/locales/nb_NO
+++ b/localedata/locales/nb_NO
@@ -127,6 +127,28 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+
+% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
+<U00C4> "<U0041><U0308>";"<U0041><U0045>"
+% LATIN CAPITAL LETTER A WITH RING ABOVE -> "AA"
+<U00C5> "<U0041><U030A>";"<U0041><U0041>"
+% LATIN CAPITAL LETTER AE -> "AE"
+<U00C6> "<U0041><U0045>"
+% LATIN CAPITAL LETTER O WITH DIAERESIS -> "OE"
+<U00D6> "<U004F><U0308>";"<U004F><U0045>"
+% LATIN CAPITAL LETTER O WITH STROKE -> "OE"
+<U00D8> "<U004F><U0338>";"<U004F><U0045>"
+% LATIN SMALL LETTER A WITH DIAERESIS -> "ae"
+<U00E4> "<U0061><U0308>";"<U0061><U0065>"
+% LATIN SMALL LETTER A WITH RING ABOVE -> "aa"
+<U00E5> "<U0061><U030A>";"<U0061><U0061>"
+% LATIN SMALL LETTER AE -> "ae"
+<U00E6> "<U0061><U0065>"
+% LATIN SMALL LETTER O WITH DIAERESIS -> "oe"
+<U00F6> "<U006F><U0308>";"<U006F><U0065>"
+% LATIN SMALL LETTER O WITH STROKE -> "oe"
+<U00F8> "<U006F><U0338>";"<U006F><U0065>"
+
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sv_SE b/localedata/locales/sv_SE
index ecf7858..92358b9 100644
--- a/localedata/locales/sv_SE
+++ b/localedata/locales/sv_SE
@@ -112,6 +112,28 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+
+% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
+<U00C4> "<U0041><U0308>";"<U0041><U0045>"
+% LATIN CAPITAL LETTER A WITH RING ABOVE -> "AA"
+<U00C5> "<U0041><U030A>";"<U0041><U0041>"
+% LATIN CAPITAL LETTER AE -> "AE"
+<U00C6> "<U0041><U0045>"
+% LATIN CAPITAL LETTER O WITH DIAERESIS -> "OE"
+<U00D6> "<U004F><U0308>";"<U004F><U0045>"
+% LATIN CAPITAL LETTER O WITH STROKE -> "OE"
+<U00D8> "<U004F><U0338>";"<U004F><U0045>"
+% LATIN SMALL LETTER A WITH DIAERESIS -> "ae"
+<U00E4> "<U0061><U0308>";"<U0061><U0065>"
+% LATIN SMALL LETTER A WITH RING ABOVE -> "aa"
+<U00E5> "<U0061><U030A>";"<U0061><U0061>"
+% LATIN SMALL LETTER AE -> "ae"
+<U00E6> "<U0061><U0065>"
+% LATIN SMALL LETTER O WITH DIAERESIS -> "oe"
+<U00F6> "<U006F><U0308>";"<U006F><U0065>"
+% LATIN SMALL LETTER O WITH STROKE -> "oe"
+<U00F8> "<U006F><U0338>";"<U006F><U0065>"
+
 translit_end
 END LC_CTYPE
 
-- 
2.4.2