[v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
diff mbox series

Message ID b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com
State New
Headers show
Series
  • [v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
Related show

Commit Message

Diego (Egor) Kobylkin Nov. 14, 2018, 9:25 p.m. UTC
Changelog v9:
* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch

Changelog v8:
* Re-added missing translit_cyrillic in patch v7 (due to missing "git
add" in the script).

Changelog v7:
* Generated against git://sourceware.org/git/glibc.git master with git
format-patch.
* The 'include "translit_cyrillic";""' now immediately follows last
'include "translit_XXX";""' string (was inserted just before
translit_end previously.)
* Only the locales already having 'include .*translit.*;""' are patched
(see the list for manual exclusions below, full list of included locales
at the end of the email in the commit section.)
* Excluded az_AZ completely to avoid circular reference from tr_TR via
“copy "tr_TR"”.

Changelog v6:
* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* Consistently transliterate single uppercase Cyrillic letters
  to sequences of all uppercase Latin letters in all languages (whenever
  a Cyrillic letter is transliterated to more than one Latin letter),
  for example "Ї" is now transliterated as "YI" rather than "Yi".

Dear locale maintainers,

fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"

https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]

add the Cyrillic transliteration table translit_cyrillic file

https://sourceware.org/bugzilla/attachment.cgi?id=11340 [7]

to localedata/locales/ and include it in all your locales going forward.

The patch included inline below.

From this patch I have excluded locales that already mention cyrillic or
have a transliteration table for it:

mn_MN
sr_RS
tg_TJ
tk_TM
tt_RU
uk_UA
uz_UZ
uz_UZ@cyrillic
uk_UA

Their maintainers are requested to make an explicit decision on how and
whether at all to include this patch.

Current bug effect:

The glibc wiki explicitly lists this use case as the test example

https://sourceware.org/glibc/wiki/Locales#Testing_Locales :

LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt

currently it fails on Cyrillic texts in most locales including ru_RU [1]
[8] [9]:

LC_ALL=ru_RU.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt |grep CYRILLIC

CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.

 - It produces a string of question marks and spaces.

This is what it should produce and it does so after the patch applied:

CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
chayu.


The root problem and the fix:

The root problem is the missing transliteration table that I am
supplying here. Furthermore it has to be referenced/included into the
active locale at the compilation time to be used by iconv.



COMMIT MESSAGE:
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.

Examples: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
compatible transcription and iconv -f UTF-8 -t ISO-8859-15//TRANSLIT |
iconv -f ISO-8859-15 -t UTF-8 will produce Latin transliteration as per
ISO 9.1995.

While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.

The transliteration table itself is attached as a file translit_cyrillic
[7]. Its content (mapping) is based on ISO 9.1995 standard [10] and its
derivative GOST 7.79-2000 official source (Federal Agency on Technical
Regulating and Metrology Of Russian Federation [2]). Technically an
independent but mostly identical source [3] was used and prepared in a
spreadsheet [6].

The documentation suggests that the transliteration tables inclusion is
done by adding *include "translit_cyrillic";""* string into LC_CTYPE
translit_start section
http://man7.org/linux/man-pages/man5/locale.5.html [5]
Practically I have searched for all locales that already have
'include .*translit.*;""' string and generated a patch for them.

The Cyrillic transliteration of e.g. Russian text may have already
worked to some extent for mn_MN, sr_RS, tk_TM, uz_UZ, uk_UA locales that
have their transliteration tables included inline.

I am excluding these locales from this proposed patch. I have written
directly to locale maintainer emails listed in the files. Volodymyr
Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com> (uk_UA),
Данило Шеган <danilo@gnome.org>  (sr_RS) have confirmed the
exclusion.

Links:

[1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
[3] http://transliteration.ru/gost-7-79-2000/ and
http://www.yfermer.ru/specifications/285821.html
[4] Wikipedia article on Cyrillic transliteration with Latin alphabet
https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
[5] http://man7.org/linux/man-pages/man5/locale.5.html
[6] Spreadsheet for generating translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?id=11301
[7] translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?id=11340
[8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
[9] translit-test-input.txt
https://sourceware.org/bugzilla/attachment.cgi?id=11304
[10] https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A

Best regards,
Egor Kobylkin

Comments

Rafal Luzynski Nov. 16, 2018, 10:17 p.m. UTC | #1
Thank you for working on this, Egor.

Before I start reviewing I would like to summarize the things which
I think are blocking for this patch.

1. I think we need tests for transliteration.  Currently there is only
   one test program which is similar to what we need,
   localedata/bug-iconv-trans.c.  It is old and it is not quite clear
   what bug it is trying to test.  Therefore I think we need a new
   framework to test transliteration.  Is it a good idea to base the
   test on the iconv(1) command line utility which is part of glibc?

2. I made few tests in the command line and it seems to me that the
   transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does
   not work and has not been working for some time already because
   I've checked some older systems as well and the result is always
   the same.  I think that the reason is that uk_UA defines multiple
   transliteration rules for "З" depending on what is the letter following
   it.  It does not seem to work.  AFAIK the reason is that the syntax of
   transliteration rules says that a single non-Latin character may map
   one or more Latin strings, each consisting of one or more characters.
   There cannot be a rule transliterating multiple source characters into
   one or multiple destination characters.  Is it a bug in transliteration
   implementation?  Or maybe in the specification, including POSIX standard?
   The definition of transliteration says that it is one-to-one mapping
   of graphemes while a grapheme may be one or multiple characters.
   It does not have to be always mapping one-to-one character.  Should we
   fix this bug first, make uk_UA transliteration work, and only then
   add a generic Cyrillic transliteration?  Egor's patch already contains
   transliteration of "У" + combining acute accent to "Ú" which most
probably
   will not work.

I still think that in the longer term all existing custom transliterations
of Cyrillic alphabets should be ported to a modification of your patch.

Egor, while at this I was thinking about your idea to transliterate letters
like "Ш" (uppercase) to "SH" (always uppercase) in order to distinguish
between "Шема" (-> "SHema") and "Схема" (-> "Shema" or "Sxema").  Also
you include a rule to transliterate "Х" to "H" or "X" depending on which
destination characters are available, which I told you already that will
not work because both "H" and "X" are always available and therefore only
the first rule will always be used.  I still don't like the idea to
put two uppercase letters in a beginning of a word in titlecase only to
indicate that there was originally a single letter.  What if we:

* drop the rule of transliterating "Х" to "H" and transliterate always to
"X",
* transliterate uppercase "Ш" to "Sh" (so it will work fine for titlecase
  words)?

As a result the Latin letter "h" will only appear as part of a digraph and
never as a transliteration of "Х" and therefore will never cause a conflict.
Examples:

* "Шема" -> "Shema",
* "Схема" -> "Sxema".

Will this solve the problem?

Regards,

Rafal
Diego (Egor) Kobylkin Nov. 17, 2018, 6:34 p.m. UTC | #2
Hi Rafal,
thanks for putting it into a clear issue statement on SH/Sh problem. I'm
totally with you on this being a good thing to discuss. It is orthogonal
to the tests so let me focus on SH/Sh and System A/B problematic here.

Looks like we have three issues:
1. lack of explicit control which transformation to use (System A or
System B) via //TRANSLIT
2. possibility of collision for System B if used CAP/low transcription
for capital letters
3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
System B because it's equivalent 'X'/'x' from System A is always present
and takes precedence.

As a solution shouldn't we only keep System B in a new file
transcribe_cyrillic and put it in place as the explicit ASCII
transcription for targeted locales (as opposed to transliteration)?

We would keep System A as translit_cyrillic but won't include it into
this patch. Once you have resolved an issue of having two conflicting
rule-sets but only one key //TRANSLIT you could add the System A back.

The SH/Sh can be decided on either way - seems like an easy change any way.

Please see more discussion on your excellent points below:

On 16.11.18 23:17, Rafal Luzynski wrote:

> Egor, while at this I was thinking about your idea to transliterate
> letters like "Ш" (uppercase) to "SH" (always uppercase) in order to
> distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or
> "Sxema").

to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t
ASCII//TRANSLIT (i.e. System B transcription).
But it's not only SH/Sh, there are following combinations used to
transcribe capital letters:

YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ

Arguably any of them (if not in that CAP/CAP form) could collide with
their CAP/low equivalent from a different word. (there may be language
grammar rules that in fact prevent some but we don't know for sure)

With transcription we are basically striping information from the data,
mapping it into a smaller character set. The idea to keep them in
CAP/CAP is to try to preserve as much information as possible.


> Also you include a rule to transliterate "Х" to "H" or "X" depending
> on which destination characters are available, which I told you
> already that will not work because both "H" and "X" are always
> available and therefore only the first rule will always be used.

Just to have this here for reference, the idea was to have both rules in
one file so

iconv -f UTF-8 -t ASCII//TRANSLIT
will produce ASCII compatible _transcription_ (System B)

iconv -f UTF-8 -t ISO-8859-15//TRANSLIT |
iconv -f ISO-8859-15 -t UTF-8
will produce Latin _transliteration_ as per ISO 9.1995. (System A)

So in fact we have two rules for each letter in the same file (System A
and System B), where System A takes precedence.

I have a question then: isn't this more like a hack than a right thing
to do?

Shouldn't we have two explicit rules for transcription and
transliteration not dependent on a destination character set?


> I still don't like the idea to
> put two uppercase letters in a beginning of a word in titlecase only
> to indicate that there was originally a single letter.  What if we:
> 
> * drop the rule of transliterating "Х" to "H" and transliterate
> always to "X",
This would contradict ISO 9.1995. (System A).
System A was added on Marko's request (so setting him on TO:) I am
neutral on keeping it or dropping it, just to be clear.

> * transliterate uppercase "Ш" to "Sh" (so it will work fine for
> titlecase words)?
> 
> As a result the Latin letter "h" will only appear as part of a
> digraph and never as a transliteration of "Х" and therefore will
> never cause a conflict. Examples:
> 
> * "Шема" -> "Shema", * "Схема" -> "Sxema".
> 
> Will this solve the problem?
This particular rule with h/x would make sense it's own.
But again - it would contradict the standards.
On the other hand, for my personal needs I care less about standards but
about current functionality and data loss because of missing
transcription altogether due to the BZ #2872.

Bests,
Egor
Marko Myllynen Nov. 19, 2018, 7:13 a.m. UTC | #3
Hi,

On 17/11/2018 20.34, Egor Kobylkin wrote:
> 
> Looks like we have three issues:
> 1. lack of explicit control which transformation to use (System A or
> System B) via //TRANSLIT
> 2. possibility of collision for System B if used CAP/low transcription
> for capital letters
> 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
> System B because it's equivalent 'X'/'x' from System A is always present
> and takes precedence.
> 
> As a solution shouldn't we only keep System B in a new file
> transcribe_cyrillic and put it in place as the explicit ASCII
> transcription for targeted locales (as opposed to transliteration)?
> 
> We would keep System A as translit_cyrillic but won't include it into
> this patch. Once you have resolved an issue of having two conflicting
> rule-sets but only one key //TRANSLIT you could add the System A back.
> 
> The SH/Sh can be decided on either way - seems like an easy change any way.
> 
> I have a question then: isn't this more like a hack than a right thing
> to do?
> 
> Shouldn't we have two explicit rules for transcription and
> transliteration not dependent on a destination character set?
> 
> This would contradict ISO 9.1995. (System A).
> System A was added on Marko's request (so setting him on TO:) I am
> neutral on keeping it or dropping it, just to be clear.
> 
> This particular rule with h/x would make sense it's own.
> But again - it would contradict the standards.
> On the other hand, for my personal needs I care less about standards but
> about current functionality and data loss because of missing
> transcription altogether due to the BZ #2872.

Given the amount of questions above I think the way forward is to try
follow the relevant standards as closely as possible and also check what
the other implementations (i.e., uconv(1)) do. For example, checking the
case earlier mentioned case may or may not give some hints:

$ echo Шема  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Šema
$ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Shema
$ uconv -V
uconv v2.1  ICU 50.1.2

Thanks,
Diego (Egor) Kobylkin Nov. 19, 2018, 9:21 a.m. UTC | #4
On 19.11.18 08:13, Marko Myllynen wrote:
> Hi,
> 
> On 17/11/2018 20.34, Egor Kobylkin wrote:

>>
>> Shouldn't we have two explicit rules for transcription and
>> transliteration not dependent on a destination character set?
>>
>> This would contradict ISO 9.1995. (System A).
>> System A was added on Marko's request (so setting him on TO:) I am
>> neutral on keeping it or dropping it, just to be clear.
>>
>> This particular rule with h/x would make sense it's own.
>> But again - it would contradict the standards.
>> On the other hand, for my personal needs I care less about standards but
>> about current functionality and data loss because of missing
>> transcription altogether due to the BZ #2872.
> 
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
> 
> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Šema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Shema
> $ uconv -V
> uconv v2.1  ICU 50.1.2

Marko,

Your example only covers _tansliteration_ to Latin Diacritics
iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
| iconv -f ISO-8859-15 -t UTF-8

while BZ #2872 is about _transcription_ to ASCII
iconv -f UTF-8 -t ASCII//TRANSLIT

The glibc wiki explicitly lists this use case (ASCII) as the test
example https://sourceware.org/glibc/wiki/Locales#Testing_Locales

So again, you are asking to have ISO 9.1995. System A but the bug is
about ISO 9.1995. System B (GOST 7.79-2000)


Bests,
Egor
Marko Myllynen Nov. 19, 2018, 7:35 p.m. UTC | #5
Hi,

On 19/11/2018 11.21, Egor Kobylkin wrote:
> On 19.11.18 08:13, Marko Myllynen wrote:
>> On 17/11/2018 20.34, Egor Kobylkin wrote:
> 
> Your example only covers _tansliteration_ to Latin Diacritics
> iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
> | iconv -f ISO-8859-15 -t UTF-8
> 
> while BZ #2872 is about _transcription_ to ASCII
> iconv -f UTF-8 -t ASCII//TRANSLIT

AFAICS v9 (unlike v10) supported both of the above cases.

> The glibc wiki explicitly lists this use case (ASCII) as the test
> example https://sourceware.org/glibc/wiki/Locales#Testing_Locales

I wrote that section and I certainly wasn't considering Cyrillic aspects
at that time (IIRC it was written even before Mike did the major update
for transliteration rules at the end of 2015). The context back then was
mostly about handling Latin letters like Å, Ä, Ö, Ø, etc.

> So again, you are asking to have ISO 9.1995. System A but the bug is
> about ISO 9.1995. System B (GOST 7.79-2000)

We certainly can decide here what's the best course of action, we do not
have to slavishly follow some old bug report when deciding the direction
for the implementation. But I think I've made my position clear by now
so I'm not going to repeat it anymore.

In any case once your patch lands I'm going to submit a follow-up patch
for fi_FI to make it compliant with the applicable national standard
(SFS 4900) which defines how to do Cyrillic transliteration /
transcription in the context Finnish.

Thanks,
Rafal Luzynski Dec. 1, 2018, 10:07 p.m. UTC | #6
19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
> 
> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Šema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Shema
> $ uconv -V
> uconv v2.1  ICU 50.1.2

I've played a little with uconv and unfortunately it does not look good
to me.

It does not have any fallback transliteration to plain ASCII.  When it says
that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
charset does not have this character then crashes:

$ echo Шема  | uconv -f UTF-8 -t ASCII -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема  | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
�ema
$ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
ISO-8859-2 -t UTF-8
Šema

It seems to follow ISO 9 (GOST 7.79) System A.  However, the transliteration
of the hard sign is rather strange:

$ echo нъе  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
nʺe

The above was correct but:

$ echo НЪЕ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin          
Nʺ̱E
$ echo Ъ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
ʺ̱
$ echo Ъ  | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
0000000    feff    02ba    0331    000a                                
0000008

So this generates:
02BA  MODIFIER LETTER DOUBLE PRIME
0331  COMBINING MACRON BELOW

There is are more transliteration methods, for example Russian-Latin/BGN:

$ echo Шема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Shema
$ echo Схема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Skhema

Converting 'х' to 'kh' seems to be common in English transliteration but
it does not follow any ISO standard.

$ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
KHA kha

This means that the choice whether a digraph in the output should be
all uppercase or maybe upper+lower is context based, something which we
probably cannot implement.  But definitely a good thing.

Two more tests:

$ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Yeshchë
$ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Conversion from Unicode to codepage failed at output byte position 6.
Unicode: 00eb Error: Invalid character found

So the output is not plain ASCII.

$ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
ye zhe le ne

Again this means that transliteration of 'е' is context based:
it is 'ye' in the beginning of a word and 'e' otherwise.

The version which I've tested:

$ uconv -V
uconv v2.1  ICU 60.2

It seems that uconv will not be a good hint about transliterating
to plain ASCII.

Also, the difference between uconv and iconv is that we can provide
multiple transliterations for any source character but we can't group
them into standards so we can't tell iconv to use this or another
system.  It will just choose the best fitting the current output
character set and the only thing we can choose is the locale.

This makes me think: should we add a locale like ru_RU@SystemA or
ru_RU@SystemB?

Regards,

Rafal
Diego (Egor) Kobylkin Dec. 1, 2018, 10:53 p.m. UTC | #7
On 01.12.18 23:07, Rafal Luzynski wrote:
> 
> Also, the difference between uconv and iconv is that we can provide
> multiple transliterations for any source character but we can't group
> them into standards so we can't tell iconv to use this or another
> system.  It will just choose the best fitting the current output
> character set and the only thing we can choose is the locale.
> 
> This makes me think: should we add a locale like ru_RU@SystemA or
> ru_RU@SystemB?

Wouldn't it require to create 3 versions of every locale that would
include the translit_cyrillic file then? I.e. en_US + en_US@SystemA,
en_US@SystemB etc.?

This in turn will make two of them optional (as cyrillic fonts are at
the moment). The highest value is in having the default locale being
able to transliterate, isn't it? So putting the transliteration to
optional locales kind of defeats the purpose.

An example from my experience as a user - a networked device or host
would often have the en_US as the default (only?) locale with no viable
way to change it or install cyrillic fonts. Anyway, this is the most
dire situation where the ASCII transliteration certainly helps most.
Having en_US@SystemA or en_US@SystemB theoretically available but not
compiled by the distributor wouldn't help here, would it?

So the only useful scenario here would be to ship your locales with the
transliteration already included by default in en_US. This way the
distributor won't have to get active to include transliteration as
en_US@SystemA or en_US@SystemB.

From my (however limited) point of view it is better to have the System
B in first, then see if some code need to be changed to accommodate
System A/System B problematic. Again, System B is _transcription_ to
ASCII and System A _transliteration_ to Latin with different use cases.

It's insightful to see your comparison of the uconv vs. iconv!
Similar to your checks this is what I was using to see whether any
locale fails the transliteration for any cyrillic letter:

echo
"ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ
ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"|
LOCPATH=$workdir/compiled_locales/"$locale"/ LC_ALL="$locale".UTF-8
iconv -f UTF-8 -t ASCII//TRANSLIT

should give (can be asserted with bash string comparison):

AaOoUussYODJG`YeZ`IYiJL`N`TSHK`U`DhABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FhfhYhyhE`e`
G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'

And I am attaching another file that has the Unicode Codepoints next to
the letters for easier identification of failures. (like  "U0401-Ё
U0402-Ђ U0403-Ѓ etc.) Hope it will be helpful in creating the tests.

Best regards,
Egor Kobylkin
CYRILLIC RUSSIAN Съешь ещё этих мягких французских булок, да выпей же чаю. СЪЕШЬ ЕЩЁ ЭТИХ МЯГКИХ ФРАНЦУЗСКИХ БУЛОК? ДА ВЫПЕЙ ЖЕ ЧАЮ!
CYRILLIC COMPLETE U0401-Ё U0402-Ђ U0403-Ѓ U0404-Є U0405-Ѕ U0406-І U0407-Ї U0408-Ј U0409-Љ U040A-Њ U040B-Ћ U040C-Ќ U040E-Ў U040F-Џ U0410-А U0411-Б U0412-В U0413-Г U0414-Д U0415-Е U0416-Ж U0417-З U0418-И U0419-Й U041A-К U041B-Л U041C-М U041D-Н U041E-О U041F-П U0420-Р U0421-С U0422-Т U0423-У U0423 0301-У́ U0424-Ф U0425-Х U0426-Ц U0427-Ч U0428-Ш U0429-Щ U042A-ъ U042B-Ы U042C-ь U042D-Э U042E-Ю U042F-Я U0430-а U0431-б U0432-в U0433-г U0434-д U0435-е U0436-ж U0437-з U0438-и U0439-й U043A-к U043B-л U043C-м U043D-н U043E-о U043F-п U0440-р U0441-с U0442-т U0443-у U0443 0301-у́ U0444-ф U0445-х U0446-ц U0447-ч U0448-ш U0449-щ U044A-Ъ U044B-ы U044C-Ь U044D-э U044E-ю U044F-я U0451-ё U0452-ђ U0453-ѓ U0454-є U0455-ѕ U0456-і U0457-ї U0458-ј U0459-љ U045A-њ U045B-ћ U045C-ќ U045E-ў U045F-џ U046A-Ѫ U046B-ѫ U0472-Ѳ U0473-ѳ U0474-Ѵ U0475-ѵ U048C-Ҍ U048D-ҍ U0490-Ґ U0491-ґ U0492-Ғ U0493-ғ U0494-Ҕ U0495-ҕ U0496-Җ U0497-җ U049A-Қ U049B-қ U049E-Ҟ U049F-ҟ U04A2-Ң U04A3-ң U04A4-Ҥ U04A5-ҥ U04A6-Ҧ U04A7-ҧ U04A8-Ҩ U04A9-ҩ U04AA-Ҫ U04AB-ҫ U04AC-Ҭ U04AD-ҭ U04AE-Ү U04AF-ү U04B2-Ҳ U04B3-ҳ U04B4-Ҵ U04B5-ҵ U04BA-Һ U04BB-һ U04BC-Ҽ U04BD-ҽ U04BE-Ҿ U04BF-ҿ U04C0-Ӏ U04C1-Ӂ U04C2-ӂ U04CB-Ӌ U04CC-ӌ U04D0-Ӑ U04D1-ӑ U04D2-Ӓ U04D3-ӓ U04D6-Ӗ U04D7-ӗ U04D8-Ә U04D9-ә U04DC-Ӝ U04DD-ӝ U04DE-Ӟ U04DF-ӟ U04E0-Ӡ U04E1-ӡ U04E4-Ӥ U04E5-ӥ U04E6-Ӧ U04E7-ӧ U04E8-Ө U04E9-ө U04F0-Ӱ U04F1-ӱ U04F2-Ӳ U04F3-ӳ U04F4-Ӵ U04F5-ӵ U04F8-Ӹ U04F9-ӹ U2019-’
GREEK Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής.
GERMAN Zwölf Boxkämpfer jagen Victor quer über den großen Sylter Deich.
FRENCH Dès Noël où un zéphyr haï me vêt de glaçons würmiens je dîne d’exquis rôtis de bœuf au kir à l’aÿ d’âge mûr \& cætera.
SPANISH El veloz murciélago hindú comía feliz cardillo y kiwi, la cigüeña tocaba el saxofón detrás del palenque de paja.
END
Diego (Egor) Kobylkin Dec. 3, 2018, 10:19 p.m. UTC | #8
Rafal,

Just to touch base on this, what is the best way forward? Did you get
any input/feedback on your questions below? Are you expecting input from
anyone but myself?

On the blocking issue #2: I really don’t see the connection to the uk_UA
locale that has its transliteration table inline and is explicitly
excluded from my patch. It may be revealing  another issue you have with
glibc but wouldn’t that be better addressed in a new bug?
Again, in the v10 of my patch I have removed multicharacter source
graphemes, so that issue is moot there.

If you’d like to overhaul the glibc translit system wouldn’t it be
better to commit the simple text file with the Cyrillic
translit(transcription) table first, fix the bug from the year 2006 and
then proceed from there all due diligence?

The same with having both System A and System B.  Initially I went along
with the suggestion to include the system A but it is clear now that it
doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
to set it aside for the moment and use the v10 without the system A.
That is the whole reason I have submitted it, to be superclear on that.

Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that
should mitigate your concern about that issue too (somewhat, anyway).
Making it context based would also be about adding new code, see above.

Let me know if there’s anything I can help with getting more progress
with the decision

Bests,
Egor


On 16.11.18 23:17, Rafal Luzynski wrote:

> 2. I made few tests in the command line and it seems to me that the 
> transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does 
> not work and has not been working for some time already because I've
> checked some older systems as well and the result is always the same.
> I think that the reason is that uk_UA defines multiple 
> transliteration rules for "З" depending on what is the letter
> following it.  It does not seem to work.  AFAIK the reason is that
> the syntax of transliteration rules says that a single non-Latin
> character may map one or more Latin strings, each consisting of one
> or more characters. There cannot be a rule transliterating multiple
> source characters into one or multiple destination characters.  Is it
> a bug in transliteration implementation?  Or maybe in the
> specification, including POSIX standard?
> The definition of transliteration says that it is one-to-one mapping 
> of graphemes while a grapheme may be one or multiple characters. It
> does not have to be always mapping one-to-one character.  Should we 
> fix this bug first, make uk_UA transliteration work, and only then 
> add a generic Cyrillic transliteration?  Egor's patch already
> contains transliteration of "У" + combining acute accent to "Ú" which
> most probably will not work.
> 
> I still think that in the longer term all existing custom
> transliterations of Cyrillic alphabets should be ported to a
> modification of your patch.

On 01.12.18 23:07, Rafal Luzynski wrote:
> 19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
>> [...]
>> Given the amount of questions above I think the way forward is to try
>> follow the relevant standards as closely as possible and also check what
>> the other implementations (i.e., uconv(1)) do. For example, checking the
>> case earlier mentioned case may or may not give some hints:
>>
>> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
>> Šema
>> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
>> Shema
>> $ uconv -V
>> uconv v2.1  ICU 50.1.2
> 
> I've played a little with uconv and unfortunately it does not look good
> to me.
> 
> It does not have any fallback transliteration to plain ASCII.  When it says
> that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
> charset does not have this character then crashes:
> 
> $ echo Шема  | uconv -f UTF-8 -t ASCII -x cyrillic-latin
> Conversion from Unicode to codepage failed at output byte position 0.
> Unicode: 0160 Error: Invalid character found
> $ echo Шема  | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
> Conversion from Unicode to codepage failed at output byte position 0.
> Unicode: 0160 Error: Invalid character found
> $ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
> �ema
> $ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
> ISO-8859-2 -t UTF-8
> Šema
> 
> It seems to follow ISO 9 (GOST 7.79) System A.  However, the transliteration
> of the hard sign is rather strange:
> 
> $ echo нъе  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> nʺe
> 
> The above was correct but:
> 
> $ echo НЪЕ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin          
> Nʺ̱E
> $ echo Ъ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> ʺ̱
> $ echo Ъ  | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
> 0000000    feff    02ba    0331    000a                                
> 0000008
> 
> So this generates:
> 02BA  MODIFIER LETTER DOUBLE PRIME
> 0331  COMBINING MACRON BELOW
> 
> There is are more transliteration methods, for example Russian-Latin/BGN:
> 
> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Shema
> $ echo Схема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Skhema
> 
> Converting 'х' to 'kh' seems to be common in English transliteration but
> it does not follow any ISO standard.
> 
> $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> KHA kha
> 
> This means that the choice whether a digraph in the output should be
> all uppercase or maybe upper+lower is context based, something which we
> probably cannot implement.  But definitely a good thing.
> 
> Two more tests:
> 
> $ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Yeshchë
> $ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> Conversion from Unicode to codepage failed at output byte position 6.
> Unicode: 00eb Error: Invalid character found
> 
> So the output is not plain ASCII.
> 
> $ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> ye zhe le ne
> 
> Again this means that transliteration of 'е' is context based:
> it is 'ye' in the beginning of a word and 'e' otherwise.
> 
> The version which I've tested:
> 
> $ uconv -V
> uconv v2.1  ICU 60.2
> 
> It seems that uconv will not be a good hint about transliterating
> to plain ASCII.
> 
> Also, the difference between uconv and iconv is that we can provide
> multiple transliterations for any source character but we can't group
> them into standards so we can't tell iconv to use this or another
> system.  It will just choose the best fitting the current output
> character set and the only thing we can choose is the locale.
> 
> This makes me think: should we add a locale like ru_RU@SystemA or
> ru_RU@SystemB?
> 
> Regards,
> 
> Rafal
>
Rafal Luzynski Dec. 8, 2018, 1:15 a.m. UTC | #9
17.11.2018 19:34 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> Looks like we have three issues:
> 1. lack of explicit control which transformation to use (System A or
> System B) via //TRANSLIT
> 2. possibility of collision for System B if used CAP/low transcription
> for capital letters
> 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
> System B because it's equivalent 'X'/'x' from System A is always present
> and takes precedence.

True.

> As a solution shouldn't we only keep System B in a new file
> transcribe_cyrillic and put it in place as the explicit ASCII
> transcription for targeted locales (as opposed to transliteration)?
> 
> We would keep System A as translit_cyrillic but won't include it into
> this patch. Once you have resolved an issue of having two conflicting
> rule-sets but only one key //TRANSLIT you could add the System A back.

Sounds like a good idea to provide those two files:

* translit_cyrillic_system_a,
* translit_cyrillic_system_b,

(or any other pair of names) and let the individual locales choose whether
they want to include System A or System B.  For optimization, system_b
file could include system_a and modify it.

> The SH/Sh can be decided on either way - seems like an easy change any
> way.

I'm in favor of "Sh" because it will work fine for titlecased words
(where only the first letter is uppercase) but I'm aware it would be
a problem for uppercased words.  Unfortunately, I think we are unable
to satisfy both cases.

> On 16.11.18 23:17, Rafal Luzynski wrote:
> 
> > Egor, while at this I was thinking about your idea to transliterate
> > letters like "Ш" (uppercase) to "SH" (always uppercase) in order to
> > distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or
> > "Sxema").
> 
> to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t
> ASCII//TRANSLIT (i.e. System B transcription).

True.

> But it's not only SH/Sh, there are following combinations used to
> transcribe capital letters:
> 
> YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ

Absolutely true.  I skip the whole list only for the brevity: if we
find a solution for one letter the same solution will work fine for
all others.

> [...]
> With transcription we are basically striping information from the data,
> mapping it into a smaller character set. The idea to keep them in
> CAP/CAP is to try to preserve as much information as possible.

I'm only afraid that things like "TWo CApitals" or "CamelCase" are
common among us computer geeks while they do not look great when
working with natural language and when displaying them to regular users
and even non-computer people.

> [...]
> So in fact we have two rules for each letter in the same file (System A
> and System B), where System A takes precedence.
> 
> I have a question then: isn't this more like a hack than a right thing
> to do?
> 
> Shouldn't we have two explicit rules for transcription and
> transliteration not dependent on a destination character set?

It's impossible with the current API of iconv.  Maybe it would be
possible ever in future but that's a greater amount of work than what
we are doing here now.  Again, for now different set of rules = different
locale.

I have another question: is it really a job of transliteration to preserve
all original information, to ensure no collisions and have the ability to
restore the original text?  I'm afraid that as long as plain ASCII is the
destination charset whatever system we provide it will always be possible
to provide a malicious combination of the Cyrillic characters proving that
the system generates collisions.

> > I still don't like the idea to
> > put two uppercase letters in a beginning of a word in titlecase only
> > to indicate that there was originally a single letter.  What if we:
> > 
> > * drop the rule of transliterating "Х" to "H" and transliterate
> > always to "X",
> This would contradict ISO 9.1995. (System A).

Yes, it would.  I'm trying to find solution here since I think we have
proved that we can't implement a system which will handle System A,
System B, and ensure no collisions at the same time.  At least one
requirement must be dropped (at least partially).

> System A was added on Marko's request (so setting him on TO:) I am
> neutral on keeping it or dropping it, just to be clear.

I think I didn't see this Marko's request but I'm in favor of keeping
System A, too.

Marko, it would be good to hear your opinion about System A vs. System B
again.

> [...]
> On the other hand, for my personal needs I care less about standards but
> about current functionality and data loss because of missing
> transcription altogether due to the BZ #2872.

I read this that you are open to a solution which is inspired by some
standards but does not implement them fully due to our technical
limitations.


19.11.2018 10:21 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> Marko,
> 
> Your example only covers _tansliteration_ to Latin Diacritics
> [...]
> while BZ #2872 is about _transcription_ to ASCII
> [...]
> 
> So again, you are asking to have ISO 9.1995. System A but the bug is
> about ISO 9.1995. System B (GOST 7.79-2000)

It's hard to say what the original bug reporter meant but I think that the
problem is that there is no transliteration from Cyrillic to any variant of
Latin, except in few locales.  If System A was implemented but System B was
not then at least some characters would be handled correctly.  Currently no
Cyrillic characters are handled.


19.11.2018 20:35 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> In any case once your patch lands I'm going to submit a follow-up patch
> for fi_FI to make it compliant with the applicable national standard
> (SFS 4900) which defines how to do Cyrillic transliteration /
> transcription in the context Finnish.

I totally agree.  As far as I can see, SFS 4900 is more similar to
System A (ISO 9) rather than System B, that is, it transliterates to Latin
characters with diacritics rather than plain ASCII.  Marko, what is your
opinion about possible implementation of SFS 4900 in these cases:

* When the destination charset does not contain required Latin diacritic
  characters (e.g., it is plain ASCII)?
* When the output is ambiguous, that means, when two different Cyrillic
  strings produce the same Latin (or ASCII) output?

At the moment I am not curious about SFS 4900 but we are facing the same
problems now with ISO 9 and GOST 7.79.


1.12.2018 23:07 Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
> [...]
> $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> KHA kha
> 
> This means that the choice whether a digraph in the output should be
> all uppercase or maybe upper+lower is context based, something which we
> probably cannot implement.  But definitely a good thing.

I forgot to include this test which is really interesting:

$ echo ХА Ха ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN    
KHA Kha kha

which again confirms that the choice of all uppercase or just the first
letter uppercased is context based, a thing which we can't implement now.


1.12.2018 23:53 Egor Kobylkin <egor@kobylkin.com> wrote:
> 
> On 01.12.18 23:07, Rafal Luzynski wrote:
> > 
> > [...]
> > This makes me think: should we add a locale like ru_RU@SystemA or
> > ru_RU@SystemB?
> 
> Wouldn't it require to create 3 versions of every locale that would
> include the translit_cyrillic file then? I.e. en_US + en_US@SystemA,
> en_US@SystemB etc.?

OK, please read this as another brainstorming idea and let's just
forget it.

> [...]
> An example from my experience as a user - a networked device or host
> would often have the en_US as the default (only?) locale with no viable
> way to change it or install cyrillic fonts. Anyway, this is the most
> dire situation where the ASCII transliteration certainly helps most.
> Having en_US@SystemA or en_US@SystemB theoretically available but not
> compiled by the distributor wouldn't help here, would it?
> 
> So the only useful scenario here would be to ship your locales with the
> transliteration already included by default in en_US. This way the
> distributor won't have to get active to include transliteration as
> en_US@SystemA or en_US@SystemB.

Having the idea of "@SystemA" and "@SystemB" dropped I don't think
implementing any solution in glibc would be helpful for your use case.
Two reasons:

1. I believe that sooner or later someone will develop a transliteration
   system for en_US which will follow English transliteration of Russian
   instead of any standard we are discussing here.  That means, it would
   transliterate 'Х' as 'Kh' rather than 'H' or 'X'.
2. Currently there is a trend not to install even en_US locales and leave
   only C which is hardcoded into glibc binaries.  OTOH, I wouldn't mind
   if ISO 9 was hardcoded into C as well.
3. That's beyond Russian language but transliteration according to Serbian
   or Bulgarian or Ukrainian or Kazakh rules still requires installing their
   proper locales.  I think that requiring ru_RU to be installed could be
   reasonable especially if we end up with ru_RU somehow differing from
   the default "translit_cyrillic".

BTW you don't need Cyrillic fonts to be installed on your server in order
to process the Cyrillic text correctly unless your server renders the text.


3.12.2018 23:19 Egor Kobylkin <egor@kobylkin.com> wrote:
> 
> Rafal,
> 
> Just to touch base on this, what is the best way forward? Did you get
> any input/feedback on your questions below? Are you expecting input from
> anyone but myself?

Yes, I expected some input from more experienced maintainers about whether
and how to write the tests but I'd rather start another thread about it
because this one is too long already.

> On the blocking issue #2: I really don’t see the connection to the uk_UA
> locale that has its transliteration table inline and is explicitly
> excluded from my patch. It may be revealing  another issue you have with
> glibc but wouldn’t that be better addressed in a new bug?

OK, I was not precise enough (I'm sorry about it) so I'd like to explain
here:

1. In the long term goal I would like to convert those excluded locales
   to use your translit_cyrillic as well.
2. In order to ensure that change is not destructive for them I will need
   automatic tests to prove that their transliteration rules work the
   same good before the change and after the change.
3. It does not matter that converting those other locales is in a distant
   future because we need the same tests for Russian language now.
4. Even although I have not started writing any tests I can see they
   will be failing for uk_UA.  The reason is that glibc transliteration
   rules can handle transliterating single characters into single
characters,
   single characters into multiple characters but not multiple characters
   into multiple (or even single) characters.
5. We can ignore uk_UA but we will face the same case in ru_RU where
   you had a case of 'У́ ' ('У' + 'COMBINING ACUTE ACCENT').
6. So the question was: how (and whether) to write the tests if we
   already know they would be failing?  Skip them?  Resolve the other
   issue first?  Mark them as XFAIL?

In the meantime, you have removed the controversial conversion rule
of 'У' with the acute accent:

> Again, in the v10 of my patch I have removed multicharacter source
> graphemes, so that issue is moot there.

so we can move to the next step.

> If you’d like to overhaul the glibc translit system wouldn’t it be
> better to commit the simple text file with the Cyrillic
> translit(transcription) table first, fix the bug from the year 2006 and
> then proceed from there all due diligence?

I agree and we are now one step forward.

> The same with having both System A and System B.  Initially I went along
> with the suggestion to include the system A but it is clear now that it
> doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
> to set it aside for the moment and use the v10 without the system A.
> That is the whole reason I have submitted it, to be superclear on that.

OK, I think that now I understand your reason to drop System A better.
But still I'd like to rethink implementing System A somehow and drop
(or rather: implement only partially) System B.

> Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that
> should mitigate your concern about that issue too (somewhat, anyway).
> Making it context based would also be about adding new code, see above.

It would also require the changes in the syntax of the source code
of locale data and possibly breaking the POSIX compatibility which
I think would be unacceptable.

> Let me know if there’s anything I can help with getting more progress
> with the decision

I'm afraid you can't help more.  I'd like to hear some feedback from other
people.  Due to some minor obstacles we can't resolve this issue being only
two here.

Regards,

Rafal
Marko Myllynen Dec. 10, 2018, 9:20 p.m. UTC | #10
Hi,

On 08/12/2018 03.15, Rafal Luzynski wrote:
> 17.11.2018 19:34 Egor Kobylkin <egor@kobylkin.com> wrote:
>>
>> The SH/Sh can be decided on either way - seems like an easy change any
>> way.
> 
> I'm in favor of "Sh" because it will work fine for titlecased words
> (where only the first letter is uppercase) but I'm aware it would be
> a problem for uppercased words.  Unfortunately, I think we are unable
> to satisfy both cases.

I think I'm in favor of "Sh" as well, although not perfect I'd assume
it's probably going to be correct in more cases than SH.

>> System A was added on Marko's request (so setting him on TO:) I am
>> neutral on keeping it or dropping it, just to be clear.
> 
> I think I didn't see this Marko's request but I'm in favor of keeping
> System A, too.
> 
> Marko, it would be good to hear your opinion about System A vs. System B
> again.

I think System A is a better option as it should be the same as ISO 9
and perhaps also produces results in some cases which are more expected
than with System B (if the Wikipedia ISO 9 article is to be believed).

Wrt BZ #2872 I think it's good to keep it in mind but IMHO we can also
deviate from it if needed, however with System A + ASCII fallback
definitions the RFE should be satisfied as well?

> 19.11.2018 20:35 Marko Myllynen <myllynen@redhat.com> wrote:
>> [...]
>> In any case once your patch lands I'm going to submit a follow-up patch
>> for fi_FI to make it compliant with the applicable national standard
>> (SFS 4900) which defines how to do Cyrillic transliteration /
>> transcription in the context Finnish.
> 
> I totally agree.  As far as I can see, SFS 4900 is more similar to
> System A (ISO 9) rather than System B, that is, it transliterates to Latin
> characters with diacritics rather than plain ASCII.  Marko, what is your
> opinion about possible implementation of SFS 4900 in these cases:
> 
> * When the destination charset does not contain required Latin diacritic
>   characters (e.g., it is plain ASCII)?

This would be according to http://jkorpela.fi/iso9.html8 so for example
instead of ž -> zh and instead of štš -> shtsh.

> * When the output is ambiguous, that means, when two different Cyrillic
>   strings produce the same Latin (or ASCII) output?

This is a good point and one I haven't considered but I'm not sure is
there anything we can do about this (at least without major locale
system internals work)? Do you have any rough idea how frequently this
could happen or is this more a theoretical issue? (Sorry if I've missed
earlier comments about this, it's been a long thread.)

>> The same with having both System A and System B.  Initially I went along
>> with the suggestion to include the system A but it is clear now that it
>> doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
>> to set it aside for the moment and use the v10 without the system A.
>> That is the whole reason I have submitted it, to be superclear on that.
> 
> OK, I think that now I understand your reason to drop System A better.
> But still I'd like to rethink implementing System A somehow and drop
> (or rather: implement only partially) System B.

Yes, I also think System A AKA ISO 9 would be a better choice but I'll
leave the final decision for you two (and others who might weigh in).

Thanks,
Rafal Luzynski Dec. 19, 2018, 10:25 p.m. UTC | #11
10.12.2018 22:20 Marko Myllynen <myllynen@redhat.com> wrote:
> 
> Hi,
> 
> On 08/12/2018 03.15, Rafal Luzynski wrote:
> > [...]
> > Marko, it would be good to hear your opinion about System A vs. System B
> > again.
> 
> I think System A is a better option as it should be the same as ISO 9
> and perhaps also produces results in some cases which are more expected
> than with System B (if the Wikipedia ISO 9 article is to be believed).
> 
> Wrt BZ #2872 I think it's good to keep it in mind but IMHO we can also
> deviate from it if needed, however with System A + ASCII fallback
> definitions the RFE should be satisfied as well?

That's exactly what I meant (sorry if it was not clear before).

> > [...]  Marko, what is your
> > opinion about possible implementation of SFS 4900 in these cases:
> > 
> > * When the destination charset does not contain required Latin diacritic
> >   characters (e.g., it is plain ASCII)?
> 
> This would be according to http://jkorpela.fi/iso9.html8 so for example
> instead of ž -> zh and instead of štš -> shtsh.

Agree.

> > * When the output is ambiguous, that means, when two different Cyrillic
> >   strings produce the same Latin (or ASCII) output?
> 
> This is a good point and one I haven't considered but I'm not sure is
> there anything we can do about this (at least without major locale
> system internals work)?

I agree with the suggestion that we can't do much about it.  I mean,
there are possibly solutions (like using more punctuation characters)
but they don't look natural to me.

> Do you have any rough idea how frequently this
> could happen or is this more a theoretical issue? (Sorry if I've missed
> earlier comments about this, it's been a long thread.)

Yes, Egor provided this example many times:

"схема" -> "shema" (if "с" -> "s" and "х" -> "h")
"шема"  -> "shema" (if "ш" -> "sh")

I don't think that it matters how frequent are these cases.  I think that
the question is if ambiguity is a bug because if yes then even one corner
case proves that the solution is wrong.

> [...]
> Yes, I also think System A AKA ISO 9 would be a better choice but I'll
> leave the final decision for you two (and others who might weigh in).

Egor is a native speaker so I respect his opinion even if I'm not fully
convinced for technical reasons.  Sadly, nobody else provides any opinion
which could weigh.  I am going to write a separate email about it.

Regards,

Rafal
Diego (Egor) Kobylkin Dec. 19, 2018, 10:48 p.m. UTC | #12
On 19.12.18 23:25, Rafal Luzynski wrote:
> 10.12.2018 22:20 Marko Myllynen <myllynen@redhat.com> wrote:
> 
>> [...]
>> Yes, I also think System A AKA ISO 9 would be a better choice but I'll
>> leave the final decision for you two (and others who might weigh in).
> 
> Egor is a native speaker so I respect his opinion even if I'm not fully
> convinced for technical reasons.  Sadly, nobody else provides any opinion
> which could weigh.  I am going to write a separate email about it.
> 
> Regards,
> 
> Rafal
> 
It's not about which letter should be used for a particular
transliteration. I couldn't care less about that just to be clear.

May be I am missing something, could you tell how do you want to fit
System A to ASCII exactly?

Let's take the very first example from the table:
CyrillicUnicode	CyrillicLetter	CyrillicUnicodeName	LatinUnicode	System A
Latin Letter	System B ASCII Letter
0401	Ё	CYRILLIC CAPITAL LETTER IO	00CB	Ë	YO

so:
Cyrillic Ё U0401
System A - Ë U00CB -  _not_ ASCII
System B - YO (or Yo) "<U0059><U004F>" - ASCII

Could you explain how can we make System A "Ë" to be displayed or
processes somehow in a C locale? Or in a locale or program that doesn't
have "Ë" U00CB?

Bests,
Egor
Rafal Luzynski Dec. 19, 2018, 11:50 p.m. UTC | #13
19.12.2018 23:48 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> May be I am missing something, could you tell how do you want to fit
> System A to ASCII exactly?
> 
> Let's take the very first example from the table:
> CyrillicUnicode	CyrillicLetter	CyrillicUnicodeName	LatinUnicode	System A
> Latin Letter	System B ASCII Letter
> 0401	Ё	CYRILLIC CAPITAL LETTER IO	00CB	Ë	YO
> 
> so:
> Cyrillic Ё U0401
> System A - Ë U00CB -  _not_ ASCII
> System B - YO (or Yo) "<U0059><U004F>" - ASCII
> 
> Could you explain how can we make System A "Ë" to be displayed or
> processes somehow in a C locale? Or in a locale or program that doesn't
> have "Ë" U00CB?

It should be "YO" (or "Yo").  Exactly as you provided in your previous
patches.

I am afraid that my description "Cyrillic -> Latin -> ASCII" was too
ambiguous, I am sorry about it.  Actually it is a list which says:
Convert Cyrillic "Ё" into Latin "Ë" if possible, otherwise to "YO" ("Yo").
We may stop using "Cyrillic -> Latin -> ASCII" picture as too ambiguous
and invent a better one.

Regards,

Rafal

Patch
diff mbox series

From a8ae30e0bf7484f4c0f034480110c81dd059b69e Mon Sep 17 00:00:00 2001
From: Egor Kobylkin <egor@kobylkin.com>
Date: Wed, 14 Nov 2018 22:10:37 +0100
Subject: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

	[BZ #2872]
	* localedata/locales/translit_cyrillic: New file. Supports
	ISO 9.1995, GOST 7.79 System A transliteration System B
	transcription table from Cyrillic to Latin/ASCII.
	* localedata/locales/aa_DJ: Add 'include "translit_cyrillic";""'
	to LC_CTYPE translit section.
	* localedata/locales/af_ZA: Likewise.
	* localedata/locales/ak_GH: Likewise.
	* localedata/locales/am_ET: Likewise.
	* localedata/locales/ar_EG: Likewise.
	* localedata/locales/be_BY: Likewise.
	* localedata/locales/bem_ZM: Likewise.
	* localedata/locales/ber_DZ: Likewise.
	* localedata/locales/ber_MA: Likewise.
	* localedata/locales/bg_BG: Likewise.
	* localedata/locales/bi_VU: Likewise.
	* localedata/locales/bn_BD: Likewise.
	* localedata/locales/bo_CN: Likewise.
	* localedata/locales/ca_ES: Likewise.
	* localedata/locales/ce_RU: Likewise.
	* localedata/locales/cmn_TW: Likewise.
	* localedata/locales/cs_CZ: Likewise.
	* localedata/locales/cv_RU: Likewise.
	* localedata/locales/cy_GB: Likewise.
	* localedata/locales/da_DK: Likewise.
	* localedata/locales/de_DE: Likewise.
	* localedata/locales/dv_MV: Likewise.
	* localedata/locales/dz_BT: Likewise.
	* localedata/locales/el_GR: Likewise.
	* localedata/locales/en_GB: Likewise.
	* localedata/locales/en_NG: Likewise.
	* localedata/locales/en_ZM: Likewise.
	* localedata/locales/es_CU: Likewise.
	* localedata/locales/es_ES: Likewise.
	* localedata/locales/et_EE: Likewise.
	* localedata/locales/fa_IR: Likewise.
	* localedata/locales/ff_SN: Likewise.
	* localedata/locales/fi_FI: Likewise.
	* localedata/locales/fr_FR: Likewise.
	* localedata/locales/ga_IE: Likewise.
	* localedata/locales/gd_GB: Likewise.
	* localedata/locales/gu_IN: Likewise.
	* localedata/locales/gv_GB: Likewise.
	* localedata/locales/he_IL: Likewise.
	* localedata/locales/hi_IN: Likewise.
	* localedata/locales/hif_FJ: Likewise.
	* localedata/locales/hr_HR: Likewise.
	* localedata/locales/ht_HT: Likewise.
	* localedata/locales/hu_HU: Likewise.
	* localedata/locales/hy_AM: Likewise.
	* localedata/locales/id_ID: Likewise.
	* localedata/locales/is_IS: Likewise.
	* localedata/locales/it_IT: Likewise.
	* localedata/locales/ja_JP: Likewise.
	* localedata/locales/kab_DZ: Likewise.
	* localedata/locales/kk_KZ: Likewise.
	* localedata/locales/km_KH: Likewise.
	* localedata/locales/kn_IN: Likewise.
	* localedata/locales/ko_KR: Likewise.
	* localedata/locales/ks_IN: Likewise.
	* localedata/locales/kw_GB: Likewise.
	* localedata/locales/ky_KG: Likewise.
	* localedata/locales/lb_LU: Likewise.
	* localedata/locales/lg_UG: Likewise.
	* localedata/locales/lij_IT: Likewise.
	* localedata/locales/ln_CD: Likewise.
	* localedata/locales/lo_LA: Likewise.
	* localedata/locales/lt_LT: Likewise.
	* localedata/locales/lv_LV: Likewise.
	* localedata/locales/mg_MG: Likewise.
	* localedata/locales/mhr_RU: Likewise.
	* localedata/locales/mk_MK: Likewise.
	* localedata/locales/ml_IN: Likewise.
	* localedata/locales/ms_MY: Likewise.
	* localedata/locales/mt_MT: Likewise.
	* localedata/locales/nan_TW@latin: Likewise.
	* localedata/locales/nb_NO: Likewise.
	* localedata/locales/ne_NP: Likewise.
	* localedata/locales/nhn_MX: Likewise.
	* localedata/locales/niu_NU: Likewise.
	* localedata/locales/niu_NZ: Likewise.
	* localedata/locales/nl_NL: Likewise.
	* localedata/locales/nr_ZA: Likewise.
	* localedata/locales/oc_FR: Likewise.
	* localedata/locales/om_KE: Likewise.
	* localedata/locales/or_IN: Likewise.
	* localedata/locales/os_RU: Likewise.
	* localedata/locales/pa_IN: Likewise.
	* localedata/locales/pa_PK: Likewise.
	* localedata/locales/pl_PL: Likewise.
	* localedata/locales/pt_PT: Likewise.
	* localedata/locales/quz_PE: Likewise.
	* localedata/locales/ro_RO: Likewise.
	* localedata/locales/ru_RU: Likewise.
	* localedata/locales/rw_RW: Likewise.
	* localedata/locales/sa_IN: Likewise.
	* localedata/locales/sd_IN: Likewise.
	* localedata/locales/sd_IN@devanagari: Likewise.
	* localedata/locales/se_NO: Likewise.
	* localedata/locales/sgs_LT: Likewise.
	* localedata/locales/shn_MM: Likewise.
	* localedata/locales/si_LK: Likewise.
	* localedata/locales/sk_SK: Likewise.
	* localedata/locales/sl_SI: Likewise.
	* localedata/locales/sm_WS: Likewise.
	* localedata/locales/so_SO: Likewise.
	* localedata/locales/sq_AL: Likewise.
	* localedata/locales/ss_ZA: Likewise.
	* localedata/locales/st_ZA: Likewise.
	* localedata/locales/sv_SE: Likewise.
	* localedata/locales/sw_KE: Likewise.
	* localedata/locales/ta_IN: Likewise.
	* localedata/locales/te_IN: Likewise.
	* localedata/locales/th_TH: Likewise.
	* localedata/locales/ti_ET: Likewise.
	* localedata/locales/tn_ZA: Likewise.
	* localedata/locales/to_TO: Likewise.
	* localedata/locales/tpi_PG: Likewise.
	* localedata/locales/tr_TR: Likewise.
	* localedata/locales/ts_ZA: Likewise.
	* localedata/locales/unm_US: Likewise.
	* localedata/locales/ur_IN: Likewise.
	* localedata/locales/ur_PK: Likewise.
	* localedata/locales/ve_ZA: Likewise.
	* localedata/locales/vi_VN: Likewise.
	* localedata/locales/wa_BE: Likewise.
	* localedata/locales/wo_SN: Likewise.
	* localedata/locales/xh_ZA: Likewise.
	* localedata/locales/yi_US: Likewise.
	* localedata/locales/yuw_PG: Likewise.
	* localedata/locales/zh_CN: Likewise.
	* localedata/locales/zu_ZA: Likewise.
---
 localedata/locales/aa_DJ             |   1 +
 localedata/locales/af_ZA             |   1 +
 localedata/locales/ak_GH             |   1 +
 localedata/locales/am_ET             |   1 +
 localedata/locales/ar_EG             |   1 +
 localedata/locales/be_BY             |   1 +
 localedata/locales/bem_ZM            |   1 +
 localedata/locales/ber_DZ            |   1 +
 localedata/locales/ber_MA            |   1 +
 localedata/locales/bg_BG             |   1 +
 localedata/locales/bi_VU             |   1 +
 localedata/locales/bn_BD             |   1 +
 localedata/locales/bo_CN             |   1 +
 localedata/locales/ca_ES             |   1 +
 localedata/locales/ce_RU             |   1 +
 localedata/locales/cs_CZ             |   1 +
 localedata/locales/cv_RU             |   1 +
 localedata/locales/cy_GB             |   1 +
 localedata/locales/da_DK             |   1 +
 localedata/locales/de_DE             |   1 +
 localedata/locales/dv_MV             |   1 +
 localedata/locales/dz_BT             |   1 +
 localedata/locales/el_GR             |   1 +
 localedata/locales/en_GB             |   1 +
 localedata/locales/en_NG             |   1 +
 localedata/locales/en_ZM             |   1 +
 localedata/locales/es_CU             |   1 +
 localedata/locales/es_ES             |   1 +
 localedata/locales/et_EE             |   1 +
 localedata/locales/fa_IR             |   1 +
 localedata/locales/ff_SN             |   1 +
 localedata/locales/fi_FI             |   1 +
 localedata/locales/fr_FR             |   1 +
 localedata/locales/ga_IE             |   1 +
 localedata/locales/gd_GB             |   1 +
 localedata/locales/gu_IN             |   1 +
 localedata/locales/gv_GB             |   1 +
 localedata/locales/he_IL             |   1 +
 localedata/locales/hi_IN             |   1 +
 localedata/locales/hif_FJ            |   1 +
 localedata/locales/hr_HR             |   1 +
 localedata/locales/ht_HT             |   1 +
 localedata/locales/hu_HU             |   1 +
 localedata/locales/hy_AM             |   1 +
 localedata/locales/id_ID             |   1 +
 localedata/locales/is_IS             |   1 +
 localedata/locales/it_IT             |   1 +
 localedata/locales/ja_JP             |   1 +
 localedata/locales/kab_DZ            |   1 +
 localedata/locales/kk_KZ             |   1 +
 localedata/locales/km_KH             |   1 +
 localedata/locales/kn_IN             |   1 +
 localedata/locales/ko_KR             |   1 +
 localedata/locales/ks_IN             |   1 +
 localedata/locales/kw_GB             |   1 +
 localedata/locales/ky_KG             |   1 +
 localedata/locales/lb_LU             |   1 +
 localedata/locales/lg_UG             |   1 +
 localedata/locales/lij_IT            |   1 +
 localedata/locales/ln_CD             |   1 +
 localedata/locales/lo_LA             |   1 +
 localedata/locales/lt_LT             |   1 +
 localedata/locales/lv_LV             |   1 +
 localedata/locales/mg_MG             |   1 +
 localedata/locales/mhr_RU            |   1 +
 localedata/locales/mk_MK             |   1 +
 localedata/locales/ml_IN             |   1 +
 localedata/locales/ms_MY             |   1 +
 localedata/locales/mt_MT             |   1 +
 localedata/locales/nan_TW@latin      |   1 +
 localedata/locales/nb_NO             |   1 +
 localedata/locales/ne_NP             |   1 +
 localedata/locales/nhn_MX            |   1 +
 localedata/locales/niu_NU            |   1 +
 localedata/locales/niu_NZ            |   1 +
 localedata/locales/nl_NL             |   1 +
 localedata/locales/nr_ZA             |   1 +
 localedata/locales/oc_FR             |   1 +
 localedata/locales/om_KE             |   1 +
 localedata/locales/or_IN             |   1 +
 localedata/locales/os_RU             |   1 +
 localedata/locales/pa_IN             |   1 +
 localedata/locales/pa_PK             |   1 +
 localedata/locales/pl_PL             |   1 +
 localedata/locales/pt_PT             |   1 +
 localedata/locales/quz_PE            |   1 +
 localedata/locales/ro_RO             |   1 +
 localedata/locales/ru_RU             |   1 +
 localedata/locales/rw_RW             |   1 +
 localedata/locales/sa_IN             |   1 +
 localedata/locales/sd_IN             |   1 +
 localedata/locales/sd_IN@devanagari  |   1 +
 localedata/locales/se_NO             |   1 +
 localedata/locales/sgs_LT            |   1 +
 localedata/locales/shn_MM            |   1 +
 localedata/locales/si_LK             |   1 +
 localedata/locales/sk_SK             |   1 +
 localedata/locales/sl_SI             |   1 +
 localedata/locales/sm_WS             |   1 +
 localedata/locales/so_SO             |   1 +
 localedata/locales/sq_AL             |   1 +
 localedata/locales/ss_ZA             |   1 +
 localedata/locales/st_ZA             |   1 +
 localedata/locales/sv_SE             |   1 +
 localedata/locales/sw_KE             |   1 +
 localedata/locales/ta_IN             |   1 +
 localedata/locales/te_IN             |   1 +
 localedata/locales/th_TH             |   1 +
 localedata/locales/ti_ET             |   1 +
 localedata/locales/tn_ZA             |   1 +
 localedata/locales/to_TO             |   1 +
 localedata/locales/tpi_PG            |   1 +
 localedata/locales/tr_TR             |   1 +
 localedata/locales/translit_cyrillic | 383 +++++++++++++++++++++++++++
 localedata/locales/ts_ZA             |   1 +
 localedata/locales/unm_US            |   1 +
 localedata/locales/ur_IN             |   1 +
 localedata/locales/ur_PK             |   1 +
 localedata/locales/ve_ZA             |   1 +
 localedata/locales/vi_VN             |   1 +
 localedata/locales/wa_BE             |   1 +
 localedata/locales/wo_SN             |   1 +
 localedata/locales/xh_ZA             |   1 +
 localedata/locales/yi_US             |   1 +
 localedata/locales/yuw_PG            |   1 +
 localedata/locales/zh_CN             |   1 +
 localedata/locales/zu_ZA             |   1 +
 127 files changed, 509 insertions(+)
 create mode 100644 localedata/locales/translit_cyrillic

diff --git a/localedata/locales/aa_DJ b/localedata/locales/aa_DJ
index fcb9af8abc..533e5b714e 100644
--- a/localedata/locales/aa_DJ
+++ b/localedata/locales/aa_DJ
@@ -68,6 +68,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/af_ZA b/localedata/locales/af_ZA
index 2f45ddad63..d16bbcf707 100644
--- a/localedata/locales/af_ZA
+++ b/localedata/locales/af_ZA
@@ -70,6 +70,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ak_GH b/localedata/locales/ak_GH
index 926e4df343..d743ba48c7 100644
--- a/localedata/locales/ak_GH
+++ b/localedata/locales/ak_GH
@@ -54,6 +54,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/am_ET b/localedata/locales/am_ET
index e5fe88a4cd..bee494be0a 100644
--- a/localedata/locales/am_ET
+++ b/localedata/locales/am_ET
@@ -96,6 +96,7 @@  copy "i18n"
 space <U1361>
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % hoy-sadis followed by a vowel
 <U1205><U12A0>    <U0068><U0027><U0065>
diff --git a/localedata/locales/ar_EG b/localedata/locales/ar_EG
index c8cb3180bf..f2584cd7ad 100644
--- a/localedata/locales/ar_EG
+++ b/localedata/locales/ar_EG
@@ -44,6 +44,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/be_BY b/localedata/locales/be_BY
index 324379b65a..4fb16d3540 100644
--- a/localedata/locales/be_BY
+++ b/localedata/locales/be_BY
@@ -91,6 +91,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/bem_ZM b/localedata/locales/bem_ZM
index fa43ad1610..7a8c3c3b77 100644
--- a/localedata/locales/bem_ZM
+++ b/localedata/locales/bem_ZM
@@ -41,6 +41,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ber_DZ b/localedata/locales/ber_DZ
index 79f3d289b1..137643873d 100644
--- a/localedata/locales/ber_DZ
+++ b/localedata/locales/ber_DZ
@@ -136,6 +136,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ber_MA b/localedata/locales/ber_MA
index b9bd64868c..fd79bf11d6 100644
--- a/localedata/locales/ber_MA
+++ b/localedata/locales/ber_MA
@@ -83,6 +83,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/bg_BG b/localedata/locales/bg_BG
index 7a9cfa0a5d..504199a4d9 100644
--- a/localedata/locales/bg_BG
+++ b/localedata/locales/bg_BG
@@ -49,6 +49,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/bi_VU b/localedata/locales/bi_VU
index 88bf70a61b..81d717b2f6 100755
--- a/localedata/locales/bi_VU
+++ b/localedata/locales/bi_VU
@@ -39,6 +39,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/bn_BD b/localedata/locales/bn_BD
index 73efd1cbc3..bc82d611e0 100644
--- a/localedata/locales/bn_BD
+++ b/localedata/locales/bn_BD
@@ -61,6 +61,7 @@  map to_inpunct; /
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/bo_CN b/localedata/locales/bo_CN
index 90cbc7807b..7779d3d99b 100644
--- a/localedata/locales/bo_CN
+++ b/localedata/locales/bo_CN
@@ -43,6 +43,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ca_ES b/localedata/locales/ca_ES
index 0ba74ccf33..af72a1ab86 100644
--- a/localedata/locales/ca_ES
+++ b/localedata/locales/ca_ES
@@ -57,6 +57,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ce_RU b/localedata/locales/ce_RU
index 03e60f838a..75ef80498d 100644
--- a/localedata/locales/ce_RU
+++ b/localedata/locales/ce_RU
@@ -38,6 +38,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/cs_CZ b/localedata/locales/cs_CZ
index 41fbd2be93..9450d22f2f 100644
--- a/localedata/locales/cs_CZ
+++ b/localedata/locales/cs_CZ
@@ -215,6 +215,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/cv_RU b/localedata/locales/cv_RU
index e9247b39f8..253cbd63af 100644
--- a/localedata/locales/cv_RU
+++ b/localedata/locales/cv_RU
@@ -103,6 +103,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/cy_GB b/localedata/locales/cy_GB
index 5f6fd7c87f..6d35d7c27e 100644
--- a/localedata/locales/cy_GB
+++ b/localedata/locales/cy_GB
@@ -65,6 +65,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/da_DK b/localedata/locales/da_DK
index 05a2681bef..1b38e8af17 100644
--- a/localedata/locales/da_DK
+++ b/localedata/locales/da_DK
@@ -147,6 +147,7 @@  copy "i18n"
 translit_start
 
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
 <U00C4> "<U0041><U0308>";"<U0041><U0045>"
diff --git a/localedata/locales/de_DE b/localedata/locales/de_DE
index eaa9f7ff8e..85793437a5 100644
--- a/localedata/locales/de_DE
+++ b/localedata/locales/de_DE
@@ -44,6 +44,7 @@  copy "i18n"
 translit_start
 
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % German umlauts.
 % LATIN CAPITAL LETTER A WITH DIAERESIS.
diff --git a/localedata/locales/dv_MV b/localedata/locales/dv_MV
index 0d7842f39f..f9c8de4a50 100644
--- a/localedata/locales/dv_MV
+++ b/localedata/locales/dv_MV
@@ -49,6 +49,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 
 translit_end
diff --git a/localedata/locales/dz_BT b/localedata/locales/dz_BT
index 272fa7e78f..31d488ad0c 100644
--- a/localedata/locales/dz_BT
+++ b/localedata/locales/dz_BT
@@ -59,6 +59,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/el_GR b/localedata/locales/el_GR
index 7362492fbd..994a4a913d 100644
--- a/localedata/locales/el_GR
+++ b/localedata/locales/el_GR
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/en_GB b/localedata/locales/en_GB
index 5b895574ac..2f1cc5904b 100644
--- a/localedata/locales/en_GB
+++ b/localedata/locales/en_GB
@@ -54,6 +54,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/en_NG b/localedata/locales/en_NG
index 109201c2fe..fa70ffe943 100644
--- a/localedata/locales/en_NG
+++ b/localedata/locales/en_NG
@@ -49,6 +49,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/en_ZM b/localedata/locales/en_ZM
index 8957d8e8aa..1fc5dfed65 100644
--- a/localedata/locales/en_ZM
+++ b/localedata/locales/en_ZM
@@ -41,6 +41,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/es_CU b/localedata/locales/es_CU
index d37d452b0f..90c714ea18 100644
--- a/localedata/locales/es_CU
+++ b/localedata/locales/es_CU
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/es_ES b/localedata/locales/es_ES
index aa919a2626..534152d0a8 100644
--- a/localedata/locales/es_ES
+++ b/localedata/locales/es_ES
@@ -107,6 +107,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/et_EE b/localedata/locales/et_EE
index f5c47149a6..51e6a4ab13 100644
--- a/localedata/locales/et_EE
+++ b/localedata/locales/et_EE
@@ -113,6 +113,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/fa_IR b/localedata/locales/fa_IR
index 3714a30932..fdeaf6312e 100644
--- a/localedata/locales/fa_IR
+++ b/localedata/locales/fa_IR
@@ -78,6 +78,7 @@  map to_outpunct; /
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ff_SN b/localedata/locales/ff_SN
index e4b18eba7b..32e2eb78d8 100644
--- a/localedata/locales/ff_SN
+++ b/localedata/locales/ff_SN
@@ -41,6 +41,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/fi_FI b/localedata/locales/fi_FI
index eeb278316b..57eda9bff1 100644
--- a/localedata/locales/fi_FI
+++ b/localedata/locales/fi_FI
@@ -177,6 +177,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/fr_FR b/localedata/locales/fr_FR
index a18c514f19..098be4906f 100644
--- a/localedata/locales/fr_FR
+++ b/localedata/locales/fr_FR
@@ -57,6 +57,7 @@  translit_start
 
 % In France, accents are simply omitted if they cannot be represented.
 include "translit_combining";""
+include "translit_cyrillic";""
 
 translit_end
 
diff --git a/localedata/locales/ga_IE b/localedata/locales/ga_IE
index 782adbaa5c..d430028b74 100644
--- a/localedata/locales/ga_IE
+++ b/localedata/locales/ga_IE
@@ -53,6 +53,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/gd_GB b/localedata/locales/gd_GB
index 8d54593113..aaa41a0bda 100644
--- a/localedata/locales/gd_GB
+++ b/localedata/locales/gd_GB
@@ -45,6 +45,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/gu_IN b/localedata/locales/gu_IN
index cd7e23a4be..00f00d4f8d 100644
--- a/localedata/locales/gu_IN
+++ b/localedata/locales/gu_IN
@@ -62,6 +62,7 @@  map to_inpunct; /
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/gv_GB b/localedata/locales/gv_GB
index 473c043cba..3c6ba93629 100644
--- a/localedata/locales/gv_GB
+++ b/localedata/locales/gv_GB
@@ -56,6 +56,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/he_IL b/localedata/locales/he_IL
index 52b5a6bff0..82a0760c10 100644
--- a/localedata/locales/he_IL
+++ b/localedata/locales/he_IL
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/hi_IN b/localedata/locales/hi_IN
index a94365519f..12a44e6689 100644
--- a/localedata/locales/hi_IN
+++ b/localedata/locales/hi_IN
@@ -61,6 +61,7 @@  map to_inpunct; /
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/hif_FJ b/localedata/locales/hif_FJ
index 5433bb4a2a..005ac6d308 100644
--- a/localedata/locales/hif_FJ
+++ b/localedata/locales/hif_FJ
@@ -37,6 +37,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/hr_HR b/localedata/locales/hr_HR
index 029a3794e2..8222d73ff0 100644
--- a/localedata/locales/hr_HR
+++ b/localedata/locales/hr_HR
@@ -46,6 +46,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % Historicaly we used ISO-8869-2 and wrote digraphs
 % <U01C6> {dž}, <U01C9> {lj} and <U01CC> {nj}
diff --git a/localedata/locales/ht_HT b/localedata/locales/ht_HT
index 0e0a79d2f1..69688a401e 100644
--- a/localedata/locales/ht_HT
+++ b/localedata/locales/ht_HT
@@ -57,6 +57,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/hu_HU b/localedata/locales/hu_HU
index 9d6bb85022..5e19e5b689 100644
--- a/localedata/locales/hu_HU
+++ b/localedata/locales/hu_HU
@@ -455,6 +455,7 @@  copy "i18n"
 translit_start
 
 include "translit_combining";""
+include "translit_cyrillic";""
 
 <U00C1> "<U0041><U0301>";"<U0041><U00B4>";"<U0041><U0027>"
 <U00C9> "<U0045><U0301>";"<U0045><U00B4>";"<U0045><U0027>"
diff --git a/localedata/locales/hy_AM b/localedata/locales/hy_AM
index 74e1b77efb..5973c85f33 100644
--- a/localedata/locales/hy_AM
+++ b/localedata/locales/hy_AM
@@ -75,6 +75,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/id_ID b/localedata/locales/id_ID
index 3ddd8d07da..af36159ca6 100644
--- a/localedata/locales/id_ID
+++ b/localedata/locales/id_ID
@@ -54,6 +54,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/is_IS b/localedata/locales/is_IS
index 8d59b468d6..f614fea728 100644
--- a/localedata/locales/is_IS
+++ b/localedata/locales/is_IS
@@ -149,6 +149,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/it_IT b/localedata/locales/it_IT
index 8a10545de0..7d4cda7fc6 100644
--- a/localedata/locales/it_IT
+++ b/localedata/locales/it_IT
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ja_JP b/localedata/locales/ja_JP
index 1fd2fee44b..34ed430947 100644
--- a/localedata/locales/ja_JP
+++ b/localedata/locales/ja_JP
@@ -1680,6 +1680,7 @@  translit_start
 
 include "translit_combining";""
 include "translit_cjk_variants";""
+include "translit_cyrillic";""
 
 translit_end
 
diff --git a/localedata/locales/kab_DZ b/localedata/locales/kab_DZ
index a165f53f01..4cf468c6a5 100644
--- a/localedata/locales/kab_DZ
+++ b/localedata/locales/kab_DZ
@@ -41,6 +41,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/kk_KZ b/localedata/locales/kk_KZ
index c29c84b46e..c4ceb28b27 100644
--- a/localedata/locales/kk_KZ
+++ b/localedata/locales/kk_KZ
@@ -99,6 +99,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/km_KH b/localedata/locales/km_KH
index 0d8c9ce78d..acd9291346 100644
--- a/localedata/locales/km_KH
+++ b/localedata/locales/km_KH
@@ -42,6 +42,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/kn_IN b/localedata/locales/kn_IN
index b6443d12c8..cffa4e4544 100644
--- a/localedata/locales/kn_IN
+++ b/localedata/locales/kn_IN
@@ -63,6 +63,7 @@  map to_inpunct; /
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ko_KR b/localedata/locales/ko_KR
index bd0d919218..31a8b105c5 100644
--- a/localedata/locales/ko_KR
+++ b/localedata/locales/ko_KR
@@ -6098,6 +6098,7 @@  translit_start
 
 include "translit_combining";""
 include "translit_hangul";""
+include "translit_cyrillic";""
 
 translit_end
 
diff --git a/localedata/locales/ks_IN b/localedata/locales/ks_IN
index 9ab8707922..0c1572b8fd 100644
--- a/localedata/locales/ks_IN
+++ b/localedata/locales/ks_IN
@@ -46,6 +46,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/kw_GB b/localedata/locales/kw_GB
index c0433b3f07..1eb4cfd1c1 100644
--- a/localedata/locales/kw_GB
+++ b/localedata/locales/kw_GB
@@ -57,6 +57,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ky_KG b/localedata/locales/ky_KG
index 871b8a818b..f46b6979e2 100644
--- a/localedata/locales/ky_KG
+++ b/localedata/locales/ky_KG
@@ -82,6 +82,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/lb_LU b/localedata/locales/lb_LU
index 92f1e22e1a..992d0f677d 100644
--- a/localedata/locales/lb_LU
+++ b/localedata/locales/lb_LU
@@ -44,6 +44,7 @@  copy "i18n"
 translit_start
 
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % German umlauts
 % LATIN CAPITAL LETTER A WITH DIAERESIS
diff --git a/localedata/locales/lg_UG b/localedata/locales/lg_UG
index 70dd1cad2e..57dd8c74e8 100644
--- a/localedata/locales/lg_UG
+++ b/localedata/locales/lg_UG
@@ -56,6 +56,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/lij_IT b/localedata/locales/lij_IT
index 2d6e5fcc5c..baec837196 100644
--- a/localedata/locales/lij_IT
+++ b/localedata/locales/lij_IT
@@ -47,6 +47,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ln_CD b/localedata/locales/ln_CD
index ed6404a1e5..a91441809c 100644
--- a/localedata/locales/ln_CD
+++ b/localedata/locales/ln_CD
@@ -39,6 +39,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/lo_LA b/localedata/locales/lo_LA
index d60d157167..2abd680a6a 100644
--- a/localedata/locales/lo_LA
+++ b/localedata/locales/lo_LA
@@ -50,6 +50,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/lt_LT b/localedata/locales/lt_LT
index e9834bd200..a58168dc45 100644
--- a/localedata/locales/lt_LT
+++ b/localedata/locales/lt_LT
@@ -163,6 +163,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/lv_LV b/localedata/locales/lv_LV
index a20cbdde46..e3fb992562 100644
--- a/localedata/locales/lv_LV
+++ b/localedata/locales/lv_LV
@@ -125,6 +125,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/mg_MG b/localedata/locales/mg_MG
index 266ff17e7d..ee1ed56fed 100644
--- a/localedata/locales/mg_MG
+++ b/localedata/locales/mg_MG
@@ -53,6 +53,7 @@  translit_start
 
 % Accents are simply omitted if they cannot be represented.
 include "translit_combining";""
+include "translit_cyrillic";""
 
 translit_end
 
diff --git a/localedata/locales/mhr_RU b/localedata/locales/mhr_RU
index 85ac21b35a..b936253ebc 100644
--- a/localedata/locales/mhr_RU
+++ b/localedata/locales/mhr_RU
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/mk_MK b/localedata/locales/mk_MK
index 87bae1dc7c..210cfce05c 100644
--- a/localedata/locales/mk_MK
+++ b/localedata/locales/mk_MK
@@ -48,6 +48,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ml_IN b/localedata/locales/ml_IN
index d7a8f43f1e..794d59f923 100644
--- a/localedata/locales/ml_IN
+++ b/localedata/locales/ml_IN
@@ -60,6 +60,7 @@  map to_inpunct; /
 
 translit_start
 include     "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 %
diff --git a/localedata/locales/ms_MY b/localedata/locales/ms_MY
index 66b5dd98e9..4fa53adbc3 100644
--- a/localedata/locales/ms_MY
+++ b/localedata/locales/ms_MY
@@ -45,6 +45,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/mt_MT b/localedata/locales/mt_MT
index a6ab7b1dad..4b6a08f4e1 100644
--- a/localedata/locales/mt_MT
+++ b/localedata/locales/mt_MT
@@ -47,6 +47,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/nan_TW@latin b/localedata/locales/nan_TW@latin
index d4579a4cdf..99e2bd80ab 100644
--- a/localedata/locales/nan_TW@latin
+++ b/localedata/locales/nan_TW@latin
@@ -51,6 +51,7 @@  translit_start
 
 % accents are simply omitted if they cannot be represented.
 include "translit_combining";""
+include "translit_cyrillic";""
 
 translit_end
 
diff --git a/localedata/locales/nb_NO b/localedata/locales/nb_NO
index a8675b6104..4c90307366 100644
--- a/localedata/locales/nb_NO
+++ b/localedata/locales/nb_NO
@@ -144,6 +144,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 
 % LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
 <U00C4> "<U0041><U0308>";"<U0041><U0045>"
diff --git a/localedata/locales/ne_NP b/localedata/locales/ne_NP
index eb80eabbd8..3aecda7fd7 100644
--- a/localedata/locales/ne_NP
+++ b/localedata/locales/ne_NP
@@ -43,6 +43,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/nhn_MX b/localedata/locales/nhn_MX
index 88a89765e8..a5e286bc4c 100644
--- a/localedata/locales/nhn_MX
+++ b/localedata/locales/nhn_MX
@@ -59,6 +59,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/niu_NU b/localedata/locales/niu_NU
index 553c5d9edc..e34f33e0c6 100644
--- a/localedata/locales/niu_NU
+++ b/localedata/locales/niu_NU
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/niu_NZ b/localedata/locales/niu_NZ
index 560101b447..85acd3bc44 100644
--- a/localedata/locales/niu_NZ
+++ b/localedata/locales/niu_NZ
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/nl_NL b/localedata/locales/nl_NL
index 1ab3277aa0..6284728fe7 100644
--- a/localedata/locales/nl_NL
+++ b/localedata/locales/nl_NL
@@ -56,6 +56,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/nr_ZA b/localedata/locales/nr_ZA
index 7de6420a6b..caf2aba2e4 100644
--- a/localedata/locales/nr_ZA
+++ b/localedata/locales/nr_ZA
@@ -64,6 +64,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/oc_FR b/localedata/locales/oc_FR
index 707927ee26..f347c8c4d8 100644
--- a/localedata/locales/oc_FR
+++ b/localedata/locales/oc_FR
@@ -54,6 +54,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/om_KE b/localedata/locales/om_KE
index 66cdcf5c45..a75a623053 100644
--- a/localedata/locales/om_KE
+++ b/localedata/locales/om_KE
@@ -156,6 +156,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/or_IN b/localedata/locales/or_IN
index ef28b58895..5c7b9cf8ef 100644
--- a/localedata/locales/or_IN
+++ b/localedata/locales/or_IN
@@ -62,6 +62,7 @@  map to_inpunct; /
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/os_RU b/localedata/locales/os_RU
index 9a4ce037cd..7ab0b7a9bc 100644
--- a/localedata/locales/os_RU
+++ b/localedata/locales/os_RU
@@ -71,6 +71,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 
 END LC_CTYPE
diff --git a/localedata/locales/pa_IN b/localedata/locales/pa_IN
index ca28f21162..93e17fa848 100644
--- a/localedata/locales/pa_IN
+++ b/localedata/locales/pa_IN
@@ -60,6 +60,7 @@  map to_inpunct; /
 
 translit_start
 include     "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/pa_PK b/localedata/locales/pa_PK
index 1f49bdc90d..7782adb5d8 100644
--- a/localedata/locales/pa_PK
+++ b/localedata/locales/pa_PK
@@ -49,6 +49,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % those two lettes are not in cp1256...
 
diff --git a/localedata/locales/pl_PL b/localedata/locales/pl_PL
index 4c1b2a869d..8caa5e8579 100644
--- a/localedata/locales/pl_PL
+++ b/localedata/locales/pl_PL
@@ -130,6 +130,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/pt_PT b/localedata/locales/pt_PT
index 6225036edf..d52ac3ac26 100644
--- a/localedata/locales/pt_PT
+++ b/localedata/locales/pt_PT
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/quz_PE b/localedata/locales/quz_PE
index f6b1956b93..018cd9a7e5 100644
--- a/localedata/locales/quz_PE
+++ b/localedata/locales/quz_PE
@@ -55,6 +55,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ro_RO b/localedata/locales/ro_RO
index 39c4d09a07..6443d66d6a 100644
--- a/localedata/locales/ro_RO
+++ b/localedata/locales/ro_RO
@@ -129,6 +129,7 @@  copy "i18n"
 %
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % if t/scomma is not available, try first t/scedilla
 <U0218> "<U015E>";"<U0053>"
diff --git a/localedata/locales/ru_RU b/localedata/locales/ru_RU
index fdb2059fe7..1f6d2c6935 100644
--- a/localedata/locales/ru_RU
+++ b/localedata/locales/ru_RU
@@ -69,6 +69,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/rw_RW b/localedata/locales/rw_RW
index e0bc763c5a..e12a3d83a3 100644
--- a/localedata/locales/rw_RW
+++ b/localedata/locales/rw_RW
@@ -45,6 +45,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sa_IN b/localedata/locales/sa_IN
index 4eaf6fe1fe..6ebb5e4f90 100644
--- a/localedata/locales/sa_IN
+++ b/localedata/locales/sa_IN
@@ -44,6 +44,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sd_IN b/localedata/locales/sd_IN
index e5ab80b062..23b7424d3b 100644
--- a/localedata/locales/sd_IN
+++ b/localedata/locales/sd_IN
@@ -46,6 +46,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sd_IN@devanagari b/localedata/locales/sd_IN@devanagari
index d57cea639b..0a122b95ac 100644
--- a/localedata/locales/sd_IN@devanagari
+++ b/localedata/locales/sd_IN@devanagari
@@ -44,6 +44,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/se_NO b/localedata/locales/se_NO
index b50001139a..b423d93531 100644
--- a/localedata/locales/se_NO
+++ b/localedata/locales/se_NO
@@ -221,6 +221,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sgs_LT b/localedata/locales/sgs_LT
index 6b6ab1cac9..561c43b651 100644
--- a/localedata/locales/sgs_LT
+++ b/localedata/locales/sgs_LT
@@ -58,6 +58,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/shn_MM b/localedata/locales/shn_MM
index 4212c50ec5..079506dafc 100644
--- a/localedata/locales/shn_MM
+++ b/localedata/locales/shn_MM
@@ -58,6 +58,7 @@  map to_inpunct; /
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/si_LK b/localedata/locales/si_LK
index dc4a9eb04d..4d2fc8b3f0 100644
--- a/localedata/locales/si_LK
+++ b/localedata/locales/si_LK
@@ -44,6 +44,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sk_SK b/localedata/locales/sk_SK
index 94e6e12bb2..086499bb7e 100644
--- a/localedata/locales/sk_SK
+++ b/localedata/locales/sk_SK
@@ -67,6 +67,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sl_SI b/localedata/locales/sl_SI
index 6157b26d4f..dd9b516111 100644
--- a/localedata/locales/sl_SI
+++ b/localedata/locales/sl_SI
@@ -2120,6 +2120,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sm_WS b/localedata/locales/sm_WS
index 6058fbdc38..b9954ae30e 100644
--- a/localedata/locales/sm_WS
+++ b/localedata/locales/sm_WS
@@ -37,6 +37,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/so_SO b/localedata/locales/so_SO
index 713bf79608..9ed4d68ce9 100644
--- a/localedata/locales/so_SO
+++ b/localedata/locales/so_SO
@@ -68,6 +68,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sq_AL b/localedata/locales/sq_AL
index b16a459c56..d9154d7f9e 100644
--- a/localedata/locales/sq_AL
+++ b/localedata/locales/sq_AL
@@ -45,6 +45,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ss_ZA b/localedata/locales/ss_ZA
index 7532a1940b..31c45321ce 100644
--- a/localedata/locales/ss_ZA
+++ b/localedata/locales/ss_ZA
@@ -66,6 +66,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/st_ZA b/localedata/locales/st_ZA
index 706ef3e50a..b62f478f5f 100644
--- a/localedata/locales/st_ZA
+++ b/localedata/locales/st_ZA
@@ -62,6 +62,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/sv_SE b/localedata/locales/sv_SE
index aa28c23776..7443ee277c 100644
--- a/localedata/locales/sv_SE
+++ b/localedata/locales/sv_SE
@@ -151,6 +151,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 
 % LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
 <U00C4> "<U0041><U0308>";"<U0041><U0045>"
diff --git a/localedata/locales/sw_KE b/localedata/locales/sw_KE
index 6c303da983..1e3f848e1d 100644
--- a/localedata/locales/sw_KE
+++ b/localedata/locales/sw_KE
@@ -43,6 +43,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ta_IN b/localedata/locales/ta_IN
index 5a083d2658..ec08739ebd 100644
--- a/localedata/locales/ta_IN
+++ b/localedata/locales/ta_IN
@@ -63,6 +63,7 @@  map to_inpunct; /
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/te_IN b/localedata/locales/te_IN
index b70f320051..99ffb43bf5 100644
--- a/localedata/locales/te_IN
+++ b/localedata/locales/te_IN
@@ -63,6 +63,7 @@  map to_inpunct; /
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/th_TH b/localedata/locales/th_TH
index 7a10376e80..148a1c632b 100644
--- a/localedata/locales/th_TH
+++ b/localedata/locales/th_TH
@@ -57,6 +57,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ti_ET b/localedata/locales/ti_ET
index 6c387604e9..2c2e32a702 100644
--- a/localedata/locales/ti_ET
+++ b/localedata/locales/ti_ET
@@ -864,6 +864,7 @@  translit_start
 <U137C>    <U0060><U0031><U0030><U0030><U0030><U0030>
 
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 %
 END LC_CTYPE
diff --git a/localedata/locales/tn_ZA b/localedata/locales/tn_ZA
index 8473426eab..274336c8d3 100644
--- a/localedata/locales/tn_ZA
+++ b/localedata/locales/tn_ZA
@@ -67,6 +67,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/to_TO b/localedata/locales/to_TO
index 7abe8685df..09e5e093d5 100644
--- a/localedata/locales/to_TO
+++ b/localedata/locales/to_TO
@@ -36,6 +36,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/tpi_PG b/localedata/locales/tpi_PG
index 3315c27633..e625543fcb 100644
--- a/localedata/locales/tpi_PG
+++ b/localedata/locales/tpi_PG
@@ -44,6 +44,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR
index f7c13ddf4b..c751dc696a 100644
--- a/localedata/locales/tr_TR
+++ b/localedata/locales/tr_TR
@@ -2535,6 +2535,7 @@  class "combining_level3"; /
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % TURKISH LIRA SIGN
 <U20BA> "<U0054><U004C>"
diff --git a/localedata/locales/translit_cyrillic b/localedata/locales/translit_cyrillic
new file mode 100644
index 0000000000..82d9749e08
--- /dev/null
+++ b/localedata/locales/translit_cyrillic
@@ -0,0 +1,383 @@ 
+escape_char /
+comment_char %
+
+% This file is part of the GNU C Library and contains locale data.
+% The Free Software Foundation does not claim any copyright interest
+% in the locale data contained in this file.  The foregoing does not
+% affect the license of the GNU C Library as a whole. It does not
+% exempt you from the conditions of the license if your use would
+% otherwise be governed by that license.
+
+% Transliterations of Cyrillic letters to Latin and/or ASCII symbols.
+% Inspired by ISO 9.1995 / GOST 7.79-2000.
+% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf
+% i.e. [U0401-U04F9, U2019] but only the letters covered by ISO 9.1995
+% It implements the GOST_7.79 System A (Latin Script) as a first
+% option and System B Cyrillic (ASCII) as a second option. Check
+% h:ttps://en.wikipedia.org/wiki/ISO_9 for reference.
+% The System B is extended from GOST_7.79-Russian using open sources
+% of the transliteration mappings and the "h/`" diacritics logic.
+
+% Usage examples:
+% iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
+%   | iconv -f ISO-8859-15 -t UTF-8 # System A
+% iconv -f UTF-8 -t ASCII//TRANSLIT # System B.
+
+% Contributions welcome for the rest of Cyrillic script in Unicode
+% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.
+% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=2872.
+% Generated from UnicodeData.txt with a spreadsheet referenced
+% in that bug's doclet
+
+LC_CTYPE
+
+translit_start
+
+% CYRILLIC CAPITAL LETTER IO
+<U0401> <U00CB>;"<U0059><U004F>"
+% CYRILLIC CAPITAL LETTER DJE
+<U0402> <U0110>;"<U0044><U004A>"
+% CYRILLIC CAPITAL LETTER GJE
+<U0403> <U01F4>;"<U0047><U0060>"
+% CYRILLIC CAPITAL LETTER UKRAINIAN IE
+<U0404> <U00CA>;"<U0059><U0045>"
+% CYRILLIC CAPITAL LETTER DZE
+<U0405> <U1E90>;"<U005A><U0060>"
+% CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
+<U0406> <U00CC>;<U0049>
+% CYRILLIC CAPITAL LETTER YI
+<U0407> <U00CF>;"<U0059><U0049>"
+% CYRILLIC CAPITAL LETTER JE
+<U0408> "<U004A><U030C>";<U004A>
+% CYRILLIC CAPITAL LETTER LJE
+<U0409> "<U004C><U0302>";"<U004C><U0060>"
+% CYRILLIC CAPITAL LETTER NJE
+<U040A> "<U004E><U0302>";"<U004E><U0060>"
+% CYRILLIC CAPITAL LETTER TSHE
+<U040B> <U0106>;"<U0054><U0053><U0048>"
+% CYRILLIC CAPITAL LETTER KJE
+<U040C> <U1E30>;"<U004B><U0060>"
+% CYRILLIC CAPITAL LETTER SHORT U
+<U040E> <U016C>;"<U0055><U0060>"
+% CYRILLIC CAPITAL LETTER DZHE
+<U040F> "<U0044><U0302>";"<U0044><U0048>"
+% CYRILLIC CAPITAL LETTER A
+<U0410> <U0041>
+% CYRILLIC CAPITAL LETTER BE
+<U0411> <U0042>
+% CYRILLIC CAPITAL LETTER VE
+<U0412> <U0056>
+% CYRILLIC CAPITAL LETTER GHE
+<U0413> <U0047>
+% CYRILLIC CAPITAL LETTER DE
+<U0414> <U0044>
+% CYRILLIC CAPITAL LETTER IE
+<U0415> <U0045>
+% CYRILLIC CAPITAL LETTER ZHE
+<U0416> <U017D>;"<U005A><U0048>"
+% CYRILLIC CAPITAL LETTER ZE
+<U0417> <U005A>
+% CYRILLIC CAPITAL LETTER I
+<U0418> <U0049>
+% CYRILLIC CAPITAL LETTER SHORT I
+<U0419> <U004A>
+% CYRILLIC CAPITAL LETTER KA
+<U041A> <U004B>
+% CYRILLIC CAPITAL LETTER EL
+<U041B> <U004C>
+% CYRILLIC CAPITAL LETTER EM
+<U041C> <U004D>
+% CYRILLIC CAPITAL LETTER EN
+<U041D> <U004E>
+% CYRILLIC CAPITAL LETTER O
+<U041E> <U004F>
+% CYRILLIC CAPITAL LETTER PE
+<U041F> <U0050>
+% CYRILLIC CAPITAL LETTER ER
+<U0420> <U0052>
+% CYRILLIC CAPITAL LETTER ES
+<U0421> <U0053>
+% CYRILLIC CAPITAL LETTER TE
+<U0422> <U0054>
+% CYRILLIC CAPITAL LETTER U
+<U0423> <U0055>
+% CYRILLIC UNDEFINED
+<U0423><U0301> <U00DA>;"<U0055><U0060>"
+% CYRILLIC CAPITAL LETTER EF
+<U0424> <U0046>
+% CYRILLIC CAPITAL LETTER HA
+<U0425> <U0048>;<U0058>
+% CYRILLIC CAPITAL LETTER TSE
+<U0426> <U0043>;"<U0043><U005A>"
+% CYRILLIC CAPITAL LETTER CHE
+<U0427> <U010C>;"<U0043><U0048>"
+% CYRILLIC CAPITAL LETTER SHA
+<U0428> <U0160>;"<U0053><U0048>"
+% CYRILLIC CAPITAL LETTER SHCHA
+<U0429> <U015C>;"<U0053><U0048><U0048>"
+% CYRILLIC CAPITAL LETTER HARD SIGN
+<U042A> <U02BA>;"<U0041><U0060>"
+% CYRILLIC CAPITAL LETTER YERU
+<U042B> <U0059>;"<U0059><U0060>"
+% CYRILLIC CAPITAL LETTER SOFT SIGN
+<U042C> <U02B9>;<U0060>
+% CYRILLIC CAPITAL LETTER E
+<U042D> <U00C8>;"<U0045><U0060>"
+% CYRILLIC CAPITAL LETTER YU
+<U042E> <U00DB>;"<U0059><U0055>"
+% CYRILLIC CAPITAL LETTER YA
+<U042F> <U00C2>;"<U0059><U0041>"
+% CYRILLIC SMALL LETTER A
+<U0430> <U0061>
+% CYRILLIC SMALL LETTER BE
+<U0431> <U0062>
+% CYRILLIC SMALL LETTER VE
+<U0432> <U0076>
+% CYRILLIC SMALL LETTER GHE
+<U0433> <U0067>
+% CYRILLIC SMALL LETTER DE
+<U0434> <U0064>
+% CYRILLIC SMALL LETTER IE
+<U0435> <U0065>
+% CYRILLIC SMALL LETTER ZHE
+<U0436> <U017E>;"<U007A><U0068>"
+% CYRILLIC SMALL LETTER ZE
+<U0437> <U007A>
+% CYRILLIC SMALL LETTER I
+<U0438> <U0069>
+% CYRILLIC SMALL LETTER SHORT I
+<U0439> <U006A>
+% CYRILLIC SMALL LETTER KA
+<U043A> <U006B>
+% CYRILLIC SMALL LETTER EL
+<U043B> <U006C>
+% CYRILLIC SMALL LETTER EM
+<U043C> <U006D>
+% CYRILLIC SMALL LETTER EN
+<U043D> <U006E>
+% CYRILLIC SMALL LETTER O
+<U043E> <U006F>
+% CYRILLIC SMALL LETTER PE
+<U043F> <U0070>
+% CYRILLIC SMALL LETTER ER
+<U0440> <U0072>
+% CYRILLIC SMALL LETTER ES
+<U0441> <U0073>
+% CYRILLIC SMALL LETTER TE
+<U0442> <U0074>
+% CYRILLIC SMALL LETTER U
+<U0443> <U0075>
+% CYRILLIC UNDEFINED
+<U0443><U0301> <U00FA>;"<U0075><U0060>"
+% CYRILLIC SMALL LETTER EF
+<U0444> <U0066>
+% CYRILLIC SMALL LETTER HA
+<U0445> <U0068>;<U0078>
+% CYRILLIC SMALL LETTER TSE
+<U0446> <U0063>;"<U0063><U007A>"
+% CYRILLIC SMALL LETTER CHE
+<U0447> <U010D>;"<U0063><U0068>"
+% CYRILLIC SMALL LETTER SHA
+<U0448> <U0161>;"<U0073><U0068>"
+% CYRILLIC SMALL LETTER SHCHA
+<U0449> <U015D>;"<U0073><U0068><U0068>"
+% CYRILLIC SMALL LETTER HARD SIGN
+<U044A> <U02BA>;"<U0060><U0060>"
+% CYRILLIC SMALL LETTER YERU
+<U044B> <U0079>;"<U0079><U0060>"
+% CYRILLIC SMALL LETTER SOFT SIGN
+<U044C> <U02B9>;<U0060>
+% CYRILLIC SMALL LETTER E
+<U044D> <U00E8>;"<U0065><U0060>"
+% CYRILLIC SMALL LETTER YU
+<U044E> <U00FB>;"<U0079><U0075>"
+% CYRILLIC SMALL LETTER YA
+<U044F> <U00E2>;"<U0079><U0061>"
+% CYRILLIC SMALL LETTER IO
+<U0451> <U00EB>;"<U0079><U006F>"
+% CYRILLIC SMALL LETTER DJE
+<U0452> <U0111>;"<U0064><U006A>"
+% CYRILLIC SMALL LETTER GJE
+<U0453> <U01F5>;"<U0067><U0060>"
+% CYRILLIC SMALL LETTER UKRAINIAN IE
+<U0454> <U00EA>;"<U0079><U0065>"
+% CYRILLIC SMALL LETTER DZE
+<U0455> <U1E91>;"<U007A><U0060>"
+% CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
+<U0456> <U00EC>;<U0069>
+% CYRILLIC SMALL LETTER YI
+<U0457> <U00EF>;"<U0079><U0069>"
+% CYRILLIC SMALL LETTER JE
+<U0458> <U01F0>;<U006A>
+% CYRILLIC SMALL LETTER LJE
+<U0459> "<U006C><U0302>";"<U006C><U0060>"
+% CYRILLIC SMALL LETTER NJE
+<U045A> "<U006E><U0302>";"<U006E><U0060>"
+% CYRILLIC SMALL LETTER TSHE
+<U045B> <U0107>;"<U0074><U0073><U0068>"
+% CYRILLIC SMALL LETTER KJE
+<U045C> <U1E31>;"<U006B><U0060>"
+% CYRILLIC SMALL LETTER SHORT U
+<U045E> <U016D>;"<U0075><U0060>"
+% CYRILLIC SMALL LETTER DZHE
+<U045F> "<U0064><U0302>";"<U0064><U0068>"
+% CYRILLIC CAPITAL LETTER BIG YUS
+<U046A> <U01CD>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER BIG YUS
+<U046B> <U01CE>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER FITA
+<U0472> "<U0046><U0300>";"<U0046><U0048>"
+% CYRILLIC SMALL LETTER FITA
+<U0473> "<U0066><U0300>";"<U0066><U0068>"
+% CYRILLIC CAPITAL LETTER IZHITSA
+<U0474> <U1EF2>;"<U0059><U0048>"
+% CYRILLIC SMALL LETTER IZHITSA
+<U0475> <U1EF3>;"<U0079><U0068>"
+% CYRILLIC CAPITAL LETTER SEMISOFT SIGN
+<U048C> <U011A>;"<U0045><U0060>"
+% CYRILLIC SMALL LETTER SEMISOFT SIGN
+<U048D> <U011B>;"<U0065><U0060>"
+% CYRILLIC CAPITAL LETTER GHE WITH UPTURN
+<U0490> "<U0047><U0300>";"<U0047><U0060>"
+% CYRILLIC SMALL LETTER GHE WITH UPTURN
+<U0491> "<U0067><U0300>";"<U0067><U0060>"
+% CYRILLIC CAPITAL LETTER GHE WITH STROKE
+<U0492> <U0120>;"<U0047><U0048>"
+% CYRILLIC SMALL LETTER GHE WITH STROKE
+<U0493> <U0121>;"<U0067><U0068>"
+% CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK
+<U0494> <U011E>;"<U0047><U0048>"
+% CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
+<U0495> <U011F>;"<U0067><U0068>"
+% CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER
+<U0496> "<U017D><U0327>";"<U005A><U0048><U0060>"
+% CYRILLIC SMALL LETTER ZHE WITH DESCENDER
+<U0497> "<U017E><U0327>";"<U007A><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER KA WITH DESCENDER
+<U049A> <U0136>;"<U004B><U0060>"
+% CYRILLIC SMALL LETTER KA WITH DESCENDER
+<U049B> <U0137>;"<U006B><U0060>"
+% CYRILLIC CAPITAL LETTER KA WITH STROKE
+<U049E> "<U004B><U0304>";"<U004B><U0060>"
+% CYRILLIC SMALL LETTER KA WITH STROKE
+<U049F> "<U006B><U0304>";"<U006B><U0060>"
+% CYRILLIC CAPITAL LETTER EN WITH DESCENDER
+<U04A2> <U1E46>;"<U004E><U0060>"
+% CYRILLIC SMALL LETTER EN WITH DESCENDER
+<U04A3> <U1E47>;"<U006E><U0060>"
+% CYRILLIC CAPITAL LIGATURE EN GHE
+<U04A4> <U1E44>;"<U004E><U0047>"
+% CYRILLIC SMALL LIGATURE EN GHE
+<U04A5> <U1E45>;"<U006E><U0067>"
+% CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK
+<U04A6> <U1E54>;"<U0050><U0060>"
+% CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
+<U04A7> <U1E55>;"<U0070><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN HA
+<U04A8> <U00D2>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN HA
+<U04A9> <U00F2>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER ES WITH DESCENDER
+<U04AA> <U00C7>;"<U0043><U0060>"
+% CYRILLIC SMALL LETTER ES WITH DESCENDER
+<U04AB> <U00E7>;"<U0043><U0060>"
+% CYRILLIC CAPITAL LETTER TE WITH DESCENDER
+<U04AC> <U0162>;"<U0054><U0060>"
+% CYRILLIC SMALL LETTER TE WITH DESCENDER
+<U04AD> <U0163>;"<U0074><U0060>"
+% CYRILLIC CAPITAL LETTER STRAIGHT U
+<U04AE> <U00D9>;<U0055>
+% CYRILLIC SMALL LETTER STRAIGHT U
+<U04AF> <U00F9>;<U0075>
+% CYRILLIC CAPITAL LETTER HA WITH DESCENDER
+<U04B2> <U1E28>;"<U0048><U0060>"
+% CYRILLIC SMALL LETTER HA WITH DESCENDER
+<U04B3> <U1E29>;"<U0068><U0060>"
+% CYRILLIC CAPITAL LIGATURE TE TSE
+<U04B4> "<U0043><U0304>";"<U0054><U0043><U005A>"
+% CYRILLIC SMALL LIGATURE TE TSE
+<U04B5> "<U0063><U0304>";"<U0074><U0063><U007A>"
+% CYRILLIC CAPITAL LETTER SHHA
+<U04BA> <U1E24>;"<U0053><U0048><U0060>"
+% CYRILLIC SMALL LETTER SHHA
+<U04BB> <U1E25>;"<U0053><U0048><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN CHE
+<U04BC> "<U0043><U0306>";"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN CHE
+<U04BD> "<U0063><U0306>";"<U0063><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER
+<U04BE> "<U00C7><U0306>";"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
+<U04BF> "<U00E7><U0306>";"<U0063><U0068><U0060>"
+% CYRILLIC LETTER PALOCHKA
+<U04C0> <U2021>;<U0069>
+% CYRILLIC CAPITAL LETTER ZHE WITH BREVE
+<U04C1> "<U005A><U0306>";"<U005A><U0048><U0060>"
+% CYRILLIC SMALL LETTER ZHE WITH BREVE
+<U04C2> "<U007A><U0306>";"<U007A><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
+<U04CB> <U00C7>;"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER KHAKASSIAN CHE
+<U04CC> <U00E7>;"<U0063><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER A WITH BREVE
+<U04D0> <U0102>;"<U0041><U0060>"
+% CYRILLIC SMALL LETTER A WITH BREVE
+<U04D1> <U0103>;"<U0061><U0060>"
+% CYRILLIC CAPITAL LETTER A WITH DIAERESIS
+<U04D2> <U00C4>;"<U0041><U0060>"
+% CYRILLIC SMALL LETTER A WITH DIAERESIS
+<U04D3> <U00E4>;"<U0061><U0060>"
+% CYRILLIC CAPITAL LETTER IE WITH BREVE
+<U04D6> <U0114>;"<U0045><U0060>"
+% CYRILLIC SMALL LETTER IE WITH BREVE
+<U04D7> <U0115>;"<U0065><U0060>"
+% CYRILLIC CAPITAL LETTER SCHWA
+<U04D8> "<U0041><U030B>";"<U0041><U0060>"
+% CYRILLIC SMALL LETTER SCHWA
+<U04D9> "<U0061><U030B>";"<U0061><U0060>"
+% CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
+<U04DC> "<U005A><U0304>";"<U005A><U0048><U0060>"
+% CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
+<U04DD> "<U007A><U0304>";"<U007A><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
+<U04DE> "<U005A><U0308>";"<U005A><U0060>"
+% CYRILLIC SMALL LETTER ZE WITH DIAERESIS
+<U04DF> "<U007A><U0308>";"<U007A><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN DZE
+<U04E0> <U0179>;"<U005A><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN DZE
+<U04E1> <U017A>;"<U007A><U0060>"
+% CYRILLIC CAPITAL LETTER I WITH DIAERESIS
+<U04E4> <U00CE>;"<U0049><U0060>"
+% CYRILLIC SMALL LETTER I WITH DIAERESIS
+<U04E5> <U00EE>;"<U0069><U0060>"
+% CYRILLIC CAPITAL LETTER O WITH DIAERESIS
+<U04E6> <U00D6>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER O WITH DIAERESIS
+<U04E7> <U00F6>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER BARRED O
+<U04E8> <U00D4>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER BARRED O
+<U04E9> <U00F4>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER U WITH DIAERESIS
+<U04F0> <U00DC>;"<U0055><U0060>"
+% CYRILLIC SMALL LETTER U WITH DIAERESIS
+<U04F1> <U00FC>;"<U0075><U0060>"
+% CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
+<U04F2> <U0170>;"<U0055><U0060>"
+% CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
+<U04F3> <U0171>;"<U0075><U0060>"
+% CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
+<U04F4> "<U0043><U0308>";"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER CHE WITH DIAERESIS
+<U04F5> "<U0063><U0308>";"<U0063><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
+<U04F8> <U0178>;"<U0059><U0060>"
+% CYRILLIC SMALL LETTER YERU WITH DIAERESIS
+<U04F9> <U00FF>;"<U0079><U0060>"
+% RIGHT SINGLE QUOTATION MARK
+<U2019> <U2035>;<U0027>
+
+translit_end
+
+END LC_CTYPE
diff --git a/localedata/locales/ts_ZA b/localedata/locales/ts_ZA
index 0256e42979..8e16fc02ae 100644
--- a/localedata/locales/ts_ZA
+++ b/localedata/locales/ts_ZA
@@ -62,6 +62,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/unm_US b/localedata/locales/unm_US
index 1e62c60443..66cb4f7210 100644
--- a/localedata/locales/unm_US
+++ b/localedata/locales/unm_US
@@ -48,6 +48,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ur_IN b/localedata/locales/ur_IN
index 062cbf0937..38675b8c6b 100644
--- a/localedata/locales/ur_IN
+++ b/localedata/locales/ur_IN
@@ -46,6 +46,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/ur_PK b/localedata/locales/ur_PK
index aaf47fceb5..4ea9c56100 100644
--- a/localedata/locales/ur_PK
+++ b/localedata/locales/ur_PK
@@ -49,6 +49,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % those two lettes are not in cp1256...
 
diff --git a/localedata/locales/ve_ZA b/localedata/locales/ve_ZA
index 6b80455c98..1964162cc4 100644
--- a/localedata/locales/ve_ZA
+++ b/localedata/locales/ve_ZA
@@ -65,6 +65,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/vi_VN b/localedata/locales/vi_VN
index 7fac1fbbcc..8eac6f3ba9 100644
--- a/localedata/locales/vi_VN
+++ b/localedata/locales/vi_VN
@@ -53,6 +53,7 @@  copy "i18n"
 translit_start
 
 include  "translit_combining";""
+include "translit_cyrillic";""
 
 % dong sign -> d// -> dd
 <U20AB> "<U0111>";"<U0064><U0064>"
diff --git a/localedata/locales/wa_BE b/localedata/locales/wa_BE
index e97493089e..6349142ef7 100644
--- a/localedata/locales/wa_BE
+++ b/localedata/locales/wa_BE
@@ -54,6 +54,7 @@  LC_CTYPE
 copy "i18n"
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % A-bole -> A-circonflecse -> AU
 <U00C5> "A<U030A>";"A";"AU"
diff --git a/localedata/locales/wo_SN b/localedata/locales/wo_SN
index 47263d2eab..bd466d934a 100644
--- a/localedata/locales/wo_SN
+++ b/localedata/locales/wo_SN
@@ -53,6 +53,7 @@  translit_start
 
 % Accents are simply omitted if they cannot be represented.
 include "translit_combining";""
+include "translit_cyrillic";""
 
 translit_end
 
diff --git a/localedata/locales/xh_ZA b/localedata/locales/xh_ZA
index 4564137e85..5bd3d5bd3c 100644
--- a/localedata/locales/xh_ZA
+++ b/localedata/locales/xh_ZA
@@ -64,6 +64,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
diff --git a/localedata/locales/yi_US b/localedata/locales/yi_US
index 95963830fc..edd55f77e9 100644
--- a/localedata/locales/yi_US
+++ b/localedata/locales/yi_US
@@ -60,6 +60,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 
 % if digraphs are not available (this is the case with iso-8859-8)
 % then use the single letters
diff --git a/localedata/locales/yuw_PG b/localedata/locales/yuw_PG
index 0cb3cadf4a..b9e393d354 100644
--- a/localedata/locales/yuw_PG
+++ b/localedata/locales/yuw_PG
@@ -40,6 +40,7 @@  copy "i18n"
 
 translit_start
 include "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 
 END LC_CTYPE
diff --git a/localedata/locales/zh_CN b/localedata/locales/zh_CN
index 62a46415c1..00f2332dde 100644
--- a/localedata/locales/zh_CN
+++ b/localedata/locales/zh_CN
@@ -58,6 +58,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 
 class	"hanzi"; /
diff --git a/localedata/locales/zu_ZA b/localedata/locales/zu_ZA
index cf93a63009..ab37a145b2 100644
--- a/localedata/locales/zu_ZA
+++ b/localedata/locales/zu_ZA
@@ -68,6 +68,7 @@  copy "i18n"
 
 translit_start
 include  "translit_combining";""
+include "translit_cyrillic";""
 translit_end
 END LC_CTYPE
 
-- 
2.17.1