[RFC,03/13] charsets: utf8: Add unicode character database files

Message ID 20180112071234.29470-4-krisman@collabora.co.uk
State Superseded, archived
Headers show
Series
  • UTF-8 case insensitive lookups for EXT4
Related show

Commit Message

Gabriel Krisman Bertazi Jan. 12, 2018, 7:12 a.m.
From: Olaf Weber <olaf@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
  [Move ucd directory to lib/charsets]
---
 lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 lib/charsets/ucd/README

Comments

Darrick J. Wong Jan. 12, 2018, 4:59 p.m. | #1
On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Add files from the Unicode Character Database, version 7.0.0, to the source.
> A helper program that generates a trie used for normalization from these
> files is part of a separate commit.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
>   [Move ucd directory to lib/charsets]
> ---
>  lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>  create mode 100644 lib/charsets/ucd/README
> 
> diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README
> new file mode 100644
> index 000000000000..d713e663cdf9
> --- /dev/null
> +++ b/lib/charsets/ucd/README
> @@ -0,0 +1,33 @@
> +The files in this directory are part of the Unicode Character Database
> +for version 7.0.0 of the Unicode standard.
> +
> +The full set of files can be found here:
> +
> +  http://www.unicode.org/Public/7.0.0/ucd/
> +
> +The latest released version of the UCD can be found here:
> +
> +  http://www.unicode.org/Public/UCD/latest/
> +
> +The files in this directory are identical, except that they have been
> +renamed with a suffix indicating the unicode version.
> +
> +Individual source links:
> +
> +  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
> +  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
> +  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
> +  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
> +  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
> +  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
> +  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> +
> +md5sums
> +
> +  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
> +  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
> +  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
> +  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
> +  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
> +  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
> +  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt

Uh... are these files supposed to be attached to this patch?

--D

> -- 
> 2.15.1
>
Weber, Olaf (HPC Data Management & Storage) Jan. 12, 2018, 8:29 p.m. | #2
> -----Original Message-----
> From: Darrick J. Wong [mailto:darrick.wong@oracle.com]
> Sent: Friday, January 12, 2018 17:59
> To: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
> Cc: tytso@mit.edu; david@fromorbit.com; bpm@sgi.com; olaf@sgi.com;
> linux-ext4@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> kernel@lists.collabora.co.uk; alvaro.soliverez@collabora.co.uk
> Subject: Re: [PATCH RFC 03/13] charsets: utf8: Add unicode character
> database files
> 
> On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
> > From: Olaf Weber <olaf@sgi.com>
> >
> > Add files from the Unicode Character Database, version 7.0.0, to the
> source.
> > A helper program that generates a trie used for normalization from
> > these files is part of a separate commit.
> >
> > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
> >   [Move ucd directory to lib/charsets]
> > ---
> >  lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >  create mode 100644 lib/charsets/ucd/README
> >
> > diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README new
> > file mode 100644 index 000000000000..d713e663cdf9
> > --- /dev/null
> > +++ b/lib/charsets/ucd/README
> > @@ -0,0 +1,33 @@
> > +The files in this directory are part of the Unicode Character
> > +Database for version 7.0.0 of the Unicode standard.
> > +
> > +The full set of files can be found here:
> > +
> > +  http://www.unicode.org/Public/7.0.0/ucd/
> > +
> > +The latest released version of the UCD can be found here:
> > +
> > +  http://www.unicode.org/Public/UCD/latest/
> > +
> > +The files in this directory are identical, except that they have been
> > +renamed with a suffix indicating the unicode version.
> > +
> > +Individual source links:
> > +
> > +  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
> > +  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
> > +
> > +
> http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningCl
> > + ass.txt
> > + http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
> > +
> > + http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
> > +  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
> > +  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> > +
> > +md5sums
> > +
> > +  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
> > + 07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
> > +  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
> > +  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
> > + 522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
> > +  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
> > + c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt
> 
> Uh... are these files supposed to be attached to this patch?

Actually, no, as was explained in the 1st message:

" Like the original submission from Ben, I excluded the commit that includes the
" generated header file and unicode files because they are too big and would
" bounce the list.  Instead, instructions on fetching and generating the files are
" documented in the commit message.

One issue we (SGI) anticipated is that we were proposing the inclusion of a large binary blob into
the kernel. And people here do dislike opaque binary blobs. So instead we proposed including the
program that generated the blob in question plus the source files it uses. On the one hand, a
sizable increase of the kernel source tree, on the other hand, no argument about the provenance
of the blob as both source and generator are right there.

An alternative might be to include the generated blob itself but retain the instructions so people can
verify it, providing they cared to do so. If someone was really ambitious, they could even automate
grabbing the source files from unicode.org as part of a verification build. If they were even more
ambitious, they could add such a verification build as an option to the linux kernel build system. (In
other words, I am not the one who's going to implement this if it turns out that people on this list
believe this to be a good idea.)

Olaf
Theodore Y. Ts'o Jan. 13, 2018, 12:24 a.m. | #3
On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Add files from the Unicode Character Database, version 7.0.0, to the source.
> A helper program that generates a trie used for normalization from these
> files is part of a separate commit.

It looks like the latest version of Unicode is 10.0.0.  Once we pick a
Unicode version, changing will be painful; but in the absence of
interop requirements, is there a reason to stick with Unicode 7?  Why
not take the latest version of Unicode and then freeze on it?

    	     	    	       	       	   - Ted
Gabriel Krisman Bertazi Jan. 13, 2018, 4:28 a.m. | #4
Theodore Ts'o <tytso@mit.edu> writes:

> On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
>> From: Olaf Weber <olaf@sgi.com>
>> 
>> Add files from the Unicode Character Database, version 7.0.0, to the source.
>> A helper program that generates a trie used for normalization from these
>> files is part of a separate commit.
>
> It looks like the latest version of Unicode is 10.0.0.  Once we pick a
> Unicode version, changing will be painful; but in the absence of
> interop requirements, is there a reason to stick with Unicode 7?  Why
> not take the latest version of Unicode and then freeze on it?
>

Hi Ted,

No, there isn't a specific reason for unicode 7 and I forgot to mention
this in my cover letter.  I have successfully generated the data file
for 10.0.0 with the mkutf8data script, but I couldn't validate it
entirely yet.  I walked through changelogs to make sure any relevant
changes where there, but I'm not done yet.  You can definitely expect
new versions of the patchset to support 10.0.0.

Thanks,

Patch

diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README
new file mode 100644
index 000000000000..d713e663cdf9
--- /dev/null
+++ b/lib/charsets/ucd/README
@@ -0,0 +1,33 @@ 
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+  http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+  http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
+  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
+  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
+  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
+  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
+  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
+  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt