diff mbox series

[PATCH/next,v2,1/1] package/icu: Add support to generate a subset of ICU data

Message ID 20210601060608.5531-1-bernd.kuhls@t-online.de
State Changes Requested
Headers show
Series [PATCH/next,v2,1/1] package/icu: Add support to generate a subset of ICU data | expand

Commit Message

Bernd Kuhls June 1, 2021, 6:06 a.m. UTC
Recent versions of ICU (64+) provide a tool for configuring ICU locale
data file with finer granularity [1].

Default generated size for libicudata.so is ~27M, which is quite large
for embedded systems and all of them may not even need all locale data.

This patch adds support for a custom data filter file in json format to
reduce the size of libicudata.so, e.g.

{
  "localeFilter": {
    "filterType": "language",
    "includelist": [
      "en",
      "de",
      "it"
    ]
  }
}

would only generate the locale data for english/german/italian.

This would reduce the size of libicudata.so to 12M.

[1] https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md

Signed-off-by: Bernd Kuhls <bernd.kuhls@t-online.de>
---
v2: switch ICU_COPY_CUSTOM_DATA to ICU_PRE_CONFIGURE_HOOKS to fix
    parallel build issues

 package/icu/Config.in |  9 +++++++++
 package/icu/icu.hash  |  1 +
 package/icu/icu.mk    | 24 +++++++++++++++++++++++-
 3 files changed, 33 insertions(+), 1 deletion(-)

Comments

Thomas Petazzoni July 18, 2021, 9:11 p.m. UTC | #1
Hello Bernd,

On Tue,  1 Jun 2021 08:06:08 +0200
Bernd Kuhls <bernd.kuhls@t-online.de> wrote:

> This would reduce the size of libicudata.so to 12M.
> 
> [1] https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md

This is really some good and useful contribution, but (as usual), I
have a couple of questions/requests.

> +config BR2_PACKAGE_ICU_DATA_FILTER_FILE
> +	string "Path to custom data configuration file"
> +	help
> +	  The ICU Data Build Tool enables you to write a configuration
> +	  file that specifies what features and locales to include in a
> +	  custom data bundle:
> +	  https://github.com/unicode-org/icu/blob/main/docs/userguide/icu_data/buildtool.md
> +	  Leave empty to not use this functionality.

We already have BR2_PACKAGE_ICU_CUSTOM_DATA_PATH which was also meant
to allow building ICU with a smaller dataset, based on pre-compiled
.dat file. It points to using http://apps.icu-project.org/datacustom/,
which in fact seems to no longer exists, as it redirects to
https://unicode-org.atlassian.net/browse/ICU-12977/, which says that
the feature is no longer available since ICU 58.x.

So I guess we should first drop BR2_PACKAGE_ICU_CUSTOM_DATA_PATH which
is already unusable today.

Then the second question is whether we can provide/generate some kind
of default file in Buildroot that would in a default Buildroot
configuration generate a more minimal ICU dataset. For example by
building only the support for English ? Or based on the value of
BR2_GENERATE_LOCALE ?

I know this would break backward compatibility, but it would really
make ICU a lot more reasonable in size for most of our users. Indeed,
they are unlikely to know that they can reduce the size of ICU by using
this new feature.

> +ICU_DATA_FILTER_FILE = $(call qstrip,$(BR2_PACKAGE_ICU_DATA_FILTER_FILE))
> +
> +ifneq ($(ICU_DATA_FILTER_FILE),)
> +HOST_ICU_DATA_SOURCE = $(subst src.tgz,data.zip,$(ICU_SOURCE))
> +HOST_ICU_EXTRA_DOWNLOADS += $(HOST_ICU_SITE)/$(HOST_ICU_DATA_SOURCE)
> +
> +define HOST_ICU_EXTRACT_DATA
> +	rm -rf $(@D)/$(HOST_ICU_SUBDIR)/data
> +	$(UNZIP) $(ICU_DL_DIR)/$(HOST_ICU_DATA_SOURCE) -d $(@D)/$(HOST_ICU_SUBDIR)
> +endef
> +HOST_ICU_POST_EXTRACT_HOOKS += HOST_ICU_EXTRACT_DATA
> +
> +HOST_ICU_CONF_ENV = ICU_DATA_FILTER_FILE=$(ICU_DATA_FILTER_FILE)
> +HOST_ICU_CONF_OPTS += --with-data-packaging=archive
> +
> +define ICU_COPY_CUSTOM_DATA
> +	$(INSTALL) -D -m 644 $(HOST_ICU_DIR)/$(HOST_ICU_SUBDIR)/data/out/icudt$(ICU_VERSION_MAJOR)l.dat $(@D)/$(ICU_SUBDIR)/data/in/
> +endef
> +ICU_PRE_CONFIGURE_HOOKS += ICU_COPY_CUSTOM_DATA
> +endif

It took me quite some time to understand what was going on here. My
understanding is as follows:

 * In a normal build, the pre-compiled source/data/in/icudt69l.dat
   provided in the ICU tarball is used.

 * When a custom dataset needs to be generated thanks to your new
   option, we need to download the source of this dataset as an extra
   download for the host-icu package, replace the source/data/ subdir
   with this source data set, and ask the host-icu package to generate
   icudt69l.dat. Then the target icu package grabs this pre-compiled
   icudt69l.dat.

Is that correct ? If so, then I'd say it would be useful to add a
comment above this code, as I find the logic to not be that trivial.

Thanks!

Thomas
Bernd Kuhls July 19, 2021, 5:45 a.m. UTC | #2
Hello Thomas,

Am Sun, 18 Jul 2021 23:11:20 +0200 schrieb Thomas Petazzoni:

> We already have BR2_PACKAGE_ICU_CUSTOM_DATA_PATH which was also meant to
> allow building ICU with a smaller dataset, based on pre-compiled .dat
> file. It points to using http://apps.icu-project.org/datacustom/, which
> in fact seems to no longer exists, as it redirects to
> https://unicode-org.atlassian.net/browse/ICU-12977/, which says that the
> feature is no longer available since ICU 58.x.
> 
> So I guess we should first drop BR2_PACKAGE_ICU_CUSTOM_DATA_PATH which
> is already unusable today.

Yes, it is true that the upstream URL to create custum data files
http://apps.icu-project.org/datacustom/
no longer exists and no new data files can be created this way.

But our option BR2_PACKAGE_ICU_CUSTOM_DATA_PATH still works when using 
already existing data files to overwrite the tarball version of
icu/source/data/in/icudt69l.dat by ICU_POST_PATCH_HOOKS.

That's the reason why I did not touch this buildroot option because I 
focused on the new option. I do not have any objections to remove this 
option however.

> Then the second question is whether we can provide/generate some kind of
> default file in Buildroot that would in a default Buildroot
> configuration generate a more minimal ICU dataset. For example by
> building only the support for English ? Or based on the value of
> BR2_GENERATE_LOCALE ?

Another idea just came up: What about changing my patch to a comma-
separated string option like BR2_PACKAGE_ICU_DATA_LANGUAGES with a 
default of "en" and icu.mk creates the corresponding json file during the 
build:

{
  "localeFilter": {
    "filterType": "language",
    "includelist": [
      "en"
    ]
  }
}

Users could then use this option to filter more languages are change it 
to an empty option which will provide the full data file, like today.
This way we would lose the more detailed filter possibilities offered by 
the icu build system but I guess most users would be happy just being 
able to filter languages.

> It took me quite some time to understand what was going on here. My
> understanding is as follows:
> 
>  * In a normal build, the pre-compiled source/data/in/icudt69l.dat
>    provided in the ICU tarball is used.

Correct.

>  * When a custom dataset needs to be generated thanks to your new
>    option, we need to download the source of this dataset as an extra
>    download for the host-icu package, replace the source/data/ subdir
>    with this source data set, and ask the host-icu package to generate
>    icudt69l.dat. Then the target icu package grabs this pre-compiled
>    icudt69l.dat.

Correct, you need host tools, which are configured using --with-data-
packaging=archive, to build a new data file which is then injected into 
the target package by ICU_PRE_CONFIGURE_HOOKS.

> Is that correct ? If so, then I'd say it would be useful to add a
> comment above this code, as I find the logic to not be that trivial.

Ok.

Regards, Bernd
Thomas Petazzoni July 19, 2021, 7:46 a.m. UTC | #3
Hello Bernd,

On Mon, 19 Jul 2021 07:45:36 +0200
Bernd Kuhls <bernd.kuhls@t-online.de> wrote:

> Yes, it is true that the upstream URL to create custum data files
> http://apps.icu-project.org/datacustom/
> no longer exists and no new data files can be created this way.
> 
> But our option BR2_PACKAGE_ICU_CUSTOM_DATA_PATH still works when using 
> already existing data files to overwrite the tarball version of
> icu/source/data/in/icudt69l.dat by ICU_POST_PATCH_HOOKS.
> 
> That's the reason why I did not touch this buildroot option because I 
> focused on the new option. I do not have any objections to remove this 
> option however.

Since the website no longer exists and has never allowed creating
custom datasets for recent versions of ICU, I doubt anybody has such as
custom dataset pre-compiled, so I would be in favor of dropping the
option.

> Another idea just came up: What about changing my patch to a comma-
> separated string option like BR2_PACKAGE_ICU_DATA_LANGUAGES with a 
> default of "en" and icu.mk creates the corresponding json file during the 
> build:
> 
> {
>   "localeFilter": {
>     "filterType": "language",
>     "includelist": [
>       "en"
>     ]
>   }
> }
> 
> Users could then use this option to filter more languages are change it 
> to an empty option which will provide the full data file, like today.
> This way we would lose the more detailed filter possibilities offered by 
> the icu build system but I guess most users would be happy just being 
> able to filter languages.

This is unfortunately more "limiting" than the mechanism you propose
today, which really provides the full flexibility allowed by the ICU
custom dataset creation mechanism.

Should we have a "choice" between 3 possibilities:

 (1) A custom dataset created based on the list of locales enabled by
     the user in the Buildroot configuration. This would be the default
     possibility.

 (2) The full dataset.

 (3) A custom dataset based on a custom JSON file provided by the user,
     for advanced configuration.

What do you think ?

> >  * When a custom dataset needs to be generated thanks to your new
> >    option, we need to download the source of this dataset as an extra
> >    download for the host-icu package, replace the source/data/ subdir
> >    with this source data set, and ask the host-icu package to generate
> >    icudt69l.dat. Then the target icu package grabs this pre-compiled
> >    icudt69l.dat.  
> 
> Correct, you need host tools, which are configured using --with-data-
> packaging=archive, to build a new data file which is then injected into 
> the target package by ICU_PRE_CONFIGURE_HOOKS.

Indeed, that's what I understood. As I mentioned, would be good to have
this explained via a comment in the .mk file.

Thanks a lot!

Thomas
diff mbox series

Patch

diff --git a/package/icu/Config.in b/package/icu/Config.in
index b0c9eac83d..564e509fa0 100644
--- a/package/icu/Config.in
+++ b/package/icu/Config.in
@@ -25,6 +25,15 @@  config BR2_PACKAGE_ICU_CUSTOM_DATA_PATH
 	  provided by buildroot.
 	  Leave empty to not use this functionality.
 
+config BR2_PACKAGE_ICU_DATA_FILTER_FILE
+	string "Path to custom data configuration file"
+	help
+	  The ICU Data Build Tool enables you to write a configuration
+	  file that specifies what features and locales to include in a
+	  custom data bundle:
+	  https://github.com/unicode-org/icu/blob/main/docs/userguide/icu_data/buildtool.md
+	  Leave empty to not use this functionality.
+
 endif
 
 comment "icu needs a toolchain w/ C++, wchar, threads, gcc >= 4.9, host gcc >= 4.9"
diff --git a/package/icu/icu.hash b/package/icu/icu.hash
index 5ed7cc4217..ab0a642259 100644
--- a/package/icu/icu.hash
+++ b/package/icu/icu.hash
@@ -1,3 +1,4 @@ 
 # Locally computed
 sha256  4cba7b7acd1d3c42c44bb0c14be6637098c7faf2b330ce876bc5f3b915d09745  icu4c-69_1-src.tgz
+sha256  4fc2d8cfc3343673123586fca3967404abd4e346fba5515829204533b3bae4bf  icu4c-69_1-data.zip
 sha256  7915b19db903070778581ae05d8bf4ea241b34a05deb51ca4f5cbb15ea1cbba3  LICENSE
diff --git a/package/icu/icu.mk b/package/icu/icu.mk
index 0a17c61462..3acc0d266e 100644
--- a/package/icu/icu.mk
+++ b/package/icu/icu.mk
@@ -7,7 +7,8 @@ 
 # Git tags (and therefore versions on release-monitoring.org) use the
 # XX-Y format, but the tarballs are named XX_Y and the containing
 # directories XX.Y.
-ICU_VERSION = 69-1
+ICU_VERSION_MAJOR = 69
+ICU_VERSION = $(ICU_VERSION_MAJOR)-1
 ICU_SOURCE = icu4c-$(subst -,_,$(ICU_VERSION))-src.tgz
 ICU_SITE = \
 	https://github.com/unicode-org/icu/releases/download/release-$(ICU_VERSION)
@@ -56,6 +57,27 @@  endef
 ICU_POST_PATCH_HOOKS += ICU_COPY_CUSTOM_DATA
 endif
 
+ICU_DATA_FILTER_FILE = $(call qstrip,$(BR2_PACKAGE_ICU_DATA_FILTER_FILE))
+
+ifneq ($(ICU_DATA_FILTER_FILE),)
+HOST_ICU_DATA_SOURCE = $(subst src.tgz,data.zip,$(ICU_SOURCE))
+HOST_ICU_EXTRA_DOWNLOADS += $(HOST_ICU_SITE)/$(HOST_ICU_DATA_SOURCE)
+
+define HOST_ICU_EXTRACT_DATA
+	rm -rf $(@D)/$(HOST_ICU_SUBDIR)/data
+	$(UNZIP) $(ICU_DL_DIR)/$(HOST_ICU_DATA_SOURCE) -d $(@D)/$(HOST_ICU_SUBDIR)
+endef
+HOST_ICU_POST_EXTRACT_HOOKS += HOST_ICU_EXTRACT_DATA
+
+HOST_ICU_CONF_ENV = ICU_DATA_FILTER_FILE=$(ICU_DATA_FILTER_FILE)
+HOST_ICU_CONF_OPTS += --with-data-packaging=archive
+
+define ICU_COPY_CUSTOM_DATA
+	$(INSTALL) -D -m 644 $(HOST_ICU_DIR)/$(HOST_ICU_SUBDIR)/data/out/icudt$(ICU_VERSION_MAJOR)l.dat $(@D)/$(ICU_SUBDIR)/data/in/
+endef
+ICU_PRE_CONFIGURE_HOOKS += ICU_COPY_CUSTOM_DATA
+endif
+
 define ICU_REMOVE_DEV_FILES
 	rm -f $(addprefix $(TARGET_DIR)/usr/bin/,derb genbrk gencfu gencnval gendict genrb icuinfo makeconv uconv)
 	rm -f $(addprefix $(TARGET_DIR)/usr/sbin/,genccode gencmn gennorm2 gensprep icupkg)