diff mbox

localedef: check LC_IDENTIFICATION.category values

Message ID 1460587532-5278-1-git-send-email-vapier@gentoo.org
State New
Headers show

Commit Message

Mike Frysinger April 13, 2016, 10:45 p.m. UTC
Currently localedef accepts any value for the category keyword.  This has
allowed bad values to propagate to the vast majority of locales (~90%).
Add some logic to only accept the 1993 POSIX and 2002 ISO-14652 standards.

2016-04-13  Mike Frysinger  <vapier@gentoo.org>

	* locale/programs/ld-identification.c (identification_finish): Check
	that the values in identification->category are only posix:1993 or
	i18n:2002.
---
 locale/programs/ld-identification.c | 42 ++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

Comments

Keld Simonsen April 14, 2016, 8:59 a.m. UTC | #1
Please also allow ISO 30112 categories.

best regards
keld

On Wed, Apr 13, 2016 at 06:45:32PM -0400, Mike Frysinger wrote:
> Currently localedef accepts any value for the category keyword.  This has
> allowed bad values to propagate to the vast majority of locales (~90%).
> Add some logic to only accept the 1993 POSIX and 2002 ISO-14652 standards.
> 
> 2016-04-13  Mike Frysinger  <vapier@gentoo.org>
> 
> 	* locale/programs/ld-identification.c (identification_finish): Check
> 	that the values in identification->category are only posix:1993 or
> 	i18n:2002.
> ---
>  locale/programs/ld-identification.c | 42 ++++++++++++++++++++++++++++++-------
>  1 file changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/locale/programs/ld-identification.c b/locale/programs/ld-identification.c
> index 1e8fa84..eccb388 100644
> --- a/locale/programs/ld-identification.c
> +++ b/locale/programs/ld-identification.c
> @@ -164,14 +164,42 @@ No definition for %s category found"), "LC_IDENTIFICATION"));
>    TEST_ELEM (date);
>  
>    for (num = 0; num < __LC_LAST; ++num)
> -    if (num != LC_ALL && identification->category[num] == NULL)
> -      {
> -	if (verbose && ! nothing)
> -	  WITH_CUR_LOCALE (error (0, 0, _("\
> +    {
> +      /* We don't accept/parse this category, so skip it early.  */
> +      if (num == LC_ALL)
> +	continue;
> +
> +      if (identification->category[num] == NULL)
> +	{
> +	  if (verbose && ! nothing)
> +	    WITH_CUR_LOCALE (error (0, 0, _("\
>  %s: no identification for category `%s'"),
> -				  "LC_IDENTIFICATION", category_name[num]));
> -	identification->category[num] = "";
> -      }
> +				    "LC_IDENTIFICATION", category_name[num]));
> +	  identification->category[num] = "";
> +	}
> +      else
> +	{
> +	  /* Only list the standards we care about.  */
> +	  static const char * const standards[] =
> +	    {
> +	      "posix:1993",
> +	      "i18n:2002",
> +	    };
> +	  size_t i;
> +	  bool matched = false;
> +
> +	  for (i = 0; i < sizeof (standards) / sizeof (standards[0]); ++i)
> +	    if (strcmp (identification->category[num], standards[i]) == 0)
> +	      matched = true;
> +
> +	  if (matched != true)
> +	    WITH_CUR_LOCALE (error (0, 0, _("\
> +%s: unknown standard `%s' for category `%s'"),
> +				    "LC_IDENTIFICATION",
> +				    identification->category[num],
> +				    category_name[num]));
> +	}
> +    }
>  }
>  
>  
> -- 
> 2.7.4
Keld Simonsen April 14, 2016, 9:26 a.m. UTC | #2
Actually the standards 14652/30112 were set up so you could declare
what version of the locale category was used for the data.
POSIX is different from 14652 and again different from 30112.
30112 is the one that most closely corresponds to glibc implementations.


I also think that POSIX allows for more categories than the ones that the
9945 standard defines, and in that way 14652 and 30112 are compatible 
with POSIX. I would advise that this still be allowed, but then declared
in the LC_IDENTIFICATION section. Maybe we should use a specifiv version value like
"non-standard" to indicate that.

I would advice to use the values for the locale versions
given in 30112. The values defined in 30112 are:
i18n:2004
i18n:2012
posix:1993

Best regards
Keld


On Thu, Apr 14, 2016 at 10:59:19AM +0200, keld@keldix.com wrote:
> Please also allow ISO 30112 categories.
> 
> best regards
> keld
> 
> On Wed, Apr 13, 2016 at 06:45:32PM -0400, Mike Frysinger wrote:
> > Currently localedef accepts any value for the category keyword.  This has
> > allowed bad values to propagate to the vast majority of locales (~90%).
> > Add some logic to only accept the 1993 POSIX and 2002 ISO-14652 standards.
> > 
> > 2016-04-13  Mike Frysinger  <vapier@gentoo.org>
> > 
> > 	* locale/programs/ld-identification.c (identification_finish): Check
> > 	that the values in identification->category are only posix:1993 or
> > 	i18n:2002.
> > ---
> >  locale/programs/ld-identification.c | 42 ++++++++++++++++++++++++++++++-------
> >  1 file changed, 35 insertions(+), 7 deletions(-)
> > 
> > diff --git a/locale/programs/ld-identification.c b/locale/programs/ld-identification.c
> > index 1e8fa84..eccb388 100644
> > --- a/locale/programs/ld-identification.c
> > +++ b/locale/programs/ld-identification.c
> > @@ -164,14 +164,42 @@ No definition for %s category found"), "LC_IDENTIFICATION"));
> >    TEST_ELEM (date);
> >  
> >    for (num = 0; num < __LC_LAST; ++num)
> > -    if (num != LC_ALL && identification->category[num] == NULL)
> > -      {
> > -	if (verbose && ! nothing)
> > -	  WITH_CUR_LOCALE (error (0, 0, _("\
> > +    {
> > +      /* We don't accept/parse this category, so skip it early.  */
> > +      if (num == LC_ALL)
> > +	continue;
> > +
> > +      if (identification->category[num] == NULL)
> > +	{
> > +	  if (verbose && ! nothing)
> > +	    WITH_CUR_LOCALE (error (0, 0, _("\
> >  %s: no identification for category `%s'"),
> > -				  "LC_IDENTIFICATION", category_name[num]));
> > -	identification->category[num] = "";
> > -      }
> > +				    "LC_IDENTIFICATION", category_name[num]));
> > +	  identification->category[num] = "";
> > +	}
> > +      else
> > +	{
> > +	  /* Only list the standards we care about.  */
> > +	  static const char * const standards[] =
> > +	    {
> > +	      "posix:1993",
> > +	      "i18n:2002",
> > +	    };
> > +	  size_t i;
> > +	  bool matched = false;
> > +
> > +	  for (i = 0; i < sizeof (standards) / sizeof (standards[0]); ++i)
> > +	    if (strcmp (identification->category[num], standards[i]) == 0)
> > +	      matched = true;
> > +
> > +	  if (matched != true)
> > +	    WITH_CUR_LOCALE (error (0, 0, _("\
> > +%s: unknown standard `%s' for category `%s'"),
> > +				    "LC_IDENTIFICATION",
> > +				    identification->category[num],
> > +				    category_name[num]));
> > +	}
> > +    }
> >  }
> >  
> >  
> > -- 
> > 2.7.4
Mike Frysinger April 14, 2016, 1:50 p.m. UTC | #3
On 14 Apr 2016 11:26, keld@keldix.com wrote:
> Actually the standards 14652/30112 were set up so you could declare
> what version of the locale category was used for the data.
> POSIX is different from 14652 and again different from 30112.
> 30112 is the one that most closely corresponds to glibc implementations.

in general, for standards that are stuck behind ISO's dumb paywall (they
want to charge CHF198 for the pleasure of downloading what should be in
the public), you'll have to tell me what values to plug in, and/or what
it says.

although i have found this link:
	http://www.open-std.org/JTC1/SC35/WG5/docs/30112d10.pdf
is that the same ?

if it is, i would highlight that the examples provided in the spec do
not seem to line up with the spec itself ;).  the Danish example that
is embedded in the file tries to use "i18n:2000", and it doesn't use
double quotes like it says it should be.

> I also think that POSIX allows for more categories than the ones that the
> 9945 standard defines, and in that way 14652 and 30112 are compatible 

looks like ISO 9945 is just the combined POSIX standard (2003 edition).
the public 2004 edition [1] and 2013 edition [2] do not define the cat
LC_IDENTIFICATION, so they wouldn't have anything to say here.  also,
even if those allow for defining of arbitrary categories, that's kind
of orthogonal to glibc's localedef needs isn't it ?  the utility has
been rejecting all unknown categories for basically ever at this point.
[1] http://pubs.opengroup.org/onlinepubs/009695399/
[2] http://pubs.opengroup.org/onlinepubs/9699919799/

if you try to do:
LC_FOO
...
END LC_FOO
localdef will reject it as a syntax error.

if you try to do:
LC_IDENTIFICATION
...
category "en_US:2000";LC_FOO
...
END LC_IDENTIFICATION
localdef will reject it as a syntax error (ignoring the standard part).

are you referring to something else ?

> with POSIX. I would advise that this still be allowed, but then declared
> in the LC_IDENTIFICATION section. Maybe we should use a specifiv version value like
> "non-standard" to indicate that.

why do we need to support that ?  we're talking about what localedef
will accept, and localedef is entirely a glibc-specific utility.  the
binary format it produces is internal glibc ABI.  seems like accepting
other random values isn't useful to us.

> I would advice to use the values for the locale versions
> given in 30112. The values defined in 30112 are:
> i18n:2004
> i18n:2012
> posix:1993

OK.  shall i update all the locale files then to use i18n:2012 ?
-mike
Keld Simonsen April 14, 2016, 3:04 p.m. UTC | #4
On Thu, Apr 14, 2016 at 09:50:33AM -0400, Mike Frysinger wrote:
> On 14 Apr 2016 11:26, keld@keldix.com wrote:
> > Actually the standards 14652/30112 were set up so you could declare
> > what version of the locale category was used for the data.
> > POSIX is different from 14652 and again different from 30112.
> > 30112 is the one that most closely corresponds to glibc implementations.
> 
> in general, for standards that are stuck behind ISO's dumb paywall (they
> want to charge CHF198 for the pleasure of downloading what should be in
> the public), you'll have to tell me what values to plug in, and/or what
> it says.

I agree.

> although i have found this link:
> 	http://www.open-std.org/JTC1/SC35/WG5/docs/30112d10.pdf
> is that the same ?

It is a new Working Draft for the revision of 30112, so it contains all of
the approved TR 30112 from 2014, plus some. But it is not a standard,
it is work in progress. That is why we are allowed to have it publically available.

> if it is, i would highlight that the examples provided in the spec do
> not seem to line up with the spec itself ;).  the Danish example that
> is embedded in the file tries to use "i18n:2000", and it doesn't use
> double quotes like it says it should be.

There are errors everywhere. This is a draft, and not supposed to be error-free.
Anyway, the same inconsistency was probably in the approved TR.
I will see to that this be corrected. Probably it should be marked with
the new standards's identifying value.

> > I also think that POSIX allows for more categories than the ones that the
> > 9945 standard defines, and in that way 14652 and 30112 are compatible 
> 
> looks like ISO 9945 is just the combined POSIX standard (2003 edition).
> the public 2004 edition [1] and 2013 edition [2] do not define the cat
> LC_IDENTIFICATION, so they wouldn't have anything to say here.  also,
> even if those allow for defining of arbitrary categories, that's kind
> of orthogonal to glibc's localedef needs isn't it ?  the utility has
> been rejecting all unknown categories for basically ever at this point.
> [1] http://pubs.opengroup.org/onlinepubs/009695399/
> [2] http://pubs.opengroup.org/onlinepubs/9699919799/

Well, yes, LC_IDENTIFICATION is a novelty of 14652. 
But 9945 - POSIX does allow implementation defined categories AFAIK.
There is one new category in 30112, namely LC_KEYBOARD. I am not sure whether
glibc supports LC_XLITERATE eitherC, or the functionality is present only in 
LC_CTYPE.

> 
> if you try to do:
> LC_FOO
> ...
> END LC_FOO
> localdef will reject it as a syntax error.
> 
> if you try to do:
> LC_IDENTIFICATION
> ...
> category "en_US:2000";LC_FOO
> ...
> END LC_IDENTIFICATION
> localdef will reject it as a syntax error (ignoring the standard part).
> 
> are you referring to something else ?

No. I would like your last example to not error, it could issue a warning,
or at least that LC_KEYBOARD be accepted. 
In that way one could use localedef to test new functionality.

> > with POSIX. I would advise that this still be allowed, but then declared
> > in the LC_IDENTIFICATION section. Maybe we should use a specifiv version value like
> > "non-standard" to indicate that.
> 
> why do we need to support that ?  we're talking about what localedef
> will accept, and localedef is entirely a glibc-specific utility.  the
> binary format it produces is internal glibc ABI.  seems like accepting
> other random values isn't useful to us.

Localedef is specified in POSIX, 
http://pubs.opengroup.org/onlinepubs/009696699/utilities/localedef.html

> > I would advice to use the values for the locale versions
> > given in 30112. The values defined in 30112 are:
> > i18n:2004
> > i18n:2012
> > posix:1993
> 
> OK.  shall i update all the locale files then to use i18n:2012 ?

Yes, I think that this is the most appropiate.

Best regards
Keld
Mike Frysinger April 14, 2016, 5:49 p.m. UTC | #5
On 14 Apr 2016 17:04, keld@keldix.com wrote:
> On Thu, Apr 14, 2016 at 09:50:33AM -0400, Mike Frysinger wrote:
> > On 14 Apr 2016 11:26, keld@keldix.com wrote:
> > > I also think that POSIX allows for more categories than the ones that the
> > > 9945 standard defines, and in that way 14652 and 30112 are compatible 
> > 
> > looks like ISO 9945 is just the combined POSIX standard (2003 edition).
> > the public 2004 edition [1] and 2013 edition [2] do not define the cat
> > LC_IDENTIFICATION, so they wouldn't have anything to say here.  also,
> > even if those allow for defining of arbitrary categories, that's kind
> > of orthogonal to glibc's localedef needs isn't it ?  the utility has
> > been rejecting all unknown categories for basically ever at this point.
> > [1] http://pubs.opengroup.org/onlinepubs/009695399/
> > [2] http://pubs.opengroup.org/onlinepubs/9699919799/
> 
> Well, yes, LC_IDENTIFICATION is a novelty of 14652. 
> But 9945 - POSIX does allow implementation defined categories AFAIK.

sure -- see below

> There is one new category in 30112, namely LC_KEYBOARD. I am not sure whether
> glibc supports LC_XLITERATE eitherC, or the functionality is present only in 
> LC_CTYPE.

we don't support LC_KEYBOARD or LC_XLITERATE today.  i think any new
categories would need to be proposed including why glibc should carry
them at all.  i haven't read the standard, so i can't speak to either.

> > if you try to do:
> > LC_FOO
> > ...
> > END LC_FOO
> > localdef will reject it as a syntax error.
> > 
> > if you try to do:
> > LC_IDENTIFICATION
> > ...
> > category "en_US:2000";LC_FOO
> > ...
> > END LC_IDENTIFICATION
> > localdef will reject it as a syntax error (ignoring the standard part).
> > 
> > are you referring to something else ?
> 
> No. I would like your last example to not error, it could issue a warning,
> or at least that LC_KEYBOARD be accepted. 
> In that way one could use localedef to test new functionality.

we can have it warn.  localedef has precedence w/not warning about many
things or being fatal by default, but adding -v makes it more strict.
this seems to fall into that bucket.

i'm not keen on -v/--verbose being a hidden alias to also "exit non-zero
in many more cases", but that's a diff topic :).

> > > with POSIX. I would advise that this still be allowed, but then declared
> > > in the LC_IDENTIFICATION section. Maybe we should use a specifiv version value like
> > > "non-standard" to indicate that.
> > 
> > why do we need to support that ?  we're talking about what localedef
> > will accept, and localedef is entirely a glibc-specific utility.  the
> > binary format it produces is internal glibc ABI.  seems like accepting
> > other random values isn't useful to us.
> 
> Localedef is specified in POSIX, 
> http://pubs.opengroup.org/onlinepubs/009696699/utilities/localedef.html

on the frontend sure.  i was thinking of its output format which is not
specified by POSIX but is an internal glibc ABI detail.  it even says:
	The localedef utility shall convert source definitions for locale
	categories into a format usable by the functions and utilities ...
i.e. it doesn't specify that output format.

back to the frontend, what POSIX specifically says is:
	In addition, the input may contain source for implementation-defined
	categories.

so glibc's localedef is free to support as many more or few categories as
it sees fit.  that includes outright rejecting unknown ones.

also, if we want to speak stricly about POSIX, it also says:
	-u  code_set_name
	Specify the name of a codeset used as the target mapping of character
	symbols and collating element symbols whose encoding values are defined
	in terms of the ISO/IEC 10646-1:2000 standard position constant values.

pretty sure that says we aren't even permitted to support a newer standard
there.  whether it matters in practice i'm not sure (haven't done a diff on
the diff versions/standards).
-mike
diff mbox

Patch

diff --git a/locale/programs/ld-identification.c b/locale/programs/ld-identification.c
index 1e8fa84..eccb388 100644
--- a/locale/programs/ld-identification.c
+++ b/locale/programs/ld-identification.c
@@ -164,14 +164,42 @@  No definition for %s category found"), "LC_IDENTIFICATION"));
   TEST_ELEM (date);
 
   for (num = 0; num < __LC_LAST; ++num)
-    if (num != LC_ALL && identification->category[num] == NULL)
-      {
-	if (verbose && ! nothing)
-	  WITH_CUR_LOCALE (error (0, 0, _("\
+    {
+      /* We don't accept/parse this category, so skip it early.  */
+      if (num == LC_ALL)
+	continue;
+
+      if (identification->category[num] == NULL)
+	{
+	  if (verbose && ! nothing)
+	    WITH_CUR_LOCALE (error (0, 0, _("\
 %s: no identification for category `%s'"),
-				  "LC_IDENTIFICATION", category_name[num]));
-	identification->category[num] = "";
-      }
+				    "LC_IDENTIFICATION", category_name[num]));
+	  identification->category[num] = "";
+	}
+      else
+	{
+	  /* Only list the standards we care about.  */
+	  static const char * const standards[] =
+	    {
+	      "posix:1993",
+	      "i18n:2002",
+	    };
+	  size_t i;
+	  bool matched = false;
+
+	  for (i = 0; i < sizeof (standards) / sizeof (standards[0]); ++i)
+	    if (strcmp (identification->category[num], standards[i]) == 0)
+	      matched = true;
+
+	  if (matched != true)
+	    WITH_CUR_LOCALE (error (0, 0, _("\
+%s: unknown standard `%s' for category `%s'"),
+				    "LC_IDENTIFICATION",
+				    identification->category[num],
+				    category_name[num]));
+	}
+    }
 }