diff mbox

[v9,01/37] qobject: Document more shortcomings in our number handling

Message ID 1453219845-30939-2-git-send-email-eblake@redhat.com
State New
Headers show

Commit Message

Eric Blake Jan. 19, 2016, 4:10 p.m. UTC
We've already documented that our JSON parsing is locale dependent;
but we should also document that our JSON output has the same
problem.  Additionally, JSON requires finite values (you have to
upgrade to JSON5 to get support for Inf or NaN), and our output
risks truncating floating point numbers to the point of losing
significant precision.

Sadly, this series is not going to be the one that addresses these
problems.

Fix some trailing whitespace I noticed in the vicinity.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

---
v9: no change
v8: no change
v7: new patch
---
 qobject/json-parser.c | 4 +++-
 qobject/qjson.c       | 8 +++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

Markus Armbruster Jan. 20, 2016, 9:02 a.m. UTC | #1
Eric Blake <eblake@redhat.com> writes:

> We've already documented that our JSON parsing is locale dependent;
> but we should also document that our JSON output has the same
> problem.  Additionally, JSON requires finite values (you have to
> upgrade to JSON5 to get support for Inf or NaN), and our output
> risks truncating floating point numbers to the point of losing
> significant precision.
>
> Sadly, this series is not going to be the one that addresses these
> problems.
>
> Fix some trailing whitespace I noticed in the vicinity.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
>
> ---
> v9: no change
> v8: no change
> v7: new patch
> ---
>  qobject/json-parser.c | 4 +++-
>  qobject/qjson.c       | 8 +++++++-
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/qobject/json-parser.c b/qobject/json-parser.c
> index 3c5d35d..6ab98a7 100644
> --- a/qobject/json-parser.c
> +++ b/qobject/json-parser.c
> @@ -1,5 +1,5 @@
>  /*
> - * JSON Parser 
> + * JSON Parser
>   *
>   * Copyright IBM, Corp. 2009
>   *
> @@ -519,6 +519,8 @@ static QObject *parse_literal(JSONParserContext *ctxt)
>      }
>      case JSON_FLOAT:
>          /* FIXME dependent on locale */
> +        /* FIXME our lexer matches RFC7159 in forbidding Inf or NaN,

For what it's worth, the RFC spells this "RFC 7159".

> +         * but those might be useful extensions beyond JSON */
>          return QOBJECT(qfloat_from_double(strtod(token->str, NULL)));
>      default:
>          abort();
> diff --git a/qobject/qjson.c b/qobject/qjson.c
> index a3e6a7c..41d9d65 100644
> --- a/qobject/qjson.c
> +++ b/qobject/qjson.c
> @@ -237,6 +237,12 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
>          char buffer[1024];
>          int len;
>
> +        /* FIXME: snprintf() is locale dependent; but JSON requires
> +         * numbers to be formatted as if in the C locale. */

The new FIXME matches the existing one in parse_literal(), visible
above.

However, dependence on C locale is a pervasive issue in QEMU.  These two
comments could give readers the idea that it's an issue just here.
Suggest to add something like "Dependence on C locale is a pervasive
issue in QEMU."

> +        /* FIXME: This risks printing Inf or NaN, which are not valid
> +         * JSON values. */
> +        /* FIXME: the default precision of %f may be insufficient to
> +         * tell this number apart from others. */

Yup.

The easy way to avoid loss of precision is %a, but of course that's way
too much sophistication for JSON.

Avoiding loss of precision with a decimal format is non-trivial; see
Steele, Jr., Guy L., and White, Jon L. How to print floating-point
numbers accurately, SIGPLAN ’90, and later improvements collected here:
http://stackoverflow.com/questions/7153979/algorithm-to-convert-an-ieee-754-double-to-a-string

>          len = snprintf(buffer, sizeof(buffer), "%f", qfloat_get_double(val));
>          while (len > 0 && buffer[len - 1] == '0') {
>              len--;
> @@ -247,7 +253,7 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
>          } else {
>              buffer[len] = 0;
>          }
> -        
> +
>          qstring_append(str, buffer);
>          break;
>      }
Eric Blake Jan. 20, 2016, 3:55 p.m. UTC | #2
On 01/20/2016 02:02 AM, Markus Armbruster wrote:

>> @@ -519,6 +519,8 @@ static QObject *parse_literal(JSONParserContext *ctxt)
>>      }
>>      case JSON_FLOAT:
>>          /* FIXME dependent on locale */
>> +        /* FIXME our lexer matches RFC7159 in forbidding Inf or NaN,
> 
> For what it's worth, the RFC spells this "RFC 7159".

Looks like we use space more often than not, but that we're
inconsistent.  For example:

slirp/tcp.h: * Per RFC 793, September, 1981.
slirp/tcp.h: * Per RFC793, September, 1981.

Will fix if I need to respin, otherwise I assume you can do it.


>> +        /* FIXME: snprintf() is locale dependent; but JSON requires
>> +         * numbers to be formatted as if in the C locale. */
> 
> The new FIXME matches the existing one in parse_literal(), visible
> above.
> 
> However, dependence on C locale is a pervasive issue in QEMU.  These two
> comments could give readers the idea that it's an issue just here.
> Suggest to add something like "Dependence on C locale is a pervasive
> issue in QEMU."

Good idea.

> 
>> +        /* FIXME: This risks printing Inf or NaN, which are not valid
>> +         * JSON values. */
>> +        /* FIXME: the default precision of %f may be insufficient to
>> +         * tell this number apart from others. */
> 
> Yup.
> 
> The easy way to avoid loss of precision is %a, but of course that's way
> too much sophistication for JSON.
> 
> Avoiding loss of precision with a decimal format is non-trivial; see
> Steele, Jr., Guy L., and White, Jon L. How to print floating-point
> numbers accurately, SIGPLAN ’90, and later improvements collected here:
> http://stackoverflow.com/questions/7153979/algorithm-to-convert-an-ieee-754-double-to-a-string

Ah, memories.  I read and implemented that paper when working on the
jikes compiler for the Java language back in the late nineties, as it is
the Java language specification which had the very neat property of
requiring the shortest decimal string that will unambiguously round back
to the same floating point pattern.

One alternative is to always output a guaranteed unambiguous decimal
string (although not necessarily the shortest), by using %.17f, using
<float.h> DBL_DECIMAL_DIG.  (Note that DBL_DIG of 15 is NOT sufficient -
it is the lower limit that says that a decimal->float->decimal will not
change the decimal; but we want the converse where a
float->decimal->float will not change the float.  There are stretches of
numbers where the pigeonhole principle applies; you can think of it this
way: there is no way to map all possible 2^10 (1024) binary values
inside 2^3 (1000) decimal digits without at least 24 of them needing one
more decimal digit.  But by the same arguments, DBL_DECIMAL_DIG is an
upper limit and usually more than you need.)

So, the question is whether we want to always output 17 digits, or
whether we want to do the poor-man's truncation scheme (easy to
understand, but not optimal use of the processor), or go all the way to
the algorithm of that paper (faster but lots harder to understand).  For
reference, here's the poor-man's algorithm in pseudocode:

if 0, inf, nan:
    special case
else:
    Obtain the DBL_DECIMAL_DIG string via sprintf %.17f
    i = 17;
    do {
        truncate the original string to i-1 decimal characters
        parse that with strtod()
        if the bit pattern differs:
            break;
    } while (--i);
    assert(i)
    use i digits of the string

As a separate patch, of course, but I have a pending patch that provides
a single place where we could drop in such an improvement:
https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg03932.html
Markus Armbruster Jan. 21, 2016, 6:21 a.m. UTC | #3
Eric Blake <eblake@redhat.com> writes:

> On 01/20/2016 02:02 AM, Markus Armbruster wrote:
>
>>> @@ -519,6 +519,8 @@ static QObject *parse_literal(JSONParserContext *ctxt)
>>>      }
>>>      case JSON_FLOAT:
>>>          /* FIXME dependent on locale */
>>> +        /* FIXME our lexer matches RFC7159 in forbidding Inf or NaN,
>> 
>> For what it's worth, the RFC spells this "RFC 7159".
>
> Looks like we use space more often than not, but that we're
> inconsistent.  For example:
>
> slirp/tcp.h: * Per RFC 793, September, 1981.
> slirp/tcp.h: * Per RFC793, September, 1981.
>
> Will fix if I need to respin, otherwise I assume you can do it.

Okay.

>>> +        /* FIXME: snprintf() is locale dependent; but JSON requires
>>> +         * numbers to be formatted as if in the C locale. */
>> 
>> The new FIXME matches the existing one in parse_literal(), visible
>> above.
>> 
>> However, dependence on C locale is a pervasive issue in QEMU.  These two
>> comments could give readers the idea that it's an issue just here.
>> Suggest to add something like "Dependence on C locale is a pervasive
>> issue in QEMU."
>
> Good idea.
>
>> 
>>> +        /* FIXME: This risks printing Inf or NaN, which are not valid
>>> +         * JSON values. */
>>> +        /* FIXME: the default precision of %f may be insufficient to
>>> +         * tell this number apart from others. */
>> 
>> Yup.
>> 
>> The easy way to avoid loss of precision is %a, but of course that's way
>> too much sophistication for JSON.
>> 
>> Avoiding loss of precision with a decimal format is non-trivial; see
>> Steele, Jr., Guy L., and White, Jon L. How to print floating-point
>> numbers accurately, SIGPLAN ’90, and later improvements collected here:
>> http://stackoverflow.com/questions/7153979/algorithm-to-convert-an-ieee-754-double-to-a-string
>
> Ah, memories.  I read and implemented that paper when working on the
> jikes compiler for the Java language back in the late nineties, as it is
> the Java language specification which had the very neat property of
> requiring the shortest decimal string that will unambiguously round back
> to the same floating point pattern.
>
> One alternative is to always output a guaranteed unambiguous decimal
> string (although not necessarily the shortest), by using %.17f, using
> <float.h> DBL_DECIMAL_DIG.  (Note that DBL_DIG of 15 is NOT sufficient -
> it is the lower limit that says that a decimal->float->decimal will not
> change the decimal; but we want the converse where a
> float->decimal->float will not change the float.  There are stretches of
> numbers where the pigeonhole principle applies; you can think of it this
> way: there is no way to map all possible 2^10 (1024) binary values
> inside 2^3 (1000) decimal digits without at least 24 of them needing one
> more decimal digit.  But by the same arguments, DBL_DECIMAL_DIG is an
> upper limit and usually more than you need.)
>
> So, the question is whether we want to always output 17 digits, or
> whether we want to do the poor-man's truncation scheme (easy to
> understand, but not optimal use of the processor), or go all the way to
> the algorithm of that paper (faster but lots harder to understand).  For
> reference, here's the poor-man's algorithm in pseudocode:

I don't think we want to implement floating-point formatting ourselves.

> if 0, inf, nan:
>     special case
> else:
>     Obtain the DBL_DECIMAL_DIG string via sprintf %.17f
>     i = 17;
>     do {
>         truncate the original string to i-1 decimal characters
>         parse that with strtod()
>         if the bit pattern differs:
>             break;
>     } while (--i);
>     assert(i)
>     use i digits of the string

That's a lot of strtod()...  May not be noticable if we write the result
to a slowish sink.  Binary search could save a few.

Naive idea: chop off trailing '0'?

> As a separate patch, of course, but I have a pending patch that provides
> a single place where we could drop in such an improvement:
> https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg03932.html

Definitely separate.
Eric Blake Jan. 21, 2016, 5:12 p.m. UTC | #4
On 01/20/2016 11:21 PM, Markus Armbruster wrote:
>> One alternative is to always output a guaranteed unambiguous decimal
>> string (although not necessarily the shortest), by using %.17f, using
>> <float.h> DBL_DECIMAL_DIG.  (Note that DBL_DIG of 15 is NOT sufficient -
>> it is the lower limit that says that a decimal->float->decimal will not
>> change the decimal; but we want the converse where a
>> float->decimal->float will not change the float.  There are stretches of
>> numbers where the pigeonhole principle applies; you can think of it this
>> way: there is no way to map all possible 2^10 (1024) binary values
>> inside 2^3 (1000) decimal digits without at least 24 of them needing one
>> more decimal digit.  But by the same arguments, DBL_DECIMAL_DIG is an
>> upper limit and usually more than you need.)
>>
>> So, the question is whether we want to always output 17 digits, or
>> whether we want to do the poor-man's truncation scheme (easy to
>> understand, but not optimal use of the processor), or go all the way to
>> the algorithm of that paper (faster but lots harder to understand).  For
>> reference, here's the poor-man's algorithm in pseudocode:
> 
> I don't think we want to implement floating-point formatting ourselves.

Well, we already have _some_ level of floating-point support built in
for TCG emulation of floating point on various architectures.  I don't
know how much would be easily reusable for this case, though.

> 
>> if 0, inf, nan:
>>     special case
>> else:
>>     Obtain the DBL_DECIMAL_DIG string via sprintf %.17f
>>     i = 17;
>>     do {
>>         truncate the original string to i-1 decimal characters
>>         parse that with strtod()
>>         if the bit pattern differs:
>>             break;
>>     } while (--i);
>>     assert(i)
>>     use i digits of the string
> 
> That's a lot of strtod()...  May not be noticable if we write the result
> to a slowish sink.  Binary search could save a few.
> 
> Naive idea: chop off trailing '0'?

That, and rounding trailing '9'.  Any other digits in positions 1-16 are
significant (other digits in position 17 might not matter, though), so I
guess a full 17 iterations is probably not strictly necessary.  Our
current code DOES chop trailing '0', but starting from the flawed
premise that 6 digits was enough precision.

> 
>> As a separate patch, of course, but I have a pending patch that provides
>> a single place where we could drop in such an improvement:
>> https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg03932.html
> 
> Definitely separate.

Okay, all thoughts for a later day :)
Daniel P. Berrangé Jan. 21, 2016, 5:29 p.m. UTC | #5
On Thu, Jan 21, 2016 at 07:21:49AM +0100, Markus Armbruster wrote:
> Eric Blake <eblake@redhat.com> writes:
> 
> > On 01/20/2016 02:02 AM, Markus Armbruster wrote:
> >
> >>> @@ -519,6 +519,8 @@ static QObject *parse_literal(JSONParserContext *ctxt)
> >>>      }
> >>>      case JSON_FLOAT:
> >>>          /* FIXME dependent on locale */
> >>> +        /* FIXME our lexer matches RFC7159 in forbidding Inf or NaN,
> >> 
> >> For what it's worth, the RFC spells this "RFC 7159".
> >
> > Looks like we use space more often than not, but that we're
> > inconsistent.  For example:
> >
> > slirp/tcp.h: * Per RFC 793, September, 1981.
> > slirp/tcp.h: * Per RFC793, September, 1981.
> >
> > Will fix if I need to respin, otherwise I assume you can do it.
> 
> Okay.
> 
> >>> +        /* FIXME: snprintf() is locale dependent; but JSON requires
> >>> +         * numbers to be formatted as if in the C locale. */
> >> 
> >> The new FIXME matches the existing one in parse_literal(), visible
> >> above.
> >> 
> >> However, dependence on C locale is a pervasive issue in QEMU.  These two
> >> comments could give readers the idea that it's an issue just here.
> >> Suggest to add something like "Dependence on C locale is a pervasive
> >> issue in QEMU."
> >
> > Good idea.
> >
> >> 
> >>> +        /* FIXME: This risks printing Inf or NaN, which are not valid
> >>> +         * JSON values. */
> >>> +        /* FIXME: the default precision of %f may be insufficient to
> >>> +         * tell this number apart from others. */
> >> 
> >> Yup.
> >> 
> >> The easy way to avoid loss of precision is %a, but of course that's way
> >> too much sophistication for JSON.
> >> 
> >> Avoiding loss of precision with a decimal format is non-trivial; see
> >> Steele, Jr., Guy L., and White, Jon L. How to print floating-point
> >> numbers accurately, SIGPLAN ’90, and later improvements collected here:
> >> http://stackoverflow.com/questions/7153979/algorithm-to-convert-an-ieee-754-double-to-a-string
> >
> > Ah, memories.  I read and implemented that paper when working on the
> > jikes compiler for the Java language back in the late nineties, as it is
> > the Java language specification which had the very neat property of
> > requiring the shortest decimal string that will unambiguously round back
> > to the same floating point pattern.
> >
> > One alternative is to always output a guaranteed unambiguous decimal
> > string (although not necessarily the shortest), by using %.17f, using
> > <float.h> DBL_DECIMAL_DIG.  (Note that DBL_DIG of 15 is NOT sufficient -
> > it is the lower limit that says that a decimal->float->decimal will not
> > change the decimal; but we want the converse where a
> > float->decimal->float will not change the float.  There are stretches of
> > numbers where the pigeonhole principle applies; you can think of it this
> > way: there is no way to map all possible 2^10 (1024) binary values
> > inside 2^3 (1000) decimal digits without at least 24 of them needing one
> > more decimal digit.  But by the same arguments, DBL_DECIMAL_DIG is an
> > upper limit and usually more than you need.)
> >
> > So, the question is whether we want to always output 17 digits, or
> > whether we want to do the poor-man's truncation scheme (easy to
> > understand, but not optimal use of the processor), or go all the way to
> > the algorithm of that paper (faster but lots harder to understand).  For
> > reference, here's the poor-man's algorithm in pseudocode:
> 
> I don't think we want to implement floating-point formatting ourselves.

FWIW in libvirt we use per-thread locale switching, so that we can flip
into the C locale temporarily when formatting doubles. If those are not
available we have a bit of a nasty hack to post-process the locale
dependant string back into what we (hope) is C-locale formatting.
We don't do anything aobut the Inf/NaN problem though.

http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/util/virutil.c;h=bb9604a0c1ffb9c99e454e84878a8c376f773046;hb=HEAD#l454

Regards,
Daniel
diff mbox

Patch

diff --git a/qobject/json-parser.c b/qobject/json-parser.c
index 3c5d35d..6ab98a7 100644
--- a/qobject/json-parser.c
+++ b/qobject/json-parser.c
@@ -1,5 +1,5 @@ 
 /*
- * JSON Parser 
+ * JSON Parser
  *
  * Copyright IBM, Corp. 2009
  *
@@ -519,6 +519,8 @@  static QObject *parse_literal(JSONParserContext *ctxt)
     }
     case JSON_FLOAT:
         /* FIXME dependent on locale */
+        /* FIXME our lexer matches RFC7159 in forbidding Inf or NaN,
+         * but those might be useful extensions beyond JSON */
         return QOBJECT(qfloat_from_double(strtod(token->str, NULL)));
     default:
         abort();
diff --git a/qobject/qjson.c b/qobject/qjson.c
index a3e6a7c..41d9d65 100644
--- a/qobject/qjson.c
+++ b/qobject/qjson.c
@@ -237,6 +237,12 @@  static void to_json(const QObject *obj, QString *str, int pretty, int indent)
         char buffer[1024];
         int len;

+        /* FIXME: snprintf() is locale dependent; but JSON requires
+         * numbers to be formatted as if in the C locale. */
+        /* FIXME: This risks printing Inf or NaN, which are not valid
+         * JSON values. */
+        /* FIXME: the default precision of %f may be insufficient to
+         * tell this number apart from others. */
         len = snprintf(buffer, sizeof(buffer), "%f", qfloat_get_double(val));
         while (len > 0 && buffer[len - 1] == '0') {
             len--;
@@ -247,7 +253,7 @@  static void to_json(const QObject *obj, QString *str, int pretty, int indent)
         } else {
             buffer[len] = 0;
         }
-        
+
         qstring_append(str, buffer);
         break;
     }