Patchwork check-qjson: More thorough testing of UTF-8 in strings

login
register
mail settings
Submitter Markus Armbruster
Date Feb. 28, 2013, 7:42 p.m.
Message ID <87ip5cw7gl.fsf@blackfin.pond.sub.org>
Download mbox | patch
Permalink /patch/224165/
State New
Headers show

Comments

Markus Armbruster - Feb. 28, 2013, 7:42 p.m.
Blue Swirl <blauwirbel@gmail.com> writes:

> On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <armbru@redhat.com> wrote:
>> Blue Swirl <blauwirbel@gmail.com> writes:
>>
>>> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <armbru@redhat.com> wrote:
>>>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and
>>>> stress test at
>>>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>>
>>>> Unfortunately, both JSON parser and formatter misbehave right now.
>>>> This test expects current, incorrect results.  They're all clearly
>>>> marked, and are to be replaced by correct ones as the bugs get fixed.
>>>> See comments in new utf8_string() for details.
>>>>
>>>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>>>> ---
>>>>  tests/check-qjson.c | 625 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 625 insertions(+)
>>>>
>>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>>> index 32ffb43..4590b3a 100644
>>>> --- a/tests/check-qjson.c
>>>> +++ b/tests/check-qjson.c
>>>> @@ -1,8 +1,10 @@
>>>>  /*
>>>>   * Copyright IBM, Corp. 2009
>>>> + * Copyright (c) 2013 Red Hat Inc.
>>>>   *
>>>>   * Authors:
>>>>   *  Anthony Liguori   <aliguori@us.ibm.com>
>>>> + *  Markus Armbruster <armbru@redhat.com>,
>>>>   *
>>>>   * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>>>>   * See the COPYING.LIB file in the top-level directory.
>>>> @@ -131,6 +133,628 @@ static void single_quote_string(void)
>>>>      }
>>>>  }
>>>>
>>>> +static void utf8_string(void)
>>>> +{
>>>> +    /*
>>>> +     * FIXME Current behavior for invalid UTF-8 sequences is
>>>> +     * incorrect.  This test expects current, incorrect results.
>>>> +     * They're all marked "bug:" below, and are to be replaced by
>>>> +     * correct ones as the bugs get fixed.
>>>> +     *
>>>> +     * The JSON parser rejects some invalid sequences, but accepts
>>>> +     * others without correcting the problem.
>>>> +     *
>>>> +     * The JSON formatter replaces some invalid sequences by U+FFFF (a
>>>> +     * noncharacter), and goes wonky for others.
>>>> +     *
>>>> +     * For both directions, we should either reject all invalid
>>>> +     * sequences, or minimize overlong sequences and replace all other
>>>> +     * invalid sequences by a suitable replacement character.  A
>>>> +     * common choice for replacement is U+FFFD.
>>>> +     *
>>>> +     * Problem: we can't easily deal with embedded U+0000.  Parsing
>>>> +     * the JSON string "this \\u0000" is fun" yields "this \0 is fun",
>>>> +     * which gets misinterpreted as NUL-terminated "this ".  We should
>>>> +     * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>>> +     * UTF-8").
>>>> +     *
>>>> +     * Tests are scraped from Markus Kuhn's UTF-8 decoder capability
>>>> +     * and stress test at
>>>> +     * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>> +     */
>>>> +    static const struct {
>>>> +        const char *json_in;
>>>> +        const char *utf8_out;
>>>> +        const char *json_out;   /* defaults to @json_in */
>>>> +        const char *utf8_in;    /* defaults to @utf8_out */
>>>> +    } test_cases[] = {
>>>> +        /*
>>>> +         * Bug markers used here:
>>>> +         * - bug: not corrected
>>>> +         *   JSON parser fails to correct invalid sequence(s)
>>>> +         * - bug: rejected
>>>> +         *   JSON parser rejects invalid sequence(s)
>>>> +         *   We may choose to define this as feature
>>>> +         * - bug: want "\"...\""
>>>> +         *   JSON formatter produces incorrect result, this is the
>>>> +         *   correct one, assuming replacement character U+FFFF
>>>> +         * - bug: want "..." (no \")
>>>> +         *   JSON parser produces incorrect result, this is the
>>>> +         *   correct one.
>>>> +         * Not marked explicitly, but trivial to find:
>>>> +         * - JSON formatter replacing invalid sequence by \\uFFFF is a
>>>> +         *   bug if we want it to fail for invalid sequences.
>>>> +         */
>> [...]
>>>> +        /* 2.1.4  4 bytes U+10000 */
>>>> +        {
>>>> +            "\"\xF0\x90\x80\x80\"",
>>>> +            "\xF0\x90\x80\x80",
>>>> +            "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
>>>> +        },
>> [...]
>>>> +        {}
>>>> +    };
>>>> +    int i;
>>>> +    QObject *obj;
>>>> +    QString *str;
>>>> +    const char *json_in, *utf8_out, *utf8_in, *json_out;
>>>> +
>>>> +    for (i = 0; test_cases[i].json_in; i++) {
>>>> +        json_in = test_cases[i].json_in;
>>>> +        utf8_out = test_cases[i].utf8_out;
>>>> +        utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out;
>>>> +        json_out = test_cases[i].json_out ?: test_cases[i].json_in;
>>>> +
>>>> +        obj = qobject_from_json(json_in);
>>>> +        if (utf8_out) {
>>>> +            g_assert(obj);
>>>> +            g_assert(qobject_type(obj) == QTYPE_QSTRING);
>>>> +            str = qobject_to_qstring(obj);
>>>> +            g_assert_cmpstr(qstring_get_str(str), ==, utf8_out);
>>>> +        } else {
>>>> +            g_assert(!obj);
>>>> +        }
>>>> +        qobject_decref(obj);
>>>> +
>>>> +        obj = QOBJECT(qstring_from_str(utf8_in));
>>>> +        str = qobject_to_json(obj);
>>>> +        if (json_out) {
>>>> +            g_assert(str);
>>>> +            g_assert_cmpstr(qstring_get_str(str), ==, json_out);
>>>
>>> This assertion trips on an ARM host (Debian stable):
>>>
>>> GTESTER tests/check-qjson
>>> **
>>> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
>>> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
>>> "\"\\u0400\\uFFFF\"")
>>> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd
>>
>> Must be case 2.1.4, because that's where json_out is
>> "\"\\u0400\\uFFFF\"".
>>
>> We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser
>> qobject_from_json().  Yields "\xF0\x90\x80\x80", as expected.
>>
>> We then pass that to qobject_to_json().  Should yield
>> "\"\\uD800\\uDC00\"" (surrogate pair).  Does yield "\"\\u0400\\uFFFF\""
>> on my machine (known bug), and "\"\\u0400\200\"" on yours.
>>
>> Looks like the JSON formatter is not just broken (we knew that already),
>> it's broken in machine-dependent ways.  Good to know, thanks for
>> reporting.
>>
>> Obvious ways to get "make check" pass for you again *now*:
>>
>> * Disable check-qjson.  That's too big a hammer for me.
>>
>> * Disable test case 2.1.4 with a comment explaining why.
>>
>> * Suitable #ifdeffery around the expected value.
>>
>> Preferences?
>
> * Fix JSON formatter :-)

I want that too, but I'm afraid we can't have it *now* :)

> Disabling 2.1.4 only reveals the next problem:
> GTESTER tests/check-qjson
> GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c
> **
> ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed
> (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" ==
> "\"\\u8200\\uFFFF\\uFFFF\"")

All right, I give up.  I can't fix to_json() tonight (I have maybe 30
minutes of usable brain left), but I can make it portably wrong.  Please
try the appended patch.
Blue Swirl - Feb. 28, 2013, 8:06 p.m.
On Thu, Feb 28, 2013 at 7:42 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Blue Swirl <blauwirbel@gmail.com> writes:
>
>> On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <armbru@redhat.com> wrote:
>>> Blue Swirl <blauwirbel@gmail.com> writes:
>>>
>>>> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <armbru@redhat.com> wrote:
>>>>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and
>>>>> stress test at
>>>>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>>>
>>>>> Unfortunately, both JSON parser and formatter misbehave right now.
>>>>> This test expects current, incorrect results.  They're all clearly
>>>>> marked, and are to be replaced by correct ones as the bugs get fixed.
>>>>> See comments in new utf8_string() for details.
>>>>>
>>>>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>>>>> ---
>>>>>  tests/check-qjson.c | 625 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 625 insertions(+)
>>>>>
>>>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>>>> index 32ffb43..4590b3a 100644
>>>>> --- a/tests/check-qjson.c
>>>>> +++ b/tests/check-qjson.c
>>>>> @@ -1,8 +1,10 @@
>>>>>  /*
>>>>>   * Copyright IBM, Corp. 2009
>>>>> + * Copyright (c) 2013 Red Hat Inc.
>>>>>   *
>>>>>   * Authors:
>>>>>   *  Anthony Liguori   <aliguori@us.ibm.com>
>>>>> + *  Markus Armbruster <armbru@redhat.com>,
>>>>>   *
>>>>>   * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>>>>>   * See the COPYING.LIB file in the top-level directory.
>>>>> @@ -131,6 +133,628 @@ static void single_quote_string(void)
>>>>>      }
>>>>>  }
>>>>>
>>>>> +static void utf8_string(void)
>>>>> +{
>>>>> +    /*
>>>>> +     * FIXME Current behavior for invalid UTF-8 sequences is
>>>>> +     * incorrect.  This test expects current, incorrect results.
>>>>> +     * They're all marked "bug:" below, and are to be replaced by
>>>>> +     * correct ones as the bugs get fixed.
>>>>> +     *
>>>>> +     * The JSON parser rejects some invalid sequences, but accepts
>>>>> +     * others without correcting the problem.
>>>>> +     *
>>>>> +     * The JSON formatter replaces some invalid sequences by U+FFFF (a
>>>>> +     * noncharacter), and goes wonky for others.
>>>>> +     *
>>>>> +     * For both directions, we should either reject all invalid
>>>>> +     * sequences, or minimize overlong sequences and replace all other
>>>>> +     * invalid sequences by a suitable replacement character.  A
>>>>> +     * common choice for replacement is U+FFFD.
>>>>> +     *
>>>>> +     * Problem: we can't easily deal with embedded U+0000.  Parsing
>>>>> +     * the JSON string "this \\u0000" is fun" yields "this \0 is fun",
>>>>> +     * which gets misinterpreted as NUL-terminated "this ".  We should
>>>>> +     * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>>>> +     * UTF-8").
>>>>> +     *
>>>>> +     * Tests are scraped from Markus Kuhn's UTF-8 decoder capability
>>>>> +     * and stress test at
>>>>> +     * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>>> +     */
>>>>> +    static const struct {
>>>>> +        const char *json_in;
>>>>> +        const char *utf8_out;
>>>>> +        const char *json_out;   /* defaults to @json_in */
>>>>> +        const char *utf8_in;    /* defaults to @utf8_out */
>>>>> +    } test_cases[] = {
>>>>> +        /*
>>>>> +         * Bug markers used here:
>>>>> +         * - bug: not corrected
>>>>> +         *   JSON parser fails to correct invalid sequence(s)
>>>>> +         * - bug: rejected
>>>>> +         *   JSON parser rejects invalid sequence(s)
>>>>> +         *   We may choose to define this as feature
>>>>> +         * - bug: want "\"...\""
>>>>> +         *   JSON formatter produces incorrect result, this is the
>>>>> +         *   correct one, assuming replacement character U+FFFF
>>>>> +         * - bug: want "..." (no \")
>>>>> +         *   JSON parser produces incorrect result, this is the
>>>>> +         *   correct one.
>>>>> +         * Not marked explicitly, but trivial to find:
>>>>> +         * - JSON formatter replacing invalid sequence by \\uFFFF is a
>>>>> +         *   bug if we want it to fail for invalid sequences.
>>>>> +         */
>>> [...]
>>>>> +        /* 2.1.4  4 bytes U+10000 */
>>>>> +        {
>>>>> +            "\"\xF0\x90\x80\x80\"",
>>>>> +            "\xF0\x90\x80\x80",
>>>>> +            "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
>>>>> +        },
>>> [...]
>>>>> +        {}
>>>>> +    };
>>>>> +    int i;
>>>>> +    QObject *obj;
>>>>> +    QString *str;
>>>>> +    const char *json_in, *utf8_out, *utf8_in, *json_out;
>>>>> +
>>>>> +    for (i = 0; test_cases[i].json_in; i++) {
>>>>> +        json_in = test_cases[i].json_in;
>>>>> +        utf8_out = test_cases[i].utf8_out;
>>>>> +        utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out;
>>>>> +        json_out = test_cases[i].json_out ?: test_cases[i].json_in;
>>>>> +
>>>>> +        obj = qobject_from_json(json_in);
>>>>> +        if (utf8_out) {
>>>>> +            g_assert(obj);
>>>>> +            g_assert(qobject_type(obj) == QTYPE_QSTRING);
>>>>> +            str = qobject_to_qstring(obj);
>>>>> +            g_assert_cmpstr(qstring_get_str(str), ==, utf8_out);
>>>>> +        } else {
>>>>> +            g_assert(!obj);
>>>>> +        }
>>>>> +        qobject_decref(obj);
>>>>> +
>>>>> +        obj = QOBJECT(qstring_from_str(utf8_in));
>>>>> +        str = qobject_to_json(obj);
>>>>> +        if (json_out) {
>>>>> +            g_assert(str);
>>>>> +            g_assert_cmpstr(qstring_get_str(str), ==, json_out);
>>>>
>>>> This assertion trips on an ARM host (Debian stable):
>>>>
>>>> GTESTER tests/check-qjson
>>>> **
>>>> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
>>>> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
>>>> "\"\\u0400\\uFFFF\"")
>>>> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd
>>>
>>> Must be case 2.1.4, because that's where json_out is
>>> "\"\\u0400\\uFFFF\"".
>>>
>>> We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser
>>> qobject_from_json().  Yields "\xF0\x90\x80\x80", as expected.
>>>
>>> We then pass that to qobject_to_json().  Should yield
>>> "\"\\uD800\\uDC00\"" (surrogate pair).  Does yield "\"\\u0400\\uFFFF\""
>>> on my machine (known bug), and "\"\\u0400\200\"" on yours.
>>>
>>> Looks like the JSON formatter is not just broken (we knew that already),
>>> it's broken in machine-dependent ways.  Good to know, thanks for
>>> reporting.
>>>
>>> Obvious ways to get "make check" pass for you again *now*:
>>>
>>> * Disable check-qjson.  That's too big a hammer for me.
>>>
>>> * Disable test case 2.1.4 with a comment explaining why.
>>>
>>> * Suitable #ifdeffery around the expected value.
>>>
>>> Preferences?
>>
>> * Fix JSON formatter :-)
>
> I want that too, but I'm afraid we can't have it *now* :)
>
>> Disabling 2.1.4 only reveals the next problem:
>> GTESTER tests/check-qjson
>> GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c
>> **
>> ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed
>> (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" ==
>> "\"\\u8200\\uFFFF\\uFFFF\"")
>
> All right, I give up.  I can't fix to_json() tonight (I have maybe 30
> minutes of usable brain left), but I can make it portably wrong.  Please
> try the appended patch.

GTESTER tests/check-qjson
GTester: last random seed: R02Scc9b8a0a880770aaee720c8f98cc953d
**
ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
(qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
"\"\\u0400\\uFFFF\"")

>
>
> diff --git a/qobject/qjson.c b/qobject/qjson.c
> index 83a6b4f..195da1f 100644
> --- a/qobject/qjson.c
> +++ b/qobject/qjson.c
> @@ -187,7 +187,21 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent)
>                  default: {
>                      if (ptr[0] <= 0x1F) {
>                          char escape[7];
> -                        snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
> +                        /*
> +                         * Portability band-aid
> +                         *
> +                         * We used to print ptr[0] here, but when
> +                         * plain char is signed, that prints \uFFFF
> +                         * for negative values.  The code here is crap
> +                         * (see utf8_string() in tests/check-qjson.c),
> +                         * and needs to be fixed.  Can't do that right
> +                         * now, and don't want to go from wrong to
> +                         * differently wrong, so I make the wrong we
> +                         * now get on the most common machines the
> +                         * wrong we get on all machines.
> +                         */
> +                        snprintf(escape, sizeof(escape), "\\u%04X",
> +                                 *(signed char *)ptr);
>                          qstring_append(str, escape);
>                      } else {
>                          char buf[2] = { ptr[0], 0 };
Markus Armbruster - March 1, 2013, 4:07 p.m.
Blue Swirl <blauwirbel@gmail.com> writes:

> On Thu, Feb 28, 2013 at 7:42 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> Blue Swirl <blauwirbel@gmail.com> writes:
>>
>>> On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <armbru@redhat.com> wrote:
[...]
>>>> Looks like the JSON formatter is not just broken (we knew that already),
>>>> it's broken in machine-dependent ways.  Good to know, thanks for
>>>> reporting.
>>>>
>>>> Obvious ways to get "make check" pass for you again *now*:
>>>>
>>>> * Disable check-qjson.  That's too big a hammer for me.
>>>>
>>>> * Disable test case 2.1.4 with a comment explaining why.
>>>>
>>>> * Suitable #ifdeffery around the expected value.
>>>>
>>>> Preferences?
>>>
>>> * Fix JSON formatter :-)
>>
>> I want that too, but I'm afraid we can't have it *now* :)
>>
>>> Disabling 2.1.4 only reveals the next problem:
>>> GTESTER tests/check-qjson
>>> GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c
>>> **
>>> ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed
>>> (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" ==
>>> "\"\\u8200\\uFFFF\\uFFFF\"")
>>
>> All right, I give up.  I can't fix to_json() tonight (I have maybe 30
>> minutes of usable brain left), but I can make it portably wrong.  Please
>> try the appended patch.
>
> GTESTER tests/check-qjson
> GTester: last random seed: R02Scc9b8a0a880770aaee720c8f98cc953d
> **
> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
> "\"\\u0400\\uFFFF\"")

Working on a real fix, please be patient.

Patch

diff --git a/qobject/qjson.c b/qobject/qjson.c
index 83a6b4f..195da1f 100644
--- a/qobject/qjson.c
+++ b/qobject/qjson.c
@@ -187,7 +187,21 @@  static void to_json(const QObject *obj, QString *str, int pretty, int indent)
                 default: {
                     if (ptr[0] <= 0x1F) {
                         char escape[7];
-                        snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
+                        /*
+                         * Portability band-aid
+                         *
+                         * We used to print ptr[0] here, but when
+                         * plain char is signed, that prints \uFFFF
+                         * for negative values.  The code here is crap
+                         * (see utf8_string() in tests/check-qjson.c),
+                         * and needs to be fixed.  Can't do that right
+                         * now, and don't want to go from wrong to
+                         * differently wrong, so I make the wrong we
+                         * now get on the most common machines the
+                         * wrong we get on all machines.
+                         */
+                        snprintf(escape, sizeof(escape), "\\u%04X",
+                                 *(signed char *)ptr);
                         qstring_append(str, escape);
                     } else {
                         char buf[2] = { ptr[0], 0 };