Message ID | 87ip5cw7gl.fsf@blackfin.pond.sub.org |
---|---|
State | New |
Headers | show |
On Thu, Feb 28, 2013 at 7:42 PM, Markus Armbruster <armbru@redhat.com> wrote: > Blue Swirl <blauwirbel@gmail.com> writes: > >> On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <armbru@redhat.com> wrote: >>> Blue Swirl <blauwirbel@gmail.com> writes: >>> >>>> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <armbru@redhat.com> wrote: >>>>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and >>>>> stress test at >>>>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>>>> >>>>> Unfortunately, both JSON parser and formatter misbehave right now. >>>>> This test expects current, incorrect results. They're all clearly >>>>> marked, and are to be replaced by correct ones as the bugs get fixed. >>>>> See comments in new utf8_string() for details. >>>>> >>>>> Signed-off-by: Markus Armbruster <armbru@redhat.com> >>>>> --- >>>>> tests/check-qjson.c | 625 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 625 insertions(+) >>>>> >>>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >>>>> index 32ffb43..4590b3a 100644 >>>>> --- a/tests/check-qjson.c >>>>> +++ b/tests/check-qjson.c >>>>> @@ -1,8 +1,10 @@ >>>>> /* >>>>> * Copyright IBM, Corp. 2009 >>>>> + * Copyright (c) 2013 Red Hat Inc. >>>>> * >>>>> * Authors: >>>>> * Anthony Liguori <aliguori@us.ibm.com> >>>>> + * Markus Armbruster <armbru@redhat.com>, >>>>> * >>>>> * This work is licensed under the terms of the GNU LGPL, version 2.1 or later. >>>>> * See the COPYING.LIB file in the top-level directory. >>>>> @@ -131,6 +133,628 @@ static void single_quote_string(void) >>>>> } >>>>> } >>>>> >>>>> +static void utf8_string(void) >>>>> +{ >>>>> + /* >>>>> + * FIXME Current behavior for invalid UTF-8 sequences is >>>>> + * incorrect. This test expects current, incorrect results. >>>>> + * They're all marked "bug:" below, and are to be replaced by >>>>> + * correct ones as the bugs get fixed. >>>>> + * >>>>> + * The JSON parser rejects some invalid sequences, but accepts >>>>> + * others without correcting the problem. >>>>> + * >>>>> + * The JSON formatter replaces some invalid sequences by U+FFFF (a >>>>> + * noncharacter), and goes wonky for others. >>>>> + * >>>>> + * For both directions, we should either reject all invalid >>>>> + * sequences, or minimize overlong sequences and replace all other >>>>> + * invalid sequences by a suitable replacement character. A >>>>> + * common choice for replacement is U+FFFD. >>>>> + * >>>>> + * Problem: we can't easily deal with embedded U+0000. Parsing >>>>> + * the JSON string "this \\u0000" is fun" yields "this \0 is fun", >>>>> + * which gets misinterpreted as NUL-terminated "this ". We should >>>>> + * consider using overlong encoding \xC0\x80 for U+0000 ("modified >>>>> + * UTF-8"). >>>>> + * >>>>> + * Tests are scraped from Markus Kuhn's UTF-8 decoder capability >>>>> + * and stress test at >>>>> + * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>>>> + */ >>>>> + static const struct { >>>>> + const char *json_in; >>>>> + const char *utf8_out; >>>>> + const char *json_out; /* defaults to @json_in */ >>>>> + const char *utf8_in; /* defaults to @utf8_out */ >>>>> + } test_cases[] = { >>>>> + /* >>>>> + * Bug markers used here: >>>>> + * - bug: not corrected >>>>> + * JSON parser fails to correct invalid sequence(s) >>>>> + * - bug: rejected >>>>> + * JSON parser rejects invalid sequence(s) >>>>> + * We may choose to define this as feature >>>>> + * - bug: want "\"...\"" >>>>> + * JSON formatter produces incorrect result, this is the >>>>> + * correct one, assuming replacement character U+FFFF >>>>> + * - bug: want "..." (no \") >>>>> + * JSON parser produces incorrect result, this is the >>>>> + * correct one. >>>>> + * Not marked explicitly, but trivial to find: >>>>> + * - JSON formatter replacing invalid sequence by \\uFFFF is a >>>>> + * bug if we want it to fail for invalid sequences. >>>>> + */ >>> [...] >>>>> + /* 2.1.4 4 bytes U+10000 */ >>>>> + { >>>>> + "\"\xF0\x90\x80\x80\"", >>>>> + "\xF0\x90\x80\x80", >>>>> + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ >>>>> + }, >>> [...] >>>>> + {} >>>>> + }; >>>>> + int i; >>>>> + QObject *obj; >>>>> + QString *str; >>>>> + const char *json_in, *utf8_out, *utf8_in, *json_out; >>>>> + >>>>> + for (i = 0; test_cases[i].json_in; i++) { >>>>> + json_in = test_cases[i].json_in; >>>>> + utf8_out = test_cases[i].utf8_out; >>>>> + utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out; >>>>> + json_out = test_cases[i].json_out ?: test_cases[i].json_in; >>>>> + >>>>> + obj = qobject_from_json(json_in); >>>>> + if (utf8_out) { >>>>> + g_assert(obj); >>>>> + g_assert(qobject_type(obj) == QTYPE_QSTRING); >>>>> + str = qobject_to_qstring(obj); >>>>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out); >>>>> + } else { >>>>> + g_assert(!obj); >>>>> + } >>>>> + qobject_decref(obj); >>>>> + >>>>> + obj = QOBJECT(qstring_from_str(utf8_in)); >>>>> + str = qobject_to_json(obj); >>>>> + if (json_out) { >>>>> + g_assert(str); >>>>> + g_assert_cmpstr(qstring_get_str(str), ==, json_out); >>>> >>>> This assertion trips on an ARM host (Debian stable): >>>> >>>> GTESTER tests/check-qjson >>>> ** >>>> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed >>>> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" == >>>> "\"\\u0400\\uFFFF\"") >>>> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd >>> >>> Must be case 2.1.4, because that's where json_out is >>> "\"\\u0400\\uFFFF\"". >>> >>> We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser >>> qobject_from_json(). Yields "\xF0\x90\x80\x80", as expected. >>> >>> We then pass that to qobject_to_json(). Should yield >>> "\"\\uD800\\uDC00\"" (surrogate pair). Does yield "\"\\u0400\\uFFFF\"" >>> on my machine (known bug), and "\"\\u0400\200\"" on yours. >>> >>> Looks like the JSON formatter is not just broken (we knew that already), >>> it's broken in machine-dependent ways. Good to know, thanks for >>> reporting. >>> >>> Obvious ways to get "make check" pass for you again *now*: >>> >>> * Disable check-qjson. That's too big a hammer for me. >>> >>> * Disable test case 2.1.4 with a comment explaining why. >>> >>> * Suitable #ifdeffery around the expected value. >>> >>> Preferences? >> >> * Fix JSON formatter :-) > > I want that too, but I'm afraid we can't have it *now* :) > >> Disabling 2.1.4 only reveals the next problem: >> GTESTER tests/check-qjson >> GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c >> ** >> ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed >> (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" == >> "\"\\u8200\\uFFFF\\uFFFF\"") > > All right, I give up. I can't fix to_json() tonight (I have maybe 30 > minutes of usable brain left), but I can make it portably wrong. Please > try the appended patch. GTESTER tests/check-qjson GTester: last random seed: R02Scc9b8a0a880770aaee720c8f98cc953d ** ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" == "\"\\u0400\\uFFFF\"") > > > diff --git a/qobject/qjson.c b/qobject/qjson.c > index 83a6b4f..195da1f 100644 > --- a/qobject/qjson.c > +++ b/qobject/qjson.c > @@ -187,7 +187,21 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) > default: { > if (ptr[0] <= 0x1F) { > char escape[7]; > - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); > + /* > + * Portability band-aid > + * > + * We used to print ptr[0] here, but when > + * plain char is signed, that prints \uFFFF > + * for negative values. The code here is crap > + * (see utf8_string() in tests/check-qjson.c), > + * and needs to be fixed. Can't do that right > + * now, and don't want to go from wrong to > + * differently wrong, so I make the wrong we > + * now get on the most common machines the > + * wrong we get on all machines. > + */ > + snprintf(escape, sizeof(escape), "\\u%04X", > + *(signed char *)ptr); > qstring_append(str, escape); > } else { > char buf[2] = { ptr[0], 0 };
Blue Swirl <blauwirbel@gmail.com> writes: > On Thu, Feb 28, 2013 at 7:42 PM, Markus Armbruster <armbru@redhat.com> wrote: >> Blue Swirl <blauwirbel@gmail.com> writes: >> >>> On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <armbru@redhat.com> wrote: [...] >>>> Looks like the JSON formatter is not just broken (we knew that already), >>>> it's broken in machine-dependent ways. Good to know, thanks for >>>> reporting. >>>> >>>> Obvious ways to get "make check" pass for you again *now*: >>>> >>>> * Disable check-qjson. That's too big a hammer for me. >>>> >>>> * Disable test case 2.1.4 with a comment explaining why. >>>> >>>> * Suitable #ifdeffery around the expected value. >>>> >>>> Preferences? >>> >>> * Fix JSON formatter :-) >> >> I want that too, but I'm afraid we can't have it *now* :) >> >>> Disabling 2.1.4 only reveals the next problem: >>> GTESTER tests/check-qjson >>> GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c >>> ** >>> ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed >>> (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" == >>> "\"\\u8200\\uFFFF\\uFFFF\"") >> >> All right, I give up. I can't fix to_json() tonight (I have maybe 30 >> minutes of usable brain left), but I can make it portably wrong. Please >> try the appended patch. > > GTESTER tests/check-qjson > GTester: last random seed: R02Scc9b8a0a880770aaee720c8f98cc953d > ** > ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed > (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" == > "\"\\u0400\\uFFFF\"") Working on a real fix, please be patient.
diff --git a/qobject/qjson.c b/qobject/qjson.c index 83a6b4f..195da1f 100644 --- a/qobject/qjson.c +++ b/qobject/qjson.c @@ -187,7 +187,21 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) default: { if (ptr[0] <= 0x1F) { char escape[7]; - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); + /* + * Portability band-aid + * + * We used to print ptr[0] here, but when + * plain char is signed, that prints \uFFFF + * for negative values. The code here is crap + * (see utf8_string() in tests/check-qjson.c), + * and needs to be fixed. Can't do that right + * now, and don't want to go from wrong to + * differently wrong, so I make the wrong we + * now get on the most common machines the + * wrong we get on all machines. + */ + snprintf(escape, sizeof(escape), "\\u%04X", + *(signed char *)ptr); qstring_append(str, escape); } else { char buf[2] = { ptr[0], 0 };