diff mbox

[3/4] check-qjson: Test noncharacters other than U+FFFE, U+FFFF in strings

Message ID 1363283360-26220-4-git-send-email-armbru@redhat.com
State New
Headers show

Commit Message

Markus Armbruster March 14, 2013, 5:49 p.m. UTC
These are all broken, too.

A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
noncharacters some more instead of what they're supposed to test.  Use
U+FFFD and U+10FFFD instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 tests/check-qjson.c | 85 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 13 deletions(-)

Comments

Laszlo Ersek March 21, 2013, 8:22 p.m. UTC | #1
On 03/14/13 18:49, Markus Armbruster wrote:
> These are all broken, too.

What are "these"? And how are they broken? And how does the patch fix them?

> 
> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
> noncharacters some more instead of what they're supposed to test.  Use
> U+FFFD and U+10FFFD instead.
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>  tests/check-qjson.c | 85 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 13 deletions(-)

I'm confused about the commit message. There are three paragraphs in it
(the title, the first paragraph, and the 2nd paragraph). This patch
modifies different tests:

> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
> index 852124a..efec1b2 100644
> --- a/tests/check-qjson.c
> +++ b/tests/check-qjson.c
> @@ -158,7 +158,7 @@ static void utf8_string(void)
>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>       * UTF-8").
>       *
> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>       * capability and stress test at
>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>       */
> @@ -256,11 +256,11 @@ static void utf8_string(void)
>              "\xDF\xBF",
>              "\"\\u07FF\"",
>          },
> -        /* 2.2.3  3 bytes U+FFFF */
> +        /* 2.2.3  3 bytes U+FFFD */
>          {
> -            "\"\xEF\xBF\xBF\"",
> -            "\xEF\xBF\xBF",
> -            "\"\\uFFFF\"",
> +            "\"\xEF\xBF\xBD\"",
> +            "\xEF\xBF\xBD",
> +            "\"\\uFFFD\"",
>          },

This is under "2.2  Last possible sequence of a certain length". I guess
this is where you say "last possible sequence of a certain length,
encoding a character (= non-noncharacter)". OK, p#2.


>          /* 2.2.4  4 bytes U+1FFFFF */
>          {
> @@ -303,10 +303,10 @@ static void utf8_string(void)
>              "\"\\uFFFD\"",
>          },
>          {
> -            /* U+10FFFF */
> -            "\"\xF4\x8F\xBF\xBF\"",
> -            "\xF4\x8F\xBF\xBF",
> -            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
> +            /* U+10FFFD */
> +            "\"\xF4\x8F\xBF\xBD\"",
> +            "\xF4\x8F\xBF\xBD",
> +            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
>          },
>          {
>              /* U+110000 */

Under "2.3  Other boundary conditions". Not a non-character any longer,
but also not a boundary condition. At least not the original one. Still
covered by the ...FFFD part of the commit message, p#2.


> @@ -584,9 +584,9 @@ static void utf8_string(void)
>              "\"\\u07FF\"",
>          },
>          {
> -            /* \U+FFFF */
> -            "\"\xF0\x8F\xBF\xBF\"",
> -            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
> +            /* \U+FFFD */
> +            "\"\xF0\x8F\xBF\xBD\"",
> +            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
>              "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>          },
>          {

Under "4.2  Maximum overlong sequences". What does that even mean? "In
some sense maximum codepoints, all represented as overlong sequences"? P#2.

> @@ -731,6 +731,7 @@ static void utf8_string(void)
>              "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
>          },
>          /* 5.3  Other illegal code positions */
> +        /* BMP noncharacters */
>          {
>              /* \U+FFFE */
>              "\"\xEF\xBF\xBE\"",
> @@ -741,7 +742,65 @@ static void utf8_string(void)
>              /* \U+FFFF */
>              "\"\xEF\xBF\xBF\"",
>              "\xEF\xBF\xBF",     /* bug: not corrected */
> -            "\"\\uFFFF\"",      /* bug: not corrected */
> +            "\"\\uFFFF\"",
> +        },
> +        {
> +            /* U+FDD0 */
> +            "\"\xEF\xB7\x90\"",
> +            "\xEF\xB7\x90",     /* bug: not corrected */
> +            "\"\\uFDD0\"",      /* bug: not corrected */
> +        },
> +        {
> +            /* U+FDEF */
> +            "\"\xEF\xB7\xAF\"",
> +            "\xEF\xB7\xAF",     /* bug: not corrected */
> +            "\"\\uFDEF\"",      /* bug: not corrected */
> +        },
> +        /* Plane 1 .. 16 noncharacters */
> +        {
> +            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
> +            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
> +            /* bug: not corrected */
> +            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
> +            /* bug: not corrected */
> +            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
> +            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
> +            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
> +            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
> +            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
> +            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
> +            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
> +            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
>          },
>          {}
>      };
> 

This is probably p#0 (the title).

Ah. Have you removed the noncharacters from the other tests, but made up
for them at the end with new noncharacter tests?

Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Markus Armbruster March 22, 2013, 2:37 p.m. UTC | #2
Laszlo Ersek <lersek@redhat.com> writes:

> On 03/14/13 18:49, Markus Armbruster wrote:
>> These are all broken, too.
>
> What are "these"? And how are they broken? And how does the patch fix them?

"These" refers to the subject: noncharacters other than U+FFFE, U+FFFF.

I agree that I should better explain how they're broken, and what the
patch does to fix them.  Will fix on respin.

>> 
>> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
>> noncharacters some more instead of what they're supposed to test.  Use
>> U+FFFD and U+10FFFD instead.
>> 
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>  tests/check-qjson.c | 85
>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 72 insertions(+), 13 deletions(-)
>
> I'm confused about the commit message. There are three paragraphs in it
> (the title, the first paragraph, and the 2nd paragraph). This patch
> modifies different tests:
>
>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>> index 852124a..efec1b2 100644
>> --- a/tests/check-qjson.c
>> +++ b/tests/check-qjson.c
>> @@ -158,7 +158,7 @@ static void utf8_string(void)
>>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>       * UTF-8").
>>       *
>> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
>> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>>       * capability and stress test at
>>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>       */
>> @@ -256,11 +256,11 @@ static void utf8_string(void)
>>              "\xDF\xBF",
>>              "\"\\u07FF\"",
>>          },
>> -        /* 2.2.3  3 bytes U+FFFF */
>> +        /* 2.2.3  3 bytes U+FFFD */
>>          {
>> -            "\"\xEF\xBF\xBF\"",
>> -            "\xEF\xBF\xBF",
>> -            "\"\\uFFFF\"",
>> +            "\"\xEF\xBF\xBD\"",
>> +            "\xEF\xBF\xBD",
>> +            "\"\\uFFFD\"",
>>          },
>
> This is under "2.2  Last possible sequence of a certain length". I guess

Which is in turn under "2  Boundary condition test cases".

> this is where you say "last possible sequence of a certain length,
> encoding a character (= non-noncharacter)". OK, p#2.

Yes.

The test's purpose is testing the upper bound of 3-byte sequences is
decoded correctly.

The upper bound is U+FFFF.  Since that's a noncharacter, the parser
should reject it (or maybe replace), the formatter should replace it.
Trouble is it could be misdecoded and then rejected / replaced.

Besides, U+FFFF already gets tested along with the other noncharacters
under "5.3  Other illegal code positions".

Next in line is U+FFFE, also a noncharacter, also under 5.3.

Next in line is U+FFFD, which I picked.

But that gets tested under "2.3  Other boundary conditions"!  I guess I
either drop it there, or make this one U+FFFC.

I think testing U+FFFC here makes sense, because U+FFFD could be
misdecoded, then replaced by U+FFFD.

What do you think?

>>          /* 2.2.4  4 bytes U+1FFFFF */
>>          {
>> @@ -303,10 +303,10 @@ static void utf8_string(void)
>>              "\"\\uFFFD\"",
>>          },
>>          {
>> -            /* U+10FFFF */
>> -            "\"\xF4\x8F\xBF\xBF\"",
>> -            "\xF4\x8F\xBF\xBF",
>> -            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
>> +            /* U+10FFFD */
>> +            "\"\xF4\x8F\xBF\xBD\"",
>> +            "\xF4\x8F\xBF\xBD",
>> +            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
>>          },
>>          {
>>              /* U+110000 */
>
> Under "2.3  Other boundary conditions". Not a non-character any longer,
> but also not a boundary condition. At least not the original one. Still
> covered by the ...FFFD part of the commit message, p#2.

The test's purpose is testing the upper bound of the Unicode range gets
decoded correctly.

The upper bound is U+10FFFF.  Since that's a noncharacter... same
argument as above.

Next in line is U+10FFFE, also a noncharacter.

Next in line is U+10FFFD, which I picked.

>> @@ -584,9 +584,9 @@ static void utf8_string(void)
>>              "\"\\u07FF\"",
>>          },
>>          {
>> -            /* \U+FFFF */
>> -            "\"\xF0\x8F\xBF\xBF\"",
>> -            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
>> +            /* \U+FFFD */
>> +            "\"\xF0\x8F\xBF\xBD\"",
>> +            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
>>              "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>>          },
>>          {
>
> Under "4.2  Maximum overlong sequences". What does that even mean? "In
> some sense maximum codepoints, all represented as overlong sequences"? P#2.

The headings are all stolen from the same source as the test cases.
Perhaps I should steal more of the explanatory text there as well.

The one for 4.2 is:

    Below you see the highest Unicode value that is still resulting in
    an overlong sequence if represented with the given number of bytes.
    This is a boundary test for safe UTF-8 decoders.  All five
    characters should be rejected like malformed UTF-8 sequences.

>> @@ -731,6 +731,7 @@ static void utf8_string(void)
>>              "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
>>          },
>>          /* 5.3  Other illegal code positions */
>> +        /* BMP noncharacters */
>>          {
>>              /* \U+FFFE */
>>              "\"\xEF\xBF\xBE\"",
>> @@ -741,7 +742,65 @@ static void utf8_string(void)
>>              /* \U+FFFF */
>>              "\"\xEF\xBF\xBF\"",
>>              "\xEF\xBF\xBF",     /* bug: not corrected */
>> -            "\"\\uFFFF\"",      /* bug: not corrected */
>> +            "\"\\uFFFF\"",
>> +        },
>> +        {
>> +            /* U+FDD0 */
>> +            "\"\xEF\xB7\x90\"",
>> +            "\xEF\xB7\x90",     /* bug: not corrected */
>> +            "\"\\uFDD0\"",      /* bug: not corrected */
>> +        },
>> +        {
>> +            /* U+FDEF */
>> +            "\"\xEF\xB7\xAF\"",
>> +            "\xEF\xB7\xAF",     /* bug: not corrected */
>> +            "\"\\uFDEF\"",      /* bug: not corrected */
>> +        },
>> +        /* Plane 1 .. 16 noncharacters */
>> +        {
>> +            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
>> +            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
>> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
>> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
>> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
>> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
>> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
>> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
>> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
>> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
>> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
>> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
>> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
>> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
>> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
>> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
>> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
>> +            /* bug: not corrected */
>> +            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
>> +            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
>> +            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
>> +            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
>> +            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
>> +            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
>> +            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
>> +            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
>> +            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
>> +            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
>> +            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
>> +            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
>> +            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
>> +            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
>> +            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
>> +            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
>> +            /* bug: not corrected */
>> +            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
>> +            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
>> +            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
>> +            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
>> +            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
>> +            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
>> +            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
>> +            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
>>          },
>>          {}
>>      };
>> 
>
> This is probably p#0 (the title).
>
> Ah. Have you removed the noncharacters from the other tests, but made up
> for them at the end with new noncharacter tests?

I added tests to cover all 66 noncharacters.  Then I noticed some
duplicate test cases elsewhere, and realized that these don't really
fully test what I want to test there.

> Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Thanks!
Laszlo Ersek March 22, 2013, 2:52 p.m. UTC | #3
On 03/22/13 15:37, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> 
>> On 03/14/13 18:49, Markus Armbruster wrote:
>>> These are all broken, too.
>>
>> What are "these"? And how are they broken? And how does the patch fix them?
> 
> "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF.
> 
> I agree that I should better explain how they're broken, and what the
> patch does to fix them.  Will fix on respin.
> 
>>>
>>> A few test cases use noncharacters U+FFFF and U+10FFFF.  Risks testing
>>> noncharacters some more instead of what they're supposed to test.  Use
>>> U+FFFD and U+10FFFD instead.
>>>
>>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>>> ---
>>>  tests/check-qjson.c | 85
>>> +++++++++++++++++++++++++++++++++++++++++++++--------
>>>  1 file changed, 72 insertions(+), 13 deletions(-)
>>
>> I'm confused about the commit message. There are three paragraphs in it
>> (the title, the first paragraph, and the 2nd paragraph). This patch
>> modifies different tests:
>>
>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>> index 852124a..efec1b2 100644
>>> --- a/tests/check-qjson.c
>>> +++ b/tests/check-qjson.c
>>> @@ -158,7 +158,7 @@ static void utf8_string(void)
>>>       * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>>       * UTF-8").
>>>       *
>>> -     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
>>> +     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
>>>       * capability and stress test at
>>>       * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>       */
>>> @@ -256,11 +256,11 @@ static void utf8_string(void)
>>>              "\xDF\xBF",
>>>              "\"\\u07FF\"",
>>>          },
>>> -        /* 2.2.3  3 bytes U+FFFF */
>>> +        /* 2.2.3  3 bytes U+FFFD */
>>>          {
>>> -            "\"\xEF\xBF\xBF\"",
>>> -            "\xEF\xBF\xBF",
>>> -            "\"\\uFFFF\"",
>>> +            "\"\xEF\xBF\xBD\"",
>>> +            "\xEF\xBF\xBD",
>>> +            "\"\\uFFFD\"",
>>>          },
>>
>> This is under "2.2  Last possible sequence of a certain length". I guess
> 
> Which is in turn under "2  Boundary condition test cases".
> 
>> this is where you say "last possible sequence of a certain length,
>> encoding a character (= non-noncharacter)". OK, p#2.
> 
> Yes.
> 
> The test's purpose is testing the upper bound of 3-byte sequences is
> decoded correctly.
> 
> The upper bound is U+FFFF.  Since that's a noncharacter, the parser
> should reject it (or maybe replace), the formatter should replace it.
> Trouble is it could be misdecoded and then rejected / replaced.
> 
> Besides, U+FFFF already gets tested along with the other noncharacters
> under "5.3  Other illegal code positions".
> 
> Next in line is U+FFFE, also a noncharacter, also under 5.3.
> 
> Next in line is U+FFFD, which I picked.
> 
> But that gets tested under "2.3  Other boundary conditions"!  I guess I
> either drop it there, or make this one U+FFFC.
> 
> I think testing U+FFFC here makes sense, because U+FFFD could be
> misdecoded, then replaced by U+FFFD.
> 
> What do you think?

I think that we're extending Markus Kuhn's test suite, basically taking
random shots at where one specific parser's/formatter's weak spots might
be :)

That said, with intelligent fuzzing out of scope / capacity, U+FFFC
could be a good pick.

I also think I'm a quite a useless person to ask for thoughts in this
area :)

Thanks,
Laszlo
diff mbox

Patch

diff --git a/tests/check-qjson.c b/tests/check-qjson.c
index 852124a..efec1b2 100644
--- a/tests/check-qjson.c
+++ b/tests/check-qjson.c
@@ -158,7 +158,7 @@  static void utf8_string(void)
      * consider using overlong encoding \xC0\x80 for U+0000 ("modified
      * UTF-8").
      *
-     * Test cases are scraped from Markus Kuhn's UTF-8 decoder
+     * Most test cases are scraped from Markus Kuhn's UTF-8 decoder
      * capability and stress test at
      * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
      */
@@ -256,11 +256,11 @@  static void utf8_string(void)
             "\xDF\xBF",
             "\"\\u07FF\"",
         },
-        /* 2.2.3  3 bytes U+FFFF */
+        /* 2.2.3  3 bytes U+FFFD */
         {
-            "\"\xEF\xBF\xBF\"",
-            "\xEF\xBF\xBF",
-            "\"\\uFFFF\"",
+            "\"\xEF\xBF\xBD\"",
+            "\xEF\xBF\xBD",
+            "\"\\uFFFD\"",
         },
         /* 2.2.4  4 bytes U+1FFFFF */
         {
@@ -303,10 +303,10 @@  static void utf8_string(void)
             "\"\\uFFFD\"",
         },
         {
-            /* U+10FFFF */
-            "\"\xF4\x8F\xBF\xBF\"",
-            "\xF4\x8F\xBF\xBF",
-            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
+            /* U+10FFFD */
+            "\"\xF4\x8F\xBF\xBD\"",
+            "\xF4\x8F\xBF\xBD",
+            "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */
         },
         {
             /* U+110000 */
@@ -584,9 +584,9 @@  static void utf8_string(void)
             "\"\\u07FF\"",
         },
         {
-            /* \U+FFFF */
-            "\"\xF0\x8F\xBF\xBF\"",
-            "\xF0\x8F\xBF\xBF",   /* bug: not corrected */
+            /* \U+FFFD */
+            "\"\xF0\x8F\xBF\xBD\"",
+            "\xF0\x8F\xBF\xBD",   /* bug: not corrected */
             "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
         },
         {
@@ -731,6 +731,7 @@  static void utf8_string(void)
             "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */
         },
         /* 5.3  Other illegal code positions */
+        /* BMP noncharacters */
         {
             /* \U+FFFE */
             "\"\xEF\xBF\xBE\"",
@@ -741,7 +742,65 @@  static void utf8_string(void)
             /* \U+FFFF */
             "\"\xEF\xBF\xBF\"",
             "\xEF\xBF\xBF",     /* bug: not corrected */
-            "\"\\uFFFF\"",      /* bug: not corrected */
+            "\"\\uFFFF\"",
+        },
+        {
+            /* U+FDD0 */
+            "\"\xEF\xB7\x90\"",
+            "\xEF\xB7\x90",     /* bug: not corrected */
+            "\"\\uFDD0\"",      /* bug: not corrected */
+        },
+        {
+            /* U+FDEF */
+            "\"\xEF\xB7\xAF\"",
+            "\xEF\xB7\xAF",     /* bug: not corrected */
+            "\"\\uFDEF\"",      /* bug: not corrected */
+        },
+        /* Plane 1 .. 16 noncharacters */
+        {
+            /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */
+            "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
+            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
+            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
+            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
+            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
+            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
+            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
+            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
+            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
+            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
+            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
+            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
+            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
+            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
+            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
+            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"",
+            /* bug: not corrected */
+            "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF"
+            "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF"
+            "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF"
+            "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF"
+            "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF"
+            "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF"
+            "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF"
+            "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF"
+            "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF"
+            "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF"
+            "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF"
+            "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF"
+            "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF"
+            "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF"
+            "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF"
+            "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF",
+            /* bug: not corrected */
+            "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF"
+            "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF"
+            "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF"
+            "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF"
+            "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF"
+            "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF"
+            "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF"
+            "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"",
         },
         {}
     };