Message ID | 1363283360-26220-4-git-send-email-armbru@redhat.com |
---|---|
State | New |
Headers | show |
On 03/14/13 18:49, Markus Armbruster wrote: > These are all broken, too. What are "these"? And how are they broken? And how does the patch fix them? > > A few test cases use noncharacters U+FFFF and U+10FFFF. Risks testing > noncharacters some more instead of what they're supposed to test. Use > U+FFFD and U+10FFFD instead. > > Signed-off-by: Markus Armbruster <armbru@redhat.com> > --- > tests/check-qjson.c | 85 +++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 72 insertions(+), 13 deletions(-) I'm confused about the commit message. There are three paragraphs in it (the title, the first paragraph, and the 2nd paragraph). This patch modifies different tests: > diff --git a/tests/check-qjson.c b/tests/check-qjson.c > index 852124a..efec1b2 100644 > --- a/tests/check-qjson.c > +++ b/tests/check-qjson.c > @@ -158,7 +158,7 @@ static void utf8_string(void) > * consider using overlong encoding \xC0\x80 for U+0000 ("modified > * UTF-8"). > * > - * Test cases are scraped from Markus Kuhn's UTF-8 decoder > + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder > * capability and stress test at > * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt > */ > @@ -256,11 +256,11 @@ static void utf8_string(void) > "\xDF\xBF", > "\"\\u07FF\"", > }, > - /* 2.2.3 3 bytes U+FFFF */ > + /* 2.2.3 3 bytes U+FFFD */ > { > - "\"\xEF\xBF\xBF\"", > - "\xEF\xBF\xBF", > - "\"\\uFFFF\"", > + "\"\xEF\xBF\xBD\"", > + "\xEF\xBF\xBD", > + "\"\\uFFFD\"", > }, This is under "2.2 Last possible sequence of a certain length". I guess this is where you say "last possible sequence of a certain length, encoding a character (= non-noncharacter)". OK, p#2. > /* 2.2.4 4 bytes U+1FFFFF */ > { > @@ -303,10 +303,10 @@ static void utf8_string(void) > "\"\\uFFFD\"", > }, > { > - /* U+10FFFF */ > - "\"\xF4\x8F\xBF\xBF\"", > - "\xF4\x8F\xBF\xBF", > - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ > + /* U+10FFFD */ > + "\"\xF4\x8F\xBF\xBD\"", > + "\xF4\x8F\xBF\xBD", > + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ > }, > { > /* U+110000 */ Under "2.3 Other boundary conditions". Not a non-character any longer, but also not a boundary condition. At least not the original one. Still covered by the ...FFFD part of the commit message, p#2. > @@ -584,9 +584,9 @@ static void utf8_string(void) > "\"\\u07FF\"", > }, > { > - /* \U+FFFF */ > - "\"\xF0\x8F\xBF\xBF\"", > - "\xF0\x8F\xBF\xBF", /* bug: not corrected */ > + /* \U+FFFD */ > + "\"\xF0\x8F\xBF\xBD\"", > + "\xF0\x8F\xBF\xBD", /* bug: not corrected */ > "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > }, > { Under "4.2 Maximum overlong sequences". What does that even mean? "In some sense maximum codepoints, all represented as overlong sequences"? P#2. > @@ -731,6 +731,7 @@ static void utf8_string(void) > "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ > }, > /* 5.3 Other illegal code positions */ > + /* BMP noncharacters */ > { > /* \U+FFFE */ > "\"\xEF\xBF\xBE\"", > @@ -741,7 +742,65 @@ static void utf8_string(void) > /* \U+FFFF */ > "\"\xEF\xBF\xBF\"", > "\xEF\xBF\xBF", /* bug: not corrected */ > - "\"\\uFFFF\"", /* bug: not corrected */ > + "\"\\uFFFF\"", > + }, > + { > + /* U+FDD0 */ > + "\"\xEF\xB7\x90\"", > + "\xEF\xB7\x90", /* bug: not corrected */ > + "\"\\uFDD0\"", /* bug: not corrected */ > + }, > + { > + /* U+FDEF */ > + "\"\xEF\xB7\xAF\"", > + "\xEF\xB7\xAF", /* bug: not corrected */ > + "\"\\uFDEF\"", /* bug: not corrected */ > + }, > + /* Plane 1 .. 16 noncharacters */ > + { > + /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */ > + "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" > + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" > + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" > + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" > + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" > + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" > + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" > + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" > + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" > + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" > + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" > + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" > + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" > + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" > + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" > + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"", > + /* bug: not corrected */ > + "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" > + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" > + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" > + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" > + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" > + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" > + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" > + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" > + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" > + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" > + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" > + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" > + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" > + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" > + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" > + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", > + /* bug: not corrected */ > + "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" > + "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" > + "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" > + "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" > + "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" > + "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" > + "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" > + "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", > }, > {} > }; > This is probably p#0 (the title). Ah. Have you removed the noncharacters from the other tests, but made up for them at the end with new noncharacter tests? Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Laszlo Ersek <lersek@redhat.com> writes: > On 03/14/13 18:49, Markus Armbruster wrote: >> These are all broken, too. > > What are "these"? And how are they broken? And how does the patch fix them? "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF. I agree that I should better explain how they're broken, and what the patch does to fix them. Will fix on respin. >> >> A few test cases use noncharacters U+FFFF and U+10FFFF. Risks testing >> noncharacters some more instead of what they're supposed to test. Use >> U+FFFD and U+10FFFD instead. >> >> Signed-off-by: Markus Armbruster <armbru@redhat.com> >> --- >> tests/check-qjson.c | 85 >> +++++++++++++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 72 insertions(+), 13 deletions(-) > > I'm confused about the commit message. There are three paragraphs in it > (the title, the first paragraph, and the 2nd paragraph). This patch > modifies different tests: > >> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >> index 852124a..efec1b2 100644 >> --- a/tests/check-qjson.c >> +++ b/tests/check-qjson.c >> @@ -158,7 +158,7 @@ static void utf8_string(void) >> * consider using overlong encoding \xC0\x80 for U+0000 ("modified >> * UTF-8"). >> * >> - * Test cases are scraped from Markus Kuhn's UTF-8 decoder >> + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder >> * capability and stress test at >> * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >> */ >> @@ -256,11 +256,11 @@ static void utf8_string(void) >> "\xDF\xBF", >> "\"\\u07FF\"", >> }, >> - /* 2.2.3 3 bytes U+FFFF */ >> + /* 2.2.3 3 bytes U+FFFD */ >> { >> - "\"\xEF\xBF\xBF\"", >> - "\xEF\xBF\xBF", >> - "\"\\uFFFF\"", >> + "\"\xEF\xBF\xBD\"", >> + "\xEF\xBF\xBD", >> + "\"\\uFFFD\"", >> }, > > This is under "2.2 Last possible sequence of a certain length". I guess Which is in turn under "2 Boundary condition test cases". > this is where you say "last possible sequence of a certain length, > encoding a character (= non-noncharacter)". OK, p#2. Yes. The test's purpose is testing the upper bound of 3-byte sequences is decoded correctly. The upper bound is U+FFFF. Since that's a noncharacter, the parser should reject it (or maybe replace), the formatter should replace it. Trouble is it could be misdecoded and then rejected / replaced. Besides, U+FFFF already gets tested along with the other noncharacters under "5.3 Other illegal code positions". Next in line is U+FFFE, also a noncharacter, also under 5.3. Next in line is U+FFFD, which I picked. But that gets tested under "2.3 Other boundary conditions"! I guess I either drop it there, or make this one U+FFFC. I think testing U+FFFC here makes sense, because U+FFFD could be misdecoded, then replaced by U+FFFD. What do you think? >> /* 2.2.4 4 bytes U+1FFFFF */ >> { >> @@ -303,10 +303,10 @@ static void utf8_string(void) >> "\"\\uFFFD\"", >> }, >> { >> - /* U+10FFFF */ >> - "\"\xF4\x8F\xBF\xBF\"", >> - "\xF4\x8F\xBF\xBF", >> - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ >> + /* U+10FFFD */ >> + "\"\xF4\x8F\xBF\xBD\"", >> + "\xF4\x8F\xBF\xBD", >> + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ >> }, >> { >> /* U+110000 */ > > Under "2.3 Other boundary conditions". Not a non-character any longer, > but also not a boundary condition. At least not the original one. Still > covered by the ...FFFD part of the commit message, p#2. The test's purpose is testing the upper bound of the Unicode range gets decoded correctly. The upper bound is U+10FFFF. Since that's a noncharacter... same argument as above. Next in line is U+10FFFE, also a noncharacter. Next in line is U+10FFFD, which I picked. >> @@ -584,9 +584,9 @@ static void utf8_string(void) >> "\"\\u07FF\"", >> }, >> { >> - /* \U+FFFF */ >> - "\"\xF0\x8F\xBF\xBF\"", >> - "\xF0\x8F\xBF\xBF", /* bug: not corrected */ >> + /* \U+FFFD */ >> + "\"\xF0\x8F\xBF\xBD\"", >> + "\xF0\x8F\xBF\xBD", /* bug: not corrected */ >> "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ >> }, >> { > > Under "4.2 Maximum overlong sequences". What does that even mean? "In > some sense maximum codepoints, all represented as overlong sequences"? P#2. The headings are all stolen from the same source as the test cases. Perhaps I should steal more of the explanatory text there as well. The one for 4.2 is: Below you see the highest Unicode value that is still resulting in an overlong sequence if represented with the given number of bytes. This is a boundary test for safe UTF-8 decoders. All five characters should be rejected like malformed UTF-8 sequences. >> @@ -731,6 +731,7 @@ static void utf8_string(void) >> "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ >> }, >> /* 5.3 Other illegal code positions */ >> + /* BMP noncharacters */ >> { >> /* \U+FFFE */ >> "\"\xEF\xBF\xBE\"", >> @@ -741,7 +742,65 @@ static void utf8_string(void) >> /* \U+FFFF */ >> "\"\xEF\xBF\xBF\"", >> "\xEF\xBF\xBF", /* bug: not corrected */ >> - "\"\\uFFFF\"", /* bug: not corrected */ >> + "\"\\uFFFF\"", >> + }, >> + { >> + /* U+FDD0 */ >> + "\"\xEF\xB7\x90\"", >> + "\xEF\xB7\x90", /* bug: not corrected */ >> + "\"\\uFDD0\"", /* bug: not corrected */ >> + }, >> + { >> + /* U+FDEF */ >> + "\"\xEF\xB7\xAF\"", >> + "\xEF\xB7\xAF", /* bug: not corrected */ >> + "\"\\uFDEF\"", /* bug: not corrected */ >> + }, >> + /* Plane 1 .. 16 noncharacters */ >> + { >> + /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */ >> + "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" >> + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" >> + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" >> + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" >> + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" >> + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" >> + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" >> + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" >> + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" >> + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" >> + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" >> + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" >> + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" >> + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" >> + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" >> + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"", >> + /* bug: not corrected */ >> + "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" >> + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" >> + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" >> + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" >> + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" >> + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" >> + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" >> + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" >> + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" >> + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" >> + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" >> + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" >> + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" >> + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" >> + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" >> + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", >> + /* bug: not corrected */ >> + "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" >> + "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" >> + "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" >> + "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" >> + "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" >> + "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" >> + "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" >> + "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", >> }, >> {} >> }; >> > > This is probably p#0 (the title). > > Ah. Have you removed the noncharacters from the other tests, but made up > for them at the end with new noncharacter tests? I added tests to cover all 66 noncharacters. Then I noticed some duplicate test cases elsewhere, and realized that these don't really fully test what I want to test there. > Reviewed-by: Laszlo Ersek <lersek@redhat.com> Thanks!
On 03/22/13 15:37, Markus Armbruster wrote: > Laszlo Ersek <lersek@redhat.com> writes: > >> On 03/14/13 18:49, Markus Armbruster wrote: >>> These are all broken, too. >> >> What are "these"? And how are they broken? And how does the patch fix them? > > "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF. > > I agree that I should better explain how they're broken, and what the > patch does to fix them. Will fix on respin. > >>> >>> A few test cases use noncharacters U+FFFF and U+10FFFF. Risks testing >>> noncharacters some more instead of what they're supposed to test. Use >>> U+FFFD and U+10FFFD instead. >>> >>> Signed-off-by: Markus Armbruster <armbru@redhat.com> >>> --- >>> tests/check-qjson.c | 85 >>> +++++++++++++++++++++++++++++++++++++++++++++-------- >>> 1 file changed, 72 insertions(+), 13 deletions(-) >> >> I'm confused about the commit message. There are three paragraphs in it >> (the title, the first paragraph, and the 2nd paragraph). This patch >> modifies different tests: >> >>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >>> index 852124a..efec1b2 100644 >>> --- a/tests/check-qjson.c >>> +++ b/tests/check-qjson.c >>> @@ -158,7 +158,7 @@ static void utf8_string(void) >>> * consider using overlong encoding \xC0\x80 for U+0000 ("modified >>> * UTF-8"). >>> * >>> - * Test cases are scraped from Markus Kuhn's UTF-8 decoder >>> + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder >>> * capability and stress test at >>> * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>> */ >>> @@ -256,11 +256,11 @@ static void utf8_string(void) >>> "\xDF\xBF", >>> "\"\\u07FF\"", >>> }, >>> - /* 2.2.3 3 bytes U+FFFF */ >>> + /* 2.2.3 3 bytes U+FFFD */ >>> { >>> - "\"\xEF\xBF\xBF\"", >>> - "\xEF\xBF\xBF", >>> - "\"\\uFFFF\"", >>> + "\"\xEF\xBF\xBD\"", >>> + "\xEF\xBF\xBD", >>> + "\"\\uFFFD\"", >>> }, >> >> This is under "2.2 Last possible sequence of a certain length". I guess > > Which is in turn under "2 Boundary condition test cases". > >> this is where you say "last possible sequence of a certain length, >> encoding a character (= non-noncharacter)". OK, p#2. > > Yes. > > The test's purpose is testing the upper bound of 3-byte sequences is > decoded correctly. > > The upper bound is U+FFFF. Since that's a noncharacter, the parser > should reject it (or maybe replace), the formatter should replace it. > Trouble is it could be misdecoded and then rejected / replaced. > > Besides, U+FFFF already gets tested along with the other noncharacters > under "5.3 Other illegal code positions". > > Next in line is U+FFFE, also a noncharacter, also under 5.3. > > Next in line is U+FFFD, which I picked. > > But that gets tested under "2.3 Other boundary conditions"! I guess I > either drop it there, or make this one U+FFFC. > > I think testing U+FFFC here makes sense, because U+FFFD could be > misdecoded, then replaced by U+FFFD. > > What do you think? I think that we're extending Markus Kuhn's test suite, basically taking random shots at where one specific parser's/formatter's weak spots might be :) That said, with intelligent fuzzing out of scope / capacity, U+FFFC could be a good pick. I also think I'm a quite a useless person to ask for thoughts in this area :) Thanks, Laszlo
diff --git a/tests/check-qjson.c b/tests/check-qjson.c index 852124a..efec1b2 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -158,7 +158,7 @@ static void utf8_string(void) * consider using overlong encoding \xC0\x80 for U+0000 ("modified * UTF-8"). * - * Test cases are scraped from Markus Kuhn's UTF-8 decoder + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder * capability and stress test at * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt */ @@ -256,11 +256,11 @@ static void utf8_string(void) "\xDF\xBF", "\"\\u07FF\"", }, - /* 2.2.3 3 bytes U+FFFF */ + /* 2.2.3 3 bytes U+FFFD */ { - "\"\xEF\xBF\xBF\"", - "\xEF\xBF\xBF", - "\"\\uFFFF\"", + "\"\xEF\xBF\xBD\"", + "\xEF\xBF\xBD", + "\"\\uFFFD\"", }, /* 2.2.4 4 bytes U+1FFFFF */ { @@ -303,10 +303,10 @@ static void utf8_string(void) "\"\\uFFFD\"", }, { - /* U+10FFFF */ - "\"\xF4\x8F\xBF\xBF\"", - "\xF4\x8F\xBF\xBF", - "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ + /* U+10FFFD */ + "\"\xF4\x8F\xBF\xBD\"", + "\xF4\x8F\xBF\xBD", + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFD\"" */ }, { /* U+110000 */ @@ -584,9 +584,9 @@ static void utf8_string(void) "\"\\u07FF\"", }, { - /* \U+FFFF */ - "\"\xF0\x8F\xBF\xBF\"", - "\xF0\x8F\xBF\xBF", /* bug: not corrected */ + /* \U+FFFD */ + "\"\xF0\x8F\xBF\xBD\"", + "\xF0\x8F\xBF\xBD", /* bug: not corrected */ "\"\\u03FF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ }, { @@ -731,6 +731,7 @@ static void utf8_string(void) "\"\\uDBFF\\uDFFF\"", /* bug: want "\"\\uFFFF\\uFFFF\"" */ }, /* 5.3 Other illegal code positions */ + /* BMP noncharacters */ { /* \U+FFFE */ "\"\xEF\xBF\xBE\"", @@ -741,7 +742,65 @@ static void utf8_string(void) /* \U+FFFF */ "\"\xEF\xBF\xBF\"", "\xEF\xBF\xBF", /* bug: not corrected */ - "\"\\uFFFF\"", /* bug: not corrected */ + "\"\\uFFFF\"", + }, + { + /* U+FDD0 */ + "\"\xEF\xB7\x90\"", + "\xEF\xB7\x90", /* bug: not corrected */ + "\"\\uFDD0\"", /* bug: not corrected */ + }, + { + /* U+FDEF */ + "\"\xEF\xB7\xAF\"", + "\xEF\xB7\xAF", /* bug: not corrected */ + "\"\\uFDEF\"", /* bug: not corrected */ + }, + /* Plane 1 .. 16 noncharacters */ + { + /* U+1FFFE U+1FFFF U+2FFFE U+2FFFF ... U+10FFFE U+10FFFF */ + "\"\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF\"", + /* bug: not corrected */ + "\xF0\x9F\xBF\xBE\xF0\x9F\xBF\xBF" + "\xF0\xAF\xBF\xBE\xF0\xAF\xBF\xBF" + "\xF0\xBF\xBF\xBE\xF0\xBF\xBF\xBF" + "\xF1\x8F\xBF\xBE\xF1\x8F\xBF\xBF" + "\xF1\x9F\xBF\xBE\xF1\x9F\xBF\xBF" + "\xF1\xAF\xBF\xBE\xF1\xAF\xBF\xBF" + "\xF1\xBF\xBF\xBE\xF1\xBF\xBF\xBF" + "\xF2\x8F\xBF\xBE\xF2\x8F\xBF\xBF" + "\xF2\x9F\xBF\xBE\xF2\x9F\xBF\xBF" + "\xF2\xAF\xBF\xBE\xF2\xAF\xBF\xBF" + "\xF2\xBF\xBF\xBE\xF2\xBF\xBF\xBF" + "\xF3\x8F\xBF\xBE\xF3\x8F\xBF\xBF" + "\xF3\x9F\xBF\xBE\xF3\x9F\xBF\xBF" + "\xF3\xAF\xBF\xBE\xF3\xAF\xBF\xBF" + "\xF3\xBF\xBF\xBE\xF3\xBF\xBF\xBF" + "\xF4\x8F\xBF\xBE\xF4\x8F\xBF\xBF", + /* bug: not corrected */ + "\"\\u07FF\\uFFFF\\u07FF\\uFFFF\\u0BFF\\uFFFF\\u0BFF\\uFFFF" + "\\u0FFF\\uFFFF\\u0FFF\\uFFFF\\u13FF\\uFFFF\\u13FF\\uFFFF" + "\\u17FF\\uFFFF\\u17FF\\uFFFF\\u1BFF\\uFFFF\\u1BFF\\uFFFF" + "\\u1FFF\\uFFFF\\u1FFF\\uFFFF\\u23FF\\uFFFF\\u23FF\\uFFFF" + "\\u27FF\\uFFFF\\u27FF\\uFFFF\\u2BFF\\uFFFF\\u2BFF\\uFFFF" + "\\u2FFF\\uFFFF\\u2FFF\\uFFFF\\u33FF\\uFFFF\\u33FF\\uFFFF" + "\\u37FF\\uFFFF\\u37FF\\uFFFF\\u3BFF\\uFFFF\\u3BFF\\uFFFF" + "\\u3FFF\\uFFFF\\u3FFF\\uFFFF\\u43FF\\uFFFF\\u43FF\\uFFFF\"", }, {} };
These are all broken, too. A few test cases use noncharacters U+FFFF and U+10FFFF. Risks testing noncharacters some more instead of what they're supposed to test. Use U+FFFD and U+10FFFD instead. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- tests/check-qjson.c | 85 +++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 72 insertions(+), 13 deletions(-)