Message ID | 20180808120334.10970-29-armbru@redhat.com |
---|---|
State | New |
Headers | show |
Series | json: Fixes, error reporting improvements, cleanups | expand |
On 08/08/2018 07:03 AM, Markus Armbruster wrote: > The JSON parser treats each half of a surrogate pair as unpaired > surrogate. Fix it to recognize surrogate pairs. > > Signed-off-by: Markus Armbruster <armbru@redhat.com> > --- > qobject/json-parser.c | 16 +++++++++++++++- > tests/check-qjson.c | 3 +-- > 2 files changed, 16 insertions(+), 3 deletions(-) > > @@ -168,6 +170,18 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token) > cp |= hex2decimal(*ptr); > } > > + if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate > + && ptr[1] == '\\' && ptr[2] == 'u') { > + ptr += 2; > + leading_surrogate = cp; > + goto hex; > + } > + if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) { > + cp &= 0x3FF; > + cp |= (leading_surrogate & 0x3FF) << 10; > + cp += 0x010000; > + } > + > if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) { > parse_error(ctxt, token, > "\\u%.4s is not a valid Unicode character", Consider "\\udbff\\udfff" - a valid surrogate pair (in terms of being in range), but which decodes to u+10ffff. Since is_valid_codepoint() (part of mod_utf8_encode()) rejects it due to (codepoint & 0xfffe) == 0xfffe, it means we end up printing this error message, but only using the second half of the surrogate pair. Is that okay? Otherwise, Reviewed-by: Eric Blake <eblake@redhat.com>
On 08/08/2018 14:03, Markus Armbruster wrote: > + if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate > + && ptr[1] == '\\' && ptr[2] == 'u') { > + ptr += 2; > + leading_surrogate = cp; > + goto hex; > + } > + if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) { > + cp &= 0x3FF; > + cp |= (leading_surrogate & 0x3FF) << 10; > + cp += 0x010000; > + } > + The leading surrogate is discarded for \uD800\uCAFE, I think. Is this desired? Paolo
Eric Blake <eblake@redhat.com> writes: > On 08/08/2018 07:03 AM, Markus Armbruster wrote: >> The JSON parser treats each half of a surrogate pair as unpaired >> surrogate. Fix it to recognize surrogate pairs. >> >> Signed-off-by: Markus Armbruster <armbru@redhat.com> >> --- >> qobject/json-parser.c | 16 +++++++++++++++- >> tests/check-qjson.c | 3 +-- >> 2 files changed, 16 insertions(+), 3 deletions(-) >> > >> @@ -168,6 +170,18 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token) >> cp |= hex2decimal(*ptr); >> } >> + if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate >> + && ptr[1] == '\\' && ptr[2] == 'u') { >> + ptr += 2; >> + leading_surrogate = cp; >> + goto hex; >> + } >> + if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) { >> + cp &= 0x3FF; >> + cp |= (leading_surrogate & 0x3FF) << 10; >> + cp += 0x010000; >> + } >> + >> if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) { >> parse_error(ctxt, token, >> "\\u%.4s is not a valid Unicode character", > > Consider "\\udbff\\udfff" - a valid surrogate pair (in terms of being > in range), but which decodes to u+10ffff. Since is_valid_codepoint() > (part of mod_utf8_encode()) rejects it due to (codepoint & 0xfffe) == > 0xfffe, it means we end up printing this error message, but only using > the second half of the surrogate pair. Is that okay? It's not horrible, but I wouldn't call it okay. I'll try to improve it. > Otherwise, > Reviewed-by: Eric Blake <eblake@redhat.com> Thanks!
Paolo Bonzini <pbonzini@redhat.com> writes: > On 08/08/2018 14:03, Markus Armbruster wrote: >> + if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate >> + && ptr[1] == '\\' && ptr[2] == 'u') { >> + ptr += 2; >> + leading_surrogate = cp; >> + goto hex; >> + } >> + if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) { >> + cp &= 0x3FF; >> + cp |= (leading_surrogate & 0x3FF) << 10; >> + cp += 0x010000; >> + } >> + > > The leading surrogate is discarded for \uD800\uCAFE, I think. Is this > desired? Certainly not. I'll fix it. Thanks!
diff --git a/qobject/json-parser.c b/qobject/json-parser.c index bb54886809..703065fa2b 100644 --- a/qobject/json-parser.c +++ b/qobject/json-parser.c @@ -115,7 +115,7 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token) const char *ptr = token->str; QString *str; char quote; - int cp, i; + int cp, i, leading_surrogate; char *end; ssize_t len; char utf8_buf[5]; @@ -156,6 +156,8 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token) qstring_append_chr(str, '\t'); break; case 'u': + leading_surrogate = 0; + hex: cp = 0; for (i = 0; i < 4; i++) { ptr++; @@ -168,6 +170,18 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token) cp |= hex2decimal(*ptr); } + if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate + && ptr[1] == '\\' && ptr[2] == 'u') { + ptr += 2; + leading_surrogate = cp; + goto hex; + } + if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) { + cp &= 0x3FF; + cp |= (leading_surrogate & 0x3FF) << 10; + cp += 0x010000; + } + if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) { parse_error(ctxt, token, "\\u%.4s is not a valid Unicode character", diff --git a/tests/check-qjson.c b/tests/check-qjson.c index 422697459f..3d3a3f105f 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -61,8 +61,7 @@ static void escaped_string(void) { "double byte utf-8 \\u00A2", "double byte utf-8 \xc2\xa2" }, { "triple byte utf-8 \\u20AC", "triple byte utf-8 \xe2\x82\xac" }, { "quadruple byte utf-8 \\uD834\\uDD1E", /* U+1D11E */ - /* bug: want \xF0\x9D\x84\x9E */ - NULL }, + "quadruple byte utf-8 \xF0\x9D\x84\x9E" }, { "\\z", NULL }, { "\\ux", NULL }, { "\\u1x", NULL },
The JSON parser treats each half of a surrogate pair as unpaired surrogate. Fix it to recognize surrogate pairs. Signed-off-by: Markus Armbruster <armbru@redhat.com> --- qobject/json-parser.c | 16 +++++++++++++++- tests/check-qjson.c | 3 +-- 2 files changed, 16 insertions(+), 3 deletions(-)