diff mbox series

[28/56] json: Fix \uXXXX for surrogate pairs

Message ID 20180808120334.10970-29-armbru@redhat.com
State New
Headers show
Series json: Fixes, error reporting improvements, cleanups | expand

Commit Message

Markus Armbruster Aug. 8, 2018, 12:03 p.m. UTC
The JSON parser treats each half of a surrogate pair as unpaired
surrogate.  Fix it to recognize surrogate pairs.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/json-parser.c | 16 +++++++++++++++-
 tests/check-qjson.c   |  3 +--
 2 files changed, 16 insertions(+), 3 deletions(-)

Comments

Eric Blake Aug. 10, 2018, 5:18 p.m. UTC | #1
On 08/08/2018 07:03 AM, Markus Armbruster wrote:
> The JSON parser treats each half of a surrogate pair as unpaired
> surrogate.  Fix it to recognize surrogate pairs.
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qobject/json-parser.c | 16 +++++++++++++++-
>   tests/check-qjson.c   |  3 +--
>   2 files changed, 16 insertions(+), 3 deletions(-)
> 

> @@ -168,6 +170,18 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token)
>                       cp |= hex2decimal(*ptr);
>                   }
>   
> +                if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate
> +                    && ptr[1] == '\\' && ptr[2] == 'u') {
> +                    ptr += 2;
> +                    leading_surrogate = cp;
> +                    goto hex;
> +                }
> +                if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) {
> +                    cp &= 0x3FF;
> +                    cp |= (leading_surrogate & 0x3FF) << 10;
> +                    cp += 0x010000;
> +                }
> +
>                   if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) {
>                       parse_error(ctxt, token,
>                                   "\\u%.4s is not a valid Unicode character",

Consider "\\udbff\\udfff" - a valid surrogate pair (in terms of being in 
range), but which decodes to u+10ffff.  Since is_valid_codepoint() (part 
of mod_utf8_encode()) rejects it due to (codepoint & 0xfffe) == 0xfffe, 
it means we end up printing this error message, but only using the 
second half of the surrogate pair.  Is that okay?

Otherwise,
Reviewed-by: Eric Blake <eblake@redhat.com>
Paolo Bonzini Aug. 12, 2018, 9:52 a.m. UTC | #2
On 08/08/2018 14:03, Markus Armbruster wrote:
> +                if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate
> +                    && ptr[1] == '\\' && ptr[2] == 'u') {
> +                    ptr += 2;
> +                    leading_surrogate = cp;
> +                    goto hex;
> +                }
> +                if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) {
> +                    cp &= 0x3FF;
> +                    cp |= (leading_surrogate & 0x3FF) << 10;
> +                    cp += 0x010000;
> +                }
> +

The leading surrogate is discarded for \uD800\uCAFE, I think.  Is this
desired?

Paolo
Markus Armbruster Aug. 13, 2018, 7:07 a.m. UTC | #3
Eric Blake <eblake@redhat.com> writes:

> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
>> The JSON parser treats each half of a surrogate pair as unpaired
>> surrogate.  Fix it to recognize surrogate pairs.
>>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qobject/json-parser.c | 16 +++++++++++++++-
>>   tests/check-qjson.c   |  3 +--
>>   2 files changed, 16 insertions(+), 3 deletions(-)
>>
>
>> @@ -168,6 +170,18 @@ static QString *parse_string(JSONParserContext *ctxt, JSONToken *token)
>>                      cp |= hex2decimal(*ptr);
>>                  }
>> +                if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate
>> +                    && ptr[1] == '\\' && ptr[2] == 'u') {
>> +                    ptr += 2;
>> +                    leading_surrogate = cp;
>> +                    goto hex;
>> +                }
>> +                if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) {
>> +                    cp &= 0x3FF;
>> +                    cp |= (leading_surrogate & 0x3FF) << 10;
>> +                    cp += 0x010000;
>> +                }
>> +
>>                   if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) {
>>                       parse_error(ctxt, token,
>>                                   "\\u%.4s is not a valid Unicode character",
>
> Consider "\\udbff\\udfff" - a valid surrogate pair (in terms of being
> in range), but which decodes to u+10ffff.  Since is_valid_codepoint()
> (part of mod_utf8_encode()) rejects it due to (codepoint & 0xfffe) ==
> 0xfffe, it means we end up printing this error message, but only using
> the second half of the surrogate pair.  Is that okay?

It's not horrible, but I wouldn't call it okay.  I'll try to improve it.

> Otherwise,
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!
Markus Armbruster Aug. 13, 2018, 7:12 a.m. UTC | #4
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 08/08/2018 14:03, Markus Armbruster wrote:
>> +                if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate
>> +                    && ptr[1] == '\\' && ptr[2] == 'u') {
>> +                    ptr += 2;
>> +                    leading_surrogate = cp;
>> +                    goto hex;
>> +                }
>> +                if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) {
>> +                    cp &= 0x3FF;
>> +                    cp |= (leading_surrogate & 0x3FF) << 10;
>> +                    cp += 0x010000;
>> +                }
>> +
>
> The leading surrogate is discarded for \uD800\uCAFE, I think.  Is this
> desired?

Certainly not.  I'll fix it.  Thanks!
diff mbox series

Patch

diff --git a/qobject/json-parser.c b/qobject/json-parser.c
index bb54886809..703065fa2b 100644
--- a/qobject/json-parser.c
+++ b/qobject/json-parser.c
@@ -115,7 +115,7 @@  static QString *parse_string(JSONParserContext *ctxt, JSONToken *token)
     const char *ptr = token->str;
     QString *str;
     char quote;
-    int cp, i;
+    int cp, i, leading_surrogate;
     char *end;
     ssize_t len;
     char utf8_buf[5];
@@ -156,6 +156,8 @@  static QString *parse_string(JSONParserContext *ctxt, JSONToken *token)
                 qstring_append_chr(str, '\t');
                 break;
             case 'u':
+                leading_surrogate = 0;
+            hex:
                 cp = 0;
                 for (i = 0; i < 4; i++) {
                     ptr++;
@@ -168,6 +170,18 @@  static QString *parse_string(JSONParserContext *ctxt, JSONToken *token)
                     cp |= hex2decimal(*ptr);
                 }
 
+                if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate
+                    && ptr[1] == '\\' && ptr[2] == 'u') {
+                    ptr += 2;
+                    leading_surrogate = cp;
+                    goto hex;
+                }
+                if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) {
+                    cp &= 0x3FF;
+                    cp |= (leading_surrogate & 0x3FF) << 10;
+                    cp += 0x010000;
+                }
+
                 if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) {
                     parse_error(ctxt, token,
                                 "\\u%.4s is not a valid Unicode character",
diff --git a/tests/check-qjson.c b/tests/check-qjson.c
index 422697459f..3d3a3f105f 100644
--- a/tests/check-qjson.c
+++ b/tests/check-qjson.c
@@ -61,8 +61,7 @@  static void escaped_string(void)
         { "double byte utf-8 \\u00A2", "double byte utf-8 \xc2\xa2" },
         { "triple byte utf-8 \\u20AC", "triple byte utf-8 \xe2\x82\xac" },
         { "quadruple byte utf-8 \\uD834\\uDD1E", /* U+1D11E */
-          /* bug: want \xF0\x9D\x84\x9E */
-          NULL },
+          "quadruple byte utf-8 \xF0\x9D\x84\x9E" },
         { "\\z", NULL },
         { "\\ux", NULL },
         { "\\u1x", NULL },