diff mbox series

[39/56] json: Leave rejecting invalid interpolation to parser

Message ID 20180808120334.10970-40-armbru@redhat.com
State New
Headers show
Series json: Fixes, error reporting improvements, cleanups | expand

Commit Message

Markus Armbruster Aug. 8, 2018, 12:03 p.m. UTC
Both lexer and parser reject invalid interpolation specifications.
The parser's check is useless.

The lexer ends the token right after the first bad character.  This
tends to lead to suboptimal error reporting.  For instance, input

    [ %11d ]

produces the tokens

    JSON_LSQUARE  [
    JSON_ERROR    %1
    JSON_INTEGER  1
    JSON_KEYWORD  d
    JSON_RSQUARE  ]

The parser then yields an error, an object and two more errors:

    error: Invalid JSON syntax
    object: 1
    error: JSON parse error, invalid keyword
    error: JSON parse error, expecting value

Change the lexer to accept [A-Za-z0-9]*[duipsf].  It now produces

    JSON_LSQUARE  [
    JSON_INTERPOLATION %11d
    JSON_RSQUARE  ]

and the parser reports just

    JSON parse error, invalid interpolation '%11d'

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/json-lexer.c  | 52 +++++++++----------------------------------
 qobject/json-parser.c |  1 +
 2 files changed, 11 insertions(+), 42 deletions(-)

Comments

Eric Blake Aug. 13, 2018, 4:12 p.m. UTC | #1
On 08/08/2018 07:03 AM, Markus Armbruster wrote:
> Both lexer and parser reject invalid interpolation specifications.
> The parser's check is useless.
> 
> The lexer ends the token right after the first bad character.  This
> tends to lead to suboptimal error reporting.  For instance, input
> 
>      [ %11d ]

With no context of other characters on this line, it took me a while to 
notice that this was '1' (one) not 'l' (ell), even though my current 
font renders the two quite distinctly. (And now you know why base32 
avoids 0/1 in its alphabet, to minimize confusion with O/l - see RFC 
4648).  A better example might be %22d.

> 
> produces the tokens
> 
>      JSON_LSQUARE  [
>      JSON_ERROR    %1
>      JSON_INTEGER  1

And that's in spite of the context you have here, which makes it obvious 
that the parser saw an integer.

>      JSON_KEYWORD  d
>      JSON_RSQUARE  ]
> 
> The parser then yields an error, an object and two more errors:
> 
>      error: Invalid JSON syntax
>      object: 1
>      error: JSON parse error, invalid keyword
>      error: JSON parse error, expecting value
> 
> Change the lexer to accept [A-Za-z0-9]*[duipsf].  It now produces

That regex doesn't match the code.

> 
>      JSON_LSQUARE  [
>      JSON_INTERPOLATION %11d
>      JSON_RSQUARE  ]
> 
> and the parser reports just
> 
>      JSON parse error, invalid interpolation '%11d'
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qobject/json-lexer.c  | 52 +++++++++----------------------------------
>   qobject/json-parser.c |  1 +
>   2 files changed, 11 insertions(+), 42 deletions(-)
> 
> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
> index 0ea1eae4aa..7a82aab88b 100644
> --- a/qobject/json-lexer.c
> +++ b/qobject/json-lexer.c
> @@ -93,7 +93,8 @@
>    *   (apostrophe) instead of %x22 (quotation mark), and can't contain
>    *   unescaped apostrophe, but can contain unescaped quotation mark.
>    * - Interpolation:
> - *   interpolation = %((l|ll|I64)[du]|[ipsf])
> + *   The lexer accepts [A-Za-z0-9]*, and leaves rejecting invalid ones
> + *   to the parser.

This comment is more apropos.  But is it worth spelling "The lexer 
accepts %[A-Za-z0-9]*", and/or documenting that recognizing 
interpolation during lexing is now optional (thanks to the previous patch)?

> @@ -278,6 +238,14 @@ static const uint8_t json_lexer[][256] =  {
>           ['\n'] = IN_WHITESPACE,
>       },
>   
> +    /* interpolation */
> +    [IN_INTERPOL] = {
> +        TERMINAL(JSON_INTERPOL),
> +        ['A' ... 'Z'] = IN_INTERPOL,
> +        ['a' ... 'z'] = IN_INTERPOL,
> +        ['0' ... '9'] = IN_INTERPOL,
> +    },
> +

Note that we still treat code like "'foo': %#x" as an invalid 
interpolation request (it would be a valid printf format), while at the 
same time passing "%1" on to the parser (which is not a valid printf 
format, so -Wformat would have flagged it).  In fact, "%d1" which is 
valid for printf as "%d" followed by a literal "1" gets thrown to the 
parser as one chunk - but it is not valid in JSON to have %d adjacent to 
another integer any more than it is to reject "%d1" as an unknown 
sequence.  I don't think that catering to the remaining printf 
metacharacters (' ', '#', '\'', '-', '*', '+') nor demanding that things 
end on a letter is worth the effort, since it just makes the lexer more 
complicated when your goal was to do as little as possible to get things 
thrown over to the parser.

So even though your lexing is now somewhat different from printf, I 
don't see that as a serious drawback (the uses we care about are still 
-Wformat clean, and the uses we don't like are properly handled in the 
parser).

Perhaps you want to enhance the testsuite (earlier in the series) to 
cover "%22d", "%d1", "%1" as various unaccepted interpolation requests 
(if you didn't already).

Reviewed-by: Eric Blake <eblake@redhat.com>
Markus Armbruster Aug. 14, 2018, 7:23 a.m. UTC | #2
Eric Blake <eblake@redhat.com> writes:

> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
>> Both lexer and parser reject invalid interpolation specifications.
>> The parser's check is useless.
>>
>> The lexer ends the token right after the first bad character.  This
>> tends to lead to suboptimal error reporting.  For instance, input
>>
>>      [ %11d ]
>
> With no context of other characters on this line, it took me a while
> to notice that this was '1' (one) not 'l' (ell), even though my
> current font renders the two quite distinctly. (And now you know why
> base32 avoids 0/1 in its alphabet, to minimize confusion with O/l -
> see RFC 4648).  A better example might be %22d.

I made up an example that could plausibly escape review.  I guess the
realism is too... realistic.  Also, not really helpful.

>> produces the tokens
>>
>>      JSON_LSQUARE  [
>>      JSON_ERROR    %1
>>      JSON_INTEGER  1
>
> And that's in spite of the context you have here, which makes it
> obvious that the parser saw an integer.
>
>>      JSON_KEYWORD  d
>>      JSON_RSQUARE  ]
>>
>> The parser then yields an error, an object and two more errors:
>>
>>      error: Invalid JSON syntax
>>      object: 1
>>      error: JSON parse error, invalid keyword
>>      error: JSON parse error, expecting value
>>
>> Change the lexer to accept [A-Za-z0-9]*[duipsf].  It now produces
>
> That regex doesn't match the code.

Left over from a previous, unpublished iteration.  I'll fix it.

>>
>>      JSON_LSQUARE  [
>>      JSON_INTERPOLATION %11d
>>      JSON_RSQUARE  ]
>>
>> and the parser reports just
>>
>>      JSON parse error, invalid interpolation '%11d'
>>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qobject/json-lexer.c  | 52 +++++++++----------------------------------
>>   qobject/json-parser.c |  1 +
>>   2 files changed, 11 insertions(+), 42 deletions(-)
>>
>> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
>> index 0ea1eae4aa..7a82aab88b 100644
>> --- a/qobject/json-lexer.c
>> +++ b/qobject/json-lexer.c
>> @@ -93,7 +93,8 @@
>>    *   (apostrophe) instead of %x22 (quotation mark), and can't contain
>>    *   unescaped apostrophe, but can contain unescaped quotation mark.
>>    * - Interpolation:
>> - *   interpolation = %((l|ll|I64)[du]|[ipsf])
>> + *   The lexer accepts [A-Za-z0-9]*, and leaves rejecting invalid ones
>> + *   to the parser.
>
> This comment is more apropos.  But is it worth spelling "The lexer
> accepts %[A-Za-z0-9]*",

Yes.

>                         and/or documenting that recognizing
> interpolation during lexing is now optional (thanks to the previous
> patch)?

Should be done right when I make it optional, in PATCH 37, perhaps like
this:

@@ -92,7 +92,7 @@
  *   Like double-quoted strings, except they're delimited by %x27
  *   (apostrophe) instead of %x22 (quotation mark), and can't contain
  *   unescaped apostrophe, but can contain unescaped quotation mark.
- * - Interpolation:
+ * - Interpolation, if enabled:
  *   interpolation = %((l|ll|I64)[du]|[ipsf])
  *
  * Note:

>> @@ -278,6 +238,14 @@ static const uint8_t json_lexer[][256] =  {
>>           ['\n'] = IN_WHITESPACE,
>>       },
>>   +    /* interpolation */
>> +    [IN_INTERPOL] = {
>> +        TERMINAL(JSON_INTERPOL),
>> +        ['A' ... 'Z'] = IN_INTERPOL,
>> +        ['a' ... 'z'] = IN_INTERPOL,
>> +        ['0' ... '9'] = IN_INTERPOL,
>> +    },
>> +
>
> Note that we still treat code like "'foo': %#x" as an invalid
> interpolation request (it would be a valid printf format), while at
> the same time passing "%1" on to the parser (which is not a valid
> printf format, so -Wformat would have flagged it).  In fact, "%d1"
> which is valid for printf as "%d" followed by a literal "1" gets
> thrown to the parser as one chunk - but it is not valid in JSON to
> have %d adjacent to another integer any more than it is to reject
> "%d1" as an unknown sequence.  I don't think that catering to the
> remaining printf metacharacters (' ', '#', '\'', '-', '*', '+') nor
> demanding that things end on a letter is worth the effort, since it
> just makes the lexer more complicated when your goal was to do as
> little as possible to get things thrown over to the parser.

Yes.

The goal is to catch all mistakes safely, and as many as practical at
compile time.

Invalid interpolation specifications get rejected at run time, by
parse_interpolation().  We can't check them at compile time.

We can't check the variadic arguments match the interpolation
specifications at run time.  We can enlist gcc to check at compile time.
The only remaining hole is '%' in strings.  I intend to plug it.

> So even though your lexing is now somewhat different from printf, I
> don't see that as a serious drawback (the uses we care about are still
> -Wformat clean, and the uses we don't like are properly handled in the
> parser).

Exactly.

> Perhaps you want to enhance the testsuite (earlier in the series) to
> cover "%22d", "%d1", "%1" as various unaccepted interpolation requests
> (if you didn't already).

We don't have negative tests so far.

From a white box point of view, one negative test certainly makes sense,
to cover parse_interpolation()'s error path.  Multiple ones not so much;
it's all the same to parse_interpolation().

> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!
diff mbox series

Patch

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index 0ea1eae4aa..7a82aab88b 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -93,7 +93,8 @@ 
  *   (apostrophe) instead of %x22 (quotation mark), and can't contain
  *   unescaped apostrophe, but can contain unescaped quotation mark.
  * - Interpolation:
- *   interpolation = %((l|ll|I64)[du]|[ipsf])
+ *   The lexer accepts [A-Za-z0-9]*, and leaves rejecting invalid ones
+ *   to the parser.
  *
  * Note:
  * - Input must be encoded in modified UTF-8.
@@ -116,11 +117,6 @@  enum json_lexer_state {
     IN_NEG_NONZERO_NUMBER,
     IN_KEYWORD,
     IN_INTERPOL,
-    IN_INTERPOL_L,
-    IN_INTERPOL_LL,
-    IN_INTERPOL_I,
-    IN_INTERPOL_I6,
-    IN_INTERPOL_I64,
     IN_WHITESPACE,
     IN_START_INTERPOL,
     IN_START,
@@ -222,42 +218,6 @@  static const uint8_t json_lexer[][256] =  {
         ['\n'] = IN_WHITESPACE,
     },
 
-    /* interpolation */
-    [IN_INTERPOL_LL] = {
-        ['d'] = JSON_INTERPOL,
-        ['u'] = JSON_INTERPOL,
-    },
-
-    [IN_INTERPOL_L] = {
-        ['d'] = JSON_INTERPOL,
-        ['l'] = IN_INTERPOL_LL,
-        ['u'] = JSON_INTERPOL,
-    },
-
-    [IN_INTERPOL_I64] = {
-        ['d'] = JSON_INTERPOL,
-        ['u'] = JSON_INTERPOL,
-    },
-
-    [IN_INTERPOL_I6] = {
-        ['4'] = IN_INTERPOL_I64,
-    },
-
-    [IN_INTERPOL_I] = {
-        ['6'] = IN_INTERPOL_I6,
-    },
-
-    [IN_INTERPOL] = {
-        ['d'] = JSON_INTERPOL,
-        ['i'] = JSON_INTERPOL,
-        ['p'] = JSON_INTERPOL,
-        ['s'] = JSON_INTERPOL,
-        ['u'] = JSON_INTERPOL,
-        ['f'] = JSON_INTERPOL,
-        ['l'] = IN_INTERPOL_L,
-        ['I'] = IN_INTERPOL_I,
-    },
-
     /* top level rule */
     [IN_START] = {
         ['"'] = IN_DQ_STRING,
@@ -278,6 +238,14 @@  static const uint8_t json_lexer[][256] =  {
         ['\n'] = IN_WHITESPACE,
     },
 
+    /* interpolation */
+    [IN_INTERPOL] = {
+        TERMINAL(JSON_INTERPOL),
+        ['A' ... 'Z'] = IN_INTERPOL,
+        ['a' ... 'z'] = IN_INTERPOL,
+        ['0' ... '9'] = IN_INTERPOL,
+    },
+
     [IN_START_INTERPOL] = {
         ['"'] = IN_DQ_STRING,
         ['\''] = IN_SQ_STRING,
diff --git a/qobject/json-parser.c b/qobject/json-parser.c
index 848d469b2a..bd137399e5 100644
--- a/qobject/json-parser.c
+++ b/qobject/json-parser.c
@@ -448,6 +448,7 @@  static QObject *parse_interpolation(JSONParserContext *ctxt, va_list *ap)
     } else if (!strcmp(token->str, "%f")) {
         return QOBJECT(qnum_from_double(va_arg(*ap, double)));
     }
+    parse_error(ctxt, token, "invalid interpolation '%s'", token->str);
     return NULL;
 }