diff mbox series

[41/56] json: Nicer recovery from invalid leading zero

Message ID 20180808120334.10970-42-armbru@redhat.com
State New
Headers show
Series json: Fixes, error reporting improvements, cleanups | expand

Commit Message

Markus Armbruster Aug. 8, 2018, 12:03 p.m. UTC
For input 0123, the lexer produces the tokens

    JSON_ERROR    01
    JSON_INTEGER  23

Reporting an error is correct; 0123 is invalid according to RFC 7159.
But the error recovery isn't nice.

Make the finite state machine eat digits before going into the error
state.  The lexer now produces

    JSON_ERROR    0123

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/json-lexer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Eric Blake Aug. 13, 2018, 4:33 p.m. UTC | #1
On 08/08/2018 07:03 AM, Markus Armbruster wrote:
> For input 0123, the lexer produces the tokens
> 
>      JSON_ERROR    01
>      JSON_INTEGER  23
> 
> Reporting an error is correct; 0123 is invalid according to RFC 7159.
> But the error recovery isn't nice.
> 
> Make the finite state machine eat digits before going into the error
> state.  The lexer now produces
> 
>      JSON_ERROR    0123
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qobject/json-lexer.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 

> @@ -158,10 +159,14 @@ static const uint8_t json_lexer[][256] =  {
>       /* Zero */
>       [IN_ZERO] = {
>           TERMINAL(JSON_INTEGER),
> -        ['0' ... '9'] = IN_ERROR,
> +        ['0' ... '9'] = IN_BAD_ZERO,
>           ['.'] = IN_MANTISSA,
>       },
>   
> +    [IN_BAD_ZERO] = {
> +        ['0' ... '9'] = IN_BAD_ZERO,
> +    },
> +

Should IN_BAD_ZERO also consume '.' and/or 'e' (after all, '01e2 is a 
valid C constant, but not a valid JSON literal)?  But I think your 
choice here is fine (again, add too much, and then the lexer has to 
track a lot of state; whereas this minimal addition catches the most 
obvious things with little effort).

Reviewed-by: Eric Blake <eblake@redhat.com>
Markus Armbruster Aug. 14, 2018, 8:24 a.m. UTC | #2
Eric Blake <eblake@redhat.com> writes:

> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
>> For input 0123, the lexer produces the tokens
>>
>>      JSON_ERROR    01
>>      JSON_INTEGER  23
>>
>> Reporting an error is correct; 0123 is invalid according to RFC 7159.
>> But the error recovery isn't nice.
>>
>> Make the finite state machine eat digits before going into the error
>> state.  The lexer now produces
>>
>>      JSON_ERROR    0123
>>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qobject/json-lexer.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>
>> @@ -158,10 +159,14 @@ static const uint8_t json_lexer[][256] =  {
>>       /* Zero */
>>       [IN_ZERO] = {
>>           TERMINAL(JSON_INTEGER),
>> -        ['0' ... '9'] = IN_ERROR,
>> +        ['0' ... '9'] = IN_BAD_ZERO,
>>           ['.'] = IN_MANTISSA,
>>       },
>>   +    [IN_BAD_ZERO] = {
>> +        ['0' ... '9'] = IN_BAD_ZERO,
>> +    },
>> +
>
> Should IN_BAD_ZERO also consume '.' and/or 'e' (after all, '01e2 is a
> valid C constant, but not a valid JSON literal)?  But I think your
> choice here is fine (again, add too much, and then the lexer has to
> track a lot of state; whereas this minimal addition catches the most
> obvious things with little effort).

My patch is of marginal value to begin with.  It improves error recovery
only for the "integer with redundant leading zero" case.  I guess that's
more common than "floating-point with redundant leading zero".

An obvious way to extend it to "number with redundant leading zero"
would be cloning the lexer states related to numbers.  Clean, but six
more states.  Meh.

Another way is to dumb down the lexer not to care about leading zero,
and catch it in parse_literal() instead.  Basically duplicates the lexer
state machine up to where leading zero is recognized.  Not much code,
but meh.

Yet another way is to have the lexer eat "digit salad" after redundant
leading zero:

    [IN_BAD_ZERO] = {
        ['0' ... '9'] = IN_BAD_ZERO,
        ['.'] = IN_BAD_ZERO,
        ['e'] = IN_BAD_ZERO,
        ['-'] = IN_BAD_ZERO,
        ['+'] = IN_BAD_ZERO,
    },

Eats even crap like 01e...  But then the same crap without leading zero
should also be eaten.  We'd have to add state transitions to IN_BAD_ZERO
to the six states related to numbers.  Perhaps clever use of gcc's range
initialization lets us do this compactly.  Otherwise, meh again.

Opinions?

I'd also be fine with dropping this patch.

> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!
Eric Blake Aug. 14, 2018, 1:14 p.m. UTC | #3
On 08/14/2018 03:24 AM, Markus Armbruster wrote:
>> Should IN_BAD_ZERO also consume '.' and/or 'e' (after all, '01e2 is a
>> valid C constant, but not a valid JSON literal)?  But I think your
>> choice here is fine (again, add too much, and then the lexer has to
>> track a lot of state; whereas this minimal addition catches the most
>> obvious things with little effort).
> 
> My patch is of marginal value to begin with.  It improves error recovery
> only for the "integer with redundant leading zero" case.  I guess that's
> more common than "floating-point with redundant leading zero".

Yes, that was my thought.

> 
> An obvious way to extend it to "number with redundant leading zero"
> would be cloning the lexer states related to numbers.  Clean, but six
> more states.  Meh.
> 
> Another way is to dumb down the lexer not to care about leading zero,
> and catch it in parse_literal() instead.  Basically duplicates the lexer
> state machine up to where leading zero is recognized.  Not much code,
> but meh.
> 
> Yet another way is to have the lexer eat "digit salad" after redundant
> leading zero:
> 
>      [IN_BAD_ZERO] = {
>          ['0' ... '9'] = IN_BAD_ZERO,
>          ['.'] = IN_BAD_ZERO,
>          ['e'] = IN_BAD_ZERO,
>          ['-'] = IN_BAD_ZERO,
>          ['+'] = IN_BAD_ZERO,
>      },
> 
> Eats even crap like 01e...  But then the same crap without leading zero
> should also be eaten.  We'd have to add state transitions to IN_BAD_ZERO
> to the six states related to numbers.  Perhaps clever use of gcc's range
> initialization lets us do this compactly.  Otherwise, meh again.
> 
> Opinions?

Of the various options, the patch as written seems to be the most 
minimal. Maybe tweak the commit message to call out other cases that we 
discussed as not worth doing, because the likelihood of such input is 
dramatically less.

About the only other improvement that might be worth making is adding:

[IN_BAD_ZERO] = {
   ['x'] = IN_BAD_ZERO,
   ['X'] = IN_BAD_ZERO,
   ['a' ... 'f'] = IN_BAD_ZERO,
   ['A' ... 'F'] = IN_BAD_ZERO,
}

to catch people typing JSON by hand and thinking that a hex constant 
will work.  That way, '0x1a' gets parsed as a single JSON_ERROR token, 
rather than four separate tokens of JSON_INTEGER, JSON_KEYWORD, 
JSON_INTEGER, JSON_KEYWORD (where you get the semantic error due to 
JSON_KEYWORD unexpected).

> 
> I'd also be fine with dropping this patch.

No, I like it, especially if you also like my suggestion to make error 
reporting on hex numbers resemble error reporting on octal numbers.

> 
>> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> Thanks!
>
diff mbox series

Patch

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index 7a82aab88b..f600cc732e 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -108,6 +108,7 @@  enum json_lexer_state {
     IN_SQ_STRING_ESCAPE,
     IN_SQ_STRING,
     IN_ZERO,
+    IN_BAD_ZERO,
     IN_DIGITS,
     IN_DIGIT,
     IN_EXP_E,
@@ -158,10 +159,14 @@  static const uint8_t json_lexer[][256] =  {
     /* Zero */
     [IN_ZERO] = {
         TERMINAL(JSON_INTEGER),
-        ['0' ... '9'] = IN_ERROR,
+        ['0' ... '9'] = IN_BAD_ZERO,
         ['.'] = IN_MANTISSA,
     },
 
+    [IN_BAD_ZERO] = {
+        ['0' ... '9'] = IN_BAD_ZERO,
+    },
+
     /* Float */
     [IN_DIGITS] = {
         TERMINAL(JSON_FLOAT),