diff mbox series

[v2,41/60] json: Nicer recovery from invalid leading zero

Message ID 20180817150559.16243-42-armbru@redhat.com
State New
Headers show
Series json: Fixes, error reporting improvements, cleanups | expand

Commit Message

Markus Armbruster Aug. 17, 2018, 3:05 p.m. UTC
For input 0123, the lexer produces the tokens

    JSON_ERROR    01
    JSON_INTEGER  23

Reporting an error is correct; 0123 is invalid according to RFC 7159.
But the error recovery isn't nice.

Make the finite state machine eat digits before going into the error
state.  The lexer now produces

    JSON_ERROR    0123

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qobject/json-lexer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Eric Blake Aug. 17, 2018, 4:03 p.m. UTC | #1
On 08/17/2018 10:05 AM, Markus Armbruster wrote:
> For input 0123, the lexer produces the tokens
> 
>      JSON_ERROR    01
>      JSON_INTEGER  23
> 
> Reporting an error is correct; 0123 is invalid according to RFC 7159.
> But the error recovery isn't nice.
> 
> Make the finite state machine eat digits before going into the error
> state.  The lexer now produces
> 
>      JSON_ERROR    0123
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Did you also want to reject invalid attempts at hex numbers, by adding 
[xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?

>   
> +    [IN_BAD_ZERO] = {
> +        ['0' ... '9'] = IN_BAD_ZERO,
> +    },
> +
Markus Armbruster Aug. 20, 2018, 11:39 a.m. UTC | #2
Eric Blake <eblake@redhat.com> writes:

> On 08/17/2018 10:05 AM, Markus Armbruster wrote:
>> For input 0123, the lexer produces the tokens
>>
>>      JSON_ERROR    01
>>      JSON_INTEGER  23
>>
>> Reporting an error is correct; 0123 is invalid according to RFC 7159.
>> But the error recovery isn't nice.
>>
>> Make the finite state machine eat digits before going into the error
>> state.  The lexer now produces
>>
>>      JSON_ERROR    0123
>>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>
> Did you also want to reject invalid attempts at hex numbers, by adding
> [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?

I put one foot on a slippery slope with this patch...

In review of v1, we discussed whether to try matching non-integer
numbers with redundant leading zero.  Doing that tightly in the lexer
requires duplicating six states.  A simpler alternative is to have the
lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
Your suggestion for hexadecimal numbers is digit salad with different
digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
Even more radical would be eating anything but whitespace and structural
characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
a two-stage lexer: first stage finds token strings, where a token string
is a structural character or a sequence of non-structural,
non-whitespace characters, second stage rejects invalid token strings.

Hmm, we could try to recover from lexical errors more smartly in
general: instead of ending the JSON error token after the first
offending character, end it before the first whitespace or structural
character following the offending character.

I can try that, but I'd prefer to try it in a follow-up patch.

>>   +    [IN_BAD_ZERO] = {
>> +        ['0' ... '9'] = IN_BAD_ZERO,
>> +    },
>> +
Eric Blake Aug. 20, 2018, 6:36 p.m. UTC | #3
On 08/20/2018 06:39 AM, Markus Armbruster wrote:

> In review of v1, we discussed whether to try matching non-integer
> numbers with redundant leading zero.  Doing that tightly in the lexer
> requires duplicating six states.  A simpler alternative is to have the
> lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
> Your suggestion for hexadecimal numbers is digit salad with different
> digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
> Even more radical would be eating anything but whitespace and structural
> characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
> a two-stage lexer: first stage finds token strings, where a token string
> is a structural character or a sequence of non-structural,
> non-whitespace characters, second stage rejects invalid token strings.
> 
> Hmm, we could try to recover from lexical errors more smartly in
> general: instead of ending the JSON error token after the first
> offending character, end it before the first whitespace or structural
> character following the offending character.
> 
> I can try that, but I'd prefer to try it in a follow-up patch.

Indeed, that sounds like a valid approach. So, for this patch, I'm fine 
with just accepting ['0' ... '9'], then seeing if the later 
smarter-lexing change makes back-to-back non-structural tokens give 
saner error messages in general.
Markus Armbruster Aug. 21, 2018, 5:10 a.m. UTC | #4
Eric Blake <eblake@redhat.com> writes:

> On 08/20/2018 06:39 AM, Markus Armbruster wrote:
>
>> In review of v1, we discussed whether to try matching non-integer
>> numbers with redundant leading zero.  Doing that tightly in the lexer
>> requires duplicating six states.  A simpler alternative is to have the
>> lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
>> Your suggestion for hexadecimal numbers is digit salad with different
>> digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
>> Even more radical would be eating anything but whitespace and structural
>> characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
>> a two-stage lexer: first stage finds token strings, where a token string
>> is a structural character or a sequence of non-structural,
>> non-whitespace characters, second stage rejects invalid token strings.
>>
>> Hmm, we could try to recover from lexical errors more smartly in
>> general: instead of ending the JSON error token after the first
>> offending character, end it before the first whitespace or structural
>> character following the offending character.
>>
>> I can try that, but I'd prefer to try it in a follow-up patch.
>
> Indeed, that sounds like a valid approach. So, for this patch, I'm
> fine with just accepting ['0' ... '9'], then seeing if the later
> smarter-lexing change makes back-to-back non-structural tokens give
> saner error messages in general.

I think I'll drop this patch for now.  It's not useful enough to apply
it now, then revert it when we have the more general error recovery
improvement.
diff mbox series

Patch

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index ab2453a1e1..4028f39f28 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -108,6 +108,7 @@  enum json_lexer_state {
     IN_SQ_STRING_ESCAPE,
     IN_SQ_STRING,
     IN_ZERO,
+    IN_BAD_ZERO,
     IN_DIGITS,
     IN_DIGIT,
     IN_EXP_E,
@@ -159,10 +160,14 @@  static const uint8_t json_lexer[][256] =  {
     /* Zero */
     [IN_ZERO] = {
         TERMINAL(JSON_INTEGER),
-        ['0' ... '9'] = IN_ERROR,
+        ['0' ... '9'] = IN_BAD_ZERO,
         ['.'] = IN_MANTISSA,
     },
 
+    [IN_BAD_ZERO] = {
+        ['0' ... '9'] = IN_BAD_ZERO,
+    },
+
     /* Float */
     [IN_DIGITS] = {
         TERMINAL(JSON_FLOAT),