Message ID | 20180817150559.16243-42-armbru@redhat.com |
---|---|
State | New |
Headers | show |
Series | json: Fixes, error reporting improvements, cleanups | expand |
On 08/17/2018 10:05 AM, Markus Armbruster wrote: > For input 0123, the lexer produces the tokens > > JSON_ERROR 01 > JSON_INTEGER 23 > > Reporting an error is correct; 0123 is invalid according to RFC 7159. > But the error recovery isn't nice. > > Make the finite state machine eat digits before going into the error > state. The lexer now produces > > JSON_ERROR 0123 > > Signed-off-by: Markus Armbruster <armbru@redhat.com> > Reviewed-by: Eric Blake <eblake@redhat.com> Did you also want to reject invalid attempts at hex numbers, by adding [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO? > > + [IN_BAD_ZERO] = { > + ['0' ... '9'] = IN_BAD_ZERO, > + }, > +
Eric Blake <eblake@redhat.com> writes: > On 08/17/2018 10:05 AM, Markus Armbruster wrote: >> For input 0123, the lexer produces the tokens >> >> JSON_ERROR 01 >> JSON_INTEGER 23 >> >> Reporting an error is correct; 0123 is invalid according to RFC 7159. >> But the error recovery isn't nice. >> >> Make the finite state machine eat digits before going into the error >> state. The lexer now produces >> >> JSON_ERROR 0123 >> >> Signed-off-by: Markus Armbruster <armbru@redhat.com> >> Reviewed-by: Eric Blake <eblake@redhat.com> > > Did you also want to reject invalid attempts at hex numbers, by adding > [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO? I put one foot on a slippery slope with this patch... In review of v1, we discussed whether to try matching non-integer numbers with redundant leading zero. Doing that tightly in the lexer requires duplicating six states. A simpler alternative is to have the lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+. Your suggestion for hexadecimal numbers is digit salad with different digits: [0-9a-fA-FxX]. Another option is their union: [0-9a-fA-FxX+-]. Even more radical would be eating anything but whitespace and structural characters: [^][}{:, \t\n\r]. That idea pushed to the limit results in a two-stage lexer: first stage finds token strings, where a token string is a structural character or a sequence of non-structural, non-whitespace characters, second stage rejects invalid token strings. Hmm, we could try to recover from lexical errors more smartly in general: instead of ending the JSON error token after the first offending character, end it before the first whitespace or structural character following the offending character. I can try that, but I'd prefer to try it in a follow-up patch. >> + [IN_BAD_ZERO] = { >> + ['0' ... '9'] = IN_BAD_ZERO, >> + }, >> +
On 08/20/2018 06:39 AM, Markus Armbruster wrote: > In review of v1, we discussed whether to try matching non-integer > numbers with redundant leading zero. Doing that tightly in the lexer > requires duplicating six states. A simpler alternative is to have the > lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+. > Your suggestion for hexadecimal numbers is digit salad with different > digits: [0-9a-fA-FxX]. Another option is their union: [0-9a-fA-FxX+-]. > Even more radical would be eating anything but whitespace and structural > characters: [^][}{:, \t\n\r]. That idea pushed to the limit results in > a two-stage lexer: first stage finds token strings, where a token string > is a structural character or a sequence of non-structural, > non-whitespace characters, second stage rejects invalid token strings. > > Hmm, we could try to recover from lexical errors more smartly in > general: instead of ending the JSON error token after the first > offending character, end it before the first whitespace or structural > character following the offending character. > > I can try that, but I'd prefer to try it in a follow-up patch. Indeed, that sounds like a valid approach. So, for this patch, I'm fine with just accepting ['0' ... '9'], then seeing if the later smarter-lexing change makes back-to-back non-structural tokens give saner error messages in general.
Eric Blake <eblake@redhat.com> writes: > On 08/20/2018 06:39 AM, Markus Armbruster wrote: > >> In review of v1, we discussed whether to try matching non-integer >> numbers with redundant leading zero. Doing that tightly in the lexer >> requires duplicating six states. A simpler alternative is to have the >> lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+. >> Your suggestion for hexadecimal numbers is digit salad with different >> digits: [0-9a-fA-FxX]. Another option is their union: [0-9a-fA-FxX+-]. >> Even more radical would be eating anything but whitespace and structural >> characters: [^][}{:, \t\n\r]. That idea pushed to the limit results in >> a two-stage lexer: first stage finds token strings, where a token string >> is a structural character or a sequence of non-structural, >> non-whitespace characters, second stage rejects invalid token strings. >> >> Hmm, we could try to recover from lexical errors more smartly in >> general: instead of ending the JSON error token after the first >> offending character, end it before the first whitespace or structural >> character following the offending character. >> >> I can try that, but I'd prefer to try it in a follow-up patch. > > Indeed, that sounds like a valid approach. So, for this patch, I'm > fine with just accepting ['0' ... '9'], then seeing if the later > smarter-lexing change makes back-to-back non-structural tokens give > saner error messages in general. I think I'll drop this patch for now. It's not useful enough to apply it now, then revert it when we have the more general error recovery improvement.
diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c index ab2453a1e1..4028f39f28 100644 --- a/qobject/json-lexer.c +++ b/qobject/json-lexer.c @@ -108,6 +108,7 @@ enum json_lexer_state { IN_SQ_STRING_ESCAPE, IN_SQ_STRING, IN_ZERO, + IN_BAD_ZERO, IN_DIGITS, IN_DIGIT, IN_EXP_E, @@ -159,10 +160,14 @@ static const uint8_t json_lexer[][256] = { /* Zero */ [IN_ZERO] = { TERMINAL(JSON_INTEGER), - ['0' ... '9'] = IN_ERROR, + ['0' ... '9'] = IN_BAD_ZERO, ['.'] = IN_MANTISSA, }, + [IN_BAD_ZERO] = { + ['0' ... '9'] = IN_BAD_ZERO, + }, + /* Float */ [IN_DIGITS] = { TERMINAL(JSON_FLOAT),