[v2,41/60] json: Nicer recovery from invalid leading zero

Message ID	20180817150559.16243-42-armbru@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Markus Armbruster <armbru@redhat.com> To: qemu-devel@nongnu.org Date: Fri, 17 Aug 2018 17:05:40 +0200 Message-Id: <20180817150559.16243-42-armbru@redhat.com> In-Reply-To: <20180817150559.16243-1-armbru@redhat.com> References: <20180817150559.16243-1-armbru@redhat.com> Subject: [Qemu-devel] [PATCH v2 41/60] json: Nicer recovery from invalid leading zero Precedence: list Cc: marcandre.lureau@redhat.com, mdroth@linux.vnet.ibm.com Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	json: Fixes, error reporting improvements, cleanups \| expand [v2,00/60] json: Fixes, error reporting improvements, cleanups [v2,01/60] check-qjson: Cover multiple JSON objects in same string [v2,02/60] check-qjson: Cover blank and lexically erroneous input [v2,03/60] check-qjson: Cover whitespace more thoroughly [v2,04/60] qmp-cmd-test: Split off qmp-test [v2,05/60] qmp-test: Cover syntax and lexical errors [v2,06/60] test-qga: Clean up how we test QGA synchronization [v2,07/60] check-qjson: Cover escaped characters more thoroughly, part 1 [v2,08/60] check-qjson: Streamline escaped_string()'s test strings [v2,09/60] check-qjson: Cover escaped characters more thoroughly, part 2 [v2,10/60] check-qjson: Consolidate partly redundant string tests [v2,11/60] check-qjson: Cover UTF-8 in single quoted strings [v2,12/60] check-qjson: Simplify utf8_string() [v2,13/60] check-qjson: Fix utf8_string() to test all invalid sequences [v2,14/60] check-qjson qmp-test: Cover control characters more thoroughly [v2,15/60] check-qjson: Cover interpolation more thoroughly [v2,16/60] json: Fix lexer to include the bad character in JSON_ERROR token [v2,17/60] json: Reject unescaped control characters [v2,18/60] json: Revamp lexer documentation [v2,19/60] json: Tighten and simplify qstring_from_escaped_str()'s loop [v2,20/60] check-qjson: Document we expect invalid UTF-8 to be rejected [v2,21/60] json: Reject invalid UTF-8 sequences [v2,22/60] json: Report first rather than last parse error [v2,23/60] json: Leave rejecting invalid UTF-8 to parser [v2,24/60] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8") [v2,25/60] json: Leave rejecting invalid escape sequences to parser [v2,26/60] json: Simplify parse_string() [v2,27/60] json: Reject invalid \uXXXX, fix \u0000 [v2,28/60] json: Fix \uXXXX for surrogate pairs [v2,29/60] check-qjson: Fix and enable utf8_string()'s disabled part [v2,30/60] json: remove useless return value from lexer/parser [v2,31/60] json-parser: simplify and avoid JSONParserContext allocation [v2,32/60] json: Have lexer call streamer directly [v2,33/60] json: Redesign the callback to consume JSON values [v2,34/60] json: Don't pass null @tokens to json_parser_parse() [v2,35/60] json: Don't create JSON_ERROR tokens that won't be used [v2,36/60] json: Rename token JSON_ESCAPE & friends to JSON_INTERPOL [v2,37/60] json: Treat unwanted interpolation as lexical error [v2,38/60] json: Pass lexical errors and limit violations to callback [v2,39/60] json: Leave rejecting invalid interpolation to parser [v2,40/60] json: Replace %I64d, %I64u by %PRId64, %PRIu64 [v2,41/60] json: Nicer recovery from invalid leading zero [v2,42/60] json: Improve names of lexer states related to numbers [v2,43/60] qjson: Fix qobject_from_json() & friends for multiple values [v2,44/60] json: Fix latent parser aborts at end of input [v2,45/60] json: Fix streamer not to ignore trailing unterminated structures [v2,46/60] json: Assert json_parser_parse() consumes all tokens on success [v2,47/60] qjson: Have qobject_from_json() & friends reject empty and blank [v2,48/60] json: Enforce token count and size limits more tightly [v2,49/60] json: Streamline json_message_process_token() [v2,50/60] json: Unbox tokens queue in JSONMessageParser [v2,51/60] json: Eliminate lexer state IN_ERROR and pseudo-token JSON_MIN [v2,52/60] json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP [v2,53/60] json: Make JSONToken opaque outside json-parser.c [v2,54/60] qobject: Drop superfluous includes of qemu-common.h [v2,55/60] json: Clean up headers [v2,56/60] docs/interop/qmp-spec: How to force known good parser state [v2,57/60] tests/drive_del-test: Fix harmless JSON interpolation bug [v2,58/60] json: Keep interpolation state in JSONParserContext [v2,59/60] json: Improve safety of qobject_from_jsonf_nofail() & friends [v2,60/60] json: Support %% in JSON strings when interpolating

Message ID

20180817150559.16243-42-armbru@redhat.com

State

New

Headers

From: Markus Armbruster <armbru@redhat.com>
To: qemu-devel@nongnu.org
Date: Fri, 17 Aug 2018 17:05:40 +0200
Message-Id: <20180817150559.16243-42-armbru@redhat.com>
In-Reply-To: <20180817150559.16243-1-armbru@redhat.com>
References: <20180817150559.16243-1-armbru@redhat.com>
Subject: [Qemu-devel] [PATCH v2 41/60] json: Nicer recovery from invalid
	leading zero
Precedence: list
Cc: marcandre.lureau@redhat.com, mdroth@linux.vnet.ibm.com
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Series

json: Fixes, error reporting improvements, cleanups | expand

Commit Message

Markus Armbruster Aug. 17, 2018, 3:05 p.m. UTC

For input 0123, the lexer produces the tokens

    JSON_ERROR    01
    JSON_INTEGER  23

Reporting an error is correct; 0123 is invalid according to RFC 7159.
But the error recovery isn't nice.

Make the finite state machine eat digits before going into the error
state.  The lexer now produces

    JSON_ERROR    0123

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qobject/json-lexer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Eric Blake Aug. 17, 2018, 4:03 p.m. UTC | #1

On 08/17/2018 10:05 AM, Markus Armbruster wrote:
> For input 0123, the lexer produces the tokens
> 
>      JSON_ERROR    01
>      JSON_INTEGER  23
> 
> Reporting an error is correct; 0123 is invalid according to RFC 7159.
> But the error recovery isn't nice.
> 
> Make the finite state machine eat digits before going into the error
> state.  The lexer now produces
> 
>      JSON_ERROR    0123
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Did you also want to reject invalid attempts at hex numbers, by adding 
[xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?

>   
> +    [IN_BAD_ZERO] = {
> +        ['0' ... '9'] = IN_BAD_ZERO,
> +    },
> +

Markus Armbruster Aug. 20, 2018, 11:39 a.m. UTC | #2

Eric Blake <eblake@redhat.com> writes:

> On 08/17/2018 10:05 AM, Markus Armbruster wrote:
>> For input 0123, the lexer produces the tokens
>>
>>      JSON_ERROR    01
>>      JSON_INTEGER  23
>>
>> Reporting an error is correct; 0123 is invalid according to RFC 7159.
>> But the error recovery isn't nice.
>>
>> Make the finite state machine eat digits before going into the error
>> state.  The lexer now produces
>>
>>      JSON_ERROR    0123
>>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>
> Did you also want to reject invalid attempts at hex numbers, by adding
> [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?

I put one foot on a slippery slope with this patch...

In review of v1, we discussed whether to try matching non-integer
numbers with redundant leading zero.  Doing that tightly in the lexer
requires duplicating six states.  A simpler alternative is to have the
lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
Your suggestion for hexadecimal numbers is digit salad with different
digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
Even more radical would be eating anything but whitespace and structural
characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
a two-stage lexer: first stage finds token strings, where a token string
is a structural character or a sequence of non-structural,
non-whitespace characters, second stage rejects invalid token strings.

Hmm, we could try to recover from lexical errors more smartly in
general: instead of ending the JSON error token after the first
offending character, end it before the first whitespace or structural
character following the offending character.

I can try that, but I'd prefer to try it in a follow-up patch.

>>   +    [IN_BAD_ZERO] = {
>> +        ['0' ... '9'] = IN_BAD_ZERO,
>> +    },
>> +

Eric Blake Aug. 20, 2018, 6:36 p.m. UTC | #3

On 08/20/2018 06:39 AM, Markus Armbruster wrote:

> In review of v1, we discussed whether to try matching non-integer
> numbers with redundant leading zero.  Doing that tightly in the lexer
> requires duplicating six states.  A simpler alternative is to have the
> lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
> Your suggestion for hexadecimal numbers is digit salad with different
> digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
> Even more radical would be eating anything but whitespace and structural
> characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
> a two-stage lexer: first stage finds token strings, where a token string
> is a structural character or a sequence of non-structural,
> non-whitespace characters, second stage rejects invalid token strings.
> 
> Hmm, we could try to recover from lexical errors more smartly in
> general: instead of ending the JSON error token after the first
> offending character, end it before the first whitespace or structural
> character following the offending character.
> 
> I can try that, but I'd prefer to try it in a follow-up patch.

Indeed, that sounds like a valid approach. So, for this patch, I'm fine 
with just accepting ['0' ... '9'], then seeing if the later 
smarter-lexing change makes back-to-back non-structural tokens give 
saner error messages in general.

Markus Armbruster Aug. 21, 2018, 5:10 a.m. UTC | #4

Eric Blake <eblake@redhat.com> writes:

> On 08/20/2018 06:39 AM, Markus Armbruster wrote:
>
>> In review of v1, we discussed whether to try matching non-integer
>> numbers with redundant leading zero.  Doing that tightly in the lexer
>> requires duplicating six states.  A simpler alternative is to have the
>> lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
>> Your suggestion for hexadecimal numbers is digit salad with different
>> digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
>> Even more radical would be eating anything but whitespace and structural
>> characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
>> a two-stage lexer: first stage finds token strings, where a token string
>> is a structural character or a sequence of non-structural,
>> non-whitespace characters, second stage rejects invalid token strings.
>>
>> Hmm, we could try to recover from lexical errors more smartly in
>> general: instead of ending the JSON error token after the first
>> offending character, end it before the first whitespace or structural
>> character following the offending character.
>>
>> I can try that, but I'd prefer to try it in a follow-up patch.
>
> Indeed, that sounds like a valid approach. So, for this patch, I'm
> fine with just accepting ['0' ... '9'], then seeing if the later
> smarter-lexing change makes back-to-back non-structural tokens give
> saner error messages in general.

I think I'll drop this patch for now.  It's not useful enough to apply
it now, then revert it when we have the more general error recovery
improvement.

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index ab2453a1e1..4028f39f28 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -108,6 +108,7 @@  enum json_lexer_state {
     IN_SQ_STRING_ESCAPE,
     IN_SQ_STRING,
     IN_ZERO,
+    IN_BAD_ZERO,
     IN_DIGITS,
     IN_DIGIT,
     IN_EXP_E,
@@ -159,10 +160,14 @@  static const uint8_t json_lexer[][256] =  {
     /* Zero */
     [IN_ZERO] = {
         TERMINAL(JSON_INTEGER),
-        ['0' ... '9'] = IN_ERROR,
+        ['0' ... '9'] = IN_BAD_ZERO,
         ['.'] = IN_MANTISSA,
     },
 
+    [IN_BAD_ZERO] = {
+        ['0' ... '9'] = IN_BAD_ZERO,
+    },
+
     /* Float */
     [IN_DIGITS] = {
         TERMINAL(JSON_FLOAT),

[v2,41/60] json: Nicer recovery from invalid leading zero

Commit Message

Comments

Patch