diff mbox series

[18/56] json: Revamp lexer documentation

Message ID 20180808120334.10970-19-armbru@redhat.com
State New
Headers show
Series json: Fixes, error reporting improvements, cleanups | expand

Commit Message

Markus Armbruster Aug. 8, 2018, 12:02 p.m. UTC
Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/json-lexer.c | 80 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 9 deletions(-)

Comments

Eric Blake Aug. 9, 2018, 6:49 p.m. UTC | #1
On 08/08/2018 07:02 AM, Markus Armbruster wrote:
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qobject/json-lexer.c | 80 +++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 71 insertions(+), 9 deletions(-)
> 

> + *
> + * [Numbers:]

Worth also calling out:

[Objects:]
       object = begin-object [ member *( value-separator member ) ]
                end-object

       member = string name-separator value
[Arrays:]
    array = begin-array [ value *( value-separator value ) ] end-array

so as to completely cover the RFC grammar?

> + *
> + * Extensions over RFC 7159:
> + * - Extra escape sequence in strings:
> + *   0x27 (apostrophe) is recognized after escape, too
> + * - Single-quoted strings:
> + *   Like double-quoted strings, except they're delimited by %x27
> + *   (apostrophe) instead of %x22 (quotation mark), and can't contain
> + *   unescaped apostrophe, but can contain unescaped quotation mark.
> + * - Interpolation:
> + *   interpolation = %((l|ll|I64)[du]|[ipsf])

Not in your series, but we recently discussed adding %% (only inside 
strings); coupled with enforcing that all other interpolation occurs 
outside of strings.  I guess we can update this comment at that time.

> + *
> + * Note:
> + * - Input must be encoded in UTF-8.
> + * - Decoding and validating is left to the parser.
>    */
>   
>   enum json_lexer_state {
> 

Reviewed-by: Eric Blake <eblake@redhat.com>
Markus Armbruster Aug. 10, 2018, 2:31 p.m. UTC | #2
Eric Blake <eblake@redhat.com> writes:

> On 08/08/2018 07:02 AM, Markus Armbruster wrote:
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qobject/json-lexer.c | 80 +++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 71 insertions(+), 9 deletions(-)
>>
>
>> + *
>> + * [Numbers:]
>
> Worth also calling out:
>
> [Objects:]
>       object = begin-object [ member *( value-separator member ) ]
>                end-object
>
>       member = string name-separator value
> [Arrays:]
>    array = begin-array [ value *( value-separator value ) ] end-array
>
> so as to completely cover the RFC grammar?

Should this go into json-parser.c?

>> + *
>> + * Extensions over RFC 7159:
>> + * - Extra escape sequence in strings:
>> + *   0x27 (apostrophe) is recognized after escape, too
>> + * - Single-quoted strings:
>> + *   Like double-quoted strings, except they're delimited by %x27
>> + *   (apostrophe) instead of %x22 (quotation mark), and can't contain
>> + *   unescaped apostrophe, but can contain unescaped quotation mark.
>> + * - Interpolation:
>> + *   interpolation = %((l|ll|I64)[du]|[ipsf])
>
> Not in your series, but we recently discussed adding %% (only inside
> strings); coupled with enforcing that all other interpolation occurs
> outside of strings.  I guess we can update this comment at that time.

Message-ID: <87bmaoszf0.fsf@dusky.pond.sub.org>
https://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg05844.html

I meant to do that in this series, but got overwhelmed by all the other
stuff, and forgot.  Thanks for the reminder.  I may still do it in v2.
If not, we can do it on top.

>> + *
>> + * Note:
>> + * - Input must be encoded in UTF-8.
>> + * - Decoding and validating is left to the parser.
>>    */
>>     enum json_lexer_state {
>>
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!
Eric Blake Aug. 10, 2018, 3:02 p.m. UTC | #3
On 08/10/2018 09:31 AM, Markus Armbruster wrote:

>>> + *
>>> + * [Numbers:]
>>
>> Worth also calling out:
>>
>> [Objects:]
>>        object = begin-object [ member *( value-separator member ) ]
>>                 end-object
>>
>>        member = string name-separator value
>> [Arrays:]
>>     array = begin-array [ value *( value-separator value ) ] end-array
>>
>> so as to completely cover the RFC grammar?
> 
> Should this go into json-parser.c?

Perhaps. After all, the lexer does nothing special for any of those 
constructs; they are where we really have moved into the parser phase.


>>> + * - Interpolation:
>>> + *   interpolation = %((l|ll|I64)[du]|[ipsf])
>>
>> Not in your series, but we recently discussed adding %% (only inside
>> strings); coupled with enforcing that all other interpolation occurs
>> outside of strings.  I guess we can update this comment at that time.
> 
> Message-ID: <87bmaoszf0.fsf@dusky.pond.sub.org>
> https://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg05844.html
> 
> I meant to do that in this series, but got overwhelmed by all the other
> stuff, and forgot.  Thanks for the reminder.  I may still do it in v2.
> If not, we can do it on top.

Here's where I first attempted it, if it helps.

https://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg00603.html
Markus Armbruster Aug. 13, 2018, 6:12 a.m. UTC | #4
Eric Blake <eblake@redhat.com> writes:

> On 08/10/2018 09:31 AM, Markus Armbruster wrote:
>
>>>> + *
>>>> + * [Numbers:]
>>>
>>> Worth also calling out:
>>>
>>> [Objects:]
>>>        object = begin-object [ member *( value-separator member ) ]
>>>                 end-object
>>>
>>>        member = string name-separator value
>>> [Arrays:]
>>>     array = begin-array [ value *( value-separator value ) ] end-array
>>>
>>> so as to completely cover the RFC grammar?
>>
>> Should this go into json-parser.c?
>
> Perhaps. After all, the lexer does nothing special for any of those
> constructs; they are where we really have moved into the parser phase.
>
>
>>>> + * - Interpolation:
>>>> + *   interpolation = %((l|ll|I64)[du]|[ipsf])
>>>
>>> Not in your series, but we recently discussed adding %% (only inside
>>> strings); coupled with enforcing that all other interpolation occurs
>>> outside of strings.  I guess we can update this comment at that time.
>>
>> Message-ID: <87bmaoszf0.fsf@dusky.pond.sub.org>
>> https://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg05844.html
>>
>> I meant to do that in this series, but got overwhelmed by all the other
>> stuff, and forgot.  Thanks for the reminder.  I may still do it in v2.
>> If not, we can do it on top.
>
> Here's where I first attempted it, if it helps.
>
> https://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg00603.html

Thanks.  I'll see what I can steal from it.
diff mbox series

Patch

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index e85e9a78ff..109a7d8bb8 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -18,21 +18,83 @@ 
 #define MAX_TOKEN_SIZE (64ULL << 20)
 
 /*
- * Required by JSON (RFC 7159):
+ * From RFC 7159 "The JavaScript Object Notation (JSON) Data
+ * Interchange Format", with [comments in brackets]:
  *
- * \"([^\\\"]|\\[\"'\\/bfnrt]|\\u[0-9a-fA-F]{4})*\"
- * -?(0|[1-9][0-9]*)(.[0-9]+)?([eE][-+]?[0-9]+)?
- * [{}\[\],:]
- * [a-z]+   # covers null, true, false
+ * The set of tokens includes six structural characters, strings,
+ * numbers, and three literal names.
  *
- * Extension of '' strings:
+ * These are the six structural characters:
  *
- * '([^\\']|\\[\"'\\/bfnrt]|\\u[0-9a-fA-F]{4})*'
+ *    begin-array     = ws %x5B ws  ; [ left square bracket
+ *    begin-object    = ws %x7B ws  ; { left curly bracket
+ *    end-array       = ws %x5D ws  ; ] right square bracket
+ *    end-object      = ws %x7D ws  ; } right curly bracket
+ *    name-separator  = ws %x3A ws  ; : colon
+ *    value-separator = ws %x2C ws  ; , comma
  *
- * Extension for vararg handling in JSON construction:
+ * Insignificant whitespace is allowed before or after any of the six
+ * structural characters.
+ * [This lexer accepts it before or after any token, which is actually
+ * the same, as the grammar always has structural characters between
+ * other tokens.]
  *
- * %((l|ll|I64)?d|[ipsf])
+ *    ws = *(
+ *           %x20 /              ; Space
+ *           %x09 /              ; Horizontal tab
+ *           %x0A /              ; Line feed or New line
+ *           %x0D )              ; Carriage return
  *
+ * [...] three literal names:
+ *    false null true
+ *  [This lexer accepts [a-z]+, and leaves rejecting unknown literal
+ *  names to the parser.]
+ *
+ * [Numbers:]
+ *
+ *    number = [ minus ] int [ frac ] [ exp ]
+ *    decimal-point = %x2E       ; .
+ *    digit1-9 = %x31-39         ; 1-9
+ *    e = %x65 / %x45            ; e E
+ *    exp = e [ minus / plus ] 1*DIGIT
+ *    frac = decimal-point 1*DIGIT
+ *    int = zero / ( digit1-9 *DIGIT )
+ *    minus = %x2D               ; -
+ *    plus = %x2B                ; +
+ *    zero = %x30                ; 0
+ *
+ * [Strings:]
+ *    string = quotation-mark *char quotation-mark
+ *
+ *    char = unescaped /
+ *        escape (
+ *            %x22 /          ; "    quotation mark  U+0022
+ *            %x5C /          ; \    reverse solidus U+005C
+ *            %x2F /          ; /    solidus         U+002F
+ *            %x62 /          ; b    backspace       U+0008
+ *            %x66 /          ; f    form feed       U+000C
+ *            %x6E /          ; n    line feed       U+000A
+ *            %x72 /          ; r    carriage return U+000D
+ *            %x74 /          ; t    tab             U+0009
+ *            %x75 4HEXDIG )  ; uXXXX                U+XXXX
+ *    escape = %x5C              ; \
+ *    quotation-mark = %x22      ; "
+ *    unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ *
+ *
+ * Extensions over RFC 7159:
+ * - Extra escape sequence in strings:
+ *   0x27 (apostrophe) is recognized after escape, too
+ * - Single-quoted strings:
+ *   Like double-quoted strings, except they're delimited by %x27
+ *   (apostrophe) instead of %x22 (quotation mark), and can't contain
+ *   unescaped apostrophe, but can contain unescaped quotation mark.
+ * - Interpolation:
+ *   interpolation = %((l|ll|I64)[du]|[ipsf])
+ *
+ * Note:
+ * - Input must be encoded in UTF-8.
+ * - Decoding and validating is left to the parser.
  */
 
 enum json_lexer_state {