diff mbox

[v2,4/4] json-streamer: Limit number of tokens in addition to total size

Message ID 1447946948-12489-5-git-send-email-armbru@redhat.com
State New
Headers show

Commit Message

Markus Armbruster Nov. 19, 2015, 3:29 p.m. UTC
Commit 29c75dd "json-streamer: limit the maximum recursion depth and
maximum token count" attempts to guard against excessive heap usage by
limiting total token size (it says "token count", but that's a lie).

Total token size is a rather imprecise predictor of heap usage: many
small tokens use more space than few large tokens with the same input
size, because there's a constant per-token overhead.

Tighten this up: limit the token count to 128Ki.

If you think 128Ki is too stingy: check-qjson's large_dict test eats a
sweet 500MiB on my machine to parse ~100K tokens.  Absurdly wasteful.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qobject/json-streamer.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Paolo Bonzini Nov. 19, 2015, 10:01 p.m. UTC | #1
On 19/11/2015 16:29, Markus Armbruster wrote:
> Commit 29c75dd "json-streamer: limit the maximum recursion depth and
> maximum token count" attempts to guard against excessive heap usage by
> limiting total token size (it says "token count", but that's a lie).
> 
> Total token size is a rather imprecise predictor of heap usage: many
> small tokens use more space than few large tokens with the same input
> size, because there's a constant per-token overhead.
> 
> Tighten this up: limit the token count to 128Ki.
> 
> If you think 128Ki is too stingy: check-qjson's large_dict test eats a
> sweet 500MiB on my machine to parse ~100K tokens.

How much of this is freed before the start of the parse?

> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  qobject/json-streamer.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/qobject/json-streamer.c b/qobject/json-streamer.c
> index 2bd22a7..8752834 100644
> --- a/qobject/json-streamer.c
> +++ b/qobject/json-streamer.c
> @@ -19,6 +19,7 @@
>  #include "qapi/qmp/json-streamer.h"
>  
>  #define MAX_TOKEN_SIZE (64ULL << 20)
> +#define MAX_TOKEN_COUNT (128ULL << 10)
>  #define MAX_NESTING (1ULL << 10)
>  
>  static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTokenType type, int x, int y)
> @@ -64,6 +65,7 @@ static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTok
>           parser->bracket_count == 0)) {
>          goto out_emit;
>      } else if (parser->token_size > MAX_TOKEN_SIZE ||
> +               qlist_size(parser->tokens) > MAX_TOKEN_COUNT ||

This is O(n^2).  I'd rather skip this patch, fix the memory hog and
possibly decrease MAX_TOKEN_SIZE a bit.

Paolo

>                 parser->bracket_count + parser->brace_count > MAX_NESTING) {
>          /* Security consideration, we limit total memory allocated per object
>           * and the maximum recursion depth that a message can force.
>
Markus Armbruster Nov. 20, 2015, 6:13 a.m. UTC | #2
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 19/11/2015 16:29, Markus Armbruster wrote:
>> Commit 29c75dd "json-streamer: limit the maximum recursion depth and
>> maximum token count" attempts to guard against excessive heap usage by
>> limiting total token size (it says "token count", but that's a lie).
>> 
>> Total token size is a rather imprecise predictor of heap usage: many
>> small tokens use more space than few large tokens with the same input
>> size, because there's a constant per-token overhead.
>> 
>> Tighten this up: limit the token count to 128Ki.
>> 
>> If you think 128Ki is too stingy: check-qjson's large_dict test eats a
>> sweet 500MiB on my machine to parse ~100K tokens.
>
> How much of this is freed before the start of the parse?
>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> ---
>>  qobject/json-streamer.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/qobject/json-streamer.c b/qobject/json-streamer.c
>> index 2bd22a7..8752834 100644
>> --- a/qobject/json-streamer.c
>> +++ b/qobject/json-streamer.c
>> @@ -19,6 +19,7 @@
>>  #include "qapi/qmp/json-streamer.h"
>>  
>>  #define MAX_TOKEN_SIZE (64ULL << 20)
>> +#define MAX_TOKEN_COUNT (128ULL << 10)
>>  #define MAX_NESTING (1ULL << 10)
>>  
>>  static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTokenType type, int x, int y)
>> @@ -64,6 +65,7 @@ static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTok
>>           parser->bracket_count == 0)) {
>>          goto out_emit;
>>      } else if (parser->token_size > MAX_TOKEN_SIZE ||
>> +               qlist_size(parser->tokens) > MAX_TOKEN_COUNT ||
>
> This is O(n^2).  I'd rather skip this patch, fix the memory hog and
> possibly decrease MAX_TOKEN_SIZE a bit.

I can't see the square right now.  Anyway, O() is unchanged by my patch,
only n is more limited.  See also commit 65c0f1e, which claims to have
fixed the quadradic behavior.

Regardless, the memory consumption is still ridiculous, and we certainly
need to fix that.  Whether we can have one for 2.5 I don't know.

Even with a fix, this patch has some value.  As explained in the commit
message, "total token size is a rather imprecise predictor of heap
usage: many small tokens use more space than few large tokens with the
same input size, because there's a constant per-token overhead."  As
long as we limit input to limit memory use, we better use a limit that's
connected to actual memory use with reasonable accuracy  This patch
improves accuracy.

>>                 parser->bracket_count + parser->brace_count > MAX_NESTING) {
>>          /* Security consideration, we limit total memory allocated per object
>>           * and the maximum recursion depth that a message can force.
Paolo Bonzini Nov. 20, 2015, 8:50 a.m. UTC | #3
On 20/11/2015 07:13, Markus Armbruster wrote:
>>> @@ -64,6 +65,7 @@ static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTok
>>>           parser->bracket_count == 0)) {
>>>          goto out_emit;
>>>      } else if (parser->token_size > MAX_TOKEN_SIZE ||
>>> +               qlist_size(parser->tokens) > MAX_TOKEN_COUNT ||
>>
>> This is O(n^2).  I'd rather skip this patch, fix the memory hog and
>> possibly decrease MAX_TOKEN_SIZE a bit.
> 
> I can't see the square right now.

It's hidden in qlist_size:

static void qlist_size_iter(QObject *obj, void *opaque)
{
    size_t *count = opaque;
    (*count)++;
}

size_t qlist_size(const QList *qlist)
{
    size_t count = 0;
    qlist_iter(qlist, qlist_size_iter, &count);
    return count;
}

You process the whole list on every token, which makes it O(n^2) where n
is the token count.

Paolo

  Anyway, O() is unchanged by my patch,
> only n is more limited.  See also commit 65c0f1e, which claims to have
> fixed the quadradic behavior.
> 
> Regardless, the memory consumption is still ridiculous, and we certainly
> need to fix that.  Whether we can have one for 2.5 I don't know.
> 
> Even with a fix, this patch has some value.  As explained in the commit
> message, "total token size is a rather imprecise predictor of heap
> usage: many small tokens use more space than few large tokens with the
> same input size, because there's a constant per-token overhead."  As
> long as we limit input to limit memory use, we better use a limit that's
> connected to actual memory use with reasonable accuracy  This patch
> improves accuracy.
> 
>>>                 parser->bracket_count + parser->brace_count > MAX_NESTING) {
>>>          /* Security consideration, we limit total memory allocated per object
>>>           * and the maximum recursion depth that a message can force.
> 
>
Eric Blake Nov. 20, 2015, 5:32 p.m. UTC | #4
On 11/20/2015 01:50 AM, Paolo Bonzini wrote:
> 
> 
> On 20/11/2015 07:13, Markus Armbruster wrote:
>>>> @@ -64,6 +65,7 @@ static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTok
>>>>           parser->bracket_count == 0)) {
>>>>          goto out_emit;
>>>>      } else if (parser->token_size > MAX_TOKEN_SIZE ||
>>>> +               qlist_size(parser->tokens) > MAX_TOKEN_COUNT ||
>>>
>>> This is O(n^2).  I'd rather skip this patch, fix the memory hog and
>>> possibly decrease MAX_TOKEN_SIZE a bit.
>>
>> I can't see the square right now.
> 
> It's hidden in qlist_size:
> 
> static void qlist_size_iter(QObject *obj, void *opaque)
> {
>     size_t *count = opaque;
>     (*count)++;
> }

Yuck - we don't track size independently?  Seems like it might make a
worthwhile addition, as well as convenience functions for efficiently
adding/removing a member at either end of the list.  (But we already
know that the QObject hierarchy is silly with some of the things it does).
Paolo Bonzini Nov. 23, 2015, 2:27 p.m. UTC | #5
On 20/11/2015 18:32, Eric Blake wrote:
> > static void qlist_size_iter(QObject *obj, void *opaque)
> > {
> >     size_t *count = opaque;
> >     (*count)++;
> > }
> 
> Yuck - we don't track size independently?  Seems like it might make a
> worthwhile addition,

Would you change your mind, if I told you that qlist_size is unused? :)

Paolo
Eric Blake Nov. 23, 2015, 4:03 p.m. UTC | #6
On 11/23/2015 07:27 AM, Paolo Bonzini wrote:
> 
> 
> On 20/11/2015 18:32, Eric Blake wrote:
>>> static void qlist_size_iter(QObject *obj, void *opaque)
>>> {
>>>     size_t *count = opaque;
>>>     (*count)++;
>>> }
>>
>> Yuck - we don't track size independently?  Seems like it might make a
>> worthwhile addition,
> 
> Would you change your mind, if I told you that qlist_size is unused? :)

Deleting dead code is a perfectly acceptable alternative to advertising
what should normally be an O(1) operation with an O(n) implementation. :)
Markus Armbruster Nov. 23, 2015, 5:09 p.m. UTC | #7
Eric Blake <eblake@redhat.com> writes:

> On 11/23/2015 07:27 AM, Paolo Bonzini wrote:
>> 
>> 
>> On 20/11/2015 18:32, Eric Blake wrote:
>>>> static void qlist_size_iter(QObject *obj, void *opaque)
>>>> {
>>>>     size_t *count = opaque;
>>>>     (*count)++;
>>>> }
>>>
>>> Yuck - we don't track size independently?  Seems like it might make a
>>> worthwhile addition,
>> 
>> Would you change your mind, if I told you that qlist_size is unused? :)
>
> Deleting dead code is a perfectly acceptable alternative to advertising
> what should normally be an O(1) operation with an O(n) implementation. :)

Well, "length of list" certainly isn't O(1) everywhere.  The Common Lisp
Hyperspec, for instance, gives an O(n) example implementation[*].
Generally just fine as long as callers are aware.


[*] http://www.lispworks.com/documentation/HyperSpec/Body/f_list_l.htm#list-length
diff mbox

Patch

diff --git a/qobject/json-streamer.c b/qobject/json-streamer.c
index 2bd22a7..8752834 100644
--- a/qobject/json-streamer.c
+++ b/qobject/json-streamer.c
@@ -19,6 +19,7 @@ 
 #include "qapi/qmp/json-streamer.h"
 
 #define MAX_TOKEN_SIZE (64ULL << 20)
+#define MAX_TOKEN_COUNT (128ULL << 10)
 #define MAX_NESTING (1ULL << 10)
 
 static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTokenType type, int x, int y)
@@ -64,6 +65,7 @@  static void json_message_process_token(JSONLexer *lexer, QString *token, JSONTok
          parser->bracket_count == 0)) {
         goto out_emit;
     } else if (parser->token_size > MAX_TOKEN_SIZE ||
+               qlist_size(parser->tokens) > MAX_TOKEN_COUNT ||
                parser->bracket_count + parser->brace_count > MAX_NESTING) {
         /* Security consideration, we limit total memory allocated per object
          * and the maximum recursion depth that a message can force.