Patchwork [09/11] json-lexer: limit the maximum size of a given token

login
register
mail settings
Submitter Anthony Liguori
Date March 11, 2011, 9 p.m.
Message ID <1299877249-13433-10-git-send-email-aliguori@us.ibm.com>
Download mbox | patch
Permalink /patch/86454/
State New
Headers show

Comments

Anthony Liguori - March 11, 2011, 9 p.m.
This is a security consideration.  We don't want a client to cause an arbitrary
amount of memory to be allocated in QEMU.  For now, we use a limit of 64MB
which should be large enough for any reasonably sized token.

This is important for parsing JSON from untrusted sources.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Luiz Capitulino - March 14, 2011, 7:25 p.m.
On Fri, 11 Mar 2011 15:00:47 -0600
Anthony Liguori <aliguori@us.ibm.com> wrote:

> This is a security consideration.  We don't want a client to cause an arbitrary
> amount of memory to be allocated in QEMU.  For now, we use a limit of 64MB
> which should be large enough for any reasonably sized token.
> 
> This is important for parsing JSON from untrusted sources.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> 
> diff --git a/json-lexer.c b/json-lexer.c
> index 834d7af..3462c89 100644
> --- a/json-lexer.c
> +++ b/json-lexer.c
> @@ -18,6 +18,8 @@
>  #include "qemu-common.h"
>  #include "json-lexer.h"
>  
> +#define MAX_TOKEN_SIZE (64ULL << 20)
> +
>  /*
>   * \"([^\\\"]|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*\"
>   * '([^\\']|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*'
> @@ -312,6 +314,17 @@ static int json_lexer_feed_char(JSONLexer *lexer, char ch)
>          }
>          lexer->state = new_state;
>      } while (!char_consumed);
> +
> +    /* Do not let a single token grow to an arbitrarily large size,
> +     * this is a security consideration.
> +     */
> +    if (lexer->token->length > MAX_TOKEN_SIZE) {
> +        lexer->emit(lexer, lexer->token, lexer->state, lexer->x, lexer->y);
> +        QDECREF(lexer->token);
> +        lexer->token = qstring_new();
> +        lexer->state = IN_START;
> +    }

Entering an invalid token is an error, we should fail here. Which brings
two features:

 1. A test code could trigger this condition and check for the specific
    error code

 2. Developers will know when they hit the limit. Although I don't expect
    expect this to happen, there was talking about adding base64 support
    to transfer something (I can't remember what, but we never know how the
    protocol will evolve).

Also, by testing this I found that the parser seems to get confused when
the limit is reached: it stops responding.

> +
>      return 0;
>  }
>
Anthony Liguori - March 14, 2011, 8:18 p.m.
On 03/14/2011 02:25 PM, Luiz Capitulino wrote:
> On Fri, 11 Mar 2011 15:00:47 -0600
> Anthony Liguori<aliguori@us.ibm.com>  wrote:
>
>> This is a security consideration.  We don't want a client to cause an arbitrary
>> amount of memory to be allocated in QEMU.  For now, we use a limit of 64MB
>> which should be large enough for any reasonably sized token.
>>
>> This is important for parsing JSON from untrusted sources.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>
>> diff --git a/json-lexer.c b/json-lexer.c
>> index 834d7af..3462c89 100644
>> --- a/json-lexer.c
>> +++ b/json-lexer.c
>> @@ -18,6 +18,8 @@
>>   #include "qemu-common.h"
>>   #include "json-lexer.h"
>>
>> +#define MAX_TOKEN_SIZE (64ULL<<  20)
>> +
>>   /*
>>    * \"([^\\\"]|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*\"
>>    * '([^\\']|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*'
>> @@ -312,6 +314,17 @@ static int json_lexer_feed_char(JSONLexer *lexer, char ch)
>>           }
>>           lexer->state = new_state;
>>       } while (!char_consumed);
>> +
>> +    /* Do not let a single token grow to an arbitrarily large size,
>> +     * this is a security consideration.
>> +     */
>> +    if (lexer->token->length>  MAX_TOKEN_SIZE) {
>> +        lexer->emit(lexer, lexer->token, lexer->state, lexer->x, lexer->y);
>> +        QDECREF(lexer->token);
>> +        lexer->token = qstring_new();
>> +        lexer->state = IN_START;
>> +    }
> Entering an invalid token is an error, we should fail here.

It's not so clear to me.

I think of it like GCC.  GCC doesn't bail out on the first invalid 
character the lexer encounters.  Instead, it records that an error has 
occurred and tries its best to recover.

The result is that instead of getting an error message about the first 
error in your code, you'll get a long listing of all the mistakes you 
made (usually).

One thing that makes this more difficult in our case is that when you're 
testing, we don't have a clear EOI to flush things out.  So a bad 
sequence of inputs might make the message parser wait for an a character 
that you're not necessarily inputting which makes the session appear 
hung.  Usually, if you throw a couple extra brackets around, you'll get 
back to valid input.

>   Which brings
> two features:
>
>   1. A test code could trigger this condition and check for the specific
>      error code
>
>   2. Developers will know when they hit the limit. Although I don't expect
>      expect this to happen, there was talking about adding base64 support
>      to transfer something (I can't remember what, but we never know how the
>      protocol will evolve).
>
> Also, by testing this I found that the parser seems to get confused when
> the limit is reached: it stops responding.

Actually, it does respond.  The lexer just takes an incredibly long time 
to process a large token because qstring_append_ch is incredibly slow 
:-)  If you drop the token size down to 64k instead of 64mb, it'll seem 
a lot more reasonable.

Regards,

Anthony Liguori

>> +
>>       return 0;
>>   }
>>
>
Luiz Capitulino - March 14, 2011, 8:33 p.m.
On Mon, 14 Mar 2011 15:18:57 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 03/14/2011 02:25 PM, Luiz Capitulino wrote:
> > On Fri, 11 Mar 2011 15:00:47 -0600
> > Anthony Liguori<aliguori@us.ibm.com>  wrote:
> >
> >> This is a security consideration.  We don't want a client to cause an arbitrary
> >> amount of memory to be allocated in QEMU.  For now, we use a limit of 64MB
> >> which should be large enough for any reasonably sized token.
> >>
> >> This is important for parsing JSON from untrusted sources.
> >>
> >> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
> >>
> >> diff --git a/json-lexer.c b/json-lexer.c
> >> index 834d7af..3462c89 100644
> >> --- a/json-lexer.c
> >> +++ b/json-lexer.c
> >> @@ -18,6 +18,8 @@
> >>   #include "qemu-common.h"
> >>   #include "json-lexer.h"
> >>
> >> +#define MAX_TOKEN_SIZE (64ULL<<  20)
> >> +
> >>   /*
> >>    * \"([^\\\"]|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*\"
> >>    * '([^\\']|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*'
> >> @@ -312,6 +314,17 @@ static int json_lexer_feed_char(JSONLexer *lexer, char ch)
> >>           }
> >>           lexer->state = new_state;
> >>       } while (!char_consumed);
> >> +
> >> +    /* Do not let a single token grow to an arbitrarily large size,
> >> +     * this is a security consideration.
> >> +     */
> >> +    if (lexer->token->length>  MAX_TOKEN_SIZE) {
> >> +        lexer->emit(lexer, lexer->token, lexer->state, lexer->x, lexer->y);
> >> +        QDECREF(lexer->token);
> >> +        lexer->token = qstring_new();
> >> +        lexer->state = IN_START;
> >> +    }
> > Entering an invalid token is an error, we should fail here.
> 
> It's not so clear to me.
> 
> I think of it like GCC.  GCC doesn't bail out on the first invalid 
> character the lexer encounters.  Instead, it records that an error has 
> occurred and tries its best to recover.
> 
> The result is that instead of getting an error message about the first 
> error in your code, you'll get a long listing of all the mistakes you 
> made (usually).

I'm not a compiler expert, but I think compilers do this because it would be
very problematic from a usability perspective to report one error at time:
by reporting most errors per run, the compiler gives the user to fix as much
as possible programming mistakes w/o having to re-run.

Our target are not humans, though.

> One thing that makes this more difficult in our case is that when you're 
> testing, we don't have a clear EOI to flush things out.  So a bad 
> sequence of inputs might make the message parser wait for an a character 
> that you're not necessarily inputting which makes the session appear 
> hung.  Usually, if you throw a couple extra brackets around, you'll get 
> back to valid input.

Not sure if this is easy to do, but shouldn't we fail if we don't get
the expected character?

> >   Which brings
> > two features:
> >
> >   1. A test code could trigger this condition and check for the specific
> >      error code
> >
> >   2. Developers will know when they hit the limit. Although I don't expect
> >      expect this to happen, there was talking about adding base64 support
> >      to transfer something (I can't remember what, but we never know how the
> >      protocol will evolve).
> >
> > Also, by testing this I found that the parser seems to get confused when
> > the limit is reached: it stops responding.
> 
> Actually, it does respond.  The lexer just takes an incredibly long time 
> to process a large token because qstring_append_ch is incredibly slow 
> :-)  If you drop the token size down to 64k instead of 64mb, it'll seem 
> a lot more reasonable.

Actually, I dropped it down to 20 :) This is supposed to work like your 64k
suggestion, right?

> 
> Regards,
> 
> Anthony Liguori
> 
> >> +
> >>       return 0;
> >>   }
> >>
> >
>

Patch

diff --git a/json-lexer.c b/json-lexer.c
index 834d7af..3462c89 100644
--- a/json-lexer.c
+++ b/json-lexer.c
@@ -18,6 +18,8 @@ 
 #include "qemu-common.h"
 #include "json-lexer.h"
 
+#define MAX_TOKEN_SIZE (64ULL << 20)
+
 /*
  * \"([^\\\"]|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*\"
  * '([^\\']|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*'
@@ -312,6 +314,17 @@  static int json_lexer_feed_char(JSONLexer *lexer, char ch)
         }
         lexer->state = new_state;
     } while (!char_consumed);
+
+    /* Do not let a single token grow to an arbitrarily large size,
+     * this is a security consideration.
+     */
+    if (lexer->token->length > MAX_TOKEN_SIZE) {
+        lexer->emit(lexer, lexer->token, lexer->state, lexer->x, lexer->y);
+        QDECREF(lexer->token);
+        lexer->token = qstring_new();
+        lexer->state = IN_START;
+    }
+
     return 0;
 }