diff mbox

[06/17] block/nbd-client: fix nbd_read_reply_entry

Message ID 20170804151440.320927-7-vsementsov@virtuozzo.com
State New
Headers show

Commit Message

Vladimir Sementsov-Ogievskiy Aug. 4, 2017, 3:14 p.m. UTC
Set reply.handle to 0 on error path to prevent normal path of
nbd_co_receive_reply.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/nbd-client.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Eric Blake Aug. 7, 2017, 11:52 a.m. UTC | #1
On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote:
> Set reply.handle to 0 on error path to prevent normal path of
> nbd_co_receive_reply.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/nbd-client.c | 1 +
>  1 file changed, 1 insertion(+)

Can you document a case where not fixing this would be an observable bug
(even if it requires using gdb and single-stepping between client and
server to make what is otherwise a racy situation easy to see)?  I'm
trying to figure out if this is 2.10 material.

> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index dc19894a7c..0c88d84de6 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -107,6 +107,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
>          qemu_coroutine_yield();
>      }
>  
> +    s->reply.handle = 0;
>      nbd_recv_coroutines_enter_all(s);
>      s->read_reply_co = NULL;
>  }
>
Vladimir Sementsov-Ogievskiy Aug. 7, 2017, 12:56 p.m. UTC | #2
07.08.2017 14:52, Eric Blake wrote:
> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Set reply.handle to 0 on error path to prevent normal path of
>> nbd_co_receive_reply.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/nbd-client.c | 1 +
>>   1 file changed, 1 insertion(+)
> Can you document a case where not fixing this would be an observable bug
> (even if it requires using gdb and single-stepping between client and
> server to make what is otherwise a racy situation easy to see)?  I'm
> trying to figure out if this is 2.10 material.

it is simple enough:

run qemu-nbd in gdb, set break on nbd_send_reply, and when it shoot s,
next up to "stl_be_p(buf, NBD_REPLY_MAGIC);" and after it do "call 
stl_be_p(buf, 1000)"

run qemu-io with some read in gdb, set break on
br block/nbd-client.c:83

( it is break; after failed nbd_receive_reply call)

and on
br block/nbd-client.c:170

(it is in nbd_co_receive_reply after yield)

on first break we will be sure that  nbd_receive_reply failed,
on second we will be sure by
(gdb) p s->reply
$1 = {handle = 93825000680144, error = 0}
(gdb) p request->handle
$2 = 93825000680144

that we are on normal receiving path.

>
>> diff --git a/block/nbd-client.c b/block/nbd-client.c
>> index dc19894a7c..0c88d84de6 100644
>> --- a/block/nbd-client.c
>> +++ b/block/nbd-client.c
>> @@ -107,6 +107,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
>>           qemu_coroutine_yield();
>>       }
>>   
>> +    s->reply.handle = 0;
>>       nbd_recv_coroutines_enter_all(s);
>>       s->read_reply_co = NULL;
>>   }
>>
Eric Blake Aug. 7, 2017, 3:13 p.m. UTC | #3
On 08/07/2017 07:56 AM, Vladimir Sementsov-Ogievskiy wrote:
> 07.08.2017 14:52, Eric Blake wrote:
>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Set reply.handle to 0 on error path to prevent normal path of
>>> nbd_co_receive_reply.

Side note: in general, our server must allow a handle of 0 as valid (as
the handle is something chosen by the client); but our particular client
always uses non-zero handles (and therefore using 0 as a sentinel for an
error path is okay).

>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/nbd-client.c | 1 +
>>>   1 file changed, 1 insertion(+)
>> Can you document a case where not fixing this would be an observable bug
>> (even if it requires using gdb and single-stepping between client and
>> server to make what is otherwise a racy situation easy to see)?  I'm
>> trying to figure out if this is 2.10 material.
> 
> it is simple enough:
> 
> run qemu-nbd in gdb, set break on nbd_send_reply, and when it shoot s,
> next up to "stl_be_p(buf, NBD_REPLY_MAGIC);" and after it do "call
> stl_be_p(buf, 1000)"

Okay, so you have set up a malicious server intentionally sending bad
magic...

> 
> run qemu-io with some read in gdb, set break on
> br block/nbd-client.c:83
> 
> ( it is break; after failed nbd_receive_reply call)
> 
> and on
> br block/nbd-client.c:170
> 
> (it is in nbd_co_receive_reply after yield)
> 
> on first break we will be sure that  nbd_receive_reply failed,
> on second we will be sure by
> (gdb) p s->reply
> $1 = {handle = 93825000680144, error = 0}
> (gdb) p request->handle
> $2 = 93825000680144
> 
> that we are on normal receiving path.

...and the client is recognizing that the server sent garbage, but then
proceeds to handle the packet anyway.  The ideal reaction is that the
client should disconnect from the server, rather than handle the packet.
 But because it didn't do that, the client is now unable to receive any
future messages from the server.  Compare the traces of:

$  ./qemu-io -c 'r 0 1' -f raw nbd://localhost:10809/foo --trace nbd_\*

against a working server:

29103@1502118291.281180:nbd_opt_go_success Export is good to go
29103@1502118291.281397:nbd_send_request Sending request to server: {
.from = 0, .len = 1, .handle = 93860726319200, .flags = 0x0, .type = 0
(read) }
29103@1502118291.281705:nbd_receive_reply Got reply: { magic =
0x67446698, .error =  0, handle = 93860726319200 }
read 1/1 bytes at offset 0
1 bytes, 1 ops; 0.0003 sec (2.822 KiB/sec and 2890.1734 ops/sec)
29103@1502118291.281773:nbd_send_request Sending request to server: {
.from = 0, .len = 0, .handle = 93860726319200, .flags = 0x0, .type = 3
(flush) }
29103@1502118291.281902:nbd_receive_reply Got reply: { magic =
0x67446698, .error =  0, handle = 93860726319200 }
29103@1502118291.281939:nbd_send_request Sending request to server: {
.from = 0, .len = 0, .handle = 93860726319200, .flags = 0x0, .type = 3
(flush) }
29103@1502118291.282064:nbd_receive_reply Got reply: { magic =
0x67446698, .error =  0, handle = 93860726319200 }
29103@1502118291.282078:nbd_send_request Sending request to server: {
.from = 0, .len = 0, .handle = 0, .flags = 0x0, .type = 2 (discard) }

followed by a clean disconnect; vs. the buggy server:

29148@1502118384.385133:nbd_opt_go_success Export is good to go
29148@1502118384.385321:nbd_send_request Sending request to server: {
.from = 0, .len = 1, .handle = 94152262559840, .flags = 0x0, .type = 0
(read) }
29148@1502118396.494643:nbd_receive_reply Got reply: { magic =
0x1446698, .error =  0, handle = 94152262559840 }
invalid magic (got 0x1446698)
read 1/1 bytes at offset 0
1 bytes, 1 ops; 0:00:12.10 (0.082581 bytes/sec and 0.0826 ops/sec)
29148@1502118396.494746:nbd_send_request Sending request to server: {
.from = 0, .len = 0, .handle = 94152262559840, .flags = 0x0, .type = 3
(flush) }

where the client is now hung.  Thankfully, the client is blocked in an
idle loop (not eating CPU), so I don't know if this counts as the
ability for a malicious server to cause a denial of service against qemu
as an NBD client (in general, being unable to read further data from the
server because client internal state is now botched is not that much
different from being unable to read further data from the server because
the client hung up on the invalid server), unless there is some way to
cause qemu to die from an assertion failure rather than just get stuck.

It also looks like the client acts on at most one bad packet from the
server before it gets stuck - but the question is whether a malicious
server could, in that one bad packet reply, cause qemu to misbehave in
some other way.

I'm adding Prasad, to analyze whether this needs a CVE.

We don't have a good protocol fuzzer to simulate a bad client and/or bad
server as the partner over the socket - if you can find any more such
interactions where a bad partner can cause a hang or crash, let's get
those fixed in 2.10.

Meanwhile, I'll probably include this patch in 2.10 (after first
figuring out if it works in isolation or needs other patches from your
series).
Eric Blake Aug. 7, 2017, 3:33 p.m. UTC | #4
On 08/07/2017 10:13 AM, Eric Blake wrote:
> On 08/07/2017 07:56 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 07.08.2017 14:52, Eric Blake wrote:
>>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>> Set reply.handle to 0 on error path to prevent normal path of
>>>> nbd_co_receive_reply.
> 

> ...and the client is recognizing that the server sent garbage, but then
> proceeds to handle the packet anyway.  The ideal reaction is that the
> client should disconnect from the server, rather than handle the packet.
>  But because it didn't do that, the client is now unable to receive any
> future messages from the server.  Compare the traces of:
> 

> followed by a clean disconnect; vs. the buggy server:
> 
> 29148@1502118384.385133:nbd_opt_go_success Export is good to go
> 29148@1502118384.385321:nbd_send_request Sending request to server: {
> .from = 0, .len = 1, .handle = 94152262559840, .flags = 0x0, .type = 0
> (read) }
> 29148@1502118396.494643:nbd_receive_reply Got reply: { magic =
> 0x1446698, .error =  0, handle = 94152262559840 }
> invalid magic (got 0x1446698)
> read 1/1 bytes at offset 0
> 1 bytes, 1 ops; 0:00:12.10 (0.082581 bytes/sec and 0.0826 ops/sec)
> 29148@1502118396.494746:nbd_send_request Sending request to server: {
> .from = 0, .len = 0, .handle = 94152262559840, .flags = 0x0, .type = 3
> (flush) }
> 
> where the client is now hung.  Thankfully, the client is blocked in an
> idle loop (not eating CPU), so I don't know if this counts as the
> ability for a malicious server to cause a denial of service against qemu
> as an NBD client (in general, being unable to read further data from the
> server because client internal state is now botched is not that much
> different from being unable to read further data from the server because
> the client hung up on the invalid server), unless there is some way to
> cause qemu to die from an assertion failure rather than just get stuck.

With just patch 6/17 applied, the client still hangs, but this time with
a different trace:

30053@1502119637.604092:nbd_opt_go_success Export is good to go
30053@1502119637.604256:nbd_send_request Sending request to server: {
.from = 0, .len = 1, .handle = 94716063746144, .flags = 0x0, .type = 0
(read) }
30053@1502119649.070156:nbd_receive_reply Got reply: { magic =
0x1446698, .error =  0, handle = 94716063746144 }
invalid magic (got 0x1446698)
read failed: Input/output error
30053@1502119649.070243:nbd_send_request Sending request to server: {
.from = 0, .len = 0, .handle = 94716063746144, .flags = 0x0, .type = 3
(flush) }

The client still tried to send a flush request to the server, when it
should REALLY quit talking to the server at all and just disconnect,
because the moment the client recognizes that the server has sent
garbage is the moment that the client can no longer assume that the
server will react correctly to any further commands.

So I don't think your patch is quite correct, even if you have
identified a shortfall in our client code.
Vladimir Sementsov-Ogievskiy Aug. 7, 2017, 4:09 p.m. UTC | #5
07.08.2017 18:33, Eric Blake wrote:
> On 08/07/2017 10:13 AM, Eric Blake wrote:
>> On 08/07/2017 07:56 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> 07.08.2017 14:52, Eric Blake wrote:
>>>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Set reply.handle to 0 on error path to prevent normal path of
>>>>> nbd_co_receive_reply.
>> ...and the client is recognizing that the server sent garbage, but then
>> proceeds to handle the packet anyway.  The ideal reaction is that the
>> client should disconnect from the server, rather than handle the packet.
>>   But because it didn't do that, the client is now unable to receive any
>> future messages from the server.  Compare the traces of:
>>
>> followed by a clean disconnect; vs. the buggy server:
>>
>> 29148@1502118384.385133:nbd_opt_go_success Export is good to go
>> 29148@1502118384.385321:nbd_send_request Sending request to server: {
>> .from = 0, .len = 1, .handle = 94152262559840, .flags = 0x0, .type = 0
>> (read) }
>> 29148@1502118396.494643:nbd_receive_reply Got reply: { magic =
>> 0x1446698, .error =  0, handle = 94152262559840 }
>> invalid magic (got 0x1446698)
>> read 1/1 bytes at offset 0
>> 1 bytes, 1 ops; 0:00:12.10 (0.082581 bytes/sec and 0.0826 ops/sec)
>> 29148@1502118396.494746:nbd_send_request Sending request to server: {
>> .from = 0, .len = 0, .handle = 94152262559840, .flags = 0x0, .type = 3
>> (flush) }
>>
>> where the client is now hung.  Thankfully, the client is blocked in an
>> idle loop (not eating CPU), so I don't know if this counts as the
>> ability for a malicious server to cause a denial of service against qemu
>> as an NBD client (in general, being unable to read further data from the
>> server because client internal state is now botched is not that much
>> different from being unable to read further data from the server because
>> the client hung up on the invalid server), unless there is some way to
>> cause qemu to die from an assertion failure rather than just get stuck.
> With just patch 6/17 applied, the client still hangs, but this time with
> a different trace:
>
> 30053@1502119637.604092:nbd_opt_go_success Export is good to go
> 30053@1502119637.604256:nbd_send_request Sending request to server: {
> .from = 0, .len = 1, .handle = 94716063746144, .flags = 0x0, .type = 0
> (read) }
> 30053@1502119649.070156:nbd_receive_reply Got reply: { magic =
> 0x1446698, .error =  0, handle = 94716063746144 }
> invalid magic (got 0x1446698)
> read failed: Input/output error
> 30053@1502119649.070243:nbd_send_request Sending request to server: {
> .from = 0, .len = 0, .handle = 94716063746144, .flags = 0x0, .type = 3
> (flush) }
>
> The client still tried to send a flush request to the server, when it
> should REALLY quit talking to the server at all and just disconnect,
> because the moment the client recognizes that the server has sent
> garbage is the moment that the client can no longer assume that the
> server will react correctly to any further commands.
>
> So I don't think your patch is quite correct, even if you have
> identified a shortfall in our client code.
>
Why do you think so? It just does what it states in commit message.

Patch 17 should fix the case (but I doubt that it can be taken separately).
Eric Blake Aug. 7, 2017, 4:18 p.m. UTC | #6
On 08/07/2017 11:09 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Set reply.handle to 0 on error path to prevent normal path of
>>>>>> nbd_co_receive_reply.

>>
>> The client still tried to send a flush request to the server, when it
>> should REALLY quit talking to the server at all and just disconnect,
>> because the moment the client recognizes that the server has sent
>> garbage is the moment that the client can no longer assume that the
>> server will react correctly to any further commands.
>>
>> So I don't think your patch is quite correct, even if you have
>> identified a shortfall in our client code.
>>
> Why do you think so? It just does what it states in commit message.

The commit message doesn't state much on the way of WHY.  It it the
subsequent discussion that says that the reason WHY we need to fix
something is to make the client robustly hang up when it encounters a
broken server - but once you couch it in those terms, this patch is NOT
making the client hang up gracefully (it is making the client skip ONE
invalid reply, but then immediately lets the client send another request
and gets stuck waiting for a reply to that second request).  A full fix
for the issue would make sure the client hangs up as soon as it detects
a bogus server.

> 
> Patch 17 should fix the case (but I doubt that it can be taken separately).

It's okay if it takes more than one patch to fix an issue, if the
earlier patches document that more work is still needed.  Or maybe we
can squash the best parts of 17 and this one.  I'm still playing with
the best way to minimally address the issue for 2.10.
diff mbox

Patch

diff --git a/block/nbd-client.c b/block/nbd-client.c
index dc19894a7c..0c88d84de6 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -107,6 +107,7 @@  static coroutine_fn void nbd_read_reply_entry(void *opaque)
         qemu_coroutine_yield();
     }
 
+    s->reply.handle = 0;
     nbd_recv_coroutines_enter_all(s);
     s->read_reply_co = NULL;
 }