Message ID | 20170804151440.320927-7-vsementsov@virtuozzo.com |
---|---|
State | New |
Headers | show |
On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote: > Set reply.handle to 0 on error path to prevent normal path of > nbd_co_receive_reply. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > block/nbd-client.c | 1 + > 1 file changed, 1 insertion(+) Can you document a case where not fixing this would be an observable bug (even if it requires using gdb and single-stepping between client and server to make what is otherwise a racy situation easy to see)? I'm trying to figure out if this is 2.10 material. > > diff --git a/block/nbd-client.c b/block/nbd-client.c > index dc19894a7c..0c88d84de6 100644 > --- a/block/nbd-client.c > +++ b/block/nbd-client.c > @@ -107,6 +107,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) > qemu_coroutine_yield(); > } > > + s->reply.handle = 0; > nbd_recv_coroutines_enter_all(s); > s->read_reply_co = NULL; > } >
07.08.2017 14:52, Eric Blake wrote: > On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote: >> Set reply.handle to 0 on error path to prevent normal path of >> nbd_co_receive_reply. >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >> --- >> block/nbd-client.c | 1 + >> 1 file changed, 1 insertion(+) > Can you document a case where not fixing this would be an observable bug > (even if it requires using gdb and single-stepping between client and > server to make what is otherwise a racy situation easy to see)? I'm > trying to figure out if this is 2.10 material. it is simple enough: run qemu-nbd in gdb, set break on nbd_send_reply, and when it shoot s, next up to "stl_be_p(buf, NBD_REPLY_MAGIC);" and after it do "call stl_be_p(buf, 1000)" run qemu-io with some read in gdb, set break on br block/nbd-client.c:83 ( it is break; after failed nbd_receive_reply call) and on br block/nbd-client.c:170 (it is in nbd_co_receive_reply after yield) on first break we will be sure that nbd_receive_reply failed, on second we will be sure by (gdb) p s->reply $1 = {handle = 93825000680144, error = 0} (gdb) p request->handle $2 = 93825000680144 that we are on normal receiving path. > >> diff --git a/block/nbd-client.c b/block/nbd-client.c >> index dc19894a7c..0c88d84de6 100644 >> --- a/block/nbd-client.c >> +++ b/block/nbd-client.c >> @@ -107,6 +107,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) >> qemu_coroutine_yield(); >> } >> >> + s->reply.handle = 0; >> nbd_recv_coroutines_enter_all(s); >> s->read_reply_co = NULL; >> } >>
On 08/07/2017 07:56 AM, Vladimir Sementsov-Ogievskiy wrote: > 07.08.2017 14:52, Eric Blake wrote: >> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote: >>> Set reply.handle to 0 on error path to prevent normal path of >>> nbd_co_receive_reply. Side note: in general, our server must allow a handle of 0 as valid (as the handle is something chosen by the client); but our particular client always uses non-zero handles (and therefore using 0 as a sentinel for an error path is okay). >>> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >>> --- >>> block/nbd-client.c | 1 + >>> 1 file changed, 1 insertion(+) >> Can you document a case where not fixing this would be an observable bug >> (even if it requires using gdb and single-stepping between client and >> server to make what is otherwise a racy situation easy to see)? I'm >> trying to figure out if this is 2.10 material. > > it is simple enough: > > run qemu-nbd in gdb, set break on nbd_send_reply, and when it shoot s, > next up to "stl_be_p(buf, NBD_REPLY_MAGIC);" and after it do "call > stl_be_p(buf, 1000)" Okay, so you have set up a malicious server intentionally sending bad magic... > > run qemu-io with some read in gdb, set break on > br block/nbd-client.c:83 > > ( it is break; after failed nbd_receive_reply call) > > and on > br block/nbd-client.c:170 > > (it is in nbd_co_receive_reply after yield) > > on first break we will be sure that nbd_receive_reply failed, > on second we will be sure by > (gdb) p s->reply > $1 = {handle = 93825000680144, error = 0} > (gdb) p request->handle > $2 = 93825000680144 > > that we are on normal receiving path. ...and the client is recognizing that the server sent garbage, but then proceeds to handle the packet anyway. The ideal reaction is that the client should disconnect from the server, rather than handle the packet. But because it didn't do that, the client is now unable to receive any future messages from the server. Compare the traces of: $ ./qemu-io -c 'r 0 1' -f raw nbd://localhost:10809/foo --trace nbd_\* against a working server: 29103@1502118291.281180:nbd_opt_go_success Export is good to go 29103@1502118291.281397:nbd_send_request Sending request to server: { .from = 0, .len = 1, .handle = 93860726319200, .flags = 0x0, .type = 0 (read) } 29103@1502118291.281705:nbd_receive_reply Got reply: { magic = 0x67446698, .error = 0, handle = 93860726319200 } read 1/1 bytes at offset 0 1 bytes, 1 ops; 0.0003 sec (2.822 KiB/sec and 2890.1734 ops/sec) 29103@1502118291.281773:nbd_send_request Sending request to server: { .from = 0, .len = 0, .handle = 93860726319200, .flags = 0x0, .type = 3 (flush) } 29103@1502118291.281902:nbd_receive_reply Got reply: { magic = 0x67446698, .error = 0, handle = 93860726319200 } 29103@1502118291.281939:nbd_send_request Sending request to server: { .from = 0, .len = 0, .handle = 93860726319200, .flags = 0x0, .type = 3 (flush) } 29103@1502118291.282064:nbd_receive_reply Got reply: { magic = 0x67446698, .error = 0, handle = 93860726319200 } 29103@1502118291.282078:nbd_send_request Sending request to server: { .from = 0, .len = 0, .handle = 0, .flags = 0x0, .type = 2 (discard) } followed by a clean disconnect; vs. the buggy server: 29148@1502118384.385133:nbd_opt_go_success Export is good to go 29148@1502118384.385321:nbd_send_request Sending request to server: { .from = 0, .len = 1, .handle = 94152262559840, .flags = 0x0, .type = 0 (read) } 29148@1502118396.494643:nbd_receive_reply Got reply: { magic = 0x1446698, .error = 0, handle = 94152262559840 } invalid magic (got 0x1446698) read 1/1 bytes at offset 0 1 bytes, 1 ops; 0:00:12.10 (0.082581 bytes/sec and 0.0826 ops/sec) 29148@1502118396.494746:nbd_send_request Sending request to server: { .from = 0, .len = 0, .handle = 94152262559840, .flags = 0x0, .type = 3 (flush) } where the client is now hung. Thankfully, the client is blocked in an idle loop (not eating CPU), so I don't know if this counts as the ability for a malicious server to cause a denial of service against qemu as an NBD client (in general, being unable to read further data from the server because client internal state is now botched is not that much different from being unable to read further data from the server because the client hung up on the invalid server), unless there is some way to cause qemu to die from an assertion failure rather than just get stuck. It also looks like the client acts on at most one bad packet from the server before it gets stuck - but the question is whether a malicious server could, in that one bad packet reply, cause qemu to misbehave in some other way. I'm adding Prasad, to analyze whether this needs a CVE. We don't have a good protocol fuzzer to simulate a bad client and/or bad server as the partner over the socket - if you can find any more such interactions where a bad partner can cause a hang or crash, let's get those fixed in 2.10. Meanwhile, I'll probably include this patch in 2.10 (after first figuring out if it works in isolation or needs other patches from your series).
On 08/07/2017 10:13 AM, Eric Blake wrote: > On 08/07/2017 07:56 AM, Vladimir Sementsov-Ogievskiy wrote: >> 07.08.2017 14:52, Eric Blake wrote: >>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote: >>>> Set reply.handle to 0 on error path to prevent normal path of >>>> nbd_co_receive_reply. > > ...and the client is recognizing that the server sent garbage, but then > proceeds to handle the packet anyway. The ideal reaction is that the > client should disconnect from the server, rather than handle the packet. > But because it didn't do that, the client is now unable to receive any > future messages from the server. Compare the traces of: > > followed by a clean disconnect; vs. the buggy server: > > 29148@1502118384.385133:nbd_opt_go_success Export is good to go > 29148@1502118384.385321:nbd_send_request Sending request to server: { > .from = 0, .len = 1, .handle = 94152262559840, .flags = 0x0, .type = 0 > (read) } > 29148@1502118396.494643:nbd_receive_reply Got reply: { magic = > 0x1446698, .error = 0, handle = 94152262559840 } > invalid magic (got 0x1446698) > read 1/1 bytes at offset 0 > 1 bytes, 1 ops; 0:00:12.10 (0.082581 bytes/sec and 0.0826 ops/sec) > 29148@1502118396.494746:nbd_send_request Sending request to server: { > .from = 0, .len = 0, .handle = 94152262559840, .flags = 0x0, .type = 3 > (flush) } > > where the client is now hung. Thankfully, the client is blocked in an > idle loop (not eating CPU), so I don't know if this counts as the > ability for a malicious server to cause a denial of service against qemu > as an NBD client (in general, being unable to read further data from the > server because client internal state is now botched is not that much > different from being unable to read further data from the server because > the client hung up on the invalid server), unless there is some way to > cause qemu to die from an assertion failure rather than just get stuck. With just patch 6/17 applied, the client still hangs, but this time with a different trace: 30053@1502119637.604092:nbd_opt_go_success Export is good to go 30053@1502119637.604256:nbd_send_request Sending request to server: { .from = 0, .len = 1, .handle = 94716063746144, .flags = 0x0, .type = 0 (read) } 30053@1502119649.070156:nbd_receive_reply Got reply: { magic = 0x1446698, .error = 0, handle = 94716063746144 } invalid magic (got 0x1446698) read failed: Input/output error 30053@1502119649.070243:nbd_send_request Sending request to server: { .from = 0, .len = 0, .handle = 94716063746144, .flags = 0x0, .type = 3 (flush) } The client still tried to send a flush request to the server, when it should REALLY quit talking to the server at all and just disconnect, because the moment the client recognizes that the server has sent garbage is the moment that the client can no longer assume that the server will react correctly to any further commands. So I don't think your patch is quite correct, even if you have identified a shortfall in our client code.
07.08.2017 18:33, Eric Blake wrote: > On 08/07/2017 10:13 AM, Eric Blake wrote: >> On 08/07/2017 07:56 AM, Vladimir Sementsov-Ogievskiy wrote: >>> 07.08.2017 14:52, Eric Blake wrote: >>>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote: >>>>> Set reply.handle to 0 on error path to prevent normal path of >>>>> nbd_co_receive_reply. >> ...and the client is recognizing that the server sent garbage, but then >> proceeds to handle the packet anyway. The ideal reaction is that the >> client should disconnect from the server, rather than handle the packet. >> But because it didn't do that, the client is now unable to receive any >> future messages from the server. Compare the traces of: >> >> followed by a clean disconnect; vs. the buggy server: >> >> 29148@1502118384.385133:nbd_opt_go_success Export is good to go >> 29148@1502118384.385321:nbd_send_request Sending request to server: { >> .from = 0, .len = 1, .handle = 94152262559840, .flags = 0x0, .type = 0 >> (read) } >> 29148@1502118396.494643:nbd_receive_reply Got reply: { magic = >> 0x1446698, .error = 0, handle = 94152262559840 } >> invalid magic (got 0x1446698) >> read 1/1 bytes at offset 0 >> 1 bytes, 1 ops; 0:00:12.10 (0.082581 bytes/sec and 0.0826 ops/sec) >> 29148@1502118396.494746:nbd_send_request Sending request to server: { >> .from = 0, .len = 0, .handle = 94152262559840, .flags = 0x0, .type = 3 >> (flush) } >> >> where the client is now hung. Thankfully, the client is blocked in an >> idle loop (not eating CPU), so I don't know if this counts as the >> ability for a malicious server to cause a denial of service against qemu >> as an NBD client (in general, being unable to read further data from the >> server because client internal state is now botched is not that much >> different from being unable to read further data from the server because >> the client hung up on the invalid server), unless there is some way to >> cause qemu to die from an assertion failure rather than just get stuck. > With just patch 6/17 applied, the client still hangs, but this time with > a different trace: > > 30053@1502119637.604092:nbd_opt_go_success Export is good to go > 30053@1502119637.604256:nbd_send_request Sending request to server: { > .from = 0, .len = 1, .handle = 94716063746144, .flags = 0x0, .type = 0 > (read) } > 30053@1502119649.070156:nbd_receive_reply Got reply: { magic = > 0x1446698, .error = 0, handle = 94716063746144 } > invalid magic (got 0x1446698) > read failed: Input/output error > 30053@1502119649.070243:nbd_send_request Sending request to server: { > .from = 0, .len = 0, .handle = 94716063746144, .flags = 0x0, .type = 3 > (flush) } > > The client still tried to send a flush request to the server, when it > should REALLY quit talking to the server at all and just disconnect, > because the moment the client recognizes that the server has sent > garbage is the moment that the client can no longer assume that the > server will react correctly to any further commands. > > So I don't think your patch is quite correct, even if you have > identified a shortfall in our client code. > Why do you think so? It just does what it states in commit message. Patch 17 should fix the case (but I doubt that it can be taken separately).
On 08/07/2017 11:09 AM, Vladimir Sementsov-Ogievskiy wrote: >>>>> On 08/04/2017 10:14 AM, Vladimir Sementsov-Ogievskiy wrote: >>>>>> Set reply.handle to 0 on error path to prevent normal path of >>>>>> nbd_co_receive_reply. >> >> The client still tried to send a flush request to the server, when it >> should REALLY quit talking to the server at all and just disconnect, >> because the moment the client recognizes that the server has sent >> garbage is the moment that the client can no longer assume that the >> server will react correctly to any further commands. >> >> So I don't think your patch is quite correct, even if you have >> identified a shortfall in our client code. >> > Why do you think so? It just does what it states in commit message. The commit message doesn't state much on the way of WHY. It it the subsequent discussion that says that the reason WHY we need to fix something is to make the client robustly hang up when it encounters a broken server - but once you couch it in those terms, this patch is NOT making the client hang up gracefully (it is making the client skip ONE invalid reply, but then immediately lets the client send another request and gets stuck waiting for a reply to that second request). A full fix for the issue would make sure the client hangs up as soon as it detects a bogus server. > > Patch 17 should fix the case (but I doubt that it can be taken separately). It's okay if it takes more than one patch to fix an issue, if the earlier patches document that more work is still needed. Or maybe we can squash the best parts of 17 and this one. I'm still playing with the best way to minimally address the issue for 2.10.
diff --git a/block/nbd-client.c b/block/nbd-client.c index dc19894a7c..0c88d84de6 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -107,6 +107,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) qemu_coroutine_yield(); } + s->reply.handle = 0; nbd_recv_coroutines_enter_all(s); s->read_reply_co = NULL; }
Set reply.handle to 0 on error path to prevent normal path of nbd_co_receive_reply. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> --- block/nbd-client.c | 1 + 1 file changed, 1 insertion(+)