Message ID | cf837995a28824a17f65060e9515b9a8acf936fe.1350901963.git.nick@bytemark.co.uk |
---|---|
State | New |
Headers | show |
Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk: > > This is unlikely to come up now, but is a necessary prerequisite for reconnection > behaviour. > > Signed-off-by: Nick Thomas <nick@bytemark.co.uk> > --- > block/nbd.c | 13 +++++++++++-- > 1 files changed, 11 insertions(+), 2 deletions(-) What's the real requirement here? Silently ignoring a flush and returning success for it feels wrong. Why is it correct? Kevin
On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote: > Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk: > > > > This is unlikely to come up now, but is a necessary prerequisite for reconnection > > behaviour. > > > > Signed-off-by: Nick Thomas <nick@bytemark.co.uk> > > --- > > block/nbd.c | 13 +++++++++++-- > > 1 files changed, 11 insertions(+), 2 deletions(-) > > What's the real requirement here? Silently ignoring a flush and > returning success for it feels wrong. Why is it correct? > > Kevin I just needed to avoid socket operations while s->sock == -1, and extending the existing case of "can't do the command, so pretend I did it" to "can't do the command right now, so pretend..." seemed like an easy way out. In the Bytemark case, the NBD server always opens the file O_SYNC, so nbd_co_flush could check in_flight == 0 and return 0/1 based on that; but I'd be surprised if that's true for all NBD servers. Should we be returning 1 here for both "not supported" and "can't do it right now", instead? /Nick
Am 23.10.2012 13:08, schrieb Nicholas Thomas: > On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote: >> Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk: >>> >>> This is unlikely to come up now, but is a necessary prerequisite for reconnection >>> behaviour. >>> >>> Signed-off-by: Nick Thomas <nick@bytemark.co.uk> >>> --- >>> block/nbd.c | 13 +++++++++++-- >>> 1 files changed, 11 insertions(+), 2 deletions(-) >> >> What's the real requirement here? Silently ignoring a flush and >> returning success for it feels wrong. Why is it correct? >> >> Kevin > > I just needed to avoid socket operations while s->sock == -1, and > extending the existing case of "can't do the command, so pretend I did > it" to "can't do the command right now, so pretend..." seemed like an > easy way out. The difference is that NBD_FLAG_SEND_FLUSH is only clear if the server is configured like that, i.e. it's completely in the control of the user, and it ignores flushes consistently (for example because it already uses a writethrough mode so that flushes aren't required). Network outage usually isn't in the control of the user and ignoring a flush request when the server actually asked for flushes is completely surprising and dangerous. > In the Bytemark case, the NBD server always opens the file O_SYNC, so > nbd_co_flush could check in_flight == 0 and return 0/1 based on that; > but I'd be surprised if that's true for all NBD servers. Should we be > returning 1 here for both "not supported" and "can't do it right now", > instead? The best return value is probably -ENOTCONN or -EIO if you must return an error. But maybe using the same logic as for writes would be appropriate, i.e. try to reestablish the connection before you return an error. Kevin
Nicholas Thomas wrote: > On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote: > > Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk: > > > > > > This is unlikely to come up now, but is a necessary prerequisite for reconnection > > > behaviour. > > > > > > Signed-off-by: Nick Thomas <nick@bytemark.co.uk> > > > --- > > > block/nbd.c | 13 +++++++++++-- > > > 1 files changed, 11 insertions(+), 2 deletions(-) > > > > What's the real requirement here? Silently ignoring a flush and > > returning success for it feels wrong. Why is it correct? > > > > Kevin > > I just needed to avoid socket operations while s->sock == -1, and > extending the existing case of "can't do the command, so pretend I did > it" to "can't do the command right now, so pretend..." seemed like an > easy way out. Hi Nicholas, Ignoring a flush is another way of saying "corrupt my data" in some circumstances. We have options in QEMU already to say whether flushes are ignored on normal discs, but if someone's chosen the "I really care about my database/filesystem" option, and verified that their NBD setup really performs them (in normal circumstances), silently dropping flushes from time to time isn't nice. I would much rather the guest is forced to wait until reconnection and then get a successful flush, if the problem is just that the server was done briefly. Or, if that is too hard, that the flush is Since the I/O _order_ before, and sometimes after, flush, is important for data integrity, this needs to be maintained when I/Os are queued in the disconnected state -- including those which were inflight at the time disconnect was detected and then retried on reconnect. Ignoring a discard is not too bad. However, if discard is retried, then I/O order is important in relation to those as well. > In the Bytemark case, the NBD server always opens the file O_SYNC, so > nbd_co_flush could check in_flight == 0 and return 0/1 based on that; > but I'd be surprised if that's true for all NBD servers. Should we be > returning 1 here for both "not supported" and "can't do it right now", > instead? When the server is opening the file O_SYNC, wouldn't it make sense to tell QEMU -- and the guest -- that there's no need to send flushes at all, as it's equivalent to a disk with no write-cache (or disabled)? Best, -- Jamie
On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: > Nicholas Thomas wrote: > > On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote: > > > Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk: > > > > > > > > This is unlikely to come up now, but is a necessary prerequisite for reconnection > > > > behaviour. > > > > > > > > Signed-off-by: Nick Thomas <nick@bytemark.co.uk> > > > > --- > > > > block/nbd.c | 13 +++++++++++-- > > > > 1 files changed, 11 insertions(+), 2 deletions(-) > > > > > > What's the real requirement here? Silently ignoring a flush and > > > returning success for it feels wrong. Why is it correct? > > > > > > Kevin > > > > I just needed to avoid socket operations while s->sock == -1, and > > extending the existing case of "can't do the command, so pretend I did > > it" to "can't do the command right now, so pretend..." seemed like an > > easy way out. > > Hi Nicholas, > > Ignoring a flush is another way of saying "corrupt my data" in some > circumstances. We have options in QEMU already to say whether flushes > are ignored on normal discs, but if someone's chosen the "I really > care about my database/filesystem" option, and verified that their NBD > setup really performs them (in normal circumstances), silently > dropping flushes from time to time isn't nice. > > I would much rather the guest is forced to wait until reconnection and > then get a successful flush, if the problem is just that the server > was done briefly. Or, if that is too hard, that the flush is OK, that's certainly doable. Revised patches hopefully showing up this week or (at a push) next. > Since the I/O _order_ before, and sometimes after, flush, is important > for data integrity, this needs to be maintained when I/Os are queued in > the disconnected state -- including those which were inflight at the > time disconnect was detected and then retried on reconnect. Hmm, discussing this on IRC I was told that it wasn't necessary to preserve order - although I forget the fine detail. Depending on the implementation of qemu's coroutine mutexes, operations may not actually be performed in order right now - it's not too easy to work out what's happening. I've also just noticed that flush & discard don't take the send_mutex before writing to the socket. That can't be intentional, surely? Paolo? I've got a proof-of-concept that makes flush, discard, readv_1 and writev_1 call a common nbd_co_handle_request function. this retries I/Os when a connection goes away and comes back, and lets me make this patch unnecessary. It doesn't preserve I/O ordering, but it won't be too hard to add that if it is actually necessary. > Ignoring a discard is not too bad. However, if discard is retried, > then I/O order is important in relation to those as well. > > > In the Bytemark case, the NBD server always opens the file O_SYNC, so > > nbd_co_flush could check in_flight == 0 and return 0/1 based on that; > > but I'd be surprised if that's true for all NBD servers. Should we be > > returning 1 here for both "not supported" and "can't do it right now", > > instead? > > When the server is opening the file O_SYNC, wouldn't it make sense to > tell QEMU -- and the guest -- that there's no need to send flushes at > all, as it's equivalent to a disk with no write-cache (or disabled)? Probably; our NBD server still implements the old-style handshakes, as far as I recall, and doesn't actually implement the flush or discard commands as a result.
Am 24.10.2012 14:16, schrieb Nicholas Thomas: > On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: >> Since the I/O _order_ before, and sometimes after, flush, is important >> for data integrity, this needs to be maintained when I/Os are queued in >> the disconnected state -- including those which were inflight at the >> time disconnect was detected and then retried on reconnect. > > Hmm, discussing this on IRC I was told that it wasn't necessary to > preserve order - although I forget the fine detail. Depending on the > implementation of qemu's coroutine mutexes, operations may not actually > be performed in order right now - it's not too easy to work out what's > happening. It's possible to reorder, but it must be consistent with the order in which completion is signalled to the guest. The semantics of flush is that at the point that the flush completes, all writes to the disk that already have completed successfully are stable. It doesn't say anything about writes that are still in flight, they may or may not be flushed to disk. Kevin
Il 24/10/2012 14:16, Nicholas Thomas ha scritto: > > I've also just noticed that flush & discard don't take the send_mutex > before writing to the socket. That can't be intentional, surely? Paolo? No, it's a bug. Paolo
Il 24/10/2012 16:03, Paolo Bonzini ha scritto: > Il 24/10/2012 14:16, Nicholas Thomas ha scritto: >> >> I've also just noticed that flush & discard don't take the send_mutex >> before writing to the socket. That can't be intentional, surely? Paolo? > > No, it's a bug. Hmm, why do you say it doesn't take the mutex? nbd_co_flush and nbd_co_discard all call nbd_co_send_request, which takes the mutex. Paolo
On Wed, 2012-10-24 at 16:10 +0200, Paolo Bonzini wrote: > Il 24/10/2012 16:03, Paolo Bonzini ha scritto: > > Il 24/10/2012 14:16, Nicholas Thomas ha scritto: > >> > >> I've also just noticed that flush & discard don't take the send_mutex > >> before writing to the socket. That can't be intentional, surely? Paolo? > > > > No, it's a bug. > > Hmm, why do you say it doesn't take the mutex? nbd_co_flush and > nbd_co_discard all call nbd_co_send_request, which takes the mutex. > > Paolo Sorry, brain fail. You're right: no bug :) /Nick
Kevin Wolf wrote: > Am 24.10.2012 14:16, schrieb Nicholas Thomas: > > On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: > >> Since the I/O _order_ before, and sometimes after, flush, is important > >> for data integrity, this needs to be maintained when I/Os are queued in > >> the disconnected state -- including those which were inflight at the > >> time disconnect was detected and then retried on reconnect. > > > > Hmm, discussing this on IRC I was told that it wasn't necessary to > > preserve order - although I forget the fine detail. Depending on the > > implementation of qemu's coroutine mutexes, operations may not actually > > be performed in order right now - it's not too easy to work out what's > > happening. > > It's possible to reorder, but it must be consistent with the order in > which completion is signalled to the guest. The semantics of flush is > that at the point that the flush completes, all writes to the disk that > already have completed successfully are stable. It doesn't say anything > about writes that are still in flight, they may or may not be flushed to > disk. I admit I wasn't thinking clearly how much ordering NBD actually guarantees (or if there's ordering the guest depends on implicitly even if it's not guaranteed in specification), and how that is related within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees that the guest expects for various emulated devices and their settings. The ordering (if any) needed from the NBD driver (or any backend) is going to depend on the assumptions baked into the interface between QEMU device emulation <-> backend. E.g. if every device emulation waited for all outstanding writes to complete before sending a flush, then it wouldn't matter how the backend reordered its requests, even getting the completions out of order. Is that relationship documented (and conformed to)? -- Jamie
Il 24/10/2012 16:32, Jamie Lokier ha scritto: > Kevin Wolf wrote: >> Am 24.10.2012 14:16, schrieb Nicholas Thomas: >>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: >>>> Since the I/O _order_ before, and sometimes after, flush, is important >>>> for data integrity, this needs to be maintained when I/Os are queued in >>>> the disconnected state -- including those which were inflight at the >>>> time disconnect was detected and then retried on reconnect. >>> >>> Hmm, discussing this on IRC I was told that it wasn't necessary to >>> preserve order - although I forget the fine detail. Depending on the >>> implementation of qemu's coroutine mutexes, operations may not actually >>> be performed in order right now - it's not too easy to work out what's >>> happening. >> >> It's possible to reorder, but it must be consistent with the order in >> which completion is signalled to the guest. The semantics of flush is >> that at the point that the flush completes, all writes to the disk that >> already have completed successfully are stable. It doesn't say anything >> about writes that are still in flight, they may or may not be flushed to >> disk. > > I admit I wasn't thinking clearly how much ordering NBD actually > guarantees (or if there's ordering the guest depends on implicitly > even if it's not guaranteed in specification), Here NBD is used just as a transport on the host and totally invisible to the guest. So NBD pretty much has to implement whatever ordering guarantees the guest needs. > E.g. if every device emulation waited for all outstanding writes to > complete before sending a flush, then it wouldn't matter how the > backend reordered its requests, even getting the completions out of > order. The NBD implementation in QEMU (which Nick is using) is completely asynchronous. Paolo > Is that relationship documented (and conformed to)? > > -- Jamie >
Am 24.10.2012 16:32, schrieb Jamie Lokier: > Kevin Wolf wrote: >> Am 24.10.2012 14:16, schrieb Nicholas Thomas: >>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: >>>> Since the I/O _order_ before, and sometimes after, flush, is important >>>> for data integrity, this needs to be maintained when I/Os are queued in >>>> the disconnected state -- including those which were inflight at the >>>> time disconnect was detected and then retried on reconnect. >>> >>> Hmm, discussing this on IRC I was told that it wasn't necessary to >>> preserve order - although I forget the fine detail. Depending on the >>> implementation of qemu's coroutine mutexes, operations may not actually >>> be performed in order right now - it's not too easy to work out what's >>> happening. >> >> It's possible to reorder, but it must be consistent with the order in >> which completion is signalled to the guest. The semantics of flush is >> that at the point that the flush completes, all writes to the disk that >> already have completed successfully are stable. It doesn't say anything >> about writes that are still in flight, they may or may not be flushed to >> disk. > > I admit I wasn't thinking clearly how much ordering NBD actually > guarantees (or if there's ordering the guest depends on implicitly > even if it's not guaranteed in specification), and how that is related > within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees > that the guest expects for various emulated devices and their settings. > > The ordering (if any) needed from the NBD driver (or any backend) is > going to depend on the assumptions baked into the interface between > QEMU device emulation <-> backend. > > E.g. if every device emulation waited for all outstanding writes to > complete before sending a flush, then it wouldn't matter how the > backend reordered its requests, even getting the completions out of > order. > > Is that relationship documented (and conformed to)? No, like so many other things in qemu it's not spelt out explicitly. However, as I understand it it's the same behaviour as real hardware has, so device emulation at least for the common devices doesn't have to implement anything special for it. If the hardware even supports parallel requests, otherwise it would automatically only have a single request in flight (like IDE). Kevin
Kevin Wolf wrote: > Am 24.10.2012 16:32, schrieb Jamie Lokier: > > Kevin Wolf wrote: > >> Am 24.10.2012 14:16, schrieb Nicholas Thomas: > >>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: > >>>> Since the I/O _order_ before, and sometimes after, flush, is important > >>>> for data integrity, this needs to be maintained when I/Os are queued in > >>>> the disconnected state -- including those which were inflight at the > >>>> time disconnect was detected and then retried on reconnect. > >>> > >>> Hmm, discussing this on IRC I was told that it wasn't necessary to > >>> preserve order - although I forget the fine detail. Depending on the > >>> implementation of qemu's coroutine mutexes, operations may not actually > >>> be performed in order right now - it's not too easy to work out what's > >>> happening. > >> > >> It's possible to reorder, but it must be consistent with the order in > >> which completion is signalled to the guest. The semantics of flush is > >> that at the point that the flush completes, all writes to the disk that > >> already have completed successfully are stable. It doesn't say anything > >> about writes that are still in flight, they may or may not be flushed to > >> disk. > > > > I admit I wasn't thinking clearly how much ordering NBD actually > > guarantees (or if there's ordering the guest depends on implicitly > > even if it's not guaranteed in specification), and how that is related > > within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees > > that the guest expects for various emulated devices and their settings. > > > > The ordering (if any) needed from the NBD driver (or any backend) is > > going to depend on the assumptions baked into the interface between > > QEMU device emulation <-> backend. > > > > E.g. if every device emulation waited for all outstanding writes to > > complete before sending a flush, then it wouldn't matter how the > > backend reordered its requests, even getting the completions out of > > order. > > > > Is that relationship documented (and conformed to)? > > No, like so many other things in qemu it's not spelt out explicitly. > However, as I understand it it's the same behaviour as real hardware > has, so device emulation at least for the common devices doesn't have to > implement anything special for it. If the hardware even supports > parallel requests, otherwise it would automatically only have a single > request in flight (like IDE). That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite common. They are features of devices which support multiple parallel requests, but with certain ordering constraints conveyed by or expected by the guest, which has to be ensured when it's mapped onto a QEMU fully asynchronous backend. That means they are features of the hardware which device emulations _do_ have to implement. If they don't, the storage is unreliable on things like host power removal and virtual power removal. If the backends are allowed to explicitly have no coupling between different request types (even flush/discard and write), and ordering constraints are being enforced by the order in which device emulations submit and wait, that's fine. I mention this, because POSIX aio_fsync() is _not_ fully decoupled according to it's specification. So it might be that some device emulations are depending on the semantics of aio_fsync() or the QEMU equivalent by now; and randomly reordering in the NBD driver in unusual circumstances (or any other backend), would break those semantics. -- Jamie
Am 25.10.2012 19:09, schrieb Jamie Lokier: > Kevin Wolf wrote: >> Am 24.10.2012 16:32, schrieb Jamie Lokier: >>> Kevin Wolf wrote: >>>> Am 24.10.2012 14:16, schrieb Nicholas Thomas: >>>>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote: >>>>>> Since the I/O _order_ before, and sometimes after, flush, is important >>>>>> for data integrity, this needs to be maintained when I/Os are queued in >>>>>> the disconnected state -- including those which were inflight at the >>>>>> time disconnect was detected and then retried on reconnect. >>>>> >>>>> Hmm, discussing this on IRC I was told that it wasn't necessary to >>>>> preserve order - although I forget the fine detail. Depending on the >>>>> implementation of qemu's coroutine mutexes, operations may not actually >>>>> be performed in order right now - it's not too easy to work out what's >>>>> happening. >>>> >>>> It's possible to reorder, but it must be consistent with the order in >>>> which completion is signalled to the guest. The semantics of flush is >>>> that at the point that the flush completes, all writes to the disk that >>>> already have completed successfully are stable. It doesn't say anything >>>> about writes that are still in flight, they may or may not be flushed to >>>> disk. >>> >>> I admit I wasn't thinking clearly how much ordering NBD actually >>> guarantees (or if there's ordering the guest depends on implicitly >>> even if it's not guaranteed in specification), and how that is related >>> within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees >>> that the guest expects for various emulated devices and their settings. >>> >>> The ordering (if any) needed from the NBD driver (or any backend) is >>> going to depend on the assumptions baked into the interface between >>> QEMU device emulation <-> backend. >>> >>> E.g. if every device emulation waited for all outstanding writes to >>> complete before sending a flush, then it wouldn't matter how the >>> backend reordered its requests, even getting the completions out of >>> order. >>> >>> Is that relationship documented (and conformed to)? >> >> No, like so many other things in qemu it's not spelt out explicitly. >> However, as I understand it it's the same behaviour as real hardware >> has, so device emulation at least for the common devices doesn't have to >> implement anything special for it. If the hardware even supports >> parallel requests, otherwise it would automatically only have a single >> request in flight (like IDE). > > That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite > common. > > They are features of devices which support multiple parallel requests, > but with certain ordering constraints conveyed by or expected by the > guest, which has to be ensured when it's mapped onto a QEMU fully > asynchronous backend. > > That means they are features of the hardware which device emulations > _do_ have to implement. If they don't, the storage is unreliable on > things like host power removal and virtual power removal. Yes, device emulations that need to maintain a given order must pay attention to wait for completion of the previous requests. > If the backends are allowed to explicitly have no coupling between > different request types (even flush/discard and write), and ordering > constraints are being enforced by the order in which device emulations > submit and wait, that's fine. > > I mention this, because POSIX aio_fsync() is _not_ fully decoupled > according to it's specification. > > So it might be that some device emulations are depending on the > semantics of aio_fsync() or the QEMU equivalent by now; and randomly > reordering in the NBD driver in unusual circumstances (or any other > backend), would break those semantics. qemu AIO has always had this semantics since bdrv_aio_flush() was introduced. It behaves the same way for image files. So I don't see any problem with NBD making use of the same. Kevin
diff --git a/block/nbd.c b/block/nbd.c index 2bce47b..9536408 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -71,6 +71,11 @@ typedef struct BDRVNBDState { char *host_spec; } BDRVNBDState; +static bool nbd_is_connected(BDRVNBDState *s) +{ + return s->sock >= 0; +} + static int nbd_config(BDRVNBDState *s, const char *filename, int flags) { char *file; @@ -309,6 +314,7 @@ static void nbd_teardown_connection(BlockDriverState *bs) qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL); closesocket(s->sock); + s->sock = -1; } static int nbd_open(BlockDriverState *bs, const char* filename, int flags) @@ -316,6 +322,8 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags) BDRVNBDState *s = bs->opaque; int result; + s->sock = -1; + qemu_co_mutex_init(&s->send_mutex); qemu_co_mutex_init(&s->free_sema); @@ -431,7 +439,7 @@ static int nbd_co_flush(BlockDriverState *bs) struct nbd_reply reply; ssize_t ret; - if (!(s->nbdflags & NBD_FLAG_SEND_FLUSH)) { + if (!nbd_is_connected(s) || !(s->nbdflags & NBD_FLAG_SEND_FLUSH)) { return 0; } @@ -462,7 +470,7 @@ static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num, struct nbd_reply reply; ssize_t ret; - if (!(s->nbdflags & NBD_FLAG_SEND_TRIM)) { + if (!nbd_is_connected(s) || !(s->nbdflags & NBD_FLAG_SEND_TRIM)) { return 0; } request.type = NBD_CMD_TRIM; @@ -515,3 +523,4 @@ static void bdrv_nbd_init(void) } block_init(bdrv_nbd_init); +
This is unlikely to come up now, but is a necessary prerequisite for reconnection behaviour. Signed-off-by: Nick Thomas <nick@bytemark.co.uk> --- block/nbd.c | 13 +++++++++++-- 1 files changed, 11 insertions(+), 2 deletions(-)