Message ID | 5030E5F2.7060903@redhat.com |
---|---|
State | New |
Headers | show |
Am 19.08.2012 15:11, schrieb Paolo Bonzini: > No problem, my fault---I'm just back and I haven't really started again > all my stuff, so the patch was not tested. > > This should fix it, though. Booting works fine now. But the VM starts to hang after trying to unmap large regions. No segfault or so just not reacting anymore. Stefan
Il 19/08/2012 21:22, Stefan Priebe - Profihost AG ha scritto: > >> No problem, my fault---I'm just back and I haven't really started again >> all my stuff, so the patch was not tested. >> >> This should fix it, though. > > Booting works fine now. But the VM starts to hang after trying to unmap > large regions. No segfault or so just not reacting anymore. This is expected; unfortunately cancellation right now is a synchronous operation in the block layer. SCSI is the first big user of cancellation, and it would indeed benefit from asynchronous cancellation. Without these three patches, you risk corruption in case the following happens: qemu target ----------------------------------- send unmap --------> cancel unmap ------> send write --------> <---------------- complete write <unmap just written sector> <---------------- complete unmap <---------------- cancellation done (unmap complete) Paolo
Am 20.08.2012 09:22, schrieb Paolo Bonzini: > Il 19/08/2012 21:22, Stefan Priebe - Profihost AG ha scritto: >> >>> No problem, my fault---I'm just back and I haven't really started again >>> all my stuff, so the patch was not tested. >>> >>> This should fix it, though. >> >> Booting works fine now. But the VM starts to hang after trying to unmap >> large regions. No segfault or so just not reacting anymore. > > This is expected; unfortunately cancellation right now is a synchronous > operation in the block layer. SCSI is the first big user of > cancellation, and it would indeed benefit from asynchronous cancellation. > > Without these three patches, you risk corruption in case the following > happens: > > qemu target > ----------------------------------- > send unmap --------> > cancel unmap ------> > send write --------> > <---------------- complete write > <unmap just written sector> > <---------------- complete unmap > <---------------- cancellation done (unmap complete) mhm OK that makes sense. But i cannot even login via SSH and i also see no cancellation message in kernel log. So what is the right way to test these patches? With my version i can still work via SSH. Stefan
Il 20/08/2012 09:34, Stefan Priebe - Profihost AG ha scritto: >>> Booting works fine now. But the VM starts to hang after trying to unmap >>> large regions. No segfault or so just not reacting anymore. >> >> This is expected; unfortunately cancellation right now is a synchronous >> operation in the block layer. SCSI is the first big user of >> cancellation, and it would indeed benefit from asynchronous cancellation. >> >> Without these three patches, you risk corruption in case the following >> happens: >> >> qemu target >> ----------------------------------- >> send unmap --------> >> cancel unmap ------> >> send write --------> >> <---------------- complete write >> <unmap just written sector> >> <---------------- complete unmap >> <---------------- cancellation done (unmap complete) > > mhm OK that makes sense. But i cannot even login via SSH That's because the "big QEMU lock" is held by the thread that called qemu_aio_cancel. > and i also see > no cancellation message in kernel log. And that's because the UNMAP actually ultimately succeeds. You'll probably see soft lockup messages though. The solution here is to bump the timeout of the UNMAP command (either in the kernel or in libiscsi, I didn't really understand who's at fault). Paolo
Hi Ronnie, Am 20.08.2012 10:08, schrieb Paolo Bonzini: > That's because the "big QEMU lock" is held by the thread that called > qemu_aio_cancel. > >> and i also see >> no cancellation message in kernel log. > > And that's because the UNMAP actually ultimately succeeds. You'll > probably see soft lockup messages though. > > The solution here is to bump the timeout of the UNMAP command (either in > the kernel or in libiscsi, I didn't really understand who's at fault). What's your suggestion / idea about that? Greets, Stefan
On Mon, Aug 20, 2012 at 6:12 PM, Stefan Priebe - Profihost AG <s.priebe@profihost.ag> wrote: > Hi Ronnie, > > Am 20.08.2012 10:08, schrieb Paolo Bonzini: > >> That's because the "big QEMU lock" is held by the thread that called >> qemu_aio_cancel. >> >>> and i also see >>> no cancellation message in kernel log. >> >> >> And that's because the UNMAP actually ultimately succeeds. You'll >> probably see soft lockup messages though. >> >> The solution here is to bump the timeout of the UNMAP command (either in >> the kernel or in libiscsi, I didn't really understand who's at fault). > > > What's your suggestion / idea about that? > There are no timeouts in libiscsi itself. But you can probably tweak it through the guest kernel : $ cat /sys/bus/scsi/devices/0\:0\:0\:0/timeout 30 should be the default scsi timeout for this device, and $ cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth 31 would be how many concurrent i/o that the guest will drive to the device. When performing the UNMAPS, maybe setting timeout to something really big, and at the same time also dropping queue-depth to something really small so that the guest kernel will not be able to send so many UNMAPs concurrently. ronnie s
Am 21.08.2012 00:36, schrieb ronnie sahlberg: > On Mon, Aug 20, 2012 at 6:12 PM, Stefan Priebe - Profihost AG > <s.priebe@profihost.ag> wrote: >> Hi Ronnie, >> >> Am 20.08.2012 10:08, schrieb Paolo Bonzini: >> >>> That's because the "big QEMU lock" is held by the thread that called >>> qemu_aio_cancel. >>> >>>> and i also see >>>> no cancellation message in kernel log. >>> >>> >>> And that's because the UNMAP actually ultimately succeeds. You'll >>> probably see soft lockup messages though. >>> >>> The solution here is to bump the timeout of the UNMAP command (either in >>> the kernel or in libiscsi, I didn't really understand who's at fault). >> >> >> What's your suggestion / idea about that? >> > > There are no timeouts in libiscsi itself. > But you can probably tweak it through the guest kernel : > > > $ cat /sys/bus/scsi/devices/0\:0\:0\:0/timeout > 30 > > should be the default scsi timeout for this device, and > > $ cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth > 31 > > would be how many concurrent i/o that the guest will drive to the device. > > > When performing the UNMAPS, maybe setting timeout to something really > big, and at the same time also dropping queue-depth to something > really small so that the guest kernel will not be able to send so many > UNMAPs concurrently. How is this relevant to the fact, that i can't even work with SSH any longer with paolo's patch while UNMAPs are running? With my patchset you can still work on the machine. Greets, Stefan
> > When performing the UNMAPS, maybe setting timeout to something > > really big, and at the same time also dropping queue-depth to something > > really small so that the guest kernel will not be able to send so > > many UNMAPs concurrently. > > How is this relevant to the fact, that i can't even work with SSH any > longer with paolo's patch while UNMAPs are running? With my patchset > you can still work on the machine. It is relevant because if you tweak the timeouts you will not hit the cancel at all. I would like to get a trace of the commands that are sent, so that we can attack the problem in the guest kernel. For example, if it's sending UNMAPs that span 100GB or so, we could enforce a limit in the guest kernel on the maximum number of discarded blocks per UNMAP. But if the problem is the _number_ of UNMAPs rather than the size, it would need a different heuristic. Paolo
Hello list, i wantes to use scsi unmap with rbd. rbd documention says you need to set discard_granularity=512 for the device. I'm using qemu 1.2. If i set this and send an UNMAP command i get this kernel output: Sense Key : Aborted Command [current] sd 2:0:0:1: [sdb] Add. Sense: I/O process terminated sd 2:0:0:1: [sdb] CDB: Write same(16): 93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00 end_request: I/O error, dev sdb, sector 0 sd 2:0:0:1: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:1: [sdb] Sense Key : Aborted Command [current] sd 2:0:0:1: [sdb] Add. Sense: I/O process terminated sd 2:0:0:1: [sdb] CDB: Write same(16): 93 08 00 00 00 00 01 ff ff fc 00 7f ff ff 00 00 end_request: I/O error, dev sdb, sector 33554428 sd 2:0:0:1: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:1: [sdb] Sense Key : Aborted Command [current] sd 2:0:0:1: [sdb] Add. Sense: I/O process terminated sd 2:0:0:1: [sdb] CDB: Write same(16): 93 08 00 00 00 00 00 ff ff fe 00 7f ff ff 00 00 end_request: I/O error, dev sdb, sector 16777214 Greets, Stefan
> i wantes to use scsi unmap with rbd. rbd documention says you need to > set discard_granularity=512 for the device. I'm using qemu 1.2. > > If i set this and send an UNMAP command i get this kernel output: The discard request is failing. Please check why with a breakpoint on rbd_aio_discard_wrapper or scsi_handle_rw_error for example. Paolo
Am 06.11.2012 23:42, schrieb Paolo Bonzini: > >> i wantes to use scsi unmap with rbd. rbd documention says you need to >> set discard_granularity=512 for the device. I'm using qemu 1.2. >> >> If i set this and send an UNMAP command i get this kernel output: > > The discard request is failing. Please check why with a breakpoint on > rbd_aio_discard_wrapper or scsi_handle_rw_error for example. I've no idea about setting breakpoints and analyse what happens... no C coder just perl. I can for sure modify code and compile... but i don't know how todo that. Greets, Stefan
Hi Paolo, Am 06.11.2012 23:42, schrieb Paolo Bonzini: > >> i wantes to use scsi unmap with rbd. rbd documention says you need to >> set discard_granularity=512 for the device. I'm using qemu 1.2. >> >> If i set this and send an UNMAP command i get this kernel output: > > The discard request is failing. Please check why with a breakpoint on > rbd_aio_discard_wrapper or scsi_handle_rw_error for example. > > Paolo I'm sorry the discard requests aren't failing. Qemu / Block driver starts to cancel a bunch of requests. [ 39.577194] sd 2:0:0:1: [sdb] [ 39.577194] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 39.577194] sd 2:0:0:1: [sdb] [ 39.577194] Sense Key : Aborted Command [current] [ 39.577194] sd 2:0:0:1: [sdb] [ 39.577194] Add. Sense: I/O process terminated [ 39.577194] sd 2:0:0:1: [sdb] CDB: [ 39.577194] Write same(16): 93 08 00 00 00 00 04 7f ff f7 00 62 00 09 00 00 [ 39.577194] end_request: I/O error, dev sdb, sector 75497463 Greets, Stefan
Il 18/11/2012 23:00, Stefan Priebe ha scritto: > Hi Paolo, > > Am 06.11.2012 23:42, schrieb Paolo Bonzini: >> >>> i wantes to use scsi unmap with rbd. rbd documention says you need to >>> set discard_granularity=512 for the device. I'm using qemu 1.2. >>> >>> If i set this and send an UNMAP command i get this kernel output: >> >> The discard request is failing. Please check why with a breakpoint on >> rbd_aio_discard_wrapper or scsi_handle_rw_error for example. >> >> Paolo > > I'm sorry the discard requests aren't failing. Qemu / Block driver > starts to cancel a bunch of requests. That is being done in the kernel (the guest, I think) because the UNMAPs are taking too long. Paolo > [ 39.577194] sd 2:0:0:1: [sdb] > [ 39.577194] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > [ 39.577194] sd 2:0:0:1: [sdb] > [ 39.577194] Sense Key : Aborted Command [current] > [ 39.577194] sd 2:0:0:1: [sdb] > [ 39.577194] Add. Sense: I/O process terminated > [ 39.577194] sd 2:0:0:1: [sdb] CDB: > [ 39.577194] Write same(16): 93 08 00 00 00 00 04 7f ff f7 00 62 00 09 > 00 00 > [ 39.577194] end_request: I/O error, dev sdb, sector 75497463
Hi Paolo, Am 19.11.2012 09:10, schrieb Paolo Bonzini: >> I'm sorry the discard requests aren't failing. Qemu / Block driver >> starts to cancel a bunch of requests. > > That is being done in the kernel (the guest, I think) because the UNMAPs > are taking too long. That makes sense. RBD handles discards as buffered I/O. When i do an mkfs.xfs on a 30GB device i see around 900 pending discard requests. rbd finishes them all before reporting success. If it is correct that a 3.6.7 kernel sends as many discard requests i only the a solution in using unbuffered I/O for discards. Do you know what is the correct way? Greets, Stefan
Il 19/11/2012 10:36, Stefan Priebe - Profihost AG ha scritto: > Hi Paolo, > > Am 19.11.2012 09:10, schrieb Paolo Bonzini: >>> I'm sorry the discard requests aren't failing. Qemu / Block driver >>> starts to cancel a bunch of requests. >> >> That is being done in the kernel (the guest, I think) because the UNMAPs >> are taking too long. > > That makes sense. RBD handles discards as buffered I/O. When i do an > mkfs.xfs on a 30GB device i see around 900 pending discard requests. rbd > finishes them all before reporting success. > > If it is correct that a 3.6.7 kernel sends as many discard requests i > only the a solution in using unbuffered I/O for discards. > > Do you know what is the correct way? I think the correct fix is to serialize them in the kernel. Paolo
Am 19.11.2012 10:54, schrieb Paolo Bonzini: > Il 19/11/2012 10:36, Stefan Priebe - Profihost AG ha scritto: >> Hi Paolo, >> >> Am 19.11.2012 09:10, schrieb Paolo Bonzini: >>>> I'm sorry the discard requests aren't failing. Qemu / Block driver >>>> starts to cancel a bunch of requests. >>> >>> That is being done in the kernel (the guest, I think) because the UNMAPs >>> are taking too long. >> >> That makes sense. RBD handles discards as buffered I/O. When i do an >> mkfs.xfs on a 30GB device i see around 900 pending discard requests. rbd >> finishes them all before reporting success. >> >> If it is correct that a 3.6.7 kernel sends as many discard requests i >> only the a solution in using unbuffered I/O for discards. >> >> Do you know what is the correct way? > > I think the correct fix is to serialize them in the kernel. So you mean this is not a bug in rbd or qemu this is a general bug in the linux kernel since they implemented discard? Greets, Stefan
Il 19/11/2012 10:59, Stefan Priebe - Profihost AG ha scritto: >>> >>> >>> Do you know what is the correct way? >> >> I think the correct fix is to serialize them in the kernel. > > So you mean this is not a bug in rbd or qemu this is a general bug in > the linux kernel since they implemented discard? Yes. Paolo
Am 19.11.2012 11:06, schrieb Paolo Bonzini: > Il 19/11/2012 10:59, Stefan Priebe - Profihost AG ha scritto: >>>> >>>> >>>> Do you know what is the correct way? >>> >>> I think the correct fix is to serialize them in the kernel. >> >> So you mean this is not a bug in rbd or qemu this is a general bug in >> the linux kernel since they implemented discard? > > Yes. As you're known in the linux dev community ;-) Might you open a thread / discussion in the linux kernel mailinglist CC'ing me regarding this problem? I think when i open one - no one will listen ;-) Greets Stefan
Il 19/11/2012 11:13, Stefan Priebe - Profihost AG ha scritto: >>>> >>> >>> So you mean this is not a bug in rbd or qemu this is a general bug in >>> the linux kernel since they implemented discard? >> >> Yes. > > As you're known in the linux dev community ;-) Might you open a thread / > discussion in the linux kernel mailinglist CC'ing me regarding this > problem? I think when i open one - no one will listen ;-) Yeah, I'll try making and sending a patch, that's usually a good way to convince those guys to listen... Paolo
Am 19.11.2012 11:23, schrieb Paolo Bonzini: > Il 19/11/2012 11:13, Stefan Priebe - Profihost AG ha scritto: >>>>> >>>> >>>> So you mean this is not a bug in rbd or qemu this is a general bug in >>>> the linux kernel since they implemented discard? >>> >>> Yes. >> >> As you're known in the linux dev community ;-) Might you open a thread / >> discussion in the linux kernel mailinglist CC'ing me regarding this >> problem? I think when i open one - no one will listen ;-) > > Yeah, I'll try making and sending a patch, that's usually a good way to > convince those guys to listen... Thanks. But do you have any idea why it works with an iscsi / libiscsi backend? In that case the kernel does not cancel the I/O. Greets, Stefan
Il 19/11/2012 11:30, Stefan Priebe - Profihost AG ha scritto: > > > But do you have any idea why it works with an iscsi / libiscsi backend? > In that case the kernel does not cancel the I/O. It tries to, but the command ultimately succeeds and the success response "undoes" the cancel. http://thread.gmane.org/gmane.comp.emulators.qemu/166232/focus=166316 See also the whole thread (and I think there as a part of it offlist, too). Paolo
Yeah thats my old thread regarding iscsi und unmap but this works fine now since you patched qemu. Stefan Am 19.11.2012 11:36, schrieb Paolo Bonzini: > Il 19/11/2012 11:30, Stefan Priebe - Profihost AG ha scritto: >> >> >> But do you have any idea why it works with an iscsi / libiscsi backend? >> In that case the kernel does not cancel the I/O. > > It tries to, but the command ultimately succeeds and the success > response "undoes" the cancel. > > http://thread.gmane.org/gmane.comp.emulators.qemu/166232/focus=166316 > > See also the whole thread (and I think there as a part of it offlist, too). > > Paolo >
Il 19/11/2012 11:57, Stefan Priebe - Profihost AG ha scritto: > Yeah thats my old thread regarding iscsi und unmap but this works fine > now since you patched qemu. It still causes hangs no? Though it works apart from that. Paolo > Stefan > > Am 19.11.2012 11:36, schrieb Paolo Bonzini: >> Il 19/11/2012 11:30, Stefan Priebe - Profihost AG ha scritto: >>> >>> >>> But do you have any idea why it works with an iscsi / libiscsi backend? >>> In that case the kernel does not cancel the I/O. >> >> It tries to, but the command ultimately succeeds and the success >> response "undoes" the cancel. >> >> http://thread.gmane.org/gmane.comp.emulators.qemu/166232/focus=166316 >> >> See also the whole thread (and I think there as a part of it offlist, >> too). >> >> Paolo >>
Am 19.11.2012 12:16, schrieb Paolo Bonzini: > Il 19/11/2012 11:57, Stefan Priebe - Profihost AG ha scritto: >> Yeah thats my old thread regarding iscsi und unmap but this works fine >> now since you patched qemu. > > It still causes hangs no? Though it works apart from that. iscsi/libiscsi and discards works fine since your latest patches: 1bd075f29ea6d11853475c7c42734595720c3ac6 cfb3f5064af2d2e29c976e292c9472dfe9d61e31 27cbd828c617944c0f9603763fdf4fa87e7ad923 b20909195745c34a819aed14ae996b60ab0f591f But in rbd it starts to cancel the discard requests. Both using the same guest and host kernel with the same QEMU version. Greets, Stefan
Il 19/11/2012 12:49, Stefan Priebe - Profihost AG ha scritto: >> >> It still causes hangs no? Though it works apart from that. > > iscsi/libiscsi and discards works fine since your latest patches: > 1bd075f29ea6d11853475c7c42734595720c3ac6 > cfb3f5064af2d2e29c976e292c9472dfe9d61e31 > 27cbd828c617944c0f9603763fdf4fa87e7ad923 > b20909195745c34a819aed14ae996b60ab0f591f Does the console still hang though? > But in rbd it starts to cancel the discard requests. Both using the same > guest and host kernel with the same QEMU version. rbd's cancellation is broken like libiscsi's was. So the cancellation succeeds apparently, but it is subject to the same race introduced in commit 64e69e8 (iscsi: Fix NULL dereferences / races between task completion and abort, 2012-08-15) for libiscsi. It risks corruption in case the following happens: guest qemu rbd server ========================================================================= send write 1 --------> send write 1 ------> cancel write 1 ------> cancel write 1 ----> <------------------ cancelled send write 2 --------> send write 2 ------> <--------------- completed write 2 <------------------ completed write 2 <--------------- completed write 1 Here, the guest would see write 2 superseding write 1, when in fact the outcome could have been the opposite. The right behavior is to return only after the target says whether the cancellation was done or not. For libiscsi, it was implemented by the commits you mention. Paolo
Am 19.11.2012 13:24, schrieb Paolo Bonzini: > Il 19/11/2012 12:49, Stefan Priebe - Profihost AG ha scritto: >>> >>> It still causes hangs no? Though it works apart from that. >> >> iscsi/libiscsi and discards works fine since your latest patches: >> 1bd075f29ea6d11853475c7c42734595720c3ac6 >> cfb3f5064af2d2e29c976e292c9472dfe9d61e31 >> 27cbd828c617944c0f9603763fdf4fa87e7ad923 >> b20909195745c34a819aed14ae996b60ab0f591f > > Does the console still hang though? No everything is absolutely fine. >> But in rbd it starts to cancel the discard requests. Both using the same >> guest and host kernel with the same QEMU version. > > rbd's cancellation is broken like libiscsi's was. So the cancellation > succeeds apparently, but it is subject to the same race introduced in > commit 64e69e8 (iscsi: Fix NULL dereferences / races between task > completion and abort, 2012-08-15) for libiscsi. > > It risks corruption in case the following happens: > > guest qemu rbd server > ========================================================================= > send write 1 --------> > send write 1 ------> > cancel write 1 ------> > cancel write 1 ----> > <------------------ cancelled > send write 2 --------> > send write 2 ------> > <--------------- completed write 2 > <------------------ completed write 2 > <--------------- completed write 1 > > Here, the guest would see write 2 superseding write 1, when in fact the > outcome could have been the opposite. The right behavior is to return > only after the target says whether the cancellation was done or not. > For libiscsi, it was implemented by the commits you mention. So the whole bunch of changes is needed for rbd? Or just 64e69e8? Or all mentioned except 64e69e8 as 64e69e8 introduced the problem? Stefan
Il 19/11/2012 14:01, Stefan Priebe - Profihost AG ha scritto: >> The right behavior is to return >> only after the target says whether the cancellation was done or not. >> For libiscsi, it was implemented by the commits you mention. > > So the whole bunch of changes is needed for rbd? Something like the first three: 1bd075f29ea6d11853475c7c42734595720c3ac6 cfb3f5064af2d2e29c976e292c9472dfe9d61e31 27cbd828c617944c0f9603763fdf4fa87e7ad923 Paolo
Hi Paolo, Hi Josh, i started to port the iscsi fixes to rbd. The one think i'm missing is. How to tell / call rbd to cancel I/O on the server side? I grepped librbd source code for abort / cancel but didn't find anything. Qemu must be able to tell rbd to cancel a specific I/O request. Greets, Stefan Am 19.11.2012 14:06, schrieb Paolo Bonzini: > Il 19/11/2012 14:01, Stefan Priebe - Profihost AG ha scritto: >>> The right behavior is to return >>> only after the target says whether the cancellation was done or not. >>> For libiscsi, it was implemented by the commits you mention. >> >> So the whole bunch of changes is needed for rbd? > > Something like the first three: > > 1bd075f29ea6d11853475c7c42734595720c3ac6 > cfb3f5064af2d2e29c976e292c9472dfe9d61e31 > 27cbd828c617944c0f9603763fdf4fa87e7ad923 > > Paolo >
Il 19/11/2012 15:16, Stefan Priebe - Profihost AG ha scritto: > Hi Paolo, > Hi Josh, > > i started to port the iscsi fixes to rbd. The one think i'm missing is. > How to tell / call rbd to cancel I/O on the server side? > > I grepped librbd source code for abort / cancel but didn't find anything. > > Qemu must be able to tell rbd to cancel a specific I/O request. Just don't. QEMU will wait for rbd to finish the operation, instead of actually having it cancelled. Paolo
diff --git a/block/iscsi.c b/block/iscsi.c index 74ada64..0b96165 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -247,6 +247,7 @@ iscsi_aio_writev(BlockDriverState *bs, int64_t sector_num, acb->qiov = qiov; acb->canceled = 0; + acb->bh = NULL; acb->status = -EINPROGRESS;