Message ID | 51190352.2030708@redhat.com |
---|---|
State | New |
Headers | show |
Hi, thanks - i applied the patch to the latest master. I hope that this will solve my issue. Will this one get integrated in 1.4 final? Greets, Stefan Am 11.02.2013 15:42, schrieb Paolo Bonzini: > Il 11/02/2013 15:18, Stefan Priebe - Profihost AG ha scritto: >>>> Some trace that a request was actually cancelled, but I think I >>>> believe >> Ah but that must be in guest not on host right? How to grab that from >> client when it is crashing? > > Serial console could have something like "sda: aborting command". It is actually interesting to see what is causing commands to be aborted (typically a timeout, but what causes the timeout? :). > >>>> that. This seems to be the same issue as commits >>>> 1bd075f29ea6d11853475c7c42734595720c3ac6 (iSCSI) and >>>> 473c7f0255920bcaf37411990a3725898772817f (rbd), where the "cancelled" >>>> callback is called before the "complete" callback. >> If there is the same code in virtio-scsi it might be. > > No, virtio-scsi is relying on the backends (including scsi-disk) > doing it correctly. The RBD code looks okay, so it's still my > fault :) but not virtio-scsi's. > > I think this happens when a request is split into multiple parts, > and one of them is canceled. Then the next part is fired, but > virtio-scsi's cancellation callbacks have fired already. > > You can test this patch: > > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c > index 07220e4..1d8289c 100644 > --- a/hw/scsi/scsi-disk.c > +++ b/hw/scsi/scsi-disk.c > @@ -221,6 +221,10 @@ static void scsi_write_do_fua(SCSIDiskReq *r) > { > SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev); > > + if (r->req.io_canceled) { > + return; > + } > + > if (scsi_is_cmd_fua(&r->req.cmd)) { > bdrv_acct_start(s->qdev.conf.bs, &r->acct, 0, BDRV_ACCT_FLUSH); > r->req.aiocb = bdrv_aio_flush(s->qdev.conf.bs, scsi_aio_complete, r); > @@ -352,6 +356,10 @@ static void scsi_read_data(SCSIRequest *req) > /* No data transfer may already be in progress */ > assert(r->req.aiocb == NULL); > > + if (r->req.io_canceled) { > + return; > + } > + > /* The request is used as the AIO opaque value, so add a ref. */ > scsi_req_ref(&r->req); > if (r->req.cmd.mode == SCSI_XFER_TO_DEV) { > @@ -455,6 +463,10 @@ static void scsi_write_data(SCSIRequest *req) > /* No data transfer may already be in progress */ > assert(r->req.aiocb == NULL); > > + if (r->req.io_canceled) { > + return; > + } > + > /* The request is used as the AIO opaque value, so add a ref. */ > scsi_req_ref(&r->req); > if (r->req.cmd.mode != SCSI_XFER_TO_DEV) { > > Paolo >
Il 12/02/2013 14:46, Stefan Priebe - Profihost AG ha scritto: > Hi, > > thanks - i applied the patch to the latest master. I hope that this will > solve my issue. Will this one get integrated in 1.4 final? No, only 1.4.1 and 1.5 unfortunately. Let's give it a week for you to test it. Paolo
Hi Paolo, sadly no luck. A VM crashed again. [ ~]# addr2line -e /usr/lib/debug/usr/bin/kvm -f 24040c virtio_scsi_command_complete hw/virtio-scsi.c:429 Same point like last time: static void virtio_scsi_command_complete(SCSIRequest *r, uint32_t status, size_t resid) { VirtIOSCSIReq *req = r->hba_private; uint32_t sense_len; 429 =====> req->resp.cmd->response = VIRTIO_SCSI_S_OK; req->resp.cmd->status = status; if (req->resp.cmd->status == GOOD) { req->resp.cmd->resid = tswap32(resid); } else { req->resp.cmd->resid = 0; sense_len = scsi_req_get_sense(r, req->resp.cmd->sense, VIRTIO_SCSI_SENSE_SIZE); req->resp.cmd->sense_len = tswap32(sense_len); } virtio_scsi_complete_req(req); Greets Stefan Am 12.02.2013 15:34, schrieb Paolo Bonzini: > Il 12/02/2013 14:46, Stefan Priebe - Profihost AG ha scritto: >> Hi, >> >> thanks - i applied the patch to the latest master. I hope that this will >> solve my issue. Will this one get integrated in 1.4 final?
Another hint: I've never seens this using qemu 1.3.1 Stefan Am 13.02.2013 08:49, schrieb Stefan Priebe - Profihost AG: > Hi Paolo, > > sadly no luck. A VM crashed again. > > [ ~]# addr2line -e /usr/lib/debug/usr/bin/kvm -f 24040c > virtio_scsi_command_complete > hw/virtio-scsi.c:429 > > Same point like last time: > static void virtio_scsi_command_complete(SCSIRequest *r, uint32_t status, > size_t resid) > { > VirtIOSCSIReq *req = r->hba_private; > uint32_t sense_len; > > 429 =====> req->resp.cmd->response = VIRTIO_SCSI_S_OK; > req->resp.cmd->status = status; > if (req->resp.cmd->status == GOOD) { > req->resp.cmd->resid = tswap32(resid); > } else { > req->resp.cmd->resid = 0; > sense_len = scsi_req_get_sense(r, req->resp.cmd->sense, > VIRTIO_SCSI_SENSE_SIZE); > req->resp.cmd->sense_len = tswap32(sense_len); > } > virtio_scsi_complete_req(req); > > Greets > Stefan > > Am 12.02.2013 15:34, schrieb Paolo Bonzini: >> Il 12/02/2013 14:46, Stefan Priebe - Profihost AG ha scritto: >>> Hi, >>> >>> thanks - i applied the patch to the latest master. I hope that this will >>> solve my issue. Will this one get integrated in 1.4 final?
Hi, could this be this one? commit 47a150a4bbb06e45ef439a8222e9f46a7c4cca3f Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Jan 10 15:49:08 2013 +0100 virtio-scsi: abort in-flight I/O when the device is reset When the device is reset, the SCSI bus should also be reset so that in-flight I/O is cancelled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Greets, Stefan Am 13.02.2013 09:01, schrieb Stefan Priebe - Profihost AG: > Another hint: I've never seens this using qemu 1.3.1 > > Stefan > > Am 13.02.2013 08:49, schrieb Stefan Priebe - Profihost AG: >> Hi Paolo, >> >> sadly no luck. A VM crashed again. >> >> [ ~]# addr2line -e /usr/lib/debug/usr/bin/kvm -f 24040c >> virtio_scsi_command_complete >> hw/virtio-scsi.c:429 >> >> Same point like last time: >> static void virtio_scsi_command_complete(SCSIRequest *r, uint32_t status, >> size_t resid) >> { >> VirtIOSCSIReq *req = r->hba_private; >> uint32_t sense_len; >> >> 429 =====> req->resp.cmd->response = VIRTIO_SCSI_S_OK; >> req->resp.cmd->status = status; >> if (req->resp.cmd->status == GOOD) { >> req->resp.cmd->resid = tswap32(resid); >> } else { >> req->resp.cmd->resid = 0; >> sense_len = scsi_req_get_sense(r, req->resp.cmd->sense, >> VIRTIO_SCSI_SENSE_SIZE); >> req->resp.cmd->sense_len = tswap32(sense_len); >> } >> virtio_scsi_complete_req(req); >> >> Greets >> Stefan >> >> Am 12.02.2013 15:34, schrieb Paolo Bonzini: >>> Il 12/02/2013 14:46, Stefan Priebe - Profihost AG ha scritto: >>>> Hi, >>>> >>>> thanks - i applied the patch to the latest master. I hope that this will >>>> solve my issue. Will this one get integrated in 1.4 final?
Il 13/02/2013 09:19, Stefan Priebe - Profihost AG ha scritto: > Hi, > > could this be this one? > > commit 47a150a4bbb06e45ef439a8222e9f46a7c4cca3f > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Thu Jan 10 15:49:08 2013 +0100 > > virtio-scsi: abort in-flight I/O when the device is reset > > When the device is reset, the SCSI bus should also be reset so > that in-flight I/O is cancelled. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> You can certainly try reverting it, but this patch is fixing a real bug. Paolo > Greets, > Stefan > Am 13.02.2013 09:01, schrieb Stefan Priebe - Profihost AG: >> Another hint: I've never seens this using qemu 1.3.1 >> >> Stefan >> >> Am 13.02.2013 08:49, schrieb Stefan Priebe - Profihost AG: >>> Hi Paolo, >>> >>> sadly no luck. A VM crashed again. >>> >>> [ ~]# addr2line -e /usr/lib/debug/usr/bin/kvm -f 24040c >>> virtio_scsi_command_complete >>> hw/virtio-scsi.c:429 >>> >>> Same point like last time: >>> static void virtio_scsi_command_complete(SCSIRequest *r, uint32_t status, >>> size_t resid) >>> { >>> VirtIOSCSIReq *req = r->hba_private; >>> uint32_t sense_len; >>> >>> 429 =====> req->resp.cmd->response = VIRTIO_SCSI_S_OK; >>> req->resp.cmd->status = status; >>> if (req->resp.cmd->status == GOOD) { >>> req->resp.cmd->resid = tswap32(resid); >>> } else { >>> req->resp.cmd->resid = 0; >>> sense_len = scsi_req_get_sense(r, req->resp.cmd->sense, >>> VIRTIO_SCSI_SENSE_SIZE); >>> req->resp.cmd->sense_len = tswap32(sense_len); >>> } >>> virtio_scsi_complete_req(req); >>> >>> Greets >>> Stefan >>> >>> Am 12.02.2013 15:34, schrieb Paolo Bonzini: >>>> Il 12/02/2013 14:46, Stefan Priebe - Profihost AG ha scritto: >>>>> Hi, >>>>> >>>>> thanks - i applied the patch to the latest master. I hope that this will >>>>> solve my issue. Will this one get integrated in 1.4 final? > >
Hi, Am 13.02.2013 09:57, schrieb Paolo Bonzini: > Il 13/02/2013 09:19, Stefan Priebe - Profihost AG ha scritto: >> Hi, >> >> could this be this one? >> >> commit 47a150a4bbb06e45ef439a8222e9f46a7c4cca3f ... > You can certainly try reverting it, but this patch is fixing a real bug. Will try that. Yes but even if it fixes a bug and raises another one (kvm segfault) which is the worst one. It should be fixed. Greets, Stefan
Il 13/02/2013 10:07, Stefan Priebe - Profihost AG ha scritto: >>> >> >>> >> commit 47a150a4bbb06e45ef439a8222e9f46a7c4cca3f > ... >> > You can certainly try reverting it, but this patch is fixing a real bug. > Will try that. Yes but even if it fixes a bug and raises another one > (kvm segfault) which is the worst one. It should be fixed. The KVM segfault is exposing a potential consistency problem. What is worse is not obvious. Also, it is happening at reset time if this is the culprit. Reset usually happens at places where no data loss is caused. Can you find out what the VM was doing when it segfaulted? (Or even, can you place the corefile and kvm executable somewhere where I can download it?) I'll prepare a test program that resets the adapter while doing I/O and try to reproduce it myself, in the meanwhile: can you grep the VM's /var/log/messages with kernel messages regarding the storage ("aborting cmd" and other things after it)? If not, do your VMs reset themselves often for example? Can you reproduce it on non-rbd storage? Paolo
Hi, Am 13.02.2013 12:36, schrieb Paolo Bonzini: > Il 13/02/2013 10:07, Stefan Priebe - Profihost AG ha scritto: >>>>>> >>>>>> commit 47a150a4bbb06e45ef439a8222e9f46a7c4cca3f >> ... >>>> You can certainly try reverting it, but this patch is fixing a real bug. >> Will try that. Yes but even if it fixes a bug and raises another one >> (kvm segfault) which is the worst one. It should be fixed. > > The KVM segfault is exposing a potential consistency problem. What is > worse is not obvious. Also, it is happening at reset time if this is > the culprit. Reset usually happens at places where no data loss is caused. > > Can you find out what the VM was doing when it segfaulted? (Or even, > can you place the corefile and kvm executable somewhere where I can > download it?) Yes it was doing an fstrim -v / which resulted in: [45648.453698] end_request: I/O error, dev sda, sector 9066952 > I'll prepare a test program that resets the adapter while doing I/O and > try to reproduce it myself, in the meanwhile: can you grep the VM's > /var/log/messages with kernel messages regarding the storage ("aborting > cmd" and other things after it)? Sadly not as i don't have acore dump. The kvm processes are started through variuos Daemons and there seems no way to activate core dumps for an already running process and i don't know which VM will crash next. > If not, do your VMs reset themselves > often for example? No > Can you reproduce it on non-rbd storage? I don't have another storage type. ;-( Stefan
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c index 07220e4..1d8289c 100644 --- a/hw/scsi/scsi-disk.c +++ b/hw/scsi/scsi-disk.c @@ -221,6 +221,10 @@ static void scsi_write_do_fua(SCSIDiskReq *r) { SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev); + if (r->req.io_canceled) { + return; + } + if (scsi_is_cmd_fua(&r->req.cmd)) { bdrv_acct_start(s->qdev.conf.bs, &r->acct, 0, BDRV_ACCT_FLUSH); r->req.aiocb = bdrv_aio_flush(s->qdev.conf.bs, scsi_aio_complete, r); @@ -352,6 +356,10 @@ static void scsi_read_data(SCSIRequest *req) /* No data transfer may already be in progress */ assert(r->req.aiocb == NULL); + if (r->req.io_canceled) { + return; + } + /* The request is used as the AIO opaque value, so add a ref. */ scsi_req_ref(&r->req); if (r->req.cmd.mode == SCSI_XFER_TO_DEV) { @@ -455,6 +463,10 @@ static void scsi_write_data(SCSIRequest *req) /* No data transfer may already be in progress */ assert(r->req.aiocb == NULL); + if (r->req.io_canceled) { + return; + } + /* The request is used as the AIO opaque value, so add a ref. */ scsi_req_ref(&r->req); if (r->req.cmd.mode != SCSI_XFER_TO_DEV) {