diff mbox series

[2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

Message ID 20180910145616.8598-3-famz@redhat.com
State New
Headers show
Series virtio-scsi: Fix QEMU hang with vIOMMU and ATS | expand

Commit Message

Fam Zheng Sept. 10, 2018, 2:56 p.m. UTC
We have this unwanted call stack:

  > ...
  > #13 0x00005586602b7793 in virtio_scsi_handle_cmd_vq
  > #14 0x00005586602b8d66 in virtio_scsi_data_plane_handle_cmd
  > #15 0x00005586602ddab7 in virtio_queue_notify_aio_vq
  > #16 0x00005586602dfc9f in virtio_queue_host_notifier_aio_poll
  > #17 0x00005586607885da in run_poll_handlers_once
  > #18 0x000055866078880e in try_poll_mode
  > #19 0x00005586607888eb in aio_poll
  > #20 0x0000558660784561 in aio_wait_bh_oneshot
  > #21 0x00005586602b9582 in virtio_scsi_dataplane_stop
  > #22 0x00005586605a7110 in virtio_bus_stop_ioeventfd
  > #23 0x00005586605a9426 in virtio_pci_stop_ioeventfd
  > #24 0x00005586605ab808 in virtio_pci_common_write
  > #25 0x0000558660242396 in memory_region_write_accessor
  > #26 0x00005586602425ab in access_with_adjusted_size
  > #27 0x0000558660245281 in memory_region_dispatch_write
  > #28 0x00005586601e008e in flatview_write_continue
  > #29 0x00005586601e01d8 in flatview_write
  > #30 0x00005586601e04de in address_space_write
  > #31 0x00005586601e052f in address_space_rw
  > #32 0x00005586602607f2 in kvm_cpu_exec
  > #33 0x0000558660227148 in qemu_kvm_cpu_thread_fn
  > #34 0x000055866078bde7 in qemu_thread_start
  > #35 0x00007f5784906594 in start_thread
  > #36 0x00007f5784639e6f in clone

Avoid it with the aio_disable_external/aio_enable_external pair, so that
no vq poll handlers can be called in aio_wait_bh_oneshot.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 hw/block/dataplane/virtio-blk.c | 2 ++
 hw/scsi/virtio-scsi-dataplane.c | 2 ++
 2 files changed, 4 insertions(+)

Comments

Paolo Bonzini Sept. 11, 2018, 11:32 a.m. UTC | #1
On 10/09/2018 16:56, Fam Zheng wrote:
> We have this unwanted call stack:
> 
>   > ...
>   > #13 0x00005586602b7793 in virtio_scsi_handle_cmd_vq
>   > #14 0x00005586602b8d66 in virtio_scsi_data_plane_handle_cmd
>   > #15 0x00005586602ddab7 in virtio_queue_notify_aio_vq
>   > #16 0x00005586602dfc9f in virtio_queue_host_notifier_aio_poll
>   > #17 0x00005586607885da in run_poll_handlers_once
>   > #18 0x000055866078880e in try_poll_mode
>   > #19 0x00005586607888eb in aio_poll
>   > #20 0x0000558660784561 in aio_wait_bh_oneshot
>   > #21 0x00005586602b9582 in virtio_scsi_dataplane_stop
>   > #22 0x00005586605a7110 in virtio_bus_stop_ioeventfd
>   > #23 0x00005586605a9426 in virtio_pci_stop_ioeventfd
>   > #24 0x00005586605ab808 in virtio_pci_common_write
>   > #25 0x0000558660242396 in memory_region_write_accessor
>   > #26 0x00005586602425ab in access_with_adjusted_size
>   > #27 0x0000558660245281 in memory_region_dispatch_write
>   > #28 0x00005586601e008e in flatview_write_continue
>   > #29 0x00005586601e01d8 in flatview_write
>   > #30 0x00005586601e04de in address_space_write
>   > #31 0x00005586601e052f in address_space_rw
>   > #32 0x00005586602607f2 in kvm_cpu_exec
>   > #33 0x0000558660227148 in qemu_kvm_cpu_thread_fn
>   > #34 0x000055866078bde7 in qemu_thread_start
>   > #35 0x00007f5784906594 in start_thread
>   > #36 0x00007f5784639e6f in clone
> 
> Avoid it with the aio_disable_external/aio_enable_external pair, so that
> no vq poll handlers can be called in aio_wait_bh_oneshot.

I don't understand.  We are in the vCPU thread, so not in the
AioContext's home thread.  Why is aio_wait_bh_oneshot polling rather
than going through the aio_wait_bh path?

Thanks,

Paolo
Fam Zheng Sept. 11, 2018, 2:12 p.m. UTC | #2
On Tue, 09/11 13:32, Paolo Bonzini wrote:
> On 10/09/2018 16:56, Fam Zheng wrote:
> > We have this unwanted call stack:
> > 
> >   > ...
> >   > #13 0x00005586602b7793 in virtio_scsi_handle_cmd_vq
> >   > #14 0x00005586602b8d66 in virtio_scsi_data_plane_handle_cmd
> >   > #15 0x00005586602ddab7 in virtio_queue_notify_aio_vq
> >   > #16 0x00005586602dfc9f in virtio_queue_host_notifier_aio_poll
> >   > #17 0x00005586607885da in run_poll_handlers_once
> >   > #18 0x000055866078880e in try_poll_mode
> >   > #19 0x00005586607888eb in aio_poll
> >   > #20 0x0000558660784561 in aio_wait_bh_oneshot
> >   > #21 0x00005586602b9582 in virtio_scsi_dataplane_stop
> >   > #22 0x00005586605a7110 in virtio_bus_stop_ioeventfd
> >   > #23 0x00005586605a9426 in virtio_pci_stop_ioeventfd
> >   > #24 0x00005586605ab808 in virtio_pci_common_write
> >   > #25 0x0000558660242396 in memory_region_write_accessor
> >   > #26 0x00005586602425ab in access_with_adjusted_size
> >   > #27 0x0000558660245281 in memory_region_dispatch_write
> >   > #28 0x00005586601e008e in flatview_write_continue
> >   > #29 0x00005586601e01d8 in flatview_write
> >   > #30 0x00005586601e04de in address_space_write
> >   > #31 0x00005586601e052f in address_space_rw
> >   > #32 0x00005586602607f2 in kvm_cpu_exec
> >   > #33 0x0000558660227148 in qemu_kvm_cpu_thread_fn
> >   > #34 0x000055866078bde7 in qemu_thread_start
> >   > #35 0x00007f5784906594 in start_thread
> >   > #36 0x00007f5784639e6f in clone
> > 
> > Avoid it with the aio_disable_external/aio_enable_external pair, so that
> > no vq poll handlers can be called in aio_wait_bh_oneshot.
> 
> I don't understand.  We are in the vCPU thread, so not in the
> AioContext's home thread.  Why is aio_wait_bh_oneshot polling rather
> than going through the aio_wait_bh path?

What do you mean by 'aio_wait_bh path'? Here is aio_wait_bh_oneshot:

void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
{
    AioWaitBHData data = {
        .cb = cb,
        .opaque = opaque,
    };

    assert(qemu_get_current_aio_context() == qemu_get_aio_context());

    aio_bh_schedule_oneshot(ctx, aio_wait_bh, &data);
    AIO_WAIT_WHILE(&data.wait, ctx, !data.done);
}

ctx is qemu_aio_context here, so there's no interaction with IOThread. This is
the full backtrace:

https://paste.ubuntu.com/p/Dm7zGzmmRG/

Fam
Paolo Bonzini Sept. 11, 2018, 3:30 p.m. UTC | #3
On 11/09/2018 16:12, Fam Zheng wrote:
> On Tue, 09/11 13:32, Paolo Bonzini wrote:
>> On 10/09/2018 16:56, Fam Zheng wrote:
>>> We have this unwanted call stack:
>>>
>>>   > ...
>>>   > #13 0x00005586602b7793 in virtio_scsi_handle_cmd_vq
>>>   > #14 0x00005586602b8d66 in virtio_scsi_data_plane_handle_cmd
>>>   > #15 0x00005586602ddab7 in virtio_queue_notify_aio_vq
>>>   > #16 0x00005586602dfc9f in virtio_queue_host_notifier_aio_poll
>>>   > #17 0x00005586607885da in run_poll_handlers_once
>>>   > #18 0x000055866078880e in try_poll_mode
>>>   > #19 0x00005586607888eb in aio_poll
>>>   > #20 0x0000558660784561 in aio_wait_bh_oneshot
>>>   > #21 0x00005586602b9582 in virtio_scsi_dataplane_stop
>>>   > #22 0x00005586605a7110 in virtio_bus_stop_ioeventfd
>>>   > #23 0x00005586605a9426 in virtio_pci_stop_ioeventfd
>>>   > #24 0x00005586605ab808 in virtio_pci_common_write
>>>   > #25 0x0000558660242396 in memory_region_write_accessor
>>>   > #26 0x00005586602425ab in access_with_adjusted_size
>>>   > #27 0x0000558660245281 in memory_region_dispatch_write
>>>   > #28 0x00005586601e008e in flatview_write_continue
>>>   > #29 0x00005586601e01d8 in flatview_write
>>>   > #30 0x00005586601e04de in address_space_write
>>>   > #31 0x00005586601e052f in address_space_rw
>>>   > #32 0x00005586602607f2 in kvm_cpu_exec
>>>   > #33 0x0000558660227148 in qemu_kvm_cpu_thread_fn
>>>   > #34 0x000055866078bde7 in qemu_thread_start
>>>   > #35 0x00007f5784906594 in start_thread
>>>   > #36 0x00007f5784639e6f in clone
>>>
>>> Avoid it with the aio_disable_external/aio_enable_external pair, so that
>>> no vq poll handlers can be called in aio_wait_bh_oneshot.
>>
>> I don't understand.  We are in the vCPU thread, so not in the
>> AioContext's home thread.  Why is aio_wait_bh_oneshot polling rather
>> than going through the aio_wait_bh path?
> 
> What do you mean by 'aio_wait_bh path'? Here is aio_wait_bh_oneshot:

Sorry, I meant the "atomic_inc(&wait_->num_waiters);" path.  But if this
backtrace is obtained without dataplane, that's the answer I was seeking.

> void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
> {
>     AioWaitBHData data = {
>         .cb = cb,
>         .opaque = opaque,
>     };
> 
>     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
> 
>     aio_bh_schedule_oneshot(ctx, aio_wait_bh, &data);
>     AIO_WAIT_WHILE(&data.wait, ctx, !data.done);
> }
> 
> ctx is qemu_aio_context here, so there's no interaction with IOThread.

In this case, it should be okay to have the reentrancy, what is the bug
that this patch is fixing?

Thanks,

Paolo
Fam Zheng Sept. 12, 2018, 1:31 a.m. UTC | #4
On Tue, 09/11 17:30, Paolo Bonzini wrote:
> On 11/09/2018 16:12, Fam Zheng wrote:
> > On Tue, 09/11 13:32, Paolo Bonzini wrote:
> >> On 10/09/2018 16:56, Fam Zheng wrote:
> >>> We have this unwanted call stack:
> >>>
> >>>   > ...
> >>>   > #13 0x00005586602b7793 in virtio_scsi_handle_cmd_vq
> >>>   > #14 0x00005586602b8d66 in virtio_scsi_data_plane_handle_cmd
> >>>   > #15 0x00005586602ddab7 in virtio_queue_notify_aio_vq
> >>>   > #16 0x00005586602dfc9f in virtio_queue_host_notifier_aio_poll
> >>>   > #17 0x00005586607885da in run_poll_handlers_once
> >>>   > #18 0x000055866078880e in try_poll_mode
> >>>   > #19 0x00005586607888eb in aio_poll
> >>>   > #20 0x0000558660784561 in aio_wait_bh_oneshot
> >>>   > #21 0x00005586602b9582 in virtio_scsi_dataplane_stop
> >>>   > #22 0x00005586605a7110 in virtio_bus_stop_ioeventfd
> >>>   > #23 0x00005586605a9426 in virtio_pci_stop_ioeventfd
> >>>   > #24 0x00005586605ab808 in virtio_pci_common_write
> >>>   > #25 0x0000558660242396 in memory_region_write_accessor
> >>>   > #26 0x00005586602425ab in access_with_adjusted_size
> >>>   > #27 0x0000558660245281 in memory_region_dispatch_write
> >>>   > #28 0x00005586601e008e in flatview_write_continue
> >>>   > #29 0x00005586601e01d8 in flatview_write
> >>>   > #30 0x00005586601e04de in address_space_write
> >>>   > #31 0x00005586601e052f in address_space_rw
> >>>   > #32 0x00005586602607f2 in kvm_cpu_exec
> >>>   > #33 0x0000558660227148 in qemu_kvm_cpu_thread_fn
> >>>   > #34 0x000055866078bde7 in qemu_thread_start
> >>>   > #35 0x00007f5784906594 in start_thread
> >>>   > #36 0x00007f5784639e6f in clone
> >>>
> >>> Avoid it with the aio_disable_external/aio_enable_external pair, so that
> >>> no vq poll handlers can be called in aio_wait_bh_oneshot.
> >>
> >> I don't understand.  We are in the vCPU thread, so not in the
> >> AioContext's home thread.  Why is aio_wait_bh_oneshot polling rather
> >> than going through the aio_wait_bh path?
> > 
> > What do you mean by 'aio_wait_bh path'? Here is aio_wait_bh_oneshot:
> 
> Sorry, I meant the "atomic_inc(&wait_->num_waiters);" path.  But if this
> backtrace is obtained without dataplane, that's the answer I was seeking.
> 
> > void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
> > {
> >     AioWaitBHData data = {
> >         .cb = cb,
> >         .opaque = opaque,
> >     };
> > 
> >     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
> > 
> >     aio_bh_schedule_oneshot(ctx, aio_wait_bh, &data);
> >     AIO_WAIT_WHILE(&data.wait, ctx, !data.done);
> > }
> > 
> > ctx is qemu_aio_context here, so there's no interaction with IOThread.
> 
> In this case, it should be okay to have the reentrancy, what is the bug
> that this patch is fixing?

The same symptom as in the previous patch: virtio_scsi_handle_cmd_vq hangs. The
reason it hangs is fixed by the previous patch, but I don't think it should be
invoked as we're in the middle of virtio_scsi_dataplane_stop(). Applying either
one of the two patches avoids the problem, but this one is more superficial.
What do you think?

Fam
Paolo Bonzini Sept. 12, 2018, 11:11 a.m. UTC | #5
On 12/09/2018 03:31, Fam Zheng wrote:
>>>
>>> ctx is qemu_aio_context here, so there's no interaction with IOThread.
>> In this case, it should be okay to have the reentrancy, what is the bug
>> that this patch is fixing?
> The same symptom as in the previous patch: virtio_scsi_handle_cmd_vq hangs. The
> reason it hangs is fixed by the previous patch, but I don't think it should be
> invoked as we're in the middle of virtio_scsi_dataplane_stop(). Applying either
> one of the two patches avoids the problem, but this one is more superficial.
> What do you think?

I think it's okay if it is invoked.  The sequence is first you stop the
vq, then you drain the BlockBackends, then you switch AioContext.  All
that matters is the outcome when virtio_scsi_dataplane_stop returns.

Paolo
Fam Zheng Sept. 12, 2018, 11:50 a.m. UTC | #6
On Wed, 09/12 13:11, Paolo Bonzini wrote:
> On 12/09/2018 03:31, Fam Zheng wrote:
> >>>
> >>> ctx is qemu_aio_context here, so there's no interaction with IOThread.
> >> In this case, it should be okay to have the reentrancy, what is the bug
> >> that this patch is fixing?
> > The same symptom as in the previous patch: virtio_scsi_handle_cmd_vq hangs. The
> > reason it hangs is fixed by the previous patch, but I don't think it should be
> > invoked as we're in the middle of virtio_scsi_dataplane_stop(). Applying either
> > one of the two patches avoids the problem, but this one is more superficial.
> > What do you think?
> 
> I think it's okay if it is invoked.  The sequence is first you stop the
> vq, then you drain the BlockBackends, then you switch AioContext.  All
> that matters is the outcome when virtio_scsi_dataplane_stop returns.

Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
which is not clean. QEMU stderr when this call happens (with patch 1 but not
this patch):

2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed

Fam
Paolo Bonzini Sept. 12, 2018, 12:42 p.m. UTC | #7
On 12/09/2018 13:50, Fam Zheng wrote:
>> I think it's okay if it is invoked.  The sequence is first you stop the
>> vq, then you drain the BlockBackends, then you switch AioContext.  All
>> that matters is the outcome when virtio_scsi_dataplane_stop returns.
> Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
> which is not clean. QEMU stderr when this call happens (with patch 1 but not
> this patch):
> 
> 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
> 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
> 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed

But with iothread, virtio_scsi_dataplane_stop runs in another thread
than the iothread; in that case you still have a race where the iothread
can process the vq before aio_disable_external and print the error.

IIUC the guest has cleared the IOMMU page tables _before_ clearing the
DRIVER_OK bit in the status field.  Could this be a guest bug?

Paolo
Fam Zheng Sept. 13, 2018, 6:03 a.m. UTC | #8
On Wed, 09/12 14:42, Paolo Bonzini wrote:
> On 12/09/2018 13:50, Fam Zheng wrote:
> >> I think it's okay if it is invoked.  The sequence is first you stop the
> >> vq, then you drain the BlockBackends, then you switch AioContext.  All
> >> that matters is the outcome when virtio_scsi_dataplane_stop returns.
> > Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
> > which is not clean. QEMU stderr when this call happens (with patch 1 but not
> > this patch):
> > 
> > 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
> > 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
> > 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed
> 
> But with iothread, virtio_scsi_dataplane_stop runs in another thread
> than the iothread; in that case you still have a race where the iothread
> can process the vq before aio_disable_external and print the error.
> 
> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
> DRIVER_OK bit in the status field.  Could this be a guest bug?
> 

I'm not sure if it is a bug or not. I think what happens is the device is left
enabled by Seabios, and then reset by kernel.

Fam
Paolo Bonzini Sept. 13, 2018, 9:11 a.m. UTC | #9
On 13/09/2018 08:03, Fam Zheng wrote:
> On Wed, 09/12 14:42, Paolo Bonzini wrote:
>> On 12/09/2018 13:50, Fam Zheng wrote:
>>>> I think it's okay if it is invoked.  The sequence is first you stop the
>>>> vq, then you drain the BlockBackends, then you switch AioContext.  All
>>>> that matters is the outcome when virtio_scsi_dataplane_stop returns.
>>> Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
>>> which is not clean. QEMU stderr when this call happens (with patch 1 but not
>>> this patch):
>>>
>>> 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
>>> 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
>>> 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed
>>
>> But with iothread, virtio_scsi_dataplane_stop runs in another thread
>> than the iothread; in that case you still have a race where the iothread
>> can process the vq before aio_disable_external and print the error.
>>
>> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
>> DRIVER_OK bit in the status field.  Could this be a guest bug?
> 
> I'm not sure if it is a bug or not. I think what happens is the device is left
> enabled by Seabios, and then reset by kernel.

That makes sense, though I'm not sure why QEMU needs to process a
request long after SeaBIOS has left control to Linux.  Maybe it's just
that the messages should not go on QEMU stderr, and rather trace-point
should be enough.

Paolo
Paolo Bonzini Sept. 13, 2018, 10:04 a.m. UTC | #10
On 13/09/2018 11:11, Paolo Bonzini wrote:
> On 13/09/2018 08:03, Fam Zheng wrote:
>> On Wed, 09/12 14:42, Paolo Bonzini wrote:
>>> On 12/09/2018 13:50, Fam Zheng wrote:
>>>>> I think it's okay if it is invoked.  The sequence is first you stop the
>>>>> vq, then you drain the BlockBackends, then you switch AioContext.  All
>>>>> that matters is the outcome when virtio_scsi_dataplane_stop returns.
>>>> Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
>>>> which is not clean. QEMU stderr when this call happens (with patch 1 but not
>>>> this patch):
>>>>
>>>> 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
>>>> 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
>>>> 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed
>>>
>>> But with iothread, virtio_scsi_dataplane_stop runs in another thread
>>> than the iothread; in that case you still have a race where the iothread
>>> can process the vq before aio_disable_external and print the error.
>>>
>>> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
>>> DRIVER_OK bit in the status field.  Could this be a guest bug?
>>
>> I'm not sure if it is a bug or not. I think what happens is the device is left
>> enabled by Seabios, and then reset by kernel.
> 
> That makes sense, though I'm not sure why QEMU needs to process a
> request long after SeaBIOS has left control to Linux.  Maybe it's just
> that the messages should not go on QEMU stderr, and rather trace-point
> should be enough.

Aha, it's not that QEMU needs to poll, it's just that polling mode is
enabled, and it decides to do one last iteration.  In general the virtio
spec allows the hardware to poll whenever it wants, hence:

1) I'm not sure that translation failures should mark the device as
broken---definitely not when doing polling, possibly not even in
response to the guest "kicking" the virtqueue.  Alex, does the PCI spec
say anything about this?

2) translation faliures should definitely not print messages to stderr.

Thanks,

Paolo
Alex Williamson Sept. 13, 2018, 4 p.m. UTC | #11
On Thu, 13 Sep 2018 12:04:34 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/09/2018 11:11, Paolo Bonzini wrote:
> > On 13/09/2018 08:03, Fam Zheng wrote:  
> >> On Wed, 09/12 14:42, Paolo Bonzini wrote:  
> >>> On 12/09/2018 13:50, Fam Zheng wrote:  
> >>>>> I think it's okay if it is invoked.  The sequence is first you stop the
> >>>>> vq, then you drain the BlockBackends, then you switch AioContext.  All
> >>>>> that matters is the outcome when virtio_scsi_dataplane_stop returns.  
> >>>> Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
> >>>> which is not clean. QEMU stderr when this call happens (with patch 1 but not
> >>>> this patch):
> >>>>
> >>>> 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
> >>>> 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
> >>>> 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed  
> >>>
> >>> But with iothread, virtio_scsi_dataplane_stop runs in another thread
> >>> than the iothread; in that case you still have a race where the iothread
> >>> can process the vq before aio_disable_external and print the error.
> >>>
> >>> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
> >>> DRIVER_OK bit in the status field.  Could this be a guest bug?  
> >>
> >> I'm not sure if it is a bug or not. I think what happens is the device is left
> >> enabled by Seabios, and then reset by kernel.  
> > 
> > That makes sense, though I'm not sure why QEMU needs to process a
> > request long after SeaBIOS has left control to Linux.  Maybe it's just
> > that the messages should not go on QEMU stderr, and rather trace-point
> > should be enough.  
> 
> Aha, it's not that QEMU needs to poll, it's just that polling mode is
> enabled, and it decides to do one last iteration.  In general the virtio
> spec allows the hardware to poll whenever it wants, hence:
> 
> 1) I'm not sure that translation failures should mark the device as
> broken---definitely not when doing polling, possibly not even in
> response to the guest "kicking" the virtqueue.  Alex, does the PCI spec
> say anything about this?

AFAIK the PCI spec doesn't define anything about the IOMMU or response
to translation failures.  Depending on whether it's a read or write,
the device might see an unsupported request or not even be aware of the
error.  It's really a platform RAS question whether to have any more
significant response, most don't, but at least one tends to consider
IOMMU faults to be a data integrity issue worth bring the system down.
We've struggled with handling ongoing DMA generating IOMMU faults
during kexec for a long time, so any sort of marking a device broken
for a fault should be thoroughly considered, especially when a device
could be assigned to a user who can trivially trigger a fault.
 
> 2) translation faliures should definitely not print messages to stderr.

Yep, easy DoS vector for a malicious guest, or malicious userspace
driver within the guest.  Thanks,

Alex
Peter Xu Sept. 14, 2018, 2:45 a.m. UTC | #12
On Thu, Sep 13, 2018 at 10:00:43AM -0600, Alex Williamson wrote:
> On Thu, 13 Sep 2018 12:04:34 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 13/09/2018 11:11, Paolo Bonzini wrote:
> > > On 13/09/2018 08:03, Fam Zheng wrote:  
> > >> On Wed, 09/12 14:42, Paolo Bonzini wrote:  
> > >>> On 12/09/2018 13:50, Fam Zheng wrote:  
> > >>>>> I think it's okay if it is invoked.  The sequence is first you stop the
> > >>>>> vq, then you drain the BlockBackends, then you switch AioContext.  All
> > >>>>> that matters is the outcome when virtio_scsi_dataplane_stop returns.  
> > >>>> Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
> > >>>> which is not clean. QEMU stderr when this call happens (with patch 1 but not
> > >>>> this patch):
> > >>>>
> > >>>> 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
> > >>>> 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due to compression of faults
> > >>>> 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are not allowed  
> > >>>
> > >>> But with iothread, virtio_scsi_dataplane_stop runs in another thread
> > >>> than the iothread; in that case you still have a race where the iothread
> > >>> can process the vq before aio_disable_external and print the error.
> > >>>
> > >>> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
> > >>> DRIVER_OK bit in the status field.  Could this be a guest bug?  
> > >>
> > >> I'm not sure if it is a bug or not. I think what happens is the device is left
> > >> enabled by Seabios, and then reset by kernel.  
> > > 
> > > That makes sense, though I'm not sure why QEMU needs to process a
> > > request long after SeaBIOS has left control to Linux.  Maybe it's just
> > > that the messages should not go on QEMU stderr, and rather trace-point
> > > should be enough.  
> > 
> > Aha, it's not that QEMU needs to poll, it's just that polling mode is
> > enabled, and it decides to do one last iteration.  In general the virtio
> > spec allows the hardware to poll whenever it wants, hence:
> > 
> > 1) I'm not sure that translation failures should mark the device as
> > broken---definitely not when doing polling, possibly not even in
> > response to the guest "kicking" the virtqueue.  Alex, does the PCI spec
> > say anything about this?
> 
> AFAIK the PCI spec doesn't define anything about the IOMMU or response
> to translation failures.  Depending on whether it's a read or write,
> the device might see an unsupported request or not even be aware of the
> error.  It's really a platform RAS question whether to have any more
> significant response, most don't, but at least one tends to consider
> IOMMU faults to be a data integrity issue worth bring the system down.
> We've struggled with handling ongoing DMA generating IOMMU faults
> during kexec for a long time, so any sort of marking a device broken
> for a fault should be thoroughly considered, especially when a device
> could be assigned to a user who can trivially trigger a fault.
>  
> > 2) translation faliures should definitely not print messages to stderr.
> 
> Yep, easy DoS vector for a malicious guest, or malicious userspace
> driver within the guest.  Thanks,

Note that it's using error_report_once() upstream so it'll only print
once for the whole lifecycle of QEMU process, and it's still a
tracepoint downstream so no error will be dumped by default.  So AFAIU
it's not a DoS target for either.

I would consider it a good hint for strange bugs since AFAIU DMA error
should never exist on well-behaved guests.  However I'll be fine too
to post a patch to make it an explicit tracepoint again if any of us
would still like it to go away.

Thanks,
diff mbox series

Patch

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 8c37bd314a..8ab54218c1 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -279,7 +279,9 @@  void virtio_blk_data_plane_stop(VirtIODevice *vdev)
     trace_virtio_blk_data_plane_stop(s);
 
     aio_context_acquire(s->ctx);
+    aio_disable_external(s->ctx);
     aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+    aio_enable_external(s->ctx);
 
     /* Drain and switch bs back to the QEMU main loop */
     blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index b995bab3a2..1452398491 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -208,7 +208,9 @@  void virtio_scsi_dataplane_stop(VirtIODevice *vdev)
     s->dataplane_stopping = true;
 
     aio_context_acquire(s->ctx);
+    aio_disable_external(s->ctx);
     aio_wait_bh_oneshot(s->ctx, virtio_scsi_dataplane_stop_bh, s);
+    aio_enable_external(s->ctx);
     aio_context_release(s->ctx);
 
     blk_drain_all(); /* ensure there are no in-flight requests */