Patchwork kvm: deassign irqs in reset path

login
register
mail settings
Submitter Jason Baron
Date March 30, 2012, 7:18 p.m.
Message ID <201203301918.q2UJI63c005908@int-mx02.intmail.prod.int.phx2.redhat.com>
Download mbox | patch
Permalink /patch/149735/
State New
Headers show

Comments

Jason Baron - March 30, 2012, 7:18 p.m.
We've hit a kernel host panic, when issuing a 'system_reset' with a 82576 nic
assigned and a Windows guest. Host system is a PowerEdge R815.

[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 0, PCIe end point
[Hardware Error]: version: 1.0
[Hardware Error]: command: 0x0000, status: 0x0010
[Hardware Error]: device_id: 0000:08:00.0
[Hardware Error]: slot: 1
[Hardware Error]: secondary_bus: 0x00
[Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
[Hardware Error]: class_code: 000002
[Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
[Hardware Error]: Unsupported Request
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x00067011
[Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
[Hardware Error]: section: 1, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 0, PCIe end point
[Hardware Error]: version: 1.0
[Hardware Error]: command: 0x0000, status: 0x0010
[Hardware Error]: device_id: 0000:08:00.0
[Hardware Error]: slot: 1
[Hardware Error]: secondary_bus: 0x00
[Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
[Hardware Error]: class_code: 000002
[Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
[Hardware Error]: Unsupported Request
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x00067011
[Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
Kernel panic - not syncing: Fatal hardware error!
Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
 [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
 [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
 [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff8109667e>] ? notify_die+0x2e/0x30
 [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
 [<ffffffff814f6760>] ? nmi+0x20/0x30
 [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
 <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
 [<ffffffff814da63a>] ? rest_init+0x7a/0x80
 [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109

The root cause of the problem is that the 'reset_assigned_device()' code
first writes a 0 to the command register. Then, when qemu subsequently does
a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
the kernel ends up calling '__msix_mask_irq()', which performs a write to
the memory mapped msi vector space. Since, we've explicitly told the device
to disallow mmio access (via the 0 write to the command register), we end
up with the above 'Unsupported Request'.

The fix here is to first call kvm_deassign_irq(), before doing the reset,
and then calling assign_irq() to put the device in an INTx mode. In this
way, the device is a known state after reset (INTx mode), and we avoid touching
msi memory mapped space on any subsequent 'kvm_deassign_irq()', since we're
in INTx mode.

Thanks to Michael S. Tsirkin for help in understanding what was going on here.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/device-assignment.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)
Jan Kiszka - March 30, 2012, 7:29 p.m.
On 2012-03-30 21:18, Jason Baron wrote:
> We've hit a kernel host panic, when issuing a 'system_reset' with a 82576 nic
> assigned and a Windows guest. Host system is a PowerEdge R815.
> 
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> [Hardware Error]: APEI generic hardware error status
> [Hardware Error]: severity: 1, fatal
> [Hardware Error]: section: 0, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> [Hardware Error]: section: 1, severity: 1, fatal
> [Hardware Error]: flags: 0x01
> [Hardware Error]: primary
> [Hardware Error]: section_type: PCIe error
> [Hardware Error]: port_type: 0, PCIe end point
> [Hardware Error]: version: 1.0
> [Hardware Error]: command: 0x0000, status: 0x0010
> [Hardware Error]: device_id: 0000:08:00.0
> [Hardware Error]: slot: 1
> [Hardware Error]: secondary_bus: 0x00
> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> [Hardware Error]: class_code: 000002
> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> [Hardware Error]: Unsupported Request
> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> [Hardware Error]: aer_uncor_severity: 0x00067011
> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> Kernel panic - not syncing: Fatal hardware error!
> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> 
> The root cause of the problem is that the 'reset_assigned_device()' code
> first writes a 0 to the command register. Then, when qemu subsequently does
> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> the memory mapped msi vector space. Since, we've explicitly told the device
> to disallow mmio access (via the 0 write to the command register), we end
> up with the above 'Unsupported Request'.
> 
> The fix here is to first call kvm_deassign_irq(), before doing the reset,

s/fix/workaround/. This is a kernel bug if userspace can crash the
system like this, no? Let's fix the kernel first and then look at what
needs to be changed here.

Jan

> and then calling assign_irq() to put the device in an INTx mode. In this
> way, the device is a known state after reset (INTx mode), and we avoid touching
> msi memory mapped space on any subsequent 'kvm_deassign_irq()', since we're
> in INTx mode.
> 
> Thanks to Michael S. Tsirkin for help in understanding what was going on here.
> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/device-assignment.c |   27 +++++++++++++++++++++++++++
>  1 files changed, 27 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..31aed17 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1609,10 +1609,32 @@ static void reset_assigned_device(DeviceState *dev)
>  {
>      PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, dev);
>      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> +    struct kvm_assigned_irq assigned_irq_data;
>      char reset_file[64];
>      const char reset[] = "1";
>      int fd, ret;
>  
> +    /*
> +     * Make sure the irq for the device is set to a consistent state of INTx
> +     * on reset. This also ensures that a subsequent deassign_irq/assign_irq
> +     * sequence (such as during 'system_reset'), does not touch memory
> +     * mapped msi space, since we are about to disallow that access via a
> +     * 0 write to the command register. In addition, the 'kvm_deassign_irq()'
> +     * clears the msi enable bit, thus preventing any unexpected MSIs.
> +     */
> +    memset(&assigned_irq_data, 0, sizeof assigned_irq_data);
> +    assigned_irq_data.assigned_dev_id  =
> +        calc_assigned_dev_id(adev);
> +    assigned_irq_data.flags = adev->irq_requested_type;
> +    free_dev_irq_entries(adev);
> +    ret = kvm_deassign_irq(kvm_state, &assigned_irq_data);
> +    /* -ENXIO means no assigned irq */
> +    if (ret && ret != -ENXIO) {
> +        perror("reset_assigned_device: deassign irq");
> +    }
> +
> +    adev->irq_requested_type = 0;
> +
>      snprintf(reset_file, sizeof(reset_file),
>               "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
>               adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);
> @@ -1635,6 +1657,11 @@ static void reset_assigned_device(DeviceState *dev)
>       * disconnected from the PCI bus. This avoids further DMA transfers.
>       */
>      assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> +
> +    ret = assign_irq(adev);
> +    if (ret) {
> +        perror("reset_assigned_device: assign irq");
> +    }
>  }
>  
>  static int assigned_initfn(struct PCIDevice *pci_dev)
Jason Baron - March 30, 2012, 8:13 p.m.
On Fri, Mar 30, 2012 at 09:29:23PM +0200, Jan Kiszka wrote:
> On 2012-03-30 21:18, Jason Baron wrote:
> > We've hit a kernel host panic, when issuing a 'system_reset' with a 82576 nic
> > assigned and a Windows guest. Host system is a PowerEdge R815.
> > 
> > [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
> > [Hardware Error]: APEI generic hardware error status
> > [Hardware Error]: severity: 1, fatal
> > [Hardware Error]: section: 0, severity: 1, fatal
> > [Hardware Error]: flags: 0x01
> > [Hardware Error]: primary
> > [Hardware Error]: section_type: PCIe error
> > [Hardware Error]: port_type: 0, PCIe end point
> > [Hardware Error]: version: 1.0
> > [Hardware Error]: command: 0x0000, status: 0x0010
> > [Hardware Error]: device_id: 0000:08:00.0
> > [Hardware Error]: slot: 1
> > [Hardware Error]: secondary_bus: 0x00
> > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> > [Hardware Error]: class_code: 000002
> > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> > [Hardware Error]: Unsupported Request
> > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [Hardware Error]: aer_uncor_severity: 0x00067011
> > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> > [Hardware Error]: section: 1, severity: 1, fatal
> > [Hardware Error]: flags: 0x01
> > [Hardware Error]: primary
> > [Hardware Error]: section_type: PCIe error
> > [Hardware Error]: port_type: 0, PCIe end point
> > [Hardware Error]: version: 1.0
> > [Hardware Error]: command: 0x0000, status: 0x0010
> > [Hardware Error]: device_id: 0000:08:00.0
> > [Hardware Error]: slot: 1
> > [Hardware Error]: secondary_bus: 0x00
> > [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
> > [Hardware Error]: class_code: 000002
> > [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
> > [Hardware Error]: Unsupported Request
> > [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [Hardware Error]: aer_uncor_severity: 0x00067011
> > [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
> > Kernel panic - not syncing: Fatal hardware error!
> > Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
> > Call Trace:
> >  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
> >  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
> >  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
> >  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
> >  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
> >  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
> >  [<ffffffff814f6760>] ? nmi+0x20/0x30
> >  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
> >  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
> >  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
> >  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
> >  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
> >  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
> >  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
> > 
> > The root cause of the problem is that the 'reset_assigned_device()' code
> > first writes a 0 to the command register. Then, when qemu subsequently does
> > a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> > the kernel ends up calling '__msix_mask_irq()', which performs a write to
> > the memory mapped msi vector space. Since, we've explicitly told the device
> > to disallow mmio access (via the 0 write to the command register), we end
> > up with the above 'Unsupported Request'.
> > 
> > The fix here is to first call kvm_deassign_irq(), before doing the reset,
> 
> s/fix/workaround/. This is a kernel bug if userspace can crash the
> system like this, no? Let's fix the kernel first and then look at what
> needs to be changed here.
> 
> Jan
> 

But don't I need special privalege to run the device assignment bits?
For example, this crash is precipitated by a write of '0' to the pci
device config register from userspace. Surely, not every is allowed to
do that write. So it seems to me, that this patch is in keeping with the
current model of how things work.

Thanks,

-Jason
Jan Kiszka - March 30, 2012, 8:18 p.m.
On 2012-03-30 22:13, Jason Baron wrote:
> On Fri, Mar 30, 2012 at 09:29:23PM +0200, Jan Kiszka wrote:
>> On 2012-03-30 21:18, Jason Baron wrote:
>>> We've hit a kernel host panic, when issuing a 'system_reset' with a 82576 nic
>>> assigned and a Windows guest. Host system is a PowerEdge R815.
>>>
>>> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
>>> [Hardware Error]: APEI generic hardware error status
>>> [Hardware Error]: severity: 1, fatal
>>> [Hardware Error]: section: 0, severity: 1, fatal
>>> [Hardware Error]: flags: 0x01
>>> [Hardware Error]: primary
>>> [Hardware Error]: section_type: PCIe error
>>> [Hardware Error]: port_type: 0, PCIe end point
>>> [Hardware Error]: version: 1.0
>>> [Hardware Error]: command: 0x0000, status: 0x0010
>>> [Hardware Error]: device_id: 0000:08:00.0
>>> [Hardware Error]: slot: 1
>>> [Hardware Error]: secondary_bus: 0x00
>>> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
>>> [Hardware Error]: class_code: 000002
>>> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
>>> [Hardware Error]: Unsupported Request
>>> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
>>> [Hardware Error]: aer_uncor_severity: 0x00067011
>>> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
>>> [Hardware Error]: section: 1, severity: 1, fatal
>>> [Hardware Error]: flags: 0x01
>>> [Hardware Error]: primary
>>> [Hardware Error]: section_type: PCIe error
>>> [Hardware Error]: port_type: 0, PCIe end point
>>> [Hardware Error]: version: 1.0
>>> [Hardware Error]: command: 0x0000, status: 0x0010
>>> [Hardware Error]: device_id: 0000:08:00.0
>>> [Hardware Error]: slot: 1
>>> [Hardware Error]: secondary_bus: 0x00
>>> [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
>>> [Hardware Error]: class_code: 000002
>>> [Hardware Error]: aer_status: 0x00100000, aer_mask: 0x00018000
>>> [Hardware Error]: Unsupported Request
>>> [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
>>> [Hardware Error]: aer_uncor_severity: 0x00067011
>>> [Hardware Error]: aer_tlp_header: 40001001 0020000f edbf800c 01000000
>>> Kernel panic - not syncing: Fatal hardware error!
>>> Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
>>> Call Trace:
>>>  <NMI>  [<ffffffff814f2fe5>] ? panic+0xa0/0x168
>>>  [<ffffffff812f919c>] ? ghes_notify_nmi+0x17c/0x180
>>>  [<ffffffff814f91d5>] ? notifier_call_chain+0x55/0x80
>>>  [<ffffffff814f923a>] ? atomic_notifier_call_chain+0x1a/0x20
>>>  [<ffffffff8109667e>] ? notify_die+0x2e/0x30
>>>  [<ffffffff814f6e81>] ? do_nmi+0x1a1/0x2b0
>>>  [<ffffffff814f6760>] ? nmi+0x20/0x30
>>>  [<ffffffff8103762b>] ? native_safe_halt+0xb/0x10
>>>  <<EOE>>  [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
>>>  [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
>>>  [<ffffffff814da63a>] ? rest_init+0x7a/0x80
>>>  [<ffffffff81c1ff7b>] ? start_kernel+0x424/0x430
>>>  [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
>>>  [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
>>>
>>> The root cause of the problem is that the 'reset_assigned_device()' code
>>> first writes a 0 to the command register. Then, when qemu subsequently does
>>> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
>>> the kernel ends up calling '__msix_mask_irq()', which performs a write to
>>> the memory mapped msi vector space. Since, we've explicitly told the device
>>> to disallow mmio access (via the 0 write to the command register), we end
>>> up with the above 'Unsupported Request'.
>>>
>>> The fix here is to first call kvm_deassign_irq(), before doing the reset,
>>
>> s/fix/workaround/. This is a kernel bug if userspace can crash the
>> system like this, no? Let's fix the kernel first and then look at what
>> needs to be changed here.
>>
>> Jan
>>
> 
> But don't I need special privalege to run the device assignment bits?

Yes, but even that might be moderated by a management component like
libvirt.

> For example, this crash is precipitated by a write of '0' to the pci
> device config register from userspace. Surely, not every is allowed to
> do that write. So it seems to me, that this patch is in keeping with the
> current model of how things work.

No user should needlessly be able to crash the host by issuing valid
commands in a special order.

Jan
Jason Baron - March 30, 2012, 8:31 p.m.
On Fri, Mar 30, 2012 at 10:18:31PM +0200, Jan Kiszka wrote:
> >>> The root cause of the problem is that the 'reset_assigned_device()' code
> >>> first writes a 0 to the command register. Then, when qemu subsequently does
> >>> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> >>> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> >>> the memory mapped msi vector space. Since, we've explicitly told the device
> >>> to disallow mmio access (via the 0 write to the command register), we end
> >>> up with the above 'Unsupported Request'.
> >>>
> >>> The fix here is to first call kvm_deassign_irq(), before doing the reset,
> >>
> >> s/fix/workaround/. This is a kernel bug if userspace can crash the
> >> system like this, no? Let's fix the kernel first and then look at what
> >> needs to be changed here.
> >>
> >> Jan
> >>
> > 
> > But don't I need special privalege to run the device assignment bits?
> 
> Yes, but even that might be moderated by a management component like
> libvirt.
> 
> > For example, this crash is precipitated by a write of '0' to the pci
> > device config register from userspace. Surely, not every is allowed to
> > do that write. So it seems to me, that this patch is in keeping with the
> > current model of how things work.
> 
> No user should needlessly be able to crash the host by issuing valid
> commands in a special order.
> 
> Jan
> 

Right, but as I see device-assign.c, we are essentially programming the
pci device directly from userspace. Put another way, the kernel could
crash the system if it programmed a pci device in the wrong order. So I
don't see how this is different. But maybe I'm misunderstanding the
model here?

Thanks,

-Jason
Jan Kiszka - March 30, 2012, 8:35 p.m.
On 2012-03-30 22:31, Jason Baron wrote:
> On Fri, Mar 30, 2012 at 10:18:31PM +0200, Jan Kiszka wrote:
>>>>> The root cause of the problem is that the 'reset_assigned_device()' code
>>>>> first writes a 0 to the command register. Then, when qemu subsequently does
>>>>> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
>>>>> the kernel ends up calling '__msix_mask_irq()', which performs a write to
>>>>> the memory mapped msi vector space. Since, we've explicitly told the device
>>>>> to disallow mmio access (via the 0 write to the command register), we end
>>>>> up with the above 'Unsupported Request'.
>>>>>
>>>>> The fix here is to first call kvm_deassign_irq(), before doing the reset,
>>>>
>>>> s/fix/workaround/. This is a kernel bug if userspace can crash the
>>>> system like this, no? Let's fix the kernel first and then look at what
>>>> needs to be changed here.
>>>>
>>>> Jan
>>>>
>>>
>>> But don't I need special privalege to run the device assignment bits?
>>
>> Yes, but even that might be moderated by a management component like
>> libvirt.
>>
>>> For example, this crash is precipitated by a write of '0' to the pci
>>> device config register from userspace. Surely, not every is allowed to
>>> do that write. So it seems to me, that this patch is in keeping with the
>>> current model of how things work.
>>
>> No user should needlessly be able to crash the host by issuing valid
>> commands in a special order.
>>
>> Jan
>>
> 
> Right, but as I see device-assign.c, we are essentially programming the
> pci device directly from userspace. Put another way, the kernel could
> crash the system if it programmed a pci device in the wrong order. So I
> don't see how this is different. But maybe I'm misunderstanding the
> model here?

The model is that the KVM device assignment subsystem in the kernel (or
vfio in the future) + the IOMMU confine what userspace (not the qemu
guest) can do with the device and prevent that any harm is caused to the
system. There might be practical holes in this model, but this is still
what we are striving for.

Jan
Alex Williamson - March 30, 2012, 9:09 p.m.
On Fri, 2012-03-30 at 22:35 +0200, Jan Kiszka wrote:
> On 2012-03-30 22:31, Jason Baron wrote:
> > On Fri, Mar 30, 2012 at 10:18:31PM +0200, Jan Kiszka wrote:
> >>>>> The root cause of the problem is that the 'reset_assigned_device()' code
> >>>>> first writes a 0 to the command register. Then, when qemu subsequently does
> >>>>> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> >>>>> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> >>>>> the memory mapped msi vector space. Since, we've explicitly told the device
> >>>>> to disallow mmio access (via the 0 write to the command register), we end
> >>>>> up with the above 'Unsupported Request'.
> >>>>>
> >>>>> The fix here is to first call kvm_deassign_irq(), before doing the reset,
> >>>>
> >>>> s/fix/workaround/. This is a kernel bug if userspace can crash the
> >>>> system like this, no? Let's fix the kernel first and then look at what
> >>>> needs to be changed here.
> >>>>
> >>>> Jan
> >>>>
> >>>
> >>> But don't I need special privalege to run the device assignment bits?
> >>
> >> Yes, but even that might be moderated by a management component like
> >> libvirt.
> >>
> >>> For example, this crash is precipitated by a write of '0' to the pci
> >>> device config register from userspace. Surely, not every is allowed to
> >>> do that write. So it seems to me, that this patch is in keeping with the
> >>> current model of how things work.
> >>
> >> No user should needlessly be able to crash the host by issuing valid
> >> commands in a special order.
> >>
> >> Jan
> >>
> > 
> > Right, but as I see device-assign.c, we are essentially programming the
> > pci device directly from userspace. Put another way, the kernel could
> > crash the system if it programmed a pci device in the wrong order. So I
> > don't see how this is different. But maybe I'm misunderstanding the
> > model here?
> 
> The model is that the KVM device assignment subsystem in the kernel (or
> vfio in the future) + the IOMMU confine what userspace (not the qemu
> guest) can do with the device and prevent that any harm is caused to the
> system. There might be practical holes in this model, but this is still
> what we are striving for.

Jan,

It's possible to cause this same crash w/o device assignment involved.
Use setpci to zero the command register, then set smp_affinity on the
device using MSI-X.  It's the right thing to do for qemu reset to return
device interrupts to a known state, even if it does just happen to be a
workaround for this problem.  Given that we're using a generic interface
for config space, I'm not sure how practical it is to start arbitrarily
picking certain things as bad and making them inaccessible (there are
plenty of targets in PCI config space on x86).  Thanks,

Alex
Jan Kiszka - March 30, 2012, 10:15 p.m.
On 2012-03-30 23:09, Alex Williamson wrote:
> On Fri, 2012-03-30 at 22:35 +0200, Jan Kiszka wrote:
>> On 2012-03-30 22:31, Jason Baron wrote:
>>> On Fri, Mar 30, 2012 at 10:18:31PM +0200, Jan Kiszka wrote:
>>>>>>> The root cause of the problem is that the 'reset_assigned_device()' code
>>>>>>> first writes a 0 to the command register. Then, when qemu subsequently does
>>>>>>> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
>>>>>>> the kernel ends up calling '__msix_mask_irq()', which performs a write to
>>>>>>> the memory mapped msi vector space. Since, we've explicitly told the device
>>>>>>> to disallow mmio access (via the 0 write to the command register), we end
>>>>>>> up with the above 'Unsupported Request'.
>>>>>>>
>>>>>>> The fix here is to first call kvm_deassign_irq(), before doing the reset,
>>>>>>
>>>>>> s/fix/workaround/. This is a kernel bug if userspace can crash the
>>>>>> system like this, no? Let's fix the kernel first and then look at what
>>>>>> needs to be changed here.
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>
>>>>> But don't I need special privalege to run the device assignment bits?
>>>>
>>>> Yes, but even that might be moderated by a management component like
>>>> libvirt.
>>>>
>>>>> For example, this crash is precipitated by a write of '0' to the pci
>>>>> device config register from userspace. Surely, not every is allowed to
>>>>> do that write. So it seems to me, that this patch is in keeping with the
>>>>> current model of how things work.
>>>>
>>>> No user should needlessly be able to crash the host by issuing valid
>>>> commands in a special order.
>>>>
>>>> Jan
>>>>
>>>
>>> Right, but as I see device-assign.c, we are essentially programming the
>>> pci device directly from userspace. Put another way, the kernel could
>>> crash the system if it programmed a pci device in the wrong order. So I
>>> don't see how this is different. But maybe I'm misunderstanding the
>>> model here?
>>
>> The model is that the KVM device assignment subsystem in the kernel (or
>> vfio in the future) + the IOMMU confine what userspace (not the qemu
>> guest) can do with the device and prevent that any harm is caused to the
>> system. There might be practical holes in this model, but this is still
>> what we are striving for.
> 
> Jan,
> 
> It's possible to cause this same crash w/o device assignment involved.
> Use setpci to zero the command register, then set smp_affinity on the
> device using MSI-X.  It's the right thing to do for qemu reset to return
> device interrupts to a known state, even if it does just happen to be a
> workaround for this problem.

Looks like the right thing is first of all to catch hardware errors of
an untrusted device and handle them more gracefully than panicing the
whole system. I suppose the very same panic can be triggered by writing
to other MMIO bars after fiddling with the command register. And as we
are passing the command register through, this is even guest
triggerable, no? Can failing PIO access cause such panics as well?

>  Given that we're using a generic interface
> for config space, I'm not sure how practical it is to start arbitrarily
> picking certain things as bad and making them inaccessible (there are
> plenty of targets in PCI config space on x86).  Thanks,

This patch provides band-aid for a single symptom. But we rather need
cure for the disease you discovered.

Jan
Jan Kiszka - March 31, 2012, 8:54 a.m.
On 2012-03-30 21:18, Jason Baron wrote:
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..31aed17 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1609,10 +1609,32 @@ static void reset_assigned_device(DeviceState *dev)
>  {
>      PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, dev);
>      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> +    struct kvm_assigned_irq assigned_irq_data;
>      char reset_file[64];
>      const char reset[] = "1";
>      int fd, ret;
>  
> +    /*
> +     * Make sure the irq for the device is set to a consistent state of INTx
> +     * on reset. This also ensures that a subsequent deassign_irq/assign_irq
> +     * sequence (such as during 'system_reset'), does not touch memory
> +     * mapped msi space, since we are about to disallow that access via a
> +     * 0 write to the command register. In addition, the 'kvm_deassign_irq()'
> +     * clears the msi enable bit, thus preventing any unexpected MSIs.
> +     */
> +    memset(&assigned_irq_data, 0, sizeof assigned_irq_data);
> +    assigned_irq_data.assigned_dev_id  =
> +        calc_assigned_dev_id(adev);
> +    assigned_irq_data.flags = adev->irq_requested_type;
> +    free_dev_irq_entries(adev);
> +    ret = kvm_deassign_irq(kvm_state, &assigned_irq_data);
> +    /* -ENXIO means no assigned irq */
> +    if (ret && ret != -ENXIO) {
> +        perror("reset_assigned_device: deassign irq");
> +    }
> +
> +    adev->irq_requested_type = 0;
> +

What is actually missing here, independent of the kernel bug you saw, is
a proper reset of the MSI[-X] control flags followed by a call to our
update handlers. The day we stop open-coding MSI support here, that will
be triggered by the MSI layer, but for now we need a

if (requested_type == MSI)
	clear_msi_in_shadow_cap
	update_msi
else if (requested_type == MSIX)
	clear_msix_in_shadow_cap
	update_msix

Or, better, fix the update handlers to allow unconditional

clear_msi_in_shadow_cap
update_msi
clear_msix_in_shadow_cap
update_msix

because that is the pattern we will once get from a refactored version
anyway where msi[x]_reset will be invoked automatically.

But we also need to calm down the PCI error handling to not get hysteric
over untrusted devices.

Jan
Michael S. Tsirkin - April 1, 2012, 10:57 a.m.
On Sat, Mar 31, 2012 at 12:15:34AM +0200, Jan Kiszka wrote:
> On 2012-03-30 23:09, Alex Williamson wrote:
> > On Fri, 2012-03-30 at 22:35 +0200, Jan Kiszka wrote:
> >> On 2012-03-30 22:31, Jason Baron wrote:
> >>> On Fri, Mar 30, 2012 at 10:18:31PM +0200, Jan Kiszka wrote:
> >>>>>>> The root cause of the problem is that the 'reset_assigned_device()' code
> >>>>>>> first writes a 0 to the command register. Then, when qemu subsequently does
> >>>>>>> a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
> >>>>>>> the kernel ends up calling '__msix_mask_irq()', which performs a write to
> >>>>>>> the memory mapped msi vector space. Since, we've explicitly told the device
> >>>>>>> to disallow mmio access (via the 0 write to the command register), we end
> >>>>>>> up with the above 'Unsupported Request'.
> >>>>>>>
> >>>>>>> The fix here is to first call kvm_deassign_irq(), before doing the reset,
> >>>>>>
> >>>>>> s/fix/workaround/. This is a kernel bug if userspace can crash the
> >>>>>> system like this, no? Let's fix the kernel first and then look at what
> >>>>>> needs to be changed here.
> >>>>>>
> >>>>>> Jan
> >>>>>>
> >>>>>
> >>>>> But don't I need special privalege to run the device assignment bits?
> >>>>
> >>>> Yes, but even that might be moderated by a management component like
> >>>> libvirt.
> >>>>
> >>>>> For example, this crash is precipitated by a write of '0' to the pci
> >>>>> device config register from userspace. Surely, not every is allowed to
> >>>>> do that write. So it seems to me, that this patch is in keeping with the
> >>>>> current model of how things work.
> >>>>
> >>>> No user should needlessly be able to crash the host by issuing valid
> >>>> commands in a special order.
> >>>>
> >>>> Jan
> >>>>
> >>>
> >>> Right, but as I see device-assign.c, we are essentially programming the
> >>> pci device directly from userspace. Put another way, the kernel could
> >>> crash the system if it programmed a pci device in the wrong order. So I
> >>> don't see how this is different. But maybe I'm misunderstanding the
> >>> model here?
> >>
> >> The model is that the KVM device assignment subsystem in the kernel (or
> >> vfio in the future) + the IOMMU confine what userspace (not the qemu
> >> guest) can do with the device and prevent that any harm is caused to the
> >> system. There might be practical holes in this model, but this is still
> >> what we are striving for.
> > 
> > Jan,

I don't think the pci sysfs attributes or qemu device assignment
were designed with this in mind. For example, it's up to qemu to assign
the device to the iommu domain through kvm, is it not?
IMO, if you want to protect against untrusted userspace, you
would need to address this in pci sysfs as a first step. For example, add
more selinux hooks to control what qemu can do. Another alternative is
building a completely new interface like vfio tries to do.

> > It's possible to cause this same crash w/o device assignment involved.
> > Use setpci to zero the command register, then set smp_affinity on the
> > device using MSI-X.  It's the right thing to do for qemu reset to return
> > device interrupts to a known state, even if it does just happen to be a
> > workaround for this problem.

Basically the driver/kernel are in control of the command register.
If you clobber it from userspace, you might break things.
Another easy example would be resetting the device directly instead
of through the sysfs attribute.


> Looks like the right thing is first of all to catch hardware errors of
> an untrusted device and handle them more gracefully than panicing the
> whole system.

IMO it's ansolutely the right thing to do but the fixes probably can be
applied in any order :)
Also, ideally we'd pass the errors on to the guest.

> I suppose the very same panic can be triggered by writing
> to other MMIO bars after fiddling with the command register. And as we
> are passing the command register through, this is even guest
> triggerable, no?

I don't think we pass through command register writes,
if we did I think that would be a bug.
We do something for PCI-X command register, I didn't
look closely so I'm not sure whether what we do is correct.

> Can failing PIO access cause such panics as well?

I think they can.

Patch

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 89823f1..31aed17 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1609,10 +1609,32 @@  static void reset_assigned_device(DeviceState *dev)
 {
     PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, dev);
     AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
+    struct kvm_assigned_irq assigned_irq_data;
     char reset_file[64];
     const char reset[] = "1";
     int fd, ret;
 
+    /*
+     * Make sure the irq for the device is set to a consistent state of INTx
+     * on reset. This also ensures that a subsequent deassign_irq/assign_irq
+     * sequence (such as during 'system_reset'), does not touch memory
+     * mapped msi space, since we are about to disallow that access via a
+     * 0 write to the command register. In addition, the 'kvm_deassign_irq()'
+     * clears the msi enable bit, thus preventing any unexpected MSIs.
+     */
+    memset(&assigned_irq_data, 0, sizeof assigned_irq_data);
+    assigned_irq_data.assigned_dev_id  =
+        calc_assigned_dev_id(adev);
+    assigned_irq_data.flags = adev->irq_requested_type;
+    free_dev_irq_entries(adev);
+    ret = kvm_deassign_irq(kvm_state, &assigned_irq_data);
+    /* -ENXIO means no assigned irq */
+    if (ret && ret != -ENXIO) {
+        perror("reset_assigned_device: deassign irq");
+    }
+
+    adev->irq_requested_type = 0;
+
     snprintf(reset_file, sizeof(reset_file),
              "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/reset",
              adev->host.seg, adev->host.bus, adev->host.dev, adev->host.func);
@@ -1635,6 +1657,11 @@  static void reset_assigned_device(DeviceState *dev)
      * disconnected from the PCI bus. This avoids further DMA transfers.
      */
     assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
+
+    ret = assign_irq(adev);
+    if (ret) {
+        perror("reset_assigned_device: assign irq");
+    }
 }
 
 static int assigned_initfn(struct PCIDevice *pci_dev)