diff mbox series

[{focal,groovy}:linux-azure] PCI: hv: Fix hibernation in case interrupts are not re-created

Message ID 20201023204428.2558110-1-marcelo.cerri@canonical.com
State New
Headers show
Series [{focal,groovy}:linux-azure] PCI: hv: Fix hibernation in case interrupts are not re-created | expand

Commit Message

Marcelo Henrique Cerri Oct. 23, 2020, 8:44 p.m. UTC
From: Dexuan Cui <decui@microsoft.com>

BugLink: https://bugs.launchpad.net/bugs/1894893

pci_restore_msi_state() directly writes the MSI/MSI-X related registers
via MMIO. On a physical machine, this works perfectly; for a Linux VM
running on a hypervisor, which typically enables IOMMU interrupt remapping,
the hypervisor usually should trap and emulate the MMIO accesses in order
to re-create the necessary interrupt remapping table entries in the IOMMU,
otherwise the interrupts can not work in the VM after hibernation.

Hyper-V is different from other hypervisors in that it does not trap and
emulate the MMIO accesses, and instead it uses a para-virtualized method,
which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
of the info that would be passed to the hypervisor in the case of the
trap-and-emulate method. This is not an issue to a lot of PCI device
drivers, which destroy and re-create the interrupts across hibernation, so
hv_compose_msi_msg() is called automatically. However, some PCI device
drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
otherwise the PCI device drivers can no longer receive interrupts after
the VM resumes from hibernation.

Hyper-V is also different in that chip->irq_unmask() may fail in a
Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
can not fail because unmasking an MSI/MSI-X register just means an MMIO
write): during hibernation, when a CPU is offlined, the kernel tries
to move the interrupt to the remaining CPUs that haven't been offlined
yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
because the vmbus channel has been closed: here the early "return" in
hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
desc->masked remains "true", so later after hibernation, the MSI interrupt
always remains masked, which is incorrect. Refer to cpu_disable_common()
-> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():

static bool migrate_one_irq(struct irq_desc *desc)
{
...
        if (maskchip && chip->irq_mask)
                chip->irq_mask(d);
...
        err = irq_do_set_affinity(d, affinity, false);
...
        if (maskchip && chip->irq_unmask)
                chip->irq_unmask(d);

Fix the issue by calling pci_msi_unmask_irq() unconditionally in
hv_irq_unmask(). Also suppress the error message for hibernation because
the hypercall failure during hibernation does not matter (at this time
all the devices have been frozen). Note: the correct affinity info is
still updated into the irqdata data structure in migrate_one_irq() ->
irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
resumes, hv_pci_restore_msi_state() is able to correctly restore
the interrupt with the correct affinity.

Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com
Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Jake Oshins <jakeo@microsoft.com>
(cherry picked from commit 915cff7f38c5e4d47f187f8049245afc2cb3e503)
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
---

Clean cherry-pich from upstream with positive test results.

---
 drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 3 deletions(-)

Comments

Colin Ian King Oct. 23, 2020, 9:55 p.m. UTC | #1
On 23/10/2020 21:44, Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894893
> 
> pci_restore_msi_state() directly writes the MSI/MSI-X related registers
> via MMIO. On a physical machine, this works perfectly; for a Linux VM
> running on a hypervisor, which typically enables IOMMU interrupt remapping,
> the hypervisor usually should trap and emulate the MMIO accesses in order
> to re-create the necessary interrupt remapping table entries in the IOMMU,
> otherwise the interrupts can not work in the VM after hibernation.
> 
> Hyper-V is different from other hypervisors in that it does not trap and
> emulate the MMIO accesses, and instead it uses a para-virtualized method,
> which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
> of the info that would be passed to the hypervisor in the case of the
> trap-and-emulate method. This is not an issue to a lot of PCI device
> drivers, which destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some PCI device
> drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
> proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
> across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
> otherwise the PCI device drivers can no longer receive interrupts after
> the VM resumes from hibernation.
> 
> Hyper-V is also different in that chip->irq_unmask() may fail in a
> Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
> can not fail because unmasking an MSI/MSI-X register just means an MMIO
> write): during hibernation, when a CPU is offlined, the kernel tries
> to move the interrupt to the remaining CPUs that haven't been offlined
> yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
> because the vmbus channel has been closed: here the early "return" in
> hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
> desc->masked remains "true", so later after hibernation, the MSI interrupt
> always remains masked, which is incorrect. Refer to cpu_disable_common()
> -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
> 
> static bool migrate_one_irq(struct irq_desc *desc)
> {
> ...
>         if (maskchip && chip->irq_mask)
>                 chip->irq_mask(d);
> ...
>         err = irq_do_set_affinity(d, affinity, false);
> ...
>         if (maskchip && chip->irq_unmask)
>                 chip->irq_unmask(d);
> 
> Fix the issue by calling pci_msi_unmask_irq() unconditionally in
> hv_irq_unmask(). Also suppress the error message for hibernation because
> the hypercall failure during hibernation does not matter (at this time
> all the devices have been frozen). Note: the correct affinity info is
> still updated into the irqdata data structure in migrate_one_irq() ->
> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> resumes, hv_pci_restore_msi_state() is able to correctly restore
> the interrupt with the correct affinity.
> 
> Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> (cherry picked from commit 915cff7f38c5e4d47f187f8049245afc2cb3e503)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
> ---
> 
> Clean cherry-pich from upstream with positive test results.
> 
> ---
>  drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 5d074d78da97..b9bff3539918 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1312,11 +1312,25 @@ static void hv_irq_unmask(struct irq_data *data)
>  exit_unlock:
>  	spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
>  
> -	if (res) {
> +	/*
> +	 * During hibernation, when a CPU is offlined, the kernel tries
> +	 * to move the interrupt to the remaining CPUs that haven't
> +	 * been offlined yet. In this case, the below hv_do_hypercall()
> +	 * always fails since the vmbus channel has been closed:
> +	 * refer to cpu_disable_common() -> fixup_irqs() ->
> +	 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> +	 *
> +	 * Suppress the error message for hibernation because the failure
> +	 * during hibernation does not matter (at this time all the devices
> +	 * have been frozen). Note: the correct affinity info is still updated
> +	 * into the irqdata data structure in migrate_one_irq() ->
> +	 * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> +	 * resumes, hv_pci_restore_msi_state() is able to correctly restore
> +	 * the interrupt with the correct affinity.
> +	 */
> +	if (res && hbus->state != hv_pcibus_removing)
>  		dev_err(&hbus->hdev->device,
>  			"%s() failed: %#llx", __func__, res);
> -		return;
> -	}
>  
>  	pci_msi_unmask_irq(data);
>  }
> @@ -3348,6 +3362,34 @@ static int hv_pci_suspend(struct hv_device *hdev)
>  	return 0;
>  }
>  
> +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> +{
> +	struct msi_desc *entry;
> +	struct irq_data *irq_data;
> +
> +	for_each_pci_msi_entry(entry, pdev) {
> +		irq_data = irq_get_irq_data(entry->irq);
> +		if (WARN_ON_ONCE(!irq_data))
> +			return -EINVAL;
> +
> +		hv_compose_msi_msg(irq_data, &entry->msg);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> + * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> + * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
> + * Table entries.
> + */
> +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> +{
> +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> +}
> +
>  static int hv_pci_resume(struct hv_device *hdev)
>  {
>  	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> @@ -3381,6 +3423,8 @@ static int hv_pci_resume(struct hv_device *hdev)
>  
>  	prepopulate_bars(hbus);
>  
> +	hv_pci_restore_msi_state(hbus);
> +
>  	hbus->state = hv_pcibus_installed;
>  	return 0;
>  out:
> 

Clean cherry pick. Looks good to me. Thanks Marcelo

Acked-by: Colin Ian King <colin.king@canonical.com>
Stefan Bader Oct. 26, 2020, 8:16 a.m. UTC | #2
On 23.10.20 22:44, Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894893
> 
> pci_restore_msi_state() directly writes the MSI/MSI-X related registers
> via MMIO. On a physical machine, this works perfectly; for a Linux VM
> running on a hypervisor, which typically enables IOMMU interrupt remapping,
> the hypervisor usually should trap and emulate the MMIO accesses in order
> to re-create the necessary interrupt remapping table entries in the IOMMU,
> otherwise the interrupts can not work in the VM after hibernation.
> 
> Hyper-V is different from other hypervisors in that it does not trap and
> emulate the MMIO accesses, and instead it uses a para-virtualized method,
> which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
> of the info that would be passed to the hypervisor in the case of the
> trap-and-emulate method. This is not an issue to a lot of PCI device
> drivers, which destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some PCI device
> drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
> proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
> across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
> otherwise the PCI device drivers can no longer receive interrupts after
> the VM resumes from hibernation.
> 
> Hyper-V is also different in that chip->irq_unmask() may fail in a
> Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
> can not fail because unmasking an MSI/MSI-X register just means an MMIO
> write): during hibernation, when a CPU is offlined, the kernel tries
> to move the interrupt to the remaining CPUs that haven't been offlined
> yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
> because the vmbus channel has been closed: here the early "return" in
> hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
> desc->masked remains "true", so later after hibernation, the MSI interrupt
> always remains masked, which is incorrect. Refer to cpu_disable_common()
> -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
> 
> static bool migrate_one_irq(struct irq_desc *desc)
> {
> ...
>         if (maskchip && chip->irq_mask)
>                 chip->irq_mask(d);
> ...
>         err = irq_do_set_affinity(d, affinity, false);
> ...
>         if (maskchip && chip->irq_unmask)
>                 chip->irq_unmask(d);
> 
> Fix the issue by calling pci_msi_unmask_irq() unconditionally in
> hv_irq_unmask(). Also suppress the error message for hibernation because
> the hypercall failure during hibernation does not matter (at this time
> all the devices have been frozen). Note: the correct affinity info is
> still updated into the irqdata data structure in migrate_one_irq() ->
> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> resumes, hv_pci_restore_msi_state() is able to correctly restore
> the interrupt with the correct affinity.
> 
> Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> (cherry picked from commit 915cff7f38c5e4d47f187f8049245afc2cb3e503)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---
> 
> Clean cherry-pich from upstream with positive test results.
> 
> ---
>  drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 5d074d78da97..b9bff3539918 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1312,11 +1312,25 @@ static void hv_irq_unmask(struct irq_data *data)
>  exit_unlock:
>  	spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
>  
> -	if (res) {
> +	/*
> +	 * During hibernation, when a CPU is offlined, the kernel tries
> +	 * to move the interrupt to the remaining CPUs that haven't
> +	 * been offlined yet. In this case, the below hv_do_hypercall()
> +	 * always fails since the vmbus channel has been closed:
> +	 * refer to cpu_disable_common() -> fixup_irqs() ->
> +	 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> +	 *
> +	 * Suppress the error message for hibernation because the failure
> +	 * during hibernation does not matter (at this time all the devices
> +	 * have been frozen). Note: the correct affinity info is still updated
> +	 * into the irqdata data structure in migrate_one_irq() ->
> +	 * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> +	 * resumes, hv_pci_restore_msi_state() is able to correctly restore
> +	 * the interrupt with the correct affinity.
> +	 */
> +	if (res && hbus->state != hv_pcibus_removing)
>  		dev_err(&hbus->hdev->device,
>  			"%s() failed: %#llx", __func__, res);
> -		return;
> -	}
>  
>  	pci_msi_unmask_irq(data);
>  }
> @@ -3348,6 +3362,34 @@ static int hv_pci_suspend(struct hv_device *hdev)
>  	return 0;
>  }
>  
> +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> +{
> +	struct msi_desc *entry;
> +	struct irq_data *irq_data;
> +
> +	for_each_pci_msi_entry(entry, pdev) {
> +		irq_data = irq_get_irq_data(entry->irq);
> +		if (WARN_ON_ONCE(!irq_data))
> +			return -EINVAL;
> +
> +		hv_compose_msi_msg(irq_data, &entry->msg);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> + * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> + * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
> + * Table entries.
> + */
> +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> +{
> +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> +}
> +
>  static int hv_pci_resume(struct hv_device *hdev)
>  {
>  	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> @@ -3381,6 +3423,8 @@ static int hv_pci_resume(struct hv_device *hdev)
>  
>  	prepopulate_bars(hbus);
>  
> +	hv_pci_restore_msi_state(hbus);
> +
>  	hbus->state = hv_pcibus_installed;
>  	return 0;
>  out:
>
Ian May Oct. 26, 2020, 6:33 p.m. UTC | #3
Applied to Azure Focal/master-next

Thanks!
Ian

On 2020-10-23 17:44:28 , Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894893
> 
> pci_restore_msi_state() directly writes the MSI/MSI-X related registers
> via MMIO. On a physical machine, this works perfectly; for a Linux VM
> running on a hypervisor, which typically enables IOMMU interrupt remapping,
> the hypervisor usually should trap and emulate the MMIO accesses in order
> to re-create the necessary interrupt remapping table entries in the IOMMU,
> otherwise the interrupts can not work in the VM after hibernation.
> 
> Hyper-V is different from other hypervisors in that it does not trap and
> emulate the MMIO accesses, and instead it uses a para-virtualized method,
> which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
> of the info that would be passed to the hypervisor in the case of the
> trap-and-emulate method. This is not an issue to a lot of PCI device
> drivers, which destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some PCI device
> drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
> proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
> across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
> otherwise the PCI device drivers can no longer receive interrupts after
> the VM resumes from hibernation.
> 
> Hyper-V is also different in that chip->irq_unmask() may fail in a
> Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
> can not fail because unmasking an MSI/MSI-X register just means an MMIO
> write): during hibernation, when a CPU is offlined, the kernel tries
> to move the interrupt to the remaining CPUs that haven't been offlined
> yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
> because the vmbus channel has been closed: here the early "return" in
> hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
> desc->masked remains "true", so later after hibernation, the MSI interrupt
> always remains masked, which is incorrect. Refer to cpu_disable_common()
> -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
> 
> static bool migrate_one_irq(struct irq_desc *desc)
> {
> ...
>         if (maskchip && chip->irq_mask)
>                 chip->irq_mask(d);
> ...
>         err = irq_do_set_affinity(d, affinity, false);
> ...
>         if (maskchip && chip->irq_unmask)
>                 chip->irq_unmask(d);
> 
> Fix the issue by calling pci_msi_unmask_irq() unconditionally in
> hv_irq_unmask(). Also suppress the error message for hibernation because
> the hypercall failure during hibernation does not matter (at this time
> all the devices have been frozen). Note: the correct affinity info is
> still updated into the irqdata data structure in migrate_one_irq() ->
> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> resumes, hv_pci_restore_msi_state() is able to correctly restore
> the interrupt with the correct affinity.
> 
> Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> (cherry picked from commit 915cff7f38c5e4d47f187f8049245afc2cb3e503)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
> ---
> 
> Clean cherry-pich from upstream with positive test results.
> 
> ---
>  drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 5d074d78da97..b9bff3539918 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1312,11 +1312,25 @@ static void hv_irq_unmask(struct irq_data *data)
>  exit_unlock:
>  	spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
>  
> -	if (res) {
> +	/*
> +	 * During hibernation, when a CPU is offlined, the kernel tries
> +	 * to move the interrupt to the remaining CPUs that haven't
> +	 * been offlined yet. In this case, the below hv_do_hypercall()
> +	 * always fails since the vmbus channel has been closed:
> +	 * refer to cpu_disable_common() -> fixup_irqs() ->
> +	 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> +	 *
> +	 * Suppress the error message for hibernation because the failure
> +	 * during hibernation does not matter (at this time all the devices
> +	 * have been frozen). Note: the correct affinity info is still updated
> +	 * into the irqdata data structure in migrate_one_irq() ->
> +	 * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> +	 * resumes, hv_pci_restore_msi_state() is able to correctly restore
> +	 * the interrupt with the correct affinity.
> +	 */
> +	if (res && hbus->state != hv_pcibus_removing)
>  		dev_err(&hbus->hdev->device,
>  			"%s() failed: %#llx", __func__, res);
> -		return;
> -	}
>  
>  	pci_msi_unmask_irq(data);
>  }
> @@ -3348,6 +3362,34 @@ static int hv_pci_suspend(struct hv_device *hdev)
>  	return 0;
>  }
>  
> +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> +{
> +	struct msi_desc *entry;
> +	struct irq_data *irq_data;
> +
> +	for_each_pci_msi_entry(entry, pdev) {
> +		irq_data = irq_get_irq_data(entry->irq);
> +		if (WARN_ON_ONCE(!irq_data))
> +			return -EINVAL;
> +
> +		hv_compose_msi_msg(irq_data, &entry->msg);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> + * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> + * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
> + * Table entries.
> + */
> +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> +{
> +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> +}
> +
>  static int hv_pci_resume(struct hv_device *hdev)
>  {
>  	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> @@ -3381,6 +3423,8 @@ static int hv_pci_resume(struct hv_device *hdev)
>  
>  	prepopulate_bars(hbus);
>  
> +	hv_pci_restore_msi_state(hbus);
> +
>  	hbus->state = hv_pcibus_installed;
>  	return 0;
>  out:
> -- 
> 2.25.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Kleber Souza Oct. 27, 2020, 10:37 a.m. UTC | #4
On 23.10.20 22:44, Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894893
> 
> pci_restore_msi_state() directly writes the MSI/MSI-X related registers
> via MMIO. On a physical machine, this works perfectly; for a Linux VM
> running on a hypervisor, which typically enables IOMMU interrupt remapping,
> the hypervisor usually should trap and emulate the MMIO accesses in order
> to re-create the necessary interrupt remapping table entries in the IOMMU,
> otherwise the interrupts can not work in the VM after hibernation.
> 
> Hyper-V is different from other hypervisors in that it does not trap and
> emulate the MMIO accesses, and instead it uses a para-virtualized method,
> which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
> of the info that would be passed to the hypervisor in the case of the
> trap-and-emulate method. This is not an issue to a lot of PCI device
> drivers, which destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some PCI device
> drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
> proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
> across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
> otherwise the PCI device drivers can no longer receive interrupts after
> the VM resumes from hibernation.
> 
> Hyper-V is also different in that chip->irq_unmask() may fail in a
> Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
> can not fail because unmasking an MSI/MSI-X register just means an MMIO
> write): during hibernation, when a CPU is offlined, the kernel tries
> to move the interrupt to the remaining CPUs that haven't been offlined
> yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
> because the vmbus channel has been closed: here the early "return" in
> hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
> desc->masked remains "true", so later after hibernation, the MSI interrupt
> always remains masked, which is incorrect. Refer to cpu_disable_common()
> -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
> 
> static bool migrate_one_irq(struct irq_desc *desc)
> {
> ...
>          if (maskchip && chip->irq_mask)
>                  chip->irq_mask(d);
> ...
>          err = irq_do_set_affinity(d, affinity, false);
> ...
>          if (maskchip && chip->irq_unmask)
>                  chip->irq_unmask(d);
> 
> Fix the issue by calling pci_msi_unmask_irq() unconditionally in
> hv_irq_unmask(). Also suppress the error message for hibernation because
> the hypercall failure during hibernation does not matter (at this time
> all the devices have been frozen). Note: the correct affinity info is
> still updated into the irqdata data structure in migrate_one_irq() ->
> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> resumes, hv_pci_restore_msi_state() is able to correctly restore
> the interrupt with the correct affinity.
> 
> Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> (cherry picked from commit 915cff7f38c5e4d47f187f8049245afc2cb3e503)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>

Applied to groovy/linux-azure.

Thanks,
Kleber

> ---
> 
> Clean cherry-pich from upstream with positive test results.
> 
> ---
>   drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++--
>   1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 5d074d78da97..b9bff3539918 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1312,11 +1312,25 @@ static void hv_irq_unmask(struct irq_data *data)
>   exit_unlock:
>   	spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
>   
> -	if (res) {
> +	/*
> +	 * During hibernation, when a CPU is offlined, the kernel tries
> +	 * to move the interrupt to the remaining CPUs that haven't
> +	 * been offlined yet. In this case, the below hv_do_hypercall()
> +	 * always fails since the vmbus channel has been closed:
> +	 * refer to cpu_disable_common() -> fixup_irqs() ->
> +	 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> +	 *
> +	 * Suppress the error message for hibernation because the failure
> +	 * during hibernation does not matter (at this time all the devices
> +	 * have been frozen). Note: the correct affinity info is still updated
> +	 * into the irqdata data structure in migrate_one_irq() ->
> +	 * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
> +	 * resumes, hv_pci_restore_msi_state() is able to correctly restore
> +	 * the interrupt with the correct affinity.
> +	 */
> +	if (res && hbus->state != hv_pcibus_removing)
>   		dev_err(&hbus->hdev->device,
>   			"%s() failed: %#llx", __func__, res);
> -		return;
> -	}
>   
>   	pci_msi_unmask_irq(data);
>   }
> @@ -3348,6 +3362,34 @@ static int hv_pci_suspend(struct hv_device *hdev)
>   	return 0;
>   }
>   
> +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> +{
> +	struct msi_desc *entry;
> +	struct irq_data *irq_data;
> +
> +	for_each_pci_msi_entry(entry, pdev) {
> +		irq_data = irq_get_irq_data(entry->irq);
> +		if (WARN_ON_ONCE(!irq_data))
> +			return -EINVAL;
> +
> +		hv_compose_msi_msg(irq_data, &entry->msg);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> + * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> + * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
> + * Table entries.
> + */
> +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> +{
> +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> +}
> +
>   static int hv_pci_resume(struct hv_device *hdev)
>   {
>   	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> @@ -3381,6 +3423,8 @@ static int hv_pci_resume(struct hv_device *hdev)
>   
>   	prepopulate_bars(hbus);
>   
> +	hv_pci_restore_msi_state(hbus);
> +
>   	hbus->state = hv_pcibus_installed;
>   	return 0;
>   out:
>
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 5d074d78da97..b9bff3539918 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1312,11 +1312,25 @@  static void hv_irq_unmask(struct irq_data *data)
 exit_unlock:
 	spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
 
-	if (res) {
+	/*
+	 * During hibernation, when a CPU is offlined, the kernel tries
+	 * to move the interrupt to the remaining CPUs that haven't
+	 * been offlined yet. In this case, the below hv_do_hypercall()
+	 * always fails since the vmbus channel has been closed:
+	 * refer to cpu_disable_common() -> fixup_irqs() ->
+	 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
+	 *
+	 * Suppress the error message for hibernation because the failure
+	 * during hibernation does not matter (at this time all the devices
+	 * have been frozen). Note: the correct affinity info is still updated
+	 * into the irqdata data structure in migrate_one_irq() ->
+	 * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
+	 * resumes, hv_pci_restore_msi_state() is able to correctly restore
+	 * the interrupt with the correct affinity.
+	 */
+	if (res && hbus->state != hv_pcibus_removing)
 		dev_err(&hbus->hdev->device,
 			"%s() failed: %#llx", __func__, res);
-		return;
-	}
 
 	pci_msi_unmask_irq(data);
 }
@@ -3348,6 +3362,34 @@  static int hv_pci_suspend(struct hv_device *hdev)
 	return 0;
 }
 
+static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
+{
+	struct msi_desc *entry;
+	struct irq_data *irq_data;
+
+	for_each_pci_msi_entry(entry, pdev) {
+		irq_data = irq_get_irq_data(entry->irq);
+		if (WARN_ON_ONCE(!irq_data))
+			return -EINVAL;
+
+		hv_compose_msi_msg(irq_data, &entry->msg);
+	}
+
+	return 0;
+}
+
+/*
+ * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
+ * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
+ * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
+ * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
+ * Table entries.
+ */
+static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
+{
+	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
+}
+
 static int hv_pci_resume(struct hv_device *hdev)
 {
 	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
@@ -3381,6 +3423,8 @@  static int hv_pci_resume(struct hv_device *hdev)
 
 	prepopulate_bars(hbus);
 
+	hv_pci_restore_msi_state(hbus);
+
 	hbus->state = hv_pcibus_installed;
 	return 0;
 out: