From patchwork Fri Oct 23 20:44:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Marcelo Henrique Cerri X-Patchwork-Id: 1386992 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CHx6v3Mfvz9sSn; Sat, 24 Oct 2020 07:44:39 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1kW3vP-0002TH-M1; Fri, 23 Oct 2020 20:44:35 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1kW3vO-0002TA-3D for kernel-team@lists.ubuntu.com; Fri, 23 Oct 2020 20:44:34 +0000 Received: from mail-qv1-f70.google.com ([209.85.219.70]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1kW3vN-000473-Ny for kernel-team@lists.ubuntu.com; Fri, 23 Oct 2020 20:44:33 +0000 Received: by mail-qv1-f70.google.com with SMTP id h12so1838546qvk.22 for ; Fri, 23 Oct 2020 13:44:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=SxazvM7htw8yWOrk3miPSU/i7f/p3d8g7Srbc/Ff+8o=; b=YUjX53tcuGC+m/0QTwwOO7VuJx9+Orl0AHbb4yp80NlMtNSYRRiy6WDrXI8EKaOeOD TUE4+e5ykAKok2V/lR8pjGJKnXwxl5WFdueNt7OqO+GshlL0HVNmisQ5QKuXHcNmfL36 LZtdozhmkhRXme5thNFyNqzqRYf9bSGi4mF72D3cKe4LLcvD2jrjsp9chzm0CpnF6zhN T+aHLCfE93/lB19RwPUGoXF+AIh5wUzqnSZvQ/O5IRZPAxnqUntrYXKFFFg25i9HjchL tBr4JqSXPiLTtarkeUbentb8wgcmyTuJ4Aw8J7C50YDbvuJM4TPKXLhYy3ZKfHuzdQ0Q 6t/g== X-Gm-Message-State: AOAM532xVIYBNGwNSoJ2lV/YXZzyxD3pXgLHw6tMxraa60A0GnX0V9PL V0payjg2ydSiBjIrEFX+locnyY8Sd1RqPU8LxIcEKwdUcCSoPUgG8MyiUZ2B7Ex/IZjeSEFmrXO W9hbc0b8EuV1YFwk1dsoVJP6gfPwuCjX664lWhiZt X-Received: by 2002:ac8:44d3:: with SMTP id b19mr4046401qto.28.1603485872418; Fri, 23 Oct 2020 13:44:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJySIBTZbER3vJwIyAFIr5rPMI8XmTJejX5YRTyF2lyqZi72unO1eFWBmMwkttykjLZMsTzvoA== X-Received: by 2002:ac8:44d3:: with SMTP id b19mr4046384qto.28.1603485872037; Fri, 23 Oct 2020 13:44:32 -0700 (PDT) Received: from valinor.lan ([191.13.170.216]) by smtp.gmail.com with ESMTPSA id f22sm1525340qkl.99.2020.10.23.13.44.30 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Oct 2020 13:44:31 -0700 (PDT) From: Marcelo Henrique Cerri To: kernel-team@lists.ubuntu.com Subject: [{focal, groovy}:linux-azure][PATCH] PCI: hv: Fix hibernation in case interrupts are not re-created Date: Fri, 23 Oct 2020 17:44:28 -0300 Message-Id: <20201023204428.2558110-1-marcelo.cerri@canonical.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Dexuan Cui BugLink: https://bugs.launchpad.net/bugs/1894893 pci_restore_msi_state() directly writes the MSI/MSI-X related registers via MMIO. On a physical machine, this works perfectly; for a Linux VM running on a hypervisor, which typically enables IOMMU interrupt remapping, the hypervisor usually should trap and emulate the MMIO accesses in order to re-create the necessary interrupt remapping table entries in the IOMMU, otherwise the interrupts can not work in the VM after hibernation. Hyper-V is different from other hypervisors in that it does not trap and emulate the MMIO accesses, and instead it uses a para-virtualized method, which requires the VM to call hv_compose_msi_msg() to notify the hypervisor of the info that would be passed to the hypervisor in the case of the trap-and-emulate method. This is not an issue to a lot of PCI device drivers, which destroy and re-create the interrupts across hibernation, so hv_compose_msi_msg() is called automatically. However, some PCI device drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(), otherwise the PCI device drivers can no longer receive interrupts after the VM resumes from hibernation. Hyper-V is also different in that chip->irq_unmask() may fail in a Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask() can not fail because unmasking an MSI/MSI-X register just means an MMIO write): during hibernation, when a CPU is offlined, the kernel tries to move the interrupt to the remaining CPUs that haven't been offlined yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails because the vmbus channel has been closed: here the early "return" in hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the desc->masked remains "true", so later after hibernation, the MSI interrupt always remains masked, which is incorrect. Refer to cpu_disable_common() -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq(): static bool migrate_one_irq(struct irq_desc *desc) { ... if (maskchip && chip->irq_mask) chip->irq_mask(d); ... err = irq_do_set_affinity(d, affinity, false); ... if (maskchip && chip->irq_unmask) chip->irq_unmask(d); Fix the issue by calling pci_msi_unmask_irq() unconditionally in hv_irq_unmask(). Also suppress the error message for hibernation because the hypercall failure during hibernation does not matter (at this time all the devices have been frozen). Note: the correct affinity info is still updated into the irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() -> hv_set_affinity(), so later when the VM resumes, hv_pci_restore_msi_state() is able to correctly restore the interrupt with the correct affinity. Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@microsoft.com Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") Signed-off-by: Dexuan Cui Signed-off-by: Lorenzo Pieralisi Reviewed-by: Jake Oshins (cherry picked from commit 915cff7f38c5e4d47f187f8049245afc2cb3e503) Signed-off-by: Marcelo Henrique Cerri Acked-by: Colin Ian King Acked-by: Stefan Bader --- Clean cherry-pich from upstream with positive test results. --- drivers/pci/controller/pci-hyperv.c | 50 +++++++++++++++++++++++++++-- 1 file changed, 47 insertions(+), 3 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 5d074d78da97..b9bff3539918 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1312,11 +1312,25 @@ static void hv_irq_unmask(struct irq_data *data) exit_unlock: spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags); - if (res) { + /* + * During hibernation, when a CPU is offlined, the kernel tries + * to move the interrupt to the remaining CPUs that haven't + * been offlined yet. In this case, the below hv_do_hypercall() + * always fails since the vmbus channel has been closed: + * refer to cpu_disable_common() -> fixup_irqs() -> + * irq_migrate_all_off_this_cpu() -> migrate_one_irq(). + * + * Suppress the error message for hibernation because the failure + * during hibernation does not matter (at this time all the devices + * have been frozen). Note: the correct affinity info is still updated + * into the irqdata data structure in migrate_one_irq() -> + * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM + * resumes, hv_pci_restore_msi_state() is able to correctly restore + * the interrupt with the correct affinity. + */ + if (res && hbus->state != hv_pcibus_removing) dev_err(&hbus->hdev->device, "%s() failed: %#llx", __func__, res); - return; - } pci_msi_unmask_irq(data); } @@ -3348,6 +3362,34 @@ static int hv_pci_suspend(struct hv_device *hdev) return 0; } +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) +{ + struct msi_desc *entry; + struct irq_data *irq_data; + + for_each_pci_msi_entry(entry, pdev) { + irq_data = irq_get_irq_data(entry->irq); + if (WARN_ON_ONCE(!irq_data)) + return -EINVAL; + + hv_compose_msi_msg(irq_data, &entry->msg); + } + + return 0; +} + +/* + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() + * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V + * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg() + * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping + * Table entries. + */ +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) +{ + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); +} + static int hv_pci_resume(struct hv_device *hdev) { struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); @@ -3381,6 +3423,8 @@ static int hv_pci_resume(struct hv_device *hdev) prepopulate_bars(hbus); + hv_pci_restore_msi_state(hbus); + hbus->state = hv_pcibus_installed; return 0; out: