Message ID | 20210429230730.201266-1-drbawb@fatalsyntax.com |
---|---|
State | New |
Headers | show |
Series | pci: add NVMe FLR quirk to the SM951 SSD | expand |
[+cc Alex, author of ffb0863426eb] Please make your subject line match ffb0863426eb ("PCI: Disable Samsung SM961/PM961 NVMe before FLR") On Thu, Apr 29, 2021 at 06:07:30PM -0500, Robert Straw wrote: > The SM951/PM951, when used in conjunction with the vfio-pci driver and > passed to a KVM guest, can exhibit the fatal state addressed by the > existing `nvme_disable_and_flr` quirk. If the guest cleanly shuts down > the SSD, and vfio-pci attempts an FLR to the device while it is in this > state, the nvme driver will fail when it attempts to bind to the device > after the FLR due to the frozen config area, e.g: > > nvme nvme2: frozen state error detected, reset controller > nvme nvme2: Removing after probe failure status: -12 > > By including this older model (Samsung 950 PRO) of the controller in the > existing quirk: the device is able to be cleanly reset after being used > by a KVM guest. I don't see anything in the PCIe spec about software being required to do something special before initiating an FLR, so I assume this is a hardware defect in the Samsung 950 PRO? Has Samsung published an erratum or at least acknowledged it? There's always the possibility that we are doing something wrong in Linux *after* the FLR, e.g., not waiting long enough, not reinitializing something correctly, etc. > Signed-off-by: Robert Straw <drbawb@fatalsyntax.com> > --- > drivers/pci/quirks.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 653660e3b..e339ca238 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3920,6 +3920,7 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { > reset_ivb_igd }, > { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA, > reset_ivb_igd }, > + { PCI_VENDOR_ID_SAMSUNG, 0xa802, nvme_disable_and_flr }, > { PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr }, > { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr }, > { PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID, > -- > 2.31.1 >
On Fri Apr 30, 2021 at 3:51 PM CDT, Bjorn Helgaas wrote: > Please make your subject line match ffb0863426eb ("PCI: Disable > Samsung SM961/PM961 NVMe before FLR") Understood, I will send a revision ASAP. > There's always the possibility that we are doing something wrong in > Linux *after* the FLR, e.g., not waiting long enough, not > reinitializing something correctly, etc. In my experience I was not able to get my particular drive to enter this state while issuing various types of resets purely from the Linux host. The issue only appeared when I pass the device to a KVM guest *and allow that guest to cleanly shut-down.* The last part is crucial: if the guest is forcibly powered off Linux was able to reset the drive just fine. So I suspect the issue here is related to the interaction between whatever state the guest leaves the NVMe drive in, and the Linux kernel's own reset code triggering some pathological behavior in the controller. FWIW even a remove/rescan, with an interim suspend to RAM, was not enough to unfreeze the controller. The only way I've found to get the device back (apart from this patch) was a full reboot.
On Fri Apr 30, 2021 at 3:51 PM CDT, Bjorn Helgaas wrote: > Please make your subject line match ffb0863426eb ("PCI: Disable > Samsung SM961/PM961 NVMe before FLR") I had done this in a V2 of this patch, but after some additional research I'm thinking the behavior of this quirk might be in-line w/ the NVMe specification more generally, I'll elaborate more below. > I don't see anything in the PCIe spec about software being required to > do something special before initiating an FLR, so I assume this is a > hardware defect in the Samsung 950 PRO? Has Samsung published an > erratum or at least acknowledged it? > > There's always the possibility that we are doing something wrong in > Linux *after* the FLR, e.g., not waiting long enough, not > reinitializing something correctly, etc. I did some dumping of registers both with and without this patch, and determined the following to be true in my use-case: 1. My guest VM leaves the device in a state where SHN (shutdown notification) is set to 0b01 (normal shutdown) 2. The guest also leaves CC.EN (controller enable) set to 0b1 3. vfio-pci attempts to issue an FLR while the device is in this state. On page 40, sec 3.1.6 of the NVMe 1.1 spec, the documentation on SHST states the following: > To start executing commands on the controller after a shutdown > operation (CSTS.SHST set to 10b), a reset (CC.EN cleared to ‘0’) > is required. If host software submits commands to the controller > without issuing a reset, the behavior is undefined. In the case of the SM951/SM961 it appears the undefined behavior is that they stop responding to attempts to change their configuration if you do an FLR while the device is in this state. The reason this patch resolved the issue I was seeing is because the toggle of the CC.EN flag puts the drive in a known-good state after the guest's shutdown notification. Knowing this I would suspect we'd actually want to treat most NVMe drives in this manner *if the kernel sees the SHN/SHST has been set prior.* Perhaps other NVMe devices are more tolerant of not doing this? ~ robert straw
On Sat, May 15, 2021 at 12:20:05PM -0500, Robert Straw wrote: > On page 40, sec 3.1.6 of the NVMe 1.1 spec, the documentation on SHST > states the following: While it doesn't matter here, NVMe 1.1 is very much out of data, being a more than 8 year old specification. The current version is 1.4b, with NVMe 2.0 about to be released. > Knowing this I would suspect we'd actually want to treat most NVMe > drives in this manner *if the kernel sees the SHN/SHST has been set > prior.* Perhaps other NVMe devices are more tolerant of not doing this? No, we don't. This is a bug particular to a specific implementation. In fact the whole existing NVMe shutdown before reset quirk is rather broken and dangerous, as it concurrently accesses the NVMe registers with the actual driver, which could be trivially triggered through the sysfs reset attribute. I'd much rather quirk these broken Samsung drivers to not allow assigning them to VFIO.
On Wed May 19, 2021 at 3:44 AM CDT, Christoph Hellwig wrote: > On Sat, May 15, 2021 at 12:20:05PM -0500, Robert Straw wrote: > While it doesn't matter here, NVMe 1.1 is very much out of data, being > a more than 8 year old specification. The current version is 1.4b, > with NVMe 2.0 about to be released. I can't comment on 2.0, but yes 1.4b has the same aside regarding undefined behavior on the SHST field (on p. 50). The only reason I was looking at 1.1a is because it's specifically listed on the datasheet for the SM951. (The device under test.) > No, we don't. This is a bug particular to a specific implementation. > In fact the whole existing NVMe shutdown before reset quirk is rather > broken and dangerous, as it concurrently accesses the NVMe registers > with the actual driver, which could be trivially triggered through the > sysfs reset attribute. I'm not exactly clear in what way the nvme driver would be racing against vfio-pci here. (a) vfio-pci is the driver bound in this scenario, and (b) the vfio-pci driver triggers this quirk by issuing an FLR, which is done with the device locked. (e.g: vfio_pci.c:499.) In my testing *without this patch* vfio-pci is still bound to the device for at least 60s after guest shutdown, at which point the FLR times out. After this FLR the device is useless w/o a full reboot of the host. Rebinding it to *either* another guest w/ vfio-pci, or the Linux nvme driver doesn't matter: as the device can no longer be reconfigured. As I understand it: vfio-pci should not blindly issue an FLR to an NVMe class device w/o obeying the protocol. The protocol seems clear that after a shutdown CC->EN must transition from 1 to 0. (I would argue the guest OS leaving the device in this state is the actual violation of the spec. As I'm unable to change that behavior: having vfio-pci clean up the mess w/ this quirk seems to be an adequate workaround.) I am currently testing a version of this patch that only disables the controller if the device has been previously shutdown. I am trying to gauge whether this would be preferable to just blanket-disabling these bugged devices before relinquishing control of them back to the host. > I'd much rather quirk these broken Samsung drivers to not allow > assigning them to VFIO. I'd much rather keep using my storage devices. I will leave the quirk limited to these known-bugged devices.
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 653660e3b..e339ca238 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3920,6 +3920,7 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { reset_ivb_igd }, { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA, reset_ivb_igd }, + { PCI_VENDOR_ID_SAMSUNG, 0xa802, nvme_disable_and_flr }, { PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr }, { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr }, { PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
The SM951/PM951, when used in conjunction with the vfio-pci driver and passed to a KVM guest, can exhibit the fatal state addressed by the existing `nvme_disable_and_flr` quirk. If the guest cleanly shuts down the SSD, and vfio-pci attempts an FLR to the device while it is in this state, the nvme driver will fail when it attempts to bind to the device after the FLR due to the frozen config area, e.g: nvme nvme2: frozen state error detected, reset controller nvme nvme2: Removing after probe failure status: -12 By including this older model (Samsung 950 PRO) of the controller in the existing quirk: the device is able to be cleanly reset after being used by a KVM guest. Signed-off-by: Robert Straw <drbawb@fatalsyntax.com> --- drivers/pci/quirks.c | 1 + 1 file changed, 1 insertion(+)