diff mbox series

pci: add NVMe FLR quirk to the SM951 SSD

Message ID 20210429230730.201266-1-drbawb@fatalsyntax.com
State New
Headers show
Series pci: add NVMe FLR quirk to the SM951 SSD | expand

Commit Message

Robert Straw April 29, 2021, 11:07 p.m. UTC
The SM951/PM951, when used in conjunction with the vfio-pci driver and
passed to a KVM guest, can exhibit the fatal state addressed by the
existing `nvme_disable_and_flr` quirk. If the guest cleanly shuts down
the SSD, and vfio-pci attempts an FLR to the device while it is in this 
state, the nvme driver will fail when it attempts to bind to the device 
after the FLR due to the frozen config area, e.g:

  nvme nvme2: frozen state error detected, reset controller
  nvme nvme2: Removing after probe failure status: -12

By including this older model (Samsung 950 PRO) of the controller in the
existing quirk: the device is able to be cleanly reset after being used
by a KVM guest.

Signed-off-by: Robert Straw <drbawb@fatalsyntax.com>
---
 drivers/pci/quirks.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Bjorn Helgaas April 30, 2021, 8:51 p.m. UTC | #1
[+cc Alex, author of ffb0863426eb]

Please make your subject line match ffb0863426eb ("PCI: Disable
Samsung SM961/PM961 NVMe before FLR")

On Thu, Apr 29, 2021 at 06:07:30PM -0500, Robert Straw wrote:
> The SM951/PM951, when used in conjunction with the vfio-pci driver and
> passed to a KVM guest, can exhibit the fatal state addressed by the
> existing `nvme_disable_and_flr` quirk. If the guest cleanly shuts down
> the SSD, and vfio-pci attempts an FLR to the device while it is in this 
> state, the nvme driver will fail when it attempts to bind to the device 
> after the FLR due to the frozen config area, e.g:
> 
>   nvme nvme2: frozen state error detected, reset controller
>   nvme nvme2: Removing after probe failure status: -12
> 
> By including this older model (Samsung 950 PRO) of the controller in the
> existing quirk: the device is able to be cleanly reset after being used
> by a KVM guest.

I don't see anything in the PCIe spec about software being required to
do something special before initiating an FLR, so I assume this is a
hardware defect in the Samsung 950 PRO?  Has Samsung published an
erratum or at least acknowledged it?

There's always the possibility that we are doing something wrong in
Linux *after* the FLR, e.g., not waiting long enough, not
reinitializing something correctly, etc.

> Signed-off-by: Robert Straw <drbawb@fatalsyntax.com>
> ---
>  drivers/pci/quirks.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 653660e3b..e339ca238 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3920,6 +3920,7 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
>  		reset_ivb_igd },
>  	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
>  		reset_ivb_igd },
> +	{ PCI_VENDOR_ID_SAMSUNG, 0xa802, nvme_disable_and_flr },
>  	{ PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
>  	{ PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
>  	{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
> -- 
> 2.31.1
>
Robert Straw April 30, 2021, 10:34 p.m. UTC | #2
On Fri Apr 30, 2021 at 3:51 PM CDT, Bjorn Helgaas wrote:
> Please make your subject line match ffb0863426eb ("PCI: Disable
> Samsung SM961/PM961 NVMe before FLR")

Understood, I will send a revision ASAP.

> There's always the possibility that we are doing something wrong in
> Linux *after* the FLR, e.g., not waiting long enough, not
> reinitializing something correctly, etc.

In my experience I was not able to get my particular drive to enter this
state while issuing various types of resets purely from the Linux host. 
The issue only appeared when I pass the device to a KVM guest *and allow
that guest to cleanly shut-down.* The last part is crucial: if the guest 
is forcibly powered off Linux was able to reset the drive just fine.

So I suspect the issue here is related to the interaction between
whatever state the guest leaves the NVMe drive in, and the Linux kernel's 
own reset code triggering some pathological behavior in the controller.

FWIW even a remove/rescan, with an interim suspend to RAM, was not
enough to unfreeze the controller. The only way I've found to get the
device back (apart from this patch) was a full reboot.
Robert Straw May 15, 2021, 5:20 p.m. UTC | #3
On Fri Apr 30, 2021 at 3:51 PM CDT, Bjorn Helgaas wrote:

> Please make your subject line match ffb0863426eb ("PCI: Disable
> Samsung SM961/PM961 NVMe before FLR")

I had done this in a V2 of this patch, but after some additional
research I'm thinking the behavior of this quirk might be in-line 
w/ the NVMe specification more generally, I'll elaborate more below.

> I don't see anything in the PCIe spec about software being required to
> do something special before initiating an FLR, so I assume this is a
> hardware defect in the Samsung 950 PRO? Has Samsung published an
> erratum or at least acknowledged it?
>
> There's always the possibility that we are doing something wrong in
> Linux *after* the FLR, e.g., not waiting long enough, not
> reinitializing something correctly, etc.

I did some dumping of registers both with and without this patch, and
determined the following to be true in my use-case:

1. My guest VM leaves the device in a state where SHN (shutdown
notification) is set to 0b01 (normal shutdown)

2. The guest also leaves CC.EN (controller enable) set to 0b1

3. vfio-pci attempts to issue an FLR while the device is in this state.


On page 40, sec 3.1.6 of the NVMe 1.1 spec, the documentation on SHST 
states the following:

> To start executing commands on the controller after a shutdown 
> operation (CSTS.SHST set to 10b), a reset (CC.EN cleared to ‘0’) 
> is required. If host software submits commands to the controller 
> without issuing a reset, the behavior is undefined.

In the case of the SM951/SM961 it appears the undefined behavior is that
they stop responding to attempts to change their configuration if you do
an FLR while the device is in this state. The reason this patch
resolved the issue I was seeing is because the toggle of the CC.EN flag 
puts the drive in a known-good state after the guest's shutdown 
notification.

Knowing this I would suspect we'd actually want to treat most NVMe
drives in this manner *if the kernel sees the SHN/SHST has been set
prior.* Perhaps other NVMe devices are more tolerant of not doing this?

~ robert straw
Christoph Hellwig May 19, 2021, 8:44 a.m. UTC | #4
On Sat, May 15, 2021 at 12:20:05PM -0500, Robert Straw wrote:
> On page 40, sec 3.1.6 of the NVMe 1.1 spec, the documentation on SHST 
> states the following:

While it doesn't matter here, NVMe 1.1 is very much out of data, being
a more than 8 year old specification.  The current version is 1.4b,
with NVMe 2.0 about to be released.

> Knowing this I would suspect we'd actually want to treat most NVMe
> drives in this manner *if the kernel sees the SHN/SHST has been set
> prior.* Perhaps other NVMe devices are more tolerant of not doing this?

No, we don't.  This is a bug particular to a specific implementation.
In fact the whole existing NVMe shutdown before reset quirk is rather
broken and dangerous, as it concurrently accesses the NVMe registers
with the actual driver, which could be trivially triggered through the
sysfs reset attribute.

I'd much rather quirk these broken Samsung drivers to not allow
assigning them to VFIO.
Robert Straw May 19, 2021, 12:54 p.m. UTC | #5
On Wed May 19, 2021 at 3:44 AM CDT, Christoph Hellwig wrote:
> On Sat, May 15, 2021 at 12:20:05PM -0500, Robert Straw wrote:
> While it doesn't matter here, NVMe 1.1 is very much out of data, being
> a more than 8 year old specification. The current version is 1.4b,
> with NVMe 2.0 about to be released.

I can't comment on 2.0, but yes 1.4b has the same aside regarding undefined
behavior on the SHST field (on p. 50). The only reason I was looking at
1.1a is because it's specifically listed on the datasheet for the SM951.
(The device under test.)

> No, we don't. This is a bug particular to a specific implementation.
> In fact the whole existing NVMe shutdown before reset quirk is rather
> broken and dangerous, as it concurrently accesses the NVMe registers
> with the actual driver, which could be trivially triggered through the
> sysfs reset attribute.

I'm not exactly clear in what way the nvme driver would  be racing against 
vfio-pci here. (a) vfio-pci is the driver bound in this scenario, and (b)
the vfio-pci driver triggers this quirk by issuing an FLR, which is done 
with the device locked. (e.g: vfio_pci.c:499.)

In my testing *without this patch* vfio-pci is still bound to the device 
for at least 60s after guest shutdown, at which point the FLR times out.
After this FLR the device is useless w/o a full reboot of the host. 
Rebinding it to *either* another guest w/ vfio-pci, or the Linux nvme 
driver doesn't matter: as the device can no longer be reconfigured.

As I understand it: vfio-pci should not blindly issue an FLR to an NVMe class 
device w/o obeying the protocol. The protocol seems clear that after a 
shutdown CC->EN must transition from 1 to 0. (I would argue the guest OS 
leaving the device in this state is the actual violation of the spec. As 
I'm unable to change that behavior: having vfio-pci clean up the mess w/ 
this quirk seems to be an adequate workaround.)

I am currently testing  a version of this patch that only disables the
controller if the device has been previously shutdown. I am trying to
gauge whether this would be preferable to just blanket-disabling these 
bugged devices before relinquishing control of them back to the host.

> I'd much rather quirk these broken Samsung drivers to not allow
> assigning them to VFIO.

I'd much rather keep using my storage devices. I will leave the 
quirk limited to these known-bugged devices.
diff mbox series

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 653660e3b..e339ca238 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3920,6 +3920,7 @@  static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 		reset_ivb_igd },
 	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
 		reset_ivb_igd },
+	{ PCI_VENDOR_ID_SAMSUNG, 0xa802, nvme_disable_and_flr },
 	{ PCI_VENDOR_ID_SAMSUNG, 0xa804, nvme_disable_and_flr },
 	{ PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
 	{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,