[v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

Message ID 20180918221501.13112-1-mr.nuke.me@gmail.com
State New
Delegated to: Bjorn Helgaas
Headers show
Series
  • [v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
Related show

Commit Message

Alexandru Gagniuc Sept. 18, 2018, 10:15 p.m.
When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

.irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
.irq_mask/unmask() are not. Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on firmware-first machines.
Not triggering the IO in the first place greatly reduces the
possibility of the problem occurring.

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/pci/msi.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Alexandru Gagniuc Nov. 6, 2018, 12:32 a.m. | #1
ping

On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote:
> When a PCI device is gone, we don't want to send IO to it if we can
> avoid it. We expose functionality via the irq_chip structure. As
> users of that structure may not know about the underlying PCI device,
> it's our responsibility to guard against removed devices.
> 
> .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
> .irq_mask/unmask() are not. Guard them for completeness.
> 
> For example, surprise removal of a PCIe device triggers teardown. This
> touches the irq_chips ops some point to disable the interrupts. I/O
> generated here can crash the system on firmware-first machines.
> Not triggering the IO in the first place greatly reduces the
> possibility of the problem occurring.
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>   drivers/pci/msi.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index f2ef896464b3..f31058fd2260 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag)
>   {
>   	struct msi_desc *desc = irq_data_get_msi_desc(data);
>   
> +	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
> +		return;
> +
>   	if (desc->msi_attrib.is_msix) {
>   		msix_mask_irq(desc, flag);
>   		readl(desc->mask_base);		/* Flush write to device */
>
Derrick, Jonathan Nov. 7, 2018, 5:04 p.m. | #2
I found the same issue:
https://patchwork.ozlabs.org/patch/989272/

Tested-by: Jon Derrick <jonathan.derrick@intel.com>

On Mon, 2018-11-05 at 18:32 -0600, Alex G. wrote:
> ping
> 
> On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote:
> > When a PCI device is gone, we don't want to send IO to it if we can
> > avoid it. We expose functionality via the irq_chip structure. As
> > users of that structure may not know about the underlying PCI
> > device,
> > it's our responsibility to guard against removed devices.
> > 
> > .irq_write_msi_msg() is already guarded inside
> > __pci_write_msi_msg().
> > .irq_mask/unmask() are not. Guard them for completeness.
> > 
> > For example, surprise removal of a PCIe device triggers teardown.
> > This
> > touches the irq_chips ops some point to disable the interrupts. I/O
> > generated here can crash the system on firmware-first machines.
> > Not triggering the IO in the first place greatly reduces the
> > possibility of the problem occurring.
> > 
> > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> > ---
> >   drivers/pci/msi.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index f2ef896464b3..f31058fd2260 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data
> > *data, u32 flag)
> >   {
> >   	struct msi_desc *desc = irq_data_get_msi_desc(data);
> >   
> > +	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
> > +		return;
> > +
> >   	if (desc->msi_attrib.is_msix) {
> >   		msix_mask_irq(desc, flag);
> >   		readl(desc->mask_base);		/* Flush
> > write to device */
> >
Bjorn Helgaas Nov. 7, 2018, 11:42 p.m. | #3
On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote:
> When a PCI device is gone, we don't want to send IO to it if we can
> avoid it. We expose functionality via the irq_chip structure. As
> users of that structure may not know about the underlying PCI device,
> it's our responsibility to guard against removed devices.
> 
> .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
> .irq_mask/unmask() are not. Guard them for completeness.
> 
> For example, surprise removal of a PCIe device triggers teardown. This
> touches the irq_chips ops some point to disable the interrupts. I/O
> generated here can crash the system on firmware-first machines.
> Not triggering the IO in the first place greatly reduces the
> possibility of the problem occurring.
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>

Applied to pci/misc for v4.21, thanks!

> ---
>  drivers/pci/msi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index f2ef896464b3..f31058fd2260 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag)
>  {
>  	struct msi_desc *desc = irq_data_get_msi_desc(data);
>  
> +	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
> +		return;
> +
>  	if (desc->msi_attrib.is_msix) {
>  		msix_mask_irq(desc, flag);
>  		readl(desc->mask_base);		/* Flush write to device */
> -- 
> 2.17.1
>
Bjorn Helgaas Nov. 8, 2018, 8:09 p.m. | #4
[+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
PCI error recovery in general]

On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote:
> > When a PCI device is gone, we don't want to send IO to it if we can
> > avoid it. We expose functionality via the irq_chip structure. As
> > users of that structure may not know about the underlying PCI device,
> > it's our responsibility to guard against removed devices.
> > 
> > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
> > .irq_mask/unmask() are not. Guard them for completeness.
> > 
> > For example, surprise removal of a PCIe device triggers teardown. This
> > touches the irq_chips ops some point to disable the interrupts. I/O
> > generated here can crash the system on firmware-first machines.
> > Not triggering the IO in the first place greatly reduces the
> > possibility of the problem occurring.
> > 
> > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> 
> Applied to pci/misc for v4.21, thanks!

I'm having second thoughts about this.  One thing I'm uncomfortable
with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
instead of systematic, in the sense that I don't know how we convince
ourselves that this (and only this) is the correct place to put it.

Another is that the only place we call pci_dev_set_disconnected() is
in pciehp and acpiphp, so the only "disconnected" case we catch is if
hotplug happens to be involved.  Every MMIO read from the device is an
opportunity to learn whether it is reachable (a read from an
unreachable device typically returns ~0 data), but we don't do
anything at all with those.

The config accessors already check pci_dev_is_disconnected(), so this
patch is really aimed at MMIO accesses.  I think it would be more
robust if we added wrappers for readl() and writel() so we could
notice read errors and avoid future reads and writes.

Two compiled but untested patches below for your comments.  You can
mostly ignore the first; it's just boring interface changes.  The
important part is the second, which adds the wrappers.

These would be an alternative to the (admittedly much shorter) patch
here.  The wrappers are independent of MSI and could potentially be
exported from the PCI core for use by drivers.  I think it would be
better for drivers to use something like this instead of calling
pci_device_is_present() or pci_dev_is_disconnected() directly.

> > ---
> >  drivers/pci/msi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index f2ef896464b3..f31058fd2260 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag)
> >  {
> >  	struct msi_desc *desc = irq_data_get_msi_desc(data);
> >  
> > +	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
> > +		return;
> > +
> >  	if (desc->msi_attrib.is_msix) {
> >  		msix_mask_irq(desc, flag);
> >  		readl(desc->mask_base);		/* Flush write to device */


commit 150346e09edbcaedc520a6d7dec2b16f3a53afa1
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Thu Nov 8 09:17:51 2018 -0600

    PCI/MSI: Pass pci_dev into IRQ mask interfaces
    
    Add the struct pci_dev pointer to these interfaces:
    
      __pci_msix_desc_mask_irq()
      __pci_msi_desc_mask_irq()
      msix_mask_irq
      msi_mask_irq()
    
    The pci_dev pointer is currently unused, so there's no functional change
    intended with this patch.  A subsequent patch will use the pointer to
    improve error detection.

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9f6f392a4461..56bbee2cf761 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -462,9 +462,9 @@ void arch_teardown_msi_irqs(struct pci_dev *pdev)
 		if (!msi->irq)
 			continue;
 		if (msi->msi_attrib.is_msix)
-			__pci_msix_desc_mask_irq(msi, 1);
+			__pci_msix_desc_mask_irq(pdev, msi, 1);
 		else
-			__pci_msi_desc_mask_irq(msi, 1, 1);
+			__pci_msi_desc_mask_irq(pdev, msi, 1, 1);
 		irq_set_msi_desc(msi->irq, NULL);
 		irq_free_desc(msi->irq);
 		msi->msg.address_lo = 0;
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index af24ed50a245..d46ae506e296 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -170,7 +170,8 @@ static inline __attribute_const__ u32 msi_mask(unsigned x)
  * reliably as devices without an INTx disable bit will then generate a
  * level IRQ which will never be cleared.
  */
-u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
+u32 __pci_msi_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc,
+			    u32 mask, u32 flag)
 {
 	u32 mask_bits = desc->masked;
 
@@ -179,15 +180,15 @@ u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
 
 	mask_bits &= ~mask;
 	mask_bits |= flag;
-	pci_write_config_dword(msi_desc_to_pci_dev(desc), desc->mask_pos,
-			       mask_bits);
+	pci_write_config_dword(dev, desc->mask_pos, mask_bits);
 
 	return mask_bits;
 }
 
-static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
+static void msi_mask_irq(struct pci_dev *dev, struct msi_desc *desc, u32 mask,
+			 u32 flag)
 {
-	desc->masked = __pci_msi_desc_mask_irq(desc, mask, flag);
+	desc->masked = __pci_msi_desc_mask_irq(dev, desc, mask, flag);
 }
 
 static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
@@ -203,7 +204,8 @@ static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
  * file.  This saves a few milliseconds when initialising devices with lots
  * of MSI-X interrupts.
  */
-u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
+u32 __pci_msix_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc,
+			     u32 flag)
 {
 	u32 mask_bits = desc->masked;
 
@@ -218,21 +220,22 @@ u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
 	return mask_bits;
 }
 
-static void msix_mask_irq(struct msi_desc *desc, u32 flag)
+static void msix_mask_irq(struct pci_dev *dev, struct msi_desc *desc, u32 flag)
 {
-	desc->masked = __pci_msix_desc_mask_irq(desc, flag);
+	desc->masked = __pci_msix_desc_mask_irq(dev, desc, flag);
 }
 
 static void msi_set_mask_bit(struct irq_data *data, u32 flag)
 {
 	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct pci_dev *dev = msi_desc_to_pci_dev(desc);
 
 	if (desc->msi_attrib.is_msix) {
-		msix_mask_irq(desc, flag);
+		msix_mask_irq(dev, desc, flag);
 		readl(desc->mask_base);		/* Flush write to device */
 	} else {
 		unsigned offset = data->irq - desc->irq;
-		msi_mask_irq(desc, 1 << offset, flag << offset);
+		msi_mask_irq(dev, desc, 1 << offset, flag << offset);
 	}
 }
 
@@ -401,7 +404,7 @@ static void __pci_restore_msi_state(struct pci_dev *dev)
 	arch_restore_msi_irqs(dev);
 
 	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
-	msi_mask_irq(entry, msi_mask(entry->msi_attrib.multi_cap),
+	msi_mask_irq(dev, entry, msi_mask(entry->msi_attrib.multi_cap),
 		     entry->masked);
 	control &= ~PCI_MSI_FLAGS_QSIZE;
 	control |= (entry->msi_attrib.multiple << 4) | PCI_MSI_FLAGS_ENABLE;
@@ -423,7 +426,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 
 	arch_restore_msi_irqs(dev);
 	for_each_pci_msi_entry(entry, dev)
-		msix_mask_irq(entry, entry->masked);
+		msix_mask_irq(dev, entry, entry->masked);
 
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
 }
@@ -612,28 +615,28 @@ static int msi_capability_init(struct pci_dev *dev, int nvec,
 
 	/* All MSIs are unmasked by default, Mask them all */
 	mask = msi_mask(entry->msi_attrib.multi_cap);
-	msi_mask_irq(entry, mask, mask);
+	msi_mask_irq(dev, entry, mask, mask);
 
 	list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
 
 	/* Configure MSI capability structure */
 	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
 	if (ret) {
-		msi_mask_irq(entry, mask, ~mask);
+		msi_mask_irq(dev, entry, mask, ~mask);
 		free_msi_irqs(dev);
 		return ret;
 	}
 
 	ret = msi_verify_entries(dev);
 	if (ret) {
-		msi_mask_irq(entry, mask, ~mask);
+		msi_mask_irq(dev, entry, mask, ~mask);
 		free_msi_irqs(dev);
 		return ret;
 	}
 
 	ret = populate_msi_sysfs(dev);
 	if (ret) {
-		msi_mask_irq(entry, mask, ~mask);
+		msi_mask_irq(dev, entry, mask, ~mask);
 		free_msi_irqs(dev);
 		return ret;
 	}
@@ -721,7 +724,7 @@ static void msix_program_entries(struct pci_dev *dev,
 			entries[i++].vector = entry->irq;
 		entry->masked = readl(pci_msix_desc_addr(entry) +
 				PCI_MSIX_ENTRY_VECTOR_CTRL);
-		msix_mask_irq(entry, 1);
+		msix_mask_irq(dev, entry, 1);
 	}
 }
 
@@ -895,7 +898,7 @@ static void pci_msi_shutdown(struct pci_dev *dev)
 	/* Return the device with MSI unmasked as initial states */
 	mask = msi_mask(desc->msi_attrib.multi_cap);
 	/* Keep cached state to be restored */
-	__pci_msi_desc_mask_irq(desc, mask, ~mask);
+	__pci_msi_desc_mask_irq(dev, desc, mask, ~mask);
 
 	/* Restore dev->irq to its default pin-assertion irq */
 	dev->irq = desc->msi_attrib.default_irq;
@@ -982,7 +985,7 @@ static void pci_msix_shutdown(struct pci_dev *dev)
 	/* Return the device with MSI-X masked as initial states */
 	for_each_pci_msi_entry(entry, dev) {
 		/* Keep cached states to be restored */
-		__pci_msix_desc_mask_irq(entry, 1);
+		__pci_msix_desc_mask_irq(dev, entry, 1);
 	}
 
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 0e9c50052ff3..1525ec9457a4 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -141,8 +141,9 @@ void free_msi_entry(struct msi_desc *entry);
 void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 
-u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag);
-u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag);
+u32 __pci_msix_desc_mask_irq(struct pci_dev *, struct msi_desc *desc, u32 flag);
+u32 __pci_msi_desc_mask_irq(struct pci_dev *, struct msi_desc *desc, u32 mask,
+			    u32 flag);
 void pci_msi_mask_irq(struct irq_data *data);
 void pci_msi_unmask_irq(struct irq_data *data);
 

commit 9b51617cc910b5facdb4ab0442d6562422c916d7
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Thu Nov 8 08:34:07 2018 -0600

    PCI/MSI: Avoid MMIO accesses to device if it's unreachable
    
    Add wrappers around MMIO accesses (readl/writel) to the device.  If we
    already know the device is unreachable because of hot-removal or a bus
    error, skip the MMIO access.
    
    In addition, if we're doing an MMIO read, check whether it returns ~0 data.
    If a read fails because of a bus error, host bridges typically synthesize
    ~0 data to complete the CPU's read.  If a read returns ~0, do a config read
    to determine whether the device is still reachable.  If the config read is
    successful, the ~0 data is probably valid data from the device.  If the
    config read fails, the device is unreachable, so mark it as disconnected.

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index d46ae506e296..0cf095eef69d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -31,6 +31,38 @@ int pci_msi_ignore_mask;
 
 #define msix_table_size(flags)	((flags & PCI_MSIX_FLAGS_QSIZE) + 1)
 
+static u32 mmio_readl(struct pci_dev *dev, const volatile void __iomem *addr)
+{
+	u32 val, id;
+
+	if (pci_dev_is_disconnected(dev))
+		return ~0;
+
+	val = readl(addr);
+
+	/*
+	 * If an MMIO read from the device returns ~0 data, that data may
+	 * be valid, or it may indicate a bus error.  If config space is
+	 * readable, assume it's valid data; otherwise, assume a bus error.
+	 */
+	if (val == ~0) {
+		pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
+		if (id == ~0)
+			pci_dev_set_disconnected(dev, NULL);
+	}
+
+	return val;
+}
+
+static void mmio_writel(struct pci_dev *dev, u32 val,
+			volatile void __iomem *addr)
+{
+	if (pci_dev_is_disconnected(dev))
+		return;
+
+	writel(val, addr);
+}
+
 #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
 static int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
@@ -215,7 +247,8 @@ u32 __pci_msix_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc,
 	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	if (flag)
 		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
-	writel(mask_bits, pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL);
+	mmio_writel(dev, mask_bits,
+		    pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL);
 
 	return mask_bits;
 }
@@ -232,7 +265,7 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag)
 
 	if (desc->msi_attrib.is_msix) {
 		msix_mask_irq(dev, desc, flag);
-		readl(desc->mask_base);		/* Flush write to device */
+		mmio_readl(dev, desc->mask_base); /* Flush write to device */
 	} else {
 		unsigned offset = data->irq - desc->irq;
 		msi_mask_irq(dev, desc, 1 << offset, flag << offset);
@@ -276,9 +309,11 @@ void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 	if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
 
-		msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR);
-		msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR);
-		msg->data = readl(base + PCI_MSIX_ENTRY_DATA);
+		msg->address_lo = mmio_readl(dev,
+					     base + PCI_MSIX_ENTRY_LOWER_ADDR);
+		msg->address_hi = mmio_readl(dev,
+					     base + PCI_MSIX_ENTRY_UPPER_ADDR);
+		msg->data = mmio_readl(dev, base + PCI_MSIX_ENTRY_DATA);
 	} else {
 		int pos = dev->msi_cap;
 		u16 data;
@@ -306,9 +341,11 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 	} else if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
 
-		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
-		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
-		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
+		mmio_writel(dev, msg->address_lo,
+			    base + PCI_MSIX_ENTRY_LOWER_ADDR);
+		mmio_writel(dev, msg->address_hi,
+			    base + PCI_MSIX_ENTRY_UPPER_ADDR);
+		mmio_writel(dev, msg->data, base + PCI_MSIX_ENTRY_DATA);
 	} else {
 		int pos = dev->msi_cap;
 		u16 msgctl;
@@ -722,8 +759,8 @@ static void msix_program_entries(struct pci_dev *dev,
 	for_each_pci_msi_entry(entry, dev) {
 		if (entries)
 			entries[i++].vector = entry->irq;
-		entry->masked = readl(pci_msix_desc_addr(entry) +
-				PCI_MSIX_ENTRY_VECTOR_CTRL);
+		entry->masked = mmio_readl(dev, pci_msix_desc_addr(entry) +
+					   PCI_MSIX_ENTRY_VECTOR_CTRL);
 		msix_mask_irq(dev, entry, 1);
 	}
 }
Keith Busch Nov. 8, 2018, 9:49 p.m. | #5
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> 
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it.

You know how the kernel provides ZERO_PAGE, wouldn't it be cool if we
also had a ONES_PAGE and could remap all virtual addresses from a memory
mapped device to that page on an ungraceful disconnect? I do not know
how to accomplish that, so might just be crazy talk... But if it is
possible, that would be a pretty nifty way to solve this problem.
Greg KH Nov. 8, 2018, 10:01 p.m. | #6
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]
> 
> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> > On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote:
> > > When a PCI device is gone, we don't want to send IO to it if we can
> > > avoid it. We expose functionality via the irq_chip structure. As
> > > users of that structure may not know about the underlying PCI device,
> > > it's our responsibility to guard against removed devices.
> > > 
> > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
> > > .irq_mask/unmask() are not. Guard them for completeness.
> > > 
> > > For example, surprise removal of a PCIe device triggers teardown. This
> > > touches the irq_chips ops some point to disable the interrupts. I/O
> > > generated here can crash the system on firmware-first machines.
> > > Not triggering the IO in the first place greatly reduces the
> > > possibility of the problem occurring.
> > > 
> > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> > 
> > Applied to pci/misc for v4.21, thanks!
> 
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it.

I think my stance always has been that this call is not good at all
because once you call it you never really know if it is still true as
the device could have been removed right afterward.

So almost any code that relies on it is broken, there is no locking and
it can and will race and you will loose.

I think your patch suffers from this race:

> +static u32 mmio_readl(struct pci_dev *dev, const volatile void __iomem *addr)
> +{
> +	u32 val, id;
> +
> +	if (pci_dev_is_disconnected(dev))
> +		return ~0;

Great, but what happens if I yank the device out right here?

> +	val = readl(addr);

This value could now be all FF, if the device is gone, so what did the
check above help with?

> +	/*
> +	 * If an MMIO read from the device returns ~0 data, that data may
> +	 * be valid, or it may indicate a bus error.  If config space is
> +	 * readable, assume it's valid data; otherwise, assume a bus error.
> +	 */
> +	if (val == ~0) {
> +		pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> +		if (id == ~0)
> +			pci_dev_set_disconnected(dev, NULL);

So why do the check above for "is disconnected"?  What does this buy us
here, just short-circuiting the readl()?

> +	}
> +
> +	return val;
> +}
> +
> +static void mmio_writel(struct pci_dev *dev, u32 val,
> +			volatile void __iomem *addr)
> +{
> +	if (pci_dev_is_disconnected(dev))
> +		return;
> +
> +	writel(val, addr);

Why even check, what's wrong with always doing the write?

I understand the wish to make this easier, but I think the only way is
that the driver themselves should be checking on their reads.  And they
have to check on all reads, or at least on some subset of reads and be
able to handle 0xff for the other ones without going crazy.

I _think_ the xhci driver does this given that it is hot added/removed
all the time dynamically due to the way that modern laptops are made
where the bios adds/removed the xhci controller when a USB device is
added/removed.

thanks,

greg k-h
Alex_Gagniuc@Dellteam.com Nov. 8, 2018, 10:20 p.m. | #7
On 11/08/2018 02:09 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> 
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]

Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have 
been floating around the past few months? -- HPX, SFI, CER. Without 
divulging too much before publication, I'm curious on opinions on how 
well (or not well) those flows would work in general, and in linux.

> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it. >
> Another is that the only place we call pci_dev_set_disconnected() is
> in pciehp and acpiphp, so the only "disconnected" case we catch is if
> hotplug happens to be involved.  Every MMIO read from the device is an
> opportunity to learn whether it is reachable (a read from an
> unreachable device typically returns ~0 data), but we don't do
> anything at all with those.
 >
> The config accessors already check pci_dev_is_disconnected(), so this
> patch is really aimed at MMIO accesses.  I think it would be more
> robust if we added wrappers for readl() and writel() so we could
> notice read errors and avoid future reads and writes.

I wouldn't expect anything less than  complete scrutiny and quality 
control of unquestionable moral integrity :). In theory ~0 can be a 
great indicator that something may be wrong. Though I think it's about 
as ad-hoc as pci_dev_is_disconnected().

I slightly like the idea of wrapping the MMIO accessors. There's still 
memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the 
same sort of errors in PCIe land, and it would be good to have more 
testing on this. Since this patch is tested and confirmed to fix a known 
failure case, I would keep it, and the look at fixing the problem in a 
more generic way.

BTW, a lot of the problems we're fixing here come courtesy of 
firmware-first error handling. Do we reach a point where we draw a line 
in handling new problems introduced by FFS? So, if something is a 
problem with FFS, but not native handling, do we commit to supporting it?

Alex
Keith Busch Nov. 8, 2018, 10:32 p.m. | #8
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > I'm having second thoughts about this.  One thing I'm uncomfortable
> > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > instead of systematic, in the sense that I don't know how we convince
> > ourselves that this (and only this) is the correct place to put it.
> 
> I think my stance always has been that this call is not good at all
> because once you call it you never really know if it is still true as
> the device could have been removed right afterward.
> 
> So almost any code that relies on it is broken, there is no locking and
> it can and will race and you will loose.

AIUI, we're not trying to create code to rely on this. This more about
reducing reliance on hardware. If the software misses the race once and
accesses disconnected device memory, that's usually not a big deal to
let hardware sort it out, but the point is not to push our luck.

Surprise hot remove is empirically more reliable the less we interact
with hardware and firmware. That shouldn't be necessary, but is just an
unfortunate reality.
Greg KH Nov. 8, 2018, 10:42 p.m. | #9
On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > instead of systematic, in the sense that I don't know how we convince
> > > ourselves that this (and only this) is the correct place to put it.
> > 
> > I think my stance always has been that this call is not good at all
> > because once you call it you never really know if it is still true as
> > the device could have been removed right afterward.
> > 
> > So almost any code that relies on it is broken, there is no locking and
> > it can and will race and you will loose.
> 
> AIUI, we're not trying to create code to rely on this. This more about
> reducing reliance on hardware. If the software misses the race once and
> accesses disconnected device memory, that's usually not a big deal to
> let hardware sort it out, but the point is not to push our luck.

Then why even care about this call at all?  If you need to really know
if the read worked, you have to check the value.  If the value is FF
then you have a huge hint that the hardware is now gone.  And you can
rely on it being gone, you can never rely on making the call to the
function to check if the hardware is there to be still valid any point
in time after the call returns.

> Surprise hot remove is empirically more reliable the less we interact
> with hardware and firmware. That shouldn't be necessary, but is just an
> unfortunate reality.

You are not "interacting", you are reading/writing to the hardware, as
you have to do so.  So I really don't understand what you are talking
about here, sorry.

greg k-h
Alex_Gagniuc@Dellteam.com Nov. 8, 2018, 10:49 p.m. | #10
On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> 
> On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
>> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
>>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
>>>> I'm having second thoughts about this.  One thing I'm uncomfortable
>>>> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
>>>> instead of systematic, in the sense that I don't know how we convince
>>>> ourselves that this (and only this) is the correct place to put it.
>>>
>>> I think my stance always has been that this call is not good at all
>>> because once you call it you never really know if it is still true as
>>> the device could have been removed right afterward.
>>>
>>> So almost any code that relies on it is broken, there is no locking and
>>> it can and will race and you will loose.
>>
>> AIUI, we're not trying to create code to rely on this. This more about
>> reducing reliance on hardware. If the software misses the race once and
>> accesses disconnected device memory, that's usually not a big deal to
>> let hardware sort it out, but the point is not to push our luck.
> 
> Then why even care about this call at all?  If you need to really know
> if the read worked, you have to check the value.  If the value is FF
> then you have a huge hint that the hardware is now gone.  And you can
> rely on it being gone, you can never rely on making the call to the
> function to check if the hardware is there to be still valid any point
> in time after the call returns.

In the case that we're trying to fix, this code executing is a result of 
the device being gone, so we can guarantee race-free operation. I agree 
that there is a race, in the general case. As far as checking the result 
for all F's, that's not an option when firmware crashes the system as a 
result of the mmio read/write. It's never pretty when firmware gets 
involved.

Alex
Greg KH Nov. 8, 2018, 10:51 p.m. | #11
On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote:
> > 
> > [EXTERNAL EMAIL]
> > Please report any suspicious attachments, links, or requests for sensitive information.
> > 
> > 
> > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> >> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> >>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> >>>> I'm having second thoughts about this.  One thing I'm uncomfortable
> >>>> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> >>>> instead of systematic, in the sense that I don't know how we convince
> >>>> ourselves that this (and only this) is the correct place to put it.
> >>>
> >>> I think my stance always has been that this call is not good at all
> >>> because once you call it you never really know if it is still true as
> >>> the device could have been removed right afterward.
> >>>
> >>> So almost any code that relies on it is broken, there is no locking and
> >>> it can and will race and you will loose.
> >>
> >> AIUI, we're not trying to create code to rely on this. This more about
> >> reducing reliance on hardware. If the software misses the race once and
> >> accesses disconnected device memory, that's usually not a big deal to
> >> let hardware sort it out, but the point is not to push our luck.
> > 
> > Then why even care about this call at all?  If you need to really know
> > if the read worked, you have to check the value.  If the value is FF
> > then you have a huge hint that the hardware is now gone.  And you can
> > rely on it being gone, you can never rely on making the call to the
> > function to check if the hardware is there to be still valid any point
> > in time after the call returns.
> 
> In the case that we're trying to fix, this code executing is a result of 
> the device being gone, so we can guarantee race-free operation. I agree 
> that there is a race, in the general case. As far as checking the result 
> for all F's, that's not an option when firmware crashes the system as a 
> result of the mmio read/write. It's never pretty when firmware gets 
> involved.

If you have firmware that crashes the system when you try to read from a
PCI device that was hot-removed, that is broken firmware and needs to be
fixed.  The kernel can not work around that as again, you will never win
that race.

thanks,

greg k-h
Keith Busch Nov. 8, 2018, 11:03 p.m. | #12
On Thu, Nov 08, 2018 at 02:42:55PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > > instead of systematic, in the sense that I don't know how we convince
> > > > ourselves that this (and only this) is the correct place to put it.
> > > 
> > > I think my stance always has been that this call is not good at all
> > > because once you call it you never really know if it is still true as
> > > the device could have been removed right afterward.
> > > 
> > > So almost any code that relies on it is broken, there is no locking and
> > > it can and will race and you will loose.
> > 
> > AIUI, we're not trying to create code to rely on this. This more about
> > reducing reliance on hardware. If the software misses the race once and
> > accesses disconnected device memory, that's usually not a big deal to
> > let hardware sort it out, but the point is not to push our luck.
> 
> Then why even care about this call at all?  If you need to really know
> if the read worked, you have to check the value.  If the value is FF
> then you have a huge hint that the hardware is now gone.  And you can
> rely on it being gone, you can never rely on making the call to the
> function to check if the hardware is there to be still valid any point
> in time after the call returns.
> 
> > Surprise hot remove is empirically more reliable the less we interact
> > with hardware and firmware. That shouldn't be necessary, but is just an
> > unfortunate reality.
> 
> You are not "interacting", you are reading/writing to the hardware, as
> you have to do so.  So I really don't understand what you are talking
> about here, sorry.

We're reading hardware memory, yes, but the hardware isn't there.
Something obviously needs to return FF, so we are indirectly interacting
with whatever mechanism handles that. Sometimes that mechanism doesn't
handle it gracefully and instead of having FF to consider, you have a
machine check rebooting your system.
Alex_Gagniuc@Dellteam.com Nov. 8, 2018, 11:06 p.m. | #13
On 11/08/2018 04:51 PM, Greg KH wrote:
> On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> In the case that we're trying to fix, this code executing is a result of
>> the device being gone, so we can guarantee race-free operation. I agree
>> that there is a race, in the general case. As far as checking the result
>> for all F's, that's not an option when firmware crashes the system as a
>> result of the mmio read/write. It's never pretty when firmware gets
>> involved.
> 
> If you have firmware that crashes the system when you try to read from a
> PCI device that was hot-removed, that is broken firmware and needs to be
> fixed.  The kernel can not work around that as again, you will never win
> that race.

But it's not the firmware that crashes. It's linux as a result of a 
fatal error message from the firmware. And we can't fix that because FFS 
handling requires that the system reboots [1].

If we're going to say that we don't want to support FFS because it's a 
separate code path, and different flow, that's fine. I am myself, not a 
fan of FFS. But if we're going to continue supporting it, I think we'll 
continue to have to resolve these sort of unintended consequences.

Alex

[1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
Lukas Wunner Nov. 9, 2018, 7:11 a.m. | #14
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> +	/*
> +	 * If an MMIO read from the device returns ~0 data, that data may
> +	 * be valid, or it may indicate a bus error.  If config space is
> +	 * readable, assume it's valid data; otherwise, assume a bus error.
> +	 */
> +	if (val == ~0) {
> +		pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> +		if (id == ~0)
> +			pci_dev_set_disconnected(dev, NULL);
> +	}

This isn't safe unfortunately because "all ones" may occur for other
reasons besides disconnectedness.  E.g. on an Uncorrectable Error,
the device may likewise respond with all ones, but revert to valid
responses if the error can be recovered through a Secondary Bus Reset.
In such a case, marking the device disconnected would be inappropriate.

Accessing a device in D3cold would be another example where all ones
is returned both from mmio and config space despite the device still
being present and future accesses having a chance to succeed.

In fact, in v2 of Keith's patches adding pci_dev_set_disconnected()
he attempted the same as what you're doing here and caused issues
for me with devices in D3cold:

https://spinics.net/lists/linux-pci/msg54337.html


> One thing I'm uncomfortable with is that [...].  Another is that the
> only place we call pci_dev_set_disconnected() is in pciehp and acpiphp,
> so the only "disconnected" case we catch is if hotplug happens to be
> involved.

Yes, that's because the hotplug drivers are the only ones who can
identify removal authoritatively and unambiguously.  They *know*
when the device is gone and don't have to resort to heuristics
such as all ones.  (ISTR that dpc also marks devices disconnected.)


> sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it.

We need to add documentation for driver authors how to deal with
surprise removal.  Briefly:

* If (pdev->error_state == pci_channel_io_perm_failure), the device
  is definitely gone and any further device access can be skipped.
  Otherwise presence of the device is likely, but not guaranteed.

* If a device access can significantly delay device removal due to
  Completion Timeouts, or can cause an infinite loop, MCE or crash,
  do check pdev->error_state before carrying out the device access.

* Always be prepared that a device access may fail due to surprise
  removal, do not blindly trust mmio or config space reads or
  assume success of writes.

I'm sure this can be extended quite a bit.  There's more information
in this LWN article in the "Surprise removal" section:

https://lwn.net/Articles/767885/

Thanks,

Lukas
Lukas Wunner Nov. 9, 2018, 7:29 a.m. | #15
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > I'm having second thoughts about this.  One thing I'm uncomfortable
> > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> 
> I think my stance always has been that this call is not good at all
> because once you call it you never really know if it is still true as
> the device could have been removed right afterward.
> 
> So almost any code that relies on it is broken, there is no locking and
> it can and will race and you will loose.

Hm, to be honest if that's your impression I think you must have missed a
large portion of the discussion we've been having over the past 2 years.

Please consider reading this LWN article, particularly the "Surprise
removal" section, to get up to speed:

https://lwn.net/Articles/767885/

You seem to be assuming that all we care about is the *return value* of
an mmio read.  However a transaction to a surprise removed device has
side effects beyond returning all ones, such as a Completion Timeout
which, with thousands of transactions in flight, added up to many seconds
to handle removal of an NVMe array and occasionally caused MCEs.

It is not an option to just blindly carry out device accesses even though
it is known the device is gone, Completion Timeouts be damned.

However there is more to it than just Completion Timeouts, this is all
detailed in the LWN article.

Thanks,

Lukas
Greg KH Nov. 9, 2018, 11:32 a.m. | #16
On Fri, Nov 09, 2018 at 08:29:53AM +0100, Lukas Wunner wrote:
> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > 
> > I think my stance always has been that this call is not good at all
> > because once you call it you never really know if it is still true as
> > the device could have been removed right afterward.
> > 
> > So almost any code that relies on it is broken, there is no locking and
> > it can and will race and you will loose.
> 
> Hm, to be honest if that's your impression I think you must have missed a
> large portion of the discussion we've been having over the past 2 years.
> 
> Please consider reading this LWN article, particularly the "Surprise
> removal" section, to get up to speed:
> 
> https://lwn.net/Articles/767885/
> 
> You seem to be assuming that all we care about is the *return value* of
> an mmio read.  However a transaction to a surprise removed device has
> side effects beyond returning all ones, such as a Completion Timeout
> which, with thousands of transactions in flight, added up to many seconds
> to handle removal of an NVMe array and occasionally caused MCEs.

Again, I still claim this is broken hardware/firmware :)

> It is not an option to just blindly carry out device accesses even though
> it is known the device is gone, Completion Timeouts be damned.

I don't disagree with you at all, and your other email is great with
summarizing the issues here.

What I do object to is somehow relying on that function call as knowing
that the device really is present or not.  It's a good hint, yes, but
driver authors still have to be able to handle the bad data coming back
from when the call races with the device being removed.

> However there is more to it than just Completion Timeouts, this is all
> detailed in the LWN article.

And that's a great article and your work here is much appreciated.  I
think we are in violent agreement :)

thanks,

greg k-h
Keith Busch Nov. 9, 2018, 4:36 p.m. | #17
On Fri, Nov 09, 2018 at 03:32:57AM -0800, Greg Kroah-Hartman wrote:
> On Fri, Nov 09, 2018 at 08:29:53AM +0100, Lukas Wunner wrote:
> > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > 
> > > I think my stance always has been that this call is not good at all
> > > because once you call it you never really know if it is still true as
> > > the device could have been removed right afterward.
> > > 
> > > So almost any code that relies on it is broken, there is no locking and
> > > it can and will race and you will loose.
> > 
> > Hm, to be honest if that's your impression I think you must have missed a
> > large portion of the discussion we've been having over the past 2 years.
> > 
> > Please consider reading this LWN article, particularly the "Surprise
> > removal" section, to get up to speed:
> > 
> > https://lwn.net/Articles/767885/
> > 
> > You seem to be assuming that all we care about is the *return value* of
> > an mmio read.  However a transaction to a surprise removed device has
> > side effects beyond returning all ones, such as a Completion Timeout
> > which, with thousands of transactions in flight, added up to many seconds
> > to handle removal of an NVMe array and occasionally caused MCEs.
> 
> Again, I still claim this is broken hardware/firmware :)

Indeed it is, but I don't want to abandon people with hardware in hand
if we can make it work despite being broken. Perfection is the enemy of
good. :)
 
> > It is not an option to just blindly carry out device accesses even though
> > it is known the device is gone, Completion Timeouts be damned.
> 
> I don't disagree with you at all, and your other email is great with
> summarizing the issues here.
> 
> What I do object to is somehow relying on that function call as knowing
> that the device really is present or not.  It's a good hint, yes, but
> driver authors still have to be able to handle the bad data coming back
> from when the call races with the device being removed.

The function has always been a private interface. It is not available
for drivers to rely on.

The only thing we're trying to accomplish is not start a transaction
if software knows it will not succeed. There are certainly times when
a transaction will fail that software does not forsee, but we're not
suggesting the intent handles that either.
Oliver Nov. 12, 2018, 5:48 a.m. | #18
On Fri, 2018-11-09 at 08:11 +0100, Lukas Wunner wrote:
> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > +	/*
> > +	 * If an MMIO read from the device returns ~0 data, that data may
> > +	 * be valid, or it may indicate a bus error.  If config space is
> > +	 * readable, assume it's valid data; otherwise, assume a bus error.
> > +	 */
> > +	if (val == ~0) {
> > +		pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> > +		if (id == ~0)
> > +			pci_dev_set_disconnected(dev, NULL);
> > +	}
> 
> This isn't safe unfortunately because "all ones" may occur for other
> reasons besides disconnectedness.  E.g. on an Uncorrectable Error,
> the device may likewise respond with all ones, but revert to valid
> responses if the error can be recovered through a Secondary Bus Reset.
> In such a case, marking the device disconnected would be inappropriate.

I don't really see why we're trying to make a distinction between
recoverable errors and disconnected devices at this stage. In either
case we should assume the device is broken and shouldn't be accessed
until we perform some kind of recovery action.

Bjorn's MMIO wrappers are more-or-less an opt-in software emulation of
the freeze-MMIO-on-error behaviour that the EEH mechanism provides on 
IBM hardware so I think it makes sense. It also has the nice side
effect of giving driver writers an error injection mechanism so they
might actually test how their drivers deal with errors.

> Accessing a device in D3cold would be another example where all ones
> is returned both from mmio and config space despite the device still
> being present and future accesses having a chance to succeed.

Is doing a MMIO to a device in D3cold (or hot) ever a valid thing to
do?

> In fact, in v2 of Keith's patches adding pci_dev_set_disconnected()
> he attempted the same as what you're doing here and caused issues
> for me with devices in D3cold:
> 
> https://spinics.net/lists/linux-pci/msg54337.html
> 
> 
> > One thing I'm uncomfortable with is that [...].  Another is that the
> > only place we call pci_dev_set_disconnected() is in pciehp and acpiphp,
> > so the only "disconnected" case we catch is if hotplug happens to be
> > involved.
> 
> Yes, that's because the hotplug drivers are the only ones who can
> identify removal authoritatively and unambiguously.  They *know*
> when the device is gone and don't have to resort to heuristics
> such as all ones.  (ISTR that dpc also marks devices disconnected.)

The herustics shouldn't be used to work out when the device is gone,
rather they should be used to work out when we need to check on the
device.

One of the grosser bits of EEH support is a hook in readl() and friends
that checks for a 0xFF response. If it finds one, it looks at the EEH
state and will start the recovery process if the device is marked as
frozen.

(don't look at the code. you won't like what you find)

> > sprinkling pci_dev_is_disconnected() around feels ad hoc
> > instead of systematic, in the sense that I don't know how we convince
> > ourselves that this (and only this) is the correct place to put it.
> 
> We need to add documentation for driver authors how to deal with
> surprise removal.  Briefly:
> 
> * If (pdev->error_state == pci_channel_io_perm_failure), the device
>   is definitely gone and any further device access can be skipped.
>   Otherwise presence of the device is likely, but not guaranteed.
>
> * If a device access can significantly delay device removal due to
>   Completion Timeouts, or can cause an infinite loop, MCE or crash,
>   do check pdev->error_state before carrying out the device access.
> 
> * Always be prepared that a device access may fail due to surprise
>   removal, do not blindly trust mmio or config space reads or
>   assume success of writes.

Completely agree. We really need better documentation of what drivers
should be doing.

Oliver
Oliver Nov. 12, 2018, 5:49 a.m. | #19
On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/08/2018 04:51 PM, Greg KH wrote:
> > On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> > > In the case that we're trying to fix, this code executing is a result of
> > > the device being gone, so we can guarantee race-free operation. I agree
> > > that there is a race, in the general case. As far as checking the result
> > > for all F's, that's not an option when firmware crashes the system as a
> > > result of the mmio read/write. It's never pretty when firmware gets
> > > involved.
> > 
> > If you have firmware that crashes the system when you try to read from a
> > PCI device that was hot-removed, that is broken firmware and needs to be
> > fixed.  The kernel can not work around that as again, you will never win
> > that race.
> 
> But it's not the firmware that crashes. It's linux as a result of a 
> fatal error message from the firmware. And we can't fix that because FFS 
> handling requires that the system reboots [1].

Do we know the exact circumsances that result in firmware requesting a
reboot? If it happen on any PCIe error I don't see what we can do to
prevent that beyond masking UEs entirely (are we even allowed to do
that on FFS systems?).

> If we're going to say that we don't want to support FFS because it's a 
> separate code path, and different flow, that's fine. I am myself, not a 
> fan of FFS. But if we're going to continue supporting it, I think we'll 
> continue to have to resolve these sort of unintended consequences.
> 
> Alex
> 
> [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
Alex_Gagniuc@Dellteam.com Nov. 12, 2018, 8:05 p.m. | #20
On 11/11/2018 11:50 PM, Oliver O'Halloran wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> 
> On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 11/08/2018 04:51 PM, Greg KH wrote:
>>> On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>>>> In the case that we're trying to fix, this code executing is a result of
>>>> the device being gone, so we can guarantee race-free operation. I agree
>>>> that there is a race, in the general case. As far as checking the result
>>>> for all F's, that's not an option when firmware crashes the system as a
>>>> result of the mmio read/write. It's never pretty when firmware gets
>>>> involved.
>>>
>>> If you have firmware that crashes the system when you try to read from a
>>> PCI device that was hot-removed, that is broken firmware and needs to be
>>> fixed.  The kernel can not work around that as again, you will never win
>>> that race.
>>
>> But it's not the firmware that crashes. It's linux as a result of a
>> fatal error message from the firmware. And we can't fix that because FFS
>> handling requires that the system reboots [1].
> 
> Do we know the exact circumsances that result in firmware requesting a
> reboot? If it happen on any PCIe error I don't see what we can do to
> prevent that beyond masking UEs entirely (are we even allowed to do
> that on FFS systems?).

Pull a drive out at an angle, push two drives in at the same time, pull 
out a drive really slow. If an error is even reported to the OS depends 
on PD state, and proprietary mechanisms and logic in the HW and FW. OS 
is not supposed to mask errors (touch AER bits) on FFS.

Sadly, with FFS, behavior can and does change from BIOS version to BIOS 
version. On one product, for example, we eliminated a lot of crashes by 
simply not reporting some classes of PCIe errors to the OS.

Alex

>> If we're going to say that we don't want to support FFS because it's a
>> separate code path, and different flow, that's fine. I am myself, not a
>> fan of FFS. But if we're going to continue supporting it, I think we'll
>> continue to have to resolve these sort of unintended consequences.
>>
>> Alex
>>
>> [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
> 
>
Bjorn Helgaas Nov. 13, 2018, 5:02 a.m. | #21
[+cc Jon, for related VMD firmware-first error enable issue]

On Mon, Nov 12, 2018 at 08:05:41PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/11/2018 11:50 PM, Oliver O'Halloran wrote:
> > On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote:

> >> But it's not the firmware that crashes. It's linux as a result of a
> >> fatal error message from the firmware. And we can't fix that because FFS
> >> handling requires that the system reboots [1].
> > 
> > Do we know the exact circumsances that result in firmware requesting a
> > reboot? If it happen on any PCIe error I don't see what we can do to
> > prevent that beyond masking UEs entirely (are we even allowed to do
> > that on FFS systems?).
> 
> Pull a drive out at an angle, push two drives in at the same time, pull 
> out a drive really slow. If an error is even reported to the OS depends 
> on PD state, and proprietary mechanisms and logic in the HW and FW. OS 
> is not supposed to mask errors (touch AER bits) on FFS.

PD?

Do you think Linux observes the rule about not touching AER bits on
FFS?  I'm not sure it does.  I'm not even sure what section of the
spec is relevant.

The whole issue of firmware-first, the mechanism by which firmware
gets control, the System Error enables in Root Port Root Control
registers, etc., is very murky to me.  Jon has a sort of similar issue
with VMD where he needs to leave System Errors enabled instead of
disabling them as we currently do.

Bjorn

[1] https://lore.kernel.org/linux-pci/20181029210651.GB13681@bhelgaas-glaptop.roam.corp.google.com
Alex_Gagniuc@Dellteam.com Nov. 13, 2018, 10:39 p.m. | #22
On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> 
> [+cc Jon, for related VMD firmware-first error enable issue]
> 
> On Mon, Nov 12, 2018 at 08:05:41PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 11/11/2018 11:50 PM, Oliver O'Halloran wrote:
>>> On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote:
> 
>>>> But it's not the firmware that crashes. It's linux as a result of a
>>>> fatal error message from the firmware. And we can't fix that because FFS
>>>> handling requires that the system reboots [1].
>>>
>>> Do we know the exact circumsances that result in firmware requesting a
>>> reboot? If it happen on any PCIe error I don't see what we can do to
>>> prevent that beyond masking UEs entirely (are we even allowed to do
>>> that on FFS systems?).
>>
>> Pull a drive out at an angle, push two drives in at the same time, pull
>> out a drive really slow. If an error is even reported to the OS depends
>> on PD state, and proprietary mechanisms and logic in the HW and FW. OS
>> is not supposed to mask errors (touch AER bits) on FFS.
> 
> PD?

Presence Detect

> Do you think Linux observes the rule about not touching AER bits on
> FFS?  I'm not sure it does.  I'm not even sure what section of the
> spec is relevant.

I haven't found any place where linux breaks this rule. I'm very 
confident that, unless otherwise instructed, we follow this rule.

> The whole issue of firmware-first, the mechanism by which firmware
> gets control, the System Error enables in Root Port Root Control
> registers, etc., is very murky to me.  Jon has a sort of similar issue
> with VMD where he needs to leave System Errors enabled instead of
> disabling them as we currently do.

Well, OS gets control via _OSC method, and based on that it should 
touch/not touch the AER bits. The bits that get set/cleared come from 
_HPX method, and there's a more about the FFS described in ACPI spec. It 
seems that if platform, wants to enable VMD, it should pass the correct 
bits via _HPX. I'm curious to know in what new twisted way FFS doesn't 
work as intended.

Alex

> Bjorn
> 
> [1] https://lore.kernel.org/linux-pci/20181029210651.GB13681@bhelgaas-glaptop.roam.corp.google.com
>
Keith Busch Nov. 13, 2018, 10:52 p.m. | #23
On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
> > The whole issue of firmware-first, the mechanism by which firmware
> > gets control, the System Error enables in Root Port Root Control
> > registers, etc., is very murky to me.  Jon has a sort of similar issue
> > with VMD where he needs to leave System Errors enabled instead of
> > disabling them as we currently do.
> 
> Well, OS gets control via _OSC method, and based on that it should 
> touch/not touch the AER bits. The bits that get set/cleared come from 
> _HPX method, and there's a more about the FFS described in ACPI spec. It 
> seems that if platform, wants to enable VMD, it should pass the correct 
> bits via _HPX. I'm curious to know in what new twisted way FFS doesn't 
> work as intended.

When VMD is enabled, the platform sees a VMD endpoint. It doesn't see
any of the root ports on that domain, so ACPI can't provide policies for
them nor AER registers for the platform to consider controlling.
Alex_Gagniuc@Dellteam.com Nov. 14, 2018, 12:31 a.m. | #24
On 11/13/2018 04:56 PM, Keith Busch wrote:
> On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
>>> The whole issue of firmware-first, the mechanism by which firmware
>>> gets control, the System Error enables in Root Port Root Control
>>> registers, etc., is very murky to me.  Jon has a sort of similar issue
>>> with VMD where he needs to leave System Errors enabled instead of
>>> disabling them as we currently do.
>>
>> Well, OS gets control via _OSC method, and based on that it should
>> touch/not touch the AER bits. The bits that get set/cleared come from
>> _HPX method, and there's a more about the FFS described in ACPI spec. It
>> seems that if platform, wants to enable VMD, it should pass the correct
>> bits via _HPX. I'm curious to know in what new twisted way FFS doesn't
>> work as intended.
> 
> When VMD is enabled, the platform sees a VMD endpoint. It doesn't see
> any of the root ports on that domain, so ACPI can't provide policies for
> them nor AER registers for the platform to consider controlling.

I'm not understanding the interdependency between RP AER settings and 
VMD. My understanding of VMD is quite rudimentary though, so I'll take 
your word for it.

Alex
Bjorn Helgaas Nov. 14, 2018, 5:59 a.m. | #25
On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
> > 
> > [EXTERNAL EMAIL]
> > Please report any suspicious attachments, links, or requests for sensitive information.

It looks like Dell's email system adds the above in such a way that the
email quoting convention suggests that *I* wrote it, when I did not.

> ...
> > Do you think Linux observes the rule about not touching AER bits on
> > FFS?  I'm not sure it does.  I'm not even sure what section of the
> > spec is relevant.
> 
> I haven't found any place where linux breaks this rule. I'm very 
> confident that, unless otherwise instructed, we follow this rule.

Just to make sure we're on the same page, can you point me to this
rule?  I do see that OSPM must request control of AER using _OSC
before it touches the AER registers.  What I don't see is the
connection between firmware-first and the AER registers.

The closest I can find is the "Enabled" field in the HEST PCIe
AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:

  If the field value is 1, indicates this error source is
  to be enabled.

  If the field value is 0, indicates that the error source
  is not to be enabled.

  If FIRMWARE_FIRST is set in the flags field, the Enabled
  field is ignored by the OSPM.

AFAICT, Linux completely ignores the Enabled field in these
structures.

These structures also contain values the OS is apparently supposed to
write to Device Control and several AER registers (in struct
acpi_hest_aer_common).  Linux ignores these as well.

These seem like fairly serious omissions in Linux.

> > The whole issue of firmware-first, the mechanism by which firmware
> > gets control, the System Error enables in Root Port Root Control
> > registers, etc., is very murky to me.  Jon has a sort of similar issue
> > with VMD where he needs to leave System Errors enabled instead of
> > disabling them as we currently do.
> 
> Well, OS gets control via _OSC method, and based on that it should 
> touch/not touch the AER bits. 

I agree so far.

> The bits that get set/cleared come from _HPX method,

_HPX tells us about some AER registers, Device Control, Link Control,
and some bridge registers.  It doesn't say anything about the Root
Control register that Jon is concerned with.

For firmware-first to work, firmware has to get control.  How does it
get control?  How does OSPM know to either set up that mechanism or
keep its mitts off something firmware set up before handoff?  In Jon's
VMD case, I think firmware-first relies on the System Error controlled
by the Root Control register.  Linux thinks it owns that, and I don't
know how to learn otherwise.

> and there's a more about the FFS described in ACPI spec. It 
> seems that if platform, wants to enable VMD, it should pass the correct 
> bits via _HPX. I'm curious to know in what new twisted way FFS doesn't 
> work as intended.

Bjorn
Alex_Gagniuc@Dellteam.com Nov. 14, 2018, 7:22 p.m. | #26
On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
> On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
>>>
>>> [EXTERNAL EMAIL]
>>> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> It looks like Dell's email system adds the above in such a way that the
> email quoting convention suggests that *I* wrote it, when I did not.

I was wondering why you thought I was suspicious. It's a recent 
(server-side) change. You used to be able to disable these sort of 
notices. I'm told back in the day people were asked to delete emails 
before reading them.

>> ...
>>> Do you think Linux observes the rule about not touching AER bits on
>>> FFS?  I'm not sure it does.  I'm not even sure what section of the
>>> spec is relevant.
>>
>> I haven't found any place where linux breaks this rule. I'm very
>> confident that, unless otherwise instructed, we follow this rule.
> 
> Just to make sure we're on the same page, can you point me to this
> rule?  I do see that OSPM must request control of AER using _OSC
> before it touches the AER registers.  What I don't see is the
> connection between firmware-first and the AER registers.

ACPI 6.2 - 6.2.11.3, Table 6-197:

PCI Express Advanced Error Reporting control:
  * The firmware sets this bit to 1 to grant control over PCI Express 
Advanced Error Reporting. If firmware allows the OS control of this 
feature, then in the context of the _OSC method it must ensure that 
error messages are routed to device interrupts as described in the PCI 
Express Base Specification[...]

Now I'm confused too:
  * HEST -> __aer_firmware_first
	This is used for touching/not touching AER bits
  * _OSC -> bridge->native_aer
	Used to enable/not enable AER portdrv service
Maybe Keith knows better why we're doing it this way. From ACPI text, it 
doesn't seem that control of AER would be tied to HEST entries, although 
in practice, it is.

> The closest I can find is the "Enabled" field in the HEST PCIe
> AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
> 
>    If the field value is 1, indicates this error source is
>    to be enabled.
> 
>    If the field value is 0, indicates that the error source
>    is not to be enabled.
> 
>    If FIRMWARE_FIRST is set in the flags field, the Enabled
>    field is ignored by the OSPM.
> 
> AFAICT, Linux completely ignores the Enabled field in these
> structures.

I don't think ignoring the field is a problem:
  * With FFS, OS should ignore it.
  * Without FFS, we have control, and we get to make the decisions anyway.
In the latter case we decide whether to use AER, independent of the crap 
in ACPI. I'm not even sure why "Enabled" matters in native AER handling. 
Probably one of the check-boxes in "Binary table designer's handbook"?

> These structures also contain values the OS is apparently supposed to
> write to Device Control and several AER registers (in struct
> acpi_hest_aer_common).  Linux ignores these as well.
> 
> These seem like fairly serious omissions in Linux.

I think HPX carries the same sort of information (except for Root 
Command reg). FW is supposed to program those registers anyway, so even 
if OS doesn't touch them, I'd expect things to just work.

>>> The whole issue of firmware-first, the mechanism by which firmware
>>> gets control, the System Error enables in Root Port Root Control
>>> registers, etc., is very murky to me.  Jon has a sort of similar issue
>>> with VMD where he needs to leave System Errors enabled instead of
>>> disabling them as we currently do.
>>
>> Well, OS gets control via _OSC method, and based on that it should
>> touch/not touch the AER bits.
> 
> I agree so far.
> 
>> The bits that get set/cleared come from _HPX method,
> 
> _HPX tells us about some AER registers, Device Control, Link Control,
> and some bridge registers.  It doesn't say anything about the Root
> Control register that Jon is concerned with.

_HPX type 3 (yay!!!) got approved recently, and that will have more 
fine-grained control. It will be able to handle root control reg.

> For firmware-first to work, firmware has to get control.  How does it
> get control?  How does OSPM know to either set up that mechanism or
> keep its mitts off something firmware set up before handoff?

My understanding is that, if FW keeps control of AER in _OSC, then it 
will have set things up to get notified instead of the OS. OSPM not 
touching AER bits is to make sure it doesn't mess up FW's setup. I think 
there are some proprietary bits in the root port to route interrupts to 
SMIs instead of the AER vectors.

> In Jon's
> VMD case, I think firmware-first relies on the System Error controlled
> by the Root Control register.  Linux thinks it owns that, and I don't
> know how to learn otherwise.

Didn't Keith say the root port is not visible to the OS?

Alex
Derrick, Jonathan Nov. 14, 2018, 7:41 p.m. | #27
On Wed, 2018-11-14 at 19:22 +0000, Alex_Gagniuc@Dellteam.com wrote:
[snip]
> The whole issue of firmware-first, the mechanism by which
> > > > firmware
> > > > gets control, the System Error enables in Root Port Root
> > > > Control
> > > > registers, etc., is very murky to me.  Jon has a sort of
> > > > similar issue
> > > > with VMD where he needs to leave System Errors enabled instead
> > > > of
> > > > disabling them as we currently do.
> > > 
> > > Well, OS gets control via _OSC method, and based on that it
> > > should
> > > touch/not touch the AER bits.
> > 
> > I agree so far.
> > 
> > > The bits that get set/cleared come from _HPX method,
> > 
> > _HPX tells us about some AER registers, Device Control, Link
> > Control,
> > and some bridge registers.  It doesn't say anything about the Root
> > Control register that Jon is concerned with.
> 
> _HPX type 3 (yay!!!) got approved recently, and that will have more 
> fine-grained control. It will be able to handle root control reg.
> 
> > For firmware-first to work, firmware has to get control.  How does
> > it
> > get control?  How does OSPM know to either set up that mechanism or
> > keep its mitts off something firmware set up before handoff?
> 
> My understanding is that, if FW keeps control of AER in _OSC, then
> it 
> will have set things up to get notified instead of the OS. OSPM not 
> touching AER bits is to make sure it doesn't mess up FW's setup. I
> think 
> there are some proprietary bits in the root port to route interrupts
> to 
> SMIs instead of the AER vectors.
> 
> > In Jon's
> > VMD case, I think firmware-first relies on the System Error
> > controlled
> > by the Root Control register.  Linux thinks it owns that, and I
> > don't
> > know how to learn otherwise.
> 
> Didn't Keith say the root port is not visible to the OS?
> 
> Alex

That's correct. OS visibility wrt ACPI is limited to the VMD
endpoint/host bridge device which exposes the root ports. The root
ports aren't described by ACPI. VMD is the unusual case.

In VMD case, we might or might not need to pass back control to AER for
further error handling post FFS. I can see that's normally done by GHES
but will probably need some shimming to support the VMD case. I can't
rely on AER, because if any other devices use APEI, then the AER module
won't be initialized (aer_service_init::aer_acpi_firmware_first)
Keith Busch Nov. 14, 2018, 8:23 p.m. | #28
On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
> > Just to make sure we're on the same page, can you point me to this
> > rule?  I do see that OSPM must request control of AER using _OSC
> > before it touches the AER registers.  What I don't see is the
> > connection between firmware-first and the AER registers.
> 
> ACPI 6.2 - 6.2.11.3, Table 6-197:
> 
> PCI Express Advanced Error Reporting control:
>   * The firmware sets this bit to 1 to grant control over PCI Express 
> Advanced Error Reporting. If firmware allows the OS control of this 
> feature, then in the context of the _OSC method it must ensure that 
> error messages are routed to device interrupts as described in the PCI 
> Express Base Specification[...]
> 
> Now I'm confused too:
>   * HEST -> __aer_firmware_first
> 	This is used for touching/not touching AER bits
>   * _OSC -> bridge->native_aer
> 	Used to enable/not enable AER portdrv service
> Maybe Keith knows better why we're doing it this way. From ACPI text, it 
> doesn't seem that control of AER would be tied to HEST entries, although 
> in practice, it is.

I'm not sure, that predates me.  HEST does have a FIRMWARE_FIRST flag, but
spec does not say anymore on relation to _OSC control or AER capability.
Nothing in PCIe spec either.

I also don't know why Linux disables the AER driver if only one
device has a FIRMWARE_FIRST HEST. Shouldn't that just be a per-device
decision?

> > The closest I can find is the "Enabled" field in the HEST PCIe
> > AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
> > 
> >    If the field value is 1, indicates this error source is
> >    to be enabled.
> > 
> >    If the field value is 0, indicates that the error source
> >    is not to be enabled.
> > 
> >    If FIRMWARE_FIRST is set in the flags field, the Enabled
> >    field is ignored by the OSPM.
> > 
> > AFAICT, Linux completely ignores the Enabled field in these
> > structures.
> 
> I don't think ignoring the field is a problem:
>   * With FFS, OS should ignore it.
>   * Without FFS, we have control, and we get to make the decisions anyway.
> In the latter case we decide whether to use AER, independent of the crap 
> in ACPI. I'm not even sure why "Enabled" matters in native AER handling. 
> Probably one of the check-boxes in "Binary table designer's handbook"?

And why doesn't Linux do anything with _OSC response other than logging
it? If OS control wasn't granted, shouldn't that take priority over HEST?
Alex_Gagniuc@Dellteam.com Nov. 14, 2018, 8:52 p.m. | #29
On 11/14/2018 02:27 PM, Keith Busch wrote:
> On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
>>> Just to make sure we're on the same page, can you point me to this
>>> rule?  I do see that OSPM must request control of AER using _OSC
>>> before it touches the AER registers.  What I don't see is the
>>> connection between firmware-first and the AER registers.
>>
>> ACPI 6.2 - 6.2.11.3, Table 6-197:
>>[...]
>> Maybe Keith knows better why we're doing it this way. From ACPI text, it
>> doesn't seem that control of AER would be tied to HEST entries, although
>> in practice, it is.
> 
> I'm not sure, that predates me.  HEST does have a FIRMWARE_FIRST flag, but
> spec does not say anymore on relation to _OSC control or AER capability.
> Nothing in PCIe spec either.

Speaking to one of the PCIe (and _HPX type 3) spec authors, ownership of 
AER should be determined by _OSC. period. The result of _OSC applies to 
every device under the root port. This crap we do with checking HEST is 
crap.

If I'm not stepping on anyone toes, and there's no known unintended 
consequences, I can look at patching this up. I'm not promising a patch, 
though, but it's exactly the sort of thing I like to fix.

> I also don't know why Linux disables the AER driver if only one
> device has a FIRMWARE_FIRST HEST. Shouldn't that just be a per-device
> decision?

I think the logic is if one HEST entry has both FFS and GLOBAL flags 
set, then then disable AER services for all devices. It works in 
practice better than it works in theory. I think _OSC should be the 
determining factor here, not HEST.

>>> The closest I can find is the "Enabled" field in the HEST PCIe
>>> AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
>>> [...]
>>> AFAICT, Linux completely ignores the Enabled field in these
>>> structures.
>>
>> I don't think ignoring the field is a problem:
>>    * With FFS, OS should ignore it.
>>    * Without FFS, we have control, and we get to make the decisions anyway.
>> In the latter case we decide whether to use AER, independent of the crap
>> in ACPI. I'm not even sure why "Enabled" matters in native AER handling.
>> Probably one of the check-boxes in "Binary table designer's handbook"?
> 
> And why doesn't Linux do anything with _OSC response other than logging
> it? If OS control wasn't granted, shouldn't that take priority over HEST?

But it does in portdrv_core.c:

	if (dev->aer_cap && pci_aer_available() &&
	    (pcie_ports_native || host->native_aer)) {
		services |= PCIE_PORT_SERVICE_AER;

That flag later creates a pcie device that allows aerdrv to attach to.

Alex
Keith Busch Nov. 14, 2018, 8:58 p.m. | #30
On Wed, Nov 14, 2018 at 08:52:10PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> But it does in portdrv_core.c:
> 
> 	if (dev->aer_cap && pci_aer_available() &&
> 	    (pcie_ports_native || host->native_aer)) {
> 		services |= PCIE_PORT_SERVICE_AER;
> 
> That flag later creates a pcie device that allows aerdrv to attach to.

Oh, right. I saw negotiate_os_control() just uses a stack variable for
the _OSC response, but if I had looked one level deeper, I'd see it
cached in a different structure.
Bjorn Helgaas Nov. 15, 2018, 6:24 a.m. | #31
On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
> > On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> >> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
> >> ...
> >>> Do you think Linux observes the rule about not touching AER bits on
> >>> FFS?  I'm not sure it does.  I'm not even sure what section of the
> >>> spec is relevant.
> >>
> >> I haven't found any place where linux breaks this rule. I'm very
> >> confident that, unless otherwise instructed, we follow this rule.
> > 
> > Just to make sure we're on the same page, can you point me to this
> > rule?  I do see that OSPM must request control of AER using _OSC
> > before it touches the AER registers.  What I don't see is the
> > connection between firmware-first and the AER registers.
> 
> ACPI 6.2 - 6.2.11.3, Table 6-197:
> 
> PCI Express Advanced Error Reporting control:
>   * The firmware sets this bit to 1 to grant control over PCI Express 
> Advanced Error Reporting. If firmware allows the OS control of this 
> feature, then in the context of the _OSC method it must ensure that 
> error messages are routed to device interrupts as described in the PCI 
> Express Base Specification[...]

The PCIe Base Spec is pretty big, so I wish this reference were a
little more explicit.  I *guess* maybe it's referring to PCIe r4.0,
figure 6-3 in sec 6.2.6, where PCIe ERR_* Messages can be routed to
"INTx or MSI Error Interrupts" and/or "platform-specific System Error"
interrupts.

"Device interrupts" seems like it refers to the "INTx or MSI"
interrupts, not the platform-specific System Errors, so I would read
that as saying "if firmware grants OS control of AER via _OSC,
firmware must set the AER Reporting Enables in the AER Root Error
Command register."  But that seems a little silly because the OS now
*owns* the AER capability and it can set the AER Root Error Command
register itself if it wants to.

And I still don't see the connection here with Firmware-First.  I'm
pretty sure firmware could not be notified via INTx or MSI interrupts
because those are totally managed by OSPM.

> > The closest I can find is the "Enabled" field in the HEST PCIe
> > AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
> > 
> >    If the field value is 1, indicates this error source is
> >    to be enabled.
> > 
> >    If the field value is 0, indicates that the error source
> >    is not to be enabled.
> > 
> >    If FIRMWARE_FIRST is set in the flags field, the Enabled
> >    field is ignored by the OSPM.
> > 
> > AFAICT, Linux completely ignores the Enabled field in these
> > structures.
> 
> I don't think ignoring the field is a problem:
>   * With FFS, OS should ignore it.
>   * Without FFS, we have control, and we get to make the decisions anyway.
> In the latter case we decide whether to use AER, independent of the crap 
> in ACPI. I'm not even sure why "Enabled" matters in native AER handling. 

It seems like these HEST structures are "here's how firmware thinks
you should set up AER on this device".  But I agree, I have no idea
how to interpret "Enabled".  The rest of the HEST fields cover all the
useful AER registers, including the Reporting Enables in the AER Root
Error Command register *and* the Error Reporting Enables in the Device
Control register.  So I don't know what the "Enabled" field adds to
all that.  What a mess.

> > For firmware-first to work, firmware has to get control.  How does
> > it get control?  How does OSPM know to either set up that
> > mechanism or keep its mitts off something firmware set up before
> > handoff?
> 
> My understanding is that, if FW keeps control of AER in _OSC, then
> it will have set things up to get notified instead of the OS. OSPM
> not touching AER bits is to make sure it doesn't mess up FW's setup.
> I think there are some proprietary bits in the root port to route
> interrupts to SMIs instead of the AER vectors.

It makes good sense that if OSPM doesn't have AER control, firmware
does all AER handling, including any setup for firmware-first
notification.  If we can assume that firmware-first notification is
done in some way the OS doesn't know about and can't mess up, that
would be awesome.

But I think the VMD model really has nothing to do with the APEI
firmware-first model.  With VMD, it sounds like OSPM owns the AER
capability and doesn't know firmware exists *except* that it has to be
careful not to step on firmware's interrupt.  So maybe we can handle it
separately.

Bjorn
Alex_Gagniuc@Dellteam.com Nov. 16, 2018, 12:19 a.m. | #32
+ Borislav (ACPI guy, maybe he can answer more about HEST)

On 11/15/2018 12:24 AM, Bjorn Helgaas wrote:
> On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
>>> On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> ACPI 6.2 - 6.2.11.3, Table 6-197:
>>
>> PCI Express Advanced Error Reporting control:
>>    * The firmware sets this bit to 1 to grant control over PCI Express
>> Advanced Error Reporting. If firmware allows the OS control of this
>> feature, then in the context of the _OSC method it must ensure that
>> error messages are routed to device interrupts as described in the PCI
>> Express Base Specification[...]
> 
> The PCIe Base Spec is pretty big, so I wish this reference were a
> little more explicit.  I *guess* maybe it's referring to PCIe r4.0,
> figure 6-3 in sec 6.2.6, where PCIe ERR_* Messages can be routed to
> "INTx or MSI Error Interrupts" and/or "platform-specific System Error"
> interrupts.

It's engineering slang for "I don't know what the spec says, but do what 
the spec says." What it means is that FW should set up HW in such a way, 
that PCIe-compliant OS which enables interrupts will get interrupts the 
way it expects. For example, some FW might set the root port to trigger 
an SMI instead of firing up the proper MSI vector. In this case FW must 
ensure that AER interrupts fire up the MSI vector instead of triggering SMI.

> "Device interrupts" seems like it refers to the "INTx or MSI"
> interrupts, not the platform-specific System Errors, so I would read
> that as saying "if firmware grants OS control of AER via _OSC,
> firmware must set the AER Reporting Enables in the AER Root Error
> Command register."  But that seems a little silly because the OS now
> *owns* the AER capability and it can set the AER Root Error Command
> register itself if it wants to.

When OS takes control of AER, it is responsible for setting things up, 
at least as far as standard PCIe registers are concerned. So setting up 
the the root command register would still be the OS's responsibility. 
Proprietary registers, like routing to SMI/SCI/MSI would have to be done 
by FW.


> And I still don't see the connection here with Firmware-First.  I'm
> pretty sure firmware could not be notified via INTx or MSI interrupts
> because those are totally managed by OSPM.

The connection with FFS is that FFS needs to be notified by some other 
method. FW can't commandeer interrupt vectors willie-nillie. FW gets 
notified by platform-specific mechanisms. In my case, it happens to be 
SMM, but it could be a number of other proprietary mechanisms.


>>> The closest I can find is the "Enabled" field in the HEST PCIe
>>> AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
>>>[...] 
>>> AFAICT, Linux completely ignores the Enabled field in these
>>> structures.
>>
>> I don't think ignoring the field is a problem:
>>    * With FFS, OS should ignore it.
>>    * Without FFS, we have control, and we get to make the decisions anyway.
>> In the latter case we decide whether to use AER, independent of the crap
>> in ACPI. I'm not even sure why "Enabled" matters in native AER handling.
> 
> It seems like these HEST structures are "here's how firmware thinks
> you should set up AER on this device".  But I agree, I have no idea
> how to interpret "Enabled".  The rest of the HEST fields cover all the
> useful AER registers, including the Reporting Enables in the AER Root
> Error Command register *and* the Error Reporting Enables in the Device
> Control register.  So I don't know what the "Enabled" field adds to
> all that.  

"This table contains information platform firmware supplies to OSPM for 
configuring AER support on a given PCI Express device."

I don't fully get the point of the HEST structures for PCIe. I'm hoping 
Borislav can shed enlightment on this. I think we can ignore them in PCI 
core, with a note to revisit them if something ever goes horribly wrong. 
I've patched some changes to reduce reliance on HEST [1].

> What a mess.

It's ACPI. What else did you expect?

>>> For firmware-first to work, firmware has to get control.  How does
>>> it get control?  How does OSPM know to either set up that
>>> mechanism or keep its mitts off something firmware set up before
>>> handoff?
>>
>> My understanding is that, if FW keeps control of AER in _OSC, then
>> it will have set things up to get notified instead of the OS. OSPM
>> not touching AER bits is to make sure it doesn't mess up FW's setup.
>> I think there are some proprietary bits in the root port to route
>> interrupts to SMIs instead of the AER vectors.
> 
> It makes good sense that if OSPM doesn't have AER control, firmware
> does all AER handling, including any setup for firmware-first
> notification.  If we can assume that firmware-first notification is
> done in some way the OS doesn't know about and can't mess up, that
> would be awesome.

I think that's a reasonable assumption as long as OS respects AER ownership.

> But I think the VMD model really has nothing to do with the APEI
> firmware-first model.  With VMD, it sounds like OSPM owns the AER
> capability and doesn't know firmware exists *except* that it has to be
> careful not to step on firmware's interrupt.  So maybe we can handle it
> separately.

<sarcasm_alert>
Sounds like VMD is qualified to become part of ACPI spec.
<\sarcasm_alert>

I'll have to look into VMD in more detail, but this sounds like a design 
flaw. It's possible to use the values in HEST to set up AER, but that 
might conflict with the OS's wish on how to set up AER.

Alex


[1] https://lkml.org/lkml/2018/11/16/202

Patch

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..f31058fd2260 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@  static void msi_set_mask_bit(struct irq_data *data, u32 flag)
 {
 	struct msi_desc *desc = irq_data_get_msi_desc(data);
 
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
+		return;
+
 	if (desc->msi_attrib.is_msix) {
 		msix_mask_irq(desc, flag);
 		readl(desc->mask_base);		/* Flush write to device */