[v3] PCI: Data corruption happening due to race condition

Message ID 1530608741-30664-2-git-send-email-hari.vyas@broadcom.com
State Accepted
Delegated to: Bjorn Helgaas
Headers show
Series
  • [v3] PCI: Data corruption happening due to race condition
Related show

Commit Message

Hari Vyas July 3, 2018, 9:05 a.m.
When a pci device is detected, a variable is_added is set to
1 in pci device structure and proc, sys entries are created.

When a pci device is removed, first is_added is checked for one
and then device is detached with clearing of proc and sys
entries and at end, is_added is set to 0.

is_added and is_busmaster are bit fields in pci_dev structure
sharing same memory location.

A strange issue was observed with multiple times removal and
rescan of a pcie nvme device using sysfs commands where is_added
flag was observed as zero instead of one while removing device
and proc,sys entries are not cleared.  This causes issue in
later device addition with warning message "proc_dir_entry"
already registered.

Debugging revealed a race condition between pcie core driver
enabling is_added bit(pci_bus_add_device()) and nvme driver
reset work-queue enabling is_busmaster bit (by pci_set_master()).
As both fields are not handled in atomic manner and that clears
is_added bit.

Fix moves device addition is_added bit to separate private flag
variable and use different atomic functions to set and retrieve
device addition state. As is_added shares different memory
location so race condition is avoided.

Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
---
 arch/powerpc/kernel/pci-common.c          |  4 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
 arch/powerpc/platforms/pseries/setup.c    |  3 ++-
 drivers/pci/bus.c                         |  6 +++---
 drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
 drivers/pci/pci.h                         | 11 +++++++++++
 drivers/pci/probe.c                       |  4 ++--
 drivers/pci/remove.c                      |  5 +++--
 include/linux/pci.h                       |  1 -
 9 files changed, 27 insertions(+), 12 deletions(-)

Comments

Lukas Wunner July 3, 2018, 9:13 a.m. | #1
On Tue, Jul 03, 2018 at 02:35:41PM +0530, Hari Vyas wrote:
> When a pci device is detected, a variable is_added is set to
> 1 in pci device structure and proc, sys entries are created.
> 
> When a pci device is removed, first is_added is checked for one
> and then device is detached with clearing of proc and sys
> entries and at end, is_added is set to 0.
> 
> is_added and is_busmaster are bit fields in pci_dev structure
> sharing same memory location.
> 
> A strange issue was observed with multiple times removal and
> rescan of a pcie nvme device using sysfs commands where is_added
> flag was observed as zero instead of one while removing device
> and proc,sys entries are not cleared.  This causes issue in
> later device addition with warning message "proc_dir_entry"
> already registered.
> 
> Debugging revealed a race condition between pcie core driver
> enabling is_added bit(pci_bus_add_device()) and nvme driver
> reset work-queue enabling is_busmaster bit (by pci_set_master()).
> As both fields are not handled in atomic manner and that clears
> is_added bit.
> 
> Fix moves device addition is_added bit to separate private flag
> variable and use different atomic functions to set and retrieve
> device addition state. As is_added shares different memory
> location so race condition is avoided.
> 
> Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>

Reviewed-by: Lukas Wunner <lukas@wunner.de>
Bjorn Helgaas July 18, 2018, 11:29 p.m. | #2
[+cc Paul, Michael, linuxppc-dev]

On Tue, Jul 03, 2018 at 02:35:41PM +0530, Hari Vyas wrote:
> When a pci device is detected, a variable is_added is set to
> 1 in pci device structure and proc, sys entries are created.
> 
> When a pci device is removed, first is_added is checked for one
> and then device is detached with clearing of proc and sys
> entries and at end, is_added is set to 0.
> 
> is_added and is_busmaster are bit fields in pci_dev structure
> sharing same memory location.
> 
> A strange issue was observed with multiple times removal and
> rescan of a pcie nvme device using sysfs commands where is_added
> flag was observed as zero instead of one while removing device
> and proc,sys entries are not cleared.  This causes issue in
> later device addition with warning message "proc_dir_entry"
> already registered.
> 
> Debugging revealed a race condition between pcie core driver
> enabling is_added bit(pci_bus_add_device()) and nvme driver
> reset work-queue enabling is_busmaster bit (by pci_set_master()).
> As both fields are not handled in atomic manner and that clears
> is_added bit.
> 
> Fix moves device addition is_added bit to separate private flag
> variable and use different atomic functions to set and retrieve
> device addition state. As is_added shares different memory
> location so race condition is avoided.

Really nice bit of debugging!

> Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
> ---
>  arch/powerpc/kernel/pci-common.c          |  4 +++-
>  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
>  arch/powerpc/platforms/pseries/setup.c    |  3 ++-
>  drivers/pci/bus.c                         |  6 +++---
>  drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
>  drivers/pci/pci.h                         | 11 +++++++++++
>  drivers/pci/probe.c                       |  4 ++--
>  drivers/pci/remove.c                      |  5 +++--
>  include/linux/pci.h                       |  1 -
>  9 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index fe9733f..471aac3 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -42,6 +42,8 @@
>  #include <asm/ppc-pci.h>
>  #include <asm/eeh.h>
>  
> +#include "../../../drivers/pci/pci.h"

I see why you need it, but this include path is really ugly.  Outside
of bootloaders and tools, there are very few instances of includes
like this that reference a different top-level directory, and I'm not
very keen about adding more.

Obviously powerpc is the only arch that needs dev->is_added.  It seems
to be because "We can only call pcibios_setup_device() after bus setup
is complete, since some of the platform specific DMA setup code
depends on it."

I don't know powerpc, but it does raise the question in my mind of
whether powerpc could be changed to do the DMA setup more like other
arches do to remove this ordering dependency and the need to use
dev->is_added.

That sounds like a lot of work, but it would have the benefit of
unifying some code that is probably needlessly arch-specific.

>  /* hose_spinlock protects accesses to the the phb_bitmap. */
>  static DEFINE_SPINLOCK(hose_spinlock);
>  LIST_HEAD(hose_list);
> @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
>  		/* Cardbus can call us to add new devices to a bus, so ignore
>  		 * those who are already fully discovered
>  		 */
> -		if (dev->is_added)
> +		if (pci_dev_is_added(dev))
>  			continue;
>  
>  		pcibios_setup_device(dev);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5bd0eb6..70b2e1e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -46,6 +46,7 @@
>  
>  #include "powernv.h"
>  #include "pci.h"
> +#include "../../../../drivers/pci/pci.h"
>  
>  #define PNV_IODA1_M64_NUM	16	/* Number of M64 BARs	*/
>  #define PNV_IODA1_M64_SEGS	8	/* Segments per M64 BAR	*/
> @@ -3138,7 +3139,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	struct pci_dn *pdn;
>  	int mul, total_vfs;
>  
> -	if (!pdev->is_physfn || pdev->is_added)
> +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
>  		return;
>  
>  	pdn = pci_get_pdn(pdev);
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 139f0af..8a4868a 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -71,6 +71,7 @@
>  #include <asm/security_features.h>
>  
>  #include "pseries.h"
> +#include "../../../../drivers/pci/pci.h"
>  
>  int CMO_PrPSP = -1;
>  int CMO_SecPSP = -1;
> @@ -664,7 +665,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
>  	const int *indexes;
>  	struct device_node *dn = pci_device_to_OF_node(pdev);
>  
> -	if (!pdev->is_physfn || pdev->is_added)
> +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
>  		return;
>  	/*Firmware must support open sriov otherwise dont configure*/
>  	indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 35b7fc8..5cb40b2 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
>  		return;
>  	}
>  
> -	dev->is_added = 1;
> +	pci_dev_assign_added(dev, true);
>  }
>  EXPORT_SYMBOL_GPL(pci_bus_add_device);
>  
> @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		/* Skip already-added devices */
> -		if (dev->is_added)
> +		if (pci_dev_is_added(dev))
>  			continue;
>  		pci_bus_add_device(dev);
>  	}
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		/* Skip if device attach failed */
> -		if (!dev->is_added)
> +		if (!pci_dev_is_added(dev))
>  			continue;
>  		child = dev->subordinate;
>  		if (child)
> diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> index 3a17b29..ef0b1b6 100644
> --- a/drivers/pci/hotplug/acpiphp_glue.c
> +++ b/drivers/pci/hotplug/acpiphp_glue.c
> @@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		/* Assume that newly added devices are powered on already. */
> -		if (!dev->is_added)
> +		if (!pci_dev_is_added(dev))
>  			dev->current_state = PCI_D0;
>  	}
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 882f1f9..0881725 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -288,6 +288,7 @@ struct pci_sriov {
>  
>  /* pci_dev priv_flags */
>  #define PCI_DEV_DISCONNECTED 0
> +#define PCI_DEV_ADDED 1
>  
>  static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  {
> @@ -300,6 +301,16 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
>  	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
>  }
>  
> +static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
> +{
> +	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
> +}
> +
> +static inline bool pci_dev_is_added(const struct pci_dev *dev)
> +{
> +	return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
> +}
> +
>  #ifdef CONFIG_PCI_ATS
>  void pci_restore_ats_state(struct pci_dev *dev);
>  #else
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac876e3..611adcd 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2433,13 +2433,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
>  	dev = pci_scan_single_device(bus, devfn);
>  	if (!dev)
>  		return 0;
> -	if (!dev->is_added)
> +	if (!pci_dev_is_added(dev))
>  		nr++;
>  
>  	for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
>  		dev = pci_scan_single_device(bus, devfn + fn);
>  		if (dev) {
> -			if (!dev->is_added)
> +			if (!pci_dev_is_added(dev))
>  				nr++;
>  			dev->multifunction = 1;
>  		}
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index 6f072ea..5e3d0dc 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -19,11 +19,12 @@ static void pci_stop_dev(struct pci_dev *dev)
>  {
>  	pci_pme_active(dev, false);
>  
> -	if (dev->is_added) {
> +	if (pci_dev_is_added(dev)) {
>  		device_release_driver(&dev->dev);
>  		pci_proc_detach_device(dev);
>  		pci_remove_sysfs_dev_files(dev);
> -		dev->is_added = 0;
> +
> +		pci_dev_assign_added(dev, false);
>  	}
>  
>  	if (dev->bus->self)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 340029b..506125b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -368,7 +368,6 @@ struct pci_dev {
>  	unsigned int	transparent:1;		/* Subtractive decode bridge */
>  	unsigned int	multifunction:1;	/* Multi-function device */
>  
> -	unsigned int	is_added:1;
>  	unsigned int	is_busmaster:1;		/* Is busmaster */
>  	unsigned int	no_msi:1;		/* May not use MSI */
>  	unsigned int	no_64bit_msi:1; 	/* May only use 32-bit MSIs */
> -- 
> 1.9.1
>
Benjamin Herrenschmidt July 19, 2018, 4:18 a.m. | #3
On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote:
> [+cc Paul, Michael, linuxppc-dev]
> 

   ..../...

> > Debugging revealed a race condition between pcie core driver
> > enabling is_added bit(pci_bus_add_device()) and nvme driver
> > reset work-queue enabling is_busmaster bit (by pci_set_master()).
> > As both fields are not handled in atomic manner and that clears
> > is_added bit.
> > 
> > Fix moves device addition is_added bit to separate private flag
> > variable and use different atomic functions to set and retrieve
> > device addition state. As is_added shares different memory
> > location so race condition is avoided.
> 
> Really nice bit of debugging!

Indeed. However I'm not fan of the solution. Shouldn't we instead have
some locking for the content of pci_dev ? I've always been wary of us
having other similar races in there.

As for the powerpc bits, I'm probably the one who wrote them, however,
I'm on vacation this week and right now, no bandwidth to context switch
all that back in :-) So give me a few days and/or ping me next week.

The powerpc PCI code contains a lot of cruft coming from the depth of
history, including rather nasty assumptions. We want to progressively
clean it up, starting with EEH, but it will take time.

Cheers,
Ben.

> > Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
> > ---
> >  arch/powerpc/kernel/pci-common.c          |  4 +++-
> >  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
> >  arch/powerpc/platforms/pseries/setup.c    |  3 ++-
> >  drivers/pci/bus.c                         |  6 +++---
> >  drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
> >  drivers/pci/pci.h                         | 11 +++++++++++
> >  drivers/pci/probe.c                       |  4 ++--
> >  drivers/pci/remove.c                      |  5 +++--
> >  include/linux/pci.h                       |  1 -
> >  9 files changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> > index fe9733f..471aac3 100644
> > --- a/arch/powerpc/kernel/pci-common.c
> > +++ b/arch/powerpc/kernel/pci-common.c
> > @@ -42,6 +42,8 @@
> >  #include <asm/ppc-pci.h>
> >  #include <asm/eeh.h>
> >  
> > +#include "../../../drivers/pci/pci.h"
> 
> I see why you need it, but this include path is really ugly.  Outside
> of bootloaders and tools, there are very few instances of includes
> like this that reference a different top-level directory, and I'm not
> very keen about adding more.
> 
> Obviously powerpc is the only arch that needs dev->is_added.  It seems
> to be because "We can only call pcibios_setup_device() after bus setup
> is complete, since some of the platform specific DMA setup code
> depends on it."
> 
> I don't know powerpc, but it does raise the question in my mind of
> whether powerpc could be changed to do the DMA setup more like other
> arches do to remove this ordering dependency and the need to use
> dev->is_added.
> 
> That sounds like a lot of work, but it would have the benefit of
> unifying some code that is probably needlessly arch-specific.
> 
> >  /* hose_spinlock protects accesses to the the phb_bitmap. */
> >  static DEFINE_SPINLOCK(hose_spinlock);
> >  LIST_HEAD(hose_list);
> > @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
> >  		/* Cardbus can call us to add new devices to a bus, so ignore
> >  		 * those who are already fully discovered
> >  		 */
> > -		if (dev->is_added)
> > +		if (pci_dev_is_added(dev))
> >  			continue;
> >  
> >  		pcibios_setup_device(dev);
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index 5bd0eb6..70b2e1e 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -46,6 +46,7 @@
> >  
> >  #include "powernv.h"
> >  #include "pci.h"
> > +#include "../../../../drivers/pci/pci.h"
> >  
> >  #define PNV_IODA1_M64_NUM	16	/* Number of M64 BARs	*/
> >  #define PNV_IODA1_M64_SEGS	8	/* Segments per M64 BAR	*/
> > @@ -3138,7 +3139,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> >  	struct pci_dn *pdn;
> >  	int mul, total_vfs;
> >  
> > -	if (!pdev->is_physfn || pdev->is_added)
> > +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
> >  		return;
> >  
> >  	pdn = pci_get_pdn(pdev);
> > diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> > index 139f0af..8a4868a 100644
> > --- a/arch/powerpc/platforms/pseries/setup.c
> > +++ b/arch/powerpc/platforms/pseries/setup.c
> > @@ -71,6 +71,7 @@
> >  #include <asm/security_features.h>
> >  
> >  #include "pseries.h"
> > +#include "../../../../drivers/pci/pci.h"
> >  
> >  int CMO_PrPSP = -1;
> >  int CMO_SecPSP = -1;
> > @@ -664,7 +665,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
> >  	const int *indexes;
> >  	struct device_node *dn = pci_device_to_OF_node(pdev);
> >  
> > -	if (!pdev->is_physfn || pdev->is_added)
> > +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
> >  		return;
> >  	/*Firmware must support open sriov otherwise dont configure*/
> >  	indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index 35b7fc8..5cb40b2 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
> >  		return;
> >  	}
> >  
> > -	dev->is_added = 1;
> > +	pci_dev_assign_added(dev, true);
> >  }
> >  EXPORT_SYMBOL_GPL(pci_bus_add_device);
> >  
> > @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
> >  
> >  	list_for_each_entry(dev, &bus->devices, bus_list) {
> >  		/* Skip already-added devices */
> > -		if (dev->is_added)
> > +		if (pci_dev_is_added(dev))
> >  			continue;
> >  		pci_bus_add_device(dev);
> >  	}
> >  
> >  	list_for_each_entry(dev, &bus->devices, bus_list) {
> >  		/* Skip if device attach failed */
> > -		if (!dev->is_added)
> > +		if (!pci_dev_is_added(dev))
> >  			continue;
> >  		child = dev->subordinate;
> >  		if (child)
> > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> > index 3a17b29..ef0b1b6 100644
> > --- a/drivers/pci/hotplug/acpiphp_glue.c
> > +++ b/drivers/pci/hotplug/acpiphp_glue.c
> > @@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
> >  
> >  	list_for_each_entry(dev, &bus->devices, bus_list) {
> >  		/* Assume that newly added devices are powered on already. */
> > -		if (!dev->is_added)
> > +		if (!pci_dev_is_added(dev))
> >  			dev->current_state = PCI_D0;
> >  	}
> >  
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 882f1f9..0881725 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -288,6 +288,7 @@ struct pci_sriov {
> >  
> >  /* pci_dev priv_flags */
> >  #define PCI_DEV_DISCONNECTED 0
> > +#define PCI_DEV_ADDED 1
> >  
> >  static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> >  {
> > @@ -300,6 +301,16 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
> >  	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
> >  }
> >  
> > +static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
> > +{
> > +	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
> > +}
> > +
> > +static inline bool pci_dev_is_added(const struct pci_dev *dev)
> > +{
> > +	return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
> > +}
> > +
> >  #ifdef CONFIG_PCI_ATS
> >  void pci_restore_ats_state(struct pci_dev *dev);
> >  #else
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index ac876e3..611adcd 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -2433,13 +2433,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
> >  	dev = pci_scan_single_device(bus, devfn);
> >  	if (!dev)
> >  		return 0;
> > -	if (!dev->is_added)
> > +	if (!pci_dev_is_added(dev))
> >  		nr++;
> >  
> >  	for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
> >  		dev = pci_scan_single_device(bus, devfn + fn);
> >  		if (dev) {
> > -			if (!dev->is_added)
> > +			if (!pci_dev_is_added(dev))
> >  				nr++;
> >  			dev->multifunction = 1;
> >  		}
> > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> > index 6f072ea..5e3d0dc 100644
> > --- a/drivers/pci/remove.c
> > +++ b/drivers/pci/remove.c
> > @@ -19,11 +19,12 @@ static void pci_stop_dev(struct pci_dev *dev)
> >  {
> >  	pci_pme_active(dev, false);
> >  
> > -	if (dev->is_added) {
> > +	if (pci_dev_is_added(dev)) {
> >  		device_release_driver(&dev->dev);
> >  		pci_proc_detach_device(dev);
> >  		pci_remove_sysfs_dev_files(dev);
> > -		dev->is_added = 0;
> > +
> > +		pci_dev_assign_added(dev, false);
> >  	}
> >  
> >  	if (dev->bus->self)
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 340029b..506125b 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -368,7 +368,6 @@ struct pci_dev {
> >  	unsigned int	transparent:1;		/* Subtractive decode bridge */
> >  	unsigned int	multifunction:1;	/* Multi-function device */
> >  
> > -	unsigned int	is_added:1;
> >  	unsigned int	is_busmaster:1;		/* Is busmaster */
> >  	unsigned int	no_msi:1;		/* May not use MSI */
> >  	unsigned int	no_64bit_msi:1; 	/* May only use 32-bit MSIs */
> > -- 
> > 1.9.1
> >
Hari Vyas July 19, 2018, 2:04 p.m. | #4
Hi Bjonr, Ben

On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote:
>> [+cc Paul, Michael, linuxppc-dev]
>>
>
>    ..../...
>
>> > Debugging revealed a race condition between pcie core driver
>> > enabling is_added bit(pci_bus_add_device()) and nvme driver
>> > reset work-queue enabling is_busmaster bit (by pci_set_master()).
>> > As both fields are not handled in atomic manner and that clears
>> > is_added bit.
>> >
>> > Fix moves device addition is_added bit to separate private flag
>> > variable and use different atomic functions to set and retrieve
>> > device addition state. As is_added shares different memory
>> > location so race condition is avoided.
>>
>> Really nice bit of debugging!
>
> Indeed. However I'm not fan of the solution. Shouldn't we instead have
> some locking for the content of pci_dev ? I've always been wary of us
> having other similar races in there.
>
> As for the powerpc bits, I'm probably the one who wrote them, however,
> I'm on vacation this week and right now, no bandwidth to context switch
> all that back in :-) So give me a few days and/or ping me next week.
>
> The powerpc PCI code contains a lot of cruft coming from the depth of
> history, including rather nasty assumptions. We want to progressively
> clean it up, starting with EEH, but it will take time.
>
> Cheers,
> Ben.
>
Some driver too directly using pci_dev structure flags and may cause similar
type of issues in race condition and should be avoided.
Probably not causing issue currently but some race scenario may affect
and needs to be handled
with some get(),set() api's in atomic manner.
I will suggest to use bit position for all remaining bitfields and use
atomic operation. In that way,
it can be controlled and avoid direct updating fields from outside.
Ex;
enum pci_dev_flags {
   IS_BUSMASTER=1,
.  NO_MSI=2.
}
void assign_pci_dev_flag(struct pci_dev *dev, int flag, bool val)
{
   assign_bit(flag, &dev->flags, val);
}
Proper cleanup is required at so many places but that will certainly
take some time
i.e.  good effort but will be future safe.If  Bjorn agrees, we can
work on that one.
>> > Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
>> > ---
>> >  arch/powerpc/kernel/pci-common.c          |  4 +++-
>> >  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
>> >  arch/powerpc/platforms/pseries/setup.c    |  3 ++-
>> >  drivers/pci/bus.c                         |  6 +++---
>> >  drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
>> >  drivers/pci/pci.h                         | 11 +++++++++++
>> >  drivers/pci/probe.c                       |  4 ++--
>> >  drivers/pci/remove.c                      |  5 +++--
>> >  include/linux/pci.h                       |  1 -
>> >  9 files changed, 27 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> > index fe9733f..471aac3 100644
>> > --- a/arch/powerpc/kernel/pci-common.c
>> > +++ b/arch/powerpc/kernel/pci-common.c
>> > @@ -42,6 +42,8 @@
>> >  #include <asm/ppc-pci.h>
>> >  #include <asm/eeh.h>
>> >
>> > +#include "../../../drivers/pci/pci.h"
>>
>> I see why you need it, but this include path is really ugly.  Outside
>> of bootloaders and tools, there are very few instances of includes
>> like this that reference a different top-level directory, and I'm not
>> very keen about adding more.
>>
>> Obviously powerpc is the only arch that needs dev->is_added.  It seems
>> to be because "We can only call pcibios_setup_device() after bus setup
>> is complete, since some of the platform specific DMA setup code
>> depends on it."
>>
>> I don't know powerpc, but it does raise the question in my mind of
>> whether powerpc could be changed to do the DMA setup more like other
>> arches do to remove this ordering dependency and the need to use
>> dev->is_added.
>>
>> That sounds like a lot of work, but it would have the benefit of
>> unifying some code that is probably needlessly arch-specific.
>>
Yes. I also agree, including pci.h with ../ references is really bad.
First patch
was using spin lock for protecting is_added and is_busmaster bits but in final
patch moved is_added to private flags.

>> >  /* hose_spinlock protects accesses to the the phb_bitmap. */
>> >  static DEFINE_SPINLOCK(hose_spinlock);
>> >  LIST_HEAD(hose_list);
>> > @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
>> >             /* Cardbus can call us to add new devices to a bus, so ignore
>> >              * those who are already fully discovered
>> >              */
>> > -           if (dev->is_added)
>> > +           if (pci_dev_is_added(dev))
>> >                     continue;
>> >
>> >             pcibios_setup_device(dev);
>> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> > index 5bd0eb6..70b2e1e 100644
>> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> > @@ -46,6 +46,7 @@
>> >
>> >  #include "powernv.h"
>> >  #include "pci.h"
>> > +#include "../../../../drivers/pci/pci.h"
>> >
>> >  #define PNV_IODA1_M64_NUM  16      /* Number of M64 BARs   */
>> >  #define PNV_IODA1_M64_SEGS 8       /* Segments per M64 BAR */
>> > @@ -3138,7 +3139,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> >     struct pci_dn *pdn;
>> >     int mul, total_vfs;
>> >
>> > -   if (!pdev->is_physfn || pdev->is_added)
>> > +   if (!pdev->is_physfn || pci_dev_is_added(pdev))
>> >             return;
>> >
>> >     pdn = pci_get_pdn(pdev);
>> > diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> > index 139f0af..8a4868a 100644
>> > --- a/arch/powerpc/platforms/pseries/setup.c
>> > +++ b/arch/powerpc/platforms/pseries/setup.c
>> > @@ -71,6 +71,7 @@
>> >  #include <asm/security_features.h>
>> >
>> >  #include "pseries.h"
>> > +#include "../../../../drivers/pci/pci.h"
>> >
>> >  int CMO_PrPSP = -1;
>> >  int CMO_SecPSP = -1;
>> > @@ -664,7 +665,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
>> >     const int *indexes;
>> >     struct device_node *dn = pci_device_to_OF_node(pdev);
>> >
>> > -   if (!pdev->is_physfn || pdev->is_added)
>> > +   if (!pdev->is_physfn || pci_dev_is_added(pdev))
>> >             return;
>> >     /*Firmware must support open sriov otherwise dont configure*/
>> >     indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
>> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
>> > index 35b7fc8..5cb40b2 100644
>> > --- a/drivers/pci/bus.c
>> > +++ b/drivers/pci/bus.c
>> > @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
>> >             return;
>> >     }
>> >
>> > -   dev->is_added = 1;
>> > +   pci_dev_assign_added(dev, true);
>> >  }
>> >  EXPORT_SYMBOL_GPL(pci_bus_add_device);
>> >
>> > @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
>> >
>> >     list_for_each_entry(dev, &bus->devices, bus_list) {
>> >             /* Skip already-added devices */
>> > -           if (dev->is_added)
>> > +           if (pci_dev_is_added(dev))
>> >                     continue;
>> >             pci_bus_add_device(dev);
>> >     }
>> >
>> >     list_for_each_entry(dev, &bus->devices, bus_list) {
>> >             /* Skip if device attach failed */
>> > -           if (!dev->is_added)
>> > +           if (!pci_dev_is_added(dev))
>> >                     continue;
>> >             child = dev->subordinate;
>> >             if (child)
>> > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
>> > index 3a17b29..ef0b1b6 100644
>> > --- a/drivers/pci/hotplug/acpiphp_glue.c
>> > +++ b/drivers/pci/hotplug/acpiphp_glue.c
>> > @@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
>> >
>> >     list_for_each_entry(dev, &bus->devices, bus_list) {
>> >             /* Assume that newly added devices are powered on already. */
>> > -           if (!dev->is_added)
>> > +           if (!pci_dev_is_added(dev))
>> >                     dev->current_state = PCI_D0;
>> >     }
>> >
>> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> > index 882f1f9..0881725 100644
>> > --- a/drivers/pci/pci.h
>> > +++ b/drivers/pci/pci.h
>> > @@ -288,6 +288,7 @@ struct pci_sriov {
>> >
>> >  /* pci_dev priv_flags */
>> >  #define PCI_DEV_DISCONNECTED 0
>> > +#define PCI_DEV_ADDED 1
>> >
>> >  static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>> >  {
>> > @@ -300,6 +301,16 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
>> >     return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
>> >  }
>> >
>> > +static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
>> > +{
>> > +   assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
>> > +}
>> > +
>> > +static inline bool pci_dev_is_added(const struct pci_dev *dev)
>> > +{
>> > +   return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
>> > +}
>> > +
>> >  #ifdef CONFIG_PCI_ATS
>> >  void pci_restore_ats_state(struct pci_dev *dev);
>> >  #else
>> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> > index ac876e3..611adcd 100644
>> > --- a/drivers/pci/probe.c
>> > +++ b/drivers/pci/probe.c
>> > @@ -2433,13 +2433,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
>> >     dev = pci_scan_single_device(bus, devfn);
>> >     if (!dev)
>> >             return 0;
>> > -   if (!dev->is_added)
>> > +   if (!pci_dev_is_added(dev))
>> >             nr++;
>> >
>> >     for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
>> >             dev = pci_scan_single_device(bus, devfn + fn);
>> >             if (dev) {
>> > -                   if (!dev->is_added)
>> > +                   if (!pci_dev_is_added(dev))
>> >                             nr++;
>> >                     dev->multifunction = 1;
>> >             }
>> > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
>> > index 6f072ea..5e3d0dc 100644
>> > --- a/drivers/pci/remove.c
>> > +++ b/drivers/pci/remove.c
>> > @@ -19,11 +19,12 @@ static void pci_stop_dev(struct pci_dev *dev)
>> >  {
>> >     pci_pme_active(dev, false);
>> >
>> > -   if (dev->is_added) {
>> > +   if (pci_dev_is_added(dev)) {
>> >             device_release_driver(&dev->dev);
>> >             pci_proc_detach_device(dev);
>> >             pci_remove_sysfs_dev_files(dev);
>> > -           dev->is_added = 0;
>> > +
>> > +           pci_dev_assign_added(dev, false);
>> >     }
>> >
>> >     if (dev->bus->self)
>> > diff --git a/include/linux/pci.h b/include/linux/pci.h
>> > index 340029b..506125b 100644
>> > --- a/include/linux/pci.h
>> > +++ b/include/linux/pci.h
>> > @@ -368,7 +368,6 @@ struct pci_dev {
>> >     unsigned int    transparent:1;          /* Subtractive decode bridge */
>> >     unsigned int    multifunction:1;        /* Multi-function device */
>> >
>> > -   unsigned int    is_added:1;
>> >     unsigned int    is_busmaster:1;         /* Is busmaster */
>> >     unsigned int    no_msi:1;               /* May not use MSI */
>> >     unsigned int    no_64bit_msi:1;         /* May only use 32-bit MSIs */
>> > --
>> > 1.9.1
>> >
Bjorn Helgaas July 19, 2018, 5:41 p.m. | #5
On Tue, Jul 03, 2018 at 02:35:41PM +0530, Hari Vyas wrote:
> When a pci device is detected, a variable is_added is set to
> 1 in pci device structure and proc, sys entries are created.
> 
> When a pci device is removed, first is_added is checked for one
> and then device is detached with clearing of proc and sys
> entries and at end, is_added is set to 0.
> 
> is_added and is_busmaster are bit fields in pci_dev structure
> sharing same memory location.
> 
> A strange issue was observed with multiple times removal and
> rescan of a pcie nvme device using sysfs commands where is_added
> flag was observed as zero instead of one while removing device
> and proc,sys entries are not cleared.  This causes issue in
> later device addition with warning message "proc_dir_entry"
> already registered.
> 
> Debugging revealed a race condition between pcie core driver
> enabling is_added bit(pci_bus_add_device()) and nvme driver
> reset work-queue enabling is_busmaster bit (by pci_set_master()).
> As both fields are not handled in atomic manner and that clears
> is_added bit.
> 
> Fix moves device addition is_added bit to separate private flag
> variable and use different atomic functions to set and retrieve
> device addition state. As is_added shares different memory
> location so race condition is avoided.

If/when you post a v4 of this, can you include the bugzilla URL
right here, i.e.,

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283
> Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>

That way we can connect the commit with the bugzilla, which contains
more information that may be useful in the future.

Bjorn
Lukas Wunner July 19, 2018, 6:55 p.m. | #6
On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org wrote:
> Indeed. However I'm not fan of the solution. Shouldn't we instead have
> some locking for the content of pci_dev? I've always been wary of us
> having other similar races in there.

The solution presented is perfectly fine as it uses atomic bitops which
obviate the need for locking.  Why do you want to add unnecessary locking
on top?

Certain other parts of struct pci_dev use their own locking, e.g.
pci_bus_sem to protect bus_list.  Most elements can and should
be accessed lockless for performance.


> > The powerpc PCI code contains a lot of cruft coming from the depth of
> > history, including rather nasty assumptions. We want to progressively
> > clean it up, starting with EEH, but it will take time.

Then I suggest using the #include "../../../drivers/pci/pci.h" for now
until the powerpc arch code has been consolidated.

Thanks,

Lukas
Benjamin Herrenschmidt July 20, 2018, 4:27 a.m. | #7
On Thu, 2018-07-19 at 20:55 +0200, Lukas Wunner wrote:
> On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org wrote:
> > Indeed. However I'm not fan of the solution. Shouldn't we instead have
> > some locking for the content of pci_dev? I've always been wary of us
> > having other similar races in there.
> 
> The solution presented is perfectly fine as it uses atomic bitops which
> obviate the need for locking.  Why do you want to add unnecessary locking
> on top?

Atomic bitops tend to be *more* expensive than a lock.

My concern is that the PCIe code historically had no locking and I
worry we may have other fields in there with similar issues. But maybe
I'm wrong.

> Certain other parts of struct pci_dev use their own locking, e.g.
> pci_bus_sem to protect bus_list.  Most elements can and should
> be accessed lockless for performance.
> 
> 
> > > The powerpc PCI code contains a lot of cruft coming from the depth of
> > > history, including rather nasty assumptions. We want to progressively
> > > clean it up, starting with EEH, but it will take time.
> 
> Then I suggest using the #include "../../../drivers/pci/pci.h" for now
> until the powerpc arch code has been consolidated.

There's also the need both in powerpc and sparc to access the guts of
pci_dev because those archs will "fabricate" as pci_dev from the
device-tree rather than probing it under some circumstances.

Cheers,
Ben.
Hari Vyas July 20, 2018, 9:16 a.m. | #8
On Thu, Jul 19, 2018 at 11:11 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Tue, Jul 03, 2018 at 02:35:41PM +0530, Hari Vyas wrote:
>> When a pci device is detected, a variable is_added is set to
>> 1 in pci device structure and proc, sys entries are created.
>>
>> When a pci device is removed, first is_added is checked for one
>> and then device is detached with clearing of proc and sys
>> entries and at end, is_added is set to 0.
>>
>> is_added and is_busmaster are bit fields in pci_dev structure
>> sharing same memory location.
>>
>> A strange issue was observed with multiple times removal and
>> rescan of a pcie nvme device using sysfs commands where is_added
>> flag was observed as zero instead of one while removing device
>> and proc,sys entries are not cleared.  This causes issue in
>> later device addition with warning message "proc_dir_entry"
>> already registered.
>>
>> Debugging revealed a race condition between pcie core driver
>> enabling is_added bit(pci_bus_add_device()) and nvme driver
>> reset work-queue enabling is_busmaster bit (by pci_set_master()).
>> As both fields are not handled in atomic manner and that clears
>> is_added bit.
>>
>> Fix moves device addition is_added bit to separate private flag
>> variable and use different atomic functions to set and retrieve
>> device addition state. As is_added shares different memory
>> location so race condition is avoided.
>
> If/when you post a v4 of this, can you include the bugzilla URL
> right here, i.e.,
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283
>> Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
>
> That way we can connect the commit with the bugzilla, which contains
> more information that may be useful in the future.
>
> Bjorn
Bit confused.  Do I  need  to change commit message for including bugzila link ?
Believe v3 should be Okay from code perspective time-being. Please confirm
Bjorn Helgaas July 20, 2018, 12:20 p.m. | #9
On Fri, Jul 20, 2018 at 4:16 AM Hari Vyas <hari.vyas@broadcom.com> wrote:
>
> On Thu, Jul 19, 2018 at 11:11 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, Jul 03, 2018 at 02:35:41PM +0530, Hari Vyas wrote:
> >> When a pci device is detected, a variable is_added is set to
> >> 1 in pci device structure and proc, sys entries are created.
> >>
> >> When a pci device is removed, first is_added is checked for one
> >> and then device is detached with clearing of proc and sys
> >> entries and at end, is_added is set to 0.
> >>
> >> is_added and is_busmaster are bit fields in pci_dev structure
> >> sharing same memory location.
> >>
> >> A strange issue was observed with multiple times removal and
> >> rescan of a pcie nvme device using sysfs commands where is_added
> >> flag was observed as zero instead of one while removing device
> >> and proc,sys entries are not cleared.  This causes issue in
> >> later device addition with warning message "proc_dir_entry"
> >> already registered.
> >>
> >> Debugging revealed a race condition between pcie core driver
> >> enabling is_added bit(pci_bus_add_device()) and nvme driver
> >> reset work-queue enabling is_busmaster bit (by pci_set_master()).
> >> As both fields are not handled in atomic manner and that clears
> >> is_added bit.
> >>
> >> Fix moves device addition is_added bit to separate private flag
> >> variable and use different atomic functions to set and retrieve
> >> device addition state. As is_added shares different memory
> >> location so race condition is avoided.
> >
> > If/when you post a v4 of this, can you include the bugzilla URL
> > right here, i.e.,
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283
> >> Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
> >
> > That way we can connect the commit with the bugzilla, which contains
> > more information that may be useful in the future.

> Bit confused.  Do I  need  to change commit message for including bugzila link ?
> Believe v3 should be Okay from code perspective time-being. Please confirm

You don't need to repost it just to update the commit message (I'll
try to remember to add the link when I apply it).  But if you do
repost it for some other reason, it helps me out if you include the
link.
Bjorn Helgaas July 27, 2018, 10:25 p.m. | #10
On Thu, Jul 19, 2018 at 02:18:09PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote:
> > [+cc Paul, Michael, linuxppc-dev]
> > 
> 
>    ..../...
> 
> > > Debugging revealed a race condition between pcie core driver
> > > enabling is_added bit(pci_bus_add_device()) and nvme driver
> > > reset work-queue enabling is_busmaster bit (by pci_set_master()).
> > > As both fields are not handled in atomic manner and that clears
> > > is_added bit.
> > > 
> > > Fix moves device addition is_added bit to separate private flag
> > > variable and use different atomic functions to set and retrieve
> > > device addition state. As is_added shares different memory
> > > location so race condition is avoided.
> > 
> > Really nice bit of debugging!
> 
> Indeed. However I'm not fan of the solution. Shouldn't we instead have
> some locking for the content of pci_dev ? I've always been wary of us
> having other similar races in there.
> 
> As for the powerpc bits, I'm probably the one who wrote them, however,
> I'm on vacation this week and right now, no bandwidth to context switch
> all that back in :-) So give me a few days and/or ping me next week.

OK, here's a ping :)

Some powerpc cleanup would be ideal, but I'd like to fix the race for
v4.19, so I'm fine with this patch as-is.  But I'd definitely want
your ack before inserting the ugly #include path in the powerpc code.

> The powerpc PCI code contains a lot of cruft coming from the depth of
> history, including rather nasty assumptions. We want to progressively
> clean it up, starting with EEH, but it will take time.
> 
> Cheers,
> Ben.
> 
> > > Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
> > > ---
> > >  arch/powerpc/kernel/pci-common.c          |  4 +++-
> > >  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
> > >  arch/powerpc/platforms/pseries/setup.c    |  3 ++-
> > >  drivers/pci/bus.c                         |  6 +++---
> > >  drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
> > >  drivers/pci/pci.h                         | 11 +++++++++++
> > >  drivers/pci/probe.c                       |  4 ++--
> > >  drivers/pci/remove.c                      |  5 +++--
> > >  include/linux/pci.h                       |  1 -
> > >  9 files changed, 27 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> > > index fe9733f..471aac3 100644
> > > --- a/arch/powerpc/kernel/pci-common.c
> > > +++ b/arch/powerpc/kernel/pci-common.c
> > > @@ -42,6 +42,8 @@
> > >  #include <asm/ppc-pci.h>
> > >  #include <asm/eeh.h>
> > >  
> > > +#include "../../../drivers/pci/pci.h"
> > 
> > I see why you need it, but this include path is really ugly.  Outside
> > of bootloaders and tools, there are very few instances of includes
> > like this that reference a different top-level directory, and I'm not
> > very keen about adding more.
Benjamin Herrenschmidt July 28, 2018, 12:45 a.m. | #11
On Fri, 2018-07-27 at 17:25 -0500, Bjorn Helgaas wrote:
> > As for the powerpc bits, I'm probably the one who wrote them, however,
> > I'm on vacation this week and right now, no bandwidth to context switch
> > all that back in :-) So give me a few days and/or ping me next week.
> 
> OK, here's a ping :)
> 
> Some powerpc cleanup would be ideal, but I'd like to fix the race for
> v4.19, so I'm fine with this patch as-is.  But I'd definitely want
> your ack before inserting the ugly #include path in the powerpc code.

Go for it. Looks like I got a last minute meeting in Austin next week
so i"ll have no time to look at any of this for a while.

Cheers,
Ben.
Michael Ellerman July 31, 2018, 11:21 a.m. | #12
Bjorn Helgaas <helgaas@kernel.org> writes:
> On Thu, Jul 19, 2018 at 02:18:09PM +1000, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote:
>> > [+cc Paul, Michael, linuxppc-dev]
>> > 
>> 
>>    ..../...
>> 
>> > > Debugging revealed a race condition between pcie core driver
>> > > enabling is_added bit(pci_bus_add_device()) and nvme driver
>> > > reset work-queue enabling is_busmaster bit (by pci_set_master()).
>> > > As both fields are not handled in atomic manner and that clears
>> > > is_added bit.
>> > > 
>> > > Fix moves device addition is_added bit to separate private flag
>> > > variable and use different atomic functions to set and retrieve
>> > > device addition state. As is_added shares different memory
>> > > location so race condition is avoided.
>> > 
>> > Really nice bit of debugging!
>> 
>> Indeed. However I'm not fan of the solution. Shouldn't we instead have
>> some locking for the content of pci_dev ? I've always been wary of us
>> having other similar races in there.
>> 
>> As for the powerpc bits, I'm probably the one who wrote them, however,
>> I'm on vacation this week and right now, no bandwidth to context switch
>> all that back in :-) So give me a few days and/or ping me next week.
>
> OK, here's a ping :)
>
> Some powerpc cleanup would be ideal, but I'd like to fix the race for
> v4.19, so I'm fine with this patch as-is.  But I'd definitely want
> your ack before inserting the ugly #include path in the powerpc code.

Sorry, the patch didn't hit linuxppc so I forgot about it.

I'm OK with the patch, the include is a bit gross, but I guess it's
fine.

I have a change to pseries/setup.c queued that might collide, though
it's just an addition of another include so it's a trivial fixup.

Acked-by: Michael Ellerman <mpe@ellerman.id.au>


In terms of longer term clean up, do you have a sketch of what you'd
like to see?

cheers

Patch

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733f..471aac3 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -42,6 +42,8 @@ 
 #include <asm/ppc-pci.h>
 #include <asm/eeh.h>
 
+#include "../../../drivers/pci/pci.h"
+
 /* hose_spinlock protects accesses to the the phb_bitmap. */
 static DEFINE_SPINLOCK(hose_spinlock);
 LIST_HEAD(hose_list);
@@ -1014,7 +1016,7 @@  void pcibios_setup_bus_devices(struct pci_bus *bus)
 		/* Cardbus can call us to add new devices to a bus, so ignore
 		 * those who are already fully discovered
 		 */
-		if (dev->is_added)
+		if (pci_dev_is_added(dev))
 			continue;
 
 		pcibios_setup_device(dev);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5bd0eb6..70b2e1e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -46,6 +46,7 @@ 
 
 #include "powernv.h"
 #include "pci.h"
+#include "../../../../drivers/pci/pci.h"
 
 #define PNV_IODA1_M64_NUM	16	/* Number of M64 BARs	*/
 #define PNV_IODA1_M64_SEGS	8	/* Segments per M64 BAR	*/
@@ -3138,7 +3139,7 @@  static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	struct pci_dn *pdn;
 	int mul, total_vfs;
 
-	if (!pdev->is_physfn || pdev->is_added)
+	if (!pdev->is_physfn || pci_dev_is_added(pdev))
 		return;
 
 	pdn = pci_get_pdn(pdev);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 139f0af..8a4868a 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -71,6 +71,7 @@ 
 #include <asm/security_features.h>
 
 #include "pseries.h"
+#include "../../../../drivers/pci/pci.h"
 
 int CMO_PrPSP = -1;
 int CMO_SecPSP = -1;
@@ -664,7 +665,7 @@  static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
 	const int *indexes;
 	struct device_node *dn = pci_device_to_OF_node(pdev);
 
-	if (!pdev->is_physfn || pdev->is_added)
+	if (!pdev->is_physfn || pci_dev_is_added(pdev))
 		return;
 	/*Firmware must support open sriov otherwise dont configure*/
 	indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 35b7fc8..5cb40b2 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -330,7 +330,7 @@  void pci_bus_add_device(struct pci_dev *dev)
 		return;
 	}
 
-	dev->is_added = 1;
+	pci_dev_assign_added(dev, true);
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);
 
@@ -347,14 +347,14 @@  void pci_bus_add_devices(const struct pci_bus *bus)
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		/* Skip already-added devices */
-		if (dev->is_added)
+		if (pci_dev_is_added(dev))
 			continue;
 		pci_bus_add_device(dev);
 	}
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		/* Skip if device attach failed */
-		if (!dev->is_added)
+		if (!pci_dev_is_added(dev))
 			continue;
 		child = dev->subordinate;
 		if (child)
diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
index 3a17b29..ef0b1b6 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -509,7 +509,7 @@  static void enable_slot(struct acpiphp_slot *slot)
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		/* Assume that newly added devices are powered on already. */
-		if (!dev->is_added)
+		if (!pci_dev_is_added(dev))
 			dev->current_state = PCI_D0;
 	}
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 882f1f9..0881725 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -288,6 +288,7 @@  struct pci_sriov {
 
 /* pci_dev priv_flags */
 #define PCI_DEV_DISCONNECTED 0
+#define PCI_DEV_ADDED 1
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 {
@@ -300,6 +301,16 @@  static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
 	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
 }
 
+static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
+{
+	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
+}
+
+static inline bool pci_dev_is_added(const struct pci_dev *dev)
+{
+	return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
+}
+
 #ifdef CONFIG_PCI_ATS
 void pci_restore_ats_state(struct pci_dev *dev);
 #else
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e3..611adcd 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2433,13 +2433,13 @@  int pci_scan_slot(struct pci_bus *bus, int devfn)
 	dev = pci_scan_single_device(bus, devfn);
 	if (!dev)
 		return 0;
-	if (!dev->is_added)
+	if (!pci_dev_is_added(dev))
 		nr++;
 
 	for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
 		dev = pci_scan_single_device(bus, devfn + fn);
 		if (dev) {
-			if (!dev->is_added)
+			if (!pci_dev_is_added(dev))
 				nr++;
 			dev->multifunction = 1;
 		}
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 6f072ea..5e3d0dc 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -19,11 +19,12 @@  static void pci_stop_dev(struct pci_dev *dev)
 {
 	pci_pme_active(dev, false);
 
-	if (dev->is_added) {
+	if (pci_dev_is_added(dev)) {
 		device_release_driver(&dev->dev);
 		pci_proc_detach_device(dev);
 		pci_remove_sysfs_dev_files(dev);
-		dev->is_added = 0;
+
+		pci_dev_assign_added(dev, false);
 	}
 
 	if (dev->bus->self)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 340029b..506125b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -368,7 +368,6 @@  struct pci_dev {
 	unsigned int	transparent:1;		/* Subtractive decode bridge */
 	unsigned int	multifunction:1;	/* Multi-function device */
 
-	unsigned int	is_added:1;
 	unsigned int	is_busmaster:1;		/* Is busmaster */
 	unsigned int	no_msi:1;		/* May not use MSI */
 	unsigned int	no_64bit_msi:1; 	/* May only use 32-bit MSIs */