diff mbox series

[v5,03/23] PCI: hotplug: Add a flag for the movable BARs feature

Message ID 20190816165101.911-4-s.miroshnichenko@yadro.com
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Allow BAR movement during hotplug | expand

Commit Message

Sergei Miroshnichenko Aug. 16, 2019, 4:50 p.m. UTC
When hot-adding a device, the bridge may have windows not big enough (or
fragmented too much) for newly requested BARs to fit in. And expanding
these bridge windows may be impossible because blocked by "neighboring"
BARs and bridge windows.

Still, it may be possible to allocate a memory region for new BARs with the
following procedure:

1) notify all the drivers which support movable BARs to pause and release
   the BARs; the rest of the drivers are guaranteed that their devices will
   not get BARs moved;

2) release all the bridge windows except of root bridges;

3) try to recalculate new bridge windows that will fit all the BAR types:
   - fixed;
   - immovable;
   - movable;
   - newly requested by hot-added devices;

4) if the previous step fails, disable BARs for one of the hot-added
   devices and retry from step 3;

5) notify the drivers, so they remap BARs and resume.

This makes the prior reservation of memory by BIOS/bootloader/firmware not
required anymore for the PCI hotplug.

Drivers indicate their support of movable BARs by implementing the new
.rescan_prepare() and .rescan_done() hooks in the struct pci_driver. All
device's activity must be paused during a rescan, and iounmap()+ioremap()
must be applied to every used BAR.

The platform also may need to prepare to BAR movement, so new hooks added:
pcibios_rescan_prepare(pci_dev) and pcibios_rescan_prepare(pci_dev).

This patch is a preparation for future patches with actual implementation,
and for now it just does the following:
 - declares the feature;
 - defines pci_movable_bars_enabled(), pci_dev_movable_bars_supported(dev);
 - invokes the .rescan_prepare() and .rescan_done() driver notifiers;
 - declares and invokes the pcibios_rescan_prepare()/_done() hooks;
 - adds the PCI_IMMOVABLE_BARS flag.

The feature is disabled by default (via PCI_IMMOVABLE_BARS) until the final
patch of the series. It can be overridden per-arch using this flag or by
the following command line option:

    pcie_movable_bars={ off | force }

CC: Sam Bobroff <sbobroff@linux.ibm.com>
CC: Rajat Jain <rajatja@google.com>
CC: Lukas Wunner <lukas@wunner.de>
CC: Oliver O'Halloran <oohall@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
---
 .../admin-guide/kernel-parameters.txt         |  7 ++
 drivers/pci/pci-driver.c                      |  2 +
 drivers/pci/pci.c                             | 24 ++++++
 drivers/pci/pci.h                             |  2 +
 drivers/pci/probe.c                           | 86 ++++++++++++++++++-
 include/linux/pci.h                           |  7 ++
 6 files changed, 126 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas Sept. 27, 2019, 10:02 p.m. UTC | #1
On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
> When hot-adding a device, the bridge may have windows not big enough (or
> fragmented too much) for newly requested BARs to fit in. And expanding
> these bridge windows may be impossible because blocked by "neighboring"
> BARs and bridge windows.
> 
> Still, it may be possible to allocate a memory region for new BARs with the
> following procedure:
> 
> 1) notify all the drivers which support movable BARs to pause and release
>    the BARs; the rest of the drivers are guaranteed that their devices will
>    not get BARs moved;
> 
> 2) release all the bridge windows except of root bridges;
> 
> 3) try to recalculate new bridge windows that will fit all the BAR types:
>    - fixed;
>    - immovable;
>    - movable;
>    - newly requested by hot-added devices;
> 
> 4) if the previous step fails, disable BARs for one of the hot-added
>    devices and retry from step 3;
> 
> 5) notify the drivers, so they remap BARs and resume.

You don't do the actual recalculation in *this* patch, but since you
mention the procedure here, are we confident that we never make things
worse?

It's possible that a hot-add will trigger this attempt to move things
around, and it's possible that we won't find space for the new device
even if we move things around.  But are we certain that every device
that worked *before* the hot-add will still work *afterwards*?

Much of the assignment was probably done by the BIOS using different
algorithms than Linux has, so I think there's some chance that the
BIOS did a better job and if we lose that BIOS assignment, we might
not be able to recreate it.

> This makes the prior reservation of memory by BIOS/bootloader/firmware not
> required anymore for the PCI hotplug.
> 
> Drivers indicate their support of movable BARs by implementing the new
> .rescan_prepare() and .rescan_done() hooks in the struct pci_driver. All
> device's activity must be paused during a rescan, and iounmap()+ioremap()
> must be applied to every used BAR.
> 
> The platform also may need to prepare to BAR movement, so new hooks added:
> pcibios_rescan_prepare(pci_dev) and pcibios_rescan_prepare(pci_dev).
> 
> This patch is a preparation for future patches with actual implementation,
> and for now it just does the following:
>  - declares the feature;
>  - defines pci_movable_bars_enabled(), pci_dev_movable_bars_supported(dev);
>  - invokes the .rescan_prepare() and .rescan_done() driver notifiers;
>  - declares and invokes the pcibios_rescan_prepare()/_done() hooks;
>  - adds the PCI_IMMOVABLE_BARS flag.
> 
> The feature is disabled by default (via PCI_IMMOVABLE_BARS) until the final
> patch of the series. It can be overridden per-arch using this flag or by
> the following command line option:
> 
>     pcie_movable_bars={ off | force }
> 
> CC: Sam Bobroff <sbobroff@linux.ibm.com>
> CC: Rajat Jain <rajatja@google.com>
> CC: Lukas Wunner <lukas@wunner.de>
> CC: Oliver O'Halloran <oohall@gmail.com>
> CC: David Laight <David.Laight@ACULAB.COM>
> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  7 ++
>  drivers/pci/pci-driver.c                      |  2 +
>  drivers/pci/pci.c                             | 24 ++++++
>  drivers/pci/pci.h                             |  2 +
>  drivers/pci/probe.c                           | 86 ++++++++++++++++++-
>  include/linux/pci.h                           |  7 ++
>  6 files changed, 126 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 47d981a86e2f..e2274ee87a35 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3526,6 +3526,13 @@
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcie_movable_bars=[PCIE]

This isn't a PCIe-specific feature, it's just a function of whether
drivers are smart enough, so we shouldn't tie it specifically to PCIe.
We could eventually do this for conventional PCI as well.

> +			Override the movable BARs support detection:
> +		off
> +			Disable even if supported by the platform
> +		force
> +			Enable even if not explicitly declared as supported

What's the need for "force"?  If it's possible, I think we should
enable this functionality all the time and just have a disable switch
in case we trip over cases where it doesn't work, e.g., something
like:

  pci=no_movable_bars

>  	pcmv=		[HW,PCMCIA] BadgePAD 4
>  
>  	pd_ignore_unused
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index a8124e47bf6e..d11909e79263 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1688,6 +1688,8 @@ static int __init pci_driver_init(void)
>  {
>  	int ret;
>  
> +	pci_add_flags(PCI_IMMOVABLE_BARS);
> +
>  	ret = bus_register(&pci_bus_type);
>  	if (ret)
>  		return ret;
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 61d951766087..3a504f58ac60 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -139,6 +139,30 @@ static int __init pcie_port_pm_setup(char *str)
>  }
>  __setup("pcie_port_pm=", pcie_port_pm_setup);
>  
> +static bool pcie_movable_bars_off;
> +static bool pcie_movable_bars_force;
> +static int __init pcie_movable_bars_setup(char *str)
> +{
> +	if (!strcmp(str, "off"))
> +		pcie_movable_bars_off = true;
> +	else if (!strcmp(str, "force"))
> +		pcie_movable_bars_force = true;
> +	return 1;
> +}
> +__setup("pcie_movable_bars=", pcie_movable_bars_setup);
> +
> +bool pci_movable_bars_enabled(void)
> +{
> +	if (pcie_movable_bars_off)
> +		return false;
> +
> +	if (pcie_movable_bars_force)
> +		return true;
> +
> +	return !pci_has_flag(PCI_IMMOVABLE_BARS);
> +}
> +EXPORT_SYMBOL(pci_movable_bars_enabled);

I don't think this needs to be exported, since the only references I
see are from the PCI core.

>  /* Time to wait after a reset for device to become responsive */
>  #define PCIE_RESET_READY_POLL_MS 60000
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index d22d1b807701..be7acc477c64 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -257,6 +257,8 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
>  void pci_reassigndev_resource_alignment(struct pci_dev *dev);
>  void pci_disable_bridge_window(struct pci_dev *dev);
>  
> +bool pci_dev_movable_bars_supported(struct pci_dev *dev);
> +
>  /* PCIe link information */
>  #define PCIE_SPEED2STR(speed) \
>  	((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 2e58ece820e8..60e3b48d2251 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -3406,6 +3406,74 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>  	return max;
>  }
>  
> +bool pci_dev_movable_bars_supported(struct pci_dev *dev)

This name suggests that movable BARs is a property of the device, but
it's mostly a property of the *driver*.  Most uses are in conjuction
with checking IORESOURCE_PCI_FIXED for some resource, so I think this
might read more naturally and simplify the callers slightly:

  bool pci_dev_movable(struct pci_dev *dev)
  {
    if (dev->driver && dev->driver->rescan_prepare)
      return true;

    if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
      return false;

    if (!dev->driver)
      return true;

    return false;
  }

  bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res)
  {
    if (res->flags & IORESOURCE_PCI_FIXED)
      return false;

    return pci_dev_movable(dev);
  }

I'm not sure why the PCI_CLASS_DISPLAY_VGA special case is there; can
you add a comment about why that's needed?  Obviously we can't move
the 0xa0000 legacy frame buffer because I think devices are allowed to
claim that region even if no BAR describes it.  But I would think
*other* BARs of VGA devices could be movable.

> +{
> +	if (!dev)
> +		return false;
> +
> +	if (dev->driver && dev->driver->rescan_prepare)
> +		return true;
> +
> +	if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
> +		return false;
> +
> +	return !dev->driver;
> +}
> +
> +void __weak pcibios_rescan_prepare(struct pci_dev *dev)
> +{
> +}
> +
> +void __weak pcibios_rescan_done(struct pci_dev *dev)
> +{
> +}

Can you add the pcibios_rescan_prepare() and pcibios_rescan_done()
stubs at the point where they're needed, i.e., where you add the
powerpc implementations?  We can't see the need for them at this point
in the series.

> +static void pci_bus_rescan_prepare(struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	if (bus->self)
> +		pci_config_pm_runtime_get(bus->self);
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		struct pci_bus *child = dev->subordinate;
> +
> +		if (child)
> +			pci_bus_rescan_prepare(child);
> +
> +		if (dev->driver &&
> +		    dev->driver->rescan_prepare) {
> +			dev->driver->rescan_prepare(dev);
> +			pcibios_rescan_prepare(dev);
> +		} else if (pci_dev_movable_bars_supported(dev)) {

The connection between these two conditions is pretty complicated to
figure out.  Can you explain why you need both?

> +			pcibios_rescan_prepare(dev);
> +		}
> +	}
> +}
> +
> +static void pci_bus_rescan_done(struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		struct pci_bus *child = dev->subordinate;
> +
> +		if (dev->driver &&
> +		    dev->driver->rescan_done) {
> +			pcibios_rescan_done(dev);
> +			dev->driver->rescan_done(dev);
> +		} else if (pci_dev_movable_bars_supported(dev)) {
> +			pcibios_rescan_done(dev);
> +		}
> +
> +		if (child)
> +			pci_bus_rescan_done(child);
> +	}
> +
> +	if (bus->self)
> +		pci_config_pm_runtime_put(bus->self);
> +}
> +
>  /**
>   * pci_rescan_bus - Scan a PCI bus for devices
>   * @bus: PCI bus to scan
> @@ -3418,9 +3486,23 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>  unsigned int pci_rescan_bus(struct pci_bus *bus)
>  {
>  	unsigned int max;
> +	struct pci_bus *root = bus;
> +
> +	while (!pci_is_root_bus(root))
> +		root = root->parent;
> +
> +	if (pci_movable_bars_enabled()) {
> +		pci_bus_rescan_prepare(root);
> +
> +		max = pci_scan_child_bus(root);
> +		pci_assign_unassigned_root_bus_resources(root);
> +
> +		pci_bus_rescan_done(root);
> +	} else {
> +		max = pci_scan_child_bus(bus);
> +		pci_assign_unassigned_bus_resources(bus);
> +	}
>  
> -	max = pci_scan_child_bus(bus);
> -	pci_assign_unassigned_bus_resources(bus);
>  	pci_bus_add_devices(bus);
>  
>  	return max;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index d3a72159722d..e5b5eff05744 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -838,6 +838,8 @@ struct pci_driver {
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +	void (*rescan_prepare)(struct pci_dev *dev);
> +	void (*rescan_done)(struct pci_dev *dev);
>  	const struct pci_error_handlers *err_handler;
>  	const struct attribute_group **groups;
>  	struct device_driver	driver;
> @@ -924,6 +926,7 @@ enum {
>  	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
>  	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
>  	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
> +	PCI_IMMOVABLE_BARS	= 0x00000080,	/* Disable runtime BAR reassign */

There are no uses of PCI_IMMOVABLE_BARS left at the end of the series,
so I'd rather not add it if we can avoid it.

>  };
>  
>  /* These external functions are only available when PCI support is enabled */
> @@ -1266,6 +1269,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus);
>  void pci_lock_rescan_remove(void);
>  void pci_unlock_rescan_remove(void);
>  
> +void pcibios_rescan_prepare(struct pci_dev *dev);
> +void pcibios_rescan_done(struct pci_dev *dev);
> +
>  /* Vital Product Data routines */
>  ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
>  ssize_t pci_write_vpd(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
> @@ -1402,6 +1408,7 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
>  void pci_setup_bridge(struct pci_bus *bus);
>  resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>  					 unsigned long type);
> +bool pci_movable_bars_enabled(void);

I would really like it if this were simply

  extern bool pci_no_movable_bars;

in drivers/pci/pci.h.  It would default to false since it's
uninitialized, and "pci=no_movable_bars" would set it to true.

We have similar "=off" and "=force" parameters for ASPM and other
things, and it makes the code really hard to analyze.

>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
> -- 
> 2.21.0
>
David Laight Sept. 30, 2019, 8:44 a.m. UTC | #2
From: Bjorn Helgaas
> Sent: 27 September 2019 23:02
> On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
> > When hot-adding a device, the bridge may have windows not big enough (or
> > fragmented too much) for newly requested BARs to fit in. And expanding
> > these bridge windows may be impossible because blocked by "neighboring"
> > BARs and bridge windows.
> >
> > Still, it may be possible to allocate a memory region for new BARs with the
> > following procedure:
> >
> > 1) notify all the drivers which support movable BARs to pause and release
> >    the BARs; the rest of the drivers are guaranteed that their devices will
> >    not get BARs moved;
> >
> > 2) release all the bridge windows except of root bridges;
> >
> > 3) try to recalculate new bridge windows that will fit all the BAR types:
> >    - fixed;
> >    - immovable;
> >    - movable;
> >    - newly requested by hot-added devices;
> >
> > 4) if the previous step fails, disable BARs for one of the hot-added
> >    devices and retry from step 3;
> >
> > 5) notify the drivers, so they remap BARs and resume.
> 
> You don't do the actual recalculation in *this* patch, but since you
> mention the procedure here, are we confident that we never make things
> worse?
> 
> It's possible that a hot-add will trigger this attempt to move things
> around, and it's possible that we won't find space for the new device
> even if we move things around.  But are we certain that every device
> that worked *before* the hot-add will still work *afterwards*?
> 
> Much of the assignment was probably done by the BIOS using different
> algorithms than Linux has, so I think there's some chance that the
> BIOS did a better job and if we lose that BIOS assignment, we might
> not be able to recreate it.

Yep, removing everything and starting again is probably OTT and most of the churn won't help.

I think you need to work out what can be moved in order to make the required resources available
to each bus and then make the required changes.

In the simplest case you are trying to add resource below a bridge so need to 'shuffle'
everything allocated after that bridge to later addresses (etc).

Many devices that support address reassignment might not need to be moved - so there is
no point remmapping them.

There is also the case when a device that is present but not currently is use could be taken
through a remove+insert sequence in order to change its resources.
Much easier to implement than 'remap while active'.
This would require a call into the driver (than can sleep) to request whether it is idle.
(and probably one at the end if the remove wasn't done).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Sergei Miroshnichenko Sept. 30, 2019, 12:59 p.m. UTC | #3
Hello Bjorn,

On 9/28/19 1:02 AM, Bjorn Helgaas wrote:
> On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
>> When hot-adding a device, the bridge may have windows not big enough (or
>> fragmented too much) for newly requested BARs to fit in. And expanding
>> these bridge windows may be impossible because blocked by "neighboring"
>> BARs and bridge windows.
>>
>> Still, it may be possible to allocate a memory region for new BARs with the
>> following procedure:
>>
>> 1) notify all the drivers which support movable BARs to pause and release
>>     the BARs; the rest of the drivers are guaranteed that their devices will
>>     not get BARs moved;
>>
>> 2) release all the bridge windows except of root bridges;
>>
>> 3) try to recalculate new bridge windows that will fit all the BAR types:
>>     - fixed;
>>     - immovable;
>>     - movable;
>>     - newly requested by hot-added devices;
>>
>> 4) if the previous step fails, disable BARs for one of the hot-added
>>     devices and retry from step 3;
>>
>> 5) notify the drivers, so they remap BARs and resume.
> 
> You don't do the actual recalculation in *this* patch, but since you
> mention the procedure here, are we confident that we never make things
> worse?
> 
> It's possible that a hot-add will trigger this attempt to move things
> around, and it's possible that we won't find space for the new device
> even if we move things around.  But are we certain that every device
> that worked *before* the hot-add will still work *afterwards*?
> 
> Much of the assignment was probably done by the BIOS using different
> algorithms than Linux has, so I think there's some chance that the
> BIOS did a better job and if we lose that BIOS assignment, we might
> not be able to recreate it.
> 

If a hardware has some special constraints on BAR assignment that the
kernel is not aware of yet, the movable BARs may break things after a
hotplug event. So the feature must be disabled there (manually) until
the kernel get support for that special needs.

On x86 we had no choice - most of the machines we used just can't boot
with even an "empty" 16-port switch connected. So we hot-add it after
the boot, then trigger a rescan via 'echo 1 > /sys/bus/pci/rescan'.
And reserved bridge windows wasn't enough, and they can't expand
because are blocked by the next device.

>> This makes the prior reservation of memory by BIOS/bootloader/firmware not
>> required anymore for the PCI hotplug.
>>
>> Drivers indicate their support of movable BARs by implementing the new
>> .rescan_prepare() and .rescan_done() hooks in the struct pci_driver. All
>> device's activity must be paused during a rescan, and iounmap()+ioremap()
>> must be applied to every used BAR.
>>
>> The platform also may need to prepare to BAR movement, so new hooks added:
>> pcibios_rescan_prepare(pci_dev) and pcibios_rescan_prepare(pci_dev).
>>
>> This patch is a preparation for future patches with actual implementation,
>> and for now it just does the following:
>>   - declares the feature;
>>   - defines pci_movable_bars_enabled(), pci_dev_movable_bars_supported(dev);
>>   - invokes the .rescan_prepare() and .rescan_done() driver notifiers;
>>   - declares and invokes the pcibios_rescan_prepare()/_done() hooks;
>>   - adds the PCI_IMMOVABLE_BARS flag.
>>
>> The feature is disabled by default (via PCI_IMMOVABLE_BARS) until the final
>> patch of the series. It can be overridden per-arch using this flag or by
>> the following command line option:
>>
>>      pcie_movable_bars={ off | force }
>>
>> CC: Sam Bobroff <sbobroff@linux.ibm.com>
>> CC: Rajat Jain <rajatja@google.com>
>> CC: Lukas Wunner <lukas@wunner.de>
>> CC: Oliver O'Halloran <oohall@gmail.com>
>> CC: David Laight <David.Laight@ACULAB.COM>
>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
>> ---
>>   .../admin-guide/kernel-parameters.txt         |  7 ++
>>   drivers/pci/pci-driver.c                      |  2 +
>>   drivers/pci/pci.c                             | 24 ++++++
>>   drivers/pci/pci.h                             |  2 +
>>   drivers/pci/probe.c                           | 86 ++++++++++++++++++-
>>   include/linux/pci.h                           |  7 ++
>>   6 files changed, 126 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 47d981a86e2f..e2274ee87a35 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -3526,6 +3526,13 @@
>>   		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>>   			all PCIe root ports use INTx for all services).
>>   
>> +	pcie_movable_bars=[PCIE]
> 
> This isn't a PCIe-specific feature, it's just a function of whether
> drivers are smart enough, so we shouldn't tie it specifically to PCIe.
> We could eventually do this for conventional PCI as well.
> 
>> +			Override the movable BARs support detection:
>> +		off
>> +			Disable even if supported by the platform
>> +		force
>> +			Enable even if not explicitly declared as supported
> 
> What's the need for "force"?  If it's possible, I think we should
> enable this functionality all the time and just have a disable switch
> in case we trip over cases where it doesn't work, e.g., something
> like:
> 
>    pci=no_movable_bars
> 

Thanks, I'll simplify that, replace the pci_movable_bars_enabled()
with a bool variable, and remove the flag as you advice.

>>   	pcmv=		[HW,PCMCIA] BadgePAD 4
>>   
>>   	pd_ignore_unused
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index a8124e47bf6e..d11909e79263 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -1688,6 +1688,8 @@ static int __init pci_driver_init(void)
>>   {
>>   	int ret;
>>   
>> +	pci_add_flags(PCI_IMMOVABLE_BARS);
>> +
>>   	ret = bus_register(&pci_bus_type);
>>   	if (ret)
>>   		return ret;
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 61d951766087..3a504f58ac60 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -139,6 +139,30 @@ static int __init pcie_port_pm_setup(char *str)
>>   }
>>   __setup("pcie_port_pm=", pcie_port_pm_setup);
>>   
>> +static bool pcie_movable_bars_off;
>> +static bool pcie_movable_bars_force;
>> +static int __init pcie_movable_bars_setup(char *str)
>> +{
>> +	if (!strcmp(str, "off"))
>> +		pcie_movable_bars_off = true;
>> +	else if (!strcmp(str, "force"))
>> +		pcie_movable_bars_force = true;
>> +	return 1;
>> +}
>> +__setup("pcie_movable_bars=", pcie_movable_bars_setup);
>> +
>> +bool pci_movable_bars_enabled(void)
>> +{
>> +	if (pcie_movable_bars_off)
>> +		return false;
>> +
>> +	if (pcie_movable_bars_force)
>> +		return true;
>> +
>> +	return !pci_has_flag(PCI_IMMOVABLE_BARS);
>> +}
>> +EXPORT_SYMBOL(pci_movable_bars_enabled);
> 
> I don't think this needs to be exported, since the only references I
> see are from the PCI core.
> 

Thanks for catching that, this export slipped from older patches, when
the function was used in arch/*

>>   /* Time to wait after a reset for device to become responsive */
>>   #define PCIE_RESET_READY_POLL_MS 60000
>>   
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index d22d1b807701..be7acc477c64 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -257,6 +257,8 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
>>   void pci_reassigndev_resource_alignment(struct pci_dev *dev);
>>   void pci_disable_bridge_window(struct pci_dev *dev);
>>   
>> +bool pci_dev_movable_bars_supported(struct pci_dev *dev);
>> +
>>   /* PCIe link information */
>>   #define PCIE_SPEED2STR(speed) \
>>   	((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 2e58ece820e8..60e3b48d2251 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -3406,6 +3406,74 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>   	return max;
>>   }
>>   
>> +bool pci_dev_movable_bars_supported(struct pci_dev *dev)
> 
> This name suggests that movable BARs is a property of the device, but
> it's mostly a property of the *driver*.  Most uses are in conjuction
> with checking IORESOURCE_PCI_FIXED for some resource, so I think this
> might read more naturally and simplify the callers slightly:
> 
>    bool pci_dev_movable(struct pci_dev *dev)
>    {
>      if (dev->driver && dev->driver->rescan_prepare)
>        return true;
> 
>      if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>        return false;
> 
>      if (!dev->driver)
>        return true;
> 
>      return false;
>    }
> 
>    bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res)
>    {
>      if (res->flags & IORESOURCE_PCI_FIXED)
>        return false;
> 
>      return pci_dev_movable(dev);
>    }
> 

Nice, I'll do that, thanks.

Theoretically, there may be a number of identical devices in the
system, some of them are bound to an "immovable" driver, and the rest
is not, which makes them movable. Also modprobe+rmmod may affect
device's BARs mobility. But I agree, the "_supported" suffix was
confusing.

> I'm not sure why the PCI_CLASS_DISPLAY_VGA special case is there; can
> you add a comment about why that's needed?  Obviously we can't move
> the 0xa0000 legacy frame buffer because I think devices are allowed to
> claim that region even if no BAR describes it.  But I would think
> *other* BARs of VGA devices could be movable.
> 

Sure, I'll add a comment to the code.

The issue that we are avoiding by that is the "nomodeset" command line
argument, which prevents a video driver from being bound, so the BARs
are seems to be used, but can't be moved, otherwise machines just hang
after hotplug events. That was the only special ugly case we've
spotted during testing. I'll check if it will be enough just to work
around the 0xa0000.

>> +{
>> +	if (!dev)
>> +		return false;
>> +
>> +	if (dev->driver && dev->driver->rescan_prepare)
>> +		return true;
>> +
>> +	if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>> +		return false;
>> +
>> +	return !dev->driver;
>> +}
>> +
>> +void __weak pcibios_rescan_prepare(struct pci_dev *dev)
>> +{
>> +}
>> +
>> +void __weak pcibios_rescan_done(struct pci_dev *dev)
>> +{
>> +}
> 
> Can you add the pcibios_rescan_prepare() and pcibios_rescan_done()
> stubs at the point where they're needed, i.e., where you add the
> powerpc implementations?  We can't see the need for them at this point
> in the series.
> 

Sure.

>> +static void pci_bus_rescan_prepare(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *dev;
>> +
>> +	if (bus->self)
>> +		pci_config_pm_runtime_get(bus->self);
>> +
>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>> +		struct pci_bus *child = dev->subordinate;
>> +
>> +		if (child)
>> +			pci_bus_rescan_prepare(child);
>> +
>> +		if (dev->driver &&
>> +		    dev->driver->rescan_prepare) {
>> +			dev->driver->rescan_prepare(dev);
>> +			pcibios_rescan_prepare(dev);
>> +		} else if (pci_dev_movable_bars_supported(dev)) {
> 
> The connection between these two conditions is pretty complicated to
> figure out.  Can you explain why you need both?
> 

Yeah, this is what was meant, if would be written clearer:

	if (pci_dev_movable_bars_supported(dev)) {
		if (dev->driver &&
		    dev->driver->rescan_prepare)
			dev->driver->rescan_prepare(dev);

		pcibios_rescan_prepare(dev);
	}

But anyway the pcibios_rescan_prepare(dev) was for PowerNV, and after
consulting with Oliver it was decided to make that a single call for
the whole domain instead of per-device.

>> +			pcibios_rescan_prepare(dev);
>> +		}
>> +	}
>> +}
>> +
>> +static void pci_bus_rescan_done(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *dev;
>> +
>> +	list_for_each_entry(dev, &bus->devices, bus_list) {
>> +		struct pci_bus *child = dev->subordinate;
>> +
>> +		if (dev->driver &&
>> +		    dev->driver->rescan_done) {
>> +			pcibios_rescan_done(dev);
>> +			dev->driver->rescan_done(dev);
>> +		} else if (pci_dev_movable_bars_supported(dev)) {
>> +			pcibios_rescan_done(dev);
>> +		}
>> +
>> +		if (child)
>> +			pci_bus_rescan_done(child);
>> +	}
>> +
>> +	if (bus->self)
>> +		pci_config_pm_runtime_put(bus->self);
>> +}
>> +
>>   /**
>>    * pci_rescan_bus - Scan a PCI bus for devices
>>    * @bus: PCI bus to scan
>> @@ -3418,9 +3486,23 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
>>   unsigned int pci_rescan_bus(struct pci_bus *bus)
>>   {
>>   	unsigned int max;
>> +	struct pci_bus *root = bus;
>> +
>> +	while (!pci_is_root_bus(root))
>> +		root = root->parent;
>> +
>> +	if (pci_movable_bars_enabled()) {
>> +		pci_bus_rescan_prepare(root);
>> +
>> +		max = pci_scan_child_bus(root);
>> +		pci_assign_unassigned_root_bus_resources(root);
>> +
>> +		pci_bus_rescan_done(root);
>> +	} else {
>> +		max = pci_scan_child_bus(bus);
>> +		pci_assign_unassigned_bus_resources(bus);
>> +	}
>>   
>> -	max = pci_scan_child_bus(bus);
>> -	pci_assign_unassigned_bus_resources(bus);
>>   	pci_bus_add_devices(bus);
>>   
>>   	return max;
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index d3a72159722d..e5b5eff05744 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -838,6 +838,8 @@ struct pci_driver {
>>   	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>>   	void (*shutdown)(struct pci_dev *dev);
>>   	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
>> +	void (*rescan_prepare)(struct pci_dev *dev);
>> +	void (*rescan_done)(struct pci_dev *dev);
>>   	const struct pci_error_handlers *err_handler;
>>   	const struct attribute_group **groups;
>>   	struct device_driver	driver;
>> @@ -924,6 +926,7 @@ enum {
>>   	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
>>   	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
>>   	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
>> +	PCI_IMMOVABLE_BARS	= 0x00000080,	/* Disable runtime BAR reassign */
> 
> There are no uses of PCI_IMMOVABLE_BARS left at the end of the series,
> so I'd rather not add it if we can avoid it.
> 
>>   };
>>   
>>   /* These external functions are only available when PCI support is enabled */
>> @@ -1266,6 +1269,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus);
>>   void pci_lock_rescan_remove(void);
>>   void pci_unlock_rescan_remove(void);
>>   
>> +void pcibios_rescan_prepare(struct pci_dev *dev);
>> +void pcibios_rescan_done(struct pci_dev *dev);
>> +
>>   /* Vital Product Data routines */
>>   ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
>>   ssize_t pci_write_vpd(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
>> @@ -1402,6 +1408,7 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
>>   void pci_setup_bridge(struct pci_bus *bus);
>>   resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>>   					 unsigned long type);
>> +bool pci_movable_bars_enabled(void);
> 
> I would really like it if this were simply
> 
>    extern bool pci_no_movable_bars;
> 
> in drivers/pci/pci.h.  It would default to false since it's
> uninitialized, and "pci=no_movable_bars" would set it to true.
> 

I have a premonition of platforms that will not support the feature.
Wouldn't be better to put this variable-flag to include/linux/pci.h ,
so code in arch/* can set it, so they could work by default, without
the command line argument?

Serge

> We have similar "=off" and "=force" parameters for ASPM and other
> things, and it makes the code really hard to analyze.
> 
>>   #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>>   #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
>> -- 
>> 2.21.0
>>
Sergei Miroshnichenko Sept. 30, 2019, 4:17 p.m. UTC | #4
Hello David,

On 9/30/19 11:44 AM, David Laight wrote:
> From: Bjorn Helgaas
>> Sent: 27 September 2019 23:02
>> On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
>>> When hot-adding a device, the bridge may have windows not big enough (or
>>> fragmented too much) for newly requested BARs to fit in. And expanding
>>> these bridge windows may be impossible because blocked by "neighboring"
>>> BARs and bridge windows.
>>>
>>> Still, it may be possible to allocate a memory region for new BARs with the
>>> following procedure:
>>>
>>> 1) notify all the drivers which support movable BARs to pause and release
>>>     the BARs; the rest of the drivers are guaranteed that their devices will
>>>     not get BARs moved;
>>>
>>> 2) release all the bridge windows except of root bridges;
>>>
>>> 3) try to recalculate new bridge windows that will fit all the BAR types:
>>>     - fixed;
>>>     - immovable;
>>>     - movable;
>>>     - newly requested by hot-added devices;
>>>
>>> 4) if the previous step fails, disable BARs for one of the hot-added
>>>     devices and retry from step 3;
>>>
>>> 5) notify the drivers, so they remap BARs and resume.
>>
>> You don't do the actual recalculation in *this* patch, but since you
>> mention the procedure here, are we confident that we never make things
>> worse?
>>
>> It's possible that a hot-add will trigger this attempt to move things
>> around, and it's possible that we won't find space for the new device
>> even if we move things around.  But are we certain that every device
>> that worked *before* the hot-add will still work *afterwards*?
>>
>> Much of the assignment was probably done by the BIOS using different
>> algorithms than Linux has, so I think there's some chance that the
>> BIOS did a better job and if we lose that BIOS assignment, we might
>> not be able to recreate it.
> 
> Yep, removing everything and starting again is probably OTT and most of the churn won't help.
> 
> I think you need to work out what can be moved in order to make the required resources available
> to each bus and then make the required changes.
> 
> In the simplest case you are trying to add resource below a bridge so need to 'shuffle'
> everything allocated after that bridge to later addresses (etc).
> 

Thank you for the review and suggestions!

But a bridge window may be fragmented: its total free space is enough
to fit everything, but no sufficient gaps for the new BARs. And this
bridge window may be jammed between two immovable/fixed BARs.

Or there may be lots of empty spaces in lower addresses after un-plugs,
but everything if fixed/immovable on higher addresses.

I've spent some time thinking on an optimization technique which can
be efficient enough (touch as few BARs as possible) with as high
success rate as calculating from scratch - and concluded that it is
not worth it: if only release the "obstructing" BARs and bridge
windows, a hotplug event will affect a half of (n+m) on average, which
is still O(n+m), where n is a number of endpoints, and m is a
number of bridges. But it's still need to resize windows of a root and
other common bridges.

Calculating bridge windows from scratch is relatively straightforward
and fast, so I have just added support for fixed/immovable BARs there
and reused.

> Many devices that support address reassignment might not need to be moved - so there is
> no point remmapping them.
> 

And it's the same algorithm that allocated BARs in first place, so it
will reassign the same BARs for the non-affected part of the topology.

> There is also the case when a device that is present but not currently is use could be taken
> through a remove+insert sequence in order to change its resources.
> Much easier to implement than 'remap while active'.
> This would require a call into the driver (than can sleep) to request whether it is idle.
> (and probably one at the end if the remove wasn't done).
> 

Unbind+rebind the "immovable" drivers of non-opened devices may
increase the probability of successful BAR allocation, but I'm afraid
this will produce some amount of false hotplug-like events in the logs.
Probably also some undesired effects like spikes in power consumption
because of driver initialization.

Best regards,
Serge

> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
Bjorn Helgaas Oct. 15, 2019, 10:14 p.m. UTC | #5
On Mon, Sep 30, 2019 at 03:59:25PM +0300, Sergey Miroshnichenko wrote:
> Hello Bjorn,
> 
> On 9/28/19 1:02 AM, Bjorn Helgaas wrote:
> > On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
> > > When hot-adding a device, the bridge may have windows not big enough (or
> > > fragmented too much) for newly requested BARs to fit in. And expanding
> > > these bridge windows may be impossible because blocked by "neighboring"
> > > BARs and bridge windows.
> > > 
> > > Still, it may be possible to allocate a memory region for new BARs with the
> > > following procedure:
> > > 
> > > 1) notify all the drivers which support movable BARs to pause and release
> > >     the BARs; the rest of the drivers are guaranteed that their devices will
> > >     not get BARs moved;
> > > 
> > > 2) release all the bridge windows except of root bridges;
> > > 
> > > 3) try to recalculate new bridge windows that will fit all the BAR types:
> > >     - fixed;
> > >     - immovable;
> > >     - movable;
> > >     - newly requested by hot-added devices;
> > > 
> > > 4) if the previous step fails, disable BARs for one of the hot-added
> > >     devices and retry from step 3;
> > > 
> > > 5) notify the drivers, so they remap BARs and resume.
> > 
> > You don't do the actual recalculation in *this* patch, but since you
> > mention the procedure here, are we confident that we never make things
> > worse?
> > 
> > It's possible that a hot-add will trigger this attempt to move things
> > around, and it's possible that we won't find space for the new device
> > even if we move things around.  But are we certain that every device
> > that worked *before* the hot-add will still work *afterwards*?
> > 
> > Much of the assignment was probably done by the BIOS using different
> > algorithms than Linux has, so I think there's some chance that the
> > BIOS did a better job and if we lose that BIOS assignment, we might
> > not be able to recreate it.
> 
> If a hardware has some special constraints on BAR assignment that the
> kernel is not aware of yet, the movable BARs may break things after a
> hotplug event. So the feature must be disabled there (manually) until
> the kernel get support for that special needs.

I'm not talking about special constraints on BAR assignment.  (I'm not
sure what those constraints would be -- AFAIK the constraints for a
spec-compliant device are all discoverable via the BAR size and type
(or the Enhanced Allocation capability)).

What I'm concerned about is the case where we boot with a working
assignment, we hot-add a device, we move things around to try to
accommodate the new device, and not only do we fail to find resources
for the new device, we also fail to find a working assignment for the
devices that were present at boot.  We've moved things around from
what BIOS did, and since we use a different algorithm than the BIOS,
there's no guarantee that we'll be able to find the assignment BIOS
did.

> > I'm not sure why the PCI_CLASS_DISPLAY_VGA special case is there; can
> > you add a comment about why that's needed?  Obviously we can't move
> > the 0xa0000 legacy frame buffer because I think devices are allowed to
> > claim that region even if no BAR describes it.  But I would think
> > *other* BARs of VGA devices could be movable.
> 
> Sure, I'll add a comment to the code.
> 
> The issue that we are avoiding by that is the "nomodeset" command line
> argument, which prevents a video driver from being bound, so the BARs
> are seems to be used, but can't be moved, otherwise machines just hang
> after hotplug events. That was the only special ugly case we've
> spotted during testing. I'll check if it will be enough just to work
> around the 0xa0000.

"nomodeset" is not really documented and is a funny way to say "don't
bind video drivers that know about it", but OK.  Thanks for checking
on the other BARs.

> > > +bool pci_movable_bars_enabled(void);
> > 
> > I would really like it if this were simply
> > 
> >    extern bool pci_no_movable_bars;
> > 
> > in drivers/pci/pci.h.  It would default to false since it's
> > uninitialized, and "pci=no_movable_bars" would set it to true.
> 
> I have a premonition of platforms that will not support the feature.
> Wouldn't be better to put this variable-flag to include/linux/pci.h ,
> so code in arch/* can set it, so they could work by default, without
> the command line argument?

In general I don't see why a platform wouldn't support this since
there really isn't anything platform-specific here.  But if a platform
does need to disable it, having arch code set this flag sounds
reasonable.  We shouldn't make it globally visible until we actually
need that, though.

> > We have similar "=off" and "=force" parameters for ASPM and other
> > things, and it makes the code really hard to analyze.

The "=off" and "=force" things are the biggest things I'd like to
avoid.

Bjorn
Sergei Miroshnichenko Oct. 16, 2019, 3:50 p.m. UTC | #6
On 10/16/19 1:14 AM, Bjorn Helgaas wrote:
> On Mon, Sep 30, 2019 at 03:59:25PM +0300, Sergey Miroshnichenko wrote:
>> Hello Bjorn,
>>
>> On 9/28/19 1:02 AM, Bjorn Helgaas wrote:
>>> On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
>>>> When hot-adding a device, the bridge may have windows not big enough (or
>>>> fragmented too much) for newly requested BARs to fit in. And expanding
>>>> these bridge windows may be impossible because blocked by "neighboring"
>>>> BARs and bridge windows.
>>>>
>>>> Still, it may be possible to allocate a memory region for new BARs with the
>>>> following procedure:
>>>>
>>>> 1) notify all the drivers which support movable BARs to pause and release
>>>>      the BARs; the rest of the drivers are guaranteed that their devices will
>>>>      not get BARs moved;
>>>>
>>>> 2) release all the bridge windows except of root bridges;
>>>>
>>>> 3) try to recalculate new bridge windows that will fit all the BAR types:
>>>>      - fixed;
>>>>      - immovable;
>>>>      - movable;
>>>>      - newly requested by hot-added devices;
>>>>
>>>> 4) if the previous step fails, disable BARs for one of the hot-added
>>>>      devices and retry from step 3;
>>>>
>>>> 5) notify the drivers, so they remap BARs and resume.
>>>
>>> You don't do the actual recalculation in *this* patch, but since you
>>> mention the procedure here, are we confident that we never make things
>>> worse?
>>>
>>> It's possible that a hot-add will trigger this attempt to move things
>>> around, and it's possible that we won't find space for the new device
>>> even if we move things around.  But are we certain that every device
>>> that worked *before* the hot-add will still work *afterwards*?
>>>
>>> Much of the assignment was probably done by the BIOS using different
>>> algorithms than Linux has, so I think there's some chance that the
>>> BIOS did a better job and if we lose that BIOS assignment, we might
>>> not be able to recreate it.
>>
>> If a hardware has some special constraints on BAR assignment that the
>> kernel is not aware of yet, the movable BARs may break things after a
>> hotplug event. So the feature must be disabled there (manually) until
>> the kernel get support for that special needs.
> 
> I'm not talking about special constraints on BAR assignment.  (I'm not
> sure what those constraints would be -- AFAIK the constraints for a
> spec-compliant device are all discoverable via the BAR size and type
> (or the Enhanced Allocation capability)).
> 
> What I'm concerned about is the case where we boot with a working
> assignment, we hot-add a device, we move things around to try to
> accommodate the new device, and not only do we fail to find resources
> for the new device, we also fail to find a working assignment for the
> devices that were present at boot.  We've moved things around from
> what BIOS did, and since we use a different algorithm than the BIOS,
> there's no guarantee that we'll be able to find the assignment BIOS
> did.
> 

If BAR assignment fails with a hot-added device, these patches will
disable BARs for this device and retry, falling back to the situation
where number of BARs and their size are the same as they were before
the hotplug event.

If all the BARs are immovable - they will just remain on their
positions. Nothing to break here I guess.

If almost all the BARs are immovable and there is one movable BAR,
after releasing the bridge windows there will be a free gap - right
where this movable BAR was. These patches are keeping the size of
released BARs, not requesting the size from the devices again - so the
device can't ask for a larger BAR. The space reserving is disabled by
this patchset, so the kernel will request the same size for the bridge
window containing this movable BAR. So there always will be a gap for
this BAR - in the same location it was before.

Based on these considerations I assume that the kernel is always able
to arrange BARs from scratch if a BIOS was able to make it before.

But! There is an implicit speculation that there will be the same
amount of BARs after the fallback (which is equivalent to a PCI rescan
triggered on unchanged topology). And two week ago I've found that
this is not always true!

I was testing on a "new" x86_64 PC, where BIOS doesn't reserve a space
for SR-IOV BARs (of a network adapter). On the boot, the kernel wasn't
arranging BARs itself - it took values written by the BIOS. And the
bridge window was "jammed" between immovable BARs, so it can't expand.
BARs of this device are also immovable, so the bridge window can't be
moved away. During the PCI rescan, the kernel tried to allocate both
"regular" and SR-IOV BARs - and failed. Even without changes in the
PCI topology.

So in the next version of this series there will be one more patch,
that allows the kernel to ignore BIOS's setting for the "safe" (non-IO
and non-VGA) BARs, so these BARs will be arranged kernel-way - and
also those forgotten by the BIOS.

>>> I'm not sure why the PCI_CLASS_DISPLAY_VGA special case is there; can
>>> you add a comment about why that's needed?  Obviously we can't move
>>> the 0xa0000 legacy frame buffer because I think devices are allowed to
>>> claim that region even if no BAR describes it.  But I would think
>>> *other* BARs of VGA devices could be movable.
>>
>> Sure, I'll add a comment to the code.
>>
>> The issue that we are avoiding by that is the "nomodeset" command line
>> argument, which prevents a video driver from being bound, so the BARs
>> are seems to be used, but can't be moved, otherwise machines just hang
>> after hotplug events. That was the only special ugly case we've
>> spotted during testing. I'll check if it will be enough just to work
>> around the 0xa0000.
> 
> "nomodeset" is not really documented and is a funny way to say "don't
> bind video drivers that know about it", but OK.  Thanks for checking
> on the other BARs.
> 

After modifying the code as you advised, it became possible to mark
only some BARs of the device as immovable. So the code is less ugly
now, and it also works for drivers/video/fbdev/efifb.c , which uses
the BAR in a weird way (dev->driver is NULL, but not the res->child):

   static bool pci_dev_movable(struct pci_dev *dev,
                               bool res_has_children)
   {
     if (!pci_can_move_bars)
       return false;

     if (dev->driver && dev->driver->rescan_prepare)
       return true;

     if (!dev->driver && !res_has_children)
       return true;

     return false;
   }

   bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res)
   {
     if (res->flags & IORESOURCE_PCI_FIXED)
       return false;

     #ifdef CONFIG_X86
     /* Workaround for the legacy VGA memory 0xa0000-0xbffff */
     if (res->start == 0xa0000)
       return false;
     #endif

     return pci_dev_movable(dev, res->child);
   }

>>>> +bool pci_movable_bars_enabled(void);
>>>
>>> I would really like it if this were simply
>>>
>>>     extern bool pci_no_movable_bars;
>>>
>>> in drivers/pci/pci.h.  It would default to false since it's
>>> uninitialized, and "pci=no_movable_bars" would set it to true.
>>
>> I have a premonition of platforms that will not support the feature.
>> Wouldn't be better to put this variable-flag to include/linux/pci.h ,
>> so code in arch/* can set it, so they could work by default, without
>> the command line argument?
> 
> In general I don't see why a platform wouldn't support this since
> there really isn't anything platform-specific here.  But if a platform
> does need to disable it, having arch code set this flag sounds
> reasonable.  We shouldn't make it globally visible until we actually
> need that, though.
> 

On powerpc the Extended Error Handling hardware facility doesn't allow
to shuffle the BARs (without notifying the platform code), otherwise
it reports errors.

I'm working on adding support for powerpc/powernv, but powerpc/pseries
also has EEH, and I don't have a hardware to test there.

So the arch/powerpc/platforms/pseries/setup.c will be modified as
follows in the next version of this patchset:

@@ -920,6 +920,8 @@ static void __init pseries_init(void)
  {
  	pr_debug(" -> pseries_init()\n");

+	pci_can_move_bars = false;
+


Best regards,
Serge
Bjorn Helgaas Oct. 16, 2019, 5:29 p.m. UTC | #7
On Wed, Oct 16, 2019 at 06:50:30PM +0300, Sergey Miroshnichenko wrote:
> On 10/16/19 1:14 AM, Bjorn Helgaas wrote:
> > On Mon, Sep 30, 2019 at 03:59:25PM +0300, Sergey Miroshnichenko wrote:
> > > On 9/28/19 1:02 AM, Bjorn Helgaas wrote:

> > > > It's possible that a hot-add will trigger this attempt to move things
> > > > around, and it's possible that we won't find space for the new device
> > > > even if we move things around.  But are we certain that every device
> > > > that worked *before* the hot-add will still work *afterwards*?
> > > > 
> > > > Much of the assignment was probably done by the BIOS using different
> > > > algorithms than Linux has, so I think there's some chance that the
> > > > BIOS did a better job and if we lose that BIOS assignment, we might
> > > > not be able to recreate it.
> > > 
> > > If a hardware has some special constraints on BAR assignment that the
> > > kernel is not aware of yet, the movable BARs may break things after a
> > > hotplug event. So the feature must be disabled there (manually) until
> > > the kernel get support for that special needs.
> > 
> > I'm not talking about special constraints on BAR assignment.  (I'm not
> > sure what those constraints would be -- AFAIK the constraints for a
> > spec-compliant device are all discoverable via the BAR size and type
> > (or the Enhanced Allocation capability)).
> > 
> > What I'm concerned about is the case where we boot with a working
> > assignment, we hot-add a device, we move things around to try to
> > accommodate the new device, and not only do we fail to find resources
> > for the new device, we also fail to find a working assignment for the
> > devices that were present at boot.  We've moved things around from
> > what BIOS did, and since we use a different algorithm than the BIOS,
> > there's no guarantee that we'll be able to find the assignment BIOS
> > did.
> 
> If BAR assignment fails with a hot-added device, these patches will
> disable BARs for this device and retry, falling back to the situation
> where number of BARs and their size are the same as they were before
> the hotplug event.
> 
> If all the BARs are immovable - they will just remain on their
> positions. Nothing to break here I guess.
> 
> If almost all the BARs are immovable and there is one movable BAR,
> after releasing the bridge windows there will be a free gap - right
> where this movable BAR was. These patches are keeping the size of
> released BARs, not requesting the size from the devices again - so the
> device can't ask for a larger BAR. The space reserving is disabled by
> this patchset, so the kernel will request the same size for the bridge
> window containing this movable BAR. So there always will be a gap for
> this BAR - in the same location it was before.
> 
> Based on these considerations I assume that the kernel is always able
> to arrange BARs from scratch if a BIOS was able to make it before.
> 
> But! There is an implicit speculation that there will be the same
> amount of BARs after the fallback (which is equivalent to a PCI rescan
> triggered on unchanged topology). And two week ago I've found that
> this is not always true!
> 
> I was testing on a "new" x86_64 PC, where BIOS doesn't reserve a space
> for SR-IOV BARs (of a network adapter). On the boot, the kernel wasn't
> arranging BARs itself - it took values written by the BIOS. And the
> bridge window was "jammed" between immovable BARs, so it can't expand.
> BARs of this device are also immovable, so the bridge window can't be
> moved away. During the PCI rescan, the kernel tried to allocate both
> "regular" and SR-IOV BARs - and failed. Even without changes in the
> PCI topology.
> 
> So in the next version of this series there will be one more patch,
> that allows the kernel to ignore BIOS's setting for the "safe" (non-IO
> and non-VGA) BARs, so these BARs will be arranged kernel-way - and
> also those forgotten by the BIOS.

This still seems a little scary, so I'll probably ask about it again :)

> After modifying the code as you advised, it became possible to mark
> only some BARs of the device as immovable. So the code is less ugly
> now, and it also works for drivers/video/fbdev/efifb.c , which uses
> the BAR in a weird way (dev->driver is NULL, but not the res->child):
> 
>   static bool pci_dev_movable(struct pci_dev *dev,
>                               bool res_has_children)
>   {
>     if (!pci_can_move_bars)
>       return false;
> 
>     if (dev->driver && dev->driver->rescan_prepare)
>       return true;
> 
>     if (!dev->driver && !res_has_children)
>       return true;
> 
>     return false;
>   }
> 
>   bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res)
>   {
>     if (res->flags & IORESOURCE_PCI_FIXED)
>       return false;
> 
>     #ifdef CONFIG_X86
>     /* Workaround for the legacy VGA memory 0xa0000-0xbffff */
>     if (res->start == 0xa0000)

Nit here; "res->start" is the CPU address, but what you need to check
is the PCI bus address, e.g., something from pcibios_resource_to_bus().
And this is not x86-specific.  0xa0000 is magic on PCI no matter what
the processor architecture.

>       return false;
>     #endif
> 
>     return pci_dev_movable(dev, res->child);
>   }
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 47d981a86e2f..e2274ee87a35 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3526,6 +3526,13 @@ 
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcie_movable_bars=[PCIE]
+			Override the movable BARs support detection:
+		off
+			Disable even if supported by the platform
+		force
+			Enable even if not explicitly declared as supported
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd_ignore_unused
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index a8124e47bf6e..d11909e79263 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1688,6 +1688,8 @@  static int __init pci_driver_init(void)
 {
 	int ret;
 
+	pci_add_flags(PCI_IMMOVABLE_BARS);
+
 	ret = bus_register(&pci_bus_type);
 	if (ret)
 		return ret;
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 61d951766087..3a504f58ac60 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -139,6 +139,30 @@  static int __init pcie_port_pm_setup(char *str)
 }
 __setup("pcie_port_pm=", pcie_port_pm_setup);
 
+static bool pcie_movable_bars_off;
+static bool pcie_movable_bars_force;
+static int __init pcie_movable_bars_setup(char *str)
+{
+	if (!strcmp(str, "off"))
+		pcie_movable_bars_off = true;
+	else if (!strcmp(str, "force"))
+		pcie_movable_bars_force = true;
+	return 1;
+}
+__setup("pcie_movable_bars=", pcie_movable_bars_setup);
+
+bool pci_movable_bars_enabled(void)
+{
+	if (pcie_movable_bars_off)
+		return false;
+
+	if (pcie_movable_bars_force)
+		return true;
+
+	return !pci_has_flag(PCI_IMMOVABLE_BARS);
+}
+EXPORT_SYMBOL(pci_movable_bars_enabled);
+
 /* Time to wait after a reset for device to become responsive */
 #define PCIE_RESET_READY_POLL_MS 60000
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d22d1b807701..be7acc477c64 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -257,6 +257,8 @@  bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
 void pci_reassigndev_resource_alignment(struct pci_dev *dev);
 void pci_disable_bridge_window(struct pci_dev *dev);
 
+bool pci_dev_movable_bars_supported(struct pci_dev *dev);
+
 /* PCIe link information */
 #define PCIE_SPEED2STR(speed) \
 	((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 2e58ece820e8..60e3b48d2251 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3406,6 +3406,74 @@  unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
 	return max;
 }
 
+bool pci_dev_movable_bars_supported(struct pci_dev *dev)
+{
+	if (!dev)
+		return false;
+
+	if (dev->driver && dev->driver->rescan_prepare)
+		return true;
+
+	if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
+		return false;
+
+	return !dev->driver;
+}
+
+void __weak pcibios_rescan_prepare(struct pci_dev *dev)
+{
+}
+
+void __weak pcibios_rescan_done(struct pci_dev *dev)
+{
+}
+
+static void pci_bus_rescan_prepare(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	if (bus->self)
+		pci_config_pm_runtime_get(bus->self);
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		struct pci_bus *child = dev->subordinate;
+
+		if (child)
+			pci_bus_rescan_prepare(child);
+
+		if (dev->driver &&
+		    dev->driver->rescan_prepare) {
+			dev->driver->rescan_prepare(dev);
+			pcibios_rescan_prepare(dev);
+		} else if (pci_dev_movable_bars_supported(dev)) {
+			pcibios_rescan_prepare(dev);
+		}
+	}
+}
+
+static void pci_bus_rescan_done(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		struct pci_bus *child = dev->subordinate;
+
+		if (dev->driver &&
+		    dev->driver->rescan_done) {
+			pcibios_rescan_done(dev);
+			dev->driver->rescan_done(dev);
+		} else if (pci_dev_movable_bars_supported(dev)) {
+			pcibios_rescan_done(dev);
+		}
+
+		if (child)
+			pci_bus_rescan_done(child);
+	}
+
+	if (bus->self)
+		pci_config_pm_runtime_put(bus->self);
+}
+
 /**
  * pci_rescan_bus - Scan a PCI bus for devices
  * @bus: PCI bus to scan
@@ -3418,9 +3486,23 @@  unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge)
 unsigned int pci_rescan_bus(struct pci_bus *bus)
 {
 	unsigned int max;
+	struct pci_bus *root = bus;
+
+	while (!pci_is_root_bus(root))
+		root = root->parent;
+
+	if (pci_movable_bars_enabled()) {
+		pci_bus_rescan_prepare(root);
+
+		max = pci_scan_child_bus(root);
+		pci_assign_unassigned_root_bus_resources(root);
+
+		pci_bus_rescan_done(root);
+	} else {
+		max = pci_scan_child_bus(bus);
+		pci_assign_unassigned_bus_resources(bus);
+	}
 
-	max = pci_scan_child_bus(bus);
-	pci_assign_unassigned_bus_resources(bus);
 	pci_bus_add_devices(bus);
 
 	return max;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d3a72159722d..e5b5eff05744 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -838,6 +838,8 @@  struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	void (*rescan_prepare)(struct pci_dev *dev);
+	void (*rescan_done)(struct pci_dev *dev);
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
@@ -924,6 +926,7 @@  enum {
 	PCI_ENABLE_PROC_DOMAINS	= 0x00000010,	/* Enable domains in /proc */
 	PCI_COMPAT_DOMAIN_0	= 0x00000020,	/* ... except domain 0 */
 	PCI_SCAN_ALL_PCIE_DEVS	= 0x00000040,	/* Scan all, not just dev 0 */
+	PCI_IMMOVABLE_BARS	= 0x00000080,	/* Disable runtime BAR reassign */
 };
 
 /* These external functions are only available when PCI support is enabled */
@@ -1266,6 +1269,9 @@  unsigned int pci_rescan_bus(struct pci_bus *bus);
 void pci_lock_rescan_remove(void);
 void pci_unlock_rescan_remove(void);
 
+void pcibios_rescan_prepare(struct pci_dev *dev);
+void pcibios_rescan_done(struct pci_dev *dev);
+
 /* Vital Product Data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
 ssize_t pci_write_vpd(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
@@ -1402,6 +1408,7 @@  unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+bool pci_movable_bars_enabled(void);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)