diff mbox series

[PATCH/RESEND] arm64: acpi/pci: invoke _DSM whether to preserve firmware PCI setup

Message ID d53fc77e1e754ddbd9af555ed5b344c5fa523154.camel@kernel.crashing.org
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series [PATCH/RESEND] arm64: acpi/pci: invoke _DSM whether to preserve firmware PCI setup | expand

Commit Message

Benjamin Herrenschmidt June 6, 2019, 9 a.m. UTC
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>

On arm64 ACPI systems, we unconditionally reconfigure the entire PCI
hierarchy at boot. This is a departure from what is customary on ACPI
systems, and may break assumptions in some places (e.g., EFIFB), that
the kernel will leave BARs of enabled PCI devices where they are.

Given that PCI already specifies a device specific ACPI method (_DSM)
for PCI root bridge nodes that tells us whether the firmware thinks
the configuration should be left alone, let's sidestep the entire
policy debate about whether the PCI configuration should be preserved
or not, and put it under the control of the firmware instead.

[BenH: Added pci_assign_unassigned_root_bus_resources()]

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---

So I would like this variant rather than mucking around with
IORESOURCE_PCI_FIXED at this stage to fix the problem with our platforms.

See my other email, IORESOURCE_PCI_FIXED doesn't really work terribly well
when using pci_bus_size_bridges and pci_bus_assign_resources, and the
resulting patches are ugly and add more mess.

Long run, I propose to start working on consolidating all those various
resource survey mechanisms around what x86 does, unless people strongly
object... (with the addition of the probe only and force reassign quirks
so platforms can still chose that).

Note: I haven't tested the effect of pci_assign_unassigned_root_bus_resources
as our platforms don't leave anything unassigned. I'm not entirely sure how
well pci_bus_claim_resources() will deal with a partially assigned setup...

We do want to support partial assignment by BIOS though, it's a trend to
reduce boot time, people seem to want BIOSes to only assign what's critical
for booting.

Bjorn: I haven't made the claim path the default in absence of _DSM #5 yet.
I suggest we do that as a separate patch in case it breaks somebody, thus
making bisection more meaningful. It will also make this one more palatable
to distros since it won't change the behaviour on systems without _DSM #5,
and we verified nobody has it except Seattle which returns 1. 

 arch/arm64/kernel/pci.c  | 23 +++++++++++++++++++++--
 include/linux/pci-acpi.h |  7 ++++---
 2 files changed, 25 insertions(+), 5 deletions(-)

Comments

Ard Biesheuvel June 6, 2019, 9:13 a.m. UTC | #1
On Thu, 6 Jun 2019 at 11:00, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>
> On arm64 ACPI systems, we unconditionally reconfigure the entire PCI
> hierarchy at boot. This is a departure from what is customary on ACPI
> systems, and may break assumptions in some places (e.g., EFIFB), that
> the kernel will leave BARs of enabled PCI devices where they are.
>
> Given that PCI already specifies a device specific ACPI method (_DSM)
> for PCI root bridge nodes that tells us whether the firmware thinks
> the configuration should be left alone, let's sidestep the entire
> policy debate about whether the PCI configuration should be preserved
> or not, and put it under the control of the firmware instead.
>
> [BenH: Added pci_assign_unassigned_root_bus_resources()]
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>
> So I would like this variant rather than mucking around with
> IORESOURCE_PCI_FIXED at this stage to fix the problem with our platforms.
>
> See my other email, IORESOURCE_PCI_FIXED doesn't really work terribly well
> when using pci_bus_size_bridges and pci_bus_assign_resources, and the
> resulting patches are ugly and add more mess.
>
> Long run, I propose to start working on consolidating all those various
> resource survey mechanisms around what x86 does, unless people strongly
> object... (with the addition of the probe only and force reassign quirks
> so platforms can still chose that).
>
> Note: I haven't tested the effect of pci_assign_unassigned_root_bus_resources
> as our platforms don't leave anything unassigned. I'm not entirely sure how
> well pci_bus_claim_resources() will deal with a partially assigned setup...
>
> We do want to support partial assignment by BIOS though, it's a trend to
> reduce boot time, people seem to want BIOSes to only assign what's critical
> for booting.
>
> Bjorn: I haven't made the claim path the default in absence of _DSM #5 yet.
> I suggest we do that as a separate patch in case it breaks somebody, thus
> making bisection more meaningful. It will also make this one more palatable
> to distros since it won't change the behaviour on systems without _DSM #5,
> and we verified nobody has it except Seattle which returns 1.
>

FYI Seattle is broken in any case since it returns Package(1) rather
than just an int.

The problem with this patch is that currently, the PCI fw spec permits
_DSM #5 everywhere *except* on the host bridge device object itself,
and this is in the process of being changed.

I will leave it up to the maintainers to decide whether we can take
this patch in anticipation of that, even though it doesn't deal with
_DSM #5 on nodes anywhere else in the PCIe tree.

>  arch/arm64/kernel/pci.c  | 23 +++++++++++++++++++++--
>  include/linux/pci-acpi.h |  7 ++++---
>  2 files changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index bb85e2f4603f..6358e1cb4f9f 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -168,6 +168,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>         struct acpi_pci_generic_root_info *ri;
>         struct pci_bus *bus, *child;
>         struct acpi_pci_root_ops *root_ops;
> +       union acpi_object *obj;
>
>         ri = kzalloc(sizeof(*ri), GFP_KERNEL);
>         if (!ri)
> @@ -193,8 +194,26 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>         if (!bus)
>                 return NULL;
>
> -       pci_bus_size_bridges(bus);
> -       pci_bus_assign_resources(bus);
> +       /*
> +        * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot
> +        * Configuration', which tells us whether the firmware wants us to
> +        * preserve the configuration of the PCI resource tree for this root
> +        * bridge.
> +        */
> +       obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1,
> +                               IGNORE_PCI_BOOT_CONFIG_DSM, NULL);
> +       if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
> +               /* preserve existing resource assignment */
> +               pci_bus_claim_resources(bus);
> +
> +               /* Assign anything that might have been left out */
> +               pci_assign_unassigned_root_bus_resources(bus);
> +       } else {
> +               /* reconfigure the resource tree from scratch */
> +               pci_bus_size_bridges(bus);
> +               pci_bus_assign_resources(bus);
> +       }
> +       ACPI_FREE(obj);
>
>         list_for_each_entry(child, &bus->children, node)
>                 pcie_bus_configure_settings(child);
> diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
> index 8082b612f561..62b7fdcc661c 100644
> --- a/include/linux/pci-acpi.h
> +++ b/include/linux/pci-acpi.h
> @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
>  #endif
>
>  extern const guid_t pci_acpi_dsm_guid;
> -#define DEVICE_LABEL_DSM       0x07
> -#define RESET_DELAY_DSM                0x08
> -#define FUNCTION_DELAY_DSM     0x09
> +#define IGNORE_PCI_BOOT_CONFIG_DSM     0x05
> +#define DEVICE_LABEL_DSM               0x07
> +#define RESET_DELAY_DSM                        0x08
> +#define FUNCTION_DELAY_DSM             0x09
>
>  #else  /* CONFIG_ACPI */
>  static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
>
>
Benjamin Herrenschmidt June 6, 2019, 10:55 a.m. UTC | #2
On Thu, 2019-06-06 at 11:13 +0200, Ard Biesheuvel wrote:
> > Bjorn: I haven't made the claim path the default in absence of _DSM #5 yet.
> > I suggest we do that as a separate patch in case it breaks somebody, thus
> > making bisection more meaningful. It will also make this one more palatable
> > to distros since it won't change the behaviour on systems without _DSM #5,
> > and we verified nobody has it except Seattle which returns 1.
> > 
> 
> FYI Seattle is broken in any case since it returns Package(1) rather
> than just an int.

Great .... not. Do we care ?

> The problem with this patch is that currently, the PCI fw spec permits
> _DSM #5 everywhere *except* on the host bridge device object itself,
> and this is in the process of being changed.

Yes, I'm indirectly aware of that :)

> I will leave it up to the maintainers to decide whether we can take
> this patch in anticipation of that, even though it doesn't deal with
> _DSM #5 on nodes anywhere else in the PCIe tree.

Right, so the problem at this point is that dealing with it elsewhere
in the tree is very fragile and problematic (see my other messages).
Doing it at the host bridge level fixes the immediate problem for us
(provided we are ok anticipating the spec update), and doesn't preclude
also honoring it for individual devices later on.

My thinking is if we converge everybody toward the x86 method of doing
a 2 pass survey of existing resources followed by assign_unassigned,
and have that the main generic code path (with added quirks to force a
full assignment and keeping probe_only around but that's easy, we have
that on powerpc and our code is originally based on the x86 one), then
we'll have a much easier time supporting IORESOURCE_PCI_FIXED on
portions of the tree as well (though it also becomes less critical to
do so since we will no longer reallocate unless we have to).

That said we need to understand what "fixed" means and why we do it.

IE, If an endpoint somehere has "fixed" BARs for example, that means
all parent bridge must be setup to enclose that range.

Now our allocator for bridge windows cannot handle that and probably
never will, so we have to rely on the existing window established by
the FW being reasonable and use it. We can still *extend" bridge
windows (and we have code to do that) if necessary but we cannot move
them if they contain a fixed BAR device.

There is a much bigger discussion to be had around that concept of
fixed device anyway, maybe at Plumbers ? Why is the BAR fixed ? Because
the EFI FB is on it ? Because HW bugs ? Because FW might access it from
SMM or ARM equivalent ? Because ACPI will poke at it based on its
initial address ? etc...

Some of the answers to the above questions imply more than the need to
fix the BAR: Does it also mean that disabling access to that BAR, even
temporarily, isn't safe ? However that's what we do today when we
probe, if anything, to do the BAR sizing...

This isn't a new problem. We had issues like that dating back 15 years
on powerpc for example, where a big ASIC hanging off PCI had all the
Apple gunk including the interrupt controller, which was initialized
from the DT way before PCI probing. If you took an interrupt at the
"wrong" time during BAR sizing, kaboom ! If you had debug printk's in
the wrong place in the PCI probing code, kaboom ! etc....

If we want to solve that properly in the long run, we'll probably want
ACPI to tell us the BAR sizes and use that instead of doing manual
sizing on such "system" devices. We similarily have ways to "construct"
pci_dev's from the OF tree on sparc64 and powerpc, limiting direct
config access to populate stuff we can't get from FW.

Cheers,
Ben.
Lorenzo Pieralisi June 11, 2019, 2:31 p.m. UTC | #3
On Thu, Jun 06, 2019 at 08:55:07PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-06 at 11:13 +0200, Ard Biesheuvel wrote:
> > > Bjorn: I haven't made the claim path the default in absence of _DSM #5 yet.
> > > I suggest we do that as a separate patch in case it breaks somebody, thus
> > > making bisection more meaningful. It will also make this one more palatable
> > > to distros since it won't change the behaviour on systems without _DSM #5,
> > > and we verified nobody has it except Seattle which returns 1.
> > > 
> > 
> > FYI Seattle is broken in any case since it returns Package(1) rather
> > than just an int.
> 
> Great .... not. Do we care ?
> 
> > The problem with this patch is that currently, the PCI fw spec permits
> > _DSM #5 everywhere *except* on the host bridge device object itself,
> > and this is in the process of being changed.
> 
> Yes, I'm indirectly aware of that :)
> 
> > I will leave it up to the maintainers to decide whether we can take
> > this patch in anticipation of that, even though it doesn't deal with
> > _DSM #5 on nodes anywhere else in the PCIe tree.
> 
> Right, so the problem at this point is that dealing with it elsewhere
> in the tree is very fragile and problematic (see my other messages).
> Doing it at the host bridge level fixes the immediate problem for us
> (provided we are ok anticipating the spec update), and doesn't preclude
> also honoring it for individual devices later on.

True, minus specs update schedule, I can't change that and merging
this patch (and firmware thereof) relies on specifications that
are intent changes till they become an ECN (~another merge window,
so this patch could land at v5.4).

The other option is doing what this patch does *without* relying
on _DSM #5, we may have regressions unfortunately though.

It is kind of orthogonal (but not really), bus numbers assignment
is _not_ in line with resource assignment at the moment and I want
to change it.

Since ACPI on ARM64 is still at its inception maybe we should have
a stab at patching the kernel so that it reassigns bus numbers by
default and toggle that behaviour on _DSM #5 == 0 detection.

I doubt that reassigning bus numbers by default can trigger
regressions on existing platforms but the only way to figure
it out is by testing it.

> My thinking is if we converge everybody toward the x86 method of doing
> a 2 pass survey of existing resources followed by assign_unassigned,

I am not entirely sure we need a 2-pass survey,

pci_bus_claim_resources()

should be enough; if it is not we update it.

> and have that the main generic code path (with added quirks to force a
> full assignment and keeping probe_only around but that's easy, we have
> that on powerpc and our code is originally based on the x86 one), then
> we'll have a much easier time supporting IORESOURCE_PCI_FIXED on
> portions of the tree as well (though it also becomes less critical to
> do so since we will no longer reallocate unless we have to).
> 
> That said we need to understand what "fixed" means and why we do it.

Agree, totally and I want to make it clear how a BAR is fixed in
the kernel, there are too many discrepancies in the resource management
code already.

> IE, If an endpoint somehere has "fixed" BARs for example, that means
> all parent bridge must be setup to enclose that range.
> 
> Now our allocator for bridge windows cannot handle that and probably
> never will, so we have to rely on the existing window established by
> the FW being reasonable and use it. We can still *extend" bridge
> windows (and we have code to do that) if necessary but we cannot move
> them if they contain a fixed BAR device.
> 
> There is a much bigger discussion to be had around that concept of
> fixed device anyway, maybe at Plumbers ? Why is the BAR fixed ? Because
> the EFI FB is on it ? Because HW bugs ? Because FW might access it from
> SMM or ARM equivalent ? Because ACPI will poke at it based on its
> initial address ? etc...

Consider a slot booked at LPC PCI uconf for this discussion.

> Some of the answers to the above questions imply more than the need to
> fix the BAR: Does it also mean that disabling access to that BAR, even
> temporarily, isn't safe ? However that's what we do today when we
> probe, if anything, to do the BAR sizing...

Eh, another question that came up already should be debated.

> This isn't a new problem. We had issues like that dating back 15 years
> on powerpc for example, where a big ASIC hanging off PCI had all the
> Apple gunk including the interrupt controller, which was initialized
> from the DT way before PCI probing. If you took an interrupt at the
> "wrong" time during BAR sizing, kaboom ! If you had debug printk's in
> the wrong place in the PCI probing code, kaboom ! etc....
> 
> If we want to solve that properly in the long run, we'll probably want
> ACPI to tell us the BAR sizes and use that instead of doing manual
> sizing on such "system" devices. We similarily have ways to "construct"
> pci_dev's from the OF tree on sparc64 and powerpc, limiting direct
> config access to populate stuff we can't get from FW.

https://lore.kernel.org/linux-pci/20190121174225.15835-1-mr.nuke.me@gmail.com/

?
Lorenzo Pieralisi June 11, 2019, 2:58 p.m. UTC | #4
On Thu, Jun 06, 2019 at 07:00:12PM +1000, Benjamin Herrenschmidt wrote:
> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> 
> On arm64 ACPI systems, we unconditionally reconfigure the entire PCI
> hierarchy at boot. This is a departure from what is customary on ACPI
> systems, and may break assumptions in some places (e.g., EFIFB), that
> the kernel will leave BARs of enabled PCI devices where they are.
> 
> Given that PCI already specifies a device specific ACPI method (_DSM)
> for PCI root bridge nodes that tells us whether the firmware thinks
> the configuration should be left alone, let's sidestep the entire
> policy debate about whether the PCI configuration should be preserved
> or not, and put it under the control of the firmware instead.
> 
> [BenH: Added pci_assign_unassigned_root_bus_resources()]
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> 
> So I would like this variant rather than mucking around with
> IORESOURCE_PCI_FIXED at this stage to fix the problem with our platforms.
> 
> See my other email, IORESOURCE_PCI_FIXED doesn't really work terribly well
> when using pci_bus_size_bridges and pci_bus_assign_resources, and the
> resulting patches are ugly and add more mess.
> 
> Long run, I propose to start working on consolidating all those various
> resource survey mechanisms around what x86 does, unless people strongly
> object... (with the addition of the probe only and force reassign quirks
> so platforms can still chose that).
> 
> Note: I haven't tested the effect of pci_assign_unassigned_root_bus_resources
> as our platforms don't leave anything unassigned. I'm not entirely sure how
> well pci_bus_claim_resources() will deal with a partially assigned setup...
> 
> We do want to support partial assignment by BIOS though, it's a trend to
> reduce boot time, people seem to want BIOSes to only assign what's critical
> for booting.
> 
> Bjorn: I haven't made the claim path the default in absence of _DSM #5 yet.
> I suggest we do that as a separate patch in case it breaks somebody, thus
> making bisection more meaningful. It will also make this one more palatable
> to distros since it won't change the behaviour on systems without _DSM #5,
> and we verified nobody has it except Seattle which returns 1. 
> 
>  arch/arm64/kernel/pci.c  | 23 +++++++++++++++++++++--
>  include/linux/pci-acpi.h |  7 ++++---
>  2 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index bb85e2f4603f..6358e1cb4f9f 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -168,6 +168,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>  	struct acpi_pci_generic_root_info *ri;
>  	struct pci_bus *bus, *child;
>  	struct acpi_pci_root_ops *root_ops;
> +	union acpi_object *obj;
>  
>  	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
>  	if (!ri)
> @@ -193,8 +194,26 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>  	if (!bus)
>  		return NULL;
>  
> -	pci_bus_size_bridges(bus);
> -	pci_bus_assign_resources(bus);
> +	/*
> +	 * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot
> +	 * Configuration', which tells us whether the firmware wants us to
> +	 * preserve the configuration of the PCI resource tree for this root
> +	 * bridge.
> +	 */
> +	obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1,
> +	                        IGNORE_PCI_BOOT_CONFIG_DSM, NULL);
> +	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
> +		/* preserve existing resource assignment */
> +		pci_bus_claim_resources(bus);
> +
> +		/* Assign anything that might have been left out */
> +		pci_assign_unassigned_root_bus_resources(bus);
> +	} else {
> +		/* reconfigure the resource tree from scratch */
> +		pci_bus_size_bridges(bus);
> +		pci_bus_assign_resources(bus);
> +	}

	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
		/* preserve existing resource assignment */
		pci_bus_claim_resources(bus);
	}

	pci_bus_size_bridges(bus);
	pci_bus_assign_resources(bus);

That's how it should be I think:

1) we do not want pci_assign_unassigned_root_bus_resources(bus) to
   reallocate resources already claimed (see realloc parameter), do we ?
2) pci_bus_size_bridges(bus) and pci_bus_assign_resources(bus) should
   not interfere with resources already claimed so it *should* be safe
   to call them anyway

Most importantly: I want everyone to agree that claiming is equivalent
to making a resource immutable (except for realloc, see (1) above)
because that's what we are doing by claiming on _DSM #5 == 0.

There are too many ways to make a resource immutable in the kernel
and this is confusing and prone to bugs.

Thanks,
Lorenzo

> +	ACPI_FREE(obj);
>  
>  	list_for_each_entry(child, &bus->children, node)
>  		pcie_bus_configure_settings(child);
> diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
> index 8082b612f561..62b7fdcc661c 100644
> --- a/include/linux/pci-acpi.h
> +++ b/include/linux/pci-acpi.h
> @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
>  #endif
>  
>  extern const guid_t pci_acpi_dsm_guid;
> -#define DEVICE_LABEL_DSM	0x07
> -#define RESET_DELAY_DSM		0x08
> -#define FUNCTION_DELAY_DSM	0x09
> +#define IGNORE_PCI_BOOT_CONFIG_DSM	0x05
> +#define DEVICE_LABEL_DSM		0x07
> +#define RESET_DELAY_DSM			0x08
> +#define FUNCTION_DELAY_DSM		0x09
>  
>  #else	/* CONFIG_ACPI */
>  static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
> 
>
Benjamin Herrenschmidt June 11, 2019, 10:09 p.m. UTC | #5
On Tue, 2019-06-11 at 15:31 +0100, Lorenzo Pieralisi wrote:
> 
> True, minus specs update schedule, I can't change that and merging
> this patch (and firmware thereof) relies on specifications that
> are intent changes till they become an ECN (~another merge window,
> so this patch could land at v5.4).

Hrm... annoying for us but I understand your reasoning.

> The other option is doing what this patch does *without* relying
> on _DSM #5, we may have regressions unfortunately though.

We could work around regressions with quirks I suppose. It does make
sense to assume that if you have ACPI and UEFI, you have a decent PCI
BAR assignment at boot in the "general case". That said, we need to
double check first that pci_bus_claim_resources() will not do horrible
things on partially assigned setups, since there's a real interest in
doing that in the field.

> It is kind of orthogonal (but not really), bus numbers assignment
> is _not_ in line with resource assignment at the moment and I want
> to change it.

Hrm. We should probably reassign bus numbers if we reassign resources
yes, but then I'd like us to not reassign resources unless we have to
:-)

> Since ACPI on ARM64 is still at its inception maybe we should have
> a stab at patching the kernel so that it reassigns bus numbers by
> default and toggle that behaviour on _DSM #5 == 0 detection.
> 
> I doubt that reassigning bus numbers by default can trigger
> regressions on existing platforms but the only way to figure
> it out is by testing it.
>
> > My thinking is if we converge everybody toward the x86 method of
> > doing
> > a 2 pass survey of existing resources followed by
> > assign_unassigned,
> 
> I am not entirely sure we need a 2-pass survey,
>
> pci_bus_claim_resources()
> 
> should be enough; if it is not we update it.

So it's not so much about the 2 passes per-se, though they have value,
it's more about consolidating archs to do the same thing. Chances that
we change x86 are nil. But we can change powerpc and arm64 to do like
x86 and move that code to generic.

pci_bus_claim_resources() seems to be a "lightweight" variant of the
survey done by x86. The main differences I can see are:

 - The 2 passes thing which we may or may not care about, its main
purpose is to favor resources that are already enabled by the BIOS in
case of conflicts as far as I understand.

 - pci_read_bridge_bases() is done by pci_bus_claim_resources(), while
x86 (and powerpc and others) do it in their pcibios_fixup_bus. That one
is interesting... Any reason why we shouldn't unconditionally read the
bridges while probing ? Bjorn ?

 - When allocating bridge resources, there are interesting differences:

  * x86 (and powerpc to some extent): If one has a 0 start or we fail
to claim it, x86 will wipe out the resource struct (including flags). I
assume that pci_assign_unassign_* will restore bridges when needed but
I haven't verified. 

  * pci_bus_claim_resources() is dumber in that regard. It will call
pci_claim_bridge_resources() blindly try to claim whatever is there
even if res->start is 0. This could be a problem with partially
assigned trees. It also doesn't wipe the resource in case of failure to
claim which could be a problem going down the tree and letting children
attach to the non-claimed resource, thus potentially causing the
reassign pass to fail.

The r->start == 0 test is interesting ... the generic claim code will
honor IORESOURCE_UNSET but we don't seem to set that generically unless
we hit some of the specific pass for explicit resource alignment, or
during the reassignment phases.

 - When allocating device resources, the main difference other than the
2 passes is that x86 will "0 base" the resource (r->end -= r->start; r-
>start = 0) for later reassignment. The claim path we use won't do
that. Note: none sets IORESOURCE_UNSET... Additionally x86 has some
oddball code to save the original FW values and restore them if
assignment later fails, which is somewhat odd since there's a conflict
but probably helps really broken setups.

 - x86 will not claim ROMs in that pass, it does a 3rd pass just for
them (it's common I think to not have room for all the ROMs). It also
disables them in config space during the survey.
pci_bus_claim_resources() will claim everything and leave ROMs enabled.

So as a somewhat temprary conclusion, I think the main difference here
is what happens when claim fails (also the res->start = 0 case which we
need to look at more closely) and whether we should make the generic
code also "0-base" the resource.

The question for me really is, do we want to just "upgrade" (if
necessary) pci_bus_claim_resources() and continue having x86 do its own
thing for ever, or do we want to consolidate around what is probably
the most tested platform when it comes to PCI :-)

And if we consolidate, I think that won't be by changing what x86 does,
that code is the result of decades of fiddling to get things right with
all sorts of broken BIOSes...

> > and have that the main generic code path (with added quirks to force a
> > full assignment and keeping probe_only around but that's easy, we have
> > that on powerpc and our code is originally based on the x86 one), then
> > we'll have a much easier time supporting IORESOURCE_PCI_FIXED on
> > portions of the tree as well (though it also becomes less critical to
> > do so since we will no longer reallocate unless we have to).
> > 
> > That said we need to understand what "fixed" means and why we do it.
> 
> Agree, totally and I want to make it clear how a BAR is fixed in
> the kernel, there are too many discrepancies in the resource
> management code already.
> 
> > IE, If an endpoint somehere has "fixed" BARs for example, that means
> > all parent bridge must be setup to enclose that range.
> > 
> > Now our allocator for bridge windows cannot handle that and probably
> > never will, so we have to rely on the existing window established by
> > the FW being reasonable and use it. We can still *extend" bridge
> > windows (and we have code to do that) if necessary but we cannot move
> > them if they contain a fixed BAR device.
> > 
> > There is a much bigger discussion to be had around that concept of
> > fixed device anyway, maybe at Plumbers ? Why is the BAR fixed ? Because
> > the EFI FB is on it ? Because HW bugs ? Because FW might access it from
> > SMM or ARM equivalent ? Because ACPI will poke at it based on its
> > initial address ? etc...
> 
> Consider a slot booked at LPC PCI uconf for this discussion.

Excellent.

> > Some of the answers to the above questions imply more than the need to
> > fix the BAR: Does it also mean that disabling access to that BAR, even
> > temporarily, isn't safe ? However that's what we do today when we
> > probe, if anything, to do the BAR sizing...
> 
> Eh, another question that came up already should be debated.

Yup.

> > This isn't a new problem. We had issues like that dating back 15 years
> > on powerpc for example, where a big ASIC hanging off PCI had all the
> > Apple gunk including the interrupt controller, which was initialized
> > from the DT way before PCI probing. If you took an interrupt at the
> > "wrong" time during BAR sizing, kaboom ! If you had debug printk's in
> > the wrong place in the PCI probing code, kaboom ! etc....
> > 
> > If we want to solve that properly in the long run, we'll probably want
> > ACPI to tell us the BAR sizes and use that instead of doing manual
> > sizing on such "system" devices. We similarily have ways to "construct"
> > pci_dev's from the OF tree on sparc64 and powerpc, limiting direct
> > config access to populate stuff we can't get from FW.
> 
> https://lore.kernel.org/linux-pci/20190121174225.15835-1-mr.nuke.me@gmail.com/
> 
> ?

Ah I don't know enough about ACPI yet, on my reading list :-)

Cheers,
Ben.
Benjamin Herrenschmidt June 11, 2019, 10:19 p.m. UTC | #6
On Tue, 2019-06-11 at 15:58 +0100, Lorenzo Pieralisi wrote:
> 
> 
> 	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
> 		/* preserve existing resource assignment */
> 		pci_bus_claim_resources(bus);
> 	}
> 
> 	pci_bus_size_bridges(bus);
> 	pci_bus_assign_resources(bus);

So that makes me nervous... my understanding is that the pair

	pci_bus_size_bridges(bus);
 	pci_bus_assign_resources(bus);

Is intended for full reassignment. Now they will try to skip resources
that already have a parent, but that's yet another code path. What's
wrong with pci_unassigned_* ? That's what it's meant for...

> That's how it should be I think:
> 
> 1) we do not want pci_assign_unassigned_root_bus_resources(bus) to
>    reallocate resources already claimed (see realloc parameter), do we ?

Well, realloc is useful to handle SR_IOV when the BIOS doesn't do it
right (common case). That said, at this point, we should be able to
honor IORESOURCE_PCI_FIXED for things that have _DSM #5 since they have
been claimed. I don't see that realloc logic being a problem for us,
and I want to avoid gratuitous differences with x86, but maybe I'm
missing something here...

> 2) pci_bus_size_bridges(bus) and pci_bus_assign_resources(bus) should
>    not interfere with resources already claimed so it *should* be safe
>    to call them anyway

Sure, *should* and here we introduce yet another way of doing things
though... Any reason we don't want to do what x86 does here ?

> Most importantly: I want everyone to agree that claiming is equivalent
> to making a resource immutable (except for realloc, see (1) above)
> because that's what we are doing by claiming on _DSM #5 == 0.

I think the combination of claiming *and* IORESOURCE_PCI_FIXED is what
makes it *really* immutable. I'm a bit confused by the realloc logic
right now, I'll need more quality time looking at it after ingesting
more caffeing but I'm under the impression that it will honor the flag.

> There are too many ways to make a resource immutable in the kernel
> and this is confusing and prone to bugs.

It is, but I don't want to create new ways of doing things, and what
you seem to propose is a new way imho :-)

Cheers,
Ben.

> Thanks,
> Lorenzo
> 
> > +	ACPI_FREE(obj);
> >  
> >  	list_for_each_entry(child, &bus->children, node)
> >  		pcie_bus_configure_settings(child);
> > diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
> > index 8082b612f561..62b7fdcc661c 100644
> > --- a/include/linux/pci-acpi.h
> > +++ b/include/linux/pci-acpi.h
> > @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
> >  #endif
> >  
> >  extern const guid_t pci_acpi_dsm_guid;
> > -#define DEVICE_LABEL_DSM	0x07
> > -#define RESET_DELAY_DSM		0x08
> > -#define FUNCTION_DELAY_DSM	0x09
> > +#define IGNORE_PCI_BOOT_CONFIG_DSM	0x05
> > +#define DEVICE_LABEL_DSM		0x07
> > +#define RESET_DELAY_DSM			0x08
> > +#define FUNCTION_DELAY_DSM		0x09
> >  
> >  #else	/* CONFIG_ACPI */
> >  static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
> > 
> >
Ard Biesheuvel June 11, 2019, 10:34 p.m. UTC | #7
'

On Wed, 12 Jun 2019 at 00:09, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> On Tue, 2019-06-11 at 15:31 +0100, Lorenzo Pieralisi wrote:
> >
> > True, minus specs update schedule, I can't change that and merging
> > this patch (and firmware thereof) relies on specifications that
> > are intent changes till they become an ECN (~another merge window,
> > so this patch could land at v5.4).
>
> Hrm... annoying for us but I understand your reasoning.
>
> > The other option is doing what this patch does *without* relying
> > on _DSM #5, we may have regressions unfortunately though.
>
> We could work around regressions with quirks I suppose. It does make
> sense to assume that if you have ACPI and UEFI, you have a decent PCI
> BAR assignment at boot in the "general case". That said, we need to
> double check first that pci_bus_claim_resources() will not do horrible
> things on partially assigned setups, since there's a real interest in
> doing that in the field.
>

EDK2 based code is typically very fork heavy, in the sense that,
instead of upstreaming a change, a driver gets forked and changes are
applied locally, which then need to be carried into perpetuity. That
means that 'recent' ports could still display behavior that was
removed from the generic code a long time ago. All the open source
arm64 platforms now use the generic PCI host bridge driver (which is
in charge of the bus enumeration and resource allocation) and so
hopefully, future platforms will not deviate too much from that.

In particular, EDK2 has some PCD tunables for things like PCIe hotplug
and SR-IOV support, which affects the number of spare buses that get
allocated for hotplug capable root ports, and for SR-IOV capable
endpoints.

As Lorenzo mentions, we don't actively reassign bus numbers from
scratch, but I am not sure if that is 100% true. I think you do get
some errors when booting with hotplug capable root ports that don't
have 'pci_hotplug_bus_size' spare bus numbers available.

Also note that EDK2 leaves ROM BARs unassigned.

> > It is kind of orthogonal (but not really), bus numbers assignment
> > is _not_ in line with resource assignment at the moment and I want
> > to change it.
>
> Hrm. We should probably reassign bus numbers if we reassign resources
> yes, but then I'd like us to not reassign resources unless we have to
> :-)
>
> > Since ACPI on ARM64 is still at its inception maybe we should have
> > a stab at patching the kernel so that it reassigns bus numbers by
> > default and toggle that behaviour on _DSM #5 == 0 detection.
> >
> > I doubt that reassigning bus numbers by default can trigger
> > regressions on existing platforms but the only way to figure
> > it out is by testing it.
> >
> > > My thinking is if we converge everybody toward the x86 method of
> > > doing
> > > a 2 pass survey of existing resources followed by
> > > assign_unassigned,
> >
> > I am not entirely sure we need a 2-pass survey,
> >
> > pci_bus_claim_resources()
> >
> > should be enough; if it is not we update it.
>
> So it's not so much about the 2 passes per-se, though they have value,
> it's more about consolidating archs to do the same thing. Chances that
> we change x86 are nil. But we can change powerpc and arm64 to do like
> x86 and move that code to generic.
>
> pci_bus_claim_resources() seems to be a "lightweight" variant of the
> survey done by x86. The main differences I can see are:
>
>  - The 2 passes thing which we may or may not care about, its main
> purpose is to favor resources that are already enabled by the BIOS in
> case of conflicts as far as I understand.
>
>  - pci_read_bridge_bases() is done by pci_bus_claim_resources(), while
> x86 (and powerpc and others) do it in their pcibios_fixup_bus. That one
> is interesting... Any reason why we shouldn't unconditionally read the
> bridges while probing ? Bjorn ?
>
>  - When allocating bridge resources, there are interesting differences:
>
>   * x86 (and powerpc to some extent): If one has a 0 start or we fail
> to claim it, x86 will wipe out the resource struct (including flags). I
> assume that pci_assign_unassign_* will restore bridges when needed but
> I haven't verified.
>
>   * pci_bus_claim_resources() is dumber in that regard. It will call
> pci_claim_bridge_resources() blindly try to claim whatever is there
> even if res->start is 0. This could be a problem with partially
> assigned trees. It also doesn't wipe the resource in case of failure to
> claim which could be a problem going down the tree and letting children
> attach to the non-claimed resource, thus potentially causing the
> reassign pass to fail.
>
> The r->start == 0 test is interesting ... the generic claim code will
> honor IORESOURCE_UNSET but we don't seem to set that generically unless
> we hit some of the specific pass for explicit resource alignment, or
> during the reassignment phases.
>
>  - When allocating device resources, the main difference other than the
> 2 passes is that x86 will "0 base" the resource (r->end -= r->start; r-
> >start = 0) for later reassignment. The claim path we use won't do
> that. Note: none sets IORESOURCE_UNSET... Additionally x86 has some
> oddball code to save the original FW values and restore them if
> assignment later fails, which is somewhat odd since there's a conflict
> but probably helps really broken setups.
>
>  - x86 will not claim ROMs in that pass, it does a 3rd pass just for
> them (it's common I think to not have room for all the ROMs). It also
> disables them in config space during the survey.
> pci_bus_claim_resources() will claim everything and leave ROMs enabled.
>
> So as a somewhat temprary conclusion, I think the main difference here
> is what happens when claim fails (also the res->start = 0 case which we
> need to look at more closely) and whether we should make the generic
> code also "0-base" the resource.
>
> The question for me really is, do we want to just "upgrade" (if
> necessary) pci_bus_claim_resources() and continue having x86 do its own
> thing for ever, or do we want to consolidate around what is probably
> the most tested platform when it comes to PCI :-)
>
> And if we consolidate, I think that won't be by changing what x86 does,
> that code is the result of decades of fiddling to get things right with
> all sorts of broken BIOSes...
>
> > > and have that the main generic code path (with added quirks to force a
> > > full assignment and keeping probe_only around but that's easy, we have
> > > that on powerpc and our code is originally based on the x86 one), then
> > > we'll have a much easier time supporting IORESOURCE_PCI_FIXED on
> > > portions of the tree as well (though it also becomes less critical to
> > > do so since we will no longer reallocate unless we have to).
> > >
> > > That said we need to understand what "fixed" means and why we do it.
> >
> > Agree, totally and I want to make it clear how a BAR is fixed in
> > the kernel, there are too many discrepancies in the resource
> > management code already.
> >
> > > IE, If an endpoint somehere has "fixed" BARs for example, that means
> > > all parent bridge must be setup to enclose that range.
> > >
> > > Now our allocator for bridge windows cannot handle that and probably
> > > never will, so we have to rely on the existing window established by
> > > the FW being reasonable and use it. We can still *extend" bridge
> > > windows (and we have code to do that) if necessary but we cannot move
> > > them if they contain a fixed BAR device.
> > >
> > > There is a much bigger discussion to be had around that concept of
> > > fixed device anyway, maybe at Plumbers ? Why is the BAR fixed ? Because
> > > the EFI FB is on it ? Because HW bugs ? Because FW might access it from
> > > SMM or ARM equivalent ? Because ACPI will poke at it based on its
> > > initial address ? etc...
> >
> > Consider a slot booked at LPC PCI uconf for this discussion.
>
> Excellent.
>
> > > Some of the answers to the above questions imply more than the need to
> > > fix the BAR: Does it also mean that disabling access to that BAR, even
> > > temporarily, isn't safe ? However that's what we do today when we
> > > probe, if anything, to do the BAR sizing...
> >
> > Eh, another question that came up already should be debated.
>
> Yup.
>
> > > This isn't a new problem. We had issues like that dating back 15 years
> > > on powerpc for example, where a big ASIC hanging off PCI had all the
> > > Apple gunk including the interrupt controller, which was initialized
> > > from the DT way before PCI probing. If you took an interrupt at the
> > > "wrong" time during BAR sizing, kaboom ! If you had debug printk's in
> > > the wrong place in the PCI probing code, kaboom ! etc....
> > >
> > > If we want to solve that properly in the long run, we'll probably want
> > > ACPI to tell us the BAR sizes and use that instead of doing manual
> > > sizing on such "system" devices. We similarily have ways to "construct"
> > > pci_dev's from the OF tree on sparc64 and powerpc, limiting direct
> > > config access to populate stuff we can't get from FW.
> >
> > https://lore.kernel.org/linux-pci/20190121174225.15835-1-mr.nuke.me@gmail.com/
> >
> > ?
>
> Ah I don't know enough about ACPI yet, on my reading list :-)
>
> Cheers,
> Ben.
>
>
Benjamin Herrenschmidt June 11, 2019, 10:40 p.m. UTC | #8
On Wed, 2019-06-12 at 00:34 +0200, Ard Biesheuvel wrote:

> EDK2 based code is typically very fork heavy, in the sense that,
> instead of upstreaming a change, a driver gets forked and changes are
> applied locally, which then need to be carried into perpetuity. That
> means that 'recent' ports could still display behavior that was
> removed from the generic code a long time ago. All the open source
> arm64 platforms now use the generic PCI host bridge driver (which is
> in charge of the bus enumeration and resource allocation) and so
> hopefully, future platforms will not deviate too much from that.
> 
> In particular, EDK2 has some PCD tunables for things like PCIe
> hotplug
> and SR-IOV support, which affects the number of spare buses that get
> allocated for hotplug capable root ports, and for SR-IOV capable
> endpoints.
> 
> As Lorenzo mentions, we don't actively reassign bus numbers from
> scratch, but I am not sure if that is 100% true. I think you do get
> some errors when booting with hotplug capable root ports that don't
> have 'pci_hotplug_bus_size' spare bus numbers available.
> 
> Also note that EDK2 leaves ROM BARs unassigned.

This is all somewhat reasonable. x86 is in the same situation which is
why I'm really keen on trying to consolidate the two approaches.

> > > It is kind of orthogonal (but not really), bus numbers assignment
> > > is _not_ in line with resource assignment at the moment and I
> > > want
> > > to change it.
> > 
> > Hrm. We should probably reassign bus numbers if we reassign
> > resources
> > yes, but then I'd like us to not reassign resources unless we have
> > to
> > :-)
> > 
> > > Since ACPI on ARM64 is still at its inception maybe we should
> > > have
> > > a stab at patching the kernel so that it reassigns bus numbers by
> > > default and toggle that behaviour on _DSM #5 == 0 detection.
> > > 
> > > I doubt that reassigning bus numbers by default can trigger
> > > regressions on existing platforms but the only way to figure
> > > it out is by testing it.
> > > 
> > > > My thinking is if we converge everybody toward the x86 method
> > > > of
> > > > doing
> > > > a 2 pass survey of existing resources followed by
> > > > assign_unassigned,
> > > 
> > > I am not entirely sure we need a 2-pass survey,
> > > 
> > > pci_bus_claim_resources()
> > > 
> > > should be enough; if it is not we update it.
> > 
> > So it's not so much about the 2 passes per-se, though they have
> > value,
> > it's more about consolidating archs to do the same thing. Chances
> > that
> > we change x86 are nil. But we can change powerpc and arm64 to do
> > like
> > x86 and move that code to generic.
> > 
> > pci_bus_claim_resources() seems to be a "lightweight" variant of
> > the
> > survey done by x86. The main differences I can see are:
> > 
> >  - The 2 passes thing which we may or may not care about, its main
> > purpose is to favor resources that are already enabled by the BIOS
> > in
> > case of conflicts as far as I understand.
> > 
> >  - pci_read_bridge_bases() is done by pci_bus_claim_resources(),
> > while
> > x86 (and powerpc and others) do it in their pcibios_fixup_bus. That
> > one
> > is interesting... Any reason why we shouldn't unconditionally read
> > the
> > bridges while probing ? Bjorn ?
> > 
> >  - When allocating bridge resources, there are interesting
> > differences:
> > 
> >   * x86 (and powerpc to some extent): If one has a 0 start or we
> > fail
> > to claim it, x86 will wipe out the resource struct (including
> > flags). I
> > assume that pci_assign_unassign_* will restore bridges when needed
> > but
> > I haven't verified.
> > 
> >   * pci_bus_claim_resources() is dumber in that regard. It will
> > call
> > pci_claim_bridge_resources() blindly try to claim whatever is there
> > even if res->start is 0. This could be a problem with partially
> > assigned trees. It also doesn't wipe the resource in case of
> > failure to
> > claim which could be a problem going down the tree and letting
> > children
> > attach to the non-claimed resource, thus potentially causing the
> > reassign pass to fail.
> > 
> > The r->start == 0 test is interesting ... the generic claim code
> > will
> > honor IORESOURCE_UNSET but we don't seem to set that generically
> > unless
> > we hit some of the specific pass for explicit resource alignment,
> > or
> > during the reassignment phases.
> > 
> >  - When allocating device resources, the main difference other than
> > the
> > 2 passes is that x86 will "0 base" the resource (r->end -= r-
> > >start; r-
> > > start = 0) for later reassignment. The claim path we use won't do
> > 
> > that. Note: none sets IORESOURCE_UNSET... Additionally x86 has some
> > oddball code to save the original FW values and restore them if
> > assignment later fails, which is somewhat odd since there's a
> > conflict
> > but probably helps really broken setups.
> > 
> >  - x86 will not claim ROMs in that pass, it does a 3rd pass just
> > for
> > them (it's common I think to not have room for all the ROMs). It
> > also
> > disables them in config space during the survey.
> > pci_bus_claim_resources() will claim everything and leave ROMs
> > enabled.
> > 
> > So as a somewhat temprary conclusion, I think the main difference
> > here
> > is what happens when claim fails (also the res->start = 0 case
> > which we
> > need to look at more closely) and whether we should make the
> > generic
> > code also "0-base" the resource.
> > 
> > The question for me really is, do we want to just "upgrade" (if
> > necessary) pci_bus_claim_resources() and continue having x86 do its
> > own
> > thing for ever, or do we want to consolidate around what is
> > probably
> > the most tested platform when it comes to PCI :-)
> > 
> > And if we consolidate, I think that won't be by changing what x86
> > does,
> > that code is the result of decades of fiddling to get things right
> > with
> > all sorts of broken BIOSes...
> > 
> > > > and have that the main generic code path (with added quirks to
> > > > force a
> > > > full assignment and keeping probe_only around but that's easy,
> > > > we have
> > > > that on powerpc and our code is originally based on the x86
> > > > one), then
> > > > we'll have a much easier time supporting IORESOURCE_PCI_FIXED
> > > > on
> > > > portions of the tree as well (though it also becomes less
> > > > critical to
> > > > do so since we will no longer reallocate unless we have to).
> > > > 
> > > > That said we need to understand what "fixed" means and why we
> > > > do it.
> > > 
> > > Agree, totally and I want to make it clear how a BAR is fixed in
> > > the kernel, there are too many discrepancies in the resource
> > > management code already.
> > > 
> > > > IE, If an endpoint somehere has "fixed" BARs for example, that
> > > > means
> > > > all parent bridge must be setup to enclose that range.
> > > > 
> > > > Now our allocator for bridge windows cannot handle that and
> > > > probably
> > > > never will, so we have to rely on the existing window
> > > > established by
> > > > the FW being reasonable and use it. We can still *extend"
> > > > bridge
> > > > windows (and we have code to do that) if necessary but we
> > > > cannot move
> > > > them if they contain a fixed BAR device.
> > > > 
> > > > There is a much bigger discussion to be had around that concept
> > > > of
> > > > fixed device anyway, maybe at Plumbers ? Why is the BAR fixed ?
> > > > Because
> > > > the EFI FB is on it ? Because HW bugs ? Because FW might access
> > > > it from
> > > > SMM or ARM equivalent ? Because ACPI will poke at it based on
> > > > its
> > > > initial address ? etc...
> > > 
> > > Consider a slot booked at LPC PCI uconf for this discussion.
> > 
> > Excellent.
> > 
> > > > Some of the answers to the above questions imply more than the
> > > > need to
> > > > fix the BAR: Does it also mean that disabling access to that
> > > > BAR, even
> > > > temporarily, isn't safe ? However that's what we do today when
> > > > we
> > > > probe, if anything, to do the BAR sizing...
> > > 
> > > Eh, another question that came up already should be debated.
> > 
> > Yup.
> > 
> > > > This isn't a new problem. We had issues like that dating back
> > > > 15 years
> > > > on powerpc for example, where a big ASIC hanging off PCI had
> > > > all the
> > > > Apple gunk including the interrupt controller, which was
> > > > initialized
> > > > from the DT way before PCI probing. If you took an interrupt at
> > > > the
> > > > "wrong" time during BAR sizing, kaboom ! If you had debug
> > > > printk's in
> > > > the wrong place in the PCI probing code, kaboom ! etc....
> > > > 
> > > > If we want to solve that properly in the long run, we'll
> > > > probably want
> > > > ACPI to tell us the BAR sizes and use that instead of doing
> > > > manual
> > > > sizing on such "system" devices. We similarily have ways to
> > > > "construct"
> > > > pci_dev's from the OF tree on sparc64 and powerpc, limiting
> > > > direct
> > > > config access to populate stuff we can't get from FW.
> > > 
> > > 
https://lore.kernel.org/linux-pci/20190121174225.15835-1-mr.nuke.me@gmail.com/
> > > 
> > > ?
> > 
> > Ah I don't know enough about ACPI yet, on my reading list :-)
> > 
> > Cheers,
> > Ben.
> > 
> >
Bjorn Helgaas June 11, 2019, 11:39 p.m. UTC | #9
On Thu, Jun 06, 2019 at 07:00:12PM +1000, Benjamin Herrenschmidt wrote:
> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> 
> On arm64 ACPI systems, we unconditionally reconfigure the entire PCI
> hierarchy at boot. This is a departure from what is customary on ACPI
> systems, and may break assumptions in some places (e.g., EFIFB), that
> the kernel will leave BARs of enabled PCI devices where they are.
> 
> Given that PCI already specifies a device specific ACPI method (_DSM)
> for PCI root bridge nodes that tells us whether the firmware thinks
> the configuration should be left alone, let's sidestep the entire
> policy debate about whether the PCI configuration should be preserved
> or not, and put it under the control of the firmware instead.

The current PCI Firmware spec r3.2 specifies _DSM function 5 for
PCI-to-PCI bridge objects, which does not include host bridge
(PNP0A03) nodes, but the proposed revision does allow it under host
bridges.  So I'm fine with this, but we should update the commit log
so it doesn't say "PCI *already* specifies this".

> [BenH: Added pci_assign_unassigned_root_bus_resources()]
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

I think you should add a signed-off-by for yourself?

> ---
> 
> So I would like this variant rather than mucking around with
> IORESOURCE_PCI_FIXED at this stage to fix the problem with our platforms.
> 
> See my other email, IORESOURCE_PCI_FIXED doesn't really work terribly well
> when using pci_bus_size_bridges and pci_bus_assign_resources, and the
> resulting patches are ugly and add more mess.
> 
> Long run, I propose to start working on consolidating all those various
> resource survey mechanisms around what x86 does, unless people strongly
> object... (with the addition of the probe only and force reassign quirks
> so platforms can still chose that).
> 
> Note: I haven't tested the effect of pci_assign_unassigned_root_bus_resources
> as our platforms don't leave anything unassigned. I'm not entirely sure how
> well pci_bus_claim_resources() will deal with a partially assigned setup...
> 
> We do want to support partial assignment by BIOS though, it's a trend to
> reduce boot time, people seem to want BIOSes to only assign what's critical
> for booting.
> 
> Bjorn: I haven't made the claim path the default in absence of _DSM #5 yet.
> I suggest we do that as a separate patch in case it breaks somebody, thus
> making bisection more meaningful. It will also make this one more palatable
> to distros since it won't change the behaviour on systems without _DSM #5,
> and we verified nobody has it except Seattle which returns 1. 
> 
>  arch/arm64/kernel/pci.c  | 23 +++++++++++++++++++++--
>  include/linux/pci-acpi.h |  7 ++++---
>  2 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index bb85e2f4603f..6358e1cb4f9f 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -168,6 +168,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>  	struct acpi_pci_generic_root_info *ri;
>  	struct pci_bus *bus, *child;
>  	struct acpi_pci_root_ops *root_ops;
> +	union acpi_object *obj;
>  
>  	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
>  	if (!ri)
> @@ -193,8 +194,26 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>  	if (!bus)
>  		return NULL;
>  
> -	pci_bus_size_bridges(bus);
> -	pci_bus_assign_resources(bus);
> +	/*
> +	 * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot
> +	 * Configuration', which tells us whether the firmware wants us to
> +	 * preserve the configuration of the PCI resource tree for this root
> +	 * bridge.
> +	 */
> +	obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1,
> +	                        IGNORE_PCI_BOOT_CONFIG_DSM, NULL);
> +	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {

This is fine, but can we make a tiny step toward doing this in generic
code instead of adding more arch-specific stuff?

E.g., evaluate the _DSM in the generic acpi_pci_root_add(), set a
"preserve_config" bit in the struct acpi_pci_root, and test the bit
here?

It would also be nice to add a printk in the other
pci_acpi_scan_root() implementations if the bit is set so we know that
the platform supplied the _DSM but we're ignoring it.

> +		/* preserve existing resource assignment */
> +		pci_bus_claim_resources(bus);
> +
> +		/* Assign anything that might have been left out */
> +		pci_assign_unassigned_root_bus_resources(bus);
> +	} else {
> +		/* reconfigure the resource tree from scratch */
> +		pci_bus_size_bridges(bus);
> +		pci_bus_assign_resources(bus);
> +	}
> +	ACPI_FREE(obj);
>  
>  	list_for_each_entry(child, &bus->children, node)
>  		pcie_bus_configure_settings(child);
> diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
> index 8082b612f561..62b7fdcc661c 100644
> --- a/include/linux/pci-acpi.h
> +++ b/include/linux/pci-acpi.h
> @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
>  #endif
>  
>  extern const guid_t pci_acpi_dsm_guid;
> -#define DEVICE_LABEL_DSM	0x07
> -#define RESET_DELAY_DSM		0x08
> -#define FUNCTION_DELAY_DSM	0x09
> +#define IGNORE_PCI_BOOT_CONFIG_DSM	0x05
> +#define DEVICE_LABEL_DSM		0x07
> +#define RESET_DELAY_DSM			0x08
> +#define FUNCTION_DELAY_DSM		0x09
>  
>  #else	/* CONFIG_ACPI */
>  static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
> 
>
Benjamin Herrenschmidt June 12, 2019, 12:06 a.m. UTC | #10
On Tue, 2019-06-11 at 18:39 -0500, Bjorn Helgaas wrote:
> On Thu, Jun 06, 2019 at 07:00:12PM +1000, Benjamin Herrenschmidt wrote:
> > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > 
> > On arm64 ACPI systems, we unconditionally reconfigure the entire PCI
> > hierarchy at boot. This is a departure from what is customary on ACPI
> > systems, and may break assumptions in some places (e.g., EFIFB), that
> > the kernel will leave BARs of enabled PCI devices where they are.
> > 
> > Given that PCI already specifies a device specific ACPI method (_DSM)
> > for PCI root bridge nodes that tells us whether the firmware thinks
> > the configuration should be left alone, let's sidestep the entire
> > policy debate about whether the PCI configuration should be preserved
> > or not, and put it under the control of the firmware instead.
> 
> The current PCI Firmware spec r3.2 specifies _DSM function 5 for
> PCI-to-PCI bridge objects, which does not include host bridge
> (PNP0A03) nodes, but the proposed revision does allow it under host
> bridges.  So I'm fine with this, but we should update the commit log
> so it doesn't say "PCI *already* specifies this".
> 
> > [BenH: Added pci_assign_unassigned_root_bus_resources()]
> > 
> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> 
> I think you should add a signed-off-by for yourself?

I should, I forgot. That said, Lorenzo wants to wait for the actual
ECN... and we're also discussing some details.
> 

 .../...

> > +	/*
> > +	 * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot
> > +	 * Configuration', which tells us whether the firmware wants us to
> > +	 * preserve the configuration of the PCI resource tree for this root
> > +	 * bridge.
> > +	 */
> > +	obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1,
> > +	                        IGNORE_PCI_BOOT_CONFIG_DSM, NULL);
> > +	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
> 
> This is fine, but can we make a tiny step toward doing this in generic
> code instead of adding more arch-specific stuff?
> 
> E.g., evaluate the _DSM in the generic acpi_pci_root_add(), set a
> "preserve_config" bit in the struct acpi_pci_root, and test the bit
> here?

I'd rather have the flag in the host bridge no ?

> It would also be nice to add a printk in the oter
> pci_acpi_scan_root() implementations if the bit is set so we know that
> the platform supplied the _DSM but we're ignoring it.

Ok.

Talking of which, look at the ongoing discussion I have with Lorenzo
when it comes to pci_bus_claim_resources vs. what x86 does, I'd love
for you to chime in. I'd like to try to consolidate things further
accross architectures but there might be reasons I don't see as to why
things are different in that area, so ...

Cheers,
Ben.
Lorenzo Pieralisi June 12, 2019, 10:08 a.m. UTC | #11
On Wed, Jun 12, 2019 at 08:19:40AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2019-06-11 at 15:58 +0100, Lorenzo Pieralisi wrote:
> > 
> > 
> > 	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
> > 		/* preserve existing resource assignment */
> > 		pci_bus_claim_resources(bus);
> > 	}
> > 
> > 	pci_bus_size_bridges(bus);
> > 	pci_bus_assign_resources(bus);
> 
> So that makes me nervous... my understanding is that the pair
> 
> 	pci_bus_size_bridges(bus);
>  	pci_bus_assign_resources(bus);
> 
> Is intended for full reassignment. Now they will try to skip resources
> that already have a parent, but that's yet another code path. What's
> wrong with pci_unassigned_* ? That's what it's meant for...

Nothing wrong, we have not understood each others. What I am asking
is to write it like this:

if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
	/* preserve existing resource assignment */
	pci_bus_claim_resources(bus);
}

/* this code path should be common, indipendent of _DSM #5
pci_assign_unassigned_root_bus_resources(bus);

There is no reason to have two distinct code paths, current code
can call:

pci_assign_unassigned_root_bus_resources(bus);

instead of

pci_bus_size_bridges(bus);
pci_bus_assign_resources(bus);

Actually, current code is *questionable* given that AFAICS it
does not account for hotplug bridges additional memory window
size.

> > That's how it should be I think:
> > 
> > 1) we do not want pci_assign_unassigned_root_bus_resources(bus) to
> >    reallocate resources already claimed (see realloc parameter), do we ?
> 
> Well, realloc is useful to handle SR_IOV when the BIOS doesn't do it
> right (common case). That said, at this point, we should be able to
> honor IORESOURCE_PCI_FIXED for things that have _DSM #5 since they have
> been claimed. I don't see that realloc logic being a problem for us,
> and I want to avoid gratuitous differences with x86, but maybe I'm
> missing something here...

See above. The conditions that make IORESOURCE_PCI_FIXED work are
still unclear to me.

> > 2) pci_bus_size_bridges(bus) and pci_bus_assign_resources(bus) should
> >    not interfere with resources already claimed so it *should* be safe
> >    to call them anyway
> 
> Sure, *should* and here we introduce yet another way of doing things
> though... Any reason we don't want to do what x86 does here ?

No, see above, keeping in mind that what x86 does is still not
very well defined AFAICS.

> > Most importantly: I want everyone to agree that claiming is equivalent
> > to making a resource immutable (except for realloc, see (1) above)
> > because that's what we are doing by claiming on _DSM #5 == 0.
> 
> I think the combination of claiming *and* IORESOURCE_PCI_FIXED is what
> makes it *really* immutable. I'm a bit confused by the realloc logic
> right now, I'll need more quality time looking at it after ingesting
> more caffeing but I'm under the impression that it will honor the flag.

Likewise, this requires some fuzzing because it is really hard to
understand by perusing the code.

> > There are too many ways to make a resource immutable in the kernel
> > and this is confusing and prone to bugs.
> 
> It is, but I don't want to create new ways of doing things, and what
> you seem to propose is a new way imho :-)

No, see above. What I am saying is that we have IORESOURCE_PCI_FIXED,
res->parent != NULL and Enhanced allocation to make a BAR immutable and
before going any further it would be great if we understand their
interaction given that we are starting from a pseudo clean slate.

If we fail to do that it is quirks later (and I would rather avoid
seeing x86 command line parameters to work around them).

Cheers,
Lorenzo

> Cheers,
> Ben.
> 
> > Thanks,
> > Lorenzo
> > 
> > > +	ACPI_FREE(obj);
> > >  
> > >  	list_for_each_entry(child, &bus->children, node)
> > >  		pcie_bus_configure_settings(child);
> > > diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
> > > index 8082b612f561..62b7fdcc661c 100644
> > > --- a/include/linux/pci-acpi.h
> > > +++ b/include/linux/pci-acpi.h
> > > @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
> > >  #endif
> > >  
> > >  extern const guid_t pci_acpi_dsm_guid;
> > > -#define DEVICE_LABEL_DSM	0x07
> > > -#define RESET_DELAY_DSM		0x08
> > > -#define FUNCTION_DELAY_DSM	0x09
> > > +#define IGNORE_PCI_BOOT_CONFIG_DSM	0x05
> > > +#define DEVICE_LABEL_DSM		0x07
> > > +#define RESET_DELAY_DSM			0x08
> > > +#define FUNCTION_DELAY_DSM		0x09
> > >  
> > >  #else	/* CONFIG_ACPI */
> > >  static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
> > > 
> > > 
>
Lorenzo Pieralisi June 12, 2019, 10:21 a.m. UTC | #12
On Wed, Jun 12, 2019 at 08:09:01AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2019-06-11 at 15:31 +0100, Lorenzo Pieralisi wrote:
> > 
> > True, minus specs update schedule, I can't change that and merging
> > this patch (and firmware thereof) relies on specifications that
> > are intent changes till they become an ECN (~another merge window,
> > so this patch could land at v5.4).
> 
> Hrm... annoying for us but I understand your reasoning.

If we can wait it is better, also because it gives us time to
bring this discussion to completion.

> > The other option is doing what this patch does *without* relying
> > on _DSM #5, we may have regressions unfortunately though.
> 
> We could work around regressions with quirks I suppose. It does make
> sense to assume that if you have ACPI and UEFI, you have a decent PCI
> BAR assignment at boot in the "general case". That said, we need to
> double check first that pci_bus_claim_resources() will not do horrible
> things on partially assigned setups, since there's a real interest in
> doing that in the field.
> 
> > It is kind of orthogonal (but not really), bus numbers assignment
> > is _not_ in line with resource assignment at the moment and I want
> > to change it.
> 
> Hrm. We should probably reassign bus numbers if we reassign resources
> yes, but then I'd like us to not reassign resources unless we have to
> :-)

But for that we can use _DSM #5 returning 0, at least we would
be consistent.

Current situation is inconsistent and that bothers me I can put
together a separate patch and send it as an RFT, there are not
that many ARM64 PCI ACPI platforms to test it on.

> > a stab at patching the kernel so that it reassigns bus numbers by
> > default and toggle that behaviour on _DSM #5 == 0 detection.
> > 
> > I doubt that reassigning bus numbers by default can trigger
> > regressions on existing platforms but the only way to figure
> > it out is by testing it.
> >
> > > My thinking is if we converge everybody toward the x86 method of
> > > doing
> > > a 2 pass survey of existing resources followed by
> > > assign_unassigned,
> > 
> > I am not entirely sure we need a 2-pass survey,
> >
> > pci_bus_claim_resources()
> > 
> > should be enough; if it is not we update it.
> 
> So it's not so much about the 2 passes per-se, though they have value,
> it's more about consolidating archs to do the same thing. Chances that
> we change x86 are nil. But we can change powerpc and arm64 to do like
> x86 and move that code to generic.

Agree on that.

> pci_bus_claim_resources() seems to be a "lightweight" variant of the
> survey done by x86. The main differences I can see are:
> 
>  - The 2 passes thing which we may or may not care about, its main
> purpose is to favor resources that are already enabled by the BIOS in
> case of conflicts as far as I understand.

Yes.

>  - pci_read_bridge_bases() is done by pci_bus_claim_resources(), while
> x86 (and powerpc and others) do it in their pcibios_fixup_bus. That one
> is interesting... Any reason why we shouldn't unconditionally read the
> bridges while probing ? Bjorn ?

I tried and failed miserably:

https://lore.kernel.org/linux-pci/20150916085850.GA17510@red-moon/

>  - When allocating bridge resources, there are interesting differences:
> 
>   * x86 (and powerpc to some extent): If one has a 0 start or we fail
> to claim it, x86 will wipe out the resource struct (including flags). I
> assume that pci_assign_unassign_* will restore bridges when needed but
> I haven't verified. 
> 
>   * pci_bus_claim_resources() is dumber in that regard. It will call
> pci_claim_bridge_resources() blindly try to claim whatever is there
> even if res->start is 0. This could be a problem with partially
> assigned trees. It also doesn't wipe the resource in case of failure to
> claim which could be a problem going down the tree and letting children
> attach to the non-claimed resource, thus potentially causing the
> reassign pass to fail.
> 
> The r->start == 0 test is interesting ... the generic claim code will
> honor IORESOURCE_UNSET but we don't seem to set that generically unless
> we hit some of the specific pass for explicit resource alignment, or
> during the reassignment phases.
> 
>  - When allocating device resources, the main difference other than the
> 2 passes is that x86 will "0 base" the resource (r->end -= r->start; r-
> >start = 0) for later reassignment. The claim path we use won't do
> that. Note: none sets IORESOURCE_UNSET... Additionally x86 has some
> oddball code to save the original FW values and restore them if
> assignment later fails, which is somewhat odd since there's a conflict
> but probably helps really broken setups.
> 
>  - x86 will not claim ROMs in that pass, it does a 3rd pass just for
> them (it's common I think to not have room for all the ROMs). It also
> disables them in config space during the survey.
> pci_bus_claim_resources() will claim everything and leave ROMs enabled.
> 
> So as a somewhat temprary conclusion, I think the main difference here
> is what happens when claim fails (also the res->start = 0 case which we
> need to look at more closely) and whether we should make the generic
> code also "0-base" the resource.

Oh my, res->start == 0, another can of worms. Honestly I do not know
what to do on that one mostly because we need to figure out how it
plays with resource assignment code (and legacy stuff, you know the
drill).

> 
> The question for me really is, do we want to just "upgrade" (if
> necessary) pci_bus_claim_resources() and continue having x86 do its own
> thing for ever, or do we want to consolidate around what is probably
> the most tested platform when it comes to PCI :-)

Consolidating is the right thing to do, with the caveats above, there
are many but you have all my support.

> And if we consolidate, I think that won't be by changing what x86 does,
> that code is the result of decades of fiddling to get things right with
> all sorts of broken BIOSes...

There is 0 chance to change x86 code (and there is 0 chance to change
core PCI code with x86 assumptions in it).

Cheers,
Lorenzo
Benjamin Herrenschmidt June 12, 2019, 10:58 a.m. UTC | #13
On Wed, 2019-06-12 at 11:08 +0100, Lorenzo Pieralisi wrote:
> 
> Nothing wrong, we have not understood each others. What I am asking
> is to write it like this:
> 
> if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
> 	/* preserve existing resource assignment */
> 	pci_bus_claim_resources(bus);
> }
> 
> /* this code path should be common, indipendent of _DSM #5
> pci_assign_unassigned_root_bus_resources(bus);
> 
> There is no reason to have two distinct code paths, current code
> can call:
> 
> pci_assign_unassigned_root_bus_resources(bus);
> 
> instead of
> 
> pci_bus_size_bridges(bus);
> pci_bus_assign_resources(bus);
> 
> Actually, current code is *questionable* given that AFAICS it
> does not account for hotplug bridges additional memory window
> size.

Ah yes, I see, ok, I'll play with that in qemu a bit tomorrow
make sure we aren't missing something obvious & update the patch.

> > > That's how it should be I think:
> > > 
> > > 1) we do not want pci_assign_unassigned_root_bus_resources(bus) to
> > >    reallocate resources already claimed (see realloc parameter), do we ?
> > 
> > Well, realloc is useful to handle SR_IOV when the BIOS doesn't do it
> > right (common case). That said, at this point, we should be able to
> > honor IORESOURCE_PCI_FIXED for things that have _DSM #5 since they have
> > been claimed. I don't see that realloc logic being a problem for us,
> > and I want to avoid gratuitous differences with x86, but maybe I'm
> > missing something here...
> 
> See above. The conditions that make IORESOURCE_PCI_FIXED work are
> still unclear to me.

I more/less convinced myself it doesn't in the full-reassignment case
:-) As for the more traditional survey + assign, it can work if the
original survey led to reasonable values, mostly by preventing (I
*think*, I still need to convince myself) the realloc path for kicking
in, but this is all very fishy... At least x86 does use that flag at
least in one place from my grepping around...

> > > 2) pci_bus_size_bridges(bus) and pci_bus_assign_resources(bus) should
> > >    not interfere with resources already claimed so it *should* be safe
> > >    to call them anyway
> > 
> > Sure, *should* and here we introduce yet another way of doing things
> > though... Any reason we don't want to do what x86 does here ?
> 
> No, see above, keeping in mind that what x86 does is still not
> very well defined AFAICS.

It's not defined, it's grown through an evolutionary process :-) It's
still by far the most tested with the widest range of crazy PCI devices
and BIOSes I would think.

> > > Most importantly: I want everyone to agree that claiming is equivalent
> > > to making a resource immutable (except for realloc, see (1) above)
> > > because that's what we are doing by claiming on _DSM #5 == 0.
> > 
> > I think the combination of claiming *and* IORESOURCE_PCI_FIXED is what
> > makes it *really* immutable. I'm a bit confused by the realloc logic
> > right now, I'll need more quality time looking at it after ingesting
> > more caffeing but I'm under the impression that it will honor the flag.
> 
> Likewise, this requires some fuzzing because it is really hard to
> understand by perusing the code.

Yup.

> > > There are too many ways to make a resource immutable in the kernel
> > > and this is confusing and prone to bugs.
> > 
> > It is, but I don't want to create new ways of doing things, and what
> > you seem to propose is a new way imho :-)
> 
> No, see above. What I am saying is that we have IORESOURCE_PCI_FIXED,
> res->parent != NULL and Enhanced allocation to make a BAR immutable and
> before going any further it would be great if we understand their
> interaction given that we are starting from a pseudo clean slate.

> If we fail to do that it is quirks later (and I would rather avoid
> seeing x86 command line parameters to work around them).

Well.. while I would hate to have to *use* the x86 command line
parameters, I also don't like them not existing on all archs, as they
can actually be useful to help diagnose issues if anything (with of
course the intention of fixing things so they aren't needed in the long
run of course).

Cheers,
Ben.

> Cheers,
> Lorenzo
> 
> > Cheers,
> > Ben.
> > 
> > > Thanks,
> > > Lorenzo
> > > 
> > > > +	ACPI_FREE(obj);
> > > >  
> > > >  	list_for_each_entry(child, &bus->children, node)
> > > >  		pcie_bus_configure_settings(child);
> > > > diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
> > > > index 8082b612f561..62b7fdcc661c 100644
> > > > --- a/include/linux/pci-acpi.h
> > > > +++ b/include/linux/pci-acpi.h
> > > > @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
> > > >  #endif
> > > >  
> > > >  extern const guid_t pci_acpi_dsm_guid;
> > > > -#define DEVICE_LABEL_DSM	0x07
> > > > -#define RESET_DELAY_DSM		0x08
> > > > -#define FUNCTION_DELAY_DSM	0x09
> > > > +#define IGNORE_PCI_BOOT_CONFIG_DSM	0x05
> > > > +#define DEVICE_LABEL_DSM		0x07
> > > > +#define RESET_DELAY_DSM			0x08
> > > > +#define FUNCTION_DELAY_DSM		0x09
> > > >  
> > > >  #else	/* CONFIG_ACPI */
> > > >  static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
> > > > 
> > > >
Bjorn Helgaas June 12, 2019, 1:27 p.m. UTC | #14
On Wed, Jun 12, 2019 at 10:06:06AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2019-06-11 at 18:39 -0500, Bjorn Helgaas wrote:

> > This is fine, but can we make a tiny step toward doing this in generic
> > code instead of adding more arch-specific stuff?
> > 
> > E.g., evaluate the _DSM in the generic acpi_pci_root_add(), set a
> > "preserve_config" bit in the struct acpi_pci_root, and test the bit
> > here?
> 
> I'd rather have the flag in the host bridge no ?

Oh, of course, that would make more sense.

> Talking of which, look at the ongoing discussion I have with Lorenzo
> when it comes to pci_bus_claim_resources vs. what x86 does, I'd love
> for you to chime in. I'd like to try to consolidate things further
> accross architectures but there might be reasons I don't see as to why
> things are different in that area, so ...

I don't know any reasons why things are different per arch.  In most
cases I suspect FUD.

Speaking of which, *this* patch looks like FUD because it essentially
says "Linux shouldn't change the PCI configuration on this system" but
it offers no explanation of *why* the config needs to be preserved.  I
would really like some note like "run-time firmware depends on the
addresses of device X".

Bjorn
Benjamin Herrenschmidt June 12, 2019, 9:46 p.m. UTC | #15
On Wed, 2019-06-12 at 08:27 -0500, Bjorn Helgaas wrote:
> On Wed, Jun 12, 2019 at 10:06:06AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Tue, 2019-06-11 at 18:39 -0500, Bjorn Helgaas wrote:
> > > This is fine, but can we make a tiny step toward doing this in
> > > generic
> > > code instead of adding more arch-specific stuff?
> > > 
> > > E.g., evaluate the _DSM in the generic acpi_pci_root_add(), set a
> > > "preserve_config" bit in the struct acpi_pci_root, and test the
> > > bit
> > > here?
> > 
> > I'd rather have the flag in the host bridge no ?
> 
> Oh, of course, that would make more sense.

Thinking of this ... I still believe eventually the default should be
to preseve the BIOS config (see ongoing discussions about converging
everybody toward the x86 model), so the flag should be the other way
around.

I'm thinking moving PROBE_ONLY and REASSIGN_ALL_RSRC/BUS to be host
bridge flags (initially copied from the global flags).

To not break things, ARM64 would start setting that 'reassign all' by
default, then we can flip it.

> > Talking of which, look at the ongoing discussion I have with Lorenzo
> > when it comes to pci_bus_claim_resources vs. what x86 does, I'd love
> > for you to chime in. I'd like to try to consolidate things further
> > accross architectures but there might be reasons I don't see as to why
> > things are different in that area, so ...
> 
> I don't know any reasons why things are different per arch.  In most
> cases I suspect FUD.
>
> Speaking of which, *this* patch looks like FUD because it essentially
> says "Linux shouldn't change the PCI configuration on this system" but
> it offers no explanation of *why* the config needs to be preserved.  I
> would really like some note like "run-time firmware depends on the
> addresses of device X".

Oh there are a number of reasons as to why but yes, that's one of them.
I can improve the comments.

That said, I think everybody tends to agree that reassigning everything
by default isn't a great idea, so while I still need something like
this patch in asap, I'll be working towards making that stuff more
common accross archs.

My logic is that x86 is the most tested arch with the widest range of
PCI devices and broken BIOSes. So what works for x86 should work for
others (minus maybe a handful of quirks). So it doesn't make sense to
have the main resource survey logic stuck in arch/x86 and everybody
else growing different things.

Cheers,
Ben.
Benjamin Herrenschmidt June 12, 2019, 10:05 p.m. UTC | #16
On Wed, 2019-06-12 at 11:21 +0100, Lorenzo Pieralisi wrote:

> Hrm. We should probably reassign bus numbers if we reassign resources
> > yes, but then I'd like us to not reassign resources unless we have to
> > :-)
> 
> But for that we can use _DSM #5 returning 0, at least we would
> be consistent.

Yes we should be consistent. My personal preference would be to always
honor FW resources by default regardless of DSM, and have DSM = 1 force
a full reassign instead. In a way, by always reassigning we send the
wrong message to FW folks, that it's ok to be broken bcs Linux will fix
it up..

Do you know how Windows deals with this ?

> Current situation is inconsistent and that bothers me I can put
> together a separate patch and send it as an RFT, there are not
> that many ARM64 PCI ACPI platforms to test it on.

Ok.

 ../..

> >  - pci_read_bridge_bases() is done by pci_bus_claim_resources(), while
> > x86 (and powerpc and others) do it in their pcibios_fixup_bus. That one
> > is interesting... Any reason why we shouldn't unconditionally read the
> > bridges while probing ? Bjorn ?
> 
> I tried and failed miserably:
> 
> https://lore.kernel.org/linux-pci/20150916085850.GA17510@red-moon/

Yes, I see... I think we can revive this if we key it off not
reassigning all resources.

There's a PCI flag PCI_REASSIGN_ALL_RSRC that's currently only use on
powerpc, but it wouldn't be hard to make sure it's set on archs that do
a full reassign. We could then have the generic code key off that.

That said, I'd rather have this be a host bridge flag. I'll look into
it later.

> >  - When allocating bridge resources, there are interesting differences:
> > 
> >   * x86 (and powerpc to some extent): If one has a 0 start or we fail
> > to claim it, x86 will wipe out the resource struct (including flags). I
> > assume that pci_assign_unassign_* will restore bridges when needed but
> > I haven't verified. 
> > 
> >   * pci_bus_claim_resources() is dumber in that regard. It will call
> > pci_claim_bridge_resources() blindly try to claim whatever is there
> > even if res->start is 0. This could be a problem with partially
> > assigned trees. It also doesn't wipe the resource in case of failure to
> > claim which could be a problem going down the tree and letting children
> > attach to the non-claimed resource, thus potentially causing the
> > reassign pass to fail.
> > 
> > The r->start == 0 test is interesting ... the generic claim code will
> > honor IORESOURCE_UNSET but we don't seem to set that generically unless
> > we hit some of the specific pass for explicit resource alignment, or
> > during the reassignment phases.
> > 
> >  - When allocating device resources, the main difference other than the
> > 2 passes is that x86 will "0 base" the resource (r->end -= r->start; r-
> > > start = 0) for later reassignment. The claim path we use won't do
> > 
> > that. Note: none sets IORESOURCE_UNSET... Additionally x86 has some
> > oddball code to save the original FW values and restore them if
> > assignment later fails, which is somewhat odd since there's a conflict
> > but probably helps really broken setups.
> > 
> >  - x86 will not claim ROMs in that pass, it does a 3rd pass just for
> > them (it's common I think to not have room for all the ROMs). It also
> > disables them in config space during the survey.
> > pci_bus_claim_resources() will claim everything and leave ROMs enabled.
> > 
> > So as a somewhat temprary conclusion, I think the main difference here
> > is what happens when claim fails (also the res->start = 0 case which we
> > need to look at more closely) and whether we should make the generic
> > code also "0-base" the resource.
> 
> Oh my, res->start == 0, another can of worms. Honestly I do not know
> what to do on that one mostly because we need to figure out how it
> plays with resource assignment code (and legacy stuff, you know the
> drill).

Yes. We have that funny pcibios_uninitialized_bridge_resource() in
arch/powerpc/kernel/pci-common.c which tries to "guess" whether a
bridge with a 0 res->start means that it's uninitialized or has a
"valid" 0 based resource. Among others, we check if memory decoding is
enabled, etc... If we decide it's really uninitialized we set
IORESOURCE_UNSET, and we rely on that later on.

In an idea world, nobody should create valid 0 based resources, it's
best to stay off the first 1MB of PCI space due to various legacy
concerns anyways but ...

> > The question for me really is, do we want to just "upgrade" (if
> > necessary) pci_bus_claim_resources() and continue having x86 do its own
> > thing for ever, or do we want to consolidate around what is probably
> > the most tested platform when it comes to PCI :-)
> 
> Consolidating is the right thing to do, with the caveats above, there
> are many but you have all my support.
> 
> > And if we consolidate, I think that won't be by changing what x86 does,
> > that code is the result of decades of fiddling to get things right with
> > all sorts of broken BIOSes...
> 
> There is 0 chance to change x86 code (and there is 0 chance to change
> core PCI code with x86 assumptions in it).

I wouldn't say 0 but the bar is high yes.

Cheers,
Ben.
Benjamin Herrenschmidt June 12, 2019, 11:58 p.m. UTC | #17
On Wed, 2019-06-12 at 08:27 -0500, Bjorn Helgaas wrote:
> Speaking of which, *this* patch looks like FUD because it essentially
> says "Linux shouldn't change the PCI configuration on this system" but
> it offers no explanation of *why* the config needs to be preserved.  I
> would really like some note like "run-time firmware depends on the
> addresses of device X".

BTW. the patch doesn't say that. ACPI tells us that :-)

Cheers
Ben.
diff mbox series

Patch

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index bb85e2f4603f..6358e1cb4f9f 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -168,6 +168,7 @@  struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
 	struct acpi_pci_generic_root_info *ri;
 	struct pci_bus *bus, *child;
 	struct acpi_pci_root_ops *root_ops;
+	union acpi_object *obj;
 
 	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
 	if (!ri)
@@ -193,8 +194,26 @@  struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
 	if (!bus)
 		return NULL;
 
-	pci_bus_size_bridges(bus);
-	pci_bus_assign_resources(bus);
+	/*
+	 * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot
+	 * Configuration', which tells us whether the firmware wants us to
+	 * preserve the configuration of the PCI resource tree for this root
+	 * bridge.
+	 */
+	obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1,
+	                        IGNORE_PCI_BOOT_CONFIG_DSM, NULL);
+	if (obj && obj->type == ACPI_TYPE_INTEGER && obj->integer.value == 0) {
+		/* preserve existing resource assignment */
+		pci_bus_claim_resources(bus);
+
+		/* Assign anything that might have been left out */
+		pci_assign_unassigned_root_bus_resources(bus);
+	} else {
+		/* reconfigure the resource tree from scratch */
+		pci_bus_size_bridges(bus);
+		pci_bus_assign_resources(bus);
+	}
+	ACPI_FREE(obj);
 
 	list_for_each_entry(child, &bus->children, node)
 		pcie_bus_configure_settings(child);
diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
index 8082b612f561..62b7fdcc661c 100644
--- a/include/linux/pci-acpi.h
+++ b/include/linux/pci-acpi.h
@@ -107,9 +107,10 @@  static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { }
 #endif
 
 extern const guid_t pci_acpi_dsm_guid;
-#define DEVICE_LABEL_DSM	0x07
-#define RESET_DELAY_DSM		0x08
-#define FUNCTION_DELAY_DSM	0x09
+#define IGNORE_PCI_BOOT_CONFIG_DSM	0x05
+#define DEVICE_LABEL_DSM		0x07
+#define RESET_DELAY_DSM			0x08
+#define FUNCTION_DELAY_DSM		0x09
 
 #else	/* CONFIG_ACPI */
 static inline void acpi_pci_add_bus(struct pci_bus *bus) { }