diff mbox series

x86/pci: Stop requiring MMCONFIG to be declared in E820, ACPI or EFI for newer systems

Message ID 20231205154845.44463-1-mario.limonciello@amd.com
State New
Headers show
Series x86/pci: Stop requiring MMCONFIG to be declared in E820, ACPI or EFI for newer systems | expand

Commit Message

Mario Limonciello Dec. 5, 2023, 3:48 p.m. UTC
commit 7752d5cfe3d1 ("x86: validate against acpi motherboard resources")
introduced checks for ensuring that MCFG table also has memory region
reservations to ensure no conflicts were introduced from a buggy BIOS.

This has proceeded over time to add other types of reservation checks
for ACPI PNP resources and EFI MMIO memory type.  The PCI firmware spec
however says that these checks are only required when the operating system
doesn't comprehend the firmware region:

```
If the operating system does not natively comprehend reserving the MMCFG
region, the MMCFG region must be reserved by firmware. The address range
reported in the MCFG table or by _CBA method (see Section 4.1.3) must be
reserved by declaring a motherboard resource. For most systems, the
motherboard resource would appear at the root of the ACPI namespace
(under \_SB) in a node with a _HID of EISAID (PNP0C02), and the resources
in this case should not be claimed in the root PCI bus’s _CRS. The
resources can optionally be returned in Int15 E820h or EFIGetMemoryMap
as reserved memory but must always be reported through ACPI as a
motherboard resource.
```

Running this check causes problems with accessing extended PCI
configuration space on OEM laptops that don't specify the region in PNP
resources or in the EFI memory map. That later manifests as problems with
dGPU and accessing resizable BAR. Similar problems don't exist in Windows
11 with exact same laptop/firmware stack, and in discussion with AMD's BIOS
team Windows doesn't have similar checks.

As this series of checks was first introduced as a mitigation for buggy
BIOS before EFI was introduced add a BIOS date range to only enforce the
checks on hardware that predates the release of Windows 11.

Link: https://members.pcisig.com/wg/PCI-SIG/document/15350
      PCI Firmware Specification 3.3
      Section 4.1.2 MCFG Table Description Note 2
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 arch/x86/pci/mmconfig-shared.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Bjorn Helgaas Dec. 5, 2023, 4:17 p.m. UTC | #1
On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
> commit 7752d5cfe3d1 ("x86: validate against acpi motherboard resources")
> introduced checks for ensuring that MCFG table also has memory region
> reservations to ensure no conflicts were introduced from a buggy BIOS.
> 
> This has proceeded over time to add other types of reservation checks
> for ACPI PNP resources and EFI MMIO memory type.  The PCI firmware spec
> however says that these checks are only required when the operating system
> doesn't comprehend the firmware region:
> 
> ```
> If the operating system does not natively comprehend reserving the MMCFG
> region, the MMCFG region must be reserved by firmware. The address range
> reported in the MCFG table or by _CBA method (see Section 4.1.3) must be
> reserved by declaring a motherboard resource. For most systems, the
> motherboard resource would appear at the root of the ACPI namespace
> (under \_SB) in a node with a _HID of EISAID (PNP0C02), and the resources
> in this case should not be claimed in the root PCI bus’s _CRS. The
> resources can optionally be returned in Int15 E820h or EFIGetMemoryMap
> as reserved memory but must always be reported through ACPI as a
> motherboard resource.
> ```

My understanding is that native comprehension would mean Linux knows
how to discover and/or configure the MMCFG base address and size in
the hardware and that Linux would then reserve that region so it's not
used for anything else.

Linux doesn't have that, at least for x86.  It relies on the MCFG
table to discover the MMCFG region, and it relies on PNP0C02 _CRS to
reserve it.

> Running this check causes problems with accessing extended PCI
> configuration space on OEM laptops that don't specify the region in PNP
> resources or in the EFI memory map. That later manifests as problems with
> dGPU and accessing resizable BAR.

Is there a problem report we can reference here?

Does the problem still occur with this series?
https://lore.kernel.org/r/20231121183643.249006-1-helgaas@kernel.org

This appeared in linux-next 20231130.

> Similar problems don't exist in Windows 11 with exact same
> laptop/firmware stack, and in discussion with AMD's BIOS team
> Windows doesn't have similar checks.

I would love to know AMD BIOS team's take on this.  Does the BIOS
reserve the MMCFG space in any way?

> As this series of checks was first introduced as a mitigation for buggy
> BIOS before EFI was introduced add a BIOS date range to only enforce the
> checks on hardware that predates the release of Windows 11.

Many of the MMCFG checks in Linux are historical artifacts that are
likely related to Linux defects, not BIOS defects, so I wouldn't
expect to see them in Windows.  But it's hard to remove them now.

> Link: https://members.pcisig.com/wg/PCI-SIG/document/15350
>       PCI Firmware Specification 3.3
>       Section 4.1.2 MCFG Table Description Note 2
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
>  arch/x86/pci/mmconfig-shared.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
> index 4b3efaa82ab7..e4594b181ebf 100644
> --- a/arch/x86/pci/mmconfig-shared.c
> +++ b/arch/x86/pci/mmconfig-shared.c
> @@ -570,9 +570,13 @@ static void __init pci_mmcfg_reject_broken(int early)
>  
>  	list_for_each_entry(cfg, &pci_mmcfg_list, list) {
>  		if (pci_mmcfg_check_reserved(NULL, cfg, early) == 0) {
> -			pr_info(PREFIX "not using MMCONFIG\n");
> -			free_all_mmcfg();
> -			return;
> +			if (dmi_get_bios_year() >= 2021) {
> +				pr_info(PREFIX "MMCONFIG wasn't reserved by ACPI or EFI\n");

I think this leads to using the MMCONFIG area without reserving it
anywhere, so we may end up assigning that space to something else,
which won't work, i.e., the problem described here:
https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?id=5cef3014e02d

> +			} else {
> +				pr_info(PREFIX "not using MMCONFIG\n");
> +				free_all_mmcfg();
> +				return;
> +			}
>  		}
>  	}
>  }
> -- 
> 2.34.1
>
Mario Limonciello Dec. 5, 2023, 5 p.m. UTC | #2
On 12/5/2023 10:17, Bjorn Helgaas wrote:
> On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
>> commit 7752d5cfe3d1 ("x86: validate against acpi motherboard resources")
>> introduced checks for ensuring that MCFG table also has memory region
>> reservations to ensure no conflicts were introduced from a buggy BIOS.
>>
>> This has proceeded over time to add other types of reservation checks
>> for ACPI PNP resources and EFI MMIO memory type.  The PCI firmware spec
>> however says that these checks are only required when the operating system
>> doesn't comprehend the firmware region:
>>
>> ```
>> If the operating system does not natively comprehend reserving the MMCFG
>> region, the MMCFG region must be reserved by firmware. The address range
>> reported in the MCFG table or by _CBA method (see Section 4.1.3) must be
>> reserved by declaring a motherboard resource. For most systems, the
>> motherboard resource would appear at the root of the ACPI namespace
>> (under \_SB) in a node with a _HID of EISAID (PNP0C02), and the resources
>> in this case should not be claimed in the root PCI bus’s _CRS. The
>> resources can optionally be returned in Int15 E820h or EFIGetMemoryMap
>> as reserved memory but must always be reported through ACPI as a
>> motherboard resource.
>> ```
> 
> My understanding is that native comprehension would mean Linux knows
> how to discover and/or configure the MMCFG base address and size in
> the hardware and that Linux would then reserve that region so it's not
> used for anything else.
> 
> Linux doesn't have that, at least for x86.  It relies on the MCFG
> table to discover the MMCFG region, and it relies on PNP0C02 _CRS to
> reserve it.

MCFG to discover it matches the PCI firmware spec, but as I point out 
above the decision to reserve this region doesn't require 
PNP0C01/PNP0C02 _CRS.

This is a decision made by Linux historically.

> 
>> Running this check causes problems with accessing extended PCI
>> configuration space on OEM laptops that don't specify the region in PNP
>> resources or in the EFI memory map. That later manifests as problems with
>> dGPU and accessing resizable BAR.
> 
> Is there a problem report we can reference here?

Nothing public to share. AMD BIOS team is in discussion with the OEM to 
add the reservation in a BIOS upgrade so it works with things like the 
LTS kernels.

Knowing Windows works without it I feel this is still something that we 
should be looking at fixing from an upstream perspective though which is 
what prompted my patch and discussion.

> 
> Does the problem still occur with this series?
> https://lore.kernel.org/r/20231121183643.249006-1-helgaas@kernel.org
> 
> This appeared in linux-next 20231130.

Thanks for sharing that.  If I do respin a variation of this patch I'll 
rebase on top of that.

I had a try with that series on top of 6.7-rc4, but it doesn't fix the 
issue (but obviously the patch I sent does).

# journalctl -k | grep ECAM
Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 
0xe0000000-0xefffffff] (base 0xe0000000) for domain 0000 [bus 00-ff]
Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem 
0xe0000000-0xefffffff] not reserved)
Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 
0xe0000000-0xefffffff] (base 0xe0000000) for domain 0000 [bus 00-ff]
Dec 05 06:37:46 cl-fw-fedora kernel: PCI: [Firmware Info]: ECAM [mem 
0xe0000000-0xefffffff] not reserved in ACPI motherboard resources
Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem 
0xe0000000-0xefffffff] not reserved)

> 
>> Similar problems don't exist in Windows 11 with exact same
>> laptop/firmware stack, and in discussion with AMD's BIOS team
>> Windows doesn't have similar checks.
> 
> I would love to know AMD BIOS team's take on this.  Does the BIOS
> reserve the MMCFG space in any way?

On the AMD reference platform this OEM system is based on it is reserved 
in the EFI memory map.  So on a 6.7 based kernel the reference system 
you can see this emitted:

PCI: MMCONFIG at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO

But on the OEM system this is not reserved by EFI memory map or _CRS.

That's why my assumption after reading the firmware spec and seeing the 
behavior is that Windows makes the reservation *based on* what's in MCFG.

> 
>> As this series of checks was first introduced as a mitigation for buggy
>> BIOS before EFI was introduced add a BIOS date range to only enforce the
>> checks on hardware that predates the release of Windows 11.
> 
> Many of the MMCFG checks in Linux are historical artifacts that are
> likely related to Linux defects, not BIOS defects, so I wouldn't
> expect to see them in Windows.  But it's hard to remove them now.

I guess I was hoping that by cutting a line in the sand we could avoid 
breaking anything that was relying upon the older behavior.

> 
>> Link: https://members.pcisig.com/wg/PCI-SIG/document/15350
>>        PCI Firmware Specification 3.3
>>        Section 4.1.2 MCFG Table Description Note 2
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>> ---
>>   arch/x86/pci/mmconfig-shared.c | 10 +++++++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
>> index 4b3efaa82ab7..e4594b181ebf 100644
>> --- a/arch/x86/pci/mmconfig-shared.c
>> +++ b/arch/x86/pci/mmconfig-shared.c
>> @@ -570,9 +570,13 @@ static void __init pci_mmcfg_reject_broken(int early)
>>   
>>   	list_for_each_entry(cfg, &pci_mmcfg_list, list) {
>>   		if (pci_mmcfg_check_reserved(NULL, cfg, early) == 0) {
>> -			pr_info(PREFIX "not using MMCONFIG\n");
>> -			free_all_mmcfg();
>> -			return;
>> +			if (dmi_get_bios_year() >= 2021) {
>> +				pr_info(PREFIX "MMCONFIG wasn't reserved by ACPI or EFI\n");
> 
> I think this leads to using the MMCONFIG area without reserving it
> anywhere, so we may end up assigning that space to something else,
> which won't work, i.e., the problem described here:
> https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?id=5cef3014e02d
> 
>> +			} else {
>> +				pr_info(PREFIX "not using MMCONFIG\n");
>> +				free_all_mmcfg();
>> +				return;
>> +			}
>>   		}
>>   	}
>>   }
>> -- 
>> 2.34.1
>>
Bjorn Helgaas Dec. 5, 2023, 5:31 p.m. UTC | #3
On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
> On 12/5/2023 10:17, Bjorn Helgaas wrote:
> > On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
> > > commit 7752d5cfe3d1 ("x86: validate against acpi motherboard resources")
> > > introduced checks for ensuring that MCFG table also has memory region
> > > reservations to ensure no conflicts were introduced from a buggy BIOS.
> > > 
> > > This has proceeded over time to add other types of reservation checks
> > > for ACPI PNP resources and EFI MMIO memory type.  The PCI firmware spec
> > > however says that these checks are only required when the operating system
> > > doesn't comprehend the firmware region:
> > > 
> > > ```
> > > If the operating system does not natively comprehend reserving the MMCFG
> > > region, the MMCFG region must be reserved by firmware. The address range
> > > reported in the MCFG table or by _CBA method (see Section 4.1.3) must be
> > > reserved by declaring a motherboard resource. For most systems, the
> > > motherboard resource would appear at the root of the ACPI namespace
> > > (under \_SB) in a node with a _HID of EISAID (PNP0C02), and the resources
> > > in this case should not be claimed in the root PCI bus’s _CRS. The
> > > resources can optionally be returned in Int15 E820h or EFIGetMemoryMap
> > > as reserved memory but must always be reported through ACPI as a
> > > motherboard resource.
> > > ```
> > 
> > My understanding is that native comprehension would mean Linux knows
> > how to discover and/or configure the MMCFG base address and size in
> > the hardware and that Linux would then reserve that region so it's not
> > used for anything else.
> > 
> > Linux doesn't have that, at least for x86.  It relies on the MCFG
> > table to discover the MMCFG region, and it relies on PNP0C02 _CRS to
> > reserve it.
> 
> MCFG to discover it matches the PCI firmware spec, but as I point
> out above the decision to reserve this region doesn't require
> PNP0C01/PNP0C02 _CRS.

Can you explain this reasoning a little more?  I claim Linux does not
natively comprehend reserving the MMCFG region, but it sounds like you
don't agree?  I think "native" comprehension would mean Linux would
not need the MCFG table.

> This is a decision made by Linux historically.
>
> > > Running this check causes problems with accessing extended PCI
> > > configuration space on OEM laptops that don't specify the region in PNP
> > > resources or in the EFI memory map. That later manifests as problems with
> > > dGPU and accessing resizable BAR.
> > 
> > Is there a problem report we can reference here?
> 
> Nothing public to share. AMD BIOS team is in discussion with the OEM to add
> the reservation in a BIOS upgrade so it works with things like the LTS
> kernels.

Is there some reason this can't be made public (it's obviously fine to
redact proprietary details)?  It's really hard to make this code work
for all the cases even when we know all the details, and practically
impossible if we don't.

> Knowing Windows works without it I feel this is still something that we
> should be looking at fixing from an upstream perspective though which is
> what prompted my patch and discussion.

We definitely need to change Linux so it works correctly with firmware
in the field, whether that means fixing a Linux defect or working
around a firmware defect.

> > Does the problem still occur with this series?
> > https://lore.kernel.org/r/20231121183643.249006-1-helgaas@kernel.org
> > 
> > This appeared in linux-next 20231130.
> 
> Thanks for sharing that.  If I do respin a variation of this patch I'll
> rebase on top of that.
> 
> I had a try with that series on top of 6.7-rc4, but it doesn't fix the issue
> (but obviously the patch I sent does).
> 
> # journalctl -k | grep ECAM
> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 0xe0000000-0xefffffff]
> (base 0xe0000000) for domain 0000 [bus 00-ff]
> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem
> 0xe0000000-0xefffffff] not reserved)
> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 0xe0000000-0xefffffff]
> (base 0xe0000000) for domain 0000 [bus 00-ff]
> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: [Firmware Info]: ECAM [mem
> 0xe0000000-0xefffffff] not reserved in ACPI motherboard resources
> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem
> 0xe0000000-0xefffffff] not reserved)

Can you boot with 'efi=debug dyndbg="file arch/x86/pci +p"' and share
the complete dmesg log (redacted if necessary) somewhere?  It's
important to know more about why and how this doesn't work.  I added
more debug logging, but possibly it's still not enough.

> > > Similar problems don't exist in Windows 11 with exact same
> > > laptop/firmware stack, and in discussion with AMD's BIOS team
> > > Windows doesn't have similar checks.
> > 
> > I would love to know AMD BIOS team's take on this.  Does the BIOS
> > reserve the MMCFG space in any way?
> 
> On the AMD reference platform this OEM system is based on it is reserved in
> the EFI memory map.  So on a 6.7 based kernel the reference system you can
> see this emitted:
> 
> PCI: MMCONFIG at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO

The EfiMemoryMappedIO entry is not a *reservation* (this was a poor
choice of words in the logging, and my series changes it).  This entry
only means the firmware requests that the OS map this region to a
virtual address so it can be used by EFI runtime services (UEFI v2.9,
sec 7.2).

> But on the OEM system this is not reserved by EFI memory map or _CRS.
> 
> That's why my assumption after reading the firmware spec and seeing the
> behavior is that Windows makes the reservation *based on* what's in MCFG.

Is there some spec language that says MCFG reserves space?  I'm not
aware of anything about ACPI static tables reserving MMIO space.
Here's my reasoning around static tables vs _CRS for reservations:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32

> > > As this series of checks was first introduced as a mitigation for buggy
> > > BIOS before EFI was introduced add a BIOS date range to only enforce the
> > > checks on hardware that predates the release of Windows 11.
> > 
> > Many of the MMCFG checks in Linux are historical artifacts that are
> > likely related to Linux defects, not BIOS defects, so I wouldn't
> > expect to see them in Windows.  But it's hard to remove them now.
> 
> I guess I was hoping that by cutting a line in the sand we could avoid
> breaking anything that was relying upon the older behavior.
> 
> > > Link: https://members.pcisig.com/wg/PCI-SIG/document/15350
> > >        PCI Firmware Specification 3.3
> > >        Section 4.1.2 MCFG Table Description Note 2
> > > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> > > ---
> > >   arch/x86/pci/mmconfig-shared.c | 10 +++++++---
> > >   1 file changed, 7 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
> > > index 4b3efaa82ab7..e4594b181ebf 100644
> > > --- a/arch/x86/pci/mmconfig-shared.c
> > > +++ b/arch/x86/pci/mmconfig-shared.c
> > > @@ -570,9 +570,13 @@ static void __init pci_mmcfg_reject_broken(int early)
> > >   	list_for_each_entry(cfg, &pci_mmcfg_list, list) {
> > >   		if (pci_mmcfg_check_reserved(NULL, cfg, early) == 0) {
> > > -			pr_info(PREFIX "not using MMCONFIG\n");
> > > -			free_all_mmcfg();
> > > -			return;
> > > +			if (dmi_get_bios_year() >= 2021) {
> > > +				pr_info(PREFIX "MMCONFIG wasn't reserved by ACPI or EFI\n");
> > 
> > I think this leads to using the MMCONFIG area without reserving it
> > anywhere, so we may end up assigning that space to something else,
> > which won't work, i.e., the problem described here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?id=5cef3014e02d
> > 
> > > +			} else {
> > > +				pr_info(PREFIX "not using MMCONFIG\n");
> > > +				free_all_mmcfg();
> > > +				return;
> > > +			}
> > >   		}
> > >   	}
> > >   }
> > > -- 
> > > 2.34.1
> > > 
>
Mario Limonciello Dec. 5, 2023, 6:28 p.m. UTC | #4
On 12/5/2023 11:31, Bjorn Helgaas wrote:
> On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
>> On 12/5/2023 10:17, Bjorn Helgaas wrote:
>>> On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
>>>> commit 7752d5cfe3d1 ("x86: validate against acpi motherboard resources")
>>>> introduced checks for ensuring that MCFG table also has memory region
>>>> reservations to ensure no conflicts were introduced from a buggy BIOS.
>>>>
>>>> This has proceeded over time to add other types of reservation checks
>>>> for ACPI PNP resources and EFI MMIO memory type.  The PCI firmware spec
>>>> however says that these checks are only required when the operating system
>>>> doesn't comprehend the firmware region:
>>>>
>>>> ```
>>>> If the operating system does not natively comprehend reserving the MMCFG
>>>> region, the MMCFG region must be reserved by firmware. The address range
>>>> reported in the MCFG table or by _CBA method (see Section 4.1.3) must be
>>>> reserved by declaring a motherboard resource. For most systems, the
>>>> motherboard resource would appear at the root of the ACPI namespace
>>>> (under \_SB) in a node with a _HID of EISAID (PNP0C02), and the resources
>>>> in this case should not be claimed in the root PCI bus’s _CRS. The
>>>> resources can optionally be returned in Int15 E820h or EFIGetMemoryMap
>>>> as reserved memory but must always be reported through ACPI as a
>>>> motherboard resource.
>>>> ```
>>>
>>> My understanding is that native comprehension would mean Linux knows
>>> how to discover and/or configure the MMCFG base address and size in
>>> the hardware and that Linux would then reserve that region so it's not
>>> used for anything else.
>>>
>>> Linux doesn't have that, at least for x86.  It relies on the MCFG
>>> table to discover the MMCFG region, and it relies on PNP0C02 _CRS to
>>> reserve it.
>>
>> MCFG to discover it matches the PCI firmware spec, but as I point
>> out above the decision to reserve this region doesn't require
>> PNP0C01/PNP0C02 _CRS.
> 
> Can you explain this reasoning a little more?  I claim Linux does not
> natively comprehend reserving the MMCFG region, but it sounds like you
> don't agree?  I think "native" comprehension would mean Linux would
> not need the MCFG table.

After our thread and the spec again I think you're right Linux doesn't 
natively comprehend (reserve this region;) particularly because of the 
stance you have on "static table" vs _CRS.

> 
>> This is a decision made by Linux historically.
>>
>>>> Running this check causes problems with accessing extended PCI
>>>> configuration space on OEM laptops that don't specify the region in PNP
>>>> resources or in the EFI memory map. That later manifests as problems with
>>>> dGPU and accessing resizable BAR.
>>>
>>> Is there a problem report we can reference here?
>>
>> Nothing public to share. AMD BIOS team is in discussion with the OEM to add
>> the reservation in a BIOS upgrade so it works with things like the LTS
>> kernels.
> 
> Is there some reason this can't be made public (it's obviously fine to
> redact proprietary details)?  It's really hard to make this code work
> for all the cases even when we know all the details, and practically
> impossible if we don't.

I just don't want to throw the vendor under the bus as it could have 
been caught "sooner" and fixed by BIOS adding _CRS.

I'll share the full dmesg below just redacting the DMI information.

> 
>> Knowing Windows works without it I feel this is still something that we
>> should be looking at fixing from an upstream perspective though which is
>> what prompted my patch and discussion.
> 
> We definitely need to change Linux so it works correctly with firmware
> in the field, whether that means fixing a Linux defect or working
> around a firmware defect.
> 
>>> Does the problem still occur with this series?
>>> https://lore.kernel.org/r/20231121183643.249006-1-helgaas@kernel.org
>>>
>>> This appeared in linux-next 20231130.
>>
>> Thanks for sharing that.  If I do respin a variation of this patch I'll
>> rebase on top of that.
>>
>> I had a try with that series on top of 6.7-rc4, but it doesn't fix the issue
>> (but obviously the patch I sent does).
>>
>> # journalctl -k | grep ECAM
>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 0xe0000000-0xefffffff]
>> (base 0xe0000000) for domain 0000 [bus 00-ff]
>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem
>> 0xe0000000-0xefffffff] not reserved)
>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 0xe0000000-0xefffffff]
>> (base 0xe0000000) for domain 0000 [bus 00-ff]
>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: [Firmware Info]: ECAM [mem
>> 0xe0000000-0xefffffff] not reserved in ACPI motherboard resources
>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem
>> 0xe0000000-0xefffffff] not reserved)
> 
> Can you boot with 'efi=debug dyndbg="file arch/x86/pci +p"' and share
> the complete dmesg log (redacted if necessary) somewhere?  It's
> important to know more about why and how this doesn't work.  I added
> more debug logging, but possibly it's still not enough.

Here you go (6.7-rc4 + that series you linked):
https://gist.github.com/superm1/eca87ae661793b9ab969829946adb084

> 
>>>> Similar problems don't exist in Windows 11 with exact same
>>>> laptop/firmware stack, and in discussion with AMD's BIOS team
>>>> Windows doesn't have similar checks.
>>>
>>> I would love to know AMD BIOS team's take on this.  Does the BIOS
>>> reserve the MMCFG space in any way?
>>
>> On the AMD reference platform this OEM system is based on it is reserved in
>> the EFI memory map.  So on a 6.7 based kernel the reference system you can
>> see this emitted:
>>
>> PCI: MMCONFIG at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO
> 
> The EfiMemoryMappedIO entry is not a *reservation* (this was a poor
> choice of words in the logging, and my series changes it).  This entry
> only means the firmware requests that the OS map this region to a
> virtual address so it can be used by EFI runtime services (UEFI v2.9,
> sec 7.2).

In that sense the only reason this works on the AMD reference platform 
is because that region happens to have been reserved from a subset of 
another region.

Per the stance on "static table", we should advocate for _CRS to be 
populated with MCFG on AMD reference platform too, right?

> 
>> But on the OEM system this is not reserved by EFI memory map or _CRS.
>>
>> That's why my assumption after reading the firmware spec and seeing the
>> behavior is that Windows makes the reservation *based on* what's in MCFG.
> 
> Is there some spec language that says MCFG reserves space?  I'm not
> aware of anything about ACPI static tables reserving MMIO space.
> Here's my reasoning around static tables vs _CRS for reservations:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32

Reading your stance it makes sense more of why we're where we are now.

Let me ask though - why does the distinction of old OS vs new OS matter?
If a vendor wants it to work with a kernel that didn't use MCFG to make 
a reservation _CRS or some other overlapping reservation is their only 
option.

But if we changed this behavior in a newer kernel then the stance can be
something like:
"upstream kernel 6.8 or newer will reserve MCFG if not specified by _CRS 
or any other overlapping reservation"
and
"upstream kernel 6.7 or older require explicit reservations".

It seems to me that this type of issue would entirely go away in most 
cases and it would satisfy the spec note about
'natively comprehend' reserving the MMCFG region.
Mario Limonciello Dec. 5, 2023, 10:17 p.m. UTC | #5
On 12/5/2023 12:28, Mario Limonciello wrote:
> On 12/5/2023 11:31, Bjorn Helgaas wrote:
>> On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
>>> On 12/5/2023 10:17, Bjorn Helgaas wrote:
>>>> On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
>>>>> commit 7752d5cfe3d1 ("x86: validate against acpi motherboard 
>>>>> resources")
>>>>> introduced checks for ensuring that MCFG table also has memory region
>>>>> reservations to ensure no conflicts were introduced from a buggy BIOS.
>>>>>
>>>>> This has proceeded over time to add other types of reservation checks
>>>>> for ACPI PNP resources and EFI MMIO memory type.  The PCI firmware 
>>>>> spec
>>>>> however says that these checks are only required when the operating 
>>>>> system
>>>>> doesn't comprehend the firmware region:
>>>>>
>>>>> ```
>>>>> If the operating system does not natively comprehend reserving the 
>>>>> MMCFG
>>>>> region, the MMCFG region must be reserved by firmware. The address 
>>>>> range
>>>>> reported in the MCFG table or by _CBA method (see Section 4.1.3) 
>>>>> must be
>>>>> reserved by declaring a motherboard resource. For most systems, the
>>>>> motherboard resource would appear at the root of the ACPI namespace
>>>>> (under \_SB) in a node with a _HID of EISAID (PNP0C02), and the 
>>>>> resources
>>>>> in this case should not be claimed in the root PCI bus’s _CRS. The
>>>>> resources can optionally be returned in Int15 E820h or EFIGetMemoryMap
>>>>> as reserved memory but must always be reported through ACPI as a
>>>>> motherboard resource.
>>>>> ```
>>>>
>>>> My understanding is that native comprehension would mean Linux knows
>>>> how to discover and/or configure the MMCFG base address and size in
>>>> the hardware and that Linux would then reserve that region so it's not
>>>> used for anything else.
>>>>
>>>> Linux doesn't have that, at least for x86.  It relies on the MCFG
>>>> table to discover the MMCFG region, and it relies on PNP0C02 _CRS to
>>>> reserve it.
>>>
>>> MCFG to discover it matches the PCI firmware spec, but as I point
>>> out above the decision to reserve this region doesn't require
>>> PNP0C01/PNP0C02 _CRS.
>>
>> Can you explain this reasoning a little more?  I claim Linux does not
>> natively comprehend reserving the MMCFG region, but it sounds like you
>> don't agree?  I think "native" comprehension would mean Linux would
>> not need the MCFG table.
> 
> After our thread and the spec again I think you're right Linux doesn't 
> natively comprehend (reserve this region;) particularly because of the 
> stance you have on "static table" vs _CRS.
> 
>>
>>> This is a decision made by Linux historically.
>>>
>>>>> Running this check causes problems with accessing extended PCI
>>>>> configuration space on OEM laptops that don't specify the region in 
>>>>> PNP
>>>>> resources or in the EFI memory map. That later manifests as 
>>>>> problems with
>>>>> dGPU and accessing resizable BAR.
>>>>
>>>> Is there a problem report we can reference here?
>>>
>>> Nothing public to share. AMD BIOS team is in discussion with the OEM 
>>> to add
>>> the reservation in a BIOS upgrade so it works with things like the LTS
>>> kernels.
>>
>> Is there some reason this can't be made public (it's obviously fine to
>> redact proprietary details)?  It's really hard to make this code work
>> for all the cases even when we know all the details, and practically
>> impossible if we don't.
> 
> I just don't want to throw the vendor under the bus as it could have 
> been caught "sooner" and fixed by BIOS adding _CRS.
> 
> I'll share the full dmesg below just redacting the DMI information.
> 
>>
>>> Knowing Windows works without it I feel this is still something that we
>>> should be looking at fixing from an upstream perspective though which is
>>> what prompted my patch and discussion.
>>
>> We definitely need to change Linux so it works correctly with firmware
>> in the field, whether that means fixing a Linux defect or working
>> around a firmware defect.
>>
>>>> Does the problem still occur with this series?
>>>> https://lore.kernel.org/r/20231121183643.249006-1-helgaas@kernel.org
>>>>
>>>> This appeared in linux-next 20231130.
>>>
>>> Thanks for sharing that.  If I do respin a variation of this patch I'll
>>> rebase on top of that.
>>>
>>> I had a try with that series on top of 6.7-rc4, but it doesn't fix 
>>> the issue
>>> (but obviously the patch I sent does).
>>>
>>> # journalctl -k | grep ECAM
>>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 
>>> 0xe0000000-0xefffffff]
>>> (base 0xe0000000) for domain 0000 [bus 00-ff]
>>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem
>>> 0xe0000000-0xefffffff] not reserved)
>>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: ECAM [mem 
>>> 0xe0000000-0xefffffff]
>>> (base 0xe0000000) for domain 0000 [bus 00-ff]
>>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: [Firmware Info]: ECAM [mem
>>> 0xe0000000-0xefffffff] not reserved in ACPI motherboard resources
>>> Dec 05 06:37:46 cl-fw-fedora kernel: PCI: not using ECAM ([mem
>>> 0xe0000000-0xefffffff] not reserved)
>>
>> Can you boot with 'efi=debug dyndbg="file arch/x86/pci +p"' and share
>> the complete dmesg log (redacted if necessary) somewhere?  It's
>> important to know more about why and how this doesn't work.  I added
>> more debug logging, but possibly it's still not enough.
> 
> Here you go (6.7-rc4 + that series you linked):
> https://gist.github.com/superm1/eca87ae661793b9ab969829946adb084
> 
>>
>>>>> Similar problems don't exist in Windows 11 with exact same
>>>>> laptop/firmware stack, and in discussion with AMD's BIOS team
>>>>> Windows doesn't have similar checks.
>>>>
>>>> I would love to know AMD BIOS team's take on this.  Does the BIOS
>>>> reserve the MMCFG space in any way?
>>>
>>> On the AMD reference platform this OEM system is based on it is 
>>> reserved in
>>> the EFI memory map.  So on a 6.7 based kernel the reference system 
>>> you can
>>> see this emitted:
>>>
>>> PCI: MMCONFIG at [mem 0xe0000000-0xefffffff] reserved as 
>>> EfiMemoryMappedIO
>>
>> The EfiMemoryMappedIO entry is not a *reservation* (this was a poor
>> choice of words in the logging, and my series changes it).  This entry
>> only means the firmware requests that the OS map this region to a
>> virtual address so it can be used by EFI runtime services (UEFI v2.9,
>> sec 7.2).
> 
> In that sense the only reason this works on the AMD reference platform 
> is because that region happens to have been reserved from a subset of 
> another region.
> 
> Per the stance on "static table", we should advocate for _CRS to be 
> populated with MCFG on AMD reference platform too, right?
> 
>>
>>> But on the OEM system this is not reserved by EFI memory map or _CRS.
>>>
>>> That's why my assumption after reading the firmware spec and seeing the
>>> behavior is that Windows makes the reservation *based on* what's in 
>>> MCFG.
>>
>> Is there some spec language that says MCFG reserves space?  I'm not
>> aware of anything about ACPI static tables reserving MMIO space.
>> Here's my reasoning around static tables vs _CRS for reservations:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32
> 
> Reading your stance it makes sense more of why we're where we are now.
> 
> Let me ask though - why does the distinction of old OS vs new OS matter?
> If a vendor wants it to work with a kernel that didn't use MCFG to make 
> a reservation _CRS or some other overlapping reservation is their only 
> option.
> 
> But if we changed this behavior in a newer kernel then the stance can be
> something like:
> "upstream kernel 6.8 or newer will reserve MCFG if not specified by _CRS 
> or any other overlapping reservation"
> and
> "upstream kernel 6.7 or older require explicit reservations".
> 
> It seems to me that this type of issue would entirely go away in most 
> cases and it would satisfy the spec note about
> 'natively comprehend' reserving the MMCFG region.
> 
> 

I don't think this should be any surprise, but this patch on top of your 
series fixes the issue on that system.

diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 0cc9520666ef..6a77441565e2 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -571,8 +571,6 @@ static void __init pci_mmcfg_reject_broken(int early)
                 if (!pci_mmcfg_reserved(NULL, cfg, early)) {
                         pr_info("not using ECAM (%pR not reserved)\n",
                                 &cfg->res);
-                       free_all_mmcfg();
-                       return;
                 }
         }
  }

And from what I can tell this *does* make a "reservation".
Specifically because pci_mmcfg_late_insert_resources() uses 
insert_resource() to put it in place.  I would expect if something else 
tries to request that region later it would get a conflict.
Bjorn Helgaas Dec. 14, 2023, 8:43 p.m. UTC | #6
[+cc Len, Robert in case I'm missing something about static tables
reserving address space]

On Tue, Dec 05, 2023 at 12:28:44PM -0600, Mario Limonciello wrote:
> On 12/5/2023 11:31, Bjorn Helgaas wrote:
> > On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
> > > On 12/5/2023 10:17, Bjorn Helgaas wrote:
> > > > On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
> > > > > commit 7752d5cfe3d1 ("x86: validate against acpi motherboard
> > > > > resources") introduced checks for ensuring that MCFG table
> > > > > also has memory region reservations to ensure no conflicts
> > > > > were introduced from a buggy BIOS.
> > > > > 
> > > > > This has proceeded over time to add other types of
> > > > > reservation checks for ACPI PNP resources and EFI MMIO
> > > > > memory type.  The PCI firmware spec however says that these
> > > > > checks are only required when the operating system doesn't
> > > > > comprehend the firmware region:
> > > > > 
> > > > > ``` If the operating system does not natively comprehend
> > > > > reserving the MMCFG region, the MMCFG region must be
> > > > > reserved by firmware. The address range reported in the MCFG
> > > > > table or by _CBA method (see Section 4.1.3) must be reserved
> > > > > by declaring a motherboard resource. For most systems, the
> > > > > motherboard resource would appear at the root of the ACPI
> > > > > namespace (under \_SB) in a node with a _HID of EISAID
> > > > > (PNP0C02), and the resources in this case should not be
> > > > > claimed in the root PCI bus’s _CRS. The resources can
> > > > > optionally be returned in Int15 E820h or EFIGetMemoryMap as
> > > > > reserved memory but must always be reported through ACPI as
> > > > > a motherboard resource.  ```
> > > > 
> > > > My understanding is that native comprehension would mean Linux
> > > > knows how to discover and/or configure the MMCFG base address
> > > > and size in the hardware and that Linux would then reserve
> > > > that region so it's not used for anything else.
> > > > 
> > > > Linux doesn't have that, at least for x86.  It relies on the
> > > > MCFG table to discover the MMCFG region, and it relies on
> > > > PNP0C02 _CRS to reserve it.
> > > 
> > > MCFG to discover it matches the PCI firmware spec, but as I
> > > point out above the decision to reserve this region doesn't
> > > require PNP0C01/PNP0C02 _CRS.
> > 
> > Can you explain this reasoning a little more?  I claim Linux does
> > not natively comprehend reserving the MMCFG region, but it sounds
> > like you don't agree?  I think "native" comprehension would mean
> > Linux would not need the MCFG table.
> 
> After our thread and the spec again I think you're right Linux
> doesn't natively comprehend (reserve this region;) particularly
> because of the stance you have on "static table" vs _CRS.

["My stance" refers to this:

  Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
  reserving address space.  The static tables are for things the OS
  needs to know early in boot, before it can parse the ACPI namespace.
  If a new table is defined, an old OS needs to operate correctly even
  though it ignores the table.  _CRS allows that because it is generic
  and understood by the old OS; a static table does not.

from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32]

I don't think this is just my stance.  The ACPI spec could be clearer
in terms of requiring PNP0C02 devices, not static tables, to reserve
address space, but I think that requirement is a logical consequence
of the ACPI design.

It's a goal of ACPI that an OS we release today should run on a
platform released tomorrow.  If the new platform uses a static table
to reserve address space used by some new hardware, today's OS doesn't
know about it and could place another device on top of it.

Using _CRS of an ACPI device to reserve the new hardware address space
is different because it works even with today's OS.  Today's OS can't
*operate* tomorrow's hardware, but at least it won't create address
conflicts with it.

> I just don't want to throw the vendor under the bus as it could have
> been caught "sooner" and fixed by BIOS adding _CRS.

The MCFG requirement for PNP0C02 _CRS has been in the PCI Firmware
spec since r3.0 in 2005.  I'm surprised that vendors still get this
wrong.  Vendors definitely have an interest in making shipping OSes
boot unchanged on new hardware.

> > > Knowing Windows works without it I feel this is still something that we
> > > should be looking at fixing from an upstream perspective though which is
> > > what prompted my patch and discussion.

The fact that Windows works doesn't mean the firmware is correct.
Linux assigns PCI BARs from the bottom up, and ECAM is often at the
bottom of a host bridge aperture.

Windows assigns PCI BARs from the top down, so even without a _CRS
reservation for the ECAM space, Windows is much less likely to put
something on top of it.

Bjorn
Mario Limonciello Dec. 14, 2023, 9:45 p.m. UTC | #7
On 12/14/2023 14:43, Bjorn Helgaas wrote:
> [+cc Len, Robert in case I'm missing something about static tables
> reserving address space]
> 
> On Tue, Dec 05, 2023 at 12:28:44PM -0600, Mario Limonciello wrote:
>> On 12/5/2023 11:31, Bjorn Helgaas wrote:
>>> On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
>>>> On 12/5/2023 10:17, Bjorn Helgaas wrote:
>>>>> On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
>>>>>> commit 7752d5cfe3d1 ("x86: validate against acpi motherboard
>>>>>> resources") introduced checks for ensuring that MCFG table
>>>>>> also has memory region reservations to ensure no conflicts
>>>>>> were introduced from a buggy BIOS.
>>>>>>
>>>>>> This has proceeded over time to add other types of
>>>>>> reservation checks for ACPI PNP resources and EFI MMIO
>>>>>> memory type.  The PCI firmware spec however says that these
>>>>>> checks are only required when the operating system doesn't
>>>>>> comprehend the firmware region:
>>>>>>
>>>>>> ``` If the operating system does not natively comprehend
>>>>>> reserving the MMCFG region, the MMCFG region must be
>>>>>> reserved by firmware. The address range reported in the MCFG
>>>>>> table or by _CBA method (see Section 4.1.3) must be reserved
>>>>>> by declaring a motherboard resource. For most systems, the
>>>>>> motherboard resource would appear at the root of the ACPI
>>>>>> namespace (under \_SB) in a node with a _HID of EISAID
>>>>>> (PNP0C02), and the resources in this case should not be
>>>>>> claimed in the root PCI bus’s _CRS. The resources can
>>>>>> optionally be returned in Int15 E820h or EFIGetMemoryMap as
>>>>>> reserved memory but must always be reported through ACPI as
>>>>>> a motherboard resource.  ```
>>>>>
>>>>> My understanding is that native comprehension would mean Linux
>>>>> knows how to discover and/or configure the MMCFG base address
>>>>> and size in the hardware and that Linux would then reserve
>>>>> that region so it's not used for anything else.
>>>>>
>>>>> Linux doesn't have that, at least for x86.  It relies on the
>>>>> MCFG table to discover the MMCFG region, and it relies on
>>>>> PNP0C02 _CRS to reserve it.
>>>>
>>>> MCFG to discover it matches the PCI firmware spec, but as I
>>>> point out above the decision to reserve this region doesn't
>>>> require PNP0C01/PNP0C02 _CRS.
>>>
>>> Can you explain this reasoning a little more?  I claim Linux does
>>> not natively comprehend reserving the MMCFG region, but it sounds
>>> like you don't agree?  I think "native" comprehension would mean
>>> Linux would not need the MCFG table.
>>
>> After our thread and the spec again I think you're right Linux
>> doesn't natively comprehend (reserve this region;) particularly
>> because of the stance you have on "static table" vs _CRS.
> 
> ["My stance" refers to this:
> 
>    Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
>    reserving address space.  The static tables are for things the OS
>    needs to know early in boot, before it can parse the ACPI namespace.
>    If a new table is defined, an old OS needs to operate correctly even
>    though it ignores the table.  _CRS allows that because it is generic
>    and understood by the old OS; a static table does not.
> 
> from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32]
> 
> I don't think this is just my stance.  The ACPI spec could be clearer
> in terms of requiring PNP0C02 devices, not static tables, to reserve
> address space, but I think that requirement is a logical consequence
> of the ACPI design.
> 
> It's a goal of ACPI that an OS we release today should run on a
> platform released tomorrow.  If the new platform uses a static table
> to reserve address space used by some new hardware, today's OS doesn't
> know about it and could place another device on top of it.
> 
> Using _CRS of an ACPI device to reserve the new hardware address space
> is different because it works even with today's OS.  Today's OS can't
> *operate* tomorrow's hardware, but at least it won't create address
> conflicts with it.
> 
>> I just don't want to throw the vendor under the bus as it could have
>> been caught "sooner" and fixed by BIOS adding _CRS.
> 
> The MCFG requirement for PNP0C02 _CRS has been in the PCI Firmware
> spec since r3.0 in 2005.  I'm surprised that vendors still get this
> wrong.

Probably worth mentioning with the clairvoyance of the root cause of the 
issue that prompted this conversation I've now discovered another system 
with the exact same problem.  It's a different OEM, different generation 
of hardware and a different IBV that they use for their firmware.

I've also looked through the BIOS for a variety of reference designs and 
I don't see a _CRS entry in any of them.

I'm fairly certain we're just getting lucky in Linux on a lot of devices 
that the region is often overlapping with a region for EFI runtime services.

> Vendors definitely have an interest in making shipping OSes
> boot unchanged on new hardware.

At least the OEMs that I talk to use FWTS.  FWTS catches this issue, but 
it's marked as LOW.  Everyone fixates on the HIGH or CRITICAL.

Given the severity of what I've seen it can do to a system I'm proposing 
FWTS to move it to HIGH:

https://lists.ubuntu.com/archives/fwts-devel/2023-December/013772.html

> 
>>>> Knowing Windows works without it I feel this is still something that we
>>>> should be looking at fixing from an upstream perspective though which is
>>>> what prompted my patch and discussion.
> 
> The fact that Windows works doesn't mean the firmware is correct.

Of course.  But it also means it's worth looking at the semantics of why 
Windows works.

> Linux assigns PCI BARs from the bottom up, and ECAM is often at the
> bottom of a host bridge aperture.
> 
> Windows assigns PCI BARs from the top down, so even without a _CRS
> reservation for the ECAM space, Windows is much less likely to put
> something on top of it.
> 
> Bjorn


I guess I'm naïve in that I don't know how exactly to check what Windows
*really* does (gets lucky from top down or actually reserves from MCFG).

But I do feel the OS doing it from top down vs bottom up is tangential 
to the decision of whether or not make a reservation in Linux for 
something you know about without a _CRS entry.

I can push AMD's reference designs to improve, I can push OEMs I talk to 
improve and I can try to influence FWTS but I have to ask:

What is the actual *harm* in just using this MCFG table to make a 
reservation when there isn't a PNP0C02 _CRS region declared?

At worst (a buggy BIOS) you would end up with hole in the memory map 
that isn't usable for devices.  At best you end up with more working 
devices without changing the firmware.
Bjorn Helgaas Dec. 14, 2023, 11:30 p.m. UTC | #8
On Thu, Dec 14, 2023 at 03:45:43PM -0600, Mario Limonciello wrote:
> On 12/14/2023 14:43, Bjorn Helgaas wrote:
> > On Tue, Dec 05, 2023 at 12:28:44PM -0600, Mario Limonciello wrote:
> > > On 12/5/2023 11:31, Bjorn Helgaas wrote:
> > > > On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
> > > > > On 12/5/2023 10:17, Bjorn Helgaas wrote:
> > > > > > On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
> > > > > > > commit 7752d5cfe3d1 ("x86: validate against acpi motherboard
> > > > > > > resources") introduced checks for ensuring that MCFG table
> > > > > > > also has memory region reservations to ensure no conflicts
> > > > > > > were introduced from a buggy BIOS.
> > > > > > > 
> > > > > > > This has proceeded over time to add other types of
> > > > > > > reservation checks for ACPI PNP resources and EFI MMIO
> > > > > > > memory type.  The PCI firmware spec however says that these
> > > > > > > checks are only required when the operating system doesn't
> > > > > > > comprehend the firmware region:
> > > > > > > 
> > > > > > > ``` If the operating system does not natively comprehend
> > > > > > > reserving the MMCFG region, the MMCFG region must be
> > > > > > > reserved by firmware. The address range reported in the MCFG
> > > > > > > table or by _CBA method (see Section 4.1.3) must be reserved
> > > > > > > by declaring a motherboard resource. For most systems, the
> > > > > > > motherboard resource would appear at the root of the ACPI
> > > > > > > namespace (under \_SB) in a node with a _HID of EISAID
> > > > > > > (PNP0C02), and the resources in this case should not be
> > > > > > > claimed in the root PCI bus’s _CRS. The resources can
> > > > > > > optionally be returned in Int15 E820h or EFIGetMemoryMap as
> > > > > > > reserved memory but must always be reported through ACPI as
> > > > > > > a motherboard resource.  ```
> > > > > > 
> > > > > > My understanding is that native comprehension would mean Linux
> > > > > > knows how to discover and/or configure the MMCFG base address
> > > > > > and size in the hardware and that Linux would then reserve
> > > > > > that region so it's not used for anything else.
> > > > > > 
> > > > > > Linux doesn't have that, at least for x86.  It relies on the
> > > > > > MCFG table to discover the MMCFG region, and it relies on
> > > > > > PNP0C02 _CRS to reserve it.
> > > > > 
> > > > > MCFG to discover it matches the PCI firmware spec, but as I
> > > > > point out above the decision to reserve this region doesn't
> > > > > require PNP0C01/PNP0C02 _CRS.
> > > > 
> > > > Can you explain this reasoning a little more?  I claim Linux does
> > > > not natively comprehend reserving the MMCFG region, but it sounds
> > > > like you don't agree?  I think "native" comprehension would mean
> > > > Linux would not need the MCFG table.
> > > 
> > > After our thread and the spec again I think you're right Linux
> > > doesn't natively comprehend (reserve this region;) particularly
> > > because of the stance you have on "static table" vs _CRS.
> > 
> > ["My stance" refers to this:
> > 
> >    Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
> >    reserving address space.  The static tables are for things the OS
> >    needs to know early in boot, before it can parse the ACPI namespace.
> >    If a new table is defined, an old OS needs to operate correctly even
> >    though it ignores the table.  _CRS allows that because it is generic
> >    and understood by the old OS; a static table does not.
> > 
> > from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32]
> > 
> > I don't think this is just my stance.  The ACPI spec could be clearer
> > in terms of requiring PNP0C02 devices, not static tables, to reserve
> > address space, but I think that requirement is a logical consequence
> > of the ACPI design.
> > 
> > It's a goal of ACPI that an OS we release today should run on a
> > platform released tomorrow.  If the new platform uses a static table
> > to reserve address space used by some new hardware, today's OS doesn't
> > know about it and could place another device on top of it.
> > 
> > Using _CRS of an ACPI device to reserve the new hardware address space
> > is different because it works even with today's OS.  Today's OS can't
> > *operate* tomorrow's hardware, but at least it won't create address
> > conflicts with it.
> > 
> > > I just don't want to throw the vendor under the bus as it could have
> > > been caught "sooner" and fixed by BIOS adding _CRS.
> > 
> > The MCFG requirement for PNP0C02 _CRS has been in the PCI Firmware
> > spec since r3.0 in 2005.  I'm surprised that vendors still get this
> > wrong.
> 
> Probably worth mentioning with the clairvoyance of the root cause of
> the issue that prompted this conversation I've now discovered
> another system with the exact same problem.  It's a different OEM,
> different generation of hardware and a different IBV that they use
> for their firmware.
> 
> I've also looked through the BIOS for a variety of reference designs
> and I don't see a _CRS entry in any of them.
> 
> I'm fairly certain we're just getting lucky in Linux on a lot of
> devices that the region is often overlapping with a region for EFI
> runtime services.

Ugh.  Yes, I'm sure it's not an isolated problem.

> > Vendors definitely have an interest in making shipping OSes boot
> > unchanged on new hardware.
> 
> At least the OEMs that I talk to use FWTS.  FWTS catches this issue,
> but it's marked as LOW.  Everyone fixates on the HIGH or CRITICAL.
> 
> Given the severity of what I've seen it can do to a system I'm
> proposing FWTS to move it to HIGH:
> 
> https://lists.ubuntu.com/archives/fwts-devel/2023-December/013772.html

Thanks.  I don't know anything about FWTS, but I'm a little skeptical
that it actually catches this issue.  It *looks* like FWTS builds its
idea of the memory map from a dmesg log or /sys/firmware/memmap, which
I think both come from the E820 map, which is x86-specific, of course.

I don't see anything that builds a memory map based on _CRS methods,
which I think is what we really want since the spec says:

  The resources can optionally be returned in Int15 E820h or
  EFIGetMemoryMap as reserved memory but must always be reported
  through ACPI as a motherboard resource.

(PCI Firmware spec r3.3, sec 4.1.2)

> What is the actual *harm* in just using this MCFG table to make a
> reservation when there isn't a PNP0C02 _CRS region declared?
> 
> At worst (a buggy BIOS) you would end up with hole in the memory map
> that isn't usable for devices.  At best you end up with more working
> devices without changing the firmware.

We definitely need to work around this in Linux, and your patch might
well be the right thing.

I'm a *little* hesitant because all the code in mmconfig-shared.c that
attempts to validate MCFG entries suggests that relying on them
uncritically was a problem in some cases, so I want to try to convince
myself that we really won't break something.

Bjorn
Mario Limonciello Dec. 15, 2023, 3:48 p.m. UTC | #9
On 12/14/2023 17:30, Bjorn Helgaas wrote:

>> I'm fairly certain we're just getting lucky in Linux on a lot of
>> devices that the region is often overlapping with a region for EFI
>> runtime services.
> 
> Ugh.  Yes, I'm sure it's not an isolated problem.
> 
>> Given the severity of what I've seen it can do to a system I'm
>> proposing FWTS to move it to HIGH:
>>
>> https://lists.ubuntu.com/archives/fwts-devel/2023-December/013772.html
> 
> Thanks.  I don't know anything about FWTS, but I'm a little skeptical
> that it actually catches this issue.  It *looks* like FWTS builds its
> idea of the memory map from a dmesg log or /sys/firmware/memmap, which
> I think both come from the E820 map, which is x86-specific, of course.
> 
> I don't see anything that builds a memory map based on _CRS methods,
> which I think is what we really want since the spec says:
> 
>    The resources can optionally be returned in Int15 E820h or
>    EFIGetMemoryMap as reserved memory but must always be reported
>    through ACPI as a motherboard resource.
> 
> (PCI Firmware spec r3.3, sec 4.1.2)

You're right; it doesn't catch the "root" of this issue, it only catches 
specifically when the region doesn't overlap with an existing 
reservation (like EFI runtime services).

A more thorough check would need to build a memory map.

> 
>> What is the actual *harm* in just using this MCFG table to make a
>> reservation when there isn't a PNP0C02 _CRS region declared?
>>
>> At worst (a buggy BIOS) you would end up with hole in the memory map
>> that isn't usable for devices.  At best you end up with more working
>> devices without changing the firmware.
> 
> We definitely need to work around this in Linux, and your patch might
> well be the right thing.
> 
> I'm a *little* hesitant because all the code in mmconfig-shared.c that
> attempts to validate MCFG entries suggests that relying on them
> uncritically was a problem in some cases, so I want to try to convince
> myself that we really won't break something.
> 
> Bjorn

As I mentioned in commit message this type of check was first introduced 
in 7752d5cfe3d1.

$ git describe --contains 7752d5cfe3d1
v2.6.26-rc1~369^2~18

That's roughly ~2008.  This is a long time back; IIRC it's before MMIO 
over 4GB was really added to BIOS in many PC platforms.

How about we build an escape hatch for users to put on the kernel 
command line in case of problems to restore the behavior that enforces 
reservations?
Maybe "enforce_ecam_resv"?

We could keep that around for a a year or two and if nothing pops up 
tear it out later.
diff mbox series

Patch

diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 4b3efaa82ab7..e4594b181ebf 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -570,9 +570,13 @@  static void __init pci_mmcfg_reject_broken(int early)
 
 	list_for_each_entry(cfg, &pci_mmcfg_list, list) {
 		if (pci_mmcfg_check_reserved(NULL, cfg, early) == 0) {
-			pr_info(PREFIX "not using MMCONFIG\n");
-			free_all_mmcfg();
-			return;
+			if (dmi_get_bios_year() >= 2021) {
+				pr_info(PREFIX "MMCONFIG wasn't reserved by ACPI or EFI\n");
+			} else {
+				pr_info(PREFIX "not using MMCONFIG\n");
+				free_all_mmcfg();
+				return;
+			}
 		}
 	}
 }