diff mbox series

PCI/ASPM: Should not report ASPM support to BIOS if FADT indicates ASPM is unsupported

Message ID 20220713112612.6935-1-limanyi@uniontech.com
State New
Headers show
Series PCI/ASPM: Should not report ASPM support to BIOS if FADT indicates ASPM is unsupported | expand

Commit Message

Manyi Li July 13, 2022, 11:26 a.m. UTC
Startup log of ASUSTeK X456UJ Notebook show:
[    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
[   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
[   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
[   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
[   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5

Signed-off-by: Manyi Li <limanyi@uniontech.com>
---
 drivers/pci/pcie/aspm.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Bjorn Helgaas July 13, 2022, 6:28 p.m. UTC | #1
[+cc Kai-Heng, Vidya, who also have ASPM patches in flight]

On Wed, Jul 13, 2022 at 07:26:12PM +0800, Manyi Li wrote:
> Startup log of ASUSTeK X456UJ Notebook show:
> [    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> [   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
> [   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
> [   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> [   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
> [   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5

Can you elaborate on the connection between the FADT ASPM bit and the
AER logs above?

What problem are we solving here?  A single corrected error being
logged?  An infinite stream of errors?  A device that doesn't work at
all?

We don't need the dmesg timestamps unless they contribute to
understanding the problem.  I don't think they do in this case.

> Signed-off-by: Manyi Li <limanyi@uniontech.com>
> ---
>  drivers/pci/pcie/aspm.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index a96b7424c9bc..b173d3c75ae7 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -1359,6 +1359,7 @@ void pcie_no_aspm(void)
>  	if (!aspm_force) {
>  		aspm_policy = POLICY_DEFAULT;
>  		aspm_disabled = 1;
> +		aspm_support_enabled = false;

This makes pcie_no_aspm() work the same as booting with
"pcie_aspm=off".  That might be reasonable.

I do wonder why we need both "aspm_disabled" and
"aspm_support_enabled".  And I wonder why we need to set "aspm_policy"
when we're disabling ASPM.  But those aren't really connected to your
change here.

>  	}
>  }
>  
> -- 
> 2.20.1
> 
> 
>
Kai-Heng Feng July 14, 2022, 3:20 a.m. UTC | #2
[+Cc Matthew]

On Thu, Jul 14, 2022 at 2:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Kai-Heng, Vidya, who also have ASPM patches in flight]
>
> On Wed, Jul 13, 2022 at 07:26:12PM +0800, Manyi Li wrote:
> > Startup log of ASUSTeK X456UJ Notebook show:
> > [    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> > [   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> > [   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
> > [   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
> > [   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> > [   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
> > [   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
>
> Can you elaborate on the connection between the FADT ASPM bit and the
> AER logs above?
>
> What problem are we solving here?  A single corrected error being
> logged?  An infinite stream of errors?  A device that doesn't work at
> all?

Agree, what's the real symptom of the issue?

>
> We don't need the dmesg timestamps unless they contribute to
> understanding the problem.  I don't think they do in this case.

According to commit 387d37577fdd ("PCI: Don't clear ASPM bits when the
FADT declares it's unsupported"), the bit means "just use the ASPM
bits handed over by BIOS".

However, I do wonder why both drivers/pci/pci-acpi.c and
drivers/acpi/pci_root.c are doing the ACPI_FADT_NO_ASPM check, maybe
one of them should be removed?

>
> > Signed-off-by: Manyi Li <limanyi@uniontech.com>
> > ---
> >  drivers/pci/pcie/aspm.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index a96b7424c9bc..b173d3c75ae7 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -1359,6 +1359,7 @@ void pcie_no_aspm(void)
> >       if (!aspm_force) {
> >               aspm_policy = POLICY_DEFAULT;
> >               aspm_disabled = 1;
> > +             aspm_support_enabled = false;
>
> This makes pcie_no_aspm() work the same as booting with
> "pcie_aspm=off".  That might be reasonable.
>
> I do wonder why we need both "aspm_disabled" and
> "aspm_support_enabled".  And I wonder why we need to set "aspm_policy"
> when we're disabling ASPM.  But those aren't really connected to your
> change here.

From what I can understand "aspm_disabled" means "don't touch ASPM
left by BIOS", and "aspm_support_enabled" means "whether ASPM is
disabled via command line".
There seems to be some overlaps though.

Kai-Heng

>
> >       }
> >  }
> >
> > --
> > 2.20.1
> >
> >
> >
Matthew Garrett July 14, 2022, 4:56 a.m. UTC | #3
On Thu, Jul 14, 2022 at 11:20:26AM +0800, Kai-Heng Feng wrote:

> According to commit 387d37577fdd ("PCI: Don't clear ASPM bits when the
> FADT declares it's unsupported"), the bit means "just use the ASPM

Yes, the assumption is that if the BIOS set up ASPM but FADT indicates 
it's unsupported, just trust that the BIOS did the right thing and don't 
interfere. It's been a long time, but when we were clearing the ASPM 
bits in response to this FADT setting, a bunch of machines suddenly 
started consuming a lot more power than when running Windows.
Matthew Garrett July 15, 2022, 8:29 a.m. UTC | #4
On Fri, Jul 15, 2022 at 03:40:36PM +0800, Manyi Li wrote:

> Please see the details of this issus:
> https://bugzilla.kernel.org/show_bug.cgi?id=216245

Hmm. The only case where changing aspm_support_enabled to false should 
matter is in pcie_aspm_init_link_state(), where it looks like we'll 
potentially rewrite some registers even if aspm_disabled is true. I 
think in theory we shouldn't actually modify anything as a result, and 
the lspcis from the bug don't show any ASPM values having changed, but I 
don't trust Realtek hardware in the general case so maybe it gets upset 
here? If the proposed patch is to just set aspm_support_enabled to false 
when we see the FADT bit set then I think this is fine.
Matthew Garrett July 15, 2022, 9:32 a.m. UTC | #5
On Fri, Jul 15, 2022 at 05:19:25PM +0800, Manyi Li wrote:
> 
> 
> On 2022/7/15 16:29, Matthew Garrett wrote:
> > On Fri, Jul 15, 2022 at 03:40:36PM +0800, Manyi Li wrote:
> > 
> > > Please see the details of this issus:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=216245
> > 
> > Hmm. The only case where changing aspm_support_enabled to false should
> > matter is in pcie_aspm_init_link_state(), where it looks like we'll
> > potentially rewrite some registers even if aspm_disabled is true. I
> > think in theory we shouldn't actually modify anything as a result, and
> > the lspcis from the bug don't show any ASPM values having changed, but I
> > don't trust Realtek hardware in the general case so maybe it gets upset
> > here? If the proposed patch is to just set aspm_support_enabled to false
> > when we see the FADT bit set then I think this is fine.
> > 
> 
> "aspm_support_enabled" alse be used in calculate_support():
> if (pcie_aspm_support_enabled())
>     support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> When set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT, cause this AER
> issue. I want don't set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT when
> we see the FADT bit set.

Oh hm. Are you sure it's the OSC call that breaks it? I have some 
recollection that I verified the behaviour of Windows here, but it's 
been over 10 years since I touched this so I could well be wrong. I can 
try to set up a test env to verify the behaviour of Windows when it 
comes to _OSC if the FADT says ASPM is unsupported.
Rafael J. Wysocki July 15, 2022, 12:24 p.m. UTC | #6
On Fri, Jul 15, 2022 at 9:40 AM Manyi Li <limanyi@uniontech.com> wrote:
>
>
>
> On 2022/7/14 11:20, Kai-Heng Feng wrote:
> > [+Cc Matthew]
> >
> > On Thu, Jul 14, 2022 at 2:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>
> >> [+cc Kai-Heng, Vidya, who also have ASPM patches in flight]
> >>
> >> On Wed, Jul 13, 2022 at 07:26:12PM +0800, Manyi Li wrote:
> >>> Startup log of ASUSTeK X456UJ Notebook show:
> >>> [    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> >>> [   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> >>> [   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
> >>> [   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
> >>> [   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> >>> [   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
> >>> [   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> >>
> >> Can you elaborate on the connection between the FADT ASPM bit and the
> >> AER logs above?
>
> Sorry,I don't know about that.
>
> >>
> >> What problem are we solving here?  A single corrected error being
> >> logged?  An infinite stream of errors?  A device that doesn't work at
> >> all?
> >
> > Agree, what's the real symptom of the issue?
>
> Please see the details of this issus:
> https://bugzilla.kernel.org/show_bug.cgi?id=216245
>
> >
> >>
> >> We don't need the dmesg timestamps unless they contribute to
> >> understanding the problem.  I don't think they do in this case.
> >
> > According to commit 387d37577fdd ("PCI: Don't clear ASPM bits when the
> > FADT declares it's unsupported"), the bit means "just use the ASPM
> > bits handed over by BIOS".
> >
> > However, I do wonder why both drivers/pci/pci-acpi.c and
> > drivers/acpi/pci_root.c are doing the ACPI_FADT_NO_ASPM check,

Because pci_root.c doesn't read aspm_disabled.

> > maybe one of them should be removed?

Arguably, pci_root.c could look at aspm_disabled instead of looking at
the FADT flag directly.

> I think duplicate work has been done, but comment
> in drivers/acpi/pci_root.c is
> * We want to disable ASPM here, but aspm_disabled
> * needs to remain in its state from boot so that we
> * properly handle PCIe 1.1 devices.  So we set this
> * flag here, to defer the action until after the ACPI
> * root scan.
>
> I don't understand this logic.

This is about the case after failing acpi_pci_osc_control_set() and
generally we need to defer setting aspm_disabled because of
pcie_aspm_sanity_check().

> >
> >>
> >>> Signed-off-by: Manyi Li <limanyi@uniontech.com>
> >>> ---
> >>>   drivers/pci/pcie/aspm.c | 1 +
> >>>   1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> >>> index a96b7424c9bc..b173d3c75ae7 100644
> >>> --- a/drivers/pci/pcie/aspm.c
> >>> +++ b/drivers/pci/pcie/aspm.c
> >>> @@ -1359,6 +1359,7 @@ void pcie_no_aspm(void)
> >>>        if (!aspm_force) {
> >>>                aspm_policy = POLICY_DEFAULT;
> >>>                aspm_disabled = 1;
> >>> +             aspm_support_enabled = false;
> >>
> >> This makes pcie_no_aspm() work the same as booting with
> >> "pcie_aspm=off".  That might be reasonable.
> >>
> >> I do wonder why we need both "aspm_disabled" and
> >> "aspm_support_enabled".  And I wonder why we need to set "aspm_policy"
> >> when we're disabling ASPM.  But those aren't really connected to your
> >> change here.
> >
> >  From what I can understand "aspm_disabled" means "don't touch ASPM
> > left by BIOS", and "aspm_support_enabled" means "whether ASPM is
> > disabled via command line".
> > There seems to be some overlaps though.
>
> According to commit 8b8bae901ce23 ("PCI/ACPI: Report ASPM support to
> BIOS if not disabled from command line"), "aspm_support_enabled" means
> whether or not report ASPM support to the BIOS through _OSC.

Right.
Rafael J. Wysocki July 15, 2022, 12:32 p.m. UTC | #7
On Fri, Jul 15, 2022 at 12:23 PM Manyi Li <limanyi@uniontech.com> wrote:
>
>
>
> On 2022/7/15 17:32, Matthew Garrett wrote:
> > On Fri, Jul 15, 2022 at 05:19:25PM +0800, Manyi Li wrote:
> >>
> >>
> >> On 2022/7/15 16:29, Matthew Garrett wrote:
> >>> On Fri, Jul 15, 2022 at 03:40:36PM +0800, Manyi Li wrote:
> >>>
> >>>> Please see the details of this issus:
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=216245
> >>>
> >>> Hmm. The only case where changing aspm_support_enabled to false should
> >>> matter is in pcie_aspm_init_link_state(), where it looks like we'll
> >>> potentially rewrite some registers even if aspm_disabled is true. I
> >>> think in theory we shouldn't actually modify anything as a result, and
> >>> the lspcis from the bug don't show any ASPM values having changed, but I
> >>> don't trust Realtek hardware in the general case so maybe it gets upset
> >>> here? If the proposed patch is to just set aspm_support_enabled to false
> >>> when we see the FADT bit set then I think this is fine.
> >>>
> >>
> >> "aspm_support_enabled" alse be used in calculate_support():
> >> if (pcie_aspm_support_enabled())
> >>      support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> >> When set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT, cause this AER
> >> issue. I want don't set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT when
> >> we see the FADT bit set.
> >
> > Oh hm. Are you sure it's the OSC call that breaks it? I have some
>
> I don't sure.
>
> > recollection that I verified the behaviour of Windows here, but it's
> > been over 10 years since I touched this so I could well be wrong. I can
> > try to set up a test env to verify the behaviour of Windows when it
> > comes to _OSC if the FADT says ASPM is unsupported.
> >
> but, I did a test,this modification also solves the problem:
>
> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> index d57cf8454b93..b3ea8e886d7c 100644
> --- a/drivers/acpi/pci_root.c
> +++ b/drivers/acpi/pci_root.c
> @@ -494,8 +494,8 @@ static u32 calculate_support(void)
>          support |= OSC_PCI_HPX_TYPE_3_SUPPORT;
>       if (pci_ext_cfg_avail())
>               support |= OSC_PCI_EXT_CONFIG_SUPPORT;
> -    if (pcie_aspm_support_enabled())
> -            support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> +//  if (pcie_aspm_support_enabled())
> +//          support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
>       if (pci_msi_enabled())
>               support |= OSC_PCI_MSI_SUPPORT;
>       if (IS_ENABLED(CONFIG_PCIE_EDR))
>
> This issue occur in the Notebook: ASUSTeK COMPUTER INC. X456UJ
> (ASUS-NotebookSKU) Notebook
>
> log "AER: Corrected error received: 0000:00:1c.5" is in the device:
> 00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI
> Express Root Port #6 [8086:9d15] (rev f1)

So it looks like the BIOS sets ACPI_FADT_NO_ASPM and then happily
grants control of ASPM via _OSC.  That's somewhat contradictory.

I would rather look at adjusting pcie_aspm_sanity_check() to this case
instead of wholesale changing the way _OSC is handled at the host
bridge level.
Rafael J. Wysocki July 15, 2022, 12:56 p.m. UTC | #8
On Fri, Jul 15, 2022 at 2:32 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jul 15, 2022 at 12:23 PM Manyi Li <limanyi@uniontech.com> wrote:
> >
> >
> >
> > On 2022/7/15 17:32, Matthew Garrett wrote:
> > > On Fri, Jul 15, 2022 at 05:19:25PM +0800, Manyi Li wrote:
> > >>
> > >>
> > >> On 2022/7/15 16:29, Matthew Garrett wrote:
> > >>> On Fri, Jul 15, 2022 at 03:40:36PM +0800, Manyi Li wrote:
> > >>>
> > >>>> Please see the details of this issus:
> > >>>> https://bugzilla.kernel.org/show_bug.cgi?id=216245
> > >>>
> > >>> Hmm. The only case where changing aspm_support_enabled to false should
> > >>> matter is in pcie_aspm_init_link_state(), where it looks like we'll
> > >>> potentially rewrite some registers even if aspm_disabled is true. I
> > >>> think in theory we shouldn't actually modify anything as a result, and
> > >>> the lspcis from the bug don't show any ASPM values having changed, but I
> > >>> don't trust Realtek hardware in the general case so maybe it gets upset
> > >>> here? If the proposed patch is to just set aspm_support_enabled to false
> > >>> when we see the FADT bit set then I think this is fine.
> > >>>
> > >>
> > >> "aspm_support_enabled" alse be used in calculate_support():
> > >> if (pcie_aspm_support_enabled())
> > >>      support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> > >> When set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT, cause this AER
> > >> issue. I want don't set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT when
> > >> we see the FADT bit set.
> > >
> > > Oh hm. Are you sure it's the OSC call that breaks it? I have some
> >
> > I don't sure.
> >
> > > recollection that I verified the behaviour of Windows here, but it's
> > > been over 10 years since I touched this so I could well be wrong. I can
> > > try to set up a test env to verify the behaviour of Windows when it
> > > comes to _OSC if the FADT says ASPM is unsupported.
> > >
> > but, I did a test,this modification also solves the problem:
> >
> > diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> > index d57cf8454b93..b3ea8e886d7c 100644
> > --- a/drivers/acpi/pci_root.c
> > +++ b/drivers/acpi/pci_root.c
> > @@ -494,8 +494,8 @@ static u32 calculate_support(void)
> >          support |= OSC_PCI_HPX_TYPE_3_SUPPORT;
> >       if (pci_ext_cfg_avail())
> >               support |= OSC_PCI_EXT_CONFIG_SUPPORT;
> > -    if (pcie_aspm_support_enabled())
> > -            support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> > +//  if (pcie_aspm_support_enabled())
> > +//          support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> >       if (pci_msi_enabled())
> >               support |= OSC_PCI_MSI_SUPPORT;
> >       if (IS_ENABLED(CONFIG_PCIE_EDR))
> >
> > This issue occur in the Notebook: ASUSTeK COMPUTER INC. X456UJ
> > (ASUS-NotebookSKU) Notebook
> >
> > log "AER: Corrected error received: 0000:00:1c.5" is in the device:
> > 00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI
> > Express Root Port #6 [8086:9d15] (rev f1)
>
> So it looks like the BIOS sets ACPI_FADT_NO_ASPM and then happily
> grants control of ASPM via _OSC.  That's somewhat contradictory.
>
> I would rather look at adjusting pcie_aspm_sanity_check() to this case
> instead of wholesale changing the way _OSC is handled at the host
> bridge level.

Actually, if ACPI_FADT_NO_ASPM is set in the FADT, aspm_disabled
should be set already when negotiate_os_control() runs?

Can you please check if this is the case on the affected system?

In that case, negotiate_os_control() doesn't need to set *no_aspm,
because it only causes pcie_no_aspm() to be called, but it should have
been called already.
Rafael J. Wysocki July 15, 2022, 2:07 p.m. UTC | #9
On Fri, Jul 15, 2022 at 2:24 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jul 15, 2022 at 9:40 AM Manyi Li <limanyi@uniontech.com> wrote:
> >
> >
> >
> > On 2022/7/14 11:20, Kai-Heng Feng wrote:
> > > [+Cc Matthew]
> > >
> > > On Thu, Jul 14, 2022 at 2:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >>
> > >> [+cc Kai-Heng, Vidya, who also have ASPM patches in flight]
> > >>
> > >> On Wed, Jul 13, 2022 at 07:26:12PM +0800, Manyi Li wrote:
> > >>> Startup log of ASUSTeK X456UJ Notebook show:
> > >>> [    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> > >>> [   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> > >>> [   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
> > >>> [   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
> > >>> [   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> > >>> [   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
> > >>> [   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> > >>
> > >> Can you elaborate on the connection between the FADT ASPM bit and the
> > >> AER logs above?
> >
> > Sorry,I don't know about that.
> >
> > >>
> > >> What problem are we solving here?  A single corrected error being
> > >> logged?  An infinite stream of errors?  A device that doesn't work at
> > >> all?
> > >
> > > Agree, what's the real symptom of the issue?
> >
> > Please see the details of this issus:
> > https://bugzilla.kernel.org/show_bug.cgi?id=216245
> >
> > >
> > >>
> > >> We don't need the dmesg timestamps unless they contribute to
> > >> understanding the problem.  I don't think they do in this case.
> > >
> > > According to commit 387d37577fdd ("PCI: Don't clear ASPM bits when the
> > > FADT declares it's unsupported"), the bit means "just use the ASPM
> > > bits handed over by BIOS".
> > >
> > > However, I do wonder why both drivers/pci/pci-acpi.c and
> > > drivers/acpi/pci_root.c are doing the ACPI_FADT_NO_ASPM check,
>
> Because pci_root.c doesn't read aspm_disabled.

I've recalled a bit in the meantime.

First off, ACPI_FADT_NO_ASPM forbids the OS from enabling ASPM control
(quite literally).  It doesn't mean that the OS should not enumerate
ASPM and it doesn't mean that it should not report ASPM support to the
firmware via _OSC.  Moreover, there are (or at least there were)
systems where the firmware expected ASPM support to be reported via
_OSC anyway (see commit 8b8bae901ce2 PCI/ACPI: Report ASPM support to
BIOS if not disabled from command line).

Thus, if ASPM is not disabled from command line, it would be
consistent to carry out the _OSC negotiation as usual regardless of
ACPI_FADT_NO_ASPM and then handle the case in which it is set in the
same way as the case in which the firmware doesn't grant the kernel
control of some PCIe features.  Does this sound reasonable?

If it does, I think that ASPM should be enumerated regardless of
ACPI_FADT_NO_ASPM, but we need to ensure that its configuration is not
changed in any way if ACPI_FADT_NO_ASPM is set and I'm not sure if
that is the case now.

Of course, the same needs to happen when the kernel doesn't get full
control over PCIe features via _OSC, but AFAICS that case is handled
in the same way as the above already.

> > > maybe one of them should be removed?
>
> Arguably, pci_root.c could look at aspm_disabled instead of looking at
> the FADT flag directly.

Second, if the former does sound reasonable, I'd rather avoid setting
aspm_disabled from drivers/pci/pci-acpi.c upfront when
ACPI_FADT_NO_ASPM is set, because doing that is not consistent with
the above.

Now, there may be BIOSes that don't expect to be informed of the OS
support for ASPM via _OSC if ACPI_FADT_NO_ASPM is set, and the
question is what to do with them.  They clearly are at odds with the
BIOSes that do expect that to happen (mentioned above), so honestly
I'm not sure.

> > I think duplicate work has been done, but comment
> > in drivers/acpi/pci_root.c is
> > * We want to disable ASPM here, but aspm_disabled
> > * needs to remain in its state from boot so that we
> > * properly handle PCIe 1.1 devices.  So we set this
> > * flag here, to defer the action until after the ACPI
> > * root scan.
> >
> > I don't understand this logic.
>
> This is about the case after failing acpi_pci_osc_control_set() and
> generally we need to defer setting aspm_disabled because of
> pcie_aspm_sanity_check().
>
> > >
> > >>
> > >>> Signed-off-by: Manyi Li <limanyi@uniontech.com>
> > >>> ---
> > >>>   drivers/pci/pcie/aspm.c | 1 +
> > >>>   1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > >>> index a96b7424c9bc..b173d3c75ae7 100644
> > >>> --- a/drivers/pci/pcie/aspm.c
> > >>> +++ b/drivers/pci/pcie/aspm.c
> > >>> @@ -1359,6 +1359,7 @@ void pcie_no_aspm(void)
> > >>>        if (!aspm_force) {
> > >>>                aspm_policy = POLICY_DEFAULT;
> > >>>                aspm_disabled = 1;
> > >>> +             aspm_support_enabled = false;
> > >>
> > >> This makes pcie_no_aspm() work the same as booting with
> > >> "pcie_aspm=off".  That might be reasonable.
> > >>
> > >> I do wonder why we need both "aspm_disabled" and
> > >> "aspm_support_enabled".  And I wonder why we need to set "aspm_policy"
> > >> when we're disabling ASPM.  But those aren't really connected to your
> > >> change here.
> > >
> > >  From what I can understand "aspm_disabled" means "don't touch ASPM
> > > left by BIOS", and "aspm_support_enabled" means "whether ASPM is
> > > disabled via command line".
> > > There seems to be some overlaps though.
> >
> > According to commit 8b8bae901ce23 ("PCI/ACPI: Report ASPM support to
> > BIOS if not disabled from command line"), "aspm_support_enabled" means
> > whether or not report ASPM support to the BIOS through _OSC.
>
> Right.
Bjorn Helgaas July 15, 2022, 10:49 p.m. UTC | #10
Manyi, FYI, your emails aren't making it to the linux-pci list (or to
me), so I'm missing most of this conversation.

If you look at the lore archive:
  https://lore.kernel.org/all/20220713112612.6935-1-limanyi@uniontech.com/
you'll see all the message-ids that are not found.

Maybe you're sending HTML or something else vger doesn't like?
http://vger.kernel.org/majordomo-info.html#taboo

On Fri, Jul 15, 2022 at 10:32:36AM +0100, Matthew Garrett wrote:
> On Fri, Jul 15, 2022 at 05:19:25PM +0800, Manyi Li wrote:
> > On 2022/7/15 16:29, Matthew Garrett wrote:
> > > On Fri, Jul 15, 2022 at 03:40:36PM +0800, Manyi Li wrote:
> > > 
> > > > Please see the details of this issus:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=216245
> > > 
> > > Hmm. The only case where changing aspm_support_enabled to false should
> > > matter is in pcie_aspm_init_link_state(), where it looks like we'll
> > > potentially rewrite some registers even if aspm_disabled is true. I
> > > think in theory we shouldn't actually modify anything as a result, and
> > > the lspcis from the bug don't show any ASPM values having changed, but I
> > > don't trust Realtek hardware in the general case so maybe it gets upset
> > > here? If the proposed patch is to just set aspm_support_enabled to false
> > > when we see the FADT bit set then I think this is fine.
> > > 
> > 
> > "aspm_support_enabled" alse be used in calculate_support():
> > if (pcie_aspm_support_enabled())
> >     support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
> > When set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT, cause this AER
> > issue. I want don't set OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT when
> > we see the FADT bit set.
> 
> Oh hm. Are you sure it's the OSC call that breaks it? I have some 
> recollection that I verified the behaviour of Windows here, but it's 
> been over 10 years since I touched this so I could well be wrong. I can 
> try to set up a test env to verify the behaviour of Windows when it 
> comes to _OSC if the FADT says ASPM is unsupported.
Manyi Li Aug. 15, 2022, 9:02 a.m. UTC | #11
在 2022/7/15 22:07, Rafael J. Wysocki 写道:
> On Fri, Jul 15, 2022 at 2:24 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Fri, Jul 15, 2022 at 9:40 AM Manyi Li <limanyi@uniontech.com> wrote:
>>>
>>>
>>>
>>> On 2022/7/14 11:20, Kai-Heng Feng wrote:
>>>> [+Cc Matthew]
>>>>
>>>> On Thu, Jul 14, 2022 at 2:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>>
>>>>> [+cc Kai-Heng, Vidya, who also have ASPM patches in flight]
>>>>>
>>>>> On Wed, Jul 13, 2022 at 07:26:12PM +0800, Manyi Li wrote:
>>>>>> Startup log of ASUSTeK X456UJ Notebook show:
>>>>>> [    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
>>>>>> [   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
>>>>>> [   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
>>>>>> [   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
>>>>>> [   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
>>>>>> [   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
>>>>>> [   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
>>>>>
>>>>> Can you elaborate on the connection between the FADT ASPM bit and the
>>>>> AER logs above?
>>>
>>> Sorry,I don't know about that.
>>>
>>>>>
>>>>> What problem are we solving here?  A single corrected error being
>>>>> logged?  An infinite stream of errors?  A device that doesn't work at
>>>>> all?
>>>>
>>>> Agree, what's the real symptom of the issue?
>>>
>>> Please see the details of this issus:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=216245
>>>
>>>>
>>>>>
>>>>> We don't need the dmesg timestamps unless they contribute to
>>>>> understanding the problem.  I don't think they do in this case.
>>>>
>>>> According to commit 387d37577fdd ("PCI: Don't clear ASPM bits when the
>>>> FADT declares it's unsupported"), the bit means "just use the ASPM
>>>> bits handed over by BIOS".
>>>>
>>>> However, I do wonder why both drivers/pci/pci-acpi.c and
>>>> drivers/acpi/pci_root.c are doing the ACPI_FADT_NO_ASPM check,
>>
>> Because pci_root.c doesn't read aspm_disabled.
> 
> I've recalled a bit in the meantime.
> 
> First off, ACPI_FADT_NO_ASPM forbids the OS from enabling ASPM control
> (quite literally).  It doesn't mean that the OS should not enumerate
> ASPM and it doesn't mean that it should not report ASPM support to the
> firmware via _OSC.  Moreover, there are (or at least there were)
> systems where the firmware expected ASPM support to be reported via
> _OSC anyway (see commit 8b8bae901ce2 PCI/ACPI: Report ASPM support to
> BIOS if not disabled from command line).
> 
> Thus, if ASPM is not disabled from command line, it would be
> consistent to carry out the _OSC negotiation as usual regardless of
> ACPI_FADT_NO_ASPM and then handle the case in which it is set in the
> same way as the case in which the firmware doesn't grant the kernel
> control of some PCIe features.  Does this sound reasonable

This sound reasonable.

> 
> If it does, I think that ASPM should be enumerated regardless of
> ACPI_FADT_NO_ASPM, but we need to ensure that its configuration is not
> changed in any way if ACPI_FADT_NO_ASPM is set and I'm not sure if
> that is the case now.
> 
> Of course, the same needs to happen when the kernel doesn't get full
> control over PCIe features via _OSC, but AFAICS that case is handled
> in the same way as the above already.
> 
>>>> maybe one of them should be removed?
>>
>> Arguably, pci_root.c could look at aspm_disabled instead of looking at
>> the FADT flag directly.
> 
> Second, if the former does sound reasonable, I'd rather avoid setting
> aspm_disabled from drivers/pci/pci-acpi.c upfront when
> ACPI_FADT_NO_ASPM is set, because doing that is not consistent with
> the above.
> 
> Now, there may be BIOSes that don't expect to be informed of the OS
> support for ASPM via _OSC if ACPI_FADT_NO_ASPM is set, and the
> question is what to do with them.  They clearly are at odds with the
> BIOSes that do expect that to happen (mentioned above), so honestly
> I'm not sure.

I'm not sure my issues is caused by report ASPM support to the firmware 
via _OSC.My issues is the same as this link:
https://groups.google.com/g/fa.linux.kernel/c/0uz8Nr_NVOI

Links to other discussions on this issue:
https://lore.kernel.org/all/20151229155822.GA17321@localhost/T/#u
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

> 
>>> I think duplicate work has been done, but comment
>>> in drivers/acpi/pci_root.c is
>>> * We want to disable ASPM here, but aspm_disabled
>>> * needs to remain in its state from boot so that we
>>> * properly handle PCIe 1.1 devices.  So we set this
>>> * flag here, to defer the action until after the ACPI
>>> * root scan.
>>>
>>> I don't understand this logic.
>>
>> This is about the case after failing acpi_pci_osc_control_set() and
>> generally we need to defer setting aspm_disabled because of
>> pcie_aspm_sanity_check().
>>
>>>>
>>>>>
>>>>>> Signed-off-by: Manyi Li <limanyi@uniontech.com>
>>>>>> ---
>>>>>>    drivers/pci/pcie/aspm.c | 1 +
>>>>>>    1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
>>>>>> index a96b7424c9bc..b173d3c75ae7 100644
>>>>>> --- a/drivers/pci/pcie/aspm.c
>>>>>> +++ b/drivers/pci/pcie/aspm.c
>>>>>> @@ -1359,6 +1359,7 @@ void pcie_no_aspm(void)
>>>>>>         if (!aspm_force) {
>>>>>>                 aspm_policy = POLICY_DEFAULT;
>>>>>>                 aspm_disabled = 1;
>>>>>> +             aspm_support_enabled = false;
>>>>>
>>>>> This makes pcie_no_aspm() work the same as booting with
>>>>> "pcie_aspm=off".  That might be reasonable.
>>>>>
>>>>> I do wonder why we need both "aspm_disabled" and
>>>>> "aspm_support_enabled".  And I wonder why we need to set "aspm_policy"
>>>>> when we're disabling ASPM.  But those aren't really connected to your
>>>>> change here.
>>>>
>>>>   From what I can understand "aspm_disabled" means "don't touch ASPM
>>>> left by BIOS", and "aspm_support_enabled" means "whether ASPM is
>>>> disabled via command line".
>>>> There seems to be some overlaps though.
>>>
>>> According to commit 8b8bae901ce23 ("PCI/ACPI: Report ASPM support to
>>> BIOS if not disabled from command line"), "aspm_support_enabled" means
>>> whether or not report ASPM support to the BIOS through _OSC.
>>
>> Right.
>
Bjorn Helgaas Aug. 15, 2022, 6:56 p.m. UTC | #12
Subject line should say what the patch *does*, not what *should*
happen.

On Wed, Jul 13, 2022 at 07:26:12PM +0800, Manyi Li wrote:
> Startup log of ASUSTeK X456UJ Notebook show:
> [    0.130563] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> [   48.092472] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [   48.092479] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
> [   48.092481] pcieport 0000:00:1c.5:    [ 0] RxErr
> [   48.092490] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
> [   48.092504] pcieport 0000:00:1c.5: AER: can't find device of ID00e5
> [   48.092506] pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5

No need for timestamps in commit log because they're not relevant to
this issue.  Please use quote style, i.e., blank line before and after
the quote and indent the quote two spaces:

  Startup log of ... shows:

    ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
    ...
    pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)

Can you please open a report at https://bugzilla.kernel.org and attach
the complete dmesg log and the complete "sudo lspci -vv" output before
and after this patch?  Then include the bugzilla URL in the commit
log.

I assume the patch prevents the RxErr from being reported, but I'd
like to know why, or at least have a plausible explanation of how
enabling ASPM might lead to a Receiver Error.

I'd also like to know why the "can't find device of ID00e5" happens.

Since it's not obvious from the patch, mention that "reporting ASPM
support to BIOS" happens via the _OSC Support field.

> Signed-off-by: Manyi Li <limanyi@uniontech.com>
> ---
>  drivers/pci/pcie/aspm.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index a96b7424c9bc..b173d3c75ae7 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -1359,6 +1359,7 @@ void pcie_no_aspm(void)
>  	if (!aspm_force) {
>  		aspm_policy = POLICY_DEFAULT;
>  		aspm_disabled = 1;
> +		aspm_support_enabled = false;
>  	}
>  }
>  
> -- 
> 2.20.1
> 
> 
>
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index a96b7424c9bc..b173d3c75ae7 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -1359,6 +1359,7 @@  void pcie_no_aspm(void)
 	if (!aspm_force) {
 		aspm_policy = POLICY_DEFAULT;
 		aspm_disabled = 1;
+		aspm_support_enabled = false;
 	}
 }