diff mbox series

[v2] platform/x86: p2sb: Defer P2SB device scan when P2SB device has func 0

Message ID 20240302012813.2011111-1-shinichiro.kawasaki@wdc.com
State New
Headers show
Series [v2] platform/x86: p2sb: Defer P2SB device scan when P2SB device has func 0 | expand

Commit Message

Shinichiro Kawasaki March 2, 2024, 1:28 a.m. UTC
The commit 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls
during PCI device probe") triggered repeated ACPI errors on ASUS
VivoBook D540NV-GQ065T [1]. It was confirmed that the P2SB device scan
and remove at the fs_initcall stage triggered the errors.

To avoid the error, defer the P2SB device scan on the concerned device.
The error was observed on the system with Pentium N4200 in Goldmont micro-
architecture, and on which P2SB has function 0. Then refer to the P2SB
function to decide whether to defer or not.

When the device scan is deferred, do the scan later when p2sb_bar() is
called for the first time. If this first scan is triggered by sysfs
pci bus rescan, deadlock happens. In most cases, the scan happens during
system boot process, then there is no chance of deadlock.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218531 [1]
Fixes: 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
Changes from v1:
* Removed unnecessary p2sb_resource_cached()
* Reflected other review comments

 drivers/platform/x86/p2sb.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Comments

Shinichiro Kawasaki March 2, 2024, 7:28 a.m. UTC | #1
On Mar 02, 2024 / 10:28, Shin'ichiro Kawasaki wrote:
> The commit 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls
> during PCI device probe") triggered repeated ACPI errors on ASUS
> VivoBook D540NV-GQ065T [1]. It was confirmed that the P2SB device scan
> and remove at the fs_initcall stage triggered the errors.
> 
> To avoid the error, defer the P2SB device scan on the concerned device.
> The error was observed on the system with Pentium N4200 in Goldmont micro-
> architecture, and on which P2SB has function 0. Then refer to the P2SB
> function to decide whether to defer or not.
> 
> When the device scan is deferred, do the scan later when p2sb_bar() is
> called for the first time. If this first scan is triggered by sysfs
> pci bus rescan, deadlock happens. In most cases, the scan happens during
> system boot process, then there is no chance of deadlock.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218531 [1]
> Fixes: 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe")
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

Let me drop this patch. danilrybakov found that the ACPI errors are still
reported even with this patch. Will try another fix approach.
Hans de Goede March 2, 2024, 11:43 a.m. UTC | #2
Hi Shinichiro,

Thank you for your work on this.

On 3/2/24 08:28, Shinichiro Kawasaki wrote:
> On Mar 02, 2024 / 10:28, Shin'ichiro Kawasaki wrote:
>> The commit 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls
>> during PCI device probe") triggered repeated ACPI errors on ASUS
>> VivoBook D540NV-GQ065T [1]. It was confirmed that the P2SB device scan
>> and remove at the fs_initcall stage triggered the errors.
>>
>> To avoid the error, defer the P2SB device scan on the concerned device.
>> The error was observed on the system with Pentium N4200 in Goldmont micro-
>> architecture, and on which P2SB has function 0. Then refer to the P2SB
>> function to decide whether to defer or not.
>>
>> When the device scan is deferred, do the scan later when p2sb_bar() is
>> called for the first time. If this first scan is triggered by sysfs
>> pci bus rescan, deadlock happens. In most cases, the scan happens during
>> system boot process, then there is no chance of deadlock.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218531 [1]
>> Fixes: 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe")
>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> 
> Let me drop this patch. danilrybakov found that the ACPI errors are still
> reported even with this patch. Will try another fix approach.

Can we not simply just skip scanning function 0 all together when
on Goldmont? I don't think any drivers actually ask for the bar
of function 0 on Goldmont ?

This is likely also why we never had the issue with the old p2sb_bar()
code, because that never touched function 0.

I think this is actually what you did in one of your first test
patches in the bugzilla, right ?

So maybe audit all the callers of p2sb_bar() and see if any
caller asks for function 0 on goldmont ?

Let me know if you need help with this audit.

Regards,

Hans
Shinichiro Kawasaki March 2, 2024, 11:37 p.m. UTC | #3
On Mar 02, 2024 / 12:43, Hans de Goede wrote:
> Hi Shinichiro,
> 
> Thank you for your work on this.
> 
> On 3/2/24 08:28, Shinichiro Kawasaki wrote:
> > On Mar 02, 2024 / 10:28, Shin'ichiro Kawasaki wrote:
> >> The commit 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls
> >> during PCI device probe") triggered repeated ACPI errors on ASUS
> >> VivoBook D540NV-GQ065T [1]. It was confirmed that the P2SB device scan
> >> and remove at the fs_initcall stage triggered the errors.
> >>
> >> To avoid the error, defer the P2SB device scan on the concerned device.
> >> The error was observed on the system with Pentium N4200 in Goldmont micro-
> >> architecture, and on which P2SB has function 0. Then refer to the P2SB
> >> function to decide whether to defer or not.
> >>
> >> When the device scan is deferred, do the scan later when p2sb_bar() is
> >> called for the first time. If this first scan is triggered by sysfs
> >> pci bus rescan, deadlock happens. In most cases, the scan happens during
> >> system boot process, then there is no chance of deadlock.
> >>
> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218531 [1]
> >> Fixes: 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe")
> >> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > 
> > Let me drop this patch. danilrybakov found that the ACPI errors are still
> > reported even with this patch. Will try another fix approach.
> 
> Can we not simply just skip scanning function 0 all together when
> on Goldmont? I don't think any drivers actually ask for the bar
> of function 0 on Goldmont ?

Hi Hans, thank you for the idea. If we can take this appraoch, the fix patch
will be simpler.

> 
> This is likely also why we never had the issue with the old p2sb_bar()
> code, because that never touched function 0.
> 
> I think this is actually what you did in one of your first test
> patches in the bugzilla, right ?

To be precise, the first test patch did P2SB scan only for the function 2.
To make sure your idea works, it's the better to test to scan all the function
numbers except 0, from 1 to 7.

> 
> So maybe audit all the callers of p2sb_bar() and see if any
> caller asks for function 0 on goldmont ?
> 
> Let me know if you need help with this audit.

Help for the audit will be appreciated.

With the quick grep for p2sb_bar() [2], there are five p2sb_bar() callers:

 1) edac/pnd2_edac             devfn = 0
 2) i2c/busses/i2c-i801        devfn = 0
 3) mfd/lpc_ich for pinctrl    devfn = 0
 4) mfd/lpc_ich for spi        devfn = PCI_DEVFN(13, 2)
 5) watchdog/simatic-ipc-wdt   devfn = 0

Four out of the five callers specify devfn = 0, which means devfn is the P2SB
default PCI_DEVFN(13, 0) on Goldmont. So the question is if the four drivers are
used on Goldmont platform or not. And I have no clue...

[2]

$ git grep p2sb_bar
drivers/edac/pnd2_edac.c:                       ret = p2sb_bar(NULL, 0, &r);
drivers/i2c/busses/i2c-i801.c:  ret = p2sb_bar(pci_dev->bus, 0, res);
drivers/mfd/lpc_ich.c:  ret = p2sb_bar(dev->bus, 0, &base);
drivers/mfd/lpc_ich.c:          ret = p2sb_bar(dev->bus, PCI_DEVFN(13, 2), res);
drivers/platform/x86/p2sb.c: * p2sb_bar - Get Primary to Sideband (P2SB) bridge device BAR
drivers/platform/x86/p2sb.c:int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem)
drivers/platform/x86/p2sb.c:EXPORT_SYMBOL_GPL(p2sb_bar);
drivers/watchdog/simatic-ipc-wdt.c:             ret = p2sb_bar(NULL, 0, res);
include/linux/platform_data/x86/p2sb.h:int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem);
include/linux/platform_data/x86/p2sb.h:static inline int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem)
Hans de Goede March 3, 2024, 7:35 p.m. UTC | #4
Hi Shinichiro,

On 3/3/24 00:37, Shinichiro Kawasaki wrote:
> On Mar 02, 2024 / 12:43, Hans de Goede wrote:

<snip>

>> Can we not simply just skip scanning function 0 all together when
>> on Goldmont? I don't think any drivers actually ask for the bar
>> of function 0 on Goldmont ?
> 
> Hi Hans, thank you for the idea. If we can take this appraoch, the fix patch
> will be simpler.
> 
>>
>> This is likely also why we never had the issue with the old p2sb_bar()
>> code, because that never touched function 0.
>>
>> I think this is actually what you did in one of your first test
>> patches in the bugzilla, right ?
> 
> To be precise, the first test patch did P2SB scan only for the function 2.
> To make sure your idea works, it's the better to test to scan all the function
> numbers except 0, from 1 to 7.
> 
>>
>> So maybe audit all the callers of p2sb_bar() and see if any
>> caller asks for function 0 on goldmont ?
>>
>> Let me know if you need help with this audit.
> 
> Help for the audit will be appreciated.
> 
> With the quick grep for p2sb_bar() [2], there are five p2sb_bar() callers:

Ack, I have found the same 5 callers, let go over them one by one:

>  1) edac/pnd2_edac             devfn = 0

Hmm, ok so this one binds based on CPU-ids:

static const struct x86_cpu_id pnd2_cpuids[] = {
        X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT,       &apl_ops),
        X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D,     &dnv_ops),
        { }
};
MODULE_DEVICE_TABLE(x86cpu, pnd2_cpuids);

And the 0 passed here will get replaced by PCI_DEVFN(13, 0),
so there goes my theory of p2sb() never being called for
function 0.

So I have taken a quick look at your latest patch from:
https://bugzilla.kernel.org/show_bug.cgi?id=218531

I think that skipping the caching at fs_initcall() on goldmont
is a good idea.

But you still cache *all* the bars for goldmont on the first
p2sb_bar(bus, 0, &res) call .

If we delay caching the bars till there first use, why not
just do that for all the bars and also drop p2sb_scan_and_cache()
which for non goldmont is equivalent to p2sb_scan_and_cache_devfn()
but on goldmont caches all the functions.

Since you now delay caching (on goldmont) to the first p2sb_bar()
call I think that you can just drop p2sb_scan_and_cache()
altogether and just directly call p2sb_scan_and_cache_devfn()
in its place.

This means that on goldmont where both the p2sb devfn
PCI_DEVFN(13, 0) and the SPI controller PCI_DEVFN(13, 2)
are used we end up going through p2sb_cache_resources()
twice, assuming both are actually requested at least once.
But with your current patch this will also happen when
PCI_DEVFN(13, 2) gets requested first because then
p2sb_scan_and_cache() will enter the "not function 0"
path and only cache the one resource.

So I think that it would make things more KISS if
p2sb_bar() always only caches the requested devfn bar0
instead of treating function0 special as it does now.

Also talking about making things more KISS, how
about completely dropping the fs_initcall and
simply always delay the caching of a devfn until
the first call of p2sb_bar() for that devfn ?

That way fixing the issue will also actually reduce /
simplify the code :)

Regards,

Hans


p.s.

FWIW here is a quick analysis of the other callers:

>  2) i2c/busses/i2c-i801        devfn = 0

Goldmont platforms use PCI_DEVICE_ID_INTEL_BROXTON_SMBUS for the i801
i2c adapter and that does not have FEATURE_TCO_* set in its
feature flags so the p2sb_bar() call there is never made
on broxton.

>  3) mfd/lpc_ich for pinctrl    devfn = 0

This one is made on Apollo Lake devices, which use goldmont
CPU cores, so this is another case where function 0 is
actually queried through p2sb_bar()...

>  4) mfd/lpc_ich for spi        devfn = PCI_DEVFN(13, 2)

Not function 0.

>  5) watchdog/simatic-ipc-wdt   devfn = 0

This is actually a similar usage to the mfd/lpc_ich to get
to the GPIO controller when its not exported through ACPI,
but before the code switched to the p2sb_bar() helper it
was looking at PCI_DEVFN(31, 1), so this does not run
on goldmont.

Regards,

Hans




> 
> [2]
> 
> $ git grep p2sb_bar
> drivers/edac/pnd2_edac.c:                       ret = p2sb_bar(NULL, 0, &r);
> drivers/i2c/busses/i2c-i801.c:  ret = p2sb_bar(pci_dev->bus, 0, res);
> drivers/mfd/lpc_ich.c:  ret = p2sb_bar(dev->bus, 0, &base);
> drivers/mfd/lpc_ich.c:          ret = p2sb_bar(dev->bus, PCI_DEVFN(13, 2), res);
> drivers/platform/x86/p2sb.c: * p2sb_bar - Get Primary to Sideband (P2SB) bridge device BAR
> drivers/platform/x86/p2sb.c:int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem)
> drivers/platform/x86/p2sb.c:EXPORT_SYMBOL_GPL(p2sb_bar);
> drivers/watchdog/simatic-ipc-wdt.c:             ret = p2sb_bar(NULL, 0, res);
> include/linux/platform_data/x86/p2sb.h:int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem);
> include/linux/platform_data/x86/p2sb.h:static inline int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem)
>
Shinichiro Kawasaki March 4, 2024, 3:19 a.m. UTC | #5
On Mar 03, 2024 / 20:35, Hans de Goede wrote:
> Hi Shinichiro,
> 
> On 3/3/24 00:37, Shinichiro Kawasaki wrote:
> > On Mar 02, 2024 / 12:43, Hans de Goede wrote:
> 
> <snip>
> 
> >> Can we not simply just skip scanning function 0 all together when
> >> on Goldmont? I don't think any drivers actually ask for the bar
> >> of function 0 on Goldmont ?
> > 
> > Hi Hans, thank you for the idea. If we can take this appraoch, the fix patch
> > will be simpler.
> > 
> >>
> >> This is likely also why we never had the issue with the old p2sb_bar()
> >> code, because that never touched function 0.
> >>
> >> I think this is actually what you did in one of your first test
> >> patches in the bugzilla, right ?
> > 
> > To be precise, the first test patch did P2SB scan only for the function 2.
> > To make sure your idea works, it's the better to test to scan all the function
> > numbers except 0, from 1 to 7.
> > 
> >>
> >> So maybe audit all the callers of p2sb_bar() and see if any
> >> caller asks for function 0 on goldmont ?
> >>
> >> Let me know if you need help with this audit.
> > 
> > Help for the audit will be appreciated.
> > 
> > With the quick grep for p2sb_bar() [2], there are five p2sb_bar() callers:
> 
> Ack, I have found the same 5 callers, let go over them one by one:
> 
> >  1) edac/pnd2_edac             devfn = 0
> 
> Hmm, ok so this one binds based on CPU-ids:
> 
> static const struct x86_cpu_id pnd2_cpuids[] = {
>         X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT,       &apl_ops),
>         X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D,     &dnv_ops),
>         { }
> };
> MODULE_DEVICE_TABLE(x86cpu, pnd2_cpuids);
> 
> And the 0 passed here will get replaced by PCI_DEVFN(13, 0),
> so there goes my theory of p2sb() never being called for
> function 0.

Thank you very much fot the audit. So, it was clarified that the p2sb_bar()
can be called for function 0 on goldmont.

> 
> So I have taken a quick look at your latest patch from:
> https://bugzilla.kernel.org/show_bug.cgi?id=218531
> 
> I think that skipping the caching at fs_initcall() on goldmont
> is a good idea.
> 
> But you still cache *all* the bars for goldmont on the first
> p2sb_bar(bus, 0, &res) call .
> 
> If we delay caching the bars till there first use, why not
> just do that for all the bars and also drop p2sb_scan_and_cache()
> which for non goldmont is equivalent to p2sb_scan_and_cache_devfn()
> but on goldmont caches all the functions.
> 
> Since you now delay caching (on goldmont) to the first p2sb_bar()
> call I think that you can just drop p2sb_scan_and_cache()
> altogether and just directly call p2sb_scan_and_cache_devfn()
> in its place.
> 
> This means that on goldmont where both the p2sb devfn
> PCI_DEVFN(13, 0) and the SPI controller PCI_DEVFN(13, 2)
> are used we end up going through p2sb_cache_resources()
> twice, assuming both are actually requested at least once.
> But with your current patch this will also happen when
> PCI_DEVFN(13, 2) gets requested first because then
> p2sb_scan_and_cache() will enter the "not function 0"
> path and only cache the one resource.
> 
> So I think that it would make things more KISS if
> p2sb_bar() always only caches the requested devfn bar0
> instead of treating function0 special as it does now.

Thank you again for looking into the patch. I agree that the "function 0" path
in p2sb_scan_and_cache() is not meaningful any more. When I prepare v3 patch,
I will modify the patch to call p2sb_scan_and_cache_devfn() in place of
p2sb_scan_and_cache().

> 
> Also talking about making things more KISS, how
> about completely dropping the fs_initcall and
> simply always delay the caching of a devfn until
> the first call of p2sb_bar() for that devfn ?
> 
> That way fixing the issue will also actually reduce /
> simplify the code :)

This will simplify the code more, but it has two drawabacks:

1) It still leaves the rare deadlock scenario. If the drivers which call
   p2sb_bar() are not probed during boot, and if they are probed afterwards by
   sysfs pci bus rescan, pci_rescan_remove_lock causes the deadlock.

2) It triggers lockdep splat for pci_rescan_remove_lock at sysfs pci bus rescan,
   even for devices unrelated to p2sb (This is what I regularly observe during
   kernel tests for storage sub-system.)

I suggest to limit these drawbacks only on goldmont.

On the other hand, it means that the drawbacks will be still left on goldmont.
I would like to follow your call. If you think the "no fs_initcall() and always
delay the caching" is the best, I'm willing to prepare the patch for it.
Hans de Goede March 4, 2024, 9:55 a.m. UTC | #6
Hi,

On 3/4/24 4:19 AM, Shinichiro Kawasaki wrote:
> On Mar 03, 2024 / 20:35, Hans de Goede wrote:
>> Hi Shinichiro,
>>
>> On 3/3/24 00:37, Shinichiro Kawasaki wrote:

<snip>

>> So I have taken a quick look at your latest patch from:
>> https://bugzilla.kernel.org/show_bug.cgi?id=218531
>>
>> I think that skipping the caching at fs_initcall() on goldmont
>> is a good idea.
>>
>> But you still cache *all* the bars for goldmont on the first
>> p2sb_bar(bus, 0, &res) call .
>>
>> If we delay caching the bars till there first use, why not
>> just do that for all the bars and also drop p2sb_scan_and_cache()
>> which for non goldmont is equivalent to p2sb_scan_and_cache_devfn()
>> but on goldmont caches all the functions.
>>
>> Since you now delay caching (on goldmont) to the first p2sb_bar()
>> call I think that you can just drop p2sb_scan_and_cache()
>> altogether and just directly call p2sb_scan_and_cache_devfn()
>> in its place.
>>
>> This means that on goldmont where both the p2sb devfn
>> PCI_DEVFN(13, 0) and the SPI controller PCI_DEVFN(13, 2)
>> are used we end up going through p2sb_cache_resources()
>> twice, assuming both are actually requested at least once.
>> But with your current patch this will also happen when
>> PCI_DEVFN(13, 2) gets requested first because then
>> p2sb_scan_and_cache() will enter the "not function 0"
>> path and only cache the one resource.
>>
>> So I think that it would make things more KISS if
>> p2sb_bar() always only caches the requested devfn bar0
>> instead of treating function0 special as it does now.
> 
> Thank you again for looking into the patch. I agree that the "function 0" path
> in p2sb_scan_and_cache() is not meaningful any more. When I prepare v3 patch,
> I will modify the patch to call p2sb_scan_and_cache_devfn() in place of
> p2sb_scan_and_cache().

I've seen you've posted v3, this looks good, thanks.

>> Also talking about making things more KISS, how
>> about completely dropping the fs_initcall and
>> simply always delay the caching of a devfn until
>> the first call of p2sb_bar() for that devfn ?
>>
>> That way fixing the issue will also actually reduce /
>> simplify the code :)
> 
> This will simplify the code more, but it has two drawabacks:
> 
> 1) It still leaves the rare deadlock scenario. If the drivers which call
>    p2sb_bar() are not probed during boot, and if they are probed afterwards by
>    sysfs pci bus rescan, pci_rescan_remove_lock causes the deadlock.
> 
> 2) It triggers lockdep splat for pci_rescan_remove_lock at sysfs pci bus rescan,
>    even for devices unrelated to p2sb (This is what I regularly observe during
>    kernel tests for storage sub-system.)
> 
> I suggest to limit these drawbacks only on goldmont.

I was not aware of the lockdep triggering issue. I agree that should be avoided
when possible. I see you have kept the fs_initcall() for this in v3, good.

Regards,

Hans
Hans de Goede March 4, 2024, 11:04 a.m. UTC | #7
Hi Shinichiro,

On 3/4/24 10:55 AM, Hans de Goede wrote:
> Hi,
> 
> On 3/4/24 4:19 AM, Shinichiro Kawasaki wrote:
>> On Mar 03, 2024 / 20:35, Hans de Goede wrote:

<snip>

>>> Also talking about making things more KISS, how
>>> about completely dropping the fs_initcall and
>>> simply always delay the caching of a devfn until
>>> the first call of p2sb_bar() for that devfn ?
>>>
>>> That way fixing the issue will also actually reduce /
>>> simplify the code :)
>>
>> This will simplify the code more, but it has two drawabacks:
>>
>> 1) It still leaves the rare deadlock scenario. If the drivers which call
>>    p2sb_bar() are not probed during boot, and if they are probed afterwards by
>>    sysfs pci bus rescan, pci_rescan_remove_lock causes the deadlock.
>>
>> 2) It triggers lockdep splat for pci_rescan_remove_lock at sysfs pci bus rescan,
>>    even for devices unrelated to p2sb (This is what I regularly observe during
>>    kernel tests for storage sub-system.)
>>
>> I suggest to limit these drawbacks only on goldmont.
> 
> I was not aware of the lockdep triggering issue. I agree that should be avoided
> when possible. I see you have kept the fs_initcall() for this in v3, good.

I have just taken a second look at my analysis of when p2sb_bar(devfn=0) is called
on Goldmont platforms and my analysis for:

1) edac/pnd2_edac             devfn = 0

was wrong, p2sb_bar(devfn=0) is only used on Denverton (aka "Goldmont D")
which uses the default P2SB_DEVFN_DEFAULT = PCI_DEVFN(31, 1) devfn not
the special Goldmont devfn.

So the p2sb_bar(devfn=0) call actually only happens on Goldmont from:

3) mfd/lpc_ich for pinctrl    devfn = 0

and then only when the ACPI tables fail to properly describe the GPIO
controllers as ACPI devices, if the GPIO controllers are described
in ACPI, which they are on the ASUS VivoBook D540NV-GQ065T then that
call gets skipped.

So on the ASUS VivoBook D540NV-GQ065T p2sb_bar(devfn=0) gets never
called. Which explains why not caching it fixes things. I assume that this
laptop just seems not likes the P2SB is touched at all and by not caching
the BAR for the P2SB it ends up not getting touched at all.

This also means that likely the P2SB devfn itself generally speaking is
often not touched on Goldmont platforms. So we can avoid the lockdep
issue on PCI bus rescan there by caching the SPI controller
PCI_DEVFN(13, 2) devfn from fs_initcall(), since that will be the only
devfn for which p2sb_bar() will get called (except on hw with the
GPIO controller missing from the ACPI tables which should be rare).

I have prepared a follow up patch to your v3 to cache the
SPI controller devfn instead of the P2SB devfn at fs_initcall()
time. I'll post this shortly and I'll also ask the bug reporter
to test the combination of our 2 patches.

Regards,

Hans
Shinichiro Kawasaki March 4, 2024, 12:13 p.m. UTC | #8
On Mar 04, 2024 / 12:04, Hans de Goede wrote:
...
> So the p2sb_bar(devfn=0) call actually only happens on Goldmont from:
> 
> 3) mfd/lpc_ich for pinctrl    devfn = 0
> 
> and then only when the ACPI tables fail to properly describe the GPIO
> controllers as ACPI devices, if the GPIO controllers are described
> in ACPI, which they are on the ASUS VivoBook D540NV-GQ065T then that
> call gets skipped.
> 
> So on the ASUS VivoBook D540NV-GQ065T p2sb_bar(devfn=0) gets never
> called. Which explains why not caching it fixes things. I assume that this
> laptop just seems not likes the P2SB is touched at all and by not caching
> the BAR for the P2SB it ends up not getting touched at all.

Thanks for sharing the insights.

> 
> This also means that likely the P2SB devfn itself generally speaking is
> often not touched on Goldmont platforms. So we can avoid the lockdep
> issue on PCI bus rescan there by caching the SPI controller
> PCI_DEVFN(13, 2) devfn from fs_initcall(), since that will be the only
> devfn for which p2sb_bar() will get called (except on hw with the
> GPIO controller missing from the ACPI tables which should be rare).

Oh, this sounds a great idea.

> 
> I have prepared a follow up patch to your v3 to cache the
> SPI controller devfn instead of the P2SB devfn at fs_initcall()
> time. I'll post this shortly and I'll also ask the bug reporter
> to test the combination of our 2 patches.

Thanks a lot. Looking forward to the results.
Hans de Goede March 4, 2024, 1:59 p.m. UTC | #9
Hi,

On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
> On Mar 04, 2024 / 12:04, Hans de Goede wrote:
> ...
>> So the p2sb_bar(devfn=0) call actually only happens on Goldmont from:
>>
>> 3) mfd/lpc_ich for pinctrl    devfn = 0
>>
>> and then only when the ACPI tables fail to properly describe the GPIO
>> controllers as ACPI devices, if the GPIO controllers are described
>> in ACPI, which they are on the ASUS VivoBook D540NV-GQ065T then that
>> call gets skipped.
>>
>> So on the ASUS VivoBook D540NV-GQ065T p2sb_bar(devfn=0) gets never
>> called. Which explains why not caching it fixes things. I assume that this
>> laptop just seems not likes the P2SB is touched at all and by not caching
>> the BAR for the P2SB it ends up not getting touched at all.
> 
> Thanks for sharing the insights.

Looking closer at the actual unhiding I don't think accessing func 0
is the problem. The unhiding is always done on function 0 even when
retreiving the bar for function 2 (the SPI function).

So taking that into account, as mentioned in the bugzilla, I think
the problem is probing the other functions (1, 3-7) by calling
pci_scan_single_device() on them.

I have prepared an alternative fix based on this and asked
Danilrybakov to test that in the bugzilla.

(and also posted it as a RFC to the list)

>> This also means that likely the P2SB devfn itself generally speaking is
>> often not touched on Goldmont platforms. So we can avoid the lockdep
>> issue on PCI bus rescan there by caching the SPI controller
>> PCI_DEVFN(13, 2) devfn from fs_initcall(), since that will be the only
>> devfn for which p2sb_bar() will get called (except on hw with the
>> GPIO controller missing from the ACPI tables which should be rare).
> 
> Oh, this sounds a great idea.
> 
>>
>> I have prepared a follow up patch to your v3 to cache the
>> SPI controller devfn instead of the P2SB devfn at fs_initcall()
>> time. I'll post this shortly and I'll also ask the bug reporter
>> to test the combination of our 2 patches.
> 
> Thanks a lot. Looking forward to the results.

Since I already had the follow-up patch to your v3 implementing
this ready I've also send this to the list as RFC.

I hope the patch to only cache 13,0 + 13,2 at fs_initcall()
time works. Then we still avoid all possible deadlock /
lockdep scenarios which would be nice.

Regards,

Hans
Andy Shevchenko March 4, 2024, 2:09 p.m. UTC | #10
On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
> On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
> > On Mar 04, 2024 / 12:04, Hans de Goede wrote:

...

> > Thanks for sharing the insights.
> 
> Looking closer at the actual unhiding I don't think accessing func 0
> is the problem. The unhiding is always done on function 0 even when
> retreiving the bar for function 2 (the SPI function).
> 
> So taking that into account, as mentioned in the bugzilla, I think
> the problem is probing the other functions (1, 3-7) by calling
> pci_scan_single_device() on them.

So, why we can't simply call pci_dev_present() on the function in a loop?

Will be even simpler fix, no?
Hans de Goede March 4, 2024, 2:16 p.m. UTC | #11
Hi Andy,

On 3/4/24 3:09 PM, Andy Shevchenko wrote:
> On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
>> On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
>>> On Mar 04, 2024 / 12:04, Hans de Goede wrote:
> 
> ...
> 
>>> Thanks for sharing the insights.
>>
>> Looking closer at the actual unhiding I don't think accessing func 0
>> is the problem. The unhiding is always done on function 0 even when
>> retreiving the bar for function 2 (the SPI function).
>>
>> So taking that into account, as mentioned in the bugzilla, I think
>> the problem is probing the other functions (1, 3-7) by calling
>> pci_scan_single_device() on them.
> 
> So, why we can't simply call pci_dev_present() on the function in a loop?
> 
> Will be even simpler fix, no?

pci_dev_present takes a set of ids and then looks for those in already
detected devices. That will not work for devices which we have just unhidden.

Also it is unclear what exactly is tripping the hw up. We already have
a separate code-path for Goldmont, on other platforms only the P2SB
itself is scanned instead of all functions on the PCI slot.

This patch makes the Goldmont code closer to the other platforms by
only scanning the one extra function instead of scanning all functions.

As such this patch also mostly removes code :)

Regards,

Hans
Andy Shevchenko March 4, 2024, 2:17 p.m. UTC | #12
On Mon, Mar 04, 2024 at 04:09:49PM +0200, Andy Shevchenko wrote:
> On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
> > On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
> > > On Mar 04, 2024 / 12:04, Hans de Goede wrote:

...

> > > Thanks for sharing the insights.
> > 
> > Looking closer at the actual unhiding I don't think accessing func 0
> > is the problem. The unhiding is always done on function 0 even when
> > retreiving the bar for function 2 (the SPI function).
> > 
> > So taking that into account, as mentioned in the bugzilla, I think
> > the problem is probing the other functions (1, 3-7) by calling
> > pci_scan_single_device() on them.
> 
> So, why we can't simply call pci_dev_present() on the function in a loop?

pci_device_is_present()

> Will be even simpler fix, no?
Andy Shevchenko March 4, 2024, 2:20 p.m. UTC | #13
On Mon, Mar 04, 2024 at 04:17:04PM +0200, Andy Shevchenko wrote:
> On Mon, Mar 04, 2024 at 04:09:49PM +0200, Andy Shevchenko wrote:
> > On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
> > > On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
> > > > On Mar 04, 2024 / 12:04, Hans de Goede wrote:

...

> > > > Thanks for sharing the insights.
> > > 
> > > Looking closer at the actual unhiding I don't think accessing func 0
> > > is the problem. The unhiding is always done on function 0 even when
> > > retreiving the bar for function 2 (the SPI function).
> > > 
> > > So taking that into account, as mentioned in the bugzilla, I think
> > > the problem is probing the other functions (1, 3-7) by calling
> > > pci_scan_single_device() on them.
> > 
> > So, why we can't simply call pci_dev_present() on the function in a loop?
> 
> pci_device_is_present()
> 
> > Will be even simpler fix, no?

Oh, it requires a PCI device structure which is probably is not there until
we call the pci_scan_single_device()...
Hans de Goede March 4, 2024, 2:24 p.m. UTC | #14
Hi,

On 3/4/24 3:17 PM, Andy Shevchenko wrote:
> On Mon, Mar 04, 2024 at 04:09:49PM +0200, Andy Shevchenko wrote:
>> On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
>>> On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
>>>> On Mar 04, 2024 / 12:04, Hans de Goede wrote:
> 
> ...
> 
>>>> Thanks for sharing the insights.
>>>
>>> Looking closer at the actual unhiding I don't think accessing func 0
>>> is the problem. The unhiding is always done on function 0 even when
>>> retreiving the bar for function 2 (the SPI function).
>>>
>>> So taking that into account, as mentioned in the bugzilla, I think
>>> the problem is probing the other functions (1, 3-7) by calling
>>> pci_scan_single_device() on them.
>>
>> So, why we can't simply call pci_dev_present() on the function in a loop?
> 
> pci_device_is_present()

That takes a pci_dev as argument which we only have after calling
pci_scan_single_device() and the calling of pci_scan_single_device()
on some of the other functions is what is suspected of maybe causing
the issue.

Also it is likely that if pci_scan_single_device() is actually
a problem that then what is a problem is running it on an actual
present device. Devices which are not there also should not get confused
by trying to probe them ...

Regards,

Hans
Andy Shevchenko March 4, 2024, 2:35 p.m. UTC | #15
On Mon, Mar 04, 2024 at 03:24:58PM +0100, Hans de Goede wrote:
> On 3/4/24 3:17 PM, Andy Shevchenko wrote:
> > On Mon, Mar 04, 2024 at 04:09:49PM +0200, Andy Shevchenko wrote:
> >> On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
> >>> On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
> >>>> On Mar 04, 2024 / 12:04, Hans de Goede wrote:

...

> >>>> Thanks for sharing the insights.
> >>>
> >>> Looking closer at the actual unhiding I don't think accessing func 0
> >>> is the problem. The unhiding is always done on function 0 even when
> >>> retreiving the bar for function 2 (the SPI function).
> >>>
> >>> So taking that into account, as mentioned in the bugzilla, I think
> >>> the problem is probing the other functions (1, 3-7) by calling
> >>> pci_scan_single_device() on them.
> >>
> >> So, why we can't simply call pci_dev_present() on the function in a loop?
> > 
> > pci_device_is_present()
> 
> That takes a pci_dev as argument which we only have after calling
> pci_scan_single_device() and the calling of pci_scan_single_device()
> on some of the other functions is what is suspected of maybe causing
> the issue.
> 
> Also it is likely that if pci_scan_single_device() is actually
> a problem that then what is a problem is running it on an actual
> present device.

Which is weird. But okay, let's work around first the real issue, then
if I have time I will look into datasheet to see if there is anything
special about P2SB device(s).

> Devices which are not there also should not get confused
> by trying to probe them ...
Hans de Goede March 4, 2024, 2:39 p.m. UTC | #16
Hi,

On 3/4/24 3:35 PM, Andy Shevchenko wrote:
> On Mon, Mar 04, 2024 at 03:24:58PM +0100, Hans de Goede wrote:
>> On 3/4/24 3:17 PM, Andy Shevchenko wrote:
>>> On Mon, Mar 04, 2024 at 04:09:49PM +0200, Andy Shevchenko wrote:
>>>> On Mon, Mar 04, 2024 at 02:59:57PM +0100, Hans de Goede wrote:
>>>>> On 3/4/24 1:13 PM, Shinichiro Kawasaki wrote:
>>>>>> On Mar 04, 2024 / 12:04, Hans de Goede wrote:
> 
> ...
> 
>>>>>> Thanks for sharing the insights.
>>>>>
>>>>> Looking closer at the actual unhiding I don't think accessing func 0
>>>>> is the problem. The unhiding is always done on function 0 even when
>>>>> retreiving the bar for function 2 (the SPI function).
>>>>>
>>>>> So taking that into account, as mentioned in the bugzilla, I think
>>>>> the problem is probing the other functions (1, 3-7) by calling
>>>>> pci_scan_single_device() on them.
>>>>
>>>> So, why we can't simply call pci_dev_present() on the function in a loop?
>>>
>>> pci_device_is_present()
>>
>> That takes a pci_dev as argument which we only have after calling
>> pci_scan_single_device() and the calling of pci_scan_single_device()
>> on some of the other functions is what is suspected of maybe causing
>> the issue.
>>
>> Also it is likely that if pci_scan_single_device() is actually
>> a problem that then what is a problem is running it on an actual
>> present device.
> 
> Which is weird. But okay, let's work around first the real issue, then
> if I have time I will look into datasheet to see if there is anything
> special about P2SB device(s).

Note atm pci_scan_single_device() on one of the "other" being
the problem is still only a theory!

It is also possible that pci_scan_single_device() on the P2SB
itself is somehow an issue on this laptop, since on this laptop
p2sb_bar() is only called for the SPI controller, so before
the caching to avoid the deadlock issue pci_scan_single_device()
was not called on the P2SB devfn itself.

Regards,

Hans
diff mbox series

Patch

diff --git a/drivers/platform/x86/p2sb.c b/drivers/platform/x86/p2sb.c
index 6bd14d0132db..8d206238f63a 100644
--- a/drivers/platform/x86/p2sb.c
+++ b/drivers/platform/x86/p2sb.c
@@ -133,7 +133,7 @@  static struct pci_bus *p2sb_get_bus(struct pci_bus *bus)
 	return p2sb_bus;
 }
 
-static int p2sb_cache_resources(void)
+static int p2sb_cache_resources(bool from_fs_init)
 {
 	unsigned int devfn_p2sb;
 	u32 value = P2SBC_HIDE;
@@ -150,6 +150,15 @@  static int p2sb_cache_resources(void)
 	if (!bus)
 		return -ENODEV;
 
+	/*
+	 * On ASUS VivoBook D540NV-GQ065T which has Goldmont CPU family Pentium
+	 * N4200, P2SB device scan including function 0 at fs_initcall() step
+	 * causes ACPI errors. To avoid the errors, defer P2SB device scan and
+	 * cache when P2SB devices has function 0.
+	 */
+	if (PCI_FUNC(devfn_p2sb) == 0 && from_fs_init)
+		return -EBUSY;
+
 	/*
 	 * When a device with same devfn exists and its device class is not
 	 * PCI_CLASS_MEMORY_OTHER for P2SB, do not touch it.
@@ -214,6 +223,11 @@  int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem)
 	}
 
 	cache = &p2sb_resources[PCI_FUNC(devfn)];
+
+	/* Scan and cache P2SB device if it was deferred at fs_initcall() */
+	if (!p2sb_valid_resource(&cache->res))
+		p2sb_cache_resources(false);
+
 	if (cache->bus_dev_id != bus->dev.id)
 		return -ENODEV;
 
@@ -227,7 +241,7 @@  EXPORT_SYMBOL_GPL(p2sb_bar);
 
 static int __init p2sb_fs_init(void)
 {
-	p2sb_cache_resources();
+	p2sb_cache_resources(true);
 	return 0;
 }