mbox series

[PATCHv3,0/4] drivers/base: bugfix for supplier<-consumer ordering in device_kset

Message ID 1530600642-25090-1-git-send-email-kernelfans@gmail.com (mailing list archive)
Headers show
Series drivers/base: bugfix for supplier<-consumer ordering in device_kset | expand

Message

Pingfan Liu July 3, 2018, 6:50 a.m. UTC
commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge.
This will break the parent<-children order and cause failure when
"kexec -e" in some scenario.

v2 -> v3:
It is a little hard to impose both "parent<-child" and "supplier<-consumer"
on devices_kset. Hence v3 drops this method, postpones the issue to shutdown
time instead of probing, and utilizes device-tree info during shutdown
instead of the item's seq inside devices_kset.

Pingfan Liu (4):
  drivers/base: fold the routine of device's shutdown into a func
  drivers/base: utilize device tree info to shutdown devices
  drivers/base: clean up the usage of devices_kset_move_last()
  Revert "driver core: correct device's shutdown order"

 drivers/base/base.h    |   1 -
 drivers/base/core.c    | 196 +++++++++++++++++++++++--------------------------
 drivers/base/dd.c      |   8 --
 include/linux/device.h |   1 +
 4 files changed, 92 insertions(+), 114 deletions(-)

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org

Comments

Rafael J. Wysocki July 3, 2018, 2:35 p.m. UTC | #1
On Tuesday, July 3, 2018 8:50:38 AM CEST Pingfan Liu wrote:
> commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> places an assumption of supplier<-consumer order on the process of probe.
> But it turns out to break down the parent <- child order in some scene.
> E.g in pci, a bridge is enabled by pci core, and behind it, the devices
> have been probed. Then comes the bridge's module, which enables extra
> feature(such as hotplug) on this bridge.

So what *exactly* does happen in that case?
Pingfan Liu July 4, 2018, 2:47 a.m. UTC | #2
On Tue, Jul 3, 2018 at 10:36 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Tuesday, July 3, 2018 8:50:38 AM CEST Pingfan Liu wrote:
> > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> > places an assumption of supplier<-consumer order on the process of probe.
> > But it turns out to break down the parent <- child order in some scene.
> > E.g in pci, a bridge is enabled by pci core, and behind it, the devices
> > have been probed. Then comes the bridge's module, which enables extra
> > feature(such as hotplug) on this bridge.
>
> So what *exactly* does happen in that case?
>
I saw the  shpc_probe() is called on the bridge, although the probing
failed on that bare-metal. But if it success, then it will enable the
hotplug feature on the bridge.

Thanks,
Pingfan
Rafael J. Wysocki July 4, 2018, 10:21 a.m. UTC | #3
On Wednesday, July 4, 2018 4:47:07 AM CEST Pingfan Liu wrote:
> On Tue, Jul 3, 2018 at 10:36 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > On Tuesday, July 3, 2018 8:50:38 AM CEST Pingfan Liu wrote:
> > > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> > > places an assumption of supplier<-consumer order on the process of probe.
> > > But it turns out to break down the parent <- child order in some scene.
> > > E.g in pci, a bridge is enabled by pci core, and behind it, the devices
> > > have been probed. Then comes the bridge's module, which enables extra
> > > feature(such as hotplug) on this bridge.
> >
> > So what *exactly* does happen in that case?
> >
> I saw the  shpc_probe() is called on the bridge, although the probing
> failed on that bare-metal. But if it success, then it will enable the
> hotplug feature on the bridge.

I don't understand what you are saying here, sorry.

device_reorder_to_tail() walks the entire device hierarchy below the target
and moves all of the children in there *after* their parents.

How can it break "the parent <- child order" then?

Thanks,
Rafael
Pingfan Liu July 5, 2018, 2:44 a.m. UTC | #4
On Wed, Jul 4, 2018 at 6:23 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Wednesday, July 4, 2018 4:47:07 AM CEST Pingfan Liu wrote:
> > On Tue, Jul 3, 2018 at 10:36 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > On Tuesday, July 3, 2018 8:50:38 AM CEST Pingfan Liu wrote:
> > > > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> > > > places an assumption of supplier<-consumer order on the process of probe.
> > > > But it turns out to break down the parent <- child order in some scene.
> > > > E.g in pci, a bridge is enabled by pci core, and behind it, the devices
> > > > have been probed. Then comes the bridge's module, which enables extra
> > > > feature(such as hotplug) on this bridge.
> > >
> > > So what *exactly* does happen in that case?
> > >
> > I saw the  shpc_probe() is called on the bridge, although the probing
> > failed on that bare-metal. But if it success, then it will enable the
> > hotplug feature on the bridge.
>
> I don't understand what you are saying here, sorry.
>
On the system, I observe the following:
[    2.114986] devices_kset: Moving 0004:00:00.0 to end of list
<---pcie port drive's probe, but it failed
[    2.115192] devices_kset: Moving 0004:01:00.0 to end of list
[    2.115591] devices_kset: Moving 0004:02:02.0 to end of list
[    2.115923] devices_kset: Moving 0004:02:0a.0 to end of list
[    2.116141] devices_kset: Moving 0004:02:0b.0 to end of list
[    2.116358] devices_kset: Moving 0004:02:0c.0 to end of list
[    3.181860] devices_kset: Moving 0004:03:00.0 to end of list
<---the ata disk controller which sits behind the bridge
[   10.267081] devices_kset: Moving 0004:00:00.0 to end of list
 <---shpc_probe() on this bridge, failed too.

As you can the the parent device "0004:00:00.0" is moved twice, and
finally, it is after the "0004:03:00.0", this will break the
"parent<-child" order in devices_kset. This is caused by the code
really_probe()->devices_kset_move_last(). Apparently, it makes
assumption that child device's probing comes after its parent's. But
it does not stand up in the case.

> device_reorder_to_tail() walks the entire device hierarchy below the target
> and moves all of the children in there *after* their parents.
>
As described, the bug is not related with device_reorder_to_tail(), it
is related with really_probe()->devices_kset_move_last(). So [2/4]
uses different method to achieve the "parent<-child" and
"supplier<-consumer" order. The [3/4] clean up some code in
device_reorder_to_tail(), since I need to revert the commit.

> How can it break "the parent <- child order" then?
>
As described, it does not, just not be in use any longer.

Thanks and regards,
Pingfan
Rafael J. Wysocki July 5, 2018, 9:18 a.m. UTC | #5
On Thu, Jul 5, 2018 at 4:44 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> On Wed, Jul 4, 2018 at 6:23 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>
>> On Wednesday, July 4, 2018 4:47:07 AM CEST Pingfan Liu wrote:
>> > On Tue, Jul 3, 2018 at 10:36 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > >
>> > > On Tuesday, July 3, 2018 8:50:38 AM CEST Pingfan Liu wrote:
>> > > > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
>> > > > places an assumption of supplier<-consumer order on the process of probe.
>> > > > But it turns out to break down the parent <- child order in some scene.
>> > > > E.g in pci, a bridge is enabled by pci core, and behind it, the devices
>> > > > have been probed. Then comes the bridge's module, which enables extra
>> > > > feature(such as hotplug) on this bridge.
>> > >
>> > > So what *exactly* does happen in that case?
>> > >
>> > I saw the  shpc_probe() is called on the bridge, although the probing
>> > failed on that bare-metal. But if it success, then it will enable the
>> > hotplug feature on the bridge.
>>
>> I don't understand what you are saying here, sorry.
>>
> On the system, I observe the following:
> [    2.114986] devices_kset: Moving 0004:00:00.0 to end of list
> <---pcie port drive's probe, but it failed
> [    2.115192] devices_kset: Moving 0004:01:00.0 to end of list
> [    2.115591] devices_kset: Moving 0004:02:02.0 to end of list
> [    2.115923] devices_kset: Moving 0004:02:0a.0 to end of list
> [    2.116141] devices_kset: Moving 0004:02:0b.0 to end of list
> [    2.116358] devices_kset: Moving 0004:02:0c.0 to end of list
> [    3.181860] devices_kset: Moving 0004:03:00.0 to end of list
> <---the ata disk controller which sits behind the bridge
> [   10.267081] devices_kset: Moving 0004:00:00.0 to end of list
>  <---shpc_probe() on this bridge, failed too.
>
> As you can the the parent device "0004:00:00.0" is moved twice, and
> finally, it is after the "0004:03:00.0", this will break the
> "parent<-child" order in devices_kset. This is caused by the code
> really_probe()->devices_kset_move_last(). Apparently, it makes
> assumption that child device's probing comes after its parent's. But
> it does not stand up in the case.
>
>> device_reorder_to_tail() walks the entire device hierarchy below the target
>> and moves all of the children in there *after* their parents.
>>
> As described, the bug is not related with device_reorder_to_tail(), it
> is related with really_probe()->devices_kset_move_last().

OK, so calling devices_kset_move_last() from really_probe() clearly is
a mistake.

I'm not really sure what the intention of it was as the changelog of
commit 52cdbdd49853d doesn't really explain that (why would it be
insufficient without that change?) and I'm quite sure that in the
majority of cases it is unnecessary.

I *think* that it attempted to cover a use case similar to the device
links one, but it should have moved children along with the parent
every time like device_link_add() does.

> So [2/4] uses different method to achieve the "parent<-child" and
> "supplier<-consumer" order. The [3/4] clean up some code in
> device_reorder_to_tail(), since I need to revert the commit.

OK, let me look at that one.

Thanks,
Rafael
Rafael J. Wysocki July 6, 2018, 8:47 a.m. UTC | #6
On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
> [cc += Kishon Vijay Abraham]
>
> On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
>> OK, so calling devices_kset_move_last() from really_probe() clearly is
>> a mistake.
>>
>> I'm not really sure what the intention of it was as the changelog of
>> commit 52cdbdd49853d doesn't really explain that (why would it be
>> insufficient without that change?)
>
> It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> whose reset pin needs to be driven high on shutdown, lest the MMC
> won't be found on the next boot.
>
> The boards' devicetrees use a kludge wherein the reset pin is modelled
> as a regulator.  The regulator is enabled when the MMC probes and
> disabled on driver unbind and shutdown.  As a result, the pin is driven
> low on shutdown and the MMC is not found on the next boot.
>
> To fix this, another kludge was invented wherein the GPIO expander
> driving the reset pin unconditionally drives all its pins high on
> shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> (commit adc284755055, "gpio: pcf857x: restore the initial line state
> of all pcf lines").
>
> For this kludge to work, the GPIO expander's ->shutdown hook needs to
> be executed after the MMC expander's ->shutdown hook.
>
> Commit 52cdbdd49853d achieved that by reordering devices_kset according
> to the probe order.  Apparently the MMC probes after the GPIO expander,
> possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> available yet, see mmc_regulator_get_supply().
>
> Note, I'm just piecing the information together from git history,
> I'm not responsible for these kludges.  (I'm innocent!)

Sure enough. :-)

In any case, calling devices_kset_move_last() in really_probe() is
plain broken and if its only purpose was to address a single, arguably
kludgy, use case, let's just get rid of it in the first place IMO.

> @Pingfan Liu, if you just remove the call to devices_kset_move_last()
> from really_probe(), does the issue go away?

I would think so from the description of the problem (elsewhere in this thread).

Thanks,
Rafael
Kishon Vijay Abraham I July 6, 2018, 10:02 a.m. UTC | #7
+Grygorii, linux-omap

On Friday 06 July 2018 02:06 PM, Lukas Wunner wrote:
> [cc += Kishon Vijay Abraham]
> 
> On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
>> OK, so calling devices_kset_move_last() from really_probe() clearly is
>> a mistake.
>>
>> I'm not really sure what the intention of it was as the changelog of
>> commit 52cdbdd49853d doesn't really explain that (why would it be
>> insufficient without that change?)
> 
> It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> whose reset pin needs to be driven high on shutdown, lest the MMC
> won't be found on the next boot.
> 
> The boards' devicetrees use a kludge wherein the reset pin is modelled
> as a regulator.  The regulator is enabled when the MMC probes and
> disabled on driver unbind and shutdown.  As a result, the pin is driven
> low on shutdown and the MMC is not found on the next boot.
> 
> To fix this, another kludge was invented wherein the GPIO expander
> driving the reset pin unconditionally drives all its pins high on
> shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> (commit adc284755055, "gpio: pcf857x: restore the initial line state
> of all pcf lines").
> 
> For this kludge to work, the GPIO expander's ->shutdown hook needs to
> be executed after the MMC expander's ->shutdown hook.
> 
> Commit 52cdbdd49853d achieved that by reordering devices_kset according
> to the probe order.  Apparently the MMC probes after the GPIO expander,
> possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> available yet, see mmc_regulator_get_supply().
> 
> Note, I'm just piecing the information together from git history,
> I'm not responsible for these kludges.  (I'm innocent!)
> 
> @Pingfan Liu, if you just remove the call to devices_kset_move_last()
> from really_probe(), does the issue go away?
> 
> If so, it might be best to remove that call and model the dependency
> with a call to device_link_add() in mmc_regulator_get_supply().

hmm.. had a quick look on this and looks like struct regulator is a regulator
frameworks internal data structure, so its members are not accessible outside.
struct regulator's device pointer is required for device_link_add().

Thanks
Kishon
Pingfan Liu July 6, 2018, 1:52 p.m. UTC | #8
On Fri, Jul 6, 2018 at 4:36 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> [cc += Kishon Vijay Abraham]
>
> On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
> > OK, so calling devices_kset_move_last() from really_probe() clearly is
> > a mistake.
> >
> > I'm not really sure what the intention of it was as the changelog of
> > commit 52cdbdd49853d doesn't really explain that (why would it be
> > insufficient without that change?)
>
> It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> whose reset pin needs to be driven high on shutdown, lest the MMC
> won't be found on the next boot.
>
> The boards' devicetrees use a kludge wherein the reset pin is modelled
> as a regulator.  The regulator is enabled when the MMC probes and
> disabled on driver unbind and shutdown.  As a result, the pin is driven
> low on shutdown and the MMC is not found on the next boot.
>
> To fix this, another kludge was invented wherein the GPIO expander
> driving the reset pin unconditionally drives all its pins high on
> shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> (commit adc284755055, "gpio: pcf857x: restore the initial line state
> of all pcf lines").
>
> For this kludge to work, the GPIO expander's ->shutdown hook needs to
> be executed after the MMC expander's ->shutdown hook.
>
> Commit 52cdbdd49853d achieved that by reordering devices_kset according
> to the probe order.  Apparently the MMC probes after the GPIO expander,
> possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> available yet, see mmc_regulator_get_supply().
>
> Note, I'm just piecing the information together from git history,
> I'm not responsible for these kludges.  (I'm innocent!)
>
Thanks for your exploration, very clearly. I had tried, but failed
since these commits are contributed with different authors. I am not
familiar with ARM and dts, so had thought
really_probe()->devices_kset_move_last() is used to address a very
popular "supplier<-consumer" order issue in smart phone, based on the
configuration hard-coded in "bios(or counterpart in ARM).

> @Pingfan Liu, if you just remove the call to devices_kset_move_last()
> from really_probe(), does the issue go away?
>
Yes, it goes away.

> If so, it might be best to remove that call and model the dependency
> with a call to device_link_add() in mmc_regulator_get_supply().
> Another idea would be to automatically add a device_link in the
> driver core if -EPROBE_DEFER is returned.
>
Maybe the first one is better, as it is already used by other drivers.

Thanks,
Pingfan
Pingfan Liu July 6, 2018, 1:55 p.m. UTC | #9
On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > [cc += Kishon Vijay Abraham]
> >
> > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
> >> OK, so calling devices_kset_move_last() from really_probe() clearly is
> >> a mistake.
> >>
> >> I'm not really sure what the intention of it was as the changelog of
> >> commit 52cdbdd49853d doesn't really explain that (why would it be
> >> insufficient without that change?)
> >
> > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> > whose reset pin needs to be driven high on shutdown, lest the MMC
> > won't be found on the next boot.
> >
> > The boards' devicetrees use a kludge wherein the reset pin is modelled
> > as a regulator.  The regulator is enabled when the MMC probes and
> > disabled on driver unbind and shutdown.  As a result, the pin is driven
> > low on shutdown and the MMC is not found on the next boot.
> >
> > To fix this, another kludge was invented wherein the GPIO expander
> > driving the reset pin unconditionally drives all its pins high on
> > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> > (commit adc284755055, "gpio: pcf857x: restore the initial line state
> > of all pcf lines").
> >
> > For this kludge to work, the GPIO expander's ->shutdown hook needs to
> > be executed after the MMC expander's ->shutdown hook.
> >
> > Commit 52cdbdd49853d achieved that by reordering devices_kset according
> > to the probe order.  Apparently the MMC probes after the GPIO expander,
> > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> > available yet, see mmc_regulator_get_supply().
> >
> > Note, I'm just piecing the information together from git history,
> > I'm not responsible for these kludges.  (I'm innocent!)
>
> Sure enough. :-)
>
> In any case, calling devices_kset_move_last() in really_probe() is
> plain broken and if its only purpose was to address a single, arguably
> kludgy, use case, let's just get rid of it in the first place IMO.
>
Yes, if it is only used for a single use case.

> > @Pingfan Liu, if you just remove the call to devices_kset_move_last()
> > from really_probe(), does the issue go away?
>
> I would think so from the description of the problem (elsewhere in this thread).
>
Yes.

Thanks,
Pingfan
Pingfan Liu July 7, 2018, 4:24 a.m. UTC | #10
On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > > [cc += Kishon Vijay Abraham]
> > >
> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
> > >> a mistake.
> > >>
> > >> I'm not really sure what the intention of it was as the changelog of
> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
> > >> insufficient without that change?)
> > >
> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> > > whose reset pin needs to be driven high on shutdown, lest the MMC
> > > won't be found on the next boot.
> > >
> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
> > > as a regulator.  The regulator is enabled when the MMC probes and
> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
> > > low on shutdown and the MMC is not found on the next boot.
> > >
> > > To fix this, another kludge was invented wherein the GPIO expander
> > > driving the reset pin unconditionally drives all its pins high on
> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
> > > of all pcf lines").
> > >
> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
> > > be executed after the MMC expander's ->shutdown hook.
> > >
> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> > > available yet, see mmc_regulator_get_supply().
> > >
> > > Note, I'm just piecing the information together from git history,
> > > I'm not responsible for these kludges.  (I'm innocent!)
> >
> > Sure enough. :-)
> >
> > In any case, calling devices_kset_move_last() in really_probe() is
> > plain broken and if its only purpose was to address a single, arguably
> > kludgy, use case, let's just get rid of it in the first place IMO.
> >
> Yes, if it is only used for a single use case.
>
Think it again, I saw other potential issue with the current code.
device_link_add->device_reorder_to_tail() can break the
"supplier<-consumer" order. During moving children after parent's
supplier, it ignores the order of child's consumer.  Beside this,
essentially both devices_kset_move_after/_before() and
device_pm_move_after/_before() expose  the shutdown order to the
indirect caller,  and we can not expect that the caller can not handle
it correctly. It should be a job of drivers core.  It is hard to
extract high dimension info and pack them into one dimension
linked-list. And in theory, it is warranted that the shutdown seq is
correct by using device tree info. More important, it is cheap with
the data structure in hand. So I think it is time to resolve the issue
once for all.

Thanks and regards,
Pingfan
Rafael J. Wysocki July 8, 2018, 8:25 a.m. UTC | #11
On Sat, Jul 7, 2018 at 6:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>>
>> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>> >
>> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
>> > > [cc += Kishon Vijay Abraham]
>> > >
>> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
>> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
>> > >> a mistake.
>> > >>
>> > >> I'm not really sure what the intention of it was as the changelog of
>> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
>> > >> insufficient without that change?)
>> > >
>> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
>> > > whose reset pin needs to be driven high on shutdown, lest the MMC
>> > > won't be found on the next boot.
>> > >
>> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
>> > > as a regulator.  The regulator is enabled when the MMC probes and
>> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
>> > > low on shutdown and the MMC is not found on the next boot.
>> > >
>> > > To fix this, another kludge was invented wherein the GPIO expander
>> > > driving the reset pin unconditionally drives all its pins high on
>> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
>> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
>> > > of all pcf lines").
>> > >
>> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
>> > > be executed after the MMC expander's ->shutdown hook.
>> > >
>> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
>> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
>> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
>> > > available yet, see mmc_regulator_get_supply().
>> > >
>> > > Note, I'm just piecing the information together from git history,
>> > > I'm not responsible for these kludges.  (I'm innocent!)
>> >
>> > Sure enough. :-)
>> >
>> > In any case, calling devices_kset_move_last() in really_probe() is
>> > plain broken and if its only purpose was to address a single, arguably
>> > kludgy, use case, let's just get rid of it in the first place IMO.
>> >
>> Yes, if it is only used for a single use case.
>>
> Think it again, I saw other potential issue with the current code.
> device_link_add->device_reorder_to_tail() can break the
> "supplier<-consumer" order. During moving children after parent's
> supplier, it ignores the order of child's consumer.

What do you mean?

> Beside this, essentially both devices_kset_move_after/_before() and
> device_pm_move_after/_before() expose  the shutdown order to the
> indirect caller,  and we can not expect that the caller can not handle
> it correctly. It should be a job of drivers core.

Arguably so, but that's how those functions were designed and the
callers should be aware of the limitation.

If they aren't, there is a bug in the caller.

> It is hard to extract high dimension info and pack them into one dimension
> linked-list.

Well, yes and no.

We know it for a fact that there is a linear ordering that will work.
It is inefficient to figure it out every time during system suspend
and resume, for one and that's why we have dpm_list.

Now, if we have it for suspend and resume, it can also be used for shutdown.

> And in theory, it is warranted that the shutdown seq is
> correct by using device tree info. More important, it is cheap with
> the data structure in hand. So I think it is time to resolve the issue
> once for all.

Not the way you want to do that, though.

Thanks,
Rafael
Pingfan Liu July 9, 2018, 6:48 a.m. UTC | #12
On Sun, Jul 8, 2018 at 4:25 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Sat, Jul 7, 2018 at 6:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> > On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
> >>
> >> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >> >
> >> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
> >> > > [cc += Kishon Vijay Abraham]
> >> > >
> >> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
> >> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
> >> > >> a mistake.
> >> > >>
> >> > >> I'm not really sure what the intention of it was as the changelog of
> >> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
> >> > >> insufficient without that change?)
> >> > >
> >> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> >> > > whose reset pin needs to be driven high on shutdown, lest the MMC
> >> > > won't be found on the next boot.
> >> > >
> >> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
> >> > > as a regulator.  The regulator is enabled when the MMC probes and
> >> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
> >> > > low on shutdown and the MMC is not found on the next boot.
> >> > >
> >> > > To fix this, another kludge was invented wherein the GPIO expander
> >> > > driving the reset pin unconditionally drives all its pins high on
> >> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> >> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
> >> > > of all pcf lines").
> >> > >
> >> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
> >> > > be executed after the MMC expander's ->shutdown hook.
> >> > >
> >> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
> >> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
> >> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> >> > > available yet, see mmc_regulator_get_supply().
> >> > >
> >> > > Note, I'm just piecing the information together from git history,
> >> > > I'm not responsible for these kludges.  (I'm innocent!)
> >> >
> >> > Sure enough. :-)
> >> >
> >> > In any case, calling devices_kset_move_last() in really_probe() is
> >> > plain broken and if its only purpose was to address a single, arguably
> >> > kludgy, use case, let's just get rid of it in the first place IMO.
> >> >
> >> Yes, if it is only used for a single use case.
> >>
> > Think it again, I saw other potential issue with the current code.
> > device_link_add->device_reorder_to_tail() can break the
> > "supplier<-consumer" order. During moving children after parent's
> > supplier, it ignores the order of child's consumer.
>
> What do you mean?
>
The drivers use device_link_add() to build "supplier<-consumer" order
without knowing each other. Hence there is the following potential
odds: (consumerX, child_a, ...) (consumer_a,..) (supplierX), where
consumer_a consumes child_a. When
device_link_add()->device_reorder_to_tail() moves all descendant of
consumerX to the tail, it breaks the "supplier<-consumer" order by
"consumer_a <- child_a". And we need recrusion to resolve the item in
(consumer_a,..), each time when moving a consumer behind its supplier,
we may break "parent<-child".

> > Beside this, essentially both devices_kset_move_after/_before() and
> > device_pm_move_after/_before() expose  the shutdown order to the
> > indirect caller,  and we can not expect that the caller can not handle
> > it correctly. It should be a job of drivers core.
>
> Arguably so, but that's how those functions were designed and the
> callers should be aware of the limitation.
>
> If they aren't, there is a bug in the caller.
>
If we consider device_move()-> device_pm_move_after/_before() more
carefully like the above description, then we can hide the detail from
caller. And keep the info of the pm order inside the core.

> > It is hard to extract high dimension info and pack them into one dimension
> > linked-list.
>
> Well, yes and no.
>
For "hard", I means that we need two interleaved recursion to make the
order correct. Otherwise, I think it is a bug or limitation.

> We know it for a fact that there is a linear ordering that will work.
> It is inefficient to figure it out every time during system suspend
> and resume, for one and that's why we have dpm_list.
>
Yeah, I agree that iterating over device tree may hurt performance. I
guess the iterating will not cost the majority of the suspend time,
comparing to the device_suspend(), which causes hardware's sync. But
data is more persuasive. Besides the performance, do you have other
concern till now?

> Now, if we have it for suspend and resume, it can also be used for shutdown.
>
Yes, I do think so.

Thanks and regards,
Pingfan
Rafael J. Wysocki July 9, 2018, 7:48 a.m. UTC | #13
On Mon, Jul 9, 2018 at 8:48 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> On Sun, Jul 8, 2018 at 4:25 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Sat, Jul 7, 2018 at 6:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
>> > On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>> >>
>> >> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>> >> >
>> >> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
>> >> > > [cc += Kishon Vijay Abraham]
>> >> > >
>> >> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
>> >> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
>> >> > >> a mistake.
>> >> > >>
>> >> > >> I'm not really sure what the intention of it was as the changelog of
>> >> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
>> >> > >> insufficient without that change?)
>> >> > >
>> >> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
>> >> > > whose reset pin needs to be driven high on shutdown, lest the MMC
>> >> > > won't be found on the next boot.
>> >> > >
>> >> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
>> >> > > as a regulator.  The regulator is enabled when the MMC probes and
>> >> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
>> >> > > low on shutdown and the MMC is not found on the next boot.
>> >> > >
>> >> > > To fix this, another kludge was invented wherein the GPIO expander
>> >> > > driving the reset pin unconditionally drives all its pins high on
>> >> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
>> >> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
>> >> > > of all pcf lines").
>> >> > >
>> >> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
>> >> > > be executed after the MMC expander's ->shutdown hook.
>> >> > >
>> >> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
>> >> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
>> >> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
>> >> > > available yet, see mmc_regulator_get_supply().
>> >> > >
>> >> > > Note, I'm just piecing the information together from git history,
>> >> > > I'm not responsible for these kludges.  (I'm innocent!)
>> >> >
>> >> > Sure enough. :-)
>> >> >
>> >> > In any case, calling devices_kset_move_last() in really_probe() is
>> >> > plain broken and if its only purpose was to address a single, arguably
>> >> > kludgy, use case, let's just get rid of it in the first place IMO.
>> >> >
>> >> Yes, if it is only used for a single use case.
>> >>
>> > Think it again, I saw other potential issue with the current code.
>> > device_link_add->device_reorder_to_tail() can break the
>> > "supplier<-consumer" order. During moving children after parent's
>> > supplier, it ignores the order of child's consumer.
>>
>> What do you mean?
>>
> The drivers use device_link_add() to build "supplier<-consumer" order
> without knowing each other. Hence there is the following potential
> odds: (consumerX, child_a, ...) (consumer_a,..) (supplierX), where
> consumer_a consumes child_a.

Well, what's the initial state of the list?

> When device_link_add()->device_reorder_to_tail() moves all descendant of
> consumerX to the tail, it breaks the "supplier<-consumer" order by
> "consumer_a <- child_a".

That depends on what the initial ordering of the list is and please
note that circular dependencies are explicitly assumed to be not
present.

The assumption is that the initial ordering of the list reflects the
correct suspend (or shutdown) order without the new link.  Therefore
initially all children are located after their parents and all known
consumers are located after their suppliers.

If a new link is added, the new consumer goes to the end of the list
and all of its children and all of its consumers go after it.
device_reorder_to_tail() is recursive, so for each of the devices that
went to the end of the list, all of its children and all of its
consumers go after it and so on.

Now, that operation doesn't change the order of any of the
parent<-child or supplier<-consumer pairs that get moved and since all
of the devices that depend on any device that get moved go to the end
of list after it, the only devices that don't go to the end of list
are guaranteed to not depend on any of them (they may be parents or
suppliers of the devices that go to the end of the list, but not their
children or suppliers).

> And we need recrusion to resolve the item in
> (consumer_a,..), each time when moving a consumer behind its supplier,
> we may break "parent<-child".

I don't see this as per the above.

Say, device_reorder_to_tail() moves a parent after its child.  This
means that device_reorder_to_tail() was not called for the child after
it had been called for the parent, but that is not true, because it is
called for all of the children of each device that gets moved *after*
moving that device.

>> > Beside this, essentially both devices_kset_move_after/_before() and
>> > device_pm_move_after/_before() expose  the shutdown order to the
>> > indirect caller,  and we can not expect that the caller can not handle
>> > it correctly. It should be a job of drivers core.
>>
>> Arguably so, but that's how those functions were designed and the
>> callers should be aware of the limitation.
>>
>> If they aren't, there is a bug in the caller.
>>
> If we consider device_move()-> device_pm_move_after/_before() more
> carefully like the above description, then we can hide the detail from
> caller. And keep the info of the pm order inside the core.

Yes, we can.

My point is that we have not been doing that so far and the current
callers of those routines are expected to know that.

We can do that to make the life of *future* callers easier (and maybe
to simplify the current ones), but currently the caller is expected to
do the right thing.

>> > It is hard to extract high dimension info and pack them into one dimension
>> > linked-list.
>>
>> Well, yes and no.
>>
> For "hard", I means that we need two interleaved recursion to make the
> order correct. Otherwise, I think it is a bug or limitation.

So the limitation is that circular dependencies may not exist, because
if they did, there would be no suitable suspend/shutdown ordering
between devices.

>> We know it for a fact that there is a linear ordering that will work.
>> It is inefficient to figure it out every time during system suspend
>> and resume, for one and that's why we have dpm_list.
>>
> Yeah, I agree that iterating over device tree may hurt performance. I
> guess the iterating will not cost the majority of the suspend time,
> comparing to the device_suspend(), which causes hardware's sync. But
> data is more persuasive. Besides the performance, do you have other
> concern till now?

I simply think that there should be one way to iterate over devices
for both system-wide PM and shutdown.

The reason why it is not like that today is because of the development
history, but if it doesn't work and we want to fix it, let's just
consolidate all of that.

Now, system-wide suspend resume sometimes iterates the list in the
reverse order which would be hard without having a list, wouldn't it?

>> Now, if we have it for suspend and resume, it can also be used for shutdown.
>>
> Yes, I do think so.

OK

Thanks,
Rafael
Pingfan Liu July 9, 2018, 8:40 a.m. UTC | #14
On Mon, Jul 9, 2018 at 3:48 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Jul 9, 2018 at 8:48 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> > On Sun, Jul 8, 2018 at 4:25 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>
> >> On Sat, Jul 7, 2018 at 6:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> >> > On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
> >> >>
> >> >> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >> >> >
> >> >> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
> >> >> > > [cc += Kishon Vijay Abraham]
> >> >> > >
> >> >> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
> >> >> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
> >> >> > >> a mistake.
> >> >> > >>
> >> >> > >> I'm not really sure what the intention of it was as the changelog of
> >> >> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
> >> >> > >> insufficient without that change?)
> >> >> > >
> >> >> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> >> >> > > whose reset pin needs to be driven high on shutdown, lest the MMC
> >> >> > > won't be found on the next boot.
> >> >> > >
> >> >> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
> >> >> > > as a regulator.  The regulator is enabled when the MMC probes and
> >> >> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
> >> >> > > low on shutdown and the MMC is not found on the next boot.
> >> >> > >
> >> >> > > To fix this, another kludge was invented wherein the GPIO expander
> >> >> > > driving the reset pin unconditionally drives all its pins high on
> >> >> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> >> >> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
> >> >> > > of all pcf lines").
> >> >> > >
> >> >> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
> >> >> > > be executed after the MMC expander's ->shutdown hook.
> >> >> > >
> >> >> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
> >> >> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
> >> >> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> >> >> > > available yet, see mmc_regulator_get_supply().
> >> >> > >
> >> >> > > Note, I'm just piecing the information together from git history,
> >> >> > > I'm not responsible for these kludges.  (I'm innocent!)
> >> >> >
> >> >> > Sure enough. :-)
> >> >> >
> >> >> > In any case, calling devices_kset_move_last() in really_probe() is
> >> >> > plain broken and if its only purpose was to address a single, arguably
> >> >> > kludgy, use case, let's just get rid of it in the first place IMO.
> >> >> >
> >> >> Yes, if it is only used for a single use case.
> >> >>
> >> > Think it again, I saw other potential issue with the current code.
> >> > device_link_add->device_reorder_to_tail() can break the
> >> > "supplier<-consumer" order. During moving children after parent's
> >> > supplier, it ignores the order of child's consumer.
> >>
> >> What do you mean?
> >>
> > The drivers use device_link_add() to build "supplier<-consumer" order
> > without knowing each other. Hence there is the following potential
> > odds: (consumerX, child_a, ...) (consumer_a,..) (supplierX), where
> > consumer_a consumes child_a.
>
> Well, what's the initial state of the list?
>
> > When device_link_add()->device_reorder_to_tail() moves all descendant of
> > consumerX to the tail, it breaks the "supplier<-consumer" order by
> > "consumer_a <- child_a".
>
> That depends on what the initial ordering of the list is and please
> note that circular dependencies are explicitly assumed to be not
> present.
>
> The assumption is that the initial ordering of the list reflects the
> correct suspend (or shutdown) order without the new link.  Therefore
> initially all children are located after their parents and all known
> consumers are located after their suppliers.
>
> If a new link is added, the new consumer goes to the end of the list
> and all of its children and all of its consumers go after it.
> device_reorder_to_tail() is recursive, so for each of the devices that
> went to the end of the list, all of its children and all of its
> consumers go after it and so on.
>
> Now, that operation doesn't change the order of any of the
> parent<-child or supplier<-consumer pairs that get moved and since all
> of the devices that depend on any device that get moved go to the end
> of list after it, the only devices that don't go to the end of list
> are guaranteed to not depend on any of them (they may be parents or
> suppliers of the devices that go to the end of the list, but not their
> children or suppliers).
>
Thanks for the detailed explain. It is clear now, and you are right.

> > And we need recrusion to resolve the item in
> > (consumer_a,..), each time when moving a consumer behind its supplier,
> > we may break "parent<-child".
>
> I don't see this as per the above.
>
> Say, device_reorder_to_tail() moves a parent after its child.  This
> means that device_reorder_to_tail() was not called for the child after
> it had been called for the parent, but that is not true, because it is
> called for all of the children of each device that gets moved *after*
> moving that device.
>
Yes, you are right.

> >> > Beside this, essentially both devices_kset_move_after/_before() and
> >> > device_pm_move_after/_before() expose  the shutdown order to the
> >> > indirect caller,  and we can not expect that the caller can not handle
> >> > it correctly. It should be a job of drivers core.
> >>
> >> Arguably so, but that's how those functions were designed and the
> >> callers should be aware of the limitation.
> >>
> >> If they aren't, there is a bug in the caller.
> >>
> > If we consider device_move()-> device_pm_move_after/_before() more
> > carefully like the above description, then we can hide the detail from
> > caller. And keep the info of the pm order inside the core.
>
> Yes, we can.
>
> My point is that we have not been doing that so far and the current
> callers of those routines are expected to know that.
>
> We can do that to make the life of *future* callers easier (and maybe
> to simplify the current ones), but currently the caller is expected to
> do the right thing.
>
OK, I get your point.

> >> > It is hard to extract high dimension info and pack them into one dimension
> >> > linked-list.
> >>
> >> Well, yes and no.
> >>
> > For "hard", I means that we need two interleaved recursion to make the
> > order correct. Otherwise, I think it is a bug or limitation.
>
> So the limitation is that circular dependencies may not exist, because
> if they did, there would be no suitable suspend/shutdown ordering
> between devices.
>
Yes.

> >> We know it for a fact that there is a linear ordering that will work.
> >> It is inefficient to figure it out every time during system suspend
> >> and resume, for one and that's why we have dpm_list.
> >>
> > Yeah, I agree that iterating over device tree may hurt performance. I
> > guess the iterating will not cost the majority of the suspend time,
> > comparing to the device_suspend(), which causes hardware's sync. But
> > data is more persuasive. Besides the performance, do you have other
> > concern till now?
>
> I simply think that there should be one way to iterate over devices
> for both system-wide PM and shutdown.
>
> The reason why it is not like that today is because of the development
> history, but if it doesn't work and we want to fix it, let's just
> consolidate all of that.
>
> Now, system-wide suspend resume sometimes iterates the list in the
> reverse order which would be hard without having a list, wouldn't it?
>
Yes, it would be hard without having a list. I just thought to use
device tree info to build up a shadowed list, and rebuild the list
until there is new device_link_add() operation. For
device_add/_remove(), it can modify the shadowed list directly.

Thanks,
Pingfan
Rafael J. Wysocki July 9, 2018, 8:58 a.m. UTC | #15
On Mon, Jul 9, 2018 at 10:40 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> On Mon, Jul 9, 2018 at 3:48 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Mon, Jul 9, 2018 at 8:48 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
>> > On Sun, Jul 8, 2018 at 4:25 PM Rafael J. Wysocki <rafael@kernel.org> wrote:

[cut]

>>
>> I simply think that there should be one way to iterate over devices
>> for both system-wide PM and shutdown.
>>
>> The reason why it is not like that today is because of the development
>> history, but if it doesn't work and we want to fix it, let's just
>> consolidate all of that.
>>
>> Now, system-wide suspend resume sometimes iterates the list in the
>> reverse order which would be hard without having a list, wouldn't it?
>>
> Yes, it would be hard without having a list. I just thought to use
> device tree info to build up a shadowed list, and rebuild the list
> until there is new device_link_add() operation. For
> device_add/_remove(), it can modify the shadowed list directly.

Right, and that's the idea of dpm_list, generally speaking: It
represents one of the (possibly many) orders in which devices can be
suspended (or shut down) based on the information coming from the
device hierarchy and device links.

So it appears straightforward (even though it may be complicated
because of the build-time dependencies) to start using dpm_list for
shutdown too - and to ensure that it is properly maintained
everywhere.

Thanks,
Rafael