diff mbox series

[v2,1/6] PM / core: Add LEAVE_SUSPENDED driver flag

Message ID 2932219.eEdIVn6WWf@aspire.rjw.lan
State Not Applicable
Headers show
Series PM / sleep: Driver flags for system suspend/resume (part 2) | expand

Commit Message

Rafael J. Wysocki Nov. 8, 2017, 1:25 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to
instruct the PM core and middle-layer (bus type, PM domain, etc.)
code that it is desirable to leave the device in runtime suspend
after system-wide transitions to the working state (for example,
the device may be slow to resume and it may be better to avoid
resuming it right away).

Generally, the middle-layer code involved in the handling of the
device is expected to indicate to the PM core whether or not the
device may be left in suspend with the help of the device's
power.may_skip_resume status bit.  That has to happen in the "noirq"
phase of the preceding system suspend (or analogous) transition.
The middle layer is then responsible for handling the device as
appropriate in its "noirq" resume callback which is executed
regardless of whether or not the device may be left suspended, but
the other resume callbacks (except for ->complete) will be skipped
automatically by the core if the device really can be left in
suspend.

The additional power.must_resume status bit introduced for the
implementation of this mechanisn is used internally by the PM core
to track the requirement to resume the device (which may depend on
its children etc).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 Documentation/driver-api/pm/devices.rst |   24 ++++++++++-
 drivers/base/power/main.c               |   65 +++++++++++++++++++++++++++++---
 include/linux/pm.h                      |   14 +++++-
 3 files changed, 93 insertions(+), 10 deletions(-)

Comments

Ulf Hansson Nov. 10, 2017, 9:09 a.m. UTC | #1
On 8 November 2017 at 14:25, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to
> instruct the PM core and middle-layer (bus type, PM domain, etc.)
> code that it is desirable to leave the device in runtime suspend
> after system-wide transitions to the working state (for example,
> the device may be slow to resume and it may be better to avoid
> resuming it right away).
>
> Generally, the middle-layer code involved in the handling of the
> device is expected to indicate to the PM core whether or not the
> device may be left in suspend with the help of the device's
> power.may_skip_resume status bit.  That has to happen in the "noirq"
> phase of the preceding system suspend (or analogous) transition.
> The middle layer is then responsible for handling the device as
> appropriate in its "noirq" resume callback which is executed
> regardless of whether or not the device may be left suspended, but
> the other resume callbacks (except for ->complete) will be skipped
> automatically by the core if the device really can be left in
> suspend.

I don't understand the reason to why you need to skip invoking resume
callbacks to achieve this behavior, could you elaborate on that?

Couldn't the PM domain or the middle-layer instead decide what to do?
To me it sounds a bit prone to errors by skipping callbacks from the
PM core, and I wonder if the general driver author will be able to
understand how to use this flag properly.

That said, as the series don't include any changes for drivers making
use of the flag, could please fold in such change as it would provide
a more complete picture?

>
> The additional power.must_resume status bit introduced for the
> implementation of this mechanisn is used internally by the PM core
> to track the requirement to resume the device (which may depend on
> its children etc).

Yeah, clearly the PM core needs to be involved, because of the need of
dealing with parent/child relations, however as kind of indicate
above, couldn't the PM core just set some flag/status bits, which
instructs the middle-layer and PM domain on what to do? That sounds
like an easier approach.

>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  Documentation/driver-api/pm/devices.rst |   24 ++++++++++-
>  drivers/base/power/main.c               |   65 +++++++++++++++++++++++++++++---
>  include/linux/pm.h                      |   14 +++++-
>  3 files changed, 93 insertions(+), 10 deletions(-)
>
> Index: linux-pm/include/linux/pm.h
> ===================================================================
> --- linux-pm.orig/include/linux/pm.h
> +++ linux-pm/include/linux/pm.h
> @@ -559,6 +559,7 @@ struct pm_subsys_data {
>   * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device.
>   * SMART_PREPARE: Check the return value of the driver's ->prepare callback.
>   * SMART_SUSPEND: No need to resume the device from runtime suspend.
> + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible.
>   *
>   * Setting SMART_PREPARE instructs bus types and PM domains which may want
>   * system suspend/resume callbacks to be skipped for the device to return 0 from
> @@ -572,10 +573,14 @@ struct pm_subsys_data {
>   * necessary from the driver's perspective.  It also may cause them to skip
>   * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by
>   * the driver if they decide to leave the device in runtime suspend.
> + *
> + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the
> + * driver prefers the device to be left in runtime suspend after system resume.
>   */
> -#define DPM_FLAG_NEVER_SKIP    BIT(0)
> -#define DPM_FLAG_SMART_PREPARE BIT(1)
> -#define DPM_FLAG_SMART_SUSPEND BIT(2)
> +#define DPM_FLAG_NEVER_SKIP            BIT(0)
> +#define DPM_FLAG_SMART_PREPARE         BIT(1)
> +#define DPM_FLAG_SMART_SUSPEND         BIT(2)
> +#define DPM_FLAG_LEAVE_SUSPENDED       BIT(3)
>
>  struct dev_pm_info {
>         pm_message_t            power_state;
> @@ -597,6 +602,8 @@ struct dev_pm_info {
>         bool                    wakeup_path:1;
>         bool                    syscore:1;
>         bool                    no_pm_callbacks:1;      /* Owned by the PM core */
> +       unsigned int            must_resume:1;  /* Owned by the PM core */
> +       unsigned int            may_skip_resume:1;      /* Set by subsystems */
>  #else
>         unsigned int            should_wakeup:1;
>  #endif
> @@ -765,6 +772,7 @@ extern int pm_generic_poweroff_late(stru
>  extern int pm_generic_poweroff(struct device *dev);
>  extern void pm_generic_complete(struct device *dev);
>
> +extern bool dev_pm_may_skip_resume(struct device *dev);
>  extern bool dev_pm_smart_suspend_and_suspended(struct device *dev);
>
>  #else /* !CONFIG_PM_SLEEP */
> Index: linux-pm/drivers/base/power/main.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/main.c
> +++ linux-pm/drivers/base/power/main.c
> @@ -528,6 +528,18 @@ static void dpm_watchdog_clear(struct dp
>  /*------------------------- Resume routines -------------------------*/
>
>  /**
> + * dev_pm_may_skip_resume - System-wide device resume optimization check.
> + * @dev: Target device.
> + *
> + * Checks whether or not the device may be left in suspend after a system-wide
> + * transition to the working state.
> + */
> +bool dev_pm_may_skip_resume(struct device *dev)
> +{
> +       return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE;
> +}
> +
> +/**
>   * device_resume_noirq - Execute a "noirq resume" callback for given device.
>   * @dev: Device to handle.
>   * @state: PM transition of the system being carried out.
> @@ -575,6 +587,12 @@ static int device_resume_noirq(struct de
>         error = dpm_run_callback(callback, dev, state, info);
>         dev->power.is_noirq_suspended = false;
>
> +       if (dev_pm_may_skip_resume(dev)) {
> +               pm_runtime_set_suspended(dev);

According to the doc, the DPM_FLAG_LEAVE_SUSPENDED intends to leave
the device in runtime suspend state during system resume.
However, here you are actually trying to change its runtime PM state to that.

Moreover, you should check the return value from
pm_runtime_set_suspended(). Then I wonder, what should you do when it
fails here?

Perhaps a better idea is to do this in the noirq suspend phase,
because it allows you to bail out in case pm_runtime_set_suspended()
fails.

Another option is to leave this to the middle-layer and PM domain,
that would make it more flexible and probably also easier for them to
deal with the error path.

> +               dev->power.is_late_suspended = false;
> +               dev->power.is_suspended = false;pm_runtime_set_suspended(
> +       }
> +
>   Out:
>         complete_all(&dev->power.completion);
>         TRACE_RESUME(error);
> @@ -1076,6 +1094,22 @@ static pm_message_t resume_event(pm_mess
>         return PMSG_ON;
>  }
>
> +static void dpm_superior_set_must_resume(struct device *dev)
> +{
> +       struct device_link *link;
> +       int idx;
> +
> +       if (dev->parent)
> +               dev->parent->power.must_resume = true;
> +
> +       idx = device_links_read_lock();
> +
> +       list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
> +               link->supplier->power.must_resume = true;
> +
> +       device_links_read_unlock(idx);
> +}
> +
>  /**
>   * __device_suspend_noirq - Execute a "noirq suspend" callback for given device.
>   * @dev: Device to handle.
> @@ -1127,10 +1161,27 @@ static int __device_suspend_noirq(struct
>         }
>
>         error = dpm_run_callback(callback, dev, state, info);
> -       if (!error)
> -               dev->power.is_noirq_suspended = true;
> -       else
> +       if (error) {
>                 async_error = error;
> +               goto Complete;
> +       }
> +
> +       dev->power.is_noirq_suspended = true;
> +
> +       if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) {
> +               /*
> +                * The only safe strategy here is to require that if the device
> +                * may not be left in suspend, resume callbacks must be invoked
> +                * for it.
> +                */
> +               dev->power.must_resume = dev->power.must_resume ||
> +                                       !dev->power.may_skip_resume;
> +       } else {
> +               dev->power.must_resume = true;
> +       }
> +
> +       if (dev->power.must_resume)
> +               dpm_superior_set_must_resume(dev);
>
>  Complete:
>         complete_all(&dev->power.completion);
> @@ -1487,6 +1538,9 @@ static int __device_suspend(struct devic
>                 dev->power.direct_complete = false;
>         }
>
> +       dev->power.may_skip_resume = false;
> +       dev->power.must_resume = false;
> +
>         dpm_watchdog_set(&wd, dev);
>         device_lock(dev);
>
> @@ -1652,8 +1706,9 @@ static int device_prepare(struct device
>         if (dev->power.syscore)
>                 return 0;
>
> -       WARN_ON(dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) &&
> -               !pm_runtime_enabled(dev));
> +       WARN_ON(!pm_runtime_enabled(dev) &&
> +               dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND |
> +                                             DPM_FLAG_LEAVE_SUSPENDED));
>
>         /*
>          * If a device's parent goes into runtime suspend at the wrong time,
> Index: linux-pm/Documentation/driver-api/pm/devices.rst
> ===================================================================
> --- linux-pm.orig/Documentation/driver-api/pm/devices.rst
> +++ linux-pm/Documentation/driver-api/pm/devices.rst
> @@ -788,6 +788,26 @@ must reflect the "active" status for run
>
>  During system-wide resume from a sleep state it's easiest to put devices into
>  the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
> -Refer to that document for more information regarding this particular issue as
> +[Refer to that document for more information regarding this particular issue as
>  well as for information on the device runtime power management framework in
> -general.
> +general.]
> +
> +However, it may be desirable to leave some devices in runtime suspend after
> +system transitions to the working state and device drivers can use the
> +``DPM_FLAG_LEAVE_SUSPENDED`` flag to indicate to the PM core (and middle-layer
> +code) that this is the case.  Whether or not the devices will actually be left
> +in suspend may depend on their state before the given system suspend-resume
> +cycle and on the type of the system transition under way.  In particular,
> +devices are not left suspended if that transition is a restore from hibernation,
> +as device states are not guaranteed to be reflected by the information stored in
> +the hibernation image in that case.
> +
> +The middle-layer code involved in the handling of the device has to indicate to
> +the PM core if the device may be left in suspend with the help of its
> +:c:member:`power.may_skip_resume` status bit.  That has to happen in the "noirq"
> +phase of the preceding system-wide suspend (or analogous) transition.  The
> +middle layer is then responsible for handling the device as appropriate in its
> +"noirq" resume callback, which is executed regardless of whether or not the
> +device may be left suspended, but the other resume callbacks (except for
> +``->complete``) will be skipped automatically by the PM core if the device
> +really can be left in suspend.
>

Kind regards
Uffe
Rafael J. Wysocki Nov. 10, 2017, 11:45 p.m. UTC | #2
On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> On 8 November 2017 at 14:25, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to
>> instruct the PM core and middle-layer (bus type, PM domain, etc.)
>> code that it is desirable to leave the device in runtime suspend
>> after system-wide transitions to the working state (for example,
>> the device may be slow to resume and it may be better to avoid
>> resuming it right away).
>>
>> Generally, the middle-layer code involved in the handling of the
>> device is expected to indicate to the PM core whether or not the
>> device may be left in suspend with the help of the device's
>> power.may_skip_resume status bit.  That has to happen in the "noirq"
>> phase of the preceding system suspend (or analogous) transition.
>> The middle layer is then responsible for handling the device as
>> appropriate in its "noirq" resume callback which is executed
>> regardless of whether or not the device may be left suspended, but
>> the other resume callbacks (except for ->complete) will be skipped
>> automatically by the core if the device really can be left in
>> suspend.
>
> I don't understand the reason to why you need to skip invoking resume
> callbacks to achieve this behavior, could you elaborate on that?

The reason why it is done this way is because that takes less code and
is easier (or at least less error-prone, because it avoids repeating
patterns in middle layers).

Note that the callbacks only may be skipped by the core if the middle
layer has set power.skip_resume for the device (or if the core is
handling it in patch [5/6], but that's one more step ahead still).

> Couldn't the PM domain or the middle-layer instead decide what to do?

They still can, the whole thing is a total opt-in.

But to be constructive, do you have any specific examples in mind?

> To me it sounds a bit prone to errors by skipping callbacks from the
> PM core, and I wonder if the general driver author will be able to
> understand how to use this flag properly.

This has nothing to do with general driver authors and I'm not sure
what you mean here and where you are going with this.

> That said, as the series don't include any changes for drivers making
> use of the flag, could please fold in such change as it would provide
> a more complete picture?

I've already done so, see https://patchwork.kernel.org/patch/10007349/

IMHO it's not really useful to drag this stuff (which doesn't change
BTW) along with every iteration of the core patches.

>>
>> The additional power.must_resume status bit introduced for the
>> implementation of this mechanisn is used internally by the PM core
>> to track the requirement to resume the device (which may depend on
>> its children etc).
>
> Yeah, clearly the PM core needs to be involved, because of the need of
> dealing with parent/child relations, however as kind of indicate
> above, couldn't the PM core just set some flag/status bits, which
> instructs the middle-layer and PM domain on what to do? That sounds
> like an easier approach.

No, it is not easier.  And it is backwards.

>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> ---
>>  Documentation/driver-api/pm/devices.rst |   24 ++++++++++-
>>  drivers/base/power/main.c               |   65 +++++++++++++++++++++++++++++---
>>  include/linux/pm.h                      |   14 +++++-
>>  3 files changed, 93 insertions(+), 10 deletions(-)
>>
>> Index: linux-pm/include/linux/pm.h
>> ===================================================================
>> --- linux-pm.orig/include/linux/pm.h
>> +++ linux-pm/include/linux/pm.h
>> @@ -559,6 +559,7 @@ struct pm_subsys_data {
>>   * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device.
>>   * SMART_PREPARE: Check the return value of the driver's ->prepare callback.
>>   * SMART_SUSPEND: No need to resume the device from runtime suspend.
>> + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible.
>>   *
>>   * Setting SMART_PREPARE instructs bus types and PM domains which may want
>>   * system suspend/resume callbacks to be skipped for the device to return 0 from
>> @@ -572,10 +573,14 @@ struct pm_subsys_data {
>>   * necessary from the driver's perspective.  It also may cause them to skip
>>   * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by
>>   * the driver if they decide to leave the device in runtime suspend.
>> + *
>> + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the
>> + * driver prefers the device to be left in runtime suspend after system resume.
>>   */
>> -#define DPM_FLAG_NEVER_SKIP    BIT(0)
>> -#define DPM_FLAG_SMART_PREPARE BIT(1)
>> -#define DPM_FLAG_SMART_SUSPEND BIT(2)
>> +#define DPM_FLAG_NEVER_SKIP            BIT(0)
>> +#define DPM_FLAG_SMART_PREPARE         BIT(1)
>> +#define DPM_FLAG_SMART_SUSPEND         BIT(2)
>> +#define DPM_FLAG_LEAVE_SUSPENDED       BIT(3)
>>
>>  struct dev_pm_info {
>>         pm_message_t            power_state;
>> @@ -597,6 +602,8 @@ struct dev_pm_info {
>>         bool                    wakeup_path:1;
>>         bool                    syscore:1;
>>         bool                    no_pm_callbacks:1;      /* Owned by the PM core */
>> +       unsigned int            must_resume:1;  /* Owned by the PM core */
>> +       unsigned int            may_skip_resume:1;      /* Set by subsystems */
>>  #else
>>         unsigned int            should_wakeup:1;
>>  #endif
>> @@ -765,6 +772,7 @@ extern int pm_generic_poweroff_late(stru
>>  extern int pm_generic_poweroff(struct device *dev);
>>  extern void pm_generic_complete(struct device *dev);
>>
>> +extern bool dev_pm_may_skip_resume(struct device *dev);
>>  extern bool dev_pm_smart_suspend_and_suspended(struct device *dev);
>>
>>  #else /* !CONFIG_PM_SLEEP */
>> Index: linux-pm/drivers/base/power/main.c
>> ===================================================================
>> --- linux-pm.orig/drivers/base/power/main.c
>> +++ linux-pm/drivers/base/power/main.c
>> @@ -528,6 +528,18 @@ static void dpm_watchdog_clear(struct dp
>>  /*------------------------- Resume routines -------------------------*/
>>
>>  /**
>> + * dev_pm_may_skip_resume - System-wide device resume optimization check.
>> + * @dev: Target device.
>> + *
>> + * Checks whether or not the device may be left in suspend after a system-wide
>> + * transition to the working state.
>> + */
>> +bool dev_pm_may_skip_resume(struct device *dev)
>> +{
>> +       return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE;
>> +}
>> +
>> +/**
>>   * device_resume_noirq - Execute a "noirq resume" callback for given device.
>>   * @dev: Device to handle.
>>   * @state: PM transition of the system being carried out.
>> @@ -575,6 +587,12 @@ static int device_resume_noirq(struct de
>>         error = dpm_run_callback(callback, dev, state, info);
>>         dev->power.is_noirq_suspended = false;
>>
>> +       if (dev_pm_may_skip_resume(dev)) {
>> +               pm_runtime_set_suspended(dev);
>
> According to the doc, the DPM_FLAG_LEAVE_SUSPENDED intends to leave
> the device in runtime suspend state during system resume.
> However, here you are actually trying to change its runtime PM state to that.

So the doc needs to be fixed. :-)

But I'm guessing that this just is a misunderstanding and you mean the
phrase "it may be desirable to leave some devices in runtime suspend
after [...]".  Yes, it is talking about "runtime suspend", but
actually "runtime suspend" is the only kind of "suspend" you can leave
a device in after a system transition to the working state.  It never
says that the device must have been suspended before the preceding
system transition into a sleep state started.

> Moreover, you should check the return value from
> pm_runtime_set_suspended().

This is in "noirq", so failures of that are meaningless here.

> Then I wonder, what should you do when it fails here?
>
> Perhaps a better idea is to do this in the noirq suspend phase,
> because it allows you to bail out in case pm_runtime_set_suspended()
> fails.

This doesn't make sense, sorry.

> Another option is to leave this to the middle-layer and PM domain,
> that would make it more flexible and probably also easier for them to
> deal with the error path.

So the middle layer doesn't have to set power.skip_resume.

Just don't set it if you don't like the default handling, but yes, you
will affect others this way.

Thanks,
Rafael
Rafael J. Wysocki Nov. 11, 2017, 12:41 a.m. UTC | #3
On Sat, Nov 11, 2017 at 12:45 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>> On 8 November 2017 at 14:25, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to
>>> instruct the PM core and middle-layer (bus type, PM domain, etc.)
>>> code that it is desirable to leave the device in runtime suspend
>>> after system-wide transitions to the working state (for example,
>>> the device may be slow to resume and it may be better to avoid
>>> resuming it right away).
>>>
>>> Generally, the middle-layer code involved in the handling of the
>>> device is expected to indicate to the PM core whether or not the
>>> device may be left in suspend with the help of the device's
>>> power.may_skip_resume status bit.  That has to happen in the "noirq"
>>> phase of the preceding system suspend (or analogous) transition.
>>> The middle layer is then responsible for handling the device as
>>> appropriate in its "noirq" resume callback which is executed
>>> regardless of whether or not the device may be left suspended, but
>>> the other resume callbacks (except for ->complete) will be skipped
>>> automatically by the core if the device really can be left in
>>> suspend.
>>
>> I don't understand the reason to why you need to skip invoking resume
>> callbacks to achieve this behavior, could you elaborate on that?
>
> The reason why it is done this way is because that takes less code and
> is easier (or at least less error-prone, because it avoids repeating
> patterns in middle layers).

Actually, it also is a matter of correctness, at least to some extent.

Namely, if the parent or any supplier of the device has
power.must_resume clear in dpm_noirq_resume_devices(), then the device
should not be touched during the whole system resume transition
(because the access may very well go through the suspended parent or
supplier) and the most straightforward way to make that happen is to
avoid running the code that may touch the device.  [Arguably, if
middle layers were made responsible for handling that, they would need
to do pretty much the same thing and so there is no reason for not
doing it in the core.]

Allowing the "noirq" callback from middle layers to run in that case
is a stretch already, but since genpd needs that, well, tough nuggets.

All of that said, if there is a middle layer wanting to set
power.skip_resume and needing to do something different for the resume
callbacks, then this piece can be moved from the core to the middle
layers at any time later.  So far there's none, though.  At least not
in this patch series.

Thanks,
Rafael
Rafael J. Wysocki Nov. 11, 2017, 1:36 a.m. UTC | #4
On Sat, Nov 11, 2017 at 12:45 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>> On 8 November 2017 at 14:25, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

[cut]

>> Moreover, you should check the return value from
>> pm_runtime_set_suspended().
>
> This is in "noirq", so failures of that are meaningless here.

They *should* be meaningless, but __pm_runtime_set_status() is sort of
buggy and checks child_count regardless of whether or not runtime PM
is enabled for the children (but when changing the status to "active"
it actually checks if runtime PM is enabled for the parent before
returning -EBUSY, so it is not event consistent internally).  Oh well.

>> Then I wonder, what should you do when it fails here?
>>
>> Perhaps a better idea is to do this in the noirq suspend phase,
>> because it allows you to bail out in case pm_runtime_set_suspended()
>> fails.
>
> This doesn't make sense, sorry.

Not for the above reason, but that would allow the bug in
__pm_runtime_set_status() to be sort of worked around by setting the
status to "suspended" for children before doing that for their
parents.

Moreover, stuff with nonzero usage_counts cannot be left in suspend regardless.

Thanks,
Rafael
Ulf Hansson Nov. 14, 2017, 4:07 p.m. UTC | #5
On 11 November 2017 at 00:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>> On 8 November 2017 at 14:25, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to
>>> instruct the PM core and middle-layer (bus type, PM domain, etc.)
>>> code that it is desirable to leave the device in runtime suspend
>>> after system-wide transitions to the working state (for example,
>>> the device may be slow to resume and it may be better to avoid
>>> resuming it right away).
>>>
>>> Generally, the middle-layer code involved in the handling of the
>>> device is expected to indicate to the PM core whether or not the
>>> device may be left in suspend with the help of the device's
>>> power.may_skip_resume status bit.  That has to happen in the "noirq"
>>> phase of the preceding system suspend (or analogous) transition.
>>> The middle layer is then responsible for handling the device as
>>> appropriate in its "noirq" resume callback which is executed
>>> regardless of whether or not the device may be left suspended, but
>>> the other resume callbacks (except for ->complete) will be skipped
>>> automatically by the core if the device really can be left in
>>> suspend.
>>
>> I don't understand the reason to why you need to skip invoking resume
>> callbacks to achieve this behavior, could you elaborate on that?
>
> The reason why it is done this way is because that takes less code and
> is easier (or at least less error-prone, because it avoids repeating
> patterns in middle layers).
>
> Note that the callbacks only may be skipped by the core if the middle
> layer has set power.skip_resume for the device (or if the core is
> handling it in patch [5/6], but that's one more step ahead still).
>
>> Couldn't the PM domain or the middle-layer instead decide what to do?
>
> They still can, the whole thing is a total opt-in.
>
> But to be constructive, do you have any specific examples in mind?

See more below.

>
>> To me it sounds a bit prone to errors by skipping callbacks from the
>> PM core, and I wonder if the general driver author will be able to
>> understand how to use this flag properly.
>
> This has nothing to do with general driver authors and I'm not sure
> what you mean here and where you are going with this.

Let me elaborate.

My general goal is that I want to make it easier (or as easy as
possible) for the general driver author to deploy runtime PM and
system-wide PM support - in an optimized manner. Therefore, I am
pondering over the solution you picked in this series, trying to
understand how it fits into those aspects.

Particular I am a bit worried from a complexity point of view, about
the part with skipping callbacks from the PM core. We have observed
some difficulties with the direct_complete path (i2c dw driver), which
is based on a similar approach as this one.

Additionally, in this case, to trigger skipping of callbacks to
happen, first, drivers needs to inform the middle-layer, second, the
middle layer acts on that information and then informs the PM core,
then in the third step, the PM core can decide what to do. It doesn't
sound straight-forward.

I guess I need to be convinced that this new approach is going to be
better than the the direct_complete path, so it somehow can replace it
along the road. Otherwise, we may end up just having yet another way
of skipping callbacks in the PM core and I don't like that.

Of course, I also realize this hole thing is opt-in, so nothing will
break and we are all good. :-)

>
>> That said, as the series don't include any changes for drivers making
>> use of the flag, could please fold in such change as it would provide
>> a more complete picture?
>
> I've already done so, see https://patchwork.kernel.org/patch/10007349/
>
> IMHO it's not really useful to drag this stuff (which doesn't change
> BTW) along with every iteration of the core patches.

Well, to me it's useful because it shows how these flags can/will be used.

Anyway, I thought you scraped that patch and was working on a new
version. I will have a look then.

[...]

>>>   * device_resume_noirq - Execute a "noirq resume" callback for given device.
>>>   * @dev: Device to handle.
>>>   * @state: PM transition of the system being carried out.
>>> @@ -575,6 +587,12 @@ static int device_resume_noirq(struct de
>>>         error = dpm_run_callback(callback, dev, state, info);
>>>         dev->power.is_noirq_suspended = false;
>>>
>>> +       if (dev_pm_may_skip_resume(dev)) {
>>> +               pm_runtime_set_suspended(dev);
>>
>> According to the doc, the DPM_FLAG_LEAVE_SUSPENDED intends to leave
>> the device in runtime suspend state during system resume.
>> However, here you are actually trying to change its runtime PM state to that.
>
> So the doc needs to be fixed. :-)

Yep.

>
> But I'm guessing that this just is a misunderstanding and you mean the
> phrase "it may be desirable to leave some devices in runtime suspend
> after [...]".  Yes, it is talking about "runtime suspend", but
> actually "runtime suspend" is the only kind of "suspend" you can leave
> a device in after a system transition to the working state.  It never
> says that the device must have been suspended before the preceding
> system transition into a sleep state started.

My point is, it's isn't obvious why you need to make sure the device's
runtime PM status is set to "RPM_SUSPENDED" when leaving the resume
noirq phase. You did explain that somewhat above, thanks!

Perhaps you could fold in some of that information into the doc as well?

>
>> Moreover, you should check the return value from
>> pm_runtime_set_suspended().
>
> This is in "noirq", so failures of that are meaningless here.
>
>> Then I wonder, what should you do when it fails here?
>>
>> Perhaps a better idea is to do this in the noirq suspend phase,
>> because it allows you to bail out in case pm_runtime_set_suspended()
>> fails.
>
> This doesn't make sense, sorry.

What do you mean by "failures of that are meaningless here."?

I was suggesting, instead of calling pm_runtime_set_suspended() in the
noirq *resume* phase, why can't you do that in the noirq *suspend*
phase?

In the noirq *suspend* phase it's not too late to deal with errors!? Or is it?

[...]

Kind regards
Uffe
Rafael J. Wysocki Nov. 15, 2017, 1:48 a.m. UTC | #6
On Tuesday, November 14, 2017 5:07:59 PM CET Ulf Hansson wrote:
> On 11 November 2017 at 00:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >> On 8 November 2017 at 14:25, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to
> >>> instruct the PM core and middle-layer (bus type, PM domain, etc.)
> >>> code that it is desirable to leave the device in runtime suspend
> >>> after system-wide transitions to the working state (for example,
> >>> the device may be slow to resume and it may be better to avoid
> >>> resuming it right away).
> >>>
> >>> Generally, the middle-layer code involved in the handling of the
> >>> device is expected to indicate to the PM core whether or not the
> >>> device may be left in suspend with the help of the device's
> >>> power.may_skip_resume status bit.  That has to happen in the "noirq"
> >>> phase of the preceding system suspend (or analogous) transition.
> >>> The middle layer is then responsible for handling the device as
> >>> appropriate in its "noirq" resume callback which is executed
> >>> regardless of whether or not the device may be left suspended, but
> >>> the other resume callbacks (except for ->complete) will be skipped
> >>> automatically by the core if the device really can be left in
> >>> suspend.
> >>
> >> I don't understand the reason to why you need to skip invoking resume
> >> callbacks to achieve this behavior, could you elaborate on that?
> >
> > The reason why it is done this way is because that takes less code and
> > is easier (or at least less error-prone, because it avoids repeating
> > patterns in middle layers).
> >
> > Note that the callbacks only may be skipped by the core if the middle
> > layer has set power.skip_resume for the device (or if the core is
> > handling it in patch [5/6], but that's one more step ahead still).
> >
> >> Couldn't the PM domain or the middle-layer instead decide what to do?
> >
> > They still can, the whole thing is a total opt-in.
> >
> > But to be constructive, do you have any specific examples in mind?
> 
> See more below.
> 
> >
> >> To me it sounds a bit prone to errors by skipping callbacks from the
> >> PM core, and I wonder if the general driver author will be able to
> >> understand how to use this flag properly.
> >
> > This has nothing to do with general driver authors and I'm not sure
> > what you mean here and where you are going with this.
> 
> Let me elaborate.
> 
> My general goal is that I want to make it easier (or as easy as
> possible) for the general driver author to deploy runtime PM and
> system-wide PM support - in an optimized manner. Therefore, I am
> pondering over the solution you picked in this series, trying to
> understand how it fits into those aspects.
> 
> Particular I am a bit worried from a complexity point of view, about
> the part with skipping callbacks from the PM core. We have observed
> some difficulties with the direct_complete path (i2c dw driver), which
> is based on a similar approach as this one.

These are resume callbacks, not suspend callbacks.  Also not all of them
are skipped.  That is quite a bit different from skipping *all* callbacks.

Moreover, at the point the core decides to skip the callbacks, the device
*has* *to* be left suspended and there simply is no point in running them
no matter what.

That part of code can be trivially moved to middle layers, but then each
of them will have to do exactly the same thing.  I don't see any reason to
do that and I'm not finding one in your comments.  Sorry.

> Additionally, in this case, to trigger skipping of callbacks to
> happen, first, drivers needs to inform the middle-layer, second, the
> middle layer acts on that information and then informs the PM core,
> then in the third step, the PM core can decide what to do. It doesn't
> sound straight-forward.

It really doesn't work like that.

First, the driver sets the LEAVE_SUSPENDED flag for the core to consume.
The middle layers don't have to look at it at all.

Second, each middle layer sets power.may_skip_resume for devices whose
state after system suspend should match the runtime suspend state.  The
middle layer must know that this is the case to set that bit.  [The core
effectively does that part for devices handled by it directly in patch
[5/6].]

The core then takes the LEAVE_SUSPENDED flags, power.may_skip_resume bits,
status of the children and consumers into account in order to produce the
power.must_resume bits and those are used (later) to decide whether or not
to resume the devices.  That decision is made by the core and so the core
acts on it and the middle layers must follow.

> I guess I need to be convinced that this new approach is going to be
> better than the the direct_complete path, so it somehow can replace it
> along the road. Otherwise, we may end up just having yet another way
> of skipping callbacks in the PM core and I don't like that.

Well, this works the other way around this time, I'm afraid.  At this point
you need to convince me that the approach has real issues. :-)

> Of course, I also realize this hole thing is opt-in, so nothing will
> break and we are all good. :-)
> 
> >
> >> That said, as the series don't include any changes for drivers making
> >> use of the flag, could please fold in such change as it would provide
> >> a more complete picture?
> >
> > I've already done so, see https://patchwork.kernel.org/patch/10007349/
> >
> > IMHO it's not really useful to drag this stuff (which doesn't change
> > BTW) along with every iteration of the core patches.
> 
> Well, to me it's useful because it shows how these flags can/will be used.
> 
> Anyway, I thought you scraped that patch and was working on a new
> version. I will have a look then.
> 
> [...]
> 
> >>>   * device_resume_noirq - Execute a "noirq resume" callback for given device.
> >>>   * @dev: Device to handle.
> >>>   * @state: PM transition of the system being carried out.
> >>> @@ -575,6 +587,12 @@ static int device_resume_noirq(struct de
> >>>         error = dpm_run_callback(callback, dev, state, info);
> >>>         dev->power.is_noirq_suspended = false;
> >>>
> >>> +       if (dev_pm_may_skip_resume(dev)) {
> >>> +               pm_runtime_set_suspended(dev);
> >>
> >> According to the doc, the DPM_FLAG_LEAVE_SUSPENDED intends to leave
> >> the device in runtime suspend state during system resume.
> >> However, here you are actually trying to change its runtime PM state to that.
> >
> > So the doc needs to be fixed. :-)
> 
> Yep.
> 
> >
> > But I'm guessing that this just is a misunderstanding and you mean the
> > phrase "it may be desirable to leave some devices in runtime suspend
> > after [...]".  Yes, it is talking about "runtime suspend", but
> > actually "runtime suspend" is the only kind of "suspend" you can leave
> > a device in after a system transition to the working state.  It never
> > says that the device must have been suspended before the preceding
> > system transition into a sleep state started.
> 
> My point is, it's isn't obvious why you need to make sure the device's
> runtime PM status is set to "RPM_SUSPENDED" when leaving the resume
> noirq phase. You did explain that somewhat above, thanks!
> 
> Perhaps you could fold in some of that information into the doc as well?

The doc doesn't describe the design of the code, though.

I guess I'll just add a comment at the point where the status changes.

> >
> >> Moreover, you should check the return value from
> >> pm_runtime_set_suspended().
> >
> > This is in "noirq", so failures of that are meaningless here.
> >
> >> Then I wonder, what should you do when it fails here?
> >>
> >> Perhaps a better idea is to do this in the noirq suspend phase,
> >> because it allows you to bail out in case pm_runtime_set_suspended()
> >> fails.
> >
> > This doesn't make sense, sorry.
> 
> What do you mean by "failures of that are meaningless here."?

If all devices have runtime PM disabled, pm_runtime_set_suspended() should
just do what it is asked for unless called with an invalid argument or
similar.

> I was suggesting, instead of calling pm_runtime_set_suspended() in the
> noirq *resume* phase, why can't you do that in the noirq *suspend*
> phase?
> 
> In the noirq *suspend* phase it's not too late to deal with errors!? Or is it?

At that point it has not been decided whether or not the devices will stay
suspended yet.  The status cannot be changed before making that decision,
which only happens in the noirq resume phase.

Thanks,
Rafael
Ulf Hansson Nov. 16, 2017, 10:18 a.m. UTC | #7
[...]

>>
>> My general goal is that I want to make it easier (or as easy as
>> possible) for the general driver author to deploy runtime PM and
>> system-wide PM support - in an optimized manner. Therefore, I am
>> pondering over the solution you picked in this series, trying to
>> understand how it fits into those aspects.
>>
>> Particular I am a bit worried from a complexity point of view, about
>> the part with skipping callbacks from the PM core. We have observed
>> some difficulties with the direct_complete path (i2c dw driver), which
>> is based on a similar approach as this one.
>
> These are resume callbacks, not suspend callbacks.  Also not all of them
> are skipped.  That is quite a bit different from skipping *all* callbacks.
>
> Moreover, at the point the core decides to skip the callbacks, the device
> *has* *to* be left suspended and there simply is no point in running them
> no matter what.
>
> That part of code can be trivially moved to middle layers, but then each
> of them will have to do exactly the same thing.  I don't see any reason to
> do that and I'm not finding one in your comments.  Sorry.

I think my concerns boils done to that I am wondering how useful it
will be, in general, to enable the core to skip invoking resume
callbacks.

Although, if you are targeting some specific devices/drivers (ACPI,
PCI, etc), and not care that much about flexibility, then I am fine
with it. The approach seems to work.

Let me elaborate on that comment a bit.

1)
Skipping resume callbacks is not going to work for a device that may
be attached to the generic PM domain.

Well, in principle one could try to re-work genpd to cope with this
behavior, I guess, but that would also mean genpd becomes limited to
always use the noirq callbacks to power on/off the PM domain. That
isn't an acceptable limitation.

2)
Because of 1) This leads to those cross SoC drivers, dealing with
devices which sometimes may have a genpd attached and sometimes an
ACPI PM domain attached. I guess those drivers would need to have a
different set of system-wide PM callbacks, depending on the PM domain
the device is attached to, as to achieve a similar optimized behavior
during system resume. Or some other cleverness to deal with
system-wide PM.

Perhaps we can ignore both 1) and 2), because the number of cross SoC
drivers having these issues should be rather limited!?

3)
There are certainly lots of drivers that can cope with its device
remaining in runtime suspend, during system resume.
Although, some of these drivers may have some additional operations to
carry out during resume, which may not require to resume (activate)
its device. For example the driver may need to resume a queue,
re-configure an out-of-band wakeup (GPIO IRQ), re-configure pinctrls,
etc.

These drivers can't use the method behind LEAVE_SUSPENDED, because
they need their resume callbacks to be invoked.

[...]

>
> Well, this works the other way around this time, I'm afraid.  At this point
> you need to convince me that the approach has real issues. :-)

I think I have pointed out some issues above. Feel free to ignore
them, depending on what your target is.

[...]

>> >> Perhaps a better idea is to do this in the noirq suspend phase,
>> >> because it allows you to bail out in case pm_runtime_set_suspended()
>> >> fails.
>> >
>> > This doesn't make sense, sorry.
>>
>> What do you mean by "failures of that are meaningless here."?
>
> If all devices have runtime PM disabled, pm_runtime_set_suspended() should
> just do what it is asked for unless called with an invalid argument or
> similar.

Yes.

>
>> I was suggesting, instead of calling pm_runtime_set_suspended() in the
>> noirq *resume* phase, why can't you do that in the noirq *suspend*
>> phase?
>>
>> In the noirq *suspend* phase it's not too late to deal with errors!? Or is it?
>
> At that point it has not been decided whether or not the devices will stay
> suspended yet.  The status cannot be changed before making that decision,
> which only happens in the noirq resume phase.

Okay, then it's fine as is (because of your other patch to the runtime
PM core, which changes the behavior for pm_runtime_set_suspended()).

BTW, I have some additional minor comments to some other parts of the
code, but I will start over with a new thread proving you with those
comments.

Kind regards
Uffe
diff mbox series

Patch

Index: linux-pm/include/linux/pm.h
===================================================================
--- linux-pm.orig/include/linux/pm.h
+++ linux-pm/include/linux/pm.h
@@ -559,6 +559,7 @@  struct pm_subsys_data {
  * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device.
  * SMART_PREPARE: Check the return value of the driver's ->prepare callback.
  * SMART_SUSPEND: No need to resume the device from runtime suspend.
+ * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible.
  *
  * Setting SMART_PREPARE instructs bus types and PM domains which may want
  * system suspend/resume callbacks to be skipped for the device to return 0 from
@@ -572,10 +573,14 @@  struct pm_subsys_data {
  * necessary from the driver's perspective.  It also may cause them to skip
  * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by
  * the driver if they decide to leave the device in runtime suspend.
+ *
+ * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the
+ * driver prefers the device to be left in runtime suspend after system resume.
  */
-#define DPM_FLAG_NEVER_SKIP	BIT(0)
-#define DPM_FLAG_SMART_PREPARE	BIT(1)
-#define DPM_FLAG_SMART_SUSPEND	BIT(2)
+#define DPM_FLAG_NEVER_SKIP		BIT(0)
+#define DPM_FLAG_SMART_PREPARE		BIT(1)
+#define DPM_FLAG_SMART_SUSPEND		BIT(2)
+#define DPM_FLAG_LEAVE_SUSPENDED	BIT(3)
 
 struct dev_pm_info {
 	pm_message_t		power_state;
@@ -597,6 +602,8 @@  struct dev_pm_info {
 	bool			wakeup_path:1;
 	bool			syscore:1;
 	bool			no_pm_callbacks:1;	/* Owned by the PM core */
+	unsigned int		must_resume:1;	/* Owned by the PM core */
+	unsigned int		may_skip_resume:1;	/* Set by subsystems */
 #else
 	unsigned int		should_wakeup:1;
 #endif
@@ -765,6 +772,7 @@  extern int pm_generic_poweroff_late(stru
 extern int pm_generic_poweroff(struct device *dev);
 extern void pm_generic_complete(struct device *dev);
 
+extern bool dev_pm_may_skip_resume(struct device *dev);
 extern bool dev_pm_smart_suspend_and_suspended(struct device *dev);
 
 #else /* !CONFIG_PM_SLEEP */
Index: linux-pm/drivers/base/power/main.c
===================================================================
--- linux-pm.orig/drivers/base/power/main.c
+++ linux-pm/drivers/base/power/main.c
@@ -528,6 +528,18 @@  static void dpm_watchdog_clear(struct dp
 /*------------------------- Resume routines -------------------------*/
 
 /**
+ * dev_pm_may_skip_resume - System-wide device resume optimization check.
+ * @dev: Target device.
+ *
+ * Checks whether or not the device may be left in suspend after a system-wide
+ * transition to the working state.
+ */
+bool dev_pm_may_skip_resume(struct device *dev)
+{
+	return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE;
+}
+
+/**
  * device_resume_noirq - Execute a "noirq resume" callback for given device.
  * @dev: Device to handle.
  * @state: PM transition of the system being carried out.
@@ -575,6 +587,12 @@  static int device_resume_noirq(struct de
 	error = dpm_run_callback(callback, dev, state, info);
 	dev->power.is_noirq_suspended = false;
 
+	if (dev_pm_may_skip_resume(dev)) {
+		pm_runtime_set_suspended(dev);
+		dev->power.is_late_suspended = false;
+		dev->power.is_suspended = false;
+	}
+
  Out:
 	complete_all(&dev->power.completion);
 	TRACE_RESUME(error);
@@ -1076,6 +1094,22 @@  static pm_message_t resume_event(pm_mess
 	return PMSG_ON;
 }
 
+static void dpm_superior_set_must_resume(struct device *dev)
+{
+	struct device_link *link;
+	int idx;
+
+	if (dev->parent)
+		dev->parent->power.must_resume = true;
+
+	idx = device_links_read_lock();
+
+	list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
+		link->supplier->power.must_resume = true;
+
+	device_links_read_unlock(idx);
+}
+
 /**
  * __device_suspend_noirq - Execute a "noirq suspend" callback for given device.
  * @dev: Device to handle.
@@ -1127,10 +1161,27 @@  static int __device_suspend_noirq(struct
 	}
 
 	error = dpm_run_callback(callback, dev, state, info);
-	if (!error)
-		dev->power.is_noirq_suspended = true;
-	else
+	if (error) {
 		async_error = error;
+		goto Complete;
+	}
+
+	dev->power.is_noirq_suspended = true;
+
+	if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) {
+		/*
+		 * The only safe strategy here is to require that if the device
+		 * may not be left in suspend, resume callbacks must be invoked
+		 * for it.
+		 */
+		dev->power.must_resume = dev->power.must_resume ||
+					!dev->power.may_skip_resume;
+	} else {
+		dev->power.must_resume = true;
+	}
+
+	if (dev->power.must_resume)
+		dpm_superior_set_must_resume(dev);
 
 Complete:
 	complete_all(&dev->power.completion);
@@ -1487,6 +1538,9 @@  static int __device_suspend(struct devic
 		dev->power.direct_complete = false;
 	}
 
+	dev->power.may_skip_resume = false;
+	dev->power.must_resume = false;
+
 	dpm_watchdog_set(&wd, dev);
 	device_lock(dev);
 
@@ -1652,8 +1706,9 @@  static int device_prepare(struct device
 	if (dev->power.syscore)
 		return 0;
 
-	WARN_ON(dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) &&
-		!pm_runtime_enabled(dev));
+	WARN_ON(!pm_runtime_enabled(dev) &&
+		dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND |
+					      DPM_FLAG_LEAVE_SUSPENDED));
 
 	/*
 	 * If a device's parent goes into runtime suspend at the wrong time,
Index: linux-pm/Documentation/driver-api/pm/devices.rst
===================================================================
--- linux-pm.orig/Documentation/driver-api/pm/devices.rst
+++ linux-pm/Documentation/driver-api/pm/devices.rst
@@ -788,6 +788,26 @@  must reflect the "active" status for run
 
 During system-wide resume from a sleep state it's easiest to put devices into
 the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
-Refer to that document for more information regarding this particular issue as
+[Refer to that document for more information regarding this particular issue as
 well as for information on the device runtime power management framework in
-general.
+general.]
+
+However, it may be desirable to leave some devices in runtime suspend after
+system transitions to the working state and device drivers can use the
+``DPM_FLAG_LEAVE_SUSPENDED`` flag to indicate to the PM core (and middle-layer
+code) that this is the case.  Whether or not the devices will actually be left
+in suspend may depend on their state before the given system suspend-resume
+cycle and on the type of the system transition under way.  In particular,
+devices are not left suspended if that transition is a restore from hibernation,
+as device states are not guaranteed to be reflected by the information stored in
+the hibernation image in that case.
+
+The middle-layer code involved in the handling of the device has to indicate to
+the PM core if the device may be left in suspend with the help of its
+:c:member:`power.may_skip_resume` status bit.  That has to happen in the "noirq"
+phase of the preceding system-wide suspend (or analogous) transition.  The
+middle layer is then responsible for handling the device as appropriate in its
+"noirq" resume callback, which is executed regardless of whether or not the
+device may be left suspended, but the other resume callbacks (except for
+``->complete``) will be skipped automatically by the PM core if the device
+really can be left in suspend.