mbox series

[v6,0/2] Synchronize DT overlay removal with devlink removals

Message ID 20240325152140.198219-1-herve.codina@bootlin.com
Headers show
Series Synchronize DT overlay removal with devlink removals | expand

Message

Herve Codina March 25, 2024, 3:21 p.m. UTC
Hi,

In the following sequence:
  of_platform_depopulate(); /* Remove devices from a DT overlay node */
  of_overlay_remove(); /* Remove the DT overlay node itself */

Some warnings are raised by __of_changeset_entry_destroy() which  was
called from of_overlay_remove():
  ERROR: memory leak, expected refcount 1 instead of 2 ...

The issue is that, during the device devlink removals triggered from the
of_platform_depopulate(), jobs are put in a workqueue.
These jobs drop the reference to the devices. When a device is no more
referenced (refcount == 0), it is released and the reference to its
of_node is dropped by a call to of_node_put().
These operations are fully correct except that, because of the
workqueue, they are done asynchronously with respect to function calls.

In the sequence provided, the jobs are run too late, after the call to
__of_changeset_entry_destroy() and so a missing of_node_put() call is
detected by __of_changeset_entry_destroy().

This series fixes this issue introducing device_link_wait_removal() in
order to wait for the end of jobs execution (patch 1) and using this
function to synchronize the overlay removal with the end of jobs
execution (patch 2).

Compared to the previous iteration:
  https://lore.kernel.org/linux-kernel/20240307111036.225007-1-herve.codina@bootlin.com/
this v6 series:
- Add Saravana's 'Reviewed-by' tag

This series handles cases reported by Luca [1] and Nuno [2].
  [1]: https://lore.kernel.org/all/20231220181627.341e8789@booty/
  [2]: https://lore.kernel.org/all/20240205-fix-device-links-overlays-v2-2-5344f8c79d57@analog.com/

Best regards,
Hervé

Changes v5 -> v6
  - Patch 1
    Add 'Reviewed-by: Saravana Kannan <saravanak@google.com>'

  - Patch 2
    No changes

Changes v4 -> v5
  - Patch 1
    Remove the 'Fixes' tag
    Add 'Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>'
    Add 'Reviewed-by: Nuno Sa <nuno.sa@analog.com>'

  - Patch 2
    Update comment as suggested
    Add 'Reviewed-by: Saravana Kannan <saravanak@google.com>'
    Add 'Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>'
    Add 'Reviewed-by: Nuno Sa <nuno.sa@analog.com>'

Changes v3 -> v4
  - Patch 1
    Uses flush_workqueue() instead of drain_workqueue().

  - Patch 2
    Remove unlock/re-lock when calling device_link_wait_removal()
    Move device_link_wait_removal() call to of_changeset_destroy()
    Update commit log

Changes v2 -> v3
  - Patch 1
    No changes

  - Patch 2
    Add missing device.h

Changes v1 -> v2
  - Patch 1
    Rename the workqueue to 'device_link_wq'
    Add 'Fixes' tag and Cc stable

  - Patch 2
    Add device.h inclusion.
    Call device_link_wait_removal() later in the overlay removal
    sequence (i.e. in free_overlay_changeset() function).
    Drop of_mutex lock while calling device_link_wait_removal().
    Add	'Fixes'	tag and Cc stable

Herve Codina (2):
  driver core: Introduce device_link_wait_removal()
  of: dynamic: Synchronize of_changeset_destroy() with the devlink
    removals

 drivers/base/core.c    | 26 +++++++++++++++++++++++---
 drivers/of/dynamic.c   | 12 ++++++++++++
 include/linux/device.h |  1 +
 3 files changed, 36 insertions(+), 3 deletions(-)

Comments

Rob Herring (Arm) March 25, 2024, 4:44 p.m. UTC | #1
On Mon, Mar 25, 2024 at 04:21:25PM +0100, Herve Codina wrote:
> The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> introduces a workqueue to release the consumer and supplier devices used
> in the devlink.
> In the job queued, devices are release and in turn, when all the
> references to these devices are dropped, the release function of the
> device itself is called.
> 
> Nothing is present to provide some synchronisation with this workqueue
> in order to ensure that all ongoing releasing operations are done and
> so, some other operations can be started safely.
> 
> For instance, in the following sequence:
>   1) of_platform_depopulate()
>   2) of_overlay_remove()
> 
> During the step 1, devices are released and related devlinks are removed
> (jobs pushed in the workqueue).
> During the step 2, OF nodes are destroyed but, without any
> synchronisation with devlink removal jobs, of_overlay_remove() can raise
> warnings related to missing of_node_put():
>   ERROR: memory leak, expected refcount 1 instead of 2
> 
> Indeed, the missing of_node_put() call is going to be done, too late,
> from the workqueue job execution.
> 
> Introduce device_link_wait_removal() to offer a way to synchronize
> operations waiting for the end of devlink removals (i.e. end of
> workqueue jobs).
> Also, as a flushing operation is done on the workqueue, the workqueue
> used is moved from a system-wide workqueue to a local one.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
> Reviewed-by: Nuno Sa <nuno.sa@analog.com>
> Reviewed-by: Saravana Kannan <saravanak@google.com>
> ---
>  drivers/base/core.c    | 26 +++++++++++++++++++++++---
>  include/linux/device.h |  1 +
>  2 files changed, 24 insertions(+), 3 deletions(-)

Greg, can you ack and I'll take this series.

Rob
Greg Kroah-Hartman March 25, 2024, 6:37 p.m. UTC | #2
On Mon, Mar 25, 2024 at 04:21:25PM +0100, Herve Codina wrote:
> The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> introduces a workqueue to release the consumer and supplier devices used
> in the devlink.
> In the job queued, devices are release and in turn, when all the
> references to these devices are dropped, the release function of the
> device itself is called.
> 
> Nothing is present to provide some synchronisation with this workqueue
> in order to ensure that all ongoing releasing operations are done and
> so, some other operations can be started safely.
> 
> For instance, in the following sequence:
>   1) of_platform_depopulate()
>   2) of_overlay_remove()
> 
> During the step 1, devices are released and related devlinks are removed
> (jobs pushed in the workqueue).
> During the step 2, OF nodes are destroyed but, without any
> synchronisation with devlink removal jobs, of_overlay_remove() can raise
> warnings related to missing of_node_put():
>   ERROR: memory leak, expected refcount 1 instead of 2
> 
> Indeed, the missing of_node_put() call is going to be done, too late,
> from the workqueue job execution.
> 
> Introduce device_link_wait_removal() to offer a way to synchronize
> operations waiting for the end of devlink removals (i.e. end of
> workqueue jobs).
> Also, as a flushing operation is done on the workqueue, the workqueue
> used is moved from a system-wide workqueue to a local one.
> 
> Cc: stable@vger.kernel.org

Why is this for stable?  You are just adding a new api, no one is using
it.

Or if they are, you didn't send me that patch...

greg k-h
Greg Kroah-Hartman March 25, 2024, 6:38 p.m. UTC | #3
On Mon, Mar 25, 2024 at 04:21:25PM +0100, Herve Codina wrote:
> The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> introduces a workqueue to release the consumer and supplier devices used
> in the devlink.
> In the job queued, devices are release and in turn, when all the
> references to these devices are dropped, the release function of the
> device itself is called.
> 
> Nothing is present to provide some synchronisation with this workqueue
> in order to ensure that all ongoing releasing operations are done and
> so, some other operations can be started safely.
> 
> For instance, in the following sequence:
>   1) of_platform_depopulate()
>   2) of_overlay_remove()

So this is only an issue for overlays?  Why has no one noticed this in
the years since 80dd33cf72d1 was added?  Why is this an issue now
suddenly?

thanks,

greg k-h
Nuno Sá March 26, 2024, 7:05 a.m. UTC | #4
On Mon, 2024-03-25 at 19:38 +0100, Greg Kroah-Hartman wrote:
> On Mon, Mar 25, 2024 at 04:21:25PM +0100, Herve Codina wrote:
> > The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > introduces a workqueue to release the consumer and supplier devices used
> > in the devlink.
> > In the job queued, devices are release and in turn, when all the
> > references to these devices are dropped, the release function of the
> > device itself is called.
> > 
> > Nothing is present to provide some synchronisation with this workqueue
> > in order to ensure that all ongoing releasing operations are done and
> > so, some other operations can be started safely.
> > 
> > For instance, in the following sequence:
> >   1) of_platform_depopulate()
> >   2) of_overlay_remove()
> 
> So this is only an issue for overlays?  Why has no one noticed this in
> the years since 80dd33cf72d1 was added?  Why is this an issue now
> suddenly?
> 

Not sure either... Note this is only an issue if device links are in place. So the
overlay needs to have nodes creating those links. You need to have regulators, pwm,
eth phy (at least these ones I'm aware they create links) to trigger this. We would
have to dig through git to understand when would this be noticeable. But note this is
very straight to trigger.

May also very well be that most people don't really "play" with overlay removal...
For example, I have been dealing with overlays on rpi's for the last 5 years and only
noticed this last year when we had an usecase that involved overlay removal.

- Nuno Sá
Herve Codina March 26, 2024, 7:56 a.m. UTC | #5
On Mon, 25 Mar 2024 19:37:10 +0100
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> On Mon, Mar 25, 2024 at 04:21:25PM +0100, Herve Codina wrote:
> > The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > introduces a workqueue to release the consumer and supplier devices used
> > in the devlink.
> > In the job queued, devices are release and in turn, when all the
> > references to these devices are dropped, the release function of the
> > device itself is called.
> > 
> > Nothing is present to provide some synchronisation with this workqueue
> > in order to ensure that all ongoing releasing operations are done and
> > so, some other operations can be started safely.
> > 
> > For instance, in the following sequence:
> >   1) of_platform_depopulate()
> >   2) of_overlay_remove()
> > 
> > During the step 1, devices are released and related devlinks are removed
> > (jobs pushed in the workqueue).
> > During the step 2, OF nodes are destroyed but, without any
> > synchronisation with devlink removal jobs, of_overlay_remove() can raise
> > warnings related to missing of_node_put():
> >   ERROR: memory leak, expected refcount 1 instead of 2
> > 
> > Indeed, the missing of_node_put() call is going to be done, too late,
> > from the workqueue job execution.
> > 
> > Introduce device_link_wait_removal() to offer a way to synchronize
> > operations waiting for the end of devlink removals (i.e. end of
> > workqueue jobs).
> > Also, as a flushing operation is done on the workqueue, the workqueue
> > used is moved from a system-wide workqueue to a local one.
> > 
> > Cc: stable@vger.kernel.org  
> 
> Why is this for stable?  You are just adding a new api, no one is using
> it.
> 
> Or if they are, you didn't send me that patch...

The patch 2 in this current series uses the new api.

Best regards,
Hervé
Greg Kroah-Hartman March 26, 2024, 8:20 a.m. UTC | #6
On Mon, Mar 25, 2024 at 11:44:01AM -0500, Rob Herring wrote:
> On Mon, Mar 25, 2024 at 04:21:25PM +0100, Herve Codina wrote:
> > The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > introduces a workqueue to release the consumer and supplier devices used
> > in the devlink.
> > In the job queued, devices are release and in turn, when all the
> > references to these devices are dropped, the release function of the
> > device itself is called.
> > 
> > Nothing is present to provide some synchronisation with this workqueue
> > in order to ensure that all ongoing releasing operations are done and
> > so, some other operations can be started safely.
> > 
> > For instance, in the following sequence:
> >   1) of_platform_depopulate()
> >   2) of_overlay_remove()
> > 
> > During the step 1, devices are released and related devlinks are removed
> > (jobs pushed in the workqueue).
> > During the step 2, OF nodes are destroyed but, without any
> > synchronisation with devlink removal jobs, of_overlay_remove() can raise
> > warnings related to missing of_node_put():
> >   ERROR: memory leak, expected refcount 1 instead of 2
> > 
> > Indeed, the missing of_node_put() call is going to be done, too late,
> > from the workqueue job execution.
> > 
> > Introduce device_link_wait_removal() to offer a way to synchronize
> > operations waiting for the end of devlink removals (i.e. end of
> > workqueue jobs).
> > Also, as a flushing operation is done on the workqueue, the workqueue
> > used is moved from a system-wide workqueue to a local one.
> > 
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> > Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
> > Reviewed-by: Nuno Sa <nuno.sa@analog.com>
> > Reviewed-by: Saravana Kannan <saravanak@google.com>
> > ---
> >  drivers/base/core.c    | 26 +++++++++++++++++++++++---
> >  include/linux/device.h |  1 +
> >  2 files changed, 24 insertions(+), 3 deletions(-)
> 
> Greg, can you ack and I'll take this series.

Looks semi-sane:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rob Herring (Arm) March 26, 2024, 8:43 p.m. UTC | #7
On Mon, Mar 25, 2024 at 04:21:24PM +0100, Herve Codina wrote:
> Hi,
> 
> In the following sequence:
>   of_platform_depopulate(); /* Remove devices from a DT overlay node */
>   of_overlay_remove(); /* Remove the DT overlay node itself */
> 
> Some warnings are raised by __of_changeset_entry_destroy() which  was
> called from of_overlay_remove():
>   ERROR: memory leak, expected refcount 1 instead of 2 ...
> 
> The issue is that, during the device devlink removals triggered from the
> of_platform_depopulate(), jobs are put in a workqueue.
> These jobs drop the reference to the devices. When a device is no more
> referenced (refcount == 0), it is released and the reference to its
> of_node is dropped by a call to of_node_put().
> These operations are fully correct except that, because of the
> workqueue, they are done asynchronously with respect to function calls.
> 
> In the sequence provided, the jobs are run too late, after the call to
> __of_changeset_entry_destroy() and so a missing of_node_put() call is
> detected by __of_changeset_entry_destroy().
> 
> This series fixes this issue introducing device_link_wait_removal() in
> order to wait for the end of jobs execution (patch 1) and using this
> function to synchronize the overlay removal with the end of jobs
> execution (patch 2).
> 
> Compared to the previous iteration:
>   https://lore.kernel.org/linux-kernel/20240307111036.225007-1-herve.codina@bootlin.com/
> this v6 series:
> - Add Saravana's 'Reviewed-by' tag
> 
> This series handles cases reported by Luca [1] and Nuno [2].
>   [1]: https://lore.kernel.org/all/20231220181627.341e8789@booty/
>   [2]: https://lore.kernel.org/all/20240205-fix-device-links-overlays-v2-2-5344f8c79d57@analog.com/
> 
> Best regards,
> Hervé
> 
> Changes v5 -> v6
>   - Patch 1
>     Add 'Reviewed-by: Saravana Kannan <saravanak@google.com>'
> 
>   - Patch 2
>     No changes
> 
> Changes v4 -> v5
>   - Patch 1
>     Remove the 'Fixes' tag
>     Add 'Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>'
>     Add 'Reviewed-by: Nuno Sa <nuno.sa@analog.com>'
> 
>   - Patch 2
>     Update comment as suggested
>     Add 'Reviewed-by: Saravana Kannan <saravanak@google.com>'
>     Add 'Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>'
>     Add 'Reviewed-by: Nuno Sa <nuno.sa@analog.com>'
> 
> Changes v3 -> v4
>   - Patch 1
>     Uses flush_workqueue() instead of drain_workqueue().
> 
>   - Patch 2
>     Remove unlock/re-lock when calling device_link_wait_removal()
>     Move device_link_wait_removal() call to of_changeset_destroy()
>     Update commit log
> 
> Changes v2 -> v3
>   - Patch 1
>     No changes
> 
>   - Patch 2
>     Add missing device.h
> 
> Changes v1 -> v2
>   - Patch 1
>     Rename the workqueue to 'device_link_wq'
>     Add 'Fixes' tag and Cc stable
> 
>   - Patch 2
>     Add device.h inclusion.
>     Call device_link_wait_removal() later in the overlay removal
>     sequence (i.e. in free_overlay_changeset() function).
>     Drop of_mutex lock while calling device_link_wait_removal().
>     Add	'Fixes'	tag and Cc stable
> 
> Herve Codina (2):
>   driver core: Introduce device_link_wait_removal()
>   of: dynamic: Synchronize of_changeset_destroy() with the devlink
>     removals
> 
>  drivers/base/core.c    | 26 +++++++++++++++++++++++---
>  drivers/of/dynamic.c   | 12 ++++++++++++
>  include/linux/device.h |  1 +
>  3 files changed, 36 insertions(+), 3 deletions(-)

Applied, thanks!