diff mbox series

powerpc/mobility: Fix node detach/rename problem

Message ID c2fba52a-baac-25fb-c26b-c84b25c3178c@linux.vnet.ibm.com (mailing list archive)
State Rejected
Headers show
Series powerpc/mobility: Fix node detach/rename problem | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied
snowpatch_ozlabs/checkpatch success Test checkpatch on branch next
snowpatch_ozlabs/build-ppc64le warning Test build-ppc64le on branch next
snowpatch_ozlabs/build-ppc64be warning Test build-ppc64be on branch next
snowpatch_ozlabs/build-ppc64e success Test build-ppc64e on branch next
snowpatch_ozlabs/build-ppc32 success Test build-ppc32 on branch next

Commit Message

Michael Bringmann July 29, 2018, 1:11 p.m. UTC
During LPAR migration, the content of the device tree/sysfs may
be updated including deletion and replacement of nodes in the
tree.  When nodes are added to the internal node structures, they
are appended in FIFO order to a list of nodes maintained by the
OF code APIs.  When nodes are removed from the device tree, they
are marked OF_DETACHED, but not actually deleted from the system
to allow for pointers cached elsewhere in the kernel.  The order
and content of the entries in the list of nodes is not altered,
though.

During LPAR migration some common nodes are deleted and re-added
e.g. "ibm,platform-facilities".  If a node is re-added to the OF
node lists, the of_attach_node function checks to make sure that
the name + ibm,phandle of the to-be-added data is unique.  As the
previous copy of a re-added node is not modified beyond the addition
of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
notice to the console, (3) renames the to-be-added node to avoid
filename collisions within a directory, and (3) adds entries to
the sysfs/kernfs.

This patch fixes the 'migration add' problem by changing the
stored 'phandle' of the OF_DETACHed node to 0 (reserved value for
of_find_node_by_phandle), so that subsequent re-add operations,
such as those during migration, do not find the detached node,
do not observe duplicate names, do not rename them,  and the
extra WARNING notices are removed from the console output.

In addition, it erases the 'name' field of the OF_DETACHED node,
to prevent any future calls to of_find_node_by_name() or
of_find_node_by_path() from matching this node.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/dlpar.c |    3 +++
 1 file changed, 3 insertions(+)

Comments

kernel test robot July 29, 2018, 4:31 p.m. UTC | #1
Hi Michael,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.18-rc6 next-20180727]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michael-Bringmann/powerpc-mobility-Fix-node-detach-rename-problem/20180729-213517
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/pseries/dlpar.c: In function 'dlpar_detach_node':
>> arch/powerpc/platforms/pseries/dlpar.c:276:9: error: passing argument 1 of 'memset' discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
     memset(dn->name, 0, strlen(dn->name));
            ^~
   In file included from include/linux/string.h:20:0,
                    from arch/powerpc/include/asm/paca.h:19,
                    from arch/powerpc/include/asm/current.h:16,
                    from include/linux/mutex.h:14,
                    from include/linux/notifier.h:14,
                    from arch/powerpc/platforms/pseries/dlpar.c:16:
   arch/powerpc/include/asm/string.h:23:15: note: expected 'void *' but argument is of type 'const char *'
    extern void * memset(void *,int,__kernel_size_t);
                  ^~~~~~
   cc1: all warnings being treated as errors

vim +276 arch/powerpc/platforms/pseries/dlpar.c

   259	
   260	int dlpar_detach_node(struct device_node *dn)
   261	{
   262		struct device_node *child;
   263		int rc;
   264	
   265		child = of_get_next_child(dn, NULL);
   266		while (child) {
   267			dlpar_detach_node(child);
   268			child = of_get_next_child(dn, child);
   269		}
   270	
   271		rc = of_detach_node(dn);
   272		if (rc)
   273			return rc;
   274	
   275		dn->phandle = 0;
 > 276		memset(dn->name, 0, strlen(dn->name));
   277	
   278		return 0;
   279	}
   280	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Michael Ellerman July 30, 2018, 6:31 a.m. UTC | #2
Michael Bringmann <mwb@linux.vnet.ibm.com> writes:

> During LPAR migration, the content of the device tree/sysfs may
> be updated including deletion and replacement of nodes in the
> tree.  When nodes are added to the internal node structures, they
> are appended in FIFO order to a list of nodes maintained by the
> OF code APIs.

That hasn't been true for several years. The data structure is an n-ary
tree. What kernel version are you working on?

> When nodes are removed from the device tree, they
> are marked OF_DETACHED, but not actually deleted from the system
> to allow for pointers cached elsewhere in the kernel.  The order
> and content of the entries in the list of nodes is not altered,
> though.

Something is going wrong if this is actually happening.

When the node is detached it should be *detached* from the tree of all
nodes, so it should not be discoverable other than by having an existing
pointer to it.

That's what __of_detach_node() does:

	parent = np->parent;
	if (WARN_ON(!parent))
		return;

	if (parent->child == np)
		parent->child = np->sibling;
	else {
		struct device_node *prevsib;
		for (prevsib = np->parent->child;
		     prevsib->sibling != np;
		     prevsib = prevsib->sibling)
			;
		prevsib->sibling = np->sibling;
	}

ie. the node must already have a NULL parent, and then it is spliced out
of its parent's child list.


Please give us more info so we can work out what's actually happening.

cheers
Michael Bringmann July 30, 2018, 9:02 p.m. UTC | #3
See below.

On 07/30/2018 01:31 AM, Michael Ellerman wrote:
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> 
>> During LPAR migration, the content of the device tree/sysfs may
>> be updated including deletion and replacement of nodes in the
>> tree.  When nodes are added to the internal node structures, they
>> are appended in FIFO order to a list of nodes maintained by the
>> OF code APIs.
> 
> That hasn't been true for several years. The data structure is an n-ary
> tree. What kernel version are you working on?

Sorry for an error in my description.  I oversimplified based on the
name of a search iterator.  Let me try to provide a better explanation
of the problem, here.

This is the problem.  The PPC mobility code receives RTAS requests to
delete nodes with platform-/hardware-specific attributes when restarting
the kernel after a migration.  My example is for migration between a
P8 Alpine and a P8 Brazos.   Nodes to be deleted may include 'ibm,random-v1',
'ibm,compression-v1', 'ibm,platform-facilities', 'ibm,sym-encryption-v1',
or others.

The mobility.c code calls 'of_detach_node' for the nodes and their children.
This makes calls to detach the properties and to try to remove the associated
sysfs/kernfs files.

Then new copies of the same nodes are next provided by the PHYP, local
copies are built, and a pointer to the 'struct device_node' is passed to
of_attach_node.  Before the call to of_attach_node, the phandle is initialized
to 0 when the data structure is alloced.  During the call to of_attach_node,
it calls __of_attach_node which pulls the actual name and phandle from just
created sub-properties named something like 'name' and 'ibm,phandle'.

This is all fine for the first migration.  The problem occurs with the
second and subsequent migrations when the PHYP on the new system wants to
replace the same set of nodes again, referenced with the same names and
phandle values.

> 
>> When nodes are removed from the device tree, they
>> are marked OF_DETACHED, but not actually deleted from the system
>> to allow for pointers cached elsewhere in the kernel.  The order
>> and content of the entries in the list of nodes is not altered,
>> though.
> 
> Something is going wrong if this is actually happening.
> 
> When the node is detached it should be *detached* from the tree of all
> nodes, so it should not be discoverable other than by having an existing
> pointer to it.
On the second and subsequent migrations, the PHYP tells the system
to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
nodes by its known set of phandle values -- the same handles used
by the PHYP on the source system are known on the target system.
The mobility.c code calls of_find_node_by_phandle() with these values
and ends up locating the first instance of each node that was added
during the original boot, instead of the second instance of each node
created after the first migration.  The detach during the second
migration fails with errors like,

[ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 __of_detach_node+0x8/0xa0
[ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc xts vmx_crypto sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
[ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: G        W         4.18.0-rc1-wi107836-v05-120+ #201
[ 4565.030737] NIP:  c0000000007c1ea8 LR: c0000000007c1fb4 CTR: 0000000000655170
[ 4565.030741] REGS: c0000003f302b690 TRAP: 0700   Tainted: G        W          (4.18.0-rc1-wi107836-v05-120+)
[ 4565.030745] MSR:  800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22288822  XER: 0000000a
[ 4565.030757] CFAR: c0000000007c1fb0 IRQMASK: 1
[ 4565.030757] GPR00: c0000000007c1fa4 c0000003f302b910 c00000000114bf00 c0000003ffff8e68
[ 4565.030757] GPR04: 0000000000000001 ffffffffffffffff 800000c008e0b4b8 ffffffffffffffff
[ 4565.030757] GPR08: 0000000000000000 0000000000000001 0000000080000003 0000000000002843
[ 4565.030757] GPR12: 0000000000008800 c00000001ec9ae00 0000000040000000 0000000000000000
[ 4565.030757] GPR16: 0000000000000000 0000000000000008 0000000000000000 00000000f6ffffff
[ 4565.030757] GPR20: 0000000000000007 0000000000000000 c0000003e9f1f034 0000000000000001
[ 4565.030757] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 4565.030757] GPR28: c000000001549d28 c000000001134828 c0000003ffff8e68 c0000003f302b930
[ 4565.030804] NIP [c0000000007c1ea8] __of_detach_node+0x8/0xa0
[ 4565.030808] LR [c0000000007c1fb4] of_detach_node+0x74/0xd0
[ 4565.030811] Call Trace:
[ 4565.030815] [c0000003f302b910] [c0000000007c1fa4] of_detach_node+0x64/0xd0 (unreliable)
[ 4565.030821] [c0000003f302b980] [c0000000000c33c4] dlpar_detach_node+0xb4/0x150
[ 4565.030826] [c0000003f302ba10] [c0000000000c3ffc] delete_dt_node+0x3c/0x80
[ 4565.030831] [c0000003f302ba40] [c0000000000c4380] pseries_devicetree_update+0x150/0x4f0
[ 4565.030836] [c0000003f302bb70] [c0000000000c479c] post_mobility_fixup+0x7c/0xf0
[ 4565.030841] [c0000003f302bbe0] [c0000000000c4908] migration_store+0xf8/0x130
[ 4565.030847] [c0000003f302bc70] [c000000000998160] kobj_attr_store+0x30/0x60
[ 4565.030852] [c0000003f302bc90] [c000000000412f14] sysfs_kf_write+0x64/0xa0
[ 4565.030857] [c0000003f302bcb0] [c000000000411cac] kernfs_fop_write+0x16c/0x240
[ 4565.030862] [c0000003f302bd00] [c000000000355f20] __vfs_write+0x40/0x220
[ 4565.030867] [c0000003f302bd90] [c000000000356358] vfs_write+0xc8/0x240
[ 4565.030872] [c0000003f302bde0] [c0000000003566cc] ksys_write+0x5c/0x100
[ 4565.030880] [c0000003f302be30] [c00000000000b288] system_call+0x5c/0x70
[ 4565.030884] Instruction dump:
[ 4565.030887] 38210070 38600000 e8010010 eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
[ 4565.030895] 7c0803a6 4e800020 e9230098 7929f7e2 <0b090000> 2f890000 4cde0020 e9030040
[ 4565.030903] ---[ end trace 5bd54cb1df9d2976 ]---

The mobility.c code continues on during the second migration, accepts the
definitions of the new nodes from the PHYP and ends up renaming the new
properties e.g.

[ 4565.827296] Duplicate name in base, renamed to "ibm,platform-facilities#1"

I don't see any check like 'of_node_check_flag(np, OF_DETACHED)' within
of_find_node_by_phandle to skip nodes that are detached, but still present
due to caching or use count considerations.  Another possibility to consider
is that of_find_node_by_phandle also uses something called 'phandle_cache'
which may have outdated data as of_detach_node() does not have access to
that cache for the 'OF_DETACHED' nodes.

>
> That's what __of_detach_node() does:
> 
> 	parent = np->parent;
> 	if (WARN_ON(!parent))
> 		return;
> 
> 	if (parent->child == np)
> 		parent->child = np->sibling;
> 	else {
> 		struct device_node *prevsib;
> 		for (prevsib = np->parent->child;
> 		     prevsib->sibling != np;
> 		     prevsib = prevsib->sibling)
> 			;
> 		prevsib->sibling = np->sibling;
> 	}
> 
> ie. the node must already have a NULL parent, and then it is spliced out
> of its parent's child list.
> 
> 
> Please give us more info so we can work out what's actually happening.

I duplicated the problem in the 4.18 raw kernel on the second and subsequent
legs of migration of an LPAR back and forth between an Alpine to a Brazos.
Again, entries like  'ibm,random-v1', 'ibm,compression-v1',
'ibm,platform-facilities', 'ibm,sym-encryption-v1', yield errors as described
above.

My patch was attempting to resolve the issue wholly within the powerpc code,
at this time.

> 
> cheers
> 

Michael
Tyrel Datwyler July 31, 2018, 12:59 a.m. UTC | #4
On 07/29/2018 06:11 AM, Michael Bringmann wrote:
> During LPAR migration, the content of the device tree/sysfs may
> be updated including deletion and replacement of nodes in the
> tree.  When nodes are added to the internal node structures, they
> are appended in FIFO order to a list of nodes maintained by the
> OF code APIs.  When nodes are removed from the device tree, they
> are marked OF_DETACHED, but not actually deleted from the system
> to allow for pointers cached elsewhere in the kernel.  The order
> and content of the entries in the list of nodes is not altered,
> though.
> 
> During LPAR migration some common nodes are deleted and re-added
> e.g. "ibm,platform-facilities".  If a node is re-added to the OF
> node lists, the of_attach_node function checks to make sure that
> the name + ibm,phandle of the to-be-added data is unique.  As the
> previous copy of a re-added node is not modified beyond the addition
> of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
> notice to the console, (3) renames the to-be-added node to avoid
> filename collisions within a directory, and (3) adds entries to
> the sysfs/kernfs.

So, this patch actually just band aids over the real problem. This is a long standing problem with several PFO drivers leaking references. The issue here is that, during the device tree update that follows a migration. the update of the ibm,platform-facilities node and friends below are always deleted and re-added on the destination lpar and subsequently the leaked references prevent the devices nodes from every actually being properly cleaned up after detach. Thus, leading to the issue you are observing.

As and example a quick look at nx-842-pseries.c reveals several issues.

static int __init nx842_pseries_init(void)
{
        struct nx842_devdata *new_devdata;
        int ret;

        if (!of_find_compatible_node(NULL, NULL, "ibm,compression"))
                return -ENODEV;

This call to of_find_compatible_node() results in a node returned with refcount incremented and therefore immediately leaked.

Further, the reconfig notifier logic makes the assumption that it only needs to deal with node updates, but as I outlined above the post migration device tree update always results in PFO nodes and properties being deleted and re-added.

/**
 * nx842_OF_notifier - Process updates to OF properties for the device
 *
 * @np: notifier block
 * @action: notifier action
 * @update: struct pSeries_reconfig_prop_update pointer if action is
 *      PSERIES_UPDATE_PROPERTY
 *
 * Returns:
 *      NOTIFY_OK on success
 *      NOTIFY_BAD encoded with error number on failure, use
 *              notifier_to_errno() to decode this value
 */
static int nx842_OF_notifier(struct notifier_block *np, unsigned long action,
                             void *data)
{
        struct of_reconfig_data *upd = data;
        struct nx842_devdata *local_devdata;
        struct device_node *node = NULL;

        rcu_read_lock();
        local_devdata = rcu_dereference(devdata);
        if (local_devdata)
                node = local_devdata->dev->of_node;

        if (local_devdata &&
                        action == OF_RECONFIG_UPDATE_PROPERTY &&
                        !strcmp(upd->dn->name, node->name)) {
                rcu_read_unlock();
                nx842_OF_upd(upd->prop);
        } else
                rcu_read_unlock();

        return NOTIFY_OK;
}

I expect to find the same problems in pseries-rng.c and nx.c.

-Tyrel

> 
> This patch fixes the 'migration add' problem by changing the
> stored 'phandle' of the OF_DETACHed node to 0 (reserved value for
> of_find_node_by_phandle), so that subsequent re-add operations,
> such as those during migration, do not find the detached node,
> do not observe duplicate names, do not rename them,  and the
> extra WARNING notices are removed from the console output.
> 
> In addition, it erases the 'name' field of the OF_DETACHED node,
> to prevent any future calls to of_find_node_by_name() or
> of_find_node_by_path() from matching this node.
> 
> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/dlpar.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
> index 2de0f0d..9d82c28 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -274,6 +274,9 @@ int dlpar_detach_node(struct device_node *dn)
>  	if (rc)
>  		return rc;
> 
> +	dn->phandle = 0;
> +	memset(dn->name, 0, strlen(dn->name));
> +
>  	return 0;
>  }
> 
>
Tyrel Datwyler July 31, 2018, 1:46 a.m. UTC | #5
On 07/30/2018 05:59 PM, Tyrel Datwyler wrote:
> On 07/29/2018 06:11 AM, Michael Bringmann wrote:
>> During LPAR migration, the content of the device tree/sysfs may
>> be updated including deletion and replacement of nodes in the
>> tree.  When nodes are added to the internal node structures, they
>> are appended in FIFO order to a list of nodes maintained by the
>> OF code APIs.  When nodes are removed from the device tree, they
>> are marked OF_DETACHED, but not actually deleted from the system
>> to allow for pointers cached elsewhere in the kernel.  The order
>> and content of the entries in the list of nodes is not altered,
>> though.
>>
>> During LPAR migration some common nodes are deleted and re-added
>> e.g. "ibm,platform-facilities".  If a node is re-added to the OF
>> node lists, the of_attach_node function checks to make sure that
>> the name + ibm,phandle of the to-be-added data is unique.  As the
>> previous copy of a re-added node is not modified beyond the addition
>> of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
>> notice to the console, (3) renames the to-be-added node to avoid
>> filename collisions within a directory, and (3) adds entries to
>> the sysfs/kernfs.
> 
> So, this patch actually just band aids over the real problem. This is a long standing problem with several PFO drivers leaking references. The issue here is that, during the device tree update that follows a migration. the update of the ibm,platform-facilities node and friends below are always deleted and re-added on the destination lpar and subsequently the leaked references prevent the devices nodes from every actually being properly cleaned up after detach. Thus, leading to the issue you are observing.
> 
> As and example a quick look at nx-842-pseries.c reveals several issues.
> 
> static int __init nx842_pseries_init(void)
> {
>         struct nx842_devdata *new_devdata;
>         int ret;
> 
>         if (!of_find_compatible_node(NULL, NULL, "ibm,compression"))
>                 return -ENODEV;
> 
> This call to of_find_compatible_node() results in a node returned with refcount incremented and therefore immediately leaked.
> 
> Further, the reconfig notifier logic makes the assumption that it only needs to deal with node updates, but as I outlined above the post migration device tree update always results in PFO nodes and properties being deleted and re-added.
> 
> /**
>  * nx842_OF_notifier - Process updates to OF properties for the device
>  *
>  * @np: notifier block
>  * @action: notifier action
>  * @update: struct pSeries_reconfig_prop_update pointer if action is
>  *      PSERIES_UPDATE_PROPERTY
>  *
>  * Returns:
>  *      NOTIFY_OK on success
>  *      NOTIFY_BAD encoded with error number on failure, use
>  *              notifier_to_errno() to decode this value
>  */
> static int nx842_OF_notifier(struct notifier_block *np, unsigned long action,
>                              void *data)
> {
>         struct of_reconfig_data *upd = data;
>         struct nx842_devdata *local_devdata;
>         struct device_node *node = NULL;
> 
>         rcu_read_lock();
>         local_devdata = rcu_dereference(devdata);
>         if (local_devdata)
>                 node = local_devdata->dev->of_node;
> 
>         if (local_devdata &&
>                         action == OF_RECONFIG_UPDATE_PROPERTY &&
>                         !strcmp(upd->dn->name, node->name)) {
>                 rcu_read_unlock();
>                 nx842_OF_upd(upd->prop);
>         } else
>                 rcu_read_unlock();
> 
>         return NOTIFY_OK;
> }
> 
> I expect to find the same problems in pseries-rng.c and nx.c.

So, in actuality the main root of the problem is really in the vio core. A node reference for each PFO device under "ibm,platform-facilities" is taken during vio_register_device_node(). We need a reconfig notifier to call vio_unregister_device() for each PFO device on detach, and vio_bus_scan_register_devices("ibm,platform-facilities") on attach. This will make sure the PFO vio devices are released such that vio_dev_release() gets called to put the node reference that was taken at original registration time.

/* vio_dev refcount hit 0 */
static void vio_dev_release(struct device *dev)
{
        struct iommu_table *tbl = get_iommu_table_base(dev);

        if (tbl)
                iommu_tce_table_put(tbl);
        of_node_put(dev->of_node);
        kfree(to_vio_dev(dev));
}

-Tyrel

> 
> -Tyrel
> 
>>
>> This patch fixes the 'migration add' problem by changing the
>> stored 'phandle' of the OF_DETACHed node to 0 (reserved value for
>> of_find_node_by_phandle), so that subsequent re-add operations,
>> such as those during migration, do not find the detached node,
>> do not observe duplicate names, do not rename them,  and the
>> extra WARNING notices are removed from the console output.
>>
>> In addition, it erases the 'name' field of the OF_DETACHED node,
>> to prevent any future calls to of_find_node_by_name() or
>> of_find_node_by_path() from matching this node.
>>
>> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/dlpar.c |    3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
>> index 2de0f0d..9d82c28 100644
>> --- a/arch/powerpc/platforms/pseries/dlpar.c
>> +++ b/arch/powerpc/platforms/pseries/dlpar.c
>> @@ -274,6 +274,9 @@ int dlpar_detach_node(struct device_node *dn)
>>  	if (rc)
>>  		return rc;
>>
>> +	dn->phandle = 0;
>> +	memset(dn->name, 0, strlen(dn->name));
>> +
>>  	return 0;
>>  }
>>
>>
>
Michael Ellerman July 31, 2018, 6:42 a.m. UTC | #6
Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
> On 07/29/2018 06:11 AM, Michael Bringmann wrote:
>> During LPAR migration, the content of the device tree/sysfs may
>> be updated including deletion and replacement of nodes in the
>> tree.  When nodes are added to the internal node structures, they
>> are appended in FIFO order to a list of nodes maintained by the
>> OF code APIs.  When nodes are removed from the device tree, they
>> are marked OF_DETACHED, but not actually deleted from the system
>> to allow for pointers cached elsewhere in the kernel.  The order
>> and content of the entries in the list of nodes is not altered,
>> though.
>> 
>> During LPAR migration some common nodes are deleted and re-added
>> e.g. "ibm,platform-facilities".  If a node is re-added to the OF
>> node lists, the of_attach_node function checks to make sure that
>> the name + ibm,phandle of the to-be-added data is unique.  As the
>> previous copy of a re-added node is not modified beyond the addition
>> of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
>> notice to the console, (3) renames the to-be-added node to avoid
>> filename collisions within a directory, and (3) adds entries to
>> the sysfs/kernfs.
>
> So, this patch actually just band aids over the real problem. This is
> a long standing problem with several PFO drivers leaking references.
> The issue here is that, during the device tree update that follows a
> migration. the update of the ibm,platform-facilities node and friends
> below are always deleted and re-added on the destination lpar and
> subsequently the leaked references prevent the devices nodes from
> every actually being properly cleaned up after detach. Thus, leading
> to the issue you are observing.

Leaking references shouldn't affect the node being detached from the
tree though.

See of_detach_node() calling __of_detach_node(), none of that depends on
the refcount.

It's only the actual freeing of the node, in of_node_release() that is
prevented by leaked reference counts.

So I agree we need to do a better job with the reference counting, but I
don't see how it is causing the problem here.

cheers
Tyrel Datwyler July 31, 2018, 7:50 p.m. UTC | #7
On 07/30/2018 11:42 PM, Michael Ellerman wrote:
> Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
>> On 07/29/2018 06:11 AM, Michael Bringmann wrote:
>>> During LPAR migration, the content of the device tree/sysfs may
>>> be updated including deletion and replacement of nodes in the
>>> tree.  When nodes are added to the internal node structures, they
>>> are appended in FIFO order to a list of nodes maintained by the
>>> OF code APIs.  When nodes are removed from the device tree, they
>>> are marked OF_DETACHED, but not actually deleted from the system
>>> to allow for pointers cached elsewhere in the kernel.  The order
>>> and content of the entries in the list of nodes is not altered,
>>> though.
>>>
>>> During LPAR migration some common nodes are deleted and re-added
>>> e.g. "ibm,platform-facilities".  If a node is re-added to the OF
>>> node lists, the of_attach_node function checks to make sure that
>>> the name + ibm,phandle of the to-be-added data is unique.  As the
>>> previous copy of a re-added node is not modified beyond the addition
>>> of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
>>> notice to the console, (3) renames the to-be-added node to avoid
>>> filename collisions within a directory, and (3) adds entries to
>>> the sysfs/kernfs.
>>
>> So, this patch actually just band aids over the real problem. This is
>> a long standing problem with several PFO drivers leaking references.
>> The issue here is that, during the device tree update that follows a
>> migration. the update of the ibm,platform-facilities node and friends
>> below are always deleted and re-added on the destination lpar and
>> subsequently the leaked references prevent the devices nodes from
>> every actually being properly cleaned up after detach. Thus, leading
>> to the issue you are observing.

So, it was the end of the day, and I kind of glossed over the issue Michael was trying to address with an issue that I remembered that had been around for awhile.

> 
> Leaking references shouldn't affect the node being detached from the
> tree though.

No, but it does prevent it from being freed from sysfs which leads to the sysfs entry renaming that happens when another node with the same name is attached.

> 
> See of_detach_node() calling __of_detach_node(), none of that depends on
> the refcount.
> 
> It's only the actual freeing of the node, in of_node_release() that is
> prevented by leaked reference counts.

Right, but if we did refcount correctly there is the new problem that the node is freed and now the phandle cache points at freed memory in the case were no invalidation is done.

> 
> So I agree we need to do a better job with the reference counting, but I
> don't see how it is causing the problem here

Now in regards to the phandle caching somehow I missed that code going into OF earlier this year. That should have had at least some discussion from our side based on the fact that it is built on dtc compiler assumption that there are a set number of phandles that are allocated starting at 1..n such that they are monotonically increasing. That has a nice fixed size with O(1) lookup time. Phyp doesn't guarantee any sort of niceness with nicely ordered phandles. From what I've seen there are a subset of phandles that decrease from (-1) monotonically, and then there are a bunch of random values for cpus and IOAs. Thinking on it might not be that big of a deal as we just end up in the case where we have a phandle collision one makes it into the cache and is optimized while the other doesn't. On another note, they clearly hit a similar issue during overlay attach and remove, and as Rob pointed out their solution to handle it is a full cache invalidation followed by rescanning the whole tree to rebuild it. Seeing as our dynamic lifecycle is node by node this definitely adds some overhead.

-Tyrel

> 
> cheers
>
Michael Ellerman Aug. 1, 2018, 2:35 p.m. UTC | #8
Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
> On 07/30/2018 11:42 PM, Michael Ellerman wrote:
>> Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
>>> On 07/29/2018 06:11 AM, Michael Bringmann wrote:
>>>> During LPAR migration, the content of the device tree/sysfs may
>>>> be updated including deletion and replacement of nodes in the
>>>> tree.  When nodes are added to the internal node structures, they
>>>> are appended in FIFO order to a list of nodes maintained by the
>>>> OF code APIs.  When nodes are removed from the device tree, they
>>>> are marked OF_DETACHED, but not actually deleted from the system
>>>> to allow for pointers cached elsewhere in the kernel.  The order
>>>> and content of the entries in the list of nodes is not altered,
>>>> though.
>>>>
>>>> During LPAR migration some common nodes are deleted and re-added
>>>> e.g. "ibm,platform-facilities".  If a node is re-added to the OF
>>>> node lists, the of_attach_node function checks to make sure that
>>>> the name + ibm,phandle of the to-be-added data is unique.  As the
>>>> previous copy of a re-added node is not modified beyond the addition
>>>> of a bit flag, the code (1) finds the old copy, (2) prints a WARNING
>>>> notice to the console, (3) renames the to-be-added node to avoid
>>>> filename collisions within a directory, and (3) adds entries to
>>>> the sysfs/kernfs.
>>>
>>> So, this patch actually just band aids over the real problem. This is
>>> a long standing problem with several PFO drivers leaking references.
>>> The issue here is that, during the device tree update that follows a
>>> migration. the update of the ibm,platform-facilities node and friends
>>> below are always deleted and re-added on the destination lpar and
>>> subsequently the leaked references prevent the devices nodes from
>>> every actually being properly cleaned up after detach. Thus, leading
>>> to the issue you are observing.
>
> So, it was the end of the day, and I kind of glossed over the issue
> Michael was trying to address with an issue that I remembered that had
> been around for awhile.

Sure, I know that feeling :)

>> Leaking references shouldn't affect the node being detached from the
>> tree though.
>
> No, but it does prevent it from being freed from sysfs which leads to
> the sysfs entry renaming that happens when another node with the same
> name is attached.

OK.

>> See of_detach_node() calling __of_detach_node(), none of that depends on
>> the refcount.
>> 
>> It's only the actual freeing of the node, in of_node_release() that is
>> prevented by leaked reference counts.
>
> Right, but if we did refcount correctly there is the new problem that
> the node is freed and now the phandle cache points at freed memory in
> the case were no invalidation is done.

Yeah that's bad.

I guess no one is supposed to lookup that phandle anymore, but it's
super fragile.

>> So I agree we need to do a better job with the reference counting, but I
>> don't see how it is causing the problem here
>
> Now in regards to the phandle caching somehow I missed that code going
> into OF earlier this year. That should have had at least some
> discussion from our side based on the fact that it is built on dtc
> compiler assumption that there are a set number of phandles that are
> allocated starting at 1..n such that they are monotonically
> increasing. That has a nice fixed size with O(1) lookup time. Phyp
> doesn't guarantee any sort of niceness with nicely ordered phandles.
> From what I've seen there are a subset of phandles that decrease from
> (-1) monotonically, and then there are a bunch of random values for
> cpus and IOAs. Thinking on it might not be that big of a deal as we
> just end up in the case where we have a phandle collision one makes it
> into the cache and is optimized while the other doesn't. On another
> note, they clearly hit a similar issue during overlay attach and
> remove, and as Rob pointed out their solution to handle it is a full
> cache invalidation followed by rescanning the whole tree to rebuild
> it. Seeing as our dynamic lifecycle is node by node this definitely
> adds some overhead.

Yeah I also should have noticed it going in, but despite being
subscribed to the devicetree list I didn't spot it in the flood.

AIUI the 1..n assumption is not about correctness but if our phandles
violate that we may not be getting any benefit.

Anyway yeah lots of things to look at tomorrow :)

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
index 2de0f0d..9d82c28 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -274,6 +274,9 @@  int dlpar_detach_node(struct device_node *dn)
 	if (rc)
 		return rc;
 
+	dn->phandle = 0;
+	memset(dn->name, 0, strlen(dn->name));
+
 	return 0;
 }