diff mbox series

of: allocate / free phandle cache outside of the devtree_lock

Message ID 20191111172142.ozczh7j2gmzi7o5k@linutronix.de
State Changes Requested, archived
Headers show
Series of: allocate / free phandle cache outside of the devtree_lock | expand

Checks

Context Check Description
robh/checkpatch warning "total: 1 errors, 0 warnings, 72 lines checked"

Commit Message

Sebastian Andrzej Siewior Nov. 11, 2019, 5:21 p.m. UTC
The phandle cache code allocates memory while holding devtree_lock which
is a raw_spinlock_t. Memory allocation (and free()) is not possible on
RT while a raw_spinlock_t is held.
Invoke the kfree() and kcalloc() while the lock is dropped.

Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: devicetree@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

This is a repost of:
	https://lore.kernel.org/linux-devicetree/20180910154227.xsbbqvw3cayro4gg@linutronix.de/

I mentioned this patch (briefly) to Frank, let me summarize:

of_populate_phandle_cache() triggers a warning during boot on arm64 with
RT enabled. By moving memory allocation/free outside of the locked
section (which really disables interrupts on -RT) everything is fine
again.

The lock has been made a raw_spinlock_t in RT as part pSeries bring up.
It then made its way upstream as:
   28d0e36bf9686 ("OF: Fixup resursive locking code paths")
   d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")

I've been looking into making devtree_lock a spinlock_t which would
avoid this patch. I haven't seen an issue during boot on arm64 even
with hotplug.
However Power64/pSeries complained during boot:

| smp: Bringing up secondary CPUs ...
| BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
| in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
| 1 lock held by swapper/1/0:
|  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
| Preemption disabled at:
| [<c0000000000557a0>] start_secondary+0xd0/0x6a0
|
| Call Trace:
| [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
| [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
| [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
| [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
| [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
| [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
| [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
| [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14

because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
up path.

 drivers/of/base.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Comments

Rob Herring Nov. 12, 2019, 3:35 a.m. UTC | #1
On Mon, Nov 11, 2019 at 11:21 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> The phandle cache code allocates memory while holding devtree_lock which
> is a raw_spinlock_t. Memory allocation (and free()) is not possible on
> RT while a raw_spinlock_t is held.
> Invoke the kfree() and kcalloc() while the lock is dropped.
>
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Cc: devicetree@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>
> This is a repost of:
>         https://lore.kernel.org/linux-devicetree/20180910154227.xsbbqvw3cayro4gg@linutronix.de/
>
> I mentioned this patch (briefly) to Frank, let me summarize:
>
> of_populate_phandle_cache() triggers a warning during boot on arm64 with
> RT enabled. By moving memory allocation/free outside of the locked
> section (which really disables interrupts on -RT) everything is fine
> again.
>
> The lock has been made a raw_spinlock_t in RT as part pSeries bring up.
> It then made its way upstream as:
>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")

So to summarize, we changed mainline to fix RT which then broke RT. :)


> I've been looking into making devtree_lock a spinlock_t which would
> avoid this patch. I haven't seen an issue during boot on arm64 even
> with hotplug.

Did you look into using RCU reader locks any more?

Rob
Sebastian Andrzej Siewior Nov. 12, 2019, 9:10 a.m. UTC | #2
On 2019-11-11 21:35:35 [-0600], Rob Herring wrote:
> >    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
> >    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
> 
> So to summarize, we changed mainline to fix RT which then broke RT. :)

correct, but we were good until v4.17-rc1 :)

> > I've been looking into making devtree_lock a spinlock_t which would
> > avoid this patch. I haven't seen an issue during boot on arm64 even
> > with hotplug.
> 
> Did you look into using RCU reader locks any more?

A little bit. The writers, which modify the node, would need to replace
the whole node. So this is where things got a little complicated.
Frank wasn't a big fan of it back then and he still wasn't a few weeks
back.
If you two agree to prefer RCU over this patch then I would look more
into adding RCU into the lookup path. The argument was that this isn't
time critical. I'm just trying to avoid to replace the locking for
nothing.
So, should I come up with a RCU patch?

> Rob

Sebastian
Rob Herring Nov. 12, 2019, 3:55 p.m. UTC | #3
On Tue, Nov 12, 2019 at 3:10 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2019-11-11 21:35:35 [-0600], Rob Herring wrote:
> > >    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
> > >    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
> >
> > So to summarize, we changed mainline to fix RT which then broke RT. :)
>
> correct, but we were good until v4.17-rc1 :)
>
> > > I've been looking into making devtree_lock a spinlock_t which would
> > > avoid this patch. I haven't seen an issue during boot on arm64 even
> > > with hotplug.
> >
> > Did you look into using RCU reader locks any more?
>
> A little bit. The writers, which modify the node, would need to replace
> the whole node. So this is where things got a little complicated.

Why is that exactly? That's pretty much how node and property updates
work anyways though maybe not strictly enforced. The other updates are
atomic flags and ref counts which I assume are fine. The spinlock
protects traversing the tree of nodes and list of properties.
Traversing and updates would need to follow similar semantics as
rculist, right? BTW, there's been some thoughts to move node and
property lists to list_head. We may want to do that rather than trying
to roll our own RCU code.

> Frank wasn't a big fan of it back then and he still wasn't a few weeks
> back.

I guess you spoke at ELCE. RCU seems like the right locking to me as
we're almost entirely reads and I haven't seen another proposal.

> If you two agree to prefer RCU over this patch then I would look more
> into adding RCU into the lookup path. The argument was that this isn't
> time critical. I'm just trying to avoid to replace the locking for
> nothing.
> So, should I come up with a RCU patch?

Lets see what Frank says first.

While this patch is a bit of a band-aid, I don't think it complicates
the situation at all to prevent coming up with a better solution. The
other option is get rid of the memory allocation altogether. My
preference for the cache was a simpler solution that was truly a cache
(i.e. a fixed size that could miss). The performance wasn't quite as
good though.

Rob
Frank Rowand Nov. 12, 2019, 10:48 p.m. UTC | #4
Hi Sebastian,

On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote:
> The phandle cache code allocates memory while holding devtree_lock which
> is a raw_spinlock_t. Memory allocation (and free()) is not possible on
> RT while a raw_spinlock_t is held.
> Invoke the kfree() and kcalloc() while the lock is dropped.

I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache()
was sufficient.  And I didn't realize (or remember) that kfree was
not allowed while a raw_spinlock_t is held.  Do you have a
pointer to the preempt RT documentation that explains that?
I'd like to add that pointer to my personal notes about locking so
that I won't mis-remember this too often.


> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Cc: devicetree@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> This is a repost of:
> 	https://lore.kernel.org/linux-devicetree/20180910154227.xsbbqvw3cayro4gg@linutronix.de/
> 
> I mentioned this patch (briefly) to Frank, let me summarize:
> 
> of_populate_phandle_cache() triggers a warning during boot on arm64 with
> RT enabled. By moving memory allocation/free outside of the locked
> section (which really disables interrupts on -RT) everything is fine
> again.
> 
> The lock has been made a raw_spinlock_t in RT as part pSeries bring up.
> It then made its way upstream as:
>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
> 
> I've been looking into making devtree_lock a spinlock_t which would
> avoid this patch. I haven't seen an issue during boot on arm64 even
> with hotplug.> However Power64/pSeries complained during boot:
> 
> | smp: Bringing up secondary CPUs ...
> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> | 1 lock held by swapper/1/0:
> |  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
> | Preemption disabled at:
> | [<c0000000000557a0>] start_secondary+0xd0/0x6a0
> |
> | Call Trace:
> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14
> 
> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
> up path.

I read too much into that sentence, and ran off on a tangent re-educating
myself on preempt RT lock stuff.

The issue in this path is that start_secondary() disables preemption before
going down the code path that ends up with an attempt by of_find_node_opts_by_path()
to lock devtree_lock.  It is ok to acquire a raw spinlock with preemption
disabled, but not ok to acquire a normal spinlock with preemption disabled.

The calling path to cpu_to_chip_id() has an intervening call that does not
show up in the above trace, add_cpu_to_masks().  The first call of cpu_to_chip_id()
is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(),
before preemption is disabled.  But at the end of add_cpu_to_masks() is:

        for_each_cpu(i, cpu_online_mask)
                if (cpu_to_chip_id(i) == chipid)
                        set_cpus_related(cpu, i, cpu_core_mask);

This use of cpu_to_chip_id() is a little harder to move to before the preemption,
but it is possible.  A table of the chip ids for all possible cpus could be
created before disabling preemption, and the table could be passed into
add_cpu_to_masks().  This would allow devtree_lock to be changed to a
spinlock_t.

I like this approach because it removes the one known place that constrains
what type of lock devtree_lock is.

My second choice (and I am willing to accept this) is:

> 
>  drivers/of/base.c |   19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask;
>  /*
>   * Caller must hold devtree_lock.
>   */

Add a one line comment to the effect of kfree()
can not occur while raw_spinlock_t held, so caller must
do the kfree().


> -static void __of_free_phandle_cache(void)
> +static struct device_node** __of_free_phandle_cache(void)
>  {
>  	u32 cache_entries = phandle_cache_mask + 1;
>  	u32 k;
> +	struct device_node **shadow;
>  
>  	if (!phandle_cache)
> -		return;
> +		return NULL;
>  
>  	for (k = 0; k < cache_entries; k++)
>  		of_node_put(phandle_cache[k]);
>  
> -	kfree(phandle_cache);
> +	shadow = phandle_cache;
>  	phandle_cache = NULL;
> +	return shadow;
>  }
>  
>  int of_free_phandle_cache(void)
>  {
>  	unsigned long flags;
> +	struct device_node **shadow;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	__of_free_phandle_cache();
> +	shadow = __of_free_phandle_cache();
>  
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> -
> +	kfree(shadow);
>  	return 0;
>  }
>  #if !defined(CONFIG_MODULES)
> @@ -197,10 +200,11 @@ void of_populate_phandle_cache(void)
>  	u32 cache_entries;
>  	struct device_node *np;
>  	u32 phandles = 0;
> +	struct device_node **shadow;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	__of_free_phandle_cache();
> +	shadow = __of_free_phandle_cache();
>  
>  	for_each_of_allnodes(np)
>  		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void)
>  
>  	if (!phandles)
>  		goto out;

Add a one line comment to the effect of raw_spinlock_t can not be held
when calling kcalloc().


> +	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>  
>  	cache_entries = roundup_pow_of_two(phandles);
>  	phandle_cache_mask = cache_entries - 1;
> 

Need to avoid race with of_find_node_by_phandle().  So change the following
to tmp_phandle_cache = kcalloc(...
 
>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>  				GFP_ATOMIC);
> +	raw_spin_lock_irqsave(&devtree_lock, flags);

Then here:

        phandle_cache = tmp_phandle_cache;

>  	if (!phandle_cache)
>  		goto out;
>  
> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void)
>  
>  out:
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> +	kfree(shadow);
>  }
>  
>  void __init of_core_init(void)
> 

The subtle race with of_find_node_by_phandle() is that if
of_find_node_by_phandle() added an entry to the cache it
also did an of_node_get().  It is ok that of_populate_phandle_cache()
overwrite the cache entry, but it would also do an additional
of_node_get().

-Frank
Frank Rowand Nov. 12, 2019, 11:46 p.m. UTC | #5
On 11/12/19 9:55 AM, Rob Herring wrote:
> On Tue, Nov 12, 2019 at 3:10 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2019-11-11 21:35:35 [-0600], Rob Herring wrote:
>>>>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
>>>>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
>>>
>>> So to summarize, we changed mainline to fix RT which then broke RT. :)
>>
>> correct, but we were good until v4.17-rc1 :)
>>
>>>> I've been looking into making devtree_lock a spinlock_t which would
>>>> avoid this patch. I haven't seen an issue during boot on arm64 even
>>>> with hotplug.
>>>
>>> Did you look into using RCU reader locks any more?
>>
>> A little bit. The writers, which modify the node, would need to replace
>> the whole node. So this is where things got a little complicated.
> 
> Why is that exactly? That's pretty much how node and property updates
> work anyways though maybe not strictly enforced. The other updates are
> atomic flags and ref counts which I assume are fine. The spinlock
> protects traversing the tree of nodes and list of properties.
> Traversing and updates would need to follow similar semantics as
> rculist, right? BTW, there's been some thoughts to move node and
> property lists to list_head. We may want to do that rather than trying
> to roll our own RCU code.
> 
>> Frank wasn't a big fan of it back then and he still wasn't a few weeks
>> back.
> 
> I guess you spoke at ELCE. RCU seems like the right locking to me as
> we're almost entirely reads and I haven't seen another proposal.
> 
>> If you two agree to prefer RCU over this patch then I would look more
>> into adding RCU into the lookup path. The argument was that this isn't
>> time critical. I'm just trying to avoid to replace the locking for
>> nothing.
>> So, should I come up with a RCU patch?
> 
> Lets see what Frank says first.

RCU is good for adding list entries, deleting list entries, and updating
the value(s) of a list element.

RCU is not good for excluding readers of a data structure while multiple
entries are being added, deleted, and/or modified, such that the readers
only see the state of the data structure either (1) before any changes
or (2) after all changes occur.  Overlay application and removal require
multiple modifications, with readers seeing either the before state of
the after state.


> While this patch is a bit of a band-aid, I don't think it complicates
> the situation at all to prevent coming up with a better solution. The

Agreed.


> other option is get rid of the memory allocation altogether. My
> preference for the cache was a simpler solution that was truly a cache
> (i.e. a fixed size that could miss). The performance wasn't quite as
> good though.

The current implementation allows and properly handles misses.  But misses
will not occur if the range of phandle values is in the range of 1..n,
when there are n distinct phandle values.

I don't think the patch makes the implementation so bad that the cache
should instead switch to a fixed size to avoid kcalloc().

> 
> Rob
>
Rob Herring Nov. 13, 2019, 12:48 a.m. UTC | #6
On Tue, Nov 12, 2019 at 5:46 PM Frank Rowand <frowand.list@gmail.com> wrote:
>
> On 11/12/19 9:55 AM, Rob Herring wrote:
> > On Tue, Nov 12, 2019 at 3:10 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> >>
> >> On 2019-11-11 21:35:35 [-0600], Rob Herring wrote:
> >>>>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
> >>>>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
> >>>
> >>> So to summarize, we changed mainline to fix RT which then broke RT. :)
> >>
> >> correct, but we were good until v4.17-rc1 :)
> >>
> >>>> I've been looking into making devtree_lock a spinlock_t which would
> >>>> avoid this patch. I haven't seen an issue during boot on arm64 even
> >>>> with hotplug.
> >>>
> >>> Did you look into using RCU reader locks any more?
> >>
> >> A little bit. The writers, which modify the node, would need to replace
> >> the whole node. So this is where things got a little complicated.
> >
> > Why is that exactly? That's pretty much how node and property updates
> > work anyways though maybe not strictly enforced. The other updates are
> > atomic flags and ref counts which I assume are fine. The spinlock
> > protects traversing the tree of nodes and list of properties.
> > Traversing and updates would need to follow similar semantics as
> > rculist, right? BTW, there's been some thoughts to move node and
> > property lists to list_head. We may want to do that rather than trying
> > to roll our own RCU code.
> >
> >> Frank wasn't a big fan of it back then and he still wasn't a few weeks
> >> back.
> >
> > I guess you spoke at ELCE. RCU seems like the right locking to me as
> > we're almost entirely reads and I haven't seen another proposal.
> >
> >> If you two agree to prefer RCU over this patch then I would look more
> >> into adding RCU into the lookup path. The argument was that this isn't
> >> time critical. I'm just trying to avoid to replace the locking for
> >> nothing.
> >> So, should I come up with a RCU patch?
> >
> > Lets see what Frank says first.
>
> RCU is good for adding list entries, deleting list entries, and updating
> the value(s) of a list element.
>
> RCU is not good for excluding readers of a data structure while multiple
> entries are being added, deleted, and/or modified, such that the readers
> only see the state of the data structure either (1) before any changes
> or (2) after all changes occur.  Overlay application and removal require
> multiple modifications, with readers seeing either the before state of
> the after state.

Do we ensure that currently? The of_mutex provides what you describe,
but most or all the reader paths don't take the mutex. Even the
changeset API takes the spinlock a single node or property at a time.

Rob
Frank Rowand Nov. 13, 2019, 4:52 p.m. UTC | #7
On 11/12/19 6:48 PM, Rob Herring wrote:
> On Tue, Nov 12, 2019 at 5:46 PM Frank Rowand <frowand.list@gmail.com> wrote:
>>
>> On 11/12/19 9:55 AM, Rob Herring wrote:
>>> On Tue, Nov 12, 2019 at 3:10 AM Sebastian Andrzej Siewior
>>> <bigeasy@linutronix.de> wrote:
>>>>
>>>> On 2019-11-11 21:35:35 [-0600], Rob Herring wrote:
>>>>>>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
>>>>>>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
>>>>>
>>>>> So to summarize, we changed mainline to fix RT which then broke RT. :)
>>>>
>>>> correct, but we were good until v4.17-rc1 :)
>>>>
>>>>>> I've been looking into making devtree_lock a spinlock_t which would
>>>>>> avoid this patch. I haven't seen an issue during boot on arm64 even
>>>>>> with hotplug.
>>>>>
>>>>> Did you look into using RCU reader locks any more?
>>>>
>>>> A little bit. The writers, which modify the node, would need to replace
>>>> the whole node. So this is where things got a little complicated.
>>>
>>> Why is that exactly? That's pretty much how node and property updates
>>> work anyways though maybe not strictly enforced. The other updates are
>>> atomic flags and ref counts which I assume are fine. The spinlock
>>> protects traversing the tree of nodes and list of properties.
>>> Traversing and updates would need to follow similar semantics as
>>> rculist, right? BTW, there's been some thoughts to move node and
>>> property lists to list_head. We may want to do that rather than trying
>>> to roll our own RCU code.
>>>
>>>> Frank wasn't a big fan of it back then and he still wasn't a few weeks
>>>> back.
>>>
>>> I guess you spoke at ELCE. RCU seems like the right locking to me as
>>> we're almost entirely reads and I haven't seen another proposal.
>>>
>>>> If you two agree to prefer RCU over this patch then I would look more
>>>> into adding RCU into the lookup path. The argument was that this isn't
>>>> time critical. I'm just trying to avoid to replace the locking for
>>>> nothing.
>>>> So, should I come up with a RCU patch?
>>>
>>> Lets see what Frank says first.
>>
>> RCU is good for adding list entries, deleting list entries, and updating
>> the value(s) of a list element.
>>
>> RCU is not good for excluding readers of a data structure while multiple
>> entries are being added, deleted, and/or modified, such that the readers
>> only see the state of the data structure either (1) before any changes
>> or (2) after all changes occur.  Overlay application and removal require
>> multiple modifications, with readers seeing either the before state of
>> the after state.
> 
> Do we ensure that currently? The of_mutex provides what you describe,

No, we don't.  I've been saying that our locking is horribly broken and
that is one of the items that needs to be fixed for overlays.  Overlays
weren't contemplated in the design of the devicetree locking architecture.

So I don't want to change over to RCU which continues the same issues.

> but most or all the reader paths don't take the mutex. Even the
> changeset API takes the spinlock a single node or property at a time.
> 
> Rob
>
Sebastian Andrzej Siewior Nov. 29, 2019, 1:57 p.m. UTC | #8
On 2019-11-12 16:48:12 [-0600], Frank Rowand wrote:
> Hi Sebastian,
Hi Frank,

> On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote:
> > The phandle cache code allocates memory while holding devtree_lock which
> > is a raw_spinlock_t. Memory allocation (and free()) is not possible on
> > RT while a raw_spinlock_t is held.
> > Invoke the kfree() and kcalloc() while the lock is dropped.
> 
> I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache()
> was sufficient.  And I didn't realize (or remember) that kfree was
> not allowed while a raw_spinlock_t is held.  Do you have a
> pointer to the preempt RT documentation that explains that?
> I'd like to add that pointer to my personal notes about locking so
> that I won't mis-remember this too often.

Unfortunately not yet. I have a mini document that needs polishing…
However, since as far as my RT memory goes:
GFP_ATOMIC can be used in non-blocking context. On !RT that includes
disabled preemption and/or interrupts _usually_ part of locking (like
spin_lock_irqsave()) or the inherited context like IRQ-handler or
softirq for instance.
On RT the former example is not blocking anymore (sleeping spinlocks,
threaded IRQs). The locking with raw_spinlock_t still disables
preemption and interrupts as part of the procedure. The GFP_ATOMIC in
this case does not matter on RT here in terms of blocking (the part with
emergency pools and so on remains unchanged). 
The kfree() part within a raw_spinlock_t section has the same
limitations on RT as kmalloc() for the same reason. The SLUB part of the
function is fine. The problem is once SLUB has to call to the page
allocated in order to return or ask for memory. We can't do this in
blocking context.

…
> > | smp: Bringing up secondary CPUs ...
> > | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
> > | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> > | 1 lock held by swapper/1/0:
> > |  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
> > | Preemption disabled at:
> > | [<c0000000000557a0>] start_secondary+0xd0/0x6a0
> > |
> > | Call Trace:
> > | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
> > | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
> > | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
> > | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
> > | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
> > | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
> > | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
> > | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14
> > 
> > because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
> > up path.
> 
> I read too much into that sentence, and ran off on a tangent re-educating
> myself on preempt RT lock stuff.
> 
> The issue in this path is that start_secondary() disables preemption before
> going down the code path that ends up with an attempt by of_find_node_opts_by_path()
> to lock devtree_lock.  It is ok to acquire a raw spinlock with preemption
> disabled, but not ok to acquire a normal spinlock with preemption disabled.
That is correct. So
	spin_lock();
	spin_lock();

is okay. However
	preempt_disable();
	spin_lock();

is not okay. Same is true for s/preempt_disable/local_irq_disable()/.

Please note the even before preempt_disable() in start_secondary() the
whole code runs with disabled interrupts as part of CPU bring up. 

> The calling path to cpu_to_chip_id() has an intervening call that does not
> show up in the above trace, add_cpu_to_masks().  The first call of cpu_to_chip_id()
> is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(),
> before preemption is disabled.  But at the end of add_cpu_to_masks() is:
> 
>         for_each_cpu(i, cpu_online_mask)
>                 if (cpu_to_chip_id(i) == chipid)
>                         set_cpus_related(cpu, i, cpu_core_mask);
> 
> This use of cpu_to_chip_id() is a little harder to move to before the preemption,
> but it is possible.  A table of the chip ids for all possible cpus could be
> created before disabling preemption, and the table could be passed into
> add_cpu_to_masks().  This would allow devtree_lock to be changed to a
> spinlock_t.
> 
> I like this approach because it removes the one known place that constrains
> what type of lock devtree_lock is.

So even if we move that whole section before preempt_disable(), we still
run with disabled interrupts and can't use spinlock_t.
 
> My second choice (and I am willing to accept this) is:
> 
> > 
> >  drivers/of/base.c |   19 +++++++++++++------
> >  1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > --- a/drivers/of/base.c
> > +++ b/drivers/of/base.c
> > @@ -138,31 +138,34 @@ static u32 phandle_cache_mask;
> >  /*
> >   * Caller must hold devtree_lock.
> >   */
> 
> Add a one line comment to the effect of kfree()
> can not occur while raw_spinlock_t held, so caller must
> do the kfree().

okay.

…
> > @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void)
> >  
> >  	if (!phandles)
> >  		goto out;
> 
> Add a one line comment to the effect of raw_spinlock_t can not be held
> when calling kcalloc().
okay.

> > +	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> >  
> >  	cache_entries = roundup_pow_of_two(phandles);
> >  	phandle_cache_mask = cache_entries - 1;
> > 
> 
> Need to avoid race with of_find_node_by_phandle().  So change the following
> to tmp_phandle_cache = kcalloc(...
okay.
 
> >  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
> >  				GFP_ATOMIC);
> > +	raw_spin_lock_irqsave(&devtree_lock, flags);
> 
> Then here:
> 
>         phandle_cache = tmp_phandle_cache;
> 
> >  	if (!phandle_cache)
> >  		goto out;
> >  
> > @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void)
> >  
> >  out:
> >  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> > +	kfree(shadow);
> >  }
> >  
> >  void __init of_core_init(void)
> > 
> 
> The subtle race with of_find_node_by_phandle() is that if
> of_find_node_by_phandle() added an entry to the cache it
> also did an of_node_get().  It is ok that of_populate_phandle_cache()
> overwrite the cache entry, but it would also do an additional
> of_node_get().
okay.

> -Frank

Sebastian
Frank Rowand Nov. 30, 2019, 2:48 a.m. UTC | #9
Hi Sebastian,

On 11/29/19 7:57 AM, Sebastian Andrzej Siewior wrote:
> On 2019-11-12 16:48:12 [-0600], Frank Rowand wrote:
>> Hi Sebastian,
> Hi Frank,
> 
>> On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote:
>>> The phandle cache code allocates memory while holding devtree_lock which
>>> is a raw_spinlock_t. Memory allocation (and free()) is not possible on
>>> RT while a raw_spinlock_t is held.
>>> Invoke the kfree() and kcalloc() while the lock is dropped.
>>
>> I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache()
>> was sufficient.  And I didn't realize (or remember) that kfree was
>> not allowed while a raw_spinlock_t is held.  Do you have a
>> pointer to the preempt RT documentation that explains that?
>> I'd like to add that pointer to my personal notes about locking so
>> that I won't mis-remember this too often.
> 
> Unfortunately not yet. I have a mini document that needs polishing…
> However, since as far as my RT memory goes:
> GFP_ATOMIC can be used in non-blocking context. On !RT that includes
> disabled preemption and/or interrupts _usually_ part of locking (like
> spin_lock_irqsave()) or the inherited context like IRQ-handler or
> softirq for instance.
> On RT the former example is not blocking anymore (sleeping spinlocks,
> threaded IRQs). The locking with raw_spinlock_t still disables
> preemption and interrupts as part of the procedure. The GFP_ATOMIC in
> this case does not matter on RT here in terms of blocking (the part with
> emergency pools and so on remains unchanged). 
> The kfree() part within a raw_spinlock_t section has the same
> limitations on RT as kmalloc() for the same reason. The SLUB part of the
> function is fine. The problem is once SLUB has to call to the page
> allocated in order to return or ask for memory. We can't do this in
> blocking context.

Thanks for writing this up.

And thanks for digging even deeper into the relevant context than I
did below.

-Frank

> 
> …
>>> | smp: Bringing up secondary CPUs ...
>>> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
>>> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
>>> | 1 lock held by swapper/1/0:
>>> |  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
>>> | Preemption disabled at:
>>> | [<c0000000000557a0>] start_secondary+0xd0/0x6a0
>>> |
>>> | Call Trace:
>>> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
>>> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
>>> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
>>> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
>>> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
>>> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
>>> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
>>> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14
>>>
>>> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
>>> up path.
>>
>> I read too much into that sentence, and ran off on a tangent re-educating
>> myself on preempt RT lock stuff.
>>
>> The issue in this path is that start_secondary() disables preemption before
>> going down the code path that ends up with an attempt by of_find_node_opts_by_path()
>> to lock devtree_lock.  It is ok to acquire a raw spinlock with preemption
>> disabled, but not ok to acquire a normal spinlock with preemption disabled.
> That is correct. So
> 	spin_lock();
> 	spin_lock();
> 
> is okay. However
> 	preempt_disable();
> 	spin_lock();
> 
> is not okay. Same is true for s/preempt_disable/local_irq_disable()/.
> 
> Please note the even before preempt_disable() in start_secondary() the
> whole code runs with disabled interrupts as part of CPU bring up. 
> 
>> The calling path to cpu_to_chip_id() has an intervening call that does not
>> show up in the above trace, add_cpu_to_masks().  The first call of cpu_to_chip_id()
>> is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(),
>> before preemption is disabled.  But at the end of add_cpu_to_masks() is:
>>
>>         for_each_cpu(i, cpu_online_mask)
>>                 if (cpu_to_chip_id(i) == chipid)
>>                         set_cpus_related(cpu, i, cpu_core_mask);
>>
>> This use of cpu_to_chip_id() is a little harder to move to before the preemption,
>> but it is possible.  A table of the chip ids for all possible cpus could be
>> created before disabling preemption, and the table could be passed into
>> add_cpu_to_masks().  This would allow devtree_lock to be changed to a
>> spinlock_t.
>>
>> I like this approach because it removes the one known place that constrains
>> what type of lock devtree_lock is.
> 
> So even if we move that whole section before preempt_disable(), we still
> run with disabled interrupts and can't use spinlock_t.
>  
>> My second choice (and I am willing to accept this) is:
>>
>>>
>>>  drivers/of/base.c |   19 +++++++++++++------
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> --- a/drivers/of/base.c
>>> +++ b/drivers/of/base.c
>>> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask;
>>>  /*
>>>   * Caller must hold devtree_lock.
>>>   */
>>
>> Add a one line comment to the effect of kfree()
>> can not occur while raw_spinlock_t held, so caller must
>> do the kfree().
> 
> okay.
> 
> …
>>> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void)
>>>  
>>>  	if (!phandles)
>>>  		goto out;
>>
>> Add a one line comment to the effect of raw_spinlock_t can not be held
>> when calling kcalloc().
> okay.
> 
>>> +	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>>>  
>>>  	cache_entries = roundup_pow_of_two(phandles);
>>>  	phandle_cache_mask = cache_entries - 1;
>>>
>>
>> Need to avoid race with of_find_node_by_phandle().  So change the following
>> to tmp_phandle_cache = kcalloc(...
> okay.
>  
>>>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>>>  				GFP_ATOMIC);
>>> +	raw_spin_lock_irqsave(&devtree_lock, flags);
>>
>> Then here:
>>
>>         phandle_cache = tmp_phandle_cache;
>>
>>>  	if (!phandle_cache)
>>>  		goto out;
>>>  
>>> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void)
>>>  
>>>  out:
>>>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>>> +	kfree(shadow);
>>>  }
>>>  
>>>  void __init of_core_init(void)
>>>
>>
>> The subtle race with of_find_node_by_phandle() is that if
>> of_find_node_by_phandle() added an entry to the cache it
>> also did an of_node_get().  It is ok that of_populate_phandle_cache()
>> overwrite the cache entry, but it would also do an additional
>> of_node_get().
> okay.
> 
>> -Frank
> 
> Sebastian
>
diff mbox series

Patch

--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -138,31 +138,34 @@  static u32 phandle_cache_mask;
 /*
  * Caller must hold devtree_lock.
  */
-static void __of_free_phandle_cache(void)
+static struct device_node** __of_free_phandle_cache(void)
 {
 	u32 cache_entries = phandle_cache_mask + 1;
 	u32 k;
+	struct device_node **shadow;
 
 	if (!phandle_cache)
-		return;
+		return NULL;
 
 	for (k = 0; k < cache_entries; k++)
 		of_node_put(phandle_cache[k]);
 
-	kfree(phandle_cache);
+	shadow = phandle_cache;
 	phandle_cache = NULL;
+	return shadow;
 }
 
 int of_free_phandle_cache(void)
 {
 	unsigned long flags;
+	struct device_node **shadow;
 
 	raw_spin_lock_irqsave(&devtree_lock, flags);
 
-	__of_free_phandle_cache();
+	shadow = __of_free_phandle_cache();
 
 	raw_spin_unlock_irqrestore(&devtree_lock, flags);
-
+	kfree(shadow);
 	return 0;
 }
 #if !defined(CONFIG_MODULES)
@@ -197,10 +200,11 @@  void of_populate_phandle_cache(void)
 	u32 cache_entries;
 	struct device_node *np;
 	u32 phandles = 0;
+	struct device_node **shadow;
 
 	raw_spin_lock_irqsave(&devtree_lock, flags);
 
-	__of_free_phandle_cache();
+	shadow = __of_free_phandle_cache();
 
 	for_each_of_allnodes(np)
 		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
@@ -208,12 +212,14 @@  void of_populate_phandle_cache(void)
 
 	if (!phandles)
 		goto out;
+	raw_spin_unlock_irqrestore(&devtree_lock, flags);
 
 	cache_entries = roundup_pow_of_two(phandles);
 	phandle_cache_mask = cache_entries - 1;
 
 	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
 				GFP_ATOMIC);
+	raw_spin_lock_irqsave(&devtree_lock, flags);
 	if (!phandle_cache)
 		goto out;
 
@@ -225,6 +231,7 @@  void of_populate_phandle_cache(void)
 
 out:
 	raw_spin_unlock_irqrestore(&devtree_lock, flags);
+	kfree(shadow);
 }
 
 void __init of_core_init(void)