diff mbox

next: Commit 'mm: Prevent __alloc_pages_nodemask() RCU CPU stall ...' causing hang on sparc32 qemu

Message ID 20161130012817.GH3924@linux.vnet.ibm.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Paul E. McKenney Nov. 30, 2016, 1:28 a.m. UTC
On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> Hi Paul,
> 
> most of my qemu tests for sparc32 targets started to fail in next-20161129.
> The problem is only seen in SMP builds; non-SMP builds are fine.
> Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> RCU CPU stall warnings"); reverting that commit fixes the problem.
> 
> Test scripts are available at:
> 	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
> Test results are at:
> 	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
> 
> Bisect log is attached.
> 
> Please let me know if there is anything I can do to help tracking down the
> problem.

Apologies!!!  Does the patch below help?

							Thanx, Paul

------------------------------------------------------------------------

commit 97708e737e2a55fed4bdbc005bf05ea909df6b73
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Nov 29 11:06:05 2016 -0800

    rcu: Allow boot-time use of cond_resched_rcu_qs()
    
    The cond_resched_rcu_qs() macro is used to force RCU quiescent states into
    long-running in-kernel loops.  However, some of these loops can execute
    during early boot when interrupts are disabled, and during which time
    it is therefore illegal to enter the scheduler.  This commit therefore
    makes cond_resched_rcu_qs() be a no-op during early boot.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Guenter Roeck Nov. 30, 2016, 4:32 a.m. UTC | #1
On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
>> Hi Paul,
>>
>> most of my qemu tests for sparc32 targets started to fail in next-20161129.
>> The problem is only seen in SMP builds; non-SMP builds are fine.
>> Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
>> RCU CPU stall warnings"); reverting that commit fixes the problem.
>>
>> Test scripts are available at:
>> 	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
>> Test results are at:
>> 	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
>>
>> Bisect log is attached.
>>
>> Please let me know if there is anything I can do to help tracking down the
>> problem.
>
> Apologies!!!  Does the patch below help?
>
No, sorry, it doesn't make a difference.

Guenter

> 							Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 97708e737e2a55fed4bdbc005bf05ea909df6b73
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Nov 29 11:06:05 2016 -0800
>
>     rcu: Allow boot-time use of cond_resched_rcu_qs()
>
>     The cond_resched_rcu_qs() macro is used to force RCU quiescent states into
>     long-running in-kernel loops.  However, some of these loops can execute
>     during early boot when interrupts are disabled, and during which time
>     it is therefore illegal to enter the scheduler.  This commit therefore
>     makes cond_resched_rcu_qs() be a no-op during early boot.
>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 525ca34603b7..b6944cc19a07 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -423,7 +423,7 @@ extern struct srcu_struct tasks_rcu_exit_srcu;
>   */
>  #define cond_resched_rcu_qs() \
>  do { \
> -	if (!cond_resched()) \
> +	if (!is_idle_task(current) && !cond_resched()) \
>  		rcu_note_voluntary_context_switch(current); \
>  } while (0)
>
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 7232d199a81c..20f5990deeee 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -228,6 +228,7 @@ static inline void exit_rcu(void)
>  extern int rcu_scheduler_active __read_mostly;
>  void rcu_scheduler_starting(void);
>  #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> +#define rcu_scheduler_active false
>  static inline void rcu_scheduler_starting(void)
>  {
>  }
>
>

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 30, 2016, 7:02 a.m. UTC | #2
On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> >On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> >>Hi Paul,
> >>
> >>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> >>The problem is only seen in SMP builds; non-SMP builds are fine.
> >>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> >>RCU CPU stall warnings"); reverting that commit fixes the problem.
> >>
> >>Test scripts are available at:
> >>	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
> >>Test results are at:
> >>	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
> >>
> >>Bisect log is attached.
> >>
> >>Please let me know if there is anything I can do to help tracking down the
> >>problem.
> >
> >Apologies!!!  Does the patch below help?
> >
> No, sorry, it doesn't make a difference.

Interesting...  Could you please send me the build failure messages?

							Thanx, Paul

> Guenter
> 
> >							Thanx, Paul
> >
> >------------------------------------------------------------------------
> >
> >commit 97708e737e2a55fed4bdbc005bf05ea909df6b73
> >Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >Date:   Tue Nov 29 11:06:05 2016 -0800
> >
> >    rcu: Allow boot-time use of cond_resched_rcu_qs()
> >
> >    The cond_resched_rcu_qs() macro is used to force RCU quiescent states into
> >    long-running in-kernel loops.  However, some of these loops can execute
> >    during early boot when interrupts are disabled, and during which time
> >    it is therefore illegal to enter the scheduler.  This commit therefore
> >    makes cond_resched_rcu_qs() be a no-op during early boot.
> >
> >    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> >diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >index 525ca34603b7..b6944cc19a07 100644
> >--- a/include/linux/rcupdate.h
> >+++ b/include/linux/rcupdate.h
> >@@ -423,7 +423,7 @@ extern struct srcu_struct tasks_rcu_exit_srcu;
> >  */
> > #define cond_resched_rcu_qs() \
> > do { \
> >-	if (!cond_resched()) \
> >+	if (!is_idle_task(current) && !cond_resched()) \
> > 		rcu_note_voluntary_context_switch(current); \
> > } while (0)
> >
> >diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> >index 7232d199a81c..20f5990deeee 100644
> >--- a/include/linux/rcutiny.h
> >+++ b/include/linux/rcutiny.h
> >@@ -228,6 +228,7 @@ static inline void exit_rcu(void)
> > extern int rcu_scheduler_active __read_mostly;
> > void rcu_scheduler_starting(void);
> > #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >+#define rcu_scheduler_active false
> > static inline void rcu_scheduler_starting(void)
> > {
> > }
> >
> >
> 

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck Nov. 30, 2016, 10:52 a.m. UTC | #3
On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
>> On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
>>> On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
>>>> Hi Paul,
>>>>
>>>> most of my qemu tests for sparc32 targets started to fail in next-20161129.
>>>> The problem is only seen in SMP builds; non-SMP builds are fine.
>>>> Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
>>>> RCU CPU stall warnings"); reverting that commit fixes the problem.
>>>>
>>>> Test scripts are available at:
>>>> 	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
>>>> Test results are at:
>>>> 	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
>>>>
>>>> Bisect log is attached.
>>>>
>>>> Please let me know if there is anything I can do to help tracking down the
>>>> problem.
>>>
>>> Apologies!!!  Does the patch below help?
>>>
>> No, sorry, it doesn't make a difference.
>
> Interesting...  Could you please send me the build failure messages?
>

There is no failure message; it just hangs until I abort the qemu session.

http://kerneltests.org/builders/qemu-sparc-next/builds/532/steps/qemubuildcommand/logs/stdio

Guenter

> 							Thanx, Paul
>
>> Guenter
>>
>>> 							Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> commit 97708e737e2a55fed4bdbc005bf05ea909df6b73
>>> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>> Date:   Tue Nov 29 11:06:05 2016 -0800
>>>
>>>    rcu: Allow boot-time use of cond_resched_rcu_qs()
>>>
>>>    The cond_resched_rcu_qs() macro is used to force RCU quiescent states into
>>>    long-running in-kernel loops.  However, some of these loops can execute
>>>    during early boot when interrupts are disabled, and during which time
>>>    it is therefore illegal to enter the scheduler.  This commit therefore
>>>    makes cond_resched_rcu_qs() be a no-op during early boot.
>>>
>>>    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>>
>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>> index 525ca34603b7..b6944cc19a07 100644
>>> --- a/include/linux/rcupdate.h
>>> +++ b/include/linux/rcupdate.h
>>> @@ -423,7 +423,7 @@ extern struct srcu_struct tasks_rcu_exit_srcu;
>>>  */
>>> #define cond_resched_rcu_qs() \
>>> do { \
>>> -	if (!cond_resched()) \
>>> +	if (!is_idle_task(current) && !cond_resched()) \
>>> 		rcu_note_voluntary_context_switch(current); \
>>> } while (0)
>>>
>>> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
>>> index 7232d199a81c..20f5990deeee 100644
>>> --- a/include/linux/rcutiny.h
>>> +++ b/include/linux/rcutiny.h
>>> @@ -228,6 +228,7 @@ static inline void exit_rcu(void)
>>> extern int rcu_scheduler_active __read_mostly;
>>> void rcu_scheduler_starting(void);
>>> #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>> +#define rcu_scheduler_active false
>>> static inline void rcu_scheduler_starting(void)
>>> {
>>> }
>>>
>>>
>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 30, 2016, 12:03 p.m. UTC | #4
On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> >>>>Hi Paul,
> >>>>
> >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.

And I have dropped this patch.  Michal Hocko showed me the error of
my ways with this patch.

							Thanx, Paul

> >>>>Test scripts are available at:
> >>>>	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
> >>>>Test results are at:
> >>>>	https://github.com/groeck/linux-build-test/tree/master/rootfs/sparc
> >>>>
> >>>>Bisect log is attached.
> >>>>
> >>>>Please let me know if there is anything I can do to help tracking down the
> >>>>problem.
> >>>
> >>>Apologies!!!  Does the patch below help?
> >>>
> >>No, sorry, it doesn't make a difference.
> >
> >Interesting...  Could you please send me the build failure messages?
> >
> 
> There is no failure message; it just hangs until I abort the qemu session.
> 
> http://kerneltests.org/builders/qemu-sparc-next/builds/532/steps/qemubuildcommand/logs/stdio
> 
> Guenter
> 
> >							Thanx, Paul
> >
> >>Guenter
> >>
> >>>							Thanx, Paul
> >>>
> >>>------------------------------------------------------------------------
> >>>
> >>>commit 97708e737e2a55fed4bdbc005bf05ea909df6b73
> >>>Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >>>Date:   Tue Nov 29 11:06:05 2016 -0800
> >>>
> >>>   rcu: Allow boot-time use of cond_resched_rcu_qs()
> >>>
> >>>   The cond_resched_rcu_qs() macro is used to force RCU quiescent states into
> >>>   long-running in-kernel loops.  However, some of these loops can execute
> >>>   during early boot when interrupts are disabled, and during which time
> >>>   it is therefore illegal to enter the scheduler.  This commit therefore
> >>>   makes cond_resched_rcu_qs() be a no-op during early boot.
> >>>
> >>>   Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >>>
> >>>diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >>>index 525ca34603b7..b6944cc19a07 100644
> >>>--- a/include/linux/rcupdate.h
> >>>+++ b/include/linux/rcupdate.h
> >>>@@ -423,7 +423,7 @@ extern struct srcu_struct tasks_rcu_exit_srcu;
> >>> */
> >>>#define cond_resched_rcu_qs() \
> >>>do { \
> >>>-	if (!cond_resched()) \
> >>>+	if (!is_idle_task(current) && !cond_resched()) \
> >>>		rcu_note_voluntary_context_switch(current); \
> >>>} while (0)
> >>>
> >>>diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> >>>index 7232d199a81c..20f5990deeee 100644
> >>>--- a/include/linux/rcutiny.h
> >>>+++ b/include/linux/rcutiny.h
> >>>@@ -228,6 +228,7 @@ static inline void exit_rcu(void)
> >>>extern int rcu_scheduler_active __read_mostly;
> >>>void rcu_scheduler_starting(void);
> >>>#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>+#define rcu_scheduler_active false
> >>>static inline void rcu_scheduler_starting(void)
> >>>{
> >>>}
> >>>
> >>>
> >>
> >
> >
> 

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck Nov. 30, 2016, 7:21 p.m. UTC | #5
On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > >>>>Hi Paul,
> > >>>>
> > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> 
> And I have dropped this patch.  Michal Hocko showed me the error of
> my ways with this patch.
> 

:-)

On another note, I still get RCU tracebacks in the s390 tests.

BUG: sleeping function called from invalid context at mm/page_alloc.c:3775

That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
if I recall correctly we had discussed that earlier.

Thanks,
Guenter
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 30, 2016, 9:01 p.m. UTC | #6
On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote:
> On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > > >>>>Hi Paul,
> > > >>>>
> > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> > 
> > And I have dropped this patch.  Michal Hocko showed me the error of
> > my ways with this patch.
> > 
> 
> :-)
> 
> On another note, I still get RCU tracebacks in the s390 tests.
> 
> BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
> 
> That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
> if I recall correctly we had discussed that earlier.

Indeed, I had missed a dyntick counter update back on Nov 11, which meant
that some of the code was still looking at the low-order bit instead of
the next bit up.  This is now fixed.

So to get to the error message you call out above, I need to have improperly
left the system in bh state or left irqs disabled, while the system was
running normally without an oops.  I am having a hard time seeing how this
patch can do that.

I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use
of cond_resched_rcu_qs()").

So you bisected or did a revert to work out which was the offending commit?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck Nov. 30, 2016, 11:18 p.m. UTC | #7
On Wed, Nov 30, 2016 at 01:01:52PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote:
> > On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > > > >>>>Hi Paul,
> > > > >>>>
> > > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> > > 
> > > And I have dropped this patch.  Michal Hocko showed me the error of
> > > my ways with this patch.
> > > 
> > 
> > :-)
> > 
> > On another note, I still get RCU tracebacks in the s390 tests.
> > 
> > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
> > 
> > That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
> > if I recall correctly we had discussed that earlier.
> 
> Indeed, I had missed a dyntick counter update back on Nov 11, which meant
> that some of the code was still looking at the low-order bit instead of
> the next bit up.  This is now fixed.
> 
> So to get to the error message you call out above, I need to have improperly
> left the system in bh state or left irqs disabled, while the system was
> running normally without an oops.  I am having a hard time seeing how this
> patch can do that.
> 
> I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use
> of cond_resched_rcu_qs()").
> 
> So you bisected or did a revert to work out which was the offending commit?
> 

My most recent bisect was with the November 10 image, so that would have missed
any later fix. Comparing the log messages, the current message is indeed
different. Sorry, I mixed that up; I just assumed that the problem would be
the same without really checking. My bad.

Bisect would be tricky, since the s390 image was broken for some time after
November 10. The first time I have seen the above BUG: was with next-20161128
(which is the first build after the crash was fixed). That version did not
include f2a471ffc8a8, so that can not be the cause.

I'll try to set up a bisect tonight, working around the crash problem.
I'll let you know how it goes.

Guenter
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Dec. 1, 2016, 1:19 a.m. UTC | #8
On Wed, Nov 30, 2016 at 03:18:46PM -0800, Guenter Roeck wrote:
> On Wed, Nov 30, 2016 at 01:01:52PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote:
> > > On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > > > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > > > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > > > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > > > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > > > > >>>>Hi Paul,
> > > > > >>>>
> > > > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > > > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > > > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > > > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> > > > 
> > > > And I have dropped this patch.  Michal Hocko showed me the error of
> > > > my ways with this patch.
> > > > 
> > > 
> > > :-)
> > > 
> > > On another note, I still get RCU tracebacks in the s390 tests.
> > > 
> > > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
> > > 
> > > That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
> > > if I recall correctly we had discussed that earlier.
> > 
> > Indeed, I had missed a dyntick counter update back on Nov 11, which meant
> > that some of the code was still looking at the low-order bit instead of
> > the next bit up.  This is now fixed.
> > 
> > So to get to the error message you call out above, I need to have improperly
> > left the system in bh state or left irqs disabled, while the system was
> > running normally without an oops.  I am having a hard time seeing how this
> > patch can do that.
> > 
> > I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use
> > of cond_resched_rcu_qs()").
> > 
> > So you bisected or did a revert to work out which was the offending commit?
> > 
> 
> My most recent bisect was with the November 10 image, so that would have missed
> any later fix. Comparing the log messages, the current message is indeed
> different. Sorry, I mixed that up; I just assumed that the problem would be
> the same without really checking. My bad.
> 
> Bisect would be tricky, since the s390 image was broken for some time after
> November 10. The first time I have seen the above BUG: was with next-20161128
> (which is the first build after the crash was fixed). That version did not
> include f2a471ffc8a8, so that can not be the cause.
> 
> I'll try to set up a bisect tonight, working around the crash problem.
> I'll let you know how it goes.

Whew!  You had me going for a bit there.  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck Dec. 1, 2016, 6:56 a.m. UTC | #9
Hi Paul,

On Wed, Nov 30, 2016 at 05:19:50PM -0800, Paul E. McKenney wrote:
[ ... ]

> > > > 
> > > > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
[ ... ]
> 
> Whew!  You had me going for a bit there.  ;-)

Bisect results are here ... the culprit is, again, commit 2d66cccd73 ("mm:
Prevent __alloc_pages_nodemask() RCU CPU stall warnings"), and reverting that
patch fixes the problem. Good that you dropped it already :-).

Guenter

---
# bad: [59ab0083490c8a871b51e893bae5806e55901d7e] Add linux-next specific files for 20161130
# good: [e5517c2a5a49ed5e99047008629f1cd60246ea0e] Linux 4.9-rc7
git bisect start 'HEAD' 'v4.9-rc7'
# good: [187f99e5c22bb3fab8b330f3ebbbd235d238f3f8] Merge remote-tracking branch 'crypto/master'
git bisect good 187f99e5c22bb3fab8b330f3ebbbd235d238f3f8
# good: [36126657c908e822523b8563f9b1512937c0f342] Merge remote-tracking branch 'tip/auto-latest'
git bisect good 36126657c908e822523b8563f9b1512937c0f342
# good: [2d2139c5c746ec61024fdfa9c36e4e034bb18e59] Merge tag 'iio-for-4.10d' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next
git bisect good 2d2139c5c746ec61024fdfa9c36e4e034bb18e59
# bad: [926a60551123048c589b45abee2a2ec4c924ab21] Merge remote-tracking branch 'extcon/extcon-next'
git bisect bad 926a60551123048c589b45abee2a2ec4c924ab21
# bad: [1541655795a90720b8a094c8cc39f582dec17398] Merge remote-tracking branch 'tty/tty-next'
git bisect bad 1541655795a90720b8a094c8cc39f582dec17398
# bad: [69a6720a1e54519d9bf8563764e9e93bf1bd6a84] Merge remote-tracking branch 'kvm-arm/next'
git bisect bad 69a6720a1e54519d9bf8563764e9e93bf1bd6a84
# good: [33b8b045b93f9104c61ecad1865af961b3bef03e] Merge remote-tracking branch 'ftrace/for-next'
git bisect good 33b8b045b93f9104c61ecad1865af961b3bef03e
# good: [8370c3d08bd98576d97514eca29970e03767a5d1] kvm: svm: Add kvm_fast_pio_in support
git bisect good 8370c3d08bd98576d97514eca29970e03767a5d1
# good: [0a895142323de3eebb0b753d3d8c0e768ff179d9] mm: Prevent shrink_node() RCU CPU stall warnings
git bisect good 0a895142323de3eebb0b753d3d8c0e768ff179d9
# bad: [f8045446ca778333e960dcb9e30a5858ff2b8c20] srcu: Force full grace-period ordering
git bisect bad f8045446ca778333e960dcb9e30a5858ff2b8c20
# good: [f660d64912ccadadcdce6dfb39eb06924dd93767] doc: Fix RCU requirements typos
git bisect good f660d64912ccadadcdce6dfb39eb06924dd93767
# good: [d2db185bfee894c573faebed93461e9938bdbb61] rcu: Remove short-term CPU kicking
git bisect good d2db185bfee894c573faebed93461e9938bdbb61
# bad: [2d66cccd73436ac9985a08e5c2f82e4344f72264] mm: Prevent __alloc_pages_nodemask() RCU CPU stall warnings
git bisect bad 2d66cccd73436ac9985a08e5c2f82e4344f72264
# good: [34c53f5cd399801b083047cc9cf2ad3ed17c3144] mm: Prevent shrink_node_memcg() RCU CPU stall warnings
git bisect good 34c53f5cd399801b083047cc9cf2ad3ed17c3144
# first bad commit: [2d66cccd73436ac9985a08e5c2f82e4344f72264] mm: Prevent __alloc_pages_nodemask() RCU CPU stall warnings
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Dec. 1, 2016, 12:34 p.m. UTC | #10
On Wed, Nov 30, 2016 at 10:56:57PM -0800, Guenter Roeck wrote:
> Hi Paul,
> 
> On Wed, Nov 30, 2016 at 05:19:50PM -0800, Paul E. McKenney wrote:
> [ ... ]
> 
> > > > > 
> > > > > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
> [ ... ]
> > 
> > Whew!  You had me going for a bit there.  ;-)
> 
> Bisect results are here ... the culprit is, again, commit 2d66cccd73 ("mm:
> Prevent __alloc_pages_nodemask() RCU CPU stall warnings"), and reverting that
> patch fixes the problem. Good that you dropped it already :-).

"My work is done."  ;-)

And apologies for the hassle.  I have no idea what I was thinking when
I put the cond_resched_rcu_qs() there!

							Thanx, Paul

> Guenter
> 
> ---
> # bad: [59ab0083490c8a871b51e893bae5806e55901d7e] Add linux-next specific files for 20161130
> # good: [e5517c2a5a49ed5e99047008629f1cd60246ea0e] Linux 4.9-rc7
> git bisect start 'HEAD' 'v4.9-rc7'
> # good: [187f99e5c22bb3fab8b330f3ebbbd235d238f3f8] Merge remote-tracking branch 'crypto/master'
> git bisect good 187f99e5c22bb3fab8b330f3ebbbd235d238f3f8
> # good: [36126657c908e822523b8563f9b1512937c0f342] Merge remote-tracking branch 'tip/auto-latest'
> git bisect good 36126657c908e822523b8563f9b1512937c0f342
> # good: [2d2139c5c746ec61024fdfa9c36e4e034bb18e59] Merge tag 'iio-for-4.10d' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next
> git bisect good 2d2139c5c746ec61024fdfa9c36e4e034bb18e59
> # bad: [926a60551123048c589b45abee2a2ec4c924ab21] Merge remote-tracking branch 'extcon/extcon-next'
> git bisect bad 926a60551123048c589b45abee2a2ec4c924ab21
> # bad: [1541655795a90720b8a094c8cc39f582dec17398] Merge remote-tracking branch 'tty/tty-next'
> git bisect bad 1541655795a90720b8a094c8cc39f582dec17398
> # bad: [69a6720a1e54519d9bf8563764e9e93bf1bd6a84] Merge remote-tracking branch 'kvm-arm/next'
> git bisect bad 69a6720a1e54519d9bf8563764e9e93bf1bd6a84
> # good: [33b8b045b93f9104c61ecad1865af961b3bef03e] Merge remote-tracking branch 'ftrace/for-next'
> git bisect good 33b8b045b93f9104c61ecad1865af961b3bef03e
> # good: [8370c3d08bd98576d97514eca29970e03767a5d1] kvm: svm: Add kvm_fast_pio_in support
> git bisect good 8370c3d08bd98576d97514eca29970e03767a5d1
> # good: [0a895142323de3eebb0b753d3d8c0e768ff179d9] mm: Prevent shrink_node() RCU CPU stall warnings
> git bisect good 0a895142323de3eebb0b753d3d8c0e768ff179d9
> # bad: [f8045446ca778333e960dcb9e30a5858ff2b8c20] srcu: Force full grace-period ordering
> git bisect bad f8045446ca778333e960dcb9e30a5858ff2b8c20
> # good: [f660d64912ccadadcdce6dfb39eb06924dd93767] doc: Fix RCU requirements typos
> git bisect good f660d64912ccadadcdce6dfb39eb06924dd93767
> # good: [d2db185bfee894c573faebed93461e9938bdbb61] rcu: Remove short-term CPU kicking
> git bisect good d2db185bfee894c573faebed93461e9938bdbb61
> # bad: [2d66cccd73436ac9985a08e5c2f82e4344f72264] mm: Prevent __alloc_pages_nodemask() RCU CPU stall warnings
> git bisect bad 2d66cccd73436ac9985a08e5c2f82e4344f72264
> # good: [34c53f5cd399801b083047cc9cf2ad3ed17c3144] mm: Prevent shrink_node_memcg() RCU CPU stall warnings
> git bisect good 34c53f5cd399801b083047cc9cf2ad3ed17c3144
> # first bad commit: [2d66cccd73436ac9985a08e5c2f82e4344f72264] mm: Prevent __alloc_pages_nodemask() RCU CPU stall warnings
> 

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck Dec. 1, 2016, 12:50 p.m. UTC | #11
On 12/01/2016 04:34 AM, Paul E. McKenney wrote:
> On Wed, Nov 30, 2016 at 10:56:57PM -0800, Guenter Roeck wrote:
>> Hi Paul,
>>
>> On Wed, Nov 30, 2016 at 05:19:50PM -0800, Paul E. McKenney wrote:
>> [ ... ]
>>
>>>>>>
>>>>>> BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
>> [ ... ]
>>>
>>> Whew!  You had me going for a bit there.  ;-)
>>
>> Bisect results are here ... the culprit is, again, commit 2d66cccd73 ("mm:
>> Prevent __alloc_pages_nodemask() RCU CPU stall warnings"), and reverting that
>> patch fixes the problem. Good that you dropped it already :-).
>
> "My work is done."  ;-)
>
> And apologies for the hassle.  I have no idea what I was thinking when
> I put the cond_resched_rcu_qs() there!
>

No worries.

Cheers,
Guenter

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 525ca34603b7..b6944cc19a07 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -423,7 +423,7 @@  extern struct srcu_struct tasks_rcu_exit_srcu;
  */
 #define cond_resched_rcu_qs() \
 do { \
-	if (!cond_resched()) \
+	if (!is_idle_task(current) && !cond_resched()) \
 		rcu_note_voluntary_context_switch(current); \
 } while (0)
 
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 7232d199a81c..20f5990deeee 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -228,6 +228,7 @@  static inline void exit_rcu(void)
 extern int rcu_scheduler_active __read_mostly;
 void rcu_scheduler_starting(void);
 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+#define rcu_scheduler_active false
 static inline void rcu_scheduler_starting(void)
 {
 }