diff mbox

[RFC,idle,2/3] arm: Avoid invoking RCU when CPU is idle

Message ID 1328143404-11038-2-git-send-email-paulmck@linux.vnet.ibm.com
State New
Headers show

Commit Message

Paul E. McKenney Feb. 2, 2012, 12:43 a.m. UTC
From: "Paul E. McKenney" <paul.mckenney@linaro.org>

The idle loop is a quiscent state for RCU, which means that RCU ignores
CPUs that have told RCU that they are idle via rcu_idle_enter().
There are nevertheless quite a few places where idle CPUs use RCU, most
commonly indirectly via tracing.  This patch fixes these problems for ARM.

Many of these bugs have been in the kernel for quite some time, but
Frederic's recent change now gives warnings.

This patch takes the straightforward approach of pushing the
rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
idle loop.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Victor <linux@maxim.org.za>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Sekhar Nori <nsekhar@ti.com>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: Lennert Buytenhek <kernel@wantstofly.org>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Amit Kucheria <amit.kucheria@canonical.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Barry Song <baohua.song@csr.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Trinabh Gupta <g.trinabh@gmail.com>
Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-omap@vger.kernel.org
---
 arch/arm/kernel/process.c          |    2 --
 arch/arm/mach-at91/cpuidle.c       |    3 +++
 arch/arm/mach-davinci/cpuidle.c    |    3 +++
 arch/arm/mach-exynos/common.c      |    2 ++
 arch/arm/mach-highbank/pm.c        |   12 ++++++++++++
 arch/arm/mach-imx/mm-imx3.c        |    3 +++
 arch/arm/mach-imx/pm-imx27.c       |    4 ++++
 arch/arm/mach-imx/pm-imx6q.c       |    4 ++++
 arch/arm/mach-kirkwood/cpuidle.c   |    3 +++
 arch/arm/mach-mx5/mm.c             |    3 +++
 arch/arm/mach-mx5/pm-imx5.c        |    3 +++
 arch/arm/mach-mxs/pm.c             |    4 ++++
 arch/arm/mach-omap1/pm.c           |    4 ++++
 arch/arm/mach-omap2/pm24xx.c       |    2 ++
 arch/arm/mach-omap2/pm34xx.c       |    2 ++
 arch/arm/mach-omap2/pm44xx.c       |    3 +++
 arch/arm/mach-pnx4008/pm.c         |    2 ++
 arch/arm/mach-prima2/pm.c          |    4 ++++
 arch/arm/mach-s5p64x0/common.c     |    2 ++
 arch/arm/mach-s5pc100/common.c     |    2 ++
 arch/arm/mach-s5pv210/common.c     |    2 ++
 arch/arm/mach-shmobile/cpuidle.c   |    3 +++
 arch/arm/mach-shmobile/pm-sh7372.c |    8 ++++++++
 23 files changed, 78 insertions(+), 2 deletions(-)

Comments

Rob Herring Feb. 2, 2012, 2:48 a.m. UTC | #1
Paul,

On 02/01/2012 06:43 PM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> The idle loop is a quiscent state for RCU, which means that RCU ignores
> CPUs that have told RCU that they are idle via rcu_idle_enter().
> There are nevertheless quite a few places where idle CPUs use RCU, most
> commonly indirectly via tracing.  This patch fixes these problems for ARM.

Do you have some specific examples of actual problems? It's not possible
to restrict tracing at this low level?

This really seems like the wrong direction as we are trying to pull
common pieces up out of platform specific code.

> Many of these bugs have been in the kernel for quite some time, but
> Frederic's recent change now gives warnings.
> 
> This patch takes the straightforward approach of pushing the
> rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
> idle loop.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Andrew Victor <linux@maxim.org.za>
> Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
> Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
> Cc: Sekhar Nori <nsekhar@ti.com>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Kukjin Kim <kgene.kim@samsung.com>
> Cc: Rob Herring <rob.herring@calxeda.com>
> Cc: Sascha Hauer <kernel@pengutronix.de>
> Cc: Shawn Guo <shawn.guo@linaro.org>
> Cc: Lennert Buytenhek <kernel@wantstofly.org>
> Cc: Nicolas Pitre <nico@fluxnic.net>
> Cc: Amit Kucheria <amit.kucheria@canonical.com>
> Cc: Tony Lindgren <tony@atomide.com>
> Cc: Barry Song <baohua.song@csr.com>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Magnus Damm <magnus.damm@gmail.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
> Cc: Trinabh Gupta <g.trinabh@gmail.com>
> Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-omap@vger.kernel.org
> ---
>  arch/arm/kernel/process.c          |    2 --
>  arch/arm/mach-at91/cpuidle.c       |    3 +++
>  arch/arm/mach-davinci/cpuidle.c    |    3 +++
>  arch/arm/mach-exynos/common.c      |    2 ++
>  arch/arm/mach-highbank/pm.c        |   12 ++++++++++++
>  arch/arm/mach-imx/mm-imx3.c        |    3 +++
>  arch/arm/mach-imx/pm-imx27.c       |    4 ++++
>  arch/arm/mach-imx/pm-imx6q.c       |    4 ++++
>  arch/arm/mach-kirkwood/cpuidle.c   |    3 +++
>  arch/arm/mach-mx5/mm.c             |    3 +++
>  arch/arm/mach-mx5/pm-imx5.c        |    3 +++
>  arch/arm/mach-mxs/pm.c             |    4 ++++
>  arch/arm/mach-omap1/pm.c           |    4 ++++
>  arch/arm/mach-omap2/pm24xx.c       |    2 ++
>  arch/arm/mach-omap2/pm34xx.c       |    2 ++
>  arch/arm/mach-omap2/pm44xx.c       |    3 +++
>  arch/arm/mach-pnx4008/pm.c         |    2 ++
>  arch/arm/mach-prima2/pm.c          |    4 ++++
>  arch/arm/mach-s5p64x0/common.c     |    2 ++
>  arch/arm/mach-s5pc100/common.c     |    2 ++
>  arch/arm/mach-s5pv210/common.c     |    2 ++
>  arch/arm/mach-shmobile/cpuidle.c   |    3 +++
>  arch/arm/mach-shmobile/pm-sh7372.c |    8 ++++++++
>  23 files changed, 78 insertions(+), 2 deletions(-)
> 

This is clearly not all platforms and you're missing some cpuidle
drivers like omap. Some platform's idle code don't need this?

I'd guess this conflicts with Nico's idle clean-up which is in Russell's
for-next branch now.

> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index 971d65c..8241df7 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -207,7 +207,6 @@ void cpu_idle(void)
>  	/* endless idle loop with no priority at all */
>  	while (1) {
>  		tick_nohz_idle_enter();
> -		rcu_idle_enter();

In some cases with the clean-up, there is no platform specific idle
code, so rcu_idle_enter would never be called.

Rob

>  		leds_event(led_idle_start);
>  		while (!need_resched()) {
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -237,7 +236,6 @@ void cpu_idle(void)
>  			}
>  		}
>  		leds_event(led_idle_end);
> -		rcu_idle_exit();
>  		tick_nohz_idle_exit();
>  		preempt_enable_no_resched();
>  		schedule();
> diff --git a/arch/arm/mach-at91/cpuidle.c b/arch/arm/mach-at91/cpuidle.c
> index a851e6c..8feff6b 100644
> --- a/arch/arm/mach-at91/cpuidle.c
> +++ b/arch/arm/mach-at91/cpuidle.c
> @@ -20,6 +20,7 @@
>  #include <asm/proc-fns.h>
>  #include <linux/io.h>
>  #include <linux/export.h>
> +#include <linux/rcupdate.h>
>  
>  #include "pm.h"
>  
> @@ -43,6 +44,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
>  
>  	local_irq_disable();
>  	do_gettimeofday(&before);
> +	rcu_idle_enter();
>  	if (index == 0)
>  		/* Wait for interrupt state */
>  		cpu_do_idle();
> @@ -53,6 +55,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
>  		cpu_do_idle();
>  		sdram_selfrefresh_disable(saved_lpr);
>  	}
> +	rcu_idle_exit();
>  	do_gettimeofday(&after);
>  	local_irq_enable();
>  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> diff --git a/arch/arm/mach-davinci/cpuidle.c b/arch/arm/mach-davinci/cpuidle.c
> index a30c7c5..6594b7e 100644
> --- a/arch/arm/mach-davinci/cpuidle.c
> +++ b/arch/arm/mach-davinci/cpuidle.c
> @@ -17,6 +17,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/io.h>
>  #include <linux/export.h>
> +#include <linux/rcupdate.h>
>  #include <asm/proc-fns.h>
>  
>  #include <mach/cpuidle.h>
> @@ -90,12 +91,14 @@ static int davinci_enter_idle(struct cpuidle_device *dev,
>  	local_irq_disable();
>  	do_gettimeofday(&before);
>  
> +	rcu_idle_enter();
>  	if (ops && ops->enter)
>  		ops->enter(ops->flags);
>  	/* Wait for interrupt state */
>  	cpu_do_idle();
>  	if (ops && ops->exit)
>  		ops->exit(ops->flags);
> +	rcu_idle_exit();
>  
>  	do_gettimeofday(&after);
>  	local_irq_enable();
> diff --git a/arch/arm/mach-exynos/common.c b/arch/arm/mach-exynos/common.c
> index c59e188..1e5844a 100644
> --- a/arch/arm/mach-exynos/common.c
> +++ b/arch/arm/mach-exynos/common.c
> @@ -203,8 +203,10 @@ static struct map_desc exynos4_iodesc1[] __initdata = {
>  
>  static void exynos_idle(void)
>  {
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		cpu_do_idle();
> +	rcu_idle_exit();
>  
>  	local_irq_enable();
>  }
> diff --git a/arch/arm/mach-highbank/pm.c b/arch/arm/mach-highbank/pm.c
> index 33b3beb..5ecc4bc 100644
> --- a/arch/arm/mach-highbank/pm.c
> +++ b/arch/arm/mach-highbank/pm.c
> @@ -17,6 +17,7 @@
>  #include <linux/init.h>
>  #include <linux/io.h>
>  #include <linux/suspend.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/proc-fns.h>
>  #include <asm/smp_scu.h>
> @@ -33,12 +34,23 @@ static int highbank_suspend_finish(unsigned long val)
>  
>  static int highbank_pm_enter(suspend_state_t state)
>  {
> +	/*
> +	 * Some of the functions preceding the cpu_suspend() can
> +	 * invoke RCU, but only in case of failure to disable preemption
> +	 * or preempt_disable() misnesting.  Assume that these errors
> +	 * are not committed in the following code, because RCU just
> +	 * doesn't want to run on a CPU that has its caches disabled.
> +	 */
> +	rcu_idle_enter();
> +
>  	hignbank_set_pwr_suspend();
>  	highbank_set_cpu_jump(0, cpu_resume);
>  
>  	scu_power_mode(scu_base_addr, SCU_PM_POWEROFF);
>  	cpu_suspend(0, highbank_suspend_finish);
>  
> +	rcu_idle_exit();
> +
>  	return 0;
>  }
>  
> diff --git a/arch/arm/mach-imx/mm-imx3.c b/arch/arm/mach-imx/mm-imx3.c
> index 31807d2..0945b23 100644
> --- a/arch/arm/mach-imx/mm-imx3.c
> +++ b/arch/arm/mach-imx/mm-imx3.c
> @@ -19,6 +19,7 @@
>  #include <linux/mm.h>
>  #include <linux/init.h>
>  #include <linux/err.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/hardware/cache-l2x0.h>
> @@ -34,6 +35,7 @@ static void imx3_idle(void)
>  {
>  	unsigned long reg = 0;
>  
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		__asm__ __volatile__(
>  			/* disable I and D cache */
> @@ -58,6 +60,7 @@ static void imx3_idle(void)
>  			"orr %0, %0, #0x00000004\n"
>  			"mcr p15, 0, %0, c1, c0, 0\n"
>  			: "=r" (reg));
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-imx/pm-imx27.c b/arch/arm/mach-imx/pm-imx27.c
> index e455d2f..c4adc05 100644
> --- a/arch/arm/mach-imx/pm-imx27.c
> +++ b/arch/arm/mach-imx/pm-imx27.c
> @@ -10,12 +10,14 @@
>  #include <linux/kernel.h>
>  #include <linux/suspend.h>
>  #include <linux/io.h>
> +#include <linux/rcupdate.h>
>  #include <mach/system.h>
>  #include <mach/hardware.h>
>  
>  static int mx27_suspend_enter(suspend_state_t state)
>  {
>  	u32 cscr;
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		/* Clear MPEN and SPEN to disable MPLL/SPLL */
> @@ -24,9 +26,11 @@ static int mx27_suspend_enter(suspend_state_t state)
>  		__raw_writel(cscr, MX27_IO_ADDRESS(MX27_CCM_BASE_ADDR));
>  		/* Executes WFI */
>  		arch_idle();
> +		rcu_idle_exit();
>  		break;
>  
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  	return 0;
> diff --git a/arch/arm/mach-imx/pm-imx6q.c b/arch/arm/mach-imx/pm-imx6q.c
> index f7b0c2b..b9c2589 100644
> --- a/arch/arm/mach-imx/pm-imx6q.c
> +++ b/arch/arm/mach-imx/pm-imx6q.c
> @@ -14,6 +14,7 @@
>  #include <linux/io.h>
>  #include <linux/of.h>
>  #include <linux/suspend.h>
> +#include <linux/rcupdate.h>
>  #include <asm/cacheflush.h>
>  #include <asm/proc-fns.h>
>  #include <asm/suspend.h>
> @@ -31,6 +32,7 @@ static int imx6q_suspend_finish(unsigned long val)
>  
>  static int imx6q_pm_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		imx6q_set_lpm(STOP_POWER_OFF);
> @@ -40,8 +42,10 @@ static int imx6q_pm_enter(suspend_state_t state)
>  		cpu_suspend(0, imx6q_suspend_finish);
>  		imx_smp_prepare();
>  		imx_gpc_post_resume();
> +		rcu_idle_exit();
>  		break;
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  
> diff --git a/arch/arm/mach-kirkwood/cpuidle.c b/arch/arm/mach-kirkwood/cpuidle.c
> index 7088180..de8e317 100644
> --- a/arch/arm/mach-kirkwood/cpuidle.c
> +++ b/arch/arm/mach-kirkwood/cpuidle.c
> @@ -19,6 +19,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/io.h>
>  #include <linux/export.h>
> +#include <linux/rcupdate.h>
>  #include <asm/proc-fns.h>
>  #include <mach/kirkwood.h>
>  
> @@ -41,6 +42,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
>  
>  	local_irq_disable();
>  	do_gettimeofday(&before);
> +	rcu_idle_enter();
>  	if (index == 0)
>  		/* Wait for interrupt state */
>  		cpu_do_idle();
> @@ -55,6 +57,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
>  		writel(0x7, DDR_OPERATION_BASE);
>  		cpu_do_idle();
>  	}
> +	rcu_idle_exit();
>  	do_gettimeofday(&after);
>  	local_irq_enable();
>  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> diff --git a/arch/arm/mach-mx5/mm.c b/arch/arm/mach-mx5/mm.c
> index bc17dfe..d919648 100644
> --- a/arch/arm/mach-mx5/mm.c
> +++ b/arch/arm/mach-mx5/mm.c
> @@ -14,6 +14,7 @@
>  #include <linux/mm.h>
>  #include <linux/init.h>
>  #include <linux/clk.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/mach/map.h>
>  
> @@ -37,7 +38,9 @@ static void imx5_idle(void)
>  		mx5_cpu_lp_set(WAIT_UNCLOCKED_POWER_OFF);
>  		if (tzic_enable_wake())
>  			goto err1;
> +		rcu_idle_enter();  /* No tracing until rcu_idle_exit(). */
>  		cpu_do_idle();
> +		rcu_idle_exit();
>  err1:
>  		clk_disable(gpc_dvfs_clk);
>  	}
> diff --git a/arch/arm/mach-mx5/pm-imx5.c b/arch/arm/mach-mx5/pm-imx5.c
> index 98052fc..b4adf42 100644
> --- a/arch/arm/mach-mx5/pm-imx5.c
> +++ b/arch/arm/mach-mx5/pm-imx5.c
> @@ -12,6 +12,7 @@
>  #include <linux/clk.h>
>  #include <linux/io.h>
>  #include <linux/err.h>
> +#include <linux/rcupdate.h>
>  #include <asm/cacheflush.h>
>  #include <asm/tlbflush.h>
>  #include <mach/common.h>
> @@ -27,6 +28,7 @@ static int mx5_suspend_prepare(void)
>  
>  static int mx5_suspend_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		mx5_cpu_lp_set(STOP_POWER_OFF);
> @@ -47,6 +49,7 @@ static int mx5_suspend_enter(suspend_state_t state)
>  		__raw_writel(0, MXC_SRPG_EMPGC1_SRPGCR);
>  	}
>  	cpu_do_idle();
> +	rcu_idle_exit();
>  	return 0;
>  }
>  
> diff --git a/arch/arm/mach-mxs/pm.c b/arch/arm/mach-mxs/pm.c
> index fb042da..36a3a57 100644
> --- a/arch/arm/mach-mxs/pm.c
> +++ b/arch/arm/mach-mxs/pm.c
> @@ -15,16 +15,20 @@
>  #include <linux/kernel.h>
>  #include <linux/suspend.h>
>  #include <linux/io.h>
> +#include <linux/rcupdate.h>
>  #include <mach/system.h>
>  
>  static int mxs_suspend_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		arch_idle();
> +		rcu_idle_exit();
>  		break;
>  
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  	return 0;
> diff --git a/arch/arm/mach-omap1/pm.c b/arch/arm/mach-omap1/pm.c
> index 89ea20c..c661eba 100644
> --- a/arch/arm/mach-omap1/pm.c
> +++ b/arch/arm/mach-omap1/pm.c
> @@ -108,11 +108,13 @@ void omap1_pm_idle(void)
>  	__u32 use_idlect1 = arm_idlect1_mask;
>  	int do_sleep = 0;
>  
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  	if (need_resched()) {
>  		local_fiq_enable();
>  		local_irq_enable();
> +		rcu_idle_exit();
>  		return;
>  	}
>  
> @@ -158,6 +160,7 @@ void omap1_pm_idle(void)
>  
>  		local_fiq_enable();
>  		local_irq_enable();
> +		rcu_idle_exit();
>  		return;
>  	}
>  	omap_sram_suspend(omap_readl(ARM_IDLECT1),
> @@ -165,6 +168,7 @@ void omap1_pm_idle(void)
>  
>  	local_fiq_enable();
>  	local_irq_enable();
> +	rcu_idle_exit();
>  }
>  
>  /*
> diff --git a/arch/arm/mach-omap2/pm24xx.c b/arch/arm/mach-omap2/pm24xx.c
> index b8822f8..2139e9d 100644
> --- a/arch/arm/mach-omap2/pm24xx.c
> +++ b/arch/arm/mach-omap2/pm24xx.c
> @@ -232,6 +232,7 @@ static int omap2_can_sleep(void)
>  
>  static void omap2_pm_idle(void)
>  {
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  
> @@ -250,6 +251,7 @@ static void omap2_pm_idle(void)
>  out:
>  	local_fiq_enable();
>  	local_irq_enable();
> +	rcu_idle_exit();
>  }
>  
>  #ifdef CONFIG_SUSPEND
> diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
> index fc69875..58a8689 100644
> --- a/arch/arm/mach-omap2/pm34xx.c
> +++ b/arch/arm/mach-omap2/pm34xx.c
> @@ -427,7 +427,9 @@ static void omap3_pm_idle(void)
>  	trace_power_start(POWER_CSTATE, 1, smp_processor_id());
>  	trace_cpu_idle(1, smp_processor_id());
>  
> +	rcu_idle_enter();
>  	omap_sram_idle();
> +	rcu_idle_exit();
>  
>  	trace_power_end(smp_processor_id());
>  	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> diff --git a/arch/arm/mach-omap2/pm44xx.c b/arch/arm/mach-omap2/pm44xx.c
> index c264ef7..038796c 100644
> --- a/arch/arm/mach-omap2/pm44xx.c
> +++ b/arch/arm/mach-omap2/pm44xx.c
> @@ -16,6 +16,7 @@
>  #include <linux/list.h>
>  #include <linux/err.h>
>  #include <linux/slab.h>
> +#include <linux/rcupdate.h>
>  
>  #include "common.h"
>  #include "clockdomain.h"
> @@ -178,6 +179,7 @@ static int __init pwrdms_setup(struct powerdomain *pwrdm, void *unused)
>   */
>  static void omap_default_idle(void)
>  {
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  
> @@ -185,6 +187,7 @@ static void omap_default_idle(void)
>  
>  	local_fiq_enable();
>  	local_irq_enable();
> +	rcu_idle_exit();
>  }
>  
>  /**
> diff --git a/arch/arm/mach-pnx4008/pm.c b/arch/arm/mach-pnx4008/pm.c
> index f3e60a0..0b8703f 100644
> --- a/arch/arm/mach-pnx4008/pm.c
> +++ b/arch/arm/mach-pnx4008/pm.c
> @@ -102,6 +102,7 @@ static inline void pnx4008_suspend(void)
>  
>  static int pnx4008_pm_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_STANDBY:
>  		pnx4008_standby();
> @@ -110,6 +111,7 @@ static int pnx4008_pm_enter(suspend_state_t state)
>  		pnx4008_suspend();
>  		break;
>  	}
> +	rcu_idle_exit();
>  	return 0;
>  }
>  
> diff --git a/arch/arm/mach-prima2/pm.c b/arch/arm/mach-prima2/pm.c
> index 26ebb57..1e35c8a 100644
> --- a/arch/arm/mach-prima2/pm.c
> +++ b/arch/arm/mach-prima2/pm.c
> @@ -15,6 +15,7 @@
>  #include <linux/of_device.h>
>  #include <linux/of_platform.h>
>  #include <linux/io.h>
> +#include <linux/rcupdate.h>
>  #include <linux/rtc/sirfsoc_rtciobrg.h>
>  #include <asm/suspend.h>
>  #include <asm/hardware/cache-l2x0.h>
> @@ -64,6 +65,7 @@ static int sirfsoc_pre_suspend_power_off(void)
>  
>  static int sirfsoc_pm_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter(); /* No good place for this: locking->lockdep->RCU. */
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		sirfsoc_pre_suspend_power_off();
> @@ -73,8 +75,10 @@ static int sirfsoc_pm_enter(suspend_state_t state)
>  		/* go zzz */
>  		cpu_suspend(0, sirfsoc_finish_suspend);
>  		outer_resume();
> +		rcu_idle_exit();
>  		break;
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  	return 0;
> diff --git a/arch/arm/mach-s5p64x0/common.c b/arch/arm/mach-s5p64x0/common.c
> index 52b89a3..c59a7f2 100644
> --- a/arch/arm/mach-s5p64x0/common.c
> +++ b/arch/arm/mach-s5p64x0/common.c
> @@ -146,6 +146,7 @@ static void s5p64x0_idle(void)
>  {
>  	unsigned long val;
>  
> +	rcu_idle_enter();
>  	if (!need_resched()) {
>  		val = __raw_readl(S5P64X0_PWR_CFG);
>  		val &= ~(0x3 << 5);
> @@ -154,6 +155,7 @@ static void s5p64x0_idle(void)
>  
>  		cpu_do_idle();
>  	}
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-s5pc100/common.c b/arch/arm/mach-s5pc100/common.c
> index c909573..7bf02ff 100644
> --- a/arch/arm/mach-s5pc100/common.c
> +++ b/arch/arm/mach-s5pc100/common.c
> @@ -131,9 +131,11 @@ static struct map_desc s5pc100_iodesc[] __initdata = {
>  
>  static void s5pc100_idle(void)
>  {
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		cpu_do_idle();
>  
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-s5pv210/common.c b/arch/arm/mach-s5pv210/common.c
> index 9c1bcdc..ccd6b4d 100644
> --- a/arch/arm/mach-s5pv210/common.c
> +++ b/arch/arm/mach-s5pv210/common.c
> @@ -144,9 +144,11 @@ static struct map_desc s5pv210_iodesc[] __initdata = {
>  
>  static void s5pv210_idle(void)
>  {
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		cpu_do_idle();
>  
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-shmobile/cpuidle.c b/arch/arm/mach-shmobile/cpuidle.c
> index 1b23342..651e92b 100644
> --- a/arch/arm/mach-shmobile/cpuidle.c
> +++ b/arch/arm/mach-shmobile/cpuidle.c
> @@ -13,6 +13,7 @@
>  #include <linux/suspend.h>
>  #include <linux/module.h>
>  #include <linux/err.h>
> +#include <linux/rcupdate.h>
>  #include <asm/system.h>
>  #include <asm/io.h>
>  
> @@ -33,6 +34,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
>  
>  	before = ktime_get();
>  
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  
> @@ -40,6 +42,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
>  
>  	local_irq_enable();
>  	local_fiq_enable();
> +	rcu_idle_exit();
>  
>  	after = ktime_get();
>  	dev->last_residency = ktime_to_ns(ktime_sub(after, before)) >> 10;
> diff --git a/arch/arm/mach-shmobile/pm-sh7372.c b/arch/arm/mach-shmobile/pm-sh7372.c
> index fcf8b17..e11507e 100644
> --- a/arch/arm/mach-shmobile/pm-sh7372.c
> +++ b/arch/arm/mach-shmobile/pm-sh7372.c
> @@ -21,6 +21,7 @@
>  #include <linux/irq.h>
>  #include <linux/bitrev.h>
>  #include <linux/console.h>
> +#include <linux/rcupdate.h>
>  #include <asm/system.h>
>  #include <asm/io.h>
>  #include <asm/tlbflush.h>
> @@ -520,17 +521,24 @@ static int sh7372_enter_suspend(suspend_state_t suspend_state)
>  		    sh7372_a4s.genpd.status == GPD_STATE_POWER_OFF) {
>  			/* enter A4S sleep with PLLC0 off */
>  			pr_debug("entering A4S\n");
> +			rcu_idle_enter();
>  			sh7372_enter_a4s_common(0);
> +			rcu_idle_exit();
>  		} else {
>  			/* enter A3SM sleep with PLLC0 off */
>  			pr_debug("entering A3SM\n");
> +			rcu_idle_enter();
>  			sh7372_enter_a3sm_common(0);
> +			rcu_idle_exit();
>  		}
>  	} else {
>  		/* default to Core Standby that supports all wakeup sources */
>  		pr_debug("entering Core Standby\n");
> +		rcu_idle_enter();
>  		sh7372_enter_core_standby();
> +		rcu_idle_exit();
>  	}
> +
>  	return 0;
>  }
>
Nicolas Pitre Feb. 2, 2012, 3:49 a.m. UTC | #2
On Wed, 1 Feb 2012, Paul E. McKenney wrote:

> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> The idle loop is a quiscent state for RCU, which means that RCU ignores
> CPUs that have told RCU that they are idle via rcu_idle_enter().
> There are nevertheless quite a few places where idle CPUs use RCU, most
> commonly indirectly via tracing.  This patch fixes these problems for ARM.
> 
> Many of these bugs have been in the kernel for quite some time, but
> Frederic's recent change now gives warnings.
> 
> This patch takes the straightforward approach of pushing the
> rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
> idle loop.

NAK.

1) This is going to conflict with a patch series I've pushed to RMK 
   already.  You can see its result in linux-next.

2) The purpose of (1) is to do precisely the very opposite i.e. move as 
   much common knowledge as possible up the idle call paths and leave 
   the platform specific quirks as bare as possible, if any.

So I'd much prefer if this change was constrained to be inside 
cpu_idle(), or at least in its close vicinity.

> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Andrew Victor <linux@maxim.org.za>
> Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
> Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
> Cc: Sekhar Nori <nsekhar@ti.com>
> Cc: Kevin Hilman <khilman@ti.com>
> Cc: Kukjin Kim <kgene.kim@samsung.com>
> Cc: Rob Herring <rob.herring@calxeda.com>
> Cc: Sascha Hauer <kernel@pengutronix.de>
> Cc: Shawn Guo <shawn.guo@linaro.org>
> Cc: Lennert Buytenhek <kernel@wantstofly.org>
> Cc: Nicolas Pitre <nico@fluxnic.net>
> Cc: Amit Kucheria <amit.kucheria@canonical.com>
> Cc: Tony Lindgren <tony@atomide.com>
> Cc: Barry Song <baohua.song@csr.com>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Magnus Damm <magnus.damm@gmail.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
> Cc: Trinabh Gupta <g.trinabh@gmail.com>
> Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-omap@vger.kernel.org
> ---
>  arch/arm/kernel/process.c          |    2 --
>  arch/arm/mach-at91/cpuidle.c       |    3 +++
>  arch/arm/mach-davinci/cpuidle.c    |    3 +++
>  arch/arm/mach-exynos/common.c      |    2 ++
>  arch/arm/mach-highbank/pm.c        |   12 ++++++++++++
>  arch/arm/mach-imx/mm-imx3.c        |    3 +++
>  arch/arm/mach-imx/pm-imx27.c       |    4 ++++
>  arch/arm/mach-imx/pm-imx6q.c       |    4 ++++
>  arch/arm/mach-kirkwood/cpuidle.c   |    3 +++
>  arch/arm/mach-mx5/mm.c             |    3 +++
>  arch/arm/mach-mx5/pm-imx5.c        |    3 +++
>  arch/arm/mach-mxs/pm.c             |    4 ++++
>  arch/arm/mach-omap1/pm.c           |    4 ++++
>  arch/arm/mach-omap2/pm24xx.c       |    2 ++
>  arch/arm/mach-omap2/pm34xx.c       |    2 ++
>  arch/arm/mach-omap2/pm44xx.c       |    3 +++
>  arch/arm/mach-pnx4008/pm.c         |    2 ++
>  arch/arm/mach-prima2/pm.c          |    4 ++++
>  arch/arm/mach-s5p64x0/common.c     |    2 ++
>  arch/arm/mach-s5pc100/common.c     |    2 ++
>  arch/arm/mach-s5pv210/common.c     |    2 ++
>  arch/arm/mach-shmobile/cpuidle.c   |    3 +++
>  arch/arm/mach-shmobile/pm-sh7372.c |    8 ++++++++
>  23 files changed, 78 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index 971d65c..8241df7 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -207,7 +207,6 @@ void cpu_idle(void)
>  	/* endless idle loop with no priority at all */
>  	while (1) {
>  		tick_nohz_idle_enter();
> -		rcu_idle_enter();
>  		leds_event(led_idle_start);
>  		while (!need_resched()) {
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -237,7 +236,6 @@ void cpu_idle(void)
>  			}
>  		}
>  		leds_event(led_idle_end);
> -		rcu_idle_exit();
>  		tick_nohz_idle_exit();
>  		preempt_enable_no_resched();
>  		schedule();
> diff --git a/arch/arm/mach-at91/cpuidle.c b/arch/arm/mach-at91/cpuidle.c
> index a851e6c..8feff6b 100644
> --- a/arch/arm/mach-at91/cpuidle.c
> +++ b/arch/arm/mach-at91/cpuidle.c
> @@ -20,6 +20,7 @@
>  #include <asm/proc-fns.h>
>  #include <linux/io.h>
>  #include <linux/export.h>
> +#include <linux/rcupdate.h>
>  
>  #include "pm.h"
>  
> @@ -43,6 +44,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
>  
>  	local_irq_disable();
>  	do_gettimeofday(&before);
> +	rcu_idle_enter();
>  	if (index == 0)
>  		/* Wait for interrupt state */
>  		cpu_do_idle();
> @@ -53,6 +55,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
>  		cpu_do_idle();
>  		sdram_selfrefresh_disable(saved_lpr);
>  	}
> +	rcu_idle_exit();
>  	do_gettimeofday(&after);
>  	local_irq_enable();
>  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> diff --git a/arch/arm/mach-davinci/cpuidle.c b/arch/arm/mach-davinci/cpuidle.c
> index a30c7c5..6594b7e 100644
> --- a/arch/arm/mach-davinci/cpuidle.c
> +++ b/arch/arm/mach-davinci/cpuidle.c
> @@ -17,6 +17,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/io.h>
>  #include <linux/export.h>
> +#include <linux/rcupdate.h>
>  #include <asm/proc-fns.h>
>  
>  #include <mach/cpuidle.h>
> @@ -90,12 +91,14 @@ static int davinci_enter_idle(struct cpuidle_device *dev,
>  	local_irq_disable();
>  	do_gettimeofday(&before);
>  
> +	rcu_idle_enter();
>  	if (ops && ops->enter)
>  		ops->enter(ops->flags);
>  	/* Wait for interrupt state */
>  	cpu_do_idle();
>  	if (ops && ops->exit)
>  		ops->exit(ops->flags);
> +	rcu_idle_exit();
>  
>  	do_gettimeofday(&after);
>  	local_irq_enable();
> diff --git a/arch/arm/mach-exynos/common.c b/arch/arm/mach-exynos/common.c
> index c59e188..1e5844a 100644
> --- a/arch/arm/mach-exynos/common.c
> +++ b/arch/arm/mach-exynos/common.c
> @@ -203,8 +203,10 @@ static struct map_desc exynos4_iodesc1[] __initdata = {
>  
>  static void exynos_idle(void)
>  {
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		cpu_do_idle();
> +	rcu_idle_exit();
>  
>  	local_irq_enable();
>  }
> diff --git a/arch/arm/mach-highbank/pm.c b/arch/arm/mach-highbank/pm.c
> index 33b3beb..5ecc4bc 100644
> --- a/arch/arm/mach-highbank/pm.c
> +++ b/arch/arm/mach-highbank/pm.c
> @@ -17,6 +17,7 @@
>  #include <linux/init.h>
>  #include <linux/io.h>
>  #include <linux/suspend.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/proc-fns.h>
>  #include <asm/smp_scu.h>
> @@ -33,12 +34,23 @@ static int highbank_suspend_finish(unsigned long val)
>  
>  static int highbank_pm_enter(suspend_state_t state)
>  {
> +	/*
> +	 * Some of the functions preceding the cpu_suspend() can
> +	 * invoke RCU, but only in case of failure to disable preemption
> +	 * or preempt_disable() misnesting.  Assume that these errors
> +	 * are not committed in the following code, because RCU just
> +	 * doesn't want to run on a CPU that has its caches disabled.
> +	 */
> +	rcu_idle_enter();
> +
>  	hignbank_set_pwr_suspend();
>  	highbank_set_cpu_jump(0, cpu_resume);
>  
>  	scu_power_mode(scu_base_addr, SCU_PM_POWEROFF);
>  	cpu_suspend(0, highbank_suspend_finish);
>  
> +	rcu_idle_exit();
> +
>  	return 0;
>  }
>  
> diff --git a/arch/arm/mach-imx/mm-imx3.c b/arch/arm/mach-imx/mm-imx3.c
> index 31807d2..0945b23 100644
> --- a/arch/arm/mach-imx/mm-imx3.c
> +++ b/arch/arm/mach-imx/mm-imx3.c
> @@ -19,6 +19,7 @@
>  #include <linux/mm.h>
>  #include <linux/init.h>
>  #include <linux/err.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/hardware/cache-l2x0.h>
> @@ -34,6 +35,7 @@ static void imx3_idle(void)
>  {
>  	unsigned long reg = 0;
>  
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		__asm__ __volatile__(
>  			/* disable I and D cache */
> @@ -58,6 +60,7 @@ static void imx3_idle(void)
>  			"orr %0, %0, #0x00000004\n"
>  			"mcr p15, 0, %0, c1, c0, 0\n"
>  			: "=r" (reg));
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-imx/pm-imx27.c b/arch/arm/mach-imx/pm-imx27.c
> index e455d2f..c4adc05 100644
> --- a/arch/arm/mach-imx/pm-imx27.c
> +++ b/arch/arm/mach-imx/pm-imx27.c
> @@ -10,12 +10,14 @@
>  #include <linux/kernel.h>
>  #include <linux/suspend.h>
>  #include <linux/io.h>
> +#include <linux/rcupdate.h>
>  #include <mach/system.h>
>  #include <mach/hardware.h>
>  
>  static int mx27_suspend_enter(suspend_state_t state)
>  {
>  	u32 cscr;
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		/* Clear MPEN and SPEN to disable MPLL/SPLL */
> @@ -24,9 +26,11 @@ static int mx27_suspend_enter(suspend_state_t state)
>  		__raw_writel(cscr, MX27_IO_ADDRESS(MX27_CCM_BASE_ADDR));
>  		/* Executes WFI */
>  		arch_idle();
> +		rcu_idle_exit();
>  		break;
>  
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  	return 0;
> diff --git a/arch/arm/mach-imx/pm-imx6q.c b/arch/arm/mach-imx/pm-imx6q.c
> index f7b0c2b..b9c2589 100644
> --- a/arch/arm/mach-imx/pm-imx6q.c
> +++ b/arch/arm/mach-imx/pm-imx6q.c
> @@ -14,6 +14,7 @@
>  #include <linux/io.h>
>  #include <linux/of.h>
>  #include <linux/suspend.h>
> +#include <linux/rcupdate.h>
>  #include <asm/cacheflush.h>
>  #include <asm/proc-fns.h>
>  #include <asm/suspend.h>
> @@ -31,6 +32,7 @@ static int imx6q_suspend_finish(unsigned long val)
>  
>  static int imx6q_pm_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		imx6q_set_lpm(STOP_POWER_OFF);
> @@ -40,8 +42,10 @@ static int imx6q_pm_enter(suspend_state_t state)
>  		cpu_suspend(0, imx6q_suspend_finish);
>  		imx_smp_prepare();
>  		imx_gpc_post_resume();
> +		rcu_idle_exit();
>  		break;
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  
> diff --git a/arch/arm/mach-kirkwood/cpuidle.c b/arch/arm/mach-kirkwood/cpuidle.c
> index 7088180..de8e317 100644
> --- a/arch/arm/mach-kirkwood/cpuidle.c
> +++ b/arch/arm/mach-kirkwood/cpuidle.c
> @@ -19,6 +19,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/io.h>
>  #include <linux/export.h>
> +#include <linux/rcupdate.h>
>  #include <asm/proc-fns.h>
>  #include <mach/kirkwood.h>
>  
> @@ -41,6 +42,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
>  
>  	local_irq_disable();
>  	do_gettimeofday(&before);
> +	rcu_idle_enter();
>  	if (index == 0)
>  		/* Wait for interrupt state */
>  		cpu_do_idle();
> @@ -55,6 +57,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
>  		writel(0x7, DDR_OPERATION_BASE);
>  		cpu_do_idle();
>  	}
> +	rcu_idle_exit();
>  	do_gettimeofday(&after);
>  	local_irq_enable();
>  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> diff --git a/arch/arm/mach-mx5/mm.c b/arch/arm/mach-mx5/mm.c
> index bc17dfe..d919648 100644
> --- a/arch/arm/mach-mx5/mm.c
> +++ b/arch/arm/mach-mx5/mm.c
> @@ -14,6 +14,7 @@
>  #include <linux/mm.h>
>  #include <linux/init.h>
>  #include <linux/clk.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/mach/map.h>
>  
> @@ -37,7 +38,9 @@ static void imx5_idle(void)
>  		mx5_cpu_lp_set(WAIT_UNCLOCKED_POWER_OFF);
>  		if (tzic_enable_wake())
>  			goto err1;
> +		rcu_idle_enter();  /* No tracing until rcu_idle_exit(). */
>  		cpu_do_idle();
> +		rcu_idle_exit();
>  err1:
>  		clk_disable(gpc_dvfs_clk);
>  	}
> diff --git a/arch/arm/mach-mx5/pm-imx5.c b/arch/arm/mach-mx5/pm-imx5.c
> index 98052fc..b4adf42 100644
> --- a/arch/arm/mach-mx5/pm-imx5.c
> +++ b/arch/arm/mach-mx5/pm-imx5.c
> @@ -12,6 +12,7 @@
>  #include <linux/clk.h>
>  #include <linux/io.h>
>  #include <linux/err.h>
> +#include <linux/rcupdate.h>
>  #include <asm/cacheflush.h>
>  #include <asm/tlbflush.h>
>  #include <mach/common.h>
> @@ -27,6 +28,7 @@ static int mx5_suspend_prepare(void)
>  
>  static int mx5_suspend_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		mx5_cpu_lp_set(STOP_POWER_OFF);
> @@ -47,6 +49,7 @@ static int mx5_suspend_enter(suspend_state_t state)
>  		__raw_writel(0, MXC_SRPG_EMPGC1_SRPGCR);
>  	}
>  	cpu_do_idle();
> +	rcu_idle_exit();
>  	return 0;
>  }
>  
> diff --git a/arch/arm/mach-mxs/pm.c b/arch/arm/mach-mxs/pm.c
> index fb042da..36a3a57 100644
> --- a/arch/arm/mach-mxs/pm.c
> +++ b/arch/arm/mach-mxs/pm.c
> @@ -15,16 +15,20 @@
>  #include <linux/kernel.h>
>  #include <linux/suspend.h>
>  #include <linux/io.h>
> +#include <linux/rcupdate.h>
>  #include <mach/system.h>
>  
>  static int mxs_suspend_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		arch_idle();
> +		rcu_idle_exit();
>  		break;
>  
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  	return 0;
> diff --git a/arch/arm/mach-omap1/pm.c b/arch/arm/mach-omap1/pm.c
> index 89ea20c..c661eba 100644
> --- a/arch/arm/mach-omap1/pm.c
> +++ b/arch/arm/mach-omap1/pm.c
> @@ -108,11 +108,13 @@ void omap1_pm_idle(void)
>  	__u32 use_idlect1 = arm_idlect1_mask;
>  	int do_sleep = 0;
>  
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  	if (need_resched()) {
>  		local_fiq_enable();
>  		local_irq_enable();
> +		rcu_idle_exit();
>  		return;
>  	}
>  
> @@ -158,6 +160,7 @@ void omap1_pm_idle(void)
>  
>  		local_fiq_enable();
>  		local_irq_enable();
> +		rcu_idle_exit();
>  		return;
>  	}
>  	omap_sram_suspend(omap_readl(ARM_IDLECT1),
> @@ -165,6 +168,7 @@ void omap1_pm_idle(void)
>  
>  	local_fiq_enable();
>  	local_irq_enable();
> +	rcu_idle_exit();
>  }
>  
>  /*
> diff --git a/arch/arm/mach-omap2/pm24xx.c b/arch/arm/mach-omap2/pm24xx.c
> index b8822f8..2139e9d 100644
> --- a/arch/arm/mach-omap2/pm24xx.c
> +++ b/arch/arm/mach-omap2/pm24xx.c
> @@ -232,6 +232,7 @@ static int omap2_can_sleep(void)
>  
>  static void omap2_pm_idle(void)
>  {
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  
> @@ -250,6 +251,7 @@ static void omap2_pm_idle(void)
>  out:
>  	local_fiq_enable();
>  	local_irq_enable();
> +	rcu_idle_exit();
>  }
>  
>  #ifdef CONFIG_SUSPEND
> diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
> index fc69875..58a8689 100644
> --- a/arch/arm/mach-omap2/pm34xx.c
> +++ b/arch/arm/mach-omap2/pm34xx.c
> @@ -427,7 +427,9 @@ static void omap3_pm_idle(void)
>  	trace_power_start(POWER_CSTATE, 1, smp_processor_id());
>  	trace_cpu_idle(1, smp_processor_id());
>  
> +	rcu_idle_enter();
>  	omap_sram_idle();
> +	rcu_idle_exit();
>  
>  	trace_power_end(smp_processor_id());
>  	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> diff --git a/arch/arm/mach-omap2/pm44xx.c b/arch/arm/mach-omap2/pm44xx.c
> index c264ef7..038796c 100644
> --- a/arch/arm/mach-omap2/pm44xx.c
> +++ b/arch/arm/mach-omap2/pm44xx.c
> @@ -16,6 +16,7 @@
>  #include <linux/list.h>
>  #include <linux/err.h>
>  #include <linux/slab.h>
> +#include <linux/rcupdate.h>
>  
>  #include "common.h"
>  #include "clockdomain.h"
> @@ -178,6 +179,7 @@ static int __init pwrdms_setup(struct powerdomain *pwrdm, void *unused)
>   */
>  static void omap_default_idle(void)
>  {
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  
> @@ -185,6 +187,7 @@ static void omap_default_idle(void)
>  
>  	local_fiq_enable();
>  	local_irq_enable();
> +	rcu_idle_exit();
>  }
>  
>  /**
> diff --git a/arch/arm/mach-pnx4008/pm.c b/arch/arm/mach-pnx4008/pm.c
> index f3e60a0..0b8703f 100644
> --- a/arch/arm/mach-pnx4008/pm.c
> +++ b/arch/arm/mach-pnx4008/pm.c
> @@ -102,6 +102,7 @@ static inline void pnx4008_suspend(void)
>  
>  static int pnx4008_pm_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter();
>  	switch (state) {
>  	case PM_SUSPEND_STANDBY:
>  		pnx4008_standby();
> @@ -110,6 +111,7 @@ static int pnx4008_pm_enter(suspend_state_t state)
>  		pnx4008_suspend();
>  		break;
>  	}
> +	rcu_idle_exit();
>  	return 0;
>  }
>  
> diff --git a/arch/arm/mach-prima2/pm.c b/arch/arm/mach-prima2/pm.c
> index 26ebb57..1e35c8a 100644
> --- a/arch/arm/mach-prima2/pm.c
> +++ b/arch/arm/mach-prima2/pm.c
> @@ -15,6 +15,7 @@
>  #include <linux/of_device.h>
>  #include <linux/of_platform.h>
>  #include <linux/io.h>
> +#include <linux/rcupdate.h>
>  #include <linux/rtc/sirfsoc_rtciobrg.h>
>  #include <asm/suspend.h>
>  #include <asm/hardware/cache-l2x0.h>
> @@ -64,6 +65,7 @@ static int sirfsoc_pre_suspend_power_off(void)
>  
>  static int sirfsoc_pm_enter(suspend_state_t state)
>  {
> +	rcu_idle_enter(); /* No good place for this: locking->lockdep->RCU. */
>  	switch (state) {
>  	case PM_SUSPEND_MEM:
>  		sirfsoc_pre_suspend_power_off();
> @@ -73,8 +75,10 @@ static int sirfsoc_pm_enter(suspend_state_t state)
>  		/* go zzz */
>  		cpu_suspend(0, sirfsoc_finish_suspend);
>  		outer_resume();
> +		rcu_idle_exit();
>  		break;
>  	default:
> +		rcu_idle_exit();
>  		return -EINVAL;
>  	}
>  	return 0;
> diff --git a/arch/arm/mach-s5p64x0/common.c b/arch/arm/mach-s5p64x0/common.c
> index 52b89a3..c59a7f2 100644
> --- a/arch/arm/mach-s5p64x0/common.c
> +++ b/arch/arm/mach-s5p64x0/common.c
> @@ -146,6 +146,7 @@ static void s5p64x0_idle(void)
>  {
>  	unsigned long val;
>  
> +	rcu_idle_enter();
>  	if (!need_resched()) {
>  		val = __raw_readl(S5P64X0_PWR_CFG);
>  		val &= ~(0x3 << 5);
> @@ -154,6 +155,7 @@ static void s5p64x0_idle(void)
>  
>  		cpu_do_idle();
>  	}
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-s5pc100/common.c b/arch/arm/mach-s5pc100/common.c
> index c909573..7bf02ff 100644
> --- a/arch/arm/mach-s5pc100/common.c
> +++ b/arch/arm/mach-s5pc100/common.c
> @@ -131,9 +131,11 @@ static struct map_desc s5pc100_iodesc[] __initdata = {
>  
>  static void s5pc100_idle(void)
>  {
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		cpu_do_idle();
>  
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-s5pv210/common.c b/arch/arm/mach-s5pv210/common.c
> index 9c1bcdc..ccd6b4d 100644
> --- a/arch/arm/mach-s5pv210/common.c
> +++ b/arch/arm/mach-s5pv210/common.c
> @@ -144,9 +144,11 @@ static struct map_desc s5pv210_iodesc[] __initdata = {
>  
>  static void s5pv210_idle(void)
>  {
> +	rcu_idle_enter();
>  	if (!need_resched())
>  		cpu_do_idle();
>  
> +	rcu_idle_exit();
>  	local_irq_enable();
>  }
>  
> diff --git a/arch/arm/mach-shmobile/cpuidle.c b/arch/arm/mach-shmobile/cpuidle.c
> index 1b23342..651e92b 100644
> --- a/arch/arm/mach-shmobile/cpuidle.c
> +++ b/arch/arm/mach-shmobile/cpuidle.c
> @@ -13,6 +13,7 @@
>  #include <linux/suspend.h>
>  #include <linux/module.h>
>  #include <linux/err.h>
> +#include <linux/rcupdate.h>
>  #include <asm/system.h>
>  #include <asm/io.h>
>  
> @@ -33,6 +34,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
>  
>  	before = ktime_get();
>  
> +	rcu_idle_enter();
>  	local_irq_disable();
>  	local_fiq_disable();
>  
> @@ -40,6 +42,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
>  
>  	local_irq_enable();
>  	local_fiq_enable();
> +	rcu_idle_exit();
>  
>  	after = ktime_get();
>  	dev->last_residency = ktime_to_ns(ktime_sub(after, before)) >> 10;
> diff --git a/arch/arm/mach-shmobile/pm-sh7372.c b/arch/arm/mach-shmobile/pm-sh7372.c
> index fcf8b17..e11507e 100644
> --- a/arch/arm/mach-shmobile/pm-sh7372.c
> +++ b/arch/arm/mach-shmobile/pm-sh7372.c
> @@ -21,6 +21,7 @@
>  #include <linux/irq.h>
>  #include <linux/bitrev.h>
>  #include <linux/console.h>
> +#include <linux/rcupdate.h>
>  #include <asm/system.h>
>  #include <asm/io.h>
>  #include <asm/tlbflush.h>
> @@ -520,17 +521,24 @@ static int sh7372_enter_suspend(suspend_state_t suspend_state)
>  		    sh7372_a4s.genpd.status == GPD_STATE_POWER_OFF) {
>  			/* enter A4S sleep with PLLC0 off */
>  			pr_debug("entering A4S\n");
> +			rcu_idle_enter();
>  			sh7372_enter_a4s_common(0);
> +			rcu_idle_exit();
>  		} else {
>  			/* enter A3SM sleep with PLLC0 off */
>  			pr_debug("entering A3SM\n");
> +			rcu_idle_enter();
>  			sh7372_enter_a3sm_common(0);
> +			rcu_idle_exit();
>  		}
>  	} else {
>  		/* default to Core Standby that supports all wakeup sources */
>  		pr_debug("entering Core Standby\n");
> +		rcu_idle_enter();
>  		sh7372_enter_core_standby();
> +		rcu_idle_exit();
>  	}
> +
>  	return 0;
>  }
>  
> -- 
> 1.7.8
>
Paul E. McKenney Feb. 2, 2012, 4:40 a.m. UTC | #3
On Wed, Feb 01, 2012 at 08:48:00PM -0600, Rob Herring wrote:
> Paul,
> 
> On 02/01/2012 06:43 PM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > The idle loop is a quiscent state for RCU, which means that RCU ignores
> > CPUs that have told RCU that they are idle via rcu_idle_enter().
> > There are nevertheless quite a few places where idle CPUs use RCU, most
> > commonly indirectly via tracing.  This patch fixes these problems for ARM.
> 
> Do you have some specific examples of actual problems? It's not possible
> to restrict tracing at this low level?
> 
> This really seems like the wrong direction as we are trying to pull
> common pieces up out of platform specific code.

If you yank the tracing, that works for me!  But it is your platform,
so I cannot decide that for you.

							Thanx, Paul

> > Many of these bugs have been in the kernel for quite some time, but
> > Frederic's recent change now gives warnings.
> > 
> > This patch takes the straightforward approach of pushing the
> > rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
> > idle loop.
> > 
> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Russell King <linux@arm.linux.org.uk>
> > Cc: Andrew Victor <linux@maxim.org.za>
> > Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
> > Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
> > Cc: Sekhar Nori <nsekhar@ti.com>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Kukjin Kim <kgene.kim@samsung.com>
> > Cc: Rob Herring <rob.herring@calxeda.com>
> > Cc: Sascha Hauer <kernel@pengutronix.de>
> > Cc: Shawn Guo <shawn.guo@linaro.org>
> > Cc: Lennert Buytenhek <kernel@wantstofly.org>
> > Cc: Nicolas Pitre <nico@fluxnic.net>
> > Cc: Amit Kucheria <amit.kucheria@canonical.com>
> > Cc: Tony Lindgren <tony@atomide.com>
> > Cc: Barry Song <baohua.song@csr.com>
> > Cc: Paul Mundt <lethal@linux-sh.org>
> > Cc: Magnus Damm <magnus.damm@gmail.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
> > Cc: Trinabh Gupta <g.trinabh@gmail.com>
> > Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-omap@vger.kernel.org
> > ---
> >  arch/arm/kernel/process.c          |    2 --
> >  arch/arm/mach-at91/cpuidle.c       |    3 +++
> >  arch/arm/mach-davinci/cpuidle.c    |    3 +++
> >  arch/arm/mach-exynos/common.c      |    2 ++
> >  arch/arm/mach-highbank/pm.c        |   12 ++++++++++++
> >  arch/arm/mach-imx/mm-imx3.c        |    3 +++
> >  arch/arm/mach-imx/pm-imx27.c       |    4 ++++
> >  arch/arm/mach-imx/pm-imx6q.c       |    4 ++++
> >  arch/arm/mach-kirkwood/cpuidle.c   |    3 +++
> >  arch/arm/mach-mx5/mm.c             |    3 +++
> >  arch/arm/mach-mx5/pm-imx5.c        |    3 +++
> >  arch/arm/mach-mxs/pm.c             |    4 ++++
> >  arch/arm/mach-omap1/pm.c           |    4 ++++
> >  arch/arm/mach-omap2/pm24xx.c       |    2 ++
> >  arch/arm/mach-omap2/pm34xx.c       |    2 ++
> >  arch/arm/mach-omap2/pm44xx.c       |    3 +++
> >  arch/arm/mach-pnx4008/pm.c         |    2 ++
> >  arch/arm/mach-prima2/pm.c          |    4 ++++
> >  arch/arm/mach-s5p64x0/common.c     |    2 ++
> >  arch/arm/mach-s5pc100/common.c     |    2 ++
> >  arch/arm/mach-s5pv210/common.c     |    2 ++
> >  arch/arm/mach-shmobile/cpuidle.c   |    3 +++
> >  arch/arm/mach-shmobile/pm-sh7372.c |    8 ++++++++
> >  23 files changed, 78 insertions(+), 2 deletions(-)
> > 
> 
> This is clearly not all platforms and you're missing some cpuidle
> drivers like omap. Some platform's idle code don't need this?
> 
> I'd guess this conflicts with Nico's idle clean-up which is in Russell's
> for-next branch now.
> 
> > diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> > index 971d65c..8241df7 100644
> > --- a/arch/arm/kernel/process.c
> > +++ b/arch/arm/kernel/process.c
> > @@ -207,7 +207,6 @@ void cpu_idle(void)
> >  	/* endless idle loop with no priority at all */
> >  	while (1) {
> >  		tick_nohz_idle_enter();
> > -		rcu_idle_enter();
> 
> In some cases with the clean-up, there is no platform specific idle
> code, so rcu_idle_enter would never be called.
> 
> Rob
> 
> >  		leds_event(led_idle_start);
> >  		while (!need_resched()) {
> >  #ifdef CONFIG_HOTPLUG_CPU
> > @@ -237,7 +236,6 @@ void cpu_idle(void)
> >  			}
> >  		}
> >  		leds_event(led_idle_end);
> > -		rcu_idle_exit();
> >  		tick_nohz_idle_exit();
> >  		preempt_enable_no_resched();
> >  		schedule();
> > diff --git a/arch/arm/mach-at91/cpuidle.c b/arch/arm/mach-at91/cpuidle.c
> > index a851e6c..8feff6b 100644
> > --- a/arch/arm/mach-at91/cpuidle.c
> > +++ b/arch/arm/mach-at91/cpuidle.c
> > @@ -20,6 +20,7 @@
> >  #include <asm/proc-fns.h>
> >  #include <linux/io.h>
> >  #include <linux/export.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include "pm.h"
> >  
> > @@ -43,6 +44,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
> >  
> >  	local_irq_disable();
> >  	do_gettimeofday(&before);
> > +	rcu_idle_enter();
> >  	if (index == 0)
> >  		/* Wait for interrupt state */
> >  		cpu_do_idle();
> > @@ -53,6 +55,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
> >  		cpu_do_idle();
> >  		sdram_selfrefresh_disable(saved_lpr);
> >  	}
> > +	rcu_idle_exit();
> >  	do_gettimeofday(&after);
> >  	local_irq_enable();
> >  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> > diff --git a/arch/arm/mach-davinci/cpuidle.c b/arch/arm/mach-davinci/cpuidle.c
> > index a30c7c5..6594b7e 100644
> > --- a/arch/arm/mach-davinci/cpuidle.c
> > +++ b/arch/arm/mach-davinci/cpuidle.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/cpuidle.h>
> >  #include <linux/io.h>
> >  #include <linux/export.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/proc-fns.h>
> >  
> >  #include <mach/cpuidle.h>
> > @@ -90,12 +91,14 @@ static int davinci_enter_idle(struct cpuidle_device *dev,
> >  	local_irq_disable();
> >  	do_gettimeofday(&before);
> >  
> > +	rcu_idle_enter();
> >  	if (ops && ops->enter)
> >  		ops->enter(ops->flags);
> >  	/* Wait for interrupt state */
> >  	cpu_do_idle();
> >  	if (ops && ops->exit)
> >  		ops->exit(ops->flags);
> > +	rcu_idle_exit();
> >  
> >  	do_gettimeofday(&after);
> >  	local_irq_enable();
> > diff --git a/arch/arm/mach-exynos/common.c b/arch/arm/mach-exynos/common.c
> > index c59e188..1e5844a 100644
> > --- a/arch/arm/mach-exynos/common.c
> > +++ b/arch/arm/mach-exynos/common.c
> > @@ -203,8 +203,10 @@ static struct map_desc exynos4_iodesc1[] __initdata = {
> >  
> >  static void exynos_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		cpu_do_idle();
> > +	rcu_idle_exit();
> >  
> >  	local_irq_enable();
> >  }
> > diff --git a/arch/arm/mach-highbank/pm.c b/arch/arm/mach-highbank/pm.c
> > index 33b3beb..5ecc4bc 100644
> > --- a/arch/arm/mach-highbank/pm.c
> > +++ b/arch/arm/mach-highbank/pm.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/init.h>
> >  #include <linux/io.h>
> >  #include <linux/suspend.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/proc-fns.h>
> >  #include <asm/smp_scu.h>
> > @@ -33,12 +34,23 @@ static int highbank_suspend_finish(unsigned long val)
> >  
> >  static int highbank_pm_enter(suspend_state_t state)
> >  {
> > +	/*
> > +	 * Some of the functions preceding the cpu_suspend() can
> > +	 * invoke RCU, but only in case of failure to disable preemption
> > +	 * or preempt_disable() misnesting.  Assume that these errors
> > +	 * are not committed in the following code, because RCU just
> > +	 * doesn't want to run on a CPU that has its caches disabled.
> > +	 */
> > +	rcu_idle_enter();
> > +
> >  	hignbank_set_pwr_suspend();
> >  	highbank_set_cpu_jump(0, cpu_resume);
> >  
> >  	scu_power_mode(scu_base_addr, SCU_PM_POWEROFF);
> >  	cpu_suspend(0, highbank_suspend_finish);
> >  
> > +	rcu_idle_exit();
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/arm/mach-imx/mm-imx3.c b/arch/arm/mach-imx/mm-imx3.c
> > index 31807d2..0945b23 100644
> > --- a/arch/arm/mach-imx/mm-imx3.c
> > +++ b/arch/arm/mach-imx/mm-imx3.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/init.h>
> >  #include <linux/err.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/pgtable.h>
> >  #include <asm/hardware/cache-l2x0.h>
> > @@ -34,6 +35,7 @@ static void imx3_idle(void)
> >  {
> >  	unsigned long reg = 0;
> >  
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		__asm__ __volatile__(
> >  			/* disable I and D cache */
> > @@ -58,6 +60,7 @@ static void imx3_idle(void)
> >  			"orr %0, %0, #0x00000004\n"
> >  			"mcr p15, 0, %0, c1, c0, 0\n"
> >  			: "=r" (reg));
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-imx/pm-imx27.c b/arch/arm/mach-imx/pm-imx27.c
> > index e455d2f..c4adc05 100644
> > --- a/arch/arm/mach-imx/pm-imx27.c
> > +++ b/arch/arm/mach-imx/pm-imx27.c
> > @@ -10,12 +10,14 @@
> >  #include <linux/kernel.h>
> >  #include <linux/suspend.h>
> >  #include <linux/io.h>
> > +#include <linux/rcupdate.h>
> >  #include <mach/system.h>
> >  #include <mach/hardware.h>
> >  
> >  static int mx27_suspend_enter(suspend_state_t state)
> >  {
> >  	u32 cscr;
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		/* Clear MPEN and SPEN to disable MPLL/SPLL */
> > @@ -24,9 +26,11 @@ static int mx27_suspend_enter(suspend_state_t state)
> >  		__raw_writel(cscr, MX27_IO_ADDRESS(MX27_CCM_BASE_ADDR));
> >  		/* Executes WFI */
> >  		arch_idle();
> > +		rcu_idle_exit();
> >  		break;
> >  
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  	return 0;
> > diff --git a/arch/arm/mach-imx/pm-imx6q.c b/arch/arm/mach-imx/pm-imx6q.c
> > index f7b0c2b..b9c2589 100644
> > --- a/arch/arm/mach-imx/pm-imx6q.c
> > +++ b/arch/arm/mach-imx/pm-imx6q.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/io.h>
> >  #include <linux/of.h>
> >  #include <linux/suspend.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/cacheflush.h>
> >  #include <asm/proc-fns.h>
> >  #include <asm/suspend.h>
> > @@ -31,6 +32,7 @@ static int imx6q_suspend_finish(unsigned long val)
> >  
> >  static int imx6q_pm_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		imx6q_set_lpm(STOP_POWER_OFF);
> > @@ -40,8 +42,10 @@ static int imx6q_pm_enter(suspend_state_t state)
> >  		cpu_suspend(0, imx6q_suspend_finish);
> >  		imx_smp_prepare();
> >  		imx_gpc_post_resume();
> > +		rcu_idle_exit();
> >  		break;
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  
> > diff --git a/arch/arm/mach-kirkwood/cpuidle.c b/arch/arm/mach-kirkwood/cpuidle.c
> > index 7088180..de8e317 100644
> > --- a/arch/arm/mach-kirkwood/cpuidle.c
> > +++ b/arch/arm/mach-kirkwood/cpuidle.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/cpuidle.h>
> >  #include <linux/io.h>
> >  #include <linux/export.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/proc-fns.h>
> >  #include <mach/kirkwood.h>
> >  
> > @@ -41,6 +42,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
> >  
> >  	local_irq_disable();
> >  	do_gettimeofday(&before);
> > +	rcu_idle_enter();
> >  	if (index == 0)
> >  		/* Wait for interrupt state */
> >  		cpu_do_idle();
> > @@ -55,6 +57,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
> >  		writel(0x7, DDR_OPERATION_BASE);
> >  		cpu_do_idle();
> >  	}
> > +	rcu_idle_exit();
> >  	do_gettimeofday(&after);
> >  	local_irq_enable();
> >  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> > diff --git a/arch/arm/mach-mx5/mm.c b/arch/arm/mach-mx5/mm.c
> > index bc17dfe..d919648 100644
> > --- a/arch/arm/mach-mx5/mm.c
> > +++ b/arch/arm/mach-mx5/mm.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/init.h>
> >  #include <linux/clk.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/mach/map.h>
> >  
> > @@ -37,7 +38,9 @@ static void imx5_idle(void)
> >  		mx5_cpu_lp_set(WAIT_UNCLOCKED_POWER_OFF);
> >  		if (tzic_enable_wake())
> >  			goto err1;
> > +		rcu_idle_enter();  /* No tracing until rcu_idle_exit(). */
> >  		cpu_do_idle();
> > +		rcu_idle_exit();
> >  err1:
> >  		clk_disable(gpc_dvfs_clk);
> >  	}
> > diff --git a/arch/arm/mach-mx5/pm-imx5.c b/arch/arm/mach-mx5/pm-imx5.c
> > index 98052fc..b4adf42 100644
> > --- a/arch/arm/mach-mx5/pm-imx5.c
> > +++ b/arch/arm/mach-mx5/pm-imx5.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/clk.h>
> >  #include <linux/io.h>
> >  #include <linux/err.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/cacheflush.h>
> >  #include <asm/tlbflush.h>
> >  #include <mach/common.h>
> > @@ -27,6 +28,7 @@ static int mx5_suspend_prepare(void)
> >  
> >  static int mx5_suspend_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		mx5_cpu_lp_set(STOP_POWER_OFF);
> > @@ -47,6 +49,7 @@ static int mx5_suspend_enter(suspend_state_t state)
> >  		__raw_writel(0, MXC_SRPG_EMPGC1_SRPGCR);
> >  	}
> >  	cpu_do_idle();
> > +	rcu_idle_exit();
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/arm/mach-mxs/pm.c b/arch/arm/mach-mxs/pm.c
> > index fb042da..36a3a57 100644
> > --- a/arch/arm/mach-mxs/pm.c
> > +++ b/arch/arm/mach-mxs/pm.c
> > @@ -15,16 +15,20 @@
> >  #include <linux/kernel.h>
> >  #include <linux/suspend.h>
> >  #include <linux/io.h>
> > +#include <linux/rcupdate.h>
> >  #include <mach/system.h>
> >  
> >  static int mxs_suspend_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		arch_idle();
> > +		rcu_idle_exit();
> >  		break;
> >  
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  	return 0;
> > diff --git a/arch/arm/mach-omap1/pm.c b/arch/arm/mach-omap1/pm.c
> > index 89ea20c..c661eba 100644
> > --- a/arch/arm/mach-omap1/pm.c
> > +++ b/arch/arm/mach-omap1/pm.c
> > @@ -108,11 +108,13 @@ void omap1_pm_idle(void)
> >  	__u32 use_idlect1 = arm_idlect1_mask;
> >  	int do_sleep = 0;
> >  
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  	if (need_resched()) {
> >  		local_fiq_enable();
> >  		local_irq_enable();
> > +		rcu_idle_exit();
> >  		return;
> >  	}
> >  
> > @@ -158,6 +160,7 @@ void omap1_pm_idle(void)
> >  
> >  		local_fiq_enable();
> >  		local_irq_enable();
> > +		rcu_idle_exit();
> >  		return;
> >  	}
> >  	omap_sram_suspend(omap_readl(ARM_IDLECT1),
> > @@ -165,6 +168,7 @@ void omap1_pm_idle(void)
> >  
> >  	local_fiq_enable();
> >  	local_irq_enable();
> > +	rcu_idle_exit();
> >  }
> >  
> >  /*
> > diff --git a/arch/arm/mach-omap2/pm24xx.c b/arch/arm/mach-omap2/pm24xx.c
> > index b8822f8..2139e9d 100644
> > --- a/arch/arm/mach-omap2/pm24xx.c
> > +++ b/arch/arm/mach-omap2/pm24xx.c
> > @@ -232,6 +232,7 @@ static int omap2_can_sleep(void)
> >  
> >  static void omap2_pm_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  
> > @@ -250,6 +251,7 @@ static void omap2_pm_idle(void)
> >  out:
> >  	local_fiq_enable();
> >  	local_irq_enable();
> > +	rcu_idle_exit();
> >  }
> >  
> >  #ifdef CONFIG_SUSPEND
> > diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
> > index fc69875..58a8689 100644
> > --- a/arch/arm/mach-omap2/pm34xx.c
> > +++ b/arch/arm/mach-omap2/pm34xx.c
> > @@ -427,7 +427,9 @@ static void omap3_pm_idle(void)
> >  	trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> >  	trace_cpu_idle(1, smp_processor_id());
> >  
> > +	rcu_idle_enter();
> >  	omap_sram_idle();
> > +	rcu_idle_exit();
> >  
> >  	trace_power_end(smp_processor_id());
> >  	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> > diff --git a/arch/arm/mach-omap2/pm44xx.c b/arch/arm/mach-omap2/pm44xx.c
> > index c264ef7..038796c 100644
> > --- a/arch/arm/mach-omap2/pm44xx.c
> > +++ b/arch/arm/mach-omap2/pm44xx.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/list.h>
> >  #include <linux/err.h>
> >  #include <linux/slab.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include "common.h"
> >  #include "clockdomain.h"
> > @@ -178,6 +179,7 @@ static int __init pwrdms_setup(struct powerdomain *pwrdm, void *unused)
> >   */
> >  static void omap_default_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  
> > @@ -185,6 +187,7 @@ static void omap_default_idle(void)
> >  
> >  	local_fiq_enable();
> >  	local_irq_enable();
> > +	rcu_idle_exit();
> >  }
> >  
> >  /**
> > diff --git a/arch/arm/mach-pnx4008/pm.c b/arch/arm/mach-pnx4008/pm.c
> > index f3e60a0..0b8703f 100644
> > --- a/arch/arm/mach-pnx4008/pm.c
> > +++ b/arch/arm/mach-pnx4008/pm.c
> > @@ -102,6 +102,7 @@ static inline void pnx4008_suspend(void)
> >  
> >  static int pnx4008_pm_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_STANDBY:
> >  		pnx4008_standby();
> > @@ -110,6 +111,7 @@ static int pnx4008_pm_enter(suspend_state_t state)
> >  		pnx4008_suspend();
> >  		break;
> >  	}
> > +	rcu_idle_exit();
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/arm/mach-prima2/pm.c b/arch/arm/mach-prima2/pm.c
> > index 26ebb57..1e35c8a 100644
> > --- a/arch/arm/mach-prima2/pm.c
> > +++ b/arch/arm/mach-prima2/pm.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/of_device.h>
> >  #include <linux/of_platform.h>
> >  #include <linux/io.h>
> > +#include <linux/rcupdate.h>
> >  #include <linux/rtc/sirfsoc_rtciobrg.h>
> >  #include <asm/suspend.h>
> >  #include <asm/hardware/cache-l2x0.h>
> > @@ -64,6 +65,7 @@ static int sirfsoc_pre_suspend_power_off(void)
> >  
> >  static int sirfsoc_pm_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter(); /* No good place for this: locking->lockdep->RCU. */
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		sirfsoc_pre_suspend_power_off();
> > @@ -73,8 +75,10 @@ static int sirfsoc_pm_enter(suspend_state_t state)
> >  		/* go zzz */
> >  		cpu_suspend(0, sirfsoc_finish_suspend);
> >  		outer_resume();
> > +		rcu_idle_exit();
> >  		break;
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  	return 0;
> > diff --git a/arch/arm/mach-s5p64x0/common.c b/arch/arm/mach-s5p64x0/common.c
> > index 52b89a3..c59a7f2 100644
> > --- a/arch/arm/mach-s5p64x0/common.c
> > +++ b/arch/arm/mach-s5p64x0/common.c
> > @@ -146,6 +146,7 @@ static void s5p64x0_idle(void)
> >  {
> >  	unsigned long val;
> >  
> > +	rcu_idle_enter();
> >  	if (!need_resched()) {
> >  		val = __raw_readl(S5P64X0_PWR_CFG);
> >  		val &= ~(0x3 << 5);
> > @@ -154,6 +155,7 @@ static void s5p64x0_idle(void)
> >  
> >  		cpu_do_idle();
> >  	}
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-s5pc100/common.c b/arch/arm/mach-s5pc100/common.c
> > index c909573..7bf02ff 100644
> > --- a/arch/arm/mach-s5pc100/common.c
> > +++ b/arch/arm/mach-s5pc100/common.c
> > @@ -131,9 +131,11 @@ static struct map_desc s5pc100_iodesc[] __initdata = {
> >  
> >  static void s5pc100_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		cpu_do_idle();
> >  
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-s5pv210/common.c b/arch/arm/mach-s5pv210/common.c
> > index 9c1bcdc..ccd6b4d 100644
> > --- a/arch/arm/mach-s5pv210/common.c
> > +++ b/arch/arm/mach-s5pv210/common.c
> > @@ -144,9 +144,11 @@ static struct map_desc s5pv210_iodesc[] __initdata = {
> >  
> >  static void s5pv210_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		cpu_do_idle();
> >  
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-shmobile/cpuidle.c b/arch/arm/mach-shmobile/cpuidle.c
> > index 1b23342..651e92b 100644
> > --- a/arch/arm/mach-shmobile/cpuidle.c
> > +++ b/arch/arm/mach-shmobile/cpuidle.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/suspend.h>
> >  #include <linux/module.h>
> >  #include <linux/err.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/system.h>
> >  #include <asm/io.h>
> >  
> > @@ -33,6 +34,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
> >  
> >  	before = ktime_get();
> >  
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  
> > @@ -40,6 +42,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
> >  
> >  	local_irq_enable();
> >  	local_fiq_enable();
> > +	rcu_idle_exit();
> >  
> >  	after = ktime_get();
> >  	dev->last_residency = ktime_to_ns(ktime_sub(after, before)) >> 10;
> > diff --git a/arch/arm/mach-shmobile/pm-sh7372.c b/arch/arm/mach-shmobile/pm-sh7372.c
> > index fcf8b17..e11507e 100644
> > --- a/arch/arm/mach-shmobile/pm-sh7372.c
> > +++ b/arch/arm/mach-shmobile/pm-sh7372.c
> > @@ -21,6 +21,7 @@
> >  #include <linux/irq.h>
> >  #include <linux/bitrev.h>
> >  #include <linux/console.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/system.h>
> >  #include <asm/io.h>
> >  #include <asm/tlbflush.h>
> > @@ -520,17 +521,24 @@ static int sh7372_enter_suspend(suspend_state_t suspend_state)
> >  		    sh7372_a4s.genpd.status == GPD_STATE_POWER_OFF) {
> >  			/* enter A4S sleep with PLLC0 off */
> >  			pr_debug("entering A4S\n");
> > +			rcu_idle_enter();
> >  			sh7372_enter_a4s_common(0);
> > +			rcu_idle_exit();
> >  		} else {
> >  			/* enter A3SM sleep with PLLC0 off */
> >  			pr_debug("entering A3SM\n");
> > +			rcu_idle_enter();
> >  			sh7372_enter_a3sm_common(0);
> > +			rcu_idle_exit();
> >  		}
> >  	} else {
> >  		/* default to Core Standby that supports all wakeup sources */
> >  		pr_debug("entering Core Standby\n");
> > +		rcu_idle_enter();
> >  		sh7372_enter_core_standby();
> > +		rcu_idle_exit();
> >  	}
> > +
> >  	return 0;
> >  }
> >  
>
Paul E. McKenney Feb. 2, 2012, 4:44 a.m. UTC | #4
On Wed, Feb 01, 2012 at 10:49:22PM -0500, Nicolas Pitre wrote:
> On Wed, 1 Feb 2012, Paul E. McKenney wrote:
> 
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > The idle loop is a quiscent state for RCU, which means that RCU ignores
> > CPUs that have told RCU that they are idle via rcu_idle_enter().
> > There are nevertheless quite a few places where idle CPUs use RCU, most
> > commonly indirectly via tracing.  This patch fixes these problems for ARM.
> > 
> > Many of these bugs have been in the kernel for quite some time, but
> > Frederic's recent change now gives warnings.
> > 
> > This patch takes the straightforward approach of pushing the
> > rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
> > idle loop.
> 
> NAK.
> 
> 1) This is going to conflict with a patch series I've pushed to RMK 
>    already.  You can see its result in linux-next.
> 
> 2) The purpose of (1) is to do precisely the very opposite i.e. move as 
>    much common knowledge as possible up the idle call paths and leave 
>    the platform specific quirks as bare as possible, if any.
> 
> So I'd much prefer if this change was constrained to be inside 
> cpu_idle(), or at least in its close vicinity.

OK.  Then the tracing in the inner idle loop needs to go.  Other
restructuring might be required as well.  The long and short of it is that
you absolutely cannot use RCU between the time you invoke rcu_idle_enter()
and the time you invoke rcu_idle_exit().  The ARM idle code currently
does just that and therefore must change.

Whatever change you choose that causes the code to meet that constraint
is met is fine by me.

But that constraint absolutely must be met.

							Thanx, Paul

> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Russell King <linux@arm.linux.org.uk>
> > Cc: Andrew Victor <linux@maxim.org.za>
> > Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
> > Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
> > Cc: Sekhar Nori <nsekhar@ti.com>
> > Cc: Kevin Hilman <khilman@ti.com>
> > Cc: Kukjin Kim <kgene.kim@samsung.com>
> > Cc: Rob Herring <rob.herring@calxeda.com>
> > Cc: Sascha Hauer <kernel@pengutronix.de>
> > Cc: Shawn Guo <shawn.guo@linaro.org>
> > Cc: Lennert Buytenhek <kernel@wantstofly.org>
> > Cc: Nicolas Pitre <nico@fluxnic.net>
> > Cc: Amit Kucheria <amit.kucheria@canonical.com>
> > Cc: Tony Lindgren <tony@atomide.com>
> > Cc: Barry Song <baohua.song@csr.com>
> > Cc: Paul Mundt <lethal@linux-sh.org>
> > Cc: Magnus Damm <magnus.damm@gmail.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
> > Cc: Trinabh Gupta <g.trinabh@gmail.com>
> > Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-omap@vger.kernel.org
> > ---
> >  arch/arm/kernel/process.c          |    2 --
> >  arch/arm/mach-at91/cpuidle.c       |    3 +++
> >  arch/arm/mach-davinci/cpuidle.c    |    3 +++
> >  arch/arm/mach-exynos/common.c      |    2 ++
> >  arch/arm/mach-highbank/pm.c        |   12 ++++++++++++
> >  arch/arm/mach-imx/mm-imx3.c        |    3 +++
> >  arch/arm/mach-imx/pm-imx27.c       |    4 ++++
> >  arch/arm/mach-imx/pm-imx6q.c       |    4 ++++
> >  arch/arm/mach-kirkwood/cpuidle.c   |    3 +++
> >  arch/arm/mach-mx5/mm.c             |    3 +++
> >  arch/arm/mach-mx5/pm-imx5.c        |    3 +++
> >  arch/arm/mach-mxs/pm.c             |    4 ++++
> >  arch/arm/mach-omap1/pm.c           |    4 ++++
> >  arch/arm/mach-omap2/pm24xx.c       |    2 ++
> >  arch/arm/mach-omap2/pm34xx.c       |    2 ++
> >  arch/arm/mach-omap2/pm44xx.c       |    3 +++
> >  arch/arm/mach-pnx4008/pm.c         |    2 ++
> >  arch/arm/mach-prima2/pm.c          |    4 ++++
> >  arch/arm/mach-s5p64x0/common.c     |    2 ++
> >  arch/arm/mach-s5pc100/common.c     |    2 ++
> >  arch/arm/mach-s5pv210/common.c     |    2 ++
> >  arch/arm/mach-shmobile/cpuidle.c   |    3 +++
> >  arch/arm/mach-shmobile/pm-sh7372.c |    8 ++++++++
> >  23 files changed, 78 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> > index 971d65c..8241df7 100644
> > --- a/arch/arm/kernel/process.c
> > +++ b/arch/arm/kernel/process.c
> > @@ -207,7 +207,6 @@ void cpu_idle(void)
> >  	/* endless idle loop with no priority at all */
> >  	while (1) {
> >  		tick_nohz_idle_enter();
> > -		rcu_idle_enter();
> >  		leds_event(led_idle_start);
> >  		while (!need_resched()) {
> >  #ifdef CONFIG_HOTPLUG_CPU
> > @@ -237,7 +236,6 @@ void cpu_idle(void)
> >  			}
> >  		}
> >  		leds_event(led_idle_end);
> > -		rcu_idle_exit();
> >  		tick_nohz_idle_exit();
> >  		preempt_enable_no_resched();
> >  		schedule();
> > diff --git a/arch/arm/mach-at91/cpuidle.c b/arch/arm/mach-at91/cpuidle.c
> > index a851e6c..8feff6b 100644
> > --- a/arch/arm/mach-at91/cpuidle.c
> > +++ b/arch/arm/mach-at91/cpuidle.c
> > @@ -20,6 +20,7 @@
> >  #include <asm/proc-fns.h>
> >  #include <linux/io.h>
> >  #include <linux/export.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include "pm.h"
> >  
> > @@ -43,6 +44,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
> >  
> >  	local_irq_disable();
> >  	do_gettimeofday(&before);
> > +	rcu_idle_enter();
> >  	if (index == 0)
> >  		/* Wait for interrupt state */
> >  		cpu_do_idle();
> > @@ -53,6 +55,7 @@ static int at91_enter_idle(struct cpuidle_device *dev,
> >  		cpu_do_idle();
> >  		sdram_selfrefresh_disable(saved_lpr);
> >  	}
> > +	rcu_idle_exit();
> >  	do_gettimeofday(&after);
> >  	local_irq_enable();
> >  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> > diff --git a/arch/arm/mach-davinci/cpuidle.c b/arch/arm/mach-davinci/cpuidle.c
> > index a30c7c5..6594b7e 100644
> > --- a/arch/arm/mach-davinci/cpuidle.c
> > +++ b/arch/arm/mach-davinci/cpuidle.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/cpuidle.h>
> >  #include <linux/io.h>
> >  #include <linux/export.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/proc-fns.h>
> >  
> >  #include <mach/cpuidle.h>
> > @@ -90,12 +91,14 @@ static int davinci_enter_idle(struct cpuidle_device *dev,
> >  	local_irq_disable();
> >  	do_gettimeofday(&before);
> >  
> > +	rcu_idle_enter();
> >  	if (ops && ops->enter)
> >  		ops->enter(ops->flags);
> >  	/* Wait for interrupt state */
> >  	cpu_do_idle();
> >  	if (ops && ops->exit)
> >  		ops->exit(ops->flags);
> > +	rcu_idle_exit();
> >  
> >  	do_gettimeofday(&after);
> >  	local_irq_enable();
> > diff --git a/arch/arm/mach-exynos/common.c b/arch/arm/mach-exynos/common.c
> > index c59e188..1e5844a 100644
> > --- a/arch/arm/mach-exynos/common.c
> > +++ b/arch/arm/mach-exynos/common.c
> > @@ -203,8 +203,10 @@ static struct map_desc exynos4_iodesc1[] __initdata = {
> >  
> >  static void exynos_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		cpu_do_idle();
> > +	rcu_idle_exit();
> >  
> >  	local_irq_enable();
> >  }
> > diff --git a/arch/arm/mach-highbank/pm.c b/arch/arm/mach-highbank/pm.c
> > index 33b3beb..5ecc4bc 100644
> > --- a/arch/arm/mach-highbank/pm.c
> > +++ b/arch/arm/mach-highbank/pm.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/init.h>
> >  #include <linux/io.h>
> >  #include <linux/suspend.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/proc-fns.h>
> >  #include <asm/smp_scu.h>
> > @@ -33,12 +34,23 @@ static int highbank_suspend_finish(unsigned long val)
> >  
> >  static int highbank_pm_enter(suspend_state_t state)
> >  {
> > +	/*
> > +	 * Some of the functions preceding the cpu_suspend() can
> > +	 * invoke RCU, but only in case of failure to disable preemption
> > +	 * or preempt_disable() misnesting.  Assume that these errors
> > +	 * are not committed in the following code, because RCU just
> > +	 * doesn't want to run on a CPU that has its caches disabled.
> > +	 */
> > +	rcu_idle_enter();
> > +
> >  	hignbank_set_pwr_suspend();
> >  	highbank_set_cpu_jump(0, cpu_resume);
> >  
> >  	scu_power_mode(scu_base_addr, SCU_PM_POWEROFF);
> >  	cpu_suspend(0, highbank_suspend_finish);
> >  
> > +	rcu_idle_exit();
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/arm/mach-imx/mm-imx3.c b/arch/arm/mach-imx/mm-imx3.c
> > index 31807d2..0945b23 100644
> > --- a/arch/arm/mach-imx/mm-imx3.c
> > +++ b/arch/arm/mach-imx/mm-imx3.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/init.h>
> >  #include <linux/err.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/pgtable.h>
> >  #include <asm/hardware/cache-l2x0.h>
> > @@ -34,6 +35,7 @@ static void imx3_idle(void)
> >  {
> >  	unsigned long reg = 0;
> >  
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		__asm__ __volatile__(
> >  			/* disable I and D cache */
> > @@ -58,6 +60,7 @@ static void imx3_idle(void)
> >  			"orr %0, %0, #0x00000004\n"
> >  			"mcr p15, 0, %0, c1, c0, 0\n"
> >  			: "=r" (reg));
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-imx/pm-imx27.c b/arch/arm/mach-imx/pm-imx27.c
> > index e455d2f..c4adc05 100644
> > --- a/arch/arm/mach-imx/pm-imx27.c
> > +++ b/arch/arm/mach-imx/pm-imx27.c
> > @@ -10,12 +10,14 @@
> >  #include <linux/kernel.h>
> >  #include <linux/suspend.h>
> >  #include <linux/io.h>
> > +#include <linux/rcupdate.h>
> >  #include <mach/system.h>
> >  #include <mach/hardware.h>
> >  
> >  static int mx27_suspend_enter(suspend_state_t state)
> >  {
> >  	u32 cscr;
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		/* Clear MPEN and SPEN to disable MPLL/SPLL */
> > @@ -24,9 +26,11 @@ static int mx27_suspend_enter(suspend_state_t state)
> >  		__raw_writel(cscr, MX27_IO_ADDRESS(MX27_CCM_BASE_ADDR));
> >  		/* Executes WFI */
> >  		arch_idle();
> > +		rcu_idle_exit();
> >  		break;
> >  
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  	return 0;
> > diff --git a/arch/arm/mach-imx/pm-imx6q.c b/arch/arm/mach-imx/pm-imx6q.c
> > index f7b0c2b..b9c2589 100644
> > --- a/arch/arm/mach-imx/pm-imx6q.c
> > +++ b/arch/arm/mach-imx/pm-imx6q.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/io.h>
> >  #include <linux/of.h>
> >  #include <linux/suspend.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/cacheflush.h>
> >  #include <asm/proc-fns.h>
> >  #include <asm/suspend.h>
> > @@ -31,6 +32,7 @@ static int imx6q_suspend_finish(unsigned long val)
> >  
> >  static int imx6q_pm_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		imx6q_set_lpm(STOP_POWER_OFF);
> > @@ -40,8 +42,10 @@ static int imx6q_pm_enter(suspend_state_t state)
> >  		cpu_suspend(0, imx6q_suspend_finish);
> >  		imx_smp_prepare();
> >  		imx_gpc_post_resume();
> > +		rcu_idle_exit();
> >  		break;
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  
> > diff --git a/arch/arm/mach-kirkwood/cpuidle.c b/arch/arm/mach-kirkwood/cpuidle.c
> > index 7088180..de8e317 100644
> > --- a/arch/arm/mach-kirkwood/cpuidle.c
> > +++ b/arch/arm/mach-kirkwood/cpuidle.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/cpuidle.h>
> >  #include <linux/io.h>
> >  #include <linux/export.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/proc-fns.h>
> >  #include <mach/kirkwood.h>
> >  
> > @@ -41,6 +42,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
> >  
> >  	local_irq_disable();
> >  	do_gettimeofday(&before);
> > +	rcu_idle_enter();
> >  	if (index == 0)
> >  		/* Wait for interrupt state */
> >  		cpu_do_idle();
> > @@ -55,6 +57,7 @@ static int kirkwood_enter_idle(struct cpuidle_device *dev,
> >  		writel(0x7, DDR_OPERATION_BASE);
> >  		cpu_do_idle();
> >  	}
> > +	rcu_idle_exit();
> >  	do_gettimeofday(&after);
> >  	local_irq_enable();
> >  	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
> > diff --git a/arch/arm/mach-mx5/mm.c b/arch/arm/mach-mx5/mm.c
> > index bc17dfe..d919648 100644
> > --- a/arch/arm/mach-mx5/mm.c
> > +++ b/arch/arm/mach-mx5/mm.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/init.h>
> >  #include <linux/clk.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/mach/map.h>
> >  
> > @@ -37,7 +38,9 @@ static void imx5_idle(void)
> >  		mx5_cpu_lp_set(WAIT_UNCLOCKED_POWER_OFF);
> >  		if (tzic_enable_wake())
> >  			goto err1;
> > +		rcu_idle_enter();  /* No tracing until rcu_idle_exit(). */
> >  		cpu_do_idle();
> > +		rcu_idle_exit();
> >  err1:
> >  		clk_disable(gpc_dvfs_clk);
> >  	}
> > diff --git a/arch/arm/mach-mx5/pm-imx5.c b/arch/arm/mach-mx5/pm-imx5.c
> > index 98052fc..b4adf42 100644
> > --- a/arch/arm/mach-mx5/pm-imx5.c
> > +++ b/arch/arm/mach-mx5/pm-imx5.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/clk.h>
> >  #include <linux/io.h>
> >  #include <linux/err.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/cacheflush.h>
> >  #include <asm/tlbflush.h>
> >  #include <mach/common.h>
> > @@ -27,6 +28,7 @@ static int mx5_suspend_prepare(void)
> >  
> >  static int mx5_suspend_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		mx5_cpu_lp_set(STOP_POWER_OFF);
> > @@ -47,6 +49,7 @@ static int mx5_suspend_enter(suspend_state_t state)
> >  		__raw_writel(0, MXC_SRPG_EMPGC1_SRPGCR);
> >  	}
> >  	cpu_do_idle();
> > +	rcu_idle_exit();
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/arm/mach-mxs/pm.c b/arch/arm/mach-mxs/pm.c
> > index fb042da..36a3a57 100644
> > --- a/arch/arm/mach-mxs/pm.c
> > +++ b/arch/arm/mach-mxs/pm.c
> > @@ -15,16 +15,20 @@
> >  #include <linux/kernel.h>
> >  #include <linux/suspend.h>
> >  #include <linux/io.h>
> > +#include <linux/rcupdate.h>
> >  #include <mach/system.h>
> >  
> >  static int mxs_suspend_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		arch_idle();
> > +		rcu_idle_exit();
> >  		break;
> >  
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  	return 0;
> > diff --git a/arch/arm/mach-omap1/pm.c b/arch/arm/mach-omap1/pm.c
> > index 89ea20c..c661eba 100644
> > --- a/arch/arm/mach-omap1/pm.c
> > +++ b/arch/arm/mach-omap1/pm.c
> > @@ -108,11 +108,13 @@ void omap1_pm_idle(void)
> >  	__u32 use_idlect1 = arm_idlect1_mask;
> >  	int do_sleep = 0;
> >  
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  	if (need_resched()) {
> >  		local_fiq_enable();
> >  		local_irq_enable();
> > +		rcu_idle_exit();
> >  		return;
> >  	}
> >  
> > @@ -158,6 +160,7 @@ void omap1_pm_idle(void)
> >  
> >  		local_fiq_enable();
> >  		local_irq_enable();
> > +		rcu_idle_exit();
> >  		return;
> >  	}
> >  	omap_sram_suspend(omap_readl(ARM_IDLECT1),
> > @@ -165,6 +168,7 @@ void omap1_pm_idle(void)
> >  
> >  	local_fiq_enable();
> >  	local_irq_enable();
> > +	rcu_idle_exit();
> >  }
> >  
> >  /*
> > diff --git a/arch/arm/mach-omap2/pm24xx.c b/arch/arm/mach-omap2/pm24xx.c
> > index b8822f8..2139e9d 100644
> > --- a/arch/arm/mach-omap2/pm24xx.c
> > +++ b/arch/arm/mach-omap2/pm24xx.c
> > @@ -232,6 +232,7 @@ static int omap2_can_sleep(void)
> >  
> >  static void omap2_pm_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  
> > @@ -250,6 +251,7 @@ static void omap2_pm_idle(void)
> >  out:
> >  	local_fiq_enable();
> >  	local_irq_enable();
> > +	rcu_idle_exit();
> >  }
> >  
> >  #ifdef CONFIG_SUSPEND
> > diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
> > index fc69875..58a8689 100644
> > --- a/arch/arm/mach-omap2/pm34xx.c
> > +++ b/arch/arm/mach-omap2/pm34xx.c
> > @@ -427,7 +427,9 @@ static void omap3_pm_idle(void)
> >  	trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> >  	trace_cpu_idle(1, smp_processor_id());
> >  
> > +	rcu_idle_enter();
> >  	omap_sram_idle();
> > +	rcu_idle_exit();
> >  
> >  	trace_power_end(smp_processor_id());
> >  	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> > diff --git a/arch/arm/mach-omap2/pm44xx.c b/arch/arm/mach-omap2/pm44xx.c
> > index c264ef7..038796c 100644
> > --- a/arch/arm/mach-omap2/pm44xx.c
> > +++ b/arch/arm/mach-omap2/pm44xx.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/list.h>
> >  #include <linux/err.h>
> >  #include <linux/slab.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include "common.h"
> >  #include "clockdomain.h"
> > @@ -178,6 +179,7 @@ static int __init pwrdms_setup(struct powerdomain *pwrdm, void *unused)
> >   */
> >  static void omap_default_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  
> > @@ -185,6 +187,7 @@ static void omap_default_idle(void)
> >  
> >  	local_fiq_enable();
> >  	local_irq_enable();
> > +	rcu_idle_exit();
> >  }
> >  
> >  /**
> > diff --git a/arch/arm/mach-pnx4008/pm.c b/arch/arm/mach-pnx4008/pm.c
> > index f3e60a0..0b8703f 100644
> > --- a/arch/arm/mach-pnx4008/pm.c
> > +++ b/arch/arm/mach-pnx4008/pm.c
> > @@ -102,6 +102,7 @@ static inline void pnx4008_suspend(void)
> >  
> >  static int pnx4008_pm_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter();
> >  	switch (state) {
> >  	case PM_SUSPEND_STANDBY:
> >  		pnx4008_standby();
> > @@ -110,6 +111,7 @@ static int pnx4008_pm_enter(suspend_state_t state)
> >  		pnx4008_suspend();
> >  		break;
> >  	}
> > +	rcu_idle_exit();
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/arm/mach-prima2/pm.c b/arch/arm/mach-prima2/pm.c
> > index 26ebb57..1e35c8a 100644
> > --- a/arch/arm/mach-prima2/pm.c
> > +++ b/arch/arm/mach-prima2/pm.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/of_device.h>
> >  #include <linux/of_platform.h>
> >  #include <linux/io.h>
> > +#include <linux/rcupdate.h>
> >  #include <linux/rtc/sirfsoc_rtciobrg.h>
> >  #include <asm/suspend.h>
> >  #include <asm/hardware/cache-l2x0.h>
> > @@ -64,6 +65,7 @@ static int sirfsoc_pre_suspend_power_off(void)
> >  
> >  static int sirfsoc_pm_enter(suspend_state_t state)
> >  {
> > +	rcu_idle_enter(); /* No good place for this: locking->lockdep->RCU. */
> >  	switch (state) {
> >  	case PM_SUSPEND_MEM:
> >  		sirfsoc_pre_suspend_power_off();
> > @@ -73,8 +75,10 @@ static int sirfsoc_pm_enter(suspend_state_t state)
> >  		/* go zzz */
> >  		cpu_suspend(0, sirfsoc_finish_suspend);
> >  		outer_resume();
> > +		rcu_idle_exit();
> >  		break;
> >  	default:
> > +		rcu_idle_exit();
> >  		return -EINVAL;
> >  	}
> >  	return 0;
> > diff --git a/arch/arm/mach-s5p64x0/common.c b/arch/arm/mach-s5p64x0/common.c
> > index 52b89a3..c59a7f2 100644
> > --- a/arch/arm/mach-s5p64x0/common.c
> > +++ b/arch/arm/mach-s5p64x0/common.c
> > @@ -146,6 +146,7 @@ static void s5p64x0_idle(void)
> >  {
> >  	unsigned long val;
> >  
> > +	rcu_idle_enter();
> >  	if (!need_resched()) {
> >  		val = __raw_readl(S5P64X0_PWR_CFG);
> >  		val &= ~(0x3 << 5);
> > @@ -154,6 +155,7 @@ static void s5p64x0_idle(void)
> >  
> >  		cpu_do_idle();
> >  	}
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-s5pc100/common.c b/arch/arm/mach-s5pc100/common.c
> > index c909573..7bf02ff 100644
> > --- a/arch/arm/mach-s5pc100/common.c
> > +++ b/arch/arm/mach-s5pc100/common.c
> > @@ -131,9 +131,11 @@ static struct map_desc s5pc100_iodesc[] __initdata = {
> >  
> >  static void s5pc100_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		cpu_do_idle();
> >  
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-s5pv210/common.c b/arch/arm/mach-s5pv210/common.c
> > index 9c1bcdc..ccd6b4d 100644
> > --- a/arch/arm/mach-s5pv210/common.c
> > +++ b/arch/arm/mach-s5pv210/common.c
> > @@ -144,9 +144,11 @@ static struct map_desc s5pv210_iodesc[] __initdata = {
> >  
> >  static void s5pv210_idle(void)
> >  {
> > +	rcu_idle_enter();
> >  	if (!need_resched())
> >  		cpu_do_idle();
> >  
> > +	rcu_idle_exit();
> >  	local_irq_enable();
> >  }
> >  
> > diff --git a/arch/arm/mach-shmobile/cpuidle.c b/arch/arm/mach-shmobile/cpuidle.c
> > index 1b23342..651e92b 100644
> > --- a/arch/arm/mach-shmobile/cpuidle.c
> > +++ b/arch/arm/mach-shmobile/cpuidle.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/suspend.h>
> >  #include <linux/module.h>
> >  #include <linux/err.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/system.h>
> >  #include <asm/io.h>
> >  
> > @@ -33,6 +34,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
> >  
> >  	before = ktime_get();
> >  
> > +	rcu_idle_enter();
> >  	local_irq_disable();
> >  	local_fiq_disable();
> >  
> > @@ -40,6 +42,7 @@ static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
> >  
> >  	local_irq_enable();
> >  	local_fiq_enable();
> > +	rcu_idle_exit();
> >  
> >  	after = ktime_get();
> >  	dev->last_residency = ktime_to_ns(ktime_sub(after, before)) >> 10;
> > diff --git a/arch/arm/mach-shmobile/pm-sh7372.c b/arch/arm/mach-shmobile/pm-sh7372.c
> > index fcf8b17..e11507e 100644
> > --- a/arch/arm/mach-shmobile/pm-sh7372.c
> > +++ b/arch/arm/mach-shmobile/pm-sh7372.c
> > @@ -21,6 +21,7 @@
> >  #include <linux/irq.h>
> >  #include <linux/bitrev.h>
> >  #include <linux/console.h>
> > +#include <linux/rcupdate.h>
> >  #include <asm/system.h>
> >  #include <asm/io.h>
> >  #include <asm/tlbflush.h>
> > @@ -520,17 +521,24 @@ static int sh7372_enter_suspend(suspend_state_t suspend_state)
> >  		    sh7372_a4s.genpd.status == GPD_STATE_POWER_OFF) {
> >  			/* enter A4S sleep with PLLC0 off */
> >  			pr_debug("entering A4S\n");
> > +			rcu_idle_enter();
> >  			sh7372_enter_a4s_common(0);
> > +			rcu_idle_exit();
> >  		} else {
> >  			/* enter A3SM sleep with PLLC0 off */
> >  			pr_debug("entering A3SM\n");
> > +			rcu_idle_enter();
> >  			sh7372_enter_a3sm_common(0);
> > +			rcu_idle_exit();
> >  		}
> >  	} else {
> >  		/* default to Core Standby that supports all wakeup sources */
> >  		pr_debug("entering Core Standby\n");
> > +		rcu_idle_enter();
> >  		sh7372_enter_core_standby();
> > +		rcu_idle_exit();
> >  	}
> > +
> >  	return 0;
> >  }
> >  
> > -- 
> > 1.7.8
> > 
>
Nicolas Pitre Feb. 2, 2012, 5:13 p.m. UTC | #5
On Wed, 1 Feb 2012, Paul E. McKenney wrote:

> On Wed, Feb 01, 2012 at 10:49:22PM -0500, Nicolas Pitre wrote:
> > On Wed, 1 Feb 2012, Paul E. McKenney wrote:
> > 
> > > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > > 
> > > The idle loop is a quiscent state for RCU, which means that RCU ignores
> > > CPUs that have told RCU that they are idle via rcu_idle_enter().
> > > There are nevertheless quite a few places where idle CPUs use RCU, most
> > > commonly indirectly via tracing.  This patch fixes these problems for ARM.
> > > 
> > > Many of these bugs have been in the kernel for quite some time, but
> > > Frederic's recent change now gives warnings.
> > > 
> > > This patch takes the straightforward approach of pushing the
> > > rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
> > > idle loop.
> > 
> > NAK.
> > 
> > 1) This is going to conflict with a patch series I've pushed to RMK 
> >    already.  You can see its result in linux-next.
> > 
> > 2) The purpose of (1) is to do precisely the very opposite i.e. move as 
> >    much common knowledge as possible up the idle call paths and leave 
> >    the platform specific quirks as bare as possible, if any.
> > 
> > So I'd much prefer if this change was constrained to be inside 
> > cpu_idle(), or at least in its close vicinity.
> 
> OK.  Then the tracing in the inner idle loop needs to go.  Other
> restructuring might be required as well. 

Let me expand a bit more on the "if any" in my #2 above.  Most ARM 
sub-architectures don't have any special idle drivers and they simply 
rely on the default cpu_do_idle call.  What this proposed patch is doing 
is to move the rcu_idle_enter()/rcu_idle_exit() pair down into the few 
subarchs with an existing cpu-idle callback.  This means in practice 
that rcu_idle_enter()/rcu_idle_exit() wouldn't be called at all anymore 
on most ARM targets.  Is that appropriate?

> The long and short of it is that you absolutely cannot use RCU between 
> the time you invoke rcu_idle_enter() and the time you invoke 
> rcu_idle_exit().  The ARM idle code currently does just that and 
> therefore must change.
> 
> Whatever change you choose that causes the code to meet that constraint
> is met is fine by me.
> 
> But that constraint absolutely must be met.

I would have to know more about what the rcu_idle_*() calls imply before 
suggesting an alternative.


Nicolas
Paul E. McKenney Feb. 2, 2012, 5:43 p.m. UTC | #6
On Thu, Feb 02, 2012 at 12:13:00PM -0500, Nicolas Pitre wrote:
> On Wed, 1 Feb 2012, Paul E. McKenney wrote:
> 
> > On Wed, Feb 01, 2012 at 10:49:22PM -0500, Nicolas Pitre wrote:
> > > On Wed, 1 Feb 2012, Paul E. McKenney wrote:
> > > 
> > > > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > > > 
> > > > The idle loop is a quiscent state for RCU, which means that RCU ignores
> > > > CPUs that have told RCU that they are idle via rcu_idle_enter().
> > > > There are nevertheless quite a few places where idle CPUs use RCU, most
> > > > commonly indirectly via tracing.  This patch fixes these problems for ARM.
> > > > 
> > > > Many of these bugs have been in the kernel for quite some time, but
> > > > Frederic's recent change now gives warnings.
> > > > 
> > > > This patch takes the straightforward approach of pushing the
> > > > rcu_idle_enter()/rcu_idle_exit() pair further down into the core of the
> > > > idle loop.
> > > 
> > > NAK.
> > > 
> > > 1) This is going to conflict with a patch series I've pushed to RMK 
> > >    already.  You can see its result in linux-next.
> > > 
> > > 2) The purpose of (1) is to do precisely the very opposite i.e. move as 
> > >    much common knowledge as possible up the idle call paths and leave 
> > >    the platform specific quirks as bare as possible, if any.
> > > 
> > > So I'd much prefer if this change was constrained to be inside 
> > > cpu_idle(), or at least in its close vicinity.
> > 
> > OK.  Then the tracing in the inner idle loop needs to go.  Other
> > restructuring might be required as well. 
> 
> Let me expand a bit more on the "if any" in my #2 above.  Most ARM 
> sub-architectures don't have any special idle drivers and they simply 
> rely on the default cpu_do_idle call.  What this proposed patch is doing 
> is to move the rcu_idle_enter()/rcu_idle_exit() pair down into the few 
> subarchs with an existing cpu-idle callback.  This means in practice 
> that rcu_idle_enter()/rcu_idle_exit() wouldn't be called at all anymore 
> on most ARM targets.  Is that appropriate?

Now that you put it that way, it does seem to have severe disadvantages.
Good thing I tagged the patches as RFC.  ;-)

Annoying, that.  I was hoping to get this fixed on an arch-by-arch basis.
Looks like it might have to be a global change requiring global agreement. :-(

> > The long and short of it is that you absolutely cannot use RCU between 
> > the time you invoke rcu_idle_enter() and the time you invoke 
> > rcu_idle_exit().  The ARM idle code currently does just that and 
> > therefore must change.
> > 
> > Whatever change you choose that causes the code to meet that constraint
> > is met is fine by me.
> > 
> > But that constraint absolutely must be met.
> 
> I would have to know more about what the rcu_idle_*() calls imply before 
> suggesting an alternative.

After a call to rcu_idle_enter(), RCU ignores the CPU completely until the
next call to rcu_idle_exit().  This means that in the interval between
rcu_idle_enter() and rcu_idle_exit(), you can say rcu_read_lock() all
you want, but your data structures will get absolutely no RCU protection.
This will result in random memory corruption, and for all I know might
already be resultin in random memory corruption.  So a fix is required.

So why does RCU need to ignore idle CPUs?  Because otherwise, RCU needs
to periodically wake them up to check their status.  Waking them up
periodically defeats CONFIG_NO_HZ's attempts to conserve power.  Avoiding
waking them up and avoiding ignoring them extended RCU grace periods,
eventually OOMing the systems.

Why this new imposition?  It is not new.  It has always been illegal to
use RCU in the idle loop from the beginning.  But people happily ignored
that restriction, partly because it is not always immediately obvious
what is using RCU.  Event tracing does, but unless you know that...

So we added diagnostics to check for illegal uses of RCU in the idle
loop, and also separated rcu_idle_enter() and rcu_idle_exit() from
tick_nohz_stop_sched_tick(), so that things like idle notifiers can work.
The diagnostics promptly located all sorts of problems, including these.

The two options I see are:

1.	Rip tracing out of the inner idle loops and everything that
	they invoke.

2.	Push rcu_idle_enter() and rcu_idle_exit() further down into
	the idle loops.  As you say, -all- of them.

My patch set was an attempt at #2, but clearly a very poorly conceived
attempt.  But I had to start somewhere.

Other ideas?

							Thanx, Paul
Nicolas Pitre Feb. 2, 2012, 6:31 p.m. UTC | #7
On Thu, 2 Feb 2012, Paul E. McKenney wrote:

> On Thu, Feb 02, 2012 at 12:13:00PM -0500, Nicolas Pitre wrote:
> > I would have to know more about what the rcu_idle_*() calls imply before 
> > suggesting an alternative.
> 
> After a call to rcu_idle_enter(), RCU ignores the CPU completely until the
> next call to rcu_idle_exit().  This means that in the interval between
> rcu_idle_enter() and rcu_idle_exit(), you can say rcu_read_lock() all
> you want, but your data structures will get absolutely no RCU protection.
> This will result in random memory corruption, and for all I know might
> already be resultin in random memory corruption.  So a fix is required.
> 
> So why does RCU need to ignore idle CPUs?  Because otherwise, RCU needs
> to periodically wake them up to check their status.  Waking them up
> periodically defeats CONFIG_NO_HZ's attempts to conserve power.  Avoiding
> waking them up and avoiding ignoring them extended RCU grace periods,
> eventually OOMing the systems.
> 
> Why this new imposition?  It is not new.  It has always been illegal to
> use RCU in the idle loop from the beginning.  But people happily ignored
> that restriction, partly because it is not always immediately obvious
> what is using RCU.  Event tracing does, but unless you know that...

Not having the slightest idea about the tracing context, I'd simply 
suggest flaming the people who ignored the restriction harder.  Or turn 
that into a BUG() to get their attention.

> So we added diagnostics to check for illegal uses of RCU in the idle
> loop, and also separated rcu_idle_enter() and rcu_idle_exit() from
> tick_nohz_stop_sched_tick(), so that things like idle notifiers can work.
> The diagnostics promptly located all sorts of problems, including these.

Good.

> The two options I see are:
> 
> 1.	Rip tracing out of the inner idle loops and everything that
> 	they invoke.

What I suggested above.  But as I said I know sh*t about that tracing 
implementation so that's an easy suggestion for me to make.

> 2.	Push rcu_idle_enter() and rcu_idle_exit() further down into
> 	the idle loops.  As you say, -all- of them.
> 
> My patch set was an attempt at #2, but clearly a very poorly conceived
> attempt.  But I had to start somewhere.

I'm pretty opposed to #2 in any case.  Spreading knowledge about generic 
kernel infrastructure requirements across 53 or so different ARM 
subarchitectures is what has put ARM in trouble in the past, as core 
kernel folks called it a maintenance hell.  Just imagine that you do put 
your rcu_idle_enter() call in those subarchs, and a year from now you 
need to modify its semantics.  At that point you would have to audit all 
those 53 subarchs which might have evolved their idle support in the 
mean time, and the new ones that might have appeared with buggy usage of 
rcu_idle_enter() just because they copied it from somewhere else without 
thinking it through.

Now we're working very hard to kill that trend and move things in the 
other direction so to keep only minimal and very platform specific 
knowledge in the subarch code.  If rcu were to push its requirements 
down there again instead of keeping things abstracted in one place 
higher the stack then that would be a big step backward.

> Other ideas?

Have a special tracing API just for the idle code that queues its events 
in a per-CPU buffer (there should not be that many events to trace in 
the idle code) and have rcu_idle_exit() flush that buffer back in the 
actual tracing infrastructure.  Maybe rcu_idle_enter() could set things 
in a way so the regular tracing API could still be used regardless.  
This has the advantage of keeping the knowledge about rcu restrictions 
in a central place that can be much more easily maintained.


Nicolas
Paul E. McKenney Feb. 2, 2012, 7:07 p.m. UTC | #8
On Thu, Feb 02, 2012 at 01:31:28PM -0500, Nicolas Pitre wrote:
> On Thu, 2 Feb 2012, Paul E. McKenney wrote:
> 
> > On Thu, Feb 02, 2012 at 12:13:00PM -0500, Nicolas Pitre wrote:
> > > I would have to know more about what the rcu_idle_*() calls imply before 
> > > suggesting an alternative.
> > 
> > After a call to rcu_idle_enter(), RCU ignores the CPU completely until the
> > next call to rcu_idle_exit().  This means that in the interval between
> > rcu_idle_enter() and rcu_idle_exit(), you can say rcu_read_lock() all
> > you want, but your data structures will get absolutely no RCU protection.
> > This will result in random memory corruption, and for all I know might
> > already be resultin in random memory corruption.  So a fix is required.
> > 
> > So why does RCU need to ignore idle CPUs?  Because otherwise, RCU needs
> > to periodically wake them up to check their status.  Waking them up
> > periodically defeats CONFIG_NO_HZ's attempts to conserve power.  Avoiding
> > waking them up and avoiding ignoring them extended RCU grace periods,
> > eventually OOMing the systems.
> > 
> > Why this new imposition?  It is not new.  It has always been illegal to
> > use RCU in the idle loop from the beginning.  But people happily ignored
> > that restriction, partly because it is not always immediately obvious
> > what is using RCU.  Event tracing does, but unless you know that...
> 
> Not having the slightest idea about the tracing context, I'd simply 
> suggest flaming the people who ignored the restriction harder.  Or turn 
> that into a BUG() to get their attention.

Done in my tree.  But it screams enough that effort to get it fixed
are in order.

> > So we added diagnostics to check for illegal uses of RCU in the idle
> > loop, and also separated rcu_idle_enter() and rcu_idle_exit() from
> > tick_nohz_stop_sched_tick(), so that things like idle notifiers can work.
> > The diagnostics promptly located all sorts of problems, including these.
> 
> Good.
> 
> > The two options I see are:
> > 
> > 1.	Rip tracing out of the inner idle loops and everything that
> > 	they invoke.
> 
> What I suggested above.  But as I said I know sh*t about that tracing 
> implementation so that's an easy suggestion for me to make.

Works for me as well.  ;-)

> > 2.	Push rcu_idle_enter() and rcu_idle_exit() further down into
> > 	the idle loops.  As you say, -all- of them.
> > 
> > My patch set was an attempt at #2, but clearly a very poorly conceived
> > attempt.  But I had to start somewhere.
> 
> I'm pretty opposed to #2 in any case.  Spreading knowledge about generic 
> kernel infrastructure requirements across 53 or so different ARM 
> subarchitectures is what has put ARM in trouble in the past, as core 
> kernel folks called it a maintenance hell.  Just imagine that you do put 
> your rcu_idle_enter() call in those subarchs, and a year from now you 
> need to modify its semantics.  At that point you would have to audit all 
> those 53 subarchs which might have evolved their idle support in the 
> mean time, and the new ones that might have appeared with buggy usage of 
> rcu_idle_enter() just because they copied it from somewhere else without 
> thinking it through.

Fair point!

> Now we're working very hard to kill that trend and move things in the 
> other direction so to keep only minimal and very platform specific 
> knowledge in the subarch code.  If rcu were to push its requirements 
> down there again instead of keeping things abstracted in one place 
> higher the stack then that would be a big step backward.

Again, fair point!

> > Other ideas?
> 
> Have a special tracing API just for the idle code that queues its events 
> in a per-CPU buffer (there should not be that many events to trace in 
> the idle code) and have rcu_idle_exit() flush that buffer back in the 
> actual tracing infrastructure.  Maybe rcu_idle_enter() could set things 
> in a way so the regular tracing API could still be used regardless.  
> This has the advantage of keeping the knowledge about rcu restrictions 
> in a central place that can be much more easily maintained.

Hmmm...  Worth some thought.  Though the delay in trace messages might
be a bit disconcerting.  Probably better than random memory corruption,
though.

							Thanx, Paul
Kevin Hilman Feb. 2, 2012, 10:20 p.m. UTC | #9
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

[...]

>> > The two options I see are:
>> > 
>> > 1.	Rip tracing out of the inner idle loops and everything that
>> > 	they invoke.
>> 
>> What I suggested above.  But as I said I know sh*t about that tracing 
>> implementation so that's an easy suggestion for me to make.
>
> Works for me as well.  ;-)

While I must admit not having a better suggestion, I for one would vote
strongly against removing tracing from the idle path.

Being a PM developer and maintainer, much of the code I work on and
maintain happens to be run in the bowels of the idle path.  Not having
the ability to trace this code would be a major step backwards IMO.

Kevin
Rob Herring Feb. 2, 2012, 10:49 p.m. UTC | #10
On 02/02/2012 04:20 PM, Kevin Hilman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> [...]
> 
>>>> The two options I see are:
>>>>
>>>> 1.	Rip tracing out of the inner idle loops and everything that
>>>> 	they invoke.
>>>
>>> What I suggested above.  But as I said I know sh*t about that tracing 
>>> implementation so that's an easy suggestion for me to make.
>>
>> Works for me as well.  ;-)
> 
> While I must admit not having a better suggestion, I for one would vote
> strongly against removing tracing from the idle path.
> 
> Being a PM developer and maintainer, much of the code I work on and
> maintain happens to be run in the bowels of the idle path.  Not having
> the ability to trace this code would be a major step backwards IMO.

How is it a step backwards if it is already broken. Obviously you
haven't actually used any tracing here because it doesn't work right
with things as is.

What exactly do you want to trace at this level. By the point you are in
this code, the path is somewhat known and problems you have are likely
h/w issues. If you are trying to go thru a very precise sequence of
saving cpu state and flushing caches, you don't want calls out to
tracing code that could very easily change the behavior.

It would be nice to understand a real example of how to hit this
problem. Why type of tracing? Are there other known examples?

Rob
Paul E. McKenney Feb. 2, 2012, 11:03 p.m. UTC | #11
On Thu, Feb 02, 2012 at 02:20:26PM -0800, Kevin Hilman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> [...]
> 
> >> > The two options I see are:
> >> > 
> >> > 1.	Rip tracing out of the inner idle loops and everything that
> >> > 	they invoke.
> >> 
> >> What I suggested above.  But as I said I know sh*t about that tracing 
> >> implementation so that's an easy suggestion for me to make.
> >
> > Works for me as well.  ;-)
> 
> While I must admit not having a better suggestion, I for one would vote
> strongly against removing tracing from the idle path.
> 
> Being a PM developer and maintainer, much of the code I work on and
> maintain happens to be run in the bowels of the idle path.  Not having
> the ability to trace this code would be a major step backwards IMO.

OK...

What if the tracing code between the rcu_idle_enter() and the
rcu_idle_exit() had to be enclosed in a wrapper?  For example,
the tracing in cpuidle_idle_call() might appear as follows:

	RCU_NONIDLE(
		trace_power_start(POWER_CSTATE, next_state, dev->cpu);
		trace_cpu_idle(next_state, dev->cpu);
	);

	entered_state = target_state->enter(dev, drv, next_state);

	RCU_NONIDLE(
		trace_power_end(dev->cpu);
		trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
	);

The RCU_NONIDLE() macro would do rcu_idle_exit(), execute its
argument, then do rcu_idle_enter().  (Credit to Steven Rostedt
for suggesting this.)  Given the possibility of code invoked both
from idle and not-idle, I have some changes to rcu to allow nesting
of rcu_idle_enter() and rcu_idle_exit().

Would that work for you?

							Thanx, Paul
Steven Rostedt Feb. 2, 2012, 11:03 p.m. UTC | #12
On Thu, 2012-02-02 at 16:49 -0600, Rob Herring wrote:
> On 02/02/2012 04:20 PM, Kevin Hilman wrote:
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> > 
> > [...]
> > 
> >>>> The two options I see are:
> >>>>
> >>>> 1.	Rip tracing out of the inner idle loops and everything that
> >>>> 	they invoke.
> >>>
> >>> What I suggested above.  But as I said I know sh*t about that tracing 
> >>> implementation so that's an easy suggestion for me to make.
> >>
> >> Works for me as well.  ;-)
> > 
> > While I must admit not having a better suggestion, I for one would vote
> > strongly against removing tracing from the idle path.
> > 
> > Being a PM developer and maintainer, much of the code I work on and
> > maintain happens to be run in the bowels of the idle path.  Not having
> > the ability to trace this code would be a major step backwards IMO.
> 
> How is it a step backwards if it is already broken. Obviously you
> haven't actually used any tracing here because it doesn't work right
> with things as is.
> 
> What exactly do you want to trace at this level. By the point you are in
> this code, the path is somewhat known and problems you have are likely
> h/w issues. If you are trying to go thru a very precise sequence of
> saving cpu state and flushing caches, you don't want calls out to
> tracing code that could very easily change the behavior.
> 
> It would be nice to understand a real example of how to hit this
> problem. Why type of tracing? Are there other known examples?

Well, these are the tracepoints in question:

trace_power_start(), trace_cpu_idle(), and trace_power_end()

But this is only used by powertop, and I'm sure nobody would mind if you
rip these tracepoints out, causing powertop to no longer work. (that's
why we have 4 bytes per tracepoint wasting space in every trace event).

Anyway, one answer is (and I was talking with Paul about this on IRC) is
that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
call "rcu_idle_exit/enter()" as it expects to be called with it off.

This should solve most issues I believe.

-- Steve
Paul E. McKenney Feb. 2, 2012, 11:27 p.m. UTC | #13
On Thu, Feb 02, 2012 at 06:03:39PM -0500, Steven Rostedt wrote:

[ . . . ]

> Anyway, one answer is (and I was talking with Paul about this on IRC) is
> that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
> call "rcu_idle_exit/enter()" as it expects to be called with it off.
> 
> This should solve most issues I believe.

You OK with something like RCU_NONIDLE() rather than RCU_EVENT_IDLE()?
I have this funny feeling that tracing won't be the only thing using
RCU from idle.  :-/

Something like this, perhaps?

	#define RCU_NONIDLE(a) \
		do { \
			rcu_idle_exit(); \
			do { a; } while (0); \
			rcu_idle_enter(); \
		}

							Thanx, Paul
Rob Herring Feb. 2, 2012, 11:39 p.m. UTC | #14
On 02/02/2012 05:03 PM, Steven Rostedt wrote:
> On Thu, 2012-02-02 at 16:49 -0600, Rob Herring wrote:
>> On 02/02/2012 04:20 PM, Kevin Hilman wrote:
>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>
>>> [...]
>>>
>>>>>> The two options I see are:
>>>>>>
>>>>>> 1.	Rip tracing out of the inner idle loops and everything that
>>>>>> 	they invoke.
>>>>>
>>>>> What I suggested above.  But as I said I know sh*t about that tracing 
>>>>> implementation so that's an easy suggestion for me to make.
>>>>
>>>> Works for me as well.  ;-)
>>>
>>> While I must admit not having a better suggestion, I for one would vote
>>> strongly against removing tracing from the idle path.
>>>
>>> Being a PM developer and maintainer, much of the code I work on and
>>> maintain happens to be run in the bowels of the idle path.  Not having
>>> the ability to trace this code would be a major step backwards IMO.
>>
>> How is it a step backwards if it is already broken. Obviously you
>> haven't actually used any tracing here because it doesn't work right
>> with things as is.
>>
>> What exactly do you want to trace at this level. By the point you are in
>> this code, the path is somewhat known and problems you have are likely
>> h/w issues. If you are trying to go thru a very precise sequence of
>> saving cpu state and flushing caches, you don't want calls out to
>> tracing code that could very easily change the behavior.
>>
>> It would be nice to understand a real example of how to hit this
>> problem. Why type of tracing? Are there other known examples?
> 
> Well, these are the tracepoints in question:
> 
> trace_power_start(), trace_cpu_idle(), and trace_power_end()
> 
> But this is only used by powertop, and I'm sure nobody would mind if you
> rip these tracepoints out, causing powertop to no longer work. (that's
> why we have 4 bytes per tracepoint wasting space in every trace event).
> 
> Anyway, one answer is (and I was talking with Paul about this on IRC) is
> that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
> call "rcu_idle_exit/enter()" as it expects to be called with it off.
> 
> This should solve most issues I believe.

Seems like these could all be placed common cpuidle code. If you're not
using cpuidle, then you probably aren't doing much in the default idle
code. You would still need the wrappers within the cpuidle code or push
rcu_idle_exit/enter down into a common default idle (which I think we
now have with Nico's clean-up).

Rob
Paul E. McKenney Feb. 2, 2012, 11:51 p.m. UTC | #15
On Thu, Feb 02, 2012 at 03:27:36PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 02, 2012 at 06:03:39PM -0500, Steven Rostedt wrote:
> 
> [ . . . ]
> 
> > Anyway, one answer is (and I was talking with Paul about this on IRC) is
> > that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
> > call "rcu_idle_exit/enter()" as it expects to be called with it off.
> > 
> > This should solve most issues I believe.
> 
> You OK with something like RCU_NONIDLE() rather than RCU_EVENT_IDLE()?
> I have this funny feeling that tracing won't be the only thing using
> RCU from idle.  :-/
> 
> Something like this, perhaps?

Or perhaps with the trailing "while (0)" included:

	#define RCU_NONIDLE(a) \
		do { \
			rcu_idle_exit(); \
			do { a; } while (0); \
			rcu_idle_enter(); \
		} while (0)

							Thanx, Paul
Steven Rostedt Feb. 3, 2012, 2:45 a.m. UTC | #16
On Thu, 2012-02-02 at 15:27 -0800, Paul E. McKenney wrote:
> On Thu, Feb 02, 2012 at 06:03:39PM -0500, Steven Rostedt wrote:
> 
> [ . . . ]
> 
> > Anyway, one answer is (and I was talking with Paul about this on IRC) is
> > that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
> > call "rcu_idle_exit/enter()" as it expects to be called with it off.
> > 
> > This should solve most issues I believe.
> 
> You OK with something like RCU_NONIDLE() rather than RCU_EVENT_IDLE()?
> I have this funny feeling that tracing won't be the only thing using
> RCU from idle.  :-/
> 
> Something like this, perhaps?
> 
> 	#define RCU_NONIDLE(a) \
> 		do { \
> 			rcu_idle_exit(); \
> 			do { a; } while (0); \
> 			rcu_idle_enter(); \
> 		}
> 

I'm fine with this. But what is the overhead for doing such a thing.
Remember, to encapsulate a tracepoint means that you will be doing that
work everytime regardless if the tracepoint is ever activated or not. Or
even worse, not even compiled into the kernel.

Now this is entering and exiting idle, so maybe it's not that critical?

-- Steve
Paul E. McKenney Feb. 3, 2012, 6:04 a.m. UTC | #17
On Thu, Feb 02, 2012 at 09:45:31PM -0500, Steven Rostedt wrote:
> On Thu, 2012-02-02 at 15:27 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 02, 2012 at 06:03:39PM -0500, Steven Rostedt wrote:
> > 
> > [ . . . ]
> > 
> > > Anyway, one answer is (and I was talking with Paul about this on IRC) is
> > > that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
> > > call "rcu_idle_exit/enter()" as it expects to be called with it off.
> > > 
> > > This should solve most issues I believe.
> > 
> > You OK with something like RCU_NONIDLE() rather than RCU_EVENT_IDLE()?
> > I have this funny feeling that tracing won't be the only thing using
> > RCU from idle.  :-/
> > 
> > Something like this, perhaps?
> > 
> > 	#define RCU_NONIDLE(a) \
> > 		do { \
> > 			rcu_idle_exit(); \
> > 			do { a; } while (0); \
> > 			rcu_idle_enter(); \
> > 		}
> > 
> 
> I'm fine with this. But what is the overhead for doing such a thing.
> Remember, to encapsulate a tracepoint means that you will be doing that
> work everytime regardless if the tracepoint is ever activated or not. Or
> even worse, not even compiled into the kernel.
> 
> Now this is entering and exiting idle, so maybe it's not that critical?

It is an atomic instruction or two, plus some memory barriers.  Entering
idle is more heavyweight for RCU_FAST_NO_HZ.  But as you say, it is
entering and exiting idle.

But should I make an empty definition of RCU_NONIDLE() for some #define
or another?

	#ifdef CONFIG_YOU_TELL_ME
	#define RCU_NONIDLE(a) \
		do { \
			rcu_idle_exit(); \
			do { a; } while (0); \
			rcu_idle_enter(); \
		} while (0)
	#else /* #ifdef CONFIG_YOU_TELL_ME */
	#define RCU_NONIDLE(a) do { } while (0);
	#endif /* #else #ifdef CONFIG_YOU_TELL_ME */

Or is event tracing unconditional these days?

							Thanx, Paul
Paul E. McKenney Feb. 3, 2012, 6:06 a.m. UTC | #18
On Thu, Feb 02, 2012 at 06:53:50PM -0800, Josh Triplett wrote:
> On Thu, Feb 02, 2012 at 09:45:31PM -0500, Steven Rostedt wrote:
> > On Thu, 2012-02-02 at 15:27 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 02, 2012 at 06:03:39PM -0500, Steven Rostedt wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > Anyway, one answer is (and I was talking with Paul about this on IRC) is
> > > > that we can create a special "TRACE_EVENT_IDLE()" that will explicitly
> > > > call "rcu_idle_exit/enter()" as it expects to be called with it off.
> > > > 
> > > > This should solve most issues I believe.
> > > 
> > > You OK with something like RCU_NONIDLE() rather than RCU_EVENT_IDLE()?
> > > I have this funny feeling that tracing won't be the only thing using
> > > RCU from idle.  :-/
> > > 
> > > Something like this, perhaps?
> > > 
> > > 	#define RCU_NONIDLE(a) \
> > > 		do { \
> > > 			rcu_idle_exit(); \
> > > 			do { a; } while (0); \
> > > 			rcu_idle_enter(); \
> > > 		}
> > > 
> > 
> > I'm fine with this. But what is the overhead for doing such a thing.
> > Remember, to encapsulate a tracepoint means that you will be doing that
> > work everytime regardless if the tracepoint is ever activated or not. Or
> > even worse, not even compiled into the kernel.
> 
> Yeah, this should definitely try to avoid doing any work in the normal
> case, if at all possible.  Not quite sure how to integrate that into
> tracepoints, though, short of teaching the trace API how to deal with
> getting invoked with RCU currently idle.

I agree that we need to apply this on a tracepoint-by-tracepoint basis --
too expensive for every tracepoint to be doing this, so reserve it for
the ones used in idle.

							Thanx, Paul
Kevin Hilman Feb. 3, 2012, 6:41 p.m. UTC | #19
Rob Herring <robherring2@gmail.com> writes:

> On 02/02/2012 04:20 PM, Kevin Hilman wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> 
>> [...]
>> 
>>>>> The two options I see are:
>>>>>
>>>>> 1.	Rip tracing out of the inner idle loops and everything that
>>>>> 	they invoke.
>>>>
>>>> What I suggested above.  But as I said I know sh*t about that tracing 
>>>> implementation so that's an easy suggestion for me to make.
>>>
>>> Works for me as well.  ;-)
>> 
>> While I must admit not having a better suggestion, I for one would vote
>> strongly against removing tracing from the idle path.
>> 
>> Being a PM developer and maintainer, much of the code I work on and
>> maintain happens to be run in the bowels of the idle path.  Not having
>> the ability to trace this code would be a major step backwards IMO.
>
> How is it a step backwards if it is already broken. 

Well, I didn't know it was broken. ;) And, as Paul mentioned, this has
been broken for a long time. Apparently it's been working well enough
for nobody to notice until recently.

> Obviously you haven't actually used any tracing here because it
> doesn't work right with things as is.

It's been working well enough for me to debug several idle path problems
with tracing.  Admittedly, this has been primarily on UP systems, but
I've recently started using the tracing on SMP as well.  (however, due
to "coupled" low-power states on OMAP, large parts of the idle path are
effectively UP since one CPU0 has to wait for CPU1 to hit a low-power
state before it can.)

> What exactly do you want to trace at this level. By the point you are in
> this code, the path is somewhat known and problems you have are likely
> h/w issues. 

Not really. 

There is still quite a bit of software between the decision to enter
idle and the hardware taking over.  On OMAP for example, we have power
domains, clock domains and clocks that are managed during idle, and
these layers contain tracepoints.

Add to that the runtime PM management of some devices that are coupled
to the CPU (because they share a power domain, etc).  Runtime PM
contains tracepoints.

Add to that possible voltage scaling during idle using regulators.
Regulator framework has tracepoints.

That can lead to quite a bit of tracing info *after* the decision to
enter idle.

> If you are trying to go thru a very precise sequence of
> saving cpu state and flushing caches, you don't want calls out to
> tracing code that could very easily change the behavior.

I'm more worried about the power domain and voltage domain transitions
(or lack thereof) when trying to debug why a particular low-power state
was not hit.

Kevin
Steven Rostedt Feb. 3, 2012, 6:55 p.m. UTC | #20
On Thu, 2012-02-02 at 22:04 -0800, Paul E. McKenney wrote:
> On Thu, Feb 02, 2012 at 09:45:31PM -0500, Steven Rostedt wrote:

> It is an atomic instruction or two, plus some memory barriers.  Entering
> idle is more heavyweight for RCU_FAST_NO_HZ.  But as you say, it is
> entering and exiting idle.
> 
> But should I make an empty definition of RCU_NONIDLE() for some #define
> or another?
> 
> 	#ifdef CONFIG_YOU_TELL_ME
> 	#define RCU_NONIDLE(a) \
> 		do { \
> 			rcu_idle_exit(); \
> 			do { a; } while (0); \
> 			rcu_idle_enter(); \
> 		} while (0)
> 	#else /* #ifdef CONFIG_YOU_TELL_ME */
> 	#define RCU_NONIDLE(a) do { } while (0);
> 	#endif /* #else #ifdef CONFIG_YOU_TELL_ME */
> 
> Or is event tracing unconditional these days?

I don't like it. As it binds the RCU_NONIDLE to tracepoints only without
any annotation that they are bound. Still doesn't help when tracepoints
are configured but not enabled.

I have no problem in making a special TRACE_EVENT_IDLE() that does this
inside the jump label. Basically what we have today is:


	if (static_branch(tracepoint_key)) {
		rcu_read_lock_sched_notrace();
		for (all attached tracepoints) {
			[...]
		}
		rcu_read_unlock_sched_notrace();
	}

Ideally we want the enter/exit idle inside that static_branch()
condition:

	if (static_branch(tracepoint_key)) {
		rcu_idle_exit();
		rcu_read_lock_sched_notrace();
		for (all attached tracepoints) {
			[...]
		}
		rcu_read_unlock_sched_notrace();
		rcu_idle_enter();
	}

The static_branch() is the jump label code when it's a nop when disabled
and a jump to the tracing code when enabled:

	nop; /* or jmp 2f */  <<--- jump label
1:	[ normal code ]
	ret;

2:	[trace code]
	jmp 1b


The jump label when disabled is just a nop that ignores the trace code
(although current gcc has a bug that it currently doesn't do it this
elegantly). When tracing is enabled the nop is converted to a jump to
the tracing code. This makes tracepoints very light weight in hot paths.

Ideally, we want the exit/enter rcu idle with in the [trace code], which
makes it not used when not needed.

-- Steve
Kevin Hilman Feb. 3, 2012, 7:12 p.m. UTC | #21
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Thu, Feb 02, 2012 at 02:20:26PM -0800, Kevin Hilman wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> 
>> [...]
>> 
>> >> > The two options I see are:
>> >> > 
>> >> > 1.	Rip tracing out of the inner idle loops and everything that
>> >> > 	they invoke.
>> >> 
>> >> What I suggested above.  But as I said I know sh*t about that tracing 
>> >> implementation so that's an easy suggestion for me to make.
>> >
>> > Works for me as well.  ;-)
>> 
>> While I must admit not having a better suggestion, I for one would vote
>> strongly against removing tracing from the idle path.
>> 
>> Being a PM developer and maintainer, much of the code I work on and
>> maintain happens to be run in the bowels of the idle path.  Not having
>> the ability to trace this code would be a major step backwards IMO.
>
> OK...
>
> What if the tracing code between the rcu_idle_enter() and the
> rcu_idle_exit() had to be enclosed in a wrapper?  For example,
> the tracing in cpuidle_idle_call() might appear as follows:
>
> 	RCU_NONIDLE(
> 		trace_power_start(POWER_CSTATE, next_state, dev->cpu);
> 		trace_cpu_idle(next_state, dev->cpu);
> 	);
>
> 	entered_state = target_state->enter(dev, drv, next_state);
>
> 	RCU_NONIDLE(
> 		trace_power_end(dev->cpu);
> 		trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
> 	);
>
> The RCU_NONIDLE() macro would do rcu_idle_exit(), execute its
> argument, then do rcu_idle_enter().  (Credit to Steven Rostedt
> for suggesting this.)  Given the possibility of code invoked both
> from idle and not-idle, I have some changes to rcu to allow nesting
> of rcu_idle_enter() and rcu_idle_exit().
>
> Would that work for you?

Yes, that should work.

And I defintely have examples of code paths that use tracepoints in both
idle and non-idle context (power domains, clocks, etc.) so the changes
to allow nesting will be needed.

Thanks,

Kevin
Paul E. McKenney Feb. 3, 2012, 7:26 p.m. UTC | #22
On Fri, Feb 03, 2012 at 10:41:01AM -0800, Kevin Hilman wrote:
> Rob Herring <robherring2@gmail.com> writes:
> 
> > On 02/02/2012 04:20 PM, Kevin Hilman wrote:
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >> 
> >> [...]
> >> 
> >>>>> The two options I see are:
> >>>>>
> >>>>> 1.	Rip tracing out of the inner idle loops and everything that
> >>>>> 	they invoke.
> >>>>
> >>>> What I suggested above.  But as I said I know sh*t about that tracing 
> >>>> implementation so that's an easy suggestion for me to make.
> >>>
> >>> Works for me as well.  ;-)
> >> 
> >> While I must admit not having a better suggestion, I for one would vote
> >> strongly against removing tracing from the idle path.
> >> 
> >> Being a PM developer and maintainer, much of the code I work on and
> >> maintain happens to be run in the bowels of the idle path.  Not having
> >> the ability to trace this code would be a major step backwards IMO.
> >
> > How is it a step backwards if it is already broken. 
> 
> Well, I didn't know it was broken. ;) And, as Paul mentioned, this has
> been broken for a long time. Apparently it's been working well enough
> for nobody to notice until recently.

Yep.  The probability is quite low, but the consequences are dire.
We might well have hit it occasionally -- it would be a random
inexplicable crash.

							Thanx, Paul

> > Obviously you haven't actually used any tracing here because it
> > doesn't work right with things as is.
> 
> It's been working well enough for me to debug several idle path problems
> with tracing.  Admittedly, this has been primarily on UP systems, but
> I've recently started using the tracing on SMP as well.  (however, due
> to "coupled" low-power states on OMAP, large parts of the idle path are
> effectively UP since one CPU0 has to wait for CPU1 to hit a low-power
> state before it can.)
> 
> > What exactly do you want to trace at this level. By the point you are in
> > this code, the path is somewhat known and problems you have are likely
> > h/w issues. 
> 
> Not really. 
> 
> There is still quite a bit of software between the decision to enter
> idle and the hardware taking over.  On OMAP for example, we have power
> domains, clock domains and clocks that are managed during idle, and
> these layers contain tracepoints.
> 
> Add to that the runtime PM management of some devices that are coupled
> to the CPU (because they share a power domain, etc).  Runtime PM
> contains tracepoints.
> 
> Add to that possible voltage scaling during idle using regulators.
> Regulator framework has tracepoints.
> 
> That can lead to quite a bit of tracing info *after* the decision to
> enter idle.
> 
> > If you are trying to go thru a very precise sequence of
> > saving cpu state and flushing caches, you don't want calls out to
> > tracing code that could very easily change the behavior.
> 
> I'm more worried about the power domain and voltage domain transitions
> (or lack thereof) when trying to debug why a particular low-power state
> was not hit.
> 
> Kevin
>
Paul E. McKenney Feb. 3, 2012, 7:26 p.m. UTC | #23
On Fri, Feb 03, 2012 at 11:12:49AM -0800, Kevin Hilman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Thu, Feb 02, 2012 at 02:20:26PM -0800, Kevin Hilman wrote:
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >> 
> >> [...]
> >> 
> >> >> > The two options I see are:
> >> >> > 
> >> >> > 1.	Rip tracing out of the inner idle loops and everything that
> >> >> > 	they invoke.
> >> >> 
> >> >> What I suggested above.  But as I said I know sh*t about that tracing 
> >> >> implementation so that's an easy suggestion for me to make.
> >> >
> >> > Works for me as well.  ;-)
> >> 
> >> While I must admit not having a better suggestion, I for one would vote
> >> strongly against removing tracing from the idle path.
> >> 
> >> Being a PM developer and maintainer, much of the code I work on and
> >> maintain happens to be run in the bowels of the idle path.  Not having
> >> the ability to trace this code would be a major step backwards IMO.
> >
> > OK...
> >
> > What if the tracing code between the rcu_idle_enter() and the
> > rcu_idle_exit() had to be enclosed in a wrapper?  For example,
> > the tracing in cpuidle_idle_call() might appear as follows:
> >
> > 	RCU_NONIDLE(
> > 		trace_power_start(POWER_CSTATE, next_state, dev->cpu);
> > 		trace_cpu_idle(next_state, dev->cpu);
> > 	);
> >
> > 	entered_state = target_state->enter(dev, drv, next_state);
> >
> > 	RCU_NONIDLE(
> > 		trace_power_end(dev->cpu);
> > 		trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
> > 	);
> >
> > The RCU_NONIDLE() macro would do rcu_idle_exit(), execute its
> > argument, then do rcu_idle_enter().  (Credit to Steven Rostedt
> > for suggesting this.)  Given the possibility of code invoked both
> > from idle and not-idle, I have some changes to rcu to allow nesting
> > of rcu_idle_enter() and rcu_idle_exit().
> >
> > Would that work for you?
> 
> Yes, that should work.
> 
> And I defintely have examples of code paths that use tracepoints in both
> idle and non-idle context (power domains, clocks, etc.) so the changes
> to allow nesting will be needed.

OK, good to know that my paranoia is still functioning correctly.  ;-)

							Thanx, Paul
Steven Rostedt Feb. 3, 2012, 7:36 p.m. UTC | #24
On Fri, 2012-02-03 at 10:41 -0800, Kevin Hilman wrote:

> > How is it a step backwards if it is already broken. 
> 
> Well, I didn't know it was broken. ;) And, as Paul mentioned, this has
> been broken for a long time. Apparently it's been working well enough
> for nobody to notice until recently.
> 
> > Obviously you haven't actually used any tracing here because it
> > doesn't work right with things as is.
> 
> It's been working well enough for me to debug several idle path problems
> with tracing.  Admittedly, this has been primarily on UP systems, but
> I've recently started using the tracing on SMP as well.  (however, due
> to "coupled" low-power states on OMAP, large parts of the idle path are
> effectively UP since one CPU0 has to wait for CPU1 to hit a low-power
> state before it can.)

It's used by all users of powertop, and we haven't heard about a bug
yet. This doesn't mean that the bug doesn't exist. The race is extremely
hard to hit. It's one of those "good bugs". You know, the kind that you
don't really have to worry about because you are more likely to win the
lottery, become President of the United States, and find a cure for
cancer (all those together, not just one) than the chance of hitting
this bug. But it's a bug regardless and should, unfortunately, be fixed.

But here's the explanation of the bug:

As Paul has stated, when rcu_idle_enter() is in effect, the calls to
rcu_read_lock_* are ignored. Thus we can pretend they don't exist.

The code in question is the __DO_TRACE() in include/linux/tracepoint.h:

		rcu_read_lock_sched_notrace();				\
		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
		if (it_func_ptr) {					\
			do {						\
				it_func = (it_func_ptr)->func;		\
				__data = (it_func_ptr)->data;		\
				((void(*)(proto))(it_func))(args);	\
			} while ((++it_func_ptr)->func);		\
		}							\
		rcu_read_unlock_sched_notrace();	

As stated above, the rcu_read_(un)lock_sched_notrace() are worthless
when in rcu_idle_enter().

They protect the referencing of tp->funcs, which is an array of all
funcs that are attached to this tracepoint.

Now we need to look at kernel/tracepoint.c:

The protection is needed against a simultaneous insertion or deletion of
a tracepoint hook. This happens when a user enables or disables tracing.

Note, this race is even made harder to hit, because due to the static
branch that controls whether this gets called, will be off if no
tracepoints are attached. So the race can only happen after at least one
tracepoint is active.

But if two probes are are added to this tracepoint, then we can hit the
race. And it is possible to trigger with only one probe on removal.

When adding or removing a tracepoint, the array (the one that
it_func_ptr points to) is updated by allocating a new array, copying the
old array plus or minus the tracpoint being added or removed, setting
the tp->funcs to the new array, and then it calls call_rcu_sched() to
free it.

Now for the bug to hit, something had to be coming in or out of idle,
and jumping to this code. Between the time it got the it_func_ptr to the
time it accessed any of that pointer's data in the loop, the tp->func
had to be updated to the new array, and then all CPUs would have passed
a schedule point (except the rcu_idle CPUs).

On uniprocessor, this is not an issue, but on SMP, it is possible that
with two CPUs the first being in rcu_idle may be ignored, and the second
would have been adding the tracepoint and then going directly to freeing
the code. But as tracepoints are very low weight, it is most likely that
the tracepoints will finish before the first could even free the memory.

But the chance does exist. As the chance of me winning the lottery,
becoming President of the United States, and curing cancer also exists!

;-)

-- Steve
Paul E. McKenney Feb. 3, 2012, 7:40 p.m. UTC | #25
On Fri, Feb 03, 2012 at 01:55:09PM -0500, Steven Rostedt wrote:
> On Thu, 2012-02-02 at 22:04 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 02, 2012 at 09:45:31PM -0500, Steven Rostedt wrote:
> 
> > It is an atomic instruction or two, plus some memory barriers.  Entering
> > idle is more heavyweight for RCU_FAST_NO_HZ.  But as you say, it is
> > entering and exiting idle.
> > 
> > But should I make an empty definition of RCU_NONIDLE() for some #define
> > or another?
> > 
> > 	#ifdef CONFIG_YOU_TELL_ME
> > 	#define RCU_NONIDLE(a) \
> > 		do { \
> > 			rcu_idle_exit(); \
> > 			do { a; } while (0); \
> > 			rcu_idle_enter(); \
> > 		} while (0)
> > 	#else /* #ifdef CONFIG_YOU_TELL_ME */
> > 	#define RCU_NONIDLE(a) do { } while (0);
> > 	#endif /* #else #ifdef CONFIG_YOU_TELL_ME */
> > 
> > Or is event tracing unconditional these days?
> 
> I don't like it. As it binds the RCU_NONIDLE to tracepoints only without
> any annotation that they are bound. Still doesn't help when tracepoints
> are configured but not enabled.
> 
> I have no problem in making a special TRACE_EVENT_IDLE() that does this
> inside the jump label. Basically what we have today is:
> 
> 
> 	if (static_branch(tracepoint_key)) {
> 		rcu_read_lock_sched_notrace();
> 		for (all attached tracepoints) {
> 			[...]
> 		}
> 		rcu_read_unlock_sched_notrace();
> 	}
> 
> Ideally we want the enter/exit idle inside that static_branch()
> condition:
> 
> 	if (static_branch(tracepoint_key)) {
> 		rcu_idle_exit();
> 		rcu_read_lock_sched_notrace();
> 		for (all attached tracepoints) {
> 			[...]
> 		}
> 		rcu_read_unlock_sched_notrace();
> 		rcu_idle_enter();
> 	}
> 
> The static_branch() is the jump label code when it's a nop when disabled
> and a jump to the tracing code when enabled:
> 
> 	nop; /* or jmp 2f */  <<--- jump label
> 1:	[ normal code ]
> 	ret;
> 
> 2:	[trace code]
> 	jmp 1b
> 
> 
> The jump label when disabled is just a nop that ignores the trace code
> (although current gcc has a bug that it currently doesn't do it this
> elegantly). When tracing is enabled the nop is converted to a jump to
> the tracing code. This makes tracepoints very light weight in hot paths.
> 
> Ideally, we want the exit/enter rcu idle with in the [trace code], which
> makes it not used when not needed.

So the idea is that if you have a trace event that is to be used in idle,
you use TRACE_EVENT_IDLE() rather than TRACE_EVENT() to declare that
trace event?  That would work for me, and might make for fewer changes
for the architecture guys.  Also, this should address the code-size
concerns we discussed yesterday.

So sounds good!

Is a DEFINE_EVENT_IDLE() also needed?  Or prehaps a
DECLARE_EVENT_CLASS_IDLE()?  My guess is "yes" for at least one of the
two based on include/trace/events/power.h.

I will keep RCU_NONIDLE() for at least a little while (reworking comments
to point out TRACE_EVENT_IDLE() and friends) in case there turn out to
be non-tracepoint uses of RCU in the idle loop.

							Thanx, Paul
Steven Rostedt Feb. 3, 2012, 8:02 p.m. UTC | #26
On Fri, 2012-02-03 at 11:40 -0800, Paul E. McKenney wrote:

> So the idea is that if you have a trace event that is to be used in idle,
> you use TRACE_EVENT_IDLE() rather than TRACE_EVENT() to declare that
> trace event?  That would work for me, and might make for fewer changes
> for the architecture guys.  Also, this should address the code-size
> concerns we discussed yesterday.
> 
> So sounds good!
> 
> Is a DEFINE_EVENT_IDLE() also needed?  Or prehaps a
> DECLARE_EVENT_CLASS_IDLE()?  My guess is "yes" for at least one of the
> two based on include/trace/events/power.h.

I'll have to take a look. I may even find a better way to do this too.

> 
> I will keep RCU_NONIDLE() for at least a little while (reworking comments
> to point out TRACE_EVENT_IDLE() and friends) in case there turn out to
> be non-tracepoint uses of RCU in the idle loop.

OK, I'll take a crack at this next Monday.

-- Steve
Paul E. McKenney Feb. 3, 2012, 8:23 p.m. UTC | #27
On Fri, Feb 03, 2012 at 03:02:55PM -0500, Steven Rostedt wrote:
> On Fri, 2012-02-03 at 11:40 -0800, Paul E. McKenney wrote:
> 
> > So the idea is that if you have a trace event that is to be used in idle,
> > you use TRACE_EVENT_IDLE() rather than TRACE_EVENT() to declare that
> > trace event?  That would work for me, and might make for fewer changes
> > for the architecture guys.  Also, this should address the code-size
> > concerns we discussed yesterday.
> > 
> > So sounds good!
> > 
> > Is a DEFINE_EVENT_IDLE() also needed?  Or prehaps a
> > DECLARE_EVENT_CLASS_IDLE()?  My guess is "yes" for at least one of the
> > two based on include/trace/events/power.h.
> 
> I'll have to take a look. I may even find a better way to do this too.

Better is always better.  ;-)

> > I will keep RCU_NONIDLE() for at least a little while (reworking comments
> > to point out TRACE_EVENT_IDLE() and friends) in case there turn out to
> > be non-tracepoint uses of RCU in the idle loop.
> 
> OK, I'll take a crack at this next Monday.

Sounds good!

I plan to push my stack to -next later today, but will yank my cpuidle
commit as soon as your approach is available.  They do not conflict
because the rcu_idle_enter()s nest.

							Thanx, Paul
Paul E. McKenney Feb. 4, 2012, 2:21 p.m. UTC | #28
On Fri, Feb 03, 2012 at 02:36:27PM -0500, Steven Rostedt wrote:
> On Fri, 2012-02-03 at 10:41 -0800, Kevin Hilman wrote:
> 
> > > How is it a step backwards if it is already broken. 
> > 
> > Well, I didn't know it was broken. ;) And, as Paul mentioned, this has
> > been broken for a long time. Apparently it's been working well enough
> > for nobody to notice until recently.
> > 
> > > Obviously you haven't actually used any tracing here because it
> > > doesn't work right with things as is.
> > 
> > It's been working well enough for me to debug several idle path problems
> > with tracing.  Admittedly, this has been primarily on UP systems, but
> > I've recently started using the tracing on SMP as well.  (however, due
> > to "coupled" low-power states on OMAP, large parts of the idle path are
> > effectively UP since one CPU0 has to wait for CPU1 to hit a low-power
> > state before it can.)
> 
> It's used by all users of powertop, and we haven't heard about a bug
> yet. This doesn't mean that the bug doesn't exist. The race is extremely
> hard to hit. It's one of those "good bugs". You know, the kind that you
> don't really have to worry about because you are more likely to win the
> lottery, become President of the United States, and find a cure for
> cancer (all those together, not just one) than the chance of hitting
> this bug. But it's a bug regardless and should, unfortunately, be fixed.
> 
> But here's the explanation of the bug:
> 
> As Paul has stated, when rcu_idle_enter() is in effect, the calls to
> rcu_read_lock_* are ignored. Thus we can pretend they don't exist.
> 
> The code in question is the __DO_TRACE() in include/linux/tracepoint.h:
> 
> 		rcu_read_lock_sched_notrace();				\
> 		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
> 		if (it_func_ptr) {					\
> 			do {						\
> 				it_func = (it_func_ptr)->func;		\
> 				__data = (it_func_ptr)->data;		\
> 				((void(*)(proto))(it_func))(args);	\
> 			} while ((++it_func_ptr)->func);		\
> 		}							\
> 		rcu_read_unlock_sched_notrace();	
> 
> As stated above, the rcu_read_(un)lock_sched_notrace() are worthless
> when in rcu_idle_enter().
> 
> They protect the referencing of tp->funcs, which is an array of all
> funcs that are attached to this tracepoint.
> 
> Now we need to look at kernel/tracepoint.c:
> 
> The protection is needed against a simultaneous insertion or deletion of
> a tracepoint hook. This happens when a user enables or disables tracing.
> 
> Note, this race is even made harder to hit, because due to the static
> branch that controls whether this gets called, will be off if no
> tracepoints are attached. So the race can only happen after at least one
> tracepoint is active.

I agree that this race is hard to hit when running Linux on bare metal.

But consider a Linux kernel running as a guest OS.  Then the host might
preempt the guest in the middle of a tracepoint.  Then from the guest OS's
viewpoint, that VCPU has just stopped, possibly for a very long time --
easily long enough for all the other VCPUs to pass through quiescent
states.  And the guest OS is ignoring that VCPU, so a too-short grace
period could easily happen in this scenario.

							Thanx, Paul

> But if two probes are are added to this tracepoint, then we can hit the
> race. And it is possible to trigger with only one probe on removal.
> 
> When adding or removing a tracepoint, the array (the one that
> it_func_ptr points to) is updated by allocating a new array, copying the
> old array plus or minus the tracpoint being added or removed, setting
> the tp->funcs to the new array, and then it calls call_rcu_sched() to
> free it.
> 
> Now for the bug to hit, something had to be coming in or out of idle,
> and jumping to this code. Between the time it got the it_func_ptr to the
> time it accessed any of that pointer's data in the loop, the tp->func
> had to be updated to the new array, and then all CPUs would have passed
> a schedule point (except the rcu_idle CPUs).
> 
> On uniprocessor, this is not an issue, but on SMP, it is possible that
> with two CPUs the first being in rcu_idle may be ignored, and the second
> would have been adding the tracepoint and then going directly to freeing
> the code. But as tracepoints are very low weight, it is most likely that
> the tracepoints will finish before the first could even free the memory.
> 
> But the chance does exist. As the chance of me winning the lottery,
> becoming President of the United States, and curing cancer also exists!
> 
> ;-)
> 
> -- Steve
> 
>
Steven Rostedt Feb. 6, 2012, 7:32 p.m. UTC | #29
On Sat, 2012-02-04 at 06:21 -0800, Paul E. McKenney wrote:
> On Fri, Feb 03, 2012 at 02:36:27PM -0500, Steven Rostedt wrote:

> > Note, this race is even made harder to hit, because due to the static
> > branch that controls whether this gets called, will be off if no
> > tracepoints are attached. So the race can only happen after at least one
> > tracepoint is active.
> 
> I agree that this race is hard to hit when running Linux on bare metal.
> 
> But consider a Linux kernel running as a guest OS.  Then the host might
> preempt the guest in the middle of a tracepoint.  Then from the guest OS's
> viewpoint, that VCPU has just stopped, possibly for a very long time --
> easily long enough for all the other VCPUs to pass through quiescent
> states.  And the guest OS is ignoring that VCPU, so a too-short grace
> period could easily happen in this scenario.

But there's one thing that you forget. The race only happens on adding
or removing of the tracepoint, which requires human intervention.

That is, this gap needs to exist when a human starts or stops tracing on
the guest. For the race to occur, the one guest CPU has to preempt at
that exact location, and then be busy doing other things as a user on
the guest enables or disables tracing. For the enabled part, something
else had to already be tracing that same tracepoint (which seldom
happens, and only if the users chooses to (root user)). Otherwise, the
race only exists on removing the tracepoint.

OK, I'll update statement for running this on a guest. It is less likely
to trigger than me winning the lottery, becoming president of the United
states, *or* curing cancer. Not all together, just one of the above ;-)

-- Steve
Paul E. McKenney Feb. 6, 2012, 11:38 p.m. UTC | #30
On Mon, Feb 06, 2012 at 04:18:33PM -0500, Steven Rostedt wrote:
> As I have said, I may find a better solution than to create a
> TRACE_EVENT_IDLE(), and I believe I have :-)
> 
> A added a new static inline function that lets *any* tracepoint be used
> inside a rcu_idle_exit() section. And this also solves the problem where
> the same tracepoint may be used inside a rcu_idle_exit() section as well
> as outside of one.
> 
> I added a new tracepoint function with a "_rcuidle" extension. All
> tracepoints can be used with either the normal "trace_foobar()"
> function, or the "trace_foobar_rcuidle()" function when inside a
> rcu_idle_exit() section.
> 
> Ideally, this patch would be broken up into two commits. The first would
> change the tracepoint.h to introduce the new trace_foobar_rcuidle()
> static inline, and the second patch would change the power tracepoints
> to use this tracepoint function. For the RFC, I'm keeping it as a single
> patch.
> 
> Another nice aspect about this patch is that "static inline"s are not
> compiled into text when not used. So only the tracepoints that actually
> use the _rcuidle() version will have them defined in the actual text
> that is booted.
> 
> -- Steve

Aside from the "pre;" below needing to precede rcu_read_lock_sched_notrace()
as Josh noted, this looks reasonable to me.

							Thanx, Paul

> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: linux-trace.git/include/linux/tracepoint.h
> ===================================================================
> --- linux-trace.git.orig/include/linux/tracepoint.h
> +++ linux-trace.git/include/linux/tracepoint.h
> @@ -114,7 +114,7 @@ static inline void tracepoint_synchroniz
>   * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
>   * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
>   */
> -#define __DO_TRACE(tp, proto, args, cond)				\
> +#define __DO_TRACE(tp, proto, args, cond, pre, post)			\
>  	do {								\
>  		struct tracepoint_func *it_func_ptr;			\
>  		void *it_func;						\
> @@ -123,6 +123,7 @@ static inline void tracepoint_synchroniz
>  		if (!(cond))						\
>  			return;						\
>  		rcu_read_lock_sched_notrace();				\
> +		pre;							\
>  		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
>  		if (it_func_ptr) {					\
>  			do {						\
> @@ -132,6 +133,7 @@ static inline void tracepoint_synchroniz
>  			} while ((++it_func_ptr)->func);		\
>  		}							\
>  		rcu_read_unlock_sched_notrace();			\
> +		post;							\
>  	} while (0)
> 
>  /*
> @@ -139,7 +141,7 @@ static inline void tracepoint_synchroniz
>   * not add unwanted padding between the beginning of the section and the
>   * structure. Force alignment to the same alignment as the section start.
>   */
> -#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args)	\
> +#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
>  	extern struct tracepoint __tracepoint_##name;			\
>  	static inline void trace_##name(proto)				\
>  	{								\
> @@ -147,7 +149,17 @@ static inline void tracepoint_synchroniz
>  			__DO_TRACE(&__tracepoint_##name,		\
>  				TP_PROTO(data_proto),			\
>  				TP_ARGS(data_args),			\
> -				TP_CONDITION(cond));			\
> +				TP_CONDITION(cond),,);			\
> +	}								\
> +	static inline void trace_##name##_rcuidle(proto)		\
> +	{								\
> +		if (static_branch(&__tracepoint_##name.key))		\
> +			__DO_TRACE(&__tracepoint_##name,		\
> +				TP_PROTO(data_proto),			\
> +				TP_ARGS(data_args),			\
> +				TP_CONDITION(cond),			\
> +				rcu_idle_exit(),			\
> +				rcu_idle_enter());			\
>  	}								\
>  	static inline int						\
>  	register_trace_##name(void (*probe)(data_proto), void *data)	\
> @@ -190,7 +202,7 @@ static inline void tracepoint_synchroniz
>  	EXPORT_SYMBOL(__tracepoint_##name)
> 
>  #else /* !CONFIG_TRACEPOINTS */
> -#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args)	\
> +#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
>  	static inline void trace_##name(proto)				\
>  	{ }								\
>  	static inline int						\
> Index: linux-trace.git/arch/x86/kernel/process.c
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/process.c
> +++ linux-trace.git/arch/x86/kernel/process.c
> @@ -377,8 +377,8 @@ static inline int hlt_use_halt(void)
>  void default_idle(void)
>  {
>  	if (hlt_use_halt()) {
> -		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> -		trace_cpu_idle(1, smp_processor_id());
> +		trace_power_start_rcuidle(POWER_CSTATE, 1, smp_processor_id());
> +		trace_cpu_idle_rcuidle(1, smp_processor_id());
>  		current_thread_info()->status &= ~TS_POLLING;
>  		/*
>  		 * TS_POLLING-cleared state must be visible before we
> @@ -391,7 +391,7 @@ void default_idle(void)
>  		else
>  			local_irq_enable();
>  		current_thread_info()->status |= TS_POLLING;
> -		trace_power_end(smp_processor_id());
> +		trace_power_end_rcuidle(smp_processor_id());
>  		trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>  	} else {
>  		local_irq_enable();
> @@ -450,8 +450,8 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
>  static void mwait_idle(void)
>  {
>  	if (!need_resched()) {
> -		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
> -		trace_cpu_idle(1, smp_processor_id());
> +		trace_power_start_rcuidle(POWER_CSTATE, 1, smp_processor_id());
> +		trace_cpu_idle_rcuidle(1, smp_processor_id());
>  		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>  			clflush((void *)&current_thread_info()->flags);
> 
> @@ -461,8 +461,8 @@ static void mwait_idle(void)
>  			__sti_mwait(0, 0);
>  		else
>  			local_irq_enable();
> -		trace_power_end(smp_processor_id());
> -		trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> +		trace_power_end_rcuidle(smp_processor_id());
> +		trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>  	} else
>  		local_irq_enable();
>  }
> @@ -474,13 +474,13 @@ static void mwait_idle(void)
>   */
>  static void poll_idle(void)
>  {
> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> -	trace_cpu_idle(0, smp_processor_id());
> +	trace_power_start_rcuidle(POWER_CSTATE, 0, smp_processor_id());
> +	trace_cpu_idle_rcuidle(0, smp_processor_id());
>  	local_irq_enable();
>  	while (!need_resched())
>  		cpu_relax();
> -	trace_power_end(smp_processor_id());
> -	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> +	trace_power_end_rcuidle(smp_processor_id());
> +	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>  }
> 
>  /*
> 
>
Steven Rostedt Feb. 7, 2012, 12:36 a.m. UTC | #31
On Mon, 2012-02-06 at 14:05 -0800, Josh Triplett wrote:
> >   */
> > -#define __DO_TRACE(tp, proto, args, cond)				\
> > +#define __DO_TRACE(tp, proto, args, cond, pre, post)			\
> >  	do {								\
> >  		struct tracepoint_func *it_func_ptr;			\
> >  		void *it_func;						\
> > @@ -123,6 +123,7 @@ static inline void tracepoint_synchroniz
> >  		if (!(cond))						\
> >  			return;						\
> >  		rcu_read_lock_sched_notrace();				\
> > +		pre;							\
> >  		it_func_ptr = rcu_dereference_sched((tp)->funcs);	\
> 
> This results in calling rcu_idle_exit after rcu_read_lock; it needs to
> occur before.

Bah! Good catch. The fumes of my furnace must have been sending out some
strong narcotics. Yeah yeah, that's the reason for this mistake, and
I'll stand by it!

-- Steve
Steven Rostedt Feb. 7, 2012, 12:32 p.m. UTC | #32
On Mon, 2012-02-06 at 15:38 -0800, Paul E. McKenney wrote:

> Aside from the "pre;" below needing to precede rcu_read_lock_sched_notrace()
> as Josh noted, this looks reasonable to me.
> 

Does this mean I can add an Acked-by from you and Josh? Obviously with
the post "pre" change.

-- Steve
Paul E. McKenney Feb. 7, 2012, 2:11 p.m. UTC | #33
On Tue, Feb 07, 2012 at 07:32:56AM -0500, Steven Rostedt wrote:
> On Mon, 2012-02-06 at 15:38 -0800, Paul E. McKenney wrote:
> 
> > Aside from the "pre;" below needing to precede rcu_read_lock_sched_notrace()
> > as Josh noted, this looks reasonable to me.
> > 
> 
> Does this mean I can add an Acked-by from you and Josh? Obviously with
> the post "pre" change.

With the "pre" change,

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

							Thanx, Paul
Josh Triplett Feb. 7, 2012, 2:40 p.m. UTC | #34
On Tue, Feb 07, 2012 at 07:32:56AM -0500, Steven Rostedt wrote:
> On Mon, 2012-02-06 at 15:38 -0800, Paul E. McKenney wrote:
> 
> > Aside from the "pre;" below needing to precede rcu_read_lock_sched_notrace()
> > as Josh noted, this looks reasonable to me.
> > 
> 
> Does this mean I can add an Acked-by from you and Josh? Obviously with
> the post "pre" change.

With the pre change:

Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Frédéric Weisbecker Feb. 8, 2012, 1:57 p.m. UTC | #35
On Tue, Feb 07, 2012 at 06:11:42AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 07, 2012 at 07:32:56AM -0500, Steven Rostedt wrote:
> > On Mon, 2012-02-06 at 15:38 -0800, Paul E. McKenney wrote:
> > 
> > > Aside from the "pre;" below needing to precede rcu_read_lock_sched_notrace()
> > > as Josh noted, this looks reasonable to me.
> > > 
> > 
> > Does this mean I can add an Acked-by from you and Josh? Obviously with
> > the post "pre" change.
> 
> With the "pre" change,
> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Same for me! Good idea.

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
diff mbox

Patch

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 971d65c..8241df7 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -207,7 +207,6 @@  void cpu_idle(void)
 	/* endless idle loop with no priority at all */
 	while (1) {
 		tick_nohz_idle_enter();
-		rcu_idle_enter();
 		leds_event(led_idle_start);
 		while (!need_resched()) {
 #ifdef CONFIG_HOTPLUG_CPU
@@ -237,7 +236,6 @@  void cpu_idle(void)
 			}
 		}
 		leds_event(led_idle_end);
-		rcu_idle_exit();
 		tick_nohz_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
diff --git a/arch/arm/mach-at91/cpuidle.c b/arch/arm/mach-at91/cpuidle.c
index a851e6c..8feff6b 100644
--- a/arch/arm/mach-at91/cpuidle.c
+++ b/arch/arm/mach-at91/cpuidle.c
@@ -20,6 +20,7 @@ 
 #include <asm/proc-fns.h>
 #include <linux/io.h>
 #include <linux/export.h>
+#include <linux/rcupdate.h>
 
 #include "pm.h"
 
@@ -43,6 +44,7 @@  static int at91_enter_idle(struct cpuidle_device *dev,
 
 	local_irq_disable();
 	do_gettimeofday(&before);
+	rcu_idle_enter();
 	if (index == 0)
 		/* Wait for interrupt state */
 		cpu_do_idle();
@@ -53,6 +55,7 @@  static int at91_enter_idle(struct cpuidle_device *dev,
 		cpu_do_idle();
 		sdram_selfrefresh_disable(saved_lpr);
 	}
+	rcu_idle_exit();
 	do_gettimeofday(&after);
 	local_irq_enable();
 	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
diff --git a/arch/arm/mach-davinci/cpuidle.c b/arch/arm/mach-davinci/cpuidle.c
index a30c7c5..6594b7e 100644
--- a/arch/arm/mach-davinci/cpuidle.c
+++ b/arch/arm/mach-davinci/cpuidle.c
@@ -17,6 +17,7 @@ 
 #include <linux/cpuidle.h>
 #include <linux/io.h>
 #include <linux/export.h>
+#include <linux/rcupdate.h>
 #include <asm/proc-fns.h>
 
 #include <mach/cpuidle.h>
@@ -90,12 +91,14 @@  static int davinci_enter_idle(struct cpuidle_device *dev,
 	local_irq_disable();
 	do_gettimeofday(&before);
 
+	rcu_idle_enter();
 	if (ops && ops->enter)
 		ops->enter(ops->flags);
 	/* Wait for interrupt state */
 	cpu_do_idle();
 	if (ops && ops->exit)
 		ops->exit(ops->flags);
+	rcu_idle_exit();
 
 	do_gettimeofday(&after);
 	local_irq_enable();
diff --git a/arch/arm/mach-exynos/common.c b/arch/arm/mach-exynos/common.c
index c59e188..1e5844a 100644
--- a/arch/arm/mach-exynos/common.c
+++ b/arch/arm/mach-exynos/common.c
@@ -203,8 +203,10 @@  static struct map_desc exynos4_iodesc1[] __initdata = {
 
 static void exynos_idle(void)
 {
+	rcu_idle_enter();
 	if (!need_resched())
 		cpu_do_idle();
+	rcu_idle_exit();
 
 	local_irq_enable();
 }
diff --git a/arch/arm/mach-highbank/pm.c b/arch/arm/mach-highbank/pm.c
index 33b3beb..5ecc4bc 100644
--- a/arch/arm/mach-highbank/pm.c
+++ b/arch/arm/mach-highbank/pm.c
@@ -17,6 +17,7 @@ 
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/suspend.h>
+#include <linux/rcupdate.h>
 
 #include <asm/proc-fns.h>
 #include <asm/smp_scu.h>
@@ -33,12 +34,23 @@  static int highbank_suspend_finish(unsigned long val)
 
 static int highbank_pm_enter(suspend_state_t state)
 {
+	/*
+	 * Some of the functions preceding the cpu_suspend() can
+	 * invoke RCU, but only in case of failure to disable preemption
+	 * or preempt_disable() misnesting.  Assume that these errors
+	 * are not committed in the following code, because RCU just
+	 * doesn't want to run on a CPU that has its caches disabled.
+	 */
+	rcu_idle_enter();
+
 	hignbank_set_pwr_suspend();
 	highbank_set_cpu_jump(0, cpu_resume);
 
 	scu_power_mode(scu_base_addr, SCU_PM_POWEROFF);
 	cpu_suspend(0, highbank_suspend_finish);
 
+	rcu_idle_exit();
+
 	return 0;
 }
 
diff --git a/arch/arm/mach-imx/mm-imx3.c b/arch/arm/mach-imx/mm-imx3.c
index 31807d2..0945b23 100644
--- a/arch/arm/mach-imx/mm-imx3.c
+++ b/arch/arm/mach-imx/mm-imx3.c
@@ -19,6 +19,7 @@ 
 #include <linux/mm.h>
 #include <linux/init.h>
 #include <linux/err.h>
+#include <linux/rcupdate.h>
 
 #include <asm/pgtable.h>
 #include <asm/hardware/cache-l2x0.h>
@@ -34,6 +35,7 @@  static void imx3_idle(void)
 {
 	unsigned long reg = 0;
 
+	rcu_idle_enter();
 	if (!need_resched())
 		__asm__ __volatile__(
 			/* disable I and D cache */
@@ -58,6 +60,7 @@  static void imx3_idle(void)
 			"orr %0, %0, #0x00000004\n"
 			"mcr p15, 0, %0, c1, c0, 0\n"
 			: "=r" (reg));
+	rcu_idle_exit();
 	local_irq_enable();
 }
 
diff --git a/arch/arm/mach-imx/pm-imx27.c b/arch/arm/mach-imx/pm-imx27.c
index e455d2f..c4adc05 100644
--- a/arch/arm/mach-imx/pm-imx27.c
+++ b/arch/arm/mach-imx/pm-imx27.c
@@ -10,12 +10,14 @@ 
 #include <linux/kernel.h>
 #include <linux/suspend.h>
 #include <linux/io.h>
+#include <linux/rcupdate.h>
 #include <mach/system.h>
 #include <mach/hardware.h>
 
 static int mx27_suspend_enter(suspend_state_t state)
 {
 	u32 cscr;
+	rcu_idle_enter();
 	switch (state) {
 	case PM_SUSPEND_MEM:
 		/* Clear MPEN and SPEN to disable MPLL/SPLL */
@@ -24,9 +26,11 @@  static int mx27_suspend_enter(suspend_state_t state)
 		__raw_writel(cscr, MX27_IO_ADDRESS(MX27_CCM_BASE_ADDR));
 		/* Executes WFI */
 		arch_idle();
+		rcu_idle_exit();
 		break;
 
 	default:
+		rcu_idle_exit();
 		return -EINVAL;
 	}
 	return 0;
diff --git a/arch/arm/mach-imx/pm-imx6q.c b/arch/arm/mach-imx/pm-imx6q.c
index f7b0c2b..b9c2589 100644
--- a/arch/arm/mach-imx/pm-imx6q.c
+++ b/arch/arm/mach-imx/pm-imx6q.c
@@ -14,6 +14,7 @@ 
 #include <linux/io.h>
 #include <linux/of.h>
 #include <linux/suspend.h>
+#include <linux/rcupdate.h>
 #include <asm/cacheflush.h>
 #include <asm/proc-fns.h>
 #include <asm/suspend.h>
@@ -31,6 +32,7 @@  static int imx6q_suspend_finish(unsigned long val)
 
 static int imx6q_pm_enter(suspend_state_t state)
 {
+	rcu_idle_enter();
 	switch (state) {
 	case PM_SUSPEND_MEM:
 		imx6q_set_lpm(STOP_POWER_OFF);
@@ -40,8 +42,10 @@  static int imx6q_pm_enter(suspend_state_t state)
 		cpu_suspend(0, imx6q_suspend_finish);
 		imx_smp_prepare();
 		imx_gpc_post_resume();
+		rcu_idle_exit();
 		break;
 	default:
+		rcu_idle_exit();
 		return -EINVAL;
 	}
 
diff --git a/arch/arm/mach-kirkwood/cpuidle.c b/arch/arm/mach-kirkwood/cpuidle.c
index 7088180..de8e317 100644
--- a/arch/arm/mach-kirkwood/cpuidle.c
+++ b/arch/arm/mach-kirkwood/cpuidle.c
@@ -19,6 +19,7 @@ 
 #include <linux/cpuidle.h>
 #include <linux/io.h>
 #include <linux/export.h>
+#include <linux/rcupdate.h>
 #include <asm/proc-fns.h>
 #include <mach/kirkwood.h>
 
@@ -41,6 +42,7 @@  static int kirkwood_enter_idle(struct cpuidle_device *dev,
 
 	local_irq_disable();
 	do_gettimeofday(&before);
+	rcu_idle_enter();
 	if (index == 0)
 		/* Wait for interrupt state */
 		cpu_do_idle();
@@ -55,6 +57,7 @@  static int kirkwood_enter_idle(struct cpuidle_device *dev,
 		writel(0x7, DDR_OPERATION_BASE);
 		cpu_do_idle();
 	}
+	rcu_idle_exit();
 	do_gettimeofday(&after);
 	local_irq_enable();
 	idle_time = (after.tv_sec - before.tv_sec) * USEC_PER_SEC +
diff --git a/arch/arm/mach-mx5/mm.c b/arch/arm/mach-mx5/mm.c
index bc17dfe..d919648 100644
--- a/arch/arm/mach-mx5/mm.c
+++ b/arch/arm/mach-mx5/mm.c
@@ -14,6 +14,7 @@ 
 #include <linux/mm.h>
 #include <linux/init.h>
 #include <linux/clk.h>
+#include <linux/rcupdate.h>
 
 #include <asm/mach/map.h>
 
@@ -37,7 +38,9 @@  static void imx5_idle(void)
 		mx5_cpu_lp_set(WAIT_UNCLOCKED_POWER_OFF);
 		if (tzic_enable_wake())
 			goto err1;
+		rcu_idle_enter();  /* No tracing until rcu_idle_exit(). */
 		cpu_do_idle();
+		rcu_idle_exit();
 err1:
 		clk_disable(gpc_dvfs_clk);
 	}
diff --git a/arch/arm/mach-mx5/pm-imx5.c b/arch/arm/mach-mx5/pm-imx5.c
index 98052fc..b4adf42 100644
--- a/arch/arm/mach-mx5/pm-imx5.c
+++ b/arch/arm/mach-mx5/pm-imx5.c
@@ -12,6 +12,7 @@ 
 #include <linux/clk.h>
 #include <linux/io.h>
 #include <linux/err.h>
+#include <linux/rcupdate.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 #include <mach/common.h>
@@ -27,6 +28,7 @@  static int mx5_suspend_prepare(void)
 
 static int mx5_suspend_enter(suspend_state_t state)
 {
+	rcu_idle_enter();
 	switch (state) {
 	case PM_SUSPEND_MEM:
 		mx5_cpu_lp_set(STOP_POWER_OFF);
@@ -47,6 +49,7 @@  static int mx5_suspend_enter(suspend_state_t state)
 		__raw_writel(0, MXC_SRPG_EMPGC1_SRPGCR);
 	}
 	cpu_do_idle();
+	rcu_idle_exit();
 	return 0;
 }
 
diff --git a/arch/arm/mach-mxs/pm.c b/arch/arm/mach-mxs/pm.c
index fb042da..36a3a57 100644
--- a/arch/arm/mach-mxs/pm.c
+++ b/arch/arm/mach-mxs/pm.c
@@ -15,16 +15,20 @@ 
 #include <linux/kernel.h>
 #include <linux/suspend.h>
 #include <linux/io.h>
+#include <linux/rcupdate.h>
 #include <mach/system.h>
 
 static int mxs_suspend_enter(suspend_state_t state)
 {
+	rcu_idle_enter();
 	switch (state) {
 	case PM_SUSPEND_MEM:
 		arch_idle();
+		rcu_idle_exit();
 		break;
 
 	default:
+		rcu_idle_exit();
 		return -EINVAL;
 	}
 	return 0;
diff --git a/arch/arm/mach-omap1/pm.c b/arch/arm/mach-omap1/pm.c
index 89ea20c..c661eba 100644
--- a/arch/arm/mach-omap1/pm.c
+++ b/arch/arm/mach-omap1/pm.c
@@ -108,11 +108,13 @@  void omap1_pm_idle(void)
 	__u32 use_idlect1 = arm_idlect1_mask;
 	int do_sleep = 0;
 
+	rcu_idle_enter();
 	local_irq_disable();
 	local_fiq_disable();
 	if (need_resched()) {
 		local_fiq_enable();
 		local_irq_enable();
+		rcu_idle_exit();
 		return;
 	}
 
@@ -158,6 +160,7 @@  void omap1_pm_idle(void)
 
 		local_fiq_enable();
 		local_irq_enable();
+		rcu_idle_exit();
 		return;
 	}
 	omap_sram_suspend(omap_readl(ARM_IDLECT1),
@@ -165,6 +168,7 @@  void omap1_pm_idle(void)
 
 	local_fiq_enable();
 	local_irq_enable();
+	rcu_idle_exit();
 }
 
 /*
diff --git a/arch/arm/mach-omap2/pm24xx.c b/arch/arm/mach-omap2/pm24xx.c
index b8822f8..2139e9d 100644
--- a/arch/arm/mach-omap2/pm24xx.c
+++ b/arch/arm/mach-omap2/pm24xx.c
@@ -232,6 +232,7 @@  static int omap2_can_sleep(void)
 
 static void omap2_pm_idle(void)
 {
+	rcu_idle_enter();
 	local_irq_disable();
 	local_fiq_disable();
 
@@ -250,6 +251,7 @@  static void omap2_pm_idle(void)
 out:
 	local_fiq_enable();
 	local_irq_enable();
+	rcu_idle_exit();
 }
 
 #ifdef CONFIG_SUSPEND
diff --git a/arch/arm/mach-omap2/pm34xx.c b/arch/arm/mach-omap2/pm34xx.c
index fc69875..58a8689 100644
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -427,7 +427,9 @@  static void omap3_pm_idle(void)
 	trace_power_start(POWER_CSTATE, 1, smp_processor_id());
 	trace_cpu_idle(1, smp_processor_id());
 
+	rcu_idle_enter();
 	omap_sram_idle();
+	rcu_idle_exit();
 
 	trace_power_end(smp_processor_id());
 	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
diff --git a/arch/arm/mach-omap2/pm44xx.c b/arch/arm/mach-omap2/pm44xx.c
index c264ef7..038796c 100644
--- a/arch/arm/mach-omap2/pm44xx.c
+++ b/arch/arm/mach-omap2/pm44xx.c
@@ -16,6 +16,7 @@ 
 #include <linux/list.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/rcupdate.h>
 
 #include "common.h"
 #include "clockdomain.h"
@@ -178,6 +179,7 @@  static int __init pwrdms_setup(struct powerdomain *pwrdm, void *unused)
  */
 static void omap_default_idle(void)
 {
+	rcu_idle_enter();
 	local_irq_disable();
 	local_fiq_disable();
 
@@ -185,6 +187,7 @@  static void omap_default_idle(void)
 
 	local_fiq_enable();
 	local_irq_enable();
+	rcu_idle_exit();
 }
 
 /**
diff --git a/arch/arm/mach-pnx4008/pm.c b/arch/arm/mach-pnx4008/pm.c
index f3e60a0..0b8703f 100644
--- a/arch/arm/mach-pnx4008/pm.c
+++ b/arch/arm/mach-pnx4008/pm.c
@@ -102,6 +102,7 @@  static inline void pnx4008_suspend(void)
 
 static int pnx4008_pm_enter(suspend_state_t state)
 {
+	rcu_idle_enter();
 	switch (state) {
 	case PM_SUSPEND_STANDBY:
 		pnx4008_standby();
@@ -110,6 +111,7 @@  static int pnx4008_pm_enter(suspend_state_t state)
 		pnx4008_suspend();
 		break;
 	}
+	rcu_idle_exit();
 	return 0;
 }
 
diff --git a/arch/arm/mach-prima2/pm.c b/arch/arm/mach-prima2/pm.c
index 26ebb57..1e35c8a 100644
--- a/arch/arm/mach-prima2/pm.c
+++ b/arch/arm/mach-prima2/pm.c
@@ -15,6 +15,7 @@ 
 #include <linux/of_device.h>
 #include <linux/of_platform.h>
 #include <linux/io.h>
+#include <linux/rcupdate.h>
 #include <linux/rtc/sirfsoc_rtciobrg.h>
 #include <asm/suspend.h>
 #include <asm/hardware/cache-l2x0.h>
@@ -64,6 +65,7 @@  static int sirfsoc_pre_suspend_power_off(void)
 
 static int sirfsoc_pm_enter(suspend_state_t state)
 {
+	rcu_idle_enter(); /* No good place for this: locking->lockdep->RCU. */
 	switch (state) {
 	case PM_SUSPEND_MEM:
 		sirfsoc_pre_suspend_power_off();
@@ -73,8 +75,10 @@  static int sirfsoc_pm_enter(suspend_state_t state)
 		/* go zzz */
 		cpu_suspend(0, sirfsoc_finish_suspend);
 		outer_resume();
+		rcu_idle_exit();
 		break;
 	default:
+		rcu_idle_exit();
 		return -EINVAL;
 	}
 	return 0;
diff --git a/arch/arm/mach-s5p64x0/common.c b/arch/arm/mach-s5p64x0/common.c
index 52b89a3..c59a7f2 100644
--- a/arch/arm/mach-s5p64x0/common.c
+++ b/arch/arm/mach-s5p64x0/common.c
@@ -146,6 +146,7 @@  static void s5p64x0_idle(void)
 {
 	unsigned long val;
 
+	rcu_idle_enter();
 	if (!need_resched()) {
 		val = __raw_readl(S5P64X0_PWR_CFG);
 		val &= ~(0x3 << 5);
@@ -154,6 +155,7 @@  static void s5p64x0_idle(void)
 
 		cpu_do_idle();
 	}
+	rcu_idle_exit();
 	local_irq_enable();
 }
 
diff --git a/arch/arm/mach-s5pc100/common.c b/arch/arm/mach-s5pc100/common.c
index c909573..7bf02ff 100644
--- a/arch/arm/mach-s5pc100/common.c
+++ b/arch/arm/mach-s5pc100/common.c
@@ -131,9 +131,11 @@  static struct map_desc s5pc100_iodesc[] __initdata = {
 
 static void s5pc100_idle(void)
 {
+	rcu_idle_enter();
 	if (!need_resched())
 		cpu_do_idle();
 
+	rcu_idle_exit();
 	local_irq_enable();
 }
 
diff --git a/arch/arm/mach-s5pv210/common.c b/arch/arm/mach-s5pv210/common.c
index 9c1bcdc..ccd6b4d 100644
--- a/arch/arm/mach-s5pv210/common.c
+++ b/arch/arm/mach-s5pv210/common.c
@@ -144,9 +144,11 @@  static struct map_desc s5pv210_iodesc[] __initdata = {
 
 static void s5pv210_idle(void)
 {
+	rcu_idle_enter();
 	if (!need_resched())
 		cpu_do_idle();
 
+	rcu_idle_exit();
 	local_irq_enable();
 }
 
diff --git a/arch/arm/mach-shmobile/cpuidle.c b/arch/arm/mach-shmobile/cpuidle.c
index 1b23342..651e92b 100644
--- a/arch/arm/mach-shmobile/cpuidle.c
+++ b/arch/arm/mach-shmobile/cpuidle.c
@@ -13,6 +13,7 @@ 
 #include <linux/suspend.h>
 #include <linux/module.h>
 #include <linux/err.h>
+#include <linux/rcupdate.h>
 #include <asm/system.h>
 #include <asm/io.h>
 
@@ -33,6 +34,7 @@  static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
 
 	before = ktime_get();
 
+	rcu_idle_enter();
 	local_irq_disable();
 	local_fiq_disable();
 
@@ -40,6 +42,7 @@  static int shmobile_cpuidle_enter(struct cpuidle_device *dev,
 
 	local_irq_enable();
 	local_fiq_enable();
+	rcu_idle_exit();
 
 	after = ktime_get();
 	dev->last_residency = ktime_to_ns(ktime_sub(after, before)) >> 10;
diff --git a/arch/arm/mach-shmobile/pm-sh7372.c b/arch/arm/mach-shmobile/pm-sh7372.c
index fcf8b17..e11507e 100644
--- a/arch/arm/mach-shmobile/pm-sh7372.c
+++ b/arch/arm/mach-shmobile/pm-sh7372.c
@@ -21,6 +21,7 @@ 
 #include <linux/irq.h>
 #include <linux/bitrev.h>
 #include <linux/console.h>
+#include <linux/rcupdate.h>
 #include <asm/system.h>
 #include <asm/io.h>
 #include <asm/tlbflush.h>
@@ -520,17 +521,24 @@  static int sh7372_enter_suspend(suspend_state_t suspend_state)
 		    sh7372_a4s.genpd.status == GPD_STATE_POWER_OFF) {
 			/* enter A4S sleep with PLLC0 off */
 			pr_debug("entering A4S\n");
+			rcu_idle_enter();
 			sh7372_enter_a4s_common(0);
+			rcu_idle_exit();
 		} else {
 			/* enter A3SM sleep with PLLC0 off */
 			pr_debug("entering A3SM\n");
+			rcu_idle_enter();
 			sh7372_enter_a3sm_common(0);
+			rcu_idle_exit();
 		}
 	} else {
 		/* default to Core Standby that supports all wakeup sources */
 		pr_debug("entering Core Standby\n");
+		rcu_idle_enter();
 		sh7372_enter_core_standby();
+		rcu_idle_exit();
 	}
+
 	return 0;
 }