diff mbox

[v8,2/8] : cpuidle: implement a list based approach to register a set of idle routines.

Message ID 20091008095027.GC20595@linux.vnet.ibm.com (mailing list archive)
State Superseded
Delegated to: Benjamin Herrenschmidt
Headers show

Commit Message

Arun Bharadwaj Oct. 8, 2009, 9:50 a.m. UTC
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-10-08 15:18:28]:

Implement a list based registering mechanism for architectures which
have multiple sets of idle routines which are to be registered.

Currently, in x86 it is done by merely setting pm_idle = idle_routine
and managing this pm_idle pointer is messy.

To give an example of how this mechanism works:
In x86, initially, idle routine is selected from the set of poll/mwait/
c1e/default idle loops. So the selected idle loop is registered in cpuidle
as one idle state cpuidle devices. Once ACPI comes up, it registers
another set of idle states on top of this state. Again, suppose a module
registers another set of idle loops, it is added to this list.

This provides a clean way of registering and unregistering idle state
routines.

In the current implementation, pm_idle is set as the current idle routine
being used and the old idle routine has to be maintained and when a module
registers/unregisters an idle routine, confusion arises.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle.c |   54 ++++++++++++++++++++++++++++++++++++++++------
 include/linux/cpuidle.h   |    1 
 2 files changed, 48 insertions(+), 7 deletions(-)

Comments

Peter Zijlstra Oct. 8, 2009, 10:36 a.m. UTC | #1
On Thu, 2009-10-08 at 15:20 +0530, Arun R Bharadwaj wrote:
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-10-08 15:18:28]:
> 
> Implement a list based registering mechanism for architectures which
> have multiple sets of idle routines which are to be registered.
> 
> Currently, in x86 it is done by merely setting pm_idle = idle_routine
> and managing this pm_idle pointer is messy.
> 
> To give an example of how this mechanism works:
> In x86, initially, idle routine is selected from the set of poll/mwait/
> c1e/default idle loops. So the selected idle loop is registered in cpuidle
> as one idle state cpuidle devices. Once ACPI comes up, it registers
> another set of idle states on top of this state. Again, suppose a module
> registers another set of idle loops, it is added to this list.
> 
> This provides a clean way of registering and unregistering idle state
> routines.

So cpuidle didn't already have a list of idle functions it takes an
appropriate one from?

Then what does this governor do?

Also, does this imply the governor doesn't consider these idle routines?
Arun Bharadwaj Oct. 8, 2009, 10:42 a.m. UTC | #2
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:36:02]:

> On Thu, 2009-10-08 at 15:20 +0530, Arun R Bharadwaj wrote:
> > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-10-08 15:18:28]:
> > 
> > Implement a list based registering mechanism for architectures which
> > have multiple sets of idle routines which are to be registered.
> > 
> > Currently, in x86 it is done by merely setting pm_idle = idle_routine
> > and managing this pm_idle pointer is messy.
> > 
> > To give an example of how this mechanism works:
> > In x86, initially, idle routine is selected from the set of poll/mwait/
> > c1e/default idle loops. So the selected idle loop is registered in cpuidle
> > as one idle state cpuidle devices. Once ACPI comes up, it registers
> > another set of idle states on top of this state. Again, suppose a module
> > registers another set of idle loops, it is added to this list.
> > 
> > This provides a clean way of registering and unregistering idle state
> > routines.
> 
> So cpuidle didn't already have a list of idle functions it takes an
> appropriate one from?
> 

No.. As of now, cpuidle supported only one _set_ of idle states that
can be registered. So in this one set, it would choose the appropriate
idle state. But this list mechanism(actually a stack) allows for
multiple sets.

This is needed because we have a hierarchy of idle states discovery
in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
It doesn't know of existance of ACPI. Later when ACPI comes up,
it registers a set of routines on top of the earlier set.

> Then what does this governor do?
>

The governor would only select the best idle state available from the
set of states which is at the top of the stack. (In the above case, it
would only consider the states registered by ACPI).

If the top-of-the-stack set of idle states is unregistered, the next
set of states on the stack are considered.

> Also, does this imply the governor doesn't consider these idle routines?
>

As i said above, governor would only consider the idle routines which
are at the top of the stack.

Hope this gave a better idea..
Peter Zijlstra Oct. 8, 2009, 10:50 a.m. UTC | #3
On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> 
> > So cpuidle didn't already have a list of idle functions it takes an
> > appropriate one from?
> > 
> 
> No.. As of now, cpuidle supported only one _set_ of idle states that
> can be registered. So in this one set, it would choose the appropriate
> idle state. But this list mechanism(actually a stack) allows for
> multiple sets.
> 
> This is needed because we have a hierarchy of idle states discovery
> in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> It doesn't know of existance of ACPI. Later when ACPI comes up,
> it registers a set of routines on top of the earlier set.
> 
> > Then what does this governor do?
> >
> 
> The governor would only select the best idle state available from the
> set of states which is at the top of the stack. (In the above case, it
> would only consider the states registered by ACPI).
> 
> If the top-of-the-stack set of idle states is unregistered, the next
> set of states on the stack are considered.
> 
> > Also, does this imply the governor doesn't consider these idle routines?
> >
> 
> As i said above, governor would only consider the idle routines which
> are at the top of the stack.
> 
> Hope this gave a better idea..

So does it make sense to have a set of sets?

Why not integrate them all into one set to be ruled by this governor
thing?
Arun Bharadwaj Oct. 8, 2009, 11:01 a.m. UTC | #4
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:50:33]:

> On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> > 
> > > So cpuidle didn't already have a list of idle functions it takes an
> > > appropriate one from?
> > > 
> > 
> > No.. As of now, cpuidle supported only one _set_ of idle states that
> > can be registered. So in this one set, it would choose the appropriate
> > idle state. But this list mechanism(actually a stack) allows for
> > multiple sets.
> > 
> > This is needed because we have a hierarchy of idle states discovery
> > in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> > It doesn't know of existance of ACPI. Later when ACPI comes up,
> > it registers a set of routines on top of the earlier set.
> > 
> > > Then what does this governor do?
> > >
> > 
> > The governor would only select the best idle state available from the
> > set of states which is at the top of the stack. (In the above case, it
> > would only consider the states registered by ACPI).
> > 
> > If the top-of-the-stack set of idle states is unregistered, the next
> > set of states on the stack are considered.
> > 
> > > Also, does this imply the governor doesn't consider these idle routines?
> > >
> > 
> > As i said above, governor would only consider the idle routines which
> > are at the top of the stack.
> > 
> > Hope this gave a better idea..
> 
> So does it make sense to have a set of sets?
> 
> Why not integrate them all into one set to be ruled by this governor
> thing?
> 

Right now there is a clean hierarchy. So breaking that would mean
putting the registration of all idle routines under ACPI. So, if ACPI
fails to come up or if ACPI is not supported, that would lead to
problems. Because if that happens now, we can fallback to the
initially registered set.

Also, if a module wants to register a set of routines later on, that
cant be added to the initially registered set. So i think we need this
set of sets.
Peter Zijlstra Oct. 8, 2009, 11:25 a.m. UTC | #5
On Thu, 2009-10-08 at 16:31 +0530, Arun R Bharadwaj wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:50:33]:
> 
> > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> > > 
> > > > So cpuidle didn't already have a list of idle functions it takes an
> > > > appropriate one from?
> > > > 
> > > 
> > > No.. As of now, cpuidle supported only one _set_ of idle states that
> > > can be registered. So in this one set, it would choose the appropriate
> > > idle state. But this list mechanism(actually a stack) allows for
> > > multiple sets.
> > > 
> > > This is needed because we have a hierarchy of idle states discovery
> > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> > > It doesn't know of existance of ACPI. Later when ACPI comes up,
> > > it registers a set of routines on top of the earlier set.
> > > 
> > > > Then what does this governor do?
> > > >
> > > 
> > > The governor would only select the best idle state available from the
> > > set of states which is at the top of the stack. (In the above case, it
> > > would only consider the states registered by ACPI).
> > > 
> > > If the top-of-the-stack set of idle states is unregistered, the next
> > > set of states on the stack are considered.
> > > 
> > > > Also, does this imply the governor doesn't consider these idle routines?
> > > >
> > > 
> > > As i said above, governor would only consider the idle routines which
> > > are at the top of the stack.
> > > 
> > > Hope this gave a better idea..
> > 
> > So does it make sense to have a set of sets?
> > 
> > Why not integrate them all into one set to be ruled by this governor
> > thing?
> > 
> 
> Right now there is a clean hierarchy. So breaking that would mean
> putting the registration of all idle routines under ACPI. 

Uhm, no, it would mean ACPI putting its idle routines on the same level
as all others.

> So, if ACPI
> fails to come up or if ACPI is not supported, that would lead to
> problems.

I think the problem is that ACPI is thinking its special, that should be
rectified, its not.

>  Because if that happens now, we can fallback to the
> initially registered set.

I'm thinking its all daft and we should be having one set of idle
routines, if ACPI fails (a tautology if ever there was one) we simply
wouldn't have its idle routines to pick from.

> Also, if a module wants to register a set of routines later on, that
> cant be added to the initially registered set. So i think we need this
> set of sets.

Sounds like something is wrong alright. If you can register an idle
routine you should be able to unregister it too.

What about making ACPI register its idle routines too, 1 for each C
state, and have the governor make a selection out of the full set?

That also allows you to do away with this default_idle() nonsense and
simply panic the box when there are no registered idle routines when the
box wants to go idle.
Arun Bharadwaj Oct. 8, 2009, 12:01 p.m. UTC | #6
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 13:25:10]:

> On Thu, 2009-10-08 at 16:31 +0530, Arun R Bharadwaj wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:50:33]:
> > 
> > > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote:
> > > > 
> > > > > So cpuidle didn't already have a list of idle functions it takes an
> > > > > appropriate one from?
> > > > > 
> > > > 
> > > > No.. As of now, cpuidle supported only one _set_ of idle states that
> > > > can be registered. So in this one set, it would choose the appropriate
> > > > idle state. But this list mechanism(actually a stack) allows for
> > > > multiple sets.
> > > > 
> > > > This is needed because we have a hierarchy of idle states discovery
> > > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e.
> > > > It doesn't know of existance of ACPI. Later when ACPI comes up,
> > > > it registers a set of routines on top of the earlier set.
> > > > 
> > > > > Then what does this governor do?
> > > > >
> > > > 
> > > > The governor would only select the best idle state available from the
> > > > set of states which is at the top of the stack. (In the above case, it
> > > > would only consider the states registered by ACPI).
> > > > 
> > > > If the top-of-the-stack set of idle states is unregistered, the next
> > > > set of states on the stack are considered.
> > > > 
> > > > > Also, does this imply the governor doesn't consider these idle routines?
> > > > >
> > > > 
> > > > As i said above, governor would only consider the idle routines which
> > > > are at the top of the stack.
> > > > 
> > > > Hope this gave a better idea..
> > > 
> > > So does it make sense to have a set of sets?
> > > 
> > > Why not integrate them all into one set to be ruled by this governor
> > > thing?
> > > 
> > 
> > Right now there is a clean hierarchy. So breaking that would mean
> > putting the registration of all idle routines under ACPI. 
> 
> Uhm, no, it would mean ACPI putting its idle routines on the same level
> as all others.
> 

Putting them all on the same level would mean, we need an
enable/disable routine to enable only the currently active routines.

Also, the way governor works is that, it assumes that idle routines
are indexed in the increasing order of power benefit that can be got
out of the state. So this would get messed up.

> > So, if ACPI
> > fails to come up or if ACPI is not supported, that would lead to
> > problems.
> 
> I think the problem is that ACPI is thinking its special, that should be
> rectified, its not.
> 
> >  Because if that happens now, we can fallback to the
> > initially registered set.
> 
> I'm thinking its all daft and we should be having one set of idle
> routines, if ACPI fails (a tautology if ever there was one) we simply
> wouldn't have its idle routines to pick from.
> 
> > Also, if a module wants to register a set of routines later on, that
> > cant be added to the initially registered set. So i think we need this
> > set of sets.
> 
> Sounds like something is wrong alright. If you can register an idle
> routine you should be able to unregister it too.
>

Yes, we can register and unregister in a clean way now.
Consider this. We have a set of routines A, B, C currently registered.
Now a module comes and registers D and E, and later on at some point
of time wants to unregister. So how do you keep track of what all idle
routines the module registered and unregister only those?
Best way to do that is a stack, which is how I have currently
implemented.

> What about making ACPI register its idle routines too, 1 for each C
> state, and have the governor make a selection out of the full set?
> 
> That also allows you to do away with this default_idle() nonsense and
> simply panic the box when there are no registered idle routines when the
> box wants to go idle.
>
Peter Zijlstra Oct. 8, 2009, 12:25 p.m. UTC | #7
On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote:
> 
> > Uhm, no, it would mean ACPI putting its idle routines on the same level
> > as all others.
> > 
> 
> Putting them all on the same level would mean, we need an
> enable/disable routine to enable only the currently active routines.

What's this enable/disable stuff about?

> Also, the way governor works is that, it assumes that idle routines
> are indexed in the increasing order of power benefit that can be got
> out of the state. So this would get messed up.

Right, which is why I initially had a power-savings field in my
proposal, so it could weight the power savings vs the wakeup latency.

  http://lkml.org/lkml/2009/8/27/159

There it was said that was exactly what these governors were doing,
seems its not.

> > Sounds like something is wrong alright. If you can register an idle
> > routine you should be able to unregister it too.
> >
> 
> Yes, we can register and unregister in a clean way now.
> Consider this. We have a set of routines A, B, C currently registered.
> Now a module comes and registers D and E, and later on at some point
> of time wants to unregister. So how do you keep track of what all idle
> routines the module registered and unregister only those?
> Best way to do that is a stack, which is how I have currently
> implemented.

Right, so destroy that inner set thing, that way we only have one
left ;-)
Vaidyanathan Srinivasan Oct. 8, 2009, 1:10 p.m. UTC | #8
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 14:25:37]:

> On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote:
> > 
> > > Uhm, no, it would mean ACPI putting its idle routines on the same level
> > > as all others.
> > > 
> > 
> > Putting them all on the same level would mean, we need an
> > enable/disable routine to enable only the currently active routines.
> 
> What's this enable/disable stuff about?

Don't we need an explicit 'don't use this routine' apart from having
the weights based on power consumption and latency.  In multiple
registration the assumption is that the top most 'set' has all
necessary routines and we do not need any other idle routines for
optimum operation.

Otherwise the governor has to select from larger 'set' which could
have redundant or conflicting idle routines.

For example we now have c1e_idle to start with and then a set of ACPI
C1, C2, C3 routines.  The expectation now is that once we have the
ACPI's routines, we do not need the previous used c1e_idle because
this new set is self contained and picking one from this set based on
latency is good for power savings.
 
> > Also, the way governor works is that, it assumes that idle routines
> > are indexed in the increasing order of power benefit that can be got
> > out of the state. So this would get messed up.
> 
> Right, which is why I initially had a power-savings field in my
> proposal, so it could weight the power savings vs the wakeup latency.
> 
>   http://lkml.org/lkml/2009/8/27/159

This proposal that you had suggested is being used for the 'set' of
idle routines.  The patch changes the idle routines as 'sets' and does
not mix use of routines between two registrations.
 
> There it was said that was exactly what these governors were doing,
> seems its not.

Governors select from a set of idle routines based on latency.  But
there is a probability that any of the routine in the set can be
selected.  

> > > Sounds like something is wrong alright. If you can register an idle
> > > routine you should be able to unregister it too.
> > >
> > 
> > Yes, we can register and unregister in a clean way now.
> > Consider this. We have a set of routines A, B, C currently registered.
> > Now a module comes and registers D and E, and later on at some point
> > of time wants to unregister. So how do you keep track of what all idle
> > routines the module registered and unregister only those?
> > Best way to do that is a stack, which is how I have currently
> > implemented.
> 
> Right, so destroy that inner set thing, that way we only have one
> left ;-)

If un-registration is not needed, then this framework can easily
override the current set with the new one and not worry about the set
of sets.

Ideally, during system boot, we could wait until we know enough
information about idle states and then have a single registration.
The boot process can be in poll-idle until this decision happens.
Like in x86, we can register either c1e_idle or ACPI's routines at
a stage where we know for sure if ACPI is enabled or not.

--Vaidy
Arun Bharadwaj Oct. 9, 2009, 9:39 a.m. UTC | #9
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 14:25:37]:

> On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote:
> > 
> > > Uhm, no, it would mean ACPI putting its idle routines on the same level
> > > as all others.
> > > 
> > 
> > Putting them all on the same level would mean, we need an
> > enable/disable routine to enable only the currently active routines.
> 
> What's this enable/disable stuff about?
> 
> > Also, the way governor works is that, it assumes that idle routines
> > are indexed in the increasing order of power benefit that can be got
> > out of the state. So this would get messed up.
> 
> Right, which is why I initially had a power-savings field in my
> proposal, so it could weight the power savings vs the wakeup latency.
> 
>   http://lkml.org/lkml/2009/8/27/159
> 
> There it was said that was exactly what these governors were doing,
> seems its not.
> 
> > > Sounds like something is wrong alright. If you can register an idle
> > > routine you should be able to unregister it too.
> > >
> > 
> > Yes, we can register and unregister in a clean way now.
> > Consider this. We have a set of routines A, B, C currently registered.
> > Now a module comes and registers D and E, and later on at some point
> > of time wants to unregister. So how do you keep track of what all idle
> > routines the module registered and unregister only those?
> > Best way to do that is a stack, which is how I have currently
> > implemented.
> 
> Right, so destroy that inner set thing, that way we only have one
> left ;-)
> 

I'm not convinced with your argument. Why dont we do this
incrementally. Right now, this set of sets mechanism works fine and
doesn't look like it has any obvious flaws in it. We have a clean
register/unregister mechanism which solves all the earlier problems we
started out to solve.

We can gradually build on this and try to come up with a single set
of idle states for a governor to choose from.

thanks,
arun
Andi Kleen Oct. 12, 2009, 6 p.m. UTC | #10
Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
>
> So does it make sense to have a set of sets?
>
> Why not integrate them all into one set to be ruled by this governor
> thing?

cpuidle is currently optional, that is why the two level hierarchy
is there so that you can still have simple idle selection without it.

% size drivers/cpuidle/*.o
   text    data     bss     dec     hex filename
   5514    1416      44    6974    1b3e drivers/cpuidle/built-in.o

Adding it unconditionally would add ~7k to everyone who wants idle functions.

I think making it unconditional would require putting it on a serious
diet first.

-Andi
Arun Bharadwaj Oct. 14, 2009, 6:17 a.m. UTC | #11
* Andi Kleen <andi@firstfloor.org> [2009-10-12 20:00:05]:

> Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> >
> > So does it make sense to have a set of sets?
> >
> > Why not integrate them all into one set to be ruled by this governor
> > thing?
> 
> cpuidle is currently optional, that is why the two level hierarchy
> is there so that you can still have simple idle selection without it.
> 
> % size drivers/cpuidle/*.o
>    text    data     bss     dec     hex filename
>    5514    1416      44    6974    1b3e drivers/cpuidle/built-in.o
> 
> Adding it unconditionally would add ~7k to everyone who wants idle functions.
> 
> I think making it unconditional would require putting it on a serious
> diet first.
> 

Hi Andi,

Yes, this is a valid point.

How about something like this..
If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
which is called from cpu_idle() should call default_idle without
involving the registering cpuidle steps. This should prevent bloating
up of the kernel for archs which dont want to use cpuidle.

--arun
> -Andi
> -- 
> ak@linux.intel.com -- Speaking for myself only.
Andi Kleen Oct. 14, 2009, 7:18 a.m. UTC | #12
> How about something like this..
> If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
> which is called from cpu_idle() should call default_idle without
> involving the registering cpuidle steps. This should prevent bloating
> up of the kernel for archs which dont want to use cpuidle.

On x86 some people want small kernel too, so selecting it on a architecture
granuality is not good. Also you can make it default, you just need
to slim it down first.

-Andi
Arun Bharadwaj Oct. 15, 2009, 6:06 a.m. UTC | #13
* Andi Kleen <andi@firstfloor.org> [2009-10-14 09:18:38]:

> > How about something like this..
> > If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call
> > which is called from cpu_idle() should call default_idle without
> > involving the registering cpuidle steps. This should prevent bloating
> > up of the kernel for archs which dont want to use cpuidle.
> 
> On x86 some people want small kernel too, so selecting it on a architecture
> granuality is not good. Also you can make it default, you just need
> to slim it down first.
> 

No, I dont mean selecting it on an architecture granularity.

At compile time, if CONFIG_CPU_IDLE is disabled, the arch can redefine
cpuidle_idle_call. For e.g. in arch/x86/kernel/process.c

#ifndef CONFIG_CPU_IDLE
void cpuidle_idle_call(void)
{
	if (local_idle)
		local_idle();
	else
		default_idle();
}
#endif

where local_idle points to the idle routine selected using
select_idle_routine() which can be poll, mwait, c1e.

So this way, we still preserve the exact same functionality as before
and we also remove the ugly pm_idle exported function pointer and we
avoid unnecessary code bloat for platforms who do not want to use
cpuidle.

--arun
diff mbox

Patch

Index: linux.trees.git/drivers/cpuidle/cpuidle.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
+++ linux.trees.git/drivers/cpuidle/cpuidle.c
@@ -22,6 +22,7 @@ 
 #include "cpuidle.h"
 
 DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
+DEFINE_PER_CPU(struct list_head, cpuidle_devices_list);
 
 DEFINE_MUTEX(cpuidle_lock);
 
@@ -112,6 +113,45 @@  void cpuidle_resume_and_unlock(void)
 
 EXPORT_SYMBOL_GPL(cpuidle_resume_and_unlock);
 
+int cpuidle_add_to_list(struct cpuidle_device *dev)
+{
+	int ret, cpu = dev->cpu;
+	struct cpuidle_device *old_dev;
+
+	if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) {
+		old_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu),
+				struct cpuidle_device, idle_list);
+		cpuidle_remove_state_sysfs(old_dev);
+	}
+
+	list_add(&dev->idle_list, &per_cpu(cpuidle_devices_list, cpu));
+	ret = cpuidle_add_state_sysfs(dev);
+	return ret;
+}
+
+void cpuidle_remove_from_list(struct cpuidle_device *dev)
+{
+	struct cpuidle_device *temp_dev;
+	struct list_head *pos;
+	int ret, cpu = dev->cpu;
+
+	list_for_each(pos, &per_cpu(cpuidle_devices_list, cpu)) {
+		temp_dev = container_of(pos, struct cpuidle_device, idle_list);
+		if (dev == temp_dev) {
+			list_del(&temp_dev->idle_list);
+			cpuidle_remove_state_sysfs(temp_dev);
+			break;
+		}
+	}
+
+	if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) {
+		temp_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu),
+					struct cpuidle_device, idle_list);
+		ret = cpuidle_add_state_sysfs(temp_dev);
+	}
+	cpuidle_kick_cpus();
+}
+
 /**
  * cpuidle_enable_device - enables idle PM for a CPU
  * @dev: the CPU
@@ -136,9 +176,6 @@  int cpuidle_enable_device(struct cpuidle
 			return ret;
 	}
 
-	if ((ret = cpuidle_add_state_sysfs(dev)))
-		return ret;
-
 	if (cpuidle_curr_governor->enable &&
 	    (ret = cpuidle_curr_governor->enable(dev)))
 		goto fail_sysfs;
@@ -157,7 +194,7 @@  int cpuidle_enable_device(struct cpuidle
 	return 0;
 
 fail_sysfs:
-	cpuidle_remove_state_sysfs(dev);
+	cpuidle_remove_from_list(dev);
 
 	return ret;
 }
@@ -182,8 +219,6 @@  void cpuidle_disable_device(struct cpuid
 
 	if (cpuidle_curr_governor->disable)
 		cpuidle_curr_governor->disable(dev);
-
-	cpuidle_remove_state_sysfs(dev);
 }
 
 EXPORT_SYMBOL_GPL(cpuidle_disable_device);
@@ -267,6 +302,7 @@  int cpuidle_register_device(struct cpuid
 	}
 
 	cpuidle_enable_device(dev);
+	cpuidle_add_to_list(dev);
 
 	mutex_unlock(&cpuidle_lock);
 
@@ -288,6 +324,7 @@  void cpuidle_unregister_device(struct cp
 	cpuidle_pause_and_lock();
 
 	cpuidle_disable_device(dev);
+	cpuidle_remove_from_list(dev);
 
 	per_cpu(cpuidle_devices, dev->cpu) = NULL;
 
@@ -338,12 +375,15 @@  static inline void latency_notifier_init
  */
 static int __init cpuidle_init(void)
 {
-	int ret;
+	int ret, cpu;
 
 	ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
 	if (ret)
 		return ret;
 
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(&per_cpu(cpuidle_devices_list, cpu));
+
 	latency_notifier_init(&cpuidle_latency_notifier);
 
 	return 0;
Index: linux.trees.git/include/linux/cpuidle.h
===================================================================
--- linux.trees.git.orig/include/linux/cpuidle.h
+++ linux.trees.git/include/linux/cpuidle.h
@@ -92,6 +92,7 @@  struct cpuidle_device {
 	struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX];
 	struct cpuidle_state	*last_state;
 
+	struct list_head	idle_list;
 	struct kobject		kobj;
 	struct completion	kobj_unregister;
 	void			*governor_data;