diff mbox

net: allow netdev_wait_allrefs() to run faster

Message ID 20091029233848.GV3141@kvack.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Benjamin LaHaise Oct. 29, 2009, 11:38 p.m. UTC
On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote:
> Could you keep me in the loop with that.  I have some pending cleanups for
> all of those pieces of code and may be able to help/advice/review.

Here are the sysfs scaling improvements.  I have to break them up, as there 
are 3 separate changes in this patch: 1. use an rbtree for name lookup in 
sysfs, 2. keep track of the number of directories for the purpose of 
generating the link count, as otherwise too much cpu time is spent in 
sysfs_count_nlink when new entries are added, and 3. when adding a new 
sysfs_dirent, walk the list backwards when linking it in, as higher 
numbered inodes tend to be at the end of the list, not the beginning.

		-ben


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric W. Biederman Oct. 30, 2009, 1:45 a.m. UTC | #1
Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote:
>> Could you keep me in the loop with that.  I have some pending cleanups for
>> all of those pieces of code and may be able to help/advice/review.
>
> Here are the sysfs scaling improvements.  I have to break them up, as there 
> are 3 separate changes in this patch: 1. use an rbtree for name lookup in 
> sysfs, 2. keep track of the number of directories for the purpose of 
> generating the link count, as otherwise too much cpu time is spent in 
> sysfs_count_nlink when new entries are added, and 3. when adding a new 
> sysfs_dirent, walk the list backwards when linking it in, as higher 
> numbered inodes tend to be at the end of the list, not the beginning.

The reason for the existence of sysfs_dirent is as things grow larger
we want to keep the amount of RAM consumed down.  So we don't pin
everything in the dcache.  So we try and keep the amount of memory
consumed down.

So I would like to see how much we can par down.

For dealing with seeks in the middle of readdir I expect the best way
to do that is to be inspired by htrees in extNfs and return a hash of
the filename as our position, and keep the filename list sorted by
that hash.  Since we are optimizing for size we don't need to store
that hash.  Then we can turn that list into a some flavor of sorted
binary tree.

I'm surprised sysfs_count_nlink shows up, as it is not directly on the
add or remove path.  I think the answer there is to change s_flags
into a set of bitfields and make link_count one of them, perhaps
16bits long.  If we ever overflow our bitfield we can just set link
count to 0, and userspace (aka find) will know it can't optimized
based on link count.

I was expecting someone to run into problems with the linear directory
of sysfs someday.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Oct. 30, 2009, 2:35 p.m. UTC | #2
On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote:
> The reason for the existence of sysfs_dirent is as things grow larger
> we want to keep the amount of RAM consumed down.  So we don't pin
> everything in the dcache.  So we try and keep the amount of memory
> consumed down.

I'm aware of that, but for users running into this sort of scaling issue, 
the amount of RAM required is a non-issue (30,000 interfaces require about 
1GB of RAM at present), making the question more one of how to avoid the 
overhead for users who don't require it.  I'd prefer a config option.  The 
only way I can really see saving memory usage is to somehow tie sysfs dirent 
lookups into the network stack's own tables for looking up device entries.  
The network stack already has to cope with this kind of scaling, and that 
would save the RAM.

> So I would like to see how much we can par down.

> For dealing with seeks in the middle of readdir I expect the best way
> to do that is to be inspired by htrees in extNfs and return a hash of
> the filename as our position, and keep the filename list sorted by
> that hash.  Since we are optimizing for size we don't need to store
> that hash.  Then we can turn that list into a some flavor of sorted
> binary tree.

readdir() generally isn't an issue at present.

> I'm surprised sysfs_count_nlink shows up, as it is not directly on the
> add or remove path.  I think the answer there is to change s_flags
> into a set of bitfields and make link_count one of them, perhaps
> 16bits long.  If we ever overflow our bitfield we can just set link
> count to 0, and userspace (aka find) will know it can't optimized
> based on link count.

It shows up because of the bits of userspace (udev) touching the directory 
from things like the hotplug code path.

> I was expecting someone to run into problems with the linear directory
> of sysfs someday.

Alas, sysfs isn't the only offender.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 30, 2009, 2:43 p.m. UTC | #3
Benjamin LaHaise a écrit :
> On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote:
>> I was expecting someone to run into problems with the linear directory
>> of sysfs someday.
> 
> Alas, sysfs isn't the only offender.
> 
> 		-ben

In my tests, the sysfs lookup by name is the big offender, I believe you should
post your rb_tree patch ASAP ;) Then we can go further



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 30, 2009, 11:25 p.m. UTC | #4
Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote:
>> The reason for the existence of sysfs_dirent is as things grow larger
>> we want to keep the amount of RAM consumed down.  So we don't pin
>> everything in the dcache.  So we try and keep the amount of memory
>> consumed down.
>
> I'm aware of that, but for users running into this sort of scaling issue, 
> the amount of RAM required is a non-issue (30,000 interfaces require about 
> 1GB of RAM at present), making the question more one of how to avoid the 
> overhead for users who don't require it.  I'd prefer a config option.  The 
> only way I can really see saving memory usage is to somehow tie sysfs dirent 
> lookups into the network stack's own tables for looking up device entries.  
> The network stack already has to cope with this kind of scaling, and that 
> would save the RAM.

There is that.  I'm trying to figure out how to add the improvements
without making sysfs_dirent larger.  Which I think that is doable.

>> So I would like to see how much we can par down.
>
>> For dealing with seeks in the middle of readdir I expect the best way
>> to do that is to be inspired by htrees in extNfs and return a hash of
>> the filename as our position, and keep the filename list sorted by
>> that hash.  Since we are optimizing for size we don't need to store
>> that hash.  Then we can turn that list into a some flavor of sorted
>> binary tree.
>
> readdir() generally isn't an issue at present.

Supporting seekdir into the middle of a directory is the entire reason
I keep the entries sorted by inode.  If we sort by a hash of the name.
We can use the hash to support directory position in readdir and seekdir.
And we can completely remove the linear list when the rb_tree is introduced.

>> I'm surprised sysfs_count_nlink shows up, as it is not directly on the
>> add or remove path.  I think the answer there is to change s_flags
>> into a set of bitfields and make link_count one of them, perhaps
>> 16bits long.  If we ever overflow our bitfield we can just set link
>> count to 0, and userspace (aka find) will know it can't optimized
>> based on link count.
>
> It shows up because of the bits of userspace (udev) touching the directory 
> from things like the hotplug code path.

I realized after sending the message that s_mode in sysfs_dirent is a
real size offense.  It is a 16bit field packed in between two longs.
So in practice it is possible to move the s_mode  up next to s_flags
and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink.

>> I was expecting someone to run into problems with the linear directory
>> of sysfs someday.
>
> Alas, sysfs isn't the only offender.

Agreed. Sysfs is probably the easiest to untangle.

Since I'm not quite ready to post my patches.  I will briefly
mention what I have in my queue and hopefully get things posted.

I have changes to make it so that sysfs never has to go from
the sysfs_dirent to the sysfs inode.  

I have changes to sys_sysctl() so that it becomes a filesystem lookup
under /proc/sys.  Which ultimately makes the code easier to maintain
and debug.

Now back to getting things forward ported and ready to post.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Oct. 30, 2009, 11:53 p.m. UTC | #5
On Fri, Oct 30, 2009 at 04:25:52PM -0700, Eric W. Biederman wrote:
> I realized after sending the message that s_mode in sysfs_dirent is a
> real size offense.  It is a 16bit field packed in between two longs.
> So in practice it is possible to move the s_mode  up next to s_flags
> and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink.

That doesn't work -- the number of directory entries can easily exceed 65535.  
Current mid range hardware is good enough to terminate 100,000 network 
interfaces on a single host.

> Since I'm not quite ready to post my patches.  I will briefly
> mention what I have in my queue and hopefully get things posted.
> 
> I have changes to make it so that sysfs never has to go from
> the sysfs_dirent to the sysfs inode.  

Ah, interesting.

> I have changes to sys_sysctl() so that it becomes a filesystem lookup
> under /proc/sys.  Which ultimately makes the code easier to maintain
> and debug.

That sounds like a much saner approach, but has the wrinkle that procfs can 
be configured out.

> Now back to getting things forward ported and ready to post.

I'm looking forward to those changes.  I've been ignoring procfs for the 
time being by disabling the per-interface entries in the network stack, 
but there is some desire to be able to enable rp_filter on a per-interface 
radius config at runtime.  rp_filter has to be disabled across the board 
on my access routers, as there are several places where assymetric routing 
is used for performance reasons.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 31, 2009, 12:37 a.m. UTC | #6
Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Fri, Oct 30, 2009 at 04:25:52PM -0700, Eric W. Biederman wrote:
>> I realized after sending the message that s_mode in sysfs_dirent is a
>> real size offense.  It is a 16bit field packed in between two longs.
>> So in practice it is possible to move the s_mode  up next to s_flags
>> and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink.
>
> That doesn't work -- the number of directory entries can easily exceed 65535.  
> Current mid range hardware is good enough to terminate 100,000 network 
> interfaces on a single host.

On overflow you nlink becomes zero and you leave it there.  That is how
ondisk filesystems handle that case on directories, and find etc
knows how to deal.

>> Since I'm not quite ready to post my patches.  I will briefly
>> mention what I have in my queue and hopefully get things posted.
>> 
>> I have changes to make it so that sysfs never has to go from
>> the sysfs_dirent to the sysfs inode.  
>
> Ah, interesting.

I have to cleanup sysfs before I merge changes for supporting
multiple network namespaces.

>> I have changes to sys_sysctl() so that it becomes a filesystem lookup
>> under /proc/sys.  Which ultimately makes the code easier to maintain
>> and debug.
>
> That sounds like a much saner approach, but has the wrinkle that procfs can 
> be configured out.

So I will add the dependency.  There are very few serious users of sys_sysctl,
and all of them have been getting a deprecated interface warning every
time they use it for the last several years.

>> Now back to getting things forward ported and ready to post.
>
> I'm looking forward to those changes.  I've been ignoring procfs for the 
> time being by disabling the per-interface entries in the network stack, 
> but there is some desire to be able to enable rp_filter on a per-interface 
> radius config at runtime.  rp_filter has to be disabled across the board 
> on my access routers, as there are several places where assymetric routing 
> is used for performance reasons.

Just out of curiosity does the loose rp_filter mode work for you?

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 9, 2010, 5:23 p.m. UTC | #7
On 10/29/2009 04:38 PM, Benjamin LaHaise wrote:
> On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote:
>> Could you keep me in the loop with that.  I have some pending cleanups for
>> all of those pieces of code and may be able to help/advice/review.
>
> Here are the sysfs scaling improvements.  I have to break them up, as there
> are 3 separate changes in this patch: 1. use an rbtree for name lookup in
> sysfs, 2. keep track of the number of directories for the purpose of
> generating the link count, as otherwise too much cpu time is spent in
> sysfs_count_nlink when new entries are added, and 3. when adding a new
> sysfs_dirent, walk the list backwards when linking it in, as higher
> numbered inodes tend to be at the end of the list, not the beginning.

I was just comparing my out-of-tree patch set to .35, and it appears
little or none of the patches discussed in this thread are in the
upstream kernel yet.

Specifically, there is still that msleep(250) in
netdev_wait_allrefs

Is anyone still trying to get the improvements needed for adding/deleting
lots of interfaces into the kernel?

Thanks,
Ben
Benjamin LaHaise Aug. 9, 2010, 5:34 p.m. UTC | #8
Hello Ben,

On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote:
> I was just comparing my out-of-tree patch set to .35, and it appears
> little or none of the patches discussed in this thread are in the
> upstream kernel yet.

I was waiting on Eric's sysfs changes for namespaces to settle down, but 
ended up getting busy on other things.  I guess now is a good time to pick 
this back up and try to merge my changes for improving interface scaling.  
I'll send out a new version of the patches sometime in the next couple of 
days.  I'm also about to make a new Babylon release as well, I just need 
to write some more documentation. :-/

Btw, one thing I noticed but haven't been able to come up with a fix for 
yet is that iptables has scaling issues with lots of interfaces.  
Specifically, we had to start adding one iptables rule per interface for smtp 
filtering (not all subscribers are permitted to send smtp directly out to 
the net, so it has to be per-interface).  It seems that those all get 
dumped into a giant list.  What I'd like to do is to be able to attach rules 
directly to the interface, but I haven't really had the time to do a mergable 
set of changes for that.  Thoughts anyone?

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 9, 2010, 5:44 p.m. UTC | #9
On 08/09/2010 10:34 AM, Benjamin LaHaise wrote:
> Hello Ben,
>
> On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote:
>> I was just comparing my out-of-tree patch set to .35, and it appears
>> little or none of the patches discussed in this thread are in the
>> upstream kernel yet.
>
> I was waiting on Eric's sysfs changes for namespaces to settle down, but
> ended up getting busy on other things.  I guess now is a good time to pick
> this back up and try to merge my changes for improving interface scaling.
> I'll send out a new version of the patches sometime in the next couple of
> days.  I'm also about to make a new Babylon release as well, I just need
> to write some more documentation. :-/
>
> Btw, one thing I noticed but haven't been able to come up with a fix for
> yet is that iptables has scaling issues with lots of interfaces.
> Specifically, we had to start adding one iptables rule per interface for smtp
> filtering (not all subscribers are permitted to send smtp directly out to
> the net, so it has to be per-interface).  It seems that those all get
> dumped into a giant list.  What I'd like to do is to be able to attach rules
> directly to the interface, but I haven't really had the time to do a mergable
> set of changes for that.  Thoughts anyone?

We also have a few rules per interface, and notice that it takes around 10ms
per rule when we are removing them, even when using batching in 'ip':

This is on a high-end core i7, otherwise lightly loaded.

Total IPv4 rule listings: 2097
Cleaning 2094 rules with ip -batch...
time -p ip -4 -force -batch /tmp/crr_batch_cmds_4.txt
real 17.81
user 0.05
sys 0.00


Patrick thought had an idea, but I don't think he had time to
look at it further:

"Its probably the synchronize_rcu() in fib_nl_delrule() and
the route flushing happening after rule removal."


Thanks,
Ben
Benjamin LaHaise Aug. 9, 2010, 5:48 p.m. UTC | #10
On Mon, Aug 09, 2010 at 10:44:14AM -0700, Ben Greear wrote:
> We also have a few rules per interface, and notice that it takes around 10ms
> per rule when we are removing them, even when using batching in 'ip':
...
> Patrick thought had an idea, but I don't think he had time to
> look at it further:
> 
> "Its probably the synchronize_rcu() in fib_nl_delrule() and
> the route flushing happening after rule removal."

Yes, that would be a problem, but the issue is deeper than that -- if I'm 
not mistaken it's on the packet processing path that iptables doesn't scale 
for 100k interfaces with 1 rule per interface.  It's been a while since I 
ran the tests, but I don't think it's changed much.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Aug. 9, 2010, 6:03 p.m. UTC | #11
On 08/09/2010 10:48 AM, Benjamin LaHaise wrote:
> On Mon, Aug 09, 2010 at 10:44:14AM -0700, Ben Greear wrote:
>> We also have a few rules per interface, and notice that it takes around 10ms
>> per rule when we are removing them, even when using batching in 'ip':
> ...
>> Patrick thought had an idea, but I don't think he had time to
>> look at it further:
>>
>> "Its probably the synchronize_rcu() in fib_nl_delrule() and
>> the route flushing happening after rule removal."
>
> Yes, that would be a problem, but the issue is deeper than that -- if I'm
> not mistaken it's on the packet processing path that iptables doesn't scale
> for 100k interfaces with 1 rule per interface.  It's been a while since I
> ran the tests, but I don't think it's changed much.

It would be nice to tie the rules based on 'iif' to a specific
interface.  Seems it should give near constant time lookup for rules
if we only have a few per interface....

Thanks,
Ben
Eric W. Biederman Aug. 9, 2010, 7:59 p.m. UTC | #12
Benjamin LaHaise <bcrl@lhnet.ca> writes:

> Hello Ben,
>
> On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote:
>> I was just comparing my out-of-tree patch set to .35, and it appears
>> little or none of the patches discussed in this thread are in the
>> upstream kernel yet.

The network device deletion batching code has gone in, which is
a big help, as have some dev_put deletions, so we hit that 250ms
delay less often.  

> I was waiting on Eric's sysfs changes for namespaces to settle down, but 
> ended up getting busy on other things.  I guess now is a good time to pick 
> this back up and try to merge my changes for improving interface scaling.  
> I'll send out a new version of the patches sometime in the next couple of 
> days.  I'm also about to make a new Babylon release as well, I just need 
> to write some more documentation. :-/

sysfs feature wise has now settled down, and the regressions have all
been stamped out so now should be a good time to work on scaling.

I still have some preliminary patches in my tree, that I will dig up
as time goes by.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Aug. 9, 2010, 9:03 p.m. UTC | #13
On Mon, Aug 09, 2010 at 12:59:14PM -0700, Eric W. Biederman wrote:
> The network device deletion batching code has gone in, which is
> a big help, as have some dev_put deletions, so we hit that 250ms
> delay less often.  

I'll see how much that helps.  Odds are I'm going to have to move the 
device deletion into a separate thread.  That should give me a natural 
boundary to queue up deletions at, which should fix the tunnel-flap and 
partial tunnel-flap cases I'm worried about.  At some point I have to 
figure out how to get my API needs met by the in-kernel L2TP code, but 
that's a worry for another day.

> sysfs feature wise has now settled down, and the regressions have all
> been stamped out so now should be a good time to work on scaling.
> 
> I still have some preliminary patches in my tree, that I will dig up
> as time goes by.

I should have some time this evening to run a few tests, and hopefully can 
post some results.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Aug. 9, 2010, 9:17 p.m. UTC | #14
Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Mon, Aug 09, 2010 at 12:59:14PM -0700, Eric W. Biederman wrote:
>> The network device deletion batching code has gone in, which is
>> a big help, as have some dev_put deletions, so we hit that 250ms
>> delay less often.  
>
> I'll see how much that helps.  Odds are I'm going to have to move the 
> device deletion into a separate thread.  That should give me a natural 
> boundary to queue up deletions at, which should fix the tunnel-flap and 
> partial tunnel-flap cases I'm worried about.  At some point I have to 
> figure out how to get my API needs met by the in-kernel L2TP code, but 
> that's a worry for another day.

In case it is useful, if you delete a network namespace in general
all of the network device deletions can be batched.  

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 5fad489..38ad7c8 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -43,10 +43,18 @@  static DEFINE_IDA(sysfs_ino_ida);
 static void sysfs_link_sibling(struct sysfs_dirent *sd)
 {
 	struct sysfs_dirent *parent_sd = sd->s_parent;
-	struct sysfs_dirent **pos;
+	struct sysfs_dirent **pos, *prev = NULL;
+	struct rb_node **new, *parent;
 
 	BUG_ON(sd->s_sibling);
 
+	if (parent_sd->s_dir.children_tail &&
+	    parent_sd->s_dir.children_tail->s_ino < sd->s_ino) {
+		prev = parent_sd->s_dir.children_tail;
+		pos = &prev->s_sibling;
+		goto got_it;
+	}
+
 	/* Store directory entries in order by ino.  This allows
 	 * readdir to properly restart without having to add a
 	 * cursor into the s_dir.children list.
@@ -54,9 +62,36 @@  static void sysfs_link_sibling(struct sysfs_dirent *sd)
 	for (pos = &parent_sd->s_dir.children; *pos; pos = &(*pos)->s_sibling) {
 		if (sd->s_ino < (*pos)->s_ino)
 			break;
+		prev = *pos;
 	}
+got_it:
+	if (prev == parent_sd->s_dir.children_tail)
+		parent_sd->s_dir.children_tail = sd;
 	sd->s_sibling = *pos;
+	sd->s_sibling_prev = prev;
 	*pos = sd;
+	parent_sd->s_nr_children_dir += (sysfs_type(sd) == SYSFS_DIR);
+
+	// rb tree insert
+	new = &(parent_sd->s_dir.child_rb_root.rb_node);
+	parent = NULL;
+
+	while (*new) {
+		struct sysfs_dirent *this =
+			container_of(*new, struct sysfs_dirent, s_rb_node);
+		int result = strcmp(sd->s_name, this->s_name);
+
+		parent = *new;
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0)
+			new = &((*new)->rb_right);
+		else
+			BUG();
+	}
+
+	rb_link_node(&sd->s_rb_node, parent, new);
+	rb_insert_color(&sd->s_rb_node, &parent_sd->s_dir.child_rb_root);
 }
 
 /**
@@ -71,16 +106,22 @@  static void sysfs_link_sibling(struct sysfs_dirent *sd)
  */
 static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
 {
-	struct sysfs_dirent **pos;
+	struct sysfs_dirent **pos, *prev = NULL;
 
-	for (pos = &sd->s_parent->s_dir.children; *pos;
-	     pos = &(*pos)->s_sibling) {
-		if (*pos == sd) {
-			*pos = sd->s_sibling;
-			sd->s_sibling = NULL;
-			break;
-		}
-	}
+	prev = sd->s_sibling_prev;
+	if (prev)
+		pos = &prev->s_sibling;
+	else
+		pos = &sd->s_parent->s_dir.children;
+	if (sd == sd->s_parent->s_dir.children_tail)
+		sd->s_parent->s_dir.children_tail = prev;
+	*pos = sd->s_sibling;
+	if (sd->s_sibling)
+		sd->s_sibling->s_sibling_prev = prev;
+	
+	sd->s_parent->s_nr_children_dir -= (sysfs_type(sd) == SYSFS_DIR);
+	sd->s_sibling_prev = NULL;
+	rb_erase(&sd->s_rb_node, &sd->s_parent->s_dir.child_rb_root);
 }
 
 /**
@@ -331,6 +372,9 @@  struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 	sd->s_mode = mode;
 	sd->s_flags = type;
 
+	if (type == SYSFS_DIR)
+		sd->s_dir.child_rb_root = RB_ROOT;
+
 	return sd;
 
  err_out2:
@@ -630,11 +674,20 @@  void sysfs_addrm_finish(struct sysfs_addrm_cxt *acxt)
 struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
 				       const unsigned char *name)
 {
-	struct sysfs_dirent *sd;
-
-	for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling)
-		if (!strcmp(sd->s_name, name))
-			return sd;
+	struct rb_node *node = parent_sd->s_dir.child_rb_root.rb_node;
+
+	while (node) {
+		struct sysfs_dirent *data =
+			container_of(node, struct sysfs_dirent, s_rb_node);
+		int result;
+		result = strcmp(name, data->s_name);
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
+		else
+			return data;
+	}
 	return NULL;
 }
 
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index e28cecf..ff6e960 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -191,14 +191,7 @@  static struct lock_class_key sysfs_inode_imutex_key;
 
 static int sysfs_count_nlink(struct sysfs_dirent *sd)
 {
-	struct sysfs_dirent *child;
-	int nr = 0;
-
-	for (child = sd->s_dir.children; child; child = child->s_sibling)
-		if (sysfs_type(child) == SYSFS_DIR)
-			nr++;
-
-	return nr + 2;
+	return sd->s_nr_children_dir + 2;
 }
 
 static void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode)
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index af4c4e7..22fd1bc 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -9,6 +9,7 @@ 
  */
 
 #include <linux/fs.h>
+#include <linux/rbtree.h>
 
 struct sysfs_open_dirent;
 
@@ -16,7 +17,10 @@  struct sysfs_open_dirent;
 struct sysfs_elem_dir {
 	struct kobject		*kobj;
 	/* children list starts here and goes through sd->s_sibling */
+	
 	struct sysfs_dirent	*children;
+	struct sysfs_dirent	*children_tail;
+	struct rb_root		child_rb_root;
 };
 
 struct sysfs_elem_symlink {
@@ -52,6 +56,8 @@  struct sysfs_dirent {
 	atomic_t		s_active;
 	struct sysfs_dirent	*s_parent;
 	struct sysfs_dirent	*s_sibling;
+	struct sysfs_dirent	*s_sibling_prev;
+	struct rb_node		s_rb_node;
 	const char		*s_name;
 
 	union {
@@ -65,6 +71,8 @@  struct sysfs_dirent {
 	ino_t			s_ino;
 	umode_t			s_mode;
 	struct sysfs_inode_attrs *s_iattr;
+
+	int			s_nr_children_dir;
 };
 
 #define SD_DEACTIVATED_BIAS		INT_MIN