diff mbox

tcp: Fix slowness in read /proc/net/tcp

Message ID alpine.DEB.1.00.1006062020180.24064@pokey.mtv.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert June 7, 2010, 3:27 a.m. UTC
This patch address a serious performance issue in reading the
TCP sockets table (/proc/net/tcp).

Reading the full table is done by a number of sequential read
operations.  At each read operation, a seek is done to find the
last socket that was previously read.  This seek operation requires
that the sockets in the table need to be counted up to the current
file position, and to count each of these requires taking a lock for
each non-empty bucket.  The whole algorithm is O(n^2).

The fix is to cache the last bucket value, offset within the bucket,
and the file position returned by the last read operation.   On the
next sequential read, the bucket and offset are used to find the
last read socket immediately without needing ot scan the previous
buckets  the table.  This algorithm t read the whole table is O(n).

The improvement offered by this patch is easily show by performing
cat'ing /proc/net/tcp on a machine with a lot of connections.  With
about 182K connections in the table, I see the following:

- Without patch
time cat /proc/net/tcp > /dev/null

real	1m56.729s
user	0m0.214s
sys	1m56.344s

- With patch
time cat /proc/net/tcp > /dev/null

real	0m0.894s
user	0m0.290s
sys	0m0.594s

Signed-off-by: Tom Herbert <therbert@google.com>
---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet June 7, 2010, 7:19 a.m. UTC | #1
Le dimanche 06 juin 2010 à 20:27 -0700, Tom Herbert a écrit :
> This patch address a serious performance issue in reading the
> TCP sockets table (/proc/net/tcp).
> 
> Reading the full table is done by a number of sequential read
> operations.  At each read operation, a seek is done to find the
> last socket that was previously read.  This seek operation requires
> that the sockets in the table need to be counted up to the current
> file position, and to count each of these requires taking a lock for
> each non-empty bucket.  The whole algorithm is O(n^2).
> 
> The fix is to cache the last bucket value, offset within the bucket,
> and the file position returned by the last read operation.   On the
> next sequential read, the bucket and offset are used to find the
> last read socket immediately without needing ot scan the previous
> buckets  the table.  This algorithm t read the whole table is O(n).
> 
> The improvement offered by this patch is easily show by performing
> cat'ing /proc/net/tcp on a machine with a lot of connections.  With
> about 182K connections in the table, I see the following:
> 
> - Without patch
> time cat /proc/net/tcp > /dev/null
> 
> real	1m56.729s
> user	0m0.214s
> sys	1m56.344s
> 
> - With patch
> time cat /proc/net/tcp > /dev/null
> 
> real	0m0.894s
> user	0m0.290s
> sys	0m0.594s
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---

This problem raises every year,  (last attempt from Yakov Lerner :
http://kerneltrap.org/mailarchive/linux-netdev/2009/9/26/6256119 )

And finally, someone motivated enough to use /proc/net/tcp found the
right answer ;)

Most netdev people tend to push inet_diag (netlink) interface instead of
old /proc/net/tcp, but it seems old interface will still be used in
2030, so :

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

BTW, another problem of /proc/net/tcp is the buffer size used by netstat
utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
more palpable.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 7, 2010, 7:55 a.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 07 Jun 2010 09:19:27 +0200

> Le dimanche 06 juin 2010 à 20:27 -0700, Tom Herbert a écrit :
>> Signed-off-by: Tom Herbert <therbert@google.com>
>> ---
> 
> This problem raises every year,  (last attempt from Yakov Lerner :
> http://kerneltrap.org/mailarchive/linux-netdev/2009/9/26/6256119 )
> 
> And finally, someone motivated enough to use /proc/net/tcp found the
> right answer ;)
> 
> Most netdev people tend to push inet_diag (netlink) interface instead of
> old /proc/net/tcp, but it seems old interface will still be used in
> 2030, so :
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Indeed this is the best attempt of this I've seen so far,
applied to net-next-2.6, thanks Tom.

> BTW, another problem of /proc/net/tcp is the buffer size used by netstat
> utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
> more palpable.

Probably cap it at 8K like we do for netlink atomic reads, because
there are all sorts of crazy PAGE_SIZE configurations possible out
there, even as high as 512K.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen June 7, 2010, 8:49 a.m. UTC | #3
Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> BTW, another problem of /proc/net/tcp is the buffer size used by netstat
> utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
> more palpable.

That would be easily fixable in netstat.

But I'm not sure net-tools is still maintained as a separate package,
afaik the standard procedure is to submit a patch to a big distribution
and the others take it from there.

-Andi
Eric Dumazet June 7, 2010, 10:32 a.m. UTC | #4
Le lundi 07 juin 2010 à 10:49 +0200, Andi Kleen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > BTW, another problem of /proc/net/tcp is the buffer size used by netstat
> > utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
> > more palpable.
> 
> That would be easily fixable in netstat.
> 
> But I'm not sure net-tools is still maintained as a separate package,
> afaik the standard procedure is to submit a patch to a big distribution
> and the others take it from there.

Well, it seems this is fixed in latest netstat, sorry for the noise (I
still have machines with RHEL 4)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a144914..5731664 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1413,7 +1413,8 @@  struct tcp_iter_state {
 	sa_family_t		family;
 	enum tcp_seq_states	state;
 	struct sock		*syn_wait_sk;
-	int			bucket, sbucket, num, uid;
+	int			bucket, offset, sbucket, num, uid;
+	loff_t			last_pos;
 };
 
 extern int tcp_proc_register(struct net *net, struct tcp_seq_afinfo *afinfo);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 202cf09..4af6f20 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1977,6 +1977,11 @@  static inline struct inet_timewait_sock *tw_next(struct inet_timewait_sock *tw)
 		hlist_nulls_entry(tw->tw_node.next, typeof(*tw), tw_node) : NULL;
 }
 
+/*
+ * Get next listener socket follow cur.  If cur is NULL, get first socket
+ * starting from bucket given in st->bucket; when st->bucket is zero the
+ * very first socket in the hash table is returned.
+ */
 static void *listening_get_next(struct seq_file *seq, void *cur)
 {
 	struct inet_connection_sock *icsk;
@@ -1987,14 +1992,15 @@  static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct net *net = seq_file_net(seq);
 
 	if (!sk) {
-		st->bucket = 0;
-		ilb = &tcp_hashinfo.listening_hash[0];
+		ilb = &tcp_hashinfo.listening_hash[st->bucket];
 		spin_lock_bh(&ilb->lock);
 		sk = sk_nulls_head(&ilb->head);
+		st->offset = 0;
 		goto get_sk;
 	}
 	ilb = &tcp_hashinfo.listening_hash[st->bucket];
 	++st->num;
+	++st->offset;
 
 	if (st->state == TCP_SEQ_STATE_OPENREQ) {
 		struct request_sock *req = cur;
@@ -2009,6 +2015,7 @@  static void *listening_get_next(struct seq_file *seq, void *cur)
 				}
 				req = req->dl_next;
 			}
+			st->offset = 0;
 			if (++st->sbucket >= icsk->icsk_accept_queue.listen_opt->nr_table_entries)
 				break;
 get_req:
@@ -2044,6 +2051,7 @@  start_req:
 		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
 	}
 	spin_unlock_bh(&ilb->lock);
+	st->offset = 0;
 	if (++st->bucket < INET_LHTABLE_SIZE) {
 		ilb = &tcp_hashinfo.listening_hash[st->bucket];
 		spin_lock_bh(&ilb->lock);
@@ -2057,7 +2065,12 @@  out:
 
 static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
 {
-	void *rc = listening_get_next(seq, NULL);
+	struct tcp_iter_state *st = seq->private;
+	void *rc;
+
+	st->bucket = 0;
+	st->offset = 0;
+	rc = listening_get_next(seq, NULL);
 
 	while (rc && *pos) {
 		rc = listening_get_next(seq, rc);
@@ -2072,13 +2085,18 @@  static inline int empty_bucket(struct tcp_iter_state *st)
 		hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
 }
 
+/*
+ * Get first established socket starting from bucket given in st->bucket.
+ * If st->bucket is zero, the very first socket in the hash is returned.
+ */
 static void *established_get_first(struct seq_file *seq)
 {
 	struct tcp_iter_state *st = seq->private;
 	struct net *net = seq_file_net(seq);
 	void *rc = NULL;
 
-	for (st->bucket = 0; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) {
+	st->offset = 0;
+	for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) {
 		struct sock *sk;
 		struct hlist_nulls_node *node;
 		struct inet_timewait_sock *tw;
@@ -2123,6 +2141,7 @@  static void *established_get_next(struct seq_file *seq, void *cur)
 	struct net *net = seq_file_net(seq);
 
 	++st->num;
+	++st->offset;
 
 	if (st->state == TCP_SEQ_STATE_TIME_WAIT) {
 		tw = cur;
@@ -2139,6 +2158,7 @@  get_tw:
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
 
 		/* Look for next non empty bucket */
+		st->offset = 0;
 		while (++st->bucket <= tcp_hashinfo.ehash_mask &&
 				empty_bucket(st))
 			;
@@ -2166,7 +2186,11 @@  out:
 
 static void *established_get_idx(struct seq_file *seq, loff_t pos)
 {
-	void *rc = established_get_first(seq);
+	struct tcp_iter_state *st = seq->private;
+	void *rc;
+
+	st->bucket = 0;
+	rc = established_get_first(seq);
 
 	while (rc && pos) {
 		rc = established_get_next(seq, rc);
@@ -2191,24 +2215,72 @@  static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
 	return rc;
 }
 
+static void *tcp_seek_last_pos(struct seq_file *seq)
+{
+	struct tcp_iter_state *st = seq->private;
+	int offset = st->offset;
+	int orig_num = st->num;
+	void *rc = NULL;
+	
+	switch (st->state) {
+	case TCP_SEQ_STATE_OPENREQ:
+	case TCP_SEQ_STATE_LISTENING:
+		if (st->bucket >= INET_LHTABLE_SIZE)
+			break;
+		st->state = TCP_SEQ_STATE_LISTENING;
+		rc = listening_get_next(seq, NULL);
+		while (offset-- && rc)
+			rc = listening_get_next(seq, rc);
+		if (rc)
+			break;
+		st->bucket = 0;
+		/* Fallthrough */
+	case TCP_SEQ_STATE_ESTABLISHED:
+	case TCP_SEQ_STATE_TIME_WAIT:
+		st->state = TCP_SEQ_STATE_ESTABLISHED;
+		if (st->bucket > tcp_hashinfo.ehash_mask)
+			break;
+		rc = established_get_first(seq);
+		while (offset-- && rc)
+			rc = established_get_next(seq, rc);
+	}
+
+	st->num = orig_num;
+
+	return rc;
+}
+
 static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
 {
 	struct tcp_iter_state *st = seq->private;
+	void *rc;
+
+	if (*pos && *pos == st->last_pos) {
+		rc = tcp_seek_last_pos(seq);
+		if (rc)
+			goto out;
+	}
+
 	st->state = TCP_SEQ_STATE_LISTENING;
 	st->num = 0;
-	return *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
+	st->bucket = 0;
+	st->offset = 0;
+	rc = *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
+
+out:
+	st->last_pos = *pos;
+	return rc;
 }
 
 static void *tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
+	struct tcp_iter_state *st = seq->private;
 	void *rc = NULL;
-	struct tcp_iter_state *st;
 
 	if (v == SEQ_START_TOKEN) {
 		rc = tcp_get_idx(seq, 0);
 		goto out;
 	}
-	st = seq->private;
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_OPENREQ:
@@ -2216,6 +2288,8 @@  static void *tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 		rc = listening_get_next(seq, v);
 		if (!rc) {
 			st->state = TCP_SEQ_STATE_ESTABLISHED;
+			st->bucket = 0;
+			st->offset = 0;
 			rc	  = established_get_first(seq);
 		}
 		break;
@@ -2226,6 +2300,7 @@  static void *tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	}
 out:
 	++*pos;
+	st->last_pos = *pos;
 	return rc;
 }
 
@@ -2264,6 +2339,7 @@  static int tcp_seq_open(struct inode *inode, struct file *file)
 
 	s = ((struct seq_file *)file->private_data)->private;
 	s->family		= afinfo->family;
+	s->last_pos 		= 0;
 	return 0;
 }