diff mbox

[RFC] accounting for socket backlog

Message ID 1267067593.16986.1583.camel@debian
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Zhu Yi Feb. 25, 2010, 3:13 a.m. UTC
Hi,

We got system OOM while running some UDP netperf testing on the loopback
device. The case is multiple senders sent stream UDP packets to a single
receiver via loopback on local host. Of course, the receiver is not able
to handle all the packets in time. But we surprisingly found that these
packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
Instead, they are kept queuing to sk->sk_backlog and finally ate up all
the memory. We believe this is a secure hole that a none privileged user
can crash the system.

The root cause for this problem is, when the receiver is doing
__release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
sk_receive_queue with the softirq enabled. In the above case, multiple
busy senders will almost make it an endless loop. The skbs in the
backlog end up eat all the system memory.

The patch fixed this problem by adding accounting for the socket
backlog. So that the backlog size can be restricted by protocol's choice
(i.e. UDP).

Signed-off-by: Zhu Yi <yi.zhu@intel.com>
---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Feb. 25, 2010, 8:31 a.m. UTC | #1
From: Zhu Yi <yi.zhu@intel.com>
Date: Thu, 25 Feb 2010 11:13:13 +0800

> @@ -1372,8 +1372,13 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  	bh_lock_sock(sk);
>  	if (!sock_owned_by_user(sk))
>  		rc = __udp_queue_rcv_skb(sk, skb);
> -	else
> +	else {
> +		if (atomic_read(&sk->sk_backlog.len) >= sk->sk_rcvbuf) {
> +			bh_unlock_sock(sk);
> +			goto drop;
> +		}
>  		sk_add_backlog(sk, skb);
> +	}

We have to address this issue, of course, but I bet this method of
handling it negatively impacts performance in normal cases.

Right now we can queue up a lot and still get it to the application
if it is slow getting scheduled onto a cpu, but if you put this
limit here it could result in lots of drops.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 25, 2010, 11:24 a.m. UTC | #2
Le jeudi 25 février 2010 à 11:13 +0800, Zhu Yi a écrit :
> Hi,
> 
> We got system OOM while running some UDP netperf testing on the loopback
> device. The case is multiple senders sent stream UDP packets to a single
> receiver via loopback on local host. Of course, the receiver is not able
> to handle all the packets in time. But we surprisingly found that these
> packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
> Instead, they are kept queuing to sk->sk_backlog and finally ate up all
> the memory. We believe this is a secure hole that a none privileged user
> can crash the system.
> 
> The root cause for this problem is, when the receiver is doing
> __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
> skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
> sk_receive_queue with the softirq enabled. In the above case, multiple
> busy senders will almost make it an endless loop. The skbs in the
> backlog end up eat all the system memory.
> 
> The patch fixed this problem by adding accounting for the socket
> backlog. So that the backlog size can be restricted by protocol's choice
> (i.e. UDP).
> 
> Signed-off-by: Zhu Yi <yi.zhu@intel.com>
> ---
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 3f1a480..2e003b9 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -253,6 +253,7 @@ struct sock {
>  	struct {
>  		struct sk_buff *head;
>  		struct sk_buff *tail;
> +		atomic_t len;

This adds a hole on 32bit arches.

I am pretty sure we dont need an atomic here, since we must own a lock
before manipulating sk_backlog{head,tail,len}.

UDP/IPV6 should be addressed too in your patch.

Other questions raised by your discovery :
- What about other protocols that also use a backlog ?
- __release_sock() could run forever with no preemption, even with a
limit on backlog.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhu Yi Feb. 26, 2010, 2:34 a.m. UTC | #3
On Thu, 2010-02-25 at 19:24 +0800, Eric Dumazet wrote:
> > @@ -253,6 +253,7 @@ struct sock {
> >       struct {
> >               struct sk_buff *head;
> >               struct sk_buff *tail;
> > +             atomic_t len;
> 
> This adds a hole on 32bit arches.
> 
> I am pretty sure we dont need an atomic here, since we must own a lock
> before manipulating sk_backlog{head,tail,len}.

Good point. bh_lock_sock is always held for backlog operations.

> UDP/IPV6 should be addressed too in your patch.

Will do, this is only a RFC anyway.

> Other questions raised by your discovery :
> - What about other protocols that also use a backlog ?

I don't think protocols with flow/congestion control capability have
such issue. We have tested TCP is immune. Other current backlog users
are dccp, sctp, tipc, x.25 and llc. We didn't test all of them. But
looks like only llc here is possible but unlikely?

> - __release_sock() could run forever with no preemption, even with a
> limit on backlog. 

Yes, but there is no critical impact like memory exhausted for this
case.

Thanks,
-yi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhu Yi Feb. 26, 2010, 2:44 a.m. UTC | #4
On Thu, 2010-02-25 at 16:31 +0800, David Miller wrote:
> > @@ -1372,8 +1372,13 @@ int udp_queue_rcv_skb(struct sock *sk, struct
> sk_buff *skb)
> >       bh_lock_sock(sk);
> >       if (!sock_owned_by_user(sk))
> >               rc = __udp_queue_rcv_skb(sk, skb);
> > -     else
> > +     else {
> > +             if (atomic_read(&sk->sk_backlog.len) >= sk->sk_rcvbuf)
> {
> > +                     bh_unlock_sock(sk);
> > +                     goto drop;
> > +             }
> >               sk_add_backlog(sk, skb);
> > +     }
> 
> We have to address this issue, of course, but I bet this method of
> handling it negatively impacts performance in normal cases.

Eric mentioned atomic is not required here. I don't think performance
will be impacted any more with the above if clause. Right?

> Right now we can queue up a lot and still get it to the application
> if it is slow getting scheduled onto a cpu, but if you put this
> limit here it could result in lots of drops. 

Or we can replace the sk->sk_rcvbuf limit with a backlog own limit. We
can queue "a lot", but not endless. We have to have a limit anyway.

Thanks,
-yi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 26, 2010, 5:52 a.m. UTC | #5
From: Zhu Yi <yi.zhu@intel.com>
Date: Fri, 26 Feb 2010 10:44:13 +0800

> On Thu, 2010-02-25 at 16:31 +0800, David Miller wrote:
>> Right now we can queue up a lot and still get it to the application
>> if it is slow getting scheduled onto a cpu, but if you put this
>> limit here it could result in lots of drops. 
> 
> Or we can replace the sk->sk_rcvbuf limit with a backlog own limit. We
> can queue "a lot", but not endless. We have to have a limit anyway.

Simply using (2 * sk->sk_rcvbuf) might be sufficient enough.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 26, 2010, 1:12 p.m. UTC | #6
Le vendredi 26 février 2010 à 10:34 +0800, Zhu Yi a écrit :
> On Thu, 2010-02-25 at 19:24 +0800, Eric Dumazet wrote:
> > - What about other protocols that also use a backlog ?
> 
> I don't think protocols with flow/congestion control capability have
> such issue. We have tested TCP is immune. Other current backlog users
> are dccp, sctp, tipc, x.25 and llc. We didn't test all of them. But
> looks like only llc here is possible but unlikely?
> 

I looked again for TCP case and I see same problem than UDP. If socket
is locked by user, and a softirq flood happens, you can fill backlog too
with frames (any frame with correct tuple, even with bad sequences), and
exhaust memory.


> > - __release_sock() could run forever with no preemption, even with a
> > limit on backlog. 
> 
> Yes, but there is no critical impact like memory exhausted for this
> case.
> 

Well, if you have one processor, and a process doesnt want to yield the
cpu (apart of sofirq of course that is filling the backlog while your
process tries to empty it), your machine is dead. This is critical
too :)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhu Yi March 1, 2010, 2:17 a.m. UTC | #7
On Fri, 2010-02-26 at 21:12 +0800, Eric Dumazet wrote:
> Well, if you have one processor, and a process doesnt want to yield
> the cpu (apart of sofirq of course that is filling the backlog while
> your process tries to empty it), your machine is dead. This is
> critical too :) 

If you only have one CPU, this won't happen. Because while the receiver
is busy processing the backlog, no senders will have the chance to be
scheduled to Tx more. And with the limited backlog, it won't take long
for the receiver to finish processing all the frames in the backlog.

Thanks,
-yi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 1, 2010, 2:29 a.m. UTC | #8
Le lundi 01 mars 2010 à 10:17 +0800, Zhu Yi a écrit :
> On Fri, 2010-02-26 at 21:12 +0800, Eric Dumazet wrote:
> > Well, if you have one processor, and a process doesnt want to yield
> > the cpu (apart of sofirq of course that is filling the backlog while
> > your process tries to empty it), your machine is dead. This is
> > critical too :) 
> 
> If you only have one CPU, this won't happen. Because while the receiver
> is busy processing the backlog, no senders will have the chance to be
> scheduled to Tx more. And with the limited backlog, it won't take long
> for the receiver to finish processing all the frames in the backlog.
> 

You focus on the case you Intel guys discovered the flaw, using loopback
interface. I am concerned with a DOS situation, when some bad guys on
your LAN sends a flood on your machine, using a real 10Gb NIC.

I was concerned by two things :

- One process being stuck forever in the __release_sock(), basically
stopping an application from performing progress. This is a DOS problem.
Our kernel is potentially affected. We probably should do something to
avoid this problem. But I am _not_ saying this should be done by your
patch, it is probably possible to address it in an independant patch.

- One process being stuck in __release_sock() and not yield to other
processes. This is not the case since we call the cond_resched_sofirq()
function that permits other high priority task to get the CPU.


One more point :

If you remember my previous mail, I suggested cleaning the len field in
__release_sock(). This way, you can provide a first patch, protocol
agnostic, then provide further patch for UDP V4, another patch for UDP
V6, etc... to have a clean path and make the resolution of the bugs more
self explaining and not as a whole and big patch.
 
Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhu Yi March 1, 2010, 3:03 a.m. UTC | #9
On Mon, 2010-03-01 at 10:29 +0800, Eric Dumazet wrote:
> You focus on the case you Intel guys discovered the flaw, using
> loopback interface. I am concerned with a DOS situation, when some bad
> guys on your LAN sends a flood on your machine, using a real 10Gb NIC.
> I was concerned by two things :
> 
> - One process being stuck forever in the __release_sock(), basically
> stopping an application from performing progress. This is a DOS
> problem.
> Our kernel is potentially affected. We probably should do something to
> avoid this problem. But I am _not_ saying this should be done by your
> patch, it is probably possible to address it in an independant patch.

Agreed. Unless you or someone else provides a better idea solves them
both, I'd only fix one problem at a time.

> - One process being stuck in __release_sock() and not yield to other
> processes. This is not the case since we call the
> cond_resched_sofirq()
> function that permits other high priority task to get the CPU.
> 
> 
> One more point :
> 
> If you remember my previous mail, I suggested cleaning the len field
> in
> __release_sock(). This way, you can provide a first patch, protocol
> agnostic, then provide further patch for UDP V4, another patch for UDP
> V6, etc... to have a clean path and make the resolution of the bugs
> more
> self explaining and not as a whole and big patch.

This is my original plan. But davem just mentioned we have to prevent
from maliciously send frames, that means all protocols using backlog.
That makes me consider if I should put the check to the generic path.

Thanks,
-yi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 1, 2010, 3:12 a.m. UTC | #10
Le lundi 01 mars 2010 à 11:03 +0800, Zhu Yi a écrit :

> This is my original plan. But davem just mentioned we have to prevent
> from maliciously send frames, that means all protocols using backlog.
> That makes me consider if I should put the check to the generic path.
> 

Sure, check could be done in sk_add_backlog() (including the
atomic_inc(&sk->sk_drops); and kfree_skb(skb);), but make sure a status
is returned so that we can increment SNMP counters at protocol level.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 3f1a480..2e003b9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -253,6 +253,7 @@  struct sock {
 	struct {
 		struct sk_buff *head;
 		struct sk_buff *tail;
+		atomic_t len;
 	} sk_backlog;
 	wait_queue_head_t	*sk_sleep;
 	struct dst_entry	*sk_dst_cache;
@@ -583,11 +584,13 @@  static inline void sk_add_backlog(struct sock *sk, struct sk_buff *skb)
 		sk->sk_backlog.tail->next = skb;
 		sk->sk_backlog.tail = skb;
 	}
+	atomic_add(skb->truesize, &sk->sk_backlog.len);
 	skb->next = NULL;
 }
 
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	atomic_sub(skb->truesize, &sk->sk_backlog.len);
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index e1f6f22..3b988f2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1138,6 +1138,7 @@  struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		sock_lock_init(newsk);
 		bh_lock_sock(newsk);
 		newsk->sk_backlog.head	= newsk->sk_backlog.tail = NULL;
+		atomic_set(&newsk->sk_backlog.len, 0);
 
 		atomic_set(&newsk->sk_rmem_alloc, 0);
 		/*
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f0126fd..e019067 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1372,8 +1372,13 @@  int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk))
 		rc = __udp_queue_rcv_skb(sk, skb);
-	else
+	else {
+		if (atomic_read(&sk->sk_backlog.len) >= sk->sk_rcvbuf) {
+			bh_unlock_sock(sk);
+			goto drop;
+		}
 		sk_add_backlog(sk, skb);
+	}
 	bh_unlock_sock(sk);
 
 	return rc;