diff mbox

[net-next-2.6] ip: Report qdisc packet drops

Message ID 4A9EF113.9030102@gmail.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 2, 2009, 10:26 p.m. UTC
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 31 Aug 2009 14:09:50 +0200
>>
>>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>>
>>> May I suggest following path :
>>>
>>> 1) Correct ip6_push_pending_frames() to properly
>>> account for dropped-by-qdisc frames when IP_RECVERR is set
>> Your patch is  applied to net-next-2.6, thanks!
>>
>>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>>> but still return a OK to user application, to not break them ?
>> Sounds good.
>>
>> I think if you sample random UDP applications, you will find that such
>> errors will upset them terribly, make them log tons of crap to
>> /var/log/messages et al., and consume tons of CPU.
>>
>> And in such cases silent ignoring of drops is entirely appropriate and
>> optimal, which supports our current behavior.
>>
>> If we are to make such applications "more sophisticated" such
>> converted apps can be indicated simply their use of IP_RECVERR.
>>
>> If you want to be notified of all asynchronous errors we can detect,
>> you use this, end of story.  It is the only way to handle this
>> situation without breaking the world.
>>
>> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
>> accurate, and wise.  And he understood all of this 10+ years ago.
> 
> Thanks David, here is the 2nd patch then :

Here is an updated version of the patch, after Christoph comments.



[PATCH net-next-2.6] ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s 
...
IP:
    1487048 outgoing packets dropped
...
Udp:
...
    SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

 "The output queue for a network interface was full.
  This generally indicates that the interface has stopped sending,
  but may be caused by transient congestion.
  (Normally, this does not occur in Linux. Packets are just silently
  dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c  |    2 +-
 net/ipv4/raw.c        |    9 +++++++--
 net/ipv4/udp.c        |   12 +++++++++---
 net/ipv6/ip6_output.c |    2 +-
 net/ipv6/raw.c        |    4 +++-
 net/ipv6/udp.c        |   12 +++++++++---
 6 files changed, 30 insertions(+), 11 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Sept. 3, 2009, 1:05 a.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 Sep 2009 00:26:27 +0200

> Here is an updated version of the patch, after Christoph comments.
> 
> 
> 
> [PATCH net-next-2.6] ip: Report qdisc packet drops

Looks great, applied!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 3, 2009, 2:12 p.m. UTC | #2
Christoph Lameter a écrit :
> On Thu, 3 Sep 2009, Eric Dumazet wrote:
>> index 29ebb0d..ebaaa7f 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
>>
>>  send:
>>  	err = ip_push_pending_frames(sk);
>> +	if (err) {
>> +		if (err == -ENOBUFS && !inet->recverr) {
>> +			UDP_INC_STATS_USER(sock_net(sk),
>> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> 
> This means that we now do not increment SNDBUFERRORS if IP_RECVERR is set.
> 
> I think it would be better to move UDP_INC_STATS_USER before the if
> statement. That will also simplify the code further.
> 
> 

I believe you already said this once Christoph on a previous patch, and I 
replied that in case of IP_RECVERR set, udp_push_pending_frames()
returns -ENOBUFS to its caller : udp_sendmsg()

And the caller takes care of :

out:
        ip_rt_put(rt);
        if (free)
                kfree(ipc.opt);
        if (!err)
                return len;
        /*
         * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
         * ENOBUFS might not be good (it's not tunable per se), but otherwise
         * we don't have a good statistic (IpOutDiscards but it can be too many
         * things).  We could add another new stat but at least for now that
         * seems like overkill.
         */
        if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
                UDP_INC_STATS_USER(sock_net(sk),
                                UDP_MIB_SNDBUFERRORS, is_udplite);
        }
        return err;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Sept. 3, 2009, 5:57 p.m. UTC | #3
On Thu, 3 Sep 2009, Eric Dumazet wrote:
> index 29ebb0d..ebaaa7f 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
>
>  send:
>  	err = ip_push_pending_frames(sk);
> +	if (err) {
> +		if (err == -ENOBUFS && !inet->recverr) {
> +			UDP_INC_STATS_USER(sock_net(sk),
> +					   UDP_MIB_SNDBUFERRORS, is_udplite);

This means that we now do not increment SNDBUFERRORS if IP_RECVERR is set.

I think it would be better to move UDP_INC_STATS_USER before the if
statement. That will also simplify the code further.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Sept. 3, 2009, 6:35 p.m. UTC | #4
On Thu, 3 Sep 2009, Eric Dumazet wrote:

> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
> >>
> >>  send:
> >>  	err = ip_push_pending_frames(sk);
> >> +	if (err) {
> >> +		if (err == -ENOBUFS && !inet->recverr) {
> >> +			UDP_INC_STATS_USER(sock_net(sk),
> >> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> >
> > This means that we now do not increment SNDBUFERRORS if IP_RECVERR is set.
> >
> > I think it would be better to move UDP_INC_STATS_USER before the if
> > statement. That will also simplify the code further.
> >
> >
>
> I believe you already said this once Christoph on a previous patch, and I
> replied that in case of IP_RECVERR set, udp_push_pending_frames()
> returns -ENOBUFS to its caller : udp_sendmsg()

Ahhh. I see. Would it not be better to have a single location where the
SNDBUFERRORS are accounted for? Check for -ENOBUFS before the code in
udp_sendmsg()? Hmmm.. The control flow is a bit complicated there...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..afae0cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1302,7 +1302,7 @@  int ip_push_pending_frames(struct sock *sk)
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..ebb1e58 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -375,7 +375,7 @@  static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
 	if (err > 0)
-		err = inet->recverr ? net_xmit_errno(err) : 0;
+		err = net_xmit_errno(err);
 	if (err)
 		goto error;
 out:
@@ -386,6 +386,8 @@  error_fault:
 	kfree_skb(skb);
 error:
 	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+	if (err == -ENOBUFS && !inet->recverr)
+		err = 0;
 	return err;
 }
 
@@ -576,8 +578,11 @@  back_from_confirm:
 					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
-		else if (!(msg->msg_flags & MSG_MORE))
+		else if (!(msg->msg_flags & MSG_MORE)) {
 			err = ip_push_pending_frames(sk);
+			if (err == -ENOBUFS && !inet->recverr)
+				err = 0;
+		}
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 29ebb0d..ebaaa7f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -561,12 +561,18 @@  static int udp_push_pending_frames(struct sock *sk)
 
 send:
 	err = ip_push_pending_frames(sk);
+	if (err) {
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	} else
+		UDP_INC_STATS_USER(sock_net(sk),
+				   UDP_MIB_OUTDATAGRAMS, is_udplite);
 out:
 	up->len = 0;
 	up->pending = 0;
-	if (!err)
-		UDP_INC_STATS_USER(sock_net(sk),
-				UDP_MIB_OUTDATAGRAMS, is_udplite);
 	return err;
 }
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a931229..cd48801 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1511,7 +1511,7 @@  int ip6_push_pending_frames(struct sock *sk)
 	err = ip6_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5068410..7d675b8 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -642,7 +642,7 @@  static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
 	if (err > 0)
-		err = np->recverr ? net_xmit_errno(err) : 0;
+		err = net_xmit_errno(err);
 	if (err)
 		goto error;
 out:
@@ -653,6 +653,8 @@  error_fault:
 	kfree_skb(skb);
 error:
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
+	if (err == -ENOBUFS && !np->recverr)
+		err = 0;
 	return err;
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 20d2ffc..1640406 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -724,12 +724,18 @@  static int udp_v6_push_pending_frames(struct sock *sk)
 
 send:
 	err = ip6_push_pending_frames(sk);
+	if (err) {
+		if (err == -ENOBUFS && !inet6_sk(sk)->recverr) {
+			UDP6_INC_STATS_USER(sock_net(sk),
+					    UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	} else
+		UDP6_INC_STATS_USER(sock_net(sk),
+				    UDP_MIB_OUTDATAGRAMS, is_udplite);
 out:
 	up->len = 0;
 	up->pending = 0;
-	if (!err)
-		UDP6_INC_STATS_USER(sock_net(sk),
-				UDP_MIB_OUTDATAGRAMS, is_udplite);
 	return err;
 }