diff mbox

[net-next-2.6] ip: Report qdisc packet drops

Message ID 4A9E849A.30105@gmail.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 2, 2009, 2:43 p.m. UTC
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 31 Aug 2009 14:09:50 +0200
> 
>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>
>> May I suggest following path :
>>
>> 1) Correct ip6_push_pending_frames() to properly
>> account for dropped-by-qdisc frames when IP_RECVERR is set
> 
> Your patch is  applied to net-next-2.6, thanks!
> 
>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>> but still return a OK to user application, to not break them ?
> 
> Sounds good.
> 
> I think if you sample random UDP applications, you will find that such
> errors will upset them terribly, make them log tons of crap to
> /var/log/messages et al., and consume tons of CPU.
> 
> And in such cases silent ignoring of drops is entirely appropriate and
> optimal, which supports our current behavior.
> 
> If we are to make such applications "more sophisticated" such
> converted apps can be indicated simply their use of IP_RECVERR.
> 
> If you want to be notified of all asynchronous errors we can detect,
> you use this, end of story.  It is the only way to handle this
> situation without breaking the world.
> 
> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> accurate, and wise.  And he understood all of this 10+ years ago.

Thanks David, here is the 2nd patch then :


[PATCH net-next-2.6] ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s 
...
IP:
    1487048 outgoing packets dropped
...
Udp:
...
    SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

 "The output queue for a network interface was full.
  This generally indicates that the interface has stopped sending,
  but may be caused by transient congestion.
  (Normally, this does not occur in Linux. Packets are just silently
  dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ip.h      |    2 +-
 include/net/ipv6.h    |    2 +-
 include/net/udp.h     |    2 +-
 net/ipv4/icmp.c       |    2 +-
 net/ipv4/ip_output.c  |   19 ++++++++++---------
 net/ipv4/raw.c        |   14 ++++++++++----
 net/ipv4/udp.c        |   20 +++++++++++++-------
 net/ipv6/icmp.c       |    2 +-
 net/ipv6/ip6_output.c |   18 +++++++++++-------
 net/ipv6/raw.c        |   15 ++++++++++-----
 net/ipv6/udp.c        |   14 ++++++++++----
 11 files changed, 69 insertions(+), 41 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Sept. 2, 2009, 4:05 p.m. UTC | #1
Christoph Lameter a écrit :
> The patch is smaller if you remove the handling of recverr completely from
> ip_push_pending_frames() and return NET_RX_DROP etc. Two of the callers
> never even inspect the return code. For them this is useless processing.

We must check NET_XMIT_CN before doing the update, but you are probably right.

I'll cook another patch ASAP, thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sridhar Samudrala Sept. 2, 2009, 4:11 p.m. UTC | #2
On Wed, 2009-09-02 at 16:43 +0200, Eric Dumazet wrote:
> David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 31 Aug 2009 14:09:50 +0200
> > 
> >> Re-reading again this stuff, I realized ip6_push_pending_frames()
> >> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
> >>
> >> May I suggest following path :
> >>
> >> 1) Correct ip6_push_pending_frames() to properly
> >> account for dropped-by-qdisc frames when IP_RECVERR is set
> > 
> > Your patch is  applied to net-next-2.6, thanks!
> > 
> >> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
> >> but still return a OK to user application, to not break them ?
> > 
> > Sounds good.
> > 
> > I think if you sample random UDP applications, you will find that such
> > errors will upset them terribly, make them log tons of crap to
> > /var/log/messages et al., and consume tons of CPU.
> > 
> > And in such cases silent ignoring of drops is entirely appropriate and
> > optimal, which supports our current behavior.
> > 
> > If we are to make such applications "more sophisticated" such
> > converted apps can be indicated simply their use of IP_RECVERR.
> > 
> > If you want to be notified of all asynchronous errors we can detect,
> > you use this, end of story.  It is the only way to handle this
> > situation without breaking the world.
> > 
> > As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> > accurate, and wise.  And he understood all of this 10+ years ago.
> 
> Thanks David, here is the 2nd patch then :
> 
> 
> [PATCH net-next-2.6] ip: Report qdisc packet drops
> 
> Christoph Lameter pointed out that packet drops at qdisc level where not
> accounted in SNMP counters. Only if application sets IP_RECVERR, drops
> are reported to user (-ENOBUFS errors) and SNMP counters updated.
> 
> IP_RECVERR is used to enable extended reliable error message passing,
> but these are not needed to update system wide SNMP stats.
> 
> This patch changes things a bit to allow SNMP counters to be updated,
> regardless of IP_RECVERR being set or not on the socket.
> 
> Example after an UDP tx flood
> # netstat -s 
> ...
> IP:
>     1487048 outgoing packets dropped
> ...
> Udp:
> ...
>     SndbufErrors: 1487048
> 

Didn't we agree that qdisc drops should not be counted as IP or UDP 
drops as David Stevens pointed out?
I would say that even when IP_RECVERR is set, SNMP counters at IP and
UDP should not be counted when a packet is dropped at qdisc level,
but the error can be reported to user.

Now that qdisc stats issue is figured out and they can be accounted
and seen at qdisc level, doesn't it confuse if we count the same drop 
at IP, UDP and qdisc level?

Thanks
Sridhar

> 
> send() syscalls, do however still return an OK status, to not
> break applications.
> 
> Note : send() manual page explicitly says for -ENOBUFS error :
> 
>  "The output queue for a network interface was full.
>   This generally indicates that the interface has stopped sending,
>   but may be caused by transient congestion.
>   (Normally, this does not occur in Linux. Packets are just silently
>   dropped when a device queue overflows.) "
> 
> This is not true for IP_RECVERR enabled sockets : a send() syscall
> that hit a qdisc drop returns an ENOBUFS error.
> 
> Many thanks to Christoph, David, and last but not least, Alexey !
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/ip.h      |    2 +-
>  include/net/ipv6.h    |    2 +-
>  include/net/udp.h     |    2 +-
>  net/ipv4/icmp.c       |    2 +-
>  net/ipv4/ip_output.c  |   19 ++++++++++---------
>  net/ipv4/raw.c        |   14 ++++++++++----
>  net/ipv4/udp.c        |   20 +++++++++++++-------
>  net/ipv6/icmp.c       |    2 +-
>  net/ipv6/ip6_output.c |   18 +++++++++++-------
>  net/ipv6/raw.c        |   15 ++++++++++-----
>  net/ipv6/udp.c        |   14 ++++++++++----
>  11 files changed, 69 insertions(+), 41 deletions(-)
> 
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 72c3692..9dd19a8 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -116,7 +116,7 @@ extern int		ip_append_data(struct sock *sk,
>  extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
>  extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
>  				int offset, size_t size, int flags);
> -extern int		ip_push_pending_frames(struct sock *sk);
> +extern int		ip_push_pending_frames(struct sock *sk, int recverr);
>  extern void		ip_flush_pending_frames(struct sock *sk);
> 
>  /* datagram.c */
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index ad9a511..f514257 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -498,7 +498,7 @@ extern int			ip6_append_data(struct sock *sk,
>  						struct rt6_info *rt,
>  						unsigned int flags);
> 
> -extern int			ip6_push_pending_frames(struct sock *sk);
> +extern int			ip6_push_pending_frames(struct sock *sk, int recverr);
> 
>  extern void			ip6_flush_pending_frames(struct sock *sk);
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 5fb029f..a60ef10 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -145,7 +145,7 @@ extern int 	udp_lib_getsockopt(struct sock *sk, int level, int optname,
>  			           char __user *optval, int __user *optlen);
>  extern int 	udp_lib_setsockopt(struct sock *sk, int level, int optname,
>  				   char __user *optval, int optlen,
> -				   int (*push_pending_frames)(struct sock *));
> +				   int (*push_pending_frames)(struct sock *, int));
> 
>  extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
>  				    __be32 daddr, __be16 dport,
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 97c410e..f46a53c 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -345,7 +345,7 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
>  						 icmp_param->head_len, csum);
>  		icmph->checksum = csum_fold(csum);
>  		skb->ip_summed = CHECKSUM_NONE;
> -		ip_push_pending_frames(sk);
> +		ip_push_pending_frames(sk, 0);
>  	}
>  }
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7d08210..8f81dab 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1216,7 +1216,7 @@ static void ip_cork_release(struct inet_sock *inet)
>   *	Combined all pending IP fragments on the socket as one IP datagram
>   *	and push them out.
>   */
> -int ip_push_pending_frames(struct sock *sk)
> +int ip_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct sk_buff *skb, *tmp_skb;
>  	struct sk_buff **tail_skb;
> @@ -1301,19 +1301,20 @@ int ip_push_pending_frames(struct sock *sk)
>  	/* Netfilter gets whole the not fragmented skb. */
>  	err = ip_local_out(skb);
>  	if (err) {
> -		if (err > 0)
> -			err = inet->recverr ? net_xmit_errno(err) : 0;
> +		if (err > 0) {
> +			err = net_xmit_errno(err);
> +			if (err && !recverr) {
> +				IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> +				err = 0;
> +			}
> +		}
>  		if (err)
> -			goto error;
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
>  	}
> 
>  out:
>  	ip_cork_release(inet);
>  	return err;
> -
> -error:
> -	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  /*
> @@ -1412,7 +1413,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
>  			  arg->csumoffset) = csum_fold(csum_add(skb->csum,
>  								arg->csum));
>  		skb->ip_summed = CHECKSUM_NONE;
> -		ip_push_pending_frames(sk);
> +		ip_push_pending_frames(sk, 0);
>  	}
> 
>  	bh_unlock_sock(sk);
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 2979f14..444c465 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -374,8 +374,13 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
> 
>  	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
>  		      dst_output);
> -	if (err > 0)
> -		err = inet->recverr ? net_xmit_errno(err) : 0;
> +	if (err > 0) {
> +		err = net_xmit_errno(err);
> +		if (!inet->recverr && err) {
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> +			err = 0;
> +		}
> +	}
>  	if (err)
>  		goto error;
>  out:
> @@ -576,8 +581,9 @@ back_from_confirm:
>  					&ipc, &rt, msg->msg_flags);
>  		if (err)
>  			ip_flush_pending_frames(sk);
> -		else if (!(msg->msg_flags & MSG_MORE))
> -			err = ip_push_pending_frames(sk);
> +		else if (!(msg->msg_flags & MSG_MORE)) {
> +			err = ip_push_pending_frames(sk, inet->recverr);
> +		}
>  		release_sock(sk);
>  	}
>  done:
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 29ebb0d..6a6bf1d 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -513,7 +513,7 @@ static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
>  /*
>   * Push out all pending data as one UDP datagram. Socket is locked.
>   */
> -static int udp_push_pending_frames(struct sock *sk)
> +static int udp_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct udp_sock  *up = udp_sk(sk);
>  	struct inet_sock *inet = inet_sk(sk);
> @@ -560,7 +560,7 @@ static int udp_push_pending_frames(struct sock *sk)
>  		uh->check = CSUM_MANGLED_0;
> 
>  send:
> -	err = ip_push_pending_frames(sk);
> +	err = ip_push_pending_frames(sk, recverr);
>  out:
>  	up->len = 0;
>  	up->pending = 0;
> @@ -752,8 +752,14 @@ do_append_data:
>  			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
>  	if (err)
>  		udp_flush_pending_frames(sk);
> -	else if (!corkreq)
> -		err = udp_push_pending_frames(sk);
> +	else if (!corkreq) {
> +		err = udp_push_pending_frames(sk, 1);
> +		if (err == -ENOBUFS && !inet->recverr) {
> +			UDP_INC_STATS_USER(sock_net(sk),
> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> +			err = 0;
> +		}
> +	}
>  	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
>  		up->pending = 0;
>  	release_sock(sk);
> @@ -826,7 +832,7 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> 
>  	up->len += size;
>  	if (!(up->corkflag || (flags&MSG_MORE)))
> -		ret = udp_push_pending_frames(sk);
> +		ret = udp_push_pending_frames(sk, inet_sk(sk)->recverr);
>  	if (!ret)
>  		ret = size;
>  out:
> @@ -1354,7 +1360,7 @@ void udp_destroy_sock(struct sock *sk)
>   */
>  int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>  		       char __user *optval, int optlen,
> -		       int (*push_pending_frames)(struct sock *))
> +		       int (*push_pending_frames)(struct sock *, int))
>  {
>  	struct udp_sock *up = udp_sk(sk);
>  	int val;
> @@ -1374,7 +1380,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>  		} else {
>  			up->corkflag = 0;
>  			lock_sock(sk);
> -			(*push_pending_frames)(sk);
> +			(*push_pending_frames)(sk, 0);
>  			release_sock(sk);
>  		}
>  		break;
> diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
> index e2325f6..a9c54c2 100644
> --- a/net/ipv6/icmp.c
> +++ b/net/ipv6/icmp.c
> @@ -253,7 +253,7 @@ static int icmpv6_push_pending_frames(struct sock *sk, struct flowi *fl, struct
>  						      len, fl->proto,
>  						      tmp_csum);
>  	}
> -	ip6_push_pending_frames(sk);
> +	ip6_push_pending_frames(sk, 0);
>  out:
>  	return err;
>  }
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index a931229..ade5707 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1440,7 +1440,7 @@ static void ip6_cork_release(struct inet_sock *inet, struct ipv6_pinfo *np)
>  	memset(&inet->cork.fl, 0, sizeof(inet->cork.fl));
>  }
> 
> -int ip6_push_pending_frames(struct sock *sk)
> +int ip6_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct sk_buff *skb, *tmp_skb;
>  	struct sk_buff **tail_skb;
> @@ -1510,18 +1510,22 @@ int ip6_push_pending_frames(struct sock *sk)
> 
>  	err = ip6_local_out(skb);
>  	if (err) {
> -		if (err > 0)
> -			err = np->recverr ? net_xmit_errno(err) : 0;
> +		if (err > 0) {
> +			err = net_xmit_errno(err);
> +			if (err && !recverr) {
> +				IP6_INC_STATS(net, rt->rt6i_idev,
> +					      IPSTATS_MIB_OUTDISCARDS);
> +				err = 0;
> +			}
> +		}
>  		if (err)
> -			goto error;
> +			IP6_INC_STATS(net, rt->rt6i_idev,
> +				      IPSTATS_MIB_OUTDISCARDS);
>  	}
> 
>  out:
>  	ip6_cork_release(inet, np);
>  	return err;
> -error:
> -	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  void ip6_flush_pending_frames(struct sock *sk)
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 5068410..d054fa2 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -523,7 +523,7 @@ csum_copy_err:
>  }
> 
>  static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
> -				     struct raw6_sock *rp)
> +				     struct raw6_sock *rp, int recverr)
>  {
>  	struct sk_buff *skb;
>  	int err = 0;
> @@ -595,7 +595,7 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
>  		BUG();
> 
>  send:
> -	err = ip6_push_pending_frames(sk);
> +	err = ip6_push_pending_frames(sk, recverr);
>  out:
>  	return err;
>  }
> @@ -641,8 +641,13 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
>  	IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
>  	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
>  		      dst_output);
> -	if (err > 0)
> -		err = np->recverr ? net_xmit_errno(err) : 0;
> +	if (err > 0) {
> +		err = net_xmit_errno(err);
> +		if (!np->recverr && err) {
> +			IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
> +			err = 0;
> +		}
> +	}
>  	if (err)
>  		goto error;
>  out:
> @@ -895,7 +900,7 @@ back_from_confirm:
>  		if (err)
>  			ip6_flush_pending_frames(sk);
>  		else if (!(msg->msg_flags & MSG_MORE))
> -			err = rawv6_push_pending_frames(sk, &fl, rp);
> +			err = rawv6_push_pending_frames(sk, &fl, rp, np->recverr);
>  		release_sock(sk);
>  	}
>  done:
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 20d2ffc..963dd0a 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -683,7 +683,7 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
>   *	Sending
>   */
> 
> -static int udp_v6_push_pending_frames(struct sock *sk)
> +static int udp_v6_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct sk_buff *skb;
>  	struct udphdr *uh;
> @@ -723,7 +723,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
>  		uh->check = CSUM_MANGLED_0;
> 
>  send:
> -	err = ip6_push_pending_frames(sk);
> +	err = ip6_push_pending_frames(sk, recverr);
>  out:
>  	up->len = 0;
>  	up->pending = 0;
> @@ -975,8 +975,14 @@ do_append_data:
>  		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
>  	if (err)
>  		udp_v6_flush_pending_frames(sk);
> -	else if (!corkreq)
> -		err = udp_v6_push_pending_frames(sk);
> +	else if (!corkreq) {
> +		err = udp_v6_push_pending_frames(sk, 1);
> +		if (err == -ENOBUFS && !np->recverr) {
> +			UDP6_INC_STATS_USER(sock_net(sk),
> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> +			err = 0;
> +		}
> +	}
>  	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
>  		up->pending = 0;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 2, 2009, 4:20 p.m. UTC | #3
Sridhar Samudrala a écrit :
> On Wed, 2009-09-02 at 16:43 +0200, Eric Dumazet wrote:
>> David Miller a écrit :
>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>> Date: Mon, 31 Aug 2009 14:09:50 +0200
>>>
>>>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>>>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>>>
>>>> May I suggest following path :
>>>>
>>>> 1) Correct ip6_push_pending_frames() to properly
>>>> account for dropped-by-qdisc frames when IP_RECVERR is set
>>> Your patch is  applied to net-next-2.6, thanks!
>>>
>>>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>>>> but still return a OK to user application, to not break them ?
>>> Sounds good.
>>>
>>> I think if you sample random UDP applications, you will find that such
>>> errors will upset them terribly, make them log tons of crap to
>>> /var/log/messages et al., and consume tons of CPU.
>>>
>>> And in such cases silent ignoring of drops is entirely appropriate and
>>> optimal, which supports our current behavior.
>>>
>>> If we are to make such applications "more sophisticated" such
>>> converted apps can be indicated simply their use of IP_RECVERR.
>>>
>>> If you want to be notified of all asynchronous errors we can detect,
>>> you use this, end of story.  It is the only way to handle this
>>> situation without breaking the world.
>>>
>>> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
>>> accurate, and wise.  And he understood all of this 10+ years ago.
>> Thanks David, here is the 2nd patch then :
>>
>>
>> [PATCH net-next-2.6] ip: Report qdisc packet drops
>>
>> Christoph Lameter pointed out that packet drops at qdisc level where not
>> accounted in SNMP counters. Only if application sets IP_RECVERR, drops
>> are reported to user (-ENOBUFS errors) and SNMP counters updated.
>>
>> IP_RECVERR is used to enable extended reliable error message passing,
>> but these are not needed to update system wide SNMP stats.
>>
>> This patch changes things a bit to allow SNMP counters to be updated,
>> regardless of IP_RECVERR being set or not on the socket.
>>
>> Example after an UDP tx flood
>> # netstat -s 
>> ...
>> IP:
>>     1487048 outgoing packets dropped
>> ...
>> Udp:
>> ...
>>     SndbufErrors: 1487048
>>
> 
> Didn't we agree that qdisc drops should not be counted as IP or UDP 
> drops as David Stevens pointed out?
> I would say that even when IP_RECVERR is set, SNMP counters at IP and
> UDP should not be counted when a packet is dropped at qdisc level,
> but the error can be reported to user.
> 
> Now that qdisc stats issue is figured out and they can be accounted
> and seen at qdisc level, doesn't it confuse if we count the same drop 
> at IP, UDP and qdisc level?
> 
> Thanks
> Sridhar
>

Yes, I am aware of David point, but its already not true with current kernel.

Current kernels and an UDP frame sent by application :

if IP_RECVERR not set, no SNMP error logged, IP or UDP level

if IP_RECVERR is set, qdisc drops are reported both to IP and UDP
SNMP counters.



udp_sendmsg()
{
...
out:
        ip_rt_put(rt);
        if (free)
                kfree(ipc.opt);
        if (!err)
                return len;
        /*
         * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
         * ENOBUFS might not be good (it's not tunable per se), but otherwise
         * we don't have a good statistic (IpOutDiscards but it can be too many
         * things).  We could add another new stat but at least for now that
         * seems like overkill.
         */
        if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
                UDP_INC_STATS_USER(sock_net(sk),
                                UDP_MIB_SNDBUFERRORS, is_udplite);
        }
        return err;
...
}


So what shall we do ?

IMHO, one should not add MIB counters for different domains (IP / UDP), this
makes no sense.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Sept. 2, 2009, 7:37 p.m. UTC | #4
The patch is smaller if you remove the handling of recverr completely from
ip_push_pending_frames() and return NET_RX_DROP etc. Two of the callers
never even inspect the return code. For them this is useless processing.

The others could handle the processing of recverr on their own. Doing so
voids adding code to ip_push_pending_frames() which is latency critical
and also avoids changing the calling conventions.

I have a draft here from our earlier disucssions but its not as
comprehensive as yours.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/ip.h b/include/net/ip.h
index 72c3692..9dd19a8 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,7 +116,7 @@  extern int		ip_append_data(struct sock *sk,
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
-extern int		ip_push_pending_frames(struct sock *sk);
+extern int		ip_push_pending_frames(struct sock *sk, int recverr);
 extern void		ip_flush_pending_frames(struct sock *sk);
 
 /* datagram.c */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ad9a511..f514257 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -498,7 +498,7 @@  extern int			ip6_append_data(struct sock *sk,
 						struct rt6_info *rt,
 						unsigned int flags);
 
-extern int			ip6_push_pending_frames(struct sock *sk);
+extern int			ip6_push_pending_frames(struct sock *sk, int recverr);
 
 extern void			ip6_flush_pending_frames(struct sock *sk);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 5fb029f..a60ef10 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -145,7 +145,7 @@  extern int 	udp_lib_getsockopt(struct sock *sk, int level, int optname,
 			           char __user *optval, int __user *optlen);
 extern int 	udp_lib_setsockopt(struct sock *sk, int level, int optname,
 				   char __user *optval, int optlen,
-				   int (*push_pending_frames)(struct sock *));
+				   int (*push_pending_frames)(struct sock *, int));
 
 extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 				    __be32 daddr, __be16 dport,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 97c410e..f46a53c 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -345,7 +345,7 @@  static void icmp_push_reply(struct icmp_bxm *icmp_param,
 						 icmp_param->head_len, csum);
 		icmph->checksum = csum_fold(csum);
 		skb->ip_summed = CHECKSUM_NONE;
-		ip_push_pending_frames(sk);
+		ip_push_pending_frames(sk, 0);
 	}
 }
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..8f81dab 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1216,7 +1216,7 @@  static void ip_cork_release(struct inet_sock *inet)
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-int ip_push_pending_frames(struct sock *sk)
+int ip_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1301,19 +1301,20 @@  int ip_push_pending_frames(struct sock *sk)
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
-		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+		if (err > 0) {
+			err = net_xmit_errno(err);
+			if (err && !recverr) {
+				IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+				err = 0;
+			}
+		}
 		if (err)
-			goto error;
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 /*
@@ -1412,7 +1413,7 @@  void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 			  arg->csumoffset) = csum_fold(csum_add(skb->csum,
 								arg->csum));
 		skb->ip_summed = CHECKSUM_NONE;
-		ip_push_pending_frames(sk);
+		ip_push_pending_frames(sk, 0);
 	}
 
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..444c465 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -374,8 +374,13 @@  static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 
 	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
-	if (err > 0)
-		err = inet->recverr ? net_xmit_errno(err) : 0;
+	if (err > 0) {
+		err = net_xmit_errno(err);
+		if (!inet->recverr && err) {
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			err = 0;
+		}
+	}
 	if (err)
 		goto error;
 out:
@@ -576,8 +581,9 @@  back_from_confirm:
 					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
-		else if (!(msg->msg_flags & MSG_MORE))
-			err = ip_push_pending_frames(sk);
+		else if (!(msg->msg_flags & MSG_MORE)) {
+			err = ip_push_pending_frames(sk, inet->recverr);
+		}
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 29ebb0d..6a6bf1d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -513,7 +513,7 @@  static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
 /*
  * Push out all pending data as one UDP datagram. Socket is locked.
  */
-static int udp_push_pending_frames(struct sock *sk)
+static int udp_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct udp_sock  *up = udp_sk(sk);
 	struct inet_sock *inet = inet_sk(sk);
@@ -560,7 +560,7 @@  static int udp_push_pending_frames(struct sock *sk)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip_push_pending_frames(sk);
+	err = ip_push_pending_frames(sk, recverr);
 out:
 	up->len = 0;
 	up->pending = 0;
@@ -752,8 +752,14 @@  do_append_data:
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_flush_pending_frames(sk);
-	else if (!corkreq)
-		err = udp_push_pending_frames(sk);
+	else if (!corkreq) {
+		err = udp_push_pending_frames(sk, 1);
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	}
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;
 	release_sock(sk);
@@ -826,7 +832,7 @@  int udp_sendpage(struct sock *sk, struct page *page, int offset,
 
 	up->len += size;
 	if (!(up->corkflag || (flags&MSG_MORE)))
-		ret = udp_push_pending_frames(sk);
+		ret = udp_push_pending_frames(sk, inet_sk(sk)->recverr);
 	if (!ret)
 		ret = size;
 out:
@@ -1354,7 +1360,7 @@  void udp_destroy_sock(struct sock *sk)
  */
 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		       char __user *optval, int optlen,
-		       int (*push_pending_frames)(struct sock *))
+		       int (*push_pending_frames)(struct sock *, int))
 {
 	struct udp_sock *up = udp_sk(sk);
 	int val;
@@ -1374,7 +1380,7 @@  int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			up->corkflag = 0;
 			lock_sock(sk);
-			(*push_pending_frames)(sk);
+			(*push_pending_frames)(sk, 0);
 			release_sock(sk);
 		}
 		break;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index e2325f6..a9c54c2 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -253,7 +253,7 @@  static int icmpv6_push_pending_frames(struct sock *sk, struct flowi *fl, struct
 						      len, fl->proto,
 						      tmp_csum);
 	}
-	ip6_push_pending_frames(sk);
+	ip6_push_pending_frames(sk, 0);
 out:
 	return err;
 }
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a931229..ade5707 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1440,7 +1440,7 @@  static void ip6_cork_release(struct inet_sock *inet, struct ipv6_pinfo *np)
 	memset(&inet->cork.fl, 0, sizeof(inet->cork.fl));
 }
 
-int ip6_push_pending_frames(struct sock *sk)
+int ip6_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1510,18 +1510,22 @@  int ip6_push_pending_frames(struct sock *sk)
 
 	err = ip6_local_out(skb);
 	if (err) {
-		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+		if (err > 0) {
+			err = net_xmit_errno(err);
+			if (err && !recverr) {
+				IP6_INC_STATS(net, rt->rt6i_idev,
+					      IPSTATS_MIB_OUTDISCARDS);
+				err = 0;
+			}
+		}
 		if (err)
-			goto error;
+			IP6_INC_STATS(net, rt->rt6i_idev,
+				      IPSTATS_MIB_OUTDISCARDS);
 	}
 
 out:
 	ip6_cork_release(inet, np);
 	return err;
-error:
-	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 void ip6_flush_pending_frames(struct sock *sk)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5068410..d054fa2 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -523,7 +523,7 @@  csum_copy_err:
 }
 
 static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
-				     struct raw6_sock *rp)
+				     struct raw6_sock *rp, int recverr)
 {
 	struct sk_buff *skb;
 	int err = 0;
@@ -595,7 +595,7 @@  static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
 		BUG();
 
 send:
-	err = ip6_push_pending_frames(sk);
+	err = ip6_push_pending_frames(sk, recverr);
 out:
 	return err;
 }
@@ -641,8 +641,13 @@  static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
 	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
-	if (err > 0)
-		err = np->recverr ? net_xmit_errno(err) : 0;
+	if (err > 0) {
+		err = net_xmit_errno(err);
+		if (!np->recverr && err) {
+			IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
+			err = 0;
+		}
+	}
 	if (err)
 		goto error;
 out:
@@ -895,7 +900,7 @@  back_from_confirm:
 		if (err)
 			ip6_flush_pending_frames(sk);
 		else if (!(msg->msg_flags & MSG_MORE))
-			err = rawv6_push_pending_frames(sk, &fl, rp);
+			err = rawv6_push_pending_frames(sk, &fl, rp, np->recverr);
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 20d2ffc..963dd0a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -683,7 +683,7 @@  static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
  *	Sending
  */
 
-static int udp_v6_push_pending_frames(struct sock *sk)
+static int udp_v6_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb;
 	struct udphdr *uh;
@@ -723,7 +723,7 @@  static int udp_v6_push_pending_frames(struct sock *sk)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip6_push_pending_frames(sk);
+	err = ip6_push_pending_frames(sk, recverr);
 out:
 	up->len = 0;
 	up->pending = 0;
@@ -975,8 +975,14 @@  do_append_data:
 		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_v6_flush_pending_frames(sk);
-	else if (!corkreq)
-		err = udp_v6_push_pending_frames(sk);
+	else if (!corkreq) {
+		err = udp_v6_push_pending_frames(sk, 1);
+		if (err == -ENOBUFS && !np->recverr) {
+			UDP6_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	}
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;