diff mbox

UDP multicast packet loss not reported if TX ring overrun?

Message ID 4A930DEF.5000008@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Aug. 24, 2009, 10:02 p.m. UTC
Christoph Lameter a écrit :
> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> 
>> So it is possible that there is some other place in the stack where the packets
>> are gettting dropped but not counted.
> 
> Such a deed occurs in ip_push_pending_frames():
> 
>         /* Netfilter gets whole the not fragmented skb. */
>         err = ip_local_out(skb);
>         if (err) {
>                 if (err > 0)
>                         err = inet->recverr ? net_xmit_errno(err) : 0;
> 			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 if (err)
>                         goto error;
>         }
> 
> out:
>         ip_cork_release(inet);
>         return err;
> 
> error:
>         IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
>         goto out;
> 
> 
> So if ip_local_out returns NET_XMIT_DROP then its simply going to be
> replaced by 0. Then we check err again and there is no error!!!!
> 
> The statistics are only generated if IP_RECVERR is set.
> 
> Could we move the increment of IPSTATS_MIB_OUTDISCARDS up so that it
> is incremented regardless of the setting of IP_RECVERR?
> 
> F.e?
> 
> 
> Subject: Report TX drops
> 
> Incrementing of TX drop counters currently does not work if errors from the
> network stack are suppressed (IP_RECVERR off). Increment the statistics
> independently of the setting of IP_RECVERR.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  net/ipv4/ip_output.c |   19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/net/ipv4/ip_output.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/ip_output.c	2009-08-24 17:04:27.000000000 +0000
> +++ linux-2.6/net/ipv4/ip_output.c	2009-08-24 17:32:05.000000000 +0000
> @@ -1300,20 +1300,21 @@ int ip_push_pending_frames(struct sock *
> 
>  	/* Netfilter gets whole the not fragmented skb. */
>  	err = ip_local_out(skb);
> -	if (err) {
> -		if (err > 0)
> -			err = inet->recverr ? net_xmit_errno(err) : 0;
> -		if (err)
> -			goto error;
> +	if (err > 0) {
> +		/* The packet was dropped by the network subsystem */
> +		IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> +
> +		/*
> +		 * Errors are not passed on if the socket
> +		 * does not process errors (see IP_RECVERR).
> +		 * net_xmit_errno filters NET_XMIT_CN.
> +		 */
> +		err = inet->recverr ? net_xmit_errno(err) : 0;
>  	}
> 
>  out:
>  	ip_cork_release(inet);
>  	return err;
> -
> -error:
> -	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  /*
> 
> 
> 
> 

NET_XMIT_CN strikes again :)

Well, if ip_local_out() returns a negative error (say -EPERM for example),
 your patch disables OUTDISCARDS increments.

Maybe a simpler patch like this one ?

[PATCH] net: correctly updates OUTDISCARDS in ip_push_pending_frames()

ip_push_pending_frames() can fail to send a frame because of a congestioned
device. In this case, we increment SNMP OUTDISCARDS only if user set
IP_RECVERR, which is not RFC conformant.

Only case where we should not update OUTDISCARDS is when
ip_local_output() return value is NET_XMIT_CN (meaning
skb was xmitted but future frames might be dropped)

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sridhar Samudrala Aug. 24, 2009, 10:36 p.m. UTC | #1
On Tue, 2009-08-25 at 00:02 +0200, Eric Dumazet wrote:
> Christoph Lameter a écrit :
> > On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> > 
> >> So it is possible that there is some other place in the stack where the packets
> >> are gettting dropped but not counted.
> > 
> > Such a deed occurs in ip_push_pending_frames():
> > 
> >         /* Netfilter gets whole the not fragmented skb. */
> >         err = ip_local_out(skb);
> >         if (err) {
> >                 if (err > 0)
> >                         err = inet->recverr ? net_xmit_errno(err) : 0;
> > 			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                 if (err)
> >                         goto error;
> >         }
> > 
> > out:
> >         ip_cork_release(inet);
> >         return err;
> > 
> > error:
> >         IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> >         goto out;
> > 
> > 
> > So if ip_local_out returns NET_XMIT_DROP then its simply going to be
> > replaced by 0. Then we check err again and there is no error!!!!

Christoph,

So are you hitting this case with your workload and does this account for all the
packet losses you are seeing? 

If we are dropping the packet and returing NET_XMIT_DROP, should
we also increment qdisc drop stats (sch->qstats.drops)?

In dev_queue_xmit(), we have

        if (q->enqueue) {
                spinlock_t *root_lock = qdisc_lock(q);

                spin_lock(root_lock);

                if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                        kfree_skb(skb);
                        rc = NET_XMIT_DROP;
                } else {
                        rc = qdisc_enqueue_root(skb, q);
                        qdisc_run(q);
                }
                spin_unlock(root_lock);

                goto out;
        }

Here, if QDISC_STATE_DEACTIVATED is true, the skb is dropped and NET_XMIT_DROP
is returned, but not accounted in qdisc drop stats.
However it is incremented when NET_XMIT_DROP is returned via qdisc_drop().

If we count these drops as qdisc drops, should we also count them as IP OUTDISCARDS?

Thanks
Sridhar

> > 
> NET_XMIT_CN strikes again :)
> 
> Well, if ip_local_out() returns a negative error (say -EPERM for example),
>  your patch disables OUTDISCARDS increments.
> 
> Maybe a simpler patch like this one ?
> 
> [PATCH] net: correctly updates OUTDISCARDS in ip_push_pending_frames()
> 
> ip_push_pending_frames() can fail to send a frame because of a congestioned
> device. In this case, we increment SNMP OUTDISCARDS only if user set
> IP_RECVERR, which is not RFC conformant.
> 
> Only case where we should not update OUTDISCARDS is when
> ip_local_output() return value is NET_XMIT_CN (meaning
> skb was xmitted but future frames might be dropped)
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7d08210..27a5b79 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1301,19 +1301,15 @@ int ip_push_pending_frames(struct sock *sk)
>  	/* Netfilter gets whole the not fragmented skb. */
>  	err = ip_local_out(skb);
>  	if (err) {
> +		if (err != NET_XMIT_CN)
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
>  		if (err > 0)
>  			err = inet->recverr ? net_xmit_errno(err) : 0;
> -		if (err)
> -			goto error;
>  	}
> 
>  out:
>  	ip_cork_release(inet);
>  	return err;
> -
> -error:
> -	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  /*


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 1:46 p.m. UTC | #2
On Tue, 25 Aug 2009, Eric Dumazet wrote:

> NET_XMIT_CN strikes again :)
>
> Well, if ip_local_out() returns a negative error (say -EPERM for example),
>  your patch disables OUTDISCARDS increments.
>
> Maybe a simpler patch like this one ?

Yes looks good. Can we get this in soon?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2009, 1:48 p.m. UTC | #3
Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> NET_XMIT_CN strikes again :)
>>
>> Well, if ip_local_out() returns a negative error (say -EPERM for example),
>>  your patch disables OUTDISCARDS increments.
>>
>> Maybe a simpler patch like this one ?
> 
> Yes looks good. Can we get this in soon?
> 

Please hold on, I would like to fully understand what's happening,
and test the patch :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 1:48 p.m. UTC | #4
On Mon, 24 Aug 2009, Sridhar Samudrala wrote:

> So are you hitting this case with your workload and does this account for all the
> packet losses you are seeing?

Yes.

> If we are dropping the packet and returing NET_XMIT_DROP, should
> we also increment qdisc drop stats (sch->qstats.drops)?

I think so but I am no expert. I was surprised to not even see counter
increments at that level. But I was content to fix up the higher level
tracking to have at least one counter that showed the packet loss.

> If we count these drops as qdisc drops, should we also count them as IP OUTDISCARDS?

Yes.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 2 p.m. UTC | #5
On Tue, 25 Aug 2009, Eric Dumazet wrote:

> Please hold on, I would like to fully understand what's happening,
> and test the patch :)

Ok. It would be good if the drops would also be somehow noted by the UDP
subsystem (one should see something with netstat -su) and may be even the
socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2009, 3:32 p.m. UTC | #6
Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> Please hold on, I would like to fully understand what's happening,
>> and test the patch :)
> 
> Ok. It would be good if the drops would also be somehow noted by the UDP
> subsystem (one should see something with netstat -su) and may be even the
> socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?

This /proc/net/udp column is for rx_drops currently and was recently added...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 3:35 p.m. UTC | #7
On Tue, 25 Aug 2009, Eric Dumazet wrote:

> Christoph Lameter a ?crit :
> > On Tue, 25 Aug 2009, Eric Dumazet wrote:
> >
> >> Please hold on, I would like to fully understand what's happening,
> >> and test the patch :)
> >
> > Ok. It would be good if the drops would also be somehow noted by the UDP
> > subsystem (one should see something with netstat -su) and may be even the
> > socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?
>
> This /proc/net/udp column is for rx_drops currently and was recently added...

So lets rename it to rx_drops and then add tx_drops?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2009, 3:58 p.m. UTC | #8
Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> Christoph Lameter a ?crit :
>>> On Tue, 25 Aug 2009, Eric Dumazet wrote:
>>>
>>>> Please hold on, I would like to fully understand what's happening,
>>>> and test the patch :)
>>> Ok. It would be good if the drops would also be somehow noted by the UDP
>>> subsystem (one should see something with netstat -su) and may be even the
>>> socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?
>> This /proc/net/udp column is for rx_drops currently and was recently added...
> 
> So lets rename it to rx_drops and then add tx_drops?
> 

It wont be very nice, because it'll add yet another 32bits counter in each socket
structure, for a unlikely use. While rx_drops can happen if application is slow.

Also, tx_drops might be done later and not noticed.

Please read this old (and usefull) thread, with Alexey words...

http://oss.sgi.com/archives/netdev/2002-10/msg00612.html

http://oss.sgi.com/archives/netdev/2002-10/msg00617.html


So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 4:11 p.m. UTC | #9
On Tue, 25 Aug 2009, Eric Dumazet wrote:

> It wont be very nice, because it'll add yet another 32bits counter in each socket
> structure, for a unlikely use. While rx_drops can happen if application is slow.

tx_drops happen if the application sends too fast.

TX drop tracking is important due to the braindamaged throttling logic
during send. If SO_SNDBUF is less than what happens to fit in the TX ring then the
application will be throttled and no packet loss happens. If SO_SNDBUF is
set high then the TX ring will overflow and packets are dropped.

We need some way to diagnose TX drops per socket as long as we have
that mind boggling issue. TX drops means that one should reduce the size
of the sendbuffer in order to get better throttling which reduces packet
loss.

> Also, tx_drops might be done later and not noticed.
>
> Please read this old (and usefull) thread, with Alexey words...
>
> http://oss.sgi.com/archives/netdev/2002-10/msg00612.html
>
> http://oss.sgi.com/archives/netdev/2002-10/msg00617.html
>
>
> So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)

I read this just yesterday. IP_RECVERR means that the application wants to
see details on each loss. We just want some counters that give us accurate
statistics to gauge where packet loss is occurring. Applications are
usually not interested in tracking the fate of each packet.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2009, 4:27 p.m. UTC | #10
Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> It wont be very nice, because it'll add yet another 32bits counter in each socket
>> structure, for a unlikely use. While rx_drops can happen if application is slow.
> 
> tx_drops happen if the application sends too fast.
> 
> TX drop tracking is important due to the braindamaged throttling logic
> during send. If SO_SNDBUF is less than what happens to fit in the TX ring then the
> application will be throttled and no packet loss happens. If SO_SNDBUF is
> set high then the TX ring will overflow and packets are dropped.
> 
> We need some way to diagnose TX drops per socket as long as we have
> that mind boggling issue. TX drops means that one should reduce the size
> of the sendbuffer in order to get better throttling which reduces packet
> loss.
> 
>> Also, tx_drops might be done later and not noticed.
>>
>> Please read this old (and usefull) thread, with Alexey words...
>>
>> http://oss.sgi.com/archives/netdev/2002-10/msg00612.html
>>
>> http://oss.sgi.com/archives/netdev/2002-10/msg00617.html
>>
>>
>> So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)
> 
> I read this just yesterday. IP_RECVERR means that the application wants to
> see details on each loss. We just want some counters that give us accurate
> statistics to gauge where packet loss is occurring. Applications are
> usually not interested in tracking the fate of each packet.

Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
in sending and congestion, which was your initial point :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 4:36 p.m. UTC | #11
On Tue, 25 Aug 2009, Eric Dumazet wrote:

> > I read this just yesterday. IP_RECVERR means that the application wants to
> > see details on each loss. We just want some counters that give us accurate
> > statistics to gauge where packet loss is occurring. Applications are
> > usually not interested in tracking the fate of each packet.
>
> Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
> in sending and congestion, which was your initial point :)

The initial point was that the SNMP counters are not updated if IP_RECVERR
is not set which is clearly a bug and your and my patch addresses that.

Then Sridhar noted that there are other tx drop counters. qdisc counters
are also not updated. Wish we would maintain tx drops counters there as
well so that we can track down which NIC drops it.

Then came the wishlist of UDP counters for tx drops and socket based
tx_drop accounting for tuning and tracking down which app is sending
too fast .... ;-)

The apps could be third party apps. Just need to be able to troubleshoot
packet loss.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Stevens Aug. 25, 2009, 7:03 p.m. UTC | #12
Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24 
AM:

> On Mon, 24 Aug 2009, Sridhar Samudrala wrote:

> > If we count these drops as qdisc drops, should we also count them as 
IP OUTDISCARDS?
> 
> Yes.

Actually, no. (!)

IP_OUTDISCARDS should count the packets IP dropped, not
anything dropped at a lower layer (which, in general, it
is not aware of). If you count these in multiple layers,
then you don't really know who dropped it.

                                                +-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 25, 2009, 7:08 p.m. UTC | #13
From: David Stevens <dlstevens@us.ibm.com>
Date: Tue, 25 Aug 2009 12:03:58 -0700

> IP_OUTDISCARDS should count the packets IP dropped, not
> anything dropped at a lower layer (which, in general, it
> is not aware of). If you count these in multiple layers,
> then you don't really know who dropped it.

Right.

We are in danger of going from one extreme to the other.
Previously we lacked some drop detection capabilities
but now we've filled most of these holes and ON TOP of
all of that we have Neil's SKB drop tracer.

Let's not get carried away over-accounting this stuff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Aug. 25, 2009, 7:15 p.m. UTC | #14
On Tue, 25 Aug 2009, David Stevens wrote:

> Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24
> AM:
>
> > On Mon, 24 Aug 2009, Sridhar Samudrala wrote:
>
> > > If we count these drops as qdisc drops, should we also count them as
> IP OUTDISCARDS?
> >
> > Yes.
>
> Actually, no. (!)
>
> IP_OUTDISCARDS should count the packets IP dropped, not
> anything dropped at a lower layer (which, in general, it
> is not aware of). If you count these in multiple layers,
> then you don't really know who dropped it.

You are right. I skipped that IP OUTDICARDS reference. They need to be
accounted at the qdisc level though.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Perches Aug. 25, 2009, 7:56 p.m. UTC | #15
On Tue, 2009-08-25 at 15:15 -0400, Christoph Lameter wrote:
> On Tue, 25 Aug 2009, David Stevens wrote:
> > Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24
> > AM:
> > > On Mon, 24 Aug 2009, Sridhar Samudrala wrote:
> > > > If we count these drops as qdisc drops, should we also count them as
> > IP OUTDISCARDS?
> > > Yes.
> > Actually, no. (!)
> > IP_OUTDISCARDS should count the packets IP dropped, not
> > anything dropped at a lower layer (which, in general, it
> > is not aware of). If you count these in multiple layers,
> > then you don't really know who dropped it.
> They need to be accounted at the qdisc level though.

It's probably useful to be able to know when packets
and payloads are dropped.  It may not be necessary though.
It's probably not a fundamental.

Chariot, LANforge and apps like it might care, but most all
other apps might not care at all.

Maybe these should be allowed with a CONFIG.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..27a5b79 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1301,19 +1301,15 @@  int ip_push_pending_frames(struct sock *sk)
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
+		if (err != NET_XMIT_CN)
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 		if (err > 0)
 			err = inet->recverr ? net_xmit_errno(err) : 0;
-		if (err)
-			goto error;
 	}
 
 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 /*