diff mbox

[1/4] net: skb_orphan on dev_hard_start_xmit

Message ID 200905292344.56814.rusty@rustcorp.com.au
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Rusty Russell May 29, 2009, 2:14 p.m. UTC
Various drivers do skb_orphan() because they do not free transmitted
skbs in a timely manner (causing apps which hit their socket limits to
stall, socket close to hang, etc.).

DaveM points out that there are advantages to doing it generally (it's
more likely to be on same CPU than after xmit), and I couldn't find
any new starvation issues in simple benchmarking here.

This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

I removed the drivers' skb_orphan(), though it's harmless.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dan Williams <dcbw@redhat.com>
Cc: libertas-dev@lists.infradead.org
---
 drivers/net/cxgb3/sge.c            |   27 ---------------------------
 drivers/net/loopback.c             |    2 --
 drivers/net/mlx4/en_tx.c           |    4 ----
 drivers/net/niu.c                  |    3 +--
 drivers/net/veth.c                 |    2 --
 drivers/net/wireless/libertas/tx.c |    3 ---
 net/core/dev.c                     |   21 +++++----------------
 7 files changed, 6 insertions(+), 56 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet May 29, 2009, 3:11 p.m. UTC | #1
Rusty Russell a écrit :
> Various drivers do skb_orphan() because they do not free transmitted
> skbs in a timely manner (causing apps which hit their socket limits to
> stall, socket close to hang, etc.).
> 
> DaveM points out that there are advantages to doing it generally (it's
> more likely to be on same CPU than after xmit), and I couldn't find
> any new starvation issues in simple benchmarking here.
> 
>

If really no starvations are possible at all, I really wonder why some
guys added memory accounting to UDP flows. Maybe they dont run "simple
benchmarks" but real apps ? :)

For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS
and window control, but an UDP sender will likely be able to saturate a link.

Maybe we can try to selectively call skb_orphan() only for tcp packets ?

I understand your motivations are the driver simplifications, so you
want all packets being orphaned... hmm...

 This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> I removed the drivers' skb_orphan(), though it's harmless.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Divy Le Ray <divy@chelsio.com>
> Cc: Roland Dreier <rolandd@cisco.com>
> Cc: Pavel Emelianov <xemul@openvz.org>
> Cc: Dan Williams <dcbw@redhat.com>
> Cc: libertas-dev@lists.infradead.org
> ---
>  drivers/net/cxgb3/sge.c            |   27 ---------------------------
>  drivers/net/loopback.c             |    2 --
>  drivers/net/mlx4/en_tx.c           |    4 ----
>  drivers/net/niu.c                  |    3 +--
>  drivers/net/veth.c                 |    2 --
>  drivers/net/wireless/libertas/tx.c |    3 ---
>  net/core/dev.c                     |   21 +++++----------------
>  7 files changed, 6 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
> --- a/drivers/net/cxgb3/sge.c
> +++ b/drivers/net/cxgb3/sge.c
> @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
>  	dev->trans_start = jiffies;
>  	spin_unlock(&q->lock);
>  
> -	/*
> -	 * We do not use Tx completion interrupts to free DMAd Tx packets.
> -	 * This is good for performamce but means that we rely on new Tx
> -	 * packets arriving to run the destructors of completed packets,
> -	 * which open up space in their sockets' send queues.  Sometimes
> -	 * we do not get such new packets causing Tx to stall.  A single
> -	 * UDP transmitter is a good example of this situation.  We have
> -	 * a clean up timer that periodically reclaims completed packets
> -	 * but it doesn't run often enough (nor do we want it to) to prevent
> -	 * lengthy stalls.  A solution to this problem is to run the
> -	 * destructor early, after the packet is queued but before it's DMAd.
> -	 * A cons is that we lie to socket memory accounting, but the amount
> -	 * of extra memory is reasonable (limited by the number of Tx
> -	 * descriptors), the packets do actually get freed quickly by new
> -	 * packets almost always, and for protocols like TCP that wait for
> -	 * acks to really free up the data the extra memory is even less.
> -	 * On the positive side we run the destructors on the sending CPU
> -	 * rather than on a potentially different completing CPU, usually a
> -	 * good thing.  We also run them without holding our Tx queue lock,
> -	 * unlike what reclaim_completed_tx() would otherwise do.
> -	 *
> -	 * Run the destructor before telling the DMA engine about the packet
> -	 * to make sure it doesn't complete and get freed prematurely.
> -	 */
> -	if (likely(!skb_shared(skb)))
> -		skb_orphan(skb);
> -
>  	write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
>  	check_ring_tx_db(adap, q);
>  	return NETDEV_TX_OK;
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff 
>  {
>  	struct pcpu_lstats *pcpu_lstats, *lb_stats;
>  
> -	skb_orphan(skb);
> -
>  	skb->protocol = eth_type_trans(skb,dev);
>  
>  	/* it's OK to use per_cpu_ptr() because BHs are off */
> diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> --- a/drivers/net/mlx4/en_tx.c
> +++ b/drivers/net/mlx4/en_tx.c
> @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
>  	if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
>  		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>  
> -	/* Run destructor before passing skb to HW */
> -	if (likely(!skb_shared(skb)))
> -		skb_orphan(skb);
> -
>  	/* Ensure new descirptor hits memory
>  	 * before setting ownership of this descriptor to HW */
>  	wmb();
> diff --git a/drivers/net/niu.c b/drivers/net/niu.c
> --- a/drivers/net/niu.c
> +++ b/drivers/net/niu.c
> @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
>  		}
>  		kfree_skb(skb);
>  		skb = skb_new;
> -	} else
> -		skb_orphan(skb);
> +	}
>  
>  	align = ((unsigned long) skb->data & (16 - 1));
>  	headroom = align + sizeof(struct tx_pkt_hdr);
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
>  	struct veth_net_stats *stats, *rcv_stats;
>  	int length, cpu;
>  
> -	skb_orphan(skb);
> -
>  	priv = netdev_priv(dev);
>  	rcv = priv->peer;
>  	rcv_priv = netdev_priv(rcv);
> diff --git a/drivers/net/wireless/libertas/tx.c 
> b/drivers/net/wireless/libertas/tx.c
> --- a/drivers/net/wireless/libertas/tx.c
> +++ b/drivers/net/wireless/libertas/tx.c
> @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
>  	if (priv->monitormode) {
>  		/* Keep the skb to echo it back once Tx feedback is
>  		   received from FW */
> -		skb_orphan(skb);
> -
> -		/* Keep the skb around for when we get feedback */
>  		priv->currenttxskb = skb;
>  	} else {
>   free:
> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
>  	const struct net_device_ops *ops = dev->netdev_ops;
>  	int rc;
>  
> +	/* Call destructor here.  It's SMP-cache-friendly and avoids issues
> +	 * with drivers which can hold transmitted skbs for long times */
> +	skb_orphan(skb);
> +
>  	if (likely(!skb->next)) {
>  		if (!list_empty(&ptype_all))
>  			dev_queue_xmit_nit(skb, dev);
> @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
>  				goto gso;
>  		}
>  
> -		rc = ops->ndo_start_xmit(skb, dev);
> -		/*
> -		 * TODO: if skb_orphan() was called by
> -		 * dev->hard_start_xmit() (for example, the unmodified
> -		 * igb driver does that; bnx2 doesn't), then
> -		 * skb_tx_software_timestamp() will be unable to send
> -		 * back the time stamp.
> -		 *
> -		 * How can this be prevented? Always create another
> -		 * reference to the socket before calling
> -		 * dev->hard_start_xmit()? Prevent that skb_orphan()
> -		 * does anything in dev->hard_start_xmit() by clearing
> -		 * the skb destructor before the call and restoring it
> -		 * afterwards, then doing the skb_orphan() ourselves?
> -		 */
> -		return rc;
> +		return ops->ndo_start_xmit(skb, dev);
>  	}
>  
>  gso:
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell June 1, 2009, 12:27 p.m. UTC | #2
On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > DaveM points out that there are advantages to doing it generally (it's
> > more likely to be on same CPU than after xmit), and I couldn't find
> > any new starvation issues in simple benchmarking here.
>
> If really no starvations are possible at all, I really wonder why some
> guys added memory accounting to UDP flows. Maybe they dont run "simple
> benchmarks" but real apps ? :)

Well, without any accounting at all you could use quite a lot of memory as 
there are many places packets can be queued.

> For TCP, I agree your patch is a huge benefit, since its paced by remote
> ACKS and window control

I doubt that.  There'll be some cache friendliness, but I'm not sure it'll be 
measurable, let alone "huge".  It's the win to drivers which don't have a 
timely and batching tx free mechanism which I aim for.

> , but an UDP sender will likely be able to saturate
> a link.

I couldn't see any difference in saturation here (with default scheduler and an 
100MBit e1000e).  Two reasons come to mind: firstly, only the hardware queue is 
unregulated: the tx queue is still accounted.  And when you add scheduling to 
the mix, I can't in practice cause starvation of other senders.

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick Ohly June 1, 2009, 7:47 p.m. UTC | #3
On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

Would it be possible to make the new skb_orphan() at the start of
dev_hard_start_xmit() conditionally so that it is not executed for
packets that are to be time stamped?

As discussed before
(http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
socket pointer is required for sending back the send time stamp from
inside the device driver. Calling skb_orphan() unconditionally as in
this patch would break the hardware time stamping of outgoing packets.

This should work without much overhead (skb_tx() expands to a lookup of
the skb_shared_info):
if (!skb_tx(skb)->flags)
    skb_orphan(skb);

Instead of looking at the individual skb_shared_tx "hardware" and
"software" bits I'm just checking the whole byte here because it is
shorter and perhaps faster. Semantically the result is the same.
David Miller June 2, 2009, 7:25 a.m. UTC | #4
From: Patrick Ohly <patrick.ohly@intel.com>
Date: Mon, 01 Jun 2009 21:47:22 +0200

> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> Would it be possible to make the new skb_orphan() at the start of
> dev_hard_start_xmit() conditionally so that it is not executed for
> packets that are to be time stamped?
> 
> As discussed before
> (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> socket pointer is required for sending back the send time stamp from
> inside the device driver. Calling skb_orphan() unconditionally as in
> this patch would break the hardware time stamping of outgoing packets.

Indeed, we need to check that case, at a minimum.

And there are other potentially other problems.  For example, I
wonder how this interacts with the new TX MMAP af_packet support
in net-next-2.6 :-/

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell June 2, 2009, 2:08 p.m. UTC | #5
On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
> From: Patrick Ohly <patrick.ohly@intel.com>
> Date: Mon, 01 Jun 2009 21:47:22 +0200
>
> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> >
> > Would it be possible to make the new skb_orphan() at the start of
> > dev_hard_start_xmit() conditionally so that it is not executed for
> > packets that are to be time stamped?
> >
> > As discussed before
> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> > socket pointer is required for sending back the send time stamp from
> > inside the device driver. Calling skb_orphan() unconditionally as in
> > this patch would break the hardware time stamping of outgoing packets.
>
> Indeed, we need to check that case, at a minimum.
>
> And there are other potentially other problems.  For example, I
> wonder how this interacts with the new TX MMAP af_packet support
> in net-next-2.6 :-/

I think I'll do this in the driver for now, and let's revisit doing it 
generically later?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 3, 2009, 12:14 a.m. UTC | #6
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Tue, 2 Jun 2009 23:38:29 +0930

> On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
>> From: Patrick Ohly <patrick.ohly@intel.com>
>> Date: Mon, 01 Jun 2009 21:47:22 +0200
>>
>> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>> >
>> > Would it be possible to make the new skb_orphan() at the start of
>> > dev_hard_start_xmit() conditionally so that it is not executed for
>> > packets that are to be time stamped?
>> >
>> > As discussed before
>> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
>> > socket pointer is required for sending back the send time stamp from
>> > inside the device driver. Calling skb_orphan() unconditionally as in
>> > this patch would break the hardware time stamping of outgoing packets.
>>
>> Indeed, we need to check that case, at a minimum.
>>
>> And there are other potentially other problems.  For example, I
>> wonder how this interacts with the new TX MMAP af_packet support
>> in net-next-2.6 :-/
> 
> I think I'll do this in the driver for now, and let's revisit doing it 
> generically later?

That might be the best course of action for the time being.
This whole area is a rat's nest.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 3, 2009, 9:02 p.m. UTC | #7
Rusty Russell a écrit :
> On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
>> Rusty Russell a écrit :
>>> DaveM points out that there are advantages to doing it generally (it's
>>> more likely to be on same CPU than after xmit), and I couldn't find
>>> any new starvation issues in simple benchmarking here.
>> If really no starvations are possible at all, I really wonder why some
>> guys added memory accounting to UDP flows. Maybe they dont run "simple
>> benchmarks" but real apps ? :)
> 
> Well, without any accounting at all you could use quite a lot of memory as 
> there are many places packets can be queued.
> 
>> For TCP, I agree your patch is a huge benefit, since its paced by remote
>> ACKS and window control
> 
> I doubt that.  There'll be some cache friendliness, but I'm not sure it'll be 
> measurable, let alone "huge".  It's the win to drivers which don't have a 
> timely and batching tx free mechanism which I aim for.

At 250.000 packets/second on a Gigabit link, this is huge, I can tell you.
(250.000 incoming packets and 250.000 outgoing packets per second, 700 Mbit/s)

According to this oprofile on CPU0 (dedicated to softirqs on one bnx2 eth adapter)

We can see sock_wfree() being number 2 on the profile, because it touches three cache lines per socket and
transmited packet in TX completion handler.

Also, taking a reference on socket for each xmit packet in flight is very expensive, since it slows
down receiver in __udp4_lib_lookup(). Several cpus are fighting for sk->refcnt cache line.


CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
21215    21215         11.8847  11.8847    bnx2_poll_work
17239    38454          9.6573  21.5420    sock_wfree      << effect of udp memory accounting >>
14817    53271          8.3005  29.8425    __slab_free
14635    67906          8.1986  38.0411    __udp4_lib_lookup
11425    79331          6.4003  44.4414    __alloc_skb
9710     89041          5.4396  49.8810    __slab_alloc
8095     97136          4.5348  54.4158    __udp4_lib_rcv
7831     104967         4.3869  58.8027    sock_def_write_space
7586     112553         4.2497  63.0524    ip_rcv
7518     120071         4.2116  67.2640    skb_dma_unmap
6711     126782         3.7595  71.0235    netif_receive_skb
6272     133054         3.5136  74.5371    udp_queue_rcv_skb
5262     138316         2.9478  77.4849    skb_release_data
5023     143339         2.8139  80.2988    __kmalloc_track_caller
4070     147409         2.2800  82.5788    kmem_cache_alloc
3216     150625         1.8016  84.3804    ipt_do_table
2576     153201         1.4431  85.8235    skb_queue_tail



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell June 4, 2009, 3:54 a.m. UTC | #8
On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> >> For TCP, I agree your patch is a huge benefit, since its paced by remote
> >> ACKS and window control
> >
> > I doubt that.  There'll be some cache friendliness, but I'm not sure
> > it'll be measurable, let alone "huge".
...
> We can see sock_wfree() being number 2 on the profile, because it touches
> three cache lines per socket and transmited packet in TX completion
> handler.

Interesting, I take it back: got some "after" stats as well?

> Also, taking a reference on socket for each xmit packet in flight is very
> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
> cpus are fighting for sk->refcnt cache line.

Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
obvious for device counts than sockets, but perhaps applicable here as well?

Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 4, 2009, 4 a.m. UTC | #9
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 4 Jun 2009 13:24:57 +0930

> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>> Also, taking a reference on socket for each xmit packet in flight is very
>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>> cpus are fighting for sk->refcnt cache line.
> 
> Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
> obvious for device counts than sockets, but perhaps applicable here as well?

It might be very beneficial for longer lasting, active, connections, but
for high connection rates it's going to be a lose in my estimation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 4, 2009, 4:54 a.m. UTC | #10
David Miller a écrit :
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Thu, 4 Jun 2009 13:24:57 +0930
> 
>> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>>> Also, taking a reference on socket for each xmit packet in flight is very
>>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>>> cpus are fighting for sk->refcnt cache line.
>> Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
>> obvious for device counts than sockets, but perhaps applicable here as well?
> 
> It might be very beneficial for longer lasting, active, connections, but
> for high connection rates it's going to be a lose in my estimation.

Agreed.

We also can avoid the sock_put()/sock_hold() pair for each tx packet,
to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
and atomic_dec_test in sk_free

We could initialize sk->sk_wmem_alloc to one instead of 0, so that
sock_wfree() could just synchronize itself with sk_free()

void sk_free(struct sock *sk)
{
	if (atomic_dec_test(&sk->sk_wmem_alloc))
		__sk_free(sk)
}

 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
-       sock_hold(sk);
        skb->sk = sk;
        skb->destructor = sock_wfree;
        atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }

 void sock_wfree(struct sk_buff *skb)
 {
        struct sock *sk = skb->sk;
+       int res;

        /* In case it might be waiting for more memory. */
-       atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+       res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
        if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
                sk->sk_write_space(sk);
-       sock_put(sk);
+       if (res == 0)
+               __sk_free(sk);
 }

Patch will follow after some testing

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 4, 2009, 4:56 a.m. UTC | #11
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 06:54:24 +0200

> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
> and atomic_dec_test in sk_free
> 
> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
> sock_wfree() could just synchronize itself with sk_free()

Excellent idea Eric.

> Patch will follow after some testing

I look forward to it :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 3, 2009, 7:55 a.m. UTC | #12
David Miller <davem@davemloft.net> wrote:
>
>> I think I'll do this in the driver for now, and let's revisit doing it 
>> generically later?
> 
> That might be the best course of action for the time being.
> This whole area is a rat's nest.

Calling skb_orphan like this should be forbidden.  Apart from the
problems already raised, it is a sign that the driver is trying to
paper over a more serious issue of not cleaning up skb's timely.

Yes skb_orphan will work for the cases where calling the skb
destructor allows forward progress, but for the cases where you
really need to the skb to be freed (e.g., iSCSI or Xen), this
simply doesn't work.

So anytime someone tries to propose such a solution it is a sign
that they have bigger problems.

Cheers,
David Miller July 4, 2009, 3:02 a.m. UTC | #13
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 3 Jul 2009 15:55:30 +0800

> Calling skb_orphan like this should be forbidden.  Apart from the
> problems already raised, it is a sign that the driver is trying to
> paper over a more serious issue of not cleaning up skb's timely.
> 
> Yes skb_orphan will work for the cases where calling the skb
> destructor allows forward progress, but for the cases where you
> really need to the skb to be freed (e.g., iSCSI or Xen), this
> simply doesn't work.
> 
> So anytime someone tries to propose such a solution it is a sign
> that they have bigger problems.

Agreed, but alas we are foaming at the mouth until we have a truly
usable alternative.

In particular the case of handling a device without usable TX
completion event indications is still quite troublesome.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 4, 2009, 3:08 a.m. UTC | #14
On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>
> In particular the case of handling a device without usable TX
> completion event indications is still quite troublesome.

Which particular devices do you have in mind?

Cheers,
David Miller July 4, 2009, 3:13 a.m. UTC | #15
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 4 Jul 2009 11:08:30 +0800

> On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>>
>> In particular the case of handling a device without usable TX
>> completion event indications is still quite troublesome.
> e
> Which particular devices do you have in mind?

NIU

I basically can't defer interrupts because the chip supports
per-TX-desc interrupt indications but it lacks an "all TX queue sent"
event.  So if, say, tell it to interrupt every 1/4 of the TX queue
then up to 1/4 of the queue can have packets "stuck" in there
if TX activity all of a sudden ceases.

The only thing I've come up with to be able to mitigate interrupts is
to use an hrtimer of some sort.  But that's going to be hard to get
right, and who knows what kind of latencies will be introduced for TX
completion packet freeing unless I am very carefull.

And finally this belongs in generic code, not in the NIU driver,
whatever we come up with.  Especially since my understanding is that
this is similar to what Rusty needs.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 4, 2009, 7:42 a.m. UTC | #16
On Fri, Jul 03, 2009 at 08:13:47PM -0700, David Miller wrote:
>
> NIU
> 
> I basically can't defer interrupts because the chip supports
> per-TX-desc interrupt indications but it lacks an "all TX queue sent"
> event.  So if, say, tell it to interrupt every 1/4 of the TX queue
> then up to 1/4 of the queue can have packets "stuck" in there
> if TX activity all of a sudden ceases.

Here's an idea: We let the sender decide whether we need to enable
notification.  This decision would be carried as a flag in the skb.
For example, UDP would set this flag when its socket buffer is close
to capacity.  Routing would set this flag per NAPI run, etc.

Of course you'd ignore this flag completely if the qdisc queue is
non-empty.

Cheers,
Herbert Xu July 4, 2009, 9:09 a.m. UTC | #17
On Sat, Jul 04, 2009 at 03:42:45PM +0800, Herbert Xu wrote:
> 
> Here's an idea: We let the sender decide whether we need to enable
> notification.  This decision would be carried as a flag in the skb.
> For example, UDP would set this flag when its socket buffer is close
> to capacity.  Routing would set this flag per NAPI run, etc.

Actually it doesn't even matter for routing because only those
that are charged by the skb's or the pages care and they're the
only ones that would need to set this.

One potential problem is if the socket is constantly running
close to capacity, but that should only happen if the device
TX queue is also close to capacity which means that the qdisc
queue should be non-empty.

Cheers,
Herbert Xu July 5, 2009, 3:26 a.m. UTC | #18
On Sat, Jul 04, 2009 at 05:09:10PM +0800, Herbert Xu wrote:
>
> One potential problem is if the socket is constantly running
> close to capacity, but that should only happen if the device
> TX queue is also close to capacity which means that the qdisc
> queue should be non-empty.

Here's a another crazy idea:

Let's use dummy TX descriptors to generate an interrupt, either
with or without transmitting an actual packet on the wire depending
on the NIC.

xmit(skb)

	if (TX queue contains no interrupting descriptor &&
	    qdisc is empty)
		mark TX descriptor as interrupting

clean()

	do work
	if (TX queue contains no interrupting descriptor &&
	    TX queue non-empty &&
	    qdisc is empty)
		send dummy TX descriptor

This is based on the assumption that in the time it takes for
the NIC to process one packet and interrupt us, we can generate
more packets.  Obviously if we can't then even if the NIC had
a queue-empty interrupt capability it wouldn't help.

Cheers,
Herbert Xu July 5, 2009, 3:34 a.m. UTC | #19
On Sun, Jul 05, 2009 at 11:26:58AM +0800, Herbert Xu wrote:
> 
> Here's a another crazy idea:
> 
> Let's use dummy TX descriptors to generate an interrupt, either
> with or without transmitting an actual packet on the wire depending
> on the NIC.

Here's an even crazier idea that doesn't use dummy descriptors.

xmit(skb)

	if (TX queue contains no interrupting descriptor &&
	    qdisc is empty)
		mark TX descriptor as interrupting

	if (TX queue now contains an interrupting descriptor &&
	    qdisc len < 2)
		stop queue

	if (TX ring full)
		stop queue

clean()

 	do work
	wake queue as per usual

Cheers,
David Miller Aug. 18, 2009, 1:47 a.m. UTC | #20
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 5 Jul 2009 11:34:08 +0800

> Here's an even crazier idea that doesn't use dummy descriptors.
> 
> xmit(skb)
> 
> 	if (TX queue contains no interrupting descriptor &&
> 	    qdisc is empty)
> 		mark TX descriptor as interrupting
> 
> 	if (TX queue now contains an interrupting descriptor &&
> 	    qdisc len < 2)
> 		stop queue
> 
> 	if (TX ring full)
> 		stop queue
> 
> clean()
> 
>  	do work
> 	wake queue as per usual

I'm pretty sure that for normal TCP and UDP workloads, this is just
going to set the interrupt bit on the first packet that gets into the
queue, and then not in the rest.

TCP just loops over packets in the send queue, and at initial state
the qdisc will be empty.

It's very hard to get this to work as well as if we had a real
queue empty interrupt status event.

Even if you get upstream status from the protocols saying "there's
more packets coming" via some flag in the SKB, that only says what one
client feeding the TX ring is about to do.

It says nothing about other threads of control which are about to start
feeding packets to the same device.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Aug. 19, 2009, 3:19 a.m. UTC | #21
On Mon, Aug 17, 2009 at 06:47:12PM -0700, David Miller wrote:
> 
> I'm pretty sure that for normal TCP and UDP workloads, this is just
> going to set the interrupt bit on the first packet that gets into the
> queue, and then not in the rest.
> 
> TCP just loops over packets in the send queue, and at initial state
> the qdisc will be empty.

The scheme I actually tried on the e1000e is not quite the same
as what I had here.  I'm basically just adding descriptors per
usual.

However, I don't give them to the hardware immediately.  Instead
they're held back so that on average we have about three descriptors
in the queue spaced equally that will cause interrupts.  It seems
to work fairly well for netperf-generated TCP/UDP traffic in terms
of generating the right amount of interrupts (once every qlen/3
packets).

I haven't posted anything yet because I ran into a weird problem
with the e1000e.  It was generating a lot more interrupts than
what I'm telling it to do.  Even if I hard-code the interrupts
at 0, qlen/3, 2qlen/2 I still get way more than qlen/3 interrupts
for qlen packets.

This may be related to the fact that e1000e doesn't really have
a way of turning the interrupts off for a given descriptor.  Well
actually it does but it also turns off status reporting which means
that we will never know whether that entry has finished processing.
So I've had to rely on using its TX interrupt delay mechanism as an
approximation of turning interrupt notification off.  Evidently that
is not working in the way I intended it to.

I'm in the process of repeating the same experiment with cxgb3
which hopefully should let me turn interrupts off on descriptors
while still reporting completion status.

Cheers,
David Miller Aug. 19, 2009, 3:34 a.m. UTC | #22
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 Aug 2009 13:19:26 +1000

> I'm in the process of repeating the same experiment with cxgb3
> which hopefully should let me turn interrupts off on descriptors
> while still reporting completion status.

Ok, I look forward to seeing your work however it turns out.

Once I see what you've done, I'll give it a spin on niu.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1288,33 +1288,6 @@  int t3_eth_xmit(struct sk_buff *skb, str
 	dev->trans_start = jiffies;
 	spin_unlock(&q->lock);
 
-	/*
-	 * We do not use Tx completion interrupts to free DMAd Tx packets.
-	 * This is good for performamce but means that we rely on new Tx
-	 * packets arriving to run the destructors of completed packets,
-	 * which open up space in their sockets' send queues.  Sometimes
-	 * we do not get such new packets causing Tx to stall.  A single
-	 * UDP transmitter is a good example of this situation.  We have
-	 * a clean up timer that periodically reclaims completed packets
-	 * but it doesn't run often enough (nor do we want it to) to prevent
-	 * lengthy stalls.  A solution to this problem is to run the
-	 * destructor early, after the packet is queued but before it's DMAd.
-	 * A cons is that we lie to socket memory accounting, but the amount
-	 * of extra memory is reasonable (limited by the number of Tx
-	 * descriptors), the packets do actually get freed quickly by new
-	 * packets almost always, and for protocols like TCP that wait for
-	 * acks to really free up the data the extra memory is even less.
-	 * On the positive side we run the destructors on the sending CPU
-	 * rather than on a potentially different completing CPU, usually a
-	 * good thing.  We also run them without holding our Tx queue lock,
-	 * unlike what reclaim_completed_tx() would otherwise do.
-	 *
-	 * Run the destructor before telling the DMA engine about the packet
-	 * to make sure it doesn't complete and get freed prematurely.
-	 */
-	if (likely(!skb_shared(skb)))
-		skb_orphan(skb);
-
 	write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
 	check_ring_tx_db(adap, q);
 	return NETDEV_TX_OK;
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -72,8 +72,6 @@  static int loopback_xmit(struct sk_buff 
 {
 	struct pcpu_lstats *pcpu_lstats, *lb_stats;
 
-	skb_orphan(skb);
-
 	skb->protocol = eth_type_trans(skb,dev);
 
 	/* it's OK to use per_cpu_ptr() because BHs are off */
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -807,10 +807,6 @@  int mlx4_en_xmit(struct sk_buff *skb, st
 	if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
-	/* Run destructor before passing skb to HW */
-	if (likely(!skb_shared(skb)))
-		skb_orphan(skb);
-
 	/* Ensure new descirptor hits memory
 	 * before setting ownership of this descriptor to HW */
 	wmb();
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -6667,8 +6667,7 @@  static int niu_start_xmit(struct sk_buff
 		}
 		kfree_skb(skb);
 		skb = skb_new;
-	} else
-		skb_orphan(skb);
+	}
 
 	align = ((unsigned long) skb->data & (16 - 1));
 	headroom = align + sizeof(struct tx_pkt_hdr);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@  static int veth_xmit(struct sk_buff *skb
 	struct veth_net_stats *stats, *rcv_stats;
 	int length, cpu;
 
-	skb_orphan(skb);
-
 	priv = netdev_priv(dev);
 	rcv = priv->peer;
 	rcv_priv = netdev_priv(rcv);
diff --git a/drivers/net/wireless/libertas/tx.c 
b/drivers/net/wireless/libertas/tx.c
--- a/drivers/net/wireless/libertas/tx.c
+++ b/drivers/net/wireless/libertas/tx.c
@@ -154,9 +154,6 @@  int lbs_hard_start_xmit(struct sk_buff *
 	if (priv->monitormode) {
 		/* Keep the skb to echo it back once Tx feedback is
 		   received from FW */
-		skb_orphan(skb);
-
-		/* Keep the skb around for when we get feedback */
 		priv->currenttxskb = skb;
 	} else {
  free:
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1677,6 +1677,10 @@  int dev_hard_start_xmit(struct sk_buff *
 	const struct net_device_ops *ops = dev->netdev_ops;
 	int rc;
 
+	/* Call destructor here.  It's SMP-cache-friendly and avoids issues
+	 * with drivers which can hold transmitted skbs for long times */
+	skb_orphan(skb);
+
 	if (likely(!skb->next)) {
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
@@ -1688,22 +1692,7 @@  int dev_hard_start_xmit(struct sk_buff *
 				goto gso;
 		}
 
-		rc = ops->ndo_start_xmit(skb, dev);
-		/*
-		 * TODO: if skb_orphan() was called by
-		 * dev->hard_start_xmit() (for example, the unmodified
-		 * igb driver does that; bnx2 doesn't), then
-		 * skb_tx_software_timestamp() will be unable to send
-		 * back the time stamp.
-		 *
-		 * How can this be prevented? Always create another
-		 * reference to the socket before calling
-		 * dev->hard_start_xmit()? Prevent that skb_orphan()
-		 * does anything in dev->hard_start_xmit() by clearing
-		 * the skb destructor before the call and restoring it
-		 * afterwards, then doing the skb_orphan() ourselves?
-		 */
-		return rc;
+		return ops->ndo_start_xmit(skb, dev);
 	}
 
 gso: