Patchwork [RFC] net: release dst entry in dev_queue_xmit()

login
register
mail settings
Submitter Eric Dumazet
Date March 25, 2009, 6:41 p.m.
Message ID <49CA7AD7.8040401@cosmosbay.com>
Download mbox | patch
Permalink /patch/25098/
State RFC
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - March 25, 2009, 6:41 p.m.
Jarek Poplawski a écrit :
> David Miller wrote, On 03/25/2009 08:17 AM:
> 
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Wed, 25 Mar 2009 08:13:30 +0100
>>
>>> If done in dev_hard_start_xmit(), skb could be requeued (because of
>>> NETDEV_TX_BUSY).  Then if requeued, maybe at this time, dst being
>>> NULL is not a problem ?
>> Usually it should be OK because the packet schedulers have
>> a sort-of one-behind state that allows them to reinsert
>> the SKB into their queue datastructures without reclassifying.
> 
> 
> Actually, since David has dumped requeuing there is no reinserting;
> this last one "requeued" skb is buffered at the top in q->gso_skb
> and waiting for better times.

Yes indeed, this is what I thought too, thanks Jarek.

I tested following patch today on my machine, but obviously could not try 
all possible quirks :)

[PATCH] net: release dst entry in dev_hard_start_xmit()

One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls skb free
long after original call to dev_queue_xmit(skb).

CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.

I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since
caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
work to be done by softirq handler, and decrease cache misses.

I also believe only pktgen can call dev_queue_xmit() with skb which have
a skb->users != 1. But pkthen skbs have a NULL dst entry.

I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
if skb->users != 1

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - March 25, 2009, 7:18 p.m.
On Wed, Mar 25, 2009 at 07:41:27PM +0100, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> > David Miller wrote, On 03/25/2009 08:17 AM:
> > 
> >> From: Eric Dumazet <dada1@cosmosbay.com>
> >> Date: Wed, 25 Mar 2009 08:13:30 +0100
> >>
> >>> If done in dev_hard_start_xmit(), skb could be requeued (because of
> >>> NETDEV_TX_BUSY).  Then if requeued, maybe at this time, dst being
> >>> NULL is not a problem ?
> >> Usually it should be OK because the packet schedulers have
> >> a sort-of one-behind state that allows them to reinsert
> >> the SKB into their queue datastructures without reclassifying.
> > 
> > 
> > Actually, since David has dumped requeuing there is no reinserting;
> > this last one "requeued" skb is buffered at the top in q->gso_skb
> > and waiting for better times.
> 
> Yes indeed, this is what I thought too, thanks Jarek.

Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at
xmits in macvlan and pppoe. Maybe this patch should exclude them?

Jarek P.

> 
> I tested following patch today on my machine, but obviously could not try 
> all possible quirks :)
> 
> [PATCH] net: release dst entry in dev_hard_start_xmit()
> 
> One point of contention in high network loads is the dst_release() performed
> when a transmited skb is freed. This is because NIC tx completion calls skb free
> long after original call to dev_queue_xmit(skb).
> 
> CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
> quite visible if one CPU is 100% handling softirqs for a network device,
> since dst_clone() is done by other cpus, involving cache line ping pongs.
> 
> I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since
> caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
> work to be done by softirq handler, and decrease cache misses.
> 
> I also believe only pktgen can call dev_queue_xmit() with skb which have
> a skb->users != 1. But pkthen skbs have a NULL dst entry.
> 
> I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
> if skb->users != 1
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..a622db6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb)
>  	return 0;
>  }
>  
> +
> +/*
> + * Release dst while its refcnt is likely hot in CPU cache, instead
> + * of waiting NIC tx completion.
> + * We inline dst_release() code for performance reason
> + */
> +static void release_skb_dst(struct sk_buff *skb)
> +{
> +	if (likely(skb->dst)) {
> +		if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
> +			int newrefcnt;
> +
> +			smp_mb__before_atomic_dec();
> +			newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
> +			WARN_ON(newrefcnt < 0);
> +			skb->dst = NULL;
> +		}
> +	}
> +}
> +
>  int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  			struct netdev_queue *txq)
>  {
> @@ -1681,6 +1701,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  				goto gso;
>  		}
>  
> +		release_skb_dst(skb);
>  		return ops->ndo_start_xmit(skb, dev);
>  	}
>  
> @@ -1691,6 +1712,7 @@ gso:
>  
>  		skb->next = nskb->next;
>  		nskb->next = NULL;
> +		release_skb_dst(nskb);
>  		rc = ops->ndo_start_xmit(nskb, dev);
>  		if (unlikely(rc)) {
>  			nskb->next = skb->next;
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 25, 2009, 7:40 p.m.
Jarek Poplawski a écrit :
> On Wed, Mar 25, 2009 at 07:41:27PM +0100, Eric Dumazet wrote:
>> Jarek Poplawski a écrit :
>>> David Miller wrote, On 03/25/2009 08:17 AM:
>>>
>>>> From: Eric Dumazet <dada1@cosmosbay.com>
>>>> Date: Wed, 25 Mar 2009 08:13:30 +0100
>>>>
>>>>> If done in dev_hard_start_xmit(), skb could be requeued (because of
>>>>> NETDEV_TX_BUSY).  Then if requeued, maybe at this time, dst being
>>>>> NULL is not a problem ?
>>>> Usually it should be OK because the packet schedulers have
>>>> a sort-of one-behind state that allows them to reinsert
>>>> the SKB into their queue datastructures without reclassifying.
>>>
>>> Actually, since David has dumped requeuing there is no reinserting;
>>> this last one "requeued" skb is buffered at the top in q->gso_skb
>>> and waiting for better times.
>> Yes indeed, this is what I thought too, thanks Jarek.
> 
> Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at
> xmits in macvlan and pppoe. Maybe this patch should exclude them?
> 

Yes, MACVLAN :) its macvlan_start_xmit() function calls
dev_queue_xmit(skb) again, so we go back to packet schedulers and
classifiers, they might need dst again :(

Only other potential problem I found was in 
drivers/net/appletalk/ipddp.c

static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev)
{
__be32 paddr = ((struct rtable*)skb->dst)->rt_gateway;

Not sure this driver is still supported, or if this paddr can be found elsewhere...
 __sk_dst_get(skb->sk) ???

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - March 25, 2009, 7:54 p.m.
On Wed, Mar 25, 2009 at 08:40:05PM +0100, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
...
> > Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at
> > xmits in macvlan and pppoe. Maybe this patch should exclude them?
> > 
> 
> Yes, MACVLAN :) its macvlan_start_xmit() function calls
> dev_queue_xmit(skb) again, so we go back to packet schedulers and
> classifiers, they might need dst again :(
> 
> Only other potential problem I found was in 
> drivers/net/appletalk/ipddp.c
> 
> static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> __be32 paddr = ((struct rtable*)skb->dst)->rt_gateway;
> 
> Not sure this driver is still supported, or if this paddr can be found elsewhere...
>  __sk_dst_get(skb->sk) ???
> 

I guess you've considered the loopback too...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 25, 2009, 8:28 p.m.
Jarek Poplawski a écrit :

> I guess you've considered the loopback too...
> 

I had for the first patch (carefuly avoiding loopback being not responsive) :
I was releasing dst only if enqueue() was called.

You are right, loopback doesnt work anymore with last patch,
and I have no idea why...

Do you know why ?

Thank you

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - March 25, 2009, 9:12 p.m.
On Wed, Mar 25, 2009 at 09:28:37PM +0100, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> 
> > I guess you've considered the loopback too...
> > 
> 
> I had for the first patch (carefuly avoiding loopback being not responsive) :
> I was releasing dst only if enqueue() was called.
> 
> You are right, loopback doesnt work anymore with last patch,
> and I have no idea why...
> 
> Do you know why ?

Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be
negative for loopback source?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - March 25, 2009, 9:20 p.m.
Jarek Poplawski wrote:
> On Wed, Mar 25, 2009 at 09:28:37PM +0100, Eric Dumazet wrote:
>   
>> Jarek Poplawski a écrit :
>>
>>     
>>> I guess you've considered the loopback too...
>>>
>>>       
>> I had for the first patch (carefuly avoiding loopback being not responsive) :
>> I was releasing dst only if enqueue() was called.
>>
>> You are right, loopback doesnt work anymore with last patch,
>> and I have no idea why...
>>
>> Do you know why ?
>>     
>
> Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be
> negative for loopback source?

ip_route_input() doesn't accept loopback addresses, so loopback packets
already need to have a dst_entry attached.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..a622db6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1664,6 +1664,26 @@  static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+
+/*
+ * Release dst while its refcnt is likely hot in CPU cache, instead
+ * of waiting NIC tx completion.
+ * We inline dst_release() code for performance reason
+ */
+static void release_skb_dst(struct sk_buff *skb)
+{
+	if (likely(skb->dst)) {
+		if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
+			int newrefcnt;
+
+			smp_mb__before_atomic_dec();
+			newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
+			WARN_ON(newrefcnt < 0);
+			skb->dst = NULL;
+		}
+	}
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1681,6 +1701,7 @@  int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				goto gso;
 		}
 
+		release_skb_dst(skb);
 		return ops->ndo_start_xmit(skb, dev);
 	}
 
@@ -1691,6 +1712,7 @@  gso:
 
 		skb->next = nskb->next;
 		nskb->next = NULL;
+		release_skb_dst(nskb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc)) {
 			nskb->next = skb->next;