Patchwork High contention on the sk_buff_head.lock

login
register
mail settings
Submitter Eric Dumazet
Date March 18, 2009, 7:07 p.m.
Message ID <49C14667.2040806@cosmosbay.com>
Download mbox | patch
Permalink /patch/24623/
State RFC
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - March 18, 2009, 7:07 p.m.
Vernon Mauery a écrit :
> I have been beating on network throughput in the -rt kernel for some time
> now.  After digging down through the send path of UDP packets, I found
> that the sk_buff_head.lock is under some very high contention.  This lock
> is acquired each time a packet is enqueued on a qdisc and then acquired
> again to dequeue the packet.  Under high networking loads, the enqueueing
> processes are not only contending among each other for the lock, but also
> with the net-tx soft irq.  This makes for some very high contention on this
> one lock.  My testcase is running varying numbers of concurrent netperf
> instances pushing UDP traffic to another machine.  As the count goes from
> 1 to 2, the network performance increases.  But from 2 to 4 and from 4
> to 8,
> we see a big decline, with 8 instances pushing about half of what a single
> thread can do.
> 
> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
> both NetXen and Broadcom, with very similar results), I can only push about
> 1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
> 6000 Mb/s. But still not as much as I think is possible.  I was curious and
> decided to see if the mainline kernel was hitting the same lock, and using
> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
> number one contended lock).
> 
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well.  The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.
> 
> Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
> a 1 minute network stress test.  The next high contender had 2 orders of
> magnitude fewer contentions.  Think of the throughput increase if we could
> ease this contention a bit.  We might even be able to saturate a 10GbE
> link.
> 
> lock_stat version 0.3
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>       class name    con-bounces    contentions   waittime-min  
> waittime-max   waittime-total    acq-bounces   acquisitions  
> holdtime-min  holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
>    &list->lock#3:      24517307       24643791           0.71       
> 1286.62      56516392.42       34834296       44904018          
> 0.60        164.79    31314786.02
>     -------------
>    &list->lock#3       15596927    [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
>    &list->lock#3        9046864    [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
>     -------------
>    &list->lock#3        6525300    [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
>    &list->lock#3       18118491    [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
> 
> 
> The story is the same for -rt kernels, only the waittime and holdtime
> are both
> orders of magnitude greater.
> 
> I am not exactly clear on the solution, but if I understand correctly,
> in the
> past there has been some discussion of batched enqueueing and
> dequeueing.  Is
> anyone else working on this problem right now who has just not yet posted
> anything for review?  Questions, comments, flames?
> 

Yes we have a known contention point here, but before adding more complex code,
could you try following patch please ?

[PATCH] net: Reorder fields of struct Qdisc

dev_queue_xmit() needs to dirty fields "state" and "q"

On x86_64 arch, they currently span two cache lines, involving more
cache line ping pongs than necessary.

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, dev_queue)=0x60

After patch :

offsetof(struct Qdisc, dev_queue)=0x38
offsetof(struct Qdisc, state)=0x48
offsetof(struct Qdisc, q)=0x50


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vernon Mauery - March 18, 2009, 8:17 p.m.
Eric Dumazet wrote:
> Vernon Mauery a écrit :
>> I have been beating on network throughput in the -rt kernel for some time
>> now.  After digging down through the send path of UDP packets, I found
>> that the sk_buff_head.lock is under some very high contention.  This lock
>> is acquired each time a packet is enqueued on a qdisc and then acquired
>> again to dequeue the packet.  Under high networking loads, the enqueueing
>> processes are not only contending among each other for the lock, but also
>> with the net-tx soft irq.  This makes for some very high contention on this
>> one lock.  My testcase is running varying numbers of concurrent netperf
>> instances pushing UDP traffic to another machine.  As the count goes from
>> 1 to 2, the network performance increases.  But from 2 to 4 and from 4
>> to 8,
>> we see a big decline, with 8 instances pushing about half of what a single
>> thread can do.
>>
>> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
>> both NetXen and Broadcom, with very similar results), I can only push about
>> 1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
>> 6000 Mb/s. But still not as much as I think is possible.  I was curious and
>> decided to see if the mainline kernel was hitting the same lock, and using
>> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
>> number one contended lock).
>>
>> So while this issue really hits -rt kernels hard, it has a real effect on
>> mainline kernels as well.  The contention of the spinlocks is amplified
>> when they get turned into rt-mutexes, which causes a double context switch.
>>
>> Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
>> a 1 minute network stress test.  The next high contender had 2 orders of
>> magnitude fewer contentions.  Think of the throughput increase if we could
>> ease this contention a bit.  We might even be able to saturate a 10GbE
>> link.
>>
>> lock_stat version 0.3
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>       class name    con-bounces    contentions   waittime-min  
>> waittime-max   waittime-total    acq-bounces   acquisitions  
>> holdtime-min  holdtime-max holdtime-total
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>>    &list->lock#3:      24517307       24643791           0.71       
>> 1286.62      56516392.42       34834296       44904018          
>> 0.60        164.79    31314786.02
>>     -------------
>>    &list->lock#3       15596927    [<ffffffff812474da>]
>> dev_queue_xmit+0x2ea/0x468
>>    &list->lock#3        9046864    [<ffffffff812546e9>]
>> __qdisc_run+0x11b/0x1ef
>>     -------------
>>    &list->lock#3        6525300    [<ffffffff812546e9>]
>> __qdisc_run+0x11b/0x1ef
>>    &list->lock#3       18118491    [<ffffffff812474da>]
>> dev_queue_xmit+0x2ea/0x468
>>
>>
>> The story is the same for -rt kernels, only the waittime and holdtime
>> are both
>> orders of magnitude greater.
>>
>> I am not exactly clear on the solution, but if I understand correctly,
>> in the
>> past there has been some discussion of batched enqueueing and
>> dequeueing.  Is
>> anyone else working on this problem right now who has just not yet posted
>> anything for review?  Questions, comments, flames?
>>
> 
> Yes we have a known contention point here, but before adding more complex code,
> could you try following patch please ?

This patch does seem to reduce the number of contentions by about 10%.  That is
a good start (and a good catch on the cacheline bounces).  But, like I mentioned
above, this lock still has 2 orders of magnitude greater contention than the
next lock, so even a large decrease like 10% makes little difference in the
overall contention characteristics.

So we will have to do something more.  Whether it needs to be more complex or
not is still up in the air.  Batched enqueueing/dequeueing are just two options
and the former would be a *lot* less complex than the latter.

If anyone else has any ideas they have been holding back, now would be a great
time to get them out in the open.

--Vernon

> [PATCH] net: Reorder fields of struct Qdisc
> 
> dev_queue_xmit() needs to dirty fields "state" and "q"
> 
> On x86_64 arch, they currently span two cache lines, involving more
> cache line ping pongs than necessary.
> 
> Before patch :
> 
> offsetof(struct Qdisc, state)=0x38
> offsetof(struct Qdisc, q)=0x48
> offsetof(struct Qdisc, dev_queue)=0x60
> 
> After patch :
> 
> offsetof(struct Qdisc, dev_queue)=0x38
> offsetof(struct Qdisc, state)=0x48
> offsetof(struct Qdisc, q)=0x50
> 
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f8c4742..e24feeb 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -51,10 +51,11 @@ struct Qdisc
>  	u32			handle;
>  	u32			parent;
>  	atomic_t		refcnt;
> -	unsigned long		state;
> +	struct netdev_queue	*dev_queue;
> +
>  	struct sk_buff		*gso_skb;
> +	unsigned long		state;
>  	struct sk_buff_head	q;
> -	struct netdev_queue	*dev_queue;
>  	struct Qdisc		*next_sched;
>  	struct list_head	list;
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..e24feeb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,10 +51,11 @@  struct Qdisc
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
-	unsigned long		state;
+	struct netdev_queue	*dev_queue;
+
 	struct sk_buff		*gso_skb;
+	unsigned long		state;
 	struct sk_buff_head	q;
-	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
 	struct list_head	list;