diff mbox

[net-next-2.6] net: Xmit Packet Steering (XPS)

Message ID 4B05D8DC.7020907@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 19, 2009, 11:46 p.m. UTC
Here is first version of XPS.

Goal of XPS is to free TX completed skbs by the cpu that submitted the transmit.

Because I chose to union skb->iif with skb->sending_cpu, I chose
to introduce a new xps_consume_skb(skb), and not generalize consume_skb() itself.

This means that selected drivers must use new function to benefit from XPS

Preliminary tests are quite good, especially on NUMA machines.

Only NAPI drivers can use this new infrastructure (xps_consume_skb() cannot
be called from hardirq context, only from softirq)

I converted tg3 and pktgen for my tests

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/net/tg3.c      |    2
 include/linux/skbuff.h |   14 +++
 net/core/Makefile      |    1
 net/core/dev.c         |    8 +-
 net/core/pktgen.c      |    1
 net/core/xps.c         |  145 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 165 insertions(+), 6 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Changli Gao Nov. 20, 2009, 2:12 a.m. UTC | #1
On Fri, Nov 20, 2009 at 7:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9977288..9e134f6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2000,6 +2001,7 @@ gso:
>         */
>        rcu_read_lock_bh();
>
> +       skb->sending_cpu = cpu = smp_processor_id();
>        txq = dev_pick_tx(dev, skb);
>        q = rcu_dereference(txq->qdisc);

I think assigning cpu to skb->sending_cpu just before calling
hard_start_xmit is better, because the CPU which dequeues the skb will
be another one.

>
> @@ -2024,8 +2026,6 @@ gso:
>           Either shot noqueue qdisc, it is even simpler 8)
>         */
>        if (dev->flags & IFF_UP) {
> -               int cpu = smp_processor_id(); /* ok because BHs are off */
> -
>                if (txq->xmit_lock_owner != cpu) {
>
>                        HARD_TX_LOCK(dev, txq, cpu);
> @@ -2967,7 +2967,7 @@ static void net_rx_action(struct softirq_action *h)
>        }
>  out:
>        local_irq_enable();
> -
> +       xps_flush();

If there isn't any new skbs, the memory will be hold forever. I know
you want to eliminate unnecessary IPI, how about sending IPI only when
the remote xps_pcpu_queues are changed from empty to nonempty?
Eric Dumazet Nov. 20, 2009, 4:58 a.m. UTC | #2
Changli Gao a écrit :
> On Fri, Nov 20, 2009 at 7:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 9977288..9e134f6 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2000,6 +2001,7 @@ gso:
>>         */
>>        rcu_read_lock_bh();
>>
>> +       skb->sending_cpu = cpu = smp_processor_id();
>>        txq = dev_pick_tx(dev, skb);
>>        q = rcu_dereference(txq->qdisc);
> 
> I think assigning cpu to skb->sending_cpu just before calling
> hard_start_xmit is better, because the CPU which dequeues the skb will
> be another one.

I want to record the application CPU, because I want the application CPU
to call sock_wfree(), not the CPU that happened to dequeue skb to transmit it
in case of txq contention.

> 
>> @@ -2024,8 +2026,6 @@ gso:
>>           Either shot noqueue qdisc, it is even simpler 8)
>>         */
>>        if (dev->flags & IFF_UP) {
>> -               int cpu = smp_processor_id(); /* ok because BHs are off */
>> -
>>                if (txq->xmit_lock_owner != cpu) {
>>
>>                        HARD_TX_LOCK(dev, txq, cpu);
>> @@ -2967,7 +2967,7 @@ static void net_rx_action(struct softirq_action *h)
>>        }
>>  out:
>>        local_irq_enable();
>> -
>> +       xps_flush();
> 
> If there isn't any new skbs, the memory will be hold forever. I know
> you want to eliminate unnecessary IPI, how about sending IPI only when
> the remote xps_pcpu_queues are changed from empty to nonempty?

I dont understand your remark, and dont see the problem, yet.

I send IPI only on cpus I know I have at least one skb queueud for them.
For each cpu taking TX completion interrupts I have :

One bitmask (xps_cpus) of cpus I will eventually send IPI at end of net_rx_action()

One array of skb lists per remote cpu, allocated on cpu node memory, thanks
to __alloc_percpu() at boot time.

I say _eventually_ because the algo is :

+		if (cpu_online(cpu)) {
+			spin_lock(&q->list.lock);
+			prevlen = skb_queue_len(&q->list);
+			skb_queue_splice_init(&head[cpu], &q->list);
+			spin_unlock(&q->list.lock);
+			/*
+			 * We hope remote cpu will be fast enough to transfert
+			 * this list to its completion queue before our
+			 * next xps_flush() call
+			 */
+			if (!prevlen)
+				__smp_call_function_single(cpu, &q->csd, 0);
+			continue;

So I send an IPI only if needed, once for the whole skb list.

With my pktgen (no skb cloning setup) tests, and

ethtool -C eth3 tx-usecs 1000 tx-frames 100

I really saw batches of 100 frames given from CPU X (NIC interrupts) to CPU Y (pktgen cpu)

What memory is hold forever ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2009, 5:08 a.m. UTC | #3
Tom Herbert a écrit :
> 
> 
> On Thu, Nov 19, 2009 at 3:46 PM, Eric Dumazet <eric.dumazet@gmail.com
> <mailto:eric.dumazet@gmail.com>> wrote:
> 
>     Here is first version of XPS.
> 
> Very cool!  The infrastructure to move lists of skb's from cpu to
> another and the IPI to kick processing look like something that could be
> consolidated between rps and xps :-)

Sure, but RPS seems quite slow to integrate (no offense Tom) :)
I wanted to cook a patch on top on yours, but got no sign you were
about to release RPS version 4 soon.

> 
>     Goal of XPS is to free TX completed skbs by the cpu that submitted
>     the transmit.
> 
>     Because I chose to union skb->iif with skb->sending_cpu, I chose
>     to introduce a new xps_consume_skb(skb), and not generalize
>     consume_skb() itself.
> 
>     This means that selected drivers must use new function to benefit
>     from XPS
> 
> 
> Is this better than modifying consume_skb so this can be used by any driver?

consume_skb() is also used by RX side, and this side doesnt want XPS.

Adding a flag in skb to differentiate the use might be possible,
but we add a new test in hot paths...

>  
> 
>      
> 
>     Preliminary tests are quite good, especially on NUMA machines.
> 
>     Only NAPI drivers can use this new infrastructure (xps_consume_skb()
>     cannot
>     be called from hardirq context, only from softirq)
> 
> Is this a strict requirement, especially considering devices with
> separate TX interrupts?  For a hardirq we could we just put the skb on
> local percpu queue and schedule a softirq to do the ipi.

I chose this way because any sane and up2date driver should really
not use hardirqs TX completion. I want fast processing, without
masking local interrupts in xps_consume_skb(), and wihout testing
our context.

Note : if TX completion and RX are run in different NAPI contexts,
there is no problem using xps_consume_skb().


Thanks for reviewing Tom.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Nov. 20, 2009, 5:11 a.m. UTC | #4
On Fri, Nov 20, 2009 at 12:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Changli Gao a écrit :
>> On Fri, Nov 20, 2009 at 7:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 9977288..9e134f6 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -2000,6 +2001,7 @@ gso:
>>>         */
>>>        rcu_read_lock_bh();
>>>
>>> +       skb->sending_cpu = cpu = smp_processor_id();
>>>        txq = dev_pick_tx(dev, skb);
>>>        q = rcu_dereference(txq->qdisc);
>>
>> I think assigning cpu to skb->sending_cpu just before calling
>> hard_start_xmit is better, because the CPU which dequeues the skb will
>> be another one.
>
> I want to record the application CPU, because I want the application CPU
> to call sock_wfree(), not the CPU that happened to dequeue skb to transmit it
> in case of txq contention.
>

got it.

>>
>>> @@ -2024,8 +2026,6 @@ gso:
>>>           Either shot noqueue qdisc, it is even simpler 8)
>>>         */
>>>        if (dev->flags & IFF_UP) {
>>> -               int cpu = smp_processor_id(); /* ok because BHs are off */
>>> -
>>>                if (txq->xmit_lock_owner != cpu) {
>>>
>>>                        HARD_TX_LOCK(dev, txq, cpu);
>>> @@ -2967,7 +2967,7 @@ static void net_rx_action(struct softirq_action *h)
>>>        }
>>>  out:
>>>        local_irq_enable();
>>> -
>>> +       xps_flush();
>>
>> If there isn't any new skbs, the memory will be hold forever. I know
>> you want to eliminate unnecessary IPI, how about sending IPI only when
>> the remote xps_pcpu_queues are changed from empty to nonempty?
>
> I dont understand your remark, and dont see the problem, yet.
>
> I send IPI only on cpus I know I have at least one skb queueud for them.
> For each cpu taking TX completion interrupts I have :
>
> One bitmask (xps_cpus) of cpus I will eventually send IPI at end of net_rx_action()
>

You call xps_flush() in net_rx_aciton(). It means that if no new
packet arrives, xps_flush() won't be called forever, and the memory
used by skbs will be hold forever. Did I misunderstand? Your algorithm
only works with packet forwarding but sending packets from local
sockets.
Eric Dumazet Nov. 20, 2009, 5:24 a.m. UTC | #5
Changli Gao a écrit :
> You call xps_flush() in net_rx_aciton(). It means that if no new
> packet arrives, xps_flush() won't be called forever, and the memory
> used by skbs will be hold forever. Did I misunderstand? Your algorithm
> only works with packet forwarding but sending packets from local
> sockets.
> 

Please re-read my patch, you misunderstood it, or I dont get you.

If xps_consume_skb(skb) is ever called (from one to XXX times),
then we xps_flush() them, from net_rx_action()

net_rx_action()
{
	while (has_work) {
		perform_napi_things(); // calls xps_consume_skb()
	}
	xps_flush(); // post things to remote cpus, and dont leak memory
}

This net_rx_action() is same for forwarding and localy generated packets.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Nov. 20, 2009, 5:34 a.m. UTC | #6
On Fri, Nov 20, 2009 at 1:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Changli Gao a écrit :
>
> Please re-read my patch, you misunderstood it, or I dont get you.

I think I didn't misunderstand it. If local socket only sends packets,
which don't need replies from receiver, so new NIC RX IRQ, and NET_RX
softirq won't be triggered. Who will call xps_flush() to free the
memory used by locally generated packets?

>
> If xps_consume_skb(skb) is ever called (from one to XXX times),
> then we xps_flush() them, from net_rx_action()
>
> net_rx_action()
> {
>        while (has_work) {
>                perform_napi_things(); // calls xps_consume_skb()
>        }
>        xps_flush(); // post things to remote cpus, and dont leak memory
> }
>
> This net_rx_action() is same for forwarding and localy generated packets.
>
Eric Dumazet Nov. 20, 2009, 5:42 a.m. UTC | #7
Changli Gao a écrit :
> On Fri, Nov 20, 2009 at 1:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Changli Gao a écrit :
>>
>> Please re-read my patch, you misunderstood it, or I dont get you.
> 
> I think I didn't misunderstand it. If local socket only sends packets,
> which don't need replies from receiver, so new NIC RX IRQ, and NET_RX
> softirq won't be triggered. Who will call xps_flush() to free the
> memory used by locally generated packets?
> 

Changli, when we transmit a skb on NIC, NIC is supposed to have a TX completion
call back, to free this skb.

These completion calls are running from net_rx_action(), if driver is NAPI enabled.

Only NAPI enabled drivers are allowed to use XPS infrastructure.

I sent 100.000.000 packets in my pktgen+tg3 tests, without receiving a single packet
in return, I can tell you all packets were correctly freed :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Nov. 20, 2009, 5:50 a.m. UTC | #8
On Fri, Nov 20, 2009 at 1:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Changli Gao a écrit :
>> On Fri, Nov 20, 2009 at 1:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Changli Gao a écrit :
>>>
>>> Please re-read my patch, you misunderstood it, or I dont get you.
>>
>> I think I didn't misunderstand it. If local socket only sends packets,
>> which don't need replies from receiver, so new NIC RX IRQ, and NET_RX
>> softirq won't be triggered. Who will call xps_flush() to free the
>> memory used by locally generated packets?
>>
>
> Changli, when we transmit a skb on NIC, NIC is supposed to have a TX completion
> call back, to free this skb.
>
> These completion calls are running from net_rx_action(), if driver is NAPI enabled.
>
> Only NAPI enabled drivers are allowed to use XPS infrastructure.
>
> I sent 100.000.000 packets in my pktgen+tg3 tests, without receiving a single packet
> in return, I can tell you all packets were correctly freed :)
>
>

Oh, I am so sorry. It seems I missed sth. I need to review the new
NAPI. Thanks for your patience and help.
Jarek Poplawski Nov. 20, 2009, 1:32 p.m. UTC | #9
On 20-11-2009 00:46, Eric Dumazet wrote:
> Here is first version of XPS.
> 
> Goal of XPS is to free TX completed skbs by the cpu that submitted the transmit.

But why?... OK, you write in another message about sock_wfree(). Then
how about users, who don't sock_wfree (routers)? Will there be any way
to disable it?

> 
> Because I chose to union skb->iif with skb->sending_cpu, I chose
> to introduce a new xps_consume_skb(skb), and not generalize consume_skb() itself.
> 
> This means that selected drivers must use new function to benefit from XPS
> 
> Preliminary tests are quite good, especially on NUMA machines.
> 
> Only NAPI drivers can use this new infrastructure (xps_consume_skb() cannot
> be called from hardirq context, only from softirq)
> 
> I converted tg3 and pktgen for my tests
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
...
> diff --git a/net/core/xps.c b/net/core/xps.c
> index e69de29..e580159 100644
> --- a/net/core/xps.c
> +++ b/net/core/xps.c
> @@ -0,0 +1,145 @@
> +/*
> + * XPS : Xmit Packet Steering
> + *
> + * TX completion packet freeing is performed on cpu that sent packet.
> + */
> +#if defined(CONFIG_SMP)

Shouldn't it be in the Makefile?

...
> +/*
> + * called at end of net_rx_action()
> + * preemption (and cpu migration/offline/online) disabled
> + */
> +void xps_flush(void)
> +{
> +	int cpu, prevlen;
> +	struct sk_buff_head *head = per_cpu_ptr(xps_array, smp_processor_id());
> +	struct xps_pcpu_queue *q;
> +	struct sk_buff *skb;
> +
> +	for_each_cpu_mask_nr(cpu, __get_cpu_var(xps_cpus)) {
> +		q = &per_cpu(xps_pcpu_queue, cpu);
> +		if (cpu_online(cpu)) {
> +			spin_lock(&q->list.lock);

This lock probably needs irq disabling: let's say 2 cpus run this at
the same time and both are interrupted with these (previously
scheduled) IPIs?

> +			prevlen = skb_queue_len(&q->list);
> +			skb_queue_splice_init(&head[cpu], &q->list);
> +			spin_unlock(&q->list.lock);
> +			/*
> +			 * We hope remote cpu will be fast enough to transfert
> +			 * this list to its completion queue before our
> +			 * next xps_flush() call
> +			 */
> +			if (!prevlen)
> +				__smp_call_function_single(cpu, &q->csd, 0);
> +			continue;
> +		}
> +		/*
> +		 * ok, we must free these skbs, even if we tried to avoid it :)
> +		 */
> +		while ((skb = __skb_dequeue(&head[cpu])) != NULL)
> +			__kfree_skb(skb);
> +	}
> +	cpus_clear(__get_cpu_var(xps_cpus));
> +}
> +
> +/*
> + * called from hardirq (IPI) context
> + */
> +static void remote_free_skb_list(void *arg)
> +{
> +	struct sk_buff *last;
> +	struct softnet_data *sd;
> +	struct xps_pcpu_queue *q = arg; /* &__get_cpu_var(xps_pcpu_queue); */
> +
> +	spin_lock(&q->list.lock);
> +
> +	last = q->list.prev;

Is q->list handled in case this cpu goes down before this IPI is
triggered?

Jarek P.

> +	sd = &__get_cpu_var(softnet_data);
> +	last->next = sd->completion_queue;
> +	sd->completion_queue = q->list.next;
> +	__skb_queue_head_init(&q->list);
> +
> +	spin_unlock(&q->list.lock);
> +
> +	raise_softirq_irqoff(NET_TX_SOFTIRQ);
> +}
...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2009, 2:45 p.m. UTC | #10
Jarek Poplawski a écrit :
> On 20-11-2009 00:46, Eric Dumazet wrote:
>> Here is first version of XPS.
>>
>> Goal of XPS is to free TX completed skbs by the cpu that submitted the transmit.
> 
> But why?... OK, you write in another message about sock_wfree(). Then
> how about users, who don't sock_wfree (routers)? Will there be any way
> to disable it?


This is open for discussion, but I saw no problem with routing workloads.

sock_wfree() is not that expensive for tcp anyway.
You also have a cost of kfreeing() two blocks of memory per skb, if allocation was done by another cpu.

If this happens to be a problem, we can immediately free packet if it 
has no destructors :

At xmit time, initialize skb->sending_cpu like that

skb->sending_cpu = (skb->destructor) ? smp_processor_id() : 0xFFFF;

to make sure we dont touch too many cache lines at tx completion time.


>> +/*
>> + * XPS : Xmit Packet Steering
>> + *
>> + * TX completion packet freeing is performed on cpu that sent packet.
>> + */
>> +#if defined(CONFIG_SMP)
> 
> Shouldn't it be in the Makefile?

It is in Makefile too, I let it in prelim code to make it clear this was CONFIG_SMP only.

> 
> ...
>> +/*
>> + * called at end of net_rx_action()
>> + * preemption (and cpu migration/offline/online) disabled
>> + */
>> +void xps_flush(void)
>> +{
>> +	int cpu, prevlen;
>> +	struct sk_buff_head *head = per_cpu_ptr(xps_array, smp_processor_id());
>> +	struct xps_pcpu_queue *q;
>> +	struct sk_buff *skb;
>> +
>> +	for_each_cpu_mask_nr(cpu, __get_cpu_var(xps_cpus)) {
>> +		q = &per_cpu(xps_pcpu_queue, cpu);
>> +		if (cpu_online(cpu)) {
>> +			spin_lock(&q->list.lock);
> 
> This lock probably needs irq disabling: let's say 2 cpus run this at
> the same time and both are interrupted with these (previously
> scheduled) IPIs?

Repeat after me :

lockdep is my friend, lockdep is my friend, lockdep is my friend... :)

Seriously, I must think again on this locking schem.

>> +static void remote_free_skb_list(void *arg)
>> +{
>> +	struct sk_buff *last;
>> +	struct softnet_data *sd;
>> +	struct xps_pcpu_queue *q = arg; /* &__get_cpu_var(xps_pcpu_queue); */
>> +
>> +	spin_lock(&q->list.lock);
>> +
>> +	last = q->list.prev;
> 
> Is q->list handled in case this cpu goes down before this IPI is
> triggered?


[block migration] (how ? this is the question)

if (cpu_online(cpu)) { 
	give_work_to_cpu(cpu);
	trigger IPI
} else {
	handle_work_ourself()
}

[unblock migration]

General problem is : what guards cpu going off line between the if (cpu_online(cpu))
and the IPI.
I dont know yet, but it seems that disabling preemption is enough to get this
guarantee. This seems strange.

We can add a notifier (or better call a function from existing one : dev_cpu_callback()) to 
flush this queue when necessary.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Nov. 20, 2009, 8:04 p.m. UTC | #11
On Fri, Nov 20, 2009 at 03:45:42PM +0100, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> > On 20-11-2009 00:46, Eric Dumazet wrote:
> >> Here is first version of XPS.
> >>
> >> Goal of XPS is to free TX completed skbs by the cpu that submitted the transmit.
> > 
> > But why?... OK, you write in another message about sock_wfree(). Then
> > how about users, who don't sock_wfree (routers)? Will there be any way
> > to disable it?
> 
> 
> This is open for discussion, but I saw no problem with routing workloads.

IMHO, it should depend on testing: if you can  prove there is a
distinct gain in "common" use case (which isn't probably a numa box
used e.g. for google.com or even kernel.org yet), and no visible
slowdown for such a router, then we could probably forget about
disabling, and look more at optimizations of the fast path.

> 
> sock_wfree() is not that expensive for tcp anyway.
> You also have a cost of kfreeing() two blocks of memory per skb, if allocation was done by another cpu.
> 
> If this happens to be a problem, we can immediately free packet if it 
> has no destructors :
> 
> At xmit time, initialize skb->sending_cpu like that
> 
> skb->sending_cpu = (skb->destructor) ? smp_processor_id() : 0xFFFF;
> 
> to make sure we dont touch too many cache lines at tx completion time.
> 
> 
> >> +/*
> >> + * XPS : Xmit Packet Steering
> >> + *
> >> + * TX completion packet freeing is performed on cpu that sent packet.
> >> + */
> >> +#if defined(CONFIG_SMP)
> > 
> > Shouldn't it be in the Makefile?
> 
> It is in Makefile too, I let it in prelim code to make it clear this was CONFIG_SMP only.

Aha! Now it's clear why I made this mistake. ;-)
> 
> > 
> > ...
> >> +/*
> >> + * called at end of net_rx_action()
> >> + * preemption (and cpu migration/offline/online) disabled
> >> + */
> >> +void xps_flush(void)
> >> +{
> >> +	int cpu, prevlen;
> >> +	struct sk_buff_head *head = per_cpu_ptr(xps_array, smp_processor_id());
> >> +	struct xps_pcpu_queue *q;
> >> +	struct sk_buff *skb;
> >> +
> >> +	for_each_cpu_mask_nr(cpu, __get_cpu_var(xps_cpus)) {
> >> +		q = &per_cpu(xps_pcpu_queue, cpu);
> >> +		if (cpu_online(cpu)) {
> >> +			spin_lock(&q->list.lock);
> > 
> > This lock probably needs irq disabling: let's say 2 cpus run this at
> > the same time and both are interrupted with these (previously
> > scheduled) IPIs?
> 
> Repeat after me :
> 
> lockdep is my friend, lockdep is my friend, lockdep is my friend... :)

Hmm... Actually, why did I have to do lockdep's job...
> 
> Seriously, I must think again on this locking schem.
> 
> >> +static void remote_free_skb_list(void *arg)
> >> +{
> >> +	struct sk_buff *last;
> >> +	struct softnet_data *sd;
> >> +	struct xps_pcpu_queue *q = arg; /* &__get_cpu_var(xps_pcpu_queue); */
> >> +
> >> +	spin_lock(&q->list.lock);
> >> +
> >> +	last = q->list.prev;
> > 
> > Is q->list handled in case this cpu goes down before this IPI is
> > triggered?
> 
> 
> [block migration] (how ? this is the question)
> 
> if (cpu_online(cpu)) { 
> 	give_work_to_cpu(cpu);
> 	trigger IPI
> } else {
> 	handle_work_ourself()
> }
> 
> [unblock migration]
> 
> General problem is : what guards cpu going off line between the if (cpu_online(cpu))
> and the IPI.
> I dont know yet, but it seems that disabling preemption is enough to get this
> guarantee. This seems strange.
> 
> We can add a notifier (or better call a function from existing one : dev_cpu_callback()) to 
> flush this queue when necessary.

Using dev_cpu_callback() looks quite obvious to me.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Nov. 20, 2009, 8:51 p.m. UTC | #12
Eric Dumazet <eric.dumazet@gmail.com> writes:

> Here is first version of XPS.
>
> Goal of XPS is to free TX completed skbs by the cpu that submitted the transmit.
>
> Because I chose to union skb->iif with skb->sending_cpu, I chose
> to introduce a new xps_consume_skb(skb), and not generalize consume_skb() itself.
>
> This means that selected drivers must use new function to benefit from XPS
>
> Preliminary tests are quite good, especially on NUMA machines.
>
> Only NAPI drivers can use this new infrastructure (xps_consume_skb() cannot
> be called from hardirq context, only from softirq)
>
> I converted tg3 and pktgen for my tests

Do you have numbers on this? It seems like a lot of effort to avoid transfering
a few cache lines.

-Andi (who is a bit sceptical and would rather see generic work for this in slab)
David Miller Nov. 20, 2009, 8:53 p.m. UTC | #13
From: Andi Kleen <andi@firstfloor.org>
Date: Fri, 20 Nov 2009 21:51:02 +0100

> -Andi (who is a bit sceptical and would rather see generic work for
> -this in slab)

SLAB does this to an extent, but it's the other things about
SKB freeing that are expensive and hits lots of cache misses
in these cases.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Nov. 20, 2009, 8:53 p.m. UTC | #14
Eric Dumazet wrote, On 11/20/2009 12:46 AM:

> Here is first version of XPS.
> 
> Goal of XPS is to free TX completed skbs by the cpu that submitted the transmit.
> 
> Because I chose to union skb->iif with skb->sending_cpu, I chose
> to introduce a new xps_consume_skb(skb), and not generalize consume_skb() itself.

...

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 63f4742..e8e4795 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -366,7 +366,10 @@ struct sk_buff {
>  	struct nf_bridge_info	*nf_bridge;
>  #endif
>  
> -	int			iif;
> +	union {
> +		int		iif;
> +		int		sending_cpu;
> +	};

...

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9977288..9e134f6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1965,6 +1965,7 @@ int dev_queue_xmit(struct sk_buff *skb)
>  	struct netdev_queue *txq;
>  	struct Qdisc *q;
>  	int rc = -ENOMEM;
> +	int cpu;
>  
>  	/* GSO will handle the following emulations directly. */
>  	if (netif_needs_gso(dev, skb))
> @@ -2000,6 +2001,7 @@ gso:
>  	 */
>  	rcu_read_lock_bh();
>  
> +	skb->sending_cpu = cpu = smp_processor_id();

There is one more problem: this will break things like act_mirred + ifb,
and other cases using skb->iif e.g. for filtering on virtual devices at
the xmit path.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2009, 9:43 p.m. UTC | #15
Jarek Poplawski a écrit :
> On Fri, Nov 20, 2009 at 03:45:42PM +0100, Eric Dumazet wrote:
>> Jarek Poplawski a écrit :
>>> On 20-11-2009 00:46, Eric Dumazet wrote:
>>> scheduled) IPIs?
>> Repeat after me :
>>
>> lockdep is my friend, lockdep is my friend, lockdep is my friend... :)
> 
> Hmm... Actually, why did I have to do lockdep's job...

In fact I could not find why irq masking is necessary, and lockdep is
fine with my code and my testings. Care to explain what problem you spotted ?


In fact, I originally used  cmpxchg() and xchg() (no spinlock),
but had to find the end of skb list in the IPI handler, so I chose
to use a skb_head instead to let the IPI handler be as short as possible.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Nov. 20, 2009, 10:08 p.m. UTC | #16
On Fri, Nov 20, 2009 at 10:43:55PM +0100, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> > On Fri, Nov 20, 2009 at 03:45:42PM +0100, Eric Dumazet wrote:
> >> Jarek Poplawski a écrit :
> >>> On 20-11-2009 00:46, Eric Dumazet wrote:
> >>> scheduled) IPIs?
> >> Repeat after me :
> >>
> >> lockdep is my friend, lockdep is my friend, lockdep is my friend... :)
> > 
> > Hmm... Actually, why did I have to do lockdep's job...
> 
> In fact I could not find why irq masking is necessary, and lockdep is
> fine with my code and my testings. Care to explain what problem you spotted ?
> 

CPU1						 CPU2
net_rx_action()					 net_rx_action()
 xps_flush()				  	  xps_flush()
  q = &per_cpu(xps_pcpu_queue, cpu2)		   q = &per_cpu(xps_pcpu_queue, cpu1)
  spin_lock(&q->list.lock of cpu2)		   spin_lock(&q->list.lock of cpu1)

<IPI>						 <IPI>
remote_free_skb_list()				 remote_free_skb_list()
waiting on spin_lock(&q->list.lock of cpu1)	 waiting on spin_lock(&q->list.lock of cpu2)


IPIs triggerered e.g. by CPU3 (or/and CPU4...) doing net_rx_action as well.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2009, 10:21 p.m. UTC | #17
Jarek Poplawski a écrit :
> On Fri, Nov 20, 2009 at 10:43:55PM +0100, Eric Dumazet wrote:
>>> Hmm... Actually, why did I have to do lockdep's job...
>> In fact I could not find why irq masking is necessary, and lockdep is
>> fine with my code and my testings. Care to explain what problem you spotted ?
>>
> 
> CPU1						 CPU2
> net_rx_action()					 net_rx_action()
>  xps_flush()				  	  xps_flush()
>   q = &per_cpu(xps_pcpu_queue, cpu2)		   q = &per_cpu(xps_pcpu_queue, cpu1)
>   spin_lock(&q->list.lock of cpu2)		   spin_lock(&q->list.lock of cpu1)
> 
> <IPI>						 <IPI>
> remote_free_skb_list()				 remote_free_skb_list()
> waiting on spin_lock(&q->list.lock of cpu1)	 waiting on spin_lock(&q->list.lock of cpu2)
> 
> 
> IPIs triggerered e.g. by CPU3 (or/and CPU4...) doing net_rx_action as well.
> 

:) Now I am convinced :)

Thanks jarek


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2009, 10:30 p.m. UTC | #18
Andi Kleen a écrit :
> 
> Do you have numbers on this? It seems like a lot of effort to avoid transfering
> a few cache lines.

Lot of efforts ? hmm...

> 
> -Andi (who is a bit sceptical and would rather see generic work for this in slab)
> 

Yes, I know, but slab/slub is already quite optimized :)

Coding XPS was more a way to try to help push RPS in, I did not benchmarked it BTW.

My new dev machine is up, with two E5530 cpus and a 82599EB 10-Gigabit dual port card :
Pretty amazing :=)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 20, 2009, 10:32 p.m. UTC | #19
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 20 Nov 2009 00:46:36 +0100

> Goal of XPS is to free TX completed skbs by the cpu that submitted
> the transmit.
> 
> Because I chose to union skb->iif with skb->sending_cpu, I chose
> to introduce a new xps_consume_skb(skb), and not generalize consume_skb() itself.
> 
> This means that selected drivers must use new function to benefit from XPS
> 
> Preliminary tests are quite good, especially on NUMA machines.
> 
> Only NAPI drivers can use this new infrastructure (xps_consume_skb() cannot
> be called from hardirq context, only from softirq)
> 
> I converted tg3 and pktgen for my tests
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

I like this work.  But as you know it still needs a little bit
more work :-)

Let's also pick a more decent name for the free function since
tons of drivers are going to call this thing.  How about
dev_kfree_tx_skb()? :-)

I see Jarek and you have come to a mutual understanding about the
locking.  Since you need to change it anyways to fix the deadlock,
what using a netchannel like scheme to do remote SKB queueing?

PAGE_SIZE queue arrays, lockless access to head and tail pointers, and
if queue is full we local free.

I think that's a reasonable policy and the only detail to work out is
to make sure we never race on the IPI send and thus miss processing
the queue.

What do you think?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2009, 10:36 p.m. UTC | #20
David Miller a écrit :
> 
> I like this work.  But as you know it still needs a little bit
> more work :-)
> 
> Let's also pick a more decent name for the free function since
> tons of drivers are going to call this thing.  How about
> dev_kfree_tx_skb()? :-)
> 
> I see Jarek and you have come to a mutual understanding about the
> locking.  Since you need to change it anyways to fix the deadlock,
> what using a netchannel like scheme to do remote SKB queueing?
> 
> PAGE_SIZE queue arrays, lockless access to head and tail pointers, and
> if queue is full we local free.
> 
> I think that's a reasonable policy and the only detail to work out is
> to make sure we never race on the IPI send and thus miss processing
> the queue.
> 
> What do you think?
> 

Thats good ideas David, I'll work on them next week, and do benchmarks as well
before sending a new version.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Nov. 20, 2009, 10:37 p.m. UTC | #21
On Fri, Nov 20, 2009 at 11:30:36PM +0100, Eric Dumazet wrote:
> Andi Kleen a écrit :
> > 
> > Do you have numbers on this? It seems like a lot of effort to avoid transfering
> > a few cache lines.
> 
> Lot of efforts ? hmm...

Well lots of code at least. Perhaps I'm old fashioned, but I always like to have
each code justify its complexity.

It seems like a very narrow special case. Or perhaps this is something that
should be a general library function for all allocator users?

> Yes, I know, but slab/slub is already quite optimized :)

Well it still has a lot of problems, other benchmarks suffer too.

-Andi
Eric Dumazet Nov. 21, 2009, 6:58 a.m. UTC | #22
Tom Herbert a écrit :
> 
> It probably is a special case, but addresses a real problem.  Like rps,
> this addresses the case where a single queue NIC is the bottleneck on a
> system.  Anything that takes work off the interrupting core and spreads
> it to other cores may alleviate the bottleneck. 
> 


Yes, and this brings back an idea I mentioned earlier in April, for a multicast
problem we had with AthenaCR.

XPS could be 'started' only if caller runs from ksoftirqd, eg we are in a stress
situation, and additional cost of XPS is nothing compared to huge gains we have
to spread part of work to more CPUS.

(Costs of XPS includes : ~500 bytes of ICACHE, and IPI)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 6e6db95..bc756e6 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4379,7 +4379,7 @@  static void tg3_tx(struct tg3_napi *tnapi)
 			sw_idx = NEXT_TX(sw_idx);
 		}
 
-		dev_kfree_skb(skb);
+		xps_consume_skb(skb);
 
 		if (unlikely(tx_bug)) {
 			tg3_tx_recover(tp);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 63f4742..e8e4795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -366,7 +366,10 @@  struct sk_buff {
 	struct nf_bridge_info	*nf_bridge;
 #endif
 
-	int			iif;
+	union {
+		int		iif;
+		int		sending_cpu;
+	};
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
 #ifdef CONFIG_NET_CLS_ACT
@@ -441,6 +444,15 @@  static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 
 extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
+#if defined(CONFIG_SMP)
+extern void xps_consume_skb(struct sk_buff *skb);
+extern void xps_flush(void);
+extern void xps_init(void);
+#else
+#define xps_consume_skb(skb) consume_skb(skb)
+static inline void xps_flush(void) {}
+static inline void xps_init(void) {}
+#endif
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
diff --git a/net/core/Makefile b/net/core/Makefile
index 796f46e..eacd3d8 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,4 +19,5 @@  obj-$(CONFIG_NET_DMA) += user_dma.o
 obj-$(CONFIG_FIB_RULES) += fib_rules.o
 obj-$(CONFIG_TRACEPOINTS) += net-traces.o
 obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
+obj-$(CONFIG_SMP) += xps.o
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 9977288..9e134f6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1965,6 +1965,7 @@  int dev_queue_xmit(struct sk_buff *skb)
 	struct netdev_queue *txq;
 	struct Qdisc *q;
 	int rc = -ENOMEM;
+	int cpu;
 
 	/* GSO will handle the following emulations directly. */
 	if (netif_needs_gso(dev, skb))
@@ -2000,6 +2001,7 @@  gso:
 	 */
 	rcu_read_lock_bh();
 
+	skb->sending_cpu = cpu = smp_processor_id();
 	txq = dev_pick_tx(dev, skb);
 	q = rcu_dereference(txq->qdisc);
 
@@ -2024,8 +2026,6 @@  gso:
 	   Either shot noqueue qdisc, it is even simpler 8)
 	 */
 	if (dev->flags & IFF_UP) {
-		int cpu = smp_processor_id(); /* ok because BHs are off */
-
 		if (txq->xmit_lock_owner != cpu) {
 
 			HARD_TX_LOCK(dev, txq, cpu);
@@ -2967,7 +2967,7 @@  static void net_rx_action(struct softirq_action *h)
 	}
 out:
 	local_irq_enable();
-
+	xps_flush();
 #ifdef CONFIG_NET_DMA
 	/*
 	 * There may not be any more sk_buffs coming right now, so push
@@ -5798,7 +5798,7 @@  static int __init net_dev_init(void)
 		queue->backlog.gro_list = NULL;
 		queue->backlog.gro_count = 0;
 	}
-
+	xps_init();
 	dev_boot_phase = 0;
 
 	/* The loopback device is special if any other network devices
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index d38470a..b41b794 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3435,6 +3435,7 @@  static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 			pkt_dev->clone_count--;	/* back out increment, OOM */
 			return;
 		}
+		pkt_dev->skb->sending_cpu = smp_processor_id();
 		pkt_dev->last_pkt_size = pkt_dev->skb->len;
 		pkt_dev->allocated_skbs++;
 		pkt_dev->clone_count = 0;	/* reset counter */
diff --git a/net/core/xps.c b/net/core/xps.c
index e69de29..e580159 100644
--- a/net/core/xps.c
+++ b/net/core/xps.c
@@ -0,0 +1,145 @@ 
+/*
+ * XPS : Xmit Packet Steering
+ *
+ * TX completion packet freeing is performed on cpu that sent packet.
+ */
+#if defined(CONFIG_SMP)
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/kmemcheck.h>
+#include <linux/mm.h>
+#include <linux/interrupt.h>
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+
+
+struct xps_pcpu_queue {
+	struct call_single_data csd;
+	struct sk_buff_head	list;
+};
+
+static DEFINE_PER_CPU(cpumask_t, xps_cpus);
+
+static DEFINE_PER_CPU(struct xps_pcpu_queue, xps_pcpu_queue);
+
+static struct sk_buff_head *xps_array; /* nr_cpu_ids elems */
+
+
+
+/* called from softirq context only */
+void xps_consume_skb(struct sk_buff *skb)
+{
+	unsigned int remote;
+	int thiscpu;
+	struct sk_buff_head *head;
+	/*
+	 * Might be stupid to dirty this cache line, but might
+	 * also be stupid to send an IPI if skb is not to be freed :(
+	 * One solution to this problem would be to move ->users in first cache line
+	 * (shared with ->next & ->sending_cpu fields ), so that this cpu dirties
+	 * only one cache line per queued skb.
+	 */
+	if (!atomic_dec_and_test(&skb->users))
+		return;
+
+	remote = skb->sending_cpu;
+	thiscpu = smp_processor_id();
+	if (remote >= nr_cpu_ids || remote == thiscpu) {
+		__kfree_skb(skb);
+		return;
+	}
+	head = &per_cpu_ptr(xps_array, thiscpu)[remote];
+
+	__skb_queue_head(head, skb);
+
+	/* IPI to remote processor will be sent later by xps_flush(),
+	 * to coalesce as much as possible skbs
+	 */
+	cpu_set(remote, __get_cpu_var(xps_cpus));
+}
+EXPORT_SYMBOL(xps_consume_skb);
+
+/*
+ * called at end of net_rx_action()
+ * preemption (and cpu migration/offline/online) disabled
+ */
+void xps_flush(void)
+{
+	int cpu, prevlen;
+	struct sk_buff_head *head = per_cpu_ptr(xps_array, smp_processor_id());
+	struct xps_pcpu_queue *q;
+	struct sk_buff *skb;
+
+	for_each_cpu_mask_nr(cpu, __get_cpu_var(xps_cpus)) {
+		q = &per_cpu(xps_pcpu_queue, cpu);
+		if (cpu_online(cpu)) {
+			spin_lock(&q->list.lock);
+			prevlen = skb_queue_len(&q->list);
+			skb_queue_splice_init(&head[cpu], &q->list);
+			spin_unlock(&q->list.lock);
+			/*
+			 * We hope remote cpu will be fast enough to transfert
+			 * this list to its completion queue before our
+			 * next xps_flush() call
+			 */
+			if (!prevlen)
+				__smp_call_function_single(cpu, &q->csd, 0);
+			continue;
+		}
+		/*
+		 * ok, we must free these skbs, even if we tried to avoid it :)
+		 */
+		while ((skb = __skb_dequeue(&head[cpu])) != NULL)
+			__kfree_skb(skb);
+	}
+	cpus_clear(__get_cpu_var(xps_cpus));
+}
+
+/*
+ * called from hardirq (IPI) context
+ */
+static void remote_free_skb_list(void *arg)
+{
+	struct sk_buff *last;
+	struct softnet_data *sd;
+	struct xps_pcpu_queue *q = arg; /* &__get_cpu_var(xps_pcpu_queue); */
+
+	spin_lock(&q->list.lock);
+
+	last = q->list.prev;
+	sd = &__get_cpu_var(softnet_data);
+	last->next = sd->completion_queue;
+	sd->completion_queue = q->list.next;
+	__skb_queue_head_init(&q->list);
+
+	spin_unlock(&q->list.lock);
+
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+}
+
+void __init xps_init(void)
+{
+	int cpu, remote;
+	struct sk_buff_head *head;
+
+	xps_array = __alloc_percpu(nr_cpu_ids * sizeof(struct sk_buff_head),
+				   __alignof__(struct sk_buff_head));
+	if (!xps_array)
+		panic("XPS: Could not allocate xps_array\n");
+
+	for_each_possible_cpu(cpu) {
+		skb_queue_head_init(&per_cpu(xps_pcpu_queue, cpu).list);
+		per_cpu(xps_pcpu_queue, cpu).csd.func = remote_free_skb_list;
+		per_cpu(xps_pcpu_queue, cpu).csd.info = &per_cpu(xps_pcpu_queue, cpu);
+		head = per_cpu_ptr(xps_array, cpu);
+		for (remote = 0; remote < nr_cpu_ids; remote++)
+			__skb_queue_head_init(head + remote);
+	}
+}
+
+#endif /* CONFIG_SMP */