diff mbox

[RFC,v1] hand off skb list to other cpu to submit to upper layer

Message ID 1235525270.2604.483.camel@ymzhang
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Zhang, Yanmin Feb. 25, 2009, 1:27 a.m. UTC
Subject: hand off skb list to other cpu to submit to upper layer
From: Zhang Yanmin <yanmin.zhang@linux.intel.com>

Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
cpu of different physical cpu while considering cache sharing carefully.

Comparing with sending speed on the 1st machine, the forward speed is not good, only
about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
So although IXGBE collects packets with NAPI, the forwarding really has much impact on
collection. As IXGBE runs very fast, it drops packets quickly. The better way for
receiving cpu is doing nothing than just collecting packets.

Currently kernel has backlog to support a similar capability, but process_backlog still
runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
softnet_data. Receving cpu collects packets and link them into skb list, then delivers
the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list
from input_pkt_alien_queue when input_pkt_queue is empty.

NIC driver could use this capability like below step in NAPI RX cleanup function.
1) Initiate a local var struct sk_buff_head skb_head;
2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb)
to add skb to the list;
3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu.

Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing.

I tested my patch on top of 2.6.28.5. The improvement is about 43%.

Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com>

---



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

stephen hemminger Feb. 25, 2009, 2:11 a.m. UTC | #1
On Wed, 25 Feb 2009 09:27:49 +0800
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:

> Subject: hand off skb list to other cpu to submit to upper layer
> From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> 
> Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
> I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
> packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
> from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
> cpu of different physical cpu while considering cache sharing carefully.
> 
> Comparing with sending speed on the 1st machine, the forward speed is not good, only
> about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
> arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
> So although IXGBE collects packets with NAPI, the forwarding really has much impact on
> collection. As IXGBE runs very fast, it drops packets quickly. The better way for
> receiving cpu is doing nothing than just collecting packets.
> 
> Currently kernel has backlog to support a similar capability, but process_backlog still
> runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
> softnet_data. Receving cpu collects packets and link them into skb list, then delivers
> the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list
> from input_pkt_alien_queue when input_pkt_queue is empty.
> 
> NIC driver could use this capability like below step in NAPI RX cleanup function.
> 1) Initiate a local var struct sk_buff_head skb_head;
> 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb)
> to add skb to the list;
> 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu.
> 
> Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing.
> 
> I tested my patch on top of 2.6.28.5. The improvement is about 43%.
> 
> Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> 
> ---

You can't safely put packets on another CPU queue without adding a spinlock.
And if you add the spinlock, you drop the performance back down for your
device and all the other devices. Also, you will end up reordering
packets which hurts single stream TCP performance.

Is this all because the hardware doesn't do MSI-X or are you testing only
a single flow. 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin Feb. 25, 2009, 2:35 a.m. UTC | #2
On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote:
> On Wed, 25 Feb 2009 09:27:49 +0800
> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> 
> > Subject: hand off skb list to other cpu to submit to upper layer
> > From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > 
> > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
> > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
> > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
> > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
> > cpu of different physical cpu while considering cache sharing carefully.
> > 
> > Comparing with sending speed on the 1st machine, the forward speed is not good, only
> > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
> > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
> > So although IXGBE collects packets with NAPI, the forwarding really has much impact on
> > collection. As IXGBE runs very fast, it drops packets quickly. The better way for
> > receiving cpu is doing nothing than just collecting packets.
> > 
> > Currently kernel has backlog to support a similar capability, but process_backlog still
> > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
> > softnet_data. Receving cpu collects packets and link them into skb list, then delivers
> > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list
> > from input_pkt_alien_queue when input_pkt_queue is empty.
> > 
> > NIC driver could use this capability like below step in NAPI RX cleanup function.
> > 1) Initiate a local var struct sk_buff_head skb_head;
> > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb)
> > to add skb to the list;
> > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu.
> > 
> > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing.
> > 
> > I tested my patch on top of 2.6.28.5. The improvement is about 43%.
> > 
> > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > 
> > ---
Thanks for your comments.

> 
> You can't safely put packets on another CPU queue without adding a spinlock.
input_pkt_alien_queue is a struct sk_buff_head which has a spinlock. We use
that lock to protect the queue.

> And if you add the spinlock, you drop the performance back down for your
> device and all the other devices.
My testing shows 43% improvement. As multi-core machines are becoming
popular, we can allocate some core for packet collection only.

I use the spinlock carefully. The deliver cpu locks it only when input_pkt_queue
is empty, and just merges the list to input_pkt_queue. Later skb dequeue needn't
hold the spinlock. In the other hand, the original receving cpu dispatches a batch
of skb (64 packets with IXGBE default) when holding the lock once.

>  Also, you will end up reordering
> packets which hurts single stream TCP performance.
Would you like to elaborate the scenario? Does your speaking mean multi-queue
also hurts single stream TCP performance when we bind multi-queue(interrupt) to
different cpu?

> 
> Is this all because the hardware doesn't do MSI-X
IXGBE supports MSI-X and I enables it when testing.  The receiver has 16 multi-queue,
so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physical cpu.

>  or are you testing only
> a single flow. 
What does a single flow mean here? One sender? I do start one sender for testing because
I couldn't get enough hardware.

In addition, my patch doesn't change old interface, so there would be no performance
hurt to old drivers.

yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Feb. 25, 2009, 5:18 a.m. UTC | #3
On Wed, 25 Feb 2009 10:35:43 +0800
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:

> On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote:
> > On Wed, 25 Feb 2009 09:27:49 +0800
> > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > Subject: hand off skb list to other cpu to submit to upper layer
> > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > > 
> > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
> > > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
> > > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
> > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
> > > cpu of different physical cpu while considering cache sharing carefully.
> > > 
> > > Comparing with sending speed on the 1st machine, the forward speed is not good, only
> > > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
> > > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
> > > So although IXGBE collects packets with NAPI, the forwarding really has much impact on
> > > collection. As IXGBE runs very fast, it drops packets quickly. The better way for
> > > receiving cpu is doing nothing than just collecting packets.
> > > 
> > > Currently kernel has backlog to support a similar capability, but process_backlog still
> > > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
> > > softnet_data. Receving cpu collects packets and link them into skb list, then delivers
> > > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list
> > > from input_pkt_alien_queue when input_pkt_queue is empty.
> > > 
> > > NIC driver could use this capability like below step in NAPI RX cleanup function.
> > > 1) Initiate a local var struct sk_buff_head skb_head;
> > > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb)
> > > to add skb to the list;
> > > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu.
> > > 
> > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing.
> > > 
> > > I tested my patch on top of 2.6.28.5. The improvement is about 43%.
> > > 
> > > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > > 
> > > ---
> Thanks for your comments.
> 
> > 
> > You can't safely put packets on another CPU queue without adding a spinlock.
> input_pkt_alien_queue is a struct sk_buff_head which has a spinlock. We use
> that lock to protect the queue.

I was reading netif_rx_queue() and you have it using __skb_queue_tail() which
has no locking. 

> > And if you add the spinlock, you drop the performance back down for your
> > device and all the other devices.
> My testing shows 43% improvement. As multi-core machines are becoming
> popular, we can allocate some core for packet collection only.
> 
> I use the spinlock carefully. The deliver cpu locks it only when input_pkt_queue
> is empty, and just merges the list to input_pkt_queue. Later skb dequeue needn't
> hold the spinlock. In the other hand, the original receving cpu dispatches a batch
> of skb (64 packets with IXGBE default) when holding the lock once.
> 
> >  Also, you will end up reordering
> > packets which hurts single stream TCP performance.
> Would you like to elaborate the scenario? Does your speaking mean multi-queue
> also hurts single stream TCP performance when we bind multi-queue(interrupt) to
> different cpu?
> 
> > 
> > Is this all because the hardware doesn't do MSI-X
> IXGBE supports MSI-X and I enables it when testing.  The receiver has 16 multi-queue,
> so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physical cpu.
> 
> >  or are you testing only
> > a single flow. 
> What does a single flow mean here? One sender? I do start one sender for testing because
> I couldn't get enough hardware.

Multiple receive queues only have an performance gain if the packets are being
sent with different SRC/DST address pairs. That is how the hardware is supposed
to break them into queues.

Reordering is what happens when packts that are sent as [ 0, 1, 2, 3, 4 ]
get received as [ 0, 1, 4, 3, 2 ] because your receive processing happened on different
CPU's. You really need to test this with some program like 'iperf' to see the effect
it has on TCP. Older Juniper routers used to have hardware that did this and it
caused it caused performance loss. Do some google searches and you will see
it is a active research topic about whether reordering is okay or not. Existing
multiqueue is safe because it doesn't reorder inside a single flow; it only
changes order between flows:  [ A1, A2, B1, B2] => [ A1, B1, A2, B2 ]


> 
> In addition, my patch doesn't change old interface, so there would be no performance
> hurt to old drivers.
> 
> yanmin
> 
> 

Isn't this a problem:
> +int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_queue)
>  {
>  	struct softnet_data *queue;
>  	unsigned long flags;
> +	int this_cpu;
>  
>  	/* if netpoll wants it, pretend we never saw it */
>  	if (netpoll_rx(skb))
> @@ -1943,24 +1946,31 @@ int netif_rx(struct sk_buff *skb)
>  	if (!skb->tstamp.tv64)
>  		net_timestamp(skb);
>  
> +	if (skb_queue)
> +		this_cpu = 0;
> +	else
> +		this_cpu = 1;

Why bother with a special boolean? and instead just test for skb_queue != NULL

> +
>  	/*
>  	 * The code is rearranged so that the path is the most
>  	 * short when CPU is congested, but is still operating.
>  	 */
>  	local_irq_save(flags);
> +
>  	queue = &__get_cpu_var(softnet_data);
> +	if (!skb_queue)
> +		skb_queue = &queue->input_pkt_queue;

>  
>  	__get_cpu_var(netdev_rx_stat).total++;
> -	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
> -		if (queue->input_pkt_queue.qlen) {
> -enqueue:
> -			__skb_queue_tail(&queue->input_pkt_queue, skb);
> -			local_irq_restore(flags);
> -			return NET_RX_SUCCESS;
> +
> +	if (skb_queue->qlen <= netdev_max_backlog) {
> +		if (!skb_queue->qlen && this_cpu) {
> +			napi_schedule(&queue->backlog);
>  		}

Won't this break if skb_queue is NULL (non NAPI case)?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin Feb. 25, 2009, 5:51 a.m. UTC | #4
On Tue, 2009-02-24 at 21:18 -0800, Stephen Hemminger wrote:
> On Wed, 25 Feb 2009 10:35:43 +0800
> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> 
> > On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote:
> > > On Wed, 25 Feb 2009 09:27:49 +0800
> > > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> > > 
> > > > Subject: hand off skb list to other cpu to submit to upper layer
> > > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > > > 
> > > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
> > > > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
> > > > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
> > > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
> > > > cpu of different physical cpu while considering cache sharing carefully.
> > > > 
> > > > Comparing with sending speed on the 1st machine, the forward speed is not good, only
> > > > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
> > > > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
> > > > So although IXGBE collects packets with NAPI, the forwarding really has much impact on
> > > > collection. As IXGBE runs very fast, it drops packets quickly. The better way for
> > > > receiving cpu is doing nothing than just collecting packets.
> > > > 
> > > > Currently kernel has backlog to support a similar capability, but process_backlog still
> > > > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
> > > > softnet_data. Receving cpu collects packets and link them into skb list, then delivers
> > > > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list
> > > > from input_pkt_alien_queue when input_pkt_queue is empty.
> > > > 
> > > > NIC driver could use this capability like below step in NAPI RX cleanup function.
> > > > 1) Initiate a local var struct sk_buff_head skb_head;
> > > > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb)
> > > > to add skb to the list;
> > > > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu.
> > > > 
> > > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing.
> > > > 
> > > > I tested my patch on top of 2.6.28.5. The improvement is about 43%.
> > > > 
> > > > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > > > 
> > > You can't safely put packets on another CPU queue without adding a spinlock.
> > input_pkt_alien_queue is a struct sk_buff_head which has a spinlock. We use
> > that lock to protect the queue.
> 
> I was reading netif_rx_queue() and you have it using __skb_queue_tail() which
> has no locking. 
Sorry, I need add some comments to function netif_rx_queue.

Parameter skb_queue is point to a local var, or NULL. If it points to a local var,
just like function ixgbe_clean_rx_irq of IXGBE, so we needn't protect it when
using __skb_queue_tail to add new skb. If skb_queue is point to NULL, below

skb_queue = &queue->input_pkt_queue;

make it points to the local input_pkt_queue which is protected by local_irq_save.

> > > And if you add the spinlock, you drop the performance back down for your
> > > device and all the other devices.
> > My testing shows 43% improvement. As multi-core machines are becoming
> > popular, we can allocate some core for packet collection only.
> > 
> > I use the spinlock carefully. The deliver cpu locks it only when input_pkt_queue
> > is empty, and just merges the list to input_pkt_queue. Later skb dequeue needn't
> > hold the spinlock. In the other hand, the original receving cpu dispatches a batch
> > of skb (64 packets with IXGBE default) when holding the lock once.
> > 
> > >  Also, you will end up reordering
> > > packets which hurts single stream TCP performance.
> > Would you like to elaborate the scenario? Does your speaking mean multi-queue
> > also hurts single stream TCP performance when we bind multi-queue(interrupt) to
> > different cpu?
> > 
> > > 
> > > Is this all because the hardware doesn't do MSI-X
> > IXGBE supports MSI-X and I enables it when testing.  The receiver has 16 multi-queue,
> > so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physical cpu.
> > 
> > >  or are you testing only
> > > a single flow. 
> > What does a single flow mean here? One sender? I do start one sender for testing because
> > I couldn't get enough hardware.
> 
> Multiple receive queues only have an performance gain if the packets are being
> sent with different SRC/DST address pairs. That is how the hardware is supposed
> to break them into queues.
Thanks for your explanation.

> 
> Reordering is what happens when packts that are sent as [ 0, 1, 2, 3, 4 ]
> get received as [ 0, 1, 4, 3, 2 ] because your receive processing happened on different
> CPU's. You really need to test this with some program like 'iperf' to see the effect
> it has on TCP. Older Juniper routers used to have hardware that did this and it
> caused it caused performance loss. Do some google searches and you will see
> it is a active research topic about whether reordering is okay or not. Existing
> multiqueue is safe because it doesn't reorder inside a single flow; it only
> changes order between flows:  [ A1, A2, B1, B2] => [ A1, B1, A2, B2 ]
Thanks. Your explanation is very clear. My patch might cause reorder, but very rarely,
because reorder only happens when there is a failover in function raise_netif_irq. perhaps
I need replace the failover with just packet dropping?

I will try iperf.

> 
> Isn't this a problem:
> > +int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_queue)
> >  {
> >  	struct softnet_data *queue;
> >  	unsigned long flags;
> > +	int this_cpu;
> >  
> >  	/* if netpoll wants it, pretend we never saw it */
> >  	if (netpoll_rx(skb))
> > @@ -1943,24 +1946,31 @@ int netif_rx(struct sk_buff *skb)
> >  	if (!skb->tstamp.tv64)
> >  		net_timestamp(skb);
> >  
> > +	if (skb_queue)
> > +		this_cpu = 0;
> > +	else
> > +		this_cpu = 1;
> 
> Why bother with a special boolean? and instead just test for skb_queue != NULL
Var this_cpu is used for napi_schedule late. Although the logical has no problem,
this_cpu seems confused. Let me check if there is a better way for late napi_schedule.

> 
> > +
> >  	/*
> >  	 * The code is rearranged so that the path is the most
> >  	 * short when CPU is congested, but is still operating.
> >  	 */
> >  	local_irq_save(flags);
> > +
> >  	queue = &__get_cpu_var(softnet_data);
> > +	if (!skb_queue)
> > +		skb_queue = &queue->input_pkt_queue;
When skb_queue is NULL, we redirect it to queue->input_pkt_queue.

> 
> >  
> >  	__get_cpu_var(netdev_rx_stat).total++;
> > -	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
> > -		if (queue->input_pkt_queue.qlen) {
> > -enqueue:
> > -			__skb_queue_tail(&queue->input_pkt_queue, skb);
> > -			local_irq_restore(flags);
> > -			return NET_RX_SUCCESS;
> > +
> > +	if (skb_queue->qlen <= netdev_max_backlog) {
> > +		if (!skb_queue->qlen && this_cpu) {
> > +			napi_schedule(&queue->backlog);
> >  		}
> 
> Won't this break if skb_queue is NULL (non NAPI case)?
So skb_queue isn't NULL here.

Another idea is just to delete function netif_rx_queue. Drivers could use
__skb_queue_tail directly. The difference netif_rx_queue has a queue length
checking while __skb_queue_tail hasn't. But mostly skb_queue is far smaller than
queue->input_pkt_queue.qlen and queue->input_pkt_alien_queue.qlen.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 25, 2009, 6:36 a.m. UTC | #5
Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> Subject: hand off skb list to other cpu to submit to upper layer
> From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> 
> Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
> I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
> packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
> from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
> cpu of different physical cpu while considering cache sharing carefully.
> 
> Comparing with sending speed on the 1st machine, the forward speed is not good, only
> about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
> arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
> So although IXGBE collects packets with NAPI, the forwarding really has much impact on
> collection. As IXGBE runs very fast, it drops packets quickly. The better way for
> receiving cpu is doing nothing than just collecting packets.

This doesn't make sense.  With multiqueue RX, every core should be
working to receive its fraction of the traffic and forwarding them
out.  So you shouldn't have any idle cores to begin with.  The fact
that you do means that multiqueue RX hasn't maximised its utility,
so you should tackle that instead of trying redirect traffic away
from the cores that are receiving.

Of course for NICs that don't support multiqueue RX, or where the
number of RX queues is less than the number of cores, then a scheme
like yours may be useful.

Cheers,
Zhang, Yanmin Feb. 25, 2009, 7:20 a.m. UTC | #6
On Wed, 2009-02-25 at 14:36 +0800, Herbert Xu wrote:
> Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > Subject: hand off skb list to other cpu to submit to upper layer
> > From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
> > 
> > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
> > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
> > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
> > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical
> > cpu of different physical cpu while considering cache sharing carefully.
> > 
> > Comparing with sending speed on the 1st machine, the forward speed is not good, only
> > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt
> > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately.
> > So although IXGBE collects packets with NAPI, the forwarding really has much impact on
> > collection. As IXGBE runs very fast, it drops packets quickly. The better way for
> > receiving cpu is doing nothing than just collecting packets.
> 
Thanks for your comments.

> This doesn't make sense.  With multiqueue RX, every core should be
> working to receive its fraction of the traffic and forwarding them
> out.
I never say the core can't receive and forward packets at the same time.
I mean the performance isn't good.

>   So you shouldn't have any idle cores to begin with.  The fact
> that you do means that multiqueue RX hasn't maximised its utility,
> so you should tackle that instead of trying redirect traffic away
> from the cores that are receiving.
>From Stephen's explanation, the packets are being sent with different SRC/DST address
pairs by which harware delivers packets to different queues. we couldn't expect
NIC always puts packets into queues evenly.

The behavior is IXGBE is very fast and cpu couldn't collect packets in time if it
collects packets and forwards them at the same time. That causes IXGBE drops packets.

> 
> Of course for NICs that don't support multiqueue RX, or where the
> number of RX queues is less than the number of cores, then a scheme
> like yours may be useful.
IXGBE NIC does support a large number of RX queues. By default, it creates
CPU_NUM queues. But the performance is not good when we bind queues to
cpu evenly. One reason is cache miss/ping-pong. The forwarder machine has
2 physical cpu and every cpu has 8 logical threads. All 8 logical cpu share
the last level cache. With my ip_forward testing by pktgen, binding queues
to 8 logical cpu of a physical cpu could have 40% improvement than binding
queues to 16 logical cpu. So the optimization scenario just needs IXGBE drivers
create 8 queues.

If the machines might have a couple of NICs and every NIC has CPU_NUM queues,
binding them evenly might cause more cache-miss/ping-pong. I didn't test
multiple receiving NICs scenario as I couldn't get enough hardware.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 25, 2009, 7:31 a.m. UTC | #7
From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Date: Wed, 25 Feb 2009 15:20:23 +0800

> If the machines might have a couple of NICs and every NIC has CPU_NUM queues,
> binding them evenly might cause more cache-miss/ping-pong. I didn't test
> multiple receiving NICs scenario as I couldn't get enough hardware.

In the net-next-2.6 tree, since we mark incoming packets with
skb_record_rx_queue() properly, we'll make a more favorable choice of
TX queue.

You may want to figure out what that isn't behaving well in your
case.

I don't think we should do any kind of software spreading for such
capable hardware, it defeats the whole point of supporting the
multiqueue features.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin March 4, 2009, 9:27 a.m. UTC | #8
On Tue, 2009-02-24 at 23:31 -0800, David Miller wrote: 
> From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> Date: Wed, 25 Feb 2009 15:20:23 +0800
> 
> > If the machines might have a couple of NICs and every NIC has CPU_NUM queues,
> > binding them evenly might cause more cache-miss/ping-pong. I didn't test
> > multiple receiving NICs scenario as I couldn't get enough hardware.
> 
> In the net-next-2.6 tree, since we mark incoming packets with
> skb_record_rx_queue() properly, we'll make a more favorable choice of
> TX queue.
Thanks for your pointer. I cloned net-next-2.6 tree. skb_record_rx_queue is a smart
idea to implement an auto TX selection.

There is no NIC multi-queue standard or RFC available. At least I didn't find it
by google.

Both the new skb_record_rx_queue and current kernel have an assumption on
multi-queue. The assumption is it's best to send out packets from the TX of the
same number of queue like the one of RX if the receved packets are related to
the out packets. Or more direct speaking is we need send packets on the same cpu on
which we receive them. The start point is that could reduce skb and data cache miss.

With slow NIC, the assumption is right. But with high-speed NIC, especially 10G
NIC, the assumption seems not ok.

Here is a simple calculation with real testing/data with Nehalem machine and Bensley
machine. There are 2 machines with the testing driven by pktgen.

		    	send packets
	Machine A   	==============>		Machine B
		    
		    	forward pkts back
		    	<==============		


With Nehalem machines, I can get 4 million pps (packets per second) and per packet consists
of 60 bytes. So the speed is about 240MBytes/s. Nehalem has 2 sockets and every socket has
4 core and 8 logical cpu. All logical cpu share the last level cache 8Mbytes. That means
every physical cpu receives 120M bytes per second which is 8 times of last level cache
size.

With Bensley machine, I can get 1.2M pps, or 72MBytes. That machine has 2 sockets and every
socket has a qual-core cpu. Every dual-core share the last level cache 6MByte. That means
every dual-core gets 18M bytes per second, which is 3 times of last level cache size.

So with both bensley and Nehalem, the cache is flushed very quickly with 10G NIC testing.

Some other kinds of machines might have bigger cache. For example, my Montvale Itanium has
2 sockets, and every socket has a qual-core cpu plus multi-thread. Every dual-core shares
the last level cache 12M. But the cache is stll flushed at least twice per second.

If checking NIC drivers, we can find drivers touch very limited fields of sk_buff when
collecting packets from NIC.

It is said 20G or 30G NIC are under producing.

So with high-speed 10G NIC, the old assumption seems not working.

In the other hand, which part causes most cache foot print and cache miss? I don't think
drivers do so because the receiving cpu only touch some fields of sb_buff before sending
to upper layer.

My patch throws packets to specific cpu controlled by configuration, which doesn't
cause much cache ping-pong. After receving cpu throws packets to 2nd cpu, it doesn't need them
again. The 2nd cpu has cache-miss, but it doesn't cause cache ping-pong.

My patch doesn't always disagree with skb_record_rx_queue.
1) It can be configured by admin;
2) We can call skb_record_rx_queue or similiar functions at the 2nd cpu (the real cpu to
process the packets by process_backlog); So later on cache footprint won't be wasted when
forwarding packets out;

> 
> You may want to figure out what that isn't behaving well in your
> case.

I did check kernel, including slab ( I tried slab/slub/slqb and use slub now) tuning, and
instrumented IXGBE driver. Besides careful multi-queue/interrupt binding, another way is
just to use my patch to promote speed for more than 40% on both Nehalem and Bensley.


> 
> I don't think we should do any kind of software spreading for such
> capable hardware,
> it defeats the whole point of supporting the
> multiqueue features.
There is no NIC multi-queue standard or RFC.

Jesse is worried about we might allocate free cores for the packet collection while a
real environment keeps cpu all busy. I added more pressure on sending machine, and got
better performance on forwarding machine and the forwarding machine's cpu are busier
than before. Some logical cpu idle is near to 0. But I only have a couple of 10G NIC,
and couldn't add more pressure to make all cpu busy.


Thanks again, for your comments and patience.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 4, 2009, 9:39 a.m. UTC | #9
From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Date: Wed, 04 Mar 2009 17:27:48 +0800

> Both the new skb_record_rx_queue and current kernel have an
> assumption on multi-queue. The assumption is it's best to send out
> packets from the TX of the same number of queue like the one of RX
> if the receved packets are related to the out packets. Or more
> direct speaking is we need send packets on the same cpu on which we
> receive them. The start point is that could reduce skb and data
> cache miss.

We have to use the same TX queue for all packets for the same
connection flow (same src/dst IP address and ports) otherwise
we introduce reordering.

Herbert brought this up, now I have explicitly brought this up,
and you cannot ignore this issue.

You must not knowingly reorder packets, and using different TX
queues for packets within the same flow does that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin March 5, 2009, 1:04 a.m. UTC | #10
On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> Date: Wed, 04 Mar 2009 17:27:48 +0800
> 
> > Both the new skb_record_rx_queue and current kernel have an
> > assumption on multi-queue. The assumption is it's best to send out
> > packets from the TX of the same number of queue like the one of RX
> > if the receved packets are related to the out packets. Or more
> > direct speaking is we need send packets on the same cpu on which we
> > receive them. The start point is that could reduce skb and data
> > cache miss.
> 
> We have to use the same TX queue for all packets for the same
> connection flow (same src/dst IP address and ports) otherwise
> we introduce reordering.
> Herbert brought this up, now I have explicitly brought this up,
> and you cannot ignore this issue.
Thanks. Stephen Hemminger brought it up and explained what reorder
is. I answered in a reply (sorry for not clear) that mostly we need spread
packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
received from RX 8 will be spreaded to TX 0 always.


> 
> You must not knowingly reorder packets, and using different TX
> queues for packets within the same flow does that.
Thanks for you rexplanation which is really consistent with Stephen's speaking.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin March 5, 2009, 2:40 a.m. UTC | #11
On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> > Date: Wed, 04 Mar 2009 17:27:48 +0800
> > 
> > > Both the new skb_record_rx_queue and current kernel have an
> > > assumption on multi-queue. The assumption is it's best to send out
> > > packets from the TX of the same number of queue like the one of RX
> > > if the receved packets are related to the out packets. Or more
> > > direct speaking is we need send packets on the same cpu on which we
> > > receive them. The start point is that could reduce skb and data
> > > cache miss.
> > 
> > We have to use the same TX queue for all packets for the same
> > connection flow (same src/dst IP address and ports) otherwise
> > we introduce reordering.
> > Herbert brought this up, now I have explicitly brought this up,
> > and you cannot ignore this issue.
> Thanks. Stephen Hemminger brought it up and explained what reorder
> is. I answered in a reply (sorry for not clear) that mostly we need spread
> packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
> received from RX 8 will be spreaded to TX 0 always.
To make it clearer, I used 1:1 mapping binding when running testing
on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder
issue. I also worked out a new patch on the failover path to just drop
packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't
cause reorder.

> 
> 
> > 
> > You must not knowingly reorder packets, and using different TX
> > queues for packets within the same flow does that.
> Thanks for you rexplanation which is really consistent with Stephen's speaking.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Låås March 5, 2009, 7:32 a.m. UTC | #12
2009/3/5, Zhang, Yanmin <yanmin_zhang@linux.intel.com>:
> On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote:
>  > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
>  > > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
>  > > Date: Wed, 04 Mar 2009 17:27:48 +0800
>  > >
>  > > > Both the new skb_record_rx_queue and current kernel have an
>  > > > assumption on multi-queue. The assumption is it's best to send out
>  > > > packets from the TX of the same number of queue like the one of RX
>  > > > if the receved packets are related to the out packets. Or more
>  > > > direct speaking is we need send packets on the same cpu on which we
>  > > > receive them. The start point is that could reduce skb and data
>  > > > cache miss.
>  > >
>  > > We have to use the same TX queue for all packets for the same
>  > > connection flow (same src/dst IP address and ports) otherwise
>  > > we introduce reordering.
>  > > Herbert brought this up, now I have explicitly brought this up,
>  > > and you cannot ignore this issue.
>  > Thanks. Stephen Hemminger brought it up and explained what reorder
>  > is. I answered in a reply (sorry for not clear) that mostly we need spread
>  > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
>  > received from RX 8 will be spreaded to TX 0 always.
>
> To make it clearer, I used 1:1 mapping binding when running testing
>  on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder
>  issue. I also worked out a new patch on the failover path to just drop
>  packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't
>  cause reorder.
>

We have not seen this problem in our testing.
We do keep the skb processing with the same CPU from RX to TX.
This is done via setting affinity for queues and using custom select_queue.

+static u16 select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+       if( dev->real_num_tx_queues && skb_rx_queue_recorded(skb) )
+               return  skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+
+       return  smp_processor_id() %  dev->real_num_tx_queues;
+}
+

The hash based default for selecting TX-queue generates an uneven
spread that is hard to follow with correct affinity.

We have not been able to generate quite as much traffic from the sender.

Sender: (64 byte pkts)
eth5            4.5 k bit/s        3   pps   1233.9 M bit/s    2.632 M pps

Router:
eth0         1077.2 M bit/s    2.298 M pps      1.7 k bit/s        1   pps
eth1            744   bit/s        1   pps   1076.3 M bit/s    2.296 M pps

Im not sure I like the proposed concept since it decouples RX
processing from receiving.
There is no point collecting lots of packets just to drop them later
in the qdisc.
Infact this is bad for performance, we just consume cpu for nothing.

It is important to have as strong correlation as possible between RX
and TX so we dont receive more pkts than we can handle. Better to drop
on the interface.

We might start thinking of a way for userland to set the policy for
multiq mapping.

Cheers,
Jens Låås


>  > >
>  > > You must not knowingly reorder packets, and using different TX
>  > > queues for packets within the same flow does that.
>  > Thanks for you rexplanation which is really consistent with Stephen's speaking.
>
>
>  --
>  To unsubscribe from this list: send the line "unsubscribe netdev" in
>  the body of a message to majordomo@vger.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin March 5, 2009, 9:24 a.m. UTC | #13
On Thu, 2009-03-05 at 08:32 +0100, Jens Låås wrote:
> 2009/3/5, Zhang, Yanmin <yanmin_zhang@linux.intel.com>:
> > On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote:
> >  > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> >  > > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> >  > > Date: Wed, 04 Mar 2009 17:27:48 +0800
> >  > >
> >  > > > Both the new skb_record_rx_queue and current kernel have an
> >  > > > assumption on multi-queue. The assumption is it's best to send out
> >  > > > packets from the TX of the same number of queue like the one of RX
> >  > > > if the receved packets are related to the out packets. Or more
> >  > > > direct speaking is we need send packets on the same cpu on which we
> >  > > > receive them. The start point is that could reduce skb and data
> >  > > > cache miss.
> >  > >
> >  > > We have to use the same TX queue for all packets for the same
> >  > > connection flow (same src/dst IP address and ports) otherwise
> >  > > we introduce reordering.
> >  > > Herbert brought this up, now I have explicitly brought this up,
> >  > > and you cannot ignore this issue.
> >  > Thanks. Stephen Hemminger brought it up and explained what reorder
> >  > is. I answered in a reply (sorry for not clear) that mostly we need spread
> >  > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
> >  > received from RX 8 will be spreaded to TX 0 always.
> >
> > To make it clearer, I used 1:1 mapping binding when running testing
> >  on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder
> >  issue. I also worked out a new patch on the failover path to just drop
> >  packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't
> >  cause reorder.
> >
> 

> We have not seen this problem in our 
Thanks for your valuable input. We need more data on high-speed NIC.

> We do keep the skb processing with the same CPU from RX to TX.
That's a normal point. I did so when I began to investigate why forward
speed is far slower than sending speed with 10G NIC.

> This is done via setting affinity for queues and using custom select_queue.
> 
> +static u16 select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> +       if( dev->real_num_tx_queues && skb_rx_queue_recorded(skb) )
> +               return  skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> +
> +       return  smp_processor_id() %  dev->real_num_tx_queues;
> +}
> +
Yes, with the function and every NIC has CPU_NUM queues, skb is processed
with the same cpu from RX to TX.

> 
> The hash based default for selecting TX-queue generates an uneven
> spread that is hard to follow with correct affinity.
> 
> We have not been able to generate quite as much traffic from the sender.
pktgen of the latest kernel supports multi-thread on the same device. If you
just starts one thread, the speed is limited. Could you try 4 or 8 threads? Perhaps
speed could double then.

> 
> Sender: (64 byte pkts)
> eth5            4.5 k bit/s        3   pps   1233.9 M bit/s    2.632 M pps
I'm a little confused with the data. Do the first 2 mean IN and last 2 mean OUT?

What kind of NIC and machines are they? How big is the last level cache of the cpu?

> 
> Router:
> eth0         1077.2 M bit/s    2.298 M pps      1.7 k bit/s        1   pps
> eth1            744   bit/s        1   pps   1076.3 M bit/s    2.296 M pps
The forward speed is quite close to the sending speed of the Sender. It seems
your machine needn't my patch.

My original case is the sending speed is 1.4M pps with careful cpu binding considering
cpu cache sharing. With my patch, the result becomes 2M pps and the sending speed is
2.36M pps. The NICs I am using are not latest.

> 
> Im not sure I like the proposed concept since it decouples RX
> processing from receiving.
> There is no point collecting lots of packets just to drop them later
> in the qdisc.
> Infact this is bad for performance, we just consume cpu for nothing.
Yes, if the skb processing cpu is very busy, and we choose to drop skb there instead of
by driver or NIC hardware, performance might be worse.

A small change on my patch and driver could reduce that possibility. Checking qlen before
collecting the 64 packets (assume driver collects 64 packets per NAPI loop). If qlen is
larger than netdev_max_backlog, driver could just return without real collection.

We need data to distinguish good or bad.

> It is important to have as strong correlation as possible between RX
> and TX so we dont receive more pkts than we can handle. Better to drop
> on the interface.
With my above small change, interface would drop packets.

> 
> We might start thinking of a way for userland to set the policy for
> multiq mapping.
I also think so.

I did more testing with different slab allocator as slab has big impact on
performance. SLQB has very different behavior from SLUB. It seems SLQB (try2) need
improve NUMA allocation/free. At least I use slub_min_objects=64 and slub_max_order=6
to get the best result on my machine.

Thanks for your comments.

> >  > >
> >  > > You must not knowingly reorder packets, and using different TX
> >  > > queues for packets within the same flow does that.
> >  > Thanks for you rexplanation which is really consistent with Stephen's speaking.
> >


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- linux-2.6.29-rc2/include/linux/netdevice.h	2009-01-20 14:20:45.000000000 +0800
+++ linux-2.6.29-rc2_napi_rcv/include/linux/netdevice.h	2009-02-23 13:32:48.000000000 +0800
@@ -1119,6 +1119,9 @@  static inline int unregister_gifconf(uns
 /*
  * Incoming packets are placed on per-cpu queues so that
  * no locking is needed.
+ * To speed up fast network, sometimes place incoming packets
+ * to other cpu queues. Use input_pkt_alien_queue.lock to
+ * protect input_pkt_alien_queue.
  */
 struct softnet_data
 {
@@ -1127,6 +1130,7 @@  struct softnet_data
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
+	struct sk_buff_head	input_pkt_alien_queue;
 	struct napi_struct	backlog;
 };
 
@@ -1368,6 +1372,10 @@  extern void dev_kfree_skb_irq(struct sk_
 extern void dev_kfree_skb_any(struct sk_buff *skb);
 
 #define HAVE_NETIF_RX 1
+extern int		netif_rx_queue(struct sk_buff *skb,
+					struct sk_buff_head *skb_queue);
+extern int		raise_netif_irq(int cpu,
+					struct sk_buff_head *skb_queue);
 extern int		netif_rx(struct sk_buff *skb);
 extern int		netif_rx_ni(struct sk_buff *skb);
 #define HAVE_NETIF_RECEIVE_SKB 1
--- linux-2.6.29-rc2/net/core/dev.c	2009-01-20 14:20:45.000000000 +0800
+++ linux-2.6.29-rc2_napi_rcv/net/core/dev.c	2009-02-24 13:53:02.000000000 +0800
@@ -1917,8 +1917,10 @@  DEFINE_PER_CPU(struct netif_rx_stats, ne
 
 
 /**
- *	netif_rx	-	post buffer to the network code
+ *	netif_rx_queue	-	post buffer to the network code
  *	@skb: buffer to post
+ *	@sk_buff_head: the queue to keep skb. It could be NULL or point
+ *		to a local variable.
  *
  *	This function receives a packet from a device driver and queues it for
  *	the upper (protocol) levels to process.  It always succeeds. The buffer
@@ -1931,10 +1933,11 @@  DEFINE_PER_CPU(struct netif_rx_stats, ne
  *
  */
 
-int netif_rx(struct sk_buff *skb)
+int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_queue)
 {
 	struct softnet_data *queue;
 	unsigned long flags;
+	int this_cpu;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -1943,24 +1946,31 @@  int netif_rx(struct sk_buff *skb)
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
 
+	if (skb_queue)
+		this_cpu = 0;
+	else
+		this_cpu = 1;
+
 	/*
 	 * The code is rearranged so that the path is the most
 	 * short when CPU is congested, but is still operating.
 	 */
 	local_irq_save(flags);
+
 	queue = &__get_cpu_var(softnet_data);
+	if (!skb_queue)
+		skb_queue = &queue->input_pkt_queue;
 
 	__get_cpu_var(netdev_rx_stat).total++;
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
+
+	if (skb_queue->qlen <= netdev_max_backlog) {
+		if (!skb_queue->qlen && this_cpu) {
+			napi_schedule(&queue->backlog);
 		}
 
-		napi_schedule(&queue->backlog);
-		goto enqueue;
+		__skb_queue_tail(skb_queue, skb);
+		local_irq_restore(flags);
+		return NET_RX_SUCCESS;
 	}
 
 	__get_cpu_var(netdev_rx_stat).dropped++;
@@ -1970,6 +1980,11 @@  enqueue:
 	return NET_RX_DROP;
 }
 
+int netif_rx(struct sk_buff *skb)
+{
+	return netif_rx_queue(skb, NULL);
+}
+
 int netif_rx_ni(struct sk_buff *skb)
 {
 	int err;
@@ -1985,6 +2000,79 @@  int netif_rx_ni(struct sk_buff *skb)
 
 EXPORT_SYMBOL(netif_rx_ni);
 
+static void net_drop_skb(struct sk_buff_head *skb_queue)
+{
+	struct sk_buff *skb = __skb_dequeue(skb_queue);
+
+	while (skb) {
+		__get_cpu_var(netdev_rx_stat).dropped++;
+		kfree_skb(skb);
+		skb = __skb_dequeue(skb_queue);
+	}
+}
+
+static void net_napi_backlog(void *data)
+{
+	struct softnet_data *queue = &__get_cpu_var(softnet_data);
+
+	napi_schedule(&queue->backlog);
+	kfree(data);
+}
+
+int raise_netif_irq(int cpu, struct sk_buff_head *skb_queue)
+{
+	unsigned long flags;
+	struct softnet_data *queue;
+
+	if (skb_queue_empty(skb_queue))
+		return 0;
+
+	if ((unsigned)cpu < nr_cpu_ids &&
+		cpu_online(cpu) &&
+		cpu != smp_processor_id()) {
+
+		struct call_single_data *data;
+
+		queue = &per_cpu(softnet_data, cpu);
+
+		if (queue->input_pkt_alien_queue.qlen > netdev_max_backlog)
+			goto failover;
+
+		data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+		if (!data)
+			goto failover;
+
+		spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags);
+		skb_queue_splice_tail_init(skb_queue,
+				&queue->input_pkt_alien_queue);
+		spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock,
+					flags);
+
+		data->func = net_napi_backlog;
+		data->info = data;
+		data->flags = 0;
+
+		__smp_call_function_single(cpu, data);
+
+		return 0;
+	}
+
+failover:
+	/* If cpu is offline, we queue skb back to the queue on current cpu*/
+	queue = &__get_cpu_var(softnet_data);
+	if (queue->input_pkt_queue.qlen + skb_queue->qlen <=
+		netdev_max_backlog) {
+		local_irq_save(flags);
+		skb_queue_splice_tail_init(skb_queue, &queue->input_pkt_queue);
+		napi_schedule(&queue->backlog);
+		local_irq_restore(flags);
+	} else {
+		net_drop_skb(skb_queue);
+	}
+
+	return 1;
+}
+
 static void net_tx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
@@ -2324,6 +2412,13 @@  static void flush_backlog(void *arg)
 	struct net_device *dev = arg;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	struct sk_buff *skb, *tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags);
+	skb_queue_splice_tail_init(
+			&queue->input_pkt_alien_queue,
+			&queue->input_pkt_queue );
+	spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, flags);
 
 	skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp)
 		if (skb->dev == dev) {
@@ -2575,9 +2670,19 @@  static int process_backlog(struct napi_s
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
-			__napi_complete(napi);
-			local_irq_enable();
-			break;
+			if (!skb_queue_empty(&queue->input_pkt_alien_queue)) {
+				spin_lock(&queue->input_pkt_alien_queue.lock);
+				skb_queue_splice_tail_init(
+						&queue->input_pkt_alien_queue,
+						&queue->input_pkt_queue );
+				spin_unlock(&queue->input_pkt_alien_queue.lock);
+
+				skb = __skb_dequeue(&queue->input_pkt_queue);
+			} else {
+				__napi_complete(napi);
+				local_irq_enable();
+				break;
+			}
 		}
 		local_irq_enable();
 
@@ -4966,6 +5071,11 @@  static int dev_cpu_callback(struct notif
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
+	spin_lock(&oldsd->input_pkt_alien_queue.lock);
+	skb_queue_splice_tail_init(&oldsd->input_pkt_alien_queue,
+			&oldsd->input_pkt_queue);
+	spin_unlock(&oldsd->input_pkt_alien_queue.lock);
+
 	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
 		netif_rx(skb);
 
@@ -5165,10 +5275,13 @@  static int __init net_dev_init(void)
 		struct softnet_data *queue;
 
 		queue = &per_cpu(softnet_data, i);
+
 		skb_queue_head_init(&queue->input_pkt_queue);
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 
+		skb_queue_head_init(&queue->input_pkt_alien_queue);
+
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
@@ -5227,7 +5340,9 @@  EXPORT_SYMBOL(netdev_boot_setup_check);
 EXPORT_SYMBOL(netdev_set_master);
 EXPORT_SYMBOL(netdev_state_change);
 EXPORT_SYMBOL(netif_receive_skb);
+EXPORT_SYMBOL(netif_rx_queue);
 EXPORT_SYMBOL(netif_rx);
+EXPORT_SYMBOL(raise_netif_irq);
 EXPORT_SYMBOL(register_gifconf);
 EXPORT_SYMBOL(register_netdevice);
 EXPORT_SYMBOL(register_netdevice_notifier);