diff mbox

[v4,1/1] rps: core implementation

Message ID 4B44D89B.8070006@gmail.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 6, 2010, 6:38 p.m. UTC
Le 06/01/2010 06:54, Eric Dumazet a écrit :
> 
> This probably can be done later, this Version 4 of RPS looks very good, thanks !
> I am going to test it today on my dev machine before giving an Acked-by :)
> 

Hmm, I had to make some changes to get RPS working a bit on my setup
(bonding + vlans)

Some devices dont have napi contexts at all, so force cnt being not null
in store_rps_cpus() :

+       list_for_each_entry(napi, &net->napi_list, dev_list)
+               cnt++;
+       if (cnt == 0)
+               cnt = 1;

BTW, following sequence is not safe :

	rcu_read_lock_bh();
	old_drmap = rcu_dereference(net->dev_rps_maps); 
	rcu_assign_pointer(net->dev_rps_maps, drmap);
	rcu_read_unlock_bh();

rcu_read_lock/unlock is not the right paradigm for writers :)


I also noticed these messages, there is a lock imbalance somewhere...
(caused by cpu handling original soft interrupt)

[  442.636637] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  442.652122] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  442.735065] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  442.752136] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  442.966726] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  442.982126] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  450.794525] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  451.808901] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  454.610304] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  511.825160] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  571.840765] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  631.856645] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  691.869030] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  721.847106] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  721.860279] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?
[  722.223053] huh, cpu 0 entered softirq 3 NET_RX c051f0a0 preempt_count 00000102, exited with 00000103?


patch I currently used against linux-2.6

(net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Jan. 6, 2010, 9:10 p.m. UTC | #1
Le 06/01/2010 19:38, Eric Dumazet a écrit :
> 
> (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)

David, I had to revert 1f3c8804acba841b5573b953f5560d2683d2db0d
(bonding: allow arp_ip_targets on separate vlans to use arp validation)

Or else, my vlan devices dont work (unfortunatly I dont have much time
these days to debug the thing)

My config :

              +---------+
vlan.103 -----+ bond0   +--- eth1 (bnx2)
              |         +
vlan.825 -----+         +--- eth2 (tg3)
              +---------+

$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1  (bnx2)
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:1e:0b:ec:d3:d2

Slave Interface: eth2   (tg3)
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1e:0b:92:78:50




author	Andy Gospodarek <andy@greyhouse.net>	
	Mon, 14 Dec 2009 10:48:58 +0000 (10:48 +0000)
committer	David S. Miller <davem@davemloft.net>	
	Mon, 4 Jan 2010 05:17:16 +0000 (21:17 -0800)
commit	1f3c8804acba841b5573b953f5560d2683d2db0d
tree	453fae141b4a37e72ee0513ea1816fbff8f1cf8a	
parent	3a999e6eb5d277cd6a321dcda3fc43c3d9e4e4b8	
bonding: allow arp_ip_targets on separate vlans to use arp validation

This allows a bond device to specify an arp_ip_target as a host that is
not on the same vlan as the base bond device and still use arp
validation.  A configuration like this, now works:

BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=10.0.100.1 arp_validate=3"

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::213:21ff:febe:33e9/64 scope link
       valid_lft forever preferred_lft forever
9: bond0.100@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
    inet 10.0.100.2/24 brd 10.0.100.255 scope global bond0.100
    inet6 fe80::213:21ff:febe:33e9/64 scope link
       valid_lft forever preferred_lft forever

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.0.100.1

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:40:05:30:ff:30

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:be:33:e9

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh Jan. 6, 2010, 9:28 p.m. UTC | #2
Eric Dumazet <eric.dumazet@gmail.com> wrote:

>Le 06/01/2010 19:38, Eric Dumazet a écrit :
>> 
>> (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
>
>David, I had to revert 1f3c8804acba841b5573b953f5560d2683d2db0d
>(bonding: allow arp_ip_targets on separate vlans to use arp validation)
>
>Or else, my vlan devices dont work (unfortunatly I dont have much time
>these days to debug the thing)
>
>My config :
>
>              +---------+
>vlan.103 -----+ bond0   +--- eth1 (bnx2)
>              |         +
>vlan.825 -----+         +--- eth2 (tg3)
>              +---------+

	I'm looking into this right now; I'm seeing what I suspect is
the same thing: the ARP traffic for the probes is processed, and the
bonding slaves are marked up, but any other incoming traffic on the VLAN
is dropped.  It might be that just the incoming ARP replies are lost;
I'm not sure yet.  Tcpdump clearly shows the traffic from the peer
arriving.

	This is the patch we put in last week that worked for Andy, but
not for me.  Earlier versions worked fine, so this might be something in
the last version.  With Eric now having issues, perhaps this isn't just
my problem.  Perhaps there's some difference in our configurations that
differs from what Andy has.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 6, 2010, 9:34 p.m. UTC | #3
Le 06/01/2010 22:28, Jay Vosburgh a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Le 06/01/2010 19:38, Eric Dumazet a écrit :
>>>
>>> (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
>>
>> David, I had to revert 1f3c8804acba841b5573b953f5560d2683d2db0d
>> (bonding: allow arp_ip_targets on separate vlans to use arp validation)
>>
>> Or else, my vlan devices dont work (unfortunatly I dont have much time
>> these days to debug the thing)
>>
>> My config :
>>
>>              +---------+
>> vlan.103 -----+ bond0   +--- eth1 (bnx2)
>>              |         +
>> vlan.825 -----+         +--- eth2 (tg3)
>>              +---------+
> 
> 	I'm looking into this right now; I'm seeing what I suspect is
> the same thing: the ARP traffic for the probes is processed, and the
> bonding slaves are marked up, but any other incoming traffic on the VLAN
> is dropped.  It might be that just the incoming ARP replies are lost;
> I'm not sure yet.  Tcpdump clearly shows the traffic from the peer
> arriving.
> 
> 	This is the patch we put in last week that worked for Andy, but
> not for me.  Earlier versions worked fine, so this might be something in
> the last version.  With Eric now having issues, perhaps this isn't just
> my problem.  Perhaps there's some difference in our configurations that
> differs from what Andy has.
> 

Before going to sleep, I can confirm ARP traffic was going out/coming in, but ARP
table entries stay in incomplete state. 

I just had the time to try one single revert (no time for a bisect), and this commit
was an obvious candidate :)

Thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 6, 2010, 9:38 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 06 Jan 2010 22:10:03 +0100

> Le 06/01/2010 19:38, Eric Dumazet a écrit :
>> 
>> (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
> 
> David, I had to revert 1f3c8804acba841b5573b953f5560d2683d2db0d
> (bonding: allow arp_ip_targets on separate vlans to use arp validation)
> 
> Or else, my vlan devices dont work (unfortunatly I dont have much time
> these days to debug the thing)

I bet this is the same issue Jay was running into, and he
ACK'd the patch anyways. :-)

This is why I wanted full testing and feedback before committing this
change.

Unless I see a fix in the next day I'm reverting from net-next-2.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek Jan. 6, 2010, 9:45 p.m. UTC | #5
On Wed, Jan 6, 2010 at 4:38 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 06 Jan 2010 22:10:03 +0100
>
>> Le 06/01/2010 19:38, Eric Dumazet a écrit :
>>>
>>> (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
>>
>> David, I had to revert 1f3c8804acba841b5573b953f5560d2683d2db0d
>> (bonding: allow arp_ip_targets on separate vlans to use arp validation)
>>
>> Or else, my vlan devices dont work (unfortunatly I dont have much time
>> these days to debug the thing)
>
> I bet this is the same issue Jay was running into, and he
> ACK'd the patch anyways. :-)
>
> This is why I wanted full testing and feedback before committing this
> change.
>
> Unless I see a fix in the next day I'm reverting from net-next-2.6
>
>

Seems reasonable.  I'm working on it now trying to understand what
might be different about my setup and why I'm not seeing any problems.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a3fccc8..7218970 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -216,12 +216,20 @@  enum {
 struct neighbour;
 struct neigh_parms;
 struct sk_buff;
-
+/**
+ * struct netif_rx_stats - softnet stats for rx on each cpu
+ * @total: number of time softirq was trigerred
+ * @dropped: number of dropped frames
+ * @time_squeeze: number of time rx softirq was delayed
+ * @cpu_collision:
+ * @rps: number of time this cpu was given work by Receive Packet Steering
+ */
 struct netif_rx_stats {
 	unsigned total;
 	unsigned dropped;
 	unsigned time_squeeze;
 	unsigned cpu_collision;
+	unsigned rps;
 };
 
 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
@@ -676,6 +684,29 @@  struct net_device_ops {
 };
 
 /*
+ * Structure for Receive Packet Steering.  Length of map and array of CPU ID's.
+ */
+struct rps_map {
+	int len;
+	u16 map[0];
+};
+
+/*
+ * Structure that contains the rps maps for various NAPI instances of a device.
+ */
+struct dev_rps_maps {
+	int num_maps;
+	struct rcu_head rcu;
+	struct rps_map maps[0];
+};
+
+/* Bound number of CPUs that can be in an rps map */
+#define MAX_RPS_CPUS (num_possible_cpus() < 256 ? num_possible_cpus() : 256)
+
+/* Maximum size of RPS map (for allocation) */
+#define RPS_MAP_SIZE (sizeof(struct rps_map) + (MAX_RPS_CPUS * sizeof(u16)))
+
+/*
  *	The DEVICE structure.
  *	Actually, this whole structure is a big mistake.  It mixes I/O
  *	data with strictly "high-level" data, and it has to know about
@@ -861,6 +892,9 @@  struct net_device {
 
 	struct netdev_queue	rx_queue;
 
+	struct dev_rps_maps	*dev_rps_maps;	/* Per-NAPI maps for
+						   receive packet steeing */
+
 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
 
 	/* Number of TX queues allocated at alloc_netdev_mq() time  */
@@ -1276,10 +1310,12 @@  static inline int unregister_gifconf(unsigned int family)
  */
 struct softnet_data {
 	struct Qdisc		*output_queue;
-	struct sk_buff_head	input_pkt_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
+	/* Elements below can be accessed between CPUs for RPS */
+	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ae836fd..8ed3f66 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -267,6 +267,7 @@  typedef unsigned char *sk_buff_data_t;
  *	@mac_header: Link layer header
  *	@_skb_dst: destination entry
  *	@sp: the security path, used for xfrm
+ *	@rxhash: the packet hash computed on receive
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
  *	@len: Length of actual data
  *	@data_len: Data length
@@ -323,6 +324,8 @@  struct sk_buff {
 #ifdef CONFIG_XFRM
 	struct	sec_path	*sp;
 #endif
+	__u32			rxhash;
+
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/kernel/softirq.c b/kernel/softirq.c
index a09502e..5bde7bc 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -219,9 +219,10 @@  restart:
 			h->action(h);
 			trace_softirq_exit(h, softirq_vec);
 			if (unlikely(prev_count != preempt_count())) {
-				printk(KERN_ERR "huh, entered softirq %td %s %p"
+				pr_err("huh, cpu %d entered softirq %td %s %p"
 				       "with preempt_count %08x,"
-				       " exited with %08x?\n", h - softirq_vec,
+				       " exited with %08x?\n", cpu,
+				       h - softirq_vec,
 				       softirq_to_name[h - softirq_vec],
 				       h->action, prev_count, preempt_count());
 				preempt_count() = prev_count;
diff --git a/net/core/dev.c b/net/core/dev.c
index be9924f..9c92f10 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1874,7 +1874,7 @@  out_kfree_skb:
 	return rc;
 }
 
-static u32 skb_tx_hashrnd;
+static u32 hashrnd __read_mostly;
 
 u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
@@ -1892,7 +1892,7 @@  u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 	else
 		hash = skb->protocol;
 
-	hash = jhash_1word(hash, skb_tx_hashrnd);
+	hash = jhash_1word(hash, hashrnd);
 
 	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
 }
@@ -2113,6 +2113,149 @@  int weight_p __read_mostly = 64;            /* old backlog weight */
 
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
+/*
+ * get_rps_cpu is called from netif_receive_skb and returns the target
+ * CPU from the RPS map of the receiving NAPI instance for a given skb.
+ */
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+{
+	u32 addr1, addr2, ports;
+	struct ipv6hdr *ip6;
+	struct iphdr *ip;
+	u32 ihl;
+	u8 ip_proto;
+	int cpu = -1;
+	struct dev_rps_maps *drmap;
+	struct rps_map *map = NULL;
+	u16 index;
+
+	rcu_read_lock();
+
+	drmap = rcu_dereference(dev->dev_rps_maps);
+	if (!drmap)
+		goto done;
+
+	index = skb_get_rx_queue(skb);
+	if (index >= drmap->num_maps)
+		index = 0;
+
+	map = (struct rps_map *)
+	    ((void *)drmap->maps + (RPS_MAP_SIZE * index));
+	if (!map->len)
+		goto done;
+
+	if (skb->rxhash)
+		goto got_hash; /* Skip hash computation on packet header */
+
+	switch (skb->protocol) {
+	case __constant_htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, sizeof(*ip)))
+			goto done;
+
+		ip = (struct iphdr *) skb->data;
+		ip_proto = ip->protocol;
+		addr1 = ip->saddr;
+		addr2 = ip->daddr;
+		ihl = ip->ihl;
+		break;
+	case __constant_htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, sizeof(*ip6)))
+			goto done;
+
+		ip6 = (struct ipv6hdr *) skb->data;
+		ip_proto = ip6->nexthdr;
+		addr1 = ip6->saddr.s6_addr32[3];
+		addr2 = ip6->daddr.s6_addr32[3];
+		ihl = (40 >> 2);
+		break;
+	default:
+		goto done;
+	}
+	ports = 0;
+	switch (ip_proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_AH:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		if (pskb_may_pull(skb, (ihl * 4) + 4))
+			ports = *((u32 *) (skb->data + (ihl * 4)));
+		break;
+
+	default:
+		break;
+	}
+
+	skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
+	if (!skb->rxhash)
+		skb->rxhash = 1;
+
+got_hash:
+	cpu = map->map[((u64) skb->rxhash * map->len) >> 32];
+
+	if (!cpu_online(cpu))
+		cpu = -1;
+done:
+	rcu_read_unlock();
+	return cpu;
+}
+
+static DEFINE_PER_CPU(cpumask_t, rps_remote_softirq_cpus);
+
+/* Called from hardirq (IPI) context */
+static void trigger_softirq(void *data)
+{
+	struct softnet_data *queue = data;
+	__napi_schedule(&queue->backlog);
+	__get_cpu_var(netdev_rx_stat).rps++;
+}
+
+/*
+ * enqueue_to_backlog is called to queue an skb to a per CPU backlog
+ * queue (may be a remote CPU queue).
+ */
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+{
+	struct softnet_data *queue;
+	unsigned long flags;
+
+	queue = &per_cpu(softnet_data, cpu);
+
+	local_irq_save(flags);
+	__get_cpu_var(netdev_rx_stat).total++;
+
+	spin_lock(&queue->input_pkt_queue.lock);
+	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
+		if (queue->input_pkt_queue.qlen) {
+enqueue:
+			__skb_queue_tail(&queue->input_pkt_queue, skb);
+			spin_unlock_irqrestore(&queue->input_pkt_queue.lock,
+			    flags);
+			return NET_RX_SUCCESS;
+		}
+
+		/* Schedule NAPI for backlog device */
+		if (napi_schedule_prep(&queue->backlog)) {
+			if (cpu != smp_processor_id()) {
+				cpu_set(cpu,
+				    get_cpu_var(rps_remote_softirq_cpus));
+				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			} else
+				__napi_schedule(&queue->backlog);
+		}
+		goto enqueue;
+	}
+
+	spin_unlock(&queue->input_pkt_queue.lock);
+
+	__get_cpu_var(netdev_rx_stat).dropped++;
+	local_irq_restore(flags);
+
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
 
 /**
  *	netif_rx	-	post buffer to the network code
@@ -2131,8 +2274,7 @@  DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 int netif_rx(struct sk_buff *skb)
 {
-	struct softnet_data *queue;
-	unsigned long flags;
+	int cpu;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2141,31 +2283,12 @@  int netif_rx(struct sk_buff *skb)
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
 
-	/*
-	 * The code is rearranged so that the path is the most
-	 * short when CPU is congested, but is still operating.
-	 */
-	local_irq_save(flags);
-	queue = &__get_cpu_var(softnet_data);
-
-	__get_cpu_var(netdev_rx_stat).total++;
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
-		}
-
-		napi_schedule(&queue->backlog);
-		goto enqueue;
-	}
 
-	__get_cpu_var(netdev_rx_stat).dropped++;
-	local_irq_restore(flags);
+	cpu = get_rps_cpu(skb->dev, skb);
+	if (cpu < 0)
+		cpu = smp_processor_id();
 
-	kfree_skb(skb);
-	return NET_RX_DROP;
+	return enqueue_to_backlog(skb, cpu);
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -2403,10 +2526,10 @@  void netif_nit_deliver(struct sk_buff *skb)
 }
 
 /**
- *	netif_receive_skb - process receive buffer from network
+ *	__netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
  *
- *	netif_receive_skb() is the main receive data processing function.
+ *	__netif__receive_skb() is the main receive data processing function.
  *	It always succeeds. The buffer may be dropped during processing
  *	for congestion control or by the protocol layers.
  *
@@ -2417,7 +2540,8 @@  void netif_nit_deliver(struct sk_buff *skb)
  *	NET_RX_SUCCESS: no congestion
  *	NET_RX_DROP: packet was dropped
  */
-int netif_receive_skb(struct sk_buff *skb)
+
+int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	struct net_device *orig_dev;
@@ -2515,6 +2639,16 @@  out:
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
+int netif_receive_skb(struct sk_buff *skb)
+{
+	int cpu = get_rps_cpu(skb->dev, skb);
+
+	if (cpu < 0)
+		return __netif_receive_skb(skb);
+	else
+		return enqueue_to_backlog(skb, cpu);
+}
+
 /* Network device is going away, flush any packets still pending  */
 static void flush_backlog(void *arg)
 {
@@ -2558,7 +2692,7 @@  static int napi_gro_complete(struct sk_buff *skb)
 	}
 
 out:
-	return netif_receive_skb(skb);
+	return __netif_receive_skb(skb);
 }
 
 void napi_gro_flush(struct napi_struct *napi)
@@ -2691,7 +2825,7 @@  gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 {
 	switch (ret) {
 	case GRO_NORMAL:
-		if (netif_receive_skb(skb))
+		if (__netif_receive_skb(skb))
 			ret = GRO_DROP;
 		break;
 
@@ -2765,7 +2899,7 @@  gro_result_t napi_frags_finish(struct napi_struct *napi, struct sk_buff *skb,
 
 		if (ret == GRO_HELD)
 			skb_gro_pull(skb, -ETH_HLEN);
-		else if (netif_receive_skb(skb))
+		else if (__netif_receive_skb(skb))
 			ret = GRO_DROP;
 		break;
 
@@ -2840,16 +2974,16 @@  static int process_backlog(struct napi_struct *napi, int quota)
 	do {
 		struct sk_buff *skb;
 
-		local_irq_disable();
+		spin_lock_irq(&queue->input_pkt_queue.lock);
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
 			__napi_complete(napi);
-			local_irq_enable();
+			spin_unlock_irq(&queue->input_pkt_queue.lock);
 			break;
 		}
-		local_irq_enable();
+		spin_unlock_irq(&queue->input_pkt_queue.lock);
 
-		netif_receive_skb(skb);
+		__netif_receive_skb(skb);
 	} while (++work < quota && jiffies == start_time);
 
 	return work;
@@ -2938,6 +3072,21 @@  void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
+/*
+ * net_rps_action sends any pending IPI's for rps.  This is only called from
+ * softirq and interrupts must be enabled.
+ */
+static void net_rps_action(void)
+{
+	int cpu;
+
+	/* Send pending IPI's to kick RPS processing on remote cpus. */
+	for_each_cpu_mask_nr(cpu, __get_cpu_var(rps_remote_softirq_cpus)) {
+		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
+		cpu_clear(cpu, __get_cpu_var(rps_remote_softirq_cpus));
+		__smp_call_function_single(cpu, &queue->csd, 0);
+	}
+}
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -3009,6 +3158,8 @@  static void net_rx_action(struct softirq_action *h)
 out:
 	local_irq_enable();
 
+	net_rps_action();
+
 #ifdef CONFIG_NET_DMA
 	/*
 	 * There may not be any more sk_buffs coming right now, so push
@@ -3255,7 +3406,7 @@  static int softnet_seq_show(struct seq_file *seq, void *v)
 
 	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
-		   0, 0, 0, 0, /* was fastroute */
+		   0, 0, 0, s->rps, /* was fastroute */
 		   s->cpu_collision);
 	return 0;
 }
@@ -5403,6 +5554,8 @@  void free_netdev(struct net_device *dev)
 	/* Flush device addresses */
 	dev_addr_flush(dev);
 
+	kfree(dev->dev_rps_maps);
+
 	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
 		netif_napi_del(p);
 
@@ -5877,6 +6030,10 @@  static int __init net_dev_init(void)
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 
+		queue->csd.func = trigger_softirq;
+		queue->csd.info = queue;
+		queue->csd.flags = 0;
+
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
@@ -5915,7 +6072,7 @@  subsys_initcall(net_dev_init);
 
 static int __init initialize_hashrnd(void)
 {
-	get_random_bytes(&skb_tx_hashrnd, sizeof(skb_tx_hashrnd));
+	get_random_bytes(&hashrnd, sizeof(hashrnd));
 	return 0;
 }
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index fbc1c74..5ca8731 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -18,6 +18,9 @@ 
 #include <linux/wireless.h>
 #include <net/wext.h>
 
+#include <linux/string.h>
+#include <linux/ctype.h>
+
 #include "net-sysfs.h"
 
 #ifdef CONFIG_SYSFS
@@ -253,6 +256,137 @@  static ssize_t store_tx_queue_len(struct device *dev,
 	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
 }
 
+static char *get_token(const char **cp, size_t *len)
+{
+	const char *bp = *cp;
+	char *start;
+
+	while (isspace(*bp))
+		bp++;
+
+	start = (char *)bp;
+	while (!isspace(*bp) && *bp != '\0')
+		bp++;
+
+	if (start != bp)
+		*len = bp - start;
+	else
+		start = NULL;
+
+	*cp = bp;
+	return start;
+}
+
+static void dev_map_release(struct rcu_head *rcu)
+{
+	struct dev_rps_maps *drmap =
+	    container_of(rcu, struct dev_rps_maps, rcu);
+
+	kfree(drmap);
+}
+
+static ssize_t store_rps_cpus(struct device *dev,
+    struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct net_device *net = to_net_dev(dev);
+	struct napi_struct *napi;
+	cpumask_t mask;
+	int err, cpu, index, i;
+	int cnt = 0;
+	char *token;
+	const char *cp = buf;
+	size_t tlen;
+	struct dev_rps_maps *drmap, *old_drmap;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	cnt = 0;
+	list_for_each_entry(napi, &net->napi_list, dev_list)
+		cnt++;
+	if (cnt == 0) {
+		pr_err("force a non null cnt on device %s\n", net->name);
+		cnt = 1;
+	}
+	pr_err("cnt=%d for device %s\n", cnt, net->name);
+
+	drmap = kzalloc(sizeof(struct dev_rps_maps) +
+	    RPS_MAP_SIZE * cnt, GFP_KERNEL);
+	if (!drmap)
+		return -ENOMEM;
+
+	drmap->num_maps = cnt;
+
+	cp = buf;
+	for (index = 0; index < cnt &&
+	   (token = get_token(&cp, &tlen)); index++) {
+		struct rps_map *map = (struct rps_map *)
+		    ((void *)drmap->maps + (RPS_MAP_SIZE * index));
+		err = bitmap_parse(token, tlen, cpumask_bits(&mask),
+		    nr_cpumask_bits);
+
+		if (err) {
+			kfree(drmap);
+			return err;
+		}
+
+		cpus_and(mask, mask, cpu_online_map);
+		i = 0;
+		for_each_cpu_mask(cpu, mask) {
+			if (i >= MAX_RPS_CPUS)
+				break;
+			map->map[i++] =  cpu;
+		}
+		map->len = i;
+	}
+
+	rcu_read_lock_bh();
+	old_drmap = rcu_dereference(net->dev_rps_maps);
+	rcu_assign_pointer(net->dev_rps_maps, drmap);
+	rcu_read_unlock_bh();
+
+	if (old_drmap)
+		call_rcu(&old_drmap->rcu, dev_map_release);
+
+	return len;
+}
+
+static ssize_t show_rps_cpus(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct net_device *net = to_net_dev(dev);
+	size_t len = 0;
+	cpumask_t mask;
+	int i, j;
+	struct dev_rps_maps *drmap;
+
+	rcu_read_lock_bh();
+	drmap = rcu_dereference(net->dev_rps_maps);
+
+	if (drmap) {
+		for (j = 0; j < drmap->num_maps; j++) {
+			struct rps_map *map = (struct rps_map *)
+			    ((void *)drmap->maps + (RPS_MAP_SIZE * j));
+			cpus_clear(mask);
+			for (i = 0; i < map->len; i++)
+				cpu_set(map->map[i], mask);
+
+			len += cpumask_scnprintf(buf + len, PAGE_SIZE, &mask);
+			if (PAGE_SIZE - len < 3) {
+				rcu_read_unlock();
+				return -EINVAL;
+			}
+			if (j < drmap->num_maps)
+				len += sprintf(buf + len, " ");
+		}
+	}
+
+	rcu_read_unlock_bh();
+
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
 static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
 {
@@ -309,6 +443,7 @@  static struct device_attribute net_class_attributes[] = {
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_cpus, store_rps_cpus),
 	{}
 };