diff mbox

[v5] rps: Receive Packet Steering

Message ID alpine.DEB.1.00.1001141353140.19018@pokey.mtv.corp.google.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Jan. 14, 2010, 9:56 p.m. UTC
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or NAPI instance for
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets for the device. A CPU is selected on a per packet basis by
hashing contents of the packet header (the TCP or UDP 4-tuple) and using the
result to index into the CPU mask.  The IPI mechanism is used to raise
networking receive softirqs between CPUs.  This effectively emulates in
software what a multi-queue NIC can provide, but is generic requiring no device
support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(Toeplitz is popular).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering the packet to a remote CPU.

The CPU masks is set on a per device basis in the sysfs variable
/sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
each NAPI nstance of the device.  For example:

echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus

would set maps for four NAPI instances on eth0.

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seems to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
    Without RPS: 90K tps at 33% CPU
    With RPS:    239K tps at 60% CPU

foredeth on 16 core AMD
    Without RPS: 103K tps at 15% CPU
    With RPS:    285K tps at 49% CPU

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that RPS masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possbility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

stephen hemminger Jan. 14, 2010, 10:56 p.m. UTC | #1
On Thu, 14 Jan 2010 13:56:23 -0800 (PST)
Tom Herbert <therbert@google.com> wrote:

> This patch implements software receive side packet steering (RPS).  RPS
> distributes the load of received packet processing across multiple CPUs.
> 
> Problem statement: Protocol processing done in the NAPI context for received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load.  This substantially limits pps that can be achieved on a single
> queue NIC and provides no scaling with multiple cores.
> 
> This solution queues packets early on in the receive path on the backlog queues
> of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel.   For each device (or NAPI instance for
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets for the device. A CPU is selected on a per packet basis by
> hashing contents of the packet header (the TCP or UDP 4-tuple) and using the
> result to index into the CPU mask.  The IPI mechanism is used to raise
> networking receive softirqs between CPUs.  This effectively emulates in
> software what a multi-queue NIC can provide, but is generic requiring no device
> support.
> 
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (Toeplitz is popular).  This patch allow drivers to set the HW reported hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering the packet to a remote CPU.
> 
> The CPU masks is set on a per device basis in the sysfs variable
> /sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
> each NAPI nstance of the device.  For example:
> 
> echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus

Why not make a kobject out of cpus which would add subdirectory.
This would keep interface consistent with the one value per file
semantic of sysfs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 14, 2010, 11:31 p.m. UTC | #2
>>The CPU masks is set on a per device basis in the sysfs variable
>>/sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
>>each NAPI nstance of the device.  For example:
>>
>>echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus
> 
> 
> Why not make a kobject out of cpus which would add subdirectory.
> This would keep interface consistent with the one value per file
> semantic of sysfs.

Perhaps a question of po-tay-toe vs po-tah-toe, but which will be easier to deal 
with for say 128 or 256 "CPUs?"

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 15, 2010, 6:19 a.m. UTC | #3
Le 15/01/2010 03:22, Changli Gao a écrit :
> On Fri, Jan 15, 2010 at 5:56 AM, Tom Herbert <therbert@google.com> wrote:
>> +
>> +       if (skb->rxhash)
>> +               goto got_hash; /* Skip hash computation on packet header */
>> +
>> +       switch (skb->protocol) {
>> +       case __constant_htons(ETH_P_IP):
>> +               if (!pskb_may_pull(skb, sizeof(*ip)))
>> +                       goto done;
>> +
>> +               ip = (struct iphdr *) skb->data;
>> +               ip_proto = ip->protocol;
>> +               addr1 = ip->saddr;
>> +               addr2 = ip->daddr;
>> +               ihl = ip->ihl;
>> +               break;
>> +       case __constant_htons(ETH_P_IPV6):
>> +               if (!pskb_may_pull(skb, sizeof(*ip6)))
>> +                       goto done;
>> +
>> +               ip6 = (struct ipv6hdr *) skb->data;
>> +               ip_proto = ip6->nexthdr;
> This code can't work, when there are extra headers. ipv6_skip_exthdr()
> can be used to get the l4 header.

Could you give exact code please ?

> 
>> +               addr1 = ip6->saddr.s6_addr32[3];
>> +               addr2 = ip6->daddr.s6_addr32[3];
>> +               ihl = (40 >> 2);
>> +               break;
>> +       default:
>> +               goto done;
>> +       }
>> +       ports = 0;
>> +       switch (ip_proto) {
>> +       case IPPROTO_TCP:
>> +       case IPPROTO_UDP:
>> +       case IPPROTO_DCCP:
>> +       case IPPROTO_ESP:
>> +       case IPPROTO_AH:
>> +       case IPPROTO_SCTP:
>> +       case IPPROTO_UDPLITE:
>> +               if (pskb_may_pull(skb, (ihl * 4) + 4))
>> +                       ports = *((u32 *) (skb->data + (ihl * 4)));
>> +               break;
>> +
>> +       default:
>> +               break;
>> +       }
>> +
>> +       skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
> For connection based packet processing, such as netfilter,
> distributing the packets in two directions into one CPU will reduce
> cache miss, when NAT isn't used. I think the code bellow will help:
> if (addr1 > addr2)
>   swap(addr1, addr2);

Yes, I already gave this hint in a previous review, but this adds a test
and I suspect Google is not using NAT :)

> 
>> +       if (!skb->rxhash)
>> +               skb->rxhash = 1;
> 
> Why not put the above code into a new function, and add more protocols
> support, such as 802.1Q.  Though rxhash is based on 4-tuple, I think
> netfilter will benefit from it.
> 

Sure, this can be done in a followup patch.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 15, 2010, 6:27 a.m. UTC | #4
Le 14/01/2010 22:56, Tom Herbert a écrit :
> This patch implements software receive side packet steering (RPS).  RPS
> distributes the load of received packet processing across multiple CPUs.
> 
> Problem statement: Protocol processing done in the NAPI context for
> received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load.  This substantially limits pps that can be achieved on a
> single
> queue NIC and provides no scaling with multiple cores.
> 
> This solution queues packets early on in the receive path on the backlog
> queues
> of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel.   For each device (or NAPI instance for
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets for the device. A CPU is selected on a per packet basis by
> hashing contents of the packet header (the TCP or UDP 4-tuple) and using
> the
> result to index into the CPU mask.  The IPI mechanism is used to raise
> networking receive softirqs between CPUs.  This effectively emulates in
> software what a multi-queue NIC can provide, but is generic requiring no
> device
> support.
> 
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (Toeplitz is popular).  This patch allow drivers to set the HW reported
> hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering the packet to a remote CPU.
> 
> The CPU masks is set on a per device basis in the sysfs variable
> /sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
> each NAPI nstance of the device.  For example:
> 
> echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus
> 
> would set maps for four NAPI instances on eth0.
> 
> Generally, we have found this technique increases pps capabilities of a
> single
> queue device with good CPU utilization.  Optimal settings for the CPU mask
> seems to depend on architectures and cache hierarcy.  Below are some
> results
> running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
> Results show cumulative transaction rate and system CPU utilization.
> 
> e1000e on 8 core Intel
>    Without RPS: 90K tps at 33% CPU
>    With RPS:    239K tps at 60% CPU
> 
> foredeth on 16 core AMD
>    Without RPS: 103K tps at 15% CPU
>    With RPS:    285K tps at 49% CPU
> 
> Caveats:
> - The benefits of this patch are dependent on architecture and cache
> hierarchy.
> Tuning the masks to get best performance is probably necessary.
> - This patch adds overhead in the path for processing a single packet.  In
> a lightly loaded server this overhead may eliminate the advantages of
> increased parallelism, and possibly cause some relative performance
> degradation.
> We have found that RPS masks that are cache aware (share same caches with
> the interrupting CPU) mitigate much of this.
> - The RPS masks can be changed dynamically, however whenever the mask is
> changed
> this introduces the possbility of generating out of order packets.  It's
> probably best not change the masks too frequently.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>


> 
> +/*
> + * net_rps_action sends any pending IPI's for rps.  This is only called
> from
> + * softirq and interrupts must be enabled.
> + */
> +static void net_rps_action(void)
> +{
> +    int cpu;
> +
> +    /* Send pending IPI's to kick RPS processing on remote cpus. */
> +    for_each_cpu_mask_nr(cpu, __get_cpu_var(rps_remote_softirq_cpus)) {
> +        struct softnet_data *queue = &per_cpu(softnet_data, cpu);
> +        cpu_clear(cpu, __get_cpu_var(rps_remote_softirq_cpus));
> +        if (cpu_online(cpu))
> +            __smp_call_function_single(cpu, &queue->csd, 0);
> +    }
> +}
> 

So we have this last bit that might have a reentrance problem...

Do you plan a followup patch to copy the rps_remote_softirq_cpus in a local variable
before enabling interrupts and calling net_rps_action() ?

	cpumask_t rps_copy;

	copy and clean rps_remote_softirq_cpus
	local_irq_enable();
	net_rps_action(&rps_copy); 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Jan. 15, 2010, 6:39 a.m. UTC | #5
On Fri, Jan 15, 2010 at 2:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le 15/01/2010 03:22, Changli Gao a écrit :
>> On Fri, Jan 15, 2010 at 5:56 AM, Tom Herbert <therbert@google.com> wrote:
>>> +
>>> +       if (skb->rxhash)
>>> +               goto got_hash; /* Skip hash computation on packet header */
>>> +
>>> +       switch (skb->protocol) {
>>> +       case __constant_htons(ETH_P_IP):
>>> +               if (!pskb_may_pull(skb, sizeof(*ip)))
>>> +                       goto done;
>>> +
>>> +               ip = (struct iphdr *) skb->data;
>>> +               ip_proto = ip->protocol;
>>> +               addr1 = ip->saddr;
>>> +               addr2 = ip->daddr;
>>> +               ihl = ip->ihl;
>>> +               break;
>>> +       case __constant_htons(ETH_P_IPV6):
>>> +               if (!pskb_may_pull(skb, sizeof(*ip6)))
>>> +                       goto done;
>>> +
>>> +               ip6 = (struct ipv6hdr *) skb->data;
>>> +               ip_proto = ip6->nexthdr;
>> This code can't work, when there are extra headers. ipv6_skip_exthdr()
>> can be used to get the l4 header.
>
> Could you give exact code please ?
>

The code bellow is from my ifb-mq.patch
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
        case __constant_htons(ETH_P_IPV6):
process_ipv6:
                if (unlikely(!pskb_may_pull(skb, sizeof(struct ipv6hdr))))
                        goto process_other;
                addr1 = ipv6_hdr(skb)->saddr.s6_addr32[3];
                addr2 = ipv6_hdr(skb)->daddr.s6_addr32[3];
                ihl = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &ip_proto);
                if (unlikely(ihl < 0))
                        goto process_other_trans;
                break;
#endif
Eric Dumazet Jan. 15, 2010, 6:57 a.m. UTC | #6
Le 15/01/2010 07:39, Changli Gao a écrit :
> The code bellow is from my ifb-mq.patch
> #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>         case __constant_htons(ETH_P_IPV6):
> process_ipv6:
>                 if (unlikely(!pskb_may_pull(skb, sizeof(struct ipv6hdr))))
>                         goto process_other;
>                 addr1 = ipv6_hdr(skb)->saddr.s6_addr32[3];
>                 addr2 = ipv6_hdr(skb)->daddr.s6_addr32[3];
>                 ihl = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &ip_proto);
>                 if (unlikely(ihl < 0))
>                         goto process_other_trans;
>                 break;
> #endif
> 
> 

Thanks Changli !

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 15, 2010, 8:49 a.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 15 Jan 2010 07:57:56 +0100

> Le 15/01/2010 07:39, Changli Gao a écrit :
>> The code bellow is from my ifb-mq.patch
>> #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>>         case __constant_htons(ETH_P_IPV6):
>> process_ipv6:
>>                 if (unlikely(!pskb_may_pull(skb, sizeof(struct ipv6hdr))))
>>                         goto process_other;
>>                 addr1 = ipv6_hdr(skb)->saddr.s6_addr32[3];
>>                 addr2 = ipv6_hdr(skb)->daddr.s6_addr32[3];
>>                 ihl = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &ip_proto);
>>                 if (unlikely(ihl < 0))
>>                         goto process_other_trans;
>>                 break;
>> #endif
>> 
>> 
> 
> Thanks Changli !

Actually, no thanks.  Have you actually taken a look at
ipv6_skip_exthdr()?

Do that, then tell me that you want the extra function call, plus all
of the processing and data touching that that function does, just to
handle the case that there "might" be ipv6 extension headers there.

It is the exception rather than the rule, and I think it's just
assume we have a real protocol header next.

And that's what skb_tx_hash() used to do too before we started using
the recorded RX queue and socket hash values.

Nobody cared and nobody complained.  Guess why?  Because in practice
it doesn't matter.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 15, 2010, 8:50 a.m. UTC | #8
From: Changli Gao <xiaosuo@gmail.com>
Date: Fri, 15 Jan 2010 10:22:03 +0800

> For connection based packet processing, such as netfilter,
> distributing the packets in two directions into one CPU will reduce
> cache miss, when NAT isn't used. I think the code bellow will help:
> if (addr1 > addr2)
>   swap(addr1, addr2);

You can't just do the addresses, the ports will swap too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Jan. 15, 2010, 9:05 a.m. UTC | #9
On Fri, Jan 15, 2010 at 4:50 PM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Fri, 15 Jan 2010 10:22:03 +0800
>
>> For connection based packet processing, such as netfilter,
>> distributing the packets in two directions into one CPU will reduce
>> cache miss, when NAT isn't used. I think the code bellow will help:
>> if (addr1 > addr2)
>>   swap(addr1, addr2);
>
> You can't just do the addresses, the ports will swap too.
>

Yea, and it is just an example.
Changli Gao Jan. 15, 2010, 9:20 a.m. UTC | #10
On Fri, Jan 15, 2010 at 4:49 PM, David Miller <davem@davemloft.net> wrote:
>
> Actually, no thanks.  Have you actually taken a look at
> ipv6_skip_exthdr()?
>
> Do that, then tell me that you want the extra function call, plus all
> of the processing and data touching that that function does, just to
> handle the case that there "might" be ipv6 extension headers there.
>

I don't think ipv6_skip_exthdr() is too weight. If there isn't any
extra header, only some compare and jump instruments are added, and no
more data references. If there are some headers, I think distributing
packets among CPUs is more important than the extra cost introduced by
calling ipv6_skip_exthdr().

> It is the exception rather than the rule, and I think it's just
> assume we have a real protocol header next.
>
> And that's what skb_tx_hash() used to do too before we started using
> the recorded RX queue and socket hash values.
>
> Nobody cared and nobody complained.  Guess why?  Because in practice
> it doesn't matter.
>

Maybe they don't know it.If it was a performance regression, I think
more people might pay attention on it.
David Miller Jan. 15, 2010, 9:26 a.m. UTC | #11
From: Changli Gao <xiaosuo@gmail.com>
Date: Fri, 15 Jan 2010 17:20:43 +0800

> On Fri, Jan 15, 2010 at 4:49 PM, David Miller <davem@davemloft.net> wrote:
>>
>> Actually, no thanks.  Have you actually taken a look at
>> ipv6_skip_exthdr()?
>>
>> Do that, then tell me that you want the extra function call, plus all
>> of the processing and data touching that that function does, just to
>> handle the case that there "might" be ipv6 extension headers there.
>>
> 
> I don't think ipv6_skip_exthdr() is too weight. If there isn't any
> extra header, only some compare and jump instruments are added, and no
> more data references. If there are some headers, I think distributing
> packets among CPUs is more important than the extra cost introduced by
> calling ipv6_skip_exthdr().

Calling a function is expensive.

What was now a leaf function deep in the call chain, will no longer
be, so GCC will need to push all live registers onto the stack,
then reload them back into registers when ipv6_skip_exthdr() returns.

And that function is expensive, it's a lot of code that %99 of the
time serves no purpose at all.

This will be executed for every single packet we process, and Linux
can process millions of packets per second, so every cycle and every
memory reference matters.

> Maybe they don't know it.If it was a performance regression, I think
> more people might pay attention on it.

And we can address such a problem at that time.

Can you show a real life setup that sees ipv6 packets with extension
headers and would be effected by this?

Really, I do not want to bloat up this path with useless code
execution when for all practical purposes it really doesn't matter.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 15, 2010, 9:45 a.m. UTC | #12
Tom your patch still doesn't apply.

I took it out of patchwork:

	http://patchwork.ozlabs.org/patch/42931/

and the patch is all corrupted in the leading line characters.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Jan. 16, 2010, 2:11 a.m. UTC | #13
On Thu, 2010-01-14 at 14:56 -0800, Stephen Hemminger wrote:
> On Thu, 14 Jan 2010 13:56:23 -0800 (PST)
> Tom Herbert <therbert@google.com> wrote:
[...]
> > The CPU masks is set on a per device basis in the sysfs variable
> > /sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
> > each NAPI nstance of the device.  For example:
> > 
> > echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus
> 
> Why not make a kobject out of cpus which would add subdirectory.
> This would keep interface consistent with the one value per file
> semantic of sysfs.

Do you mean a kobject per NAPI context, each initially with a rps_cpus
attribute holding a CPU mask, or a kobject per CPU, each with an
attribute specifying which NAPI contexts it does work for?  (Personally
I'd favour the first.)

Ben.
Ben Hutchings Jan. 16, 2010, 2:26 a.m. UTC | #14
On Thu, 2010-01-14 at 13:56 -0800, Tom Herbert wrote:
[...]
> The CPU masks is set on a per device basis in the sysfs variable
> /sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
> each NAPI nstance of the device.  For example:
> 
> echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus
> 
> would set maps for four NAPI instances on eth0.
[...]
> Caveats:
> - The benefits of this patch are dependent on architecture and cache hierarchy.
> Tuning the masks to get best performance is probably necessary.

It seems to me that it would be helpful to provide some kind of sensible
default behaviour.  I'm sure Google has the in-house expertise to do
this at a higher-level, but most end users rely on good defaults rather
than tuning.

[...]
> +/*
> + * Structure that contains the rps maps for various NAPI instances of a device.
> + */
> +struct dev_rps_maps {
> +	int num_maps;
> +	struct rcu_head rcu;
> +	struct rps_map maps[0];

This declaration is a botch. A VLA of structures containing VLAs has an
index operation, but it's broken.  It would be better to remove the maps
member and define an inline function for indexing the following array of
maps, instead of writing the magic formula way over...

[...]
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
> +{
[...]
> +	map = (struct rps_map *)
> +	    ((void *)drmap->maps + (rps_map_size * index));
[...]

...here.

[...]
> @@ -2363,10 +2487,10 @@ void netif_nit_deliver(struct sk_buff *skb)
>   }
> 
>   /**
> - *	netif_receive_skb - process receive buffer from network
> + *	__netif_receive_skb - process receive buffer from network
>    *	@skb: buffer to process
>    *
> - *	netif_receive_skb() is the main receive data processing function.
> + *	__netif_receive_skb() is the main receive data processing function.
>    *	It always succeeds. The buffer may be dropped during processing
>    *	for congestion control or by the protocol layers.
>    *

Surely this kernel-doc should be moved rather than modified, since you
want most callers to continue using netif_receive_skb()?

[...]
> @@ -2475,6 +2599,16 @@ out:
>   }
>   EXPORT_SYMBOL(netif_receive_skb);

This should be moved underneath the new implementation...

> +int netif_receive_skb(struct sk_buff *skb)
> +{
> +	int cpu = get_rps_cpu(skb->dev, skb);
> +
> +	if (cpu < 0)
> +		return __netif_receive_skb(skb);
> +	else
> +		return enqueue_to_backlog(skb, cpu);
> +}
[...]

...here.

Ben.
stephen hemminger Jan. 17, 2010, 5:22 p.m. UTC | #15
On Sat, 16 Jan 2010 02:11:33 +0000
Ben Hutchings <bhutchings@solarflare.com> wrote:

> On Thu, 2010-01-14 at 14:56 -0800, Stephen Hemminger wrote:
> > On Thu, 14 Jan 2010 13:56:23 -0800 (PST)
> > Tom Herbert <therbert@google.com> wrote:
> [...]
> > > The CPU masks is set on a per device basis in the sysfs variable
> > > /sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
> > > each NAPI nstance of the device.  For example:
> > > 
> > > echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus
> > 
> > Why not make a kobject out of cpus which would add subdirectory.
> > This would keep interface consistent with the one value per file
> > semantic of sysfs.
> 
> Do you mean a kobject per NAPI context, each initially with a rps_cpus
> attribute holding a CPU mask, or a kobject per CPU, each with an
> attribute specifying which NAPI contexts it does work for?  (Personally
> I'd favour the first.)

Yes, make NAPI real kobjects, linked to device.

BUT
make sure this also handles case of N to 1 mapping as well
as 1 to N case.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Jan. 21, 2010, 7:04 a.m. UTC | #16
On Fri, Jan 15, 2010 at 5:26 PM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Fri, 15 Jan 2010 17:20:43 +0800
>
> Calling a function is expensive.
>
> What was now a leaf function deep in the call chain, will no longer
> be, so GCC will need to push all live registers onto the stack,
> then reload them back into registers when ipv6_skip_exthdr() returns.
>
> And that function is expensive, it's a lot of code that %99 of the
> time serves no purpose at all.
>
> This will be executed for every single packet we process, and Linux
> can process millions of packets per second, so every cycle and every
> memory reference matters.
>

We can write a new inline function like this:

static inline int ipv6_get_ports(const struct sk_buff *skb, u16 *port1,
                u16 *port2)
{
        u8 nexthdr;
        int hdrlen;

        nexthdr = ipv6_hdr(skb)->nexthdr;
        hdrlen = sizeof(struct ipv6hdr);
        while (1) {
                switch (nexthdr) {
                case IPPROTO_TCP:
                case IPPROTO_UDP:
                case IPPROTO_DCCP:
                case IPPROTO_ESP:
                case IPPROTO_AH:
                case IPPROTO_SCTP:
                case IPPROTO_UDPLITE:
                        skb_copy_bits(skb, hdrlen, port1, 2);
                        skb_copy_bits(skb, hdrlen + 2, port2, 2);
                        return 0;
                case NEXTHDR_HOP:
                case NEXTHDR_ROUTING:
                case NEXTHDR_FRAGMENT:
                case NEXTHDR_AUTH:
                case NEXTHDR_DEST:
                        // some code like ipv6_skip_exthdr()
                        ....
                        break;
                case NEXTHDR_NONE:
                        return -1;
                default:
                        return -1;
                }
        }
}
Changli Gao Jan. 21, 2010, 7:54 a.m. UTC | #17
On Fri, Jan 15, 2010 at 5:56 AM, Tom Herbert <therbert@google.com> wrote:
> +/*
> + * get_rps_cpu is called from netif_receive_skb and returns the target
> + * CPU from the RPS map of the receiving NAPI instance for a given skb.
> + */
> +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
> +{
> +       u32 addr1, addr2, ports;
> +       struct ipv6hdr *ip6;
> +       struct iphdr *ip;
> +       u32 ihl;
> +       u8 ip_proto;
> +       int cpu = -1;
> +       struct dev_rps_maps *drmap;
> +       struct rps_map *map = NULL;
> +       u16 index;
> +
> +       rcu_read_lock();
> +
> +       drmap = rcu_dereference(dev->dev_rps_maps);
> +       if (!drmap)
> +               goto done;
> +
> +       index = skb_get_rx_queue(skb);
> +       if (index >= drmap->num_maps)
> +               index = 0;
> +
> +       map = (struct rps_map *)
> +           ((void *)drmap->maps + (rps_map_size * index));
> +       if (!map->len)
> +               goto done;
> +
> +       if (skb->rxhash)
> +               goto got_hash; /* Skip hash computation on packet header */
> +

Sometimes, rxhash will be 0 generated. In order to check whether
rxhash is generated or not, a new bit field in sk_buff is needed. When
rxhash is generated and saved in sk_buff, the bit is set.

And, I think rxhash should be reserved when calling skb_copy and skb_clone.
Eric Dumazet Jan. 21, 2010, 9:16 a.m. UTC | #18
Le 21/01/2010 08:54, Changli Gao a écrit :
> Sometimes, rxhash will be 0 generated. In order to check whether
> rxhash is generated or not, a new bit field in sk_buff is needed. When
> rxhash is generated and saved in sk_buff, the bit is set.
> 
> And, I think rxhash should be reserved when calling skb_copy and skb_clone.
> 
> 

I disagree

A null rxhash should not be generated by a driver, or even if its null,
why should we care ?

In this very unlikely event, let get_rps_cpu() compute a (non null) hash.

Adding a bit in skb for such low probability event brings nothing but complexity.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Jan. 28, 2010, 6:04 a.m. UTC | #19
On Thu, 14 Jan 2010 13:56:23 -0800 (PST)
Tom Herbert <therbert@google.com> wrote:

> This patch implements software receive side packet steering (RPS).  RPS
> distributes the load of received packet processing across multiple CPUs.
> 
> Problem statement: Protocol processing done in the NAPI context for received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load.  This substantially limits pps that can be achieved on a single
> queue NIC and provides no scaling with multiple cores.
> 
> This solution queues packets early on in the receive path on the backlog queues
> of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel.   For each device (or NAPI instance for
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets for the device. A CPU is selected on a per packet basis by
> hashing contents of the packet header (the TCP or UDP 4-tuple) and using the
> result to index into the CPU mask.  The IPI mechanism is used to raise
> networking receive softirqs between CPUs.  This effectively emulates in
> software what a multi-queue NIC can provide, but is generic requiring no device
> support.
> 
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (Toeplitz is popular).  This patch allow drivers to set the HW reported hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering the packet to a remote CPU.
> 
> The CPU masks is set on a per device basis in the sysfs variable
> /sys/class/net/<device>/rps_cpus.  This is a set of canonical bit maps for
> each NAPI nstance of the device.  For example:
> 
> echo "0b 0b0 0b00 0b000" > /sys/class/net/eth0/rps_cpus
> 
> would set maps for four NAPI instances on eth0.
> 
> Generally, we have found this technique increases pps capabilities of a single
> queue device with good CPU utilization.  Optimal settings for the CPU mask
> seems to depend on architectures and cache hierarcy.  Below are some results
> running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
> Results show cumulative transaction rate and system CPU utilization.
> 
> e1000e on 8 core Intel
>     Without RPS: 90K tps at 33% CPU
>     With RPS:    239K tps at 60% CPU
> 
> foredeth on 16 core AMD
>     Without RPS: 103K tps at 15% CPU
>     With RPS:    285K tps at 49% CPU
> 
> Caveats:
> - The benefits of this patch are dependent on architecture and cache hierarchy.
> Tuning the masks to get best performance is probably necessary.
> - This patch adds overhead in the path for processing a single packet.  In
> a lightly loaded server this overhead may eliminate the advantages of
> increased parallelism, and possibly cause some relative performance degradation.
> We have found that RPS masks that are cache aware (share same caches with
> the interrupting CPU) mitigate much of this.
> - The RPS masks can be changed dynamically, however whenever the mask is changed
> this introduces the possbility of generating out of order packets.  It's
> probably best not change the masks too frequently.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> 

I started playing and looking more closely at this.
1. CPU and several of the other parameters like backlog should be unsigned
   to avoid possible problems
2. __netif_receive_skb() can be static so gcc can optimize better
3. Not sure if it works or not with devices like sky2 that can have
   two netdevice's sharing same NAPI instance because both ports have
   shared irq.
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97873e3..8b33522 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -222,6 +222,7 @@  struct netif_rx_stats {
  	unsigned dropped;
  	unsigned time_squeeze;
  	unsigned cpu_collision;
+	unsigned received_rps;
  };

  DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
@@ -676,6 +677,27 @@  struct net_device_ops {
  };

  /*
+ * Structure for Receive Packet Steering.  Length of map and array of CPU ID's.
+ */
+struct rps_map {
+	int len;
+	u16 map[0];
+};
+
+#define MAX_RPS_CPUS 256 /* Limit maximum number of CPUs in a map */
+extern int rps_map_size; /* Size of an RPS map */
+extern int rps_cpus_in_map; /* Number of CPUs in a map */
+
+/*
+ * Structure that contains the rps maps for various NAPI instances of a device.
+ */
+struct dev_rps_maps {
+	int num_maps;
+	struct rcu_head rcu;
+	struct rps_map maps[0];
+};
+
+/*
   *	The DEVICE structure.
   *	Actually, this whole structure is a big mistake.  It mixes I/O
   *	data with strictly "high-level" data, and it has to know about
@@ -861,6 +883,9 @@  struct net_device {

  	struct netdev_queue	rx_queue;

+	struct dev_rps_maps	*dev_rps_maps;	/* Per-NAPI maps for
+						   receive packet steering */
+
  	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;

  	/* Number of TX queues allocated at alloc_netdev_mq() time  */
@@ -1274,14 +1299,16 @@  static inline int unregister_gifconf(unsigned int family)
   */
  struct softnet_data {
  	struct Qdisc		*output_queue;
-	struct sk_buff_head	input_pkt_queue;
  	struct list_head	poll_list;
  	struct sk_buff		*completion_queue;

+	/* Elements below can be accessed between CPUs for RPS */
+	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	struct sk_buff_head	input_pkt_queue;
  	struct napi_struct	backlog;
  };

-DECLARE_PER_CPU(struct softnet_data,softnet_data);
+DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);

  #define HAVE_NETIF_QUEUE

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 63f4742..f188301 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -267,6 +267,7 @@  typedef unsigned char *sk_buff_data_t;
   *	@mac_header: Link layer header
   *	@_skb_dst: destination entry
   *	@sp: the security path, used for xfrm
+ *	@rxhash: the packet hash computed on receive
   *	@cb: Control buffer. Free for use by every layer. Put private vars here
   *	@len: Length of actual data
   *	@data_len: Data length
@@ -323,6 +324,8 @@  struct sk_buff {
  #ifdef CONFIG_XFRM
  	struct	sec_path	*sp;
  #endif
+	__u32			rxhash;
+
  	/*
  	 * This is the control buffer. It is free to use for every
  	 * layer. Please put your private variables there. If you
diff --git a/net/core/dev.c b/net/core/dev.c
index 9977288..b7ad07d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1834,7 +1834,7 @@  out_kfree_skb:
  	return rc;
  }

-static u32 skb_tx_hashrnd;
+static u32 hashrnd __read_mostly;

  u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
  {
@@ -1852,7 +1852,7 @@  u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
  	else
  		hash = skb->protocol;

-	hash = jhash_1word(hash, skb_tx_hashrnd);
+	hash = jhash_1word(hash, hashrnd);

  	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
  }
@@ -2070,9 +2070,154 @@  EXPORT_SYMBOL(dev_queue_xmit);
  int netdev_max_backlog __read_mostly = 1000;
  int netdev_budget __read_mostly = 300;
  int weight_p __read_mostly = 64;            /* old backlog weight */
+int rps_cpus_in_map __read_mostly;
+int rps_map_size __read_mostly;

  DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };

+/*
+ * get_rps_cpu is called from netif_receive_skb and returns the target
+ * CPU from the RPS map of the receiving NAPI instance for a given skb.
+ */
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+{
+	u32 addr1, addr2, ports;
+	struct ipv6hdr *ip6;
+	struct iphdr *ip;
+	u32 ihl;
+	u8 ip_proto;
+	int cpu = -1;
+	struct dev_rps_maps *drmap;
+	struct rps_map *map = NULL;
+	u16 index;
+
+	rcu_read_lock();
+
+	drmap = rcu_dereference(dev->dev_rps_maps);
+	if (!drmap)
+		goto done;
+
+	index = skb_get_rx_queue(skb);
+	if (index >= drmap->num_maps)
+		index = 0;
+
+	map = (struct rps_map *)
+	    ((void *)drmap->maps + (rps_map_size * index));
+	if (!map->len)
+		goto done;
+
+	if (skb->rxhash)
+		goto got_hash; /* Skip hash computation on packet header */
+
+	switch (skb->protocol) {
+	case __constant_htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, sizeof(*ip)))
+			goto done;
+
+		ip = (struct iphdr *) skb->data;
+		ip_proto = ip->protocol;
+		addr1 = ip->saddr;
+		addr2 = ip->daddr;
+		ihl = ip->ihl;
+		break;
+	case __constant_htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, sizeof(*ip6)))
+			goto done;
+
+		ip6 = (struct ipv6hdr *) skb->data;
+		ip_proto = ip6->nexthdr;
+		addr1 = ip6->saddr.s6_addr32[3];
+		addr2 = ip6->daddr.s6_addr32[3];
+		ihl = (40 >> 2);
+		break;
+	default:
+		goto done;
+	}
+	ports = 0;
+	switch (ip_proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_AH:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		if (pskb_may_pull(skb, (ihl * 4) + 4))
+			ports = *((u32 *) (skb->data + (ihl * 4)));
+		break;
+
+	default:
+		break;
+	}
+
+	skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
+	if (!skb->rxhash)
+		skb->rxhash = 1;
+
+got_hash:
+	cpu = map->map[((u64) skb->rxhash * map->len) >> 32];
+
+	if (!cpu_online(cpu))
+		cpu = -1;
+done:
+	rcu_read_unlock();
+	return cpu;
+}
+
+static DEFINE_PER_CPU(cpumask_t, rps_remote_softirq_cpus);
+
+/* Called from hardirq (IPI) context */
+static void trigger_softirq(void *data)
+{
+	struct softnet_data *queue = data;
+	__napi_schedule(&queue->backlog);
+	__get_cpu_var(netdev_rx_stat).received_rps++;
+}
+
+/*
+ * enqueue_to_backlog is called to queue an skb to a per CPU backlog
+ * queue (may be a remote CPU queue).
+ */
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+{
+	struct softnet_data *queue;
+	unsigned long flags;
+
+	queue = &per_cpu(softnet_data, cpu);
+
+	local_irq_save(flags);
+	__get_cpu_var(netdev_rx_stat).total++;
+
+	spin_lock(&queue->input_pkt_queue.lock);
+	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
+		if (queue->input_pkt_queue.qlen) {
+enqueue:
+			__skb_queue_tail(&queue->input_pkt_queue, skb);
+			spin_unlock_irqrestore(&queue->input_pkt_queue.lock,
+			    flags);
+			return NET_RX_SUCCESS;
+		}
+
+		/* Schedule NAPI for backlog device */
+		if (napi_schedule_prep(&queue->backlog)) {
+			if (cpu != smp_processor_id()) {
+				cpu_set(cpu,
+				    __get_cpu_var(rps_remote_softirq_cpus));
+				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			} else
+				__napi_schedule(&queue->backlog);
+		}
+		goto enqueue;
+	}
+
+	spin_unlock(&queue->input_pkt_queue.lock);
+
+	__get_cpu_var(netdev_rx_stat).dropped++;
+	local_irq_restore(flags);
+
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}

  /**
   *	netif_rx	-	post buffer to the network code
@@ -2091,8 +2236,7 @@  DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };

  int netif_rx(struct sk_buff *skb)
  {
-	struct softnet_data *queue;
-	unsigned long flags;
+	int cpu;

  	/* if netpoll wants it, pretend we never saw it */
  	if (netpoll_rx(skb))
@@ -2101,31 +2245,11 @@  int netif_rx(struct sk_buff *skb)
  	if (!skb->tstamp.tv64)
  		net_timestamp(skb);

-	/*
-	 * The code is rearranged so that the path is the most
-	 * short when CPU is congested, but is still operating.
-	 */
-	local_irq_save(flags);
-	queue = &__get_cpu_var(softnet_data);
-
-	__get_cpu_var(netdev_rx_stat).total++;
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
-		}
-
-		napi_schedule(&queue->backlog);
-		goto enqueue;
-	}
+	cpu = get_rps_cpu(skb->dev, skb);
+	if (cpu < 0)
+		cpu = smp_processor_id();

-	__get_cpu_var(netdev_rx_stat).dropped++;
-	local_irq_restore(flags);
-
-	kfree_skb(skb);
-	return NET_RX_DROP;
+	return enqueue_to_backlog(skb, cpu);
  }
  EXPORT_SYMBOL(netif_rx);

@@ -2363,10 +2487,10 @@  void netif_nit_deliver(struct sk_buff *skb)
  }

  /**
- *	netif_receive_skb - process receive buffer from network
+ *	__netif_receive_skb - process receive buffer from network
   *	@skb: buffer to process
   *
- *	netif_receive_skb() is the main receive data processing function.
+ *	__netif_receive_skb() is the main receive data processing function.
   *	It always succeeds. The buffer may be dropped during processing
   *	for congestion control or by the protocol layers.
   *
@@ -2377,7 +2501,7 @@  void netif_nit_deliver(struct sk_buff *skb)
   *	NET_RX_SUCCESS: no congestion
   *	NET_RX_DROP: packet was dropped
   */
-int netif_receive_skb(struct sk_buff *skb)
+int __netif_receive_skb(struct sk_buff *skb)
  {
  	struct packet_type *ptype, *pt_prev;
  	struct net_device *orig_dev;
@@ -2475,6 +2599,16 @@  out:
  }
  EXPORT_SYMBOL(netif_receive_skb);

+int netif_receive_skb(struct sk_buff *skb)
+{
+	int cpu = get_rps_cpu(skb->dev, skb);
+
+	if (cpu < 0)
+		return __netif_receive_skb(skb);
+	else
+		return enqueue_to_backlog(skb, cpu);
+}
+
  /* Network device is going away, flush any packets still pending  */
  static void flush_backlog(void *arg)
  {
@@ -2799,16 +2933,16 @@  static int process_backlog(struct napi_struct *napi, int quota)
  	do {
  		struct sk_buff *skb;

-		local_irq_disable();
+		spin_lock_irq(&queue->input_pkt_queue.lock);
  		skb = __skb_dequeue(&queue->input_pkt_queue);
  		if (!skb) {
  			__napi_complete(napi);
-			local_irq_enable();
+			spin_unlock_irq(&queue->input_pkt_queue.lock);
  			break;
  		}
-		local_irq_enable();
+		spin_unlock_irq(&queue->input_pkt_queue.lock);

-		netif_receive_skb(skb);
+		__netif_receive_skb(skb);
  	} while (++work < quota && jiffies == start_time);

  	return work;
@@ -2897,6 +3031,22 @@  void netif_napi_del(struct napi_struct *napi)
  }
  EXPORT_SYMBOL(netif_napi_del);

+/*
+ * net_rps_action sends any pending IPI's for rps.  This is only called from
+ * softirq and interrupts must be enabled.
+ */
+static void net_rps_action(void)
+{
+	int cpu;
+
+	/* Send pending IPI's to kick RPS processing on remote cpus. */
+	for_each_cpu_mask_nr(cpu, __get_cpu_var(rps_remote_softirq_cpus)) {
+		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
+		cpu_clear(cpu, __get_cpu_var(rps_remote_softirq_cpus));
+		if (cpu_online(cpu))
+			__smp_call_function_single(cpu, &queue->csd, 0);
+	}
+}

  static void net_rx_action(struct softirq_action *h)
  {
@@ -2968,6 +3118,8 @@  static void net_rx_action(struct softirq_action *h)
  out:
  	local_irq_enable();

+	net_rps_action();
+
  #ifdef CONFIG_NET_DMA
  	/*
  	 * There may not be any more sk_buffs coming right now, so push
@@ -3212,10 +3364,10 @@  static int softnet_seq_show(struct seq_file *seq, void *v)
  {
  	struct netif_rx_stats *s = v;

-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
  		   s->total, s->dropped, s->time_squeeze, 0,
  		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision);
+		   s->cpu_collision, s->received_rps);
  	return 0;
  }

@@ -5341,6 +5493,8 @@  void free_netdev(struct net_device *dev)
  	/* Flush device addresses */
  	dev_addr_flush(dev);

+	kfree(dev->dev_rps_maps);
+
  	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
  		netif_napi_del(p);

@@ -5793,12 +5947,20 @@  static int __init net_dev_init(void)
  		queue->completion_queue = NULL;
  		INIT_LIST_HEAD(&queue->poll_list);

+		queue->csd.func = trigger_softirq;
+		queue->csd.info = queue;
+		queue->csd.flags = 0;
+
  		queue->backlog.poll = process_backlog;
  		queue->backlog.weight = weight_p;
  		queue->backlog.gro_list = NULL;
  		queue->backlog.gro_count = 0;
  	}

+	rps_cpus_in_map = num_possible_cpus() < MAX_RPS_CPUS ?
+	    num_possible_cpus() : MAX_RPS_CPUS;
+	rps_map_size = sizeof(struct rps_map) + (rps_cpus_in_map * sizeof(u16));
+
  	dev_boot_phase = 0;

  	/* The loopback device is special if any other network devices
@@ -5831,7 +5993,7 @@  subsys_initcall(net_dev_init);

  static int __init initialize_hashrnd(void)
  {
-	get_random_bytes(&skb_tx_hashrnd, sizeof(skb_tx_hashrnd));
+	get_random_bytes(&hashrnd, sizeof(hashrnd));
  	return 0;
  }

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 157645c..a390c07 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -18,6 +18,9 @@ 
  #include <linux/wireless.h>
  #include <net/wext.h>

+#include <linux/string.h>
+#include <linux/ctype.h>
+
  #include "net-sysfs.h"

  #ifdef CONFIG_SYSFS
@@ -253,6 +256,134 @@  static ssize_t store_tx_queue_len(struct device *dev,
  	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
  }

+static char *get_token(const char **cp, size_t *len)
+{
+	const char *bp = *cp;
+	char *start;
+
+	while (isspace(*bp))
+		bp++;
+
+	start = (char *)bp;
+	while (!isspace(*bp) && *bp != '\0')
+		bp++;
+
+	if (start != bp)
+		*len = bp - start;
+	else
+		start = NULL;
+
+	*cp = bp;
+	return start;
+}
+
+static void dev_map_release(struct rcu_head *rcu)
+{
+	struct dev_rps_maps *drmap =
+	    container_of(rcu, struct dev_rps_maps, rcu);
+
+	kfree(drmap);
+}
+
+static ssize_t store_rps_cpus(struct device *dev,
+    struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct net_device *net = to_net_dev(dev);
+	struct napi_struct *napi;
+	cpumask_t mask;
+	int err, cpu, index, i;
+	int cnt = 0;
+	char *token;
+	const char *cp = buf;
+	size_t tlen;
+	struct dev_rps_maps *drmap, *old_drmap;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	cnt = 0;
+	list_for_each_entry(napi, &net->napi_list, dev_list)
+		cnt++;
+	if (cnt == 0)
+		cnt = 1; /* For devices with no napi instances */
+
+	drmap = kzalloc(sizeof(struct dev_rps_maps) +
+	    rps_map_size * cnt, GFP_KERNEL);
+	if (!drmap)
+		return -ENOMEM;
+
+	drmap->num_maps = cnt;
+
+	cp = buf;
+	for (index = 0; index < cnt &&
+	   (token = get_token(&cp, &tlen)); index++) {
+		struct rps_map *map = (struct rps_map *)
+		    ((void *)drmap->maps + (rps_map_size * index));
+		err = bitmap_parse(token, tlen, cpumask_bits(&mask),
+		    nr_cpumask_bits);
+
+		if (err) {
+			kfree(drmap);
+			return err;
+		}
+
+		cpus_and(mask, mask, cpu_online_map);
+		i = 0;
+		for_each_cpu_mask(cpu, mask) {
+			if (i >= rps_cpus_in_map)
+				break;
+			map->map[i++] =  cpu;
+		}
+		map->len = i;
+	}
+
+	rtnl_lock();
+	old_drmap = net->dev_rps_maps;
+	rcu_assign_pointer(net->dev_rps_maps, drmap);
+	rtnl_unlock();
+
+	if (old_drmap)
+		call_rcu(&old_drmap->rcu, dev_map_release);
+
+	return len;
+}
+
+static ssize_t show_rps_cpus(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct net_device *net = to_net_dev(dev);
+	size_t len = 0;
+	cpumask_t mask;
+	int i, j;
+	struct dev_rps_maps *drmap;
+
+	rcu_read_lock();
+	drmap = rcu_dereference(net->dev_rps_maps);
+
+	if (drmap) {
+		for (j = 0; j < drmap->num_maps; j++) {
+			struct rps_map *map = (struct rps_map *)
+			    ((void *)drmap->maps + (rps_map_size * j));
+			cpus_clear(mask);
+			for (i = 0; i < map->len; i++)
+				cpu_set(map->map[i], mask);
+
+			len += cpumask_scnprintf(buf + len, PAGE_SIZE, &mask);
+			if (PAGE_SIZE - len < 3) {
+				rcu_read_unlock();
+				return -EINVAL;
+			}
+			if (j < drmap->num_maps)
+				len += sprintf(buf + len, " ");
+		}
+	}
+
+	rcu_read_unlock();
+
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
  static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
  			     const char *buf, size_t len)
  {
@@ -309,6 +440,7 @@  static struct device_attribute net_class_attributes[] = {
  	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
  	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
  	       store_tx_queue_len),
+	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_cpus, store_rps_cpus),
  	{}
  };