Patchwork Software receive packet steering

login
register
mail settings
Submitter Tom Herbert
Date April 8, 2009, 10:48 p.m.
Message ID <65634d660904081548g7ea3e3bfn858f2336db9a671f@mail.gmail.com>
Download mbox | patch
Permalink /patch/25745/
State RFC
Delegated to: David Miller
Headers show

Comments

Tom Herbert - April 8, 2009, 10:48 p.m.
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device a mask of CPUs is set
to indicate the CPUs that can process packets for the device. A CPU is selected
on a per packet basis by hashing contents of the packet header (the TCP or UDP
4-tuple) and using the result to index into the CPU mask.  The IPI mechanism
is used to raise networking receive softirqs between CPUs.  This effectively
emulates in software what a multi-queue NIC can provide, but is generic
requiring no device support.

The CPU mask is set on a per device basis in the sysfs variable
/sys/class/net/<device>/soft_rps_cpus.  This is a canonical bit map.

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seems to depend on architectures and cache hierarchy.  Below are some results
running 700 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

tg3 on 8 core Intel
   Without RPS: 90K tps at 34% CPU
   With RPS:    285K tps at 70% CPU

e1000 on 8 core Intel
   Without RPS: 90K tps at 34% CPU
   With RPS:    292K tps at 66% CPU

foredeth on 16 core AMD
   Without RPS: 117K tps at 10% CPU
   With RPS:    327K tps at 29% CPU

bnx2x on 16 core AMD
   Single queue without RPS:        139K tps at 17% CPU
   Single queue with RPS:           352K tps at 30% CPU
   Multi queue (1 queues per CPU)   204K tps at 12% CPU

We have been running a variant of this patch on production servers for a while
with good results.  In some of our more networking intensive applications we
have seen 30-50% gains in end application performance.

Tom

Signed-off-by: Tom Herbert <therbert@google.com>
---

the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - April 8, 2009, 11:08 p.m.
On Wed, 8 Apr 2009 15:48:12 -0700
Tom Herbert <therbert@google.com> wrote:

> 	if (skb->protocol == __constant_htons(ETH_P_IP)) {
> +		struct iphdr *iph = (struct iphdr *)skb->data;
> +		__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
> +
> +		hash = 0;
> +		if (!(iph->frag_off &
> +		      __constant_htons(IP_MF|IP_OFFSET)) &&
> +		    ((iph->protocol == IPPROTO_TCP) ||
> +		     (iph->protocol == IPPROTO_UDP)))
> +			hash = ntohs(*layer4hdr ^ *(layer4hdr + 1));
> +
> +		hash ^= (ntohl(iph->saddr ^ iph->daddr)) & 0xffff;
> +		goto got_hash;

The hash could should be same as existing Tx hash?

What about using hardware RSS values?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - April 8, 2009, 11:09 p.m.
On Wed, 8 Apr 2009 15:48:12 -0700
Tom Herbert <therbert@google.com> wrote:

> -extern int		netif_receive_skb(struct sk_buff *skb);
> +extern int            __netif_receive_skb(struct sk_buff *skb);
> +
> +static inline int netif_receive_skb(struct sk_buff *skb)
> +{
> +#ifdef CONFIG_NET_SOFTRPS
> +	return netif_rx(skb);
> +#else
> +	return __netif_receive_skb(skb);
> +#endif
> +}

Ugh, this forces all devices receiving back into a single backlog
queue.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 8, 2009, 11:15 p.m.
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 8 Apr 2009 16:09:48 -0700

> On Wed, 8 Apr 2009 15:48:12 -0700
> Tom Herbert <therbert@google.com> wrote:
> 
>> -extern int		netif_receive_skb(struct sk_buff *skb);
>> +extern int            __netif_receive_skb(struct sk_buff *skb);
>> +
>> +static inline int netif_receive_skb(struct sk_buff *skb)
>> +{
>> +#ifdef CONFIG_NET_SOFTRPS
>> +	return netif_rx(skb);
>> +#else
>> +	return __netif_receive_skb(skb);
>> +#endif
>> +}
> 
> Ugh, this forces all devices receiving back into a single backlog
> queue.

Yes, it basically turns off NAPI.

This patch seems to be throwing the baby out with the
bath water.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert - April 9, 2009, 4:43 p.m.
>>> -extern int          netif_receive_skb(struct sk_buff *skb);
>>> +extern int            __netif_receive_skb(struct sk_buff *skb);
>>> +
>>> +static inline int netif_receive_skb(struct sk_buff *skb)
>>> +{
>>> +#ifdef CONFIG_NET_SOFTRPS
>>> +    return netif_rx(skb);
>>> +#else
>>> +    return __netif_receive_skb(skb);
>>> +#endif
>>> +}
>>
>> Ugh, this forces all devices receiving back into a single backlog
>> queue.
>
> Yes, it basically turns off NAPI.
>

NAPI is still useful, but it does take a higher packet load before
polling kicks in.  I believe this is similarly true for HW multi
queue, and could actually be worse depending on the number of queues
traffic is being split across (in my bnx2x experiment 16 core AMD with
16 queues, I was seeing around 300K interrupts per second, no benefit
from NAPI).

The bimodal behavior between polling and non-polling states does give
us fits.  I looked at the parked mode idea, but the latency hit seems
too high.  We've considered holding the interface in polling state for
longer periods of time, maybe this could trade off CPU cycles (on the
core taking interrupts) for lower latency and higher throughput.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings - April 9, 2009, 6:23 p.m.
On Thu, 2009-04-09 at 09:43 -0700, Tom Herbert wrote:
> >>> -extern int          netif_receive_skb(struct sk_buff *skb);
> >>> +extern int            __netif_receive_skb(struct sk_buff *skb);
> >>> +
> >>> +static inline int netif_receive_skb(struct sk_buff *skb)
> >>> +{
> >>> +#ifdef CONFIG_NET_SOFTRPS
> >>> +    return netif_rx(skb);
> >>> +#else
> >>> +    return __netif_receive_skb(skb);
> >>> +#endif
> >>> +}
> >>
> >> Ugh, this forces all devices receiving back into a single backlog
> >> queue.
> >
> > Yes, it basically turns off NAPI.
> >
> 
> NAPI is still useful, but it does take a higher packet load before
> polling kicks in.  I believe this is similarly true for HW multi
> queue, and could actually be worse depending on the number of queues
> traffic is being split across (in my bnx2x experiment 16 core AMD with
> 16 queues, I was seeing around 300K interrupts per second, no benefit
> from NAPI).
[...]

Have you tried using fewer than 16 queues?  We found using every core in
a multi-core package to be a waste of cycles.

Ben.
David Miller - April 9, 2009, 9:17 p.m.
From: Tom Herbert <therbert@google.com>
Date: Thu, 9 Apr 2009 09:43:07 -0700

> The bimodal behavior between polling and non-polling states does give
> us fits.  I looked at the parked mode idea, but the latency hit seems
> too high.  We've considered holding the interface in polling state for
> longer periods of time, maybe this could trade off CPU cycles (on the
> core taking interrupts) for lower latency and higher throughput.

The sweet spot is usually obtained by having moderate HW interrupt
mitigation settings.  Unfortunately not all drivers do this
universally and have been well tuned.  tg3 is one driver that does do
this correctly.

I would imagine that a non-trivial swath of the issues you guys run
into are actually driver related.  It took us a while to get the tg3
HW interrupt mitigations to play just-right with NAPI.

And we were able to get it right because on a particular system the
NAPI transition was incredibly expensive (some big NUMA SGI box)
so all of the effects were pronounced.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen - April 20, 2009, 10:32 a.m.
Tom Herbert <therbert@google.com> writes:

> +static int netif_cpu_for_rps(struct net_device *dev, struct sk_buff *skb)
> +{
> +	cpumask_t mask;
> +	unsigned int hash;
> +	int cpu, count = 0;
> +
> +	cpus_and(mask, dev->soft_rps_cpus, cpu_online_map);
> +	if (cpus_empty(mask))
> +		return smp_processor_id();

There's a race here with CPU hotunplug I think. When a CPU is hotunplugged
in parallel you can still push packets to it even though they are not
drained. You probably need some kind of drain callback in a CPU hotunplug
notifier that eats all packets left over.

> +got_hash:
> +	hash %= cpus_weight_nr(mask);

That looks rather heavyweight even on modern CPUs. I bet it's 40-50+ cycles
alone forth the hweight and the division. Surely that can be done better?

Also I suspect some kind of runtime switch for this would be useful.

Also the manual set up of the receive mask seems really clumpsy. Couldn't
you set that up dynamically based on where processes executing recvmsg()
are running?

-Andi
David Miller - April 20, 2009, 10:46 a.m.
From: Andi Kleen <andi@firstfloor.org>
Date: Mon, 20 Apr 2009 12:32:29 +0200

> Tom Herbert <therbert@google.com> writes:
> 
>> +got_hash:
>> +	hash %= cpus_weight_nr(mask);
> 
> That looks rather heavyweight even on modern CPUs. I bet it's 40-50+ cycles
> alone forth the hweight and the division. Surely that can be done better?

The standard way to do this is to compute a 32-bit jenkins
hash, and do a 64-bit multiply of this value with a suitable
reciprocol.

This is what skb_tx_hash() is doing in net/core/dev.c
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert - April 21, 2009, 3:26 a.m.
On Mon, Apr 20, 2009 at 3:32 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Tom Herbert <therbert@google.com> writes:
>
> > +static int netif_cpu_for_rps(struct net_device *dev, struct sk_buff *skb)
> > +{
> > +     cpumask_t mask;
> > +     unsigned int hash;
> > +     int cpu, count = 0;
> > +
> > +     cpus_and(mask, dev->soft_rps_cpus, cpu_online_map);
> > +     if (cpus_empty(mask))
> > +             return smp_processor_id();
>
> There's a race here with CPU hotunplug I think. When a CPU is hotunplugged
> in parallel you can still push packets to it even though they are not
> drained. You probably need some kind of drain callback in a CPU hotunplug
> notifier that eats all packets left over.
>
We will look at that, the hotplug support may very well be lacking in the patch.

> > +got_hash:
> > +     hash %= cpus_weight_nr(mask);
>
> That looks rather heavyweight even on modern CPUs. I bet it's 40-50+ cycles
> alone forth the hweight and the division. Surely that can be done better?
>
Agreed, I will try to pull in the RX hash from Dave Miller's remote
softirq patch.

> Also I suspect some kind of runtime switch for this would be useful.
>
> Also the manual set up of the receive mask seems really clumpsy. Couldn't
> you set that up dynamically based on where processes executing recvmsg()
> are running?
>
We have done exactly that.  It works very well in many cases
(application + platform combinations), but I haven't found it to be
better than doing the hash in all cases.  I could provide the patch,
but it might be more of a follow patch to this base one.

Thanks,
Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 21, 2009, 9:48 a.m.
Tom Herbert a écrit :
> On Mon, Apr 20, 2009 at 3:32 AM, Andi Kleen <andi@firstfloor.org> wrote:
>> Tom Herbert <therbert@google.com> writes:
>>
>>> +static int netif_cpu_for_rps(struct net_device *dev, struct sk_buff *skb)
>>> +{
>>> +     cpumask_t mask;
>>> +     unsigned int hash;
>>> +     int cpu, count = 0;
>>> +
>>> +     cpus_and(mask, dev->soft_rps_cpus, cpu_online_map);
>>> +     if (cpus_empty(mask))
>>> +             return smp_processor_id();
>> There's a race here with CPU hotunplug I think. When a CPU is hotunplugged
>> in parallel you can still push packets to it even though they are not
>> drained. You probably need some kind of drain callback in a CPU hotunplug
>> notifier that eats all packets left over.
>>
> We will look at that, the hotplug support may very well be lacking in the patch.
> 
>>> +got_hash:
>>> +     hash %= cpus_weight_nr(mask);
>> That looks rather heavyweight even on modern CPUs. I bet it's 40-50+ cycles
>> alone forth the hweight and the division. Surely that can be done better?
>>
> Agreed, I will try to pull in the RX hash from Dave Miller's remote
> softirq patch.
> 
>> Also I suspect some kind of runtime switch for this would be useful.
>>
>> Also the manual set up of the receive mask seems really clumpsy. Couldn't
>> you set that up dynamically based on where processes executing recvmsg()
>> are running?
>>
> We have done exactly that.  It works very well in many cases
> (application + platform combinations), but I haven't found it to be
> better than doing the hash in all cases.  I could provide the patch,
> but it might be more of a follow patch to this base one.

Hello Tom

I was thinking about your patch (and David's one), and thought it could be
possible to spread packets to other cpus only if current one is under stress.

A posssible metric would be to test if softirq is handled by ksoftirqd
(stress situation) or not.

Under moderate load, we could have one active cpu (and fewer cache line
transferts), keeping good latencies.

I tried alternative approach to solve the Multicast problem raised some time ago,
but still have one cpu handling one device. Only wakeups were defered to a
workqueue (and possibly another cpu) if running from ksoftirq only.
Patch not yet ready for review, but based on a previous patch that was more
intrusive (touching kernel/softirq.c)

Under stress, your idea permits to use more cpus for a fast NIC and get better
throughput. Its more generic.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - April 21, 2009, 3:46 p.m.
On Tue, 21 Apr 2009 11:48:43 +0200
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Tom Herbert a écrit :
> > On Mon, Apr 20, 2009 at 3:32 AM, Andi Kleen <andi@firstfloor.org> wrote:
> >> Tom Herbert <therbert@google.com> writes:
> >>
> >>> +static int netif_cpu_for_rps(struct net_device *dev, struct sk_buff *skb)
> >>> +{
> >>> +     cpumask_t mask;
> >>> +     unsigned int hash;
> >>> +     int cpu, count = 0;
> >>> +
> >>> +     cpus_and(mask, dev->soft_rps_cpus, cpu_online_map);
> >>> +     if (cpus_empty(mask))
> >>> +             return smp_processor_id();
> >> There's a race here with CPU hotunplug I think. When a CPU is hotunplugged
> >> in parallel you can still push packets to it even though they are not
> >> drained. You probably need some kind of drain callback in a CPU hotunplug
> >> notifier that eats all packets left over.
> >>
> > We will look at that, the hotplug support may very well be lacking in the patch.
> > 
> >>> +got_hash:
> >>> +     hash %= cpus_weight_nr(mask);
> >> That looks rather heavyweight even on modern CPUs. I bet it's 40-50+ cycles
> >> alone forth the hweight and the division. Surely that can be done better?
> >>
> > Agreed, I will try to pull in the RX hash from Dave Miller's remote
> > softirq patch.
> > 
> >> Also I suspect some kind of runtime switch for this would be useful.
> >>
> >> Also the manual set up of the receive mask seems really clumpsy. Couldn't
> >> you set that up dynamically based on where processes executing recvmsg()
> >> are running?
> >>
> > We have done exactly that.  It works very well in many cases
> > (application + platform combinations), but I haven't found it to be
> > better than doing the hash in all cases.  I could provide the patch,
> > but it might be more of a follow patch to this base one.
> 
> Hello Tom
> 
> I was thinking about your patch (and David's one), and thought it could be
> possible to spread packets to other cpus only if current one is under stress.
> 
> A posssible metric would be to test if softirq is handled by ksoftirqd
> (stress situation) or not.
> 
> Under moderate load, we could have one active cpu (and fewer cache line
> transferts), keeping good latencies.
> 
> I tried alternative approach to solve the Multicast problem raised some time ago,
> but still have one cpu handling one device. Only wakeups were defered to a
> workqueue (and possibly another cpu) if running from ksoftirq only.
> Patch not yet ready for review, but based on a previous patch that was more
> intrusive (touching kernel/softirq.c)
> 
> Under stress, your idea permits to use more cpus for a fast NIC and get better
> throughput. Its more generic.

I would like to see some way to have multiple CPU's pulling packets and adapting
the number of CPU's being used based on load. Basically, turn all device is into
receive multiqueue. The mapping could be adjusted by user level (see irqbalancer).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert - April 21, 2009, 6:52 p.m.
>> Hello Tom
>>
>> I was thinking about your patch (and David's one), and thought it could be
>> possible to spread packets to other cpus only if current one is under stress.
>>
>> A posssible metric would be to test if softirq is handled by ksoftirqd
>> (stress situation) or not.
>>
>> Under moderate load, we could have one active cpu (and fewer cache line
>> transferts), keeping good latencies.
>>
>> I tried alternative approach to solve the Multicast problem raised some time ago,
>> but still have one cpu handling one device. Only wakeups were defered to a
>> workqueue (and possibly another cpu) if running from ksoftirq only.
>> Patch not yet ready for review, but based on a previous patch that was more
>> intrusive (touching kernel/softirq.c)
>>
>> Under stress, your idea permits to use more cpus for a fast NIC and get better
>> throughput. Its more generic.
>
> I would like to see some way to have multiple CPU's pulling packets and adapting
> the number of CPU's being used based on load. Basically, turn all device is into
> receive multiqueue. The mapping could be adjusted by user level (see irqbalancer).
>

That is possible and don't think the design of our patch would
preclude it, but I am worried that each time the mapping from a
connection to a CPU changes this could cause of out of order packets.
I suppose this is similar problem to changing the RSS hash mappings in
a device.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 22, 2009, 9:21 a.m.
From: Tom Herbert <therbert@google.com>
Date: Tue, 21 Apr 2009 11:52:07 -0700

> That is possible and don't think the design of our patch would
> preclude it, but I am worried that each time the mapping from a
> connection to a CPU changes this could cause of out of order
> packets.  I suppose this is similar problem to changing the RSS hash
> mappings in a device.

Yes, out of order packet processing is a serious issue.

There are some things I've been brainstorming about.

One thought I keep coming back to is the hack the block layer
is using right now.  It remembers which CPU a block I/O request
comes in on, and it makes sure the completion runs on that
cpu too.

We could remember the cpu that the last socket level operation
occurred upon, and use that as a target for packets.  This requires a
bit of work.

First we'd need some kind of pre-demux at netif_receive_skb()
time to look up the cpu target, and reference this blob from
the socket somehow, and keep it uptodate at various specific
locations (read/write/poll, whatever...).

Or we could pre-demux the real socket.  That could be exciting.

But then we come back to the cpu number changing issue.  There is a
cool way to handle this, because it seems that we can just keep
queueing to the previous cpu and it can check the socket cpu cookie.
If that changes, the old target can push the rest of it's queue to
that cpu and then update the cpu target blob.

Anyways, just some ideas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Josefsson - April 22, 2009, 2:33 p.m.
On Tue, 21 Apr 2009, Stephen Hemminger wrote:

> I would like to see some way to have multiple CPU's pulling packets and adapting
> the number of CPU's being used based on load. Basically, turn all device is into
> receive multiqueue. The mapping could be adjusted by user level (see irqbalancer).

I've been toying with the irqbalancer idea as well.
Set the number of software "queues" high and have a mapping table of 
queue->cpu and then let irqbalanced which has knowledge of cpu cache 
hirarchy etc balance the load by changing the mapping table as it sees 
fit.

This would make it fairly fine grained so that it could work reasonably 
well even if you have very few but very performance demanding clients.
Having "collisions" in the queues in that situation isn't very good.

/Martin
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert - April 22, 2009, 3:46 p.m.
>
> There are some things I've been brainstorming about.
>
> One thought I keep coming back to is the hack the block layer
> is using right now.  It remembers which CPU a block I/O request
> comes in on, and it makes sure the completion runs on that
> cpu too.
>
> We could remember the cpu that the last socket level operation
> occurred upon, and use that as a target for packets.  This requires a
> bit of work.
>
> First we'd need some kind of pre-demux at netif_receive_skb()
> time to look up the cpu target, and reference this blob from
> the socket somehow, and keep it uptodate at various specific
> locations (read/write/poll, whatever...).
>
> Or we could pre-demux the real socket.  That could be exciting.
>

We are doing the pre-demux, and it works well.  The additional benefit
is that the hash result or the the sk itself could be cached in the
skb for the upper layer protocol.

One caveat though is that if the device provides a hash, ie. Toeplitz,
we really want to use that in the CPU look-up to avoid the cache miss
on the header.  We considered using the Toeplitz hash as the inet
hash, but it's incredibly expensive to do in software being about 20x
slower than inet_ehashfn is best we could do.  Our (naive) solution is
to maintain a big array of CPU indices where we write the CPU ids from
recvmsg and sendmsg, and then read it using the hash provided on
incoming packets.  This is lockless and allows very fast operations,
but doesn't take collisions into account (probably allows a slim
possibility of thrashing a connection between CPUs).  The other option
we considered was maintaining a secondary cnx lookup table based on
Toeplitz hash, but that seemed to be rather involved.

> But then we come back to the cpu number changing issue.  There is a
> cool way to handle this, because it seems that we can just keep
> queueing to the previous cpu and it can check the socket cpu cookie.
> If that changes, the old target can push the rest of it's queue to
> that cpu and then update the cpu target blob.
>
> Anyways, just some ideas.
>
Thanks for your thoughts.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 22, 2009, 6:49 p.m.
David Miller wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Tue, 21 Apr 2009 11:52:07 -0700
> 
> 
>>That is possible and don't think the design of our patch would
>>preclude it, but I am worried that each time the mapping from a
>>connection to a CPU changes this could cause of out of order
>>packets.  I suppose this is similar problem to changing the RSS hash
>>mappings in a device.
> 
> 
> Yes, out of order packet processing is a serious issue.
> 
> There are some things I've been brainstorming about.
> 
> One thought I keep coming back to is the hack the block layer
> is using right now.  It remembers which CPU a block I/O request
> comes in on, and it makes sure the completion runs on that
> cpu too.
> 
> We could remember the cpu that the last socket level operation
> occurred upon, and use that as a target for packets.  This requires a
> bit of work.
> 
> First we'd need some kind of pre-demux at netif_receive_skb()
> time to look up the cpu target, and reference this blob from
> the socket somehow, and keep it uptodate at various specific
> locations (read/write/poll, whatever...).

Does poll on the socket touch all that many cachelines, or are you thinking of it 
as being a predictor of where read/write will be called?

> 
> Or we could pre-demux the real socket.  That could be exciting.
> 
> But then we come back to the cpu number changing issue.  There is a
> cool way to handle this, because it seems that we can just keep
> queueing to the previous cpu and it can check the socket cpu cookie.
> If that changes, the old target can push the rest of it's queue to
> that cpu and then update the cpu target blob.
> 
> Anyways, just some ideas.

For what it is worth, at the 5000 foot description level that is exactly what 
HP-UX 11.X does and calls TOPS (Thread Optimized Packet Scheduling).  Where the 
socket was last accessed is stashed away (in the socket/stream structure) and 
that is looked-up when the driver hands the packet up the stack.  It was done 
that way in HP-UX 11.X because we found that simply hashing the headers (what 
HP-UX 10.20 called "Inbound Packet Scheduling" or IPS) while fine for discrete 
netperf TCP_RR tests, wasn't really what one wanted when a single thread of 
execution was servicing more than one connection/flow.

The TOPS patches were added to HP-UX 11.0 ca 1998 and while there have been some 
issues (as you surmise, and others thanks to Streams being involved :) it appears 
to have worked rather well these last ten years.  So, at least in the abstract 
what is proposed above has at least a little pre-validation.  TOPS can be 
disabled/enabled via an ndd (ie sysctl) setting for those cases when the number 
of NICs (back then they were all single-queue) or now queues is a reasonable 
fraction of the number of cores and  the administrator can/wants to silo things.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer - April 22, 2009, 8:44 p.m.
On Wed, 22 Apr 2009, David Miller wrote:

> One thought I keep coming back to is the hack the block layer
> is using right now.  It remembers which CPU a block I/O request
> comes in on, and it makes sure the completion runs on that
> cpu too.

This is also very important for routing performance.

Experiences from practical 10GbE routing tests (done by Roberts team and 
my self), reveals that we can only achieve (close to) 10Gbit/s routing 
performance when carefully making sure that the rx-queue and tx-queue runs 
on the same CPU. (Not doing so really kills performance).

Currently I'm using some patches by Jens Låås, that allows userspace to 
setup the rx-queue to tx-queues mapping, plus manual smp_affinity tuning. 
The problem with this approach is that it requires way too much manual 
tuning from userspace to achieve good performance.

I would like to see an approach with less manual tuning, as we basically 
"just" need to make sure that TX completion is done on the same CPU as RX. 
I would like to see some effort in this area and is willing to partisipate 
actively.

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
Jens Axboe - April 23, 2009, 6:58 a.m.
On Wed, Apr 22 2009, Jesper Dangaard Brouer wrote:
> On Wed, 22 Apr 2009, David Miller wrote:
>
>> One thought I keep coming back to is the hack the block layer
>> is using right now.  It remembers which CPU a block I/O request
>> comes in on, and it makes sure the completion runs on that
>> cpu too.

Hack?! :-)

It's actually nicely integrated to our existing IO completion path,
where we raise a softirq to complete the IO out of path. The only
difference now being that if you enable rq_affinity, it'll raise the
softirq potentially on a remote CPU except of always using the local
one.

> This is also very important for routing performance.
>
> Experiences from practical 10GbE routing tests (done by Roberts team and
> my self), reveals that we can only achieve (close to) 10Gbit/s routing
> performance when carefully making sure that the rx-queue and tx-queue runs
> on the same CPU. (Not doing so really kills performance).
>
> Currently I'm using some patches by Jens Låås, that allows userspace to
> setup the rx-queue to tx-queues mapping, plus manual smp_affinity tuning.
> The problem with this approach is that it requires way too much manual
> tuning from userspace to achieve good performance.
>
> I would like to see an approach with less manual tuning, as we basically
> "just" need to make sure that TX completion is done on the same CPU as RX.
> I would like to see some effort in this area and is willing to partisipate
> actively.

I saw very nice benefits on the IO side as well!
David Miller - April 23, 2009, 7:25 a.m.
From: Jens Axboe <jens.axboe@oracle.com>
Date: Thu, 23 Apr 2009 08:58:30 +0200

> On Wed, Apr 22 2009, Jesper Dangaard Brouer wrote:
>> On Wed, 22 Apr 2009, David Miller wrote:
>>
>>> One thought I keep coming back to is the hack the block layer
>>> is using right now.  It remembers which CPU a block I/O request
>>> comes in on, and it makes sure the completion runs on that
>>> cpu too.
> 
> Hack?! :-)

I meant hack in the most positive sense :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe - April 23, 2009, 7:29 a.m.
On Thu, Apr 23 2009, David Miller wrote:
> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Thu, 23 Apr 2009 08:58:30 +0200
> 
> > On Wed, Apr 22 2009, Jesper Dangaard Brouer wrote:
> >> On Wed, 22 Apr 2009, David Miller wrote:
> >>
> >>> One thought I keep coming back to is the hack the block layer
> >>> is using right now.  It remembers which CPU a block I/O request
> >>> comes in on, and it makes sure the completion runs on that
> >>> cpu too.
> > 
> > Hack?! :-)
> 
> I meant hack in the most positive sense :)

Duly noted ;-)
David Miller - April 23, 2009, 7:34 a.m.
From: Martin Josefsson <gandalf@mjufs.se>
Date: Wed, 22 Apr 2009 16:33:17 +0200 (CEST)

> On Tue, 21 Apr 2009, Stephen Hemminger wrote:
> 
>> I would like to see some way to have multiple CPU's pulling packets
>> and adapting
>> the number of CPU's being used based on load. Basically, turn all
>> device is into
>> receive multiqueue. The mapping could be adjusted by user level (see
>> irqbalancer).
> 
> I've been toying with the irqbalancer idea as well.
> Set the number of software "queues" high and have a mapping table of
> queue->cpu and then let irqbalanced which has knowledge of cpu cache
> hirarchy etc balance the load by changing the mapping table as it sees
> fit.

Steering changes without any protocol specific handling is
a total non-started because of packet reordering.

Please understand how deeply important this when considering
any packet steering scheme.  You absolutely cannot allow it
to happen.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Laas - April 23, 2009, 9:12 a.m.
(09.04.22 kl.22:44) Jesper Dangaard Brouer skrev följande till David Miller:

> On Wed, 22 Apr 2009, David Miller wrote:
>
>> One thought I keep coming back to is the hack the block layer
>> is using right now.  It remembers which CPU a block I/O request
>> comes in on, and it makes sure the completion runs on that
>> cpu too.
>
> This is also very important for routing performance.
>
> Experiences from practical 10GbE routing tests (done by Roberts team and my 
> self), reveals that we can only achieve (close to) 10Gbit/s routing 
> performance when carefully making sure that the rx-queue and tx-queue runs on 
> the same CPU. (Not doing so really kills performance).
>
> Currently I'm using some patches by Jens Låås, that allows userspace to setup 
> the rx-queue to tx-queues mapping, plus manual smp_affinity tuning. The 
> problem with this approach is that it requires way too much manual tuning 
> from userspace to achieve good performance.

We have a C-program for setting the affinity correctly. Note that 
"correctly" very much depends on your setup and what you want to do.
We started with a script for doing this, but its a bit easier to implement 
some heuristics in a proper program.

The patch (which implements a concept called "flowtrunks") also requires 
setup from userspace (via ethtool ioctl). We dont actually use this yet in 
production.

The natural way to go forward would be to implement in userspace a program 
that can tune smp_affinity and queue-mapping (maybe via flowtrunks) 
together. With knowledge of the setup and userpreference this should be 
doable to automatically tune your system for you.

One advantage with flowtrunks (generic queue/nic to/from flowtrunk 
mapping) would be for us to not have to patch every supported nic.
Plus we could tune the system for more than one usecase (forwarding 
between multi-queue nics).
The main object of the flowtrunk patch was to try to start a discussion 
and create something concrete to help our thinking. 
This problem-space needs to be explored.

>
> I would like to see an approach with less manual tuning, as we basically 
> "just" need to make sure that TX completion is done on the same CPU as RX. I 
> would like to see some effort in this area and is willing to partisipate 
> actively.

I dont see a problem with tuning from userspace. I think it will be hard 
for the kernel to automatically tune all types of setups for all usecases. 
Maybe Im just lacking in imagination though.

Cheers,
Jens

>
> Cheers,
>  Jesper Brouer
>
> --
> -------------------------------------------------------------------
> MSc. Master of Computer Science
> Dept. of Computer Science, University of Copenhagen
> Author of http://www.adsl-optimizer.dk
> -------------------------------------------------------------------

-----------------------------------------------------------------------
     'In theory, there is no difference between theory and practice.
      But, in practice, there is.'
-----------------------------------------------------------------------
     Jens Låås                              Email: jens.laas@its.uu.se
     ITS                                    Phone: +46 18 471 77 03
     SWEDEN
-----------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be3ebd7..ca52116 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -758,6 +758,9 @@  struct net_device
 	void			*ax25_ptr;	/* AX.25 specific data */
 	struct wireless_dev	*ieee80211_ptr;	/* IEEE 802.11 specific data,
 						   assign before registering */
+#ifdef CONFIG_NET_SOFTRPS
+	cpumask_t		soft_rps_cpus;	/* CPU Mask for RX processing */
+#endif

 /*
  * Cache line mostly used on receive path (including eth_type_trans())
@@ -1170,6 +1173,15 @@  struct softnet_data
 	struct Qdisc		*output_queue;
 	struct sk_buff_head	input_pkt_queue;
 	struct list_head	poll_list;
+#ifdef CONFIG_NET_SOFTRPS
+	int			rps_cpu;
+	struct list_head	rps_poll_list;
+	spinlock_t		rps_poll_list_lock;
+	struct call_single_data	rps_csd;
+	unsigned long		rps_flags;
+#define RPS_SOFTIRQ_PENDING	0x1
+#define RPS_SOFTIRQ_COMPLETING	0x2
+#endif
 	struct sk_buff		*completion_queue;

 	struct napi_struct	backlog;
@@ -1177,6 +1189,32 @@  struct softnet_data

 DECLARE_PER_CPU(struct softnet_data,softnet_data);

+static inline void lock_softnet_input_queue(struct softnet_data *queue,
+    unsigned long *flags)
+{
+	local_irq_save(*flags);
+#ifdef CONFIG_NET_SOFTRPS
+	spin_lock(&queue->input_pkt_queue.lock);
+#endif
+}
+
+static inline void lock_softnet_input_queue_noflags(struct softnet_data *queue)
+{
+#ifdef CONFIG_NET_SOFTRPS
+	spin_lock(&queue->input_pkt_queue.lock);
+#endif
+}
+
+static inline void unlock_softnet_input_queue(struct softnet_data *queue,
+    unsigned long *flags)
+{
+#ifdef CONFIG_NET_SOFTRPS
+	spin_unlock(&queue->input_pkt_queue.lock);
+#endif
+	local_irq_restore(*flags);
+}
+
+
 #define HAVE_NETIF_QUEUE

 extern void __netif_schedule(struct Qdisc *q);
@@ -1416,7 +1454,17 @@  extern void dev_kfree_skb_any(struct sk_buff *skb);
 extern int		netif_rx(struct sk_buff *skb);
 extern int		netif_rx_ni(struct sk_buff *skb);
 #define HAVE_NETIF_RECEIVE_SKB 1
-extern int		netif_receive_skb(struct sk_buff *skb);
+extern int            __netif_receive_skb(struct sk_buff *skb);
+
+static inline int netif_receive_skb(struct sk_buff *skb)
+{
+#ifdef CONFIG_NET_SOFTRPS
+	return netif_rx(skb);
+#else
+	return __netif_receive_skb(skb);
+#endif
+}
+
 extern void		napi_gro_flush(struct napi_struct *napi);
 extern int		dev_gro_receive(struct napi_struct *napi,
 					struct sk_buff *skb);
diff --git a/net/Kconfig b/net/Kconfig
index ec93e7e..75bdda0 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -25,6 +25,19 @@  if NET

 menu "Networking options"

+config NET_SOFTRPS
+	bool "Software RX packet steering"
+	depends on SMP
+	help
+	  Say Y here to enable a software implementation of receive path
+	  packet steering (RPS).  RPS distributes the load of received
+	  packet processing across multiple CPUs.  Packets are scheduled
+	  to different CPUs for protocol processing in the netif_rx function.
+	  A hash is performed on fields in packet headers (the four tuple
+	  in the case of TCP), this resulting value is used to index into a
+	  mask of CPUs.  The CPU masks are set on a per device basis
+	  in the sysfs variable /sys/class/net/<device>/soft_rps_cpus.
+
 source "net/packet/Kconfig"
 source "net/unix/Kconfig"
 source "net/xfrm/Kconfig"
diff --git a/net/core/dev.c b/net/core/dev.c
index 052dd47..df0507b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1906,6 +1906,140 @@  int weight_p __read_mostly = 64;            /*
old backlog weight */

 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };

+#ifdef CONFIG_NET_SOFTRPS
+/**
+ *	netif_cpu_for_rps - return the appropriate CPU for protocol
+ *	processing of a packet when doing receive packet steering.
+ *	@dev: receiving device
+ *	@skb: buffer with packet
+ *
+ *	Fields in packet headers are hashed to be used as an index into a
+ *	per device CPU mask (IP packets).  For TCP and UDP packets
+ *	a simple hash is done on the 4-tuple, for other IP packets a hash
+ *	is done on the source and destination addresses.
+ *
+ *	Called with irq's disabled.
+ */
+static int netif_cpu_for_rps(struct net_device *dev, struct sk_buff *skb)
+{
+	cpumask_t mask;
+	unsigned int hash;
+	int cpu, count = 0;
+
+	cpus_and(mask, dev->soft_rps_cpus, cpu_online_map);
+	if (cpus_empty(mask))
+		return smp_processor_id();
+
+	if (skb->protocol == __constant_htons(ETH_P_IP)) {
+		struct iphdr *iph = (struct iphdr *)skb->data;
+		__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
+
+		hash = 0;
+		if (!(iph->frag_off &
+		      __constant_htons(IP_MF|IP_OFFSET)) &&
+		    ((iph->protocol == IPPROTO_TCP) ||
+		     (iph->protocol == IPPROTO_UDP)))
+			hash = ntohs(*layer4hdr ^ *(layer4hdr + 1));
+
+		hash ^= (ntohl(iph->saddr ^ iph->daddr)) & 0xffff;
+		goto got_hash;
+	}
+
+	return smp_processor_id();
+
+got_hash:
+	hash %= cpus_weight_nr(mask);
+
+	for_each_cpu_mask(cpu, mask) {
+		if (count++ == hash)
+			break;
+	}
+	return cpu;
+}
+
+static DEFINE_PER_CPU(cpumask_t, rps_remote_softirq_cpus);
+
+/* Called from hardirq (IPI) context */
+static void trigger_softirq(void *data)
+{
+	struct softnet_data *queue = data;
+	set_bit(RPS_SOFTIRQ_COMPLETING, &queue->rps_flags);
+	raise_softirq(NET_RX_SOFTIRQ);
+}
+
+/**
+ * net_rx_action_rps is called from net_rx_action to do the softirq
+ * functions related to receive packet steering.
+ *
+ * Called with irq's disable.
+ */
+static void net_rx_action_rps(struct softnet_data *queue)
+{
+	int cpu;
+
+	/* Finish remote softirq invocation for this CPU. */
+	if (test_bit(RPS_SOFTIRQ_COMPLETING, &queue->rps_flags)) {
+		clear_bit(RPS_SOFTIRQ_COMPLETING, &queue->rps_flags);
+		clear_bit(RPS_SOFTIRQ_PENDING, &queue->rps_flags);
+		smp_mb__after_clear_bit();
+	}
+
+	/* Send any pending remote softirqs, allows for coalescing */
+	for_each_cpu_mask_nr(cpu, __get_cpu_var(rps_remote_softirq_cpus)) {
+		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
+		if (!test_and_set_bit(RPS_SOFTIRQ_PENDING,
+		    &queue->rps_flags))
+			__smp_call_function_single(cpu, &queue->rps_csd);
+	}
+	cpus_clear(__get_cpu_var(rps_remote_softirq_cpus));
+
+	/* Splice devices that were remotely scheduled for processing */
+	if (!list_empty(&queue->rps_poll_list)) {
+		spin_lock(&queue->rps_poll_list_lock);
+		list_splice_init(&queue->rps_poll_list, &queue->poll_list);
+		spin_unlock(&queue->rps_poll_list_lock);
+	}
+}
+
+static void net_rps_init_queue(struct softnet_data *queue, int cpu)
+{
+	INIT_LIST_HEAD(&queue->rps_poll_list);
+	spin_lock_init(&queue->rps_poll_list_lock);
+	queue->rps_cpu = cpu;
+	queue->rps_csd.func = trigger_softirq;
+	queue->rps_csd.info = queue;
+	queue->rps_csd.flags = 0;
+}
+
+#endif /* CONFIG_NET_SOFT_RPS*/
+
+/**
+ * schedule_backlog_napi is called to schedule backlog processing.
+ *
+ * Called with irq's disabled.
+ */
+static void schedule_backlog_napi(struct softnet_data *queue)
+{
+	if (napi_schedule_prep(&queue->backlog)) {
+#ifdef CONFIG_NET_SOFTRPS
+		if (queue->rps_cpu != smp_processor_id()) {
+			 /* Sheduling the backlog queue for a  different
+			  * CPU, a remote softirq is performed accordingly.
+			  */
+			spin_lock(&queue->rps_poll_list_lock);
+			list_add_tail(&queue->backlog.poll_list,
+			    &queue->rps_poll_list);
+			spin_unlock(&queue->rps_poll_list_lock);
+
+			cpu_set(queue->rps_cpu,
+			    get_cpu_var(rps_remote_softirq_cpus));
+			raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			return;
+		}
+#endif
+		__napi_schedule(&queue->backlog);
+	}
+}

 /**
  *	netif_rx	-	post buffer to the network code
@@ -1939,23 +2073,28 @@  int netif_rx(struct sk_buff *skb)
 	 * short when CPU is congested, but is still operating.
 	 */
 	local_irq_save(flags);
-	queue = &__get_cpu_var(softnet_data);

+#ifdef CONFIG_NET_SOFTRPS
+	queue = &per_cpu(softnet_data, netif_cpu_for_rps(skb->dev, skb));
+	lock_softnet_input_queue_noflags(queue);
+#else
+	queue = &__get_cpu_var(softnet_data);
+#endif
 	__get_cpu_var(netdev_rx_stat).total++;
 	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
 		if (queue->input_pkt_queue.qlen) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
+			unlock_softnet_input_queue(queue, &flags);
 			return NET_RX_SUCCESS;
 		}

-		napi_schedule(&queue->backlog);
+		schedule_backlog_napi(queue);
 		goto enqueue;
 	}

 	__get_cpu_var(netdev_rx_stat).dropped++;
-	local_irq_restore(flags);
+	unlock_softnet_input_queue(queue, &flags);

 	kfree_skb(skb);
 	return NET_RX_DROP;
@@ -2192,10 +2331,10 @@  void netif_nit_deliver(struct sk_buff *skb)
 }

 /**
- *	netif_receive_skb - process receive buffer from network
+ *	__netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
  *
- *	netif_receive_skb() is the main receive data processing function.
+ *	__netif_receive_skb() is the main receive data processing function.
  *	It always succeeds. The buffer may be dropped during processing
  *	for congestion control or by the protocol layers.
  *
@@ -2206,7 +2345,7 @@  void netif_nit_deliver(struct sk_buff *skb)
  *	NET_RX_SUCCESS: no congestion
  *	NET_RX_DROP: packet was dropped
  */
-int netif_receive_skb(struct sk_buff *skb)
+int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	struct net_device *orig_dev;
@@ -2347,7 +2486,7 @@  static int napi_gro_complete(struct sk_buff *skb)

 out:
 	skb_shinfo(skb)->gso_size = 0;
-	return netif_receive_skb(skb);
+	return __netif_receive_skb(skb);
 }

 void napi_gro_flush(struct napi_struct *napi)
@@ -2484,7 +2623,7 @@  int napi_skb_finish(int ret, struct sk_buff *skb)

 	switch (ret) {
 	case GRO_NORMAL:
-		return netif_receive_skb(skb);
+		return __netif_receive_skb(skb);

 	case GRO_DROP:
 		err = NET_RX_DROP;
@@ -2585,7 +2724,7 @@  int napi_frags_finish(struct napi_struct *napi,
struct sk_buff *skb, int ret)
 		skb->protocol = eth_type_trans(skb, napi->dev);

 		if (ret == GRO_NORMAL)
-			return netif_receive_skb(skb);
+			return __netif_receive_skb(skb);

 		skb_gro_pull(skb, -ETH_HLEN);
 		break;
@@ -2619,19 +2758,24 @@  static int process_backlog(struct napi_struct
*napi, int quota)
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
+	unsigned long flags;

 	napi->weight = weight_p;
 	do {
 		struct sk_buff *skb;

-		local_irq_disable();
+		lock_softnet_input_queue(queue, &flags);
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
-			local_irq_enable();
-			napi_complete(napi);
+			unlock_softnet_input_queue(queue, &flags);
+			napi_gro_flush(napi);
+			lock_softnet_input_queue(queue, &flags);
+			if (skb_queue_empty(&queue->input_pkt_queue))
+				__napi_complete(napi);
+			unlock_softnet_input_queue(queue, &flags);
 			goto out;
 		}
-		local_irq_enable();
+		unlock_softnet_input_queue(queue, &flags);

 		napi_gro_receive(napi, skb);
 	} while (++work < quota && jiffies == start_time);
@@ -2728,13 +2872,18 @@  EXPORT_SYMBOL(netif_napi_del);

 static void net_rx_action(struct softirq_action *h)
 {
-	struct list_head *list = &__get_cpu_var(softnet_data).poll_list;
+	struct softnet_data *queue = &__get_cpu_var(softnet_data);
+	struct list_head *list = &queue->poll_list;
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;

 	local_irq_disable();

+#ifdef CONFIG_NET_SOFTRPS
+	net_rx_action_rps(queue);
+#endif
+
 	while (!list_empty(list)) {
 		struct napi_struct *n;
 		int work, weight;
@@ -5239,6 +5388,9 @@  static int __init net_dev_init(void)
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);

+#ifdef CONFIG_NET_SOFTRPS
+		net_rps_init_queue(queue, i);
+#endif
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
@@ -5305,7 +5457,7 @@  EXPORT_SYMBOL(free_netdev);
 EXPORT_SYMBOL(netdev_boot_setup_check);
 EXPORT_SYMBOL(netdev_set_master);
 EXPORT_SYMBOL(netdev_state_change);
-EXPORT_SYMBOL(netif_receive_skb);
+EXPORT_SYMBOL(__netif_receive_skb);
 EXPORT_SYMBOL(netif_rx);
 EXPORT_SYMBOL(register_gifconf);
 EXPORT_SYMBOL(register_netdevice);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2da59a0..b12ae88 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -211,6 +211,64 @@  static ssize_t store_tx_queue_len(struct device *dev,
 	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
 }

+#ifdef CONFIG_NET_SOFTRPS
+static ssize_t netdev_store_cpumask(struct net_device *net, const char *buf,
+    size_t len, cpumask_t *maskp)
+{
+	cpumask_t new_value;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	err = bitmap_parse(buf, len, cpumask_bits(&new_value), nr_cpumask_bits);
+	if (err)
+		return err;
+
+	rtnl_lock();
+	if (dev_isalive(net))
+		*maskp = new_value;
+	rtnl_unlock();
+
+	return len;
+}
+
+static ssize_t netdev_show_cpumask(struct net_device *net, char *buf,
+    cpumask_t *maskp)
+{
+	size_t len;
+	cpumask_t tmp;
+
+	read_lock(&dev_base_lock);
+	if (dev_isalive(net))
+		cpus_and(tmp, *maskp, cpu_online_map);
+	else
+		cpus_clear(tmp);
+	read_unlock(&dev_base_lock);
+
+	len = cpumask_scnprintf(buf, PAGE_SIZE, &tmp);
+	if (PAGE_SIZE - len < 2)
+		return -EINVAL;
+
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+static ssize_t show_soft_rps_cpus(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct net_device *net = to_net_dev(dev);
+	return netdev_show_cpumask(net, buf, &net->soft_rps_cpus);
+}
+
+static ssize_t store_soft_rps_cpus(struct device *dev,
+    struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct net_device *net = to_net_dev(dev);
+	return netdev_store_cpumask(net, buf, len, &net->soft_rps_cpus);
+}
+#endif
+
 static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
 {
@@ -263,6 +321,10 @@  static struct device_attribute net_class_attributes[] = {
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+#ifdef CONFIG_NET_SOFTRPS
+	__ATTR(soft_rps_cpus, S_IRUGO | S_IWUSR, show_soft_rps_cpus,
+	       store_soft_rps_cpus),
+#endif
 	{}
 };
--
To unsubscribe from this list: send the line "unsubscribe netdev" in