Patchwork [net-next] rps: overflow prevention for saturated cpus

login
register
mail settings
Submitter Willem de Bruijn
Date Dec. 6, 2012, 8:36 p.m.
Message ID <1354826194-9289-1-git-send-email-willemb@google.com>
Download mbox | patch
Permalink /patch/204318/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Willem de Bruijn - Dec. 6, 2012, 8:36 p.m.
RPS and RFS balance load across cpus with flow affinity. This can
cause local bottlenecks, where a small number or single large flow
(DoS) can saturate one CPU while others are idle.

This patch maintains flow affinity in normal conditions, but
trades it for throughput when a cpu becomes saturated. Then, packets
destined to that cpu (only) are redirected to the lightest loaded cpu
in the rxqueue's rps_map. This breaks flow affinity under high load
for some flows, in favor of processing packets up to the capacity
of the complete rps_map cpuset in all circumstances.

Overload on an rps cpu is detected when the cpu's input queue
exceeds a high watermark. This threshold, netdev_max_rps_backlog,
is configurable through sysctl. By default, it is the same as
netdev_max_backlog, the threshold at which point packets are
dropped. Therefore, the behavior is disabled by default. It is
enabled by setting the rps threshold lower than the drop threshold.

This mechanism is orthogonal to filtering approaches to handle
unwanted large flows (DoS). The goal here is to avoid dropping any
traffic until the entire system is saturated.

Tested:
Sent a steady stream of 1.4M UDP packets to a machine with four
RPS cpus 0--3, set in /sys/class/net/eth0/queues/rx-N/rps_cpus.
RFS is disabled to illustrate load balancing more clearly. Showing
the output from /proc/net/softnet_stat, columns 0 (processed), 1
(dropped), 2 (time squeeuze) and 9 (rps), for the relevant CPUs:

- without patch, 40 source IP addresses (i.e., balanced):
0: ok=00409483 drop=00000000 time=00001224 rps=00000089
1: ok=00496336 drop=00051365 time=00001551 rps=00000000
2: ok=00374380 drop=00000000 time=00001105 rps=00000129
3: ok=00411348 drop=00000000 time=00001175 rps=00000165

- without patch, 1 source IP address:
0: ok=00856313 drop=00863842 time=00002676 rps=00000000
1: ok=00000003 drop=00000000 time=00000000 rps=00000001
2: ok=00000001 drop=00000000 time=00000000 rps=00000001
3: ok=00000014 drop=00000000 time=00000000 rps=00000000

- with patch, 1 source IP address:
0: ok=00278675 drop=00000000 time=00000475 rps=00001201
1: ok=00276154 drop=00000000 time=00000459 rps=00001213
2: ok=00647050 drop=00000000 time=00002022 rps=00000000
3: ok=00276228 drop=00000000 time=00000464 rps=00001218

(let me know if commit messages like this are too wordy)
---
 Documentation/networking/scaling.txt |   12 +++++++++++
 include/linux/netdevice.h            |    3 ++
 net/core/dev.c                       |   37 ++++++++++++++++++++++++++++++++-
 net/core/sysctl_net_core.c           |    9 ++++++++
 4 files changed, 59 insertions(+), 2 deletions(-)
Rick Jones - Dec. 6, 2012, 10:25 p.m.
On 12/06/2012 12:36 PM, Willem de Bruijn wrote:
> RPS and RFS balance load across cpus with flow affinity. This can
> cause local bottlenecks, where a small number or single large flow
> (DoS) can saturate one CPU while others are idle.
>
> This patch maintains flow affinity in normal conditions, but
> trades it for throughput when a cpu becomes saturated. Then, packets
> destined to that cpu (only) are redirected to the lightest loaded cpu
> in the rxqueue's rps_map. This breaks flow affinity under high load
> for some flows, in favor of processing packets up to the capacity
> of the complete rps_map cpuset in all circumstances.

I thought (one of) the ideas behind RFS at least was to give the CPU 
scheduler control over where network processing took place instead of it 
being dictated solely by the addressing.  I would have expected the CPU 
scheduler to migrate some work off the saturated CPU.  Or will this only 
affect RPS and not RFS?

Allowing individual flows to straddle the CPUs - won't that be somewhat 
like what happens in bonding with mode-rr in the outbound case - packet 
reordering evil?  What kind of workload is this targeting that calls for 
such intra-flow parallelism?

With respect to the examples given, what happens when it is TCP traffic 
rather than UDP?

happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn - Dec. 6, 2012, 11:04 p.m.
On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 12/06/2012 12:36 PM, Willem de Bruijn wrote:
>>
>> RPS and RFS balance load across cpus with flow affinity. This can
>> cause local bottlenecks, where a small number or single large flow
>> (DoS) can saturate one CPU while others are idle.
>>
>> This patch maintains flow affinity in normal conditions, but
>> trades it for throughput when a cpu becomes saturated. Then, packets
>> destined to that cpu (only) are redirected to the lightest loaded cpu
>> in the rxqueue's rps_map. This breaks flow affinity under high load
>> for some flows, in favor of processing packets up to the capacity
>> of the complete rps_map cpuset in all circumstances.
>
>
> I thought (one of) the ideas behind RFS at least was to give the CPU
> scheduler control over where network processing took place instead of it
> being dictated solely by the addressing. I would have expected the CPU
> scheduler to migrate some work off the saturated CPU.  Or will this only
> affect RPS and not RFS?

I wrote it with RPS in mind, indeed. With RFS, for sufficiently
multithreaded applications that are unpinned, the scheduler will
likely spread the threads across as many cpus as possible. In that
case, the mechanism will not kick in, or as quickly. Even with RFS,
pinned threads and single-threaded applications will likely also
benefit during high load from redirecting kernel receive
processing away from the cpu that runs the application thread. I
haven't tested that case independently.

> Allowing individual flows to straddle the CPUs - won't that be somewhat like
> what happens in bonding with mode-rr in the outbound case - packet
> reordering evil?

Yes, that's the main drawback.

> What kind of workload is this targeting that calls for
> such intra-flow parallelism?

Packet processing middeboxes that rather operate in degraded mode
(reordering) than drop packets. Intrusion detection systems and proxies,
for instance. These boxes are actually likely to have RPS enabled and
RFS disabled.

> With respect to the examples given, what happens when it is TCP traffic
> rather than UDP?

That should be identical. RFS is supported for both protocols. In the test, it
is turned off to demonstrate the effect solely with RPS.

Restricting this to only RPS is easy, btw: get_rps_overflow_cpu() is called
one for RFS and once for RPS.

> happy benchmarking,
>
> rick jones
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Dec. 6, 2012, 11:45 p.m.
On 12/06/2012 03:04 PM, Willem de Bruijn wrote:
> On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote:
>> I thought (one of) the ideas behind RFS at least was to give the CPU
>> scheduler control over where network processing took place instead of it
>> being dictated solely by the addressing. I would have expected the CPU
>> scheduler to migrate some work off the saturated CPU.  Or will this only
>> affect RPS and not RFS?
>
> I wrote it with RPS in mind, indeed. With RFS, for sufficiently
> multithreaded applications that are unpinned, the scheduler will
> likely spread the threads across as many cpus as possible. In that
> case, the mechanism will not kick in, or as quickly. Even with RFS,
> pinned threads and single-threaded applications will likely also
> benefit during high load from redirecting kernel receive
> processing away from the cpu that runs the application thread. I
> haven't tested that case independently.

Unless that single-threaded application (or single receiving thread) is 
pinned to a CPU, isn't there a non-trivial chance that incoming traffic 
flowing up different CPUs will cause it to be bounced from one CPU to 
another, taking its cache lines with it and not just the "intra-stack" 
cache lines?

Long (?) ago and far away it was possible to say that a given IRQ should 
be potentially serviced by more than one CPU (if I recall though not 
phrase correctly).  Didn't that get taken away because it did such nasty 
things like reordering and such?  (Admittedly, I'm really stretching the 
limits of my dimm memory there)

>> What kind of workload is this targeting that calls for
>> such intra-flow parallelism?
>
> Packet processing middeboxes that rather operate in degraded mode
> (reordering) than drop packets. Intrusion detection systems and proxies,
> for instance. These boxes are actually likely to have RPS enabled and
> RFS disabled.
>
>> With respect to the examples given, what happens when it is TCP traffic
>> rather than UDP?
>
> That should be identical. RFS is supported for both protocols. In the
> test, it is turned off to demonstrate the effect solely with RPS.

Will it be identical with TCP?  If anything, I would think causing 
reordering of the TCP segments within flows would only further increase 
the workload of the middlebox because it will increase the ACK rates. 
Perhaps quite significantly if GRO was effective at the receivers before 
the reordering started.

At least unless/until the reordering is bad enough to cause the sending 
TCPs to fast retransmit and so throttle back.  And unless we are talking 
about being overloaded by massive herds of "mice" I'd think that the TCP 
flows would be throttling back to what the single CPU in the middlebox 
could handle.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings - Dec. 7, 2012, 2:51 p.m.
On Thu, 2012-12-06 at 15:36 -0500, Willem de Bruijn wrote:
> RPS and RFS balance load across cpus with flow affinity. This can
> cause local bottlenecks, where a small number or single large flow
> (DoS) can saturate one CPU while others are idle.
> 
> This patch maintains flow affinity in normal conditions, but
> trades it for throughput when a cpu becomes saturated. Then, packets
> destined to that cpu (only) are redirected to the lightest loaded cpu
> in the rxqueue's rps_map. This breaks flow affinity under high load
> for some flows, in favor of processing packets up to the capacity
> of the complete rps_map cpuset in all circumstances.
[...]
> --- a/Documentation/networking/scaling.txt
> +++ b/Documentation/networking/scaling.txt
> @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog
>  processing on the remote CPU, and any queued packets are then processed
>  up the networking stack.
>  
> +==== RPS Overflow Protection
> +
> +By selecting the same cpu from the cpuset for each packet in the same
> +flow, RPS will cause load imbalance when input flows are not uniformly
> +random. In the extreme case, a single flow, all packets are handled on a
> +single CPU, which limits the throughput of the machine to the throughput
> +of that CPU. RPS has optional overflow protection, which disables flow
> +affinity when an RPS CPU becomes saturated: during overload, its packets
> +will be sent to the least loaded other CPU in the RPS cpuset. To enable
> +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than
> +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic.
[...]

This only seems to be suitable for specialised applications where a high
degree of reordering is tolerable.  This documentation should make that
very clear.

Ben.
Willem de Bruijn - Dec. 7, 2012, 4:04 p.m.
On Thu, Dec 6, 2012 at 6:45 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 12/06/2012 03:04 PM, Willem de Bruijn wrote:
>>
>> On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>>
>>> I thought (one of) the ideas behind RFS at least was to give the CPU
>>> scheduler control over where network processing took place instead of it
>>> being dictated solely by the addressing. I would have expected the CPU
>>> scheduler to migrate some work off the saturated CPU.  Or will this only
>>> affect RPS and not RFS?
>>
>>
>> I wrote it with RPS in mind, indeed. With RFS, for sufficiently
>> multithreaded applications that are unpinned, the scheduler will
>> likely spread the threads across as many cpus as possible. In that
>> case, the mechanism will not kick in, or as quickly. Even with RFS,
>> pinned threads and single-threaded applications will likely also
>> benefit during high load from redirecting kernel receive
>> processing away from the cpu that runs the application thread. I
>> haven't tested that case independently.
>
>
> Unless that single-threaded application (or single receiving thread) is
> pinned to a CPU, isn't there a non-trivial chance that incoming traffic
> flowing up different CPUs will cause it to be bounced from one CPU to
> another, taking its cache lines with it and not just the "intra-stack" cache
> lines?

Yes. The patch restricts the offload cpus to rps_cpus, with the assumption
that this is a small subset of all cpus. In that case, other workloads will
eventually migrate to the remainder. I previously tested spreading across
all cpus, which indeed did interfere with the userspace threads.

> Long (?) ago and far away it was possible to say that a given IRQ should be
> potentially serviced by more than one CPU (if I recall though not phrase
> correctly).  Didn't that get taken away because it did such nasty things
> like reordering and such?  (Admittedly, I'm really stretching the limits of
> my dimm memory there)

Sounds familiar. Wasn't there a mechanism to periodically switch the
destination cpu? If at HZ granularity, that is very coarse grain compared to
Mpps, but out of order does seem likely. I assume that this patch will lead
to a steady state where userspace and kernel receive run on disjoint cpusets,
due to the rps_cpus set being hot with kernel receive processing. That said,
I can run a test with RFS enabled to see whether that actually holds.

>>> What kind of workload is this targeting that calls for
>>> such intra-flow parallelism?
>>
>>
>> Packet processing middeboxes that rather operate in degraded mode
>> (reordering) than drop packets. Intrusion detection systems and proxies,
>> for instance. These boxes are actually likely to have RPS enabled and
>> RFS disabled.
>>
>>> With respect to the examples given, what happens when it is TCP traffic
>>> rather than UDP?
>>
>>
>> That should be identical. RFS is supported for both protocols. In the
>> test, it is turned off to demonstrate the effect solely with RPS.
>
>
> Will it be identical with TCP?  If anything, I would think causing
> reordering of the TCP segments within flows would only further increase the
> workload of the middlebox because it will increase the ACK rates. Perhaps
> quite significantly if GRO was effective at the receivers before the
> reordering started.
>
> At least unless/until the reordering is bad enough to cause the sending TCPs
> to fast retransmit and so throttle back.  And unless we are talking about
> being overloaded by massive herds of "mice" I'd think that the TCP flows
> would be throttling back to what the single CPU in the middlebox could
> handle.

Agreed, I will try to get some data on the interaction with TCP flows. My
hunch is that they throttle down due to the reordering, but data is more useful.
The initial increase in ACKs, if any, will likely not increase rate beyond a
small factor.

The situations that this patch mean to address are more straightforward
DoS attacks, where a box can handle normal load with a big safety margin,
but falls over at a 10x or 100x flood of TCP SYN or similar packets.

> rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn - Dec. 7, 2012, 4:41 p.m.
On Fri, Dec 7, 2012 at 9:51 AM, Ben Hutchings <bhutchings@solarflare.com> wrote:
> On Thu, 2012-12-06 at 15:36 -0500, Willem de Bruijn wrote:
>> RPS and RFS balance load across cpus with flow affinity. This can
>> cause local bottlenecks, where a small number or single large flow
>> (DoS) can saturate one CPU while others are idle.
>>
>> This patch maintains flow affinity in normal conditions, but
>> trades it for throughput when a cpu becomes saturated. Then, packets
>> destined to that cpu (only) are redirected to the lightest loaded cpu
>> in the rxqueue's rps_map. This breaks flow affinity under high load
>> for some flows, in favor of processing packets up to the capacity
>> of the complete rps_map cpuset in all circumstances.
> [...]
>> --- a/Documentation/networking/scaling.txt
>> +++ b/Documentation/networking/scaling.txt
>> @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog
>>  processing on the remote CPU, and any queued packets are then processed
>>  up the networking stack.
>>
>> +==== RPS Overflow Protection
>> +
>> +By selecting the same cpu from the cpuset for each packet in the same
>> +flow, RPS will cause load imbalance when input flows are not uniformly
>> +random. In the extreme case, a single flow, all packets are handled on a
>> +single CPU, which limits the throughput of the machine to the throughput
>> +of that CPU. RPS has optional overflow protection, which disables flow
>> +affinity when an RPS CPU becomes saturated: during overload, its packets
>> +will be sent to the least loaded other CPU in the RPS cpuset. To enable
>> +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than
>> +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic.
> [...]
>
> This only seems to be suitable for specialised applications where a high
> degree of reordering is tolerable.  This documentation should make that
> very clear.

Good point. I'll revise that when I respin the patch.

I wasn't too concerned with this earlier, but there may be a way
to reduce the amount of reordering imposed, in
particular in the case where normal load has many small flows and the
exception is the normal case plus a small number of very high rate flows
(think synflood).

It is possible for a single high rate flow to exceed a single cpu
capacity, so those flows will always either drop packets or span
cpus and thus witness reordering (they are unlikely to be tcp
connections). It would be an improvement if the smaller flows
would at least not see reordering.

If the algorithm only redistributes packets from high rate flows, or an
approximation thereof, this will be the case. Keeping a hashtable,
counting arrivals per bucket and redirecting the highest fraction of buckets,
will do this (not my idea: a variation on a drop strategy that Eric mentioned
to me earlier). I can implement this, instead, if that sounds like a better
idea.

Because of the constraint that a single flow may exceed a single cpu
capacity, redistributed packets will always have to be redistributed
without flow affinity, I think.


> Ben.
>
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Dec. 7, 2012, 7:20 p.m.
From: Willem de Bruijn <willemb@google.com>
Date: Thu,  6 Dec 2012 15:36:34 -0500

> This patch maintains flow affinity in normal conditions, but
> trades it for throughput when a cpu becomes saturated. Then, packets
> destined to that cpu (only) are redirected to the lightest loaded cpu
> in the rxqueue's rps_map. This breaks flow affinity under high load
> for some flows, in favor of processing packets up to the capacity
> of the complete rps_map cpuset in all circumstances.

We specifically built-in very strict checks to make sure we never
deliver packets out-of-order.  Those mechanisms must be used and
enforced in any change of this nature.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn - Dec. 10, 2012, 8:09 p.m.
On Fri, Dec 7, 2012 at 2:20 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemb@google.com>
> Date: Thu,  6 Dec 2012 15:36:34 -0500
>
>> This patch maintains flow affinity in normal conditions, but
>> trades it for throughput when a cpu becomes saturated. Then, packets
>> destined to that cpu (only) are redirected to the lightest loaded cpu
>> in the rxqueue's rps_map. This breaks flow affinity under high load
>> for some flows, in favor of processing packets up to the capacity
>> of the complete rps_map cpuset in all circumstances.
>
> We specifically built-in very strict checks to make sure we never
> deliver packets out-of-order.  Those mechanisms must be used and
> enforced in any change of this nature.

Okay. I'm working on a table-based revision to redirect flows consistently
when backlogged and to drop flows that are too big for any cpu to handle.
Revising and testing will take some time. If results are good, I'll post
a v2 soon. Thanks for all feedback so far.

Flow redirection when backlogged should improve resilience against
unbalanced load (such as synfloods) for all rps/rfs applications, not
just middleboxes. For that case, I'd like to be able to spray packets
instead of drop them when a single flow exceeds cpu capacity, but
that's a separate issue.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index 579994a..f454564 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -135,6 +135,18 @@  packets have been queued to their backlog queue. The IPI wakes backlog
 processing on the remote CPU, and any queued packets are then processed
 up the networking stack.
 
+==== RPS Overflow Protection
+
+By selecting the same cpu from the cpuset for each packet in the same
+flow, RPS will cause load imbalance when input flows are not uniformly
+random. In the extreme case, a single flow, all packets are handled on a
+single CPU, which limits the throughput of the machine to the throughput
+of that CPU. RPS has optional overflow protection, which disables flow
+affinity when an RPS CPU becomes saturated: during overload, its packets
+will be sent to the least loaded other CPU in the RPS cpuset. To enable
+this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than
+net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic.
+
 ==== RPS Configuration
 
 RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18c5dc9..84624fa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2609,6 +2609,9 @@  extern void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64,
 				    const struct net_device_stats *netdev_stats);
 
 extern int		netdev_max_backlog;
+#ifdef CONFIG_RPS
+extern int		netdev_max_rps_backlog;
+#endif
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
diff --git a/net/core/dev.c b/net/core/dev.c
index 2f94df2..08c99ad 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2734,6 +2734,9 @@  EXPORT_SYMBOL(dev_queue_xmit);
 int netdev_max_backlog __read_mostly = 1000;
 EXPORT_SYMBOL(netdev_max_backlog);
 
+#ifdef CONFIG_RPS
+int netdev_max_rps_backlog __read_mostly = 1000;
+#endif
 int netdev_tstamp_prequeue __read_mostly = 1;
 int netdev_budget __read_mostly = 300;
 int weight_p __read_mostly = 64;            /* old backlog weight */
@@ -2834,6 +2837,36 @@  set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	return rflow;
 }
 
+/* @return cpu under normal conditions, another rps_cpu if backlogged. */
+static int get_rps_overflow_cpu(int cpu, const struct rps_map* map)
+{
+       struct softnet_data *sd;
+       unsigned int cur, tcpu, min;
+       int i;
+
+       if (skb_queue_len(&per_cpu(softnet_data, cpu).input_pkt_queue) <
+           netdev_max_rps_backlog || !map)
+               return cpu;
+
+       /* leave room to prioritize the flows sent to the cpu by rxhash. */
+       min = netdev_max_rps_backlog;
+       min -= min >> 3;
+
+       for (i = 0; i < map->len; i++) {
+               tcpu = map->cpus[i];
+               if (cpu_online(tcpu)) {
+                       sd = &per_cpu(softnet_data, tcpu);
+                       cur = skb_queue_len(&sd->input_pkt_queue);
+                       if (cur < min) {
+                               min = cur;
+                               cpu = tcpu;
+                       }
+               }
+       }
+
+       return cpu;
+}
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
@@ -2912,7 +2945,7 @@  static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
 			*rflowp = rflow;
-			cpu = tcpu;
+			cpu = get_rps_overflow_cpu(tcpu, map);
 			goto done;
 		}
 	}
@@ -2921,7 +2954,7 @@  static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
 
 		if (cpu_online(tcpu)) {
-			cpu = tcpu;
+			cpu = get_rps_overflow_cpu(tcpu, map);
 			goto done;
 		}
 	}
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index d1b0804..c1b7829 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -129,6 +129,15 @@  static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+#ifdef CONFIG_RPS
+	{
+		.procname	= "netdev_max_rps_backlog",
+		.data		= &netdev_max_rps_backlog,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif
 #ifdef CONFIG_BPF_JIT
 	{
 		.procname	= "bpf_jit_enable",