Message ID | 1354826194-9289-1-git-send-email-willemb@google.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 12/06/2012 12:36 PM, Willem de Bruijn wrote: > RPS and RFS balance load across cpus with flow affinity. This can > cause local bottlenecks, where a small number or single large flow > (DoS) can saturate one CPU while others are idle. > > This patch maintains flow affinity in normal conditions, but > trades it for throughput when a cpu becomes saturated. Then, packets > destined to that cpu (only) are redirected to the lightest loaded cpu > in the rxqueue's rps_map. This breaks flow affinity under high load > for some flows, in favor of processing packets up to the capacity > of the complete rps_map cpuset in all circumstances. I thought (one of) the ideas behind RFS at least was to give the CPU scheduler control over where network processing took place instead of it being dictated solely by the addressing. I would have expected the CPU scheduler to migrate some work off the saturated CPU. Or will this only affect RPS and not RFS? Allowing individual flows to straddle the CPUs - won't that be somewhat like what happens in bonding with mode-rr in the outbound case - packet reordering evil? What kind of workload is this targeting that calls for such intra-flow parallelism? With respect to the examples given, what happens when it is TCP traffic rather than UDP? happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote: > On 12/06/2012 12:36 PM, Willem de Bruijn wrote: >> >> RPS and RFS balance load across cpus with flow affinity. This can >> cause local bottlenecks, where a small number or single large flow >> (DoS) can saturate one CPU while others are idle. >> >> This patch maintains flow affinity in normal conditions, but >> trades it for throughput when a cpu becomes saturated. Then, packets >> destined to that cpu (only) are redirected to the lightest loaded cpu >> in the rxqueue's rps_map. This breaks flow affinity under high load >> for some flows, in favor of processing packets up to the capacity >> of the complete rps_map cpuset in all circumstances. > > > I thought (one of) the ideas behind RFS at least was to give the CPU > scheduler control over where network processing took place instead of it > being dictated solely by the addressing. I would have expected the CPU > scheduler to migrate some work off the saturated CPU. Or will this only > affect RPS and not RFS? I wrote it with RPS in mind, indeed. With RFS, for sufficiently multithreaded applications that are unpinned, the scheduler will likely spread the threads across as many cpus as possible. In that case, the mechanism will not kick in, or as quickly. Even with RFS, pinned threads and single-threaded applications will likely also benefit during high load from redirecting kernel receive processing away from the cpu that runs the application thread. I haven't tested that case independently. > Allowing individual flows to straddle the CPUs - won't that be somewhat like > what happens in bonding with mode-rr in the outbound case - packet > reordering evil? Yes, that's the main drawback. > What kind of workload is this targeting that calls for > such intra-flow parallelism? Packet processing middeboxes that rather operate in degraded mode (reordering) than drop packets. Intrusion detection systems and proxies, for instance. These boxes are actually likely to have RPS enabled and RFS disabled. > With respect to the examples given, what happens when it is TCP traffic > rather than UDP? That should be identical. RFS is supported for both protocols. In the test, it is turned off to demonstrate the effect solely with RPS. Restricting this to only RPS is easy, btw: get_rps_overflow_cpu() is called one for RFS and once for RPS. > happy benchmarking, > > rick jones > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/06/2012 03:04 PM, Willem de Bruijn wrote: > On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote: >> I thought (one of) the ideas behind RFS at least was to give the CPU >> scheduler control over where network processing took place instead of it >> being dictated solely by the addressing. I would have expected the CPU >> scheduler to migrate some work off the saturated CPU. Or will this only >> affect RPS and not RFS? > > I wrote it with RPS in mind, indeed. With RFS, for sufficiently > multithreaded applications that are unpinned, the scheduler will > likely spread the threads across as many cpus as possible. In that > case, the mechanism will not kick in, or as quickly. Even with RFS, > pinned threads and single-threaded applications will likely also > benefit during high load from redirecting kernel receive > processing away from the cpu that runs the application thread. I > haven't tested that case independently. Unless that single-threaded application (or single receiving thread) is pinned to a CPU, isn't there a non-trivial chance that incoming traffic flowing up different CPUs will cause it to be bounced from one CPU to another, taking its cache lines with it and not just the "intra-stack" cache lines? Long (?) ago and far away it was possible to say that a given IRQ should be potentially serviced by more than one CPU (if I recall though not phrase correctly). Didn't that get taken away because it did such nasty things like reordering and such? (Admittedly, I'm really stretching the limits of my dimm memory there) >> What kind of workload is this targeting that calls for >> such intra-flow parallelism? > > Packet processing middeboxes that rather operate in degraded mode > (reordering) than drop packets. Intrusion detection systems and proxies, > for instance. These boxes are actually likely to have RPS enabled and > RFS disabled. > >> With respect to the examples given, what happens when it is TCP traffic >> rather than UDP? > > That should be identical. RFS is supported for both protocols. In the > test, it is turned off to demonstrate the effect solely with RPS. Will it be identical with TCP? If anything, I would think causing reordering of the TCP segments within flows would only further increase the workload of the middlebox because it will increase the ACK rates. Perhaps quite significantly if GRO was effective at the receivers before the reordering started. At least unless/until the reordering is bad enough to cause the sending TCPs to fast retransmit and so throttle back. And unless we are talking about being overloaded by massive herds of "mice" I'd think that the TCP flows would be throttling back to what the single CPU in the middlebox could handle. rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-12-06 at 15:36 -0500, Willem de Bruijn wrote: > RPS and RFS balance load across cpus with flow affinity. This can > cause local bottlenecks, where a small number or single large flow > (DoS) can saturate one CPU while others are idle. > > This patch maintains flow affinity in normal conditions, but > trades it for throughput when a cpu becomes saturated. Then, packets > destined to that cpu (only) are redirected to the lightest loaded cpu > in the rxqueue's rps_map. This breaks flow affinity under high load > for some flows, in favor of processing packets up to the capacity > of the complete rps_map cpuset in all circumstances. [...] > --- a/Documentation/networking/scaling.txt > +++ b/Documentation/networking/scaling.txt > @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog > processing on the remote CPU, and any queued packets are then processed > up the networking stack. > > +==== RPS Overflow Protection > + > +By selecting the same cpu from the cpuset for each packet in the same > +flow, RPS will cause load imbalance when input flows are not uniformly > +random. In the extreme case, a single flow, all packets are handled on a > +single CPU, which limits the throughput of the machine to the throughput > +of that CPU. RPS has optional overflow protection, which disables flow > +affinity when an RPS CPU becomes saturated: during overload, its packets > +will be sent to the least loaded other CPU in the RPS cpuset. To enable > +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than > +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic. [...] This only seems to be suitable for specialised applications where a high degree of reordering is tolerable. This documentation should make that very clear. Ben.
On Thu, Dec 6, 2012 at 6:45 PM, Rick Jones <rick.jones2@hp.com> wrote: > On 12/06/2012 03:04 PM, Willem de Bruijn wrote: >> >> On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote: >>> >>> I thought (one of) the ideas behind RFS at least was to give the CPU >>> scheduler control over where network processing took place instead of it >>> being dictated solely by the addressing. I would have expected the CPU >>> scheduler to migrate some work off the saturated CPU. Or will this only >>> affect RPS and not RFS? >> >> >> I wrote it with RPS in mind, indeed. With RFS, for sufficiently >> multithreaded applications that are unpinned, the scheduler will >> likely spread the threads across as many cpus as possible. In that >> case, the mechanism will not kick in, or as quickly. Even with RFS, >> pinned threads and single-threaded applications will likely also >> benefit during high load from redirecting kernel receive >> processing away from the cpu that runs the application thread. I >> haven't tested that case independently. > > > Unless that single-threaded application (or single receiving thread) is > pinned to a CPU, isn't there a non-trivial chance that incoming traffic > flowing up different CPUs will cause it to be bounced from one CPU to > another, taking its cache lines with it and not just the "intra-stack" cache > lines? Yes. The patch restricts the offload cpus to rps_cpus, with the assumption that this is a small subset of all cpus. In that case, other workloads will eventually migrate to the remainder. I previously tested spreading across all cpus, which indeed did interfere with the userspace threads. > Long (?) ago and far away it was possible to say that a given IRQ should be > potentially serviced by more than one CPU (if I recall though not phrase > correctly). Didn't that get taken away because it did such nasty things > like reordering and such? (Admittedly, I'm really stretching the limits of > my dimm memory there) Sounds familiar. Wasn't there a mechanism to periodically switch the destination cpu? If at HZ granularity, that is very coarse grain compared to Mpps, but out of order does seem likely. I assume that this patch will lead to a steady state where userspace and kernel receive run on disjoint cpusets, due to the rps_cpus set being hot with kernel receive processing. That said, I can run a test with RFS enabled to see whether that actually holds. >>> What kind of workload is this targeting that calls for >>> such intra-flow parallelism? >> >> >> Packet processing middeboxes that rather operate in degraded mode >> (reordering) than drop packets. Intrusion detection systems and proxies, >> for instance. These boxes are actually likely to have RPS enabled and >> RFS disabled. >> >>> With respect to the examples given, what happens when it is TCP traffic >>> rather than UDP? >> >> >> That should be identical. RFS is supported for both protocols. In the >> test, it is turned off to demonstrate the effect solely with RPS. > > > Will it be identical with TCP? If anything, I would think causing > reordering of the TCP segments within flows would only further increase the > workload of the middlebox because it will increase the ACK rates. Perhaps > quite significantly if GRO was effective at the receivers before the > reordering started. > > At least unless/until the reordering is bad enough to cause the sending TCPs > to fast retransmit and so throttle back. And unless we are talking about > being overloaded by massive herds of "mice" I'd think that the TCP flows > would be throttling back to what the single CPU in the middlebox could > handle. Agreed, I will try to get some data on the interaction with TCP flows. My hunch is that they throttle down due to the reordering, but data is more useful. The initial increase in ACKs, if any, will likely not increase rate beyond a small factor. The situations that this patch mean to address are more straightforward DoS attacks, where a box can handle normal load with a big safety margin, but falls over at a 10x or 100x flood of TCP SYN or similar packets. > rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 7, 2012 at 9:51 AM, Ben Hutchings <bhutchings@solarflare.com> wrote: > On Thu, 2012-12-06 at 15:36 -0500, Willem de Bruijn wrote: >> RPS and RFS balance load across cpus with flow affinity. This can >> cause local bottlenecks, where a small number or single large flow >> (DoS) can saturate one CPU while others are idle. >> >> This patch maintains flow affinity in normal conditions, but >> trades it for throughput when a cpu becomes saturated. Then, packets >> destined to that cpu (only) are redirected to the lightest loaded cpu >> in the rxqueue's rps_map. This breaks flow affinity under high load >> for some flows, in favor of processing packets up to the capacity >> of the complete rps_map cpuset in all circumstances. > [...] >> --- a/Documentation/networking/scaling.txt >> +++ b/Documentation/networking/scaling.txt >> @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog >> processing on the remote CPU, and any queued packets are then processed >> up the networking stack. >> >> +==== RPS Overflow Protection >> + >> +By selecting the same cpu from the cpuset for each packet in the same >> +flow, RPS will cause load imbalance when input flows are not uniformly >> +random. In the extreme case, a single flow, all packets are handled on a >> +single CPU, which limits the throughput of the machine to the throughput >> +of that CPU. RPS has optional overflow protection, which disables flow >> +affinity when an RPS CPU becomes saturated: during overload, its packets >> +will be sent to the least loaded other CPU in the RPS cpuset. To enable >> +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than >> +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic. > [...] > > This only seems to be suitable for specialised applications where a high > degree of reordering is tolerable. This documentation should make that > very clear. Good point. I'll revise that when I respin the patch. I wasn't too concerned with this earlier, but there may be a way to reduce the amount of reordering imposed, in particular in the case where normal load has many small flows and the exception is the normal case plus a small number of very high rate flows (think synflood). It is possible for a single high rate flow to exceed a single cpu capacity, so those flows will always either drop packets or span cpus and thus witness reordering (they are unlikely to be tcp connections). It would be an improvement if the smaller flows would at least not see reordering. If the algorithm only redistributes packets from high rate flows, or an approximation thereof, this will be the case. Keeping a hashtable, counting arrivals per bucket and redirecting the highest fraction of buckets, will do this (not my idea: a variation on a drop strategy that Eric mentioned to me earlier). I can implement this, instead, if that sounds like a better idea. Because of the constraint that a single flow may exceed a single cpu capacity, redistributed packets will always have to be redistributed without flow affinity, I think. > Ben. > > -- > Ben Hutchings, Staff Engineer, Solarflare > Not speaking for my employer; that's the marketing department's job. > They asked us to note that Solarflare product names are trademarked. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willem de Bruijn <willemb@google.com> Date: Thu, 6 Dec 2012 15:36:34 -0500 > This patch maintains flow affinity in normal conditions, but > trades it for throughput when a cpu becomes saturated. Then, packets > destined to that cpu (only) are redirected to the lightest loaded cpu > in the rxqueue's rps_map. This breaks flow affinity under high load > for some flows, in favor of processing packets up to the capacity > of the complete rps_map cpuset in all circumstances. We specifically built-in very strict checks to make sure we never deliver packets out-of-order. Those mechanisms must be used and enforced in any change of this nature. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 7, 2012 at 2:20 PM, David Miller <davem@davemloft.net> wrote: > From: Willem de Bruijn <willemb@google.com> > Date: Thu, 6 Dec 2012 15:36:34 -0500 > >> This patch maintains flow affinity in normal conditions, but >> trades it for throughput when a cpu becomes saturated. Then, packets >> destined to that cpu (only) are redirected to the lightest loaded cpu >> in the rxqueue's rps_map. This breaks flow affinity under high load >> for some flows, in favor of processing packets up to the capacity >> of the complete rps_map cpuset in all circumstances. > > We specifically built-in very strict checks to make sure we never > deliver packets out-of-order. Those mechanisms must be used and > enforced in any change of this nature. Okay. I'm working on a table-based revision to redirect flows consistently when backlogged and to drop flows that are too big for any cpu to handle. Revising and testing will take some time. If results are good, I'll post a v2 soon. Thanks for all feedback so far. Flow redirection when backlogged should improve resilience against unbalanced load (such as synfloods) for all rps/rfs applications, not just middleboxes. For that case, I'd like to be able to spray packets instead of drop them when a single flow exceeds cpu capacity, but that's a separate issue. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 579994a..f454564 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog processing on the remote CPU, and any queued packets are then processed up the networking stack. +==== RPS Overflow Protection + +By selecting the same cpu from the cpuset for each packet in the same +flow, RPS will cause load imbalance when input flows are not uniformly +random. In the extreme case, a single flow, all packets are handled on a +single CPU, which limits the throughput of the machine to the throughput +of that CPU. RPS has optional overflow protection, which disables flow +affinity when an RPS CPU becomes saturated: during overload, its packets +will be sent to the least loaded other CPU in the RPS cpuset. To enable +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic. + ==== RPS Configuration RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 18c5dc9..84624fa 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2609,6 +2609,9 @@ extern void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64, const struct net_device_stats *netdev_stats); extern int netdev_max_backlog; +#ifdef CONFIG_RPS +extern int netdev_max_rps_backlog; +#endif extern int netdev_tstamp_prequeue; extern int weight_p; extern int bpf_jit_enable; diff --git a/net/core/dev.c b/net/core/dev.c index 2f94df2..08c99ad 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2734,6 +2734,9 @@ EXPORT_SYMBOL(dev_queue_xmit); int netdev_max_backlog __read_mostly = 1000; EXPORT_SYMBOL(netdev_max_backlog); +#ifdef CONFIG_RPS +int netdev_max_rps_backlog __read_mostly = 1000; +#endif int netdev_tstamp_prequeue __read_mostly = 1; int netdev_budget __read_mostly = 300; int weight_p __read_mostly = 64; /* old backlog weight */ @@ -2834,6 +2837,36 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb, return rflow; } +/* @return cpu under normal conditions, another rps_cpu if backlogged. */ +static int get_rps_overflow_cpu(int cpu, const struct rps_map* map) +{ + struct softnet_data *sd; + unsigned int cur, tcpu, min; + int i; + + if (skb_queue_len(&per_cpu(softnet_data, cpu).input_pkt_queue) < + netdev_max_rps_backlog || !map) + return cpu; + + /* leave room to prioritize the flows sent to the cpu by rxhash. */ + min = netdev_max_rps_backlog; + min -= min >> 3; + + for (i = 0; i < map->len; i++) { + tcpu = map->cpus[i]; + if (cpu_online(tcpu)) { + sd = &per_cpu(softnet_data, tcpu); + cur = skb_queue_len(&sd->input_pkt_queue); + if (cur < min) { + min = cur; + cpu = tcpu; + } + } + } + + return cpu; +} + /* * get_rps_cpu is called from netif_receive_skb and returns the target * CPU from the RPS map of the receiving queue for a given skb. @@ -2912,7 +2945,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) { *rflowp = rflow; - cpu = tcpu; + cpu = get_rps_overflow_cpu(tcpu, map); goto done; } } @@ -2921,7 +2954,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32]; if (cpu_online(tcpu)) { - cpu = tcpu; + cpu = get_rps_overflow_cpu(tcpu, map); goto done; } } diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index d1b0804..c1b7829 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -129,6 +129,15 @@ static struct ctl_table net_core_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, +#ifdef CONFIG_RPS + { + .procname = "netdev_max_rps_backlog", + .data = &netdev_max_rps_backlog, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif #ifdef CONFIG_BPF_JIT { .procname = "bpf_jit_enable",