Message ID | 1335607805-735-1-git-send-email-dczhu@mips.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 2012-04-28 at 18:10 +0800, Deng-Cheng Zhu wrote: > From: Deng-Cheng Zhu <dczhu@mips.com> > > Currently, choosing target CPU to process the incoming packet is based on > skb->rxhash. In the case of sparse connections, this could lead to > relatively low and inconsistent bandwidth while doing network throughput > tests -- CPU selection in the RPS map is imbalanced. Even with the same > hash value, 2 packets could come from different devices. Besides, on > architectures like MIPS with multi-threaded cores, siblings of CPU0 should > not be selected when others are not saturated. What CPU0 is doing so special you have to mention it in this changelog ? > > This patch introduces a feature that allows some flows to select their CPUs > by looping the RPS CPU maps. Some tests were performed on the MIPS Malta > 1004K platform (2 cores, each with 2 VPEs) at 25Mhz with 2 Intel Pro/1000 > NICs. The Malta board works as a router between 2 PCs. Using iperf, here > are results: RPS on a router ? Thats not very good, unless you perform a crazy amount of work in iptables rules maybe ? One packet comes, its better to handle it right now and send it right now on the same cpu. No IPI cost, no cache line misses... RPS is something more suitable to TCP handling in local host because stack has big memory footprint and latencies, not for forwarding workload. I suspect you can reach more throughput using appropriate tunings (correct interrupt affinities). This sounds like a bad config from the very beginning. > +}; > + > +static struct cpu_flow flow[CONFIG_NR_RPS_MAP_LOOPS][NR_CPUS]; Thats absolutely not allowed to add a [NR_CPUS] array anywhere in linux kernel in 2012. > +/* > + * We've got CONFIG_SMP to do RPS, so only arch define is needed here to access > + * sibling specific information. > + */ > +#if defined(CONFIG_MIPS) Thats not allowed to add a CONFIG_somearch in net/core/dev.c -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/28/2012 07:55 PM, Eric Dumazet wrote: > On Sat, 2012-04-28 at 18:10 +0800, Deng-Cheng Zhu wrote: >> From: Deng-Cheng Zhu<dczhu@mips.com> >> >> Currently, choosing target CPU to process the incoming packet is based on >> skb->rxhash. In the case of sparse connections, this could lead to >> relatively low and inconsistent bandwidth while doing network throughput >> tests -- CPU selection in the RPS map is imbalanced. Even with the same >> hash value, 2 packets could come from different devices. Besides, on >> architectures like MIPS with multi-threaded cores, siblings of CPU0 should >> not be selected when others are not saturated. > > What CPU0 is doing so special you have to mention it in this changelog ? Serve the NIC hardware irq and do the 1st part of softirq in RPS. Changes relating to this will be made in v2. Thanks. >> >> This patch introduces a feature that allows some flows to select their CPUs >> by looping the RPS CPU maps. Some tests were performed on the MIPS Malta >> 1004K platform (2 cores, each with 2 VPEs) at 25Mhz with 2 Intel Pro/1000 >> NICs. The Malta board works as a router between 2 PCs. Using iperf, here >> are results: > > > RPS on a router ? Thats not very good, unless you perform a crazy amount > of work in iptables rules maybe ? > > One packet comes, its better to handle it right now and send it right > now on the same cpu. No IPI cost, no cache line misses... Theoretically, you are right. But RPS works early -- when the hot CPU gets really hot, RPS takes effect even with forwarding workload. Earlier experiments on the mentioned Malta platform proved it as well (typically 45% ~ more than doubled, amazingly). > > RPS is something more suitable to TCP handling in local host because > stack has big memory footprint and latencies, not for forwarding > workload. See above. > I suspect you can reach more throughput using appropriate tunings > (correct interrupt affinities). This sounds like a bad config from the > very beginning. Unfortunately, on the Malta platform, NIC irqs are not suitable for SMP IRQ affinity -- they are based on XT-PIC. However, I did do some RPS tests where different CPU masks were assigned to the 2 NICs. And the throughput *WAS* better than that of assigning the same mask to NICs. But the problem addressed in this patch *STILL* exists -- hash indexing causes imbalance across CPUs in the case of sparse connections. > >> +}; >> + >> +static struct cpu_flow flow[CONFIG_NR_RPS_MAP_LOOPS][NR_CPUS]; > > Thats absolutely not allowed to add a [NR_CPUS] array anywhere in linux > kernel in 2012. Good advice. Will be different in V2. > >> +/* >> + * We've got CONFIG_SMP to do RPS, so only arch define is needed here to access >> + * sibling specific information. >> + */ >> +#if defined(CONFIG_MIPS) > > Thats not allowed to add a CONFIG_somearch in net/core/dev.c Good advice too. Thanks! Deng-Cheng -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-02 at 12:08 +0800, Deng-Cheng Zhu wrote: > Unfortunately, on the Malta platform, NIC irqs are not suitable for SMP > IRQ affinity -- they are based on XT-PIC. However, I did do some RPS > tests where different CPU masks were assigned to the 2 NICs. And the > throughput *WAS* better than that of assigning the same mask to NICs. > But the problem addressed in this patch *STILL* exists -- hash indexing > causes imbalance across CPUs in the case of sparse connections. You mean your two NIC irqs are handled by CPU0 and this cant be changed ? That really is bad. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/02/2012 12:12 PM, Eric Dumazet wrote: > On Wed, 2012-05-02 at 12:08 +0800, Deng-Cheng Zhu wrote: > >> Unfortunately, on the Malta platform, NIC irqs are not suitable for SMP >> IRQ affinity -- they are based on XT-PIC. However, I did do some RPS >> tests where different CPU masks were assigned to the 2 NICs. And the >> throughput *WAS* better than that of assigning the same mask to NICs. >> But the problem addressed in this patch *STILL* exists -- hash indexing >> causes imbalance across CPUs in the case of sparse connections. > > You mean your two NIC irqs are handled by CPU0 and this cant be > changed ? > > That really is bad. > > > Yes, that's right. It's not IO-APIC enabled. When trying to write values to /proc/irq/$irqnum/smp_affinity: echo: write error: Input/output error Deng-Cheng -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/Kconfig b/net/Kconfig index e07272d..d5aa682 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -222,6 +222,28 @@ config RPS depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS default y +config RPS_SPARSE_FLOW_OPTIMIZATION + bool "RPS optimizations for sparse flows" + depends on RPS + default n + ---help--- + This feature will try to map some network flows to consecutive + CPUs in the RPS map. It will bring in some per packet overhead + but should be able to do good to network throughput in the case + of low number of connections while not much affecting other + cases. (e.g. relatively consistent and high bandwidth in single + connection tests). + +config NR_RPS_MAP_LOOPS + int "Number of loops walking RPS map before hash indexing (1-5)" + range 1 5 + depends on RPS_SPARSE_FLOW_OPTIMIZATION + default "4" + ---help--- + It defines how many loops to go through the RPS map while + determing target CPU to process the incoming packet. After that, + the decision will fall back on hash indexing the RPS map. + config RFS_ACCEL boolean depends on RPS && GENERIC_HARDIRQS diff --git a/net/core/dev.c b/net/core/dev.c index c25d453..82633df 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2698,6 +2698,79 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb, return rflow; } +#ifdef CONFIG_RPS_SPARSE_FLOW_OPTIMIZATION +struct cpu_flow { + struct net_device *dev; + u32 rxhash; + unsigned long ts; +}; + +static struct cpu_flow flow[CONFIG_NR_RPS_MAP_LOOPS][NR_CPUS]; +static unsigned long hash_active; + +#define FLOW_INACTIVE(now, base) (time_after((now), (base) + HZ) || \ + unlikely(time_before((now), (base)))) + +static int find_cpu(const struct rps_map *map, const struct sk_buff *skb) +{ + int i, l, cpu, do_alloc = 0; + unsigned long now = jiffies; + +retry: + for (l = 0; l < CONFIG_NR_RPS_MAP_LOOPS; l++) { + for (i = map->len - 1; i >= 0; i--) { + cpu = map->cpus[i]; + +/* + * We've got CONFIG_SMP to do RPS, so only arch define is needed here to access + * sibling specific information. + */ +#if defined(CONFIG_MIPS) + /* + * Some architectures like MIPS have multithreaded + * cores. When CPU0 serves NIC IRQs, its siblings, e.g. + * CPU1, have less horsepower than CPUs in other cores + * in the RPS CPU map. We simply leave them alone here. + */ + if (cpu_isset(cpu, cpu_sibling_map[0])) + continue; +#endif + + if (do_alloc) { + if (flow[l][cpu].dev == NULL || + FLOW_INACTIVE(now, flow[l][cpu].ts)) { + flow[l][cpu].dev = skb->dev; + flow[l][cpu].rxhash = skb->rxhash; + flow[l][cpu].ts = now; + return cpu; + } + } else { + /* + * Unlike hash indexing, this avoids packet + * processing imbalance across CPUs. + */ + if (flow[l][cpu].rxhash == skb->rxhash && + flow[l][cpu].dev == skb->dev && + !FLOW_INACTIVE(now, flow[l][cpu].ts)) { + flow[l][cpu].ts = now; + return cpu; + } + } + } + } + + if (FLOW_INACTIVE(now, hash_active) && do_alloc == 0) { + do_alloc = 1; + goto retry; + } + + /* For all other flows */ + hash_active = now; + + return map->cpus[((u64) skb->rxhash * map->len) >> 32]; +} +#endif + /* * get_rps_cpu is called from netif_receive_skb and returns the target * CPU from the RPS map of the receiving queue for a given skb. @@ -2780,7 +2853,11 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, } if (map) { +#ifdef CONFIG_RPS_SPARSE_FLOW_OPTIMIZATION + tcpu = find_cpu(map, skb); +#else tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32]; +#endif if (cpu_online(tcpu)) { cpu = tcpu;