Message ID | alpine.DEB.1.00.1004152243470.15102@pokey.mtv.corp.google.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Tom Herbert <therbert@google.com> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) > Version 5 of RFS: > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a > static function. > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count > sysfs variable. I've read this over a few times and I think it's ready to go into net-next-2.6, we can tweak things as-needed from here on out. Eric, what do you think? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit : > From: Tom Herbert <therbert@google.com> > Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) > > > Version 5 of RFS: > > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a > > static function. > > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count > > sysfs variable. > > I've read this over a few times and I think it's ready to go into > net-next-2.6, we can tweak things as-needed from here on out. > > Eric, what do you think? I read the patch and found no error. I booted a test machine and performed some tests I am a bit worried of a tbench regression I am looking at right now. if RFS disabled , tbench 16 -> 4408.63 MB/sec # grep . /sys/class/net/lo/queues/rx-0/* /sys/class/net/lo/queues/rx-0/rps_cpus:00000000 /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192 # cat /proc/sys/net/core/rps_sock_flow_entries 8192 echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus tbench 16 -> 2336.32 MB/sec ----------------------------------------------------------------------------------------------------------------------------------------------------- PerfTop: 14561 irqs/sec kernel:86.3% [1000Hz cycles], (all, 16 CPUs) ----------------------------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ______________________________ __________________________________________________________ 2664.00 5.1% copy_user_generic_string /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 2323.00 4.4% acpi_os_read_port /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 1641.00 3.1% _raw_spin_lock_irqsave /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 1260.00 2.4% schedule /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 1159.00 2.2% _raw_spin_lock /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 1051.00 2.0% tcp_ack /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 991.00 1.9% tcp_sendmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 922.00 1.8% tcp_recvmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 821.00 1.6% child_run /usr/bin/tbench 766.00 1.5% all_string_sub /usr/bin/tbench 630.00 1.2% __switch_to /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 608.00 1.2% __GI_strchr /lib/tls/libc-2.3.4.so 606.00 1.2% ipt_do_table /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 600.00 1.1% __GI_strstr /lib/tls/libc-2.3.4.so 556.00 1.1% __netif_receive_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 504.00 1.0% tcp_transmit_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 502.00 1.0% tick_nohz_stop_sched_tick /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 481.00 0.9% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 473.00 0.9% next_token /usr/bin/tbench 449.00 0.9% ip_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 423.00 0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 422.00 0.8% ia32_sysenter_target /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 420.00 0.8% compat_sys_socketcall /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 401.00 0.8% mod_timer /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 400.00 0.8% process_backlog /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 399.00 0.8% ip_queue_xmit /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 387.00 0.7% select_task_rq_fair /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 377.00 0.7% _raw_spin_lock_bh /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux 360.00 0.7% tcp_v4_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux But if RFS is on, why activating rps_cpus change tbench ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 avril 2010 à 08:56 +0200, Eric Dumazet a écrit : > I read the patch and found no error. > > I booted a test machine and performed some tests > > I am a bit worried of a tbench regression I am looking at right now. > > if RFS disabled , tbench 16 -> 4408.63 MB/sec > > > # grep . /sys/class/net/lo/queues/rx-0/* > /sys/class/net/lo/queues/rx-0/rps_cpus:00000000 > /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192 > # cat /proc/sys/net/core/rps_sock_flow_entries > 8192 > > > echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus > > tbench 16 -> 2336.32 MB/sec > > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > PerfTop: 14561 irqs/sec kernel:86.3% [1000Hz cycles], (all, 16 CPUs) > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > samples pcnt function DSO > _______ _____ ______________________________ __________________________________________________________ > > 2664.00 5.1% copy_user_generic_string /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 2323.00 4.4% acpi_os_read_port /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1641.00 3.1% _raw_spin_lock_irqsave /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1260.00 2.4% schedule /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1159.00 2.2% _raw_spin_lock /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1051.00 2.0% tcp_ack /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 991.00 1.9% tcp_sendmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 922.00 1.8% tcp_recvmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 821.00 1.6% child_run /usr/bin/tbench > 766.00 1.5% all_string_sub /usr/bin/tbench > 630.00 1.2% __switch_to /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 608.00 1.2% __GI_strchr /lib/tls/libc-2.3.4.so > 606.00 1.2% ipt_do_table /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 600.00 1.1% __GI_strstr /lib/tls/libc-2.3.4.so > 556.00 1.1% __netif_receive_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 504.00 1.0% tcp_transmit_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 502.00 1.0% tick_nohz_stop_sched_tick /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 481.00 0.9% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 473.00 0.9% next_token /usr/bin/tbench > 449.00 0.9% ip_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 423.00 0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 422.00 0.8% ia32_sysenter_target /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 420.00 0.8% compat_sys_socketcall /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 401.00 0.8% mod_timer /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 400.00 0.8% process_backlog /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 399.00 0.8% ip_queue_xmit /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 387.00 0.7% select_task_rq_fair /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 377.00 0.7% _raw_spin_lock_bh /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 360.00 0.7% tcp_v4_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > But if RFS is on, why activating rps_cpus change tbench ? > Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old (versus linux-2.6). I know scheduler guys did some tweaks. Because apparently, some cpus are idle part of their time (30% ???) Or a new bug on cpu accounting, reporting idle time while cpus are busy.... # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 16 0 0 5670264 13280 63392 0 0 2 1 1512 227 12 47 41 0 18 0 0 5669396 13280 63392 0 0 0 0 657952 1606102 14 58 28 0 17 0 0 5668776 13288 63392 0 0 0 12 656701 1606369 14 58 28 0 18 0 0 5669644 13288 63392 0 0 0 0 657636 1603960 15 57 28 0 17 0 0 5670900 13288 63392 0 0 0 0 666425 1584847 15 56 29 0 15 0 0 5669164 13288 63392 0 0 0 0 682578 1472616 14 56 30 0 16 0 0 5669412 13288 63392 0 0 0 0 695767 1506302 14 54 32 0 14 0 0 5668916 13296 63396 0 0 4 148 685286 1482897 14 56 30 0 17 0 0 5669784 13296 63396 0 0 0 0 683910 1477994 14 56 30 0 18 0 0 5670032 13296 63396 0 0 0 0 692023 1497195 14 55 31 0 16 0 0 5669040 13296 63396 0 0 0 0 677477 1468157 14 56 30 0 16 0 0 5668916 13312 63396 0 0 0 32 489358 1048553 14 57 30 0 18 0 0 5667924 13320 63396 0 0 0 12 424787 897145 15 55 29 0 RFS off : # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 24 0 0 5669624 13632 63476 0 0 2 1 261 82 12 48 40 0 26 0 0 5669492 13632 63476 0 0 0 0 4223 1740651 21 71 7 0 23 0 0 5669864 13640 63476 0 0 0 12 4205 1731882 21 71 8 0 23 0 0 5670484 13640 63476 0 0 0 0 4176 1733448 21 71 8 0 24 0 0 5670588 13640 63476 0 0 0 0 4176 1733845 21 72 7 0 21 0 0 5671084 13640 63476 0 0 0 0 4200 1734990 20 73 7 0 23 0 0 5671580 13640 63476 0 0 0 0 4168 1735100 21 71 8 0 23 0 0 5671704 13640 63480 0 0 4 132 4221 1733428 21 72 7 0 22 0 0 5671952 13640 63480 0 0 0 0 4190 1730370 21 72 8 0 20 0 0 5672292 13640 63480 0 0 0 0 4212 1732084 22 70 8 0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 16 Apr 2010 09:18:03 +0200 > Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old > (versus linux-2.6). I know scheduler guys did some tweaks. I synced net-next-2.6 up with Linus's current tree just a day or two ago when I pulled net-2.6 into net-next-2.6. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 avril 2010 à 00:26 -0700, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Fri, 16 Apr 2010 09:18:03 +0200 > > > Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old > > (versus linux-2.6). I know scheduler guys did some tweaks. > > I synced net-next-2.6 up with Linus's current tree just a day > or two ago when I pulled net-2.6 into net-next-2.6. OK thanks :) Tom, please add a read_mostly to rps_sock_flow_table struct rps_sock_flow_table *rps_sock_flow_table __read_mostly; I'll spend some hours today to track the problem. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tom Herbert <therbert@google.com> writes: > + > + /* > + * If the desired CPU (where last recvmsg was done) is > + * different from current CPU (one in the rx-queue flow > + * table entry), switch if one of the following holds: > + * - Current CPU is unset (equal to RPS_NO_CPU). > + * - Current CPU is offline. > + * - The current CPU's queue tail has advanced beyond the > + * last packet that was enqueued using this table entry. > + * This guarantees that all previous packets for the flow > + * have been dequeued, thus preserving in order delivery. > + */ > + if (unlikely(tcpu != next_cpu) && > + (tcpu == RPS_NO_CPU || !cpu_online(tcpu) || > + ((int)(per_cpu(softnet_data, tcpu).input_queue_head - One thing I've been wondering while reading if this should be made socket or SMT aware. If you're on a hyperthreaded system and sending a IPI to your core sibling, which has a completely shared cache hierarchy, might not be the best use of cycles. The same could potentially true for shared L2 or shared L3 cache (e.g. only redirect flows between different sockets) Have you ever considered that? This is of course something that could be addressed post-merge, not a blocker. -Andi
On Fri, 2010-04-16 at 13:57 +0200, Andi Kleen wrote: > One thing I've been wondering while reading if this should be made > socket or SMT aware. > > If you're on a hyperthreaded system and sending a IPI > to your core sibling, which has a completely shared cache hierarchy, > might not be the best use of cycles. > > The same could potentially true for shared L2 or shared L3 cache > (e.g. only redirect flows between different sockets) > > Have you ever considered that? > How are you going to schedule the net softirq on an empty queue if you do this? BTW, in my tests sending an IPI to an SMT sibling or to another core didnt make any difference in terms of latency - still 5 microsecs. I dont have dual Nehalem where we have to cross QPI - there i suspect it will be longer than 5 microsecs. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote: > How are you going to schedule the net softirq on an empty queue if you > do this? Sorry don't understand the question? You can always do the flow as if rps was not there. > BTW, in my tests sending an IPI to an SMT sibling or to another core > didnt make any difference in terms of latency - still 5 microsecs. > I dont have dual Nehalem where we have to cross QPI - there i suspect > it will be longer than 5 microsecs. I meant an IPI to a sibling is not useful. You send it to the IPI to get cache locality in the target, but if the target has the same cache locality as you you can as well avoid the cost of the IPI and process directly. For thread sibling I'm pretty sure it's useless. Not full sure about socket sibling. Maybe. -Andi
On Fri, 2010-04-16 at 15:42 +0200, Andi Kleen wrote: > On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote: > > How are you going to schedule the net softirq on an empty queue if you > > do this? > > Sorry don't understand the question? > > You can always do the flow as if rps was not there. Meaning you schedule the other side netrx softirq if queue is empty? > I meant an IPI to a sibling is not useful. You send it to the IPI > to get cache locality in the target, but if the target has the same > cache locality as you you can as well avoid the cost of the IPI > and process directly. > Isnt the purpose of the IPI to signal remote side that theres something for it to do? Does it also sync the remote cache? > For thread sibling I'm pretty sure it's useless. Not full sure about > socket sibling. Maybe. > Agreed, the SMT threads share L2. All the cores share L3. And it is inclusive, so if it is missing it is in L1 of one thread it must be present in L2 of shared cache as well as L3. Across the QPI i dont think that is true. But if you speacial case this - arent you being specific to Nehalem? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 16, 2010 at 10:05:15AM -0400, jamal wrote: > On Fri, 2010-04-16 at 15:42 +0200, Andi Kleen wrote: > > On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote: > > > How are you going to schedule the net softirq on an empty queue if you > > > do this? > > > > Sorry don't understand the question? > > > > You can always do the flow as if rps was not there. > > Meaning you schedule the other side netrx softirq if queue is empty? You handle the packet like if rps wasn't enabled. softirq on current CPU and it queues it on the socket. > > I meant an IPI to a sibling is not useful. You send it to the IPI > > to get cache locality in the target, but if the target has the same > > cache locality as you you can as well avoid the cost of the IPI > > and process directly. > > > > Isnt the purpose of the IPI to signal remote side that theres something > for it to do? The current CPU can queue on that socket as well. The whole point of the IPI is to do it with cache locality. But if cache locality is already there on the current CPU you don't need the IPI. > Does it also sync the remote cache? No, the caches are always coherent. > > > For thread sibling I'm pretty sure it's useless. Not full sure about > > socket sibling. Maybe. > > > > Agreed, the SMT threads share L2. All the cores share L3. And it is > inclusive, so if it is missing it is in L1 of one thread it must be > present in L2 of shared cache as well as L3. Across the QPI i dont think > that is true. > But if you speacial case this - arent you being specific to Nehalem? Other CPUs have SMT too (Niagara, POWER 6/7, mips, ...). It should be the same there. Assuming L3 affinity helps it might need to be a CPU specific tunable yes. The scheduler has some information about this. -Andi
Eric, thanks for testing that. Admittedly, we have looked at enabling RFS/RPS over loopback. I'll look at that today also. On Thu, Apr 15, 2010 at 11:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit : >> From: Tom Herbert <therbert@google.com> >> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) >> >> > Version 5 of RFS: >> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a >> > static function. >> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count >> > sysfs variable. >> >> I've read this over a few times and I think it's ready to go into >> net-next-2.6, we can tweak things as-needed from here on out. >> >> Eric, what do you think? > > I read the patch and found no error. > > I booted a test machine and performed some tests > > I am a bit worried of a tbench regression I am looking at right now. > > if RFS disabled , tbench 16 -> 4408.63 MB/sec > > > # grep . /sys/class/net/lo/queues/rx-0/* > /sys/class/net/lo/queues/rx-0/rps_cpus:00000000 > /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192 > # cat /proc/sys/net/core/rps_sock_flow_entries > 8192 > > > echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus > > tbench 16 -> 2336.32 MB/sec > > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > PerfTop: 14561 irqs/sec kernel:86.3% [1000Hz cycles], (all, 16 CPUs) > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > samples pcnt function DSO > _______ _____ ______________________________ __________________________________________________________ > > 2664.00 5.1% copy_user_generic_string /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 2323.00 4.4% acpi_os_read_port /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1641.00 3.1% _raw_spin_lock_irqsave /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1260.00 2.4% schedule /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1159.00 2.2% _raw_spin_lock /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1051.00 2.0% tcp_ack /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 991.00 1.9% tcp_sendmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 922.00 1.8% tcp_recvmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 821.00 1.6% child_run /usr/bin/tbench > 766.00 1.5% all_string_sub /usr/bin/tbench > 630.00 1.2% __switch_to /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 608.00 1.2% __GI_strchr /lib/tls/libc-2.3.4.so > 606.00 1.2% ipt_do_table /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 600.00 1.1% __GI_strstr /lib/tls/libc-2.3.4.so > 556.00 1.1% __netif_receive_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 504.00 1.0% tcp_transmit_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 502.00 1.0% tick_nohz_stop_sched_tick /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 481.00 0.9% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 473.00 0.9% next_token /usr/bin/tbench > 449.00 0.9% ip_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 423.00 0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 422.00 0.8% ia32_sysenter_target /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 420.00 0.8% compat_sys_socketcall /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 401.00 0.8% mod_timer /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 400.00 0.8% process_backlog /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 399.00 0.8% ip_queue_xmit /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 387.00 0.7% select_task_rq_fair /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 377.00 0.7% _raw_spin_lock_bh /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 360.00 0.7% tcp_v4_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > But if RFS is on, why activating rps_cpus change tbench ? > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 avril 2010 à 08:35 -0700, Tom Herbert a écrit : > Eric, thanks for testing that. Admittedly, we have looked at enabling > RFS/RPS over loopback. I'll look at that today also. > > Hi Tom I am sorry, but I could not work on this today. I hope I can find some time a bit later. > On Thu, Apr 15, 2010 at 11:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit : > >> From: Tom Herbert <therbert@google.com> > >> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) > >> > >> > Version 5 of RFS: > >> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a > >> > static function. > >> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count > >> > sysfs variable. > >> > >> I've read this over a few times and I think it's ready to go into > >> net-next-2.6, we can tweak things as-needed from here on out. > >> > >> Eric, what do you think? > > > > I read the patch and found no error. > > > > I booted a test machine and performed some tests > > > > I am a bit worried of a tbench regression I am looking at right now. > > > > if RFS disabled , tbench 16 -> 4408.63 MB/sec > > > > > > # grep . /sys/class/net/lo/queues/rx-0/* > > /sys/class/net/lo/queues/rx-0/rps_cpus:00000000 > > /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192 > > # cat /proc/sys/net/core/rps_sock_flow_entries > > 8192 > > > > > > echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus > > > > tbench 16 -> 2336.32 MB/sec > > > > > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > PerfTop: 14561 irqs/sec kernel:86.3% [1000Hz cycles], (all, 16 CPUs) > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > > > samples pcnt function DSO > > _______ _____ ______________________________ __________________________________________________________ > > > > 2664.00 5.1% copy_user_generic_string /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 2323.00 4.4% acpi_os_read_port /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 1641.00 3.1% _raw_spin_lock_irqsave /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 1260.00 2.4% schedule /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 1159.00 2.2% _raw_spin_lock /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 1051.00 2.0% tcp_ack /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 991.00 1.9% tcp_sendmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 922.00 1.8% tcp_recvmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 821.00 1.6% child_run /usr/bin/tbench > > 766.00 1.5% all_string_sub /usr/bin/tbench > > 630.00 1.2% __switch_to /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 608.00 1.2% __GI_strchr /lib/tls/libc-2.3.4.so > > 606.00 1.2% ipt_do_table /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 600.00 1.1% __GI_strstr /lib/tls/libc-2.3.4.so > > 556.00 1.1% __netif_receive_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 504.00 1.0% tcp_transmit_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 502.00 1.0% tick_nohz_stop_sched_tick /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 481.00 0.9% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 473.00 0.9% next_token /usr/bin/tbench > > 449.00 0.9% ip_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 423.00 0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 422.00 0.8% ia32_sysenter_target /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 420.00 0.8% compat_sys_socketcall /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 401.00 0.8% mod_timer /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 400.00 0.8% process_backlog /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 399.00 0.8% ip_queue_xmit /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 387.00 0.7% select_task_rq_fair /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 377.00 0.7% _raw_spin_lock_bh /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > 360.00 0.7% tcp_v4_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > > > But if RFS is on, why activating rps_cpus change tbench ? > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Results with "tbench 16" on an 8 core Intel machine. No RPS/RFS: 2155 MB/sec RPS (0ff mask): 1700 MB/sec RFS: 1097 I am not particularly surprised by the results, using loopback interface already provides good parallelism and RPS/RFS really would only add overhead and more trips between CPUs (last part is why RPS < RFS I suspect)-- I guess this is why we've never enabled RPS on loopback :-) Eric, do you have a particular concern that this could affect a real workload? Tom On Thu, Apr 15, 2010 at 11:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit : >> From: Tom Herbert <therbert@google.com> >> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) >> >> > Version 5 of RFS: >> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a >> > static function. >> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count >> > sysfs variable. >> >> I've read this over a few times and I think it's ready to go into >> net-next-2.6, we can tweak things as-needed from here on out. >> >> Eric, what do you think? > > I read the patch and found no error. > > I booted a test machine and performed some tests > > I am a bit worried of a tbench regression I am looking at right now. > > if RFS disabled , tbench 16 -> 4408.63 MB/sec > > > # grep . /sys/class/net/lo/queues/rx-0/* > /sys/class/net/lo/queues/rx-0/rps_cpus:00000000 > /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192 > # cat /proc/sys/net/core/rps_sock_flow_entries > 8192 > > > echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus > > tbench 16 -> 2336.32 MB/sec > > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > PerfTop: 14561 irqs/sec kernel:86.3% [1000Hz cycles], (all, 16 CPUs) > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > samples pcnt function DSO > _______ _____ ______________________________ __________________________________________________________ > > 2664.00 5.1% copy_user_generic_string /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 2323.00 4.4% acpi_os_read_port /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1641.00 3.1% _raw_spin_lock_irqsave /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1260.00 2.4% schedule /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1159.00 2.2% _raw_spin_lock /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 1051.00 2.0% tcp_ack /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 991.00 1.9% tcp_sendmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 922.00 1.8% tcp_recvmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 821.00 1.6% child_run /usr/bin/tbench > 766.00 1.5% all_string_sub /usr/bin/tbench > 630.00 1.2% __switch_to /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 608.00 1.2% __GI_strchr /lib/tls/libc-2.3.4.so > 606.00 1.2% ipt_do_table /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 600.00 1.1% __GI_strstr /lib/tls/libc-2.3.4.so > 556.00 1.1% __netif_receive_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 504.00 1.0% tcp_transmit_skb /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 502.00 1.0% tick_nohz_stop_sched_tick /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 481.00 0.9% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 473.00 0.9% next_token /usr/bin/tbench > 449.00 0.9% ip_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 423.00 0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 422.00 0.8% ia32_sysenter_target /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 420.00 0.8% compat_sys_socketcall /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 401.00 0.8% mod_timer /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 400.00 0.8% process_backlog /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 399.00 0.8% ip_queue_xmit /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 387.00 0.7% select_task_rq_fair /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 377.00 0.7% _raw_spin_lock_bh /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > 360.00 0.7% tcp_v4_rcv /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux > > But if RFS is on, why activating rps_cpus change tbench ? > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit : > Results with "tbench 16" on an 8 core Intel machine. > > No RPS/RFS: 2155 MB/sec > RPS (0ff mask): 1700 MB/sec > RFS: 1097 > > I am not particularly surprised by the results, using loopback > interface already provides good parallelism and RPS/RFS really would > only add overhead and more trips between CPUs (last part is why RPS < > RFS I suspect)-- I guess this is why we've never enabled RPS on > loopback :-) > > Eric, do you have a particular concern that this could affect a real workload? > I was expecting RFS to be better than RPS at least, for this particular workload (tcp over loopback) With RPS, the hash function of (127.0.0.1, port1, 127.0.0.1, port2) is different than (127.0.0.1, port2, 127.0.0.1, port1), so basically we force the server to run on different processor than client However, I was expecting that with RFS, client and server would run on same cpu. Maybe we could change (for a test) hash function to use (sport ^ dport) instead of (sport << 16) + dport -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit : > From: Tom Herbert <therbert@google.com> > Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) > > > Version 5 of RFS: > > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a > > static function. > > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count > > sysfs variable. > > I've read this over a few times and I think it's ready to go into > net-next-2.6, we can tweak things as-needed from here on out. > > Eric, what do you think? I think I can give my Sob, and we have time to fully test it and tweak it if necessary. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Thanks Tom ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit : >> Results with "tbench 16" on an 8 core Intel machine. >> >> No RPS/RFS: 2155 MB/sec >> RPS (0ff mask): 1700 MB/sec >> RFS: 1097 >> Blah, I mistakingly reported that... should have been: No RPS/RFS: 2155 MB/sec RPS (0ff mask): 1097 MB/sec RFS: 1700 MB/sec Sorry about that! >> I am not particularly surprised by the results, using loopback >> interface already provides good parallelism and RPS/RFS really would >> only add overhead and more trips between CPUs (last part is why RPS < >> RFS I suspect)-- I guess this is why we've never enabled RPS on >> loopback :-) >> >> Eric, do you have a particular concern that this could affect a real workload? >> > > I was expecting RFS to be better than RPS at least, for this particular > workload (tcp over loopback) > This was my expectation too, and what my "corrected" numbers show :-) But, I take it this is different in your results? Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit : > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit : > >> Results with "tbench 16" on an 8 core Intel machine. > >> > >> No RPS/RFS: 2155 MB/sec > >> RPS (0ff mask): 1700 MB/sec > >> RFS: 1097 > >> > > Blah, I mistakingly reported that... should have been: > > No RPS/RFS: 2155 MB/sec > RPS (0ff mask): 1097 MB/sec > RFS: 1700 MB/sec > > Sorry about that! > This was my expectation too, and what my "corrected" numbers show :-) > But, I take it this is different in your results? My results are on a "tbench 16" on an dual X5570 @ 2.93GHz. (16 logical cpus) No RPS , no RFS : 4448.14 MB/sec RPS : 2298.00 MB/sec (but lot of variation) RFS : 2600 MB/sec Maybe my RFS setup is bad ? (8192 flows) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit : > Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit : > > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit : > > >> Results with "tbench 16" on an 8 core Intel machine. > > >> > > >> No RPS/RFS: 2155 MB/sec > > >> RPS (0ff mask): 1700 MB/sec > > >> RFS: 1097 > > >> > > > > Blah, I mistakingly reported that... should have been: > > > > No RPS/RFS: 2155 MB/sec > > RPS (0ff mask): 1097 MB/sec > > RFS: 1700 MB/sec > > > > Sorry about that! > > > This was my expectation too, and what my "corrected" numbers show :-) > > But, I take it this is different in your results? > > > My results are on a "tbench 16" on an dual X5570 @ 2.93GHz. > (16 logical cpus) > > No RPS , no RFS : 4448.14 MB/sec > RPS : 2298.00 MB/sec (but lot of variation) > RFS : 2600 MB/sec > > Maybe my RFS setup is bad ? > (8192 flows) > Very strange, a second tbench-16 RFS=y run gave me 2134.08 MB/sec A third run gave me 1813.21 MB/sec A fourth run gave me 2472.91 MB/sec Hmm... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 16 Apr 2010 21:37:59 +0200 > Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit : >> From: Tom Herbert <therbert@google.com> >> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT) >> >> > Version 5 of RFS: >> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a >> > static function. >> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count >> > sysfs variable. >> >> I've read this over a few times and I think it's ready to go into >> net-next-2.6, we can tweak things as-needed from here on out. >> >> Eric, what do you think? > > I think I can give my Sob, and we have time to fully test it and tweak > it if necessary. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Great, I'll add this to net-next-2.6 right now. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT) > Great, I'll add this to net-next-2.6 right now. I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c to fix the build while committing this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Fri, 16 Apr 2010 15:53:40 -0700 (PDT) > From: David Miller <davem@davemloft.net> > Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT) > >> Great, I'll add this to net-next-2.6 right now. > > I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c > to fix the build while committing this. net/core/net-sysfs.c needed it too :-/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ugh, vmalloc.h must be sneaking in through some other header file for me :-( Sorry about that. Do you need me to respin the patch? Tom On Fri, Apr 16, 2010 at 3:57 PM, David Miller <davem@davemloft.net> wrote: > From: David Miller <davem@davemloft.net> > Date: Fri, 16 Apr 2010 15:53:40 -0700 (PDT) > >> From: David Miller <davem@davemloft.net> >> Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT) >> >>> Great, I'll add this to net-next-2.6 right now. >> >> I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c >> to fix the build while committing this. > > net/core/net-sysfs.c needed it too :-/ > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Fri, 16 Apr 2010 17:22:49 -0700 > Ugh, vmalloc.h must be sneaking in through some other header file for > me :-( Sorry about that. Do you need me to respin the patch? No, I took care of it and am about to push things out to net-next-2.6 on kernel.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 55c2086..649a025 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -530,14 +530,73 @@ struct rps_map { }; #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16))) +/* + * The rps_dev_flow structure contains the mapping of a flow to a CPU and the + * tail pointer for that CPU's input queue at the time of last enqueue. + */ +struct rps_dev_flow { + u16 cpu; + u16 fill; + unsigned int last_qtail; +}; + +/* + * The rps_dev_flow_table structure contains a table of flow mappings. + */ +struct rps_dev_flow_table { + unsigned int mask; + struct rcu_head rcu; + struct work_struct free_work; + struct rps_dev_flow flows[0]; +}; +#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \ + (_num * sizeof(struct rps_dev_flow))) + +/* + * The rps_sock_flow_table contains mappings of flows to the last CPU + * on which they were processed by the application (set in recvmsg). + */ +struct rps_sock_flow_table { + unsigned int mask; + u16 ents[0]; +}; +#define RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \ + (_num * sizeof(u16))) + +#define RPS_NO_CPU 0xffff + +static inline void rps_record_sock_flow(struct rps_sock_flow_table *table, + u32 hash) +{ + if (table && hash) { + unsigned int cpu, index = hash & table->mask; + + /* We only give a hint, preemption can change cpu under us */ + cpu = raw_smp_processor_id(); + + if (table->ents[index] != cpu) + table->ents[index] = cpu; + } +} + +static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table, + u32 hash) +{ + if (table && hash) + table->ents[hash & table->mask] = RPS_NO_CPU; +} + +extern struct rps_sock_flow_table *rps_sock_flow_table; + /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { struct rps_map *rps_map; + struct rps_dev_flow_table *rps_flow_table; struct kobject kobj; struct netdev_rx_queue *first; atomic_t count; } ____cacheline_aligned_in_smp; -#endif +#endif /* CONFIG_RPS */ /* * This structure defines the management hooks for network devices. @@ -1333,11 +1392,19 @@ struct softnet_data { /* Elements below can be accessed between CPUs for RPS */ #ifdef CONFIG_RPS struct call_single_data csd ____cacheline_aligned_in_smp; + unsigned int input_queue_head; #endif struct sk_buff_head input_pkt_queue; struct napi_struct backlog; }; +static inline void incr_input_queue_head(struct softnet_data *queue) +{ +#ifdef CONFIG_RPS + queue->input_queue_head++; +#endif +} + DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data); #define HAVE_NETIF_QUEUE diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index 83fd344..b487bc1 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -21,6 +21,7 @@ #include <linux/string.h> #include <linux/types.h> #include <linux/jhash.h> +#include <linux/netdevice.h> #include <net/flow.h> #include <net/sock.h> @@ -101,6 +102,7 @@ struct rtable; * @uc_ttl - Unicast TTL * @inet_sport - Source port * @inet_id - ID counter for DF pkts + * @rxhash - flow hash received from netif layer * @tos - TOS * @mc_ttl - Multicasting TTL * @is_icsk - is this an inet_connection_sock? @@ -124,6 +126,9 @@ struct inet_sock { __u16 cmsg_flags; __be16 inet_sport; __u16 inet_id; +#ifdef CONFIG_RPS + __u32 rxhash; +#endif struct ip_options *opt; __u8 tos; @@ -219,4 +224,37 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk) return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0; } +static inline void inet_rps_record_flow(const struct sock *sk) +{ +#ifdef CONFIG_RPS + struct rps_sock_flow_table *sock_flow_table; + + rcu_read_lock(); + sock_flow_table = rcu_dereference(rps_sock_flow_table); + rps_record_sock_flow(sock_flow_table, inet_sk(sk)->rxhash); + rcu_read_unlock(); +#endif +} + +static inline void inet_rps_reset_flow(const struct sock *sk) +{ +#ifdef CONFIG_RPS + struct rps_sock_flow_table *sock_flow_table; + + rcu_read_lock(); + sock_flow_table = rcu_dereference(rps_sock_flow_table); + rps_reset_sock_flow(sock_flow_table, inet_sk(sk)->rxhash); + rcu_read_unlock(); +#endif +} + +static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash) +{ +#ifdef CONFIG_RPS + if (unlikely(inet_sk(sk)->rxhash != rxhash)) { + inet_rps_reset_flow(sk); + inet_sk(sk)->rxhash = rxhash; + } +#endif +} #endif /* _INET_SOCK_H */ diff --git a/net/core/dev.c b/net/core/dev.c index e8041eb..d7107ac 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2203,19 +2203,28 @@ int weight_p __read_mostly = 64; /* old backlog weight */ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, }; #ifdef CONFIG_RPS + +/* One global table that all flow-based protocols share. */ +struct rps_sock_flow_table *rps_sock_flow_table; +EXPORT_SYMBOL(rps_sock_flow_table); + /* * get_rps_cpu is called from netif_receive_skb and returns the target * CPU from the RPS map of the receiving queue for a given skb. * rcu_read_lock must be held on entry. */ -static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb) +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, + struct rps_dev_flow **rflowp) { struct ipv6hdr *ip6; struct iphdr *ip; struct netdev_rx_queue *rxqueue; struct rps_map *map; + struct rps_dev_flow_table *flow_table; + struct rps_sock_flow_table *sock_flow_table; int cpu = -1; u8 ip_proto; + u16 tcpu; u32 addr1, addr2, ports, ihl; if (skb_rx_queue_recorded(skb)) { @@ -2232,7 +2241,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb) } else rxqueue = dev->_rx; - if (!rxqueue->rps_map) + if (!rxqueue->rps_map && !rxqueue->rps_flow_table) goto done; if (skb->rxhash) @@ -2284,9 +2293,48 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb) skb->rxhash = 1; got_hash: + flow_table = rcu_dereference(rxqueue->rps_flow_table); + sock_flow_table = rcu_dereference(rps_sock_flow_table); + if (flow_table && sock_flow_table) { + u16 next_cpu; + struct rps_dev_flow *rflow; + + rflow = &flow_table->flows[skb->rxhash & flow_table->mask]; + tcpu = rflow->cpu; + + next_cpu = sock_flow_table->ents[skb->rxhash & + sock_flow_table->mask]; + + /* + * If the desired CPU (where last recvmsg was done) is + * different from current CPU (one in the rx-queue flow + * table entry), switch if one of the following holds: + * - Current CPU is unset (equal to RPS_NO_CPU). + * - Current CPU is offline. + * - The current CPU's queue tail has advanced beyond the + * last packet that was enqueued using this table entry. + * This guarantees that all previous packets for the flow + * have been dequeued, thus preserving in order delivery. + */ + if (unlikely(tcpu != next_cpu) && + (tcpu == RPS_NO_CPU || !cpu_online(tcpu) || + ((int)(per_cpu(softnet_data, tcpu).input_queue_head - + rflow->last_qtail)) >= 0)) { + tcpu = rflow->cpu = next_cpu; + if (tcpu != RPS_NO_CPU) + rflow->last_qtail = per_cpu(softnet_data, + tcpu).input_queue_head; + } + if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) { + *rflowp = rflow; + cpu = tcpu; + goto done; + } + } + map = rcu_dereference(rxqueue->rps_map); if (map) { - u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32]; + tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32]; if (cpu_online(tcpu)) { cpu = tcpu; @@ -2320,13 +2368,14 @@ static void trigger_softirq(void *data) __napi_schedule(&queue->backlog); __get_cpu_var(netdev_rx_stat).received_rps++; } -#endif /* CONFIG_SMP */ +#endif /* CONFIG_RPS */ /* * enqueue_to_backlog is called to queue an skb to a per CPU backlog * queue (may be a remote CPU queue). */ -static int enqueue_to_backlog(struct sk_buff *skb, int cpu) +static int enqueue_to_backlog(struct sk_buff *skb, int cpu, + unsigned int *qtail) { struct softnet_data *queue; unsigned long flags; @@ -2341,6 +2390,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu) if (queue->input_pkt_queue.qlen) { enqueue: __skb_queue_tail(&queue->input_pkt_queue, skb); +#ifdef CONFIG_RPS + *qtail = queue->input_queue_head + + queue->input_pkt_queue.qlen; +#endif rps_unlock(queue); local_irq_restore(flags); return NET_RX_SUCCESS; @@ -2355,11 +2408,10 @@ enqueue: cpu_set(cpu, rcpus->mask[rcpus->select]); __raise_softirq_irqoff(NET_RX_SOFTIRQ); - } else - __napi_schedule(&queue->backlog); -#else - __napi_schedule(&queue->backlog); + goto enqueue; + } #endif + __napi_schedule(&queue->backlog); } goto enqueue; } @@ -2401,18 +2453,25 @@ int netif_rx(struct sk_buff *skb) #ifdef CONFIG_RPS { + struct rps_dev_flow voidflow, *rflow = &voidflow; int cpu; rcu_read_lock(); - cpu = get_rps_cpu(skb->dev, skb); + + cpu = get_rps_cpu(skb->dev, skb, &rflow); if (cpu < 0) cpu = smp_processor_id(); - ret = enqueue_to_backlog(skb, cpu); + + ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); + rcu_read_unlock(); } #else - ret = enqueue_to_backlog(skb, get_cpu()); - put_cpu(); + { + unsigned int qtail; + ret = enqueue_to_backlog(skb, get_cpu(), &qtail); + put_cpu(); + } #endif return ret; } @@ -2830,14 +2889,22 @@ out: int netif_receive_skb(struct sk_buff *skb) { #ifdef CONFIG_RPS - int cpu; + struct rps_dev_flow voidflow, *rflow = &voidflow; + int cpu, ret; + + rcu_read_lock(); - cpu = get_rps_cpu(skb->dev, skb); + cpu = get_rps_cpu(skb->dev, skb, &rflow); - if (cpu < 0) - return __netif_receive_skb(skb); - else - return enqueue_to_backlog(skb, cpu); + if (cpu >= 0) { + ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); + rcu_read_unlock(); + } else { + rcu_read_unlock(); + ret = __netif_receive_skb(skb); + } + + return ret; #else return __netif_receive_skb(skb); #endif @@ -2856,6 +2923,7 @@ static void flush_backlog(void *arg) if (skb->dev == dev) { __skb_unlink(skb, &queue->input_pkt_queue); kfree_skb(skb); + incr_input_queue_head(queue); } rps_unlock(queue); } @@ -3179,6 +3247,7 @@ static int process_backlog(struct napi_struct *napi, int quota) local_irq_enable(); break; } + incr_input_queue_head(queue); rps_unlock(queue); local_irq_enable(); @@ -5542,8 +5611,10 @@ static int dev_cpu_callback(struct notifier_block *nfb, local_irq_enable(); /* Process offline CPU's input_pkt_queue */ - while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) + while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) { netif_rx(skb); + incr_input_queue_head(oldsd); + } return NOTIFY_OK; } diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 96ed690..f0f1bb7 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -601,22 +601,109 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue, return len; } +static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, + char *buf) +{ + struct rps_dev_flow_table *flow_table; + unsigned int val = 0; + + rcu_read_lock(); + flow_table = rcu_dereference(queue->rps_flow_table); + if (flow_table) + val = flow_table->mask + 1; + rcu_read_unlock(); + + return sprintf(buf, "%u\n", val); +} + +static void rps_dev_flow_table_release_work(struct work_struct *work) +{ + struct rps_dev_flow_table *table = container_of(work, + struct rps_dev_flow_table, free_work); + + vfree(table); +} + +static void rps_dev_flow_table_release(struct rcu_head *rcu) +{ + struct rps_dev_flow_table *table = container_of(rcu, + struct rps_dev_flow_table, rcu); + + INIT_WORK(&table->free_work, rps_dev_flow_table_release_work); + schedule_work(&table->free_work); +} + +ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, + const char *buf, size_t len) +{ + unsigned int count; + char *endp; + struct rps_dev_flow_table *table, *old_table; + static DEFINE_SPINLOCK(rps_dev_flow_lock); + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + count = simple_strtoul(buf, &endp, 0); + if (endp == buf) + return -EINVAL; + + if (count) { + int i; + + if (count > 1<<30) { + /* Enforce a limit to prevent overflow */ + return -EINVAL; + } + count = roundup_pow_of_two(count); + table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count)); + if (!table) + return -ENOMEM; + + table->mask = count - 1; + for (i = 0; i < count; i++) + table->flows[i].cpu = RPS_NO_CPU; + } else + table = NULL; + + spin_lock(&rps_dev_flow_lock); + old_table = queue->rps_flow_table; + rcu_assign_pointer(queue->rps_flow_table, table); + spin_unlock(&rps_dev_flow_lock); + + if (old_table) + call_rcu(&old_table->rcu, rps_dev_flow_table_release); + + return len; +} + static struct rx_queue_attribute rps_cpus_attribute = __ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map); + +static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute = + __ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR, + show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt); + static struct attribute *rx_queue_default_attrs[] = { &rps_cpus_attribute.attr, + &rps_dev_flow_table_cnt_attribute.attr, NULL }; static void rx_queue_release(struct kobject *kobj) { struct netdev_rx_queue *queue = to_rx_queue(kobj); - struct rps_map *map = queue->rps_map; struct netdev_rx_queue *first = queue->first; - if (map) - call_rcu(&map->rcu, rps_map_release); + if (queue->rps_map) + call_rcu(&queue->rps_map->rcu, rps_map_release); + + if (queue->rps_flow_table) + call_rcu(&queue->rps_flow_table->rcu, + rps_dev_flow_table_release); if (atomic_dec_and_test(&first->count)) kfree(first); diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index b7b6b82..e023c93 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -17,6 +17,65 @@ #include <net/ip.h> #include <net/sock.h> +#ifdef CONFIG_RPS +static int rps_sock_flow_sysctl(ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + unsigned int orig_size, size; + int ret, i; + ctl_table tmp = { + .data = &size, + .maxlen = sizeof(size), + .mode = table->mode + }; + struct rps_sock_flow_table *orig_sock_table, *sock_table; + static DEFINE_MUTEX(sock_flow_mutex); + + mutex_lock(&sock_flow_mutex); + + orig_sock_table = rps_sock_flow_table; + size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0; + + ret = proc_dointvec(&tmp, write, buffer, lenp, ppos); + + if (write) { + if (size) { + if (size > 1<<30) { + /* Enforce limit to prevent overflow */ + mutex_unlock(&sock_flow_mutex); + return -EINVAL; + } + size = roundup_pow_of_two(size); + if (size != orig_size) { + sock_table = + vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size)); + if (!sock_table) { + mutex_unlock(&sock_flow_mutex); + return -ENOMEM; + } + + sock_table->mask = size - 1; + } else + sock_table = orig_sock_table; + + for (i = 0; i < size; i++) + sock_table->ents[i] = RPS_NO_CPU; + } else + sock_table = NULL; + + if (sock_table != orig_sock_table) { + rcu_assign_pointer(rps_sock_flow_table, sock_table); + synchronize_rcu(); + vfree(orig_sock_table); + } + } + + mutex_unlock(&sock_flow_mutex); + + return ret; +} +#endif /* CONFIG_RPS */ + static struct ctl_table net_core_table[] = { #ifdef CONFIG_NET { @@ -82,6 +141,14 @@ static struct ctl_table net_core_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, +#ifdef CONFIG_RPS + { + .procname = "rps_sock_flow_entries", + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = rps_sock_flow_sysctl + }, +#endif #endif /* CONFIG_NET */ { .procname = "netdev_budget", diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 193dcd6..c5376c7 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -419,6 +419,8 @@ int inet_release(struct socket *sock) if (sk) { long timeout; + inet_rps_reset_flow(sk); + /* Applications forget to leave groups before exiting */ ip_mc_drop_socket(sk); @@ -720,6 +722,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, { struct sock *sk = sock->sk; + inet_rps_record_flow(sk); + /* We may need to bind the socket. */ if (!inet_sk(sk)->inet_num && inet_autobind(sk)) return -EAGAIN; @@ -728,12 +732,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, } EXPORT_SYMBOL(inet_sendmsg); - static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags) { struct sock *sk = sock->sk; + inet_rps_record_flow(sk); + /* We may need to bind the socket. */ if (!inet_sk(sk)->inet_num && inet_autobind(sk)) return -EAGAIN; @@ -743,6 +748,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, return sock_no_sendpage(sock, page, offset, size, flags); } +int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t size, int flags) +{ + struct sock *sk = sock->sk; + int addr_len = 0; + int err; + + inet_rps_record_flow(sk); + + err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, + flags & ~MSG_DONTWAIT, &addr_len); + if (err >= 0) + msg->msg_namelen = addr_len; + return err; +} +EXPORT_SYMBOL(inet_recvmsg); int inet_shutdown(struct socket *sock, int how) { @@ -872,7 +893,7 @@ const struct proto_ops inet_stream_ops = { .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = tcp_sendmsg, - .recvmsg = sock_common_recvmsg, + .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .sendpage = tcp_sendpage, .splice_read = tcp_splice_read, @@ -899,7 +920,7 @@ const struct proto_ops inet_dgram_ops = { .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, - .recvmsg = sock_common_recvmsg, + .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .sendpage = inet_sendpage, #ifdef CONFIG_COMPAT @@ -929,7 +950,7 @@ static const struct proto_ops inet_sockraw_ops = { .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, - .recvmsg = sock_common_recvmsg, + .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .sendpage = inet_sendpage, #ifdef CONFIG_COMPAT diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a24995c..ad08392 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1672,6 +1672,8 @@ process: skb->dev = NULL; + inet_rps_save_rxhash(sk, skb->rxhash); + bh_lock_sock_nested(sk); ret = 0; if (!sock_owned_by_user(sk)) { diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 8fef859..666b963 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1217,6 +1217,7 @@ int udp_disconnect(struct sock *sk, int flags) sk->sk_state = TCP_CLOSE; inet->inet_daddr = 0; inet->inet_dport = 0; + inet_rps_save_rxhash(sk, 0); sk->sk_bound_dev_if = 0; if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) inet_reset_saddr(sk); @@ -1258,8 +1259,12 @@ EXPORT_SYMBOL(udp_lib_unhash); static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { - int rc = sock_queue_rcv_skb(sk, skb); + int rc; + + if (inet_sk(sk)->inet_daddr) + inet_rps_save_rxhash(sk, skb->rxhash); + rc = sock_queue_rcv_skb(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk);
Version 5 of RFS: - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a static function. - Apply limits to rps_sock_flow_entires systcl and rps_flow_count sysfs variable. --- This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert <therbert@google.com> --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html