Message ID | 1272783366.2173.13.camel@edumazet-laptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
> I tried it on the right spot (since my bench was only doing recvmsg() > calls, I had to patch wait_for_packet() in net/core/datagram.c > > udp_recvmsg -> __skb_recv_datagram -> wait_for_packet -> > schedule_timeout > > Unfortunatly, using io_schedule_timeout() did not solve the problem. Hmm, too bad. Weird. > > Tell me if you need some traces or something. I'll try to reproduce it and see what I can do. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 02 mai 2010 à 11:20 +0200, Andi Kleen a écrit : > > I tried it on the right spot (since my bench was only doing recvmsg() > > calls, I had to patch wait_for_packet() in net/core/datagram.c > > > > udp_recvmsg -> __skb_recv_datagram -> wait_for_packet -> > > schedule_timeout > > > > Unfortunatly, using io_schedule_timeout() did not solve the problem. > > Hmm, too bad. Weird. > > > > > Tell me if you need some traces or something. > > I'll try to reproduce it and see what I can do. > Here the perf report on the latest test done, I confirm I am using io_schedule_timeout() in this kernel. In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these packets are then redistributed by RPS to same 16 cpus, generating about 650.000 IPI per second. top says : Cpu(s): 3.0%us, 17.3%sy, 0.0%ni, 22.4%id, 28.2%wa, 0.0%hi, 29.1%si, 0.0%st # Samples: 321362570767 # # Overhead Command Shared Object Symbol # ........ .............. ............................ ...... # 25.08% init [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--93.47%-- clockevents_notify | lapic_timer_state_broadcast | acpi_idle_enter_bm | cpuidle_idle_call | cpu_idle | start_secondary | |--4.70%-- tick_broadcast_oneshot_control | tick_notify | notifier_call_chain | __raw_notifier_call_chain | raw_notifier_call_chain | clockevents_do_notify | clockevents_notify | lapic_timer_state_broadcast | acpi_idle_enter_bm | cpuidle_idle_call | cpu_idle | start_secondary | |--0.64%-- generic_exec_single | __smp_call_function_single | net_rps_action_and_irq_enable ... 9.72% init [kernel.kallsyms] [k] acpi_os_read_port | --- acpi_os_read_port | |--99.45%-- acpi_hw_read_port | acpi_hw_read | acpi_hw_read_multiple | acpi_hw_register_read | acpi_read_bit_register | acpi_idle_enter_bm | cpuidle_idle_call | cpu_idle | start_secondary | --0.55%-- acpi_hw_read acpi_hw_read_multiple powertop says : PowerTOP version 1.11 (C) 2007 Intel Corporation Cn Avg residency P-states (frequencies) C0 (cpu running) (68.9%) 2.93 Ghz 46.5% polling 0.0ms ( 0.0%) 2.80 Ghz 5.1% C1 mwait 0.0ms ( 0.0%) 2.53 Ghz 3.0% C2 mwait 0.0ms (31.1%) 2.13 Ghz 2.8% 1.60 Ghz 38.2% Wakeups-from-idle per second : 45177.8 interval: 5.0s no ACPI power usage estimate available Top causes for wakeups: 9.9% (40863.0) <interrupt> : eth1-fp-7 9.9% (40861.0) <interrupt> : eth1-fp-8 9.9% (40858.0) <interrupt> : eth1-fp-5 9.9% (40855.2) <interrupt> : eth1-fp-10 9.9% (40847.6) <interrupt> : eth1-fp-14 9.9% (40847.2) <interrupt> : eth1-fp-12 9.9% (40835.0) <interrupt> : eth1-fp-1 9.9% (40834.2) <interrupt> : eth1-fp-3 9.9% (40834.0) <interrupt> : eth1-fp-6 9.9% (40829.6) <interrupt> : eth1-fp-4 1.0% (4002.0) <kernel core> : hrtimer_start_range_ns (tick_sched_timer) 0.4% (1725.6) <interrupt> : extra timer interrupt 0.0% ( 4.0) <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 0.0% ( 2.0) <kernel core> : clocksource_watchdog (clocksource_watchdog) 0.0% ( 2.0) snmpd : hrtimer_start_range_ns (hrtimer_wakeup) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Cn Avg residency P-states (frequencies) > C0 (cpu running) (68.9%) 2.93 Ghz 46.5% > polling 0.0ms ( 0.0%) 2.80 Ghz 5.1% > C1 mwait 0.0ms ( 0.0%) 2.53 Ghz 3.0% > C2 mwait 0.0ms (31.1%) 2.13 Ghz 2.8% > 1.60 Ghz 38.2% I bet your system advertizes C2 with the same latency as C1, but with lower power... which means Linux will pretty much never pick C1.... no matter how much you take Andi's patch. this is a bios thing... and until we put in the patch to override the bios values (I can dust it off but it might need a bit of tweaking since it was against .31) Andi's patch alone won't cut it... you also need a non-lying bios ;)
Le dimanche 02 mai 2010 à 07:13 -0700, Arjan van de Ven a écrit : > > > > Cn Avg residency P-states (frequencies) > > C0 (cpu running) (68.9%) 2.93 Ghz 46.5% > > polling 0.0ms ( 0.0%) 2.80 Ghz 5.1% > > C1 mwait 0.0ms ( 0.0%) 2.53 Ghz 3.0% > > C2 mwait 0.0ms (31.1%) 2.13 Ghz 2.8% > > 1.60 Ghz 38.2% > > I bet your system advertizes C2 with the same latency as C1, > but with lower power... which means Linux will pretty much never > pick C1.... no matter how much you take Andi's patch. > > this is a bios thing... and until we put in the patch to override the > bios values (I can dust it off but it might need a bit of tweaking > since it was against .31) Andi's patch alone won't cut it... you also > need a non-lying bios ;) > > > # pwd /sys/devices/system/cpu/cpu15/cpuidle # grep . */* state0/desc:CPUIDLE CORE POLL IDLE state0/latency:0 state0/name:C0 state0/power:4294967295 state0/time:0 state0/usage:0 state1/desc:ACPI FFH INTEL MWAIT 0x0 state1/latency:1 state1/name:C1 state1/power:1000 state1/time:433855186 state1/usage:126869 state2/desc:ACPI FFH INTEL MWAIT 0x10 state2/latency:64 state2/name:C2 state2/power:500 state2/time:198095020416 state2/usage:76287744 C2 latency seems to be 64 (us ?), while C1 seems to be 1 BIOS Information Vendor: HP Version: I24 Release Date: 10/01/2009 # powertop PowerTOP 1.11 (C) 2007, 2008 Intel Corporation Collecting data for 5 seconds Your CPU supports the following C-states : C1 C2 C3 Your BIOS reports the following C-states : C1 C2 C3 seems to be disabled in BIOS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 02 mai 2010 à 16:27 +0200, Eric Dumazet a écrit : > Le dimanche 02 mai 2010 à 07:13 -0700, Arjan van de Ven a écrit : > > > > > > Cn Avg residency P-states (frequencies) > > > C0 (cpu running) (68.9%) 2.93 Ghz 46.5% > > > polling 0.0ms ( 0.0%) 2.80 Ghz 5.1% > > > C1 mwait 0.0ms ( 0.0%) 2.53 Ghz 3.0% > > > C2 mwait 0.0ms (31.1%) 2.13 Ghz 2.8% > > > 1.60 Ghz 38.2% > > > > I bet your system advertizes C2 with the same latency as C1, > > but with lower power... which means Linux will pretty much never > > pick C1.... no matter how much you take Andi's patch. > > > > this is a bios thing... and until we put in the patch to override the > > bios values (I can dust it off but it might need a bit of tweaking > > since it was against .31) Andi's patch alone won't cut it... you also > > need a non-lying bios ;) > > > > > > > # pwd > /sys/devices/system/cpu/cpu15/cpuidle > # grep . */* > state0/desc:CPUIDLE CORE POLL IDLE > state0/latency:0 > state0/name:C0 > state0/power:4294967295 > state0/time:0 > state0/usage:0 > state1/desc:ACPI FFH INTEL MWAIT 0x0 > state1/latency:1 > state1/name:C1 > state1/power:1000 > state1/time:433855186 > state1/usage:126869 > state2/desc:ACPI FFH INTEL MWAIT 0x10 > state2/latency:64 > state2/name:C2 > state2/power:500 > state2/time:198095020416 > state2/usage:76287744 > > C2 latency seems to be 64 (us ?), while C1 seems to be 1 > > BIOS Information > Vendor: HP > Version: I24 > Release Date: 10/01/2009 > > # powertop > PowerTOP 1.11 (C) 2007, 2008 Intel Corporation > > Collecting data for 5 seconds > > > Your CPU supports the following C-states : C1 C2 C3 > Your BIOS reports the following C-states : C1 C2 > > C3 seems to be disabled in BIOS > I took a look at BIOS settings and enabled the minimum sleep state to be C6 (instead of C3, the default). Now we see C3 being available... No changes, only more IPI delivered during the test, and more overhead in clockevents_notify() # grep . */* state0/desc:CPUIDLE CORE POLL IDLE state0/latency:0 state0/name:C0 state0/power:4294967295 state0/time:0 state0/usage:0 state1/desc:ACPI FFH INTEL MWAIT 0x0 state1/latency:1 state1/name:C1 state1/power:1000 state1/time:39432 state1/usage:119 state2/desc:ACPI FFH INTEL MWAIT 0x10 state2/latency:64 state2/name:C2 state2/power:500 state2/time:3170745 state2/usage:11177 state3/desc:ACPI FFH INTEL MWAIT 0x20 state3/latency:96 state3/name:C3 state3/power:350 state3/time:1030987453 state3/usage:14047019 --------------------------------------------------------------------------------------------------------------------------- PerfTop: 15984 irqs/sec kernel:98.5% [1000Hz cycles], (all, 16 CPUs) --------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ______________________________ _______ 23822.00 40.2% _raw_spin_lock_irqsave vmlinux 4413.00 7.4% acpi_os_read_port vmlinux 1426.00 2.4% _raw_spin_lock vmlinux 1284.00 2.2% _raw_spin_unlock_irqrestore vmlinux 1247.00 2.1% schedule vmlinux 1137.00 1.9% bnx2x_rx_int vmlinux 643.00 1.1% tick_broadcast_oneshot_control vmlinux 597.00 1.0% copy_user_generic_string vmlinux 595.00 1.0% __napi_complete vmlinux 550.00 0.9% call_function_single_interrupt vmlinux 548.00 0.9% bnx2x_msix_fp_int vmlinux 486.00 0.8% __netif_receive_skb vmlinux 461.00 0.8% bnx2x_poll vmlinux 433.00 0.7% eth_type_trans vmlinux 428.00 0.7% acpi_idle_enter_bm vmlinux 422.00 0.7% sock_recv_ts_and_drops vmlinux 382.00 0.6% __udp4_lib_lookup vmlinux 369.00 0.6% __slab_free vmlinux 357.00 0.6% ip_route_input vmlinux 341.00 0.6% kfree vmlinux 335.00 0.6% ipt_do_table vmlinux 334.00 0.6% ip_rcv vmlinux 332.00 0.6% udp_recvmsg vmlinux 317.00 0.5% __kmalloc_node_track_caller vmlinux 37.46% init [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--95.58%-- clockevents_notify | lapic_timer_state_broadcast | acpi_idle_enter_bm | cpuidle_idle_call | cpu_idle | start_secondary | |--3.27%-- tick_broadcast_oneshot_control | tick_notify | notifier_call_chain | __raw_notifier_call_chain | raw_notifier_call_chain | clockevents_do_notify | clockevents_notify | lapic_timer_state_broadcast | acpi_idle_enter_bm -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers > packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these > packets are then redistributed by RPS to same 16 cpus, generating about > 650.000 IPI per second. BTW if rps was SMT aware it could avoid a lot of the IPIs in the first place. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 02 mai 2010 à 17:46 +0200, Andi Kleen a écrit : > > In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers > > packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these > > packets are then redistributed by RPS to same 16 cpus, generating about > > 650.000 IPI per second. > > BTW if rps was SMT aware it could avoid a lot of the IPIs in the first place. RPS do what you want, just stick a good cpumask, not a unware one :) In my test, I specifically do something 'stupid' like : echo fffe >/sys/class/net/bond0.2240/queues/rx-0/rps_cpus echo fffd >/sys/class/net/bond0.2240/queues/rx-1/rps_cpus echo fffb >/sys/class/net/bond0.2240/queues/rx-2/rps_cpus echo fff7 >/sys/class/net/bond0.2240/queues/rx-3/rps_cpus echo ffef >/sys/class/net/bond0.2240/queues/rx-4/rps_cpus echo ffdf >/sys/class/net/bond0.2240/queues/rx-5/rps_cpus echo ffbf >/sys/class/net/bond0.2240/queues/rx-6/rps_cpus echo ff7f >/sys/class/net/bond0.2240/queues/rx-7/rps_cpus echo feff >/sys/class/net/bond0.2240/queues/rx-8/rps_cpus echo fdff >/sys/class/net/bond0.2240/queues/rx-9/rps_cpus echo fbff >/sys/class/net/bond0.2240/queues/rx-10/rps_cpus echo f7ff >/sys/class/net/bond0.2240/queues/rx-11/rps_cpus echo efff >/sys/class/net/bond0.2240/queues/rx-12/rps_cpus echo dfff >/sys/class/net/bond0.2240/queues/rx-13/rps_cpus echo bfff >/sys/class/net/bond0.2240/queues/rx-14/rps_cpus echo 7fff >/sys/class/net/bond0.2240/queues/rx-15/rps_cpus echo 0001 >/proc/irq/*/eth1-fp-0/../smp_affinity echo 0002 >/proc/irq/*/eth1-fp-1/../smp_affinity echo 0004 >/proc/irq/*/eth1-fp-2/../smp_affinity echo 0008 >/proc/irq/*/eth1-fp-3/../smp_affinity echo 0010 >/proc/irq/*/eth1-fp-4/../smp_affinity echo 0020 >/proc/irq/*/eth1-fp-5/../smp_affinity echo 0040 >/proc/irq/*/eth1-fp-6/../smp_affinity echo 0080 >/proc/irq/*/eth1-fp-7/../smp_affinity echo 0100 >/proc/irq/*/eth1-fp-8/../smp_affinity echo 0200 >/proc/irq/*/eth1-fp-9/../smp_affinity echo 0400 >/proc/irq/*/eth1-fp-10/../smp_affinity echo 0800 >/proc/irq/*/eth1-fp-11/../smp_affinity echo 1000 >/proc/irq/*/eth1-fp-12/../smp_affinity echo 2000 >/proc/irq/*/eth1-fp-13/../smp_affinity echo 4000 >/proc/irq/*/eth1-fp-14/../smp_affinity echo 8000 >/proc/irq/*/eth1-fp-15/../smp_affinity You mean we can wakeup a thread with something else than an IPI ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 02 May 2010 18:35:31 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote > > > You mean we can wakeup a thread with something else than an IPI ? > actually we can. mwait is not only "go idle", it is "go idle until someone writes to <THIS> cacheline". where <THIS> is set up with a "monitor" instruction. We don't need to send an ipi per se.. all we need is to write to the right cacheline that we're monitoring.
Le dimanche 02 mai 2010 à 10:43 -0700, Arjan van de Ven a écrit : > On Sun, 02 May 2010 18:35:31 +0200 > Eric Dumazet <eric.dumazet@gmail.com> wrote > > > > > > You mean we can wakeup a thread with something else than an IPI ? > > > > actually we can. > > mwait is not only "go idle", it is "go idle until someone writes to > <THIS> cacheline". where <THIS> is set up with a "monitor" instruction. > We don't need to send an ipi per se.. all we need is to write to the > right cacheline that we're monitoring. > > Thats a bit x86 specific, isnt it ? But we want to eventually send a 'signal' to a cpu, even if not blocked in idle, so that it can do following action : /* Called from hardirq (IPI) context */ static void rps_trigger_softirq(void *data) { struct softnet_data *sd = data; __napi_schedule(&sd->backlog); __get_cpu_var(netdev_rx_stat).received_rps++; } And it also should be portable ;) If something else than an IPI is available, please let us know ! Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 02 May 2010 16:27:28 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > C2 latency seems to be 64 (us ?), while C1 seems to be 1 the processor_idle module has a "latency_factor" module parameter. The default is 2, but sometimes people think 6 is a better value... .. any chance you can try that value ? Also, I'm starting to wonder if Andi's patch to use io_schedule() needs to be replaced with a net_schedule() kind of thing. The cpuidle code currently has a weight factor for IO (based on measuring/experiments), and maybe networking really needs another factor... so just having a parallel concept with a different weight could be the right answer for that. > > Your CPU supports the following C-states : C1 C2 C3 > Your BIOS reports the following C-states : C1 C2 > > C3 seems to be disabled in BIOS btw this C2 == marketing name C3, and C3 == marketing name C6 (too many translations ;-) we'll fix powertop to report the marketing name soon.
Le dimanche 02 mai 2010 à 10:54 -0700, Arjan van de Ven a écrit : > On Sun, 02 May 2010 16:27:28 +0200 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > C2 latency seems to be 64 (us ?), while C1 seems to be 1 > > the processor_idle module has a "latency_factor" module parameter. > The default is 2, but sometimes people think 6 is a better value... > .. any chance you can try that value ? > I tried 6 and 20, nothing changed ;( > Also, I'm starting to wonder if Andi's patch to use io_schedule() needs > to be replaced with a net_schedule() kind of thing. The cpuidle code > currently has a weight factor for IO (based on measuring/experiments), > and maybe networking really needs another factor... so just having a > parallel concept with a different weight could be the right answer for > that. > But a task blocked on disk IO is probably blocked for a small amount of time, while on network, it can be for a long time. I am not sure its the right metric. I was expecting something based on recent history. Say if we have 20.000 wakeups per second, most likely we should not enter C2/C3 states... > > we'll fix powertop to report the marketing name soon. > > Ah, I see, thanks :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> You mean we can wakeup a thread with something else than an IPI ?
It's pointless to send an IPI to your thread sibling for this.
Everything it could do you can do yourself too with the same performance.
-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 02, 2010 at 10:54:18AM -0700, Arjan van de Ven wrote: > On Sun, 02 May 2010 16:27:28 +0200 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > C2 latency seems to be 64 (us ?), while C1 seems to be 1 > > the processor_idle module has a "latency_factor" module parameter. > The default is 2, but sometimes people think 6 is a better value... > .. any chance you can try that value ? > > Also, I'm starting to wonder if Andi's patch to use io_schedule() needs > to be replaced with a net_schedule() kind of thing. The cpuidle code > currently has a weight factor for IO (based on measuring/experiments), > and maybe networking really needs another factor... so just having a > parallel concept with a different weight could be the right answer for > that. We definitely need a net_schedule() for other reasons too: to avoid the blkio wait code and then also because networking needs a short "fast idle" timeout because the delays are not bounded. Otherwise a sender that suddenly stops sending could break all your power saving. I think the reference count used in io_schedule is not the right model for this, probably needs a per cpu timeout ("be fast until this time"). Possibly a dynamic one feed by the measured input rate. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 02 mai 2010 à 23:25 +0200, Andi Kleen a écrit : > It's pointless to send an IPI to your thread sibling for this. > Everything it could do you can do yourself too with the same performance. > > -Andi Amen Tests just prove the reverse. I have some collegues that disable HyperThreading for exact same reasons. I wonder why Intel designed HT. Should be marketing I guess. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 02, 2010 at 11:45:55PM +0200, Eric Dumazet wrote: > Le dimanche 02 mai 2010 à 23:25 +0200, Andi Kleen a écrit : > > > It's pointless to send an IPI to your thread sibling for this. > > Everything it could do you can do yourself too with the same performance. > > > > -Andi > > Amen That is in terms of cache locality. > > Tests just prove the reverse. What do you mean? > > I have some collegues that disable HyperThreading for exact same > reasons. I wonder why Intel designed HT. Should be marketing I guess. HT (especially Nehalem HT) is useful for a wide range of workloads. Just handling network interrupts for its thread sibling is not one of them. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> But a task blocked on disk IO is probably blocked for a small amount of > time, while on network, it can be for a long time. I am not sure its the > right metric. I think it needs a dynamic timeout. I agree the reference count as is will not work well for networking. > > I was expecting something based on recent history. > Say if we have 20.000 wakeups per second, most likely we should not > enter C2/C3 states... That's what the menu governour already does, it just doesn't work in some cases :/ -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 02 mai 2010 à 23:54 +0200, Andi Kleen a écrit : > On Sun, May 02, 2010 at 11:45:55PM +0200, Eric Dumazet wrote: > > Tests just prove the reverse. > > What do you mean? > Test I did this week with Jamal. We first set a "ee" rps mask, because all NIC interrupts were handled by CPU0, and Jamal thought like you, that not using cpu4 would give better performance. But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps CPU : E5450 @3.00GHz Two quad-core cpus in the machine, tg3 NIC. With RPS, CPU0 does not a lot of things, just talk with the NIC, bring a few cache lines per packet and dispatch it to a slave cpu. > HT (especially Nehalem HT) is useful for a wide range of workloads. > Just handling network interrupts for its thread sibling is not one of them. > Thats the theory, now in practice I see different results. Of course, this might be related to hash distribution being different and more uniform. I should redo the test with many more flows. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Also, I'm starting to wonder if Andi's patch to use io_schedule() > > needs to be replaced with a net_schedule() kind of thing. The > > cpuidle code currently has a weight factor for IO (based on > > measuring/experiments), and maybe networking really needs another > > factor... so just having a parallel concept with a different weight > > could be the right answer for that. > > > > But a task blocked on disk IO is probably blocked for a small amount > of time, while on network, it can be for a long time. I am not sure > its the right metric. it's not so much about the duration, as it is about the performance sensitivity.... > I was expecting something based on recent history. > Say if we have 20.000 wakeups per second, most likely we should not > enter C2/C3 states... we effectively do that. The thing is that C2 is so low cost normally that it's still worth it even at 20k wakeups... this is where the bios tells us how "heavy" the states are.... and 64 usec... is just not very much.
Le dimanche 02 mai 2010 à 20:50 -0700, Arjan van de Ven a écrit : > we effectively do that. The thing is that C2 is so low cost normally > that it's still worth it even at 20k wakeups... > > this is where the bios tells us how "heavy" the states are.... > and 64 usec... is just not very much. Maybe its low cost, (apparently, it is, since I can reach ~900.000 ipis on my 16 cores machine) but multiply this by 16 or 32 or 64 cpus, and clockevents_notify() cost appears to be a killer, all cpus compete on a single lock. Maybe this notifier could use RCU ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 03 May 2010 07:17:14 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le dimanche 02 mai 2010 à 20:50 -0700, Arjan van de Ven a écrit : > > > we effectively do that. The thing is that C2 is so low cost normally > > that it's still worth it even at 20k wakeups... > > > > this is where the bios tells us how "heavy" the states are.... > > and 64 usec... is just not very much. > > Maybe its low cost, (apparently, it is, since I can reach ~900.000 > ipis on my 16 cores machine) but multiply this by 16 or 32 or 64 > cpus, and clockevents_notify() cost appears to be a killer, all cpus > compete on a single lock. > > Maybe this notifier could use RCU ? could this be an artifact of the local apic stopping in deeper C states? (which is finally fixed in the Westmere generation)
> > Maybe its low cost, (apparently, it is, since I can reach ~900.000 > > ipis on my 16 cores machine) but multiply this by 16 or 32 or 64 > > cpus, and clockevents_notify() cost appears to be a killer, all cpus > > compete on a single lock. > > > > Maybe this notifier could use RCU ? > > could this be an artifact of the local apic stopping in deeper C states? > (which is finally fixed in the Westmere generation) Yes it is I think. But I suspect Eric wants a solution for Nehalem. -Andi
On Mon, 3 May 2010 12:34:26 +0200 Andi Kleen <andi@firstfloor.org> wrote: > > > Maybe its low cost, (apparently, it is, since I can reach ~900.000 > > > ipis on my 16 cores machine) but multiply this by 16 or 32 or 64 > > > cpus, and clockevents_notify() cost appears to be a killer, all > > > cpus compete on a single lock. > > > > > > Maybe this notifier could use RCU ? > > > > could this be an artifact of the local apic stopping in deeper C > > states? (which is finally fixed in the Westmere generation) > > Yes it is I think. > > But I suspect Eric wants a solution for Nehalem. sure ;-) so the hard problem is that on going idle, the local timers need to be funneled to the external HPET. Afaik right now we use one channel of the hpet, with the result that we have one global lock for this. HPETs have more than one channel (2 or 3 historically, newer chipsets iirc have a few more), so in principle we can split this lock at least a little bit... if we can get to one hpet channel per level 3 cache domain we'd already make huge progress in terms of cost of the contention....
Arjan van de Ven wrote: > On Mon, 3 May 2010 12:34:26 +0200 > Andi Kleen <andi@firstfloor.org> wrote: > >>>> Maybe its low cost, (apparently, it is, since I can reach ~900.000 >>>> ipis on my 16 cores machine) but multiply this by 16 or 32 or 64 >>>> cpus, and clockevents_notify() cost appears to be a killer, all >>>> cpus compete on a single lock. >>>> >>>> Maybe this notifier could use RCU ? >>> could this be an artifact of the local apic stopping in deeper C >>> states? (which is finally fixed in the Westmere generation) >> Yes it is I think. >> >> But I suspect Eric wants a solution for Nehalem. > > sure ;-) > > > so the hard problem is that on going idle, the local timers need to be > funneled to the external HPET. Afaik right now we use one channel of > the hpet, with the result that we have one global lock for this. Does the HPET only need to be programmed when going idle? That could mean that this isn't a big performance issue. cares if you spin for a while when you're about to sleep for at least 60usec? > HPETs have more than one channel (2 or 3 historically, newer chipsets > iirc have a few more), so in principle we can split this lock at least > a little bit... if we can get to one hpet channel per level 3 cache > domain we'd already make huge progress in terms of cost of the > contention.... Another possible approach: if a core needs the HPET and finds it locked, it could queue up its request to a backlog which the locking core will service. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> so the hard problem is that on going idle, the local timers need to be > funneled to the external HPET. Afaik right now we use one channel of > the hpet, with the result that we have one global lock for this. > > HPETs have more than one channel (2 or 3 historically, newer chipsets > iirc have a few more), so in principle we can split this lock at least > a little bit... if we can get to one hpet channel per level 3 cache > domain we'd already make huge progress in terms of cost of the > contention.... I suggested the same thing a few emails up @) (great minds think alike etc.etc. @) . I'm not sure how difficult it would be to implement though. Potential issues: Some user applications use the hpet channels directly through the character device interface so there would be a potential compatibility issue (but maybe that should be just moved to be emulated with a hrtimer ?) And if multiple broadcast controllers are elected this might make it harder to become idle. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2010-05-03 at 00:08 +0200, Eric Dumazet wrote: > > Test I did this week with Jamal. > > We first set a "ee" rps mask, because all NIC interrupts were handled by > CPU0, and Jamal thought like you, that not using cpu4 would give better > performance. > > But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps > I am seeing the opposite with my machine (Nehalem): with ee i get 99.4% and fe i get 94.2% whereas non-rps is about 98.1%. cheers, jamal PS:- sorry dont have time to collect a lot more data - tommorow i could do more. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 03 May 2010 10:45:07 -0400 Brian Bloniarz <bmb@athenacr > > so the hard problem is that on going idle, the local timers need to > > be funneled to the external HPET. Afaik right now we use one > > channel of the hpet, with the result that we have one global lock > > for this. > > Does the HPET only need to be programmed when going idle? correct; when going idle the per logical CPU timer value needs to be put in the global HPET (assuming 1 channel is in use). This "global" is where the lock comes in. > That could mean that this isn't a big performance issue. > cares if you spin for a while when you're about to sleep for > at least 60usec? depends on how long the sleep is ;-)
On Mon, 3 May 2010 17:52:04 +0200 Andi Kleen <andi@firstfloor.org> wrote: > > HPETs have more than one channel (2 or 3 historically, newer > > chipsets iirc have a few more), so in principle we can split this > > lock at least a little bit... if we can get to one hpet channel per > > level 3 cache domain we'd already make huge progress in terms of > > cost of the contention.... > > I suggested the same thing a few emails up @) (great minds think > alike etc.etc. @) . > > I'm not sure how difficult it would be to implement though. the hardest part will be cases where the SMM code borrows higher HPET channels or something.. not sure if they do, but.. color me a bit afraid we'll find cases. > > Potential issues: > > Some user applications use the hpet channels directly through > the character device interface so there would be a potential > compatibility issue (but maybe that should be just moved > to be emulated with a hrtimer ?) we can and should just emulate this. Same for the rtc device I suspect. > And if multiple broadcast controllers are elected this might > make it harder to become idle. not quite, as long as you do a directed broadcast. As long as there's a predictable mapping for which cores group to which hpet channel.. won't be that bad since you only need to wake up your own local subset.
diff --git a/net/core/datagram.c b/net/core/datagram.c index 95b851f..051fd5b 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -113,7 +113,7 @@ static int wait_for_packet(struct sock *sk, int *err, long *timeo_p) goto interrupted; error = 0; - *timeo_p = schedule_timeout(*timeo_p); + *timeo_p = io_schedule_timeout(*timeo_p); out: finish_wait(sk_sleep(sk), &wait); return error;