diff mbox

[v6] net: batch skb dequeueing from softnet input_pkt_queue

Message ID 1272783366.2173.13.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 2, 2010, 6:56 a.m. UTC
Le samedi 01 mai 2010 à 13:00 +0200, Andi Kleen a écrit :
> On Fri, Apr 30, 2010 at 04:38:57PM -0700, David Miller wrote:
> > From: Andi Kleen <ak@gargoyle.fritz.box>
> > Date: Thu, 29 Apr 2010 23:41:44 +0200
> > 
> > >     Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> > > 
> > >     XXX: probably too aggressive, some of these sleeps are not under high load.
> > > 
> > >     Based on a bug report from Eric Dumazet.
> > >     
> > >     Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > I like this, except that we probably don't want the delayacct_blkio_*() calls
> > these things do.
> 
> Yes.
> 
> It needs more work, please don't apply it yet, to handle the "long sleep" case.
> 
> Still curious if it fixes Eric's test case.
> 

I tried it on the right spot (since my bench was only doing recvmsg()
calls, I had to patch wait_for_packet() in net/core/datagram.c

udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
schedule_timeout

Unfortunatly, using io_schedule_timeout() did not solve the problem.

Tell me if you need some traces or something.

Thanks !



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andi Kleen May 2, 2010, 9:20 a.m. UTC | #1
> I tried it on the right spot (since my bench was only doing recvmsg()
> calls, I had to patch wait_for_packet() in net/core/datagram.c
> 
> udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
> schedule_timeout
> 
> Unfortunatly, using io_schedule_timeout() did not solve the problem.

Hmm, too bad. Weird.

> 
> Tell me if you need some traces or something.

I'll try to reproduce it and see what I can do.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 2, 2010, 10:54 a.m. UTC | #2
Le dimanche 02 mai 2010 à 11:20 +0200, Andi Kleen a écrit :
> > I tried it on the right spot (since my bench was only doing recvmsg()
> > calls, I had to patch wait_for_packet() in net/core/datagram.c
> > 
> > udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
> > schedule_timeout
> > 
> > Unfortunatly, using io_schedule_timeout() did not solve the problem.
> 
> Hmm, too bad. Weird.
> 
> > 
> > Tell me if you need some traces or something.
> 
> I'll try to reproduce it and see what I can do.
> 

Here the perf report on the latest test done, I confirm I am using
io_schedule_timeout() in this kernel.

In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers
 packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these
packets are then redistributed by RPS to same 16 cpus, generating about
650.000 IPI per second.

top says :
Cpu(s):  3.0%us, 17.3%sy,  0.0%ni, 22.4%id, 28.2%wa,  0.0%hi, 29.1%si,
0.0%st


# Samples: 321362570767
#
# Overhead         Command                 Shared Object  Symbol
# ........  ..............  ............................  ......
#
    25.08%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
                      |
                      --- _raw_spin_lock_irqsave
                         |          
                         |--93.47%-- clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--4.70%-- tick_broadcast_oneshot_control
                         |          tick_notify
                         |          notifier_call_chain
                         |          __raw_notifier_call_chain
                         |          raw_notifier_call_chain
                         |          clockevents_do_notify
                         |          clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--0.64%-- generic_exec_single
                         |          __smp_call_function_single
                         |          net_rps_action_and_irq_enable
...
     9.72%            init  [kernel.kallsyms]             [k] acpi_os_read_port
                      |
                      --- acpi_os_read_port
                         |          
                         |--99.45%-- acpi_hw_read_port
                         |          acpi_hw_read
                         |          acpi_hw_read_multiple
                         |          acpi_hw_register_read
                         |          acpi_read_bit_register
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                          --0.55%-- acpi_hw_read
                                    acpi_hw_read_multiple

powertop says :
     PowerTOP version 1.11      (C) 2007 Intel Corporation

Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
                                        1.60 Ghz    38.2%

Wakeups-from-idle per second : 45177.8  interval: 5.0s
no ACPI power usage estimate available

Top causes for wakeups:
   9.9% (40863.0)       <interrupt> : eth1-fp-7 
   9.9% (40861.0)       <interrupt> : eth1-fp-8 
   9.9% (40858.0)       <interrupt> : eth1-fp-5 
   9.9% (40855.2)       <interrupt> : eth1-fp-10 
   9.9% (40847.6)       <interrupt> : eth1-fp-14 
   9.9% (40847.2)       <interrupt> : eth1-fp-12 
   9.9% (40835.0)       <interrupt> : eth1-fp-1 
   9.9% (40834.2)       <interrupt> : eth1-fp-3 
   9.9% (40834.0)       <interrupt> : eth1-fp-6 
   9.9% (40829.6)       <interrupt> : eth1-fp-4 
   1.0% (4002.0)     <kernel core> : hrtimer_start_range_ns (tick_sched_timer) 
   0.4% (1725.6)       <interrupt> : extra timer interrupt 
   0.0% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func)
   0.0% (  2.0)     <kernel core> : clocksource_watchdog (clocksource_watchdog)
   0.0% (  2.0)             snmpd : hrtimer_start_range_ns (hrtimer_wakeup)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven May 2, 2010, 2:13 p.m. UTC | #3
> 
> Cn                Avg residency       P-states (frequencies)
> C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
> polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
> C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
> C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
>                                         1.60 Ghz    38.2%

I bet your system advertizes C2 with the same latency as C1,
but with lower power... which means Linux will pretty much never
pick C1.... no matter how much you take Andi's patch.

this is a bios thing... and until we put in the patch to override the
bios values (I can dust it off but it might need a bit of tweaking
since it was against .31) Andi's patch alone won't cut it... you also
need a non-lying bios ;)
Eric Dumazet May 2, 2010, 2:27 p.m. UTC | #4
Le dimanche 02 mai 2010 à 07:13 -0700, Arjan van de Ven a écrit :
> > 
> > Cn                Avg residency       P-states (frequencies)
> > C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
> > polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
> > C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
> > C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
> >                                         1.60 Ghz    38.2%
> 
> I bet your system advertizes C2 with the same latency as C1,
> but with lower power... which means Linux will pretty much never
> pick C1.... no matter how much you take Andi's patch.
> 
> this is a bios thing... and until we put in the patch to override the
> bios values (I can dust it off but it might need a bit of tweaking
> since it was against .31) Andi's patch alone won't cut it... you also
> need a non-lying bios ;)
> 
> 
> 
# pwd
/sys/devices/system/cpu/cpu15/cpuidle
# grep . */*
state0/desc:CPUIDLE CORE POLL IDLE
state0/latency:0
state0/name:C0
state0/power:4294967295
state0/time:0
state0/usage:0
state1/desc:ACPI FFH INTEL MWAIT 0x0
state1/latency:1
state1/name:C1
state1/power:1000
state1/time:433855186
state1/usage:126869
state2/desc:ACPI FFH INTEL MWAIT 0x10
state2/latency:64
state2/name:C2
state2/power:500
state2/time:198095020416
state2/usage:76287744

C2 latency seems to be 64  (us ?), while C1 seems to be 1

BIOS Information
	Vendor: HP
	Version: I24
	Release Date: 10/01/2009

# powertop
PowerTOP 1.11   (C) 2007, 2008 Intel Corporation 

Collecting data for 5 seconds 


Your CPU supports the following C-states : C1 C2 C3 
Your BIOS reports the following C-states : C1 C2 

C3 seems to be disabled in BIOS


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 2, 2010, 3:32 p.m. UTC | #5
Le dimanche 02 mai 2010 à 16:27 +0200, Eric Dumazet a écrit :
> Le dimanche 02 mai 2010 à 07:13 -0700, Arjan van de Ven a écrit :
> > > 
> > > Cn                Avg residency       P-states (frequencies)
> > > C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
> > > polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
> > > C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
> > > C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
> > >                                         1.60 Ghz    38.2%
> > 
> > I bet your system advertizes C2 with the same latency as C1,
> > but with lower power... which means Linux will pretty much never
> > pick C1.... no matter how much you take Andi's patch.
> > 
> > this is a bios thing... and until we put in the patch to override the
> > bios values (I can dust it off but it might need a bit of tweaking
> > since it was against .31) Andi's patch alone won't cut it... you also
> > need a non-lying bios ;)
> > 
> > 
> > 
> # pwd
> /sys/devices/system/cpu/cpu15/cpuidle
> # grep . */*
> state0/desc:CPUIDLE CORE POLL IDLE
> state0/latency:0
> state0/name:C0
> state0/power:4294967295
> state0/time:0
> state0/usage:0
> state1/desc:ACPI FFH INTEL MWAIT 0x0
> state1/latency:1
> state1/name:C1
> state1/power:1000
> state1/time:433855186
> state1/usage:126869
> state2/desc:ACPI FFH INTEL MWAIT 0x10
> state2/latency:64
> state2/name:C2
> state2/power:500
> state2/time:198095020416
> state2/usage:76287744
> 
> C2 latency seems to be 64  (us ?), while C1 seems to be 1
> 
> BIOS Information
> 	Vendor: HP
> 	Version: I24
> 	Release Date: 10/01/2009
> 
> # powertop
> PowerTOP 1.11   (C) 2007, 2008 Intel Corporation 
> 
> Collecting data for 5 seconds 
> 
> 
> Your CPU supports the following C-states : C1 C2 C3 
> Your BIOS reports the following C-states : C1 C2 
> 
> C3 seems to be disabled in BIOS
> 

I took a look at BIOS settings and enabled the minimum sleep state to be
C6 (instead of C3, the default). Now we see C3 being available...

No changes, only more IPI delivered during the test, and more overhead
in clockevents_notify()

# grep . */*
state0/desc:CPUIDLE CORE POLL IDLE
state0/latency:0
state0/name:C0
state0/power:4294967295
state0/time:0
state0/usage:0
state1/desc:ACPI FFH INTEL MWAIT 0x0
state1/latency:1
state1/name:C1
state1/power:1000
state1/time:39432
state1/usage:119
state2/desc:ACPI FFH INTEL MWAIT 0x10
state2/latency:64
state2/name:C2
state2/power:500
state2/time:3170745
state2/usage:11177
state3/desc:ACPI FFH INTEL MWAIT 0x20
state3/latency:96
state3/name:C3
state3/power:350
state3/time:1030987453
state3/usage:14047019

---------------------------------------------------------------------------------------------------------------------------
   PerfTop:   15984 irqs/sec  kernel:98.5% [1000Hz cycles],  (all, 16 CPUs)
---------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _______

            23822.00 40.2% _raw_spin_lock_irqsave         vmlinux
             4413.00  7.4% acpi_os_read_port              vmlinux
             1426.00  2.4% _raw_spin_lock                 vmlinux
             1284.00  2.2% _raw_spin_unlock_irqrestore    vmlinux
             1247.00  2.1% schedule                       vmlinux
             1137.00  1.9% bnx2x_rx_int                   vmlinux
              643.00  1.1% tick_broadcast_oneshot_control vmlinux
              597.00  1.0% copy_user_generic_string       vmlinux
              595.00  1.0% __napi_complete                vmlinux
              550.00  0.9% call_function_single_interrupt vmlinux
              548.00  0.9% bnx2x_msix_fp_int              vmlinux
              486.00  0.8% __netif_receive_skb            vmlinux
              461.00  0.8% bnx2x_poll                     vmlinux
              433.00  0.7% eth_type_trans                 vmlinux
              428.00  0.7% acpi_idle_enter_bm             vmlinux
              422.00  0.7% sock_recv_ts_and_drops         vmlinux
              382.00  0.6% __udp4_lib_lookup              vmlinux
              369.00  0.6% __slab_free                    vmlinux
              357.00  0.6% ip_route_input                 vmlinux
              341.00  0.6% kfree                          vmlinux
              335.00  0.6% ipt_do_table                   vmlinux
              334.00  0.6% ip_rcv                         vmlinux
              332.00  0.6% udp_recvmsg                    vmlinux
              317.00  0.5% __kmalloc_node_track_caller    vmlinux

    37.46%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
                      |
                      --- _raw_spin_lock_irqsave
                         |          
                         |--95.58%-- clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--3.27%-- tick_broadcast_oneshot_control
                         |          tick_notify
                         |          notifier_call_chain
                         |          __raw_notifier_call_chain
                         |          raw_notifier_call_chain
                         |          clockevents_do_notify
                         |          clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen May 2, 2010, 3:46 p.m. UTC | #6
> In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers
>  packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these
> packets are then redistributed by RPS to same 16 cpus, generating about
> 650.000 IPI per second.

BTW if rps was SMT aware it could avoid a lot of the IPIs in the first place.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 2, 2010, 4:35 p.m. UTC | #7
Le dimanche 02 mai 2010 à 17:46 +0200, Andi Kleen a écrit :
> > In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers
> >  packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these
> > packets are then redistributed by RPS to same 16 cpus, generating about
> > 650.000 IPI per second.
> 
> BTW if rps was SMT aware it could avoid a lot of the IPIs in the first place.

RPS do what you want, just stick a good cpumask, not a unware one :)

In my test, I specifically do something 'stupid' like :

echo fffe >/sys/class/net/bond0.2240/queues/rx-0/rps_cpus
echo fffd >/sys/class/net/bond0.2240/queues/rx-1/rps_cpus
echo fffb >/sys/class/net/bond0.2240/queues/rx-2/rps_cpus
echo fff7 >/sys/class/net/bond0.2240/queues/rx-3/rps_cpus

echo ffef >/sys/class/net/bond0.2240/queues/rx-4/rps_cpus
echo ffdf >/sys/class/net/bond0.2240/queues/rx-5/rps_cpus
echo ffbf >/sys/class/net/bond0.2240/queues/rx-6/rps_cpus
echo ff7f >/sys/class/net/bond0.2240/queues/rx-7/rps_cpus

echo feff >/sys/class/net/bond0.2240/queues/rx-8/rps_cpus
echo fdff >/sys/class/net/bond0.2240/queues/rx-9/rps_cpus
echo fbff >/sys/class/net/bond0.2240/queues/rx-10/rps_cpus
echo f7ff >/sys/class/net/bond0.2240/queues/rx-11/rps_cpus

echo efff >/sys/class/net/bond0.2240/queues/rx-12/rps_cpus
echo dfff >/sys/class/net/bond0.2240/queues/rx-13/rps_cpus
echo bfff >/sys/class/net/bond0.2240/queues/rx-14/rps_cpus
echo 7fff >/sys/class/net/bond0.2240/queues/rx-15/rps_cpus

echo 0001 >/proc/irq/*/eth1-fp-0/../smp_affinity
echo 0002 >/proc/irq/*/eth1-fp-1/../smp_affinity
echo 0004 >/proc/irq/*/eth1-fp-2/../smp_affinity
echo 0008 >/proc/irq/*/eth1-fp-3/../smp_affinity
echo 0010 >/proc/irq/*/eth1-fp-4/../smp_affinity
echo 0020 >/proc/irq/*/eth1-fp-5/../smp_affinity
echo 0040 >/proc/irq/*/eth1-fp-6/../smp_affinity
echo 0080 >/proc/irq/*/eth1-fp-7/../smp_affinity
echo 0100 >/proc/irq/*/eth1-fp-8/../smp_affinity
echo 0200 >/proc/irq/*/eth1-fp-9/../smp_affinity
echo 0400 >/proc/irq/*/eth1-fp-10/../smp_affinity
echo 0800 >/proc/irq/*/eth1-fp-11/../smp_affinity
echo 1000 >/proc/irq/*/eth1-fp-12/../smp_affinity
echo 2000 >/proc/irq/*/eth1-fp-13/../smp_affinity
echo 4000 >/proc/irq/*/eth1-fp-14/../smp_affinity
echo 8000 >/proc/irq/*/eth1-fp-15/../smp_affinity


You mean we can wakeup a thread with something else than an IPI ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven May 2, 2010, 5:43 p.m. UTC | #8
On Sun, 02 May 2010 18:35:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote
> 
> 
> You mean we can wakeup a thread with something else than an IPI ?
> 

actually we can.

mwait is not only "go idle", it is "go idle until someone writes to
<THIS> cacheline". where <THIS> is set up with a "monitor" instruction.
We don't need to send an ipi per se.. all we need is to write to the
right cacheline that we're monitoring.
Eric Dumazet May 2, 2010, 5:47 p.m. UTC | #9
Le dimanche 02 mai 2010 à 10:43 -0700, Arjan van de Ven a écrit :
> On Sun, 02 May 2010 18:35:31 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote
> > 
> > 
> > You mean we can wakeup a thread with something else than an IPI ?
> > 
> 
> actually we can.
> 
> mwait is not only "go idle", it is "go idle until someone writes to
> <THIS> cacheline". where <THIS> is set up with a "monitor" instruction.
> We don't need to send an ipi per se.. all we need is to write to the
> right cacheline that we're monitoring.
> 
> 

Thats a bit x86 specific, isnt it ?

But we want to eventually send a 'signal' to a cpu, even if not blocked
in idle, so that it can do following action :

/* Called from hardirq (IPI) context */
static void rps_trigger_softirq(void *data)
{
        struct softnet_data *sd = data;

        __napi_schedule(&sd->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

And it also should be portable ;)

If something else than an IPI is available, please let us know !

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven May 2, 2010, 5:54 p.m. UTC | #10
On Sun, 02 May 2010 16:27:28 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> C2 latency seems to be 64  (us ?), while C1 seems to be 1

the processor_idle module has a "latency_factor" module parameter.
The default is 2, but sometimes people think 6 is a better value...
.. any chance you can try that value ?

Also, I'm starting to wonder if Andi's patch to use io_schedule() needs
to be replaced with a net_schedule() kind of thing. The cpuidle code
currently has a weight factor for IO (based on measuring/experiments),
and maybe networking really needs another factor... so just having a
parallel concept with a different weight could be the right answer for
that.



> 
> Your CPU supports the following C-states : C1 C2 C3 
> Your BIOS reports the following C-states : C1 C2 
> 
> C3 seems to be disabled in BIOS

btw this C2 == marketing name C3, and C3 == marketing name C6

(too many translations ;-)

we'll fix powertop to report the marketing name soon.
Eric Dumazet May 2, 2010, 7:22 p.m. UTC | #11
Le dimanche 02 mai 2010 à 10:54 -0700, Arjan van de Ven a écrit :
> On Sun, 02 May 2010 16:27:28 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > C2 latency seems to be 64  (us ?), while C1 seems to be 1
> 
> the processor_idle module has a "latency_factor" module parameter.
> The default is 2, but sometimes people think 6 is a better value...
> .. any chance you can try that value ?
> 

I tried 6 and 20, nothing changed ;(

> Also, I'm starting to wonder if Andi's patch to use io_schedule() needs
> to be replaced with a net_schedule() kind of thing. The cpuidle code
> currently has a weight factor for IO (based on measuring/experiments),
> and maybe networking really needs another factor... so just having a
> parallel concept with a different weight could be the right answer for
> that.
> 

But a task blocked on disk IO is probably blocked for a small amount of
time, while on network, it can be for a long time. I am not sure its the
right metric.

I was expecting something based on recent history.
Say if we have 20.000 wakeups per second, most likely we should not
enter C2/C3 states...

> 
> we'll fix powertop to report the marketing name soon.
> 
> 

Ah, I see, thanks :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen May 2, 2010, 9:25 p.m. UTC | #12
> You mean we can wakeup a thread with something else than an IPI ?

It's pointless to send an IPI to your thread sibling for this. 
Everything it could do you can do yourself too with the same performance.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen May 2, 2010, 9:30 p.m. UTC | #13
On Sun, May 02, 2010 at 10:54:18AM -0700, Arjan van de Ven wrote:
> On Sun, 02 May 2010 16:27:28 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > C2 latency seems to be 64  (us ?), while C1 seems to be 1
> 
> the processor_idle module has a "latency_factor" module parameter.
> The default is 2, but sometimes people think 6 is a better value...
> .. any chance you can try that value ?
> 
> Also, I'm starting to wonder if Andi's patch to use io_schedule() needs
> to be replaced with a net_schedule() kind of thing. The cpuidle code
> currently has a weight factor for IO (based on measuring/experiments),
> and maybe networking really needs another factor... so just having a
> parallel concept with a different weight could be the right answer for
> that.

We definitely need a net_schedule() for other reasons too: to avoid the blkio 
wait code and then also because networking needs a short "fast idle" timeout 
because the delays are not bounded.  

Otherwise a sender that suddenly stops sending could break all your power 
saving.

I think the reference count used in io_schedule is not the right model for 
this, probably needs a per cpu timeout ("be fast until this time"). Possibly 
a dynamic one feed by the measured input rate.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 2, 2010, 9:45 p.m. UTC | #14
Le dimanche 02 mai 2010 à 23:25 +0200, Andi Kleen a écrit :

> It's pointless to send an IPI to your thread sibling for this. 
> Everything it could do you can do yourself too with the same performance.
> 
> -Andi

Amen

Tests just prove the reverse.

I have some collegues that disable HyperThreading for exact same
reasons. I wonder why Intel designed HT. Should be marketing I guess.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen May 2, 2010, 9:54 p.m. UTC | #15
On Sun, May 02, 2010 at 11:45:55PM +0200, Eric Dumazet wrote:
> Le dimanche 02 mai 2010 à 23:25 +0200, Andi Kleen a écrit :
> 
> > It's pointless to send an IPI to your thread sibling for this. 
> > Everything it could do you can do yourself too with the same performance.
> > 
> > -Andi
> 
> Amen

That is in terms of cache locality.

> 
> Tests just prove the reverse.

What do you mean? 

> 
> I have some collegues that disable HyperThreading for exact same
> reasons. I wonder why Intel designed HT. Should be marketing I guess.

HT (especially Nehalem HT) is useful for a wide range of workloads.
Just handling network interrupts for its thread sibling is not one of them.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen May 2, 2010, 10:06 p.m. UTC | #16
> But a task blocked on disk IO is probably blocked for a small amount of
> time, while on network, it can be for a long time. I am not sure its the
> right metric.

I think it needs a dynamic timeout.

I agree the reference count as is will not work well for networking.

> 
> I was expecting something based on recent history.
> Say if we have 20.000 wakeups per second, most likely we should not
> enter C2/C3 states...

That's what the menu governour already does, it just doesn't work
in some cases :/

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 2, 2010, 10:08 p.m. UTC | #17
Le dimanche 02 mai 2010 à 23:54 +0200, Andi Kleen a écrit :
> On Sun, May 02, 2010 at 11:45:55PM +0200, Eric Dumazet wrote:

> > Tests just prove the reverse.
> 
> What do you mean? 
> 

Test I did this week with Jamal.

We first set a "ee" rps mask, because all NIC interrupts were handled by
CPU0, and Jamal thought like you, that not using cpu4 would give better
performance.

But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps

CPU : E5450  @3.00GHz
Two quad-core cpus in the machine, tg3 NIC.

With RPS, CPU0 does not a lot of things, just talk with the NIC, bring a
few cache lines per packet and dispatch it to a slave cpu.



> HT (especially Nehalem HT) is useful for a wide range of workloads.
> Just handling network interrupts for its thread sibling is not one of them.
> 

Thats the theory, now in practice I see different results.

Of course, this might be related to hash distribution being different
and more uniform.

I should redo the test with many more flows.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven May 3, 2010, 3:50 a.m. UTC | #18
> > Also, I'm starting to wonder if Andi's patch to use io_schedule()
> > needs to be replaced with a net_schedule() kind of thing. The
> > cpuidle code currently has a weight factor for IO (based on
> > measuring/experiments), and maybe networking really needs another
> > factor... so just having a parallel concept with a different weight
> > could be the right answer for that.
> > 
> 
> But a task blocked on disk IO is probably blocked for a small amount
> of time, while on network, it can be for a long time. I am not sure
> its the right metric.

it's not so much about the duration, as it is about the performance
sensitivity....

 
> I was expecting something based on recent history.
> Say if we have 20.000 wakeups per second, most likely we should not
> enter C2/C3 states...

we effectively do that. The thing is that C2 is so low cost normally
that it's still worth it even at 20k wakeups...

this is where the bios tells us how "heavy" the states are....
and 64 usec... is just not very much.
Eric Dumazet May 3, 2010, 5:17 a.m. UTC | #19
Le dimanche 02 mai 2010 à 20:50 -0700, Arjan van de Ven a écrit :

> we effectively do that. The thing is that C2 is so low cost normally
> that it's still worth it even at 20k wakeups...
> 
> this is where the bios tells us how "heavy" the states are....
> and 64 usec... is just not very much.

Maybe its low cost, (apparently, it is, since I can reach ~900.000 ipis
on my 16 cores machine) but multiply this by 16 or 32 or 64 cpus, and
clockevents_notify() cost appears to be a killer, all cpus compete on a
single lock.

Maybe this notifier could use RCU ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven May 3, 2010, 10:22 a.m. UTC | #20
On Mon, 03 May 2010 07:17:14 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le dimanche 02 mai 2010 à 20:50 -0700, Arjan van de Ven a écrit :
> 
> > we effectively do that. The thing is that C2 is so low cost normally
> > that it's still worth it even at 20k wakeups...
> > 
> > this is where the bios tells us how "heavy" the states are....
> > and 64 usec... is just not very much.
> 
> Maybe its low cost, (apparently, it is, since I can reach ~900.000
> ipis on my 16 cores machine) but multiply this by 16 or 32 or 64
> cpus, and clockevents_notify() cost appears to be a killer, all cpus
> compete on a single lock.
> 
> Maybe this notifier could use RCU ?

could this be an artifact of the local apic stopping in deeper C states?
(which is finally fixed in the Westmere generation)
Andi Kleen May 3, 2010, 10:34 a.m. UTC | #21
> > Maybe its low cost, (apparently, it is, since I can reach ~900.000
> > ipis on my 16 cores machine) but multiply this by 16 or 32 or 64
> > cpus, and clockevents_notify() cost appears to be a killer, all cpus
> > compete on a single lock.
> > 
> > Maybe this notifier could use RCU ?
> 
> could this be an artifact of the local apic stopping in deeper C states?
> (which is finally fixed in the Westmere generation)

Yes it is I think.

But I suspect Eric wants a solution for Nehalem.

-Andi
Arjan van de Ven May 3, 2010, 2:09 p.m. UTC | #22
On Mon, 3 May 2010 12:34:26 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> > > Maybe its low cost, (apparently, it is, since I can reach ~900.000
> > > ipis on my 16 cores machine) but multiply this by 16 or 32 or 64
> > > cpus, and clockevents_notify() cost appears to be a killer, all
> > > cpus compete on a single lock.
> > > 
> > > Maybe this notifier could use RCU ?
> > 
> > could this be an artifact of the local apic stopping in deeper C
> > states? (which is finally fixed in the Westmere generation)
> 
> Yes it is I think.
> 
> But I suspect Eric wants a solution for Nehalem.

sure ;-)


so the hard problem is that on going idle, the local timers need to be
funneled to the external HPET. Afaik right now we use one channel of
the hpet, with the result that we have one global lock for this.

HPETs have more than one channel (2 or 3 historically, newer chipsets
iirc have a few more), so in principle we can split this lock at least
a little bit... if we can get to one hpet channel per level 3 cache
domain we'd already make huge progress in terms of cost of the
contention....
Brian Bloniarz May 3, 2010, 2:45 p.m. UTC | #23
Arjan van de Ven wrote:
> On Mon, 3 May 2010 12:34:26 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
>>>> Maybe its low cost, (apparently, it is, since I can reach ~900.000
>>>> ipis on my 16 cores machine) but multiply this by 16 or 32 or 64
>>>> cpus, and clockevents_notify() cost appears to be a killer, all
>>>> cpus compete on a single lock.
>>>>
>>>> Maybe this notifier could use RCU ?
>>> could this be an artifact of the local apic stopping in deeper C
>>> states? (which is finally fixed in the Westmere generation)
>> Yes it is I think.
>>
>> But I suspect Eric wants a solution for Nehalem.
> 
> sure ;-)
> 
> 
> so the hard problem is that on going idle, the local timers need to be
> funneled to the external HPET. Afaik right now we use one channel of
> the hpet, with the result that we have one global lock for this.

Does the HPET only need to be programmed when going idle?
That could mean that this isn't a big performance issue.
cares if you spin for a while when you're about to sleep for
at least 60usec?

> HPETs have more than one channel (2 or 3 historically, newer chipsets
> iirc have a few more), so in principle we can split this lock at least
> a little bit... if we can get to one hpet channel per level 3 cache
> domain we'd already make huge progress in terms of cost of the
> contention....

Another possible approach: if a core needs the HPET and finds it
locked, it could queue up its request to a backlog which the
locking core will service.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen May 3, 2010, 3:52 p.m. UTC | #24
> so the hard problem is that on going idle, the local timers need to be
> funneled to the external HPET. Afaik right now we use one channel of
> the hpet, with the result that we have one global lock for this.
> 
> HPETs have more than one channel (2 or 3 historically, newer chipsets
> iirc have a few more), so in principle we can split this lock at least
> a little bit... if we can get to one hpet channel per level 3 cache
> domain we'd already make huge progress in terms of cost of the
> contention....

I suggested the same thing a few emails up @) (great minds think 
alike etc.etc. @) . 

I'm not sure how difficult it would be to implement though.

Potential issues:

Some user applications use the hpet channels directly through
the character device interface so there would be a potential
compatibility issue (but maybe that should be just moved
to be emulated with a hrtimer ?)

And if multiple broadcast controllers are elected this might
make it harder to become idle.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 3, 2010, 8:15 p.m. UTC | #25
On Mon, 2010-05-03 at 00:08 +0200, Eric Dumazet wrote:

> 
> Test I did this week with Jamal.
> 
> We first set a "ee" rps mask, because all NIC interrupts were handled by
> CPU0, and Jamal thought like you, that not using cpu4 would give better
> performance.
> 
> But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps
> 

I am seeing the opposite with my machine (Nehalem):
with ee i get 99.4% and fe i get 94.2% whereas non-rps
is about 98.1%.


cheers,
jamal

PS:- sorry dont have time to collect a lot more data - tommorow i could
do more.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven May 4, 2010, 1:10 a.m. UTC | #26
On Mon, 03 May 2010 10:45:07 -0400
Brian Bloniarz <bmb@athenacr

> > so the hard problem is that on going idle, the local timers need to
> > be funneled to the external HPET. Afaik right now we use one
> > channel of the hpet, with the result that we have one global lock
> > for this.
> 
> Does the HPET only need to be programmed when going idle?

correct; when going idle the per logical CPU timer value needs
to be put in the global HPET (assuming 1 channel is in use).
This "global" is where the lock comes in.

> That could mean that this isn't a big performance issue.
> cares if you spin for a while when you're about to sleep for
> at least 60usec?

depends on how long the sleep is ;-)
Arjan van de Ven May 4, 2010, 1:11 a.m. UTC | #27
On Mon, 3 May 2010 17:52:04 +0200
Andi Kleen <andi@firstfloor.org> wrote:
> > HPETs have more than one channel (2 or 3 historically, newer
> > chipsets iirc have a few more), so in principle we can split this
> > lock at least a little bit... if we can get to one hpet channel per
> > level 3 cache domain we'd already make huge progress in terms of
> > cost of the contention....
> 
> I suggested the same thing a few emails up @) (great minds think 
> alike etc.etc. @) . 
> 
> I'm not sure how difficult it would be to implement though.

the hardest part will be cases where the SMM code borrows higher HPET
channels or something.. not sure if they do, but.. color me a bit afraid
we'll find cases.


> 
> Potential issues:
> 
> Some user applications use the hpet channels directly through
> the character device interface so there would be a potential
> compatibility issue (but maybe that should be just moved
> to be emulated with a hrtimer ?)

we can and should just emulate this. Same for the rtc device I suspect.

 
> And if multiple broadcast controllers are elected this might
> make it harder to become idle.

not quite, as long as you do a directed broadcast. As long as there's a
predictable mapping for which cores group to which hpet channel.. won't
be that bad since you only need to wake up your own local subset.
diff mbox

Patch

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 95b851f..051fd5b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -113,7 +113,7 @@  static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 		goto interrupted;
 
 	error = 0;
-	*timeo_p = schedule_timeout(*timeo_p);
+	*timeo_p = io_schedule_timeout(*timeo_p);
 out:
 	finish_wait(sk_sleep(sk), &wait);
 	return error;