diff mbox series

[RFC] tcp: make icsk_retransmit_timer pinned

Message ID 20191101221605.32210-1-xiyou.wangcong@gmail.com
State RFC
Delegated to: David Miller
Headers show
Series [RFC] tcp: make icsk_retransmit_timer pinned | expand

Commit Message

Cong Wang Nov. 1, 2019, 10:16 p.m. UTC
While investigating the spinlock contention on resetting TCP
retransmit timer:

  61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
   ...
    - 58.83% tcp_v4_rcv
      - 58.80% tcp_v4_do_rcv
         - 58.80% tcp_rcv_established
            - 52.88% __tcp_push_pending_frames
               - 52.88% tcp_write_xmit
                  - 28.16% tcp_event_new_data_sent
                     - 28.15% sk_reset_timer
                        + mod_timer
                  - 24.68% tcp_schedule_loss_probe
                     - 24.68% sk_reset_timer
                        + 24.68% mod_timer

it turns out to be a serious timer migration issue. After collecting timer_start
trace events for tcp_write_timer, it shows more than 77% times this timer got
migrated to a difference CPU:

	$ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
	1303826
	$ wc -l tcp_timer_trace.txt
	1681068 tcp_timer_trace.txt
	$ python
	Python 2.7.5 (default, Jul 11 2019, 17:13:53)
	[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> 1303826 / 1681068.0
	0.7755938486723916

And all of those migration happened during an idle CPU serving a network RX
softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
I don't know whether we should relax it for this scenario particuarly, something
like:

-	if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
+	if ((!idle_cpu(cpu) || in_serving_softirq()) &&
+	    housekeeping_cpu(cpu, HK_FLAG_TIMER))
 		return cpu;

(There could be better way than in_serving_softirq() to measure the idleness,
of course.)

Or simply just make the TCP retransmit timer pinned. At least this approach
has the minimum impact.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/inet_connection_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Nov. 1, 2019, 10:30 p.m. UTC | #1
On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> While investigating the spinlock contention on resetting TCP
> retransmit timer:
>
>   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
>    ...
>     - 58.83% tcp_v4_rcv
>       - 58.80% tcp_v4_do_rcv
>          - 58.80% tcp_rcv_established
>             - 52.88% __tcp_push_pending_frames
>                - 52.88% tcp_write_xmit
>                   - 28.16% tcp_event_new_data_sent
>                      - 28.15% sk_reset_timer
>                         + mod_timer
>                   - 24.68% tcp_schedule_loss_probe
>                      - 24.68% sk_reset_timer
>                         + 24.68% mod_timer
>
> it turns out to be a serious timer migration issue. After collecting timer_start
> trace events for tcp_write_timer, it shows more than 77% times this timer got
> migrated to a difference CPU:
>
>         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
>         1303826
>         $ wc -l tcp_timer_trace.txt
>         1681068 tcp_timer_trace.txt
>         $ python
>         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
>         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
>         Type "help", "copyright", "credits" or "license" for more information.
>         >>> 1303826 / 1681068.0
>         0.7755938486723916
>
> And all of those migration happened during an idle CPU serving a network RX
> softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> I don't know whether we should relax it for this scenario particuarly, something
> like:
>
> -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
>                 return cpu;
>
> (There could be better way than in_serving_softirq() to measure the idleness,
> of course.)
>
> Or simply just make the TCP retransmit timer pinned. At least this approach
> has the minimum impact.
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/ipv4/inet_connection_sock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index eb30fc1770de..de5510ddb1c8 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
>         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
>         timer_setup(&sk->sk_timer, keepalive_handler, 0);
>         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> --
> 2.21.0
>

Now you are talking ...

We have disabled /proc/sys/kernel/timer_migration on all Google servers,
because this made no sense on servers really, and not only for tcp timers.

This has been a hot topic years ago ( random example :
https://lore.kernel.org/patchwork/patch/947052/ )
Cong Wang Nov. 1, 2019, 10:43 p.m. UTC | #2
On Fri, Nov 1, 2019 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > While investigating the spinlock contention on resetting TCP
> > retransmit timer:
> >
> >   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> >    ...
> >     - 58.83% tcp_v4_rcv
> >       - 58.80% tcp_v4_do_rcv
> >          - 58.80% tcp_rcv_established
> >             - 52.88% __tcp_push_pending_frames
> >                - 52.88% tcp_write_xmit
> >                   - 28.16% tcp_event_new_data_sent
> >                      - 28.15% sk_reset_timer
> >                         + mod_timer
> >                   - 24.68% tcp_schedule_loss_probe
> >                      - 24.68% sk_reset_timer
> >                         + 24.68% mod_timer
> >
> > it turns out to be a serious timer migration issue. After collecting timer_start
> > trace events for tcp_write_timer, it shows more than 77% times this timer got
> > migrated to a difference CPU:
> >
> >         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
> >         1303826
> >         $ wc -l tcp_timer_trace.txt
> >         1681068 tcp_timer_trace.txt
> >         $ python
> >         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
> >         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
> >         Type "help", "copyright", "credits" or "license" for more information.
> >         >>> 1303826 / 1681068.0
> >         0.7755938486723916
> >
> > And all of those migration happened during an idle CPU serving a network RX
> > softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> > I don't know whether we should relax it for this scenario particuarly, something
> > like:
> >
> > -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> > +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> > +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
> >                 return cpu;
> >
> > (There could be better way than in_serving_softirq() to measure the idleness,
> > of course.)
> >
> > Or simply just make the TCP retransmit timer pinned. At least this approach
> > has the minimum impact.
> >
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >  net/ipv4/inet_connection_sock.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index eb30fc1770de..de5510ddb1c8 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
> >  {
> >         struct inet_connection_sock *icsk = inet_csk(sk);
> >
> > -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> > +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
> >         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
> >         timer_setup(&sk->sk_timer, keepalive_handler, 0);
> >         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> > --
> > 2.21.0
> >
>
> Now you are talking ...
>
> We have disabled /proc/sys/kernel/timer_migration on all Google servers,
> because this made no sense on servers really, and not only for tcp timers.

So let's make the sysctl timer_migration disabled by default? It is
always how we want to trade off CPU power saving with latency.

Did you measure how much CPU power it increases after disabling it?
If not much, we can certainly make it disabled by default.

>
> This has been a hot topic years ago ( random example :
> https://lore.kernel.org/patchwork/patch/947052/ )

Yeah, this specific patch has been merged for a long time,
but I know you are not just talking about this single one. :)

Thanks.
Eric Dumazet Nov. 1, 2019, 11:44 p.m. UTC | #3
On Fri, Nov 1, 2019 at 3:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> So let's make the sysctl timer_migration disabled by default? It is
> always how we want to trade off CPU power saving with latency.

At least timer_migration sysctl deserves a proper documentation.
I do not see any.

And maybe automatically disable it for hosts with more than 64
possible cpus would make sense,
but that  is only a suggestion. I won't fight this battle.

(All sysctls can be set by admins, we do not need to change the default)
Cong Wang Nov. 1, 2019, 11:51 p.m. UTC | #4
On Fri, Nov 1, 2019 at 4:44 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > So let's make the sysctl timer_migration disabled by default? It is
> > always how we want to trade off CPU power saving with latency.
>
> At least timer_migration sysctl deserves a proper documentation.
> I do not see any.
>
> And maybe automatically disable it for hosts with more than 64
> possible cpus would make sense,
> but that  is only a suggestion. I won't fight this battle.
>
> (All sysctls can be set by admins, we do not need to change the default)

Make people rely on the default value, as obviously not everyone
is able to revise all of the sysctl's.

Anyway, I read it as the patch makes TCP retransmit timer pinned
not interesting, therefore let's discard it. We can at least carry it
by ourselves, so not a big deal.

Thanks!
diff mbox series

Patch

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index eb30fc1770de..de5510ddb1c8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -507,7 +507,7 @@  void inet_csk_init_xmit_timers(struct sock *sk,
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
+	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
 	timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
 	timer_setup(&sk->sk_timer, keepalive_handler, 0);
 	icsk->icsk_pending = icsk->icsk_ack.pending = 0;