[RFC] tcp: make icsk_retransmit_timer pinned

Message ID	20191101221605.32210-1-xiyou.wangcong@gmail.com
State	RFC
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: Cong Wang <xiyou.wangcong@gmail.com>, Thomas Gleixner <tglx@linutronix.de>, Eric Dumazet <edumazet@google.com> Subject: [RFC Patch] tcp: make icsk_retransmit_timer pinned Date: Fri, 1 Nov 2019 15:16:05 -0700 Message-Id: <20191101221605.32210-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	[RFC] tcp: make icsk_retransmit_timer pinned \| expand [RFC] tcp: make icsk_retransmit_timer pinned

Message ID

20191101221605.32210-1-xiyou.wangcong@gmail.com

State

RFC

Delegated to:

David Miller

Headers

From: Cong Wang <xiyou.wangcong@gmail.com>
To: netdev@vger.kernel.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>, Eric Dumazet <edumazet@google.com>
Subject: [RFC Patch] tcp: make icsk_retransmit_timer pinned
Date: Fri,  1 Nov 2019 15:16:05 -0700
Message-Id: <20191101221605.32210-1-xiyou.wangcong@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Series

[RFC] tcp: make icsk_retransmit_timer pinned | expand

Commit Message

Cong Wang Nov. 1, 2019, 10:16 p.m. UTC

While investigating the spinlock contention on resetting TCP
retransmit timer:

  61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
   ...
    - 58.83% tcp_v4_rcv
      - 58.80% tcp_v4_do_rcv
         - 58.80% tcp_rcv_established
            - 52.88% __tcp_push_pending_frames
               - 52.88% tcp_write_xmit
                  - 28.16% tcp_event_new_data_sent
                     - 28.15% sk_reset_timer
                        + mod_timer
                  - 24.68% tcp_schedule_loss_probe
                     - 24.68% sk_reset_timer
                        + 24.68% mod_timer

it turns out to be a serious timer migration issue. After collecting timer_start
trace events for tcp_write_timer, it shows more than 77% times this timer got
migrated to a difference CPU:

	$ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
	1303826
	$ wc -l tcp_timer_trace.txt
	1681068 tcp_timer_trace.txt
	$ python
	Python 2.7.5 (default, Jul 11 2019, 17:13:53)
	[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> 1303826 / 1681068.0
	0.7755938486723916

And all of those migration happened during an idle CPU serving a network RX
softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
I don't know whether we should relax it for this scenario particuarly, something
like:

-	if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
+	if ((!idle_cpu(cpu) || in_serving_softirq()) &&
+	    housekeeping_cpu(cpu, HK_FLAG_TIMER))
 		return cpu;

(There could be better way than in_serving_softirq() to measure the idleness,
of course.)

Or simply just make the TCP retransmit timer pinned. At least this approach
has the minimum impact.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/inet_connection_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Nov. 1, 2019, 10:30 p.m. UTC | #1

On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> While investigating the spinlock contention on resetting TCP
> retransmit timer:
>
>   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
>    ...
>     - 58.83% tcp_v4_rcv
>       - 58.80% tcp_v4_do_rcv
>          - 58.80% tcp_rcv_established
>             - 52.88% __tcp_push_pending_frames
>                - 52.88% tcp_write_xmit
>                   - 28.16% tcp_event_new_data_sent
>                      - 28.15% sk_reset_timer
>                         + mod_timer
>                   - 24.68% tcp_schedule_loss_probe
>                      - 24.68% sk_reset_timer
>                         + 24.68% mod_timer
>
> it turns out to be a serious timer migration issue. After collecting timer_start
> trace events for tcp_write_timer, it shows more than 77% times this timer got
> migrated to a difference CPU:
>
>         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
>         1303826
>         $ wc -l tcp_timer_trace.txt
>         1681068 tcp_timer_trace.txt
>         $ python
>         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
>         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
>         Type "help", "copyright", "credits" or "license" for more information.
>         >>> 1303826 / 1681068.0
>         0.7755938486723916
>
> And all of those migration happened during an idle CPU serving a network RX
> softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> I don't know whether we should relax it for this scenario particuarly, something
> like:
>
> -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
>                 return cpu;
>
> (There could be better way than in_serving_softirq() to measure the idleness,
> of course.)
>
> Or simply just make the TCP retransmit timer pinned. At least this approach
> has the minimum impact.
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/ipv4/inet_connection_sock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index eb30fc1770de..de5510ddb1c8 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
>         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
>         timer_setup(&sk->sk_timer, keepalive_handler, 0);
>         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> --
> 2.21.0
>

Now you are talking ...

We have disabled /proc/sys/kernel/timer_migration on all Google servers,
because this made no sense on servers really, and not only for tcp timers.

This has been a hot topic years ago ( random example :
https://lore.kernel.org/patchwork/patch/947052/ )

Cong Wang Nov. 1, 2019, 10:43 p.m. UTC | #2

On Fri, Nov 1, 2019 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > While investigating the spinlock contention on resetting TCP
> > retransmit timer:
> >
> >   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> >    ...
> >     - 58.83% tcp_v4_rcv
> >       - 58.80% tcp_v4_do_rcv
> >          - 58.80% tcp_rcv_established
> >             - 52.88% __tcp_push_pending_frames
> >                - 52.88% tcp_write_xmit
> >                   - 28.16% tcp_event_new_data_sent
> >                      - 28.15% sk_reset_timer
> >                         + mod_timer
> >                   - 24.68% tcp_schedule_loss_probe
> >                      - 24.68% sk_reset_timer
> >                         + 24.68% mod_timer
> >
> > it turns out to be a serious timer migration issue. After collecting timer_start
> > trace events for tcp_write_timer, it shows more than 77% times this timer got
> > migrated to a difference CPU:
> >
> >         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
> >         1303826
> >         $ wc -l tcp_timer_trace.txt
> >         1681068 tcp_timer_trace.txt
> >         $ python
> >         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
> >         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
> >         Type "help", "copyright", "credits" or "license" for more information.
> >         >>> 1303826 / 1681068.0
> >         0.7755938486723916
> >
> > And all of those migration happened during an idle CPU serving a network RX
> > softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> > I don't know whether we should relax it for this scenario particuarly, something
> > like:
> >
> > -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> > +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> > +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
> >                 return cpu;
> >
> > (There could be better way than in_serving_softirq() to measure the idleness,
> > of course.)
> >
> > Or simply just make the TCP retransmit timer pinned. At least this approach
> > has the minimum impact.
> >
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >  net/ipv4/inet_connection_sock.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index eb30fc1770de..de5510ddb1c8 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
> >  {
> >         struct inet_connection_sock *icsk = inet_csk(sk);
> >
> > -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> > +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
> >         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
> >         timer_setup(&sk->sk_timer, keepalive_handler, 0);
> >         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> > --
> > 2.21.0
> >
>
> Now you are talking ...
>
> We have disabled /proc/sys/kernel/timer_migration on all Google servers,
> because this made no sense on servers really, and not only for tcp timers.

So let's make the sysctl timer_migration disabled by default? It is
always how we want to trade off CPU power saving with latency.

Did you measure how much CPU power it increases after disabling it?
If not much, we can certainly make it disabled by default.

>
> This has been a hot topic years ago ( random example :
> https://lore.kernel.org/patchwork/patch/947052/ )

Yeah, this specific patch has been merged for a long time,
but I know you are not just talking about this single one. :)

Thanks.

Eric Dumazet Nov. 1, 2019, 11:44 p.m. UTC | #3

On Fri, Nov 1, 2019 at 3:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> So let's make the sysctl timer_migration disabled by default? It is
> always how we want to trade off CPU power saving with latency.

At least timer_migration sysctl deserves a proper documentation.
I do not see any.

And maybe automatically disable it for hosts with more than 64
possible cpus would make sense,
but that  is only a suggestion. I won't fight this battle.

(All sysctls can be set by admins, we do not need to change the default)

Cong Wang Nov. 1, 2019, 11:51 p.m. UTC | #4

On Fri, Nov 1, 2019 at 4:44 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > So let's make the sysctl timer_migration disabled by default? It is
> > always how we want to trade off CPU power saving with latency.
>
> At least timer_migration sysctl deserves a proper documentation.
> I do not see any.
>
> And maybe automatically disable it for hosts with more than 64
> possible cpus would make sense,
> but that  is only a suggestion. I won't fight this battle.
>
> (All sysctls can be set by admins, we do not need to change the default)

Make people rely on the default value, as obviously not everyone
is able to revise all of the sysctl's.

Anyway, I read it as the patch makes TCP retransmit timer pinned
not interesting, therefore let's discard it. We can at least carry it
by ourselves, so not a big deal.

Thanks!

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index eb30fc1770de..de5510ddb1c8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -507,7 +507,7 @@  void inet_csk_init_xmit_timers(struct sock *sk,
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
+	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
 	timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
 	timer_setup(&sk->sk_timer, keepalive_handler, 0);
 	icsk->icsk_pending = icsk->icsk_ack.pending = 0;