diff mbox

[v3,net-next] tcp: TSO packets automatic sizing

Message ID 1377607592.8828.149.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Aug. 27, 2013, 12:46 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt
 
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs 

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
---
v3: The change Yuchung suggested added a possibility of a divide by 0:
    On some (retransmits) case, srtt can be 0 because
    tcp_rtt_estimator() has not yet been called.
    Change the computation to remove this, and do not yet use usec
    as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
    being an out of line function :( ]

This version passed all our tests.

 Documentation/networking/ip-sysctl.txt |    9 ++++++
 include/net/sock.h                     |    2 +
 include/net/tcp.h                      |    1 
 net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
 net/ipv4/tcp.c                         |   28 ++++++++++++++++----
 net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
 net/ipv4/tcp_output.c                  |    2 -
 7 files changed, 77 insertions(+), 7 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yuchung Cheng Aug. 28, 2013, 12:17 a.m. UTC | #1
On Tue, Aug 27, 2013 at 5:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)
Acked-by: Yuchung Cheng <ycheng@google.com>

>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index debfe85..ce5bb43 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
>  tcp_timestamps - BOOLEAN
>         Enable timestamps as defined in RFC1323.
>
> +tcp_min_tso_segs - INTEGER
> +       Minimal number of segments per TSO frame.
> +       Since linux-3.12, TCP does an automatic sizing of TSO frames,
> +       depending on flow rate, instead of filling 64Kbytes packets.
> +       For specific usages, it's possible to force TCP to build big
> +       TSO frames. Note that TCP stack might split too big TSO packets
> +       if available window is too small.
> +       Default: 2
> +
>  tcp_tso_win_divisor - INTEGER
>         This allows control over what percentage of the congestion window
>         can be consumed by a single TSO frame.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e4bbcbf..6ba2e7b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -232,6 +232,7 @@ struct cg_proto;
>    *    @sk_napi_id: id of the last napi context to receive data for sk
>    *    @sk_ll_usec: usecs to busypoll when there is no data
>    *    @sk_allocation: allocation mode
> +  *    @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
>    *    @sk_sndbuf: size of send buffer in bytes
>    *    @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>    *               %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -361,6 +362,7 @@ struct sock {
>         kmemcheck_bitfield_end(flags);
>         int                     sk_wmem_queued;
>         gfp_t                   sk_allocation;
> +       u32                     sk_pacing_rate; /* bytes per second */
>         netdev_features_t       sk_route_caps;
>         netdev_features_t       sk_route_nocaps;
>         int                     sk_gso_type;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 09cb5c1..73fcd7c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
>  extern unsigned int sysctl_tcp_notsent_lowat;
> +extern int sysctl_tcp_min_tso_segs;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 8ed7c32..540279f 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -29,6 +29,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
>                 .extra2         = &four,
>         },
>         {
> +               .procname       = "tcp_min_tso_segs",
> +               .data           = &sysctl_tcp_min_tso_segs,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &gso_max_segs,
> +       },
> +       {
>                 .procname       = "udp_mem",
>                 .data           = &sysctl_udp_mem,
>                 .maxlen         = sizeof(sysctl_udp_mem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4e42c03..fdf7409 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -283,6 +283,8 @@
>
>  int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>
> +int sysctl_tcp_min_tso_segs __read_mostly = 2;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>         xmit_size_goal = mss_now;
>
>         if (large_allowed && sk_can_gso(sk)) {
> -               xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -                                 inet_csk(sk)->icsk_af_ops->net_header_len -
> -                                 inet_csk(sk)->icsk_ext_hdr_len -
> -                                 tp->tcp_header_len);
> +               u32 gso_size, hlen;
> +
> +               /* Maybe we should/could use sk->sk_prot->max_header here ? */
> +               hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +                      inet_csk(sk)->icsk_ext_hdr_len +
> +                      tp->tcp_header_len;
> +
> +               /* Goal is to send at least one packet per ms,
> +                * not one big TSO packet every 100 ms.
> +                * This preserves ACK clocking and is consistent
> +                * with tcp_tso_should_defer() heuristic.
> +                */
> +               gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> +               gso_size = max_t(u32, gso_size,
> +                                sysctl_tcp_min_tso_segs * mss_now);
> +
> +               xmit_size_goal = min_t(u32, gso_size,
> +                                      sk->sk_gso_max_size - 1 - hlen);
>
> -               /* TSQ : try to have two TSO segments in flight */
> +               /* TSQ : try to have at least two segments in flight
> +                * (one in NIC TX ring, another in Qdisc)
> +                */
>                 xmit_size_goal = min_t(u32, xmit_size_goal,
>                                        sysctl_tcp_limit_output_bytes >> 1);
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ec492ea..436c7e8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>         }
>  }
>
> +/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
> + * Note: TCP stack does not yet implement pacing.
> + * FQ packet scheduler can be used to implement cheap but effective
> + * TCP pacing, to smooth the burst on large writes when packets
> + * in flight is significantly lower than cwnd (or rwin)
> + */
> +static void tcp_update_pacing_rate(struct sock *sk)
> +{
> +       const struct tcp_sock *tp = tcp_sk(sk);
> +       u64 rate;
> +
> +       /* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
> +       rate = (u64)tp->mss_cache * 2 * (HZ << 3);
> +
> +       rate *= max(tp->snd_cwnd, tp->packets_out);
> +
> +       /* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
> +        * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
> +        * We probably need usec resolution in the future.
> +        * Note: This also takes care of possible srtt=0 case,
> +        * when tcp_rtt_estimator() was not yet called.
> +        */
> +       if (tp->srtt > 8 + 2)
> +               do_div(rate, tp->srtt);
> +
> +       sk->sk_pacing_rate = min_t(u64, rate, ~0U);
> +}
> +
>  /* Calculate rto without backoff.  This is the second half of Van Jacobson's
>   * routine referred to above.
>   */
> @@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>         u32 ack_seq = TCP_SKB_CB(skb)->seq;
>         u32 ack = TCP_SKB_CB(skb)->ack_seq;
>         bool is_dupack = false;
> -       u32 prior_in_flight;
> +       u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
>         u32 prior_fackets;
>         int prior_packets = tp->packets_out;
>         const int prior_unsacked = tp->packets_out - tp->sacked_out;
> @@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>
>         if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>                 tcp_schedule_loss_probe(sk);
> +       if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
> +               tcp_update_pacing_rate(sk);
>         return 1;
>
>  no_queue:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 884efff..e63ae4c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>
>         /* If a full-sized TSO skb can be sent, do it. */
>         if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -                          sk->sk_gso_max_segs * tp->mss_cache))
> +                          tp->xmit_size_goal_segs * tp->mss_cache))
>                 goto send_now;
>
>         /* Middle in queue won't get any more data, full sendable already? */
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Aug. 28, 2013, 12:21 a.m. UTC | #2
On Tue, Aug 27, 2013 at 8:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)

Acked-by: Neal Cardwell <ncardwell@google.com>

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang Aug. 28, 2013, 7:37 a.m. UTC | #3
On 08/27/2013 08:46 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>  
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs 
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1 
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index debfe85..ce5bb43 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
>  tcp_timestamps - BOOLEAN
>  	Enable timestamps as defined in RFC1323.
>  
> +tcp_min_tso_segs - INTEGER
> +	Minimal number of segments per TSO frame.
> +	Since linux-3.12, TCP does an automatic sizing of TSO frames,
> +	depending on flow rate, instead of filling 64Kbytes packets.
> +	For specific usages, it's possible to force TCP to build big
> +	TSO frames. Note that TCP stack might split too big TSO packets
> +	if available window is too small.
> +	Default: 2
> +
>  tcp_tso_win_divisor - INTEGER
>  	This allows control over what percentage of the congestion window
>  	can be consumed by a single TSO frame.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e4bbcbf..6ba2e7b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -232,6 +232,7 @@ struct cg_proto;
>    *	@sk_napi_id: id of the last napi context to receive data for sk
>    *	@sk_ll_usec: usecs to busypoll when there is no data
>    *	@sk_allocation: allocation mode
> +  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
>    *	@sk_sndbuf: size of send buffer in bytes
>    *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>    *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -361,6 +362,7 @@ struct sock {
>  	kmemcheck_bitfield_end(flags);
>  	int			sk_wmem_queued;
>  	gfp_t			sk_allocation;
> +	u32			sk_pacing_rate; /* bytes per second */
>  	netdev_features_t	sk_route_caps;
>  	netdev_features_t	sk_route_nocaps;
>  	int			sk_gso_type;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 09cb5c1..73fcd7c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
>  extern unsigned int sysctl_tcp_notsent_lowat;
> +extern int sysctl_tcp_min_tso_segs;
>  
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 8ed7c32..540279f 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -29,6 +29,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
>  		.extra2		= &four,
>  	},
>  	{
> +		.procname	= "tcp_min_tso_segs",
> +		.data		= &sysctl_tcp_min_tso_segs,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &gso_max_segs,
> +	},
> +	{
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
>  		.maxlen		= sizeof(sysctl_udp_mem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4e42c03..fdf7409 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -283,6 +283,8 @@
>  
>  int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  
> +int sysctl_tcp_min_tso_segs __read_mostly = 2;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>  
> @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>  	xmit_size_goal = mss_now;
>  
>  	if (large_allowed && sk_can_gso(sk)) {
> -		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -				  inet_csk(sk)->icsk_af_ops->net_header_len -
> -				  inet_csk(sk)->icsk_ext_hdr_len -
> -				  tp->tcp_header_len);
> +		u32 gso_size, hlen;
> +
> +		/* Maybe we should/could use sk->sk_prot->max_header here ? */
> +		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +		       inet_csk(sk)->icsk_ext_hdr_len +
> +		       tp->tcp_header_len;
> +
> +		/* Goal is to send at least one packet per ms,
> +		 * not one big TSO packet every 100 ms.
> +		 * This preserves ACK clocking and is consistent
> +		 * with tcp_tso_should_defer() heuristic.
> +		 */
> +		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> +		gso_size = max_t(u32, gso_size,
> +				 sysctl_tcp_min_tso_segs * mss_now);
> +
> +		xmit_size_goal = min_t(u32, gso_size,
> +				       sk->sk_gso_max_size - 1 - hlen);
>  
> -		/* TSQ : try to have two TSO segments in flight */
> +		/* TSQ : try to have at least two segments in flight
> +		 * (one in NIC TX ring, another in Qdisc)
> +		 */
>  		xmit_size_goal = min_t(u32, xmit_size_goal,
>  				       sysctl_tcp_limit_output_bytes >> 1);
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ec492ea..436c7e8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>  	}
>  }
>  
> +/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
> + * Note: TCP stack does not yet implement pacing.
> + * FQ packet scheduler can be used to implement cheap but effective
> + * TCP pacing, to smooth the burst on large writes when packets
> + * in flight is significantly lower than cwnd (or rwin)
> + */
> +static void tcp_update_pacing_rate(struct sock *sk)
> +{
> +	const struct tcp_sock *tp = tcp_sk(sk);
> +	u64 rate;
> +
> +	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
> +	rate = (u64)tp->mss_cache * 2 * (HZ << 3);
> +
> +	rate *= max(tp->snd_cwnd, tp->packets_out);
> +
> +	/* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
> +	 * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
> +	 * We probably need usec resolution in the future.
> +	 * Note: This also takes care of possible srtt=0 case,
> +	 * when tcp_rtt_estimator() was not yet called.
> +	 */
> +	if (tp->srtt > 8 + 2)
> +		do_div(rate, tp->srtt);
> +
> +	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
> +}
> +
>  /* Calculate rto without backoff.  This is the second half of Van Jacobson's
>   * routine referred to above.
>   */
> @@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	u32 ack_seq = TCP_SKB_CB(skb)->seq;
>  	u32 ack = TCP_SKB_CB(skb)->ack_seq;
>  	bool is_dupack = false;
> -	u32 prior_in_flight;
> +	u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
>  	u32 prior_fackets;
>  	int prior_packets = tp->packets_out;
>  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
> @@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  
>  	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>  		tcp_schedule_loss_probe(sk);
> +	if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
> +		tcp_update_pacing_rate(sk);
>  	return 1;
>  
>  no_queue:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 884efff..e63ae4c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>  
>  	/* If a full-sized TSO skb can be sent, do it. */
>  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -			   sk->sk_gso_max_segs * tp->mss_cache))
> +			   tp->xmit_size_goal_segs * tp->mss_cache))
>  		goto send_now;
A question is: Does this really guarantee the minimal TSO segments
excluding the case of small available window? The skb->len may be much
smaller and can still be sent here. Maybe we should check skb->len also?

>  
>  	/* Middle in queue won't get any more data, full sendable already? */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 28, 2013, 10:34 a.m. UTC | #4
On Wed, 2013-08-28 at 15:37 +0800, Jason Wang wrote:

> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 884efff..e63ae4c 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
> >  
> >  	/* If a full-sized TSO skb can be sent, do it. */
> >  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> > -			   sk->sk_gso_max_segs * tp->mss_cache))
> > +			   tp->xmit_size_goal_segs * tp->mss_cache))
> >  		goto send_now;
> A question is: Does this really guarantee the minimal TSO segments
> excluding the case of small available window? The skb->len may be much
> smaller and can still be sent here. Maybe we should check skb->len also?

tcp_tso_should_defer() is all about hoping the application will
'complete' the last skb in write queue with more payload in the near
future.

skb->len might therefore change because sendmsg()/sendpage() will add
new stuff in the skb.

We try hard to not remove tcp_tso_should_defer() and take the best of
it. We have not yet decided to add a real timer instead of relying on
upcoming ACKS.

Neal has an idea/patch to avoid a defer depending on
the expected time of following ACKS.

By making the TSO sizes smaller for low rates, we avoid these stalls
from tcp_tso_should_defer(), because an incoming ACK has normally freed
enough window to send the next packet in write queue without the need to
split it into two parts.

These changes are fundamental to use delay based congestion modules like
Vegas/Westwood and experimental new ones, without having to disable TSO.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 29, 2013, 7:51 p.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Aug 2013 05:46:32 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
> 
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
> 
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
> 
> This field could be set by other transports.
> 
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
> 
> For other flows, this helps better packet scheduling and ACK clocking.
> 
> This patch increases performance of TCP flows in lossy environments.
> 
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
> 
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
> 
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
> 
> sk_pacing_rate = 2 * cwnd * mss / srtt
>  
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, please post a new copy of your accompanying packet scheduler.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 29, 2013, 8:26 p.m. UTC | #6
On Thu, 2013-08-29 at 15:51 -0400, David Miller wrote:

> Applied, please post a new copy of your accompanying packet scheduler.
> 
> Thanks.

Thanks David.

I am a bit puzzled by the caching of srtt in tcp metrics. We ten to
cache bufferbloated values that are almost useless.

On this 50ms RTT link, the syn/synack rtt was correctly sampled at 51
jiffies, but tcp_init_metrics() finds a very high srtt cached from
previous tcp flow, which ended its life with a huge cwin=327/srtt=1468
because of bufferbloat.

Since the new connexion starts with IW10, the estimated rate is slightly
wrong for the first ~10 incoming acks, before ewma converges to the
right value...

[ 4544.656476] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.656482] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 0
[ 4544.656496] TCP: sk ffff88085825d180 cwnd 10 packets 0 rate 231680000/srtt 408
[ 4544.707045] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 50 sack_rtt 4294967295
[ 4544.707051] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 50 srtt 1468
[ 4544.707055] TCP: sk ffff88085825d180 cwnd 11 packets 9 rate 254848000/srtt 1335
[ 4544.707067] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 50 sack_rtt 4294967295
[ 4544.707069] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 50 srtt 1335
[ 4544.707071] TCP: sk ffff88085825d180 cwnd 12 packets 10 rate 278016000/srtt 1219
[ 4544.707694] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.707699] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 1219
[ 4544.707703] TCP: sk ffff88085825d180 cwnd 13 packets 11 rate 301184000/srtt 1118
[ 4544.708324] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 52 sack_rtt 4294967295
[ 4544.708330] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 52 srtt 1118
[ 4544.708333] TCP: sk ffff88085825d180 cwnd 14 packets 12 rate 324352000/srtt 1031
[ 4544.708846] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 52 sack_rtt 4294967295
[ 4544.708851] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 52 srtt 1031
[ 4544.708855] TCP: sk ffff88085825d180 cwnd 15 packets 13 rate 347520000/srtt 955
[ 4544.709521] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 53 sack_rtt 4294967295
[ 4544.709526] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 53 srtt 955
[ 4544.709530] TCP: sk ffff88085825d180 cwnd 16 packets 14 rate 370688000/srtt 889
[ 4544.710103] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 53 sack_rtt 4294967295
[ 4544.710108] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 53 srtt 889
[ 4544.710111] TCP: sk ffff88085825d180 cwnd 17 packets 15 rate 393856000/srtt 831
[ 4544.710683] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 54 sack_rtt 4294967295
[ 4544.710688] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 54 srtt 831
[ 4544.710691] TCP: sk ffff88085825d180 cwnd 18 packets 16 rate 417024000/srtt 782
[ 4544.711210] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 55 sack_rtt 4294967295
[ 4544.711215] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 55 srtt 782
[ 4544.711219] TCP: sk ffff88085825d180 cwnd 19 packets 17 rate 440192000/srtt 740
[ 4544.711868] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 55 sack_rtt 4294967295
[ 4544.711873] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 55 srtt 740
[ 4544.711876] TCP: sk ffff88085825d180 cwnd 20 packets 18 rate 463360000/srtt 703
[ 4544.757576] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.757581] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 703
[ 4544.757585] TCP: sk ffff88085825d180 cwnd 21 packets 19 rate 486528000/srtt 667
[ 4544.757595] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.757597] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 667
[ 4544.757610] TCP: sk ffff88085825d180 cwnd 22 packets 20 rate 509696000/srtt 635
[ 4544.773527] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 67 sack_rtt 4294967295
[ 4544.773533] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 67 srtt 635
[ 4544.773536] TCP: sk ffff88085825d180 cwnd 23 packets 21 rate 532864000/srtt 623
[ 4544.773548] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 67 sack_rtt 4294967295
[ 4544.773560] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 67 srtt 623
[ 4544.773562] TCP: sk ffff88085825d180 cwnd 24 packets 22 rate 556032000/srtt 613
[ 4544.778208] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 71 sack_rtt 4294967295
[ 4544.778213] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 71 srtt 613
[ 4544.778216] TCP: sk ffff88085825d180 cwnd 25 packets 23 rate 579200000/srtt 608
[ 4544.778237] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 71 sack_rtt 4294967295
[ 4544.778238] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 71 srtt 608
[ 4544.778240] TCP: sk ffff88085825d180 cwnd 26 packets 24 rate 602368000/srtt 603
[ 4544.782776] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 74 sack_rtt 4294967295
[ 4544.782781] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 74 srtt 603
[ 4544.782785] TCP: sk ffff88085825d180 cwnd 27 packets 25 rate 625536000/srtt 602
[ 4544.782795] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 74 sack_rtt 4294967295
[ 4544.782808] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 74 srtt 602
...
Typical bufferbloat at the end of transfert :

[ 4547.051521] TCP: sk ffff88085825d180 cwnd 327 packets 3 rate 7575936000/srtt 1581
[ 4547.052722] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
[ 4547.052726] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1581
[ 4547.052729] TCP: sk ffff88085825d180 cwnd 327 packets 1 rate 7575936000/srtt 1582
[ 4547.053315] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
[ 4547.053318] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1582
[ 4547.053321] TCP: sk ffff88085825d180 cwnd 327 packets 0 rate 7575936000/srtt 1583

Maybe we could instead store a value corrected by the sk_pacing_rate

rate = (big_cwin * mss) / big_srtt

stored_rtt = rate / (big_cwin * mss)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 29, 2013, 8:35 p.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Aug 2013 13:26:17 -0700

> Typical bufferbloat at the end of transfert :
> 
> [ 4547.051521] TCP: sk ffff88085825d180 cwnd 327 packets 3 rate 7575936000/srtt 1581
> [ 4547.052722] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
> [ 4547.052726] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1581
> [ 4547.052729] TCP: sk ffff88085825d180 cwnd 327 packets 1 rate 7575936000/srtt 1582
> [ 4547.053315] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
> [ 4547.053318] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1582
> [ 4547.053321] TCP: sk ffff88085825d180 cwnd 327 packets 0 rate 7575936000/srtt 1583
> 
> Maybe we could instead store a value corrected by the sk_pacing_rate
> 
> rate = (big_cwin * mss) / big_srtt
> 
> stored_rtt = rate / (big_cwin * mss)

No objections from me.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 29, 2013, 9:26 p.m. UTC | #8
On Thu, 2013-08-29 at 16:35 -0400, David Miller wrote:

> 
> No objections from me.

We'll cook a different patch.

Idea is to feed tcp_set_rto() with the srtt found in the tcp metric
cache, and let tp->srtt value found in SYN/SYNACK (if available) as is.

(Be conservative for initial rto value, yet allow tp->srtt be the
current rtt on the network)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang Aug. 30, 2013, 3:02 a.m. UTC | #9
On 08/28/2013 06:34 PM, Eric Dumazet wrote:
> On Wed, 2013-08-28 at 15:37 +0800, Jason Wang wrote:
>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 884efff..e63ae4c 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>>>  
>>>  	/* If a full-sized TSO skb can be sent, do it. */
>>>  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
>>> -			   sk->sk_gso_max_segs * tp->mss_cache))
>>> +			   tp->xmit_size_goal_segs * tp->mss_cache))
>>>  		goto send_now;
>> A question is: Does this really guarantee the minimal TSO segments
>> excluding the case of small available window? The skb->len may be much
>> smaller and can still be sent here. Maybe we should check skb->len also?
> tcp_tso_should_defer() is all about hoping the application will
> 'complete' the last skb in write queue with more payload in the near
> future.
>
> skb->len might therefore change because sendmsg()/sendpage() will add
> new stuff in the skb.

Ture, but sometimes the application may be slow to fill the bytes into
skb. Especially the application run in virt guest with multiqueue. In
the case, the application in guest tends to be slower than the
nic(virtio-net) which does the transmission through a host thread
(vhost). Looks like current defer algorithm could not do this very well
and if we want to force the batching of 64K packet, tcp_min_tso_segs
could not works well also.
> We try hard to not remove tcp_tso_should_defer() and take the best of
> it. We have not yet decided to add a real timer instead of relying on
> upcoming ACKS.
>
> Neal has an idea/patch to avoid a defer depending on
> the expected time of following ACKS.
>
> By making the TSO sizes smaller for low rates, we avoid these stalls
> from tcp_tso_should_defer(), because an incoming ACK has normally freed
> enough window to send the next packet in write queue without the need to
> split it into two parts.
>
> These changes are fundamental to use delay based congestion modules like
> Vegas/Westwood and experimental new ones, without having to disable TSO.
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index debfe85..ce5bb43 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -482,6 +482,15 @@  tcp_syn_retries - INTEGER
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
 
+tcp_min_tso_segs - INTEGER
+	Minimal number of segments per TSO frame.
+	Since linux-3.12, TCP does an automatic sizing of TSO frames,
+	depending on flow rate, instead of filling 64Kbytes packets.
+	For specific usages, it's possible to force TCP to build big
+	TSO frames. Note that TCP stack might split too big TSO packets
+	if available window is too small.
+	Default: 2
+
 tcp_tso_win_divisor - INTEGER
 	This allows control over what percentage of the congestion window
 	can be consumed by a single TSO frame.
diff --git a/include/net/sock.h b/include/net/sock.h
index e4bbcbf..6ba2e7b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -232,6 +232,7 @@  struct cg_proto;
   *	@sk_napi_id: id of the last napi context to receive data for sk
   *	@sk_ll_usec: usecs to busypoll when there is no data
   *	@sk_allocation: allocation mode
+  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
   *	@sk_sndbuf: size of send buffer in bytes
   *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -361,6 +362,7 @@  struct sock {
 	kmemcheck_bitfield_end(flags);
 	int			sk_wmem_queued;
 	gfp_t			sk_allocation;
+	u32			sk_pacing_rate; /* bytes per second */
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
 	int			sk_gso_type;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 09cb5c1..73fcd7c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -281,6 +281,7 @@  extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
 extern unsigned int sysctl_tcp_notsent_lowat;
+extern int sysctl_tcp_min_tso_segs;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 8ed7c32..540279f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -29,6 +29,7 @@ 
 static int zero;
 static int one = 1;
 static int four = 4;
+static int gso_max_segs = GSO_MAX_SEGS;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -761,6 +762,15 @@  static struct ctl_table ipv4_table[] = {
 		.extra2		= &four,
 	},
 	{
+		.procname	= "tcp_min_tso_segs",
+		.data		= &sysctl_tcp_min_tso_segs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &gso_max_segs,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4e42c03..fdf7409 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -283,6 +283,8 @@ 
 
 int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 
+int sysctl_tcp_min_tso_segs __read_mostly = 2;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -785,12 +787,28 @@  static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	xmit_size_goal = mss_now;
 
 	if (large_allowed && sk_can_gso(sk)) {
-		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
-				  inet_csk(sk)->icsk_af_ops->net_header_len -
-				  inet_csk(sk)->icsk_ext_hdr_len -
-				  tp->tcp_header_len);
+		u32 gso_size, hlen;
+
+		/* Maybe we should/could use sk->sk_prot->max_header here ? */
+		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+		       inet_csk(sk)->icsk_ext_hdr_len +
+		       tp->tcp_header_len;
+
+		/* Goal is to send at least one packet per ms,
+		 * not one big TSO packet every 100 ms.
+		 * This preserves ACK clocking and is consistent
+		 * with tcp_tso_should_defer() heuristic.
+		 */
+		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
+		gso_size = max_t(u32, gso_size,
+				 sysctl_tcp_min_tso_segs * mss_now);
+
+		xmit_size_goal = min_t(u32, gso_size,
+				       sk->sk_gso_max_size - 1 - hlen);
 
-		/* TSQ : try to have two TSO segments in flight */
+		/* TSQ : try to have at least two segments in flight
+		 * (one in NIC TX ring, another in Qdisc)
+		 */
 		xmit_size_goal = min_t(u32, xmit_size_goal,
 				       sysctl_tcp_limit_output_bytes >> 1);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec492ea..436c7e8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -688,6 +688,34 @@  static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
 	}
 }
 
+/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
+ * Note: TCP stack does not yet implement pacing.
+ * FQ packet scheduler can be used to implement cheap but effective
+ * TCP pacing, to smooth the burst on large writes when packets
+ * in flight is significantly lower than cwnd (or rwin)
+ */
+static void tcp_update_pacing_rate(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u64 rate;
+
+	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
+	rate = (u64)tp->mss_cache * 2 * (HZ << 3);
+
+	rate *= max(tp->snd_cwnd, tp->packets_out);
+
+	/* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
+	 * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
+	 * We probably need usec resolution in the future.
+	 * Note: This also takes care of possible srtt=0 case,
+	 * when tcp_rtt_estimator() was not yet called.
+	 */
+	if (tp->srtt > 8 + 2)
+		do_div(rate, tp->srtt);
+
+	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+}
+
 /* Calculate rto without backoff.  This is the second half of Van Jacobson's
  * routine referred to above.
  */
@@ -3278,7 +3306,7 @@  static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
 	u32 ack = TCP_SKB_CB(skb)->ack_seq;
 	bool is_dupack = false;
-	u32 prior_in_flight;
+	u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
 	u32 prior_fackets;
 	int prior_packets = tp->packets_out;
 	const int prior_unsacked = tp->packets_out - tp->sacked_out;
@@ -3383,6 +3411,8 @@  static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
 		tcp_schedule_loss_probe(sk);
+	if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
+		tcp_update_pacing_rate(sk);
 	return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 884efff..e63ae4c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1631,7 +1631,7 @@  static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
 
 	/* If a full-sized TSO skb can be sent, do it. */
 	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
-			   sk->sk_gso_max_segs * tp->mss_cache))
+			   tp->xmit_size_goal_segs * tp->mss_cache))
 		goto send_now;
 
 	/* Middle in queue won't get any more data, full sendable already? */