diff mbox series

[net] tcp: do not send empty skb from tcp_write_xmit()

Message ID 20191211073419.258820-1-edumazet@google.com
State Superseded
Delegated to: David Miller
Headers show
Series [net] tcp: do not send empty skb from tcp_write_xmit() | expand

Commit Message

Eric Dumazet Dec. 11, 2019, 7:34 a.m. UTC
Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
write queue in error cases") in linux-4.14 stable triggered
various bugs. One of them has been fixed in commit ba2ddb43f270
("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
we still have crashes in some occasions.

Root-cause is that when tcp_sendmsg() has allocated a fresh
skb and could not append a fragment before being blocked
in sk_stream_wait_memory(), tcp_write_xmit() might be called
and decide to send this fresh and empty skb.

Sending an empty packet is not only silly, it might have caused
many issues we had in the past with tp->packets_out being
out of sync.

Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Jason Baron <jbaron@akamai.com>
---
 net/ipv4/tcp_output.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Neal Cardwell Dec. 11, 2019, 7:17 p.m. UTC | #1
On Wed, Dec 11, 2019 at 2:34 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
> write queue in error cases") in linux-4.14 stable triggered
> various bugs. One of them has been fixed in commit ba2ddb43f270
> ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
> we still have crashes in some occasions.
>
> Root-cause is that when tcp_sendmsg() has allocated a fresh
> skb and could not append a fragment before being blocked
> in sk_stream_wait_memory(), tcp_write_xmit() might be called
> and decide to send this fresh and empty skb.
>
> Sending an empty packet is not only silly, it might have caused
> many issues we had in the past with tp->packets_out being
> out of sync.
>
> Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Christoph Paasch <cpaasch@apple.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> ---
>  net/ipv4/tcp_output.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b184f03d743715ef4b2d166ceae651529be77953..57f434a8e41ffd6bc584cb4d9e87703491a378c1 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2438,6 +2438,14 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                 if (tcp_small_queue_check(sk, skb, 0))
>                         break;
>
> +               /* Argh, we hit an empty skb(), presumably a thread
> +                * is sleeping in sendmsg()/sk_stream_wait_memory().
> +                * We do not want to send a pure-ack packet and have
> +                * a strange looking rtx queue with empty packet(s).
> +                */
> +               if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)
> +                       break;
> +
>                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
>                         break;
>
> --

Thanks for the fix, Eric!

Is there any risk that any current or future bugs that create
persistently empty skbs could cause the connection to "freeze", unable
to reach the tcp_transmit_skb() call in tcp_write_xmit()?

To avoid this risk, would it make sense to delete the empty skb and
continue the tcp_write_xmit() transmit loop, rather than breaking out
of the loop?

Just curious to learn. :-)

thanks,
neal
Eric Dumazet Dec. 11, 2019, 7:39 p.m. UTC | #2
On Wed, Dec 11, 2019 at 11:17 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Dec 11, 2019 at 2:34 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
> > write queue in error cases") in linux-4.14 stable triggered
> > various bugs. One of them has been fixed in commit ba2ddb43f270
> > ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
> > we still have crashes in some occasions.
> >
> > Root-cause is that when tcp_sendmsg() has allocated a fresh
> > skb and could not append a fragment before being blocked
> > in sk_stream_wait_memory(), tcp_write_xmit() might be called
> > and decide to send this fresh and empty skb.
> >
> > Sending an empty packet is not only silly, it might have caused
> > many issues we had in the past with tp->packets_out being
> > out of sync.
> >
> > Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Christoph Paasch <cpaasch@apple.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Jason Baron <jbaron@akamai.com>
> > ---
> >  net/ipv4/tcp_output.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b184f03d743715ef4b2d166ceae651529be77953..57f434a8e41ffd6bc584cb4d9e87703491a378c1 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2438,6 +2438,14 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >                 if (tcp_small_queue_check(sk, skb, 0))
> >                         break;
> >
> > +               /* Argh, we hit an empty skb(), presumably a thread
> > +                * is sleeping in sendmsg()/sk_stream_wait_memory().
> > +                * We do not want to send a pure-ack packet and have
> > +                * a strange looking rtx queue with empty packet(s).
> > +                */
> > +               if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)
> > +                       break;
> > +
> >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> >                         break;
> >
> > --
>
> Thanks for the fix, Eric!
>
> Is there any risk that any current or future bugs that create
> persistently empty skbs could cause the connection to "freeze", unable
> to reach the tcp_transmit_skb() call in tcp_write_xmit()?
>
> To avoid this risk, would it make sense to delete the empty skb and
> continue the tcp_write_xmit() transmit loop, rather than breaking out
> of the loop?

This 'empty' skb must be the last in the queue.

Removing it from the queue would not prevent tcp_write_xmit() from
breaking the loop.

If we remove it, then we force sendmsg() to re-allocate a fresh skb,
which seems more work, and would paper around the bugs that would be
un-noticed.

Another question to ask is if we need to reconsider how we use
tcp_write_queue_empty() as an indicator
for 'I  have something in the write queue'

This is obviously wrong if the write queue has a single empty skb.

Maybe we should instead use tp->write_seq != tp->snd_nxt
Eric Dumazet Dec. 11, 2019, 7:43 p.m. UTC | #3
On Wed, Dec 11, 2019 at 11:39 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Dec 11, 2019 at 11:17 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Wed, Dec 11, 2019 at 2:34 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
> > > write queue in error cases") in linux-4.14 stable triggered
> > > various bugs. One of them has been fixed in commit ba2ddb43f270
> > > ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
> > > we still have crashes in some occasions.
> > >
> > > Root-cause is that when tcp_sendmsg() has allocated a fresh
> > > skb and could not append a fragment before being blocked
> > > in sk_stream_wait_memory(), tcp_write_xmit() might be called
> > > and decide to send this fresh and empty skb.
> > >
> > > Sending an empty packet is not only silly, it might have caused
> > > many issues we had in the past with tp->packets_out being
> > > out of sync.
> > >
> > > Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Cc: Christoph Paasch <cpaasch@apple.com>
> > > Cc: Neal Cardwell <ncardwell@google.com>
> > > Cc: Jason Baron <jbaron@akamai.com>
> > > ---
> > >  net/ipv4/tcp_output.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index b184f03d743715ef4b2d166ceae651529be77953..57f434a8e41ffd6bc584cb4d9e87703491a378c1 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -2438,6 +2438,14 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> > >                 if (tcp_small_queue_check(sk, skb, 0))
> > >                         break;
> > >
> > > +               /* Argh, we hit an empty skb(), presumably a thread
> > > +                * is sleeping in sendmsg()/sk_stream_wait_memory().
> > > +                * We do not want to send a pure-ack packet and have
> > > +                * a strange looking rtx queue with empty packet(s).
> > > +                */
> > > +               if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)
> > > +                       break;
> > > +
> > >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> > >                         break;
> > >
> > > --
> >
> > Thanks for the fix, Eric!
> >
> > Is there any risk that any current or future bugs that create
> > persistently empty skbs could cause the connection to "freeze", unable
> > to reach the tcp_transmit_skb() call in tcp_write_xmit()?
> >
> > To avoid this risk, would it make sense to delete the empty skb and
> > continue the tcp_write_xmit() transmit loop, rather than breaking out
> > of the loop?
>
> This 'empty' skb must be the last in the queue.
>
> Removing it from the queue would not prevent tcp_write_xmit() from
> breaking the loop.
>
> If we remove it, then we force sendmsg() to re-allocate a fresh skb,
> which seems more work, and would paper around the bugs that would be
> un-noticed.
>
> Another question to ask is if we need to reconsider how we use
> tcp_write_queue_empty() as an indicator
> for 'I  have something in the write queue'
>
> This is obviously wrong if the write queue has a single empty skb.
>
> Maybe we should instead use tp->write_seq != tp->snd_nxt

Also we use skb_queue_len(&sk->sk_write_queue) == 0 in
tcp_sendmsg_locked() and this seems not good.

We could have changed tcp_sendmsg_locked() to leave the empty skb in
the write queue.
( and we can thus remove tcp_remove_empty_skb() completely since it
has caused many problems in stable kernels)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8a39ee79489192c02385aaadc8d1ae969fb55d23..9ba3294de9cb4e929afdc0e00b01b6b534c84af6
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1419,7 +1419,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct
msghdr *msg, size_t size)
        sock_zerocopy_put_abort(uarg, true);
        err = sk_stream_error(sk, flags, err);
        /* make sure we wake any epoll edge trigger waiter */
-       if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+       if (unlikely(tp->write_seq == tp->snd_nxt &&
                     err == -EAGAIN)) {
                sk->sk_write_space(sk);
                tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
Neal Cardwell Dec. 11, 2019, 8:04 p.m. UTC | #4
On Wed, Dec 11, 2019 at 2:44 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Dec 11, 2019 at 11:39 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Dec 11, 2019 at 11:17 AM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Wed, Dec 11, 2019 at 2:34 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
> > > > write queue in error cases") in linux-4.14 stable triggered
> > > > various bugs. One of them has been fixed in commit ba2ddb43f270
> > > > ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
> > > > we still have crashes in some occasions.
> > > >
> > > > Root-cause is that when tcp_sendmsg() has allocated a fresh
> > > > skb and could not append a fragment before being blocked
> > > > in sk_stream_wait_memory(), tcp_write_xmit() might be called
> > > > and decide to send this fresh and empty skb.
> > > >
> > > > Sending an empty packet is not only silly, it might have caused
> > > > many issues we had in the past with tp->packets_out being
> > > > out of sync.
> > > >
> > > > Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > Cc: Christoph Paasch <cpaasch@apple.com>
> > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > Cc: Jason Baron <jbaron@akamai.com>
> > > > ---
> > > >  net/ipv4/tcp_output.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > index b184f03d743715ef4b2d166ceae651529be77953..57f434a8e41ffd6bc584cb4d9e87703491a378c1 100644
> > > > --- a/net/ipv4/tcp_output.c
> > > > +++ b/net/ipv4/tcp_output.c
> > > > @@ -2438,6 +2438,14 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> > > >                 if (tcp_small_queue_check(sk, skb, 0))
> > > >                         break;
> > > >
> > > > +               /* Argh, we hit an empty skb(), presumably a thread
> > > > +                * is sleeping in sendmsg()/sk_stream_wait_memory().
> > > > +                * We do not want to send a pure-ack packet and have
> > > > +                * a strange looking rtx queue with empty packet(s).
> > > > +                */
> > > > +               if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)
> > > > +                       break;
> > > > +
> > > >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> > > >                         break;
> > > >
> > > > --
> > >
> > > Thanks for the fix, Eric!
> > >
> > > Is there any risk that any current or future bugs that create
> > > persistently empty skbs could cause the connection to "freeze", unable
> > > to reach the tcp_transmit_skb() call in tcp_write_xmit()?
> > >
> > > To avoid this risk, would it make sense to delete the empty skb and
> > > continue the tcp_write_xmit() transmit loop, rather than breaking out
> > > of the loop?
> >
> > This 'empty' skb must be the last in the queue.
> >
> > Removing it from the queue would not prevent tcp_write_xmit() from
> > breaking the loop.
> >
> > If we remove it, then we force sendmsg() to re-allocate a fresh skb,
> > which seems more work, and would paper around the bugs that would be
> > un-noticed.
> >
> > Another question to ask is if we need to reconsider how we use
> > tcp_write_queue_empty() as an indicator
> > for 'I  have something in the write queue'
> >
> > This is obviously wrong if the write queue has a single empty skb.
> >
> > Maybe we should instead use tp->write_seq != tp->snd_nxt
>
> Also we use skb_queue_len(&sk->sk_write_queue) == 0 in
> tcp_sendmsg_locked() and this seems not good.
>
> We could have changed tcp_sendmsg_locked() to leave the empty skb in
> the write queue.
> ( and we can thus remove tcp_remove_empty_skb() completely since it
> has caused many problems in stable kernels)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 8a39ee79489192c02385aaadc8d1ae969fb55d23..9ba3294de9cb4e929afdc0e00b01b6b534c84af6
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1419,7 +1419,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct
> msghdr *msg, size_t size)
>         sock_zerocopy_put_abort(uarg, true);
>         err = sk_stream_error(sk, flags, err);
>         /* make sure we wake any epoll edge trigger waiter */
> -       if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
> +       if (unlikely(tp->write_seq == tp->snd_nxt &&
>                      err == -EAGAIN)) {
>                 sk->sk_write_space(sk);
>                 tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);

Thanks, Eric.

I like your idea to audit the TCP calls to tcp_write_queue_empty() and
skb_queue_len(&sk->sk_write_queue) to see which ones should be
replaced with comparisons of tp->write_seq and tp->snd_nxt (including
this one). Good catch!

neal
Neal Cardwell Dec. 11, 2019, 8:05 p.m. UTC | #5
On Wed, Dec 11, 2019 at 2:34 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
> write queue in error cases") in linux-4.14 stable triggered
> various bugs. One of them has been fixed in commit ba2ddb43f270
> ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
> we still have crashes in some occasions.
>
> Root-cause is that when tcp_sendmsg() has allocated a fresh
> skb and could not append a fragment before being blocked
> in sk_stream_wait_memory(), tcp_write_xmit() might be called
> and decide to send this fresh and empty skb.
>
> Sending an empty packet is not only silly, it might have caused
> many issues we had in the past with tp->packets_out being
> out of sync.
>
> Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Christoph Paasch <cpaasch@apple.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

thanks,
neal
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b184f03d743715ef4b2d166ceae651529be77953..57f434a8e41ffd6bc584cb4d9e87703491a378c1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2438,6 +2438,14 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		if (tcp_small_queue_check(sk, skb, 0))
 			break;
 
+		/* Argh, we hit an empty skb(), presumably a thread
+		 * is sleeping in sendmsg()/sk_stream_wait_memory().
+		 * We do not want to send a pure-ack packet and have
+		 * a strange looking rtx queue with empty packet(s).
+		 */
+		if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)
+			break;
+
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;