diff mbox

[bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+

Message ID 1709726.jUgUSQI9sl@pikkukde.a.i2n
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Thomas Jarosch Nov. 29, 2014, 11:44 a.m. UTC
Hello,

we're in the process of updating production level machines
from kernel 3.4.101 to kernel 3.14.25. On one mail server
we noticed that emails destined for an IPSec tunnel sometimes
get stuck in the mail queue with TCP timeouts.

To make a long story short: When the VPN connection is initially
set up or re-newed, the path MTU for the xfrm tunnel is undetermined.

As soon as a TCP client starts to send large packets,
it triggers path MTU detection. Some middlebox on the
way to the final server has a lower MTU and sends back
an "ICMP fragmentation needed" packet as normal.

With the old kernel, the packet size for the TCP connection inside
the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
the connection stalls completely. Same thing with kernel v3.18-rc6.

We wrote a small tool to mimic postfix's TCP behavior (see attached file).
In the end it's a normal TCP client sending large packets.
The server side is just "socat - tcp4-listen:667".

If you run "socket_client" a second time, the path MTU
for the xfrm tunnel is already known and packets flow normal, too.


The "evil" commit in question is this one:
---------------------------------------------------------------------
commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Oct 15 12:24:54 2013 -0700

    tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()

    sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO
    packets in the first place. (As a performance hint)

    Once we have GSO packets in write queue, we can not decide they are no
    longer GSO only because flow now uses a route which doesn't handle
    TSO/GSO.

    Core networking stack handles the case very well for us, all we need
    is keeping track of packet counts in MSS terms, regardless of
    segmentation done later (in GSO or hardware)

    Right now, if  tcp_fragment() splits a GSO packet in two parts,
    @left and @right, and route changed through a non GSO device,
    both @left and @right have pcount set to 1, which is wrong,
    and leads to incorrect packet_count tracking.

    This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU
    discovery")

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Reported-by: Maciej Żenczykowski <maze@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

---------------------------------------------------------------------

When I revert it, even kernel v3.18-rc6 starts working.
But I doubt this is the root problem, may be just hiding another issue.

--- Sample output of socket_client using vanilla v3.12 kernel ---
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1338
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1338
*STUCK*
--------------------------------------------------------

The "machine" is running on KVM and using "virtio_net" as NIC driver.
I've played with the ethtool offload settings:

*** eth1 defaults ***
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

*** eth1 working (no stalls) using vanilla kernel ***
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: off  <-- the magic switch
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off

When I turn "tx-checksumming" back on, it fails again.
Though that is probably also just a side effect.

I can provide tcpdumps if needed but they are no real help
since you can just see the kernel stops sending TCP packets.
(and the outgoing TCP packets are encrypted in ESP packets)


Any vague idea what might be the root cause?

I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
("xfrm: Don't queue retransmitted packets if the original is still on the host")
but that didn't change the situation. In fact it wasn't even triggered.

Please CC: comments. Thanks.

Best regards,
Thomas

Comments

Herbert Xu Dec. 1, 2014, 10:25 a.m. UTC | #1
Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
> 
> When I revert it, even kernel v3.18-rc6 starts working.
> But I doubt this is the root problem, may be just hiding another issue.

Can you do a tcpdump with this patch reverted? I would like to
see the size of the packets that are sent out vs. the ICMP message
that came back.

My guess is that the IPsec GSO path is buggy since as you rightly
pointed out it couldn't have been heavily tested prior to this
patch.

Though I am surprised that it only breaks when you have a PMTU event
so it might be something else after all.

Thanks,
Thomas Jarosch Dec. 1, 2014, 11:20 a.m. UTC | #2
On Monday, 1. December 2014 18:25:22 Herbert Xu wrote:
> Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
> > When I revert it, even kernel v3.18-rc6 starts working.
> > But I doubt this is the root problem, may be just hiding another issue.
> 
> Can you do a tcpdump with this patch reverted? I would like to
> see the size of the packets that are sent out vs. the ICMP message
> that came back.
> 
> My guess is that the IPsec GSO path is buggy since as you rightly
> pointed out it couldn't have been heavily tested prior to this
> patch.
> 
> Though I am surprised that it only breaks when you have a PMTU event
> so it might be something else after all.

I've sent you two pcap files off list. One with the reverted patch
and one I created on Saturday showing the stalling TCP connection.
(though the mail currently seems greylisted by the receiving mail server)

One thing I can say from looking at the tcpdump with the reverted patch
is that the ESP packet size drops from 1510 to 1478. The announced MTU
of the next hop in the ICMP message is 1464. It also contains seven
duplicated ACK packets later on.


Without the reverted patch, it sends four ESP packets:
1. 1498 bytes
2. 1498 bytes
3. 1498 bytes
4. 234 bytes

That triggers the  ICMP "fragmentation needed" message. I can only spot one 
ESP packet of size 1466 afterwards that's sent from time to time. Only one
duplicated ACK packet can be seen. The other packets are just not resent. 
May be it's stuck in some in-kernel queue because it's too big to send
and stays there until it expires?

Today I also tried changing the NIC driver on the virtual machine
from virtio_net to e1000. Luckily still the same behavior,
so it's probably not related to the virtio_net driver.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wolfgang Walter Dec. 1, 2014, 1:17 p.m. UTC | #3
Am Samstag, 29. November 2014, 12:44:07 schrieb Thomas Jarosch:
> Hello,
> 
> we're in the process of updating production level machines
> from kernel 3.4.101 to kernel 3.14.25. On one mail server
> we noticed that emails destined for an IPSec tunnel sometimes
> get stuck in the mail queue with TCP timeouts.
> 
> To make a long story short: When the VPN connection is initially
> set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
> 
> As soon as a TCP client starts to send large packets,
> it triggers path MTU detection. Some middlebox on the
> way to the final server has a lower MTU and sends back
> an "ICMP fragmentation needed" packet as normal.
> 
> With the old kernel, the packet size for the TCP connection inside
> the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
> the connection stalls completely. Same thing with kernel v3.18-rc6.

We see something similar with real nic (RTL8139). In our case only the first 
tcp-connection which triggers PMTU stalls. Later tcp-connections then work 
fine.

I will revert that patch and see if that fixes the problem.

> 
> We wrote a small tool to mimic postfix's TCP behavior (see attached file).
> In the end it's a normal TCP client sending large packets.
> The server side is just "socat - tcp4-listen:667".
> 
> If you run "socket_client" a second time, the path MTU
> for the xfrm tunnel is already known and packets flow normal, too.
> 
> 
> The "evil" commit in question is this one:
> ---------------------------------------------------------------------
> commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Tue Oct 15 12:24:54 2013 -0700
> 
>     tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
> 
>     sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO
> packets in the first place. (As a performance hint)
> 
>     Once we have GSO packets in write queue, we can not decide they are no
>     longer GSO only because flow now uses a route which doesn't handle
>     TSO/GSO.
> 
>     Core networking stack handles the case very well for us, all we need
>     is keeping track of packet counts in MSS terms, regardless of
>     segmentation done later (in GSO or hardware)
> 
>     Right now, if  tcp_fragment() splits a GSO packet in two parts,
>     @left and @right, and route changed through a non GSO device,
>     both @left and @right have pcount set to 1, which is wrong,
>     and leads to incorrect packet_count tracking.
> 
>     This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU
> discovery")
> 
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Signed-off-by: Neal Cardwell <ncardwell@google.com>
>     Signed-off-by: Yuchung Cheng <ycheng@google.com>
>     Reported-by: Maciej Żenczykowski <maze@google.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 8fad1c1..d46f214 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk,
> struct sk_buff *skb, /* Make sure we own this skb before messing
> gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb));
> 
> -       if (skb->len <= mss_now || !sk_can_gso(sk) ||
> -           skb->ip_summed == CHECKSUM_NONE) {
> +       if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
>                 /* Avoid the costly divide in the normal
>                  * non-TSO case.
>                  */
> ---------------------------------------------------------------------
> 
> When I revert it, even kernel v3.18-rc6 starts working.
> But I doubt this is the root problem, may be just hiding another issue.
> 
> --- Sample output of socket_client using vanilla v3.12 kernel ---
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1338
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1338
> *STUCK*
> --------------------------------------------------------
> 
> The "machine" is running on KVM and using "virtio_net" as NIC driver.
> I've played with the ethtool offload settings:
> 
> *** eth1 defaults ***
> Offload parameters for eth1:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: on
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off
> 
> *** eth1 working (no stalls) using vanilla kernel ***
> Offload parameters for eth1:
> rx-checksumming: on
> tx-checksumming: off  <-- the magic switch
> scatter-gather: off
> tcp-segmentation-offload: off
> udp-fragmentation-offload: off
> generic-segmentation-offload: off
> generic-receive-offload: off
> large-receive-offload: off
> 
> When I turn "tx-checksumming" back on, it fails again.
> Though that is probably also just a side effect.
> 
> I can provide tcpdumps if needed but they are no real help
> since you can just see the kernel stops sending TCP packets.
> (and the outgoing TCP packets are encrypted in ESP packets)
> 
> 
> Any vague idea what might be the root cause?
> 
> I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
> ("xfrm: Don't queue retransmitted packets if the original is still on the
> host") but that didn't change the situation. In fact it wasn't even
> triggered.
> 
> Please CC: comments. Thanks.
> 
> Best regards,
> Thomas

Regards,
Wolfgang Walter Dec. 1, 2014, 4:41 p.m. UTC | #4
Am Montag, 1. Dezember 2014, 14:17:28 schrieb Wolfgang Walter:
> Am Samstag, 29. November 2014, 12:44:07 schrieb Thomas Jarosch:
> > Hello,
> > 
> > we're in the process of updating production level machines
> > from kernel 3.4.101 to kernel 3.14.25. On one mail server
> > we noticed that emails destined for an IPSec tunnel sometimes
> > get stuck in the mail queue with TCP timeouts.
> > 
> > To make a long story short: When the VPN connection is initially
> > set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
> > 
> > As soon as a TCP client starts to send large packets,
> > it triggers path MTU detection. Some middlebox on the
> > way to the final server has a lower MTU and sends back
> > an "ICMP fragmentation needed" packet as normal.
> > 
> > With the old kernel, the packet size for the TCP connection inside
> > the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
> > the connection stalls completely. Same thing with kernel v3.18-rc6.
> 
> We see something similar with real nic (RTL8139). In our case only the first
> tcp-connection which triggers PMTU stalls. Later tcp-connections then work
> fine.
> 
> I will revert that patch and see if that fixes the problem.


Reverting the commit fixes the problem here, too.


> 
> > We wrote a small tool to mimic postfix's TCP behavior (see attached file).
> > In the end it's a normal TCP client sending large packets.
> > The server side is just "socat - tcp4-listen:667".
> > 
> > If you run "socket_client" a second time, the path MTU
> > for the xfrm tunnel is already known and packets flow normal, too.
> > 
> > 
> > The "evil" commit in question is this one:
> > ---------------------------------------------------------------------
> > commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Tue Oct 15 12:24:54 2013 -0700
> > 
> >     tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
> >     
> >     sk_can_gso() should only be used as a hint in tcp_sendmsg() to build
> >     GSO
> > 
> > packets in the first place. (As a performance hint)
> > 
> >     Once we have GSO packets in write queue, we can not decide they are no
> >     longer GSO only because flow now uses a route which doesn't handle
> >     TSO/GSO.
> >     
> >     Core networking stack handles the case very well for us, all we need
> >     is keeping track of packet counts in MSS terms, regardless of
> >     segmentation done later (in GSO or hardware)
> >     
> >     Right now, if  tcp_fragment() splits a GSO packet in two parts,
> >     @left and @right, and route changed through a non GSO device,
> >     both @left and @right have pcount set to 1, which is wrong,
> >     and leads to incorrect packet_count tracking.
> >     
> >     This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with
> >     MTU
> > 
> > discovery")
> > 
> >     Signed-off-by: Eric Dumazet <edumazet@google.com>
> >     Signed-off-by: Neal Cardwell <ncardwell@google.com>
> >     Signed-off-by: Yuchung Cheng <ycheng@google.com>
> >     Reported-by: Maciej Żenczykowski <maze@google.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 8fad1c1..d46f214 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock
> > *sk,
> > struct sk_buff *skb, /* Make sure we own this skb before messing
> > gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb));
> > 
> > -       if (skb->len <= mss_now || !sk_can_gso(sk) ||
> > -           skb->ip_summed == CHECKSUM_NONE) {
> > +       if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
> > 
> >                 /* Avoid the costly divide in the normal
> >                 
> >                  * non-TSO case.
> >                  */
> > 
> > ---------------------------------------------------------------------
> > 
> > When I revert it, even kernel v3.18-rc6 starts working.
> > But I doubt this is the root problem, may be just hiding another issue.
> > 
> > --- Sample output of socket_client using vanilla v3.12 kernel ---
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1338
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1338
> > *STUCK*
> > --------------------------------------------------------
> > 
> > The "machine" is running on KVM and using "virtio_net" as NIC driver.
> > I've played with the ethtool offload settings:
> > 
> > *** eth1 defaults ***
> > Offload parameters for eth1:
> > rx-checksumming: on
> > tx-checksumming: on
> > scatter-gather: on
> > tcp-segmentation-offload: on
> > udp-fragmentation-offload: on
> > generic-segmentation-offload: on
> > generic-receive-offload: on
> > large-receive-offload: off
> > 
> > *** eth1 working (no stalls) using vanilla kernel ***
> > Offload parameters for eth1:
> > rx-checksumming: on
> > tx-checksumming: off  <-- the magic switch
> > scatter-gather: off
> > tcp-segmentation-offload: off
> > udp-fragmentation-offload: off
> > generic-segmentation-offload: off
> > generic-receive-offload: off
> > large-receive-offload: off
> > 
> > When I turn "tx-checksumming" back on, it fails again.
> > Though that is probably also just a side effect.
> > 
> > I can provide tcpdumps if needed but they are no real help
> > since you can just see the kernel stops sending TCP packets.
> > (and the outgoing TCP packets are encrypted in ESP packets)
> > 
> > 
> > Any vague idea what might be the root cause?
> > 
> > I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
> > ("xfrm: Don't queue retransmitted packets if the original is still on the
> > host") but that didn't change the situation. In fact it wasn't even
> > triggered.
> > 
> > Please CC: comments. Thanks.
> > 
> > Best regards,
> > Thomas
> 
> Regards,

Regards,
Wolfgang Walter Dec. 5, 2014, 12:09 p.m. UTC | #5
Hello,

as reverting this patch fixes this rather annoying problem: is it dangerous to 
revert it as a workaround until the root cause is found?


Am Montag, 1. Dezember 2014, 17:41:23 schrieb Wolfgang Walter:
> Am Montag, 1. Dezember 2014, 14:17:28 schrieb Wolfgang Walter:
> > Am Samstag, 29. November 2014, 12:44:07 schrieb Thomas Jarosch:
> > > Hello,
> > > 
> > > we're in the process of updating production level machines
> > > from kernel 3.4.101 to kernel 3.14.25. On one mail server
> > > we noticed that emails destined for an IPSec tunnel sometimes
> > > get stuck in the mail queue with TCP timeouts.
> > > 
> > > To make a long story short: When the VPN connection is initially
> > > set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
> > > 
> > > As soon as a TCP client starts to send large packets,
> > > it triggers path MTU detection. Some middlebox on the
> > > way to the final server has a lower MTU and sends back
> > > an "ICMP fragmentation needed" packet as normal.
> > > 
> > > With the old kernel, the packet size for the TCP connection inside
> > > the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
> > > the connection stalls completely. Same thing with kernel v3.18-rc6.
> > 
> > We see something similar with real nic (RTL8139). In our case only the
> > first tcp-connection which triggers PMTU stalls. Later tcp-connections
> > then work fine.
> > 
> > I will revert that patch and see if that fixes the problem.
> 
> Reverting the commit fixes the problem here, too.
> 
> > > We wrote a small tool to mimic postfix's TCP behavior (see attached
> > > file).
> > > In the end it's a normal TCP client sending large packets.
> > > The server side is just "socat - tcp4-listen:667".
> > > 
> > > If you run "socket_client" a second time, the path MTU
> > > for the xfrm tunnel is already known and packets flow normal, too.
> > > 
> > > 
> > > The "evil" commit in question is this one:
> > > ---------------------------------------------------------------------
> > > commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
> > > Author: Eric Dumazet <edumazet@google.com>
> > > Date:   Tue Oct 15 12:24:54 2013 -0700
> > > 
> > >     tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
> > >     
> > >     sk_can_gso() should only be used as a hint in tcp_sendmsg() to build
> > >     GSO
> > > 
> > > packets in the first place. (As a performance hint)
> > > 
> > >     Once we have GSO packets in write queue, we can not decide they are
> > >     no
> > >     longer GSO only because flow now uses a route which doesn't handle
> > >     TSO/GSO.
> > >     
> > >     Core networking stack handles the case very well for us, all we need
> > >     is keeping track of packet counts in MSS terms, regardless of
> > >     segmentation done later (in GSO or hardware)
> > >     
> > >     Right now, if  tcp_fragment() splits a GSO packet in two parts,
> > >     @left and @right, and route changed through a non GSO device,
> > >     both @left and @right have pcount set to 1, which is wrong,
> > >     and leads to incorrect packet_count tracking.
> > >     
> > >     This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with
> > >     MTU
> > > 
> > > discovery")
> > > 
> > >     Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >     Signed-off-by: Neal Cardwell <ncardwell@google.com>
> > >     Signed-off-by: Yuchung Cheng <ycheng@google.com>
> > >     Reported-by: Maciej Żenczykowski <maze@google.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > 
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 8fad1c1..d46f214 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock
> > > *sk,
> > > struct sk_buff *skb, /* Make sure we own this skb before messing
> > > gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb));
> > > 
> > > -       if (skb->len <= mss_now || !sk_can_gso(sk) ||
> > > -           skb->ip_summed == CHECKSUM_NONE) {
> > > +       if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
> > > 
> > >                 /* Avoid the costly divide in the normal
> > >                 
> > >                  * non-TSO case.
> > >                  */
> > > 
> > > ---------------------------------------------------------------------
> > > 
> > > When I revert it, even kernel v3.18-rc6 starts working.
> > > But I doubt this is the root problem, may be just hiding another issue.
> > > 
> > > --- Sample output of socket_client using vanilla v3.12 kernel ---
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1338
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1338
> > > *STUCK*
> > > --------------------------------------------------------
> > > 
> > > The "machine" is running on KVM and using "virtio_net" as NIC driver.
> > > I've played with the ethtool offload settings:
> > > 
> > > *** eth1 defaults ***
> > > Offload parameters for eth1:
> > > rx-checksumming: on
> > > tx-checksumming: on
> > > scatter-gather: on
> > > tcp-segmentation-offload: on
> > > udp-fragmentation-offload: on
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: off
> > > 
> > > *** eth1 working (no stalls) using vanilla kernel ***
> > > Offload parameters for eth1:
> > > rx-checksumming: on
> > > tx-checksumming: off  <-- the magic switch
> > > scatter-gather: off
> > > tcp-segmentation-offload: off
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: off
> > > generic-receive-offload: off
> > > large-receive-offload: off
> > > 
> > > When I turn "tx-checksumming" back on, it fails again.
> > > Though that is probably also just a side effect.
> > > 
> > > I can provide tcpdumps if needed but they are no real help
> > > since you can just see the kernel stops sending TCP packets.
> > > (and the outgoing TCP packets are encrypted in ESP packets)
> > > 
> > > 
> > > Any vague idea what might be the root cause?
> > > 
> > > I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
> > > ("xfrm: Don't queue retransmitted packets if the original is still on
> > > the
> > > host") but that didn't change the situation. In fact it wasn't even
> > > triggered.
> > > 
> > > Please CC: comments. Thanks.
> > > 
> > > Best regards,
> > > Thomas
> > 
> > Regards,
> 
> Regards,

Regards,
Eric Dumazet Dec. 5, 2014, 1:26 p.m. UTC | #6
On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> Hello,
> 
> as reverting this patch fixes this rather annoying problem: is it dangerous to 
> revert it as a workaround until the root cause is found?
> 

Unfortunately no, this patch fixes a serious issue.

We need to find the root cause of your problem instead of trying to work
around it.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wolfgang Walter Dec. 8, 2014, 10:20 p.m. UTC | #7
Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > Hello,
> > 
> > as reverting this patch fixes this rather annoying problem: is it
> > dangerous to revert it as a workaround until the root cause is found?
> 
> Unfortunately no, this patch fixes a serious issue.
> 
> We need to find the root cause of your problem instead of trying to work
> around it.
> 

I only wanted to use it as local workaround here.


I looked a bit at at code. I'm not familiar with the network code, though :-).

When exactly should the gso handling with esp tunnel mode happen?

esp_output_done() calls xfrm_output_resume() and no gso handling is done. 
Would be to late anyway.
esp_output() does not handle gso.
xfrm4_mode_tunnel_output() does not do any gso handling as far as I can see.

xfrm_output() (from net/xfrm/xfrm_output) does but is it called with esp 
tunnel mode?

Regards,
Thomas Jarosch Dec. 9, 2014, 8:54 a.m. UTC | #8
On Monday, 8. December 2014 23:20:42 Wolfgang Walter wrote:
> Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> > On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > > Hello,
> > > 
> > > as reverting this patch fixes this rather annoying problem: is it
> > > dangerous to revert it as a workaround until the root cause is found?
> > 
> > Unfortunately no, this patch fixes a serious issue.
> > 
> > We need to find the root cause of your problem instead of trying to work
> > around it.
> 
> I only wanted to use it as local workaround here.
> 
> 
> I looked a bit at at code. I'm not familiar with the network code, though
> :-).

If it helps, I'm running the reverted patch on five production boxes hitherto 
without a hiccup. As far as I understood the original commit message,
some packet counters might me wrong without it.

@Eric: What could possibly go wrong(tm)? :)

Of course a real fix is preferred over this band-aid.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 9, 2014, 2:26 p.m. UTC | #9
On Tue, 2014-12-09 at 09:54 +0100, Thomas Jarosch wrote:
> On Monday, 8. December 2014 23:20:42 Wolfgang Walter wrote:
> > Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> > > On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > > > Hello,
> > > > 
> > > > as reverting this patch fixes this rather annoying problem: is it
> > > > dangerous to revert it as a workaround until the root cause is found?
> > > 
> > > Unfortunately no, this patch fixes a serious issue.
> > > 
> > > We need to find the root cause of your problem instead of trying to work
> > > around it.
> > 
> > I only wanted to use it as local workaround here.
> > 
> > 
> > I looked a bit at at code. I'm not familiar with the network code, though
> > :-).
> 
> If it helps, I'm running the reverted patch on five production boxes hitherto 
> without a hiccup. As far as I understood the original commit message,
> some packet counters might me wrong without it.
> 
> @Eric: What could possibly go wrong(tm)? :)

Crashes in TCP stack, because of packet count mismatches.

The sk_can_gso() status is already tested in tcp_sendmsg() as a hint,
since path behavior can dynamically be changed on existing flow :

<start a TCP flow>
ethtool -K eth0 tso off gso off

In this case, core networking stack detects this and segments the
packets _after_ TCP or IP stack, before they reach eth0.

TCP stack does not have to know that something is changed right before
giving a GSO packet to core networking stack, this would be racy by
nature, as TCP does not know or control full path. Hopefully we do not
take RTNL for every packet we send in TCP !

It seems XFRM triggers in a slow path something which is not correctly
handled.

It is not correct to add a racy kludge in TCP fast path for this very
unlikely case.

I would disable TSO/GSO on xfrm, and problem should disappear.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Jarosch Dec. 9, 2014, 2:49 p.m. UTC | #10
On Tuesday, 9. December 2014 06:26:49 Eric Dumazet wrote:
> > If it helps, I'm running the reverted patch on five production boxes
> > hitherto without a hiccup. As far as I understood the original commit
> > message, some packet counters might me wrong without it.
> > 
> > @Eric: What could possibly go wrong(tm)? :)
> 
> Crashes in TCP stack, because of packet count mismatches.

alright, that sounds like a pretty good argument.

> ...
> I would disable TSO/GSO on xfrm, and problem should disappear.

I guess you can't explicitly disable this with the "ip xfrm" command?
Or do you mean this should be disabled on the ethX device
serving the xfrm connection?

We are about to push out this code to ten more machines,
so the best time (for me) to do any changes  that increases
stability would be now :o)

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wolfgang Walter Dec. 9, 2014, 8:36 p.m. UTC | #11
Am Dienstag, 9. Dezember 2014, 06:26:49 schrieb Eric Dumazet:
> On Tue, 2014-12-09 at 09:54 +0100, Thomas Jarosch wrote:
> > On Monday, 8. December 2014 23:20:42 Wolfgang Walter wrote:
> > > Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> > > > On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > > > > Hello,
> > > > > 
> > > > > as reverting this patch fixes this rather annoying problem: is it
> > > > > dangerous to revert it as a workaround until the root cause is
> > > > > found?
> > > > 
> > > > Unfortunately no, this patch fixes a serious issue.
> > > > 
> > > > We need to find the root cause of your problem instead of trying to
> > > > work
> > > > around it.
> > > 
> > > I only wanted to use it as local workaround here.
> > > 
> > > 
> > > I looked a bit at at code. I'm not familiar with the network code,
> > > though
> > > 
> > > :-).
> > 
> > If it helps, I'm running the reverted patch on five production boxes
> > hitherto without a hiccup. As far as I understood the original commit
> > message, some packet counters might me wrong without it.
> > 
> > @Eric: What could possibly go wrong(tm)? :)
> 
> Crashes in TCP stack, because of packet count mismatches.
> 
> The sk_can_gso() status is already tested in tcp_sendmsg() as a hint,
> since path behavior can dynamically be changed on existing flow :
> 
> <start a TCP flow>
> ethtool -K eth0 tso off gso off
> 
> In this case, core networking stack detects this and segments the
> packets _after_ TCP or IP stack, before they reach eth0.
> 
> TCP stack does not have to know that something is changed right before
> giving a GSO packet to core networking stack, this would be racy by
> nature, as TCP does not know or control full path. Hopefully we do not
> take RTNL for every packet we send in TCP !
> 
> It seems XFRM triggers in a slow path something which is not correctly
> handled.
> 
> It is not correct to add a racy kludge in TCP fast path for this very
> unlikely case.
> 
> I would disable TSO/GSO on xfrm, and problem should disappear.

How would that be done? I found no way to disable it especially for xfrm. I 
disabled gso for the interface which serves the ipsec traffic but this does 
not help. tcp still uses gso for the esp tunnel.

I put a view printk's in net/xfrm/xfrm_output.c and net/ipv4/tcp_output.c. (I 
try to understand where in the xfrm transformation gso is handeled).

What I can say yet is:

xfrm_output() is used with ipsec (esp) tunnel mode but at I never see gso 
packets here. xfrm_output_gso() is never called.

Everytime tcp_set_skb_tso_segs() is called for a tcp connection over the esp-
tunnel and it is a gso case then the tcp connection hangs. Those packets 
always have skb->len 1398 and mss_now is 1374. I see a call of xfrm_output() 
afterwards but for a packet of skb->len 52 (maybe ACK from other direction?).

As long as the tcp-connection over the ipsec-tunnel works and if I send bulk 
traffic xfrm_output() is called 3 times with packet skb->len 1426 and then one 
time with 78 (maybe other direction?), don't know if that is of any interest.


With non-ipsec-traffic gso works fine: in this case the skb->len() varies a 
lot and mss_now is always 1288.


Regards,
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8fad1c1..d46f214 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -989,8 +989,7 @@  static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
        /* Make sure we own this skb before messing gso_size/gso_segs */
        WARN_ON_ONCE(skb_cloned(skb));
 
-       if (skb->len <= mss_now || !sk_can_gso(sk) ||
-           skb->ip_summed == CHECKSUM_NONE) {
+       if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
                /* Avoid the costly divide in the normal
                 * non-TSO case.
                 */