diff mbox

tcp: Do not apply TSO segment limit to non-TSO packets

Message ID 20141231133923.GA30248@gondor.apana.org.au
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu Dec. 31, 2014, 1:39 p.m. UTC
On Mon, Dec 01, 2014 at 06:25:22PM +0800, Herbert Xu wrote:
> Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
> > 
> > When I revert it, even kernel v3.18-rc6 starts working.
> > But I doubt this is the root problem, may be just hiding another issue.
> 
> Can you do a tcpdump with this patch reverted? I would like to
> see the size of the packets that are sent out vs. the ICMP message
> that came back.

Thanks for providing the data.  Here is a patch that should fix
the problem.

-- >8 --
Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.

In fact the problem was completely unrelated to IPsec.  The bug is
also reproducible if you just disable TSO/GSO.

The problem is that when the MSS goes down, existing queued packet
on the TX queue that have not been transmitted yet all look like
TSO packets and get treated as such.

This then triggers a bug where tcp_mss_split_point tells us to
generate a zero-sized packet on the TX queue.  Once that happens
we're screwed because the zero-sized packet can never be removed
by ACKs.

Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier")
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>


Cheers,

Comments

Herbert Xu Dec. 31, 2014, 1:42 p.m. UTC | #1
On Thu, Jan 01, 2015 at 12:39:23AM +1100, Herbert Xu wrote:
>
> Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
> 
> In fact the problem was completely unrelated to IPsec.  The bug is
> also reproducible if you just disable TSO/GSO.

This raises two interesting questions.

Firstly not many people test non-TSO code paths anymore so bugs
are likely to persist for a long time there.  Perhaps it's time
to remove the non-TSO code path altogether? The GSO code path
should provide enough speed-up in terms of boosting the effective
MTU to offset the cost of copying.

Secondly why are we dealing with hardware TSO segment limits
by limiting the size of the TSO packet in the TCP stack? Surely
in this case GSO is free since there won't be any copying?

Cheers,
Eric Dumazet Jan. 2, 2015, 6:24 p.m. UTC | #2
On Thu, 2015-01-01 at 00:42 +1100, Herbert Xu wrote:
> On Thu, Jan 01, 2015 at 12:39:23AM +1100, Herbert Xu wrote:
> >
> > Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
> > 
> > In fact the problem was completely unrelated to IPsec.  The bug is
> > also reproducible if you just disable TSO/GSO.
> 
> This raises two interesting questions.
> 
> Firstly not many people test non-TSO code paths anymore so bugs
> are likely to persist for a long time there.  Perhaps it's time
> to remove the non-TSO code path altogether? The GSO code path
> should provide enough speed-up in terms of boosting the effective
> MTU to offset the cost of copying.

> Secondly why are we dealing with hardware TSO segment limits
> by limiting the size of the TSO packet in the TCP stack? Surely
> in this case GSO is free since there won't be any copying?

It might depends on the device capabilities.

Non TSO/GSO path is known to be better for devices unable to perform TX
checksumming, as we compute the checksum at the time we copy data from
user to kernel (csum_and_copy_from_user() from tcp_sendmsg())).

With BQL+TSQ, having to compute the TX hash means bringing data into cpu
caches a second time right before ndo_start_xmit()

But maybe this gain is very relative in a full blown configuration, with
netfilter / complex qdisc being used.

Thanks Herbert !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 2, 2015, 8:36 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Jan 2015 10:24:00 -0800

> On Thu, 2015-01-01 at 00:42 +1100, Herbert Xu wrote:
>> Firstly not many people test non-TSO code paths anymore so bugs
>> are likely to persist for a long time there.  Perhaps it's time
>> to remove the non-TSO code path altogether? The GSO code path
>> should provide enough speed-up in terms of boosting the effective
>> MTU to offset the cost of copying.
> 
>> Secondly why are we dealing with hardware TSO segment limits
>> by limiting the size of the TSO packet in the TCP stack? Surely
>> in this case GSO is free since there won't be any copying?
> 
> It might depends on the device capabilities.
> 
> Non TSO/GSO path is known to be better for devices unable to perform TX
> checksumming, as we compute the checksum at the time we copy data from
> user to kernel (csum_and_copy_from_user() from tcp_sendmsg())).

Non-SG capable devices suffer in this scenerio as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 2, 2015, 9:13 p.m. UTC | #4
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 1 Jan 2015 00:39:23 +1100

> Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
> 
> In fact the problem was completely unrelated to IPsec.  The bug is
> also reproducible if you just disable TSO/GSO.
> 
> The problem is that when the MSS goes down, existing queued packet
> on the TX queue that have not been transmitted yet all look like
> TSO packets and get treated as such.
> 
> This then triggers a bug where tcp_mss_split_point tells us to
> generate a zero-sized packet on the TX queue.  Once that happens
> we're screwed because the zero-sized packet can never be removed
> by ACKs.
> 
> Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier")
> Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied and queued up for -stable, thanks Herbert.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 2, 2015, 10:01 p.m. UTC | #5
On Fri, Jan 02, 2015 at 03:36:55PM -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 02 Jan 2015 10:24:00 -0800
>
> > Non TSO/GSO path is known to be better for devices unable to perform TX
> > checksumming, as we compute the checksum at the time we copy data from
> > user to kernel (csum_and_copy_from_user() from tcp_sendmsg())).
> 
> Non-SG capable devices suffer in this scenerio as well.

Yes I was referring to using GSO on non-SG/non-checksumming devices.
Anything that supports checksum/SG should obviously be using GSO.

IIRC when I first tested this GSO is basically on par for the non-SG
case as the overhead of the extra copying was offset by the benefit
of a larger MTU through the stack.

So has anyone actually observed worse performance with GSO on these
devices (you could even test on a GSO-capable device simply by
disabling checksumming)?

Also the fact that this bug took two years to surface means that
very few people are actually using non-GSO in the real world as
this bug is easily triggered by a PMTU event.

Cheers,
David Miller Jan. 2, 2015, 10:06 p.m. UTC | #6
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 3 Jan 2015 09:01:07 +1100

> So has anyone actually observed worse performance with GSO on these
> devices (you could even test on a GSO-capable device simply by
> disabling checksumming)?

Good question.

> Also the fact that this bug took two years to surface means that
> very few people are actually using non-GSO in the real world as
> this bug is easily triggered by a PMTU event.

I think the rarity of PMTU events on non-VPN'd connections plays
a part in how long it took as well.

But I totally accept your point, for sure.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 2, 2015, 10:09 p.m. UTC | #7
On Fri, Jan 02, 2015 at 05:06:29PM -0500, David Miller wrote:
>
> I think the rarity of PMTU events on non-VPN'd connections plays
> a part in how long it took as well.

Maybe in your neck of the woods but certainly in China I observe
loads of PMTU events without involving any VPNs at all :)

In fact even the fibre connections here use PPPOE so PMTU is
sort of unavoidable.

Cheers,
Thomas Jarosch Jan. 19, 2015, 1:39 p.m. UTC | #8
Hi Herbert,

On Thursday, 1. January 2015 00:39:23 Herbert Xu wrote:
> The problem is that when the MSS goes down, existing queued packet
> on the TX queue that have not been transmitted yet all look like
> TSO packets and get treated as such.
> 
> This then triggers a bug where tcp_mss_split_point tells us to
> generate a zero-sized packet on the TX queue.  Once that happens
> we're screwed because the zero-sized packet can never be removed
> by ACKs.

picking up this one again: Is there any valid use case to have
zero-sized packets in the TX queue? If not, may be a WARN_ON() could
be added to the processing of the TX queue. That would help to
prevent future issues like this.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 19, 2015, 10:36 p.m. UTC | #9
On Mon, Jan 19, 2015 at 09:19:48AM -0800, Eric Dumazet wrote:
> Zero sized packets are valid, just take a look at tcpdump ;)

But zero-sized packets in the send queue are not valid and will
not ever be cleaned, which is what triggered the original bug.

Although I'm not sure whether a WARN_ON will actually help since
it won't tell you why there is a zero-sized packet on the queue,
just the fact that there is.  You'll know you've got a zero-sized
packet on the queue anyway by looking at the tcpdump.

Cheers,
Eric Dumazet Jan. 19, 2015, 10:38 p.m. UTC | #10
On Mon, Jan 19, 2015 at 2:36 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Jan 19, 2015 at 09:19:48AM -0800, Eric Dumazet wrote:
>> Zero sized packets are valid, just take a look at tcpdump ;)
>
> But zero-sized packets in the send queue are not valid and will
> not ever be cleaned, which is what triggered the original bug.

SYN and FIN packets are in the send queue.

They are valid and might have a zero size.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 19, 2015, 10:40 p.m. UTC | #11
On Mon, Jan 19, 2015 at 02:38:47PM -0800, Eric Dumazet wrote:
> 
> SYN and FIN packets are in the send queue.
> 
> They are valid and might have a zero size.

Right, obviously I meant packets in between the SYN and the FIN.
Technically SYN/FIN are not really zero-sided since they carry
one bit of information and yes they do get cleaned while the ones
in the middle will not.

Cheers,
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..65caf8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2019,7 +2019,7 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
 			break;
 
-		if (tso_segs == 1) {
+		if (tso_segs == 1 || !max_segs) {
 			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
 						     (tcp_skb_is_last(sk, skb) ?
 						      nonagle : TCP_NAGLE_PUSH))))
@@ -2032,7 +2032,7 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		}
 
 		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
+		if (tso_segs > 1 && max_segs && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
 						    min_t(unsigned int,
 							  cwnd_quota,