diff mbox

Stable regression with 'tcp: allow splice() to build full TSO packets'

Message ID 20120517150157.GA19274@1wt.eu
State Deferred, archived
Delegated to: David Miller
Headers show

Commit Message

Willy Tarreau May 17, 2012, 3:01 p.m. UTC
Hi Eric,

On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> exhibited by your patch 'tcp: allow splice() to build full TSO
> packets' which unfortunately I am very interested in !
> 
> What I'm observing is that TCP transmits using splice() stall
> quite quickly if I'm using pipes larger than 64kB (even 65537
> is enough to reliably observe the stall).
(...)

I managed to fix the issue and I really think that the fix makes sense.
I'm appending the patch, please could you review it and if approved,
push it for inclusion ?

BTW, your patch significantly improves performance here. On this
machine I was reaching max 515 Mbps of proxied traffic, and with
the patch I reach 665 Mbps with the same test ! I think you managed
to fix what caused splice() to always be slightly slower than
recv/send() for a long time in my tests with a number of NICs !

Thanks,
Willy

-------

From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 16:48:56 +0200
Subject: [PATCH] tcp: force push data out when buffers are missing

Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
significantly improved splice() performance for some workloads but
caused stalls when pipe buffers were larger than socket buffers.

The issue seems to happen when no data can be copied at all due to
lack of buffers, which results in pending data never being pushed.

This change checks if all pending data has been pushed or not and
pushes them when waiting for send buffers.
---
 net/ipv4/tcp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Comments

Eric Dumazet May 17, 2012, 3:43 p.m. UTC | #1
On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > exhibited by your patch 'tcp: allow splice() to build full TSO
> > packets' which unfortunately I am very interested in !
> > 
> > What I'm observing is that TCP transmits using splice() stall
> > quite quickly if I'm using pipes larger than 64kB (even 65537
> > is enough to reliably observe the stall).
> (...)
> 
> I managed to fix the issue and I really think that the fix makes sense.
> I'm appending the patch, please could you review it and if approved,
> push it for inclusion ?
> 
> BTW, your patch significantly improves performance here. On this
> machine I was reaching max 515 Mbps of proxied traffic, and with
> the patch I reach 665 Mbps with the same test ! I think you managed
> to fix what caused splice() to always be slightly slower than
> recv/send() for a long time in my tests with a number of NICs !
> 

What NIC do you use exactly ?

With commit 1d0c0b328a6 in net-next
(net: makes skb_splice_bits() aware of skb->head_frag)
You'll be able to get even more speed, if NIC uses frag to hold frame.

> Thanks,
> Willy
> 
> -------
> 
> From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 16:48:56 +0200
> Subject: [PATCH] tcp: force push data out when buffers are missing
> 
> Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> significantly improved splice() performance for some workloads but
> caused stalls when pipe buffers were larger than socket buffers.
> 

But... This seems bug was already there ? Only having more data in pipe
trigger it more often ?

> The issue seems to happen when no data can be copied at all due to
> lack of buffers, which results in pending data never being pushed.
> 
> This change checks if all pending data has been pushed or not and
> pushes them when waiting for send buffers.
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 80b988f..f6d9e5f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -850,7 +850,7 @@ new_segment:
>  wait_for_sndbuf:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  wait_for_memory:
> -		if (copied)
> +		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
>  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
>  
>  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)

Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 
( tcp: tcp_sendpages() should call tcp_push() once )


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau May 17, 2012, 3:56 p.m. UTC | #2
On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> > > Hi Eric,
> > > 
> > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > > exhibited by your patch 'tcp: allow splice() to build full TSO
> > > packets' which unfortunately I am very interested in !
> > > 
> > > What I'm observing is that TCP transmits using splice() stall
> > > quite quickly if I'm using pipes larger than 64kB (even 65537
> > > is enough to reliably observe the stall).
> > (...)
> > 
> > I managed to fix the issue and I really think that the fix makes sense.
> > I'm appending the patch, please could you review it and if approved,
> > push it for inclusion ?
> > 
> > BTW, your patch significantly improves performance here. On this
> > machine I was reaching max 515 Mbps of proxied traffic, and with
> > the patch I reach 665 Mbps with the same test ! I think you managed
> > to fix what caused splice() to always be slightly slower than
> > recv/send() for a long time in my tests with a number of NICs !
> > 
> 
> What NIC do you use exactly ?

It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
dreamplugs and almost all ARM-based cheap NAS boxes.

> With commit 1d0c0b328a6 in net-next
> (net: makes skb_splice_bits() aware of skb->head_frag)
> You'll be able to get even more speed, if NIC uses frag to hold frame.

I'm going to check this now, sounds interesting :-)

The NIC does not support TSO but I've seen an alternate driver for this
NIC which pretends to do TSO and in fact builds header frags so that the
NIC is able to send all frames at once. I think it's already what GSO is
doing but I'm wondering whether it would be possible to get more speed
by doing this than by relying on GSO to (possibly) split the frags earlier.

Note that I'm not quite skilled on this, so do not hesitate to correct me
if I'm saying stupid things.

(...)

> > Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> > significantly improved splice() performance for some workloads but
> > caused stalls when pipe buffers were larger than socket buffers.
> 
> But... This seems bug was already there ? Only having more data in pipe
> trigger it more often ?

I honnestly don't know. As I said in the initial mail, I suspect this is
the case because I don't see in your patch what could cause the bug, but
I suspect it makes it more frequent. Still, I could not manage to reproduce
the bug without your patch. So maybe we switched from just below a limit
to just over.

> > The issue seems to happen when no data can be copied at all due to
> > lack of buffers, which results in pending data never being pushed.
> > 
> > This change checks if all pending data has been pushed or not and
> > pushes them when waiting for send buffers.
> > ---
> >  net/ipv4/tcp.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 80b988f..f6d9e5f 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -850,7 +850,7 @@ new_segment:
> >  wait_for_sndbuf:
> >  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >  wait_for_memory:
> > -		if (copied)
> > +		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
> >  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> >  
> >  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
> 
> Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 
> ( tcp: tcp_sendpages() should call tcp_push() once )

Yes I have it, in fact I'm on -stable (3.0.31) which includes your backport
of the two patches in a single one. It's just a few lines below, out of the
context of this patch.

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 17, 2012, 4:33 p.m. UTC | #3
On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:

> It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> dreamplugs and almost all ARM-based cheap NAS boxes.
> 
> > With commit 1d0c0b328a6 in net-next
> > (net: makes skb_splice_bits() aware of skb->head_frag)
> > You'll be able to get even more speed, if NIC uses frag to hold frame.
> 
> I'm going to check this now, sounds interesting :-)

splice(socket -> pipe) does a copy of frame from skb->head to a page
fragment.

With latest patches (net-next), we can provide a frag for skb->head and
avoid this copy in splice(). You would have a true zero copy
socket->socket mode.

tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau May 17, 2012, 4:40 p.m. UTC | #4
On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> 
> > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> > dreamplugs and almost all ARM-based cheap NAS boxes.
> > 
> > > With commit 1d0c0b328a6 in net-next
> > > (net: makes skb_splice_bits() aware of skb->head_frag)
> > > You'll be able to get even more speed, if NIC uses frag to hold frame.
> > 
> > I'm going to check this now, sounds interesting :-)
> 
> splice(socket -> pipe) does a copy of frame from skb->head to a page
> fragment.
> 
> With latest patches (net-next), we can provide a frag for skb->head and
> avoid this copy in splice(). You would have a true zero copy
> socket->socket mode.

That's what I've read in your commit messages. Indeed, we've been waiting
for this to happen for a long time. It should bring a real benefit on small
devices like the one I'm playing with which have very low memory bandwidth,
because it's a shame to copy the data and thrash the cache when we don't
need to see them.

I'm currently trying to get mainline to boot then I'll switch to net-next.

> tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.

I'll try to port your patch (8d4057 tg3: provide frags as skb head) to
mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
send a patch :-)

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 17, 2012, 4:47 p.m. UTC | #5
On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote:
> > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> > 
> > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> > > dreamplugs and almost all ARM-based cheap NAS boxes.
> > > 
> > > > With commit 1d0c0b328a6 in net-next
> > > > (net: makes skb_splice_bits() aware of skb->head_frag)
> > > > You'll be able to get even more speed, if NIC uses frag to hold frame.
> > > 
> > > I'm going to check this now, sounds interesting :-)
> > 
> > splice(socket -> pipe) does a copy of frame from skb->head to a page
> > fragment.
> > 
> > With latest patches (net-next), we can provide a frag for skb->head and
> > avoid this copy in splice(). You would have a true zero copy
> > socket->socket mode.
> 
> That's what I've read in your commit messages. Indeed, we've been waiting
> for this to happen for a long time. It should bring a real benefit on small
> devices like the one I'm playing with which have very low memory bandwidth,
> because it's a shame to copy the data and thrash the cache when we don't
> need to see them.
> 
> I'm currently trying to get mainline to boot then I'll switch to net-next.
> 
> > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.
> 
> I'll try to port your patch (8d4057 tg3: provide frags as skb head) to
> mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> send a patch :-)

In fact I could post a generic patch, to make netdev_alloc_skb() do this
automatically for all network drivers.

Its probably less than 30 lines of code.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 17, 2012, 4:49 p.m. UTC | #6
On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
>  try to port your patch (8d4057 tg3: provide frags as skb head) to
> mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> send a patch :-)


tg3 uses build_skb(), thats an API that allocates the skb after frame is
filled by the device, to have less cache misses in RX path.

But for most devices, netdev_alloc_skb() would be more than enough.

I'll send a patch once tested.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau May 17, 2012, 5:22 p.m. UTC | #7
On Thu, May 17, 2012 at 06:49:47PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
> >  try to port your patch (8d4057 tg3: provide frags as skb head) to
> > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> > send a patch :-)
> 
> 
> tg3 uses build_skb(), thats an API that allocates the skb after frame is
> filled by the device, to have less cache misses in RX path.
> 
> But for most devices, netdev_alloc_skb() would be more than enough.

OK, so that's out of reach for me right now.

> I'll send a patch once tested.

I'll happily give it a try.

BTW, net-next gave me around 13-14% improvement over 3.0/3.2+splice
patches ! Now I'm seeing splice() being 75% faster than recv/send, while
it used to be slower on this hardware not that long ago! That's really
impressive, great work!

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings May 17, 2012, 6:38 p.m. UTC | #8
On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
[...]
> The NIC does not support TSO but I've seen an alternate driver for this
> NIC which pretends to do TSO and in fact builds header frags so that the
> NIC is able to send all frames at once. I think it's already what GSO is
> doing but I'm wondering whether it would be possible to get more speed
> by doing this than by relying on GSO to (possibly) split the frags earlier.
[...]

Yes, GSO has some overhead for skb allocation and additional function
calls that you can avoid by doing 'TSO' in the driver.

Ben.
David Miller May 17, 2012, 7:55 p.m. UTC | #9
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 17:01:57 +0200

>>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 16:48:56 +0200
> Subject: [PATCH] tcp: force push data out when buffers are missing
> 
> Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> significantly improved splice() performance for some workloads but
> caused stalls when pipe buffers were larger than socket buffers.
> 
> The issue seems to happen when no data can be copied at all due to
> lack of buffers, which results in pending data never being pushed.
> 
> This change checks if all pending data has been pushed or not and
> pushes them when waiting for send buffers.

Eric, please indicate whether we need Willy's patch here.

I want to propagate this fix as fast as possible if so.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau May 17, 2012, 8:04 p.m. UTC | #10
Hi David,

On Thu, May 17, 2012 at 03:55:03PM -0400, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 17:01:57 +0200
> 
> >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 17 May 2012 16:48:56 +0200
> > Subject: [PATCH] tcp: force push data out when buffers are missing
> > 
> > Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> > significantly improved splice() performance for some workloads but
> > caused stalls when pipe buffers were larger than socket buffers.
> > 
> > The issue seems to happen when no data can be copied at all due to
> > lack of buffers, which results in pending data never being pushed.
> > 
> > This change checks if all pending data has been pushed or not and
> > pushes them when waiting for send buffers.
> 
> Eric, please indicate whether we need Willy's patch here.
> 
> I want to propagate this fix as fast as possible if so.

I think you should hold off for now, because it's possible that my patch
hides another issue instead of fixing it.

I'm having the same stall issue again since I applied Eric's build_skb
patch, but not for all data sizes. So if the same issue is still there,
it's possible that we're playing hide-and-seek with it. That's rather
strange.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 17, 2012, 8:07 p.m. UTC | #11
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:04:04 +0200

> I'm having the same stall issue again since I applied Eric's build_skb
> patch, but not for all data sizes. So if the same issue is still there,
> it's possible that we're playing hide-and-seek with it. That's rather
> strange.

Ok, a Heisenbug :-)  Let me know when you guys resolve this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 80b988f..f6d9e5f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -850,7 +850,7 @@  new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
+		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
 			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)