Message ID | 20120517150157.GA19274@1wt.eu |
---|---|
State | Deferred, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote: > Hi Eric, > > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote: > > Hi Eric, > > > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > > exhibited by your patch 'tcp: allow splice() to build full TSO > > packets' which unfortunately I am very interested in ! > > > > What I'm observing is that TCP transmits using splice() stall > > quite quickly if I'm using pipes larger than 64kB (even 65537 > > is enough to reliably observe the stall). > (...) > > I managed to fix the issue and I really think that the fix makes sense. > I'm appending the patch, please could you review it and if approved, > push it for inclusion ? > > BTW, your patch significantly improves performance here. On this > machine I was reaching max 515 Mbps of proxied traffic, and with > the patch I reach 665 Mbps with the same test ! I think you managed > to fix what caused splice() to always be slightly slower than > recv/send() for a long time in my tests with a number of NICs ! > What NIC do you use exactly ? With commit 1d0c0b328a6 in net-next (net: makes skb_splice_bits() aware of skb->head_frag) You'll be able to get even more speed, if NIC uses frag to hold frame. > Thanks, > Willy > > ------- > > From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 16:48:56 +0200 > Subject: [PATCH] tcp: force push data out when buffers are missing > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > significantly improved splice() performance for some workloads but > caused stalls when pipe buffers were larger than socket buffers. > But... This seems bug was already there ? Only having more data in pipe trigger it more often ? > The issue seems to happen when no data can be copied at all due to > lack of buffers, which results in pending data never being pushed. > > This change checks if all pending data has been pushed or not and > pushes them when waiting for send buffers. > --- > net/ipv4/tcp.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 80b988f..f6d9e5f 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -850,7 +850,7 @@ new_segment: > wait_for_sndbuf: > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > wait_for_memory: > - if (copied) > + if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq) > tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ( tcp: tcp_sendpages() should call tcp_push() once ) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote: > > Hi Eric, > > > > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote: > > > Hi Eric, > > > > > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > > > exhibited by your patch 'tcp: allow splice() to build full TSO > > > packets' which unfortunately I am very interested in ! > > > > > > What I'm observing is that TCP transmits using splice() stall > > > quite quickly if I'm using pipes larger than 64kB (even 65537 > > > is enough to reliably observe the stall). > > (...) > > > > I managed to fix the issue and I really think that the fix makes sense. > > I'm appending the patch, please could you review it and if approved, > > push it for inclusion ? > > > > BTW, your patch significantly improves performance here. On this > > machine I was reaching max 515 Mbps of proxied traffic, and with > > the patch I reach 665 Mbps with the same test ! I think you managed > > to fix what caused splice() to always be slightly slower than > > recv/send() for a long time in my tests with a number of NICs ! > > > > What NIC do you use exactly ? It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, dreamplugs and almost all ARM-based cheap NAS boxes. > With commit 1d0c0b328a6 in net-next > (net: makes skb_splice_bits() aware of skb->head_frag) > You'll be able to get even more speed, if NIC uses frag to hold frame. I'm going to check this now, sounds interesting :-) The NIC does not support TSO but I've seen an alternate driver for this NIC which pretends to do TSO and in fact builds header frags so that the NIC is able to send all frames at once. I think it's already what GSO is doing but I'm wondering whether it would be possible to get more speed by doing this than by relying on GSO to (possibly) split the frags earlier. Note that I'm not quite skilled on this, so do not hesitate to correct me if I'm saying stupid things. (...) > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > > significantly improved splice() performance for some workloads but > > caused stalls when pipe buffers were larger than socket buffers. > > But... This seems bug was already there ? Only having more data in pipe > trigger it more often ? I honnestly don't know. As I said in the initial mail, I suspect this is the case because I don't see in your patch what could cause the bug, but I suspect it makes it more frequent. Still, I could not manage to reproduce the bug without your patch. So maybe we switched from just below a limit to just over. > > The issue seems to happen when no data can be copied at all due to > > lack of buffers, which results in pending data never being pushed. > > > > This change checks if all pending data has been pushed or not and > > pushes them when waiting for send buffers. > > --- > > net/ipv4/tcp.c | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 80b988f..f6d9e5f 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -850,7 +850,7 @@ new_segment: > > wait_for_sndbuf: > > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > > wait_for_memory: > > - if (copied) > > + if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq) > > tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > > Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 > ( tcp: tcp_sendpages() should call tcp_push() once ) Yes I have it, in fact I'm on -stable (3.0.31) which includes your backport of the two patches in a single one. It's just a few lines below, out of the context of this patch. Best regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, > dreamplugs and almost all ARM-based cheap NAS boxes. > > > With commit 1d0c0b328a6 in net-next > > (net: makes skb_splice_bits() aware of skb->head_frag) > > You'll be able to get even more speed, if NIC uses frag to hold frame. > > I'm going to check this now, sounds interesting :-) splice(socket -> pipe) does a copy of frame from skb->head to a page fragment. With latest patches (net-next), we can provide a frag for skb->head and avoid this copy in splice(). You would have a true zero copy socket->socket mode. tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, > > dreamplugs and almost all ARM-based cheap NAS boxes. > > > > > With commit 1d0c0b328a6 in net-next > > > (net: makes skb_splice_bits() aware of skb->head_frag) > > > You'll be able to get even more speed, if NIC uses frag to hold frame. > > > > I'm going to check this now, sounds interesting :-) > > splice(socket -> pipe) does a copy of frame from skb->head to a page > fragment. > > With latest patches (net-next), we can provide a frag for skb->head and > avoid this copy in splice(). You would have a true zero copy > socket->socket mode. That's what I've read in your commit messages. Indeed, we've been waiting for this to happen for a long time. It should bring a real benefit on small devices like the one I'm playing with which have very low memory bandwidth, because it's a shame to copy the data and thrash the cache when we don't need to see them. I'm currently trying to get mainline to boot then I'll switch to net-next. > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well. I'll try to port your patch (8d4057 tg3: provide frags as skb head) to mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll send a patch :-) Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote: > On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote: > > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: > > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > > > > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC > > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, > > > dreamplugs and almost all ARM-based cheap NAS boxes. > > > > > > > With commit 1d0c0b328a6 in net-next > > > > (net: makes skb_splice_bits() aware of skb->head_frag) > > > > You'll be able to get even more speed, if NIC uses frag to hold frame. > > > > > > I'm going to check this now, sounds interesting :-) > > > > splice(socket -> pipe) does a copy of frame from skb->head to a page > > fragment. > > > > With latest patches (net-next), we can provide a frag for skb->head and > > avoid this copy in splice(). You would have a true zero copy > > socket->socket mode. > > That's what I've read in your commit messages. Indeed, we've been waiting > for this to happen for a long time. It should bring a real benefit on small > devices like the one I'm playing with which have very low memory bandwidth, > because it's a shame to copy the data and thrash the cache when we don't > need to see them. > > I'm currently trying to get mainline to boot then I'll switch to net-next. > > > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well. > > I'll try to port your patch (8d4057 tg3: provide frags as skb head) to > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll > send a patch :-) In fact I could post a generic patch, to make netdev_alloc_skb() do this automatically for all network drivers. Its probably less than 30 lines of code. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote: > try to port your patch (8d4057 tg3: provide frags as skb head) to > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll > send a patch :-) tg3 uses build_skb(), thats an API that allocates the skb after frame is filled by the device, to have less cache misses in RX path. But for most devices, netdev_alloc_skb() would be more than enough. I'll send a patch once tested. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 17, 2012 at 06:49:47PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote: > > try to port your patch (8d4057 tg3: provide frags as skb head) to > > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll > > send a patch :-) > > > tg3 uses build_skb(), thats an API that allocates the skb after frame is > filled by the device, to have less cache misses in RX path. > > But for most devices, netdev_alloc_skb() would be more than enough. OK, so that's out of reach for me right now. > I'll send a patch once tested. I'll happily give it a try. BTW, net-next gave me around 13-14% improvement over 3.0/3.2+splice patches ! Now I'm seeing splice() being 75% faster than recv/send, while it used to be slower on this hardware not that long ago! That's really impressive, great work! Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: [...] > The NIC does not support TSO but I've seen an alternate driver for this > NIC which pretends to do TSO and in fact builds header frags so that the > NIC is able to send all frames at once. I think it's already what GSO is > doing but I'm wondering whether it would be possible to get more speed > by doing this than by relying on GSO to (possibly) split the frags earlier. [...] Yes, GSO has some overhead for skb allocation and additional function calls that you can avoid by doing 'TSO' in the driver. Ben.
From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 17:01:57 +0200 >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 16:48:56 +0200 > Subject: [PATCH] tcp: force push data out when buffers are missing > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > significantly improved splice() performance for some workloads but > caused stalls when pipe buffers were larger than socket buffers. > > The issue seems to happen when no data can be copied at all due to > lack of buffers, which results in pending data never being pushed. > > This change checks if all pending data has been pushed or not and > pushes them when waiting for send buffers. Eric, please indicate whether we need Willy's patch here. I want to propagate this fix as fast as possible if so. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi David, On Thu, May 17, 2012 at 03:55:03PM -0400, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 17:01:57 +0200 > > >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 > > From: Willy Tarreau <w@1wt.eu> > > Date: Thu, 17 May 2012 16:48:56 +0200 > > Subject: [PATCH] tcp: force push data out when buffers are missing > > > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > > significantly improved splice() performance for some workloads but > > caused stalls when pipe buffers were larger than socket buffers. > > > > The issue seems to happen when no data can be copied at all due to > > lack of buffers, which results in pending data never being pushed. > > > > This change checks if all pending data has been pushed or not and > > pushes them when waiting for send buffers. > > Eric, please indicate whether we need Willy's patch here. > > I want to propagate this fix as fast as possible if so. I think you should hold off for now, because it's possible that my patch hides another issue instead of fixing it. I'm having the same stall issue again since I applied Eric's build_skb patch, but not for all data sizes. So if the same issue is still there, it's possible that we're playing hide-and-seek with it. That's rather strange. Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 22:04:04 +0200 > I'm having the same stall issue again since I applied Eric's build_skb > patch, but not for all data sizes. So if the same issue is still there, > it's possible that we're playing hide-and-seek with it. That's rather > strange. Ok, a Heisenbug :-) Let me know when you guys resolve this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 80b988f..f6d9e5f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -850,7 +850,7 @@ new_segment: wait_for_sndbuf: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); wait_for_memory: - if (copied) + if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)