Message ID | 20090115.152608.89323697.davem@davemloft.net |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > New patch, this has the SKB clone removal as well: Thanks Dave! Something else just came to mind though. > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > + unsigned int offset) > +{ > + struct page *p = alloc_pages(GFP_KERNEL, 0); > + > + if (!p) > + return NULL; > + memcpy(page_address(p) + offset, page_address(page) + offset, len); This won't work very well if skb->head is longer than a page. We'll need to divide it up into individual pages. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 16 Jan 2009 10:32:05 +1100 > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > + unsigned int offset) > > +{ > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > + > > + if (!p) > > + return NULL; > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > This won't work very well if skb->head is longer than a page. > > We'll need to divide it up into individual pages. Oh yes the same bug I pointed out the other day. But Willy can test this patch as-is, since he is not using jumbo frames in linear SKBs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > + unsigned int offset) > > > +{ > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > + > > > + if (!p) > > > + return NULL; > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > This won't work very well if skb->head is longer than a page. > > > > We'll need to divide it up into individual pages. > > Oh yes the same bug I pointed out the other day. > > But Willy can test this patch as-is, Hey, nice work Dave. +3% performance from your previous patch (31.6 MB/s). It's going fine and stable here. > since he is not using jumbo frames in linear SKBs. If you're interested, this week-end I can do some tests on my myri10ge NICs which support LRO. I frequently observe 23 kB packets there, and they also support jumbo frames. Those should cover the case above. I'm afraid that's all for me for this evening, I have to get some sleep before going to work. If you want to cook up more patches, I'll be able to do a bit of testing in 5 hours now. Cheers! Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote: > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > > + unsigned int offset) > > > > +{ > > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > > + > > > > + if (!p) > > > > + return NULL; > > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > > > This won't work very well if skb->head is longer than a page. > > > > > > We'll need to divide it up into individual pages. > > > > Oh yes the same bug I pointed out the other day. > > > > But Willy can test this patch as-is, > > Hey, nice work Dave. +3% performance from your previous patch > (31.6 MB/s). It's going fine and stable here. And BTW feel free to add my Tested-by if you want in case you merge this fix. Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Fri, 16 Jan 2009 00:44:08 +0100 > And BTW feel free to add my Tested-by if you want in case you merge > this fix. Done, thanks Willy. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 16, 2009 at 12:44:08AM +0100, Willy Tarreau wrote: > On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote: > > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote: > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > > > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > > > + unsigned int offset) > > > > > +{ > > > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > > > + > > > > > + if (!p) > > > > > + return NULL; > > > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > > > > > This won't work very well if skb->head is longer than a page. > > > > > > > > We'll need to divide it up into individual pages. > > > > > > Oh yes the same bug I pointed out the other day. > > > > > > But Willy can test this patch as-is, > > > > Hey, nice work Dave. +3% performance from your previous patch > > (31.6 MB/s). It's going fine and stable here. > > And BTW feel free to add my Tested-by if you want in case you merge > this fix. > > Willy > Herbert, good catch! David, if it's not too late I think more credits are needed, especially for Willy. He did "a bit" more than testing. Alas, I can't see this problem with skb->head longer than page. There is even some comment on this in __splice_segment(), but I can miss something. I'm more concerned with memory usage if these skbs are not acked for some reason. Isn't there some DOS issue possible? Thanks everybody, Jarek P. ---------> Based on a review by Changli Gao <xiaosuo@gmail.com>: http://lkml.org/lkml/2008/2/26/210 Foreseen-by: Changli Gao <xiaosuo@gmail.com> Diagnosed-by: Willy Tarreau <w@1wt.eu> Reported-by: Willy Tarreau <w@1wt.eu> Fixed-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi guys, On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > And BTW feel free to add my Tested-by if you want in case you merge > > this fix. > > Done, thanks Willy. Just for the record, I've now re-integrated those changes in a test kernel that I booted on my 10gig machines. I have updated my user-space code in haproxy to run a new series of tests. Eventhough there is a memcpy(), the results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : - 4.8 Gbps at 100% CPU using MTU=1500 without LRO (3.2 Gbps at 100% CPU without splice) - 9.2 Gbps at 50% CPU using MTU=1500 with LRO - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without splice) - 10 Gbps at 15% CPU using MTU=9000 with LRO These last ones are really impressive. While I had already observed such performance on the Myri10GE with Tux, it's the first time I can reach that level with so little CPU usage in haproxy ! So I think that the memcpy() workaround might be a non-issue for some time. I agree it's not beautiful but it works pretty well for now. The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's last patch based on Jarek's workaround for the corruption issue. Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > Just for the record, I've now re-integrated those changes in a test kernel > that I booted on my 10gig machines. I have updated my user-space code in > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > (3.2 Gbps at 100% CPU without splice) One thing to note is that Myricom's driver probably uses page frags which means that you're not actually triggering the copy. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 19 Jan 2009 14:08:44 +1100 > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > > > Just for the record, I've now re-integrated those changes in a test kernel > > that I booted on my 10gig machines. I have updated my user-space code in > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > (3.2 Gbps at 100% CPU without splice) > > One thing to note is that Myricom's driver probably uses page > frags which means that you're not actually triggering the copy. Right. And this is also the only reason why jumbo MTU worked :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Mon, 19 Jan 2009 01:42:06 +0100 > Hi guys, > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > > From: Willy Tarreau <w@1wt.eu> > > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > > > And BTW feel free to add my Tested-by if you want in case you merge > > > this fix. > > > > Done, thanks Willy. > > Just for the record, I've now re-integrated those changes in a test kernel > that I booted on my 10gig machines. I have updated my user-space code in > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > (3.2 Gbps at 100% CPU without splice) > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > splice) > > - 10 Gbps at 15% CPU using MTU=9000 with LRO Thanks for the numbers. We can almost certainly do a lot better, so if you have the time and can get some oprofile dumps for these various cases that would be useful to us. Thanks again. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Fri, 16 Jan 2009 06:51:08 +0000 > David, if it's not too late I think more credits are needed, > especially for Willy. He did "a bit" more than testing. ... > Foreseen-by: Changli Gao <xiaosuo@gmail.com> > Diagnosed-by: Willy Tarreau <w@1wt.eu> > Reported-by: Willy Tarreau <w@1wt.eu> > Fixed-by: Jens Axboe <jens.axboe@oracle.com> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> I will be sure to add these before committing, thanks Jarek. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Mon, 19 Jan 2009 01:42:06 +0100 > > > Hi guys, > > > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > > > From: Willy Tarreau <w@1wt.eu> > > > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > > > > > And BTW feel free to add my Tested-by if you want in case you merge > > > > this fix. > > > > > > Done, thanks Willy. > > > > Just for the record, I've now re-integrated those changes in a test kernel > > that I booted on my 10gig machines. I have updated my user-space code in > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > (3.2 Gbps at 100% CPU without splice) > > > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > > splice) > > > > - 10 Gbps at 15% CPU using MTU=9000 with LRO > > Thanks for the numbers. > > We can almost certainly do a lot better, so if you have the time and > can get some oprofile dumps for these various cases that would be > useful to us. No problem, of course. It's just a matter of time, but if we can push the numbers further, let's try. Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > > > > > Just for the record, I've now re-integrated those changes in a test kernel > > > that I booted on my 10gig machines. I have updated my user-space code in > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > > (3.2 Gbps at 100% CPU without splice) > > > > One thing to note is that Myricom's driver probably uses page > > frags which means that you're not actually triggering the copy. So does this mean that the corruption problem should still there for such a driver ? I'm asking before testing, because at these speeds, validity tests are not that easy ;-) > Right. > > And this is also the only reason why jumbo MTU worked :-) What should we expect from other drivers with jumbo frames ? Hangs, corruption, errors, packet loss ? Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Thu, 15 Jan 2009 15:34:49 -0800 (PST) > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > + unsigned int offset) > > > +{ > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > + > > > + if (!p) > > > + return NULL; > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > This won't work very well if skb->head is longer than a page. > > > > We'll need to divide it up into individual pages. > > Oh yes the same bug I pointed out the other day. > > But Willy can test this patch as-is, since he is not using > jumbo frames in linear SKBs. Actually, Herbert, it turns out this case should be OK. Look a level or two higher, at __splice_segment(), it even has a comment :-) -------------------- do { unsigned int flen = min(*len, plen); /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); *len -= flen; } while (*len && plen); -------------------- That code should work and do what we need. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Mon, 19 Jan 2009 07:14:20 +0100 > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > > > One thing to note is that Myricom's driver probably uses page > > > frags which means that you're not actually triggering the copy. > > So does this mean that the corruption problem should still there for > such a driver ? I'm asking before testing, because at these speeds, > validity tests are not that easy ;-) It ought not to, but it seems that is the case where you saw the original corruptions, so hmmm... Actually, I see, the myri10ge driver does put up to 64 bytes of the initial packet into the linear area. If the IPV4 + TCP headers are less than this, you will hit the corruption case even with the myri10ge driver. So everything checks out. > > And this is also the only reason why jumbo MTU worked :-) > > What should we expect from other drivers with jumbo frames ? Hangs, > corruption, errors, packet loss ? Upon recent review I think jumbo frames in such drivers should actually be fine. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Mon, 19 Jan 2009 07:14:20 +0100 > > > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > > > > > One thing to note is that Myricom's driver probably uses page > > > > frags which means that you're not actually triggering the copy. > > > > So does this mean that the corruption problem should still there for > > such a driver ? I'm asking before testing, because at these speeds, > > validity tests are not that easy ;-) > > It ought not to, but it seems that is the case where you > saw the original corruptions, so hmmm... > > Actually, I see, the myri10ge driver does put up to > 64 bytes of the initial packet into the linear area. > If the IPV4 + TCP headers are less than this, you will > hit the corruption case even with the myri10ge driver. > > So everything checks out. OK, so I will modify my tools in order to perform a few checks. Thanks! Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 19, 2009 at 07:14:20AM +0100, Willy Tarreau wrote: > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > > > > > > > Just for the record, I've now re-integrated those changes in a test kernel > > > > that I booted on my 10gig machines. I have updated my user-space code in > > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > > > (3.2 Gbps at 100% CPU without splice) > > > > > > One thing to note is that Myricom's driver probably uses page > > > frags which means that you're not actually triggering the copy. > > So does this mean that the corruption problem should still there for > such a driver ? I'm asking before testing, because at these speeds, > validity tests are not that easy ;-) I guess, David meant the performance: there is not much change because only a small part could be copied. The most harmed should be jumbo frames in linear only skbs. But no corruption is expected. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > > Actually, I see, the myri10ge driver does put up to > 64 bytes of the initial packet into the linear area. > If the IPV4 + TCP headers are less than this, you will > hit the corruption case even with the myri10ge driver. I thought splice only mapped the payload areas, no? So we should probably test with a non-paged driver to be totally sure. Cheers,
On Sun, Jan 18, 2009 at 10:16:03PM -0800, David Miller wrote: > > Actually, Herbert, it turns out this case should be OK. > > Look a level or two higher, at __splice_segment(), it even has a > comment :-) Aha, that should take care of it. Thanks,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 19 Jan 2009 21:19:24 +1100 > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > > > > Actually, I see, the myri10ge driver does put up to > > 64 bytes of the initial packet into the linear area. > > If the IPV4 + TCP headers are less than this, you will > > hit the corruption case even with the myri10ge driver. > > I thought splice only mapped the payload areas, no? And the difference between 64 and IPV4+TCP header len becomes the payload, don't you see? :-) myri10ge just pulls min(64, skb->len) bytes from the SKB frags into the linear area, unconditionally. So a small number of payload bytes can in fact end up there. Otherwise Willy could never have triggered this bug. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote: . > And the difference between 64 and IPV4+TCP header len becomes the > payload, don't you see? :-) > > myri10ge just pulls min(64, skb->len) bytes from the SKB frags into > the linear area, unconditionally. So a small number of payload bytes > can in fact end up there. Ah, I thought they were doing a precise division. Yes pulling a constant length will trigger it if it's big enough. Cheers,
Willy Tarreau wrote: > Hi guys, > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: >> From: Willy Tarreau <w@1wt.eu> >> Date: Fri, 16 Jan 2009 00:44:08 +0100 >> >>> And BTW feel free to add my Tested-by if you want in case you merge >>> this fix. >> Done, thanks Willy. > > Just for the record, I've now re-integrated those changes in a test kernel > that I booted on my 10gig machines. I have updated my user-space code in > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > (3.2 Gbps at 100% CPU without splice) > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > splice) > > - 10 Gbps at 15% CPU using MTU=9000 with LRO > > These last ones are really impressive. While I had already observed such > performance on the Myri10GE with Tux, it's the first time I can reach that > level with so little CPU usage in haproxy ! > > So I think that the memcpy() workaround might be a non-issue for some time. > I agree it's not beautiful but it works pretty well for now. > > The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of > -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's > last patch based on Jarek's workaround for the corruption issue. I've also tested on the same three patches (against 2.6.27.2 here), and the patches appear to work just fine. I'm running a similar proxy benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 2xforcedeth). splice is working OK now, although I get identical results when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 98% system). I may be hitting a h/w limitation which prevents any higher throughput, but I'm a little surprised that splice() didn't use less CPU time. Anyway, the splice code is working which is the important part! Ben -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote: > I've also tested on the same three patches (against 2.6.27.2 here), and > the patches appear to work just fine. I'm running a similar proxy > benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, > 2xforcedeth). splice is working OK now, although I get identical results > when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, > 98% system). With small MTU or when driver does not support fragmented allocation (iirc at least forcedeth does not) skb will contain all the data in the linear part and thus will be copied in the kernel. read()/write() does effectively the same, but in userspace. This should only affect splice usage which involves socket->pipe data transfer. > I may be hitting a h/w limitation which prevents any higher throughput, > but I'm a little surprised that splice() didn't use less CPU time. > Anyway, the splice code is working which is the important part! Does splice without patches (but with performance improvement for non-blocking splice) has the same performance? It does not copy data, but you may hit the data corruption? If performance is the same, this maybe indeed HW limitation.
Evgeniy Polyakov wrote: > On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote: >> I've also tested on the same three patches (against 2.6.27.2 here), and >> the patches appear to work just fine. I'm running a similar proxy >> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, >> 2xforcedeth). splice is working OK now, although I get identical results >> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, >> 98% system). > > With small MTU or when driver does not support fragmented allocation > (iirc at least forcedeth does not) skb will contain all the data in the > linear part and thus will be copied in the kernel. read()/write() does > effectively the same, but in userspace. > This should only affect splice usage which involves socket->pipe data > transfer. I'll try with some larger MTUs and see if that helps - it should also give an improvement if I'm hitting a limit on the number of packets/second that the cards can process, regardless of splice... >> I may be hitting a h/w limitation which prevents any higher throughput, >> but I'm a little surprised that splice() didn't use less CPU time. >> Anyway, the splice code is working which is the important part! > > Does splice without patches (but with performance improvement for > non-blocking splice) has the same performance? It does not copy data, > but you may hit the data corruption? If performance is the same, this > maybe indeed HW limitation. With an unpatched kernel, the splice performance was worse (due to the one packet per-splice issues). With the small patch to fix that, I was getting around 2 Gbps performance, although oddly enough, I could only get 2 Gbps with read()/write() then as well... I'll try and do some tests on a machine that hopefully doesn't have the bottlenecks (and one that uses different NICs) Ben -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 20, 2009 at 01:43:52PM +0000, Ben Mansell wrote: ... > With an unpatched kernel, the splice performance was worse (due to the > one packet per-splice issues). With the small patch to fix that, I was > getting around 2 Gbps performance, although oddly enough, I could only > get 2 Gbps with read()/write() then as well... > > I'll try and do some tests on a machine that hopefully doesn't have the > bottlenecks (and one that uses different NICs) I guess you should especially check if SG and checksums are on, and it could depend on a chip within those NICs as well. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi David, On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Mon, 19 Jan 2009 01:42:06 +0100 > > > Hi guys, > > > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > > > From: Willy Tarreau <w@1wt.eu> > > > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > > > > > And BTW feel free to add my Tested-by if you want in case you merge > > > > this fix. > > > > > > Done, thanks Willy. > > > > Just for the record, I've now re-integrated those changes in a test kernel > > that I booted on my 10gig machines. I have updated my user-space code in > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > (3.2 Gbps at 100% CPU without splice) > > > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > > splice) > > > > - 10 Gbps at 15% CPU using MTU=9000 with LRO > > Thanks for the numbers. > > We can almost certainly do a lot better, so if you have the time and > can get some oprofile dumps for these various cases that would be > useful to us. Well, I tried to get oprofile to work on those machines, but I'm surely missing something, as "opreport" always complains : opreport error: No sample file found: try running opcontrol --dump or specify a session containing sample files I've straced opcontrol --dump, opcontrol --stop, and I see nothing creating any file in the samples directory. I thought it would be opjitconv which would do it, but it's hard to debug, and I haven't used oprofile for a 2/3 years now. I followed exactly the instructions in the kernel doc, as well as a howto found on the net, but I remain out of luck. I just see a "complete_dump" file created with only two bytes. It's not easy to debug on those machines are they're diskless and PXE-booted from squashfs root images. However, upon Ingo's suggestion I tried his perfcounters while running a test at 5 Gbps with haproxy running alone on one core, and IRQs on the other one. No LRO was used and MTU was 1500. For instance, kerneltop tells how many CPU cycles are spent in each function : # kerneltop -e 0 -d 1 -c 1000000 -C 1 events RIP kernel function ______ ______ ________________ _______________ 1864.00 - 00000000f87be000 : init_module [myri10ge] 1078.00 - 00000000784a6580 : tcp_read_sock 901.00 - 00000000784a7840 : tcp_sendpage 857.00 - 0000000078470be0 : __skb_splice_bits 617.00 - 00000000784b2260 : tcp_transmit_skb 604.00 - 000000007849fdf0 : __ip_local_out 581.00 - 0000000078470460 : __copy_skb_header 569.00 - 000000007850cac0 : _spin_lock [myri10ge] 472.00 - 0000000078185150 : __slab_free 443.00 - 000000007850cc10 : _spin_lock_bh [sky2] 434.00 - 00000000781852e0 : __slab_alloc 408.00 - 0000000078488620 : __qdisc_run 355.00 - 0000000078185b20 : kmem_cache_free 352.00 - 0000000078472950 : __alloc_skb 348.00 - 00000000784705f0 : __skb_clone 337.00 - 0000000078185870 : kmem_cache_alloc [myri10ge] 302.00 - 0000000078472150 : skb_release_data 297.00 - 000000007847bcf0 : dev_queue_xmit 285.00 - 00000000784a08f0 : ip_queue_xmit You should ignore the lines init_module, _spin_lock, etc, in fact all lines indicating a module, as there's something wrong there, they always report the name of the last module loaded, and the name changes when the module is unloaded. I also tried dumping the number of cache misses per function : ------------------------------------------------------------------------------ KernelTop: 1146 irqs/sec [NMI, 1000 cache-misses], (all, cpu: 1) ------------------------------------------------------------------------------ events RIP kernel function ______ ______ ________________ _______________ 7512.00 - 00000000784a6580 : tcp_read_sock 2716.00 - 0000000078470be0 : __skb_splice_bits 2516.00 - 00000000784a08f0 : ip_queue_xmit 986.00 - 00000000784a7840 : tcp_sendpage 587.00 - 00000000781a40c0 : sys_splice 462.00 - 00000000781856b0 : kfree [myri10ge] 451.00 - 000000007849fdf0 : __ip_local_out 242.00 - 0000000078185b20 : kmem_cache_free 205.00 - 00000000784b1b90 : __tcp_select_window 153.00 - 000000007850cac0 : _spin_lock [myri10ge] 142.00 - 000000007849f6a0 : ip_fragment 119.00 - 0000000078188950 : rw_verify_area 117.00 - 00000000784a99e0 : tcp_rcv_space_adjust 107.00 - 000000007850cc10 : _spin_lock_bh [sky2] 104.00 - 00000000781852e0 : __slab_alloc There are other options to combine events but I don't understand the output when I enable it. I think that when properly used, these tools can report useful information. Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi David, On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Mon, 19 Jan 2009 21:19:24 +1100 > > > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > > > > > > Actually, I see, the myri10ge driver does put up to > > > 64 bytes of the initial packet into the linear area. > > > If the IPV4 + TCP headers are less than this, you will > > > hit the corruption case even with the myri10ge driver. > > > > I thought splice only mapped the payload areas, no? > > And the difference between 64 and IPV4+TCP header len becomes the > payload, don't you see? :-) > > myri10ge just pulls min(64, skb->len) bytes from the SKB frags into > the linear area, unconditionally. So a small number of payload bytes > can in fact end up there. > > Otherwise Willy could never have triggered this bug. Just FWIW, I've updated my tools in order to perform content checks more easily. I cannot reproduce the issue at all with the myri10ge NICs, neither with large frames nor with tiny ones (8 bytes). However, I have noticed that the load is now sensible to the number of concurrent sessions. I'm using 2.6.29-rc2 with the perfcounters patches, and I'm not sure whether the difference in behaviour came with the data corruption fixes or with the new kernel (which has some profiling options turned on). Basically, below 800-1000 concurrent sessions, I have no problem reaching 10 Gbps with LRO and MTU=1500, with about 60% CPU. Above this number of session, the CPU suddenly jumps to 100% and the data rate drops to about 6.7 Gbps. I spent a long time trying to figure what it was, but I think that I have found. Kerneltop reports different figures above and below the limit. 1) below the limit : 1429.00 - 00000000784a7840 : tcp_sendpage 561.00 - 00000000784a6580 : tcp_read_sock 485.00 - 00000000f87e13c0 : myri10ge_xmit [myri10ge] 433.00 - 00000000781a40c0 : sys_splice 411.00 - 00000000784a6eb0 : tcp_poll 344.00 - 000000007847bcf0 : dev_queue_xmit 342.00 - 0000000078470be0 : __skb_splice_bits 319.00 - 0000000078472950 : __alloc_skb 310.00 - 0000000078185870 : kmem_cache_alloc 285.00 - 00000000784b2260 : tcp_transmit_skb 285.00 - 000000007850cac0 : _spin_lock 250.00 - 00000000781afda0 : sys_epoll_ctl 238.00 - 000000007810334c : system_call 232.00 - 000000007850ac20 : schedule 230.00 - 000000007850cc10 : _spin_lock_bh 222.00 - 00000000784705f0 : __skb_clone 220.00 - 000000007850cbc0 : _spin_lock_irqsave 213.00 - 00000000784a08f0 : ip_queue_xmit 211.00 - 0000000078185ea0 : __kmalloc_track_caller 2) above the limit : 1778.00 - 00000000784a7840 : tcp_sendpage 1281.00 - 0000000078472950 : __alloc_skb 639.00 - 00000000784a6780 : sk_stream_alloc_skb 507.00 - 0000000078185ea0 : __kmalloc_track_caller 484.00 - 0000000078185870 : kmem_cache_alloc 476.00 - 00000000784a6580 : tcp_read_sock 451.00 - 00000000784a08f0 : ip_queue_xmit 421.00 - 00000000f87e13c0 : myri10ge_xmit [myri10ge] 374.00 - 00000000781852e0 : __slab_alloc 361.00 - 00000000781a40c0 : sys_splice 273.00 - 0000000078470be0 : __skb_splice_bits 231.00 - 000000007850cac0 : _spin_lock 206.00 - 0000000078168b30 : get_pageblock_flags_group 165.00 - 00000000784a0260 : ip_finish_output 165.00 - 00000000784b2260 : tcp_transmit_skb 161.00 - 0000000078470460 : __copy_skb_header 153.00 - 000000007816d6d0 : put_page 144.00 - 000000007850cbc0 : _spin_lock_irqsave 137.00 - 0000000078189be0 : fget_light The memory allocation clearly is the culprit here. I'll try Jarek's patch which reduces memory allocation to see if that changes something, as I'm sure we can do fairly better, given how it behaves with limited sessions. Regards, Willy PS: this thread is long, if some of the people in CC want to get off the thread, please complain. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote: ... > The memory allocation clearly is the culprit here. I'll try Jarek's > patch which reduces memory allocation to see if that changes something, > as I'm sure we can do fairly better, given how it behaves with limited > sessions. I think you are right, but I wonder if it's not better to wait with more profiling until this splicing is really redone. Regards, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 26, 2009 at 07:59:33AM +0000, Jarek Poplawski wrote: > On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote: > ... > > The memory allocation clearly is the culprit here. I'll try Jarek's > > patch which reduces memory allocation to see if that changes something, > > as I'm sure we can do fairly better, given how it behaves with limited > > sessions. > > I think you are right, but I wonder if it's not better to wait with > more profiling until this splicing is really redone. Agreed. In fact I have run a few tests even with your patch and I could see no obvious figure starting to appear. I'll go back to 2.6.27-stable + the fixes because I don't really know what I'm testing under 2.6.29-rc2. Once I'm able to get reproducible reference numbers, I'll test again with the latest kernel. Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 65eac77..56272ac 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly; static void sock_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - kfree_skb(skb); + put_page(buf->page); } static void sock_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - skb_get(skb); + get_page(buf->page); } static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, @@ -1334,9 +1330,19 @@ fault: */ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) { - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; + put_page(spd->pages[i]); +} - kfree_skb(skb); +static inline struct page *linear_to_page(struct page *page, unsigned int len, + unsigned int offset) +{ + struct page *p = alloc_pages(GFP_KERNEL, 0); + + if (!p) + return NULL; + memcpy(page_address(p) + offset, page_address(page) + offset, len); + + return p; } /* @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, unsigned int len, unsigned int offset, - struct sk_buff *skb) + struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; + if (linear) { + page = linear_to_page(page, len, offset); + if (!page) + return 1; + } else + get_page(page); + spd->pages[spd->nr_pages] = page; spd->partial[spd->nr_pages].len = len; spd->partial[spd->nr_pages].offset = offset; - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); spd->nr_pages++; + return 0; } @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, static inline int __splice_segment(struct page *page, unsigned int poff, unsigned int plen, unsigned int *off, unsigned int *len, struct sk_buff *skb, - struct splice_pipe_desc *spd) + struct splice_pipe_desc *spd, int linear) { if (!*len) return 1; @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb)) + if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), - offset, len, skb, spd)) + offset, len, skb, spd, 1)) return 1; /* @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; if (__splice_segment(f->page, f->page_offset, f->size, - offset, len, skb, spd)) + offset, len, skb, spd, 0)) return 1; } @@ -1442,7 +1455,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, * the frag list, if such a thing exists. We'd probably need to recurse to * handle that cleanly. */ -int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, +int skb_splice_bits(struct sk_buff *skb, unsigned int offset, struct pipe_inode_info *pipe, unsigned int tlen, unsigned int flags) { @@ -1455,16 +1468,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, .ops = &sock_pipe_buf_ops, .spd_release = sock_spd_release, }; - struct sk_buff *skb; - - /* - * I'd love to avoid the clone here, but tcp_read_sock() - * ignores reference counts and unconditonally kills the sk_buff - * on return from the actor. - */ - skb = skb_clone(__skb, GFP_KERNEL); - if (unlikely(!skb)) - return -ENOMEM; /* * __skb_splice_bits() only fails if the output has no room left, @@ -1488,15 +1491,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, } done: - /* - * drop our reference to the clone, the pipe consumption will - * drop the rest. - */ - kfree_skb(skb); - if (spd.nr_pages) { + struct sock *sk = skb->sk; int ret; - struct sock *sk = __skb->sk; /* * Drop the socket lock, otherwise we have reverse