diff mbox

gso: Attempt to handle mega-GRO packets

Message ID 20131106013038.GA14894@gondor.apana.org.au
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu Nov. 6, 2013, 1:30 a.m. UTC
Here is a totally untested patch that tries to trivially process
these new frags + frag_list skbs.  It should actually be trivial
to make this generate TSO packets by just adding a gso_ok check
and short-circuit.


Thanks,

Comments

Eric Dumazet Nov. 6, 2013, 1:45 a.m. UTC | #1
On Wed, 2013-11-06 at 09:30 +0800, Herbert Xu wrote:
> Here is a totally untested patch that tries to trivially process
> these new frags + frag_list skbs.  It should actually be trivial
> to make this generate TSO packets by just adding a gso_ok check
> and short-circuit.
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3735fad..ec8e8bc 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2816,7 +2816,24 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  			hsize = len;
>  
>  		if (!hsize && i >= nfrags) {
> -			BUG_ON(fskb->len != len);
> +			if (fskb->len != len) {
> +				SKB_FRAG_ASSERT(fskb);
> +
> +				nskb = skb_segment(fskb, features);
> +
> +				err = PTR_ERR(nskb);
> +				if (IS_ERR(nskb))
> +					goto err;
> +				err = -ENOMEM;
> +
> +				if (segs)
> +					tail->next = nskb;
> +				else
> +					segs = nskb;
> +				tail = nskb;
> +				while (tail->next)
> +					tail = tail->next;
> +			}
>  
>  			pos += len;
>  			nskb = skb_clone(fskb, GFP_ATOMIC);
> 
> Thanks,

Hmm, I do not think fskb has the headers in the general case.

It might work in the GRO case only.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 6, 2013, 4:07 a.m. UTC | #2
On Tue, Nov 05, 2013 at 05:45:47PM -0800, Eric Dumazet wrote:
>
> Hmm, I do not think fskb has the headers in the general case.

Only GRO produces such packets, see the changeset where I added
frag_list support to skb_setgment

	89319d3801d1d3ac29c7df1f067038986f267d29

> It might work in the GRO case only.

Any chance you guys can test this patch out?

Thanks!
Eric Dumazet Nov. 6, 2013, 4:23 a.m. UTC | #3
On Wed, 2013-11-06 at 12:07 +0800, Herbert Xu wrote:
> On Tue, Nov 05, 2013 at 05:45:47PM -0800, Eric Dumazet wrote:
> >
> > Hmm, I do not think fskb has the headers in the general case.
> 
> Only GRO produces such packets, see the changeset where I added
> frag_list support to skb_setgment

Nope, I already mentioned this :

Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
("virtio_net: migrate mergeable rx buffers to page frag allocators")

I do not see why skb_segment() would be tied to GRO -> GSO, with
the property of each page frag being exactly MSS sized.

We use it in TCP stack with page frags of any size.

skb_segment() needs to iterate properly on all pages frags,
the one found in the first skb, and the ones found on frag_list skbs.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 6, 2013, 4:28 a.m. UTC | #4
On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
>
> Nope, I already mentioned this :
> 
> Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> ("virtio_net: migrate mergeable rx buffers to page frag allocators")

This patch should simply be reverted.  You guys steam-rolled over
the virtio_net people's concerns that this seriously impacts virt
performance.

Cheers,
Eric Dumazet Nov. 6, 2013, 5:20 a.m. UTC | #5
On Wed, 2013-11-06 at 12:28 +0800, Herbert Xu wrote:
> On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
> >
> > Nope, I already mentioned this :
> > 
> > Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> > ("virtio_net: migrate mergeable rx buffers to page frag allocators")
> 
> This patch should simply be reverted.  You guys steam-rolled over
> the virtio_net people's concerns that this seriously impacts virt
> performance.
> 

Have you followed netdev traffic lately ?

virtio_net performance is better than before, and we did no revert,
thank you.

I am really tired, I'll fix skb_segment() tomorrow.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 6, 2013, 8:04 a.m. UTC | #6
On Tue, Nov 05, 2013 at 09:20:22PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 12:28 +0800, Herbert Xu wrote:
> > On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
> > >
> > > Nope, I already mentioned this :
> > > 
> > > Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> > > ("virtio_net: migrate mergeable rx buffers to page frag allocators")
> > 
> > This patch should simply be reverted.  You guys steam-rolled over
> > the virtio_net people's concerns that this seriously impacts virt
> > performance.
> > 
> 
> Have you followed netdev traffic lately ?
> 
> virtio_net performance is better than before, and we did no revert,
> thank you.

The last email I saw from Michael Tsirkin he still wasn't happy
that your patch does not degrade virt performance.  Where is the
reply to that?

Cheers,
Herbert Xu Nov. 6, 2013, 8:16 a.m. UTC | #7
On Wed, Nov 06, 2013 at 04:04:25PM +0800, Herbert Xu wrote:
> On Tue, Nov 05, 2013 at 09:20:22PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-06 at 12:28 +0800, Herbert Xu wrote:
> > > On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
> > > >
> > > > Nope, I already mentioned this :
> > > > 
> > > > Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> > > > ("virtio_net: migrate mergeable rx buffers to page frag allocators")
> > > 
> > > This patch should simply be reverted.  You guys steam-rolled over
> > > the virtio_net people's concerns that this seriously impacts virt
> > > performance.
> > > 
> > 
> > Have you followed netdev traffic lately ?
> > 
> > virtio_net performance is better than before, and we did no revert,
> > thank you.
> 
> The last email I saw from Michael Tsirkin he still wasn't happy
> that your patch does not degrade virt performance.  Where is the
> reply to that?

I just looked at the aformentioned patch again and it is seriously
broken! How on earth did it get merged?

Instead of using perfectly sane 4K pages per frag to store guest to
guest traffic, we now end up using 1.5K frags, which that's why you
end up having to use the frag_list, WTF?

Dave, please revert the above commit as it is seriously broken.

Whatever performance problem it is trying to address can surely
be fixed without being so stupid as to break up perfectly sized
4K pages into 1.5K chunks.

Thanks!
Herbert Xu Nov. 6, 2013, 1:12 p.m. UTC | #8
On Wed, Nov 06, 2013 at 04:16:38PM +0800, Herbert Xu wrote:
>
> I just looked at the aformentioned patch again and it is seriously
> broken! How on earth did it get merged?
> 
> Instead of using perfectly sane 4K pages per frag to store guest to
> guest traffic, we now end up using 1.5K frags, which that's why you
> end up having to use the frag_list, WTF?
> 
> Dave, please revert the above commit as it is seriously broken.

I take that back.  While the original patch was seriously broken,
it has since been fixed by the coalescing patch that Jason Wang
wrote.

It's still pretty weird to be dividing page frags into 1500-byte
chunks and then merging back up to 4K but at least it should do the
right thing now.

With regards to the impact on my skb_segment patch, there should be
none because those packets ultimately are still originating from
either GRO or the TCP stack.  In which case the same assumptions
on frag_list still holds.

However, we do have to guard against evil hosts/guests injecting
bogus packets into our stack, so either we'll have to lose the
BUG_ON in skb_segment or we'll need some sort of a filter in
virtio_net.

Perhaps a rate-limited printk might be the go.

Cheers,
Eric Dumazet Nov. 6, 2013, 3:01 p.m. UTC | #9
On Wed, 2013-11-06 at 21:12 +0800, Herbert Xu wrote:

> I take that back.  While the original patch was seriously broken,
> it has since been fixed by the coalescing patch that Jason Wang
> wrote.
> 

Patch was not 'broken', but a step into the right direction. I am very
sorry if you think otherwise.

> It's still pretty weird to be dividing page frags into 1500-byte
> chunks and then merging back up to 4K but at least it should do the
> right thing now.


Have you thought about arches having PAGE_SIZE=65536, and how bad it is
to use a full page per network frame ? It is lazy and x86 centered.

So after our patches, we now have an optimal situation, even on these
arches.

On x86, a full 64KB GSO packet will now fit in 2 (or 3) frags allocated
in 32KB pages (order-3) in fast path (ie if there is not high memory
pressure)

According to our tests, performance is better on x86, and virtio_net now
is usable on arches with PAGE_SIZE=65536



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 6, 2013, 3:05 p.m. UTC | #10
On Wed, 2013-11-06 at 16:16 +0800, Herbert Xu wrote:

> Instead of using perfectly sane 4K pages per frag to store guest to
> guest traffic, we now end up using 1.5K frags, which that's why you
> end up having to use the frag_list, WTF?

Sure, if your host has infinite amount of memory, we can remove all the
silly and expensive and bore some checks we added in the stack against
skb->truesize.

And yes, network stack will be faster.

But in real life, we have memory constraints.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 7, 2013, 12:36 a.m. UTC | #11
On Wed, Nov 06, 2013 at 07:01:10AM -0800, Eric Dumazet wrote:
> Have you thought about arches having PAGE_SIZE=65536, and how bad it is
> to use a full page per network frame ? It is lazy and x86 centered.

So instead if we were sending a full 64K packet on such an arch to
another guest, we'd now chop it up into 1.5K chunks and reassemble them.

> So after our patches, we now have an optimal situation, even on these
> arches.

Optimal only for physical incoming packets with no jumbo frames.

What's worse, I now realise that the coalesce thing isn't even
guaranteed to work.  It probably works in your benchmarks because
you're working with freshly allocated pages.

But once the system has been running for a while, I see nothing
in the virtio_net code that tries to prevent fragmentation.  Once
fragmentation sets in, you'll be back in the terrible situation
that we were in prior to the coalesce patch.

Jason/Michael (Tsirkin), am I missing something that would prevent
fragmentation of these buffers?

Cheers,
Herbert Xu Nov. 7, 2013, 12:39 a.m. UTC | #12
On Wed, Nov 06, 2013 at 07:05:41AM -0800, Eric Dumazet wrote:
> 
> Sure, if your host has infinite amount of memory, we can remove all the
> silly and expensive and bore some checks we added in the stack against
> skb->truesize.
> 
> And yes, network stack will be faster.
> 
> But in real life, we have memory constraints.

With intra-page fragmentation, it's not clear that you'd be better
off with this patch in the long run anyway.

Cheers,
Eric Dumazet Nov. 7, 2013, 1:03 a.m. UTC | #13
On Thu, 2013-11-07 at 08:36 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 07:01:10AM -0800, Eric Dumazet wrote:
> > Have you thought about arches having PAGE_SIZE=65536, and how bad it is
> > to use a full page per network frame ? It is lazy and x86 centered.
> 
> So instead if we were sending a full 64K packet on such an arch to
> another guest, we'd now chop it up into 1.5K chunks and reassemble them.
> 

Yep, and speed is now better than before the patches.

I understand you do not believe it. But this is the truth.

And now your guest can receive a bunch of small UDP frames, without
having to drop them because sk->rcvbuf limit is hit.

> > So after our patches, we now have an optimal situation, even on these
> > arches.
> 
> Optimal only for physical incoming packets with no jumbo frames.

Have you actually tested this ?

> 
> What's worse, I now realise that the coalesce thing isn't even
> guaranteed to work.  It probably works in your benchmarks because
> you're working with freshly allocated pages.
> 

Oh well.

> But once the system has been running for a while, I see nothing
> in the virtio_net code that tries to prevent fragmentation.  Once
> fragmentation sets in, you'll be back in the terrible situation
> that we were in prior to the coalesce patch.
> 

There is no fragmentation, since we allocate 32Kb pages.

Michael Dalton worked on a patch to add EWMA for auto sizing and a
private page_frag per virtio queue, instead of using the per cpu one.

On x86 :

- All offloads enabled (average packet size should be >> MTU-size)

net-next trunk w/ virtio_net prior to 2613af0ed (PAGE_SIZE bufs): 14179.17Gb/s
net-next trunk (MTU-size bufs):  13390.69Gb/s
net-next trunk + auto-tune - 14358.41Gb/s

- guest_tso4/guest_csum disabled (forces MTU-sized packets on receiver)

net-next trunk w/ virtio_net prior to 2613af0ed: 4059.49Gb/s
net-next trunk (MTU 1500- packet takes two bufs due to sizing bug): 4174.30Gb/s
net-next trunk (MTU 1480- packet fits in one buf): 6672.16Gb/s
net-next trunk + auto-tune (MTU 1500- fixed, packet uses one buf) - 6791.28Gb/s




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 7, 2013, 1:47 a.m. UTC | #14
On Wed, Nov 06, 2013 at 05:03:28PM -0800, Eric Dumazet wrote:
>
> > But once the system has been running for a while, I see nothing
> > in the virtio_net code that tries to prevent fragmentation.  Once
> > fragmentation sets in, you'll be back in the terrible situation
> > that we were in prior to the coalesce patch.
> 
> There is no fragmentation, since we allocate 32Kb pages.

Say the system is fragmented sufficiently that you'll end up with
0-order pages.  In that case you'll only ever be able to coalesce
two packets.

Real systems that run for more than a day do end up with seriously
fragmented memory.

Cheers,
Eric Dumazet Nov. 7, 2013, 2:02 a.m. UTC | #15
On Thu, 2013-11-07 at 09:47 +0800, Herbert Xu wrote:

> Say the system is fragmented sufficiently that you'll end up with
> 0-order pages.  In that case you'll only ever be able to coalesce
> two packets.

4K page will contain 2 frags and they will coalesce.

Performance will still be quite good.

We probably add a tweak, to not have any hole in this case.

> 
> Real systems that run for more than a day do end up with seriously
> fragmented memory.

Sure, but having shallow skbs in the first place help quite a bit.

There is no perfect solution, unless of course you change virtio_net to
provide different queues, with different frag sizes.

Sort of what NIU driver uses.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 7, 2013, 2:08 a.m. UTC | #16
On Wed, 2013-11-06 at 18:02 -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 09:47 +0800, Herbert Xu wrote:
> 
> > Say the system is fragmented sufficiently that you'll end up with
> > 0-order pages.  In that case you'll only ever be able to coalesce
> > two packets.
> 
> 4K page will contain 2 frags and they will coalesce.
> 
> Performance will still be quite good.
> 
> We probably add a tweak, to not have any hole in this case.

After skb_page_frag_refill() call, its trivial to check if the page is
order-0.

If yes, use the whole remaining space, instead of MAX_PACKET_LEN

-> If memory is fragmented, we switch back to old behavior.

Really, its that simple.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 7, 2013, 2:15 a.m. UTC | #17
On Wed, Nov 06, 2013 at 06:02:38PM -0800, Eric Dumazet wrote:
> 
> 4K page will contain 2 frags and they will coalesce.
> 
> Performance will still be quite good.
>
> We probably add a tweak, to not have any hole in this case.

Also have you considered the security aspect of this? If you have
two skbs sharing a page, and one gets transmitted to a third party
using zero-copy, the other unrelated skb's content may become visible
where it shouldn't.

Cheers,
Eric Dumazet Nov. 7, 2013, 2:37 a.m. UTC | #18
On Thu, 2013-11-07 at 10:15 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 06:02:38PM -0800, Eric Dumazet wrote:
> > 
> > 4K page will contain 2 frags and they will coalesce.
> > 
> > Performance will still be quite good.
> >
> > We probably add a tweak, to not have any hole in this case.
> 
> Also have you considered the security aspect of this? If you have
> two skbs sharing a page, and one gets transmitted to a third party
> using zero-copy, the other unrelated skb's content may become visible
> where it shouldn't.

If the hypervisor is doomed, there is nothing we can do.

virtio_net owns the pages, and relies on hypervisor doing the right
thing.

That you use part of the page, is really irrelevant.

It seems you are speaking of virtio_net sending frames, but its about
receiving frames here.

We receive frames, delivered by the trusted hypervisor.

OK, I will shut up now, since apparently I really upset you.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 7, 2013, 2:41 a.m. UTC | #19
On Wed, Nov 06, 2013 at 06:37:35PM -0800, Eric Dumazet wrote:
>
> If the hypervisor is doomed, there is nothing we can do.

It's not just the hypervisor you know.  There is also DMA etc.
The hypervisor doesn't know that there is stuff in the page that
should not be exposed.

Also one day we may want to do zero-copy receive in which case
this would be even more important.

Cheers,
Jason Wang Nov. 7, 2013, 2:52 a.m. UTC | #20
On 11/07/2013 08:36 AM, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 07:01:10AM -0800, Eric Dumazet wrote:
>> Have you thought about arches having PAGE_SIZE=65536, and how bad it is
>> to use a full page per network frame ? It is lazy and x86 centered.
> So instead if we were sending a full 64K packet on such an arch to
> another guest, we'd now chop it up into 1.5K chunks and reassemble them.
>
>> So after our patches, we now have an optimal situation, even on these
>> arches.
> Optimal only for physical incoming packets with no jumbo frames.
>
> What's worse, I now realise that the coalesce thing isn't even
> guaranteed to work.  It probably works in your benchmarks because
> you're working with freshly allocated pages.
>
> But once the system has been running for a while, I see nothing
> in the virtio_net code that tries to prevent fragmentation.  Once
> fragmentation sets in, you'll be back in the terrible situation
> that we were in prior to the coalesce patch.
>
> Jason/Michael (Tsirkin), am I missing something that would prevent
> fragmentation of these buffers?
>
> Cheers,

No. Maybe we can use per-queue buffers instead.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 7, 2013, 5:56 a.m. UTC | #21
On Wed, Nov 06, 2013 at 06:02:38PM -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 09:47 +0800, Herbert Xu wrote:
> 
> > Say the system is fragmented sufficiently that you'll end up with
> > 0-order pages.  In that case you'll only ever be able to coalesce
> > two packets.
> 
> 4K page will contain 2 frags and they will coalesce.
> 
> Performance will still be quite good.
> 
> We probably add a tweak, to not have any hole in this case.
> 
> > 
> > Real systems that run for more than a day do end up with seriously
> > fragmented memory.
> 
> Sure, but having shallow skbs in the first place help quite a bit.
> 
> There is no perfect solution, unless of course you change virtio_net to
> provide different queues, with different frag sizes.
> 
> Sort of what NIU driver uses.
> 

I considered doing this but won't this mean packets can get reordered?
In practice we need to maintain ordering of RX frames within
a given flow, correct?
Eric Dumazet Nov. 7, 2013, 7:07 a.m. UTC | #22
On Thu, 2013-11-07 at 07:56 +0200, Michael S. Tsirkin wrote:

> I considered doing this but won't this mean packets can get reordered?
> In practice we need to maintain ordering of RX frames within
> a given flow, correct?

Sorry, I was referring to two pools of frags, instead of a single one.

One pool of frags of (1500 + hdr) bytes
One pool of frags of 4096 bytes

But its still one logical queue.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..ec8e8bc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,7 +2816,24 @@  struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				SKB_FRAG_ASSERT(fskb);
+
+				nskb = skb_segment(fskb, features);
+
+				err = PTR_ERR(nskb);
+				if (IS_ERR(nskb))
+					goto err;
+				err = -ENOMEM;
+
+				if (segs)
+					tail->next = nskb;
+				else
+					segs = nskb;
+				tail = nskb;
+				while (tail->next)
+					tail = tail->next;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);