diff mbox

[RFC,v7,01/19] Add a new structure for skb buffer from external.

Message ID 1275732899-5423-1-git-send-email-xiaohui.xin@intel.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Xin, Xiaohui June 5, 2010, 10:14 a.m. UTC
From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

Comments

stephen hemminger June 6, 2010, 11:13 p.m. UTC | #1
Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...

2. SKB's can have infinite lifetime in the kernel. If these buffers come from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen June 7, 2010, 7:51 a.m. UTC | #2
Stephen Hemminger <shemminger@vyatta.com> writes:

> Still not sure this is a good idea for a couple of reasons:
>
> 1. We already have lots of special cases with skb's (frags and fraglist),
>    and skb's travel through a lot of different parts of the kernel.  So any
>    new change like this creates lots of exposed points for new bugs. Look
>    at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>    and ppp and ...
>
> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>    a fixed size pool in an external device, they can easily all get tied up
>    if you have a slow listener. What happens then?

3. If they come from an internal pool what happens when the kernel runs 
low on memory? How is that pool balanced against other kernel
memory users?

-Andi
Mitchell Erblich June 7, 2010, 8:17 a.m. UTC | #3
On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:

> Stephen Hemminger <shemminger@vyatta.com> writes:
> 
>> Still not sure this is a good idea for a couple of reasons:
>> 
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>   and skb's travel through a lot of different parts of the kernel.  So any
>>   new change like this creates lots of exposed points for new bugs. Look
>>   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>   and ppp and ...
>> 
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>   a fixed size pool in an external device, they can easily all get tied up
>>   if you have a slow listener. What happens then?
> 
> 3. If they come from an internal pool what happens when the kernel runs 
> low on memory? How is that pool balanced against other kernel
> memory users?
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

In general,

When an internal pool is created/used, their SHOULD be a reason.
Maybe, to keep allocation latency to a min, OR ...

Now IMO, 

internal pool objects should have a ref count and
if that count is 0, then under memory pressure and/or num 
of objects are above a high water mark, then they are freed,

OR if there is a last reference age field, then the object is to be 
cleaned if dirty, then freed,  

Else, the pool is allowed to grow if the number of objects in the
pool is below a set max (max COULD equal Infinity).



Mitchell Erblich--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 8, 2010, 5:27 a.m. UTC | #4
On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
> Still not sure this is a good idea for a couple of reasons:
> 
> 1. We already have lots of special cases with skb's (frags and fraglist),
>    and skb's travel through a lot of different parts of the kernel.  So any
>    new change like this creates lots of exposed points for new bugs. Look
>    at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>    and ppp and ...
> 
> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>    a fixed size pool in an external device, they can easily all get tied up
>    if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.

If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb->head area to be shared without
copying.  I'm claiming that this can be done by simply making
skb->head empty.

Cheers,
Xin, Xiaohui June 9, 2010, 8:29 a.m. UTC | #5
>-----Original Message-----
>From: Stephen Hemminger [mailto:shemminger@vyatta.com]
>Sent: Monday, June 07, 2010 7:14 AM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mst@redhat.com; mingo@elte.hu; davem@davemloft.net; herbert@gondor.apana.org.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>Still not sure this is a good idea for a couple of reasons:
>
>1. We already have lots of special cases with skb's (frags and fraglist),
>   and skb's travel through a lot of different parts of the kernel.  So any
>   new change like this creates lots of exposed points for new bugs. Look
>   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>   and ppp and ...
>

Yes, I agree on your concern at some extent. But the skbs which use external pages in our cases have short travels which start from NIC driver and then forward to the guest immediately. It will not travel into host kernel stack or any filters which may avoid many problems you have mentioned here.

Another point is that we try to make the solution more generic to different NIC drivers, then many drivers may use the way without modifications.
  
>2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>   a fixed size pool in an external device, they can easily all get tied up
>   if you have a slow listener. What happens then?

The pool is not fixed size, it's the size of usable buffers submitted by guest virtio-net driver. Guest virtio-net will check how much buffers are filled and do resubmit. What a slow listener mean? A slow NIC driver?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin, Xiaohui June 9, 2010, 8:48 a.m. UTC | #6
>-----Original Message-----
>From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Andi
>Kleen
>Sent: Monday, June 07, 2010 3:51 PM
>To: Stephen Hemminger
>Cc: Xin, Xiaohui; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>herbert@gondor.apana.org.au; jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>Stephen Hemminger <shemminger@vyatta.com> writes:
>
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>    and skb's travel through a lot of different parts of the kernel.  So any
>>    new change like this creates lots of exposed points for new bugs. Look
>>    at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>    and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>    a fixed size pool in an external device, they can easily all get tied up
>>    if you have a slow listener. What happens then?
>
>3. If they come from an internal pool what happens when the kernel runs
>low on memory? How is that pool balanced against other kernel
>memory users?
>
The size of the pool is limited by the virtqueue capacity now.
If the virtqueue is really huge, then how to balance on memory is a problem.
I did not consider it clearly how to tune it dynamically currently...

>-Andi
>
>--
>ak@linux.intel.com -- Speaking for myself only.
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin, Xiaohui June 9, 2010, 9:22 a.m. UTC | #7
>-----Original Message-----
>From: Mitchell Erblich [mailto:erblichs@earthlink.net]
>Sent: Monday, June 07, 2010 4:17 PM
>To: Andi Kleen
>Cc: Stephen Hemminger; Xin, Xiaohui; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>herbert@gondor.apana.org.au; jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>
>On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:
>
>> Stephen Hemminger <shemminger@vyatta.com> writes:
>>
>>> Still not sure this is a good idea for a couple of reasons:
>>>
>>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>>   and skb's travel through a lot of different parts of the kernel.  So any
>>>   new change like this creates lots of exposed points for new bugs. Look
>>>   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>>   and ppp and ...
>>>
>>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>>   a fixed size pool in an external device, they can easily all get tied up
>>>   if you have a slow listener. What happens then?
>>
>> 3. If they come from an internal pool what happens when the kernel runs
>> low on memory? How is that pool balanced against other kernel
>> memory users?
>>
>> -Andi
>>
>> --
>> ak@linux.intel.com -- Speaking for myself only.
>
>In general,
>
>When an internal pool is created/used, their SHOULD be a reason.
>Maybe, to keep allocation latency to a min, OR ...
>
The internal pool here is a collection of user buffers submitted
by guest virtio-net driver. Guest put buffers here and driver get
buffers from it. If guest submit more buffers then driver needs,
we need somewhere to put the buffers, it's the internal pool here
to deal with. 

>Now IMO,
>
>internal pool objects should have a ref count and
>if that count is 0, then under memory pressure and/or num
>of objects are above a high water mark, then they are freed,
>
>OR if there is a last reference age field, then the object is to be
>cleaned if dirty, then freed,
>
>Else, the pool is allowed to grow if the number of objects in the
>pool is below a set max (max COULD equal Infinity).

Thanks for the thoughts.

Basically, the size of the internal pool is not decided by the pool itself,
To add/delete the objects in the pool is not a task of the pool itself too. 
It's decided by guest virtio-net driver and vhost-net driver both, and 
decided by the guest receive speed and submit speed.
The max size of the pool is limited by the virtqueue buffer numbers.

Thanks
Xiaohui

>
>
>
>Mitchell Erblich
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin, Xiaohui June 9, 2010, 9:54 a.m. UTC | #8
>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Tuesday, June 08, 2010 1:28 PM
>To: Stephen Hemminger
>Cc: Xin, Xiaohui; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>    and skb's travel through a lot of different parts of the kernel.  So any
>>    new change like this creates lots of exposed points for new bugs. Look
>>    at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>    and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>    a fixed size pool in an external device, they can easily all get tied up
>>    if you have a slow listener. What happens then?
>
>I agree with Stephen on this.
>
>FWIW I don't think we even need the external pages concept in
>order to implement zero-copy receive (which I gather is the intent
>here).
>
>Here is one way to do it, simply construct a completely non-linear
>packet in the driver, as you would if you were using the GRO frags
>interface (grep for napi_gro_frags under drivers/net for examples).
>
I'm not sure if I understand your way correctly:
1) Does the way only deal with driver with SG feature? Since packet 
is non-linear...

2) Is skb->data still pointing to guest user buffers?
If yes, how to avoid the modifications to net core change to skb?

3) In our way only parts of drivers need be modified to support zero-copy.
and here, need we modify all the drivers?

If I missed your idea, may you explain it in more detail?

>This way you can transfer the entire contents of the packet without
>copying through to the other side, provided that the host stack does
>not modify the packet.
>

>If the host side did modify the packet then we have to incur the
>memory cost anyway.
>
>IOW I think the only feature provided by the external pages
>construct is allowing the skb->head area to be shared without
>copying.  I'm claiming that this can be done by simply making
>skb->head empty.
>
I think to make skb->head empty at first will cause more effort to pass the check with 
skb header. Have I missed something here? I really make the skb->head NULL
just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.

Thanks
Xiaohui

>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 11, 2010, 5:21 a.m. UTC | #9
On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>
> I'm not sure if I understand your way correctly:
> 1) Does the way only deal with driver with SG feature? Since packet 
> is non-linear...

No the hardware doesn't have to support SG.  You just need to
place the entire packet contents in a page instead of skb->head.

> 2) Is skb->data still pointing to guest user buffers?
> If yes, how to avoid the modifications to net core change to skb?

skb->data would not point to guest user buffers.  In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

> 3) In our way only parts of drivers need be modified to support zero-copy.
> and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

> I think to make skb->head empty at first will cause more effort to pass the check with 
> skb header. Have I missed something here? I really make the skb->head NULL
> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.

No I'm not suggesting you set it to NULL.  It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
suggesting.

Cheers,
Xin, Xiaohui June 12, 2010, 9:31 a.m. UTC | #10
>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Friday, June 11, 2010 1:21 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>>
>> I'm not sure if I understand your way correctly:
>> 1) Does the way only deal with driver with SG feature? Since packet
>> is non-linear...
>
>No the hardware doesn't have to support SG.  You just need to
>place the entire packet contents in a page instead of skb->head.
>
>> 2) Is skb->data still pointing to guest user buffers?
>> If yes, how to avoid the modifications to net core change to skb?
>
>skb->data would not point to guest user buffers.  In the common
>case the packet is not modified on its way to the guest so this
>is not an issue.
>
>In the rare case where it is modified, you only have to copy the
>bits which are modified and the cost of that is inconsequential
>since you have to write to that memory anyway.
>
>> 3) In our way only parts of drivers need be modified to support zero-copy.
>> and here, need we modify all the drivers?
>
>If you're asking the portion of each driver supporting zero-copy
>that needs to be modified, then AFAICS this doesn't change that
>very much at all.
>
>> I think to make skb->head empty at first will cause more effort to pass the check with
>> skb header. Have I missed something here? I really make the skb->head NULL
>> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.
>
>No I'm not suggesting you set it to NULL.  It should have some
>memory allocated, but skb_headlen(skb) should be zero.
>
>Please have a look at how the napi_gro_frags interface works (e.g.,
>in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
>suggesting.
>
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if 
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
  If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving 
function.
3) napi_gro_frags() will allocate small skb and pull the header data from 
the first page to skb->data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into allocated kernel skb. 
The offset should be recalculated when the user page data is transferred to guest. This 
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot 
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a vnet header.
That means for some pages, the vnet header room should be reserved when allocated.
But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin, Xiaohui June 13, 2010, 8:58 a.m. UTC | #11
>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of
>Xin, Xiaohui
>Sent: Saturday, June 12, 2010 5:31 PM
>To: Herbert Xu
>Cc: Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>>-----Original Message-----
>>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>>Sent: Friday, June 11, 2010 1:21 PM
>>To: Xin, Xiaohui
>>Cc: Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>>jdike@linux.intel.com
>>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>>
>>On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>>>
>>> I'm not sure if I understand your way correctly:
>>> 1) Does the way only deal with driver with SG feature? Since packet
>>> is non-linear...
>>
>>No the hardware doesn't have to support SG.  You just need to
>>place the entire packet contents in a page instead of skb->head.
>>
>>> 2) Is skb->data still pointing to guest user buffers?
>>> If yes, how to avoid the modifications to net core change to skb?
>>
>>skb->data would not point to guest user buffers.  In the common
>>case the packet is not modified on its way to the guest so this
>>is not an issue.
>>
>>In the rare case where it is modified, you only have to copy the
>>bits which are modified and the cost of that is inconsequential
>>since you have to write to that memory anyway.
>>
>>> 3) In our way only parts of drivers need be modified to support zero-copy.
>>> and here, need we modify all the drivers?
>>
>>If you're asking the portion of each driver supporting zero-copy
>>that needs to be modified, then AFAICS this doesn't change that
>>very much at all.
>>
>>> I think to make skb->head empty at first will cause more effort to pass the check with
>>> skb header. Have I missed something here? I really make the skb->head NULL
>>> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.
>>
>>No I'm not suggesting you set it to NULL.  It should have some
>>memory allocated, but skb_headlen(skb) should be zero.
>>
>>Please have a look at how the napi_gro_frags interface works (e.g.,
>>in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
>>suggesting.
>>
>>Cheers,
>>--
>>Visit Openswan at http://www.openswan.org/
>>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>>Home Page: http://gondor.apana.org.au/~herbert/
>>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
>Herbert,
>I explained what I think the thought in your mind here, please clarify if
>something missed.
>
>1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
>  If the driver support PS mode, then modify alloc_page() too.
>2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving
>function.
>3) napi_gro_frags() will allocate small skb and pull the header data from
>the first page to skb->data.
>
>Is above the way what you have suggested?
>I have thought something in detail about the way.
>
>1) The first page will have an offset after the header is copied into allocated kernel skb.
>The offset should be recalculated when the user page data is transferred to guest. This
>may modify some of the gro code.
>
>2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot
>put a user page as normally. This may modify the gro code too.
>
>3) When the user buffer returned to guest, some of them need to be appended a vnet header.
>That means for some pages, the vnet header room should be reserved when allocated.
>But we cannot know which one will be used as the first page when allocated. If we reserved
>vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for
>second pages, then page data will be wrong.
>
>4) Since the user buffer pages should be released, so we still need a dtor callback to do that,
>and then I still need a place to hold it. How do you think about to put it in skb_shinfo?
>
>Currently I can only think of this.
>How do you think about then?
>
>Thanks
>Xiaohui

Herbert,
In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.

We can also have another way here. We can provide a function to only substitute 
alloc_page(), and a function to release the pages when cleaning the rx buffers.
The skb for the rx buffer can be allocated in original way, and when pushing 
the data to guest, the header data will be copied to guest buffer. In this way, we 
should reserve sufficient room for the header in the first guest user buffers. 
That need modifications to guest virtio-net kernel. And this way only suitable for
PS mode supported driver. Considered the advanced driver mostly has PS mode.
So it should be not a critical issue.

Thanks
Xiaohui
 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 17, 2010, 11:20 a.m. UTC | #12
On Sat, Jun 12, 2010 at 05:31:10PM +0800, Xin, Xiaohui wrote:
> 
> 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
>   If the driver support PS mode, then modify alloc_page() too.

Well if you were doing this then the driver won't be generating
skbs at all.  So you don't need to change netdev_alloc_skb.

The driver would currently do alloc_page, so you would replace
that with netdev_alloc_page, which can call your new function
to allocate an external page where appropriate.

IOW you just need one change in the driver if it already uses
the skbless interface, to replace with alloc_page.

If the driver doesn't use the skbless interface then you need
to make a few more changes but it isn't too hard either, it'll
also mean that the driver will have less overhead even for normal
use which is a win-win situation.

> 2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving 
> function.
>
> 3) napi_gro_frags() will allocate small skb and pull the header data from 
> the first page to skb->data.
> 
> Is above the way what you have suggested?

Yes.

> I have thought something in detail about the way.
> 
> 1) The first page will have an offset after the header is copied into allocated kernel skb. 
> The offset should be recalculated when the user page data is transferred to guest. This 
> may modify some of the gro code.

We could keep track whether the stack has modified the header,
since you can simply ignore it if it doesn't modify it, which
should be the common case for virt.

> 2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot 
> put a user page as normally. This may modify the gro code too.

If it does anything like that, then we're not in the fast-path
case so you can just fall back to copying.

> 3) When the user buffer returned to guest, some of them need to be appended a vnet header.
> That means for some pages, the vnet header room should be reserved when allocated.
> But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong.

I don't see why this would be a problem, since as far as what
the driver is putting onto the physical RX ring nothing has
changed.  IOW if you want to designate a certain page as special,
or the first page, you can still do so.

So can you explain which bits of your patches would be affected
by this?

> 4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo?

While I don't like that very much I guess I can live with that
if nobody else objects.

Thanks,
Herbert Xu June 17, 2010, 11:21 a.m. UTC | #13
On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote:
> 
> Herbert,
> In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.
> 
> We can also have another way here. We can provide a function to only substitute 
> alloc_page(), and a function to release the pages when cleaning the rx buffers.

Yes that's exactly what I had in mind.

Cheers,
Xin, Xiaohui June 18, 2010, 5:26 a.m. UTC | #14
>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Thursday, June 17, 2010 7:21 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote:
>>
>> Herbert,
>> In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to
>receive the rx buffers, and to clean the rx buffers.
>>
>> We can also have another way here. We can provide a function to only substitute
>> alloc_page(), and a function to release the pages when cleaning the rx buffers.
>
>Yes that's exactly what I had in mind.
>
Herbert,
I have questions about the idea above:
1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(), 
then we don't need napi_gro_frags() any more, the driver's original receiving 
function is ok. Right?

2) Is napi_gro_frags() only suitable for TCP protocol packet? 
I have done a small test for ixgbe driver to let it only allocate paged buffers 
and found kernel hangs when napi_gro_frags() receives an ARP packet.

3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate 
as usual, the data pointed by skb->data will be copied into the first guest buffer. 
That means we should reserve sufficient room in guest buffer. For PS mode 
supported driver (for example ixgbe), the room will be more than 128. After 128bytes, 
we will put the first frag data. Look into virtio-net.c the function page_to_skb()  
and receive_mergeable(), that means we should modify guest virtio-net driver to 
compute the offset as the parameter for skb_set_frag().

How do you think about this? Attached is a patch to how to modify the guest driver.
I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.

Thanks
Xiaohui
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu June 18, 2010, 5:59 a.m. UTC | #15
On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:
>
> Herbert,
> I have questions about the idea above:
> 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(), 
> then we don't need napi_gro_frags() any more, the driver's original receiving 
> function is ok. Right?

Well I was actually thinking about converting all drivers that
need this to napi_gro_frags.  But now that you mention it, yes
we could still keep the old interface to minimise the work.

> 2) Is napi_gro_frags() only suitable for TCP protocol packet? 
> I have done a small test for ixgbe driver to let it only allocate paged buffers 
> and found kernel hangs when napi_gro_frags() receives an ARP packet.

It should work with any packet.  In fact, I'm pretty sure the
other drivers (e.g., cxgb3) use that interface for all packets.

> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate 
> as usual, the data pointed by skb->data will be copied into the first guest buffer. 
> That means we should reserve sufficient room in guest buffer. For PS mode 
> supported driver (for example ixgbe), the room will be more than 128. After 128bytes, 
> we will put the first frag data. Look into virtio-net.c the function page_to_skb()  
> and receive_mergeable(), that means we should modify guest virtio-net driver to 
> compute the offset as the parameter for skb_set_frag().
> 
> How do you think about this? Attached is a patch to how to modify the guest driver.
> I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.

Expanding the buffer size to 512 bytes to accomodate PS mode
looks reasonable to me.

However, I don't think we should increase the copy threshold to
512 bytes at the same time.  I don't have any figures myself but
I think if we are to make such a change it should be a separate
one and come with supporting numbers.

Cheers,
Xin, Xiaohui June 18, 2010, 7:14 a.m. UTC | #16
>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Friday, June 18, 2010 1:59 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com; Rusty Russell
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:
>>
>> Herbert,
>> I have questions about the idea above:
>> 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
>> then we don't need napi_gro_frags() any more, the driver's original receiving
>> function is ok. Right?
>
>Well I was actually thinking about converting all drivers that
>need this to napi_gro_frags.  But now that you mention it, yes
>we could still keep the old interface to minimise the work.
>
>> 2) Is napi_gro_frags() only suitable for TCP protocol packet?
>> I have done a small test for ixgbe driver to let it only allocate paged buffers
>> and found kernel hangs when napi_gro_frags() receives an ARP packet.
>
>It should work with any packet.  In fact, I'm pretty sure the
>other drivers (e.g., cxgb3) use that interface for all packets.
>
Thanks for the verification. By the way, does that mean that nearly all drivers can use the 
same napi_gro_frags() to receive buffers though currently each driver has it's own receiving 
function?

>> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate
>> as usual, the data pointed by skb->data will be copied into the first guest buffer.
>> That means we should reserve sufficient room in guest buffer. For PS mode
>> supported driver (for example ixgbe), the room will be more than 128. After 128bytes,
>> we will put the first frag data. Look into virtio-net.c the function page_to_skb()
>> and receive_mergeable(), that means we should modify guest virtio-net driver to
>> compute the offset as the parameter for skb_set_frag().
>>
>> How do you think about this? Attached is a patch to how to modify the guest driver.
>> I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.
>
>Expanding the buffer size to 512 bytes to accomodate PS mode
>looks reasonable to me.
>
>However, I don't think we should increase the copy threshold to
>512 bytes at the same time.  I don't have any figures myself but
>I think if we are to make such a change it should be a separate
>one and come with supporting numbers.
>
Let me have a look to see if I can retain the copy threshold as 128 bytes 
and copy the header data safely.

>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 18, 2010, 7:45 a.m. UTC | #17
On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote:
>
> Thanks for the verification. By the way, does that mean that nearly all drivers can use the 
> same napi_gro_frags() to receive buffers though currently each driver has it's own receiving 
> function?

There is no reason why the napi_gro_frags can't be used by any
driver that supports receiving data into pages.

Cheers,
Michael S. Tsirkin June 20, 2010, 10:06 a.m. UTC | #18
On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
> >Sent: Friday, June 18, 2010 1:59 PM
> >To: Xin, Xiaohui
> >Cc: Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
> >linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
> >jdike@linux.intel.com; Rusty Russell
> >Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
> >
> >On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:
> >>
> >> Herbert,
> >> I have questions about the idea above:
> >> 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
> >> then we don't need napi_gro_frags() any more, the driver's original receiving
> >> function is ok. Right?
> >
> >Well I was actually thinking about converting all drivers that
> >need this to napi_gro_frags.  But now that you mention it, yes
> >we could still keep the old interface to minimise the work.
> >
> >> 2) Is napi_gro_frags() only suitable for TCP protocol packet?
> >> I have done a small test for ixgbe driver to let it only allocate paged buffers
> >> and found kernel hangs when napi_gro_frags() receives an ARP packet.
> >
> >It should work with any packet.  In fact, I'm pretty sure the
> >other drivers (e.g., cxgb3) use that interface for all packets.
> >
> Thanks for the verification. By the way, does that mean that nearly all drivers can use the 
> same napi_gro_frags() to receive buffers though currently each driver has it's own receiving 
> function?
> 
> >> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate
> >> as usual, the data pointed by skb->data will be copied into the first guest buffer.
> >> That means we should reserve sufficient room in guest buffer. For PS mode
> >> supported driver (for example ixgbe), the room will be more than 128. After 128bytes,
> >> we will put the first frag data. Look into virtio-net.c the function page_to_skb()
> >> and receive_mergeable(), that means we should modify guest virtio-net driver to
> >> compute the offset as the parameter for skb_set_frag().
> >>
> >> How do you think about this? Attached is a patch to how to modify the guest driver.
> >> I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.
> >
> >Expanding the buffer size to 512 bytes to accomodate PS mode
> >looks reasonable to me.
> >
> >However, I don't think we should increase the copy threshold to
> >512 bytes at the same time.  I don't have any figures myself but
> >I think if we are to make such a change it should be a separate
> >one and come with supporting numbers.
> >
> Let me have a look to see if I can retain the copy threshold as 128 bytes 
> and copy the header data safely.

Changing the guest virtio to match the backend is a problem,
this breaks migration etc.


> >Cheers,
> >--
> >Visit Openswan at http://www.openswan.org/
> >Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> >Home Page: http://gondor.apana.org.au/~herbert/
> >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 20, 2010, 10:32 a.m. UTC | #19
On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:
>
> Changing the guest virtio to match the backend is a problem,
> this breaks migration etc.

As long as it's done in a backwards compatible way it should be
fine.  It's just like migrating from a backend that supports TSO
to one that doesn't.

Cheers,
Michael S. Tsirkin June 20, 2010, 10:39 a.m. UTC | #20
On Sun, Jun 20, 2010 at 08:32:35PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:
> >
> > Changing the guest virtio to match the backend is a problem,
> > this breaks migration etc.
> 
> As long as it's done in a backwards compatible way it should be
> fine.

Possibly, but to me the need to do this implies that
we'll need another change with different hardware at the backend.

> It's just like migrating from a backend that supports TSO
> to one that doesn't.
> 
> Cheers,

Exactly. We don't support such migration.

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 20, 2010, 11:02 a.m. UTC | #21
On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:
>
> > It's just like migrating from a backend that supports TSO
> > to one that doesn't.
> 
> Exactly. We don't support such migration.

Well that's something that has to be addressed in the virtio_net.

Cheers,
Michael S. Tsirkin June 20, 2010, 11:11 a.m. UTC | #22
On Sun, Jun 20, 2010 at 09:02:54PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:
> >
> > > It's just like migrating from a backend that supports TSO
> > > to one that doesn't.
> > 
> > Exactly. We don't support such migration.
> 
> Well that's something that has to be addressed in the virtio_net.

Rather than modifying all guests, it seems much easier not to assume
specific buffer layout in host.  Copying network header around seems a
small cost.

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 20, 2010, 11:36 a.m. UTC | #23
On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote:
>
> Rather than modifying all guests, it seems much easier not to assume
> specific buffer layout in host.  Copying network header around seems a
> small cost.

Well sure we can debate the specifics of this implementation detail.

However, the fact that virtio_net doesn't support feature renegotiation
on live migration is not a valid reason against this.

Cheers,
Michael S. Tsirkin June 20, 2010, 11:47 a.m. UTC | #24
On Sun, Jun 20, 2010 at 09:36:09PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote:
> >
> > Rather than modifying all guests, it seems much easier not to assume
> > specific buffer layout in host.  Copying network header around seems a
> > small cost.
> 
> Well sure we can debate the specifics of this implementation detail.

Let's do this then.  So far the virtio spec avoided making layout
assumptions, leaving guests lay out data as they see fit.
Isn't it possible to keep supporting this with zero copy for hardware
that can issue DMA at arbitrary addresses?

> However, the fact that virtio_net doesn't support feature renegotiation
> on live migration is not a valid reason against this.
> 
> Cheers,
Herbert Xu June 20, 2010, 11:59 a.m. UTC | #25
On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
>
> Let's do this then.  So far the virtio spec avoided making layout
> assumptions, leaving guests lay out data as they see fit.
> Isn't it possible to keep supporting this with zero copy for hardware
> that can issue DMA at arbitrary addresses?

I think you're mistaken with respect to what is being proposed.
Raising 512 bytes isn't a hard constraint, it is merely an
optimisation for Intel NICs because their PS mode can produce
a head fragment of up to 512 bytes.

If the guest didn't allocate 512 bytes it wouldn't be the end of
the world, it'd just mean that we'd either copy whatever is in
the head fragment, or we waste 4096-X bytes of memory where X
is the number of bytes in the head.

Cheers,
Michael S. Tsirkin June 20, 2010, 12:48 p.m. UTC | #26
On Sun, Jun 20, 2010 at 09:59:26PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
> >
> > Let's do this then.  So far the virtio spec avoided making layout
> > assumptions, leaving guests lay out data as they see fit.
> > Isn't it possible to keep supporting this with zero copy for hardware
> > that can issue DMA at arbitrary addresses?
> 
> I think you're mistaken with respect to what is being proposed.
> Raising 512 bytes isn't a hard constraint, it is merely an
> optimisation for Intel NICs because their PS mode can produce
> a head fragment of up to 512 bytes.
> 
Thanks for the clarification. So what is discussed here is
the API changes that will enable this optimization?
Of couse, it makes sense to consider this to try and avoid code churn
in the future.

As a side note, I hope to see a basic zero copy implementation with
GSO/GRO that beats copy in host convincingly before work is started on
further optimizations, though.

> If the guest didn't allocate 512 bytes it wouldn't be the end of
> the world, it'd just mean that we'd either copy whatever is in
> the head fragment,
I don't know how much will copying the head cost.

> or we waste 4096-X bytes of memory where X
> is the number of bytes in the head.

This seems mostly harmless - and guest can always do a copy internally
to save memory, correct?
Note also that we lock a full page to allow DMA, anyway.

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings June 20, 2010, 3:19 p.m. UTC | #27
On Sun, 2010-06-20 at 21:59 +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
> >
> > Let's do this then.  So far the virtio spec avoided making layout
> > assumptions, leaving guests lay out data as they see fit.
> > Isn't it possible to keep supporting this with zero copy for hardware
> > that can issue DMA at arbitrary addresses?
> 
> I think you're mistaken with respect to what is being proposed.
> Raising 512 bytes isn't a hard constraint, it is merely an
> optimisation for Intel NICs because their PS mode can produce
> a head fragment of up to 512 bytes.
> 
> If the guest didn't allocate 512 bytes it wouldn't be the end of
> the world, it'd just mean that we'd either copy whatever is in
> the head fragment, or we waste 4096-X bytes of memory where X
> is the number of bytes in the head.

If I understand correctly what this 'PS mode' is (I haven't seen the
documentation for it), it is a feature that Microsoft requested from
hardware vendors for use in Hyper-V.  As a result, the SFC9000 family
and presumably other controllers also implement something similar.

Ben.
Dong, Eddie June 23, 2010, 8:09 a.m. UTC | #28
> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will
> allocate 
> as usual, the data pointed by skb->data will be copied into the first
> guest buffer. 
> That means we should reserve sufficient room in guest buffer. For PS
> mode 
> supported driver (for example ixgbe), the room will be more than 128.
> After 128bytes, 
> we will put the first frag data. Look into virtio-net.c the function
> page_to_skb() 
> and receive_mergeable(), that means we should modify guest virtio-net
> driver to 
> compute the offset as the parameter for skb_set_frag().
> 
> How do you think about this? Attached is a patch to how to modify the
> guest driver. 
> I reserve 512 bytes as an example, and transfer the header len of the
> skb in hdr->hdr_len. 
> 
Xiaohui & Herbert:
	Mixing copy of head & 0-copy of bulk data imposes additional challange to find the guest buffer. The backend driver may be unable to find a spare guest buffer from virtqueue at that time which may block the receiving process then.
	Can't we completely eliminate netdev_alloc_skb here? Assigning guest buffer at this time makes life much easier.
Thx, Eddie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 23, 2010, 9:52 a.m. UTC | #29
On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:
>
> Xiaohui & Herbert:
> 	Mixing copy of head & 0-copy of bulk data imposes additional challange to find the guest buffer. The backend driver may be unable to find a spare guest buffer from virtqueue at that time which may block the receiving process then.
> 	Can't we completely eliminate netdev_alloc_skb here? Assigning guest buffer at this time makes life much easier.

I'm not sure I understand you concern.  If you mean that when
the guest doesn't give enough pages to the host and the host
can't receive on behalf of the guest then isn't that already
the case with the original patch-set?

Cheers,
Dong, Eddie June 23, 2010, 10:05 a.m. UTC | #30
Herbert Xu wrote:
> On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:
>> 
>> Xiaohui & Herbert:
>> 	Mixing copy of head & 0-copy of bulk data imposes additional
>> 	challange to find the guest buffer. The backend driver may be
>> unable to find a spare guest buffer from virtqueue at that time
>> which may block the receiving process then. Can't we completely
>> eliminate netdev_alloc_skb here? Assigning guest buffer at this time
>> makes life much easier.    
> 
> I'm not sure I understand you concern.  If you mean that when
> the guest doesn't give enough pages to the host and the host
> can't receive on behalf of the guest then isn't that already
> the case with the original patch-set?
> 

I mean once the frontend side driver post the buffers to the backend driver, the backend driver will "immediately" use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :(

Thx, Eddie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu June 24, 2010, 10:08 a.m. UTC | #31
On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
> 
> I mean once the frontend side driver post the buffers to the backend driver, the backend driver will "immediately" use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :(

OK I see what you mean.  Can you tell me how does Xiaohui's
previous patch-set deal with this problem?

Thanks,
Dong, Eddie June 25, 2010, 1:03 a.m. UTC | #32
Herbert Xu wrote:
> On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
>> 
>> I mean once the frontend side driver post the buffers to the backend
>> driver, the backend driver will "immediately" use that buffers to
>> compose skb or gro_frags and post them to the assigned host NIC
>> driver as receive buffers. In that case, if the backend driver
>> recieves a packet from the NIC that requires to do copy, it may be
>> unable to find additional free guest buffer because all of them are
>> already used by the NIC driver. We have to reserve some guest
>> buffers for the possible copy even if the buffer address is not
>> identified by original skb :(        
> 
> OK I see what you mean.  Can you tell me how does Xiaohui's
> previous patch-set deal with this problem?
> 
> Thanks,

In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :(

BTW, some hardware may require certain level of packet copy such as for broadcast packets in very old VMDq device, which is not addressed in previous Xiaohui's patch yet. We may address this by implementing an additional virtqueue between guest and host for slow path (broadcast packets only here) with additinal complexity in FE/BE driver. 

Thx, Eddie--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin, Xiaohui June 25, 2010, 2:07 a.m. UTC | #33
Herbert,
That's why I have sent you the patch for guest virtio-net driver. I reserved 512 bytes in each page, then I can always have the space to copy and avoid the backend memory used up issue.

Thanks
Xiaohui

>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Thursday, June 24, 2010 6:09 PM
>To: Dong, Eddie
>Cc: Xin, Xiaohui; Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
>>
>> I mean once the frontend side driver post the buffers to the backend driver, the backend
>driver will "immediately" use that buffers to compose skb or gro_frags and post them to the
>assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a
>packet from the NIC that requires to do copy, it may be unable to find additional free guest
>buffer because all of them are already used by the NIC driver. We have to reserve some guest
>buffers for the possible copy even if the buffer address is not identified by original skb :(
>
>OK I see what you mean.  Can you tell me how does Xiaohui's
>previous patch-set deal with this problem?
>
>Thanks,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 25, 2010, 11:06 a.m. UTC | #34
On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
> Herbert Xu wrote:
> > On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
> >> 
> >> I mean once the frontend side driver post the buffers to the backend
> >> driver, the backend driver will "immediately" use that buffers to
> >> compose skb or gro_frags and post them to the assigned host NIC
> >> driver as receive buffers. In that case, if the backend driver
> >> recieves a packet from the NIC that requires to do copy, it may be
> >> unable to find additional free guest buffer because all of them are
> >> already used by the NIC driver. We have to reserve some guest
> >> buffers for the possible copy even if the buffer address is not
> >> identified by original skb :(        
> > 
> > OK I see what you mean.  Can you tell me how does Xiaohui's
> > previous patch-set deal with this problem?
> > 
> > Thanks,
> 
> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :(
> BTW, some hardware may require certain level of packet copy such as for broadcast packets in very old VMDq device, which is not addressed in previous Xiaohui's patch yet. We may address this by implementing an additional virtqueue between guest and host for slow path (broadcast packets only here) with additinal complexity in FE/BE driver. 
> 
> Thx, Eddie

guest posts a large number of buffers to the host.
Host can use them any way it wants to, and in any order,
for example reserve half the buffers for the copy.

This might waste some memory if buffers are used
only partially, but let's worry about this later.
Herbert Xu June 27, 2010, 6:14 a.m. UTC | #35
On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
> 
> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :(

OK, if I understand you correctly then I don't think have a
problem.  With your current patch-set you have exactly the same
situation when the skb->data is reallocated as a kernel buffer.

This is OK because as you correctly argue, it is a rare situation.

With my proposal you will need to get this extra external buffer
in even less cases, because you'd only need to do it if the skb
head grows, which only happens if it becomes encapsulated.

So let me explain it in a bit more detail:

Our packet starts out as a purely non-linear skb, i.e., skb->head
contains nothing and all the page frags come from the guest.

During host processing we may pull data into skb->head but the
first frag will remain unless we pull all of it.  If we did do
that then you would have a free external buffer anyway.

Now in the common case the header may be modified or pulled, but
it very rarely grows.  So you can just copy the header back into
the first frag just before we give it to the guest.

Only in the case where the packet header grows (e.g., encapsulation)
would you need to get an extra external buffer.

Cheers,
Xin, Xiaohui June 28, 2010, 9:56 a.m. UTC | #36
>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Sunday, June 27, 2010 2:15 PM
>To: Dong, Eddie
>Cc: Xin, Xiaohui; Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
>>
>> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete
>queue pairs) uses the buffer from guest, so it eliminates copy completely in software and
>requires hardware to do so. If we can have an additonal place to store the buffer per skb (may
>cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver
>later on. But that may be not very clean either :(
>
>OK, if I understand you correctly then I don't think have a
>problem.  With your current patch-set you have exactly the same
>situation when the skb->data is reallocated as a kernel buffer.
>

When will skb->data to be reallocated? May you point me the code path?

>This is OK because as you correctly argue, it is a rare situation.
>
>With my proposal you will need to get this extra external buffer
>in even less cases, because you'd only need to do it if the skb
>head grows, which only happens if it becomes encapsulated.
>So let me explain it in a bit more detail:
>
>Our packet starts out as a purely non-linear skb, i.e., skb->head
>contains nothing and all the page frags come from the guest.
>
>During host processing we may pull data into skb->head but the
>first frag will remain unless we pull all of it.  If we did do
>that then you would have a free external buffer anyway.
>
>Now in the common case the header may be modified or pulled, but
>it very rarely grows.  So you can just copy the header back into
>the first frag just before we give it to the guest.
>
Since the data is still there, so recompute the page offset and size is ok, right?

>Only in the case where the packet header grows (e.g., encapsulation)
>would you need to get an extra external buffer.
>
>Cheers,
>--
>Email: Herbert Xu <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 28, 2010, 10 a.m. UTC | #37
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
> >Sent: Sunday, June 27, 2010 2:15 PM
> >To: Dong, Eddie
> >Cc: Xin, Xiaohui; Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
> >linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
> >jdike@linux.intel.com
> >Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
> >
> >On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
> >>
> >> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete
> >queue pairs) uses the buffer from guest, so it eliminates copy completely in software and
> >requires hardware to do so. If we can have an additonal place to store the buffer per skb (may
> >cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver
> >later on. But that may be not very clean either :(
> >
> >OK, if I understand you correctly then I don't think have a
> >problem.  With your current patch-set you have exactly the same
> >situation when the skb->data is reallocated as a kernel buffer.
> >
> 
> When will skb->data to be reallocated? May you point me the code path?
> 
> >This is OK because as you correctly argue, it is a rare situation.
> >
> >With my proposal you will need to get this extra external buffer
> >in even less cases, because you'd only need to do it if the skb
> >head grows, which only happens if it becomes encapsulated.
> >So let me explain it in a bit more detail:
> >
> >Our packet starts out as a purely non-linear skb, i.e., skb->head
> >contains nothing and all the page frags come from the guest.
> >
> >During host processing we may pull data into skb->head but the
> >first frag will remain unless we pull all of it.  If we did do
> >that then you would have a free external buffer anyway.
> >
> >Now in the common case the header may be modified or pulled, but
> >it very rarely grows.  So you can just copy the header back into
> >the first frag just before we give it to the guest.
> >
> Since the data is still there, so recompute the page offset and size is ok, right?

Question: can devices use parts of the same page
in frags of different skbs (or for other purposes)? If they do,
we'll corrupt that memory if we try to stick the header there.

We have another option, reserve some buffers
posted by guest and use them if we need to copy
the header. This seems the most straight-forward to me.

> >Only in the case where the packet header grows (e.g., encapsulation)
> >would you need to get an extra external buffer.
> >
> >Cheers,
> >--
> >Email: Herbert Xu <herbert@gondor.apana.org.au>
> >Home Page: http://gondor.apana.org.au/~herbert/
> >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 3, 2010, 9:12 a.m. UTC | #38
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:
>
> >OK, if I understand you correctly then I don't think have a
> >problem.  With your current patch-set you have exactly the same
> >situation when the skb->data is reallocated as a kernel buffer.
> 
> When will skb->data to be reallocated? May you point me the code path?

Anything that calls pskb_expand_head.

> >This is OK because as you correctly argue, it is a rare situation.
> >
> >With my proposal you will need to get this extra external buffer
> >in even less cases, because you'd only need to do it if the skb
> >head grows, which only happens if it becomes encapsulated.
> >So let me explain it in a bit more detail:
> >
> >Our packet starts out as a purely non-linear skb, i.e., skb->head
> >contains nothing and all the page frags come from the guest.
> >
> >During host processing we may pull data into skb->head but the
> >first frag will remain unless we pull all of it.  If we did do
> >that then you would have a free external buffer anyway.
> >
> >Now in the common case the header may be modified or pulled, but
> >it very rarely grows.  So you can just copy the header back into
> >the first frag just before we give it to the guest.
> >
> Since the data is still there, so recompute the page offset and size is ok, right?

Right, you just move the page offset back and increase the size.
However, to do this safely we'd need to have a way of knowing
whether the skb head has been modified.

It may well turn out to be just as effective to do something like

	if (memcmp(skb->data, page frag head, skb_headlen))
		memcpy(page frag head, skb->data, skb_headlen)

As the page frag head should be in cache since it would've been
used to populate skb->data.

It'd be good to run some benchmarks with this to see whether
adding a bit to sk_buff to avoid the memcmp is worth it or not.

Cheers,
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..cf309c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,18 @@  struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+/* The structure is for a skb which skb->data may point to
+ * an external buffer, which is not allocated from kernel space.
+ * Since the buffer is external, then the shinfo or frags are
+ * also extern too. It also contains a destructor for itself.
+ */
+struct skb_external_page {
+	u8              *start;
+	int             size;
+	struct skb_frag_struct *frags;
+	struct skb_shared_info *ushinfo;
+	void		(*dtor)(struct skb_external_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of