diff mbox

[RFC] net: add dataref destructor to sk_buff

Message ID 20091002141407.30224.54207.stgit@dev.haskins.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Gregory Haskins Oct. 2, 2009, 2:20 p.m. UTC
(Applies to davem/net-2.6.git:4fdb78d30)

Hi David, netdevs,

The following is an RFC for an attempt at addressing a zero-copy solution.

To be perfectly honest, I have no idea if this is the best solution, or if
there is truly a problem with skb->destructor that requires an alternate
mechanism.  What I do know is that this patch seems to work, and I would
like to see some kind of solution available upstream.  So I thought I would
send my hack out as at least a point of discussion.  FWIW: This has been
tested heavily in my rig and is technically suitable for inclusion after
review as is, if that is decided to be the optimal path forward here.

Thanks for your review and consideration,

Kind regards,
-Greg

----------------------------------------
From: Gregory Haskins <ghaskins@novell.com>
Subject: [RFC PATCH] net: add dataref destructor to sk_buff

What: The skb->destructor field is reportedly unreliable for ensuring
that all shinfo users have dropped their references.  Therefore, we add
a distinct ->release() method for the shinfo structure which is closely
tied to the underlying page resources we want to protect.

Why: We want to add zero-copy transmit support for AlacrityVM guests.
In order to support this, the host kernel must map guest pages directly
into a paged-skb and send it as normal.  put_page() alone is not
sufficient lifetime management since the pages are ultimately allocated
from within the guest.  Therefore, we need higher-level notification
when the skb is finally freed on the host so we can then inject a proper
"tx-complete" event into the guest context.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/skbuff.h |    2 ++
 net/core/skbuff.c      |    9 +++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Nov. 6, 2009, 5:08 a.m. UTC | #1
From: Gregory Haskins <ghaskins@novell.com>
Date: Fri, 02 Oct 2009 10:20:00 -0400

> The following is an RFC for an attempt at addressing a zero-copy solution.
> 
> To be perfectly honest, I have no idea if this is the best solution, or if
> there is truly a problem with skb->destructor that requires an alternate
> mechanism.  What I do know is that this patch seems to work, and I would
> like to see some kind of solution available upstream.  So I thought I would
> send my hack out as at least a point of discussion.  FWIW: This has been
> tested heavily in my rig and is technically suitable for inclusion after
> review as is, if that is decided to be the optimal path forward here.
> 
> Thanks for your review and consideration,

I have no fundamental objections, but when you submit this for
real can you show the code that uses it so we can get a better
idea about things?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Nov. 6, 2009, 4:08 p.m. UTC | #2
David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Fri, 02 Oct 2009 10:20:00 -0400
> 
>> The following is an RFC for an attempt at addressing a zero-copy solution.
>>
>> To be perfectly honest, I have no idea if this is the best solution, or if
>> there is truly a problem with skb->destructor that requires an alternate
>> mechanism.  What I do know is that this patch seems to work, and I would
>> like to see some kind of solution available upstream.  So I thought I would
>> send my hack out as at least a point of discussion.  FWIW: This has been
>> tested heavily in my rig and is technically suitable for inclusion after
>> review as is, if that is decided to be the optimal path forward here.
>>
>> Thanks for your review and consideration,
> 
> I have no fundamental objections, but when you submit this for
> real can you show the code that uses it so we can get a better
> idea about things?
> 
> Thanks.

Absolutely.  I am not sure if this content would be appropriate for the
patch header, so I will just reply to your request here.  If you would
like to see the patch resubmitted with the following in the header, let
me know.

The way we use this today is in the venet driver as part of the
AlacrityVM hypervisor.  We therefore have a guest and host environment,
where the guest builds fully formed L2 frames, and the host generally
acts as a conduit for passing those frames to a real physical device
(such as through a soft-bridge, etc).  We would like to do this without
requiring copies for certain classes of packets (i.e. packets larger
than a threshold).  However we have to be smart about how we do this
since the guest technically "owns" the pages, and therefore needs
io-completion events to properly signify when pages are actually freed.

The way this all looks today is (I hope this doesn't get mangled):

----------------------------------------------------------------------
|              guest              |               host               |
----------------------------------------------------------------------
|     stack      |     venet      |     venetdev    |     phydev     |
----------------------------------------------------------------------
| alloc_skb()                                                        |
| dev_xmit()                                                         |
|                -> queue_tx(sg)                                     |
|                                 -> dequeue_rx(sg)                  |
|                                    alloc_pskb()                    |
|                                    map_sg(sg)->pskb                |
|                                    loop(get_page())                |
|                                    skb->release = cb               |
|                                                   -> dev_xmit()    |
|                                                                    |
|                                                     txc_isr() <-   |
|                                                     kfree_skb()    |
|                                                     skb->release() |
|                                    cb()           <-               |
|                                     queue_event(sg)                |
|                    txc_isr()    <-                                 |
|                    kfree_skb()                                     |
|                                                    loop(put_page())|
----------------------------------------------------------------------

And here is the actual code in action
(kernel/vbus/devices/venet/device.c) from the alacrityvm.git tree

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/devices/venet/device.c;h=d49ba7fa9f70cbb7e61c366d52d4c316d15f8b73;hb=HEAD

Line 587 is the "dequeue_rx()" operation from the diagram.  Line 627 is
where we map in the guests pages to a scatterlist.  Line 649 is where we
update the skb_shinfo->frags with the mapping.  And finally, Line 677 is
where I register a callback for when the skb is released.

Line 853 is the callback that the stack invokes when the phydev finally
frees the packet.  You can see that line 863 then sends an
transmit-complete event back up to the guest.

If this is not what you were looking for, please let me know.  If this
looks acceptable to you, please consider the original patch for
inclusion at the next convenient merge window.

Thanks David!
-Greg
Michael S. Tsirkin Nov. 10, 2009, 11:53 a.m. UTC | #3
On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote:
> (Applies to davem/net-2.6.git:4fdb78d30)
> 
> Hi David, netdevs,
> 
> The following is an RFC for an attempt at addressing a zero-copy solution.
> 
> To be perfectly honest, I have no idea if this is the best solution, or if
> there is truly a problem with skb->destructor that requires an alternate
> mechanism.  What I do know is that this patch seems to work, and I would
> like to see some kind of solution available upstream.  So I thought I would
> send my hack out as at least a point of discussion.  FWIW: This has been
> tested heavily in my rig and is technically suitable for inclusion after
> review as is, if that is decided to be the optimal path forward here.
> 
> Thanks for your review and consideration,
> 
> Kind regards,
> -Greg
> 
> ----------------------------------------
> From: Gregory Haskins <ghaskins@novell.com>
> Subject: [RFC PATCH] net: add dataref destructor to sk_buff
> 
> What: The skb->destructor field is reportedly unreliable for ensuring
> that all shinfo users have dropped their references.  Therefore, we add
> a distinct ->release() method for the shinfo structure which is closely
> tied to the underlying page resources we want to protect.
> 
> Why: We want to add zero-copy transmit support for AlacrityVM guests.
> In order to support this, the host kernel must map guest pages directly
> into a paged-skb and send it as normal.  put_page() alone is not
> sufficient lifetime management since the pages are ultimately allocated
> from within the guest.  Therefore, we need higher-level notification
> when the skb is finally freed on the host so we can then inject a proper
> "tx-complete" event into the guest context.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  include/linux/skbuff.h |    2 ++
>  net/core/skbuff.c      |    9 +++++++++
>  2 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index df7b23a..02cdab6 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -207,6 +207,8 @@ struct skb_shared_info {
>  	/* Intermediate layers must ensure that destructor_arg
>  	 * remains valid until skb destructor */
>  	void *		destructor_arg;
> +	void *          priv;
> +	void           (*release)(struct sk_buff *skb);
>  };
>  
>  /* We divide dataref into two halves.  The higher 16 bits hold references
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 80a9616..a7e40a9 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	shinfo->tx_flags.flags = 0;
>  	skb_frag_list_init(skb);
>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
> +	shinfo->release = NULL;
> +	shinfo->priv = NULL;
>  
>  	if (fclone) {
>  		struct sk_buff *child = skb + 1;
> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb)
>  		if (skb_has_frags(skb))
>  			skb_drop_fraglist(skb);
>  
> +		if (skb_shinfo(skb)->release)
> +			skb_shinfo(skb)->release(skb);
> +
>  		kfree(skb->head);
>  	}
>  }
> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
>  	shinfo->tx_flags.flags = 0;
>  	skb_frag_list_init(skb);
>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
> +	shinfo->release = NULL;
> +	shinfo->priv = NULL;
>  
>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>  	skb->data = skb->head + NET_SKB_PAD;
> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	skb->hdr_len  = 0;
>  	skb->nohdr    = 0;
>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
> +	skb_shinfo(skb)->release = NULL;
> +	skb_shinfo(skb)->priv = NULL;
>  	return 0;
>  
>  nodata:

This is basically subset of the skb data destructors patch, isn't it?
Last time this was tried, this is the objection that was voiced:

	The problem with this patch is that it's tracking skb's, while
	you want use it to track pages for zero-copy.  That just doesn't
	work.  Through mechanisms like splice, individual pages in the
	skb can be detached and metastasize to other locations, e.g.,
	the VFS.

and I think this applies here. In other words, this only *seems*
to work for you because you are not trying to do things like
guest to host communication, with host doing smart things.

Cc Herbert which was involved in the original discussion.

In the specific case, it seems that things like pskb_copy,
skb_split and others will also be broken, won't they?
Gregory Haskins Nov. 10, 2009, 12:40 p.m. UTC | #4
>>> On 11/10/2009 at  6:53 AM, in message <20091110115335.GC6989@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com> wrote: 
> On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote:
>> (Applies to davem/net-2.6.git:4fdb78d30)
>> 
>> Hi David, netdevs,
>> 
>> The following is an RFC for an attempt at addressing a zero-copy solution.
>> 
>> To be perfectly honest, I have no idea if this is the best solution, or if
>> there is truly a problem with skb->destructor that requires an alternate
>> mechanism.  What I do know is that this patch seems to work, and I would
>> like to see some kind of solution available upstream.  So I thought I would
>> send my hack out as at least a point of discussion.  FWIW: This has been
>> tested heavily in my rig and is technically suitable for inclusion after
>> review as is, if that is decided to be the optimal path forward here.
>> 
>> Thanks for your review and consideration,
>> 
>> Kind regards,
>> -Greg
>> 
>> ----------------------------------------
>> From: Gregory Haskins <ghaskins@novell.com>
>> Subject: [RFC PATCH] net: add dataref destructor to sk_buff
>> 
>> What: The skb->destructor field is reportedly unreliable for ensuring
>> that all shinfo users have dropped their references.  Therefore, we add
>> a distinct ->release() method for the shinfo structure which is closely
>> tied to the underlying page resources we want to protect.
>> 
>> Why: We want to add zero-copy transmit support for AlacrityVM guests.
>> In order to support this, the host kernel must map guest pages directly
>> into a paged-skb and send it as normal.  put_page() alone is not
>> sufficient lifetime management since the pages are ultimately allocated
>> from within the guest.  Therefore, we need higher-level notification
>> when the skb is finally freed on the host so we can then inject a proper
>> "tx-complete" event into the guest context.
>> 
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>> 
>>  include/linux/skbuff.h |    2 ++
>>  net/core/skbuff.c      |    9 +++++++++
>>  2 files changed, 11 insertions(+), 0 deletions(-)
>> 
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index df7b23a..02cdab6 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -207,6 +207,8 @@ struct skb_shared_info {
>>  	/* Intermediate layers must ensure that destructor_arg
>>  	 * remains valid until skb destructor */
>>  	void *		destructor_arg;
>> +	void *          priv;
>> +	void           (*release)(struct sk_buff *skb);
>>  };
>>  
>>  /* We divide dataref into two halves.  The higher 16 bits hold references
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 80a9616..a7e40a9 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> gfp_mask,
>>  	shinfo->tx_flags.flags = 0;
>>  	skb_frag_list_init(skb);
>>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
>> +	shinfo->release = NULL;
>> +	shinfo->priv = NULL;
>>  
>>  	if (fclone) {
>>  		struct sk_buff *child = skb + 1;
>> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb)
>>  		if (skb_has_frags(skb))
>>  			skb_drop_fraglist(skb);
>>  
>> +		if (skb_shinfo(skb)->release)
>> +			skb_shinfo(skb)->release(skb);
>> +
>>  		kfree(skb->head);
>>  	}
>>  }
>> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
>>  	shinfo->tx_flags.flags = 0;
>>  	skb_frag_list_init(skb);
>>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
>> +	shinfo->release = NULL;
>> +	shinfo->priv = NULL;
>>  
>>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>>  	skb->data = skb->head + NET_SKB_PAD;
>> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
> ntail,
>>  	skb->hdr_len  = 0;
>>  	skb->nohdr    = 0;
>>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>> +	skb_shinfo(skb)->release = NULL;
>> +	skb_shinfo(skb)->priv = NULL;
>>  	return 0;
>>  
>>  nodata:
> 
> This is basically subset of the skb data destructors patch, isn't it?

Sort of, but the emphasis is different.  Here are the main differences:

1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level

2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to
manage the buffer-space dynamically), whereas shinfo->release is designed to be used by
"the owner" and is thus only set at creation.

3) shinfo->release tracks the lifetime of the pages, not the skb.  This means it transcends
the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages

> Last time this was tried, this is the objection that was voiced:
> 
> 	The problem with this patch is that it's tracking skb's, while
> 	you want use it to track pages for zero-copy.  That just doesn't
> 	work.  Through mechanisms like splice, individual pages in the
> 	skb can be detached and metastasize to other locations, e.g.,
> 	the VFS.

Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly
track the page level avoid this issue.  Multiple skb's can point to a single shinfo, iiuc.

> 
> and I think this applies here.

I don't think so, but if you think I missed something, do not be shy (not that you ever are).

> In other words, this only *seems*
> to work for you because you are not trying to do things like
> guest to host communication, with host doing smart things.

I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and
it works quite nicely.  I map the guest pages in, and when the last reference to the pages are dropped,
I release the pages back to the guest.  It doesn't matter if the skb egresses out a physical adapter or is
received locally.  All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly.

> 
> Cc Herbert which was involved in the original discussion.
> 
> In the specific case, it seems that things like pskb_copy,
> skb_split and others will also be broken, won't they?

Not to my knowledge.   They up the reference to the shinfo before proceeding.

Kind Regards,
-Greg



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 10, 2009, 1:17 p.m. UTC | #5
On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote:
> >>> On 11/10/2009 at  6:53 AM, in message <20091110115335.GC6989@redhat.com>,
> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
> > On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote:
> >> (Applies to davem/net-2.6.git:4fdb78d30)
> >> 
> >> Hi David, netdevs,
> >> 
> >> The following is an RFC for an attempt at addressing a zero-copy solution.
> >> 
> >> To be perfectly honest, I have no idea if this is the best solution, or if
> >> there is truly a problem with skb->destructor that requires an alternate
> >> mechanism.  What I do know is that this patch seems to work, and I would
> >> like to see some kind of solution available upstream.  So I thought I would
> >> send my hack out as at least a point of discussion.  FWIW: This has been
> >> tested heavily in my rig and is technically suitable for inclusion after
> >> review as is, if that is decided to be the optimal path forward here.
> >> 
> >> Thanks for your review and consideration,
> >> 
> >> Kind regards,
> >> -Greg
> >> 
> >> ----------------------------------------
> >> From: Gregory Haskins <ghaskins@novell.com>
> >> Subject: [RFC PATCH] net: add dataref destructor to sk_buff
> >> 
> >> What: The skb->destructor field is reportedly unreliable for ensuring
> >> that all shinfo users have dropped their references.  Therefore, we add
> >> a distinct ->release() method for the shinfo structure which is closely
> >> tied to the underlying page resources we want to protect.
> >> 
> >> Why: We want to add zero-copy transmit support for AlacrityVM guests.
> >> In order to support this, the host kernel must map guest pages directly
> >> into a paged-skb and send it as normal.  put_page() alone is not
> >> sufficient lifetime management since the pages are ultimately allocated
> >> from within the guest.  Therefore, we need higher-level notification
> >> when the skb is finally freed on the host so we can then inject a proper
> >> "tx-complete" event into the guest context.
> >> 
> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> >> ---
> >> 
> >>  include/linux/skbuff.h |    2 ++
> >>  net/core/skbuff.c      |    9 +++++++++
> >>  2 files changed, 11 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index df7b23a..02cdab6 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -207,6 +207,8 @@ struct skb_shared_info {
> >>  	/* Intermediate layers must ensure that destructor_arg
> >>  	 * remains valid until skb destructor */
> >>  	void *		destructor_arg;
> >> +	void *          priv;
> >> +	void           (*release)(struct sk_buff *skb);
> >>  };
> >>  
> >>  /* We divide dataref into two halves.  The higher 16 bits hold references
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index 80a9616..a7e40a9 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> >>  	shinfo->tx_flags.flags = 0;
> >>  	skb_frag_list_init(skb);
> >>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
> >> +	shinfo->release = NULL;
> >> +	shinfo->priv = NULL;
> >>  
> >>  	if (fclone) {
> >>  		struct sk_buff *child = skb + 1;
> >> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb)
> >>  		if (skb_has_frags(skb))
> >>  			skb_drop_fraglist(skb);
> >>  
> >> +		if (skb_shinfo(skb)->release)
> >> +			skb_shinfo(skb)->release(skb);
> >> +
> >>  		kfree(skb->head);
> >>  	}
> >>  }
> >> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
> >>  	shinfo->tx_flags.flags = 0;
> >>  	skb_frag_list_init(skb);
> >>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
> >> +	shinfo->release = NULL;
> >> +	shinfo->priv = NULL;
> >>  
> >>  	memset(skb, 0, offsetof(struct sk_buff, tail));
> >>  	skb->data = skb->head + NET_SKB_PAD;
> >> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
> > ntail,
> >>  	skb->hdr_len  = 0;
> >>  	skb->nohdr    = 0;
> >>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
> >> +	skb_shinfo(skb)->release = NULL;
> >> +	skb_shinfo(skb)->priv = NULL;
> >>  	return 0;
> >>  
> >>  nodata:
> > 
> > This is basically subset of the skb data destructors patch, isn't it?
> 
> Sort of, but the emphasis is different.  Here are the main differences:
> 
> 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level
> 
> 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to
> manage the buffer-space dynamically), whereas shinfo->release is designed to be used by
> "the owner" and is thus only set at creation.
> 
> 3) shinfo->release tracks the lifetime of the pages, not the skb.  This means it transcends
> the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages

You are comparing with skb destructors, not skb data destructors :) skb
data destructor is Rusty's patch which he wanted to use for vringfd.  I
mean e.g. this:
http://lists.openwall.net/netdev/2008/04/18/7

> > Last time this was tried, this is the objection that was voiced:
> > 
> > 	The problem with this patch is that it's tracking skb's, while
> > 	you want use it to track pages for zero-copy.  That just doesn't
> > 	work.  Through mechanisms like splice, individual pages in the
> > 	skb can be detached and metastasize to other locations, e.g.,
> > 	the VFS.
> 
> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly
> track the page level avoid this issue.  Multiple skb's can point to a single shinfo, iiuc.

VFS does not know about shinfo either, does it?

> > 
> > and I think this applies here.
> 
> I don't think so, but if you think I missed something, do not be shy (not that you ever are).

Well, I hope the reviews are helpful.  I'll be happy if we learn to
track pages involved in transmit, but need to be careful.

> > In other words, this only *seems*
> > to work for you because you are not trying to do things like
> > guest to host communication, with host doing smart things.
> 
> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and
> it works quite nicely.  I map the guest pages in, and when the last reference to the pages are dropped,
> I release the pages back to the guest.  It doesn't matter if the skb egresses out a physical adapter or is
> received locally.  All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly.

Not if someone else is referencing the pages without a reference to shinfo.

> > 
> > Cc Herbert which was involved in the original discussion.
> > 
> > In the specific case, it seems that things like pskb_copy,
> > skb_split and others will also be broken, won't they?
> 
> Not to my knowledge.   They up the reference to the shinfo before proceeding.

I don't seem to find where does skb_split reference the shinfo.
It seems to do get_page on individual pages?

> Kind Regards,
> -Greg
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Nov. 10, 2009, 2:11 p.m. UTC | #6
Michael S. Tsirkin wrote:
> On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote:
>>>>> On 11/10/2009 at  6:53 AM, in message <20091110115335.GC6989@redhat.com>,
>> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
>>> On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote:
>>>> (Applies to davem/net-2.6.git:4fdb78d30)
>>>>
>>>> Hi David, netdevs,
>>>>
>>>> The following is an RFC for an attempt at addressing a zero-copy solution.
>>>>
>>>> To be perfectly honest, I have no idea if this is the best solution, or if
>>>> there is truly a problem with skb->destructor that requires an alternate
>>>> mechanism.  What I do know is that this patch seems to work, and I would
>>>> like to see some kind of solution available upstream.  So I thought I would
>>>> send my hack out as at least a point of discussion.  FWIW: This has been
>>>> tested heavily in my rig and is technically suitable for inclusion after
>>>> review as is, if that is decided to be the optimal path forward here.
>>>>
>>>> Thanks for your review and consideration,
>>>>
>>>> Kind regards,
>>>> -Greg
>>>>
>>>> ----------------------------------------
>>>> From: Gregory Haskins <ghaskins@novell.com>
>>>> Subject: [RFC PATCH] net: add dataref destructor to sk_buff
>>>>
>>>> What: The skb->destructor field is reportedly unreliable for ensuring
>>>> that all shinfo users have dropped their references.  Therefore, we add
>>>> a distinct ->release() method for the shinfo structure which is closely
>>>> tied to the underlying page resources we want to protect.
>>>>
>>>> Why: We want to add zero-copy transmit support for AlacrityVM guests.
>>>> In order to support this, the host kernel must map guest pages directly
>>>> into a paged-skb and send it as normal.  put_page() alone is not
>>>> sufficient lifetime management since the pages are ultimately allocated
>>>> from within the guest.  Therefore, we need higher-level notification
>>>> when the skb is finally freed on the host so we can then inject a proper
>>>> "tx-complete" event into the guest context.
>>>>
>>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>>> ---
>>>>
>>>>  include/linux/skbuff.h |    2 ++
>>>>  net/core/skbuff.c      |    9 +++++++++
>>>>  2 files changed, 11 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index df7b23a..02cdab6 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -207,6 +207,8 @@ struct skb_shared_info {
>>>>  	/* Intermediate layers must ensure that destructor_arg
>>>>  	 * remains valid until skb destructor */
>>>>  	void *		destructor_arg;
>>>> +	void *          priv;
>>>> +	void           (*release)(struct sk_buff *skb);
>>>>  };
>>>>  
>>>>  /* We divide dataref into two halves.  The higher 16 bits hold references
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 80a9616..a7e40a9 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
>>> gfp_mask,
>>>>  	shinfo->tx_flags.flags = 0;
>>>>  	skb_frag_list_init(skb);
>>>>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
>>>> +	shinfo->release = NULL;
>>>> +	shinfo->priv = NULL;
>>>>  
>>>>  	if (fclone) {
>>>>  		struct sk_buff *child = skb + 1;
>>>> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb)
>>>>  		if (skb_has_frags(skb))
>>>>  			skb_drop_fraglist(skb);
>>>>  
>>>> +		if (skb_shinfo(skb)->release)
>>>> +			skb_shinfo(skb)->release(skb);
>>>> +
>>>>  		kfree(skb->head);
>>>>  	}
>>>>  }
>>>> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
>>>>  	shinfo->tx_flags.flags = 0;
>>>>  	skb_frag_list_init(skb);
>>>>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
>>>> +	shinfo->release = NULL;
>>>> +	shinfo->priv = NULL;
>>>>  
>>>>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>>>>  	skb->data = skb->head + NET_SKB_PAD;
>>>> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
>>> ntail,
>>>>  	skb->hdr_len  = 0;
>>>>  	skb->nohdr    = 0;
>>>>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>>>> +	skb_shinfo(skb)->release = NULL;
>>>> +	skb_shinfo(skb)->priv = NULL;
>>>>  	return 0;
>>>>  
>>>>  nodata:
>>> This is basically subset of the skb data destructors patch, isn't it?
>> Sort of, but the emphasis is different.  Here are the main differences:
>>
>> 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level
>>
>> 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to
>> manage the buffer-space dynamically), whereas shinfo->release is designed to be used by
>> "the owner" and is thus only set at creation.
>>
>> 3) shinfo->release tracks the lifetime of the pages, not the skb.  This means it transcends
>> the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages
> 
> You are comparing with skb destructors, not skb data destructors :) skb
> data destructor is Rusty's patch which he wanted to use for vringfd.  I
> mean e.g. this:
> http://lists.openwall.net/netdev/2008/04/18/7

Ah, I wasn't aware.  I believe Anthony Ligouri had pointed me at this
same patch earlier this year.  However, more recently when I saw
skb->destructor() in mainline (seemingly from Rusty), I thought it had
been accepted and didn't investigate further.  I see now that you are
talking about something else.

> 
>>> Last time this was tried, this is the objection that was voiced:
>>>
>>> 	The problem with this patch is that it's tracking skb's, while
>>> 	you want use it to track pages for zero-copy.  That just doesn't
>>> 	work.  Through mechanisms like splice, individual pages in the
>>> 	skb can be detached and metastasize to other locations, e.g.,
>>> 	the VFS.
>> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly
>> track the page level avoid this issue.  Multiple skb's can point to a single shinfo, iiuc.
> 
> VFS does not know about shinfo either, does it?

I do not follow the reference.  Where does VFS come into play?

> 
>>> and I think this applies here.
>> I don't think so, but if you think I missed something, do not be shy (not that you ever are).
> 
> Well, I hope the reviews are helpful.

Yes, I didn't mean it as a slam.  Was only trying to convey that I
didn't think I was wrong but am open minded to the possibility, so
please keep the discussion going.

>  I'll be happy if we learn to
> track pages involved in transmit, but need to be careful.
> 

Agreed.

>>> In other words, this only *seems*
>>> to work for you because you are not trying to do things like
>>> guest to host communication, with host doing smart things.
>> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and
>> it works quite nicely.  I map the guest pages in, and when the last reference to the pages are dropped,
>> I release the pages back to the guest.  It doesn't matter if the skb egresses out a physical adapter or is
>> received locally.  All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly.
> 
> Not if someone else is referencing the pages without a reference to shinfo.

I agree that if we can reference pages outside of the skb/shinfo then
there is a problem.  I wasn't aware that we could do this, tbh.

However, it seems to me that this is a problem with the overall stack,
if true....isn't it?  For instance, if I do a sendmsg() from a userspace
app and block until its consumed, how can the system function sanely if
the app returns from the call but something is still referencing the
page(s)?  This seems broken to me.

> 
>>> Cc Herbert which was involved in the original discussion.
>>>
>>> In the specific case, it seems that things like pskb_copy,
>>> skb_split and others will also be broken, won't they?
>> Not to my knowledge.   They up the reference to the shinfo before proceeding.
> 
> I don't seem to find where does skb_split reference the shinfo.
> It seems to do get_page on individual pages?

Ill take a look.

If so, one alternate solution that I had considered was to look at some
kind of page->release() hook.  There are obvious disadvantages to such a
solution (for one, it has no such notion, and second we have to manage
the aggregate collection which has overhead), so I shied away from that
design in favor of this one.  But perhaps we will ultimately have no choice.

Kind Regards,
-Greg
Michael S. Tsirkin Nov. 10, 2009, 2:36 p.m. UTC | #7
On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote:
> >>>>> On 11/10/2009 at  6:53 AM, in message <20091110115335.GC6989@redhat.com>,
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
> >>> On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote:
> >>>> (Applies to davem/net-2.6.git:4fdb78d30)
> >>>>
> >>>> Hi David, netdevs,
> >>>>
> >>>> The following is an RFC for an attempt at addressing a zero-copy solution.
> >>>>
> >>>> To be perfectly honest, I have no idea if this is the best solution, or if
> >>>> there is truly a problem with skb->destructor that requires an alternate
> >>>> mechanism.  What I do know is that this patch seems to work, and I would
> >>>> like to see some kind of solution available upstream.  So I thought I would
> >>>> send my hack out as at least a point of discussion.  FWIW: This has been
> >>>> tested heavily in my rig and is technically suitable for inclusion after
> >>>> review as is, if that is decided to be the optimal path forward here.
> >>>>
> >>>> Thanks for your review and consideration,
> >>>>
> >>>> Kind regards,
> >>>> -Greg
> >>>>
> >>>> ----------------------------------------
> >>>> From: Gregory Haskins <ghaskins@novell.com>
> >>>> Subject: [RFC PATCH] net: add dataref destructor to sk_buff
> >>>>
> >>>> What: The skb->destructor field is reportedly unreliable for ensuring
> >>>> that all shinfo users have dropped their references.  Therefore, we add
> >>>> a distinct ->release() method for the shinfo structure which is closely
> >>>> tied to the underlying page resources we want to protect.
> >>>>
> >>>> Why: We want to add zero-copy transmit support for AlacrityVM guests.
> >>>> In order to support this, the host kernel must map guest pages directly
> >>>> into a paged-skb and send it as normal.  put_page() alone is not
> >>>> sufficient lifetime management since the pages are ultimately allocated
> >>>> from within the guest.  Therefore, we need higher-level notification
> >>>> when the skb is finally freed on the host so we can then inject a proper
> >>>> "tx-complete" event into the guest context.
> >>>>
> >>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> >>>> ---
> >>>>
> >>>>  include/linux/skbuff.h |    2 ++
> >>>>  net/core/skbuff.c      |    9 +++++++++
> >>>>  2 files changed, 11 insertions(+), 0 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >>>> index df7b23a..02cdab6 100644
> >>>> --- a/include/linux/skbuff.h
> >>>> +++ b/include/linux/skbuff.h
> >>>> @@ -207,6 +207,8 @@ struct skb_shared_info {
> >>>>  	/* Intermediate layers must ensure that destructor_arg
> >>>>  	 * remains valid until skb destructor */
> >>>>  	void *		destructor_arg;
> >>>> +	void *          priv;
> >>>> +	void           (*release)(struct sk_buff *skb);
> >>>>  };
> >>>>  
> >>>>  /* We divide dataref into two halves.  The higher 16 bits hold references
> >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>> index 80a9616..a7e40a9 100644
> >>>> --- a/net/core/skbuff.c
> >>>> +++ b/net/core/skbuff.c
> >>>> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> >>> gfp_mask,
> >>>>  	shinfo->tx_flags.flags = 0;
> >>>>  	skb_frag_list_init(skb);
> >>>>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
> >>>> +	shinfo->release = NULL;
> >>>> +	shinfo->priv = NULL;
> >>>>  
> >>>>  	if (fclone) {
> >>>>  		struct sk_buff *child = skb + 1;
> >>>> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb)
> >>>>  		if (skb_has_frags(skb))
> >>>>  			skb_drop_fraglist(skb);
> >>>>  
> >>>> +		if (skb_shinfo(skb)->release)
> >>>> +			skb_shinfo(skb)->release(skb);
> >>>> +
> >>>>  		kfree(skb->head);
> >>>>  	}
> >>>>  }
> >>>> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
> >>>>  	shinfo->tx_flags.flags = 0;
> >>>>  	skb_frag_list_init(skb);
> >>>>  	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
> >>>> +	shinfo->release = NULL;
> >>>> +	shinfo->priv = NULL;
> >>>>  
> >>>>  	memset(skb, 0, offsetof(struct sk_buff, tail));
> >>>>  	skb->data = skb->head + NET_SKB_PAD;
> >>>> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
> >>> ntail,
> >>>>  	skb->hdr_len  = 0;
> >>>>  	skb->nohdr    = 0;
> >>>>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
> >>>> +	skb_shinfo(skb)->release = NULL;
> >>>> +	skb_shinfo(skb)->priv = NULL;
> >>>>  	return 0;
> >>>>  
> >>>>  nodata:
> >>> This is basically subset of the skb data destructors patch, isn't it?
> >> Sort of, but the emphasis is different.  Here are the main differences:
> >>
> >> 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level
> >>
> >> 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to
> >> manage the buffer-space dynamically), whereas shinfo->release is designed to be used by
> >> "the owner" and is thus only set at creation.
> >>
> >> 3) shinfo->release tracks the lifetime of the pages, not the skb.  This means it transcends
> >> the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages
> > 
> > You are comparing with skb destructors, not skb data destructors :) skb
> > data destructor is Rusty's patch which he wanted to use for vringfd.  I
> > mean e.g. this:
> > http://lists.openwall.net/netdev/2008/04/18/7
> 
> Ah, I wasn't aware.  I believe Anthony Ligouri had pointed me at this
> same patch earlier this year.  However, more recently when I saw
> skb->destructor() in mainline (seemingly from Rusty), I thought it had
> been accepted and didn't investigate further.  I see now that you are
> talking about something else.
> 
> > 
> >>> Last time this was tried, this is the objection that was voiced:
> >>>
> >>> 	The problem with this patch is that it's tracking skb's, while
> >>> 	you want use it to track pages for zero-copy.  That just doesn't
> >>> 	work.  Through mechanisms like splice, individual pages in the
> >>> 	skb can be detached and metastasize to other locations, e.g.,
> >>> 	the VFS.
> >> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly
> >> track the page level avoid this issue.  Multiple skb's can point to a single shinfo, iiuc.
> > 
> > VFS does not know about shinfo either, does it?
> 
> I do not follow the reference.  Where does VFS come into play?

"Through mechanisms like splice, individual pages in the
skb can be detached and metastasize to other locations, e.g.,
the VFS"

> > 
> >>> and I think this applies here.
> >> I don't think so, but if you think I missed something, do not be shy (not that you ever are).
> > 
> > Well, I hope the reviews are helpful.
> 
> Yes, I didn't mean it as a slam.  Was only trying to convey that I
> didn't think I was wrong but am open minded to the possibility, so
> please keep the discussion going.
> 
> >  I'll be happy if we learn to
> > track pages involved in transmit, but need to be careful.
> > 
> 
> Agreed.
> 
> >>> In other words, this only *seems*
> >>> to work for you because you are not trying to do things like
> >>> guest to host communication, with host doing smart things.
> >> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and
> >> it works quite nicely.  I map the guest pages in, and when the last reference to the pages are dropped,
> >> I release the pages back to the guest.  It doesn't matter if the skb egresses out a physical adapter or is
> >> received locally.  All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly.
> > 
> > Not if someone else is referencing the pages without a reference to shinfo.
> 
> I agree that if we can reference pages outside of the skb/shinfo then
> there is a problem.  I wasn't aware that we could do this, tbh.
> 
> However, it seems to me that this is a problem with the overall stack,
> if true....isn't it?  For instance, if I do a sendmsg() from a userspace
> app and block until its consumed,

consumed == memcpy_from_iovec?

> how can the system function sanely if
> the app returns from the call but something is still referencing the
> page(s)?

which pages?

>  This seems broken to me.
> > 
> >>> Cc Herbert which was involved in the original discussion.
> >>>
> >>> In the specific case, it seems that things like pskb_copy,
> >>> skb_split and others will also be broken, won't they?
> >> Not to my knowledge.   They up the reference to the shinfo before proceeding.
> > 
> > I don't seem to find where does skb_split reference the shinfo.
> > It seems to do get_page on individual pages?
> 
> Ill take a look.
> 
> If so, one alternate solution that I had considered was to look at some
> kind of page->release() hook.  There are obvious disadvantages to such a
> solution (for one, it has no such notion, and second we have to manage
> the aggregate collection which has overhead), so I shied away from that
> design in favor of this one.  But perhaps we will ultimately have no choice.
> 
> Kind Regards,
> -Greg
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 10, 2009, 2:45 p.m. UTC | #8
On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote:
> > You are comparing with skb destructors, not skb data destructors :) skb
> > data destructor is Rusty's patch which he wanted to use for vringfd.  I
> > mean e.g. this:
> > http://lists.openwall.net/netdev/2008/04/18/7
> 
> Ah, I wasn't aware.  I believe Anthony Ligouri had pointed me at this
> same patch earlier this year.  However, more recently when I saw
> skb->destructor() in mainline (seemingly from Rusty), I thought it had
> been accepted and didn't investigate further.

skb->destructor seems to be there as far back as 2.2.0
http://lxr.linux.no/#linux-old+v2.2.0/include/linux/skbuff.h
Gregory Haskins Nov. 10, 2009, 3:45 p.m. UTC | #9
Michael S. Tsirkin wrote:
> On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote:
>>>>>>> On 11/10/2009 at  6:53 AM, in message <20091110115335.GC6989@redhat.com>,
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
>>
>>>>> Last time this was tried, this is the objection that was voiced:
>>>>>
>>>>> 	The problem with this patch is that it's tracking skb's, while
>>>>> 	you want use it to track pages for zero-copy.  That just doesn't
>>>>> 	work.  Through mechanisms like splice, individual pages in the
>>>>> 	skb can be detached and metastasize to other locations, e.g.,
>>>>> 	the VFS.
>>>> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly
>>>> track the page level avoid this issue.  Multiple skb's can point to a single shinfo, iiuc.
>>> VFS does not know about shinfo either, does it?
>> I do not follow the reference.  Where does VFS come into play?
> 
> "Through mechanisms like splice, individual pages in the
> skb can be detached and metastasize to other locations, e.g.,
> the VFS"

Right, understood.  What I mean is: How is that actually used in
real-life in a way that is valid?

What I am getting at is as follows:  From a real basic perspective, you
can look at all of this as a simple synchronous call (i.e. sendmsg()).
The "app" (be it a userspace app, or a guest) prepares a buffer for
transmission, and offers it to the next layer in the stack.  The app
must maintain the integrity of that buffer at least until the layer
below it signifies that it is "consumed".  This may mean its a
synchronous call, like sendmsg(), or it may be asynchronous, like AIO.

But the key thing here is that at some point, the lower layer has to
signify that the buffer stability constraint has been met.  In either
case, we have a clear delineated event: the io-completes = the buffer is
free to be reused.

In the simple case, the buffer in question is copied to a kernel buffer,
and the io completes immediately. In other cases (such as zero copy),
the buffer is mapped into the skb, and we have to wait for even lower
layers to signify the completion.

I am not a stack expert, but I was under the impression that we use this
model for userspace pages today as well using the wmem callbacks in
skb->destructor().  If so, I do not see how you could do something like
detach a page from a pskb and still expect to have a proper event that
delineates the io-completion to the higher layers.

So the questions are:

1) do we in fact map userspace pages to pskbs today?
2a) if so, how do we delineate the completion event?
2b) and how do we prevent worrying about the get_page() issue you refer
to.


>>
>>>>> In other words, this only *seems*
>>>>> to work for you because you are not trying to do things like
>>>>> guest to host communication, with host doing smart things.
>>>> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and
>>>> it works quite nicely.  I map the guest pages in, and when the last reference to the pages are dropped,
>>>> I release the pages back to the guest.  It doesn't matter if the skb egresses out a physical adapter or is
>>>> received locally.  All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly.
>>> Not if someone else is referencing the pages without a reference to shinfo.
>> I agree that if we can reference pages outside of the skb/shinfo then
>> there is a problem.  I wasn't aware that we could do this, tbh.
>>
>> However, it seems to me that this is a problem with the overall stack,
>> if true....isn't it?  For instance, if I do a sendmsg() from a userspace
>> app and block until its consumed,
> 
> consumed == memcpy_from_iovec?

For non-zero-copy, sure why not.

> 
>> how can the system function sanely if
>> the app returns from the call but something is still referencing the
>> page(s)?
> 
> which pages?

You said that there are paths that get_page() out of shinfo without
holding a shinfo reference.

Kind Regards,
-Greg
Gregory Haskins Nov. 10, 2009, 3:47 p.m. UTC | #10
Michael S. Tsirkin wrote:
> On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote:
>>> You are comparing with skb destructors, not skb data destructors :) skb
>>> data destructor is Rusty's patch which he wanted to use for vringfd.  I
>>> mean e.g. this:
>>> http://lists.openwall.net/netdev/2008/04/18/7
>> Ah, I wasn't aware.  I believe Anthony Ligouri had pointed me at this
>> same patch earlier this year.  However, more recently when I saw
>> skb->destructor() in mainline (seemingly from Rusty), I thought it had
>> been accepted and didn't investigate further.
> 
> skb->destructor seems to be there as far back as 2.2.0
> http://lxr.linux.no/#linux-old+v2.2.0/include/linux/skbuff.h
> 

Right, I meant recent activity around it for fixing it up related to
leaks, etc.
Michael S. Tsirkin Nov. 10, 2009, 5:36 p.m. UTC | #11
On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote:
> >> Michael S. Tsirkin wrote:
> >>> On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote:
> >>>>>>> On 11/10/2009 at  6:53 AM, in message <20091110115335.GC6989@redhat.com>,
> >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
> >>
> >>>>> Last time this was tried, this is the objection that was voiced:
> >>>>>
> >>>>> 	The problem with this patch is that it's tracking skb's, while
> >>>>> 	you want use it to track pages for zero-copy.  That just doesn't
> >>>>> 	work.  Through mechanisms like splice, individual pages in the
> >>>>> 	skb can be detached and metastasize to other locations, e.g.,
> >>>>> 	the VFS.
> >>>> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly
> >>>> track the page level avoid this issue.  Multiple skb's can point to a single shinfo, iiuc.
> >>> VFS does not know about shinfo either, does it?
> >> I do not follow the reference.  Where does VFS come into play?
> > 
> > "Through mechanisms like splice, individual pages in the
> > skb can be detached and metastasize to other locations, e.g.,
> > the VFS"
> 
> Right, understood.  What I mean is: How is that actually used in
> real-life in a way that is valid?
> 
> What I am getting at is as follows:  From a real basic perspective, you
> can look at all of this as a simple synchronous call (i.e. sendmsg()).
> The "app" (be it a userspace app, or a guest) prepares a buffer for
> transmission, and offers it to the next layer in the stack.  The app
> must maintain the integrity of that buffer at least until the layer
> below it signifies that it is "consumed".  This may mean its a
> synchronous call, like sendmsg(), or it may be asynchronous, like AIO.
> 
> But the key thing here is that at some point, the lower layer has to
> signify that the buffer stability constraint has been met.  In either
> case, we have a clear delineated event: the io-completes = the buffer is
> free to be reused.
> 
> In the simple case, the buffer in question is copied to a kernel buffer,
> and the io completes immediately. In other cases (such as zero copy),
> the buffer is mapped into the skb, and we have to wait for even lower
> layers to signify the completion.
> 
> I am not a stack expert, but I was under the impression that we use this
> model for userspace pages today as well using the wmem callbacks in
> skb->destructor().  If so, I do not see how you could do something like
> detach a page from a pskb and still expect to have a proper event that
> delineates the io-completion to the higher layers.

I think linux only cares about that for accounting purposes (stuff like
socket sndbuff size). If someone takes over the page, the socket can
stop worrying about it.

> So the questions are:
> 
> 1) do we in fact map userspace pages to pskbs today?

I don't think so.

> 2a) if so, how do we delineate the completion event?
> 2b) and how do we prevent worrying about the get_page() issue you refer
> to.
> 
> 
> >>
> >>>>> In other words, this only *seems*
> >>>>> to work for you because you are not trying to do things like
> >>>>> guest to host communication, with host doing smart things.
> >>>> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and
> >>>> it works quite nicely.  I map the guest pages in, and when the last reference to the pages are dropped,
> >>>> I release the pages back to the guest.  It doesn't matter if the skb egresses out a physical adapter or is
> >>>> received locally.  All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly.
> >>> Not if someone else is referencing the pages without a reference to shinfo.
> >> I agree that if we can reference pages outside of the skb/shinfo then
> >> there is a problem.  I wasn't aware that we could do this, tbh.
> >>
> >> However, it seems to me that this is a problem with the overall stack,
> >> if true....isn't it?  For instance, if I do a sendmsg() from a userspace
> >> app and block until its consumed,
> > 
> > consumed == memcpy_from_iovec?
> 
> For non-zero-copy, sure why not.
> 
> > 
> >> how can the system function sanely if
> >> the app returns from the call but something is still referencing the
> >> page(s)?
> > 
> > which pages?
> 
> You said that there are paths that get_page() out of shinfo without
> holding a shinfo reference.

Without zero copy, application does not care about these,
they have been allocated by kernel.
Gregory Haskins Nov. 10, 2009, 6:36 p.m. UTC | #12
Michael S. Tsirkin wrote:
> On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote:
>> I am not a stack expert, but I was under the impression that we use this
>> model for userspace pages today as well using the wmem callbacks in
>> skb->destructor().  If so, I do not see how you could do something like
>> detach a page from a pskb and still expect to have a proper event that
>> delineates the io-completion to the higher layers.
> 
> I think linux only cares about that for accounting purposes (stuff like
> socket sndbuff size). If someone takes over the page, the socket can
> stop worrying about it.

Only if there isn't zero-copy.

> 
>> So the questions are:
>>
>> 1) do we in fact map userspace pages to pskbs today?
> 
> I don't think so.

What about things like sendfile()?  There has to be *some* way to
synchronize with the io-completion event,  I would think.  Whatever that
is, I'd like to tap into it.

>>> which pages?
>>
>> You said that there are paths that get_page() out of shinfo without
>> holding a shinfo reference.
> 
> Without zero copy, application does not care about these,
> they have been allocated by kernel.

Agreed in the non-zero copy case.  I am not yet convinced that we do not
do zero copy in some form, however. Ill have to dig through the code
when I get a chance to confirm.

Kind Regards,
-Greg
Evgeniy Polyakov Nov. 10, 2009, 9:40 p.m. UTC | #13
On Tue, Nov 10, 2009 at 01:36:23PM -0500, Gregory Haskins (ghaskins@novell.com) wrote:
> What about things like sendfile()?  There has to be *some* way to
> synchronize with the io-completion event,  I would think.  Whatever that
> is, I'd like to tap into it.

All skb manipulation functions properly maintain data reference
counters, so pages will not be freed until all data is consumed.
But there is no guarantee that data placed in given page will not be
overwritten while page is being held somewhere in the stack.

Putting shared info destructor will allow to get notification, that
given shared info processing is over, i.e. that network stack does not
use data placed in shared info for given skb, but if it was copied or
VFS hold those pages, they may or may not be freed.
Herbert Xu Nov. 14, 2009, 1:12 a.m. UTC | #14
On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote:
>
> What I am getting at is as follows:  From a real basic perspective, you
> can look at all of this as a simple synchronous call (i.e. sendmsg()).
> The "app" (be it a userspace app, or a guest) prepares a buffer for
> transmission, and offers it to the next layer in the stack.  The app
> must maintain the integrity of that buffer at least until the layer
> below it signifies that it is "consumed".  This may mean its a
> synchronous call, like sendmsg(), or it may be asynchronous, like AIO.

Neither sendmsg() nor sendfile() is synchronous in the way you
imagine.

Cheers,
Gregory Haskins Nov. 14, 2009, 1:33 a.m. UTC | #15
Herbert Xu wrote:
> On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote:
>> What I am getting at is as follows:  From a real basic perspective, you
>> can look at all of this as a simple synchronous call (i.e. sendmsg()).
>> The "app" (be it a userspace app, or a guest) prepares a buffer for
>> transmission, and offers it to the next layer in the stack.  The app
>> must maintain the integrity of that buffer at least until the layer
>> below it signifies that it is "consumed".  This may mean its a
>> synchronous call, like sendmsg(), or it may be asynchronous, like AIO.
> 
> Neither sendmsg() nor sendfile() is synchronous in the way you
> imagine.

Well, not with respect to the overall protocol, of course not.  But with
respect to the buffer in question, it _has_ to be.  Or am I missing
something?

Kind Regards,
-Greg
Herbert Xu Nov. 14, 2009, 2:21 a.m. UTC | #16
On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote:
>
> Well, not with respect to the overall protocol, of course not.  But with
> respect to the buffer in question, it _has_ to be.  Or am I missing
> something?

sendfile() has never guaranteed that the kernel is finished with
the underlying pages when it returns.

Cheers,
Gregory Haskins Nov. 14, 2009, 2:27 a.m. UTC | #17
Herbert Xu wrote:
> On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote:
>> Well, not with respect to the overall protocol, of course not.  But with
>> respect to the buffer in question, it _has_ to be.  Or am I missing
>> something?
> 
> sendfile() has never guaranteed that the kernel is finished with
> the underlying pages when it returns.
> 
> Cheers,

Clearly there must be _some_ mechanism to synchronize (e.g.
flush/barrier) though, right?  Otherwise, that interface would seem to
be quite prone to races and would likely be unusable.   So what does
said flush use to know when the buffer is free?

-Greg
Herbert Xu Nov. 14, 2009, 2:43 a.m. UTC | #18
On Fri, Nov 13, 2009 at 09:27:57PM -0500, Gregory Haskins wrote:
>
> Clearly there must be _some_ mechanism to synchronize (e.g.
> flush/barrier) though, right?  Otherwise, that interface would seem to
> be quite prone to races and would likely be unusable.   So what does
> said flush use to know when the buffer is free?

It is up to the application to devise its own sync mechanism.
For TCP, you may query the kernel for the last sequence number
acked by the other side.

Cheers,
stephen hemminger Nov. 14, 2009, 2:45 a.m. UTC | #19
On Fri, 13 Nov 2009 21:27:57 -0500
Gregory Haskins <gregory.haskins@gmail.com> wrote:

> Herbert Xu wrote:
> > On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote:
> >> Well, not with respect to the overall protocol, of course not.  But with
> >> respect to the buffer in question, it _has_ to be.  Or am I missing
> >> something?
> > 
> > sendfile() has never guaranteed that the kernel is finished with
> > the underlying pages when it returns.
> > 
> > Cheers,
> 
> Clearly there must be _some_ mechanism to synchronize (e.g.
> flush/barrier) though, right?  Otherwise, that interface would seem to
> be quite prone to races and would likely be unusable.   So what does
> said flush use to know when the buffer is free?

No all the interfaces require a copy. Actually, sendfile makes no guarantee about synchronization
because the receiver of said file could be arbitrarily slow, and any attempt at locking down
current contents of file is a denial of service exposure.

People have tried doing copy-less send by page flipping, but the overhead of the IPI to
invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in
at Linux Symposium (Ottawa) several years ago.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 14, 2009, 2:51 a.m. UTC | #20
On Fri, Nov 13, 2009 at 06:45:03PM -0800, Stephen Hemminger wrote:
> 
> No all the interfaces require a copy. Actually, sendfile makes no guarantee about synchronization

Actually sendfile does not require a copy.

> because the receiver of said file could be arbitrarily slow, and any attempt at locking down
> current contents of file is a denial of service exposure.

Agreed.

Cheers,
David Miller Nov. 14, 2009, 3:04 a.m. UTC | #21
From: Gregory Haskins <gregory.haskins@gmail.com>
Date: Fri, 13 Nov 2009 20:33:35 -0500

> Well, not with respect to the overall protocol, of course not.  But with
> respect to the buffer in question, it _has_ to be.  Or am I missing
> something?

sendfile() absolutely, and positively, is not.

Any entity can write to the pages being send via sendfile(), at will,
and those writes will show up in the packet stream if they occur
before the NIC DMA's the memory backed by those pages into it's
buffer.

There is zero data synchronization whatsoever, we don't lock the
pages, we don't block their usage while they are queued up in the
socket send queue, nothing like that.

The user returns long before it every hits the wire and there is zero
"notification" to the user that the pages in question for the
sendfile() request are no longer in use.

It seems that your understanding of how buffering and synchronization
works in the TCP stack has come out of a fairy tale :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 14, 2009, 3:09 a.m. UTC | #22
From: Gregory Haskins <gregory.haskins@gmail.com>
Date: Fri, 13 Nov 2009 21:27:57 -0500

> Clearly there must be _some_ mechanism to synchronize (e.g.
> flush/barrier) though, right?  Otherwise, that interface would seem to
> be quite prone to races and would likely be unusable.   So what does
> said flush use to know when the buffer is free?

There is no such synchronization at all.

If some synchronization is required, it must be done at a higher
level.

For example, SAMBA can only use sendfile() to serve file contents when
the client holds an SMB "OP Lock" on the file.  Luckily, by default
pretty much all SMB client implementations hold the OP Lock on a file
even just to read it (this is so that self-modifying autoexec.bat
scripts work :-)

But frankly most sendfile() consumers don't care, and the only thing
that really matters is that the checksum on the final TCP packet that
goes out is correct, and since we require card checksums to sendfile()
without copying, that is taken care of transparently.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Nov. 14, 2009, 5:27 a.m. UTC | #23
Stephen Hemminger wrote:
> On Fri, 13 Nov 2009 21:27:57 -0500
> Gregory Haskins <gregory.haskins@gmail.com> wrote:
> 
>> Herbert Xu wrote:
>>> On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote:
>>>> Well, not with respect to the overall protocol, of course not.  But with
>>>> respect to the buffer in question, it _has_ to be.  Or am I missing
>>>> something?
>>> sendfile() has never guaranteed that the kernel is finished with
>>> the underlying pages when it returns.
>>>
>>> Cheers,
>> Clearly there must be _some_ mechanism to synchronize (e.g.
>> flush/barrier) though, right?  Otherwise, that interface would seem to
>> be quite prone to races and would likely be unusable.   So what does
>> said flush use to know when the buffer is free?
> 
> No all the interfaces require a copy.

I'm sorry, but I do not think that is correct.  As others have pointed
out, that would not appear to be true for at least sendfile.

> Actually, sendfile makes no guarantee about synchronization
> because the receiver of said file could be arbitrarily slow, and any attempt at locking down
> current contents of file is a denial of service exposure.

I think you are inverting the problem space.  It is fully expected that
changing the "file", or the pages that represent the file before the
packet is queued would constitute the modification of the stream on the
wire.

I am more thinking about the applications of mmap+sendfile to implement
a zero-copy interface.  As David mentions in another mail, it seems that
there is no sync mechanism available, so this would not appear to be a
viable use case today, unfortunately.

> 
> People have tried doing copy-less send by page flipping, but the overhead of the IPI to
> invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in
> at Linux Symposium (Ottawa) several years ago.

I think you are confusing copy-less tx with copy-less rx.  You can try
to do copy-less rx with page flipping, which has the IPI/TLB thrashing
properties you mention, and I agree is problematic.  We are talking
about copy-less tx here, however, and therefore no page-flipping is
involved.  Rather, we are just posting SG lists of pages directly to the
NIC (assuming the nic supports HIGH_DMA, etc).  You do not need to flip
the page, or invalidate the TLB (and thus IPI the other cores) to do
this to my knowledge.

Kind Regards,
-Greg
Gregory Haskins Nov. 16, 2009, 2:22 p.m. UTC | #24
David Miller wrote:
> From: Gregory Haskins <gregory.haskins@gmail.com>
> Date: Fri, 13 Nov 2009 20:33:35 -0500
> 
>> Well, not with respect to the overall protocol, of course not.  But with
>> respect to the buffer in question, it _has_ to be.  Or am I missing
>> something?
> 
> sendfile() absolutely, and positively, is not.
> 
> Any entity can write to the pages being send via sendfile(), at will,
> and those writes will show up in the packet stream if they occur
> before the NIC DMA's the memory backed by those pages into it's
> buffer.

Right, understood.

> 
> There is zero data synchronization whatsoever, we don't lock the
> pages, we don't block their usage while they are queued up in the
> socket send queue, nothing like that.

Understood.

> 
> The user returns long before it every hits the wire and there is zero
> "notification" to the user that the pages in question for the
> sendfile() request are no longer in use.

Ok, this was the part I didn't know.

> 
> It seems that your understanding of how buffering and synchronization
> works in the TCP stack has come out of a fairy tale :-)

I understand that we do not protect the buffers from modification from
other entities in process.  This was purely a question of
synchronization from the producers standpoint.

IOW:

for (;;) {
   char buf[512];

   memcpy(buf, next, sizeof(buf));	
   write(fd, buf);
}

would work without worrying that the producer will stomp on buf itself.
 It is now my understanding that for things other than sendfile, this
works because the buffer is copied before it returns control to the app.
 For sendfile(), the producer is more or less on its own and therefore
has to be careful if they are reusing previous mmapped buffers.  Ok.

But really, this is somewhat orthogonal to the original problem, so let
me see if we can bring it back on topic.  Michael stated that this patch
in question may be problematic because there are places in the stack
that can get_page() without also maintaining a reference to the shinfo
object.  Evgeniy seems to say the opposite.  I am not sure who is right,
or if I misunderstood one or both of them.  Any thoughts?

Kind Regards,
-Greg
Avi Kivity Nov. 16, 2009, 5:08 p.m. UTC | #25
On 11/14/2009 05:04 AM, David Miller wrote:
> From: Gregory Haskins<gregory.haskins@gmail.com>
> Date: Fri, 13 Nov 2009 20:33:35 -0500
>
>    
>> Well, not with respect to the overall protocol, of course not.  But with
>> respect to the buffer in question, it _has_ to be.  Or am I missing
>> something?
>>      
> sendfile() absolutely, and positively, is not.
>
> Any entity can write to the pages being send via sendfile(), at will,
> and those writes will show up in the packet stream if they occur
> before the NIC DMA's the memory backed by those pages into it's
> buffer.
>
> There is zero data synchronization whatsoever, we don't lock the
> pages, we don't block their usage while they are queued up in the
> socket send queue, nothing like that.
>
>    

But it must maintain a reference count on the page being dmaed and drop 
it only after dma is complete.  Otherwise we risk the page being 
recycled and arbitrary memory sent out on the wire; and an application 
can trivially cause this by truncate()ing a sendfile.

> The user returns long before it every hits the wire and there is zero
> "notification" to the user that the pages in question for the
> sendfile() request are no longer in use.
>    

The put_page() is a notification except it doesn't reach the caller.  
Gregory's patch (and previous shared info destructor patches) is an 
attempt to make it reach the caller, IIUC.
stephen hemminger Nov. 16, 2009, 7:59 p.m. UTC | #26
On Sat, 14 Nov 2009 00:27:46 -0500
Gregory Haskins <gregory.haskins@gmail.com> wrote:

> Stephen Hemminger wrote:
> > On Fri, 13 Nov 2009 21:27:57 -0500
> > Gregory Haskins <gregory.haskins@gmail.com> wrote:
> > 
> >> Herbert Xu wrote:
> >>> On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote:
> >>>> Well, not with respect to the overall protocol, of course not.  But with
> >>>> respect to the buffer in question, it _has_ to be.  Or am I missing
> >>>> something?
> >>> sendfile() has never guaranteed that the kernel is finished with
> >>> the underlying pages when it returns.
> >>>
> >>> Cheers,
> >> Clearly there must be _some_ mechanism to synchronize (e.g.
> >> flush/barrier) though, right?  Otherwise, that interface would seem to
> >> be quite prone to races and would likely be unusable.   So what does
> >> said flush use to know when the buffer is free?
> > 
> > No all the interfaces require a copy.
> 
> I'm sorry, but I do not think that is correct.  As others have pointed
> out, that would not appear to be true for at least sendfile.

Correct.

> 
> > Actually, sendfile makes no guarantee about synchronization
> > because the receiver of said file could be arbitrarily slow, and any attempt at locking down
> > current contents of file is a denial of service exposure.
> 
> I think you are inverting the problem space.  It is fully expected that
> changing the "file", or the pages that represent the file before the
> packet is queued would constitute the modification of the stream on the
> wire.
> 
> I am more thinking about the applications of mmap+sendfile to implement
> a zero-copy interface.  As David mentions in another mail, it seems that
> there is no sync mechanism available, so this would not appear to be a
> viable use case today, unfortunately.

yes, if you do mmap/sendfile then there is no synchronization, and the stack
can hold onto your data for an arbitrary time.  The file and mapping's can
be closed but that risks tying up all of memory.


> > 
> > People have tried doing copy-less send by page flipping, but the overhead of the IPI to
> > invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in
> > at Linux Symposium (Ottawa) several years ago.
> 
> I think you are confusing copy-less tx with copy-less rx.  You can try
> to do copy-less rx with page flipping, which has the IPI/TLB thrashing
> properties you mention, and I agree is problematic.  We are talking
> about copy-less tx here, however, and therefore no page-flipping is
> involved.  Rather, we are just posting SG lists of pages directly to the
> NIC (assuming the nic supports HIGH_DMA, etc).  You do not need to flip
> the page, or invalidate the TLB (and thus IPI the other cores) to do
> this to my knowledge.
> 

If you want to do copy-less tx for all applications, you have to
do COW to handle the trivial case of :

while (cc = read(infd, buffer, sizeof buffer)) {
   send(outsock, buffer, cc);
}
Gregory Haskins Nov. 16, 2009, 8:18 p.m. UTC | #27
Stephen Hemminger wrote:
> On Sat, 14 Nov 2009 00:27:46 -0500
> Gregory Haskins <gregory.haskins@gmail.com> wrote:
> 
>> Stephen Hemminger wrote:

> 
>>> People have tried doing copy-less send by page flipping, but the overhead of the IPI to
>>> invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in
>>> at Linux Symposium (Ottawa) several years ago.
>> I think you are confusing copy-less tx with copy-less rx.  You can try
>> to do copy-less rx with page flipping, which has the IPI/TLB thrashing
>> properties you mention, and I agree is problematic.  We are talking
>> about copy-less tx here, however, and therefore no page-flipping is
>> involved.  Rather, we are just posting SG lists of pages directly to the
>> NIC (assuming the nic supports HIGH_DMA, etc).  You do not need to flip
>> the page, or invalidate the TLB (and thus IPI the other cores) to do
>> this to my knowledge.
>>
> 
> If you want to do copy-less tx for all applications, you have to
> do COW to handle the trivial case of :
> 
> while (cc = read(infd, buffer, sizeof buffer)) {
>    send(outsock, buffer, cc);
> }
> 
> 

You certainly _could_ implement this as a COW I suppose, but that would
be insane.  If someone did do this, you are right: you need TLB
invalidation.

However, if I were going to actually propose the changeover of the
system calls to use zero-copy (note that I am not), it would be based on
the concept in this patch.  That is: the send() would block until the
NIC completes the DMA and the shinfo block is freed.  Alternate
implementations would be AIO based, where the shinfo destructor
signifies the generation of the completion event.

FWIW: The latter is conceptually similar to how this is being used in
AlacrityVM.

HTH

Kind Regards,
-Greg
Herbert Xu Nov. 17, 2009, 1:02 a.m. UTC | #28
On Mon, Nov 16, 2009 at 09:22:57AM -0500, Gregory Haskins wrote:
> 
> But really, this is somewhat orthogonal to the original problem, so let
> me see if we can bring it back on topic.  Michael stated that this patch
> in question may be problematic because there are places in the stack
> that can get_page() without also maintaining a reference to the shinfo
> object.  Evgeniy seems to say the opposite.  I am not sure who is right,
> or if I misunderstood one or both of them.  Any thoughts?

There are loads of places where this can happen.  Start with
pskb_expand_head.

Cheers,
Gregory Haskins Nov. 17, 2009, 12:33 p.m. UTC | #29
Herbert Xu wrote:
> On Mon, Nov 16, 2009 at 09:22:57AM -0500, Gregory Haskins wrote:
>> But really, this is somewhat orthogonal to the original problem, so let
>> me see if we can bring it back on topic.  Michael stated that this patch
>> in question may be problematic because there are places in the stack
>> that can get_page() without also maintaining a reference to the shinfo
>> object.  Evgeniy seems to say the opposite.  I am not sure who is right,
>> or if I misunderstood one or both of them.  Any thoughts?
> 
> There are loads of places where this can happen.  Start with
> pskb_expand_head.

Indeed, I see your point.  Looks like we can potentially solve that with
an extra level of indirection.  E.g. skb->shinfo->owner, with a
ref-count+callback.  Thoughts?

Kind Regards,
-Greg
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df7b23a..02cdab6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -207,6 +207,8 @@  struct skb_shared_info {
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+	void *          priv;
+	void           (*release)(struct sk_buff *skb);
 };
 
 /* We divide dataref into two halves.  The higher 16 bits hold references
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 80a9616..a7e40a9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -219,6 +219,8 @@  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo->tx_flags.flags = 0;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
+	shinfo->release = NULL;
+	shinfo->priv = NULL;
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -350,6 +352,9 @@  static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb_shinfo(skb)->release)
+			skb_shinfo(skb)->release(skb);
+
 		kfree(skb->head);
 	}
 }
@@ -514,6 +519,8 @@  int skb_recycle_check(struct sk_buff *skb, int skb_size)
 	shinfo->tx_flags.flags = 0;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
+	shinfo->release = NULL;
+	shinfo->priv = NULL;
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->data = skb->head + NET_SKB_PAD;
@@ -856,6 +863,8 @@  int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->hdr_len  = 0;
 	skb->nohdr    = 0;
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
+	skb_shinfo(skb)->release = NULL;
+	skb_shinfo(skb)->priv = NULL;
 	return 0;
 
 nodata: