diff mbox

[RFC,23/23] : Support for zero-copy TCP transmit of user space data

Message ID 494012C4.7090304@vlnb.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Vladislav Bolkhovitin Dec. 10, 2008, 7:04 p.m. UTC
This patch implements support for zero-copy TCP transmit of user space 
data. It is necessary in iSCSI-SCST target driver for transmitting data 
from user space buffers, supplied by user space backend handlers. In 
this case SCST core needs to know when TCP finished transmitting the 
data, so the corresponding buffers can be reused or freed. Without this 
patch it isn't possible, so iSCSI-SCST has to use data copying to TCP 
send buffers function sock_sendpage(). ISCSI-SCST also works without 
this patch, but that this patch gives a nice performance improvement.

In the chosen approach new optional field void *net_priv was added to 
struct page. It is enclosed by

#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION),

so if one doesn't need this functionality, net_priv won't consume space 
in struct page.

Then, 2 new global callbacks net_get_page_callback and 
net_put_page_callback together with 2 new inline functions 
net_get_page() and net_put_page() were added. If 
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION not defined 
net_get_page() and net_put_page() effectively become get_page() and 
put_page() correspondingly.

Those functions, if the corresponding net_get_page_callback or 
net_put_page_callback assigned, call it, then do get_page() or put_page().

Then in net/ subdirectory all get_page() calls were replaced by 
net_get_page() and put_page() - by net_put_page().

How it works. ISCSI-SCST assigns net_get_page_callback and 
net_put_page_callback to its internal functions. Each page before being 
sent to TCP's sendpage has net_priv field set to pointer to the 
corresponding iSCSI command. Then in each net_get_page_callback handler 
reference counter for that command increased and in each 
net_put_page_callback - decreased. When it reaches zero, then all the 
data for this command were transferred, so the command and its buffer 
can be freed.

You can find how it used in the iSCSI-SCST patch (number 21 in this series).

Global callbacks were chosen, because this is the simplest and most
performance effective approach, fully following section 2 subsection 4 
of SubmittingPatches file: "Don't over-design". If accepted, iSCSI-SCST 
will be the only user of this functionality. Requirements to call 
net_set_get_put_page_callbacks() (see comment in the patch) allows to 
not protect those callbacks anyhow. Then, if in the future there is 
another user of that functionality, it will be possible to convert those 
callbacks to RCU-protected list of callbacks. But for now there's no 
need to overcomplicate the code.

During development the following approaches were also examined and rejected:

1. Add net_priv analog in struct sk_buff, not in struct page. But then 
it would be required that all the pages in each skb must be from the 
same originator, i.e. with the same net_priv. It is unpractical to 
change all the operations with skb's to forbid merging them, if they 
have different net_priv. I tried, but quickly gave up. There are too 
many such places in very not obvious code pieces.

2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a 
simple search function. This approach was rejected, because to copy a 
page a modern CPU needs using MMX about 1500 ticks. It was observed, 
that each page can be referenced by TCP during transmit about 20 times 
or even more. So, if each search needs, say, 20 ticks, the overall 
search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this 
approach would considerably worse performance-wise to the chosen 
approach and provide not too much benefit.

Please, if you reject this approach, advice any other way to implement 
the required functionality.

Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net>
---
  include/linux/mm_types.h |   12 +++++++++++
  include/linux/net.h      |   40 ++++++++++++++++++++++++++++++++++++++
  net/Kconfig              |   12 +++++++++++
  net/core/skbuff.c        |   14 ++++++-------
  net/ipv4/Makefile        |    1
  net/ipv4/ip_output.c     |    4 +--
  net/ipv4/tcp.c           |    8 +++----
  net/ipv4/tcp_output.c    |    2 -
  net/ipv4/tcp_zero_copy.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++
  net/ipv6/ip6_output.c    |    2 -
  10 files changed, 129 insertions(+), 15 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Evgeniy Polyakov Dec. 10, 2008, 9:45 p.m. UTC | #1
Hi Vladislav.

On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
> In the chosen approach new optional field void *net_priv was added to 
> struct page. It is enclosed by

There is a huge no-no in networking land on increasing skb.
Reason is simple every skb will carry potentially unneded data as long
as given option is enabled, and most of the time it will.
To break this barrier one has to have (I wanted to write ego, but then
decided to replace it with mojo) so huge reason to do this, that it is
almost impossible to have.

Something tells me that increasing page structure with 8 bytes because
of zero-copy iscsi transfer is not that great idea, since basically every
user out there will have it enabled in the distro config and will waste
noticeble amount of ram.

The same problem of not sending any kind of notification to the user
when his pages are 'acked' by receiving some packet or freeing the data
exists long ago and was tried to be fixed several times.

The most applicable to your case maybe DST experience. DST is a block
layer device and all its pages starting from quite recent kernels are
not allowed to be slab ones (xfs was the last one who provided slab
pages in the bios), so each page has two 'unused' pointers in lru list
entry, which you may reuse. If scsi layer may have slab pages from some
place (although this does not sound like a good idea, ->sendpage() will
bug on on them anyway), this hack will not work, otherwise you only need
to have net_page_get/put stuff in and do not mess with increasing page.
And this was tested 3-4 kernel releases ago, so things may be changed.

Another appropach is to increase skb's shared data (at the end of the
skb->data), and this approach was not frowned upon too much either, but
it requires to mess with skb->destructor, which may not be appropriate
in some cases. If iscsi does not use sockets (it does iirc), things are
much simpler.

Hope this helps.
Vladislav Bolkhovitin Dec. 11, 2008, 6:16 p.m. UTC | #2
Hi Evgeniy,

Evgeniy Polyakov wrote:
> Hi Vladislav.
> 
> On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
>> In the chosen approach new optional field void *net_priv was added to 
>> struct page. It is enclosed by
> 
> There is a huge no-no in networking land on increasing skb.
> Reason is simple every skb will carry potentially unneded data as long
> as given option is enabled, and most of the time it will.
> To break this barrier one has to have (I wanted to write ego, but then
> decided to replace it with mojo) so huge reason to do this, that it is
> almost impossible to have.
> 
> Something tells me that increasing page structure with 8 bytes because
> of zero-copy iscsi transfer is not that great idea, since basically every
> user out there will have it enabled in the distro config and will waste
> noticeble amount of ram.

The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps, 
not noticeable for an average user of distro kernels at all. Embedded 
people, who count each byte, almost always don't need iSCSI, so won't 
have any problems to disable 
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option.

Also, as I wrote, iSCSI-SCST can work without this patch or with 
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option disabled, but with 
user space device handlers it will work considerably worse. Only few 
distro kernels users need an iSCSI target and only few among such users 
need to use a user space device handler. So, option 
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION can be safely disabled in 
default configs of distro kernels. People who need both iSCSI target 
*and* fast working user space device handler would simply enable that 
option and rebuild the kernel. Rejecting this patch provides much worse 
alternative: those people would also have to *patch* the kernel at 
first, only then enable that option, then rebuild the kernel.

> The same problem of not sending any kind of notification to the user
> when his pages are 'acked' by receiving some packet or freeing the data
> exists long ago and was tried to be fixed several times.
> 
> The most applicable to your case maybe DST experience. DST is a block
> layer device and all its pages starting from quite recent kernels are
> not allowed to be slab ones (xfs was the last one who provided slab
> pages in the bios), so each page has two 'unused' pointers in lru list
> entry, which you may reuse. If scsi layer may have slab pages from some
> place (although this does not sound like a good idea, ->sendpage() will
> bug on on them anyway), this hack will not work, otherwise you only need
> to have net_page_get/put stuff in and do not mess with increasing page.
> And this was tested 3-4 kernel releases ago, so things may be changed.

I've just rechecked if any of struct page fields can be shared with 
net_priv field. Unfortunately, it looks like none among the mandatory 
fields:

  - Transmitting pages can be mapped in some user space, so sharing in 
unions with _mapcount, mapping and index fields isn't possible

  - Transmitting pages can be in the page cache, so sharing lru field 
isn't possible as well

It might, however, be shared with field "virtual", since SCST allocates 
only lowmem pages, but on non-HIGHMEM kernels and most HIGHMEM kernels 
WANT_PAGE_VIRTUAL isn't defined, so it will change basically nothing.

> Another appropach is to increase skb's shared data (at the end of the
> skb->data), and this approach was not frowned upon too much either, but
> it requires to mess with skb->destructor, which may not be appropriate
> in some cases. If iscsi does not use sockets (it does iirc), things are
> much simpler.

As I wrote, approach to use net_priv on the skb level was examined, but 
rejected as unpractical. Simply too many places should be modified to 
prevent merging skb's with different net_priv-like labels and those 
places are too inobvious. Implementation of this approach and, 
especially, maintenance it would be just a nightmare.

Thanks,
Vlad

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Dec. 11, 2008, 7:12 p.m. UTC | #3
On Thu, 2008-12-11 at 21:16 +0300, Vladislav Bolkhovitin wrote:
> Hi Evgeniy,
> 
> Evgeniy Polyakov wrote:
> > Hi Vladislav.
> > 
> > On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
> >> In the chosen approach new optional field void *net_priv was added to 
> >> struct page. It is enclosed by
> > 
> > There is a huge no-no in networking land on increasing skb.
> > Reason is simple every skb will carry potentially unneded data as long
> > as given option is enabled, and most of the time it will.
> > To break this barrier one has to have (I wanted to write ego, but then
> > decided to replace it with mojo) so huge reason to do this, that it is
> > almost impossible to have.
> > 
> > Something tells me that increasing page structure with 8 bytes because
> > of zero-copy iscsi transfer is not that great idea, since basically every
> > user out there will have it enabled in the distro config and will waste
> > noticeble amount of ram.
> 
> The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps, 
> not noticeable for an average user of distro kernels at all. Embedded 
> people, who count each byte, almost always don't need iSCSI, so won't 
> have any problems to disable 
> TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option.

Actually, there are several other considerations:

     1. struct page is a lowmem structure, so increasing its size
        becomes problematic on x86 PAE systems.
     2. The current 64 bit struct page seems to be exactly pushing a
        cacheline boundary.  Increasing it so it spills over will have a
        performance impact

It's the performance problems that will be most critical, I suspect, so
you'll need mm people buy in for doing this.

One thing that leaps immediately to mind is that you could isolate this
to the net layer by putting it in skb_frag_struct.  However, such a move
would require a proper API for this in net ... right now it looks like
you're using the struct page addition to carry this information from
SCSI to net, which is a bit of a layering violation.

James


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 12, 2008, 7:25 p.m. UTC | #4
James Bottomley wrote:
> On Thu, 2008-12-11 at 21:16 +0300, Vladislav Bolkhovitin wrote:
>> Hi Evgeniy,
>>
>> Evgeniy Polyakov wrote:
>>> Hi Vladislav.
>>>
>>> On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
>>>> In the chosen approach new optional field void *net_priv was added to 
>>>> struct page. It is enclosed by
>>> There is a huge no-no in networking land on increasing skb.
>>> Reason is simple every skb will carry potentially unneded data as long
>>> as given option is enabled, and most of the time it will.
>>> To break this barrier one has to have (I wanted to write ego, but then
>>> decided to replace it with mojo) so huge reason to do this, that it is
>>> almost impossible to have.
>>>
>>> Something tells me that increasing page structure with 8 bytes because
>>> of zero-copy iscsi transfer is not that great idea, since basically every
>>> user out there will have it enabled in the distro config and will waste
>>> noticeble amount of ram.
>> The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps, 
>> not noticeable for an average user of distro kernels at all. Embedded 
>> people, who count each byte, almost always don't need iSCSI, so won't 
>> have any problems to disable 
>> TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option.
> 
> Actually, there are several other considerations:
> 
>      1. struct page is a lowmem structure, so increasing its size
>         becomes problematic on x86 PAE systems.
>      2. The current 64 bit struct page seems to be exactly pushing a
>         cacheline boundary.  Increasing it so it spills over will have a
>         performance impact

This is why I suggest to have 
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION disabled in 
general kernels. ISCSI-SCST will still work with almost no performance 
loss for in-kernel backend and people would better recompile kernel, 
then patch it, then recompile.

> It's the performance problems that will be most critical, I suspect, so
> you'll need mm people buy in for doing this.

I'll ask in linux-mm, thanks for the suggestion.

> One thing that leaps immediately to mind is that you could isolate this
> to the net layer by putting it in skb_frag_struct.  However, such a move
> would require a proper API for this in net ...

To have net_priv analog in skb was the first idea I was tried. But I 
quickly gave up, because it would required that all the pages in each 
skb_frag_struct be from the same originator, i.e. with the same 
net_priv. It is unpractical to change all the operations with skb's to 
forbid merging them, if they have different net_priv. There are too many 
such places in very not obvious code pieces.

> right now it looks like
> you're using the struct page addition to carry this information from
> SCSI to net, which is a bit of a layering violation.

I don't think there is any layering violation here. Just lower layer 
notifies upper layer that transmission of a page has finished. It's done 
a bit not straightforward, but still basically the same as, for 
instance, on_free_cmd() callbacks which SCST core uses to notify target 
drivers and dev handlers that the corresponding command is about to be 
freed, so they can free associated with it data as well.

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Dec. 12, 2008, 7:37 p.m. UTC | #5
On Fri, 2008-12-12 at 22:25 +0300, Vladislav Bolkhovitin wrote:
> > One thing that leaps immediately to mind is that you could isolate this
> > to the net layer by putting it in skb_frag_struct.  However, such a move
> > would require a proper API for this in net ...
> 
> To have net_priv analog in skb was the first idea I was tried. But I 
> quickly gave up, because it would required that all the pages in each 
> skb_frag_struct be from the same originator, i.e. with the same 
> net_priv. It is unpractical to change all the operations with skb's to 
> forbid merging them, if they have different net_priv. There are too many 
> such places in very not obvious code pieces.

Actually, I said carry in skb_frag_struct not skb ... that allows for
merging of skbs with different page sources.  The API changes would have
to allow setting at this level.

> > right now it looks like
> > you're using the struct page addition to carry this information from
> > SCSI to net, which is a bit of a layering violation.
> 
> I don't think there is any layering violation here. Just lower layer 
> notifies upper layer that transmission of a page has finished. It's done 
> a bit not straightforward, but still basically the same as, for 
> instance, on_free_cmd() callbacks which SCST core uses to notify target 
> drivers and dev handlers that the corresponding command is about to be 
> freed, so they can free associated with it data as well.

The way you transmit the information you want notification is done by a
private tag in struct page, so you're carrying the information on an
object that belongs to neither layer ... that's the violation.  It's
essentially an extension of the net API that goes via the mm layer.

James


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 15, 2008, 5:58 p.m. UTC | #6
James Bottomley wrote:
> On Fri, 2008-12-12 at 22:25 +0300, Vladislav Bolkhovitin wrote:
>>> One thing that leaps immediately to mind is that you could isolate this
>>> to the net layer by putting it in skb_frag_struct.  However, such a move
>>> would require a proper API for this in net ...
>> To have net_priv analog in skb was the first idea I was tried. But I 
>> quickly gave up, because it would required that all the pages in each 
>> skb_frag_struct be from the same originator, i.e. with the same 
>> net_priv. It is unpractical to change all the operations with skb's to 
>> forbid merging them, if they have different net_priv. There are too many 
>> such places in very not obvious code pieces.
> 
> Actually, I said carry in skb_frag_struct not skb ... that allows for
> merging of skbs with different page sources.  The API changes would have
> to allow setting at this level.

Possibly, you are right, but the amount of changes would be very big, 
while the net_priv patch I did is just 309 lines long including all the 
descriptions and comments. And it's really simple and straightforward.

>>> right now it looks like
>>> you're using the struct page addition to carry this information from
>>> SCSI to net, which is a bit of a layering violation.
>> I don't think there is any layering violation here. Just lower layer 
>> notifies upper layer that transmission of a page has finished. It's done 
>> a bit not straightforward, but still basically the same as, for 
>> instance, on_free_cmd() callbacks which SCST core uses to notify target 
>> drivers and dev handlers that the corresponding command is about to be 
>> freed, so they can free associated with it data as well.
> 
> The way you transmit the information you want notification is done by a
> private tag in struct page, so you're carrying the information on an
> object that belongs to neither layer ... that's the violation.  It's
> essentially an extension of the net API that goes via the mm layer.

I understand your fears, but I think there can be another view on this. 
There are 2 layers: TCP and iSCSI. ISCSI asks TCP to send data and TCP 
performs that. ISCSI communicates with TCP using sendpage() function, it 
asks TCP to send pages. So, the unit of communications is page. When TCP 
finished transmitting a page, it notifies iSCSI that the page was 
transmitted. Now iSCSI needs to find out its own command, associated 
with that page. It does that using page->net_priv, which keeps pointer 
to the associated iSCSI command. I don't think there is any layer 
violation here, because iSCSI and TCP communicate using pages, not any 
other objects. If they communicated using, e.g., skb, no doubts, it 
would be a layers violation.

But, since iSCSI and TCP communicate using pages, the situation is the 
same as between SCST core and a target driver in the on_free_cmd() 
example I used before. When SCST core notifies the target driver using 
on_free_cmd() callback that a command finished and is about to be freed, 
the target driver also needs to find out its own associated with the 
command data and it does that using tgt_priv pointer in struct scst_cmd. 
Similarly, for instance, struct scsi_cmnd has host_scribble pointer for 
low level drivers.

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Dec. 15, 2008, 11:18 p.m. UTC | #7
If you don't believe that increasing struct page size for a fringe
feature is a no-go submit the patch to the VM people and wait..
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Dec. 16, 2008, 4 p.m. UTC | #8
On Wed, Dec 10, 2008 at 10:45 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> Another approach is to increase skb's shared data (at the end of the
> skb->data), and this approach was not frowned upon too much either, but
> it requires to mess with skb->destructor, which may not be appropriate
> in some cases. If iscsi does not use sockets (it does iirc), things are
> much simpler.

Hello Evgeniy,

Any idea whether it would be acceptable to extend "struct sock" with
one or two pointers to callback functions ? These callback functions
could be called from skb_release_data() etc. through the "struct sock*
sk" member of "struct sk_buff".

Bart.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 16, 2008, 5:41 p.m. UTC | #9
Hi Bart.

On Tue, Dec 16, 2008 at 05:00:07PM +0100, Bart Van Assche (bart.vanassche@gmail.com) wrote:
> Any idea whether it would be acceptable to extend "struct sock" with
> one or two pointers to callback functions ? These callback functions
> could be called from skb_release_data() etc. through the "struct sock*
> sk" member of "struct sk_buff".

That's unlikely, since again this will be unused by the all but iscsi
users. But you can be tricky and replace skb destructor with your own
pointer and call 'old' destructor yourself. Its main goal is to adjust
socket memory statistics and since all iscsi sockets are under your full
control, you can play with it. Although this sounds like a hack, with
proper implementation it is not that bad idea. As additional pointer you
can use sk_user_data which is used by rpc socket calls only iirc.
Vladislav Bolkhovitin Dec. 16, 2008, 6:57 p.m. UTC | #10
Christoph Hellwig wrote:
> If you don't believe that increasing struct page size for a fringe
> feature is a no-go submit the patch to the VM people and wait..

I guessed, it can be no-go, this is why I wrote that this feature isn't 
required for iSCSI-SCST functionality. But, since the implementation is 
*so* simple and doesn't do any layering violation, I have a hope that 
once it disabled by default it will be harmless and, hence, could be 
accepted. Only few people need this feature. Otherwise there will be an 
alternative for them between enable that feature, then recompile, vs 
patch the kernel, enable that feature, then recompile.

When I was developing it my main goal was to do it as simple as 
possible. I believe, I succeeded in it. If it's rejected, it will simply 
live out of tree until me or somebody else finds time to reimplement it 
in an acceptable, although at least 10 times more complicated manner.

Thanks, I'll follow your advise.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 18, 2008, 6:35 p.m. UTC | #11
Hello linux-mm,

Recently I submitted a new SCSI target framework (SCST) and 4 target 
drivers for it for the first iteration of review and comments. See 
http://lkml.org/lkml/2008/12/10/245 for details.

An iSCSI target driver iSCSI-SCST was a part of the patchset 
(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
have TCP zero-copy transmit of user space data was implemented. Patch, 
implementing this optimization was also sent in the patchset, see 
http://lkml.org/lkml/2008/12/10/296.

I would like to ask, if the approach used in this patch can be 
acceptable from your point of view? I understand, that extending struct 
page is a very much undesirable, but, from other side:

  - This approach is very simple and straightforward. The patch is only 
309 lines long, including comments. All other alternative 
implementations would be at least an order of magnitude more complicated.

  - Related kernel config option 
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION should be disabled by 
default in general distro kernels, so the would be no harm at all from 
this patch. ISCSI-SCST can work without this patch or with 
TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option disabled, although 
with user space device handlers it will work considerably worse. Only 
few distro kernels users need an iSCSI target and only few among such 
users need to use user space device handlers. People who need both iSCSI 
target *and* fast working user space device handlers would simply enable 
that option and rebuild the kernel. Rejecting this patch provides much 
worse alternative: those people would also have to *patch* the kernel at 
first, only then enable that option, then rebuild the kernel.

  - Although usage of struct page to keep network related pointer might 
look as a layering violation, it isn't. I wrote in 
http://lkml.org/lkml/2008/12/15/190 why.

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David M. Lloyd Dec. 18, 2008, 6:43 p.m. UTC | #12
On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
> An iSCSI target driver iSCSI-SCST was a part of the patchset 
> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
> have TCP zero-copy transmit of user space data was implemented. Patch, 
> implementing this optimization was also sent in the patchset, see 
> http://lkml.org/lkml/2008/12/10/296.

I'm probably ignorant of about 90% of the context here, but isn't this the 
sort of problem that was supposed to have been solved by vmsplice(2)?

- DML

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Dec. 19, 2008, 11:27 a.m. UTC | #13
Vladislav Bolkhovitin <vst@vlnb.net> writes:
>
>  - Although usage of struct page to keep network related pointer might
> look as a layering violation, it isn't. I wrote in
> http://lkml.org/lkml/2008/12/15/190 why.

Sorry but extending struct page for this is really a bad idea because
of the extreme memory overhead even when it's not used (which is a 
problem on distribution kernels) Find some other way to store this
information.  Even for patches with more general value it was not
acceptable.

-Andi
Vladislav Bolkhovitin Dec. 19, 2008, 5:37 p.m. UTC | #14
David M. Lloyd, on 12/18/2008 09:43 PM wrote:
> On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
>> An iSCSI target driver iSCSI-SCST was a part of the patchset 
>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
>> have TCP zero-copy transmit of user space data was implemented. Patch, 
>> implementing this optimization was also sent in the patchset, see 
>> http://lkml.org/lkml/2008/12/10/296.
> 
> I'm probably ignorant of about 90% of the context here, but isn't this the 
> sort of problem that was supposed to have been solved by vmsplice(2)?

No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, 
even if it was a user space driver, vmsplice wouldn't change anything 
much. It doesn't have a possibility for a user to know, when 
transmission of the data finished. So, it is intended to be used as: 
vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> 
vmsplice() it. But on the mmap() stage kernel has to zero all the newly 
mapped pages and zeroing memory isn't much faster, than copying it. 
Hence, there would be no considerable performance increase.

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 19, 2008, 5:38 p.m. UTC | #15
Andi Kleen, on 12/19/2008 02:27 PM wrote:
> Vladislav Bolkhovitin <vst@vlnb.net> writes:
>>  - Although usage of struct page to keep network related pointer might
>> look as a layering violation, it isn't. I wrote in
>> http://lkml.org/lkml/2008/12/15/190 why.
> 
> Sorry but extending struct page for this is really a bad idea because
> of the extreme memory overhead even when it's not used (which is a 
> problem on distribution kernels) Find some other way to store this
> information.  Even for patches with more general value it was not
> acceptable.

Sure, this is why I propose to disable that option by default in 
distribution kernels, so it would produce no harm. ISCSI-SCST can work 
in this configuration quite well too. People who need both iSCSI target 
*and* fast working user space device handlers would simply enable that 
option and rebuild the kernel. Rejecting this patch provides much worse 
alternative: those people would also have to *patch* the kernel at 
first, only then enable that option, then rebuild the kernel. (I'm 
repeating it to make sure you didn't miss this my point; it was in the 
part of my original message, which you cut out.)

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 19, 2008, 5:57 p.m. UTC | #16
Andi Kleen, on 12/19/2008 09:00 PM wrote:
>> Sure, this is why I propose to disable that option by default in 
>> distribution kernels, so it would produce no harm.
> 
> That would make the option useless for most users. You might as well
> not bother merging then.

I believe 99.(9)% of users prefer don't patch kernel, if possible.

>> first, only then enable that option, then rebuild the kernel. (I'm 
>> repeating it to make sure you didn't miss this my point; it was in the 
>> part of my original message, which you cut out.)
> 
> That was such a ridiculous suggestion, I didn't take it seriously.
> 
> Also it should be really not rocket science to use a separate 
> table for this.

Sorry, what do you mean? If usage of something like a hash table to map 
pages to the corresponding iSCSI commands, this approach was evaluated 
and rejected, because it wouldn't provide much performance increase, 
which would worth the effort. See details in the end of the patch 
description in http://lkml.org/lkml/2008/12/10/296

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Dec. 19, 2008, 6 p.m. UTC | #17
> Sure, this is why I propose to disable that option by default in 
> distribution kernels, so it would produce no harm.

That would make the option useless for most users. You might as well
not bother merging then.

> first, only then enable that option, then rebuild the kernel. (I'm 
> repeating it to make sure you didn't miss this my point; it was in the 
> part of my original message, which you cut out.)

That was such a ridiculous suggestion, I didn't take it seriously.

Also it should be really not rocket science to use a separate 
table for this.

-Andi
Jens Axboe Dec. 19, 2008, 7:07 p.m. UTC | #18
On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
> David M. Lloyd, on 12/18/2008 09:43 PM wrote:
> >On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
> >>An iSCSI target driver iSCSI-SCST was a part of the patchset 
> >>(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
> >>have TCP zero-copy transmit of user space data was implemented. Patch, 
> >>implementing this optimization was also sent in the patchset, see 
> >>http://lkml.org/lkml/2008/12/10/296.
> >
> >I'm probably ignorant of about 90% of the context here, but isn't this the 
> >sort of problem that was supposed to have been solved by vmsplice(2)?
> 
> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, 
> even if it was a user space driver, vmsplice wouldn't change anything 
> much. It doesn't have a possibility for a user to know, when 
> transmission of the data finished. So, it is intended to be used as: 
> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> 
> vmsplice() it. But on the mmap() stage kernel has to zero all the newly 
> mapped pages and zeroing memory isn't much faster, than copying it. 
> Hence, there would be no considerable performance increase.

vmsplice() isn't the right choice, but splice() very well could be. You
could easily use splice internally as well. The vmsplice() part sort-of
applies in the sense that you want to fill pages into a pipe, which is
essentially what vmsplice() does. You'd need some helper to do that. And
the ack-on-xmit-done bits is something that splice-to-socket needs
anyway, so I think it'd be quite a suitable choice for this.
Vladislav Bolkhovitin Dec. 19, 2008, 7:17 p.m. UTC | #19
Jens Axboe, on 12/19/2008 10:07 PM wrote:
> On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
>> David M. Lloyd, on 12/18/2008 09:43 PM wrote:
>>> On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
>>>> An iSCSI target driver iSCSI-SCST was a part of the patchset 
>>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
>>>> have TCP zero-copy transmit of user space data was implemented. Patch, 
>>>> implementing this optimization was also sent in the patchset, see 
>>>> http://lkml.org/lkml/2008/12/10/296.
>>> I'm probably ignorant of about 90% of the context here, but isn't this the 
>>> sort of problem that was supposed to have been solved by vmsplice(2)?
>> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, 
>> even if it was a user space driver, vmsplice wouldn't change anything 
>> much. It doesn't have a possibility for a user to know, when 
>> transmission of the data finished. So, it is intended to be used as: 
>> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> 
>> vmsplice() it. But on the mmap() stage kernel has to zero all the newly 
>> mapped pages and zeroing memory isn't much faster, than copying it. 
>> Hence, there would be no considerable performance increase.
> 
> vmsplice() isn't the right choice, but splice() very well could be. You
> could easily use splice internally as well. The vmsplice() part sort-of
> applies in the sense that you want to fill pages into a pipe, which is
> essentially what vmsplice() does. You'd need some helper to do that.

Sorry, Jens, but splice() works only if there is a file handle on the 
another side, so user space doesn't see data buffers. But SCST needs to 
serve a wider usage cases, like reading data with decompression from a 
virtual tape, where decompression is done in user space. For those only 
complete zero-copy network send, which I implemented, can give the best 
performance.

> And
> the ack-on-xmit-done bits is something that splice-to-socket needs
> anyway, so I think it'd be quite a suitable choice for this.

So, are you writing that splice() could also benefit from the zero-copy 
transmit feature, like I implemented?

Thanks,
Vlad


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Dec. 19, 2008, 7:27 p.m. UTC | #20
On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
> Jens Axboe, on 12/19/2008 10:07 PM wrote:
> >On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote:
> >>David M. Lloyd, on 12/18/2008 09:43 PM wrote:
> >>>On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote:
> >>>>An iSCSI target driver iSCSI-SCST was a part of the patchset 
> >>>>(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
> >>>>have TCP zero-copy transmit of user space data was implemented. Patch, 
> >>>>implementing this optimization was also sent in the patchset, see 
> >>>>http://lkml.org/lkml/2008/12/10/296.
> >>>I'm probably ignorant of about 90% of the context here, but isn't this 
> >>>the sort of problem that was supposed to have been solved by vmsplice(2)?
> >>No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, 
> >>even if it was a user space driver, vmsplice wouldn't change anything 
> >>much. It doesn't have a possibility for a user to know, when 
> >>transmission of the data finished. So, it is intended to be used as: 
> >>vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> 
> >>vmsplice() it. But on the mmap() stage kernel has to zero all the newly 
> >>mapped pages and zeroing memory isn't much faster, than copying it. 
> >>Hence, there would be no considerable performance increase.
> >
> >vmsplice() isn't the right choice, but splice() very well could be. You
> >could easily use splice internally as well. The vmsplice() part sort-of
> >applies in the sense that you want to fill pages into a pipe, which is
> >essentially what vmsplice() does. You'd need some helper to do that.
> 
> Sorry, Jens, but splice() works only if there is a file handle on the 
> another side, so user space doesn't see data buffers. But SCST needs to 
> serve a wider usage cases, like reading data with decompression from a 
> virtual tape, where decompression is done in user space. For those only 
> complete zero-copy network send, which I implemented, can give the best 
> performance.

__splice_from_pipe() takes a pipe, a descriptor and an actor. There's
absolutely ZERO reason you could not reuse most of that for this
implementation. The big bonus here is that getting the put correct from
networking would even make splice() better for everyone. Win for Linux,
win for you since it'll make it MUCH easier for you to get this stuff
in. Looking at your original patch and I almost think it's a flame bait
to induce discussion (nothing wrong with that, that approach works quite
well and has been used before). There's no way in HELL that it'd ever be
a merge candidate. And I suspect you know that, at least I hope you do
or you are farther away from going forward with this than you think.

So don't look at splice() the system call, look at the infrastructure
and check if that could be useful for your case. To me it looks
absolutely like it could, if you goal is just zero-copy transmit. The
only missing piece is dropping the reference and signalling page
consumption at the right point, which is when the data is safe to be
reused. That very bit is missing, but that should be all as far as I can
tell.

> >And
> >the ack-on-xmit-done bits is something that splice-to-socket needs
> >anyway, so I think it'd be quite a suitable choice for this.
> 
> So, are you writing that splice() could also benefit from the zero-copy 
> transmit feature, like I implemented?

I like how you want to reinvent everything, perhaps you should spend a
little more time looking into various other approaches? splice() already
does zero-copy network transmit, there are no copies going on. Ideally,
you'd have zero copies moving data into your pipe, but migrade/move
isn't quite there yet. But that doesn't apply to your case at all.

What is missing, as I wrote, is the 'release on ack' and not on pipe
buffer release. This is similar to the get_page/put_page stuff you did
in your patch, but don't go claiming that zero-copy transmit is a
Vladislav original - the ->sendpage() does no copies.
Jeremy Fitzhardinge Dec. 19, 2008, 8:21 p.m. UTC | #21
Vladislav Bolkhovitin wrote:
> This patch implements support for zero-copy TCP transmit of user space 
> data. It is necessary in iSCSI-SCST target driver for transmitting data 
> from user space buffers, supplied by user space backend handlers. In 
> this case SCST core needs to know when TCP finished transmitting the 
> data, so the corresponding buffers can be reused or freed. Without this 
> patch it isn't possible, so iSCSI-SCST has to use data copying to TCP 
> send buffers function sock_sendpage(). ISCSI-SCST also works without 
> this patch, but that this patch gives a nice performance improvement.
>   

In Xen networking it looks like we're going to need to solve a very 
similar problem.

When a guest (non-privileged, with no direct hardware access) wants to 
send a network packet, it passes it over to the privileged (host) 
domain, who then puts it into the network stack for transmission.

The packet gets passed over in a page granted (read "borrowed") from the 
guest domain.   We can't return it to the guest while its tangled up in 
the host's network stack, so we need notification of when the stack has 
finished with the page.

The out of tree Xen patches do this by marking a page as having been 
allocated by a foreign allocator, and overloads the private memory of 
struct page with a destructor function pointer, which put_page calls as 
appropriate.  We can do this because the page is definitely "owned" by 
the Xen subsystem, so most of the fields are available for recycling; 
the main problem is that we need to grab another page flag.  Your case 
sounds more complex because the source page can be mapped by userspace 
and/or be in the pagecache, so everything is already claimed.

As with your case, we can simply copy the page data if this mechanism 
isn't available.  But it would be nice if it were.

> 1. Add net_priv analog in struct sk_buff, not in struct page. But then 
> it would be required that all the pages in each skb must be from the 
> same originator, i.e. with the same net_priv. It is unpractical to 
> change all the operations with skb's to forbid merging them, if they 
> have different net_priv. I tried, but quickly gave up. There are too 
> many such places in very not obvious code pieces.
>   

I think Rusty has a patch to put some kind of put notifier in struct 
skb_shared_info, but I'm not sure of the details.

> 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a 
> simple search function. This approach was rejected, because to copy a 
> page a modern CPU needs using MMX about 1500 ticks.

Is that the cold cache timing?

>  It was observed, 
> that each page can be referenced by TCP during transmit about 20 times 
> or even more. So, if each search needs, say, 20 ticks, the overall 
> search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this 
> approach would considerably worse performance-wise to the chosen 
> approach and provide not too much benefit.
>   

Wouldn't you only need to do the lookup on the last put?

An external lookup table might well for for us, if the net_put_page() 
change is acceptable to the network folk.

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 19, 2008, 9:58 p.m. UTC | #22
On Fri, Dec 19, 2008 at 08:27:36PM +0100, Jens Axboe (jens.axboe@oracle.com) wrote:
> What is missing, as I wrote, is the 'release on ack' and not on pipe
> buffer release. This is similar to the get_page/put_page stuff you did
> in your patch, but don't go claiming that zero-copy transmit is a
> Vladislav original - the ->sendpage() does no copies.

Just my small rant: it does, when underlying device does not support
hardware tx checksumming and scatter/gather, which is likely exception
than a rule for the modern NICs.

As of having notifications of the received ack (or from user's point of
view notification of the freeing of the buffer), I have following idea
in mind: extend skb ahsred info by copy of the frag array and additional
destructor field, which will be invoked when not only skb but also all
its clones are freed (that's when shared info is freed), so that user
could save some per-page context in fraglist and work with it when data
is not used anymore.

Extending page or skb structure is a no-go for sure, and actually even
shared info is not rubber, but there we can at least add something...

If only destructor field is allowed (similar patch was not rejected),
scsi can save its pages in the tree (indexed by the page pointer) and
traverse it when destructor is invoked selecting pages found in the
freed skb.
Evgeniy Polyakov Dec. 19, 2008, 10:04 p.m. UTC | #23
On Fri, Dec 19, 2008 at 12:21:41PM -0800, Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> I think Rusty has a patch to put some kind of put notifier in struct 
> skb_shared_info, but I'm not sure of the details.

Yes, he added destructor callback into shared info.

There maybe a problem though, if iscsi will run over xen network, but in
this case xen may copy the data, or iscsi may do that after determining
that underlying network device does not allow shared info destructor
(vis device/route flag for example).

> Wouldn't you only need to do the lookup on the last put?
> 
> An external lookup table might well for for us, if the net_put_page() 
> change is acceptable to the network folk.

That sounds like the best solution for this problem.
David, will you accept shared info destructor?
And second fraglist? (Put to the new line to easily say no here :)
Jeremy Fitzhardinge Dec. 19, 2008, 10:21 p.m. UTC | #24
Evgeniy Polyakov wrote:
> On Fri, Dec 19, 2008 at 12:21:41PM -0800, Jeremy Fitzhardinge (jeremy@goop.org) wrote:
>   
>> I think Rusty has a patch to put some kind of put notifier in struct 
>> skb_shared_info, but I'm not sure of the details.
>>     
>
> Yes, he added destructor callback into shared info.
>
> There maybe a problem though, if iscsi will run over xen network, but in
> this case xen may copy the data, or iscsi may do that after determining
> that underlying network device does not allow shared info destructor
> (vis device/route flag for example).
>   

Xen only needs the callback in the case of traffic originating in 
another Xen domain.  If iscsi is involved, it will be running in the 
other domain, and so all its callbacks and so on will happen there.  
There's no conflict.

>> Wouldn't you only need to do the lookup on the last put?
>>
>> An external lookup table might well for for us, if the net_put_page() 
>> change is acceptable to the network folk.
>>     
>
> That sounds like the best solution for this problem.
>   

An external lookup would just need to change put_page -> net_put_page, I 
think.

> David, will you accept shared info destructor?
>   

I'm not very familiar with the network stack, but am I right in assuming 
that the shared_info destructor would be called when the network stack 
has finished with all the pages it refers to?  And those pages could be 
the combination of pages separately submitted in different skbs?

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 19, 2008, 10:33 p.m. UTC | #25
On Fri, Dec 19, 2008 at 02:21:18PM -0800, Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> >There maybe a problem though, if iscsi will run over xen network, but in
> >this case xen may copy the data, or iscsi may do that after determining
> >that underlying network device does not allow shared info destructor
> >(vis device/route flag for example).
> >  
> 
> Xen only needs the callback in the case of traffic originating in 
> another Xen domain.  If iscsi is involved, it will be running in the 
> other domain, and so all its callbacks and so on will happen there.  
> There's no conflict.

That's good news.

> >>Wouldn't you only need to do the lookup on the last put?
> >>
> >>An external lookup table might well for for us, if the net_put_page() 
> >>change is acceptable to the network folk.
> >>    
> >
> >That sounds like the best solution for this problem.
> >  
> 
> An external lookup would just need to change put_page -> net_put_page, I 
> think.

It is not needed I think, having skb shared info destructor has access
to the all pages in that skb.

> >David, will you accept shared info destructor?
> >  
> 
> I'm not very familiar with the network stack, but am I right in assuming 
> that the shared_info destructor would be called when the network stack 
> has finished with all the pages it refers to?  And those pages could be 
> the combination of pages separately submitted in different skbs?

Shared info is freed when there are no skbs referring to the shared info
in question. Skb holds all pages in shared info in the fraglist array,
so when it is about to be freed, it means that network stack does not
use it (particulary it will putpage every page in fraglist). Usually
there are two skbs in the network stack per packet in TCP (allocated at
once though via fastclone mechanims): one is provided to the device
(and will be freed there) and another one is placed into retransmit
queue, where it will be located and freed when ack has been received.

There may be another layers which may clone skb, but its shared info
structure (shared between the clones) will only be freed when all users
freed appropriate cloned skbs.
Jeremy Fitzhardinge Dec. 20, 2008, 1:56 a.m. UTC | #26
Evgeniy Polyakov wrote:
> Shared info is freed when there are no skbs referring to the shared info
> in question. Skb holds all pages in shared info in the fraglist array,
> so when it is about to be freed, it means that network stack does not
> use it (particulary it will putpage every page in fraglist). Usually
> there are two skbs in the network stack per packet in TCP (allocated at
> once though via fastclone mechanims): one is provided to the device
> (and will be freed there) and another one is placed into retransmit
> queue, where it will be located and freed when ack has been received.
>
> There may be another layers which may clone skb, but its shared info
> structure (shared between the clones) will only be freed when all users
> freed appropriate cloned skbs.
>   

I see.  One more question: how would I go about submitting some data 
with the callback attached to it?

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 20, 2008, 2:02 a.m. UTC | #27
On Fri, Dec 19, 2008 at 05:56:11PM -0800, Jeremy Fitzhardinge wrote:
> Evgeniy Polyakov wrote:
>> Shared info is freed when there are no skbs referring to the shared info
>> in question. Skb holds all pages in shared info in the fraglist array,
>> so when it is about to be freed, it means that network stack does not
>> use it (particulary it will putpage every page in fraglist). Usually
>> there are two skbs in the network stack per packet in TCP (allocated at
>> once though via fastclone mechanims): one is provided to the device
>> (and will be freed there) and another one is placed into retransmit
>> queue, where it will be located and freed when ack has been received.
>>
>> There may be another layers which may clone skb, but its shared info
>> structure (shared between the clones) will only be freed when all users
>> freed appropriate cloned skbs.

This is all correct.  However, please note that that if any clone
does a pskb_expand_head then it will get its own private copy of
of the shared info.  So you can't use the shared info to ref count
the pages in it.

Cheers,
Jeremy Fitzhardinge Dec. 20, 2008, 6:14 a.m. UTC | #28
Herbert Xu wrote:
> On Fri, Dec 19, 2008 at 05:56:11PM -0800, Jeremy Fitzhardinge wrote:
>   
>> Evgeniy Polyakov wrote:
>>     
>>> Shared info is freed when there are no skbs referring to the shared info
>>> in question. Skb holds all pages in shared info in the fraglist array,
>>> so when it is about to be freed, it means that network stack does not
>>> use it (particulary it will putpage every page in fraglist). Usually
>>> there are two skbs in the network stack per packet in TCP (allocated at
>>> once though via fastclone mechanims): one is provided to the device
>>> (and will be freed there) and another one is placed into retransmit
>>> queue, where it will be located and freed when ack has been received.
>>>
>>> There may be another layers which may clone skb, but its shared info
>>> structure (shared between the clones) will only be freed when all users
>>> freed appropriate cloned skbs.
>>>       
>
> This is all correct.  However, please note that that if any clone
> does a pskb_expand_head then it will get its own private copy of
> of the shared info.  So you can't use the shared info to ref count
> the pages in it.

Ah, so the lifetime of the shared_info structure doesn't match the 
lifetime of the underlying pages, and this mechanism would be 
insufficient for my purposes?  If so, how can it be solved?

Thanks,
    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 20, 2008, 6:51 a.m. UTC | #29
On Fri, Dec 19, 2008 at 10:14:47PM -0800, Jeremy Fitzhardinge wrote:
>
> Ah, so the lifetime of the shared_info structure doesn't match the  
> lifetime of the underlying pages, and this mechanism would be  
> insufficient for my purposes?

Right.

> If so, how can it be solved?

Well since each individual page can be pulled out I think you
just have to track them.

Cheers,
Jeremy Fitzhardinge Dec. 20, 2008, 7:43 a.m. UTC | #30
Herbert Xu wrote:
>> If so, how can it be solved?
>>     
>
> Well since each individual page can be pulled out I think you
> just have to track them.
>   

Hm.  So if I get a destructor call from the shared_info, can I go an 
inspect the page refcounts to see if its really the last use?

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 20, 2008, 8:10 a.m. UTC | #31
On Fri, Dec 19, 2008 at 11:43:34PM -0800, Jeremy Fitzhardinge wrote:
>
> Hm.  So if I get a destructor call from the shared_info, can I go an  
> inspect the page refcounts to see if its really the last use?

The pages that were originally in the shared_info at creation
time may no longer be there by the time it's freed because of
pskb_pull_tail.

Cheers,
Evgeniy Polyakov Dec. 20, 2008, 10:32 a.m. UTC | #32
On Sat, Dec 20, 2008 at 07:10:45PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > Hm.  So if I get a destructor call from the shared_info, can I go an  
> > inspect the page refcounts to see if its really the last use?
> 
> The pages that were originally in the shared_info at creation
> time may no longer be there by the time it's freed because of
> pskb_pull_tail.

Things should work fine, since pskb_expand_head() copies whole shared
info structure (and thus will copy destructor), get all pages and then
copy all pointers into the new skb, and then release old skb's data.

So destructor for the pages should not rely on which skb it is called on
and check if pages are about to be really freed (i.e. check theirs
reference counter).

__pskb_pull_tail() is tricky, it just puts some pages it does not want
to be present in the skb, but it could be possible to add there
destructor callback from the original skb with partial flag (or just
having destructor with two parameters: skb and page, and if page is not
NULL, then actually only given page is freed, otherwise the whole skb).
Jeremy Fitzhardinge Dec. 20, 2008, 7:39 p.m. UTC | #33
Evgeniy Polyakov wrote:
> Things should work fine, since pskb_expand_head() copies whole shared
> info structure (and thus will copy destructor), get all pages and then
> copy all pointers into the new skb, and then release old skb's data.
>
> So destructor for the pages should not rely on which skb it is called on
> and check if pages are about to be really freed (i.e. check theirs
> reference counter).
>   

OK.

> __pskb_pull_tail() is tricky, it just puts some pages it does not want
> to be present in the skb, but it could be possible to add there
> destructor callback from the original skb with partial flag (or just
> having destructor with two parameters: skb and page, and if page is not
> NULL, then actually only given page is freed, otherwise the whole skb).
>   

Yes, that doesn't sound too bad.

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 23, 2008, 7:11 p.m. UTC | #34
Jens Axboe, on 12/19/2008 10:27 PM wrote:
>>>>>> An iSCSI target driver iSCSI-SCST was a part of the patchset 
>>>>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to 
>>>>>> have TCP zero-copy transmit of user space data was implemented. Patch, 
>>>>>> implementing this optimization was also sent in the patchset, see 
>>>>>> http://lkml.org/lkml/2008/12/10/296.
>>>>> I'm probably ignorant of about 90% of the context here, but isn't this 
>>>>> the sort of problem that was supposed to have been solved by vmsplice(2)?
>>>> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, 
>>>> even if it was a user space driver, vmsplice wouldn't change anything 
>>>> much. It doesn't have a possibility for a user to know, when 
>>>> transmission of the data finished. So, it is intended to be used as: 
>>>> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> 
>>>> vmsplice() it. But on the mmap() stage kernel has to zero all the newly 
>>>> mapped pages and zeroing memory isn't much faster, than copying it. 
>>>> Hence, there would be no considerable performance increase.
>>> vmsplice() isn't the right choice, but splice() very well could be. You
>>> could easily use splice internally as well. The vmsplice() part sort-of
>>> applies in the sense that you want to fill pages into a pipe, which is
>>> essentially what vmsplice() does. You'd need some helper to do that.
>> Sorry, Jens, but splice() works only if there is a file handle on the 
>> another side, so user space doesn't see data buffers. But SCST needs to 
>> serve a wider usage cases, like reading data with decompression from a 
>> virtual tape, where decompression is done in user space. For those only 
>> complete zero-copy network send, which I implemented, can give the best 
>> performance.
> 
> __splice_from_pipe() takes a pipe, a descriptor and an actor. There's
> absolutely ZERO reason you could not reuse most of that for this
> implementation. The big bonus here is that getting the put correct from
> networking would even make splice() better for everyone. Win for Linux,
> win for you since it'll make it MUCH easier for you to get this stuff
> in. Looking at your original patch and I almost think it's a flame bait
> to induce discussion (nothing wrong with that, that approach works quite
> well and has been used before). There's no way in HELL that it'd ever be
> a merge candidate. And I suspect you know that, at least I hope you do
> or you are farther away from going forward with this than you think.
> 
> So don't look at splice() the system call, look at the infrastructure
> and check if that could be useful for your case. To me it looks
> absolutely like it could, if you goal is just zero-copy transmit.

I looked at the splice code again to make sure I don't miss anything. 
__splice_from_pipe() leads to pipe_to_sendpage(), which leads to 
sock_sendpage, then to sock->sendpage(). Sorry, but I don't see any 
point why to go over all the complicated splice infrastructure instead 
of directly call sock->sendpage(), as I do.

> The
> only missing piece is dropping the reference and signalling page
> consumption at the right point, which is when the data is safe to be
> reused. That very bit is missing, but that should be all as far as I can
> tell.

This is exactly what I implemented in the patch we are discussing.

>>> And
>>> the ack-on-xmit-done bits is something that splice-to-socket needs
>>> anyway, so I think it'd be quite a suitable choice for this.
>> So, are you writing that splice() could also benefit from the zero-copy 
>> transmit feature, like I implemented?
> 
> I like how you want to reinvent everything, perhaps you should spend a
> little more time looking into various other approaches? splice() already
> does zero-copy network transmit, there are no copies going on. Ideally,
> you'd have zero copies moving data into your pipe, but migrade/move
> isn't quite there yet. But that doesn't apply to your case at all.
> 
> What is missing, as I wrote, is the 'release on ack' and not on pipe
> buffer release. This is similar to the get_page/put_page stuff you did
> in your patch, but don't go claiming that zero-copy transmit is a
> Vladislav original - the ->sendpage() does no copies.

Jens, I have never claimed I reinvented ->sendpage(). Quite opposite, I 
use it. I only extended it by a missing feature. Although, seems, since 
you were misleaded, I should apologize for not too good description of 
the patch.

Thanks,
Vlad



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 23, 2008, 7:13 p.m. UTC | #35
Jeremy Fitzhardinge, on 12/19/2008 11:21 PM wrote:

[...]

> As with your case, we can simply copy the page data if this mechanism 
> isn't available.  But it would be nice if it were.
> 
>> 1. Add net_priv analog in struct sk_buff, not in struct page. But then 
>> it would be required that all the pages in each skb must be from the 
>> same originator, i.e. with the same net_priv. It is unpractical to 
>> change all the operations with skb's to forbid merging them, if they 
>> have different net_priv. I tried, but quickly gave up. There are too 
>> many such places in very not obvious code pieces.
>>   
> 
> I think Rusty has a patch to put some kind of put notifier in struct 
> skb_shared_info, but I'm not sure of the details.
> 
>> 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a 
>> simple search function. This approach was rejected, because to copy a 
>> page a modern CPU needs using MMX about 1500 ticks.
> 
> Is that the cold cache timing?

Should be L2 cache hot, which is almost always the case if FILEIO is 
used, because data are just copied from the page cache. Although, 
frankly, at the moment I can't find from where I got that number..

>>  It was observed, 
>> that each page can be referenced by TCP during transmit about 20 times 
>> or even more. So, if each search needs, say, 20 ticks, the overall 
>> search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this 
>> approach would considerably worse performance-wise to the chosen 
>> approach and provide not too much benefit.
> 
> Wouldn't you only need to do the lookup on the last put?

No, because you can't say which one is the last. E.g., a page can be 
mmaped to another process, while it's being transmitted. So, the only 
possible way is to track all gets and puts done by networking using some 
external reference counting (net_ref_cnt in case if iscsi-scst).

Vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin Dec. 23, 2008, 7:16 p.m. UTC | #36
Evgeniy Polyakov, on 12/20/2008 01:32 PM wrote:
> On Sat, Dec 20, 2008 at 07:10:45PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
>>> Hm.  So if I get a destructor call from the shared_info, can I go an  
>>> inspect the page refcounts to see if its really the last use?
>> The pages that were originally in the shared_info at creation
>> time may no longer be there by the time it's freed because of
>> pskb_pull_tail.
> 
> Things should work fine, since pskb_expand_head() copies whole shared
> info structure (and thus will copy destructor), get all pages and then
> copy all pointers into the new skb, and then release old skb's data.
> 
> So destructor for the pages should not rely on which skb it is called on
> and check if pages are about to be really freed (i.e. check theirs
> reference counter).
> 
> __pskb_pull_tail() is tricky, it just puts some pages it does not want
> to be present in the skb, but it could be possible to add there
> destructor callback from the original skb with partial flag (or just
> having destructor with two parameters: skb and page, and if page is not
> NULL, then actually only given page is freed, otherwise the whole skb).

Actually, there's another way, which seems to be a lot simpler. Alexey 
Kuznetsov privately suggested it to me.

In skb_shared_info new pointer transaction_token would be added, which 
would point on:

struct sk_transaction_token
{
	atomic_t			io_count;
	struct sk_transaction_token	*next;
	unsigned long			token;
	unsigned long			private;
	void				(*finish_callback)(struct sk_transaction_token *);
};

When skb is translated, transaction_token inherited. If 2 skb are merged 
(the same places where I put net_get_page's in my patch), the *older* 
token is inherited. This is the main point of this idea.

Before starting new asynchronous send a client would open a new token. 
Everything sent then would receive that token. Finish_callback() would 
be called and the corresponding token freed, when io_count == 0 *AND* 
all previous tokens closed.

This idea seems to be simpler, than even what Rusty implemented. Correct 
me, if I wrong. But, unfortunately, in the near future I will have no 
time to develop it.. :-(

Vlad

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 23, 2008, 9:38 p.m. UTC | #37
On Tue, Dec 23, 2008 at 10:16:25PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
> Actually, there's another way, which seems to be a lot simpler. Alexey 
> Kuznetsov privately suggested it to me.
> 
> In skb_shared_info new pointer transaction_token would be added, which 
> would point on:
> 
> struct sk_transaction_token
> {
> 	atomic_t			io_count;
> 	struct sk_transaction_token	*next;
> 	unsigned long			token;
> 	unsigned long			private;
> 	void				(*finish_callback)(struct 
> 	sk_transaction_token *);
> };
> 
> When skb is translated, transaction_token inherited. If 2 skb are merged 
> (the same places where I put net_get_page's in my patch), the *older* 
> token is inherited. This is the main point of this idea.
> 
> Before starting new asynchronous send a client would open a new token. 
> Everything sent then would receive that token. Finish_callback() would 
> be called and the corresponding token freed, when io_count == 0 *AND* 
> all previous tokens closed.
> 
> This idea seems to be simpler, than even what Rusty implemented. Correct 
> me, if I wrong. But, unfortunately, in the near future I will have no 
> time to develop it.. :-(

Yes, it is simpler and cleaner, but it requires additional allocation.
This is additional (and quite noticeble) overhead.
Vladislav Bolkhovitin Dec. 24, 2008, 2:37 p.m. UTC | #38
Evgeniy Polyakov, on 12/24/2008 12:38 AM wrote:
> On Tue, Dec 23, 2008 at 10:16:25PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
>> Actually, there's another way, which seems to be a lot simpler. Alexey 
>> Kuznetsov privately suggested it to me.
>>
>> In skb_shared_info new pointer transaction_token would be added, which 
>> would point on:
>>
>> struct sk_transaction_token
>> {
>> 	atomic_t			io_count;
>> 	struct sk_transaction_token	*next;
>> 	unsigned long			token;
>> 	unsigned long			private;
>> 	void				(*finish_callback)(struct 
>> 	sk_transaction_token *);
>> };
>>
>> When skb is translated, transaction_token inherited. If 2 skb are merged 
>> (the same places where I put net_get_page's in my patch), the *older* 
>> token is inherited. This is the main point of this idea.
>>
>> Before starting new asynchronous send a client would open a new token. 
>> Everything sent then would receive that token. Finish_callback() would 
>> be called and the corresponding token freed, when io_count == 0 *AND* 
>> all previous tokens closed.
>>
>> This idea seems to be simpler, than even what Rusty implemented. Correct 
>> me, if I wrong. But, unfortunately, in the near future I will have no 
>> time to develop it.. :-(
> 
> Yes, it is simpler and cleaner, but it requires additional allocation.
> This is additional (and quite noticeble) overhead.

Not necessary requires. For instance, in iscsi-scst sk_transaction_token 
can (and should) be part of iSCSI cmd structure, so no additional 
allocations would be needed.

Thanks,
Vlad


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 24, 2008, 2:44 p.m. UTC | #39
On Wed, Dec 24, 2008 at 05:37:51PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
> >Yes, it is simpler and cleaner, but it requires additional allocation.
> >This is additional (and quite noticeble) overhead.
> 
> Not necessary requires. For instance, in iscsi-scst sk_transaction_token 
> can (and should) be part of iSCSI cmd structure, so no additional 
> allocations would be needed.

This is special case, I'm not sure it is always possible to cache that
token and attach to every skb, but if it can be done, then of course
this does not end up with additional overhead.
Vladislav Bolkhovitin Dec. 24, 2008, 5:46 p.m. UTC | #40
Evgeniy Polyakov, on 12/24/2008 05:44 PM wrote:
> On Wed, Dec 24, 2008 at 05:37:51PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
>>> Yes, it is simpler and cleaner, but it requires additional allocation.
>>> This is additional (and quite noticeble) overhead.
>> Not necessary requires. For instance, in iscsi-scst sk_transaction_token 
>> can (and should) be part of iSCSI cmd structure, so no additional 
>> allocations would be needed.
> 
> This is special case, I'm not sure it is always possible to cache that
> token and attach to every skb, but if it can be done, then of course
> this does not end up with additional overhead.

I think in most cases there would be possibility to embed 
sk_transaction_token to some higher level structure. E.g. Xen apparently 
should have something to track packets passed through host/guest 
boundary. From other side, kmem cache is too well polished to have much 
overhead. I doubt, you would even notice it in this application. In most 
cases allocation of such small object in it using SLUB is just about the 
same as a list_del() under disabled IRQs.

Vlad



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 24, 2008, 6:08 p.m. UTC | #41
On Wed, Dec 24, 2008 at 08:46:56PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
> I think in most cases there would be possibility to embed 
> sk_transaction_token to some higher level structure. E.g. Xen apparently 
> should have something to track packets passed through host/guest 
> boundary. From other side, kmem cache is too well polished to have much 
> overhead. I doubt, you would even notice it in this application. In most 
> cases allocation of such small object in it using SLUB is just about the 
> same as a list_del() under disabled IRQs.

I definitely would not rely on that, especially at cache reclaim time.
But it of course depends on the workload and maybe appropriate for the
cases in question. The best solution I think is to combine tag and
separate destructur, so that those who do not want to allocate a token
could still get notification via destructor callback.
Vladislav Bolkhovitin Dec. 30, 2008, 5:37 p.m. UTC | #42
Evgeniy Polyakov, on 12/24/2008 09:08 PM wrote:
> On Wed, Dec 24, 2008 at 08:46:56PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
>> I think in most cases there would be possibility to embed 
>> sk_transaction_token to some higher level structure. E.g. Xen apparently 
>> should have something to track packets passed through host/guest 
>> boundary. From other side, kmem cache is too well polished to have much 
>> overhead. I doubt, you would even notice it in this application. In most 
>> cases allocation of such small object in it using SLUB is just about the 
>> same as a list_del() under disabled IRQs.
> 
> I definitely would not rely on that, especially at cache reclaim time.
> But it of course depends on the workload and maybe appropriate for the
> cases in question. The best solution I think is to combine tag and
> separate destructur, so that those who do not want to allocate a token
> could still get notification via destructor callback.

Although I agree that any additional allocation is something, which 
should be avoided, *if possible*. But you shouldn't overestimate the 
overhead of the sk_transaction_token allocation in cases, when it would 
be needed. At first, sk_transaction_token is quite small, so a single 
page in the kmem cache would keep about 100 of them, hence the slow 
allocation path would be called only once per 100 objects. Second, in 
many cases ->sendpages() needs to allocate a new skb, so already there 
is at least one such allocations on the fast path.

Actually, it doesn't look like the skb shared info destructor alone 
can't solve the task we are solving, because we need to know not when an 
skb transmittion finished, but when transmittion of our *set of pages* 
finished. Hence, with skb shared info destructor we would need also to 
invent some way to track set of pages <-> set of skbs translation (you 
refer it as combining tag and separate destructor), which would bring 
this solution on the entire new complexity level for no gain over the 
sk_transaction_token solution.

Thanks,
Vlad



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Dec. 30, 2008, 9:35 p.m. UTC | #43
Hi Vlad.

On Tue, Dec 30, 2008 at 08:37:00PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote:
> Although I agree that any additional allocation is something, which 
> should be avoided, *if possible*. But you shouldn't overestimate the 
> overhead of the sk_transaction_token allocation in cases, when it would 
> be needed. At first, sk_transaction_token is quite small, so a single 
> page in the kmem cache would keep about 100 of them, hence the slow 
> allocation path would be called only once per 100 objects. Second, in 
> many cases ->sendpages() needs to allocate a new skb, so already there 
> is at least one such allocations on the fast path.

Once per 100 objects? With millions of packets per second at extreme
cases this does not scale. Even more common thousand of usual packets
per second with 1.5k mtu will show up (especially freeing actually).

Any additional overhead has to be avoided if possible, even if it looks
innocent.

BSD guys already learned this lesson with packet processing tags at
every layer.

> Actually, it doesn't look like the skb shared info destructor alone 
> can't solve the task we are solving, because we need to know not when an 
> skb transmittion finished, but when transmittion of our *set of pages* 
> finished. Hence, with skb shared info destructor we would need also to 
> invent some way to track set of pages <-> set of skbs translation (you 
> refer it as combining tag and separate destructor), which would bring 
> this solution on the entire new complexity level for no gain over the 
> sk_transaction_token solution.

You really do not need to know when transmission is over, but when remote
side acks it (or connection is reset by the timeout). There is no way to
know when transmission is over without creating own skbs and submitting
them avoiding usual tcp/ip stack machinery.

You do not need to know which skbs contain which pages, system only should
track page pointers freed at skb destruction (shared info destruction
actually) time, no matter who owns those pages (since new pages can be
added into the page and some of the old ones can be freed early).

This will be effectively the same token, but it does not mean that
everyone who needs notification will have to perform additional
allocation. Put two pointers: destructor and token and do whatever you
like if one of them is non-empty, but try to avoid unneded overhead when
it is possible.
diff mbox

Patch

diff -upr linux-2.6.26/include/linux/mm_types.h linux-2.6.26/include/linux/mm_types.h
--- linux-2.6.26/include/linux/mm_types.h	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/linux/mm_types.h	2008-07-22 20:30:21.000000000 +0400
@@ -92,6 +92,18 @@  struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
+
+#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION)
+	/*
+	 * Used to implement support for notification on zero-copy TCP transfer
+	 * completion. It might look as not good to have this field here and
+	 * it's better to have it in struct sk_buff, but it would make the code
+	 * much more complicated and fragile, since all skb then would have to
+	 * contain only pages with the same value in this field.
+	 */
+	 void *net_priv;
+#endif
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	unsigned long page_cgroup;
 #endif
diff -upr linux-2.6.26/include/linux/net.h linux-2.6.26/include/linux/net.h
--- linux-2.6.26/include/linux/net.h	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/linux/net.h	2008-07-29 20:48:07.000000000 +0400
@@ -57,6 +57,7 @@  typedef enum {
 #include <linux/random.h>
 #include <linux/wait.h>
 #include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/mm.h>

 struct poll_table_struct;
 struct pipe_inode_info;
@@ -354,5 +354,44 @@  extern int net_msg_cost;
 extern struct ratelimit_state net_ratelimit_state;
 #endif

+#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION)
+/* Support for notification on zero-copy TCP transfer completion */
+typedef void (*net_get_page_callback_t)(struct page *page);
+typedef void (*net_put_page_callback_t)(struct page *page);
+
+extern net_get_page_callback_t net_get_page_callback;
+extern net_put_page_callback_t net_put_page_callback;
+
+extern int net_set_get_put_page_callbacks(
+	net_get_page_callback_t get_callback,
+	net_put_page_callback_t put_callback);
+
+/*
+ * See comment for net_set_get_put_page_callbacks() why those functions
+ * don't need any protection.
+ */
+static inline void net_get_page(struct page *page)
+{
+	if (page->net_priv != 0)
+		net_get_page_callback(page);
+	get_page(page);
+}
+static inline void net_put_page(struct page *page)
+{
+	if (page->net_priv != 0)
+		net_put_page_callback(page);
+	put_page(page);
+}
+#else
+static inline void net_get_page(struct page *page)
+{
+	get_page(page);
+}
+static inline void net_put_page(struct page *page)
+{
+	put_page(page);
+}
+#endif /* CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION */
+
 #endif /* __KERNEL__ */
 #endif	/* _LINUX_NET_H */
diff -upr linux-2.6.26/net/core/skbuff.c linux-2.6.26/net/core/skbuff.c
--- linux-2.6.26/net/core/skbuff.c	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/core/skbuff.c	2008-07-22 20:28:41.000000000 +0400
@@ -319,7 +319,7 @@  static void skb_release_data(struct sk_b
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+				net_put_page(skb_shinfo(skb)->frags[i].page);
 		}
 
 		if (skb_shinfo(skb)->frag_list)
@@ -658,7 +658,7 @@  struct sk_buff *pskb_copy(struct sk_buff
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			net_get_page(skb_shinfo(n)->frags[i].page);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -721,7 +721,7 @@  int pskb_expand_head(struct sk_buff *skb
 	       sizeof(struct skb_shared_info));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+		net_get_page(skb_shinfo(skb)->frags[i].page);
 
 	if (skb_shinfo(skb)->frag_list)
 		skb_clone_fraglist(skb);
@@ -990,7 +990,7 @@  drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			net_put_page(skb_shinfo(skb)->frags[i].page);
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
@@ -1159,7 +1159,7 @@  pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			net_put_page(skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1916,7 +1916,7 @@  static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				net_get_page(skb_shinfo(skb)->frags[i].page);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -2284,7 +2284,7 @@  struct sk_buff *skb_segment(struct sk_bu
 			BUG_ON(i >= nfrags);
 
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			net_get_page(frag->page);
 			size = frag->size;
 
 			if (pos < offset) {
diff -upr linux-2.6.26/net/ipv4/ip_output.c linux-2.6.26/net/ipv4/ip_output.c
--- linux-2.6.26/net/ipv4/ip_output.c	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/ip_output.c	2008-07-22 20:28:41.000000000 +0400
@@ -1007,7 +1007,7 @@  alloc_new_skb:
 						err = -EMSGSIZE;
 						goto error;
 					}
-					get_page(page);
+					net_get_page(page);
 					skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
 				}
@@ -1165,7 +1165,7 @@  ssize_t	ip_append_page(struct sock *sk, 
 		if (skb_can_coalesce(skb, i, page, offset)) {
 			skb_shinfo(skb)->frags[i-1].size += len;
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
+			net_get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
 		} else {
 			err = -EMSGSIZE;
diff -upr linux-2.6.26/net/ipv4/Makefile linux-2.6.26/net/ipv4/Makefile
--- linux-2.6.26/net/ipv4/Makefile	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/Makefile	2008-07-22 20:35:05.000000000 +0400
@@ -50,6 +50,7 @@  obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
 obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION) += tcp_zero_copy.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o
diff -upr linux-2.6.26/net/ipv4/tcp.c linux-2.6.26/net/ipv4/tcp.c
--- linux-2.6.26/net/ipv4/tcp.c	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp.c	2008-07-22 20:28:41.000000000 +0400
@@ -712,7 +712,7 @@  new_segment:
 		if (can_coalesce) {
 			skb_shinfo(skb)->frags[i - 1].size += copy;
 		} else {
-			get_page(page);
+			net_get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
 		}
 
@@ -917,7 +917,7 @@  new_segment:
 					goto new_segment;
 				} else if (page) {
 					if (off == PAGE_SIZE) {
-						put_page(page);
+						net_put_page(page);
 						TCP_PAGE(sk) = page = NULL;
 						off = 0;
 					}
@@ -958,9 +958,9 @@  new_segment:
 				} else {
 					skb_fill_page_desc(skb, i, page, off, copy);
 					if (TCP_PAGE(sk)) {
-						get_page(page);
+						net_get_page(page);
 					} else if (off + copy < PAGE_SIZE) {
-						get_page(page);
+						net_get_page(page);
 						TCP_PAGE(sk) = page;
 					}
 				}
diff -upr linux-2.6.26/net/ipv4/tcp_output.c linux-2.6.26/net/ipv4/tcp_output.c
--- linux-2.6.26/net/ipv4/tcp_output.c	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp_output.c	2008-07-22 20:28:41.000000000 +0400
@@ -854,7 +854,7 @@  static void __pskb_trim_head(struct sk_b
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			net_put_page(skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
diff -upr linux-2.6.26/net/ipv4/tcp_zero_copy.c linux-2.6.26/net/ipv4/tcp_zero_copy.c
--- linux-2.6.26/net/ipv4/tcp_zero_copy.c	2008-07-22 20:12:35.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp_zero_copy.c	2008-07-31 21:21:13.000000000 +0400
@@ -0,0 +1,49 @@ 
+/*
+ *	Support routines for TCP zero copy transmit
+ *
+ *	Created by Vladislav Bolkhovitin
+ *
+ *	This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/skbuff.h>
+
+net_get_page_callback_t net_get_page_callback __read_mostly;
+EXPORT_SYMBOL(net_get_page_callback);
+
+net_put_page_callback_t net_put_page_callback __read_mostly;
+EXPORT_SYMBOL(net_put_page_callback);
+
+/*
+ * Caller of this function must ensure that at the moment when it's called
+ * there are no pages in the system with net_priv field set to non-zero
+ * value. Hence, this function, as well as net_get_page() and net_put_page(),
+ * don't need any protection.
+ */
+int net_set_get_put_page_callbacks(
+	net_get_page_callback_t get_callback,
+	net_put_page_callback_t put_callback)
+{
+	int res = 0;
+
+	if ((net_get_page_callback != NULL) && (get_callback != NULL) &&
+	    (net_get_page_callback != get_callback)) {
+		res = -EBUSY;
+		goto out;
+	}
+
+	if ((net_put_page_callback != NULL) && (put_callback != NULL) &&
+	    (net_put_page_callback != put_callback)) {
+		res = -EBUSY;
+		goto out;
+	}
+
+	net_get_page_callback = get_callback;
+	net_put_page_callback = put_callback;
+
+out:
+	return res;
+}
+EXPORT_SYMBOL(net_set_get_put_page_callbacks);
diff -upr linux-2.6.26/net/ipv6/ip6_output.c linux-2.6.26/net/ipv6/ip6_output.c
--- linux-2.6.26/net/ipv6/ip6_output.c	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv6/ip6_output.c	2008-07-22 20:28:41.000000000 +0400
@@ -1349,7 +1349,7 @@  alloc_new_skb:
 						err = -EMSGSIZE;
 						goto error;
 					}
-					get_page(page);
+					net_get_page(page);
 					skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
 				}
diff -upr linux-2.6.26/net/Kconfig linux-2.6.26/net/Kconfig
--- linux-2.6.26/net/Kconfig	2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/Kconfig	2008-07-29 21:15:39.000000000 +0400
@@ -59,6 +59,18 @@  config INET
 
 	  Short answer: say Y.
 
+config TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION
+	bool "TCP/IP zero-copy transfer completion notification"
+        depends on INET
+        default SCST_ISCSI
+	---help---
+	  Adds support for sending a notification upon completion of a
+          zero-copy TCP/IP transfer. This can speed up certain TCP/IP
+          software. Currently this is only used by the iSCSI target driver
+          iSCSI-SCST.
+
+          If unsure, say N.
+
 if INET
 source "net/ipv4/Kconfig"
 source "net/ipv6/Kconfig"