diff mbox

[RFC,1/6] skbuff: support per-page destructors in copy_ubufs

Message ID 1336726800.23818.33.camel@zakaz.uk.xensource.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ian Campbell May 11, 2012, 9 a.m. UTC
On Thu, 2012-05-10 at 19:42 +0100, Michael S. Tsirkin wrote:
> On Thu, May 10, 2012 at 06:46:17PM +0100, Ian Campbell wrote:
> > On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> So the below on top then. I pushed these on
> top of my zerocopy branch - can you confirm pls?

I added these to my test branch:
93772fea6cd66616912101b9e0144dfed645d8fe fix per page destructors in copy ubufs
d72b7ab15f944c5df5f28cce7b5c9a0bca61ff6d clear destructor arg when set zerocopy 

I think you also need, as part of the second one:


I'm seeing copy_ubufs called in my remote NFS test, which I don't think
I expected -- I'll investigate why this is happening today.

Ian.

> 
> ---
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 930a50e..e52bc8d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1270,8 +1270,10 @@ static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
>  {
>  	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>  	frag->page.destructor = destroy;
> -	if (destroy)
> +	if (destroy) {
>  		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> +		skb_shinfo(skb)->destructor_arg = NULL;
> +	}
>  }
>  
>  /**
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b7fc47e..453f621 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -753,12 +753,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
>  		uarg->callback(uarg);
>  
>  	/* skb frags point to kernel buffers */
> -	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
> +	for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
>  		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
>  		if (unlikely((!uarg && !f->page.destructor)))
>  			continue;
> -		__skb_fill_page_desc(skb, i-1, head, 0,
> -				     skb_shinfo(skb)->frags[i - 1].size);
> +		__skb_fill_page_desc(skb, i, head, 0, f->size);
>  		head = (struct page *)head->private;
>  	}
>  


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ian Campbell May 11, 2012, 10:58 a.m. UTC | #1
On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> I'm seeing copy_ubufs called in my remote NFS test, which I don't
> think I expected -- I'll investigate why this is happening today. 

It's tcp_transmit_skb which can (conditionally) call skb_clone
(backtrace below)

I suspect this means that the existing SKBTX_DEV_ZEROCOPY semantics are
a superset of what we need to consider for the destructor case. I'm
assuming here that the existing SKBTX_DEV_ZEROCOPY is copying aside
exactly the right amount and isn't conservatively coying more often than
necessary.

shinfo->tx_flags are pretty scarce -- can we afford a new one for this
usecase?

Or perhaps this is actually a function of the callsite not the of
individual skb and we want to have some concept of "deep" and "shallow"
clones combined with SKBTX_DEV_ZEROCOPY to decide when to copy_ubufs or
not? e.g. deep clone => always copy if SKBTX_DEV_ZEROCOPY and shallow
clone => only copy if SKBTX_DEV_ZEROCOPY && destructor_arg!=NULL
(neither copy if !SKBTX_DEV_ZEROCOPY).

Oh, I suppose that reintroduces the copy_ubufs under a (shallow) cloned
skb race if one of those skbs eventually finds itself in a situation
where a skb_frag_orphan is required doesn't it. Hrm :-/

Will have to have a think...

Ian.

[  109.680828] ------------[ cut here ]------------
[  109.685440] WARNING: at /local/scratch/ianc/devel/kernels/linux/include/linux/skbuff.h:1732 skb_clone+0xe6/0xf0()
[  109.695678] Hardware name:
[  109.699162] ORPHANING
[  109.701434] Modules linked in:
[  109.704495] Pid: 10, comm: kworker/0:1 Tainted: G        W    3.4.0-rc4-x86_64-native+ #186
[  109.712830] Call Trace:
[  109.715278]  [<ffffffff8107edfa>] warn_slowpath_common+0x7a/0xb0
[  109.721273]  [<ffffffff8107eed1>] warn_slowpath_fmt+0x41/0x50
[  109.727007]  [<ffffffff8170feea>] ? tcp_transmit_skb+0x9a/0x8f0
[  109.732914]  [<ffffffff8169b2d6>] skb_clone+0xe6/0xf0
[  109.737957]  [<ffffffff8170feea>] tcp_transmit_skb+0x9a/0x8f0
[  109.743694]  [<ffffffff81712d7a>] tcp_write_xmit+0x1ea/0x9c0
[  109.749343]  [<ffffffff8171357b>] tcp_push_one+0x2b/0x40
[  109.754648]  [<ffffffff81705b2b>] tcp_sendpage+0x64b/0x6d0
[  109.760126]  [<ffffffff8172785d>] inet_sendpage+0x4d/0xf0
[  109.765518]  [<ffffffff817afed7>] xs_sendpages+0x117/0x2a0
[  109.770996]  [<ffffffff817ad3f0>] ? xprt_reserve+0x2d0/0x2d0
[  109.776647]  [<ffffffff817b0178>] xs_tcp_send_request+0x58/0x110
[  109.782644]  [<ffffffff817ad5bb>] xprt_transmit+0x6b/0x2d0
[  109.788123]  [<ffffffff817aa9a0>] ? call_transmit_status+0xd0/0xd0
[  109.794293]  [<ffffffff817aab70>] call_transmit+0x1d0/0x290
[  109.799857]  [<ffffffff817aa9a0>] ? call_transmit_status+0xd0/0xd0
[  109.806029]  [<ffffffff817b3725>] __rpc_execute+0x65/0x260
[  109.811505]  [<ffffffff817b3920>] ? __rpc_execute+0x260/0x260
[  109.817241]  [<ffffffff817b3930>] rpc_async_schedule+0x10/0x20
[  109.823066]  [<ffffffff81098fff>] process_one_work+0x11f/0x460
[  109.828895]  [<ffffffff8109b0b3>] worker_thread+0x173/0x3f0
[  109.834459]  [<ffffffff8109af40>] ? manage_workers+0x210/0x210
[  109.840283]  [<ffffffff8109fa26>] kthread+0x96/0xa0
[  109.845179]  [<ffffffff81861654>] kernel_thread_helper+0x4/0x10
[  109.851092]  [<ffffffff8109f990>] ? kthread_freezable_should_stop+0x70/0x70
[  109.858053]  [<ffffffff81861650>] ? gs_change+0xb/0xb
[  109.863087] ---[ end trace 3e3acdb7cc57c191 ]---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 11, 2012, 12:08 p.m. UTC | #2
On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > think I expected -- I'll investigate why this is happening today. 
> 
> It's tcp_transmit_skb which can (conditionally) call skb_clone
> (backtrace below)

Interesting. I didn't realise we clone skbs on data path:
tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
Could someone comment on why we need to clone on good path
like this?
Michael S. Tsirkin May 11, 2012, 4:30 p.m. UTC | #3
On Fri, May 11, 2012 at 03:08:36PM +0300, Michael S. Tsirkin wrote:
> On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> > On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > > think I expected -- I'll investigate why this is happening today. 
> > 
> > It's tcp_transmit_skb which can (conditionally) call skb_clone
> > (backtrace below)
> 
> Interesting. I didn't realise we clone skbs on data path:
> tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> Could someone comment on why we need to clone on good path
> like this?

Hmm, it's in case we need to retransmit it later.

> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 11, 2012, 9:12 p.m. UTC | #4
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 11 May 2012 15:08:37 +0300

> On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
>> On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
>> > I'm seeing copy_ubufs called in my remote NFS test, which I don't
>> > think I expected -- I'll investigate why this is happening today. 
>> 
>> It's tcp_transmit_skb which can (conditionally) call skb_clone
>> (backtrace below)
> 
> Interesting. I didn't realise we clone skbs on data path:
> tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> Could someone comment on why we need to clone on good path
> like this?

We can't send the original SKB that's linked into the retransmit
queue.  It's linkage must stay secure in that queue.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Campbell May 12, 2012, 6:01 a.m. UTC | #5
On Fri, 2012-05-11 at 17:30 +0100, Michael S. Tsirkin wrote:
> On Fri, May 11, 2012 at 03:08:36PM +0300, Michael S. Tsirkin wrote:
> > On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> > > On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > > > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > > > think I expected -- I'll investigate why this is happening today. 
> > > 
> > > It's tcp_transmit_skb which can (conditionally) call skb_clone
> > > (backtrace below)
> > 
> > Interesting. I didn't realise we clone skbs on data path:
> > tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> > Could someone comment on why we need to clone on good path
> > like this?
> 
> Hmm, it's in case we need to retransmit it later.

I wonder if we could avoid the copy_ubuf in this particular clone path
and have any subsequent calls to copy_ubufs use skb->fclone to determine
if it can safely replace the frags?

If it cannot then could it do a full copy of the skb (including new
shinfo, new frag pages etc) as a fallback?

Ian.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 13, 2012, 10:10 a.m. UTC | #6
On Sat, May 12, 2012 at 07:01:24AM +0100, Ian Campbell wrote:
> On Fri, 2012-05-11 at 17:30 +0100, Michael S. Tsirkin wrote:
> > On Fri, May 11, 2012 at 03:08:36PM +0300, Michael S. Tsirkin wrote:
> > > On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> > > > On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > > > > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > > > > think I expected -- I'll investigate why this is happening today. 
> > > > 
> > > > It's tcp_transmit_skb which can (conditionally) call skb_clone
> > > > (backtrace below)
> > > 
> > > Interesting. I didn't realise we clone skbs on data path:
> > > tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> > > Could someone comment on why we need to clone on good path
> > > like this?
> > 
> > Hmm, it's in case we need to retransmit it later.
> 
> I wonder if we could avoid the copy_ubuf in this particular clone path
> and have any subsequent calls to copy_ubufs use skb->fclone to determine
> if it can safely replace the frags?
> 
> If it cannot then could it do a full copy of the skb (including new
> shinfo, new frag pages etc) as a fallback?
> 
> Ian.
> 

Yes I think we should call a variant of clone that avoids copy_ubuf on
the first transmit.  But need to be careful we don't access the frag
list while it is being modified.

For example very roughly, maybe we could have copy_ubuf detect
packet clone is queued and take some lock?

On retransmit we could check and if we are not the only clone left
(which should be uncommon) trigger copy ubuf then.

Thoughts?
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index af2d10e..40ca43e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1744,6 +1744,7 @@  static inline void skb_copy_frag_destructor(struct sk_buff *to,
 {
 	skb_shinfo(to)->tx_flags |= skb_shinfo(from)->tx_flags &
 		SKBTX_DEV_ZEROCOPY;
+	skb_shinfo(to)->destructor_arg = NULL;
 }
 
 /**