diff mbox

[net-next,v3,02/13] sock: skb_copy_ubufs support for compound pages

Message ID 20170621211816.53837-3-willemdebruijn.kernel@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Willem de Bruijn June 21, 2017, 9:18 p.m. UTC
From: Willem de Bruijn <willemb@google.com>

Refine skb_copy_ubufs to support compound pages. With upcoming TCP
and UDP zerocopy sendmsg, such fragments may appear.

The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS.

Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages. Compute
the bytelength and allocate the minimum number of pages needed to
hold this. Revise the copy loop to fill each destination page.

It is not safe to modify skb frags when the skbuff is shared. No
existing codepath should hit this case.

Eventually, this fragile function can perhaps be replaced with calls
to skb_linearize -- when converted to not always require GFP_ATOMIC.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  9 +++++++--
 net/core/skbuff.c      | 49 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 13 deletions(-)

Comments

David Miller June 22, 2017, 5:05 p.m. UTC | #1
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 21 Jun 2017 17:18:05 -0400

> @@ -958,15 +958,20 @@ EXPORT_SYMBOL_GPL(skb_morph);
>   */
>  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
>  {
> -	int i;
>  	int num_frags = skb_shinfo(skb)->nr_frags;
>  	struct page *page, *head = NULL;
>  	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> +	int i, new_frags;
> +	u32 d_off;

Reverse christmas tree with the local variable declarations.

> +	page = head;
> +	d_off = 0;
> +	for (i = 0; i < num_frags; i++) {
> +		u8 *vaddr;
> +		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
> +		u32 f_off, f_size, copy;

Likewise.

> +		f_off = f->page_offset;
> +		f_size = f->size;
> +
> +		vaddr = kmap_atomic(skb_frag_page(f));

I looked at some kmap_atomic() implementations and I do not think
it supports compound pages.
Willem de Bruijn June 22, 2017, 8:57 p.m. UTC | #2
>
> Likewise.
>
>> +             f_off = f->page_offset;
>> +             f_size = f->size;
>> +
>> +             vaddr = kmap_atomic(skb_frag_page(f));
>
> I looked at some kmap_atomic() implementations and I do not think
> it supports compound pages.

Indeed. Thanks. It appears that I can do the obvious thing and
kmap the individual page that is being copied inside the loop:

  kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));

This is similar to existing logic in copy_huge_page_from_user
and __flush_dcache_page in arch/arm/mm/flush.c

But, this also applies to other skb operations that call kmap_atomic,
such as skb_copy_bits and __skb_checksum. Not all can be called
from a codepath with a compound user page, but I have to address
the ones that can.
David Miller June 23, 2017, 1:36 a.m. UTC | #3
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Thu, 22 Jun 2017 16:57:07 -0400

>>
>> Likewise.
>>
>>> +             f_off = f->page_offset;
>>> +             f_size = f->size;
>>> +
>>> +             vaddr = kmap_atomic(skb_frag_page(f));
>>
>> I looked at some kmap_atomic() implementations and I do not think
>> it supports compound pages.
> 
> Indeed. Thanks. It appears that I can do the obvious thing and
> kmap the individual page that is being copied inside the loop:
> 
>   kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));
> 
> This is similar to existing logic in copy_huge_page_from_user
> and __flush_dcache_page in arch/arm/mm/flush.c
> 
> But, this also applies to other skb operations that call kmap_atomic,
> such as skb_copy_bits and __skb_checksum. Not all can be called
> from a codepath with a compound user page, but I have to address
> the ones that can.

Yeah that's quite a mess, it looks like this assumption that
kmap can handle compound pages exists in quite a few places.
Willem de Bruijn June 23, 2017, 3:59 a.m. UTC | #4
On Thu, Jun 22, 2017 at 9:36 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Thu, 22 Jun 2017 16:57:07 -0400
>
>>>
>>> Likewise.
>>>
>>>> +             f_off = f->page_offset;
>>>> +             f_size = f->size;
>>>> +
>>>> +             vaddr = kmap_atomic(skb_frag_page(f));
>>>
>>> I looked at some kmap_atomic() implementations and I do not think
>>> it supports compound pages.
>>
>> Indeed. Thanks. It appears that I can do the obvious thing and
>> kmap the individual page that is being copied inside the loop:
>>
>>   kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));
>>
>> This is similar to existing logic in copy_huge_page_from_user
>> and __flush_dcache_page in arch/arm/mm/flush.c
>>
>> But, this also applies to other skb operations that call kmap_atomic,
>> such as skb_copy_bits and __skb_checksum. Not all can be called
>> from a codepath with a compound user page, but I have to address
>> the ones that can.
>
> Yeah that's quite a mess, it looks like this assumption that
> kmap can handle compound pages exists in quite a few places.

I hadn't even considered that skbs can already hold compound
page frags without zerocopy.

Open coding all call sites to iterate is tedious and unnecessary
in the common case where a page is not highmem.

kmap_atomic has enough slots to map an entire order-3 compound
page at once. But kmap_atomic cannot fail and there may be edge
cases that are larger than order-3.

Packet rings allocate with __GFP_COMP and an order derived
from (user supplied) tp_block_size, for instance. But it links each
skb_frag_t from an individual page, so this case seems okay.

Perhaps calls to kmap_atomic can be replaced with a
kmap_compound(..) that checks

 __this_cpu_read(__kmap_atomic_idx) +  (1 << compound_order(p)) < KM_TYPE_NR

before calling kmap_atomic on all pages in the compound page. In
the common case that the page is not high mem, a single call is
enough, as there is no per-page operation.
Willem de Bruijn June 27, 2017, 3:53 p.m. UTC | #5
>>>> I looked at some kmap_atomic() implementations and I do not think
>>>> it supports compound pages.
>>>
>>> Indeed. Thanks. It appears that I can do the obvious thing and
>>> kmap the individual page that is being copied inside the loop:
>>>
>>>   kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));
>>>
>>> This is similar to existing logic in copy_huge_page_from_user
>>> and __flush_dcache_page in arch/arm/mm/flush.c
>>>
>>> But, this also applies to other skb operations that call kmap_atomic,
>>> such as skb_copy_bits and __skb_checksum. Not all can be called
>>> from a codepath with a compound user page, but I have to address
>>> the ones that can.
>>
>> Yeah that's quite a mess, it looks like this assumption that
>> kmap can handle compound pages exists in quite a few places.
>
> I hadn't even considered that skbs can already hold compound
> page frags without zerocopy.
>
> Open coding all call sites to iterate is tedious and unnecessary
> in the common case where a page is not highmem.
>
> kmap_atomic has enough slots to map an entire order-3 compound
> page at once. But kmap_atomic cannot fail and there may be edge
> cases that are larger than order-3.
>
> Packet rings allocate with __GFP_COMP and an order derived
> from (user supplied) tp_block_size, for instance. But it links each
> skb_frag_t from an individual page, so this case seems okay.
>
> Perhaps calls to kmap_atomic can be replaced with a
> kmap_compound(..) that checks
>
>  __this_cpu_read(__kmap_atomic_idx) +  (1 << compound_order(p)) < KM_TYPE_NR
>
> before calling kmap_atomic on all pages in the compound page. In
> the common case that the page is not high mem, a single call is
> enough, as there is no per-page operation.

This does not work. Some callers, such as __skb_checksum, cannot
fail, so neither can kmap_compound. Also, vaddr of consecutive
kmap_atomic calls are not guaranteed to be in order. Indeed, on x86
and arm vaddr appears to grows down: (FIXADDR_TOP - ((x) << PAGE_SHIFT))

An alternative is to change the kmap_atomic callers in skbuff.c. To
avoid open coding, we can wrap the kmap_atomic; op; kunmap_atomic
in a macro that loops only if needed:

static inline bool skb_frag_must_loop(struct page *p)
{
#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
        if (PageHighMem(p))
                return true;
#endif
        return false;
}

#define skb_frag_map_foreach(f, start, size, p, p_off, cp, copied)      \
        for (p = skb_frag_page(f) + ((start) >> PAGE_SHIFT),            \
             p_off = (start) & (PAGE_SIZE - 1),                         \
             copied = 0,                                                \
             cp = skb_frag_must_loop(p) ?                               \
                    min_t(u32, size, PAGE_SIZE - p_off) : size;         \
             copied < size;                                             \
             copied += cp, p++, p_off = 0,                              \
             cp = min_t(u32, size - copied, PAGE_SIZE))                 \

This does not change behavior on machines without high mem
or on low mem pages.

skb_seq_read keeps a mapping between calls to the function,
so will need a separate approach.
Willem de Bruijn June 29, 2017, 3:54 p.m. UTC | #6
>> Perhaps calls to kmap_atomic can be replaced with a
>> kmap_compound(..) that checks
>>
>>  __this_cpu_read(__kmap_atomic_idx) +  (1 << compound_order(p)) < KM_TYPE_NR
>>
>> before calling kmap_atomic on all pages in the compound page. In
>> the common case that the page is not high mem, a single call is
>> enough, as there is no per-page operation.
>
> This does not work. Some callers, such as __skb_checksum, cannot
> fail, so neither can kmap_compound. Also, vaddr of consecutive
> kmap_atomic calls are not guaranteed to be in order. Indeed, on x86
> and arm vaddr appears to grows down: (FIXADDR_TOP - ((x) << PAGE_SHIFT))
>
> An alternative is to change the kmap_atomic callers in skbuff.c. To
> avoid open coding, we can wrap the kmap_atomic; op; kunmap_atomic
> in a macro that loops only if needed

I'll send this as RFC. It's not the most elegant solution.

The issue only arises with pages allocated with both __GFP_COMP and
__GFP_HIGHMEM, which is rare: skb_page_frag_refill,
alloc_skb_with_frags, __napi_alloc_skb and most device drivers do not
pass the high mem flag.

Exceptions are rds, mlx5. And transparent hugepages, which is a
problem with zerocopy fragments only (though not only msg_zerocopy,
potentially also the existing virtio and xen paths).

A simpler solution, then, may be to covert rds and mlx5 to not pass
__GFP_HIGHMEM and copy data on all zerocopy requests for this type of
pages.
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a17e235639ae..81c17bd41661 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1783,13 +1783,18 @@  static inline unsigned int skb_headlen(const struct sk_buff *skb)
 	return skb->len - skb->data_len;
 }
 
-static inline unsigned int skb_pagelen(const struct sk_buff *skb)
+static inline unsigned int __skb_pagelen(const struct sk_buff *skb)
 {
 	unsigned int i, len = 0;
 
 	for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--)
 		len += skb_frag_size(&skb_shinfo(skb)->frags[i]);
-	return len + skb_headlen(skb);
+	return len;
+}
+
+static inline unsigned int skb_pagelen(const struct sk_buff *skb)
+{
+	return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f75897a33fa4..96db26594192 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -958,15 +958,20 @@  EXPORT_SYMBOL_GPL(skb_morph);
  */
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 {
-	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
 	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+	int i, new_frags;
+	u32 d_off;
 
-	for (i = 0; i < num_frags; i++) {
-		u8 *vaddr;
-		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+	if (!num_frags)
+		return 0;
 
+	if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
+		return -EINVAL;
+
+	new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 0; i < new_frags; i++) {
 		page = alloc_page(gfp_mask);
 		if (!page) {
 			while (head) {
@@ -976,14 +981,35 @@  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 			}
 			return -ENOMEM;
 		}
-		vaddr = kmap_atomic(skb_frag_page(f));
-		memcpy(page_address(page),
-		       vaddr + f->page_offset, skb_frag_size(f));
-		kunmap_atomic(vaddr);
 		set_page_private(page, (unsigned long)head);
 		head = page;
 	}
 
+	page = head;
+	d_off = 0;
+	for (i = 0; i < num_frags; i++) {
+		u8 *vaddr;
+		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+		u32 f_off, f_size, copy;
+
+		f_off = f->page_offset;
+		f_size = f->size;
+
+		vaddr = kmap_atomic(skb_frag_page(f));
+		while (f_size) {
+			if (d_off == PAGE_SIZE) {
+				d_off = 0;
+				page = (struct page *)page_private(page);
+			}
+			copy = min_t(u32, PAGE_SIZE - d_off, f_size);
+			memcpy(page_address(page) + d_off, vaddr + f_off, copy);
+			f_size -= copy;
+			d_off += copy;
+			f_off += copy;
+		}
+		kunmap_atomic(vaddr);
+	}
+
 	/* skb frags release userspace buffers */
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
@@ -991,11 +1017,12 @@  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	uarg->callback(uarg, false);
 
 	/* skb frags point to kernel buffers */
-	for (i = num_frags - 1; i >= 0; i--) {
-		__skb_fill_page_desc(skb, i, head, 0,
-				     skb_shinfo(skb)->frags[i].size);
+	for (i = 0; i < new_frags - 1; i++) {
+		__skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
 		head = (struct page *)page_private(head);
 	}
+	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
+	skb_shinfo(skb)->nr_frags = new_frags;
 
 	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
 	return 0;