Message ID | 20170621211816.53837-3-willemdebruijn.kernel@gmail.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Wed, 21 Jun 2017 17:18:05 -0400 > @@ -958,15 +958,20 @@ EXPORT_SYMBOL_GPL(skb_morph); > */ > int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > { > - int i; > int num_frags = skb_shinfo(skb)->nr_frags; > struct page *page, *head = NULL; > struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > + int i, new_frags; > + u32 d_off; Reverse christmas tree with the local variable declarations. > + page = head; > + d_off = 0; > + for (i = 0; i < num_frags; i++) { > + u8 *vaddr; > + skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > + u32 f_off, f_size, copy; Likewise. > + f_off = f->page_offset; > + f_size = f->size; > + > + vaddr = kmap_atomic(skb_frag_page(f)); I looked at some kmap_atomic() implementations and I do not think it supports compound pages.
> > Likewise. > >> + f_off = f->page_offset; >> + f_size = f->size; >> + >> + vaddr = kmap_atomic(skb_frag_page(f)); > > I looked at some kmap_atomic() implementations and I do not think > it supports compound pages. Indeed. Thanks. It appears that I can do the obvious thing and kmap the individual page that is being copied inside the loop: kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT)); This is similar to existing logic in copy_huge_page_from_user and __flush_dcache_page in arch/arm/mm/flush.c But, this also applies to other skb operations that call kmap_atomic, such as skb_copy_bits and __skb_checksum. Not all can be called from a codepath with a compound user page, but I have to address the ones that can.
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Thu, 22 Jun 2017 16:57:07 -0400 >> >> Likewise. >> >>> + f_off = f->page_offset; >>> + f_size = f->size; >>> + >>> + vaddr = kmap_atomic(skb_frag_page(f)); >> >> I looked at some kmap_atomic() implementations and I do not think >> it supports compound pages. > > Indeed. Thanks. It appears that I can do the obvious thing and > kmap the individual page that is being copied inside the loop: > > kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT)); > > This is similar to existing logic in copy_huge_page_from_user > and __flush_dcache_page in arch/arm/mm/flush.c > > But, this also applies to other skb operations that call kmap_atomic, > such as skb_copy_bits and __skb_checksum. Not all can be called > from a codepath with a compound user page, but I have to address > the ones that can. Yeah that's quite a mess, it looks like this assumption that kmap can handle compound pages exists in quite a few places.
On Thu, Jun 22, 2017 at 9:36 PM, David Miller <davem@davemloft.net> wrote: > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > Date: Thu, 22 Jun 2017 16:57:07 -0400 > >>> >>> Likewise. >>> >>>> + f_off = f->page_offset; >>>> + f_size = f->size; >>>> + >>>> + vaddr = kmap_atomic(skb_frag_page(f)); >>> >>> I looked at some kmap_atomic() implementations and I do not think >>> it supports compound pages. >> >> Indeed. Thanks. It appears that I can do the obvious thing and >> kmap the individual page that is being copied inside the loop: >> >> kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT)); >> >> This is similar to existing logic in copy_huge_page_from_user >> and __flush_dcache_page in arch/arm/mm/flush.c >> >> But, this also applies to other skb operations that call kmap_atomic, >> such as skb_copy_bits and __skb_checksum. Not all can be called >> from a codepath with a compound user page, but I have to address >> the ones that can. > > Yeah that's quite a mess, it looks like this assumption that > kmap can handle compound pages exists in quite a few places. I hadn't even considered that skbs can already hold compound page frags without zerocopy. Open coding all call sites to iterate is tedious and unnecessary in the common case where a page is not highmem. kmap_atomic has enough slots to map an entire order-3 compound page at once. But kmap_atomic cannot fail and there may be edge cases that are larger than order-3. Packet rings allocate with __GFP_COMP and an order derived from (user supplied) tp_block_size, for instance. But it links each skb_frag_t from an individual page, so this case seems okay. Perhaps calls to kmap_atomic can be replaced with a kmap_compound(..) that checks __this_cpu_read(__kmap_atomic_idx) + (1 << compound_order(p)) < KM_TYPE_NR before calling kmap_atomic on all pages in the compound page. In the common case that the page is not high mem, a single call is enough, as there is no per-page operation.
>>>> I looked at some kmap_atomic() implementations and I do not think >>>> it supports compound pages. >>> >>> Indeed. Thanks. It appears that I can do the obvious thing and >>> kmap the individual page that is being copied inside the loop: >>> >>> kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT)); >>> >>> This is similar to existing logic in copy_huge_page_from_user >>> and __flush_dcache_page in arch/arm/mm/flush.c >>> >>> But, this also applies to other skb operations that call kmap_atomic, >>> such as skb_copy_bits and __skb_checksum. Not all can be called >>> from a codepath with a compound user page, but I have to address >>> the ones that can. >> >> Yeah that's quite a mess, it looks like this assumption that >> kmap can handle compound pages exists in quite a few places. > > I hadn't even considered that skbs can already hold compound > page frags without zerocopy. > > Open coding all call sites to iterate is tedious and unnecessary > in the common case where a page is not highmem. > > kmap_atomic has enough slots to map an entire order-3 compound > page at once. But kmap_atomic cannot fail and there may be edge > cases that are larger than order-3. > > Packet rings allocate with __GFP_COMP and an order derived > from (user supplied) tp_block_size, for instance. But it links each > skb_frag_t from an individual page, so this case seems okay. > > Perhaps calls to kmap_atomic can be replaced with a > kmap_compound(..) that checks > > __this_cpu_read(__kmap_atomic_idx) + (1 << compound_order(p)) < KM_TYPE_NR > > before calling kmap_atomic on all pages in the compound page. In > the common case that the page is not high mem, a single call is > enough, as there is no per-page operation. This does not work. Some callers, such as __skb_checksum, cannot fail, so neither can kmap_compound. Also, vaddr of consecutive kmap_atomic calls are not guaranteed to be in order. Indeed, on x86 and arm vaddr appears to grows down: (FIXADDR_TOP - ((x) << PAGE_SHIFT)) An alternative is to change the kmap_atomic callers in skbuff.c. To avoid open coding, we can wrap the kmap_atomic; op; kunmap_atomic in a macro that loops only if needed: static inline bool skb_frag_must_loop(struct page *p) { #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32) if (PageHighMem(p)) return true; #endif return false; } #define skb_frag_map_foreach(f, start, size, p, p_off, cp, copied) \ for (p = skb_frag_page(f) + ((start) >> PAGE_SHIFT), \ p_off = (start) & (PAGE_SIZE - 1), \ copied = 0, \ cp = skb_frag_must_loop(p) ? \ min_t(u32, size, PAGE_SIZE - p_off) : size; \ copied < size; \ copied += cp, p++, p_off = 0, \ cp = min_t(u32, size - copied, PAGE_SIZE)) \ This does not change behavior on machines without high mem or on low mem pages. skb_seq_read keeps a mapping between calls to the function, so will need a separate approach.
>> Perhaps calls to kmap_atomic can be replaced with a >> kmap_compound(..) that checks >> >> __this_cpu_read(__kmap_atomic_idx) + (1 << compound_order(p)) < KM_TYPE_NR >> >> before calling kmap_atomic on all pages in the compound page. In >> the common case that the page is not high mem, a single call is >> enough, as there is no per-page operation. > > This does not work. Some callers, such as __skb_checksum, cannot > fail, so neither can kmap_compound. Also, vaddr of consecutive > kmap_atomic calls are not guaranteed to be in order. Indeed, on x86 > and arm vaddr appears to grows down: (FIXADDR_TOP - ((x) << PAGE_SHIFT)) > > An alternative is to change the kmap_atomic callers in skbuff.c. To > avoid open coding, we can wrap the kmap_atomic; op; kunmap_atomic > in a macro that loops only if needed I'll send this as RFC. It's not the most elegant solution. The issue only arises with pages allocated with both __GFP_COMP and __GFP_HIGHMEM, which is rare: skb_page_frag_refill, alloc_skb_with_frags, __napi_alloc_skb and most device drivers do not pass the high mem flag. Exceptions are rds, mlx5. And transparent hugepages, which is a problem with zerocopy fragments only (though not only msg_zerocopy, potentially also the existing virtio and xen paths). A simpler solution, then, may be to covert rds and mlx5 to not pass __GFP_HIGHMEM and copy data on all zerocopy requests for this type of pages.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a17e235639ae..81c17bd41661 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1783,13 +1783,18 @@ static inline unsigned int skb_headlen(const struct sk_buff *skb) return skb->len - skb->data_len; } -static inline unsigned int skb_pagelen(const struct sk_buff *skb) +static inline unsigned int __skb_pagelen(const struct sk_buff *skb) { unsigned int i, len = 0; for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--) len += skb_frag_size(&skb_shinfo(skb)->frags[i]); - return len + skb_headlen(skb); + return len; +} + +static inline unsigned int skb_pagelen(const struct sk_buff *skb) +{ + return skb_headlen(skb) + __skb_pagelen(skb); } /** diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f75897a33fa4..96db26594192 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -958,15 +958,20 @@ EXPORT_SYMBOL_GPL(skb_morph); */ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) { - int i; int num_frags = skb_shinfo(skb)->nr_frags; struct page *page, *head = NULL; struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; + int i, new_frags; + u32 d_off; - for (i = 0; i < num_frags; i++) { - u8 *vaddr; - skb_frag_t *f = &skb_shinfo(skb)->frags[i]; + if (!num_frags) + return 0; + if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) + return -EINVAL; + + new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < new_frags; i++) { page = alloc_page(gfp_mask); if (!page) { while (head) { @@ -976,14 +981,35 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) } return -ENOMEM; } - vaddr = kmap_atomic(skb_frag_page(f)); - memcpy(page_address(page), - vaddr + f->page_offset, skb_frag_size(f)); - kunmap_atomic(vaddr); set_page_private(page, (unsigned long)head); head = page; } + page = head; + d_off = 0; + for (i = 0; i < num_frags; i++) { + u8 *vaddr; + skb_frag_t *f = &skb_shinfo(skb)->frags[i]; + u32 f_off, f_size, copy; + + f_off = f->page_offset; + f_size = f->size; + + vaddr = kmap_atomic(skb_frag_page(f)); + while (f_size) { + if (d_off == PAGE_SIZE) { + d_off = 0; + page = (struct page *)page_private(page); + } + copy = min_t(u32, PAGE_SIZE - d_off, f_size); + memcpy(page_address(page) + d_off, vaddr + f_off, copy); + f_size -= copy; + d_off += copy; + f_off += copy; + } + kunmap_atomic(vaddr); + } + /* skb frags release userspace buffers */ for (i = 0; i < num_frags; i++) skb_frag_unref(skb, i); @@ -991,11 +1017,12 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) uarg->callback(uarg, false); /* skb frags point to kernel buffers */ - for (i = num_frags - 1; i >= 0; i--) { - __skb_fill_page_desc(skb, i, head, 0, - skb_shinfo(skb)->frags[i].size); + for (i = 0; i < new_frags - 1; i++) { + __skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE); head = (struct page *)page_private(head); } + __skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off); + skb_shinfo(skb)->nr_frags = new_frags; skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; return 0;