diff mbox series

[net-next] net: zerocopy: combine pages in zerocopy_sg_from_iter()

Message ID 20200820154359.1806305-1-edumazet@google.com
State Accepted
Delegated to: David Miller
Headers show
Series [net-next] net: zerocopy: combine pages in zerocopy_sg_from_iter() | expand

Commit Message

Eric Dumazet Aug. 20, 2020, 3:43 p.m. UTC
Currently, tcp sendmsg(MSG_ZEROCOPY) is building skbs with order-0 fragments.
Compared to standard sendmsg(), these skbs usually contain up to 16 fragments
on arches with 4KB page sizes, instead of two.

This adds considerable costs on various ndo_start_xmit() handlers,
especially when IOMMU is in the picture.

As high performance applications are often using huge pages,
we can try to combine adjacent pages belonging to same
compound page.

Tested on AMD Rome platform, with IOMMU, nominal single TCP flow speed
is roughly doubled (~55Gbit -> ~100Gbit), when user application
is using hugepages.

For reference, nominal single TCP flow speed on this platform
without MSG_ZEROCOPY is ~65Gbit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
---
 net/core/datagram.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

Comments

David Miller Aug. 20, 2020, 11:13 p.m. UTC | #1
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 20 Aug 2020 08:43:59 -0700

> Currently, tcp sendmsg(MSG_ZEROCOPY) is building skbs with order-0 fragments.
> Compared to standard sendmsg(), these skbs usually contain up to 16 fragments
> on arches with 4KB page sizes, instead of two.
> 
> This adds considerable costs on various ndo_start_xmit() handlers,
> especially when IOMMU is in the picture.
> 
> As high performance applications are often using huge pages,
> we can try to combine adjacent pages belonging to same
> compound page.
> 
> Tested on AMD Rome platform, with IOMMU, nominal single TCP flow speed
> is roughly doubled (~55Gbit -> ~100Gbit), when user application
> is using hugepages.
> 
> For reference, nominal single TCP flow speed on this platform
> without MSG_ZEROCOPY is ~65Gbit.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, the refcounitng in these kinds of patchs is always fun to
audit :-)
diff mbox series

Patch

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 639745d4f3b94a248da9a685f45158410a85bec7..9fcaa544f11a92f1b833d03e9db0863c32905673 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -623,10 +623,11 @@  int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 
 	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
+		struct page *last_head = NULL;
 		size_t start;
 		ssize_t copied;
 		unsigned long truesize;
-		int n = 0;
+		int refs, n = 0;
 
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
@@ -649,13 +650,37 @@  int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 		} else {
 			refcount_add(truesize, &skb->sk->sk_wmem_alloc);
 		}
-		while (copied) {
+		for (refs = 0; copied != 0; start = 0) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
-			skb_fill_page_desc(skb, frag++, pages[n], start, size);
-			start = 0;
+			struct page *head = compound_head(pages[n]);
+
+			start += (pages[n] - head) << PAGE_SHIFT;
 			copied -= size;
 			n++;
+			if (frag) {
+				skb_frag_t *last = &skb_shinfo(skb)->frags[frag - 1];
+
+				if (head == skb_frag_page(last) &&
+				    start == skb_frag_off(last) + skb_frag_size(last)) {
+					skb_frag_size_add(last, size);
+					/* We combined this page, we need to release
+					 * a reference. Since compound pages refcount
+					 * is shared among many pages, batch the refcount
+					 * adjustments to limit false sharing.
+					 */
+					last_head = head;
+					refs++;
+					continue;
+				}
+			}
+			if (refs) {
+				page_ref_sub(last_head, refs);
+				refs = 0;
+			}
+			skb_fill_page_desc(skb, frag++, head, start, size);
 		}
+		if (refs)
+			page_ref_sub(last_head, refs);
 	}
 	return 0;
 }