diff mbox series

[net-next,v10,05/14] netdev: netdevice devmem allocator

Message ID 20240530201616.1316526-6-almasrymina@google.com
State New
Headers show
Series Device Memory TCP | expand

Commit Message

Mina Almasry May 30, 2024, 8:16 p.m. UTC
Implement netdev devmem allocator. The allocator takes a given struct
netdev_dmabuf_binding as input and allocates net_iov from that
binding.

The allocation simply delegates to the binding's genpool for the
allocation logic and wraps the returned memory region in a net_iov
struct.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v8:
- Rename netdev_dmabuf_binding -> net_devmem_dmabuf_binding to avoid
  patch-by-patch build error.
- Move niov->pp_magic/pp/pp_ref_counter usage to later patch to avoid
  patch-by-patch build error.

v7:
- netdev_ -> net_devmem_* naming (Yunsheng).

v6:
- Add comment on net_iov_dma_addr to explain why we don't use
  niov->dma_addr (Pavel)
- Refactor new functions into net/core/devmem.c (Pavel)

v1:
- Rename devmem -> dmabuf (David).

---
 include/net/devmem.h | 13 +++++++++++++
 include/net/netmem.h | 18 ++++++++++++++++++
 net/core/devmem.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+)

Comments

Paolo Abeni June 4, 2024, 10:13 a.m. UTC | #1
On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index d82f92d7cf9ce..d5fac8edf621d 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
>  	kfree(owner);
>  }
>  
> +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)

Minor nit: please no 'inline' keyword in c files.

Thanks,

Paolo
Steven Rostedt June 4, 2024, 4:15 p.m. UTC | #2
On Tue, 04 Jun 2024 12:13:15 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index d82f92d7cf9ce..d5fac8edf621d 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> >  	kfree(owner);
> >  }
> >  
> > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)  
> 
> Minor nit: please no 'inline' keyword in c files.

I'm curious. Is this a networking rule? I use 'inline' in my C code all the
time.

-- Steve
Jason Gunthorpe June 4, 2024, 4:31 p.m. UTC | #3
On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote:
> On Tue, 04 Jun 2024 12:13:15 +0200
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index d82f92d7cf9ce..d5fac8edf621d 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> > >  	kfree(owner);
> > >  }
> > >  
> > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)  
> > 
> > Minor nit: please no 'inline' keyword in c files.
> 
> I'm curious. Is this a networking rule? I use 'inline' in my C code all the
> time.

It mostly comes from Documentation/process/coding-style.rst:

15) The inline disease
----------------------

There appears to be a common misperception that gcc has a magic "make me
faster" speedup option called ``inline``. While the use of inlines can be
appropriate (for example as a means of replacing macros, see Chapter 12), it
very often is not. Abundant use of the inline keyword leads to a much bigger
kernel, which in turn slows the system as a whole down, due to a bigger
icache footprint for the CPU and simply because there is less memory
available for the pagecache. Just think about it; a pagecache miss causes a
disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles
that can go into these 5 milliseconds.

A reasonable rule of thumb is to not put inline at functions that have more
than 3 lines of code in them. An exception to this rule are the cases where
a parameter is known to be a compiletime constant, and as a result of this
constantness you *know* the compiler will be able to optimize most of your
function away at compile time. For a good example of this later case, see
the kmalloc() inline function.

Often people argue that adding inline to functions that are static and used
only once is always a win since there is no space tradeoff. While this is
technically correct, gcc is capable of inlining these automatically without
help, and the maintenance issue of removing the inline when a second user
appears outweighs the potential value of the hint that tells gcc to do
something it would have done anyway.

Jason
Steven Rostedt June 4, 2024, 4:42 p.m. UTC | #4
On Tue, 4 Jun 2024 13:31:58 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote:
> > On Tue, 04 Jun 2024 12:13:15 +0200
> > Paolo Abeni <pabeni@redhat.com> wrote:
> >   
> > > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:  
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index d82f92d7cf9ce..d5fac8edf621d 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> > > >  	kfree(owner);
> > > >  }
> > > >  
> > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)    
> > > 
> > > Minor nit: please no 'inline' keyword in c files.  
> > 
> > I'm curious. Is this a networking rule? I use 'inline' in my C code all the
> > time.  
> 
> It mostly comes from Documentation/process/coding-style.rst:
> 
> 15) The inline disease
> ----------------------
> 
> There appears to be a common misperception that gcc has a magic "make me
> faster" speedup option called ``inline``. While the use of inlines can be
> appropriate (for example as a means of replacing macros, see Chapter 12), it
> very often is not. Abundant use of the inline keyword leads to a much bigger
> kernel, which in turn slows the system as a whole down, due to a bigger
> icache footprint for the CPU and simply because there is less memory
> available for the pagecache. Just think about it; a pagecache miss causes a
> disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles
> that can go into these 5 milliseconds.
> 
> A reasonable rule of thumb is to not put inline at functions that have more
> than 3 lines of code in them. An exception to this rule are the cases where
> a parameter is known to be a compiletime constant, and as a result of this
> constantness you *know* the compiler will be able to optimize most of your
> function away at compile time. For a good example of this later case, see
> the kmalloc() inline function.
> 
> Often people argue that adding inline to functions that are static and used
> only once is always a win since there is no space tradeoff. While this is
> technically correct, gcc is capable of inlining these automatically without
> help, and the maintenance issue of removing the inline when a second user
> appears outweighs the potential value of the hint that tells gcc to do
> something it would have done anyway.
> 

Interesting, as I sped up the ftrace ring buffer by a substantial amount by
adding strategic __always_inline, noinline, likely() and unlikely()
throughout the code. It had to do with what was considered the fast path
and slow path, and not actually the size of the function. gcc got it
horribly wrong.

-- Steve
Andrew Lunn June 4, 2024, 11:44 p.m. UTC | #5
> Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> adding strategic __always_inline, noinline, likely() and unlikely()
> throughout the code. It had to do with what was considered the fast path
> and slow path, and not actually the size of the function. gcc got it
> horribly wrong.

And what did the compiler people say when you reported gcc was getting
it wrong?

Our assumption is, the compiler is better than a human at deciding
this. Or at least, a human who does not spend a long time profiling
and tuning. If this assumption is not true, we probably should be
trying to figure out why, and improving the compiler when
possible. That will benefit everybody.

       Andrew
Steven Rostedt June 5, 2024, 12:27 a.m. UTC | #6
On Wed, 5 Jun 2024 01:44:37 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> > Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> > adding strategic __always_inline, noinline, likely() and unlikely()
> > throughout the code. It had to do with what was considered the fast path
> > and slow path, and not actually the size of the function. gcc got it
> > horribly wrong.  
> 
> And what did the compiler people say when you reported gcc was getting
> it wrong?
> 
> Our assumption is, the compiler is better than a human at deciding
> this. Or at least, a human who does not spend a long time profiling
> and tuning. If this assumption is not true, we probably should be
> trying to figure out why, and improving the compiler when
> possible. That will benefit everybody.
> 

How is the compiler going to know which path is going to be taken the most?
There's two main paths in the ring buffer logic. One when an event stays on
the sub-buffer, the other when the event crosses over to a new sub buffer.
As there's 100s of events that happen on the same sub-buffer for every one
time there's a cross over, I optimized the paths that stayed on the
sub-buffer, which caused the time for those events to go from 250ns down to
150 ns!. That's a 40% speed up.

I added the unlikely/likely and 'always_inline' and 'noinline' paths to
make sure the "staying on the buffer" path was always the hot path, and
keeping it tight in cache.

How is a compiler going to know that?

-- Steve
Andrew Lunn June 5, 2024, 12:52 a.m. UTC | #7
> How is the compiler going to know which path is going to be taken the most?
> There's two main paths in the ring buffer logic. One when an event stays on
> the sub-buffer, the other when the event crosses over to a new sub buffer.
> As there's 100s of events that happen on the same sub-buffer for every one
> time there's a cross over, I optimized the paths that stayed on the
> sub-buffer, which caused the time for those events to go from 250ns down to
> 150 ns!. That's a 40% speed up.
> 
> I added the unlikely/likely and 'always_inline' and 'noinline' paths to
> make sure the "staying on the buffer" path was always the hot path, and
> keeping it tight in cache.
> 
> How is a compiler going to know that?

It might have some heuristics to try to guess unlikely/likely, but
that is not what we are talking about here.

How much difference did 'always_inline' and 'noinline' make? Hopefully
the likely is enough of a clue it should prefer to inline whatever is
in that branch, where as for the unlikely case it can do a function
call.

But compilers is not my thing, which is why i would reach out to the
compiler people and ask them, is it expected to get this wrong, could
it be made better?

   Andrew
Steven Rostedt June 6, 2024, 1:43 a.m. UTC | #8
On Wed, 5 Jun 2024 02:52:29 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> > How is a compiler going to know that?  
> 
> It might have some heuristics to try to guess unlikely/likely, but
> that is not what we are talking about here.
> 
> How much difference did 'always_inline' and 'noinline' make? Hopefully
> the likely is enough of a clue it should prefer to inline whatever is
> in that branch, where as for the unlikely case it can do a function
> call.

Perhaps, but one of the issues was that I have lots of small functions that
are used all over the place, and gcc tends to change them to function
calls, instead of duplicating them. I did this analysis back in 2016, so
maybe it became better.

> 
> But compilers is not my thing, which is why i would reach out to the
> compiler people and ask them, is it expected to get this wrong, could
> it be made better?

Well, I actually do work with the compiler folks, and we are actually
trying to get a session at GNU Cauldron where Linux kernel folks can talk
with the gcc compiler folks.

I've stared at so many objdump outputs, that I can now pretty much see the
assembly that my C code makes ;-)

-- Steve
Niklas Schnelle June 7, 2024, 7:55 a.m. UTC | #9
On Tue, 2024-06-04 at 20:27 -0400, Steven Rostedt wrote:
> On Wed, 5 Jun 2024 01:44:37 +0200
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> > > adding strategic __always_inline, noinline, likely() and unlikely()
> > > throughout the code. It had to do with what was considered the fast path
> > > and slow path, and not actually the size of the function. gcc got it
> > > horribly wrong.  
> > 
> > And what did the compiler people say when you reported gcc was getting
> > it wrong?
> > 
> > Our assumption is, the compiler is better than a human at deciding
> > this. Or at least, a human who does not spend a long time profiling
> > and tuning. If this assumption is not true, we probably should be
> > trying to figure out why, and improving the compiler when
> > possible. That will benefit everybody.
> > 
> 
> How is the compiler going to know which path is going to be taken the most?
> There's two main paths in the ring buffer logic. One when an event stays on
> the sub-buffer, the other when the event crosses over to a new sub buffer.
> As there's 100s of events that happen on the same sub-buffer for every one
> time there's a cross over, I optimized the paths that stayed on the
> sub-buffer, which caused the time for those events to go from 250ns down to
> 150 ns!. That's a 40% speed up.
> 
> I added the unlikely/likely and 'always_inline' and 'noinline' paths to
> make sure the "staying on the buffer" path was always the hot path, and
> keeping it tight in cache.
> 
> How is a compiler going to know that?
> 
> -- Steve
> 

Isn't this basically a perfect example of something where profile
guided optimization should work?

Thanks,
Niklas
diff mbox series

Patch

diff --git a/include/net/devmem.h b/include/net/devmem.h
index fa03bdabdffd9..cd3186f5d1fbd 100644
--- a/include/net/devmem.h
+++ b/include/net/devmem.h
@@ -68,7 +68,20 @@  int net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
 int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 				    struct net_devmem_dmabuf_binding *binding);
+struct net_iov *
+net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
+void net_devmem_free_dmabuf(struct net_iov *ppiov);
 #else
+static inline struct net_iov *
+net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
+{
+	return NULL;
+}
+
+static inline void net_devmem_free_dmabuf(struct net_iov *ppiov)
+{
+}
+
 static inline void
 __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 {
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 72e932a1a9489..01dbdd216fae7 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -14,8 +14,26 @@ 
 
 struct net_iov {
 	struct dmabuf_genpool_chunk_owner *owner;
+	unsigned long dma_addr;
 };
 
+static inline struct dmabuf_genpool_chunk_owner *
+net_iov_owner(const struct net_iov *niov)
+{
+	return niov->owner;
+}
+
+static inline unsigned int net_iov_idx(const struct net_iov *niov)
+{
+	return niov - net_iov_owner(niov)->niovs;
+}
+
+static inline struct net_devmem_dmabuf_binding *
+net_iov_binding(const struct net_iov *niov)
+{
+	return net_iov_owner(niov)->binding;
+}
+
 /* netmem */
 
 /**
diff --git a/net/core/devmem.c b/net/core/devmem.c
index d82f92d7cf9ce..d5fac8edf621d 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -32,6 +32,14 @@  static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
 	kfree(owner);
 }
 
+static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)
+{
+	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+
+	return owner->base_dma_addr +
+	       ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
+}
+
 void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 {
 	size_t size, avail;
@@ -54,6 +62,42 @@  void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 	kfree(binding);
 }
 
+struct net_iov *
+net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
+{
+	struct dmabuf_genpool_chunk_owner *owner;
+	unsigned long dma_addr;
+	struct net_iov *niov;
+	ssize_t offset;
+	ssize_t index;
+
+	dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE,
+					(void **)&owner);
+	if (!dma_addr)
+		return NULL;
+
+	offset = dma_addr - owner->base_dma_addr;
+	index = offset / PAGE_SIZE;
+	niov = &owner->niovs[index];
+
+	niov->dma_addr = 0;
+
+	net_devmem_dmabuf_binding_get(binding);
+
+	return niov;
+}
+
+void net_devmem_free_dmabuf(struct net_iov *niov)
+{
+	struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov);
+	unsigned long dma_addr = net_devmem_get_dma_addr(niov);
+
+	if (gen_pool_has_addr(binding->chunk_pool, dma_addr, PAGE_SIZE))
+		gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
+
+	net_devmem_dmabuf_binding_put(binding);
+}
+
 /* Protected by rtnl_lock() */
 static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);