diff mbox

[net-next,v2,8/8] net: Introduce SO_INCOMING_NAPI_ID

Message ID 20170323213802.12615.58216.stgit@localhost.localdomain
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Alexander H Duyck March 23, 2017, 9:38 p.m. UTC
From: Sridhar Samudrala <sridhar.samudrala@intel.com>

This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.

If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |    2 ++
 arch/avr32/include/uapi/asm/socket.h   |    2 ++
 arch/frv/include/uapi/asm/socket.h     |    2 ++
 arch/ia64/include/uapi/asm/socket.h    |    2 ++
 arch/m32r/include/uapi/asm/socket.h    |    2 ++
 arch/mips/include/uapi/asm/socket.h    |    1 +
 arch/mn10300/include/uapi/asm/socket.h |    2 ++
 arch/parisc/include/uapi/asm/socket.h  |    2 ++
 arch/powerpc/include/uapi/asm/socket.h |    2 ++
 arch/s390/include/uapi/asm/socket.h    |    2 ++
 arch/sparc/include/uapi/asm/socket.h   |    2 ++
 arch/xtensa/include/uapi/asm/socket.h  |    2 ++
 include/uapi/asm-generic/socket.h      |    2 ++
 net/core/sock.c                        |   12 ++++++++++++
 14 files changed, 37 insertions(+)

Comments

Eric Dumazet March 23, 2017, 10:22 p.m. UTC | #1
On Thu, 2017-03-23 at 14:38 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala <sridhar.samudrala@intel.com>
> 
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
> 
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

Acked-by: Eric Dumazet <edumazet@google.com>
Andy Lutomirski March 23, 2017, 10:43 p.m. UTC | #2
On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Sridhar Samudrala <sridhar.samudrala@intel.com>
>
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
>
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

This may be more of a naming / documentation issue than a
functionality issue, but to me this reads as:

"This socket option returns an internal implementation detail that, if
you are sufficiently clueful about the current performance heuristics
used by the Linux networking stack, just might give you a hint as to
which epoll set to put the socket in."  I've done some digging into
Linux networking stuff, but not nearly enough to have the slighest
clue what you're supposed to do with the NAPI ID.

It would be nice to make this a bit more concrete and a bit less tied
in Linux innards.  Perhaps a socket option could instead return a hint
saying "for best results, put this socket in an epoll set that's on
cpu N"?  After all, you're unlikely to do anything productive with
busy polling at all, even on a totally different kernel
implementation, if you have more than one epoll set per CPU.  I can
see cases where you could plausibly poll with fewer than one set per
CPU, I suppose.

Again, though, from the description, it's totally unclear what a user
is supposed to do.

--Andy
Alexander H Duyck March 24, 2017, 12:58 a.m. UTC | #3
On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> From: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>
>> This socket option returns the NAPI ID associated with the queue on which
>> the last frame is received. This information can be used by the apps to
>> split the incoming flows among the threads based on the Rx queue on which
>> they are received.
>>
>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>> and 0 is returned.
>
> This may be more of a naming / documentation issue than a
> functionality issue, but to me this reads as:
>
> "This socket option returns an internal implementation detail that, if
> you are sufficiently clueful about the current performance heuristics
> used by the Linux networking stack, just might give you a hint as to
> which epoll set to put the socket in."  I've done some digging into
> Linux networking stuff, but not nearly enough to have the slighest
> clue what you're supposed to do with the NAPI ID.

Really the NAPI ID is an arbitrary number that will be unique per
device queue, though multiple Rx queues can share a NAPI ID if they
are meant to be processed in the same call to poll.

If we wanted we could probably rename it to something like Device Poll
Identifier or Device Queue Identifier, DPID or DQID, if that would
work for you.  Essentially it is just a unique u32 value that should
not identify any other queue in the system while this device queue is
active.  Really the number itself is mostly arbitrary, the main thing
is that it doesn't change and uniquely identifies the queue in the
system.

> It would be nice to make this a bit more concrete and a bit less tied
> in Linux innards.  Perhaps a socket option could instead return a hint
> saying "for best results, put this socket in an epoll set that's on
> cpu N"?  After all, you're unlikely to do anything productive with
> busy polling at all, even on a totally different kernel
> implementation, if you have more than one epoll set per CPU.  I can
> see cases where you could plausibly poll with fewer than one set per
> CPU, I suppose.

Really we kind of already have an option that does what you are
implying called SO_INCOMING_CPU.  The problem is it requires pinning
the interrupts to the CPUs in order to keep the values consistent, and
even then busy polling can mess that up if the busy poll thread is
running on a different CPU.  With the NAPI ID we have to do a bit of
work on the application end, but we can uniquely identify each
incoming queue and interrupt migration and busy polling don't have any
effect on it.  So for example we could stack all the interrupts on CPU
0, and have our main thread located there doing the sorting of
incoming requests and handing them out to epoll listener threads on
other CPUs.  When those epoll listener threads start doing busy
polling the NAPI ID won't change even though the packet is being
processed on a different CPU.

> Again, though, from the description, it's totally unclear what a user
> is supposed to do.

What you end up having to do is essentially create a hash of sorts so
that you can go from NAPI IDs to threads.  In an ideal setup what you
end up with multiple threads, each one running one epoll, and each
epoll polling on one specific queue.

Hope that helps to clarify it.

- Alex
Andy Lutomirski March 24, 2017, 4:47 a.m. UTC | #4
On Thu, Mar 23, 2017 at 5:58 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> From: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>
>>> This socket option returns the NAPI ID associated with the queue on which
>>> the last frame is received. This information can be used by the apps to
>>> split the incoming flows among the threads based on the Rx queue on which
>>> they are received.
>>>
>>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>>> and 0 is returned.
>>
>> This may be more of a naming / documentation issue than a
>> functionality issue, but to me this reads as:
>>
>> "This socket option returns an internal implementation detail that, if
>> you are sufficiently clueful about the current performance heuristics
>> used by the Linux networking stack, just might give you a hint as to
>> which epoll set to put the socket in."  I've done some digging into
>> Linux networking stuff, but not nearly enough to have the slighest
>> clue what you're supposed to do with the NAPI ID.
>
> Really the NAPI ID is an arbitrary number that will be unique per
> device queue, though multiple Rx queues can share a NAPI ID if they
> are meant to be processed in the same call to poll.
>
> If we wanted we could probably rename it to something like Device Poll
> Identifier or Device Queue Identifier, DPID or DQID, if that would
> work for you.  Essentially it is just a unique u32 value that should
> not identify any other queue in the system while this device queue is
> active.  Really the number itself is mostly arbitrary, the main thing
> is that it doesn't change and uniquely identifies the queue in the
> system.

That seems reasonably sane to me.

>
>> It would be nice to make this a bit more concrete and a bit less tied
>> in Linux innards.  Perhaps a socket option could instead return a hint
>> saying "for best results, put this socket in an epoll set that's on
>> cpu N"?  After all, you're unlikely to do anything productive with
>> busy polling at all, even on a totally different kernel
>> implementation, if you have more than one epoll set per CPU.  I can
>> see cases where you could plausibly poll with fewer than one set per
>> CPU, I suppose.
>
> Really we kind of already have an option that does what you are
> implying called SO_INCOMING_CPU.  The problem is it requires pinning
> the interrupts to the CPUs in order to keep the values consistent,

Some day the kernel should just solve this problem once and for all.
Have root give a basic policy for mapping queues to CPUs (one per
physical core / one per logical core / use this subset of cores) and
perhaps forcibly prevent irqbalanced from even seeing it.  I'm sure
other solutions are possible.

> and
> even then busy polling can mess that up if the busy poll thread is
> running on a different CPU.  With the NAPI ID we have to do a bit of
> work on the application end, but we can uniquely identify each
> incoming queue and interrupt migration and busy polling don't have any
> effect on it.  So for example we could stack all the interrupts on CPU
> 0, and have our main thread located there doing the sorting of
> incoming requests and handing them out to epoll listener threads on
> other CPUs.  When those epoll listener threads start doing busy
> polling the NAPI ID won't change even though the packet is being
> processed on a different CPU.
>
>> Again, though, from the description, it's totally unclear what a user
>> is supposed to do.
>
> What you end up having to do is essentially create a hash of sorts so
> that you can go from NAPI IDs to threads.  In an ideal setup what you
> end up with multiple threads, each one running one epoll, and each
> epoll polling on one specific queue.

So don't we want queue id, not NAPI id?  Or am I still missing something?

But I'm also a but confused as to the overall performance effect.
Suppose I have an rx queue that has its interrupt bound to cpu 0.  For
whatever reason (random chance if I'm hashing, for example), I end up
with the epoll caller on cpu 1.  Suppose further that cpus 0 and 1 are
on different NUMA nodes.

Now, let's suppose that I get lucky and *all* the packets are pulled
off the queue by epoll busy polling.  Life is great [1].  But suppose
that, due to a tiny hiccup or simply user code spending some cycles
processing those packets, an rx interrupt fires.  Now cpu 0 starts
pulling packets off the queue via NAPI, right?  So both NUMA nodes are
fighting over all the cachelines involved in servicing the queue *and*
the packets just got dequeued on the wrong NUMA node.

ISTM this would work better if the epoll busy polling could handle the
case where one epoll set polls sockets on different queues as long as
those queues are all owned by the same CPU.  Then user code could use
SO_INCOMING_CPU to sort out the sockets.

Am I missing something?

[1] Maybe.  How smart is direct cache access?  If it's smart enough,
it'll pre-populate node 0's LLC, which means that life isn't so great
after all.
Eric Dumazet March 24, 2017, 5:07 a.m. UTC | #5
On Thu, Mar 23, 2017 at 9:47 PM, Andy Lutomirski <luto@kernel.org> wrote:

> So don't we want queue id, not NAPI id?  Or am I still missing something?
>
> But I'm also a but confused as to the overall performance effect.
> Suppose I have an rx queue that has its interrupt bound to cpu 0.  For
> whatever reason (random chance if I'm hashing, for example), I end up
> with the epoll caller on cpu 1.  Suppose further that cpus 0 and 1 are
> on different NUMA nodes.
>
> Now, let's suppose that I get lucky and *all* the packets are pulled
> off the queue by epoll busy polling.  Life is great [1].  But suppose
> that, due to a tiny hiccup or simply user code spending some cycles
> processing those packets, an rx interrupt fires.  Now cpu 0 starts
> pulling packets off the queue via NAPI, right?  So both NUMA nodes are
> fighting over all the cachelines involved in servicing the queue *and*
> the packets just got dequeued on the wrong NUMA node.
>
> ISTM this would work better if the epoll busy polling could handle the
> case where one epoll set polls sockets on different queues as long as
> those queues are all owned by the same CPU.  Then user code could use
> SO_INCOMING_CPU to sort out the sockets.
>

Of course you can do that already.

SO_REUSEPORT + appropriate eBPF filter can select the best socket to
receive your packets, based
on various smp/numa affinities ( BPF_FUNC_get_smp_processor_id or
BPF_FUNC_get_numa_node_id )

This new instruction is simply _allowing_ other schems, based on
queues ID, in the case each NIC queue
can be managed by a group of cores (presumably on same NUMA node)


> Am I missing something?
>
> [1] Maybe.  How smart is direct cache access?  If it's smart enough,
> it'll pre-populate node 0's LLC, which means that life isn't so great
> after all.
diff mbox

Patch

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 089db42c1b40..1bb8cac61a28 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 6eabcbd2f82a..f824eeb0f2e4 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index bd497f8356b9..a8ad9bebfc47 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -94,5 +94,7 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index f1bb54686168..6af3253e4209 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -103,4 +103,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 459c46076f6f..e98b6bb897c0 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 688c18dd62ef..ae2b62e39d4d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -112,5 +112,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 312d2c457a04..e4ac1843ee01 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index b98ec38f2083..f754c793e82a 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -93,4 +93,6 @@ 
 
 #define SO_MEMINFO		0x4030
 
+#define SO_INCOMING_NAPI_ID	0x4031
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index 099a889240f6..5f84af7dcb2e 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 6199bb34f7fa..25ac4960e707 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -100,4 +100,6 @@ 
 
 #define	SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 12cd8c2ec422..b05513acd589 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -90,6 +90,8 @@ 
 
 #define SO_MEMINFO		0x0039
 
+#define SO_INCOMING_NAPI_ID	0x003a
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index d0b85f6c1484..786606c81edd 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -105,4 +105,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 8313702c1eae..c98a52fb572a 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -96,4 +96,6 @@ 
 
 #define SO_MEMINFO		55
 
+#define SO_INCOMING_NAPI_ID	56
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 0aa725cb3dd6..1de6c369ed86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1328,6 +1328,18 @@  int sock_getsockopt(struct socket *sock, int level, int optname,
 
 		goto lenout;
 	}
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	case SO_INCOMING_NAPI_ID:
+		v.val = READ_ONCE(sk->sk_napi_id);
+
+		/* aggregate non-NAPI IDs down to 0 */
+		if (v.val < MIN_NAPI_ID)
+			v.val = 0;
+
+		break;
+#endif
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).