diff mbox

[v2,net-next] net: Add low-latency/polling support for UDP multicast

Message ID 1375818663-12318-1-git-send-email-sbohrer@rgmadvisors.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Shawn Bohrer Aug. 6, 2013, 7:51 p.m. UTC
Set the napi id for each socket in the multicast path to enable
low-latency/polling support.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
v2 include ipv6 support

 net/ipv4/udp.c |    1 +
 net/ipv6/udp.c |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

Comments

Eric Dumazet Aug. 7, 2013, 8:22 p.m. UTC | #1
On Tue, 2013-08-06 at 14:51 -0500, Shawn Bohrer wrote:
> Set the napi id for each socket in the multicast path to enable
> low-latency/polling support.
> 
> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
> ---
> v2 include ipv6 support

This might help your workload, but I doubt it is generic enough.

One UDP socket is supposed to receive traffic from many endpoints,
so we have no guarantee all the received traffic will end on a single RX
queue on the NIC.

That's the same logic than RFS here.

sk_mark_napi_id() in UDP are wrong IMHO.

It should be guarded by the following test in  
__udp_queue_rcv_skb()

if (inet_sk(sk)->inet_daddr) {
    sock_rps_save_rxhash(sk, skb);
    sk_mark_napi_id(sk, skb);
}

(To occur only for connected UDP sockets, where we are 100% sure all
packets will use this same rxhash/rx queue)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eliezer Tamir Aug. 8, 2013, 8:46 a.m. UTC | #2
On 07/08/2013 23:22, Eric Dumazet wrote:
> On Tue, 2013-08-06 at 14:51 -0500, Shawn Bohrer wrote:
>> Set the napi id for each socket in the multicast path to enable
>> low-latency/polling support.
>>
>> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
>> ---
>> v2 include ipv6 support
> 
> This might help your workload, but I doubt it is generic enough.
> 
> One UDP socket is supposed to receive traffic from many endpoints,
> so we have no guarantee all the received traffic will end on a single RX
> queue on the NIC.
> 
> That's the same logic than RFS here.
> 
> sk_mark_napi_id() in UDP are wrong IMHO.
> 
> It should be guarded by the following test in  
> __udp_queue_rcv_skb()
> 
> if (inet_sk(sk)->inet_daddr) {
>     sock_rps_save_rxhash(sk, skb);
>     sk_mark_napi_id(sk, skb);
> }
> 
> (To occur only for connected UDP sockets, where we are 100% sure all
> packets will use this same rxhash/rx queue)

This would also be safe if there is only one NIC and said NIC was
programmed to always place this socket's data on the same queue.

I don't have a good suggestion on how to detect this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 8, 2013, 11:55 p.m. UTC | #3
On Thu, 2013-08-08 at 11:46 +0300, Eliezer Tamir wrote:
> On 07/08/2013 23:22, Eric Dumazet wrote:
> > On Tue, 2013-08-06 at 14:51 -0500, Shawn Bohrer wrote:
> >> Set the napi id for each socket in the multicast path to enable
> >> low-latency/polling support.
> >>
> >> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
> >> ---
> >> v2 include ipv6 support
> > 
> > This might help your workload, but I doubt it is generic enough.
> > 
> > One UDP socket is supposed to receive traffic from many endpoints,
> > so we have no guarantee all the received traffic will end on a single RX
> > queue on the NIC.
> > 
> > That's the same logic than RFS here.
> > 
> > sk_mark_napi_id() in UDP are wrong IMHO.
> > 
> > It should be guarded by the following test in  
> > __udp_queue_rcv_skb()
> > 
> > if (inet_sk(sk)->inet_daddr) {
> >     sock_rps_save_rxhash(sk, skb);
> >     sk_mark_napi_id(sk, skb);
> > }
> > 
> > (To occur only for connected UDP sockets, where we are 100% sure all
> > packets will use this same rxhash/rx queue)
> 
> This would also be safe if there is only one NIC and said NIC was
> programmed to always place this socket's data on the same queue.
> 
> I don't have a good suggestion on how to detect this.

Well, this stuff relies on flows being correctly steered.

TCP stack performs much better if we avoid reorders ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eliezer Tamir Aug. 11, 2013, 7:59 a.m. UTC | #4
On 09/08/2013 02:55, Eric Dumazet wrote:
> On Thu, 2013-08-08 at 11:46 +0300, Eliezer Tamir wrote:
>> On 07/08/2013 23:22, Eric Dumazet wrote:
>>> sk_mark_napi_id() in UDP are wrong IMHO.
>>>
>>> It should be guarded by the following test in  
>>> __udp_queue_rcv_skb()
>>>
>>> if (inet_sk(sk)->inet_daddr) {
>>>     sock_rps_save_rxhash(sk, skb);
>>>     sk_mark_napi_id(sk, skb);
>>> }
>>>
>>> (To occur only for connected UDP sockets, where we are 100% sure all
>>> packets will use this same rxhash/rx queue)
>>
>> This would also be safe if there is only one NIC and said NIC was
>> programmed to always place this socket's data on the same queue.
>>
>> I don't have a good suggestion on how to detect this.
> 
> Well, this stuff relies on flows being correctly steered.
> 
> TCP stack performs much better if we avoid reorders ;)

Let's limit the discussion to UDP for now.
If you are getting packets on multiple queues for a TCP socket,
it's hard to see how you can avoid caused reordering.

Maybe it would be enough to have a socket option that the user
can set to tell the stack "I know what I'm doing, please allow busy
polling on this UDP socket, even though it's not bound".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 766e6ba..0d0da17 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1596,6 +1596,7 @@  static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	dif = skb->dev->ifindex;
 	sk = udp_v4_mcast_next(net, sk, uh->dest, daddr, uh->source, saddr, dif);
 	while (sk) {
+		sk_mark_napi_id(sk, skb);
 		stack[count++] = sk;
 		sk = udp_v4_mcast_next(net, sk_nulls_next(sk), uh->dest,
 				       daddr, uh->source, saddr, dif);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f405815..82be372 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -756,6 +756,7 @@  static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	dif = inet6_iif(skb);
 	sk = udp_v6_mcast_next(net, sk, uh->dest, daddr, uh->source, saddr, dif);
 	while (sk) {
+		sk_mark_napi_id(sk, skb);
 		stack[count++] = sk;
 		sk = udp_v6_mcast_next(net, sk_nulls_next(sk), uh->dest, daddr,
 				       uh->source, saddr, dif);