diff mbox

[v2,net-next] udp: clear rps flow table for packets recv on UDP unconnected sockets

Message ID alpine.DEB.2.02.1408091013240.1087@tomh.mtv.corp.google.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Aug. 9, 2014, 5:15 p.m. UTC
This patch addresses a problem posed by Eric Dumazet in RPS/RFS
concerning the interaction between connected sockets and traffic on
unnconnected sockets.

On a server which has both traffic using connected connected sockets
and traffic that is going through unconnected UDP sockets, it is
very possible that the connected sockets could heavily populate the
RPS flow tables. Packets received on unconnected sockets would then
be steered based on unrelated entries in the flow tables which leads
to suboptimal steering. This happens even if the connections that
populate the table are inactive with no traffic, as long as the
connected sockets simply remain open unrelated traffic can be steered
using that information. This problem would further be exacerbated
if the packets on the unconnected UDP sockets are actually part of
long lived flows (which apparently would happen with QUIC in their
current implementation).

This patch clears the RPS flow hash table for packet recieved on
unnconnected UDP sockets. The effect is that the "flows" on unconnected
socekts will be steered using RPS. We don't do this for unconnected UDP
tunnels (encap_rcv) under the assumption that the flow table entries
will be adjusted during processing of inner packets.

Tested using super_netperf UDP_RR with 200 flows. Did not see any
noticeable regression in writing to flow table for every packet.

Before fix:
  76.99% CPU utilization
  112/158/230 90/95/99% latencies
  1.62776e+06 tps

After fix:
  76.03% CPU utilization
  115/162/239 90/95/99% latencies
  1.62257e+06 tps

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/udp.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Eric Dumazet Aug. 10, 2014, 4:59 p.m. UTC | #1
On Sat, 2014-08-09 at 10:15 -0700, Tom Herbert wrote:
> This patch addresses a problem posed by Eric Dumazet in RPS/RFS
> concerning the interaction between connected sockets and traffic on
> unnconnected sockets.
> 
> On a server which has both traffic using connected connected sockets
> and traffic that is going through unconnected UDP sockets, it is
> very possible that the connected sockets could heavily populate the
> RPS flow tables. Packets received on unconnected sockets would then
> be steered based on unrelated entries in the flow tables which leads
> to suboptimal steering. This happens even if the connections that
> populate the table are inactive with no traffic, as long as the
> connected sockets simply remain open unrelated traffic can be steered
> using that information. This problem would further be exacerbated
> if the packets on the unconnected UDP sockets are actually part of
> long lived flows (which apparently would happen with QUIC in their
> current implementation).
> 
> This patch clears the RPS flow hash table for packet recieved on
> unnconnected UDP sockets. The effect is that the "flows" on unconnected
> socekts will be steered using RPS. We don't do this for unconnected UDP
> tunnels (encap_rcv) under the assumption that the flow table entries
> will be adjusted during processing of inner packets.
> 
> Tested using super_netperf UDP_RR with 200 flows. Did not see any
> noticeable regression in writing to flow table for every packet.
> 
> Before fix:
>   76.99% CPU utilization
>   112/158/230 90/95/99% latencies
>   1.62776e+06 tps
> 
> After fix:
>   76.03% CPU utilization
>   115/162/239 90/95/99% latencies
>   1.62257e+06 tps
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  net/ipv4/udp.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index f57c0e4..7ba764a 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1444,6 +1444,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  	if (inet_sk(sk)->inet_daddr) {
>  		sock_rps_save_rxhash(sk, skb);
>  		sk_mark_napi_id(sk, skb);
> +	} else {
> +		sock_rps_reset_flow_hash(skb->hash);
>  	}
>  
>  	rc = sock_queue_rcv_skb(sk, skb);


I believe this patch is absolutely wrong.

1) Your benchmarks are either all TCP, either all UDP, so you cant see
any problem with these kind of heuristics.

Now, a server might have a mixed workload, with some TCP flows, and some
thousands of 'unconnected' UDP messages per second.

This patch would completely destroy TCP performance, by lowering RFS hit
rate (table would be almost empty)

If this was valid, admin would not setup RFS in the first place.

2) there is no guarantee skb->hash at this time is spread, so you might
add a false sharing in some cases (GRE tunnels for instance), as
rps_reset_sock_flow() would constantly set the same entry to RPS_NO_CPU
from multiple cpus.

Lets face it, RFS only works if workload is specialized, not generic.

If you want RFS being more usable, you need to find a way so that UDP
messages do not share a hash table used by TCP (which by definition only
deals with connected flows)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Aug. 10, 2014, 8:14 p.m. UTC | #2
On Sun, Aug 10, 2014 at 9:59 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2014-08-09 at 10:15 -0700, Tom Herbert wrote:
>> This patch addresses a problem posed by Eric Dumazet in RPS/RFS
>> concerning the interaction between connected sockets and traffic on
>> unnconnected sockets.
>>
>> On a server which has both traffic using connected connected sockets
>> and traffic that is going through unconnected UDP sockets, it is
>> very possible that the connected sockets could heavily populate the
>> RPS flow tables. Packets received on unconnected sockets would then
>> be steered based on unrelated entries in the flow tables which leads
>> to suboptimal steering. This happens even if the connections that
>> populate the table are inactive with no traffic, as long as the
>> connected sockets simply remain open unrelated traffic can be steered
>> using that information. This problem would further be exacerbated
>> if the packets on the unconnected UDP sockets are actually part of
>> long lived flows (which apparently would happen with QUIC in their
>> current implementation).
>>
>> This patch clears the RPS flow hash table for packet recieved on
>> unnconnected UDP sockets. The effect is that the "flows" on unconnected
>> socekts will be steered using RPS. We don't do this for unconnected UDP
>> tunnels (encap_rcv) under the assumption that the flow table entries
>> will be adjusted during processing of inner packets.
>>
>> Tested using super_netperf UDP_RR with 200 flows. Did not see any
>> noticeable regression in writing to flow table for every packet.
>>
>> Before fix:
>>   76.99% CPU utilization
>>   112/158/230 90/95/99% latencies
>>   1.62776e+06 tps
>>
>> After fix:
>>   76.03% CPU utilization
>>   115/162/239 90/95/99% latencies
>>   1.62257e+06 tps
>>
>> Signed-off-by: Tom Herbert <therbert@google.com>
>> ---
>>  net/ipv4/udp.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index f57c0e4..7ba764a 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -1444,6 +1444,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>>       if (inet_sk(sk)->inet_daddr) {
>>               sock_rps_save_rxhash(sk, skb);
>>               sk_mark_napi_id(sk, skb);
>> +     } else {
>> +             sock_rps_reset_flow_hash(skb->hash);
>>       }
>>
>>       rc = sock_queue_rcv_skb(sk, skb);
>
>
> I believe this patch is absolutely wrong.
>
> 1) Your benchmarks are either all TCP, either all UDP, so you cant see
> any problem with these kind of heuristics.
>
> Now, a server might have a mixed workload, with some TCP flows, and some
> thousands of 'unconnected' UDP messages per second.
>
> This patch would completely destroy TCP performance, by lowering RFS hit
> rate (table would be almost empty)
>
> If this was valid, admin would not setup RFS in the first place.
>
You also need to size the tables for your workload and desired hit
rate, this patch does not address that it is part of configuration the
admin needs to do. If the admin is unwilling to do this, then, yes,
they shouldn't bother setting up RFS!

> 2) there is no guarantee skb->hash at this time is spread, so you might
> add a false sharing in some cases (GRE tunnels for instance), as
> rps_reset_sock_flow() would constantly set the same entry to RPS_NO_CPU
> from multiple cpus.
>
> Lets face it, RFS only works if workload is specialized, not generic.
>

RFS is a probabilistic algorithm, it is not designed to be correct
100% of the time, but it is adaptive.

P(x) = probability of packet being steering to "incorrect" CPU. So RFS
is useful on server if P(x) < some delta, say 0.01.

P(x) = F(D, S): Probability of incorrect steering is a function of D,
flow distribution in some time quantum, and S size of the flow tables.
So for an expected flow distribution, table should be sized
accordingly to make P(X) < 0.01.

Assuming that the correct CPU choice for unconnected sockets is to use
RPS, then this patch fixes the problem of choosing an incorrect CPU
because of interference with *idle* TCP connections. Hence, reduces
P(x) for the work load.

> If you want RFS being more usable, you need to find a way so that UDP
> messages do not share a hash table used by TCP (which by definition only
> deals with connected flows)
>

Sorry, I don't see how this is feasible and would fundamentally break
the whole model that RPS/RFS is flow based, not protocol-flow based--
we don't need to know the protocol to steer packets. There is simply
nothing special about UDP, as far as RFS/RPS is concerned these are
flows. Note that connected UDP sockets work perfectly well and UDP
tunnels packet often carry TCP packets anyway which also work as
expected. Even in TCP, if the number of active connections far exceeds
the flow table size P(x) could start to approach 1.

>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 11, 2014, 1:07 a.m. UTC | #3
On Sun, 2014-08-10 at 13:14 -0700, Tom Herbert wrote:

> Sorry, I don't see how this is feasible and would fundamentally break
> the whole model that RPS/RFS is flow based, not protocol-flow based--

In a 5-tuple we _do_ have the protocol : TCP, UDP, ...

I do not know why you want to say it should be protocol independent.

This is only because you want to use a NIC provided hash, but this is
wrong in many cases (tunnels...), and we often fallback to flow
dissection anyway.

The thing is : 100% of TCP packets are flows steered.

And 99% of UDP packets are not, especially on servers where network
performance is an issue.

Current UDP stack do not allow to use millions of connected UDP flows on
a server. So a 'server' is _forced_ to use non connected UDP sockets.

RFS _assumed_ that all packets would participate in the dance, while its
obviously not true. When we have a mix of connected/unconnected packets,
then the RFS hit rate is very low.

Allowing TCP packets to use RFS, and only TCP packets, would immediately
solve the problem, and remove one cache miss per incoming UDP packet.


> Even in TCP, if the number of active connections far exceeds
> the flow table size P(x) could start to approach 1.

Experiments show that only a fraction of flows are really active at a
given point. For the others, we do not care which cpu handles the one
packet every xx seconds.

Experiments show that increasing flow hash table has very little impact,
but overall memory increase.

If we have a lot of TCP "active" flows, then RFS is not worth it.

Prefer a normal steering on a multiqueue NIC, because affine wakeups
will be far better. 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Aug. 11, 2014, 3:47 p.m. UTC | #4
On Sun, Aug 10, 2014 at 6:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2014-08-10 at 13:14 -0700, Tom Herbert wrote:
>
>> Sorry, I don't see how this is feasible and would fundamentally break
>> the whole model that RPS/RFS is flow based, not protocol-flow based--
>
> In a 5-tuple we _do_ have the protocol : TCP, UDP, ...
>
> I do not know why you want to say it should be protocol independent.
>
> This is only because you want to use a NIC provided hash, but this is
> wrong in many cases (tunnels...), and we often fallback to flow
> dissection anyway.
>
> The thing is : 100% of TCP packets are flows steered.
>
> And 99% of UDP packets are not, especially on servers where network
> performance is an issue.
>
> Current UDP stack do not allow to use millions of connected UDP flows on
> a server. So a 'server' is _forced_ to use non connected UDP sockets.
>
> RFS _assumed_ that all packets would participate in the dance, while its
> obviously not true. When we have a mix of connected/unconnected packets,
> then the RFS hit rate is very low.
>
> Allowing TCP packets to use RFS, and only TCP packets, would immediately
> solve the problem, and remove one cache miss per incoming UDP packet.
>
>
>> Even in TCP, if the number of active connections far exceeds
>> the flow table size P(x) could start to approach 1.
>
> Experiments show that only a fraction of flows are really active at a
> given point. For the others, we do not care which cpu handles the one
> packet every xx seconds.
>
> Experiments show that increasing flow hash table has very little impact,
> but overall memory increase.
>
> If we have a lot of TCP "active" flows, then RFS is not worth it.
>
> Prefer a normal steering on a multiqueue NIC, because affine wakeups
> will be far better.
>
Then why not do that if it solves your problem? RFS is optionally
configured and no one is under any obligation to use it. In fact, if
RFS is indeed completely useless, I'd rather see it removed from the
kernel than hacked up just to extend it's lifetime if it's now only
useful in a few specialized use cases. On the other hand, if you are
really interested in fixing it, please start to articulate and
quantify the problems you're seeing, provide the test cases and real
data that demonstrates the problem, and describe exactly what possible
solutions you've tried and precisely why they have or haven't worked.

Thanks,
Tom

>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f57c0e4..7ba764a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1444,6 +1444,8 @@  static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	if (inet_sk(sk)->inet_daddr) {
 		sock_rps_save_rxhash(sk, skb);
 		sk_mark_napi_id(sk, skb);
+	} else {
+		sock_rps_reset_flow_hash(skb->hash);
 	}
 
 	rc = sock_queue_rcv_skb(sk, skb);