Message ID | alpine.DEB.2.02.1408091013240.1087@tomh.mtv.corp.google.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 2014-08-09 at 10:15 -0700, Tom Herbert wrote: > This patch addresses a problem posed by Eric Dumazet in RPS/RFS > concerning the interaction between connected sockets and traffic on > unnconnected sockets. > > On a server which has both traffic using connected connected sockets > and traffic that is going through unconnected UDP sockets, it is > very possible that the connected sockets could heavily populate the > RPS flow tables. Packets received on unconnected sockets would then > be steered based on unrelated entries in the flow tables which leads > to suboptimal steering. This happens even if the connections that > populate the table are inactive with no traffic, as long as the > connected sockets simply remain open unrelated traffic can be steered > using that information. This problem would further be exacerbated > if the packets on the unconnected UDP sockets are actually part of > long lived flows (which apparently would happen with QUIC in their > current implementation). > > This patch clears the RPS flow hash table for packet recieved on > unnconnected UDP sockets. The effect is that the "flows" on unconnected > socekts will be steered using RPS. We don't do this for unconnected UDP > tunnels (encap_rcv) under the assumption that the flow table entries > will be adjusted during processing of inner packets. > > Tested using super_netperf UDP_RR with 200 flows. Did not see any > noticeable regression in writing to flow table for every packet. > > Before fix: > 76.99% CPU utilization > 112/158/230 90/95/99% latencies > 1.62776e+06 tps > > After fix: > 76.03% CPU utilization > 115/162/239 90/95/99% latencies > 1.62257e+06 tps > > Signed-off-by: Tom Herbert <therbert@google.com> > --- > net/ipv4/udp.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index f57c0e4..7ba764a 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1444,6 +1444,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) > if (inet_sk(sk)->inet_daddr) { > sock_rps_save_rxhash(sk, skb); > sk_mark_napi_id(sk, skb); > + } else { > + sock_rps_reset_flow_hash(skb->hash); > } > > rc = sock_queue_rcv_skb(sk, skb); I believe this patch is absolutely wrong. 1) Your benchmarks are either all TCP, either all UDP, so you cant see any problem with these kind of heuristics. Now, a server might have a mixed workload, with some TCP flows, and some thousands of 'unconnected' UDP messages per second. This patch would completely destroy TCP performance, by lowering RFS hit rate (table would be almost empty) If this was valid, admin would not setup RFS in the first place. 2) there is no guarantee skb->hash at this time is spread, so you might add a false sharing in some cases (GRE tunnels for instance), as rps_reset_sock_flow() would constantly set the same entry to RPS_NO_CPU from multiple cpus. Lets face it, RFS only works if workload is specialized, not generic. If you want RFS being more usable, you need to find a way so that UDP messages do not share a hash table used by TCP (which by definition only deals with connected flows) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 10, 2014 at 9:59 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Sat, 2014-08-09 at 10:15 -0700, Tom Herbert wrote: >> This patch addresses a problem posed by Eric Dumazet in RPS/RFS >> concerning the interaction between connected sockets and traffic on >> unnconnected sockets. >> >> On a server which has both traffic using connected connected sockets >> and traffic that is going through unconnected UDP sockets, it is >> very possible that the connected sockets could heavily populate the >> RPS flow tables. Packets received on unconnected sockets would then >> be steered based on unrelated entries in the flow tables which leads >> to suboptimal steering. This happens even if the connections that >> populate the table are inactive with no traffic, as long as the >> connected sockets simply remain open unrelated traffic can be steered >> using that information. This problem would further be exacerbated >> if the packets on the unconnected UDP sockets are actually part of >> long lived flows (which apparently would happen with QUIC in their >> current implementation). >> >> This patch clears the RPS flow hash table for packet recieved on >> unnconnected UDP sockets. The effect is that the "flows" on unconnected >> socekts will be steered using RPS. We don't do this for unconnected UDP >> tunnels (encap_rcv) under the assumption that the flow table entries >> will be adjusted during processing of inner packets. >> >> Tested using super_netperf UDP_RR with 200 flows. Did not see any >> noticeable regression in writing to flow table for every packet. >> >> Before fix: >> 76.99% CPU utilization >> 112/158/230 90/95/99% latencies >> 1.62776e+06 tps >> >> After fix: >> 76.03% CPU utilization >> 115/162/239 90/95/99% latencies >> 1.62257e+06 tps >> >> Signed-off-by: Tom Herbert <therbert@google.com> >> --- >> net/ipv4/udp.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c >> index f57c0e4..7ba764a 100644 >> --- a/net/ipv4/udp.c >> +++ b/net/ipv4/udp.c >> @@ -1444,6 +1444,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) >> if (inet_sk(sk)->inet_daddr) { >> sock_rps_save_rxhash(sk, skb); >> sk_mark_napi_id(sk, skb); >> + } else { >> + sock_rps_reset_flow_hash(skb->hash); >> } >> >> rc = sock_queue_rcv_skb(sk, skb); > > > I believe this patch is absolutely wrong. > > 1) Your benchmarks are either all TCP, either all UDP, so you cant see > any problem with these kind of heuristics. > > Now, a server might have a mixed workload, with some TCP flows, and some > thousands of 'unconnected' UDP messages per second. > > This patch would completely destroy TCP performance, by lowering RFS hit > rate (table would be almost empty) > > If this was valid, admin would not setup RFS in the first place. > You also need to size the tables for your workload and desired hit rate, this patch does not address that it is part of configuration the admin needs to do. If the admin is unwilling to do this, then, yes, they shouldn't bother setting up RFS! > 2) there is no guarantee skb->hash at this time is spread, so you might > add a false sharing in some cases (GRE tunnels for instance), as > rps_reset_sock_flow() would constantly set the same entry to RPS_NO_CPU > from multiple cpus. > > Lets face it, RFS only works if workload is specialized, not generic. > RFS is a probabilistic algorithm, it is not designed to be correct 100% of the time, but it is adaptive. P(x) = probability of packet being steering to "incorrect" CPU. So RFS is useful on server if P(x) < some delta, say 0.01. P(x) = F(D, S): Probability of incorrect steering is a function of D, flow distribution in some time quantum, and S size of the flow tables. So for an expected flow distribution, table should be sized accordingly to make P(X) < 0.01. Assuming that the correct CPU choice for unconnected sockets is to use RPS, then this patch fixes the problem of choosing an incorrect CPU because of interference with *idle* TCP connections. Hence, reduces P(x) for the work load. > If you want RFS being more usable, you need to find a way so that UDP > messages do not share a hash table used by TCP (which by definition only > deals with connected flows) > Sorry, I don't see how this is feasible and would fundamentally break the whole model that RPS/RFS is flow based, not protocol-flow based-- we don't need to know the protocol to steer packets. There is simply nothing special about UDP, as far as RFS/RPS is concerned these are flows. Note that connected UDP sockets work perfectly well and UDP tunnels packet often carry TCP packets anyway which also work as expected. Even in TCP, if the number of active connections far exceeds the flow table size P(x) could start to approach 1. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2014-08-10 at 13:14 -0700, Tom Herbert wrote: > Sorry, I don't see how this is feasible and would fundamentally break > the whole model that RPS/RFS is flow based, not protocol-flow based-- In a 5-tuple we _do_ have the protocol : TCP, UDP, ... I do not know why you want to say it should be protocol independent. This is only because you want to use a NIC provided hash, but this is wrong in many cases (tunnels...), and we often fallback to flow dissection anyway. The thing is : 100% of TCP packets are flows steered. And 99% of UDP packets are not, especially on servers where network performance is an issue. Current UDP stack do not allow to use millions of connected UDP flows on a server. So a 'server' is _forced_ to use non connected UDP sockets. RFS _assumed_ that all packets would participate in the dance, while its obviously not true. When we have a mix of connected/unconnected packets, then the RFS hit rate is very low. Allowing TCP packets to use RFS, and only TCP packets, would immediately solve the problem, and remove one cache miss per incoming UDP packet. > Even in TCP, if the number of active connections far exceeds > the flow table size P(x) could start to approach 1. Experiments show that only a fraction of flows are really active at a given point. For the others, we do not care which cpu handles the one packet every xx seconds. Experiments show that increasing flow hash table has very little impact, but overall memory increase. If we have a lot of TCP "active" flows, then RFS is not worth it. Prefer a normal steering on a multiqueue NIC, because affine wakeups will be far better. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 10, 2014 at 6:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Sun, 2014-08-10 at 13:14 -0700, Tom Herbert wrote: > >> Sorry, I don't see how this is feasible and would fundamentally break >> the whole model that RPS/RFS is flow based, not protocol-flow based-- > > In a 5-tuple we _do_ have the protocol : TCP, UDP, ... > > I do not know why you want to say it should be protocol independent. > > This is only because you want to use a NIC provided hash, but this is > wrong in many cases (tunnels...), and we often fallback to flow > dissection anyway. > > The thing is : 100% of TCP packets are flows steered. > > And 99% of UDP packets are not, especially on servers where network > performance is an issue. > > Current UDP stack do not allow to use millions of connected UDP flows on > a server. So a 'server' is _forced_ to use non connected UDP sockets. > > RFS _assumed_ that all packets would participate in the dance, while its > obviously not true. When we have a mix of connected/unconnected packets, > then the RFS hit rate is very low. > > Allowing TCP packets to use RFS, and only TCP packets, would immediately > solve the problem, and remove one cache miss per incoming UDP packet. > > >> Even in TCP, if the number of active connections far exceeds >> the flow table size P(x) could start to approach 1. > > Experiments show that only a fraction of flows are really active at a > given point. For the others, we do not care which cpu handles the one > packet every xx seconds. > > Experiments show that increasing flow hash table has very little impact, > but overall memory increase. > > If we have a lot of TCP "active" flows, then RFS is not worth it. > > Prefer a normal steering on a multiqueue NIC, because affine wakeups > will be far better. > Then why not do that if it solves your problem? RFS is optionally configured and no one is under any obligation to use it. In fact, if RFS is indeed completely useless, I'd rather see it removed from the kernel than hacked up just to extend it's lifetime if it's now only useful in a few specialized use cases. On the other hand, if you are really interested in fixing it, please start to articulate and quantify the problems you're seeing, provide the test cases and real data that demonstrates the problem, and describe exactly what possible solutions you've tried and precisely why they have or haven't worked. Thanks, Tom > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index f57c0e4..7ba764a 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1444,6 +1444,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) if (inet_sk(sk)->inet_daddr) { sock_rps_save_rxhash(sk, skb); sk_mark_napi_id(sk, skb); + } else { + sock_rps_reset_flow_hash(skb->hash); } rc = sock_queue_rcv_skb(sk, skb);
This patch addresses a problem posed by Eric Dumazet in RPS/RFS concerning the interaction between connected sockets and traffic on unnconnected sockets. On a server which has both traffic using connected connected sockets and traffic that is going through unconnected UDP sockets, it is very possible that the connected sockets could heavily populate the RPS flow tables. Packets received on unconnected sockets would then be steered based on unrelated entries in the flow tables which leads to suboptimal steering. This happens even if the connections that populate the table are inactive with no traffic, as long as the connected sockets simply remain open unrelated traffic can be steered using that information. This problem would further be exacerbated if the packets on the unconnected UDP sockets are actually part of long lived flows (which apparently would happen with QUIC in their current implementation). This patch clears the RPS flow hash table for packet recieved on unnconnected UDP sockets. The effect is that the "flows" on unconnected socekts will be steered using RPS. We don't do this for unconnected UDP tunnels (encap_rcv) under the assumption that the flow table entries will be adjusted during processing of inner packets. Tested using super_netperf UDP_RR with 200 flows. Did not see any noticeable regression in writing to flow table for every packet. Before fix: 76.99% CPU utilization 112/158/230 90/95/99% latencies 1.62776e+06 tps After fix: 76.03% CPU utilization 115/162/239 90/95/99% latencies 1.62257e+06 tps Signed-off-by: Tom Herbert <therbert@google.com> --- net/ipv4/udp.c | 2 ++ 1 file changed, 2 insertions(+)