Message ID | 1271750198.3845.216.camel@edumazet-laptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 20 Apr 2010 09:56:38 +0200 > In case we compute a software skb->rxhash, we can generate a consistent > hash : Its value will be the same in both flow directions. > > This helps some workloads, like conntracking, since the same state needs > to be accessed in both directions. > > tbench + RFS + this patch gives better results than tbench with default > kernel configuration (no RPS, no RFS) > > Also fixed some sparse warnings. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote: > In case we compute a software skb->rxhash, we can generate a consistent > hash : Its value will be the same in both flow directions. > > This helps some workloads, like conntracking, since the same state needs > to be accessed in both directions. > > tbench + RFS + this patch gives better results than tbench with default > kernel configuration (no RPS, no RFS) > > Also fixed some sparse warnings. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > --- I thought about this for some time... Do we really need the port numbers here at all? A simple addr1^addr2 can provide a good enough pointer for distribution amongst CPUs. The real connection tracking is better done locally at the corresponding CPU. That way a potential cache miss can be avoided and the still needed hash calculation for connection tracking will be offloaded. Franco -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 20 avril 2010 à 14:48 +0200, Franco Fichtner a écrit : > > I thought about this for some time... > > Do we really need the port numbers here at all? A simple > addr1^addr2 can provide a good enough pointer for > distribution amongst CPUs. > > The real connection tracking is better done locally at the > corresponding CPU. That way a potential cache miss can be > avoided and the still needed hash calculation for > connection tracking will be offloaded. > Yes, doing the port test/swap is useful in the loopback case (addr1 == addr2). This is probably a bit convoluted, but David (and me) found this funny ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote: > Le mardi 20 avril 2010 à 14:48 +0200, Franco Fichtner a écrit : > >> I thought about this for some time... >> >> Do we really need the port numbers here at all? A simple >> addr1^addr2 can provide a good enough pointer for >> distribution amongst CPUs. >> >> The real connection tracking is better done locally at the >> corresponding CPU. That way a potential cache miss can be >> avoided and the still needed hash calculation for >> connection tracking will be offloaded. >> > > Yes, doing the port test/swap is useful in the loopback case > (addr1 == addr2). > > This is probably a bit convoluted, but David (and me) found this > funny ;) > > It is funny, but I fail to see the big picture of the firewall / conntrack application here. It looks like this is needed for local netperf tests to impress, but it's a quite special use case, isn't it? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 20 avril 2010 à 16:03 +0200, Franco Fichtner a écrit : > > It is funny, but I fail to see the big picture of the > firewall / conntrack application here. It looks like > this is needed for local netperf tests to impress, but > it's a quite special use case, isn't it? I know many applications using TCP on loopback, they are real :) What I find 'funny' are not the tbench results, but the fact that RFS can give pretty good hints to process scheduler, something that might be good to investigate by scheduler specialists. In the meantime, if some admin finds that setting RFS on loopback can boost by 10% its application, why not ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I thought about this for some time... > > Do we really need the port numbers here at all? A simple > addr1^addr2 can provide a good enough pointer for > distribution amongst CPUs. > What about a server behind a TCP proxy? Also, need to minimize collisions for RPS to be effective. Tom > The real connection tracking is better done locally at the > corresponding CPU. That way a potential cache miss can be > avoided and the still needed hash calculation for > connection tracking will be offloaded. > > > Franco > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 20 Apr 2010 16:57:01 +0200 > I know many applications using TCP on loopback, they are real :) This is all true and I support your hashing patch and all of that. But if we really want TCP over loopback to go fast, there are much better ways to do this. Eric, do you remember that "TCP friends" rough patch I sent you last year that essentailly made TCP sockets over loopback behave like AF_UNIX ones and just queue the SKBs directly to the destination socket without doing any protocol work? If we ever got that working, tbench performance would become impressive :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote: > Eric, do you remember that "TCP friends" rough patch I sent you last > year that essentailly made TCP sockets over loopback behave like > AF_UNIX ones and just queue the SKBs directly to the destination > socket without doing any protocol work? I think it will break some benchmark tools. The loopback device is for testing networking protocol stacks, so we shouldn't bypass the protocol processing. And anyone who has a performance problem of loopback device should turn to UNIX domain socket. For routers, how about letting users choose whether RPS mixes layer 4 info in?
From: Changli Gao <xiaosuo@gmail.com> Date: Wed, 21 Apr 2010 07:35:48 +0800 > On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote: >> Eric, do you remember that "TCP friends" rough patch I sent you last >> year that essentailly made TCP sockets over loopback behave like >> AF_UNIX ones and just queue the SKBs directly to the destination >> socket without doing any protocol work? > > I think it will break some benchmark tools. Other systems already do this optimization, so if things break, this breakage is already pervasive. We should be able to tell people that they can use TCP solely in their applications and it will perform optimally regardless of transport. People already code their applications this way, and ignoring this issue would just makes us stupid. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tom Herbert wrote: >> I thought about this for some time... >> >> Do we really need the port numbers here at all? A simple >> addr1^addr2 can provide a good enough pointer for >> distribution amongst CPUs. >> > > What about a server behind a TCP proxy? Also, need to minimize > collisions for RPS to be effective What about routers? What about loopback? This all boils down to the same issue of obscuring IP data by "magical" means and then reattaching functionality by reaching for upper layer information. It is necessary in some cases, but it can cripple performance for other cases. The interesting thing is you don't need to deal with collisions while distributing amonst cpus at all. You just need to make sure the distribution algorithm keeps every single flow attached to the correct cpu. All of the actual flow hashing, tracking and whatever else the traffic needs to go through can be done locally by cpu x which helps a lot with load distribution and cache issues in mind. It also helps locking because there is no global flow lookup table. Oh, and it also reduces collisions with every cpu you add for receiving. I work with a lot of plain office and ISP traffic in mind daily, so please don't misunderstand my motivation here. I'd hate to see poor performance in scenarios in which there is a lot of potential improvement. Franco -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit : > Tom Herbert wrote: > >> I thought about this for some time... > >> > >> Do we really need the port numbers here at all? A simple > >> addr1^addr2 can provide a good enough pointer for > >> distribution amongst CPUs. > >> > > > > What about a server behind a TCP proxy? Also, need to minimize > > collisions for RPS to be effective > > What about routers? What about loopback? This all boils down to > the same issue of obscuring IP data by "magical" means and then > reattaching functionality by reaching for upper layer information. > It is necessary in some cases, but it can cripple performance > for other cases. > > The interesting thing is you don't need to deal with collisions > while distributing amonst cpus at all. You just need to make sure > the distribution algorithm keeps every single flow attached to > the correct cpu. > > All of the actual flow hashing, tracking and whatever else the > traffic needs to go through can be done locally by cpu x which > helps a lot with load distribution and cache issues in mind. It > also helps locking because there is no global flow lookup table. > Oh, and it also reduces collisions with every cpu you add for > receiving. > > I work with a lot of plain office and ISP traffic in mind daily, > so please don't misunderstand my motivation here. I'd hate to > see poor performance in scenarios in which there is a lot of > potential improvement. > I am a bit lost by this conversation. Are you saying something is wrong with current schem ? What are exactly your suggestions ? Tom replied to you that a hash derived from (addr1 ^ addr2) would not work in situations where all flows goes from machine A to machine B (all hashes would be the same) Current hash is probably more than enough to cover all situations. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote: > Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit : >> Tom Herbert wrote: >>>> I thought about this for some time... >>>> >>>> Do we really need the port numbers here at all? A simple >>>> addr1^addr2 can provide a good enough pointer for >>>> distribution amongst CPUs. >>>> >>> What about a server behind a TCP proxy? Also, need to minimize >>> collisions for RPS to be effective >> What about routers? What about loopback? This all boils down to >> the same issue of obscuring IP data by "magical" means and then >> reattaching functionality by reaching for upper layer information. >> It is necessary in some cases, but it can cripple performance >> for other cases. >> >> The interesting thing is you don't need to deal with collisions >> while distributing amonst cpus at all. You just need to make sure >> the distribution algorithm keeps every single flow attached to >> the correct cpu. >> >> All of the actual flow hashing, tracking and whatever else the >> traffic needs to go through can be done locally by cpu x which >> helps a lot with load distribution and cache issues in mind. It >> also helps locking because there is no global flow lookup table. >> Oh, and it also reduces collisions with every cpu you add for >> receiving. >> >> I work with a lot of plain office and ISP traffic in mind daily, >> so please don't misunderstand my motivation here. I'd hate to >> see poor performance in scenarios in which there is a lot of >> potential improvement. >> > > I am a bit lost by this conversation. > > Are you saying something is wrong with current schem ? > > What are exactly your suggestions ? > Hashing for cpu distribution should be as minimal as it could possibly be with the least number operations needed to compute a hash, which normally involves touching one cold cache line (ip header). If you add the ports to your mix you have the luxury of solving static ip mappings, but only for protocols that support it. Usage of the destination port may also prove to be more or less pointless with a lot of http traffic, because it's most likely static. And you add another potential cold cache line access. For a lot of traffic scenarios, we'll have a bunch of internal ips and the internet on the other side, so having a simple hash based on a flavor if internal/external ip is more than enough to work with for distribution. If the network card can provide a complete hash all the better. Then this part of my point is void. But then, hashing for cpu distribution should have nothing todo with real flow tracking with lookup tables for let's say a firewall or dpi application, because that data is only needed by local cpu and can be gathered after distribution. Simply put, the lookup for the flow, if it is needed, does not belong to distribution. It can be outsourced to the destination cpu or just simply be ignored, if the application doesn't care. > Tom replied to you that a hash derived from (addr1 ^ addr2) would not > work in situations where all flows goes from machine A to machine B > (all hashes would be the same) > Yes, I see the point. And all I'm just asking if it's wise to optimize for this particular scenario. If you spin this idea further beyond flow tracking, maybe an application also needs to do some kind of user tracking by ip. Wouldn't it make sense to have user based flows on a more local basis, not a global one because ports will get in the way? > Current hash is probably more than enough to cover all situations. > I agree with this, but would like to point out the phrasing "probably more than enough". :) Franco -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 21 avril 2010 à 13:06 +0200, Franco Fichtner a écrit : > > > > Hashing for cpu distribution should be as minimal as it could possibly > be with the least number operations needed to compute a hash, which > normally involves touching one cold cache line (ip header). If you add > the ports to your mix you have the luxury of solving static ip mappings, > but only for protocols that support it. Usage of the destination port > may also prove to be more or less pointless with a lot of http traffic, > because it's most likely static. And you add another potential cold > cache line access. For a lot of traffic scenarios, we'll have a bunch of > internal ips and the internet on the other side, so having a simple hash > based on a flavor if internal/external ip is more than enough to work > with for distribution. If the network card can provide a complete hash > all the better. Then this part of my point is void. > But we already have to bring into our cpu cache one cache line, needed in eth_type_trans() : (12+2 bytes of ethernet header) TCP/UDP tuples are included into this cache line (64 bytes on current popular arches) Cost of rxhash is absolute noise into the picture. A device provided hash, to be effective, would also make eth_type_trans() call not done. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 20, 2010 at 2:41 PM, David Miller <davem@davemloft.net> wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 20 Apr 2010 16:57:01 +0200 > >> I know many applications using TCP on loopback, they are real :) > > This is all true and I support your hashing patch and all of that. > > But if we really want TCP over loopback to go fast, there are much > better ways to do this. > > Eric, do you remember that "TCP friends" rough patch I sent you last > year that essentailly made TCP sockets over loopback behave like > AF_UNIX ones and just queue the SKBs directly to the destination > socket without doing any protocol work? > This is sounds very interesting! Could you post a patch? :-) > If we ever got that working, tbench performance would become > impressive :) > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Wed, 21 Apr 2010 12:12:41 -0700 > On Tue, Apr 20, 2010 at 2:41 PM, David Miller <davem@davemloft.net> wrote: >> Eric, do you remember that "TCP friends" rough patch I sent you last >> year that essentailly made TCP sockets over loopback behave like >> AF_UNIX ones and just queue the SKBs directly to the destination >> socket without doing any protocol work? >> > > This is sounds very interesting! Could you post a patch? :-) I'll see if I can find it, I sent it to Eric more than a year ago... The basic scheme was pretty simple: 1) Add "struct sock *friend" to struct sk_buff 2) TCP initial handshake SYN and SYN+ACK transmits set "skb->friend = sk" and TCP receive path notices this and stores this 'friend' socket pointer locally in the newly created connection socket. The purpose of skb->friend is to let the receiving socket on loopback see that the other end is on the local system and can be directly communicated to. 3) TCP sendmsg queues data directly to sk->friend's receive queue instead sending TCP protocol packets. The only complications come from making sendmsg and recvmsg not try to do all of the sequence handling and checking, stuff like that. Also, URG would need to be dealt with somehow too. I'm sure someone suitably motivated could get a working patch going in no time :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/dev.c b/net/core/dev.c index 05a2b29..cb150ec 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1974,7 +1974,7 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) if (skb->sk && skb->sk->sk_hash) hash = skb->sk->sk_hash; else - hash = skb->protocol; + hash = (__force u16) skb->protocol; hash = jhash_1word(hash, hashrnd); @@ -2253,8 +2253,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, ip = (struct iphdr *) skb->data; ip_proto = ip->protocol; - addr1 = ip->saddr; - addr2 = ip->daddr; + addr1 = (__force u32) ip->saddr; + addr2 = (__force u32) ip->daddr; ihl = ip->ihl; break; case __constant_htons(ETH_P_IPV6): @@ -2263,8 +2263,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, ip6 = (struct ipv6hdr *) skb->data; ip_proto = ip6->nexthdr; - addr1 = ip6->saddr.s6_addr32[3]; - addr2 = ip6->daddr.s6_addr32[3]; + addr1 = (__force u32) ip6->saddr.s6_addr32[3]; + addr2 = (__force u32) ip6->daddr.s6_addr32[3]; ihl = (40 >> 2); break; default: @@ -2279,14 +2279,25 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, case IPPROTO_AH: case IPPROTO_SCTP: case IPPROTO_UDPLITE: - if (pskb_may_pull(skb, (ihl * 4) + 4)) - ports = *((u32 *) (skb->data + (ihl * 4))); + if (pskb_may_pull(skb, (ihl * 4) + 4)) { + __be16 *hports = (__be16 *) (skb->data + (ihl * 4)); + u32 sport, dport; + + sport = (__force u16) hports[0]; + dport = (__force u16) hports[1]; + if (dport < sport) + swap(sport, dport); + ports = (sport << 16) + dport; + } break; default: break; } + /* get a consistent hash (same value on both flow directions) */ + if (addr2 < addr1) + swap(addr1, addr2); skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd); if (!skb->rxhash) skb->rxhash = 1;
In case we compute a software skb->rxhash, we can generate a consistent hash : Its value will be the same in both flow directions. This helps some workloads, like conntracking, since the same state needs to be accessed in both directions. tbench + RFS + this patch gives better results than tbench with default kernel configuration (no RPS, no RFS) Also fixed some sparse warnings. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/core/dev.c | 25 ++++++++++++++++++------- 1 files changed, 18 insertions(+), 7 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html