diff mbox

[net-next-2.6] rps: consistent rxhash

Message ID 1271750198.3845.216.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 20, 2010, 7:56 a.m. UTC
In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.

This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.

tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)

Also fixed some sparse warnings.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |   25 ++++++++++++++++++-------
 1 files changed, 18 insertions(+), 7 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller April 20, 2010, 8:18 a.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 09:56:38 +0200

> In case we compute a software skb->rxhash, we can generate a consistent
> hash : Its value will be the same in both flow directions.
> 
> This helps some workloads, like conntracking, since the same state needs
> to be accessed in both directions.
> 
> tbench + RFS + this patch gives better results than tbench with default
> kernel configuration (no RPS, no RFS)
> 
> Also fixed some sparse warnings.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Franco Fichtner April 20, 2010, 12:48 p.m. UTC | #2
Eric Dumazet wrote:
> In case we compute a software skb->rxhash, we can generate a consistent
> hash : Its value will be the same in both flow directions.
>
> This helps some workloads, like conntracking, since the same state needs
> to be accessed in both directions.
>
> tbench + RFS + this patch gives better results than tbench with default
> kernel configuration (no RPS, no RFS)
>
> Also fixed some sparse warnings.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---

I thought about this for some time...

Do we really need the port numbers here at all? A simple
addr1^addr2 can provide a good enough pointer for
distribution amongst CPUs.

The real connection tracking is better done locally at the
corresponding CPU. That way a potential cache miss can be
avoided and the still needed hash calculation for
connection tracking will be offloaded.


Franco

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 20, 2010, 1:16 p.m. UTC | #3
Le mardi 20 avril 2010 à 14:48 +0200, Franco Fichtner a écrit :

> 
> I thought about this for some time...
> 
> Do we really need the port numbers here at all? A simple
> addr1^addr2 can provide a good enough pointer for
> distribution amongst CPUs.
> 
> The real connection tracking is better done locally at the
> corresponding CPU. That way a potential cache miss can be
> avoided and the still needed hash calculation for
> connection tracking will be offloaded.
> 

Yes, doing the port test/swap is useful in the loopback case 
(addr1 == addr2).

This is probably a bit convoluted, but David (and me) found this
funny ;)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Franco Fichtner April 20, 2010, 2:03 p.m. UTC | #4
Eric Dumazet wrote:
> Le mardi 20 avril 2010 à 14:48 +0200, Franco Fichtner a écrit :
> 
>> I thought about this for some time...
>>
>> Do we really need the port numbers here at all? A simple
>> addr1^addr2 can provide a good enough pointer for
>> distribution amongst CPUs.
>>
>> The real connection tracking is better done locally at the
>> corresponding CPU. That way a potential cache miss can be
>> avoided and the still needed hash calculation for
>> connection tracking will be offloaded.
>>
> 
> Yes, doing the port test/swap is useful in the loopback case 
> (addr1 == addr2).
> 
> This is probably a bit convoluted, but David (and me) found this
> funny ;)
> 
> 

It is funny, but I fail to see the big picture of the
firewall / conntrack application here. It looks like
this is needed for local netperf tests to impress, but
it's a quite special use case, isn't it?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 20, 2010, 2:57 p.m. UTC | #5
Le mardi 20 avril 2010 à 16:03 +0200, Franco Fichtner a écrit :

> 
> It is funny, but I fail to see the big picture of the
> firewall / conntrack application here. It looks like
> this is needed for local netperf tests to impress, but
> it's a quite special use case, isn't it?

I know many applications using TCP on loopback, they are real :)

What I find 'funny' are not the tbench results, but the fact that RFS
can give pretty good hints to process scheduler, something that might be
good to investigate by scheduler specialists.

In the meantime, if some admin finds that setting RFS on loopback can
boost by 10% its application, why not ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert April 20, 2010, 3:09 p.m. UTC | #6
> I thought about this for some time...
>
> Do we really need the port numbers here at all? A simple
> addr1^addr2 can provide a good enough pointer for
> distribution amongst CPUs.
>

What about a server behind a TCP proxy?  Also, need to minimize
collisions for RPS to be effective.

Tom

> The real connection tracking is better done locally at the
> corresponding CPU. That way a potential cache miss can be
> avoided and the still needed hash calculation for
> connection tracking will be offloaded.
>
>
> Franco
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 20, 2010, 9:41 p.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 16:57:01 +0200

> I know many applications using TCP on loopback, they are real :)

This is all true and I support your hashing patch and all of that.

But if we really want TCP over loopback to go fast, there are much
better ways to do this.

Eric, do you remember that "TCP friends" rough patch I sent you last
year that essentailly made TCP sockets over loopback behave like
AF_UNIX ones and just queue the SKBs directly to the destination
socket without doing any protocol work?

If we ever got that working, tbench performance would become
impressive :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 20, 2010, 11:35 p.m. UTC | #8
On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote:
> Eric, do you remember that "TCP friends" rough patch I sent you last
> year that essentailly made TCP sockets over loopback behave like
> AF_UNIX ones and just queue the SKBs directly to the destination
> socket without doing any protocol work?

I think it will break some benchmark tools. The loopback device is for
testing networking protocol stacks, so we shouldn't bypass the
protocol processing. And anyone who has a performance problem of
loopback device should turn to UNIX domain socket.

For routers, how about letting users choose whether RPS mixes layer 4 info in?
David Miller April 20, 2010, 11:38 p.m. UTC | #9
From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 21 Apr 2010 07:35:48 +0800

> On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote:
>> Eric, do you remember that "TCP friends" rough patch I sent you last
>> year that essentailly made TCP sockets over loopback behave like
>> AF_UNIX ones and just queue the SKBs directly to the destination
>> socket without doing any protocol work?
> 
> I think it will break some benchmark tools.

Other systems already do this optimization, so if things break, this
breakage is already pervasive.

We should be able to tell people that they can use TCP solely in their
applications and it will perform optimally regardless of transport.

People already code their applications this way, and ignoring
this issue would just makes us stupid.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Franco Fichtner April 21, 2010, 9:29 a.m. UTC | #10
Tom Herbert wrote:
>> I thought about this for some time...
>>
>> Do we really need the port numbers here at all? A simple
>> addr1^addr2 can provide a good enough pointer for
>> distribution amongst CPUs.
>>
> 
> What about a server behind a TCP proxy?  Also, need to minimize
> collisions for RPS to be effective

What about routers? What about loopback? This all boils down to
the same issue of obscuring IP data by "magical" means and then
reattaching functionality by reaching for upper layer information.
It is necessary in some cases, but it can cripple performance
for other cases.

The interesting thing is you don't need to deal with collisions
while distributing amonst cpus at all. You just need to make sure
the distribution algorithm keeps every single flow attached to
the correct cpu.

All of the actual flow hashing, tracking and whatever else the
traffic needs to go through can be done locally by cpu x which
helps a lot with load distribution and cache issues in mind. It
also helps locking because there is no global flow lookup table.
Oh, and it also reduces collisions with every cpu you add for
receiving.

I work with a lot of plain office and ISP traffic in mind daily,
so please don't misunderstand my motivation here. I'd hate to
see poor performance in scenarios in which there is a lot of
potential improvement.


Franco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 21, 2010, 9:39 a.m. UTC | #11
Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
> Tom Herbert wrote:
> >> I thought about this for some time...
> >>
> >> Do we really need the port numbers here at all? A simple
> >> addr1^addr2 can provide a good enough pointer for
> >> distribution amongst CPUs.
> >>
> > 
> > What about a server behind a TCP proxy?  Also, need to minimize
> > collisions for RPS to be effective
> 
> What about routers? What about loopback? This all boils down to
> the same issue of obscuring IP data by "magical" means and then
> reattaching functionality by reaching for upper layer information.
> It is necessary in some cases, but it can cripple performance
> for other cases.
> 
> The interesting thing is you don't need to deal with collisions
> while distributing amonst cpus at all. You just need to make sure
> the distribution algorithm keeps every single flow attached to
> the correct cpu.
> 
> All of the actual flow hashing, tracking and whatever else the
> traffic needs to go through can be done locally by cpu x which
> helps a lot with load distribution and cache issues in mind. It
> also helps locking because there is no global flow lookup table.
> Oh, and it also reduces collisions with every cpu you add for
> receiving.
> 
> I work with a lot of plain office and ISP traffic in mind daily,
> so please don't misunderstand my motivation here. I'd hate to
> see poor performance in scenarios in which there is a lot of
> potential improvement.
> 

I am a bit lost by this conversation.

Are you saying something is wrong with current schem ?

What are exactly your suggestions ?

Tom replied to you that a hash derived from (addr1 ^ addr2) would not
work in situations where all flows goes from machine A to machine B
(all hashes would be the same)

Current hash is probably more than enough to cover all situations.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Franco Fichtner April 21, 2010, 11:06 a.m. UTC | #12
Eric Dumazet wrote:
> Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
>> Tom Herbert wrote:
>>>> I thought about this for some time...
>>>>
>>>> Do we really need the port numbers here at all? A simple
>>>> addr1^addr2 can provide a good enough pointer for
>>>> distribution amongst CPUs.
>>>>
>>> What about a server behind a TCP proxy?  Also, need to minimize
>>> collisions for RPS to be effective
>> What about routers? What about loopback? This all boils down to
>> the same issue of obscuring IP data by "magical" means and then
>> reattaching functionality by reaching for upper layer information.
>> It is necessary in some cases, but it can cripple performance
>> for other cases.
>>
>> The interesting thing is you don't need to deal with collisions
>> while distributing amonst cpus at all. You just need to make sure
>> the distribution algorithm keeps every single flow attached to
>> the correct cpu.
>>
>> All of the actual flow hashing, tracking and whatever else the
>> traffic needs to go through can be done locally by cpu x which
>> helps a lot with load distribution and cache issues in mind. It
>> also helps locking because there is no global flow lookup table.
>> Oh, and it also reduces collisions with every cpu you add for
>> receiving.
>>
>> I work with a lot of plain office and ISP traffic in mind daily,
>> so please don't misunderstand my motivation here. I'd hate to
>> see poor performance in scenarios in which there is a lot of
>> potential improvement.
>>
> 
> I am a bit lost by this conversation.
> 
> Are you saying something is wrong with current schem ?
> 
> What are exactly your suggestions ?
> 

Hashing for cpu distribution should be as minimal as it could possibly
be with the least number operations needed to compute a hash, which 
normally involves touching one cold cache line (ip header). If you add 
the ports to your mix you have the luxury of solving static ip mappings,
but only for protocols that support it. Usage of the destination port
may also prove to be more or less pointless with a lot of http traffic,
because it's most likely static. And you add another potential cold
cache line access. For a lot of traffic scenarios, we'll have a bunch of
internal ips and the internet on the other side, so having a simple hash
based on a flavor if internal/external ip is more than enough to work
with for distribution. If the network card can provide a complete hash
all the better. Then this part of my point is void.

But then, hashing for cpu distribution should have nothing todo with
real flow tracking with lookup tables for let's say a firewall or dpi
application, because that data is only needed by local cpu and can
be gathered after distribution. Simply put, the lookup for the flow, if
it is needed, does not belong to distribution. It can be outsourced to
the destination cpu or just simply be ignored, if the application
doesn't care.

> Tom replied to you that a hash derived from (addr1 ^ addr2) would not
> work in situations where all flows goes from machine A to machine B
> (all hashes would be the same)
> 

Yes, I see the point. And all I'm just asking if it's wise to optimize
for this particular scenario.

If you spin this idea further beyond flow tracking, maybe an application
also needs to do some kind of user tracking by ip. Wouldn't it make
sense to have user based flows on a more local basis, not a global one
because ports will get in the way?

> Current hash is probably more than enough to cover all situations.
> 

I agree with this, but would like to point out the phrasing "probably
more than enough". :)


Franco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 21, 2010, 11:16 a.m. UTC | #13
Le mercredi 21 avril 2010 à 13:06 +0200, Franco Fichtner a écrit :
> > 
> 
> Hashing for cpu distribution should be as minimal as it could possibly
> be with the least number operations needed to compute a hash, which 
> normally involves touching one cold cache line (ip header). If you add 
> the ports to your mix you have the luxury of solving static ip mappings,
> but only for protocols that support it. Usage of the destination port
> may also prove to be more or less pointless with a lot of http traffic,
> because it's most likely static. And you add another potential cold
> cache line access. For a lot of traffic scenarios, we'll have a bunch of
> internal ips and the internet on the other side, so having a simple hash
> based on a flavor if internal/external ip is more than enough to work
> with for distribution. If the network card can provide a complete hash
> all the better. Then this part of my point is void.
> 

But we already have to bring into our cpu cache one cache line, needed
in eth_type_trans() : (12+2 bytes of ethernet header)

TCP/UDP tuples are included into this cache line (64 bytes on current
popular arches)

Cost of rxhash is absolute noise into the picture.
A device provided hash, to be effective, would also make
eth_type_trans() call not done.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert April 21, 2010, 7:12 p.m. UTC | #14
On Tue, Apr 20, 2010 at 2:41 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Apr 2010 16:57:01 +0200
>
>> I know many applications using TCP on loopback, they are real :)
>
> This is all true and I support your hashing patch and all of that.
>
> But if we really want TCP over loopback to go fast, there are much
> better ways to do this.
>
> Eric, do you remember that "TCP friends" rough patch I sent you last
> year that essentailly made TCP sockets over loopback behave like
> AF_UNIX ones and just queue the SKBs directly to the destination
> socket without doing any protocol work?
>

This is sounds very interesting!  Could you post a patch? :-)

> If we ever got that working, tbench performance would become
> impressive :)
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 23, 2010, 8:44 p.m. UTC | #15
From: Tom Herbert <therbert@google.com>
Date: Wed, 21 Apr 2010 12:12:41 -0700

> On Tue, Apr 20, 2010 at 2:41 PM, David Miller <davem@davemloft.net> wrote:
>> Eric, do you remember that "TCP friends" rough patch I sent you last
>> year that essentailly made TCP sockets over loopback behave like
>> AF_UNIX ones and just queue the SKBs directly to the destination
>> socket without doing any protocol work?
>>
> 
> This is sounds very interesting!  Could you post a patch? :-)

I'll see if I can find it, I sent it to Eric more than a year
ago...

The basic scheme was pretty simple:

1) Add "struct sock *friend" to struct sk_buff

2) TCP initial handshake SYN and SYN+ACK transmits set "skb->friend =
   sk" and TCP receive path notices this and stores this 'friend'
   socket pointer locally in the newly created connection socket.

   The purpose of skb->friend is to let the receiving socket on
   loopback see that the other end is on the local system and
   can be directly communicated to.

3) TCP sendmsg queues data directly to sk->friend's receive queue
   instead sending TCP protocol packets.

The only complications come from making sendmsg and recvmsg not
try to do all of the sequence handling and checking, stuff like
that.  Also, URG would need to be dealt with somehow too.

I'm sure someone suitably motivated could get a working patch
going in no time :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..cb150ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1974,7 +1974,7 @@  u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
-		hash = skb->protocol;
+		hash = (__force u16) skb->protocol;
 
 	hash = jhash_1word(hash, hashrnd);
 
@@ -2253,8 +2253,8 @@  static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 		ip = (struct iphdr *) skb->data;
 		ip_proto = ip->protocol;
-		addr1 = ip->saddr;
-		addr2 = ip->daddr;
+		addr1 = (__force u32) ip->saddr;
+		addr2 = (__force u32) ip->daddr;
 		ihl = ip->ihl;
 		break;
 	case __constant_htons(ETH_P_IPV6):
@@ -2263,8 +2263,8 @@  static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 		ip6 = (struct ipv6hdr *) skb->data;
 		ip_proto = ip6->nexthdr;
-		addr1 = ip6->saddr.s6_addr32[3];
-		addr2 = ip6->daddr.s6_addr32[3];
+		addr1 = (__force u32) ip6->saddr.s6_addr32[3];
+		addr2 = (__force u32) ip6->daddr.s6_addr32[3];
 		ihl = (40 >> 2);
 		break;
 	default:
@@ -2279,14 +2279,25 @@  static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	case IPPROTO_AH:
 	case IPPROTO_SCTP:
 	case IPPROTO_UDPLITE:
-		if (pskb_may_pull(skb, (ihl * 4) + 4))
-			ports = *((u32 *) (skb->data + (ihl * 4)));
+		if (pskb_may_pull(skb, (ihl * 4) + 4)) {
+			__be16 *hports = (__be16 *) (skb->data + (ihl * 4));
+			u32 sport, dport;
+
+			sport = (__force u16) hports[0];
+			dport = (__force u16) hports[1];
+			if (dport < sport)
+				swap(sport, dport);
+			ports = (sport << 16) + dport;
+		}
 		break;
 
 	default:
 		break;
 	}
 
+	/* get a consistent hash (same value on both flow directions) */
+	if (addr2 < addr1)
+		swap(addr1, addr2);
 	skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
 	if (!skb->rxhash)
 		skb->rxhash = 1;