diff mbox

[v5] rfs: Receive Flow Steering

Message ID 1271520633.16881.4754.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 17, 2010, 4:10 p.m. UTC
Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit :
> Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> > >> Results with "tbench 16" on an 8 core Intel machine.
> > >>
> > >> No RPS/RFS:  2155 MB/sec
> > >> RPS (0ff mask): 1700 MB/sec
> > >> RFS: 1097
> > >>
> > 
> > Blah, I mistakingly reported that... should have been:
> > 
> > No RPS/RFS:  2155 MB/sec
> > RPS (0ff mask): 1097 MB/sec
> > RFS: 1700 MB/sec
> > 
> > Sorry about that!
> 
> > This was my expectation too, and what my "corrected" numbers show :-)
> > But, I take it this is different in your results?
> 
> 
> My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
> (16 logical cpus)
> 
> No RPS , no RFS : 4448.14 MB/sec 
> RPS : 2298.00 MB/sec (but lot of variation)
> RFS : 2600 MB/sec
> 
> Maybe my RFS setup is bad ?
> (8192 flows)
> 

With attached patch, I reached 

Throughput 4465.13 MB/sec 16 procs

RFS better than no RPS/RFS :)

So, the old idea to make rxhash consistent (same value in both
directions) is a win for some workloads (Consider connection tracking /
firewalling) 

port1 = ...
port2 = ...
addr1 = ...
addr2 = ...
if (addr1 > addr2)
	exchange(addr1, addr2)
if (port1 > port2)
	exchange(port, port2)

hash = jhash(addr1, addr2, (port1<<16)+port2, ...)

        default:


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tom Herbert April 17, 2010, 5:38 p.m. UTC | #1
>
> With attached patch, I reached
>
> Throughput 4465.13 MB/sec 16 procs
>
> RFS better than no RPS/RFS :)
>
> So, the old idea to make rxhash consistent (same value in both
> directions) is a win for some workloads (Consider connection tracking /
> firewalling)
>
> port1 = ...
> port2 = ...
> addr1 = ...
> addr2 = ...
> if (addr1 > addr2)
>        exchange(addr1, addr2)
> if (port1 > port2)
>        exchange(port, port2)
>
> hash = jhash(addr1, addr2, (port1<<16)+port2, ...)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7abf959..6b757ff 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2280,8 +2280,10 @@ static int get_rps_cpu(struct net_device *dev,
> struct sk_buff *skb,
>        case IPPROTO_AH:
>        case IPPROTO_SCTP:
>        case IPPROTO_UDPLITE:
> -               if (pskb_may_pull(skb, (ihl * 4) + 4))
> -                       ports = *((u32 *) (skb->data + (ihl * 4)));
> +               if (pskb_may_pull(skb, (ihl * 4) + 4)) {
> +                       u16 *_ports = (u16 *)(skb->data + (ihl * 4));
> +                       ports = _ports[0] ^ _ports[1];
> +               }
>                break;
>
>        default:
>
That's cool!, but I still like the idea that this hash is treated as
an opaque value, getting the hash from the device to avoid the jhash
or cache misses on the packet can also be a win...  Maybe connection
tracking/firewall could use the skb->rxhash which provides the
consistency and also eliminates the need to do more jhashes.


>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 18, 2010, 12:06 a.m. UTC | #2
On Sun, Apr 18, 2010 at 1:38 AM, Tom Herbert <therbert@google.com> wrote:
> That's cool!, but I still like the idea that this hash is treated as
> an opaque value getting the hash from the device to avoid the jhash
> or cache misses on the packet can also be a win...  Maybe connection
> tracking/firewall could use the skb->rxhash which provides the
> consistency and also eliminates the need to do more jhashes.
>

consistent rxhash only adds the risk of the hash collision, and I
don't think it is a big problem. For connection tracking/firewall use,
I am afraid that we have to recompute this value after defrag.  So we
have to export the hash function we used in RPS.

As NIC's hash function can be changed dynamically, the rxhash isn't
consistent, so the rxhash can't be used by connection tracking, socket
lookup and others come later.
Franco Fichtner April 18, 2010, 11:06 a.m. UTC | #3
Changli Gao wrote:
> On Sun, Apr 18, 2010 at 1:38 AM, Tom Herbert <therbert@google.com> wrote:
>   
>> That's cool!, but I still like the idea that this hash is treated as
>> an opaque value getting the hash from the device to avoid the jhash
>> or cache misses on the packet can also be a win...  Maybe connection
>> tracking/firewall could use the skb->rxhash which provides the
>> consistency and also eliminates the need to do more jhashes.
>>
>>     
>
> consistent rxhash only adds the risk of the hash collision, and I
> don't think it is a big problem. For connection tracking/firewall use,
> I am afraid that we have to recompute this value after defrag.  So we
> have to export the hash function we used in RPS.
>
> As NIC's hash function can be changed dynamically, the rxhash isn't
> consistent, so the rxhash can't be used by connection tracking, socket
> lookup and others come later.
>
>   
I have to agree with Eric and Changli here.

It's especially true if you're passively tracking via one NIC, where all 
traffic is just forwarded.
In this scenario, you need to compute consistent hashes. rxhashes by NIC 
will be different for
"incoming" and "outgoing" traffic...

Where rxhash by NIC can be used (note: didn't say _useful_) are 
scenarios with different net
ports for incoming and outgoing traffic (in active but also passive 
traffic scenarios). Here,
rxhashes could be used on a per-port basis, but associating two 
seemingly separate rxhashes
with one another to match CPUs is a really annoying task. This would 
involve computing the
corresponding "txhash" and looking it up, which is what we'd be doing 
with the jhash anyway.

For proper flow tracking Eric's suggestion is the way to go. And if 
there are worries about
collisions, why not add IPPROTO_* to the mix.


Franco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 19, 2010, 8:09 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 17 Apr 2010 18:10:33 +0200

> Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit :
>> My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
>> (16 logical cpus)
>> 
>> No RPS , no RFS : 4448.14 MB/sec 
>> RPS : 2298.00 MB/sec (but lot of variation)
>> RFS : 2600 MB/sec
>> 
>> Maybe my RFS setup is bad ?
>> (8192 flows)
>> 
> 
> With attached patch, I reached 
> 
> Throughput 4465.13 MB/sec 16 procs
> 
> RFS better than no RPS/RFS :)
> 
> So, the old idea to make rxhash consistent (same value in both
> directions) is a win for some workloads (Consider connection tracking /
> firewalling) 

Fun :-)  I toyed around with this on my 128 cpu machine (2 NUMA
nodes, 64 cpus each NUMA node).

Vanilla net-next-2.6, no configuration changes:

tbench 64: Throughput 1843.43 MB/sec 64 procs
tbench 128: Throughput 1889.67 MB/sec 128 procs

Vanilla net-next-2.6, rps_cpus="ffffffff,ffffffff,ffffffff,ffffffff"

tbench 64: Throughput 1455.89 MB/sec 64 procs
tbench 128: Throughput 2009.91 MB/sec 128 procs

net-next-2.6 + Eric's port hashing patch, rps_cpus="ffffffff,ffffffff,ffffffff,ffffffff"

tbench 64: Throughput 1593.13 MB/sec 64 procs
tbench 128: Throughput 2367.27 MB/sec 128 procs

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 19, 2010, 8:23 p.m. UTC | #5
From: David Miller <davem@davemloft.net>
Date: Mon, 19 Apr 2010 13:09:05 -0700 (PDT)

> net-next-2.6 + Eric's port hashing patch, rps_cpus="ffffffff,ffffffff,ffffffff,ffffffff"
> 
> tbench 64: Throughput 1593.13 MB/sec 64 procs
> tbench 128: Throughput 2367.27 MB/sec 128 procs

Eric, I think there is agreement that your patch is not a bad idea.

Your original posting had whitespace damange in the patch plus I want
to see a proper commit message and signoff, so could you please submit
this formally?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2010, 8:32 p.m. UTC | #6
Le lundi 19 avril 2010 à 13:23 -0700, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Mon, 19 Apr 2010 13:09:05 -0700 (PDT)
> 
> > net-next-2.6 + Eric's port hashing patch, rps_cpus="ffffffff,ffffffff,ffffffff,ffffffff"
> > 
> > tbench 64: Throughput 1593.13 MB/sec 64 procs
> > tbench 128: Throughput 2367.27 MB/sec 128 procs
> 
> Eric, I think there is agreement that your patch is not a bad idea.
> 
> Your original posting had whitespace damange in the patch plus I want
> to see a proper commit message and signoff, so could you please submit
> this formally?

Hmm, this was not a formal patch, just an information.

Problem is if hardware provides rxhash, will it be "consistent" too ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 19, 2010, 9:19 p.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Apr 2010 22:32:01 +0200

> Hmm, this was not a formal patch, just an information.
> 
> Problem is if hardware provides rxhash, will it be "consistent" too ?

Yes, it is an issue.  I am not aware of whether the Toeplitz
hash computed by cards is impervious to the order of the
input bits of not, probably it is.

I was thinking also about how we could compute rxhash in the
loopback driver :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 19, 2010, 11:38 p.m. UTC | #8
On Tue, Apr 20, 2010 at 4:32 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Hmm, this was not a formal patch, just an information.
>
> Problem is if hardware provides rxhash, will it be "consistent" too ?
>
>

Does this problem has relationship with your patch? No. If the rxhash
isn't provided by hardware, we can get more throughput from you patch,
and on the other side, we don't lose anything but potential more hash
collision.
Eric Dumazet April 20, 2010, 5:59 a.m. UTC | #9
Le mardi 20 avril 2010 à 07:38 +0800, Changli Gao a écrit :

> Does this problem has relationship with your patch? No. If the rxhash
> isn't provided by hardware, we can get more throughput from you patch,
> and on the other side, we don't lose anything but potential more hash
> collision.
> 

I am not sure what you call hash collision. There is no hash chain here.

This 32bit hash is a jhash one, and we only need 1 to 12 bits in it, I
am pretty sure its OK.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert April 20, 2010, 3:04 p.m. UTC | #10
On Mon, Apr 19, 2010 at 10:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 20 avril 2010 à 07:38 +0800, Changli Gao a écrit :
>
>> Does this problem has relationship with your patch? No. If the rxhash
>> isn't provided by hardware, we can get more throughput from you patch,
>> and on the other side, we don't lose anything but potential more hash
>> collision.
>>
>
> I am not sure what you call hash collision. There is no hash chain here.
>
> This 32bit hash is a jhash one, and we only need 1 to 12 bits in it, I
> am pretty sure its OK.
>
Maybe for the purposes of RPS, but hash collisions could definitely be
an issue in RFS.  If two active connections hit the same rps_flow
entry this may cause thrashing of those connections between CPUs.  I
think your patch may increase the probability of this happening.

>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 20, 2010, 3:39 p.m. UTC | #11
Le mardi 20 avril 2010 à 08:04 -0700, Tom Herbert a écrit :

> Maybe for the purposes of RPS, but hash collisions could definitely be
> an issue in RFS.  If two active connections hit the same rps_flow
> entry this may cause thrashing of those connections between CPUs.  I
> think your patch may increase the probability of this happening.
> 

Good point.

I'll make a gathering of tcp tuples on a busy server over a day and try
to compute number of clashes we can get with and without the addr/port
swapping.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 26, 2010, 8:41 a.m. UTC | #12
Le lundi 19 avril 2010 à 14:19 -0700, David Miller a écrit :

> 
> I was thinking also about how we could compute rxhash in the
> loopback driver :-)

This would be easy if rxhash was not a "struct inet_sock" field but a
"struct sock" one

sock_alloc_send_pskb() (or skb_set_owner_w())

skb->rxhash = sk->rxhash;



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..6b757ff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2280,8 +2280,10 @@  static int get_rps_cpu(struct net_device *dev,
struct sk_buff *skb,
        case IPPROTO_AH:
        case IPPROTO_SCTP:
        case IPPROTO_UDPLITE:
-               if (pskb_may_pull(skb, (ihl * 4) + 4))
-                       ports = *((u32 *) (skb->data + (ihl * 4)));
+               if (pskb_may_pull(skb, (ihl * 4) + 4)) {
+                       u16 *_ports = (u16 *)(skb->data + (ihl * 4));
+                       ports = _ports[0] ^ _ports[1];
+               }
                break;