diff mbox

[net-next,6/6] sched: Eliminate use of flow_keys in sch_sfq

Message ID 1425247789-21211-7-git-send-email-therbert@google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert March 1, 2015, 10:09 p.m. UTC
Call qdisc_skb_get_hash instead of doing skb_flow_dissect and then
jhash by hand.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/sched/sch_sfq.c | 29 +++--------------------------
 1 file changed, 3 insertions(+), 26 deletions(-)

Comments

Eric Dumazet March 2, 2015, 12:58 a.m. UTC | #1
On Sun, 2015-03-01 at 14:09 -0800, Tom Herbert wrote:
> Call qdisc_skb_get_hash instead of doing skb_flow_dissect and then
> jhash by hand.

This defeats one of the perturbation goal :

If two flows hashes into same hash, then skb_get_hash_perturb(skb,
q->perturbation) will also give same hash forever and map to same hash
bucket.

Ideally, we could add a 'u32 perturbation' salt to
__skb_get_hash()/__flow_hash_from_keys()/__flow_hash_3words instead of
using a net_get_random_once(&hashrnd, sizeof(hashrnd)); 

but I guess all these functions are becoming very fat.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 2, 2015, 4:06 a.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 01 Mar 2015 16:58:52 -0800

> On Sun, 2015-03-01 at 14:09 -0800, Tom Herbert wrote:
>> Call qdisc_skb_get_hash instead of doing skb_flow_dissect and then
>> jhash by hand.
> 
> This defeats one of the perturbation goal :
> 
> If two flows hashes into same hash, then skb_get_hash_perturb(skb,
> q->perturbation) will also give same hash forever and map to same hash
> bucket.

Yes, this needs to be resolved.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert March 2, 2015, 3:31 p.m. UTC | #3
On Sun, Mar 1, 2015 at 4:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2015-03-01 at 14:09 -0800, Tom Herbert wrote:
>> Call qdisc_skb_get_hash instead of doing skb_flow_dissect and then
>> jhash by hand.
>
> This defeats one of the perturbation goal :
>
> If two flows hashes into same hash, then skb_get_hash_perturb(skb,
> q->perturbation) will also give same hash forever and map to same hash
> bucket.
>
As I already mentioned, the probability of skb_get_hash returning the
same value for two flows is 1/2^32. I suspect it's greater probability
to get a hash collision between two IPv6 flows with subtlety different
addresses or simply match 4-tuples in different address spaces (like
VLAN or with VNID). If hash collision is really a problem, we can
periodically rekey skb->hash (either SW or HW) or maybe move to 64 bit
hashes somehow. Anyway, we potentially have the same problem in RSS
and other uses of the hash, if hashes collide then two flows share the
same bucket forever unless there is an action to break that.

> Ideally, we could add a 'u32 perturbation' salt to
> __skb_get_hash()/__flow_hash_from_keys()/__flow_hash_3words instead of
> using a net_get_random_once(&hashrnd, sizeof(hashrnd));
>
> but I guess all these functions are becoming very fat.
>
Same thing as rekeying.

>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert March 2, 2015, 4:25 p.m. UTC | #4
On Mon, Mar 2, 2015 at 7:31 AM, Tom Herbert <therbert@google.com> wrote:
>
> On Sun, Mar 1, 2015 at 4:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sun, 2015-03-01 at 14:09 -0800, Tom Herbert wrote:
> >> Call qdisc_skb_get_hash instead of doing skb_flow_dissect and then
> >> jhash by hand.
> >
> > This defeats one of the perturbation goal :
> >
> > If two flows hashes into same hash, then skb_get_hash_perturb(skb,
> > q->perturbation) will also give same hash forever and map to same hash
> > bucket.
> >
> As I already mentioned, the probability of skb_get_hash returning the
> same value for two flows is 1/2^32. I suspect it's greater probability
> to get a hash collision between two IPv6 flows with subtlety different
> addresses or simply match 4-tuples in different address spaces (like
> VLAN or with VNID). If hash collision is really a problem, we can
> periodically rekey skb->hash (either SW or HW) or maybe move to 64 bit
> hashes somehow. Anyway, we potentially have the same problem in RSS
> and other uses of the hash, if hashes collide then two flows share the
> same bucket forever unless there is an action to break that.
>
One other possibility is that with the txhash changes we can rehash
individual connections that are sourced from the host. I was planning
to do this on too many TCP retransmits to try to get a different ECMP
path (type of source routing), but we could also do this periodically
or based on request from qdisc. ooo_okay could be taken into account
to avoid OOO packets.

>
> > Ideally, we could add a 'u32 perturbation' salt to
> > __skb_get_hash()/__flow_hash_from_keys()/__flow_hash_3words instead of
> > using a net_get_random_once(&hashrnd, sizeof(hashrnd));
> >
> > but I guess all these functions are becoming very fat.
> >
> Same thing as rekeying.
>
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2015, 7:24 p.m. UTC | #5
On Mon, 2015-03-02 at 07:31 -0800, Tom Herbert wrote:

> As I already mentioned, the probability of skb_get_hash returning the
> same value for two flows is 1/2^32. I suspect it's greater probability
> to get a hash collision between two IPv6 flows with subtlety different
> addresses or simply match 4-tuples in different address spaces (like
> VLAN or with VNID). If hash collision is really a problem, we can
> periodically rekey skb->hash (either SW or HW) or maybe move to 64 bit
> hashes somehow. Anyway, we potentially have the same problem in RSS
> and other uses of the hash, if hashes collide then two flows share the
> same bucket forever unless there is an action to break that.

Main historical use of SFQ is for routers.

We do not control skb->hash for forwarding workload, unless you
reprogram RSS keys on the NIC.

If you remember, we used to have a problem on some NIC using a well
known RSS key. It is fortunate SFQ/fq_codel did not rely on skb->hash
too much.

If to target one flow, attackers need to attack your big boxes to flood
one particular RX queue at NIC level, this has a huge cost for them.

While targeting one hash bucket on a qdisc might be much easier, if they
know even a hash perturbation has no effect against their attack.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2015, 7:28 p.m. UTC | #6
On Mon, 2015-03-02 at 08:25 -0800, Tom Herbert wrote:

> One other possibility is that with the txhash changes we can rehash
> individual connections that are sourced from the host. I was planning
> to do this on too many TCP retransmits to try to get a different ECMP
> path (type of source routing), but we could also do this periodically
> or based on request from qdisc. ooo_okay could be taken into account
> to avoid OOO packets.

About locally generated traffic, skb->txhash potential uses would be

1) Somewhat spray txhash on the wire (Hmmm...)
   Hard to do that with regular IPv4/tcp traffic, or quite hacky.
   (like VLAN spraying...)

2.1) Slave selection on bonding driver
2.2) TX queue selection on multiqueue NIC.

    In this model (way different than XPS), then skb->ooo_okay is of
    no use :

    In both cases, if you change skb->txhash after congestion event,
    you do not care of reorders anyway. They are going to be generated
    most probably because of different paths.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert March 2, 2015, 7:34 p.m. UTC | #7
On Mon, Mar 2, 2015 at 11:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Mon, 2015-03-02 at 08:25 -0800, Tom Herbert wrote:
>
>> One other possibility is that with the txhash changes we can rehash
>> individual connections that are sourced from the host. I was planning
>> to do this on too many TCP retransmits to try to get a different ECMP
>> path (type of source routing), but we could also do this periodically
>> or based on request from qdisc. ooo_okay could be taken into account
>> to avoid OOO packets.
>
> About locally generated traffic, skb->txhash potential uses would be
>
> 1) Somewhat spray txhash on the wire (Hmmm...)
>    Hard to do that with regular IPv4/tcp traffic, or quite hacky.
>    (like VLAN spraying...)
>
Spraying would be quite easy by modulating txhash from TCP and using
either  UDP encapsulation like GUE or IPv6 flow labels. Only problem
is that this would constantly be changing rxhash at the receiver for
the connection, although flow labels might 'just work' with IPv6
Toeplitz in NICs.

> 2.1) Slave selection on bonding driver
> 2.2) TX queue selection on multiqueue NIC.
>
>     In this model (way different than XPS), then skb->ooo_okay is of
>     no use :
>
>     In both cases, if you change skb->txhash after congestion event,
>     you do not care of reorders anyway. They are going to be generated
>     most probably because of different paths.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2015, 7:43 p.m. UTC | #8
On Mon, 2015-03-02 at 11:34 -0800, Tom Herbert wrote:

> Spraying would be quite easy by modulating txhash from TCP and using
> either  UDP encapsulation like GUE or IPv6 flow labels. Only problem
> is that this would constantly be changing rxhash at the receiver for
> the connection, although flow labels might 'just work' with IPv6
> Toeplitz in NICs.

When I said 'regular IPv4/tcp traffic' this did not include any added
encapsulation.

Most NIC are not able to properly offload TCP (TSO) with such
encapsulations.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert March 2, 2015, 7:51 p.m. UTC | #9
On Mon, Mar 2, 2015 at 11:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-03-02 at 11:34 -0800, Tom Herbert wrote:
>
>> Spraying would be quite easy by modulating txhash from TCP and using
>> either  UDP encapsulation like GUE or IPv6 flow labels. Only problem
>> is that this would constantly be changing rxhash at the receiver for
>> the connection, although flow labels might 'just work' with IPv6
>> Toeplitz in NICs.
>
> When I said 'regular IPv4/tcp traffic' this did not include any added
> encapsulation.
>
> Most NIC are not able to properly offload TCP (TSO) with such
> encapsulations.
>
IPv6 flow labels should work through TSO, they would just be
replicated for each segment. You would need switches that can hash
based on flow label though, and of course IPv6 deployed in the network
:-).

>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2015, 8:02 p.m. UTC | #10
On Mon, 2015-03-02 at 11:51 -0800, Tom Herbert wrote:
> On Mon, Mar 2, 2015 at 11:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-03-02 at 11:34 -0800, Tom Herbert wrote:
> >
> >> Spraying would be quite easy by modulating txhash from TCP and using
> >> either  UDP encapsulation like GUE or IPv6 flow labels. Only problem
> >> is that this would constantly be changing rxhash at the receiver for
> >> the connection, although flow labels might 'just work' with IPv6
> >> Toeplitz in NICs.
> >
> > When I said 'regular IPv4/tcp traffic' this did not include any added
> > encapsulation.
> >
> > Most NIC are not able to properly offload TCP (TSO) with such
> > encapsulations.
> >
> IPv6 flow labels should work through TSO, they would just be
> replicated for each segment. You would need switches that can hash
> based on flow label though, and of course IPv6 deployed in the network
> :-).


Yeah, but I explicitly talked about native IPv4 + TCP ;)

not IPv6 + TCP, or IPv6 + IPV4 + TCP or some encapsulation...

Trend is to generalize random packet spraying, even within a flow.

https://www.cs.purdue.edu/homes/kompella/publications/infocom13.pdf

This forces us to build a resilient TCP stack.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert March 3, 2015, 9 p.m. UTC | #11
On Mon, Mar 2, 2015 at 11:24 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-03-02 at 07:31 -0800, Tom Herbert wrote:
>
>> As I already mentioned, the probability of skb_get_hash returning the
>> same value for two flows is 1/2^32. I suspect it's greater probability
>> to get a hash collision between two IPv6 flows with subtlety different
>> addresses or simply match 4-tuples in different address spaces (like
>> VLAN or with VNID). If hash collision is really a problem, we can
>> periodically rekey skb->hash (either SW or HW) or maybe move to 64 bit
>> hashes somehow. Anyway, we potentially have the same problem in RSS
>> and other uses of the hash, if hashes collide then two flows share the
>> same bucket forever unless there is an action to break that.
>
> Main historical use of SFQ is for routers.
>
> We do not control skb->hash for forwarding workload, unless you
> reprogram RSS keys on the NIC.
>
> If you remember, we used to have a problem on some NIC using a well
> known RSS key. It is fortunate SFQ/fq_codel did not rely on skb->hash
> too much.
>
Okay, I'll audit the drivers returning skb->hash fro HW to make sure
they are properly randomizing the initial hash key.

Unless, there's other material objections, I'll formally post these
patches. Consolidating the packet hash for different use cases should
be more efficient and we'll be able focus efforts on strengthening it
over time in one place. As for SW vs. HW hash, it still seems like
getting a Toeplitz hash from the hardware is going to be stronger at
least for TCP/IPv6 than what we can currently do with jhash as the
input is the full 40 bytes of 4-tuple. I must admit I wish Microsoft
would have continued to make advancements with Toeplitz hash, it would
be nice if there was a real definition for use with UDP and maybe we
had instructions for this in x86!

Tom

> If to target one flow, attackers need to attack your big boxes to flood
> one particular RX queue at NIC level, this has a huge cost for them.
>
> While targeting one hash bucket on a qdisc might be much easier, if they
> know even a hash perturbation has no effect against their attack.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index b877140..e7ed8a5 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -23,7 +23,6 @@ 
 #include <linux/vmalloc.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
-#include <net/flow_keys.h>
 #include <net/red.h>
 
 
@@ -156,30 +155,10 @@  static inline struct sfq_head *sfq_dep_head(struct sfq_sched_data *q, sfq_index
 	return &q->dep[val - SFQ_MAX_FLOWS];
 }
 
-/*
- * In order to be able to quickly rehash our queue when timer changes
- * q->perturbation, we store flow_keys in skb->cb[]
- */
-struct sfq_skb_cb {
-       struct flow_keys        keys;
-};
-
-static inline struct sfq_skb_cb *sfq_skb_cb(const struct sk_buff *skb)
-{
-	qdisc_cb_private_validate(skb, sizeof(struct sfq_skb_cb));
-	return (struct sfq_skb_cb *)qdisc_skb_cb(skb)->data;
-}
-
 static unsigned int sfq_hash(const struct sfq_sched_data *q,
-			     const struct sk_buff *skb)
+			     struct sk_buff *skb)
 {
-	const struct flow_keys *keys = &sfq_skb_cb(skb)->keys;
-	unsigned int hash;
-
-	hash = jhash_3words((__force u32)keys->dst,
-			    (__force u32)keys->src ^ keys->ip_proto,
-			    (__force u32)keys->ports, q->perturbation);
-	return hash & (q->divisor - 1);
+	return skb_get_hash_perturb(skb, q->perturbation) & (q->divisor - 1);
 }
 
 static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
@@ -196,10 +175,8 @@  static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
 		return TC_H_MIN(skb->priority);
 
 	fl = rcu_dereference_bh(q->filter_list);
-	if (!fl) {
-		skb_flow_dissect(skb, &sfq_skb_cb(skb)->keys);
+	if (!fl)
 		return sfq_hash(q, skb) + 1;
-	}
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	result = tc_classify(skb, fl, &res);