diff mbox

net: implement emergency route cache rebulds when gc_elasticity is exceeded

Message ID c3ca0c0f0810042145q35a451a7u706bc64fb43723fa@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Andrew Dickinson Oct. 5, 2008, 4:45 a.m. UTC
Here's the patch that Herbert's referring to.  The basic idea is that
we have a flag which indicates whether or not we need to invalidate
the route cache.  If any chain exceeds gc_elasticity, we set the flag
and reschedule the timer.  In the worst-case, we'll invalidate the
route cache once every secret_interval; in the best-case, we never
invalidate the cache.

                                          void __user *buffer, size_t *lenp,
@@ -3200,6 +3206,8 @@ static __net_init int
rt_secret_timer_init(struct net *net)
                        (int) ((num_physpages ^ (num_physpages>>8)) ^
                        (jiffies ^ (jiffies >> 7))));

+       net->ipv4.rt_secret_flag = 0;
+
        net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
        net->ipv4.rt_secret_timer.data = (unsigned long)net;
        init_timer_deferrable(&net->ipv4.rt_secret_timer);









On Sat, Oct 4, 2008 at 8:26 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> David Miller <davem@davemloft.net> wrote:
>>
>> The idea is that we can by default not rebuild the secret
>> at all.
>
> Actually Andrew Dickson <whydna@whydna.net> came up with this idea
> quite a while ago: Keep the rehash interval but do nothing until
> some chain hits a specified length.  This is quite similar to
> what is being discussed here.
>
> Andrew, could you post the patch please?
>
> In addition to this, we should probably enforce that limit as
> well by simply not adding the newly created entry or deleting
> one forcibly.
>
> Thanks,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Oct. 5, 2008, 5:34 p.m. UTC | #1
From: "Andrew Dickinson" <whydna@whydna.net>
Date: Sat, 4 Oct 2008 21:45:27 -0700

> Here's the patch that Herbert's referring to.  The basic idea is that
> we have a flag which indicates whether or not we need to invalidate
> the route cache.  If any chain exceeds gc_elasticity, we set the flag
> and reschedule the timer.  In the worst-case, we'll invalidate the
> route cache once every secret_interval; in the best-case, we never
> invalidate the cache.

This is a very interesting patch and idea, but...

Eric showed clearly that on a completely normal well loaded
system, the chain lengths exceed the elasticity all the time
and it's not like these are entries we can get rid of because
their refcounts are all > 1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Dickinson Oct. 5, 2008, 6:06 p.m. UTC | #2
I've got another patch that takes a different approach...  Instead of
disabling the secret_interval timer or trying to heuristically guess
when we're under attack, we continue to invalidate the cache; we just
invalidate it with kid-gloves instead of a sledge hammer.

Like we do today, we continue to update the genid every time the
secret_interval timer expires.  Instead of simply creating a new value
(and thus invalidating the entire cache), we keep a short history of
genid values (I'm thinking on the order of 2-4 previous values).  In
rt_intern_hash(), when we do the check to see if we already have an
existing hash entry, we'll check each of the previous genid versions
(hence the desire to keep the history short) before declaring it as
not there.  If we do find the entry in the hash with an older genid
value, we'll re-bucket it into the correct location for the latest
genid.

Basically, we're allowing entries to continue to exist in the hash
after the route cache has been invalidated (they can still be pruned
by GC).  Happy to send the patch along if you'd like, although I'm not
as confident that this approach is really desirable.

-A


On Sun, Oct 5, 2008 at 10:34 AM, David Miller <davem@davemloft.net> wrote:
> From: "Andrew Dickinson" <whydna@whydna.net>
> Date: Sat, 4 Oct 2008 21:45:27 -0700
>
>> Here's the patch that Herbert's referring to.  The basic idea is that
>> we have a flag which indicates whether or not we need to invalidate
>> the route cache.  If any chain exceeds gc_elasticity, we set the flag
>> and reschedule the timer.  In the worst-case, we'll invalidate the
>> route cache once every secret_interval; in the best-case, we never
>> invalidate the cache.
>
> This is a very interesting patch and idea, but...
>
> Eric showed clearly that on a completely normal well loaded
> system, the chain lengths exceed the elasticity all the time
> and it's not like these are entries we can get rid of because
> their refcounts are all > 1
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 6, 2008, 4:21 a.m. UTC | #3
On Sun, Oct 05, 2008 at 10:34:54AM -0700, David Miller wrote:
>
> Eric showed clearly that on a completely normal well loaded
> system, the chain lengths exceed the elasticity all the time
> and it's not like these are entries we can get rid of because
> their refcounts are all > 1

I think there are two orthogonal issues here.

1) The way we count the chain length is wrong.  There are keys
which do not form part of the hash computation.  Entries that
only differ by them will always end up in the same bucket.

We should count all entries that only differ by those keys as
a single entry for the purposes of detecting an attack.

FWIW we could even reorganise the storage inside a bucket such
that it is a 2-level list where the first level only contained
entries that differ by saddr/daddr.

2) What do we do when we get a long chain just after a rehash.

This is an indication that the attacker has more knowledge about
us than we expected.  Continuing to rehash is probably no going
to help.

We need to decide whether we care about this scenario.

If yes, then we'll need to come up with a way to bypass the
route cache, or at least act as if it was bypassed.

If not, then we can just kill the rehash interval altogether.

Cheers,
Neil Horman Oct. 6, 2008, 10:50 a.m. UTC | #4
On Mon, Oct 06, 2008 at 12:21:08PM +0800, Herbert Xu wrote:
> On Sun, Oct 05, 2008 at 10:34:54AM -0700, David Miller wrote:
> >
> > Eric showed clearly that on a completely normal well loaded
> > system, the chain lengths exceed the elasticity all the time
> > and it's not like these are entries we can get rid of because
> > their refcounts are all > 1
> 
> I think there are two orthogonal issues here.
> 
> 1) The way we count the chain length is wrong.  There are keys
> which do not form part of the hash computation.  Entries that
> only differ by them will always end up in the same bucket.
> 
> We should count all entries that only differ by those keys as
> a single entry for the purposes of detecting an attack.
> 
> FWIW we could even reorganise the storage inside a bucket such
> that it is a 2-level list where the first level only contained
> entries that differ by saddr/daddr.
> 

I'm not sure I follow what your saying here.  I understand that some keys will
wind up hashing to the same bucket, but from what I see a change to the saddr
and daddr parameters to rt_hash, will change what bucket you hash too.  What am
I missing?

> 2) What do we do when we get a long chain just after a rehash.
> 
> This is an indication that the attacker has more knowledge about
> us than we expected.  Continuing to rehash is probably no going
> to help.
> 
Seems like it might be ambiguous to me.  perhaps we just got a series of
collisions in the firs few entries after a  rebuild?  I dont know, Im just
playing devils advocate.

> We need to decide whether we care about this scenario.
> 
I expect we should.

> If yes, then we'll need to come up with a way to bypass the
> route cache, or at least act as if it was bypassed.
> 
Why don't we just add a count to the number of times we call
rt_emergency_hash_rebuild?  If we cross a threshold on that count (or perhaps a
rate determined by jiffies since the last emergency rebuild), we can set a flag
to not always return a failed lookup in the cache, so as to force routing into
the slow path.


Does that seem reasonable to you?


Best
Neil
Herbert Xu Oct. 6, 2008, 11:02 a.m. UTC | #5
On Mon, Oct 06, 2008 at 06:50:22AM -0400, Neil Horman wrote:
>
> > 1) The way we count the chain length is wrong.  There are keys
> > which do not form part of the hash computation.  Entries that
> > only differ by them will always end up in the same bucket.
>
> I'm not sure I follow what your saying here.  I understand that some keys will
> wind up hashing to the same bucket, but from what I see a change to the saddr
> and daddr parameters to rt_hash, will change what bucket you hash too.  What am
> I missing?

We've got keys other than saddr/daddr, for example, all route
entries that differ only by TOS hash to the same bucket.  There
are quite a few possible TOS values.

Have a look at ip r l c.

> > 2) What do we do when we get a long chain just after a rehash.
> > 
> > This is an indication that the attacker has more knowledge about
> > us than we expected.  Continuing to rehash is probably no going
> > to help.
>
> Seems like it might be ambiguous to me.  perhaps we just got a series of
> collisions in the firs few entries after a  rebuild?  I dont know, Im just
> playing devils advocate.

Have you computed the probability of that? If the limit is sensible
this should be extremely unlikely.  Note that I'm also assuming
that you've got 1) solved as otherwise all bets are off.

Assuming that we are counting it correctly, and that the length
limit is set to a sensible value, then the probability of the
attacker reaching that limit soon after a rehash is very small.

Therefore we can deduce that if it does happen then chances are
that our RNG has been compromised and as such rehashing again
will not help.

> Why don't we just add a count to the number of times we call
> rt_emergency_hash_rebuild?  If we cross a threshold on that count (or perhaps a
> rate determined by jiffies since the last emergency rebuild), we can set a flag
> to not always return a failed lookup in the cache, so as to force routing into
> the slow path.

Really, if we're hitting this when nobody is attacking us then
something is screwed up, notably 1) as it stands.

Cheers,
Neil Horman Oct. 6, 2008, 12:43 p.m. UTC | #6
On Mon, Oct 06, 2008 at 07:02:49PM +0800, Herbert Xu wrote:
> On Mon, Oct 06, 2008 at 06:50:22AM -0400, Neil Horman wrote:
> >
> > > 1) The way we count the chain length is wrong.  There are keys
> > > which do not form part of the hash computation.  Entries that
> > > only differ by them will always end up in the same bucket.
> >
> > I'm not sure I follow what your saying here.  I understand that some keys will
> > wind up hashing to the same bucket, but from what I see a change to the saddr
> > and daddr parameters to rt_hash, will change what bucket you hash too.  What am
> > I missing?
> 
> We've got keys other than saddr/daddr, for example, all route
> entries that differ only by TOS hash to the same bucket.  There
> are quite a few possible TOS values.
> 
> Have a look at ip r l c.
Ok, that makes a bit more sense.  Thank you.   I agree that in that case we
probably do need to bias our chain length accounting such that entries that
differ only by TOS or other fields that do not affect the hash value only get
counted once.

> 
> > > 2) What do we do when we get a long chain just after a rehash.
> > > 
> > > This is an indication that the attacker has more knowledge about
> > > us than we expected.  Continuing to rehash is probably no going
> > > to help.
> >
> > Seems like it might be ambiguous to me.  perhaps we just got a series of
> > collisions in the firs few entries after a  rebuild?  I dont know, Im just
> > playing devils advocate.
> 
> Have you computed the probability of that? If the limit is sensible
> this should be extremely unlikely.  Note that I'm also assuming
> that you've got 1) solved as otherwise all bets are off.
> 
> Assuming that we are counting it correctly, and that the length
> limit is set to a sensible value, then the probability of the
> attacker reaching that limit soon after a rehash is very small.
> 
> Therefore we can deduce that if it does happen then chances are
> that our RNG has been compromised and as such rehashing again
> will not help.
> 
Like I said, I'm just playing devils advocate on this.  Clearly the hash
function should be such that an early data set hashes to the same chain is very
low.

> > Why don't we just add a count to the number of times we call
> > rt_emergency_hash_rebuild?  If we cross a threshold on that count (or perhaps a
> > rate determined by jiffies since the last emergency rebuild), we can set a flag
> > to not always return a failed lookup in the cache, so as to force routing into
> > the slow path.
> 
> Really, if we're hitting this when nobody is attacking us then
> something is screwed up, notably 1) as it stands.
> 

Thats rather the point though.  If we're not being attacked, then we'll never
hit this (ideally).

I'll post a new patch when I figure out how best to handle entries which differ
by a field irrelevant to the hash function.

Neil

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
diff mbox

Patch

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..82baf68 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -48,6 +48,7 @@  struct netns_ipv4 {
        int sysctl_icmp_errors_use_inbound_ifaddr;

        struct timer_list rt_secret_timer;
+       int rt_secret_flag;
        atomic_t rt_genid;
 };
 #endif
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e91bafe..83a1b43 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -837,13 +837,49 @@  void rt_cache_flush(struct net *net, int delay)
 }

 /*
- * We change rt_genid and let gc do the cleanup
+ * We set rt_secret_flag indicating that we can invalidate the cache if needed.
  */
 static void rt_secret_rebuild(unsigned long __net)
 {
        struct net *net = (struct net *)__net;
-       rt_cache_invalidate(net);
-       mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
+       net->ipv4.rt_secret_flag = 1;
+}
+
+static void rt_secret_reschedule(int old)
+{
+       struct net *net;
+       int new = ip_rt_secret_interval;
+       int diff = new - old;
+
+       if (!diff)
+               return;
+
+       rtnl_lock();
+       for_each_net(net) {
+               int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
+
+               if (!new) {
+                       net->ipv4.rt_secret_flag = 0;
+                       continue;
+               }
+
+               if(net->ipv4.rt_secret_flag)
+                       continue;
+
+               if (old && deleted) {
+                       long time = net->ipv4.rt_secret_timer.expires - jiffies;
+
+                       if (time <= 0 || (time += diff) <= 0)
+                               time = 0;
+
+                       net->ipv4.rt_secret_timer.expires = time;
+               } else
+                       net->ipv4.rt_secret_timer.expires = new;
+
+               net->ipv4.rt_secret_timer.expires += jiffies;
+               add_timer(&net->ipv4.rt_secret_timer);
+       }
+       rtnl_unlock();
 }

 /*
@@ -1045,17 +1081,19 @@  restart:
                rthp = &rth->u.dst.rt_next;
        }

-       if (cand) {
-               /* ip_rt_gc_elasticity used to be average length of chain
-                * length, when exceeded gc becomes really aggressive.
-                *
-                * The second limit is less certain. At the moment it allows
-                * only 2 entries per bucket. We will see.
-                */
-               if (chain_length > ip_rt_gc_elasticity) {
+       if (chain_length > ip_rt_gc_elasticity) {
+               struct net *net = dev_net(rth->u.dst.dev);
+
+               if (cand) {
                        *candp = cand->u.dst.rt_next;
                        rt_free(cand);
                }
+
+               if (net->ipv4.rt_secret_flag &&
+                   xchg(&net->ipv4.rt_secret_flag, 0)) {
+                       rt_cache_invalidate(net);
+                       rt_secret_reschedule(0);
+               }
        }

        /* Try to bind route to arp only if it is output
@@ -2914,38 +2952,6 @@  static int
ipv4_sysctl_rtcache_flush_strategy(ctl_table *table,
        return 0;
 }

-static void rt_secret_reschedule(int old)
-{
-       struct net *net;
-       int new = ip_rt_secret_interval;
-       int diff = new - old;
-
-       if (!diff)
-               return;
-
-       rtnl_lock();
-       for_each_net(net) {
-               int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-
-               if (!new)
-                       continue;
-
-               if (deleted) {
-                       long time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-                       if (time <= 0 || (time += diff) <= 0)
-                               time = 0;
-
-                       net->ipv4.rt_secret_timer.expires = time;
-               } else
-                       net->ipv4.rt_secret_timer.expires = new;
-
-               net->ipv4.rt_secret_timer.expires += jiffies;
-               add_timer(&net->ipv4.rt_secret_timer);
-       }
-       rtnl_unlock();
-}
-
 static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
                                          struct file *filp,