diff mbox

[00/16] Remove the ipv4 routing cache

Message ID 20120726.140601.1137230112117936793.davem@davemloft.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller July 26, 2012, 9:06 p.m. UTC
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 11:26:26 -0700

> The previous results were with a slight modifications to your earlier
> patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
> reaching a maximum of 11.6Mpps with 9 queues.

For fun you might want to see what this patch does for your tests,
it should cut the number of fib_table_lookup() calls roughly in half.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alexander H Duyck July 26, 2012, 10:03 p.m. UTC | #1
On Thu, Jul 26, 2012 at 2:06 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Thu, 26 Jul 2012 11:26:26 -0700
>
>> The previous results were with a slight modifications to your earlier
>> patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
>> reaching a maximum of 11.6Mpps with 9 queues.
>
> For fun you might want to see what this patch does for your tests,
> it should cut the number of fib_table_lookup() calls roughly in half.

So with your patch, Eric's patch, and this most recent patch we are
now at 11.8Mpps with 8 or 9 queues.  At this point I am staring to hit
the hardware limits since 82599 will typically max out at about 12Mpps
w/ 9 queues.

Here is the latest perf results with all of these patches in place.
As you predicted your patch essentially cut the lookup overhead in
half:
    10.65%  [k] ixgbe_poll
     7.77%  [k] fib_table_lookup
     6.21%  [k] ixgbe_xmit_frame_ring
     6.08%  [k] __netif_receive_skb
     4.41%  [k] _raw_spin_lock
     3.95%  [k] kmem_cache_free
     3.30%  [k] build_skb
     3.17%  [k] memcpy
     2.96%  [k] dev_queue_xmit
     2.79%  [k] ip_finish_output
     2.66%  [k] kmem_cache_alloc
     2.57%  [k] check_leaf
     2.52%  [k] ip_route_input_noref
     2.50%  [k] netdev_alloc_frag
     2.17%  [k] ip_rcv
     2.16%  [k] __phys_addr

I will probably do some more poking around over the next few days in
order to get my head around the fib_table_lookup overhead.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger July 26, 2012, 10:13 p.m. UTC | #2
On Thu, 26 Jul 2012 15:03:39 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Thu, Jul 26, 2012 at 2:06 PM, David Miller <davem@davemloft.net> wrote:
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Date: Thu, 26 Jul 2012 11:26:26 -0700
> >
> >> The previous results were with a slight modifications to your earlier
> >> patch.  With this patch applied I am seeing 10.4Mpps with 8 queues,
> >> reaching a maximum of 11.6Mpps with 9 queues.
> >
> > For fun you might want to see what this patch does for your tests,
> > it should cut the number of fib_table_lookup() calls roughly in half.
> 
> So with your patch, Eric's patch, and this most recent patch we are
> now at 11.8Mpps with 8 or 9 queues.  At this point I am staring to hit
> the hardware limits since 82599 will typically max out at about 12Mpps
> w/ 9 queues.
> 
> Here is the latest perf results with all of these patches in place.
> As you predicted your patch essentially cut the lookup overhead in
> half:
>     10.65%  [k] ixgbe_poll
>      7.77%  [k] fib_table_lookup
>      6.21%  [k] ixgbe_xmit_frame_ring
>      6.08%  [k] __netif_receive_skb
>      4.41%  [k] _raw_spin_lock
>      3.95%  [k] kmem_cache_free
>      3.30%  [k] build_skb
>      3.17%  [k] memcpy
>      2.96%  [k] dev_queue_xmit
>      2.79%  [k] ip_finish_output
>      2.66%  [k] kmem_cache_alloc
>      2.57%  [k] check_leaf
>      2.52%  [k] ip_route_input_noref
>      2.50%  [k] netdev_alloc_frag
>      2.17%  [k] ip_rcv
>      2.16%  [k] __phys_addr
> 
> I will probably do some more poking around over the next few days in
> order to get my head around the fib_table_lookup overhead.
> 
> Thanks,
> 
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The fib trie stats are global, you may want to either disable CONFIG_IP_FIB_TRIE_STATS
or convert them to per-cpu.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 26, 2012, 10:19 p.m. UTC | #3
On Thu, 2012-07-26 at 15:13 -0700, Stephen Hemminger wrote:

> The fib trie stats are global, you may want to either disable CONFIG_IP_FIB_TRIE_STATS
> or convert them to per-cpu.

I guess its already disabled in Alex case ;)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 26, 2012, 10:48 p.m. UTC | #4
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 26 Jul 2012 15:13:12 -0700

> The fib trie stats are global, you may want to either disable
> CONFIG_IP_FIB_TRIE_STATS or convert them to per-cpu.

Oh yeah, turn that stuff off when testing :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 26, 2012, 10:53 p.m. UTC | #5
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 15:03:39 -0700

> Here is the latest perf results with all of these patches in place.
> As you predicted your patch essentially cut the lookup overhead in
> half:

Ok good.

That patch is hard to make legitimate, I'd have to do a bit or work
before we could realize that change.

We can only combine the LOCAL and MAIN tables like that so long as
there are no overlaps in the routes covered by the two tables.  We'd
also have to be sure to report the routes properly in dumps too.

I really wish we had never segregated these two tables, it's
completely pointless and hurts performance.  But now we have to
accomodate this legacy.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck July 27, 2012, 2:14 a.m. UTC | #6
On Thu, Jul 26, 2012 at 3:53 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Thu, 26 Jul 2012 15:03:39 -0700
>
>> Here is the latest perf results with all of these patches in place.
>> As you predicted your patch essentially cut the lookup overhead in
>> half:
>
> Ok good.
>
> That patch is hard to make legitimate, I'd have to do a bit or work
> before we could realize that change.
>
> We can only combine the LOCAL and MAIN tables like that so long as
> there are no overlaps in the routes covered by the two tables.  We'd
> also have to be sure to report the routes properly in dumps too.
>
> I really wish we had never segregated these two tables, it's
> completely pointless and hurts performance.  But now we have to
> accomodate this legacy.

Any idea why these look-ups are so expensive in the first place?  When
I dump fib_trie it doesn't look like I have much there.  I would have
thought the table would be pretty static with just 8 flows all going
to the same destination address, but it seems like I was getting hit
with cache misses for some reason.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 27, 2012, 3:08 a.m. UTC | #7
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 26 Jul 2012 19:14:55 -0700

> Any idea why these look-ups are so expensive in the first place?  When
> I dump fib_trie it doesn't look like I have much there.  I would have
> thought the table would be pretty static with just 8 flows all going
> to the same destination address, but it seems like I was getting hit
> with cache misses for some reason.

A lot of the overhead comes from write traffic that results from
filling in the "fib_result" structure onto the callers stack.

The return value from a fib_lookup() has far too many components.  But
simplifying things is not easy.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 27, 2012, 6:02 a.m. UTC | #8
From: David Miller <davem@davemloft.net>
Date: Thu, 26 Jul 2012 20:08:46 -0700 (PDT)

> A lot of the overhead comes from write traffic that results from
> filling in the "fib_result" structure onto the callers stack.

Here's the longer analysis of how things are now.

There are several components to a route lookup result, and struct
fib_result tries to encapsulate all of this.

Another aspect is that our route tables are broken up into different
datas tructures which reference each other, in order to save space.

So the actual objects in the FIB trie are fib_alias structures, and
those point to fib_info.  There is a many to one relationship between
FIB trie nodes and fib_info objects.

The idea is that many routes have the same set of nexthops, metrics,
preferred source address, etc.

So one thing we return in the fib_result is a pointer to the fib_info
and an index into the nexthop array (nh_sel).  That's why we have all
of these funny accessor's FIB_RES_X(res) which essentially provide
res.fi->fib_nh[res.nh_sel].X

Therefore one area of simplification would be to just return a pointer
to the FIB nexthop, rather than the fib_info pointer and the nexthop
index.  We can get to the fib_info, if we need to, via the nh_parent
pointer of the nexthop.

It seems also that the res->scope value can be cribbed from the
fib_info as well.

res->type is embedded in the fib_alias we select hanging off of the
FIB trie node.  And the res->prefixlen is taken from the FIB trie
node.

res->tclassid is problematic, because it comes from the FIB rules
tables rather than the FIB trie.  We used to store a full FIB rules
pointer in the fib_result, but I reduced it down to just the u32
tclassid.

This whole area, as well as the FIB trie lookup itself, is an area
ripe for a large number of small micro-optimizations that in the end
make it's overhead much more reasonable.

Another thing I haven't mentioned is that another part of FIB trie's
overhead is that it does backtracking.  The shorter prefixes sit at
the top of the trie, so when it traverses down it does so until it
can't get a match, then it walks back up to the root until it does
have a match.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 28, 2012, 4:15 a.m. UTC | #9
From: David Miller <davem@davemloft.net>
Date: Thu, 26 Jul 2012 23:02:46 -0700 (PDT)

> Therefore one area of simplification would be to just return a pointer
> to the FIB nexthop, rather than the fib_info pointer and the nexthop
> index.  We can get to the fib_info, if we need to, via the nh_parent
> pointer of the nexthop.

So I'm about to post an RFC set of patches which show this kind
of simplification.  It gets fib_result down to two members:

	u32		tclassid;
	struct fib_nh	*nh;

If I could get rid of that tclassid it would be really nice.  But
that's hard because the tclassid is fetched from the fib_rule and
all of that lookup path is abstracted behind a common layer
that's shared between ipv4 and ipv6 so it's a bit of work changing
arg conventions.

These changes help, but only ever so slightly, in my testing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck July 28, 2012, 5:45 a.m. UTC | #10
On Fri, Jul 27, 2012 at 9:15 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 26 Jul 2012 23:02:46 -0700 (PDT)
>
>> Therefore one area of simplification would be to just return a pointer
>> to the FIB nexthop, rather than the fib_info pointer and the nexthop
>> index.  We can get to the fib_info, if we need to, via the nh_parent
>> pointer of the nexthop.
>
> So I'm about to post an RFC set of patches which show this kind
> of simplification.  It gets fib_result down to two members:
>
>         u32             tclassid;
>         struct fib_nh   *nh;
>
> If I could get rid of that tclassid it would be really nice.  But
> that's hard because the tclassid is fetched from the fib_rule and
> all of that lookup path is abstracted behind a common layer
> that's shared between ipv4 and ipv6 so it's a bit of work changing
> arg conventions.
>
> These changes help, but only ever so slightly, in my testing.

I probably won't be able to do any real testing for the patches until
Monday, or at least I may let somebody else test the patches first
just to make sure they don't panic the system before I do anything
with them.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index b64a19c..fc7eade 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -204,9 +204,7 @@  static inline struct fib_table *fib_get_table(struct net *net, u32 id)
 {
 	struct hlist_head *ptr;
 
-	ptr = id == RT_TABLE_LOCAL ?
-		&net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX] :
-		&net->ipv4.fib_table_hash[TABLE_MAIN_INDEX];
+	ptr = &net->ipv4.fib_table_hash[TABLE_MAIN_INDEX];
 	return hlist_entry(ptr->first, struct fib_table, tb_hlist);
 }
 
@@ -220,10 +218,6 @@  static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
 {
 	struct fib_table *table;
 
-	table = fib_get_table(net, RT_TABLE_LOCAL);
-	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
-		return 0;
-
 	table = fib_get_table(net, RT_TABLE_MAIN);
 	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
 		return 0;
@@ -245,10 +239,6 @@  static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 {
 	if (!net->ipv4.fib_has_custom_rules) {
 		res->tclassid = 0;
-		if (net->ipv4.fib_local &&
-		    !fib_table_lookup(net->ipv4.fib_local, flp, res,
-				      FIB_LOOKUP_NOREF))
-			return 0;
 		if (net->ipv4.fib_main &&
 		    !fib_table_lookup(net->ipv4.fib_main, flp, res,
 				      FIB_LOOKUP_NOREF))
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 44bf82e..bdc0231 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -160,7 +160,7 @@  struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 		/* Fallback to FIB local table so that communication
 		 * over loopback subnets work.
 		 */
-		local = fib_get_table(net, RT_TABLE_LOCAL);
+		local = fib_get_table(net, RT_TABLE_MAIN);
 		if (local &&
 		    !fib_table_lookup(local, &fl4, &res, FIB_LOOKUP_NOREF) &&
 		    res.type == RTN_LOCAL)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c1fde53..ddfe398 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -52,16 +52,10 @@  static int __net_init fib4_rules_init(struct net *net)
 {
 	struct fib_table *local_table, *main_table;
 
-	local_table = fib_trie_table(RT_TABLE_LOCAL);
-	if (local_table == NULL)
-		return -ENOMEM;
-
 	main_table  = fib_trie_table(RT_TABLE_MAIN);
 	if (main_table == NULL)
 		goto fail;
 
-	hlist_add_head_rcu(&local_table->tb_hlist,
-				&net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX]);
 	hlist_add_head_rcu(&main_table->tb_hlist,
 				&net->ipv4.fib_table_hash[TABLE_MAIN_INDEX]);
 	return 0;
@@ -169,7 +163,7 @@  static inline unsigned int __inet_dev_addr_type(struct net *net,
 	if (ipv4_is_multicast(addr))
 		return RTN_MULTICAST;
 
-	local_table = fib_get_table(net, RT_TABLE_LOCAL);
+	local_table = fib_get_table(net, RT_TABLE_MAIN);
 	if (local_table) {
 		ret = RTN_UNICAST;
 		rcu_read_lock();
@@ -712,11 +706,7 @@  static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
 		},
 	};
 
-	if (type == RTN_UNICAST)
-		tb = fib_new_table(net, RT_TABLE_MAIN);
-	else
-		tb = fib_new_table(net, RT_TABLE_LOCAL);
-
+	tb = fib_new_table(net, RT_TABLE_MAIN);
 	if (tb == NULL)
 		return;
 
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index a83d74e..65135dd 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -284,9 +284,6 @@  static int fib_default_rules_init(struct fib_rules_ops *ops)
 {
 	int err;
 
-	err = fib_default_rule_add(ops, 0, RT_TABLE_LOCAL, 0);
-	if (err < 0)
-		return err;
 	err = fib_default_rule_add(ops, 0x7FFE, RT_TABLE_MAIN, 0);
 	if (err < 0)
 		return err;