Patchwork [00/16] Remove the ipv4 routing cache

login
register
mail settings
Submitter Eric Dumazet
Date July 27, 2012, 10:01 a.m.
Message ID <1343383283.2626.12691.camel@edumazet-glaptop>
Download mbox | patch
Permalink /patch/173612/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - July 27, 2012, 10:01 a.m.
From: Eric Dumazet <edumazet@google.com>

On Thu, 2012-07-26 at 23:02 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 26 Jul 2012 20:08:46 -0700 (PDT)
> 
> > A lot of the overhead comes from write traffic that results from
> > filling in the "fib_result" structure onto the callers stack.
> 
> Here's the longer analysis of how things are now.
> 
> There are several components to a route lookup result, and struct
> fib_result tries to encapsulate all of this.
> 
> Another aspect is that our route tables are broken up into different
> datas tructures which reference each other, in order to save space.
> 
> So the actual objects in the FIB trie are fib_alias structures, and
> those point to fib_info.  There is a many to one relationship between
> FIB trie nodes and fib_info objects.
> 
> The idea is that many routes have the same set of nexthops, metrics,
> preferred source address, etc.
> 
> So one thing we return in the fib_result is a pointer to the fib_info
> and an index into the nexthop array (nh_sel).  That's why we have all
> of these funny accessor's FIB_RES_X(res) which essentially provide
> res.fi->fib_nh[res.nh_sel].X
> 
> Therefore one area of simplification would be to just return a pointer
> to the FIB nexthop, rather than the fib_info pointer and the nexthop
> index.  We can get to the fib_info, if we need to, via the nh_parent
> pointer of the nexthop.
> 
> It seems also that the res->scope value can be cribbed from the
> fib_info as well.
> 
> res->type is embedded in the fib_alias we select hanging off of the
> FIB trie node.  And the res->prefixlen is taken from the FIB trie
> node.
> 
> res->tclassid is problematic, because it comes from the FIB rules
> tables rather than the FIB trie.  We used to store a full FIB rules
> pointer in the fib_result, but I reduced it down to just the u32
> tclassid.
> 
> This whole area, as well as the FIB trie lookup itself, is an area
> ripe for a large number of small micro-optimizations that in the end
> make it's overhead much more reasonable.
> 
> Another thing I haven't mentioned is that another part of FIB trie's
> overhead is that it does backtracking.  The shorter prefixes sit at
> the top of the trie, so when it traverses down it does so until it
> can't get a match, then it walks back up to the root until it does
> have a match.

We also have cache line misses in this code, and we shouldn't have them
at all.

Following patch helps a lot in my case.

Thanks

[PATCH] ipv4: fib: avoid false sharing

Now IP route cache is removed, we should make sure fib structures
cant share cache lines with possibly often dirtied objects.

On x86, kmalloc-96 cache can be source of such problems.

Problem spotted with perf ... -e cache-misses ... while doing
a forwarding benchmark.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ip_fib.h     |    6 ++++++
 net/ipv4/fib_frontend.c  |    5 +----
 net/ipv4/fib_semantics.c |   15 ++++++++-------
 net/ipv4/fib_trie.c      |   26 +++++++++-----------------
 4 files changed, 24 insertions(+), 28 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - July 27, 2012, 2:53 p.m.
Eric Dumazet <eric.dumazet@gmail.com> writes:

> Now IP route cache is removed, we should make sure fib structures
> cant share cache lines with possibly often dirtied objects.
>
> On x86, kmalloc-96 cache can be source of such problems.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>


> +static inline void *fib_zalloc(size_t size)
> +{
> +	/* We want to avoid possible false sharing */
> +	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);

Why the hard coded 128 here?

It seems more portable and obvious to do 
	return kzalloc(round_up(size, L1_CACHE_BYTES), GFP_KERNEL);

> +}

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - July 27, 2012, 3:12 p.m.
On Fri, 2012-07-27 at 07:53 -0700, Eric W. Biederman wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
> > Now IP route cache is removed, we should make sure fib structures
> > cant share cache lines with possibly often dirtied objects.
> >
> > On x86, kmalloc-96 cache can be source of such problems.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> 
> > +static inline void *fib_zalloc(size_t size)
> > +{
> > +	/* We want to avoid possible false sharing */
> > +	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);
> 
> Why the hard coded 128 here?
> 
> It seems more portable and obvious to do 
> 	return kzalloc(round_up(size, L1_CACHE_BYTES), GFP_KERNEL);
> 

Its not that obvious, because some machines have an apparent
L1_CACHE_BYTES of 64, but hardware prefetching to 128 bytes

But using 2*L1_CACHE_BYTES as minimum allocation size might be overkill
on some arches with 256 bytes cache lines.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - July 27, 2012, 4:23 p.m.
Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Fri, 2012-07-27 at 07:53 -0700, Eric W. Biederman wrote:
>> Eric Dumazet <eric.dumazet@gmail.com> writes:
>> 
>> > Now IP route cache is removed, we should make sure fib structures
>> > cant share cache lines with possibly often dirtied objects.
>> >
>> > On x86, kmalloc-96 cache can be source of such problems.
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> 
>> 
>> > +static inline void *fib_zalloc(size_t size)
>> > +{
>> > +	/* We want to avoid possible false sharing */
>> > +	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);
>> 
>> Why the hard coded 128 here?
>> 
>> It seems more portable and obvious to do 
>> 	return kzalloc(round_up(size, L1_CACHE_BYTES), GFP_KERNEL);
>> 
>
> Its not that obvious, because some machines have an apparent
> L1_CACHE_BYTES of 64, but hardware prefetching to 128 bytes

I am familiar.  But does hardware prefetching make a difference
if your object is less than 64 bytes?

I don't believe only allocating 64 bytes will be a problem,
as no one else well be dirtying your cache line.

I suppose you could run into pathologies where your object
is 3*64 bytes in size, but your expression doesn't handle
that case either.

> But using 2*L1_CACHE_BYTES as minimum allocation size might be overkill
> on some arches with 256 bytes cache lines.

The other alternative to guarantee very good cache behavior is
to ensure you are allocating a power of two size up to some limit,
perhaps page size.

My point is the magic 128 likely requires an explicatory comment and I
think the net result is you have encoded something fragile that is good
for testing but that will in the fullness of time do strange things that
will be easy to overlook.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - July 27, 2012, 4:28 p.m.
On Fri, 2012-07-27 at 09:23 -0700, Eric W. Biederman wrote:

> I am familiar.  But does hardware prefetching make a difference
> if your object is less than 64 bytes?
> 

Apparently yes, if the prefetch touches a dirtied neighbour cache line.

> I don't believe only allocating 64 bytes will be a problem,
> as no one else well be dirtying your cache line.
> 
> I suppose you could run into pathologies where your object
> is 3*64 bytes in size, but your expression doesn't handle
> that case either.
> 

Sure, but in most cases fib objects are under 128 bytes.


> The other alternative to guarantee very good cache behavior is
> to ensure you are allocating a power of two size up to some limit,
> perhaps page size.
> 

Good idea.

> My point is the magic 128 likely requires an explicatory comment and I
> think the net result is you have encoded something fragile that is good
> for testing but that will in the fullness of time do strange things that
> will be easy to overlook.

Sure, I'll send a v2, thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck - July 27, 2012, 7:06 p.m.
On Fri, Jul 27, 2012 at 9:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-07-27 at 09:23 -0700, Eric W. Biederman wrote:
>
>> I am familiar.  But does hardware prefetching make a difference
>> if your object is less than 64 bytes?
>>
>
> Apparently yes, if the prefetch touches a dirtied neighbour cache line.
>
>> I don't believe only allocating 64 bytes will be a problem,
>> as no one else well be dirtying your cache line.
>>
>> I suppose you could run into pathologies where your object
>> is 3*64 bytes in size, but your expression doesn't handle
>> that case either.
>>
>
> Sure, but in most cases fib objects are under 128 bytes.
>
>
>> The other alternative to guarantee very good cache behavior is
>> to ensure you are allocating a power of two size up to some limit,
>> perhaps page size.
>>
>
> Good idea.
>
>> My point is the magic 128 likely requires an explicatory comment and I
>> think the net result is you have encoded something fragile that is good
>> for testing but that will in the fullness of time do strange things that
>> will be easy to overlook.
>
> Sure, I'll send a v2, thanks.
>
>

I tested out v1 of your patch and didn't seem much of a change in my
test environment.  I figure I will spend most of today going through
the code trying to figure out what is causing the issues I am seeing
and why your changes had no effect on my setup.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e69c3a4..5b6a9b3 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -295,6 +295,12 @@  extern void fib_select_multipath(struct fib_result *res);
 /* Exported by fib_trie.c */
 extern void fib_trie_init(void);
 extern struct fib_table *fib_trie_table(u32 id);
+static inline void *fib_zalloc(size_t size)
+{
+	/* We want to avoid possible false sharing */
+	return kzalloc(max_t(size_t, 128, size), GFP_KERNEL);
+}
+
 
 static inline void fib_combine_itag(u32 *itag, const struct fib_result *res)
 {
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 8732cc7..a3d0285 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1089,10 +1089,7 @@  static int __net_init ip_fib_net_init(struct net *net)
 	int err;
 	size_t size = sizeof(struct hlist_head) * FIB_TABLE_HASHSZ;
 
-	/* Avoid false sharing : Use at least a full cache line */
-	size = max_t(size_t, size, L1_CACHE_BYTES);
-
-	net->ipv4.fib_table_hash = kzalloc(size, GFP_KERNEL);
+	net->ipv4.fib_table_hash = fib_zalloc(size);
 	if (net->ipv4.fib_table_hash == NULL)
 		return -ENOMEM;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da0cc2e..a9f5e87 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -651,17 +651,17 @@  static inline unsigned int fib_laddr_hashfn(__be32 val)
 		((__force u32)val >> 14)) & mask;
 }
 
-static struct hlist_head *fib_info_hash_alloc(int bytes)
+static struct hlist_head *fib_info_hash_alloc(size_t bytes)
 {
 	if (bytes <= PAGE_SIZE)
-		return kzalloc(bytes, GFP_KERNEL);
+		return fib_zalloc(bytes);
 	else
 		return (struct hlist_head *)
 			__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 					 get_order(bytes));
 }
 
-static void fib_info_hash_free(struct hlist_head *hash, int bytes)
+static void fib_info_hash_free(struct hlist_head *hash, size_t bytes)
 {
 	if (!hash)
 		return;
@@ -678,7 +678,8 @@  static void fib_info_hash_move(struct hlist_head *new_info_hash,
 {
 	struct hlist_head *old_info_hash, *old_laddrhash;
 	unsigned int old_size = fib_info_hash_size;
-	unsigned int i, bytes;
+	unsigned int i;
+	size_t bytes;
 
 	spin_lock_bh(&fib_info_lock);
 	old_info_hash = fib_info_hash;
@@ -766,10 +767,10 @@  struct fib_info *fib_create_info(struct fib_config *cfg)
 		unsigned int new_size = fib_info_hash_size << 1;
 		struct hlist_head *new_info_hash;
 		struct hlist_head *new_laddrhash;
-		unsigned int bytes;
+		size_t bytes;
 
 		if (!new_size)
-			new_size = 1;
+			new_size = 16;
 		bytes = new_size * sizeof(struct hlist_head *);
 		new_info_hash = fib_info_hash_alloc(bytes);
 		new_laddrhash = fib_info_hash_alloc(bytes);
@@ -783,7 +784,7 @@  struct fib_info *fib_create_info(struct fib_config *cfg)
 			goto failure;
 	}
 
-	fi = kzalloc(sizeof(*fi)+nhs*sizeof(struct fib_nh), GFP_KERNEL);
+	fi = fib_zalloc(sizeof(*fi) + nhs*sizeof(struct fib_nh));
 	if (fi == NULL)
 		goto failure;
 	if (cfg->fc_mx) {
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 18cbc15..664152c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -379,7 +379,7 @@  static inline void free_leaf_info(struct leaf_info *leaf)
 static struct tnode *tnode_alloc(size_t size)
 {
 	if (size <= PAGE_SIZE)
-		return kzalloc(size, GFP_KERNEL);
+		return fib_zalloc(size);
 	else
 		return vzalloc(size);
 }
@@ -449,7 +449,7 @@  static struct leaf *leaf_new(void)
 
 static struct leaf_info *leaf_info_new(int plen)
 {
-	struct leaf_info *li = kmalloc(sizeof(struct leaf_info),  GFP_KERNEL);
+	struct leaf_info *li = fib_zalloc(sizeof(struct leaf_info));
 	if (li) {
 		li->plen = plen;
 		li->mask_plen = ntohl(inet_make_mask(plen));
@@ -1969,32 +1969,24 @@  void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   max(sizeof(struct leaf),
 					       sizeof(struct leaf_info)),
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
 }
 
 
 struct fib_table *fib_trie_table(u32 id)
 {
 	struct fib_table *tb;
-	struct trie *t;
-
-	tb = kmalloc(sizeof(struct fib_table) + sizeof(struct trie),
-		     GFP_KERNEL);
-	if (tb == NULL)
-		return NULL;
-
-	tb->tb_id = id;
-	tb->tb_default = -1;
-	tb->tb_num_default = 0;
-
-	t = (struct trie *) tb->tb_data;
-	memset(t, 0, sizeof(*t));
 
+	tb = fib_zalloc(sizeof(struct fib_table) + sizeof(struct trie));
+	if (tb) {
+		tb->tb_id = id;
+		tb->tb_default = -1;
+	}
 	return tb;
 }