Patchwork net: make sure struct dst_entry refcount is aligned on 64 bytes

login
register
mail settings
Submitter Eric Dumazet
Date Nov. 14, 2008, 10:47 a.m.
Message ID <491D5725.50006@cosmosbay.com>
Download mbox | patch
Permalink /patch/8754/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Nov. 14, 2008, 10:47 a.m.
Alexey Dobriyan a écrit :
> On Fri, Nov 14, 2008 at 10:04:24AM +0100, Eric Dumazet wrote:
>> David Miller a écrit :
>>> From: Eric Dumazet <dada1@cosmosbay.com>
>>> Date: Fri, 14 Nov 2008 09:09:31 +0100
>>>
>>>> During tbench/oprofile sessions, I found that dst_release() was in third position.
>>>  ...
>>>> Instead of first checking the refcount value, then decrement it,
>>>> we use atomic_dec_return() to help CPU to make the right memory transaction
>>>> (ie getting the cache line in exclusive mode)
>>>  ...
>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>> This looks great, applied, thanks Eric.
>>>
>> Thanks David
>>
>>
>> I think I understood some regressions here on 32bits 
>>
>> offsetof(struct dst_entry, __refcnt) is 0x7c again !!!
>>
>> This is really really bad for performance
>>
>> I believe this comes from a patch from Alexey Dobriyan
>> (commit def8b4faff5ca349beafbbfeb2c51f3602a6ef3a
>> net: reduce structures when XFRM=n)
> 
> Ick.

Well, your patch is a good thing, we only need to make adjustments.

> 
>> This kills effort from Zhang Yanmin (and me...)
>>
>> (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
>> [NET]: Fix tbench regression in 2.6.25-rc1)
>>
>>
>> Really we must find something so that this damned __refcnt is starting at 0x80
> 
> Make it last member?

Yes, it will help tbench, but not machines that stress IP route cache

(dst_use() must dirty the three fields "refcnt, __use , lastuse" )

Also, 'next' pointer should be in the same cache line, to speedup route
cache lookups.

Next problem is that offsets depend on architecture being 32 or 64 bits.

On 64bit, offsetof(struct dst_entry, __refcnt) is 0xb0 : not very good...


[PATCH] net: make sure struct dst_entry refcount is aligned on 64 bytes

As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
[NET]: Fix tbench regression in 2.6.25-rc1), it is really
important that struct dst_entry refcount is aligned on a cache line.

We cannot use __atribute((aligned)), so manually pad the structure
for 32 and 64 bit arches.

for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0

As it is not possible to guess at compile time cache line size,
we use a generic value of 64 bytes, that satisfies many current arches.
(Using 128 bytes alignment on 64bit arches would waste 64 bytes)

Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
break this alignment.

"tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
(2350 MB/s instead of 2250 MB/s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/dst.h |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+)
Alexey Dobriyan - Nov. 14, 2008, 11:35 a.m.
On Fri, Nov 14, 2008 at 11:47:01AM +0100, Eric Dumazet wrote:
> Alexey Dobriyan a écrit :
>> On Fri, Nov 14, 2008 at 10:04:24AM +0100, Eric Dumazet wrote:
>>> David Miller a écrit :
>>>> From: Eric Dumazet <dada1@cosmosbay.com>
>>>> Date: Fri, 14 Nov 2008 09:09:31 +0100
>>>>
>>>>> During tbench/oprofile sessions, I found that dst_release() was in third position.
>>>>  ...
>>>>> Instead of first checking the refcount value, then decrement it,
>>>>> we use atomic_dec_return() to help CPU to make the right memory transaction
>>>>> (ie getting the cache line in exclusive mode)
>>>>  ...
>>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>> This looks great, applied, thanks Eric.
>>>>
>>> Thanks David
>>>
>>>
>>> I think I understood some regressions here on 32bits 
>>>
>>> offsetof(struct dst_entry, __refcnt) is 0x7c again !!!
>>>
>>> This is really really bad for performance
>>>
>>> I believe this comes from a patch from Alexey Dobriyan
>>> (commit def8b4faff5ca349beafbbfeb2c51f3602a6ef3a
>>> net: reduce structures when XFRM=n)
>>
>> Ick.
>
> Well, your patch is a good thing, we only need to make adjustments.
>
>>
>>> This kills effort from Zhang Yanmin (and me...)
>>>
>>> (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
>>> [NET]: Fix tbench regression in 2.6.25-rc1)
>>>
>>>
>>> Really we must find something so that this damned __refcnt is starting at 0x80
>>
>> Make it last member?
>
> Yes, it will help tbench, but not machines that stress IP route cache
>
> (dst_use() must dirty the three fields "refcnt, __use , lastuse" )
>
> Also, 'next' pointer should be in the same cache line, to speedup route
> cache lookups.

Knowledge taken.

> Next problem is that offsets depend on architecture being 32 or 64 bits.
>
> On 64bit, offsetof(struct dst_entry, __refcnt) is 0xb0 : not very good...

I think all these constraints can be satisfied with clever rearranging of dst_entry.
Let me come up with alternative patch which still reduces dst slab size.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Nov. 14, 2008, 11:43 a.m.
Alexey Dobriyan a écrit :
> On Fri, Nov 14, 2008 at 11:47:01AM +0100, Eric Dumazet wrote:
>> Alexey Dobriyan a écrit :
>>> On Fri, Nov 14, 2008 at 10:04:24AM +0100, Eric Dumazet wrote:
>>>> David Miller a écrit :
>>>>> From: Eric Dumazet <dada1@cosmosbay.com>
>>>>> Date: Fri, 14 Nov 2008 09:09:31 +0100
>>>>>
>>>>>> During tbench/oprofile sessions, I found that dst_release() was in third position.
>>>>>  ...
>>>>>> Instead of first checking the refcount value, then decrement it,
>>>>>> we use atomic_dec_return() to help CPU to make the right memory transaction
>>>>>> (ie getting the cache line in exclusive mode)
>>>>>  ...
>>>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>>> This looks great, applied, thanks Eric.
>>>>>
>>>> Thanks David
>>>>
>>>>
>>>> I think I understood some regressions here on 32bits 
>>>>
>>>> offsetof(struct dst_entry, __refcnt) is 0x7c again !!!
>>>>
>>>> This is really really bad for performance
>>>>
>>>> I believe this comes from a patch from Alexey Dobriyan
>>>> (commit def8b4faff5ca349beafbbfeb2c51f3602a6ef3a
>>>> net: reduce structures when XFRM=n)
>>> Ick.
>> Well, your patch is a good thing, we only need to make adjustments.
>>
>>>> This kills effort from Zhang Yanmin (and me...)
>>>>
>>>> (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
>>>> [NET]: Fix tbench regression in 2.6.25-rc1)
>>>>
>>>>
>>>> Really we must find something so that this damned __refcnt is starting at 0x80
>>> Make it last member?
>> Yes, it will help tbench, but not machines that stress IP route cache
>>
>> (dst_use() must dirty the three fields "refcnt, __use , lastuse" )
>>
>> Also, 'next' pointer should be in the same cache line, to speedup route
>> cache lookups.
> 
> Knowledge taken.
> 
>> Next problem is that offsets depend on architecture being 32 or 64 bits.
>>
>> On 64bit, offsetof(struct dst_entry, __refcnt) is 0xb0 : not very good...
> 
> I think all these constraints can be satisfied with clever rearranging of dst_entry.
> Let me come up with alternative patch which still reduces dst slab size.

You cannot reduce size, and it doesnt matter, since we use dst_entry inside rtable
and rtable is using SLAB_HWCACHE_ALIGN kmem_cachep : we have many bytes available.

After patch on 32 bits

sizeof(struct rtable)=244   (12 bytes left)

Same for other containers.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey Dobriyan - Nov. 14, 2008, 1:22 p.m.
On Fri, Nov 14, 2008 at 12:43:06PM +0100, Eric Dumazet wrote:
> Alexey Dobriyan a écrit :
>> On Fri, Nov 14, 2008 at 11:47:01AM +0100, Eric Dumazet wrote:
>>> Alexey Dobriyan a écrit :
>>>> On Fri, Nov 14, 2008 at 10:04:24AM +0100, Eric Dumazet wrote:
>>>>> David Miller a écrit :
>>>>>> From: Eric Dumazet <dada1@cosmosbay.com>
>>>>>> Date: Fri, 14 Nov 2008 09:09:31 +0100
>>>>>>
>>>>>>> During tbench/oprofile sessions, I found that dst_release() was in third position.
>>>>>>  ...
>>>>>>> Instead of first checking the refcount value, then decrement it,
>>>>>>> we use atomic_dec_return() to help CPU to make the right memory transaction
>>>>>>> (ie getting the cache line in exclusive mode)
>>>>>>  ...
>>>>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>>>> This looks great, applied, thanks Eric.
>>>>>>
>>>>> Thanks David
>>>>>
>>>>>
>>>>> I think I understood some regressions here on 32bits 
>>>>>
>>>>> offsetof(struct dst_entry, __refcnt) is 0x7c again !!!
>>>>>
>>>>> This is really really bad for performance
>>>>>
>>>>> I believe this comes from a patch from Alexey Dobriyan
>>>>> (commit def8b4faff5ca349beafbbfeb2c51f3602a6ef3a
>>>>> net: reduce structures when XFRM=n)
>>>> Ick.
>>> Well, your patch is a good thing, we only need to make adjustments.
>>>
>>>>> This kills effort from Zhang Yanmin (and me...)
>>>>>
>>>>> (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
>>>>> [NET]: Fix tbench regression in 2.6.25-rc1)
>>>>>
>>>>>
>>>>> Really we must find something so that this damned __refcnt is starting at 0x80
>>>> Make it last member?
>>> Yes, it will help tbench, but not machines that stress IP route cache
>>>
>>> (dst_use() must dirty the three fields "refcnt, __use , lastuse" )
>>>
>>> Also, 'next' pointer should be in the same cache line, to speedup route
>>> cache lookups.
>>
>> Knowledge taken.
>>
>>> Next problem is that offsets depend on architecture being 32 or 64 bits.
>>>
>>> On 64bit, offsetof(struct dst_entry, __refcnt) is 0xb0 : not very good...
>>
>> I think all these constraints can be satisfied with clever rearranging of dst_entry.
>> Let me come up with alternative patch which still reduces dst slab size.
>
> You cannot reduce size, and it doesnt matter, since we use dst_entry inside rtable
> and rtable is using SLAB_HWCACHE_ALIGN kmem_cachep : we have many bytes available.
>
> After patch on 32 bits
>
> sizeof(struct rtable)=244   (12 bytes left)
>
> Same for other containers.

Hmm, indeed.

I tried moving __refcnt et al to the very beginning, but it seems to make
things worse (on x86_64, almost within statistical error).

And there is no way to use offset_of() inside struct definition. :-(
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Nov. 14, 2008, 1:37 p.m.
Alexey Dobriyan a écrit :

> Hmm, indeed.
> 
> I tried moving __refcnt et al to the very beginning, but it seems to make
> things worse (on x86_64, almost within statistical error).
> 
> And there is no way to use offset_of() inside struct definition. :-(

Yes, it is important that the beginning of structure contain read mostly fields.

refcnt being the most written field (incremented / decremented for each packet),
it is really important to move it outside of the first 128 bytes 
(192 bytes on 64 bit arches) of dst_entry

I wonder if some real hot dst_entries could be splitted (one copy for each stream),
to reduce ping-pongs.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Nov. 17, 2008, 3:46 a.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 14 Nov 2008 11:47:01 +0100

> [PATCH] net: make sure struct dst_entry refcount is aligned on 64 bytes

Applied to net-next-2.6, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/dst.h b/include/net/dst.h
index 65a60fa..6c77879 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -61,6 +61,8 @@  struct dst_entry
 	struct hh_cache		*hh;
 #ifdef CONFIG_XFRM
 	struct xfrm_state	*xfrm;
+#else
+	void			*__pad1;
 #endif
 	int			(*input)(struct sk_buff*);
 	int			(*output)(struct sk_buff*);
@@ -71,8 +73,20 @@  struct dst_entry
 
 #ifdef CONFIG_NET_CLS_ROUTE
 	__u32			tclassid;
+#else
+	__u32			__pad2;
 #endif
 
+
+	/*
+	 * Align __refcnt to a 64 bytes alignment
+	 * (L1_CACHE_SIZE would be too much)
+	 */
+#ifdef CONFIG_64BIT
+	long			__pad_to_align_refcnt[2];
+#else
+	long			__pad_to_align_refcnt[1];
+#endif
 	/*
 	 * __refcnt wants to be on a different cache line from
 	 * input/output/ops or performance tanks badly
@@ -157,6 +171,11 @@  dst_metric_locked(struct dst_entry *dst, int metric)
 
 static inline void dst_hold(struct dst_entry * dst)
 {
+	/*
+	 * If your kernel compilation stops here, please check
+	 * __pad_to_align_refcnt declaration in struct dst_entry
+	 */
+	BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
 	atomic_inc(&dst->__refcnt);
 }