diff mbox

hackbench regression due to commit 9dfc6e68bfe6e

Message ID alpine.DEB.2.00.1004071130260.13261@router.home
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Christoph Lameter April 7, 2010, 4:43 p.m. UTC
On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> I collected retired instruction, dtlb miss and LLC miss.
> Below is data of LLC miss.
>
> Kernel 2.6.33:
>     20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
>     14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
>     12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
>      7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
>      7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
>      6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
>      6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
>      2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
>      2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
>      2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
>      1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
>      1.27%        hackbench  hackbench                                               [.] receiver
>      0.99%        hackbench  libpthread-2.9.so                                       [.] __read
>      0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg
>
> Kernel 2.6.34-rc3:
>     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> ing
>     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
>     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
>      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
>      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> caller

Seems that the overhead of __kmalloc_node_track_caller was increased. The
function inlines slab_alloc().

>      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
>      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
>      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
>      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
>      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
>      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
>      1.59%        hackbench  hackbench                                                             [.] receiver
>      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read

I wonder if this is not related to the kmem_cache_cpu structure straggling
cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
structure was larger and therefore tight packing resulted in different
alignment.

Could you see how the following patch affects the results. It attempts to
increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
the potential that other per cpu fetches to neighboring objects affect the
situation. We could cacheline align the whole thing.

---
 include/linux/slub_def.h |    5 +++++
 1 file changed, 5 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Pekka Enberg April 7, 2010, 4:49 p.m. UTC | #1
Christoph Lameter wrote:
> I wonder if this is not related to the kmem_cache_cpu structure straggling
> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
> structure was larger and therefore tight packing resulted in different
> alignment.
> 
> Could you see how the following patch affects the results. It attempts to
> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
> the potential that other per cpu fetches to neighboring objects affect the
> situation. We could cacheline align the whole thing.
> 
> ---
>  include/linux/slub_def.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000 -0500
> +++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000 -0500
> @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
>  	void **freelist;	/* Pointer to first free per cpu object */
>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
> +#ifndef CONFIG_64BIT
> +	int dummy1;
> +#endif
> +	unsigned long dummy2;
> +
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif

Would __cacheline_aligned_in_smp do the trick here?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg April 7, 2010, 4:52 p.m. UTC | #2
Pekka Enberg wrote:
> Christoph Lameter wrote:
>> I wonder if this is not related to the kmem_cache_cpu structure 
>> straggling
>> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
>> structure was larger and therefore tight packing resulted in different
>> alignment.
>>
>> Could you see how the following patch affects the results. It attempts to
>> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
>> the potential that other per cpu fetches to neighboring objects affect 
>> the
>> situation. We could cacheline align the whole thing.
>>
>> ---
>>  include/linux/slub_def.h |    5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> Index: linux-2.6/include/linux/slub_def.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/slub_def.h    2010-04-07 
>> 11:33:50.000000000 -0500
>> +++ linux-2.6/include/linux/slub_def.h    2010-04-07 
>> 11:35:18.000000000 -0500
>> @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
>>      void **freelist;    /* Pointer to first free per cpu object */
>>      struct page *page;    /* The slab from which we are allocating */
>>      int node;        /* The node of the page (or -1 for debug) */
>> +#ifndef CONFIG_64BIT
>> +    int dummy1;
>> +#endif
>> +    unsigned long dummy2;
>> +
>>  #ifdef CONFIG_SLUB_STATS
>>      unsigned stat[NR_SLUB_STAT_ITEMS];
>>  #endif
> 
> Would __cacheline_aligned_in_smp do the trick here?

Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with 
four underscores) for per-cpu data. Confusing...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter April 7, 2010, 6:18 p.m. UTC | #3
On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Christoph Lameter wrote:
> > I wonder if this is not related to the kmem_cache_cpu structure straggling
> > cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
> > structure was larger and therefore tight packing resulted in different
> > alignment.
> >
> > Could you see how the following patch affects the results. It attempts to
> > increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
> > the potential that other per cpu fetches to neighboring objects affect the
> > situation. We could cacheline align the whole thing.
> >
> > ---
> >  include/linux/slub_def.h |    5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > Index: linux-2.6/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000
> > -0500
> > +++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000
> > -0500
> > @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
> >  	void **freelist;	/* Pointer to first free per cpu object */
> >  	struct page *page;	/* The slab from which we are allocating */
> >  	int node;		/* The node of the page (or -1 for debug) */
> > +#ifndef CONFIG_64BIT
> > +	int dummy1;
> > +#endif
> > +	unsigned long dummy2;
> > +
> >  #ifdef CONFIG_SLUB_STATS
> >  	unsigned stat[NR_SLUB_STAT_ITEMS];
> >  #endif
>
> Would __cacheline_aligned_in_smp do the trick here?

This is allocated via the percpu allocator. We could specify cacheline
alignment there but that would reduce the density. You basically need 4
words for a kmem_cache_cpu structure. A number of those fit into one 64
byte cacheline.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter April 7, 2010, 6:20 p.m. UTC | #4
On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> underscores) for per-cpu data. Confusing...

This does not particulary help to clarify the situation since we are
dealing with data that can either be allocated via the percpu allocator or
be statically present (kmalloc bootstrap situation).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg April 7, 2010, 6:25 p.m. UTC | #5
Christoph Lameter wrote:
> On Wed, 7 Apr 2010, Pekka Enberg wrote:
> 
>> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
>> underscores) for per-cpu data. Confusing...
> 
> This does not particulary help to clarify the situation since we are
> dealing with data that can either be allocated via the percpu allocator or
> be statically present (kmalloc bootstrap situation).

Yes, I am an idiot. :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 7, 2010, 6:38 p.m. UTC | #6
Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit :
> On Wed, 7 Apr 2010, Pekka Enberg wrote:
> 
> > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> > underscores) for per-cpu data. Confusing...
> 
> This does not particulary help to clarify the situation since we are
> dealing with data that can either be allocated via the percpu allocator or
> be statically present (kmalloc bootstrap situation).
> 
> --

Do we have a user program to check actual L1 cache size of a machine ?

I remember my HP blades have many BIOS options, I would like to make
sure they are properly set.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter April 7, 2010, 7:30 p.m. UTC | #7
On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Yes, I am an idiot. :-)

Plato said it in another way:

"As for me, all I know is that I know nothing."



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin April 8, 2010, 1:05 a.m. UTC | #8
On Wed, 2010-04-07 at 20:38 +0200, Eric Dumazet wrote:
> Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit :
> > On Wed, 7 Apr 2010, Pekka Enberg wrote:
> > 
> > > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> > > underscores) for per-cpu data. Confusing...
> > 
> > This does not particulary help to clarify the situation since we are
> > dealing with data that can either be allocated via the percpu allocator or
> > be statically present (kmalloc bootstrap situation).
> > 
> > --
> 
> Do we have a user program to check actual L1 cache size of a machine ?
If there is no, it's easy to write it as kernel exports the cache stat by
/sys/devices/system/cpu/cpuXXX/cache/indexXXX/

> 
> I remember my HP blades have many BIOS options, I would like to make
> sure they are properly set.
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 4:59 a.m. UTC | #9
Le jeudi 08 avril 2010 à 09:05 +0800, Zhang, Yanmin a écrit :

> > Do we have a user program to check actual L1 cache size of a machine ?
> If there is no, it's easy to write it as kernel exports the cache stat by
> /sys/devices/system/cpu/cpuXXX/cache/indexXXX/

Yes, this is what advertizes my L1 cache having 64bytes lines, but I
would like to check that in practice, this is not 128bytes...

./index0/type:Data
./index0/level:1
./index0/coherency_line_size:64
./index0/physical_line_partition:1
./index0/ways_of_associativity:8
./index0/number_of_sets:64
./index0/size:32K
./index0/shared_cpu_map:00000101
./index0/shared_cpu_list:0,8
./index1/type:Instruction
./index1/level:1
./index1/coherency_line_size:64
./index1/physical_line_partition:1
./index1/ways_of_associativity:4
./index1/number_of_sets:128
./index1/size:32K
./index1/shared_cpu_map:00000101
./index1/shared_cpu_list:0,8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 5:39 a.m. UTC | #10
I suspect NUMA is completely out of order on current kernel, or my
Nehalem machine NUMA support is a joke

# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 3071 MB
node 0 free: 2637 MB
node 1 size: 3062 MB
node 1 free: 2909 MB


# cat try.sh
hackbench 50 process 5000
numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
wait
echo node0 results
cat RES0
echo node1 results
cat RES1

numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
wait
echo node0 on mem1 results
cat RES0_1
echo node1 on mem0 results
cat RES1_0

# ./try.sh
Running with 50*40 (== 2000) tasks.
Time: 16.865
node0 results
Running with 25*40 (== 1000) tasks.
Time: 16.767
node1 results
Running with 25*40 (== 1000) tasks.
Time: 16.564
node0 on mem1 results
Running with 25*40 (== 1000) tasks.
Time: 16.814
node1 on mem0 results
Running with 25*40 (== 1000) tasks.
Time: 16.896


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 7 a.m. UTC | #11
Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit :
> I suspect NUMA is completely out of order on current kernel, or my
> Nehalem machine NUMA support is a joke
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 size: 3071 MB
> node 0 free: 2637 MB
> node 1 size: 3062 MB
> node 1 free: 2909 MB
> 
> 
> # cat try.sh
> hackbench 50 process 5000
> numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
> numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
> wait
> echo node0 results
> cat RES0
> echo node1 results
> cat RES1
> 
> numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
> numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
> wait
> echo node0 on mem1 results
> cat RES0_1
> echo node1 on mem0 results
> cat RES1_0
> 
> # ./try.sh
> Running with 50*40 (== 2000) tasks.
> Time: 16.865
> node0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.767
> node1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.564
> node0 on mem1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.814
> node1 on mem0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.896

If run individually, the tests results are more what we would expect
(slow), but if machine runs the two set of process concurrently, each
group runs much faster...


# numactl --cpubind=0 --membind=1 hackbench 25 process 5000
Running with 25*40 (== 1000) tasks.
Time: 21.810

# numactl --cpubind=1 --membind=0 hackbench 25 process 5000
Running with 25*40 (== 1000) tasks.
Time: 20.679

# numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
[1] 9177
# numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
[2] 9196
# wait
[1]-  Done                    numactl --cpubind=0 --membind=1 hackbench
25 process 5000 >RES0_1
[2]+  Done                    numactl --cpubind=1 --membind=0 hackbench
25 process 5000 >RES1_0
# echo node0 on mem1 results
node0 on mem1 results
# cat RES0_1
Running with 25*40 (== 1000) tasks.
Time: 13.818
# echo node1 on mem0 results
node1 on mem0 results
# cat RES1_0
Running with 25*40 (== 1000) tasks.
Time: 11.633

Oh well...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 8, 2010, 7:05 a.m. UTC | #12
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 08 Apr 2010 09:00:19 +0200

> If run individually, the tests results are more what we would expect
> (slow), but if machine runs the two set of process concurrently, each
> group runs much faster...

BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
that loopback TCP packets get fully checksum validated on receive.

I'm trying to figure out why skb->ip_summed ends up being
CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
CHECKSUM_PARTIAL in tcp_sendmsg().

I wonder how much this accounts for some of the hackbench
oddities... and other regressions in loopback tests we've seen.
:-)

Just FYI...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin April 8, 2010, 7:18 a.m. UTC | #13
On Wed, 2010-04-07 at 11:43 -0500, Christoph Lameter wrote:
> On Wed, 7 Apr 2010, Zhang, Yanmin wrote:
> 
> > I collected retired instruction, dtlb miss and LLC miss.
> > Below is data of LLC miss.
> >
> > Kernel 2.6.33:
> >     20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
> >     14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
> >     12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
> >      7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
> >      7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
> >      6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
> >      6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
> >      2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
> >      2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
> >      2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
> >      1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
> >      1.27%        hackbench  hackbench                                               [.] receiver
> >      0.99%        hackbench  libpthread-2.9.so                                       [.] __read
> >      0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg
> >
> > Kernel 2.6.34-rc3:
> >     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> > ing
> >     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
> >     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
> >      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
> >      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> > caller
> 
> Seems that the overhead of __kmalloc_node_track_caller was increased. The
> function inlines slab_alloc().
> 
> >      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
> >      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
> >      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
> >      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
> >      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
> >      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
> >      1.59%        hackbench  hackbench                                                             [.] receiver
> >      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read
> 
> I wonder if this is not related to the kmem_cache_cpu structure straggling
> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
> structure was larger and therefore tight packing resulted in different
> alignment.
> 
> Could you see how the following patch affects the results. It attempts to
> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
> the potential that other per cpu fetches to neighboring objects affect the
> situation. We could cacheline align the whole thing.
I tested the patch against 2.6.33+9dfc6e68bfe6e and it seems it doesn't help.

I dumped percpu allocation info when booting kernel and didn't find clear sign.

> 
> ---
>  include/linux/slub_def.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000 -0500
> +++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000 -0500
> @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
>  	void **freelist;	/* Pointer to first free per cpu object */
>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
> +#ifndef CONFIG_64BIT
> +	int dummy1;
> +#endif
> +	unsigned long dummy2;
> +
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 8, 2010, 7:20 a.m. UTC | #14
From: David Miller <davem@davemloft.net>
Date: Thu, 08 Apr 2010 00:05:57 -0700 (PDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 08 Apr 2010 09:00:19 +0200
> 
>> If run individually, the tests results are more what we would expect
>> (slow), but if machine runs the two set of process concurrently, each
>> group runs much faster...
> 
> BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
> that loopback TCP packets get fully checksum validated on receive.
> 
> I'm trying to figure out why skb->ip_summed ends up being
> CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
> CHECKSUM_PARTIAL in tcp_sendmsg().

Ok, it looks like it's only ACK packets that have this problem,
but still :-)

It's weird that we have a special ip_dev_loopback_xmit() for for
ip_mc_output() NF_HOOK()s, which forces skb->ip_summed to
CHECKSUM_UNNECESSARY, but the actual normal loopback xmit doesn't
do that...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 7:25 a.m. UTC | #15
Le jeudi 08 avril 2010 à 00:05 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 08 Apr 2010 09:00:19 +0200
> 
> > If run individually, the tests results are more what we would expect
> > (slow), but if machine runs the two set of process concurrently, each
> > group runs much faster...
> 
> BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
> that loopback TCP packets get fully checksum validated on receive.
> 
> I'm trying to figure out why skb->ip_summed ends up being
> CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
> CHECKSUM_PARTIAL in tcp_sendmsg().
> 
> I wonder how much this accounts for some of the hackbench
> oddities... and other regressions in loopback tests we've seen.
> :-)
> 
> Just FYI...

Thanks !

But hackbench is a af_unix benchmark, so loopback stuff is not used that
much :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 7:54 a.m. UTC | #16
Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :

> If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> a power-saving mode when all the cpus of node 1 are free. So the transactions
> from MCH 1 to MCH 0 has a larger latency.
> 

Hmm, thanks for the hint, I will investigate this.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin April 8, 2010, 7:54 a.m. UTC | #17
On Thu, 2010-04-08 at 09:00 +0200, Eric Dumazet wrote:
> Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit :
> > I suspect NUMA is completely out of order on current kernel, or my
> > Nehalem machine NUMA support is a joke
> > 
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 size: 3071 MB
> > node 0 free: 2637 MB
> > node 1 size: 3062 MB
> > node 1 free: 2909 MB
> > 
> > 
> > # cat try.sh
> > hackbench 50 process 5000
> > numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
> > numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
> > wait
> > echo node0 results
> > cat RES0
> > echo node1 results
> > cat RES1
> > 
> > numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
> > numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
> > wait
> > echo node0 on mem1 results
> > cat RES0_1
> > echo node1 on mem0 results
> > cat RES1_0
> > 
> > # ./try.sh
> > Running with 50*40 (== 2000) tasks.
> > Time: 16.865
> > node0 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.767
> > node1 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.564
> > node0 on mem1 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.814
> > node1 on mem0 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.896
> 
> If run individually, the tests results are more what we would expect
> (slow), but if machine runs the two set of process concurrently, each
> group runs much faster...
If there are 2 nodes in the machine, processes on node 0 will contact MCH of
node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
a power-saving mode when all the cpus of node 1 are free. So the transactions
from MCH 1 to MCH 0 has a larger latency.

> 
> 
> # numactl --cpubind=0 --membind=1 hackbench 25 process 5000
> Running with 25*40 (== 1000) tasks.
> Time: 21.810
> 
> # numactl --cpubind=1 --membind=0 hackbench 25 process 5000
> Running with 25*40 (== 1000) tasks.
> Time: 20.679
> 
> # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
> [1] 9177
> # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
> [2] 9196
> # wait
> [1]-  Done                    numactl --cpubind=0 --membind=1 hackbench
> 25 process 5000 >RES0_1
> [2]+  Done                    numactl --cpubind=1 --membind=0 hackbench
> 25 process 5000 >RES1_0
> # echo node0 on mem1 results
> node0 on mem1 results
> # cat RES0_1
> Running with 25*40 (== 1000) tasks.
> Time: 13.818
> # echo node1 on mem0 results
> node1 on mem0 results
> # cat RES1_0
> Running with 25*40 (== 1000) tasks.
> Time: 11.633
> 
> Oh well...
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 8:09 a.m. UTC | #18
Le jeudi 08 avril 2010 à 09:54 +0200, Eric Dumazet a écrit :
> Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :
> 
> > If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> > node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> > a power-saving mode when all the cpus of node 1 are free. So the transactions
> > from MCH 1 to MCH 0 has a larger latency.
> > 
> 
> Hmm, thanks for the hint, I will investigate this.

Oh well, 

perf timechart record &

Instant crash

Call Trace:
 perf_trace_sched_switch+0xd5/0x120
 schedule+0x6b5/0x860
 retint_careful+0xd/0x21
 
RIP ffffffff81010955 perf_arch_fetch_caller_regs+0x15/0x40
CR2: 00000000d21f1422


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter April 8, 2010, 3:34 p.m. UTC | #19
On Thu, 8 Apr 2010, Eric Dumazet wrote:

> I suspect NUMA is completely out of order on current kernel, or my
> Nehalem machine NUMA support is a joke
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 size: 3071 MB
> node 0 free: 2637 MB
> node 1 size: 3062 MB
> node 1 free: 2909 MB

How do the cpus map to the nodes? cpu 0 and 1 both on the same node?

> # ./try.sh
> Running with 50*40 (== 2000) tasks.
> Time: 16.865
> node0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.767
> node1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.564
> node0 on mem1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.814
> node1 on mem0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.896
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2010, 3:52 p.m. UTC | #20
Le jeudi 08 avril 2010 à 10:34 -0500, Christoph Lameter a écrit :
> On Thu, 8 Apr 2010, Eric Dumazet wrote:
> 
> > I suspect NUMA is completely out of order on current kernel, or my
> > Nehalem machine NUMA support is a joke
> >
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 size: 3071 MB
> > node 0 free: 2637 MB
> > node 1 size: 3062 MB
> > node 1 free: 2909 MB
> 
> How do the cpus map to the nodes? cpu 0 and 1 both on the same node?

one socket maps to 0 2 4 6 8 10 12 14 (Node 0)
one socket maps to 1 3 5 7 9 11 13 15 (Node 1)

# numactl --cpubind=0 --membind=0 numactl --show
policy: bind
preferred node: 0
interleavemask: 
interleavenode: 0
nodebind: 0 
membind: 0 
cpubind: 1 3 5 7 9 11 13 15 1024 

(strange 1024 report...)

# numactl --cpubind=1 --membind=1 numactl --show
policy: bind
preferred node: 1
interleavemask: 
interleavenode: 0
nodebind: 
membind: 1 
cpubind: 0 2 4 6 8 10 12 14 



[    0.161170] Booting Node   0, Processors  #1
[    0.248995] CPU 1 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 SHD:8
[    0.269177]  Ok.
[    0.269453] Booting Node   1, Processors  #2
[    0.356965] CPU 2 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[    0.377207]  Ok.
[    0.377485] Booting Node   0, Processors  #3
[    0.464935] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[    0.485065]  Ok.
[    0.485217] Booting Node   1, Processors  #4
[    0.572906] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[    0.593044]  Ok.
...
grep "physical id" /proc/cpuinfo 
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000 -0500
@@ -38,6 +38,11 @@  struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+#ifndef CONFIG_64BIT
+	int dummy1;
+#endif
+	unsigned long dummy2;
+
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif