Message ID | alpine.DEB.2.00.1004071130260.13261@router.home |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Christoph Lameter wrote: > I wonder if this is not related to the kmem_cache_cpu structure straggling > cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu > structure was larger and therefore tight packing resulted in different > alignment. > > Could you see how the following patch affects the results. It attempts to > increase the size of kmem_cache_cpu to a power of 2 bytes. There is also > the potential that other per cpu fetches to neighboring objects affect the > situation. We could cacheline align the whole thing. > > --- > include/linux/slub_def.h | 5 +++++ > 1 file changed, 5 insertions(+) > > Index: linux-2.6/include/linux/slub_def.h > =================================================================== > --- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 -0500 > +++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 -0500 > @@ -38,6 +38,11 @@ struct kmem_cache_cpu { > void **freelist; /* Pointer to first free per cpu object */ > struct page *page; /* The slab from which we are allocating */ > int node; /* The node of the page (or -1 for debug) */ > +#ifndef CONFIG_64BIT > + int dummy1; > +#endif > + unsigned long dummy2; > + > #ifdef CONFIG_SLUB_STATS > unsigned stat[NR_SLUB_STAT_ITEMS]; > #endif Would __cacheline_aligned_in_smp do the trick here? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pekka Enberg wrote: > Christoph Lameter wrote: >> I wonder if this is not related to the kmem_cache_cpu structure >> straggling >> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu >> structure was larger and therefore tight packing resulted in different >> alignment. >> >> Could you see how the following patch affects the results. It attempts to >> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also >> the potential that other per cpu fetches to neighboring objects affect >> the >> situation. We could cacheline align the whole thing. >> >> --- >> include/linux/slub_def.h | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> Index: linux-2.6/include/linux/slub_def.h >> =================================================================== >> --- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 >> 11:33:50.000000000 -0500 >> +++ linux-2.6/include/linux/slub_def.h 2010-04-07 >> 11:35:18.000000000 -0500 >> @@ -38,6 +38,11 @@ struct kmem_cache_cpu { >> void **freelist; /* Pointer to first free per cpu object */ >> struct page *page; /* The slab from which we are allocating */ >> int node; /* The node of the page (or -1 for debug) */ >> +#ifndef CONFIG_64BIT >> + int dummy1; >> +#endif >> + unsigned long dummy2; >> + >> #ifdef CONFIG_SLUB_STATS >> unsigned stat[NR_SLUB_STAT_ITEMS]; >> #endif > > Would __cacheline_aligned_in_smp do the trick here? Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four underscores) for per-cpu data. Confusing... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 7 Apr 2010, Pekka Enberg wrote: > Christoph Lameter wrote: > > I wonder if this is not related to the kmem_cache_cpu structure straggling > > cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu > > structure was larger and therefore tight packing resulted in different > > alignment. > > > > Could you see how the following patch affects the results. It attempts to > > increase the size of kmem_cache_cpu to a power of 2 bytes. There is also > > the potential that other per cpu fetches to neighboring objects affect the > > situation. We could cacheline align the whole thing. > > > > --- > > include/linux/slub_def.h | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > Index: linux-2.6/include/linux/slub_def.h > > =================================================================== > > --- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 > > -0500 > > +++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 > > -0500 > > @@ -38,6 +38,11 @@ struct kmem_cache_cpu { > > void **freelist; /* Pointer to first free per cpu object */ > > struct page *page; /* The slab from which we are allocating */ > > int node; /* The node of the page (or -1 for debug) */ > > +#ifndef CONFIG_64BIT > > + int dummy1; > > +#endif > > + unsigned long dummy2; > > + > > #ifdef CONFIG_SLUB_STATS > > unsigned stat[NR_SLUB_STAT_ITEMS]; > > #endif > > Would __cacheline_aligned_in_smp do the trick here? This is allocated via the percpu allocator. We could specify cacheline alignment there but that would reduce the density. You basically need 4 words for a kmem_cache_cpu structure. A number of those fit into one 64 byte cacheline. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 7 Apr 2010, Pekka Enberg wrote: > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four > underscores) for per-cpu data. Confusing... This does not particulary help to clarify the situation since we are dealing with data that can either be allocated via the percpu allocator or be statically present (kmalloc bootstrap situation). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter wrote: > On Wed, 7 Apr 2010, Pekka Enberg wrote: > >> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four >> underscores) for per-cpu data. Confusing... > > This does not particulary help to clarify the situation since we are > dealing with data that can either be allocated via the percpu allocator or > be statically present (kmalloc bootstrap situation). Yes, I am an idiot. :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit : > On Wed, 7 Apr 2010, Pekka Enberg wrote: > > > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four > > underscores) for per-cpu data. Confusing... > > This does not particulary help to clarify the situation since we are > dealing with data that can either be allocated via the percpu allocator or > be statically present (kmalloc bootstrap situation). > > -- Do we have a user program to check actual L1 cache size of a machine ? I remember my HP blades have many BIOS options, I would like to make sure they are properly set. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 7 Apr 2010, Pekka Enberg wrote:
> Yes, I am an idiot. :-)
Plato said it in another way:
"As for me, all I know is that I know nothing."
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-04-07 at 20:38 +0200, Eric Dumazet wrote: > Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit : > > On Wed, 7 Apr 2010, Pekka Enberg wrote: > > > > > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four > > > underscores) for per-cpu data. Confusing... > > > > This does not particulary help to clarify the situation since we are > > dealing with data that can either be allocated via the percpu allocator or > > be statically present (kmalloc bootstrap situation). > > > > -- > > Do we have a user program to check actual L1 cache size of a machine ? If there is no, it's easy to write it as kernel exports the cache stat by /sys/devices/system/cpu/cpuXXX/cache/indexXXX/ > > I remember my HP blades have many BIOS options, I would like to make > sure they are properly set. > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 08 avril 2010 à 09:05 +0800, Zhang, Yanmin a écrit : > > Do we have a user program to check actual L1 cache size of a machine ? > If there is no, it's easy to write it as kernel exports the cache stat by > /sys/devices/system/cpu/cpuXXX/cache/indexXXX/ Yes, this is what advertizes my L1 cache having 64bytes lines, but I would like to check that in practice, this is not 128bytes... ./index0/type:Data ./index0/level:1 ./index0/coherency_line_size:64 ./index0/physical_line_partition:1 ./index0/ways_of_associativity:8 ./index0/number_of_sets:64 ./index0/size:32K ./index0/shared_cpu_map:00000101 ./index0/shared_cpu_list:0,8 ./index1/type:Instruction ./index1/level:1 ./index1/coherency_line_size:64 ./index1/physical_line_partition:1 ./index1/ways_of_associativity:4 ./index1/number_of_sets:128 ./index1/size:32K ./index1/shared_cpu_map:00000101 ./index1/shared_cpu_list:0,8 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I suspect NUMA is completely out of order on current kernel, or my Nehalem machine NUMA support is a joke # numactl --hardware available: 2 nodes (0-1) node 0 size: 3071 MB node 0 free: 2637 MB node 1 size: 3062 MB node 1 free: 2909 MB # cat try.sh hackbench 50 process 5000 numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 & numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 & wait echo node0 results cat RES0 echo node1 results cat RES1 numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & wait echo node0 on mem1 results cat RES0_1 echo node1 on mem0 results cat RES1_0 # ./try.sh Running with 50*40 (== 2000) tasks. Time: 16.865 node0 results Running with 25*40 (== 1000) tasks. Time: 16.767 node1 results Running with 25*40 (== 1000) tasks. Time: 16.564 node0 on mem1 results Running with 25*40 (== 1000) tasks. Time: 16.814 node1 on mem0 results Running with 25*40 (== 1000) tasks. Time: 16.896 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit : > I suspect NUMA is completely out of order on current kernel, or my > Nehalem machine NUMA support is a joke > > # numactl --hardware > available: 2 nodes (0-1) > node 0 size: 3071 MB > node 0 free: 2637 MB > node 1 size: 3062 MB > node 1 free: 2909 MB > > > # cat try.sh > hackbench 50 process 5000 > numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 & > numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 & > wait > echo node0 results > cat RES0 > echo node1 results > cat RES1 > > numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & > numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & > wait > echo node0 on mem1 results > cat RES0_1 > echo node1 on mem0 results > cat RES1_0 > > # ./try.sh > Running with 50*40 (== 2000) tasks. > Time: 16.865 > node0 results > Running with 25*40 (== 1000) tasks. > Time: 16.767 > node1 results > Running with 25*40 (== 1000) tasks. > Time: 16.564 > node0 on mem1 results > Running with 25*40 (== 1000) tasks. > Time: 16.814 > node1 on mem0 results > Running with 25*40 (== 1000) tasks. > Time: 16.896 If run individually, the tests results are more what we would expect (slow), but if machine runs the two set of process concurrently, each group runs much faster... # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 Running with 25*40 (== 1000) tasks. Time: 21.810 # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 Running with 25*40 (== 1000) tasks. Time: 20.679 # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & [1] 9177 # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & [2] 9196 # wait [1]- Done numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 [2]+ Done numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 # echo node0 on mem1 results node0 on mem1 results # cat RES0_1 Running with 25*40 (== 1000) tasks. Time: 13.818 # echo node1 on mem0 results node1 on mem0 results # cat RES1_0 Running with 25*40 (== 1000) tasks. Time: 11.633 Oh well... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 08 Apr 2010 09:00:19 +0200 > If run individually, the tests results are more what we would expect > (slow), but if machine runs the two set of process concurrently, each > group runs much faster... BTW, I just discovered (thanks to the function graph tracer, woo hoo!) that loopback TCP packets get fully checksum validated on receive. I'm trying to figure out why skb->ip_summed ends up being CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to CHECKSUM_PARTIAL in tcp_sendmsg(). I wonder how much this accounts for some of the hackbench oddities... and other regressions in loopback tests we've seen. :-) Just FYI... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-04-07 at 11:43 -0500, Christoph Lameter wrote: > On Wed, 7 Apr 2010, Zhang, Yanmin wrote: > > > I collected retired instruction, dtlb miss and LLC miss. > > Below is data of LLC miss. > > > > Kernel 2.6.33: > > 20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string > > 14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg > > 12.88% hackbench [kernel.kallsyms] [k] kfree > > 7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free > > 7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node > > 6.78% hackbench [kernel.kallsyms] [k] kfree_skb > > 6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller > > 2.73% hackbench [kernel.kallsyms] [k] __slab_free > > 2.21% hackbench [kernel.kallsyms] [k] get_partial_node > > 2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock > > 1.59% hackbench [kernel.kallsyms] [k] schedule > > 1.27% hackbench hackbench [.] receiver > > 0.99% hackbench libpthread-2.9.so [.] __read > > 0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg > > > > Kernel 2.6.34-rc3: > > 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str > > ing > > 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg > > 11.62% hackbench [kernel.kallsyms] [k] kfree > > 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free > > 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_ > > caller > > Seems that the overhead of __kmalloc_node_track_caller was increased. The > function inlines slab_alloc(). > > > 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node > > 5.94% hackbench [kernel.kallsyms] [k] kfree_skb > > 3.48% hackbench [kernel.kallsyms] [k] __slab_free > > 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock > > 1.83% hackbench [kernel.kallsyms] [k] schedule > > 1.82% hackbench [kernel.kallsyms] [k] get_partial_node > > 1.59% hackbench hackbench [.] receiver > > 1.37% hackbench libpthread-2.9.so [.] __read > > I wonder if this is not related to the kmem_cache_cpu structure straggling > cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu > structure was larger and therefore tight packing resulted in different > alignment. > > Could you see how the following patch affects the results. It attempts to > increase the size of kmem_cache_cpu to a power of 2 bytes. There is also > the potential that other per cpu fetches to neighboring objects affect the > situation. We could cacheline align the whole thing. I tested the patch against 2.6.33+9dfc6e68bfe6e and it seems it doesn't help. I dumped percpu allocation info when booting kernel and didn't find clear sign. > > --- > include/linux/slub_def.h | 5 +++++ > 1 file changed, 5 insertions(+) > > Index: linux-2.6/include/linux/slub_def.h > =================================================================== > --- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 -0500 > +++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 -0500 > @@ -38,6 +38,11 @@ struct kmem_cache_cpu { > void **freelist; /* Pointer to first free per cpu object */ > struct page *page; /* The slab from which we are allocating */ > int node; /* The node of the page (or -1 for debug) */ > +#ifndef CONFIG_64BIT > + int dummy1; > +#endif > + unsigned long dummy2; > + > #ifdef CONFIG_SLUB_STATS > unsigned stat[NR_SLUB_STAT_ITEMS]; > #endif -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Thu, 08 Apr 2010 00:05:57 -0700 (PDT) > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 08 Apr 2010 09:00:19 +0200 > >> If run individually, the tests results are more what we would expect >> (slow), but if machine runs the two set of process concurrently, each >> group runs much faster... > > BTW, I just discovered (thanks to the function graph tracer, woo hoo!) > that loopback TCP packets get fully checksum validated on receive. > > I'm trying to figure out why skb->ip_summed ends up being > CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to > CHECKSUM_PARTIAL in tcp_sendmsg(). Ok, it looks like it's only ACK packets that have this problem, but still :-) It's weird that we have a special ip_dev_loopback_xmit() for for ip_mc_output() NF_HOOK()s, which forces skb->ip_summed to CHECKSUM_UNNECESSARY, but the actual normal loopback xmit doesn't do that... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 08 avril 2010 à 00:05 -0700, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 08 Apr 2010 09:00:19 +0200 > > > If run individually, the tests results are more what we would expect > > (slow), but if machine runs the two set of process concurrently, each > > group runs much faster... > > BTW, I just discovered (thanks to the function graph tracer, woo hoo!) > that loopback TCP packets get fully checksum validated on receive. > > I'm trying to figure out why skb->ip_summed ends up being > CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to > CHECKSUM_PARTIAL in tcp_sendmsg(). > > I wonder how much this accounts for some of the hackbench > oddities... and other regressions in loopback tests we've seen. > :-) > > Just FYI... Thanks ! But hackbench is a af_unix benchmark, so loopback stuff is not used that much :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit : > If there are 2 nodes in the machine, processes on node 0 will contact MCH of > node 1 to access memory of node 1. I suspect the MCH of node 1 might enter > a power-saving mode when all the cpus of node 1 are free. So the transactions > from MCH 1 to MCH 0 has a larger latency. > Hmm, thanks for the hint, I will investigate this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2010-04-08 at 09:00 +0200, Eric Dumazet wrote: > Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit : > > I suspect NUMA is completely out of order on current kernel, or my > > Nehalem machine NUMA support is a joke > > > > # numactl --hardware > > available: 2 nodes (0-1) > > node 0 size: 3071 MB > > node 0 free: 2637 MB > > node 1 size: 3062 MB > > node 1 free: 2909 MB > > > > > > # cat try.sh > > hackbench 50 process 5000 > > numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 & > > numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 & > > wait > > echo node0 results > > cat RES0 > > echo node1 results > > cat RES1 > > > > numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & > > numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & > > wait > > echo node0 on mem1 results > > cat RES0_1 > > echo node1 on mem0 results > > cat RES1_0 > > > > # ./try.sh > > Running with 50*40 (== 2000) tasks. > > Time: 16.865 > > node0 results > > Running with 25*40 (== 1000) tasks. > > Time: 16.767 > > node1 results > > Running with 25*40 (== 1000) tasks. > > Time: 16.564 > > node0 on mem1 results > > Running with 25*40 (== 1000) tasks. > > Time: 16.814 > > node1 on mem0 results > > Running with 25*40 (== 1000) tasks. > > Time: 16.896 > > If run individually, the tests results are more what we would expect > (slow), but if machine runs the two set of process concurrently, each > group runs much faster... If there are 2 nodes in the machine, processes on node 0 will contact MCH of node 1 to access memory of node 1. I suspect the MCH of node 1 might enter a power-saving mode when all the cpus of node 1 are free. So the transactions from MCH 1 to MCH 0 has a larger latency. > > > # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 > Running with 25*40 (== 1000) tasks. > Time: 21.810 > > # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 > Running with 25*40 (== 1000) tasks. > Time: 20.679 > > # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & > [1] 9177 > # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & > [2] 9196 > # wait > [1]- Done numactl --cpubind=0 --membind=1 hackbench > 25 process 5000 >RES0_1 > [2]+ Done numactl --cpubind=1 --membind=0 hackbench > 25 process 5000 >RES1_0 > # echo node0 on mem1 results > node0 on mem1 results > # cat RES0_1 > Running with 25*40 (== 1000) tasks. > Time: 13.818 > # echo node1 on mem0 results > node1 on mem0 results > # cat RES1_0 > Running with 25*40 (== 1000) tasks. > Time: 11.633 > > Oh well... > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 08 avril 2010 à 09:54 +0200, Eric Dumazet a écrit : > Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit : > > > If there are 2 nodes in the machine, processes on node 0 will contact MCH of > > node 1 to access memory of node 1. I suspect the MCH of node 1 might enter > > a power-saving mode when all the cpus of node 1 are free. So the transactions > > from MCH 1 to MCH 0 has a larger latency. > > > > Hmm, thanks for the hint, I will investigate this. Oh well, perf timechart record & Instant crash Call Trace: perf_trace_sched_switch+0xd5/0x120 schedule+0x6b5/0x860 retint_careful+0xd/0x21 RIP ffffffff81010955 perf_arch_fetch_caller_regs+0x15/0x40 CR2: 00000000d21f1422 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 8 Apr 2010, Eric Dumazet wrote: > I suspect NUMA is completely out of order on current kernel, or my > Nehalem machine NUMA support is a joke > > # numactl --hardware > available: 2 nodes (0-1) > node 0 size: 3071 MB > node 0 free: 2637 MB > node 1 size: 3062 MB > node 1 free: 2909 MB How do the cpus map to the nodes? cpu 0 and 1 both on the same node? > # ./try.sh > Running with 50*40 (== 2000) tasks. > Time: 16.865 > node0 results > Running with 25*40 (== 1000) tasks. > Time: 16.767 > node1 results > Running with 25*40 (== 1000) tasks. > Time: 16.564 > node0 on mem1 results > Running with 25*40 (== 1000) tasks. > Time: 16.814 > node1 on mem0 results > Running with 25*40 (== 1000) tasks. > Time: 16.896 > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 08 avril 2010 à 10:34 -0500, Christoph Lameter a écrit : > On Thu, 8 Apr 2010, Eric Dumazet wrote: > > > I suspect NUMA is completely out of order on current kernel, or my > > Nehalem machine NUMA support is a joke > > > > # numactl --hardware > > available: 2 nodes (0-1) > > node 0 size: 3071 MB > > node 0 free: 2637 MB > > node 1 size: 3062 MB > > node 1 free: 2909 MB > > How do the cpus map to the nodes? cpu 0 and 1 both on the same node? one socket maps to 0 2 4 6 8 10 12 14 (Node 0) one socket maps to 1 3 5 7 9 11 13 15 (Node 1) # numactl --cpubind=0 --membind=0 numactl --show policy: bind preferred node: 0 interleavemask: interleavenode: 0 nodebind: 0 membind: 0 cpubind: 1 3 5 7 9 11 13 15 1024 (strange 1024 report...) # numactl --cpubind=1 --membind=1 numactl --show policy: bind preferred node: 1 interleavemask: interleavenode: 0 nodebind: membind: 1 cpubind: 0 2 4 6 8 10 12 14 [ 0.161170] Booting Node 0, Processors #1 [ 0.248995] CPU 1 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 SHD:8 [ 0.269177] Ok. [ 0.269453] Booting Node 1, Processors #2 [ 0.356965] CPU 2 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8 [ 0.377207] Ok. [ 0.377485] Booting Node 0, Processors #3 [ 0.464935] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8 [ 0.485065] Ok. [ 0.485217] Booting Node 1, Processors #4 [ 0.572906] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8 [ 0.593044] Ok. ... grep "physical id" /proc/cpuinfo physical id : 1 physical id : 0 physical id : 1 physical id : 0 physical id : 1 physical id : 0 physical id : 1 physical id : 0 physical id : 1 physical id : 0 physical id : 1 physical id : 0 physical id : 1 physical id : 0 physical id : 1 physical id : 0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/include/linux/slub_def.h =================================================================== --- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 -0500 +++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 -0500 @@ -38,6 +38,11 @@ struct kmem_cache_cpu { void **freelist; /* Pointer to first free per cpu object */ struct page *page; /* The slab from which we are allocating */ int node; /* The node of the page (or -1 for debug) */ +#ifndef CONFIG_64BIT + int dummy1; +#endif + unsigned long dummy2; + #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; #endif