Patchwork Mainline kernel OLTP performance update

login
register
mail settings
Submitter Pekka Enberg
Date Jan. 22, 2009, 9:47 a.m.
Message ID <1232617672.14549.25.camel@penberg-laptop>
Download mbox | patch
Permalink /patch/19781/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Pekka Enberg - Jan. 22, 2009, 9:47 a.m.
On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > 
> > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > 
> > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > command line to force an order 0 alloc.
> > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > 
> > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > 
> > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > 
> > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > functionality.
> > 
> > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > calls in the same function in a loop?
> It's kfree(skb->head).
> 
> > 
> > If it's the former, with big enough size passed to __alloc_skb(), the
> > networking code might be taking a hit from the SLUB page allocator
> > pass-through.

Do we know what kind of size is being passed to __alloc_skb() in this
case? Maybe we want to do something like this.

		Pekka

SLUB: revert page allocator pass-through

This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
direct pass through of page size or higher kmalloc requests").
---



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin - Jan. 23, 2009, 3:02 a.m.
On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > > 
> > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > > 
> > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > > command line to force an order 0 alloc.
> > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > > 
> > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > > 
> > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > > 
> > > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > > functionality.
> > > 
> > > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > > calls in the same function in a loop?
> > It's kfree(skb->head).
> > 
> > > 
> > > If it's the former, with big enough size passed to __alloc_skb(), the
> > > networking code might be taking a hit from the SLUB page allocator
> > > pass-through.
> 
> Do we know what kind of size is being passed to __alloc_skb() in this
> case?
In function __alloc_skb, original parameter size=4155,
SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
__kmalloc_track_caller's parameter size=4696.

>  Maybe we want to do something like this.
> 
> 		Pekka
> 
> SLUB: revert page allocator pass-through
This patch amost fixes the netperf UDP-U-4k issue.

#slabinfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1658 70350463 70348946  99  99 
kmalloc-8192                31 70322309 70322293  99  99 
:0000168                  2592   143154   140684  93  28 
:0004096                  1456    91072    89644  99  96 
:0000192                  3402    63838    60491  89  11 
:0000064                  6177    49635    43743  98  77 

So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
kmalloc-8192's default order on my 8-core stoakley is 2.

1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
is about 10% better than SLUB's.

I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

> 
> This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
> direct pass through of page size or higher kmalloc requests").
> ---
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 2f5c16b..3bd3662 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - Jan. 23, 2009, 6:52 a.m.
Zhang, Yanmin wrote:
>>>> If it's the former, with big enough size passed to __alloc_skb(), the
>>>> networking code might be taking a hit from the SLUB page allocator
>>>> pass-through.
>> Do we know what kind of size is being passed to __alloc_skb() in this
>> case?
> In function __alloc_skb, original parameter size=4155,
> SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
> __kmalloc_track_caller's parameter size=4696.

OK, so all allocations go straight to the page allocator.

> 
>>  Maybe we want to do something like this.
>>
>> SLUB: revert page allocator pass-through
> This patch amost fixes the netperf UDP-U-4k issue.
> 
> #slabinfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1658 70350463 70348946  99  99 
> kmalloc-8192                31 70322309 70322293  99  99 
> :0000168                  2592   143154   140684  93  28 
> :0004096                  1456    91072    89644  99  96 
> :0000192                  3402    63838    60491  89  11 
> :0000064                  6177    49635    43743  98  77 
> 
> So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
> kmalloc-8192's default order on my 8-core stoakley is 2.

Christoph, should we merge my patch as-is or do you have an alternative 
fix in mind? We could, of course, increase kmalloc() caches one level up 
to 8192 or higher.

> 
> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> is about 10% better than SLUB's.
> 
> I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
counters patch to diagnose this:

http://lkml.org/lkml/2009/1/21/273

And do oprofile, of course. Thanks!

		Pekka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - Jan. 23, 2009, 8:06 a.m.
On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > is about 10% better than SLUB's.
> > 
> > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> 
> Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> counters patch to diagnose this:
> 
> http://lkml.org/lkml/2009/1/21/273
> 
> And do oprofile, of course. Thanks!

I assume binding the client and the server to different physical CPUs
also  means that the SKB is always allocated on CPU 1 and freed on CPU
2? If so, we will be taking the __slab_free() slow path all the time on
kfree() which will cause cache effects, no doubt.

But there's another potential performance hit we're taking because the
object size of the cache is so big. As allocations from CPU 1 keep
coming in, we need to allocate new pages and unfreeze the per-cpu page.
That in turn causes __slab_free() to be more eager to discard the slab
(see the PageSlubFrozen check there).

So before going for cache profiling, I'd really like to see an oprofile
report. I suspect we're still going to see much more page allocator
activity there than with SLAB or SLQB which is why we're still behaving
so badly here.

		Pekka

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin - Jan. 23, 2009, 8:30 a.m.
On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > is about 10% better than SLUB's.
> > > 
> > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > 
> > Maybe we can use the perfstat and/or kerneltop utilities of the new perf 
> > counters patch to diagnose this:
> > 
> > http://lkml.org/lkml/2009/1/21/273
> > 
> > And do oprofile, of course. Thanks!
> 
> I assume binding the client and the server to different physical CPUs
> also  means that the SKB is always allocated on CPU 1 and freed on CPU
> 2? If so, we will be taking the __slab_free() slow path all the time on
> kfree() which will cause cache effects, no doubt.
> 
> But there's another potential performance hit we're taking because the
> object size of the cache is so big. As allocations from CPU 1 keep
> coming in, we need to allocate new pages and unfreeze the per-cpu page.
> That in turn causes __slab_free() to be more eager to discard the slab
> (see the PageSlubFrozen check there).
> 
> So before going for cache profiling, I'd really like to see an oprofile
> report. I suspect we're still going to see much more page allocator
> activity
Theoretically, it should, but oprofile doesn't show that.

>  there than with SLAB or SLQB which is why we're still behaving
> so badly here.

oprofile output with 2.6.29-rc2-slubrevertlarge:
CPU: Core 2, speed 2666.71 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
132779   32.9951  vmlinux                  copy_user_generic_string
25334     6.2954  vmlinux                  schedule
21032     5.2264  vmlinux                  tg_shares_up
17175     4.2679  vmlinux                  __skb_recv_datagram
9091      2.2591  vmlinux                  sock_def_readable
8934      2.2201  vmlinux                  mwait_idle
8796      2.1858  vmlinux                  try_to_wake_up
6940      1.7246  vmlinux                  __slab_free

#slaninfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000256                  1643  5215544  5214027  94   0 
kmalloc-8192                28  5189576  5189560   0   0 
:0000168                  2631   141466   138976  92  28 
:0004096                  1452    88697    87269  99  96 
:0000192                  3402    63050    59732  89  11 
:0000064                  6265    46611    40721  98  82 
:0000128                  1895    30429    28654  93  32 


oprofile output with kernel 2.6.29-rc2-slqb0121:
CPU: Core 2, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name                 symbol name
114793   28.7163  vmlinux                  vmlinux                  copy_user_generic_string
27880     6.9744  vmlinux                  vmlinux                  tg_shares_up
22218     5.5580  vmlinux                  vmlinux                  schedule
12238     3.0614  vmlinux                  vmlinux                  mwait_idle
7395      1.8499  vmlinux                  vmlinux                  task_rq_lock
7348      1.8382  vmlinux                  vmlinux                  sock_def_readable
7202      1.8016  vmlinux                  vmlinux                  sched_clock_cpu
6981      1.7464  vmlinux                  vmlinux                  __skb_recv_datagram
6566      1.6425  vmlinux                  vmlinux                  udp_queue_rcv_skb


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Jan. 23, 2009, 8:33 a.m.
On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:

> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> than SLQB's;

I'll have to look into this too. Could be evidence of the possible
TLB improvement from using bigger pages and/or page-specific freelist,
I suppose.

Do you have a scripted used to start netperf in that configuration?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - Jan. 23, 2009, 8:40 a.m.
On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> > I assume binding the client and the server to different physical CPUs
> > also  means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> > 
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> > 
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.
> 
> > there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
> 
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 132779   32.9951  vmlinux                  copy_user_generic_string
> 25334     6.2954  vmlinux                  schedule
> 21032     5.2264  vmlinux                  tg_shares_up
> 17175     4.2679  vmlinux                  __skb_recv_datagram
> 9091      2.2591  vmlinux                  sock_def_readable
> 8934      2.2201  vmlinux                  mwait_idle
> 8796      2.1858  vmlinux                  try_to_wake_up
> 6940      1.7246  vmlinux                  __slab_free
> 
> #slaninfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000256                  1643  5215544  5214027  94   0 
> kmalloc-8192                28  5189576  5189560   0   0 
                                                    ^^^^^^

This looks bit funny. Hmm.

> :0000168                  2631   141466   138976  92  28 
> :0004096                  1452    88697    87269  99  96 
> :0000192                  3402    63050    59732  89  11 
> :0000064                  6265    46611    40721  98  82 
> :0000128                  1895    30429    28654  93  32 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin - Jan. 23, 2009, 9:02 a.m.
On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote:
> On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:
> 
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> > than SLQB's;
> 
> I'll have to look into this too. Could be evidence of the possible
> TLB improvement from using bigger pages and/or page-specific freelist,
> I suppose.
> 
> Do you have a scripted used to start netperf in that configuration?
See the attachment.

Steps to run testing:
1) compile netperf;
2) Change PROG_DIR to path/to/netperf/src;
3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
Rick Jones - Jan. 23, 2009, 6:40 p.m.
> 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.

Some comments on the script:

> #!/bin/sh
> 
> PROG_DIR=/home/ymzhang/test/netperf/src
> date=`date +%H%M%N`
> #PROG_DIR=/root/netperf/netperf/src
> client_num=$1
> pin_cpu=$2
> 
> start_port_server=12384
> start_port_client=15888
> 
> killall netserver
> ${PROG_DIR}/netserver
> sleep 2

Any particular reason for killing-off the netserver daemon?

> if [ ! -d result ]; then
>         mkdir result
> fi
> 
> all_result_files=""
> for i in `seq 1 ${client_num}`; do
>         if [ "${pin_cpu}" == "pin" ]; then
>                 pin_param="-T ${i} ${i}"

The -T option takes arguments of the form:

N   - bind both netperf and netserver to core N
N,  - bind only netperf to core N, float netserver
  ,M - float netperf, bind only netserver to core M
N,M - bind netperf to core N and netserver to core M

Without a comma between N and M knuth only knows what the command line parser 
will do :)

>         fi
>         result_file=result/netperf_${start_port_client}.${date}
>         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
>         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
>         #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
>         ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file}  &

Same thing here for the -P option - there needs to be a comma between the two 
port numbers otherwise, the best case is that the second port number is ignored. 
  Worst case is that netperf starts doing knuth only knows what.


To get quick profiles, that form of aggregate netperf is OK - just the one 
iteration with background processes using a moderatly long run time.  However, 
for result reporting, it is best to (ab)use the confidence intervals 
functionality to try to avoid skew errors.  I tend to add-in a global -i 30 
option to get each netperf to repeat its measurments 30 times.  That way one is 
reasonably confident that skew issues are minimized.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

And I would probably add the -c and -C options to have netperf report service 
demands.


>         sub_pid="${sub_pid} `echo $!`"
>         port_num=$((${port_num}+1))
>         all_result_files="${all_result_files} ${result_file}"
>         start_port_server=$((${start_port_server}+1))
>         start_port_client=$((${start_port_client}+1))
> done;
> 
> wait ${sub_pid}
> killall netserver
> 
> result="0"
> for i in `echo ${all_result_files}`; do
>         sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
>         result=`echo "${result}+${sub_result}"|bc`
> done;

The documented-only-in-source :( "omni" tests in top-of-trunk netperf:

http://www.netperf.org/svn/netperf2/trunk

./configure --enable-omni

allow one to specify which result values one wants, in which order, either as 
more or less traditional netperf output (test-specific -O), CSV (test-specific 
-o) or keyval (test-specific -k).  All three take an optional filename as an 
argument with the file containing a list of desired output values.  You can give 
a "filename" of '?' to get the list of output values known to that version of 
netperf.

Might help simplify parsing and whatnot.

happy benchmarking,

rick jones

> 
> echo $result

> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Grant Grundler - Jan. 23, 2009, 6:51 p.m.
On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote:
...
> And I would probably add the -c and -C options to have netperf report
> service demands.

For performance analysis, the service demand is often more interesting
than the absolute performance (which typically only varies a few Mb/s
for gigE NICs). I strongly encourage adding -c and -C.

grant
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin - Jan. 24, 2009, 3:03 a.m.
On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote:
> > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
> 
> Some comments on the script:
Thanks. I wanted to run the testing to get result quickly as long as
the result has no big fluctuation.

> 
> > #!/bin/sh
> > 
> > PROG_DIR=/home/ymzhang/test/netperf/src
> > date=`date +%H%M%N`
> > #PROG_DIR=/root/netperf/netperf/src
> > client_num=$1
> > pin_cpu=$2
> > 
> > start_port_server=12384
> > start_port_client=15888
> > 
> > killall netserver
> > ${PROG_DIR}/netserver
> > sleep 2
> 
> Any particular reason for killing-off the netserver daemon?
I'm not sure if prior running might leave any impact on later running, so
just kill netserver.

> 
> > if [ ! -d result ]; then
> >         mkdir result
> > fi
> > 
> > all_result_files=""
> > for i in `seq 1 ${client_num}`; do
> >         if [ "${pin_cpu}" == "pin" ]; then
> >                 pin_param="-T ${i} ${i}"
> 
> The -T option takes arguments of the form:
> 
> N   - bind both netperf and netserver to core N
> N,  - bind only netperf to core N, float netserver
>   ,M - float netperf, bind only netserver to core M
> N,M - bind netperf to core N and netserver to core M
> 
> Without a comma between N and M knuth only knows what the command line parser 
> will do :)
> 
> >         fi
> >         result_file=result/netperf_${start_port_client}.${date}
> >         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> >         #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
> >         #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
> >         ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file}  &
> 
> Same thing here for the -P option - there needs to be a comma between the two 
> port numbers otherwise, the best case is that the second port number is ignored. 
>   Worst case is that netperf starts doing knuth only knows what.
Thanks.

> 
> 
> To get quick profiles, that form of aggregate netperf is OK - just the one 
> iteration with background processes using a moderatly long run time.  However, 
> for result reporting, it is best to (ab)use the confidence intervals 
> functionality to try to avoid skew errors.
Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
finer-tuning or investigation, I would turn on more options.

>   I tend to add-in a global -i 30 
> option to get each netperf to repeat its measurments 30 times.  That way one is 
> reasonably confident that skew issues are minimized.
> 
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
> 
> And I would probably add the -c and -C options to have netperf report service 
> demands.
Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization
in real time.

> 
> 
> >         sub_pid="${sub_pid} `echo $!`"
> >         port_num=$((${port_num}+1))
> >         all_result_files="${all_result_files} ${result_file}"
> >         start_port_server=$((${start_port_server}+1))
> >         start_port_client=$((${start_port_client}+1))
> > done;
> > 
> > wait ${sub_pid}
> > killall netserver
> > 
> > result="0"
> > for i in `echo ${all_result_files}`; do
> >         sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
> >         result=`echo "${result}+${sub_result}"|bc`
> > done;
> 
> The documented-only-in-source :( "omni" tests in top-of-trunk netperf:
> 
> http://www.netperf.org/svn/netperf2/trunk
> 
> ./configure --enable-omni
> 
> allow one to specify which result values one wants, in which order, either as 
> more or less traditional netperf output (test-specific -O), CSV (test-specific 
> -o) or keyval (test-specific -k).  All three take an optional filename as an 
> argument with the file containing a list of desired output values.  You can give 
> a "filename" of '?' to get the list of output values known to that version of 
> netperf.
> 
> Might help simplify parsing and whatnot.
Yes, it does.

> 
> happy benchmarking,
> 
> rick jones
Thanks again. I learned a lot.

> 
> > 
> > echo $result
> 
> > 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Jan. 26, 2009, 6:26 p.m.
>>To get quick profiles, that form of aggregate netperf is OK - just the one 
>>iteration with background processes using a moderatly long run time.  However, 
>>for result reporting, it is best to (ab)use the confidence intervals 
>>functionality to try to avoid skew errors.
> 
> Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
> finer-tuning or investigation, I would turn on more options.

Netperf will silently clip that to 30 as that is all the built-in tables know.

> Thanks again. I learned a lot.

Feel free to wander over to netperf-talk over at netperf.org if you want to talk 
some more about the care and feeding of netperf.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@  struct kmem_cache {
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@  static __always_inline int kmalloc_index(size_t size)
 	if (!size)
 		return 0;
 
+	if (size > KMALLOC_MAX_SIZE)
+		return -1;
+
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
@@ -154,10 +157,6 @@  static __always_inline int kmalloc_index(size_t size)
 	if (size <=       1024) return 10;
 	if (size <=   2 * 1024) return 11;
 	if (size <=   4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
 	if (size <=   8 * 1024) return 13;
 	if (size <=  16 * 1024) return 14;
 	if (size <=  32 * 1024) return 15;
@@ -167,6 +166,10 @@  static __always_inline int kmalloc_index(size_t size)
 	if (size <= 512 * 1024) return 19;
 	if (size <= 1024 * 1024) return 20;
 	if (size <=  2 * 1024 * 1024) return 21;
+	if (size <=  4 * 1024 * 1024) return 22;
+	if (size <=  8 * 1024 * 1024) return 23;
+	if (size <= 16 * 1024 * 1024) return 24;
+	if (size <= 32 * 1024 * 1024) return 25;
 	return -1;
 
 /*
@@ -191,6 +194,19 @@  static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 	if (index == 0)
 		return NULL;
 
+	/*
+	 * This function only gets expanded if __builtin_constant_p(size), so
+	 * testing it here shouldn't be needed.  But some versions of gcc need
+	 * help.
+	 */
+	if (__builtin_constant_p(size) && index < 0) {
+		/*
+		 * Generate a link failure. Would be great if we could
+		 * do something to stop the compile here.
+		 */
+		extern void __kmalloc_size_too_large(void);
+		__kmalloc_size_too_large();
+	}
 	return &kmalloc_caches[index];
 }
 
@@ -204,17 +220,9 @@  static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
-	return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-		if (size > PAGE_SIZE)
-			return kmalloc_large(size, flags);
-
 		if (!(flags & SLUB_DMA)) {
 			struct kmem_cache *s = kmalloc_slab(size);
 
diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@  EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@  panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];
 
 static void sysfs_add_func(struct work_struct *w)
 {
@@ -2643,8 +2643,12 @@  static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 			return ZERO_SIZE_PTR;
 
 		index = size_index[(size - 1) / 8];
-	} else
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
 		index = fls(size - 1);
+	}
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@  void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, flags);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@  void *__kmalloc(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
-	struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
-						get_order(size));
-
-	if (page)
-		return page_address(page);
-	else
-		return NULL;
-}
-
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, flags, node);
-
 	s = get_slab(size, flags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@  void kfree(const void *x)
 		return;
 
 	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		BUG_ON(!PageCompound(page));
-		put_page(page);
+	if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
 		return;
-	}
 	slab_free(page->slab, page, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@  void __init kmem_cache_init(void)
 		caches++;
 	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
 		caches++;
@@ -3022,7 +3006,7 @@  void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
 
@@ -3222,9 +3206,6 @@  void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large(size, gfpflags);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@  void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 {
 	struct kmem_cache *s;
 
-	if (unlikely(size > PAGE_SIZE))
-		return kmalloc_large_node(size, gfpflags, node);
-
 	s = get_slab(size, gfpflags);
 
 	if (unlikely(ZERO_OR_NULL_PTR(s)))