Patchwork slow performance on disk/network i/o full speed after drop_caches

login
register
mail settings
Submitter Wu Fengguang
Date Sept. 1, 2011, 4:14 a.m.
Message ID <20110901041458.GA30123@localhost>
Download mbox | patch
Permalink /patch/112794/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Wu Fengguang - Sept. 1, 2011, 4:14 a.m.
Hi Stefan,

On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
> Hi Fengguang,
> Hi Yanhai,
> 
> > you're abssolutely corect zone_reclaim_mode is on - but why?
> > There must be some linux software which switches it on.
> >
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > also
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > tells us nothing.
> >
> > I've then read this:
> >
> > "zone_reclaim_mode is set during bootup to 1 if it is determined that
> > pages from remote zones will cause a measurable performance reduction.
> > The page allocator will then reclaim easily reusable pages (those page
> > cache pages that are currently not used) before allocating off node pages."
> >
> > Why does the kernel do that here in our case on these machines.
> 
> Can nobody help why the kernel in this case set it to 1?

It's determined by RECLAIM_DISTANCE.

build_zonelists():

                /*
                 * If another node is sufficiently far away then it is better
                 * to reclaim pages in a zone before going off node.
                 */
                if (distance > RECLAIM_DISTANCE)
                        zone_reclaim_mode = 1;

Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
It may well help your case, too.

commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Wed Jun 15 15:08:20 2011 -0700

    mm: increase RECLAIM_DISTANCE to 30
    
    Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
    that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
    Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
    a very traditional single-process model.
    
      * a master process which reads config files and manages the other
        process
      * multiple imapd processes, one per connection
      * multiple pop3d processes, one per connection
      * multiple lmtpd processes, one per connection
      * periodical "cleanup" processes.
    
    There are thousands of independent processes.  The problem is, recent
    Intel motherboard turn on zone_reclaim_mode by default and traditional
    prefork model software don't work well on it.  Unfortunatelly, such models
    are still typical even in the 21st century.  We can't ignore them.
    
    This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
    any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
    such relatively cheap 2-4 socket machine are often used for traditional
    servers as above.  The intention is that these machines don't use
    zone_reclaim_mode.
    
    Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
    This patch doesn't change such high-end NUMA machine behavior.
    
    Dave Hansen said:
    
    : I know specifically of pieces of x86 hardware that set the information
    : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
    : behavior which that implies.
    :
    : They've done performance testing and run very large and scary benchmarks
    : to make sure that they _want_ this turned on.  What this means for them
    : is that they'll probably be de-optimized, at least on newer versions of
    : the kernel.
    :
    : If you want to do this for particular systems, maybe _that_'s what we
    : should do.  Have a list of specific configurations that need the
    : defaults overridden either because they're buggy, or they have an
    : unusual hardware configuration not really reflected in the distance
    : table.

    And later said:
    
    : The original change in the hardware tables was for the benefit of a
    : benchmark.  Said benchmark isn't going to get run on mainline until the
    : next batch of enterprise distros drops, at which point the hardware where
    : this was done will be irrelevant for the benchmark.  I'm sure any new
    : hardware will just set this distance to another yet arbitrary value to
    : make the kernel do what it wants.  :)
    :
    : Also, when the hardware got _set_ to this initially, I complained.  So, I
    : guess I'm getting my way now, with this patch.  I'm cool with it.


Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG - Sept. 1, 2011, 5:41 a.m.
Thanks!

Am 01.09.2011 06:14, schrieb Wu Fengguang:
> Hi Stefan,
>
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
>> Hi Fengguang,
>> Hi Yanhai,
>>
>>> you're abssolutely corect zone_reclaim_mode is on - but why?
>>> There must be some linux software which switches it on.
>>>
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> also
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> tells us nothing.
>>>
>>> I've then read this:
>>>
>>> "zone_reclaim_mode is set during bootup to 1 if it is determined that
>>> pages from remote zones will cause a measurable performance reduction.
>>> The page allocator will then reclaim easily reusable pages (those page
>>> cache pages that are currently not used) before allocating off node pages."
>>>
>>> Why does the kernel do that here in our case on these machines.
>>
>> Can nobody help why the kernel in this case set it to 1?
>
> It's determined by RECLAIM_DISTANCE.
>
> build_zonelists():
>
>                  /*
>                   * If another node is sufficiently far away then it is better
>                   * to reclaim pages in a zone before going off node.
>                   */
>                  if (distance>  RECLAIM_DISTANCE)
>                          zone_reclaim_mode = 1;
>
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
>
> commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
> Author: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Date:   Wed Jun 15 15:08:20 2011 -0700
>
>      mm: increase RECLAIM_DISTANCE to 30
>
>      Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
>      that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
>      Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
>      a very traditional single-process model.
>
>        * a master process which reads config files and manages the other
>          process
>        * multiple imapd processes, one per connection
>        * multiple pop3d processes, one per connection
>        * multiple lmtpd processes, one per connection
>        * periodical "cleanup" processes.
>
>      There are thousands of independent processes.  The problem is, recent
>      Intel motherboard turn on zone_reclaim_mode by default and traditional
>      prefork model software don't work well on it.  Unfortunatelly, such models
>      are still typical even in the 21st century.  We can't ignore them.
>
>      This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
>      any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
>      such relatively cheap 2-4 socket machine are often used for traditional
>      servers as above.  The intention is that these machines don't use
>      zone_reclaim_mode.
>
>      Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
>      This patch doesn't change such high-end NUMA machine behavior.
>
>      Dave Hansen said:
>
>      : I know specifically of pieces of x86 hardware that set the information
>      : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
>      : behavior which that implies.
>      :
>      : They've done performance testing and run very large and scary benchmarks
>      : to make sure that they _want_ this turned on.  What this means for them
>      : is that they'll probably be de-optimized, at least on newer versions of
>      : the kernel.
>      :
>      : If you want to do this for particular systems, maybe _that_'s what we
>      : should do.  Have a list of specific configurations that need the
>      : defaults overridden either because they're buggy, or they have an
>      : unusual hardware configuration not really reflected in the distance
>      : table.
>
>      And later said:
>
>      : The original change in the hardware tables was for the benefit of a
>      : benchmark.  Said benchmark isn't going to get run on mainline until the
>      : next batch of enterprise distros drops, at which point the hardware where
>      : this was done will be irrelevant for the benchmark.  I'm sure any new
>      : hardware will just set this distance to another yet arbitrary value to
>      : make the kernel do what it wants.  :)
>      :
>      : Also, when the hardware got _set_ to this initially, I complained.  So, I
>      : guess I'm getting my way now, with this patch.  I'm cool with it.
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
>    * (in whatever arch specific measurement units returned by node_distance())
>    * then switch on zone reclaim on boot.
>    */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30
>   #endif
>   #ifndef PENALTY_FOR_NODE_WITH_CPUS
>   #define PENALTY_FOR_NODE_WITH_CPUS     (1)
>
> Thanks,
> Fengguang
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - Sept. 1, 2011, 12:57 p.m.
On Thu, Sep 01, 2011 at 12:14:58PM +0800, Wu Fengguang wrote:
> Hi Stefan,
> 
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
> > Hi Fengguang,
> > Hi Yanhai,
> > 
> > > you're abssolutely corect zone_reclaim_mode is on - but why?
> > > There must be some linux software which switches it on.
> > >
> > > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > > ~#
> > >
> > > also
> > > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > > ~#
> > >
> > > tells us nothing.
> > >
> > > I've then read this:
> > >
> > > "zone_reclaim_mode is set during bootup to 1 if it is determined that
> > > pages from remote zones will cause a measurable performance reduction.
> > > The page allocator will then reclaim easily reusable pages (those page
> > > cache pages that are currently not used) before allocating off node pages."
> > >
> > > Why does the kernel do that here in our case on these machines.
> > 
> > Can nobody help why the kernel in this case set it to 1?
> 
> It's determined by RECLAIM_DISTANCE.
> 
> build_zonelists():
> 
>                 /*
>                  * If another node is sufficiently far away then it is better
>                  * to reclaim pages in a zone before going off node.
>                  */
>                 if (distance > RECLAIM_DISTANCE)
>                         zone_reclaim_mode = 1;
> 
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
> 

Even with that, it's known that zone_reclaim() can be a disaster when
it runs into problems. This should be fixed in 3.1 by the following
commits;

[cd38b115 mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim]
[76d3fbf8 mm: page allocator: reconsider zones for allocation after direct reclaim]

The description in cd38b115 has the interesting details.

Patch

diff --git a/include/linux/topology.h b/include/linux/topology.h
index b91a40e..fc839bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -60,7 +60,7 @@  int arch_update_cpu_topology(void);
  * (in whatever arch specific measurement units returned by node_distance())
  * then switch on zone reclaim on boot.
  */
-#define RECLAIM_DISTANCE 20
+#define RECLAIM_DISTANCE 30
 #endif
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS     (1)