From patchwork Tue Jul 12 14:03:11 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: Natty SRU: Fix kswapd 100% CPU utilization From: Tim Gardner X-Patchwork-Id: 104374 Message-Id: <4E1C541F.4040406@canonical.com> To: Kernel team list Date: Tue, 12 Jul 2011 08:03:11 -0600 The following series was submitted to stable for 2.6.38.8 by Mel Gorman, but NACK'd by Greg K-H as that stable kernel series is no longer maintained (but accepted it for 2.6.39). I'd like to suggest that we need this patch series for low resource Sandybridge based platforms. http://bugs.launchpad.net/bugs/808509 rtg Acked-by: Stefan Bader Acked-by: Leann Ogasawara diff --git a/mm/vmscan.c b/mm/vmscan.c index a74bf72..a578535 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, return true; /* Check the watermark levels */ - for (i = 0; i < pgdat->nr_zones; i++) { + for (i = 0; i <= classzone_idx; i++) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) -- 1.7.0.4 >From c774fd4d541e2b930ca21382755d1084f39f5587 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 22 Mar 2011 16:33:04 -0700 Subject: [PATCH 2/5] mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones BugLink: http://bugs.launchpad.net/bugs/808509 When reclaiming for order-0 pages, kswapd requires that all zones be balanced. Each cycle through balance_pgdat() does background ageing on all zones if necessary and applies equal pressure on the inactive zone unless a lot of pages are free already. A "lot of free pages" is defined as a "balance gap" above the high watermark which is currently 7*high_watermark. Historically this was reasonable as min_free_kbytes was small. However, on systems using huge pages, it is recommended that min_free_kbytes is higher and it is tuned with hugeadm --set-recommended-min_free_kbytes. With the introduction of transparent huge page support, this recommended value is also applied. On X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M of memory to be free. The Normal zone is approximately 35000 pages so under even normal memory pressure such as copying a large file, it gets exhausted quickly. As it is getting exhausted, kswapd applies pressure equally to all zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with a high watermark of around 23,000 pages. In this situation, kswapd will reclaim around (23000*8 where 8 is the high watermark + balance gap of 7 * high watermark) pages or 718M of pages before the zone is ignored. What the user sees is that free memory far higher than it should be. To avoid an excessive number of pages being reclaimed from the larger zones, explicitely defines the "balance gap" to be either 1% of the zone or the low watermark for the zone, whichever is smaller. While kswapd will check all zones to apply pressure, it'll ignore zones that meets the (high_wmark + balance_gap) watermark. To test this, 80G were copied from a partition and the amount of memory being used was recorded. A comparison of a patch and unpatched kernel can be seen at http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps and shows that kswapd is not reclaiming as much memory with the patch applied. Signed-off-by: Andrea Arcangeli Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Shaohua Li Cc: "Chen, Tim C" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 8afdcece4911e51cfff2b50a269418914cab8a3f) Signed-off-by: Tim Gardner --- include/linux/swap.h | 9 +++++++++ mm/vmscan.c | 16 +++++++++++++--- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 97a5514..3b454c0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -155,6 +155,15 @@ enum { #define SWAP_CLUSTER_MAX 32 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX +/* + * Ratio between the present memory in the zone and the "gap" that + * we're allowing kswapd to shrink in addition to the per-zone high + * wmark, even for zones that already have the high wmark satisfied, + * in order to provide better per-zone lru behavior. We are ok to + * spend not more than 1% of the memory for this zone balancing "gap". + */ +#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 + #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ diff --git a/mm/vmscan.c b/mm/vmscan.c index a578535..f50f716 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2406,6 +2406,7 @@ loop_again: for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; int nr_slab; + unsigned long balance_gap; if (!populated_zone(zone)) continue; @@ -2422,11 +2423,20 @@ loop_again: mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask); /* - * We put equal pressure on every zone, unless one - * zone has way too many pages free already. + * We put equal pressure on every zone, unless + * one zone has way too many pages free + * already. The "too many pages" is defined + * as the high wmark plus a "gap" where the + * gap is either the low watermark or 1% + * of the zone, whichever is smaller. */ + balance_gap = min(low_wmark_pages(zone), + (zone->present_pages + + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / + KSWAPD_ZONE_BALANCE_GAP_RATIO); if (!zone_watermark_ok_safe(zone, order, - 8*high_wmark_pages(zone), end_zone, 0)) + high_wmark_pages(zone) + balance_gap, + end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, -- 1.7.0.4 >From c06f51cab98ee6861846f816d334d439f384933a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 11 Jul 2011 10:21:15 +0100 Subject: [PATCH 3/5] mm: vmscan: do not apply pressure to slab if we are not applying pressure to zone MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BugLink: http://bugs.launchpad.net/bugs/808509 commit d7868dae893c83c50c7824bc2bc75f93d114669f During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. When kswapd applies pressure to zones during node balancing, it checks if the zone is above a high+balance_gap threshold. If it is, it does not apply pressure but it unconditionally shrinks slab on a global basis which is excessive. In the event kswapd is being kept awake due to a high small unreclaimable zone, it skips zone shrinking but still calls shrink_slab(). Once pressure has been applied, the check for zone being unreclaimable is being made before the check is made if all_unreclaimable should be set. This miss of unreclaimable can cause has_under_min_watermark_zone to be set due to an unreclaimable zone preventing kswapd backing off on congestion_wait(). Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Reviewed-by: Minchan Kim Reviewed-by: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Tim Gardner --- mm/vmscan.c | 25 ++++++++++++++----------- 1 files changed, 14 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f50f716..7f8e890 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2436,19 +2436,19 @@ loop_again: KSWAPD_ZONE_BALANCE_GAP_RATIO); if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) + balance_gap, - end_zone, 0)) + end_zone, 0)) { shrink_zone(priority, zone, &sc); - reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, - lru_pages); - sc.nr_reclaimed += reclaim_state->reclaimed_slab; - total_scanned += sc.nr_scanned; - if (zone->all_unreclaimable) - continue; - if (nr_slab == 0 && - !zone_reclaimable(zone)) - zone->all_unreclaimable = 1; + reclaim_state->reclaimed_slab = 0; + nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, + lru_pages); + sc.nr_reclaimed += reclaim_state->reclaimed_slab; + total_scanned += sc.nr_scanned; + + if (nr_slab == 0 && !zone_reclaimable(zone)) + zone->all_unreclaimable = 1; + } + /* * If we've done a decent amount of scanning and * the reclaim ratio is low, start doing writepage @@ -2458,6 +2458,9 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; + if (zone->all_unreclaimable) + continue; + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { all_zones_ok = 0; -- 1.7.0.4 >From 3b9266cd8d60fb291cee6f8b8bbc6ff282f6dc20 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 8 Jul 2011 15:39:39 -0700 Subject: [PATCH 4/5] mm: vmscan: evaluate the watermarks against the correct classzone MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BugLink: http://bugs.launchpad.net/bugs/808509 When deciding if kswapd is sleeping prematurely, the classzone is taken into account but this is different to what balance_pgdat() and the allocator are doing. Specifically, the DMA zone will be checked based on the classzone used when waking kswapd which could be for a GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in error. Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Reviewed-by: Minchan Kim Cc: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit da175d06b437093f93109ba9e5efbe44dfdf9409) Signed-off-by: Tim Gardner --- mm/vmscan.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7f8e890..136dc5e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2279,7 +2279,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, } if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), - classzone_idx, 0)) + i, 0)) all_zones_ok = false; else balanced += zone->present_pages; -- 1.7.0.4 >From 35b09ae93713a435cd5abe3ea54bce8ad70d4345 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 8 Jul 2011 15:39:40 -0700 Subject: [PATCH 5/5] mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BugLink: http://bugs.launchpad.net/bugs/808509 During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. When balance_pgdat() returns, it may be at a lower classzone_idx than it started because the highest zone was unreclaimable. Before checking if it should go to sleep though, it checks pgdat->classzone_idx which when there is no other activity will be MAX_NR_ZONES-1. It interprets this as it has been woken up while reclaiming, skips scheduling and reclaims again. As there is no useful reclaim work to do, it enters into a loop of shrinking slab consuming loads of CPU until the highest zone becomes reclaimable for a long period of time. There are two problems here. 1) If the returned classzone or order is lower, it'll continue reclaiming without scheduling. 2) if the highest zone was marked unreclaimable but balance_pgdat() returns immediately at DEF_PRIORITY, the new lower classzone is not communicated back to kswapd() for sleeping. This patch does two things that are related. If the end_zone is unreclaimable, this information is communicated back. Second, if the classzone or order was reduced due to failing to reclaim, new information is not read from pgdat and instead an attempt is made to go to sleep. Due to this, it is also necessary that pgdat->classzone_idx be initialised each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as wakeups. Signed-off-by: Mel Gorman Reported-by: Pádraig Brady Tested-by: Pádraig Brady Tested-by: Andrew Lutomirski Acked-by: Rik van Riel Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 215ddd6664ced067afca7eebd2d1eb83f064ff5a) Signed-off-by: Tim Gardner --- mm/vmscan.c | 34 +++++++++++++++++++++------------- 1 files changed, 21 insertions(+), 13 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 136dc5e..4b8b37c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2381,7 +2381,6 @@ loop_again: if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; - *classzone_idx = i; break; } } @@ -2458,8 +2457,11 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; - if (zone->all_unreclaimable) + if (zone->all_unreclaimable) { + if (end_zone && end_zone == i) + end_zone--; continue; + } if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { @@ -2639,8 +2641,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) */ static int kswapd(void *p) { - unsigned long order; - int classzone_idx; + unsigned long order, new_order; + int classzone_idx, new_classzone_idx; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; @@ -2670,17 +2672,23 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); - order = 0; - classzone_idx = MAX_NR_ZONES - 1; + order = new_order = 0; + classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; for ( ; ; ) { - unsigned long new_order; - int new_classzone_idx; int ret; - new_order = pgdat->kswapd_max_order; - new_classzone_idx = pgdat->classzone_idx; - pgdat->kswapd_max_order = 0; - pgdat->classzone_idx = MAX_NR_ZONES - 1; + /* + * If the last balance_pgdat was unsuccessful it's unlikely a + * new request of a similar or harder type will succeed soon + * so consider going to sleep on the basis we reclaimed at + */ + if (classzone_idx >= new_classzone_idx && order == new_order) { + new_order = pgdat->kswapd_max_order; + new_classzone_idx = pgdat->classzone_idx; + pgdat->kswapd_max_order = 0; + pgdat->classzone_idx = pgdat->nr_zones - 1; + } + if (order < new_order || classzone_idx > new_classzone_idx) { /* * Don't sleep if someone wants a larger 'order' @@ -2693,7 +2701,7 @@ static int kswapd(void *p) order = pgdat->kswapd_max_order; classzone_idx = pgdat->classzone_idx; pgdat->kswapd_max_order = 0; - pgdat->classzone_idx = MAX_NR_ZONES - 1; + pgdat->classzone_idx = pgdat->nr_zones - 1; } ret = try_to_freeze();