Patchwork [4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep

login
register
mail settings
Submitter Mel Gorman
Date May 13, 2011, 2:03 p.m.
Message ID <1305295404-12129-5-git-send-email-mgorman@suse.de>
Download mbox | patch
Permalink /patch/95487/
State Not Applicable
Headers show

Comments

Mel Gorman - May 13, 2011, 2:03 p.m.
Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)
KOSAKI Motohiro - May 15, 2011, 10:27 a.m.
(2011/05/13 23:03), Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> ---
>   mm/vmscan.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>   	unsigned long balanced = 0;
>   	bool all_zones_ok = true;
> 
> +	/* If kswapd has been running too long, just sleep */
> +	if (need_resched())
> +		return false;
> +

Hmm... I don't like this patch so much. because this code does

- don't sleep if kswapd got context switch at shrink_inactive_list
- sleep if kswapd didn't

It seems to be semi random behavior.



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 16, 2011, 4:21 a.m.
On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > ---
> >   mm/vmscan.c |    4 ++++
> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >   	unsigned long balanced = 0;
> >   	bool all_zones_ok = true;
> > 
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> 
> Hmm... I don't like this patch so much. because this code does
> 
> - don't sleep if kswapd got context switch at shrink_inactive_list

This isn't entirely true:  need_resched() will be false, so we'll follow
the normal path for determining whether to sleep or not, in effect
leaving the current behaviour unchanged.

> - sleep if kswapd didn't

This also isn't entirely true: whether need_resched() is true at this
point depends on a whole lot more that whether we did a context switch
in shrink_inactive. It mostly depends on how long we've been running
without giving up the CPU.  Generally that will mean we've been round
the shrinker loop hundreds to thousands of times without sleeping.

> It seems to be semi random behavior.

Well, we have to do something.  Chris Mason first suspected the hang was
a kswapd rescheduling problem a while ago.  We tried putting
cond_rescheds() in several places in the vmscan code, but to no avail.
The need_resched() in sleeping_prematurely() seems to be about the best
option.  The other option might be just to put a cond_resched() in
kswapd_try_to_sleep(), but that will really have about the same effect.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
MinChan Kim - May 16, 2011, 5:04 a.m.
On Mon, May 16, 2011 at 1:21 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> (2011/05/13 23:03), Mel Gorman wrote:
>> > Under constant allocation pressure, kswapd can be in the situation where
>> > sleeping_prematurely() will always return true even if kswapd has been
>> > running a long time. Check if kswapd needs to be scheduled.
>> >
>> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> > ---
>> >   mm/vmscan.c |    4 ++++
>> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index af24d1e..4d24828 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >     unsigned long balanced = 0;
>> >     bool all_zones_ok = true;
>> >
>> > +   /* If kswapd has been running too long, just sleep */
>> > +   if (need_resched())
>> > +           return false;
>> > +
>>
>> Hmm... I don't like this patch so much. because this code does
>>
>> - don't sleep if kswapd got context switch at shrink_inactive_list
>
> This isn't entirely true:  need_resched() will be false, so we'll follow
> the normal path for determining whether to sleep or not, in effect
> leaving the current behaviour unchanged.
>
>> - sleep if kswapd didn't
>
> This also isn't entirely true: whether need_resched() is true at this
> point depends on a whole lot more that whether we did a context switch
> in shrink_inactive. It mostly depends on how long we've been running
> without giving up the CPU.  Generally that will mean we've been round
> the shrinker loop hundreds to thousands of times without sleeping.
>
>> It seems to be semi random behavior.
>
> Well, we have to do something.  Chris Mason first suspected the hang was
> a kswapd rescheduling problem a while ago.  We tried putting
> cond_rescheds() in several places in the vmscan code, but to no avail.

Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?

If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
Because, although we complete zone balancing, kswapd doesn't sleep as
pgdat_balance returns wrong result. And at last VM calls
balance_pgdat. In this case, balance_pgdat returns without any work as
kswap couldn't find zones which have not enough free pages and goto
out. kswapd could repeat this work infinitely. So you don't have a
chance to call cond_resched.

But if your test was with Hanne's patch, I am very curious how come
kswapd consumes CPU a lot.

> The need_resched() in sleeping_prematurely() seems to be about the best
> option.  The other option might be just to put a cond_resched() in
> kswapd_try_to_sleep(), but that will really have about the same effect.

I don't oppose it but before that, I think we have to know why kswapd
consumes CPU a lot although we applied Hannes' patch.

>
> James
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Mel Gorman - May 16, 2011, 8:45 a.m.
On Sun, May 15, 2011 at 07:27:12PM +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > ---
> >   mm/vmscan.c |    4 ++++
> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >   	unsigned long balanced = 0;
> >   	bool all_zones_ok = true;
> > 
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> 
> Hmm... I don't like this patch so much. because this code does
> 
> - don't sleep if kswapd got context switch at shrink_inactive_list
> - sleep if kswapd didn't
> 
> It seems to be semi random behavior.
> 

It's possible to keep kswapd awake simply by allocating fast enough that
the watermarks are never balanced making kswapd appear to consume 100%
of CPU. This check causes kswapd to sleep in this case. The processes
doing the allocations will enter direct reclaim and probably stall while
processes that are not allocating will get some CPU time.
Mel Gorman - May 16, 2011, 8:45 a.m.
On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >
> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> > ---
> >> >   mm/vmscan.c |    4 ++++
> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index af24d1e..4d24828 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >     unsigned long balanced = 0;
> >> >     bool all_zones_ok = true;
> >> >
> >> > +   /* If kswapd has been running too long, just sleep */
> >> > +   if (need_resched())
> >> > +           return false;
> >> > +
> >>
> >> Hmm... I don't like this patch so much. because this code does
> >>
> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >
> > This isn't entirely true:  need_resched() will be false, so we'll follow
> > the normal path for determining whether to sleep or not, in effect
> > leaving the current behaviour unchanged.
> >
> >> - sleep if kswapd didn't
> >
> > This also isn't entirely true: whether need_resched() is true at this
> > point depends on a whole lot more that whether we did a context switch
> > in shrink_inactive. It mostly depends on how long we've been running
> > without giving up the CPU.  Generally that will mean we've been round
> > the shrinker loop hundreds to thousands of times without sleeping.
> >
> >> It seems to be semi random behavior.
> >
> > Well, we have to do something.  Chris Mason first suspected the hang was
> > a kswapd rescheduling problem a while ago.  We tried putting
> > cond_rescheds() in several places in the vmscan code, but to no avail.
> 
> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> 
> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> Because, although we complete zone balancing, kswapd doesn't sleep as
> pgdat_balance returns wrong result. And at last VM calls
> balance_pgdat. In this case, balance_pgdat returns without any work as
> kswap couldn't find zones which have not enough free pages and goto
> out. kswapd could repeat this work infinitely. So you don't have a
> chance to call cond_resched.
> 
> But if your test was with Hanne's patch, I am very curious how come
> kswapd consumes CPU a lot.
> 
> > The need_resched() in sleeping_prematurely() seems to be about the best
> > option.  The other option might be just to put a cond_resched() in
> > kswapd_try_to_sleep(), but that will really have about the same effect.
> 
> I don't oppose it but before that, I think we have to know why kswapd
> consumes CPU a lot although we applied Hannes' patch.
> 

Because it's still possible for processes to allocate pages at the same
rate kswapd is freeing them leading to a situation where kswapd does not
consider the zone balanced for prolonged periods of time.
MinChan Kim - May 16, 2011, 8:58 a.m.
On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >
>> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> >> > ---
>> >> >   mm/vmscan.c |    4 ++++
>> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > index af24d1e..4d24828 100644
>> >> > --- a/mm/vmscan.c
>> >> > +++ b/mm/vmscan.c
>> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >     unsigned long balanced = 0;
>> >> >     bool all_zones_ok = true;
>> >> >
>> >> > +   /* If kswapd has been running too long, just sleep */
>> >> > +   if (need_resched())
>> >> > +           return false;
>> >> > +
>> >>
>> >> Hmm... I don't like this patch so much. because this code does
>> >>
>> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >
>> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> > the normal path for determining whether to sleep or not, in effect
>> > leaving the current behaviour unchanged.
>> >
>> >> - sleep if kswapd didn't
>> >
>> > This also isn't entirely true: whether need_resched() is true at this
>> > point depends on a whole lot more that whether we did a context switch
>> > in shrink_inactive. It mostly depends on how long we've been running
>> > without giving up the CPU.  Generally that will mean we've been round
>> > the shrinker loop hundreds to thousands of times without sleeping.
>> >
>> >> It seems to be semi random behavior.
>> >
>> > Well, we have to do something.  Chris Mason first suspected the hang was
>> > a kswapd rescheduling problem a while ago.  We tried putting
>> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>
>> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>
>> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> Because, although we complete zone balancing, kswapd doesn't sleep as
>> pgdat_balance returns wrong result. And at last VM calls
>> balance_pgdat. In this case, balance_pgdat returns without any work as
>> kswap couldn't find zones which have not enough free pages and goto
>> out. kswapd could repeat this work infinitely. So you don't have a
>> chance to call cond_resched.
>>
>> But if your test was with Hanne's patch, I am very curious how come
>> kswapd consumes CPU a lot.
>>
>> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > option.  The other option might be just to put a cond_resched() in
>> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>
>> I don't oppose it but before that, I think we have to know why kswapd
>> consumes CPU a lot although we applied Hannes' patch.
>>
>
> Because it's still possible for processes to allocate pages at the same
> rate kswapd is freeing them leading to a situation where kswapd does not
> consider the zone balanced for prolonged periods of time.

We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
So I think kswapd can be scheduled out although it's scheduled in
after a short time as task scheduled also need page reclaim. Although
all task in system need reclaim, kswapd cpu 99% consumption is a
natural result, I think.
Do I miss something?

>
> --
> Mel Gorman
> SUSE Labs
>
Mel Gorman - May 16, 2011, 10:27 a.m.
On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >
> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> > ---
> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> > index af24d1e..4d24828 100644
> >> >> > --- a/mm/vmscan.c
> >> >> > +++ b/mm/vmscan.c
> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >     unsigned long balanced = 0;
> >> >> >     bool all_zones_ok = true;
> >> >> >
> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> > +   if (need_resched())
> >> >> > +           return false;
> >> >> > +
> >> >>
> >> >> Hmm... I don't like this patch so much. because this code does
> >> >>
> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >
> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> > the normal path for determining whether to sleep or not, in effect
> >> > leaving the current behaviour unchanged.
> >> >
> >> >> - sleep if kswapd didn't
> >> >
> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > point depends on a whole lot more that whether we did a context switch
> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > without giving up the CPU.  Generally that will mean we've been round
> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >
> >> >> It seems to be semi random behavior.
> >> >
> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >>
> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >>
> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> pgdat_balance returns wrong result. And at last VM calls
> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> kswap couldn't find zones which have not enough free pages and goto
> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> chance to call cond_resched.
> >>
> >> But if your test was with Hanne's patch, I am very curious how come
> >> kswapd consumes CPU a lot.
> >>
> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > option.  The other option might be just to put a cond_resched() in
> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >>
> >> I don't oppose it but before that, I think we have to know why kswapd
> >> consumes CPU a lot although we applied Hannes' patch.
> >>
> >
> > Because it's still possible for processes to allocate pages at the same
> > rate kswapd is freeing them leading to a situation where kswapd does not
> > consider the zone balanced for prolonged periods of time.
> 
> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> So I think kswapd can be scheduled out although it's scheduled in
> after a short time as task scheduled also need page reclaim. Although
> all task in system need reclaim, kswapd cpu 99% consumption is a
> natural result, I think.
> Do I miss something?
> 

Lets see;

shrink_page_list() only applies if inactive pages were isolated
	which in turn may not happen if all_unreclaimable is set in
	shrink_zones(). If for whatver reason, all_unreclaimable is
	set on all zones, we can miss calling cond_resched().

shrink_slab only applies if we are reclaiming slab pages. If the first
	shrinker returns -1, we do not call cond_resched(). If that
	first shrinker is dcache and __GFP_FS is not set, direct
	reclaimers will not shrink at all. However, if there are
	enough of them running or if one of the other shrinkers
	is running for a very long time, kswapd could be starved
	acquiring the shrinker_rwsem and never reaching the
	cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
	balanced. For a high-order allocation that is balanced, it
	checks order-0 again. During that window, order-0 might have
	become unbalanced so it loops again for order-0 and returns
	that was reclaiming for order-0 to kswapd(). It can then find
	that a caller has rewoken kswapd for a high-order and re-enters
	balance_pgdat() without ever have called cond_resched().

While it appears unlikely, there are bad conditions which can result
in cond_resched() being avoided.
Rik van Riel - May 16, 2011, 2:30 p.m.
On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel<riel@redhat.com>
KOSAKI Motohiro - May 18, 2011, 12:26 a.m.
> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
> 	which in turn may not happen if all_unreclaimable is set in
> 	shrink_zones(). If for whatver reason, all_unreclaimable is
> 	set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
> 	shrinker returns -1, we do not call cond_resched(). If that
> 	first shrinker is dcache and __GFP_FS is not set, direct
> 	reclaimers will not shrink at all. However, if there are
> 	enough of them running or if one of the other shrinkers
> 	is running for a very long time, kswapd could be starved
> 	acquiring the shrinker_rwsem and never reaching the
> 	cond_resched().

OK.


>
> balance_pgdat() only calls cond_resched if the zones are not
> 	balanced. For a high-order allocation that is balanced, it
> 	checks order-0 again. During that window, order-0 might have
> 	become unbalanced so it loops again for order-0 and returns
> 	that was reclaiming for order-0 to kswapd(). It can then find
> 	that a caller has rewoken kswapd for a high-order and re-enters
> 	balance_pgdat() without ever have called cond_resched().

Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
The problem is NOT 100% cpu consumption. if kswapd will sleep, other
processes need to reclaim old pages. The problem is, kswapd doesn't
invoke context switch and other tasks hang-up.




> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.
>


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 18, 2011, 9:57 a.m.
On Wed, May 18, 2011 at 09:26:09AM +0900, KOSAKI Motohiro wrote:
> >Lets see;
> >
> >shrink_page_list() only applies if inactive pages were isolated
> >	which in turn may not happen if all_unreclaimable is set in
> >	shrink_zones(). If for whatver reason, all_unreclaimable is
> >	set on all zones, we can miss calling cond_resched().
> >
> >shrink_slab only applies if we are reclaiming slab pages. If the first
> >	shrinker returns -1, we do not call cond_resched(). If that
> >	first shrinker is dcache and __GFP_FS is not set, direct
> >	reclaimers will not shrink at all. However, if there are
> >	enough of them running or if one of the other shrinkers
> >	is running for a very long time, kswapd could be starved
> >	acquiring the shrinker_rwsem and never reaching the
> >	cond_resched().
> 
> OK.
> 
> 
> >
> >balance_pgdat() only calls cond_resched if the zones are not
> >	balanced. For a high-order allocation that is balanced, it
> >	checks order-0 again. During that window, order-0 might have
> >	become unbalanced so it loops again for order-0 and returns
> >	that was reclaiming for order-0 to kswapd(). It can then find
> >	that a caller has rewoken kswapd for a high-order and re-enters
> >	balance_pgdat() without ever have called cond_resched().
> 
> Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
> The problem is NOT 100% cpu consumption. if kswapd will sleep, other
> processes need to reclaim old pages. The problem is, kswapd doesn't
> invoke context switch and other tasks hang-up.
> 

Which the shrink_slab patch does (either version). What's the gain from
sprinkling more cond_resched() around? If you think there is, submit
another pair of patches (include patch 1 from this series) but I'm not
seeing the advantage myself.

> 
> >While it appears unlikely, there are bad conditions which can result
> >in cond_resched() being avoided.
> >
> 
>

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@  static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 
+	/* If kswapd has been running too long, just sleep */
+	if (need_resched())
+		return false;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;