diff mbox

[v4,1/6] mm: teach mm by current context info to not do I/O during memory allocation

Message ID 1351931714-11689-2-git-send-email-ming.lei@canonical.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Ming Lei Nov. 3, 2012, 8:35 a.m. UTC
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task
to avoid doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device,
and the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one
of its ancestors(or the block device itself), the deadlock may be
triggered inside the memory allocation since it might not complete
until the block device becomes active and the involed page I/O finishes.
The situation is pointed out first by Alan Stern. It is not a good
approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
several subsystems may be involved(for example, PCI, USB and SCSI may
be involved for usb mass stoarage device, network devices involved too
in the iSCSI case)

- during block device runtime suspend, because runtime resume need
to wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset
will be put on the device, so there shouldn't have any
memory allocation with GFP_KERNEL during USB bus reset, otherwise
the deadlock similar with above may be triggered. Unfortunately, any
usb device may include one mass storage interface in theory, so it
requires all usb interface drivers to handle the situation. In fact,
most usb drivers don't know how to handle bus reset on the device
and don't provide .pre_set() and .post_reset() callback at all, so
USB core has to unbind and bind driver for these devices. So it
is still not practical to resort to GFP_NOIO for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the
affected path into GFP_NOIO because these functions doing that may be
implemented as library and will be called in many other contexts.

In fact, memalloc_noio() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_IO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
	<-acpi_ut_allocate
		<-ACPI_ALLOCATE_ZEROED
			<-acpi_evaluate_object
				<-__acpi_bus_set_power
					<-acpi_bus_set_power
						<-acpi_pci_set_power_state
							<-platform_pci_set_power_state
								<-pci_platform_power_transition
									<-__pci_complete_power_transition
										<-pci_set_power_state
											<-pci_restore_standard_config
												<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
	<-finish_port_resume
		<-usb_port_resume
			<-generic_resume
				<-usb_resume_device
					<-usb_resume_both
						<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found.  Unfortunately, this allocation can
only be found by human being now, and there should be many not found
since any function in the resume path(call tree) may allocate memory
with GFP_KERNEL.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v4:
	- fix comment
v3:
	- no change
v2:
        - remove changes on 'may_writepage' and 'may_swap' because that
          isn't related with the patchset, and can't introduce I/O in
          allocation path if GFP_IOFS is unset, so handing 'may_swap'
          and may_writepage on GFP_NOIO or GFP_NOFS  should be a
          mm internal thing, and let mm guys deal with that, :-).

          Looks clearing the two may_XXX flag only excludes dirty pages
	  	  and anon pages for relaiming, and the behaviour should be decided
          by GFP FLAG, IMO.

        - unset GFP_IOFS in try_to_free_pages() path since
          alloc_page_buffers()
          and dma_alloc_from_contiguous may drop into the path, as
          pointed by KAMEZAWA Hiroyuki
v1:
        - take Minchan's change to avoid the check in alloc_page hot
          path

        - change the helpers' style into save/restore as suggested by
          Alan Stern
---
 include/linux/sched.h |   10 ++++++++++
 mm/page_alloc.c       |   10 +++++++++-
 mm/vmscan.c           |   12 ++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

Comments

Andrew Morton Nov. 6, 2012, 11:23 p.m. UTC | #1
On Sat,  3 Nov 2012 16:35:09 +0800
Ming Lei <ming.lei@canonical.com> wrote:

> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> 'struct task_struct'), so that the flag can be set by one task
> to avoid doing I/O inside memory allocation in the task's context.
> 
> The patch trys to solve one deadlock problem caused by block device,
> and the problem may happen at least in the below situations:
> 
> - during block device runtime resume, if memory allocation with
> GFP_KERNEL is called inside runtime resume callback of any one
> of its ancestors(or the block device itself), the deadlock may be
> triggered inside the memory allocation since it might not complete
> until the block device becomes active and the involed page I/O finishes.
> The situation is pointed out first by Alan Stern. It is not a good
> approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
> several subsystems may be involved(for example, PCI, USB and SCSI may
> be involved for usb mass stoarage device, network devices involved too
> in the iSCSI case)
> 
> - during block device runtime suspend, because runtime resume need
> to wait for completion of concurrent runtime suspend.
> 
> - during error handling of usb mass storage deivce, USB bus reset
> will be put on the device, so there shouldn't have any
> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> the deadlock similar with above may be triggered. Unfortunately, any
> usb device may include one mass storage interface in theory, so it
> requires all usb interface drivers to handle the situation. In fact,
> most usb drivers don't know how to handle bus reset on the device
> and don't provide .pre_set() and .post_reset() callback at all, so
> USB core has to unbind and bind driver for these devices. So it
> is still not practical to resort to GFP_NOIO for solving the problem.
> 
> Also the introduced solution can be used by block subsystem or block
> drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
> actual I/O transfer.
> 
> It is not a good idea to convert all these GFP_KERNEL in the
> affected path into GFP_NOIO because these functions doing that may be
> implemented as library and will be called in many other contexts.
> 
> In fact, memalloc_noio() can convert some of current static GFP_NOIO
> allocation into GFP_KERNEL back in other non-affected contexts, at least
> almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
> after applying the approach and make allocation with GFP_IO
> only happen in runtime resume/bus reset/block I/O transfer contexts
> generally.

It's unclear from the description why we're also clearing __GFP_FS in
this situation.

If we can avoid doing this then there will be a very small gain: there
are some situations in which a filesystem can clean pagecache without
performing I/O.


It doesn't appear that the patch will add overhead to the alloc/free
hotpaths, which is good.

> 
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>  #define PF_FROZEN	0x00010000	/* frozen for system suspend */
>  #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
>  #define PF_KSWAPD	0x00040000	/* I am kswapd */
> +#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
>  #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
>  #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
> @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>  #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
>  #define used_math() tsk_used_math(current)
>  
> +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO)
> +#define memalloc_noio_save(flag) do { \
> +	(flag) = current->flags & PF_MEMALLOC_NOIO; \
> +	current->flags |= PF_MEMALLOC_NOIO; \
> +} while (0)
> +#define memalloc_noio_restore(flag) do { \
> +	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \
> +} while (0)
> +

Again with the ghastly macros.  Please, do this properly in regular old
C, as previously discussed.  It really doesn't matter what daft things
local_irq_save() did 20 years ago.  Just do it right!

Also, you can probably put the unlikely() inside memalloc_noio() and
avoid repeating it at all the callsites.

And it might be neater to do:

/*
 * Nice comment goes here
 */
static inline gfp_t memalloc_noio_flags(gfp_t flags)
{
	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
		flags &= ~GFP_IOFS;
	return flags;
}

>   * task->jobctl flags
>   */
>
> ...
>
> @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> +	if (unlikely(memalloc_noio())) {
> +		gfp_mask &= ~GFP_IOFS;
> +		sc.gfp_mask = gfp_mask;
> +		shrink.gfp_mask = sc.gfp_mask;
> +	}

We can avoid writing to shrink.gfp_mask twice.  And maybe sc.gfp_mask
as well.  Unclear, I didn't think about it too hard ;)

>  	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
>  
>  	/*
>
> ...
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Nov. 7, 2012, 3:11 a.m. UTC | #2
On Wed, Nov 7, 2012 at 7:23 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> It's unclear from the description why we're also clearing __GFP_FS in
> this situation.
>
> If we can avoid doing this then there will be a very small gain: there
> are some situations in which a filesystem can clean pagecache without
> performing I/O.

Firstly,  the patch follows the policy in the system suspend/resume situation,
in which the __GFP_FS is cleared, and basically the problem is very similar
with that in system PM path.

Secondly, inside shrink_page_list(), pageout() may be triggered on dirty anon
page if __GFP_FS is set.

IMO, if performing I/O can be completely avoided when __GFP_FS is set, the
flag can be kept, otherwise it is better to clear it in the situation.

>
> It doesn't appear that the patch will add overhead to the alloc/free
> hotpaths, which is good.

Thanks for previous Minchan's comment.

>
>>
>> ...
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>  #define PF_FROZEN    0x00010000      /* frozen for system suspend */
>>  #define PF_FSTRANS   0x00020000      /* inside a filesystem transaction */
>>  #define PF_KSWAPD    0x00040000      /* I am kswapd */
>> +#define PF_MEMALLOC_NOIO 0x00080000  /* Allocating memory without IO involved */
>>  #define PF_LESS_THROTTLE 0x00100000  /* Throttle me less: I clean memory */
>>  #define PF_KTHREAD   0x00200000      /* I am a kernel thread */
>>  #define PF_RANDOMIZE 0x00400000      /* randomize virtual address space */
>> @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>  #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
>>  #define used_math() tsk_used_math(current)
>>
>> +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO)
>> +#define memalloc_noio_save(flag) do { \
>> +     (flag) = current->flags & PF_MEMALLOC_NOIO; \
>> +     current->flags |= PF_MEMALLOC_NOIO; \
>> +} while (0)
>> +#define memalloc_noio_restore(flag) do { \
>> +     current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \
>> +} while (0)
>> +
>
> Again with the ghastly macros.  Please, do this properly in regular old
> C, as previously discussed.  It really doesn't matter what daft things
> local_irq_save() did 20 years ago.  Just do it right!

OK, I will take inline function in -v5.

>
> Also, you can probably put the unlikely() inside memalloc_noio() and
> avoid repeating it at all the callsites.
>
> And it might be neater to do:
>
> /*
>  * Nice comment goes here
>  */
> static inline gfp_t memalloc_noio_flags(gfp_t flags)
> {
>         if (unlikely(current->flags & PF_MEMALLOC_NOIO))
>                 flags &= ~GFP_IOFS;
>         return flags;
> }

But without the check in callsites, some local variables will be write
two times,
so it is better to not do it.

>
>>   * task->jobctl flags
>>   */
>>
>> ...
>>
>> @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>>               .gfp_mask = sc.gfp_mask,
>>       };
>>
>> +     if (unlikely(memalloc_noio())) {
>> +             gfp_mask &= ~GFP_IOFS;
>> +             sc.gfp_mask = gfp_mask;
>> +             shrink.gfp_mask = sc.gfp_mask;
>> +     }
>
> We can avoid writing to shrink.gfp_mask twice.  And maybe sc.gfp_mask
> as well.  Unclear, I didn't think about it too hard ;)

Yes, we can do it by initializing 'shrink' local variable just after the branch,
so one writing is enough. Will do it in -v5.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Nov. 7, 2012, 3:48 a.m. UTC | #3
On Wed, 7 Nov 2012 11:11:24 +0800 Ming Lei <ming.lei@canonical.com> wrote:

> On Wed, Nov 7, 2012 at 7:23 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > It's unclear from the description why we're also clearing __GFP_FS in
> > this situation.
> >
> > If we can avoid doing this then there will be a very small gain: there
> > are some situations in which a filesystem can clean pagecache without
> > performing I/O.
> 
> Firstly,  the patch follows the policy in the system suspend/resume situation,
> in which the __GFP_FS is cleared, and basically the problem is very similar
> with that in system PM path.

I suspect that code is wrong.  Or at least, suboptimal.

> Secondly, inside shrink_page_list(), pageout() may be triggered on dirty anon
> page if __GFP_FS is set.

pageout() should be called if GFP_FS is set or if GFP_IO is set and the
IO is against swap.

And that's what we want to happen: we want to enter the fs to try to
turn dirty pagecache into clean pagecache without doing IO.  If we in
fact enter the device drivers when GFP_IO was not set then that's a bug
which we should fix.

> IMO, if performing I/O can be completely avoided when __GFP_FS is set, the
> flag can be kept, otherwise it is better to clear it in the situation.

yup.

> >
> > Also, you can probably put the unlikely() inside memalloc_noio() and
> > avoid repeating it at all the callsites.
> >
> > And it might be neater to do:
> >
> > /*
> >  * Nice comment goes here
> >  */
> > static inline gfp_t memalloc_noio_flags(gfp_t flags)
> > {
> >         if (unlikely(current->flags & PF_MEMALLOC_NOIO))
> >                 flags &= ~GFP_IOFS;
> >         return flags;
> > }
> 
> But without the check in callsites, some local variables will be write
> two times,
> so it is better to not do it.

I don't see why - we just modify the incoming gfp_t at the start of the
function, then use it.

It gets a bit tricky with those struct initialisations.  Things like

	struct foo bar {
		.a = a1,
		.b = b1,
	};

should not be turned into

	struct foo bar {
		.a = a1,
	};
	
	bar.b = b1;

and we don't want to do

	struct foo bar { };

	bar.a = a1;
	bar.b = b1;

either, because these are indeed a double-write.  But we can do

	struct foo bar {
		.flags = (flags = memalloc_noio_flags(flags)),
		.b = b1,
	};

which is a bit arcane but not toooo bad.  Have a think about it...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Nov. 7, 2012, 4:35 a.m. UTC | #4
On Wed, Nov 7, 2012 at 11:48 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>>
>> Firstly,  the patch follows the policy in the system suspend/resume situation,
>> in which the __GFP_FS is cleared, and basically the problem is very similar
>> with that in system PM path.
>
> I suspect that code is wrong.  Or at least, suboptimal.
>
>> Secondly, inside shrink_page_list(), pageout() may be triggered on dirty anon
>> page if __GFP_FS is set.
>
> pageout() should be called if GFP_FS is set or if GFP_IO is set and the
> IO is against swap.
>
> And that's what we want to happen: we want to enter the fs to try to
> turn dirty pagecache into clean pagecache without doing IO.  If we in
> fact enter the device drivers when GFP_IO was not set then that's a bug
> which we should fix.

OK, I got it, and I'll not clear GFP_FS in -v5.

>
>> IMO, if performing I/O can be completely avoided when __GFP_FS is set, the
>> flag can be kept, otherwise it is better to clear it in the situation.
>
> yup.
>
>> >
>> > Also, you can probably put the unlikely() inside memalloc_noio() and
>> > avoid repeating it at all the callsites.
>> >
>> > And it might be neater to do:
>> >
>> > /*
>> >  * Nice comment goes here
>> >  */
>> > static inline gfp_t memalloc_noio_flags(gfp_t flags)
>> > {
>> >         if (unlikely(current->flags & PF_MEMALLOC_NOIO))
>> >                 flags &= ~GFP_IOFS;
>> >         return flags;
>> > }
>>
>> But without the check in callsites, some local variables will be write
>> two times,
>> so it is better to not do it.
>
> I don't see why - we just modify the incoming gfp_t at the start of the
> function, then use it.
>
> It gets a bit tricky with those struct initialisations.  Things like
>
>         struct foo bar {
>                 .a = a1,
>                 .b = b1,
>         };
>
> should not be turned into
>
>         struct foo bar {
>                 .a = a1,
>         };
>
>         bar.b = b1;
>
> and we don't want to do
>
>         struct foo bar { };
>
>         bar.a = a1;
>         bar.b = b1;
>
> either, because these are indeed a double-write.  But we can do
>
>         struct foo bar {
>                 .flags = (flags = memalloc_noio_flags(flags)),
>                 .b = b1,
>         };
>
> which is a bit arcane but not toooo bad.  Have a think about it...

Got it, looks memalloc_noio_flags() neater, and I will take it in v5.

Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fb27acd..283fe86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1805,6 +1805,7 @@  extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
 #define PF_KSWAPD	0x00040000	/* I am kswapd */
+#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -1842,6 +1843,15 @@  extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO)
+#define memalloc_noio_save(flag) do { \
+	(flag) = current->flags & PF_MEMALLOC_NOIO; \
+	current->flags |= PF_MEMALLOC_NOIO; \
+} while (0)
+#define memalloc_noio_restore(flag) do { \
+	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \
+} while (0)
+
 /*
  * task->jobctl flags
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 45c916b..3c47049 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2634,10 +2634,18 @@  retry_cpuset:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, migratetype);
-	if (unlikely(!page))
+	if (unlikely(!page)) {
+		/*
+		 * Runtime PM, block IO and its error handling path
+		 * can deadlock because I/O on the device might not
+		 * complete.
+		 */
+		if (unlikely(memalloc_noio()))
+			gfp_mask &= ~GFP_IOFS;
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10090c8..035088a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2304,6 +2304,12 @@  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.gfp_mask = sc.gfp_mask,
 	};
 
+	if (unlikely(memalloc_noio())) {
+		gfp_mask &= ~GFP_IOFS;
+		sc.gfp_mask = gfp_mask;
+		shrink.gfp_mask = sc.gfp_mask;
+	}
+
 	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
 
 	/*
@@ -3304,6 +3310,12 @@  static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	};
 	unsigned long nr_slab_pages0, nr_slab_pages1;
 
+	if (unlikely(memalloc_noio())) {
+		gfp_mask &= ~GFP_IOFS;
+		sc.gfp_mask = gfp_mask;
+		shrink.gfp_mask = sc.gfp_mask;
+	}
+
 	cond_resched();
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_SWAP