Message ID | 1351931714-11689-2-git-send-email-ming.lei@canonical.com |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 3 Nov 2012 16:35:09 +0800 Ming Lei <ming.lei@canonical.com> wrote: > This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of > 'struct task_struct'), so that the flag can be set by one task > to avoid doing I/O inside memory allocation in the task's context. > > The patch trys to solve one deadlock problem caused by block device, > and the problem may happen at least in the below situations: > > - during block device runtime resume, if memory allocation with > GFP_KERNEL is called inside runtime resume callback of any one > of its ancestors(or the block device itself), the deadlock may be > triggered inside the memory allocation since it might not complete > until the block device becomes active and the involed page I/O finishes. > The situation is pointed out first by Alan Stern. It is not a good > approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because > several subsystems may be involved(for example, PCI, USB and SCSI may > be involved for usb mass stoarage device, network devices involved too > in the iSCSI case) > > - during block device runtime suspend, because runtime resume need > to wait for completion of concurrent runtime suspend. > > - during error handling of usb mass storage deivce, USB bus reset > will be put on the device, so there shouldn't have any > memory allocation with GFP_KERNEL during USB bus reset, otherwise > the deadlock similar with above may be triggered. Unfortunately, any > usb device may include one mass storage interface in theory, so it > requires all usb interface drivers to handle the situation. In fact, > most usb drivers don't know how to handle bus reset on the device > and don't provide .pre_set() and .post_reset() callback at all, so > USB core has to unbind and bind driver for these devices. So it > is still not practical to resort to GFP_NOIO for solving the problem. > > Also the introduced solution can be used by block subsystem or block > drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing > actual I/O transfer. > > It is not a good idea to convert all these GFP_KERNEL in the > affected path into GFP_NOIO because these functions doing that may be > implemented as library and will be called in many other contexts. > > In fact, memalloc_noio() can convert some of current static GFP_NOIO > allocation into GFP_KERNEL back in other non-affected contexts, at least > almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL > after applying the approach and make allocation with GFP_IO > only happen in runtime resume/bus reset/block I/O transfer contexts > generally. It's unclear from the description why we're also clearing __GFP_FS in this situation. If we can avoid doing this then there will be a very small gain: there are some situations in which a filesystem can clean pagecache without performing I/O. It doesn't appear that the patch will add overhead to the alloc/free hotpaths, which is good. > > ... > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ > +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ > @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) > #define used_math() tsk_used_math(current) > > +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO) > +#define memalloc_noio_save(flag) do { \ > + (flag) = current->flags & PF_MEMALLOC_NOIO; \ > + current->flags |= PF_MEMALLOC_NOIO; \ > +} while (0) > +#define memalloc_noio_restore(flag) do { \ > + current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \ > +} while (0) > + Again with the ghastly macros. Please, do this properly in regular old C, as previously discussed. It really doesn't matter what daft things local_irq_save() did 20 years ago. Just do it right! Also, you can probably put the unlikely() inside memalloc_noio() and avoid repeating it at all the callsites. And it might be neater to do: /* * Nice comment goes here */ static inline gfp_t memalloc_noio_flags(gfp_t flags) { if (unlikely(current->flags & PF_MEMALLOC_NOIO)) flags &= ~GFP_IOFS; return flags; } > * task->jobctl flags > */ > > ... > > @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > .gfp_mask = sc.gfp_mask, > }; > > + if (unlikely(memalloc_noio())) { > + gfp_mask &= ~GFP_IOFS; > + sc.gfp_mask = gfp_mask; > + shrink.gfp_mask = sc.gfp_mask; > + } We can avoid writing to shrink.gfp_mask twice. And maybe sc.gfp_mask as well. Unclear, I didn't think about it too hard ;) > throttle_direct_reclaim(gfp_mask, zonelist, nodemask); > > /* > > ... > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 7, 2012 at 7:23 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > It's unclear from the description why we're also clearing __GFP_FS in > this situation. > > If we can avoid doing this then there will be a very small gain: there > are some situations in which a filesystem can clean pagecache without > performing I/O. Firstly, the patch follows the policy in the system suspend/resume situation, in which the __GFP_FS is cleared, and basically the problem is very similar with that in system PM path. Secondly, inside shrink_page_list(), pageout() may be triggered on dirty anon page if __GFP_FS is set. IMO, if performing I/O can be completely avoided when __GFP_FS is set, the flag can be kept, otherwise it is better to clear it in the situation. > > It doesn't appear that the patch will add overhead to the alloc/free > hotpaths, which is good. Thanks for previous Minchan's comment. > >> >> ... >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * >> #define PF_FROZEN 0x00010000 /* frozen for system suspend */ >> #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ >> #define PF_KSWAPD 0x00040000 /* I am kswapd */ >> +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ >> #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ >> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ >> #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >> @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * >> #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) >> #define used_math() tsk_used_math(current) >> >> +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO) >> +#define memalloc_noio_save(flag) do { \ >> + (flag) = current->flags & PF_MEMALLOC_NOIO; \ >> + current->flags |= PF_MEMALLOC_NOIO; \ >> +} while (0) >> +#define memalloc_noio_restore(flag) do { \ >> + current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \ >> +} while (0) >> + > > Again with the ghastly macros. Please, do this properly in regular old > C, as previously discussed. It really doesn't matter what daft things > local_irq_save() did 20 years ago. Just do it right! OK, I will take inline function in -v5. > > Also, you can probably put the unlikely() inside memalloc_noio() and > avoid repeating it at all the callsites. > > And it might be neater to do: > > /* > * Nice comment goes here > */ > static inline gfp_t memalloc_noio_flags(gfp_t flags) > { > if (unlikely(current->flags & PF_MEMALLOC_NOIO)) > flags &= ~GFP_IOFS; > return flags; > } But without the check in callsites, some local variables will be write two times, so it is better to not do it. > >> * task->jobctl flags >> */ >> >> ... >> >> @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, >> .gfp_mask = sc.gfp_mask, >> }; >> >> + if (unlikely(memalloc_noio())) { >> + gfp_mask &= ~GFP_IOFS; >> + sc.gfp_mask = gfp_mask; >> + shrink.gfp_mask = sc.gfp_mask; >> + } > > We can avoid writing to shrink.gfp_mask twice. And maybe sc.gfp_mask > as well. Unclear, I didn't think about it too hard ;) Yes, we can do it by initializing 'shrink' local variable just after the branch, so one writing is enough. Will do it in -v5. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 7 Nov 2012 11:11:24 +0800 Ming Lei <ming.lei@canonical.com> wrote: > On Wed, Nov 7, 2012 at 7:23 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > It's unclear from the description why we're also clearing __GFP_FS in > > this situation. > > > > If we can avoid doing this then there will be a very small gain: there > > are some situations in which a filesystem can clean pagecache without > > performing I/O. > > Firstly, the patch follows the policy in the system suspend/resume situation, > in which the __GFP_FS is cleared, and basically the problem is very similar > with that in system PM path. I suspect that code is wrong. Or at least, suboptimal. > Secondly, inside shrink_page_list(), pageout() may be triggered on dirty anon > page if __GFP_FS is set. pageout() should be called if GFP_FS is set or if GFP_IO is set and the IO is against swap. And that's what we want to happen: we want to enter the fs to try to turn dirty pagecache into clean pagecache without doing IO. If we in fact enter the device drivers when GFP_IO was not set then that's a bug which we should fix. > IMO, if performing I/O can be completely avoided when __GFP_FS is set, the > flag can be kept, otherwise it is better to clear it in the situation. yup. > > > > Also, you can probably put the unlikely() inside memalloc_noio() and > > avoid repeating it at all the callsites. > > > > And it might be neater to do: > > > > /* > > * Nice comment goes here > > */ > > static inline gfp_t memalloc_noio_flags(gfp_t flags) > > { > > if (unlikely(current->flags & PF_MEMALLOC_NOIO)) > > flags &= ~GFP_IOFS; > > return flags; > > } > > But without the check in callsites, some local variables will be write > two times, > so it is better to not do it. I don't see why - we just modify the incoming gfp_t at the start of the function, then use it. It gets a bit tricky with those struct initialisations. Things like struct foo bar { .a = a1, .b = b1, }; should not be turned into struct foo bar { .a = a1, }; bar.b = b1; and we don't want to do struct foo bar { }; bar.a = a1; bar.b = b1; either, because these are indeed a double-write. But we can do struct foo bar { .flags = (flags = memalloc_noio_flags(flags)), .b = b1, }; which is a bit arcane but not toooo bad. Have a think about it... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 7, 2012 at 11:48 AM, Andrew Morton <akpm@linux-foundation.org> wrote: >> >> Firstly, the patch follows the policy in the system suspend/resume situation, >> in which the __GFP_FS is cleared, and basically the problem is very similar >> with that in system PM path. > > I suspect that code is wrong. Or at least, suboptimal. > >> Secondly, inside shrink_page_list(), pageout() may be triggered on dirty anon >> page if __GFP_FS is set. > > pageout() should be called if GFP_FS is set or if GFP_IO is set and the > IO is against swap. > > And that's what we want to happen: we want to enter the fs to try to > turn dirty pagecache into clean pagecache without doing IO. If we in > fact enter the device drivers when GFP_IO was not set then that's a bug > which we should fix. OK, I got it, and I'll not clear GFP_FS in -v5. > >> IMO, if performing I/O can be completely avoided when __GFP_FS is set, the >> flag can be kept, otherwise it is better to clear it in the situation. > > yup. > >> > >> > Also, you can probably put the unlikely() inside memalloc_noio() and >> > avoid repeating it at all the callsites. >> > >> > And it might be neater to do: >> > >> > /* >> > * Nice comment goes here >> > */ >> > static inline gfp_t memalloc_noio_flags(gfp_t flags) >> > { >> > if (unlikely(current->flags & PF_MEMALLOC_NOIO)) >> > flags &= ~GFP_IOFS; >> > return flags; >> > } >> >> But without the check in callsites, some local variables will be write >> two times, >> so it is better to not do it. > > I don't see why - we just modify the incoming gfp_t at the start of the > function, then use it. > > It gets a bit tricky with those struct initialisations. Things like > > struct foo bar { > .a = a1, > .b = b1, > }; > > should not be turned into > > struct foo bar { > .a = a1, > }; > > bar.b = b1; > > and we don't want to do > > struct foo bar { }; > > bar.a = a1; > bar.b = b1; > > either, because these are indeed a double-write. But we can do > > struct foo bar { > .flags = (flags = memalloc_noio_flags(flags)), > .b = b1, > }; > > which is a bit arcane but not toooo bad. Have a think about it... Got it, looks memalloc_noio_flags() neater, and I will take it in v5. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/sched.h b/include/linux/sched.h index fb27acd..283fe86 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) #define used_math() tsk_used_math(current) +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO) +#define memalloc_noio_save(flag) do { \ + (flag) = current->flags & PF_MEMALLOC_NOIO; \ + current->flags |= PF_MEMALLOC_NOIO; \ +} while (0) +#define memalloc_noio_restore(flag) do { \ + current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \ +} while (0) + /* * task->jobctl flags */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 45c916b..3c47049 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2634,10 +2634,18 @@ retry_cpuset: page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, alloc_flags, preferred_zone, migratetype); - if (unlikely(!page)) + if (unlikely(!page)) { + /* + * Runtime PM, block IO and its error handling path + * can deadlock because I/O on the device might not + * complete. + */ + if (unlikely(memalloc_noio())) + gfp_mask &= ~GFP_IOFS; page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); + } trace_mm_page_alloc(page, order, gfp_mask, migratetype); diff --git a/mm/vmscan.c b/mm/vmscan.c index 10090c8..035088a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .gfp_mask = sc.gfp_mask, }; + if (unlikely(memalloc_noio())) { + gfp_mask &= ~GFP_IOFS; + sc.gfp_mask = gfp_mask; + shrink.gfp_mask = sc.gfp_mask; + } + throttle_direct_reclaim(gfp_mask, zonelist, nodemask); /* @@ -3304,6 +3310,12 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) }; unsigned long nr_slab_pages0, nr_slab_pages1; + if (unlikely(memalloc_noio())) { + gfp_mask &= ~GFP_IOFS; + sc.gfp_mask = gfp_mask; + shrink.gfp_mask = sc.gfp_mask; + } + cond_resched(); /* * We need to be able to allocate from the reserves for RECLAIM_SWAP