[v4,1/6] mm: teach mm by current context info to not do I/O during memory allocation

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task
to avoid doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device,
and the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one
of its ancestors(or the block device itself), the deadlock may be
triggered inside the memory allocation since it might not complete
until the block device becomes active and the involed page I/O finishes.
The situation is pointed out first by Alan Stern. It is not a good
approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
several subsystems may be involved(for example, PCI, USB and SCSI may
be involved for usb mass stoarage device, network devices involved too
in the iSCSI case)

- during block device runtime suspend, because runtime resume need
to wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset
will be put on the device, so there shouldn't have any
memory allocation with GFP_KERNEL during USB bus reset, otherwise
the deadlock similar with above may be triggered. Unfortunately, any
usb device may include one mass storage interface in theory, so it
requires all usb interface drivers to handle the situation. In fact,
most usb drivers don't know how to handle bus reset on the device
and don't provide .pre_set() and .post_reset() callback at all, so
USB core has to unbind and bind driver for these devices. So it
is still not practical to resort to GFP_NOIO for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the
affected path into GFP_NOIO because these functions doing that may be
implemented as library and will be called in many other contexts.

In fact, memalloc_noio() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_IO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
	<-acpi_ut_allocate
		<-ACPI_ALLOCATE_ZEROED
			<-acpi_evaluate_object
				<-__acpi_bus_set_power
					<-acpi_bus_set_power
						<-acpi_pci_set_power_state
							<-platform_pci_set_power_state
								<-pci_platform_power_transition
									<-__pci_complete_power_transition
										<-pci_set_power_state
											<-pci_restore_standard_config
												<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
	<-finish_port_resume
		<-usb_port_resume
			<-generic_resume
				<-usb_resume_device
					<-usb_resume_both
						<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found.  Unfortunately, this allocation can
only be found by human being now, and there should be many not found
since any function in the resume path(call tree) may allocate memory
with GFP_KERNEL.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v4:
	- fix comment
v3:
	- no change
v2:
        - remove changes on 'may_writepage' and 'may_swap' because that
          isn't related with the patchset, and can't introduce I/O in
          allocation path if GFP_IOFS is unset, so handing 'may_swap'
          and may_writepage on GFP_NOIO or GFP_NOFS  should be a
          mm internal thing, and let mm guys deal with that, :-).

          Looks clearing the two may_XXX flag only excludes dirty pages
	  	  and anon pages for relaiming, and the behaviour should be decided
          by GFP FLAG, IMO.

        - unset GFP_IOFS in try_to_free_pages() path since
          alloc_page_buffers()
          and dma_alloc_from_contiguous may drop into the path, as
          pointed by KAMEZAWA Hiroyuki
v1:
        - take Minchan's change to avoid the check in alloc_page hot
          path

        - change the helpers' style into save/restore as suggested by
          Alan Stern
---
 include/linux/sched.h |   10 ++++++++++
 mm/page_alloc.c       |   10 +++++++++-
 mm/vmscan.c           |   12 ++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

Message ID	1351931714-11689-2-git-send-email-ming.lei@canonical.com
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 2EFB62C00B9 for <patchwork-incoming@ozlabs.org>; Sat, 3 Nov 2012 19:42:53 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753711Ab2KCIf5 (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Sat, 3 Nov 2012 04:35:57 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:38139 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753617Ab2KCIfs (ORCPT <rfc822;netdev@vger.kernel.org>); Sat, 3 Nov 2012 04:35:48 -0400 Received: by mail-pa0-f46.google.com with SMTP id hz1so2928564pad.19 for <multiple recipients>; Sat, 03 Nov 2012 01:35:48 -0700 (PDT) Received: by 10.68.204.103 with SMTP id kx7mr13787330pbc.33.1351931748476; Sat, 03 Nov 2012 01:35:48 -0700 (PDT) Received: from localhost ([121.15.171.109]) by mx.google.com with ESMTPS id qj6sm7078387pbb.69.2012.11.03.01.35.45 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 03 Nov 2012 01:35:47 -0700 (PDT) From: Ming Lei <ming.lei@canonical.com> To: linux-kernel@vger.kernel.org Cc: Alan Stern <stern@rowland.harvard.edu>, Oliver Neukum <oneukum@suse.de>, Minchan Kim <minchan@kernel.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "Rafael J. Wysocki" <rjw@sisk.pl>, Jens Axboe <axboe@kernel.dk>, "David S. Miller" <davem@davemloft.net>, Andrew Morton <akpm@linux-foundation.org>, netdev@vger.kernel.org, linux-usb@vger.kernel.org, linux-pm@vger.kernel.org, linux-mm@kvack.org, Ming Lei <ming.lei@canonical.com>, Jiri Kosina <jiri.kosina@suse.com>, Mel Gorman <mel@csn.ul.ie>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, Michal Hocko <mhocko@suse.cz>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH v4 1/6] mm: teach mm by current context info to not do I/O during memory allocation Date: Sat, 3 Nov 2012 16:35:09 +0800 Message-Id: <1351931714-11689-2-git-send-email-ming.lei@canonical.com> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1351931714-11689-1-git-send-email-ming.lei@canonical.com> References: <1351931714-11689-1-git-send-email-ming.lei@canonical.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

[v4,1/6] mm: teach mm by current context info to not do I/O during memory allocation

Commit Message

Comments

Patch