diff mbox

[RESEND,V2,1/3] Add mmap flag to request pages are locked after page fault

Message ID 1433942810-7852-2-git-send-email-emunson@akamai.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Eric B Munson June 10, 2015, 1:26 p.m. UTC
The cost of faulting in all memory to be locked can be very high when
working with large mappings.  If only portions of the mapping will be
used this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well).  For the security example, any application transacting
in data that cannot be swapped out (credit card data, medical records,
etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in.

To keep accounting checks out of the page fault path, users are billed
for the entire mapping lock as if MAP_LOCKED was used.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: linux-alpha@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
---
 arch/alpha/include/uapi/asm/mman.h   | 1 +
 arch/mips/include/uapi/asm/mman.h    | 1 +
 arch/parisc/include/uapi/asm/mman.h  | 1 +
 arch/powerpc/include/uapi/asm/mman.h | 1 +
 arch/sparc/include/uapi/asm/mman.h   | 1 +
 arch/tile/include/uapi/asm/mman.h    | 1 +
 arch/xtensa/include/uapi/asm/mman.h  | 1 +
 include/linux/mm.h                   | 1 +
 include/linux/mman.h                 | 3 ++-
 include/uapi/asm-generic/mman.h      | 1 +
 mm/mmap.c                            | 4 ++--
 mm/swap.c                            | 3 ++-
 12 files changed, 15 insertions(+), 4 deletions(-)

Comments

Michal Hocko June 18, 2015, 3:29 p.m. UTC | #1
[Sorry for the late reply - I meant to answer in the previous threads
 but something always preempted me from that]

On Wed 10-06-15 09:26:48, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings.  If only portions of the mapping will be
> used this can incur a high penalty for locking.
> 
> For the example of a large file, this is the usage pattern for a large
> statical language model (probably applies to other statical or graphical
> models as well).  For the security example, any application transacting
> in data that cannot be swapped out (credit card data, medical records,
> etc).

Such a use case makes some sense to me but I am not sure the way you
implement it is the right one. This is another mlock related flag for
mmap with a different semantic. You do not want to prefault but e.g. is
the readahead or fault around acceptable? I do not see anything in your
patch to handle those...

Wouldn't it be much more reasonable and straightforward to have
MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
explicitly disallow any form of pre-faulting? It would be usable for
other usecases than with MAP_LOCKED combination.

> This patch introduces the ability to request that pages are not
> pre-faulted, but are placed on the unevictable LRU when they are finally
> faulted in.
> 
> To keep accounting checks out of the page fault path, users are billed
> for the entire mapping lock as if MAP_LOCKED was used.
> 
> Signed-off-by: Eric B Munson <emunson@akamai.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mips@linux-mips.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-xtensa@linux-xtensa.org
> Cc: linux-mm@kvack.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> ---
>  arch/alpha/include/uapi/asm/mman.h   | 1 +
>  arch/mips/include/uapi/asm/mman.h    | 1 +
>  arch/parisc/include/uapi/asm/mman.h  | 1 +
>  arch/powerpc/include/uapi/asm/mman.h | 1 +
>  arch/sparc/include/uapi/asm/mman.h   | 1 +
>  arch/tile/include/uapi/asm/mman.h    | 1 +
>  arch/xtensa/include/uapi/asm/mman.h  | 1 +
>  include/linux/mm.h                   | 1 +
>  include/linux/mman.h                 | 3 ++-
>  include/uapi/asm-generic/mman.h      | 1 +
>  mm/mmap.c                            | 4 ++--
>  mm/swap.c                            | 3 ++-
>  12 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 0086b47..15e96e1 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -30,6 +30,7 @@
>  #define MAP_NONBLOCK	0x40000		/* do not block on IO */
>  #define MAP_STACK	0x80000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x100000	/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x200000	/* Lock pages after they are faulted in, do not prefault */
>  
>  #define MS_ASYNC	1		/* sync memory asynchronously */
>  #define MS_SYNC		2		/* synchronous memory sync */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index cfcb876..47846a5 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -48,6 +48,7 @@
>  #define MAP_NONBLOCK	0x20000		/* do not block on IO */
>  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
>  
>  /*
>   * Flags for msync
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 294d251..1514cd7 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -24,6 +24,7 @@
>  #define MAP_NONBLOCK	0x20000		/* do not block on IO */
>  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
>  
>  #define MS_SYNC		1		/* synchronous memory sync */
>  #define MS_ASYNC	2		/* sync memory asynchronously */
> diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
> index 6ea26df..fce74fe 100644
> --- a/arch/powerpc/include/uapi/asm/mman.h
> +++ b/arch/powerpc/include/uapi/asm/mman.h
> @@ -27,5 +27,6 @@
>  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
>  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
>  
>  #endif /* _UAPI_ASM_POWERPC_MMAN_H */
> diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
> index 0b14df3..12425d8 100644
> --- a/arch/sparc/include/uapi/asm/mman.h
> +++ b/arch/sparc/include/uapi/asm/mman.h
> @@ -22,6 +22,7 @@
>  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
>  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
>  
>  
>  #endif /* _UAPI__SPARC_MMAN_H__ */
> diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
> index 81b8fc3..ec04eaf 100644
> --- a/arch/tile/include/uapi/asm/mman.h
> +++ b/arch/tile/include/uapi/asm/mman.h
> @@ -29,6 +29,7 @@
>  #define MAP_DENYWRITE	0x0800		/* ETXTBSY */
>  #define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
>  #define MAP_HUGETLB	0x4000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x8000		/* Lock pages after they are faulted in, do not prefault */
>  
>  
>  /*
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 201aec0..42d43cc 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -55,6 +55,7 @@
>  #define MAP_NONBLOCK	0x20000		/* do not block on IO */
>  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
>  #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
>  # define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
>  					 * uninitialized */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0755b9f..3e31457 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
>  #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
>  
> +#define VM_LOCKONFAULT	0x00001000	/* Lock the pages covered when they are faulted in */
>  #define VM_LOCKED	0x00002000
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 16373c8..437264b 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
>  {
>  	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>  	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
> -	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    );
> +	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> +	       _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
>  }
>  
>  unsigned long vm_commit_limit(void);
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index e9fe6fd..fc4e586 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
>  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
>  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> +#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bb50cac..ba1a6bf 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
>  	unsigned long locked, lock_limit;
>  
>  	/*  mlock MCL_FUTURE? */
> -	if (flags & VM_LOCKED) {
> +	if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
>  		locked = len >> PAGE_SHIFT;
>  		locked += mm->locked_vm;
>  		lock_limit = rlimit(RLIMIT_MEMLOCK);
> @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
>  	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
>  			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>  
> -	if (flags & MAP_LOCKED)
> +	if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
>  		if (!can_do_mlock())
>  			return -EPERM;
>  
> diff --git a/mm/swap.c b/mm/swap.c
> index a7251a8..07c905e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
>  {
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> -	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
> +	if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
> +		   (vma->vm_flags & VM_SPECIAL)) {
>  		SetPageActive(page);
>  		lru_cache_add(page);
>  		return;
> -- 
> 1.9.1
>
Eric B Munson June 18, 2015, 8:30 p.m. UTC | #2
On Thu, 18 Jun 2015, Michal Hocko wrote:

> [Sorry for the late reply - I meant to answer in the previous threads
>  but something always preempted me from that]
> 
> On Wed 10-06-15 09:26:48, Eric B Munson wrote:
> > The cost of faulting in all memory to be locked can be very high when
> > working with large mappings.  If only portions of the mapping will be
> > used this can incur a high penalty for locking.
> > 
> > For the example of a large file, this is the usage pattern for a large
> > statical language model (probably applies to other statical or graphical
> > models as well).  For the security example, any application transacting
> > in data that cannot be swapped out (credit card data, medical records,
> > etc).
> 
> Such a use case makes some sense to me but I am not sure the way you
> implement it is the right one. This is another mlock related flag for
> mmap with a different semantic. You do not want to prefault but e.g. is
> the readahead or fault around acceptable? I do not see anything in your
> patch to handle those...

We haven't bumped into readahead or fault around causing performance
problems for us.  If they cause problems for users when LOCKONFAULT is
in use then we can address them.

> 
> Wouldn't it be much more reasonable and straightforward to have
> MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> explicitly disallow any form of pre-faulting? It would be usable for
> other usecases than with MAP_LOCKED combination.

I don't see a clear case for it being more reasonable, it is one
possible way to solve the problem.  But I think it leaves us in an even
more akward state WRT VMA flags.  As you noted in your fix for the
mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
not present.  Having VM_LOCKONFAULT states that this was intentional, if
we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
longer set VM_LOCKONFAULT (unless we want to start mapping it to the
presence of two MAP_ flags).  This can make detecting the MAP_LOCKED +
populate failure state harder.

If this is the preferred path for mmap(), I am fine with that.  However,
I would like to see the new system calls that Andrew mentioned (and that
I am testing patches for) go in as well.  That way we give users the
ability to request VM_LOCKONFAULT for memory allocated using something
other than mmap.

> 
> > This patch introduces the ability to request that pages are not
> > pre-faulted, but are placed on the unevictable LRU when they are finally
> > faulted in.
> > 
> > To keep accounting checks out of the page fault path, users are billed
> > for the entire mapping lock as if MAP_LOCKED was used.
> > 
> > Signed-off-by: Eric B Munson <emunson@akamai.com>
> > Cc: Michal Hocko <mhocko@suse.cz>
> > Cc: linux-alpha@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mips@linux-mips.org
> > Cc: linux-parisc@vger.kernel.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: sparclinux@vger.kernel.org
> > Cc: linux-xtensa@linux-xtensa.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-api@vger.kernel.org
> > ---
> >  arch/alpha/include/uapi/asm/mman.h   | 1 +
> >  arch/mips/include/uapi/asm/mman.h    | 1 +
> >  arch/parisc/include/uapi/asm/mman.h  | 1 +
> >  arch/powerpc/include/uapi/asm/mman.h | 1 +
> >  arch/sparc/include/uapi/asm/mman.h   | 1 +
> >  arch/tile/include/uapi/asm/mman.h    | 1 +
> >  arch/xtensa/include/uapi/asm/mman.h  | 1 +
> >  include/linux/mm.h                   | 1 +
> >  include/linux/mman.h                 | 3 ++-
> >  include/uapi/asm-generic/mman.h      | 1 +
> >  mm/mmap.c                            | 4 ++--
> >  mm/swap.c                            | 3 ++-
> >  12 files changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 0086b47..15e96e1 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -30,6 +30,7 @@
> >  #define MAP_NONBLOCK	0x40000		/* do not block on IO */
> >  #define MAP_STACK	0x80000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x100000	/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x200000	/* Lock pages after they are faulted in, do not prefault */
> >  
> >  #define MS_ASYNC	1		/* sync memory asynchronously */
> >  #define MS_SYNC		2		/* synchronous memory sync */
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index cfcb876..47846a5 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -48,6 +48,7 @@
> >  #define MAP_NONBLOCK	0x20000		/* do not block on IO */
> >  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
> >  
> >  /*
> >   * Flags for msync
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 294d251..1514cd7 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -24,6 +24,7 @@
> >  #define MAP_NONBLOCK	0x20000		/* do not block on IO */
> >  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
> >  
> >  #define MS_SYNC		1		/* synchronous memory sync */
> >  #define MS_ASYNC	2		/* sync memory asynchronously */
> > diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
> > index 6ea26df..fce74fe 100644
> > --- a/arch/powerpc/include/uapi/asm/mman.h
> > +++ b/arch/powerpc/include/uapi/asm/mman.h
> > @@ -27,5 +27,6 @@
> >  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
> >  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
> >  
> >  #endif /* _UAPI_ASM_POWERPC_MMAN_H */
> > diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
> > index 0b14df3..12425d8 100644
> > --- a/arch/sparc/include/uapi/asm/mman.h
> > +++ b/arch/sparc/include/uapi/asm/mman.h
> > @@ -22,6 +22,7 @@
> >  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
> >  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
> >  
> >  
> >  #endif /* _UAPI__SPARC_MMAN_H__ */
> > diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
> > index 81b8fc3..ec04eaf 100644
> > --- a/arch/tile/include/uapi/asm/mman.h
> > +++ b/arch/tile/include/uapi/asm/mman.h
> > @@ -29,6 +29,7 @@
> >  #define MAP_DENYWRITE	0x0800		/* ETXTBSY */
> >  #define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
> >  #define MAP_HUGETLB	0x4000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x8000		/* Lock pages after they are faulted in, do not prefault */
> >  
> >  
> >  /*
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 201aec0..42d43cc 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -55,6 +55,7 @@
> >  #define MAP_NONBLOCK	0x20000		/* do not block on IO */
> >  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
> >  #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
> >  # define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
> >  					 * uninitialized */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0755b9f..3e31457 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
> >  #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
> >  
> > +#define VM_LOCKONFAULT	0x00001000	/* Lock the pages covered when they are faulted in */
> >  #define VM_LOCKED	0x00002000
> >  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
> >  
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index 16373c8..437264b 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
> >  {
> >  	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> >  	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
> > -	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    );
> > +	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > +	       _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
> >  }
> >  
> >  unsigned long vm_commit_limit(void);
> > diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> > index e9fe6fd..fc4e586 100644
> > --- a/include/uapi/asm-generic/mman.h
> > +++ b/include/uapi/asm-generic/mman.h
> > @@ -12,6 +12,7 @@
> >  #define MAP_NONBLOCK	0x10000		/* do not block on IO */
> >  #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
> >  #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
> > +#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
> >  
> >  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index bb50cac..ba1a6bf 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
> >  	unsigned long locked, lock_limit;
> >  
> >  	/*  mlock MCL_FUTURE? */
> > -	if (flags & VM_LOCKED) {
> > +	if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
> >  		locked = len >> PAGE_SHIFT;
> >  		locked += mm->locked_vm;
> >  		lock_limit = rlimit(RLIMIT_MEMLOCK);
> > @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> >  	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
> >  			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
> >  
> > -	if (flags & MAP_LOCKED)
> > +	if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
> >  		if (!can_do_mlock())
> >  			return -EPERM;
> >  
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a7251a8..07c905e 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
> >  {
> >  	VM_BUG_ON_PAGE(PageLRU(page), page);
> >  
> > -	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
> > +	if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
> > +		   (vma->vm_flags & VM_SPECIAL)) {
> >  		SetPageActive(page);
> >  		lru_cache_add(page);
> >  		return;
> > -- 
> > 1.9.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko June 19, 2015, 2:57 p.m. UTC | #3
On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> On Thu, 18 Jun 2015, Michal Hocko wrote:
[...]
> > Wouldn't it be much more reasonable and straightforward to have
> > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > explicitly disallow any form of pre-faulting? It would be usable for
> > other usecases than with MAP_LOCKED combination.
> 
> I don't see a clear case for it being more reasonable, it is one
> possible way to solve the problem.

MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
around is all or nothing feature. Either all mappings (which support
this) fault around or none. There is no way to tell the kernel that
this particular mapping shouldn't fault around. I haven't seen such a
request yet but we have seen requests to have a way to opt out from
a global policy in the past (e.g. per-process opt out from THP). So
I can imagine somebody will come with a request to opt out from any
speculative operations on the mapped area in the future.

> But I think it leaves us in an even
> more akward state WRT VMA flags.  As you noted in your fix for the
> mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> not present.  Having VM_LOCKONFAULT states that this was intentional, if
> we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> presence of two MAP_ flags).  This can make detecting the MAP_LOCKED +
> populate failure state harder.

I am not sure I understand your point here. Could you be more specific
how would you check for that and what for?

From my understanding MAP_LOCKONFAULT is essentially
MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
single MAP_LOCKED unfortunately). I would love to also have
MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
skeptical considering how my previous attempt to make MAP_POPULATE
reasonable went.

> If this is the preferred path for mmap(), I am fine with that. 

> However,
> I would like to see the new system calls that Andrew mentioned (and that
> I am testing patches for) go in as well. 

mlock with flags sounds like a good step but I am not sure it will make
sense in the future. POSIX has screwed that and I am not sure how many
applications would use it. This ship has sailed long time ago.

> That way we give users the
> ability to request VM_LOCKONFAULT for memory allocated using something
> other than mmap.

mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
without changing mlock syscall.
 
> > > This patch introduces the ability to request that pages are not
> > > pre-faulted, but are placed on the unevictable LRU when they are finally
> > > faulted in.
> > > 
> > > To keep accounting checks out of the page fault path, users are billed
> > > for the entire mapping lock as if MAP_LOCKED was used.
> > > 
> > > Signed-off-by: Eric B Munson <emunson@akamai.com>
> > > Cc: Michal Hocko <mhocko@suse.cz>
> > > Cc: linux-alpha@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: linux-mips@linux-mips.org
> > > Cc: linux-parisc@vger.kernel.org
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Cc: sparclinux@vger.kernel.org
> > > Cc: linux-xtensa@linux-xtensa.org
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-arch@vger.kernel.org
> > > Cc: linux-api@vger.kernel.org
> > > ---
[...]
Eric B Munson June 19, 2015, 4:43 p.m. UTC | #4
On Fri, 19 Jun 2015, Michal Hocko wrote:

> On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> > On Thu, 18 Jun 2015, Michal Hocko wrote:
> [...]
> > > Wouldn't it be much more reasonable and straightforward to have
> > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > > explicitly disallow any form of pre-faulting? It would be usable for
> > > other usecases than with MAP_LOCKED combination.
> > 
> > I don't see a clear case for it being more reasonable, it is one
> > possible way to solve the problem.
> 
> MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
> around is all or nothing feature. Either all mappings (which support
> this) fault around or none. There is no way to tell the kernel that
> this particular mapping shouldn't fault around. I haven't seen such a
> request yet but we have seen requests to have a way to opt out from
> a global policy in the past (e.g. per-process opt out from THP). So
> I can imagine somebody will come with a request to opt out from any
> speculative operations on the mapped area in the future.
> 
> > But I think it leaves us in an even
> > more akward state WRT VMA flags.  As you noted in your fix for the
> > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> > not present.  Having VM_LOCKONFAULT states that this was intentional, if
> > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> > longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> > presence of two MAP_ flags).  This can make detecting the MAP_LOCKED +
> > populate failure state harder.
> 
> I am not sure I understand your point here. Could you be more specific
> how would you check for that and what for?

My thought on detecting was that someone might want to know if they had
a VMA that was VM_LOCKED but had not been made present becuase of a
failure in mmap.  We don't have a way today, but adding VM_LOCKONFAULT
is at least explicit about what is happening which would make detecting
the VM_LOCKED but not present state easier.  This assumes that
MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
it would have to.

> 
> From my understanding MAP_LOCKONFAULT is essentially
> MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
> single MAP_LOCKED unfortunately). I would love to also have
> MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
> skeptical considering how my previous attempt to make MAP_POPULATE
> reasonable went.

Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
new MAP_LOCKONFAULT flag (or both)?  If you prefer that MAP_LOCKED |
MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
instead of introducing MAP_LOCKONFAULT.  I went with the new flag
because to date, we have a one to one mapping of MAP_* to VM_* flags.

> 
> > If this is the preferred path for mmap(), I am fine with that. 
> 
> > However,
> > I would like to see the new system calls that Andrew mentioned (and that
> > I am testing patches for) go in as well. 
> 
> mlock with flags sounds like a good step but I am not sure it will make
> sense in the future. POSIX has screwed that and I am not sure how many
> applications would use it. This ship has sailed long time ago.

I don't know either, but the code is the question, right?  I know that
we have at least one team that wants it here.

> 
> > That way we give users the
> > ability to request VM_LOCKONFAULT for memory allocated using something
> > other than mmap.
> 
> mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
> without changing mlock syscall.

That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s).  It
doesn't cover the actual case I was asking about, which is how do I get
lock on fault on malloc'd memory?

>  
> > > > This patch introduces the ability to request that pages are not
> > > > pre-faulted, but are placed on the unevictable LRU when they are finally
> > > > faulted in.
> > > > 
> > > > To keep accounting checks out of the page fault path, users are billed
> > > > for the entire mapping lock as if MAP_LOCKED was used.
> > > > 
> > > > Signed-off-by: Eric B Munson <emunson@akamai.com>
> > > > Cc: Michal Hocko <mhocko@suse.cz>
> > > > Cc: linux-alpha@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Cc: linux-mips@linux-mips.org
> > > > Cc: linux-parisc@vger.kernel.org
> > > > Cc: linuxppc-dev@lists.ozlabs.org
> > > > Cc: sparclinux@vger.kernel.org
> > > > Cc: linux-xtensa@linux-xtensa.org
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-arch@vger.kernel.org
> > > > Cc: linux-api@vger.kernel.org
> > > > ---
> [...]
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko June 22, 2015, 12:38 p.m. UTC | #5
On Fri 19-06-15 12:43:33, Eric B Munson wrote:
> On Fri, 19 Jun 2015, Michal Hocko wrote:
> 
> > On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> > > On Thu, 18 Jun 2015, Michal Hocko wrote:
> > [...]
> > > > Wouldn't it be much more reasonable and straightforward to have
> > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > > > explicitly disallow any form of pre-faulting? It would be usable for
> > > > other usecases than with MAP_LOCKED combination.
> > > 
> > > I don't see a clear case for it being more reasonable, it is one
> > > possible way to solve the problem.
> > 
> > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
> > around is all or nothing feature. Either all mappings (which support
> > this) fault around or none. There is no way to tell the kernel that
> > this particular mapping shouldn't fault around. I haven't seen such a
> > request yet but we have seen requests to have a way to opt out from
> > a global policy in the past (e.g. per-process opt out from THP). So
> > I can imagine somebody will come with a request to opt out from any
> > speculative operations on the mapped area in the future.
> > 
> > > But I think it leaves us in an even
> > > more akward state WRT VMA flags.  As you noted in your fix for the
> > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> > > not present.  Having VM_LOCKONFAULT states that this was intentional, if
> > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> > > presence of two MAP_ flags).  This can make detecting the MAP_LOCKED +
> > > populate failure state harder.
> > 
> > I am not sure I understand your point here. Could you be more specific
> > how would you check for that and what for?
> 
> My thought on detecting was that someone might want to know if they had
> a VMA that was VM_LOCKED but had not been made present becuase of a
> failure in mmap.  We don't have a way today, but adding VM_LOCKONFAULT
> is at least explicit about what is happening which would make detecting
> the VM_LOCKED but not present state easier. 

One could use /proc/<pid>/pagemap to query the residency.

> This assumes that
> MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
> it would have to.

Yes, it would have to have a VM flag for the vma.

> > From my understanding MAP_LOCKONFAULT is essentially
> > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
> > single MAP_LOCKED unfortunately). I would love to also have
> > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
> > skeptical considering how my previous attempt to make MAP_POPULATE
> > reasonable went.
> 
> Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> new MAP_LOCKONFAULT flag (or both)? 

I thought the MAP_FAULTPOPULATE (or any other better name) would
directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
locked semantic. We already have VM_LOCKED for that. The direct effect
of the flag would be to prevent from population other than the direct
page fault - including any speculative actions like fault around or
read-ahead.

> If you prefer that MAP_LOCKED |
> MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
> instead of introducing MAP_LOCKONFAULT.  I went with the new flag
> because to date, we have a one to one mapping of MAP_* to VM_* flags.
> 
> > 
> > > If this is the preferred path for mmap(), I am fine with that. 
> > 
> > > However,
> > > I would like to see the new system calls that Andrew mentioned (and that
> > > I am testing patches for) go in as well. 
> > 
> > mlock with flags sounds like a good step but I am not sure it will make
> > sense in the future. POSIX has screwed that and I am not sure how many
> > applications would use it. This ship has sailed long time ago.
> 
> I don't know either, but the code is the question, right?  I know that
> we have at least one team that wants it here.
> 
> > 
> > > That way we give users the
> > > ability to request VM_LOCKONFAULT for memory allocated using something
> > > other than mmap.
> > 
> > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
> > without changing mlock syscall.
> 
> That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s).  It
> doesn't cover the actual case I was asking about, which is how do I get
> lock on fault on malloc'd memory?

OK I see your point now. We would indeed need a flag argument for mlock.
Eric B Munson June 22, 2015, 2:18 p.m. UTC | #6
On Mon, 22 Jun 2015, Michal Hocko wrote:

> On Fri 19-06-15 12:43:33, Eric B Munson wrote:
> > On Fri, 19 Jun 2015, Michal Hocko wrote:
> > 
> > > On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> > > > On Thu, 18 Jun 2015, Michal Hocko wrote:
> > > [...]
> > > > > Wouldn't it be much more reasonable and straightforward to have
> > > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > > > > explicitly disallow any form of pre-faulting? It would be usable for
> > > > > other usecases than with MAP_LOCKED combination.
> > > > 
> > > > I don't see a clear case for it being more reasonable, it is one
> > > > possible way to solve the problem.
> > > 
> > > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
> > > around is all or nothing feature. Either all mappings (which support
> > > this) fault around or none. There is no way to tell the kernel that
> > > this particular mapping shouldn't fault around. I haven't seen such a
> > > request yet but we have seen requests to have a way to opt out from
> > > a global policy in the past (e.g. per-process opt out from THP). So
> > > I can imagine somebody will come with a request to opt out from any
> > > speculative operations on the mapped area in the future.
> > > 
> > > > But I think it leaves us in an even
> > > > more akward state WRT VMA flags.  As you noted in your fix for the
> > > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> > > > not present.  Having VM_LOCKONFAULT states that this was intentional, if
> > > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> > > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> > > > presence of two MAP_ flags).  This can make detecting the MAP_LOCKED +
> > > > populate failure state harder.
> > > 
> > > I am not sure I understand your point here. Could you be more specific
> > > how would you check for that and what for?
> > 
> > My thought on detecting was that someone might want to know if they had
> > a VMA that was VM_LOCKED but had not been made present becuase of a
> > failure in mmap.  We don't have a way today, but adding VM_LOCKONFAULT
> > is at least explicit about what is happening which would make detecting
> > the VM_LOCKED but not present state easier. 
> 
> One could use /proc/<pid>/pagemap to query the residency.
> 
> > This assumes that
> > MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
> > it would have to.
> 
> Yes, it would have to have a VM flag for the vma.
> 
> > > From my understanding MAP_LOCKONFAULT is essentially
> > > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
> > > single MAP_LOCKED unfortunately). I would love to also have
> > > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
> > > skeptical considering how my previous attempt to make MAP_POPULATE
> > > reasonable went.
> > 
> > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> > new MAP_LOCKONFAULT flag (or both)? 
> 
> I thought the MAP_FAULTPOPULATE (or any other better name) would
> directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
> locked semantic. We already have VM_LOCKED for that. The direct effect
> of the flag would be to prevent from population other than the direct
> page fault - including any speculative actions like fault around or
> read-ahead.

I like the ability to control other speculative population, but I am not
sure about overloading it with the VM_LOCKONFAULT case.  Here is my
concern.  If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
LOCKONFAULT, how can we tell the difference between someone that wants
to avoid read-ahead and wants to use mlock()?  This might lead to some
interesting states with mlock() and munlock() that take flags.  For
instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
VM_LOCKONFAULT set.  If we use VM_FAULTPOPULATE, the same pair of calls
would clear VM_LOCKED, but leave VM_FAULTPOPULATE.  It may not matter in
the end, but I am concerned about the subtleties here.

> 
> > If you prefer that MAP_LOCKED |
> > MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
> > instead of introducing MAP_LOCKONFAULT.  I went with the new flag
> > because to date, we have a one to one mapping of MAP_* to VM_* flags.
> > 
> > > 
> > > > If this is the preferred path for mmap(), I am fine with that. 
> > > 
> > > > However,
> > > > I would like to see the new system calls that Andrew mentioned (and that
> > > > I am testing patches for) go in as well. 
> > > 
> > > mlock with flags sounds like a good step but I am not sure it will make
> > > sense in the future. POSIX has screwed that and I am not sure how many
> > > applications would use it. This ship has sailed long time ago.
> > 
> > I don't know either, but the code is the question, right?  I know that
> > we have at least one team that wants it here.
> > 
> > > 
> > > > That way we give users the
> > > > ability to request VM_LOCKONFAULT for memory allocated using something
> > > > other than mmap.
> > > 
> > > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
> > > without changing mlock syscall.
> > 
> > That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s).  It
> > doesn't cover the actual case I was asking about, which is how do I get
> > lock on fault on malloc'd memory?
> 
> OK I see your point now. We would indeed need a flag argument for mlock.
> -- 
> Michal Hocko
> SUSE Labs
Vlastimil Babka June 23, 2015, 12:45 p.m. UTC | #7
On 06/22/2015 04:18 PM, Eric B Munson wrote:
> On Mon, 22 Jun 2015, Michal Hocko wrote:
>
>> On Fri 19-06-15 12:43:33, Eric B Munson wrote:
>>> On Fri, 19 Jun 2015, Michal Hocko wrote:
>>>
>>>> On Thu 18-06-15 16:30:48, Eric B Munson wrote:
>>>>> On Thu, 18 Jun 2015, Michal Hocko wrote:
>>>> [...]
>>>>>> Wouldn't it be much more reasonable and straightforward to have
>>>>>> MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
>>>>>> explicitly disallow any form of pre-faulting? It would be usable for
>>>>>> other usecases than with MAP_LOCKED combination.
>>>>>
>>>>> I don't see a clear case for it being more reasonable, it is one
>>>>> possible way to solve the problem.
>>>>
>>>> MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
>>>> around is all or nothing feature. Either all mappings (which support
>>>> this) fault around or none. There is no way to tell the kernel that
>>>> this particular mapping shouldn't fault around. I haven't seen such a
>>>> request yet but we have seen requests to have a way to opt out from
>>>> a global policy in the past (e.g. per-process opt out from THP). So
>>>> I can imagine somebody will come with a request to opt out from any
>>>> speculative operations on the mapped area in the future.

That sounds like something where new madvise() flag would make more 
sense than a new mmap flag, and conflating it with locking behavior 
would lead to all kinds of weird corner cases as Eric mentioned.

>>>>
>>>>> But I think it leaves us in an even
>>>>> more akward state WRT VMA flags.  As you noted in your fix for the
>>>>> mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
>>>>> not present.  Having VM_LOCKONFAULT states that this was intentional, if
>>>>> we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
>>>>> longer set VM_LOCKONFAULT (unless we want to start mapping it to the
>>>>> presence of two MAP_ flags).  This can make detecting the MAP_LOCKED +
>>>>> populate failure state harder.
>>>>
>>>> I am not sure I understand your point here. Could you be more specific
>>>> how would you check for that and what for?
>>>
>>> My thought on detecting was that someone might want to know if they had
>>> a VMA that was VM_LOCKED but had not been made present becuase of a
>>> failure in mmap.  We don't have a way today, but adding VM_LOCKONFAULT
>>> is at least explicit about what is happening which would make detecting
>>> the VM_LOCKED but not present state easier.
>>
>> One could use /proc/<pid>/pagemap to query the residency.

I think that's all too much complex scenario for a little gain. If 
someone knows that mmap(MAP_LOCKED|MAP_POPULATE) is not perfect, he 
should either mlock() separately from mmap(), or fault the range 
manually with a for loop. Why try to detect if the corner case was hit?

>>
>>> This assumes that
>>> MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
>>> it would have to.
>>
>> Yes, it would have to have a VM flag for the vma.

So with your approach, VM_LOCKED flag is enough, right? The new MAP_ / 
MLOCK_ flags just cause setting VM_LOCKED to not fault the whole vma, 
but otherwise nothing changes.

If that's true, I think it's better than a new vma flag.

>>
>>>>  From my understanding MAP_LOCKONFAULT is essentially
>>>> MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
>>>> single MAP_LOCKED unfortunately). I would love to also have
>>>> MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
>>>> skeptical considering how my previous attempt to make MAP_POPULATE
>>>> reasonable went.
>>>
>>> Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
>>> new MAP_LOCKONFAULT flag (or both)?
>>
>> I thought the MAP_FAULTPOPULATE (or any other better name) would
>> directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
>> locked semantic. We already have VM_LOCKED for that. The direct effect
>> of the flag would be to prevent from population other than the direct
>> page fault - including any speculative actions like fault around or
>> read-ahead.
>
> I like the ability to control other speculative population, but I am not
> sure about overloading it with the VM_LOCKONFAULT case.  Here is my
> concern.  If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
> LOCKONFAULT, how can we tell the difference between someone that wants
> to avoid read-ahead and wants to use mlock()?  This might lead to some
> interesting states with mlock() and munlock() that take flags.  For
> instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
> munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
> VM_LOCKONFAULT set.  If we use VM_FAULTPOPULATE, the same pair of calls
> would clear VM_LOCKED, but leave VM_FAULTPOPULATE.  It may not matter in
> the end, but I am concerned about the subtleties here.

Right.

>>
>>> If you prefer that MAP_LOCKED |
>>> MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
>>> instead of introducing MAP_LOCKONFAULT.  I went with the new flag
>>> because to date, we have a one to one mapping of MAP_* to VM_* flags.
>>>
>>>>
>>>>> If this is the preferred path for mmap(), I am fine with that.
>>>>
>>>>> However,
>>>>> I would like to see the new system calls that Andrew mentioned (and that
>>>>> I am testing patches for) go in as well.
>>>>
>>>> mlock with flags sounds like a good step but I am not sure it will make
>>>> sense in the future. POSIX has screwed that and I am not sure how many
>>>> applications would use it. This ship has sailed long time ago.
>>>
>>> I don't know either, but the code is the question, right?  I know that
>>> we have at least one team that wants it here.
>>>
>>>>
>>>>> That way we give users the
>>>>> ability to request VM_LOCKONFAULT for memory allocated using something
>>>>> other than mmap.
>>>>
>>>> mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
>>>> without changing mlock syscall.
>>>
>>> That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s).  It
>>> doesn't cover the actual case I was asking about, which is how do I get
>>> lock on fault on malloc'd memory?
>>
>> OK I see your point now. We would indeed need a flag argument for mlock.
>> --
>> Michal Hocko
>> SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko June 24, 2015, 8:50 a.m. UTC | #8
On Mon 22-06-15 10:18:06, Eric B Munson wrote:
> On Mon, 22 Jun 2015, Michal Hocko wrote:
> 
> > On Fri 19-06-15 12:43:33, Eric B Munson wrote:
[...]
> > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> > > new MAP_LOCKONFAULT flag (or both)? 
> > 
> > I thought the MAP_FAULTPOPULATE (or any other better name) would
> > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
> > locked semantic. We already have VM_LOCKED for that. The direct effect
> > of the flag would be to prevent from population other than the direct
> > page fault - including any speculative actions like fault around or
> > read-ahead.
> 
> I like the ability to control other speculative population, but I am not
> sure about overloading it with the VM_LOCKONFAULT case.  Here is my
> concern.  If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
> LOCKONFAULT, how can we tell the difference between someone that wants
> to avoid read-ahead and wants to use mlock()?

Not sure I understand. Something like?
addr = mmap(VM_FAULTPOPULATE) # To prevent speculative mappings into the vma
[...]
mlock(addr, len) # Now I want the full mlock semantic

and the later to have the full mlock semantic and populate the given
area regardless of VM_FAULTPOPULATE being set on the vma? This would
be an interesting question because mlock man page clearly states the
semantic and that is to _always_ populate or fail. So I originally
thought that it would obey VM_FAULTPOPULATE but this needs a more
thinking.

> This might lead to some
> interesting states with mlock() and munlock() that take flags.  For
> instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
> munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
> VM_LOCKONFAULT set. 

This is really confusing. Let me try to rephrase that. So you have
mlock(addr, len, MLOCK_ONFAULT)
munlock(addr, len, MLOCK_LOCKED)

IIUC you would expect the vma still being MLOCK_ONFAULT, right? Isn't
that behavior strange and unexpected? First of all, munlock has
traditionally dropped the lock on the address range (e.g. what should
happen if you did plain old munlock(addr, len)). But even without
that. You are trying to unlock something that hasn't been locked the
same way. So I would expect -EINVAL at least, if the two modes should be
really represented by different flags.

Or did you mean the both types of lock like:
mlock(addr, len, MLOCK_ONFAULT) | mmap(MAP_LOCKONFAULT)
mlock(addr, len, MLOCK_LOCKED)
munlock(addr, len, MLOCK_LOCKED)

and that should keep MLOCK_ONFAULT?
This sounds even more weird to me because that means that the vma in
question would be locked by two different mechanisms. MLOCK_LOCKED with
the "always populate" semantic would rule out MLOCK_ONFAULT so what
would be the meaning of the other flag then? Also what should regular
munlock(addr, len) without flags unlock? Both?

> If we use VM_FAULTPOPULATE, the same pair of calls
> would clear VM_LOCKED, but leave VM_FAULTPOPULATE.  It may not matter in
> the end, but I am concerned about the subtleties here.

This sounds like the proper behavior to me. munlock should simply always
drop VM_LOCKED and the VM_FAULTPOPULATE can live its separate life.

Btw. could you be more specific about semantic of m{un}lock(addr, len, flags)
you want to propose? The more I think about that the more I am unclear
about it, especially munlock behavior and possible flags.
Michal Hocko June 24, 2015, 9:47 a.m. UTC | #9
On Tue 23-06-15 14:45:17, Vlastimil Babka wrote:
> On 06/22/2015 04:18 PM, Eric B Munson wrote:
> >On Mon, 22 Jun 2015, Michal Hocko wrote:
> >
> >>On Fri 19-06-15 12:43:33, Eric B Munson wrote:
[...]
> >>>My thought on detecting was that someone might want to know if they had
> >>>a VMA that was VM_LOCKED but had not been made present becuase of a
> >>>failure in mmap.  We don't have a way today, but adding VM_LOCKONFAULT
> >>>is at least explicit about what is happening which would make detecting
> >>>the VM_LOCKED but not present state easier.
> >>
> >>One could use /proc/<pid>/pagemap to query the residency.
> 
> I think that's all too much complex scenario for a little gain. If someone
> knows that mmap(MAP_LOCKED|MAP_POPULATE) is not perfect, he should either
> mlock() separately from mmap(), or fault the range manually with a for loop.
> Why try to detect if the corner case was hit?

No idea. I have just offered a way to do that. I do not think it is
anyhow useful but who knows... I do agree that the mlock should be used
for the full mlock semantic.

> >>>This assumes that
> >>>MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
> >>>it would have to.
> >>
> >>Yes, it would have to have a VM flag for the vma.
> 
> So with your approach, VM_LOCKED flag is enough, right? The new MAP_ /
> MLOCK_ flags just cause setting VM_LOCKED to not fault the whole vma, but
> otherwise nothing changes.

VM_FAULTPOPULATE would have to be sticky to prevent from other
speculative poppulation of the mapping. I mean, is it OK to have a new
mlock semantic (on fault) which might still populate&lock memory which
hasn't been faulted directly? Who knows what kind of speculative things
we will do in the future and then find out that the semantic of
lock-on-fault is not usable anymore.

[...]
Eric B Munson June 25, 2015, 2:46 p.m. UTC | #10
On Wed, 24 Jun 2015, Michal Hocko wrote:

> On Mon 22-06-15 10:18:06, Eric B Munson wrote:
> > On Mon, 22 Jun 2015, Michal Hocko wrote:
> > 
> > > On Fri 19-06-15 12:43:33, Eric B Munson wrote:
> [...]
> > > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> > > > new MAP_LOCKONFAULT flag (or both)? 
> > > 
> > > I thought the MAP_FAULTPOPULATE (or any other better name) would
> > > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
> > > locked semantic. We already have VM_LOCKED for that. The direct effect
> > > of the flag would be to prevent from population other than the direct
> > > page fault - including any speculative actions like fault around or
> > > read-ahead.
> > 
> > I like the ability to control other speculative population, but I am not
> > sure about overloading it with the VM_LOCKONFAULT case.  Here is my
> > concern.  If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
> > LOCKONFAULT, how can we tell the difference between someone that wants
> > to avoid read-ahead and wants to use mlock()?
> 
> Not sure I understand. Something like?
> addr = mmap(VM_FAULTPOPULATE) # To prevent speculative mappings into the vma
> [...]
> mlock(addr, len) # Now I want the full mlock semantic

So this leaves us without the LOCKONFAULT semantics?  That is not at all
what I am looking for.  What I want is a way to express 3 possible
states of a VMA WRT locking, locked (populated and all pages on the
unevictable LRU), lock on fault (populated by page fault, pages that are
present are on the unevictable LRU, newly faulted pages are added to
same), and not locked.

> 
> and the later to have the full mlock semantic and populate the given
> area regardless of VM_FAULTPOPULATE being set on the vma? This would
> be an interesting question because mlock man page clearly states the
> semantic and that is to _always_ populate or fail. So I originally
> thought that it would obey VM_FAULTPOPULATE but this needs a more
> thinking.
> 
> > This might lead to some
> > interesting states with mlock() and munlock() that take flags.  For
> > instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
> > munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
> > VM_LOCKONFAULT set. 
> 
> This is really confusing. Let me try to rephrase that. So you have
> mlock(addr, len, MLOCK_ONFAULT)
> munlock(addr, len, MLOCK_LOCKED)
> 
> IIUC you would expect the vma still being MLOCK_ONFAULT, right? Isn't
> that behavior strange and unexpected? First of all, munlock has
> traditionally dropped the lock on the address range (e.g. what should
> happen if you did plain old munlock(addr, len)). But even without
> that. You are trying to unlock something that hasn't been locked the
> same way. So I would expect -EINVAL at least, if the two modes should be
> really represented by different flags.

I would expect it to remain MLOCK_LOCKONFAULT because the user requested
munlock(addr, len, MLOCK_LOCKED).  It is not currently an error to
unlock memory that is not locked.  We do this because we do not require
the user track what areas are locked.  It is acceptable to have a mostly
locked area with holes unlocked with a single call to munlock that spans
the entire area.  The same semantics should hold for munlock with flags.
If I have an area with MLOCK_LOCKED and MLOCK_ONFAULT interleaved, it
should be acceptable to clear the MLOCK_ONFAULT flag from those areas
with a single munlock call that spans the area.

On top of continuing with munlock semantics, the implementation would
need the ability to rollback an munlock call if it failed after altering
VMAs.  If we have the same interleaved area as before and we go to
return -EINVAL the first time we hit an area that was MLOCK_LOCKED, how
do we restore the state of the VMAs we have already processed, and
possibly merged/split?
> 
> Or did you mean the both types of lock like:
> mlock(addr, len, MLOCK_ONFAULT) | mmap(MAP_LOCKONFAULT)
> mlock(addr, len, MLOCK_LOCKED)
> munlock(addr, len, MLOCK_LOCKED)
> 
> and that should keep MLOCK_ONFAULT?
> This sounds even more weird to me because that means that the vma in
> question would be locked by two different mechanisms. MLOCK_LOCKED with
> the "always populate" semantic would rule out MLOCK_ONFAULT so what
> would be the meaning of the other flag then? Also what should regular
> munlock(addr, len) without flags unlock? Both?

This is indeed confusing and not what I was trying to illustrate, but
since you bring it up.  mlockall() currently clears all flags and then
sets the new flags with each subsequent call.  mlock2 would use that
same behavior, if LOCKED was specified for a ONFAULT region, that region
would become LOCKED and vice versa.

I have the new system call set ready, I am waiting to post for rc1 so I
can run the benchmarks again on a base more stable than the middle of a
merge window.  We should wait to hash out implementations until the code
is up rather than talk past eachother here.

> 
> > If we use VM_FAULTPOPULATE, the same pair of calls
> > would clear VM_LOCKED, but leave VM_FAULTPOPULATE.  It may not matter in
> > the end, but I am concerned about the subtleties here.
> 
> This sounds like the proper behavior to me. munlock should simply always
> drop VM_LOCKED and the VM_FAULTPOPULATE can live its separate life.
> 
> Btw. could you be more specific about semantic of m{un}lock(addr, len, flags)
> you want to propose? The more I think about that the more I am unclear
> about it, especially munlock behavior and possible flags.
> -- 
> Michal Hocko
> SUSE Labs
diff mbox

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..15e96e1 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@ 
 #define MAP_NONBLOCK	0x40000		/* do not block on IO */
 #define MAP_STACK	0x80000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x100000	/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x200000	/* Lock pages after they are faulted in, do not prefault */
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_SYNC		2		/* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..47846a5 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -48,6 +48,7 @@ 
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
 
 /*
  * Flags for msync
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..1514cd7 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -24,6 +24,7 @@ 
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
 
 #define MS_SYNC		1		/* synchronous memory sync */
 #define MS_ASYNC	2		/* sync memory asynchronously */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 6ea26df..fce74fe 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -27,5 +27,6 @@ 
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
 
 #endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 0b14df3..12425d8 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -22,6 +22,7 @@ 
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
 
 
 #endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index 81b8fc3..ec04eaf 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -29,6 +29,7 @@ 
 #define MAP_DENYWRITE	0x0800		/* ETXTBSY */
 #define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
 #define MAP_HUGETLB	0x4000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x8000		/* Lock pages after they are faulted in, do not prefault */
 
 
 /*
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0..42d43cc 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -55,6 +55,7 @@ 
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x100000	/* Lock pages after they are faulted in, do not prefault */
 #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
 # define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9f..3e31457 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -126,6 +126,7 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 
+#define VM_LOCKONFAULT	0x00001000	/* Lock the pages covered when they are faulted in */
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..437264b 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,7 +86,8 @@  calc_vm_flag_bits(unsigned long flags)
 {
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
 	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
-	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    );
+	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
+	       _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
 }
 
 unsigned long vm_commit_limit(void);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index e9fe6fd..fc4e586 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@ 
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_LOCKONFAULT	0x80000		/* Lock pages after they are faulted in, do not prefault */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 
diff --git a/mm/mmap.c b/mm/mmap.c
index bb50cac..ba1a6bf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1233,7 +1233,7 @@  static inline int mlock_future_check(struct mm_struct *mm,
 	unsigned long locked, lock_limit;
 
 	/*  mlock MCL_FUTURE? */
-	if (flags & VM_LOCKED) {
+	if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
 		locked = len >> PAGE_SHIFT;
 		locked += mm->locked_vm;
 		lock_limit = rlimit(RLIMIT_MEMLOCK);
@@ -1301,7 +1301,7 @@  unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
-	if (flags & MAP_LOCKED)
+	if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
 		if (!can_do_mlock())
 			return -EPERM;
 
diff --git a/mm/swap.c b/mm/swap.c
index a7251a8..07c905e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -711,7 +711,8 @@  void lru_cache_add_active_or_unevictable(struct page *page,
 {
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+	if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
+		   (vma->vm_flags & VM_SPECIAL)) {
 		SetPageActive(page);
 		lru_cache_add(page);
 		return;