[v3,1/5] mm: add vm_insert_mixed_mkwrite()

Message ID 20170628220152.28161-2-ross.zwisler@linux.intel.com
State New
Headers show

Commit Message

Ross Zwisler June 28, 2017, 10:01 p.m.
To be able to use the common 4k zero page in DAX we need to have our PTE
fault path look more like our PMD fault path where a PTE entry can be
marked as dirty and writeable as it is first inserted, rather than waiting
for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.

Right now we can rely on having a dax_pfn_mkwrite() call because we can
distinguish between these two cases in do_wp_page():

	case 1: 4k zero page => writable DAX storage
	case 2: read-only DAX storage => writeable DAX storage

This distinction is made by via vm_normal_page().  vm_normal_page() returns
false for the common 4k zero page, though, just as it does for DAX ptes.
Instead of special casing the DAX + 4k zero page case, we will simplify our
DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
get rid of dax_pfn_mkwrite() completely.

This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
will do the work that was previously done by wp_page_reuse() as part of the
dax_pfn_mkwrite() call path.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 54 insertions(+), 5 deletions(-)

Comments

Jan Kara July 19, 2017, 2:16 p.m. | #1
On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> To be able to use the common 4k zero page in DAX we need to have our PTE
> fault path look more like our PMD fault path where a PTE entry can be
> marked as dirty and writeable as it is first inserted, rather than waiting
> for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> 
> Right now we can rely on having a dax_pfn_mkwrite() call because we can
> distinguish between these two cases in do_wp_page():
> 
> 	case 1: 4k zero page => writable DAX storage
> 	case 2: read-only DAX storage => writeable DAX storage
> 
> This distinction is made by via vm_normal_page().  vm_normal_page() returns
> false for the common 4k zero page, though, just as it does for DAX ptes.
> Instead of special casing the DAX + 4k zero page case, we will simplify our
> DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> get rid of dax_pfn_mkwrite() completely.
> 
> This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> will do the work that was previously done by wp_page_reuse() as part of the
> dax_pfn_mkwrite() call path.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Just one small comment below.

> @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  	if (!pte)
>  		goto out;
>  	retval = -EBUSY;
> -	if (!pte_none(*pte))
> -		goto out_unlock;
> +	if (!pte_none(*pte)) {
> +		if (mkwrite) {
> +			entry = *pte;
> +			goto out_mkwrite;

Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
anything we don't expect and if I understand the code right, we need to
invalidate all zero page mappings at given file offset (via
unmap_mapping_range()) before mapping an allocated block there and thus the
case of filling the hole won't be affected by this?

								Honza
Ross Zwisler July 19, 2017, 5:51 p.m. | #2
On Wed, Jul 19, 2017 at 04:16:59PM +0200, Jan Kara wrote:
> On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> > To be able to use the common 4k zero page in DAX we need to have our PTE
> > fault path look more like our PMD fault path where a PTE entry can be
> > marked as dirty and writeable as it is first inserted, rather than waiting
> > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > 
> > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > distinguish between these two cases in do_wp_page():
> > 
> > 	case 1: 4k zero page => writable DAX storage
> > 	case 2: read-only DAX storage => writeable DAX storage
> > 
> > This distinction is made by via vm_normal_page().  vm_normal_page() returns
> > false for the common 4k zero page, though, just as it does for DAX ptes.
> > Instead of special casing the DAX + 4k zero page case, we will simplify our
> > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > get rid of dax_pfn_mkwrite() completely.
> > 
> > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> > will do the work that was previously done by wp_page_reuse() as part of the
> > dax_pfn_mkwrite() call path.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Just one small comment below.
> 
> > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> >  	if (!pte)
> >  		goto out;
> >  	retval = -EBUSY;
> > -	if (!pte_none(*pte))
> > -		goto out_unlock;
> > +	if (!pte_none(*pte)) {
> > +		if (mkwrite) {
> > +			entry = *pte;
> > +			goto out_mkwrite;
> 
> Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
> return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
> anything we don't expect 

Sure, that's fine.  I'll add it as a WARN_ON_ONCE() so it's a very loud
failure.  If the pfns don't match I think we're insane (and would have been
insane prior to this patch series as well) because we are getting a page fault
and somehow have a different PFN already mapped at that location.

> and if I understand the code right, we need to
> invalidate all zero page mappings at given file offset (via
> unmap_mapping_range()) before mapping an allocated block there and thus the
> case of filling the hole won't be affected by this?

Correct.  Here's the call tree if we already have a zero page mapped and are
now faulting in an allocated block:

dax_iomap_pte_fault()
  dax_insert_mapping()
    dax_insert_mapping_entry()
      unmap_mapping_range() for our zero page
    vm_insert_mixed_mkwrite() installs the new PTE. We have pte_none(), so we
    				skip the new mkwrite goto in insert_pfn().
Ross Zwisler July 19, 2017, 9:58 p.m. | #3
On Wed, Jul 19, 2017 at 11:51:12AM -0600, Ross Zwisler wrote:
> On Wed, Jul 19, 2017 at 04:16:59PM +0200, Jan Kara wrote:
> > On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> > > To be able to use the common 4k zero page in DAX we need to have our PTE
> > > fault path look more like our PMD fault path where a PTE entry can be
> > > marked as dirty and writeable as it is first inserted, rather than waiting
> > > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > > 
> > > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > > distinguish between these two cases in do_wp_page():
> > > 
> > > 	case 1: 4k zero page => writable DAX storage
> > > 	case 2: read-only DAX storage => writeable DAX storage
> > > 
> > > This distinction is made by via vm_normal_page().  vm_normal_page() returns
> > > false for the common 4k zero page, though, just as it does for DAX ptes.
> > > Instead of special casing the DAX + 4k zero page case, we will simplify our
> > > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > > get rid of dax_pfn_mkwrite() completely.
> > > 
> > > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> > > will do the work that was previously done by wp_page_reuse() as part of the
> > > dax_pfn_mkwrite() call path.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > Just one small comment below.
> > 
> > > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> > >  	if (!pte)
> > >  		goto out;
> > >  	retval = -EBUSY;
> > > -	if (!pte_none(*pte))
> > > -		goto out_unlock;
> > > +	if (!pte_none(*pte)) {
> > > +		if (mkwrite) {
> > > +			entry = *pte;
> > > +			goto out_mkwrite;
> > 
> > Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
> > return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
> > anything we don't expect 
> 
> Sure, that's fine.  I'll add it as a WARN_ON_ONCE() so it's a very loud
> failure.  If the pfns don't match I think we're insane (and would have been
> insane prior to this patch series as well) because we are getting a page fault
> and somehow have a different PFN already mapped at that location.

Umm...well, I added the warning, and during my regression testing hit a case
where the PFNs didn't match.  (generic/437 with both ext4 & XFS)

I've verified that this behavior happens with vanilla v4.12, so it's not a new
condition introduced by my patch.

I'm off tracking that down - there's a bug lurking somewhere, I think.
Vivek Goyal July 20, 2017, 3:26 p.m. | #4
On Wed, Jun 28, 2017 at 04:01:48PM -0600, Ross Zwisler wrote:
> To be able to use the common 4k zero page in DAX we need to have our PTE
> fault path look more like our PMD fault path where a PTE entry can be
> marked as dirty and writeable as it is first inserted, rather than waiting
> for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> 
> Right now we can rely on having a dax_pfn_mkwrite() call because we can
> distinguish between these two cases in do_wp_page():
> 
> 	case 1: 4k zero page => writable DAX storage
> 	case 2: read-only DAX storage => writeable DAX storage
> 
> This distinction is made by via vm_normal_page().  vm_normal_page() returns
> false for the common 4k zero page, though, just as it does for DAX ptes.
> Instead of special casing the DAX + 4k zero page case, we will simplify our
> DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> get rid of dax_pfn_mkwrite() completely.
> 
> This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> will do the work that was previously done by wp_page_reuse() as part of the
> dax_pfn_mkwrite() call path.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/mm.h |  2 ++
>  mm/memory.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6f543a4..096052f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2293,6 +2293,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
>  			unsigned long pfn, pgprot_t pgprot);
>  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  			pfn_t pfn);
> +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
> +			pfn_t pfn);
>  int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
>  
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index bb11c47..de4aa71 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1646,7 +1646,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL(vm_insert_page);
>  
>  static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> -			pfn_t pfn, pgprot_t prot)
> +			pfn_t pfn, pgprot_t prot, bool mkwrite)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	int retval;
> @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  	if (!pte)
>  		goto out;
>  	retval = -EBUSY;
> -	if (!pte_none(*pte))
> -		goto out_unlock;
> +	if (!pte_none(*pte)) {
> +		if (mkwrite) {
> +			entry = *pte;
> +			goto out_mkwrite;
> +		} else
> +			goto out_unlock;
> +	}
>  
>  	/* Ok, finally just insert the thing.. */
>  	if (pfn_t_devmap(pfn))
>  		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
>  	else
>  		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> +
> +out_mkwrite:
> +	if (mkwrite) {
> +		entry = pte_mkyoung(entry);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	}
> +
>  	set_pte_at(mm, addr, pte, entry);
>  	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
>  
> @@ -1736,7 +1748,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
>  
>  	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
>  
> -	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
> +	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
> +			false);
>  
>  	return ret;
>  }
> @@ -1772,10 +1785,44 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  		page = pfn_to_page(pfn_t_to_pfn(pfn));
>  		return insert_page(vma, addr, page, pgprot);
>  	}
> -	return insert_pfn(vma, addr, pfn, pgprot);
> +	return insert_pfn(vma, addr, pfn, pgprot, false);
>  }
>  EXPORT_SYMBOL(vm_insert_mixed);
>  
> +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
> +			pfn_t pfn)
> +{
> +	pgprot_t pgprot = vma->vm_page_prot;
> +
> +	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
> +
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return -EFAULT;
> +
> +	track_pfn_insert(vma, &pgprot, pfn);
> +
> +	/*
> +	 * If we don't have pte special, then we have to use the pfn_valid()
> +	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
> +	 * refcount the page if pfn_valid is true (hence insert_page rather
> +	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
> +	 * without pte special, it would there be refcounted as a normal page.
> +	 */
> +	if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
> +		struct page *page;
> +
> +		/*
> +		 * At this point we are committed to insert_page()
> +		 * regardless of whether the caller specified flags that
> +		 * result in pfn_t_has_page() == false.
> +		 */
> +		page = pfn_to_page(pfn_t_to_pfn(pfn));
> +		return insert_page(vma, addr, page, pgprot);
> +	}
> +	return insert_pfn(vma, addr, pfn, pgprot, true);
> +}
> +EXPORT_SYMBOL(vm_insert_mixed_mkwrite);

Hi Ross,

vm_insert_mixed_mkwrite() is same as vm_insert_mixed() except this sets
write parameter to inser_pfn() true. Will it make sense to just add
mkwrite parameter to vm_insert_mixed() and not add a new helper function.
(like insert_pfn()).

Vivek
Ross Zwisler July 20, 2017, 3:59 p.m. | #5
On Thu, Jul 20, 2017 at 11:26:16AM -0400, Vivek Goyal wrote:
> On Wed, Jun 28, 2017 at 04:01:48PM -0600, Ross Zwisler wrote:
> > To be able to use the common 4k zero page in DAX we need to have our PTE
> > fault path look more like our PMD fault path where a PTE entry can be
> > marked as dirty and writeable as it is first inserted, rather than waiting
> > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > 
> > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > distinguish between these two cases in do_wp_page():
> > 
> > 	case 1: 4k zero page => writable DAX storage
> > 	case 2: read-only DAX storage => writeable DAX storage
> > 
> > This distinction is made by via vm_normal_page().  vm_normal_page() returns
> > false for the common 4k zero page, though, just as it does for DAX ptes.
> > Instead of special casing the DAX + 4k zero page case, we will simplify our
> > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > get rid of dax_pfn_mkwrite() completely.
> > 
> > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> > will do the work that was previously done by wp_page_reuse() as part of the
> > dax_pfn_mkwrite() call path.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  include/linux/mm.h |  2 ++
> >  mm/memory.c        | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 54 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 6f543a4..096052f 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2293,6 +2293,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> >  			unsigned long pfn, pgprot_t pgprot);
> >  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> >  			pfn_t pfn);
> > +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
> > +			pfn_t pfn);
> >  int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
> >  
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index bb11c47..de4aa71 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1646,7 +1646,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
> >  EXPORT_SYMBOL(vm_insert_page);
> >  
> >  static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> > -			pfn_t pfn, pgprot_t prot)
> > +			pfn_t pfn, pgprot_t prot, bool mkwrite)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	int retval;
> > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> >  	if (!pte)
> >  		goto out;
> >  	retval = -EBUSY;
> > -	if (!pte_none(*pte))
> > -		goto out_unlock;
> > +	if (!pte_none(*pte)) {
> > +		if (mkwrite) {
> > +			entry = *pte;
> > +			goto out_mkwrite;
> > +		} else
> > +			goto out_unlock;
> > +	}
> >  
> >  	/* Ok, finally just insert the thing.. */
> >  	if (pfn_t_devmap(pfn))
> >  		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> >  	else
> >  		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > +
> > +out_mkwrite:
> > +	if (mkwrite) {
> > +		entry = pte_mkyoung(entry);
> > +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > +	}
> > +
> >  	set_pte_at(mm, addr, pte, entry);
> >  	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
> >  
> > @@ -1736,7 +1748,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> >  
> > -	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
> > +	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
> > +			false);
> >  
> >  	return ret;
> >  }
> > @@ -1772,10 +1785,44 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> >  		page = pfn_to_page(pfn_t_to_pfn(pfn));
> >  		return insert_page(vma, addr, page, pgprot);
> >  	}
> > -	return insert_pfn(vma, addr, pfn, pgprot);
> > +	return insert_pfn(vma, addr, pfn, pgprot, false);
> >  }
> >  EXPORT_SYMBOL(vm_insert_mixed);
> >  
> > +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
> > +			pfn_t pfn)
> > +{
> > +	pgprot_t pgprot = vma->vm_page_prot;
> > +
> > +	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
> > +
> > +	if (addr < vma->vm_start || addr >= vma->vm_end)
> > +		return -EFAULT;
> > +
> > +	track_pfn_insert(vma, &pgprot, pfn);
> > +
> > +	/*
> > +	 * If we don't have pte special, then we have to use the pfn_valid()
> > +	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
> > +	 * refcount the page if pfn_valid is true (hence insert_page rather
> > +	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
> > +	 * without pte special, it would there be refcounted as a normal page.
> > +	 */
> > +	if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
> > +		struct page *page;
> > +
> > +		/*
> > +		 * At this point we are committed to insert_page()
> > +		 * regardless of whether the caller specified flags that
> > +		 * result in pfn_t_has_page() == false.
> > +		 */
> > +		page = pfn_to_page(pfn_t_to_pfn(pfn));
> > +		return insert_page(vma, addr, page, pgprot);
> > +	}
> > +	return insert_pfn(vma, addr, pfn, pgprot, true);
> > +}
> > +EXPORT_SYMBOL(vm_insert_mixed_mkwrite);
> 
> Hi Ross,
> 
> vm_insert_mixed_mkwrite() is same as vm_insert_mixed() except this sets
> write parameter to inser_pfn() true. Will it make sense to just add
> mkwrite parameter to vm_insert_mixed() and not add a new helper function.
> (like insert_pfn()).
> 
> Vivek

Yep, this is how my initial implementation worked:

https://lkml.org/lkml/2017/6/7/907

vm_insert_mixed_mkwrite() was the new version that took an extra parameter,
and vm_insert_mixed() stuck around as a wrapper that supplied a default value
for the new parameter, so existing call sites didn't need to change and didn't
need to worry about the new parameter, but so that we didn't duplicate any
code.

I changed this to the way that it currently works based on Dan's feedback in
that same mail thread.
Ross Zwisler July 21, 2017, 5:44 p.m. | #6
On Wed, Jul 19, 2017 at 03:58:31PM -0600, Ross Zwisler wrote:
> On Wed, Jul 19, 2017 at 11:51:12AM -0600, Ross Zwisler wrote:
> > On Wed, Jul 19, 2017 at 04:16:59PM +0200, Jan Kara wrote:
> > > On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> > > > To be able to use the common 4k zero page in DAX we need to have our PTE
> > > > fault path look more like our PMD fault path where a PTE entry can be
> > > > marked as dirty and writeable as it is first inserted, rather than waiting
> > > > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > > > 
> > > > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > > > distinguish between these two cases in do_wp_page():
> > > > 
> > > > 	case 1: 4k zero page => writable DAX storage
> > > > 	case 2: read-only DAX storage => writeable DAX storage
> > > > 
> > > > This distinction is made by via vm_normal_page().  vm_normal_page() returns
> > > > false for the common 4k zero page, though, just as it does for DAX ptes.
> > > > Instead of special casing the DAX + 4k zero page case, we will simplify our
> > > > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > > > get rid of dax_pfn_mkwrite() completely.
> > > > 
> > > > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > > > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> > > > will do the work that was previously done by wp_page_reuse() as part of the
> > > > dax_pfn_mkwrite() call path.
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > 
> > > Just one small comment below.
> > > 
> > > > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> > > >  	if (!pte)
> > > >  		goto out;
> > > >  	retval = -EBUSY;
> > > > -	if (!pte_none(*pte))
> > > > -		goto out_unlock;
> > > > +	if (!pte_none(*pte)) {
> > > > +		if (mkwrite) {
> > > > +			entry = *pte;
> > > > +			goto out_mkwrite;
> > > 
> > > Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
> > > return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
> > > anything we don't expect 
> > 
> > Sure, that's fine.  I'll add it as a WARN_ON_ONCE() so it's a very loud
> > failure.  If the pfns don't match I think we're insane (and would have been
> > insane prior to this patch series as well) because we are getting a page fault
> > and somehow have a different PFN already mapped at that location.
> 
> Umm...well, I added the warning, and during my regression testing hit a case
> where the PFNs didn't match.  (generic/437 with both ext4 & XFS)
> 
> I've verified that this behavior happens with vanilla v4.12, so it's not a new
> condition introduced by my patch.
> 
> I'm off tracking that down - there's a bug lurking somewhere, I think.

Actually, I think we're fine.  What was happening was that two faults were
racing for a private mapping.  One was installing a RW PTE for the COW page
cache page via wp_page_copy(), and the second was trying to install a
read-only PTE in insert_pfn().  The PFNs don't match because the two faults
are trying to map very different PTEs - one for DAX storage, one for a page
cache page.

This collision is handled by insert_pfn() by just returning -EBUSY, which will
bail out of the fault and either re-fault if necessary, or use the PTE that
the other thread installed.  For the case I described above I think both
faults will just happily use the page cache page, and the RO DAX fault won't
be retried.

I think this is fine, and I'll preserve this behavior as you suggest in the
mkwrite case by validating that the PTE is what we think it should be after we
grab the PTL.
Ross Zwisler July 21, 2017, 6:02 p.m. | #7
On Thu, Jul 20, 2017 at 09:59:22AM -0600, Ross Zwisler wrote:
> On Thu, Jul 20, 2017 at 11:26:16AM -0400, Vivek Goyal wrote:
<>
> > Hi Ross,
> > 
> > vm_insert_mixed_mkwrite() is same as vm_insert_mixed() except this sets
> > write parameter to inser_pfn() true. Will it make sense to just add
> > mkwrite parameter to vm_insert_mixed() and not add a new helper function.
> > (like insert_pfn()).
> > 
> > Vivek
> 
> Yep, this is how my initial implementation worked:
> 
> https://lkml.org/lkml/2017/6/7/907
> 
> vm_insert_mixed_mkwrite() was the new version that took an extra parameter,
> and vm_insert_mixed() stuck around as a wrapper that supplied a default value
> for the new parameter, so existing call sites didn't need to change and didn't
> need to worry about the new parameter, but so that we didn't duplicate any
> code.
> 
> I changed this to the way that it currently works based on Dan's feedback in
> that same mail thread.

Looking at this again, I agree that duplicating vm_insert_mixed() seems
undesirable.  For v4 I'll add the flag to vm_insert_mixed() and just update
all the call sites instead of adding a separate wrapper for the mkwrite case,
which will fix this duplication and address Dan's naming concerns.

Thanks for the review feedback.
Jan Kara July 24, 2017, 11:20 a.m. | #8
On Fri 21-07-17 11:44:05, Ross Zwisler wrote:
> On Wed, Jul 19, 2017 at 03:58:31PM -0600, Ross Zwisler wrote:
> > On Wed, Jul 19, 2017 at 11:51:12AM -0600, Ross Zwisler wrote:
> > > On Wed, Jul 19, 2017 at 04:16:59PM +0200, Jan Kara wrote:
> > > > On Wed 28-06-17 16:01:48, Ross Zwisler wrote:
> > > > > To be able to use the common 4k zero page in DAX we need to have our PTE
> > > > > fault path look more like our PMD fault path where a PTE entry can be
> > > > > marked as dirty and writeable as it is first inserted, rather than waiting
> > > > > for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.
> > > > > 
> > > > > Right now we can rely on having a dax_pfn_mkwrite() call because we can
> > > > > distinguish between these two cases in do_wp_page():
> > > > > 
> > > > > 	case 1: 4k zero page => writable DAX storage
> > > > > 	case 2: read-only DAX storage => writeable DAX storage
> > > > > 
> > > > > This distinction is made by via vm_normal_page().  vm_normal_page() returns
> > > > > false for the common 4k zero page, though, just as it does for DAX ptes.
> > > > > Instead of special casing the DAX + 4k zero page case, we will simplify our
> > > > > DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
> > > > > get rid of dax_pfn_mkwrite() completely.
> > > > > 
> > > > > This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
> > > > > and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
> > > > > will do the work that was previously done by wp_page_reuse() as part of the
> > > > > dax_pfn_mkwrite() call path.
> > > > > 
> > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > 
> > > > Just one small comment below.
> > > > 
> > > > > @@ -1658,14 +1658,26 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> > > > >  	if (!pte)
> > > > >  		goto out;
> > > > >  	retval = -EBUSY;
> > > > > -	if (!pte_none(*pte))
> > > > > -		goto out_unlock;
> > > > > +	if (!pte_none(*pte)) {
> > > > > +		if (mkwrite) {
> > > > > +			entry = *pte;
> > > > > +			goto out_mkwrite;
> > > > 
> > > > Can we maybe check here that (pte_pfn(*pte) == pfn_t_to_pfn(pfn)) and
> > > > return -EBUSY otherwise? That way we are sure insert_pfn() isn't doing
> > > > anything we don't expect 
> > > 
> > > Sure, that's fine.  I'll add it as a WARN_ON_ONCE() so it's a very loud
> > > failure.  If the pfns don't match I think we're insane (and would have been
> > > insane prior to this patch series as well) because we are getting a page fault
> > > and somehow have a different PFN already mapped at that location.
> > 
> > Umm...well, I added the warning, and during my regression testing hit a case
> > where the PFNs didn't match.  (generic/437 with both ext4 & XFS)
> > 
> > I've verified that this behavior happens with vanilla v4.12, so it's not a new
> > condition introduced by my patch.
> > 
> > I'm off tracking that down - there's a bug lurking somewhere, I think.
> 
> Actually, I think we're fine.  What was happening was that two faults were
> racing for a private mapping.  One was installing a RW PTE for the COW page
> cache page via wp_page_copy(), and the second was trying to install a
> read-only PTE in insert_pfn().  The PFNs don't match because the two faults
> are trying to map very different PTEs - one for DAX storage, one for a page
> cache page.

OK, so two threads (sharing page tables) were doing read and write fault at
the same offset of a private mapping. OK, makes sense.

> This collision is handled by insert_pfn() by just returning -EBUSY, which will
> bail out of the fault and either re-fault if necessary, or use the PTE that
> the other thread installed.  For the case I described above I think both
> faults will just happily use the page cache page, and the RO DAX fault won't
> be retried.
> 
> I think this is fine, and I'll preserve this behavior as you suggest in the
> mkwrite case by validating that the PTE is what we think it should be after we
> grab the PTL.

Yeah, that seems to essential for the races of faults in private mappings
to work as they should. Thanks for analysing this!

								Honza

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a4..096052f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2293,6 +2293,8 @@  int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn, pgprot_t pgprot);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index bb11c47..de4aa71 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1646,7 +1646,7 @@  int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL(vm_insert_page);
 
 static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
-			pfn_t pfn, pgprot_t prot)
+			pfn_t pfn, pgprot_t prot, bool mkwrite)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int retval;
@@ -1658,14 +1658,26 @@  static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	if (!pte)
 		goto out;
 	retval = -EBUSY;
-	if (!pte_none(*pte))
-		goto out_unlock;
+	if (!pte_none(*pte)) {
+		if (mkwrite) {
+			entry = *pte;
+			goto out_mkwrite;
+		} else
+			goto out_unlock;
+	}
 
 	/* Ok, finally just insert the thing.. */
 	if (pfn_t_devmap(pfn))
 		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
 	else
 		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
+
+out_mkwrite:
+	if (mkwrite) {
+		entry = pte_mkyoung(entry);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	}
+
 	set_pte_at(mm, addr, pte, entry);
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
@@ -1736,7 +1748,8 @@  int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 
 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
-	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
+	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
+			false);
 
 	return ret;
 }
@@ -1772,10 +1785,44 @@  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 		page = pfn_to_page(pfn_t_to_pfn(pfn));
 		return insert_page(vma, addr, page, pgprot);
 	}
-	return insert_pfn(vma, addr, pfn, pgprot);
+	return insert_pfn(vma, addr, pfn, pgprot, false);
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn)
+{
+	pgprot_t pgprot = vma->vm_page_prot;
+
+	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return -EFAULT;
+
+	track_pfn_insert(vma, &pgprot, pfn);
+
+	/*
+	 * If we don't have pte special, then we have to use the pfn_valid()
+	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
+	 * refcount the page if pfn_valid is true (hence insert_page rather
+	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
+	 * without pte special, it would there be refcounted as a normal page.
+	 */
+	if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+		struct page *page;
+
+		/*
+		 * At this point we are committed to insert_page()
+		 * regardless of whether the caller specified flags that
+		 * result in pfn_t_has_page() == false.
+		 */
+		page = pfn_to_page(pfn_t_to_pfn(pfn));
+		return insert_page(vma, addr, page, pgprot);
+	}
+	return insert_pfn(vma, addr, pfn, pgprot, true);
+}
+EXPORT_SYMBOL(vm_insert_mixed_mkwrite);
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results