[11/13] dax, iomap: Add support for synchronous faults

Message ID 20170817160815.30466-12-jack@suse.cz
State New
Headers show

Commit Message

Jan Kara Aug. 17, 2017, 4:08 p.m.
Add a flag to iomap interface informing the caller that inode needs
fdstasync(2) for returned extent to become persistent and use it in DAX
fault code so that we map such extents only read only. We propagate the
information that the page table entry has been inserted write-protected
from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
handler is then responsible for calling fdatasync(2) and updating page
tables to map pfns read-write. dax_iomap_fault() also takes care of
updating vmf->orig_pte to match the PTE that was inserted so that we can
safely recheck that PTE did not change while write-enabling it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c              | 31 +++++++++++++++++++++++++++++++
 include/linux/iomap.h |  2 ++
 include/linux/mm.h    |  6 +++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

Comments

Ross Zwisler Aug. 21, 2017, 6:58 p.m. | #1
On Thu, Aug 17, 2017 at 06:08:13PM +0200, Jan Kara wrote:
> Add a flag to iomap interface informing the caller that inode needs
> fdstasync(2) for returned extent to become persistent and use it in DAX
> fault code so that we map such extents only read only. We propagate the
> information that the page table entry has been inserted write-protected
> from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
> handler is then responsible for calling fdatasync(2) and updating page
> tables to map pfns read-write. dax_iomap_fault() also takes care of
> updating vmf->orig_pte to match the PTE that was inserted so that we can
> safely recheck that PTE did not change while write-enabling it.

This changelog needs a little love.  s/VM_FAULT_RO/VM_FAULT_NEEDDSYNC/, the
new path doesn't do the RO mapping, but instead just does the entire RW
mapping after the fdatasync is complete, the vmf->orig_pte manipulation went
away, etc.

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c              | 31 +++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  2 ++
>  include/linux/mm.h    |  6 +++++-
>  3 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index bc040e654cc9..ca88fc356786 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1177,6 +1177,22 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  			goto error_finish_iomap;
>  		}
>  
> +		/*
> +		 * If we are doing synchronous page fault and inode needs fsync,
> +		 * we can insert PTE into page tables only after that happens.
> +		 * Skip insertion for now and return the pfn so that caller can
> +		 * insert it after fsync is done.
> +		 */
> +		if (write && (vma->vm_flags & VM_SYNC) &&
> +		    (iomap.flags & IOMAP_F_NEEDDSYNC)) {

Just a small nit, but I don't think we really need to check for 'write' here.
The fact that IOMAP_F_NEEDDSYNC is set tells us that we are doing a write.

	if ((flags & IOMAP_WRITE) &&
	    !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
					EXT4_I(inode)->i_datasync_tid))
		iomap->flags |= IOMAP_F_NEEDDSYNC;

Ditto for the PMD case.

With that one simplification and a cleaned up changelog, you can add:

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> +			if (WARN_ON_ONCE(!pfnp)) {
> +				error = -EIO;
> +				goto error_finish_iomap;
> +			}
> +			*pfnp = pfn;
> +			vmf_ret = VM_FAULT_NEEDDSYNC | major;
> +			goto finish_iomap;
> +		}
>  		trace_dax_insert_mapping(inode, vmf, entry);
>  		if (write)
>  			error = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
Ross Zwisler Aug. 21, 2017, 9:09 p.m. | #2
On Thu, Aug 17, 2017 at 06:08:13PM +0200, Jan Kara wrote:
> Add a flag to iomap interface informing the caller that inode needs
> fdstasync(2) for returned extent to become persistent and use it in DAX
> fault code so that we map such extents only read only. We propagate the
> information that the page table entry has been inserted write-protected
> from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
> handler is then responsible for calling fdatasync(2) and updating page
> tables to map pfns read-write. dax_iomap_fault() also takes care of
> updating vmf->orig_pte to match the PTE that was inserted so that we can
> safely recheck that PTE did not change while write-enabling it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c              | 31 +++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  2 ++
>  include/linux/mm.h    |  6 +++++-
>  3 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index bc040e654cc9..ca88fc356786 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1177,6 +1177,22 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  			goto error_finish_iomap;
>  		}
>  
> +		/*
> +		 * If we are doing synchronous page fault and inode needs fsync,
> +		 * we can insert PTE into page tables only after that happens.
> +		 * Skip insertion for now and return the pfn so that caller can
> +		 * insert it after fsync is done.
> +		 */
> +		if (write && (vma->vm_flags & VM_SYNC) &&
> +		    (iomap.flags & IOMAP_F_NEEDDSYNC)) {
> +			if (WARN_ON_ONCE(!pfnp)) {
> +				error = -EIO;
> +				goto error_finish_iomap;
> +			}
> +			*pfnp = pfn;
> +			vmf_ret = VM_FAULT_NEEDDSYNC | major;
> +			goto finish_iomap;
> +		}

Sorry for the second reply, but I spotted this during my testing.

The radix tree entry is inserted and marked as dirty by the
dax_insert_mapping_entry() call a few lines above this newly added block.

I think that this patch should prevent the radix tree entry that we insert
from being marked as dirty, and let the dax_insert_pfn_mkwrite() handler do
that work.  Right now it is being made dirty twice, which we don't need.

Just inserting the entry as clean here and then marking it as dirty later in
dax_insert_pfn_mkwrite() keeps the radix tree entry dirty state consistent
with the PTE dirty state.  It also solves an issue we have right now where the
newly inserted dirty entry will immediately be flushed as part of the
vfs_fsync_range() call that the filesystem will do before
dax_insert_pfn_mkwrite(). 

For example, here's a trace of a PMD write fault on a completely sparse file:

  dax_pmd_fault: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER
  address 0x7feab8e00000 vm_start 0x7feab8e00000 vm_end 0x7feab9000000 pgoff
  0x0 max_pgoff 0x1ff 
  
  dax_pmd_fault_done: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER
  address 0x7feab8e00000 vm_start 0x7feab8e00000 vm_end 0x7feab9000000 pgoff
  0x0 max_pgoff 0x1ff NEEDDSYNC
  
  dax_writeback_range: dev 259:0 ino 0xc pgoff 0x0-0x1ff
  
  dax_writeback_one: dev 259:0 ino 0xc pgoff 0x0 pglen 0x200
  
  dax_writeback_range_done: dev 259:0 ino 0xc pgoff 0x1-0x1ff
  
  dax_insert_pfn_mkwrite: dev 259:0 ino 0xc shared
  WRITE|ALLOW_RETRY|KILLABLE|USER address 0x7feab8e00000 pgoff 0x0 NOPAGE

The PMD that we are writing back with dax_writeback_one() is the one that we
just made dirty via the first 1/2 of the sync fault, before we've installed a
page table entry.  This fix might speed up some of your test measurements as
well.
Jan Kara Aug. 22, 2017, 9:46 a.m. | #3
On Mon 21-08-17 12:58:30, Ross Zwisler wrote:
> On Thu, Aug 17, 2017 at 06:08:13PM +0200, Jan Kara wrote:
> > Add a flag to iomap interface informing the caller that inode needs
> > fdstasync(2) for returned extent to become persistent and use it in DAX
> > fault code so that we map such extents only read only. We propagate the
> > information that the page table entry has been inserted write-protected
> > from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
> > handler is then responsible for calling fdatasync(2) and updating page
> > tables to map pfns read-write. dax_iomap_fault() also takes care of
> > updating vmf->orig_pte to match the PTE that was inserted so that we can
> > safely recheck that PTE did not change while write-enabling it.
> 
> This changelog needs a little love.  s/VM_FAULT_RO/VM_FAULT_NEEDDSYNC/, the
> new path doesn't do the RO mapping, but instead just does the entire RW
> mapping after the fdatasync is complete, the vmf->orig_pte manipulation went
> away, etc.

Yeah, fixed. Thanks for noticing.

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c              | 31 +++++++++++++++++++++++++++++++
> >  include/linux/iomap.h |  2 ++
> >  include/linux/mm.h    |  6 +++++-
> >  3 files changed, 38 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index bc040e654cc9..ca88fc356786 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1177,6 +1177,22 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
> >  			goto error_finish_iomap;
> >  		}
> >  
> > +		/*
> > +		 * If we are doing synchronous page fault and inode needs fsync,
> > +		 * we can insert PTE into page tables only after that happens.
> > +		 * Skip insertion for now and return the pfn so that caller can
> > +		 * insert it after fsync is done.
> > +		 */
> > +		if (write && (vma->vm_flags & VM_SYNC) &&
> > +		    (iomap.flags & IOMAP_F_NEEDDSYNC)) {
> 
> Just a small nit, but I don't think we really need to check for 'write' here.
> The fact that IOMAP_F_NEEDDSYNC is set tells us that we are doing a write.
> 
> 	if ((flags & IOMAP_WRITE) &&
> 	    !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
> 					EXT4_I(inode)->i_datasync_tid))
> 		iomap->flags |= IOMAP_F_NEEDDSYNC;
> 
> Ditto for the PMD case.

OK, done.

> With that one simplification and a cleaned up changelog, you can add:
> 
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks.

								Honza
Jan Kara Aug. 22, 2017, 10:08 a.m. | #4
On Mon 21-08-17 15:09:16, Ross Zwisler wrote:
> On Thu, Aug 17, 2017 at 06:08:13PM +0200, Jan Kara wrote:
> > Add a flag to iomap interface informing the caller that inode needs
> > fdstasync(2) for returned extent to become persistent and use it in DAX
> > fault code so that we map such extents only read only. We propagate the
> > information that the page table entry has been inserted write-protected
> > from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
> > handler is then responsible for calling fdatasync(2) and updating page
> > tables to map pfns read-write. dax_iomap_fault() also takes care of
> > updating vmf->orig_pte to match the PTE that was inserted so that we can
> > safely recheck that PTE did not change while write-enabling it.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c              | 31 +++++++++++++++++++++++++++++++
> >  include/linux/iomap.h |  2 ++
> >  include/linux/mm.h    |  6 +++++-
> >  3 files changed, 38 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index bc040e654cc9..ca88fc356786 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1177,6 +1177,22 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
> >  			goto error_finish_iomap;
> >  		}
> >  
> > +		/*
> > +		 * If we are doing synchronous page fault and inode needs fsync,
> > +		 * we can insert PTE into page tables only after that happens.
> > +		 * Skip insertion for now and return the pfn so that caller can
> > +		 * insert it after fsync is done.
> > +		 */
> > +		if (write && (vma->vm_flags & VM_SYNC) &&
> > +		    (iomap.flags & IOMAP_F_NEEDDSYNC)) {
> > +			if (WARN_ON_ONCE(!pfnp)) {
> > +				error = -EIO;
> > +				goto error_finish_iomap;
> > +			}
> > +			*pfnp = pfn;
> > +			vmf_ret = VM_FAULT_NEEDDSYNC | major;
> > +			goto finish_iomap;
> > +		}
> 
> Sorry for the second reply, but I spotted this during my testing.
> 
> The radix tree entry is inserted and marked as dirty by the
> dax_insert_mapping_entry() call a few lines above this newly added block.

Yes I know and this is actually deliberate. My original thinking was that
we *want* following vfs_fsync_range() to flush out any changes that are
possibly lingering in CPU caches for the block. Now thinking about this
again, changes through write(2) or pre-zeroing of block are non-temporal
anyway and changes through mmap(2) will already dirty the radix tree entry
so it seems you are right that we don't need the dirtying here. I'll change
this. Thanks for asking.

								Honza
Christoph Hellwig Aug. 24, 2017, 12:27 p.m. | #5
Just curious:  how does IOMAP_F_NEEDDSYNC practically differ
from IOMAP_F_NEW?
Jan Kara Aug. 24, 2017, 12:34 p.m. | #6
On Thu 24-08-17 05:27:20, Christoph Hellwig wrote:
> Just curious:  how does IOMAP_F_NEEDDSYNC practically differ
> from IOMAP_F_NEW?

In a subtle but important way ;). The main difference is that if the extent
has been already allocated by previous write, but the changing transaction
is not yet committed, we will return IOMAP_F_NEEDDSYNC but not IOMAP_F_NEW.

								Honza
Christoph Hellwig Aug. 24, 2017, 1:38 p.m. | #7
On Thu, Aug 24, 2017 at 02:34:51PM +0200, Jan Kara wrote:
> In a subtle but important way ;). The main difference is that if the extent
> has been already allocated by previous write, but the changing transaction
> is not yet committed, we will return IOMAP_F_NEEDDSYNC but not IOMAP_F_NEW.

Ok.  How about a IOMAP_F_DIRTY flag and a better explanation?
Jan Kara Aug. 24, 2017, 4:45 p.m. | #8
On Thu 24-08-17 06:38:09, Christoph Hellwig wrote:
> On Thu, Aug 24, 2017 at 02:34:51PM +0200, Jan Kara wrote:
> > In a subtle but important way ;). The main difference is that if the extent
> > has been already allocated by previous write, but the changing transaction
> > is not yet committed, we will return IOMAP_F_NEEDDSYNC but not IOMAP_F_NEW.
> 
> Ok.  How about a IOMAP_F_DIRTY flag and a better explanation?

OK, will change it.

								Honza

Patch

diff --git a/fs/dax.c b/fs/dax.c
index bc040e654cc9..ca88fc356786 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1177,6 +1177,22 @@  static int dax_iomap_pte_fault(struct vm_fault *vmf,
 			goto error_finish_iomap;
 		}
 
+		/*
+		 * If we are doing synchronous page fault and inode needs fsync,
+		 * we can insert PTE into page tables only after that happens.
+		 * Skip insertion for now and return the pfn so that caller can
+		 * insert it after fsync is done.
+		 */
+		if (write && (vma->vm_flags & VM_SYNC) &&
+		    (iomap.flags & IOMAP_F_NEEDDSYNC)) {
+			if (WARN_ON_ONCE(!pfnp)) {
+				error = -EIO;
+				goto error_finish_iomap;
+			}
+			*pfnp = pfn;
+			vmf_ret = VM_FAULT_NEEDDSYNC | major;
+			goto finish_iomap;
+		}
 		trace_dax_insert_mapping(inode, vmf, entry);
 		if (write)
 			error = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
@@ -1362,6 +1378,21 @@  static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		if (IS_ERR(entry))
 			goto finish_iomap;
 
+		/*
+		 * If we are doing synchronous page fault and inode needs fsync,
+		 * we can insert PMD into page tables only after that happens.
+		 * Skip insertion for now and return the pfn so that caller can
+		 * insert it after fsync is done.
+		 */
+		if (write && (vmf->vma->vm_flags & VM_SYNC) &&
+		    (iomap.flags & IOMAP_F_NEEDDSYNC)) {
+			if (WARN_ON_ONCE(!pfnp))
+				goto finish_iomap;
+			*pfnp = pfn;
+			result = VM_FAULT_NEEDDSYNC;
+			goto finish_iomap;
+		}
+
 		trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
 		result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
 					    write);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f64dc6ce5161..957463602f6e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -22,6 +22,8 @@  struct vm_fault;
  * Flags for all iomap mappings:
  */
 #define IOMAP_F_NEW	0x01	/* blocks have been newly allocated */
+#define IOMAP_F_NEEDDSYNC	0x02	/* inode needs fdatasync for storage to
+					 * become persistent */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d0fb385414a4..20e95c3a7701 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1143,6 +1143,9 @@  static inline void clear_page_pfmemalloc(struct page *page)
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
 #define VM_FAULT_DONE_COW   0x1000	/* ->fault has fully handled COW */
+#define VM_FAULT_NEEDDSYNC  0x2000	/* ->fault did not modify page tables
+					 * and needs fsync() to complete (for
+					 * synchronous page faults in DAX) */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 			 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
@@ -1160,7 +1163,8 @@  static inline void clear_page_pfmemalloc(struct page *page)
 	{ VM_FAULT_LOCKED,		"LOCKED" }, \
 	{ VM_FAULT_RETRY,		"RETRY" }, \
 	{ VM_FAULT_FALLBACK,		"FALLBACK" }, \
-	{ VM_FAULT_DONE_COW,		"DONE_COW" }
+	{ VM_FAULT_DONE_COW,		"DONE_COW" }, \
+	{ VM_FAULT_NEEDDSYNC,		"NEEDDSYNC" }
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)