diff mbox series

[16/19] ext4: Support for synchronous DAX faults

Message ID 20171011200603.27442-17-jack@suse.cz
State Not Applicable, archived
Headers show
Series dax, ext4, xfs: Synchronous page faults | expand

Commit Message

Jan Kara Oct. 11, 2017, 8:06 p.m. UTC
We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
synchronous write fault when inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case,
call vfs_fsync_range() to make sure all metadata is committed, and call
dax_insert_pfn_mkwrite() to insert page table entry. Note that this will
also dirty corresponding radix tree entry which is what we want -
fsync(2) will still provide data integrity guarantees for applications
not using userspace flushing. And applications using userspace flushing
can avoid calling fsync(2) and thus avoid the performance overhead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c       |  6 +++++-
 fs/ext4/inode.c      | 15 +++++++++++++++
 fs/jbd2/journal.c    | 17 +++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 38 insertions(+), 1 deletion(-)

Comments

Dan Williams Oct. 11, 2017, 10:23 p.m. UTC | #1
On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <jack@suse.cz> wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin()

I assume this a stale changelog from a previous version and should be
VM_FAULT_NEEDDSYNC?
Jan Kara Oct. 12, 2017, 1:42 p.m. UTC | #2
On Wed 11-10-17 15:23:21, Dan Williams wrote:
> On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <jack@suse.cz> wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin()
> 
> I assume this a stale changelog from a previous version and should be
> VM_FAULT_NEEDDSYNC?

Yeah, stale changelog. It should be IOMAP_F_DIRTY. Thanks for catching
this.

								Honza
Ross Zwisler Oct. 13, 2017, 8:58 p.m. UTC | #3
On Wed, Oct 11, 2017 at 10:06:00PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_insert_pfn_mkwrite() to insert page table entry. Note that this will
> also dirty corresponding radix tree entry which is what we want -
> fsync(2) will still provide data integrity guarantees for applications
> not using userspace flushing. And applications using userspace flushing
> can avoid calling fsync(2) and thus avoid the performance overhead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c       |  6 +++++-
>  fs/ext4/inode.c      | 15 +++++++++++++++
>  fs/jbd2/journal.c    | 17 +++++++++++++++++
>  include/linux/jbd2.h |  1 +
>  4 files changed, 38 insertions(+), 1 deletion(-)

<>

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 31db875bc7a1..13a198924a0f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
>  }
>  
>  #ifdef CONFIG_FS_DAX
> +static bool ext4_inode_datasync_dirty(struct inode *inode)
> +{
> +	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> +
> +	if (journal)
> +		return !jbd2_transaction_committed(journal,
> +					EXT4_I(inode)->i_datasync_tid);
> +	/* Any metadata buffers to write? */
> +	if (!list_empty(&inode->i_mapping->private_list))
> +		return true;
> +	return inode->i_state & I_DIRTY_DATASYNC;
> +}

I just had 2 quick questions on this:

1) Does ext4 actually use inode->i_mapping->private_list to keep track of
dirty metadata buffers?  The comment above ext4_write_end() leads me to
believe that this list is unused?

 * ext4 never places buffers on inode->i_mapping->private_list.  metadata
 * buffers are managed internally.

Or does the above comment only apply to ext4 with a journal?

2) Where is I_DIRTY_DATASYNC set in inode->i_state?  I poked around a bit and
couldn't see it.

The rest of the patch looks good to me, and you can add:

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Jan Kara Oct. 16, 2017, 3:50 p.m. UTC | #4
On Fri 13-10-17 14:58:54, Ross Zwisler wrote:
> On Wed, Oct 11, 2017 at 10:06:00PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_insert_pfn_mkwrite() to insert page table entry. Note that this will
> > also dirty corresponding radix tree entry which is what we want -
> > fsync(2) will still provide data integrity guarantees for applications
> > not using userspace flushing. And applications using userspace flushing
> > can avoid calling fsync(2) and thus avoid the performance overhead.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/file.c       |  6 +++++-
> >  fs/ext4/inode.c      | 15 +++++++++++++++
> >  fs/jbd2/journal.c    | 17 +++++++++++++++++
> >  include/linux/jbd2.h |  1 +
> >  4 files changed, 38 insertions(+), 1 deletion(-)
> 
> <>
> 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 31db875bc7a1..13a198924a0f 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
> >  }
> >  
> >  #ifdef CONFIG_FS_DAX
> > +static bool ext4_inode_datasync_dirty(struct inode *inode)
> > +{
> > +	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> > +
> > +	if (journal)
> > +		return !jbd2_transaction_committed(journal,
> > +					EXT4_I(inode)->i_datasync_tid);
> > +	/* Any metadata buffers to write? */
> > +	if (!list_empty(&inode->i_mapping->private_list))
> > +		return true;
> > +	return inode->i_state & I_DIRTY_DATASYNC;
> > +}
> 
> I just had 2 quick questions on this:
> 
> 1) Does ext4 actually use inode->i_mapping->private_list to keep track of
> dirty metadata buffers?  The comment above ext4_write_end() leads me to
> believe that this list is unused?
> 
>  * ext4 never places buffers on inode->i_mapping->private_list.  metadata
>  * buffers are managed internally.
> 
> Or does the above comment only apply to ext4 with a journal?

Yes, the above applies for ext4 with a journal. ext4 without a journal uses
inode->i_mapping->private_list for metadata tracking. And DAX can be used
without the journal just fine...

> 2) Where is I_DIRTY_DATASYNC set in inode->i_state?  I poked around a bit and
> couldn't see it.

Never directly (at least for ext4). But it will get set by
mark_inode_dirty() as I_DIRTY contains I_DIRTY_DATASYNC.

> The rest of the patch looks good to me, and you can add:
> 
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks!

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 208adfc3e673..61a8788168f3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -295,6 +295,7 @@  static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	 */
 	bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
 		(vmf->vma->vm_flags & VM_SHARED);
+	pfn_t pfn;
 
 	if (write) {
 		sb_start_pagefault(sb);
@@ -310,9 +311,12 @@  static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	} else {
 		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	result = dax_iomap_fault(vmf, pe_size, NULL, &ext4_iomap_ops);
+	result = dax_iomap_fault(vmf, pe_size, &pfn, &ext4_iomap_ops);
 	if (write) {
 		ext4_journal_stop(handle);
+		/* Handling synchronous page fault? */
+		if (result & VM_FAULT_NEEDDSYNC)
+			result = dax_finish_sync_fault(vmf, pe_size, pfn);
 		up_read(&EXT4_I(inode)->i_mmap_sem);
 		sb_end_pagefault(sb);
 	} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 31db875bc7a1..13a198924a0f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3394,6 +3394,19 @@  static int ext4_releasepage(struct page *page, gfp_t wait)
 }
 
 #ifdef CONFIG_FS_DAX
+static bool ext4_inode_datasync_dirty(struct inode *inode)
+{
+	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+
+	if (journal)
+		return !jbd2_transaction_committed(journal,
+					EXT4_I(inode)->i_datasync_tid);
+	/* Any metadata buffers to write? */
+	if (!list_empty(&inode->i_mapping->private_list))
+		return true;
+	return inode->i_state & I_DIRTY_DATASYNC;
+}
+
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 			    unsigned flags, struct iomap *iomap)
 {
@@ -3466,6 +3479,8 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	}
 
 	iomap->flags = 0;
+	if ((flags & IOMAP_WRITE) && ext4_inode_datasync_dirty(inode))
+		iomap->flags |= IOMAP_F_DIRTY;
 	iomap->bdev = inode->i_sb->s_bdev;
 	iomap->dax_dev = sbi->s_daxdev;
 	iomap->offset = first_block << blkbits;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7d5ef3bf3f3e..fa8cde498b4b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -738,6 +738,23 @@  int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	return err;
 }
 
+/* Return 1 when transaction with given tid has already committed. */
+int jbd2_transaction_committed(journal_t *journal, tid_t tid)
+{
+	int ret = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid)
+		ret = 0;
+	if (journal->j_committing_transaction &&
+	    journal->j_committing_transaction->t_tid == tid)
+		ret = 0;
+	read_unlock(&journal->j_state_lock);
+	return ret;
+}
+EXPORT_SYMBOL(jbd2_transaction_committed);
+
 /*
  * When this function returns the transaction corresponding to tid
  * will be completed.  If the transaction has currently running, start
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..296d1e0ea87b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1367,6 +1367,7 @@  int jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_transaction_committed(journal_t *journal, tid_t tid);
 int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);