Message ID | 20171011200603.27442-17-jack@suse.cz |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | dax, ext4, xfs: Synchronous page faults | expand |
On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <jack@suse.cz> wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin()
I assume this a stale changelog from a previous version and should be
VM_FAULT_NEEDDSYNC?
On Wed 11-10-17 15:23:21, Dan Williams wrote: > On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <jack@suse.cz> wrote: > > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() > > I assume this a stale changelog from a previous version and should be > VM_FAULT_NEEDDSYNC? Yeah, stale changelog. It should be IOMAP_F_DIRTY. Thanks for catching this. Honza
On Wed, Oct 11, 2017 at 10:06:00PM +0200, Jan Kara wrote: > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a > synchronous write fault when inode has some uncommitted metadata > changes. In the fault handler ext4_dax_fault() we then detect this case, > call vfs_fsync_range() to make sure all metadata is committed, and call > dax_insert_pfn_mkwrite() to insert page table entry. Note that this will > also dirty corresponding radix tree entry which is what we want - > fsync(2) will still provide data integrity guarantees for applications > not using userspace flushing. And applications using userspace flushing > can avoid calling fsync(2) and thus avoid the performance overhead. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/file.c | 6 +++++- > fs/ext4/inode.c | 15 +++++++++++++++ > fs/jbd2/journal.c | 17 +++++++++++++++++ > include/linux/jbd2.h | 1 + > 4 files changed, 38 insertions(+), 1 deletion(-) <> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 31db875bc7a1..13a198924a0f 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait) > } > > #ifdef CONFIG_FS_DAX > +static bool ext4_inode_datasync_dirty(struct inode *inode) > +{ > + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > + > + if (journal) > + return !jbd2_transaction_committed(journal, > + EXT4_I(inode)->i_datasync_tid); > + /* Any metadata buffers to write? */ > + if (!list_empty(&inode->i_mapping->private_list)) > + return true; > + return inode->i_state & I_DIRTY_DATASYNC; > +} I just had 2 quick questions on this: 1) Does ext4 actually use inode->i_mapping->private_list to keep track of dirty metadata buffers? The comment above ext4_write_end() leads me to believe that this list is unused? * ext4 never places buffers on inode->i_mapping->private_list. metadata * buffers are managed internally. Or does the above comment only apply to ext4 with a journal? 2) Where is I_DIRTY_DATASYNC set in inode->i_state? I poked around a bit and couldn't see it. The rest of the patch looks good to me, and you can add: Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
On Fri 13-10-17 14:58:54, Ross Zwisler wrote: > On Wed, Oct 11, 2017 at 10:06:00PM +0200, Jan Kara wrote: > > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a > > synchronous write fault when inode has some uncommitted metadata > > changes. In the fault handler ext4_dax_fault() we then detect this case, > > call vfs_fsync_range() to make sure all metadata is committed, and call > > dax_insert_pfn_mkwrite() to insert page table entry. Note that this will > > also dirty corresponding radix tree entry which is what we want - > > fsync(2) will still provide data integrity guarantees for applications > > not using userspace flushing. And applications using userspace flushing > > can avoid calling fsync(2) and thus avoid the performance overhead. > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/ext4/file.c | 6 +++++- > > fs/ext4/inode.c | 15 +++++++++++++++ > > fs/jbd2/journal.c | 17 +++++++++++++++++ > > include/linux/jbd2.h | 1 + > > 4 files changed, 38 insertions(+), 1 deletion(-) > > <> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 31db875bc7a1..13a198924a0f 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait) > > } > > > > #ifdef CONFIG_FS_DAX > > +static bool ext4_inode_datasync_dirty(struct inode *inode) > > +{ > > + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > > + > > + if (journal) > > + return !jbd2_transaction_committed(journal, > > + EXT4_I(inode)->i_datasync_tid); > > + /* Any metadata buffers to write? */ > > + if (!list_empty(&inode->i_mapping->private_list)) > > + return true; > > + return inode->i_state & I_DIRTY_DATASYNC; > > +} > > I just had 2 quick questions on this: > > 1) Does ext4 actually use inode->i_mapping->private_list to keep track of > dirty metadata buffers? The comment above ext4_write_end() leads me to > believe that this list is unused? > > * ext4 never places buffers on inode->i_mapping->private_list. metadata > * buffers are managed internally. > > Or does the above comment only apply to ext4 with a journal? Yes, the above applies for ext4 with a journal. ext4 without a journal uses inode->i_mapping->private_list for metadata tracking. And DAX can be used without the journal just fine... > 2) Where is I_DIRTY_DATASYNC set in inode->i_state? I poked around a bit and > couldn't see it. Never directly (at least for ext4). But it will get set by mark_inode_dirty() as I_DIRTY contains I_DIRTY_DATASYNC. > The rest of the patch looks good to me, and you can add: > > Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Thanks! Honza
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 208adfc3e673..61a8788168f3 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -295,6 +295,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf, */ bool write = (vmf->flags & FAULT_FLAG_WRITE) && (vmf->vma->vm_flags & VM_SHARED); + pfn_t pfn; if (write) { sb_start_pagefault(sb); @@ -310,9 +311,12 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf, } else { down_read(&EXT4_I(inode)->i_mmap_sem); } - result = dax_iomap_fault(vmf, pe_size, NULL, &ext4_iomap_ops); + result = dax_iomap_fault(vmf, pe_size, &pfn, &ext4_iomap_ops); if (write) { ext4_journal_stop(handle); + /* Handling synchronous page fault? */ + if (result & VM_FAULT_NEEDDSYNC) + result = dax_finish_sync_fault(vmf, pe_size, pfn); up_read(&EXT4_I(inode)->i_mmap_sem); sb_end_pagefault(sb); } else { diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 31db875bc7a1..13a198924a0f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait) } #ifdef CONFIG_FS_DAX +static bool ext4_inode_datasync_dirty(struct inode *inode) +{ + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; + + if (journal) + return !jbd2_transaction_committed(journal, + EXT4_I(inode)->i_datasync_tid); + /* Any metadata buffers to write? */ + if (!list_empty(&inode->i_mapping->private_list)) + return true; + return inode->i_state & I_DIRTY_DATASYNC; +} + static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap) { @@ -3466,6 +3479,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, } iomap->flags = 0; + if ((flags & IOMAP_WRITE) && ext4_inode_datasync_dirty(inode)) + iomap->flags |= IOMAP_F_DIRTY; iomap->bdev = inode->i_sb->s_bdev; iomap->dax_dev = sbi->s_daxdev; iomap->offset = first_block << blkbits; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 7d5ef3bf3f3e..fa8cde498b4b 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -738,6 +738,23 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) return err; } +/* Return 1 when transaction with given tid has already committed. */ +int jbd2_transaction_committed(journal_t *journal, tid_t tid) +{ + int ret = 1; + + read_lock(&journal->j_state_lock); + if (journal->j_running_transaction && + journal->j_running_transaction->t_tid == tid) + ret = 0; + if (journal->j_committing_transaction && + journal->j_committing_transaction->t_tid == tid) + ret = 0; + read_unlock(&journal->j_state_lock); + return ret; +} +EXPORT_SYMBOL(jbd2_transaction_committed); + /* * When this function returns the transaction corresponding to tid * will be completed. If the transaction has currently running, start diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 606b6bce3a5b..296d1e0ea87b 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1367,6 +1367,7 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid); int __jbd2_log_start_commit(journal_t *journal, tid_t tid); int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); +int jbd2_transaction_committed(journal_t *journal, tid_t tid); int jbd2_complete_transaction(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a synchronous write fault when inode has some uncommitted metadata changes. In the fault handler ext4_dax_fault() we then detect this case, call vfs_fsync_range() to make sure all metadata is committed, and call dax_insert_pfn_mkwrite() to insert page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/file.c | 6 +++++- fs/ext4/inode.c | 15 +++++++++++++++ fs/jbd2/journal.c | 17 +++++++++++++++++ include/linux/jbd2.h | 1 + 4 files changed, 38 insertions(+), 1 deletion(-)