[v5,00/12] ext4: port direct I/O to iomap infrastructure

Message ID	cover.1571647178.git.mbobrowski@mbobrowski.org
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> Date: Mon, 21 Oct 2019 20:17:18 +1100 From: Matthew Bobrowski <mbobrowski@mbobrowski.org> To: tytso@mit.edu, jack@suse.cz, adilger.kernel@dilger.ca Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, hch@infradead.org, david@fromorbit.com, darrick.wong@oracle.com Subject: [PATCH v5 00/12] ext4: port direct I/O to iomap infrastructure Message-ID: <cover.1571647178.git.mbobrowski@mbobrowski.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	ext4: port direct I/O to iomap infrastructure \| expand [v5,00/12] ext4: port direct I/O to iomap infrastructure [v5,01/12] ext4: move set iomap routines into separate helper ext4_set_iomap() [v5,02/12] ext4: iomap that extends beyond EOF should be marked dirty [v5,03/12] ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper [v5,04/12] ext4: introduce new callback for IOMAP_REPORT [v5,05/12] iomap: Allow forcing of waiting for running DIO in iomap_dio_rw() [v5,06/12] xfs: Use iomap_dio_rw_wait() [v5,07/12] ext4: introduce direct I/O read using iomap infrastructure [v5,08/12] ext4: update direct I/O read to do trylock in IOCB_NOWAIT cases [v5,09/12] ext4: move inode extension/truncate code out from ->iomap_end() callback [v5,12/12] ext4: introduce direct I/O write using iomap infrastructure [v5,11/12] ext4: reorder map->m_flags checks in ext4_set_iomap() [v5,10/12] ext4: move inode extension check out from ext4_iomap_alloc()

Matthew Bobrowski Oct. 21, 2019, 9:17 a.m. UTC

Hello!

I managed to get some time over the weekend to fix up the stuff that
came out from the last review. Please feel free to review and provide
any necessary constructive criticism.

This patch series ports the ext4 direct I/O paths over to the iomap
infrastructure. The legacy buffer_head based direct I/O implementation
has subsequently been removed as it's no longer in use. The result of
this change is that ext4 now uses the newer iomap framework for direct
I/O read/write operations. Overall, this results in a much cleaner
direct I/O implementation and keeps this code isolated from the
buffer_head internals. In addition to this, a slight performance boost
may be expected while using O_SYNC | O_DIRECT.

The changes within this patch series have been tested via xfstests in
both DAX and non-DAX modes using the various filesystem configuration
options i.e. 4k, dioread_nolock, etc.

Changes since v4:

- Removed the ext4_handle_inode_extension() function out from the
  ->end_io() callback so that we can realiably truncate any allocated
  blocks as we have all the information we need in
  ext4_dio_write_iter().

- Fixed a couple comment formatting issues in addition to spelling
  mistakes here and there.

- Incorporated the fix for the issue that Dave Chinner found around
  writes extending beyond and not marking the iomap dirty.

Thank you all who took the time to review and provide feedback. :)

Jan Kara (2):
  iomap: Allow forcing of waiting for running DIO in iomap_dio_rw()
  xfs: Use iomap_dio_rw_wait()

Matthew Bobrowski (10):
  ext4: move set iomap routines into separate helper ext4_set_iomap()
  ext4: iomap that extends beyond EOF should be marked dirty
  ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper
  ext4: introduce new callback for IOMAP_REPORT
  ext4: introduce direct I/O read using iomap infrastructure
  ext4: update direct I/O read to do trylock in IOCB_NOWAIT cases
  ext4: move inode extension/truncate code out from ->iomap_end()
    callback
  ext4: move inode extension check out from ext4_iomap_alloc()
  ext4: reorder map->m_flags checks in ext4_set_iomap()
  ext4: introduce direct I/O write using iomap infrastructure

 fs/ext4/ext4.h        |   4 +-
 fs/ext4/extents.c     |   4 +-
 fs/ext4/file.c        | 378 +++++++++++++++++-----
 fs/ext4/inode.c       | 716 +++++++++++-------------------------------
 fs/gfs2/file.c        |   6 +-
 fs/iomap/direct-io.c  |   7 +-
 fs/xfs/xfs_file.c     |  13 +-
 include/linux/iomap.h |   3 +-
 8 files changed, 498 insertions(+), 633 deletions(-)

Theodore Ts'o Oct. 21, 2019, 1:31 p.m. UTC | #1

Hi Matthew, thanks for your work on this patch series!

I applied it against 4c3, and ran a quick test run on it, and found
the following locking problem.  To reproduce:

kvm-xfstests -c nojournal generic/113

generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
[    7.959477] 
[    7.959798] ============================================
[    7.960518] WARNING: possible recursive locking detected
[    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
[    7.961991] --------------------------------------------
[    7.962569] aio-stress/1516 is trying to acquire lock:
[    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
[    7.964109] 
[    7.964109] but task is already holding lock:
[    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
[    7.965763] 
[    7.965763] other info that might help us debug this:
[    7.966630]  Possible unsafe locking scenario:
[    7.966630] 
[    7.967424]        CPU0
[    7.967760]        ----
[    7.968097]   lock(&sb->s_type->i_mutex_key#12);
[    7.968827]   lock(&sb->s_type->i_mutex_key#12);
[    7.969558] 
[    7.969558]  *** DEADLOCK ***
[    7.969558] 
[    7.970518]  May be due to missing lock nesting notation
[    7.970518] 
[    7.971592] 1 lock held by aio-stress/1516:
[    7.972267]  #0: ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
[    7.973807] 
[    7.973807] stack backtrace:
[    7.974510] CPU: 0 PID: 1516 Comm: aio-stress Not tainted 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238
[    7.976053] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    7.977327] Call Trace:
[    7.977700]  dump_stack+0x67/0x90
[    7.978198]  __lock_acquire.cold+0x130/0x1f7
[    7.978829]  ? __switch_to_asm+0x40/0x70
[    7.979659]  lock_acquire+0x9a/0x160
[    7.980320]  ? __generic_file_fsync+0x3e/0xb0
[    7.981014]  down_write+0x40/0x110
[    7.981717]  ? __generic_file_fsync+0x3e/0xb0
[    7.982676]  __generic_file_fsync+0x3e/0xb0
[    7.983454]  ext4_sync_file+0x277/0x4e0
[    7.984188]  iomap_dio_complete+0x112/0x130
[    7.984971]  ? iomap_dio_rw+0x3a0/0x4b0
[    7.985647]  iomap_dio_rw+0x419/0x4b0
[    7.986317]  ? ext4_dio_write_iter+0x296/0x430
[    7.987039]  ext4_dio_write_iter+0x296/0x430
[    7.987786]  aio_write+0xef/0x1c0
[    7.988284]  ? kvm_sched_clock_read+0x14/0x30
[    7.988822]  ? sched_clock+0x5/0x10
[    7.989234]  ? sched_clock_cpu+0xc/0xc0
[    7.989719]  __io_submit_one.constprop.0+0x399/0x5f0
[    7.990315]  ? kvm_sched_clock_read+0x14/0x30
[    7.990917]  ? sched_clock+0x5/0x10
[    7.991473]  ? sched_clock_cpu+0xc/0xc0
[    7.992097]  ? io_submit_one+0x141/0x5a0
[    7.992741]  io_submit_one+0x141/0x5a0
[    7.993354]  __x64_sys_io_submit+0x9a/0x290
[    7.993853]  ? do_syscall_64+0x50/0x1b0
[    7.994250]  ? __ia32_sys_io_destroy+0x10/0x10
[    7.994748]  do_syscall_64+0x50/0x1b0
[    7.995175]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[    7.995761] RIP: 0033:0x55d1268c2d17
[    7.996270] Code: 00 75 08 8b 47 0c 39 47 08 74 08 e9 b3 ff ff ff 0f 1f 00 31 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 48 63 ff b8 ce 00 00 00 0f 05 c3 0f 1f
[    7.999131] RSP: 002b:00007f090fb0bd88 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1
[    7.999994] RAX: ffffffffffffffda RBX: 000055d128135010 RCX: 000055d1268c2d17
[    8.000881] RDX: 000055d128135010 RSI: 0000000000000008 RDI: 00007f0907263000
[    8.001765] RBP: 000055d128129560 R08: 00007fff421ae080 R09: 00007f090fb0bd68
[    8.002824] R10: 00007f090fb0bd60 R11: 0000000000000202 R12: 0000000000000008
[    8.004016] R13: 00007f090fb0bdb0 R14: 00007f090fb0bda0 R15: 000055d128129560

I'm other test configurations are still running, and I haven't had a
chance to do a detailed review on it, but I'll try to get to it this
week.

Thanks,

						- Ted

Jan Kara Oct. 21, 2019, 7:43 p.m. UTC | #2

On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> Hi Matthew, thanks for your work on this patch series!
> 
> I applied it against 4c3, and ran a quick test run on it, and found
> the following locking problem.  To reproduce:
> 
> kvm-xfstests -c nojournal generic/113
> 
> generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> [    7.959477] 
> [    7.959798] ============================================
> [    7.960518] WARNING: possible recursive locking detected
> [    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> [    7.961991] --------------------------------------------
> [    7.962569] aio-stress/1516 is trying to acquire lock:
> [    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> [    7.964109] 
> [    7.964109] but task is already holding lock:
> [    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430

This is going to be a tricky one. With iomap, the inode locking is handled
by the filesystem while calling generic_write_sync() is done by
iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
to call generic_write_sync(). So we need to remove inode_lock from
__generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
for legacy purposes and we don't need this in ext4 AFAICT - but removing
the lock from __generic_file_fsync() would mean auditing all legacy
filesystems that use this to make sure flushing inode & its metadata buffer
list while it is possibly changing cannot result in something unexpected. I
don't want to clutter this series with it so we are left with
reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
too bad but not great either. Thoughts?

								Honza

Dave Chinner Oct. 21, 2019, 10:38 p.m. UTC | #3

On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > Hi Matthew, thanks for your work on this patch series!
> > 
> > I applied it against 4c3, and ran a quick test run on it, and found
> > the following locking problem.  To reproduce:
> > 
> > kvm-xfstests -c nojournal generic/113
> > 
> > generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > [    7.959477] 
> > [    7.959798] ============================================
> > [    7.960518] WARNING: possible recursive locking detected
> > [    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > [    7.961991] --------------------------------------------
> > [    7.962569] aio-stress/1516 is trying to acquire lock:
> > [    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > [    7.964109] 
> > [    7.964109] but task is already holding lock:
> > [    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> 
> This is going to be a tricky one. With iomap, the inode locking is handled
> by the filesystem while calling generic_write_sync() is done by
> iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> to call generic_write_sync().

You can't remove it from there, because that will break O_DSYNC
AIO+DIO. i.e. generic_write_sync() needs to be called before
iocb->ki_complete() is called in the AIO completion path, and that
means filesystems using iomap_dio_rw need to be be able to run
generic_write_sync() without taking the inode_lock().

> So we need to remove inode_lock from
> __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> for legacy purposes and we don't need this in ext4 AFAICT - but removing
> the lock from __generic_file_fsync() would mean auditing all legacy
> filesystems that use this to make sure flushing inode & its metadata buffer
> list while it is possibly changing cannot result in something unexpected. I
> don't want to clutter this series with it so we are left with
> reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> too bad but not great either. Thoughts?

XFS has implemented it's own ->fsync operation pretty much forever
without issues. It's basically:

	1. flush dirty cached data w/ WB_SYNC_ALL
	2. flush dirty cached metadata (i.e. journal force)
	3. flush device caches if journal force didn't, keeping in
	mind the requirements of data and journal being placed on
	different devices.

The ext4 variant shouldn't need to be any more complex than that...

Cheers,

Dave.

Jan Kara Oct. 22, 2019, 8:01 a.m. UTC | #4

On Tue 22-10-19 09:38:19, Dave Chinner wrote:
> On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> > On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > > Hi Matthew, thanks for your work on this patch series!
> > > 
> > > I applied it against 4c3, and ran a quick test run on it, and found
> > > the following locking problem.  To reproduce:
> > > 
> > > kvm-xfstests -c nojournal generic/113
> > > 
> > > generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > > [    7.959477] 
> > > [    7.959798] ============================================
> > > [    7.960518] WARNING: possible recursive locking detected
> > > [    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > > [    7.961991] --------------------------------------------
> > > [    7.962569] aio-stress/1516 is trying to acquire lock:
> > > [    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > > [    7.964109] 
> > > [    7.964109] but task is already holding lock:
> > > [    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> > 
> > This is going to be a tricky one. With iomap, the inode locking is handled
> > by the filesystem while calling generic_write_sync() is done by
> > iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> > to call generic_write_sync().
> 
> You can't remove it from there, because that will break O_DSYNC
> AIO+DIO. i.e. generic_write_sync() needs to be called before
> iocb->ki_complete() is called in the AIO completion path, and that
> means filesystems using iomap_dio_rw need to be be able to run
> generic_write_sync() without taking the inode_lock().
> 
> > So we need to remove inode_lock from
> > __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> > for legacy purposes and we don't need this in ext4 AFAICT - but removing
> > the lock from __generic_file_fsync() would mean auditing all legacy
> > filesystems that use this to make sure flushing inode & its metadata buffer
> > list while it is possibly changing cannot result in something unexpected. I
> > don't want to clutter this series with it so we are left with
> > reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> > too bad but not great either. Thoughts?
> 
> XFS has implemented it's own ->fsync operation pretty much forever
> without issues. It's basically:
> 
> 	1. flush dirty cached data w/ WB_SYNC_ALL
> 	2. flush dirty cached metadata (i.e. journal force)
> 	3. flush device caches if journal force didn't, keeping in
> 	mind the requirements of data and journal being placed on
> 	different devices.
> 
> The ext4 variant shouldn't need to be any more complex than that...

Yeah, that's what we do for the common case as well. But when the
filesystem is created without a journal (i.e., ext2 compatibility mode) we
currently use the old fsync implementation including
__generic_file_fsync(). But as I wrote above, duplicating those ~5 lines
out of __generic_file_fsync() we really care about is not a big deal.

								Honza

Matthew Bobrowski Oct. 23, 2019, 2:35 a.m. UTC | #5

On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > Hi Matthew, thanks for your work on this patch series!
> > 
> > I applied it against 4c3, and ran a quick test run on it, and found
> > the following locking problem.  To reproduce:
> > 
> > kvm-xfstests -c nojournal generic/113
> > 
> > generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > [    7.959477] 
> > [    7.959798] ============================================
> > [    7.960518] WARNING: possible recursive locking detected
> > [    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > [    7.961991] --------------------------------------------
> > [    7.962569] aio-stress/1516 is trying to acquire lock:
> > [    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > [    7.964109] 
> > [    7.964109] but task is already holding lock:
> > [    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> 
> This is going to be a tricky one. With iomap, the inode locking is handled
> by the filesystem while calling generic_write_sync() is done by
> iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> to call generic_write_sync(). So we need to remove inode_lock from
> __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> for legacy purposes and we don't need this in ext4 AFAICT - but removing
> the lock from __generic_file_fsync() would mean auditing all legacy
> filesystems that use this to make sure flushing inode & its metadata buffer
> list while it is possibly changing cannot result in something unexpected. I
> don't want to clutter this series with it so we are left with
> reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> too bad but not great either. Thoughts?

So, I just looked at this on my lunch break and I think the simplest approach
would be to just transfer the necessary chunks of code from within
__generic_file_fsync() into ext4_sync_file() for !journal cases, minus the
inode lock, and minus calling into __generic_file_fsync(). I don't forsee this
causing any issues, but feel free to correct me if I'm wrong.

If this is deemed to be OK, then I will go ahead and include this as a
separate patch in my series.

--<M>--

Jan Kara Oct. 23, 2019, 10:01 a.m. UTC | #6

On Wed 23-10-19 13:35:19, Matthew Bobrowski wrote:
> On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> > On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > > Hi Matthew, thanks for your work on this patch series!
> > > 
> > > I applied it against 4c3, and ran a quick test run on it, and found
> > > the following locking problem.  To reproduce:
> > > 
> > > kvm-xfstests -c nojournal generic/113
> > > 
> > > generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > > [    7.959477] 
> > > [    7.959798] ============================================
> > > [    7.960518] WARNING: possible recursive locking detected
> > > [    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > > [    7.961991] --------------------------------------------
> > > [    7.962569] aio-stress/1516 is trying to acquire lock:
> > > [    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > > [    7.964109] 
> > > [    7.964109] but task is already holding lock:
> > > [    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> > 
> > This is going to be a tricky one. With iomap, the inode locking is handled
> > by the filesystem while calling generic_write_sync() is done by
> > iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> > to call generic_write_sync(). So we need to remove inode_lock from
> > __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> > for legacy purposes and we don't need this in ext4 AFAICT - but removing
> > the lock from __generic_file_fsync() would mean auditing all legacy
> > filesystems that use this to make sure flushing inode & its metadata buffer
> > list while it is possibly changing cannot result in something unexpected. I
> > don't want to clutter this series with it so we are left with
> > reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> > too bad but not great either. Thoughts?
> 
> So, I just looked at this on my lunch break and I think the simplest
> approach would be to just transfer the necessary chunks of code from
> within __generic_file_fsync() into ext4_sync_file() for !journal cases,
> minus the inode lock, and minus calling into __generic_file_fsync(). I
> don't forsee this causing any issues, but feel free to correct me if I'm
> wrong.

Yes, that's what I'd suggest as well. In fact when doing that you can share
file_write_and_wait_range() call with the one already in ext4_sync_file()
use for other cases. Similarly with file_check_and_advance_wb_err(). So the
copied bit will be really only:

        ret = sync_mapping_buffers(inode->i_mapping);
        if (!(inode->i_state & I_DIRTY_ALL))
                goto out;
        if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
                goto out;

        err = sync_inode_metadata(inode, 1);
        if (ret == 0)
                ret = err;

> If this is deemed to be OK, then I will go ahead and include this as a
> separate patch in my series.

Yes, please.

								Honza

Matthew Bobrowski Oct. 23, 2019, 10:11 a.m. UTC | #7

On Wed, Oct 23, 2019 at 12:01:53PM +0200, Jan Kara wrote:
> On Wed 23-10-19 13:35:19, Matthew Bobrowski wrote:
> > On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> > > On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > > > Hi Matthew, thanks for your work on this patch series!
> > > > 
> > > > I applied it against 4c3, and ran a quick test run on it, and found
> > > > the following locking problem.  To reproduce:
> > > > 
> > > > kvm-xfstests -c nojournal generic/113
> > > > 
> > > > generic/113		[09:27:19][    5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > > > [    7.959477] 
> > > > [    7.959798] ============================================
> > > > [    7.960518] WARNING: possible recursive locking detected
> > > > [    7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > > > [    7.961991] --------------------------------------------
> > > > [    7.962569] aio-stress/1516 is trying to acquire lock:
> > > > [    7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > > > [    7.964109] 
> > > > [    7.964109] but task is already holding lock:
> > > > [    7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> > > 
> > > This is going to be a tricky one. With iomap, the inode locking is handled
> > > by the filesystem while calling generic_write_sync() is done by
> > > iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> > > to call generic_write_sync(). So we need to remove inode_lock from
> > > __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> > > for legacy purposes and we don't need this in ext4 AFAICT - but removing
> > > the lock from __generic_file_fsync() would mean auditing all legacy
> > > filesystems that use this to make sure flushing inode & its metadata buffer
> > > list while it is possibly changing cannot result in something unexpected. I
> > > don't want to clutter this series with it so we are left with
> > > reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> > > too bad but not great either. Thoughts?
> > 
> > So, I just looked at this on my lunch break and I think the simplest
> > approach would be to just transfer the necessary chunks of code from
> > within __generic_file_fsync() into ext4_sync_file() for !journal cases,
> > minus the inode lock, and minus calling into __generic_file_fsync(). I
> > don't forsee this causing any issues, but feel free to correct me if I'm
> > wrong.
> 
> Yes, that's what I'd suggest as well. In fact when doing that you can share
> file_write_and_wait_range() call with the one already in ext4_sync_file()
> use for other cases. Similarly with file_check_and_advance_wb_err(). So the
> copied bit will be really only:
> 
>         ret = sync_mapping_buffers(inode->i_mapping);
>         if (!(inode->i_state & I_DIRTY_ALL))
>                 goto out;
>         if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
>                 goto out;
> 
>         err = sync_inode_metadata(inode, 1);
>         if (ret == 0)
>                 ret = err;
> 
> > If this is deemed to be OK, then I will go ahead and include this as a
> > separate patch in my series.
> 
> Yes, please.

Heh!

I just finished writing and testing it and exactly what I've done
(attached). Anyway, I will include it in v6. :)

--<M>--

Christoph Hellwig Oct. 24, 2019, 1:58 a.m. UTC | #8

Maybe something like the version below which declutter the thing a bit?

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..ab601c6779d2 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -80,6 +80,41 @@ static int ext4_sync_parent(struct inode *inode)
 	return ret;
 }
 
+static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
+		bool *needs_barrier)
+{
+	int ret, err;
+
+	ret = sync_mapping_buffers(inode->i_mapping);
+	if (!(inode->i_state & I_DIRTY_ALL))
+		return ret;
+	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+		return ret;
+
+	err = sync_inode_metadata(inode, 1);
+	if (!ret)
+		ret = err;
+
+	if (!ret)
+		ret = ext4_sync_parent(inode);
+	if (test_opt(inode->i_sb, BARRIER))
+		*needs_barrier = true;
+	return ret;
+}
+
+static int ext4_fsync_journal(struct inode *inode, bool datasync,
+		bool *needs_barrier)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+
+	if (journal->j_flags & JBD2_BARRIER &&
+	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
+		*needs_barrier = true;
+	return jbd2_complete_transaction(journal, commit_tid);
+}
+
 /*
  * akpm: A new design for ext4_sync_file().
  *
@@ -95,13 +130,11 @@ static int ext4_sync_parent(struct inode *inode)
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int ret = 0, err;
-	tid_t commit_tid;
 	bool needs_barrier = false;
 
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+	if (unlikely(ext4_forced_shutdown(sbi)))
 		return -EIO;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -116,18 +149,10 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
-	if (!journal) {
-		ret = __generic_file_fsync(file, start, end, datasync);
-		if (!ret)
-			ret = ext4_sync_parent(inode);
-		if (test_opt(inode->i_sb, BARRIER))
-			goto issue_flush;
-		goto out;
-	}
-
 	ret = file_write_and_wait_range(file, start, end);
 	if (ret)
 		return ret;
+
 	/*
 	 * data=writeback,ordered:
 	 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -142,18 +167,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 *  (they were dirtied by commit).  But that's OK - the blocks are
 	 *  safe in-journal, which is all fsync() needs to ensure.
 	 */
-	if (ext4_should_journal_data(inode)) {
+	if (!sbi->s_journal) 
+		ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
+	else if (ext4_should_journal_data(inode))
 		ret = ext4_force_commit(inode->i_sb);
-		goto out;
-	}
+	else
+		ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
 
-	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (journal->j_flags & JBD2_BARRIER &&
-	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
-		needs_barrier = true;
-	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier) {
-	issue_flush:
 		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
 		if (!ret)
 			ret = err;

Matthew Bobrowski Oct. 24, 2019, 11:09 a.m. UTC | #9

On Wed, Oct 23, 2019 at 06:58:57PM -0700, Christoph Hellwig wrote:
> Maybe something like the version below which declutter the thing a bit?

Yeah, I like this suggestion Christoph. Super awesome.

Thank you!

--<M>--

[v5,00/12] ext4: port direct I/O to iomap infrastructure

Message

Comments