Message ID | 1363742959-12815-1-git-send-email-tytso@mit.edu |
---|---|
State | Accepted, archived |
Headers | show |
I found one more regression failure while running tests on the dev branch. It doesn't fail every single time you run xfstests #270; in my test setup, the test failure happens approximatrely one in three times. Fortunately, "kvm-xfstests -c dioread_nolock 270,270,270,270,270,270,270,270,270" made it easy to bisect the failure to commit 84c17543ab56 (ext4: move work from io_end to inode). I've added this commit to fix it, and restarted running the regression tests. Hopefully this will be the last fix up required before I send a pull request to Linus.... - Ted On Tue, Mar 19, 2013 at 09:29:19PM -0400, Theodore Ts'o wrote: > Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a > regression when running xfstest #270 when the file system is mounted > with dioread_nolock. > > The problem is that after ext4_evict_inode() calls ext4_ioend_wait(), > this guarantees that last io_end structure has been freed, but it does > not guarantee that the workqueue structure, which was moved into the > inode by commit 84c17543ab56, is actually finished. Once > ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this > will allow ext4_ioend_wait() to return on CPU #2, at which point the > evict_inode() codepath can race against the workqueue code on CPU #1 > accessing EXT4_I(inode)->i_unwritten_work to find the next item of > work to do. > > Fix this by calling flush_work() if the work structure is still > pending in ext$_ioend_wait(). Also, move the call to > ext4_ioend_wait() until after truncate_inode_pages() and > filemap_write_and_wait() are called, to make sure all dirty pages have > been written back and flushed from the page cache first. > > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e > *pdpt = 0000000030bc3001 *pde = 0000000000000000 > Oops: 0000 [#1] SMP DEBUG_PAGEALLOC > Modules linked in: > Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs > EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0 > EIP is at cwq_activate_delayed_work+0x3b/0x7e > EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000 > ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84 > DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0 > DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > DR6: ffff0ff0 DR7: 00000400 > Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000) > Stack: > f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d > f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0 > f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000 > Call Trace: > [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb > [<c01def1d>] process_one_work+0x5d8/0x637 > [<c040a10b>] ? ext4_end_bio+0x300/0x300 > [<c01e3105>] worker_thread+0x249/0x3ef > [<c01ea317>] kthread+0xd8/0xeb > [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb > [<c023a370>] ? trace_hardirqs_on+0x27/0x37 > [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28 > [<c01ea23f>] ? __init_kthread_worker+0x71/0x71 > Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1 > 6c c1 00 31 c9 83 > EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84 > CR2: 0000000000000000 > ---[ end trace a1923229da53d8a4 ]--- > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: Jan Kara <jack@suse.cz> ... -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 19-03-13 21:29:19, Ted Tso wrote: > Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a > regression when running xfstest #270 when the file system is mounted > with dioread_nolock. > > The problem is that after ext4_evict_inode() calls ext4_ioend_wait(), > this guarantees that last io_end structure has been freed, but it does > not guarantee that the workqueue structure, which was moved into the > inode by commit 84c17543ab56, is actually finished. Once > ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this > will allow ext4_ioend_wait() to return on CPU #2, at which point the > evict_inode() codepath can race against the workqueue code on CPU #1 > accessing EXT4_I(inode)->i_unwritten_work to find the next item of > work to do. > > Fix this by calling flush_work() if the work structure is still > pending in ext$_ioend_wait(). Also, move the call to > ext4_ioend_wait() until after truncate_inode_pages() and > filemap_write_and_wait() are called, to make sure all dirty pages have > been written back and flushed from the page cache first. > > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e > *pdpt = 0000000030bc3001 *pde = 0000000000000000 > Oops: 0000 [#1] SMP DEBUG_PAGEALLOC > Modules linked in: > Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs > EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0 > EIP is at cwq_activate_delayed_work+0x3b/0x7e > EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000 > ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84 > DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0 > DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > DR6: ffff0ff0 DR7: 00000400 > Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000) > Stack: > f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d > f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0 > f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000 > Call Trace: > [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb > [<c01def1d>] process_one_work+0x5d8/0x637 > [<c040a10b>] ? ext4_end_bio+0x300/0x300 > [<c01e3105>] worker_thread+0x249/0x3ef > [<c01ea317>] kthread+0xd8/0xeb > [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb > [<c023a370>] ? trace_hardirqs_on+0x27/0x37 > [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28 > [<c01ea23f>] ? __init_kthread_worker+0x71/0x71 > Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1 > 6c c1 00 31 c9 83 > EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84 > CR2: 0000000000000000 > ---[ end trace a1923229da53d8a4 ]--- > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: Jan Kara <jack@suse.cz> Good catch. Thanks for fixing this. Just one question below: > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c > index 809b310..8a004a4 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -50,11 +50,21 @@ void ext4_exit_pageio(void) > kmem_cache_destroy(io_page_cachep); > } > > +/* > + * Called by ext4_evict_inode() to make sure there are no pending I/O > + * completion work left to do. > + */ > void ext4_ioend_wait(struct inode *inode) > { > wait_queue_head_t *wq = ext4_ioend_wq(inode); > > wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0)); > + /* > + * We need to make sure the work structure is finished before > + * we let the inode get destroyed by ext4_evict_inode() > + */ > + if (work_pending(&EXT4_I(inode)->i_unwritten_work)) > + flush_work(&EXT4_I(inode)->i_unwritten_work); > } Won't it be more logical to use cancel_work_sync() here? Honza
On Wed, Mar 20, 2013 at 02:22:23PM +0100, Jan Kara wrote: > > + if (work_pending(&EXT4_I(inode)->i_unwritten_work)) > > + flush_work(&EXT4_I(inode)->i_unwritten_work); > > } > Won't it be more logical to use cancel_work_sync() here? Hmm.... yes, probably, but then ext4_ioend_wait() can only be safely used by ext4_evict_inode(). I'll make the change, but I'll also make a comment to this effect. (No one else is using it now, but if there was ever a need to use it while the inode was in use, using cancel_work_sync() would be highly dagernous/racy. That being said, I can't really think of a good use case other than evict_inode path, so it seems fine to make this change.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 20-03-13 09:37:51, Ted Tso wrote: > On Wed, Mar 20, 2013 at 02:22:23PM +0100, Jan Kara wrote: > > > + if (work_pending(&EXT4_I(inode)->i_unwritten_work)) > > > + flush_work(&EXT4_I(inode)->i_unwritten_work); > > > } > > Won't it be more logical to use cancel_work_sync() here? > > Hmm.... yes, probably, but then ext4_ioend_wait() can only be safely > used by ext4_evict_inode(). I'll make the change, but I'll also make > a comment to this effect. (No one else is using it now, but if there > was ever a need to use it while the inode was in use, using > cancel_work_sync() would be highly dagernous/racy. That being said, I > can't really think of a good use case other than evict_inode path, so > it seems fine to make this change.) Yeah, we can possibly rename the function or maybe even just inline it in ext4_evict_inode? Honza
On Wed, Mar 20, 2013 at 02:42:26PM +0100, Jan Kara wrote: > Yeah, we can possibly rename the function or maybe even just inline it in > ext4_evict_inode? Good idea, I'll rename it to ext4_ioend_shutdown(). Given that the rest of the logic is in page_io.c, it seemed better to leave that function there than to inline it in ext4_evict_inode(). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/19/13 8:29 PM, Theodore Ts'o wrote: > Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a > regression when running xfstest #270 when the file system is mounted > with dioread_nolock. As an aside, is there any reason to have "dioread_nolock" as an option at this point? If it works now, would you ever *not* want it? (granted it doesn't work with some journaling options etc, but that behavior could be automatic, w/o the need for special mount options). -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > As an aside, is there any reason to have "dioread_nolock" as an option > at this point? If it works now, would you ever *not* want it? > > (granted it doesn't work with some journaling options etc, but that > behavior could be automatic, w/o the need for special mount options). The primary restriction is that diread_nolock doesn't work when fs block size != page size. If your proposal is that we automatically enable diread_nolock when we can use it safely, that's definitely something to consider for the next merge window. My long range plan/hope is that we eventually be able to use the extent status tree so that we do allocating writes, we first (a) allocate the blocks, and mark them as in use as far as the mballoc data structures are concerned, but we do _not_ mark them as in use in the on-disk allocation bitmaps, then (b) we write the data blocks, and then triggered by the block I/O completion, (c) in a single journal trnasaction, we update the allocation bitmaps, update the inode's extent tree, and update the inode's i_size field. This is different from the dioread_nolock approach in that we're not initially inserting the blocks in the extent tree as uninitialized, and then convert the extent tree entries from uninit to init after the I/O completion. If we get to this long-term nirvana, then (1) we can eliminate the data=writeback vs data=ordered distiction, since we'll have the safety benefits of data=ordered while still having the performance characteristics of data=writeback, and (2) we can eliminate diread_nolock, since this approach should also obviate needing to take the read lock on the direct I/O read path. I also think this approach in the long term will be simpler and faster, since we don't have modify the extent tree, and start a journal transaction, before we write the data blocks. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 20-03-13 10:45:23, Ted Tso wrote: > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > at this point? If it works now, would you ever *not* want it? > > > > (granted it doesn't work with some journaling options etc, but that > > behavior could be automatic, w/o the need for special mount options). > > The primary restriction is that diread_nolock doesn't work when fs > block size != page size. If your proposal is that we automatically > enable diread_nolock when we can use it safely, that's definitely > something to consider for the next merge window. > > My long range plan/hope is that we eventually be able to use the > extent status tree so that we do allocating writes, we first (a) > allocate the blocks, and mark them as in use as far as the mballoc > data structures are concerned, but we do _not_ mark them as in use in > the on-disk allocation bitmaps, then (b) we write the data blocks, and > then triggered by the block I/O completion, (c) in a single journal > trnasaction, we update the allocation bitmaps, update the inode's > extent tree, and update the inode's i_size field. > > This is different from the dioread_nolock approach in that we're not > initially inserting the blocks in the extent tree as uninitialized, > and then convert the extent tree entries from uninit to init after the > I/O completion. > > If we get to this long-term nirvana, then (1) we can eliminate the > data=writeback vs data=ordered distiction, since we'll have the safety > benefits of data=ordered while still having the performance > characteristics of data=writeback, and (2) we can eliminate > diread_nolock, since this approach should also obviate needing to take > the read lock on the direct I/O read path. But this will be somewhat tricky because when we have racing buffered write and DIO read to the same block, we have to make sure that DIO read ignores the information in the extent status tree because data isn't written to the blocks yet. Umm, maybe we could just mark the extent as unwritten in the extent status tree (without having anything on disk) and this should make DIO read work. That sounds like a nice optimization. > I also think this approach > in the long term will be simpler and faster, since we don't have > modify the extent tree, and start a journal transaction, before we > write the data blocks. Yeah, it should be faster because we will need to perform some extent ops only in memory and not on disk. Honza
Sorry for the late reply. On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote: > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > at this point? If it works now, would you ever *not* want it? > > > > (granted it doesn't work with some journaling options etc, but that > > behavior could be automatic, w/o the need for special mount options). > > The primary restriction is that diread_nolock doesn't work when fs > block size != page size. If your proposal is that we automatically > enable diread_nolock when we can use it safely, that's definitely > something to consider for the next merge window. Yes, I also think we can automatically enable dioread_nolock because it brings us some benefits. BTW, I think there is an minor improvement for dio overwrite codepath with indirect-based file. We don't need to take i_mutex in this condition just as we have done for extent-based file. If a user mounts a ext2/3 file system with a ext4 kernel modules, he/she could get a lower latency. But it seems that it would break dio semantic in ext2/3. Currently in ext2/3 if we issue a overwrite dio and then issue a read dio. We will always read the latest data because we wait on i_mutex lock. But after parallelizing overwite dio, this semantic might breaks. I re-read this doc but it seems that it doesn't describe this case. Do we need to keep this semantic? > > My long range plan/hope is that we eventually be able to use the > extent status tree so that we do allocating writes, we first (a) > allocate the blocks, and mark them as in use as far as the mballoc > data structures are concerned, but we do _not_ mark them as in use in > the on-disk allocation bitmaps, then (b) we write the data blocks, and > then triggered by the block I/O completion, (c) in a single journal > trnasaction, we update the allocation bitmaps, update the inode's > extent tree, and update the inode's i_size field. > > This is different from the dioread_nolock approach in that we're not > initially inserting the blocks in the extent tree as uninitialized, > and then convert the extent tree entries from uninit to init after the > I/O completion. Yes, this approach is better. I am happy to work on this. Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 26, 2013 at 01:52:51PM +0800, Zheng Liu wrote: > Sorry for the late reply. > > On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote: > > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > > at this point? If it works now, would you ever *not* want it? > > > > > > (granted it doesn't work with some journaling options etc, but that > > > behavior could be automatic, w/o the need for special mount options). > > > > The primary restriction is that diread_nolock doesn't work when fs > > block size != page size. If your proposal is that we automatically > > enable diread_nolock when we can use it safely, that's definitely > > something to consider for the next merge window. > > Yes, I also think we can automatically enable dioread_nolock because it > brings us some benefits. > > BTW, I think there is an minor improvement for dio overwrite codepath > with indirect-based file. We don't need to take i_mutex in this > condition just as we have done for extent-based file. If a user mounts > a ext2/3 file system with a ext4 kernel modules, he/she could get a > lower latency. But it seems that it would break dio semantic in ext2/3. > Currently in ext2/3 if we issue a overwrite dio and then issue a read > dio. We will always read the latest data because we wait on i_mutex > lock. But after parallelizing overwite dio, this semantic might breaks. > I re-read this doc but it seems that it doesn't describe this case. ^^^ Ah, sorry, I forgot to paste the link here. https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 26-03-13 13:52:51, Zheng Liu wrote: > Sorry for the late reply. > > On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote: > > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > > at this point? If it works now, would you ever *not* want it? > > > > > > (granted it doesn't work with some journaling options etc, but that > > > behavior could be automatic, w/o the need for special mount options). > > > > The primary restriction is that diread_nolock doesn't work when fs > > block size != page size. If your proposal is that we automatically > > enable diread_nolock when we can use it safely, that's definitely > > something to consider for the next merge window. > > Yes, I also think we can automatically enable dioread_nolock because it > brings us some benefits. But isn't there also some overhead due to buffered writes having to go through uninit->init conversion? Plus there's this potential deadlock in dioread_nolock code (http://www.spinics.net/lists/linux-ext4/msg36569.html) which I'm not sure how to fix yet... > BTW, I think there is an minor improvement for dio overwrite codepath > with indirect-based file. We don't need to take i_mutex in this > condition just as we have done for extent-based file. If a user mounts > a ext2/3 file system with a ext4 kernel modules, he/she could get a > lower latency. But it seems that it would break dio semantic in ext2/3. > Currently in ext2/3 if we issue a overwrite dio and then issue a read > dio. We will always read the latest data because we wait on i_mutex > lock. But after parallelizing overwite dio, this semantic might breaks. > I re-read this doc but it seems that it doesn't describe this case. Do > we need to keep this semantic? I'm not sure but also I don't think it's important to optimize that special case. Honza
On Tue, Mar 26, 2013 at 09:34:03PM +0100, Jan Kara wrote: > On Tue 26-03-13 13:52:51, Zheng Liu wrote: > > Sorry for the late reply. > > > > On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote: > > > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > > > at this point? If it works now, would you ever *not* want it? > > > > > > > > (granted it doesn't work with some journaling options etc, but that > > > > behavior could be automatic, w/o the need for special mount options). > > > > > > The primary restriction is that diread_nolock doesn't work when fs > > > block size != page size. If your proposal is that we automatically > > > enable diread_nolock when we can use it safely, that's definitely > > > something to consider for the next merge window. > > > > Yes, I also think we can automatically enable dioread_nolock because it > > brings us some benefits. > But isn't there also some overhead due to buffered writes having to go > through uninit->init conversion? Yeah, in my test, the IOPS will decrease after dioread_nolock enables. But the latency of dio will also descrease. Honestly I don't test buffered IO. So I will test this case and post the result later. IMO, this is a tradeoff that we want to improve latency or get a better throughput. > Plus there's this potential deadlock in > dioread_nolock code (http://www.spinics.net/lists/linux-ext4/msg36569.html) > which I'm not sure how to fix yet... Yes, we need to fix this bug first. > > > BTW, I think there is an minor improvement for dio overwrite codepath > > with indirect-based file. We don't need to take i_mutex in this > > condition just as we have done for extent-based file. If a user mounts > > a ext2/3 file system with a ext4 kernel modules, he/she could get a > > lower latency. But it seems that it would break dio semantic in ext2/3. > > Currently in ext2/3 if we issue a overwrite dio and then issue a read > > dio. We will always read the latest data because we wait on i_mutex > > lock. But after parallelizing overwite dio, this semantic might breaks. > > I re-read this doc but it seems that it doesn't describe this case. Do > > we need to keep this semantic? > I'm not sure but also I don't think it's important to optimize that > special case. Thanks for the comment. I am really not sure whether it is worth. Let me test the performance w/ and w/o dioread_nolock first. :-) Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 26, 2013 at 09:34:03PM +0100, Jan Kara wrote: > On Tue 26-03-13 13:52:51, Zheng Liu wrote: > > Sorry for the late reply. > > > > On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote: > > > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > > > at this point? If it works now, would you ever *not* want it? > > > > > > > > (granted it doesn't work with some journaling options etc, but that > > > > behavior could be automatic, w/o the need for special mount options). > > > > > > The primary restriction is that diread_nolock doesn't work when fs > > > block size != page size. If your proposal is that we automatically > > > enable diread_nolock when we can use it safely, that's definitely > > > something to consider for the next merge window. > > > > Yes, I also think we can automatically enable dioread_nolock because it > > brings us some benefits. > But isn't there also some overhead due to buffered writes having to go > through uninit->init conversion? Hi Jan, Here is my test result. I use fio to do some performance tests. The test environment is a Dell desktop with a Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz, 4G memory, 1 Intel 320 SSD. I test three cases that are sync dio read/write, aio dio read/write, and sync buffered io. The result is as below. Due to too many number we post some key data here. From the result we can see that after dioread_nolock enables the max latency can be reduced dramatically for sync dio and sync bio, and other value doesn't be affected too much. If I miss some important test case, please let me know. Thanks, - Zheng The kernel version is 3.9.0-rc4+. = sync dio = == fio config file == [global] ioengine=psync direct=1 bs=4k thread group_reporting directory=/mnt/sda1/ filename=testfile filesize=10g size=10g runtime=120 iodepth=16 fallocate=0 [reader] rw=randread numjobs=8 [writer] rw=randwrite numjobs=2 == result == === w/o dioread_nolock === read : io=5142.1MB, bw=43885KB/s, iops=10971 , runt=120002msec clat (usec): min=123 , max=679911 , avg=727.21, stdev=3986.35 lat (usec): min=123 , max=679911 , avg=727.40, stdev=3986.35 write: io=542548KB, bw=4521.2KB/s, iops=1130 , runt=120003msec clat (usec): min=136 , max=2743.3K, avg=1766.78, stdev=14844.10 lat (usec): min=137 , max=2743.3K, avg=1767.14, stdev=14844.10 === w/ dioread_nolock === read : io=5072.2MB, bw=43282KB/s, iops=10820 , runt=120002msec clat (usec): min=158 , max=291847 , avg=737.38, stdev=3944.14 lat (usec): min=158 , max=291848 , avg=737.56, stdev=3944.14 write: io=589752KB, bw=4914.6KB/s, iops=1228 , runt=120001msec clat (usec): min=138 , max=2048.3K, avg=1625.15, stdev=11344.36 lat (usec): min=139 , max=2048.3K, avg=1625.50, stdev=11344.35 = aio dio = == fio config file == [global] ioengine=libaio direct=1 bs=4k thread group_reporting directory=/mnt/sda1/ filename=testfile filesize=10g size=10g runtime=120 iodepth=64 fallocate=0 [reader] rw=randread numjobs=8 [writer] rw=randwrite numjobs=2 == result == === w/o dioread_nolock === read : io=4865.4MB, bw=41510KB/s, iops=10377 , runt=120014msec slat (usec): min=3 , max=320273 , avg=700.79, stdev=6070.93 clat (usec): min=232 , max=958452 , avg=48627.21, stdev=53296.33 lat (usec): min=248 , max=958471 , avg=49328.36, stdev=53905.27 write: io=684100KB, bw=5700.3KB/s, iops=1425 , runt=120013msec slat (usec): min=5 , max=349019 , avg=1339.75, stdev=9538.63 clat (usec): min=140 , max=1060.1K, avg=88459.40, stdev=95903.63 lat (usec): min=211 , max=1060.1K, avg=89799.48, stdev=97025.95 === w/ dioread_nolock === read : io=4869.1MB, bw=41544KB/s, iops=10385 , runt=120037msec slat (usec): min=4 , max=322951 , avg=700.33, stdev=5987.29 clat (usec): min=229 , max=703241 , avg=48584.27, stdev=52570.02 lat (usec): min=248 , max=703247 , avg=49284.94, stdev=53185.97 write: io=698856KB, bw=5821.1KB/s, iops=1455 , runt=120038msec slat (usec): min=7 , max=367042 , avg=1297.87, stdev=9316.21 clat (usec): min=178 , max=968434 , avg=86625.80, stdev=92500.21 lat (usec): min=284 , max=1020.8K, avg=87924.01, stdev=93554.86 = sync buffered io = == fio config file == [global] ioengine=psync direct=0 bs=4k thread group_reporting directory=/mnt/sda1/ filename=testfile filesize=10g size=10g runtime=120 iodepth=16 fallocate=0 [reader] rw=randread numjobs=8 [writer] rw=randwrite numjobs=2 == result == === w/o dioread_nolock === read : io=7577.6MB, bw=64661KB/s, iops=16165 , runt=120001msec clat (usec): min=1 , max=3107.8K, avg=493.15, stdev=5368.87 lat (usec): min=1 , max=3107.8K, avg=493.31, stdev=5368.87 write: io=764256KB, bw=6350.2KB/s, iops=1587 , runt=120353msec clat (usec): min=2 , max=42443K, avg=1257.41, stdev=138832.28 lat (usec): min=2 , max=42443K, avg=1257.57, stdev=138832.29 === w/ dioread_nolock === read : io=7613.1MB, bw=64972KB/s, iops=16243 , runt=120001msec clat (usec): min=1 , max=1126.4K, avg=490.79, stdev=3742.64 lat (usec): min=1 , max=1126.4K, avg=490.95, stdev=3742.64 write: io=763428KB, bw=6351.1KB/s, iops=1587 , runt=120189msec clat (usec): min=2 , max=45783K, avg=1258.23, stdev=163258.23 lat (usec): min=2 , max=45783K, avg=1258.37, stdev=163258.23 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 65bbc93..15e0da1 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -185,8 +185,6 @@ void ext4_evict_inode(struct inode *inode) trace_ext4_evict_inode(inode); - ext4_ioend_wait(inode); - if (inode->i_nlink) { /* * When journalling data dirty buffers are tracked only in the @@ -216,6 +214,7 @@ void ext4_evict_inode(struct inode *inode) filemap_write_and_wait(&inode->i_data); } truncate_inode_pages(&inode->i_data, 0); + ext4_ioend_wait(inode); goto no_delete; } @@ -225,6 +224,7 @@ void ext4_evict_inode(struct inode *inode) if (ext4_should_order_data(inode)) ext4_begin_ordered_truncate(inode, 0); truncate_inode_pages(&inode->i_data, 0); + ext4_ioend_wait(inode); if (is_bad_inode(inode)) goto no_delete; diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 809b310..8a004a4 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -50,11 +50,21 @@ void ext4_exit_pageio(void) kmem_cache_destroy(io_page_cachep); } +/* + * Called by ext4_evict_inode() to make sure there are no pending I/O + * completion work left to do. + */ void ext4_ioend_wait(struct inode *inode) { wait_queue_head_t *wq = ext4_ioend_wq(inode); wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0)); + /* + * We need to make sure the work structure is finished before + * we let the inode get destroyed by ext4_evict_inode() + */ + if (work_pending(&EXT4_I(inode)->i_unwritten_work)) + flush_work(&EXT4_I(inode)->i_unwritten_work); } static void put_io_page(struct ext4_io_page *io_page)
Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a regression when running xfstest #270 when the file system is mounted with dioread_nolock. The problem is that after ext4_evict_inode() calls ext4_ioend_wait(), this guarantees that last io_end structure has been freed, but it does not guarantee that the workqueue structure, which was moved into the inode by commit 84c17543ab56, is actually finished. Once ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this will allow ext4_ioend_wait() to return on CPU #2, at which point the evict_inode() codepath can race against the workqueue code on CPU #1 accessing EXT4_I(inode)->i_unwritten_work to find the next item of work to do. Fix this by calling flush_work() if the work structure is still pending in ext$_ioend_wait(). Also, move the call to ext4_ioend_wait() until after truncate_inode_pages() and filemap_write_and_wait() are called, to make sure all dirty pages have been written back and flushed from the page cache first. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e *pdpt = 0000000030bc3001 *pde = 0000000000000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0 EIP is at cwq_activate_delayed_work+0x3b/0x7e EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000 ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000) Stack: f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0 f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000 Call Trace: [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb [<c01def1d>] process_one_work+0x5d8/0x637 [<c040a10b>] ? ext4_end_bio+0x300/0x300 [<c01e3105>] worker_thread+0x249/0x3ef [<c01ea317>] kthread+0xd8/0xeb [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb [<c023a370>] ? trace_hardirqs_on+0x27/0x37 [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28 [<c01ea23f>] ? __init_kthread_worker+0x71/0x71 Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1 6c c1 00 31 c9 83 EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84 CR2: 0000000000000000 ---[ end trace a1923229da53d8a4 ]--- Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 4 ++-- fs/ext4/page-io.c | 10 ++++++++++ 2 files changed, 12 insertions(+), 2 deletions(-)