diff mbox

[REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal

Message ID 5612BBB3.7010201@intel.com
State Accepted, archived
Headers show

Commit Message

Dave Hansen Oct. 5, 2015, 6:04 p.m. UTC
I managed to catch the condition in an ftrace.  Full spew is below.

We can see that the iov_iter_copy_from_user_atomic() "failed" and ended
up with a copied=0 which we can see in the ext4_journalled_write_end()
tracepoint as "copied 0".

So we're in this code with copied=0 and len=4096:

> static int ext4_journalled_write_end(struct file *file, ... unsigned copied, ...
> {
...

>         else {
>                 if (copied < len) {
>                         if (!PageUptodate(page))
>                                 copied = 0;
>                         page_zero_new_buffers(page, from+copied, to);
>                 }
> 
>                 ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
>                                              to, &partial, write_end_fn);
>                 if (!partial)
>                         SetPageUptodate(page);
>         }


The warning comes out of ext4_walk_page_buffers() and the dirty state
comes from page_zero_new_buffers(). That seems a _bit_ goofy that the
filesystem is marking the page dirty and then so shortly warning about it.

If we either come in here with copied=0 or force it (when
!PageUptodate()) we are essentially telling the caller that its write
must be repeated. We have no data in the buffer to preserve and we are
not leaking any data we are not marking it Uptodate.

The very lightly tested attached patch gets rid of the "Spotted dirty
metadata buffer" messages that I've been getting testing like this:

	mount -o data=journal /dev/sda1 /mnt/sda1
	TEST_DIR=/mnt/sda1 TEST_DEV=/dev/sda1 ./tests/generic/208

It _seems_ like a relatively sane thing to do.

>>  2)  aio-dio-3204  |               |              ext4_journalled_write_end() {
>>  2)  aio-dio-3204  |               |                /* ext4_journalled_write_end: dev 8,1 ino 12 pos 0 len 4096 copied 0 */
>>  2)  aio-dio-3204  |               |                page_zero_new_buffers() {
>>  2)  aio-dio-3204  |               |                  mark_buffer_dirty() {
>>  2)  aio-dio-3204  |               |                    /* block_dirty_buffer: 8,1 sector=34356 size=4096 */
>>  2)  aio-dio-3204  |   0.025 us    |                    page_mapping();
>>  2)  aio-dio-3204  |               |                    __set_page_dirty.constprop.46() {
>>  2)  aio-dio-3204  |   0.026 us    |                      _raw_spin_lock_irqsave();
>>  2)  aio-dio-3204  |               |                      account_page_dirtied() {
>>  2)  aio-dio-3204  |               |                        /* writeback_dirty_page: bdi 8:0: ino=12 index=0 */
>>  2)  aio-dio-3204  |               |                        __inc_zone_page_state() {
>>  2)  aio-dio-3204  |   0.030 us    |                          __inc_zone_state();
>>  2)  aio-dio-3204  |   0.211 us    |                        }
>>  2)  aio-dio-3204  |               |                        __inc_zone_page_state() {
>>  2)  aio-dio-3204  |   0.024 us    |                          __inc_zone_state();
>>  2)  aio-dio-3204  |   0.211 us    |                        }
>>  2)  aio-dio-3204  |   1.085 us    |                      }
>>  2)  aio-dio-3204  |   0.033 us    |                      _raw_spin_unlock_irqrestore();
>>  2)  aio-dio-3204  |   1.727 us    |                    }
>>  2)  aio-dio-3204  |               |                    __mark_inode_dirty() {
>>  2)  aio-dio-3204  |               |                      /* writeback_mark_inode_dirty: bdi 8:0: ino=12 state=I_DIRTY_SYNC|I_DIRTY_DATASYNC|I_DIRTY_PAGES flags=I_DIRTY_PAGES */
>>  2)  aio-dio-3204  |   0.207 us    |                    }
>>  2)  aio-dio-3204  |   2.600 us    |                  }
>>  2)  aio-dio-3204  |   2.878 us    |                }
>>  2)  aio-dio-3204  |               |                ext4_walk_page_buffers() {
>>  2)  aio-dio-3204  |               |                  write_end_fn() {
>>  2)  aio-dio-3204  |               |                    __ext4_handle_dirty_metadata() {
>>  2)  aio-dio-3204  |   0.024 us    |                      _cond_resched();
>>  2)  aio-dio-3204  |               |                      jbd2_journal_dirty_metadata() {
>>  2)  aio-dio-3204  |   0.025 us    |                        _raw_spin_lock();
>>  2)  aio-dio-3204  |               |                        __jbd2_journal_file_buffer() {
>>  2)  aio-dio-3204  |               |                          warn_dirty_buffer.isra.10() {
>>  2)  aio-dio-3204  |               |                            bdevname() {
>>  2)  aio-dio-3204  | + 15.280 us   |                              disk_name();
>>  2)  aio-dio-3204  | + 15.661 us   |                            }
>>  2)  aio-dio-3204  |               |                            /* ^A4JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 34356). There's a risk of filesystem corruption in case of system crash. */
>>  2)  aio-dio-3204  |               |                            bdevname() {
>>  2)  aio-dio-3204  |   0.080 us    |                              disk_name();
>>  2)  aio-dio-3204  |   0.298 us    |                            }

Comments

Theodore Ts'o Oct. 7, 2015, 3:34 a.m. UTC | #1
On Mon, Oct 05, 2015 at 11:04:35AM -0700, Dave Hansen wrote:
> 
> The warning comes out of ext4_walk_page_buffers() and the dirty state
> comes from page_zero_new_buffers(). That seems a _bit_ goofy that the
> filesystem is marking the page dirty and then so shortly warning about it.

Yes, this is a bug in ext4 --- and in fact in ext3, which apparently
we've lived with for *years*.  The problem is that when we are
journalling data buffers, we can't use page_zero_new_buffers(),
because instead of calling mark_buffer_dirty(bh), we need to call
ext4_handle_dirty_metadata(bh).  This will call mark_buffer_dirty(bh)
if journalling is not enabled, or if journalling is enabled, it will
call jbd2_journal_dirty_metadata(handle,bh).

Apprently it is extremely rare that (copied < len) --- especially when
mm/filemap.c was doing a prefault.  :-)

So your patch looks good, but in addition to that, if copied is > 0
and less than len, we shouldn't be calling page_zero_new_buffers().
We're going to need our own version of it that doesn't call
mark_buffer_dirty().

So if Linus wants to revert 998ef75ddb patch, we can do that, but I'm
also happy applying your patch as a way of preventing the failure.
We'll need to do more work to make ext4_journalled_write_end(), but
that's a bigger change which I'd rather not do at this point in the
development cycle.

Thanks again for taking a closer look at things.  I'm currently
running a full soak test to make sure your patch to
ext4_journalled_write_end() doesn't introduce any other problems, but
I'm quite confident it should be fine.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 7, 2015, 7:32 a.m. UTC | #2
On Wed, Oct 7, 2015 at 4:34 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> So your patch looks good, but in addition to that, if copied is > 0
> and less than len, we shouldn't be calling page_zero_new_buffers().
> We're going to need our own version of it that doesn't call
> mark_buffer_dirty().

Well, even with copied == 0, isn't it wrong not to zero the data and
mark it dirty, though?

At least for traditional non-prealloc filesystems, write_begin() will
have allocated the blocks on disk, and depending on the size of the
write may not have zeroed anything, or marked anything dirty. So the
disk layout may be set up, and the blocks *need* to be written back to
disk with real data at some later time.

I'm not sure that that is how write_begin() works for ext4, but the
fact that you do the page_zero_new_buffers() does imply that it's an
issue.

And none of *those* requirements change just because "copied" would be
zero. If you avoid zeroing the buffers and marking them dirty, nothing
will ever initialize them on disk, andn if the prefault then later
fails during retry, no later write will happen either. So now
eventually later, a read() can see stale data from disk.

Now, again, I didn't actually follow the ext4 code, so this may not be
an issue on ext4 due to the journaling perhaps never writing back the
actual block allocations either, so the above may be a "only happens
on old traditional unix filesystems" kind of scenario. But it worries
me.

That's why II'll just revert the change for now as a "ok, the change
itself isn't buggy, but it exposes long-time bugs in at least ext4,
and let's take this slow and give the ext4 people time to make sure
they have that case fixed".

> So if Linus wants to revert 998ef75ddb patch, we can do that, but I'm
> also happy applying your patch as a way of preventing the failure.

I do think this is an ext4 bug, and you'll need to do something *like*
that patch. Maybe Dave's patch is good as-is. It's the "I think you
need to do more" that I worry about. Not at -rc4 time. Not with a core
filesystem like ext4. Let's not hurry this too much.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Oct. 7, 2015, 3:43 p.m. UTC | #3
On Wed, Oct 07, 2015 at 08:32:16AM +0100, Linus Torvalds wrote:
> And none of *those* requirements change just because "copied" would be
> zero. If you avoid zeroing the buffers and marking them dirty, nothing
> will ever initialize them on disk, andn if the prefault then later
> fails during retry, no later write will happen either. So now
> eventually later, a read() can see stale data from disk.

Shoot.  You're right, we could end up allowing a stale data to be
exposed.  If we knew the caller of write_end() was guaranteed to
retry, we could skip the jbd2_journal_stop() call and keep the handle
open, which would prevent the transaction from closing.  But if the
write gets abandoned, then the transaction would never close, and
things would grind to a halt.

> I do think this is an ext4 bug, and you'll need to do something *like*
> that patch. Maybe Dave's patch is good as-is. It's the "I think you
> need to do more" that I worry about. Not at -rc4 time. Not with a core
> filesystem like ext4. Let's not hurry this too much.

Agreed, I know what to do, and and the change is not something I'd
want to get in -rc4.  I'll target a fix for the next merge window.

    	    	   	       	      	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -puN fs/ext4/inode.c~debug-dont-prefault-on-write fs/ext4/inode.c
--- a/fs/ext4/inode.c~debug-dont-prefault-on-write	2015-10-05 10:12:35.286153137 -0700
+++ b/fs/ext4/inode.c	2015-10-05 10:50:37.372253050 -0700
@@ -1204,10 +1204,20 @@  static int ext4_journalled_write_end(str
 		copied = ext4_write_inline_data_end(inode, pos, len,
 						    copied, page);
 	else {
+		/*
+		 * With a short write (copied < len) we have potentially
+		 * valuable data in 'page'.  But, when the page is made Uptodate
+		 * this data will be overwritten.  Setting copied=0 will tell
+		 * the upper layers to repeat the write to 'page'.
+		 *
+		 * Only bother zeroing out buffers when we have _actually_
+		 * written data.
+		 */
 		if (copied < len) {
 			if (!PageUptodate(page))
 				copied = 0;
-			page_zero_new_buffers(page, from+copied, to);
+			if (copied)
+				page_zero_new_buffers(page, from+copied, to);
 		}
 
 		ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
_