diff mbox

RFC PATCH: ext4 no journal corruption with locale-gen

Message ID 6601abe90906171148w1431258fvd0afa105cda9b77b@mail.gmail.com
State Superseded, archived
Headers show

Commit Message

Curt Wohlgemuth June 17, 2009, 6:48 p.m. UTC
So I finally was able to figure out the data corruption problem with
locale-gen on ext4 without a journal.

Basically, using mmap(MAP_SHARED, PROT_WRITE) to write to a file through an
mmap'ed pointer is broken on ext4 when there is no journal.

It seems to be a combination of several problems:

   a. The choice of what address space ops to use in ext4_set_aops() just
      seems wrong to me.

   b. The use of ext4_journalled_writepage() if there is no journal being
      used is broken if the page was marked dirty from
      ext4_journalled_set_page_dirty().

      I don't understand how __ext4_journalled_writepage() would ever work.

ext4_set_aops() chooses among 4 different structures:

            ext4_da_aops
            ext4_ordered_aops
            ext4_writeback_aops
            ext4_journalled_aops

It seems to me that ext4_da_aops should be used whenever delayed allocation
is used, and the rest otherwise.  But this leaves open the question of what
to use when nodelalloc is used, AND there's no journal.  Today, this uses
ext4_journalled_aops, but this seems odd on its face.  Yes, I know that all
the routines there are supposed to handle no journal, but it's nevertheless
odd.

The problem with ext4_journalled_writepage() is this:

1. ext4_journalled_set_page_dirty() sets the PageChecked bit, then dirties
   the page.

2. ext4_journalled_writepage() will do the following; note the goto if there
   IS a journal:

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o June 17, 2009, 11:46 p.m. UTC | #1
Hi Curt,

Thanks for your analysis of the bug.  The reason for the strange logic
in ext4_set_aops() is because at the moment the code doesn't support
the combination of data=journalled && delalloc.  That's why it was
explicitly checking for ext4_should_order_data() and
ext4_should_writeback_data().

We have a check for this in ext4_fill_super(), so your patch should be
safe, since the combination of ext4_should_journal_data &&
test_opt(inode->i_sb, DELALLOC) should never happen.

As to your question of whether the nodelalloc and nojournal case
should really be ext4_journalled_aops, I suspect ext4_writeback_aops
makes more sense.  I haven't audited all of the code paths to make
sure they DTRT in the non-journalled case yet, though.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

===================================================================
	if (ext4_journal_current_handle())
		goto no_write;

	if (PageChecked(page)) {
		/*
		 * It's mmapped pagecache.  Add buffers and journal it.  There
		 * doesn't seem much point in redirtying the page here.
		 */
		ClearPageChecked(page);
		return __ext4_journalled_writepage(page, wbc);
	} else {
		/*
		 * It may be a page full of checkpoint-mode buffers.  We don't
		 * really know unless we go poke around in the buffer_heads.
		 * But block_write_full_page will do the right thing.
		 */
		return block_write_full_page(page,
						ext4_normal_get_block_write,
						wbc);
	}
no_write:
===================================================================

3. Unfortunately, __ext4_journalled_writepage() in the case of no journal,
   will just

      - call block_prepare_write()
      - calls write_end_fn() on all buffers, which just marks them dirty,
        doesn't actually write them out.

   And it doesn't seem to me that __ext4_journalled_writepage() will ever be
   called if there IS a journal.

   Am I missing something here?

4. If PageChecked bit isn't set, it calls block_write_full_page() and works
   fine.


Below is a patch that "fixes" ext4_set_aops(): that is, in the case of
delayed allocation but no journal, we'll use ext4_da_aops.  It does NOT fix
the problem of nodelalloc and either

   - no journal
   - data=journal

I'm holding off on fixing this because I'm not sure of the right place for
it.  I think it should be one of:

   a. Adding an address space ops structure just for nodelalloc/nojournal
   b. Getting rid of __ext4_journalled_writepage() as well as
      ext4_journalled_set_page_dirty(), since I'm not really sure they do
      anything.

Comments please?


	Signed-off-by: Curt Wohlgemuth <curtw@google.com>
---
--- linux-2.6/fs/ext4/inode.c.orig	2009-06-09 20:05:27.000000000 -0700
+++ linux-2.6/fs/ext4/inode.c	2009-06-17 11:07:57.000000000 -0700
@@ -3442,14 +3442,10 @@  static const struct address_space_operat

 void ext4_set_aops(struct inode *inode)
 {
-	if (ext4_should_order_data(inode) &&
-		test_opt(inode->i_sb, DELALLOC))
+	if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else if (ext4_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext4_ordered_aops;
-	else if (ext4_should_writeback_data(inode) &&
-		 test_opt(inode->i_sb, DELALLOC))
-		inode->i_mapping->a_ops = &ext4_da_aops;
 	else if (ext4_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext4_writeback_aops;
 	else