Message ID | 5612BBB3.7010201@intel.com |
---|---|
State | Accepted, archived |
Headers | show |
On Mon, Oct 05, 2015 at 11:04:35AM -0700, Dave Hansen wrote: > > The warning comes out of ext4_walk_page_buffers() and the dirty state > comes from page_zero_new_buffers(). That seems a _bit_ goofy that the > filesystem is marking the page dirty and then so shortly warning about it. Yes, this is a bug in ext4 --- and in fact in ext3, which apparently we've lived with for *years*. The problem is that when we are journalling data buffers, we can't use page_zero_new_buffers(), because instead of calling mark_buffer_dirty(bh), we need to call ext4_handle_dirty_metadata(bh). This will call mark_buffer_dirty(bh) if journalling is not enabled, or if journalling is enabled, it will call jbd2_journal_dirty_metadata(handle,bh). Apprently it is extremely rare that (copied < len) --- especially when mm/filemap.c was doing a prefault. :-) So your patch looks good, but in addition to that, if copied is > 0 and less than len, we shouldn't be calling page_zero_new_buffers(). We're going to need our own version of it that doesn't call mark_buffer_dirty(). So if Linus wants to revert 998ef75ddb patch, we can do that, but I'm also happy applying your patch as a way of preventing the failure. We'll need to do more work to make ext4_journalled_write_end(), but that's a bigger change which I'd rather not do at this point in the development cycle. Thanks again for taking a closer look at things. I'm currently running a full soak test to make sure your patch to ext4_journalled_write_end() doesn't introduce any other problems, but I'm quite confident it should be fine. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 7, 2015 at 4:34 AM, Theodore Ts'o <tytso@mit.edu> wrote: > > So your patch looks good, but in addition to that, if copied is > 0 > and less than len, we shouldn't be calling page_zero_new_buffers(). > We're going to need our own version of it that doesn't call > mark_buffer_dirty(). Well, even with copied == 0, isn't it wrong not to zero the data and mark it dirty, though? At least for traditional non-prealloc filesystems, write_begin() will have allocated the blocks on disk, and depending on the size of the write may not have zeroed anything, or marked anything dirty. So the disk layout may be set up, and the blocks *need* to be written back to disk with real data at some later time. I'm not sure that that is how write_begin() works for ext4, but the fact that you do the page_zero_new_buffers() does imply that it's an issue. And none of *those* requirements change just because "copied" would be zero. If you avoid zeroing the buffers and marking them dirty, nothing will ever initialize them on disk, andn if the prefault then later fails during retry, no later write will happen either. So now eventually later, a read() can see stale data from disk. Now, again, I didn't actually follow the ext4 code, so this may not be an issue on ext4 due to the journaling perhaps never writing back the actual block allocations either, so the above may be a "only happens on old traditional unix filesystems" kind of scenario. But it worries me. That's why II'll just revert the change for now as a "ok, the change itself isn't buggy, but it exposes long-time bugs in at least ext4, and let's take this slow and give the ext4 people time to make sure they have that case fixed". > So if Linus wants to revert 998ef75ddb patch, we can do that, but I'm > also happy applying your patch as a way of preventing the failure. I do think this is an ext4 bug, and you'll need to do something *like* that patch. Maybe Dave's patch is good as-is. It's the "I think you need to do more" that I worry about. Not at -rc4 time. Not with a core filesystem like ext4. Let's not hurry this too much. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 07, 2015 at 08:32:16AM +0100, Linus Torvalds wrote: > And none of *those* requirements change just because "copied" would be > zero. If you avoid zeroing the buffers and marking them dirty, nothing > will ever initialize them on disk, andn if the prefault then later > fails during retry, no later write will happen either. So now > eventually later, a read() can see stale data from disk. Shoot. You're right, we could end up allowing a stale data to be exposed. If we knew the caller of write_end() was guaranteed to retry, we could skip the jbd2_journal_stop() call and keep the handle open, which would prevent the transaction from closing. But if the write gets abandoned, then the transaction would never close, and things would grind to a halt. > I do think this is an ext4 bug, and you'll need to do something *like* > that patch. Maybe Dave's patch is good as-is. It's the "I think you > need to do more" that I worry about. Not at -rc4 time. Not with a core > filesystem like ext4. Let's not hurry this too much. Agreed, I know what to do, and and the change is not something I'd want to get in -rc4. I'll target a fix for the next merge window. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -puN fs/ext4/inode.c~debug-dont-prefault-on-write fs/ext4/inode.c --- a/fs/ext4/inode.c~debug-dont-prefault-on-write 2015-10-05 10:12:35.286153137 -0700 +++ b/fs/ext4/inode.c 2015-10-05 10:50:37.372253050 -0700 @@ -1204,10 +1204,20 @@ static int ext4_journalled_write_end(str copied = ext4_write_inline_data_end(inode, pos, len, copied, page); else { + /* + * With a short write (copied < len) we have potentially + * valuable data in 'page'. But, when the page is made Uptodate + * this data will be overwritten. Setting copied=0 will tell + * the upper layers to repeat the write to 'page'. + * + * Only bother zeroing out buffers when we have _actually_ + * written data. + */ if (copied < len) { if (!PageUptodate(page)) copied = 0; - page_zero_new_buffers(page, from+copied, to); + if (copied) + page_zero_new_buffers(page, from+copied, to); } ret = ext4_walk_page_buffers(handle, page_buffers(page), from, _