diff mbox

[REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal

Message ID CA+55aFz9hH26=iQa832J5789Kf1ufYEJPt5L_D1wcDhCycUCFQ@mail.gmail.com
State Accepted, archived
Headers show

Commit Message

Linus Torvalds Oct. 5, 2015, 9:55 p.m. UTC
On Mon, Oct 5, 2015 at 10:18 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Your ext4 patch may well fix the issue, and be the right thing to do
> (_regardless_ of the revert, in fact - while it might make the revert
> unnecessary, it might also be a good idea even if we do revert).

Thinking a bit more about your patch, I actually am getting more and
more convinced that it's the wrong thing to do.

Why?

Because the whole "Setting copied=0 will tell the upper layers to
repeat the write" just seems a nasty layering violation, where the
low-level filesystem uses a magic return code to introduce a special
case at the upper layers. But the upper layers are actually already
*aware* of the special case, and in fact have a comment about it.

So I think that the whole "setting copied to 0" would actually make a
lot more sense in the *caller*. Just do it in generic_perform_write()
instead. Then all the special cases and the restarting is all
together.

What do you guys think? This basically simplifies the low-level
filesystem rules, and says:

 - the filesystem will only ever see a partial "->write_end()" for the
case where the page was up-to-date, so that there is no issue with
"oops, we now have part of the page that may not have been written at
all"

 - if the page wasn't up-to-date before, ->write_end() will either be
everything we said we'd do in ->write_begin(), or it will be nothing
at all.

Hmm? This would seem to keep the special cases at the right layer, and
actually allow low-level filesystems to simplify things (ie the
special "copied = 0" special case in ext4 goes away.

The ext4 side still worries me, though. You made that
"page_zero_new_buffers()" conditional on "copied" being non-zero, but
I'm not convinced it can be conditional. Even if we retry, that retry
may end up failing (for example, because the source isn't mapped, so
we return -EFAULT rather than re-doing the write), but we have those
new buffers that got allocated in write_begin(), and now nobody has
ever written any data to them at all, so they have random stale
contents.

So I do think this needs more thought. Or at least somebody should
explain to me better why it's all ok.

I'm attaching the "copied = 0" special case thing at the
generic_perform_write() level as a patch for comments. But for now I
still think that reverting would seem to be the safer thing (which
still possibly leaves things buggy with a racy unmap, but at least
it's the old bug that we've never hit in practice).

Dave? Ted? Comments?

               Linus
mm/filemap.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Dave Hansen Oct. 5, 2015, 11:33 p.m. UTC | #1
> On Mon, Oct 5, 2015 at 10:18 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Your ext4 patch may well fix the issue, and be the right thing to do
>> (_regardless_ of the revert, in fact - while it might make the revert
>> unnecessary, it might also be a good idea even if we do revert).
> 
> Thinking a bit more about your patch, I actually am getting more and
> more convinced that it's the wrong thing to do.
> 
> Why?
> 
> Because the whole "Setting copied=0 will tell the upper layers to
> repeat the write" just seems a nasty layering violation, where the
> low-level filesystem uses a magic return code to introduce a special
> case at the upper layers. But the upper layers are actually already
> *aware* of the special case, and in fact have a comment about it.

It's nasty, but it's widespread.  I'm fairly sure some version of that
code shows up in all these places:

	ext4_journalled_write_end()
	generic_write_end()->block_write_end()
	ocfs2_write_end_inline()
	ocfs2_write_end_nolock()
	logfs_write_end()
	ext3_journalled_write_end()
	btrfs_copy_from_user()
	reiserfs_write_end()

> The ext4 side still worries me, though. You made that
> "page_zero_new_buffers()" conditional on "copied" being non-zero, but
> I'm not convinced it can be conditional. Even if we retry, that retry
> may end up failing (for example, because the source isn't mapped, so
> we return -EFAULT rather than re-doing the write), but we have those
> new buffers that got allocated in write_begin(), and now nobody has
> ever written any data to them at all, so they have random stale
> contents.

Yeah, I guess they're not uptodate and nobody is on the hook to make
them uptodate.

> So I do think this needs more thought. Or at least somebody should
> explain to me better why it's all ok.
> 
> I'm attaching the "copied = 0" special case thing at the
> generic_perform_write() level as a patch for comments. But for now I
> still think that reverting would seem to be the safer thing (which
> still possibly leaves things buggy with a racy unmap, but at least
> it's the old bug that we've never hit in practice).

FWIW, I'm not objecting at all to a revert.  It's obviously the safe
thing to do.

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2493,6 +2493,20 @@ again:
>  		pagefault_enable();
>  		flush_dcache_page(page);
>  
> +		/*
> +		 * If we didn't successfully copy all the data from user space,
> +		 * and the target page is not up-to-date, we will have to prefault
> +		 * the source. And if the page wasn't up-to-date before the write,
> +		 * the "write_end()" may need to *make* it up-to-date, and thus
> +		 * overwrite our partial copy.
> +		 *
> +		 * So for that case, thow away the whole thing and force a full
> +		 * restart (see comment above, and iov_iter_fault_in_readable()
> +		 * below).
> +		 */
> +		if (copied < bytes && !PageUptodate(page))
> +			copied = 0;
> +
>  		status = a_ops->write_end(file, mapping, pos, bytes, copied,
>  						page, fsdata);

Did you mean that as a cleanup, or to fix the regression?

Since the page isn't faulted in yet, iov_iter_copy_from_user_atomic()
had already set copied=0, so that has no effect on the ext4 code being
called and it spits out the same warning Ted originally reported.

Ted, do we really just need to tell ext4 that this dirtying operation is
OK?  Or is there really a risk of filesystem corruption that we're not
seeing?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 6, 2015, 9:01 a.m. UTC | #2
On Tue, Oct 6, 2015 at 12:33 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Did you mean that as a cleanup, or to fix the regression?

Purely as a cleanup to try to avoid the (already existing) special
case in at least ext4 - and possibly others.

So that patch was meant just for discussion - it's not really fixing
any existing bugs, and I didn't actually keep it live in my tree.

I'm planning on just doing the revert for now, but I'll wait a bit to
see how this thread pans out first.

> Since the page isn't faulted in yet, iov_iter_copy_from_user_atomic()
> had already set copied=0

Not necessarily. For your case that only does one-byte writes, yes.
But in the generic case you may well have a page-crossing source area
in user space, and get a partial success from the user copy. It's that
partial success case (when the rest of the missing data isn't
necessarily already up-to-date) that I'd like to have low-level
filesystems not have to worry about.

           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 72940fb38666..e8d01936817a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2493,6 +2493,20 @@  again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		/*
+		 * If we didn't successfully copy all the data from user space,
+		 * and the target page is not up-to-date, we will have to prefault
+		 * the source. And if the page wasn't up-to-date before the write,
+		 * the "write_end()" may need to *make* it up-to-date, and thus
+		 * overwrite our partial copy.
+		 *
+		 * So for that case, thow away the whole thing and force a full
+		 * restart (see comment above, and iov_iter_fault_in_readable()
+		 * below).
+		 */
+		if (copied < bytes && !PageUptodate(page))
+			copied = 0;
+
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))