Patchwork [3/3] filemap: don't call generic_write_sync for -EIOCBQUEUED

login
register
mail settings
Submitter Jeff Moyer
Date Jan. 27, 2012, 9:15 p.m.
Message ID <1327698949-12616-4-git-send-email-jmoyer@redhat.com>
Download mbox | patch
Permalink /patch/138331/
State New
Headers show

Comments

Jeff Moyer - Jan. 27, 2012, 9:15 p.m.
Hi,

As it stands, generic_file_aio_write will call into generic_write_sync
when -EIOCBQUEUED is returned from __generic_file_aio_write.  EIOCBQUEUED
indicates that an I/O was submitted but NOT completed.  Thus, we will
flush the disk cache, potentially before the write(s) even make it to
the disk!  Up until now, this has been the best we could do, as file
systems didn't bother to flush the disk cache after an O_SYNC AIO+DIO
write.  After applying the prior two patches to xfs and ext4, at least
the major two file systems do the right thing.  So, let's go ahead and
fix this backwards logic.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 mm/filemap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
Martin Steigerwald - Jan. 28, 2012, 3:08 p.m.
Adding linux-btrfs to Cc.

Am Freitag, 27. Januar 2012 schrieb Jeff Moyer:
> Hi,

Hi,
 
> As it stands, generic_file_aio_write will call into generic_write_sync
> when -EIOCBQUEUED is returned from __generic_file_aio_write. 
> EIOCBQUEUED indicates that an I/O was submitted but NOT completed. 
> Thus, we will flush the disk cache, potentially before the write(s)
> even make it to the disk!  Up until now, this has been the best we
> could do, as file systems didn't bother to flush the disk cache after
> an O_SYNC AIO+DIO write.  After applying the prior two patches to xfs
> and ext4, at least the major two file systems do the right thing.  So,
> let's go ahead and fix this backwards logic.

Would this need an adaption to BTRFS as well?

Thanks,
Martin

> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
> ---
>  mm/filemap.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c4ee2e9..004442f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2634,7 +2634,7 @@ ssize_t generic_file_aio_write(struct kiocb
> *iocb, const struct iovec *iov, ret = __generic_file_aio_write(iocb,
> iov, nr_segs, &iocb->ki_pos); mutex_unlock(&inode->i_mutex);
> 
> -	if (ret > 0 || ret == -EIOCBQUEUED) {
> +	if (ret > 0) {
>  		ssize_t err;
> 
>  		err = generic_write_sync(file, pos, ret);
Jan Kara - Feb. 2, 2012, 5:52 p.m.
Hello,

On Fri 27-01-12 16:15:49, Jeff Moyer wrote:
> As it stands, generic_file_aio_write will call into generic_write_sync
> when -EIOCBQUEUED is returned from __generic_file_aio_write.  EIOCBQUEUED
> indicates that an I/O was submitted but NOT completed.  Thus, we will
> flush the disk cache, potentially before the write(s) even make it to
> the disk!
  Yeah. It seems to be a problem introduced by Tejun's rewrite of barrier
code, right? Before that we'd drain the IO queue when cache flush is issued
and thus effectively wait for IO completion...

>  Up until now, this has been the best we could do, as file
> systems didn't bother to flush the disk cache after an O_SYNC AIO+DIO
> write.  After applying the prior two patches to xfs and ext4, at least
> the major two file systems do the right thing.  So, let's go ahead and
> fix this backwards logic.
  But doesn't this break filesystems which you didn't fix explicitely even
more than they were? You are right they might have sent cache flush too
early but they'd at least propely force all metadata modifications (e.g.
from allocation) to disk. But after this patch O_SYNC will have simply no
effect for these filesystems.

Also I was thinking whether we couldn't implement the fix in VFS. Basically
it would be the same like the fix for ext4. Like having a per-sb workqueue
and queue work calling generic_write_sync() from end_io handler when the
file is O_SYNC? That would solve the issue for all filesystems...

								Honza

> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
> ---
>  mm/filemap.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c4ee2e9..004442f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2634,7 +2634,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
>  	mutex_unlock(&inode->i_mutex);
>  
> -	if (ret > 0 || ret == -EIOCBQUEUED) {
> +	if (ret > 0) {
>  		ssize_t err;
>  
>  		err = generic_write_sync(file, pos, ret);
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer - Feb. 6, 2012, 4:33 p.m.
Jan Kara <jack@suse.cz> writes:

>   Hello,
>
> On Fri 27-01-12 16:15:49, Jeff Moyer wrote:
>> As it stands, generic_file_aio_write will call into generic_write_sync
>> when -EIOCBQUEUED is returned from __generic_file_aio_write.  EIOCBQUEUED
>> indicates that an I/O was submitted but NOT completed.  Thus, we will
>> flush the disk cache, potentially before the write(s) even make it to
>> the disk!
>   Yeah. It seems to be a problem introduced by Tejun's rewrite of barrier
> code, right? Before that we'd drain the IO queue when cache flush is issued
> and thus effectively wait for IO completion...

Right, though hch seems to think even then the problem existed.

>>  Up until now, this has been the best we could do, as file
>> systems didn't bother to flush the disk cache after an O_SYNC AIO+DIO
>> write.  After applying the prior two patches to xfs and ext4, at least
>> the major two file systems do the right thing.  So, let's go ahead and
>> fix this backwards logic.
>   But doesn't this break filesystems which you didn't fix explicitely even
> more than they were? You are right they might have sent cache flush too
> early but they'd at least propely force all metadata modifications (e.g.
> from allocation) to disk. But after this patch O_SYNC will have simply no
> effect for these filesystems.

Yep.  Note that we're calling into generic_write_sync with a negative
value.  I followed that call chain all the way down and convinced myself
that it was "mostly harmless," but it sure as heck ain't right.  I'll
audit other file systems to see whether it's a problem.  btrfs, at
least, isn't affected by this.

> Also I was thinking whether we couldn't implement the fix in VFS. Basically
> it would be the same like the fix for ext4. Like having a per-sb workqueue
> and queue work calling generic_write_sync() from end_io handler when the
> file is O_SYNC? That would solve the issue for all filesystems...

Well, that would require buy-in from the other file system developers.
What do the XFS folks think?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - Feb. 6, 2012, 7:55 p.m.
On Mon, Feb 06, 2012 at 11:33:29AM -0500, Jeff Moyer wrote:
> > code, right? Before that we'd drain the IO queue when cache flush is issued
> > and thus effectively wait for IO completion...
> 
> Right, though hch seems to think even then the problem existed.

I was wrong, using -o barrier it didn't.  That was however not something
people using O_SYNC heavy production loads would do, they'd use disabled
caches and nobarrier.

> > Also I was thinking whether we couldn't implement the fix in VFS. Basically
> > it would be the same like the fix for ext4. Like having a per-sb workqueue
> > and queue work calling generic_write_sync() from end_io handler when the
> > file is O_SYNC? That would solve the issue for all filesystems...
> 
> Well, that would require buy-in from the other file system developers.
> What do the XFS folks think?

I don't think using that code for XFS makes sene.  But just like
generic_write_sync there's no reason it can't be added to generic code,
just make sure only generic_file_aio_write/__generic_file_aio_write use
it, but generic_file_buffered_write and generic_file_direct_write stay
clear of it.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer - Feb. 7, 2012, 8:39 p.m.
Christoph Hellwig <hch@infradead.org> writes:

> On Mon, Feb 06, 2012 at 11:33:29AM -0500, Jeff Moyer wrote:
>> > code, right? Before that we'd drain the IO queue when cache flush is issued
>> > and thus effectively wait for IO completion...
>> 
>> Right, though hch seems to think even then the problem existed.
>
> I was wrong, using -o barrier it didn't.  That was however not something
> people using O_SYNC heavy production loads would do, they'd use disabled
> caches and nobarrier.
>
>> > Also I was thinking whether we couldn't implement the fix in VFS. Basically
>> > it would be the same like the fix for ext4. Like having a per-sb workqueue
>> > and queue work calling generic_write_sync() from end_io handler when the
>> > file is O_SYNC? That would solve the issue for all filesystems...
>> 
>> Well, that would require buy-in from the other file system developers.
>> What do the XFS folks think?
>
> I don't think using that code for XFS makes sene.  But just like
> generic_write_sync there's no reason it can't be added to generic code,
> just make sure only generic_file_aio_write/__generic_file_aio_write use
> it, but generic_file_buffered_write and generic_file_direct_write stay
> clear of it.

ext4_file_write (ext4's .aio_write routine) calls into
generic_file_aio_write.  So, I don't think we can generalize that this
routine means that the file system doesn't install its own endio
handler.  What's more, we'd have to pass an endio routine down the call
stack quite a ways.  In all, I think that would be an uglier solution to
the problem.  Did I miss something?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index c4ee2e9..004442f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2634,7 +2634,7 @@  ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 || ret == -EIOCBQUEUED) {
+	if (ret > 0) {
 		ssize_t err;
 
 		err = generic_write_sync(file, pos, ret);