Patchwork [4/4] ext3: Warn if mounting rw on a disk requiring stable page writes

login
register
mail settings
Submitter Darrick J. Wong
Date Nov. 21, 2012, 2 a.m.
Message ID <20121121020056.10225.15220.stgit@blackbox.djwong.org>
Download mbox | patch
Permalink /patch/200538/
State Superseded
Headers show

Comments

Darrick J. Wong - Nov. 21, 2012, 2 a.m.
ext3 doesn't properly isolate pages from changes during writeback.  Since the
recommended fix is to use ext4, for now we'll just print a warning if the user
tries to mount in write mode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ext3/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 21, 2012, 2:15 a.m.
On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> ext3 doesn't properly isolate pages from changes during writeback.  Since the
> recommended fix is to use ext4, for now we'll just print a warning if the user
> tries to mount in write mode.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/ext3/super.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 5366393..5b3725d 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
>  			"forcing read-only mode");
>  		res = MS_RDONLY;
>  	}
> +	if (!read_only &&
> +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> +		ext3_msg(sb, KERN_ERR,
> +			"error: ext3 cannot safely write data to a disk "
> +			"requiring stable pages writes; forcing read-only "
> +			"mode.  Upgrading to ext4 is recommended.");
> +		res = MS_RDONLY;
> +	}
>  	if (read_only)
>  		return res;
>  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
  Why this? ext3 should be fixed by your change to
filemap_page_mkwrite()... Or does testing show otherwise?

								Honza
Darrick J. Wong - Nov. 21, 2012, 9:13 p.m.
On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > recommended fix is to use ext4, for now we'll just print a warning if the user
> > tries to mount in write mode.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/ext3/super.c |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > 
> > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > index 5366393..5b3725d 100644
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> >  			"forcing read-only mode");
> >  		res = MS_RDONLY;
> >  	}
> > +	if (!read_only &&
> > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > +		ext3_msg(sb, KERN_ERR,
> > +			"error: ext3 cannot safely write data to a disk "
> > +			"requiring stable pages writes; forcing read-only "
> > +			"mode.  Upgrading to ext4 is recommended.");
> > +		res = MS_RDONLY;
> > +	}
> >  	if (read_only)
> >  		return res;
> >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
>   Why this? ext3 should be fixed by your change to
> filemap_page_mkwrite()... Or does testing show otherwise?

Yes, it's still broken even with this new set of changes.  Now that I think
about it a little more, I recall that writeback mode was actually fine, so this
is a little harsh.

Hm... looking at the ordered code a little more, it looks like
ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
to write mapped buffers back through the journal?  Taking it out seems to fix
ordered mode, though I have a suspicion that it might very well break ordered
mode too.

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 21, 2012, 9:33 p.m.
On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > tries to mount in write mode.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/ext3/super.c |    8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > index 5366393..5b3725d 100644
> > > --- a/fs/ext3/super.c
> > > +++ b/fs/ext3/super.c
> > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > >  			"forcing read-only mode");
> > >  		res = MS_RDONLY;
> > >  	}
> > > +	if (!read_only &&
> > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > +		ext3_msg(sb, KERN_ERR,
> > > +			"error: ext3 cannot safely write data to a disk "
> > > +			"requiring stable pages writes; forcing read-only "
> > > +			"mode.  Upgrading to ext4 is recommended.");
> > > +		res = MS_RDONLY;
> > > +	}
> > >  	if (read_only)
> > >  		return res;
> > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> >   Why this? ext3 should be fixed by your change to
> > filemap_page_mkwrite()... Or does testing show otherwise?
> 
> Yes, it's still broken even with this new set of changes.  Now that I think
> about it a little more, I recall that writeback mode was actually fine, so this
> is a little harsh.
> 
> Hm... looking at the ordered code a little more, it looks like
> ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> to write mapped buffers back through the journal?  Taking it out seems to fix
> ordered mode, though I have a suspicion that it might very well break ordered
> mode too.
  Oh, right. kjournald writing buffers directly (without setting
PageWriteback) will break things. So please, change warning to:

	/*
	 * In data=ordered mode, kjournald writes buffers without setting
	 * PageWriteback bit thus generic code does not properly wait for
	 * writeback of those buffers to finish.
	 */
	if (!read_only &&
	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
		ext3_msg(sb, KERN_ERR,
			"error: data=ordered mode does not support stable "
			"page writes required by the disk; forcing read-only "
			"mode. Upgrading to ext4 is recommended.");
		res = MS_RDONLY;
	}

then you need a similar check in ext3_remount() so that filesystem cannot
be remounted read-write.

								Honza
Neil Brown - Nov. 21, 2012, 9:47 p.m.
On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:

> On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > tries to mount in write mode.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/ext3/super.c |    8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > index 5366393..5b3725d 100644
> > > > --- a/fs/ext3/super.c
> > > > +++ b/fs/ext3/super.c
> > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > >  			"forcing read-only mode");
> > > >  		res = MS_RDONLY;
> > > >  	}
> > > > +	if (!read_only &&
> > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > +		ext3_msg(sb, KERN_ERR,
> > > > +			"error: ext3 cannot safely write data to a disk "
> > > > +			"requiring stable pages writes; forcing read-only "
> > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > +		res = MS_RDONLY;
> > > > +	}
> > > >  	if (read_only)
> > > >  		return res;
> > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > >   Why this? ext3 should be fixed by your change to
> > > filemap_page_mkwrite()... Or does testing show otherwise?
> > 
> > Yes, it's still broken even with this new set of changes.  Now that I think
> > about it a little more, I recall that writeback mode was actually fine, so this
> > is a little harsh.
> > 
> > Hm... looking at the ordered code a little more, it looks like
> > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > to write mapped buffers back through the journal?  Taking it out seems to fix
> > ordered mode, though I have a suspicion that it might very well break ordered
> > mode too.
>   Oh, right. kjournald writing buffers directly (without setting
> PageWriteback) will break things. So please, change warning to:
> 
> 	/*
> 	 * In data=ordered mode, kjournald writes buffers without setting
> 	 * PageWriteback bit thus generic code does not properly wait for
> 	 * writeback of those buffers to finish.
> 	 */
> 	if (!read_only &&
> 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> 	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> 		ext3_msg(sb, KERN_ERR,
> 			"error: data=ordered mode does not support stable "
> 			"page writes required by the disk; forcing read-only "
> 			"mode. Upgrading to ext4 is recommended.");
> 		res = MS_RDONLY;
> 	}
> 
> then you need a similar check in ext3_remount() so that filesystem cannot
> be remounted read-write.
> 
> 								Honza

Given this restriction, there is no way that I can change md/raid5 to set the
"stable pages" flag and stop copying pages into the stripe-cache.  ext3 on
raid5 will be much too common to allow this breakage.

I would really like to be able to say "I prefer stable pages, but they aren't
a requirement as long as I know which is which" ....

NeilBrown
Darrick J. Wong - Nov. 22, 2012, 1:47 a.m.
On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > tries to mount in write mode.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/ext3/super.c |    8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > index 5366393..5b3725d 100644
> > > > > --- a/fs/ext3/super.c
> > > > > +++ b/fs/ext3/super.c
> > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > >  			"forcing read-only mode");
> > > > >  		res = MS_RDONLY;
> > > > >  	}
> > > > > +	if (!read_only &&
> > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > +		res = MS_RDONLY;
> > > > > +	}
> > > > >  	if (read_only)
> > > > >  		return res;
> > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > >   Why this? ext3 should be fixed by your change to
> > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > 
> > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > about it a little more, I recall that writeback mode was actually fine, so this
> > > is a little harsh.
> > > 
> > > Hm... looking at the ordered code a little more, it looks like
> > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > ordered mode, though I have a suspicion that it might very well break ordered
> > > mode too.
> >   Oh, right. kjournald writing buffers directly (without setting
> > PageWriteback) will break things. So please, change warning to:

Maybe we should just fix this anyway?

I still have the patch that adds PG_stable (and changes the
wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
the end_io processing.

It seems to get rid of the checksum-after-write errors, though I'm not
convinced it's correct.  But, I'll send both patches along.

> > 
> > 	/*
> > 	 * In data=ordered mode, kjournald writes buffers without setting
> > 	 * PageWriteback bit thus generic code does not properly wait for
> > 	 * writeback of those buffers to finish.
> > 	 */
> > 	if (!read_only &&
> > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&

test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA

since I bet data=journal mode is also borken wrt PageWriteback.

> > 	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > 		ext3_msg(sb, KERN_ERR,
> > 			"error: data=ordered mode does not support stable "
> > 			"page writes required by the disk; forcing read-only "
> > 			"mode. Upgrading to ext4 is recommended.");
> > 		res = MS_RDONLY;
> > 	}
> > 
> > then you need a similar check in ext3_remount() so that filesystem cannot
> > be remounted read-write.
> > 
> > 								Honza
> 
> Given this restriction, there is no way that I can change md/raid5 to set the
> "stable pages" flag and stop copying pages into the stripe-cache.  ext3 on
> raid5 will be much too common to allow this breakage.
> 
> I would really like to be able to say "I prefer stable pages, but they aren't
> a requirement as long as I know which is which" ....

I'd rather just fix ext3. :)

(or remove it, since ext4 can handle ext3)

--D

> NeilBrown


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 22, 2012, 9:12 a.m.
On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > 
> > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > tries to mount in write mode.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > index 5366393..5b3725d 100644
> > > > > > --- a/fs/ext3/super.c
> > > > > > +++ b/fs/ext3/super.c
> > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > >  			"forcing read-only mode");
> > > > > >  		res = MS_RDONLY;
> > > > > >  	}
> > > > > > +	if (!read_only &&
> > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > +		res = MS_RDONLY;
> > > > > > +	}
> > > > > >  	if (read_only)
> > > > > >  		return res;
> > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > >   Why this? ext3 should be fixed by your change to
> > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > 
> > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > is a little harsh.
> > > > 
> > > > Hm... looking at the ordered code a little more, it looks like
> > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > mode too.
> > >   Oh, right. kjournald writing buffers directly (without setting
> > > PageWriteback) will break things. So please, change warning to:
> 
> Maybe we should just fix this anyway?
> 
> I still have the patch that adds PG_stable (and changes the
> wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> the end_io processing.
> 
> It seems to get rid of the checksum-after-write errors, though I'm not
> convinced it's correct.  But, I'll send both patches along.
  I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
doable due to lock ranking constraints - PageWriteback has to be set under
PageLocked but that ranks above transaction start so kjournald cannot grab
page locks so it cannot set PageWriteback... And changing the lock ordering
is a major surgery.

What could be doable is waiting for buffer locks from ext3's ->write_begin
and ->page_mkwrite implementations in case stable writes are required. If
your approach with a separate page bit doesn't work out (and I have some
doubts about that as mm people are *really* thrifty with page bits).

> > > 	/*
> > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > 	 * writeback of those buffers to finish.
> > > 	 */
> > > 	if (!read_only &&
> > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> 
> test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> 
> since I bet data=journal mode is also borken wrt PageWriteback.
  It is broken wrt PageWriteback but it actually waits for buffer locks in
->write_begin() so at least write path should be properly protected. But
mmap is not handled properly there (although that wouldn't be that hard to
fix). So I agree the condition should rather be what you suggest.

								Honza
Dave Chinner - Nov. 22, 2012, 11:15 p.m.
On Wed, Nov 21, 2012 at 05:47:55PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > 
> > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > tries to mount in write mode.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > index 5366393..5b3725d 100644
> > > > > > --- a/fs/ext3/super.c
> > > > > > +++ b/fs/ext3/super.c
> > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > >  			"forcing read-only mode");
> > > > > >  		res = MS_RDONLY;
> > > > > >  	}
> > > > > > +	if (!read_only &&
> > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > +		res = MS_RDONLY;
> > > > > > +	}
> > > > > >  	if (read_only)
> > > > > >  		return res;
> > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > >   Why this? ext3 should be fixed by your change to
> > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > 
> > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > is a little harsh.
> > > > 
> > > > Hm... looking at the ordered code a little more, it looks like
> > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > mode too.
> > >   Oh, right. kjournald writing buffers directly (without setting
> > > PageWriteback) will break things. So please, change warning to:
> 
> Maybe we should just fix this anyway?
> 
> I still have the patch that adds PG_stable (and changes the

Page flags are in short supply. If you need a page flag to fix a bug
in a single filesytem, there's a couple of reserved private page
flags that can be used. i.e PG_owner_priv_1. I don't think using a
glbal page flag for just this purpose will fly, though...

Cheers,

Dave.
Darrick J. Wong - Nov. 27, 2012, 2:17 a.m.
On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > > 
> > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > > tries to mount in write mode.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > index 5366393..5b3725d 100644
> > > > > > > --- a/fs/ext3/super.c
> > > > > > > +++ b/fs/ext3/super.c
> > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > > >  			"forcing read-only mode");
> > > > > > >  		res = MS_RDONLY;
> > > > > > >  	}
> > > > > > > +	if (!read_only &&
> > > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > > +		res = MS_RDONLY;
> > > > > > > +	}
> > > > > > >  	if (read_only)
> > > > > > >  		return res;
> > > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > 
> > > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > > is a little harsh.
> > > > > 
> > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > > mode too.
> > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > PageWriteback) will break things. So please, change warning to:
> > 
> > Maybe we should just fix this anyway?
> > 
> > I still have the patch that adds PG_stable (and changes the
> > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> > around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> > the end_io processing.
> > 
> > It seems to get rid of the checksum-after-write errors, though I'm not
> > convinced it's correct.  But, I'll send both patches along.
>   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> doable due to lock ranking constraints - PageWriteback has to be set under
> PageLocked but that ranks above transaction start so kjournald cannot grab
> page locks so it cannot set PageWriteback... And changing the lock ordering
> is a major surgery.
> 
> What could be doable is waiting for buffer locks from ext3's ->write_begin
> and ->page_mkwrite implementations in case stable writes are required. If
> your approach with a separate page bit doesn't work out (and I have some
> doubts about that as mm people are *really* thrifty with page bits).
> 
> > > > 	/*
> > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > 	 * writeback of those buffers to finish.
> > > > 	 */
> > > > 	if (!read_only &&
> > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > 
> > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > 
> > since I bet data=journal mode is also borken wrt PageWriteback.
>   It is broken wrt PageWriteback but it actually waits for buffer locks in
> ->write_begin() so at least write path should be properly protected. But
> mmap is not handled properly there (although that wouldn't be that hard to
> fix). So I agree the condition should rather be what you suggest.

Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
buffer for a given page, and in turn, jbd's do_get_write_access calls
lock_buffer.  Is that what you're referring to by "actually waits for buffer
locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
set in that path, and I think it's a little early to be setting PG_writeback
anyway.

If the page has to be locked before the transaction starts, how much of a
problem is it to set PG_writeback?  Even though that seems a bit early to be
doing that?

Just for fun, I tried porting ext4_page_mkwrite into ext3 (removing all the
parts that don't exist in ext3) so that do_journal_get_write_access would also
get called here, but it didn't seem to fix journal mode.  

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Dec. 5, 2012, 12:12 p.m.
On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > > > 
> > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > > > tries to mount in write mode.
> > > > > > > > 
> > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > ---
> > > > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > > 
> > > > > > > > 
> > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > > index 5366393..5b3725d 100644
> > > > > > > > --- a/fs/ext3/super.c
> > > > > > > > +++ b/fs/ext3/super.c
> > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > > > >  			"forcing read-only mode");
> > > > > > > >  		res = MS_RDONLY;
> > > > > > > >  	}
> > > > > > > > +	if (!read_only &&
> > > > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > > > +		res = MS_RDONLY;
> > > > > > > > +	}
> > > > > > > >  	if (read_only)
> > > > > > > >  		return res;
> > > > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > > 
> > > > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > > > is a little harsh.
> > > > > > 
> > > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > > > mode too.
> > > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > > PageWriteback) will break things. So please, change warning to:
> > > 
> > > Maybe we should just fix this anyway?
> > > 
> > > I still have the patch that adds PG_stable (and changes the
> > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> > > around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> > > the end_io processing.
> > > 
> > > It seems to get rid of the checksum-after-write errors, though I'm not
> > > convinced it's correct.  But, I'll send both patches along.
> >   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> > doable due to lock ranking constraints - PageWriteback has to be set under
> > PageLocked but that ranks above transaction start so kjournald cannot grab
> > page locks so it cannot set PageWriteback... And changing the lock ordering
> > is a major surgery.
> > 
> > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > and ->page_mkwrite implementations in case stable writes are required. If
> > your approach with a separate page bit doesn't work out (and I have some
> > doubts about that as mm people are *really* thrifty with page bits).
> > 
> > > > > 	/*
> > > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > > 	 * writeback of those buffers to finish.
> > > > > 	 */
> > > > > 	if (!read_only &&
> > > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > 
> > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > > 
> > > since I bet data=journal mode is also borken wrt PageWriteback.
> >   It is broken wrt PageWriteback but it actually waits for buffer locks in
> > ->write_begin() so at least write path should be properly protected. But
> > mmap is not handled properly there (although that wouldn't be that hard to
> > fix). So I agree the condition should rather be what you suggest.
  Sorry for late reply. I was on vacation...

> Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
> buffer for a given page, and in turn, jbd's do_get_write_access calls
> lock_buffer.  Is that what you're referring to by "actually waits for buffer
> locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
> set in that path, and I think it's a little early to be setting PG_writeback
> anyway.
  It does help us. In ext3 data writeback is done either by flusher thread,
that happens under PG_Writeback and generic code waits for that as need, or
by kjournald - that happens under buffer lock and as you properly observed
do_get_write_access() waits for that (and actually copies data that should
go to disk to a separate buffer if needed).

> If the page has to be locked before the transaction starts, how much of a
> problem is it to set PG_writeback?  Even though that seems a bit early to be
> doing that?
  Well, what you would need to make things consistent is to set
PG_writeback from kjournald so that all writeback happens with PG_writeback
set on the page. But setting has to happen while the page is locked and
kjournald can never block on page lock because that would cause
deadlocks...

									Honza
Darrick J. Wong - Dec. 8, 2012, 1:09 a.m.
On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote:
> On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > > > > 
> > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > > > > tries to mount in write mode.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > ---
> > > > > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > > > index 5366393..5b3725d 100644
> > > > > > > > > --- a/fs/ext3/super.c
> > > > > > > > > +++ b/fs/ext3/super.c
> > > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > > > > >  			"forcing read-only mode");
> > > > > > > > >  		res = MS_RDONLY;
> > > > > > > > >  	}
> > > > > > > > > +	if (!read_only &&
> > > > > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > > > > +		res = MS_RDONLY;
> > > > > > > > > +	}
> > > > > > > > >  	if (read_only)
> > > > > > > > >  		return res;
> > > > > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > > > 
> > > > > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > > > > is a little harsh.
> > > > > > > 
> > > > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > > > > mode too.
> > > > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > > > PageWriteback) will break things. So please, change warning to:
> > > > 
> > > > Maybe we should just fix this anyway?
> > > > 
> > > > I still have the patch that adds PG_stable (and changes the
> > > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> > > > around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> > > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> > > > the end_io processing.
> > > > 
> > > > It seems to get rid of the checksum-after-write errors, though I'm not
> > > > convinced it's correct.  But, I'll send both patches along.
> > >   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> > > doable due to lock ranking constraints - PageWriteback has to be set under
> > > PageLocked but that ranks above transaction start so kjournald cannot grab
> > > page locks so it cannot set PageWriteback... And changing the lock ordering
> > > is a major surgery.
> > > 
> > > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > > and ->page_mkwrite implementations in case stable writes are required. If
> > > your approach with a separate page bit doesn't work out (and I have some
> > > doubts about that as mm people are *really* thrifty with page bits).
> > > 
> > > > > > 	/*
> > > > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > > > 	 * writeback of those buffers to finish.
> > > > > > 	 */
> > > > > > 	if (!read_only &&
> > > > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > > 
> > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > > > 
> > > > since I bet data=journal mode is also borken wrt PageWriteback.
> > >   It is broken wrt PageWriteback but it actually waits for buffer locks in
> > > ->write_begin() so at least write path should be properly protected. But
> > > mmap is not handled properly there (although that wouldn't be that hard to
> > > fix). So I agree the condition should rather be what you suggest.
>   Sorry for late reply. I was on vacation...

No worries; I have plenty of other patchsets to work on. :)

> > Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
> > buffer for a given page, and in turn, jbd's do_get_write_access calls
> > lock_buffer.  Is that what you're referring to by "actually waits for buffer
> > locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
> > set in that path, and I think it's a little early to be setting PG_writeback
> > anyway.
>   It does help us. In ext3 data writeback is done either by flusher thread,
> that happens under PG_Writeback and generic code waits for that as need, or
> by kjournald - that happens under buffer lock and as you properly observed
> do_get_write_access() waits for that (and actually copies data that should
> go to disk to a separate buffer if needed).

Hm.  jbd2 calls generic_writepages to flush those buffers out, which sets
PG_writeback like you'd expect.  It'd be nice if you could do that like ext4
does... but then the ext3 writepage functions can call journal start/stop.

Maybe we could call block_write_full_page directly?

> > If the page has to be locked before the transaction starts, how much of a
> > problem is it to set PG_writeback?  Even though that seems a bit early to be
> > doing that?
>   Well, what you would need to make things consistent is to set
> PG_writeback from kjournald so that all writeback happens with PG_writeback
> set on the page. But setting has to happen while the page is locked and
> kjournald can never block on page lock because that would cause
> deadlocks...

Right now we try to submit_bh() everything on commit->t_sync_datalist without
the page lock.  I guess we could try to lock (and set writeback on) every page
on that list, and anything that we can lock then goes on a new
t_locked_for_sync list.  If we find a page that's already locked, simply back
out to the main kjournald loop and try the whole mess again.  Once all the
pages are on the locked_for_sync list, then we can unlock the pages and
actually commit the transaction.

But, uck, that feels like courting disaster.  Do people sleep with a page
locked?  I'm unsure of the wisdom of making jbd do that.  Can we stall out
forever if someone keeps shoving pages onto t_sync_data?

I actually tried a dumb trylock_page loop in journal_do_submit_data, but it
quickly stalled.

Ok, enough rambling, I'll give a few of my harebrained suggestions a try and
see what results.

--D
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Dec. 10, 2012, 10:41 a.m.
On Fri 07-12-12 17:09:58, Darrick J. Wong wrote:
> On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote:
> > On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> > > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > >   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> > > > doable due to lock ranking constraints - PageWriteback has to be set under
> > > > PageLocked but that ranks above transaction start so kjournald cannot grab
> > > > page locks so it cannot set PageWriteback... And changing the lock ordering
> > > > is a major surgery.
> > > > 
> > > > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > > > and ->page_mkwrite implementations in case stable writes are required. If
> > > > your approach with a separate page bit doesn't work out (and I have some
> > > > doubts about that as mm people are *really* thrifty with page bits).
> > > > 
> > > > > > > 	/*
> > > > > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > > > > 	 * writeback of those buffers to finish.
> > > > > > > 	 */
> > > > > > > 	if (!read_only &&
> > > > > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > > > 
> > > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > > > > 
> > > > > since I bet data=journal mode is also borken wrt PageWriteback.
> > > >   It is broken wrt PageWriteback but it actually waits for buffer locks in
> > > > ->write_begin() so at least write path should be properly protected. But
> > > > mmap is not handled properly there (although that wouldn't be that hard to
> > > > fix). So I agree the condition should rather be what you suggest.
> >   Sorry for late reply. I was on vacation...
> 
> No worries; I have plenty of other patchsets to work on. :)
> 
> > > Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
> > > buffer for a given page, and in turn, jbd's do_get_write_access calls
> > > lock_buffer.  Is that what you're referring to by "actually waits for buffer
> > > locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
> > > set in that path, and I think it's a little early to be setting PG_writeback
> > > anyway.
> >   It does help us. In ext3 data writeback is done either by flusher thread,
> > that happens under PG_Writeback and generic code waits for that as need, or
> > by kjournald - that happens under buffer lock and as you properly observed
> > do_get_write_access() waits for that (and actually copies data that should
> > go to disk to a separate buffer if needed).
> 
> Hm.  jbd2 calls generic_writepages to flush those buffers out, which sets
> PG_writeback like you'd expect.  It'd be nice if you could do that like ext4
> does... but then the ext3 writepage functions can call journal start/stop.
  Yes, it would be nice if ext3 did it the way ext4 does but as I already
said earlier, that's a major change to the way locking is done. ext3 has
lock ordering PageLock -> transaction start, while ext4 has lock ordering
transaction start -> PageLock. This influences a lot of places so it's too
big change to do to ext3 which is in maintenance only mode.

> Maybe we could call block_write_full_page directly?
  You have to have the page locked to call that function. Otherwise someone
could invalidate the page under your hands (or do other sort of things
function does not expect).

> > > If the page has to be locked before the transaction starts, how much of a
> > > problem is it to set PG_writeback?  Even though that seems a bit early to be
> > > doing that?
> >   Well, what you would need to make things consistent is to set
> > PG_writeback from kjournald so that all writeback happens with PG_writeback
> > set on the page. But setting has to happen while the page is locked and
> > kjournald can never block on page lock because that would cause
> > deadlocks...
> 
> Right now we try to submit_bh() everything on commit->t_sync_datalist without
> the page lock.  I guess we could try to lock (and set writeback on) every page
> on that list, and anything that we can lock then goes on a new
> t_locked_for_sync list.  If we find a page that's already locked, simply back
> out to the main kjournald loop and try the whole mess again.  Once all the
> pages are on the locked_for_sync list, then we can unlock the pages and
> actually commit the transaction.
  That's just more complicated way of implementing page lock ;) It won't
remove deadlocks.

> But, uck, that feels like courting disaster.  Do people sleep with a page
> locked?
  Yes, they do - for example in ext3_write_begin(), ext3_journal_start()
can sleep waiting for kjournald to commit a transaction :). That's the
deadlock I'm talking about...

>  I'm unsure of the wisdom of making jbd do that.  Can we stall out
> forever if someone keeps shoving pages onto t_sync_data?
  That's not an issue. Once transaction commit starts, no new buffers can
be added to it.

  Hum, I was thinking about how to implement proper exclusion in ext3 using
buffer locks and it won't be easy. Buffered writes are simple enough but
mmap is tricky - we'd have to writeprotect the page at the moment we add
buffer to t_sync_datalist (at that moment we hold the page lock so that is
doable). That page_mkwrite will have to wait for transaction commit (or
write the buffers) if underlying buffers are in t_sync_datalist. That will
make mixing of buffered IO and mmap writes to the same page very slow but
maybe that would be acceptable. I'll think about it some more...

								Honza

Patch

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 5366393..5b3725d 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1325,6 +1325,14 @@  static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
 			"forcing read-only mode");
 		res = MS_RDONLY;
 	}
+	if (!read_only &&
+	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
+		ext3_msg(sb, KERN_ERR,
+			"error: ext3 cannot safely write data to a disk "
+			"requiring stable pages writes; forcing read-only "
+			"mode.  Upgrading to ext4 is recommended.");
+		res = MS_RDONLY;
+	}
 	if (read_only)
 		return res;
 	if (!(sbi->s_mount_state & EXT3_VALID_FS))