Message ID | 20121121020056.10225.15220.stgit@blackbox.djwong.org |
---|---|
State | Superseded, archived |
Headers | show |
On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > ext3 doesn't properly isolate pages from changes during writeback. Since the > recommended fix is to use ext4, for now we'll just print a warning if the user > tries to mount in write mode. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/ext3/super.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > index 5366393..5b3725d 100644 > --- a/fs/ext3/super.c > +++ b/fs/ext3/super.c > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > "forcing read-only mode"); > res = MS_RDONLY; > } > + if (!read_only && > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > + ext3_msg(sb, KERN_ERR, > + "error: ext3 cannot safely write data to a disk " > + "requiring stable pages writes; forcing read-only " > + "mode. Upgrading to ext4 is recommended."); > + res = MS_RDONLY; > + } > if (read_only) > return res; > if (!(sbi->s_mount_state & EXT3_VALID_FS)) Why this? ext3 should be fixed by your change to filemap_page_mkwrite()... Or does testing show otherwise? Honza
On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > recommended fix is to use ext4, for now we'll just print a warning if the user > > tries to mount in write mode. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > fs/ext3/super.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > index 5366393..5b3725d 100644 > > --- a/fs/ext3/super.c > > +++ b/fs/ext3/super.c > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > "forcing read-only mode"); > > res = MS_RDONLY; > > } > > + if (!read_only && > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > + ext3_msg(sb, KERN_ERR, > > + "error: ext3 cannot safely write data to a disk " > > + "requiring stable pages writes; forcing read-only " > > + "mode. Upgrading to ext4 is recommended."); > > + res = MS_RDONLY; > > + } > > if (read_only) > > return res; > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > Why this? ext3 should be fixed by your change to > filemap_page_mkwrite()... Or does testing show otherwise? Yes, it's still broken even with this new set of changes. Now that I think about it a little more, I recall that writeback mode was actually fine, so this is a little harsh. Hm... looking at the ordered code a little more, it looks like ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries to write mapped buffers back through the journal? Taking it out seems to fix ordered mode, though I have a suspicion that it might very well break ordered mode too. --D > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > tries to mount in write mode. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > > fs/ext3/super.c | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > index 5366393..5b3725d 100644 > > > --- a/fs/ext3/super.c > > > +++ b/fs/ext3/super.c > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > "forcing read-only mode"); > > > res = MS_RDONLY; > > > } > > > + if (!read_only && > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > + ext3_msg(sb, KERN_ERR, > > > + "error: ext3 cannot safely write data to a disk " > > > + "requiring stable pages writes; forcing read-only " > > > + "mode. Upgrading to ext4 is recommended."); > > > + res = MS_RDONLY; > > > + } > > > if (read_only) > > > return res; > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > Why this? ext3 should be fixed by your change to > > filemap_page_mkwrite()... Or does testing show otherwise? > > Yes, it's still broken even with this new set of changes. Now that I think > about it a little more, I recall that writeback mode was actually fine, so this > is a little harsh. > > Hm... looking at the ordered code a little more, it looks like > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > to write mapped buffers back through the journal? Taking it out seems to fix > ordered mode, though I have a suspicion that it might very well break ordered > mode too. Oh, right. kjournald writing buffers directly (without setting PageWriteback) will break things. So please, change warning to: /* * In data=ordered mode, kjournald writes buffers without setting * PageWriteback bit thus generic code does not properly wait for * writeback of those buffers to finish. */ if (!read_only && test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { ext3_msg(sb, KERN_ERR, "error: data=ordered mode does not support stable " "page writes required by the disk; forcing read-only " "mode. Upgrading to ext4 is recommended."); res = MS_RDONLY; } then you need a similar check in ext3_remount() so that filesystem cannot be remounted read-write. Honza
On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > tries to mount in write mode. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --- > > > > fs/ext3/super.c | 8 ++++++++ > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > index 5366393..5b3725d 100644 > > > > --- a/fs/ext3/super.c > > > > +++ b/fs/ext3/super.c > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > "forcing read-only mode"); > > > > res = MS_RDONLY; > > > > } > > > > + if (!read_only && > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > + ext3_msg(sb, KERN_ERR, > > > > + "error: ext3 cannot safely write data to a disk " > > > > + "requiring stable pages writes; forcing read-only " > > > > + "mode. Upgrading to ext4 is recommended."); > > > > + res = MS_RDONLY; > > > > + } > > > > if (read_only) > > > > return res; > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > Why this? ext3 should be fixed by your change to > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > Yes, it's still broken even with this new set of changes. Now that I think > > about it a little more, I recall that writeback mode was actually fine, so this > > is a little harsh. > > > > Hm... looking at the ordered code a little more, it looks like > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > to write mapped buffers back through the journal? Taking it out seems to fix > > ordered mode, though I have a suspicion that it might very well break ordered > > mode too. > Oh, right. kjournald writing buffers directly (without setting > PageWriteback) will break things. So please, change warning to: > > /* > * In data=ordered mode, kjournald writes buffers without setting > * PageWriteback bit thus generic code does not properly wait for > * writeback of those buffers to finish. > */ > if (!read_only && > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && > queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > ext3_msg(sb, KERN_ERR, > "error: data=ordered mode does not support stable " > "page writes required by the disk; forcing read-only " > "mode. Upgrading to ext4 is recommended."); > res = MS_RDONLY; > } > > then you need a similar check in ext3_remount() so that filesystem cannot > be remounted read-write. > > Honza Given this restriction, there is no way that I can change md/raid5 to set the "stable pages" flag and stop copying pages into the stripe-cache. ext3 on raid5 will be much too common to allow this breakage. I would really like to be able to say "I prefer stable pages, but they aren't a requirement as long as I know which is which" .... NeilBrown
On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote: > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > > tries to mount in write mode. > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > --- > > > > > fs/ext3/super.c | 8 ++++++++ > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > > index 5366393..5b3725d 100644 > > > > > --- a/fs/ext3/super.c > > > > > +++ b/fs/ext3/super.c > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > > "forcing read-only mode"); > > > > > res = MS_RDONLY; > > > > > } > > > > > + if (!read_only && > > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > > + ext3_msg(sb, KERN_ERR, > > > > > + "error: ext3 cannot safely write data to a disk " > > > > > + "requiring stable pages writes; forcing read-only " > > > > > + "mode. Upgrading to ext4 is recommended."); > > > > > + res = MS_RDONLY; > > > > > + } > > > > > if (read_only) > > > > > return res; > > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > > Why this? ext3 should be fixed by your change to > > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > > > Yes, it's still broken even with this new set of changes. Now that I think > > > about it a little more, I recall that writeback mode was actually fine, so this > > > is a little harsh. > > > > > > Hm... looking at the ordered code a little more, it looks like > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > > to write mapped buffers back through the journal? Taking it out seems to fix > > > ordered mode, though I have a suspicion that it might very well break ordered > > > mode too. > > Oh, right. kjournald writing buffers directly (without setting > > PageWriteback) will break things. So please, change warning to: Maybe we should just fix this anyway? I still have the patch that adds PG_stable (and changes the wait_for_page_stable() test to use this flag instead of PG_writeback) kicking around in my tree. I wrote a patch to jbd that changes journal_do_submit_data to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in the end_io processing. It seems to get rid of the checksum-after-write errors, though I'm not convinced it's correct. But, I'll send both patches along. > > > > /* > > * In data=ordered mode, kjournald writes buffers without setting > > * PageWriteback bit thus generic code does not properly wait for > > * writeback of those buffers to finish. > > */ > > if (!read_only && > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA since I bet data=journal mode is also borken wrt PageWriteback. > > queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > ext3_msg(sb, KERN_ERR, > > "error: data=ordered mode does not support stable " > > "page writes required by the disk; forcing read-only " > > "mode. Upgrading to ext4 is recommended."); > > res = MS_RDONLY; > > } > > > > then you need a similar check in ext3_remount() so that filesystem cannot > > be remounted read-write. > > > > Honza > > Given this restriction, there is no way that I can change md/raid5 to set the > "stable pages" flag and stop copying pages into the stripe-cache. ext3 on > raid5 will be much too common to allow this breakage. > > I would really like to be able to say "I prefer stable pages, but they aren't > a requirement as long as I know which is which" .... I'd rather just fix ext3. :) (or remove it, since ext4 can handle ext3) --D > NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 21-11-12 17:47:55, Darrick J. Wong wrote: > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote: > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > > > tries to mount in write mode. > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > --- > > > > > > fs/ext3/super.c | 8 ++++++++ > > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > > > index 5366393..5b3725d 100644 > > > > > > --- a/fs/ext3/super.c > > > > > > +++ b/fs/ext3/super.c > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > > > "forcing read-only mode"); > > > > > > res = MS_RDONLY; > > > > > > } > > > > > > + if (!read_only && > > > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > > > + ext3_msg(sb, KERN_ERR, > > > > > > + "error: ext3 cannot safely write data to a disk " > > > > > > + "requiring stable pages writes; forcing read-only " > > > > > > + "mode. Upgrading to ext4 is recommended."); > > > > > > + res = MS_RDONLY; > > > > > > + } > > > > > > if (read_only) > > > > > > return res; > > > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > > > Why this? ext3 should be fixed by your change to > > > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > > > > > Yes, it's still broken even with this new set of changes. Now that I think > > > > about it a little more, I recall that writeback mode was actually fine, so this > > > > is a little harsh. > > > > > > > > Hm... looking at the ordered code a little more, it looks like > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > > > to write mapped buffers back through the journal? Taking it out seems to fix > > > > ordered mode, though I have a suspicion that it might very well break ordered > > > > mode too. > > > Oh, right. kjournald writing buffers directly (without setting > > > PageWriteback) will break things. So please, change warning to: > > Maybe we should just fix this anyway? > > I still have the patch that adds PG_stable (and changes the > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking > around in my tree. I wrote a patch to jbd that changes journal_do_submit_data > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in > the end_io processing. > > It seems to get rid of the checksum-after-write errors, though I'm not > convinced it's correct. But, I'll send both patches along. I'll check the patches. Fixing PageWriteback logic for ext3 is not easily doable due to lock ranking constraints - PageWriteback has to be set under PageLocked but that ranks above transaction start so kjournald cannot grab page locks so it cannot set PageWriteback... And changing the lock ordering is a major surgery. What could be doable is waiting for buffer locks from ext3's ->write_begin and ->page_mkwrite implementations in case stable writes are required. If your approach with a separate page bit doesn't work out (and I have some doubts about that as mm people are *really* thrifty with page bits). > > > /* > > > * In data=ordered mode, kjournald writes buffers without setting > > > * PageWriteback bit thus generic code does not properly wait for > > > * writeback of those buffers to finish. > > > */ > > > if (!read_only && > > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA > > since I bet data=journal mode is also borken wrt PageWriteback. It is broken wrt PageWriteback but it actually waits for buffer locks in ->write_begin() so at least write path should be properly protected. But mmap is not handled properly there (although that wouldn't be that hard to fix). So I agree the condition should rather be what you suggest. Honza
On Wed, Nov 21, 2012 at 05:47:55PM -0800, Darrick J. Wong wrote: > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote: > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > > > tries to mount in write mode. > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > --- > > > > > > fs/ext3/super.c | 8 ++++++++ > > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > > > index 5366393..5b3725d 100644 > > > > > > --- a/fs/ext3/super.c > > > > > > +++ b/fs/ext3/super.c > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > > > "forcing read-only mode"); > > > > > > res = MS_RDONLY; > > > > > > } > > > > > > + if (!read_only && > > > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > > > + ext3_msg(sb, KERN_ERR, > > > > > > + "error: ext3 cannot safely write data to a disk " > > > > > > + "requiring stable pages writes; forcing read-only " > > > > > > + "mode. Upgrading to ext4 is recommended."); > > > > > > + res = MS_RDONLY; > > > > > > + } > > > > > > if (read_only) > > > > > > return res; > > > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > > > Why this? ext3 should be fixed by your change to > > > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > > > > > Yes, it's still broken even with this new set of changes. Now that I think > > > > about it a little more, I recall that writeback mode was actually fine, so this > > > > is a little harsh. > > > > > > > > Hm... looking at the ordered code a little more, it looks like > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > > > to write mapped buffers back through the journal? Taking it out seems to fix > > > > ordered mode, though I have a suspicion that it might very well break ordered > > > > mode too. > > > Oh, right. kjournald writing buffers directly (without setting > > > PageWriteback) will break things. So please, change warning to: > > Maybe we should just fix this anyway? > > I still have the patch that adds PG_stable (and changes the Page flags are in short supply. If you need a page flag to fix a bug in a single filesytem, there's a couple of reserved private page flags that can be used. i.e PG_owner_priv_1. I don't think using a glbal page flag for just this purpose will fly, though... Cheers, Dave.
On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote: > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote: > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote: > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > > > > tries to mount in write mode. > > > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > --- > > > > > > > fs/ext3/super.c | 8 ++++++++ > > > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > > > > index 5366393..5b3725d 100644 > > > > > > > --- a/fs/ext3/super.c > > > > > > > +++ b/fs/ext3/super.c > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > > > > "forcing read-only mode"); > > > > > > > res = MS_RDONLY; > > > > > > > } > > > > > > > + if (!read_only && > > > > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > > > > + ext3_msg(sb, KERN_ERR, > > > > > > > + "error: ext3 cannot safely write data to a disk " > > > > > > > + "requiring stable pages writes; forcing read-only " > > > > > > > + "mode. Upgrading to ext4 is recommended."); > > > > > > > + res = MS_RDONLY; > > > > > > > + } > > > > > > > if (read_only) > > > > > > > return res; > > > > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > > > > Why this? ext3 should be fixed by your change to > > > > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > > > > > > > Yes, it's still broken even with this new set of changes. Now that I think > > > > > about it a little more, I recall that writeback mode was actually fine, so this > > > > > is a little harsh. > > > > > > > > > > Hm... looking at the ordered code a little more, it looks like > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > > > > to write mapped buffers back through the journal? Taking it out seems to fix > > > > > ordered mode, though I have a suspicion that it might very well break ordered > > > > > mode too. > > > > Oh, right. kjournald writing buffers directly (without setting > > > > PageWriteback) will break things. So please, change warning to: > > > > Maybe we should just fix this anyway? > > > > I still have the patch that adds PG_stable (and changes the > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking > > around in my tree. I wrote a patch to jbd that changes journal_do_submit_data > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in > > the end_io processing. > > > > It seems to get rid of the checksum-after-write errors, though I'm not > > convinced it's correct. But, I'll send both patches along. > I'll check the patches. Fixing PageWriteback logic for ext3 is not easily > doable due to lock ranking constraints - PageWriteback has to be set under > PageLocked but that ranks above transaction start so kjournald cannot grab > page locks so it cannot set PageWriteback... And changing the lock ordering > is a major surgery. > > What could be doable is waiting for buffer locks from ext3's ->write_begin > and ->page_mkwrite implementations in case stable writes are required. If > your approach with a separate page bit doesn't work out (and I have some > doubts about that as mm people are *really* thrifty with page bits). > > > > > /* > > > > * In data=ordered mode, kjournald writes buffers without setting > > > > * PageWriteback bit thus generic code does not properly wait for > > > > * writeback of those buffers to finish. > > > > */ > > > > if (!read_only && > > > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && > > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA > > > > since I bet data=journal mode is also borken wrt PageWriteback. > It is broken wrt PageWriteback but it actually waits for buffer locks in > ->write_begin() so at least write path should be properly protected. But > mmap is not handled properly there (although that wouldn't be that hard to > fix). So I agree the condition should rather be what you suggest. Hm. In journal mode, write_begin calls do_journal_get_write_access on each buffer for a given page, and in turn, jbd's do_get_write_access calls lock_buffer. Is that what you're referring to by "actually waits for buffer locks"? I'm wondering how that helps us, since afaict PG_writeback doesn't get set in that path, and I think it's a little early to be setting PG_writeback anyway. If the page has to be locked before the transaction starts, how much of a problem is it to set PG_writeback? Even though that seems a bit early to be doing that? Just for fun, I tried porting ext4_page_mkwrite into ext3 (removing all the parts that don't exist in ext3) so that do_journal_get_write_access would also get called here, but it didn't seem to fix journal mode. --D > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 26-11-12 18:17:40, Darrick J. Wong wrote: > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote: > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote: > > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote: > > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > > > > > tries to mount in write mode. > > > > > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > --- > > > > > > > > fs/ext3/super.c | 8 ++++++++ > > > > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > > > > > index 5366393..5b3725d 100644 > > > > > > > > --- a/fs/ext3/super.c > > > > > > > > +++ b/fs/ext3/super.c > > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > > > > > "forcing read-only mode"); > > > > > > > > res = MS_RDONLY; > > > > > > > > } > > > > > > > > + if (!read_only && > > > > > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > > > > > + ext3_msg(sb, KERN_ERR, > > > > > > > > + "error: ext3 cannot safely write data to a disk " > > > > > > > > + "requiring stable pages writes; forcing read-only " > > > > > > > > + "mode. Upgrading to ext4 is recommended."); > > > > > > > > + res = MS_RDONLY; > > > > > > > > + } > > > > > > > > if (read_only) > > > > > > > > return res; > > > > > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > > > > > Why this? ext3 should be fixed by your change to > > > > > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > > > > > > > > > Yes, it's still broken even with this new set of changes. Now that I think > > > > > > about it a little more, I recall that writeback mode was actually fine, so this > > > > > > is a little harsh. > > > > > > > > > > > > Hm... looking at the ordered code a little more, it looks like > > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > > > > > to write mapped buffers back through the journal? Taking it out seems to fix > > > > > > ordered mode, though I have a suspicion that it might very well break ordered > > > > > > mode too. > > > > > Oh, right. kjournald writing buffers directly (without setting > > > > > PageWriteback) will break things. So please, change warning to: > > > > > > Maybe we should just fix this anyway? > > > > > > I still have the patch that adds PG_stable (and changes the > > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking > > > around in my tree. I wrote a patch to jbd that changes journal_do_submit_data > > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in > > > the end_io processing. > > > > > > It seems to get rid of the checksum-after-write errors, though I'm not > > > convinced it's correct. But, I'll send both patches along. > > I'll check the patches. Fixing PageWriteback logic for ext3 is not easily > > doable due to lock ranking constraints - PageWriteback has to be set under > > PageLocked but that ranks above transaction start so kjournald cannot grab > > page locks so it cannot set PageWriteback... And changing the lock ordering > > is a major surgery. > > > > What could be doable is waiting for buffer locks from ext3's ->write_begin > > and ->page_mkwrite implementations in case stable writes are required. If > > your approach with a separate page bit doesn't work out (and I have some > > doubts about that as mm people are *really* thrifty with page bits). > > > > > > > /* > > > > > * In data=ordered mode, kjournald writes buffers without setting > > > > > * PageWriteback bit thus generic code does not properly wait for > > > > > * writeback of those buffers to finish. > > > > > */ > > > > > if (!read_only && > > > > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && > > > > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA > > > > > > since I bet data=journal mode is also borken wrt PageWriteback. > > It is broken wrt PageWriteback but it actually waits for buffer locks in > > ->write_begin() so at least write path should be properly protected. But > > mmap is not handled properly there (although that wouldn't be that hard to > > fix). So I agree the condition should rather be what you suggest. Sorry for late reply. I was on vacation... > Hm. In journal mode, write_begin calls do_journal_get_write_access on each > buffer for a given page, and in turn, jbd's do_get_write_access calls > lock_buffer. Is that what you're referring to by "actually waits for buffer > locks"? I'm wondering how that helps us, since afaict PG_writeback doesn't get > set in that path, and I think it's a little early to be setting PG_writeback > anyway. It does help us. In ext3 data writeback is done either by flusher thread, that happens under PG_Writeback and generic code waits for that as need, or by kjournald - that happens under buffer lock and as you properly observed do_get_write_access() waits for that (and actually copies data that should go to disk to a separate buffer if needed). > If the page has to be locked before the transaction starts, how much of a > problem is it to set PG_writeback? Even though that seems a bit early to be > doing that? Well, what you would need to make things consistent is to set PG_writeback from kjournald so that all writeback happens with PG_writeback set on the page. But setting has to happen while the page is locked and kjournald can never block on page lock because that would cause deadlocks... Honza
On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote: > On Mon 26-11-12 18:17:40, Darrick J. Wong wrote: > > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote: > > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote: > > > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote: > > > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote: > > > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote: > > > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote: > > > > > > > > > ext3 doesn't properly isolate pages from changes during writeback. Since the > > > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user > > > > > > > > > tries to mount in write mode. > > > > > > > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > > --- > > > > > > > > > fs/ext3/super.c | 8 ++++++++ > > > > > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c > > > > > > > > > index 5366393..5b3725d 100644 > > > > > > > > > --- a/fs/ext3/super.c > > > > > > > > > +++ b/fs/ext3/super.c > > > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, > > > > > > > > > "forcing read-only mode"); > > > > > > > > > res = MS_RDONLY; > > > > > > > > > } > > > > > > > > > + if (!read_only && > > > > > > > > > + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { > > > > > > > > > + ext3_msg(sb, KERN_ERR, > > > > > > > > > + "error: ext3 cannot safely write data to a disk " > > > > > > > > > + "requiring stable pages writes; forcing read-only " > > > > > > > > > + "mode. Upgrading to ext4 is recommended."); > > > > > > > > > + res = MS_RDONLY; > > > > > > > > > + } > > > > > > > > > if (read_only) > > > > > > > > > return res; > > > > > > > > > if (!(sbi->s_mount_state & EXT3_VALID_FS)) > > > > > > > > Why this? ext3 should be fixed by your change to > > > > > > > > filemap_page_mkwrite()... Or does testing show otherwise? > > > > > > > > > > > > > > Yes, it's still broken even with this new set of changes. Now that I think > > > > > > > about it a little more, I recall that writeback mode was actually fine, so this > > > > > > > is a little harsh. > > > > > > > > > > > > > > Hm... looking at the ordered code a little more, it looks like > > > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries > > > > > > > to write mapped buffers back through the journal? Taking it out seems to fix > > > > > > > ordered mode, though I have a suspicion that it might very well break ordered > > > > > > > mode too. > > > > > > Oh, right. kjournald writing buffers directly (without setting > > > > > > PageWriteback) will break things. So please, change warning to: > > > > > > > > Maybe we should just fix this anyway? > > > > > > > > I still have the patch that adds PG_stable (and changes the > > > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking > > > > around in my tree. I wrote a patch to jbd that changes journal_do_submit_data > > > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in > > > > the end_io processing. > > > > > > > > It seems to get rid of the checksum-after-write errors, though I'm not > > > > convinced it's correct. But, I'll send both patches along. > > > I'll check the patches. Fixing PageWriteback logic for ext3 is not easily > > > doable due to lock ranking constraints - PageWriteback has to be set under > > > PageLocked but that ranks above transaction start so kjournald cannot grab > > > page locks so it cannot set PageWriteback... And changing the lock ordering > > > is a major surgery. > > > > > > What could be doable is waiting for buffer locks from ext3's ->write_begin > > > and ->page_mkwrite implementations in case stable writes are required. If > > > your approach with a separate page bit doesn't work out (and I have some > > > doubts about that as mm people are *really* thrifty with page bits). > > > > > > > > > /* > > > > > > * In data=ordered mode, kjournald writes buffers without setting > > > > > > * PageWriteback bit thus generic code does not properly wait for > > > > > > * writeback of those buffers to finish. > > > > > > */ > > > > > > if (!read_only && > > > > > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && > > > > > > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA > > > > > > > > since I bet data=journal mode is also borken wrt PageWriteback. > > > It is broken wrt PageWriteback but it actually waits for buffer locks in > > > ->write_begin() so at least write path should be properly protected. But > > > mmap is not handled properly there (although that wouldn't be that hard to > > > fix). So I agree the condition should rather be what you suggest. > Sorry for late reply. I was on vacation... No worries; I have plenty of other patchsets to work on. :) > > Hm. In journal mode, write_begin calls do_journal_get_write_access on each > > buffer for a given page, and in turn, jbd's do_get_write_access calls > > lock_buffer. Is that what you're referring to by "actually waits for buffer > > locks"? I'm wondering how that helps us, since afaict PG_writeback doesn't get > > set in that path, and I think it's a little early to be setting PG_writeback > > anyway. > It does help us. In ext3 data writeback is done either by flusher thread, > that happens under PG_Writeback and generic code waits for that as need, or > by kjournald - that happens under buffer lock and as you properly observed > do_get_write_access() waits for that (and actually copies data that should > go to disk to a separate buffer if needed). Hm. jbd2 calls generic_writepages to flush those buffers out, which sets PG_writeback like you'd expect. It'd be nice if you could do that like ext4 does... but then the ext3 writepage functions can call journal start/stop. Maybe we could call block_write_full_page directly? > > If the page has to be locked before the transaction starts, how much of a > > problem is it to set PG_writeback? Even though that seems a bit early to be > > doing that? > Well, what you would need to make things consistent is to set > PG_writeback from kjournald so that all writeback happens with PG_writeback > set on the page. But setting has to happen while the page is locked and > kjournald can never block on page lock because that would cause > deadlocks... Right now we try to submit_bh() everything on commit->t_sync_datalist without the page lock. I guess we could try to lock (and set writeback on) every page on that list, and anything that we can lock then goes on a new t_locked_for_sync list. If we find a page that's already locked, simply back out to the main kjournald loop and try the whole mess again. Once all the pages are on the locked_for_sync list, then we can unlock the pages and actually commit the transaction. But, uck, that feels like courting disaster. Do people sleep with a page locked? I'm unsure of the wisdom of making jbd do that. Can we stall out forever if someone keeps shoving pages onto t_sync_data? I actually tried a dumb trylock_page loop in journal_do_submit_data, but it quickly stalled. Ok, enough rambling, I'll give a few of my harebrained suggestions a try and see what results. --D > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 07-12-12 17:09:58, Darrick J. Wong wrote: > On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote: > > On Mon 26-11-12 18:17:40, Darrick J. Wong wrote: > > > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote: > > > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote: > > > > I'll check the patches. Fixing PageWriteback logic for ext3 is not easily > > > > doable due to lock ranking constraints - PageWriteback has to be set under > > > > PageLocked but that ranks above transaction start so kjournald cannot grab > > > > page locks so it cannot set PageWriteback... And changing the lock ordering > > > > is a major surgery. > > > > > > > > What could be doable is waiting for buffer locks from ext3's ->write_begin > > > > and ->page_mkwrite implementations in case stable writes are required. If > > > > your approach with a separate page bit doesn't work out (and I have some > > > > doubts about that as mm people are *really* thrifty with page bits). > > > > > > > > > > > /* > > > > > > > * In data=ordered mode, kjournald writes buffers without setting > > > > > > > * PageWriteback bit thus generic code does not properly wait for > > > > > > > * writeback of those buffers to finish. > > > > > > > */ > > > > > > > if (!read_only && > > > > > > > test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA && > > > > > > > > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA > > > > > > > > > > since I bet data=journal mode is also borken wrt PageWriteback. > > > > It is broken wrt PageWriteback but it actually waits for buffer locks in > > > > ->write_begin() so at least write path should be properly protected. But > > > > mmap is not handled properly there (although that wouldn't be that hard to > > > > fix). So I agree the condition should rather be what you suggest. > > Sorry for late reply. I was on vacation... > > No worries; I have plenty of other patchsets to work on. :) > > > > Hm. In journal mode, write_begin calls do_journal_get_write_access on each > > > buffer for a given page, and in turn, jbd's do_get_write_access calls > > > lock_buffer. Is that what you're referring to by "actually waits for buffer > > > locks"? I'm wondering how that helps us, since afaict PG_writeback doesn't get > > > set in that path, and I think it's a little early to be setting PG_writeback > > > anyway. > > It does help us. In ext3 data writeback is done either by flusher thread, > > that happens under PG_Writeback and generic code waits for that as need, or > > by kjournald - that happens under buffer lock and as you properly observed > > do_get_write_access() waits for that (and actually copies data that should > > go to disk to a separate buffer if needed). > > Hm. jbd2 calls generic_writepages to flush those buffers out, which sets > PG_writeback like you'd expect. It'd be nice if you could do that like ext4 > does... but then the ext3 writepage functions can call journal start/stop. Yes, it would be nice if ext3 did it the way ext4 does but as I already said earlier, that's a major change to the way locking is done. ext3 has lock ordering PageLock -> transaction start, while ext4 has lock ordering transaction start -> PageLock. This influences a lot of places so it's too big change to do to ext3 which is in maintenance only mode. > Maybe we could call block_write_full_page directly? You have to have the page locked to call that function. Otherwise someone could invalidate the page under your hands (or do other sort of things function does not expect). > > > If the page has to be locked before the transaction starts, how much of a > > > problem is it to set PG_writeback? Even though that seems a bit early to be > > > doing that? > > Well, what you would need to make things consistent is to set > > PG_writeback from kjournald so that all writeback happens with PG_writeback > > set on the page. But setting has to happen while the page is locked and > > kjournald can never block on page lock because that would cause > > deadlocks... > > Right now we try to submit_bh() everything on commit->t_sync_datalist without > the page lock. I guess we could try to lock (and set writeback on) every page > on that list, and anything that we can lock then goes on a new > t_locked_for_sync list. If we find a page that's already locked, simply back > out to the main kjournald loop and try the whole mess again. Once all the > pages are on the locked_for_sync list, then we can unlock the pages and > actually commit the transaction. That's just more complicated way of implementing page lock ;) It won't remove deadlocks. > But, uck, that feels like courting disaster. Do people sleep with a page > locked? Yes, they do - for example in ext3_write_begin(), ext3_journal_start() can sleep waiting for kjournald to commit a transaction :). That's the deadlock I'm talking about... > I'm unsure of the wisdom of making jbd do that. Can we stall out > forever if someone keeps shoving pages onto t_sync_data? That's not an issue. Once transaction commit starts, no new buffers can be added to it. Hum, I was thinking about how to implement proper exclusion in ext3 using buffer locks and it won't be easy. Buffered writes are simple enough but mmap is tricky - we'd have to writeprotect the page at the moment we add buffer to t_sync_datalist (at that moment we hold the page lock so that is doable). That page_mkwrite will have to wait for transaction commit (or write the buffers) if underlying buffers are in t_sync_datalist. That will make mixing of buffered IO and mmap writes to the same page very slow but maybe that would be acceptable. I'll think about it some more... Honza
diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 5366393..5b3725d 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es, "forcing read-only mode"); res = MS_RDONLY; } + if (!read_only && + queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) { + ext3_msg(sb, KERN_ERR, + "error: ext3 cannot safely write data to a disk " + "requiring stable pages writes; forcing read-only " + "mode. Upgrading to ext4 is recommended."); + res = MS_RDONLY; + } if (read_only) return res; if (!(sbi->s_mount_state & EXT3_VALID_FS))
ext3 doesn't properly isolate pages from changes during writeback. Since the recommended fix is to use ext4, for now we'll just print a warning if the user tries to mount in write mode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/ext3/super.c | 8 ++++++++ 1 file changed, 8 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html