diff mbox

ext3: prevent reread after write IO error v2

Message ID 4B4EEE86.7080807@hitachi.com
State Not Applicable, archived
Headers show

Commit Message

Hidehiro Kawai Jan. 14, 2010, 10:14 a.m. UTC
This patch fixes the similar bug fixed by commit 95450f5a.

If a directory is modified, its data block is journaled as metadata
and finally written back to the right place.  Now, we assume a
transient write erorr happens on that writeback.  Uptodate flag of
the buffer is cleared by write error, so next access on the buffer
causes a reread from disk.  This means it breaks the filesystems
consistency.

To prevent old directory data from being reread, this patch set
uptodate flag again in the case of after write error before issuing
the read operation.  The write error on the directory's data block
is detected at the time of journal checkpointing or discarded if a
rewrite by another modification succeeds, so no problem.

Similarly, this kind of consistency breakage can be caused by
a transient write error on a bitmap block.

I tested this patch by using fault injection approach.

By the way, I think the right fix is to keep uptodate flag on write
error, but it gives a big impact.  We have to confirm whether
over 200 buffer_uptodate's are used for real uptodate check or write
error check.  For now, I adopt the quick-fix solution.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/ext3/balloc.c |   12 ++++++++++++
 fs/ext3/inode.c  |   13 +++++++++++++
 fs/ext3/namei.c  |   15 ++++++++++++++-
 3 files changed, 39 insertions(+), 1 deletions(-)

Comments

Jan Kara Jan. 14, 2010, 2:18 p.m. UTC | #1
On Thu 14-01-10 19:14:30, Hidehiro Kawai wrote:
> This patch fixes the similar bug fixed by commit 95450f5a.
> 
> If a directory is modified, its data block is journaled as metadata
> and finally written back to the right place.  Now, we assume a
> transient write erorr happens on that writeback.  Uptodate flag of
> the buffer is cleared by write error, so next access on the buffer
> causes a reread from disk.  This means it breaks the filesystems
> consistency.
> 
> To prevent old directory data from being reread, this patch set
> uptodate flag again in the case of after write error before issuing
> the read operation.  The write error on the directory's data block
> is detected at the time of journal checkpointing or discarded if a
> rewrite by another modification succeeds, so no problem.
> 
> Similarly, this kind of consistency breakage can be caused by
> a transient write error on a bitmap block.
  Good catch, that's indeed a problem.

> I tested this patch by using fault injection approach.
> 
> By the way, I think the right fix is to keep uptodate flag on write
> error, but it gives a big impact.  We have to confirm whether
> over 200 buffer_uptodate's are used for real uptodate check or write
> error check.  For now, I adopt the quick-fix solution.
  Yes that needs to be solved. I also looked into it and it's too much work
to do it in a one big sweep. But I think we could do the conversion
filesystem by filesystem - see below.
  Admittedly, I don't like your solution very much. It looks strange to
check write_io_error when *reading* the buffer and I'm afraid we could
introduce bugs e.g. by clearing write_io_error bit so that ext3_bread would
then fail to detect the error condition or by some other code deciding to
read the buffer from disk via other function than just ext3_bread. So I
think it would be much better to set back uptodate flag shortly after the
failed write or not clear it at all.
  Now here's how I think we could achieve that without having to change all
other filesystems: We could create a new superblock flag which would mean
that "this filesystem handles write_io_error and doesn't want to clear
uptodate flag". Filesystems with this capability would set this flag when
calling get_sb_bdev. And if write IO fails we check this flag
(via bh->b_bdev->bd_super->s_flags) and clear / not clear uptodate flag
accordingly. What do you think?
  I know it's more work than your quick fix but it should fix all these
problems for ext3 once for all and it would be much cleaner...

									Honza
> 
> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> ---
>  fs/ext3/balloc.c |   12 ++++++++++++
>  fs/ext3/inode.c  |   13 +++++++++++++
>  fs/ext3/namei.c  |   15 ++++++++++++++-
>  3 files changed, 39 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> index 27967f9..5dc5ccf 100644
> --- a/fs/ext3/balloc.c
> +++ b/fs/ext3/balloc.c
> @@ -156,6 +156,18 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
>  	if (likely(bh_uptodate_or_lock(bh)))
>  		return bh;
>  
> +	/*
> +	 * uptodate flag may have been cleared by a previous (transient)
> +	 * write IO error.  In this case, we don't want to reread its
> +	 * old on-disk data.  Actually the buffer has the latest data,
> +	 * so set uptodate flag again.
> +	 */
> +	if (buffer_write_io_error(bh)) {
> +		set_buffer_uptodate(bh);
> +		unlock_buffer(bh);
> +		return bh;
> +	}
> +
>  	if (bh_submit_read(bh) < 0) {
>  		brelse(bh);
>  		ext3_error(sb, __func__,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 455e6e6..67d7849 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1077,10 +1077,23 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
>  		return bh;
>  	if (buffer_uptodate(bh))
>  		return bh;
> +
> +	/*
> +	 * uptodate flag may have been cleared by a previous (transient)
> +	 * write IO error.  In this case, we don't want to reread its
> +	 * old on-disk data.  Actually the buffer has the latest data,
> +	 * so set uptodate flag again.
> +	 */
> +	if (buffer_write_io_error(bh)) {
> +		set_buffer_uptodate(bh);
> +		return bh;
> +	}
> +
>  	ll_rw_block(READ_META, 1, &bh);
>  	wait_on_buffer(bh);
>  	if (buffer_uptodate(bh))
>  		return bh;
> +
>  	put_bh(bh);
>  	*err = -EIO;
>  	return NULL;
> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> index 7b0e44f..7ed8e45 100644
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -909,7 +909,20 @@ restart:
>  				num++;
>  				bh = ext3_getblk(NULL, dir, b++, 0, &err);
>  				bh_use[ra_max] = bh;
> -				if (bh)
> +				if (!bh || buffer_uptodate(bh))
> +					continue;
> +
> +				/*
> +				 * uptodate flag may have been cleared by a
> +				 * previous (transient) write IO error.  In
> +				 * this case, we don't want to reread its
> +				 * old on-disk data.  Actually the buffer
> +				 * has the latest data, so set uptodate flag
> +				 * again.
> +				 */
> +				if (buffer_write_io_error(bh))
> +					set_buffer_uptodate(bh);
> +				else
>  					ll_rw_block(READ_META, 1, &bh);
>  			}
>  		}
> -- 
> 1.6.0.3
> 
> 
> 
> -- 
> Hidehiro Kawai
> Hitachi, Systems Development Laboratory
> Linux Technology Center
>
Hidehiro Kawai Jan. 15, 2010, 10:38 a.m. UTC | #2
Jan Kara wrote:

>>By the way, I think the right fix is to keep uptodate flag on write
>>error, but it gives a big impact.  We have to confirm whether
>>over 200 buffer_uptodate's are used for real uptodate check or write
>>error check.  For now, I adopt the quick-fix solution.
> 
>   Yes that needs to be solved. I also looked into it and it's too much work
> to do it in a one big sweep. But I think we could do the conversion
> filesystem by filesystem - see below.
>   Admittedly, I don't like your solution very much. It looks strange to
> check write_io_error when *reading* the buffer and I'm afraid we could
> introduce bugs e.g. by clearing write_io_error bit so that ext3_bread would
> then fail to detect the error condition or by some other code deciding to
> read the buffer from disk via other function than just ext3_bread. So I
> think it would be much better to set back uptodate flag shortly after the
> failed write or not clear it at all.
>   Now here's how I think we could achieve that without having to change all
> other filesystems: We could create a new superblock flag which would mean
> that "this filesystem handles write_io_error and doesn't want to clear
> uptodate flag". Filesystems with this capability would set this flag when
> calling get_sb_bdev. And if write IO fails we check this flag
> (via bh->b_bdev->bd_super->s_flags) and clear / not clear uptodate flag
> accordingly. What do you think?

Thanks for your comment!

Your suggestion is what I wanted to do ultimately, and it seems
to go well.  I'll send a rivised patch later.

Best regards,
Nick Piggin Jan. 18, 2010, 5:18 a.m. UTC | #3
On Thu, Jan 14, 2010 at 03:18:04PM +0100, Jan Kara wrote:
> On Thu 14-01-10 19:14:30, Hidehiro Kawai wrote:
> > This patch fixes the similar bug fixed by commit 95450f5a.
> > 
> > If a directory is modified, its data block is journaled as metadata
> > and finally written back to the right place.  Now, we assume a
> > transient write erorr happens on that writeback.  Uptodate flag of
> > the buffer is cleared by write error, so next access on the buffer
> > causes a reread from disk.  This means it breaks the filesystems
> > consistency.
> > 
> > To prevent old directory data from being reread, this patch set
> > uptodate flag again in the case of after write error before issuing
> > the read operation.  The write error on the directory's data block
> > is detected at the time of journal checkpointing or discarded if a
> > rewrite by another modification succeeds, so no problem.
> > 
> > Similarly, this kind of consistency breakage can be caused by
> > a transient write error on a bitmap block.
>   Good catch, that's indeed a problem.
> 
> > I tested this patch by using fault injection approach.
> > 
> > By the way, I think the right fix is to keep uptodate flag on write
> > error, but it gives a big impact.  We have to confirm whether
> > over 200 buffer_uptodate's are used for real uptodate check or write
> > error check.  For now, I adopt the quick-fix solution.
>   Yes that needs to be solved. I also looked into it and it's too much work
> to do it in a one big sweep. But I think we could do the conversion
> filesystem by filesystem - see below.

Yes this is something I've had problems with as well, although I never
properly solved the issue with auditing / back compatibility with
filesystems. So it is good to see people working on a real solution :)

Clearing uptodate flag on write errors is a really nasty thing to do. It
means that failed writeback cannot be retried, can also break application
consistency for data blocks, similarly to filesystem consistency for
metadata blocks, and might even cause oopses and weird problems when
!uptodate pages/buffers are not expected, mmapped pages, for example, or
by breaking consistency between PageUptodate and buffer_uptodate.


>   Admittedly, I don't like your solution very much. It looks strange to
> check write_io_error when *reading* the buffer and I'm afraid we could
> introduce bugs e.g. by clearing write_io_error bit so that ext3_bread would
> then fail to detect the error condition or by some other code deciding to
> read the buffer from disk via other function than just ext3_bread. So I
> think it would be much better to set back uptodate flag shortly after the
> failed write or not clear it at all.
>   Now here's how I think we could achieve that without having to change all
> other filesystems: We could create a new superblock flag which would mean
> that "this filesystem handles write_io_error and doesn't want to clear
> uptodate flag". Filesystems with this capability would set this flag when
> calling get_sb_bdev. And if write IO fails we check this flag
> (via bh->b_bdev->bd_super->s_flags) and clear / not clear uptodate flag
> accordingly. What do you think?
>   I know it's more work than your quick fix but it should fix all these
> problems for ext3 once for all and it would be much cleaner...

I agree, and this sounds like a decent solution.

We also need to remove some ClearPageUptodate calls I think (similar
issues), so keep those in mind too. Unfortunately it looks like there
are also a lot of filesystem specific tests of PageUptodate... but you
could also move those under the new compatibility s_flag.

I don't know of a really good way to inject and test filesystem errors.
Make request failures causes most fs to quickly go readonly or have
bigger problems. If you're careful like try to only fail read IOs for
data, or only fail write IOs not involved in integrity or journal
operations, then test programs just tend to abort pretty quickly. Does
anyone know of anything more systematic?


> > Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> > ---
> >  fs/ext3/balloc.c |   12 ++++++++++++
> >  fs/ext3/inode.c  |   13 +++++++++++++
> >  fs/ext3/namei.c  |   15 ++++++++++++++-
> >  3 files changed, 39 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> > index 27967f9..5dc5ccf 100644
> > --- a/fs/ext3/balloc.c
> > +++ b/fs/ext3/balloc.c
> > @@ -156,6 +156,18 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
> >  	if (likely(bh_uptodate_or_lock(bh)))
> >  		return bh;
> >  
> > +	/*
> > +	 * uptodate flag may have been cleared by a previous (transient)
> > +	 * write IO error.  In this case, we don't want to reread its
> > +	 * old on-disk data.  Actually the buffer has the latest data,
> > +	 * so set uptodate flag again.
> > +	 */
> > +	if (buffer_write_io_error(bh)) {
> > +		set_buffer_uptodate(bh);
> > +		unlock_buffer(bh);
> > +		return bh;
> > +	}
> > +
> >  	if (bh_submit_read(bh) < 0) {
> >  		brelse(bh);
> >  		ext3_error(sb, __func__,
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index 455e6e6..67d7849 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1077,10 +1077,23 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
> >  		return bh;
> >  	if (buffer_uptodate(bh))
> >  		return bh;
> > +
> > +	/*
> > +	 * uptodate flag may have been cleared by a previous (transient)
> > +	 * write IO error.  In this case, we don't want to reread its
> > +	 * old on-disk data.  Actually the buffer has the latest data,
> > +	 * so set uptodate flag again.
> > +	 */
> > +	if (buffer_write_io_error(bh)) {
> > +		set_buffer_uptodate(bh);
> > +		return bh;
> > +	}
> > +
> >  	ll_rw_block(READ_META, 1, &bh);
> >  	wait_on_buffer(bh);
> >  	if (buffer_uptodate(bh))
> >  		return bh;
> > +
> >  	put_bh(bh);
> >  	*err = -EIO;
> >  	return NULL;
> > diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> > index 7b0e44f..7ed8e45 100644
> > --- a/fs/ext3/namei.c
> > +++ b/fs/ext3/namei.c
> > @@ -909,7 +909,20 @@ restart:
> >  				num++;
> >  				bh = ext3_getblk(NULL, dir, b++, 0, &err);
> >  				bh_use[ra_max] = bh;
> > -				if (bh)
> > +				if (!bh || buffer_uptodate(bh))
> > +					continue;
> > +
> > +				/*
> > +				 * uptodate flag may have been cleared by a
> > +				 * previous (transient) write IO error.  In
> > +				 * this case, we don't want to reread its
> > +				 * old on-disk data.  Actually the buffer
> > +				 * has the latest data, so set uptodate flag
> > +				 * again.
> > +				 */
> > +				if (buffer_write_io_error(bh))
> > +					set_buffer_uptodate(bh);
> > +				else
> >  					ll_rw_block(READ_META, 1, &bh);
> >  			}
> >  		}
> > -- 
> > 1.6.0.3
> > 
> > 
> > 
> > -- 
> > Hidehiro Kawai
> > Hitachi, Systems Development Laboratory
> > Linux Technology Center
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin Jan. 18, 2010, 6:05 a.m. UTC | #4
On Mon, Jan 18, 2010 at 04:18:47PM +1100, Nick Piggin wrote:
> We also need to remove some ClearPageUptodate calls I think (similar
> issues), so keep those in mind too. Unfortunately it looks like there
> are also a lot of filesystem specific tests of PageUptodate... but you
> could also move those under the new compatibility s_flag.
> 
> I don't know of a really good way to inject and test filesystem errors.
> Make request failures causes most fs to quickly go readonly or have
> bigger problems. If you're careful like try to only fail read IOs for
> data, or only fail write IOs not involved in integrity or journal
> operations, then test programs just tend to abort pretty quickly. Does
> anyone know of anything more systematic?

This might be a good time to bring up IO error behaviour again. I got
into some debates I think on Andi's hwpoison thread a while back, but
probably not appropriate thread to find a real solution to this.

The problem we have now is that IO error semantics are not well defined.
It is hard to even enumerate all the issues.

read IOs
  how to retry? appropriate defaults should happen at the block layer I
  think. Should retry behaviour be tunable by the mm/fs, or should that
  be coded explicitly as submission retry loops? Either way does imply
  there is either similar defaults for all types (or maybe classes) of
  drivers, or some way to query/set this.

  It would be nice to be able to set fs/driver behaviour from userspace
  too, in a generic (not driver or fs specific way). But defaults should
  be reasonable and similar between all, I guess.

write IOs
  This is more interesting. How to handle write IO errors. In my opinion
  we must not invalidate the data before an IO error is returned to
  somebody (whether it be fsync or a synchronous write syscall). Any
  earlier and the app just gets RAW consistency randomly violated. And I
  think it is important to treat IO errors as transparently as possible
  until the error can be detected.

  I happen to think that actually we should go further and not
  invalidate the data at all. This makes implementation simpler, and
  also allows us to retry writes like we can retry reads. It's also
  problematic to throw out errors at that point because *sync syscalls
  coming from elsewhere could result in loss of error reporting (think,
  sys_sync).

  If we go this way, we probably need another syscall and fs helper call
  to invalidate the dirty data when we give up on retries. truncate_range
  probably not appropriate because it is much harder to implement and
  maybe we want to try to get at the most recent data that is on disk.

  Also do we need to think about O_SYNC or -o sync type of writes that
  are implemented via writeback cache? We could invalidate the dirtied
  cache ASAP, which would leave a window where a concurrent read can see
  first new, then old data. It would also kind of break the above scheme
  in case the pagecache was already dirty via a descriptor without
  O_SYNC. It might just make sense to leave the pagecache dirty. Either
  way it should be documented I think.

Do we even care enough to bother thinking about this now? (serious question)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 18, 2010, 12:24 p.m. UTC | #5
On Mon, Jan 18, 2010 at 05:05:18PM +1100, Nick Piggin wrote:
> On Mon, Jan 18, 2010 at 04:18:47PM +1100, Nick Piggin wrote:
> > We also need to remove some ClearPageUptodate calls I think (similar
> > issues), so keep those in mind too. Unfortunately it looks like there
> > are also a lot of filesystem specific tests of PageUptodate... but you
> > could also move those under the new compatibility s_flag.
> > 
> > I don't know of a really good way to inject and test filesystem errors.
> > Make request failures causes most fs to quickly go readonly or have
> > bigger problems. If you're careful like try to only fail read IOs for
> > data, or only fail write IOs not involved in integrity or journal
> > operations, then test programs just tend to abort pretty quickly. Does
> > anyone know of anything more systematic?
> 
> This might be a good time to bring up IO error behaviour again. I got
> into some debates I think on Andi's hwpoison thread a while back, but
> probably not appropriate thread to find a real solution to this.
> 
> The problem we have now is that IO error semantics are not well defined.
> It is hard to even enumerate all the issues.
> 
> read IOs
>   how to retry? appropriate defaults should happen at the block layer I
>   think. Should retry behaviour be tunable by the mm/fs, or should that
>   be coded explicitly as submission retry loops? Either way does imply
>   there is either similar defaults for all types (or maybe classes) of
>   drivers, or some way to query/set this.

It's more complex than that - there are classes of errors to
consider as well. e.g transient vs permanent.

Transient is from stuff like FC path failures - failover can take up
to 240s to occur, and then the IO will generally complete
successfully.  Permanent errors are those that involve data loss e.g
bad sectors on single disks or on degraded RAID devices.

The action to take is generally different for different error
classes - transient errors can be retried later, while permanent
errors won't change no matter how many retries you do.  IOWs, we'll
need help from the block layer to enable us to determine the error
class we are dealing with.

>   It would be nice to be able to set fs/driver behaviour from userspace
>   too, in a generic (not driver or fs specific way). But defaults should
>   be reasonable and similar between all, I guess.

I don't think generic handling is really possible - filesystems may
have different ways of recovering e.g. duplicate copies of data or
metadata or internal ECC that can be used to recovery the bad
region. Also, depending where the error occurs, the filesystem might
need to shutdown to be repaired....

> write IOs
>   This is more interesting. How to handle write IO errors. In my opinion
>   we must not invalidate the data before an IO error is returned to
>   somebody (whether it be fsync or a synchronous write syscall).

We already pass the error via mapping_set_error() calls when the
error occurs and checking in it filemap_fdatawait_range().  However,
where we check the error we've lost all context and what range the
error occurred on. I don't see any easy way to track such an
error for later invalidation except maybe by a new radix tree tag.
That would allow later invalidation of only the specific range the
error was reported from.

>   Any
>   earlier and the app just gets RAW consistency randomly violated. And I
>   think it is important to treat IO errors as transparently as possible
>   until the error can be detected.
> 
>   I happen to think that actually we should go further and not
>   invalidate the data at all. This makes implementation simpler, and
>   also allows us to retry writes like we can retry reads. It's also
>   problematic to throw out errors at that point because *sync syscalls
>   coming from elsewhere could result in loss of error reporting (think,
>   sys_sync).

The worst problem with this is what happens when you can't write
back to the filesystem because of IO errors, but you still allow more
incoming writes? It's not far from IO error to running out of memory
and deadlocking....

>   If we go this way, we probably need another syscall and fs helper call
>   to invalidate the dirty data when we give up on retries. truncate_range
>   probably not appropriate because it is much harder to implement and
>   maybe we want to try to get at the most recent data that is on disk.

First we need to track what needs invalidating...

>   Also do we need to think about O_SYNC or -o sync type of writes that
>   are implemented via writeback cache? We could invalidate the dirtied
>   cache ASAP, which would leave a window where a concurrent read can see
>   first new, then old data. It would also kind of break the above scheme
>   in case the pagecache was already dirty via a descriptor without
>   O_SYNC. It might just make sense to leave the pagecache dirty. Either
>   way it should be documented I think.

How to handle this comes down to the type of error that occurred. In
the case of permanent error, the second read after the invalidation
probably should return EIO because you have no idea whether what is on
disk is the old, the new, some combination of the two or some other
random or stale garbage....

> Do we even care enough to bother thinking about this now? (serious question)

It's a damn hard problem and many of the details are filesystem
specific. However, if we want high grade reliability from our
systems then we have to tackle these problems at some point in time.

FWIW, I started to document some of what I've just been talking
(from a XFS metadata reliability context) about a year and a half
ago. The relevant section is here:

http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Exception_Handling

Though the entire page is probably somewhat relevant.  I only got as
far as documenting methods for handling transient and permanent read
errors, and the TODO includes handling:

	- Transient write error
	- Permanent write error
	- Corrupted data on read
	- Corrupted data on write (detected during guard calculation)
	- I/O timeouts
	- Memory corruption

If we want to handle these sort of errors, I think first we've got
to understand what we have to do. A rough outline of the approach I
was taking to the above document was:

	- work out how to detect each type of error effectively
	- determine strategies of how to report each type of error
	- find the best way to recovery from the error
	- work out what additional information was needed on disk
	  to enable successful recovery from the error.
	- work out what addition functionality was required from
	  the lower layers to allow reliable detection and recovery
	  to ぽccur.

In the case of reporting data errors through pagecache paths, I'd add:

	- determine if the error can be handled in a generic way
	- work out how filesystem specific handlers can be invoked
	  for those that can do filesystem level recovery
	- work out how to consistently report the same result after
	  an error.

There's plenty more I could write, but I need to sleep now....

Cheers,

Dave.
Nick Piggin Jan. 18, 2010, 2 p.m. UTC | #6
On Mon, Jan 18, 2010 at 11:24:37PM +1100, Dave Chinner wrote:
> On Mon, Jan 18, 2010 at 05:05:18PM +1100, Nick Piggin wrote:
> > The problem we have now is that IO error semantics are not well defined.
> > It is hard to even enumerate all the issues.
> > 
> > read IOs
> >   how to retry? appropriate defaults should happen at the block layer I
> >   think. Should retry behaviour be tunable by the mm/fs, or should that
> >   be coded explicitly as submission retry loops? Either way does imply
> >   there is either similar defaults for all types (or maybe classes) of
> >   drivers, or some way to query/set this.
> 
> It's more complex than that - there are classes of errors to
> consider as well. e.g transient vs permanent.
> 
> Transient is from stuff like FC path failures - failover can take up
> to 240s to occur, and then the IO will generally complete
> successfully.  Permanent errors are those that involve data loss e.g
> bad sectors on single disks or on degraded RAID devices.

Yes. Is this something that should be visible above the block layer
though? If it is known transient, should it remain uncompleted until it
is successful?

Known permanent errors yes could avoid any need for retries. Leaving
cases where the lower layers don't really know (in which case we'd
maybe want to leave it to userspace or a userspace-set policy).


> >   It would be nice to be able to set fs/driver behaviour from userspace
> >   too, in a generic (not driver or fs specific way). But defaults should
> >   be reasonable and similar between all, I guess.
> 
> I don't think generic handling is really possible - filesystems may
> have different ways of recovering e.g. duplicate copies of data or

For write errors, you could also do block re-allocation, which would be
fun.


> metadata or internal ECC that can be used to recovery the bad
> region. Also, depending where the error occurs, the filesystem might
> need to shutdown to be repaired....

Definitely there will be filesystem specific issues. But I mean that
some common things could be specified (like how long / how many times
to retry failed requests).


> > write IOs
> >   This is more interesting. How to handle write IO errors. In my opinion
> >   we must not invalidate the data before an IO error is returned to
> >   somebody (whether it be fsync or a synchronous write syscall).
> 
> We already pass the error via mapping_set_error() calls when the
> error occurs and checking in it filemap_fdatawait_range().  However,
> where we check the error we've lost all context and what range the
> error occurred on. I don't see any easy way to track such an
> error for later invalidation except maybe by a new radix tree tag.
> That would allow later invalidation of only the specific range the
> error was reported from.

If we always leave the error pages / buffers as dirty and uptodate,
then we can walk the radix tree dirty bits. IO errors are only really
reported by syncing calls anyway which walk dirty bits already.

If we wanted a purely querying syscall, it probably doesn't need to so
so performance critical as to require a new tag rather than just
checking PageError on the dirty pages.


> >   Any
> >   earlier and the app just gets RAW consistency randomly violated. And I
> >   think it is important to treat IO errors as transparently as possible
> >   until the error can be detected.
> > 
> >   I happen to think that actually we should go further and not
> >   invalidate the data at all. This makes implementation simpler, and
> >   also allows us to retry writes like we can retry reads. It's also
> >   problematic to throw out errors at that point because *sync syscalls
> >   coming from elsewhere could result in loss of error reporting (think,
> >   sys_sync).
> 
> The worst problem with this is what happens when you can't write
> back to the filesystem because of IO errors, but you still allow more
> incoming writes? It's not far from IO error to running out of memory
> and deadlocking....

Again, keeping pages dirty so we'll start synchronous dirty pagecache
throttling eventually.

That could cause problems of its own as well, but I don't know what else
we can do. I don't think we can throw out the dirty data by default (the
errors might be transient). It could be a policy, maybe.


> >   If we go this way, we probably need another syscall and fs helper call
> >   to invalidate the dirty data when we give up on retries. truncate_range
> >   probably not appropriate because it is much harder to implement and
> >   maybe we want to try to get at the most recent data that is on disk.
> 
> First we need to track what needs invalidating...

Well by this I just mean the dirty, unwritten pagecache and its associated
fs private structures. For errors in filesystem metadata yes it is a lot
harder. I guess filesystems simply need to check and handle errors on a
case by case basis.


> >   Also do we need to think about O_SYNC or -o sync type of writes that
> >   are implemented via writeback cache? We could invalidate the dirtied
> >   cache ASAP, which would leave a window where a concurrent read can see
> >   first new, then old data. It would also kind of break the above scheme
> >   in case the pagecache was already dirty via a descriptor without
> >   O_SYNC. It might just make sense to leave the pagecache dirty. Either
> >   way it should be documented I think.
> 
> How to handle this comes down to the type of error that occurred. In
> the case of permanent error, the second read after the invalidation
> probably should return EIO because you have no idea whether what is on
> disk is the old, the new, some combination of the two or some other
> random or stale garbage....

I'm not sure if that is important because you would have the same
problems if the read was not preceded by a write (or if the write came
from previous boot, or a different machine etc).

If we want to catch IO errors not detected by the block layer, it really
needs a complete solution, in the fs.

 
> > Do we even care enough to bother thinking about this now? (serious question)
> 
> It's a damn hard problem and many of the details are filesystem
> specific. However, if we want high grade reliability from our
> systems then we have to tackle these problems at some point in time.
> 
> FWIW, I started to document some of what I've just been talking
> (from a XFS metadata reliability context) about a year and a half
> ago. The relevant section is here:
> 
> http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Exception_Handling

OK, interesting. Yes a document is needed.

 
> Though the entire page is probably somewhat relevant.  I only got as
> far as documenting methods for handling transient and permanent read
> errors, and the TODO includes handling:
> 
> 	- Transient write error
> 	- Permanent write error
> 	- Corrupted data on read
> 	- Corrupted data on write (detected during guard calculation)

We do want to start by making this as _simple_ as possible. Even the
existing rudimentary error reporting by the block layer is not used in a
consistent way (or at all, in many cases).

So I think squashing corrupted data errors into transient/permanent
errors (at least to start with) could be a good idea.

> 	- I/O timeouts

Different from transient/permanent error cases?


> 	- Memory corruption

Yes this needs support, which I've talked about in hwpoison discussions.
Currently (or last time I checked) it just causes corrupted dirty
pagecache to appear as an IO error. IMO this is wrong -- the fs or the
app might retry the write, or try to re-allocate things and write that
data elsewhere in the case of EIO, which is totally wrong for memory
corruption.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 18, 2010, 10:51 p.m. UTC | #7
On Tue, Jan 19, 2010 at 01:00:39AM +1100, Nick Piggin wrote:
> On Mon, Jan 18, 2010 at 11:24:37PM +1100, Dave Chinner wrote:
> > On Mon, Jan 18, 2010 at 05:05:18PM +1100, Nick Piggin wrote:
> > > The problem we have now is that IO error semantics are not well defined.
> > > It is hard to even enumerate all the issues.
> > > 
> > > read IOs
> > >   how to retry? appropriate defaults should happen at the block layer I
> > >   think. Should retry behaviour be tunable by the mm/fs, or should that
> > >   be coded explicitly as submission retry loops? Either way does imply
> > >   there is either similar defaults for all types (or maybe classes) of
> > >   drivers, or some way to query/set this.
> > 
> > It's more complex than that - there are classes of errors to
> > consider as well. e.g transient vs permanent.
> > 
> > Transient is from stuff like FC path failures - failover can take up
> > to 240s to occur, and then the IO will generally complete
> > successfully.  Permanent errors are those that involve data loss e.g
> > bad sectors on single disks or on degraded RAID devices.
> 
> Yes. Is this something that should be visible above the block layer
> though? If it is known transient, should it remain uncompleted until it
> is successful?

I think it needs to be exposed because if the filesystem has
multiple copies of the data it can read from the other location
immediately and not hang the read for tens of seconds.

> Known permanent errors yes could avoid any need for retries. Leaving
> cases where the lower layers don't really know (in which case we'd
> maybe want to leave it to userspace or a userspace-set policy).

Personally I don't think users aren't going to be able to make
intelligent decisions about what do with such a knob. I'd prefer to
just make it a fixed policy first, and only provide tunables if
that is proved to be insufficient.

> > >   It would be nice to be able to set fs/driver behaviour from userspace
> > >   too, in a generic (not driver or fs specific way). But defaults should
> > >   be reasonable and similar between all, I guess.
> > 
> > I don't think generic handling is really possible - filesystems may
> > have different ways of recovering e.g. duplicate copies of data or
> 
> For write errors, you could also do block re-allocation, which would be
> fun.
> 
> > metadata or internal ECC that can be used to recovery the bad
> > region. Also, depending where the error occurs, the filesystem might
> > need to shutdown to be repaired....
> 
> Definitely there will be filesystem specific issues. But I mean that
> some common things could be specified (like how long / how many times
> to retry failed requests).

Agreed - there will be some common things fall out, but I'd like to
see an analysis done first before we try to extract the common
elements from the mess....

> > > write IOs
> > >   This is more interesting. How to handle write IO errors. In my opinion
> > >   we must not invalidate the data before an IO error is returned to
> > >   somebody (whether it be fsync or a synchronous write syscall).
> > 
> > We already pass the error via mapping_set_error() calls when the
> > error occurs and checking in it filemap_fdatawait_range().  However,
> > where we check the error we've lost all context and what range the
> > error occurred on. I don't see any easy way to track such an
> > error for later invalidation except maybe by a new radix tree tag.
> > That would allow later invalidation of only the specific range the
> > error was reported from.
> 
> If we always leave the error pages / buffers as dirty and uptodate,
> then we can walk the radix tree dirty bits. IO errors are only really
> reported by syncing calls anyway which walk dirty bits already.
> 
> If we wanted a purely querying syscall, it probably doesn't need to so
> so performance critical as to require a new tag rather than just
> checking PageError on the dirty pages.

The drive for the document I was writing was big, high performance
filesystems (think petabyte scale) and machines that might cache a
TB or two of a single file in memory. At that point, finding a
handful of error pages is like finding a needle in a haystack....

> > >   Any
> > >   earlier and the app just gets RAW consistency randomly violated. And I
> > >   think it is important to treat IO errors as transparently as possible
> > >   until the error can be detected.
> > > 
> > >   I happen to think that actually we should go further and not
> > >   invalidate the data at all. This makes implementation simpler, and
> > >   also allows us to retry writes like we can retry reads. It's also
> > >   problematic to throw out errors at that point because *sync syscalls
> > >   coming from elsewhere could result in loss of error reporting (think,
> > >   sys_sync).
> > 
> > The worst problem with this is what happens when you can't write
> > back to the filesystem because of IO errors, but you still allow more
> > incoming writes? It's not far from IO error to running out of memory
> > and deadlocking....
> 
> Again, keeping pages dirty so we'll start synchronous dirty pagecache
> throttling eventually.

write_cache_pages() decrements nr_to_write even if there was a write
error on that page. Hence the throttling in balance_dirty_pages
won't kick in if lots of errors occur during synchronous writeback
because it will think the number of pages it asked to be written
were written. Hence IO errors for written data often lead to OOM.

> That could cause problems of its own as well, but I don't know what else
> we can do. I don't think we can throw out the dirty data by default (the
> errors might be transient). It could be a policy, maybe.

A certain number of retries is certainly worth attempting for
errors that we can't directly report (background writeback), but
whether that should be done for sync/fsync is an open question in my
mind...

> > >   If we go this way, we probably need another syscall and fs helper call
> > >   to invalidate the dirty data when we give up on retries. truncate_range
> > >   probably not appropriate because it is much harder to implement and
> > >   maybe we want to try to get at the most recent data that is on disk.
> > 
> > First we need to track what needs invalidating...
> 
> Well by this I just mean the dirty, unwritten pagecache and its associated
> fs private structures. For errors in filesystem metadata yes it is a lot
> harder. I guess filesystems simply need to check and handle errors on a
> case by case basis.

Nothing simple about that. ;)

> > >   Also do we need to think about O_SYNC or -o sync type of writes that
> > >   are implemented via writeback cache? We could invalidate the dirtied
> > >   cache ASAP, which would leave a window where a concurrent read can see
> > >   first new, then old data. It would also kind of break the above scheme
> > >   in case the pagecache was already dirty via a descriptor without
> > >   O_SYNC. It might just make sense to leave the pagecache dirty. Either
> > >   way it should be documented I think.
> > 
> > How to handle this comes down to the type of error that occurred. In
> > the case of permanent error, the second read after the invalidation
> > probably should return EIO because you have no idea whether what is on
> > disk is the old, the new, some combination of the two or some other
> > random or stale garbage....
> 
> I'm not sure if that is important because you would have the same
> problems if the read was not preceded by a write (or if the write came
> from previous boot, or a different machine etc).

If the filesystem has been unmounted, then we have to assume that
corrective action has been taken (i.e. we've reported a problem,
it's been fixed) or the filesystem has marked the region bad.

> If we want to catch IO errors not detected by the block layer, it really
> needs a complete solution, in the fs.

Yes, though there are plenty of different types of errors the block
layer detects but report simply as "EIO". e.g. on Irix, the block
layer would report EXDEV rather than EIO for transient path-failure
errors so that it could be handled differently by the filesystem....

> > FWIW, I started to document some of what I've just been talking
> > (from a XFS metadata reliability context) about a year and a half
> > ago. The relevant section is here:
> > 
> > http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Exception_Handling
> 
> OK, interesting. Yes a document is needed.
> 
>  
> > Though the entire page is probably somewhat relevant.  I only got as
> > far as documenting methods for handling transient and permanent read
> > errors, and the TODO includes handling:
> > 
> > 	- Transient write error
> > 	- Permanent write error
> > 	- Corrupted data on read
> > 	- Corrupted data on write (detected during guard calculation)
> 
> We do want to start by making this as _simple_ as possible. Even the
> existing rudimentary error reporting by the block layer is not used in a
> consistent way (or at all, in many cases).
> 
> So I think squashing corrupted data errors into transient/permanent
> errors (at least to start with) could be a good idea.

True. My main point is, though, we can't really make that
classification without understanding the whole scope of errors that
can occur and ensuring that we get the correct errors reported from
the lower layers first. i.e. this is not just a pagecache/filesystem
level problem - the lower layers have to do the right thing before we
can even hope to get it right at the FS level...

> > 	- I/O timeouts
> 
> Different from transient/permanent error cases?

Yeah - a week rarely goes by when we don't get a report of an XFS
filesystem hung due to something below it just stopping mid-IO
(DM, md, drivers, and/or hardware). e.g what appears to be a 
DM-related hang reported last night on #xfs:

http://www.pastebin.org/78102

And there was one last week where DM was stuck waiting for a barrier
write to complete and another where a raid controller was just hanging
mid IO.

IOWs, having the IO subsystem just stop completely dead and being
unrecoverable happens far too regularly to ignore when talking about
reliability engineering. These events almost always result in
filesystem corruption of some sort, even though what we have in
memory is consistent. That's because we have to reboot to get the IO
subsystem back and that loses the consistent in-memory state. Being
able to detect such hangs, re-initialise the complete stack below
the filesystem and then re-issue all the IO that was in flight would
make a large number of these problems go away.

> > 	- Memory corruption
> 
> Yes this needs support, which I've talked about in hwpoison discussions.
> Currently (or last time I checked) it just causes corrupted dirty
> pagecache to appear as an IO error. IMO this is wrong -- the fs or the
> app might retry the write, or try to re-allocate things and write that
> data elsewhere in the case of EIO, which is totally wrong for memory
> corruption.

Yes, and it's a hard one to detect (think bit errors in non-ECC memory)
without end-to-end data CRCs. On reads it can be treated as a
transient error, but I'm not sure how you'd classify it. I hadn't
got that far...

Cheers,

Dave.
Anton Altaparmakov Jan. 18, 2010, 11:33 p.m. UTC | #8
Hi,

On 18 Jan 2010, at 14:00, Nick Piggin wrote:
> For write errors, you could also do block re-allocation, which would be fun.

Yes it would.  (-:

FWIW, Windows does this with Microsoft's NTFS driver.  When a write fails due to a bad block, the block is marked as bad (recorded in the bad cluster list and marked as allocated in the in-use bitmap so no-one tries to allocate it), a new block is allocated, inode metadata is updated to reflect the change in the logical to physical block map of the file the block belongs to, and the write is then re-tried to its new location.

I have never bothered implementing it in NTFS on Linux partially because there doesn't seem any obvious way to do it inside the file system.  I think the VFS and/or the block layer would have to offer help there in some way.  What I mean for example is that if ->writepage fails then the failure is only detected inside the asynchronous i/o completion handler at which point the page is not locked any more, it is marked as being under writeback, and we are in IRQ context (or something) and thus it is not easy to see how we can from there get to doing all the above needed actions that require memory allocations, disk i/o, etc...  I suppose a separate thread could do it where we just schedule the work to be done.  But problem with that is that that work later on might fail so we can't simply pretend the block was written successfully yet we do not want to report an error or the upper layers would pick it up even though we hopefully will correct it in due course...

Best regards,

	Anton
Ric Wheeler Jan. 25, 2010, 3:23 p.m. UTC | #9
On 01/18/2010 06:33 PM, Anton Altaparmakov wrote:
> Hi,
>
> On 18 Jan 2010, at 14:00, Nick Piggin wrote:
>    
>> For write errors, you could also do block re-allocation, which would be fun.
>>      
> Yes it would.  (-:
>
> FWIW, Windows does this with Microsoft's NTFS driver.  When a write fails due to a bad block, the block is marked as bad (recorded in the bad cluster list and marked as allocated in the in-use bitmap so no-one tries to allocate it), a new block is allocated, inode metadata is updated to reflect the change in the logical to physical block map of the file the block belongs to, and the write is then re-tried to its new location.
>
> I have never bothered implementing it in NTFS on Linux partially because there doesn't seem any obvious way to do it inside the file system.  I think the VFS and/or the block layer would have to offer help there in some way.  What I mean for example is that if ->writepage fails then the failure is only detected inside the asynchronous i/o completion handler at which point the page is not locked any more, it is marked as being under writeback, and we are in IRQ context (or something) and thus it is not easy to see how we can from there get to doing all the above needed actions that require memory allocations, disk i/o, etc...  I suppose a separate thread could do it where we just schedule the work to be done.  But problem with that is that that work later on might fail so we can't simply pretend the block was written successfully yet we do not want to report an error or the upper layers would pick it up even though we hopefully will correct it in due course...
>
> Best regards,
>
> 	Anton
>    

For permanent write errors, I would expect any modern drive to do a 
sector remapping internally. We should never need to track this kind of 
information for any modern device that I know of (S-ATA, SAS, SSD's and 
raid arrays should all handle this).

Would not seem to be worth the complexity.

Also keep in mind that retrying IO errors is not always a good thing - 
devices retry failed IO multiple times internally. Adding additional 
retry loops up the stack only makes our unavoidable IO error take much 
longer to hit!

Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer Jan. 25, 2010, 4:15 p.m. UTC | #10
On Mon, Jan 25, 2010 at 10:23 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
> On 01/18/2010 06:33 PM, Anton Altaparmakov wrote:
>>
>> Hi,
>>
>> On 18 Jan 2010, at 14:00, Nick Piggin wrote:
>>
>>>
>>> For write errors, you could also do block re-allocation, which would be
>>> fun.
>>>
>>
>> Yes it would.  (-:
>>
>> FWIW, Windows does this with Microsoft's NTFS driver.  When a write fails
>> due to a bad block, the block is marked as bad (recorded in the bad cluster
>> list and marked as allocated in the in-use bitmap so no-one tries to
>> allocate it), a new block is allocated, inode metadata is updated to reflect
>> the change in the logical to physical block map of the file the block
>> belongs to, and the write is then re-tried to its new location.
>>
>> I have never bothered implementing it in NTFS on Linux partially because
>> there doesn't seem any obvious way to do it inside the file system.  I think
>> the VFS and/or the block layer would have to offer help there in some way.
>>  What I mean for example is that if ->writepage fails then the failure is
>> only detected inside the asynchronous i/o completion handler at which point
>> the page is not locked any more, it is marked as being under writeback, and
>> we are in IRQ context (or something) and thus it is not easy to see how we
>> can from there get to doing all the above needed actions that require memory
>> allocations, disk i/o, etc...  I suppose a separate thread could do it where
>> we just schedule the work to be done.  But problem with that is that that
>> work later on might fail so we can't simply pretend the block was written
>> successfully yet we do not want to report an error or the upper layers would
>> pick it up even though we hopefully will correct it in due course...
>>
>> Best regards,
>>
>>        Anton
>>
>
> For permanent write errors, I would expect any modern drive to do a sector
> remapping internally. We should never need to track this kind of information
> for any modern device that I know of (S-ATA, SAS, SSD's and raid arrays
> should all handle this).
>
> Would not seem to be worth the complexity.
>
> Also keep in mind that retrying IO errors is not always a good thing -
> devices retry failed IO multiple times internally. Adding additional retry
> loops up the stack only makes our unavoidable IO error take much longer to
> hit!
>
> Ric

I thought write errors returned by modern drives (last 15 years) in
general were caused by bad cables, controllers, power supplies, etc.

When a media error is returned on write it indicated the spare sector
area of the drive was full.

Thus a media write error is a major error.  I would think, if
anything, we should turn the filesystem readonly upon a write media
error.  Not try to hide such a major problem.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 25, 2010, 5:47 p.m. UTC | #11
On Mon, Jan 25, 2010 at 10:23:57AM -0500, Ric Wheeler wrote:
> 
> For permanent write errors, I would expect any modern drive to do a
> sector remapping internally. We should never need to track this kind
> of information for any modern device that I know of (S-ATA, SAS,
> SSD's and raid arrays should all handle this).

... and if the device is run out of all of its blocks in its spare
blocks pool, it's probably well past the time to replace said disk.

BTW, I really liked Dave Chinner's summary of the issues involved; I
ran into Kawai-san last week at Linux.conf.au, and we discussed pretty
much the same thing over lunch.  (i.e., that it's a hard problem, and
in some cases we need to retry the writes, such as a transient FC path
problem --- but some kind of write throttling is critical or we could
end up choking the VM due to too many pages getting dirtied and no way
of cleaning them.)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Jan. 25, 2010, 5:50 p.m. UTC | #12
On 01/25/2010 12:47 PM, tytso@mit.edu wrote:
> On Mon, Jan 25, 2010 at 10:23:57AM -0500, Ric Wheeler wrote:
>>
>> For permanent write errors, I would expect any modern drive to do a
>> sector remapping internally. We should never need to track this kind
>> of information for any modern device that I know of (S-ATA, SAS,
>> SSD's and raid arrays should all handle this).
>
> ... and if the device is run out of all of its blocks in its spare
> blocks pool, it's probably well past the time to replace said disk.
>
> BTW, I really liked Dave Chinner's summary of the issues involved; I
> ran into Kawai-san last week at Linux.conf.au, and we discussed pretty
> much the same thing over lunch.  (i.e., that it's a hard problem, and
> in some cases we need to retry the writes, such as a transient FC path
> problem --- but some kind of write throttling is critical or we could
> end up choking the VM due to too many pages getting dirtied and no way
> of cleaning them.)
>
> 						- Ted

Also note that retrying writes (or reads for that matter) often are counter 
productive. For those of us who have suffered with trying to migrate data off of 
an old, failing disk onto a new, shiny one, excessive retries can be painful...

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin Jan. 25, 2010, 5:55 p.m. UTC | #13
On Mon, Jan 25, 2010 at 12:47:23PM -0500, tytso@mit.edu wrote:
> On Mon, Jan 25, 2010 at 10:23:57AM -0500, Ric Wheeler wrote:
> > 
> > For permanent write errors, I would expect any modern drive to do a
> > sector remapping internally. We should never need to track this kind
> > of information for any modern device that I know of (S-ATA, SAS,
> > SSD's and raid arrays should all handle this).
> 
> ... and if the device is run out of all of its blocks in its spare
> blocks pool, it's probably well past the time to replace said disk.
> 
> BTW, I really liked Dave Chinner's summary of the issues involved; I
> ran into Kawai-san last week at Linux.conf.au, and we discussed pretty
> much the same thing over lunch.  (i.e., that it's a hard problem, and
> in some cases we need to retry the writes, such as a transient FC path
> problem --- but some kind of write throttling is critical or we could
> end up choking the VM due to too many pages getting dirtied and no way
> of cleaning them.)

Well I just don't think we can ever discard them by default. Therefore
we must default to not discarding them, therefore we need to solve or
work around the dirty page congestion problem some how.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin Jan. 25, 2010, 5:59 p.m. UTC | #14
On Mon, Jan 25, 2010 at 12:50:19PM -0500, Ric Wheeler wrote:
> On 01/25/2010 12:47 PM, tytso@mit.edu wrote:
> >On Mon, Jan 25, 2010 at 10:23:57AM -0500, Ric Wheeler wrote:
> >>
> >>For permanent write errors, I would expect any modern drive to do a
> >>sector remapping internally. We should never need to track this kind
> >>of information for any modern device that I know of (S-ATA, SAS,
> >>SSD's and raid arrays should all handle this).
> >
> >... and if the device is run out of all of its blocks in its spare
> >blocks pool, it's probably well past the time to replace said disk.
> >
> >BTW, I really liked Dave Chinner's summary of the issues involved; I
> >ran into Kawai-san last week at Linux.conf.au, and we discussed pretty
> >much the same thing over lunch.  (i.e., that it's a hard problem, and
> >in some cases we need to retry the writes, such as a transient FC path
> >problem --- but some kind of write throttling is critical or we could
> >end up choking the VM due to too many pages getting dirtied and no way
> >of cleaning them.)
> >
> >						- Ted
> 
> Also note that retrying writes (or reads for that matter) often are
> counter productive. For those of us who have suffered with trying to
> migrate data off of an old, failing disk onto a new, shiny one,
> excessive retries can be painful...

That is probably true most of the time. So some sane defaults should
be attempted that work for most cases.

After that, retrying I was imagining should be driven by the
application. So: attempting to read or fsync again.

What should not happen is for the page to be marked !dirty or !uptodate.
This randomly breaks write to read consistency without necessarily even
any error reported, so it seems really hard for an app to do the right
thing there.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 26, 2010, 6:19 a.m. UTC | #15
On Tue, Jan 26, 2010 at 04:55:30AM +1100, Nick Piggin wrote:
> On Mon, Jan 25, 2010 at 12:47:23PM -0500, tytso@mit.edu wrote:
> > On Mon, Jan 25, 2010 at 10:23:57AM -0500, Ric Wheeler wrote:
> > > 
> > > For permanent write errors, I would expect any modern drive to do a
> > > sector remapping internally. We should never need to track this kind
> > > of information for any modern device that I know of (S-ATA, SAS,
> > > SSD's and raid arrays should all handle this).
> > 
> > ... and if the device is run out of all of its blocks in its spare
> > blocks pool, it's probably well past the time to replace said disk.
> > 
> > BTW, I really liked Dave Chinner's summary of the issues involved; I
> > ran into Kawai-san last week at Linux.conf.au, and we discussed pretty
> > much the same thing over lunch.  (i.e., that it's a hard problem, and
> > in some cases we need to retry the writes, such as a transient FC path
> > problem --- but some kind of write throttling is critical or we could
> > end up choking the VM due to too many pages getting dirtied and no way
> > of cleaning them.)
> 
> Well I just don't think we can ever discard them by default.

We have done this for a long time in XFS. e.g. If we can't issue IO
on the page (e.g. allocation fails or we are in a shutdown
situation already) we invalidate the page immediately, clear the
page uptodate flag and return an error to mark the address space
with an error. See xfs_page_state_convert() for more detail.

And besides, if there is an error of some kind sufficient to shut
down the filesystem, the last thing you want to do is write more
data to it and potentially make the problem worse, especially if
async transactions that the data write might rely on were cancelled
by the shutdown rather than pushed to disk....

> Therefore
> we must default to not discarding them, therefore we need to solve or
> work around the dirty page congestion problem some how.

Agreed. The way XFS treats data IO errors is because that's the only
thing we can do right now if we want the system to continue to function
in the face of IO errors....

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 27967f9..5dc5ccf 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -156,6 +156,18 @@  read_block_bitmap(struct super_block *sb, unsigned int block_group)
 	if (likely(bh_uptodate_or_lock(bh)))
 		return bh;
 
+	/*
+	 * uptodate flag may have been cleared by a previous (transient)
+	 * write IO error.  In this case, we don't want to reread its
+	 * old on-disk data.  Actually the buffer has the latest data,
+	 * so set uptodate flag again.
+	 */
+	if (buffer_write_io_error(bh)) {
+		set_buffer_uptodate(bh);
+		unlock_buffer(bh);
+		return bh;
+	}
+
 	if (bh_submit_read(bh) < 0) {
 		brelse(bh);
 		ext3_error(sb, __func__,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 455e6e6..67d7849 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1077,10 +1077,23 @@  struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
 		return bh;
 	if (buffer_uptodate(bh))
 		return bh;
+
+	/*
+	 * uptodate flag may have been cleared by a previous (transient)
+	 * write IO error.  In this case, we don't want to reread its
+	 * old on-disk data.  Actually the buffer has the latest data,
+	 * so set uptodate flag again.
+	 */
+	if (buffer_write_io_error(bh)) {
+		set_buffer_uptodate(bh);
+		return bh;
+	}
+
 	ll_rw_block(READ_META, 1, &bh);
 	wait_on_buffer(bh);
 	if (buffer_uptodate(bh))
 		return bh;
+
 	put_bh(bh);
 	*err = -EIO;
 	return NULL;
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 7b0e44f..7ed8e45 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -909,7 +909,20 @@  restart:
 				num++;
 				bh = ext3_getblk(NULL, dir, b++, 0, &err);
 				bh_use[ra_max] = bh;
-				if (bh)
+				if (!bh || buffer_uptodate(bh))
+					continue;
+
+				/*
+				 * uptodate flag may have been cleared by a
+				 * previous (transient) write IO error.  In
+				 * this case, we don't want to reread its
+				 * old on-disk data.  Actually the buffer
+				 * has the latest data, so set uptodate flag
+				 * again.
+				 */
+				if (buffer_write_io_error(bh))
+					set_buffer_uptodate(bh);
+				else
 					ll_rw_block(READ_META, 1, &bh);
 			}
 		}