Patchwork [2/3] ext4: introduce ext4_error_remove_page

login
register
mail settings
Submitter Naoya Horiguchi
Date Oct. 25, 2012, 3:12 p.m.
Message ID <1351177969-893-3-git-send-email-n-horiguchi@ah.jp.nec.com>
Download mbox | patch
Permalink /patch/194218/
State New
Headers show

Comments

Naoya Horiguchi - Oct. 25, 2012, 3:12 p.m.
Ext4 has its own configurable error handling policy, so it's helpful
if we can use it also in the context of memory error handling.
With this patch, when we detect a memory error on a dirty pagecache in
ext4 filesystem, we can allow users to choose to trigger kernel panic
to avoid consuming corrupted data.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/ext4/inode.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)
Jan Kara - Oct. 25, 2012, 7:39 p.m.
On Thu 25-10-12 11:12:48, Naoya Horiguchi wrote:
> Ext4 has its own configurable error handling policy, so it's helpful
> if we can use it also in the context of memory error handling.
> With this patch, when we detect a memory error on a dirty pagecache in
> ext4 filesystem, we can allow users to choose to trigger kernel panic
> to avoid consuming corrupted data.
  OK, I've checked and memory_failure() function guarantees page->mapping
is !NULL. So I'm OK with this patch. You can add:
  Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  fs/ext4/inode.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git v3.7-rc2.orig/fs/ext4/inode.c v3.7-rc2/fs/ext4/inode.c
> index b3c243b..513badb 100644
> --- v3.7-rc2.orig/fs/ext4/inode.c
> +++ v3.7-rc2/fs/ext4/inode.c
> @@ -3163,6 +3163,33 @@ static int ext4_journalled_set_page_dirty(struct page *page)
>  	return __set_page_dirty_nobuffers(page);
>  }
>  
> +static int ext4_error_remove_page(struct address_space *mapping,
> +				struct page *page)
> +{
> +	struct inode *inode = mapping->host;
> +	struct buffer_head *bh, *head;
> +	ext4_fsblk_t block;
> +
> +	if (!PageDirty(page) || !page_has_buffers(page))
> +		goto remove_page;
> +
> +	/* Lost data. Handle as critical fs error. */
> +	bh = head = page_buffers(page);
> +	do {
> +		if (buffer_dirty(bh) && !buffer_delay(bh)) {
> +			block = bh->b_blocknr;
> +			EXT4_ERROR_INODE_BLOCK(inode, block,
> +						"Removing dirty pagecache page");
> +		} else
> +			EXT4_ERROR_INODE(inode,
> +					"Removing dirty pagecache page");
> +		bh = bh->b_this_page;
> +	} while (bh != head);
> +
> +remove_page:
> +	return generic_error_remove_page(mapping, page);
> +}
> +
>  static const struct address_space_operations ext4_ordered_aops = {
>  	.readpage		= ext4_readpage,
>  	.readpages		= ext4_readpages,
> @@ -3175,7 +3202,7 @@ static const struct address_space_operations ext4_ordered_aops = {
>  	.direct_IO		= ext4_direct_IO,
>  	.migratepage		= buffer_migrate_page,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> +	.error_remove_page	= ext4_error_remove_page,
>  };
>  
>  static const struct address_space_operations ext4_writeback_aops = {
> @@ -3190,7 +3217,7 @@ static const struct address_space_operations ext4_writeback_aops = {
>  	.direct_IO		= ext4_direct_IO,
>  	.migratepage		= buffer_migrate_page,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> +	.error_remove_page	= ext4_error_remove_page,
>  };
>  
>  static const struct address_space_operations ext4_journalled_aops = {
> @@ -3205,7 +3232,7 @@ static const struct address_space_operations ext4_journalled_aops = {
>  	.releasepage		= ext4_releasepage,
>  	.direct_IO		= ext4_direct_IO,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> +	.error_remove_page	= ext4_error_remove_page,
>  };
>  
>  static const struct address_space_operations ext4_da_aops = {
> @@ -3221,7 +3248,7 @@ static const struct address_space_operations ext4_da_aops = {
>  	.direct_IO		= ext4_direct_IO,
>  	.migratepage		= buffer_migrate_page,
>  	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> +	.error_remove_page	= ext4_error_remove_page,
>  };
>  
>  void ext4_set_aops(struct inode *inode)
> -- 
> 1.7.11.7
>
Theodore Ts'o - Oct. 26, 2012, 6:12 a.m.
On Thu, Oct 25, 2012 at 11:12:48AM -0400, Naoya Horiguchi wrote:
> +	/* Lost data. Handle as critical fs error. */
> +	bh = head = page_buffers(page);
> +	do {
> +		if (buffer_dirty(bh) && !buffer_delay(bh)) {
> +			block = bh->b_blocknr;
> +			EXT4_ERROR_INODE_BLOCK(inode, block,
> +						"Removing dirty pagecache page");
> +		} else
> +			EXT4_ERROR_INODE(inode,
> +					"Removing dirty pagecache page");

One of the side effects of calling EXT4_ERROR_INODE (or ext3_error in
your ext3 patch), it sets the "file system is corrupt" bit which
forces the file system to be fsck'ed at the next boot.

If this is just a memory error, it's not clear that this is the right
thing to have happen.  It's also not clear what the benefit would be
is of forcing a reboot in the errors=panic case.  If the file system
is corrupt, forcing a panic and reboot is useful because it allows a
file system to get checked instead of allowing the system to continue
on and perhaps cause more data loss.

But if what happened is that there was a hard ECC error on a page,
we've already lost data.  Forcing a reboot isn't going to make things
better; and if you force an e2fsck, it will just increase the system's
downtime.  It's also not entirely clear that throwing away the page is
the right thing to do, either, by the way.  If you have a hard ECC
error, then there has might be a two or three bits that have gotten
flipped on that page.  But by throwing the dirty page entirely, we're
throwing away 4k worth of data.

If we go back to first principles, what do we want to do?  We want the
system administrator to know that a file might be potentially
corrupted.  And perhaps, if a program tries to read from that file, it
should get an error.  If we have a program that has that file mmap'ed
at the time of the error, perhaps we should kill the program with some
kind of signal.  But to force a reboot of the entire system?  Or to
remounte the file system read-only?  That seems to be completely
disproportionate for what might be 2 or 3 bits getting flipped in a
page cache for a file.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luck, Tony - Oct. 26, 2012, 4:55 p.m.
> If we go back to first principles, what do we want to do?  We want the
> system administrator to know that a file might be potentially
> corrupted.  And perhaps, if a program tries to read from that file, it
> should get an error.  If we have a program that has that file mmap'ed
> at the time of the error, perhaps we should kill the program with some
> kind of signal.  But to force a reboot of the entire system?  Or to
> remounte the file system read-only?  That seems to be completely
> disproportionate for what might be 2 or 3 bits getting flipped in a
> page cache for a file.

I think that we know that the file *is* corrupted, not just "potentially".
We probably know the location of the corruption to cache-line granularity.
Perhaps better on systems where we have access to ecc syndrome bits,
perhaps worse ... we do have some errors where the low bits of the address
are not known.

I'm in total agreement that forcing a reboot or fsck is unhelpful here.

But what should we do?  We don't want to let the error be propagated. That
could cause a cascade of more failures as applications make bad decisions
based on the corrupted data.

Perhaps we could ask the filesystem to move the file to a top-level
"corrupted" directory (analogous to "lost+found") with some attached
metadata to help recovery tools know where the file came from, and the
range of corrupted bytes in the file? We'd also need to invalidate existing
open file descriptors (or less damaging - flag them to avoid the corrupted
area??). Whatever we do, it needs to be persistent across a reboot ... the
lost bits are not going to magically heal themselves.

We already have code to send SIGBUS to applications that have the
corrupted page mmap(2)'d (see mm/memory-failure.c).

Other ideas?

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Oct. 26, 2012, 6:46 p.m.
On Fri, Oct 26, 2012 at 04:55:01PM +0000, Luck, Tony wrote:
> 
> I think that we know that the file *is* corrupted, not just "potentially".
> We probably know the location of the corruption to cache-line granularity.
> Perhaps better on systems where we have access to ecc syndrome bits,
> perhaps worse ... we do have some errors where the low bits of the address
> are not known.

Well, it's at least *possible* that it was only the ECC bits that got
flipped.  :-) Not likely, I'll grant!  (Or does the motherboard zero
out the entire cache-line on a hard ECC failure?)

> I'm in total agreement that forcing a reboot or fsck is unhelpful here.
> 
> But what should we do?  We don't want to let the error be propagated. That
> could cause a cascade of more failures as applications make bad decisions
> based on the corrupted data.
> 
> Perhaps we could ask the filesystem to move the file to a top-level
> "corrupted" directory (analogous to "lost+found") with some attached
> metadata to help recovery tools know where the file came from, and the
> range of corrupted bytes in the file? We'd also need to invalidate existing
> open file descriptors (or less damaging - flag them to avoid the corrupted
> area??). Whatever we do, it needs to be persistent across a reboot ... the
> lost bits are not going to magically heal themselves.

Well, we could set a new attribute bit on the file which indicates
that the file has been corrupted, and this could cause any attempts to
open the file to return some error until the bit has been cleared.
This would persist across reboots.  The only problem is that system
administrators might get very confused (at least at first, when they
first run a kernel or a distribution which has this feature enabled).
Application programs could also get very confused when any attempt to
open or read from a file suddenly returned some new error code (EIO,
or should we designate a new errno code for this purpose, so there is
a better indication of what the heck was going on?)

Also, if we just log the message in dmesg, if the system administrator
doesn't find the "this file is corrupted" bit right away, they might
not be able to determine which part of the file was corrupted.  How
important is this?  If the file system supports extended attributes,
should we attempt to attach a new extended attribute with information
about the ECC failure?

I'm not sure it's worth it to go to these extents, but I could imagine
some customers wanting to have this sort of information.  Do we know
what their "nice to have" / "must have" requirements might be?

     	       	       	       	    	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Naoya Horiguchi - Oct. 26, 2012, 6:50 p.m.
On Fri, Oct 26, 2012 at 02:12:06AM -0400, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 11:12:48AM -0400, Naoya Horiguchi wrote:
> > +	/* Lost data. Handle as critical fs error. */
> > +	bh = head = page_buffers(page);
> > +	do {
> > +		if (buffer_dirty(bh) && !buffer_delay(bh)) {
> > +			block = bh->b_blocknr;
> > +			EXT4_ERROR_INODE_BLOCK(inode, block,
> > +						"Removing dirty pagecache page");
> > +		} else
> > +			EXT4_ERROR_INODE(inode,
> > +					"Removing dirty pagecache page");
> 
> One of the side effects of calling EXT4_ERROR_INODE (or ext3_error in
> your ext3 patch), it sets the "file system is corrupt" bit which
> forces the file system to be fsck'ed at the next boot.
> 
> If this is just a memory error, it's not clear that this is the right
> thing to have happen.  It's also not clear what the benefit would be
> is of forcing a reboot in the errors=panic case.  If the file system
> is corrupt, forcing a panic and reboot is useful because it allows a
> file system to get checked instead of allowing the system to continue
> on and perhaps cause more data loss.

Let me explain what I'm worry about.
Once memory error hits a dirty pagecache page, the page itself is
removed from pagecache tree and will never be accessed later.
But the problem is that after memory error handling is done,
not all processes can know about that error event.
As a result, a process which is not aware of the error tries to read
the old data for disk, and data corruption spreads by using wrong data.
I think that forcing rebooting is effective to prevent such spread of
data corruption.

> 
> But if what happened is that there was a hard ECC error on a page,
> we've already lost data.  Forcing a reboot isn't going to make things
> better; and if you force an e2fsck, it will just increase the system's
> downtime.  It's also not entirely clear that throwing away the page is
> the right thing to do, either, by the way.  If you have a hard ECC
> error, then there has might be a two or three bits that have gotten
> flipped on that page.  But by throwing the dirty page entirely, we're
> throwing away 4k worth of data.
> 
> If we go back to first principles, what do we want to do?  We want the
> system administrator to know that a file might be potentially
> corrupted.  And perhaps, if a program tries to read from that file, it
> should get an error.  If we have a program that has that file mmap'ed
> at the time of the error, perhaps we should kill the program with some
> kind of signal.  But to force a reboot of the entire system?  Or to
> remounte the file system read-only?  That seems to be completely
> disproportionate for what might be 2 or 3 bits getting flipped in a
> page cache for a file.

In order to completely solve this problem, I'm thinking about another
approach without rebooting. Roughly saying, it introduces additional
tag in pagecache tree to keep error-affected address of the error-affected
file untouchable until we 'recover' the address range (This is based
on the provious discussion in https://lkml.org/lkml/2012/9/2/194.)
But I should be careful for many details for the patch and will take
some time, so I want to start with a simpler approach.

Thanks,
Naoya
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luck, Tony - Oct. 26, 2012, 10:24 p.m.
> Well, we could set a new attribute bit on the file which indicates
> that the file has been corrupted, and this could cause any attempts to
> open the file to return some error until the bit has been cleared.

That sounds a lot better than renaming/moving the file.

> This would persist across reboots.  The only problem is that system
> administrators might get very confused (at least at first, when they
> first run a kernel or a distribution which has this feature enabled).

Yes. This would require some education. But new attributes have been
added in the past (e.g. immutable) that caused confusion to users and
tools that didn't know about them.

> Application programs could also get very confused when any attempt to
> open or read from a file suddenly returned some new error code (EIO,
> or should we designate a new errno code for this purpose, so there is
> a better indication of what the heck was going on?)

EIO sounds wrong ... but it is perhaps the best of the existing codes. Adding
a new one is also challenging too.

> Also, if we just log the message in dmesg, if the system administrator
> doesn't find the "this file is corrupted" bit right away

This is pretty much a given. Nobody will see the message in the console log
until it is far too late.

> I'm not sure it's worth it to go to these extents, but I could imagine
> some customers wanting to have this sort of information.  Do we know
> what their "nice to have" / "must have" requirements might be?

18 years ago Intel rather famously attempted to sell users on the idea that a
rare divide error that sometimes gave the wrong answer could be ignored. Before
my time at Intel, but it is still burned into the corporate psyche that customers
really don't like to get the wrong answers from their computers.

Whether it is worth it may depend on the relative frequency of data being
corrupted this way, compared to all the other ways that it might get messed
up. If it were a thousand times more likely that data got silently corrupted
on its path to media, sitting spinning on the media, and then back off the
drive again - then all this fancy stuff wouldn't make any real difference.
I have no data on the relative error rates of memory and i/o - so I can't
answer this.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Oct. 27, 2012, 10:16 p.m.
On Fri, Oct 26, 2012 at 10:24:23PM +0000, Luck, Tony wrote:
> > Well, we could set a new attribute bit on the file which indicates
> > that the file has been corrupted, and this could cause any attempts to
> > open the file to return some error until the bit has been cleared.
> 
> That sounds a lot better than renaming/moving the file.

What I would recommend is adding a 

#define FS_CORRUPTED_FL		0x01000000 /* File is corrupted */

... and which could be accessed and cleared via the lsattr and chattr
programs.

> > Application programs could also get very confused when any attempt to
> > open or read from a file suddenly returned some new error code (EIO,
> > or should we designate a new errno code for this purpose, so there is
> > a better indication of what the heck was going on?)
> 
> EIO sounds wrong ... but it is perhaps the best of the existing codes. Adding
> a new one is also challenging too.

I think we really need a different error code from EIO; it's already
horribly overloaded already, and if this is new behavior when the
customers get confused and call up the distribution help desk, they
won't thank us if we further overload EIO.  This is abusing one of the
System V stream errno's, but no one else is using it:

#define EADV		 68  /* Advertise error */

I note that we've already added a new error code:

#define EHWPOISON 133	  /* Memory page has hardware error */

... although the glibc shipping with Debian testing hasn't been taught
what it is, so strerror(EHWPOISON) returns "Unknown error 133".  We
could simply allow open(2) and stat(2) return this error, although I
wonder if we're just better off defining a new error code.

> 18 years ago Intel rather famously attempted to sell users on the
> idea that a rare divide error that sometimes gave the wrong answer
> could be ignored. Before my time at Intel, but it is still burned
> into the corporate psyche that customers really don't like to get
> the wrong answers from their computers.

... and yet, people are generally not willing to pay the few extra
dollars for ECC memory, such that even if I want ECC for a laptop or a
desktop machine, it's generally not available without paying $$$$ for
a server-class motherboard.  :-(

The lesson I'd take from that incident is that customers really hate
it when it's trivial to reproduce the error, especially using the
something as simple and universal as the Windows Calculator
application.

Anyway, that's neither here nor there.  Perhaps it's enough to simply
log an error with a sufficient level of severity that it gets saved in
log files, at least for now.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Naoya Horiguchi - Oct. 28, 2012, 1:57 a.m.
Hi Ted,

On Sat, Oct 27, 2012 at 06:16:26PM -0400, Theodore Ts'o wrote:
> On Fri, Oct 26, 2012 at 10:24:23PM +0000, Luck, Tony wrote:
> > > Well, we could set a new attribute bit on the file which indicates
> > > that the file has been corrupted, and this could cause any attempts to
> > > open the file to return some error until the bit has been cleared.
> > 
> > That sounds a lot better than renaming/moving the file.
> 
> What I would recommend is adding a 
> 
> #define FS_CORRUPTED_FL		0x01000000 /* File is corrupted */
> 
> ... and which could be accessed and cleared via the lsattr and chattr
> programs.

Thank you for the info. This could help my next work.

> > > Application programs could also get very confused when any attempt to
> > > open or read from a file suddenly returned some new error code (EIO,
> > > or should we designate a new errno code for this purpose, so there is
> > > a better indication of what the heck was going on?)
> > 
> > EIO sounds wrong ... but it is perhaps the best of the existing codes. Adding
> > a new one is also challenging too.
> 
> I think we really need a different error code from EIO; it's already
> horribly overloaded already, and if this is new behavior when the
> customers get confused and call up the distribution help desk, they
> won't thank us if we further overload EIO.  This is abusing one of the
> System V stream errno's, but no one else is using it:
> 
> #define EADV		 68  /* Advertise error */
> 
> I note that we've already added a new error code:
> 
> #define EHWPOISON 133	  /* Memory page has hardware error */
> 
> ... although the glibc shipping with Debian testing hasn't been taught
> what it is, so strerror(EHWPOISON) returns "Unknown error 133".  We
> could simply allow open(2) and stat(2) return this error, although I
> wonder if we're just better off defining a new error code.

Whether we use EIO or EHWPOISON seems to be controversial. Andi likes
to use EIO because we can handle memory errors and legacy I/O errors in
the similar and integrated manner.
But personally, it's OK for me to use EHWPOISON. Obviously defining this
error code in glibc is a necessary step if we go in this direction.

Thanks,
Naoya
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Oct. 29, 2012, 1:16 a.m.
On Sat, Oct 27, 2012 at 06:16:26PM -0400, Theodore Ts'o wrote:
> On Fri, Oct 26, 2012 at 10:24:23PM +0000, Luck, Tony wrote:
> > > Well, we could set a new attribute bit on the file which indicates
> > > that the file has been corrupted, and this could cause any attempts to
> > > open the file to return some error until the bit has been cleared.
> > 
> > That sounds a lot better than renaming/moving the file.
> 
> What I would recommend is adding a 
> 
> #define FS_CORRUPTED_FL		0x01000000 /* File is corrupted */
> 
> ... and which could be accessed and cleared via the lsattr and chattr
> programs.

Except that there are filesystems that cannot implement such flags,
or require on-disk format changes to add more of those flags. This
is most definitely not a filesystem specific behaviour, so any sort
of VFS level per-file state needs to be kept in xattrs, not special
flags. Filesystems are welcome to optimise the storage of such
special xattrs (e.g. down to a single boolean flag in an inode), but
using a flag for something that dould, in fact, storage the exactly
offset and length of the corruption is far better than just storing
a "something is corrupted in this file" bit....

> > > Application programs could also get very confused when any attempt to
> > > open or read from a file suddenly returned some new error code (EIO,
> > > or should we designate a new errno code for this purpose, so there is
> > > a better indication of what the heck was going on?)
> > 
> > EIO sounds wrong ... but it is perhaps the best of the existing codes. Adding
> > a new one is also challenging too.
> 
> I think we really need a different error code from EIO; it's already
> horribly overloaded already, and if this is new behavior when the
> customers get confused and call up the distribution help desk, they
> won't thank us if we further overload EIO.  This is abusing one of the
> System V stream errno's, but no one else is using it:
> 
> #define EADV		 68  /* Advertise error */
> 
> I note that we've already added a new error code:
> 
> #define EHWPOISON 133	  /* Memory page has hardware error */
> 
> ... although the glibc shipping with Debian testing hasn't been taught
> what it is, so strerror(EHWPOISON) returns "Unknown error 133".  We
> could simply allow open(2) and stat(2) return this error, although I
> wonder if we're just better off defining a new error code.

If we are going to add special new "file corrupted" errors, we
should add EFSCORRUPTED (i.e. "filesystem corrupted") at the same
time....

Cheers,

Dave.
Theodore Ts'o - Oct. 29, 2012, 2:40 a.m.
On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote:
> 
> Except that there are filesystems that cannot implement such flags,
> or require on-disk format changes to add more of those flags. This
> is most definitely not a filesystem specific behaviour, so any sort
> of VFS level per-file state needs to be kept in xattrs, not special
> flags. Filesystems are welcome to optimise the storage of such
> special xattrs (e.g. down to a single boolean flag in an inode), but
> using a flag for something that dould, in fact, storage the exactly
> offset and length of the corruption is far better than just storing
> a "something is corrupted in this file" bit....

Agreed, if we're going to add an xattr, then we might as well store
not just a boolean, but some indication of what part of the file was
corrupted.  The only complication is what if there are many memory
corruptions.  Do we store just the last ECC hard error that we
detected?  Or just the first?

It wasn't clear to me it was worth the extra complexity, but if there
are indeed for file systems that don't have or don't want to allocate
a spare bit in their inode structure, that might be a good enough
justification to add an xattr.  (Was this a hypothetical, or does this
constraint apply to XFS or some other file system that you're aware of?)

> > I note that we've already added a new error code:
> > 
> > #define EHWPOISON 133	  /* Memory page has hardware error */
> > 
> > ... although the glibc shipping with Debian testing hasn't been taught
> > what it is, so strerror(EHWPOISON) returns "Unknown error 133".  We
> > could simply allow open(2) and stat(2) return this error, although I
> > wonder if we're just better off defining a new error code.
> 
> If we are going to add special new "file corrupted" errors, we
> should add EFSCORRUPTED (i.e. "filesystem corrupted") at the same
> time....

I would dearly love it if we could allocate a new EFSCORRUPTED errno.
I was about to follow XFS's lead and change ext4 to return EUCLEAN
instead of EIO in the cases of fs corruption, but that really is ugly
and gross...

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen - Oct. 29, 2012, 10:37 a.m.
Theodore Ts'o <tytso@mit.edu> writes:

> On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote:
>> 
>> Except that there are filesystems that cannot implement such flags,
>> or require on-disk format changes to add more of those flags. This
>> is most definitely not a filesystem specific behaviour, so any sort
>> of VFS level per-file state needs to be kept in xattrs, not special
>> flags. Filesystems are welcome to optimise the storage of such
>> special xattrs (e.g. down to a single boolean flag in an inode), but
>> using a flag for something that dould, in fact, storage the exactly
>> offset and length of the corruption is far better than just storing
>> a "something is corrupted in this file" bit....
>
> Agreed, if we're going to add an xattr, then we might as well store

I don't think an xattr makes sense for this. It's sufficient to keep
this state in memory.

In general these error paths are hard to test and it's important
to keep them as simple as possible. Doing IO and other complexities
just doesn't make sense. Just have the simplest possible path
that can do the job.

> not just a boolean, but some indication of what part of the file was

You're overdesigning I think.

-Andi
Jun'ichi Nomura (NEC) - Oct. 29, 2012, 11:05 a.m.
On 10/29/12 19:37, Andi Kleen wrote:
> Theodore Ts'o <tytso@mit.edu> writes:
>> On Mon, Oct 29, 2012 at 12:16:32PM +1100, Dave Chinner wrote:
>>> Except that there are filesystems that cannot implement such flags,
>>> or require on-disk format changes to add more of those flags. This
>>> is most definitely not a filesystem specific behaviour, so any sort
>>> of VFS level per-file state needs to be kept in xattrs, not special
>>> flags. Filesystems are welcome to optimise the storage of such
>>> special xattrs (e.g. down to a single boolean flag in an inode), but
>>> using a flag for something that dould, in fact, storage the exactly
>>> offset and length of the corruption is far better than just storing
>>> a "something is corrupted in this file" bit....
>>
>> Agreed, if we're going to add an xattr, then we might as well store
> 
> I don't think an xattr makes sense for this. It's sufficient to keep
> this state in memory.
> 
> In general these error paths are hard to test and it's important
> to keep them as simple as possible. Doing IO and other complexities
> just doesn't make sense. Just have the simplest possible path
> that can do the job.

And since it's difficult to prove, I think it's nice to have an
option to panic if the memory error was on dirty page cache.

It's theoretically same as disk I/O error; dirty cache is marked invalid
and next read will go to disk.
Though in practice, the next read will likely to fail if disk was broken.
(Given that transient errors are usually recovered by retries and fail-overs
 in storage stack and not visible to applications which don't care.)
So it's "consistent" in some sense.
OTOH, the next read will likely succeed reading old data from disk
in case of the memory error.
I'm afraid the read-after-write inconsistency could cause silent data
corruption.
Luck, Tony - Oct. 29, 2012, 6:11 p.m.
> What I would recommend is adding a 
>
> #define FS_CORRUPTED_FL		0x01000000 /* File is corrupted */
>
> ... and which could be accessed and cleared via the lsattr and chattr
> programs.

Good - but we need some space to save the corrupted range information
too. These errors should be quite rare, so one range per file should be
enough.

New file systems should plan to add space in their on-disk format. The
corruption isn't going to go away across a reboot.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Oct. 29, 2012, 6:24 p.m.
On Mon, Oct 29, 2012 at 03:37:05AM -0700, Andi Kleen wrote:
> > Agreed, if we're going to add an xattr, then we might as well store
> 
> I don't think an xattr makes sense for this. It's sufficient to keep
> this state in memory.
> 
> In general these error paths are hard to test and it's important
> to keep them as simple as possible. Doing IO and other complexities
> just doesn't make sense. Just have the simplest possible path
> that can do the job.

It's actually pretty easy to test this particular one, and certainly
one of the things I'd strongly encourage in this patch series is the
introduction of an interface via madvise and fadvise that allows us to
simulate an ECC hard error event.  So I don't think "it's hard to
test" is a reason not to do the right thing.  Let's make it easy to
test, and the include it in xfstests.  All of the core file systems
are regularly running regression tests via xfstests, so if we define a
standard way of testing this function (this is why I suggested
fadvise/madvise instead of an ioctl, but in a pinch we could do this
via an ioctl instead).

Note that the problem that we're dealing with is buffered writes; so
it's quite possible that the process which wrote the file, thus
dirtying the page cache, has already exited; so there's no way we can
guarantee we can inform the process which wrote the file via a signal
or a error code return.  It's also possible that part of the file has
been written out to the disk, so forcibly crashing the system and
rebooting isn't necessarily going to save the file from being
corrupted.

Also, if you're going to keep this state in memory, what happens if
the inode gets pushed out of memory?  Do we pin the inode?  Or do we
just say that it's stored into memory until some non-deterministic
time (as far as userspace programs are concerned, but if they are
running in a tight memory cgroup, it might be very short time later)
suddenly the state gets lost?

I think it's fair that if there are file systems that don't have a
single bit they can allocate in the inode, we can either accept
Jun'ichi's suggest to just forcibly crash the system, or we can allow
the state to be stored in memory.  But I suspect the core file systems
that might be used by enterprise-class workloads will want to provide
something better.

I'm not that convinced that we need to insert an xattr; after all, not
all file systems support xattrs at all, and I think a single bit
indicating that the file has corrupted data is sufficient.  But I
think it would be useful to at least optionally support a persistent
storage of this bit.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Oct. 29, 2012, 6:55 p.m.
On Mon 29-10-12 14:24:56, Ted Tso wrote:
> On Mon, Oct 29, 2012 at 03:37:05AM -0700, Andi Kleen wrote:
> > > Agreed, if we're going to add an xattr, then we might as well store
> > 
> > I don't think an xattr makes sense for this. It's sufficient to keep
> > this state in memory.
> > 
> > In general these error paths are hard to test and it's important
> > to keep them as simple as possible. Doing IO and other complexities
> > just doesn't make sense. Just have the simplest possible path
> > that can do the job.
> 
> It's actually pretty easy to test this particular one, and certainly
> one of the things I'd strongly encourage in this patch series is the
> introduction of an interface via madvise and fadvise that allows us to
> simulate an ECC hard error event.  So I don't think "it's hard to
> test" is a reason not to do the right thing.  Let's make it easy to
> test, and the include it in xfstests.  All of the core file systems
> are regularly running regression tests via xfstests, so if we define a
> standard way of testing this function (this is why I suggested
> fadvise/madvise instead of an ioctl, but in a pinch we could do this
> via an ioctl instead).
  Well, we have an error-injection framework which would be a better fit
for this functionality rather than some madvise / fadvise hack IMHO. And
test in xfstests can just check whether the error-injection code is
compiled in before running the test...

> Note that the problem that we're dealing with is buffered writes; so
> it's quite possible that the process which wrote the file, thus
> dirtying the page cache, has already exited; so there's no way we can
> guarantee we can inform the process which wrote the file via a signal
> or a error code return.  It's also possible that part of the file has
> been written out to the disk, so forcibly crashing the system and
> rebooting isn't necessarily going to save the file from being
> corrupted.
> 
> Also, if you're going to keep this state in memory, what happens if
> the inode gets pushed out of memory?  Do we pin the inode?  Or do we
> just say that it's stored into memory until some non-deterministic
> time (as far as userspace programs are concerned, but if they are
> running in a tight memory cgroup, it might be very short time later)
> suddenly the state gets lost?
> 
> I think it's fair that if there are file systems that don't have a
> single bit they can allocate in the inode, we can either accept
> Jun'ichi's suggest to just forcibly crash the system, or we can allow
> the state to be stored in memory.  But I suspect the core file systems
> that might be used by enterprise-class workloads will want to provide
> something better.
> 
> I'm not that convinced that we need to insert an xattr; after all, not
> all file systems support xattrs at all, and I think a single bit
> indicating that the file has corrupted data is sufficient.  But I
> think it would be useful to at least optionally support a persistent
> storage of this bit.
  Here I'd just note that the situation isn't that much different from a
plain old EIO we can hit during writeback. There we use in memory flag
(AS_EIO / PageError) and hope that an application which cares enough about
its data will call fsync() and thus get the error back. Sure it's not
ideal but so far nobody seemed to be bothered enough to improve on that. So
is it really fair to bother authors of this patch with it?

And regarding xattr - I think it's an overkill. Just recording the error in
the kernel log with enough detail and flagging the file (or even just the
filesystem) would be OK IMHO. Rest could be handled in userspace if someone
cares enough.

								Honza
Andi Kleen - Oct. 29, 2012, 7:07 p.m.
Theodore Ts'o <tytso@mit.edu> writes:
>
> It's actually pretty easy to test this particular one, 

Note the error can happen at any time.

> and certainly
> one of the things I'd strongly encourage in this patch series is the
> introduction of an interface via madvise

It already exists of course.

I would suggest to study the existing framework before more 
suggestions.

> simulate an ECC hard error event.  So I don't think "it's hard to
> test" is a reason not to do the right thing.  Let's make it easy to

What you can't test doesn't work. It's that simple.

And memory error handling is extremly hard to test. The errors
can happen at any time. It's not a well defined event.
There are test suites for it of course (mce-test, mce-inject[1]),
but they needed a lot of engineering effort to be at where
they are.

[1] despite the best efforts of some current RAS developers
at breaking it.

> Note that the problem that we're dealing with is buffered writes; so
> it's quite possible that the process which wrote the file, thus
> dirtying the page cache, has already exited; so there's no way we can
> guarantee we can inform the process which wrote the file via a signal
> or a error code return.

Is that any different from other IO errors? It doesn't need to 
be better.

> Also, if you're going to keep this state in memory, what happens if
> the inode gets pushed out of memory? 

You lose the error, just like you do today with any other IO error.

We had a lot of discussions on this when the memory error handling
was originally introduced, that was the conclusuion.

I don't think a special panic knob for this makes sense either.
We already have multiple panic knobs for memory errors, that
can be used.

-Andi
Naoya Horiguchi - Oct. 29, 2012, 9:47 p.m.
On Mon, Oct 29, 2012 at 12:07:04PM -0700, Andi Kleen wrote:
> Theodore Ts'o <tytso@mit.edu> writes:
...
> > Also, if you're going to keep this state in memory, what happens if
> > the inode gets pushed out of memory? 
> 
> You lose the error, just like you do today with any other IO error.
> 
> We had a lot of discussions on this when the memory error handling
> was originally introduced, that was the conclusuion.
> 
> I don't think a special panic knob for this makes sense either.
> We already have multiple panic knobs for memory errors, that
> can be used.

Yes. I understand that adding a new knob is not good.
So this patch uses the existing ext4 knob without adding new one.

Thanks,
Naoya
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jun'ichi Nomura (NEC) - Oct. 30, 2012, midnight
On 10/30/12 04:07, Andi Kleen wrote:
> Theodore Ts'o <tytso@mit.edu> writes:
>> Note that the problem that we're dealing with is buffered writes; so
>> it's quite possible that the process which wrote the file, thus
>> dirtying the page cache, has already exited; so there's no way we can
>> guarantee we can inform the process which wrote the file via a signal
>> or a error code return.
> 
> Is that any different from other IO errors? It doesn't need to 
> be better.

IMO, it is different in that next read from disk will likely succeed.
(and read corrupted data)
For IO errors come from disk failure, next read will likely fail
again so we don't have to remember it somewhere.

>> Also, if you're going to keep this state in memory, what happens if
>> the inode gets pushed out of memory? 
> 
> You lose the error, just like you do today with any other IO error.
Dave Chinner - Oct. 31, 2012, 12:21 a.m.
On Mon, Oct 29, 2012 at 06:11:58PM +0000, Luck, Tony wrote:
> > What I would recommend is adding a 
> >
> > #define FS_CORRUPTED_FL		0x01000000 /* File is corrupted */
> >
> > ... and which could be accessed and cleared via the lsattr and chattr
> > programs.
> 
> Good - but we need some space to save the corrupted range information
> too. These errors should be quite rare, so one range per file should be
> enough.
> 
> New file systems should plan to add space in their on-disk format. The
> corruption isn't going to go away across a reboot.

No, not at all. if you want to store something in the filesystem
permanently, then use xattrs. You cannot rely on the filesystem
being able to store random application specific data in their
on-disk format. That's the *exact purpose* that xattrs were
invented for - they are an extensible, user-defined, per-file
metadata storage mechanism that is not tied to the filesystem
on-disk format.

The kernel already makes extensive use of xattrs for such metadata -
just look at all the security and integrity code that uses xattrs to
store their application-specific metadata.  Hence *anything* that
the kernel wants to store on permanent storage should be using
xattrs because then the application has complete control of what is
stored without caring about what filesystem it is storing it on.

Cheers,

Dave.

Patch

diff --git v3.7-rc2.orig/fs/ext4/inode.c v3.7-rc2/fs/ext4/inode.c
index b3c243b..513badb 100644
--- v3.7-rc2.orig/fs/ext4/inode.c
+++ v3.7-rc2/fs/ext4/inode.c
@@ -3163,6 +3163,33 @@  static int ext4_journalled_set_page_dirty(struct page *page)
 	return __set_page_dirty_nobuffers(page);
 }
 
+static int ext4_error_remove_page(struct address_space *mapping,
+				struct page *page)
+{
+	struct inode *inode = mapping->host;
+	struct buffer_head *bh, *head;
+	ext4_fsblk_t block;
+
+	if (!PageDirty(page) || !page_has_buffers(page))
+		goto remove_page;
+
+	/* Lost data. Handle as critical fs error. */
+	bh = head = page_buffers(page);
+	do {
+		if (buffer_dirty(bh) && !buffer_delay(bh)) {
+			block = bh->b_blocknr;
+			EXT4_ERROR_INODE_BLOCK(inode, block,
+						"Removing dirty pagecache page");
+		} else
+			EXT4_ERROR_INODE(inode,
+					"Removing dirty pagecache page");
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+remove_page:
+	return generic_error_remove_page(mapping, page);
+}
+
 static const struct address_space_operations ext4_ordered_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
@@ -3175,7 +3202,7 @@  static const struct address_space_operations ext4_ordered_aops = {
 	.direct_IO		= ext4_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
+	.error_remove_page	= ext4_error_remove_page,
 };
 
 static const struct address_space_operations ext4_writeback_aops = {
@@ -3190,7 +3217,7 @@  static const struct address_space_operations ext4_writeback_aops = {
 	.direct_IO		= ext4_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
+	.error_remove_page	= ext4_error_remove_page,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3205,7 +3232,7 @@  static const struct address_space_operations ext4_journalled_aops = {
 	.releasepage		= ext4_releasepage,
 	.direct_IO		= ext4_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
+	.error_remove_page	= ext4_error_remove_page,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3221,7 +3248,7 @@  static const struct address_space_operations ext4_da_aops = {
 	.direct_IO		= ext4_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
+	.error_remove_page	= ext4_error_remove_page,
 };
 
 void ext4_set_aops(struct inode *inode)