Patchwork [JFFS2] load_xattr_datum need to return a positive number in case of unrecoverable error

login
register
mail settings
Submitter Jean-Christophe DUBOIS
Date April 11, 2012, 8:54 p.m.
Message ID <1334177689-19732-1-git-send-email-jcd@tribudubois.net>
Download mbox | patch
Permalink /patch/151881/
State New
Headers show

Comments

Jean-Christophe DUBOIS - April 11, 2012, 8:54 p.m.
As per load_xattr_datum() comment:
	rc < 0 : recoverable error, try again
	rc = 0 : success
	rc > 0 : Unrecoverable error, this node should be deleted.

For now we were only returning negative number (so recoverable error).
But a CRC failure or some inconsitent data seems fatal enough to
consider the attribute instance (version) as lost.

So this patch returns a positive number (1) when it detects an
unrecoverable error.

Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net>
---
 fs/jffs2/xattr.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)
Artem Bityutskiy - April 22, 2012, 1:08 p.m.
On Wed, 2012-04-11 at 22:54 +0200, Jean-Christophe DUBOIS wrote:
> As per load_xattr_datum() comment:
> 	rc < 0 : recoverable error, try again
> 	rc = 0 : success
> 	rc > 0 : Unrecoverable error, this node should be deleted.
> 
> For now we were only returning negative number (so recoverable error).
> But a CRC failure or some inconsitent data seems fatal enough to
> consider the attribute instance (version) as lost.
> 
> So this patch returns a positive number (1) when it detects an
> unrecoverable error.
> 
> Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net>

Looks sensible. But since I did not take your previous patch, you might
want to check whether this patch is really independent. Also, please,
tell  whether this fixes a real-life bug or you are fixing a complaint
of a static analysis tools or something like this? And please, tell how
you tested it.

Thanks!
Artem Bityutskiy - April 22, 2012, 1:16 p.m.
On Wed, 2012-04-11 at 22:54 +0200, Jean-Christophe DUBOIS wrote:
> As per load_xattr_datum() comment:
> 	rc < 0 : recoverable error, try again
> 	rc = 0 : success
> 	rc > 0 : Unrecoverable error, this node should be deleted.
> 
> For now we were only returning negative number (so recoverable error).
> But a CRC failure or some inconsitent data seems fatal enough to
> consider the attribute instance (version) as lost.
> 
> So this patch returns a positive number (1) when it detects an
> unrecoverable error.
> 
> Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net>

Looks sensible. But since I did not take your previous patch, you might
want to check whether this patch is really independent. Also, please,
tell  whether this fixes a real-life bug or you are fixing a complaint
of a static analysis tools or something like this? And please, tell how
you tested it.

Thanks!
Jean-Christophe DUBOIS - April 22, 2012, 5:35 p.m.
On 22/04/2012 15:16, Artem Bityutskiy wrote:
> On Wed, 2012-04-11 at 22:54 +0200, Jean-Christophe DUBOIS wrote:
>> As per load_xattr_datum() comment:
>> 	rc<  0 : recoverable error, try again
>> 	rc = 0 : success
>> 	rc>  0 : Unrecoverable error, this node should be deleted.
>>
>> For now we were only returning negative number (so recoverable error).
>> But a CRC failure or some inconsitent data seems fatal enough to
>> consider the attribute instance (version) as lost.
>>
>> So this patch returns a positive number (1) when it detects an
>> unrecoverable error.
>>
>> Signed-off-by: Jean-Christophe DUBOIS<jcd@tribudubois.net>
> Looks sensible. But since I did not take your previous patch, you might
> want to check whether this patch is really independent.
I believe it is. At the moment the load_xttrr_datum() (and such) 
functions never return an unrecoverable error. All error code are 
negative (and therefore recoverable) which mean JFFS will not try to fix 
errors that are permanent. This also means there is some code in JFFS2 
that is never executed (the one that try to recover from unrecoverable 
errors).

Please note that this patch is more or less a "partial reverse" of the 
following patch (http://patchwork.ozlabs.org/patch/71147/) which was 
accepted in 2010-12-03. I am not sure this accepted patch was correct at 
the time. Granted, returning (positive) EIO might not have been a good 
idea but turning them in negative values was not necessarily the good 
solution.
>   Also, please,
> tell  whether this fixes a real-life bug or you are fixing a complaint
> of a static analysis tools or something like this?
>   And please, tell how
> you tested it.
>
> Thanks!
>
Artem Bityutskiy - April 25, 2012, 1:59 p.m.
On Sun, 2012-04-22 at 19:35 +0200, Jean-Christophe DUBOIS wrote:
> >   Also, please,
> > tell  whether this fixes a real-life bug or you are fixing a complaint
> > of a static analysis tools or something like this?
> >   And please, tell how
> > you tested it.

I'd like to get some answers to the above questions. The reason I am
asking is that JFFS2 is old and does not see too much maintanance
anymore, so I feel that it is important important to be careful about
merging something which does not fix a real life issue.
Jean-Christophe DUBOIS - April 26, 2012, 9:09 p.m.
On 25/04/2012 15:59, Artem Bityutskiy wrote:
> On Sun, 2012-04-22 at 19:35 +0200, Jean-Christophe DUBOIS wrote:
>>>    Also, please,
>>> tell  whether this fixes a real-life bug or you are fixing a complaint
>>> of a static analysis tools or something like this?
>>>    And please, tell how
>>> you tested it.
> I'd like to get some answers to the above questions.
Well, this patch certainly tries to fix a "real-life" bug.

It mainly intend to re-enable the xattr related self-healing code 
present in JFFS2.

So first, let's try to reproduce this bug in an emulated/controlled 
environment.

# mkdir /tmp/jffs2
# modprobe mtdblock
# modprobe nandsim first_id_byte=0x20 second_id_byte=0x33
# modprobe jffs2
# mount -t jffs2 /dev/mtdblock0 /tmp/jffs2

We create a file and add an attribute
# cp /etc/passwd /tmp/jffs2
# attr -s attr1 -V "TUTU" /tmp/jffs2/passwd

So now we are going to corrupt the attribute:
# umount /tmp/jffs2
# hexedit /dev/mtdblock0

Search for the string TUTU and replace it with tutu and save the all 
thing (Ctrl-X).

Now we remount the jffs2 file:
# mount -t jffs2 /dev/mtdblock0 /tmp/jffs2

And we check "dmesg"
# dmesg
...
[972157.159571] JFFS2 notice: (3038) jffs2_build_xattr_subsystem: 
complete building xattr subsystem, 1 of xdatum (0 unchecked, 0 orphan) 
and 1 of xref (0 dead, 0 orphan) found.

No error found so far. Ok, let's try to check the attribute:
# attr -l /tmp/jffs2/passwd
attr_list: Input/output error
Could not list "(null)" for /tmp/jffs2/passwd
# dmesg
...
[972244.173098] JFFS2 warning: (3068) do_load_xattr_datum: node CRC 
failed (JFFS2_NODETYPE_XREF) at 0xfc8600, read: 0xe46f3e72 calculated: 
0xd2fad3c6

Let's try again ...
# attr -l /tmp/jffs2/passwd
attr_list: Input/output error
Could not list "(null)" for /tmp/jffs2/passwd
#

If you unmount the file system and then remount it, you end up in the 
same situation. Nothing can cure the problem except deleting the file 
(or patching the data back to TUTU in the mtd device).

So, now we modify the jffs2 code to return positive values for 
unrecoverable error and recompile the module (run the usual make ...)

Now let's try the new jffs2 module on the corrupted mtd.
# umount /tmp/jffs2
# rmmod jffs2
# insmod /wherever/fs/jffs2/jffs2.ko
# mount -t jffs2 /dev/mtdblock0 /tmp/jffs2

Let's check for the attribute:
# attr -l /tmp/jffs2/passwd
# dmesg
...
[973190.247650] JFFS2 notice: (10804) jffs2_build_xattr_subsystem: 
complete building xattr subsystem, 1 of xdatum (0 unchecked, 0 orphan) 
and 1 of xref (0 dead, 0 orphan) found.
[973212.584130] JFFS2 warning: (10813) do_load_xattr_datum: node CRC 
failed (JFFS2_NODETYPE_XREF) at 0xfc8600, read: 0xe46f3e72 calculated: 
0xd2fad3c6

No error, this time in user space. The corrupted attribute has been 
automatically discarded from the file system and there are no attributes 
left for this file.

Let's try something else:

# attr -s attr1 -V "TITI" /tmp/jffs2/passwd
# attr -s attr1 -V "TOTO" /tmp/jffs2/passwd
# attr -g attr1 /tmp/jffs2/passwd
Attribute "attr1" had a 4 byte value for /tmp/jffs2/passwd:
TOTO

So this is the last value associated to "attr1". Now we are going to 
corrupt the attribute:
# umount /tmp/jffs2
# hexedit /dev/mtdblock0

Search for the string TOTO and replace it with toto and save the all 
thing (Ctrl-X).

Now we remount the jffs2 file:
# mount -t jffs2 /dev/mtdblock0 /tmp/jffs2

# attr -g attr1 /tmp/jffs2/passwd
Attribute "attr1" had a 4 byte value for /tmp/jffs2/passwd:
TITI

As we can see the attribute value is back to the only left good value. 
The TOTO and TUTU values have been discarded.
If we check dmesg, we find:
# dmesg:
[973559.872703] JFFS2 notice: (10913) jffs2_build_xattr_subsystem: 
complete building xattr subsystem, 3 of xdatum (0 unchecked, 0 orphan) 
and 3 of xref (0 dead, 0 orphan) found.
[973590.824723] JFFS2 warning: (10922) do_load_xattr_datum: node CRC 
failed (JFFS2_NODETYPE_XREF) at 0xfc8600, read: 0xe46f3e72 calculated: 
0xd2fad3c6
[973590.824746] JFFS2 warning: (10922) do_load_xattr_datum: node CRC 
failed (JFFS2_NODETYPE_XREF) at 0xfc8848, read: 0x08bce1ae calculated: 
0x3e290c1a

So the attributes are still allocated and linked to the file but they 
are discarded on first failed access. The value of the only valid 
attribute left is retrieved.

Is this convincing enough?

You might think that the attibute corruption I am showing here is very 
artificial but I can assure you that it does happen in the wild when you 
get a power cut on your equipment while the GC was moving an attribute 
node for example.

Regards

JC
> The reason I am
> asking is that JFFS2 is old and does not see too much maintanance
> anymore, so I feel that it is important important to be careful about
> merging something which does not fix a real life issue.
Artem Bityutskiy - April 29, 2012, 3:44 p.m.
On Thu, 2012-04-26 at 23:09 +0200, Jean-Christophe DUBOIS wrote:
> Is this convincing enough?

Sure, thanks.

> You might think that the attibute corruption I am showing here is very 
> artificial but I can assure you that it does happen in the wild when you 
> get a power cut on your equipment while the GC was moving an attribute 
> node for example.

This issue is for sure worth fixing. Just make sure xattr corruption is
handled the same way as data node corruption is handled. I do not
remember a lot of details anymore, but I wrote you the main idea in the
previous e-mail.
Jean-Christophe DUBOIS - April 30, 2012, 8:54 p.m.
On 29/04/2012 17:44, Artem Bityutskiy wrote:
> On Thu, 2012-04-26 at 23:09 +0200, Jean-Christophe DUBOIS wrote:
>> Is this convincing enough?
> Sure, thanks.
>
>> You might think that the attibute corruption I am showing here is very
>> artificial but I can assure you that it does happen in the wild when you
>> get a power cut on your equipment while the GC was moving an attribute
>> node for example.
> This issue is for sure worth fixing.
So, will you accept this patch?
> Just make sure xattr corruption is
> handled the same way as data node corruption is handled. I do not
> remember a lot of details anymore, but I wrote you the main idea in the
> previous e-mail.
I will look at the other patch to do the xattr integrity checking in the 
GC second phase scanning.

Regards

JC
>
Artem Bityutskiy - May 1, 2012, 12:15 p.m.
On Mon, 2012-04-30 at 22:54 +0200, Jean-Christophe DUBOIS wrote:
> On 29/04/2012 17:44, Artem Bityutskiy wrote:
> > On Thu, 2012-04-26 at 23:09 +0200, Jean-Christophe DUBOIS wrote:
> >> Is this convincing enough?
> > Sure, thanks.
> >
> >> You might think that the attibute corruption I am showing here is very
> >> artificial but I can assure you that it does happen in the wild when you
> >> get a power cut on your equipment while the GC was moving an attribute
> >> node for example.
> > This issue is for sure worth fixing.
> So, will you accept this patch?

No, I meant that I agree that there is an issue, but I think that it
should be solved for all node types the same way. So I suggest to look
at what happens when data nodes are corrupted, how JFFS2 picks the older
node (if it does) and do it similarly for xattrs.
Jean-Christophe DUBOIS - May 1, 2012, 2:31 p.m.
On 01/05/2012 14:15, Artem Bityutskiy wrote:
> On Mon, 2012-04-30 at 22:54 +0200, Jean-Christophe DUBOIS wrote:
>> On 29/04/2012 17:44, Artem Bityutskiy wrote:
>>> On Thu, 2012-04-26 at 23:09 +0200, Jean-Christophe DUBOIS wrote:
>>>> Is this convincing enough?
>>> Sure, thanks.
>>>
>>>> You might think that the attibute corruption I am showing here is very
>>>> artificial but I can assure you that it does happen in the wild when you
>>>> get a power cut on your equipment while the GC was moving an attribute
>>>> node for example.
>>> This issue is for sure worth fixing.
>> So, will you accept this patch?
> No, I meant that I agree that there is an issue, but I think that it
> should be solved for all node types the same way. So I suggest to look
> at what happens when data nodes are corrupted, how JFFS2 picks the older
> node (if it does) and do it similarly for xattrs.

I just want to restate that this patch is just re-enabling a feature 
that was (wrongly I think) disabled a bit more than a year ago.

I will have a look at the other way you suggested but this might mean in 
the end a lot more restructuring of the JFFS2 code (at least the xattr 
part). As you seem to be very cautious about JFFS2 code change this will 
certainly mean a slow merge process. In the mean time the broken JFFS2 
xattr behavior will still be present.

JC

Patch

diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
index b55b803..382b1e0 100644
--- a/fs/jffs2/xattr.c
+++ b/fs/jffs2/xattr.c
@@ -11,6 +11,8 @@ 
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#define JFFS2_XATTR_IS_CORRUPTED	1
+
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/fs.h>
@@ -153,7 +155,7 @@  static int do_verify_xattr_datum(struct jffs2_sb_info *c, struct jffs2_xattr_dat
 		JFFS2_ERROR("node CRC failed at %#08x, read=%#08x, calc=%#08x\n",
 			    offset, je32_to_cpu(rx.hdr_crc), crc);
 		xd->flags |= JFFS2_XFLAGS_INVALID;
-		return -EIO;
+		return JFFS2_XATTR_IS_CORRUPTED;
 	}
 	totlen = PAD(sizeof(rx) + rx.name_len + 1 + je16_to_cpu(rx.value_len));
 	if (je16_to_cpu(rx.magic) != JFFS2_MAGIC_BITMASK
@@ -169,7 +171,7 @@  static int do_verify_xattr_datum(struct jffs2_sb_info *c, struct jffs2_xattr_dat
 			    je32_to_cpu(rx.xid), xd->xid,
 			    je32_to_cpu(rx.version), xd->version);
 		xd->flags |= JFFS2_XFLAGS_INVALID;
-		return -EIO;
+		return JFFS2_XATTR_IS_CORRUPTED;
 	}
 	xd->xprefix = rx.xprefix;
 	xd->name_len = rx.name_len;
@@ -232,7 +234,7 @@  static int do_load_xattr_datum(struct jffs2_sb_info *c, struct jffs2_xattr_datum
 			      ref_offset(xd->node), xd->data_crc, crc);
 		kfree(data);
 		xd->flags |= JFFS2_XFLAGS_INVALID;
-		return -EIO;
+		return JFFS2_XATTR_IS_CORRUPTED;
 	}
 
 	xd->flags |= JFFS2_XFLAGS_HOT;
@@ -1282,7 +1284,7 @@  int jffs2_verify_xattr(struct jffs2_sb_info *c)
 	down_write(&c->xattr_sem);
 	list_for_each_entry_safe(xd, _xd, &c->xattr_unchecked, xindex) {
 		rc = do_verify_xattr_datum(c, xd);
-		if (rc < 0)
+		if (rc)
 			continue;
 		list_del_init(&xd->xindex);
 		spin_lock(&c->erase_completion_lock);