Patchwork ubifs_decompress: cannot decompress ...

login
register
mail settings
Submitter Matthew L. Creech
Date June 6, 2011, 7:52 p.m.
Message ID <1307389926-12209-1-git-send-email-mlcreech@gmail.com>
Download mbox | patch
Permalink /patch/99019/
State New
Headers show

Comments

Matthew L. Creech - June 6, 2011, 7:52 p.m.
On Mon, Jun 6, 2011 at 12:18 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> I have to go home now - could you please improve dbg_dump_leb().
> Currently it calls ubifs_scan(), which scans, finds corrupted node,
> prints corruption information and returns -EUCLEAN and destroys the
> scanned data.
>

Will something like this be okay?  Or do you still want to dump the
partially-parsed data from the corrupt node as well (not just the raw contents
of the LEB)?


Currently an error in ubifs_scan() will cause dbg_dump_leb() to abort without
completing the dump.  Instead, we should abandon parsing the data, but dump
the raw (uninterpreted) LEB contents instead.

Signed-off-by: Matthew L. Creech <mlcreech@gmail.com>
---
 fs/ubifs/debug.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)
Artem Bityutskiy - June 7, 2011, 4:34 a.m.
On Mon, 2011-06-06 at 15:52 -0400, Matthew L. Creech wrote:
>  		ubifs_err("scan error %d", (int)PTR_ERR(sleb));
> +		printk(KERN_DEBUG "\tLEB data buffer:\n");
> +		print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1,
> +			       buf, c->leb_size, 0);

No, I have difficulties reading hexdumps. You have set of good nodes
following by one broken node. I wanted to see a human-readable dump of
the good nodes at the beginning of the LEB.
Matthew L. Creech - June 7, 2011, 8:41 p.m.
On Tue, Jun 7, 2011 at 12:34 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> No, I have difficulties reading hexdumps. You have set of good nodes
> following by one broken node. I wanted to see a human-readable dump of
> the good nodes at the beginning of the LEB.
>

Oh I see - sorry, I thought you wanted to debug the corrupted portion.

Here's the output for my corrupt flash:

http://mcreech.com/work/ubifs-2011-06-07.txt

I'll follow up with a patch.
Artem Bityutskiy - June 8, 2011, 2:11 p.m.
On Tue, 2011-06-07 at 16:41 -0400, Matthew L. Creech wrote:
> On Tue, Jun 7, 2011 at 12:34 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> >
> > No, I have difficulties reading hexdumps. You have set of good nodes
> > following by one broken node. I wanted to see a human-readable dump of
> > the good nodes at the beginning of the LEB.
> >
> 
> Oh I see - sorry, I thought you wanted to debug the corrupted portion.
> 
> Here's the output for my corrupt flash:
> 
> http://mcreech.com/work/ubifs-2011-06-07.txt
> 
> I'll follow up with a patch.

Yes, it does look like this LEB might be garbage-collected. But it does
not have to be.

Anyway, what I can suggest you is to do several things.

1. If you have many occasions of such error, try to gather some
   information about how the device was used, and if it was uncleanly
   power-cut. Remember, I often saw that embedded devices have incorrect
   reboot. Whe users reboot it "normally" - it does not try to unmount
   the FS-es cleanly and just jumps to som HW reset function.

   You can verify this by rebooting normally and checking if UBIFS says
   "recovery needed" or not. If it does - the reboot was not normal.

2. This error may be due to memory corruptions in some driver (e.g.,
   wireless or video), due to issues in the mtd driver, etc. Try to
   stress your system with slub/slab full checks enabled, and other
   debugging features which you can find in the "hacking" section of
   make menuconfig.

3. If my theory is true, then what may help is adding a check it
   ubifs recovery function. The recovery ends with an ubifs_leb_change()
   call. You need to check the last node there - is it full and correct?
   If not, you should print a loud warning and information like leb dump
   _before_ the change, and dump of the buffer which we are going to
   write with ubifs_leb_change().

   You'd probably need to deploy this check to the field if this issue
   is not easy to reproduce. If you have then this info you may fix the
   bug.

4. Set-up power-cut emulation testing in your office.

P.S. I'm curious where you use UBIFS, if this is not a trade secret, of
course.
Matthew L. Creech - June 8, 2011, 5:50 p.m.
On Wed, Jun 8, 2011 at 10:11 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> Yes, it does look like this LEB might be garbage-collected. But it does
> not have to be.
>
> Anyway, what I can suggest you is to do several things.
>
> 1. If you have many occasions of such error, try to gather some
>   information about how the device was used, and if it was uncleanly
>   power-cut. Remember, I often saw that embedded devices have incorrect
>   reboot. Whe users reboot it "normally" - it does not try to unmount
>   the FS-es cleanly and just jumps to som HW reset function.
>
>   You can verify this by rebooting normally and checking if UBIFS says
>   "recovery needed" or not. If it does - the reboot was not normal.
>

Yes, it currently reboots uncleanly (though it does do a "sync"
first).  I noticed this a while back, and the next release firmware
will have it fixed.  However, it doesn't make a huge difference to us,
because these devices are probably more likely to experience power
loss than a software reboot, in the field at least.

> 2. This error may be due to memory corruptions in some driver (e.g.,
>   wireless or video), due to issues in the mtd driver, etc. Try to
>   stress your system with slub/slab full checks enabled, and other
>   debugging features which you can find in the "hacking" section of
>   make menuconfig.
>

Will do.

> 3. If my theory is true, then what may help is adding a check it
>   ubifs recovery function. The recovery ends with an ubifs_leb_change()
>   call. You need to check the last node there - is it full and correct?
>   If not, you should print a loud warning and information like leb dump
>   _before_ the change, and dump of the buffer which we are going to
>   write with ubifs_leb_change().
>
>   You'd probably need to deploy this check to the field if this issue
>   is not easy to reproduce. If you have then this info you may fix the
>   bug.
>

Great, I'll add this check and see if we get any hits.  Even if it
takes a while to hit it in the field, this would at least give us a
way to make some progress in finding the issue.

> 4. Set-up power-cut emulation testing in your office.
>

I did this at one point - I have a programmable UPS, so I was able to
automate a test to turn outlet power off & on repeatedly while having
the device do some work.  It didn't seem to help reproduce the
problem, but it's worth trying again on a long-term basis (especially
with the change above to try & catch the corruption in the act).

Thanks again Artem.

Patch

diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index 26d4c61..6ab43e4 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -901,6 +901,9 @@  void dbg_dump_leb(const struct ubifs_info *c, int lnum)
 	sleb = ubifs_scan(c, lnum, 0, buf, 0);
 	if (IS_ERR(sleb)) {
 		ubifs_err("scan error %d", (int)PTR_ERR(sleb));
+		printk(KERN_DEBUG "\tLEB data buffer:\n");
+		print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1,
+			       buf, c->leb_size, 0);
 		goto out;
 	}