[RFC] UBIFS recovery

Message ID	54D33C36.9060805@huawei.com
State	RFC
Headers	show Return-Path: <linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> Message-ID: <54D33C36.9060805@huawei.com> Date: Thu, 5 Feb 2015 17:47:34 +0800 From: hujianyang <hujianyang@huawei.com> User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Artem Bityutskiy <dedekind1@gmail.com> Subject: [RFC] UBIFS recovery summary: Content analysis details: (-2.3 points) pts rule name description ---- ---------------------- -------------------------------------------------- -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium trust [119.145.14.66 listed in list.dnswl.org] -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -0.0 RCVD_IN_MSPIKE_H3 RBL: Good reputation (+3) [119.145.14.66 listed in wl.mailspike.net] -0.0 SPF_PASS SPF: sender matches SPF record -0.0 RCVD_IN_MSPIKE_WL Mailspike good senders Cc: Richard Weinberger <richard@nod.at>, linux-mtd <linux-mtd@lists.infradead.org>, Sheng Yong <shengyong1@huawei.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-mtd" <linux-mtd-bounces@lists.infradead.org> Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

hujianyang Feb. 5, 2015, 9:47 a.m. UTC

Current UBIFS is lack of recovery method, that means, once a UBIFS
partition refuse to mount, all data on that partition may lose.
The default recovery mechanism in UBIFS now can deal with corruption
on master node or power cut cleanup. But it's not enough. UBIFS
on flash may suffer different kinds of data corrupted, the most
common case, ECC error.

I've scanned the archive of maillist and found the recovery method
was once requested(Sorry, I can't find the link). Artem suggested
we could introduce a new repairing mount option instead of working
on a new userspace repairing tool. But seems no more efforts had
been done so far.

There are two ways for UBIFS recovery. One is repairing UBIFS image
in userspace via UBI interfaces, the other is repairing the corrupted
data during mount by default or via a special mount option.

The userspace tool is the most effective way to repair a partition.
It could have enough time and resource to whole scan the target and
cleanup the corrupted while the file-system offline. But it's hard
to program: many structures and functions in kernel need to be copied
into this utility, current ubi-utils focus mostly on UBI device, not
UBIFS, and the subsequent updating of file-system should consider the
userspace tool. It's too complicated.

Another way is expanding the existing recovery methods in recovery.c.
It's easy to add new recovery method in this way, few lines changes
could improve reliability in some fields. But it's hard to give a
global view to control these recovery features, they are dispersed
in mounting path. Also, make it hard to add new features after
importing lots of recovery methods.

I can't say which way is better. It depends on what we expect on
UBIFS. Actually I'm working on a userspace tool ubidump, it can
print on-flash format of a specified LEB now and add features like
file-system repairing can be considered. On the other hand, I'm
working on expanding UBIFS recovery method in kernel. e.g. cleanup
all the logs if an error occur while replaying buds, revert file-
system to last commit state instead of mounting fail.

Regardless of how to fix a corrupt partition, the first stuff should
be done is adding a method that try to mount file-system R/O instead
of breaking down to give users a chance to copy their valid data
out from the corrupt image.

Thanks!
Hu


buds replay patch for linux 3.10 stable:

shengyong Feb. 5, 2015, 1:09 p.m. UTC | #1

在 2015/2/5 17:47, hujianyang 写道:
> Current UBIFS is lack of recovery method, that means, once a UBIFS
> partition refuse to mount, all data on that partition may lose.
> The default recovery mechanism in UBIFS now can deal with corruption
> on master node or power cut cleanup. But it's not enough. UBIFS
> on flash may suffer different kinds of data corrupted, the most
> common case, ECC error.
> 
> I've scanned the archive of maillist and found the recovery method
> was once requested(Sorry, I can't find the link). Artem suggested
> we could introduce a new repairing mount option instead of working
> on a new userspace repairing tool. But seems no more efforts had
> been done so far.
> 
> There are two ways for UBIFS recovery. One is repairing UBIFS image
> in userspace via UBI interfaces, the other is repairing the corrupted
> data during mount by default or via a special mount option.
> 
> The userspace tool is the most effective way to repair a partition.
> It could have enough time and resource to whole scan the target and
> cleanup the corrupted while the file-system offline. But it's hard
> to program: many structures and functions in kernel need to be copied
> into this utility, current ubi-utils focus mostly on UBI device, not
> UBIFS, and the subsequent updating of file-system should consider the
> userspace tool. It's too complicated.
> 
> Another way is expanding the existing recovery methods in recovery.c.
> It's easy to add new recovery method in this way, few lines changes
> could improve reliability in some fields. But it's hard to give a
> global view to control these recovery features, they are dispersed
> in mounting path. Also, make it hard to add new features after
> importing lots of recovery methods.
No matter how fs is recovered, data is corrupted. For the default recovery
machanism, the recovery just drops the last node, and the lost data can be
limited in the mininum range. For other situations, like data corrupted in
the middle of log area, it may be hard to figure out which nodes should be
droped. So we'd prefer to roll the whole fs back to the last checkpoint,
rather than losing all data.

Here is a simple recovery procedure, something could be easily missed in the
procedure:
1. if the default recovery fails, we start to roll the whole filesystem
   back to the last checkpoint.
2. scan all buds already in replay_buds list, if last commit in the bud starts
   from the begining of the LEB, then all nodes in the bud are new, and we
   unmap it; if last commit starts in the middle of the bud, we leb_change the
   bud, keep old nodes and drop new nodes.
3. get the seqnum of last commit, this is the last checkpoint, where the fs
   stayed consistent.
3. scan all LEBs (skip superblock and 2 master LEBs), compare node's seqnum
   with checkpoint, find out the offset where new nodes start.
4. unmap or leb_change the corrupted LEBs, and do related cleanup.
5. create new log area.

BTW, the current ubifs will update master node when mounting, no matter whether
the mount succeeds or fails. So if need_recovery is detected, the master node
should not be updated.

thanks & best regards,
Sheng
> 
> I can't say which way is better. It depends on what we expect on
> UBIFS. Actually I'm working on a userspace tool ubidump, it can
> print on-flash format of a specified LEB now and add features like
> file-system repairing can be considered. On the other hand, I'm
> working on expanding UBIFS recovery method in kernel. e.g. cleanup
> all the logs if an error occur while replaying buds, revert file-
> system to last commit state instead of mounting fail.
> 
> Regardless of how to fix a corrupt partition, the first stuff should
> be done is adding a method that try to mount file-system R/O instead
> of breaking down to give users a chance to copy their valid data
> out from the corrupt image.
> 
> Thanks!
> Hu
> 
> 
> buds replay patch for linux 3.10 stable:
> 
> diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
> index 3187925..e2208a2 100644
> --- a/fs/ubifs/replay.c
> +++ b/fs/ubifs/replay.c
> @@ -706,14 +706,35 @@ static int replay_buds(struct ubifs_info *c)
> 
>  	list_for_each_entry(b, &c->replay_buds, list) {
>  		err = replay_bud(c, b);
> -		if (err)
> -			return err;
> +		if (err) {
> +			ubifs_err("error %d during buds replay, try to revert\n",
> +				  err);
> +			goto revert;
> +		}
> 
>  		ubifs_assert(b->sqnum > prev_sqnum);
>  		prev_sqnum = b->sqnum;
>  	}
> 
>  	return 0;
> +
> +revert:
> +	prev_sqnum = 0;
> +
> +	list_for_each_entry(b, &c->replay_buds, list) {
> +		/*
> +		 * Revert to last commit state, update lprops by setting
> +		 * the state of space used by buds to dirty.
> +		 */
> +		b->free = c->leb_size % c->min_io_size;
> +		b->dirty = c->leb_size - b->bud->start - b->free;
> +
> +		ubifs_assert(b->sqnum > prev_sqnum);
> +		prev_sqnum = b->sqnum;
> +	}
> +	ubifs_warn("revert to last commit state with data lost\n");
> +
> +	return 1;
>  }
> 
>  /**
> @@ -1036,13 +1057,15 @@ int ubifs_replay_journal(struct ubifs_info *c)
>  		lnum = ubifs_next_log_lnum(c, lnum);
>  	} while (lnum != c->ltail_lnum);
> 
> -	err = replay_buds(c);
> -	if (err)
> -		goto out;
> -
> -	err = apply_replay_list(c);
> -	if (err)
> -		goto out;
> +	/*
> +	 * If an error occur during buds replay, try to revert filesystem
> +	 * to last commit state. Should not apply corrupt replay list.
> +	 */
> +	if (!replay_buds(c)) {
> +		err = apply_replay_list(c);
> +		if (err)
> +			goto out;
> +	}
> 
>  	err = set_buds_lprops(c);
>  	if (err)
>

Steve deRosier Feb. 5, 2015, 3:08 p.m. UTC | #2

On Thu, Feb 5, 2015 at 1:47 AM, hujianyang <hujianyang@huawei.com> wrote:
> There are two ways for UBIFS recovery. One is repairing UBIFS image
> in userspace via UBI interfaces, the other is repairing the corrupted
> data during mount by default or via a special mount option.
>
> The userspace tool is the most effective way to repair a partition.
> It could have enough time and resource to whole scan the target and
> cleanup the corrupted while the file-system offline. But it's hard
> to program: many structures and functions in kernel need to be copied
> into this utility, current ubi-utils focus mostly on UBI device, not
> UBIFS, and the subsequent updating of file-system should consider the
> userspace tool. It's too complicated.
>

I hear (and agree with) several valid arguments for a tool in
userspace. And I'd like to throw my support towards an in-driver
solution. Flash filesystems are different than on-disk filesystems, in
particular in their usecase: they're generally both critical and
exclusive to embedded systems. As such, the entire filesystem might be
on the corrupted UBIFS, so even if the filesystem is recoverable, if
we can't mount it and get at the userspace tool, then we're toast.
Often the kernel itself is stored in a separate read-only partition as
a blob directly on the flash, and thus the kernel itself would be
fine. The better UBI & UBIFS can recover to a usable state in-kernel,
the better off we are I think.

Just my thoughts on the matter.

- Steve

Richard Weinberger Feb. 5, 2015, 11:36 p.m. UTC | #3

Am 05.02.2015 um 16:08 schrieb Steve deRosier:
> On Thu, Feb 5, 2015 at 1:47 AM, hujianyang <hujianyang@huawei.com> wrote:
>> There are two ways for UBIFS recovery. One is repairing UBIFS image
>> in userspace via UBI interfaces, the other is repairing the corrupted
>> data during mount by default or via a special mount option.
>>
>> The userspace tool is the most effective way to repair a partition.
>> It could have enough time and resource to whole scan the target and
>> cleanup the corrupted while the file-system offline. But it's hard
>> to program: many structures and functions in kernel need to be copied
>> into this utility, current ubi-utils focus mostly on UBI device, not
>> UBIFS, and the subsequent updating of file-system should consider the
>> userspace tool. It's too complicated.
>>
> 
> I hear (and agree with) several valid arguments for a tool in
> userspace. And I'd like to throw my support towards an in-driver
> solution. Flash filesystems are different than on-disk filesystems, in
> particular in their usecase: they're generally both critical and
> exclusive to embedded systems. As such, the entire filesystem might be
> on the corrupted UBIFS, so even if the filesystem is recoverable, if
> we can't mount it and get at the userspace tool, then we're toast.

No, embedded is not per se an excuse for doing bad/stupid things.
Embedded is *not* special.
There are folks out there that want a "force" mount option for UBIFS
to mount it in any case no matter in how bad shape it is.
But this will make the situation much worse as you'll get silent data
corruption/loss.
It is as stupid as running a "fsck -y /dev/sdXY" at every boot on a
regular disk filesystem.

UBIFS can only fully automatically recover *iff* it can guarantee that it
will be consistent after recovery and does not lose data.
If not is has to fail at mount time.
What does it help if UBIFS successfully mounts but /sbin/init is damaged or
the permissions of /etc/shadow are corrupted?
On the other hand, if UBIFS can do a better job at automatically and safe
recovery, lets improve it.

But what we really need is a fsck.ubifs and a debugfs.ubifs.
In-kernel recovery cannot replace a fsck as in-kernel will always be non-interactive.

> Often the kernel itself is stored in a separate read-only partition as
> a blob directly on the flash, and thus the kernel itself would be
> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
> the better off we are I think.

Using an initramfs you can have an fsck.ubifs without a mounted root.

Thanks,
//richard

Artem Bityutskiy Feb. 6, 2015, 5:02 p.m. UTC | #4

Hi Hujianyang,

On Thu, 2015-02-05 at 17:47 +0800, hujianyang wrote:
> Current UBIFS is lack of recovery method, that means, once a UBIFS
> partition refuse to mount, all data on that partition may lose.
> The default recovery mechanism in UBIFS now can deal with corruption
> on master node or power cut cleanup. But it's not enough. UBIFS
> on flash may suffer different kinds of data corrupted, the most
> common case, ECC error.

First of all, it is important to agree on terminology.

I think I understand what you mean in this paragraph, but other people
may get wrong impression. Simply because "UBIFS has no recovery" is
_absolutely_ not True. UBIFS has _a lot_ of recovery, just check
'recovery.c' :-)

But I understand that this is not the recovery you mean. And I
understand that it may be difficult to express things in English.
And good terminology will help - let's introduce it and and stick to it.

Here is what UBIFS "things" about file-system recovery.

There are 2 types of recovery:

1. Power-cut recovery
2. Corruption recovery.

"Power-cut recovery" is, obviously, recovering from power cuts. Indeed,
power-cuts may happen in the middle of write or erase operations and
cause rubbish on the flash media. Cleaning up this rubbish at mount time
is the power-cut recovery.

"Corruption recovery" is recovery from media corruptions. E.g., the
flash is just too worn-out and does not keep data, or part of the flash
is erased and part of the UBIFS meta-data and data are gone.

And these are 2 completely different cases, right?

Now, UBIFS _does_ support power-cut recovery. In practice this means
that you should always be able to mount the file-system after a power
cut. All the garbage caused by the power cut should go away. No data
which were on the flash media before the power cut should be lost. Any
file which was fsync()'ed be before the power cut should be stay intact.

And this is not a trivial task. Power cuts may happen during garbage
collecting, during commit. There may be a sequence of power cut:
power-cut -> mount proces -> another power cut while we are recovering
from the previous one -> and again and again.

UBIFS tries hard to provide power-cut recovery. There may be issues, and
if there are, they are bugs which should be fixed.

The _corruption recovery_, on the other hand, is not implemented in the
driver. And yes, there is not user-space tool. If UBIFS sees that some
data structure is missing or corrupted, and at the same time UBIFS
"knows" that this can't be because of a power cut - UBIFS refuses to
mount the file-system or switches to R/O mode.

UBIFS does not make any attempt to do corruption recovery.

UBIFS authors believed it is simply impossible to do inside the driver
for the generic case. E.g., what do you do if the LEB which should
contain the UBIFS index now contains "rubbish"? Will you erase it? If
yes, what if this turns out to be my favorite cat's picture? Or will you
move it? If yes, what if there is no space to move to?

User-space tools may start asking user questions, etc. Kernel driver
can't. User-space tools may copy the "rubbish" somewhere so that users
had chance to recover the picture of the beloved animal.

> I've scanned the archive of maillist and found the recovery method
> was once requested(Sorry, I can't find the link). Artem suggested
> we could introduce a new repairing mount option instead of working
> on a new userspace repairing tool. But seems no more efforts had
> been done so far.

I do not remember what I suggested, but I do not think corruption
recover is possible to implement in the driver.

But I can imagine that there may be some specific cases which could be
covered. If there is good justification for that, I am fine.

> +	/*
> +	 * If an error occur during buds replay, try to revert filesystem
> +	 * to last commit state. Should not apply corrupt replay list.
> +	 */
> +	if (!replay_buds(c)) {
> +		err = apply_replay_list(c);
> +		if (err)
> +			goto out;
> +	}

Reverting to the last committed state _may_ make sense. Probably this
could be a mount option. In this case, though, UBIFS should periodically
commit, say, every 5-10 seconds.

Thanks!

Artem Bityutskiy Feb. 6, 2015, 5:21 p.m. UTC | #5

On Thu, 2015-02-05 at 21:09 +0800, shengyong wrote:
> No matter how fs is recovered, data is corrupted. For the default recovery
> machanism, the recovery just drops the last node, and the lost data can be
> limited in the mininum range. For other situations, like data corrupted in
> the middle of log area, it may be hard to figure out which nodes should be
> droped. So we'd prefer to roll the whole fs back to the last checkpoint,
> rather than losing all data.

So are you focused on log corruptions only? Why is this case important
for you?

> Here is a simple recovery procedure, something could be easily missed in the
> procedure:
> 1. if the default recovery fails, we start to roll the whole filesystem
>    back to the last checkpoint.

Lets use word "commit" instead, just for clarity.

> 2. scan all buds already in replay_buds list, if last commit in the bud starts
>    from the begining of the LEB, then all nodes in the bud are new, and we
>    unmap it; if last commit starts in the middle of the bud, we leb_change the
>    bud, keep old nodes and drop new nodes.

I do not really understand this. A bud is an ucommitted LEB, the journal
consists of buds. The log contains references to the buds, plus commit
start/end nodes.

Also, do you realize that if I fsync() a file, it does not mean a
commit, it just means write all data to the journal.

Do you suggest to just erase the entire journal LEBs which contain
pieces of a file I fsync()'ed?

We really need to step back, think, and come with a good English
description of the specific problem we are trying to solve here.

> BTW, the current ubifs will update master node when mounting, no matter whether
> the mount succeeds or fails. So if need_recovery is detected, the master node
> should not be updated.

This sounds like a bug!

Artem Bityutskiy Feb. 6, 2015, 5:26 p.m. UTC | #6

On Thu, 2015-02-05 at 07:08 -0800, Steve deRosier wrote:
> I hear (and agree with) several valid arguments for a tool in
> userspace. And I'd like to throw my support towards an in-driver
> solution. Flash filesystems are different than on-disk filesystems, in
> particular in their usecase: they're generally both critical and
> exclusive to embedded systems. As such, the entire filesystem might be
> on the corrupted UBIFS, so even if the filesystem is recoverable, if
> we can't mount it and get at the userspace tool, then we're toast.
> Often the kernel itself is stored in a separate read-only partition as
> a blob directly on the flash, and thus the kernel itself would be
> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
> the better off we are I think.

Yes, being able to mount a corrupted FS R/O sounds like a good goal. We
are not speaking of recovery here, just about mounting R/O and providing
access to as much uncorrupted data as we can.

If FS index is not corrupted, this sounds quite doable. If the index is
corrupted, though, this requires full scan and index rebuild. Other wise
we'd mount and show empty file-system.

I can see a potential problem of reading "insane" data from flash (some
circular never ending references, etc) - the driver should be very
careful about those. On the other hand, the driver should be careful
even if we are not talking about corruptions.

Artem.

Richard Weinberger Feb. 6, 2015, 5:33 p.m. UTC | #7

Am 06.02.2015 um 18:26 schrieb Artem Bityutskiy:
> On Thu, 2015-02-05 at 07:08 -0800, Steve deRosier wrote:
>> I hear (and agree with) several valid arguments for a tool in
>> userspace. And I'd like to throw my support towards an in-driver
>> solution. Flash filesystems are different than on-disk filesystems, in
>> particular in their usecase: they're generally both critical and
>> exclusive to embedded systems. As such, the entire filesystem might be
>> on the corrupted UBIFS, so even if the filesystem is recoverable, if
>> we can't mount it and get at the userspace tool, then we're toast.
>> Often the kernel itself is stored in a separate read-only partition as
>> a blob directly on the flash, and thus the kernel itself would be
>> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
>> the better off we are I think.
> 
> Yes, being able to mount a corrupted FS R/O sounds like a good goal. We
> are not speaking of recovery here, just about mounting R/O and providing
> access to as much uncorrupted data as we can.
> 
> If FS index is not corrupted, this sounds quite doable. If the index is
> corrupted, though, this requires full scan and index rebuild. Other wise
> we'd mount and show empty file-system.

While I agree that mounting RO to get access to data is a feasible goal
I really think that this is the job of a debugfs.ubifs tool.
The kernel cannot ask questions, such a tool can.

Thanks,
//richard

Artem Bityutskiy Feb. 6, 2015, 5:40 p.m. UTC | #8

On Fri, 2015-02-06 at 18:33 +0100, Richard Weinberger wrote:
> Am 06.02.2015 um 18:26 schrieb Artem Bityutskiy:
> > On Thu, 2015-02-05 at 07:08 -0800, Steve deRosier wrote:
> >> I hear (and agree with) several valid arguments for a tool in
> >> userspace. And I'd like to throw my support towards an in-driver
> >> solution. Flash filesystems are different than on-disk filesystems, in
> >> particular in their usecase: they're generally both critical and
> >> exclusive to embedded systems. As such, the entire filesystem might be
> >> on the corrupted UBIFS, so even if the filesystem is recoverable, if
> >> we can't mount it and get at the userspace tool, then we're toast.
> >> Often the kernel itself is stored in a separate read-only partition as
> >> a blob directly on the flash, and thus the kernel itself would be
> >> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
> >> the better off we are I think.
> > 
> > Yes, being able to mount a corrupted FS R/O sounds like a good goal. We
> > are not speaking of recovery here, just about mounting R/O and providing
> > access to as much uncorrupted data as we can.
> > 
> > If FS index is not corrupted, this sounds quite doable. If the index is
> > corrupted, though, this requires full scan and index rebuild. Other wise
> > we'd mount and show empty file-system.
> 
> While I agree that mounting RO to get access to data is a feasible goal
> I really think that this is the job of a debugfs.ubifs tool.
> The kernel cannot ask questions, such a tool can.

The user-space tool would turn a corrupted FS into an uncorrupted FS.

But if the driver could provide you read access to uncorrupted files
even though the file-system is corrupted, this would be useful. The
driver would not need to do any recovery - no write or erase operation
allowed.

I think this is hard in general, but probably doable for some cases.

Artem.

Richard Weinberger Feb. 6, 2015, 5:43 p.m. UTC | #9

Am 06.02.2015 um 18:40 schrieb Artem Bityutskiy:
> On Fri, 2015-02-06 at 18:33 +0100, Richard Weinberger wrote:
>> Am 06.02.2015 um 18:26 schrieb Artem Bityutskiy:
>>> On Thu, 2015-02-05 at 07:08 -0800, Steve deRosier wrote:
>>>> I hear (and agree with) several valid arguments for a tool in
>>>> userspace. And I'd like to throw my support towards an in-driver
>>>> solution. Flash filesystems are different than on-disk filesystems, in
>>>> particular in their usecase: they're generally both critical and
>>>> exclusive to embedded systems. As such, the entire filesystem might be
>>>> on the corrupted UBIFS, so even if the filesystem is recoverable, if
>>>> we can't mount it and get at the userspace tool, then we're toast.
>>>> Often the kernel itself is stored in a separate read-only partition as
>>>> a blob directly on the flash, and thus the kernel itself would be
>>>> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
>>>> the better off we are I think.
>>>
>>> Yes, being able to mount a corrupted FS R/O sounds like a good goal. We
>>> are not speaking of recovery here, just about mounting R/O and providing
>>> access to as much uncorrupted data as we can.
>>>
>>> If FS index is not corrupted, this sounds quite doable. If the index is
>>> corrupted, though, this requires full scan and index rebuild. Other wise
>>> we'd mount and show empty file-system.
>>
>> While I agree that mounting RO to get access to data is a feasible goal
>> I really think that this is the job of a debugfs.ubifs tool.
>> The kernel cannot ask questions, such a tool can.
> 
> The user-space tool would turn a corrupted FS into an uncorrupted FS.

This is what fsck.ubifs should to. I was talking about a debugfs.ubifs which
is able to extract files, ask questions, and tell the user what exactly is going
wrong. Like "yes, I can dump you file /foo/bar.dat but rage 5m to 10m maybe be corrupted and the xattrs are gone".

Thanks,
//richard

hujianyang Feb. 9, 2015, 2:34 a.m. UTC | #10

Hi Artem,

On 2015/2/7 1:02, Artem Bityutskiy wrote:
> Hi Hujianyang,
> 
> On Thu, 2015-02-05 at 17:47 +0800, hujianyang wrote:
>> Current UBIFS is lack of recovery method, that means, once a UBIFS
>> partition refuse to mount, all data on that partition may lose.
>> The default recovery mechanism in UBIFS now can deal with corruption
>> on master node or power cut cleanup. But it's not enough. UBIFS
>> on flash may suffer different kinds of data corrupted, the most
>> common case, ECC error.
> 
> First of all, it is important to agree on terminology.
> 
> I think I understand what you mean in this paragraph, but other people
> may get wrong impression. Simply because "UBIFS has no recovery" is
> _absolutely_ not True. UBIFS has _a lot_ of recovery, just check
> 'recovery.c' :-)
> 
> But I understand that this is not the recovery you mean. And I
> understand that it may be difficult to express things in English.
> And good terminology will help - let's introduce it and and stick to it.
> 
> Here is what UBIFS "things" about file-system recovery.
> 
> There are 2 types of recovery:
> 
> 1. Power-cut recovery
> 2. Corruption recovery.
> 
> "Power-cut recovery" is, obviously, recovering from power cuts. Indeed,
> power-cuts may happen in the middle of write or erase operations and
> cause rubbish on the flash media. Cleaning up this rubbish at mount time
> is the power-cut recovery.
> 
> "Corruption recovery" is recovery from media corruptions. E.g., the
> flash is just too worn-out and does not keep data, or part of the flash
> is erased and part of the UBIFS meta-data and data are gone.
> 
> And these are 2 completely different cases, right?

Yes, nice definition.

I know power-cut recovery in recovery.c. But I don't express well. Thanks!

> 
> Now, UBIFS _does_ support power-cut recovery. In practice this means
> that you should always be able to mount the file-system after a power
> cut. All the garbage caused by the power cut should go away. No data
> which were on the flash media before the power cut should be lost. Any
> file which was fsync()'ed be before the power cut should be stay intact.
> 
> And this is not a trivial task. Power cuts may happen during garbage
> collecting, during commit. There may be a sequence of power cut:
> power-cut -> mount proces -> another power cut while we are recovering
> from the previous one -> and again and again.
> 
> UBIFS tries hard to provide power-cut recovery. There may be issues, and
> if there are, they are bugs which should be fixed.
> 
> The _corruption recovery_, on the other hand, is not implemented in the
> driver. And yes, there is not user-space tool. If UBIFS sees that some
> data structure is missing or corrupted, and at the same time UBIFS
> "knows" that this can't be because of a power cut - UBIFS refuses to
> mount the file-system or switches to R/O mode.
> 
> UBIFS does not make any attempt to do corruption recovery.
> 
> UBIFS authors believed it is simply impossible to do inside the driver
> for the generic case. E.g., what do you do if the LEB which should
> contain the UBIFS index now contains "rubbish"? Will you erase it? If
> yes, what if this turns out to be my favorite cat's picture? Or will you
> move it? If yes, what if there is no space to move to?

Power-cut recovery is predictable, or can say:

1) Where are corrupted data could be known.
2) What kinds of corrupted data could be known.

But corruption recovery is quite different. Corrupted data may exist
in any place and be any form. Even if we successfully mount a partition,
we don't whether there are any corruptions still on the flash.

In this respect, userspace tool is better. It can do whole scan, pick
up corruptions, check and fix them.

> 
> User-space tools may start asking user questions, etc. Kernel driver
> can't. User-space tools may copy the "rubbish" somewhere so that users
> had chance to recover the picture of the beloved animal.
> 
>> I've scanned the archive of maillist and found the recovery method
>> was once requested(Sorry, I can't find the link). Artem suggested
>> we could introduce a new repairing mount option instead of working
>> on a new userspace repairing tool. But seems no more efforts had
>> been done so far.
> 
> I do not remember what I suggested, but I do not think corruption
> recover is possible to implement in the driver.
> 
> But I can imagine that there may be some specific cases which could be
> covered. If there is good justification for that, I am fine.
> 
>> +	/*
>> +	 * If an error occur during buds replay, try to revert filesystem
>> +	 * to last commit state. Should not apply corrupt replay list.
>> +	 */
>> +	if (!replay_buds(c)) {
>> +		err = apply_replay_list(c);
>> +		if (err)
>> +			goto out;
>> +	}
> 
> Reverting to the last committed state _may_ make sense. Probably this
> could be a mount option. In this case, though, UBIFS should periodically
> commit, say, every 5-10 seconds.
> 

Good suggestions. I will try to realize periodically commit first. But I
don't know if this feature is really needed. Switch to R/O and revert to
last comitted state? But we just consider about log before, never think
about index.

I think maybe we can first make sure what kinds of corruptions we could
recovery, what kinds of corruptions we could fix by adding some simple
mechanism.

Thanks,
Hu

hujianyang Feb. 9, 2015, 2:48 a.m. UTC | #11

On 2015/2/7 1:33, Richard Weinberger wrote:
> Am 06.02.2015 um 18:26 schrieb Artem Bityutskiy:
>> On Thu, 2015-02-05 at 07:08 -0800, Steve deRosier wrote:
>>> I hear (and agree with) several valid arguments for a tool in
>>> userspace. And I'd like to throw my support towards an in-driver
>>> solution. Flash filesystems are different than on-disk filesystems, in
>>> particular in their usecase: they're generally both critical and
>>> exclusive to embedded systems. As such, the entire filesystem might be
>>> on the corrupted UBIFS, so even if the filesystem is recoverable, if
>>> we can't mount it and get at the userspace tool, then we're toast.
>>> Often the kernel itself is stored in a separate read-only partition as
>>> a blob directly on the flash, and thus the kernel itself would be
>>> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
>>> the better off we are I think.
>>
>> Yes, being able to mount a corrupted FS R/O sounds like a good goal. We
>> are not speaking of recovery here, just about mounting R/O and providing
>> access to as much uncorrupted data as we can.
>>
>> If FS index is not corrupted, this sounds quite doable. If the index is
>> corrupted, though, this requires full scan and index rebuild. Other wise
>> we'd mount and show empty file-system.
> 
> While I agree that mounting RO to get access to data is a feasible goal
> I really think that this is the job of a debugfs.ubifs tool.
> The kernel cannot ask questions, such a tool can.
> 

Hi Richard,

What's the different between fsck.ubifs and debugfs.ubifs? Debugfs.ubifs
seems need to provide more debugging option, not just recovery. Could you
please talk more about your thinking?

For mounting R/O case, I think we could do it directly in kernel.

Thanks,
Hu

hujianyang Feb. 9, 2015, 3 a.m. UTC | #12

On 2015/2/7 1:43, Richard Weinberger wrote:
> Am 06.02.2015 um 18:40 schrieb Artem Bityutskiy:
>> On Fri, 2015-02-06 at 18:33 +0100, Richard Weinberger wrote:
>>> Am 06.02.2015 um 18:26 schrieb Artem Bityutskiy:
>>>> On Thu, 2015-02-05 at 07:08 -0800, Steve deRosier wrote:
>>>>> I hear (and agree with) several valid arguments for a tool in
>>>>> userspace. And I'd like to throw my support towards an in-driver
>>>>> solution. Flash filesystems are different than on-disk filesystems, in
>>>>> particular in their usecase: they're generally both critical and
>>>>> exclusive to embedded systems. As such, the entire filesystem might be
>>>>> on the corrupted UBIFS, so even if the filesystem is recoverable, if
>>>>> we can't mount it and get at the userspace tool, then we're toast.
>>>>> Often the kernel itself is stored in a separate read-only partition as
>>>>> a blob directly on the flash, and thus the kernel itself would be
>>>>> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
>>>>> the better off we are I think.
>>>>
>>>> Yes, being able to mount a corrupted FS R/O sounds like a good goal. We
>>>> are not speaking of recovery here, just about mounting R/O and providing
>>>> access to as much uncorrupted data as we can.
>>>>
>>>> If FS index is not corrupted, this sounds quite doable. If the index is
>>>> corrupted, though, this requires full scan and index rebuild. Other wise
>>>> we'd mount and show empty file-system.
>>>
>>> While I agree that mounting RO to get access to data is a feasible goal
>>> I really think that this is the job of a debugfs.ubifs tool.
>>> The kernel cannot ask questions, such a tool can.
>>
>> The user-space tool would turn a corrupted FS into an uncorrupted FS.
> 
> This is what fsck.ubifs should to. I was talking about a debugfs.ubifs which
> is able to extract files, ask questions, and tell the user what exactly is going
> wrong. Like "yes, I can dump you file /foo/bar.dat but rage 5m to 10m maybe be corrupted and the xattrs are gone".
> 

Er, maybe I know what you mean.

So you think by debugfs.ubifs, we could get wanted file out from a partition
without mounting it? and do other things like (?)

Moving less files out maybe simpler than mounting the whole partition in some
cases. But is it acceptable for scripts? If someone want to perform some binary
files on the corrupted ubifs. I think mounting a R/O partition is better than
moving the request file out and then run it.

Thanks,
Hu

hujianyang Feb. 9, 2015, 3:09 a.m. UTC | #13

Hi Steve,

On 2015/2/5 23:08, Steve deRosier wrote:
> On Thu, Feb 5, 2015 at 1:47 AM, hujianyang <hujianyang@huawei.com> wrote:
>> There are two ways for UBIFS recovery. One is repairing UBIFS image
>> in userspace via UBI interfaces, the other is repairing the corrupted
>> data during mount by default or via a special mount option.
>>
>> The userspace tool is the most effective way to repair a partition.
>> It could have enough time and resource to whole scan the target and
>> cleanup the corrupted while the file-system offline. But it's hard
>> to program: many structures and functions in kernel need to be copied
>> into this utility, current ubi-utils focus mostly on UBI device, not
>> UBIFS, and the subsequent updating of file-system should consider the
>> userspace tool. It's too complicated.
>>
> 
> I hear (and agree with) several valid arguments for a tool in
> userspace. And I'd like to throw my support towards an in-driver
> solution. 

Thanks~!

> Flash filesystems are different than on-disk filesystems, in
> particular in their usecase: they're generally both critical and
> exclusive to embedded systems. As such, the entire filesystem might be
> on the corrupted UBIFS, so even if the filesystem is recoverable, if
> we can't mount it and get at the userspace tool, then we're toast.
> Often the kernel itself is stored in a separate read-only partition as
> a blob directly on the flash, and thus the kernel itself would be
> fine. The better UBI & UBIFS can recover to a usable state in-kernel,
> the better off we are I think.
> 
> Just my thoughts on the matter.
> 
> - Steve
> 
> .
> 

I think it's a good standpoint you are providing. It's a problem like
filesystem dirver and filesystem partition. But it may not only exit
in UBIFS. A good user configuration is always needed to solve problems
like this.

Maybe an acceptable solution is mount the filesystem R/O first and then
perform other recoveries.

Thanks,
Hu

Artem Bityutskiy Feb. 9, 2015, 7:51 a.m. UTC | #14

On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
> Good suggestions. I will try to realize periodically commit first. But I
> don't know if this feature is really needed. Switch to R/O and revert to
> last comitted state? But we just consider about log before, never think
> about index.

I think the right way to approach this problem is to come up with a high
level summary of the problems we are trying to solve, and the solutions,
along with some analysis of the solutions. This does not have to be very
detailed, but it should put everyone involved into the same page.

Artem.

Richard Weinberger Feb. 9, 2015, 7:56 a.m. UTC | #15

Am 09.02.2015 um 04:00 schrieb hujianyang:
>> This is what fsck.ubifs should to. I was talking about a debugfs.ubifs which
>> is able to extract files, ask questions, and tell the user what exactly is going
>> wrong. Like "yes, I can dump you file /foo/bar.dat but rage 5m to 10m maybe be corrupted and the xattrs are gone".
>>
> 
> Er, maybe I know what you mean.
> 
> So you think by debugfs.ubifs, we could get wanted file out from a partition
> without mounting it? and do other things like (?)

This is the use case of a debugfs. See debugfs.ext2/3/4, etc...
You can debug (analyze, get files your, etc...) from a broken filesystem
without mounting it.

> Moving less files out maybe simpler than mounting the whole partition in some
> cases. But is it acceptable for scripts? If someone want to perform some binary
> files on the corrupted ubifs. I think mounting a R/O partition is better than
> moving the request file out and then run it.

Scripts?
debugfs is meant for _manual_ forensics/recovery.
Mounting R/O is not always an option, we cannot make UBIFS that smart that you
can always turn it into a state where you can safely get everything out of it.
And as I wrote in a previous mail, the interaction between kernel and user is almost zero.
Debugfs can ask questions and give you a much better overall overview of the filesystem.
This is exactly why debugfs was invented. You can also manually fix/transform the filesystem...

Thanks,
//richard

Richard Weinberger Feb. 9, 2015, 7:57 a.m. UTC | #16

Am 09.02.2015 um 08:51 schrieb Artem Bityutskiy:
> On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
>> Good suggestions. I will try to realize periodically commit first. But I
>> don't know if this feature is really needed. Switch to R/O and revert to
>> last comitted state? But we just consider about log before, never think
>> about index.
> 
> I think the right way to approach this problem is to come up with a high
> level summary of the problems we are trying to solve, and the solutions,
> along with some analysis of the solutions. This does not have to be very
> detailed, but it should put everyone involved into the same page.

Agreed. I fear we're talking about different things. :)

Thanks,
//richard

Artem Bityutskiy Feb. 9, 2015, 8:26 a.m. UTC | #17

On Mon, 2015-02-09 at 08:56 +0100, Richard Weinberger wrote:
> Am 09.02.2015 um 04:00 schrieb hujianyang:
> >> This is what fsck.ubifs should to. I was talking about a debugfs.ubifs which
> >> is able to extract files, ask questions, and tell the user what exactly is going
> >> wrong. Like "yes, I can dump you file /foo/bar.dat but rage 5m to 10m maybe be corrupted and the xattrs are gone".
> >>
> > 
> > Er, maybe I know what you mean.
> > 
> > So you think by debugfs.ubifs, we could get wanted file out from a partition
> > without mounting it? and do other things like (?)
> 
> This is the use case of a debugfs. See debugfs.ext2/3/4, etc...
> You can debug (analyze, get files your, etc...) from a broken filesystem
> without mounting it.

Lets consider hypothetical 2 gadgets using UBIFS: R-gadget and H-gadget.

1. R-gadget has UBIFS which refuses to mount whenever there is any
unexpected corruption.
2. H-gadget tries hard to mount in R/O mode and let the rest of the SW
stack have a file-system.

H-gadget is resilient. When things go wrong with the storage, it still
manages to boot, show a dialog explaining that there is a problem, let
users fetch all the important files, and then either reset to factory
defaults, or bring the device to the service point.

R-gadget, on the opposite, just does not boot when there are issues.
Users see nothing on the screen. When they google for "R-gadget does not
boot", they hit some forum discussions, very technical, talking about
some "debugfs", which is very confusing.

The new generation of R-gadget, however, does better job. Unlike the
first generation, shipped under tight TTM requirements, the second
generation gave the vendor a bit more time to polish it. So the vendor
managed to use "debugfs" stuff, and now R-gadget. But unfortunately,
this feature stopped working after first system upgrade, because of a
bug (probably not enough testing). The R-gadgets was asking strange
question about moving some "inodes" from a broken "bud". But the input
did not work, and users anyway had hard time understanding "inodes" and
"buds" (they thought and inoed is some kind of flower).

Anyway, the message is: I'd prefer H-gadget :-)

hujianyang Feb. 9, 2015, 10:38 a.m. UTC | #18

Hi Artem and Richard,

On 2015/2/9 15:57, Richard Weinberger wrote:
> Am 09.02.2015 um 08:51 schrieb Artem Bityutskiy:
>> On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
>>> Good suggestions. I will try to realize periodically commit first. But I
>>> don't know if this feature is really needed. Switch to R/O and revert to
>>> last comitted state? But we just consider about log before, never think
>>> about index.
>>
>> I think the right way to approach this problem is to come up with a high
>> level summary of the problems we are trying to solve, and the solutions,
>> along with some analysis of the solutions. This does not have to be very
>> detailed, but it should put everyone involved into the same page.
> 
> Agreed. I fear we're talking about different things. :)
> 

I'm afraid I didn't express the use case of the corruption recovery feature.
UBIFS is used mostly in embedded environment. After products selling out,
it's hard to debug it. So the production team may consider any failure that
could happen and put the recovery method into their operation scripts/utilities.

Flash corruption is a problem they need to care about. Using high quality
cell is not enough, ECC error could not be avoid. So a recovery method which
is provided by filesystem itself is required. This feature is not used by
us, the developer of kernel, but the production team. They know little about
linux kernel. So the easier interface we provide, the much effective recovery
method of the products they could make. So, Artem, I'm agree with your another
email mail about R-gadget and H-gadget.

I think mount R/O is a good beginning. We don't need consider much about how
to recover but can provide a usable(in some cases) file-system. And a R/O
mount means we could do some cleanup to revert to this R/O state. This R/O
mount should be provided by driver itself without any userspace tools.

Thanks,
Hu

Richard Weinberger Feb. 9, 2015, 11:04 a.m. UTC | #19

Am 09.02.2015 um 09:26 schrieb Artem Bityutskiy:
> On Mon, 2015-02-09 at 08:56 +0100, Richard Weinberger wrote:
>> Am 09.02.2015 um 04:00 schrieb hujianyang:
>>>> This is what fsck.ubifs should to. I was talking about a debugfs.ubifs which
>>>> is able to extract files, ask questions, and tell the user what exactly is going
>>>> wrong. Like "yes, I can dump you file /foo/bar.dat but rage 5m to 10m maybe be corrupted and the xattrs are gone".
>>>>
>>>
>>> Er, maybe I know what you mean.
>>>
>>> So you think by debugfs.ubifs, we could get wanted file out from a partition
>>> without mounting it? and do other things like (?)
>>
>> This is the use case of a debugfs. See debugfs.ext2/3/4, etc...
>> You can debug (analyze, get files your, etc...) from a broken filesystem
>> without mounting it.
> 
> Lets consider hypothetical 2 gadgets using UBIFS: R-gadget and H-gadget.
> 
> 1. R-gadget has UBIFS which refuses to mount whenever there is any
> unexpected corruption.
> 2. H-gadget tries hard to mount in R/O mode and let the rest of the SW
> stack have a file-system.
> 
> H-gadget is resilient. When things go wrong with the storage, it still
> manages to boot, show a dialog explaining that there is a problem, let
> users fetch all the important files, and then either reset to factory
> defaults, or bring the device to the service point.

The questions is, can we achieve that?
Just falling to R/O and continue is not good enough.
What if the "/" inode or /lib/libc*so is broken?
Just by falling back to R/O the target won't magically be in a consistent
state.

> R-gadget, on the opposite, just does not boot when there are issues.
> Users see nothing on the screen. When they google for "R-gadget does not
> boot", they hit some forum discussions, very technical, talking about
> some "debugfs", which is very confusing.

It is not our job to make sure what users will find if they google for something. ;)

In contrast to the H-Gadget, the R-Gadget can print a perfectly sane message to the user.
Use a initramfs to mount UBIFS, it if fails display a nice message to the user that something
major went wrong...
On the other hand, the H-Gadget will continue to some point, fail or maybe not fail.

> The new generation of R-gadget, however, does better job. Unlike the
> first generation, shipped under tight TTM requirements, the second
> generation gave the vendor a bit more time to polish it. So the vendor
> managed to use "debugfs" stuff, and now R-gadget. But unfortunately,
> this feature stopped working after first system upgrade, because of a
> bug (probably not enough testing). The R-gadgets was asking strange
> question about moving some "inodes" from a broken "bud". But the input
> did not work, and users anyway had hard time understanding "inodes" and
> "buds" (they thought and inoed is some kind of flower).
> 
> Anyway, the message is: I'd prefer H-gadget :-)

My points are:
- If UBIFS can do a better job in dealing with corruptions, fix/improve it.
- Having a debugfs/fsck would be a good tool for people like me that have to analyze/fix UBI/UBIFS failures.
- Having an UBIFS "force" mode *will* be abused in horrid ways. I agree that I'm a bit biased on that, maybe because I've seen too much
horror hacks from embedded vendors to make their devices somehow passing the QA (quote: "just make it boot to pass all tests").
Of course all these "just make it boot" hacks failed later due to undetected major corruptions as the filesystem consistency was gone a long time ago,
but it booted somehow a few more days.^^

Thanks,
//richard

Richard Weinberger Feb. 9, 2015, 11:05 a.m. UTC | #20

Am 09.02.2015 um 11:38 schrieb hujianyang:
> Hi Artem and Richard,
> 
> On 2015/2/9 15:57, Richard Weinberger wrote:
>> Am 09.02.2015 um 08:51 schrieb Artem Bityutskiy:
>>> On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
>>>> Good suggestions. I will try to realize periodically commit first. But I
>>>> don't know if this feature is really needed. Switch to R/O and revert to
>>>> last comitted state? But we just consider about log before, never think
>>>> about index.
>>>
>>> I think the right way to approach this problem is to come up with a high
>>> level summary of the problems we are trying to solve, and the solutions,
>>> along with some analysis of the solutions. This does not have to be very
>>> detailed, but it should put everyone involved into the same page.
>>
>> Agreed. I fear we're talking about different things. :)
>>
> 
> I'm afraid I didn't express the use case of the corruption recovery feature.
> UBIFS is used mostly in embedded environment. After products selling out,
> it's hard to debug it. So the production team may consider any failure that
> could happen and put the recovery method into their operation scripts/utilities.
> 
> Flash corruption is a problem they need to care about. Using high quality
> cell is not enough, ECC error could not be avoid. So a recovery method which
> is provided by filesystem itself is required. This feature is not used by
> us, the developer of kernel, but the production team. They know little about
> linux kernel. So the easier interface we provide, the much effective recovery
> method of the products they could make. So, Artem, I'm agree with your another
> email mail about R-gadget and H-gadget.
> 
> I think mount R/O is a good beginning. We don't need consider much about how
> to recover but can provide a usable(in some cases) file-system. And a R/O
> mount means we could do some cleanup to revert to this R/O state. This R/O
> mount should be provided by driver itself without any userspace tools.

So, at the end of the day you want an UBIFS that can deal with randomly failed PEBs?

Thanks,
//richard

Artem Bityutskiy Feb. 9, 2015, 11:18 a.m. UTC | #21

On Mon, 2015-02-09 at 18:38 +0800, hujianyang wrote:
> I think mount R/O is a good beginning. We don't need consider much about how
> to recover but can provide a usable(in some cases) file-system. And a R/O
> mount means we could do some cleanup to revert to this R/O state. This R/O
> mount should be provided by driver itself without any userspace tools.

I guess if we decompose the problem this way it will also be helpful (to
you and the readers).

1. There are types of corruptions when UBIFS mounts the file-system just
fine. For example, a committed data node is currupted. You will only
notice this when you read the corresponding file, and this is the point
when the file-system becomes read-only.

2. There are types of corruptions when UBIFS refuses to mount. These are
related to the replay process. Whenever there is a corrupted node which
does not look like a result of power-cut, UBIFS refuses to mount.

It appears to me that you are after nailing down the problem #2. You
want UBIFS to still mount the FS, and stay R/O. Is this correct?

I would like you to consider problem #1 too. Consider cases like: a data
node is corrupted, an inode is corrupted (both directory and
non-directory), a dentry is corrupted, an index node is corrupted, an
LPT are is corrupted.

What happens in each of these cases? Are you OK with that or you'd like
to change that? What the product team does in these cases?

You do not have to answer these questions in this e-mail. You can, but
these are mostly for you, so that you see the bigger picture.

Now, regarding problem #2.

There are multiple cases here too: master nodes are corrupted, a
corruption in the log, and corruption in the journal (buds), a
corruption in the LPT area, a corruption in the index.

I'd like you to think about all these cases. Again, just for yourself,
to understand the broader picture.

It looks like you are focusing on corruptions in buds, right? Is it
because this is the most probable situation, or is this something which
show problems in the field/testing?

You suggest that in case of a corrupted bud, you just try to go back to
the previous commited state.

This sounds rational to me. As I described, though, the problem is that
'fsync()' does not mean 'commit'. So what this means is that, say, mysql
fsync()'s its database, and believes it is now on the media. But then
there is a problem in the journal, in some LEB which is not related to
the fsync()'ed mysql database at all, and you drop the database changes.

So the better thing to do is to try dropping just the corrupted nodes,
not the entire journal. It does not sound too hard - you just keep
scanning and skip corrupted nodes. Replay as usual. Just mark the FS as
R/O if corruptions were not power-cut-related.

hujianyang Feb. 9, 2015, 11:23 a.m. UTC | #22

On 2015/2/9 19:05, Richard Weinberger wrote:
> Am 09.02.2015 um 11:38 schrieb hujianyang:
>> Hi Artem and Richard,
>>
>> On 2015/2/9 15:57, Richard Weinberger wrote:
>>> Am 09.02.2015 um 08:51 schrieb Artem Bityutskiy:
>>>> On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
>>>>> Good suggestions. I will try to realize periodically commit first. But I
>>>>> don't know if this feature is really needed. Switch to R/O and revert to
>>>>> last comitted state? But we just consider about log before, never think
>>>>> about index.
>>>>
>>>> I think the right way to approach this problem is to come up with a high
>>>> level summary of the problems we are trying to solve, and the solutions,
>>>> along with some analysis of the solutions. This does not have to be very
>>>> detailed, but it should put everyone involved into the same page.
>>>
>>> Agreed. I fear we're talking about different things. :)
>>>
>>
>> I'm afraid I didn't express the use case of the corruption recovery feature.
>> UBIFS is used mostly in embedded environment. After products selling out,
>> it's hard to debug it. So the production team may consider any failure that
>> could happen and put the recovery method into their operation scripts/utilities.
>>
>> Flash corruption is a problem they need to care about. Using high quality
>> cell is not enough, ECC error could not be avoid. So a recovery method which
>> is provided by filesystem itself is required. This feature is not used by
>> us, the developer of kernel, but the production team. They know little about
>> linux kernel. So the easier interface we provide, the much effective recovery
>> method of the products they could make. So, Artem, I'm agree with your another
>> email mail about R-gadget and H-gadget.
>>
>> I think mount R/O is a good beginning. We don't need consider much about how
>> to recover but can provide a usable(in some cases) file-system. And a R/O
>> mount means we could do some cleanup to revert to this R/O state. This R/O
>> mount should be provided by driver itself without any userspace tools.
> 
> So, at the end of the day you want an UBIFS that can deal with randomly failed PEBs?
> 

It depends. We know we can't deal with all kinds of failure. So in my considering,
We should first list the error type.

If log is corrupted, we could discard them, just revert to last commit state.
If index is corrupted, we could scan the whole partition and rebuild the index.
and so on.

There must be some cases we can't achieve our expect. What can we do at that
situation? Mount failed or mount an empty R/O partition? We could make a discussion
on it.

Recovery is hard, but mount R/O is much easier, I think. For recovery case, we may
need some kinds of userspace tools, but it's a much complex work. This tool is
different with any thing we already have. We should be cautious to start at that
point.

Thanks,
Hu

Artem Bityutskiy Feb. 9, 2015, 11:36 a.m. UTC | #23

Richard,

On Mon, 2015-02-09 at 12:04 +0100, Richard Weinberger wrote:
> My points are:
> - If UBIFS can do a better job in dealing with corruptions, fix/improve it.

Right.

> - Having a debugfs/fsck would be a good tool for people like me that have to analyze/fix UBI/UBIFS failures.

Right. I think no one denies this. Correct, and I agree on this.

> - Having an UBIFS "force" mode *will* be abused in horrid ways.

I did not see anyone suggesting this. Was this suggested?

As I read it, Steve just expressed a high-level user standpoint: the
more you can do without external tools the better. I did not see him
suggesting "just mount at any price".

Artem.

Richard Weinberger Feb. 9, 2015, 11:48 a.m. UTC | #24

Am 09.02.2015 um 12:36 schrieb Artem Bityutskiy:
> Richard,
> 
> On Mon, 2015-02-09 at 12:04 +0100, Richard Weinberger wrote:
>> My points are:
>> - If UBIFS can do a better job in dealing with corruptions, fix/improve it.
> 
> Right.
> 
>> - Having a debugfs/fsck would be a good tool for people like me that have to analyze/fix UBI/UBIFS failures.
> 
> Right. I think no one denies this. Correct, and I agree on this.
> 
>> - Having an UBIFS "force" mode *will* be abused in horrid ways.
> 
> I did not see anyone suggesting this. Was this suggested?
> 
> As I read it, Steve just expressed a high-level user standpoint: the
> more you can do without external tools the better. I did not see him
> suggesting "just mount at any price".

It was not directly suggested by Steve, sorry if I was not clear about that!
I get such requests rather often from customers and therefore I'm sick of explaining
why this is a bad idea and quite nervous because most of the time vendors try to hide
issues in their software stack when they ask for such an option.

That said, if we define clearly in which situations UBIFS can safely mount R/O I'm happy.
But please make this new mount option opt-in and disabled by default.

Thanks,
//richard

hujianyang Feb. 9, 2015, 12:02 p.m. UTC | #25

Hi Artem,

On 2015/2/9 19:18, Artem Bityutskiy wrote:
> On Mon, 2015-02-09 at 18:38 +0800, hujianyang wrote:
>> I think mount R/O is a good beginning. We don't need consider much about how
>> to recover but can provide a usable(in some cases) file-system. And a R/O
>> mount means we could do some cleanup to revert to this R/O state. This R/O
>> mount should be provided by driver itself without any userspace tools.
> 
> I guess if we decompose the problem this way it will also be helpful (to
> you and the readers).
> 
> 1. There are types of corruptions when UBIFS mounts the file-system just
> fine. For example, a committed data node is currupted. You will only
> notice this when you read the corresponding file, and this is the point
> when the file-system becomes read-only.
> 
> 
> 2. There are types of corruptions when UBIFS refuses to mount. These are
> related to the replay process. Whenever there is a corrupted node which
> does not look like a result of power-cut, UBIFS refuses to mount.
> 
> 
> It appears to me that you are after nailing down the problem #2. You
> want UBIFS to still mount the FS, and stay R/O. Is this correct?
> 
> 
> I would like you to consider problem #1 too. Consider cases like: a data
> node is corrupted, an inode is corrupted (both directory and
> non-directory), a dentry is corrupted, an index node is corrupted, an
> LPT are is corrupted.
> 
> What happens in each of these cases? Are you OK with that or you'd like
> to change that? What the product team does in these cases?
> 

Er, it's a good view. I'm not sure about it, I'd like to talk with them
about it. But I think maybe they don't consider about this problem either.

I don't want to change current behavior. But maybe we could repair these
kinds of problems by a userspace tool or a repair mode in kernel in this
progress.

> You do not have to answer these questions in this e-mail. You can, but
> these are mostly for you, so that you see the bigger picture.
> 
> 
> Now, regarding problem #2.
> 
> 
> There are multiple cases here too: master nodes are corrupted, a
> corruption in the log, and corruption in the journal (buds), a
> corruption in the LPT area, a corruption in the index.
> 
> I'd like you to think about all these cases. Again, just for yourself,
> to understand the broader picture.
> 
> 
> It looks like you are focusing on corruptions in buds, right? Is it
> because this is the most probable situation, or is this something which
> show problems in the field/testing?
> 

No. It's because the buds corruptions come out in our environment, so we
firstly fix it in a rude way. It not means we just focus on this corruption
and we don't insist on our existing code. A better solution is welcomed.

> 
> You suggest that in case of a corrupted bud, you just try to go back to
> the previous commited state.
> 
> 
> This sounds rational to me. As I described, though, the problem is that
> 'fsync()' does not mean 'commit'. So what this means is that, say, mysql
> fsync()'s its database, and believes it is now on the media. But then
> there is a problem in the journal, in some LEB which is not related to
> the fsync()'ed mysql database at all, and you drop the database changes.
> 

Yes, you had explained on it. I'm considering it these days.

> 
> So the better thing to do is to try dropping just the corrupted nodes,
> not the entire journal. It does not sound too hard - you just keep
> scanning and skip corrupted nodes. Replay as usual. Just mark the FS as
> R/O if corruptions were not power-cut-related.
> 
> 

Mark R/O will not change anything on flash, write/flush are disallowed.

I'm thinking about snapshot, Do you think it's a acceptable solution?
Leaving any kinds of corruptions behind, directly keep a usable snapshot
and user could apply it if the current partition refuse to mount. I don't
want to make the discuss complex, just a new thought.

Come back to recovery, I really know it's a hard work as you described,
we should consider a lot. But we don't need to have a integrated plan at
begin, we could make our solution deal with corruptions step by step, and
make it a useful solution after days.

Thanks,
Hu

Richard Weinberger Feb. 9, 2015, 12:08 p.m. UTC | #26

Steve,

Am 06.02.2015 um 00:36 schrieb Richard Weinberger:
>> I hear (and agree with) several valid arguments for a tool in
>> userspace. And I'd like to throw my support towards an in-driver
>> solution. Flash filesystems are different than on-disk filesystems, in
>> particular in their usecase: they're generally both critical and
>> exclusive to embedded systems. As such, the entire filesystem might be
>> on the corrupted UBIFS, so even if the filesystem is recoverable, if
>> we can't mount it and get at the userspace tool, then we're toast.
> 
> No, embedded is not per se an excuse for doing bad/stupid things.
> Embedded is *not* special.
> There are folks out there that want a "force" mount option for UBIFS
> to mount it in any case no matter in how bad shape it is.
> But this will make the situation much worse as you'll get silent data
> corruption/loss.
> It is as stupid as running a "fsck -y /dev/sdXY" at every boot on a
> regular disk filesystem.

just want to point out that my rather harsh reply was not meant as an attack against you.
I'm sorry for that, please accept my apology.

Thanks,
//richard

Ricard Wanderlof Feb. 9, 2015, 12:12 p.m. UTC | #27

On Mon, 9 Feb 2015, hujianyang wrote:

> Hi Artem and Richard,
> 
> On 2015/2/9 15:57, Richard Weinberger wrote:
> > Am 09.02.2015 um 08:51 schrieb Artem Bityutskiy:
> >> On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
> >>> Good suggestions. I will try to realize periodically commit first. But I
> >>> don't know if this feature is really needed. Switch to R/O and revert to
> >>> last comitted state? But we just consider about log before, never think
> >>> about index.
> >>
> >> I think the right way to approach this problem is to come up with a high
> >> level summary of the problems we are trying to solve, and the solutions,
> >> along with some analysis of the solutions. This does not have to be very
> >> detailed, but it should put everyone involved into the same page.
> > 
> > Agreed. I fear we're talking about different things. :)
> > 
> 
> I'm afraid I didn't express the use case of the corruption recovery feature.
> UBIFS is used mostly in embedded environment. After products selling out,
> it's hard to debug it. So the production team may consider any failure that
> could happen and put the recovery method into their operation scripts/utilities.
> 
> Flash corruption is a problem they need to care about. Using high quality
> cell is not enough, ECC error could not be avoid. So a recovery method which
> is provided by filesystem itself is required.

Isn't this a bit backward? Given a certain acceptable failure rate for a 
product, select an appropriate flash chip in combination with a reasonable 
amount of ECC to get a medium that has a low enough error rate so that 
higher levels do not need to concern themselves. If a high level of 
reliability is needed, then some other form of nonvolatile storage should 
be selected.

The only high level function should be some sort of periodic scrubbing of 
NAND flash blocks to ensure the error rate does not rise too fast 
unnoticed.

Having UBIFS manage random corruptions would seem hopeful at best, if some 
critical file is corrupted then the system can't start anyway.

In any system all components have a failure rate, so it's a question of 
getting the failure rate of the NAND subsystem on par with the failure 
rate of other components. Just because there is a theoretical possibility 
of fixing an UBIFS problem does not really make the system more reliable 
per se. What if you get a fault in a RAM chip? The CPU? The PSU? In all 
those cases the product will be simply "broken", and we can handle 
defective flash the same way. A transistor in the PSU blew or the NAND 
flash happened to be the the one-in-a-million part that keeps loosing 
bits. Same result, product dead, repair or replace it.

/Ricard

hujianyang Feb. 9, 2015, 12:38 p.m. UTC | #28

On 2015/2/9 20:12, Ricard Wanderlof wrote:
> 
> On Mon, 9 Feb 2015, hujianyang wrote:
> 
>> Hi Artem and Richard,
>>
>> On 2015/2/9 15:57, Richard Weinberger wrote:
>>> Am 09.02.2015 um 08:51 schrieb Artem Bityutskiy:
>>>> On Mon, 2015-02-09 at 10:34 +0800, hujianyang wrote:
>>>>> Good suggestions. I will try to realize periodically commit first. But I
>>>>> don't know if this feature is really needed. Switch to R/O and revert to
>>>>> last comitted state? But we just consider about log before, never think
>>>>> about index.
>>>>
>>>> I think the right way to approach this problem is to come up with a high
>>>> level summary of the problems we are trying to solve, and the solutions,
>>>> along with some analysis of the solutions. This does not have to be very
>>>> detailed, but it should put everyone involved into the same page.
>>>
>>> Agreed. I fear we're talking about different things. :)
>>>
>>
>> I'm afraid I didn't express the use case of the corruption recovery feature.
>> UBIFS is used mostly in embedded environment. After products selling out,
>> it's hard to debug it. So the production team may consider any failure that
>> could happen and put the recovery method into their operation scripts/utilities.
>>
>> Flash corruption is a problem they need to care about. Using high quality
>> cell is not enough, ECC error could not be avoid. So a recovery method which
>> is provided by filesystem itself is required.
> 
> Isn't this a bit backward? Given a certain acceptable failure rate for a 
> product, select an appropriate flash chip in combination with a reasonable 
> amount of ECC to get a medium that has a low enough error rate so that 
> higher levels do not need to concern themselves. If a high level of 
> reliability is needed, then some other form of nonvolatile storage should 
> be selected.
> 
> The only high level function should be some sort of periodic scrubbing of 
> NAND flash blocks to ensure the error rate does not rise too fast 
> unnoticed.
> 
> Having UBIFS manage random corruptions would seem hopeful at best, if some 
> critical file is corrupted then the system can't start anyway.
> 
> In any system all components have a failure rate, so it's a question of 
> getting the failure rate of the NAND subsystem on par with the failure 
> rate of other components. Just because there is a theoretical possibility 
> of fixing an UBIFS problem does not really make the system more reliable 
> per se. What if you get a fault in a RAM chip? The CPU? The PSU? In all 
> those cases the product will be simply "broken", and we can handle 
> defective flash the same way. A transistor in the PSU blew or the NAND 
> flash happened to be the the one-in-a-million part that keeps loosing 
> bits. Same result, product dead, repair or replace it.
> 
> /Ricard
> 

Hi Ricard,

Yes, that's true. We can't deal with any kinds of problem. And at worst
case, we could re-format the partition.

But we could do something when data corruptions occur during mount or
during IO. For mount case, actually current driver make no effort if
an none power-cut corruption occur. It could be improved in my considering.

I think the improvement is worth to be done than just say "It's broken,
you need a new one". We can come up with some solutions for small cases
now. But the problem is the definition of what kinds of problems we can
fix. I don't want to make a unachievable plan. But I really think we
could do something, just in kernel, to improve, in any side.

Thanks,
Hu

[RFC] UBIFS recovery

Commit Message

Comments

Patch