Message ID | AANLkTi==-biDcaFozVvkHUwXMYqMWk4=mbD16_TJ1JOO@mail.gmail.com |
---|---|
State | New, archived |
Headers | show |
On Fri, 2011-01-28 at 12:13 -0500, twebb wrote: > It's been awhile but I wanted to reopen the discussion on this topic. > Could you take a look at this proposed patch? Essentially this change > results in the LEB being cleaned/recovered regardless of whether > is_last_write() is true or not. There may be a better way to do this > earlier in the same function, but I'm not familiar enough with it to > know the significance of the is_last_write() call. But I think I explained why this check is there. Why exactly it does not work for your flash? I think you need to get better understanding what is happening in your case. I am reluctant to take this patch because it is more of a band-aid but not a proper solution.
>> It's been awhile but I wanted to reopen the discussion on this topic. >> Could you take a look at this proposed patch? Essentially this change >> results in the LEB being cleaned/recovered regardless of whether >> is_last_write() is true or not. There may be a better way to do this >> earlier in the same function, but I'm not familiar enough with it to >> know the significance of the is_last_write() call. > > But I think I explained why this check is there. Why exactly it does not > work for your flash? I think you need to get better understanding what > is happening in your case. I am reluctant to take this patch because it > is more of a band-aid but not a proper solution. > I don't recall an explanation about why that check is there, but regardless, I'll try to explain why things are failing on my flash: Let's take the case of a read or write disturb error causing multiple bit flips in the empty space of a LEB (not in the common header node magic number location). I believe this type of error is very common with MLC (since ECC will generally handle bit flips in non-FF areas.) LEB: | good nodes | 0xFFs | bit flip | 0xFFs | bit flip | 0xFFs| When ubifs_recover_leb() calls ubifs_scan_a_node(), it correctly returns SCANNED_EMPTY_SPACE. Then ubifs_recover_leb() finds that the LEB buf is not empty. It also finds that !is_last_write() is TRUE and breaks without setting empty_chkd. Later, as a result of !empty_chkd and !is_empty and !is_last_write all being TRUE, the LEB is marked as corrupted. This ultimately may result in a failure to mount or in a RO mount. However, because of the nature of the "corruption", if ubifs_recover_leb() ignores is_last_write() result and instead calls clean_buf() and sets need_clean = 1, then fix_unclean_leb() ultimately fixes the bit flip via the ubi_leb_change() call and without data loss. Does this make sense or is my logic wrong? I think it's OK assuming that no true corruption (associated with a power loss) happens in conjunction with the rd/wr disturb. I do understand that perhaps this is more a band-aid than a proper solution. However, I'm trying to understand whether it is reasonable and whether you think it does more good than harm. twebb
On Mon, 2011-01-31 at 20:26 -0500, twebb wrote: > >> It's been awhile but I wanted to reopen the discussion on this topic. > >> Could you take a look at this proposed patch? Essentially this change > >> results in the LEB being cleaned/recovered regardless of whether > >> is_last_write() is true or not. There may be a better way to do this > >> earlier in the same function, but I'm not familiar enough with it to > >> know the significance of the is_last_write() call. > > > > But I think I explained why this check is there. Why exactly it does not > > work for your flash? I think you need to get better understanding what > > is happening in your case. I am reluctant to take this patch because it > > is more of a band-aid but not a proper solution. > > > > I don't recall an explanation about why that check is there, but > regardless, I'll try to explain why things are failing on my flash: Well, I explained it in various threads, probably not in this one. But I've just pushed a patch to the ubifs-2.6.git tree which should shed some light: http://git.infradead.org/ubifs-2.6.git/blobdiff/c8aa892b861cda6fc29526feae5d41db69523c36..5b1791a1529ccd4e49c59fe9f8fb6575f74c569e:/fs/ubifs/recovery.c > Let's take the case of a read or write disturb error causing multiple > bit flips in the empty space of a LEB (not in the common header node > magic number location). I believe this type of error is very common > with MLC (since ECC will generally handle bit flips in non-FF areas.) > > LEB: | good nodes | 0xFFs | bit flip | 0xFFs | bit flip | 0xFFs| What will happen if I write to the space which goes after "good nodes" and contain bit-flips? If I fill that space with UBIFS nodes, will they be all-right? Is it safe? Because this is what UBIFS will do. ... snip ... > However, because of the nature of the "corruption", if > ubifs_recover_leb() ignores is_last_write() result and instead calls > clean_buf() and sets need_clean = 1, then fix_unclean_leb() ultimately > fixes the bit flip via the ubi_leb_change() call and without data > loss. We can do this, but if your flash is so weak that you can never rely on empty space being 0xFFed, then you are screwed anyway. Because UBIFS will write nodes to other LEBs with empty space damaged by bit-flips. With such weak flash we'd need to do much more than the patch you've sent. This is why I think your change is a hack which hides the problem, not fixes. And BTW, can you confirm that with your patch your system can survive stress tests and good power-cut testing? Did you do this, can you describe how you stress your system? If the answer is "yes", then I doubt the reason for those bit-flips is read-disturb. If "yes", then I really suspect the reason for bit-flips is somewhat similar to the unstable bits problem reported by Matthieu CASTET and discussed here. His problem is SLC-specific. The idea of the issue is that if you cut the power just after you started writing to the NAND page X, you then may find that it becomes unstable. You can sometimes read all 0xFFs from there, some-times have bit-flips, sometimes even hard ECC errors. UBIFS does not currently deal with this issue and I'm going to work on it as soon as I have time. UBI also needs fixes to deal with this issue. I did start doing some work some time ago and documented the issue in my UBI development tree: http://git.infradead.org/ubi-2.6.git/blobdiff/0d3d5b3db2e2a0c6eed40714a3e9505031769508..a3353e9ebd42c710396ed6e8db0bda0a600bb1a5:/drivers/mtd/ubi/scan.c So, probably in case of MLC the unstable bit problem becomes something more severe. Thus, if you'd invest time and more properly explore the nature of these bit-flips you have, if you experimented properly and if you could give exact description about how they appeared, not just "I guess these are read-disturbs", then we can think how to really fix the issue. Thanks!
diff -Naru a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c --- a/fs/ubifs/recovery.c 2011-01-26 16:08:04.000000000 -0500 +++ b/fs/ubifs/recovery.c 2011-01-26 16:08:25.000000000 -0500 @@ -665,19 +665,8 @@ } if (!empty_chkd && !is_empty(buf, len)) { - if (is_last_write(c, buf, offs)) { - clean_buf(c, &buf, lnum, &offs, &len); - need_clean = 1; - } else { - int corruption = first_non_ff(buf, len); - - ubifs_err("corrupt empty space LEB %d:%d, corruption " - "starts at %d", lnum, offs, corruption); - /* Make sure we dump interesting non-0xFF data */ - offs = corruption; - buf += corruption; - goto corrupted; - } + clean_buf(c, &buf, lnum, &offs, &len); + need_clean = 1; } /* Drop nodes from incomplete group */