From patchwork Tue Oct 2 12:25:23 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Oberhollenzer
Both UBI (see here) and UBIFS are tolerant to power-cuts, and they were designed with this property in mind.
-Year 2011 note: however, there is an unsolved -unstable bits issue which makes -UBI/UBIFS fail to recover after a power cut on modern SLC and MLC flashes. This -issue has not been observed on older SLC NANDs back at the time UBI/UBIFS was -being developed. Note, the below text is quite old and has been written before -the unstable bits issue has been first discovered.
-UBIFS has internal debugging infrastructure to emulate power failures and the authors used it for extensive testing. It was tested for long time with power-fail emulation. The advantage of the emulation is that it emulates power @@ -311,141 +303,8 @@ some specific aspects of MLC NAND flashes:
emulation, then use theintegck
test for testing. After
all the issues are fixed, real power-cut tests could be carried
out.
-
- In the MTD community the "unstable bits" term is used to describe data -instabilities caused by power cuts while writing or erasing. The unstable bits -issue is still not resolved in UBI and UBIFS, and it was reported several times -in the MTD mailing list. In theory, this issue should be visible in any flash, -but for some reason back at the times when we developed UBI/UBIFS and -extensively tested them on a robust SLC NAND, we did not observe it. No one -reported about this issue for NOR flash yet. However, on modern SLC and MLC -flashes this problem is reproducible.
- -The unstable bits are the result of a power cut during a program or erase -operation. Depending on when the power cut has happened, they can corrupt the -data or the free space. Consider the following 4 situations:
- -The number of unstable bits resulting from a power-cut may be greater than -what the ECC algorithm is able to correct. This is why a previously readable -page may suddenly become unreadable, or conversely a previously unreadable page -may suddenly become readable.
- -Here is an example scenario how UBIFS may fail. UBIFS writes data node A to
-the journal LEB, and a power cut of type 1 happens. After the reboot, UBIFS
-recovery code reads that LEB, no bit-flips are reported by MTD, all the CRCs
-match, everything looks fine. UBIFS just assume that this LEB is all-right and
-the free space at the end of this LEB can be used for writing more data. UBIFS
-performs the commit operations, writes more user data, and everything works
-fine until the user reads node A by reading the corresponding file: an ECC
-error happens and the user gets the EIO
error.
The EIO
may be what the user gets instead of his/her data also
-if a type 2 power cut happens, and UBIFS re-uses the corrupted free space for
-writing new nodes, and then these nodes are read.
The solution is to teach UBIFS to erase-cycle any LEB which could potentially -be written to when the power cut happened. This is not only about the -journal LEBs, but also LPT, log, master and orphan LEBs. This means that the -valid data from this LEB has to be read (and only once!) and then it should be -written back to this LEB using the -atomic LEB change UBI operation. -This has to be done even if the LEB looks all-right - no corruptions, all 0xFFs -at the end.
- -Similarly, UBI has to erase-cycle every eraseblock which could potentially be -erased when the power cut happened.
- -The other requirement is that during the recovery UBI/UBIFS should read data
-from the media only once. This is easy to demonstrate on the delayed recovery
-example. The delayed recovery happens when after a power cut the file-system is
-mounted R/O, in which case UBIFS must not write anything to the flash, and the
-real recovery is delayed until the FS is re-mounted R/W. Currently UBIFS just
-scans the journal during mounting R/O, drops (or "remembers") corrupted nodes,
-and "does not let" users read them. But there is no guarantee that UBIFS
-spots all the corrupted nodes during the first scanning, so users may get
-EIO
while reading data from the R/O-mounted FS.
When UBIFS is then remounted R/W, it actually drops the corrupted nodes from -the flash media by erase-cycling the corresponding LEBs. And UBIFS re-reads -all the LEB data again. And there is no guarantee that UBIFS will get the same -corruptions again.
- -So it is important to make sure that the corrupted LEBs are read only once. -E.g., we can cache the results of the first scanning, and then use that data -when running the delayed recovery, instead of re-reading the data. Probably we -may remember only the last NAND page containing valid nodes, not whole LEB, -since for the journal only unstable bits of type 1 and 2 are relevant.
- -There are similar double-read issues in UBI scanning - when it finds 2 PEBs -belonging to the same LEB and it has to find out which one is newer. The volume -table has to be erase-cycled as well in UBI.
- -There are more issues related to unstable bits of type 2 and 3 in UBI, I -think. This all needs a very careful look, and this is not trivial to fix -because of the complexity: UBIFS as any file-system has many interfaces and a -lot of states. The best strategy to attack this problem would be:
- -integck
test to stress the file-system with
- power cut emulation enabled - the test can re-start when an emulated
- power cut happens. This will allow you to very quickly emulate hundreds
- of power cuts in interesting places. Fix all the bugs. Make sure it is
- rock solid. Of course, if you have various independent issues, you may
- temporary hack the power cut emulation code to emulate unstable bits
- only at certain places, to temporarily limit the amount of problems you
- have to simultaneously deal with.integck
test to support that infrastructure and fix all the
- issues.The UBIFS git tree is
diff --git a/faq/ubi.xml b/faq/ubi.xml index fdd7abb..f9a76b3 100644 --- a/faq/ubi.xml +++ b/faq/ubi.xml @@ -449,12 +449,6 @@ probably do this.Yes, UBI is designed to be tolerant of power failures and unclean reboots.
-Year 2011 note: however, there is an unsolved -unstable bits issue which may make -UBI fail to recover after a power cut on modern SLC and MLC flashes. This has -never been reported yet for UBI, but has been reported for UBIFS and we believe -must be an issue for UBI as well.
-Year 2011 note: however, there is an unsolved -unstable bits issue which may make -UBI fail to recover after a power cut on modern SLC and MLC flashes. This has -never been reported yet for UBI, but has been reported for UBIFS and we believe -must be an issue for UBI as well.
-Note, unlike UBI, JFFS2 uses random wear-leveling algorithm, which is in fact not completely random, because JFFS2 makes it more probable to garbage collect eraseblocks with more dirty data. This means that JFFS2 is not