diff mbox

ext2/3: document conditions when reliable operation is possible

Message ID 20090316123051.GJ2405@elf.ucw.cz
State Not Applicable, archived
Headers show

Commit Message

Pavel Machek March 16, 2009, 12:30 p.m. UTC
Updated version here.

On Thu 2009-03-12 14:13:03, Rob Landley wrote:
> On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> > Not all block devices are suitable for all filesystems. In fact, some
> > block devices are so broken that reliable operation is pretty much
> > impossible. Document stuff ext2/ext3 needs for reliable operation.
> >
> > Signed-off-by: Pavel Machek <pavel@ucw.cz>

Comments

Theodore Ts'o March 16, 2009, 7:03 p.m. UTC | #1
On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> Updated version here.
> 
> On Thu 2009-03-12 14:13:03, Rob Landley wrote:
> > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> > > Not all block devices are suitable for all filesystems. In fact, some
> > > block devices are so broken that reliable operation is pretty much
> > > impossible. Document stuff ext2/ext3 needs for reliable operation.

Some of what is here are bugs, and some are legitimate long-term
interfaces (for example, the question of losing I/O errors when two
processes are writing to the same file, or to a directory entry, and
errors aren't or in some cases, can't, be reflected back to userspace).

I'm a little concerned that some of this reads a bit too much like a
rant (and I know Pavel was very frustrated when he tried to use a
flash card with a sucky flash card socket) and it will get used the
wrong way by apoligists, because it mixes areas where "we suck, we
should do better", which a re bug reports, and "Posix or the
underlying block device layer makes it hard", and simply states them
as fundamental design requirements, when that's probably not true.

There's a lot of work that we could do to make I/O errors get better
reflected to userspace by fsync().  So state things as bald
requirements I think goes a little too far IMHO.  We can surely do
better.


> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..710d119
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.

The last half of this sentence "because success on fsync was already
returned when data hit the journal", obviously doesn't apply to all
filesystems, since some filesystems, like ext2, don't journal data.
Even for ext3, it only applies in the case of data=journal mode.  

There are other issues here, such as fsync() only reports an I/O
problem to one caller, and in some cases I/O errors aren't propagated
up the storage stack.  The latter is clearly just a bug that should be
fixed; the former is more of an interface limitation.  But you don't
talk about in this section, and I think it would be good to have a
more extended discussion about I/O errors when writing data blocks,
and I/O errors writing metadata blocks, etc.


> +
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.

This requirement is not quite the same as what you discuss below.

> +
> +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> +	do behave like this, and are thus unsuitable for all Linux
> +	filesystems I know.
> +
> +		An inherent problem with using flash as a normal block
> +		device is that the flash erase size is bigger than
> +		most filesystem sector sizes.  So when you request a
> +		write, it may erase and rewrite some 64k, 128k, or
> +		even a couple megabytes on the really _big_ ones.
> +
> +		If you lose power in the middle of that, filesystem
> +		won't notice that data in the "sectors" _around_ the
> +		one your were trying to write to got trashed.

The characteristic you descrive here is not an issue about whether
the whole sector is either written or nothing happens to the data ---
but rather, or at least in addition to that, there is also the issue
that when a there is a flash card failure --- particularly one caused
by a sucky flash card reader design causing the SD card to disconnect
from the laptop in the middle of a write --- there may be "collateral
damange"; that is, in addition to corrupting sector being writen,
adjacent sectors might also end up getting list as an unfortunate side
effect.

So there are actually two desirable properties for a storage system to
have; one is "don't damage the old data on a failed write"; and the
other is "don't cause collateral damage to adjacent sectors on a
failed write".

> +	Because RAM tends to fail faster than rest of system during 
> +	powerfail, special hw killing DMA transfers may be necessary;
> +	otherwise, disks may write garbage during powerfail.
> +	Not sure how common that problem is on generic PC machines.

This problem is still relatively common, from what I can tell.  And
ext3 handles this surprisingly well at least in the catastrophic case
of garbage getting written into the inode table, since the journal
replay often will "repair" the garbage that was written into the
filesystem metadata blocks.  It won't do a bit of good for the data
blocks, of course (unless you are using data=journal mode).  But this
means that in fact, ext3 is more resistant to suriving failures to the
first problem (powerfail while writing can damage old data on a failed
write) but fortunately, hard drives generally don't cause collateral
damage on a failed write.  Of course, there are some spectaular
exemption to this rule --- a physical shock which causes the head to
slam into a surface moving at 7200rpm can throw a lot of debris into
the hard drive enclosure, causing loss to adjacent sectors.

    	       		  	       	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sitsofe Wheeler March 16, 2009, 7:40 p.m. UTC | #2
On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> +	do behave like this, and are thus unsuitable for all Linux
> +	filesystems I know.

When you say Linux filesystems do you mean "filesystems originally
designed on Linux" or do you mean "filesystems that Linux supports"?
Additionally whatever the answer, people are going to need help
answering the "which is the least bad?" question and saying what's not
good without offering alternatives is only half helpful... People need
to put SOMETHING on these cheap (and not quite so cheap) devices... The
last recommendation I heard was that until btrfs/logfs/nilfs arrive
people are best off sticking with FAT -
http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that
should be mentioned?

> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them). 
> +
> +	   hdparm -I reports disk features. If you have "Native
> +	   Command Queueing" is the feature you are looking for.

The document makes it sound like nearly everything bar battery backed
hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
the intent?
Rob Landley March 16, 2009, 9:43 p.m. UTC | #3
On Monday 16 March 2009 14:40:57 Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> > +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> > +	do behave like this, and are thus unsuitable for all Linux
> > +	filesystems I know.
>
> When you say Linux filesystems do you mean "filesystems originally
> designed on Linux" or do you mean "filesystems that Linux supports"?
> Additionally whatever the answer, people are going to need help
> answering the "which is the least bad?" question and saying what's not
> good without offering alternatives is only half helpful... People need
> to put SOMETHING on these cheap (and not quite so cheap) devices... The
> last recommendation I heard was that until btrfs/logfs/nilfs arrive
> people are best off sticking with FAT -
> http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that
> should be mentioned?

Actually, the best filesystem for USB flash devices is probably UDF.  (Yes, 
the DVD filesystem turns out to be writeable if you put it on a writeable 
media.  The ISO spec requires write support, so any OS that supports DVDs also 
supports this.)

The reasons for this are:

A) It's the only filesystem other than FAT that's supported out of the box by 
windows, mac, _and_ Linux for hotpluggable media.

B) It doesn't have the horrible limitations of FAT (such as a max filesize of 
2 gigabytes).

C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over 
patents on it.

However, when it comes to cutting the power on a mounted filesystem (either by 
yanking the device or powering off the machine) without losing your data, 
without warning, they all suck horribly.

If you yank a USB flash disk in the middle of a write, and the device has 
decided to wipe a 2 megabyte erase sector that's behind a layer of wear 
levelling and thus consists of a series of random sectors scattered all over 
the disk, you're screwed no matter what filesystem you use.  You know the 
vinyl "record scratch" sound?  Imagine that, on a digital level.  Bad Things 
Happen to the hardware, cannot compensate in software.

> > +* either write caching is disabled, or hw can do barriers and they are
> > enabled. +
> > +	   (Note that barriers are disabled by default, use "barrier=1"
> > +	   mount option after making sure hw can support them).
> > +
> > +	   hdparm -I reports disk features. If you have "Native
> > +	   Command Queueing" is the feature you are looking for.
>
> The document makes it sound like nearly everything bar battery backed
> hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
> the intent?

SCSI disks?  They still make those?

Everything fails, it's just a question of how.  Rotational media combined with 
journaling at least fails in fairly understandable ways, so ext3 on sata is 
reasonable.

Flash gets into trouble when it presents the _interface_ of rotational media 
(a USB block device with normal 512 byte read/write sectors, which never wear 
out) which doesn't match what the hardware's actually doing (erase block sizes 
of up to several megabytes at a time, hidden behind a block remapping layer 
for wear leveling).

For devices that have built in flash that DON'T pretend to be a conventional 
block device, but instead expose their flash erase granularity and let the OS 
do the wear levelling itself, we have special flash filesystems that can be 
reasonably reliable.  It's just that ext3 isn't one of them, jffs2 and ubifs 
and logfs are.  The problem with these flash filesystems is they ONLY work on 
flash, if you want to mount them on something other than flash you need 
something like a loopback interface to make a normal block device pretend to 
be flash.  (We've got a ramdisk driver called "mtdram" that does this, but 
nobody's bothered to write a generic wrapper for a normal block device you can 
wrap over the loopback driver.)

Unfortunately, when it comes to USB flash (the most common type), the USB 
standard defines a way for a USB device to provide a normal block disk 
interface as if it was rotational media.  It does NOT provide a way to expose 
the flash erase granularity, or a way for the operating system to disable any 
built-in wear levelling (which is needed because windows doesn't _do_ wear 
levelling, and thus burns out the administrative sectors of the disk really 
fast while the rest of the disk is still fine unless the hardware wear-levels 
for it).

So every USB flash disk pretends to be a normal disk, which it isn't, and 
Linux can't _disable_ this emulation.  Which brings us back to UDF as the 
least sucky alternative.  (Although the UDF tools kind of suck.  If you 
reformat a FAT disk as UDF with mkudffs, it'll still be autodetected as FAT 
because it won't overwrite the FAT root directory.  You have to blank the 
first 64k by hand with dd.  Sad, isn't it?)

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kyle Moffett March 17, 2009, 4:55 a.m. UTC | #4
On Mon, Mar 16, 2009 at 5:43 PM, Rob Landley <rob@landley.net> wrote:
> Flash gets into trouble when it presents the _interface_ of rotational media
> (a USB block device with normal 512 byte read/write sectors, which never wear
> out) which doesn't match what the hardware's actually doing (erase block sizes
> of up to several megabytes at a time, hidden behind a block remapping layer
> for wear leveling).
>
> For devices that have built in flash that DON'T pretend to be a conventional
> block device, but instead expose their flash erase granularity and let the OS
> do the wear levelling itself, we have special flash filesystems that can be
> reasonably reliable.  It's just that ext3 isn't one of them, jffs2 and ubifs
> and logfs are.  The problem with these flash filesystems is they ONLY work on
> flash, if you want to mount them on something other than flash you need
> something like a loopback interface to make a normal block device pretend to
> be flash.  (We've got a ramdisk driver called "mtdram" that does this, but
> nobody's bothered to write a generic wrapper for a normal block device you can
> wrap over the loopback driver.)

The really nice SSDs actually reserve ~15-30% of their internal
block-level storage and actually run their own log-structured virtual
disk in hardware.  From what I understand the Intel SSDs are that way.
 Real-time garbage collection is tricky, but if you require (for
example) a max of ~80% utilization then you can provide good latency
and bandwidth guarantees.  There's usually something like a
log-structured virtual-to-physical sector map as well.  If designed
properly with automatic hardware checksumming, such a system can
actually provide atomic writes and barriers with virtually no impact
on performance.

With firmware-level hardware knowledge and the ability to perform
extremely efficient parallel reads of flash blocks, such a
log-structured virtual block device can be many times more efficient
than a general purpose OS running a log-structured filesystem.  The
result is that for an ordinary ext3-esque filesystem with 4k blocks
you can treat the SSD as though it is an atomic-write seek-less block
device.

Now if only I had the spare cash to go out and buy one of the shiny
Intel ones for my laptop... :-)

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek March 23, 2009, 11 a.m. UTC | #5
On Mon 2009-03-16 19:40:57, Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> > +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> > +	do behave like this, and are thus unsuitable for all Linux
> > +	filesystems I know.
> 
> When you say Linux filesystems do you mean "filesystems originally
> designed on Linux" or do you mean "filesystems that Linux supports"?

"Linux filesystems I know" :-). No filesystem that Linux supports,
AFAICT.

> Additionally whatever the answer, people are going to need help
> answering the "which is the least bad?" question and saying what's not
> good without offering alternatives is only half helpful... People need
> to put SOMETHING on these cheap (and not quite so cheap)
> devices... The

According to me, people should just AVOID those devices. I don't plan
to point the "least bad"; its still bad.

> > +	   hdparm -I reports disk features. If you have "Native
> > +	   Command Queueing" is the feature you are looking for.
> 
> The document makes it sound like nearly everything bar battery backed
> hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
> the intent?

Battery backed RAID should be ok, as should be plain single SATA drive.
									Pavel
diff mbox

Patch

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..710d119
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@ 
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortunately, none of the cheap USB/SD flash cards I've seen
+	do behave like this, and are thus unsuitable for all Linux
+	filesystems I know.
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite some 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _around_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	Not sure how common that problem is on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 4333e83..41fd2ec 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,25 @@  enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..02a9bd5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,27 @@  mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
 
 References
 ==========