diff mbox

ext2/3: document conditions when reliable operation is possible

Message ID 20090312092114.GC6949@elf.ucw.cz
State Not Applicable, archived
Headers show

Commit Message

Pavel Machek March 12, 2009, 9:21 a.m. UTC
Not all block devices are suitable for all filesystems. In fact, some
block devices are so broken that reliable operation is pretty much
impossible. Document stuff ext2/ext3 needs for reliable operation.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

Comments

=?ISO-8859-1?Q?Jochen_Vo=DF?= March 12, 2009, 11:40 a.m. UTC | #1
Hi,

2009/3/12 Pavel Machek <pavel@ucw.cz>:
> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 4333e83..b09aa4c 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
>  have to be 8 character filenames, even then we are fairly close to
>  running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
   ^^^^
Shouldn't this be "Ext2"?

All the best,
Jochen
Rob Landley March 12, 2009, 7:13 p.m. UTC | #2
On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> Not all block devices are suitable for all filesystems. In fact, some
> block devices are so broken that reliable operation is pretty much
> impossible. Document stuff ext2/ext3 needs for reliable operation.
>
> Signed-off-by: Pavel Machek <pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/expectations.txt
> b/Documentation/filesystems/expectations.txt new file mode 100644
> index 0000000..9c3d729
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,47 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.
> +
> +	Fortunately writes failing are very uncommon on traditional
> +	spinning disks, as they have spare sectors they use when write
> +	fails.

I vaguely recall that the behavior of when a write error _does_ occur is to 
remount the filesystem read only?  (Is this VFS or per-fs?)

Is there any kind of hotplug event associated with this?

I'm aware write errors shouldn't happen, and by the time they do it's too late 
to gracefully handle them, and all we can do is fail.  So how do we fail?

> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +	Unfortuantely, none of the cheap USB/SD flash cards I seen do

I've seen

> +	behave like this, and are unsuitable for all linux filesystems

"are thus unsuitable", perhaps?  (Too pretentious? :)

> +	I know.
> +
> +		An inherent problem with using flash as a normal block
> +		device is that the flash erase size is bigger than
> +		most filesystem sector sizes.  So when you request a
> +		write, it may erase and rewrite the next 64k, 128k, or
> +		even a couple megabytes on the really _big_ ones.

Somebody corrected me, it's not "the next" it's "the surrounding".

(Writes aren't always cleanly at the start of an erase block, so critical data 
_before_ what you touch is endangered too.)

> +		If you lose power in the middle of that, filesystem
> +		won't notice that data in the "sectors" _around_ the
> +		one your were trying to write to got trashed.
> +
> +	Because RAM tends to fail faster than rest of system during
> +	powerfail, special hw killing DMA transfers may be neccessary;

Necessary

> +	otherwise, disks may write garbage during powerfail.
> +	Not sure how common that problem is on generic PC machines.
> +
> +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> +	because it needs to write both changed data, and parity, to
> +	different disks.

These days instead of "atomic" it's better to think in terms of "barriers".  
Requesting a flush blocks until all the data written _before_ that point has 
made it to disk.  This wait may be arbitrarily long on a busy system with lots 
of disk transactions happening in parallel (perhaps because Firefox decided to 
garbage collect and is spending the next 30 seconds swapping itself back in to 
do so).

> +
> +
> diff --git a/Documentation/filesystems/ext2.txt
> b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory
> entries, so they have to be 8 character filenames, even then we are fairly
> close to running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:

This paragraph talks about ext3...

> +* write errors not allowed
> +
> +* sector writes are atomic
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.

And here we're talking about ext2.  Does neither one know about write 
barriers, or does this just apply to ext2?  (What about ext4?)

Also I remember a historical problem that not all disks honor write barriers, 
because actual data integrity makes for horrible benchmark numbers.  Dunno how 
current that is with SATA, Alan Cox would probably know.

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek March 16, 2009, 12:28 p.m. UTC | #3
Hi!

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> > +
> > +	Fortunately writes failing are very uncommon on traditional
> > +	spinning disks, as they have spare sectors they use when write
> > +	fails.
> 
> I vaguely recall that the behavior of when a write error _does_ occur is to 
> remount the filesystem read only?  (Is this VFS or per-fs?)

Per-fs.

> Is there any kind of hotplug event associated with this?

I don't think so.

> I'm aware write errors shouldn't happen, and by the time they do it's too late 
> to gracefully handle them, and all we can do is fail.  So how do we
> fail?

Well, even remount-ro may be too late, IIRC.

> > +	Unfortuantely, none of the cheap USB/SD flash cards I seen do
> 
> I've seen
> 
> > +	behave like this, and are unsuitable for all linux filesystems
> 
> "are thus unsuitable", perhaps?  (Too pretentious? :)

ACK, thanks.

> > +	I know.
> > +
> > +		An inherent problem with using flash as a normal block
> > +		device is that the flash erase size is bigger than
> > +		most filesystem sector sizes.  So when you request a
> > +		write, it may erase and rewrite the next 64k, 128k, or
> > +		even a couple megabytes on the really _big_ ones.
> 
> Somebody corrected me, it's not "the next" it's "the surrounding".

Its "some" ... due to wear leveling logic.

> (Writes aren't always cleanly at the start of an erase block, so critical data 
> _before_ what you touch is endangered too.)

Well, flashes do remap, so it is actually "random blocks".

> > +	otherwise, disks may write garbage during powerfail.
> > +	Not sure how common that problem is on generic PC machines.
> > +
> > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > +	because it needs to write both changed data, and parity, to
> > +	different disks.
> 
> These days instead of "atomic" it's better to think in terms of
> "barriers".  

This is not about barriers (that should be different topic). Atomic
write means that either whole sector is written, or nothing at all is
written. Because raid5 needs to update both master data and parity at
the same time, I don't think it can guarantee this during powerfail.


> > +Requirements
> > +* write errors not allowed
> > +
> > +* sector writes are atomic
> > +
> > +(see expectations.txt; note that most/all linux block-based
> > +filesystems have similar expectations)
> > +
> > +* write caching is disabled. ext2 does not know how to issue barriers
> > +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> 
> And here we're talking about ext2.  Does neither one know about write 
> barriers, or does this just apply to ext2?  (What about ext4?)

This document is about ext2. Ext3 can support barriers in
2.6.28. Someone else needs to write ext4 docs :-).

> Also I remember a historical problem that not all disks honor write barriers, 
> because actual data integrity makes for horrible benchmark numbers.  Dunno how 
> current that is with SATA, Alan Cox would probably know.

Sounds like broken disk, then. We should blacklist those.
									Pavel
Rob Landley March 16, 2009, 7:26 p.m. UTC | #4
On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> Hi!
> > > +	Fortunately writes failing are very uncommon on traditional
> > > +	spinning disks, as they have spare sectors they use when write
> > > +	fails.
> >
> > I vaguely recall that the behavior of when a write error _does_ occur is
> > to remount the filesystem read only?  (Is this VFS or per-fs?)
>
> Per-fs.

Might be nice to note that in the doc.

> > Is there any kind of hotplug event associated with this?
>
> I don't think so.

There probably should be, but that's a separate issue.

> > I'm aware write errors shouldn't happen, and by the time they do it's too
> > late to gracefully handle them, and all we can do is fail.  So how do we
> > fail?
>
> Well, even remount-ro may be too late, IIRC.

Care to elaborate?  (When a filesystem is mounted RO, I'm not sure what 
happens to the pages that have already been dirtied...)

> > (Writes aren't always cleanly at the start of an erase block, so critical
> > data _before_ what you touch is endangered too.)
>
> Well, flashes do remap, so it is actually "random blocks".

Fun.

When "please do not turn of your playstation until game save completes" 
honestly seems like the best solution for making the technology reliable, 
something is wrong with the technology.

I think I'll stick with rotating disks for now, thanks.

> > > +	otherwise, disks may write garbage during powerfail.
> > > +	Not sure how common that problem is on generic PC machines.
> > > +
> > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > +	because it needs to write both changed data, and parity, to
> > > +	different disks.
> >
> > These days instead of "atomic" it's better to think in terms of
> > "barriers".
>
> This is not about barriers (that should be different topic). Atomic
> write means that either whole sector is written, or nothing at all is
> written. Because raid5 needs to update both master data and parity at
> the same time, I don't think it can guarantee this during powerfail.

Good point, but I thought that's what journaling was for?

I'm aware that any flash filesystem _must_ be journaled in order to work 
sanely, and must be able to view the underlying erase granularity down to the 
bare metal, through any remapping the hardware's doing.  Possibly what's 
really needed is a "flash is weird" section, since flash filesystems can't be 
mounted on arbitrary block devices.

Although an "-O erase_size=128" option so they _could_ would be nice.  There's 
"mtdram" which seems to be the only remaining use for ram disks, but why there 
isn't an "mtdwrap" that works with arbitrary underlying block devices, I have 
no idea.  (Layering it on top of a loopback device would be most useful.)

> > > +Requirements
> > > +* write errors not allowed
> > > +
> > > +* sector writes are atomic
> > > +
> > > +(see expectations.txt; note that most/all linux block-based
> > > +filesystems have similar expectations)
> > > +
> > > +* write caching is disabled. ext2 does not know how to issue barriers
> > > +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> >
> > And here we're talking about ext2.  Does neither one know about write
> > barriers, or does this just apply to ext2?  (What about ext4?)
>
> This document is about ext2. Ext3 can support barriers in
> 2.6.28. Someone else needs to write ext4 docs :-).
>
> > Also I remember a historical problem that not all disks honor write
> > barriers, because actual data integrity makes for horrible benchmark
> > numbers.  Dunno how current that is with SATA, Alan Cox would probably
> > know.
>
> Sounds like broken disk, then. We should blacklist those.

It wasn't just one brand of disk cheating like that, and you'd have to ask him 
(or maybe Jens Axboe or somebody) whether the problem is still current.  I've 
been off in embedded-land for a few years now...

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer March 16, 2009, 7:45 p.m. UTC | #5
On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote:
<snip>
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> +       behave like this, and are unsuitable for all linux filesystems
> +       I know.
> +
> +               An inherent problem with using flash as a normal block
> +               device is that the flash erase size is bigger than
> +               most filesystem sector sizes.  So when you request a
> +               write, it may erase and rewrite the next 64k, 128k, or
> +               even a couple megabytes on the really _big_ ones.
> +
> +               If you lose power in the middle of that, filesystem
> +               won't notice that data in the "sectors" _around_ the
> +               one your were trying to write to got trashed.

I had *assumed* that SSDs worked like:

1) write request comes in
2) new unused erase block area marked to hold the new data
3) updated data written to the previously unused erase block
4) mapping updated to replace the old erase block with the new one

If it were done that way, a failure in the middle would just leave the
SSD with the old data in it.

If it is not done that way, then I can see your issue.  (I love the
potential performance of SSDs, but I'm beginning to hate the
implementations and spec writing.)

Greg
Pavel Machek March 16, 2009, 9:48 p.m. UTC | #6
On Mon 2009-03-16 15:45:36, Greg Freemyer wrote:
> On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote:
> <snip>
> > +Sector writes are atomic (ATOMIC-SECTORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > +       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > +       behave like this, and are unsuitable for all linux filesystems
> > +       I know.
> > +
> > +               An inherent problem with using flash as a normal block
> > +               device is that the flash erase size is bigger than
> > +               most filesystem sector sizes.  So when you request a
> > +               write, it may erase and rewrite the next 64k, 128k, or
> > +               even a couple megabytes on the really _big_ ones.
> > +
> > +               If you lose power in the middle of that, filesystem
> > +               won't notice that data in the "sectors" _around_ the
> > +               one your were trying to write to got trashed.
> 
> I had *assumed* that SSDs worked like:
> 
> 1) write request comes in
> 2) new unused erase block area marked to hold the new data
> 3) updated data written to the previously unused erase block
> 4) mapping updated to replace the old erase block with the new one
> 
> If it were done that way, a failure in the middle would just leave the
> SSD with the old data in it.

The really expensive ones (Intel SSD) apparently work like that, but I
never seen one of those. USB sticks and SD cards I tried behave like I
described above.
								Pavel
Pavel Machek March 21, 2009, 11:24 a.m. UTC | #7
On Thu 2009-03-12 11:40:52, Jochen Voß wrote:
> Hi,
> 
> 2009/3/12 Pavel Machek <pavel@ucw.cz>:
> > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> > index 4333e83..b09aa4c 100644
> > --- a/Documentation/filesystems/ext2.txt
> > +++ b/Documentation/filesystems/ext2.txt
> > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
> >  have to be 8 character filenames, even then we are fairly close to
> >  running out of unique filenames.
> >
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
>    ^^^^
> Shouldn't this be "Ext2"?

Thanks, fixed.
									Pavel
Pavel Machek March 23, 2009, 10:45 a.m. UTC | #8
On Mon 2009-03-16 14:26:23, Rob Landley wrote:
> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> > Hi!
> > > > +	Fortunately writes failing are very uncommon on traditional
> > > > +	spinning disks, as they have spare sectors they use when write
> > > > +	fails.
> > >
> > > I vaguely recall that the behavior of when a write error _does_ occur is
> > > to remount the filesystem read only?  (Is this VFS or per-fs?)
> >
> > Per-fs.
> 
> Might be nice to note that in the doc.

Ok, can you suggest a patch? I believe remount-ro is already
documented ... somewhere :-).

> > > I'm aware write errors shouldn't happen, and by the time they do it's too
> > > late to gracefully handle them, and all we can do is fail.  So how do we
> > > fail?
> >
> > Well, even remount-ro may be too late, IIRC.
> 
> Care to elaborate?  (When a filesystem is mounted RO, I'm not sure what 
> happens to the pages that have already been dirtied...)

Well, fsync() error reporting does not really work properly, but I
guess it will save you for the remount-ro case. So the data will be in
the journal, but it will be impossible to replay it...

> > > (Writes aren't always cleanly at the start of an erase block, so critical
> > > data _before_ what you touch is endangered too.)
> >
> > Well, flashes do remap, so it is actually "random blocks".
> 
> Fun.

Yes.

> > > > +	otherwise, disks may write garbage during powerfail.
> > > > +	Not sure how common that problem is on generic PC machines.
> > > > +
> > > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > > +	because it needs to write both changed data, and parity, to
> > > > +	different disks.
> > >
> > > These days instead of "atomic" it's better to think in terms of
> > > "barriers".
> >
> > This is not about barriers (that should be different topic). Atomic
> > write means that either whole sector is written, or nothing at all is
> > written. Because raid5 needs to update both master data and parity at
> > the same time, I don't think it can guarantee this during powerfail.
> 
> Good point, but I thought that's what journaling was for?

I believe journaling operates on assumption that "either whole sector
is written, or nothing at all is written".

> I'm aware that any flash filesystem _must_ be journaled in order to work 
> sanely, and must be able to view the underlying erase granularity down to the 
> bare metal, through any remapping the hardware's doing.  Possibly what's 
> really needed is a "flash is weird" section, since flash filesystems can't be 
> mounted on arbitrary block devices.

> Although an "-O erase_size=128" option so they _could_ would be nice.  There's 
> "mtdram" which seems to be the only remaining use for ram disks, but why there 
> isn't an "mtdwrap" that works with arbitrary underlying block devices, I have 
> no idea.  (Layering it on top of a loopback device would be most
> useful.)

I don't think that works. Compactflash (etc) cards basically randomly
remap the data, so you can't really run flash filesystem over
compactflash/usb/SD card -- you don't know the details of remapping.
									Pavel
Goswin von Brederlow March 30, 2009, 3:06 p.m. UTC | #9
Pavel Machek <pavel@ucw.cz> writes:

> On Mon 2009-03-16 14:26:23, Rob Landley wrote:
>> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
>> > > > +	otherwise, disks may write garbage during powerfail.
>> > > > +	Not sure how common that problem is on generic PC machines.
>> > > > +
>> > > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
>> > > > +	because it needs to write both changed data, and parity, to
>> > > > +	different disks.
>> > >
>> > > These days instead of "atomic" it's better to think in terms of
>> > > "barriers".

Would be nice to have barriers in md and dm.

>> > This is not about barriers (that should be different topic). Atomic
>> > write means that either whole sector is written, or nothing at all is
>> > written. Because raid5 needs to update both master data and parity at
>> > the same time, I don't think it can guarantee this during powerfail.

Actualy raid5 should have no problem with a power failure during
normal operations of the raid. The parity block should get marked out
of sync, then the new data block should be written, then the new
parity block and then the parity block should be flaged in sync.

>> Good point, but I thought that's what journaling was for?
>
> I believe journaling operates on assumption that "either whole sector
> is written, or nothing at all is written".

The real problem comes in degraded mode. In that case the data block
(if present) and parity block must be written at the same time
atomically. If the system crashes after writing one but before writing
the other then the data block on the missng drive changes its
contents. And for example with a chunk size of 1MB and 16 disks that
could be 15MB away from the block you actualy do change. And you can
not recover that after a crash as you need both the original and
changed contents of the block.

So writing one sector has the risk of corrupting another (for the FS)
totally unconnected sector. No amount of journaling will help
there. The raid5 would need to do journaling or use battery backed
cache.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Aug. 24, 2009, 9:26 a.m. UTC | #10
Hi!

> >> > This is not about barriers (that should be different topic). Atomic
> >> > write means that either whole sector is written, or nothing at all is
> >> > written. Because raid5 needs to update both master data and parity at
> >> > the same time, I don't think it can guarantee this during powerfail.
> 
> Actualy raid5 should have no problem with a power failure during
> normal operations of the raid. The parity block should get marked out
> of sync, then the new data block should be written, then the new
> parity block and then the parity block should be flaged in sync.
> 
> >> Good point, but I thought that's what journaling was for?
> >
> > I believe journaling operates on assumption that "either whole sector
> > is written, or nothing at all is written".
> 
> The real problem comes in degraded mode. In that case the data block
> (if present) and parity block must be written at the same time
> atomically. If the system crashes after writing one but before writing
> the other then the data block on the missng drive changes its
> contents. And for example with a chunk size of 1MB and 16 disks that
> could be 15MB away from the block you actualy do change. And you can
> not recover that after a crash as you need both the original and
> changed contents of the block.
> 
> So writing one sector has the risk of corrupting another (for the FS)
> totally unconnected sector. No amount of journaling will help
> there. The raid5 would need to do journaling or use battery backed
> cache.

Thanks, I updated my notes.
									Pavel
Robert Hancock Aug. 29, 2009, 1:33 a.m. UTC | #11
On 03/12/2009 01:13 PM, Rob Landley wrote:
>> +* write caching is disabled. ext2 does not know how to issue barriers
>> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
>
> And here we're talking about ext2.  Does neither one know about write
> barriers, or does this just apply to ext2?  (What about ext4?)
>
> Also I remember a historical problem that not all disks honor write barriers,
> because actual data integrity makes for horrible benchmark numbers.  Dunno how
> current that is with SATA, Alan Cox would probably know.

I've heard rumors of disks that claim to support cache flushes but 
really just ignore them, but have never heard any specifics of model 
numbers, etc. which are known to do this, so it may just be legend. If 
we do have such knowledge then we should really be blacklisting those 
drives and warning the user that we can't ensure data integrity. (Even 
powering down the system would be unsafe in this case.)
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Cox Aug. 29, 2009, 1:04 p.m. UTC | #12
> I've heard rumors of disks that claim to support cache flushes but 
> really just ignore them, but have never heard any specifics of model 
> numbers, etc. which are known to do this, so it may just be legend. If 
> we do have such knowledge then we should really be blacklisting those 
> drives and warning the user that we can't ensure data integrity. (Even 
> powering down the system would be unsafe in this case.)

This should not be the case for any vaguely modern drive. The standard
requires the drive flushes the cache if sent the command and the size of
caches on modern drives rather require it.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..9c3d729
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@ 
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortuantely, none of the cheap USB/SD flash cards I seen do 
+	behave like this, and are unsuitable for all linux filesystems 
+	I know. 
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite the next 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _around_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be neccessary;
+	otherwise, disks may write garbage during powerfail.
+	Not sure how common that problem is on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 4333e83..b09aa4c 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,25 @@  enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..02a9bd5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,27 @@  mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
 
 References
 ==========