diff mbox

Update Documentation/md.txt to mention journaling won't help dirty+degraded case.

Message ID 200909021749.47695.rob@landley.net
State Not Applicable, archived
Headers show

Commit Message

Rob Landley Sept. 2, 2009, 10:49 p.m. UTC
From: Rob Landley <rob@landley.net>

Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
explaining that using a journaling filesystem can't overcome this problem.

Signed-off-by: Rob Landley <rob@landley.net>
---

 Documentation/md.txt |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Pavel Machek Sept. 3, 2009, 9:08 a.m. UTC | #1
On Wed 2009-09-02 17:49:46, Rob Landley wrote:
> From: Rob Landley <rob@landley.net>
> 
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
> 
> Signed-off-by: Rob Landley <rob@landley.net>

I like it! Not sure if I know enough about MD to add ack, but...

Acked-by: Pavel Machek <pavel@ucw.cz>

								Pavel
Ric Wheeler Sept. 3, 2009, 12:05 p.m. UTC | #2
On 09/02/2009 06:49 PM, Rob Landley wrote:
> From: Rob Landley<rob@landley.net>
>
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
>
> Signed-off-by: Rob Landley<rob@landley.net>
> ---
>
>   Documentation/md.txt |   17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/md.txt b/Documentation/md.txt
> index 4edd39e..52b8450 100644
> --- a/Documentation/md.txt
> +++ b/Documentation/md.txt
> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>
>      md-mod.start_dirty_degraded=1
>
> +Note that Journaling filesystems do not effectively protect data in this
> +case, because the update granularity of the RAID is larger than the journal
> +was designed to expect.  Reconstructing data via partity information involes
> +matching together corresponding stripes, and updating only some of these
> +stripes renders the corresponding data in all the unmatched stripes
> +meaningless.  Thus seemingly unrelated data in other parts of the filesystem
> +(stored in the unmatched stripes) can become unreadable after a partial
> +update, but the journal is only aware of the parts it modified, not the
> +"collateral damage" elsewhere in the filesystem which was affected by those
> +changes.
> +
> +Thus successful journal replay proves nothing in this context, and even a
> +full fsck only shows whether or not the filesystem's metadata was affected.
> +(A proper solution to this problem would involve adding journaling to the RAID
> +itself, at least during degraded writes.  In the meantime, try not to allow
> +a system to shut down uncleanly with its RAID both dirty and degraded, it
> +can handle one but not both.)
>
>   Superblock formats
>   ------------------
>
>

NACK.

Now you have moved the inaccurate documentation about journalling file systems 
into the MD documentation.

Repeat after me:

(1) partial writes to a RAID stripe (with or without file systems, with or 
without journals) create an invalid stripe

(2) partial writes can be prevented in most cases by running with write cache 
disabled or working barriers

(3) fsck can (for journalling fs or non journalling fs) detect and fix your file 
system. It won't give you back the data in that stripe, but you will get the 
rest of your metadata and data back and usable.

You don't need MD in the picture to test this - take fsfuzzer or just dd and 
zero out a RAID stripe width of data from a file system. If you hit data blocks, 
your fsck (for ext2) or mount (for any journalling fs) will not see an error. If 
metadata, fsck in both cases when run will try to fix it as best as it can.

Also note that partial writes (similar to torn writes) can happen for multiple 
reasons on non-RAID systems and leave the same kind of damage.

Side note, proposing a half sketched out "fix" for partial stripe writes in 
documentation is not productive. Much better to submit a fully thought out 
proposal or actual patches to demonstrate the issue.

Rob, you should really try to take a few disks, build a working MD RAID5 group 
and test your ideas. Try it with and without the write cache enabled.

Measure and report, say after 20 power losses, how  files integrity and fsck 
repairs were impacted.

Try the same with ext2 and ext3.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Sept. 3, 2009, 12:31 p.m. UTC | #3
On Thu 2009-09-03 08:05:31, Ric Wheeler wrote:
> On 09/02/2009 06:49 PM, Rob Landley wrote:
>> From: Rob Landley<rob@landley.net>
>>
>> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
>> explaining that using a journaling filesystem can't overcome this problem.
>>
>> Signed-off-by: Rob Landley<rob@landley.net>
>> ---
>>
>>   Documentation/md.txt |   17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/Documentation/md.txt b/Documentation/md.txt
>> index 4edd39e..52b8450 100644
>> --- a/Documentation/md.txt
>> +++ b/Documentation/md.txt
>> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>>
>>      md-mod.start_dirty_degraded=1
>>
>> +Note that Journaling filesystems do not effectively protect data in this
>> +case, because the update granularity of the RAID is larger than the journal
>> +was designed to expect.  Reconstructing data via partity information involes
>> +matching together corresponding stripes, and updating only some of these
>> +stripes renders the corresponding data in all the unmatched stripes
>> +meaningless.  Thus seemingly unrelated data in other parts of the filesystem
>> +(stored in the unmatched stripes) can become unreadable after a partial
>> +update, but the journal is only aware of the parts it modified, not the
>> +"collateral damage" elsewhere in the filesystem which was affected by those
>> +changes.
>> +
>> +Thus successful journal replay proves nothing in this context, and even a
>> +full fsck only shows whether or not the filesystem's metadata was affected.
>> +(A proper solution to this problem would involve adding journaling to the RAID
>> +itself, at least during degraded writes.  In the meantime, try not to allow
>> +a system to shut down uncleanly with its RAID both dirty and degraded, it
>> +can handle one but not both.)
>>
>>   Superblock formats
>>   ------------------
>>
>>
>
> NACK.
>
> Now you have moved the inaccurate documentation about journalling file 
> systems into the MD documentation.

What is inaccurate about it?

> Repeat after me:

> (1) partial writes to a RAID stripe (with or without file systems, with 
> or without journals) create an invalid stripe

That's what he's documenting.

> (2) partial writes can be prevented in most cases by running with write 
> cache disabled or working barriers

Given how long experience with storage you claim, you should know that
MD RAID5 does not support barriers by now...


> Rob, you should really try to take a few disks, build a working MD RAID5 
> group and test your ideas. Try it with and without the write cache 
> enabled.

....and understand by now that statistics are irrelevant for design
problems.

Ouch and trying to silence people by telling them to fix the problem
instead of documenting it is not nice either.
								Pavel
diff mbox

Patch

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..52b8450 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -75,6 +75,23 @@  So, to boot with a root filesystem of a dirty degraded raid[56], use
 
    md-mod.start_dirty_degraded=1
 
+Note that Journaling filesystems do not effectively protect data in this
+case, because the update granularity of the RAID is larger than the journal
+was designed to expect.  Reconstructing data via partity information involes
+matching together corresponding stripes, and updating only some of these
+stripes renders the corresponding data in all the unmatched stripes
+meaningless.  Thus seemingly unrelated data in other parts of the filesystem
+(stored in the unmatched stripes) can become unreadable after a partial
+update, but the journal is only aware of the parts it modified, not the
+"collateral damage" elsewhere in the filesystem which was affected by those
+changes.
+
+Thus successful journal replay proves nothing in this context, and even a
+full fsck only shows whether or not the filesystem's metadata was affected.
+(A proper solution to this problem would involve adding journaling to the RAID
+itself, at least during degraded writes.  In the meantime, try not to allow
+a system to shut down uncleanly with its RAID both dirty and degraded, it
+can handle one but not both.)
 
 Superblock formats
 ------------------