diff mbox series

[01/19] btrfs: require at least 4 devices for RAID 6

Message ID 20260512052230.2947683-2-hch@lst.de (mailing list archive)
State Handled Elsewhere
Headers show
Series [01/19] btrfs: require at least 4 devices for RAID 6 | expand

Commit Message

Christoph Hellwig May 12, 2026, 5:20 a.m. UTC
While the RAID6 algorithm could in theory support 3 devices by just
copying the data disk to the two parity disks, this version is not only
useless because it is a suboptimal version of 3-way mirroring, but also
broken with various crashes and incorrect parity generation in various
architecture-optimized implementations.  Disallow it similar to mdraid
which requires at least 4 devices for RAID 6.

Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Sterba May 12, 2026, 11:42 a.m. UTC | #1
On Tue, May 12, 2026 at 07:20:41AM +0200, Christoph Hellwig wrote:
> While the RAID6 algorithm could in theory support 3 devices by just
> copying the data disk to the two parity disks, this version is not only
> useless because it is a suboptimal version of 3-way mirroring, but also
> broken with various crashes and incorrect parity generation in various
> architecture-optimized implementations.  Disallow it similar to mdraid
> which requires at least 4 devices for RAID 6.
> 
> Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This patch should have been sent separately as it has user visible
impact and can potentially break some setups. The degenerate modes of
raid0, 5, or 6 are explicit as a possible middle step when converting
profiles.  We can use a fallback implementation for this case if the
accelerated implementations cannot do it.
Christoph Hellwig May 13, 2026, 5:47 a.m. UTC | #2
On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:
> On Tue, May 12, 2026 at 07:20:41AM +0200, Christoph Hellwig wrote:
> > While the RAID6 algorithm could in theory support 3 devices by just
> > copying the data disk to the two parity disks, this version is not only
> > useless because it is a suboptimal version of 3-way mirroring, but also
> > broken with various crashes and incorrect parity generation in various
> > architecture-optimized implementations.  Disallow it similar to mdraid
> > which requires at least 4 devices for RAID 6.
> > 
> > Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> This patch should have been sent separately as it has user visible
> impact and can potentially break some setups.

It _is_ sent out separate.

> The degenerate modes of
> raid0, 5, or 6 are explicit as a possible middle step when converting
> profiles.  We can use a fallback implementation for this case if the
> accelerated implementations cannot do it.

This is not about a degenerated mode.  For a degenerated RAID 6, parity
generation uses the RAID 5 XOR routines as the second parity will be
missing.  This is about generating two parities for a single data disk,
which must be explicitly selected.
H. Peter Anvin May 13, 2026, 4:14 p.m. UTC | #3
On May 11, 2026 10:20:41 PM PDT, Christoph Hellwig <hch@lst.de> wrote:
>While the RAID6 algorithm could in theory support 3 devices by just
>copying the data disk to the two parity disks, this version is not only
>useless because it is a suboptimal version of 3-way mirroring, but also
>broken with various crashes and incorrect parity generation in various
>architecture-optimized implementations.  Disallow it similar to mdraid
>which requires at least 4 devices for RAID 6.
>
>Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
>Signed-off-by: Christoph Hellwig <hch@lst.de>
>---
> fs/btrfs/volumes.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>index a88e68f90564..0b54b97bdad8 100644
>--- a/fs/btrfs/volumes.c
>+++ b/fs/btrfs/volumes.c
>@@ -159,7 +159,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
> 		.sub_stripes	= 1,
> 		.dev_stripes	= 1,
> 		.devs_max	= 0,
>-		.devs_min	= 3,
>+		.devs_min	= 4,
> 		.tolerated_failures = 2,
> 		.devs_increment	= 1,
> 		.ncopies	= 1,

Yes, if anyone cares about < 4 disks for the RAID-6 case (or < 3 for the RAID-4/5 case), just use the RAID-1 code.
David Sterba May 13, 2026, 8:19 p.m. UTC | #4
On Wed, May 13, 2026 at 07:47:42AM +0200, Christoph Hellwig wrote:
> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:
> > On Tue, May 12, 2026 at 07:20:41AM +0200, Christoph Hellwig wrote:
> > > While the RAID6 algorithm could in theory support 3 devices by just
> > > copying the data disk to the two parity disks, this version is not only
> > > useless because it is a suboptimal version of 3-way mirroring, but also
> > > broken with various crashes and incorrect parity generation in various
> > > architecture-optimized implementations.  Disallow it similar to mdraid
> > > which requires at least 4 devices for RAID 6.
> > > 
> > > Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > 
> > This patch should have been sent separately as it has user visible
> > impact and can potentially break some setups.
> 
> It _is_ sent out separate.

It's public interface change of btrfs but in a patch series cleaning
up some library code, I noticed it by accident.

> > The degenerate modes of
> > raid0, 5, or 6 are explicit as a possible middle step when converting
> > profiles.  We can use a fallback implementation for this case if the
> > accelerated implementations cannot do it.
> 
> This is not about a degenerated mode.  For a degenerated RAID 6, parity
> generation uses the RAID 5 XOR routines as the second parity will be
> missing.  This is about generating two parities for a single data disk,
> which must be explicitly selected.

The calcuation is a different than what I'm concened about, changing
minimum devices from 3 to 4 is a breaking change. If the library won't
provide the xor/parity functions then we'll have to add a fallback for
this special case.
Goffredo Baroncelli May 14, 2026, 7:51 p.m. UTC | #5
On 13/05/2026 07.47, Christoph Hellwig wrote:
> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:

> 
>> The degenerate modes of
>> raid0, 5, or 6 are explicit as a possible middle step when converting
>> profiles.  We can use a fallback implementation for this case if the
>> accelerated implementations cannot do it.
> 
> This is not about a degenerated mode.  For a degenerated RAID 6, parity
> generation uses the RAID 5 XOR routines as the second parity will be
> missing.  This is about generating two parities for a single data disk,
> which must be explicitly selected.
> 

I think that the David concern is : "what happens for an already
existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
(I am thinking when a new BG needs to be allocated)...

BR
GB
H. Peter Anvin May 14, 2026, 7:57 p.m. UTC | #6
On May 14, 2026 12:51:59 PM PDT, Goffredo Baroncelli <kreijack@libero.it> wrote:
>On 13/05/2026 07.47, Christoph Hellwig wrote:
>> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:
>
>> 
>>> The degenerate modes of
>>> raid0, 5, or 6 are explicit as a possible middle step when converting
>>> profiles.  We can use a fallback implementation for this case if the
>>> accelerated implementations cannot do it.
>> 
>> This is not about a degenerated mode.  For a degenerated RAID 6, parity
>> generation uses the RAID 5 XOR routines as the second parity will be
>> missing.  This is about generating two parities for a single data disk,
>> which must be explicitly selected.
>> 
>
>I think that the David concern is : "what happens for an already
>existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
>(I am thinking when a new BG needs to be allocated)...
>
>BR
>GB
>

That's what I'm saying – it should invoke the RAID-1 code under the cover (as with 3 disks, D = P = Q.)
Christoph Hellwig May 15, 2026, 4:37 a.m. UTC | #7
On Thu, May 14, 2026 at 09:51:59PM +0200, Goffredo Baroncelli wrote:
> I think that the David concern is : "what happens for an already
> existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
> (I am thinking when a new BG needs to be allocated)...

Then it will cleanly fail to mount instead of constantly corrupting data
and memory with every write, yes.  Which clearly suggest that such
file systems don't exist in the wild.

But if btrfs wants to keep supporting this I'll just add a _unsafe
version without the check in the core library.
Christoph Hellwig May 15, 2026, 4:37 a.m. UTC | #8
On Thu, May 14, 2026 at 12:57:53PM -0700, H. Peter Anvin wrote:
> That's what I'm saying – it should invoke the RAID-1 code under the
> cover (as with 3 disks, D = P = Q.)

Yes, if the btrfs maintainer cared for this setup that is what should
be done.
David Sterba May 15, 2026, 2:51 p.m. UTC | #9
On Thu, May 14, 2026 at 12:57:53PM -0700, H. Peter Anvin wrote:
> On May 14, 2026 12:51:59 PM PDT, Goffredo Baroncelli <kreijack@libero.it> wrote:
> >On 13/05/2026 07.47, Christoph Hellwig wrote:
> >> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:
> >
> >> 
> >>> The degenerate modes of
> >>> raid0, 5, or 6 are explicit as a possible middle step when converting
> >>> profiles.  We can use a fallback implementation for this case if the
> >>> accelerated implementations cannot do it.
> >> 
> >> This is not about a degenerated mode.  For a degenerated RAID 6, parity
> >> generation uses the RAID 5 XOR routines as the second parity will be
> >> missing.  This is about generating two parities for a single data disk,
> >> which must be explicitly selected.
> >> 
> >
> >I think that the David concern is : "what happens for an already
> >existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
> >(I am thinking when a new BG needs to be allocated)...
> 
> That's what I'm saying – it should invoke the RAID-1 code under the cover (as with 3 disks, D = P = Q.)

Thanks, it was not clear to me what you meant. For the two edge cases
the code should do simple memcpy for both calculations of parity and
recovery.
Goffredo Baroncelli May 15, 2026, 4:50 p.m. UTC | #10
On 15/05/2026 06.37, Christoph Hellwig wrote:
> On Thu, May 14, 2026 at 09:51:59PM +0200, Goffredo Baroncelli wrote:
>> I think that the David concern is : "what happens for an already
>> existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
>> (I am thinking when a new BG needs to be allocated)...
> 
> Then it will cleanly fail to mount instead of constantly corrupting data
> and memory with every write, yes.  Which clearly suggest that such
> file systems don't exist in the wild.
> 
> But if btrfs wants to keep supporting this I'll just add a _unsafe
> version without the check in the core library.
> 

I am not arguing about this part. My point is that the change shouldn't have impacted the
BTRFS interface versus the user (as patch 01/19 does), but instead the change should
have modify the interface raid code <-> btrfs (e.g. doing a memcpy....), or at least the
cover letter should warn that the raid6 code requires a number of disk >= 4, pointing
to BTRFS as "client doing wrong things".

At least, the message was received: don't relay to the raid6 code when the number of disk is
less than 4.

BR
GB
H. Peter Anvin May 15, 2026, 7:59 p.m. UTC | #11
On May 14, 2026 9:37:05 PM PDT, Christoph Hellwig <hch@lst.de> wrote:
>On Thu, May 14, 2026 at 09:51:59PM +0200, Goffredo Baroncelli wrote:
>> I think that the David concern is : "what happens for an already
>> existing btrfs raid6 3 disks filesystem when the user upgrade the kernel ?"
>> (I am thinking when a new BG needs to be allocated)...
>
>Then it will cleanly fail to mount instead of constantly corrupting data
>and memory with every write, yes.  Which clearly suggest that such
>file systems don't exist in the wild.
>
>But if btrfs wants to keep supporting this I'll just add a _unsafe
>version without the check in the core library.

I don't think this is a good idea. Error out; it is the btrfs maintainers' job to ensure user data isn't lost. 

The RAID-6 code has *never* supported only 3 units, and if it ever worked for *any* of the implementations it was purely by accident. Speaking as the original author I should know; this was deliberate as in some cases the degenerate case (3) would have required extra trays in the code to no user benefit. 

I would not be surprised if the kernel crashed or corrupted the page cache in that case.
Christoph Hellwig May 18, 2026, 5:12 a.m. UTC | #12
On Fri, May 15, 2026 at 12:59:34PM -0700, H. Peter Anvin wrote:
> I don't think this is a good idea. Error out; it is the btrfs maintainers' job to ensure user data isn't lost. 
> 
> The RAID-6 code has *never* supported only 3 units, and if it ever worked for *any* of the implementations it was purely by accident. Speaking as the original author I should know; this was deliberate as in some cases the degenerate case (3) would have required extra trays in the code to no user benefit. 
> 
> I would not be surprised if the kernel crashed or corrupted the page cache in that case.

It does, that's why I wanted to exclude it.  Anyway, for the about to be
resent version I'll drop this btrfs patch over the stated objection and
will otherwise not change anything.  This means the (IMHO hypothetical)
users of this configuration will get a WARN_ON_ONCE triggered, but
otherwise keep working (or rather not working) as before.
Qu Wenruo May 20, 2026, 8:41 a.m. UTC | #13
在 2026/5/18 14:42, Christoph Hellwig 写道:
> On Fri, May 15, 2026 at 12:59:34PM -0700, H. Peter Anvin wrote:
>> I don't think this is a good idea. Error out; it is the btrfs maintainers' job to ensure user data isn't lost.
>>
>> The RAID-6 code has *never* supported only 3 units, and if it ever worked for *any* of the implementations it was purely by accident. Speaking as the original author I should know; this was deliberate as in some cases the degenerate case (3) would have required extra trays in the code to no user benefit.
>>
>> I would not be surprised if the kernel crashed or corrupted the page cache in that case.
> 
> It does, that's why I wanted to exclude it.  Anyway, for the about to be
> resent version I'll drop this btrfs patch over the stated objection and
> will otherwise not change anything.  This means the (IMHO hypothetical)
> users of this configuration will get a WARN_ON_ONCE triggered, but
> otherwise keep working (or rather not working) as before.
> 

For the btrfs part, I believe I can get the current 2-disk-raid5 and 
3-disk-raid6 to fallback to raid1 inside btrfs.

I hope the btrfs part can be finished and reach the next merge window, 
but I'm not 100% sure.

What is the planned cycle to merge this raid5/6 cleanup?

Thanks,
Qu
Andrew Morton May 22, 2026, 12:17 a.m. UTC | #14
On Wed, 20 May 2026 18:11:09 +0930 Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> 
> 
> 在 2026/5/18 14:42, Christoph Hellwig 写道:
> > On Fri, May 15, 2026 at 12:59:34PM -0700, H. Peter Anvin wrote:
> >> I don't think this is a good idea. Error out; it is the btrfs maintainers' job to ensure user data isn't lost.
> >>
> >> The RAID-6 code has *never* supported only 3 units, and if it ever worked for *any* of the implementations it was purely by accident. Speaking as the original author I should know; this was deliberate as in some cases the degenerate case (3) would have required extra trays in the code to no user benefit.
> >>
> >> I would not be surprised if the kernel crashed or corrupted the page cache in that case.
> > 
> > It does, that's why I wanted to exclude it.  Anyway, for the about to be
> > resent version I'll drop this btrfs patch over the stated objection and
> > will otherwise not change anything.  This means the (IMHO hypothetical)
> > users of this configuration will get a WARN_ON_ONCE triggered, but
> > otherwise keep working (or rather not working) as before.
> > 
> 
> For the btrfs part, I believe I can get the current 2-disk-raid5 and 
> 3-disk-raid6 to fallback to raid1 inside btrfs.
> 
> I hope the btrfs part can be finished and reach the next merge window, 
> but I'm not 100% sure.
> 
> What is the planned cycle to merge this raid5/6 cleanup?

At present it's on track for the 7.2-rc1 merge window.  Does that suit?
Qu Wenruo May 22, 2026, 12:27 a.m. UTC | #15
在 2026/5/22 09:47, Andrew Morton 写道:
> On Wed, 20 May 2026 18:11:09 +0930 Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
>>
>>
>> 在 2026/5/18 14:42, Christoph Hellwig 写道:
>>> On Fri, May 15, 2026 at 12:59:34PM -0700, H. Peter Anvin wrote:
>>>> I don't think this is a good idea. Error out; it is the btrfs maintainers' job to ensure user data isn't lost.
>>>>
>>>> The RAID-6 code has *never* supported only 3 units, and if it ever worked for *any* of the implementations it was purely by accident. Speaking as the original author I should know; this was deliberate as in some cases the degenerate case (3) would have required extra trays in the code to no user benefit.
>>>>
>>>> I would not be surprised if the kernel crashed or corrupted the page cache in that case.
>>>
>>> It does, that's why I wanted to exclude it.  Anyway, for the about to be
>>> resent version I'll drop this btrfs patch over the stated objection and
>>> will otherwise not change anything.  This means the (IMHO hypothetical)
>>> users of this configuration will get a WARN_ON_ONCE triggered, but
>>> otherwise keep working (or rather not working) as before.
>>>
>>
>> For the btrfs part, I believe I can get the current 2-disk-raid5 and
>> 3-disk-raid6 to fallback to raid1 inside btrfs.
>>
>> I hope the btrfs part can be finished and reach the next merge window,
>> but I'm not 100% sure.
>>
>> What is the planned cycle to merge this raid5/6 cleanup?
> 
> At present it's on track for the 7.2-rc1 merge window.  Does that suit?

The current btrfs fix (*) is pretty small, I believe we can get it into 
the next merge window, as long as we got enough review on it.

*: 
https://lore.kernel.org/linux-btrfs/a1d63733465229936351804f3760803d5894a962.1779274630.git.wqu@suse.com/T/#u
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f90564..0b54b97bdad8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -159,7 +159,7 @@  const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.sub_stripes	= 1,
 		.dev_stripes	= 1,
 		.devs_max	= 0,
-		.devs_min	= 3,
+		.devs_min	= 4,
 		.tolerated_failures = 2,
 		.devs_increment	= 1,
 		.ncopies	= 1,