Patchwork Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

login
register
mail settings
Submitter Amit Sahrawat
Date April 16, 2014, 5 a.m.
Message ID <CADDb1s3HYDvb51Ngrwk82gkpbUWg1bRo7kaUmbGRmb0g_9JKgw@mail.gmail.com>
Download mbox | patch
Permalink /patch/339431/
State New
Headers show

Comments

Amit Sahrawat - April 16, 2014, 5 a.m.
Thanks Ted, for the detailed reply.

We could not retain the original HDD – on which we got the issue. In
order to replicate the problem we did steps like this:
1)	Make 250MB partition
2)	Format the partition with blocksize ‘4K’
3)	Using dumpe2fs – get the block group information
Group 0: (Blocks 0-32767)
  Checksum 0x89ae, unused inodes 29109
  Primary superblock at 0, Group descriptors at 1-1
  Reserved GDT blocks at 2-15
  Block bitmap at 16 (+16), Inode bitmap at 32 (+32)
  Inode table at 48-957 (+48)
  26826 free blocks, 29109 free inodes, 2 directories, 29109 unused inodes
  Free blocks: 5942-32767
  Free inodes: 12-29120
Group 1: (Blocks 32768-58226) [INODE_UNINIT]
  Checksum 0xb43a, unused inodes 29120
  Backup superblock at 32768, Group descriptors at 32769-32769
  Reserved GDT blocks at 32770-32783
  Block bitmap at 17 (bg #0 + 17), Inode bitmap at 33 (bg #0 + 33)
  Inode table at 958-1867 (bg #0 + 958)
  25443 free blocks, 29120 free inodes, 0 directories, 29120 unused inodes
  Free blocks: 32784-58226
  Free inodes: 29121-58240
4)	Corrupt the block group ‘1’  by writing all ‘1’, we had one file
with all 1’s, so using ‘dd’ –
dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
After this mount the partition – create few random size files and then
ran ‘fsstress,

fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

and we get logs like before the hang:
#fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
seed = 582332
EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

EXT4-fs (sdb1): delayed block allocation failed for inode 26 at
logical offset 181 with max blocks 2 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=51
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=9
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 101 at
logical offset 198 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=43
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=1
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 94 at
logical offset 450 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=36
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=12
EXT4-fs (sdb1): i_reserved_meta_blocks=2


…
EXT4-fs (sdb1): error count: 3
EXT4-fs (sdb1): initial error at 545: ext4_mb_generate_buddy:743
EXT4-fs (sdb1): last error at 42: ext4_mb_generate_buddy:743
…
Yes, we are running this on ARM target and we did not set the time
before running these operations. So, the time actually corresponds to
date -d @545
Thu Jan  1 05:39:05 IST 1970
date -d @42
Thu Jan  1 05:30:42 IST 1970


Yes, we are running kernel version 3.8 and applied the patches from
‘Darrick’ to fix the first looping around in ext4_da_writepages.

As you suggested, we also made change to mark the FS READ ONLY in case
of corruption. The changes are;
                ext4_da_block_invalidatepages(mpd);
@@ -2303,6 +2305,36 @@ out:
        return ret;
 }

+static void ext4_invalidate_mapping(struct  address_space *mapping)
+{
+       struct pagevec pvec;
+       unsigned int i;
+       pgoff_t index = 0;
+       struct inode* inode = mapping->host;
+
+       pagevec_init(&pvec, 0);
+       while (pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY,
+                                 PAGEVEC_SIZE)) {
+               for (i = 0; i < pagevec_count(&pvec); i++) {
+                       struct page *page = pvec.pages[i];
+                       lock_page(page);
+                       spin_lock_irq(&mapping->tree_lock);
+                       if (test_bit(PG_dirty, &page->flags)) {
+                               printk(KERN_ERR"INO[%lu], i(%d)\n",
inode->i_ino, i);
+                               radix_tree_tag_clear(&mapping->page_tree,
+                               page_index(page),
+                               PAGECACHE_TAG_DIRTY);
+                               spin_unlock_irq(&mapping->tree_lock);
+                               clear_page_dirty_for_io(page);
+                               goto unlock_page;
+                       }
+                       spin_unlock_irq(&mapping->tree_lock);
+unlock_page:
+                       unlock_page(page);
+               }
+               pagevec_release(&pvec);
+       }
+}
static int ext4_da_writepages(struct address_space *mapping,
                              struct writeback_control *wbc)
@@ -2419,6 +2451,14 @@ retry:
                        ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
                               "%ld pages, ino %lu; err %d", __func__,
                                wbc->nr_to_write, inode->i_ino, ret);
+
+                       if(ret == -EROFS) {
+                               ext4_msg(inode->i_sb, KERN_CRIT,"This
is probably result of corruption"
+                                                "that FS is marked RO
in between writepages\n");
+                               /*Invalidate mapping if RO encountered
during writepages */
+                                ext4_invalidate_mapping(mapping);
+                                invalidate_inode_pages2(mapping);
+                       }
                        blk_finish_plug(&plug);
                        goto out_writepages;
                }

After making these changes, as soon as there is corruption – the FS is
marked RO and all pending DIRTY writes are cleared. So, the hang is
not seen.

But we have a doubt about the possible side-effects of this code. Can
you please provide your inputs on this? It will be of immense help.

Thanks & Regards,
Amit Sahrawat


On Tue, Apr 15, 2014 at 6:17 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Apr 15, 2014 at 01:24:26PM +0530, Amit Sahrawat wrote:
>> Initially in normal write path, when the disk was almost full – we got
>> hung for the ‘sync’ because the flusher (which is busy in the
>> writepages is not responding). Before the hung task, we also found the
>> logs like:
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>
> These errors indicate that the several block groups contain have
> corrupt block bitmap: block group #1493, #1000, and #1425.  The fact
> that there are 0 free blocks/clusters in the bitmap indicate that the
> bitmap was all zero's, which could be the potential cause of the
> corruption.
>
> The other thing which is funny is the number of free blocks/clusters
> being greater than 32768 in block group #1493.  Assuming a 4k page
> size, that shouldn't be possible.  Can you send the output of
> "dumpe2fs -h /dev/sdXX" so we can take a look at the file system parameters?
>
> How much before the hung task did you see these messages?  I normally
> recommend that the file system be set up to either panic the system,
> or force the file system to be remounted read/only when EXT4-fs error
> messages are reported, since that means that the file system is
> corrupted, and further operaion can cause more data to be lost.
>
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> EXT4-fs (sda1): error count: 58
>> EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
>> EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742
>
> The "607" and "58" in the "at 607" and "at 58" are normally supposed
> to be a unix time_t value.  That is, it's normally a number like:
> 1397564866, and it can be decoded via:
>
> % date -d @1397564866
> Tue Apr 15 08:27:46 EDT 2014
>
> The fact that these numbers are numerically so small means that the
> time wasn't set correctly on your system.  Was this a test system
> running under kvm without a proper real-time clock?
>
>> When we analysed the problem, it occurred from the writepages path in ext4.
>> This is because of the difference in the free blocks reported by
>> cluster bitmap and the number of free blocks reported by group
>> descriptor.
>
> Yes, indicating that the file system was corrupt.
>
>> During ext4_fill_super, ext4 calculates the number of free blocks by
>> reading all the descriptors in function ext4_count_free_clusters and
>> store it in percpu counter s_freeclusters_counter.
>> ext4_count_free_clusters:
>> desc_count = 0;
>> for (i = 0; i < ngroups; i++) {
>> gdp = ext4_get_group_desc(sb, i, NULL);
>> if (!gdp)
>> continue;
>> desc_count += ext4_free_group_clusters(sb, gdp);
>> }
>> return desc_count;
>>
>> During writebegin call, ext4 checks this s_freeclusters_counter
>> counter to know if there are free blocks present or not.
>> When the free blocks reported by group descriptor are greater than the
>> actual free blocks reported by bitmap, a call to writebegin could
>> still succeed even if the free blocks represented by bitmaps are 0.
>
> Yes.  We used to have code that would optionally read every single
> bitmap, and verify that the descriptor counts match the values in the
> bitmap.  However, that was expensive, and wasn't a full check of all
> possible file system inconsistencies that could lead to data loss.  So
> we ultimately removed this code.  If the file system is potentially
> corrupt, it is the system administrator's responsibility to force an
> fsck run to make sure the file system data structures are consistent.
>
>> When searching for the relevant problem which occurs in this path. We
>> got the patch-set from ‘Darrick’ which revolves around this problem.
>> ext4: error out if verifying the block bitmap fails
>> ext4: fix type declaration of ext4_validate_block_bitmap
>> ext4: mark block group as corrupt on block bitmap error
>> ext4: mark block group as corrupt on inode bitmap error
>> ext4: mark group corrupt on group descriptor checksum
>> ext4: don't count free clusters from a corrupt block group
>>
>> After adopting the patch-set and performing verification on the
>> similar setup, we ran ‘fsstress’. But now it is resulting in hang at
>> different points.
>>
>> In the current logs we got:
>> EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
>> 20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
>> JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>
> OK, what version of the kernel are you using?  The patches that you
> reference above have been in the upstream kernel since 3.12, so I'm
> assuming you're not using the latest upstream kernel, but rather an
> older kernel with some patches applied.  Hmmm, skipping ahead:
>
>> Kernel Version: 3.8
>> Test command:
>> fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
>
> There is clearly either some kernel bug or hardware problem which is
> causing the file system corruption.  Given that you are using a much
> older kernel, it's quite likely that there is some bug that has been
> fixed in a later version of the kernel (although we can't really rule
> out a hardware problem without know much more about your setup).
>
> Unfortunately, there has been a *large* number of changes since
> version 3.8, and I can't remember all of the changes and bug fixes
> that we might have made in the past year or more (v3.8 dates from
> March 2013).
>
> Something that might be helpful is for you to use xfstests.  That's a
> much more thorough set of tests which we've been using so if you must
> use an antique version of the kernel, that will probably be a much
> better set of tests.  It includes fsstress, and much more besides.
> More importantly, there are times when fixes are identified by the
> xfstest failure that has gotten fixed up in the commit logs.  So that
> might help you find the bug fix that you need to backport.
>
> For your convenience, there is a simple test framework that makes it
> relatively easy to build and run xfstests under KVM.  You can find it
> here:
>
>         git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git
>
> See the documentation found at:
>
>         https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/README
>
> for more details.
>
> I hope this helps,
>
>                                                 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - April 16, 2014, 5:07 a.m.
On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
> 4)	Corrupt the block group ‘1’  by writing all ‘1’, we had one file
> with all 1’s, so using ‘dd’ –
> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
> After this mount the partition – create few random size files and then
> ran ‘fsstress,

Um, sigh.  You didn't say that you were deliberately corrupting the
file system.  That wasn't in the subject line, or anywhere else in the
original message.

So the question isn't how the file system got corrupted, but that
you'd prefer that the system recovers without hanging after this
corruption.

I wish you had *said* that.  It would have saved me a lot of time,
since I was trying to figure out how the system had gotten so
corrupted (not realizing you had deliberately corrupted the file
system).

So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
started the fsstress, the file system would have remounted the
filesystem read-only at the first EXT4-fs error message.  This would
avoid the hang that you saw, since the file system would hopefully
"failed fast", before th euser had the opportunity to put data into
the page cache that would be lost when the system discovered there was
no place to put the data.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - April 16, 2014, 5:22 a.m.
On Apr 15, 2014, at 11:07 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
>> 4)	Corrupt the block group ‘1’  by writing all ‘1’, we had one file
>> with all 1’s, so using ‘dd’ –
>> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
>> After this mount the partition – create few random size files and then
>> ran ‘fsstress,
> 
> Um, sigh.  You didn't say that you were deliberately corrupting the
> file system.  That wasn't in the subject line, or anywhere else in the
> original message.
> 
> So the question isn't how the file system got corrupted, but that
> you'd prefer that the system recovers without hanging after this
> corruption.
> 
> I wish you had *said* that.  It would have saved me a lot of time,
> since I was trying to figure out how the system had gotten so
> corrupted (not realizing you had deliberately corrupted the file
> system).

Don't we check the bitmaps upon load to verify they are not bogus?
Looks like this is disabled completely if flex_bg is enabled, though
I guess that it is not impossible to do some checking even if flex_bg
is enabled.

For example, we have the background thread to do the inode table zeroing,
and it could load the block bitmaps and check the group descriptors for
itable and bitmap locations against the various bitmaps.  With flex_bg,
this would probably only load a small subset of block bitmaps.

> So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> started the fsstress, the file system would have remounted the
> filesystem read-only at the first EXT4-fs error message.  This would
> avoid the hang that you saw, since the file system would hopefully
> "failed fast", before th euser had the opportunity to put data into
> the page cache that would be lost when the system discovered there was
> no place to put the data.

Even without remount-ro, it would be possible to flag the block bitmaps
for a particular group as corrupt, so that no new allocations are done
from that group until the next e2fsck is run.

Cheers, Andreas
Amit Sahrawat - April 16, 2014, 7:51 a.m.
Sorry Ted, if it caused the confusion.

There were actually 2 parts to the problem, the logs in the first mail
were from the original situation – where in there were many block
groups and error prints also showed that.

EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
clusters in bitmap, 58339 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
clusters in bitmap, 3 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
clusters in bitmap, 1 in gd
JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.
JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

1)	Original case – when the disk got corrupted and we only had the
logs and the hung task messages. But not the HDD on which issue was
observed.
2)	In order to reproduce the problem as was coming through the logs
(which highlighted the problem in the bitmap corruption). To minimize
the environment and make a proper case, we created a smaller partition
size and with only 2 groups. And intentionally corrupted the group 1
(our intention was just to replicate the error scenario).
3)	After corruption we used ‘fsstress’  - we got the similar problem
as was coming the original logs. – We shared our analysis after this
point for looping in the writepages part the free blocks mismatch.
4)	We came across ‘Darrick’ patches(in which it also mentioned about
how to corrupt to reproduce the problem) and applied on our
environment. It solved the initial problem about the looping in
writepages, but now we got hangs at other places.

Using ‘tune2fs’ is not a viable solution in our case, we can only
provide the solution via. the kernel changes. So, we made the changes
as shared earlier.

So the question isn't how the file system got corrupted, but that
you'd prefer that the system recovers without hanging after this
corruption.
>> Yes,  our priority is to keep the system running.

Again, Sorry for the confusion. But the intention was just to show the
original problem and what we did in order to replicate the problem.

Thanks & Regards,
Amit Sahrawat


On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
>> 4)    Corrupt the block group ‘1’  by writing all ‘1’, we had one file
>> with all 1’s, so using ‘dd’ –
>> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
>> After this mount the partition – create few random size files and then
>> ran ‘fsstress,
>
> Um, sigh.  You didn't say that you were deliberately corrupting the
> file system.  That wasn't in the subject line, or anywhere else in the
> original message.
>
> So the question isn't how the file system got corrupted, but that
> you'd prefer that the system recovers without hanging after this
> corruption.
>
> I wish you had *said* that.  It would have saved me a lot of time,
> since I was trying to figure out how the system had gotten so
> corrupted (not realizing you had deliberately corrupted the file
> system).
>
> So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> started the fsstress, the file system would have remounted the
> filesystem read-only at the first EXT4-fs error message.  This would
> avoid the hang that you saw, since the file system would hopefully
> "failed fast", before th euser had the opportunity to put data into
> the page cache that would be lost when the system discovered there was
> no place to put the data.
>
> Regards,
>
>                                                 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong - April 16, 2014, 5:46 p.m.
On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote:
> Sorry Ted, if it caused the confusion.
> 
> There were actually 2 parts to the problem, the logs in the first mail
> were from the original situation – where in there were many block
> groups and error prints also showed that.
> 
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
> clusters in bitmap, 58339 in gd
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
> clusters in bitmap, 3 in gd
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
> clusters in bitmap, 1 in gd
> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.
> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.
> 
> 1)	Original case – when the disk got corrupted and we only had the
> logs and the hung task messages. But not the HDD on which issue was
> observed.
> 2)	In order to reproduce the problem as was coming through the logs
> (which highlighted the problem in the bitmap corruption). To minimize
> the environment and make a proper case, we created a smaller partition
> size and with only 2 groups. And intentionally corrupted the group 1
> (our intention was just to replicate the error scenario).

I'm assuming that the original broken fs simply had a corrupt block bitmap, and
that the dd thing was just to simulate that corruption in a testing
environment?

> 3)	After corruption we used ‘fsstress’  - we got the similar problem
> as was coming the original logs. – We shared our analysis after this
> point for looping in the writepages part the free blocks mismatch.

Hm.  I tried it with 3.15-rc1 and didn't see any hangs.  Corrupt bitmaps shut
down allocations from the block group and the FS continues, as expected.

> 4)	We came across ‘Darrick’ patches(in which it also mentioned about
> how to corrupt to reproduce the problem) and applied on our
> environment. It solved the initial problem about the looping in
> writepages, but now we got hangs at other places.

There are hundreds of Darrick patches ... to which one are you referring? :)
(What was the subject line?)

> Using ‘tune2fs’ is not a viable solution in our case, we can only
> provide the solution via. the kernel changes. So, we made the changes
> as shared earlier.

Would it help if you could set errors=remount-ro in mke2fs?

--D
> So the question isn't how the file system got corrupted, but that
> you'd prefer that the system recovers without hanging after this
> corruption.
> >> Yes,  our priority is to keep the system running.
> 
> Again, Sorry for the confusion. But the intention was just to show the
> original problem and what we did in order to replicate the problem.
> 
> Thanks & Regards,
> Amit Sahrawat
> 
> 
> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
> >> 4)    Corrupt the block group ‘1’  by writing all ‘1’, we had one file
> >> with all 1’s, so using ‘dd’ –
> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
> >> After this mount the partition – create few random size files and then
> >> ran ‘fsstress,
> >
> > Um, sigh.  You didn't say that you were deliberately corrupting the
> > file system.  That wasn't in the subject line, or anywhere else in the
> > original message.
> >
> > So the question isn't how the file system got corrupted, but that
> > you'd prefer that the system recovers without hanging after this
> > corruption.
> >
> > I wish you had *said* that.  It would have saved me a lot of time,
> > since I was trying to figure out how the system had gotten so
> > corrupted (not realizing you had deliberately corrupted the file
> > system).
> >
> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> > started the fsstress, the file system would have remounted the
> > filesystem read-only at the first EXT4-fs error message.  This would
> > avoid the hang that you saw, since the file system would hopefully
> > "failed fast", before th euser had the opportunity to put data into
> > the page cache that would be lost when the system discovered there was
> > no place to put the data.
> >
> > Regards,
> >
> >                                                 - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amit Sahrawat - April 22, 2014, 5:49 a.m.
Hi Darrick,

Thanks for the reply, sorry for responding late.

On Wed, Apr 16, 2014 at 11:16 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote:
>> Sorry Ted, if it caused the confusion.
>>
>> There were actually 2 parts to the problem, the logs in the first mail
>> were from the original situation – where in there were many block
>> groups and error prints also showed that.
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> 1)    Original case – when the disk got corrupted and we only had the
>> logs and the hung task messages. But not the HDD on which issue was
>> observed.
>> 2)    In order to reproduce the problem as was coming through the logs
>> (which highlighted the problem in the bitmap corruption). To minimize
>> the environment and make a proper case, we created a smaller partition
>> size and with only 2 groups. And intentionally corrupted the group 1
>> (our intention was just to replicate the error scenario).
>
> I'm assuming that the original broken fs simply had a corrupt block bitmap, and
> that the dd thing was just to simulate that corruption in a testing
> environment?

Yes, we did so in order to replicate the error scenario.
>
>> 3)    After corruption we used ‘fsstress’  - we got the similar problem
>> as was coming the original logs. – We shared our analysis after this
>> point for looping in the writepages part the free blocks mismatch.
>
> Hm.  I tried it with 3.15-rc1 and didn't see any hangs.  Corrupt bitmaps shut
> down allocations from the block group and the FS continues, as expected.
>
We are using kernel version 3.8, so cannot switch to 3.15-rc1. It is a
limitation currently.

>> 4)    We came across ‘Darrick’ patches(in which it also mentioned about
>> how to corrupt to reproduce the problem) and applied on our
>> environment. It solved the initial problem about the looping in
>> writepages, but now we got hangs at other places.
>
> There are hundreds of Darrick patches ... to which one are you referring? :)
> (What was the subject line?)
>
ext4: error out if verifying the block bitmap fails
ext4: fix type declaration of ext4_validate_block_bitmap
ext4: mark block group as corrupt on block bitmap error
ext4: mark block group as corrupt on inode bitmap error
ext4: mark group corrupt on group descriptor checksum
ext4: don't count free clusters from a corrupt block group

So, the patches helps in marking the block group as corrupt and avoids
further allocation. But when we consider the normal write path using
write_begin. Since, there is mismatch between the free cluster count
from the group descriptor and the bitmap. In that case it marks the
pages dirty by copying dirty but later it get ENOSPC from the
writepages when it actually does the allocation.

So, our doubt is if we are marking the block group as corrupt, we
should also subtract the block group count from the
s_freeclusters_counter. This will make sure we have the valid
freecluster count and error ‘ENOSPC’ can be returned from the
write_begin, instead of propagating such paths till the writepages.

We made change like this:

@@ -737,14 +737,18 @@ void ext4_mb_generate_buddy(struct super_block *sb,
        grp->bb_fragments = fragments;

        if (free != grp->bb_free) {
+               struct ext4_sb_info *sbi = EXT4_SB(sb);
                ext4_grp_locked_error(sb, group, 0, 0,
                                      "%u clusters in bitmap, %u in gd; "
                                      "block bitmap corrupt.",
                                      free, grp->bb_free);
                /*
                 * If we intend to continue, we consider group descriptor
                 * corrupt and update bb_free using bitmap value
                 */
+               percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free);
                grp->bb_free = free;
                set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
        }
        mb_set_largest_free_order(sb, grp);

Is this the correct method? Or are missing something in this? Please
share your opinion.


>> Using ‘tune2fs’ is not a viable solution in our case, we can only
>> provide the solution via. the kernel changes. So, we made the changes
>> as shared earlier.
>
> Would it help if you could set errors=remount-ro in mke2fs?
>
Sorry, we cannot reformat or use tune2fs to change the ‘errors’ value.

> --D
>> So the question isn't how the file system got corrupted, but that
>> you'd prefer that the system recovers without hanging after this
>> corruption.
>> >> Yes,  our priority is to keep the system running.
>>
>> Again, Sorry for the confusion. But the intention was just to show the
>> original problem and what we did in order to replicate the problem.
>>
>> Thanks & Regards,
>> Amit Sahrawat
>>
>>
>> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
>> >> 4)    Corrupt the block group ‘1’  by writing all ‘1’, we had one file
>> >> with all 1’s, so using ‘dd’ –
>> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
>> >> After this mount the partition – create few random size files and then
>> >> ran ‘fsstress,
>> >
>> > Um, sigh.  You didn't say that you were deliberately corrupting the
>> > file system.  That wasn't in the subject line, or anywhere else in the
>> > original message.
>> >
>> > So the question isn't how the file system got corrupted, but that
>> > you'd prefer that the system recovers without hanging after this
>> > corruption.
>> >
>> > I wish you had *said* that.  It would have saved me a lot of time,
>> > since I was trying to figure out how the system had gotten so
>> > corrupted (not realizing you had deliberately corrupted the file
>> > system).
>> >
>> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
>> > started the fsstress, the file system would have remounted the
>> > filesystem read-only at the first EXT4-fs error message.  This would
>> > avoid the hang that you saw, since the file system would hopefully
>> > "failed fast", before th euser had the opportunity to put data into
>> > the page cache that would be lost when the system discovered there was
>> > no place to put the data.
>> >
>> > Regards,
>> >
>> >                                                 - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong - April 22, 2014, 7:25 p.m.
On Tue, Apr 22, 2014 at 11:19:32AM +0530, Amit Sahrawat wrote:
> Hi Darrick,
> 
> Thanks for the reply, sorry for responding late.
> 
> On Wed, Apr 16, 2014 at 11:16 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote:
> >> Sorry Ted, if it caused the confusion.
> >>
> >> There were actually 2 parts to the problem, the logs in the first mail
> >> were from the original situation – where in there were many block
> >> groups and error prints also showed that.
> >>
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
> >> clusters in bitmap, 58339 in gd
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
> >> clusters in bitmap, 3 in gd
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
> >> clusters in bitmap, 1 in gd
> >> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> >> a risk of filesystem corruption in case of system crash.
> >> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> >> a risk of filesystem corruption in case of system crash.
> >>
> >> 1)    Original case – when the disk got corrupted and we only had the
> >> logs and the hung task messages. But not the HDD on which issue was
> >> observed.
> >> 2)    In order to reproduce the problem as was coming through the logs
> >> (which highlighted the problem in the bitmap corruption). To minimize
> >> the environment and make a proper case, we created a smaller partition
> >> size and with only 2 groups. And intentionally corrupted the group 1
> >> (our intention was just to replicate the error scenario).
> >
> > I'm assuming that the original broken fs simply had a corrupt block bitmap, and
> > that the dd thing was just to simulate that corruption in a testing
> > environment?
> 
> Yes, we did so in order to replicate the error scenario.
> >
> >> 3)    After corruption we used ‘fsstress’  - we got the similar problem
> >> as was coming the original logs. – We shared our analysis after this
> >> point for looping in the writepages part the free blocks mismatch.
> >
> > Hm.  I tried it with 3.15-rc1 and didn't see any hangs.  Corrupt bitmaps shut
> > down allocations from the block group and the FS continues, as expected.
> >
> We are using kernel version 3.8, so cannot switch to 3.15-rc1. It is a
> limitation currently.
> 
> >> 4)    We came across ‘Darrick’ patches(in which it also mentioned about
> >> how to corrupt to reproduce the problem) and applied on our
> >> environment. It solved the initial problem about the looping in
> >> writepages, but now we got hangs at other places.
> >
> > There are hundreds of Darrick patches ... to which one are you referring? :)
> > (What was the subject line?)
> >
> ext4: error out if verifying the block bitmap fails
> ext4: fix type declaration of ext4_validate_block_bitmap
> ext4: mark block group as corrupt on block bitmap error
> ext4: mark block group as corrupt on inode bitmap error
> ext4: mark group corrupt on group descriptor checksum
> ext4: don't count free clusters from a corrupt block group

Ok, thank you for clarifying. :)

> So, the patches helps in marking the block group as corrupt and avoids
> further allocation. But when we consider the normal write path using
> write_begin. Since, there is mismatch between the free cluster count
> from the group descriptor and the bitmap. In that case it marks the
> pages dirty by copying dirty but later it get ENOSPC from the
> writepages when it actually does the allocation.
> 
> So, our doubt is if we are marking the block group as corrupt, we
> should also subtract the block group count from the
> s_freeclusters_counter. This will make sure we have the valid
> freecluster count and error ‘ENOSPC’ can be returned from the
> write_begin, instead of propagating such paths till the writepages.
> 
> We made change like this:
> 
> @@ -737,14 +737,18 @@ void ext4_mb_generate_buddy(struct super_block *sb,
>         grp->bb_fragments = fragments;
> 
>         if (free != grp->bb_free) {
> +               struct ext4_sb_info *sbi = EXT4_SB(sb);
>                 ext4_grp_locked_error(sb, group, 0, 0,
>                                       "%u clusters in bitmap, %u in gd; "
>                                       "block bitmap corrupt.",
>                                       free, grp->bb_free);
>                 /*
>                  * If we intend to continue, we consider group descriptor
>                  * corrupt and update bb_free using bitmap value
>                  */
> +               percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free);
>                 grp->bb_free = free;
>                 set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
>         }
>         mb_set_largest_free_order(sb, grp);
> 
> Is this the correct method? Or are missing something in this? Please
> share your opinion.

I think this looks ok.  If you send a proper patch doing this to the mailing
list, I'll officially review it.

> >> Using ‘tune2fs’ is not a viable solution in our case, we can only
> >> provide the solution via. the kernel changes. So, we made the changes
> >> as shared earlier.
> >
> > Would it help if you could set errors=remount-ro in mke2fs?
> >
> Sorry, we cannot reformat or use tune2fs to change the ‘errors’ value.

I apologize, my question was unclear; what I meant to ask is, would it have
been helpful if you could have set errors=remount-ro back when you ran mke2fs?

Now that the format's been done, I suppose the only recourse is mount -o
remount,errors=remount-ro (online) or tune2fs (offline).

--D
> 
> > --D
> >> So the question isn't how the file system got corrupted, but that
> >> you'd prefer that the system recovers without hanging after this
> >> corruption.
> >> >> Yes,  our priority is to keep the system running.
> >>
> >> Again, Sorry for the confusion. But the intention was just to show the
> >> original problem and what we did in order to replicate the problem.
> >>
> >> Thanks & Regards,
> >> Amit Sahrawat
> >>
> >>
> >> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> >> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
> >> >> 4)    Corrupt the block group ‘1’  by writing all ‘1’, we had one file
> >> >> with all 1’s, so using ‘dd’ –
> >> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
> >> >> After this mount the partition – create few random size files and then
> >> >> ran ‘fsstress,
> >> >
> >> > Um, sigh.  You didn't say that you were deliberately corrupting the
> >> > file system.  That wasn't in the subject line, or anywhere else in the
> >> > original message.
> >> >
> >> > So the question isn't how the file system got corrupted, but that
> >> > you'd prefer that the system recovers without hanging after this
> >> > corruption.
> >> >
> >> > I wish you had *said* that.  It would have saved me a lot of time,
> >> > since I was trying to figure out how the system had gotten so
> >> > corrupted (not realizing you had deliberately corrupted the file
> >> > system).
> >> >
> >> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> >> > started the fsstress, the file system would have remounted the
> >> > filesystem read-only at the first EXT4-fs error message.  This would
> >> > avoid the hang that you saw, since the file system would hopefully
> >> > "failed fast", before th euser had the opportunity to put data into
> >> > the page cache that would be lost when the system discovered there was
> >> > no place to put the data.
> >> >
> >> > Regards,
> >> >
> >> >                                                 - Ted
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c0fbd96..04f3a66 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1641,8 +1641,10 @@  static void mpage_da_map_and_submit(struct
mpage_da_data *mpd)
                                 mpd->b_size >> mpd->inode->i_blkbits, err);
                        ext4_msg(sb, KERN_CRIT,
                                "This should not happen!! Data will be lost\n");
-                       if (err == -ENOSPC)
+                       if (err == -ENOSPC) {
                                ext4_print_free_blocks(mpd->inode);
+                               mpd->inode->i_sb->s_flags |= MS_RDONLY;
+                       }
                }
                /* invalidate all the pages */