Message ID | 20171007032917.bntgnubthdstmrrt@thunk.org |
---|---|
State | Accepted, archived |
Headers | show |
Series | ext4: fix interaction between i_size, fallocate, and delalloc after a crash | expand |
On Sat, Oct 7, 2017 at 6:29 AM, Theodore Ts'o <tytso@mit.edu> wrote: > On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote: >> >> It almost seems that way, though to be honest, I don't think I know >> >> enough about 1. the setup used by Amir and 2. all the internal working >> >> of KVM+virtio to say for sure. >> > >> > I believe you misread my email. >> > The problem was NOT reproduced on KVM+virtio. >> > The problem *is* reproduced on a 10G LVM volume over SSD on >> > Ubuntu 16.04 with latest kernel and latest e2fsprogs. > > I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2]. > > [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md > [2] https://thunk.org/gce-xfstests > > It did turn out to be timing related, and it requires that a journal > commit take place after fsstress runs, but it can *not* be triggered > by a sync/fsync (as this would cause the delayed allocation writes to > be forced out to disk, and that makes the problem go away). > > I initially tried using xfs_io as a replacement for fsstress (since it > is more flexible and would allow me to more easily run experiments), > but it turns out xfs_io was too fast/efficient, and so using xfs_io to > execute the same system calls (verified by strace) would not replicate > the problem. > >> > Now you have a broken file system image and the exact set of operations >> > that led to it. That's should be a pretty big lead for investigation. > > It was indeed a big lead for investigation (thanks, Amir!), but it > still took me several hours before I was finally able to figure out > the problem. The patch and the commit description should explain what > was going on. > > I'll leave it to Ashlie and Vijay to investigate how to improve Crash > Monkey so it can better find problems like this automatically. Since > you now have a clear reproducer (you can use generic/456 and run it on > gce-xfstests, using is a standard cloud VM configuration) and an > explanation of the bug and the four-line fix, I suspect this might be > good grist for follow-on research after your Hot Storage '17 workshop > paper. :-) > > Best regards, > > - Ted > > > commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Fri Oct 6 23:09:55 2017 -0400 > > ext4: fix interaction between i_size, fallocate, and delalloc after a crash > > If there are pending writes subject to delayed allocation, then i_size > will show size after the writes have completed, while i_disksize > contains the value of i_size on the disk (since the writes have not > been persisted to disk). > > If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either According to patch without keep size flag? BTW looking at the patch, does truncate racing with fallocate that would increase size beyond truncated size works correctly? Didn't check the code just wondering.. Thanks! > with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size > after the fallocate(2) is between i_size and i_disksize, then after a > crash, if a journal commit has resulted in the changes made by the > fallocate() call to be persisted after a crash, but the delayed > allocation write has not resolved itself, i_size would not be updated, > and this would cause the following e2fsck complaint: > > Inode 12, end of extent exceeds allowed value > (logical block 33, physical block 33441, len 7) > > This can only take place on a sparse file, where the fallocate(2) call > is allocating blocks in a range which is before a pending delayed > allocation write which is extending i_size. Since this situation is > quite rare, and the window in which the crash must take place is > typically < 30 seconds, in practice this condition will rarely happen. > > Nevertheless, it can be triggered in testing, and in particular by > xfstests generic/456. > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > Reported-by: Amir Goldstein <amir73il@gmail.com> > Cc: stable@vger.kernel.org > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 97f0fd06728d..07bca11749d4 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, > } > > if (!(mode & FALLOC_FL_KEEP_SIZE) && > - offset + len > i_size_read(inode)) { > + (offset + len > i_size_read(inode) || > + offset + len > EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret) > @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > } > > if (!(mode & FALLOC_FL_KEEP_SIZE) && > - offset + len > i_size_read(inode)) { > + (offset + len > i_size_read(inode) || > + offset + len > EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret)
On Sat, Oct 07, 2017 at 08:54:06AM +0300, Amir Goldstein wrote: > > If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either > > According to patch without keep size flag? Oops, thanks for catching that. I'll fix the commit description. > BTW looking at the patch, does truncate racing with fallocate that > would increase size beyond truncated size works correctly? Yes, because we the inode is locked via i_mutex (well, now an write lock on i_rwsem) during both truncate and fallocate operations. - Ted
Ted, Thanks for looking into this more. The information about what was causing the output Amir saw will certainly help us improve CrashMonkey so we can detect bugs better in the future! Best, Ashlie On Fri, Oct 6, 2017 at 10:29 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote: >> >> It almost seems that way, though to be honest, I don't think I know >> >> enough about 1. the setup used by Amir and 2. all the internal working >> >> of KVM+virtio to say for sure. >> > >> > I believe you misread my email. >> > The problem was NOT reproduced on KVM+virtio. >> > The problem *is* reproduced on a 10G LVM volume over SSD on >> > Ubuntu 16.04 with latest kernel and latest e2fsprogs. > > I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2]. > > [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md > [2] https://thunk.org/gce-xfstests > > It did turn out to be timing related, and it requires that a journal > commit take place after fsstress runs, but it can *not* be triggered > by a sync/fsync (as this would cause the delayed allocation writes to > be forced out to disk, and that makes the problem go away). > > I initially tried using xfs_io as a replacement for fsstress (since it > is more flexible and would allow me to more easily run experiments), > but it turns out xfs_io was too fast/efficient, and so using xfs_io to > execute the same system calls (verified by strace) would not replicate > the problem. > >> > Now you have a broken file system image and the exact set of operations >> > that led to it. That's should be a pretty big lead for investigation. > > It was indeed a big lead for investigation (thanks, Amir!), but it > still took me several hours before I was finally able to figure out > the problem. The patch and the commit description should explain what > was going on. > > I'll leave it to Ashlie and Vijay to investigate how to improve Crash > Monkey so it can better find problems like this automatically. Since > you now have a clear reproducer (you can use generic/456 and run it on > gce-xfstests, using is a standard cloud VM configuration) and an > explanation of the bug and the four-line fix, I suspect this might be > good grist for follow-on research after your Hot Storage '17 workshop > paper. :-) > > Best regards, > > - Ted > > > commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Fri Oct 6 23:09:55 2017 -0400 > > ext4: fix interaction between i_size, fallocate, and delalloc after a crash > > If there are pending writes subject to delayed allocation, then i_size > will show size after the writes have completed, while i_disksize > contains the value of i_size on the disk (since the writes have not > been persisted to disk). > > If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either > with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size > after the fallocate(2) is between i_size and i_disksize, then after a > crash, if a journal commit has resulted in the changes made by the > fallocate() call to be persisted after a crash, but the delayed > allocation write has not resolved itself, i_size would not be updated, > and this would cause the following e2fsck complaint: > > Inode 12, end of extent exceeds allowed value > (logical block 33, physical block 33441, len 7) > > This can only take place on a sparse file, where the fallocate(2) call > is allocating blocks in a range which is before a pending delayed > allocation write which is extending i_size. Since this situation is > quite rare, and the window in which the crash must take place is > typically < 30 seconds, in practice this condition will rarely happen. > > Nevertheless, it can be triggered in testing, and in particular by > xfstests generic/456. > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > Reported-by: Amir Goldstein <amir73il@gmail.com> > Cc: stable@vger.kernel.org > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 97f0fd06728d..07bca11749d4 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, > } > > if (!(mode & FALLOC_FL_KEEP_SIZE) && > - offset + len > i_size_read(inode)) { > + (offset + len > i_size_read(inode) || > + offset + len > EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret) > @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > } > > if (!(mode & FALLOC_FL_KEEP_SIZE) && > - offset + len > i_size_read(inode)) { > + (offset + len > i_size_read(inode) || > + offset + len > EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret)
On 2017/10/07 11:29, Theodore Ts'o wrote: > On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote: >>>> It almost seems that way, though to be honest, I don't think I know >>>> enough about 1. the setup used by Amir and 2. all the internal working >>>> of KVM+virtio to say for sure. >>> I believe you misread my email. >>> The problem was NOT reproduced on KVM+virtio. >>> The problem *is* reproduced on a 10G LVM volume over SSD on >>> Ubuntu 16.04 with latest kernel and latest e2fsprogs. > I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2]. > > [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md > [2] https://thunk.org/gce-xfstests > > It did turn out to be timing related, and it requires that a journal > commit take place after fsstress runs, but it can *not* be triggered > by a sync/fsync (as this would cause the delayed allocation writes to > be forced out to disk, and that makes the problem go away). > > I initially tried using xfs_io as a replacement for fsstress (since it > is more flexible and would allow me to more easily run experiments), > but it turns out xfs_io was too fast/efficient, and so using xfs_io to > execute the same system calls (verified by strace) would not replicate > the problem. > >>> Now you have a broken file system image and the exact set of operations >>> that led to it. That's should be a pretty big lead for investigation. > It was indeed a big lead for investigation (thanks, Amir!), but it > still took me several hours before I was finally able to figure out > the problem. The patch and the commit description should explain what > was going on. > > I'll leave it to Ashlie and Vijay to investigate how to improve Crash > Monkey so it can better find problems like this automatically. Since > you now have a clear reproducer (you can use generic/456 and run it on > gce-xfstests, using is a standard cloud VM configuration) and an > explanation of the bug and the four-line fix, I suspect this might be > good grist for follow-on research after your Hot Storage '17 workshop > paper. :-) > > Best regards, > > - Ted > > > commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 > Author: Theodore Ts'o<tytso@mit.edu> > Date: Fri Oct 6 23:09:55 2017 -0400 > > ext4: fix interaction between i_size, fallocate, and delalloc after a crash Hi Theodore, After applying your patch, generic/456 always passes on my system which just triggers the output[2]. So i could believe this two different outputs[1][2] are triggered by different environments, but they are caused by the same bug which your patch fixes. Is this right? [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical block 33441, len 7)Clear? no Inode 12, i_blocks is 184, should be 128. Fix? no [2] Inode 12, i_size is 147456, should be 163840. Fix? no Sorry, i am not familiar with ext4. Thanks, Xiao Yang > > If there are pending writes subject to delayed allocation, then i_size > will show size after the writes have completed, while i_disksize > contains the value of i_size on the disk (since the writes have not > been persisted to disk). > > If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either > with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size > after the fallocate(2) is between i_size and i_disksize, then after a > crash, if a journal commit has resulted in the changes made by the > fallocate() call to be persisted after a crash, but the delayed > allocation write has not resolved itself, i_size would not be updated, > and this would cause the following e2fsck complaint: > > Inode 12, end of extent exceeds allowed value > (logical block 33, physical block 33441, len 7) > > This can only take place on a sparse file, where the fallocate(2) call > is allocating blocks in a range which is before a pending delayed > allocation write which is extending i_size. Since this situation is > quite rare, and the window in which the crash must take place is > typically< 30 seconds, in practice this condition will rarely happen. > > Nevertheless, it can be triggered in testing, and in particular by > xfstests generic/456. > > Signed-off-by: Theodore Ts'o<tytso@mit.edu> > Reported-by: Amir Goldstein<amir73il@gmail.com> > Cc: stable@vger.kernel.org > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 97f0fd06728d..07bca11749d4 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, > } > > if (!(mode& FALLOC_FL_KEEP_SIZE)&& > - offset + len> i_size_read(inode)) { > + (offset + len> i_size_read(inode) || > + offset + len> EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret) > @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > } > > if (!(mode& FALLOC_FL_KEEP_SIZE)&& > - offset + len> i_size_read(inode)) { > + (offset + len> i_size_read(inode) || > + offset + len> EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret) > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > . >
On Wed, Oct 11, 2017 at 6:11 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote: > On 2017/10/07 11:29, Theodore Ts'o wrote: >> >> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote: >>>>> >>>>> It almost seems that way, though to be honest, I don't think I know >>>>> enough about 1. the setup used by Amir and 2. all the internal working >>>>> of KVM+virtio to say for sure. >>>> >>>> I believe you misread my email. >>>> The problem was NOT reproduced on KVM+virtio. >>>> The problem *is* reproduced on a 10G LVM volume over SSD on >>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs. >> >> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2]. >> >> [1] >> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md >> [2] https://thunk.org/gce-xfstests >> >> It did turn out to be timing related, and it requires that a journal >> commit take place after fsstress runs, but it can *not* be triggered >> by a sync/fsync (as this would cause the delayed allocation writes to >> be forced out to disk, and that makes the problem go away). Amir, I was actually looking into why only some machines can reproduce the output from generic/456 and I found that I was a little unsure about exactly how the test itself was operating. I'm not an expert on how device mapper works, so I was wondering if you could help shed some light on it. From what I've surmised from the test case, it replays the workload you specified with fsx and then immediately inserts the dm_flakey table and device in place of the hard disk *while requests are still in flight* (see nolockfs flag in dmsetup which is used by dm_flakey calls). The flakey device drops all requests sent to it. The flakey device is then unmounted (to force all pending writes out), the dm table changed once again back to the original backing block device, and then mounted again. Is this summary of the test case correct? If it is, then perhaps the reason why it is so hard for some people to reproduce the exact bug output you saw is because there does not appear to be anything in the test case that specifies the timing for when dm_flakey should begin dropping requests. Since the flakey device could be inserted any time after fsx completes, but writes are still pending, it seems like the test creates a race condition between inserting dm_flakey and a kernel thread waking up to persist writes. This race condition would likely be affected by things like having > 1 core on the VM or hardware (so that dm_flakey could be inserted at the same time a kernel thread on another core is trying to persist writes) the test is running on, system load, dirty writeback times, etc. Do you have any thoughts on the above? >> >> I initially tried using xfs_io as a replacement for fsstress (since it >> is more flexible and would allow me to more easily run experiments), >> but it turns out xfs_io was too fast/efficient, and so using xfs_io to >> execute the same system calls (verified by strace) would not replicate >> the problem. >> >>>> Now you have a broken file system image and the exact set of operations >>>> that led to it. That's should be a pretty big lead for investigation. >> >> It was indeed a big lead for investigation (thanks, Amir!), but it >> still took me several hours before I was finally able to figure out >> the problem. The patch and the commit description should explain what >> was going on. >> >> I'll leave it to Ashlie and Vijay to investigate how to improve Crash >> Monkey so it can better find problems like this automatically. Since >> you now have a clear reproducer (you can use generic/456 and run it on >> gce-xfstests, using is a standard cloud VM configuration) and an >> explanation of the bug and the four-line fix, I suspect this might be >> good grist for follow-on research after your Hot Storage '17 workshop >> paper. :-) >> >> Best regards, >> >> - Ted >> >> >> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 >> Author: Theodore Ts'o<tytso@mit.edu> >> Date: Fri Oct 6 23:09:55 2017 -0400 >> >> ext4: fix interaction between i_size, fallocate, and delalloc after a >> crash > > Hi Theodore, > > After applying your patch, generic/456 always passes on my system which just > triggers the output[2]. > So i could believe this two different outputs[1][2] are triggered by > different environments, but they > are caused by the same bug which your patch fixes. Is this right? > > [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical > block 33441, len 7)Clear? no > Inode 12, i_blocks is 184, should be 128. Fix? no > [2] Inode 12, i_size is 147456, should be 163840. Fix? no > > Sorry, i am not familiar with ext4. > > Thanks, > Xiao Yang >> >> >> If there are pending writes subject to delayed allocation, then >> i_size >> will show size after the writes have completed, while i_disksize >> contains the value of i_size on the disk (since the writes have not >> been persisted to disk). >> >> If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either >> with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size >> after the fallocate(2) is between i_size and i_disksize, then after a >> crash, if a journal commit has resulted in the changes made by the >> fallocate() call to be persisted after a crash, but the delayed >> allocation write has not resolved itself, i_size would not be >> updated, >> and this would cause the following e2fsck complaint: >> >> Inode 12, end of extent exceeds allowed value >> (logical block 33, physical block 33441, len 7) >> >> This can only take place on a sparse file, where the fallocate(2) >> call >> is allocating blocks in a range which is before a pending delayed >> allocation write which is extending i_size. Since this situation is >> quite rare, and the window in which the crash must take place is >> typically< 30 seconds, in practice this condition will rarely >> happen. >> >> Nevertheless, it can be triggered in testing, and in particular by >> xfstests generic/456. >> >> Signed-off-by: Theodore Ts'o<tytso@mit.edu> >> Reported-by: Amir Goldstein<amir73il@gmail.com> >> Cc: stable@vger.kernel.org >> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c >> index 97f0fd06728d..07bca11749d4 100644 >> --- a/fs/ext4/extents.c >> +++ b/fs/ext4/extents.c >> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, >> loff_t offset, >> } >> >> if (!(mode& FALLOC_FL_KEEP_SIZE)&& >> - offset + len> i_size_read(inode)) { >> + (offset + len> i_size_read(inode) || >> + offset + len> EXT4_I(inode)->i_disksize)) { >> new_size = offset + len; >> ret = inode_newsize_ok(inode, new_size); >> if (ret) >> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, >> loff_t offset, loff_t len) >> } >> >> if (!(mode& FALLOC_FL_KEEP_SIZE)&& >> - offset + len> i_size_read(inode)) { >> + (offset + len> i_size_read(inode) || >> + offset + len> EXT4_I(inode)->i_disksize)) { >> new_size = offset + len; >> ret = inode_newsize_ok(inode, new_size); >> if (ret) >> -- >> To unsubscribe from this list: send the line "unsubscribe fstests" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> . >> > > >
On Wed, Oct 11, 2017 at 4:17 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote: > On Wed, Oct 11, 2017 at 6:11 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote: >> On 2017/10/07 11:29, Theodore Ts'o wrote: >>> >>> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote: >>>>>> >>>>>> It almost seems that way, though to be honest, I don't think I know >>>>>> enough about 1. the setup used by Amir and 2. all the internal working >>>>>> of KVM+virtio to say for sure. >>>>> >>>>> I believe you misread my email. >>>>> The problem was NOT reproduced on KVM+virtio. >>>>> The problem *is* reproduced on a 10G LVM volume over SSD on >>>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs. >>> >>> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2]. >>> >>> [1] >>> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md >>> [2] https://thunk.org/gce-xfstests >>> >>> It did turn out to be timing related, and it requires that a journal >>> commit take place after fsstress runs, but it can *not* be triggered >>> by a sync/fsync (as this would cause the delayed allocation writes to >>> be forced out to disk, and that makes the problem go away). > > Amir, > > I was actually looking into why only some machines can reproduce the > output from generic/456 and I found that I was a little unsure about > exactly how the test itself was operating. I'm not an expert on how > device mapper works, so I was wondering if you could help shed some > light on it. > > From what I've surmised from the test case, it replays the workload > you specified with fsx and then immediately inserts the dm_flakey > table and device in place of the hard disk *while requests are still > in flight* (see nolockfs flag in dmsetup which is used by dm_flakey > calls). The flakey device drops all requests sent to it. The flakey > device is then unmounted (to force all pending writes out), the dm > table changed once again back to the original backing block device, > and then mounted again. > > Is this summary of the test case correct? If it is, then perhaps the > reason why it is so hard for some people to reproduce the exact bug > output you saw is because there does not appear to be anything in the > test case that specifies the timing for when dm_flakey should begin > dropping requests. Since the flakey device could be inserted any time > after fsx completes, but writes are still pending, it seems like the > test creates a race condition between inserting dm_flakey and a kernel > thread waking up to persist writes. This race condition would likely > be affected by things like having > 1 core on the VM or hardware (so > that dm_flakey could be inserted at the same time a kernel thread on > another core is trying to persist writes) the test is running on, > system load, dirty writeback times, etc. > > Do you have any thoughts on the above? Sounds about right. Having more cores to the VM was my first guess as the reason. Note that all the bugs I found so far depended on some timing between processes or kernel threads, so crash monkey trying all io completion combinations would not have helped. Maybe it would help to find other bugs though. > >>> >>> I initially tried using xfs_io as a replacement for fsstress (since it >>> is more flexible and would allow me to more easily run experiments), >>> but it turns out xfs_io was too fast/efficient, and so using xfs_io to >>> execute the same system calls (verified by strace) would not replicate >>> the problem. >>> >>>>> Now you have a broken file system image and the exact set of operations >>>>> that led to it. That's should be a pretty big lead for investigation. >>> >>> It was indeed a big lead for investigation (thanks, Amir!), but it >>> still took me several hours before I was finally able to figure out >>> the problem. The patch and the commit description should explain what >>> was going on. >>> >>> I'll leave it to Ashlie and Vijay to investigate how to improve Crash >>> Monkey so it can better find problems like this automatically. Since >>> you now have a clear reproducer (you can use generic/456 and run it on >>> gce-xfstests, using is a standard cloud VM configuration) and an >>> explanation of the bug and the four-line fix, I suspect this might be >>> good grist for follow-on research after your Hot Storage '17 workshop >>> paper. :-) >>> >>> Best regards, >>> >>> - Ted >>> >>> >>> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 >>> Author: Theodore Ts'o<tytso@mit.edu> >>> Date: Fri Oct 6 23:09:55 2017 -0400 >>> >>> ext4: fix interaction between i_size, fallocate, and delalloc after a >>> crash >> >> Hi Theodore, >> >> After applying your patch, generic/456 always passes on my system which just >> triggers the output[2]. >> So i could believe this two different outputs[1][2] are triggered by >> different environments, but they >> are caused by the same bug which your patch fixes. Is this right? >> >> [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical >> block 33441, len 7)Clear? no >> Inode 12, i_blocks is 184, should be 128. Fix? no >> [2] Inode 12, i_size is 147456, should be 163840. Fix? no >> >> Sorry, i am not familiar with ext4. >> >> Thanks, >> Xiao Yang >>> >>> >>> If there are pending writes subject to delayed allocation, then >>> i_size >>> will show size after the writes have completed, while i_disksize >>> contains the value of i_size on the disk (since the writes have not >>> been persisted to disk). >>> >>> If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either >>> with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size >>> after the fallocate(2) is between i_size and i_disksize, then after a >>> crash, if a journal commit has resulted in the changes made by the >>> fallocate() call to be persisted after a crash, but the delayed >>> allocation write has not resolved itself, i_size would not be >>> updated, >>> and this would cause the following e2fsck complaint: >>> >>> Inode 12, end of extent exceeds allowed value >>> (logical block 33, physical block 33441, len 7) >>> >>> This can only take place on a sparse file, where the fallocate(2) >>> call >>> is allocating blocks in a range which is before a pending delayed >>> allocation write which is extending i_size. Since this situation is >>> quite rare, and the window in which the crash must take place is >>> typically< 30 seconds, in practice this condition will rarely >>> happen. >>> >>> Nevertheless, it can be triggered in testing, and in particular by >>> xfstests generic/456. >>> >>> Signed-off-by: Theodore Ts'o<tytso@mit.edu> >>> Reported-by: Amir Goldstein<amir73il@gmail.com> >>> Cc: stable@vger.kernel.org >>> >>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c >>> index 97f0fd06728d..07bca11749d4 100644 >>> --- a/fs/ext4/extents.c >>> +++ b/fs/ext4/extents.c >>> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, >>> loff_t offset, >>> } >>> >>> if (!(mode& FALLOC_FL_KEEP_SIZE)&& >>> - offset + len> i_size_read(inode)) { >>> + (offset + len> i_size_read(inode) || >>> + offset + len> EXT4_I(inode)->i_disksize)) { >>> new_size = offset + len; >>> ret = inode_newsize_ok(inode, new_size); >>> if (ret) >>> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, >>> loff_t offset, loff_t len) >>> } >>> >>> if (!(mode& FALLOC_FL_KEEP_SIZE)&& >>> - offset + len> i_size_read(inode)) { >>> + (offset + len> i_size_read(inode) || >>> + offset + len> EXT4_I(inode)->i_disksize)) { >>> new_size = offset + len; >>> ret = inode_newsize_ok(inode, new_size); >>> if (ret) >>> -- >>> To unsubscribe from this list: send the line "unsubscribe fstests" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> . >>> >> >> >>
> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Fri Oct 6 23:09:55 2017 -0400 > > ext4: fix interaction between i_size, fallocate, and delalloc after a crash > > If there are pending writes subject to delayed allocation, then i_size > will show size after the writes have completed, while i_disksize > contains the value of i_size on the disk (since the writes have not > been persisted to disk). > > If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either > with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size > after the fallocate(2) is between i_size and i_disksize, then after a > crash, if a journal commit has resulted in the changes made by the > fallocate() call to be persisted after a crash, but the delayed > allocation write has not resolved itself, i_size would not be updated, > and this would cause the following e2fsck complaint: > > Inode 12, end of extent exceeds allowed value > (logical block 33, physical block 33441, len 7) > > This can only take place on a sparse file, where the fallocate(2) call > is allocating blocks in a range which is before a pending delayed > allocation write which is extending i_size. Since this situation is > quite rare, and the window in which the crash must take place is > typically < 30 seconds, in practice this condition will rarely happen. > > Nevertheless, it can be triggered in testing, and in particular by > xfstests generic/456. > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > Reported-by: Amir Goldstein <amir73il@gmail.com> > Cc: stable@vger.kernel.org The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 97f0fd06728d..07bca11749d4 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, > } > > if (!(mode & FALLOC_FL_KEEP_SIZE) && > - offset + len > i_size_read(inode)) { > + (offset + len > i_size_read(inode) || > + offset + len > EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret) > @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > } > > if (!(mode & FALLOC_FL_KEEP_SIZE) && > - offset + len > i_size_read(inode)) { > + (offset + len > i_size_read(inode) || > + offset + len > EXT4_I(inode)->i_disksize)) { > new_size = offset + len; > ret = inode_newsize_ok(inode, new_size); > if (ret)
Amir, I know this is a bit late, but I've spent some time working through the disk image that you provided (so that I could determine how/if I could modify CrashMonkey to catch errors like this) and I don't think I understand what state the disk image reflects. After digging around the journal of the disk image you provided, I found that the first 10 journal blocks are used, with the journal superblock being placed in the very first block of the journal. The journal superblock says that the first journal transaction ID that should be in the journal is transaction ID 4. However, dumping the other journal blocks, I found that the next block is a descriptor block for transaction ID 2. The rest of the journal blocks are data blocks for that transaction plus a transaction commit block. This seems a little odd considering that the journal refers to a 4th transaction, which I have not been able to find (I quickly dumped the first 50 blocks in debugfs and found the rest to contain only zeros). With this in mind, I looked back at the xfstests code for controlling the dm_flakey device. What I realized is the `nolockfs` flag is provided both when it switches from the real device to the flakey device that drops writes and when it switches from the flakey device back to the real device. I know there is a call to umount once the flakey device that drops writes is inserted, but do you think it is possible that the flakey device is swapped back to the real device before all the writes forced out by umount have made it to the flakey device? Unfortunately I still don't have a local machine that is capable of reproducing your test results and I have not made any gce test appliance images to test this yet, so I'm not sure if this is a valid theory. On Wed, Oct 11, 2017 at 8:34 AM, Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Oct 11, 2017 at 4:17 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote: > > On Wed, Oct 11, 2017 at 6:11 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote: > >> On 2017/10/07 11:29, Theodore Ts'o wrote: > >>> > >>> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote: > >>>>>> > >>>>>> It almost seems that way, though to be honest, I don't think I know > >>>>>> enough about 1. the setup used by Amir and 2. all the internal working > >>>>>> of KVM+virtio to say for sure. > >>>>> > >>>>> I believe you misread my email. > >>>>> The problem was NOT reproduced on KVM+virtio. > >>>>> The problem *is* reproduced on a 10G LVM volume over SSD on > >>>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs. > >>> > >>> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2]. > >>> > >>> [1] > >>> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md > >>> [2] https://thunk.org/gce-xfstests > >>> > >>> It did turn out to be timing related, and it requires that a journal > >>> commit take place after fsstress runs, but it can *not* be triggered > >>> by a sync/fsync (as this would cause the delayed allocation writes to > >>> be forced out to disk, and that makes the problem go away). > > > > Amir, > > > > I was actually looking into why only some machines can reproduce the > > output from generic/456 and I found that I was a little unsure about > > exactly how the test itself was operating. I'm not an expert on how > > device mapper works, so I was wondering if you could help shed some > > light on it. > > > > From what I've surmised from the test case, it replays the workload > > you specified with fsx and then immediately inserts the dm_flakey > > table and device in place of the hard disk *while requests are still > > in flight* (see nolockfs flag in dmsetup which is used by dm_flakey > > calls). The flakey device drops all requests sent to it. The flakey > > device is then unmounted (to force all pending writes out), the dm > > table changed once again back to the original backing block device, > > and then mounted again. > > > > Is this summary of the test case correct? If it is, then perhaps the > > reason why it is so hard for some people to reproduce the exact bug > > output you saw is because there does not appear to be anything in the > > test case that specifies the timing for when dm_flakey should begin > > dropping requests. Since the flakey device could be inserted any time > > after fsx completes, but writes are still pending, it seems like the > > test creates a race condition between inserting dm_flakey and a kernel > > thread waking up to persist writes. This race condition would likely > > be affected by things like having > 1 core on the VM or hardware (so > > that dm_flakey could be inserted at the same time a kernel thread on > > another core is trying to persist writes) the test is running on, > > system load, dirty writeback times, etc. > > > > Do you have any thoughts on the above? > > Sounds about right. > Having more cores to the VM was my first guess as the reason. > Note that all the bugs I found so far depended on some timing between > processes or kernel threads, so crash monkey trying all io completion > combinations would not have helped. Maybe it would help to find other > bugs though. > > > > >>> > >>> I initially tried using xfs_io as a replacement for fsstress (since it > >>> is more flexible and would allow me to more easily run experiments), > >>> but it turns out xfs_io was too fast/efficient, and so using xfs_io to > >>> execute the same system calls (verified by strace) would not replicate > >>> the problem. > >>> > >>>>> Now you have a broken file system image and the exact set of operations > >>>>> that led to it. That's should be a pretty big lead for investigation. > >>> > >>> It was indeed a big lead for investigation (thanks, Amir!), but it > >>> still took me several hours before I was finally able to figure out > >>> the problem. The patch and the commit description should explain what > >>> was going on. > >>> > >>> I'll leave it to Ashlie and Vijay to investigate how to improve Crash > >>> Monkey so it can better find problems like this automatically. Since > >>> you now have a clear reproducer (you can use generic/456 and run it on > >>> gce-xfstests, using is a standard cloud VM configuration) and an > >>> explanation of the bug and the four-line fix, I suspect this might be > >>> good grist for follow-on research after your Hot Storage '17 workshop > >>> paper. :-) > >>> > >>> Best regards, > >>> > >>> - Ted > >>> > >>> > >>> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1 > >>> Author: Theodore Ts'o<tytso@mit.edu> > >>> Date: Fri Oct 6 23:09:55 2017 -0400 > >>> > >>> ext4: fix interaction between i_size, fallocate, and delalloc after a > >>> crash > >> > >> Hi Theodore, > >> > >> After applying your patch, generic/456 always passes on my system which just > >> triggers the output[2]. > >> So i could believe this two different outputs[1][2] are triggered by > >> different environments, but they > >> are caused by the same bug which your patch fixes. Is this right? > >> > >> [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical > >> block 33441, len 7)Clear? no > >> Inode 12, i_blocks is 184, should be 128. Fix? no > >> [2] Inode 12, i_size is 147456, should be 163840. Fix? no > >> > >> Sorry, i am not familiar with ext4. > >> > >> Thanks, > >> Xiao Yang > >>> > >>> > >>> If there are pending writes subject to delayed allocation, then > >>> i_size > >>> will show size after the writes have completed, while i_disksize > >>> contains the value of i_size on the disk (since the writes have not > >>> been persisted to disk). > >>> > >>> If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either > >>> with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size > >>> after the fallocate(2) is between i_size and i_disksize, then after a > >>> crash, if a journal commit has resulted in the changes made by the > >>> fallocate() call to be persisted after a crash, but the delayed > >>> allocation write has not resolved itself, i_size would not be > >>> updated, > >>> and this would cause the following e2fsck complaint: > >>> > >>> Inode 12, end of extent exceeds allowed value > >>> (logical block 33, physical block 33441, len 7) > >>> > >>> This can only take place on a sparse file, where the fallocate(2) > >>> call > >>> is allocating blocks in a range which is before a pending delayed > >>> allocation write which is extending i_size. Since this situation is > >>> quite rare, and the window in which the crash must take place is > >>> typically< 30 seconds, in practice this condition will rarely > >>> happen. > >>> > >>> Nevertheless, it can be triggered in testing, and in particular by > >>> xfstests generic/456. > >>> > >>> Signed-off-by: Theodore Ts'o<tytso@mit.edu> > >>> Reported-by: Amir Goldstein<amir73il@gmail.com> > >>> Cc: stable@vger.kernel.org > >>> > >>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > >>> index 97f0fd06728d..07bca11749d4 100644 > >>> --- a/fs/ext4/extents.c > >>> +++ b/fs/ext4/extents.c > >>> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, > >>> loff_t offset, > >>> } > >>> > >>> if (!(mode& FALLOC_FL_KEEP_SIZE)&& > >>> - offset + len> i_size_read(inode)) { > >>> + (offset + len> i_size_read(inode) || > >>> + offset + len> EXT4_I(inode)->i_disksize)) { > >>> new_size = offset + len; > >>> ret = inode_newsize_ok(inode, new_size); > >>> if (ret) > >>> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, > >>> loff_t offset, loff_t len) > >>> } > >>> > >>> if (!(mode& FALLOC_FL_KEEP_SIZE)&& > >>> - offset + len> i_size_read(inode)) { > >>> + (offset + len> i_size_read(inode) || > >>> + offset + len> EXT4_I(inode)->i_disksize)) { > >>> new_size = offset + len; > >>> ret = inode_newsize_ok(inode, new_size); > >>> if (ret) > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe fstests" in > >>> the body of a message to majordomo@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >>> > >>> . > >>> > >> > >> > >>
On Mon, Oct 16, 2017 at 10:32 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote: > Amir, > > I know this is a bit late, but I've spent some time working through > the disk image that you provided (so that I could determine how/if I > could modify CrashMonkey to catch errors like this) and I don't think > I understand what state the disk image reflects. The disk image SHOULD reflect a state on a disk after the power was cut in the middle of mounted fs. Then power came back on, filesystem was mounted, journal recovered, then filesystem was cleanly unmounted. At this stage, I don't expect there should be anything interesting in the journal. > After digging around > the journal of the disk image you provided, I found that the first 10 > journal blocks are used, with the journal superblock being placed in > the very first block of the journal. The journal superblock says that > the first journal transaction ID that should be in the journal is > transaction ID 4. However, dumping the other journal blocks, I found > that the next block is a descriptor block for transaction ID 2. The > rest of the journal blocks are data blocks for that transaction plus a > transaction commit block. This seems a little odd considering that the > journal refers to a 4th transaction, which I have not been able to > find (I quickly dumped the first 50 blocks in debugfs and found the > rest to contain only zeros). > I did not spend time analyzing the image, so I'll take your word for it, but I can't help you understand your findings. > With this in mind, I looked back at the xfstests code for controlling > the dm_flakey device. What I realized is the `nolockfs` flag is > provided both when it switches from the real device to the flakey > device that drops writes and when it switches from the flakey device > back to the real device. I know there is a call to umount once the > flakey device that drops writes is inserted, but do you think it is > possible that the flakey device is swapped back to the real device > before all the writes forced out by umount have made it to the flakey > device? I believe umount call should be blocked until all writes have been flushed out to flakey device. > Unfortunately I still don't have a local machine that is > capable of reproducing your test results and I have not made any gce > test appliance images to test this yet, so I'm not sure if this is a > valid theory. > Ted explained that the bug related to very specific timing of flusher thread vs. fallocate thread. I was under the impression that CrashMonkey can only reorder writes between recorded FLUSH requests, so I am not really sure how you intent to modify CrashMonkey to catch this bug. Cheers, Amir.
On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote: > > The disk image SHOULD reflect a state on a disk after the power was > cut in the middle of mounted fs. Then power came back on, filesystem > was mounted, journal recovered, then filesystem was cleanly unmounted. > At this stage, I don't expect there should be anything interesting in the > journal. I suspect what Ashlie was hoping for was a file system image *before* the file system was remounted and the journal replayed (and then truncated). That would allow for an analysis of image right after the simulated power cut, so it could be seen what was in the journal. The only way to get that is to modify the test so that it aborts before the file system is remounted. I did some investigations where I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the file system was remounted to gather debugging information. That's how I tracked down the problem. Unfortunately I never bothered to grab full file system snapshot, so I can't give Ashlie what she's hoping for. > I believe umount call should be blocked until all writes have been flushed > out to flakey device. That is correct. > Ted explained that the bug related to very specific timing of flusher > thread vs. fallocate thread. > I was under the impression that CrashMonkey can only reorder writes > between recorded FLUSH requests, so I am not really sure how you intent to > modify CrashMonkey to catch this bug. The real issue is that what CrashMonkey is testing is given a test trace with N CACHE FLUSH operations, given a random X such that: 1 <= X < N If of the writes before the Xth CACHE FLUSH are completed, and a random set of writes between the Xth and (X+1)th CACHE FLUSH are completed, is the file system still consistent after a journal replay. That's a fine thing to test, although you can probably do that more efficiently by simply looking at all of the metadata writes between the Xth and X+1th CACHE FLUSH. Those writes must be effective no-ops after the journal is replayed up to the Xth cache flush. Which is to say, the writes must either (a) be to a data block, or (b) the contents of the writes must match either (a) the most recent journal entry for that block (up to the Xth cache flush), or (b) the current state of the disk. So if you are willing to assume knowledge of what is stored in the journal and how ext4 works, it should be possible to implement CrashMonkey much more effectively. The problem that this bug exposed is different sort of problem. To find this bug, given the I/O stream, you can simply examine the file system state after each journal commit. (e.g., after each CACHE FLUSH). And just make sure the file system state is consistent. There is no need to include some random set of writes from the last commit epoch. The sort of searching of the test space new CrashMonkey' would have to test can't be done just by looking at the io traces. Instead, a workload consists of a series of micro-transactions (jbd2 handles) which are assigned to a set of journal transactions. Normally, which handles get assigned to a given transaction is based on timing (we close a transaction every 5 seconds), or based on the size of the transaction (we limit the number of blocks in a transaction), or based file system operations --- e.g., a fsync() will cause a transaction to close. This CrashMoney' would have to explore a different set of transaction boundaries (e.g., which handles are assigned to a transaction before a transaction closes), and whether the file system is consistent at each transaction boundary given a each possible assignment of handles to transactions. It's doable, but it would have to be done by logging the data passed to the jbd2 logging layer, and checking file system consistency at each handle boundary. - Ted
Hi Ted, This was what we were suspecting as well. If the bug being exposed or not only depends on the number of transactions in the journal before the test workload starts, that’s something we could easily capture in CrashMonkey. It would be one more parameter to tweak, and we would potentially have to get multiple traces of the test workload with the journal being X percent full. It does expand our already-large search space, but our first order of business is making sure CrashMonkey can reproduce every crash-consistency bug reported in recent times (mostly by Amir :) ). So for now we were just analyzing the bug and trying to understand it, but it looks like the post-recovery image is not very useful for this. We do want CrashMonkey to remain file-system agnostic. (resending since earlier email failed) Thanks, Vijay On Mon, Oct 16, 2017 at 7:09 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote: >> >> The disk image SHOULD reflect a state on a disk after the power was >> cut in the middle of mounted fs. Then power came back on, filesystem >> was mounted, journal recovered, then filesystem was cleanly unmounted. >> At this stage, I don't expect there should be anything interesting in the >> journal. > > I suspect what Ashlie was hoping for was a file system image *before* > the file system was remounted and the journal replayed (and then > truncated). That would allow for an analysis of image right after the > simulated power cut, so it could be seen what was in the journal. > > The only way to get that is to modify the test so that it aborts > before the file system is remounted. I did some investigations where > I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the > file system was remounted to gather debugging information. That's how > I tracked down the problem. Unfortunately I never bothered to grab > full file system snapshot, so I can't give Ashlie what she's hoping > for. > >> I believe umount call should be blocked until all writes have been flushed >> out to flakey device. > > That is correct. > >> Ted explained that the bug related to very specific timing of flusher >> thread vs. fallocate thread. >> I was under the impression that CrashMonkey can only reorder writes >> between recorded FLUSH requests, so I am not really sure how you intent to >> modify CrashMonkey to catch this bug. > > The real issue is that what CrashMonkey is testing is given a test > trace with N CACHE FLUSH operations, given a random X such that: > > 1 <= X < N > > If of the writes before the Xth CACHE FLUSH are completed, and a > random set of writes between the Xth and (X+1)th CACHE FLUSH are > completed, is the file system still consistent after a journal replay. > > That's a fine thing to test, although you can probably do that more > efficiently by simply looking at all of the metadata writes between > the Xth and X+1th CACHE FLUSH. Those writes must be effective no-ops > after the journal is replayed up to the Xth cache flush. Which is to > say, the writes must either (a) be to a data block, or (b) the > contents of the writes must match either (a) the most recent journal > entry for that block (up to the Xth cache flush), or (b) the current > state of the disk. > > So if you are willing to assume knowledge of what is stored in the > journal and how ext4 works, it should be possible to implement > CrashMonkey much more effectively. > > > The problem that this bug exposed is different sort of problem. To > find this bug, given the I/O stream, you can simply examine the file > system state after each journal commit. (e.g., after each CACHE > FLUSH). And just make sure the file system state is consistent. > There is no need to include some random set of writes from the last > commit epoch. > > The sort of searching of the test space new CrashMonkey' would have to > test can't be done just by looking at the io traces. Instead, a > workload consists of a series of micro-transactions (jbd2 handles) > which are assigned to a set of journal transactions. Normally, which > handles get assigned to a given transaction is based on timing (we > close a transaction every 5 seconds), or based on the size of the > transaction (we limit the number of blocks in a transaction), or based > file system operations --- e.g., a fsync() will cause a transaction to > close. > > This CrashMoney' would have to explore a different set of transaction > boundaries (e.g., which handles are assigned to a transaction before a > transaction closes), and whether the file system is consistent at each > transaction boundary given a each possible assignment of handles to > transactions. > > It's doable, but it would have to be done by logging the data passed > to the jbd2 logging layer, and checking file system consistency at > each handle boundary. > > - Ted
On Tue, Oct 17, 2017 at 3:43 AM, Vijay Chidambaram <vvijay03@gmail.com> wrote: > Hi Ted, > > This was what we were suspecting as well. If the bug being exposed or not > only depends on the number of transactions in the journal before the test > workload starts, that’s something we could easily capture in CrashMonkey. It > would be one more parameter to tweak, and we would potentially have to get > multiple traces of the test workload with the journal being X percent full. > > It does expand our already-large search space, but our first order of > business is making sure CrashMonkey can reproduce every crash-consistency > bug reported in recent times (mostly by Amir :) ). So for now we were just > analyzing the bug and trying to understand it, but it looks like the > post-recovery image is not very useful for this. > > On Mon, Oct 16, 2017 at 7:09 PM Theodore Ts'o <tytso@mit.edu> wrote: >> >> On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote: >> > >> > The disk image SHOULD reflect a state on a disk after the power was >> > cut in the middle of mounted fs. Then power came back on, filesystem >> > was mounted, journal recovered, then filesystem was cleanly unmounted. >> > At this stage, I don't expect there should be anything interesting in >> > the >> > journal. >> >> I suspect what Ashlie was hoping for was a file system image *before* >> the file system was remounted and the journal replayed (and then >> truncated). That would allow for an analysis of image right after the >> simulated power cut, so it could be seen what was in the journal. >> >> The only way to get that is to modify the test so that it aborts >> before the file system is remounted. I did some investigations where >> I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the >> file system was remounted to gather debugging information. That's how >> I tracked down the problem. Unfortunately I never bothered to grab >> full file system snapshot, so I can't give Ashlie what she's hoping >> for. >> I can provide the desired file system snapshot after crash, but I will do more than that. I will provide an io recording of the test and will explain how to how to replay the recording to any given point before the crash. To produce the recording, I copied this hunk from test 456: fsxops=$tmp.fsxops cat <<EOF > $fsxops write 0x137dd 0xdc69 0x0 fallocate 0xb531 0xb5ad 0x21446 collapse_range 0x1c000 0x4000 0x21446 write 0x3e5ec 0x1a14 0x21446 zero_range 0x20fac 0x6d9c 0x40000 keep_size mapwrite 0x216ad 0x274f 0x40000 EOF run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile to test 455 before line NUM_FILES=4 and changed the line to NUM_FILES=0 to suppress the randomized fsx runs. Now test 455 fails and leaves us a recording of all io on the $LOGWRITES_DEV device. As we can see in 455.full, the consistency check fails on "pre umount" checkpoint, which test 455 has marked in the log device as mark "last": checking pre umount ("_log_writes_replay_log last" in the test) _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent To export the recording from my log device I ran: # ./src/log-writes/replay-log --log $LOGWRITES_DEV --find --end-mark last 17098@153852 Meaning that the "last" mark is record #17098 at sector offset 153852 on the log device, so I can export content of log up to and including this offset: # dd if=$LOGWRITES_DEV of=ext4-crash.log bs=512 count=153853 The exported log is 76M compressed to 34K and attached to this message. To replay this log on your device and reproduce the fsck error, you need a $SCRATCH_DEV that is at least 10G and run: # bunzip2 ext4-crash.log.bz2 # ./src/log-writes/replay-log --log ext4-crash.log --replay $SCRATCH_DEV --end-mark last Now you can investigate state of file system right after the crash. More importantly, you can use the --check [<number>|flush|fua] command line options to replay-log to run an arbitrary check utility on replay device at any numbers of checkpoints before the crash. You can use --limit and --start-entry to fast forward to a given point in recording and replay with consistency checks from that point on, and so forth. Finally, see that replaying the recording to "last" reproduces the problem: # mount $SCRATCH_DEV $SCRATCH_MNT # umount $SCRATCH_MNT # e2fsck -nf $SCRATCH_DEV Hope this helps. I guess I could have saved Ted some work had I provided this guide and log sooner, but I thought the bug may be trivial enough to understand with just the dm-flakey reproducer - I was wrong about that. Cheers, Amir.
On Tue, Oct 17, 2017 at 12:43:20AM +0000, Vijay Chidambaram wrote: > It does expand our already-large search space, but our first order of > business is making sure CrashMonkey can reproduce every crash-consistency > bug reported in recent times (mostly by Amir :) ). So for now we were just > analyzing the bug and trying to understand it, but it looks like the > post-recovery image is not very useful for this. Right, the post-recovery (after the journal replayed) is not very useful. Unfortunately, the pre-recovery (after the power cut, but before the journal replay) I suspect won't be terribly interesting either. It will show that the corruption is baked into the journal --- which is to say, the problem wasn't in whether the calls to the jbd2 layer were correct --- but rather, that one of the file system mutations in a specific jbd2 handle's "micro-transaction" left the file system is an inconsistent state. Not a terrible inconsistency, and it would be quickly papered over in a follow-up handle --- but one where if the handle which left the file system in an inconsistent state, and the handle which cleaned it up were in different transactions, and the power cut happened after the first transaction, the file system be left in a state where e2fsck would complain. So if you have the I/O trace where the handles in question were assigned to the right (wrong) set of transactions, then yes, you'll see the problem, just as the xfstest will see the problem. But if you want to improve the CrashMonkey's search of the problem space, it will require higher-level logging, because this is really a different sort of bug. CrashMonkey will find (a) bugs in jbd2, and (b) bugs in how the jbd2 layer is called. This bug is really a bug in ext4 implementation, because it is in *how* the file system was mutated that temporarily left it in an inconsistent state, and that's a different thing from (a) or (b). Which is great --- it's arguably additional research work that can be segregated into a different "Minimum Publishable Unit". :-) - Ted
Amir, thank you for providing io recording, this is really useful! Most of our work in recent weeks has driven by your input and your posts to the mailing list! Ted, I agree with your characterization of the bug. Its true that CrashMonkey will be missing context for certain bugs, and for those kinds of bugs, we will probably have a different project :) But for now, we are trying to widen the scope of CrashMonkey as much as we can. I think for the current bug, we can handle it in the following way: 1. run CrashMonkey with the smallest journal possible 2. in the setup phase of CrashMonkey, do N random metadata operations (that will fill up some percentage of the journal space) 3. run the test workload and collect the IO trace We will do steps 2 and 3 with different N. For this bug, there must be a particular N that will force the micro-transaction to be at the end of the journal tx, revealing the bug. In general, it might be good for CrashMonkey to force the metadata from the test workload to be distributed across different journal transactions. The trick is to do it without modifying the file system itself. Thanks, Vijay On Tue, Oct 17, 2017 at 9:41 AM, Theodore Ts'o <tytso@mit.edu> wrote: > On Tue, Oct 17, 2017 at 12:43:20AM +0000, Vijay Chidambaram wrote: >> It does expand our already-large search space, but our first order of >> business is making sure CrashMonkey can reproduce every crash-consistency >> bug reported in recent times (mostly by Amir :) ). So for now we were just >> analyzing the bug and trying to understand it, but it looks like the >> post-recovery image is not very useful for this. > > Right, the post-recovery (after the journal replayed) is not very > useful. Unfortunately, the pre-recovery (after the power cut, but > before the journal replay) I suspect won't be terribly interesting > either. It will show that the corruption is baked into the journal > --- which is to say, the problem wasn't in whether the calls to the > jbd2 layer were correct --- but rather, that one of the file system > mutations in a specific jbd2 handle's "micro-transaction" left the > file system is an inconsistent state. > > Not a terrible inconsistency, and it would be quickly papered over in > a follow-up handle --- but one where if the handle which left the file > system in an inconsistent state, and the handle which cleaned it up > were in different transactions, and the power cut happened after the > first transaction, the file system be left in a state where e2fsck > would complain. > > So if you have the I/O trace where the handles in question were > assigned to the right (wrong) set of transactions, then yes, you'll > see the problem, just as the xfstest will see the problem. > > But if you want to improve the CrashMonkey's search of the problem > space, it will require higher-level logging, because this is really a > different sort of bug. CrashMonkey will find (a) bugs in jbd2, and > (b) bugs in how the jbd2 layer is called. This bug is really a bug in > ext4 implementation, because it is in *how* the file system was > mutated that temporarily left it in an inconsistent state, and that's > a different thing from (a) or (b). Which is great --- it's arguably > additional research work that can be segregated into a different > "Minimum Publishable Unit". :-) > > - Ted
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 97f0fd06728d..07bca11749d4 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, } if (!(mode & FALLOC_FL_KEEP_SIZE) && - offset + len > i_size_read(inode)) { + (offset + len > i_size_read(inode) || + offset + len > EXT4_I(inode)->i_disksize)) { new_size = offset + len; ret = inode_newsize_ok(inode, new_size); if (ret) @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) } if (!(mode & FALLOC_FL_KEEP_SIZE) && - offset + len > i_size_read(inode)) { + (offset + len > i_size_read(inode) || + offset + len > EXT4_I(inode)->i_disksize)) { new_size = offset + len; ret = inode_newsize_ok(inode, new_size); if (ret)