Message ID | 20130227184912.GA19624@thunk.org |
---|---|
State | Not Applicable, archived |
Headers | show |
On 2013.02.27 at 13:49 -0500, Theodore Ts'o wrote:
> Markus, Dave, can you confirm that this fixes your problem?
Yes, it fixes the issue.
Thanks Ted.
On Wed, Feb 27, 2013 at 01:49:12PM -0500, Theodore Ts'o wrote: > Markus, Dave, can you confirm that this fixes your problem? > > Thanks!! > > (Sigh, this is a real brown paper bug; I'm embarassed I missed this in > my code review.) Building now. Can you confirm that nothing on-disk should be awry ? Or will I need a new fsck to detect what happened ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Ted, On 02/28/2013 02:49 AM, Theodore Ts'o wrote: > Markus, Dave, can you confirm that this fixes your problem? > > Thanks!! > > (Sigh, this is a real brown paper bug; I'm embarassed I missed this in > my code review.) Sorry, I don't have a big disk in my hand now. So I could reproduce it. But the patch looks good. Thanks for fixing it. Regards, - Zheng > > - Ted > > From f47f0d11096ca5d9e1965d8a9f266aa13fe2b73b Mon Sep 17 00:00:00 2001 > From: Theodore Ts'o <tytso@mit.edu> > Date: Wed, 27 Feb 2013 13:47:52 -0500 > Subject: [PATCH] ext4: fix extent status tree regression for file systems > > 512GB > > This fixes a regression introduced by commit f7fec032aa782. The > problem was that the extents status flags caused us to mask out block > numbers smaller than 2**28 blocks. Since we didn't test with file > systems smaller than 512GB, we didn't notice this during the > development cycle. > > A typical failure looks like this: > > EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block > 152052301: comm ls: bad entry in directory: rec_len is smaller than minimal - > offset=0(0), inode=0, rec_len=0, name_len=0 > > ... where 'debugfs -R "stat <172235804>" /dev/sdb1' reports that the > inode has block number 688923213. When viewed in hex, block number > 152052301 (from the syslog) is 0x910224D, while block number 688923213 > is 0x2910224D. Note the missing "0x20000000" in the block number. > > Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> > Reported-by: Dave Jones <davej@redhat.com> > Cc: Zheng Liu <gnehzuil.liu@gmail.com> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > --- > fs/ext4/extents_status.h | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h > index cf83e77..c795ff6 100644 > --- a/fs/ext4/extents_status.h > +++ b/fs/ext4/extents_status.h > @@ -20,10 +20,10 @@ > #define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__) > #endif > > -#define EXTENT_STATUS_WRITTEN 0x80000000 /* written extent */ > -#define EXTENT_STATUS_UNWRITTEN 0x40000000 /* unwritten extent */ > -#define EXTENT_STATUS_DELAYED 0x20000000 /* delayed extent */ > -#define EXTENT_STATUS_HOLE 0x10000000 /* hole */ > +#define EXTENT_STATUS_WRITTEN (((unsigned long long) 1) << 63) > +#define EXTENT_STATUS_UNWRITTEN (((unsigned long long) 1) << 62) > +#define EXTENT_STATUS_DELAYED (((unsigned long long) 1) << 61) > +#define EXTENT_STATUS_HOLE (((unsigned long long) 1) << 60) > > #define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \ > EXTENT_STATUS_UNWRITTEN | \ > @@ -58,22 +58,22 @@ extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, > > static inline int ext4_es_is_written(struct extent_status *es) > { > - return (es->es_pblk & EXTENT_STATUS_WRITTEN); > + return (es->es_pblk & EXTENT_STATUS_WRITTEN) != 0; > } > > static inline int ext4_es_is_unwritten(struct extent_status *es) > { > - return (es->es_pblk & EXTENT_STATUS_UNWRITTEN); > + return (es->es_pblk & EXTENT_STATUS_UNWRITTEN) != 0; > } > > static inline int ext4_es_is_delayed(struct extent_status *es) > { > - return (es->es_pblk & EXTENT_STATUS_DELAYED); > + return (es->es_pblk & EXTENT_STATUS_DELAYED) != 0; > } > > static inline int ext4_es_is_hole(struct extent_status *es) > { > - return (es->es_pblk & EXTENT_STATUS_HOLE); > + return (es->es_pblk & EXTENT_STATUS_HOLE) != 0; > } > > static inline ext4_fsblk_t ext4_es_status(struct extent_status *es) > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 27, 2013 at 01:57:55PM -0500, Dave Jones wrote: > Building now. Can you confirm that nothing on-disk should be awry ? > Or will I need a new fsck to detect what happened ? > Well, it's possible that a read from data file which had blocks located above 512GB might have gotten bogus information, or a block allocated above 512GB might result in a write to the wrong place on disk. So if you are very cautious, running fsck just to make sure things are OK is not a bad idea. But it's likely that the directory sanity checks would have caught things quickly, or trying run an executable would have caused a seg fault quickly enough. If your system crashed very quickly during the boot process, you'll probably be OK. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 27, 2013 at 01:49:12PM -0500, Theodore Ts'o wrote: > -#define EXTENT_STATUS_WRITTEN 0x80000000 /* written extent */ > -#define EXTENT_STATUS_UNWRITTEN 0x40000000 /* unwritten extent */ > -#define EXTENT_STATUS_DELAYED 0x20000000 /* delayed extent */ > -#define EXTENT_STATUS_HOLE 0x10000000 /* hole */ > +#define EXTENT_STATUS_WRITTEN (((unsigned long long) 1) << 63) > +#define EXTENT_STATUS_UNWRITTEN (((unsigned long long) 1) << 62) > +#define EXTENT_STATUS_DELAYED (((unsigned long long) 1) << 61) > +#define EXTENT_STATUS_HOLE (((unsigned long long) 1) << 60) Just a nitpick: 1ULL << ... is shorter and probably the standard way we do flags.
On Wed, Feb 27, 2013 at 02:04:19PM -0500, Theodore Ts'o wrote: > On Wed, Feb 27, 2013 at 01:57:55PM -0500, Dave Jones wrote: > > Building now. Can you confirm that nothing on-disk should be awry ? > > Or will I need a new fsck to detect what happened ? > > > > Well, it's possible that a read from data file which had blocks > located above 512GB might have gotten bogus information, or a block > allocated above 512GB might result in a write to the wrong place on > disk. > > So if you are very cautious, running fsck just to make sure things are > OK is not a bad idea. But it's likely that the directory sanity > checks would have caught things quickly, or trying run an executable > would have caused a seg fault quickly enough. If your system crashed > very quickly during the boot process, you'll probably be OK. It had been up a few hours before I hit those checks. fsck never found anything. I do have another disk (XFS formatted) with a backup from a week ago. I'll run a --dry-run rsync to see if it picks up any changes that I don't expect. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 27, 2013 at 07:56:25PM +0100, Markus Trippelsdorf wrote: > On 2013.02.27 at 13:49 -0500, Theodore Ts'o wrote: > > > Markus, Dave, can you confirm that this fixes your problem? > > Yes, it fixes the issue. Looks like it's fixed here too. How did this make it through -next without anyone hitting it ? I can't remember how many years ago I last bought a disk < 1TB, and I can't be alone. Or is everyone all about SSDs these days? Is anyone running xfstests or similar on linux-next regularly ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 27, 2013 at 02:11:15PM -0500, Dave Jones wrote: > > fsck never found anything. I do have another disk (XFS formatted) with > a backup from a week ago. I'll run a --dry-run rsync to see if it > picks up any changes that I don't expect. Probably a good idea. Given that Markus reports that this fixes the problem for him, and regression smoke test has passed, I'll get a cleaned up patch ready for Linus to pull ASAP. My deep apologies that we didn't catch this earlier. i'm going to think about how we can improve our testing regime to make sure problems like this get caught earlier. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/28/2013 03:19 AM, Dave Jones wrote: > On Wed, Feb 27, 2013 at 07:56:25PM +0100, Markus Trippelsdorf wrote: > > On 2013.02.27 at 13:49 -0500, Theodore Ts'o wrote: > > > > > Markus, Dave, can you confirm that this fixes your problem? > > > > Yes, it fixes the issue. > > Looks like it's fixed here too. > > How did this make it through -next without anyone hitting it ? Sorry, my apology. > > I can't remember how many years ago I last bought a disk < 1TB, > and I can't be alone. Or is everyone all about SSDs these days? I admit that I run xfstests on a SSD that has 128G. > > Is anyone running xfstests or similar on linux-next regularly ? Ted has a dev branch for -next, and I work on this branch. Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote: > > Looks like it's fixed here too. > > How did this make it through -next without anyone hitting it ? > > I can't remember how many years ago I last bought a disk < 1TB, > and I can't be alone. Or is everyone all about SSDs these days? I use LVM, so I have a number of volues which are smaler than 512GB, but very few which are actually larger than 1TB. And none on my test boxes. I was running the bleeding edge ext4 code on my laptop as for dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on my X230 laptop (it requires a thin laptop drive, and 7mm drives don't come any bigger, alas). > Is anyone running xfstests or similar on linux-next regularly ? I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip before I submitted a pull request. The problem is that XFSTESTS is S-L-O-W if you use large partitions, so typically I use a 5GB partition sizes for my test runs. Normally we're worried about race condition bugs, not something as bone-headed as a bitmasking problem, so it makes sense to use a smaller disk for most of your testing. (Some folks do their xfstests run on SSD's or tmpfs image files, again for speed reasons, and it's unlikely they would be big enough.) So what we probably need to do is to have a separate set of tests using a loopback mount, and perhaps an artificially created file system which has a large percentage of the blocks in the middle of the file system busied out, to make efficient testing of these sorts of bugs more efficient. As I said, I'm thinking about how's the best way to improve our testing regime to catch such problems the next time around. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 27 Feb 2013 14:29:07 -0500, Theodore Ts'o <tytso@mit.edu> wrote: > On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote: > > > > Looks like it's fixed here too. > > > > How did this make it through -next without anyone hitting it ? > > > > I can't remember how many years ago I last bought a disk < 1TB, > > and I can't be alone. Or is everyone all about SSDs these days? > > I use LVM, so I have a number of volues which are smaler than 512GB, > but very few which are actually larger than 1TB. And none on my test > boxes. I was running the bleeding edge ext4 code on my laptop as for > dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on > my X230 laptop (it requires a thin laptop drive, and 7mm drives don't > come any bigger, alas). > > > Is anyone running xfstests or similar on linux-next regularly ? > > I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip > before I submitted a pull request. The problem is that XFSTESTS is > S-L-O-W if you use large partitions, so typically I use a 5GB Indeed. That's why i give-up rotated disks and run xfstest only on SSD or brd module > partition sizes for my test runs. Normally we're worried about race > condition bugs, not something as bone-headed as a bitmasking problem, > so it makes sense to use a smaller disk for most of your testing. > (Some folks do their xfstests run on SSD's or tmpfs image files, again > for speed reasons, and it's unlikely they would be big enough.) > > So what we probably need to do is to have a separate set of tests > using a loopback mount, and perhaps an artificially created file > system which has a large percentage of the blocks in the middle of the > file system busied out, to make efficient testing of these sorts of > bugs more efficient. As I said, I'm thinking about how's the best way > to improve our testing regime to catch such problems the next time around. Amazing idea. Something like: #dd if=/dev/zero of=/tmp/fs.img bs=1M seek=2000000 count=1 #mkfs.ext4 -m0 -i4096000 /tmp/fs.img #mount /tmp/fs.img /mnt/ -oloop #for ((i=0; i < 2000; i++));do fallocate -l $((1024*1024*1024)) /mnt/f$i ;done #for ((i=0; i < 2000; i++));do truncate -s $((1023*1024*1024)) /mnt/f$i ;done As result file system image has 2gb of free space wich is fragmented to ~2000 chunks 1Mb each. But image itself is quite small # df /mnt Filesystem 1K-blocks Used Available Use% Mounted on /dev/loop0 2047678076 2045679228 1998848 100% /mnt # du -sch /tmp/fs.img 242M /tmp/fs.img 242M total Later we can simply run xfstest/fio/fsx on this image. I'll prepare new xfstest based on that idea. But the only disadvantage is that loop dev has bottleneck, all requests will be serialized on i_mutex. > > Cheers, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 28, 2013 at 12:58:55AM +0400, Dmitry Monakhov wrote: > Indeed. That's why i give-up rotated disks and run xfstest only on SSD > or brd module Note that some problems only show up on slower devices (because then the race window is wider), and sometimes they only show up on fast devices. This is why I've been doing runs on both HDD's and on tmpfs kvm disk image files. > Later we can simply run xfstest/fio/fsx on this image. > I'll prepare new xfstest based on that idea. But the only disadvantage > is that loop dev has bottleneck, all requests will be serialized on i_mutex. If the image file is used as a disk file for KVM, that should avoid the serialization bottleneck of using the loop device. Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 27, 2013 at 02:29:07PM -0500, Theodore Ts'o wrote: > On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote: > > > > Looks like it's fixed here too. > > > > How did this make it through -next without anyone hitting it ? > > > Is anyone running xfstests or similar on linux-next regularly ? > > I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip > before I submitted a pull request. The problem is that XFSTESTS is > S-L-O-W if you use large partitions, so typically I use a 5GB > partition sizes for my test runs. This isn't the case for XFS. I typically see 1TB scratch devices only being ~10-20% slower than 10GB scratch devices, and 10TB only being a little slower than 1TB scratch devices. I have to use sparse devices and --large-fs for 100TB filesystem testing, so I can't directly compare the speeds to those that I run on physical devices. However I can say that it isn't significantly slower than using small scratch devices... > So what we probably need to do is to have a separate set of tests > using a loopback mount, and perhaps an artificially created file > system which has a large percentage of the blocks in the middle of the > file system busied out, to make efficient testing of these sorts of That's exactly what the --large-fs patch set I posted months ago does for ext4 - it uses fallocate() to fill all but 50GB of the large filesystem without actually writing any data and runs the standard tests in the remaining unused space. However, last time I tested ext4 with this patchset (when I posted the patches months ago), multi-TB preallocation on ext4 was still too slow to make it practical for testing on devices larger than 2-3TB. Perhaps it would make testing 1-2TB ext4 filesystems fast enough that you could do it regularly... Cheers, Dave.
On 2/27/13 2:58 PM, Dmitry Monakhov wrote: > On Wed, 27 Feb 2013 14:29:07 -0500, Theodore Ts'o <tytso@mit.edu> wrote: >> On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote: >>> >>> Looks like it's fixed here too. >>> >>> How did this make it through -next without anyone hitting it ? >>> >>> I can't remember how many years ago I last bought a disk < 1TB, >>> and I can't be alone. Or is everyone all about SSDs these days? >> >> I use LVM, so I have a number of volues which are smaler than 512GB, >> but very few which are actually larger than 1TB. And none on my test >> boxes. I was running the bleeding edge ext4 code on my laptop as for >> dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on >> my X230 laptop (it requires a thin laptop drive, and 7mm drives don't >> come any bigger, alas). >> >>> Is anyone running xfstests or similar on linux-next regularly ? >> >> I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip >> before I submitted a pull request. The problem is that XFSTESTS is >> S-L-O-W if you use large partitions, so typically I use a 5GB > Indeed. That's why i give-up rotated disks and run xfstest only on SSD > or brd module >> partition sizes for my test runs. Normally we're worried about race >> condition bugs, not something as bone-headed as a bitmasking problem, >> so it makes sense to use a smaller disk for most of your testing. >> (Some folks do their xfstests run on SSD's or tmpfs image files, again >> for speed reasons, and it's unlikely they would be big enough.) >> >> So what we probably need to do is to have a separate set of tests >> using a loopback mount, and perhaps an artificially created file >> system which has a large percentage of the blocks in the middle of the >> file system busied out, to make efficient testing of these sorts of >> bugs more efficient. As I said, I'm thinking about how's the best way >> to improve our testing regime to catch such problems the next time around. > Amazing idea. Something like: > > #dd if=/dev/zero of=/tmp/fs.img bs=1M seek=2000000 count=1 > #mkfs.ext4 -m0 -i4096000 /tmp/fs.img > #mount /tmp/fs.img /mnt/ -oloop > #for ((i=0; i < 2000; i++));do fallocate -l $((1024*1024*1024)) /mnt/f$i ;done > #for ((i=0; i < 2000; i++));do truncate -s $((1023*1024*1024)) /mnt/f$i ;done > > As result file system image has 2gb of free space wich is fragmented to ~2000 > chunks 1Mb each. But image itself is quite small > # df /mnt > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/loop0 2047678076 2045679228 1998848 100% /mnt > # du -sch /tmp/fs.img > 242M /tmp/fs.img > 242M total > > Later we can simply run xfstest/fio/fsx on this image. > I'll prepare new xfstest based on that idea. But the only disadvantage > is that loop dev has bottleneck, all requests will be serialized on i_mutex. Before anyone does too much work, it would be worth revisiting dchinner's [PATCH 0/10] xfstests: rework large filesystem testing series from July 2012 to see if it meets the needs already. It almost got all reviews, with one sticking point left, AFAICT. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h index cf83e77..c795ff6 100644 --- a/fs/ext4/extents_status.h +++ b/fs/ext4/extents_status.h @@ -20,10 +20,10 @@ #define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__) #endif -#define EXTENT_STATUS_WRITTEN 0x80000000 /* written extent */ -#define EXTENT_STATUS_UNWRITTEN 0x40000000 /* unwritten extent */ -#define EXTENT_STATUS_DELAYED 0x20000000 /* delayed extent */ -#define EXTENT_STATUS_HOLE 0x10000000 /* hole */ +#define EXTENT_STATUS_WRITTEN (((unsigned long long) 1) << 63) +#define EXTENT_STATUS_UNWRITTEN (((unsigned long long) 1) << 62) +#define EXTENT_STATUS_DELAYED (((unsigned long long) 1) << 61) +#define EXTENT_STATUS_HOLE (((unsigned long long) 1) << 60) #define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \ EXTENT_STATUS_UNWRITTEN | \ @@ -58,22 +58,22 @@ extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, static inline int ext4_es_is_written(struct extent_status *es) { - return (es->es_pblk & EXTENT_STATUS_WRITTEN); + return (es->es_pblk & EXTENT_STATUS_WRITTEN) != 0; } static inline int ext4_es_is_unwritten(struct extent_status *es) { - return (es->es_pblk & EXTENT_STATUS_UNWRITTEN); + return (es->es_pblk & EXTENT_STATUS_UNWRITTEN) != 0; } static inline int ext4_es_is_delayed(struct extent_status *es) { - return (es->es_pblk & EXTENT_STATUS_DELAYED); + return (es->es_pblk & EXTENT_STATUS_DELAYED) != 0; } static inline int ext4_es_is_hole(struct extent_status *es) { - return (es->es_pblk & EXTENT_STATUS_HOLE); + return (es->es_pblk & EXTENT_STATUS_HOLE) != 0; } static inline ext4_fsblk_t ext4_es_status(struct extent_status *es)