Patchwork s390x: kernel BUG at fs/ext4/inode.c:1591! (powerpc too!)

login
register
mail settings
Submitter Dmitri Monakho
Date April 2, 2013, 9:47 a.m.
Message ID <877gkls1q7.fsf@openvz.org>
Download mbox | patch
Permalink /patch/232928/
State Superseded
Headers show

Comments

Dmitri Monakho - April 2, 2013, 9:47 a.m.
On Mon, 1 Apr 2013 23:15:07 -0700 (PDT), Christian Kujau <lists@nerdbynature.de> wrote:
> Hi,
> 
> my machine (PowerBook G4) just crashed and the only thing netconsole was 
> able to transmit was:
> 
>   ------------[ cut here ]------------
>   kernel BUG at /usr/local/src/linux-git/fs/ext4/inode.c:1591!
> 
> But (unfortunately) nothing more. I have no clear way to reproduce this, 
> but I have some kind of a (longish) backstory to this, see below. The 
> system is running 3.9-rc4, its .config and dmesg:
> 
>   http://nerdbynature.de/bits/3.9.0-rc1/config.gz (oldconfig'ed to -rc4)
>   http://nerdbynature.de/bits/3.9.0-rc1/dmesg.txt (w/o the calltrace at the end)
> 
> 
> I was having trouble all day downloading a file via bittorrent to an 
> ext4 filesystem. It came always back as corrupted, though I won't be able 
> to point out the corruption, as don't know the contents of the source 
> file. The ext4 filesystem sits on top of a dm-crypt LUKS device:
> 
>  /dev/mapper/wdc0 on /mnt/data type ext4 (rw,nosuid,nodev,noexec,relatime,data=ordered)
> 
> While looking around as to why the file would be corrupt, the internet 
> suggested "bad memory" or "bad disk" or "kernel bugs". I have dismissed 
> the first two, as the system is rock-stable otherwise and dmesg has no 
> kernel messages suggesting disk or filesystem problems.
Unfortunately it is like a regression which we missed
due to s390x and ppc is not well tested.
> 
> The file in question is ~800 MB in size. Not getting any further on a 
> solution to my corrupted file, I decided to download a 4.3GB Fedora 
> installation image via bittorrent to the same filesystem and that's when 
> the machine crashed, leaving only the single BUG message as a hint.
Ohh that is sad. Unfortunately I can't reproduce this on my own
environment. I have power mac pro G5 but w/o graphics card, so i cant
install linux on it. If you know how to do that w/o monitor please let
me know.

So you just do bunch of writes/mmap to fallocated area.
The only guess I have is that some bug in extent status tree

Please run test with a patch which was posted here:
http://marc.info/?l=linux-kernel&m=136455173926544&w=2
This patch enable sanity checks for extent_status tree.
Also please try following patch. It voluntary disable es_lookup functionality.
> 
> The system is back now, e2fsck-1.42.5 came back with no errors.
> 
> Thanks for reading,
> Christian.
> 
> PS: somewhat off-topic, but: is there a way to have BUG_ON print only
>     fs/ext4/inode.c:1591! instead of the full pathname? Is there are
>     config option for this?
> -- 
> BOFH excuse #339:
> 
> manager in the cable duct
Zheng Liu - April 2, 2013, 12:33 p.m.
On Tue, Apr 02, 2013 at 01:47:44PM +0400, Dmitry Monakhov wrote:
> On Mon, 1 Apr 2013 23:15:07 -0700 (PDT), Christian Kujau <lists@nerdbynature.de> wrote:
> > Hi,
> > 
> > my machine (PowerBook G4) just crashed and the only thing netconsole was 
> > able to transmit was:
> > 
> >   ------------[ cut here ]------------
> >   kernel BUG at /usr/local/src/linux-git/fs/ext4/inode.c:1591!
> > 
> > But (unfortunately) nothing more. I have no clear way to reproduce this, 
> > but I have some kind of a (longish) backstory to this, see below. The 
> > system is running 3.9-rc4, its .config and dmesg:
> > 
> >   http://nerdbynature.de/bits/3.9.0-rc1/config.gz (oldconfig'ed to -rc4)
> >   http://nerdbynature.de/bits/3.9.0-rc1/dmesg.txt (w/o the calltrace at the end)
> > 
> > 
> > I was having trouble all day downloading a file via bittorrent to an 
> > ext4 filesystem.

It looks like the same problem [1].  But it should have been fixed in
3.9-rc4.  Frankly, I think the root cause is es_cache.  Sorry, it hasn't
been well tested.

1. http://www.serverphorums.com/read.php?12,667656

Could you please revert your tree to this commit (3a225670), and try
again. I want to make sure that the regression won't be fixed until now
or it is introduced after this commit.

Thanks in advance,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - April 2, 2013, 5:19 p.m.
On Tue, 2 Apr 2013 09:58:46 -0700 (PDT), Christian Kujau <lists@nerdbynature.de> wrote:
> On Tue, 2 Apr 2013 at 13:47, Dmitry Monakhov wrote:
> > Unfortunately it is like a regression which we missed
> > due to s390x and ppc is not well tested.
> 
> :-(
> 
> > Ohh that is sad. Unfortunately I can't reproduce this on my own
> > environment. I have power mac pro G5 but w/o graphics card, so i cant
> > install linux on it. If you know how to do that w/o monitor please let
> > me know.
> 
> Hm, w/o a graphics card..the only way to install any OS would be via a 
> serial line, I assume.
I've tried to use qemu but I can not even boot the kernel:
Preparing to boot Linux version 3.9.0-rc4 (root@dbuild4.qa.sw.ru) (gcc
version 4.4.5 (Debian 4.4.5-8) ) #6 Tue Apr 2 19:12:42 MSK 2013
Detected machine type: 00000400
command line: console=ttyS0,9600 console=tty0
memory layout at init:
  memory_limit : 00000000 (16 MB aligned)
  alloc_bottom : 0164d000
  alloc_top    : 20000000
  alloc_top_hi : 20000000
  rmo_top      : 20000000
  ram_top      : 20000000
found display   : /pci@80000000/QEMU,VGA@1, opening... done
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x0164e000 -> 0x0164e4d7
Device tree struct  0x0164f000 -> 0x01651000
Calling quiesce...
returning from prom_init
Trying to write invalid spr 1015 3f7 at c0008bc0

Can anybody help me with simple thing
Build and boot kernel via qemu

> 
> > So you just do bunch of writes/mmap to fallocated area.
> > The only guess I have is that some bug in extent status tree
> 
> "writes/mmap to fallocated area" - this sounds like the exact thing this 
> bittorrent client is doing!
> 
> > Please run test with a patch which was posted here:
> > http://marc.info/?l=linux-kernel&m=136455173926544&w=2
> > This patch enable sanity checks for extent_status tree.
> > Also please try following patch. It voluntary disable es_lookup functionality.
> 
> I'll find a way to reproduce this first and then play around with those patches.
Probably all you need is just run fsstress
(https://github.com/dmonakhov/xfstests/blob/master/ltp/fsstress.c)
And run in like follows:
#fsstress -d $YOUR_PATH -p 4 -z -f rmdir=10 -f link=10 -f creat=10 -f mkdir=10 \
-f rename=30 -f stat=30 -f unlink=30 -f truncate=20 -n99999999
> 
> Thanks for your response,
> Christian.
> -- 
> BOFH excuse #415:
> 
> Maintenance window broken
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - April 2, 2013, 10:05 p.m.
On Tue, 2 Apr 2013 14:35:29 -0700 (PDT), Christian Kujau <lists@nerdbynature.de> wrote:
> On Tue, 2 Apr 2013 at 13:47, Dmitry Monakhov wrote:
> > So you just do bunch of writes/mmap to fallocated area.
> > The only guess I have is that some bug in extent status tree
> > 
> > Please run test with a patch which was posted here:
> > http://marc.info/?l=linux-kernel&m=136455173926544&w=2
> > This patch enable sanity checks for extent_status tree.
> > Also please try following patch. It voluntary disable es_lookup functionality.
> 
> I tested your patch below (applied to 3.9-rc4) and now the BUG is gone. 
> The machine stays up and the corruption of that torrent file is gone too! 
> 
> Feel free to add my Tested-by: but I don't know if this will be the final 
> solution to this issue, no?
No. This is just a proof that es_cache is a root of cause.
Please drop that patch and collect logs with a kernel which 
has only 0001-enable-ES_AGGRESSIVE_TEST-V2.patch patch applied
This can help us understand what was wrong. From CAI Qian's
logs(http://marc.info/?l=linux-ext4&m=136489690730402&w=2) 
I found that in most cases assertion failed because
ec_cache contains BH_Mapped entries, but extent_tree has not data at all

Also there is another assertion failure where
es_cache {15/1/33490/MAPPED}  != extent_tree {15/1/33579/BH_UNWRITTEN}

> 
> Thanks!
> Christian.
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c> index fe3337a..95d27cd 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -689,6 +689,7 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
>  	trace_ext4_es_lookup_extent_enter(inode, lblk);
>  	es_debug("lookup extent in block %u\n", lblk);
>  
> +	return 0;
>  	tree = &EXT4_I(inode)->i_es_tree;
>  	read_lock(&EXT4_I(inode)->i_es_lock);
>  
> -- 
> BOFH excuse #414:
> 
> tachyon emissions overloading the system
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index fe3337a..95d27cd 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -689,6 +689,7 @@  int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 	trace_ext4_es_lookup_extent_enter(inode, lblk);
 	es_debug("lookup extent in block %u\n", lblk);
 
+	return 0;
 	tree = &EXT4_I(inode)->i_es_tree;
 	read_lock(&EXT4_I(inode)->i_es_lock);