diff mbox

EXT4 ENOSPC Bug

Message ID 20081202075712.GA16172@mit.edu
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o Dec. 2, 2008, 7:57 a.m. UTC
Hi Andres,

What we suspect may be happening is that somehow, there is something
in your workload which is causing a leak or a double-increment of the
number of blocks reserved so that delayed allocation will succeed.
Basically, when the system writes to a block that hasn't been
previously allocated by filesystme, we reserve a data block so that
when we finally do the late delayed allocation, we know that a free
block will be available.  When we do finally determine where the block
will be located on the filesystem, we decrement the reserved block
counter.  If somehow the reserved block counter is growing and not
shrinking when it should, that could lead to the problem which you
describe.

So.... could you apply this patch, attached below.  You can trigger it
using the attached program, debug-ioctl.  If the filesystem is
quiscent, and you've typed sync once or twice, you should get the
following in your printk logs:

[ 2742.603886] ext4 debug delalloc of dm-4
[ 2742.603948] ext4: dirty blocks 0 free blocks 7324359
[ 2742.603960] ext4 debug delalloc done

If you do have some dirty blocks that haen't been flushed out to disk,
it might look liket this:

[ 2758.653682] ext4 debug delalloc of dm-4
[ 2758.653697] ext4: dirty blocks 172 free blocks 7324439
[ 2758.653703] s_dirty list:
[ 2758.653708] ino 401167: 79 2
[ 2758.653713] ino 401200: 2 2
[ 2758.653718] ino 401197: 3 2
  	       	   ....
[ 2758.653828] ext4 debug delalloc done

If our theory is correct, I suspect you will start to see the number
of dirty blocks grown over time, even before you start seeing ENOSPC
errors (which will happen when the number of dirty blocks exceeds the
number of free blocks).

In that case, the list of inodes that have data and metadata blocks
reserved will hopefully tell us soemthing about what might be going
on.  Just run the debug-ioctl command giving a filename or directory
within the filesystem where you want to deump out the debugging
information; it will be dumpd out in the dmesg buffer.

							- Ted

Comments

Andres Freund Dec. 2, 2008, 2:58 p.m. UTC | #1
Hi Ted, Hi Andreas,

On Tuesday 02 December 2008 08:57:12 Theodore Tso wrote:
> So.... could you apply this patch, attached below.  You can trigger it
> using the attached program, debug-ioctl.  If the filesystem is
> quiscent, and you've typed sync once or twice, you should get the
> following in your printk logs:
Ok. The system now runs (without problems) with the patch enabled and I can 
get the debug output.
30 Minutes after boot the system still returns to zero dirty blocks and the 
free blocks seem to stay in a sensible range.
I will let it run for a while and report back if either something interesting 
happens or the problem reappears and I am seeing no significant amount of dirty 
blocks.

Andres
Theodore Ts'o Dec. 2, 2008, 4:47 p.m. UTC | #2
On Tue, Dec 02, 2008 at 03:58:53PM +0100, Andres Freund wrote:
> Ok. The system now runs (without problems) with the patch enabled and I can 
> get the debug output.
> 30 Minutes after boot the system still returns to zero dirty blocks and the 
> free blocks seem to stay in a sensible range.
> I will let it run for a while and report back if either something interesting 
> happens or the problem reappears and I am seeing no significant amount of dirty 
> blocks.

You say you are using Postgres, right?  Something you might try to see
if it triggers the problem it is creating a new database and then
restoring some database dump/backup into that new database.  Some
databases expand into a new table space (or whatever terminology
Postgres uses) by random writes into a sparse portion of the file.
This could be triggering the problem, or at least trigger the problem
more quickly.

The other thing I wanted to ask is whether "df" was showing the 37%
in-use statistic at the time, or was that after you rebooted.  And
although I hate to ask it, you're sure this isn't the standard "delete
an in-use file but not get the space back" Unix trap, right?

Thanks, regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Freund Dec. 2, 2008, 5:47 p.m. UTC | #3
Hi,

On Tuesday 02 December 2008 17:47:09 Theodore Tso wrote:
> You say you are using Postgres, right?  Something you might try to see
> if it triggers the problem it is creating a new database and then
> restoring some database dump/backup into that new database.  Some
> databases expand into a new table space (or whatever terminology
> Postgres uses) by random writes into a sparse portion of the file.
> This could be triggering the problem, or at least trigger the problem
> more quickly.
I tried that - I have seen no problems so far. But it is not the first time I 
did not see the problem for some time.

Btw, postgres just creates the database by copying over a default database.

For an easy test with sparse files, I created a big one, set it up as a loop 
device, created a filesystem and ran some stuff in it.
No Problem so far.

> The other thing I wanted to ask is whether "df" was showing the 37%
> in-use statistic at the time, or was that after you rebooted. 
It definitely was before a reboot. And there were plenty of both, inodes and 
blocks.

I think that I have seen the problem on metadata only changes (find /tmp -type 
f|xargs touch) as well, but sometimes metadata changes were possible while file 
creation was not.

Another Datapoint: File deletion sometimes made it possible to create more 
files, but by far not as much as the space freed.

> And although I hate to ask it, you're sure this isn't the standard "delete
> an in-use file but not get the space back" Unix trap, right?
There were over 200GB free, so I doubt that. I don't know what could have 
caused an allocation of so much space unnoticed in an idle system multiple 
times.
But I do understand the reason for the question ;-)


Andres
Theodore Ts'o Dec. 2, 2008, 8:33 p.m. UTC | #4
On Tue, Dec 02, 2008 at 06:47:24PM +0100, Andres Freund wrote:
> 
> I think that I have seen the problem on metadata only changes (find
> /tmp -type f|xargs touch) as well, but sometimes metadata changes
> were possible while file creation was not.

Hmm...  really?  Was the error message ENOSPC, or something else?

The only way I can see that it might be ENOSPC would be if we got an
error in the jbd2 layer, in start_this_handle(), but that would have
resulted in a kernel printk:

	if (nblocks > journal->j_max_transaction_buffers) {
		printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
		       current->comm, nblocks,
		       journal->j_max_transaction_buffers);
		ret = -ENOSPC;
		goto out;
	}


If that wasn't too long ago, you might want to search your old system
log files in /var/log for any "JBD" or "ext4" messages.  That might be
an important clue, although that seems rather unlikely to me.  Or in
the case where metadata changes were failing, perhaps they were doing
so with some other error that ENOSPC?

   	   					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Bastien-Filiatrault Dec. 3, 2008, 12:37 a.m. UTC | #5
Theodore Tso wrote:

> Hmm...  really?  Was the error message ENOSPC, or something else?
>
I can confirm I get a ENOSPC on open/creat. Here is the relevant output
from strace touch test:

open("test", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 ENOSPC (No
space left on device)

And the df -h output (first filesystem in list):

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg0-root   88G   60G   25G  71% /
tmpfs                 1.6G   12K  1.6G   1% /lib/init/rw
udev                   10M  112K  9.9M   2% /dev
tmpfs                 1.6G     0  1.6G   0% /dev/shm
/dev/sda1              92M   19M   69M  22% /boot
none                  1.6G  3.3M  1.6G   1% /tmp

Cheers,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Bastien-Filiatrault Dec. 3, 2008, 12:40 a.m. UTC | #6
Jonathan Bastien-Filiatrault wrote:
> Theodore Tso wrote:
> 
>> Hmm...  really?  Was the error message ENOSPC, or something else?
>>
> I can confirm I get a ENOSPC on open/creat. Here is the relevant output
> from strace touch test:
> 
(Sorry for replying to myself)

This happened while compiling a kernel with ccache enabled (with a
relatively low hit rate for this specific build). Vuze was also opened,
but with no active download.

Cheers,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Dec. 3, 2008, 4:37 a.m. UTC | #7
On Tue, Dec 02, 2008 at 07:40:43PM -0500, Jonathan Bastien-Filiatrault wrote:
> Jonathan Bastien-Filiatrault wrote:
> > Theodore Tso wrote:
> > 
> >> Hmm...  really?  Was the error message ENOSPC, or something else?
> >>
> > I can confirm I get a ENOSPC on open/creat. Here is the relevant output
> > from strace touch test:
> > 
> (Sorry for replying to myself)
> 
> This happened while compiling a kernel with ccache enabled (with a
> relatively low hit rate for this specific build). Vuze was also opened,
> but with no active download.

Hmm... I build with ccache all the time.  I normally don't run a
bittorrent client, though.  What kernel version are you using?

Any chance you can apply the patch which I sent earlier in this
thread, trigger the debugging ioctl when it happens again?

Thanks,

		    	      	    	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V Dec. 3, 2008, 3:34 p.m. UTC | #8
On Tue, Dec 02, 2008 at 07:37:20PM -0500, Jonathan Bastien-Filiatrault wrote:
> Theodore Tso wrote:
> 
> > Hmm...  really?  Was the error message ENOSPC, or something else?
> >
> I can confirm I get a ENOSPC on open/creat. Here is the relevant output
> from strace touch test:
> 
> open("test", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 ENOSPC (No
> space left on device)
> 

Can you make sure you have the below patch in the kernel.

c001077f4003fa75793bb62979baa6241dd8eb19

commit c001077f4003fa75793bb62979baa6241dd8eb19
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Tue Aug 19 22:19:50 2008 -0400

ext4: Fix bug where we return ENOSPC even though we have plenty of inodes

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ba18b81..43f12f6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -313,6 +313,8 @@  struct ext4_new_group_data {
 #define EXT4_IOC_FIEMAP_INO		_IOWR('f', 18, struct fiemap_ino)
 #define EXT4_IOC_MOVE_VICTIM		_IOW('f', 19, struct ext4_extents_info)
 
+#define EXT4_IOC_DEBUG_DELALLOC		_IO('f', 42)
+
 /*
  * ioctl commands in 32 bit emulation
  */
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 15050d3..be60cfc 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -18,6 +18,17 @@ 
 #include "ext4_jbd2.h"
 #include "ext4.h"
 
+static void print_inode_dealloc_info(struct inode *inode)
+{
+	if (!EXT4_I(inode)->i_reserved_data_blocks ||
+	    !EXT4_I(inode)->i_reserved_meta_blocks)
+		return;
+
+	printk(KERN_DEBUG "ino %lu: %u %u\n", inode->i_ino,
+	       EXT4_I(inode)->i_reserved_data_blocks,
+	       EXT4_I(inode)->i_reserved_meta_blocks);
+}
+
 long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = filp->f_dentry->d_inode;
@@ -346,6 +357,42 @@  setversion_out:
 		return err;
 	}
 
+	case EXT4_IOC_DEBUG_DELALLOC:
+	{
+		extern spinlock_t inode_lock;
+		struct super_block *sb = inode->i_sb;
+		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+		struct inode *inode;
+		
+		printk(KERN_DEBUG "ext4 debug delalloc of %s\n", sb->s_id);
+		printk(KERN_DEBUG "ext4: dirty blocks %lld free blocks %lld\n",
+		       percpu_counter_sum(&sbi->s_dirtyblocks_counter),
+		       percpu_counter_sum(&sbi->s_freeblocks_counter));
+		spin_lock(&inode_lock);
+		if (!list_empty(&sb->s_dirty)) {
+			printk(KERN_DEBUG "s_dirty list:\n");
+			list_for_each_entry(inode, &sb->s_dirty, i_list) {
+				print_inode_dealloc_info(inode);
+			}
+		}
+		if (!list_empty(&sb->s_io)) {
+			printk(KERN_DEBUG "s_io list:\n");
+			list_for_each_entry(inode, &sb->s_io, i_list) {
+				print_inode_dealloc_info(inode);
+			}
+		}
+		if (!list_empty(&sb->s_more_io)) {
+			printk(KERN_DEBUG "s_more_io list:\n");
+			list_for_each_entry(inode, &sb->s_more_io, i_list) {
+				print_inode_dealloc_info(inode);
+			}
+		}
+		spin_unlock(&inode_lock);
+		printk(KERN_DEBUG "ext4 debug delalloc done\n");
+		return 0;
+	}
+
+
 	default:
 		return -ENOTTY;
 	}
@@ -389,6 +436,7 @@  long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		cmd = EXT4_IOC_SETRSVSZ;
 		break;
 	case EXT4_IOC_GROUP_ADD:
+	case EXT4_IOC_DEBUG_DELALLOC:
 		break;
 	default:
 		return -ENOIOCTLCMD;