diff mbox

ext4: add max_dir_size_kb mount option

Message ID 1344626638-31548-1-git-send-email-tytso@mit.edu
State Superseded, archived
Headers show

Commit Message

Theodore Ts'o Aug. 10, 2012, 7:23 p.m. UTC
Very large directories can cause significant performance problems,
especially if jobs are running in memory-tight environment (whether it
is VM's with a small amount of memory or small memory cgroups).

So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size.  We do this
via a new mount option, max_dir_size_kb.  If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.

Google-Bug-Id: 6863013

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ext4.h  | 1 +
 fs/ext4/namei.c | 6 ++++++
 fs/ext4/super.c | 7 +++++++
 3 files changed, 14 insertions(+)

Comments

Eric Sandeen Aug. 10, 2012, 7:38 p.m. UTC | #1
On 8/10/12 2:23 PM, Theodore Ts'o wrote:
> 
> So it is useful, in cloud server/data center environments, to be able
> to set a filesystem-wide cap on the maximum size of a directory, to
> ensure that directories never get larger than a sane size.  We do this
> via a new mount option, max_dir_size_kb.  If there is an attempt to
> grow the directory larger than max_dir_size_kb, the system call will
> return ENOSPC instead.

Can the commit message also describe more about the problem: how bad
it is, the root cause, and why it's so hard to fix properly?

Please also update Documentation/filesystems/ext4.txt so people
know for sure what the use & intent of this new knob is.

I guess it's self explanatory in the name but docs are easy & good.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 10, 2012, 7:59 p.m. UTC | #2
On Fri, Aug 10, 2012 at 02:38:10PM -0500, Eric Sandeen wrote:
> 
> Can the commit message also describe more about the problem: how bad
> it is, the root cause, and why it's so hard to fix properly?

The use case is going to be fairly userspace specific, but one example
might be if the log reaper fails to run for whatever reason, and the
log directory then proceeds to grow without bound.  And then when the
log repear *does* have a chance to run, if it happens to be in tight
memory cgroup, it then dies so the directory grows even larger, and
then when other processes try to access the directory, a readdir will
cause them to die because of their memory limitation, and hilarity
ensues.

You can fix this in other places in the software stack, but belt and
suspenders is good, and if there is no reason for directories to grow
larger than some set size, it's better to get a hard failure with an
ENOSPC rather than an funny failures caused by OOM's or slower and
slower performance.

> Please also update Documentation/filesystems/ext4.txt so people
> know for sure what the use & intent of this new knob is.

Yes, I'll do that in the next spin of the patches.

     	     	     	      	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer Aug. 10, 2012, 8:11 p.m. UTC | #3
"Theodore Ts'o" <tytso@mit.edu> writes:

> Very large directories can cause significant performance problems,
> especially if jobs are running in memory-tight environment (whether it
> is VM's with a small amount of memory or small memory cgroups).
>
> So it is useful, in cloud server/data center environments, to be able
> to set a filesystem-wide cap on the maximum size of a directory, to
> ensure that directories never get larger than a sane size.  We do this
> via a new mount option, max_dir_size_kb.  If there is an attempt to
> grow the directory larger than max_dir_size_kb, the system call will
> return ENOSPC instead.

I have no idea what a reasonable number for this would be.  Can you
provide guidelines that would help admins understand what factors
influence performance degradation due to directory size?

Finally, I don't pretend to understand how your mount option parsing
routines work, but based on what I see in this patch it looks like the
default will be set to and enforced as 0.  What am I missing?

Cheers,
Jeff

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/ext4.h  | 1 +
>  fs/ext4/namei.c | 6 ++++++
>  fs/ext4/super.c | 7 +++++++
>  3 files changed, 14 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c3411d4..7c0841e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1243,6 +1243,7 @@ struct ext4_sb_info {
>  	unsigned int s_mb_order2_reqs;
>  	unsigned int s_mb_group_prealloc;
>  	unsigned int s_max_writeback_mb_bump;
> +	unsigned int s_max_dir_size_kb;
>  	/* where last allocation was done - for stream allocation */
>  	unsigned long s_mb_last_group;
>  	unsigned long s_mb_last_start;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 2a42cc0..bdde668 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -55,6 +55,12 @@ static struct buffer_head *ext4_append(handle_t *handle,
>  {
>  	struct buffer_head *bh;
>  
> +	if (unlikely((inode->i_size >> 10) >=
> +		     EXT4_SB(inode->i_sb)->s_max_dir_size_kb)) {
> +		*err = -ENOSPC;
> +		return NULL;
> +	}
> +
>  	*block = inode->i_size >> inode->i_sb->s_blocksize_bits;
>  
>  	bh = ext4_bread(handle, inode, *block, 1, err);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 56bcaec..5896dcb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1230,6 +1230,7 @@ enum {
>  	Opt_inode_readahead_blks, Opt_journal_ioprio,
>  	Opt_dioread_nolock, Opt_dioread_lock,
>  	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> +	Opt_max_dir_size_kb,
>  };
>  
>  static const match_table_t tokens = {
> @@ -1303,6 +1304,7 @@ static const match_table_t tokens = {
>  	{Opt_init_itable, "init_itable=%u"},
>  	{Opt_init_itable, "init_itable"},
>  	{Opt_noinit_itable, "noinit_itable"},
> +	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
>  	{Opt_removed, "check=none"},	/* mount option from ext2/3 */
>  	{Opt_removed, "nocheck"},	/* mount option from ext2/3 */
>  	{Opt_removed, "reservation"},	/* mount option from ext2/3 */
> @@ -1483,6 +1485,7 @@ static const struct mount_opts {
>  	{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
>  	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
>  	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
> +	{Opt_max_dir_size_kb, 0, MOPT_GTE0},
>  	{Opt_err, 0, 0}
>  };
>  
> @@ -1598,6 +1601,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  			if (!args->from)
>  				arg = EXT4_DEF_LI_WAIT_MULT;
>  			sbi->s_li_wait_mult = arg;
> +		} else if (token == Opt_max_dir_size_kb) {
> +			sbi->s_max_dir_size_kb = arg;
>  		} else if (token == Opt_stripe) {
>  			sbi->s_stripe = arg;
>  		} else if (m->flags & MOPT_DATAJ) {
> @@ -1829,6 +1834,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  	if (nodefs || (test_opt(sb, INIT_INODE_TABLE) &&
>  		       (sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
>  		SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
> +	if (nodefs || sbi->s_max_dir_size_kb)
> +		SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
>  
>  	ext4_show_quota_options(seq, sb);
>  	return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Aug. 10, 2012, 8:16 p.m. UTC | #4
On 8/10/12 2:59 PM, Theodore Ts'o wrote:
> On Fri, Aug 10, 2012 at 02:38:10PM -0500, Eric Sandeen wrote:
>>
>> Can the commit message also describe more about the problem: how bad
>> it is, the root cause, and why it's so hard to fix properly?
> 
> The use case is going to be fairly userspace specific, but one example
> might be if the log reaper fails to run for whatever reason, and the
> log directory then proceeds to grow without bound.  And then when the
> log repear *does* have a chance to run, if it happens to be in tight
> memory cgroup, it then dies so the directory grows even larger, and
> then when other processes try to access the directory, a readdir will
> cause them to die because of their memory limitation, and hilarity
> ensues.

Oh, I thought this was papering over a scaling problem in ext4.  The intent
is to protect userspace from arbitrarily large readdir results?

If that's the case, this should probably be proposed as a VFS level
option, and see how it's received there...

(Can you tell I'm not a huge fan of the idea?) ;)

-Eric

> You can fix this in other places in the software stack, but belt and
> suspenders is good, and if there is no reason for directories to grow
> larger than some set size, it's better to get a hard failure with an
> ENOSPC rather than an funny failures caused by OOM's or slower and
> slower performance.

>> Please also update Documentation/filesystems/ext4.txt so people
>> know for sure what the use & intent of this new knob is.
> 
> Yes, I'll do that in the next spin of the patches.
> 
>      	     	     	      	      - Ted
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 10, 2012, 9:58 p.m. UTC | #5
On Fri, Aug 10, 2012 at 04:11:24PM -0400, Jeff Moyer wrote:
> 
> I have no idea what a reasonable number for this would be.  Can you
> provide guidelines that would help admins understand what factors
> influence performance degradation due to directory size?

Well, for example, if you have a job which is a 512mb memory
container, and the directory has grown to 176mb, an attempt to readdir
said directory will cause that job to thrash badly, and perhaps get
killed by the OOM killer.  If you know that no sane directory should
ever grow beyond a single megabyte, you might pick a max_dir_size_kb
of 1024.

> Finally, I don't pretend to understand how your mount option parsing
> routines work, but based on what I see in this patch it looks like the
> default will be set to and enforced as 0.  What am I missing?

Sorry, I sent out the wrong version of the patch.  The limit was only
supposed to be used if maximum directory size is greater than 0; that
is, the default is that the directory size is unlimited, as before.
I'll send out a revised v2 version of the patch.

I view this as a very specialized option, but if you're running in a
tightly constrained memory cgroup, or a tiny EC2 instance, or the
equivalent Cloud Open VM, it might be a very useful thing to be able
to cap.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 10, 2012, 9:59 p.m. UTC | #6
On Fri, Aug 10, 2012 at 03:16:57PM -0500, Eric Sandeen wrote:
> 
> Oh, I thought this was papering over a scaling problem in ext4.  The intent
> is to protect userspace from arbitrarily large readdir results?

The problem that arose was that the directory had grown to 176mb, and
in a tight memory cgroup (say, 512mb, or some such), this could cause
significant memory thrashing, or the actual OOM killing of the task in
said tight memory cgroup when it tried to read the entire directory
via readdir().

> If that's the case, this should probably be proposed as a VFS level
> option, and see how it's received there...

It's not really a VFS level thing, since the goal is to stop the
directory size itself from growing beyond a size limit (say, 1mb).

	       	      	   	   	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Aug. 10, 2012, 11:14 p.m. UTC | #7
On 2012-08-10, at 3:58 PM, Theodore Ts'o wrote:
> On Fri, Aug 10, 2012 at 04:11:24PM -0400, Jeff Moyer wrote:
>> 
>> I have no idea what a reasonable number for this would be.  Can you
>> provide guidelines that would help admins understand what factors
>> influence performance degradation due to directory size?
> 
> Well, for example, if you have a job which is a 512mb memory
> container, and the directory has grown to 176mb, an attempt to readdir
> said directory will cause that job to thrash badly, and perhaps get
> killed by the OOM killer.  If you know that no sane directory should
> ever grow beyond a single megabyte, you might pick a max_dir_size_kb
> of 1024.

Actually, we've been carrying a very similar patch to this in Lustre
for a long time.  I didn't think this would be of interest to others
outside of Lustre, so I don't think I ever sent it upstream.

The reason we have this is that some HPC jobs might create 10k files
every hour in the same output directory, and if the user/job don't
pay attention to clean up the old files, then they might get many
millions of files in the same directory.  Simple operations like
"ls -l" in the directory will behave badly because GNU ls will read
and sort all of the entries first.  Even if "-U" is given to not sort
entries, ls will try to read (and by default stat for color) all the
entries before displaying them so that column widths can be made nice.
Similarly, "rm *" or other foolish things will break on large dirs for
naïve users (who are scientists and not sysadmins).

Operations may take many minutes on a huge directory, and users will
complain and call support when they think the filesystem has hung.
Instead, the admins limit the directory size and cause such applications
to fail early to alert the user that their application is behaving badly.

Of course, other sites want huge directories (10 billion files in one
directory is the latest number I've seen), so this has to be tunable.
We have a patch to fix the 2-level htree and 2GB directory size limits
already, in case that is of interest to anyone.

Cheers, Andreas

>> Finally, I don't pretend to understand how your mount option parsing
>> routines work, but based on what I see in this patch it looks like the
>> default will be set to and enforced as 0.  What am I missing?
> 
> Sorry, I sent out the wrong version of the patch.  The limit was only
> supposed to be used if maximum directory size is greater than 0; that
> is, the default is that the directory size is unlimited, as before.
> I'll send out a revised v2 version of the patch.
> 
> I view this as a very specialized option, but if you're running in a
> tightly constrained memory cgroup, or a tiny EC2 instance, or the
> equivalent Cloud Open VM, it might be a very useful thing to be able
> to cap.
> 
> Regards,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c3411d4..7c0841e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1243,6 +1243,7 @@  struct ext4_sb_info {
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_writeback_mb_bump;
+	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
 	unsigned long s_mb_last_group;
 	unsigned long s_mb_last_start;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2a42cc0..bdde668 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -55,6 +55,12 @@  static struct buffer_head *ext4_append(handle_t *handle,
 {
 	struct buffer_head *bh;
 
+	if (unlikely((inode->i_size >> 10) >=
+		     EXT4_SB(inode->i_sb)->s_max_dir_size_kb)) {
+		*err = -ENOSPC;
+		return NULL;
+	}
+
 	*block = inode->i_size >> inode->i_sb->s_blocksize_bits;
 
 	bh = ext4_bread(handle, inode, *block, 1, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 56bcaec..5896dcb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1230,6 +1230,7 @@  enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
+	Opt_max_dir_size_kb,
 };
 
 static const match_table_t tokens = {
@@ -1303,6 +1304,7 @@  static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable=%u"},
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
+	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_removed, "check=none"},	/* mount option from ext2/3 */
 	{Opt_removed, "nocheck"},	/* mount option from ext2/3 */
 	{Opt_removed, "reservation"},	/* mount option from ext2/3 */
@@ -1483,6 +1485,7 @@  static const struct mount_opts {
 	{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
 	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
 	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
+	{Opt_max_dir_size_kb, 0, MOPT_GTE0},
 	{Opt_err, 0, 0}
 };
 
@@ -1598,6 +1601,8 @@  static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			if (!args->from)
 				arg = EXT4_DEF_LI_WAIT_MULT;
 			sbi->s_li_wait_mult = arg;
+		} else if (token == Opt_max_dir_size_kb) {
+			sbi->s_max_dir_size_kb = arg;
 		} else if (token == Opt_stripe) {
 			sbi->s_stripe = arg;
 		} else if (m->flags & MOPT_DATAJ) {
@@ -1829,6 +1834,8 @@  static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 	if (nodefs || (test_opt(sb, INIT_INODE_TABLE) &&
 		       (sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
 		SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
+	if (nodefs || sbi->s_max_dir_size_kb)
+		SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
 
 	ext4_show_quota_options(seq, sb);
 	return 0;