diff mbox

A request to reserve a "tree id" field on ext[34] inodes

Message ID 87skcb20ge.fsf@openvz.org
State Rejected, archived
Headers show

Commit Message

Dmitry Monakhov Nov. 18, 2009, 5:43 p.m. UTC
Dmitry Monakhov <dmonakhov@openvz.org> writes:

> Andreas Dilger <adilger@sun.com> writes:
>
>> On 2009-11-17, at 06:04, Pavel Emelyanov wrote:
>>> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>>>
>>> In two words - the aim is to have directories on ext3/4 partitions
>>> which are limited by its disk usage and the number of inodes. Further
>>> the plan is to allow configuring uid and gid quotas within them.
>>>
>>> The main usage of this is containers. When two or more of them are
>>> located on one disk their roots will be marked with a unique tree id
>>> and thus the disk consumption of each container will be limited. While
>>> achieving this goal having an id of what tree an inode belongs to is
>>> a key requirement.
>>
>> How do you handle files with multiple links, if they are located in
>> different trees?  The inode would need to have multiple tree ids.
> A short answer is "NO", inode can not belongs to multiple trees.
> Containers has some non obvious specific. 
> Each container isolated from another as much as possible. 
> Container has its own root tree. This tree is exported inside
> CT by numerous possible ways (name-space, virtual-stack-fs, chroot)
>
> So container's root are independent tree or several trees.
> usually they organized like follows /ct_root/CT_${ID}/${tree_content}
> There are many reasons to keep this trees separate one from another
>    - inode attr: 
>      If inode has links in A n B trees. And A-user call chown() for
>      this inode, then B's owner will be surprised.
>      The only way to overcome this is to virtualize inode atributes
>      (for each tree) which is madness IMHO.
>    - checkpoint/restore/online-backup:
>      This is like suspend resume for VM, but in this case only
>      container's process are stopped(freezed) for some time. After CT's
>      process are stopped we may create backup CT's tree without freezing
>      FS as a whole.
> As I already say there are many way to accomplish this task. But everyone
> has strong disadvantages:
> Virtual block devices(qemu-like): problems with consistency and performance
> ext3/4 + stack-fs(unionfs/vzfs): Bad failure resistance. It is
>         impossible to support jorunalling quota file on stack-fs level.
> XFS with proj quota : Lack of quota file journalling. XFS itself
>         (please dont balme me, but i'm really not huge XFS fan)
>
> So the only way to implement journalled quota for containers is to
> implement it on native fs level.
>
> "Containers directory tree-id" assumptions:
> (1) Tree id is embedded inside inode
> (2) Tree id is inherent from parent dir
> (3) Inode can not belongs to different directory trees
>
> Default directory tree (with id == 0) has special meaning.
> directory which belongs to default tree may contains roots of
> other trees. Default tree is used for subtree manipulation.
>
> ->rename restriction:
>   if (S_ISDIR(old_inode->i_mode)) {
>       if ((new_dir->i_tree_id == 0) || /* move to default tree */
>                (new_dir->i_tree_id == old_inode->i_tree_id)) /*same tree */
>              goto good;
>       return -EXDEV;
>   } else {
>       /* If entry have more than one link then it is bad idea to allow
>          rename it to different (even if it's default tree) tree,
>          because this result in rule (3) violation.
>       if (old_inode->i_nlink > 1) && 
>                     (new_dir->i_tree_id != old_inode->i_tree_id)
>             return -EXDEV;
>  }
> ->link restriction: /* Links may  belongs to only one tree */
>    if(new_dir->i_tree_id != old_inode->i_tree_id)
>             return -EXDEV;
>
>>
>> You can instead just store this data in an xattr (which will normally
>> be stored in the inode, so no performance impact), and then you are
>> free to store multiple values per inode.
> Yes xattr is possible, but struct ext4_xattr_entry is so big plus 
> space for attr_name ...., But we only want 4 bytes.
In other point of view it may be too expensive reserve the last 4
bytes in EXT4_GOOD_OLD_INODE. At the same time store tree_id as xattr.
result in space wasting. But in fact new inode has room for space
reservation. We may store it like it is done for i_version_hi field
-=-=-=-
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andreas Dilger Nov. 19, 2009, 6:33 a.m. UTC | #1
On 2009-11-18, at 09:43, Dmitry Monakhov wrote:
> In other point of view it may be too expensive reserve the last 4
> bytes in EXT4_GOOD_OLD_INODE. At the same time store tree_id as xattr.
> result in space wasting.

Since the xattr is stored inside the inode, and you are accessing this  
from the kernel, it is using only 24 bytes of space used for your tree  
ID (20 bytes ext4_xattr_entry, including 3-byte name, 4 bytes data).   
It also has virtually no performance overhead because it is kept in  
the inode itself.

If you consider that the use of tree_id is not likely to be commonly  
used, then it would be "wasting" 4 bytes of space in everyone else's  
inodes to reserve this field in the inode (whether in the old inode or  
the larger ext4 inode).

> But in fact new inode has room for space
> reservation. We may store it like it is done for i_version_hi field
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -494,6 +494,7 @@ struct ext4_inode {
> 	__le32  i_crtime;       /* File Creation time */
> 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 |  
> epoch) */
> 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
> +	__le32	i_disk_tree_id; /* directory tree quota id */
> };


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -494,6 +494,7 @@  struct ext4_inode {
 	__le32  i_crtime;       /* File Creation time */
 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
+	__le32	i_disk_tree_id; /* directory tree quota id */
 };
 
 struct move_extent {
@@ -1112,6 +1113,7 @@  static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 #define EXT4_FEATURE_INCOMPAT_64BIT		0x0080
 #define EXT4_FEATURE_INCOMPAT_MMP               0x0100
 #define EXT4_FEATURE_INCOMPAT_FLEX_BG		0x0200
+#define EXT4_FEATURE_INCOMPAT_TREE_ID		0x0400 /* directory tree id */
 
 #define EXT4_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT4_FEATURE_INCOMPAT_SUPP	(EXT4_FEATURE_INCOMPAT_FILETYPE| \
@@ -1119,7 +1121,8 @@  static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 					 EXT4_FEATURE_INCOMPAT_META_BG| \
 					 EXT4_FEATURE_INCOMPAT_EXTENTS| \
 					 EXT4_FEATURE_INCOMPAT_64BIT| \
-					 EXT4_FEATURE_INCOMPAT_FLEX_BG)
+					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
+					 EXT4_FEATURE_INCOMPAT_TREE_ID)
 #define EXT4_FEATURE_RO_COMPAT_SUPP	(EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
 					 EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1534,6 +1534,15 @@  set_qf_format:
 			set_opt(sbi->s_mount_opt, I_VERSION);
 			sb->s_flags |= MS_I_VERSION;
 			break;
+		case Opt_tree_id:
+			if (!(EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_TREE_ID) &&
+				EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
+					EXT4_FITS_IN_INODE(raw_inode, ei, i_disk_tree_id))) {
+				ext4_msg(sb, KERN_ERR, "tree_id is not supported");
+				return 0;
+			}
+			set_opt(sbi->s_mount_opt, TREE_ID);
+			break;
 		case Opt_nodelalloc:
 			clear_opt(sbi->s_mount_opt, DELALLOC);
 			break;