diff mbox

ext4: xattr-in-inode support

Message ID 86611BEE-5695-4047-9404-D2D3E232318A@dilger.ca
State Superseded, archived
Headers show

Commit Message

Andreas Dilger April 13, 2017, 7:58 p.m. UTC
Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.

The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.

The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.

struct ext4_xattr_entry {
 	__u8	e_name_len;	/* length of name */
 	__u8	e_name_index;	/* attribute name index */
 	__le16	e_value_offs;	/* offset in disk block of value */
	__le32	e_value_inum;	/* inode in which value is stored */
 	__le32	e_value_size;	/* size of attribute value */
 	__le32	e_hash;		/* hash value of name and value */
 	char	e_name[0];	/* attribute name */
};

The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.

Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
---

Per recent discussion, here is the latest version of the xattr-in-inode
patch.  This has just been freshly updated to the current kernel (from
4.4) and has not even been compiled, so it is unlikely to work properly.
The functional parts of the feature and on-disk format are unchanged,
and is really what Ted is interested in.

Cheers, Andreas
--

Comments

Theodore Ts'o April 14, 2017, 1:27 p.m. UTC | #1
To summarize the discussion that we had on this week's ext4
teleconference call, while discussing ways in which we might extend
ext4's extended attributes to provide better support for Samba.

Andreas pointed out that we already have an unused field,
e_value_block, in ext4_xattr_entry structure:

struct ext4_xattr_entry {
	__u8	e_name_len;	/* length of name */
	__u8	e_name_index;	/* attribute name index */
	__le16	e_value_offs;	/* offset in disk block of value */
	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
	__le32	e_value_size;	/* size of attribute value */
	__le32	e_hash;		/* hash value of name and value */
	char	e_name[0];	/* attribute name */
};

It's only a 32-bit field, and it was repurposed in a Lustre-specific
feature, EXT4_FEATURE_INCOMPAT_EA_INODE as e_value_inum (since inodes
are only 32-bit today).  If this feature flag is enabled, then kernels
which understand the feature will treat e_value_block as an inode
number, and if it is non-zero, the value of that extended attribute is
stored in the inode.  This ends up burning a lot of extra inodes for
each extended attribute, which is why there was never much excitement
for this patch going upstream.

However, we could extend this feature (it will almost certainly
require a new INCOMPAT feature flag) such that a particular inode
could be referenced from multiple strut ext4_xattr_entry's (from
multiple inodes or from a single inode), since the inode for the xattr
body already has a ref count, i_links_count.  And given that on a
typical Windows CIFS file system, there will be dozens of unique
acl's, the problem of exhausting inodes for xattrs won't be a issue in
this case.


However, another approach that we discussed on the weekly conference
call was to change e_value_size to be an 16-bit field, and to use the
high 16 bits for flags, where one of the flags bits (say, the MSB)
would mean that e_value_block and e_value_size should be treated as a
48-bit block number, where the block could be stored.

Thinking about this some more, we can use another 4 bits from the high
bits of e_value_size as a 16 bit number n, where if n=0, the block
number is stored in e_Value_block and e_value_size as above, and if n
> 1, that there are additional blocks for the xattr value, which will
be stored in the place where the xattr value would normally be stored
(e.g, in the inline xattr space or in the external xattr block).

So pictorally, it would look like this:

+----------------+----------------+
| 128-byte inode | in-line xattr  |
+----------------+----------------+
                /                  \
               /                    \
              /                      \
  +---------------------------------------------+
  | XE | XE | XE |               | XV | XV | XV |   XE == xattr_entry   XV == xattr value
  +---------------------------------------------+
           /      \             /     \
          /        \           /       \
         /          \         /         \
    +--------------------+  +-------------+
    |   ...  | blk0 |... |  | blk1 | blk2 |
    +--------------------+  +-------------+

(to those using gmail; please view the above in a fixed-width font, or
use "show original")

So in this picture, XE is the ext4_xattr_entry, and in this case, the
high bits of e_value_size indicate e_value_block and the low bits of
e_value_size indicate the location of the first 4k block where the
xattr value is to be stored, and if one were to look at region of
memory indicated by e_value_offs, there would be two 8-byte block
numbers indicating the location of the 2nd and 3rd file system blocks
where the xattr value can be found.

In the external xattr value blocks, at the beginning of the first
block (e.g., at blk0), there will be an ext4_xattr_header, so we can
take advantage of h_refcount field, but with the following changes:

* The low 16 bits of h_blocks will be used for the size of the xattr;
  the high bits of h_blocks must be zero (for now).

* The h_hash field will be a crc32c of the value of the xattr stored
  in the external xattr value block(s).

* The h_checksum field will be calculated so that the crc32c covers
  only the ext4_xattr_header, instead of the entire xattrblock.  e.g.,
  crc32c(fs uuid || id || xattr header), where id is the inode number
  if refcount = 1, and blknum otherwise.

What are the advantages of this approach over the Lustre's
xattr-value-in-inode approach?  First, we don't need to burn inodes
for the xattr value.  This could potentially be an issue for Windows
SID's, since there the number of SID's is roughly equal to number of
users plus the number of groups.  And for a large enterprise with
O(100,000) employees, we could burn a pretty large number of inodes.
The other advantage of this scheme is that h_refcount field is 32
bits, where as the inode's i_links_count field is only 16 bits, and
there could very easily be more than 64k files that might share the
same Windows ACL or Windows SID.  So we would need to figure out some
way of dealing with an extended i_links_count field if we went with
the xattr-value-in-inode approach.


We don't need to make this to be an either-or choice, of course.  We
could integrate the Lustre approach as well as this latter approach
which is more optimized for Windows ACL's.  And I do want to reiterate
that this is just a rough sketch as a design doc.  I'm sure we may
want to make changes to it, but hopefully it will serve as a good
starting point for discussion.

Cheers,

						- Ted


On Thu, Apr 13, 2017 at 01:58:56PM -0600, Andreas Dilger wrote:
> Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
> 
> If the size of an xattr value is larger than will fit in a single
> external block, then the xattr value will be saved into the body
> of an external xattr inode.
> 
> The also helps support a larger number of xattr, since only the headers
> will be stored in the in-inode space or the single external block.
> 
> The inode is referenced from the xattr header via "e_value_inum",
> which was formerly "e_value_block", but that field was never used.
> The e_value_size still contains the xattr size so that listing
> xattrs does not need to look up the inode if the data is not accessed.
> 
> struct ext4_xattr_entry {
>  	__u8	e_name_len;	/* length of name */
>  	__u8	e_name_index;	/* attribute name index */
>  	__le16	e_value_offs;	/* offset in disk block of value */
> 	__le32	e_value_inum;	/* inode in which value is stored */
>  	__le32	e_value_size;	/* size of attribute value */
>  	__le32	e_hash;		/* hash value of name and value */
>  	char	e_name[0];	/* attribute name */
> };
> 
> The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
> holds a back-reference to the owning inode in its i_mtime field,
> allowing the ext4/e2fsck to verify the correct inode is accessed.
Alexey Lyahkov April 16, 2017, 7:09 p.m. UTC | #2
Andreas,

I don’t sure it’s good idea to allocate one more inode to store a large EA.
It dramatically decrease a speed with accessing a EA data in this case.
And now we have already a hit a limit of inode count with large disks.
I think it code need to be rewritten to use an special extents to store a large EA, as it avoid so much problems related to bad credits while unlinking a parent inode,
some kind problems with integer overflow as backlink stored on mdata field, and other.

I know we don’t hit a problems in this area for last year, but anyway - i prefer a different solution.

> 13 апр. 2017 г., в 22:58, Andreas Dilger <adilger@dilger.ca> написал(а):
> 
> Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
> 
> If the size of an xattr value is larger than will fit in a single
> external block, then the xattr value will be saved into the body
> of an external xattr inode.
> 
> The also helps support a larger number of xattr, since only the headers
> will be stored in the in-inode space or the single external block.
> 
> The inode is referenced from the xattr header via "e_value_inum",
> which was formerly "e_value_block", but that field was never used.
> The e_value_size still contains the xattr size so that listing
> xattrs does not need to look up the inode if the data is not accessed.
> 
> struct ext4_xattr_entry {
> 	__u8	e_name_len;	/* length of name */
> 	__u8	e_name_index;	/* attribute name index */
> 	__le16	e_value_offs;	/* offset in disk block of value */
> 	__le32	e_value_inum;	/* inode in which value is stored */
> 	__le32	e_value_size;	/* size of attribute value */
> 	__le32	e_hash;		/* hash value of name and value */
> 	char	e_name[0];	/* attribute name */
> };
> 
> The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
> holds a back-reference to the owning inode in its i_mtime field,
> allowing the ext4/e2fsck to verify the correct inode is accessed.
> 
> Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
> Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
> Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
> Signed-off-by: James Simmons <uja.ornl@gmail.com>
> Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
> ---
> 
> Per recent discussion, here is the latest version of the xattr-in-inode
> patch.  This has just been freshly updated to the current kernel (from
> 4.4) and has not even been compiled, so it is unlikely to work properly.
> The functional parts of the feature and on-disk format are unchanged,
> and is really what Ted is interested in.
> 
> Cheers, Andreas
> --
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index fb69ee2..afe830b 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1797,6 +1797,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> 					 EXT4_FEATURE_INCOMPAT_EXTENTS| \
> 					 EXT4_FEATURE_INCOMPAT_64BIT| \
> 					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
> +					 EXT4_FEATURE_INCOMPAT_EA_INODE| \
> 					 EXT4_FEATURE_INCOMPAT_MMP | \
> 					 EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
> 					 EXT4_FEATURE_INCOMPAT_ENCRYPT | \
> @@ -2220,6 +2221,12 @@ struct mmpd_data {
> #define EXT4_MMP_MAX_CHECK_INTERVAL	300UL
> 
> /*
> + * Maximum size of xattr attributes for FEATURE_INCOMPAT_EA_INODE 1Mb
> + * This limit is arbitrary, but is reasonable for the xattr API.
> + */
> +#define EXT4_XATTR_MAX_LARGE_EA_SIZE    (1024 * 1024)
> +
> +/*
>  * Function prototypes
>  */
> 
> @@ -2231,6 +2238,10 @@ struct mmpd_data {
> # define ATTRIB_NORET	__attribute__((noreturn))
> # define NORET_AND	noreturn,
> 
> +struct ext4_xattr_ino_array {
> +	unsigned int xia_count;		/* # of used item in the array */
> +	unsigned int xia_inodes[0];
> +};
> /* bitmap.c */
> extern unsigned int ext4_count_free(char *bitmap, unsigned numchars);
> void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
> @@ -2480,6 +2491,7 @@ int do_journal_get_write_access(handle_t *handle,
> extern void ext4_get_inode_flags(struct ext4_inode_info *);
> extern int ext4_alloc_da_blocks(struct inode *inode);
> extern void ext4_set_aops(struct inode *inode);
> +extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int chunk);
> extern int ext4_writepage_trans_blocks(struct inode *);
> extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
> extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 17bc043..01eaad6 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -294,7 +294,6 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> 	 * as writing the quota to disk may need the lock as well.
> 	 */
> 	dquot_initialize(inode);
> -	ext4_xattr_delete_inode(handle, inode);
> 	dquot_free_inode(inode);
> 	dquot_drop(inode);
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 375fb1c..9601496 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -61,7 +61,7 @@ static int get_max_inline_xattr_value_size(struct inode *inode,
> 
> 	/* Compute min_offs. */
> 	for (; !IS_LAST_ENTRY(entry); entry = EXT4_XATTR_NEXT(entry)) {
> -		if (!entry->e_value_block && entry->e_value_size) {
> +		if (!entry->e_value_inum && entry->e_value_size) {
> 			size_t offs = le16_to_cpu(entry->e_value_offs);
> 			if (offs < min_offs)
> 				min_offs = offs;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b9ffa9f..70069e0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -139,8 +139,6 @@ static void ext4_invalidatepage(struct page *page, unsigned int offset,
> 				unsigned int length);
> static int __ext4_journalled_writepage(struct page *page, unsigned int len);
> static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> -				  int pextents);
> 
> /*
>  * Test whether an inode is a fast symlink.
> @@ -189,6 +187,8 @@ void ext4_evict_inode(struct inode *inode)
> {
> 	handle_t *handle;
> 	int err;
> +	int extra_credits = 3;
> +	struct ext4_xattr_ino_array *lea_ino_array = NULL;
> 
> 	trace_ext4_evict_inode(inode);
> 
> @@ -238,8 +238,8 @@ void ext4_evict_inode(struct inode *inode)
> 	 * protection against it
> 	 */
> 	sb_start_intwrite(inode->i_sb);
> -	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
> -				    ext4_blocks_for_truncate(inode)+3);
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, extra_credits);
> 	if (IS_ERR(handle)) {
> 		ext4_std_error(inode->i_sb, PTR_ERR(handle));
> 		/*
> @@ -251,9 +251,36 @@ void ext4_evict_inode(struct inode *inode)
> 		sb_end_intwrite(inode->i_sb);
> 		goto no_delete;
> 	}
> -
> 	if (IS_SYNC(inode))
> 		ext4_handle_sync(handle);
> +
> +	/*
> +	 * Delete xattr inode before deleting the main inode.
> +	 */
> +	err = ext4_xattr_delete_inode(handle, inode, &lea_ino_array);
> +	if (err) {
> +		ext4_warning(inode->i_sb,
> +			     "couldn't delete inode's xattr (err %d)", err);
> +		goto stop_handle;
> +	}
> +
> +	if (!IS_NOQUOTA(inode))
> +		extra_credits += 2 * EXT4_QUOTA_DEL_BLOCKS(inode->i_sb);
> +
> +	if (!ext4_handle_has_enough_credits(handle,
> +			ext4_blocks_for_truncate(inode) + extra_credits)) {
> +		err = ext4_journal_extend(handle,
> +			ext4_blocks_for_truncate(inode) + extra_credits);
> +		if (err > 0)
> +			err = ext4_journal_restart(handle,
> +			ext4_blocks_for_truncate(inode) + extra_credits);
> +		if (err != 0) {
> +			ext4_warning(inode->i_sb,
> +				     "couldn't extend journal (err %d)", err);
> +			goto stop_handle;
> +		}
> +	}
> +
> 	inode->i_size = 0;
> 	err = ext4_mark_inode_dirty(handle, inode);
> 	if (err) {
> @@ -277,10 +304,10 @@ void ext4_evict_inode(struct inode *inode)
> 	 * enough credits left in the handle to remove the inode from
> 	 * the orphan list and set the dtime field.
> 	 */
> -	if (!ext4_handle_has_enough_credits(handle, 3)) {
> -		err = ext4_journal_extend(handle, 3);
> +	if (!ext4_handle_has_enough_credits(handle, extra_credits)) {
> +		err = ext4_journal_extend(handle, extra_credits);
> 		if (err > 0)
> -			err = ext4_journal_restart(handle, 3);
> +			err = ext4_journal_restart(handle, extra_credits);
> 		if (err != 0) {
> 			ext4_warning(inode->i_sb,
> 				     "couldn't extend journal (err %d)", err);
> @@ -315,8 +342,12 @@ void ext4_evict_inode(struct inode *inode)
> 		ext4_clear_inode(inode);
> 	else
> 		ext4_free_inode(handle, inode);
> +
> 	ext4_journal_stop(handle);
> 	sb_end_intwrite(inode->i_sb);
> +
> +	if (lea_ino_array != NULL)
> +		ext4_xattr_inode_array_free(inode, lea_ino_array);
> 	return;
> no_delete:
> 	ext4_clear_inode(inode);	/* We must guarantee clearing of inode... */
> @@ -5475,7 +5506,7 @@ static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
>  *
>  * Also account for superblock, inode, quota and xattr blocks
>  */
> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> +int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
> 				  int pextents)
> {
> 	ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 996e790..f158798 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -190,9 +190,8 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 
> 	/* Check the values */
> 	while (!IS_LAST_ENTRY(entry)) {
> -		if (entry->e_value_block != 0)
> -			return -EFSCORRUPTED;
> -		if (entry->e_value_size != 0) {
> +		if (entry->e_value_size != 0 &&
> +		    entry->e_value_inum == 0) {
> 			u16 offs = le16_to_cpu(entry->e_value_offs);
> 			u32 size = le32_to_cpu(entry->e_value_size);
> 			void *value;
> @@ -258,19 +257,26 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 	__xattr_check_inode((inode), (header), (end), __func__, __LINE__)
> 
> static inline int
> -ext4_xattr_check_entry(struct ext4_xattr_entry *entry, size_t size)
> +ext4_xattr_check_entry(struct ext4_xattr_entry *entry, size_t size,
> +		       struct inode *inode)
> {
> 	size_t value_size = le32_to_cpu(entry->e_value_size);
> 
> -	if (entry->e_value_block != 0 || value_size > size ||
> +	if (!entry->e_value_inum &&
> 	    le16_to_cpu(entry->e_value_offs) + value_size > size)
> 		return -EFSCORRUPTED;
> +	if (entry->e_value_inum &&
> +	    (le32_to_cpu(entry->e_value_inum) < EXT4_FIRST_INO(inode->i_sb) ||
> +	     le32_to_cpu(entry->e_value_inum) >
> +	     le32_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_inodes_count)))
> +		return -EFSCORRUPTED;
> 	return 0;
> }
> 
> static int
> ext4_xattr_find_entry(struct ext4_xattr_entry **pentry, int name_index,
> -		      const char *name, size_t size, int sorted)
> +		      const char *name, size_t size, int sorted,
> +		      struct inode *inode)
> {
> 	struct ext4_xattr_entry *entry;
> 	size_t name_len;
> @@ -290,11 +296,104 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 			break;
> 	}
> 	*pentry = entry;
> -	if (!cmp && ext4_xattr_check_entry(entry, size))
> +	if (!cmp && ext4_xattr_check_entry(entry, size, inode))
> 		return -EFSCORRUPTED;
> 	return cmp ? -ENODATA : 0;
> }
> 
> +/*
> + * Read the EA value from an inode.
> + */
> +static int
> +ext4_xattr_inode_read(struct inode *ea_inode, void *buf, size_t *size)
> +{
> +	unsigned long block = 0;
> +	struct buffer_head *bh = NULL;
> +	int blocksize;
> +	size_t csize, ret_size = 0;
> +
> +	if (*size == 0)
> +		return 0;
> +
> +	blocksize = ea_inode->i_sb->s_blocksize;
> +
> +	while (ret_size < *size) {
> +		csize = (*size - ret_size) > blocksize ? blocksize :
> +							*size - ret_size;
> +		bh = ext4_bread(NULL, ea_inode, block, 0);
> +		if (IS_ERR(bh)) {
> +			*size = ret_size;
> +			return PTR_ERR(bh);
> +		}
> +		memcpy(buf, bh->b_data, csize);
> +		brelse(bh);
> +
> +		buf += csize;
> +		block += 1;
> +		ret_size += csize;
> +	}
> +
> +	*size = ret_size;
> +
> +	return 0;
> +}
> +
> +struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino, int *err)
> +{
> +	struct inode *ea_inode = NULL;
> +
> +	ea_inode = ext4_iget(parent->i_sb, ea_ino);
> +	if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) {
> +		int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0;
> +		ext4_error(parent->i_sb, "error while reading EA inode %lu "
> +			   "/ %d %d", ea_ino, rc, is_bad_inode(ea_inode));
> +		*err = rc != 0 ? rc : -EIO;
> +		return NULL;
> +	}
> +
> +	if (EXT4_XATTR_INODE_GET_PARENT(ea_inode) != parent->i_ino ||
> +	    ea_inode->i_generation != parent->i_generation) {
> +		ext4_error(parent->i_sb, "Backpointer from EA inode %lu "
> +			   "to parent invalid.", ea_ino);
> +		*err = -EINVAL;
> +		goto error;
> +	}
> +
> +	if (!(EXT4_I(ea_inode)->i_flags & EXT4_EA_INODE_FL)) {
> +		ext4_error(parent->i_sb, "EA inode %lu does not have "
> +			   "EXT4_EA_INODE_FL flag set.\n", ea_ino);
> +		*err = -EINVAL;
> +		goto error;
> +	}
> +
> +	*err = 0;
> +	return ea_inode;
> +
> +error:
> +	iput(ea_inode);
> +	return NULL;
> +}
> +
> +/*
> + * Read the value from the EA inode.
> + */
> +static int
> +ext4_xattr_inode_get(struct inode *inode, unsigned long ea_ino, void *buffer,
> +		     size_t *size)
> +{
> +	struct inode *ea_inode = NULL;
> +	int err;
> +
> +	ea_inode = ext4_xattr_inode_iget(inode, ea_ino, &err);
> +	if (err)
> +		return err;
> +
> +	err = ext4_xattr_inode_read(ea_inode, buffer, size);
> +	iput(ea_inode);
> +
> +	return err;
> +}
> +
> static int
> ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
> 		     void *buffer, size_t buffer_size)
> @@ -327,7 +426,8 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 	}
> 	ext4_xattr_cache_insert(ext4_mb_cache, bh);
> 	entry = BFIRST(bh);
> -	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
> +	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1,
> +				      inode);
> 	if (error == -EFSCORRUPTED)
> 		goto bad_block;
> 	if (error)
> @@ -337,8 +437,16 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 		error = -ERANGE;
> 		if (size > buffer_size)
> 			goto cleanup;
> -		memcpy(buffer, bh->b_data + le16_to_cpu(entry->e_value_offs),
> -		       size);
> +		if (entry->e_value_inum) {
> +			error = ext4_xattr_inode_get(inode,
> +					     le32_to_cpu(entry->e_value_inum),
> +					     buffer, &size);
> +			if (error)
> +				goto cleanup;
> +		} else {
> +			memcpy(buffer, bh->b_data +
> +			       le16_to_cpu(entry->e_value_offs), size);
> +		}
> 	}
> 	error = size;
> 
> @@ -372,7 +480,7 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 	if (error)
> 		goto cleanup;
> 	error = ext4_xattr_find_entry(&entry, name_index, name,
> -				      end - (void *)entry, 0);
> +				      end - (void *)entry, 0, inode);
> 	if (error)
> 		goto cleanup;
> 	size = le32_to_cpu(entry->e_value_size);
> @@ -380,8 +488,16 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
> 		error = -ERANGE;
> 		if (size > buffer_size)
> 			goto cleanup;
> -		memcpy(buffer, (void *)IFIRST(header) +
> -		       le16_to_cpu(entry->e_value_offs), size);
> +		if (entry->e_value_inum) {
> +			error = ext4_xattr_inode_get(inode,
> +					     le32_to_cpu(entry->e_value_inum),
> +					     buffer, &size);
> +			if (error)
> +				goto cleanup;
> +		} else {
> +			memcpy(buffer, (void *)IFIRST(header) +
> +			       le16_to_cpu(entry->e_value_offs), size);
> +		}
> 	}
> 	error = size;
> 
> @@ -648,7 +764,7 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 				    size_t *min_offs, void *base, int *total)
> {
> 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
> -		if (last->e_value_size) {
> +		if (!last->e_value_inum && last->e_value_size) {
> 			size_t offs = le16_to_cpu(last->e_value_offs);
> 			if (offs < *min_offs)
> 				*min_offs = offs;
> @@ -659,16 +775,172 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 	return (*min_offs - ((void *)last - base) - sizeof(__u32));
> }
> 
> -static int
> -ext4_xattr_set_entry(struct ext4_xattr_info *i, struct ext4_xattr_search *s)
> +/*
> + * Write the value of the EA in an inode.
> + */
> +static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
> +				  const void *buf, int bufsize)
> +{
> +	struct buffer_head *bh = NULL;
> +	unsigned long block = 0;
> +	unsigned blocksize = ea_inode->i_sb->s_blocksize;
> +	unsigned max_blocks = (bufsize + blocksize - 1) >> ea_inode->i_blkbits;
> +	int csize, wsize = 0;
> +	int ret = 0;
> +	int retries = 0;
> +
> +retry:
> +	while (ret >= 0 && ret < max_blocks) {
> +		struct ext4_map_blocks map;
> +		map.m_lblk = block += ret;
> +		map.m_len = max_blocks -= ret;
> +
> +		ret = ext4_map_blocks(handle, ea_inode, &map,
> +				      EXT4_GET_BLOCKS_CREATE);
> +		if (ret <= 0) {
> +			ext4_mark_inode_dirty(handle, ea_inode);
> +			if (ret == -ENOSPC &&
> +			    ext4_should_retry_alloc(ea_inode->i_sb, &retries)) {
> +				ret = 0;
> +				goto retry;
> +			}
> +			break;
> +		}
> +	}
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	block = 0;
> +	while (wsize < bufsize) {
> +		if (bh != NULL)
> +			brelse(bh);
> +		csize = (bufsize - wsize) > blocksize ? blocksize :
> +								bufsize - wsize;
> +		bh = ext4_getblk(handle, ea_inode, block, 0);
> +		if (IS_ERR(bh)) {
> +			ret = PTR_ERR(bh);
> +			goto out;
> +		}
> +		ret = ext4_journal_get_write_access(handle, bh);
> +		if (ret)
> +			goto out;
> +
> +		memcpy(bh->b_data, buf, csize);
> +		set_buffer_uptodate(bh);
> +		ext4_handle_dirty_metadata(handle, ea_inode, bh);
> +
> +		buf += csize;
> +		wsize += csize;
> +		block += 1;
> +	}
> +
> +	mutex_lock(&ea_inode->i_mutex);
> +	i_size_write(ea_inode, wsize);
> +	ext4_update_i_disksize(ea_inode, wsize);
> +	mutex_unlock(&ea_inode->i_mutex);
> +
> +	ext4_mark_inode_dirty(handle, ea_inode);
> +
> +out:
> +	brelse(bh);
> +
> +	return ret;
> +}
> +
> +/*
> + * Create an inode to store the value of a large EA.
> + */
> +static struct inode *ext4_xattr_inode_create(handle_t *handle,
> +					     struct inode *inode)
> +{
> +	struct inode *ea_inode = NULL;
> +
> +	/*
> +	 * Let the next inode be the goal, so we try and allocate the EA inode
> +	 * in the same group, or nearby one.
> +	 */
> +	ea_inode = ext4_new_inode(handle, inode->i_sb->s_root->d_inode,
> +				  S_IFREG | 0600, NULL, inode->i_ino + 1, NULL);
> +	if (!IS_ERR(ea_inode)) {
> +		ea_inode->i_op = &ext4_file_inode_operations;
> +		ea_inode->i_fop = &ext4_file_operations;
> +		ext4_set_aops(ea_inode);
> +		ea_inode->i_generation = inode->i_generation;
> +		EXT4_I(ea_inode)->i_flags |= EXT4_EA_INODE_FL;
> +
> +		/*
> +		 * A back-pointer from EA inode to parent inode will be useful
> +		 * for e2fsck.
> +		 */
> +		EXT4_XATTR_INODE_SET_PARENT(ea_inode, inode->i_ino);
> +		unlock_new_inode(ea_inode);
> +	}
> +
> +	return ea_inode;
> +}
> +
> +/*
> + * Unlink the inode storing the value of the EA.
> + */
> +int ext4_xattr_inode_unlink(struct inode *inode, unsigned long ea_ino)
> +{
> +	struct inode *ea_inode = NULL;
> +	int err;
> +
> +	ea_inode = ext4_xattr_inode_iget(inode, ea_ino, &err);
> +	if (err)
> +		return err;
> +
> +	clear_nlink(ea_inode);
> +	iput(ea_inode);
> +
> +	return 0;
> +}
> +
> +/*
> + * Add value of the EA in an inode.
> + */
> +static int ext4_xattr_inode_set(handle_t *handle, struct inode *inode,
> +				unsigned long *ea_ino, const void *value,
> +				size_t value_len)
> +{
> +	struct inode *ea_inode;
> +	int err;
> +
> +	/* Create an inode for the EA value */
> +	ea_inode = ext4_xattr_inode_create(handle, inode);
> +	if (IS_ERR(ea_inode))
> +		return PTR_ERR(ea_inode);
> +
> +	err = ext4_xattr_inode_write(handle, ea_inode, value, value_len);
> +	if (err)
> +		clear_nlink(ea_inode);
> +	else
> +		*ea_ino = ea_inode->i_ino;
> +
> +	iput(ea_inode);
> +
> +	return err;
> +}
> +
> +static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
> +				struct ext4_xattr_search *s,
> +				handle_t *handle, struct inode *inode)
> {
> 	struct ext4_xattr_entry *last;
> 	size_t free, min_offs = s->end - s->base, name_len = strlen(i->name);
> +	int in_inode = i->in_inode;
> +
> +	if (ext4_feature_incompat(inode->i_sb, EA_INODE) &&
> +	    (EXT4_XATTR_SIZE(i->value_len) >
> +	     EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
> +		in_inode = 1;
> 
> 	/* Compute min_offs and last. */
> 	last = s->first;
> 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
> -		if (last->e_value_size) {
> +		if (!last->e_value_inum && last->e_value_size) {
> 			size_t offs = le16_to_cpu(last->e_value_offs);
> 			if (offs < min_offs)
> 				min_offs = offs;
> @@ -676,15 +948,20 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 	}
> 	free = min_offs - ((void *)last - s->base) - sizeof(__u32);
> 	if (!s->not_found) {
> -		if (s->here->e_value_size) {
> +		if (!in_inode &&
> +		    !s->here->e_value_inum && s->here->e_value_size) {
> 			size_t size = le32_to_cpu(s->here->e_value_size);
> 			free += EXT4_XATTR_SIZE(size);
> 		}
> 		free += EXT4_XATTR_LEN(name_len);
> 	}
> 	if (i->value) {
> -		if (free < EXT4_XATTR_LEN(name_len) +
> -			   EXT4_XATTR_SIZE(i->value_len))
> +		size_t value_len = EXT4_XATTR_SIZE(i->value_len);
> +
> +		if (in_inode)
> +			value_len = 0;
> +
> +		if (free < EXT4_XATTR_LEN(name_len) + value_len)
> 			return -ENOSPC;
> 	}
> 
> @@ -698,7 +975,8 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 		s->here->e_name_len = name_len;
> 		memcpy(s->here->e_name, i->name, name_len);
> 	} else {
> -		if (s->here->e_value_size) {
> +		if (!s->here->e_value_inum && s->here->e_value_size &&
> +		    s->here->e_value_offs > 0) {
> 			void *first_val = s->base + min_offs;
> 			size_t offs = le16_to_cpu(s->here->e_value_offs);
> 			void *val = s->base + offs;
> @@ -732,12 +1010,18 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 			last = s->first;
> 			while (!IS_LAST_ENTRY(last)) {
> 				size_t o = le16_to_cpu(last->e_value_offs);
> -				if (last->e_value_size && o < offs)
> +				if (!last->e_value_inum &&
> +				    last->e_value_size && o < offs)
> 					last->e_value_offs =
> 						cpu_to_le16(o + size);
> 				last = EXT4_XATTR_NEXT(last);
> 			}
> 		}
> +		if (s->here->e_value_inum) {
> +			ext4_xattr_inode_unlink(inode,
> +					    le32_to_cpu(s->here->e_value_inum);
> +			s->here->e_value_inum = 0;
> +		}
> 		if (!i->value) {
> 			/* Remove the old name. */
> 			size_t size = EXT4_XATTR_LEN(name_len);
> @@ -750,11 +1034,20 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 
> 	if (i->value) {
> 		/* Insert the new value. */
> -		s->here->e_value_size = cpu_to_le32(i->value_len);
> -		if (i->value_len) {
> +		if (in_inode) {
> +			unsigned long ea_ino =
> +				le32_to_cpu(s->here->e_value_inum);
> +			rc = ext4_xattr_inode_set(handle, inode, &ea_ino,
> +						  i->value, i->value_len);
> +			if (rc)
> +				goto out;
> +			s->here->e_value_inum = cpu_to_le32(ea_ino);
> +			s->here->e_value_offs = 0;
> +		} else if (i->value_len) {
> 			size_t size = EXT4_XATTR_SIZE(i->value_len);
> 			void *val = s->base + min_offs - size;
> 			s->here->e_value_offs = cpu_to_le16(min_offs - size);
> +			s->here->e_value_inum = 0;
> 			if (i->value == EXT4_ZERO_XATTR_VALUE) {
> 				memset(val, 0, size);
> 			} else {
> @@ -764,8 +1057,11 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
> 				memcpy(val, i->value, i->value_len);
> 			}
> 		}
> +		s->here->e_value_size = cpu_to_le32(i->value_len);
> 	}
> -	return 0;
> +
> +out:
> +	return rc;
> }
> 
> struct ext4_xattr_block_find {
> @@ -804,7 +1100,7 @@ struct ext4_xattr_block_find {
> 		bs->s.end = bs->bh->b_data + bs->bh->b_size;
> 		bs->s.here = bs->s.first;
> 		error = ext4_xattr_find_entry(&bs->s.here, i->name_index,
> -					      i->name, bs->bh->b_size, 1);
> +					     i->name, bs->bh->b_size, 1, inode);
> 		if (error && error != -ENODATA)
> 			goto cleanup;
> 		bs->s.not_found = error;
> @@ -829,8 +1125,6 @@ struct ext4_xattr_block_find {
> 
> #define header(x) ((struct ext4_xattr_header *)(x))
> 
> -	if (i->value && i->value_len > sb->s_blocksize)
> -		return -ENOSPC;
> 	if (s->base) {
> 		BUFFER_TRACE(bs->bh, "get_write_access");
> 		error = ext4_journal_get_write_access(handle, bs->bh);
> @@ -849,7 +1143,7 @@ struct ext4_xattr_block_find {
> 			mb_cache_entry_delete_block(ext4_mb_cache, hash,
> 						    bs->bh->b_blocknr);
> 			ea_bdebug(bs->bh, "modifying in-place");
> -			error = ext4_xattr_set_entry(i, s);
> +			error = ext4_xattr_set_entry(i, s, handle, inode);
> 			if (!error) {
> 				if (!IS_LAST_ENTRY(s->first))
> 					ext4_xattr_rehash(header(s->base),
> @@ -898,7 +1192,7 @@ struct ext4_xattr_block_find {
> 		s->end = s->base + sb->s_blocksize;
> 	}
> 
> -	error = ext4_xattr_set_entry(i, s);
> +	error = ext4_xattr_set_entry(i, s, handle, inode);
> 	if (error == -EFSCORRUPTED)
> 		goto bad_block;
> 	if (error)
> @@ -1077,7 +1371,7 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
> 		/* Find the named attribute. */
> 		error = ext4_xattr_find_entry(&is->s.here, i->name_index,
> 					      i->name, is->s.end -
> -					      (void *)is->s.base, 0);
> +					      (void *)is->s.base, 0, inode);
> 		if (error && error != -ENODATA)
> 			return error;
> 		is->s.not_found = error;
> @@ -1095,7 +1389,7 @@ int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
> 
> 	if (EXT4_I(inode)->i_extra_isize == 0)
> 		return -ENOSPC;
> -	error = ext4_xattr_set_entry(i, s);
> +	error = ext4_xattr_set_entry(i, s, handle, inode);
> 	if (error) {
> 		if (error == -ENOSPC &&
> 		    ext4_has_inline_data(inode)) {
> @@ -1107,7 +1401,7 @@ int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
> 			error = ext4_xattr_ibody_find(inode, i, is);
> 			if (error)
> 				return error;
> -			error = ext4_xattr_set_entry(i, s);
> +			error = ext4_xattr_set_entry(i, s, handle, inode);
> 		}
> 		if (error)
> 			return error;
> @@ -1133,7 +1427,7 @@ static int ext4_xattr_ibody_set(struct inode *inode,
> 
> 	if (EXT4_I(inode)->i_extra_isize == 0)
> 		return -ENOSPC;
> -	error = ext4_xattr_set_entry(i, s);
> +	error = ext4_xattr_set_entry(i, s, handle, inode);
> 	if (error)
> 		return error;
> 	header = IHDR(inode, ext4_raw_inode(&is->iloc));
> @@ -1180,7 +1474,7 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
> 		.name = name,
> 		.value = value,
> 		.value_len = value_len,
> -
> +		.in_inode = 0,
> 	};
> 	struct ext4_xattr_ibody_find is = {
> 		.s = { .not_found = -ENODATA, },
> @@ -1250,6 +1544,15 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
> 					goto cleanup;
> 			}
> 			error = ext4_xattr_block_set(handle, inode, &i, &bs);
> +			if (EXT4_HAS_INCOMPAT_FEATURE(inode->i_sb,
> +					EXT4_FEATURE_INCOMPAT_EA_INODE) &&
> +			    error == -ENOSPC) {
> +				/* xattr not fit to block, store at external
> +				 * inode */
> +				i.in_inode = 1;
> +				error = ext4_xattr_ibody_set(handle, inode,
> +							     &i, &is);
> +			}
> 			if (error)
> 				goto cleanup;
> 			if (!is.s.not_found) {
> @@ -1293,9 +1596,22 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
> 	       const void *value, size_t value_len, int flags)
> {
> 	handle_t *handle;
> +	struct super_block *sb = inode->i_sb;
> 	int error, retries = 0;
> 	int credits = ext4_jbd2_credits_xattr(inode);
> 
> +	if ((value_len >= EXT4_XATTR_MIN_LARGE_EA_SIZE(sb->s_blocksize)) &&
> +	    EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EA_INODE)) {
> +		int nrblocks = (value_len + sb->s_blocksize - 1) >>
> +					sb->s_blocksize_bits;
> +
> +		/* For new inode */
> +		credits += EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + 3;
> +
> +		/* For data blocks of EA inode */
> +		credits += ext4_meta_trans_blocks(inode, nrblocks, 0);
> +	}
> +
> retry:
> 	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
> 	if (IS_ERR(handle)) {
> @@ -1307,7 +1623,7 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
> 					      value, value_len, flags);
> 		error2 = ext4_journal_stop(handle);
> 		if (error == -ENOSPC &&
> -		    ext4_should_retry_alloc(inode->i_sb, &retries))
> +		    ext4_should_retry_alloc(sb, &retries))
> 			goto retry;
> 		if (error == 0)
> 			error = error2;
> @@ -1332,7 +1648,7 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
> 
> 	/* Adjust the value offsets of the entries */
> 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
> -		if (last->e_value_size) {
> +		if (!last->e_value_inum && last->e_value_size) {
> 			new_offs = le16_to_cpu(last->e_value_offs) +
> 							value_offs_shift;
> 			last->e_value_offs = cpu_to_le16(new_offs);
> @@ -1593,21 +1909,135 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> }
> 
> 
> +#define EIA_INCR 16 /* must be 2^n */
> +#define EIA_MASK (EIA_INCR - 1)
> +/* Add the large xattr @ino into @lea_ino_array for later deletion.
> + * If @lea_ino_array is new or full it will be grown and the old
> + * contents copied over.
> + */
> +static int
> +ext4_expand_ino_array(struct ext4_xattr_ino_array **lea_ino_array, __u32 ino)
> +{
> +	if (*lea_ino_array == NULL) {
> +		/*
> +		 * Start with 15 inodes, so it fits into a power-of-two size.
> +		 * If *lea_ino_array is NULL, this is essentially offsetof()
> +		 */
> +		(*lea_ino_array) =
> +			kmalloc(offsetof(struct ext4_xattr_ino_array,
> +					 xia_inodes[EIA_MASK]),
> +				GFP_NOFS);
> +		if (*lea_ino_array == NULL)
> +			return -ENOMEM;
> +		(*lea_ino_array)->xia_count = 0;
> +	} else if (((*lea_ino_array)->xia_count & EIA_MASK) == EIA_MASK) {
> +		/* expand the array once all 15 + n * 16 slots are full */
> +		struct ext4_xattr_ino_array *new_array = NULL;
> +		int count = (*lea_ino_array)->xia_count;
> +
> +		/* if new_array is NULL, this is essentially offsetof() */
> +		new_array = kmalloc(
> +				offsetof(struct ext4_xattr_ino_array,
> +					 xia_inodes[count + EIA_INCR]),
> +				GFP_NOFS);
> +		if (new_array == NULL)
> +			return -ENOMEM;
> +		memcpy(new_array, *lea_ino_array,
> +		       offsetof(struct ext4_xattr_ino_array,
> +				xia_inodes[count]));
> +		kfree(*lea_ino_array);
> +		*lea_ino_array = new_array;
> +	}
> +	(*lea_ino_array)->xia_inodes[(*lea_ino_array)->xia_count++] = ino;
> +	return 0;
> +}
> +
> +/**
> + * Add xattr inode to orphan list
> + */
> +static int
> +ext4_xattr_inode_orphan_add(handle_t *handle, struct inode *inode,
> +			int credits, struct ext4_xattr_ino_array *lea_ino_array)
> +{
> +	struct inode *ea_inode = NULL;
> +	int idx = 0, error = 0;
> +
> +	if (lea_ino_array == NULL)
> +		return 0;
> +
> +	for (; idx < lea_ino_array->xia_count; ++idx) {
> +		if (!ext4_handle_has_enough_credits(handle, credits)) {
> +			error = ext4_journal_extend(handle, credits);
> +			if (error > 0)
> +				error = ext4_journal_restart(handle, credits);
> +
> +			if (error != 0) {
> +				ext4_warning(inode->i_sb,
> +					"couldn't extend journal "
> +					"(err %d)", error);
> +				return error;
> +			}
> +		}
> +		ea_inode = ext4_xattr_inode_iget(inode,
> +				lea_ino_array->xia_inodes[idx], &error);
> +		if (error)
> +			continue;
> +		ext4_orphan_add(handle, ea_inode);
> +		/* the inode's i_count will be released by caller */
> +	}
> +
> +	return 0;
> +}
> 
> /*
>  * ext4_xattr_delete_inode()
>  *
> - * Free extended attribute resources associated with this inode. This
> + * Free extended attribute resources associated with this inode. Traverse
> + * all entries and unlink any xattr inodes associated with this inode. This
>  * is called immediately before an inode is freed. We have exclusive
> - * access to the inode.
> + * access to the inode. If an orphan inode is deleted it will also delete any
> + * xattr block and all xattr inodes. They are checked by ext4_xattr_inode_iget()
> + * to ensure they belong to the parent inode and were not deleted already.
>  */
> -void
> -ext4_xattr_delete_inode(handle_t *handle, struct inode *inode)
> +int
> +ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
> +			struct ext4_xattr_ino_array **lea_ino_array)
> {
> 	struct buffer_head *bh = NULL;
> +	struct ext4_xattr_ibody_header *header;
> +	struct ext4_inode *raw_inode;
> +	struct ext4_iloc iloc;
> +	struct ext4_xattr_entry *entry;
> +	int credits = 3, error = 0;
> 
> -	if (!EXT4_I(inode)->i_file_acl)
> +	if (!ext4_test_inode_state(inode, EXT4_STATE_XATTR))
> +		goto delete_external_ea;
> +
> +	error = ext4_get_inode_loc(inode, &iloc);
> +	if (error)
> +		goto cleanup;
> +	raw_inode = ext4_raw_inode(&iloc);
> +	header = IHDR(inode, raw_inode);
> +	for (entry = IFIRST(header); !IS_LAST_ENTRY(entry);
> +	     entry = EXT4_XATTR_NEXT(entry)) {
> +		if (!entry->e_value_inum)
> +			continue;
> +		if (ext4_expand_ino_array(lea_ino_array,
> +					  entry->e_value_inum) != 0) {
> +			brelse(iloc.bh);
> +			goto cleanup;
> +		}
> +		entry->e_value_inum = 0;
> +	}
> +	brelse(iloc.bh);
> +
> +delete_external_ea:
> +	if (!EXT4_I(inode)->i_file_acl) {
> +		/* add xattr inode to orphan list */
> +		ext4_xattr_inode_orphan_add(handle, inode, credits,
> +						*lea_ino_array);
> 		goto cleanup;
> +	}
> 	bh = sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl);
> 	if (!bh) {
> 		EXT4_ERROR_INODE(inode, "block %llu read error",
> @@ -1620,11 +2050,69 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> 				 EXT4_I(inode)->i_file_acl);
> 		goto cleanup;
> 	}
> +
> +	for (entry = BFIRST(bh); !IS_LAST_ENTRY(entry);
> +	     entry = EXT4_XATTR_NEXT(entry)) {
> +		if (!entry->e_value_inum)
> +			continue;
> +		if (ext4_expand_ino_array(lea_ino_array,
> +					  entry->e_value_inum) != 0)
> +			goto cleanup;
> +		entry->e_value_inum = 0;
> +	}
> +
> +	/* add xattr inode to orphan list */
> +	error = ext4_xattr_inode_orphan_add(handle, inode, credits,
> +					*lea_ino_array);
> +	if (error != 0)
> +		goto cleanup;
> +
> +	if (!IS_NOQUOTA(inode))
> +		credits += 2 * EXT4_QUOTA_DEL_BLOCKS(inode->i_sb);
> +
> +	if (!ext4_handle_has_enough_credits(handle, credits)) {
> +		error = ext4_journal_extend(handle, credits);
> +		if (error > 0)
> +			error = ext4_journal_restart(handle, credits);
> +		if (error != 0) {
> +			ext4_warning(inode->i_sb,
> +				"couldn't extend journal (err %d)", error);
> +			goto cleanup;
> +		}
> +	}
> +
> 	ext4_xattr_release_block(handle, inode, bh);
> 	EXT4_I(inode)->i_file_acl = 0;
> 
> cleanup:
> 	brelse(bh);
> +
> +	return error;
> +}
> +
> +void
> +ext4_xattr_inode_array_free(struct inode *inode,
> +			    struct ext4_xattr_ino_array *lea_ino_array)
> +{
> +	struct inode	*ea_inode = NULL;
> +	int		idx = 0;
> +	int		err;
> +
> +	if (lea_ino_array == NULL)
> +		return;
> +
> +	for (; idx < lea_ino_array->xia_count; ++idx) {
> +		ea_inode = ext4_xattr_inode_iget(inode,
> +				lea_ino_array->xia_inodes[idx], &err);
> +		if (err)
> +			continue;
> +		/* for inode's i_count get from ext4_xattr_delete_inode */
> +		if (!list_empty(&EXT4_I(ea_inode)->i_orphan))
> +			iput(ea_inode);
> +		clear_nlink(ea_inode);
> +		iput(ea_inode);
> +	}
> +	kfree(lea_ino_array);
> }
> 
> /*
> @@ -1676,10 +2164,9 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> 		    entry1->e_name_index != entry2->e_name_index ||
> 		    entry1->e_name_len != entry2->e_name_len ||
> 		    entry1->e_value_size != entry2->e_value_size ||
> +		    entry1->e_value_inum != entry2->e_value_inum ||
> 		    memcmp(entry1->e_name, entry2->e_name, entry1->e_name_len))
> 			return 1;
> -		if (entry1->e_value_block != 0 || entry2->e_value_block != 0)
> -			return -EFSCORRUPTED;
> 		if (memcmp((char *)header1 + le16_to_cpu(entry1->e_value_offs),
> 			   (char *)header2 + le16_to_cpu(entry2->e_value_offs),
> 			   le32_to_cpu(entry1->e_value_size)))
> @@ -1751,7 +2238,7 @@ static inline void ext4_xattr_hash_entry(struct ext4_xattr_header *header,
> 		       *name++;
> 	}
> 
> -	if (entry->e_value_size != 0) {
> +	if (!entry->e_value_inum && entry->e_value_size) {
> 		__le32 *value = (__le32 *)((char *)header +
> 			le16_to_cpu(entry->e_value_offs));
> 		for (n = (le32_to_cpu(entry->e_value_size) +
> diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
> index 099c8b6..6e10ff9 100644
> --- a/fs/ext4/xattr.h
> +++ b/fs/ext4/xattr.h
> @@ -44,7 +44,7 @@ struct ext4_xattr_entry {
> 	__u8	e_name_len;	/* length of name */
> 	__u8	e_name_index;	/* attribute name index */
> 	__le16	e_value_offs;	/* offset in disk block of value */
> -	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
> +	__le32	e_value_inum;	/* inode in which the value is stored */
> 	__le32	e_value_size;	/* size of attribute value */
> 	__le32	e_hash;		/* hash value of name and value */
> 	char	e_name[0];	/* attribute name */
> @@ -69,6 +69,26 @@ struct ext4_xattr_entry {
> 		EXT4_I(inode)->i_extra_isize))
> #define IFIRST(hdr) ((struct ext4_xattr_entry *)((hdr)+1))
> 
> +/*
> + * Link EA inode back to parent one using i_mtime field.
> + * Extra integer type conversion added to ignore higher
> + * bits in i_mtime.tv_sec which might be set by ext4_get()
> + */
> +#define EXT4_XATTR_INODE_SET_PARENT(inode, inum)      \
> +do {                                                  \
> +      (inode)->i_mtime.tv_sec = inum;                 \
> +} while(0)
> +
> +#define EXT4_XATTR_INODE_GET_PARENT(inode)            \
> +((__u32)(inode)->i_mtime.tv_sec)
> +
> +/*
> + * The minimum size of EA value when you start storing it in an external inode
> + * size of block - size of header - size of 1 entry - 4 null bytes
> +*/
> +#define EXT4_XATTR_MIN_LARGE_EA_SIZE(b)					\
> +	((b) - EXT4_XATTR_LEN(3) - sizeof(struct ext4_xattr_header) - 4)
> +
> #define BHDR(bh) ((struct ext4_xattr_header *)((bh)->b_data))
> #define ENTRY(ptr) ((struct ext4_xattr_entry *)(ptr))
> #define BFIRST(bh) ENTRY(BHDR(bh)+1)
> @@ -77,10 +97,11 @@ struct ext4_xattr_entry {
> #define EXT4_ZERO_XATTR_VALUE ((void *)-1)
> 
> struct ext4_xattr_info {
> -	int name_index;
> 	const char *name;
> 	const void *value;
> 	size_t value_len;
> +	int name_index;
> +	int in_inode;
> };
> 
> struct ext4_xattr_search {
> @@ -140,7 +161,13 @@ static inline void ext4_write_unlock_xattr(struct inode *inode, int *save)
> extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
> extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
> 
> -extern void ext4_xattr_delete_inode(handle_t *, struct inode *);
> +extern struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
> +					   int *err);
> +extern int ext4_xattr_inode_unlink(struct inode *inode, unsigned long ea_ino);
> +extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
> +				   struct ext4_xattr_ino_array **array);
> +extern void ext4_xattr_inode_array_free(struct inode *inode,
> +					struct ext4_xattr_ino_array *array);
> 
> extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> 			    struct ext4_inode *raw_inode, handle_t *handle);
> diff --git a/include/uapi/linux/netfilter/xt_CONNMARK.h b/include/uapi/linux/netfilter/xt_CONNMARK.h
> index 2f2e48e..efc17a8 100644
> --- a/include/uapi/linux/netfilter/xt_CONNMARK.h
> +++ b/include/uapi/linux/netfilter/xt_CONNMARK.h
> @@ -1,6 +1,31 @@
> -#ifndef _XT_CONNMARK_H_target
> -#define _XT_CONNMARK_H_target
> +#ifndef _XT_CONNMARK_H
> +#define _XT_CONNMARK_H
> 
> -#include <linux/netfilter/xt_connmark.h>
> +#include <linux/types.h>
> 
> -#endif /*_XT_CONNMARK_H_target*/
> +/* Copyright (C) 2002,2004 MARA Systems AB <http://www.marasystems.com>
> + * by Henrik Nordstrom <hno@marasystems.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +enum {
> +	XT_CONNMARK_SET = 0,
> +	XT_CONNMARK_SAVE,
> +	XT_CONNMARK_RESTORE
> +};
> +
> +struct xt_connmark_tginfo1 {
> +	__u32 ctmark, ctmask, nfmask;
> +	__u8 mode;
> +};
> +
> +struct xt_connmark_mtinfo1 {
> +	__u32 mark, mask;
> +	__u8 invert;
> +};
> +
> +#endif /*_XT_CONNMARK_H*/
> diff --git a/include/uapi/linux/netfilter/xt_DSCP.h b/include/uapi/linux/netfilter/xt_DSCP.h
> index 648e0b3..15f8932 100644
> --- a/include/uapi/linux/netfilter/xt_DSCP.h
> +++ b/include/uapi/linux/netfilter/xt_DSCP.h
> @@ -1,26 +1,31 @@
> -/* x_tables module for setting the IPv4/IPv6 DSCP field
> +/* x_tables module for matching the IPv4/IPv6 DSCP field
>  *
>  * (C) 2002 Harald Welte <laforge@gnumonks.org>
> - * based on ipt_FTOS.c (C) 2000 by Matthew G. Marsh <mgm@paktronix.com>
>  * This software is distributed under GNU GPL v2, 1991
>  *
>  * See RFC2474 for a description of the DSCP field within the IP Header.
>  *
> - * xt_DSCP.h,v 1.7 2002/03/14 12:03:13 laforge Exp
> + * xt_dscp.h,v 1.3 2002/08/05 19:00:21 laforge Exp
> */
> -#ifndef _XT_DSCP_TARGET_H
> -#define _XT_DSCP_TARGET_H
> -#include <linux/netfilter/xt_dscp.h>
> +#ifndef _XT_DSCP_H
> +#define _XT_DSCP_H
> +
> #include <linux/types.h>
> 
> -/* target info */
> -struct xt_DSCP_info {
> +#define XT_DSCP_MASK	0xfc	/* 11111100 */
> +#define XT_DSCP_SHIFT	2
> +#define XT_DSCP_MAX	0x3f	/* 00111111 */
> +
> +/* match info */
> +struct xt_dscp_info {
> 	__u8 dscp;
> +	__u8 invert;
> };
> 
> -struct xt_tos_target_info {
> -	__u8 tos_value;
> +struct xt_tos_match_info {
> 	__u8 tos_mask;
> +	__u8 tos_value;
> +	__u8 invert;
> };
> 
> -#endif /* _XT_DSCP_TARGET_H */
> +#endif /* _XT_DSCP_H */
> diff --git a/include/uapi/linux/netfilter/xt_MARK.h b/include/uapi/linux/netfilter/xt_MARK.h
> index 41c456d..ecadc40 100644
> --- a/include/uapi/linux/netfilter/xt_MARK.h
> +++ b/include/uapi/linux/netfilter/xt_MARK.h
> @@ -1,6 +1,15 @@
> -#ifndef _XT_MARK_H_target
> -#define _XT_MARK_H_target
> +#ifndef _XT_MARK_H
> +#define _XT_MARK_H
> 
> -#include <linux/netfilter/xt_mark.h>
> +#include <linux/types.h>
> 
> -#endif /*_XT_MARK_H_target */
> +struct xt_mark_tginfo2 {
> +	__u32 mark, mask;
> +};
> +
> +struct xt_mark_mtinfo1 {
> +	__u32 mark, mask;
> +	__u8 invert;
> +};
> +
> +#endif /*_XT_MARK_H*/
> diff --git a/include/uapi/linux/netfilter/xt_TCPMSS.h b/include/uapi/linux/netfilter/xt_TCPMSS.h
> index 9a6960a..fbac56b 100644
> --- a/include/uapi/linux/netfilter/xt_TCPMSS.h
> +++ b/include/uapi/linux/netfilter/xt_TCPMSS.h
> @@ -1,12 +1,11 @@
> -#ifndef _XT_TCPMSS_H
> -#define _XT_TCPMSS_H
> +#ifndef _XT_TCPMSS_MATCH_H
> +#define _XT_TCPMSS_MATCH_H
> 
> #include <linux/types.h>
> 
> -struct xt_tcpmss_info {
> -	__u16 mss;
> +struct xt_tcpmss_match_info {
> +    __u16 mss_min, mss_max;
> +    __u8 invert;
> };
> 
> -#define XT_TCPMSS_CLAMP_PMTU 0xffff
> -
> -#endif /* _XT_TCPMSS_H */
> +#endif /*_XT_TCPMSS_MATCH_H*/
> diff --git a/include/uapi/linux/netfilter/xt_rateest.h b/include/uapi/linux/netfilter/xt_rateest.h
> index 13fe50d..ec1b570 100644
> --- a/include/uapi/linux/netfilter/xt_rateest.h
> +++ b/include/uapi/linux/netfilter/xt_rateest.h
> @@ -1,38 +1,16 @@
> -#ifndef _XT_RATEEST_MATCH_H
> -#define _XT_RATEEST_MATCH_H
> +#ifndef _XT_RATEEST_TARGET_H
> +#define _XT_RATEEST_TARGET_H
> 
> #include <linux/types.h>
> #include <linux/if.h>
> 
> -enum xt_rateest_match_flags {
> -	XT_RATEEST_MATCH_INVERT	= 1<<0,
> -	XT_RATEEST_MATCH_ABS	= 1<<1,
> -	XT_RATEEST_MATCH_REL	= 1<<2,
> -	XT_RATEEST_MATCH_DELTA	= 1<<3,
> -	XT_RATEEST_MATCH_BPS	= 1<<4,
> -	XT_RATEEST_MATCH_PPS	= 1<<5,
> -};
> -
> -enum xt_rateest_match_mode {
> -	XT_RATEEST_MATCH_NONE,
> -	XT_RATEEST_MATCH_EQ,
> -	XT_RATEEST_MATCH_LT,
> -	XT_RATEEST_MATCH_GT,
> -};
> -
> -struct xt_rateest_match_info {
> -	char			name1[IFNAMSIZ];
> -	char			name2[IFNAMSIZ];
> -	__u16		flags;
> -	__u16		mode;
> -	__u32		bps1;
> -	__u32		pps1;
> -	__u32		bps2;
> -	__u32		pps2;
> +struct xt_rateest_target_info {
> +	char			name[IFNAMSIZ];
> +	__s8			interval;
> +	__u8		ewma_log;
> 
> 	/* Used internally by the kernel */
> -	struct xt_rateest	*est1 __attribute__((aligned(8)));
> -	struct xt_rateest	*est2 __attribute__((aligned(8)));
> +	struct xt_rateest	*est __attribute__((aligned(8)));
> };
> 
> -#endif /* _XT_RATEEST_MATCH_H */
> +#endif /* _XT_RATEEST_TARGET_H */
> diff --git a/include/uapi/linux/netfilter_ipv4/ipt_ECN.h b/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
> index bb88d53..0e0c063 100644
> --- a/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
> +++ b/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
> @@ -1,33 +1,15 @@
> -/* Header file for iptables ipt_ECN target
> - *
> - * (C) 2002 by Harald Welte <laforge@gnumonks.org>
> - *
> - * This software is distributed under GNU GPL v2, 1991
> - *
> - * ipt_ECN.h,v 1.3 2002/05/29 12:17:40 laforge Exp
> -*/
> -#ifndef _IPT_ECN_TARGET_H
> -#define _IPT_ECN_TARGET_H
> -
> -#include <linux/types.h>
> -#include <linux/netfilter/xt_DSCP.h>
> -
> -#define IPT_ECN_IP_MASK	(~XT_DSCP_MASK)
> -
> -#define IPT_ECN_OP_SET_IP	0x01	/* set ECN bits of IPv4 header */
> -#define IPT_ECN_OP_SET_ECE	0x10	/* set ECE bit of TCP header */
> -#define IPT_ECN_OP_SET_CWR	0x20	/* set CWR bit of TCP header */
> -
> -#define IPT_ECN_OP_MASK		0xce
> -
> -struct ipt_ECN_info {
> -	__u8 operation;	/* bitset of operations */
> -	__u8 ip_ect;	/* ECT codepoint of IPv4 header, pre-shifted */
> -	union {
> -		struct {
> -			__u8 ece:1, cwr:1; /* TCP ECT bits */
> -		} tcp;
> -	} proto;
> +#ifndef _IPT_ECN_H
> +#define _IPT_ECN_H
> +
> +#include <linux/netfilter/xt_ecn.h>
> +#define ipt_ecn_info xt_ecn_info
> +
> +enum {
> +	IPT_ECN_IP_MASK       = XT_ECN_IP_MASK,
> +	IPT_ECN_OP_MATCH_IP   = XT_ECN_OP_MATCH_IP,
> +	IPT_ECN_OP_MATCH_ECE  = XT_ECN_OP_MATCH_ECE,
> +	IPT_ECN_OP_MATCH_CWR  = XT_ECN_OP_MATCH_CWR,
> +	IPT_ECN_OP_MATCH_MASK = XT_ECN_OP_MATCH_MASK,
> };
> 
> -#endif /* _IPT_ECN_TARGET_H */
> +#endif /* IPT_ECN_H */
> diff --git a/include/uapi/linux/netfilter_ipv4/ipt_TTL.h b/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
> index f6ac169..37bee44 100644
> --- a/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
> +++ b/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
> @@ -1,5 +1,5 @@
> -/* TTL modification module for IP tables
> - * (C) 2000 by Harald Welte <laforge@netfilter.org> */
> +/* IP tables module for matching the value of the TTL
> + * (C) 2000 by Harald Welte <laforge@gnumonks.org> */
> 
> #ifndef _IPT_TTL_H
> #define _IPT_TTL_H
> @@ -7,14 +7,14 @@
> #include <linux/types.h>
> 
> enum {
> -	IPT_TTL_SET = 0,
> -	IPT_TTL_INC,
> -	IPT_TTL_DEC
> +	IPT_TTL_EQ = 0,		/* equals */
> +	IPT_TTL_NE,		/* not equals */
> +	IPT_TTL_LT,		/* less than */
> +	IPT_TTL_GT,		/* greater than */
> };
> 
> -#define IPT_TTL_MAXMODE	IPT_TTL_DEC
> 
> -struct ipt_TTL_info {
> +struct ipt_ttl_info {
> 	__u8	mode;
> 	__u8	ttl;
> };
> diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_HL.h b/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
> index ebd8ead..6e76dbc 100644
> --- a/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
> +++ b/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
> @@ -1,6 +1,6 @@
> -/* Hop Limit modification module for ip6tables
> +/* ip6tables module for matching the Hop Limit value
>  * Maciej Soltysiak <solt@dns.toxicfilms.tv>
> - * Based on HW's TTL module */
> + * Based on HW's ttl module */
> 
> #ifndef _IP6T_HL_H
> #define _IP6T_HL_H
> @@ -8,14 +8,14 @@
> #include <linux/types.h>
> 
> enum {
> -	IP6T_HL_SET = 0,
> -	IP6T_HL_INC,
> -	IP6T_HL_DEC
> +	IP6T_HL_EQ = 0,		/* equals */
> +	IP6T_HL_NE,		/* not equals */
> +	IP6T_HL_LT,		/* less than */
> +	IP6T_HL_GT,		/* greater than */
> };
> 
> -#define IP6T_HL_MAXMODE	IP6T_HL_DEC
> 
> -struct ip6t_HL_info {
> +struct ip6t_hl_info {
> 	__u8	mode;
> 	__u8	hop_limit;
> };
> diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
> index 498b54f..755d2f6 100644
> --- a/net/netfilter/xt_RATEEST.c
> +++ b/net/netfilter/xt_RATEEST.c
> @@ -8,184 +8,149 @@
> #include <linux/module.h>
> #include <linux/skbuff.h>
> #include <linux/gen_stats.h>
> -#include <linux/jhash.h>
> -#include <linux/rtnetlink.h>
> -#include <linux/random.h>
> -#include <linux/slab.h>
> -#include <net/gen_stats.h>
> -#include <net/netlink.h>
> 
> #include <linux/netfilter/x_tables.h>
> -#include <linux/netfilter/xt_RATEEST.h>
> +#include <linux/netfilter/xt_rateest.h>
> #include <net/netfilter/xt_rateest.h>
> 
> -static DEFINE_MUTEX(xt_rateest_mutex);
> 
> -#define RATEEST_HSIZE	16
> -static struct hlist_head rateest_hash[RATEEST_HSIZE] __read_mostly;
> -static unsigned int jhash_rnd __read_mostly;
> -
> -static unsigned int xt_rateest_hash(const char *name)
> -{
> -	return jhash(name, FIELD_SIZEOF(struct xt_rateest, name), jhash_rnd) &
> -	       (RATEEST_HSIZE - 1);
> -}
> -
> -static void xt_rateest_hash_insert(struct xt_rateest *est)
> +static bool
> +xt_rateest_mt(const struct sk_buff *skb, struct xt_action_param *par)
> {
> -	unsigned int h;
> -
> -	h = xt_rateest_hash(est->name);
> -	hlist_add_head(&est->list, &rateest_hash[h]);
> -}
> +	const struct xt_rateest_match_info *info = par->matchinfo;
> +	struct gnet_stats_rate_est64 sample = {0};
> +	u_int32_t bps1, bps2, pps1, pps2;
> +	bool ret = true;
> +
> +	gen_estimator_read(&info->est1->rate_est, &sample);
> +
> +	if (info->flags & XT_RATEEST_MATCH_DELTA) {
> +		bps1 = info->bps1 >= sample.bps ? info->bps1 - sample.bps : 0;
> +		pps1 = info->pps1 >= sample.pps ? info->pps1 - sample.pps : 0;
> +	} else {
> +		bps1 = sample.bps;
> +		pps1 = sample.pps;
> +	}
> 
> -struct xt_rateest *xt_rateest_lookup(const char *name)
> -{
> -	struct xt_rateest *est;
> -	unsigned int h;
> -
> -	h = xt_rateest_hash(name);
> -	mutex_lock(&xt_rateest_mutex);
> -	hlist_for_each_entry(est, &rateest_hash[h], list) {
> -		if (strcmp(est->name, name) == 0) {
> -			est->refcnt++;
> -			mutex_unlock(&xt_rateest_mutex);
> -			return est;
> +	if (info->flags & XT_RATEEST_MATCH_ABS) {
> +		bps2 = info->bps2;
> +		pps2 = info->pps2;
> +	} else {
> +		gen_estimator_read(&info->est2->rate_est, &sample);
> +
> +		if (info->flags & XT_RATEEST_MATCH_DELTA) {
> +			bps2 = info->bps2 >= sample.bps ? info->bps2 - sample.bps : 0;
> +			pps2 = info->pps2 >= sample.pps ? info->pps2 - sample.pps : 0;
> +		} else {
> +			bps2 = sample.bps;
> +			pps2 = sample.pps;
> 		}
> 	}
> -	mutex_unlock(&xt_rateest_mutex);
> -	return NULL;
> -}
> -EXPORT_SYMBOL_GPL(xt_rateest_lookup);
> 
> -void xt_rateest_put(struct xt_rateest *est)
> -{
> -	mutex_lock(&xt_rateest_mutex);
> -	if (--est->refcnt == 0) {
> -		hlist_del(&est->list);
> -		gen_kill_estimator(&est->rate_est);
> -		/*
> -		 * gen_estimator est_timer() might access est->lock or bstats,
> -		 * wait a RCU grace period before freeing 'est'
> -		 */
> -		kfree_rcu(est, rcu);
> +	switch (info->mode) {
> +	case XT_RATEEST_MATCH_LT:
> +		if (info->flags & XT_RATEEST_MATCH_BPS)
> +			ret &= bps1 < bps2;
> +		if (info->flags & XT_RATEEST_MATCH_PPS)
> +			ret &= pps1 < pps2;
> +		break;
> +	case XT_RATEEST_MATCH_GT:
> +		if (info->flags & XT_RATEEST_MATCH_BPS)
> +			ret &= bps1 > bps2;
> +		if (info->flags & XT_RATEEST_MATCH_PPS)
> +			ret &= pps1 > pps2;
> +		break;
> +	case XT_RATEEST_MATCH_EQ:
> +		if (info->flags & XT_RATEEST_MATCH_BPS)
> +			ret &= bps1 == bps2;
> +		if (info->flags & XT_RATEEST_MATCH_PPS)
> +			ret &= pps1 == pps2;
> +		break;
> 	}
> -	mutex_unlock(&xt_rateest_mutex);
> +
> +	ret ^= info->flags & XT_RATEEST_MATCH_INVERT ? true : false;
> +	return ret;
> }
> -EXPORT_SYMBOL_GPL(xt_rateest_put);
> 
> -static unsigned int
> -xt_rateest_tg(struct sk_buff *skb, const struct xt_action_param *par)
> +static int xt_rateest_mt_checkentry(const struct xt_mtchk_param *par)
> {
> -	const struct xt_rateest_target_info *info = par->targinfo;
> -	struct gnet_stats_basic_packed *stats = &info->est->bstats;
> +	struct xt_rateest_match_info *info = par->matchinfo;
> +	struct xt_rateest *est1, *est2;
> +	int ret = -EINVAL;
> 
> -	spin_lock_bh(&info->est->lock);
> -	stats->bytes += skb->len;
> -	stats->packets++;
> -	spin_unlock_bh(&info->est->lock);
> +	if (hweight32(info->flags & (XT_RATEEST_MATCH_ABS |
> +				     XT_RATEEST_MATCH_REL)) != 1)
> +		goto err1;
> 
> -	return XT_CONTINUE;
> -}
> +	if (!(info->flags & (XT_RATEEST_MATCH_BPS | XT_RATEEST_MATCH_PPS)))
> +		goto err1;
> 
> -static int xt_rateest_tg_checkentry(const struct xt_tgchk_param *par)
> -{
> -	struct xt_rateest_target_info *info = par->targinfo;
> -	struct xt_rateest *est;
> -	struct {
> -		struct nlattr		opt;
> -		struct gnet_estimator	est;
> -	} cfg;
> -	int ret;
> -
> -	net_get_random_once(&jhash_rnd, sizeof(jhash_rnd));
> -
> -	est = xt_rateest_lookup(info->name);
> -	if (est) {
> -		/*
> -		 * If estimator parameters are specified, they must match the
> -		 * existing estimator.
> -		 */
> -		if ((!info->interval && !info->ewma_log) ||
> -		    (info->interval != est->params.interval ||
> -		     info->ewma_log != est->params.ewma_log)) {
> -			xt_rateest_put(est);
> -			return -EINVAL;
> -		}
> -		info->est = est;
> -		return 0;
> +	switch (info->mode) {
> +	case XT_RATEEST_MATCH_EQ:
> +	case XT_RATEEST_MATCH_LT:
> +	case XT_RATEEST_MATCH_GT:
> +		break;
> +	default:
> +		goto err1;
> 	}
> 
> -	ret = -ENOMEM;
> -	est = kzalloc(sizeof(*est), GFP_KERNEL);
> -	if (!est)
> +	ret  = -ENOENT;
> +	est1 = xt_rateest_lookup(info->name1);
> +	if (!est1)
> 		goto err1;
> 
> -	strlcpy(est->name, info->name, sizeof(est->name));
> -	spin_lock_init(&est->lock);
> -	est->refcnt		= 1;
> -	est->params.interval	= info->interval;
> -	est->params.ewma_log	= info->ewma_log;
> -
> -	cfg.opt.nla_len		= nla_attr_size(sizeof(cfg.est));
> -	cfg.opt.nla_type	= TCA_STATS_RATE_EST;
> -	cfg.est.interval	= info->interval;
> -	cfg.est.ewma_log	= info->ewma_log;
> -
> -	ret = gen_new_estimator(&est->bstats, NULL, &est->rate_est,
> -				&est->lock, NULL, &cfg.opt);
> -	if (ret < 0)
> -		goto err2;
> +	est2 = NULL;
> +	if (info->flags & XT_RATEEST_MATCH_REL) {
> +		est2 = xt_rateest_lookup(info->name2);
> +		if (!est2)
> +			goto err2;
> +	}
> 
> -	info->est = est;
> -	xt_rateest_hash_insert(est);
> +	info->est1 = est1;
> +	info->est2 = est2;
> 	return 0;
> 
> err2:
> -	kfree(est);
> +	xt_rateest_put(est1);
> err1:
> 	return ret;
> }
> 
> -static void xt_rateest_tg_destroy(const struct xt_tgdtor_param *par)
> +static void xt_rateest_mt_destroy(const struct xt_mtdtor_param *par)
> {
> -	struct xt_rateest_target_info *info = par->targinfo;
> +	struct xt_rateest_match_info *info = par->matchinfo;
> 
> -	xt_rateest_put(info->est);
> +	xt_rateest_put(info->est1);
> +	if (info->est2)
> +		xt_rateest_put(info->est2);
> }
> 
> -static struct xt_target xt_rateest_tg_reg __read_mostly = {
> -	.name       = "RATEEST",
> +static struct xt_match xt_rateest_mt_reg __read_mostly = {
> +	.name       = "rateest",
> 	.revision   = 0,
> 	.family     = NFPROTO_UNSPEC,
> -	.target     = xt_rateest_tg,
> -	.checkentry = xt_rateest_tg_checkentry,
> -	.destroy    = xt_rateest_tg_destroy,
> -	.targetsize = sizeof(struct xt_rateest_target_info),
> -	.usersize   = offsetof(struct xt_rateest_target_info, est),
> +	.match      = xt_rateest_mt,
> +	.checkentry = xt_rateest_mt_checkentry,
> +	.destroy    = xt_rateest_mt_destroy,
> +	.matchsize  = sizeof(struct xt_rateest_match_info),
> +	.usersize   = offsetof(struct xt_rateest_match_info, est1),
> 	.me         = THIS_MODULE,
> };
> 
> -static int __init xt_rateest_tg_init(void)
> +static int __init xt_rateest_mt_init(void)
> {
> -	unsigned int i;
> -
> -	for (i = 0; i < ARRAY_SIZE(rateest_hash); i++)
> -		INIT_HLIST_HEAD(&rateest_hash[i]);
> -
> -	return xt_register_target(&xt_rateest_tg_reg);
> +	return xt_register_match(&xt_rateest_mt_reg);
> }
> 
> -static void __exit xt_rateest_tg_fini(void)
> +static void __exit xt_rateest_mt_fini(void)
> {
> -	xt_unregister_target(&xt_rateest_tg_reg);
> +	xt_unregister_match(&xt_rateest_mt_reg);
> }
> 
> -
> MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
> MODULE_LICENSE("GPL");
> -MODULE_DESCRIPTION("Xtables: packet rate estimator");
> -MODULE_ALIAS("ipt_RATEEST");
> -MODULE_ALIAS("ip6t_RATEEST");
> -module_init(xt_rateest_tg_init);
> -module_exit(xt_rateest_tg_fini);
> +MODULE_DESCRIPTION("xtables rate estimator match");
> +MODULE_ALIAS("ipt_rateest");
> +MODULE_ALIAS("ip6t_rateest");
> +module_init(xt_rateest_mt_init);
> +module_exit(xt_rateest_mt_fini);
> diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
> index 27241a7..c53d4d1 100644
> --- a/net/netfilter/xt_TCPMSS.c
> +++ b/net/netfilter/xt_TCPMSS.c
> @@ -1,351 +1,110 @@
> -/*
> - * This is a module which is used for setting the MSS option in TCP packets.
> - *
> - * Copyright (C) 2000 Marc Boucher <marc@mbsi.ca>
> - * Copyright (C) 2007 Patrick McHardy <kaber@trash.net>
> +/* Kernel module to match TCP MSS values. */
> +
> +/* Copyright (C) 2000 Marc Boucher <marc@mbsi.ca>
> + * Portions (C) 2005 by Harald Welte <laforge@netfilter.org>
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License version 2 as
>  * published by the Free Software Foundation.
>  */
> -#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> #include <linux/module.h>
> #include <linux/skbuff.h>
> -#include <linux/ip.h>
> -#include <linux/gfp.h>
> -#include <linux/ipv6.h>
> -#include <linux/tcp.h>
> -#include <net/dst.h>
> -#include <net/flow.h>
> -#include <net/ipv6.h>
> -#include <net/route.h>
> #include <net/tcp.h>
> 
> +#include <linux/netfilter/xt_tcpmss.h>
> +#include <linux/netfilter/x_tables.h>
> +
> #include <linux/netfilter_ipv4/ip_tables.h>
> #include <linux/netfilter_ipv6/ip6_tables.h>
> -#include <linux/netfilter/x_tables.h>
> -#include <linux/netfilter/xt_tcpudp.h>
> -#include <linux/netfilter/xt_TCPMSS.h>
> 
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Marc Boucher <marc@mbsi.ca>");
> -MODULE_DESCRIPTION("Xtables: TCP Maximum Segment Size (MSS) adjustment");
> -MODULE_ALIAS("ipt_TCPMSS");
> -MODULE_ALIAS("ip6t_TCPMSS");
> -
> -static inline unsigned int
> -optlen(const u_int8_t *opt, unsigned int offset)
> -{
> -	/* Beware zero-length options: make finite progress */
> -	if (opt[offset] <= TCPOPT_NOP || opt[offset+1] == 0)
> -		return 1;
> -	else
> -		return opt[offset+1];
> -}
> -
> -static u_int32_t tcpmss_reverse_mtu(struct net *net,
> -				    const struct sk_buff *skb,
> -				    unsigned int family)
> -{
> -	struct flowi fl;
> -	const struct nf_afinfo *ai;
> -	struct rtable *rt = NULL;
> -	u_int32_t mtu     = ~0U;
> -
> -	if (family == PF_INET) {
> -		struct flowi4 *fl4 = &fl.u.ip4;
> -		memset(fl4, 0, sizeof(*fl4));
> -		fl4->daddr = ip_hdr(skb)->saddr;
> -	} else {
> -		struct flowi6 *fl6 = &fl.u.ip6;
> -
> -		memset(fl6, 0, sizeof(*fl6));
> -		fl6->daddr = ipv6_hdr(skb)->saddr;
> -	}
> -	rcu_read_lock();
> -	ai = nf_get_afinfo(family);
> -	if (ai != NULL)
> -		ai->route(net, (struct dst_entry **)&rt, &fl, false);
> -	rcu_read_unlock();
> -
> -	if (rt != NULL) {
> -		mtu = dst_mtu(&rt->dst);
> -		dst_release(&rt->dst);
> -	}
> -	return mtu;
> -}
> +MODULE_DESCRIPTION("Xtables: TCP MSS match");
> +MODULE_ALIAS("ipt_tcpmss");
> +MODULE_ALIAS("ip6t_tcpmss");
> 
> -static int
> -tcpmss_mangle_packet(struct sk_buff *skb,
> -		     const struct xt_action_param *par,
> -		     unsigned int family,
> -		     unsigned int tcphoff,
> -		     unsigned int minlen)
> +static bool
> +tcpmss_mt(const struct sk_buff *skb, struct xt_action_param *par)
> {
> -	const struct xt_tcpmss_info *info = par->targinfo;
> -	struct tcphdr *tcph;
> -	int len, tcp_hdrlen;
> -	unsigned int i;
> -	__be16 oldval;
> -	u16 newmss;
> -	u8 *opt;
> -
> -	/* This is a fragment, no TCP header is available */
> -	if (par->fragoff != 0)
> -		return 0;
> -
> -	if (!skb_make_writable(skb, skb->len))
> -		return -1;
> -
> -	len = skb->len - tcphoff;
> -	if (len < (int)sizeof(struct tcphdr))
> -		return -1;
> -
> -	tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
> -	tcp_hdrlen = tcph->doff * 4;
> -
> -	if (len < tcp_hdrlen)
> -		return -1;
> -
> -	if (info->mss == XT_TCPMSS_CLAMP_PMTU) {
> -		struct net *net = xt_net(par);
> -		unsigned int in_mtu = tcpmss_reverse_mtu(net, skb, family);
> -		unsigned int min_mtu = min(dst_mtu(skb_dst(skb)), in_mtu);
> -
> -		if (min_mtu <= minlen) {
> -			net_err_ratelimited("unknown or invalid path-MTU (%u)\n",
> -					    min_mtu);
> -			return -1;
> -		}
> -		newmss = min_mtu - minlen;
> -	} else
> -		newmss = info->mss;
> -
> -	opt = (u_int8_t *)tcph;
> -	for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += optlen(opt, i)) {
> -		if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) {
> -			u_int16_t oldmss;
> -
> -			oldmss = (opt[i+2] << 8) | opt[i+3];
> -
> -			/* Never increase MSS, even when setting it, as
> -			 * doing so results in problems for hosts that rely
> -			 * on MSS being set correctly.
> -			 */
> -			if (oldmss <= newmss)
> -				return 0;
> -
> -			opt[i+2] = (newmss & 0xff00) >> 8;
> -			opt[i+3] = newmss & 0x00ff;
> -
> -			inet_proto_csum_replace2(&tcph->check, skb,
> -						 htons(oldmss), htons(newmss),
> -						 false);
> -			return 0;
> +	const struct xt_tcpmss_match_info *info = par->matchinfo;
> +	const struct tcphdr *th;
> +	struct tcphdr _tcph;
> +	/* tcp.doff is only 4 bits, ie. max 15 * 4 bytes */
> +	const u_int8_t *op;
> +	u8 _opt[15 * 4 - sizeof(_tcph)];
> +	unsigned int i, optlen;
> +
> +	/* If we don't have the whole header, drop packet. */
> +	th = skb_header_pointer(skb, par->thoff, sizeof(_tcph), &_tcph);
> +	if (th == NULL)
> +		goto dropit;
> +
> +	/* Malformed. */
> +	if (th->doff*4 < sizeof(*th))
> +		goto dropit;
> +
> +	optlen = th->doff*4 - sizeof(*th);
> +	if (!optlen)
> +		goto out;
> +
> +	/* Truncated options. */
> +	op = skb_header_pointer(skb, par->thoff + sizeof(*th), optlen, _opt);
> +	if (op == NULL)
> +		goto dropit;
> +
> +	for (i = 0; i < optlen; ) {
> +		if (op[i] == TCPOPT_MSS
> +		    && (optlen - i) >= TCPOLEN_MSS
> +		    && op[i+1] == TCPOLEN_MSS) {
> +			u_int16_t mssval;
> +
> +			mssval = (op[i+2] << 8) | op[i+3];
> +
> +			return (mssval >= info->mss_min &&
> +				mssval <= info->mss_max) ^ info->invert;
> 		}
> +		if (op[i] < 2)
> +			i++;
> +		else
> +			i += op[i+1] ? : 1;
> 	}
> +out:
> +	return info->invert;
> 
> -	/* There is data after the header so the option can't be added
> -	 * without moving it, and doing so may make the SYN packet
> -	 * itself too large. Accept the packet unmodified instead.
> -	 */
> -	if (len > tcp_hdrlen)
> -		return 0;
> -
> -	/*
> -	 * MSS Option not found ?! add it..
> -	 */
> -	if (skb_tailroom(skb) < TCPOLEN_MSS) {
> -		if (pskb_expand_head(skb, 0,
> -				     TCPOLEN_MSS - skb_tailroom(skb),
> -				     GFP_ATOMIC))
> -			return -1;
> -		tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
> -	}
> -
> -	skb_put(skb, TCPOLEN_MSS);
> -
> -	/*
> -	 * IPv4: RFC 1122 states "If an MSS option is not received at
> -	 * connection setup, TCP MUST assume a default send MSS of 536".
> -	 * IPv6: RFC 2460 states IPv6 has a minimum MTU of 1280 and a minimum
> -	 * length IPv6 header of 60, ergo the default MSS value is 1220
> -	 * Since no MSS was provided, we must use the default values
> -	 */
> -	if (xt_family(par) == NFPROTO_IPV4)
> -		newmss = min(newmss, (u16)536);
> -	else
> -		newmss = min(newmss, (u16)1220);
> -
> -	opt = (u_int8_t *)tcph + sizeof(struct tcphdr);
> -	memmove(opt + TCPOLEN_MSS, opt, len - sizeof(struct tcphdr));
> -
> -	inet_proto_csum_replace2(&tcph->check, skb,
> -				 htons(len), htons(len + TCPOLEN_MSS), true);
> -	opt[0] = TCPOPT_MSS;
> -	opt[1] = TCPOLEN_MSS;
> -	opt[2] = (newmss & 0xff00) >> 8;
> -	opt[3] = newmss & 0x00ff;
> -
> -	inet_proto_csum_replace4(&tcph->check, skb, 0, *((__be32 *)opt), false);
> -
> -	oldval = ((__be16 *)tcph)[6];
> -	tcph->doff += TCPOLEN_MSS/4;
> -	inet_proto_csum_replace2(&tcph->check, skb,
> -				 oldval, ((__be16 *)tcph)[6], false);
> -	return TCPOLEN_MSS;
> -}
> -
> -static unsigned int
> -tcpmss_tg4(struct sk_buff *skb, const struct xt_action_param *par)
> -{
> -	struct iphdr *iph = ip_hdr(skb);
> -	__be16 newlen;
> -	int ret;
> -
> -	ret = tcpmss_mangle_packet(skb, par,
> -				   PF_INET,
> -				   iph->ihl * 4,
> -				   sizeof(*iph) + sizeof(struct tcphdr));
> -	if (ret < 0)
> -		return NF_DROP;
> -	if (ret > 0) {
> -		iph = ip_hdr(skb);
> -		newlen = htons(ntohs(iph->tot_len) + ret);
> -		csum_replace2(&iph->check, iph->tot_len, newlen);
> -		iph->tot_len = newlen;
> -	}
> -	return XT_CONTINUE;
> -}
> -
> -#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
> -static unsigned int
> -tcpmss_tg6(struct sk_buff *skb, const struct xt_action_param *par)
> -{
> -	struct ipv6hdr *ipv6h = ipv6_hdr(skb);
> -	u8 nexthdr;
> -	__be16 frag_off, oldlen, newlen;
> -	int tcphoff;
> -	int ret;
> -
> -	nexthdr = ipv6h->nexthdr;
> -	tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr, &frag_off);
> -	if (tcphoff < 0)
> -		return NF_DROP;
> -	ret = tcpmss_mangle_packet(skb, par,
> -				   PF_INET6,
> -				   tcphoff,
> -				   sizeof(*ipv6h) + sizeof(struct tcphdr));
> -	if (ret < 0)
> -		return NF_DROP;
> -	if (ret > 0) {
> -		ipv6h = ipv6_hdr(skb);
> -		oldlen = ipv6h->payload_len;
> -		newlen = htons(ntohs(oldlen) + ret);
> -		if (skb->ip_summed == CHECKSUM_COMPLETE)
> -			skb->csum = csum_add(csum_sub(skb->csum, oldlen),
> -					     newlen);
> -		ipv6h->payload_len = newlen;
> -	}
> -	return XT_CONTINUE;
> -}
> -#endif
> -
> -/* Must specify -p tcp --syn */
> -static inline bool find_syn_match(const struct xt_entry_match *m)
> -{
> -	const struct xt_tcp *tcpinfo = (const struct xt_tcp *)m->data;
> -
> -	if (strcmp(m->u.kernel.match->name, "tcp") == 0 &&
> -	    tcpinfo->flg_cmp & TCPHDR_SYN &&
> -	    !(tcpinfo->invflags & XT_TCP_INV_FLAGS))
> -		return true;
> -
> +dropit:
> +	par->hotdrop = true;
> 	return false;
> }
> 
> -static int tcpmss_tg4_check(const struct xt_tgchk_param *par)
> -{
> -	const struct xt_tcpmss_info *info = par->targinfo;
> -	const struct ipt_entry *e = par->entryinfo;
> -	const struct xt_entry_match *ematch;
> -
> -	if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
> -	    (par->hook_mask & ~((1 << NF_INET_FORWARD) |
> -			   (1 << NF_INET_LOCAL_OUT) |
> -			   (1 << NF_INET_POST_ROUTING))) != 0) {
> -		pr_info("path-MTU clamping only supported in "
> -			"FORWARD, OUTPUT and POSTROUTING hooks\n");
> -		return -EINVAL;
> -	}
> -	if (par->nft_compat)
> -		return 0;
> -
> -	xt_ematch_foreach(ematch, e)
> -		if (find_syn_match(ematch))
> -			return 0;
> -	pr_info("Only works on TCP SYN packets\n");
> -	return -EINVAL;
> -}
> -
> -#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
> -static int tcpmss_tg6_check(const struct xt_tgchk_param *par)
> -{
> -	const struct xt_tcpmss_info *info = par->targinfo;
> -	const struct ip6t_entry *e = par->entryinfo;
> -	const struct xt_entry_match *ematch;
> -
> -	if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
> -	    (par->hook_mask & ~((1 << NF_INET_FORWARD) |
> -			   (1 << NF_INET_LOCAL_OUT) |
> -			   (1 << NF_INET_POST_ROUTING))) != 0) {
> -		pr_info("path-MTU clamping only supported in "
> -			"FORWARD, OUTPUT and POSTROUTING hooks\n");
> -		return -EINVAL;
> -	}
> -	if (par->nft_compat)
> -		return 0;
> -
> -	xt_ematch_foreach(ematch, e)
> -		if (find_syn_match(ematch))
> -			return 0;
> -	pr_info("Only works on TCP SYN packets\n");
> -	return -EINVAL;
> -}
> -#endif
> -
> -static struct xt_target tcpmss_tg_reg[] __read_mostly = {
> +static struct xt_match tcpmss_mt_reg[] __read_mostly = {
> 	{
> +		.name		= "tcpmss",
> 		.family		= NFPROTO_IPV4,
> -		.name		= "TCPMSS",
> -		.checkentry	= tcpmss_tg4_check,
> -		.target		= tcpmss_tg4,
> -		.targetsize	= sizeof(struct xt_tcpmss_info),
> +		.match		= tcpmss_mt,
> +		.matchsize	= sizeof(struct xt_tcpmss_match_info),
> 		.proto		= IPPROTO_TCP,
> 		.me		= THIS_MODULE,
> 	},
> -#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
> 	{
> +		.name		= "tcpmss",
> 		.family		= NFPROTO_IPV6,
> -		.name		= "TCPMSS",
> -		.checkentry	= tcpmss_tg6_check,
> -		.target		= tcpmss_tg6,
> -		.targetsize	= sizeof(struct xt_tcpmss_info),
> +		.match		= tcpmss_mt,
> +		.matchsize	= sizeof(struct xt_tcpmss_match_info),
> 		.proto		= IPPROTO_TCP,
> 		.me		= THIS_MODULE,
> 	},
> -#endif
> };
> 
> -static int __init tcpmss_tg_init(void)
> +static int __init tcpmss_mt_init(void)
> {
> -	return xt_register_targets(tcpmss_tg_reg, ARRAY_SIZE(tcpmss_tg_reg));
> +	return xt_register_matches(tcpmss_mt_reg, ARRAY_SIZE(tcpmss_mt_reg));
> }
> 
> -static void __exit tcpmss_tg_exit(void)
> +static void __exit tcpmss_mt_exit(void)
> {
> -	xt_unregister_targets(tcpmss_tg_reg, ARRAY_SIZE(tcpmss_tg_reg));
> +	xt_unregister_matches(tcpmss_mt_reg, ARRAY_SIZE(tcpmss_mt_reg));
> }
> 
> -module_init(tcpmss_tg_init);
> -module_exit(tcpmss_tg_exit);
> +module_init(tcpmss_mt_init);
> +module_exit(tcpmss_mt_exit);
> diff --git a/net/netfilter/xt_dscp.c b/net/netfilter/xt_dscp.c
> index 236ac80..3f83d38 100644
> --- a/net/netfilter/xt_dscp.c
> +++ b/net/netfilter/xt_dscp.c
> @@ -1,11 +1,14 @@
> -/* IP tables module for matching the value of the IPv4/IPv6 DSCP field
> +/* x_tables module for setting the IPv4/IPv6 DSCP field, Version 1.8
>  *
>  * (C) 2002 by Harald Welte <laforge@netfilter.org>
> + * based on ipt_FTOS.c (C) 2000 by Matthew G. Marsh <mgm@paktronix.com>
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License version 2 as
>  * published by the Free Software Foundation.
> - */
> + *
> + * See RFC2474 for a description of the DSCP field within the IP Header.
> +*/
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> #include <linux/module.h>
> #include <linux/skbuff.h>
> @@ -14,102 +17,150 @@
> #include <net/dsfield.h>
> 
> #include <linux/netfilter/x_tables.h>
> -#include <linux/netfilter/xt_dscp.h>
> +#include <linux/netfilter/xt_DSCP.h>
> 
> MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
> -MODULE_DESCRIPTION("Xtables: DSCP/TOS field match");
> +MODULE_DESCRIPTION("Xtables: DSCP/TOS field modification");
> MODULE_LICENSE("GPL");
> -MODULE_ALIAS("ipt_dscp");
> -MODULE_ALIAS("ip6t_dscp");
> -MODULE_ALIAS("ipt_tos");
> -MODULE_ALIAS("ip6t_tos");
> +MODULE_ALIAS("ipt_DSCP");
> +MODULE_ALIAS("ip6t_DSCP");
> +MODULE_ALIAS("ipt_TOS");
> +MODULE_ALIAS("ip6t_TOS");
> 
> -static bool
> -dscp_mt(const struct sk_buff *skb, struct xt_action_param *par)
> +static unsigned int
> +dscp_tg(struct sk_buff *skb, const struct xt_action_param *par)
> {
> -	const struct xt_dscp_info *info = par->matchinfo;
> +	const struct xt_DSCP_info *dinfo = par->targinfo;
> 	u_int8_t dscp = ipv4_get_dsfield(ip_hdr(skb)) >> XT_DSCP_SHIFT;
> 
> -	return (dscp == info->dscp) ^ !!info->invert;
> +	if (dscp != dinfo->dscp) {
> +		if (!skb_make_writable(skb, sizeof(struct iphdr)))
> +			return NF_DROP;
> +
> +		ipv4_change_dsfield(ip_hdr(skb),
> +				    (__force __u8)(~XT_DSCP_MASK),
> +				    dinfo->dscp << XT_DSCP_SHIFT);
> +
> +	}
> +	return XT_CONTINUE;
> }
> 
> -static bool
> -dscp_mt6(const struct sk_buff *skb, struct xt_action_param *par)
> +static unsigned int
> +dscp_tg6(struct sk_buff *skb, const struct xt_action_param *par)
> {
> -	const struct xt_dscp_info *info = par->matchinfo;
> +	const struct xt_DSCP_info *dinfo = par->targinfo;
> 	u_int8_t dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> XT_DSCP_SHIFT;
> 
> -	return (dscp == info->dscp) ^ !!info->invert;
> +	if (dscp != dinfo->dscp) {
> +		if (!skb_make_writable(skb, sizeof(struct ipv6hdr)))
> +			return NF_DROP;
> +
> +		ipv6_change_dsfield(ipv6_hdr(skb),
> +				    (__force __u8)(~XT_DSCP_MASK),
> +				    dinfo->dscp << XT_DSCP_SHIFT);
> +	}
> +	return XT_CONTINUE;
> }
> 
> -static int dscp_mt_check(const struct xt_mtchk_param *par)
> +static int dscp_tg_check(const struct xt_tgchk_param *par)
> {
> -	const struct xt_dscp_info *info = par->matchinfo;
> +	const struct xt_DSCP_info *info = par->targinfo;
> 
> 	if (info->dscp > XT_DSCP_MAX) {
> 		pr_info("dscp %x out of range\n", info->dscp);
> 		return -EDOM;
> 	}
> -
> 	return 0;
> }
> 
> -static bool tos_mt(const struct sk_buff *skb, struct xt_action_param *par)
> +static unsigned int
> +tos_tg(struct sk_buff *skb, const struct xt_action_param *par)
> +{
> +	const struct xt_tos_target_info *info = par->targinfo;
> +	struct iphdr *iph = ip_hdr(skb);
> +	u_int8_t orig, nv;
> +
> +	orig = ipv4_get_dsfield(iph);
> +	nv   = (orig & ~info->tos_mask) ^ info->tos_value;
> +
> +	if (orig != nv) {
> +		if (!skb_make_writable(skb, sizeof(struct iphdr)))
> +			return NF_DROP;
> +		iph = ip_hdr(skb);
> +		ipv4_change_dsfield(iph, 0, nv);
> +	}
> +
> +	return XT_CONTINUE;
> +}
> +
> +static unsigned int
> +tos_tg6(struct sk_buff *skb, const struct xt_action_param *par)
> {
> -	const struct xt_tos_match_info *info = par->matchinfo;
> -
> -	if (xt_family(par) == NFPROTO_IPV4)
> -		return ((ip_hdr(skb)->tos & info->tos_mask) ==
> -		       info->tos_value) ^ !!info->invert;
> -	else
> -		return ((ipv6_get_dsfield(ipv6_hdr(skb)) & info->tos_mask) ==
> -		       info->tos_value) ^ !!info->invert;
> +	const struct xt_tos_target_info *info = par->targinfo;
> +	struct ipv6hdr *iph = ipv6_hdr(skb);
> +	u_int8_t orig, nv;
> +
> +	orig = ipv6_get_dsfield(iph);
> +	nv   = (orig & ~info->tos_mask) ^ info->tos_value;
> +
> +	if (orig != nv) {
> +		if (!skb_make_writable(skb, sizeof(struct iphdr)))
> +			return NF_DROP;
> +		iph = ipv6_hdr(skb);
> +		ipv6_change_dsfield(iph, 0, nv);
> +	}
> +
> +	return XT_CONTINUE;
> }
> 
> -static struct xt_match dscp_mt_reg[] __read_mostly = {
> +static struct xt_target dscp_tg_reg[] __read_mostly = {
> 	{
> -		.name		= "dscp",
> +		.name		= "DSCP",
> 		.family		= NFPROTO_IPV4,
> -		.checkentry	= dscp_mt_check,
> -		.match		= dscp_mt,
> -		.matchsize	= sizeof(struct xt_dscp_info),
> +		.checkentry	= dscp_tg_check,
> +		.target		= dscp_tg,
> +		.targetsize	= sizeof(struct xt_DSCP_info),
> +		.table		= "mangle",
> 		.me		= THIS_MODULE,
> 	},
> 	{
> -		.name		= "dscp",
> +		.name		= "DSCP",
> 		.family		= NFPROTO_IPV6,
> -		.checkentry	= dscp_mt_check,
> -		.match		= dscp_mt6,
> -		.matchsize	= sizeof(struct xt_dscp_info),
> +		.checkentry	= dscp_tg_check,
> +		.target		= dscp_tg6,
> +		.targetsize	= sizeof(struct xt_DSCP_info),
> +		.table		= "mangle",
> 		.me		= THIS_MODULE,
> 	},
> 	{
> -		.name		= "tos",
> +		.name		= "TOS",
> 		.revision	= 1,
> 		.family		= NFPROTO_IPV4,
> -		.match		= tos_mt,
> -		.matchsize	= sizeof(struct xt_tos_match_info),
> +		.table		= "mangle",
> +		.target		= tos_tg,
> +		.targetsize	= sizeof(struct xt_tos_target_info),
> 		.me		= THIS_MODULE,
> 	},
> 	{
> -		.name		= "tos",
> +		.name		= "TOS",
> 		.revision	= 1,
> 		.family		= NFPROTO_IPV6,
> -		.match		= tos_mt,
> -		.matchsize	= sizeof(struct xt_tos_match_info),
> +		.table		= "mangle",
> +		.target		= tos_tg6,
> +		.targetsize	= sizeof(struct xt_tos_target_info),
> 		.me		= THIS_MODULE,
> 	},
> };
> 
> -static int __init dscp_mt_init(void)
> +static int __init dscp_tg_init(void)
> {
> -	return xt_register_matches(dscp_mt_reg, ARRAY_SIZE(dscp_mt_reg));
> +	return xt_register_targets(dscp_tg_reg, ARRAY_SIZE(dscp_tg_reg));
> }
> 
> -static void __exit dscp_mt_exit(void)
> +static void __exit dscp_tg_exit(void)
> {
> -	xt_unregister_matches(dscp_mt_reg, ARRAY_SIZE(dscp_mt_reg));
> +	xt_unregister_targets(dscp_tg_reg, ARRAY_SIZE(dscp_tg_reg));
> }
> 
> -module_init(dscp_mt_init);
> -module_exit(dscp_mt_exit);
> +module_init(dscp_tg_init);
> +module_exit(dscp_tg_exit);
> diff --git a/net/netfilter/xt_hl.c b/net/netfilter/xt_hl.c
> index 0039511..1535e87 100644
> --- a/net/netfilter/xt_hl.c
> +++ b/net/netfilter/xt_hl.c
> @@ -1,96 +1,169 @@
> /*
> - * IP tables module for matching the value of the TTL
> - * (C) 2000,2001 by Harald Welte <laforge@netfilter.org>
> + * TTL modification target for IP tables
> + * (C) 2000,2005 by Harald Welte <laforge@netfilter.org>
>  *
> - * Hop Limit matching module
> - * (C) 2001-2002 Maciej Soltysiak <solt@dns.toxicfilms.tv>
> + * Hop Limit modification target for ip6tables
> + * Maciej Soltysiak <solt@dns.toxicfilms.tv>
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License version 2 as
>  * published by the Free Software Foundation.
>  */
> -
> -#include <linux/ip.h>
> -#include <linux/ipv6.h>
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> #include <linux/module.h>
> #include <linux/skbuff.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <net/checksum.h>
> 
> #include <linux/netfilter/x_tables.h>
> -#include <linux/netfilter_ipv4/ipt_ttl.h>
> -#include <linux/netfilter_ipv6/ip6t_hl.h>
> +#include <linux/netfilter_ipv4/ipt_TTL.h>
> +#include <linux/netfilter_ipv6/ip6t_HL.h>
> 
> +MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
> MODULE_AUTHOR("Maciej Soltysiak <solt@dns.toxicfilms.tv>");
> -MODULE_DESCRIPTION("Xtables: Hoplimit/TTL field match");
> +MODULE_DESCRIPTION("Xtables: Hoplimit/TTL Limit field modification target");
> MODULE_LICENSE("GPL");
> -MODULE_ALIAS("ipt_ttl");
> -MODULE_ALIAS("ip6t_hl");
> 
> -static bool ttl_mt(const struct sk_buff *skb, struct xt_action_param *par)
> +static unsigned int
> +ttl_tg(struct sk_buff *skb, const struct xt_action_param *par)
> {
> -	const struct ipt_ttl_info *info = par->matchinfo;
> -	const u8 ttl = ip_hdr(skb)->ttl;
> +	struct iphdr *iph;
> +	const struct ipt_TTL_info *info = par->targinfo;
> +	int new_ttl;
> +
> +	if (!skb_make_writable(skb, skb->len))
> +		return NF_DROP;
> +
> +	iph = ip_hdr(skb);
> 
> 	switch (info->mode) {
> -	case IPT_TTL_EQ:
> -		return ttl == info->ttl;
> -	case IPT_TTL_NE:
> -		return ttl != info->ttl;
> -	case IPT_TTL_LT:
> -		return ttl < info->ttl;
> -	case IPT_TTL_GT:
> -		return ttl > info->ttl;
> +	case IPT_TTL_SET:
> +		new_ttl = info->ttl;
> +		break;
> +	case IPT_TTL_INC:
> +		new_ttl = iph->ttl + info->ttl;
> +		if (new_ttl > 255)
> +			new_ttl = 255;
> +		break;
> +	case IPT_TTL_DEC:
> +		new_ttl = iph->ttl - info->ttl;
> +		if (new_ttl < 0)
> +			new_ttl = 0;
> +		break;
> +	default:
> +		new_ttl = iph->ttl;
> +		break;
> +	}
> +
> +	if (new_ttl != iph->ttl) {
> +		csum_replace2(&iph->check, htons(iph->ttl << 8),
> +					   htons(new_ttl << 8));
> +		iph->ttl = new_ttl;
> 	}
> 
> -	return false;
> +	return XT_CONTINUE;
> }
> 
> -static bool hl_mt6(const struct sk_buff *skb, struct xt_action_param *par)
> +static unsigned int
> +hl_tg6(struct sk_buff *skb, const struct xt_action_param *par)
> {
> -	const struct ip6t_hl_info *info = par->matchinfo;
> -	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +	struct ipv6hdr *ip6h;
> +	const struct ip6t_HL_info *info = par->targinfo;
> +	int new_hl;
> +
> +	if (!skb_make_writable(skb, skb->len))
> +		return NF_DROP;
> +
> +	ip6h = ipv6_hdr(skb);
> 
> 	switch (info->mode) {
> -	case IP6T_HL_EQ:
> -		return ip6h->hop_limit == info->hop_limit;
> -	case IP6T_HL_NE:
> -		return ip6h->hop_limit != info->hop_limit;
> -	case IP6T_HL_LT:
> -		return ip6h->hop_limit < info->hop_limit;
> -	case IP6T_HL_GT:
> -		return ip6h->hop_limit > info->hop_limit;
> +	case IP6T_HL_SET:
> +		new_hl = info->hop_limit;
> +		break;
> +	case IP6T_HL_INC:
> +		new_hl = ip6h->hop_limit + info->hop_limit;
> +		if (new_hl > 255)
> +			new_hl = 255;
> +		break;
> +	case IP6T_HL_DEC:
> +		new_hl = ip6h->hop_limit - info->hop_limit;
> +		if (new_hl < 0)
> +			new_hl = 0;
> +		break;
> +	default:
> +		new_hl = ip6h->hop_limit;
> +		break;
> 	}
> 
> -	return false;
> +	ip6h->hop_limit = new_hl;
> +
> +	return XT_CONTINUE;
> +}
> +
> +static int ttl_tg_check(const struct xt_tgchk_param *par)
> +{
> +	const struct ipt_TTL_info *info = par->targinfo;
> +
> +	if (info->mode > IPT_TTL_MAXMODE) {
> +		pr_info("TTL: invalid or unknown mode %u\n", info->mode);
> +		return -EINVAL;
> +	}
> +	if (info->mode != IPT_TTL_SET && info->ttl == 0)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static int hl_tg6_check(const struct xt_tgchk_param *par)
> +{
> +	const struct ip6t_HL_info *info = par->targinfo;
> +
> +	if (info->mode > IP6T_HL_MAXMODE) {
> +		pr_info("invalid or unknown mode %u\n", info->mode);
> +		return -EINVAL;
> +	}
> +	if (info->mode != IP6T_HL_SET && info->hop_limit == 0) {
> +		pr_info("increment/decrement does not "
> +			"make sense with value 0\n");
> +		return -EINVAL;
> +	}
> +	return 0;
> }
> 
> -static struct xt_match hl_mt_reg[] __read_mostly = {
> +static struct xt_target hl_tg_reg[] __read_mostly = {
> 	{
> -		.name       = "ttl",
> +		.name       = "TTL",
> 		.revision   = 0,
> 		.family     = NFPROTO_IPV4,
> -		.match      = ttl_mt,
> -		.matchsize  = sizeof(struct ipt_ttl_info),
> +		.target     = ttl_tg,
> +		.targetsize = sizeof(struct ipt_TTL_info),
> +		.table      = "mangle",
> +		.checkentry = ttl_tg_check,
> 		.me         = THIS_MODULE,
> 	},
> 	{
> -		.name       = "hl",
> +		.name       = "HL",
> 		.revision   = 0,
> 		.family     = NFPROTO_IPV6,
> -		.match      = hl_mt6,
> -		.matchsize  = sizeof(struct ip6t_hl_info),
> +		.target     = hl_tg6,
> +		.targetsize = sizeof(struct ip6t_HL_info),
> +		.table      = "mangle",
> +		.checkentry = hl_tg6_check,
> 		.me         = THIS_MODULE,
> 	},
> };
> 
> -static int __init hl_mt_init(void)
> +static int __init hl_tg_init(void)
> {
> -	return xt_register_matches(hl_mt_reg, ARRAY_SIZE(hl_mt_reg));
> +	return xt_register_targets(hl_tg_reg, ARRAY_SIZE(hl_tg_reg));
> }
> 
> -static void __exit hl_mt_exit(void)
> +static void __exit hl_tg_exit(void)
> {
> -	xt_unregister_matches(hl_mt_reg, ARRAY_SIZE(hl_mt_reg));
> +	xt_unregister_targets(hl_tg_reg, ARRAY_SIZE(hl_tg_reg));
> }
> 
> -module_init(hl_mt_init);
> -module_exit(hl_mt_exit);
> +module_init(hl_tg_init);
> +module_exit(hl_tg_exit);
> +MODULE_ALIAS("ipt_TTL");
> +MODULE_ALIAS("ip6t_HL");
> 
> 
> 
>
Andreas Dilger April 17, 2017, 7:07 p.m. UTC | #3
On Apr 14, 2017, at 7:27 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> To summarize the discussion that we had on this week's ext4
> teleconference call, while discussing ways in which we might extend
> ext4's extended attributes to provide better support for Samba.
> 
> Andreas pointed out that we already have an unused field,
> e_value_block, in ext4_xattr_entry structure:
> 
> struct ext4_xattr_entry {
> 	__u8	e_name_len;	/* length of name */
> 	__u8	e_name_index;	/* attribute name index */
> 	__le16	e_value_offs;	/* offset in disk block of value */
> 	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
> 	__le32	e_value_size;	/* size of attribute value */
> 	__le32	e_hash;		/* hash value of name and value */
> 	char	e_name[0];	/* attribute name */
> };
> 
> It's only a 32-bit field, and it was repurposed in a Lustre-specific
> feature, EXT4_FEATURE_INCOMPAT_EA_INODE as e_value_inum (since inodes
> are only 32-bit today).  If this feature flag is enabled, then kernels
> which understand the feature will treat e_value_block as an inode
> number, and if it is non-zero, the value of that extended attribute is
> stored in the inode.  This ends up burning a lot of extra inodes for
> each extended attribute, which is why there was never much excitement
> for this patch going upstream.
> 
> However, we could extend this feature (it will almost certainly
> require a new INCOMPAT feature flag) such that a particular inode
> could be referenced from multiple strut ext4_xattr_entry's (from
> multiple inodes or from a single inode), since the inode for the xattr
> body already has a ref count, i_links_count.  And given that on a
> typical Windows CIFS file system, there will be dozens of unique
> acl's, the problem of exhausting inodes for xattrs won't be a issue in
> this case.
> 
> 
> However, another approach that we discussed on the weekly conference
> call was to change e_value_size to be an 16-bit field, and to use the
> high 16 bits for flags, where one of the flags bits (say, the MSB)
> would mean that e_value_block and e_value_size should be treated as a
> 48-bit block number, where the block could be stored.
> 
> Thinking about this some more, we can use another 4 bits from the high
> bits of e_value_size as a 16 bit number n, where if n=0, the block
> number is stored in e_value_block and e_value_size as above, and if n
>> 1, that there are additional blocks for the xattr value, which will
> be stored in the place where the xattr value would normally be stored
> (e.g, in the inline xattr space or in the external xattr block).
> So pictorally, it would look like this:
> 
> +----------------+----------------+
> | 128-byte inode | in-line xattr  |
> +----------------+----------------+
>                /                  \
>               /                    \
>              /                      \
>  +---------------------------------------------+
>  | XE | XE | XE |               | XV | XV | XV |   XE == xattr_entry   XV == xattr value
>  +---------------------------------------------+
>           /      \             /     \
>          /        \           /       \
>         /          \         /         \
>    +--------------------+  +-------------+
>    |   ...  | blk0 |... |  | blk1 | blk2 |
>    +--------------------+  +-------------+
> 
> (to those using gmail; please view the above in a fixed-width font, or
> use "show original")
> 
> So in this picture, XE is the ext4_xattr_entry, and in this case, the
> high bits of e_value_size indicate e_value_block and the low bits of
> e_value_size indicate the location of the first 4k block where the
> xattr value is to be stored, and if one were to look at region of
> memory indicated by e_value_offs, there would be two 8-byte block
> numbers indicating the location of the 2nd and 3rd file system blocks
> where the xattr value can be found.

Given that this is a new INCOMPAT feature, wouldn't it be a lot more clear
to just create a new struct ext4_xattr_entry2 and struct ext4_xattr_header2
that had the right fields?  So long as the magic values were in the same
offset we could check at runtime which version of the struct was in use.

Also, IMHO it doesn't make sense to try and address block numbers directly
in this case, but rather just reference an inode number and then relative
offsets within the inode.  That allows flexibility in how the blocks are
addressed, either block mapped or extent mapped or some future type, and
will save space in the xattr entry since it doesn't need to specify both
the block number and offset.

> In the external xattr value blocks, at the beginning of the first
> block (e.g., at blk0), there will be an ext4_xattr_header, so we can
> take advantage of h_refcount field, but with the following changes:
> 
> * The low 16 bits of h_blocks will be used for the size of the xattr;
>  the high bits of h_blocks must be zero (for now).
> 
> * The h_hash field will be a crc32c of the value of the xattr stored
>  in the external xattr value block(s).
> 
> * The h_checksum field will be calculated so that the crc32c covers
>  only the ext4_xattr_header, instead of the entire xattrblock.  e.g.,
>  crc32c(fs uuid || id || xattr header), where id is the inode number
>  if refcount = 1, and blknum otherwise.
> 
> What are the advantages of this approach over the Lustre's
> xattr-value-in-inode approach?  First, we don't need to burn inodes
> for the xattr value.  This could potentially be an issue for Windows
> SID's, since there the number of SID's is roughly equal to number of
> users plus the number of groups.  And for a large enterprise with
> O(100,000) employees, we could burn a pretty large number of inodes.

Sure, the Lustre case is mostly used for very large files (e.g. hundreds
of GB or TB in size) with complex layouts across thousands of servers,
so there aren't expected to be a large number of such files.

> The other advantage of this scheme is that h_refcount field is 32
> bits, where as the inode's i_links_count field is only 16 bits, and
> there could very easily be more than 64k files that might share the
> same Windows ACL or Windows SID.

IMHO, it isn't necessarily desirable to have a single xattr that is
shared by so many inodes.  In cases where filesystem metadata is shared
so widely it can be reconstructed by e2fsck, but in this case the xattr
would be lost.  It would be fine to have an upper limit like 2^16 for
the number of inodes that reference the shared xattr to limit the data
that would be lost in case of corruption.

Cheers, Andreas

> So we would need to figure out some way of dealing with an extended
> i_links_count field if we went with the xattr-value-in-inode approach.
> 
> 
> We don't need to make this to be an either-or choice, of course.  We
> could integrate the Lustre approach as well as this latter approach
> which is more optimized for Windows ACL's.  And I do want to reiterate
> that this is just a rough sketch as a design doc.  I'm sure we may
> want to make changes to it, but hopefully it will serve as a good
> starting point for discussion.
> 
> Cheers,
> 
> 						- Ted
> 
> 
> On Thu, Apr 13, 2017 at 01:58:56PM -0600, Andreas Dilger wrote:
>> Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
>> 
>> If the size of an xattr value is larger than will fit in a single
>> external block, then the xattr value will be saved into the body
>> of an external xattr inode.
>> 
>> The also helps support a larger number of xattr, since only the headers
>> will be stored in the in-inode space or the single external block.
>> 
>> The inode is referenced from the xattr header via "e_value_inum",
>> which was formerly "e_value_block", but that field was never used.
>> The e_value_size still contains the xattr size so that listing
>> xattrs does not need to look up the inode if the data is not accessed.
>> 
>> struct ext4_xattr_entry {
>> 	__u8	e_name_len;	/* length of name */
>> 	__u8	e_name_index;	/* attribute name index */
>> 	__le16	e_value_offs;	/* offset in disk block of value */
>> 	__le32	e_value_inum;	/* inode in which value is stored */
>> 	__le32	e_value_size;	/* size of attribute value */
>> 	__le32	e_hash;		/* hash value of name and value */
>> 	char	e_name[0];	/* attribute name */
>> };
>> 
>> The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
>> holds a back-reference to the owning inode in its i_mtime field,
>> allowing the ext4/e2fsck to verify the correct inode is accessed.
> 
> 


Cheers, Andreas
Andreas Dilger April 17, 2017, 7:19 p.m. UTC | #4
On Apr 16, 2017, at 1:09 PM, Alexey Lyashkov <alexey.lyashkov@gmail.com> wrote:
> 
> Andreas,
> 
> I don’t sure it’s good idea to allocate one more inode to store a large EA.
> It dramatically decrease a speed with accessing a EA data in this case.
> And now we have already a hit a limit of inode count with large disks.
> I think it code need to be rewritten to use an special extents to store a
> large EA, as it avoid so much problems related to bad credits while unlinking
> a parent inode, some kind problems with integer overflow as backlink stored on mdata field, and other.
> 
> I know we don’t hit a problems in this area for last year, but anyway - i prefer a different solution.

We are of course not able to change the format of the large xattrs used in
existing filesystems for many years already (the first version of this
feature was in use since 2008).

It would be great if you can work with Ted to implement an improved solution
for large xattrs for ext4.  Since the new version would be using a different
feature flag, with some small amount of compatibility effort there is no
reason why the two cannot exist at the same time, and at some point migrate
from the old feature xattrs to the new one via e2fsck or userspace tool.

Cheers, Andreas

>> 13 апр. 2017 г., в 22:58, Andreas Dilger <adilger@dilger.ca> написал(а):
>> 
>> Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
>> 
>> If the size of an xattr value is larger than will fit in a single
>> external block, then the xattr value will be saved into the body
>> of an external xattr inode.
>> 
>> The also helps support a larger number of xattr, since only the headers
>> will be stored in the in-inode space or the single external block.
>> 
>> The inode is referenced from the xattr header via "e_value_inum",
>> which was formerly "e_value_block", but that field was never used.
>> The e_value_size still contains the xattr size so that listing
>> xattrs does not need to look up the inode if the data is not accessed.
>> 
>> struct ext4_xattr_entry {
>> 	__u8	e_name_len;	/* length of name */
>> 	__u8	e_name_index;	/* attribute name index */
>> 	__le16	e_value_offs;	/* offset in disk block of value */
>> 	__le32	e_value_inum;	/* inode in which value is stored */
>> 	__le32	e_value_size;	/* size of attribute value */
>> 	__le32	e_hash;		/* hash value of name and value */
>> 	char	e_name[0];	/* attribute name */
>> };
>> 
>> The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
>> holds a back-reference to the owning inode in its i_mtime field,
>> allowing the ext4/e2fsck to verify the correct inode is accessed.
>> 
>> Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
>> Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
>> Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
>> Signed-off-by: James Simmons <uja.ornl@gmail.com>
>> Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
>> ---
>> 
>> Per recent discussion, here is the latest version of the xattr-in-inode
>> patch.  This has just been freshly updated to the current kernel (from
>> 4.4) and has not even been compiled, so it is unlikely to work properly.
>> The functional parts of the feature and on-disk format are unchanged,
>> and is really what Ted is interested in.
>> 
>> Cheers, Andreas
>> --
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index fb69ee2..afe830b 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1797,6 +1797,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
>> 					 EXT4_FEATURE_INCOMPAT_EXTENTS| \
>> 					 EXT4_FEATURE_INCOMPAT_64BIT| \
>> 					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
>> +					 EXT4_FEATURE_INCOMPAT_EA_INODE| \
>> 					 EXT4_FEATURE_INCOMPAT_MMP | \
>> 					 EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
>> 					 EXT4_FEATURE_INCOMPAT_ENCRYPT | \
>> @@ -2220,6 +2221,12 @@ struct mmpd_data {
>> #define EXT4_MMP_MAX_CHECK_INTERVAL	300UL
>> 
>> /*
>> + * Maximum size of xattr attributes for FEATURE_INCOMPAT_EA_INODE 1Mb
>> + * This limit is arbitrary, but is reasonable for the xattr API.
>> + */
>> +#define EXT4_XATTR_MAX_LARGE_EA_SIZE    (1024 * 1024)
>> +
>> +/*
>> * Function prototypes
>> */
>> 
>> @@ -2231,6 +2238,10 @@ struct mmpd_data {
>> # define ATTRIB_NORET	__attribute__((noreturn))
>> # define NORET_AND	noreturn,
>> 
>> +struct ext4_xattr_ino_array {
>> +	unsigned int xia_count;		/* # of used item in the array */
>> +	unsigned int xia_inodes[0];
>> +};
>> /* bitmap.c */
>> extern unsigned int ext4_count_free(char *bitmap, unsigned numchars);
>> void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
>> @@ -2480,6 +2491,7 @@ int do_journal_get_write_access(handle_t *handle,
>> extern void ext4_get_inode_flags(struct ext4_inode_info *);
>> extern int ext4_alloc_da_blocks(struct inode *inode);
>> extern void ext4_set_aops(struct inode *inode);
>> +extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int chunk);
>> extern int ext4_writepage_trans_blocks(struct inode *);
>> extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
>> extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index 17bc043..01eaad6 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -294,7 +294,6 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
>> 	 * as writing the quota to disk may need the lock as well.
>> 	 */
>> 	dquot_initialize(inode);
>> -	ext4_xattr_delete_inode(handle, inode);
>> 	dquot_free_inode(inode);
>> 	dquot_drop(inode);
>> 
>> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
>> index 375fb1c..9601496 100644
>> --- a/fs/ext4/inline.c
>> +++ b/fs/ext4/inline.c
>> @@ -61,7 +61,7 @@ static int get_max_inline_xattr_value_size(struct inode *inode,
>> 
>> 	/* Compute min_offs. */
>> 	for (; !IS_LAST_ENTRY(entry); entry = EXT4_XATTR_NEXT(entry)) {
>> -		if (!entry->e_value_block && entry->e_value_size) {
>> +		if (!entry->e_value_inum && entry->e_value_size) {
>> 			size_t offs = le16_to_cpu(entry->e_value_offs);
>> 			if (offs < min_offs)
>> 				min_offs = offs;
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index b9ffa9f..70069e0 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -139,8 +139,6 @@ static void ext4_invalidatepage(struct page *page, unsigned int offset,
>> 				unsigned int length);
>> static int __ext4_journalled_writepage(struct page *page, unsigned int len);
>> static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
>> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
>> -				  int pextents);
>> 
>> /*
>> * Test whether an inode is a fast symlink.
>> @@ -189,6 +187,8 @@ void ext4_evict_inode(struct inode *inode)
>> {
>> 	handle_t *handle;
>> 	int err;
>> +	int extra_credits = 3;
>> +	struct ext4_xattr_ino_array *lea_ino_array = NULL;
>> 
>> 	trace_ext4_evict_inode(inode);
>> 
>> @@ -238,8 +238,8 @@ void ext4_evict_inode(struct inode *inode)
>> 	 * protection against it
>> 	 */
>> 	sb_start_intwrite(inode->i_sb);
>> -	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
>> -				    ext4_blocks_for_truncate(inode)+3);
>> +
>> +	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, extra_credits);
>> 	if (IS_ERR(handle)) {
>> 		ext4_std_error(inode->i_sb, PTR_ERR(handle));
>> 		/*
>> @@ -251,9 +251,36 @@ void ext4_evict_inode(struct inode *inode)
>> 		sb_end_intwrite(inode->i_sb);
>> 		goto no_delete;
>> 	}
>> -
>> 	if (IS_SYNC(inode))
>> 		ext4_handle_sync(handle);
>> +
>> +	/*
>> +	 * Delete xattr inode before deleting the main inode.
>> +	 */
>> +	err = ext4_xattr_delete_inode(handle, inode, &lea_ino_array);
>> +	if (err) {
>> +		ext4_warning(inode->i_sb,
>> +			     "couldn't delete inode's xattr (err %d)", err);
>> +		goto stop_handle;
>> +	}
>> +
>> +	if (!IS_NOQUOTA(inode))
>> +		extra_credits += 2 * EXT4_QUOTA_DEL_BLOCKS(inode->i_sb);
>> +
>> +	if (!ext4_handle_has_enough_credits(handle,
>> +			ext4_blocks_for_truncate(inode) + extra_credits)) {
>> +		err = ext4_journal_extend(handle,
>> +			ext4_blocks_for_truncate(inode) + extra_credits);
>> +		if (err > 0)
>> +			err = ext4_journal_restart(handle,
>> +			ext4_blocks_for_truncate(inode) + extra_credits);
>> +		if (err != 0) {
>> +			ext4_warning(inode->i_sb,
>> +				     "couldn't extend journal (err %d)", err);
>> +			goto stop_handle;
>> +		}
>> +	}
>> +
>> 	inode->i_size = 0;
>> 	err = ext4_mark_inode_dirty(handle, inode);
>> 	if (err) {
>> @@ -277,10 +304,10 @@ void ext4_evict_inode(struct inode *inode)
>> 	 * enough credits left in the handle to remove the inode from
>> 	 * the orphan list and set the dtime field.
>> 	 */
>> -	if (!ext4_handle_has_enough_credits(handle, 3)) {
>> -		err = ext4_journal_extend(handle, 3);
>> +	if (!ext4_handle_has_enough_credits(handle, extra_credits)) {
>> +		err = ext4_journal_extend(handle, extra_credits);
>> 		if (err > 0)
>> -			err = ext4_journal_restart(handle, 3);
>> +			err = ext4_journal_restart(handle, extra_credits);
>> 		if (err != 0) {
>> 			ext4_warning(inode->i_sb,
>> 				     "couldn't extend journal (err %d)", err);
>> @@ -315,8 +342,12 @@ void ext4_evict_inode(struct inode *inode)
>> 		ext4_clear_inode(inode);
>> 	else
>> 		ext4_free_inode(handle, inode);
>> +
>> 	ext4_journal_stop(handle);
>> 	sb_end_intwrite(inode->i_sb);
>> +
>> +	if (lea_ino_array != NULL)
>> +		ext4_xattr_inode_array_free(inode, lea_ino_array);
>> 	return;
>> no_delete:
>> 	ext4_clear_inode(inode);	/* We must guarantee clearing of inode... */
>> @@ -5475,7 +5506,7 @@ static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
>> *
>> * Also account for superblock, inode, quota and xattr blocks
>> */
>> -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
>> +int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
>> 				  int pextents)
>> {
>> 	ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
>> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
>> index 996e790..f158798 100644
>> --- a/fs/ext4/xattr.c
>> +++ b/fs/ext4/xattr.c
>> @@ -190,9 +190,8 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 
>> 	/* Check the values */
>> 	while (!IS_LAST_ENTRY(entry)) {
>> -		if (entry->e_value_block != 0)
>> -			return -EFSCORRUPTED;
>> -		if (entry->e_value_size != 0) {
>> +		if (entry->e_value_size != 0 &&
>> +		    entry->e_value_inum == 0) {
>> 			u16 offs = le16_to_cpu(entry->e_value_offs);
>> 			u32 size = le32_to_cpu(entry->e_value_size);
>> 			void *value;
>> @@ -258,19 +257,26 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 	__xattr_check_inode((inode), (header), (end), __func__, __LINE__)
>> 
>> static inline int
>> -ext4_xattr_check_entry(struct ext4_xattr_entry *entry, size_t size)
>> +ext4_xattr_check_entry(struct ext4_xattr_entry *entry, size_t size,
>> +		       struct inode *inode)
>> {
>> 	size_t value_size = le32_to_cpu(entry->e_value_size);
>> 
>> -	if (entry->e_value_block != 0 || value_size > size ||
>> +	if (!entry->e_value_inum &&
>> 	    le16_to_cpu(entry->e_value_offs) + value_size > size)
>> 		return -EFSCORRUPTED;
>> +	if (entry->e_value_inum &&
>> +	    (le32_to_cpu(entry->e_value_inum) < EXT4_FIRST_INO(inode->i_sb) ||
>> +	     le32_to_cpu(entry->e_value_inum) >
>> +	     le32_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_inodes_count)))
>> +		return -EFSCORRUPTED;
>> 	return 0;
>> }
>> 
>> static int
>> ext4_xattr_find_entry(struct ext4_xattr_entry **pentry, int name_index,
>> -		      const char *name, size_t size, int sorted)
>> +		      const char *name, size_t size, int sorted,
>> +		      struct inode *inode)
>> {
>> 	struct ext4_xattr_entry *entry;
>> 	size_t name_len;
>> @@ -290,11 +296,104 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 			break;
>> 	}
>> 	*pentry = entry;
>> -	if (!cmp && ext4_xattr_check_entry(entry, size))
>> +	if (!cmp && ext4_xattr_check_entry(entry, size, inode))
>> 		return -EFSCORRUPTED;
>> 	return cmp ? -ENODATA : 0;
>> }
>> 
>> +/*
>> + * Read the EA value from an inode.
>> + */
>> +static int
>> +ext4_xattr_inode_read(struct inode *ea_inode, void *buf, size_t *size)
>> +{
>> +	unsigned long block = 0;
>> +	struct buffer_head *bh = NULL;
>> +	int blocksize;
>> +	size_t csize, ret_size = 0;
>> +
>> +	if (*size == 0)
>> +		return 0;
>> +
>> +	blocksize = ea_inode->i_sb->s_blocksize;
>> +
>> +	while (ret_size < *size) {
>> +		csize = (*size - ret_size) > blocksize ? blocksize :
>> +							*size - ret_size;
>> +		bh = ext4_bread(NULL, ea_inode, block, 0);
>> +		if (IS_ERR(bh)) {
>> +			*size = ret_size;
>> +			return PTR_ERR(bh);
>> +		}
>> +		memcpy(buf, bh->b_data, csize);
>> +		brelse(bh);
>> +
>> +		buf += csize;
>> +		block += 1;
>> +		ret_size += csize;
>> +	}
>> +
>> +	*size = ret_size;
>> +
>> +	return 0;
>> +}
>> +
>> +struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino, int *err)
>> +{
>> +	struct inode *ea_inode = NULL;
>> +
>> +	ea_inode = ext4_iget(parent->i_sb, ea_ino);
>> +	if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) {
>> +		int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0;
>> +		ext4_error(parent->i_sb, "error while reading EA inode %lu "
>> +			   "/ %d %d", ea_ino, rc, is_bad_inode(ea_inode));
>> +		*err = rc != 0 ? rc : -EIO;
>> +		return NULL;
>> +	}
>> +
>> +	if (EXT4_XATTR_INODE_GET_PARENT(ea_inode) != parent->i_ino ||
>> +	    ea_inode->i_generation != parent->i_generation) {
>> +		ext4_error(parent->i_sb, "Backpointer from EA inode %lu "
>> +			   "to parent invalid.", ea_ino);
>> +		*err = -EINVAL;
>> +		goto error;
>> +	}
>> +
>> +	if (!(EXT4_I(ea_inode)->i_flags & EXT4_EA_INODE_FL)) {
>> +		ext4_error(parent->i_sb, "EA inode %lu does not have "
>> +			   "EXT4_EA_INODE_FL flag set.\n", ea_ino);
>> +		*err = -EINVAL;
>> +		goto error;
>> +	}
>> +
>> +	*err = 0;
>> +	return ea_inode;
>> +
>> +error:
>> +	iput(ea_inode);
>> +	return NULL;
>> +}
>> +
>> +/*
>> + * Read the value from the EA inode.
>> + */
>> +static int
>> +ext4_xattr_inode_get(struct inode *inode, unsigned long ea_ino, void *buffer,
>> +		     size_t *size)
>> +{
>> +	struct inode *ea_inode = NULL;
>> +	int err;
>> +
>> +	ea_inode = ext4_xattr_inode_iget(inode, ea_ino, &err);
>> +	if (err)
>> +		return err;
>> +
>> +	err = ext4_xattr_inode_read(ea_inode, buffer, size);
>> +	iput(ea_inode);
>> +
>> +	return err;
>> +}
>> +
>> static int
>> ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
>> 		     void *buffer, size_t buffer_size)
>> @@ -327,7 +426,8 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 	}
>> 	ext4_xattr_cache_insert(ext4_mb_cache, bh);
>> 	entry = BFIRST(bh);
>> -	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
>> +	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1,
>> +				      inode);
>> 	if (error == -EFSCORRUPTED)
>> 		goto bad_block;
>> 	if (error)
>> @@ -337,8 +437,16 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 		error = -ERANGE;
>> 		if (size > buffer_size)
>> 			goto cleanup;
>> -		memcpy(buffer, bh->b_data + le16_to_cpu(entry->e_value_offs),
>> -		       size);
>> +		if (entry->e_value_inum) {
>> +			error = ext4_xattr_inode_get(inode,
>> +					     le32_to_cpu(entry->e_value_inum),
>> +					     buffer, &size);
>> +			if (error)
>> +				goto cleanup;
>> +		} else {
>> +			memcpy(buffer, bh->b_data +
>> +			       le16_to_cpu(entry->e_value_offs), size);
>> +		}
>> 	}
>> 	error = size;
>> 
>> @@ -372,7 +480,7 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 	if (error)
>> 		goto cleanup;
>> 	error = ext4_xattr_find_entry(&entry, name_index, name,
>> -				      end - (void *)entry, 0);
>> +				      end - (void *)entry, 0, inode);
>> 	if (error)
>> 		goto cleanup;
>> 	size = le32_to_cpu(entry->e_value_size);
>> @@ -380,8 +488,16 @@ static void ext4_xattr_block_csum_set(struct inode *inode,
>> 		error = -ERANGE;
>> 		if (size > buffer_size)
>> 			goto cleanup;
>> -		memcpy(buffer, (void *)IFIRST(header) +
>> -		       le16_to_cpu(entry->e_value_offs), size);
>> +		if (entry->e_value_inum) {
>> +			error = ext4_xattr_inode_get(inode,
>> +					     le32_to_cpu(entry->e_value_inum),
>> +					     buffer, &size);
>> +			if (error)
>> +				goto cleanup;
>> +		} else {
>> +			memcpy(buffer, (void *)IFIRST(header) +
>> +			       le16_to_cpu(entry->e_value_offs), size);
>> +		}
>> 	}
>> 	error = size;
>> 
>> @@ -648,7 +764,7 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 				    size_t *min_offs, void *base, int *total)
>> {
>> 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
>> -		if (last->e_value_size) {
>> +		if (!last->e_value_inum && last->e_value_size) {
>> 			size_t offs = le16_to_cpu(last->e_value_offs);
>> 			if (offs < *min_offs)
>> 				*min_offs = offs;
>> @@ -659,16 +775,172 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 	return (*min_offs - ((void *)last - base) - sizeof(__u32));
>> }
>> 
>> -static int
>> -ext4_xattr_set_entry(struct ext4_xattr_info *i, struct ext4_xattr_search *s)
>> +/*
>> + * Write the value of the EA in an inode.
>> + */
>> +static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
>> +				  const void *buf, int bufsize)
>> +{
>> +	struct buffer_head *bh = NULL;
>> +	unsigned long block = 0;
>> +	unsigned blocksize = ea_inode->i_sb->s_blocksize;
>> +	unsigned max_blocks = (bufsize + blocksize - 1) >> ea_inode->i_blkbits;
>> +	int csize, wsize = 0;
>> +	int ret = 0;
>> +	int retries = 0;
>> +
>> +retry:
>> +	while (ret >= 0 && ret < max_blocks) {
>> +		struct ext4_map_blocks map;
>> +		map.m_lblk = block += ret;
>> +		map.m_len = max_blocks -= ret;
>> +
>> +		ret = ext4_map_blocks(handle, ea_inode, &map,
>> +				      EXT4_GET_BLOCKS_CREATE);
>> +		if (ret <= 0) {
>> +			ext4_mark_inode_dirty(handle, ea_inode);
>> +			if (ret == -ENOSPC &&
>> +			    ext4_should_retry_alloc(ea_inode->i_sb, &retries)) {
>> +				ret = 0;
>> +				goto retry;
>> +			}
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	block = 0;
>> +	while (wsize < bufsize) {
>> +		if (bh != NULL)
>> +			brelse(bh);
>> +		csize = (bufsize - wsize) > blocksize ? blocksize :
>> +								bufsize - wsize;
>> +		bh = ext4_getblk(handle, ea_inode, block, 0);
>> +		if (IS_ERR(bh)) {
>> +			ret = PTR_ERR(bh);
>> +			goto out;
>> +		}
>> +		ret = ext4_journal_get_write_access(handle, bh);
>> +		if (ret)
>> +			goto out;
>> +
>> +		memcpy(bh->b_data, buf, csize);
>> +		set_buffer_uptodate(bh);
>> +		ext4_handle_dirty_metadata(handle, ea_inode, bh);
>> +
>> +		buf += csize;
>> +		wsize += csize;
>> +		block += 1;
>> +	}
>> +
>> +	mutex_lock(&ea_inode->i_mutex);
>> +	i_size_write(ea_inode, wsize);
>> +	ext4_update_i_disksize(ea_inode, wsize);
>> +	mutex_unlock(&ea_inode->i_mutex);
>> +
>> +	ext4_mark_inode_dirty(handle, ea_inode);
>> +
>> +out:
>> +	brelse(bh);
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Create an inode to store the value of a large EA.
>> + */
>> +static struct inode *ext4_xattr_inode_create(handle_t *handle,
>> +					     struct inode *inode)
>> +{
>> +	struct inode *ea_inode = NULL;
>> +
>> +	/*
>> +	 * Let the next inode be the goal, so we try and allocate the EA inode
>> +	 * in the same group, or nearby one.
>> +	 */
>> +	ea_inode = ext4_new_inode(handle, inode->i_sb->s_root->d_inode,
>> +				  S_IFREG | 0600, NULL, inode->i_ino + 1, NULL);
>> +	if (!IS_ERR(ea_inode)) {
>> +		ea_inode->i_op = &ext4_file_inode_operations;
>> +		ea_inode->i_fop = &ext4_file_operations;
>> +		ext4_set_aops(ea_inode);
>> +		ea_inode->i_generation = inode->i_generation;
>> +		EXT4_I(ea_inode)->i_flags |= EXT4_EA_INODE_FL;
>> +
>> +		/*
>> +		 * A back-pointer from EA inode to parent inode will be useful
>> +		 * for e2fsck.
>> +		 */
>> +		EXT4_XATTR_INODE_SET_PARENT(ea_inode, inode->i_ino);
>> +		unlock_new_inode(ea_inode);
>> +	}
>> +
>> +	return ea_inode;
>> +}
>> +
>> +/*
>> + * Unlink the inode storing the value of the EA.
>> + */
>> +int ext4_xattr_inode_unlink(struct inode *inode, unsigned long ea_ino)
>> +{
>> +	struct inode *ea_inode = NULL;
>> +	int err;
>> +
>> +	ea_inode = ext4_xattr_inode_iget(inode, ea_ino, &err);
>> +	if (err)
>> +		return err;
>> +
>> +	clear_nlink(ea_inode);
>> +	iput(ea_inode);
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Add value of the EA in an inode.
>> + */
>> +static int ext4_xattr_inode_set(handle_t *handle, struct inode *inode,
>> +				unsigned long *ea_ino, const void *value,
>> +				size_t value_len)
>> +{
>> +	struct inode *ea_inode;
>> +	int err;
>> +
>> +	/* Create an inode for the EA value */
>> +	ea_inode = ext4_xattr_inode_create(handle, inode);
>> +	if (IS_ERR(ea_inode))
>> +		return PTR_ERR(ea_inode);
>> +
>> +	err = ext4_xattr_inode_write(handle, ea_inode, value, value_len);
>> +	if (err)
>> +		clear_nlink(ea_inode);
>> +	else
>> +		*ea_ino = ea_inode->i_ino;
>> +
>> +	iput(ea_inode);
>> +
>> +	return err;
>> +}
>> +
>> +static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
>> +				struct ext4_xattr_search *s,
>> +				handle_t *handle, struct inode *inode)
>> {
>> 	struct ext4_xattr_entry *last;
>> 	size_t free, min_offs = s->end - s->base, name_len = strlen(i->name);
>> +	int in_inode = i->in_inode;
>> +
>> +	if (ext4_feature_incompat(inode->i_sb, EA_INODE) &&
>> +	    (EXT4_XATTR_SIZE(i->value_len) >
>> +	     EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
>> +		in_inode = 1;
>> 
>> 	/* Compute min_offs and last. */
>> 	last = s->first;
>> 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
>> -		if (last->e_value_size) {
>> +		if (!last->e_value_inum && last->e_value_size) {
>> 			size_t offs = le16_to_cpu(last->e_value_offs);
>> 			if (offs < min_offs)
>> 				min_offs = offs;
>> @@ -676,15 +948,20 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 	}
>> 	free = min_offs - ((void *)last - s->base) - sizeof(__u32);
>> 	if (!s->not_found) {
>> -		if (s->here->e_value_size) {
>> +		if (!in_inode &&
>> +		    !s->here->e_value_inum && s->here->e_value_size) {
>> 			size_t size = le32_to_cpu(s->here->e_value_size);
>> 			free += EXT4_XATTR_SIZE(size);
>> 		}
>> 		free += EXT4_XATTR_LEN(name_len);
>> 	}
>> 	if (i->value) {
>> -		if (free < EXT4_XATTR_LEN(name_len) +
>> -			   EXT4_XATTR_SIZE(i->value_len))
>> +		size_t value_len = EXT4_XATTR_SIZE(i->value_len);
>> +
>> +		if (in_inode)
>> +			value_len = 0;
>> +
>> +		if (free < EXT4_XATTR_LEN(name_len) + value_len)
>> 			return -ENOSPC;
>> 	}
>> 
>> @@ -698,7 +975,8 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 		s->here->e_name_len = name_len;
>> 		memcpy(s->here->e_name, i->name, name_len);
>> 	} else {
>> -		if (s->here->e_value_size) {
>> +		if (!s->here->e_value_inum && s->here->e_value_size &&
>> +		    s->here->e_value_offs > 0) {
>> 			void *first_val = s->base + min_offs;
>> 			size_t offs = le16_to_cpu(s->here->e_value_offs);
>> 			void *val = s->base + offs;
>> @@ -732,12 +1010,18 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 			last = s->first;
>> 			while (!IS_LAST_ENTRY(last)) {
>> 				size_t o = le16_to_cpu(last->e_value_offs);
>> -				if (last->e_value_size && o < offs)
>> +				if (!last->e_value_inum &&
>> +				    last->e_value_size && o < offs)
>> 					last->e_value_offs =
>> 						cpu_to_le16(o + size);
>> 				last = EXT4_XATTR_NEXT(last);
>> 			}
>> 		}
>> +		if (s->here->e_value_inum) {
>> +			ext4_xattr_inode_unlink(inode,
>> +					    le32_to_cpu(s->here->e_value_inum);
>> +			s->here->e_value_inum = 0;
>> +		}
>> 		if (!i->value) {
>> 			/* Remove the old name. */
>> 			size_t size = EXT4_XATTR_LEN(name_len);
>> @@ -750,11 +1034,20 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 
>> 	if (i->value) {
>> 		/* Insert the new value. */
>> -		s->here->e_value_size = cpu_to_le32(i->value_len);
>> -		if (i->value_len) {
>> +		if (in_inode) {
>> +			unsigned long ea_ino =
>> +				le32_to_cpu(s->here->e_value_inum);
>> +			rc = ext4_xattr_inode_set(handle, inode, &ea_ino,
>> +						  i->value, i->value_len);
>> +			if (rc)
>> +				goto out;
>> +			s->here->e_value_inum = cpu_to_le32(ea_ino);
>> +			s->here->e_value_offs = 0;
>> +		} else if (i->value_len) {
>> 			size_t size = EXT4_XATTR_SIZE(i->value_len);
>> 			void *val = s->base + min_offs - size;
>> 			s->here->e_value_offs = cpu_to_le16(min_offs - size);
>> +			s->here->e_value_inum = 0;
>> 			if (i->value == EXT4_ZERO_XATTR_VALUE) {
>> 				memset(val, 0, size);
>> 			} else {
>> @@ -764,8 +1057,11 @@ static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
>> 				memcpy(val, i->value, i->value_len);
>> 			}
>> 		}
>> +		s->here->e_value_size = cpu_to_le32(i->value_len);
>> 	}
>> -	return 0;
>> +
>> +out:
>> +	return rc;
>> }
>> 
>> struct ext4_xattr_block_find {
>> @@ -804,7 +1100,7 @@ struct ext4_xattr_block_find {
>> 		bs->s.end = bs->bh->b_data + bs->bh->b_size;
>> 		bs->s.here = bs->s.first;
>> 		error = ext4_xattr_find_entry(&bs->s.here, i->name_index,
>> -					      i->name, bs->bh->b_size, 1);
>> +					     i->name, bs->bh->b_size, 1, inode);
>> 		if (error && error != -ENODATA)
>> 			goto cleanup;
>> 		bs->s.not_found = error;
>> @@ -829,8 +1125,6 @@ struct ext4_xattr_block_find {
>> 
>> #define header(x) ((struct ext4_xattr_header *)(x))
>> 
>> -	if (i->value && i->value_len > sb->s_blocksize)
>> -		return -ENOSPC;
>> 	if (s->base) {
>> 		BUFFER_TRACE(bs->bh, "get_write_access");
>> 		error = ext4_journal_get_write_access(handle, bs->bh);
>> @@ -849,7 +1143,7 @@ struct ext4_xattr_block_find {
>> 			mb_cache_entry_delete_block(ext4_mb_cache, hash,
>> 						    bs->bh->b_blocknr);
>> 			ea_bdebug(bs->bh, "modifying in-place");
>> -			error = ext4_xattr_set_entry(i, s);
>> +			error = ext4_xattr_set_entry(i, s, handle, inode);
>> 			if (!error) {
>> 				if (!IS_LAST_ENTRY(s->first))
>> 					ext4_xattr_rehash(header(s->base),
>> @@ -898,7 +1192,7 @@ struct ext4_xattr_block_find {
>> 		s->end = s->base + sb->s_blocksize;
>> 	}
>> 
>> -	error = ext4_xattr_set_entry(i, s);
>> +	error = ext4_xattr_set_entry(i, s, handle, inode);
>> 	if (error == -EFSCORRUPTED)
>> 		goto bad_block;
>> 	if (error)
>> @@ -1077,7 +1371,7 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
>> 		/* Find the named attribute. */
>> 		error = ext4_xattr_find_entry(&is->s.here, i->name_index,
>> 					      i->name, is->s.end -
>> -					      (void *)is->s.base, 0);
>> +					      (void *)is->s.base, 0, inode);
>> 		if (error && error != -ENODATA)
>> 			return error;
>> 		is->s.not_found = error;
>> @@ -1095,7 +1389,7 @@ int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
>> 
>> 	if (EXT4_I(inode)->i_extra_isize == 0)
>> 		return -ENOSPC;
>> -	error = ext4_xattr_set_entry(i, s);
>> +	error = ext4_xattr_set_entry(i, s, handle, inode);
>> 	if (error) {
>> 		if (error == -ENOSPC &&
>> 		    ext4_has_inline_data(inode)) {
>> @@ -1107,7 +1401,7 @@ int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
>> 			error = ext4_xattr_ibody_find(inode, i, is);
>> 			if (error)
>> 				return error;
>> -			error = ext4_xattr_set_entry(i, s);
>> +			error = ext4_xattr_set_entry(i, s, handle, inode);
>> 		}
>> 		if (error)
>> 			return error;
>> @@ -1133,7 +1427,7 @@ static int ext4_xattr_ibody_set(struct inode *inode,
>> 
>> 	if (EXT4_I(inode)->i_extra_isize == 0)
>> 		return -ENOSPC;
>> -	error = ext4_xattr_set_entry(i, s);
>> +	error = ext4_xattr_set_entry(i, s, handle, inode);
>> 	if (error)
>> 		return error;
>> 	header = IHDR(inode, ext4_raw_inode(&is->iloc));
>> @@ -1180,7 +1474,7 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
>> 		.name = name,
>> 		.value = value,
>> 		.value_len = value_len,
>> -
>> +		.in_inode = 0,
>> 	};
>> 	struct ext4_xattr_ibody_find is = {
>> 		.s = { .not_found = -ENODATA, },
>> @@ -1250,6 +1544,15 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
>> 					goto cleanup;
>> 			}
>> 			error = ext4_xattr_block_set(handle, inode, &i, &bs);
>> +			if (EXT4_HAS_INCOMPAT_FEATURE(inode->i_sb,
>> +					EXT4_FEATURE_INCOMPAT_EA_INODE) &&
>> +			    error == -ENOSPC) {
>> +				/* xattr not fit to block, store at external
>> +				 * inode */
>> +				i.in_inode = 1;
>> +				error = ext4_xattr_ibody_set(handle, inode,
>> +							     &i, &is);
>> +			}
>> 			if (error)
>> 				goto cleanup;
>> 			if (!is.s.not_found) {
>> @@ -1293,9 +1596,22 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
>> 	       const void *value, size_t value_len, int flags)
>> {
>> 	handle_t *handle;
>> +	struct super_block *sb = inode->i_sb;
>> 	int error, retries = 0;
>> 	int credits = ext4_jbd2_credits_xattr(inode);
>> 
>> +	if ((value_len >= EXT4_XATTR_MIN_LARGE_EA_SIZE(sb->s_blocksize)) &&
>> +	    EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EA_INODE)) {
>> +		int nrblocks = (value_len + sb->s_blocksize - 1) >>
>> +					sb->s_blocksize_bits;
>> +
>> +		/* For new inode */
>> +		credits += EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + 3;
>> +
>> +		/* For data blocks of EA inode */
>> +		credits += ext4_meta_trans_blocks(inode, nrblocks, 0);
>> +	}
>> +
>> retry:
>> 	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
>> 	if (IS_ERR(handle)) {
>> @@ -1307,7 +1623,7 @@ static int ext4_xattr_value_same(struct ext4_xattr_search *s,
>> 					      value, value_len, flags);
>> 		error2 = ext4_journal_stop(handle);
>> 		if (error == -ENOSPC &&
>> -		    ext4_should_retry_alloc(inode->i_sb, &retries))
>> +		    ext4_should_retry_alloc(sb, &retries))
>> 			goto retry;
>> 		if (error == 0)
>> 			error = error2;
>> @@ -1332,7 +1648,7 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
>> 
>> 	/* Adjust the value offsets of the entries */
>> 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
>> -		if (last->e_value_size) {
>> +		if (!last->e_value_inum && last->e_value_size) {
>> 			new_offs = le16_to_cpu(last->e_value_offs) +
>> 							value_offs_shift;
>> 			last->e_value_offs = cpu_to_le16(new_offs);
>> @@ -1593,21 +1909,135 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>> }
>> 
>> 
>> +#define EIA_INCR 16 /* must be 2^n */
>> +#define EIA_MASK (EIA_INCR - 1)
>> +/* Add the large xattr @ino into @lea_ino_array for later deletion.
>> + * If @lea_ino_array is new or full it will be grown and the old
>> + * contents copied over.
>> + */
>> +static int
>> +ext4_expand_ino_array(struct ext4_xattr_ino_array **lea_ino_array, __u32 ino)
>> +{
>> +	if (*lea_ino_array == NULL) {
>> +		/*
>> +		 * Start with 15 inodes, so it fits into a power-of-two size.
>> +		 * If *lea_ino_array is NULL, this is essentially offsetof()
>> +		 */
>> +		(*lea_ino_array) =
>> +			kmalloc(offsetof(struct ext4_xattr_ino_array,
>> +					 xia_inodes[EIA_MASK]),
>> +				GFP_NOFS);
>> +		if (*lea_ino_array == NULL)
>> +			return -ENOMEM;
>> +		(*lea_ino_array)->xia_count = 0;
>> +	} else if (((*lea_ino_array)->xia_count & EIA_MASK) == EIA_MASK) {
>> +		/* expand the array once all 15 + n * 16 slots are full */
>> +		struct ext4_xattr_ino_array *new_array = NULL;
>> +		int count = (*lea_ino_array)->xia_count;
>> +
>> +		/* if new_array is NULL, this is essentially offsetof() */
>> +		new_array = kmalloc(
>> +				offsetof(struct ext4_xattr_ino_array,
>> +					 xia_inodes[count + EIA_INCR]),
>> +				GFP_NOFS);
>> +		if (new_array == NULL)
>> +			return -ENOMEM;
>> +		memcpy(new_array, *lea_ino_array,
>> +		       offsetof(struct ext4_xattr_ino_array,
>> +				xia_inodes[count]));
>> +		kfree(*lea_ino_array);
>> +		*lea_ino_array = new_array;
>> +	}
>> +	(*lea_ino_array)->xia_inodes[(*lea_ino_array)->xia_count++] = ino;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * Add xattr inode to orphan list
>> + */
>> +static int
>> +ext4_xattr_inode_orphan_add(handle_t *handle, struct inode *inode,
>> +			int credits, struct ext4_xattr_ino_array *lea_ino_array)
>> +{
>> +	struct inode *ea_inode = NULL;
>> +	int idx = 0, error = 0;
>> +
>> +	if (lea_ino_array == NULL)
>> +		return 0;
>> +
>> +	for (; idx < lea_ino_array->xia_count; ++idx) {
>> +		if (!ext4_handle_has_enough_credits(handle, credits)) {
>> +			error = ext4_journal_extend(handle, credits);
>> +			if (error > 0)
>> +				error = ext4_journal_restart(handle, credits);
>> +
>> +			if (error != 0) {
>> +				ext4_warning(inode->i_sb,
>> +					"couldn't extend journal "
>> +					"(err %d)", error);
>> +				return error;
>> +			}
>> +		}
>> +		ea_inode = ext4_xattr_inode_iget(inode,
>> +				lea_ino_array->xia_inodes[idx], &error);
>> +		if (error)
>> +			continue;
>> +		ext4_orphan_add(handle, ea_inode);
>> +		/* the inode's i_count will be released by caller */
>> +	}
>> +
>> +	return 0;
>> +}
>> 
>> /*
>> * ext4_xattr_delete_inode()
>> *
>> - * Free extended attribute resources associated with this inode. This
>> + * Free extended attribute resources associated with this inode. Traverse
>> + * all entries and unlink any xattr inodes associated with this inode. This
>> * is called immediately before an inode is freed. We have exclusive
>> - * access to the inode.
>> + * access to the inode. If an orphan inode is deleted it will also delete any
>> + * xattr block and all xattr inodes. They are checked by ext4_xattr_inode_iget()
>> + * to ensure they belong to the parent inode and were not deleted already.
>> */
>> -void
>> -ext4_xattr_delete_inode(handle_t *handle, struct inode *inode)
>> +int
>> +ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>> +			struct ext4_xattr_ino_array **lea_ino_array)
>> {
>> 	struct buffer_head *bh = NULL;
>> +	struct ext4_xattr_ibody_header *header;
>> +	struct ext4_inode *raw_inode;
>> +	struct ext4_iloc iloc;
>> +	struct ext4_xattr_entry *entry;
>> +	int credits = 3, error = 0;
>> 
>> -	if (!EXT4_I(inode)->i_file_acl)
>> +	if (!ext4_test_inode_state(inode, EXT4_STATE_XATTR))
>> +		goto delete_external_ea;
>> +
>> +	error = ext4_get_inode_loc(inode, &iloc);
>> +	if (error)
>> +		goto cleanup;
>> +	raw_inode = ext4_raw_inode(&iloc);
>> +	header = IHDR(inode, raw_inode);
>> +	for (entry = IFIRST(header); !IS_LAST_ENTRY(entry);
>> +	     entry = EXT4_XATTR_NEXT(entry)) {
>> +		if (!entry->e_value_inum)
>> +			continue;
>> +		if (ext4_expand_ino_array(lea_ino_array,
>> +					  entry->e_value_inum) != 0) {
>> +			brelse(iloc.bh);
>> +			goto cleanup;
>> +		}
>> +		entry->e_value_inum = 0;
>> +	}
>> +	brelse(iloc.bh);
>> +
>> +delete_external_ea:
>> +	if (!EXT4_I(inode)->i_file_acl) {
>> +		/* add xattr inode to orphan list */
>> +		ext4_xattr_inode_orphan_add(handle, inode, credits,
>> +						*lea_ino_array);
>> 		goto cleanup;
>> +	}
>> 	bh = sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl);
>> 	if (!bh) {
>> 		EXT4_ERROR_INODE(inode, "block %llu read error",
>> @@ -1620,11 +2050,69 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>> 				 EXT4_I(inode)->i_file_acl);
>> 		goto cleanup;
>> 	}
>> +
>> +	for (entry = BFIRST(bh); !IS_LAST_ENTRY(entry);
>> +	     entry = EXT4_XATTR_NEXT(entry)) {
>> +		if (!entry->e_value_inum)
>> +			continue;
>> +		if (ext4_expand_ino_array(lea_ino_array,
>> +					  entry->e_value_inum) != 0)
>> +			goto cleanup;
>> +		entry->e_value_inum = 0;
>> +	}
>> +
>> +	/* add xattr inode to orphan list */
>> +	error = ext4_xattr_inode_orphan_add(handle, inode, credits,
>> +					*lea_ino_array);
>> +	if (error != 0)
>> +		goto cleanup;
>> +
>> +	if (!IS_NOQUOTA(inode))
>> +		credits += 2 * EXT4_QUOTA_DEL_BLOCKS(inode->i_sb);
>> +
>> +	if (!ext4_handle_has_enough_credits(handle, credits)) {
>> +		error = ext4_journal_extend(handle, credits);
>> +		if (error > 0)
>> +			error = ext4_journal_restart(handle, credits);
>> +		if (error != 0) {
>> +			ext4_warning(inode->i_sb,
>> +				"couldn't extend journal (err %d)", error);
>> +			goto cleanup;
>> +		}
>> +	}
>> +
>> 	ext4_xattr_release_block(handle, inode, bh);
>> 	EXT4_I(inode)->i_file_acl = 0;
>> 
>> cleanup:
>> 	brelse(bh);
>> +
>> +	return error;
>> +}
>> +
>> +void
>> +ext4_xattr_inode_array_free(struct inode *inode,
>> +			    struct ext4_xattr_ino_array *lea_ino_array)
>> +{
>> +	struct inode	*ea_inode = NULL;
>> +	int		idx = 0;
>> +	int		err;
>> +
>> +	if (lea_ino_array == NULL)
>> +		return;
>> +
>> +	for (; idx < lea_ino_array->xia_count; ++idx) {
>> +		ea_inode = ext4_xattr_inode_iget(inode,
>> +				lea_ino_array->xia_inodes[idx], &err);
>> +		if (err)
>> +			continue;
>> +		/* for inode's i_count get from ext4_xattr_delete_inode */
>> +		if (!list_empty(&EXT4_I(ea_inode)->i_orphan))
>> +			iput(ea_inode);
>> +		clear_nlink(ea_inode);
>> +		iput(ea_inode);
>> +	}
>> +	kfree(lea_ino_array);
>> }
>> 
>> /*
>> @@ -1676,10 +2164,9 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>> 		    entry1->e_name_index != entry2->e_name_index ||
>> 		    entry1->e_name_len != entry2->e_name_len ||
>> 		    entry1->e_value_size != entry2->e_value_size ||
>> +		    entry1->e_value_inum != entry2->e_value_inum ||
>> 		    memcmp(entry1->e_name, entry2->e_name, entry1->e_name_len))
>> 			return 1;
>> -		if (entry1->e_value_block != 0 || entry2->e_value_block != 0)
>> -			return -EFSCORRUPTED;
>> 		if (memcmp((char *)header1 + le16_to_cpu(entry1->e_value_offs),
>> 			   (char *)header2 + le16_to_cpu(entry2->e_value_offs),
>> 			   le32_to_cpu(entry1->e_value_size)))
>> @@ -1751,7 +2238,7 @@ static inline void ext4_xattr_hash_entry(struct ext4_xattr_header *header,
>> 		       *name++;
>> 	}
>> 
>> -	if (entry->e_value_size != 0) {
>> +	if (!entry->e_value_inum && entry->e_value_size) {
>> 		__le32 *value = (__le32 *)((char *)header +
>> 			le16_to_cpu(entry->e_value_offs));
>> 		for (n = (le32_to_cpu(entry->e_value_size) +
>> diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
>> index 099c8b6..6e10ff9 100644
>> --- a/fs/ext4/xattr.h
>> +++ b/fs/ext4/xattr.h
>> @@ -44,7 +44,7 @@ struct ext4_xattr_entry {
>> 	__u8	e_name_len;	/* length of name */
>> 	__u8	e_name_index;	/* attribute name index */
>> 	__le16	e_value_offs;	/* offset in disk block of value */
>> -	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
>> +	__le32	e_value_inum;	/* inode in which the value is stored */
>> 	__le32	e_value_size;	/* size of attribute value */
>> 	__le32	e_hash;		/* hash value of name and value */
>> 	char	e_name[0];	/* attribute name */
>> @@ -69,6 +69,26 @@ struct ext4_xattr_entry {
>> 		EXT4_I(inode)->i_extra_isize))
>> #define IFIRST(hdr) ((struct ext4_xattr_entry *)((hdr)+1))
>> 
>> +/*
>> + * Link EA inode back to parent one using i_mtime field.
>> + * Extra integer type conversion added to ignore higher
>> + * bits in i_mtime.tv_sec which might be set by ext4_get()
>> + */
>> +#define EXT4_XATTR_INODE_SET_PARENT(inode, inum)      \
>> +do {                                                  \
>> +      (inode)->i_mtime.tv_sec = inum;                 \
>> +} while(0)
>> +
>> +#define EXT4_XATTR_INODE_GET_PARENT(inode)            \
>> +((__u32)(inode)->i_mtime.tv_sec)
>> +
>> +/*
>> + * The minimum size of EA value when you start storing it in an external inode
>> + * size of block - size of header - size of 1 entry - 4 null bytes
>> +*/
>> +#define EXT4_XATTR_MIN_LARGE_EA_SIZE(b)					\
>> +	((b) - EXT4_XATTR_LEN(3) - sizeof(struct ext4_xattr_header) - 4)
>> +
>> #define BHDR(bh) ((struct ext4_xattr_header *)((bh)->b_data))
>> #define ENTRY(ptr) ((struct ext4_xattr_entry *)(ptr))
>> #define BFIRST(bh) ENTRY(BHDR(bh)+1)
>> @@ -77,10 +97,11 @@ struct ext4_xattr_entry {
>> #define EXT4_ZERO_XATTR_VALUE ((void *)-1)
>> 
>> struct ext4_xattr_info {
>> -	int name_index;
>> 	const char *name;
>> 	const void *value;
>> 	size_t value_len;
>> +	int name_index;
>> +	int in_inode;
>> };
>> 
>> struct ext4_xattr_search {
>> @@ -140,7 +161,13 @@ static inline void ext4_write_unlock_xattr(struct inode *inode, int *save)
>> extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
>> extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
>> 
>> -extern void ext4_xattr_delete_inode(handle_t *, struct inode *);
>> +extern struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>> +					   int *err);
>> +extern int ext4_xattr_inode_unlink(struct inode *inode, unsigned long ea_ino);
>> +extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>> +				   struct ext4_xattr_ino_array **array);
>> +extern void ext4_xattr_inode_array_free(struct inode *inode,
>> +					struct ext4_xattr_ino_array *array);
>> 
>> extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>> 			    struct ext4_inode *raw_inode, handle_t *handle);
>> diff --git a/include/uapi/linux/netfilter/xt_CONNMARK.h b/include/uapi/linux/netfilter/xt_CONNMARK.h
>> index 2f2e48e..efc17a8 100644
>> --- a/include/uapi/linux/netfilter/xt_CONNMARK.h
>> +++ b/include/uapi/linux/netfilter/xt_CONNMARK.h
>> @@ -1,6 +1,31 @@
>> -#ifndef _XT_CONNMARK_H_target
>> -#define _XT_CONNMARK_H_target
>> +#ifndef _XT_CONNMARK_H
>> +#define _XT_CONNMARK_H
>> 
>> -#include <linux/netfilter/xt_connmark.h>
>> +#include <linux/types.h>
>> 
>> -#endif /*_XT_CONNMARK_H_target*/
>> +/* Copyright (C) 2002,2004 MARA Systems AB <http://www.marasystems.com>
>> + * by Henrik Nordstrom <hno@marasystems.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + */
>> +
>> +enum {
>> +	XT_CONNMARK_SET = 0,
>> +	XT_CONNMARK_SAVE,
>> +	XT_CONNMARK_RESTORE
>> +};
>> +
>> +struct xt_connmark_tginfo1 {
>> +	__u32 ctmark, ctmask, nfmask;
>> +	__u8 mode;
>> +};
>> +
>> +struct xt_connmark_mtinfo1 {
>> +	__u32 mark, mask;
>> +	__u8 invert;
>> +};
>> +
>> +#endif /*_XT_CONNMARK_H*/
>> diff --git a/include/uapi/linux/netfilter/xt_DSCP.h b/include/uapi/linux/netfilter/xt_DSCP.h
>> index 648e0b3..15f8932 100644
>> --- a/include/uapi/linux/netfilter/xt_DSCP.h
>> +++ b/include/uapi/linux/netfilter/xt_DSCP.h
>> @@ -1,26 +1,31 @@
>> -/* x_tables module for setting the IPv4/IPv6 DSCP field
>> +/* x_tables module for matching the IPv4/IPv6 DSCP field
>> *
>> * (C) 2002 Harald Welte <laforge@gnumonks.org>
>> - * based on ipt_FTOS.c (C) 2000 by Matthew G. Marsh <mgm@paktronix.com>
>> * This software is distributed under GNU GPL v2, 1991
>> *
>> * See RFC2474 for a description of the DSCP field within the IP Header.
>> *
>> - * xt_DSCP.h,v 1.7 2002/03/14 12:03:13 laforge Exp
>> + * xt_dscp.h,v 1.3 2002/08/05 19:00:21 laforge Exp
>> */
>> -#ifndef _XT_DSCP_TARGET_H
>> -#define _XT_DSCP_TARGET_H
>> -#include <linux/netfilter/xt_dscp.h>
>> +#ifndef _XT_DSCP_H
>> +#define _XT_DSCP_H
>> +
>> #include <linux/types.h>
>> 
>> -/* target info */
>> -struct xt_DSCP_info {
>> +#define XT_DSCP_MASK	0xfc	/* 11111100 */
>> +#define XT_DSCP_SHIFT	2
>> +#define XT_DSCP_MAX	0x3f	/* 00111111 */
>> +
>> +/* match info */
>> +struct xt_dscp_info {
>> 	__u8 dscp;
>> +	__u8 invert;
>> };
>> 
>> -struct xt_tos_target_info {
>> -	__u8 tos_value;
>> +struct xt_tos_match_info {
>> 	__u8 tos_mask;
>> +	__u8 tos_value;
>> +	__u8 invert;
>> };
>> 
>> -#endif /* _XT_DSCP_TARGET_H */
>> +#endif /* _XT_DSCP_H */
>> diff --git a/include/uapi/linux/netfilter/xt_MARK.h b/include/uapi/linux/netfilter/xt_MARK.h
>> index 41c456d..ecadc40 100644
>> --- a/include/uapi/linux/netfilter/xt_MARK.h
>> +++ b/include/uapi/linux/netfilter/xt_MARK.h
>> @@ -1,6 +1,15 @@
>> -#ifndef _XT_MARK_H_target
>> -#define _XT_MARK_H_target
>> +#ifndef _XT_MARK_H
>> +#define _XT_MARK_H
>> 
>> -#include <linux/netfilter/xt_mark.h>
>> +#include <linux/types.h>
>> 
>> -#endif /*_XT_MARK_H_target */
>> +struct xt_mark_tginfo2 {
>> +	__u32 mark, mask;
>> +};
>> +
>> +struct xt_mark_mtinfo1 {
>> +	__u32 mark, mask;
>> +	__u8 invert;
>> +};
>> +
>> +#endif /*_XT_MARK_H*/
>> diff --git a/include/uapi/linux/netfilter/xt_TCPMSS.h b/include/uapi/linux/netfilter/xt_TCPMSS.h
>> index 9a6960a..fbac56b 100644
>> --- a/include/uapi/linux/netfilter/xt_TCPMSS.h
>> +++ b/include/uapi/linux/netfilter/xt_TCPMSS.h
>> @@ -1,12 +1,11 @@
>> -#ifndef _XT_TCPMSS_H
>> -#define _XT_TCPMSS_H
>> +#ifndef _XT_TCPMSS_MATCH_H
>> +#define _XT_TCPMSS_MATCH_H
>> 
>> #include <linux/types.h>
>> 
>> -struct xt_tcpmss_info {
>> -	__u16 mss;
>> +struct xt_tcpmss_match_info {
>> +    __u16 mss_min, mss_max;
>> +    __u8 invert;
>> };
>> 
>> -#define XT_TCPMSS_CLAMP_PMTU 0xffff
>> -
>> -#endif /* _XT_TCPMSS_H */
>> +#endif /*_XT_TCPMSS_MATCH_H*/
>> diff --git a/include/uapi/linux/netfilter/xt_rateest.h b/include/uapi/linux/netfilter/xt_rateest.h
>> index 13fe50d..ec1b570 100644
>> --- a/include/uapi/linux/netfilter/xt_rateest.h
>> +++ b/include/uapi/linux/netfilter/xt_rateest.h
>> @@ -1,38 +1,16 @@
>> -#ifndef _XT_RATEEST_MATCH_H
>> -#define _XT_RATEEST_MATCH_H
>> +#ifndef _XT_RATEEST_TARGET_H
>> +#define _XT_RATEEST_TARGET_H
>> 
>> #include <linux/types.h>
>> #include <linux/if.h>
>> 
>> -enum xt_rateest_match_flags {
>> -	XT_RATEEST_MATCH_INVERT	= 1<<0,
>> -	XT_RATEEST_MATCH_ABS	= 1<<1,
>> -	XT_RATEEST_MATCH_REL	= 1<<2,
>> -	XT_RATEEST_MATCH_DELTA	= 1<<3,
>> -	XT_RATEEST_MATCH_BPS	= 1<<4,
>> -	XT_RATEEST_MATCH_PPS	= 1<<5,
>> -};
>> -
>> -enum xt_rateest_match_mode {
>> -	XT_RATEEST_MATCH_NONE,
>> -	XT_RATEEST_MATCH_EQ,
>> -	XT_RATEEST_MATCH_LT,
>> -	XT_RATEEST_MATCH_GT,
>> -};
>> -
>> -struct xt_rateest_match_info {
>> -	char			name1[IFNAMSIZ];
>> -	char			name2[IFNAMSIZ];
>> -	__u16		flags;
>> -	__u16		mode;
>> -	__u32		bps1;
>> -	__u32		pps1;
>> -	__u32		bps2;
>> -	__u32		pps2;
>> +struct xt_rateest_target_info {
>> +	char			name[IFNAMSIZ];
>> +	__s8			interval;
>> +	__u8		ewma_log;
>> 
>> 	/* Used internally by the kernel */
>> -	struct xt_rateest	*est1 __attribute__((aligned(8)));
>> -	struct xt_rateest	*est2 __attribute__((aligned(8)));
>> +	struct xt_rateest	*est __attribute__((aligned(8)));
>> };
>> 
>> -#endif /* _XT_RATEEST_MATCH_H */
>> +#endif /* _XT_RATEEST_TARGET_H */
>> diff --git a/include/uapi/linux/netfilter_ipv4/ipt_ECN.h b/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
>> index bb88d53..0e0c063 100644
>> --- a/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
>> +++ b/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
>> @@ -1,33 +1,15 @@
>> -/* Header file for iptables ipt_ECN target
>> - *
>> - * (C) 2002 by Harald Welte <laforge@gnumonks.org>
>> - *
>> - * This software is distributed under GNU GPL v2, 1991
>> - *
>> - * ipt_ECN.h,v 1.3 2002/05/29 12:17:40 laforge Exp
>> -*/
>> -#ifndef _IPT_ECN_TARGET_H
>> -#define _IPT_ECN_TARGET_H
>> -
>> -#include <linux/types.h>
>> -#include <linux/netfilter/xt_DSCP.h>
>> -
>> -#define IPT_ECN_IP_MASK	(~XT_DSCP_MASK)
>> -
>> -#define IPT_ECN_OP_SET_IP	0x01	/* set ECN bits of IPv4 header */
>> -#define IPT_ECN_OP_SET_ECE	0x10	/* set ECE bit of TCP header */
>> -#define IPT_ECN_OP_SET_CWR	0x20	/* set CWR bit of TCP header */
>> -
>> -#define IPT_ECN_OP_MASK		0xce
>> -
>> -struct ipt_ECN_info {
>> -	__u8 operation;	/* bitset of operations */
>> -	__u8 ip_ect;	/* ECT codepoint of IPv4 header, pre-shifted */
>> -	union {
>> -		struct {
>> -			__u8 ece:1, cwr:1; /* TCP ECT bits */
>> -		} tcp;
>> -	} proto;
>> +#ifndef _IPT_ECN_H
>> +#define _IPT_ECN_H
>> +
>> +#include <linux/netfilter/xt_ecn.h>
>> +#define ipt_ecn_info xt_ecn_info
>> +
>> +enum {
>> +	IPT_ECN_IP_MASK       = XT_ECN_IP_MASK,
>> +	IPT_ECN_OP_MATCH_IP   = XT_ECN_OP_MATCH_IP,
>> +	IPT_ECN_OP_MATCH_ECE  = XT_ECN_OP_MATCH_ECE,
>> +	IPT_ECN_OP_MATCH_CWR  = XT_ECN_OP_MATCH_CWR,
>> +	IPT_ECN_OP_MATCH_MASK = XT_ECN_OP_MATCH_MASK,
>> };
>> 
>> -#endif /* _IPT_ECN_TARGET_H */
>> +#endif /* IPT_ECN_H */
>> diff --git a/include/uapi/linux/netfilter_ipv4/ipt_TTL.h b/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
>> index f6ac169..37bee44 100644
>> --- a/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
>> +++ b/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
>> @@ -1,5 +1,5 @@
>> -/* TTL modification module for IP tables
>> - * (C) 2000 by Harald Welte <laforge@netfilter.org> */
>> +/* IP tables module for matching the value of the TTL
>> + * (C) 2000 by Harald Welte <laforge@gnumonks.org> */
>> 
>> #ifndef _IPT_TTL_H
>> #define _IPT_TTL_H
>> @@ -7,14 +7,14 @@
>> #include <linux/types.h>
>> 
>> enum {
>> -	IPT_TTL_SET = 0,
>> -	IPT_TTL_INC,
>> -	IPT_TTL_DEC
>> +	IPT_TTL_EQ = 0,		/* equals */
>> +	IPT_TTL_NE,		/* not equals */
>> +	IPT_TTL_LT,		/* less than */
>> +	IPT_TTL_GT,		/* greater than */
>> };
>> 
>> -#define IPT_TTL_MAXMODE	IPT_TTL_DEC
>> 
>> -struct ipt_TTL_info {
>> +struct ipt_ttl_info {
>> 	__u8	mode;
>> 	__u8	ttl;
>> };
>> diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_HL.h b/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
>> index ebd8ead..6e76dbc 100644
>> --- a/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
>> +++ b/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
>> @@ -1,6 +1,6 @@
>> -/* Hop Limit modification module for ip6tables
>> +/* ip6tables module for matching the Hop Limit value
>> * Maciej Soltysiak <solt@dns.toxicfilms.tv>
>> - * Based on HW's TTL module */
>> + * Based on HW's ttl module */
>> 
>> #ifndef _IP6T_HL_H
>> #define _IP6T_HL_H
>> @@ -8,14 +8,14 @@
>> #include <linux/types.h>
>> 
>> enum {
>> -	IP6T_HL_SET = 0,
>> -	IP6T_HL_INC,
>> -	IP6T_HL_DEC
>> +	IP6T_HL_EQ = 0,		/* equals */
>> +	IP6T_HL_NE,		/* not equals */
>> +	IP6T_HL_LT,		/* less than */
>> +	IP6T_HL_GT,		/* greater than */
>> };
>> 
>> -#define IP6T_HL_MAXMODE	IP6T_HL_DEC
>> 
>> -struct ip6t_HL_info {
>> +struct ip6t_hl_info {
>> 	__u8	mode;
>> 	__u8	hop_limit;
>> };
>> diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
>> index 498b54f..755d2f6 100644
>> --- a/net/netfilter/xt_RATEEST.c
>> +++ b/net/netfilter/xt_RATEEST.c
>> @@ -8,184 +8,149 @@
>> #include <linux/module.h>
>> #include <linux/skbuff.h>
>> #include <linux/gen_stats.h>
>> -#include <linux/jhash.h>
>> -#include <linux/rtnetlink.h>
>> -#include <linux/random.h>
>> -#include <linux/slab.h>
>> -#include <net/gen_stats.h>
>> -#include <net/netlink.h>
>> 
>> #include <linux/netfilter/x_tables.h>
>> -#include <linux/netfilter/xt_RATEEST.h>
>> +#include <linux/netfilter/xt_rateest.h>
>> #include <net/netfilter/xt_rateest.h>
>> 
>> -static DEFINE_MUTEX(xt_rateest_mutex);
>> 
>> -#define RATEEST_HSIZE	16
>> -static struct hlist_head rateest_hash[RATEEST_HSIZE] __read_mostly;
>> -static unsigned int jhash_rnd __read_mostly;
>> -
>> -static unsigned int xt_rateest_hash(const char *name)
>> -{
>> -	return jhash(name, FIELD_SIZEOF(struct xt_rateest, name), jhash_rnd) &
>> -	       (RATEEST_HSIZE - 1);
>> -}
>> -
>> -static void xt_rateest_hash_insert(struct xt_rateest *est)
>> +static bool
>> +xt_rateest_mt(const struct sk_buff *skb, struct xt_action_param *par)
>> {
>> -	unsigned int h;
>> -
>> -	h = xt_rateest_hash(est->name);
>> -	hlist_add_head(&est->list, &rateest_hash[h]);
>> -}
>> +	const struct xt_rateest_match_info *info = par->matchinfo;
>> +	struct gnet_stats_rate_est64 sample = {0};
>> +	u_int32_t bps1, bps2, pps1, pps2;
>> +	bool ret = true;
>> +
>> +	gen_estimator_read(&info->est1->rate_est, &sample);
>> +
>> +	if (info->flags & XT_RATEEST_MATCH_DELTA) {
>> +		bps1 = info->bps1 >= sample.bps ? info->bps1 - sample.bps : 0;
>> +		pps1 = info->pps1 >= sample.pps ? info->pps1 - sample.pps : 0;
>> +	} else {
>> +		bps1 = sample.bps;
>> +		pps1 = sample.pps;
>> +	}
>> 
>> -struct xt_rateest *xt_rateest_lookup(const char *name)
>> -{
>> -	struct xt_rateest *est;
>> -	unsigned int h;
>> -
>> -	h = xt_rateest_hash(name);
>> -	mutex_lock(&xt_rateest_mutex);
>> -	hlist_for_each_entry(est, &rateest_hash[h], list) {
>> -		if (strcmp(est->name, name) == 0) {
>> -			est->refcnt++;
>> -			mutex_unlock(&xt_rateest_mutex);
>> -			return est;
>> +	if (info->flags & XT_RATEEST_MATCH_ABS) {
>> +		bps2 = info->bps2;
>> +		pps2 = info->pps2;
>> +	} else {
>> +		gen_estimator_read(&info->est2->rate_est, &sample);
>> +
>> +		if (info->flags & XT_RATEEST_MATCH_DELTA) {
>> +			bps2 = info->bps2 >= sample.bps ? info->bps2 - sample.bps : 0;
>> +			pps2 = info->pps2 >= sample.pps ? info->pps2 - sample.pps : 0;
>> +		} else {
>> +			bps2 = sample.bps;
>> +			pps2 = sample.pps;
>> 		}
>> 	}
>> -	mutex_unlock(&xt_rateest_mutex);
>> -	return NULL;
>> -}
>> -EXPORT_SYMBOL_GPL(xt_rateest_lookup);
>> 
>> -void xt_rateest_put(struct xt_rateest *est)
>> -{
>> -	mutex_lock(&xt_rateest_mutex);
>> -	if (--est->refcnt == 0) {
>> -		hlist_del(&est->list);
>> -		gen_kill_estimator(&est->rate_est);
>> -		/*
>> -		 * gen_estimator est_timer() might access est->lock or bstats,
>> -		 * wait a RCU grace period before freeing 'est'
>> -		 */
>> -		kfree_rcu(est, rcu);
>> +	switch (info->mode) {
>> +	case XT_RATEEST_MATCH_LT:
>> +		if (info->flags & XT_RATEEST_MATCH_BPS)
>> +			ret &= bps1 < bps2;
>> +		if (info->flags & XT_RATEEST_MATCH_PPS)
>> +			ret &= pps1 < pps2;
>> +		break;
>> +	case XT_RATEEST_MATCH_GT:
>> +		if (info->flags & XT_RATEEST_MATCH_BPS)
>> +			ret &= bps1 > bps2;
>> +		if (info->flags & XT_RATEEST_MATCH_PPS)
>> +			ret &= pps1 > pps2;
>> +		break;
>> +	case XT_RATEEST_MATCH_EQ:
>> +		if (info->flags & XT_RATEEST_MATCH_BPS)
>> +			ret &= bps1 == bps2;
>> +		if (info->flags & XT_RATEEST_MATCH_PPS)
>> +			ret &= pps1 == pps2;
>> +		break;
>> 	}
>> -	mutex_unlock(&xt_rateest_mutex);
>> +
>> +	ret ^= info->flags & XT_RATEEST_MATCH_INVERT ? true : false;
>> +	return ret;
>> }
>> -EXPORT_SYMBOL_GPL(xt_rateest_put);
>> 
>> -static unsigned int
>> -xt_rateest_tg(struct sk_buff *skb, const struct xt_action_param *par)
>> +static int xt_rateest_mt_checkentry(const struct xt_mtchk_param *par)
>> {
>> -	const struct xt_rateest_target_info *info = par->targinfo;
>> -	struct gnet_stats_basic_packed *stats = &info->est->bstats;
>> +	struct xt_rateest_match_info *info = par->matchinfo;
>> +	struct xt_rateest *est1, *est2;
>> +	int ret = -EINVAL;
>> 
>> -	spin_lock_bh(&info->est->lock);
>> -	stats->bytes += skb->len;
>> -	stats->packets++;
>> -	spin_unlock_bh(&info->est->lock);
>> +	if (hweight32(info->flags & (XT_RATEEST_MATCH_ABS |
>> +				     XT_RATEEST_MATCH_REL)) != 1)
>> +		goto err1;
>> 
>> -	return XT_CONTINUE;
>> -}
>> +	if (!(info->flags & (XT_RATEEST_MATCH_BPS | XT_RATEEST_MATCH_PPS)))
>> +		goto err1;
>> 
>> -static int xt_rateest_tg_checkentry(const struct xt_tgchk_param *par)
>> -{
>> -	struct xt_rateest_target_info *info = par->targinfo;
>> -	struct xt_rateest *est;
>> -	struct {
>> -		struct nlattr		opt;
>> -		struct gnet_estimator	est;
>> -	} cfg;
>> -	int ret;
>> -
>> -	net_get_random_once(&jhash_rnd, sizeof(jhash_rnd));
>> -
>> -	est = xt_rateest_lookup(info->name);
>> -	if (est) {
>> -		/*
>> -		 * If estimator parameters are specified, they must match the
>> -		 * existing estimator.
>> -		 */
>> -		if ((!info->interval && !info->ewma_log) ||
>> -		    (info->interval != est->params.interval ||
>> -		     info->ewma_log != est->params.ewma_log)) {
>> -			xt_rateest_put(est);
>> -			return -EINVAL;
>> -		}
>> -		info->est = est;
>> -		return 0;
>> +	switch (info->mode) {
>> +	case XT_RATEEST_MATCH_EQ:
>> +	case XT_RATEEST_MATCH_LT:
>> +	case XT_RATEEST_MATCH_GT:
>> +		break;
>> +	default:
>> +		goto err1;
>> 	}
>> 
>> -	ret = -ENOMEM;
>> -	est = kzalloc(sizeof(*est), GFP_KERNEL);
>> -	if (!est)
>> +	ret  = -ENOENT;
>> +	est1 = xt_rateest_lookup(info->name1);
>> +	if (!est1)
>> 		goto err1;
>> 
>> -	strlcpy(est->name, info->name, sizeof(est->name));
>> -	spin_lock_init(&est->lock);
>> -	est->refcnt		= 1;
>> -	est->params.interval	= info->interval;
>> -	est->params.ewma_log	= info->ewma_log;
>> -
>> -	cfg.opt.nla_len		= nla_attr_size(sizeof(cfg.est));
>> -	cfg.opt.nla_type	= TCA_STATS_RATE_EST;
>> -	cfg.est.interval	= info->interval;
>> -	cfg.est.ewma_log	= info->ewma_log;
>> -
>> -	ret = gen_new_estimator(&est->bstats, NULL, &est->rate_est,
>> -				&est->lock, NULL, &cfg.opt);
>> -	if (ret < 0)
>> -		goto err2;
>> +	est2 = NULL;
>> +	if (info->flags & XT_RATEEST_MATCH_REL) {
>> +		est2 = xt_rateest_lookup(info->name2);
>> +		if (!est2)
>> +			goto err2;
>> +	}
>> 
>> -	info->est = est;
>> -	xt_rateest_hash_insert(est);
>> +	info->est1 = est1;
>> +	info->est2 = est2;
>> 	return 0;
>> 
>> err2:
>> -	kfree(est);
>> +	xt_rateest_put(est1);
>> err1:
>> 	return ret;
>> }
>> 
>> -static void xt_rateest_tg_destroy(const struct xt_tgdtor_param *par)
>> +static void xt_rateest_mt_destroy(const struct xt_mtdtor_param *par)
>> {
>> -	struct xt_rateest_target_info *info = par->targinfo;
>> +	struct xt_rateest_match_info *info = par->matchinfo;
>> 
>> -	xt_rateest_put(info->est);
>> +	xt_rateest_put(info->est1);
>> +	if (info->est2)
>> +		xt_rateest_put(info->est2);
>> }
>> 
>> -static struct xt_target xt_rateest_tg_reg __read_mostly = {
>> -	.name       = "RATEEST",
>> +static struct xt_match xt_rateest_mt_reg __read_mostly = {
>> +	.name       = "rateest",
>> 	.revision   = 0,
>> 	.family     = NFPROTO_UNSPEC,
>> -	.target     = xt_rateest_tg,
>> -	.checkentry = xt_rateest_tg_checkentry,
>> -	.destroy    = xt_rateest_tg_destroy,
>> -	.targetsize = sizeof(struct xt_rateest_target_info),
>> -	.usersize   = offsetof(struct xt_rateest_target_info, est),
>> +	.match      = xt_rateest_mt,
>> +	.checkentry = xt_rateest_mt_checkentry,
>> +	.destroy    = xt_rateest_mt_destroy,
>> +	.matchsize  = sizeof(struct xt_rateest_match_info),
>> +	.usersize   = offsetof(struct xt_rateest_match_info, est1),
>> 	.me         = THIS_MODULE,
>> };
>> 
>> -static int __init xt_rateest_tg_init(void)
>> +static int __init xt_rateest_mt_init(void)
>> {
>> -	unsigned int i;
>> -
>> -	for (i = 0; i < ARRAY_SIZE(rateest_hash); i++)
>> -		INIT_HLIST_HEAD(&rateest_hash[i]);
>> -
>> -	return xt_register_target(&xt_rateest_tg_reg);
>> +	return xt_register_match(&xt_rateest_mt_reg);
>> }
>> 
>> -static void __exit xt_rateest_tg_fini(void)
>> +static void __exit xt_rateest_mt_fini(void)
>> {
>> -	xt_unregister_target(&xt_rateest_tg_reg);
>> +	xt_unregister_match(&xt_rateest_mt_reg);
>> }
>> 
>> -
>> MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
>> MODULE_LICENSE("GPL");
>> -MODULE_DESCRIPTION("Xtables: packet rate estimator");
>> -MODULE_ALIAS("ipt_RATEEST");
>> -MODULE_ALIAS("ip6t_RATEEST");
>> -module_init(xt_rateest_tg_init);
>> -module_exit(xt_rateest_tg_fini);
>> +MODULE_DESCRIPTION("xtables rate estimator match");
>> +MODULE_ALIAS("ipt_rateest");
>> +MODULE_ALIAS("ip6t_rateest");
>> +module_init(xt_rateest_mt_init);
>> +module_exit(xt_rateest_mt_fini);
>> diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
>> index 27241a7..c53d4d1 100644
>> --- a/net/netfilter/xt_TCPMSS.c
>> +++ b/net/netfilter/xt_TCPMSS.c
>> @@ -1,351 +1,110 @@
>> -/*
>> - * This is a module which is used for setting the MSS option in TCP packets.
>> - *
>> - * Copyright (C) 2000 Marc Boucher <marc@mbsi.ca>
>> - * Copyright (C) 2007 Patrick McHardy <kaber@trash.net>
>> +/* Kernel module to match TCP MSS values. */
>> +
>> +/* Copyright (C) 2000 Marc Boucher <marc@mbsi.ca>
>> + * Portions (C) 2005 by Harald Welte <laforge@netfilter.org>
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License version 2 as
>> * published by the Free Software Foundation.
>> */
>> -#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> #include <linux/module.h>
>> #include <linux/skbuff.h>
>> -#include <linux/ip.h>
>> -#include <linux/gfp.h>
>> -#include <linux/ipv6.h>
>> -#include <linux/tcp.h>
>> -#include <net/dst.h>
>> -#include <net/flow.h>
>> -#include <net/ipv6.h>
>> -#include <net/route.h>
>> #include <net/tcp.h>
>> 
>> +#include <linux/netfilter/xt_tcpmss.h>
>> +#include <linux/netfilter/x_tables.h>
>> +
>> #include <linux/netfilter_ipv4/ip_tables.h>
>> #include <linux/netfilter_ipv6/ip6_tables.h>
>> -#include <linux/netfilter/x_tables.h>
>> -#include <linux/netfilter/xt_tcpudp.h>
>> -#include <linux/netfilter/xt_TCPMSS.h>
>> 
>> MODULE_LICENSE("GPL");
>> MODULE_AUTHOR("Marc Boucher <marc@mbsi.ca>");
>> -MODULE_DESCRIPTION("Xtables: TCP Maximum Segment Size (MSS) adjustment");
>> -MODULE_ALIAS("ipt_TCPMSS");
>> -MODULE_ALIAS("ip6t_TCPMSS");
>> -
>> -static inline unsigned int
>> -optlen(const u_int8_t *opt, unsigned int offset)
>> -{
>> -	/* Beware zero-length options: make finite progress */
>> -	if (opt[offset] <= TCPOPT_NOP || opt[offset+1] == 0)
>> -		return 1;
>> -	else
>> -		return opt[offset+1];
>> -}
>> -
>> -static u_int32_t tcpmss_reverse_mtu(struct net *net,
>> -				    const struct sk_buff *skb,
>> -				    unsigned int family)
>> -{
>> -	struct flowi fl;
>> -	const struct nf_afinfo *ai;
>> -	struct rtable *rt = NULL;
>> -	u_int32_t mtu     = ~0U;
>> -
>> -	if (family == PF_INET) {
>> -		struct flowi4 *fl4 = &fl.u.ip4;
>> -		memset(fl4, 0, sizeof(*fl4));
>> -		fl4->daddr = ip_hdr(skb)->saddr;
>> -	} else {
>> -		struct flowi6 *fl6 = &fl.u.ip6;
>> -
>> -		memset(fl6, 0, sizeof(*fl6));
>> -		fl6->daddr = ipv6_hdr(skb)->saddr;
>> -	}
>> -	rcu_read_lock();
>> -	ai = nf_get_afinfo(family);
>> -	if (ai != NULL)
>> -		ai->route(net, (struct dst_entry **)&rt, &fl, false);
>> -	rcu_read_unlock();
>> -
>> -	if (rt != NULL) {
>> -		mtu = dst_mtu(&rt->dst);
>> -		dst_release(&rt->dst);
>> -	}
>> -	return mtu;
>> -}
>> +MODULE_DESCRIPTION("Xtables: TCP MSS match");
>> +MODULE_ALIAS("ipt_tcpmss");
>> +MODULE_ALIAS("ip6t_tcpmss");
>> 
>> -static int
>> -tcpmss_mangle_packet(struct sk_buff *skb,
>> -		     const struct xt_action_param *par,
>> -		     unsigned int family,
>> -		     unsigned int tcphoff,
>> -		     unsigned int minlen)
>> +static bool
>> +tcpmss_mt(const struct sk_buff *skb, struct xt_action_param *par)
>> {
>> -	const struct xt_tcpmss_info *info = par->targinfo;
>> -	struct tcphdr *tcph;
>> -	int len, tcp_hdrlen;
>> -	unsigned int i;
>> -	__be16 oldval;
>> -	u16 newmss;
>> -	u8 *opt;
>> -
>> -	/* This is a fragment, no TCP header is available */
>> -	if (par->fragoff != 0)
>> -		return 0;
>> -
>> -	if (!skb_make_writable(skb, skb->len))
>> -		return -1;
>> -
>> -	len = skb->len - tcphoff;
>> -	if (len < (int)sizeof(struct tcphdr))
>> -		return -1;
>> -
>> -	tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
>> -	tcp_hdrlen = tcph->doff * 4;
>> -
>> -	if (len < tcp_hdrlen)
>> -		return -1;
>> -
>> -	if (info->mss == XT_TCPMSS_CLAMP_PMTU) {
>> -		struct net *net = xt_net(par);
>> -		unsigned int in_mtu = tcpmss_reverse_mtu(net, skb, family);
>> -		unsigned int min_mtu = min(dst_mtu(skb_dst(skb)), in_mtu);
>> -
>> -		if (min_mtu <= minlen) {
>> -			net_err_ratelimited("unknown or invalid path-MTU (%u)\n",
>> -					    min_mtu);
>> -			return -1;
>> -		}
>> -		newmss = min_mtu - minlen;
>> -	} else
>> -		newmss = info->mss;
>> -
>> -	opt = (u_int8_t *)tcph;
>> -	for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += optlen(opt, i)) {
>> -		if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) {
>> -			u_int16_t oldmss;
>> -
>> -			oldmss = (opt[i+2] << 8) | opt[i+3];
>> -
>> -			/* Never increase MSS, even when setting it, as
>> -			 * doing so results in problems for hosts that rely
>> -			 * on MSS being set correctly.
>> -			 */
>> -			if (oldmss <= newmss)
>> -				return 0;
>> -
>> -			opt[i+2] = (newmss & 0xff00) >> 8;
>> -			opt[i+3] = newmss & 0x00ff;
>> -
>> -			inet_proto_csum_replace2(&tcph->check, skb,
>> -						 htons(oldmss), htons(newmss),
>> -						 false);
>> -			return 0;
>> +	const struct xt_tcpmss_match_info *info = par->matchinfo;
>> +	const struct tcphdr *th;
>> +	struct tcphdr _tcph;
>> +	/* tcp.doff is only 4 bits, ie. max 15 * 4 bytes */
>> +	const u_int8_t *op;
>> +	u8 _opt[15 * 4 - sizeof(_tcph)];
>> +	unsigned int i, optlen;
>> +
>> +	/* If we don't have the whole header, drop packet. */
>> +	th = skb_header_pointer(skb, par->thoff, sizeof(_tcph), &_tcph);
>> +	if (th == NULL)
>> +		goto dropit;
>> +
>> +	/* Malformed. */
>> +	if (th->doff*4 < sizeof(*th))
>> +		goto dropit;
>> +
>> +	optlen = th->doff*4 - sizeof(*th);
>> +	if (!optlen)
>> +		goto out;
>> +
>> +	/* Truncated options. */
>> +	op = skb_header_pointer(skb, par->thoff + sizeof(*th), optlen, _opt);
>> +	if (op == NULL)
>> +		goto dropit;
>> +
>> +	for (i = 0; i < optlen; ) {
>> +		if (op[i] == TCPOPT_MSS
>> +		    && (optlen - i) >= TCPOLEN_MSS
>> +		    && op[i+1] == TCPOLEN_MSS) {
>> +			u_int16_t mssval;
>> +
>> +			mssval = (op[i+2] << 8) | op[i+3];
>> +
>> +			return (mssval >= info->mss_min &&
>> +				mssval <= info->mss_max) ^ info->invert;
>> 		}
>> +		if (op[i] < 2)
>> +			i++;
>> +		else
>> +			i += op[i+1] ? : 1;
>> 	}
>> +out:
>> +	return info->invert;
>> 
>> -	/* There is data after the header so the option can't be added
>> -	 * without moving it, and doing so may make the SYN packet
>> -	 * itself too large. Accept the packet unmodified instead.
>> -	 */
>> -	if (len > tcp_hdrlen)
>> -		return 0;
>> -
>> -	/*
>> -	 * MSS Option not found ?! add it..
>> -	 */
>> -	if (skb_tailroom(skb) < TCPOLEN_MSS) {
>> -		if (pskb_expand_head(skb, 0,
>> -				     TCPOLEN_MSS - skb_tailroom(skb),
>> -				     GFP_ATOMIC))
>> -			return -1;
>> -		tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
>> -	}
>> -
>> -	skb_put(skb, TCPOLEN_MSS);
>> -
>> -	/*
>> -	 * IPv4: RFC 1122 states "If an MSS option is not received at
>> -	 * connection setup, TCP MUST assume a default send MSS of 536".
>> -	 * IPv6: RFC 2460 states IPv6 has a minimum MTU of 1280 and a minimum
>> -	 * length IPv6 header of 60, ergo the default MSS value is 1220
>> -	 * Since no MSS was provided, we must use the default values
>> -	 */
>> -	if (xt_family(par) == NFPROTO_IPV4)
>> -		newmss = min(newmss, (u16)536);
>> -	else
>> -		newmss = min(newmss, (u16)1220);
>> -
>> -	opt = (u_int8_t *)tcph + sizeof(struct tcphdr);
>> -	memmove(opt + TCPOLEN_MSS, opt, len - sizeof(struct tcphdr));
>> -
>> -	inet_proto_csum_replace2(&tcph->check, skb,
>> -				 htons(len), htons(len + TCPOLEN_MSS), true);
>> -	opt[0] = TCPOPT_MSS;
>> -	opt[1] = TCPOLEN_MSS;
>> -	opt[2] = (newmss & 0xff00) >> 8;
>> -	opt[3] = newmss & 0x00ff;
>> -
>> -	inet_proto_csum_replace4(&tcph->check, skb, 0, *((__be32 *)opt), false);
>> -
>> -	oldval = ((__be16 *)tcph)[6];
>> -	tcph->doff += TCPOLEN_MSS/4;
>> -	inet_proto_csum_replace2(&tcph->check, skb,
>> -				 oldval, ((__be16 *)tcph)[6], false);
>> -	return TCPOLEN_MSS;
>> -}
>> -
>> -static unsigned int
>> -tcpmss_tg4(struct sk_buff *skb, const struct xt_action_param *par)
>> -{
>> -	struct iphdr *iph = ip_hdr(skb);
>> -	__be16 newlen;
>> -	int ret;
>> -
>> -	ret = tcpmss_mangle_packet(skb, par,
>> -				   PF_INET,
>> -				   iph->ihl * 4,
>> -				   sizeof(*iph) + sizeof(struct tcphdr));
>> -	if (ret < 0)
>> -		return NF_DROP;
>> -	if (ret > 0) {
>> -		iph = ip_hdr(skb);
>> -		newlen = htons(ntohs(iph->tot_len) + ret);
>> -		csum_replace2(&iph->check, iph->tot_len, newlen);
>> -		iph->tot_len = newlen;
>> -	}
>> -	return XT_CONTINUE;
>> -}
>> -
>> -#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
>> -static unsigned int
>> -tcpmss_tg6(struct sk_buff *skb, const struct xt_action_param *par)
>> -{
>> -	struct ipv6hdr *ipv6h = ipv6_hdr(skb);
>> -	u8 nexthdr;
>> -	__be16 frag_off, oldlen, newlen;
>> -	int tcphoff;
>> -	int ret;
>> -
>> -	nexthdr = ipv6h->nexthdr;
>> -	tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr, &frag_off);
>> -	if (tcphoff < 0)
>> -		return NF_DROP;
>> -	ret = tcpmss_mangle_packet(skb, par,
>> -				   PF_INET6,
>> -				   tcphoff,
>> -				   sizeof(*ipv6h) + sizeof(struct tcphdr));
>> -	if (ret < 0)
>> -		return NF_DROP;
>> -	if (ret > 0) {
>> -		ipv6h = ipv6_hdr(skb);
>> -		oldlen = ipv6h->payload_len;
>> -		newlen = htons(ntohs(oldlen) + ret);
>> -		if (skb->ip_summed == CHECKSUM_COMPLETE)
>> -			skb->csum = csum_add(csum_sub(skb->csum, oldlen),
>> -					     newlen);
>> -		ipv6h->payload_len = newlen;
>> -	}
>> -	return XT_CONTINUE;
>> -}
>> -#endif
>> -
>> -/* Must specify -p tcp --syn */
>> -static inline bool find_syn_match(const struct xt_entry_match *m)
>> -{
>> -	const struct xt_tcp *tcpinfo = (const struct xt_tcp *)m->data;
>> -
>> -	if (strcmp(m->u.kernel.match->name, "tcp") == 0 &&
>> -	    tcpinfo->flg_cmp & TCPHDR_SYN &&
>> -	    !(tcpinfo->invflags & XT_TCP_INV_FLAGS))
>> -		return true;
>> -
>> +dropit:
>> +	par->hotdrop = true;
>> 	return false;
>> }
>> 
>> -static int tcpmss_tg4_check(const struct xt_tgchk_param *par)
>> -{
>> -	const struct xt_tcpmss_info *info = par->targinfo;
>> -	const struct ipt_entry *e = par->entryinfo;
>> -	const struct xt_entry_match *ematch;
>> -
>> -	if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
>> -	    (par->hook_mask & ~((1 << NF_INET_FORWARD) |
>> -			   (1 << NF_INET_LOCAL_OUT) |
>> -			   (1 << NF_INET_POST_ROUTING))) != 0) {
>> -		pr_info("path-MTU clamping only supported in "
>> -			"FORWARD, OUTPUT and POSTROUTING hooks\n");
>> -		return -EINVAL;
>> -	}
>> -	if (par->nft_compat)
>> -		return 0;
>> -
>> -	xt_ematch_foreach(ematch, e)
>> -		if (find_syn_match(ematch))
>> -			return 0;
>> -	pr_info("Only works on TCP SYN packets\n");
>> -	return -EINVAL;
>> -}
>> -
>> -#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
>> -static int tcpmss_tg6_check(const struct xt_tgchk_param *par)
>> -{
>> -	const struct xt_tcpmss_info *info = par->targinfo;
>> -	const struct ip6t_entry *e = par->entryinfo;
>> -	const struct xt_entry_match *ematch;
>> -
>> -	if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
>> -	    (par->hook_mask & ~((1 << NF_INET_FORWARD) |
>> -			   (1 << NF_INET_LOCAL_OUT) |
>> -			   (1 << NF_INET_POST_ROUTING))) != 0) {
>> -		pr_info("path-MTU clamping only supported in "
>> -			"FORWARD, OUTPUT and POSTROUTING hooks\n");
>> -		return -EINVAL;
>> -	}
>> -	if (par->nft_compat)
>> -		return 0;
>> -
>> -	xt_ematch_foreach(ematch, e)
>> -		if (find_syn_match(ematch))
>> -			return 0;
>> -	pr_info("Only works on TCP SYN packets\n");
>> -	return -EINVAL;
>> -}
>> -#endif
>> -
>> -static struct xt_target tcpmss_tg_reg[] __read_mostly = {
>> +static struct xt_match tcpmss_mt_reg[] __read_mostly = {
>> 	{
>> +		.name		= "tcpmss",
>> 		.family		= NFPROTO_IPV4,
>> -		.name		= "TCPMSS",
>> -		.checkentry	= tcpmss_tg4_check,
>> -		.target		= tcpmss_tg4,
>> -		.targetsize	= sizeof(struct xt_tcpmss_info),
>> +		.match		= tcpmss_mt,
>> +		.matchsize	= sizeof(struct xt_tcpmss_match_info),
>> 		.proto		= IPPROTO_TCP,
>> 		.me		= THIS_MODULE,
>> 	},
>> -#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
>> 	{
>> +		.name		= "tcpmss",
>> 		.family		= NFPROTO_IPV6,
>> -		.name		= "TCPMSS",
>> -		.checkentry	= tcpmss_tg6_check,
>> -		.target		= tcpmss_tg6,
>> -		.targetsize	= sizeof(struct xt_tcpmss_info),
>> +		.match		= tcpmss_mt,
>> +		.matchsize	= sizeof(struct xt_tcpmss_match_info),
>> 		.proto		= IPPROTO_TCP,
>> 		.me		= THIS_MODULE,
>> 	},
>> -#endif
>> };
>> 
>> -static int __init tcpmss_tg_init(void)
>> +static int __init tcpmss_mt_init(void)
>> {
>> -	return xt_register_targets(tcpmss_tg_reg, ARRAY_SIZE(tcpmss_tg_reg));
>> +	return xt_register_matches(tcpmss_mt_reg, ARRAY_SIZE(tcpmss_mt_reg));
>> }
>> 
>> -static void __exit tcpmss_tg_exit(void)
>> +static void __exit tcpmss_mt_exit(void)
>> {
>> -	xt_unregister_targets(tcpmss_tg_reg, ARRAY_SIZE(tcpmss_tg_reg));
>> +	xt_unregister_matches(tcpmss_mt_reg, ARRAY_SIZE(tcpmss_mt_reg));
>> }
>> 
>> -module_init(tcpmss_tg_init);
>> -module_exit(tcpmss_tg_exit);
>> +module_init(tcpmss_mt_init);
>> +module_exit(tcpmss_mt_exit);
>> diff --git a/net/netfilter/xt_dscp.c b/net/netfilter/xt_dscp.c
>> index 236ac80..3f83d38 100644
>> --- a/net/netfilter/xt_dscp.c
>> +++ b/net/netfilter/xt_dscp.c
>> @@ -1,11 +1,14 @@
>> -/* IP tables module for matching the value of the IPv4/IPv6 DSCP field
>> +/* x_tables module for setting the IPv4/IPv6 DSCP field, Version 1.8
>> *
>> * (C) 2002 by Harald Welte <laforge@netfilter.org>
>> + * based on ipt_FTOS.c (C) 2000 by Matthew G. Marsh <mgm@paktronix.com>
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License version 2 as
>> * published by the Free Software Foundation.
>> - */
>> + *
>> + * See RFC2474 for a description of the DSCP field within the IP Header.
>> +*/
>> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> #include <linux/module.h>
>> #include <linux/skbuff.h>
>> @@ -14,102 +17,150 @@
>> #include <net/dsfield.h>
>> 
>> #include <linux/netfilter/x_tables.h>
>> -#include <linux/netfilter/xt_dscp.h>
>> +#include <linux/netfilter/xt_DSCP.h>
>> 
>> MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
>> -MODULE_DESCRIPTION("Xtables: DSCP/TOS field match");
>> +MODULE_DESCRIPTION("Xtables: DSCP/TOS field modification");
>> MODULE_LICENSE("GPL");
>> -MODULE_ALIAS("ipt_dscp");
>> -MODULE_ALIAS("ip6t_dscp");
>> -MODULE_ALIAS("ipt_tos");
>> -MODULE_ALIAS("ip6t_tos");
>> +MODULE_ALIAS("ipt_DSCP");
>> +MODULE_ALIAS("ip6t_DSCP");
>> +MODULE_ALIAS("ipt_TOS");
>> +MODULE_ALIAS("ip6t_TOS");
>> 
>> -static bool
>> -dscp_mt(const struct sk_buff *skb, struct xt_action_param *par)
>> +static unsigned int
>> +dscp_tg(struct sk_buff *skb, const struct xt_action_param *par)
>> {
>> -	const struct xt_dscp_info *info = par->matchinfo;
>> +	const struct xt_DSCP_info *dinfo = par->targinfo;
>> 	u_int8_t dscp = ipv4_get_dsfield(ip_hdr(skb)) >> XT_DSCP_SHIFT;
>> 
>> -	return (dscp == info->dscp) ^ !!info->invert;
>> +	if (dscp != dinfo->dscp) {
>> +		if (!skb_make_writable(skb, sizeof(struct iphdr)))
>> +			return NF_DROP;
>> +
>> +		ipv4_change_dsfield(ip_hdr(skb),
>> +				    (__force __u8)(~XT_DSCP_MASK),
>> +				    dinfo->dscp << XT_DSCP_SHIFT);
>> +
>> +	}
>> +	return XT_CONTINUE;
>> }
>> 
>> -static bool
>> -dscp_mt6(const struct sk_buff *skb, struct xt_action_param *par)
>> +static unsigned int
>> +dscp_tg6(struct sk_buff *skb, const struct xt_action_param *par)
>> {
>> -	const struct xt_dscp_info *info = par->matchinfo;
>> +	const struct xt_DSCP_info *dinfo = par->targinfo;
>> 	u_int8_t dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> XT_DSCP_SHIFT;
>> 
>> -	return (dscp == info->dscp) ^ !!info->invert;
>> +	if (dscp != dinfo->dscp) {
>> +		if (!skb_make_writable(skb, sizeof(struct ipv6hdr)))
>> +			return NF_DROP;
>> +
>> +		ipv6_change_dsfield(ipv6_hdr(skb),
>> +				    (__force __u8)(~XT_DSCP_MASK),
>> +				    dinfo->dscp << XT_DSCP_SHIFT);
>> +	}
>> +	return XT_CONTINUE;
>> }
>> 
>> -static int dscp_mt_check(const struct xt_mtchk_param *par)
>> +static int dscp_tg_check(const struct xt_tgchk_param *par)
>> {
>> -	const struct xt_dscp_info *info = par->matchinfo;
>> +	const struct xt_DSCP_info *info = par->targinfo;
>> 
>> 	if (info->dscp > XT_DSCP_MAX) {
>> 		pr_info("dscp %x out of range\n", info->dscp);
>> 		return -EDOM;
>> 	}
>> -
>> 	return 0;
>> }
>> 
>> -static bool tos_mt(const struct sk_buff *skb, struct xt_action_param *par)
>> +static unsigned int
>> +tos_tg(struct sk_buff *skb, const struct xt_action_param *par)
>> +{
>> +	const struct xt_tos_target_info *info = par->targinfo;
>> +	struct iphdr *iph = ip_hdr(skb);
>> +	u_int8_t orig, nv;
>> +
>> +	orig = ipv4_get_dsfield(iph);
>> +	nv   = (orig & ~info->tos_mask) ^ info->tos_value;
>> +
>> +	if (orig != nv) {
>> +		if (!skb_make_writable(skb, sizeof(struct iphdr)))
>> +			return NF_DROP;
>> +		iph = ip_hdr(skb);
>> +		ipv4_change_dsfield(iph, 0, nv);
>> +	}
>> +
>> +	return XT_CONTINUE;
>> +}
>> +
>> +static unsigned int
>> +tos_tg6(struct sk_buff *skb, const struct xt_action_param *par)
>> {
>> -	const struct xt_tos_match_info *info = par->matchinfo;
>> -
>> -	if (xt_family(par) == NFPROTO_IPV4)
>> -		return ((ip_hdr(skb)->tos & info->tos_mask) ==
>> -		       info->tos_value) ^ !!info->invert;
>> -	else
>> -		return ((ipv6_get_dsfield(ipv6_hdr(skb)) & info->tos_mask) ==
>> -		       info->tos_value) ^ !!info->invert;
>> +	const struct xt_tos_target_info *info = par->targinfo;
>> +	struct ipv6hdr *iph = ipv6_hdr(skb);
>> +	u_int8_t orig, nv;
>> +
>> +	orig = ipv6_get_dsfield(iph);
>> +	nv   = (orig & ~info->tos_mask) ^ info->tos_value;
>> +
>> +	if (orig != nv) {
>> +		if (!skb_make_writable(skb, sizeof(struct iphdr)))
>> +			return NF_DROP;
>> +		iph = ipv6_hdr(skb);
>> +		ipv6_change_dsfield(iph, 0, nv);
>> +	}
>> +
>> +	return XT_CONTINUE;
>> }
>> 
>> -static struct xt_match dscp_mt_reg[] __read_mostly = {
>> +static struct xt_target dscp_tg_reg[] __read_mostly = {
>> 	{
>> -		.name		= "dscp",
>> +		.name		= "DSCP",
>> 		.family		= NFPROTO_IPV4,
>> -		.checkentry	= dscp_mt_check,
>> -		.match		= dscp_mt,
>> -		.matchsize	= sizeof(struct xt_dscp_info),
>> +		.checkentry	= dscp_tg_check,
>> +		.target		= dscp_tg,
>> +		.targetsize	= sizeof(struct xt_DSCP_info),
>> +		.table		= "mangle",
>> 		.me		= THIS_MODULE,
>> 	},
>> 	{
>> -		.name		= "dscp",
>> +		.name		= "DSCP",
>> 		.family		= NFPROTO_IPV6,
>> -		.checkentry	= dscp_mt_check,
>> -		.match		= dscp_mt6,
>> -		.matchsize	= sizeof(struct xt_dscp_info),
>> +		.checkentry	= dscp_tg_check,
>> +		.target		= dscp_tg6,
>> +		.targetsize	= sizeof(struct xt_DSCP_info),
>> +		.table		= "mangle",
>> 		.me		= THIS_MODULE,
>> 	},
>> 	{
>> -		.name		= "tos",
>> +		.name		= "TOS",
>> 		.revision	= 1,
>> 		.family		= NFPROTO_IPV4,
>> -		.match		= tos_mt,
>> -		.matchsize	= sizeof(struct xt_tos_match_info),
>> +		.table		= "mangle",
>> +		.target		= tos_tg,
>> +		.targetsize	= sizeof(struct xt_tos_target_info),
>> 		.me		= THIS_MODULE,
>> 	},
>> 	{
>> -		.name		= "tos",
>> +		.name		= "TOS",
>> 		.revision	= 1,
>> 		.family		= NFPROTO_IPV6,
>> -		.match		= tos_mt,
>> -		.matchsize	= sizeof(struct xt_tos_match_info),
>> +		.table		= "mangle",
>> +		.target		= tos_tg6,
>> +		.targetsize	= sizeof(struct xt_tos_target_info),
>> 		.me		= THIS_MODULE,
>> 	},
>> };
>> 
>> -static int __init dscp_mt_init(void)
>> +static int __init dscp_tg_init(void)
>> {
>> -	return xt_register_matches(dscp_mt_reg, ARRAY_SIZE(dscp_mt_reg));
>> +	return xt_register_targets(dscp_tg_reg, ARRAY_SIZE(dscp_tg_reg));
>> }
>> 
>> -static void __exit dscp_mt_exit(void)
>> +static void __exit dscp_tg_exit(void)
>> {
>> -	xt_unregister_matches(dscp_mt_reg, ARRAY_SIZE(dscp_mt_reg));
>> +	xt_unregister_targets(dscp_tg_reg, ARRAY_SIZE(dscp_tg_reg));
>> }
>> 
>> -module_init(dscp_mt_init);
>> -module_exit(dscp_mt_exit);
>> +module_init(dscp_tg_init);
>> +module_exit(dscp_tg_exit);
>> diff --git a/net/netfilter/xt_hl.c b/net/netfilter/xt_hl.c
>> index 0039511..1535e87 100644
>> --- a/net/netfilter/xt_hl.c
>> +++ b/net/netfilter/xt_hl.c
>> @@ -1,96 +1,169 @@
>> /*
>> - * IP tables module for matching the value of the TTL
>> - * (C) 2000,2001 by Harald Welte <laforge@netfilter.org>
>> + * TTL modification target for IP tables
>> + * (C) 2000,2005 by Harald Welte <laforge@netfilter.org>
>> *
>> - * Hop Limit matching module
>> - * (C) 2001-2002 Maciej Soltysiak <solt@dns.toxicfilms.tv>
>> + * Hop Limit modification target for ip6tables
>> + * Maciej Soltysiak <solt@dns.toxicfilms.tv>
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License version 2 as
>> * published by the Free Software Foundation.
>> */
>> -
>> -#include <linux/ip.h>
>> -#include <linux/ipv6.h>
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> #include <linux/module.h>
>> #include <linux/skbuff.h>
>> +#include <linux/ip.h>
>> +#include <linux/ipv6.h>
>> +#include <net/checksum.h>
>> 
>> #include <linux/netfilter/x_tables.h>
>> -#include <linux/netfilter_ipv4/ipt_ttl.h>
>> -#include <linux/netfilter_ipv6/ip6t_hl.h>
>> +#include <linux/netfilter_ipv4/ipt_TTL.h>
>> +#include <linux/netfilter_ipv6/ip6t_HL.h>
>> 
>> +MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
>> MODULE_AUTHOR("Maciej Soltysiak <solt@dns.toxicfilms.tv>");
>> -MODULE_DESCRIPTION("Xtables: Hoplimit/TTL field match");
>> +MODULE_DESCRIPTION("Xtables: Hoplimit/TTL Limit field modification target");
>> MODULE_LICENSE("GPL");
>> -MODULE_ALIAS("ipt_ttl");
>> -MODULE_ALIAS("ip6t_hl");
>> 
>> -static bool ttl_mt(const struct sk_buff *skb, struct xt_action_param *par)
>> +static unsigned int
>> +ttl_tg(struct sk_buff *skb, const struct xt_action_param *par)
>> {
>> -	const struct ipt_ttl_info *info = par->matchinfo;
>> -	const u8 ttl = ip_hdr(skb)->ttl;
>> +	struct iphdr *iph;
>> +	const struct ipt_TTL_info *info = par->targinfo;
>> +	int new_ttl;
>> +
>> +	if (!skb_make_writable(skb, skb->len))
>> +		return NF_DROP;
>> +
>> +	iph = ip_hdr(skb);
>> 
>> 	switch (info->mode) {
>> -	case IPT_TTL_EQ:
>> -		return ttl == info->ttl;
>> -	case IPT_TTL_NE:
>> -		return ttl != info->ttl;
>> -	case IPT_TTL_LT:
>> -		return ttl < info->ttl;
>> -	case IPT_TTL_GT:
>> -		return ttl > info->ttl;
>> +	case IPT_TTL_SET:
>> +		new_ttl = info->ttl;
>> +		break;
>> +	case IPT_TTL_INC:
>> +		new_ttl = iph->ttl + info->ttl;
>> +		if (new_ttl > 255)
>> +			new_ttl = 255;
>> +		break;
>> +	case IPT_TTL_DEC:
>> +		new_ttl = iph->ttl - info->ttl;
>> +		if (new_ttl < 0)
>> +			new_ttl = 0;
>> +		break;
>> +	default:
>> +		new_ttl = iph->ttl;
>> +		break;
>> +	}
>> +
>> +	if (new_ttl != iph->ttl) {
>> +		csum_replace2(&iph->check, htons(iph->ttl << 8),
>> +					   htons(new_ttl << 8));
>> +		iph->ttl = new_ttl;
>> 	}
>> 
>> -	return false;
>> +	return XT_CONTINUE;
>> }
>> 
>> -static bool hl_mt6(const struct sk_buff *skb, struct xt_action_param *par)
>> +static unsigned int
>> +hl_tg6(struct sk_buff *skb, const struct xt_action_param *par)
>> {
>> -	const struct ip6t_hl_info *info = par->matchinfo;
>> -	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
>> +	struct ipv6hdr *ip6h;
>> +	const struct ip6t_HL_info *info = par->targinfo;
>> +	int new_hl;
>> +
>> +	if (!skb_make_writable(skb, skb->len))
>> +		return NF_DROP;
>> +
>> +	ip6h = ipv6_hdr(skb);
>> 
>> 	switch (info->mode) {
>> -	case IP6T_HL_EQ:
>> -		return ip6h->hop_limit == info->hop_limit;
>> -	case IP6T_HL_NE:
>> -		return ip6h->hop_limit != info->hop_limit;
>> -	case IP6T_HL_LT:
>> -		return ip6h->hop_limit < info->hop_limit;
>> -	case IP6T_HL_GT:
>> -		return ip6h->hop_limit > info->hop_limit;
>> +	case IP6T_HL_SET:
>> +		new_hl = info->hop_limit;
>> +		break;
>> +	case IP6T_HL_INC:
>> +		new_hl = ip6h->hop_limit + info->hop_limit;
>> +		if (new_hl > 255)
>> +			new_hl = 255;
>> +		break;
>> +	case IP6T_HL_DEC:
>> +		new_hl = ip6h->hop_limit - info->hop_limit;
>> +		if (new_hl < 0)
>> +			new_hl = 0;
>> +		break;
>> +	default:
>> +		new_hl = ip6h->hop_limit;
>> +		break;
>> 	}
>> 
>> -	return false;
>> +	ip6h->hop_limit = new_hl;
>> +
>> +	return XT_CONTINUE;
>> +}
>> +
>> +static int ttl_tg_check(const struct xt_tgchk_param *par)
>> +{
>> +	const struct ipt_TTL_info *info = par->targinfo;
>> +
>> +	if (info->mode > IPT_TTL_MAXMODE) {
>> +		pr_info("TTL: invalid or unknown mode %u\n", info->mode);
>> +		return -EINVAL;
>> +	}
>> +	if (info->mode != IPT_TTL_SET && info->ttl == 0)
>> +		return -EINVAL;
>> +	return 0;
>> +}
>> +
>> +static int hl_tg6_check(const struct xt_tgchk_param *par)
>> +{
>> +	const struct ip6t_HL_info *info = par->targinfo;
>> +
>> +	if (info->mode > IP6T_HL_MAXMODE) {
>> +		pr_info("invalid or unknown mode %u\n", info->mode);
>> +		return -EINVAL;
>> +	}
>> +	if (info->mode != IP6T_HL_SET && info->hop_limit == 0) {
>> +		pr_info("increment/decrement does not "
>> +			"make sense with value 0\n");
>> +		return -EINVAL;
>> +	}
>> +	return 0;
>> }
>> 
>> -static struct xt_match hl_mt_reg[] __read_mostly = {
>> +static struct xt_target hl_tg_reg[] __read_mostly = {
>> 	{
>> -		.name       = "ttl",
>> +		.name       = "TTL",
>> 		.revision   = 0,
>> 		.family     = NFPROTO_IPV4,
>> -		.match      = ttl_mt,
>> -		.matchsize  = sizeof(struct ipt_ttl_info),
>> +		.target     = ttl_tg,
>> +		.targetsize = sizeof(struct ipt_TTL_info),
>> +		.table      = "mangle",
>> +		.checkentry = ttl_tg_check,
>> 		.me         = THIS_MODULE,
>> 	},
>> 	{
>> -		.name       = "hl",
>> +		.name       = "HL",
>> 		.revision   = 0,
>> 		.family     = NFPROTO_IPV6,
>> -		.match      = hl_mt6,
>> -		.matchsize  = sizeof(struct ip6t_hl_info),
>> +		.target     = hl_tg6,
>> +		.targetsize = sizeof(struct ip6t_HL_info),
>> +		.table      = "mangle",
>> +		.checkentry = hl_tg6_check,
>> 		.me         = THIS_MODULE,
>> 	},
>> };
>> 
>> -static int __init hl_mt_init(void)
>> +static int __init hl_tg_init(void)
>> {
>> -	return xt_register_matches(hl_mt_reg, ARRAY_SIZE(hl_mt_reg));
>> +	return xt_register_targets(hl_tg_reg, ARRAY_SIZE(hl_tg_reg));
>> }
>> 
>> -static void __exit hl_mt_exit(void)
>> +static void __exit hl_tg_exit(void)
>> {
>> -	xt_unregister_matches(hl_mt_reg, ARRAY_SIZE(hl_mt_reg));
>> +	xt_unregister_targets(hl_tg_reg, ARRAY_SIZE(hl_tg_reg));
>> }
>> 
>> -module_init(hl_mt_init);
>> -module_exit(hl_mt_exit);
>> +module_init(hl_tg_init);
>> +module_exit(hl_tg_exit);
>> +MODULE_ALIAS("ipt_TTL");
>> +MODULE_ALIAS("ip6t_HL");
>> 
>> 
>> 
>> 
> 


Cheers, Andreas
Jan Kara April 20, 2017, 7:58 a.m. UTC | #5
On Fri 14-04-17 09:27:20, Ted Tso wrote:
> To summarize the discussion that we had on this week's ext4
> teleconference call, while discussing ways in which we might extend
> ext4's extended attributes to provide better support for Samba.
> 
> Andreas pointed out that we already have an unused field,
> e_value_block, in ext4_xattr_entry structure:
> 
> struct ext4_xattr_entry {
> 	__u8	e_name_len;	/* length of name */
> 	__u8	e_name_index;	/* attribute name index */
> 	__le16	e_value_offs;	/* offset in disk block of value */
> 	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
> 	__le32	e_value_size;	/* size of attribute value */
> 	__le32	e_hash;		/* hash value of name and value */
> 	char	e_name[0];	/* attribute name */
> };
> 
> It's only a 32-bit field, and it was repurposed in a Lustre-specific
> feature, EXT4_FEATURE_INCOMPAT_EA_INODE as e_value_inum (since inodes
> are only 32-bit today).  If this feature flag is enabled, then kernels
> which understand the feature will treat e_value_block as an inode
> number, and if it is non-zero, the value of that extended attribute is
> stored in the inode.  This ends up burning a lot of extra inodes for
> each extended attribute, which is why there was never much excitement
> for this patch going upstream.
> 
> However, we could extend this feature (it will almost certainly
> require a new INCOMPAT feature flag) such that a particular inode
> could be referenced from multiple strut ext4_xattr_entry's (from
> multiple inodes or from a single inode), since the inode for the xattr
> body already has a ref count, i_links_count.  And given that on a
> typical Windows CIFS file system, there will be dozens of unique
> acl's, the problem of exhausting inodes for xattrs won't be a issue in
> this case.
> 
> 
> However, another approach that we discussed on the weekly conference
> call was to change e_value_size to be an 16-bit field, and to use the
> high 16 bits for flags, where one of the flags bits (say, the MSB)
> would mean that e_value_block and e_value_size should be treated as a
> 48-bit block number, where the block could be stored.
> 
> Thinking about this some more, we can use another 4 bits from the high
> bits of e_value_size as a 16 bit number n, where if n=0, the block
> number is stored in e_Value_block and e_value_size as above, and if n
> > 1, that there are additional blocks for the xattr value, which will
> be stored in the place where the xattr value would normally be stored
> (e.g, in the inline xattr space or in the external xattr block).
> 
> So pictorally, it would look like this:
> 
> +----------------+----------------+
> | 128-byte inode | in-line xattr  |
> +----------------+----------------+
>                 /                  \
>                /                    \
>               /                      \
>   +---------------------------------------------+
>   | XE | XE | XE |               | XV | XV | XV |   XE == xattr_entry   XV == xattr value
>   +---------------------------------------------+
>            /      \             /     \
>           /        \           /       \
>          /          \         /         \
>     +--------------------+  +-------------+
>     |   ...  | blk0 |... |  | blk1 | blk2 |
>     +--------------------+  +-------------+
> 
> (to those using gmail; please view the above in a fixed-width font, or
> use "show original")
> 
> So in this picture, XE is the ext4_xattr_entry, and in this case, the
> high bits of e_value_size indicate e_value_block and the low bits of
> e_value_size indicate the location of the first 4k block where the
> xattr value is to be stored, and if one were to look at region of
> memory indicated by e_value_offs, there would be two 8-byte block
> numbers indicating the location of the 2nd and 3rd file system blocks
> where the xattr value can be found.
> 
> In the external xattr value blocks, at the beginning of the first
> block (e.g., at blk0), there will be an ext4_xattr_header, so we can
> take advantage of h_refcount field, but with the following changes:
> 
> * The low 16 bits of h_blocks will be used for the size of the xattr;
>   the high bits of h_blocks must be zero (for now).
> 
> * The h_hash field will be a crc32c of the value of the xattr stored
>   in the external xattr value block(s).
> 
> * The h_checksum field will be calculated so that the crc32c covers
>   only the ext4_xattr_header, instead of the entire xattrblock.  e.g.,
>   crc32c(fs uuid || id || xattr header), where id is the inode number
>   if refcount = 1, and blknum otherwise.
> 
> What are the advantages of this approach over the Lustre's
> xattr-value-in-inode approach?  First, we don't need to burn inodes
> for the xattr value.  This could potentially be an issue for Windows
> SID's, since there the number of SID's is roughly equal to number of
> users plus the number of groups.  And for a large enterprise with
> O(100,000) employees, we could burn a pretty large number of inodes.
> The other advantage of this scheme is that h_refcount field is 32
> bits, where as the inode's i_links_count field is only 16 bits, and
> there could very easily be more than 64k files that might share the
> same Windows ACL or Windows SID.  So we would need to figure out some
> way of dealing with an extended i_links_count field if we went with
> the xattr-value-in-inode approach.

So the proposal seems to have implicit in it that we will be
"deduplicating" xattr values. Currently we deduplicate only full external
xattr blocks (which possibly contain more xattrs). Any idea how big win
that is going to be over deduplicating only full sets of xattrs?

One idea I had in mind was that one way of supporting larger xattrs would
be to support something like xattr fork - i.e., in the xattr space of the
inode we would have root of an extent tree describing xattr space of the
inode. Then inside the space described by the extent tree would be stored
xattrs - possibly in the same format as they are currently stored in a
block (we would just redefine that e_value_block+e_value_offs describe the
offset of xattr value inside the xattr space). From the perspective of
"disk reads required to get the xattrs" this proposal should be similar as
above (xattr space description will mostly fully fit in the xattr space of
the inode) so we will just go and read the xattr headers and then value.
It has an advantage that it basically does not limit xattr size or number
of xattrs. It has the disadvantage that deduplication possibilities are
lower.

								Honza
Andreas Dilger April 20, 2017, 9:22 p.m. UTC | #6
On Apr 20, 2017, at 1:58 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Fri 14-04-17 09:27:20, Ted Tso wrote:
>> To summarize the discussion that we had on this week's ext4
>> teleconference call, while discussing ways in which we might extend
>> ext4's extended attributes to provide better support for Samba.
>> 
>> Andreas pointed out that we already have an unused field,
>> e_value_block, in ext4_xattr_entry structure:
>> 
>> struct ext4_xattr_entry {
>> 	__u8	e_name_len;	/* length of name */
>> 	__u8	e_name_index;	/* attribute name index */
>> 	__le16	e_value_offs;	/* offset in disk block of value */
>> 	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
>> 	__le32	e_value_size;	/* size of attribute value */
>> 	__le32	e_hash;		/* hash value of name and value */
>> 	char	e_name[0];	/* attribute name */
>> };
>> 
>> It's only a 32-bit field, and it was repurposed in a Lustre-specific
>> feature, EXT4_FEATURE_INCOMPAT_EA_INODE as e_value_inum (since inodes
>> are only 32-bit today).  If this feature flag is enabled, then kernels
>> which understand the feature will treat e_value_block as an inode
>> number, and if it is non-zero, the value of that extended attribute is
>> stored in the inode.  This ends up burning a lot of extra inodes for
>> each extended attribute, which is why there was never much excitement
>> for this patch going upstream.
>> 
>> However, we could extend this feature (it will almost certainly
>> require a new INCOMPAT feature flag) such that a particular inode
>> could be referenced from multiple strut ext4_xattr_entry's (from
>> multiple inodes or from a single inode), since the inode for the xattr
>> body already has a ref count, i_links_count.  And given that on a
>> typical Windows CIFS file system, there will be dozens of unique
>> acl's, the problem of exhausting inodes for xattrs won't be a issue in
>> this case.
>> 
>> 
>> However, another approach that we discussed on the weekly conference
>> call was to change e_value_size to be an 16-bit field, and to use the
>> high 16 bits for flags, where one of the flags bits (say, the MSB)
>> would mean that e_value_block and e_value_size should be treated as a
>> 48-bit block number, where the block could be stored.
>> 
>> Thinking about this some more, we can use another 4 bits from the high
>> bits of e_value_size as a 16 bit number n, where if n=0, the block
>> number is stored in e_Value_block and e_value_size as above, and if n
>>> 1, that there are additional blocks for the xattr value, which will
>> be stored in the place where the xattr value would normally be stored
>> (e.g, in the inline xattr space or in the external xattr block).
>> 
>> So pictorally, it would look like this:
>> 
>> +----------------+----------------+
>> | 128-byte inode | in-line xattr  |
>> +----------------+----------------+
>>                /                  \
>>               /                    \
>>              /                      \
>>  +---------------------------------------------+
>>  | XE | XE | XE |               | XV | XV | XV |   XE == xattr_entry   XV == xattr value
>>  +---------------------------------------------+
>>           /      \             /     \
>>          /        \           /       \
>>         /          \         /         \
>>    +--------------------+  +-------------+
>>    |   ...  | blk0 |... |  | blk1 | blk2 |
>>    +--------------------+  +-------------+
>> 
>> (to those using gmail; please view the above in a fixed-width font, or
>> use "show original")
>> 
>> So in this picture, XE is the ext4_xattr_entry, and in this case, the
>> high bits of e_value_size indicate e_value_block and the low bits of
>> e_value_size indicate the location of the first 4k block where the
>> xattr value is to be stored, and if one were to look at region of
>> memory indicated by e_value_offs, there would be two 8-byte block
>> numbers indicating the location of the 2nd and 3rd file system blocks
>> where the xattr value can be found.
>> 
>> In the external xattr value blocks, at the beginning of the first
>> block (e.g., at blk0), there will be an ext4_xattr_header, so we can
>> take advantage of h_refcount field, but with the following changes:
>> 
>> * The low 16 bits of h_blocks will be used for the size of the xattr;
>>  the high bits of h_blocks must be zero (for now).
>> 
>> * The h_hash field will be a crc32c of the value of the xattr stored
>>  in the external xattr value block(s).
>> 
>> * The h_checksum field will be calculated so that the crc32c covers
>>  only the ext4_xattr_header, instead of the entire xattrblock.  e.g.,
>>  crc32c(fs uuid || id || xattr header), where id is the inode number
>>  if refcount = 1, and blknum otherwise.
>> 
>> What are the advantages of this approach over the Lustre's
>> xattr-value-in-inode approach?  First, we don't need to burn inodes
>> for the xattr value.  This could potentially be an issue for Windows
>> SID's, since there the number of SID's is roughly equal to number of
>> users plus the number of groups.  And for a large enterprise with
>> O(100,000) employees, we could burn a pretty large number of inodes.
>> The other advantage of this scheme is that h_refcount field is 32
>> bits, where as the inode's i_links_count field is only 16 bits, and
>> there could very easily be more than 64k files that might share the
>> same Windows ACL or Windows SID.  So we would need to figure out some
>> way of dealing with an extended i_links_count field if we went with
>> the xattr-value-in-inode approach.
> 
> So the proposal seems to have implicit in it that we will be
> "deduplicating" xattr values. Currently we deduplicate only full external
> xattr blocks (which possibly contain more xattrs). Any idea how big win
> that is going to be over deduplicating only full sets of xattrs?

We discussed storing xattrs to be shared with one-value-per-block so that
they could be shared independently of other xattrs that are unique.  In
general, it makes sense to fit unique xattrs into the inode (if possible)
and put shared xattrs into a common location (if needed) to reduce space
usage and disk IO.  That said, if everything fits into the inode, even
the shared xattrs are extra overhead (refcounting, lock contention, etc)
that isn't needed.

The good news is that there are still significant benefits if the sharing
is not perfect, so this can be done opportunistically (e.g. share up to
N times for an xattr block, look for other identical "under-shared" xattrs
in cache to add new references, or create a new one if none are found).

> One idea I had in mind was that one way of supporting larger xattrs would
> be to support something like xattr fork - i.e., in the xattr space of the
> inode we would have root of an extent tree describing xattr space of the
> inode. Then inside the space described by the extent tree would be stored
> xattrs - possibly in the same format as they are currently stored in a
> block (we would just redefine that e_value_block+e_value_offs describe the
> offset of xattr value inside the xattr space).

Yes, this is what I was trying to get at with my previous email as well.
There isn't much difference between allocating a bunch of blocks directly
as the xattr space vs. an inode that is allocating those blocks.  The main
difference from the current xattr inode implementation is that this packs
multiple xattrs into a single inode, while the current code only stores a
single value starting at offset=0, without any header.

> From the perspective of "disk reads required to get the xattrs" this
> proposal should be similar as above (xattr space description will mostly
> fully fit in the xattr space of the inode) so we will just go and read
> the xattr headers and then value.  It has an advantage that it basically
> does not limit xattr size or number of xattrs. It has the disadvantage
> that deduplication possibilities are lower.

If a single xattr inode was referenced by multiple inodes then the sharing
could be the same.  IMHO, we'd want to limit the number of inodes sharing
the xattr inode anyway, so the 2^16 link count on the inode wouldn't be a
real issue.  Storing shared xattrs one-per-block would increase sharing.

The difficulty would be how to distinguish xattrs that are NOT shared by a
given inode?  We wouldn't want the unshared xattrs to leak between inodes.
One option would be to store the "owning" inode into the xattr entry for
unique xattrs, and the refcount into shared xattrs (distinguished by a flag).

Cheers, Andreas
Theodore Ts'o April 20, 2017, 9:24 p.m. UTC | #7
On Thu, Apr 20, 2017 at 09:58:23AM +0200, Jan Kara wrote:
> So the proposal seems to have implicit in it that we will be
> "deduplicating" xattr values. Currently we deduplicate only full external
> xattr blocks (which possibly contain more xattrs). Any idea how big win
> that is going to be over deduplicating only full sets of xattrs?

So in Windows, the security ID can be larger than what can fit in the
inode (if file creator belongs to foreign domains; I'm told that the
SID in some cases can be 12k or more).  And of course the Windows/Rich
acl can also be substantially bigger than what can fit in the inode.

So if you a directory hierarcy which all have the same ACL's, and a
large number of users that writing into that directory (so there is a
large number of different sids), the resulting cross product can be
large.

Windows also has a large number of other use cases for extended
attributes that will be unique.  In some cases, such as the Unix
timestamps, file owner, permissions bits, for files written by the
Windows Subsystem for Linux will fit in the inode table.  The
information that a particular flie was downloaded from 
"http://russia.phish.org/rootme.exe" so the user could be asked if
they really wanted to open it is also stored in an xattr.

It's definitely true that adding some hueristics to sort certain
xattrs into in-inode xattr will definitely help.  (For example, this
will definitely help the Android SE Linux label / ext4 encryption
context overflow case.)  But there will be definitely some cases,
probably mostly with Windows CIFS serving, where Microsoft is using
enough xattrs where this will probably be useful.

> One idea I had in mind was that one way of supporting larger xattrs would
> be to support something like xattr fork - i.e., in the xattr space of the
> inode we would have root of an extent tree describing xattr space of the
> inode. Then inside the space described by the extent tree would be stored
> xattrs - possibly in the same format as they are currently stored in a
> block (we would just redefine that e_value_block+e_value_offs describe the
> offset of xattr value inside the xattr space). From the perspective of
> "disk reads required to get the xattrs" this proposal should be similar as
> above (xattr space description will mostly fully fit in the xattr space of
> the inode) so we will just go and read the xattr headers and then value.
> It has an advantage that it basically does not limit xattr size or number
> of xattrs. It has the disadvantage that deduplication possibilities are
> lower.

The concern of disk reads required to get the xattrs is especially of
concern for those things are needed every time the file is accessed
--- e.g., for Rich ACL's.  It's the sharing which is what fixes the
disk seeks, and so the lower deduplications possibilities are a major
weakness of the scheme you've proposed above.

I'm personally not that interested in suppporting a large number of
large xattr's.  If we allow xattr values in inodes, that will allow
for a small number large xattr's, which ought to be sufficient, no?

      	    	   	 	  	- Ted
Amir Goldstein April 21, 2017, 7:54 a.m. UTC | #8
On Fri, Apr 21, 2017 at 12:22 AM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Apr 20, 2017, at 1:58 AM, Jan Kara <jack@suse.cz> wrote:
>>
[...]
>> One idea I had in mind was that one way of supporting larger xattrs would
>> be to support something like xattr fork - i.e., in the xattr space of the
>> inode we would have root of an extent tree describing xattr space of the
>> inode. Then inside the space described by the extent tree would be stored
>> xattrs - possibly in the same format as they are currently stored in a
>> block (we would just redefine that e_value_block+e_value_offs describe the
>> offset of xattr value inside the xattr space).

BTW, 'xattr fork' is the xfs way AFAIK and btrfs has 'xattr inodes'.

>
> Yes, this is what I was trying to get at with my previous email as well.
> There isn't much difference between allocating a bunch of blocks directly
> as the xattr space vs. an inode that is allocating those blocks.  The main
> difference from the current xattr inode implementation is that this packs
> multiple xattrs into a single inode, while the current code only stores a
> single value starting at offset=0, without any header.
>

That's not the only difference, is it?
Current ea-in-inode code can allocate many xattr inodes per regular inode.
'xattr fork' is equivalent to allocating a single 'xattr inode' per
regular inode.

I wonder if one-xattr-inode and one-EA-per-block can work out:
- EA block cannot have more than 1 EA
- EA can span more than 1 EA block (i.e. compound EA block)
- Refcounting is in the EA block as it is now, but may refcount a compund block
- inode A may have a reference to xattr-inode Ax, which is the host for writing
  new unshared EAs for inode A
- EA of inode B may have a shared EA with e_value_block pointing at block
  of xattr-inode Ax
- When refcount of any EA (compound) block drops to zero, punch holes in
  the xattr-inode hosting these blocks
- When inode A is deleted, it drops refcount on all the EA blocks its EAs
  are referencing
- If xattr-inode Ax has remaining EA block when inode A is going away,
  it transitions into a 'shared xattr-inode' and lives on the orphan list, or
  another dedicated list, until its own blocks count drops to zero.

I probably missed some details, maybe important ones as well,
but if I haven't, then this could reuse some of the existing EA dedup
code and cap the inodes overhead significantly (*).

(*) shared xattr-inodes may be compacted by handing their blocks over to
     a dedicated SHARED_XATTR_INO or to any random xattr-inode victim
     for that matter (i.e. of root inode).

Amir.
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fb69ee2..afe830b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1797,6 +1797,7 @@  static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 					 EXT4_FEATURE_INCOMPAT_EXTENTS| \
 					 EXT4_FEATURE_INCOMPAT_64BIT| \
 					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
+					 EXT4_FEATURE_INCOMPAT_EA_INODE| \
 					 EXT4_FEATURE_INCOMPAT_MMP | \
 					 EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
 					 EXT4_FEATURE_INCOMPAT_ENCRYPT | \
@@ -2220,6 +2221,12 @@  struct mmpd_data {
 #define EXT4_MMP_MAX_CHECK_INTERVAL	300UL

 /*
+ * Maximum size of xattr attributes for FEATURE_INCOMPAT_EA_INODE 1Mb
+ * This limit is arbitrary, but is reasonable for the xattr API.
+ */
+#define EXT4_XATTR_MAX_LARGE_EA_SIZE    (1024 * 1024)
+
+/*
  * Function prototypes
  */

@@ -2231,6 +2238,10 @@  struct mmpd_data {
 # define ATTRIB_NORET	__attribute__((noreturn))
 # define NORET_AND	noreturn,

+struct ext4_xattr_ino_array {
+	unsigned int xia_count;		/* # of used item in the array */
+	unsigned int xia_inodes[0];
+};
 /* bitmap.c */
 extern unsigned int ext4_count_free(char *bitmap, unsigned numchars);
 void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
@@ -2480,6 +2491,7 @@  int do_journal_get_write_access(handle_t *handle,
 extern void ext4_get_inode_flags(struct ext4_inode_info *);
 extern int ext4_alloc_da_blocks(struct inode *inode);
 extern void ext4_set_aops(struct inode *inode);
+extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int chunk);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
 extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 17bc043..01eaad6 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -294,7 +294,6 @@  void ext4_free_inode(handle_t *handle, struct inode *inode)
 	 * as writing the quota to disk may need the lock as well.
 	 */
 	dquot_initialize(inode);
-	ext4_xattr_delete_inode(handle, inode);
 	dquot_free_inode(inode);
 	dquot_drop(inode);

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 375fb1c..9601496 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -61,7 +61,7 @@  static int get_max_inline_xattr_value_size(struct inode *inode,

 	/* Compute min_offs. */
 	for (; !IS_LAST_ENTRY(entry); entry = EXT4_XATTR_NEXT(entry)) {
-		if (!entry->e_value_block && entry->e_value_size) {
+		if (!entry->e_value_inum && entry->e_value_size) {
 			size_t offs = le16_to_cpu(entry->e_value_offs);
 			if (offs < min_offs)
 				min_offs = offs;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b9ffa9f..70069e0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -139,8 +139,6 @@  static void ext4_invalidatepage(struct page *page, unsigned int offset,
 				unsigned int length);
 static int __ext4_journalled_writepage(struct page *page, unsigned int len);
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
-static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
-				  int pextents);

 /*
  * Test whether an inode is a fast symlink.
@@ -189,6 +187,8 @@  void ext4_evict_inode(struct inode *inode)
 {
 	handle_t *handle;
 	int err;
+	int extra_credits = 3;
+	struct ext4_xattr_ino_array *lea_ino_array = NULL;

 	trace_ext4_evict_inode(inode);

@@ -238,8 +238,8 @@  void ext4_evict_inode(struct inode *inode)
 	 * protection against it
 	 */
 	sb_start_intwrite(inode->i_sb);
-	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
-				    ext4_blocks_for_truncate(inode)+3);
+
+	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, extra_credits);
 	if (IS_ERR(handle)) {
 		ext4_std_error(inode->i_sb, PTR_ERR(handle));
 		/*
@@ -251,9 +251,36 @@  void ext4_evict_inode(struct inode *inode)
 		sb_end_intwrite(inode->i_sb);
 		goto no_delete;
 	}
-
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
+
+	/*
+	 * Delete xattr inode before deleting the main inode.
+	 */
+	err = ext4_xattr_delete_inode(handle, inode, &lea_ino_array);
+	if (err) {
+		ext4_warning(inode->i_sb,
+			     "couldn't delete inode's xattr (err %d)", err);
+		goto stop_handle;
+	}
+
+	if (!IS_NOQUOTA(inode))
+		extra_credits += 2 * EXT4_QUOTA_DEL_BLOCKS(inode->i_sb);
+
+	if (!ext4_handle_has_enough_credits(handle,
+			ext4_blocks_for_truncate(inode) + extra_credits)) {
+		err = ext4_journal_extend(handle,
+			ext4_blocks_for_truncate(inode) + extra_credits);
+		if (err > 0)
+			err = ext4_journal_restart(handle,
+			ext4_blocks_for_truncate(inode) + extra_credits);
+		if (err != 0) {
+			ext4_warning(inode->i_sb,
+				     "couldn't extend journal (err %d)", err);
+			goto stop_handle;
+		}
+	}
+
 	inode->i_size = 0;
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
@@ -277,10 +304,10 @@  void ext4_evict_inode(struct inode *inode)
 	 * enough credits left in the handle to remove the inode from
 	 * the orphan list and set the dtime field.
 	 */
-	if (!ext4_handle_has_enough_credits(handle, 3)) {
-		err = ext4_journal_extend(handle, 3);
+	if (!ext4_handle_has_enough_credits(handle, extra_credits)) {
+		err = ext4_journal_extend(handle, extra_credits);
 		if (err > 0)
-			err = ext4_journal_restart(handle, 3);
+			err = ext4_journal_restart(handle, extra_credits);
 		if (err != 0) {
 			ext4_warning(inode->i_sb,
 				     "couldn't extend journal (err %d)", err);
@@ -315,8 +342,12 @@  void ext4_evict_inode(struct inode *inode)
 		ext4_clear_inode(inode);
 	else
 		ext4_free_inode(handle, inode);
+
 	ext4_journal_stop(handle);
 	sb_end_intwrite(inode->i_sb);
+
+	if (lea_ino_array != NULL)
+		ext4_xattr_inode_array_free(inode, lea_ino_array);
 	return;
 no_delete:
 	ext4_clear_inode(inode);	/* We must guarantee clearing of inode... */
@@ -5475,7 +5506,7 @@  static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
  *
  * Also account for superblock, inode, quota and xattr blocks
  */
-static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
+int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
 				  int pextents)
 {
 	ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 996e790..f158798 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -190,9 +190,8 @@  static void ext4_xattr_block_csum_set(struct inode *inode,

 	/* Check the values */
 	while (!IS_LAST_ENTRY(entry)) {
-		if (entry->e_value_block != 0)
-			return -EFSCORRUPTED;
-		if (entry->e_value_size != 0) {
+		if (entry->e_value_size != 0 &&
+		    entry->e_value_inum == 0) {
 			u16 offs = le16_to_cpu(entry->e_value_offs);
 			u32 size = le32_to_cpu(entry->e_value_size);
 			void *value;
@@ -258,19 +257,26 @@  static void ext4_xattr_block_csum_set(struct inode *inode,
 	__xattr_check_inode((inode), (header), (end), __func__, __LINE__)

 static inline int
-ext4_xattr_check_entry(struct ext4_xattr_entry *entry, size_t size)
+ext4_xattr_check_entry(struct ext4_xattr_entry *entry, size_t size,
+		       struct inode *inode)
 {
 	size_t value_size = le32_to_cpu(entry->e_value_size);

-	if (entry->e_value_block != 0 || value_size > size ||
+	if (!entry->e_value_inum &&
 	    le16_to_cpu(entry->e_value_offs) + value_size > size)
 		return -EFSCORRUPTED;
+	if (entry->e_value_inum &&
+	    (le32_to_cpu(entry->e_value_inum) < EXT4_FIRST_INO(inode->i_sb) ||
+	     le32_to_cpu(entry->e_value_inum) >
+	     le32_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_inodes_count)))
+		return -EFSCORRUPTED;
 	return 0;
 }

 static int
 ext4_xattr_find_entry(struct ext4_xattr_entry **pentry, int name_index,
-		      const char *name, size_t size, int sorted)
+		      const char *name, size_t size, int sorted,
+		      struct inode *inode)
 {
 	struct ext4_xattr_entry *entry;
 	size_t name_len;
@@ -290,11 +296,104 @@  static void ext4_xattr_block_csum_set(struct inode *inode,
 			break;
 	}
 	*pentry = entry;
-	if (!cmp && ext4_xattr_check_entry(entry, size))
+	if (!cmp && ext4_xattr_check_entry(entry, size, inode))
 		return -EFSCORRUPTED;
 	return cmp ? -ENODATA : 0;
 }

+/*
+ * Read the EA value from an inode.
+ */
+static int
+ext4_xattr_inode_read(struct inode *ea_inode, void *buf, size_t *size)
+{
+	unsigned long block = 0;
+	struct buffer_head *bh = NULL;
+	int blocksize;
+	size_t csize, ret_size = 0;
+
+	if (*size == 0)
+		return 0;
+
+	blocksize = ea_inode->i_sb->s_blocksize;
+
+	while (ret_size < *size) {
+		csize = (*size - ret_size) > blocksize ? blocksize :
+							*size - ret_size;
+		bh = ext4_bread(NULL, ea_inode, block, 0);
+		if (IS_ERR(bh)) {
+			*size = ret_size;
+			return PTR_ERR(bh);
+		}
+		memcpy(buf, bh->b_data, csize);
+		brelse(bh);
+
+		buf += csize;
+		block += 1;
+		ret_size += csize;
+	}
+
+	*size = ret_size;
+
+	return 0;
+}
+
+struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino, int *err)
+{
+	struct inode *ea_inode = NULL;
+
+	ea_inode = ext4_iget(parent->i_sb, ea_ino);
+	if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) {
+		int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0;
+		ext4_error(parent->i_sb, "error while reading EA inode %lu "
+			   "/ %d %d", ea_ino, rc, is_bad_inode(ea_inode));
+		*err = rc != 0 ? rc : -EIO;
+		return NULL;
+	}
+
+	if (EXT4_XATTR_INODE_GET_PARENT(ea_inode) != parent->i_ino ||
+	    ea_inode->i_generation != parent->i_generation) {
+		ext4_error(parent->i_sb, "Backpointer from EA inode %lu "
+			   "to parent invalid.", ea_ino);
+		*err = -EINVAL;
+		goto error;
+	}
+
+	if (!(EXT4_I(ea_inode)->i_flags & EXT4_EA_INODE_FL)) {
+		ext4_error(parent->i_sb, "EA inode %lu does not have "
+			   "EXT4_EA_INODE_FL flag set.\n", ea_ino);
+		*err = -EINVAL;
+		goto error;
+	}
+
+	*err = 0;
+	return ea_inode;
+
+error:
+	iput(ea_inode);
+	return NULL;
+}
+
+/*
+ * Read the value from the EA inode.
+ */
+static int
+ext4_xattr_inode_get(struct inode *inode, unsigned long ea_ino, void *buffer,
+		     size_t *size)
+{
+	struct inode *ea_inode = NULL;
+	int err;
+
+	ea_inode = ext4_xattr_inode_iget(inode, ea_ino, &err);
+	if (err)
+		return err;
+
+	err = ext4_xattr_inode_read(ea_inode, buffer, size);
+	iput(ea_inode);
+
+	return err;
+}
+
 static int
 ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 		     void *buffer, size_t buffer_size)
@@ -327,7 +426,8 @@  static void ext4_xattr_block_csum_set(struct inode *inode,
 	}
 	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
-	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
+	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1,
+				      inode);
 	if (error == -EFSCORRUPTED)
 		goto bad_block;
 	if (error)
@@ -337,8 +437,16 @@  static void ext4_xattr_block_csum_set(struct inode *inode,
 		error = -ERANGE;
 		if (size > buffer_size)
 			goto cleanup;
-		memcpy(buffer, bh->b_data + le16_to_cpu(entry->e_value_offs),
-		       size);
+		if (entry->e_value_inum) {
+			error = ext4_xattr_inode_get(inode,
+					     le32_to_cpu(entry->e_value_inum),
+					     buffer, &size);
+			if (error)
+				goto cleanup;
+		} else {
+			memcpy(buffer, bh->b_data +
+			       le16_to_cpu(entry->e_value_offs), size);
+		}
 	}
 	error = size;

@@ -372,7 +480,7 @@  static void ext4_xattr_block_csum_set(struct inode *inode,
 	if (error)
 		goto cleanup;
 	error = ext4_xattr_find_entry(&entry, name_index, name,
-				      end - (void *)entry, 0);
+				      end - (void *)entry, 0, inode);
 	if (error)
 		goto cleanup;
 	size = le32_to_cpu(entry->e_value_size);
@@ -380,8 +488,16 @@  static void ext4_xattr_block_csum_set(struct inode *inode,
 		error = -ERANGE;
 		if (size > buffer_size)
 			goto cleanup;
-		memcpy(buffer, (void *)IFIRST(header) +
-		       le16_to_cpu(entry->e_value_offs), size);
+		if (entry->e_value_inum) {
+			error = ext4_xattr_inode_get(inode,
+					     le32_to_cpu(entry->e_value_inum),
+					     buffer, &size);
+			if (error)
+				goto cleanup;
+		} else {
+			memcpy(buffer, (void *)IFIRST(header) +
+			       le16_to_cpu(entry->e_value_offs), size);
+		}
 	}
 	error = size;

@@ -648,7 +764,7 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
 				    size_t *min_offs, void *base, int *total)
 {
 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
-		if (last->e_value_size) {
+		if (!last->e_value_inum && last->e_value_size) {
 			size_t offs = le16_to_cpu(last->e_value_offs);
 			if (offs < *min_offs)
 				*min_offs = offs;
@@ -659,16 +775,172 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
 	return (*min_offs - ((void *)last - base) - sizeof(__u32));
 }

-static int
-ext4_xattr_set_entry(struct ext4_xattr_info *i, struct ext4_xattr_search *s)
+/*
+ * Write the value of the EA in an inode.
+ */
+static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
+				  const void *buf, int bufsize)
+{
+	struct buffer_head *bh = NULL;
+	unsigned long block = 0;
+	unsigned blocksize = ea_inode->i_sb->s_blocksize;
+	unsigned max_blocks = (bufsize + blocksize - 1) >> ea_inode->i_blkbits;
+	int csize, wsize = 0;
+	int ret = 0;
+	int retries = 0;
+
+retry:
+	while (ret >= 0 && ret < max_blocks) {
+		struct ext4_map_blocks map;
+		map.m_lblk = block += ret;
+		map.m_len = max_blocks -= ret;
+
+		ret = ext4_map_blocks(handle, ea_inode, &map,
+				      EXT4_GET_BLOCKS_CREATE);
+		if (ret <= 0) {
+			ext4_mark_inode_dirty(handle, ea_inode);
+			if (ret == -ENOSPC &&
+			    ext4_should_retry_alloc(ea_inode->i_sb, &retries)) {
+				ret = 0;
+				goto retry;
+			}
+			break;
+		}
+	}
+
+	if (ret < 0)
+		return ret;
+
+	block = 0;
+	while (wsize < bufsize) {
+		if (bh != NULL)
+			brelse(bh);
+		csize = (bufsize - wsize) > blocksize ? blocksize :
+								bufsize - wsize;
+		bh = ext4_getblk(handle, ea_inode, block, 0);
+		if (IS_ERR(bh)) {
+			ret = PTR_ERR(bh);
+			goto out;
+		}
+		ret = ext4_journal_get_write_access(handle, bh);
+		if (ret)
+			goto out;
+
+		memcpy(bh->b_data, buf, csize);
+		set_buffer_uptodate(bh);
+		ext4_handle_dirty_metadata(handle, ea_inode, bh);
+
+		buf += csize;
+		wsize += csize;
+		block += 1;
+	}
+
+	mutex_lock(&ea_inode->i_mutex);
+	i_size_write(ea_inode, wsize);
+	ext4_update_i_disksize(ea_inode, wsize);
+	mutex_unlock(&ea_inode->i_mutex);
+
+	ext4_mark_inode_dirty(handle, ea_inode);
+
+out:
+	brelse(bh);
+
+	return ret;
+}
+
+/*
+ * Create an inode to store the value of a large EA.
+ */
+static struct inode *ext4_xattr_inode_create(handle_t *handle,
+					     struct inode *inode)
+{
+	struct inode *ea_inode = NULL;
+
+	/*
+	 * Let the next inode be the goal, so we try and allocate the EA inode
+	 * in the same group, or nearby one.
+	 */
+	ea_inode = ext4_new_inode(handle, inode->i_sb->s_root->d_inode,
+				  S_IFREG | 0600, NULL, inode->i_ino + 1, NULL);
+	if (!IS_ERR(ea_inode)) {
+		ea_inode->i_op = &ext4_file_inode_operations;
+		ea_inode->i_fop = &ext4_file_operations;
+		ext4_set_aops(ea_inode);
+		ea_inode->i_generation = inode->i_generation;
+		EXT4_I(ea_inode)->i_flags |= EXT4_EA_INODE_FL;
+
+		/*
+		 * A back-pointer from EA inode to parent inode will be useful
+		 * for e2fsck.
+		 */
+		EXT4_XATTR_INODE_SET_PARENT(ea_inode, inode->i_ino);
+		unlock_new_inode(ea_inode);
+	}
+
+	return ea_inode;
+}
+
+/*
+ * Unlink the inode storing the value of the EA.
+ */
+int ext4_xattr_inode_unlink(struct inode *inode, unsigned long ea_ino)
+{
+	struct inode *ea_inode = NULL;
+	int err;
+
+	ea_inode = ext4_xattr_inode_iget(inode, ea_ino, &err);
+	if (err)
+		return err;
+
+	clear_nlink(ea_inode);
+	iput(ea_inode);
+
+	return 0;
+}
+
+/*
+ * Add value of the EA in an inode.
+ */
+static int ext4_xattr_inode_set(handle_t *handle, struct inode *inode,
+				unsigned long *ea_ino, const void *value,
+				size_t value_len)
+{
+	struct inode *ea_inode;
+	int err;
+
+	/* Create an inode for the EA value */
+	ea_inode = ext4_xattr_inode_create(handle, inode);
+	if (IS_ERR(ea_inode))
+		return PTR_ERR(ea_inode);
+
+	err = ext4_xattr_inode_write(handle, ea_inode, value, value_len);
+	if (err)
+		clear_nlink(ea_inode);
+	else
+		*ea_ino = ea_inode->i_ino;
+
+	iput(ea_inode);
+
+	return err;
+}
+
+static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
+				struct ext4_xattr_search *s,
+				handle_t *handle, struct inode *inode)
 {
 	struct ext4_xattr_entry *last;
 	size_t free, min_offs = s->end - s->base, name_len = strlen(i->name);
+	int in_inode = i->in_inode;
+
+	if (ext4_feature_incompat(inode->i_sb, EA_INODE) &&
+	    (EXT4_XATTR_SIZE(i->value_len) >
+	     EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
+		in_inode = 1;

 	/* Compute min_offs and last. */
 	last = s->first;
 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
-		if (last->e_value_size) {
+		if (!last->e_value_inum && last->e_value_size) {
 			size_t offs = le16_to_cpu(last->e_value_offs);
 			if (offs < min_offs)
 				min_offs = offs;
@@ -676,15 +948,20 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
 	}
 	free = min_offs - ((void *)last - s->base) - sizeof(__u32);
 	if (!s->not_found) {
-		if (s->here->e_value_size) {
+		if (!in_inode &&
+		    !s->here->e_value_inum && s->here->e_value_size) {
 			size_t size = le32_to_cpu(s->here->e_value_size);
 			free += EXT4_XATTR_SIZE(size);
 		}
 		free += EXT4_XATTR_LEN(name_len);
 	}
 	if (i->value) {
-		if (free < EXT4_XATTR_LEN(name_len) +
-			   EXT4_XATTR_SIZE(i->value_len))
+		size_t value_len = EXT4_XATTR_SIZE(i->value_len);
+
+		if (in_inode)
+			value_len = 0;
+
+		if (free < EXT4_XATTR_LEN(name_len) + value_len)
 			return -ENOSPC;
 	}

@@ -698,7 +975,8 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
 		s->here->e_name_len = name_len;
 		memcpy(s->here->e_name, i->name, name_len);
 	} else {
-		if (s->here->e_value_size) {
+		if (!s->here->e_value_inum && s->here->e_value_size &&
+		    s->here->e_value_offs > 0) {
 			void *first_val = s->base + min_offs;
 			size_t offs = le16_to_cpu(s->here->e_value_offs);
 			void *val = s->base + offs;
@@ -732,12 +1010,18 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
 			last = s->first;
 			while (!IS_LAST_ENTRY(last)) {
 				size_t o = le16_to_cpu(last->e_value_offs);
-				if (last->e_value_size && o < offs)
+				if (!last->e_value_inum &&
+				    last->e_value_size && o < offs)
 					last->e_value_offs =
 						cpu_to_le16(o + size);
 				last = EXT4_XATTR_NEXT(last);
 			}
 		}
+		if (s->here->e_value_inum) {
+			ext4_xattr_inode_unlink(inode,
+					    le32_to_cpu(s->here->e_value_inum);
+			s->here->e_value_inum = 0;
+		}
 		if (!i->value) {
 			/* Remove the old name. */
 			size_t size = EXT4_XATTR_LEN(name_len);
@@ -750,11 +1034,20 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,

 	if (i->value) {
 		/* Insert the new value. */
-		s->here->e_value_size = cpu_to_le32(i->value_len);
-		if (i->value_len) {
+		if (in_inode) {
+			unsigned long ea_ino =
+				le32_to_cpu(s->here->e_value_inum);
+			rc = ext4_xattr_inode_set(handle, inode, &ea_ino,
+						  i->value, i->value_len);
+			if (rc)
+				goto out;
+			s->here->e_value_inum = cpu_to_le32(ea_ino);
+			s->here->e_value_offs = 0;
+		} else if (i->value_len) {
 			size_t size = EXT4_XATTR_SIZE(i->value_len);
 			void *val = s->base + min_offs - size;
 			s->here->e_value_offs = cpu_to_le16(min_offs - size);
+			s->here->e_value_inum = 0;
 			if (i->value == EXT4_ZERO_XATTR_VALUE) {
 				memset(val, 0, size);
 			} else {
@@ -764,8 +1057,11 @@  static size_t ext4_xattr_free_space(struct ext4_xattr_entry *last,
 				memcpy(val, i->value, i->value_len);
 			}
 		}
+		s->here->e_value_size = cpu_to_le32(i->value_len);
 	}
-	return 0;
+
+out:
+	return rc;
 }

 struct ext4_xattr_block_find {
@@ -804,7 +1100,7 @@  struct ext4_xattr_block_find {
 		bs->s.end = bs->bh->b_data + bs->bh->b_size;
 		bs->s.here = bs->s.first;
 		error = ext4_xattr_find_entry(&bs->s.here, i->name_index,
-					      i->name, bs->bh->b_size, 1);
+					     i->name, bs->bh->b_size, 1, inode);
 		if (error && error != -ENODATA)
 			goto cleanup;
 		bs->s.not_found = error;
@@ -829,8 +1125,6 @@  struct ext4_xattr_block_find {

 #define header(x) ((struct ext4_xattr_header *)(x))

-	if (i->value && i->value_len > sb->s_blocksize)
-		return -ENOSPC;
 	if (s->base) {
 		BUFFER_TRACE(bs->bh, "get_write_access");
 		error = ext4_journal_get_write_access(handle, bs->bh);
@@ -849,7 +1143,7 @@  struct ext4_xattr_block_find {
 			mb_cache_entry_delete_block(ext4_mb_cache, hash,
 						    bs->bh->b_blocknr);
 			ea_bdebug(bs->bh, "modifying in-place");
-			error = ext4_xattr_set_entry(i, s);
+			error = ext4_xattr_set_entry(i, s, handle, inode);
 			if (!error) {
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
@@ -898,7 +1192,7 @@  struct ext4_xattr_block_find {
 		s->end = s->base + sb->s_blocksize;
 	}

-	error = ext4_xattr_set_entry(i, s);
+	error = ext4_xattr_set_entry(i, s, handle, inode);
 	if (error == -EFSCORRUPTED)
 		goto bad_block;
 	if (error)
@@ -1077,7 +1371,7 @@  int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
 		/* Find the named attribute. */
 		error = ext4_xattr_find_entry(&is->s.here, i->name_index,
 					      i->name, is->s.end -
-					      (void *)is->s.base, 0);
+					      (void *)is->s.base, 0, inode);
 		if (error && error != -ENODATA)
 			return error;
 		is->s.not_found = error;
@@ -1095,7 +1389,7 @@  int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,

 	if (EXT4_I(inode)->i_extra_isize == 0)
 		return -ENOSPC;
-	error = ext4_xattr_set_entry(i, s);
+	error = ext4_xattr_set_entry(i, s, handle, inode);
 	if (error) {
 		if (error == -ENOSPC &&
 		    ext4_has_inline_data(inode)) {
@@ -1107,7 +1401,7 @@  int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
 			error = ext4_xattr_ibody_find(inode, i, is);
 			if (error)
 				return error;
-			error = ext4_xattr_set_entry(i, s);
+			error = ext4_xattr_set_entry(i, s, handle, inode);
 		}
 		if (error)
 			return error;
@@ -1133,7 +1427,7 @@  static int ext4_xattr_ibody_set(struct inode *inode,

 	if (EXT4_I(inode)->i_extra_isize == 0)
 		return -ENOSPC;
-	error = ext4_xattr_set_entry(i, s);
+	error = ext4_xattr_set_entry(i, s, handle, inode);
 	if (error)
 		return error;
 	header = IHDR(inode, ext4_raw_inode(&is->iloc));
@@ -1180,7 +1474,7 @@  static int ext4_xattr_value_same(struct ext4_xattr_search *s,
 		.name = name,
 		.value = value,
 		.value_len = value_len,
-
+		.in_inode = 0,
 	};
 	struct ext4_xattr_ibody_find is = {
 		.s = { .not_found = -ENODATA, },
@@ -1250,6 +1544,15 @@  static int ext4_xattr_value_same(struct ext4_xattr_search *s,
 					goto cleanup;
 			}
 			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			if (EXT4_HAS_INCOMPAT_FEATURE(inode->i_sb,
+					EXT4_FEATURE_INCOMPAT_EA_INODE) &&
+			    error == -ENOSPC) {
+				/* xattr not fit to block, store at external
+				 * inode */
+				i.in_inode = 1;
+				error = ext4_xattr_ibody_set(handle, inode,
+							     &i, &is);
+			}
 			if (error)
 				goto cleanup;
 			if (!is.s.not_found) {
@@ -1293,9 +1596,22 @@  static int ext4_xattr_value_same(struct ext4_xattr_search *s,
 	       const void *value, size_t value_len, int flags)
 {
 	handle_t *handle;
+	struct super_block *sb = inode->i_sb;
 	int error, retries = 0;
 	int credits = ext4_jbd2_credits_xattr(inode);

+	if ((value_len >= EXT4_XATTR_MIN_LARGE_EA_SIZE(sb->s_blocksize)) &&
+	    EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EA_INODE)) {
+		int nrblocks = (value_len + sb->s_blocksize - 1) >>
+					sb->s_blocksize_bits;
+
+		/* For new inode */
+		credits += EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + 3;
+
+		/* For data blocks of EA inode */
+		credits += ext4_meta_trans_blocks(inode, nrblocks, 0);
+	}
+
 retry:
 	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
 	if (IS_ERR(handle)) {
@@ -1307,7 +1623,7 @@  static int ext4_xattr_value_same(struct ext4_xattr_search *s,
 					      value, value_len, flags);
 		error2 = ext4_journal_stop(handle);
 		if (error == -ENOSPC &&
-		    ext4_should_retry_alloc(inode->i_sb, &retries))
+		    ext4_should_retry_alloc(sb, &retries))
 			goto retry;
 		if (error == 0)
 			error = error2;
@@ -1332,7 +1648,7 @@  static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,

 	/* Adjust the value offsets of the entries */
 	for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
-		if (last->e_value_size) {
+		if (!last->e_value_inum && last->e_value_size) {
 			new_offs = le16_to_cpu(last->e_value_offs) +
 							value_offs_shift;
 			last->e_value_offs = cpu_to_le16(new_offs);
@@ -1593,21 +1909,135 @@  int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 }


+#define EIA_INCR 16 /* must be 2^n */
+#define EIA_MASK (EIA_INCR - 1)
+/* Add the large xattr @ino into @lea_ino_array for later deletion.
+ * If @lea_ino_array is new or full it will be grown and the old
+ * contents copied over.
+ */
+static int
+ext4_expand_ino_array(struct ext4_xattr_ino_array **lea_ino_array, __u32 ino)
+{
+	if (*lea_ino_array == NULL) {
+		/*
+		 * Start with 15 inodes, so it fits into a power-of-two size.
+		 * If *lea_ino_array is NULL, this is essentially offsetof()
+		 */
+		(*lea_ino_array) =
+			kmalloc(offsetof(struct ext4_xattr_ino_array,
+					 xia_inodes[EIA_MASK]),
+				GFP_NOFS);
+		if (*lea_ino_array == NULL)
+			return -ENOMEM;
+		(*lea_ino_array)->xia_count = 0;
+	} else if (((*lea_ino_array)->xia_count & EIA_MASK) == EIA_MASK) {
+		/* expand the array once all 15 + n * 16 slots are full */
+		struct ext4_xattr_ino_array *new_array = NULL;
+		int count = (*lea_ino_array)->xia_count;
+
+		/* if new_array is NULL, this is essentially offsetof() */
+		new_array = kmalloc(
+				offsetof(struct ext4_xattr_ino_array,
+					 xia_inodes[count + EIA_INCR]),
+				GFP_NOFS);
+		if (new_array == NULL)
+			return -ENOMEM;
+		memcpy(new_array, *lea_ino_array,
+		       offsetof(struct ext4_xattr_ino_array,
+				xia_inodes[count]));
+		kfree(*lea_ino_array);
+		*lea_ino_array = new_array;
+	}
+	(*lea_ino_array)->xia_inodes[(*lea_ino_array)->xia_count++] = ino;
+	return 0;
+}
+
+/**
+ * Add xattr inode to orphan list
+ */
+static int
+ext4_xattr_inode_orphan_add(handle_t *handle, struct inode *inode,
+			int credits, struct ext4_xattr_ino_array *lea_ino_array)
+{
+	struct inode *ea_inode = NULL;
+	int idx = 0, error = 0;
+
+	if (lea_ino_array == NULL)
+		return 0;
+
+	for (; idx < lea_ino_array->xia_count; ++idx) {
+		if (!ext4_handle_has_enough_credits(handle, credits)) {
+			error = ext4_journal_extend(handle, credits);
+			if (error > 0)
+				error = ext4_journal_restart(handle, credits);
+
+			if (error != 0) {
+				ext4_warning(inode->i_sb,
+					"couldn't extend journal "
+					"(err %d)", error);
+				return error;
+			}
+		}
+		ea_inode = ext4_xattr_inode_iget(inode,
+				lea_ino_array->xia_inodes[idx], &error);
+		if (error)
+			continue;
+		ext4_orphan_add(handle, ea_inode);
+		/* the inode's i_count will be released by caller */
+	}
+
+	return 0;
+}

 /*
  * ext4_xattr_delete_inode()
  *
- * Free extended attribute resources associated with this inode. This
+ * Free extended attribute resources associated with this inode. Traverse
+ * all entries and unlink any xattr inodes associated with this inode. This
  * is called immediately before an inode is freed. We have exclusive
- * access to the inode.
+ * access to the inode. If an orphan inode is deleted it will also delete any
+ * xattr block and all xattr inodes. They are checked by ext4_xattr_inode_iget()
+ * to ensure they belong to the parent inode and were not deleted already.
  */
-void
-ext4_xattr_delete_inode(handle_t *handle, struct inode *inode)
+int
+ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
+			struct ext4_xattr_ino_array **lea_ino_array)
 {
 	struct buffer_head *bh = NULL;
+	struct ext4_xattr_ibody_header *header;
+	struct ext4_inode *raw_inode;
+	struct ext4_iloc iloc;
+	struct ext4_xattr_entry *entry;
+	int credits = 3, error = 0;

-	if (!EXT4_I(inode)->i_file_acl)
+	if (!ext4_test_inode_state(inode, EXT4_STATE_XATTR))
+		goto delete_external_ea;
+
+	error = ext4_get_inode_loc(inode, &iloc);
+	if (error)
+		goto cleanup;
+	raw_inode = ext4_raw_inode(&iloc);
+	header = IHDR(inode, raw_inode);
+	for (entry = IFIRST(header); !IS_LAST_ENTRY(entry);
+	     entry = EXT4_XATTR_NEXT(entry)) {
+		if (!entry->e_value_inum)
+			continue;
+		if (ext4_expand_ino_array(lea_ino_array,
+					  entry->e_value_inum) != 0) {
+			brelse(iloc.bh);
+			goto cleanup;
+		}
+		entry->e_value_inum = 0;
+	}
+	brelse(iloc.bh);
+
+delete_external_ea:
+	if (!EXT4_I(inode)->i_file_acl) {
+		/* add xattr inode to orphan list */
+		ext4_xattr_inode_orphan_add(handle, inode, credits,
+						*lea_ino_array);
 		goto cleanup;
+	}
 	bh = sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl);
 	if (!bh) {
 		EXT4_ERROR_INODE(inode, "block %llu read error",
@@ -1620,11 +2050,69 @@  int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 				 EXT4_I(inode)->i_file_acl);
 		goto cleanup;
 	}
+
+	for (entry = BFIRST(bh); !IS_LAST_ENTRY(entry);
+	     entry = EXT4_XATTR_NEXT(entry)) {
+		if (!entry->e_value_inum)
+			continue;
+		if (ext4_expand_ino_array(lea_ino_array,
+					  entry->e_value_inum) != 0)
+			goto cleanup;
+		entry->e_value_inum = 0;
+	}
+
+	/* add xattr inode to orphan list */
+	error = ext4_xattr_inode_orphan_add(handle, inode, credits,
+					*lea_ino_array);
+	if (error != 0)
+		goto cleanup;
+
+	if (!IS_NOQUOTA(inode))
+		credits += 2 * EXT4_QUOTA_DEL_BLOCKS(inode->i_sb);
+
+	if (!ext4_handle_has_enough_credits(handle, credits)) {
+		error = ext4_journal_extend(handle, credits);
+		if (error > 0)
+			error = ext4_journal_restart(handle, credits);
+		if (error != 0) {
+			ext4_warning(inode->i_sb,
+				"couldn't extend journal (err %d)", error);
+			goto cleanup;
+		}
+	}
+
 	ext4_xattr_release_block(handle, inode, bh);
 	EXT4_I(inode)->i_file_acl = 0;

 cleanup:
 	brelse(bh);
+
+	return error;
+}
+
+void
+ext4_xattr_inode_array_free(struct inode *inode,
+			    struct ext4_xattr_ino_array *lea_ino_array)
+{
+	struct inode	*ea_inode = NULL;
+	int		idx = 0;
+	int		err;
+
+	if (lea_ino_array == NULL)
+		return;
+
+	for (; idx < lea_ino_array->xia_count; ++idx) {
+		ea_inode = ext4_xattr_inode_iget(inode,
+				lea_ino_array->xia_inodes[idx], &err);
+		if (err)
+			continue;
+		/* for inode's i_count get from ext4_xattr_delete_inode */
+		if (!list_empty(&EXT4_I(ea_inode)->i_orphan))
+			iput(ea_inode);
+		clear_nlink(ea_inode);
+		iput(ea_inode);
+	}
+	kfree(lea_ino_array);
 }

 /*
@@ -1676,10 +2164,9 @@  int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 		    entry1->e_name_index != entry2->e_name_index ||
 		    entry1->e_name_len != entry2->e_name_len ||
 		    entry1->e_value_size != entry2->e_value_size ||
+		    entry1->e_value_inum != entry2->e_value_inum ||
 		    memcmp(entry1->e_name, entry2->e_name, entry1->e_name_len))
 			return 1;
-		if (entry1->e_value_block != 0 || entry2->e_value_block != 0)
-			return -EFSCORRUPTED;
 		if (memcmp((char *)header1 + le16_to_cpu(entry1->e_value_offs),
 			   (char *)header2 + le16_to_cpu(entry2->e_value_offs),
 			   le32_to_cpu(entry1->e_value_size)))
@@ -1751,7 +2238,7 @@  static inline void ext4_xattr_hash_entry(struct ext4_xattr_header *header,
 		       *name++;
 	}

-	if (entry->e_value_size != 0) {
+	if (!entry->e_value_inum && entry->e_value_size) {
 		__le32 *value = (__le32 *)((char *)header +
 			le16_to_cpu(entry->e_value_offs));
 		for (n = (le32_to_cpu(entry->e_value_size) +
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 099c8b6..6e10ff9 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -44,7 +44,7 @@  struct ext4_xattr_entry {
 	__u8	e_name_len;	/* length of name */
 	__u8	e_name_index;	/* attribute name index */
 	__le16	e_value_offs;	/* offset in disk block of value */
-	__le32	e_value_block;	/* disk block attribute is stored on (n/i) */
+	__le32	e_value_inum;	/* inode in which the value is stored */
 	__le32	e_value_size;	/* size of attribute value */
 	__le32	e_hash;		/* hash value of name and value */
 	char	e_name[0];	/* attribute name */
@@ -69,6 +69,26 @@  struct ext4_xattr_entry {
 		EXT4_I(inode)->i_extra_isize))
 #define IFIRST(hdr) ((struct ext4_xattr_entry *)((hdr)+1))

+/*
+ * Link EA inode back to parent one using i_mtime field.
+ * Extra integer type conversion added to ignore higher
+ * bits in i_mtime.tv_sec which might be set by ext4_get()
+ */
+#define EXT4_XATTR_INODE_SET_PARENT(inode, inum)      \
+do {                                                  \
+      (inode)->i_mtime.tv_sec = inum;                 \
+} while(0)
+
+#define EXT4_XATTR_INODE_GET_PARENT(inode)            \
+((__u32)(inode)->i_mtime.tv_sec)
+
+/*
+ * The minimum size of EA value when you start storing it in an external inode
+ * size of block - size of header - size of 1 entry - 4 null bytes
+*/
+#define EXT4_XATTR_MIN_LARGE_EA_SIZE(b)					\
+	((b) - EXT4_XATTR_LEN(3) - sizeof(struct ext4_xattr_header) - 4)
+
 #define BHDR(bh) ((struct ext4_xattr_header *)((bh)->b_data))
 #define ENTRY(ptr) ((struct ext4_xattr_entry *)(ptr))
 #define BFIRST(bh) ENTRY(BHDR(bh)+1)
@@ -77,10 +97,11 @@  struct ext4_xattr_entry {
 #define EXT4_ZERO_XATTR_VALUE ((void *)-1)

 struct ext4_xattr_info {
-	int name_index;
 	const char *name;
 	const void *value;
 	size_t value_len;
+	int name_index;
+	int in_inode;
 };

 struct ext4_xattr_search {
@@ -140,7 +161,13 @@  static inline void ext4_write_unlock_xattr(struct inode *inode, int *save)
 extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
 extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);

-extern void ext4_xattr_delete_inode(handle_t *, struct inode *);
+extern struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
+					   int *err);
+extern int ext4_xattr_inode_unlink(struct inode *inode, unsigned long ea_ino);
+extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
+				   struct ext4_xattr_ino_array **array);
+extern void ext4_xattr_inode_array_free(struct inode *inode,
+					struct ext4_xattr_ino_array *array);

 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
diff --git a/include/uapi/linux/netfilter/xt_CONNMARK.h b/include/uapi/linux/netfilter/xt_CONNMARK.h
index 2f2e48e..efc17a8 100644
--- a/include/uapi/linux/netfilter/xt_CONNMARK.h
+++ b/include/uapi/linux/netfilter/xt_CONNMARK.h
@@ -1,6 +1,31 @@ 
-#ifndef _XT_CONNMARK_H_target
-#define _XT_CONNMARK_H_target
+#ifndef _XT_CONNMARK_H
+#define _XT_CONNMARK_H

-#include <linux/netfilter/xt_connmark.h>
+#include <linux/types.h>

-#endif /*_XT_CONNMARK_H_target*/
+/* Copyright (C) 2002,2004 MARA Systems AB <http://www.marasystems.com>
+ * by Henrik Nordstrom <hno@marasystems.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+enum {
+	XT_CONNMARK_SET = 0,
+	XT_CONNMARK_SAVE,
+	XT_CONNMARK_RESTORE
+};
+
+struct xt_connmark_tginfo1 {
+	__u32 ctmark, ctmask, nfmask;
+	__u8 mode;
+};
+
+struct xt_connmark_mtinfo1 {
+	__u32 mark, mask;
+	__u8 invert;
+};
+
+#endif /*_XT_CONNMARK_H*/
diff --git a/include/uapi/linux/netfilter/xt_DSCP.h b/include/uapi/linux/netfilter/xt_DSCP.h
index 648e0b3..15f8932 100644
--- a/include/uapi/linux/netfilter/xt_DSCP.h
+++ b/include/uapi/linux/netfilter/xt_DSCP.h
@@ -1,26 +1,31 @@ 
-/* x_tables module for setting the IPv4/IPv6 DSCP field
+/* x_tables module for matching the IPv4/IPv6 DSCP field
  *
  * (C) 2002 Harald Welte <laforge@gnumonks.org>
- * based on ipt_FTOS.c (C) 2000 by Matthew G. Marsh <mgm@paktronix.com>
  * This software is distributed under GNU GPL v2, 1991
  *
  * See RFC2474 for a description of the DSCP field within the IP Header.
  *
- * xt_DSCP.h,v 1.7 2002/03/14 12:03:13 laforge Exp
+ * xt_dscp.h,v 1.3 2002/08/05 19:00:21 laforge Exp
 */
-#ifndef _XT_DSCP_TARGET_H
-#define _XT_DSCP_TARGET_H
-#include <linux/netfilter/xt_dscp.h>
+#ifndef _XT_DSCP_H
+#define _XT_DSCP_H
+
 #include <linux/types.h>

-/* target info */
-struct xt_DSCP_info {
+#define XT_DSCP_MASK	0xfc	/* 11111100 */
+#define XT_DSCP_SHIFT	2
+#define XT_DSCP_MAX	0x3f	/* 00111111 */
+
+/* match info */
+struct xt_dscp_info {
 	__u8 dscp;
+	__u8 invert;
 };

-struct xt_tos_target_info {
-	__u8 tos_value;
+struct xt_tos_match_info {
 	__u8 tos_mask;
+	__u8 tos_value;
+	__u8 invert;
 };

-#endif /* _XT_DSCP_TARGET_H */
+#endif /* _XT_DSCP_H */
diff --git a/include/uapi/linux/netfilter/xt_MARK.h b/include/uapi/linux/netfilter/xt_MARK.h
index 41c456d..ecadc40 100644
--- a/include/uapi/linux/netfilter/xt_MARK.h
+++ b/include/uapi/linux/netfilter/xt_MARK.h
@@ -1,6 +1,15 @@ 
-#ifndef _XT_MARK_H_target
-#define _XT_MARK_H_target
+#ifndef _XT_MARK_H
+#define _XT_MARK_H

-#include <linux/netfilter/xt_mark.h>
+#include <linux/types.h>

-#endif /*_XT_MARK_H_target */
+struct xt_mark_tginfo2 {
+	__u32 mark, mask;
+};
+
+struct xt_mark_mtinfo1 {
+	__u32 mark, mask;
+	__u8 invert;
+};
+
+#endif /*_XT_MARK_H*/
diff --git a/include/uapi/linux/netfilter/xt_TCPMSS.h b/include/uapi/linux/netfilter/xt_TCPMSS.h
index 9a6960a..fbac56b 100644
--- a/include/uapi/linux/netfilter/xt_TCPMSS.h
+++ b/include/uapi/linux/netfilter/xt_TCPMSS.h
@@ -1,12 +1,11 @@ 
-#ifndef _XT_TCPMSS_H
-#define _XT_TCPMSS_H
+#ifndef _XT_TCPMSS_MATCH_H
+#define _XT_TCPMSS_MATCH_H

 #include <linux/types.h>

-struct xt_tcpmss_info {
-	__u16 mss;
+struct xt_tcpmss_match_info {
+    __u16 mss_min, mss_max;
+    __u8 invert;
 };

-#define XT_TCPMSS_CLAMP_PMTU 0xffff
-
-#endif /* _XT_TCPMSS_H */
+#endif /*_XT_TCPMSS_MATCH_H*/
diff --git a/include/uapi/linux/netfilter/xt_rateest.h b/include/uapi/linux/netfilter/xt_rateest.h
index 13fe50d..ec1b570 100644
--- a/include/uapi/linux/netfilter/xt_rateest.h
+++ b/include/uapi/linux/netfilter/xt_rateest.h
@@ -1,38 +1,16 @@ 
-#ifndef _XT_RATEEST_MATCH_H
-#define _XT_RATEEST_MATCH_H
+#ifndef _XT_RATEEST_TARGET_H
+#define _XT_RATEEST_TARGET_H

 #include <linux/types.h>
 #include <linux/if.h>

-enum xt_rateest_match_flags {
-	XT_RATEEST_MATCH_INVERT	= 1<<0,
-	XT_RATEEST_MATCH_ABS	= 1<<1,
-	XT_RATEEST_MATCH_REL	= 1<<2,
-	XT_RATEEST_MATCH_DELTA	= 1<<3,
-	XT_RATEEST_MATCH_BPS	= 1<<4,
-	XT_RATEEST_MATCH_PPS	= 1<<5,
-};
-
-enum xt_rateest_match_mode {
-	XT_RATEEST_MATCH_NONE,
-	XT_RATEEST_MATCH_EQ,
-	XT_RATEEST_MATCH_LT,
-	XT_RATEEST_MATCH_GT,
-};
-
-struct xt_rateest_match_info {
-	char			name1[IFNAMSIZ];
-	char			name2[IFNAMSIZ];
-	__u16		flags;
-	__u16		mode;
-	__u32		bps1;
-	__u32		pps1;
-	__u32		bps2;
-	__u32		pps2;
+struct xt_rateest_target_info {
+	char			name[IFNAMSIZ];
+	__s8			interval;
+	__u8		ewma_log;

 	/* Used internally by the kernel */
-	struct xt_rateest	*est1 __attribute__((aligned(8)));
-	struct xt_rateest	*est2 __attribute__((aligned(8)));
+	struct xt_rateest	*est __attribute__((aligned(8)));
 };

-#endif /* _XT_RATEEST_MATCH_H */
+#endif /* _XT_RATEEST_TARGET_H */
diff --git a/include/uapi/linux/netfilter_ipv4/ipt_ECN.h b/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
index bb88d53..0e0c063 100644
--- a/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
+++ b/include/uapi/linux/netfilter_ipv4/ipt_ECN.h
@@ -1,33 +1,15 @@ 
-/* Header file for iptables ipt_ECN target
- *
- * (C) 2002 by Harald Welte <laforge@gnumonks.org>
- *
- * This software is distributed under GNU GPL v2, 1991
- *
- * ipt_ECN.h,v 1.3 2002/05/29 12:17:40 laforge Exp
-*/
-#ifndef _IPT_ECN_TARGET_H
-#define _IPT_ECN_TARGET_H
-
-#include <linux/types.h>
-#include <linux/netfilter/xt_DSCP.h>
-
-#define IPT_ECN_IP_MASK	(~XT_DSCP_MASK)
-
-#define IPT_ECN_OP_SET_IP	0x01	/* set ECN bits of IPv4 header */
-#define IPT_ECN_OP_SET_ECE	0x10	/* set ECE bit of TCP header */
-#define IPT_ECN_OP_SET_CWR	0x20	/* set CWR bit of TCP header */
-
-#define IPT_ECN_OP_MASK		0xce
-
-struct ipt_ECN_info {
-	__u8 operation;	/* bitset of operations */
-	__u8 ip_ect;	/* ECT codepoint of IPv4 header, pre-shifted */
-	union {
-		struct {
-			__u8 ece:1, cwr:1; /* TCP ECT bits */
-		} tcp;
-	} proto;
+#ifndef _IPT_ECN_H
+#define _IPT_ECN_H
+
+#include <linux/netfilter/xt_ecn.h>
+#define ipt_ecn_info xt_ecn_info
+
+enum {
+	IPT_ECN_IP_MASK       = XT_ECN_IP_MASK,
+	IPT_ECN_OP_MATCH_IP   = XT_ECN_OP_MATCH_IP,
+	IPT_ECN_OP_MATCH_ECE  = XT_ECN_OP_MATCH_ECE,
+	IPT_ECN_OP_MATCH_CWR  = XT_ECN_OP_MATCH_CWR,
+	IPT_ECN_OP_MATCH_MASK = XT_ECN_OP_MATCH_MASK,
 };

-#endif /* _IPT_ECN_TARGET_H */
+#endif /* IPT_ECN_H */
diff --git a/include/uapi/linux/netfilter_ipv4/ipt_TTL.h b/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
index f6ac169..37bee44 100644
--- a/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
+++ b/include/uapi/linux/netfilter_ipv4/ipt_TTL.h
@@ -1,5 +1,5 @@ 
-/* TTL modification module for IP tables
- * (C) 2000 by Harald Welte <laforge@netfilter.org> */
+/* IP tables module for matching the value of the TTL
+ * (C) 2000 by Harald Welte <laforge@gnumonks.org> */

 #ifndef _IPT_TTL_H
 #define _IPT_TTL_H
@@ -7,14 +7,14 @@ 
 #include <linux/types.h>

 enum {
-	IPT_TTL_SET = 0,
-	IPT_TTL_INC,
-	IPT_TTL_DEC
+	IPT_TTL_EQ = 0,		/* equals */
+	IPT_TTL_NE,		/* not equals */
+	IPT_TTL_LT,		/* less than */
+	IPT_TTL_GT,		/* greater than */
 };

-#define IPT_TTL_MAXMODE	IPT_TTL_DEC

-struct ipt_TTL_info {
+struct ipt_ttl_info {
 	__u8	mode;
 	__u8	ttl;
 };
diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_HL.h b/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
index ebd8ead..6e76dbc 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6t_HL.h
@@ -1,6 +1,6 @@ 
-/* Hop Limit modification module for ip6tables
+/* ip6tables module for matching the Hop Limit value
  * Maciej Soltysiak <solt@dns.toxicfilms.tv>
- * Based on HW's TTL module */
+ * Based on HW's ttl module */

 #ifndef _IP6T_HL_H
 #define _IP6T_HL_H
@@ -8,14 +8,14 @@ 
 #include <linux/types.h>

 enum {
-	IP6T_HL_SET = 0,
-	IP6T_HL_INC,
-	IP6T_HL_DEC
+	IP6T_HL_EQ = 0,		/* equals */
+	IP6T_HL_NE,		/* not equals */
+	IP6T_HL_LT,		/* less than */
+	IP6T_HL_GT,		/* greater than */
 };

-#define IP6T_HL_MAXMODE	IP6T_HL_DEC

-struct ip6t_HL_info {
+struct ip6t_hl_info {
 	__u8	mode;
 	__u8	hop_limit;
 };
diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 498b54f..755d2f6 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -8,184 +8,149 @@ 
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/gen_stats.h>
-#include <linux/jhash.h>
-#include <linux/rtnetlink.h>
-#include <linux/random.h>
-#include <linux/slab.h>
-#include <net/gen_stats.h>
-#include <net/netlink.h>

 #include <linux/netfilter/x_tables.h>
-#include <linux/netfilter/xt_RATEEST.h>
+#include <linux/netfilter/xt_rateest.h>
 #include <net/netfilter/xt_rateest.h>

-static DEFINE_MUTEX(xt_rateest_mutex);

-#define RATEEST_HSIZE	16
-static struct hlist_head rateest_hash[RATEEST_HSIZE] __read_mostly;
-static unsigned int jhash_rnd __read_mostly;
-
-static unsigned int xt_rateest_hash(const char *name)
-{
-	return jhash(name, FIELD_SIZEOF(struct xt_rateest, name), jhash_rnd) &
-	       (RATEEST_HSIZE - 1);
-}
-
-static void xt_rateest_hash_insert(struct xt_rateest *est)
+static bool
+xt_rateest_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
-	unsigned int h;
-
-	h = xt_rateest_hash(est->name);
-	hlist_add_head(&est->list, &rateest_hash[h]);
-}
+	const struct xt_rateest_match_info *info = par->matchinfo;
+	struct gnet_stats_rate_est64 sample = {0};
+	u_int32_t bps1, bps2, pps1, pps2;
+	bool ret = true;
+
+	gen_estimator_read(&info->est1->rate_est, &sample);
+
+	if (info->flags & XT_RATEEST_MATCH_DELTA) {
+		bps1 = info->bps1 >= sample.bps ? info->bps1 - sample.bps : 0;
+		pps1 = info->pps1 >= sample.pps ? info->pps1 - sample.pps : 0;
+	} else {
+		bps1 = sample.bps;
+		pps1 = sample.pps;
+	}

-struct xt_rateest *xt_rateest_lookup(const char *name)
-{
-	struct xt_rateest *est;
-	unsigned int h;
-
-	h = xt_rateest_hash(name);
-	mutex_lock(&xt_rateest_mutex);
-	hlist_for_each_entry(est, &rateest_hash[h], list) {
-		if (strcmp(est->name, name) == 0) {
-			est->refcnt++;
-			mutex_unlock(&xt_rateest_mutex);
-			return est;
+	if (info->flags & XT_RATEEST_MATCH_ABS) {
+		bps2 = info->bps2;
+		pps2 = info->pps2;
+	} else {
+		gen_estimator_read(&info->est2->rate_est, &sample);
+
+		if (info->flags & XT_RATEEST_MATCH_DELTA) {
+			bps2 = info->bps2 >= sample.bps ? info->bps2 - sample.bps : 0;
+			pps2 = info->pps2 >= sample.pps ? info->pps2 - sample.pps : 0;
+		} else {
+			bps2 = sample.bps;
+			pps2 = sample.pps;
 		}
 	}
-	mutex_unlock(&xt_rateest_mutex);
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(xt_rateest_lookup);

-void xt_rateest_put(struct xt_rateest *est)
-{
-	mutex_lock(&xt_rateest_mutex);
-	if (--est->refcnt == 0) {
-		hlist_del(&est->list);
-		gen_kill_estimator(&est->rate_est);
-		/*
-		 * gen_estimator est_timer() might access est->lock or bstats,
-		 * wait a RCU grace period before freeing 'est'
-		 */
-		kfree_rcu(est, rcu);
+	switch (info->mode) {
+	case XT_RATEEST_MATCH_LT:
+		if (info->flags & XT_RATEEST_MATCH_BPS)
+			ret &= bps1 < bps2;
+		if (info->flags & XT_RATEEST_MATCH_PPS)
+			ret &= pps1 < pps2;
+		break;
+	case XT_RATEEST_MATCH_GT:
+		if (info->flags & XT_RATEEST_MATCH_BPS)
+			ret &= bps1 > bps2;
+		if (info->flags & XT_RATEEST_MATCH_PPS)
+			ret &= pps1 > pps2;
+		break;
+	case XT_RATEEST_MATCH_EQ:
+		if (info->flags & XT_RATEEST_MATCH_BPS)
+			ret &= bps1 == bps2;
+		if (info->flags & XT_RATEEST_MATCH_PPS)
+			ret &= pps1 == pps2;
+		break;
 	}
-	mutex_unlock(&xt_rateest_mutex);
+
+	ret ^= info->flags & XT_RATEEST_MATCH_INVERT ? true : false;
+	return ret;
 }
-EXPORT_SYMBOL_GPL(xt_rateest_put);

-static unsigned int
-xt_rateest_tg(struct sk_buff *skb, const struct xt_action_param *par)
+static int xt_rateest_mt_checkentry(const struct xt_mtchk_param *par)
 {
-	const struct xt_rateest_target_info *info = par->targinfo;
-	struct gnet_stats_basic_packed *stats = &info->est->bstats;
+	struct xt_rateest_match_info *info = par->matchinfo;
+	struct xt_rateest *est1, *est2;
+	int ret = -EINVAL;

-	spin_lock_bh(&info->est->lock);
-	stats->bytes += skb->len;
-	stats->packets++;
-	spin_unlock_bh(&info->est->lock);
+	if (hweight32(info->flags & (XT_RATEEST_MATCH_ABS |
+				     XT_RATEEST_MATCH_REL)) != 1)
+		goto err1;

-	return XT_CONTINUE;
-}
+	if (!(info->flags & (XT_RATEEST_MATCH_BPS | XT_RATEEST_MATCH_PPS)))
+		goto err1;

-static int xt_rateest_tg_checkentry(const struct xt_tgchk_param *par)
-{
-	struct xt_rateest_target_info *info = par->targinfo;
-	struct xt_rateest *est;
-	struct {
-		struct nlattr		opt;
-		struct gnet_estimator	est;
-	} cfg;
-	int ret;
-
-	net_get_random_once(&jhash_rnd, sizeof(jhash_rnd));
-
-	est = xt_rateest_lookup(info->name);
-	if (est) {
-		/*
-		 * If estimator parameters are specified, they must match the
-		 * existing estimator.
-		 */
-		if ((!info->interval && !info->ewma_log) ||
-		    (info->interval != est->params.interval ||
-		     info->ewma_log != est->params.ewma_log)) {
-			xt_rateest_put(est);
-			return -EINVAL;
-		}
-		info->est = est;
-		return 0;
+	switch (info->mode) {
+	case XT_RATEEST_MATCH_EQ:
+	case XT_RATEEST_MATCH_LT:
+	case XT_RATEEST_MATCH_GT:
+		break;
+	default:
+		goto err1;
 	}

-	ret = -ENOMEM;
-	est = kzalloc(sizeof(*est), GFP_KERNEL);
-	if (!est)
+	ret  = -ENOENT;
+	est1 = xt_rateest_lookup(info->name1);
+	if (!est1)
 		goto err1;

-	strlcpy(est->name, info->name, sizeof(est->name));
-	spin_lock_init(&est->lock);
-	est->refcnt		= 1;
-	est->params.interval	= info->interval;
-	est->params.ewma_log	= info->ewma_log;
-
-	cfg.opt.nla_len		= nla_attr_size(sizeof(cfg.est));
-	cfg.opt.nla_type	= TCA_STATS_RATE_EST;
-	cfg.est.interval	= info->interval;
-	cfg.est.ewma_log	= info->ewma_log;
-
-	ret = gen_new_estimator(&est->bstats, NULL, &est->rate_est,
-				&est->lock, NULL, &cfg.opt);
-	if (ret < 0)
-		goto err2;
+	est2 = NULL;
+	if (info->flags & XT_RATEEST_MATCH_REL) {
+		est2 = xt_rateest_lookup(info->name2);
+		if (!est2)
+			goto err2;
+	}

-	info->est = est;
-	xt_rateest_hash_insert(est);
+	info->est1 = est1;
+	info->est2 = est2;
 	return 0;

 err2:
-	kfree(est);
+	xt_rateest_put(est1);
 err1:
 	return ret;
 }

-static void xt_rateest_tg_destroy(const struct xt_tgdtor_param *par)
+static void xt_rateest_mt_destroy(const struct xt_mtdtor_param *par)
 {
-	struct xt_rateest_target_info *info = par->targinfo;
+	struct xt_rateest_match_info *info = par->matchinfo;

-	xt_rateest_put(info->est);
+	xt_rateest_put(info->est1);
+	if (info->est2)
+		xt_rateest_put(info->est2);
 }

-static struct xt_target xt_rateest_tg_reg __read_mostly = {
-	.name       = "RATEEST",
+static struct xt_match xt_rateest_mt_reg __read_mostly = {
+	.name       = "rateest",
 	.revision   = 0,
 	.family     = NFPROTO_UNSPEC,
-	.target     = xt_rateest_tg,
-	.checkentry = xt_rateest_tg_checkentry,
-	.destroy    = xt_rateest_tg_destroy,
-	.targetsize = sizeof(struct xt_rateest_target_info),
-	.usersize   = offsetof(struct xt_rateest_target_info, est),
+	.match      = xt_rateest_mt,
+	.checkentry = xt_rateest_mt_checkentry,
+	.destroy    = xt_rateest_mt_destroy,
+	.matchsize  = sizeof(struct xt_rateest_match_info),
+	.usersize   = offsetof(struct xt_rateest_match_info, est1),
 	.me         = THIS_MODULE,
 };

-static int __init xt_rateest_tg_init(void)
+static int __init xt_rateest_mt_init(void)
 {
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(rateest_hash); i++)
-		INIT_HLIST_HEAD(&rateest_hash[i]);
-
-	return xt_register_target(&xt_rateest_tg_reg);
+	return xt_register_match(&xt_rateest_mt_reg);
 }

-static void __exit xt_rateest_tg_fini(void)
+static void __exit xt_rateest_mt_fini(void)
 {
-	xt_unregister_target(&xt_rateest_tg_reg);
+	xt_unregister_match(&xt_rateest_mt_reg);
 }

-
 MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
 MODULE_LICENSE("GPL");
-MODULE_DESCRIPTION("Xtables: packet rate estimator");
-MODULE_ALIAS("ipt_RATEEST");
-MODULE_ALIAS("ip6t_RATEEST");
-module_init(xt_rateest_tg_init);
-module_exit(xt_rateest_tg_fini);
+MODULE_DESCRIPTION("xtables rate estimator match");
+MODULE_ALIAS("ipt_rateest");
+MODULE_ALIAS("ip6t_rateest");
+module_init(xt_rateest_mt_init);
+module_exit(xt_rateest_mt_fini);
diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index 27241a7..c53d4d1 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -1,351 +1,110 @@ 
-/*
- * This is a module which is used for setting the MSS option in TCP packets.
- *
- * Copyright (C) 2000 Marc Boucher <marc@mbsi.ca>
- * Copyright (C) 2007 Patrick McHardy <kaber@trash.net>
+/* Kernel module to match TCP MSS values. */
+
+/* Copyright (C) 2000 Marc Boucher <marc@mbsi.ca>
+ * Portions (C) 2005 by Harald Welte <laforge@netfilter.org>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/module.h>
 #include <linux/skbuff.h>
-#include <linux/ip.h>
-#include <linux/gfp.h>
-#include <linux/ipv6.h>
-#include <linux/tcp.h>
-#include <net/dst.h>
-#include <net/flow.h>
-#include <net/ipv6.h>
-#include <net/route.h>
 #include <net/tcp.h>

+#include <linux/netfilter/xt_tcpmss.h>
+#include <linux/netfilter/x_tables.h>
+
 #include <linux/netfilter_ipv4/ip_tables.h>
 #include <linux/netfilter_ipv6/ip6_tables.h>
-#include <linux/netfilter/x_tables.h>
-#include <linux/netfilter/xt_tcpudp.h>
-#include <linux/netfilter/xt_TCPMSS.h>

 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Marc Boucher <marc@mbsi.ca>");
-MODULE_DESCRIPTION("Xtables: TCP Maximum Segment Size (MSS) adjustment");
-MODULE_ALIAS("ipt_TCPMSS");
-MODULE_ALIAS("ip6t_TCPMSS");
-
-static inline unsigned int
-optlen(const u_int8_t *opt, unsigned int offset)
-{
-	/* Beware zero-length options: make finite progress */
-	if (opt[offset] <= TCPOPT_NOP || opt[offset+1] == 0)
-		return 1;
-	else
-		return opt[offset+1];
-}
-
-static u_int32_t tcpmss_reverse_mtu(struct net *net,
-				    const struct sk_buff *skb,
-				    unsigned int family)
-{
-	struct flowi fl;
-	const struct nf_afinfo *ai;
-	struct rtable *rt = NULL;
-	u_int32_t mtu     = ~0U;
-
-	if (family == PF_INET) {
-		struct flowi4 *fl4 = &fl.u.ip4;
-		memset(fl4, 0, sizeof(*fl4));
-		fl4->daddr = ip_hdr(skb)->saddr;
-	} else {
-		struct flowi6 *fl6 = &fl.u.ip6;
-
-		memset(fl6, 0, sizeof(*fl6));
-		fl6->daddr = ipv6_hdr(skb)->saddr;
-	}
-	rcu_read_lock();
-	ai = nf_get_afinfo(family);
-	if (ai != NULL)
-		ai->route(net, (struct dst_entry **)&rt, &fl, false);
-	rcu_read_unlock();
-
-	if (rt != NULL) {
-		mtu = dst_mtu(&rt->dst);
-		dst_release(&rt->dst);
-	}
-	return mtu;
-}
+MODULE_DESCRIPTION("Xtables: TCP MSS match");
+MODULE_ALIAS("ipt_tcpmss");
+MODULE_ALIAS("ip6t_tcpmss");

-static int
-tcpmss_mangle_packet(struct sk_buff *skb,
-		     const struct xt_action_param *par,
-		     unsigned int family,
-		     unsigned int tcphoff,
-		     unsigned int minlen)
+static bool
+tcpmss_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
-	const struct xt_tcpmss_info *info = par->targinfo;
-	struct tcphdr *tcph;
-	int len, tcp_hdrlen;
-	unsigned int i;
-	__be16 oldval;
-	u16 newmss;
-	u8 *opt;
-
-	/* This is a fragment, no TCP header is available */
-	if (par->fragoff != 0)
-		return 0;
-
-	if (!skb_make_writable(skb, skb->len))
-		return -1;
-
-	len = skb->len - tcphoff;
-	if (len < (int)sizeof(struct tcphdr))
-		return -1;
-
-	tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
-	tcp_hdrlen = tcph->doff * 4;
-
-	if (len < tcp_hdrlen)
-		return -1;
-
-	if (info->mss == XT_TCPMSS_CLAMP_PMTU) {
-		struct net *net = xt_net(par);
-		unsigned int in_mtu = tcpmss_reverse_mtu(net, skb, family);
-		unsigned int min_mtu = min(dst_mtu(skb_dst(skb)), in_mtu);
-
-		if (min_mtu <= minlen) {
-			net_err_ratelimited("unknown or invalid path-MTU (%u)\n",
-					    min_mtu);
-			return -1;
-		}
-		newmss = min_mtu - minlen;
-	} else
-		newmss = info->mss;
-
-	opt = (u_int8_t *)tcph;
-	for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += optlen(opt, i)) {
-		if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) {
-			u_int16_t oldmss;
-
-			oldmss = (opt[i+2] << 8) | opt[i+3];
-
-			/* Never increase MSS, even when setting it, as
-			 * doing so results in problems for hosts that rely
-			 * on MSS being set correctly.
-			 */
-			if (oldmss <= newmss)
-				return 0;
-
-			opt[i+2] = (newmss & 0xff00) >> 8;
-			opt[i+3] = newmss & 0x00ff;
-
-			inet_proto_csum_replace2(&tcph->check, skb,
-						 htons(oldmss), htons(newmss),
-						 false);
-			return 0;
+	const struct xt_tcpmss_match_info *info = par->matchinfo;
+	const struct tcphdr *th;
+	struct tcphdr _tcph;
+	/* tcp.doff is only 4 bits, ie. max 15 * 4 bytes */
+	const u_int8_t *op;
+	u8 _opt[15 * 4 - sizeof(_tcph)];
+	unsigned int i, optlen;
+
+	/* If we don't have the whole header, drop packet. */
+	th = skb_header_pointer(skb, par->thoff, sizeof(_tcph), &_tcph);
+	if (th == NULL)
+		goto dropit;
+
+	/* Malformed. */
+	if (th->doff*4 < sizeof(*th))
+		goto dropit;
+
+	optlen = th->doff*4 - sizeof(*th);
+	if (!optlen)
+		goto out;
+
+	/* Truncated options. */
+	op = skb_header_pointer(skb, par->thoff + sizeof(*th), optlen, _opt);
+	if (op == NULL)
+		goto dropit;
+
+	for (i = 0; i < optlen; ) {
+		if (op[i] == TCPOPT_MSS
+		    && (optlen - i) >= TCPOLEN_MSS
+		    && op[i+1] == TCPOLEN_MSS) {
+			u_int16_t mssval;
+
+			mssval = (op[i+2] << 8) | op[i+3];
+
+			return (mssval >= info->mss_min &&
+				mssval <= info->mss_max) ^ info->invert;
 		}
+		if (op[i] < 2)
+			i++;
+		else
+			i += op[i+1] ? : 1;
 	}
+out:
+	return info->invert;

-	/* There is data after the header so the option can't be added
-	 * without moving it, and doing so may make the SYN packet
-	 * itself too large. Accept the packet unmodified instead.
-	 */
-	if (len > tcp_hdrlen)
-		return 0;
-
-	/*
-	 * MSS Option not found ?! add it..
-	 */
-	if (skb_tailroom(skb) < TCPOLEN_MSS) {
-		if (pskb_expand_head(skb, 0,
-				     TCPOLEN_MSS - skb_tailroom(skb),
-				     GFP_ATOMIC))
-			return -1;
-		tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
-	}
-
-	skb_put(skb, TCPOLEN_MSS);
-
-	/*
-	 * IPv4: RFC 1122 states "If an MSS option is not received at
-	 * connection setup, TCP MUST assume a default send MSS of 536".
-	 * IPv6: RFC 2460 states IPv6 has a minimum MTU of 1280 and a minimum
-	 * length IPv6 header of 60, ergo the default MSS value is 1220
-	 * Since no MSS was provided, we must use the default values
-	 */
-	if (xt_family(par) == NFPROTO_IPV4)
-		newmss = min(newmss, (u16)536);
-	else
-		newmss = min(newmss, (u16)1220);
-
-	opt = (u_int8_t *)tcph + sizeof(struct tcphdr);
-	memmove(opt + TCPOLEN_MSS, opt, len - sizeof(struct tcphdr));
-
-	inet_proto_csum_replace2(&tcph->check, skb,
-				 htons(len), htons(len + TCPOLEN_MSS), true);
-	opt[0] = TCPOPT_MSS;
-	opt[1] = TCPOLEN_MSS;
-	opt[2] = (newmss & 0xff00) >> 8;
-	opt[3] = newmss & 0x00ff;
-
-	inet_proto_csum_replace4(&tcph->check, skb, 0, *((__be32 *)opt), false);
-
-	oldval = ((__be16 *)tcph)[6];
-	tcph->doff += TCPOLEN_MSS/4;
-	inet_proto_csum_replace2(&tcph->check, skb,
-				 oldval, ((__be16 *)tcph)[6], false);
-	return TCPOLEN_MSS;
-}
-
-static unsigned int
-tcpmss_tg4(struct sk_buff *skb, const struct xt_action_param *par)
-{
-	struct iphdr *iph = ip_hdr(skb);
-	__be16 newlen;
-	int ret;
-
-	ret = tcpmss_mangle_packet(skb, par,
-				   PF_INET,
-				   iph->ihl * 4,
-				   sizeof(*iph) + sizeof(struct tcphdr));
-	if (ret < 0)
-		return NF_DROP;
-	if (ret > 0) {
-		iph = ip_hdr(skb);
-		newlen = htons(ntohs(iph->tot_len) + ret);
-		csum_replace2(&iph->check, iph->tot_len, newlen);
-		iph->tot_len = newlen;
-	}
-	return XT_CONTINUE;
-}
-
-#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
-static unsigned int
-tcpmss_tg6(struct sk_buff *skb, const struct xt_action_param *par)
-{
-	struct ipv6hdr *ipv6h = ipv6_hdr(skb);
-	u8 nexthdr;
-	__be16 frag_off, oldlen, newlen;
-	int tcphoff;
-	int ret;
-
-	nexthdr = ipv6h->nexthdr;
-	tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr, &frag_off);
-	if (tcphoff < 0)
-		return NF_DROP;
-	ret = tcpmss_mangle_packet(skb, par,
-				   PF_INET6,
-				   tcphoff,
-				   sizeof(*ipv6h) + sizeof(struct tcphdr));
-	if (ret < 0)
-		return NF_DROP;
-	if (ret > 0) {
-		ipv6h = ipv6_hdr(skb);
-		oldlen = ipv6h->payload_len;
-		newlen = htons(ntohs(oldlen) + ret);
-		if (skb->ip_summed == CHECKSUM_COMPLETE)
-			skb->csum = csum_add(csum_sub(skb->csum, oldlen),
-					     newlen);
-		ipv6h->payload_len = newlen;
-	}
-	return XT_CONTINUE;
-}
-#endif
-
-/* Must specify -p tcp --syn */
-static inline bool find_syn_match(const struct xt_entry_match *m)
-{
-	const struct xt_tcp *tcpinfo = (const struct xt_tcp *)m->data;
-
-	if (strcmp(m->u.kernel.match->name, "tcp") == 0 &&
-	    tcpinfo->flg_cmp & TCPHDR_SYN &&
-	    !(tcpinfo->invflags & XT_TCP_INV_FLAGS))
-		return true;
-
+dropit:
+	par->hotdrop = true;
 	return false;
 }

-static int tcpmss_tg4_check(const struct xt_tgchk_param *par)
-{
-	const struct xt_tcpmss_info *info = par->targinfo;
-	const struct ipt_entry *e = par->entryinfo;
-	const struct xt_entry_match *ematch;
-
-	if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
-	    (par->hook_mask & ~((1 << NF_INET_FORWARD) |
-			   (1 << NF_INET_LOCAL_OUT) |
-			   (1 << NF_INET_POST_ROUTING))) != 0) {
-		pr_info("path-MTU clamping only supported in "
-			"FORWARD, OUTPUT and POSTROUTING hooks\n");
-		return -EINVAL;
-	}
-	if (par->nft_compat)
-		return 0;
-
-	xt_ematch_foreach(ematch, e)
-		if (find_syn_match(ematch))
-			return 0;
-	pr_info("Only works on TCP SYN packets\n");
-	return -EINVAL;
-}
-
-#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
-static int tcpmss_tg6_check(const struct xt_tgchk_param *par)
-{
-	const struct xt_tcpmss_info *info = par->targinfo;
-	const struct ip6t_entry *e = par->entryinfo;
-	const struct xt_entry_match *ematch;
-
-	if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
-	    (par->hook_mask & ~((1 << NF_INET_FORWARD) |
-			   (1 << NF_INET_LOCAL_OUT) |
-			   (1 << NF_INET_POST_ROUTING))) != 0) {
-		pr_info("path-MTU clamping only supported in "
-			"FORWARD, OUTPUT and POSTROUTING hooks\n");
-		return -EINVAL;
-	}
-	if (par->nft_compat)
-		return 0;
-
-	xt_ematch_foreach(ematch, e)
-		if (find_syn_match(ematch))
-			return 0;
-	pr_info("Only works on TCP SYN packets\n");
-	return -EINVAL;
-}
-#endif
-
-static struct xt_target tcpmss_tg_reg[] __read_mostly = {
+static struct xt_match tcpmss_mt_reg[] __read_mostly = {
 	{
+		.name		= "tcpmss",
 		.family		= NFPROTO_IPV4,
-		.name		= "TCPMSS",
-		.checkentry	= tcpmss_tg4_check,
-		.target		= tcpmss_tg4,
-		.targetsize	= sizeof(struct xt_tcpmss_info),
+		.match		= tcpmss_mt,
+		.matchsize	= sizeof(struct xt_tcpmss_match_info),
 		.proto		= IPPROTO_TCP,
 		.me		= THIS_MODULE,
 	},
-#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
 	{
+		.name		= "tcpmss",
 		.family		= NFPROTO_IPV6,
-		.name		= "TCPMSS",
-		.checkentry	= tcpmss_tg6_check,
-		.target		= tcpmss_tg6,
-		.targetsize	= sizeof(struct xt_tcpmss_info),
+		.match		= tcpmss_mt,
+		.matchsize	= sizeof(struct xt_tcpmss_match_info),
 		.proto		= IPPROTO_TCP,
 		.me		= THIS_MODULE,
 	},
-#endif
 };

-static int __init tcpmss_tg_init(void)
+static int __init tcpmss_mt_init(void)
 {
-	return xt_register_targets(tcpmss_tg_reg, ARRAY_SIZE(tcpmss_tg_reg));
+	return xt_register_matches(tcpmss_mt_reg, ARRAY_SIZE(tcpmss_mt_reg));
 }

-static void __exit tcpmss_tg_exit(void)
+static void __exit tcpmss_mt_exit(void)
 {
-	xt_unregister_targets(tcpmss_tg_reg, ARRAY_SIZE(tcpmss_tg_reg));
+	xt_unregister_matches(tcpmss_mt_reg, ARRAY_SIZE(tcpmss_mt_reg));
 }

-module_init(tcpmss_tg_init);
-module_exit(tcpmss_tg_exit);
+module_init(tcpmss_mt_init);
+module_exit(tcpmss_mt_exit);
diff --git a/net/netfilter/xt_dscp.c b/net/netfilter/xt_dscp.c
index 236ac80..3f83d38 100644
--- a/net/netfilter/xt_dscp.c
+++ b/net/netfilter/xt_dscp.c
@@ -1,11 +1,14 @@ 
-/* IP tables module for matching the value of the IPv4/IPv6 DSCP field
+/* x_tables module for setting the IPv4/IPv6 DSCP field, Version 1.8
  *
  * (C) 2002 by Harald Welte <laforge@netfilter.org>
+ * based on ipt_FTOS.c (C) 2000 by Matthew G. Marsh <mgm@paktronix.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
- */
+ *
+ * See RFC2474 for a description of the DSCP field within the IP Header.
+*/
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/module.h>
 #include <linux/skbuff.h>
@@ -14,102 +17,150 @@ 
 #include <net/dsfield.h>

 #include <linux/netfilter/x_tables.h>
-#include <linux/netfilter/xt_dscp.h>
+#include <linux/netfilter/xt_DSCP.h>

 MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
-MODULE_DESCRIPTION("Xtables: DSCP/TOS field match");
+MODULE_DESCRIPTION("Xtables: DSCP/TOS field modification");
 MODULE_LICENSE("GPL");
-MODULE_ALIAS("ipt_dscp");
-MODULE_ALIAS("ip6t_dscp");
-MODULE_ALIAS("ipt_tos");
-MODULE_ALIAS("ip6t_tos");
+MODULE_ALIAS("ipt_DSCP");
+MODULE_ALIAS("ip6t_DSCP");
+MODULE_ALIAS("ipt_TOS");
+MODULE_ALIAS("ip6t_TOS");

-static bool
-dscp_mt(const struct sk_buff *skb, struct xt_action_param *par)
+static unsigned int
+dscp_tg(struct sk_buff *skb, const struct xt_action_param *par)
 {
-	const struct xt_dscp_info *info = par->matchinfo;
+	const struct xt_DSCP_info *dinfo = par->targinfo;
 	u_int8_t dscp = ipv4_get_dsfield(ip_hdr(skb)) >> XT_DSCP_SHIFT;

-	return (dscp == info->dscp) ^ !!info->invert;
+	if (dscp != dinfo->dscp) {
+		if (!skb_make_writable(skb, sizeof(struct iphdr)))
+			return NF_DROP;
+
+		ipv4_change_dsfield(ip_hdr(skb),
+				    (__force __u8)(~XT_DSCP_MASK),
+				    dinfo->dscp << XT_DSCP_SHIFT);
+
+	}
+	return XT_CONTINUE;
 }

-static bool
-dscp_mt6(const struct sk_buff *skb, struct xt_action_param *par)
+static unsigned int
+dscp_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 {
-	const struct xt_dscp_info *info = par->matchinfo;
+	const struct xt_DSCP_info *dinfo = par->targinfo;
 	u_int8_t dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> XT_DSCP_SHIFT;

-	return (dscp == info->dscp) ^ !!info->invert;
+	if (dscp != dinfo->dscp) {
+		if (!skb_make_writable(skb, sizeof(struct ipv6hdr)))
+			return NF_DROP;
+
+		ipv6_change_dsfield(ipv6_hdr(skb),
+				    (__force __u8)(~XT_DSCP_MASK),
+				    dinfo->dscp << XT_DSCP_SHIFT);
+	}
+	return XT_CONTINUE;
 }

-static int dscp_mt_check(const struct xt_mtchk_param *par)
+static int dscp_tg_check(const struct xt_tgchk_param *par)
 {
-	const struct xt_dscp_info *info = par->matchinfo;
+	const struct xt_DSCP_info *info = par->targinfo;

 	if (info->dscp > XT_DSCP_MAX) {
 		pr_info("dscp %x out of range\n", info->dscp);
 		return -EDOM;
 	}
-
 	return 0;
 }

-static bool tos_mt(const struct sk_buff *skb, struct xt_action_param *par)
+static unsigned int
+tos_tg(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	const struct xt_tos_target_info *info = par->targinfo;
+	struct iphdr *iph = ip_hdr(skb);
+	u_int8_t orig, nv;
+
+	orig = ipv4_get_dsfield(iph);
+	nv   = (orig & ~info->tos_mask) ^ info->tos_value;
+
+	if (orig != nv) {
+		if (!skb_make_writable(skb, sizeof(struct iphdr)))
+			return NF_DROP;
+		iph = ip_hdr(skb);
+		ipv4_change_dsfield(iph, 0, nv);
+	}
+
+	return XT_CONTINUE;
+}
+
+static unsigned int
+tos_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 {
-	const struct xt_tos_match_info *info = par->matchinfo;
-
-	if (xt_family(par) == NFPROTO_IPV4)
-		return ((ip_hdr(skb)->tos & info->tos_mask) ==
-		       info->tos_value) ^ !!info->invert;
-	else
-		return ((ipv6_get_dsfield(ipv6_hdr(skb)) & info->tos_mask) ==
-		       info->tos_value) ^ !!info->invert;
+	const struct xt_tos_target_info *info = par->targinfo;
+	struct ipv6hdr *iph = ipv6_hdr(skb);
+	u_int8_t orig, nv;
+
+	orig = ipv6_get_dsfield(iph);
+	nv   = (orig & ~info->tos_mask) ^ info->tos_value;
+
+	if (orig != nv) {
+		if (!skb_make_writable(skb, sizeof(struct iphdr)))
+			return NF_DROP;
+		iph = ipv6_hdr(skb);
+		ipv6_change_dsfield(iph, 0, nv);
+	}
+
+	return XT_CONTINUE;
 }

-static struct xt_match dscp_mt_reg[] __read_mostly = {
+static struct xt_target dscp_tg_reg[] __read_mostly = {
 	{
-		.name		= "dscp",
+		.name		= "DSCP",
 		.family		= NFPROTO_IPV4,
-		.checkentry	= dscp_mt_check,
-		.match		= dscp_mt,
-		.matchsize	= sizeof(struct xt_dscp_info),
+		.checkentry	= dscp_tg_check,
+		.target		= dscp_tg,
+		.targetsize	= sizeof(struct xt_DSCP_info),
+		.table		= "mangle",
 		.me		= THIS_MODULE,
 	},
 	{
-		.name		= "dscp",
+		.name		= "DSCP",
 		.family		= NFPROTO_IPV6,
-		.checkentry	= dscp_mt_check,
-		.match		= dscp_mt6,
-		.matchsize	= sizeof(struct xt_dscp_info),
+		.checkentry	= dscp_tg_check,
+		.target		= dscp_tg6,
+		.targetsize	= sizeof(struct xt_DSCP_info),
+		.table		= "mangle",
 		.me		= THIS_MODULE,
 	},
 	{
-		.name		= "tos",
+		.name		= "TOS",
 		.revision	= 1,
 		.family		= NFPROTO_IPV4,
-		.match		= tos_mt,
-		.matchsize	= sizeof(struct xt_tos_match_info),
+		.table		= "mangle",
+		.target		= tos_tg,
+		.targetsize	= sizeof(struct xt_tos_target_info),
 		.me		= THIS_MODULE,
 	},
 	{
-		.name		= "tos",
+		.name		= "TOS",
 		.revision	= 1,
 		.family		= NFPROTO_IPV6,
-		.match		= tos_mt,
-		.matchsize	= sizeof(struct xt_tos_match_info),
+		.table		= "mangle",
+		.target		= tos_tg6,
+		.targetsize	= sizeof(struct xt_tos_target_info),
 		.me		= THIS_MODULE,
 	},
 };

-static int __init dscp_mt_init(void)
+static int __init dscp_tg_init(void)
 {
-	return xt_register_matches(dscp_mt_reg, ARRAY_SIZE(dscp_mt_reg));
+	return xt_register_targets(dscp_tg_reg, ARRAY_SIZE(dscp_tg_reg));
 }

-static void __exit dscp_mt_exit(void)
+static void __exit dscp_tg_exit(void)
 {
-	xt_unregister_matches(dscp_mt_reg, ARRAY_SIZE(dscp_mt_reg));
+	xt_unregister_targets(dscp_tg_reg, ARRAY_SIZE(dscp_tg_reg));
 }

-module_init(dscp_mt_init);
-module_exit(dscp_mt_exit);
+module_init(dscp_tg_init);
+module_exit(dscp_tg_exit);
diff --git a/net/netfilter/xt_hl.c b/net/netfilter/xt_hl.c
index 0039511..1535e87 100644
--- a/net/netfilter/xt_hl.c
+++ b/net/netfilter/xt_hl.c
@@ -1,96 +1,169 @@ 
 /*
- * IP tables module for matching the value of the TTL
- * (C) 2000,2001 by Harald Welte <laforge@netfilter.org>
+ * TTL modification target for IP tables
+ * (C) 2000,2005 by Harald Welte <laforge@netfilter.org>
  *
- * Hop Limit matching module
- * (C) 2001-2002 Maciej Soltysiak <solt@dns.toxicfilms.tv>
+ * Hop Limit modification target for ip6tables
+ * Maciej Soltysiak <solt@dns.toxicfilms.tv>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
-
-#include <linux/ip.h>
-#include <linux/ipv6.h>
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/module.h>
 #include <linux/skbuff.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <net/checksum.h>

 #include <linux/netfilter/x_tables.h>
-#include <linux/netfilter_ipv4/ipt_ttl.h>
-#include <linux/netfilter_ipv6/ip6t_hl.h>
+#include <linux/netfilter_ipv4/ipt_TTL.h>
+#include <linux/netfilter_ipv6/ip6t_HL.h>

+MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
 MODULE_AUTHOR("Maciej Soltysiak <solt@dns.toxicfilms.tv>");
-MODULE_DESCRIPTION("Xtables: Hoplimit/TTL field match");
+MODULE_DESCRIPTION("Xtables: Hoplimit/TTL Limit field modification target");
 MODULE_LICENSE("GPL");
-MODULE_ALIAS("ipt_ttl");
-MODULE_ALIAS("ip6t_hl");

-static bool ttl_mt(const struct sk_buff *skb, struct xt_action_param *par)
+static unsigned int
+ttl_tg(struct sk_buff *skb, const struct xt_action_param *par)
 {
-	const struct ipt_ttl_info *info = par->matchinfo;
-	const u8 ttl = ip_hdr(skb)->ttl;
+	struct iphdr *iph;
+	const struct ipt_TTL_info *info = par->targinfo;
+	int new_ttl;
+
+	if (!skb_make_writable(skb, skb->len))
+		return NF_DROP;
+
+	iph = ip_hdr(skb);

 	switch (info->mode) {
-	case IPT_TTL_EQ:
-		return ttl == info->ttl;
-	case IPT_TTL_NE:
-		return ttl != info->ttl;
-	case IPT_TTL_LT:
-		return ttl < info->ttl;
-	case IPT_TTL_GT:
-		return ttl > info->ttl;
+	case IPT_TTL_SET:
+		new_ttl = info->ttl;
+		break;
+	case IPT_TTL_INC:
+		new_ttl = iph->ttl + info->ttl;
+		if (new_ttl > 255)
+			new_ttl = 255;
+		break;
+	case IPT_TTL_DEC:
+		new_ttl = iph->ttl - info->ttl;
+		if (new_ttl < 0)
+			new_ttl = 0;
+		break;
+	default:
+		new_ttl = iph->ttl;
+		break;
+	}
+
+	if (new_ttl != iph->ttl) {
+		csum_replace2(&iph->check, htons(iph->ttl << 8),
+					   htons(new_ttl << 8));
+		iph->ttl = new_ttl;
 	}

-	return false;
+	return XT_CONTINUE;
 }

-static bool hl_mt6(const struct sk_buff *skb, struct xt_action_param *par)
+static unsigned int
+hl_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 {
-	const struct ip6t_hl_info *info = par->matchinfo;
-	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+	struct ipv6hdr *ip6h;
+	const struct ip6t_HL_info *info = par->targinfo;
+	int new_hl;
+
+	if (!skb_make_writable(skb, skb->len))
+		return NF_DROP;
+
+	ip6h = ipv6_hdr(skb);

 	switch (info->mode) {
-	case IP6T_HL_EQ:
-		return ip6h->hop_limit == info->hop_limit;
-	case IP6T_HL_NE:
-		return ip6h->hop_limit != info->hop_limit;
-	case IP6T_HL_LT:
-		return ip6h->hop_limit < info->hop_limit;
-	case IP6T_HL_GT:
-		return ip6h->hop_limit > info->hop_limit;
+	case IP6T_HL_SET:
+		new_hl = info->hop_limit;
+		break;
+	case IP6T_HL_INC:
+		new_hl = ip6h->hop_limit + info->hop_limit;
+		if (new_hl > 255)
+			new_hl = 255;
+		break;
+	case IP6T_HL_DEC:
+		new_hl = ip6h->hop_limit - info->hop_limit;
+		if (new_hl < 0)
+			new_hl = 0;
+		break;
+	default:
+		new_hl = ip6h->hop_limit;
+		break;
 	}

-	return false;
+	ip6h->hop_limit = new_hl;
+
+	return XT_CONTINUE;
+}
+
+static int ttl_tg_check(const struct xt_tgchk_param *par)
+{
+	const struct ipt_TTL_info *info = par->targinfo;
+
+	if (info->mode > IPT_TTL_MAXMODE) {
+		pr_info("TTL: invalid or unknown mode %u\n", info->mode);
+		return -EINVAL;
+	}
+	if (info->mode != IPT_TTL_SET && info->ttl == 0)
+		return -EINVAL;
+	return 0;
+}
+
+static int hl_tg6_check(const struct xt_tgchk_param *par)
+{
+	const struct ip6t_HL_info *info = par->targinfo;
+
+	if (info->mode > IP6T_HL_MAXMODE) {
+		pr_info("invalid or unknown mode %u\n", info->mode);
+		return -EINVAL;
+	}
+	if (info->mode != IP6T_HL_SET && info->hop_limit == 0) {
+		pr_info("increment/decrement does not "
+			"make sense with value 0\n");
+		return -EINVAL;
+	}
+	return 0;
 }

-static struct xt_match hl_mt_reg[] __read_mostly = {
+static struct xt_target hl_tg_reg[] __read_mostly = {
 	{
-		.name       = "ttl",
+		.name       = "TTL",
 		.revision   = 0,
 		.family     = NFPROTO_IPV4,
-		.match      = ttl_mt,
-		.matchsize  = sizeof(struct ipt_ttl_info),
+		.target     = ttl_tg,
+		.targetsize = sizeof(struct ipt_TTL_info),
+		.table      = "mangle",
+		.checkentry = ttl_tg_check,
 		.me         = THIS_MODULE,
 	},
 	{
-		.name       = "hl",
+		.name       = "HL",
 		.revision   = 0,
 		.family     = NFPROTO_IPV6,
-		.match      = hl_mt6,
-		.matchsize  = sizeof(struct ip6t_hl_info),
+		.target     = hl_tg6,
+		.targetsize = sizeof(struct ip6t_HL_info),
+		.table      = "mangle",
+		.checkentry = hl_tg6_check,
 		.me         = THIS_MODULE,
 	},
 };

-static int __init hl_mt_init(void)
+static int __init hl_tg_init(void)
 {
-	return xt_register_matches(hl_mt_reg, ARRAY_SIZE(hl_mt_reg));
+	return xt_register_targets(hl_tg_reg, ARRAY_SIZE(hl_tg_reg));
 }

-static void __exit hl_mt_exit(void)
+static void __exit hl_tg_exit(void)
 {
-	xt_unregister_matches(hl_mt_reg, ARRAY_SIZE(hl_mt_reg));
+	xt_unregister_targets(hl_tg_reg, ARRAY_SIZE(hl_tg_reg));
 }

-module_init(hl_mt_init);
-module_exit(hl_mt_exit);
+module_init(hl_tg_init);
+module_exit(hl_tg_exit);
+MODULE_ALIAS("ipt_TTL");
+MODULE_ALIAS("ip6t_HL");