[v3,09/13] ext4: fast-commit commit path changes
diff mbox series

Message ID 20191001074101.256523-10-harshadshirwadkar@gmail.com
State Changes Requested
Headers show
Series
  • ext4: add fast commit support
Related show

Commit Message

harshad shirwadkar Oct. 1, 2019, 7:40 a.m. UTC
This patch implements the actual commit path for fast commit. Based on
inodes tracked and their respective changes remembered, this
patch adds code to create a fast commit block that stores extents
added as well as dentrys created for the inode. We use new JBD2
interfaces added in previous patches in this series. The fast commit
blocks that are created have extents that _should_ be present in the
file. It doesn't yet support removing of extents, making operations
such as truncate, delete fast commit incompatible.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4_jbd2.c         | 309 ++++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.h         |  50 +++++-
 fs/ext4/extents.c           |   8 +-
 fs/ext4/inode.c             |  22 ++-
 fs/ext4/super.c             |  11 ++
 include/trace/events/ext4.h |  39 +++++
 6 files changed, 429 insertions(+), 10 deletions(-)

Comments

Theodore Y. Ts'o Oct. 16, 2019, 10:45 p.m. UTC | #1
On Tue, Oct 01, 2019 at 12:40:58AM -0700, Harshad Shirwadkar wrote:
> This patch implements the actual commit path for fast commit. Based on
> inodes tracked and their respective changes remembered, this
> patch adds code to create a fast commit block that stores extents
> added as well as dentrys created for the inode. We use new JBD2
> interfaces added in previous patches in this series. The fast commit
> blocks that are created have extents that _should_ be present in the
> file. It doesn't yet support removing of extents, making operations
> such as truncate, delete fast commit incompatible.

This affects some of the earlier patches, but I didn't realize this
until now.  Right now, what we're doing is when initiate an fast
commit, we are writing out all fast-commit eligible inodes (and
flushing out any associated data blocks needed to maintain
data=ordered guarantees).

We don't actually have to do this.  Strictly speaking, we only have to
write out the specific inode being fsync'ed, or the specific inode for
which ext4_nfs_commit_metdata() has been called.  For an fsync()
workload, especially one where for example, we might have hundreds of
modified inodes, all of which are fc-eligible --- for example, because
a kernel build is happening in the background, and a single file which
is being fsync'ed --- for example because the programmer has just
saved a source file in emacs ---- we only need to include that single
inode in the fast commit.  Including *all* of the inodes in the
i_fc_list in the fast commit, is wasted effort, especially since the
inodes in question will be committed within the next 5 seconds.

Now, in the case of ext4_nfs_commit_metadata(), we know that NFS is
*very* aggressive at calling commit_metadata, and so writing out all
of the FC-eligible commit is probably a good thing to do.  So we might
want to do different things depending on whether the FC is being
initiated via fsync() or fdatasync() versus commit_metadata().

The other reason why it's better to only do this for
ext4_nfs_commit_metadata() is because if we only write out the inode
which is being fsync'ed, we don't have worry about fairness concerned,
since the I/O will be charged to the process/cgroup who requested the
fsync.  If we write out *all* the fc-eligible inodes in the FC commit,
then they will get charged to the process doing the fsync(2).  Whereas
for an NFS server, we don't care about cgroups, since they can all be
charged to the NFS server.


> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 0bb8de2139a5..fd7740372438 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -4,6 +4,7 @@
> +static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +{
> +	struct buffer_head *orig_bh = bh->b_private;
> +
> +	BUFFER_TRACE(bh, "");
> +	if (uptodate) {
> +		ext4_debug("%s: Block %lld up-to-date",
> +			   __func__, bh->b_blocknr);
> +		set_buffer_uptodate(bh);
> +	} else {
> +		ext4_debug("%s: Block %lld not up-to-date",
> +			   __func__, bh->b_blocknr);
> +		clear_buffer_uptodate(bh);
> +	}
> +	if (orig_bh) {
> +		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> +		/* Protect BH_Shadow bit in b_state */
> +		smp_mb__after_atomic();
> +		wake_up_bit(&orig_bh->b_state, BH_Shadow);
> +	}

We don't need to deal with BH_Shadow handling here.  This is needed
when we are writing out buffer heads correspond to ext4 metadata
(e.g., an inode table block, a block group descriptor block).  We're
only writing out bh's corresponding to the journal, so the BH_Shadow
bit should never be set on such bh's.

> +static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)

Can you add some documentation for this function?  In particular, what
does it return?  I also tend to prefer to pass in the pointer to the
buffer (val) first, followed then by the length (len), but that's more
of a personal preference.

> +int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
> +			struct inode *inode, tid_t tid, tid_t subtid,
> +			int is_last, struct dentry *dentry)
> +{

  ...
> +
> +	memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));

So this is a bit problematic.  In the structure definition,
fc_hdr->inode is not at the end of the structure

struct ext4_fc_commit_hdr {
	/* Fast commit magic, should be EXT4_FC_MAGIC */
	__le32 fc_magic;
	...	
	/* ext4 inode on disk copy */
	struct ext4_inode inode;
	/* Csum(hdr+contents) */
	__le32 fc_csum;
};

... and the size of struct ext4_inode is just the fixed portion of the
inode, and is almost always smaller than EXT4_INODE_SIZE(sb) ---
except in the case of 128 byte inodes, in which case the fields
i_extra_isize and beyond going to be beyond the 128 byte limit.

So this isn't going to work.  I'm guessing you didn't test with
extended attributes, because the checksum would have overwritten the
beginning of the in-inode xattrs?

Also, note that EXT4_INODE_SIZE(sb) can be set to the block size.
It's super-rare, but that is legal.  Which means we need to test for
that case somewhere, and either (a) disable fast commits when the
inode size == blocksize, or (b) support a fast commit log which is
larger than a single block.  (This is doable, since there is a
checksum field to protect against partial writes.)

> +struct ext4_fc_commit_hdr {
> +	/* Fast commit magic, should be EXT4_FC_MAGIC */
> +	__le32 fc_magic;
> +	/* Sub transaction ID */
> +	__le32 fc_subtid;
> +	/* Features used by this fast commit block */
> +	__u8 fc_features;
> +	/* Flags for this block. */
> +	__u8 fc_flags;

What fs_features and fc_flags are you thinking we would need?  I can't
think of a good reasons to have per-fc block features.  But I can
think of reasons why we might want to support a small number of blocks
in an fc entry.  So maybe repurpose fc_features with some limit, such
as say, 4 blocks, and on the replay side we can just kmalloc 4 *
blocksize worth of space to read in that number of blocks if
necessary?

> +	/* Number of TLVs in this fast commmit block */
> +	__le16 fc_num_tlvs;
> +	/* Inode number */
> +	__le32 fc_ino;
> +	/* ext4 inode on disk copy */
> +	struct ext4_inode inode;
> +	/* Csum(hdr+contents) */
> +	__le32 fc_csum;

I'd suggest putting the checksum at the very end of the fc entry.
e.g., at offset 4092 if there is only a single block in the fc commit
entry.  Also, I'd make sure that we explicitly zero all of the bytes
at the end of the TLV section and the checksum, and specify that the
checksum is calculated including the must-be-zero padding, just to
keep things simple.

						- Ted
Theodore Y. Ts'o Oct. 23, 2019, 12:44 p.m. UTC | #2
On Wed, Oct 23, 2019 at 04:58:47PM +0800, xiaohui li wrote:
> why not let fsync handle enjoy one transaction exclusively ?
> that is to say, in this transaction, there is only one handle which is
> generated in one file's fsync path .

There is only one handle which is generated in one file's fsync path.
That isn't the problem.  (If it were that simple, we would have done
it a long time ago.)

The problem is that there may have been other handles that have been
started before the fsync transaction, and these handles will have
already made changes to the file system.  Worse, some of those handles
may have made changes in the same metadata blocks which the fsync
operation needs to modify.

For example, suppose we are three seconds into the current
transaction, with potentially hundreds of handles that have already
been started and finished --- but not yet committed, because the
current transaction hasn't closed.  All of those handles have already
been attached to the current transaction, and they can't be ignored.

The fast commit patch set deals with this by using part of the journal
for a "fast commit journal" where we essentially are doing a very
simplified logical journal.  It doesn't handle all cases, and there
will be situations where we will need to fall back to the physical
journalling techniques used in ext4 today.  For example, if the file
has been truncated, and then a single 4k block is written, and then
the file gets fsync'ed, we won't be able to use the fast commit
logical journal.  Fortunately, the common case which compromises well
over 99% of most workloads are much simpler to handle, and these can
be handled via the fast commit patch.

The fast commit approach is a simplified version of the idea proposed
by Daejun Park and Dungkun Shih from the Sungkyunkwan University in
Korea, and which were presented in the paper "iJournaling:
Fine-Grained Journaling for Improving the Latency of Fsync System
Call[1]", presented at the Usenix Annual Technical Conference in 2017.

[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

Cheers,

						- Ted

Patch
diff mbox series

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 0bb8de2139a5..fd7740372438 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -4,6 +4,7 @@ 
  */
 
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
 
 #include <trace/events/ext4.h>
 
@@ -480,3 +481,311 @@  bool ext4_is_inode_fc_new(struct inode *inode)
 
 	return ret;
 }
+
+static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+	struct buffer_head *orig_bh = bh->b_private;
+
+	BUFFER_TRACE(bh, "");
+	if (uptodate) {
+		ext4_debug("%s: Block %lld up-to-date",
+			   __func__, bh->b_blocknr);
+		set_buffer_uptodate(bh);
+	} else {
+		ext4_debug("%s: Block %lld not up-to-date",
+			   __func__, bh->b_blocknr);
+		clear_buffer_uptodate(bh);
+	}
+	if (orig_bh) {
+		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+		/* Protect BH_Shadow bit in b_state */
+		smp_mb__after_atomic();
+		wake_up_bit(&orig_bh->b_state, BH_Shadow);
+	}
+	unlock_buffer(bh);
+}
+
+static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)
+{
+	struct ext4_fc_tl tl;
+
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(len);
+	memcpy(dst, &tl, sizeof(tl));
+	memcpy(dst + sizeof(tl), val, len);
+
+	return dst + sizeof(tl) + len;
+}
+
+int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
+			struct inode *inode, tid_t tid, tid_t subtid,
+			int is_last, struct dentry *dentry)
+{
+	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
+	struct super_block *sb = journal->j_private;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_commit_hdr *fc_hdr;
+	struct ext4_map_blocks map;
+	struct ext4_iloc iloc;
+	struct ext4_extent extent;
+	struct inode *parent;
+	__u32 dummy_csum = 0, csum;
+	__u8 *start, *cur, *end;
+	__u16 num_tlvs = 0;
+	int ret;
+
+	read_lock(&ei->i_fc.fc_lock);
+	if (tid != ei->i_fc.fc_tid) {
+		jbd_debug(3,
+			  "File not modified. Modified %d, expected %d",
+			  ei->i_fc.fc_tid, tid);
+		read_unlock(&ei->i_fc.fc_lock);
+		return 0;
+	}
+	read_unlock(&ei->i_fc.fc_lock);
+
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		return -ECANCELED;
+
+	if (ext4_is_inode_fc_new(inode)) {
+		parent = d_inode(dentry->d_parent);
+		if (parent && ext4_is_inode_fc_ineligible(parent))
+			return -ECANCELED;
+	}
+
+	ret = ext4_get_inode_loc(inode, &iloc);
+	if (ret)
+		return ret;
+
+	end = (__u8 *)bh->b_data + journal->j_blocksize;
+
+	write_lock(&ei->i_fc.fc_lock);
+	old_blk_size = (ei->i_fc.fc_lblk_start + sb->s_blocksize - 1) >>
+		       inode->i_blkbits;
+	new_blk_size = ei->i_fc.fc_lblk_end >> inode->i_blkbits;
+	ei->i_fc.fc_lblk_start = ei->i_fc.fc_lblk_end;
+	write_unlock(&ei->i_fc.fc_lock);
+
+	jbd_debug(3, "Committing as tid = %d, subtid = %d on buffer %lld\n",
+		  tid, subtid, bh->b_blocknr);
+
+	fc_hdr = (struct ext4_fc_commit_hdr *)
+			((__u8 *)bh->b_data + sizeof(journal_header_t));
+	fc_hdr->fc_magic = cpu_to_le32(EXT4_FC_MAGIC);
+	fc_hdr->fc_subtid = cpu_to_le32(subtid);
+	fc_hdr->fc_ino = cpu_to_le32(inode->i_ino);
+	fc_hdr->fc_features = 0;
+	fc_hdr->fc_flags = 0;
+
+	if (is_last)
+		ext4_fc_mark_last(fc_hdr);
+
+	memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));
+	cur = (__u8 *)(fc_hdr + 1);
+	start = cur;
+	if (ext4_is_inode_fc_new(inode)) {
+		__le32 parent_ino;
+
+		read_lock(&ei->i_fc.fc_lock);
+		parent_ino = cpu_to_le32(ei->i_fc.fc_parent_ino);
+		read_unlock(&ei->i_fc.fc_lock);
+
+		if (!dentry)
+			return -ECANCELED;
+
+		cur = fc_add_tag(cur, EXT4_FC_TAG_PARENT_INO,
+				      sizeof(parent_ino), (u8 *)&parent_ino);
+		cur = fc_add_tag(cur, EXT4_FC_TAG_DNAME,
+				 dentry->d_name.len,
+				 (u8 *)dentry->d_name.name);
+		num_tlvs = 2;
+	}
+	csum = 0;
+	cur_lblk_off = old_blk_size;
+	while (cur_lblk_off <= new_blk_size) {
+		map.m_lblk = cur_lblk_off;
+		map.m_len = new_blk_size - cur_lblk_off + 1;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (!ret) {
+			cur_lblk_off += map.m_len;
+			continue;
+		}
+
+		if (map.m_flags & EXT4_MAP_UNWRITTEN)
+			return -ECANCELED;
+		extent.ee_block = cpu_to_le32(map.m_lblk);
+		cur_lblk_off += map.m_len;
+		if (cur + sizeof(struct ext4_extent) +
+		    sizeof(struct ext4_fc_tl) >= end)
+			return -ENOSPC;
+
+		extent.ee_len = cpu_to_le16(map.m_len);
+		ext4_ext_store_pblock(&extent, map.m_pblk);
+		ext4_ext_mark_initialized(&extent);
+		cur = fc_add_tag(cur, EXT4_FC_TAG_EXT,
+				 sizeof(struct ext4_extent),
+				 (u8 *)&extent);
+		num_tlvs++;
+	}
+
+	fc_hdr->fc_num_tlvs = cpu_to_le16(num_tlvs);
+	csum = ext4_chksum(sbi, csum, (__u8 *)fc_hdr,
+			   offsetof(struct ext4_fc_commit_hdr, fc_csum));
+	csum = ext4_chksum(sbi, csum, &dummy_csum, sizeof(dummy_csum));
+	csum = ext4_chksum(sbi, csum, start, cur - start);
+	fc_hdr->fc_csum = cpu_to_le32(csum);
+
+	jbd_debug(3, "Created FC block for inode %ld with [%d, %d]",
+		  inode->i_ino, tid, subtid);
+
+	return 1;
+}
+
+static void ext4_journal_fc_cleanup_cb(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct inode *inode;
+
+	spin_lock(&sbi->s_fc_lock);
+	while (!list_empty(&sbi->s_fc_q)) {
+		iter = list_first_entry(&sbi->s_fc_q,
+				  struct ext4_inode_info, i_fc_list);
+		list_del_init(&iter->i_fc_list);
+		inode = &iter->vfs_inode;
+	}
+	INIT_LIST_HEAD(&sbi->s_fc_q);
+	sbi->s_fc_q_cnt = 0;
+	spin_unlock(&sbi->s_fc_lock);
+	sbi->s_fc_eligible = true;
+}
+
+/*
+ * Fast-commit commit callback. There is contention between sbi->s_fc_lock and
+ * i_data_sem. Locking order is - i_data_sem then s_fc_lock
+ */
+static int ext4_journal_fc_commit_cb(journal_t *journal, tid_t tid,
+			      tid_t subtid,
+			      struct transaction_run_stats_s *stats)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct list_head *pos, *tmp;
+	struct ext4_inode_info *iter;
+	int num_bufs = 0, ret;
+
+	memset(stats, 0, sizeof(*stats));
+
+	trace_ext4_journal_fc_commit_cb_start(sb);
+	sbi = sbi;
+	spin_lock(&sbi->s_fc_lock);
+	if (!sbi->s_fc_eligible) {
+		sbi->s_fc_eligible = true;
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "ineligible");
+		return -ECANCELED;
+	}
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "shutdown");
+		return -EIO;
+	}
+
+	stats->rs_flushing = jiffies;
+	/* Submit data buffers first */
+	list_for_each(pos, &sbi->s_fc_q) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		ret = jbd2_submit_inode_data(journal, iter->jinode);
+		if (ret) {
+			spin_unlock(&sbi->s_fc_lock);
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "data_commit");
+			return ret;
+		}
+	}
+	stats->rs_logging = jiffies;
+	stats->rs_flushing = jbd2_time_diff(stats->rs_flushing,
+					    stats->rs_logging);
+
+	list_for_each_safe(pos, tmp, &sbi->s_fc_q) {
+		struct inode *inode;
+		struct buffer_head *bh;
+		int is_last;
+
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		inode = &iter->vfs_inode;
+
+		is_last = list_is_last(pos, &sbi->s_fc_q);
+		spin_unlock(&sbi->s_fc_lock);
+
+		ret = jbd2_map_fc_buf(journal, &bh);
+		if (ret) {
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "map_fc_buf");
+			return -ENOMEM;
+		}
+
+		/*
+		 * Release s_fc_lock here since fc_write_inode calls
+		 * ext4_map_blocks which needs i_data_sem.
+		 */
+		ret = ext4_fc_write_inode(journal, bh, inode, tid, subtid,
+					  is_last, NULL);
+		if (ret < 0) {
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "fc_write_inode");
+			return ret;
+		}
+		lock_buffer(bh);
+		clear_buffer_dirty(bh);
+		set_buffer_uptodate(bh);
+		bh->b_end_io = ext4_end_buffer_io_sync;
+		submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+		spin_lock(&sbi->s_fc_lock);
+
+		num_bufs++;
+	}
+
+	stats->rs_logging = jbd2_time_diff(stats->rs_logging, jiffies);
+	if (num_bufs == 0) {
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "no_data");
+		stats->rs_blocks_logged = num_bufs;
+		return 0;
+	}
+
+	/*
+	 * Before returning, check if s_fc_eligible was modified since we
+	 * started.
+	 */
+	if (!sbi->s_fc_eligible) {
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "ineligible2");
+		return -ECANCELED;
+	}
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "shutdown2");
+		return -EIO;
+	}
+
+	spin_unlock(&sbi->s_fc_lock);
+
+	jbd_debug(3, "%s: Journal blocks ready for fast commit\n", __func__);
+
+	stats->rs_blocks_logged = num_bufs;
+
+	trace_ext4_journal_fc_commit_cb_stop(sb, num_bufs, "success");
+
+	return jbd2_wait_on_fc_bufs(journal, num_bufs);
+}
+
+void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
+{
+	if (ext4_should_fast_commit(sb)) {
+		journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
+		journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
+	}
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2cb7e7e1f025..acb9533068c4 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -397,8 +397,14 @@  static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 
 	if (ext4_handle_valid(handle) && !is_handle_aborted(handle)) {
 		ei->i_sync_tid = handle->h_transaction->t_tid;
-		if (datasync)
+		if (ext4_should_fast_commit(inode->i_sb))
+			ei->i_sync_subtid = handle->h_transaction->t_subtid;
+		if (datasync) {
 			ei->i_datasync_tid = handle->h_transaction->t_tid;
+			if (ext4_should_fast_commit(inode->i_sb))
+				ei->i_datasync_subtid =
+						handle->h_transaction->t_subtid;
+		}
 	}
 }
 
@@ -470,6 +476,47 @@  static inline int ext4_should_dioread_nolock(struct inode *inode)
 	return 1;
 }
 
+/* Ext4 fast commit related info */
+
+/* Magic of fast commit header */
+#define EXT4_FC_MAGIC			0xE2540090
+
+#define EXT4_FC_FL_LAST			0x00000001
+
+#define ext4_fc_is_last(__fc_hdr)	(((__fc_hdr)->fc_flags) &	\
+					 EXT4_FC_FL_LAST)
+
+#define ext4_fc_mark_last(__fc_hdr)	(((__fc_hdr)->fc_flags) |=	\
+					 EXT4_FC_FL_LAST)
+
+struct ext4_fc_commit_hdr {
+	/* Fast commit magic, should be EXT4_FC_MAGIC */
+	__le32 fc_magic;
+	/* Sub transaction ID */
+	__le32 fc_subtid;
+	/* Features used by this fast commit block */
+	__u8 fc_features;
+	/* Flags for this block. */
+	__u8 fc_flags;
+	/* Number of TLVs in this fast commmit block */
+	__le16 fc_num_tlvs;
+	/* Inode number */
+	__le32 fc_ino;
+	/* ext4 inode on disk copy */
+	struct ext4_inode inode;
+	/* Csum(hdr+contents) */
+	__le32 fc_csum;
+};
+
+#define EXT4_FC_TAG_EXT		0x1	/* Extent */
+#define EXT4_FC_TAG_DNAME	0x2
+#define EXT4_FC_TAG_PARENT_INO	0x3
+
+struct ext4_fc_tl {
+	__le16 fc_tag;
+	__le16 fc_len;
+};
+
 void ext4_init_inode_fc_info(struct inode *inode);
 extern void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode);
 extern void ext4_fc_del(struct inode *inode);
@@ -507,4 +554,5 @@  void ext4_fc_update_commit_range(struct inode *inode, ext4_lblk_t start,
 void ext4_fc_mark_new(struct inode *inode);
 bool ext4_is_inode_fc_ineligible(struct inode *inode);
 bool ext4_is_inode_fc_new(struct inode *inode);
+void ext4_init_fast_commit(struct super_block *sb, journal_t *journal);
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index b30f6175eb71..dea4c2632272 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4898,10 +4898,10 @@  long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out;
 
-	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
-		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
-	}
+	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal)
+		ret = jbd2_fc_complete_commit(
+		    EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+		    EXT4_I(inode)->i_sync_subtid);
 out:
 	inode_unlock(inode);
 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ea039e3e1a4d..cbfa1ec858a1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5039,20 +5039,25 @@  struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	 */
 	if (journal) {
 		transaction_t *transaction;
-		tid_t tid;
+		tid_t tid, subtid;
 
 		read_lock(&journal->j_state_lock);
 		if (journal->j_running_transaction)
 			transaction = journal->j_running_transaction;
 		else
 			transaction = journal->j_committing_transaction;
-		if (transaction)
+		if (transaction) {
 			tid = transaction->t_tid;
-		else
+			subtid = transaction->t_subtid;
+		} else {
 			tid = journal->j_commit_sequence;
+			subtid = journal->j_fc_sequence;
+		}
 		read_unlock(&journal->j_state_lock);
 		ei->i_sync_tid = tid;
 		ei->i_datasync_tid = tid;
+		ei->i_sync_subtid = subtid;
+		ei->i_datasync_subtid = subtid;
 	}
 
 	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
@@ -5475,8 +5480,9 @@  int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
-		err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
+		err = jbd2_fc_complete_commit(
+		    EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+		    EXT4_I(inode)->i_sync_subtid);
 	} else {
 		struct ext4_iloc iloc;
 
@@ -5628,6 +5634,7 @@  int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (attr->ia_valid & ATTR_GID)
 			inode->i_gid = attr->ia_gid;
 		error = ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 		ext4_journal_stop(handle);
 	}
 
@@ -5688,6 +5695,7 @@  int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				inode->i_mtime = current_time(inode);
 				inode->i_ctime = inode->i_mtime;
 			}
+			ext4_fc_enqueue_inode(handle, inode);
 			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
@@ -5732,6 +5740,8 @@  int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 
 	if (!error) {
 		setattr_copy(inode, attr);
+		ext4_fc_enqueue_inode(ext4_journal_current_handle(),
+						   inode);
 		mark_inode_dirty(inode);
 	}
 
@@ -6144,6 +6154,7 @@  void ext4_dirty_inode(struct inode *inode, int flags)
 		goto out;
 
 	ext4_mark_inode_dirty(handle, inode);
+	ext4_fc_enqueue_inode(handle, inode);
 
 	ext4_journal_stop(handle);
 out:
@@ -6229,6 +6240,7 @@  int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	ext4_fc_mark_ineligible(inode);
 	err = ext4_mark_inode_dirty(handle, inode);
 	ext4_handle_sync(handle);
 	ext4_journal_stop(handle);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3e9570ea9748..208c57b5ac80 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1129,6 +1129,16 @@  static void ext4_destroy_inode(struct inode *inode)
 				true);
 		dump_stack();
 	}
+	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
+#ifdef EXT4FS_DEBUG
+		if (EXT4_SB(inode->i_sb)->s_fc_eligible) {
+			pr_warn("%s: INODE %ld in FC List with FC allowd",
+				__func__, inode->i_ino);
+			dump_stack();
+		}
+#endif
+		ext4_fc_del(inode);
+	}
 }
 
 static void init_once(void *foo)
@@ -4713,6 +4723,7 @@  static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 	journal->j_commit_interval = sbi->s_commit_interval;
 	journal->j_min_batch_time = sbi->s_min_batch_time;
 	journal->j_max_batch_time = sbi->s_max_batch_time;
+	ext4_init_fast_commit(sb, journal);
 
 	write_lock(&journal->j_state_lock);
 	if (test_opt(sb, BARRIER))
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d68e9e536814..9c24b1c5239f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2703,6 +2703,45 @@  TRACE_EVENT(ext4_error,
 		  __entry->function, __entry->line)
 );
 
+TRACE_EVENT(ext4_journal_fc_commit_cb_start,
+	TP_PROTO(struct super_block *sb),
+
+	TP_ARGS(sb),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+	),
+
+	TP_printk("fast_commit started on dev %d,%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+TRACE_EVENT(ext4_journal_fc_commit_cb_stop,
+	    TP_PROTO(struct super_block *sb, int nblks, const char *reason),
+
+	TP_ARGS(sb, nblks, reason),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, nblks)
+		__field(const char *, reason)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->nblks = nblks;
+		__entry->reason = reason;
+	),
+
+	TP_printk("fast_commit done on dev %d,%d, nblks %d, reason %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nblks, __entry->reason)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */