diff mbox series

[v4,01/20] ext4: update docs for fast commit feature

Message ID 20191224081324.95807-1-harshadshirwadkar@gmail.com
State Superseded
Headers show
Series [v4,01/20] ext4: update docs for fast commit feature | expand

Commit Message

harshad shirwadkar Dec. 24, 2019, 8:13 a.m. UTC
This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch have two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
    implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
    fast commits.

Ext4 supports two modes of fast commits: 1) fast commits with hard
consistency guarantees 2) fast commits with soft consistency guarantees

When hard consistency is enabled, fast commit guarantees that all the
updates will be committed. After a successful replay of fast commits
blocks in hard consistency mode, the entire file system would be in
the same state as that when fsync() returned before crash. This
guarantee is similar to what jbd2 gives with full commits.

With soft consistency, file system only guarantees consistency for the
inode in question. In this mode, file system will try to write as less
data to the backend as possible during the commit time. To be precise,
file system records all the data updates for the inode in question and
directory updates that are required for guaranteeing consistency of the
inode in question.

In our evaluations, fast commits with hard consistency performed
better than fast commits with soft consistency. That's because with
hard consistency, a fast commit often ends up committing other inodes
together, while with soft consistency commits get serialized. Future
work can look at creating hybrid approach between the two extremes
that are there in this patchset.

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

After applying all the patches in this series, following runs of
xfstests were performed:

- kvm-xfstest.sh -g log -c 4k
- kvm-xfstests.sh smoke

All the log tests were successful and smoke tests didn't introduce any
additional failures.

Performance Evaluation
----------------------

Ext4 file system performance was tested with full commits, with fast
commits with soft consistency and with fast commits with hard
consistency. fs_mark benchmark showed that depending on the file size,
performance improvement was seen up to 50%. Soft fast commits performed
slightly worse than hard fast commits. But soft fast commits ended up
writing slightly lesser number of blocks on disk.

Changes since V3:

- Removed invocation of fast commits from the jbd2 thread.

- Removed sub transaction ID from journal_t.

- Added rename, truncate, punch hole support.

- Added soft consistency mode and hard consistency mode.

- More bug fixes and refactoring.

- Added better debugging support: more tracepoints and debug mount
  options.

Harshad Shirwadkar(20):
 ext4: add debug mount option to test fast commit replay
 ext4: add fast commit replay path
 ext4: disable certain features in replay path
 ext4: add idempotent helpers to manipulate bitmaps
 ext4: fast commit recovery path preparation
 jbd2: add fast commit recovery path support
 ext4: main commit routine for fast commits
 jbd2: add new APIs for commit path of fast commits
 ext4: add fast commit on-disk format structs and helpers
 ext4: add fast commit track points
 ext4: break ext4_unlink() and ext4_link()
 ext4: add inode tracking and ineligible marking routines
 ext4: add directory entry tracking routines
 ext4: add generic diff tracking routines and range tracking
 jbd2: fast commit main commit path changes
 jbd2: disable fast commits if journal is empty
 jbd2: add fast commit block tracker variables
 ext4, jbd2: add fast commit initialization routines
 ext4: add handling for extended mount options
 ext4: update docs for fast commit feature

 Documentation/filesystems/ext4/journal.rst |  127 ++-
 Documentation/filesystems/journalling.rst  |   18 +
 fs/ext4/acl.c                              |    1 +
 fs/ext4/balloc.c                           |   10 +-
 fs/ext4/ext4.h                             |  127 +++
 fs/ext4/ext4_jbd2.c                        | 1484 +++++++++++++++++++++++++++-
 fs/ext4/ext4_jbd2.h                        |   71 ++
 fs/ext4/extents.c                          |    5 +
 fs/ext4/extents_status.c                   |   24 +
 fs/ext4/fsync.c                            |    2 +-
 fs/ext4/ialloc.c                           |  165 +++-
 fs/ext4/inline.c                           |    3 +
 fs/ext4/inode.c                            |   77 +-
 fs/ext4/ioctl.c                            |    9 +-
 fs/ext4/mballoc.c                          |  157 ++-
 fs/ext4/mballoc.h                          |    2 +
 fs/ext4/migrate.c                          |    1 +
 fs/ext4/namei.c                            |  172 ++--
 fs/ext4/super.c                            |   72 +-
 fs/ext4/xattr.c                            |    6 +
 fs/jbd2/commit.c                           |   61 ++
 fs/jbd2/journal.c                          |  217 +++-
 fs/jbd2/recovery.c                         |   67 +-
 include/linux/jbd2.h                       |   83 +-
 include/trace/events/ext4.h                |  208 +++-
 25 files changed, 3037 insertions(+), 132 deletions(-)
---
 Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
 Documentation/filesystems/journalling.rst  |  18 +++
 2 files changed, 139 insertions(+), 6 deletions(-)

Comments

xiaohui li Jan. 9, 2020, 4:29 a.m. UTC | #1
hi Harshad
cc ted

sorry, but i have some idea about this fast commit which i want to
share with you.

there are nearly 20 patches about this v4 fast commit , so many patches.
I wonder if necessary to make this fast commit function so complexly.

maybe i have not understand the difficulty of the fast commit coding work.
so I appreciate it very much if you give some more detailed
descriptions about the patches correlationship of v4 fast commit,
especially the reason why need have so many patches.

from my viewpoint, the purpose of doing this fast commit function is
to resolve the ext4 fsync time-cost-so-much problem.
firstly we need to resolve some actual customer problems which exist
in ext4 filesystems when doing this fast commit function.

so the first release version of fast commit is just only to accomplish
the goal of reducing the time cost of fsync because of jbd2 order
shortcoming described in ijournal paper from my opinion.
it need not do so many other unnecessary things.

if i have free time , I will review these patches continually.
 thank you for your reply.





On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> This patch series adds support for fast commits which is a simplified
> version of the scheme proposed by Park and Shin, in their paper,
> "iJournaling: Fine-Grained Journaling for Improving the Latency of
> Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> give the client file system an opportunity to perform a faster
> commit. Only if the file system cannot perform such a commit
> operation, then JBD2 should fall back to traditional commits.
>
> Because JBD2 operates at block granularity, for every file system
> metadata update it commits all the changed blocks are written to the
> journal at commit time. This is inefficient because updates to some
> blocks that JBD2 commits are derivable from some other blocks. For
> example, if a new extent is added to an inode, then corresponding
> updates to the inode table, the block bitmap, the group descriptor and
> the superblock can be derived based on just the extent information and
> the corresponding inode information. So, if we take this relationship
> between blocks into account and replay the journalled blocks smartly,
> we could increase performance of file system commits significantly.
>
> Fast commits introduced in this patch have two main contributions:
>
> (1) Making JBD2 fast commit aware, so that clients of JBD2 can
>     implement fast commits
>
> (2) Add support in ext4 to use JBD2's new interfaces and implement
>     fast commits.
>
> Ext4 supports two modes of fast commits: 1) fast commits with hard
> consistency guarantees 2) fast commits with soft consistency guarantees
>
> When hard consistency is enabled, fast commit guarantees that all the
> updates will be committed. After a successful replay of fast commits
> blocks in hard consistency mode, the entire file system would be in
> the same state as that when fsync() returned before crash. This
> guarantee is similar to what jbd2 gives with full commits.
>
> With soft consistency, file system only guarantees consistency for the
> inode in question. In this mode, file system will try to write as less
> data to the backend as possible during the commit time. To be precise,
> file system records all the data updates for the inode in question and
> directory updates that are required for guaranteeing consistency of the
> inode in question.
>
> In our evaluations, fast commits with hard consistency performed
> better than fast commits with soft consistency. That's because with
> hard consistency, a fast commit often ends up committing other inodes
> together, while with soft consistency commits get serialized. Future
> work can look at creating hybrid approach between the two extremes
> that are there in this patchset.
>
> Testing
> -------
>
> e2fsprogs was updated to set fast commit feature flag and to ignore
> fast commit blocks during e2fsck.
>
> https://github.com/harshadjs/e2fsprogs.git
>
> After applying all the patches in this series, following runs of
> xfstests were performed:
>
> - kvm-xfstest.sh -g log -c 4k
> - kvm-xfstests.sh smoke
>
> All the log tests were successful and smoke tests didn't introduce any
> additional failures.
>
> Performance Evaluation
> ----------------------
>
> Ext4 file system performance was tested with full commits, with fast
> commits with soft consistency and with fast commits with hard
> consistency. fs_mark benchmark showed that depending on the file size,
> performance improvement was seen up to 50%. Soft fast commits performed
> slightly worse than hard fast commits. But soft fast commits ended up
> writing slightly lesser number of blocks on disk.
>
> Changes since V3:
>
> - Removed invocation of fast commits from the jbd2 thread.
>
> - Removed sub transaction ID from journal_t.
>
> - Added rename, truncate, punch hole support.
>
> - Added soft consistency mode and hard consistency mode.
>
> - More bug fixes and refactoring.
>
> - Added better debugging support: more tracepoints and debug mount
>   options.
>
> Harshad Shirwadkar(20):
>  ext4: add debug mount option to test fast commit replay
>  ext4: add fast commit replay path
>  ext4: disable certain features in replay path
>  ext4: add idempotent helpers to manipulate bitmaps
>  ext4: fast commit recovery path preparation
>  jbd2: add fast commit recovery path support
>  ext4: main commit routine for fast commits
>  jbd2: add new APIs for commit path of fast commits
>  ext4: add fast commit on-disk format structs and helpers
>  ext4: add fast commit track points
>  ext4: break ext4_unlink() and ext4_link()
>  ext4: add inode tracking and ineligible marking routines
>  ext4: add directory entry tracking routines
>  ext4: add generic diff tracking routines and range tracking
>  jbd2: fast commit main commit path changes
>  jbd2: disable fast commits if journal is empty
>  jbd2: add fast commit block tracker variables
>  ext4, jbd2: add fast commit initialization routines
>  ext4: add handling for extended mount options
>  ext4: update docs for fast commit feature
>
>  Documentation/filesystems/ext4/journal.rst |  127 ++-
>  Documentation/filesystems/journalling.rst  |   18 +
>  fs/ext4/acl.c                              |    1 +
>  fs/ext4/balloc.c                           |   10 +-
>  fs/ext4/ext4.h                             |  127 +++
>  fs/ext4/ext4_jbd2.c                        | 1484 +++++++++++++++++++++++++++-
>  fs/ext4/ext4_jbd2.h                        |   71 ++
>  fs/ext4/extents.c                          |    5 +
>  fs/ext4/extents_status.c                   |   24 +
>  fs/ext4/fsync.c                            |    2 +-
>  fs/ext4/ialloc.c                           |  165 +++-
>  fs/ext4/inline.c                           |    3 +
>  fs/ext4/inode.c                            |   77 +-
>  fs/ext4/ioctl.c                            |    9 +-
>  fs/ext4/mballoc.c                          |  157 ++-
>  fs/ext4/mballoc.h                          |    2 +
>  fs/ext4/migrate.c                          |    1 +
>  fs/ext4/namei.c                            |  172 ++--
>  fs/ext4/super.c                            |   72 +-
>  fs/ext4/xattr.c                            |    6 +
>  fs/jbd2/commit.c                           |   61 ++
>  fs/jbd2/journal.c                          |  217 +++-
>  fs/jbd2/recovery.c                         |   67 +-
>  include/linux/jbd2.h                       |   83 +-
>  include/trace/events/ext4.h                |  208 +++-
>  25 files changed, 3037 insertions(+), 132 deletions(-)
> ---
>  Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
>  Documentation/filesystems/journalling.rst  |  18 +++
>  2 files changed, 139 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..f94e66f2f8c4 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
>  disk before the metadata are written to disk through the journal.
>
>  The journal inode is typically inode 8. The first 68 bytes of the
> -journal inode are replicated in the ext4 superblock. The journal itself
> -is normal (but hidden) file within the filesystem. The file usually
> -consumes an entire block group, though mke2fs tries to put it in the
> -middle of the disk.
> +journal inode are replicated in the ext4 superblock. The journal
> +itself is normal (but hidden) file within the filesystem. The file
> +usually consumes an entire block group, though mke2fs tries to put it
> +in the middle of the disk.
>
>  All fields in jbd2 are written to disk in big-endian order. This is the
>  opposite of ext4.
> @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
>  The maximum size of a journal embedded in an ext4 filesystem is 2^32
>  blocks. jbd2 itself does not seem to care.
>
> +Fast Commits
> +~~~~~~~~~~~~
> +
> +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> +Fast commits store metadata changes made to the file system as inode level
> +diff. In other words, each fast commit block identifies updates made to
> +a particular inode and collectively they represent total changes made to
> +the file system.
> +
> +A fast commit is valid only if there is no full commit after that particular
> +fast commit. Because of this feature, fast commit blocks can be reused by
> +the following transactions.
> +
> +Each fast commit block stores updates to 1 particular inode. Updates in each
> +fast commit block are one of the 2 types:
> +- Data updates (add range / delete range)
> +- Directory entry updates (Add / remove links)
> +
> +Fast commit blocks must be replayed in the order in which they appear on disk.
> +That's because directory entry updates are written in fast commit blocks
> +in the order in which they are applied on the file system before crash.
> +Changing the order of replaying for directory entry updates may result
> +in inconsistent file system. Note that only directory entry updates need
> +ordering, data updates, since they apply to only one inode, do not require
> +ordered replay. Also, fast commits guarantee that file system is in consistent
> +state after replay of each fast commit block as long as order of replay has
> +been followed.
> +
> +Note that directory inode updates are never directly recorded in fast commits.
> +Just like other file system level metaata, updates to directories are always
> +implied based on directory entry updates stored in fast commit blocks.
> +
> +Based on which directory entry updates are committed with an inode, fast
> +commits have two modes of operation:
> +
> +- Hard Consistency (default)
> +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> +
> +When hard consistency is enabled, fast commit guarantees that all the updates
> +will be committed. After a successful replay of fast commits blocks
> +in hard consistency mode, the entire file system would be in the same state as
> +that when fsync() returned before crash. This guarantee is similar to what
> +jbd2 gives.
> +
> +With soft consistency, file system only guarantees consistency for the
> +inode in question. In this mode, file system will try to write as less data
> +to the backed as possible during the commit time. To be precise, file system
> +records all the data updates for the inode in question and directory updates
> +that are required for guaranteeing consistency of the inode in question.
> +
>  Layout
>  ~~~~~~
>
>  Generally speaking, the journal has this format:
>
>  .. list-table::
> -   :widths: 16 48 16
> +   :widths: 16 48 16 18
>     :header-rows: 1
>
>     * - Superblock
>       - descriptor\_block (data\_blocks or revocation\_block) [more data or
>         revocations] commmit\_block
>       - [more transactions...]
> +     - [Fast commits...]
>     * -
>       - One transaction
>       -
> +     -
>
>  Notice that a transaction begins with either a descriptor and some data,
>  or a block revocation list. A finished transaction always ends with a
> @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
>  superblock.
>
>  .. list-table::
> -   :widths: 12 12 12 32 12
> +   :widths: 12 12 12 32 12 12
>     :header-rows: 1
>
>     * - 1024 bytes of padding
> @@ -85,11 +137,13 @@ superblock.
>       - descriptor\_block (data\_blocks or revocation\_block) [more data or
>         revocations] commmit\_block
>       - [more transactions...]
> +     - [Fast commits...]
>     * -
>       -
>       -
>       - One transaction
>       -
> +     -
>
>  Block Header
>  ~~~~~~~~~~~~
> @@ -609,3 +663,64 @@ bytes long (but uses a full block):
>       - h\_commit\_nsec
>       - Nanoseconds component of the above timestamp.
>
> +Fast Commit Block
> +~~~~~~~~~~~~~~~~~
> +
> +The fast commit block indicates an append to the last commit block
> +that was written to the journal. One fast commit block records updates
> +to one inode. So, typically you would find as many fast commit blocks
> +as the number of inodes that got changed since the last commit. A fast
> +commit block is valid only if there is no commit block present with
> +transaction ID greater than that of the fast commit block. If such a
> +block a present, then there is no need to replay the fast commit
> +block.
> +
> +.. list-table::
> +   :widths: 8 8 24 40
> +   :header-rows: 1
> +
> +   * - Offset
> +     - Type
> +     - Name
> +     - Descriptor
> +   * - 0x0
> +     - journal\_header\_s
> +     - (open coded)
> +     - Common block header.
> +   * - 0xC
> +     - \_\_le32
> +     - fc\_magic
> +     - Magic value which should be set to 0xE2540090. This identifies
> +       that this block is a fast commit block.
> +   * - 0x10
> +     - \_\_u8
> +     - fc\_features
> +     - Features used by this fast commit block.
> +   * - 0x11
> +     - \_\_le16
> +     - fc_num_tlvs
> +     - Number of TLVs contained in this fast commit block
> +   * - 0x13
> +     - \_\_le32
> +     - \_\_fc\_len
> +     - Length of the fast commit block in terms of number of blocks
> +   * - 0x17
> +     - \_\_le32
> +     - fc\_ino
> +     - Inode number of the inode that will be recovered using this fast commit
> +   * - 0x2B
> +     - struct ext4\_inode
> +     - inode
> +     - On-disk copy of the inode at the commit time
> +   * - <Variable based on inode size>
> +     - struct ext4\_fc\_tl
> +     - Array of struct ext4\_fc\_tl
> +     - The actual delta with the last commit. Starting at this offset,
> +       there is an array of TLVs that indicates which all extents
> +       should be present in the corresponding inode. Currently,
> +       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> +       should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> +       that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> +       (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> +       (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> +       (dentry that for the file that should be created for the first time).
> diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> index 58ce6b395206..1cb116ab27ab 100644
> --- a/Documentation/filesystems/journalling.rst
> +++ b/Documentation/filesystems/journalling.rst
> @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
>  ``transaction->t_private_list`` for attaching entries to a transaction
>  that need processing when the transaction commits.
>
> +JBD2 also allows client file systems to implement file system specific
> +commits which are called as ``fast commits``. Fast commits are
> +asynchronous in nature i.e. file systems can call their own commit
> +functions at any time. In order to avoid the race with kjournald
> +thread and other possible fast commits that may be happening in
> +parallel, file systems should first call
> +:c:func:`jbd2_start_async_fc()`. File system can call
> +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> +commits. Once a fast commit is completed, file system should call
> +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> +committers and the kjournald thread.  After performing either a fast
> +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> +file systems to perform cleanups for their internal fast commit
> +related data structures. At the replay time, JBD2 passes each and
> +every fast commit block to the file system via
> +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> +mechanism to improve journal commit performance.
> +
>  JBD2 also provides a way to block all transaction updates via
>  :c:func:`jbd2_journal_lock_updates()` /
>  :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> --
> 2.24.1.735.g03f4e72817-goog
>
xiaohui li Jan. 10, 2020, 9:49 a.m. UTC | #2
hi Harshad:

I apologize for my some direct and improper speaking in my last email.

what i want to say in my last email is that maybe an iterative
software development method can be better for patches application.
for the first release version, we can not do everything. it is good
enough if we can have finished just one major function in ext4 fast
commit field.

I have known that develop work of this fast commit function is more
difficult and more complex.
so i am very grateful for your work on field of ext4 fast commit development. :)

On Thu, Jan 9, 2020 at 12:29 PM xiaohui li
<lixiaohui1@xiaomi.corp-partner.google.com> wrote:
>
> hi Harshad
> cc ted
>
> sorry, but i have some idea about this fast commit which i want to
> share with you.
>
> there are nearly 20 patches about this v4 fast commit , so many patches.
> I wonder if necessary to make this fast commit function so complexly.
>
> maybe i have not understand the difficulty of the fast commit coding work.
> so I appreciate it very much if you give some more detailed
> descriptions about the patches correlationship of v4 fast commit,
> especially the reason why need have so many patches.
>
> from my viewpoint, the purpose of doing this fast commit function is
> to resolve the ext4 fsync time-cost-so-much problem.
> firstly we need to resolve some actual customer problems which exist
> in ext4 filesystems when doing this fast commit function.
>
> so the first release version of fast commit is just only to accomplish
> the goal of reducing the time cost of fsync because of jbd2 order
> shortcoming described in ijournal paper from my opinion.
> it need not do so many other unnecessary things.
>
> if i have free time , I will review these patches continually.
>  thank you for your reply.
>
>
>
>
>
> On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
> <harshadshirwadkar@gmail.com> wrote:
> >
> > This patch series adds support for fast commits which is a simplified
> > version of the scheme proposed by Park and Shin, in their paper,
> > "iJournaling: Fine-Grained Journaling for Improving the Latency of
> > Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> > give the client file system an opportunity to perform a faster
> > commit. Only if the file system cannot perform such a commit
> > operation, then JBD2 should fall back to traditional commits.
> >
> > Because JBD2 operates at block granularity, for every file system
> > metadata update it commits all the changed blocks are written to the
> > journal at commit time. This is inefficient because updates to some
> > blocks that JBD2 commits are derivable from some other blocks. For
> > example, if a new extent is added to an inode, then corresponding
> > updates to the inode table, the block bitmap, the group descriptor and
> > the superblock can be derived based on just the extent information and
> > the corresponding inode information. So, if we take this relationship
> > between blocks into account and replay the journalled blocks smartly,
> > we could increase performance of file system commits significantly.
> >
> > Fast commits introduced in this patch have two main contributions:
> >
> > (1) Making JBD2 fast commit aware, so that clients of JBD2 can
> >     implement fast commits
> >
> > (2) Add support in ext4 to use JBD2's new interfaces and implement
> >     fast commits.
> >
> > Ext4 supports two modes of fast commits: 1) fast commits with hard
> > consistency guarantees 2) fast commits with soft consistency guarantees
> >
> > When hard consistency is enabled, fast commit guarantees that all the
> > updates will be committed. After a successful replay of fast commits
> > blocks in hard consistency mode, the entire file system would be in
> > the same state as that when fsync() returned before crash. This
> > guarantee is similar to what jbd2 gives with full commits.
> >
> > With soft consistency, file system only guarantees consistency for the
> > inode in question. In this mode, file system will try to write as less
> > data to the backend as possible during the commit time. To be precise,
> > file system records all the data updates for the inode in question and
> > directory updates that are required for guaranteeing consistency of the
> > inode in question.
> >
> > In our evaluations, fast commits with hard consistency performed
> > better than fast commits with soft consistency. That's because with
> > hard consistency, a fast commit often ends up committing other inodes
> > together, while with soft consistency commits get serialized. Future
> > work can look at creating hybrid approach between the two extremes
> > that are there in this patchset.
> >
> > Testing
> > -------
> >
> > e2fsprogs was updated to set fast commit feature flag and to ignore
> > fast commit blocks during e2fsck.
> >
> > https://github.com/harshadjs/e2fsprogs.git
> >
> > After applying all the patches in this series, following runs of
> > xfstests were performed:
> >
> > - kvm-xfstest.sh -g log -c 4k
> > - kvm-xfstests.sh smoke
> >
> > All the log tests were successful and smoke tests didn't introduce any
> > additional failures.
> >
> > Performance Evaluation
> > ----------------------
> >
> > Ext4 file system performance was tested with full commits, with fast
> > commits with soft consistency and with fast commits with hard
> > consistency. fs_mark benchmark showed that depending on the file size,
> > performance improvement was seen up to 50%. Soft fast commits performed
> > slightly worse than hard fast commits. But soft fast commits ended up
> > writing slightly lesser number of blocks on disk.
> >
> > Changes since V3:
> >
> > - Removed invocation of fast commits from the jbd2 thread.
> >
> > - Removed sub transaction ID from journal_t.
> >
> > - Added rename, truncate, punch hole support.
> >
> > - Added soft consistency mode and hard consistency mode.
> >
> > - More bug fixes and refactoring.
> >
> > - Added better debugging support: more tracepoints and debug mount
> >   options.
> >
> > Harshad Shirwadkar(20):
> >  ext4: add debug mount option to test fast commit replay
> >  ext4: add fast commit replay path
> >  ext4: disable certain features in replay path
> >  ext4: add idempotent helpers to manipulate bitmaps
> >  ext4: fast commit recovery path preparation
> >  jbd2: add fast commit recovery path support
> >  ext4: main commit routine for fast commits
> >  jbd2: add new APIs for commit path of fast commits
> >  ext4: add fast commit on-disk format structs and helpers
> >  ext4: add fast commit track points
> >  ext4: break ext4_unlink() and ext4_link()
> >  ext4: add inode tracking and ineligible marking routines
> >  ext4: add directory entry tracking routines
> >  ext4: add generic diff tracking routines and range tracking
> >  jbd2: fast commit main commit path changes
> >  jbd2: disable fast commits if journal is empty
> >  jbd2: add fast commit block tracker variables
> >  ext4, jbd2: add fast commit initialization routines
> >  ext4: add handling for extended mount options
> >  ext4: update docs for fast commit feature
> >
> >  Documentation/filesystems/ext4/journal.rst |  127 ++-
> >  Documentation/filesystems/journalling.rst  |   18 +
> >  fs/ext4/acl.c                              |    1 +
> >  fs/ext4/balloc.c                           |   10 +-
> >  fs/ext4/ext4.h                             |  127 +++
> >  fs/ext4/ext4_jbd2.c                        | 1484 +++++++++++++++++++++++++++-
> >  fs/ext4/ext4_jbd2.h                        |   71 ++
> >  fs/ext4/extents.c                          |    5 +
> >  fs/ext4/extents_status.c                   |   24 +
> >  fs/ext4/fsync.c                            |    2 +-
> >  fs/ext4/ialloc.c                           |  165 +++-
> >  fs/ext4/inline.c                           |    3 +
> >  fs/ext4/inode.c                            |   77 +-
> >  fs/ext4/ioctl.c                            |    9 +-
> >  fs/ext4/mballoc.c                          |  157 ++-
> >  fs/ext4/mballoc.h                          |    2 +
> >  fs/ext4/migrate.c                          |    1 +
> >  fs/ext4/namei.c                            |  172 ++--
> >  fs/ext4/super.c                            |   72 +-
> >  fs/ext4/xattr.c                            |    6 +
> >  fs/jbd2/commit.c                           |   61 ++
> >  fs/jbd2/journal.c                          |  217 +++-
> >  fs/jbd2/recovery.c                         |   67 +-
> >  include/linux/jbd2.h                       |   83 +-
> >  include/trace/events/ext4.h                |  208 +++-
> >  25 files changed, 3037 insertions(+), 132 deletions(-)
> > ---
> >  Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> >  Documentation/filesystems/journalling.rst  |  18 +++
> >  2 files changed, 139 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > index ea613ee701f5..f94e66f2f8c4 100644
> > --- a/Documentation/filesystems/ext4/journal.rst
> > +++ b/Documentation/filesystems/ext4/journal.rst
> > @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> >  disk before the metadata are written to disk through the journal.
> >
> >  The journal inode is typically inode 8. The first 68 bytes of the
> > -journal inode are replicated in the ext4 superblock. The journal itself
> > -is normal (but hidden) file within the filesystem. The file usually
> > -consumes an entire block group, though mke2fs tries to put it in the
> > -middle of the disk.
> > +journal inode are replicated in the ext4 superblock. The journal
> > +itself is normal (but hidden) file within the filesystem. The file
> > +usually consumes an entire block group, though mke2fs tries to put it
> > +in the middle of the disk.
> >
> >  All fields in jbd2 are written to disk in big-endian order. This is the
> >  opposite of ext4.
> > @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> >  The maximum size of a journal embedded in an ext4 filesystem is 2^32
> >  blocks. jbd2 itself does not seem to care.
> >
> > +Fast Commits
> > +~~~~~~~~~~~~
> > +
> > +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> > +Fast commits store metadata changes made to the file system as inode level
> > +diff. In other words, each fast commit block identifies updates made to
> > +a particular inode and collectively they represent total changes made to
> > +the file system.
> > +
> > +A fast commit is valid only if there is no full commit after that particular
> > +fast commit. Because of this feature, fast commit blocks can be reused by
> > +the following transactions.
> > +
> > +Each fast commit block stores updates to 1 particular inode. Updates in each
> > +fast commit block are one of the 2 types:
> > +- Data updates (add range / delete range)
> > +- Directory entry updates (Add / remove links)
> > +
> > +Fast commit blocks must be replayed in the order in which they appear on disk.
> > +That's because directory entry updates are written in fast commit blocks
> > +in the order in which they are applied on the file system before crash.
> > +Changing the order of replaying for directory entry updates may result
> > +in inconsistent file system. Note that only directory entry updates need
> > +ordering, data updates, since they apply to only one inode, do not require
> > +ordered replay. Also, fast commits guarantee that file system is in consistent
> > +state after replay of each fast commit block as long as order of replay has
> > +been followed.
> > +
> > +Note that directory inode updates are never directly recorded in fast commits.
> > +Just like other file system level metaata, updates to directories are always
> > +implied based on directory entry updates stored in fast commit blocks.
> > +
> > +Based on which directory entry updates are committed with an inode, fast
> > +commits have two modes of operation:
> > +
> > +- Hard Consistency (default)
> > +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> > +
> > +When hard consistency is enabled, fast commit guarantees that all the updates
> > +will be committed. After a successful replay of fast commits blocks
> > +in hard consistency mode, the entire file system would be in the same state as
> > +that when fsync() returned before crash. This guarantee is similar to what
> > +jbd2 gives.
> > +
> > +With soft consistency, file system only guarantees consistency for the
> > +inode in question. In this mode, file system will try to write as less data
> > +to the backed as possible during the commit time. To be precise, file system
> > +records all the data updates for the inode in question and directory updates
> > +that are required for guaranteeing consistency of the inode in question.
> > +
> >  Layout
> >  ~~~~~~
> >
> >  Generally speaking, the journal has this format:
> >
> >  .. list-table::
> > -   :widths: 16 48 16
> > +   :widths: 16 48 16 18
> >     :header-rows: 1
> >
> >     * - Superblock
> >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> >         revocations] commmit\_block
> >       - [more transactions...]
> > +     - [Fast commits...]
> >     * -
> >       - One transaction
> >       -
> > +     -
> >
> >  Notice that a transaction begins with either a descriptor and some data,
> >  or a block revocation list. A finished transaction always ends with a
> > @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> >  superblock.
> >
> >  .. list-table::
> > -   :widths: 12 12 12 32 12
> > +   :widths: 12 12 12 32 12 12
> >     :header-rows: 1
> >
> >     * - 1024 bytes of padding
> > @@ -85,11 +137,13 @@ superblock.
> >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> >         revocations] commmit\_block
> >       - [more transactions...]
> > +     - [Fast commits...]
> >     * -
> >       -
> >       -
> >       - One transaction
> >       -
> > +     -
> >
> >  Block Header
> >  ~~~~~~~~~~~~
> > @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> >       - h\_commit\_nsec
> >       - Nanoseconds component of the above timestamp.
> >
> > +Fast Commit Block
> > +~~~~~~~~~~~~~~~~~
> > +
> > +The fast commit block indicates an append to the last commit block
> > +that was written to the journal. One fast commit block records updates
> > +to one inode. So, typically you would find as many fast commit blocks
> > +as the number of inodes that got changed since the last commit. A fast
> > +commit block is valid only if there is no commit block present with
> > +transaction ID greater than that of the fast commit block. If such a
> > +block a present, then there is no need to replay the fast commit
> > +block.
> > +
> > +.. list-table::
> > +   :widths: 8 8 24 40
> > +   :header-rows: 1
> > +
> > +   * - Offset
> > +     - Type
> > +     - Name
> > +     - Descriptor
> > +   * - 0x0
> > +     - journal\_header\_s
> > +     - (open coded)
> > +     - Common block header.
> > +   * - 0xC
> > +     - \_\_le32
> > +     - fc\_magic
> > +     - Magic value which should be set to 0xE2540090. This identifies
> > +       that this block is a fast commit block.
> > +   * - 0x10
> > +     - \_\_u8
> > +     - fc\_features
> > +     - Features used by this fast commit block.
> > +   * - 0x11
> > +     - \_\_le16
> > +     - fc_num_tlvs
> > +     - Number of TLVs contained in this fast commit block
> > +   * - 0x13
> > +     - \_\_le32
> > +     - \_\_fc\_len
> > +     - Length of the fast commit block in terms of number of blocks
> > +   * - 0x17
> > +     - \_\_le32
> > +     - fc\_ino
> > +     - Inode number of the inode that will be recovered using this fast commit
> > +   * - 0x2B
> > +     - struct ext4\_inode
> > +     - inode
> > +     - On-disk copy of the inode at the commit time
> > +   * - <Variable based on inode size>
> > +     - struct ext4\_fc\_tl
> > +     - Array of struct ext4\_fc\_tl
> > +     - The actual delta with the last commit. Starting at this offset,
> > +       there is an array of TLVs that indicates which all extents
> > +       should be present in the corresponding inode. Currently,
> > +       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> > +       should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> > +       that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> > +       (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> > +       (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> > +       (dentry that for the file that should be created for the first time).
> > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > index 58ce6b395206..1cb116ab27ab 100644
> > --- a/Documentation/filesystems/journalling.rst
> > +++ b/Documentation/filesystems/journalling.rst
> > @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> >  ``transaction->t_private_list`` for attaching entries to a transaction
> >  that need processing when the transaction commits.
> >
> > +JBD2 also allows client file systems to implement file system specific
> > +commits which are called as ``fast commits``. Fast commits are
> > +asynchronous in nature i.e. file systems can call their own commit
> > +functions at any time. In order to avoid the race with kjournald
> > +thread and other possible fast commits that may be happening in
> > +parallel, file systems should first call
> > +:c:func:`jbd2_start_async_fc()`. File system can call
> > +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> > +commits. Once a fast commit is completed, file system should call
> > +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> > +committers and the kjournald thread.  After performing either a fast
> > +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> > +file systems to perform cleanups for their internal fast commit
> > +related data structures. At the replay time, JBD2 passes each and
> > +every fast commit block to the file system via
> > +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> > +mechanism to improve journal commit performance.
> > +
> >  JBD2 also provides a way to block all transaction updates via
> >  :c:func:`jbd2_journal_lock_updates()` /
> >  :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > --
> > 2.24.1.735.g03f4e72817-goog
> >
harshad shirwadkar Jan. 11, 2020, 2:13 a.m. UTC | #3
Hi Xiaohui,

No worries. I think the reason we have so many patches is that this
feature requires on-disk format change and it's better to get
everything in one go, so that the size of testing matrix doesn't
become too big :)

- Harshad

On Fri, Jan 10, 2020 at 1:50 AM xiaohui li
<lixiaohui1@xiaomi.corp-partner.google.com> wrote:
>
> hi Harshad:
>
> I apologize for my some direct and improper speaking in my last email.
>
> what i want to say in my last email is that maybe an iterative
> software development method can be better for patches application.
> for the first release version, we can not do everything. it is good
> enough if we can have finished just one major function in ext4 fast
> commit field.
>
> I have known that develop work of this fast commit function is more
> difficult and more complex.
> so i am very grateful for your work on field of ext4 fast commit development. :)
>
> On Thu, Jan 9, 2020 at 12:29 PM xiaohui li
> <lixiaohui1@xiaomi.corp-partner.google.com> wrote:
> >
> > hi Harshad
> > cc ted
> >
> > sorry, but i have some idea about this fast commit which i want to
> > share with you.
> >
> > there are nearly 20 patches about this v4 fast commit , so many patches.
> > I wonder if necessary to make this fast commit function so complexly.
> >
> > maybe i have not understand the difficulty of the fast commit coding work.
> > so I appreciate it very much if you give some more detailed
> > descriptions about the patches correlationship of v4 fast commit,
> > especially the reason why need have so many patches.
> >
> > from my viewpoint, the purpose of doing this fast commit function is
> > to resolve the ext4 fsync time-cost-so-much problem.
> > firstly we need to resolve some actual customer problems which exist
> > in ext4 filesystems when doing this fast commit function.
> >
> > so the first release version of fast commit is just only to accomplish
> > the goal of reducing the time cost of fsync because of jbd2 order
> > shortcoming described in ijournal paper from my opinion.
> > it need not do so many other unnecessary things.
> >
> > if i have free time , I will review these patches continually.
> >  thank you for your reply.
> >
> >
> >
> >
> >
> > On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
> > <harshadshirwadkar@gmail.com> wrote:
> > >
> > > This patch series adds support for fast commits which is a simplified
> > > version of the scheme proposed by Park and Shin, in their paper,
> > > "iJournaling: Fine-Grained Journaling for Improving the Latency of
> > > Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> > > give the client file system an opportunity to perform a faster
> > > commit. Only if the file system cannot perform such a commit
> > > operation, then JBD2 should fall back to traditional commits.
> > >
> > > Because JBD2 operates at block granularity, for every file system
> > > metadata update it commits all the changed blocks are written to the
> > > journal at commit time. This is inefficient because updates to some
> > > blocks that JBD2 commits are derivable from some other blocks. For
> > > example, if a new extent is added to an inode, then corresponding
> > > updates to the inode table, the block bitmap, the group descriptor and
> > > the superblock can be derived based on just the extent information and
> > > the corresponding inode information. So, if we take this relationship
> > > between blocks into account and replay the journalled blocks smartly,
> > > we could increase performance of file system commits significantly.
> > >
> > > Fast commits introduced in this patch have two main contributions:
> > >
> > > (1) Making JBD2 fast commit aware, so that clients of JBD2 can
> > >     implement fast commits
> > >
> > > (2) Add support in ext4 to use JBD2's new interfaces and implement
> > >     fast commits.
> > >
> > > Ext4 supports two modes of fast commits: 1) fast commits with hard
> > > consistency guarantees 2) fast commits with soft consistency guarantees
> > >
> > > When hard consistency is enabled, fast commit guarantees that all the
> > > updates will be committed. After a successful replay of fast commits
> > > blocks in hard consistency mode, the entire file system would be in
> > > the same state as that when fsync() returned before crash. This
> > > guarantee is similar to what jbd2 gives with full commits.
> > >
> > > With soft consistency, file system only guarantees consistency for the
> > > inode in question. In this mode, file system will try to write as less
> > > data to the backend as possible during the commit time. To be precise,
> > > file system records all the data updates for the inode in question and
> > > directory updates that are required for guaranteeing consistency of the
> > > inode in question.
> > >
> > > In our evaluations, fast commits with hard consistency performed
> > > better than fast commits with soft consistency. That's because with
> > > hard consistency, a fast commit often ends up committing other inodes
> > > together, while with soft consistency commits get serialized. Future
> > > work can look at creating hybrid approach between the two extremes
> > > that are there in this patchset.
> > >
> > > Testing
> > > -------
> > >
> > > e2fsprogs was updated to set fast commit feature flag and to ignore
> > > fast commit blocks during e2fsck.
> > >
> > > https://github.com/harshadjs/e2fsprogs.git
> > >
> > > After applying all the patches in this series, following runs of
> > > xfstests were performed:
> > >
> > > - kvm-xfstest.sh -g log -c 4k
> > > - kvm-xfstests.sh smoke
> > >
> > > All the log tests were successful and smoke tests didn't introduce any
> > > additional failures.
> > >
> > > Performance Evaluation
> > > ----------------------
> > >
> > > Ext4 file system performance was tested with full commits, with fast
> > > commits with soft consistency and with fast commits with hard
> > > consistency. fs_mark benchmark showed that depending on the file size,
> > > performance improvement was seen up to 50%. Soft fast commits performed
> > > slightly worse than hard fast commits. But soft fast commits ended up
> > > writing slightly lesser number of blocks on disk.
> > >
> > > Changes since V3:
> > >
> > > - Removed invocation of fast commits from the jbd2 thread.
> > >
> > > - Removed sub transaction ID from journal_t.
> > >
> > > - Added rename, truncate, punch hole support.
> > >
> > > - Added soft consistency mode and hard consistency mode.
> > >
> > > - More bug fixes and refactoring.
> > >
> > > - Added better debugging support: more tracepoints and debug mount
> > >   options.
> > >
> > > Harshad Shirwadkar(20):
> > >  ext4: add debug mount option to test fast commit replay
> > >  ext4: add fast commit replay path
> > >  ext4: disable certain features in replay path
> > >  ext4: add idempotent helpers to manipulate bitmaps
> > >  ext4: fast commit recovery path preparation
> > >  jbd2: add fast commit recovery path support
> > >  ext4: main commit routine for fast commits
> > >  jbd2: add new APIs for commit path of fast commits
> > >  ext4: add fast commit on-disk format structs and helpers
> > >  ext4: add fast commit track points
> > >  ext4: break ext4_unlink() and ext4_link()
> > >  ext4: add inode tracking and ineligible marking routines
> > >  ext4: add directory entry tracking routines
> > >  ext4: add generic diff tracking routines and range tracking
> > >  jbd2: fast commit main commit path changes
> > >  jbd2: disable fast commits if journal is empty
> > >  jbd2: add fast commit block tracker variables
> > >  ext4, jbd2: add fast commit initialization routines
> > >  ext4: add handling for extended mount options
> > >  ext4: update docs for fast commit feature
> > >
> > >  Documentation/filesystems/ext4/journal.rst |  127 ++-
> > >  Documentation/filesystems/journalling.rst  |   18 +
> > >  fs/ext4/acl.c                              |    1 +
> > >  fs/ext4/balloc.c                           |   10 +-
> > >  fs/ext4/ext4.h                             |  127 +++
> > >  fs/ext4/ext4_jbd2.c                        | 1484 +++++++++++++++++++++++++++-
> > >  fs/ext4/ext4_jbd2.h                        |   71 ++
> > >  fs/ext4/extents.c                          |    5 +
> > >  fs/ext4/extents_status.c                   |   24 +
> > >  fs/ext4/fsync.c                            |    2 +-
> > >  fs/ext4/ialloc.c                           |  165 +++-
> > >  fs/ext4/inline.c                           |    3 +
> > >  fs/ext4/inode.c                            |   77 +-
> > >  fs/ext4/ioctl.c                            |    9 +-
> > >  fs/ext4/mballoc.c                          |  157 ++-
> > >  fs/ext4/mballoc.h                          |    2 +
> > >  fs/ext4/migrate.c                          |    1 +
> > >  fs/ext4/namei.c                            |  172 ++--
> > >  fs/ext4/super.c                            |   72 +-
> > >  fs/ext4/xattr.c                            |    6 +
> > >  fs/jbd2/commit.c                           |   61 ++
> > >  fs/jbd2/journal.c                          |  217 +++-
> > >  fs/jbd2/recovery.c                         |   67 +-
> > >  include/linux/jbd2.h                       |   83 +-
> > >  include/trace/events/ext4.h                |  208 +++-
> > >  25 files changed, 3037 insertions(+), 132 deletions(-)
> > > ---
> > >  Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> > >  Documentation/filesystems/journalling.rst  |  18 +++
> > >  2 files changed, 139 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > > index ea613ee701f5..f94e66f2f8c4 100644
> > > --- a/Documentation/filesystems/ext4/journal.rst
> > > +++ b/Documentation/filesystems/ext4/journal.rst
> > > @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > >  disk before the metadata are written to disk through the journal.
> > >
> > >  The journal inode is typically inode 8. The first 68 bytes of the
> > > -journal inode are replicated in the ext4 superblock. The journal itself
> > > -is normal (but hidden) file within the filesystem. The file usually
> > > -consumes an entire block group, though mke2fs tries to put it in the
> > > -middle of the disk.
> > > +journal inode are replicated in the ext4 superblock. The journal
> > > +itself is normal (but hidden) file within the filesystem. The file
> > > +usually consumes an entire block group, though mke2fs tries to put it
> > > +in the middle of the disk.
> > >
> > >  All fields in jbd2 are written to disk in big-endian order. This is the
> > >  opposite of ext4.
> > > @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> > >  The maximum size of a journal embedded in an ext4 filesystem is 2^32
> > >  blocks. jbd2 itself does not seem to care.
> > >
> > > +Fast Commits
> > > +~~~~~~~~~~~~
> > > +
> > > +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> > > +Fast commits store metadata changes made to the file system as inode level
> > > +diff. In other words, each fast commit block identifies updates made to
> > > +a particular inode and collectively they represent total changes made to
> > > +the file system.
> > > +
> > > +A fast commit is valid only if there is no full commit after that particular
> > > +fast commit. Because of this feature, fast commit blocks can be reused by
> > > +the following transactions.
> > > +
> > > +Each fast commit block stores updates to 1 particular inode. Updates in each
> > > +fast commit block are one of the 2 types:
> > > +- Data updates (add range / delete range)
> > > +- Directory entry updates (Add / remove links)
> > > +
> > > +Fast commit blocks must be replayed in the order in which they appear on disk.
> > > +That's because directory entry updates are written in fast commit blocks
> > > +in the order in which they are applied on the file system before crash.
> > > +Changing the order of replaying for directory entry updates may result
> > > +in inconsistent file system. Note that only directory entry updates need
> > > +ordering, data updates, since they apply to only one inode, do not require
> > > +ordered replay. Also, fast commits guarantee that file system is in consistent
> > > +state after replay of each fast commit block as long as order of replay has
> > > +been followed.
> > > +
> > > +Note that directory inode updates are never directly recorded in fast commits.
> > > +Just like other file system level metaata, updates to directories are always
> > > +implied based on directory entry updates stored in fast commit blocks.
> > > +
> > > +Based on which directory entry updates are committed with an inode, fast
> > > +commits have two modes of operation:
> > > +
> > > +- Hard Consistency (default)
> > > +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> > > +
> > > +When hard consistency is enabled, fast commit guarantees that all the updates
> > > +will be committed. After a successful replay of fast commits blocks
> > > +in hard consistency mode, the entire file system would be in the same state as
> > > +that when fsync() returned before crash. This guarantee is similar to what
> > > +jbd2 gives.
> > > +
> > > +With soft consistency, file system only guarantees consistency for the
> > > +inode in question. In this mode, file system will try to write as less data
> > > +to the backed as possible during the commit time. To be precise, file system
> > > +records all the data updates for the inode in question and directory updates
> > > +that are required for guaranteeing consistency of the inode in question.
> > > +
> > >  Layout
> > >  ~~~~~~
> > >
> > >  Generally speaking, the journal has this format:
> > >
> > >  .. list-table::
> > > -   :widths: 16 48 16
> > > +   :widths: 16 48 16 18
> > >     :header-rows: 1
> > >
> > >     * - Superblock
> > >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > >         revocations] commmit\_block
> > >       - [more transactions...]
> > > +     - [Fast commits...]
> > >     * -
> > >       - One transaction
> > >       -
> > > +     -
> > >
> > >  Notice that a transaction begins with either a descriptor and some data,
> > >  or a block revocation list. A finished transaction always ends with a
> > > @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> > >  superblock.
> > >
> > >  .. list-table::
> > > -   :widths: 12 12 12 32 12
> > > +   :widths: 12 12 12 32 12 12
> > >     :header-rows: 1
> > >
> > >     * - 1024 bytes of padding
> > > @@ -85,11 +137,13 @@ superblock.
> > >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > >         revocations] commmit\_block
> > >       - [more transactions...]
> > > +     - [Fast commits...]
> > >     * -
> > >       -
> > >       -
> > >       - One transaction
> > >       -
> > > +     -
> > >
> > >  Block Header
> > >  ~~~~~~~~~~~~
> > > @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> > >       - h\_commit\_nsec
> > >       - Nanoseconds component of the above timestamp.
> > >
> > > +Fast Commit Block
> > > +~~~~~~~~~~~~~~~~~
> > > +
> > > +The fast commit block indicates an append to the last commit block
> > > +that was written to the journal. One fast commit block records updates
> > > +to one inode. So, typically you would find as many fast commit blocks
> > > +as the number of inodes that got changed since the last commit. A fast
> > > +commit block is valid only if there is no commit block present with
> > > +transaction ID greater than that of the fast commit block. If such a
> > > +block a present, then there is no need to replay the fast commit
> > > +block.
> > > +
> > > +.. list-table::
> > > +   :widths: 8 8 24 40
> > > +   :header-rows: 1
> > > +
> > > +   * - Offset
> > > +     - Type
> > > +     - Name
> > > +     - Descriptor
> > > +   * - 0x0
> > > +     - journal\_header\_s
> > > +     - (open coded)
> > > +     - Common block header.
> > > +   * - 0xC
> > > +     - \_\_le32
> > > +     - fc\_magic
> > > +     - Magic value which should be set to 0xE2540090. This identifies
> > > +       that this block is a fast commit block.
> > > +   * - 0x10
> > > +     - \_\_u8
> > > +     - fc\_features
> > > +     - Features used by this fast commit block.
> > > +   * - 0x11
> > > +     - \_\_le16
> > > +     - fc_num_tlvs
> > > +     - Number of TLVs contained in this fast commit block
> > > +   * - 0x13
> > > +     - \_\_le32
> > > +     - \_\_fc\_len
> > > +     - Length of the fast commit block in terms of number of blocks
> > > +   * - 0x17
> > > +     - \_\_le32
> > > +     - fc\_ino
> > > +     - Inode number of the inode that will be recovered using this fast commit
> > > +   * - 0x2B
> > > +     - struct ext4\_inode
> > > +     - inode
> > > +     - On-disk copy of the inode at the commit time
> > > +   * - <Variable based on inode size>
> > > +     - struct ext4\_fc\_tl
> > > +     - Array of struct ext4\_fc\_tl
> > > +     - The actual delta with the last commit. Starting at this offset,
> > > +       there is an array of TLVs that indicates which all extents
> > > +       should be present in the corresponding inode. Currently,
> > > +       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> > > +       should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> > > +       that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> > > +       (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> > > +       (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> > > +       (dentry that for the file that should be created for the first time).
> > > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > > index 58ce6b395206..1cb116ab27ab 100644
> > > --- a/Documentation/filesystems/journalling.rst
> > > +++ b/Documentation/filesystems/journalling.rst
> > > @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> > >  ``transaction->t_private_list`` for attaching entries to a transaction
> > >  that need processing when the transaction commits.
> > >
> > > +JBD2 also allows client file systems to implement file system specific
> > > +commits which are called as ``fast commits``. Fast commits are
> > > +asynchronous in nature i.e. file systems can call their own commit
> > > +functions at any time. In order to avoid the race with kjournald
> > > +thread and other possible fast commits that may be happening in
> > > +parallel, file systems should first call
> > > +:c:func:`jbd2_start_async_fc()`. File system can call
> > > +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> > > +commits. Once a fast commit is completed, file system should call
> > > +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> > > +committers and the kjournald thread.  After performing either a fast
> > > +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> > > +file systems to perform cleanups for their internal fast commit
> > > +related data structures. At the replay time, JBD2 passes each and
> > > +every fast commit block to the file system via
> > > +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> > > +mechanism to improve journal commit performance.
> > > +
> > >  JBD2 also provides a way to block all transaction updates via
> > >  :c:func:`jbd2_journal_lock_updates()` /
> > >  :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > > --
> > > 2.24.1.735.g03f4e72817-goog
> > >
Theodore Ts'o Jan. 12, 2020, 3:45 a.m. UTC | #4
On Thu, Jan 09, 2020 at 12:29:01PM +0800, xiaohui li wrote:
> maybe i have not understand the difficulty of the fast commit coding work.
> so I appreciate it very much if you give some more detailed
> descriptions about the patches correlationship of v4 fast commit,
> especially the reason why need have so many patches.
> 
> from my viewpoint, the purpose of doing this fast commit function is
> to resolve the ext4 fsync time-cost-so-much problem.
> firstly we need to resolve some actual customer problems which exist
> in ext4 filesystems when doing this fast commit function.
> 
> so the first release version of fast commit is just only to accomplish
> the goal of reducing the time cost of fsync because of jbd2 order
> shortcoming described in ijournal paper from my opinion.
> it need not do so many other unnecessary things.

As Harshad has mentioned, one of the reasons why an incremental
approach does not make sense is that once we release a version of fast
commit into a mainline kernel, we have to worry about what happens if
users start trying to use it, and we have to provide backwards
compatibility for it.  So if we were to break up fast commit into 5
parts, then we would have to allocate 5 feature bits, and we would
have to support each version of fast commit --- essentially forever.

As far as why are we doing this, we absolutely have a specific use
case in mind, and that's to improve ext4's performance when used on a
NFS server.  The NFS protocol requires that any file system operation
requested by a client is persisted before the server sends an
acknowledgement back to the client.  For the workloads that are heavy
with metadata updates, avoiding the need to do a full jbd2 commit for
every NFS RPC request which modifies metadata will a big difference to
the NFS server's performance.

This is why we are interested in making things like renames to be fast
commit eligible, and not just the smaller set of system calls needed
by (for example) SQLite.

Regards,

						- Ted
diff mbox series

Patch

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..f94e66f2f8c4 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -29,10 +29,10 @@  safest. If ``data=writeback``, dirty data blocks are not flushed to the
 disk before the metadata are written to disk through the journal.
 
 The journal inode is typically inode 8. The first 68 bytes of the
-journal inode are replicated in the ext4 superblock. The journal itself
-is normal (but hidden) file within the filesystem. The file usually
-consumes an entire block group, though mke2fs tries to put it in the
-middle of the disk.
+journal inode are replicated in the ext4 superblock. The journal
+itself is normal (but hidden) file within the filesystem. The file
+usually consumes an entire block group, though mke2fs tries to put it
+in the middle of the disk.
 
 All fields in jbd2 are written to disk in big-endian order. This is the
 opposite of ext4.
@@ -42,22 +42,74 @@  NOTE: Both ext4 and ocfs2 use jbd2.
 The maximum size of a journal embedded in an ext4 filesystem is 2^32
 blocks. jbd2 itself does not seem to care.
 
+Fast Commits
+~~~~~~~~~~~~
+
+Ext4 also implements fast commits and integrates it with JBD2 journalling.
+Fast commits store metadata changes made to the file system as inode level
+diff. In other words, each fast commit block identifies updates made to
+a particular inode and collectively they represent total changes made to
+the file system.
+
+A fast commit is valid only if there is no full commit after that particular
+fast commit. Because of this feature, fast commit blocks can be reused by
+the following transactions.
+
+Each fast commit block stores updates to 1 particular inode. Updates in each
+fast commit block are one of the 2 types:
+- Data updates (add range / delete range)
+- Directory entry updates (Add / remove links)
+
+Fast commit blocks must be replayed in the order in which they appear on disk.
+That's because directory entry updates are written in fast commit blocks
+in the order in which they are applied on the file system before crash.
+Changing the order of replaying for directory entry updates may result
+in inconsistent file system. Note that only directory entry updates need
+ordering, data updates, since they apply to only one inode, do not require
+ordered replay. Also, fast commits guarantee that file system is in consistent
+state after replay of each fast commit block as long as order of replay has
+been followed.
+
+Note that directory inode updates are never directly recorded in fast commits.
+Just like other file system level metaata, updates to directories are always
+implied based on directory entry updates stored in fast commit blocks.
+
+Based on which directory entry updates are committed with an inode, fast
+commits have two modes of operation:
+
+- Hard Consistency (default)
+- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
+
+When hard consistency is enabled, fast commit guarantees that all the updates
+will be committed. After a successful replay of fast commits blocks
+in hard consistency mode, the entire file system would be in the same state as
+that when fsync() returned before crash. This guarantee is similar to what
+jbd2 gives.
+
+With soft consistency, file system only guarantees consistency for the
+inode in question. In this mode, file system will try to write as less data
+to the backed as possible during the commit time. To be precise, file system
+records all the data updates for the inode in question and directory updates
+that are required for guaranteeing consistency of the inode in question.
+
 Layout
 ~~~~~~
 
 Generally speaking, the journal has this format:
 
 .. list-table::
-   :widths: 16 48 16
+   :widths: 16 48 16 18
    :header-rows: 1
 
    * - Superblock
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
+     - [Fast commits...]
    * - 
      - One transaction
      -
+     -
 
 Notice that a transaction begins with either a descriptor and some data,
 or a block revocation list. A finished transaction always ends with a
@@ -76,7 +128,7 @@  The journal superblock will be in the next full block after the
 superblock.
 
 .. list-table::
-   :widths: 12 12 12 32 12
+   :widths: 12 12 12 32 12 12
    :header-rows: 1
 
    * - 1024 bytes of padding
@@ -85,11 +137,13 @@  superblock.
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
+     - [Fast commits...]
    * - 
      -
      -
      - One transaction
      -
+     -
 
 Block Header
 ~~~~~~~~~~~~
@@ -609,3 +663,64 @@  bytes long (but uses a full block):
      - h\_commit\_nsec
      - Nanoseconds component of the above timestamp.
 
+Fast Commit Block
+~~~~~~~~~~~~~~~~~
+
+The fast commit block indicates an append to the last commit block
+that was written to the journal. One fast commit block records updates
+to one inode. So, typically you would find as many fast commit blocks
+as the number of inodes that got changed since the last commit. A fast
+commit block is valid only if there is no commit block present with
+transaction ID greater than that of the fast commit block. If such a
+block a present, then there is no need to replay the fast commit
+block.
+
+.. list-table::
+   :widths: 8 8 24 40
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - journal\_header\_s
+     - (open coded)
+     - Common block header.
+   * - 0xC
+     - \_\_le32
+     - fc\_magic
+     - Magic value which should be set to 0xE2540090. This identifies
+       that this block is a fast commit block.
+   * - 0x10
+     - \_\_u8
+     - fc\_features
+     - Features used by this fast commit block.
+   * - 0x11
+     - \_\_le16
+     - fc_num_tlvs
+     - Number of TLVs contained in this fast commit block
+   * - 0x13
+     - \_\_le32
+     - \_\_fc\_len
+     - Length of the fast commit block in terms of number of blocks
+   * - 0x17
+     - \_\_le32
+     - fc\_ino
+     - Inode number of the inode that will be recovered using this fast commit
+   * - 0x2B
+     - struct ext4\_inode
+     - inode
+     - On-disk copy of the inode at the commit time
+   * - <Variable based on inode size>
+     - struct ext4\_fc\_tl
+     - Array of struct ext4\_fc\_tl
+     - The actual delta with the last commit. Starting at this offset,
+       there is an array of TLVs that indicates which all extents
+       should be present in the corresponding inode. Currently,
+       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
+       should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
+       that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
+       (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
+       (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
+       (dentry that for the file that should be created for the first time).
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..1cb116ab27ab 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -115,6 +115,24 @@  called after each transaction commit. You can also use
 ``transaction->t_private_list`` for attaching entries to a transaction
 that need processing when the transaction commits.
 
+JBD2 also allows client file systems to implement file system specific
+commits which are called as ``fast commits``. Fast commits are
+asynchronous in nature i.e. file systems can call their own commit
+functions at any time. In order to avoid the race with kjournald
+thread and other possible fast commits that may be happening in
+parallel, file systems should first call
+:c:func:`jbd2_start_async_fc()`. File system can call
+:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
+commits. Once a fast commit is completed, file system should call
+:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
+committers and the kjournald thread.  After performing either a fast
+or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
+file systems to perform cleanups for their internal fast commit
+related data structures. At the replay time, JBD2 passes each and
+every fast commit block to the file system via
+``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
+mechanism to improve journal commit performance.
+
 JBD2 also provides a way to block all transaction updates via
 :c:func:`jbd2_journal_lock_updates()` /
 :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a