diff mbox series

[v3,12/13] docs: Add fast commit documentation

Message ID 20191001074101.256523-13-harshadshirwadkar@gmail.com
State Changes Requested
Headers show
Series ext4: add fast commit support | expand

Commit Message

harshad shirwadkar Oct. 1, 2019, 7:41 a.m. UTC
This patch adds necessary documentation to
Documentation/filesystems/journalling.rst and
Documentation/filesystems/ext4/journal.rst.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 Documentation/filesystems/ext4/journal.rst | 98 ++++++++++++++++++++--
 Documentation/filesystems/journalling.rst  | 22 +++++
 2 files changed, 114 insertions(+), 6 deletions(-)

Comments

Theodore Ts'o Oct. 18, 2019, 1:56 a.m. UTC | #1
On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
> +
> +Multiple fast commit blocks are a part of one sub-transaction. To
> +indicate the last block in a fast commit transaction, fc_flags field
> +in the last block in every subtransaction is marked with "LAST" (0x1)
> +flag. A subtransaction is valid only if all the following conditions
> +are met:
> +
> +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> +   the previous fast commit block.
> +2) For every sub-transaction, last block is marked with LAST flag.
> +3) There are no invalid blocks in between.

I'm wondering why we need to support multiple inodes being modified in
a single transaction.  As we currently have defined what can be done,
all updates to an inode should be free standing and not dependent on a
change to another inode, right?  And today, one block only modifies
one inode.

The only reason why we might want to define a sub-transaction as being
composed of multiple inodes, which must all be updated in an
all-or-nothing fashion, is the swap boot inode ioctl, and if that's
the only one, I wonder if it's worth the extra complexity.

Am I missing anything?

					- Ted
Andreas Dilger Oct. 18, 2019, 4:51 a.m. UTC | #2
What about rename or hard link?

Cheers, Andreas

> On Oct 18, 2019, at 10:56, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
>> On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
>> +
>> +Multiple fast commit blocks are a part of one sub-transaction. To
>> +indicate the last block in a fast commit transaction, fc_flags field
>> +in the last block in every subtransaction is marked with "LAST" (0x1)
>> +flag. A subtransaction is valid only if all the following conditions
>> +are met:
>> +
>> +1) SUBTID of all blocks is either equal to or greater than SUBTID of
>> +   the previous fast commit block.
>> +2) For every sub-transaction, last block is marked with LAST flag.
>> +3) There are no invalid blocks in between.
> 
> I'm wondering why we need to support multiple inodes being modified in
> a single transaction.  As we currently have defined what can be done,
> all updates to an inode should be free standing and not dependent on a
> change to another inode, right?  And today, one block only modifies
> one inode.
> 
> The only reason why we might want to define a sub-transaction as being
> composed of multiple inodes, which must all be updated in an
> all-or-nothing fashion, is the swap boot inode ioctl, and if that's
> the only one, I wonder if it's worth the extra complexity.
> 
> Am I missing anything?
> 
>                    - Ted
Theodore Ts'o Oct. 18, 2019, 1:28 p.m. UTC | #3
On Fri, Oct 18, 2019 at 01:51:56PM +0900, Andreas Dilger wrote:
> What about rename or hard link?

Neither is currently handled by the fast commit patches, but each
operation can fit inside a single block, so it could be handled as a
update to a single inode.  In the case of rename, we will need to add
some tags to indicate the desintation directory and directory enrty
name, and whether or not there is a destination inode which needs to
have its refcount dropped and possibly deleted.

Harshad, we probably should handle them, since in order to support
NFS, the nfs server will send the rename or hard link request,
followed by a commit metadata request, and that commit metadata
request needs to persist the rename or link.  So for the purposes of
accelerating NFS, we should handle these commands.

If we don't handle these commands, we will need to declare the inode
as fast commit ineligible, so that we force a full journal commit when
the commit metadata request is received.

						- Ted
harshad shirwadkar Oct. 31, 2019, 5:34 a.m. UTC | #4
Thanks good point. I was trying to imitate how a jbd2 commit I guess.
There's no reason really to do this in atomic way. I'll fix this in
next version.

On Thu, Oct 17, 2019 at 6:56 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
> > +
> > +Multiple fast commit blocks are a part of one sub-transaction. To
> > +indicate the last block in a fast commit transaction, fc_flags field
> > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > +flag. A subtransaction is valid only if all the following conditions
> > +are met:
> > +
> > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > +   the previous fast commit block.
> > +2) For every sub-transaction, last block is marked with LAST flag.
> > +3) There are no invalid blocks in between.
>
> I'm wondering why we need to support multiple inodes being modified in
> a single transaction.  As we currently have defined what can be done,
> all updates to an inode should be free standing and not dependent on a
> change to another inode, right?  And today, one block only modifies
> one inode.
>
> The only reason why we might want to define a sub-transaction as being
> composed of multiple inodes, which must all be updated in an
> all-or-nothing fashion, is the swap boot inode ioctl, and if that's
> the only one, I wonder if it's worth the extra complexity.
>
> Am I missing anything?
>
>                                         - Ted
harshad shirwadkar Oct. 31, 2019, 6:41 a.m. UTC | #5
Also, at high level I realized that in order to allow fast commits
being invoked from kjournald thread, the whole patch set has become
more complicated that it needs to be. In other words, if we only
support "asynchronous fast commits" in this patch set and worry about
integrating it with journald thread later, we can simplify this series
a whole lot and yet retain mostly all the functionality. Besides that
adding support of fast commits in kjournald thread would just be an in
memory change. So, just to summarize on this, 1) we will have fsync()
result in only the inode in question being fast committed in async
fashion. 2) ext4_nfs_commit_metadata() would result in all the changed
inodes result in fast commit in async fashion as well. 3) We could
very well use fast commits for normal jbd2 periodic commits as well.
But it's not clear if that will add any value, so we'll leave it out
from this patch series. Do you agree with this?

On Wed, Oct 30, 2019 at 10:34 PM harshad shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> Thanks good point. I was trying to imitate how a jbd2 commit I guess.
> There's no reason really to do this in atomic way. I'll fix this in
> next version.
>
> On Thu, Oct 17, 2019 at 6:56 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> >
> > On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
> > > +
> > > +Multiple fast commit blocks are a part of one sub-transaction. To
> > > +indicate the last block in a fast commit transaction, fc_flags field
> > > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > > +flag. A subtransaction is valid only if all the following conditions
> > > +are met:
> > > +
> > > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > > +   the previous fast commit block.
> > > +2) For every sub-transaction, last block is marked with LAST flag.
> > > +3) There are no invalid blocks in between.
> >
> > I'm wondering why we need to support multiple inodes being modified in
> > a single transaction.  As we currently have defined what can be done,
> > all updates to an inode should be free standing and not dependent on a
> > change to another inode, right?  And today, one block only modifies
> > one inode.
> >
> > The only reason why we might want to define a sub-transaction as being
> > composed of multiple inodes, which must all be updated in an
> > all-or-nothing fashion, is the swap boot inode ioctl, and if that's
> > the only one, I wonder if it's worth the extra complexity.
> >
> > Am I missing anything?
> >
> >                                         - Ted
Andreas Dilger Oct. 31, 2019, 6:53 p.m. UTC | #6
On Oct 18, 2019, at 7:28 AM, Theodore Y. Ts'o <tytso@MIT.EDU> wrote:
> 
> On Fri, Oct 18, 2019 at 01:51:56PM +0900, Andreas Dilger wrote:
>> What about rename or hard link?
> 
> Neither is currently handled by the fast commit patches, but each
> operation can fit inside a single block, so it could be handled as a
> update to a single inode.  In the case of rename, we will need to add
> some tags to indicate the desintation directory and directory enrty
> name, and whether or not there is a destination inode which needs to
> have its refcount dropped and possibly deleted.
> 
> Harshad, we probably should handle them, since in order to support
> NFS, the nfs server will send the rename or hard link request,
> followed by a commit metadata request, and that commit metadata
> request needs to persist the rename or link.  So for the purposes of
> accelerating NFS, we should handle these commands.
> 
> If we don't handle these commands, we will need to declare the inode
> as fast commit ineligible, so that we force a full journal commit when
> the commit metadata request is received.

As a simplifying assumption, you could limit the case of rename/link
within a single directory?  That handles the common case of "create
temp file, write contents there, sync, rename over original file"
used by most editors, rsync, etc.  The case of cross-directory rename
is much less common in my experience, so it is less important to
optimize that case (if this makes it easier to add to fast commits).

Cheers, Andreas
diff mbox series

Patch

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..23e7db89fc6a 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -29,10 +29,14 @@  safest. If ``data=writeback``, dirty data blocks are not flushed to the
 disk before the metadata are written to disk through the journal.
 
 The journal inode is typically inode 8. The first 68 bytes of the
-journal inode are replicated in the ext4 superblock. The journal itself
-is normal (but hidden) file within the filesystem. The file usually
-consumes an entire block group, though mke2fs tries to put it in the
-middle of the disk.
+journal inode are replicated in the ext4 superblock. The journal
+itself is normal (but hidden) file within the filesystem. The file
+usually consumes an entire block group, though mke2fs tries to put it
+in the middle of the disk. Ext4 also utilizes JBD2's fast
+commits. Fast commits store metadata changes to inodes in an
+incremental fashion. A fast commit is valid only if there is no full
+commit after that particular fast commit. Because of this fast commit
+blocks are overwritten by a following transaction.
 
 All fields in jbd2 are written to disk in big-endian order. This is the
 opposite of ext4.
@@ -48,16 +52,18 @@  Layout
 Generally speaking, the journal has this format:
 
 .. list-table::
-   :widths: 16 48 16
+   :widths: 16 48 16 18
    :header-rows: 1
 
    * - Superblock
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
+     - [Fast commits...]
    * - 
      - One transaction
      -
+     -
 
 Notice that a transaction begins with either a descriptor and some data,
 or a block revocation list. A finished transaction always ends with a
@@ -76,7 +82,7 @@  The journal superblock will be in the next full block after the
 superblock.
 
 .. list-table::
-   :widths: 12 12 12 32 12
+   :widths: 12 12 12 32 12 12
    :header-rows: 1
 
    * - 1024 bytes of padding
@@ -85,11 +91,13 @@  superblock.
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
+     - [Fast commits...]
    * - 
      -
      -
      - One transaction
      -
+     -
 
 Block Header
 ~~~~~~~~~~~~
@@ -609,3 +617,81 @@  bytes long (but uses a full block):
      - h\_commit\_nsec
      - Nanoseconds component of the above timestamp.
 
+Fast Commit Block
+~~~~~~~~~~~~~~~~~
+
+The fast commit block indicates an append to the last commit block
+that was written to the journal. One fast commit block records updates
+to one inode. So, typically you would find as many fast commit blocks
+as the number of inodes that got changed since the last commit. A fast
+commit block is valid only if there is no commit block present with
+transaction ID greater than that of the fast commit block. If such a
+block a present, then there is no need to replay the fast commit
+block.
+
+Multiple fast commit blocks are a part of one sub-transaction. To
+indicate the last block in a fast commit transaction, fc_flags field
+in the last block in every subtransaction is marked with "LAST" (0x1)
+flag. A subtransaction is valid only if all the following conditions
+are met:
+
+1) SUBTID of all blocks is either equal to or greater than SUBTID of
+   the previous fast commit block.
+2) For every sub-transaction, last block is marked with LAST flag.
+3) There are no invalid blocks in between.
+
+.. list-table::
+   :widths: 8 8 24 40
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - journal\_header\_s
+     - (open coded)
+     - Common block header.
+   * - 0xC
+     - \_\_le32
+     - fc\_magic
+     - Magic value which should be set to 0xE2540090. This identifies
+       that this block is a fast commit block.
+   * - 0x10
+     - \_\_le32
+     - fc\_subtid
+     - Sub-transaction ID for this commit block
+   * - 0x14
+     - \_\_u8
+     - fc\_features
+     - Features used by this fast commit block.
+   * - 0x15
+     - \_\_u8
+     - fc_flags
+     - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
+   * - 0x16
+     - \_\_le16
+     - fc_num_tlvs
+     - Number of TLVs contained in this fast commit block
+   * - 0x18
+     - \_\_le32
+     - \_\_fc\_len
+     - Length of the fast commit block in terms of number of blocks
+   * - 0x2c
+     - \_\_le32
+     - fc\_ino
+     - Inode number of the inode that will be recovered using this fast commit
+   * - 0x30
+     - struct ext4\_inode
+     - inode
+     - On-disk copy of the inode at the commit time
+   * - 0x34
+     - struct ext4\_fc\_tl
+     - Array of struct ext4\_fc\_tl
+     - The actual delta with the last commit. Starting at this offset,
+       there is an array of TLVs that indicates which all extents
+       should be present in the corresponding inode. Currently,
+       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
+       should be present in the inode), EXT4\_FC\_TAG\_DNAME (dentry
+       name of the inode), EXT4\_FC\_TAG\_PARENT\_INO (inode number of
+       the directory that should contain the dentry of the inode).
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..217f66d67f9d 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -115,6 +115,28 @@  called after each transaction commit. You can also use
 ``transaction->t_private_list`` for attaching entries to a transaction
 that need processing when the transaction commits.
 
+JBD2 also allows client file systems to implement file system specific
+commits which are called as ``fast commits``. File systems that wish
+to use this feature should first set
+``journal->j_fc_commit_callback``. That function is called before
+performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
+to get buffers reserved for fast commits. If file system returns 0,
+JBD2 assumes that file system performed a fast commit and it backs off
+from performing a commit. Otherwise, JBD2 falls back to normal full
+commit. After performing either a fast or a full commit, JBD2 calls
+``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
+for their internal fast commit related data structures. At the replay
+time, JBD2 passes each and every fast commit block to the file system
+via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
+commit mechanism to improve journal commit performance.
+
+It is possible for the file systems to perform fast commits
+asynchronously (without involvement of journalling thread). All file
+systems really need to do is to call :c:func:`jbd2_start_async_fc()`
+before starting the commit and call :c:func:`jbd2_stop_async_fc()`
+after the commit. This makes sure that the journalling thread and
+other async fast committers don't interfere.
+
 JBD2 also provides a way to block all transaction updates via
 :c:func:`jbd2_journal_lock_updates()` /
 :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a