mbox series

[B/F,0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit

Message ID 20210909202230.886329-1-mfo@canonical.com
Headers show
Series ext4/jbd2: data=journal: write-protect pages on transaction commit | expand

Message

Mauricio Faria de Oliveira Sept. 9, 2021, 8:22 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1847340

[Impact]

With mmap()ed files on ext4's data journaling it's possible to change
a mapped page's buffers contents during their jbd2 transaction commit
(as currently nothing prevents/blocks the write access at that time.)

This might happen between the buffers checksum calculation and actual
write to journal, so the (old) checksum is invalid for the (new) data.

If the system crashes after that, but before such journal entry makes
it to the filesystem, the journal replay on the next mount just fails,
and the filesystem now requires fsck. (apparently curtin might set up
/etc/fstab with passno=0, requiring manual intervention.)

    [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!
    [39751.114435] JBD2: Invalid checksum recovering block 87305 in log
    [39751.146133] JBD2: Invalid checksum recovering block 88039 in log
    [39751.195950] JBD2: Invalid checksum recovering block 49633 in log
    [39751.265158] JBD2: recovery failed
    [39751.265163] EXT4-fs (vdc): error loading journal

[Fix]

The fix is to write-protect the pages during journal transaction commit,
so that writes to mapped pages hit a page fault, then ext4's page_mkwrite
hook can block until the commit finishes and the buffers can be modified.

In order to do that, add jbd2 journal callbacks that the filesystems can
customize, called before/after the critical region in transaction commit,
then have ext4 in data journaling mode to write-protect the pages whose
buffers are being committed (and handle cases that need pages redirtied.)

The changes are restricted to the data journaling mode and page_mkwrite
hook, and other modes/paths use the same code/behavior in the callbacks.

[Test Case]

Set up an ext4 filesystem in data journaling mode, and run stress-ng's
mmap file test on it, then crash the system after a bit; check whether
the filesystem can mount again or not (i.e., with jbd2 checksum errors.)

    # mkfs.ext4 $DEV
    # mount -o data=journal $DEV $DIR
    # cd $DIR
    # stress-ng --mmap $((4*$(nproc))) --mmap-file &
    # sleep 60
    # echo c >/proc/sysrq-trigger
    ...
    # mount -o data=journal $DEV $DIR   # PASS/FAIL.
    # dmesg | tail

[Regression Potential]

Regressions would likely manifest in ext4 data journaling mode (which
is not the default mode, 'ordered') with memory mapped access, as the
other modes/paths are largely unaffected by the changes/same behavior.

This has been tested with (x)fstests, that showed no regressions on
data=ordered and data=journal on both Bionic and Focal (with kernel
versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each.
And the stress-ng test-case as well. (Numbers/details in the LP bug.)

[Other info]

The patchset is applied on 5.10, so Hirsute (5.11) is already fixed;
only Focal and Bionic need it.

There are little changes in the patches between Focal and Bionic
(mostly minor backport adjustments, mainly due to no vm_fault_t)
but unfortunately that needs separate versions for most patches.

Thanks!

Comments

Tim Gardner Sept. 10, 2021, 2:40 p.m. UTC | #1
Acked-by: Tim Gardner <tim.gardner@canonical.com>

Good test results. Clean cherry-picks for Focal. Backports and 
explanations for Bionic are good.

On 9/9/21 2:22 PM, Mauricio Faria de Oliveira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847340
> 
> [Impact]
> 
> With mmap()ed files on ext4's data journaling it's possible to change
> a mapped page's buffers contents during their jbd2 transaction commit
> (as currently nothing prevents/blocks the write access at that time.)
> 
> This might happen between the buffers checksum calculation and actual
> write to journal, so the (old) checksum is invalid for the (new) data.
> 
> If the system crashes after that, but before such journal entry makes
> it to the filesystem, the journal replay on the next mount just fails,
> and the filesystem now requires fsck. (apparently curtin might set up
> /etc/fstab with passno=0, requiring manual intervention.)
> 
>      [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!
>      [39751.114435] JBD2: Invalid checksum recovering block 87305 in log
>      [39751.146133] JBD2: Invalid checksum recovering block 88039 in log
>      [39751.195950] JBD2: Invalid checksum recovering block 49633 in log
>      [39751.265158] JBD2: recovery failed
>      [39751.265163] EXT4-fs (vdc): error loading journal
> 
> [Fix]
> 
> The fix is to write-protect the pages during journal transaction commit,
> so that writes to mapped pages hit a page fault, then ext4's page_mkwrite
> hook can block until the commit finishes and the buffers can be modified.
> 
> In order to do that, add jbd2 journal callbacks that the filesystems can
> customize, called before/after the critical region in transaction commit,
> then have ext4 in data journaling mode to write-protect the pages whose
> buffers are being committed (and handle cases that need pages redirtied.)
> 
> The changes are restricted to the data journaling mode and page_mkwrite
> hook, and other modes/paths use the same code/behavior in the callbacks.
> 
> [Test Case]
> 
> Set up an ext4 filesystem in data journaling mode, and run stress-ng's
> mmap file test on it, then crash the system after a bit; check whether
> the filesystem can mount again or not (i.e., with jbd2 checksum errors.)
> 
>      # mkfs.ext4 $DEV
>      # mount -o data=journal $DEV $DIR
>      # cd $DIR
>      # stress-ng --mmap $((4*$(nproc))) --mmap-file &
>      # sleep 60
>      # echo c >/proc/sysrq-trigger
>      ...
>      # mount -o data=journal $DEV $DIR   # PASS/FAIL.
>      # dmesg | tail
> 
> [Regression Potential]
> 
> Regressions would likely manifest in ext4 data journaling mode (which
> is not the default mode, 'ordered') with memory mapped access, as the
> other modes/paths are largely unaffected by the changes/same behavior.
> 
> This has been tested with (x)fstests, that showed no regressions on
> data=ordered and data=journal on both Bionic and Focal (with kernel
> versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each.
> And the stress-ng test-case as well. (Numbers/details in the LP bug.)
> 
> [Other info]
> 
> The patchset is applied on 5.10, so Hirsute (5.11) is already fixed;
> only Focal and Bionic need it.
> 
> There are little changes in the patches between Focal and Bionic
> (mostly minor backport adjustments, mainly due to no vm_fault_t)
> but unfortunately that needs separate versions for most patches.
> 
> Thanks!
>
Stefan Bader Sept. 20, 2021, 7:22 a.m. UTC | #2
On 09.09.21 22:22, Mauricio Faria de Oliveira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847340
> 
> [Impact]
> 
> With mmap()ed files on ext4's data journaling it's possible to change
> a mapped page's buffers contents during their jbd2 transaction commit
> (as currently nothing prevents/blocks the write access at that time.)
> 
> This might happen between the buffers checksum calculation and actual
> write to journal, so the (old) checksum is invalid for the (new) data.
> 
> If the system crashes after that, but before such journal entry makes
> it to the filesystem, the journal replay on the next mount just fails,
> and the filesystem now requires fsck. (apparently curtin might set up
> /etc/fstab with passno=0, requiring manual intervention.)
> 
>      [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!
>      [39751.114435] JBD2: Invalid checksum recovering block 87305 in log
>      [39751.146133] JBD2: Invalid checksum recovering block 88039 in log
>      [39751.195950] JBD2: Invalid checksum recovering block 49633 in log
>      [39751.265158] JBD2: recovery failed
>      [39751.265163] EXT4-fs (vdc): error loading journal
> 
> [Fix]
> 
> The fix is to write-protect the pages during journal transaction commit,
> so that writes to mapped pages hit a page fault, then ext4's page_mkwrite
> hook can block until the commit finishes and the buffers can be modified.
> 
> In order to do that, add jbd2 journal callbacks that the filesystems can
> customize, called before/after the critical region in transaction commit,
> then have ext4 in data journaling mode to write-protect the pages whose
> buffers are being committed (and handle cases that need pages redirtied.)
> 
> The changes are restricted to the data journaling mode and page_mkwrite
> hook, and other modes/paths use the same code/behavior in the callbacks.
> 
> [Test Case]
> 
> Set up an ext4 filesystem in data journaling mode, and run stress-ng's
> mmap file test on it, then crash the system after a bit; check whether
> the filesystem can mount again or not (i.e., with jbd2 checksum errors.)
> 
>      # mkfs.ext4 $DEV
>      # mount -o data=journal $DEV $DIR
>      # cd $DIR
>      # stress-ng --mmap $((4*$(nproc))) --mmap-file &
>      # sleep 60
>      # echo c >/proc/sysrq-trigger
>      ...
>      # mount -o data=journal $DEV $DIR   # PASS/FAIL.
>      # dmesg | tail
> 
> [Regression Potential]
> 
> Regressions would likely manifest in ext4 data journaling mode (which
> is not the default mode, 'ordered') with memory mapped access, as the
> other modes/paths are largely unaffected by the changes/same behavior.
> 
> This has been tested with (x)fstests, that showed no regressions on
> data=ordered and data=journal on both Bionic and Focal (with kernel
> versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each.
> And the stress-ng test-case as well. (Numbers/details in the LP bug.)
> 
> [Other info]
> 
> The patchset is applied on 5.10, so Hirsute (5.11) is already fixed;
> only Focal and Bionic need it.
> 
> There are little changes in the patches between Focal and Bionic
> (mostly minor backport adjustments, mainly due to no vm_fault_t)
> but unfortunately that needs separate versions for most patches.
> 
> Thanks!
> 
Ok, modifying ext4 or its components generally is always scary. At least for F 
all changes are cherry picks and required changes for B are small. Also there 
has been some effort doing beforehand regression testing. That should limit the 
risk.

Acked-by: Stefan Bader <stefan.bader@canonical.com>
Kleber Sacilotto de Souza Sept. 23, 2021, 4:25 p.m. UTC | #3
On 09.09.21 22:22, Mauricio Faria de Oliveira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847340
> 
> [Impact]
> 
> With mmap()ed files on ext4's data journaling it's possible to change
> a mapped page's buffers contents during their jbd2 transaction commit
> (as currently nothing prevents/blocks the write access at that time.)
> 
> This might happen between the buffers checksum calculation and actual
> write to journal, so the (old) checksum is invalid for the (new) data.
> 
> If the system crashes after that, but before such journal entry makes
> it to the filesystem, the journal replay on the next mount just fails,
> and the filesystem now requires fsck. (apparently curtin might set up
> /etc/fstab with passno=0, requiring manual intervention.)
> 
>      [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!
>      [39751.114435] JBD2: Invalid checksum recovering block 87305 in log
>      [39751.146133] JBD2: Invalid checksum recovering block 88039 in log
>      [39751.195950] JBD2: Invalid checksum recovering block 49633 in log
>      [39751.265158] JBD2: recovery failed
>      [39751.265163] EXT4-fs (vdc): error loading journal
> 
> [Fix]
> 
> The fix is to write-protect the pages during journal transaction commit,
> so that writes to mapped pages hit a page fault, then ext4's page_mkwrite
> hook can block until the commit finishes and the buffers can be modified.
> 
> In order to do that, add jbd2 journal callbacks that the filesystems can
> customize, called before/after the critical region in transaction commit,
> then have ext4 in data journaling mode to write-protect the pages whose
> buffers are being committed (and handle cases that need pages redirtied.)
> 
> The changes are restricted to the data journaling mode and page_mkwrite
> hook, and other modes/paths use the same code/behavior in the callbacks.
> 
> [Test Case]
> 
> Set up an ext4 filesystem in data journaling mode, and run stress-ng's
> mmap file test on it, then crash the system after a bit; check whether
> the filesystem can mount again or not (i.e., with jbd2 checksum errors.)
> 
>      # mkfs.ext4 $DEV
>      # mount -o data=journal $DEV $DIR
>      # cd $DIR
>      # stress-ng --mmap $((4*$(nproc))) --mmap-file &
>      # sleep 60
>      # echo c >/proc/sysrq-trigger
>      ...
>      # mount -o data=journal $DEV $DIR   # PASS/FAIL.
>      # dmesg | tail
> 
> [Regression Potential]
> 
> Regressions would likely manifest in ext4 data journaling mode (which
> is not the default mode, 'ordered') with memory mapped access, as the
> other modes/paths are largely unaffected by the changes/same behavior.
> 
> This has been tested with (x)fstests, that showed no regressions on
> data=ordered and data=journal on both Bionic and Focal (with kernel
> versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each.
> And the stress-ng test-case as well. (Numbers/details in the LP bug.)
> 
> [Other info]
> 
> The patchset is applied on 5.10, so Hirsute (5.11) is already fixed;
> only Focal and Bionic need it.
> 
> There are little changes in the patches between Focal and Bionic
> (mostly minor backport adjustments, mainly due to no vm_fault_t)
> but unfortunately that needs separate versions for most patches.
> 
> Thanks!
> 

Applied to bionic:linux.

Thanks,
Kleber
Kelsey Skunberg Sept. 23, 2021, 10:57 p.m. UTC | #4
Applied to Focal master-next. Thank you!

-Kelsey

On 2021-09-09 17:22:21 , Mauricio Faria de Oliveira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847340
> 
> [Impact]
> 
> With mmap()ed files on ext4's data journaling it's possible to change
> a mapped page's buffers contents during their jbd2 transaction commit
> (as currently nothing prevents/blocks the write access at that time.)
> 
> This might happen between the buffers checksum calculation and actual
> write to journal, so the (old) checksum is invalid for the (new) data.
> 
> If the system crashes after that, but before such journal entry makes
> it to the filesystem, the journal replay on the next mount just fails,
> and the filesystem now requires fsck. (apparently curtin might set up
> /etc/fstab with passno=0, requiring manual intervention.)
> 
>     [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!
>     [39751.114435] JBD2: Invalid checksum recovering block 87305 in log
>     [39751.146133] JBD2: Invalid checksum recovering block 88039 in log
>     [39751.195950] JBD2: Invalid checksum recovering block 49633 in log
>     [39751.265158] JBD2: recovery failed
>     [39751.265163] EXT4-fs (vdc): error loading journal
> 
> [Fix]
> 
> The fix is to write-protect the pages during journal transaction commit,
> so that writes to mapped pages hit a page fault, then ext4's page_mkwrite
> hook can block until the commit finishes and the buffers can be modified.
> 
> In order to do that, add jbd2 journal callbacks that the filesystems can
> customize, called before/after the critical region in transaction commit,
> then have ext4 in data journaling mode to write-protect the pages whose
> buffers are being committed (and handle cases that need pages redirtied.)
> 
> The changes are restricted to the data journaling mode and page_mkwrite
> hook, and other modes/paths use the same code/behavior in the callbacks.
> 
> [Test Case]
> 
> Set up an ext4 filesystem in data journaling mode, and run stress-ng's
> mmap file test on it, then crash the system after a bit; check whether
> the filesystem can mount again or not (i.e., with jbd2 checksum errors.)
> 
>     # mkfs.ext4 $DEV
>     # mount -o data=journal $DEV $DIR
>     # cd $DIR
>     # stress-ng --mmap $((4*$(nproc))) --mmap-file &
>     # sleep 60
>     # echo c >/proc/sysrq-trigger
>     ...
>     # mount -o data=journal $DEV $DIR   # PASS/FAIL.
>     # dmesg | tail
> 
> [Regression Potential]
> 
> Regressions would likely manifest in ext4 data journaling mode (which
> is not the default mode, 'ordered') with memory mapped access, as the
> other modes/paths are largely unaffected by the changes/same behavior.
> 
> This has been tested with (x)fstests, that showed no regressions on
> data=ordered and data=journal on both Bionic and Focal (with kernel
> versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each.
> And the stress-ng test-case as well. (Numbers/details in the LP bug.)
> 
> [Other info]
> 
> The patchset is applied on 5.10, so Hirsute (5.11) is already fixed;
> only Focal and Bionic need it.
> 
> There are little changes in the patches between Focal and Bionic
> (mostly minor backport adjustments, mainly due to no vm_fault_t)
> but unfortunately that needs separate versions for most patches.
> 
> Thanks!
> 
> -- 
> Mauricio
> 
> 
> 
> Jan Kara (1):
>   ext4: fix mmap write protection for data=journal mode
> 
> Mauricio Faria de Oliveira (4):
>   jbd2: introduce/export functions
>     jbd2_journal_submit|finish_inode_data_buffers()
>   jbd2, ext4, ocfs2: introduce/use journal callbacks
>     j_submit|finish_inode_data_buffers()
>   ext4: data=journal: fixes for ext4_page_mkwrite()
>   ext4: data=journal: write-protect pages on
>     j_submit_inode_data_buffers()
> 
>  fs/ext4/inode.c      | 63 +++++++++++++++++++++++++++-----
>  fs/ext4/super.c      | 87 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/jbd2/commit.c     | 62 ++++++++++++++++---------------
>  fs/jbd2/journal.c    |  2 +
>  fs/ocfs2/journal.c   |  4 ++
>  include/linux/jbd2.h | 29 ++++++++++++++-
>  6 files changed, 207 insertions(+), 40 deletions(-)
> 
> -- 
> 2.30.2
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team