Message ID | 20210909202230.886329-1-mfo@canonical.com |
---|---|
Headers | show |
Series | ext4/jbd2: data=journal: write-protect pages on transaction commit | expand |
Acked-by: Tim Gardner <tim.gardner@canonical.com> Good test results. Clean cherry-picks for Focal. Backports and explanations for Bionic are good. On 9/9/21 2:22 PM, Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1847340 > > [Impact] > > With mmap()ed files on ext4's data journaling it's possible to change > a mapped page's buffers contents during their jbd2 transaction commit > (as currently nothing prevents/blocks the write access at that time.) > > This might happen between the buffers checksum calculation and actual > write to journal, so the (old) checksum is invalid for the (new) data. > > If the system crashes after that, but before such journal entry makes > it to the filesystem, the journal replay on the next mount just fails, > and the filesystem now requires fsck. (apparently curtin might set up > /etc/fstab with passno=0, requiring manual intervention.) > > [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support! > [39751.114435] JBD2: Invalid checksum recovering block 87305 in log > [39751.146133] JBD2: Invalid checksum recovering block 88039 in log > [39751.195950] JBD2: Invalid checksum recovering block 49633 in log > [39751.265158] JBD2: recovery failed > [39751.265163] EXT4-fs (vdc): error loading journal > > [Fix] > > The fix is to write-protect the pages during journal transaction commit, > so that writes to mapped pages hit a page fault, then ext4's page_mkwrite > hook can block until the commit finishes and the buffers can be modified. > > In order to do that, add jbd2 journal callbacks that the filesystems can > customize, called before/after the critical region in transaction commit, > then have ext4 in data journaling mode to write-protect the pages whose > buffers are being committed (and handle cases that need pages redirtied.) > > The changes are restricted to the data journaling mode and page_mkwrite > hook, and other modes/paths use the same code/behavior in the callbacks. > > [Test Case] > > Set up an ext4 filesystem in data journaling mode, and run stress-ng's > mmap file test on it, then crash the system after a bit; check whether > the filesystem can mount again or not (i.e., with jbd2 checksum errors.) > > # mkfs.ext4 $DEV > # mount -o data=journal $DEV $DIR > # cd $DIR > # stress-ng --mmap $((4*$(nproc))) --mmap-file & > # sleep 60 > # echo c >/proc/sysrq-trigger > ... > # mount -o data=journal $DEV $DIR # PASS/FAIL. > # dmesg | tail > > [Regression Potential] > > Regressions would likely manifest in ext4 data journaling mode (which > is not the default mode, 'ordered') with memory mapped access, as the > other modes/paths are largely unaffected by the changes/same behavior. > > This has been tested with (x)fstests, that showed no regressions on > data=ordered and data=journal on both Bionic and Focal (with kernel > versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each. > And the stress-ng test-case as well. (Numbers/details in the LP bug.) > > [Other info] > > The patchset is applied on 5.10, so Hirsute (5.11) is already fixed; > only Focal and Bionic need it. > > There are little changes in the patches between Focal and Bionic > (mostly minor backport adjustments, mainly due to no vm_fault_t) > but unfortunately that needs separate versions for most patches. > > Thanks! >
On 09.09.21 22:22, Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1847340 > > [Impact] > > With mmap()ed files on ext4's data journaling it's possible to change > a mapped page's buffers contents during their jbd2 transaction commit > (as currently nothing prevents/blocks the write access at that time.) > > This might happen between the buffers checksum calculation and actual > write to journal, so the (old) checksum is invalid for the (new) data. > > If the system crashes after that, but before such journal entry makes > it to the filesystem, the journal replay on the next mount just fails, > and the filesystem now requires fsck. (apparently curtin might set up > /etc/fstab with passno=0, requiring manual intervention.) > > [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support! > [39751.114435] JBD2: Invalid checksum recovering block 87305 in log > [39751.146133] JBD2: Invalid checksum recovering block 88039 in log > [39751.195950] JBD2: Invalid checksum recovering block 49633 in log > [39751.265158] JBD2: recovery failed > [39751.265163] EXT4-fs (vdc): error loading journal > > [Fix] > > The fix is to write-protect the pages during journal transaction commit, > so that writes to mapped pages hit a page fault, then ext4's page_mkwrite > hook can block until the commit finishes and the buffers can be modified. > > In order to do that, add jbd2 journal callbacks that the filesystems can > customize, called before/after the critical region in transaction commit, > then have ext4 in data journaling mode to write-protect the pages whose > buffers are being committed (and handle cases that need pages redirtied.) > > The changes are restricted to the data journaling mode and page_mkwrite > hook, and other modes/paths use the same code/behavior in the callbacks. > > [Test Case] > > Set up an ext4 filesystem in data journaling mode, and run stress-ng's > mmap file test on it, then crash the system after a bit; check whether > the filesystem can mount again or not (i.e., with jbd2 checksum errors.) > > # mkfs.ext4 $DEV > # mount -o data=journal $DEV $DIR > # cd $DIR > # stress-ng --mmap $((4*$(nproc))) --mmap-file & > # sleep 60 > # echo c >/proc/sysrq-trigger > ... > # mount -o data=journal $DEV $DIR # PASS/FAIL. > # dmesg | tail > > [Regression Potential] > > Regressions would likely manifest in ext4 data journaling mode (which > is not the default mode, 'ordered') with memory mapped access, as the > other modes/paths are largely unaffected by the changes/same behavior. > > This has been tested with (x)fstests, that showed no regressions on > data=ordered and data=journal on both Bionic and Focal (with kernel > versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each. > And the stress-ng test-case as well. (Numbers/details in the LP bug.) > > [Other info] > > The patchset is applied on 5.10, so Hirsute (5.11) is already fixed; > only Focal and Bionic need it. > > There are little changes in the patches between Focal and Bionic > (mostly minor backport adjustments, mainly due to no vm_fault_t) > but unfortunately that needs separate versions for most patches. > > Thanks! > Ok, modifying ext4 or its components generally is always scary. At least for F all changes are cherry picks and required changes for B are small. Also there has been some effort doing beforehand regression testing. That should limit the risk. Acked-by: Stefan Bader <stefan.bader@canonical.com>
On 09.09.21 22:22, Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1847340 > > [Impact] > > With mmap()ed files on ext4's data journaling it's possible to change > a mapped page's buffers contents during their jbd2 transaction commit > (as currently nothing prevents/blocks the write access at that time.) > > This might happen between the buffers checksum calculation and actual > write to journal, so the (old) checksum is invalid for the (new) data. > > If the system crashes after that, but before such journal entry makes > it to the filesystem, the journal replay on the next mount just fails, > and the filesystem now requires fsck. (apparently curtin might set up > /etc/fstab with passno=0, requiring manual intervention.) > > [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support! > [39751.114435] JBD2: Invalid checksum recovering block 87305 in log > [39751.146133] JBD2: Invalid checksum recovering block 88039 in log > [39751.195950] JBD2: Invalid checksum recovering block 49633 in log > [39751.265158] JBD2: recovery failed > [39751.265163] EXT4-fs (vdc): error loading journal > > [Fix] > > The fix is to write-protect the pages during journal transaction commit, > so that writes to mapped pages hit a page fault, then ext4's page_mkwrite > hook can block until the commit finishes and the buffers can be modified. > > In order to do that, add jbd2 journal callbacks that the filesystems can > customize, called before/after the critical region in transaction commit, > then have ext4 in data journaling mode to write-protect the pages whose > buffers are being committed (and handle cases that need pages redirtied.) > > The changes are restricted to the data journaling mode and page_mkwrite > hook, and other modes/paths use the same code/behavior in the callbacks. > > [Test Case] > > Set up an ext4 filesystem in data journaling mode, and run stress-ng's > mmap file test on it, then crash the system after a bit; check whether > the filesystem can mount again or not (i.e., with jbd2 checksum errors.) > > # mkfs.ext4 $DEV > # mount -o data=journal $DEV $DIR > # cd $DIR > # stress-ng --mmap $((4*$(nproc))) --mmap-file & > # sleep 60 > # echo c >/proc/sysrq-trigger > ... > # mount -o data=journal $DEV $DIR # PASS/FAIL. > # dmesg | tail > > [Regression Potential] > > Regressions would likely manifest in ext4 data journaling mode (which > is not the default mode, 'ordered') with memory mapped access, as the > other modes/paths are largely unaffected by the changes/same behavior. > > This has been tested with (x)fstests, that showed no regressions on > data=ordered and data=journal on both Bionic and Focal (with kernel > versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each. > And the stress-ng test-case as well. (Numbers/details in the LP bug.) > > [Other info] > > The patchset is applied on 5.10, so Hirsute (5.11) is already fixed; > only Focal and Bionic need it. > > There are little changes in the patches between Focal and Bionic > (mostly minor backport adjustments, mainly due to no vm_fault_t) > but unfortunately that needs separate versions for most patches. > > Thanks! > Applied to bionic:linux. Thanks, Kleber
Applied to Focal master-next. Thank you! -Kelsey On 2021-09-09 17:22:21 , Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1847340 > > [Impact] > > With mmap()ed files on ext4's data journaling it's possible to change > a mapped page's buffers contents during their jbd2 transaction commit > (as currently nothing prevents/blocks the write access at that time.) > > This might happen between the buffers checksum calculation and actual > write to journal, so the (old) checksum is invalid for the (new) data. > > If the system crashes after that, but before such journal entry makes > it to the filesystem, the journal replay on the next mount just fails, > and the filesystem now requires fsck. (apparently curtin might set up > /etc/fstab with passno=0, requiring manual intervention.) > > [39751.096455] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support! > [39751.114435] JBD2: Invalid checksum recovering block 87305 in log > [39751.146133] JBD2: Invalid checksum recovering block 88039 in log > [39751.195950] JBD2: Invalid checksum recovering block 49633 in log > [39751.265158] JBD2: recovery failed > [39751.265163] EXT4-fs (vdc): error loading journal > > [Fix] > > The fix is to write-protect the pages during journal transaction commit, > so that writes to mapped pages hit a page fault, then ext4's page_mkwrite > hook can block until the commit finishes and the buffers can be modified. > > In order to do that, add jbd2 journal callbacks that the filesystems can > customize, called before/after the critical region in transaction commit, > then have ext4 in data journaling mode to write-protect the pages whose > buffers are being committed (and handle cases that need pages redirtied.) > > The changes are restricted to the data journaling mode and page_mkwrite > hook, and other modes/paths use the same code/behavior in the callbacks. > > [Test Case] > > Set up an ext4 filesystem in data journaling mode, and run stress-ng's > mmap file test on it, then crash the system after a bit; check whether > the filesystem can mount again or not (i.e., with jbd2 checksum errors.) > > # mkfs.ext4 $DEV > # mount -o data=journal $DEV $DIR > # cd $DIR > # stress-ng --mmap $((4*$(nproc))) --mmap-file & > # sleep 60 > # echo c >/proc/sysrq-trigger > ... > # mount -o data=journal $DEV $DIR # PASS/FAIL. > # dmesg | tail > > [Regression Potential] > > Regressions would likely manifest in ext4 data journaling mode (which > is not the default mode, 'ordered') with memory mapped access, as the > other modes/paths are largely unaffected by the changes/same behavior. > > This has been tested with (x)fstests, that showed no regressions on > data=ordered and data=journal on both Bionic and Focal (with kernel > versions 4.15.0-156-generic and 5.4.0-84-generic) w/in 10 runs each. > And the stress-ng test-case as well. (Numbers/details in the LP bug.) > > [Other info] > > The patchset is applied on 5.10, so Hirsute (5.11) is already fixed; > only Focal and Bionic need it. > > There are little changes in the patches between Focal and Bionic > (mostly minor backport adjustments, mainly due to no vm_fault_t) > but unfortunately that needs separate versions for most patches. > > Thanks! > > -- > Mauricio > > > > Jan Kara (1): > ext4: fix mmap write protection for data=journal mode > > Mauricio Faria de Oliveira (4): > jbd2: introduce/export functions > jbd2_journal_submit|finish_inode_data_buffers() > jbd2, ext4, ocfs2: introduce/use journal callbacks > j_submit|finish_inode_data_buffers() > ext4: data=journal: fixes for ext4_page_mkwrite() > ext4: data=journal: write-protect pages on > j_submit_inode_data_buffers() > > fs/ext4/inode.c | 63 +++++++++++++++++++++++++++----- > fs/ext4/super.c | 87 ++++++++++++++++++++++++++++++++++++++++++++ > fs/jbd2/commit.c | 62 ++++++++++++++++--------------- > fs/jbd2/journal.c | 2 + > fs/ocfs2/journal.c | 4 ++ > include/linux/jbd2.h | 29 ++++++++++++++- > 6 files changed, 207 insertions(+), 40 deletions(-) > > -- > 2.30.2 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team