[v3] Fix ext4 fault handling when mounted with -o dax,ro

Message ID 416a465a9fbe1d27085883dbf652c115cd195697.1503523424.git.dodgen@google.com
State New
Headers show

Commit Message

Randy Dodgen Aug. 23, 2017, 9:26 p.m.
From: Randy Dodgen <dodgen@google.com>

If an ext4 filesystem is mounted with both the DAX and read-only
options, executables on that filesystem will fail to start (claiming
'Segmentation fault') due to the fault handler returning
VM_FAULT_SIGBUS.

This is due to the DAX fault handler (see ext4_dax_huge_fault)
attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
the wrong behavior for write faults which will lead to a COW page; in
particular, this fails for readonly mounts.

This change avoids journal writes for faults that are expected to COW.

It might be the case that this could be better handled in
ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
dax_iomap_fault). These is some overlap already (e.g. grabbing journal
handles).

Signed-off-by: Randy Dodgen <dodgen@google.com>
---

This version is simplified as suggested by Ross; all fault sizes and fallbacks
are handled by dax_iomap_fault.

 fs/ext4/file.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Jan Kara Aug. 24, 2017, 2:45 p.m. | #1
On Wed 23-08-17 14:26:52, rdodgen@gmail.com wrote:
> From: Randy Dodgen <dodgen@google.com>
> 
> If an ext4 filesystem is mounted with both the DAX and read-only
> options, executables on that filesystem will fail to start (claiming
> 'Segmentation fault') due to the fault handler returning
> VM_FAULT_SIGBUS.
> 
> This is due to the DAX fault handler (see ext4_dax_huge_fault)
> attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
> the wrong behavior for write faults which will lead to a COW page; in
> particular, this fails for readonly mounts.
> 
> This change avoids journal writes for faults that are expected to COW.
> 
> It might be the case that this could be better handled in
> ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
> dax_iomap_fault). These is some overlap already (e.g. grabbing journal
> handles).
> 
> Signed-off-by: Randy Dodgen <dodgen@google.com>

Thanks for the verbose comment :). The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> 
> This version is simplified as suggested by Ross; all fault sizes and fallbacks
> are handled by dax_iomap_fault.
> 
>  fs/ext4/file.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0d7cf0cc9b87..dc1e1fb6b54c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -279,7 +279,20 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>  	handle_t *handle = NULL;
>  	struct inode *inode = file_inode(vmf->vma->vm_file);
>  	struct super_block *sb = inode->i_sb;
> -	bool write = vmf->flags & FAULT_FLAG_WRITE;
> +
> +	/*
> +	 * We have to distinguish real writes from writes which will result in a
> +	 * COW page; COW writes should *not* poke the journal (the file will not
> +	 * be changed). Doing so would cause unintended failures when mounted
> +	 * read-only.
> +	 *
> +	 * We check for VM_SHARED rather than vmf->cow_page since the latter is
> +	 * unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for
> +	 * other sizes, dax_iomap_fault will handle splitting / fallback so that
> +	 * we eventually come back with a COW page.
> +	 */
> +	bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
> +		(vmf->vma->vm_flags & VM_SHARED);
>  
>  	if (write) {
>  		sb_start_pagefault(sb);
> -- 
> 2.14.1.342.g6490525c54-goog
>
Ross Zwisler Aug. 24, 2017, 3:11 p.m. | #2
On Wed, Aug 23, 2017 at 02:26:52PM -0700, rdodgen@gmail.com wrote:
> From: Randy Dodgen <dodgen@google.com>
> 
> If an ext4 filesystem is mounted with both the DAX and read-only
> options, executables on that filesystem will fail to start (claiming
> 'Segmentation fault') due to the fault handler returning
> VM_FAULT_SIGBUS.
> 
> This is due to the DAX fault handler (see ext4_dax_huge_fault)
> attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
> the wrong behavior for write faults which will lead to a COW page; in
> particular, this fails for readonly mounts.
> 
> This change avoids journal writes for faults that are expected to COW.
> 
> It might be the case that this could be better handled in
> ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
> dax_iomap_fault). These is some overlap already (e.g. grabbing journal
> handles).
> 
> Signed-off-by: Randy Dodgen <dodgen@google.com>

Cool, looks good from the DAX point of view.  You can add:

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Christoph Hellwig Aug. 24, 2017, 4:01 p.m. | #3
On Wed, Aug 23, 2017 at 02:26:52PM -0700, rdodgen@gmail.com wrote:
> From: Randy Dodgen <dodgen@google.com>
> 
> If an ext4 filesystem is mounted with both the DAX and read-only
> options, executables on that filesystem will fail to start (claiming
> 'Segmentation fault') due to the fault handler returning
> VM_FAULT_SIGBUS.
> 
> This is due to the DAX fault handler (see ext4_dax_huge_fault)
> attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
> the wrong behavior for write faults which will lead to a COW page; in
> particular, this fails for readonly mounts.
> 
> This change avoids journal writes for faults that are expected to COW.
> 
> It might be the case that this could be better handled in
> ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
> dax_iomap_fault). These is some overlap already (e.g. grabbing journal
> handles).
> 
> Signed-off-by: Randy Dodgen <dodgen@google.com>
> ---
> 
> This version is simplified as suggested by Ross; all fault sizes and fallbacks
> are handled by dax_iomap_fault.

We really need to do the same for ext2 and xfs.  And we really should
be doing that in common VM code instead of the file system.  See
my recent xfs synchronous page fault series on the mess the inconsistent
handling of the FAULT_FLAG_WRITE causes.  I think we just need a new
FAULT_FLAG_ALLOC or similar for those page faults that needs to
allocate space instead of piling hacks over hacks, and make sure
it's only set over the method that will actually do the allocation,
as the DAX and non-DAX path are not consistent on that.

Also any chance you could write an xfstest for this?
Theodore Ts'o Aug. 24, 2017, 7:26 p.m. | #4
On Wed, Aug 23, 2017 at 02:26:52PM -0700, rdodgen@gmail.com wrote:
> From: Randy Dodgen <dodgen@google.com>
> 
> If an ext4 filesystem is mounted with both the DAX and read-only
> options, executables on that filesystem will fail to start (claiming
> 'Segmentation fault') due to the fault handler returning
> VM_FAULT_SIGBUS.
> 
> This is due to the DAX fault handler (see ext4_dax_huge_fault)
> attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
> the wrong behavior for write faults which will lead to a COW page; in
> particular, this fails for readonly mounts.
> 
> This change avoids journal writes for faults that are expected to COW.
> 
> It might be the case that this could be better handled in
> ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
> dax_iomap_fault). These is some overlap already (e.g. grabbing journal
> handles).
> 
> Signed-off-by: Randy Dodgen <dodgen@google.com>

Applied, thanks!!

					- Ted
Theodore Ts'o Aug. 24, 2017, 8:57 p.m. | #5
On Thu, Aug 24, 2017 at 09:01:44AM -0700, Christoph Hellwig wrote:
> 
> We really need to do the same for ext2 and xfs.  And we really should
> be doing that in common VM code instead of the file system.  See
> my recent xfs synchronous page fault series on the mess the inconsistent
> handling of the FAULT_FLAG_WRITE causes.  I think we just need a new
> FAULT_FLAG_ALLOC or similar for those page faults that needs to
> allocate space instead of piling hacks over hacks, and make sure
> it's only set over the method that will actually do the allocation,
> as the DAX and non-DAX path are not consistent on that.

Yeah, agreed, that's the cleaner way of doing that.  It does mean
we'll have sweep through all of the file systems so that they DTRT
with this new FAULT_FLAG_ALLOC, though, right?

						- Ted
Christoph Hellwig Aug. 25, 2017, 7:28 a.m. | #6
On Thu, Aug 24, 2017 at 04:57:27PM -0400, Theodore Ts'o wrote:
> Yeah, agreed, that's the cleaner way of doing that.  It does mean
> we'll have sweep through all of the file systems so that they DTRT
> with this new FAULT_FLAG_ALLOC, though, right?

I think the only ones that it matters for for now are the DAX
fault handlers.  So we can add the new flag, check it in ext2, ext4
and xfs for now and we're done.  But if we eventually want to
phase out the ->page_mkwrite handler it would be useful for other
file systems as well.
Christoph Hellwig Aug. 29, 2017, 9:20 p.m. | #7
Randy, any chance yto at least share a test script so that others
can wire it up for the test suite?
Ross Zwisler Aug. 29, 2017, 9:37 p.m. | #8
On Tue, Aug 29, 2017 at 02:20:02PM -0700, Christoph Hellwig wrote:
> Randy, any chance yto at least share a test script so that others
> can wire it up for the test suite?

I made a reproducer for my testing. I'll make an xfstest if Randy isn't able
to.
Randy Dodgen Aug. 29, 2017, 10:07 p.m. | #9
An xfstest is a fine idea. I've started some work on that.

On Tue, Aug 29, 2017 at 2:37 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Aug 29, 2017 at 02:20:02PM -0700, Christoph Hellwig wrote:
>> Randy, any chance yto at least share a test script so that others
>> can wire it up for the test suite?
>
> I made a reproducer for my testing. I'll make an xfstest if Randy isn't able
> to.

Patch

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d7cf0cc9b87..dc1e1fb6b54c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -279,7 +279,20 @@  static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+	/*
+	 * We have to distinguish real writes from writes which will result in a
+	 * COW page; COW writes should *not* poke the journal (the file will not
+	 * be changed). Doing so would cause unintended failures when mounted
+	 * read-only.
+	 *
+	 * We check for VM_SHARED rather than vmf->cow_page since the latter is
+	 * unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for
+	 * other sizes, dax_iomap_fault will handle splitting / fallback so that
+	 * we eventually come back with a COW page.
+	 */
+	bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
+		(vmf->vma->vm_flags & VM_SHARED);
 
 	if (write) {
 		sb_start_pagefault(sb);