diff mbox series

[V4,07/13] fs: Add locking for a dynamic address space operations state

Message ID 20200221004134.30599-8-ira.weiny@intel.com
State Superseded
Headers show
Series Enable per-file/per-directory DAX operations V4 | expand

Commit Message

Ira Weiny Feb. 21, 2020, 12:41 a.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

DAX requires special address space operations (aops).  Changing DAX
state therefore requires changing those aops.

However, many functions require aops to remain consistent through a deep
call stack.

Define a vfs level inode rwsem to protect aops throughout call stacks
which require them.

Finally, define calls to be used in subsequent patches when aops usage
needs to be quiesced by the file system.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V3:
	Convert from global to a per-inode rwsem
		Remove pre-optimization
	Remove static branch stuff
	Change function names from inode_dax_state_* to inode_aops_*
		I kept 'inode' as the synchronization is at the inode
		level now (probably where it belongs)...

		and I still prefer *_[down|up]_[read|write] as those
		names better reflect the use and interaction between
		users (readers) and writers better.  'XX_start_aop()'
		would have to be matched with something like
		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
		something which does not make sense on the 'writer'
		side.

Changes from V2

	Rebase on linux-next-08-02-2020

	Fix locking order
	Change all references from mode to state where appropriate
	add CONFIG_FS_DAX requirement for state change
	Use a static branch to enable locking only when a dax capable
		device has been seen.

	Move the lock to a global vfs lock

		this does a few things
			1) preps us better for ext4 support
			2) removes funky callbacks from inode ops
			3) remove complexity from XFS and probably from
			   ext4 later

		We can do this because
			1) the locking order is required to be at the
			   highest level anyway, so why complicate xfs
			2) We had to move the sem to the super_block
			   because it is too heavy for the inode.
			3) After internal discussions with Dan we
			   decided that this would be easier, just as
			   performant, and with slightly less overhead
			   than in the VFS SB.

		We also change the functions names to up/down;
		read/write as appropriate.  Previous names were over
		simplified.

	Update comments and documentation

squash: add locking
---
 Documentation/filesystems/vfs.rst | 16 ++++++++
 fs/attr.c                         |  1 +
 fs/inode.c                        | 15 +++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  2 +
 fs/xfs/xfs_icache.c               |  1 +
 include/linux/fs.h                | 66 ++++++++++++++++++++++++++++++-
 mm/fadvise.c                      |  7 +++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/util.c                         |  9 ++++-
 13 files changed, 121 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig Feb. 21, 2020, 5:44 p.m. UTC | #1
On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations (aops).  Changing DAX
> state therefore requires changing those aops.
> 
> However, many functions require aops to remain consistent through a deep
> call stack.
> 
> Define a vfs level inode rwsem to protect aops throughout call stacks
> which require them.
> 
> Finally, define calls to be used in subsequent patches when aops usage
> needs to be quiesced by the file system.

I am very much against this.  There is a reason why we don't support
changes of ops vectors at runtime anywhere else, because it is
horribly complicated and impossible to get right.  IFF we ever want
to change the DAX vs non-DAX mode (which I'm still not sold on) the
right way is to just add a few simple conditionals and merge the
aops, which is much easier to reason about, and less costly in overall
overhead.
Dave Chinner Feb. 21, 2020, 10:44 p.m. UTC | #2
On Fri, Feb 21, 2020 at 06:44:49PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > DAX requires special address space operations (aops).  Changing DAX
> > state therefore requires changing those aops.
> > 
> > However, many functions require aops to remain consistent through a deep
> > call stack.
> > 
> > Define a vfs level inode rwsem to protect aops throughout call stacks
> > which require them.
> > 
> > Finally, define calls to be used in subsequent patches when aops usage
> > needs to be quiesced by the file system.
> 
> I am very much against this.  There is a reason why we don't support
> changes of ops vectors at runtime anywhere else, because it is
> horribly complicated and impossible to get right.  IFF we ever want
> to change the DAX vs non-DAX mode (which I'm still not sold on) the
> right way is to just add a few simple conditionals and merge the
> aops, which is much easier to reason about, and less costly in overall
> overhead.

*cough*

That's exactly what the original code did. And it was broken
because page faults call multiple aops that are dependent on the
result of the previous aop calls setting up the state correctly for
the latter calls. And when S_DAX changes between those calls, shit
breaks.

It's exactly the same problem as switching aops between two
dependent aops method calls - we don't solve anything by merging
aops and checking IS_DAX in each method because the race condition
is still there.

/me throws his hands in the air and walks away

-Dave.
Dan Williams Feb. 21, 2020, 11:26 p.m. UTC | #3
On Fri, Feb 21, 2020 at 2:44 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Feb 21, 2020 at 06:44:49PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > >
> > > DAX requires special address space operations (aops).  Changing DAX
> > > state therefore requires changing those aops.
> > >
> > > However, many functions require aops to remain consistent through a deep
> > > call stack.
> > >
> > > Define a vfs level inode rwsem to protect aops throughout call stacks
> > > which require them.
> > >
> > > Finally, define calls to be used in subsequent patches when aops usage
> > > needs to be quiesced by the file system.
> >
> > I am very much against this.  There is a reason why we don't support
> > changes of ops vectors at runtime anywhere else, because it is
> > horribly complicated and impossible to get right.  IFF we ever want
> > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > right way is to just add a few simple conditionals and merge the
> > aops, which is much easier to reason about, and less costly in overall
> > overhead.
>
> *cough*
>
> That's exactly what the original code did. And it was broken
> because page faults call multiple aops that are dependent on the
> result of the previous aop calls setting up the state correctly for
> the latter calls. And when S_DAX changes between those calls, shit
> breaks.
>
> It's exactly the same problem as switching aops between two
> dependent aops method calls - we don't solve anything by merging
> aops and checking IS_DAX in each method because the race condition
> is still there.
>
> /me throws his hands in the air and walks away

Please come back, because I think it's also clear that the "we don't
support changes of ops vectors at runtime" assertion is already being
violated by ext4 [1]. So that leaves "IFF we ever want to change the
dax vs non-dax mode" which I thought was already consensus given the
lingering hopes about having some future where the kernel is able to
dynamically optimize for dax vs non-dax based on memory media
performance characteristics. I thought the only thing missing from the
conclusion of the last conversation [2] was the "physical" vs
"effective" split that we identified at LSF'19 [3]. Christoph, that
split allows for for your concern about application intent to be
considered / overridden by kernel policy, and it allows for
communication of the effective state which applications need for
resource planning [4] independent of MAP_SYNC and other dax semantics.

The status quo of globally enabling dax for all files is empirically
the wrong choice for page-cache friendly workloads running on
slower-than-DRAM pmem media.

I am struggling to see how we address the above items without first
having a dynamic / less than global-filesystem scope facility to
control dax.

[1]: https://lore.kernel.org/linux-fsdevel/20191108131238.GK20863@quack2.suse.cz
[2]: https://lore.kernel.org/linux-fsdevel/20170927064001.GA27601@infradead.org
[3]: https://lwn.net/Articles/787973/
[4]: https://lwn.net/Articles/787233/
Darrick Wong Feb. 22, 2020, 12:33 a.m. UTC | #4
On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations (aops).  Changing DAX
> state therefore requires changing those aops.
> 
> However, many functions require aops to remain consistent through a deep
> call stack.
> 
> Define a vfs level inode rwsem to protect aops throughout call stacks
> which require them.
> 
> Finally, define calls to be used in subsequent patches when aops usage
> needs to be quiesced by the file system.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V3:
> 	Convert from global to a per-inode rwsem
> 		Remove pre-optimization
> 	Remove static branch stuff
> 	Change function names from inode_dax_state_* to inode_aops_*
> 		I kept 'inode' as the synchronization is at the inode
> 		level now (probably where it belongs)...
> 
> 		and I still prefer *_[down|up]_[read|write] as those
> 		names better reflect the use and interaction between
> 		users (readers) and writers better.  'XX_start_aop()'
> 		would have to be matched with something like
> 		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
> 		something which does not make sense on the 'writer'
> 		side.
> 
> Changes from V2
> 
> 	Rebase on linux-next-08-02-2020
> 
> 	Fix locking order
> 	Change all references from mode to state where appropriate
> 	add CONFIG_FS_DAX requirement for state change
> 	Use a static branch to enable locking only when a dax capable
> 		device has been seen.
> 
> 	Move the lock to a global vfs lock
> 
> 		this does a few things
> 			1) preps us better for ext4 support
> 			2) removes funky callbacks from inode ops
> 			3) remove complexity from XFS and probably from
> 			   ext4 later
> 
> 		We can do this because
> 			1) the locking order is required to be at the
> 			   highest level anyway, so why complicate xfs
> 			2) We had to move the sem to the super_block
> 			   because it is too heavy for the inode.
> 			3) After internal discussions with Dan we
> 			   decided that this would be easier, just as
> 			   performant, and with slightly less overhead
> 			   than in the VFS SB.
> 
> 		We also change the functions names to up/down;
> 		read/write as appropriate.  Previous names were over
> 		simplified.
> 
> 	Update comments and documentation
> 
> squash: add locking
> ---
>  Documentation/filesystems/vfs.rst | 16 ++++++++
>  fs/attr.c                         |  1 +
>  fs/inode.c                        | 15 +++++--
>  fs/iomap/buffered-io.c            |  1 +
>  fs/open.c                         |  4 ++
>  fs/stat.c                         |  2 +
>  fs/xfs/xfs_icache.c               |  1 +
>  include/linux/fs.h                | 66 ++++++++++++++++++++++++++++++-
>  mm/fadvise.c                      |  7 +++-
>  mm/filemap.c                      |  4 ++
>  mm/huge_memory.c                  |  1 +
>  mm/khugepaged.c                   |  2 +
>  mm/util.c                         |  9 ++++-
>  13 files changed, 121 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..4a10a232f8e2 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -934,6 +934,22 @@ cache in your filesystem.  The following members are defined:
>  	Called during swapoff on files where swap_activate was
>  	successful.
>  
> +Changing DAX 'state' dynamically
> +----------------------------------
> +
> +Some file systems which support DAX want to be able to change the DAX state
> +dyanically.  To switch the state safely we lock the inode state in all "normal"
> +file system operations and restrict state changes to those operations.  The
> +specific rules are.
> +
> +        1) the direct_IO address_space_operation must be supported in all
> +           potential a_ops vectors for any state suported by the inode.
> +
> +        3) DAX state changes shall not be allowed while the file is mmap'ed
> +        4) For non-mmaped operations the VFS layer must take the read lock for any
> +           use of IS_DAX()
> +        5) Filesystems take the write lock when changing DAX states.
> +
>  
>  The File Object
>  ===============
> diff --git a/fs/attr.c b/fs/attr.c
> index b4bbdbd4c8ca..9b15f73d1079 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
>  	if (error)
>  		return error;
>  
> +	/* DAX read state should already be held here */
>  	if (inode->i_op->setattr)
>  		error = inode->i_op->setattr(dentry, attr);
>  	else
> diff --git a/fs/inode.c b/fs/inode.c
> index 7d57068b6b7a..6e4f1cc872f2 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  #endif
>  	inode->i_flctx = NULL;
>  	this_cpu_inc(nr_inodes);
> +	init_rwsem(&inode->i_aops_sem);
>  
>  	return 0;
>  out:
> @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput);
>   */
>  int bmap(struct inode *inode, sector_t *block)
>  {
> -	if (!inode->i_mapping->a_ops->bmap)
> -		return -EINVAL;
> +	int ret = 0;
> +
> +	inode_aops_down_read(inode);
> +	if (!inode->i_mapping->a_ops->bmap) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
>  
>  	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> -	return 0;
> +
> +err:
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  EXPORT_SYMBOL(bmap);
>  #endif
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7c84c4c027c4..e313a34d5fa6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
>  		offset = offset_in_page(pos);
>  		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
>  
> +		/* DAX state read should already be held here */
>  		if (IS_DAX(inode))
>  			status = iomap_dax_zero(pos, offset, bytes, iomap);
>  		else
> diff --git a/fs/open.c b/fs/open.c
> index 0788b3715731..3abf0bfac462 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
>  	if (ret)
>  		newattrs.ia_valid |= ret | ATTR_FORCE;
>  
> +	inode_aops_down_read(dentry->d_inode);
>  	inode_lock(dentry->d_inode);
>  	/* Note any delegations or leases have already been broken: */
>  	ret = notify_change(dentry, &newattrs, NULL);
>  	inode_unlock(dentry->d_inode);
> +	inode_aops_up_read(dentry->d_inode);
>  	return ret;
>  }
>  
> @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EOPNOTSUPP;
>  
>  	file_start_write(file);
> +	inode_aops_down_read(inode);
>  	ret = file->f_op->fallocate(file, mode, offset, len);
> +	inode_aops_up_read(inode);
>  
>  	/*
>  	 * Create inotify and fanotify events.
> diff --git a/fs/stat.c b/fs/stat.c
> index 894699c74dde..274b3ccc82b1 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	if (IS_AUTOMOUNT(inode))
>  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> +	inode_aops_down_read(inode);
>  	if (IS_DAX(inode))
>  		stat->attributes |= STATX_ATTR_DAX;
> +	inode_aops_up_read(inode);
>  
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(path, stat, request_mask,
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 836a1f09be03..3e83a97dc047 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -420,6 +420,7 @@ xfs_iget_cache_hit(
>  		rcu_read_unlock();
>  
>  		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> +		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
>  		error = xfs_reinit_inode(mp, inode);
>  		if (error) {
>  			bool wake;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63d1e533a07d..ad0f2368031b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -40,6 +40,7 @@
>  #include <linux/fs_types.h>
>  #include <linux/build_bug.h>
>  #include <linux/stddef.h>
> +#include <linux/percpu-rwsem.h>
>  
>  #include <asm/byteorder.h>
>  #include <uapi/linux/fs.h>
> @@ -359,6 +360,11 @@ typedef struct {
>  typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
>  		unsigned long, unsigned long);
>  
> +/**
> + * NOTE: DO NOT define new functions in address_space_operations without first
> + * considering how dynamic DAX states are to be supported.  See the
> + * inode_aops_*_read() functions
> + */
>  struct address_space_operations {
>  	int (*writepage)(struct page *page, struct writeback_control *wbc);
>  	int (*readpage)(struct file *, struct page *);
> @@ -735,6 +741,9 @@ struct inode {
>  #endif
>  
>  	void			*i_private; /* fs or device private pointer */
> +#if defined(CONFIG_FS_DAX)
> +	struct rw_semaphore	i_aops_sem;
> +#endif
>  } __randomize_layout;
>  
>  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
> @@ -1817,6 +1826,11 @@ struct block_device_operations;
>  
>  struct iov_iter;
>  
> +/**
> + * NOTE: DO NOT define new functions in file_operations without first
> + * considering how dynamic address_space operations are to be supported.  See
> + * the inode_aops_*_read() functions in this file.
> + */
>  struct file_operations {
>  	struct module *owner;
>  	loff_t (*llseek) (struct file *, loff_t, int);
> @@ -1889,16 +1903,64 @@ struct inode_operations {
>  	int (*set_acl)(struct inode *, struct posix_acl *, int);
>  } ____cacheline_aligned;
>  
> +#if defined(CONFIG_FS_DAX)
> +/*
> + * Filesystems wishing to support dynamic DAX states must do the following.
> + *
> + * 1) the direct_IO address_space_operation must be supported in all potential
> + *    a_ops vectors for any state suported by the inode.  This is because the
> + *    direct_IO function is used as a flag long before the function is called.
> +
> + * 3) DAX state changes shall not be allowed while the file is mmap'ed
> + * 4) For non-mmaped operations the VFS layer must take the read lock for any
> + *    use of IS_DAX()
> + * 5) Filesystems take the write lock when changing DAX states.

Do you envision ext4 migrating their aops changing code to use
i_aops_sem in the future?

--D

> + */
> +static inline void inode_aops_down_read(struct inode *inode)
> +{
> +	down_read(&inode->i_aops_sem);
> +}
> +static inline void inode_aops_up_read(struct inode *inode)
> +{
> +	up_read(&inode->i_aops_sem);
> +}
> +static inline void inode_aops_down_write(struct inode *inode)
> +{
> +	down_write(&inode->i_aops_sem);
> +}
> +static inline void inode_aops_up_write(struct inode *inode)
> +{
> +	up_write(&inode->i_aops_sem);
> +}
> +#else /* !CONFIG_FS_DAX */
> +#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
> +#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
> +#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
> +#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
>  				     struct iov_iter *iter)
>  {
> -	return file->f_op->read_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	inode_aops_down_read(inode);
> +	ret = file->f_op->read_iter(kio, iter);
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  
>  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>  				      struct iov_iter *iter)
>  {
> -	return file->f_op->write_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	inode_aops_down_read(inode);
> +	ret = file->f_op->write_iter(kio, iter);
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  
>  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 4f17c83db575..6a30febb11e0 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  	bdi = inode_to_bdi(mapping->host);
>  
>  	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
> +		int ret = 0;
> +
>  		switch (advice) {
>  		case POSIX_FADV_NORMAL:
>  		case POSIX_FADV_RANDOM:
> @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  			/* no bad return value, but ignore advice */
>  			break;
>  		default:
> -			return -EINVAL;
> +			ret = -EINVAL;
>  		}
> -		return 0;
> +
> +		return ret;
>  	}
>  
>  	/*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1784478270e1..3a7863ba51b9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		 * and return.  Otherwise fallthrough to buffered io for
>  		 * the rest of the read.  Buffered reads will not work for
>  		 * DAX files, so don't bother trying.
> +		 *
> +		 * IS_DAX is protected under ->read_iter lock
>  		 */
>  		if (retval < 0 || !count || iocb->ki_pos >= size ||
>  		    IS_DAX(inode))
> @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		 * holes, for example.  For DAX files, a buffered write will
>  		 * not succeed (even if it did, DAX does not handle dirty
>  		 * page-cache pages correctly).
> +		 *
> +		 * IS_DAX is protected under ->write_iter lock
>  		 */
>  		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>  			goto out;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b08b199f9a11..3d05bd10d83e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  	unsigned long ret;
>  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>  
> +	/* Should not need locking here because mmap is not allowed */
>  	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>  		goto out;
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b679908743cb..f048178e2b93 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
>  		} else {	/* !is_shmem */
>  			if (!page || xa_is_value(page)) {
>  				xas_unlock_irq(&xas);
> +				inode_aops_down_read(file->f_inode);
>  				page_cache_sync_readahead(mapping, &file->f_ra,
>  							  file, index,
>  							  PAGE_SIZE);
> +				inode_aops_up_read(file->f_inode);
>  				/* drain pagevecs to help isolate_lru_page() */
>  				lru_add_drain();
>  				page = find_lock_page(mapping, index);
> diff --git a/mm/util.c b/mm/util.c
> index 988d11e6c17c..a4fb0670137d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
>  
>  	ret = security_mmap_file(file, prot, flag);
>  	if (!ret) {
> -		if (down_write_killable(&mm->mmap_sem))
> +		if (file)
> +			inode_aops_down_read(file_inode(file));
> +		if (down_write_killable(&mm->mmap_sem)) {
> +			if (file)
> +				inode_aops_up_read(file_inode(file));
>  			return -EINTR;
> +		}
>  		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
>  				    &populate, &uf);
>  		up_write(&mm->mmap_sem);
> +		if (file)
> +			inode_aops_up_read(file_inode(file));
>  		userfaultfd_unmap_complete(mm, &uf);
>  		if (populate)
>  			mm_populate(ret, populate);
> -- 
> 2.21.0
>
Ira Weiny Feb. 23, 2020, 3:03 p.m. UTC | #5
On Fri, Feb 21, 2020 at 04:33:17PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > DAX requires special address space operations (aops).  Changing DAX
> > state therefore requires changing those aops.
> > 
> > However, many functions require aops to remain consistent through a deep
> > call stack.
> > 
> > Define a vfs level inode rwsem to protect aops throughout call stacks
> > which require them.
> > 
> > Finally, define calls to be used in subsequent patches when aops usage
> > needs to be quiesced by the file system.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from V3:
> > 	Convert from global to a per-inode rwsem
> > 		Remove pre-optimization
> > 	Remove static branch stuff
> > 	Change function names from inode_dax_state_* to inode_aops_*
> > 		I kept 'inode' as the synchronization is at the inode
> > 		level now (probably where it belongs)...
> > 
> > 		and I still prefer *_[down|up]_[read|write] as those
> > 		names better reflect the use and interaction between
> > 		users (readers) and writers better.  'XX_start_aop()'
> > 		would have to be matched with something like
> > 		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
> > 		something which does not make sense on the 'writer'
> > 		side.
> > 
> > Changes from V2
> > 
> > 	Rebase on linux-next-08-02-2020
> > 
> > 	Fix locking order
> > 	Change all references from mode to state where appropriate
> > 	add CONFIG_FS_DAX requirement for state change
> > 	Use a static branch to enable locking only when a dax capable
> > 		device has been seen.
> > 
> > 	Move the lock to a global vfs lock
> > 
> > 		this does a few things
> > 			1) preps us better for ext4 support
> > 			2) removes funky callbacks from inode ops
> > 			3) remove complexity from XFS and probably from
> > 			   ext4 later
> > 
> > 		We can do this because
> > 			1) the locking order is required to be at the
> > 			   highest level anyway, so why complicate xfs
> > 			2) We had to move the sem to the super_block
> > 			   because it is too heavy for the inode.
> > 			3) After internal discussions with Dan we
> > 			   decided that this would be easier, just as
> > 			   performant, and with slightly less overhead
> > 			   than in the VFS SB.
> > 
> > 		We also change the functions names to up/down;
> > 		read/write as appropriate.  Previous names were over
> > 		simplified.
> > 
> > 	Update comments and documentation
> > 
> > squash: add locking
> > ---
> >  Documentation/filesystems/vfs.rst | 16 ++++++++
> >  fs/attr.c                         |  1 +
> >  fs/inode.c                        | 15 +++++--
> >  fs/iomap/buffered-io.c            |  1 +
> >  fs/open.c                         |  4 ++
> >  fs/stat.c                         |  2 +
> >  fs/xfs/xfs_icache.c               |  1 +
> >  include/linux/fs.h                | 66 ++++++++++++++++++++++++++++++-
> >  mm/fadvise.c                      |  7 +++-
> >  mm/filemap.c                      |  4 ++
> >  mm/huge_memory.c                  |  1 +
> >  mm/khugepaged.c                   |  2 +
> >  mm/util.c                         |  9 ++++-
> >  13 files changed, 121 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 7d4d09dd5e6d..4a10a232f8e2 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -934,6 +934,22 @@ cache in your filesystem.  The following members are defined:
> >  	Called during swapoff on files where swap_activate was
> >  	successful.
> >  
> > +Changing DAX 'state' dynamically
> > +----------------------------------
> > +
> > +Some file systems which support DAX want to be able to change the DAX state
> > +dyanically.  To switch the state safely we lock the inode state in all "normal"
> > +file system operations and restrict state changes to those operations.  The
> > +specific rules are.
> > +
> > +        1) the direct_IO address_space_operation must be supported in all
> > +           potential a_ops vectors for any state suported by the inode.
> > +
> > +        3) DAX state changes shall not be allowed while the file is mmap'ed
> > +        4) For non-mmaped operations the VFS layer must take the read lock for any
> > +           use of IS_DAX()
> > +        5) Filesystems take the write lock when changing DAX states.
> > +
> >  
> >  The File Object
> >  ===============
> > diff --git a/fs/attr.c b/fs/attr.c
> > index b4bbdbd4c8ca..9b15f73d1079 100644
> > --- a/fs/attr.c
> > +++ b/fs/attr.c
> > @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
> >  	if (error)
> >  		return error;
> >  
> > +	/* DAX read state should already be held here */
> >  	if (inode->i_op->setattr)
> >  		error = inode->i_op->setattr(dentry, attr);
> >  	else
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 7d57068b6b7a..6e4f1cc872f2 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >  #endif
> >  	inode->i_flctx = NULL;
> >  	this_cpu_inc(nr_inodes);
> > +	init_rwsem(&inode->i_aops_sem);
> >  
> >  	return 0;
> >  out:
> > @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput);
> >   */
> >  int bmap(struct inode *inode, sector_t *block)
> >  {
> > -	if (!inode->i_mapping->a_ops->bmap)
> > -		return -EINVAL;
> > +	int ret = 0;
> > +
> > +	inode_aops_down_read(inode);
> > +	if (!inode->i_mapping->a_ops->bmap) {
> > +		ret = -EINVAL;
> > +		goto err;
> > +	}
> >  
> >  	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > -	return 0;
> > +
> > +err:
> > +	inode_aops_up_read(inode);
> > +	return ret;
> >  }
> >  EXPORT_SYMBOL(bmap);
> >  #endif
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 7c84c4c027c4..e313a34d5fa6 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> >  		offset = offset_in_page(pos);
> >  		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> >  
> > +		/* DAX state read should already be held here */
> >  		if (IS_DAX(inode))
> >  			status = iomap_dax_zero(pos, offset, bytes, iomap);
> >  		else
> > diff --git a/fs/open.c b/fs/open.c
> > index 0788b3715731..3abf0bfac462 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
> >  	if (ret)
> >  		newattrs.ia_valid |= ret | ATTR_FORCE;
> >  
> > +	inode_aops_down_read(dentry->d_inode);
> >  	inode_lock(dentry->d_inode);
> >  	/* Note any delegations or leases have already been broken: */
> >  	ret = notify_change(dentry, &newattrs, NULL);
> >  	inode_unlock(dentry->d_inode);
> > +	inode_aops_up_read(dentry->d_inode);
> >  	return ret;
> >  }
> >  
> > @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  		return -EOPNOTSUPP;
> >  
> >  	file_start_write(file);
> > +	inode_aops_down_read(inode);
> >  	ret = file->f_op->fallocate(file, mode, offset, len);
> > +	inode_aops_up_read(inode);
> >  
> >  	/*
> >  	 * Create inotify and fanotify events.
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 894699c74dde..274b3ccc82b1 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> >  	if (IS_AUTOMOUNT(inode))
> >  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
> >  
> > +	inode_aops_down_read(inode);
> >  	if (IS_DAX(inode))
> >  		stat->attributes |= STATX_ATTR_DAX;
> > +	inode_aops_up_read(inode);
> >  
> >  	if (inode->i_op->getattr)
> >  		return inode->i_op->getattr(path, stat, request_mask,
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 836a1f09be03..3e83a97dc047 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -420,6 +420,7 @@ xfs_iget_cache_hit(
> >  		rcu_read_unlock();
> >  
> >  		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > +		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
> >  		error = xfs_reinit_inode(mp, inode);
> >  		if (error) {
> >  			bool wake;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 63d1e533a07d..ad0f2368031b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -40,6 +40,7 @@
> >  #include <linux/fs_types.h>
> >  #include <linux/build_bug.h>
> >  #include <linux/stddef.h>
> > +#include <linux/percpu-rwsem.h>
> >  
> >  #include <asm/byteorder.h>
> >  #include <uapi/linux/fs.h>
> > @@ -359,6 +360,11 @@ typedef struct {
> >  typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
> >  		unsigned long, unsigned long);
> >  
> > +/**
> > + * NOTE: DO NOT define new functions in address_space_operations without first
> > + * considering how dynamic DAX states are to be supported.  See the
> > + * inode_aops_*_read() functions
> > + */
> >  struct address_space_operations {
> >  	int (*writepage)(struct page *page, struct writeback_control *wbc);
> >  	int (*readpage)(struct file *, struct page *);
> > @@ -735,6 +741,9 @@ struct inode {
> >  #endif
> >  
> >  	void			*i_private; /* fs or device private pointer */
> > +#if defined(CONFIG_FS_DAX)
> > +	struct rw_semaphore	i_aops_sem;
> > +#endif
> >  } __randomize_layout;
> >  
> >  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
> > @@ -1817,6 +1826,11 @@ struct block_device_operations;
> >  
> >  struct iov_iter;
> >  
> > +/**
> > + * NOTE: DO NOT define new functions in file_operations without first
> > + * considering how dynamic address_space operations are to be supported.  See
> > + * the inode_aops_*_read() functions in this file.
> > + */
> >  struct file_operations {
> >  	struct module *owner;
> >  	loff_t (*llseek) (struct file *, loff_t, int);
> > @@ -1889,16 +1903,64 @@ struct inode_operations {
> >  	int (*set_acl)(struct inode *, struct posix_acl *, int);
> >  } ____cacheline_aligned;
> >  
> > +#if defined(CONFIG_FS_DAX)
> > +/*
> > + * Filesystems wishing to support dynamic DAX states must do the following.
> > + *
> > + * 1) the direct_IO address_space_operation must be supported in all potential
> > + *    a_ops vectors for any state suported by the inode.  This is because the
> > + *    direct_IO function is used as a flag long before the function is called.
> > +
> > + * 3) DAX state changes shall not be allowed while the file is mmap'ed
> > + * 4) For non-mmaped operations the VFS layer must take the read lock for any
> > + *    use of IS_DAX()
> > + * 5) Filesystems take the write lock when changing DAX states.
> 
> Do you envision ext4 migrating their aops changing code to use
> i_aops_sem in the future?

Ah... yea, which, at that time, this comment would be completely wrong.  I
tried to remove references to dax as per Dave but I missed this one.

Thanks,
Ira

> 
> --D
> 
> > + */
> > +static inline void inode_aops_down_read(struct inode *inode)
> > +{
> > +	down_read(&inode->i_aops_sem);
> > +}
> > +static inline void inode_aops_up_read(struct inode *inode)
> > +{
> > +	up_read(&inode->i_aops_sem);
> > +}
> > +static inline void inode_aops_down_write(struct inode *inode)
> > +{
> > +	down_write(&inode->i_aops_sem);
> > +}
> > +static inline void inode_aops_up_write(struct inode *inode)
> > +{
> > +	up_write(&inode->i_aops_sem);
> > +}
> > +#else /* !CONFIG_FS_DAX */
> > +#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
> > +#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
> > +#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
> > +#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
> > +#endif /* CONFIG_FS_DAX */
> > +
> >  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
> >  				     struct iov_iter *iter)
> >  {
> > -	return file->f_op->read_iter(kio, iter);
> > +	struct inode		*inode = file_inode(kio->ki_filp);
> > +	ssize_t ret;
> > +
> > +	inode_aops_down_read(inode);
> > +	ret = file->f_op->read_iter(kio, iter);
> > +	inode_aops_up_read(inode);
> > +	return ret;
> >  }
> >  
> >  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
> >  				      struct iov_iter *iter)
> >  {
> > -	return file->f_op->write_iter(kio, iter);
> > +	struct inode		*inode = file_inode(kio->ki_filp);
> > +	ssize_t ret;
> > +
> > +	inode_aops_down_read(inode);
> > +	ret = file->f_op->write_iter(kio, iter);
> > +	inode_aops_up_read(inode);
> > +	return ret;
> >  }
> >  
> >  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> > diff --git a/mm/fadvise.c b/mm/fadvise.c
> > index 4f17c83db575..6a30febb11e0 100644
> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> >  	bdi = inode_to_bdi(mapping->host);
> >  
> >  	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
> > +		int ret = 0;
> > +
> >  		switch (advice) {
> >  		case POSIX_FADV_NORMAL:
> >  		case POSIX_FADV_RANDOM:
> > @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> >  			/* no bad return value, but ignore advice */
> >  			break;
> >  		default:
> > -			return -EINVAL;
> > +			ret = -EINVAL;
> >  		}
> > -		return 0;
> > +
> > +		return ret;
> >  	}
> >  
> >  	/*
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1784478270e1..3a7863ba51b9 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> >  		 * and return.  Otherwise fallthrough to buffered io for
> >  		 * the rest of the read.  Buffered reads will not work for
> >  		 * DAX files, so don't bother trying.
> > +		 *
> > +		 * IS_DAX is protected under ->read_iter lock
> >  		 */
> >  		if (retval < 0 || !count || iocb->ki_pos >= size ||
> >  		    IS_DAX(inode))
> > @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  		 * holes, for example.  For DAX files, a buffered write will
> >  		 * not succeed (even if it did, DAX does not handle dirty
> >  		 * page-cache pages correctly).
> > +		 *
> > +		 * IS_DAX is protected under ->write_iter lock
> >  		 */
> >  		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> >  			goto out;
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index b08b199f9a11..3d05bd10d83e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> >  	unsigned long ret;
> >  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> >  
> > +	/* Should not need locking here because mmap is not allowed */
> >  	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> >  		goto out;
> >  
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b679908743cb..f048178e2b93 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
> >  		} else {	/* !is_shmem */
> >  			if (!page || xa_is_value(page)) {
> >  				xas_unlock_irq(&xas);
> > +				inode_aops_down_read(file->f_inode);
> >  				page_cache_sync_readahead(mapping, &file->f_ra,
> >  							  file, index,
> >  							  PAGE_SIZE);
> > +				inode_aops_up_read(file->f_inode);
> >  				/* drain pagevecs to help isolate_lru_page() */
> >  				lru_add_drain();
> >  				page = find_lock_page(mapping, index);
> > diff --git a/mm/util.c b/mm/util.c
> > index 988d11e6c17c..a4fb0670137d 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
> >  
> >  	ret = security_mmap_file(file, prot, flag);
> >  	if (!ret) {
> > -		if (down_write_killable(&mm->mmap_sem))
> > +		if (file)
> > +			inode_aops_down_read(file_inode(file));
> > +		if (down_write_killable(&mm->mmap_sem)) {
> > +			if (file)
> > +				inode_aops_up_read(file_inode(file));
> >  			return -EINTR;
> > +		}
> >  		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
> >  				    &populate, &uf);
> >  		up_write(&mm->mmap_sem);
> > +		if (file)
> > +			inode_aops_up_read(file_inode(file));
> >  		userfaultfd_unmap_complete(mm, &uf);
> >  		if (populate)
> >  			mm_populate(ret, populate);
> > -- 
> > 2.21.0
> >
Christoph Hellwig Feb. 24, 2020, 5:56 p.m. UTC | #6
On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote:
> > I am very much against this.  There is a reason why we don't support
> > changes of ops vectors at runtime anywhere else, because it is
> > horribly complicated and impossible to get right.  IFF we ever want
> > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > right way is to just add a few simple conditionals and merge the
> > aops, which is much easier to reason about, and less costly in overall
> > overhead.
> 
> *cough*
> 
> That's exactly what the original code did. And it was broken
> because page faults call multiple aops that are dependent on the
> result of the previous aop calls setting up the state correctly for
> the latter calls. And when S_DAX changes between those calls, shit
> breaks.

No, the original code was broken because it didn't serialize between
DAX and buffer access.

Take a step back and look where the problems are, and they are not
mostly with the aops.  In fact the only aop useful for DAX is
is ->writepages, and how it uses ->writepages is actually a bit of
an abuse of that interface.

So what we really need it just a way to prevent switching the flag
when a file is mapped, and in the normal read/write path ensure the
flag can't be switch while I/O is going on, which could either be
done by ensuring it is only switched under i_rwsem or equivalent
protection, or by setting the DAX flag once in the iocb similar to
IOCB_DIRECT.

And they easiest way to get all this done is as a first step to
just allowing switching the flag when no blocks are allocated at
all, similar to how the rt flag works.  Once that works properly
and use cases show up we can relax the requirements, and we need
to do that in a way without bloating the inode and various VFS
code paths, as DAX is a complete fringe feature, and Ñ•witching
the flag and runtime is the fringe of a fringe.  It just isn't worth
bloating the inode and wasting tons of developer time on it.
Dave Chinner Feb. 25, 2020, 12:09 a.m. UTC | #7
On Mon, Feb 24, 2020 at 06:56:03PM +0100, Christoph Hellwig wrote:
> On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote:
> > > I am very much against this.  There is a reason why we don't support
> > > changes of ops vectors at runtime anywhere else, because it is
> > > horribly complicated and impossible to get right.  IFF we ever want
> > > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > > right way is to just add a few simple conditionals and merge the
> > > aops, which is much easier to reason about, and less costly in overall
> > > overhead.
> > 
> > *cough*
> > 
> > That's exactly what the original code did. And it was broken
> > because page faults call multiple aops that are dependent on the
> > result of the previous aop calls setting up the state correctly for
> > the latter calls. And when S_DAX changes between those calls, shit
> > breaks.
> 
> No, the original code was broken because it didn't serialize between
> DAX and buffer access.
> 
> Take a step back and look where the problems are, and they are not
> mostly with the aops.  In fact the only aop useful for DAX is
> is ->writepages, and how it uses ->writepages is actually a bit of
> an abuse of that interface.

The races are all through the fops, too, which is one of the reasons
Darrick mentioned we should probably move this up to file ops
level...

> So what we really need it just a way to prevent switching the flag
> when a file is mapped,

That's not sufficient.

We also have to prevent the file from being mapped *while we are
switching*. Nothing in the mmap() path actually serialises against
filesystem operations, and the initial behavioural checks in the
page fault path are similarly unserialised against changing the
S_DAX flag.

e.g. there's a race against ->mmap() with switching the the S_DAX
flag. In xfs_file_mmap():

        file_accessed(file);
        vma->vm_ops = &xfs_file_vm_ops;
        if (IS_DAX(inode))
                vma->vm_flags |= VM_HUGEPAGE;

So if this runs while a switch from true to false is occurring,
we've got a non-dax VMA with huge pages enabled, and the non-dax
page fault path doesn't support this.

If we are really lucky, we'll have IS_DAX() set just long
enough to get through this check in xfs_filemap_huge_fault():

        if (!IS_DAX(file_inode(vmf->vma->vm_file)))
                return VM_FAULT_FALLBACK;

and then we'll get into __xfs_filemap_fault() and block on the
MMAPLOCK. When we actually get that lock, S_DAX will now be false
and we have a huge page fault running through a path (page cache IO)
that doesn't support huge pages...

This is the sort of landmine switching S_DAX without serialisation
causes. The methods themselves look sane, but it's cross-method
dependencies for a stable S_DAX flag that is the real problem.

Yes, we can probably fix this by adding XFS_MMAPLOCK_SHARED here,
but means every filesystem has to solve the same race conditions
itself. That's hard to get right and easy to get wrong. If the
VFS/mm subsystem provides the infrastructure needed to use this
functionality safely, then that is hard for filesystem developers to
get wrong.....

> and in the normal read/write path ensure the
> flag can't be switch while I/O is going on, which could either be
> done by ensuring it is only switched under i_rwsem or equivalent
> protection, or by setting the DAX flag once in the iocb similar to
> IOCB_DIRECT.

The iocb path is not the problem - that's entirely serialised
against S_DAX changes by the i_rwsem. The problem is that we have no
equivalent filesystem level serialisation for the entire mmap/page
fault path, and it checks S_DAX all over the place.

It's basically the same limitation that we have with mmap vs direct
IO - we can't lock out mmap when we do direct IO, so we cannot
guarantee coherency between the two. Similar here - we cannot
lockout mmap in any sane way, so we cannot guarantee coherency
between mmap and changing the S_DAX flag.

That's the underlying problem we need to solve here.

/me wonders if the best thing to do is to add a ->fault callout to
tell the filesystem to lock/unlock the inode right up at the top of
the page fault path, outside even the mmap_sem.  That means all the
methods that the page fault calls are protected against S_DAX
changes, and it gives us a low cost method of serialising page
faults against DIO (e.g. via inode_dio_wait())....

> And they easiest way to get all this done is as a first step to
> just allowing switching the flag when no blocks are allocated at
> all, similar to how the rt flag works.

False equivalence - it is not similar because the RT flag changes
and their associated state checks are *already fully serialised* by
the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
that's the problem we need to solve...

In reality, I don't really care how we solve this problem, but we've
been down the path you are suggesting at least twice before and each
time we've ended up in a "that doesn't work" corner and had to
persue other solutions. I don't like going around in circles.

Cheers,

Dave.
Christoph Hellwig Feb. 25, 2020, 5:36 p.m. UTC | #8
On Tue, Feb 25, 2020 at 11:09:37AM +1100, Dave Chinner wrote:
> > No, the original code was broken because it didn't serialize between
> > DAX and buffer access.
> > 
> > Take a step back and look where the problems are, and they are not
> > mostly with the aops.  In fact the only aop useful for DAX is
> > is ->writepages, and how it uses ->writepages is actually a bit of
> > an abuse of that interface.
> 
> The races are all through the fops, too, which is one of the reasons
> Darrick mentioned we should probably move this up to file ops
> level...

But the file ops are very simple to use.  Pass the flag in the iocb,
and make sure the flag can only changed with i_rwsem held.  That part
is pretty trivial, the interesting case is mmap because it is so
spread out.

> > So what we really need it just a way to prevent switching the flag
> > when a file is mapped,
> 
> That's not sufficient.
> 
> We also have to prevent the file from being mapped *while we are
> switching*. Nothing in the mmap() path actually serialises against
> filesystem operations, and the initial behavioural checks in the
> page fault path are similarly unserialised against changing the
> S_DAX flag.

And the important part here is ->mmap.  If ->mmap doesn't get through
we are not going to see page faults.

> > and in the normal read/write path ensure the
> > flag can't be switch while I/O is going on, which could either be
> > done by ensuring it is only switched under i_rwsem or equivalent
> > protection, or by setting the DAX flag once in the iocb similar to
> > IOCB_DIRECT.
> 
> The iocb path is not the problem - that's entirely serialised
> against S_DAX changes by the i_rwsem. The problem is that we have no
> equivalent filesystem level serialisation for the entire mmap/page
> fault path, and it checks S_DAX all over the place.

Not right now.  We have various IS_DAX checks outside it.  But it is
easily fixable indeed.

> /me wonders if the best thing to do is to add a ->fault callout to
> tell the filesystem to lock/unlock the inode right up at the top of
> the page fault path, outside even the mmap_sem.  That means all the
> methods that the page fault calls are protected against S_DAX
> changes, and it gives us a low cost method of serialising page
> faults against DIO (e.g. via inode_dio_wait())....

Maybe.  Especially if it solves real problems and isn't just new
overhead to add an esoteric feature.

> 
> > And they easiest way to get all this done is as a first step to
> > just allowing switching the flag when no blocks are allocated at
> > all, similar to how the rt flag works.
> 
> False equivalence - it is not similar because the RT flag changes
> and their associated state checks are *already fully serialised* by
> the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
> that's the problem we need to solve...

And my point is that if we ensure S_DAX can only be checked if there
are no blocks on the file, is is fairly easy to provide the same
guarantee.  And I've not heard any argument that we really need more
flexibility than that.  In fact I think just being able to change it
on the parent directory and inheriting the flag might be more than
plenty, which would lead to a very simple implementation without any
of the crazy overhead in this series.
Jeff Moyer Feb. 25, 2020, 7:37 p.m. UTC | #9
Christoph Hellwig <hch@lst.de> writes:

> And my point is that if we ensure S_DAX can only be checked if there
> are no blocks on the file, is is fairly easy to provide the same
> guarantee.  And I've not heard any argument that we really need more
> flexibility than that.  In fact I think just being able to change it
> on the parent directory and inheriting the flag might be more than
> plenty, which would lead to a very simple implementation without any
> of the crazy overhead in this series.

I know of one user who had at least mentioned it to me, so I cc'd him.
Jonathan, can you describe your use case for being able to change a
file between dax and non-dax modes?  Or, if I'm misremembering, just
correct me?

Thanks!
Jeff
Ira Weiny Feb. 25, 2020, 9:03 p.m. UTC | #10
On Tue, Feb 25, 2020 at 06:36:33PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 25, 2020 at 11:09:37AM +1100, Dave Chinner wrote:
> > > No, the original code was broken because it didn't serialize between
> > > DAX and buffer access.
> > > 
> > > Take a step back and look where the problems are, and they are not
> > > mostly with the aops.  In fact the only aop useful for DAX is
> > > is ->writepages, and how it uses ->writepages is actually a bit of
> > > an abuse of that interface.
> > 
> > The races are all through the fops, too, which is one of the reasons
> > Darrick mentioned we should probably move this up to file ops
> > level...
> 
> But the file ops are very simple to use.  Pass the flag in the iocb,
> and make sure the flag can only changed with i_rwsem held.  That part
> is pretty trivial, the interesting case is mmap because it is so
> spread out.
> 
> > > So what we really need it just a way to prevent switching the flag
> > > when a file is mapped,
> > 
> > That's not sufficient.
> > 
> > We also have to prevent the file from being mapped *while we are
> > switching*. Nothing in the mmap() path actually serialises against
> > filesystem operations, and the initial behavioural checks in the
> > page fault path are similarly unserialised against changing the
> > S_DAX flag.
> 
> And the important part here is ->mmap.  If ->mmap doesn't get through
> we are not going to see page faults.

The series already blocks mmap'ed files from a switch.  That was not crazy
hard.  I don't see a use case to support changing the mode while the file is
mapped.

> 
> > > and in the normal read/write path ensure the
> > > flag can't be switch while I/O is going on, which could either be
> > > done by ensuring it is only switched under i_rwsem or equivalent
> > > protection, or by setting the DAX flag once in the iocb similar to
> > > IOCB_DIRECT.
> > 
> > The iocb path is not the problem - that's entirely serialised
> > against S_DAX changes by the i_rwsem. The problem is that we have no
> > equivalent filesystem level serialisation for the entire mmap/page
> > fault path, and it checks S_DAX all over the place.
> 
> Not right now.  We have various IS_DAX checks outside it.  But it is
> easily fixable indeed.
> 
> > /me wonders if the best thing to do is to add a ->fault callout to
> > tell the filesystem to lock/unlock the inode right up at the top of
> > the page fault path, outside even the mmap_sem.  That means all the
> > methods that the page fault calls are protected against S_DAX
> > changes, and it gives us a low cost method of serialising page
> > faults against DIO (e.g. via inode_dio_wait())....
> 
> Maybe.  Especially if it solves real problems and isn't just new
> overhead to add an esoteric feature.

I thought about doing something like this but it seemed easier to block the
change while the file was mapped.

> 
> > 
> > > And they easiest way to get all this done is as a first step to
> > > just allowing switching the flag when no blocks are allocated at
> > > all, similar to how the rt flag works.
> > 
> > False equivalence - it is not similar because the RT flag changes
> > and their associated state checks are *already fully serialised* by
> > the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
> > that's the problem we need to solve...
> 
> And my point is that if we ensure S_DAX can only be checked if there
> are no blocks on the file, is is fairly easy to provide the same
> guarantee.  And I've not heard any argument that we really need more
> flexibility than that.  In fact I think just being able to change it
> on the parent directory and inheriting the flag might be more than
> plenty, which would lead to a very simple implementation without any
> of the crazy overhead in this series.

Inherit on file creation or also if a file was moved under the directory?

Do we need to consider hard or soft links?

Ira
Jonathan Halliday Feb. 26, 2020, 9:28 a.m. UTC | #11
Hi All

I'm a middleware developer, focused on how Java (JVM) workloads can 
benefit from app-direct mode pmem. Initially the target is apps that 
need a fast binary log for fault tolerance: the classic database WAL use 
case; transaction coordination systems; enterprise message bus 
persistence and suchlike. Critically, there are cases where we use log 
based storage, i.e. it's not the strict 'read rarely, only on recovery' 
model that a classic db may have, but more of a 'append only, read many 
times' event stream model.

Think of the log oriented data storage as having logical segments (let's 
implement them as files), of which the most recent is being appended to 
(read_write) and the remaining N-1 older segments are full and sealed, 
so effectively immutable (read_only) until discarded. The tail segment 
needs to be in DAX mode for optimal write performance, as the size of 
the append may be sub-block and we don't want the overhead of the kernel 
call anyhow. So that's clearly a good fit for putting on a DAX fs mount 
and using mmap with MAP_SYNC.

However, we want fast read access into the segments, to retrieve stored 
records. The small access index can be built in volatile RAM (assuming 
we're willing to take the startup overhead of a full file scan at 
recovery time) but the data itself is big and we don't want to move it 
all off pmem. Which means the requirements are now different: we want 
the O/S cache to pull hot data into fast volatile RAM for us, which DAX 
explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather 
than app-direct mode, except here we're using the O/S rather than the 
hardware memory controller to do the cache management for us.

Currently this requires closing the full (read_write) file, then copying 
it to a non-DAX device and reopening it (read_only) there. Clearly 
that's expensive and rather tedious. Instead, I'd like to close the 
MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode 
that will instead go via the O/S cache in the traditional manner. Bonus 
points if I can do it over non-overlapping ranges in a file without 
closing the DAX mode mmap, since then the segments are entirely logical 
instead of needing separate physical files.

I note a comment below regarding a per-directly setting, but don't have 
the background to fully understand what's being suggested. However, I'll 
note here that I can live with a per-directory granularity, as relinking 
a file into a new dir is a constant time operation, whilst the move 
described above isn't. So if a per-directory granularity is easier than 
a per-file one that's fine, though as a person with only passing 
knowledge of filesystem design I don't see how having multiple links to 
a file can work cleanly in that case.

Hope that helps.

Jonathan

P.S. I'll cheekily take the opportunity of having your attention to tack 
on one minor gripe about the current system: The only way to know if a 
mmap with MAP_SYNC will work is to try it and catch the error. Which 
would be reasonable if it were free of side effects.  However, the 
process requires first expanding the file to at least the size of the 
desired map, which is done non-atomically i.e. is user visible. There 
are thus nasty race conditions in the cleanup, where after a failed mmap 
attempt (e.g the device doesn't support DAX), we try to shrink the file 
back to its original size, but something else has already opened it at 
its new, larger size. This is not theoretical: I got caught by it whilst 
adapting some of our middleware to use pmem.  Therefore, some way to 
probe the file path for its capability would be nice, much the same as I 
can e.g. inspect file permissions to (more or less) evaluate if I can 
write it without actually mutating it.  Thanks!



On 25/02/2020 19:37, Jeff Moyer wrote:
> Christoph Hellwig <hch@lst.de> writes:
> 
>> And my point is that if we ensure S_DAX can only be checked if there
>> are no blocks on the file, is is fairly easy to provide the same
>> guarantee.  And I've not heard any argument that we really need more
>> flexibility than that.  In fact I think just being able to change it
>> on the parent directory and inheriting the flag might be more than
>> plenty, which would lead to a very simple implementation without any
>> of the crazy overhead in this series.
> 
> I know of one user who had at least mentioned it to me, so I cc'd him.
> Jonathan, can you describe your use case for being able to change a
> file between dax and non-dax modes?  Or, if I'm misremembering, just
> correct me?
> 
> Thanks!
> Jeff
>
Jan Kara Feb. 26, 2020, 11:17 a.m. UTC | #12
On Tue 25-02-20 11:09:37, Dave Chinner wrote:
> /me wonders if the best thing to do is to add a ->fault callout to
> tell the filesystem to lock/unlock the inode right up at the top of
> the page fault path, outside even the mmap_sem.  That means all the
> methods that the page fault calls are protected against S_DAX
> changes, and it gives us a low cost method of serialising page
> faults against DIO (e.g. via inode_dio_wait())....

Well, that's going to be pretty hard. The main problem is: you cannot
lookup VMA until you hold mmap_sem and the inode is inside the VMA. And
this is a fundamental problem because until you hold mmap_sem, the address
space can change and thus the virtual address you are faulting can be
changing inode it is mapped to. So you would have to do some dance like:

lock mmap_sem
lookup vma
get inode reference
drop mmap_sem
tell fs about page fault
lock mmap_sem
is the vma still the same?

And I'm pretty confident the overhead will be visible in page fault
intensive workloads...

								Honza
Jan Kara Feb. 26, 2020, 11:31 a.m. UTC | #13
Hello,

On Wed 26-02-20 09:28:57, Jonathan Halliday wrote:
> I'm a middleware developer, focused on how Java (JVM) workloads can benefit
> from app-direct mode pmem. Initially the target is apps that need a fast
> binary log for fault tolerance: the classic database WAL use case;
> transaction coordination systems; enterprise message bus persistence and
> suchlike. Critically, there are cases where we use log based storage, i.e.
> it's not the strict 'read rarely, only on recovery' model that a classic db
> may have, but more of a 'append only, read many times' event stream model.
> 
> Think of the log oriented data storage as having logical segments (let's
> implement them as files), of which the most recent is being appended to
> (read_write) and the remaining N-1 older segments are full and sealed, so
> effectively immutable (read_only) until discarded. The tail segment needs to
> be in DAX mode for optimal write performance, as the size of the append may
> be sub-block and we don't want the overhead of the kernel call anyhow. So
> that's clearly a good fit for putting on a DAX fs mount and using mmap with
> MAP_SYNC.
> 
> However, we want fast read access into the segments, to retrieve stored
> records. The small access index can be built in volatile RAM (assuming we're
> willing to take the startup overhead of a full file scan at recovery time)
> but the data itself is big and we don't want to move it all off pmem. Which
> means the requirements are now different: we want the O/S cache to pull hot
> data into fast volatile RAM for us, which DAX explicitly won't do.
> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
> except here we're using the O/S rather than the hardware memory controller
> to do the cache management for us.
> 
> Currently this requires closing the full (read_write) file, then copying it
> to a non-DAX device and reopening it (read_only) there. Clearly that's
> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
> then, leaving the file where it is, reopen it in a mode that will instead go
> via the O/S cache in the traditional manner. Bonus points if I can do it
> over non-overlapping ranges in a file without closing the DAX mode mmap,
> since then the segments are entirely logical instead of needing separate
> physical files.
> 
> I note a comment below regarding a per-directly setting, but don't have the
> background to fully understand what's being suggested. However, I'll note
> here that I can live with a per-directory granularity, as relinking a file
> into a new dir is a constant time operation, whilst the move described above
> isn't. So if a per-directory granularity is easier than a per-file one
> that's fine, though as a person with only passing knowledge of filesystem
> design I don't see how having multiple links to a file can work cleanly in
> that case.

Well, with per-directory setting, relinking the file will not magically
make it stop using DAX. So your situation would be very similar to the
current one, except "copy to non-DAX device" can be replaced by "copy to
non-DAX directory". Maybe the "copy" part could be actually reflink which
would make it faster.

> P.S. I'll cheekily take the opportunity of having your attention to tack on
> one minor gripe about the current system: The only way to know if a mmap
> with MAP_SYNC will work is to try it and catch the error. Which would be
> reasonable if it were free of side effects.  However, the process requires
> first expanding the file to at least the size of the desired map, which is
> done non-atomically i.e. is user visible. There are thus nasty race
> conditions in the cleanup, where after a failed mmap attempt (e.g the device
> doesn't support DAX), we try to shrink the file back to its original size,
> but something else has already opened it at its new, larger size. This is
> not theoretical: I got caught by it whilst adapting some of our middleware
> to use pmem.  Therefore, some way to probe the file path for its capability
> would be nice, much the same as I can e.g. inspect file permissions to (more
> or less) evaluate if I can write it without actually mutating it.  Thanks!

Well, reporting error on mmap(2) is the only way how to avoid
time-to-check-time-to-use races. And these are very important when we are
speaking about data integrity guarantees. So that's not going to change.
But with Ira's patches you could use statx(2) to check whether file at
least supports DAX and so avoid doing mmap check with the side effects in
the common case where it's hopeless... I'd also think that you could
currently do mmap check with the current file size and if it succeeds,
expand the file to the desired size and mmap again. It's not ideal but it
should work.

								Honza
Jonathan Halliday Feb. 26, 2020, 11:56 a.m. UTC | #14
On 26/02/2020 11:31, Jan Kara wrote:
> Hello,
> 
> On Wed 26-02-20 09:28:57, Jonathan Halliday wrote:
>> I'm a middleware developer, focused on how Java (JVM) workloads can benefit
>> from app-direct mode pmem. Initially the target is apps that need a fast
>> binary log for fault tolerance: the classic database WAL use case;
>> transaction coordination systems; enterprise message bus persistence and
>> suchlike. Critically, there are cases where we use log based storage, i.e.
>> it's not the strict 'read rarely, only on recovery' model that a classic db
>> may have, but more of a 'append only, read many times' event stream model.
>>
>> Think of the log oriented data storage as having logical segments (let's
>> implement them as files), of which the most recent is being appended to
>> (read_write) and the remaining N-1 older segments are full and sealed, so
>> effectively immutable (read_only) until discarded. The tail segment needs to
>> be in DAX mode for optimal write performance, as the size of the append may
>> be sub-block and we don't want the overhead of the kernel call anyhow. So
>> that's clearly a good fit for putting on a DAX fs mount and using mmap with
>> MAP_SYNC.
>>
>> However, we want fast read access into the segments, to retrieve stored
>> records. The small access index can be built in volatile RAM (assuming we're
>> willing to take the startup overhead of a full file scan at recovery time)
>> but the data itself is big and we don't want to move it all off pmem. Which
>> means the requirements are now different: we want the O/S cache to pull hot
>> data into fast volatile RAM for us, which DAX explicitly won't do.
>> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
>> except here we're using the O/S rather than the hardware memory controller
>> to do the cache management for us.
>>
>> Currently this requires closing the full (read_write) file, then copying it
>> to a non-DAX device and reopening it (read_only) there. Clearly that's
>> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
>> then, leaving the file where it is, reopen it in a mode that will instead go
>> via the O/S cache in the traditional manner. Bonus points if I can do it
>> over non-overlapping ranges in a file without closing the DAX mode mmap,
>> since then the segments are entirely logical instead of needing separate
>> physical files.
>>
>> I note a comment below regarding a per-directly setting, but don't have the
>> background to fully understand what's being suggested. However, I'll note
>> here that I can live with a per-directory granularity, as relinking a file
>> into a new dir is a constant time operation, whilst the move described above
>> isn't. So if a per-directory granularity is easier than a per-file one
>> that's fine, though as a person with only passing knowledge of filesystem
>> design I don't see how having multiple links to a file can work cleanly in
>> that case.
> 
> Well, with per-directory setting, relinking the file will not magically
> make it stop using DAX. So your situation would be very similar to the
> current one, except "copy to non-DAX device" can be replaced by "copy to
> non-DAX directory". Maybe the "copy" part could be actually reflink which
> would make it faster.

Indeed. The requirement is for 'change mode in constant time' rather 
than the current 'change mode in time proportional to file size'. That 
seems to imply requiring the approach to just change fs metadata, 
without relocating the file data bytes. Beyond that I'm largely 
indifferent to the implementation details.

>> P.S. I'll cheekily take the opportunity of having your attention to tack on
>> one minor gripe about the current system: The only way to know if a mmap
>> with MAP_SYNC will work is to try it and catch the error. Which would be
>> reasonable if it were free of side effects.  However, the process requires
>> first expanding the file to at least the size of the desired map, which is
>> done non-atomically i.e. is user visible. There are thus nasty race
>> conditions in the cleanup, where after a failed mmap attempt (e.g the device
>> doesn't support DAX), we try to shrink the file back to its original size,
>> but something else has already opened it at its new, larger size. This is
>> not theoretical: I got caught by it whilst adapting some of our middleware
>> to use pmem.  Therefore, some way to probe the file path for its capability
>> would be nice, much the same as I can e.g. inspect file permissions to (more
>> or less) evaluate if I can write it without actually mutating it.  Thanks!
> 
> Well, reporting error on mmap(2) is the only way how to avoid
> time-to-check-time-to-use races. And these are very important when we are
> speaking about data integrity guarantees. So that's not going to change.
> But with Ira's patches you could use statx(2) to check whether file at
> least supports DAX and so avoid doing mmap check with the side effects in
> the common case where it's hopeless... I'd also think that you could
> currently do mmap check with the current file size and if it succeeds,
> expand the file to the desired size and mmap again. It's not ideal but it
> should work.

Sure. Best effort is fine here, just as with looking at the permission 
bits on a file example - even in the absence of racing permission 
changes it's not definitive because of additional quota or selinux 
checks, but it's a reasonable approximation. That's a sufficiently 
useful improvement for my purposes, given the impractical nature of a 
100% solution.

Jonathan
Ira Weiny Feb. 26, 2020, 3:57 p.m. UTC | #15
On Wed, Feb 26, 2020 at 12:17:40PM +0100, Jan Kara wrote:
> On Tue 25-02-20 11:09:37, Dave Chinner wrote:
> > /me wonders if the best thing to do is to add a ->fault callout to
> > tell the filesystem to lock/unlock the inode right up at the top of
> > the page fault path, outside even the mmap_sem.  That means all the
> > methods that the page fault calls are protected against S_DAX
> > changes, and it gives us a low cost method of serialising page
> > faults against DIO (e.g. via inode_dio_wait())....
> 
> Well, that's going to be pretty hard. The main problem is: you cannot
> lookup VMA until you hold mmap_sem and the inode is inside the VMA. And
> this is a fundamental problem because until you hold mmap_sem, the address
> space can change and thus the virtual address you are faulting can be
> changing inode it is mapped to. So you would have to do some dance like:
> 
> lock mmap_sem
> lookup vma
> get inode reference
> drop mmap_sem
> tell fs about page fault
> lock mmap_sem
> is the vma still the same?
> 
> And I'm pretty confident the overhead will be visible in page fault
> intensive workloads...

I did not get to this level of detail...  Rather I looked at it from a high
level perspective and thought "does the mode need to change while someone has
the mmap?"  My thought is, that it does not make a lot of sense.  Generally the
user has mmaped with some use case in mind (either DAX or non-DAX) and it seems
reasonable to keep that mode consistent while the map is in place.

So I punted and restricted the change.

Ira

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Ira Weiny Feb. 26, 2020, 4:10 p.m. UTC | #16
On Wed, Feb 26, 2020 at 09:28:57AM +0000, Jonathan Halliday wrote:
> 
> Hi All
> 
> I'm a middleware developer, focused on how Java (JVM) workloads can benefit
> from app-direct mode pmem. Initially the target is apps that need a fast
> binary log for fault tolerance: the classic database WAL use case;
> transaction coordination systems; enterprise message bus persistence and
> suchlike. Critically, there are cases where we use log based storage, i.e.
> it's not the strict 'read rarely, only on recovery' model that a classic db
> may have, but more of a 'append only, read many times' event stream model.
> 
> Think of the log oriented data storage as having logical segments (let's
> implement them as files), of which the most recent is being appended to
> (read_write) and the remaining N-1 older segments are full and sealed, so
> effectively immutable (read_only) until discarded. The tail segment needs to
> be in DAX mode for optimal write performance, as the size of the append may
> be sub-block and we don't want the overhead of the kernel call anyhow. So
> that's clearly a good fit for putting on a DAX fs mount and using mmap with
> MAP_SYNC.
> 
> However, we want fast read access into the segments, to retrieve stored
> records. The small access index can be built in volatile RAM (assuming we're
> willing to take the startup overhead of a full file scan at recovery time)
> but the data itself is big and we don't want to move it all off pmem. Which
> means the requirements are now different: we want the O/S cache to pull hot
> data into fast volatile RAM for us, which DAX explicitly won't do.
> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
> except here we're using the O/S rather than the hardware memory controller
> to do the cache management for us.
> 
> Currently this requires closing the full (read_write) file, then copying it
> to a non-DAX device and reopening it (read_only) there. Clearly that's
> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
> then, leaving the file where it is, reopen it in a mode that will instead go
> via the O/S cache in the traditional manner.

This patch set implements this capability.

> Bonus points if I can do it
> over non-overlapping ranges in a file without closing the DAX mode mmap,
> since then the segments are entirely logical instead of needing separate
> physical files.

But it is too hard to do so while an mmap is in place so it can't do this.  So
no bonus points...

> 
> I note a comment below regarding a per-directly setting, but don't have the
> background to fully understand what's being suggested. However, I'll note
> here that I can live with a per-directory granularity, as relinking a file
> into a new dir is a constant time operation, whilst the move described above
> isn't. So if a per-directory granularity is easier than a per-file one
> that's fine, though as a person with only passing knowledge of filesystem
> design I don't see how having multiple links to a file can work cleanly in
> that case.

The more I think about it the more I'm not comfortable with a directory option.
soft links and hard links complicate that IMO.  The current inheritance model
is at file creation time and the file sticks with that state (mode).  That is
easy enough and carry's with the file without having the complexity of possibly
looking at the wrong parent directory.

Ira


> 
> Hope that helps.
> 
> Jonathan
> 
> P.S. I'll cheekily take the opportunity of having your attention to tack on
> one minor gripe about the current system: The only way to know if a mmap
> with MAP_SYNC will work is to try it and catch the error. Which would be
> reasonable if it were free of side effects.  However, the process requires
> first expanding the file to at least the size of the desired map, which is
> done non-atomically i.e. is user visible. There are thus nasty race
> conditions in the cleanup, where after a failed mmap attempt (e.g the device
> doesn't support DAX), we try to shrink the file back to its original size,
> but something else has already opened it at its new, larger size. This is
> not theoretical: I got caught by it whilst adapting some of our middleware
> to use pmem.  Therefore, some way to probe the file path for its capability
> would be nice, much the same as I can e.g. inspect file permissions to (more
> or less) evaluate if I can write it without actually mutating it.  Thanks!
> 
> 
> 
> On 25/02/2020 19:37, Jeff Moyer wrote:
> > Christoph Hellwig <hch@lst.de> writes:
> > 
> > > And my point is that if we ensure S_DAX can only be checked if there
> > > are no blocks on the file, is is fairly easy to provide the same
> > > guarantee.  And I've not heard any argument that we really need more
> > > flexibility than that.  In fact I think just being able to change it
> > > on the parent directory and inheriting the flag might be more than
> > > plenty, which would lead to a very simple implementation without any
> > > of the crazy overhead in this series.
> > 
> > I know of one user who had at least mentioned it to me, so I cc'd him.
> > Jonathan, can you describe your use case for being able to change a
> > file between dax and non-dax modes?  Or, if I'm misremembering, just
> > correct me?
> > 
> > Thanks!
> > Jeff
> > 
> 
> -- 
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>
Dan Williams Feb. 26, 2020, 4:46 p.m. UTC | #17
On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
<jonathan.halliday@redhat.com> wrote:
>
>
> Hi All
>
> I'm a middleware developer, focused on how Java (JVM) workloads can
> benefit from app-direct mode pmem. Initially the target is apps that
> need a fast binary log for fault tolerance: the classic database WAL use
> case; transaction coordination systems; enterprise message bus
> persistence and suchlike. Critically, there are cases where we use log
> based storage, i.e. it's not the strict 'read rarely, only on recovery'
> model that a classic db may have, but more of a 'append only, read many
> times' event stream model.
>
> Think of the log oriented data storage as having logical segments (let's
> implement them as files), of which the most recent is being appended to
> (read_write) and the remaining N-1 older segments are full and sealed,
> so effectively immutable (read_only) until discarded. The tail segment
> needs to be in DAX mode for optimal write performance, as the size of
> the append may be sub-block and we don't want the overhead of the kernel
> call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> and using mmap with MAP_SYNC.
>
> However, we want fast read access into the segments, to retrieve stored
> records. The small access index can be built in volatile RAM (assuming
> we're willing to take the startup overhead of a full file scan at
> recovery time) but the data itself is big and we don't want to move it
> all off pmem. Which means the requirements are now different: we want
> the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> than app-direct mode, except here we're using the O/S rather than the
> hardware memory controller to do the cache management for us.
>
> Currently this requires closing the full (read_write) file, then copying
> it to a non-DAX device and reopening it (read_only) there. Clearly
> that's expensive and rather tedious. Instead, I'd like to close the
> MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> that will instead go via the O/S cache in the traditional manner. Bonus
> points if I can do it over non-overlapping ranges in a file without
> closing the DAX mode mmap, since then the segments are entirely logical
> instead of needing separate physical files.

Hi John,

IIRC we chatted about this at PIRL, right?

At the time it sounded more like mixed mode dax, i.e. dax writes, but
cached reads. To me that's an optimization to optionally use dax for
direct-I/O writes, with its existing set of page-cache coherence
warts, and not a capability to dynamically switch the dax-mode.
mmap+MAP_SYNC seems the wrong interface for this. This writeup
mentions bypassing kernel call overhead, but I don't see how a
dax-write syscall is cheaper than an mmap syscall plus fault. If
direct-I/O to a dax capable file bypasses the block layer, isn't that
about the maximum of kernel overhead that can be cut out of this use
case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
update-in-place writes not append writes.
Jan Kara Feb. 26, 2020, 5:20 p.m. UTC | #18
On Wed 26-02-20 08:46:42, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
> <jonathan.halliday@redhat.com> wrote:
> >
> >
> > Hi All
> >
> > I'm a middleware developer, focused on how Java (JVM) workloads can
> > benefit from app-direct mode pmem. Initially the target is apps that
> > need a fast binary log for fault tolerance: the classic database WAL use
> > case; transaction coordination systems; enterprise message bus
> > persistence and suchlike. Critically, there are cases where we use log
> > based storage, i.e. it's not the strict 'read rarely, only on recovery'
> > model that a classic db may have, but more of a 'append only, read many
> > times' event stream model.
> >
> > Think of the log oriented data storage as having logical segments (let's
> > implement them as files), of which the most recent is being appended to
> > (read_write) and the remaining N-1 older segments are full and sealed,
> > so effectively immutable (read_only) until discarded. The tail segment
> > needs to be in DAX mode for optimal write performance, as the size of
> > the append may be sub-block and we don't want the overhead of the kernel
> > call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> > and using mmap with MAP_SYNC.
> >
> > However, we want fast read access into the segments, to retrieve stored
> > records. The small access index can be built in volatile RAM (assuming
> > we're willing to take the startup overhead of a full file scan at
> > recovery time) but the data itself is big and we don't want to move it
> > all off pmem. Which means the requirements are now different: we want
> > the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> > than app-direct mode, except here we're using the O/S rather than the
> > hardware memory controller to do the cache management for us.
> >
> > Currently this requires closing the full (read_write) file, then copying
> > it to a non-DAX device and reopening it (read_only) there. Clearly
> > that's expensive and rather tedious. Instead, I'd like to close the
> > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> > that will instead go via the O/S cache in the traditional manner. Bonus
> > points if I can do it over non-overlapping ranges in a file without
> > closing the DAX mode mmap, since then the segments are entirely logical
> > instead of needing separate physical files.
> 
> Hi John,
> 
> IIRC we chatted about this at PIRL, right?
> 
> At the time it sounded more like mixed mode dax, i.e. dax writes, but
> cached reads. To me that's an optimization to optionally use dax for
> direct-I/O writes, with its existing set of page-cache coherence
> warts, and not a capability to dynamically switch the dax-mode.
> mmap+MAP_SYNC seems the wrong interface for this. This writeup
> mentions bypassing kernel call overhead, but I don't see how a
> dax-write syscall is cheaper than an mmap syscall plus fault. If
> direct-I/O to a dax capable file bypasses the block layer, isn't that
> about the maximum of kernel overhead that can be cut out of this use
> case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
> update-in-place writes not append writes.

Well, even for appends you'll pay the cost only once per page (or maybe even
once per huge page) when using MAP_SYNC. With a syscall you'll pay once per
write. So although it would be good to check real numbers, the design isn't
non-sensical to me.

								Honza
Dan Williams Feb. 26, 2020, 5:54 p.m. UTC | #19
On Wed, Feb 26, 2020 at 9:20 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 26-02-20 08:46:42, Dan Williams wrote:
> > On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
> > <jonathan.halliday@redhat.com> wrote:
> > >
> > >
> > > Hi All
> > >
> > > I'm a middleware developer, focused on how Java (JVM) workloads can
> > > benefit from app-direct mode pmem. Initially the target is apps that
> > > need a fast binary log for fault tolerance: the classic database WAL use
> > > case; transaction coordination systems; enterprise message bus
> > > persistence and suchlike. Critically, there are cases where we use log
> > > based storage, i.e. it's not the strict 'read rarely, only on recovery'
> > > model that a classic db may have, but more of a 'append only, read many
> > > times' event stream model.
> > >
> > > Think of the log oriented data storage as having logical segments (let's
> > > implement them as files), of which the most recent is being appended to
> > > (read_write) and the remaining N-1 older segments are full and sealed,
> > > so effectively immutable (read_only) until discarded. The tail segment
> > > needs to be in DAX mode for optimal write performance, as the size of
> > > the append may be sub-block and we don't want the overhead of the kernel
> > > call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> > > and using mmap with MAP_SYNC.
> > >
> > > However, we want fast read access into the segments, to retrieve stored
> > > records. The small access index can be built in volatile RAM (assuming
> > > we're willing to take the startup overhead of a full file scan at
> > > recovery time) but the data itself is big and we don't want to move it
> > > all off pmem. Which means the requirements are now different: we want
> > > the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> > > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> > > than app-direct mode, except here we're using the O/S rather than the
> > > hardware memory controller to do the cache management for us.
> > >
> > > Currently this requires closing the full (read_write) file, then copying
> > > it to a non-DAX device and reopening it (read_only) there. Clearly
> > > that's expensive and rather tedious. Instead, I'd like to close the
> > > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> > > that will instead go via the O/S cache in the traditional manner. Bonus
> > > points if I can do it over non-overlapping ranges in a file without
> > > closing the DAX mode mmap, since then the segments are entirely logical
> > > instead of needing separate physical files.
> >
> > Hi John,
> >
> > IIRC we chatted about this at PIRL, right?
> >
> > At the time it sounded more like mixed mode dax, i.e. dax writes, but
> > cached reads. To me that's an optimization to optionally use dax for
> > direct-I/O writes, with its existing set of page-cache coherence
> > warts, and not a capability to dynamically switch the dax-mode.
> > mmap+MAP_SYNC seems the wrong interface for this. This writeup
> > mentions bypassing kernel call overhead, but I don't see how a
> > dax-write syscall is cheaper than an mmap syscall plus fault. If
> > direct-I/O to a dax capable file bypasses the block layer, isn't that
> > about the maximum of kernel overhead that can be cut out of this use
> > case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
> > update-in-place writes not append writes.
>
> Well, even for appends you'll pay the cost only once per page (or maybe even
> once per huge page) when using MAP_SYNC. With a syscall you'll pay once per
> write. So although it would be good to check real numbers, the design isn't
> non-sensical to me.

True, Jonathan, how many writes per page are we talking about in this case?
diff mbox series

Patch

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..4a10a232f8e2 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -934,6 +934,22 @@  cache in your filesystem.  The following members are defined:
 	Called during swapoff on files where swap_activate was
 	successful.
 
+Changing DAX 'state' dynamically
+----------------------------------
+
+Some file systems which support DAX want to be able to change the DAX state
+dyanically.  To switch the state safely we lock the inode state in all "normal"
+file system operations and restrict state changes to those operations.  The
+specific rules are.
+
+        1) the direct_IO address_space_operation must be supported in all
+           potential a_ops vectors for any state suported by the inode.
+
+        3) DAX state changes shall not be allowed while the file is mmap'ed
+        4) For non-mmaped operations the VFS layer must take the read lock for any
+           use of IS_DAX()
+        5) Filesystems take the write lock when changing DAX states.
+
 
 The File Object
 ===============
diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..9b15f73d1079 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -332,6 +332,7 @@  int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (error)
 		return error;
 
+	/* DAX read state should already be held here */
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..6e4f1cc872f2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -200,6 +200,7 @@  int inode_init_always(struct super_block *sb, struct inode *inode)
 #endif
 	inode->i_flctx = NULL;
 	this_cpu_inc(nr_inodes);
+	init_rwsem(&inode->i_aops_sem);
 
 	return 0;
 out:
@@ -1616,11 +1617,19 @@  EXPORT_SYMBOL(iput);
  */
 int bmap(struct inode *inode, sector_t *block)
 {
-	if (!inode->i_mapping->a_ops->bmap)
-		return -EINVAL;
+	int ret = 0;
+
+	inode_aops_down_read(inode);
+	if (!inode->i_mapping->a_ops->bmap) {
+		ret = -EINVAL;
+		goto err;
+	}
 
 	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
-	return 0;
+
+err:
+	inode_aops_up_read(inode);
+	return ret;
 }
 EXPORT_SYMBOL(bmap);
 #endif
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..e313a34d5fa6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -999,6 +999,7 @@  iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
+		/* DAX state read should already be held here */
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..3abf0bfac462 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -59,10 +59,12 @@  int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	inode_aops_down_read(dentry->d_inode);
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
+	inode_aops_up_read(dentry->d_inode);
 	return ret;
 }
 
@@ -306,7 +308,9 @@  int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
+	inode_aops_down_read(inode);
 	ret = file->f_op->fallocate(file, mode, offset, len);
+	inode_aops_up_read(inode);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/fs/stat.c b/fs/stat.c
index 894699c74dde..274b3ccc82b1 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,8 +79,10 @@  int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	inode_aops_down_read(inode);
 	if (IS_DAX(inode))
 		stat->attributes |= STATX_ATTR_DAX;
+	inode_aops_up_read(inode);
 
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 836a1f09be03..3e83a97dc047 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -420,6 +420,7 @@  xfs_iget_cache_hit(
 		rcu_read_unlock();
 
 		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
 			bool wake;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d1e533a07d..ad0f2368031b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@ 
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/percpu-rwsem.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -359,6 +360,11 @@  typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+/**
+ * NOTE: DO NOT define new functions in address_space_operations without first
+ * considering how dynamic DAX states are to be supported.  See the
+ * inode_aops_*_read() functions
+ */
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -735,6 +741,9 @@  struct inode {
 #endif
 
 	void			*i_private; /* fs or device private pointer */
+#if defined(CONFIG_FS_DAX)
+	struct rw_semaphore	i_aops_sem;
+#endif
 } __randomize_layout;
 
 struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
@@ -1817,6 +1826,11 @@  struct block_device_operations;
 
 struct iov_iter;
 
+/**
+ * NOTE: DO NOT define new functions in file_operations without first
+ * considering how dynamic address_space operations are to be supported.  See
+ * the inode_aops_*_read() functions in this file.
+ */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1889,16 +1903,64 @@  struct inode_operations {
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
 } ____cacheline_aligned;
 
+#if defined(CONFIG_FS_DAX)
+/*
+ * Filesystems wishing to support dynamic DAX states must do the following.
+ *
+ * 1) the direct_IO address_space_operation must be supported in all potential
+ *    a_ops vectors for any state suported by the inode.  This is because the
+ *    direct_IO function is used as a flag long before the function is called.
+
+ * 3) DAX state changes shall not be allowed while the file is mmap'ed
+ * 4) For non-mmaped operations the VFS layer must take the read lock for any
+ *    use of IS_DAX()
+ * 5) Filesystems take the write lock when changing DAX states.
+ */
+static inline void inode_aops_down_read(struct inode *inode)
+{
+	down_read(&inode->i_aops_sem);
+}
+static inline void inode_aops_up_read(struct inode *inode)
+{
+	up_read(&inode->i_aops_sem);
+}
+static inline void inode_aops_down_write(struct inode *inode)
+{
+	down_write(&inode->i_aops_sem);
+}
+static inline void inode_aops_up_write(struct inode *inode)
+{
+	up_write(&inode->i_aops_sem);
+}
+#else /* !CONFIG_FS_DAX */
+#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
+#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
+#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
+#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
+#endif /* CONFIG_FS_DAX */
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-	return file->f_op->read_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_aops_down_read(inode);
+	ret = file->f_op->read_iter(kio, iter);
+	inode_aops_up_read(inode);
+	return ret;
 }
 
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
-	return file->f_op->write_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_aops_down_read(inode);
+	ret = file->f_op->write_iter(kio, iter);
+	inode_aops_up_read(inode);
+	return ret;
 }
 
 static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4f17c83db575..6a30febb11e0 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -48,6 +48,8 @@  int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 	bdi = inode_to_bdi(mapping->host);
 
 	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
+		int ret = 0;
+
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
@@ -58,9 +60,10 @@  int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			return -EINVAL;
+			ret = -EINVAL;
 		}
-		return 0;
+
+		return ret;
 	}
 
 	/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..3a7863ba51b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2293,6 +2293,8 @@  generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		 * and return.  Otherwise fallthrough to buffered io for
 		 * the rest of the read.  Buffered reads will not work for
 		 * DAX files, so don't bother trying.
+		 *
+		 * IS_DAX is protected under ->read_iter lock
 		 */
 		if (retval < 0 || !count || iocb->ki_pos >= size ||
 		    IS_DAX(inode))
@@ -3377,6 +3379,8 @@  ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * holes, for example.  For DAX files, a buffered write will
 		 * not succeed (even if it did, DAX does not handle dirty
 		 * page-cache pages correctly).
+		 *
+		 * IS_DAX is protected under ->write_iter lock
 		 */
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..3d05bd10d83e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -572,6 +572,7 @@  unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
+	/* Should not need locking here because mmap is not allowed */
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908743cb..f048178e2b93 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1592,9 +1592,11 @@  static void collapse_file(struct mm_struct *mm,
 		} else {	/* !is_shmem */
 			if (!page || xa_is_value(page)) {
 				xas_unlock_irq(&xas);
+				inode_aops_down_read(file->f_inode);
 				page_cache_sync_readahead(mapping, &file->f_ra,
 							  file, index,
 							  PAGE_SIZE);
+				inode_aops_up_read(file->f_inode);
 				/* drain pagevecs to help isolate_lru_page() */
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..a4fb0670137d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -501,11 +501,18 @@  unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 
 	ret = security_mmap_file(file, prot, flag);
 	if (!ret) {
-		if (down_write_killable(&mm->mmap_sem))
+		if (file)
+			inode_aops_down_read(file_inode(file));
+		if (down_write_killable(&mm->mmap_sem)) {
+			if (file)
+				inode_aops_up_read(file_inode(file));
 			return -EINTR;
+		}
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
 				    &populate, &uf);
 		up_write(&mm->mmap_sem);
+		if (file)
+			inode_aops_up_read(file_inode(file));
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
 			mm_populate(ret, populate);