mbox series

[0/5] Enable per-file/directory DAX operations

Message ID 20191020155935.12297-1-ira.weiny@intel.com
Headers show
Series Enable per-file/directory DAX operations | expand

Message

Ira Weiny Oct. 20, 2019, 3:59 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the mode of a file with
active mappings / page cache. Rather than solve those races the decision was to
just limit mode flips to 0-length files.

Finally, the physical DAX flag inheritance is maintained from previous work on 
XFS but should be added for other file systems for consistence.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/

To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org

Ira Weiny (5):
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from effective
  fs/xfs: Separate functionality of xfs_inode_supports_dax()
  fs/xfs: Clean up DAX support check
  fs/xfs: Allow toggle of physical DAX flag

 fs/stat.c                 |  3 +++
 fs/xfs/xfs_ioctl.c        | 32 ++++++++++++++------------------
 fs/xfs/xfs_iops.c         | 36 ++++++++++++++++++++++++++++++------
 fs/xfs/xfs_iops.h         |  2 ++
 include/uapi/linux/stat.h |  1 +
 5 files changed, 50 insertions(+), 24 deletions(-)

Comments

Boaz Harrosh Oct. 22, 2019, 11:21 a.m. UTC | #1
On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
> 
> The following patch series enables selecting the use of DAX on individual files
> and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
> 
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
> 
> At LSF/MM we discussed the difficulties of switching the mode of a file with
> active mappings / page cache. Rather than solve those races the decision was to
> just limit mode flips to 0-length files.
> 

What I understand above is that only "writers" before writing any bytes may
turn the flag on, which then persists. But as a very long time user of DAX, usually
it is the writers that are least interesting. With lots of DAX technologies and
emulations the write is slower and needs slow "flushing".

The more interesting and performance gains comes from DAX READs actually.
specially cross the VM guest. (IE. All VMs share host memory or pmem)

This fixture as I understand it, that I need to know before I write if I will
want DAX or not and then the write is DAX as well as reads after that, looks
not very interesting for me as a user.

Just my $0.17
Boaz

> Finally, the physical DAX flag inheritance is maintained from previous work on 
> XFS but should be added for other file systems for consistence.
> 
> 
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> 
> To: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org
> Cc: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> 
> Ira Weiny (5):
>   fs/stat: Define DAX statx attribute
>   fs/xfs: Isolate the physical DAX flag from effective
>   fs/xfs: Separate functionality of xfs_inode_supports_dax()
>   fs/xfs: Clean up DAX support check
>   fs/xfs: Allow toggle of physical DAX flag
> 
>  fs/stat.c                 |  3 +++
>  fs/xfs/xfs_ioctl.c        | 32 ++++++++++++++------------------
>  fs/xfs/xfs_iops.c         | 36 ++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_iops.h         |  2 ++
>  include/uapi/linux/stat.h |  1 +
>  5 files changed, 50 insertions(+), 24 deletions(-)
>
Boaz Harrosh Oct. 23, 2019, 1:09 p.m. UTC | #2
On 22/10/2019 14:21, Boaz Harrosh wrote:
> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
<>
>> At LSF/MM we discussed the difficulties of switching the mode of a file with
>> active mappings / page cache. Rather than solve those races the decision was to
>> just limit mode flips to 0-length files.
>>
> 
> What I understand above is that only "writers" before writing any bytes may
> turn the flag on, which then persists. But as a very long time user of DAX, usually
> it is the writers that are least interesting. With lots of DAX technologies and
> emulations the write is slower and needs slow "flushing".
> 
> The more interesting and performance gains comes from DAX READs actually.
> specially cross the VM guest. (IE. All VMs share host memory or pmem)
> 
> This fixture as I understand it, that I need to know before I write if I will
> want DAX or not and then the write is DAX as well as reads after that, looks
> not very interesting for me as a user.
> 
> Just my $0.17
> Boaz
> 

I want to say one more thing about this patchset please. I was sleeping on it
and I think it is completely wrong approach with a completely wrong API.

The DAX or not DAX is a matter of transport. and is nothing to do with data
content and persistency. It's like connecting a disk behind, say, a scsi bus and then
take the same DB and putting it behind a multy-path or an NFS server. It is always
the same data.
(Same mistake we did with ndctl which is cry for generations)

For me the DAX or NO-DAX is an application thing and not a data thing.

The right API is perhaps an open O_FLAG where if you are not the first opener
the open returns EBUSY and then the app can decide if to open without the
flag or complain and fail. (Or apply open locks)

If you are a second opener when the file is already opened O_DAX you are
silently running DAX. If you are 2nd open(O_DAX) when already oppened
O_DAX then off course you succeed.
(Also the directory inheritance thing is all mute too)

Please explain the use case behind your model?

Thanks
Boaz
Dave Chinner Oct. 23, 2019, 10:13 p.m. UTC | #3
On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
> On 22/10/2019 14:21, Boaz Harrosh wrote:
> > On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> Please explain the use case behind your model?

No application changes needed to control whether they use DAX or
not. It allows the admin to control the application behaviour
completely, so they can turn off DAX if necessary. Applications are
unaware of constraints that may prevent DAX from being used, and so
admins need a mechanism to prevent DAX aware application from
actually using DAX if the capability is present.

e.g. given how slow some PMEM devices are when it comes to writing
data, especially under extremely high concurrency, DAX is not
necessarily a performance win for every application. Admins need a
guaranteed method of turning off DAX in these situations - apps may
not provide such a knob, or even be aware of a thing called DAX...

e.g. the data set being accessed by the application is mapped and
modified by RDMA applications, so those files must not be accessed
using DAX by any application because DAX+RDMA are currently
incompatible. Hence you can have RDMA on pmem devices co-exist
within the same filesystem as other applications using DAX to access
the pmem...

Cheers,

Dave.
Boaz Harrosh Oct. 24, 2019, 2:31 a.m. UTC | #4
On 24/10/2019 01:13, Dave Chinner wrote:
> On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
>> On 22/10/2019 14:21, Boaz Harrosh wrote:
>>> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
>> Please explain the use case behind your model?
> 
> No application changes needed to control whether they use DAX or
> not. It allows the admin to control the application behaviour
> completely, so they can turn off DAX if necessary. Applications are
> unaware of constraints that may prevent DAX from being used, and so
> admins need a mechanism to prevent DAX aware application from
> actually using DAX if the capability is present.
> 
> e.g. given how slow some PMEM devices are when it comes to writing
> data, especially under extremely high concurrency, DAX is not
> necessarily a performance win for every application. Admins need a
> guaranteed method of turning off DAX in these situations - apps may
> not provide such a knob, or even be aware of a thing called DAX...
> 

Thank you Dave for explaining. Forgive my slowness. I now understand
your intention.

But if so please address my first concern. That in the submitted implementation
you must set the flag-bit after the create of the file but before the write.
So exactly the above slow writes must always be DAX if I ever want the file
to be DAX accessed in the future.

In fact I do not see how you do this without changing the application because
most applications create thier own files, so you do not have a chance to set
the DAX-flag before the write happens. So the only effective fixture is the
inheritance from the parent directory.
But then again how do you separate from the slow writes that we would like
none-DAX to the DAX reads that are fast and save so much resources and latency.

What if, say in XFS when setting the DAX-bit we take all the three write-locks
same as a truncate. Then we check that there are no active page-cache mappings
ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)

> e.g. the data set being accessed by the application is mapped and
> modified by RDMA applications, so those files must not be accessed
> using DAX by any application because DAX+RDMA are currently
> incompatible. Hence you can have RDMA on pmem devices co-exist
> within the same filesystem as other applications using DAX to access
> the pmem...
> 

I actually like the lastest patchset that takes a lease on the file.
But yes an outside admin tool to set different needs.

> Cheers,
> Dave.
> 

Yes, thanks
Boaz
Dave Chinner Oct. 24, 2019, 7:34 a.m. UTC | #5
On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
> On 24/10/2019 01:13, Dave Chinner wrote:
> > On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
> >> On 22/10/2019 14:21, Boaz Harrosh wrote:
> >>> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> >> Please explain the use case behind your model?
> > 
> > No application changes needed to control whether they use DAX or
> > not. It allows the admin to control the application behaviour
> > completely, so they can turn off DAX if necessary. Applications are
> > unaware of constraints that may prevent DAX from being used, and so
> > admins need a mechanism to prevent DAX aware application from
> > actually using DAX if the capability is present.
> > 
> > e.g. given how slow some PMEM devices are when it comes to writing
> > data, especially under extremely high concurrency, DAX is not
> > necessarily a performance win for every application. Admins need a
> > guaranteed method of turning off DAX in these situations - apps may
> > not provide such a knob, or even be aware of a thing called DAX...
> > 
> 
> Thank you Dave for explaining. Forgive my slowness. I now understand
> your intention.
> 
> But if so please address my first concern. That in the submitted implementation
> you must set the flag-bit after the create of the file but before the write.
> So exactly the above slow writes must always be DAX if I ever want the file
> to be DAX accessed in the future.

The on disk DAX flag is inherited from the parent directory at
create time. Hence an admin only need to set it on the data
directory of the application when first configuring it, and
everything the app creates will be configured for DAX access
automatically.

Or, alternatively, mkfs sets the flag on the root dir so that
everything in the filesystem uses DAX by default (through
inheritance) unless the admin turns off the flag on a directory
before it starts to be used or on a set of files after they have
been created (because DAX causes problems)...

So, yeah, there's another problem with the basic assertion that we
only need to allow the on disk flag to be changed on zero length
files: we actually want to be able to -clear- the DAX flag when the
file has data attached to it, not just when it is an empty file...

> What if, say in XFS when setting the DAX-bit we take all the three write-locks
> same as a truncate. Then we check that there are no active page-cache mappings
> ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)

DAX doesn't have page cache mappings, so anything that relies on
checking page cache state isn't going to work reliably. I also seem
to recall that there was a need to take some vm level lock to really
prevent page fault races, and that we can't safely take that in a
safe combination with all the filesystem locks we need.

Cheers,

Dave.
Boaz Harrosh Oct. 24, 2019, 2:05 p.m. UTC | #6
On 24/10/2019 10:34, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
<>
> 
> The on disk DAX flag is inherited from the parent directory at
> create time. Hence an admin only need to set it on the data
> directory of the application when first configuring it, and
> everything the app creates will be configured for DAX access
> automatically.
> 

Yes I said that as well. But again I am concerned that this is the
opposite of our Intention. As you said the WRITEs are slow and
do not scale so what we like, and why we have the all problem, is
to WRITE *none*-DAX. And if so then how do we turn the bit ON later
for the fast READs.

> Or, alternatively, mkfs sets the flag on the root dir so that
> everything in the filesystem uses DAX by default (through
> inheritance) unless the admin turns off the flag on a directory
> before it starts to be used

> or on a set of files after they have
> been created (because DAX causes problems)...
> 

Yes exactly this can not be done currently.

> So, yeah, there's another problem with the basic assertion that we
> only need to allow the on disk flag to be changed on zero length
> files: we actually want to be able to -clear- the DAX flag when the
> file has data attached to it, not just when it is an empty file...
> 

Exactly, This is my concern. And the case that I see most useful is the
opposite where I want to turn it ON, for DAX fast READs.

>> What if, say in XFS when setting the DAX-bit we take all the three write-locks
>> same as a truncate. Then we check that there are no active page-cache mappings
>> ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)
> 
> DAX doesn't have page cache mappings, so anything that relies on
> checking page cache state isn't going to work reliably. 

I meant on the opposite case, Where the flag was OFF and I want it ON for
fast READs. In that case if I have any users there are pages on the
xarray.
BTW the opposite is also true if we have active DAX users we will have
DAX entries in the xarray. What we want is that there are *no* active
users while we change the file-DAX-mode. Else we fail the change.

> I also seem
> to recall that there was a need to take some vm level lock to really
> prevent page fault races, and that we can't safely take that in a
> safe combination with all the filesystem locks we need.
> 

We do not really care with page fault races in the Kernel as long
as I protect the xarray access and these are protected well if we
take truncate locking. But we have a bigger problem that you pointed
out with the change of the operations vector pointer.

I was thinking about this last night. One way to do this is with
file-exclusive-lock. Correct me if I'm wrong:
file-exclusive-readwrite-lock means any other openers will fail and
if there are openers already the lock will fail. Which is what we want
no? to make sure we are the exclusive user of the file while we change
the op vector.
Now the question is if we force the application to take the lock and
Kernel only check that we are locked. Or Kernel take the lock within
the IOCTL.

Lets touch base. As I understand the protocol we want to establish with
the administration tool is:

- File is created, written. read...

- ALL file handles are closed, there are no active users
- File open by single opener for the purpose of changing the DAX-mode
- lock from all other opens
- change the DAX-mode, op vectors
- unlock-exlusivness

- File activity can resume...

That's easy to say, But how can we enforce this protocol?

> Cheers,
> Dave.
> 

Thanks Dave
Boaz
Dave Chinner Oct. 24, 2019, 9:35 p.m. UTC | #7
On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
> On 24/10/2019 10:34, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
> <>
> > 
> > The on disk DAX flag is inherited from the parent directory at
> > create time. Hence an admin only need to set it on the data
> > directory of the application when first configuring it, and
> > everything the app creates will be configured for DAX access
> > automatically.
> > 
> 
> Yes I said that as well.

You said "it must be set between creation and first write",
stating the requirement for an on-disk flag to work. I'm
decribing how that requirement is actually implemented. i.e. what
you are stating is something we actually implemented years ago...

> > I also seem
> > to recall that there was a need to take some vm level lock to really
> > prevent page fault races, and that we can't safely take that in a
> > safe combination with all the filesystem locks we need.
> > 
> 
> We do not really care with page fault races in the Kernel as long

Oh yes we do. A write fault is a 2-part operation - a read fault to
populate the pte and mapping, then a write fault (->page_mkwrite) to 
do all the filesystem work needed to dirty the page and pte.

The read fault sets up the state for the write fault, and if we
change the aops between these two operations, then the
->page_mkwrite implementation goes kaboom.

This isn't a theoretical problem - this is exactly the race
condition that lead us to disabling the flag in the first place.
There is no serialisation between the read and write parts of the
page fault iand the filesystem changing the DAX flag and ops vector,
and so fixing this problem requires hold yet more locks in the
filesystem path to completely lock out page fault processing on the
inode's mapping.

> as I protect the xarray access and these are protected well if we
> take truncate locking. But we have a bigger problem that you pointed
> out with the change of the operations vector pointer.
> 
> I was thinking about this last night. One way to do this is with
> file-exclusive-lock. Correct me if I'm wrong:
> file-exclusive-readwrite-lock means any other openers will fail and
> if there are openers already the lock will fail. Which is what we want
> no?

The filesystem ioctls and page faults have no visibility of file
locks.  They don't know and can't find out in a sane manner that an
inode has a single -user- reference.

And it introduces a new problem for any application using the
fssetxattr() ioctl - accidentally not setting the S_DAX flag to be
unmodified will now fail, and that means such a change breaks
existing applications. Sure, you can say they are "buggy
applications", but the fact is this user API change breaks them.

Hence I don't think we can change the user API for setting/clearing
this flag like this.

Cheers,

Dave.
Boaz Harrosh Oct. 24, 2019, 11:29 p.m. UTC | #8
On 25/10/2019 00:35, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
<>
>> Yes I said that as well.
> 
> You said "it must be set between creation and first write",
> stating the requirement for an on-disk flag to work.

Sorry for not being clear. Its not new, I do not explain myself
very well.

The above quote is if you set/clear the flag directly.

> I'm describing how that requirement is actually implemented. i.e. what
> you are stating is something we actually implemented years ago...
> 

What you are talking about is when the flag is inherited from parent.
And I did mention that option as well.

[Which is my concern because my main point is that I want creation+write
 to be none-DAX. Then after writer is done. Switch to DAX-on so READs can
 be fast and not take any memory.
 And you talked about another case when I start DAX-on and then for
 say, use for RDMA turn it off later.
]

This flag is Unique because current RFC has an i_size == 0 check
that other flags do not have

>>> I also seem
>>> to recall that there was a need to take some vm level lock to really
>>> prevent page fault races, and that we can't safely take that in a
>>> safe combination with all the filesystem locks we need.
>>>
>>
>> We do not really care with page fault races in the Kernel as long
> 
> Oh yes we do. A write fault is a 2-part operation - a read fault to
> populate the pte and mapping, then a write fault (->page_mkwrite) to 
> do all the filesystem work needed to dirty the page and pte.
> 
> The read fault sets up the state for the write fault, and if we
> change the aops between these two operations, then the
> ->page_mkwrite implementation goes kaboom.
> 
> This isn't a theoretical problem - this is exactly the race
> condition that lead us to disabling the flag in the first place.
> There is no serialisation between the read and write parts of the
> page fault iand the filesystem changing the DAX flag and ops vector,
> and so fixing this problem requires hold yet more locks in the
> filesystem path to completely lock out page fault processing on the
> inode's mapping.
> 

Again sorry that I do not explain very good.

Already on the read fault we populate the xarray,
My point was that if I want to set the DAX mode I must enforce that
there are no other parallel users on my inode. The check that the
xarray is empty is my convoluted way to check that there are no other
users except me. If xarray is not empty I bail out with EBUISY

I agree this is stupid.

Which is the same stupid as i_size==0 check. Both have the same
intention, to make sure that nothing is going on in parallel to
me.

>> as I protect the xarray access and these are protected well if we
>> take truncate locking. But we have a bigger problem that you pointed
>> out with the change of the operations vector pointer.
>>
>> I was thinking about this last night. One way to do this is with
>> file-exclusive-lock. Correct me if I'm wrong:
>> file-exclusive-readwrite-lock means any other openers will fail and
>> if there are openers already the lock will fail. Which is what we want
>> no?
> 
> The filesystem ioctls and page faults have no visibility of file
> locks.  They don't know and can't find out in a sane manner that an
> inode has a single -user- reference.
> 

This is not true. The FS has full control. It is not too hard a work
to take a file-excl-lock from within the IOCTL implementation. then
unlock. that is option 1. Option 2 of the App taking the lock then
for the check we might need a new export from the lock-subsystem.

> And it introduces a new problem for any application using the
> fssetxattr() ioctl - accidentally not setting the S_DAX flag to be
> unmodified will now fail, and that means such a change breaks
> existing applications. Sure, you can say they are "buggy
> applications", but the fact is this user API change breaks them.
> 

I am not sure I completely understood. let me try ...

The app wants to set some foo flag. It can set the ignore mask to all
1(s) except the flag it wants to set/clear.
And/or get_current_flags(); modify foo_flag; set_flags().

Off course in both the ignore case or when the DAX bit does not change
we do not go on a locking rampage.

So I'm not sure I see the problem

> Hence I don't think we can change the user API for setting/clearing
> this flag like this.
> 

Yes we must not modify behavior of Apps that are modifing other flags.

> Cheers,
> Dave.
> 

Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo
to have an effective change. In hard links the first one at iget time before populating
the inode cache takes affect. (And never change the flag on the fly)
(Just brain storming here)

Or perhaps another API?

Thanks Dave
Boaz
Dave Chinner Oct. 25, 2019, 12:36 a.m. UTC | #9
On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> On 25/10/2019 00:35, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
> > This isn't a theoretical problem - this is exactly the race
> > condition that lead us to disabling the flag in the first place.
> > There is no serialisation between the read and write parts of the
> > page fault iand the filesystem changing the DAX flag and ops vector,
> > and so fixing this problem requires hold yet more locks in the
> > filesystem path to completely lock out page fault processing on the
> > inode's mapping.
> > 
> 
> Again sorry that I do not explain very good.
> 
> Already on the read fault we populate the xarray,

On a write fault we can have an empty xarray slot so the write fault
needs to both populate the xarray slot (read fault) and process the
write fault.

> My point was that if I want to set the DAX mode I must enforce that
> there are no other parallel users on my inode. The check that the
> xarray is empty is my convoluted way to check that there are no other
> users except me. If xarray is not empty I bail out with EBUISY

Checking the xarray being empty is racy. The moment you drop the
mapping lock, the page fault can populate a slot in the mapping that
you just checked was empty. And then you swap the aops between the
population and the ->page-mkwrite() call in the page fault
that is running, and things go boom.

Unless there's something new in the page fault path that nobody has
noticed in the past couple of years, this TOCTOU race hasn't been
solved....

> Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo

The inode is instatiated before the rename is run, so it's set up
with it's old dir config, not the new one. So this ends up with the
same problem of haivng to change the S_DAX flag and aops vector
dynamically on rename. Same problem, not a solution.

> to have an effective change. In hard links the first one at iget time before populating
> the inode cache takes affect.

If something like a find or backup program brings the inode into
cache, the app may not even get the behaviour it wants, and it can't
change it until the inode is evicted from cache, which may be never.
Nobody wants implicit/random/uncontrollable/unchangeable behaviour
like this.

> (And never change the flag on the fly)
> (Just brain storming here)

We went over all this ground when we disabled the flag in the first
place. We disabled the flag because we couldn't come up with a sane
way to flip the ops vector short of tracking the number of aops
calls in progress at any given time. i.e. reference counting the
aops structure, but that's hard to do with a const ops structure,
and so it got disabled rather than allowing users to crash
kernels....

Cheers,

-Dave.
Boaz Harrosh Oct. 25, 2019, 1:15 a.m. UTC | #10
On 25/10/2019 03:36, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
<>

>> Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo
> 
> The inode is instatiated before the rename is run, so it's set up
> with it's old dir config, not the new one. So this ends up with the
> same problem of haivng to change the S_DAX flag and aops vector
> dynamically on rename. Same problem, not a solution.
> 

Yes Admin needs a inode-drop_caches after the mv if she/he wants an effective
change.

>> to have an effective change. In hard links the first one at iget time before populating
>> the inode cache takes affect.
> 
> If something like a find or backup program brings the inode into
> cache, the app may not even get the behaviour it wants, and it can't
> change it until the inode is evicted from cache, which may be never.

inode-drop-caches. (echo 2 > /proc/sys/vm/drop_caches)

> Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> like this.
> 

You mean in the case of hard links between different mode directories?
I agree it is not so good. I do not like it too.

<>
> We went over all this ground when we disabled the flag in the first
> place. We disabled the flag because we couldn't come up with a sane
> way to flip the ops vector short of tracking the number of aops
> calls in progress at any given time. i.e. reference counting the
> aops structure, but that's hard to do with a const ops structure,
> and so it got disabled rather than allowing users to crash
> kernels....
> 

Do you mean dropping this patchset all together? I missed that.

Current patchset with the i_size == 0 thing is really bad I think.
Its the same has dropping the direct change all together and only
supporting inheritance from parent.
[Which again for me is really not interesting]

> Cheers,
> -Dave.

Lets sleep on it. Please remind me if xfs supports clone + DAX

Thanks Dave
Boaz
Ira Weiny Oct. 25, 2019, 8:49 p.m. UTC | #11
On Fri, Oct 25, 2019 at 11:36:03AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> > On 25/10/2019 00:35, Dave Chinner wrote:
> 
> If something like a find or backup program brings the inode into
> cache, the app may not even get the behaviour it wants, and it can't
> change it until the inode is evicted from cache, which may be never.

Why would this be never?

> Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> like this.

I'm thinking this could work with a bit of effort on the users part.  While the
behavior does have a bit of uncertainty, I feel like there has to be a way to
get the inode to drop from the cache when a final iput() happens on the inode.

Admin programs should not leave files open forever, without the users knowing
about it.  So I don't understand why the inode could not be evicted from the
cache if the FS knew that this change had been made and the inode needs to be
"re-loaded".  See below...

> 
> > (And never change the flag on the fly)
> > (Just brain storming here)
> 
> We went over all this ground when we disabled the flag in the first
> place. We disabled the flag because we couldn't come up with a sane
> way to flip the ops vector short of tracking the number of aops
> calls in progress at any given time. i.e. reference counting the
> aops structure, but that's hard to do with a const ops structure,
> and so it got disabled rather than allowing users to crash
> kernels....

Agreed.  We can't change the a_ops without some guarantee that no one is using
the file.  Which means we need all fds to close and a final iput().  I thought
that would mean an eviction of the inode and a subsequent reload.

Yesterday I coded up the following (applies on top of this series) but I can't
seem to get it to work because I believe xfs is keeping a reference on the
inode.  What am I missing?  I think if I could get xfs to recognize that the
inode needs to be cleared from it's cache this would work, with some caveats.

Currently this works if I remount the fs or if I use <procfs>/drop_caches like
Boaz mentioned.

Isn't there a way to get xfs to do that on it's own?

Ira


From 7762afd95a3e21a782dffd2fd8e13ae4a23b5e4a Mon Sep 17 00:00:00 2001
From: Ira Weiny <ira.weiny@intel.com>
Date: Fri, 25 Oct 2019 13:32:07 -0700
Subject: [PATCH] squash: Allow phys change on any length file

delay the changing of the effective bit to when the inode is re-read
into the cache.

Currently a work in Progress because xfs seems to cache the inodes as
well and I'm not sure how to get xfs to release it's reference.
---
 fs/xfs/xfs_ioctl.c | 18 +++++++-----------
 include/linux/fs.h |  5 ++++-
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89cf47ef273e..4d730d5781d9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1233,10 +1233,13 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-	if (xflags & FS_XFLAG_DAX)
-		inode->i_flags |= S_DAX;
-	else
-		inode->i_flags &= ~S_DAX;
+	/* NOTE: we do not allow the physical DAX flag to immediately change
+	 * the effective IS_DAX() flag tell the VFS layer to remove the inode
+	 * from the cache on the final iput() to force recreation on the next
+	 * 'fresh' open */
+	if (((xflags & FS_XFLAG_DAX) && !IS_DAX(inode)) ||
+	    (!(xflags & FS_XFLAG_DAX) && IS_DAX(inode)))
+		inode->i_flags |= S_REVALIDATE;
 }
 
 static int
@@ -1320,13 +1323,6 @@ xfs_ioctl_setattr_dax_invalidate(
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
 
-	/* File size must be zero to avoid races with asynchronous page
-	 * faults */
-	if (i_size_read(inode) > 0) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b4d8fc79e0f..4e9b7bf53c86 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1998,6 +1998,7 @@ struct super_operations {
 #define S_ENCRYPTED	16384	/* Encrypted file (using fs/crypto/) */
 #define S_CASEFOLD	32768	/* Casefolded file */
 #define S_VERITY	65536	/* Verity file (using fs/verity/) */
+#define S_REVALIDATE	131072	/* Drop inode from cache on final put */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2040,6 +2041,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
 #define IS_ENCRYPTED(inode)	((inode)->i_flags & S_ENCRYPTED)
 #define IS_CASEFOLDED(inode)	((inode)->i_flags & S_CASEFOLD)
 #define IS_VERITY(inode)	((inode)->i_flags & S_VERITY)
+#define IS_REVALIDATE(inode)	((inode)->i_flags & S_REVALIDATE)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
@@ -3027,7 +3029,8 @@ extern int inode_needs_sync(struct inode *inode);
 extern int generic_delete_inode(struct inode *inode);
 static inline int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || inode_unhashed(inode);
+	return !inode->i_nlink || inode_unhashed(inode) ||
+		IS_REVALIDATE(inode);
 }
 
 extern struct inode *ilookup5_nowait(struct super_block *sb,
Dave Chinner Oct. 27, 2019, 10:10 p.m. UTC | #12
On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:
> On Fri, Oct 25, 2019 at 11:36:03AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> > > On 25/10/2019 00:35, Dave Chinner wrote:
> > 
> > If something like a find or backup program brings the inode into
> > cache, the app may not even get the behaviour it wants, and it can't
> > change it until the inode is evicted from cache, which may be never.
> 
> Why would this be never?

Because only unreferenced inodes can be removed from cache. As long
as something holds a reference or repeatedly accesses the inode such
that reclaim always skips it because it is referenced, it will never
get evicted from the cache.

IOWs, "never" in the practical sense, not "never" in the theoretical
sense.

> > Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> > like this.
> 
> I'm thinking this could work with a bit of effort on the users part.  While the
> behavior does have a bit of uncertainty, I feel like there has to be a way to
> get the inode to drop from the cache when a final iput() happens on the inode.

Keep in mind that the final iput()->evict() process doesn't mean the
inode is going to get removed from all filesystem inode caches, just
the VFS level cache. The filesystem can still have internal
references to the inode, and still be doing work on the inode that
the VFS knows nothing about. XFS definitely fits into this category.

XFS will, however, re-initialise the inode aops structure if the VFS
then does another lookup on the inode while it is in this
"reclaimed" state, so from the VFS perspective it looks like a
newly instantiated inodes on the next lookup. We don't actually need
to do this for large parts of the inode as it is already still in
the valid state from the evict() call. It's an implementation
simplification that means we always re-init the ops vectors attached
to the inode rather than just the fields that need to be
re-initialised.

IOWs, evict/reinit changing the aops vector because the on disk dax
flag changed on XFS works by luck right now, not intent....

> Admin programs should not leave files open forever, without the users knowing
> about it.  So I don't understand why the inode could not be evicted from the
> cache if the FS knew that this change had been made and the inode needs to be
> "re-loaded".  See below...

Doesn't need to be an open file - inodes are pinned in memory by the
reference the dentry holds on it. Hence as long as there are
actively referenced dentries that point at the inode, the inode
cannot be reclaimed. Hard links mean multiple dentries could pin the
inode, too.

> > > (And never change the flag on the fly)
> > > (Just brain storming here)
> > 
> > We went over all this ground when we disabled the flag in the first
> > place. We disabled the flag because we couldn't come up with a sane
> > way to flip the ops vector short of tracking the number of aops
> > calls in progress at any given time. i.e. reference counting the
> > aops structure, but that's hard to do with a const ops structure,
> > and so it got disabled rather than allowing users to crash
> > kernels....
> 
> Agreed.  We can't change the a_ops without some guarantee that no one is using
> the file.  Which means we need all fds to close and a final iput().  I thought
> that would mean an eviction of the inode and a subsequent reload.
> 
> Yesterday I coded up the following (applies on top of this series) but I can't
> seem to get it to work because I believe xfs is keeping a reference on the
> inode.  What am I missing?  I think if I could get xfs to recognize that the
> inode needs to be cleared from it's cache this would work, with some caveats.

You are missing the fact that dentries hold an active reference to
inodes. So a path lookup (access(), stat(), etc) will pin the inode
just as effectively as holding an open file because they instantiate
a dentry that holds a reference to the inode....

> Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> Boaz mentioned.

drop_caches frees all the dentries that don't have an active
references before it iterates over inodes, thereby dropping the
cached reference(s) to the inode that pins it in memory before it
iterates the inode LRU.

> Isn't there a way to get xfs to do that on it's own?

Not reliably. Killing all the dentries doesn't guarantee the inode
will be reclaimed immediately. The ioctl() itself requires an open
file reference to the inode, and there's no telling how many other
references there are to the inode that the filesystem a) can't find,
and b) even if it can find them, it is illegal to release them.

IOWs, if you are relying on being able to force eviction of inode
from the cache for correct operation of a user controlled flag, then
it's just not going to work.

Cheers,

Dave.
Ira Weiny Oct. 31, 2019, 4:17 p.m. UTC | #13
On Mon, Oct 28, 2019 at 09:10:39AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:

[snip]

> 
> > Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> > Boaz mentioned.
> 
> drop_caches frees all the dentries that don't have an active
> references before it iterates over inodes, thereby dropping the
> cached reference(s) to the inode that pins it in memory before it
> iterates the inode LRU.
> 
> > Isn't there a way to get xfs to do that on it's own?
> 
> Not reliably. Killing all the dentries doesn't guarantee the inode
> will be reclaimed immediately. The ioctl() itself requires an open
> file reference to the inode, and there's no telling how many other
> references there are to the inode that the filesystem a) can't find,
> and b) even if it can find them, it is illegal to release them.
> 
> IOWs, if you are relying on being able to force eviction of inode
> from the cache for correct operation of a user controlled flag, then
> it's just not going to work.

Agree, I see the difficulty of forcing the effective flag to change in this
path.  However, the only thing I am relying on is that the ioctl will change
the physical flag.

IOW I am proposing that the semantic be that changing the physical flag does
_not_ immediately change the effective flag.  With that clarified up front the
user can adjust accordingly.

After thinking about this more I think there is a strong use case to be able to
change the physical flag on a non-zero length file.  That use case is to be
able to restore files from backups.

Therefore, having the effective flag flip at some later time when the a_ops can
safely be changed (for example a remount/drop_cache event) is beneficial.

I propose the user has no direct control over this event and it is mainly used
to restore files from backups which is mainly an admin operation where a
remount is a reasonable thing to do.

Users direct control of the effective flag is through inheritance.  The user
needs to create the file in a DAX enable dir and they get effective operation
right away.

If in the future we can determine a safe way to trigger the a_ops change we can
add that to the semantic as an alternative for users.

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 1, 2019, 10:47 p.m. UTC | #14
On Thu, Oct 31, 2019 at 09:17:58AM -0700, Ira Weiny wrote:
> On Mon, Oct 28, 2019 at 09:10:39AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:
> 
> [snip]
> 
> > 
> > > Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> > > Boaz mentioned.
> > 
> > drop_caches frees all the dentries that don't have an active
> > references before it iterates over inodes, thereby dropping the
> > cached reference(s) to the inode that pins it in memory before it
> > iterates the inode LRU.
> > 
> > > Isn't there a way to get xfs to do that on it's own?
> > 
> > Not reliably. Killing all the dentries doesn't guarantee the inode
> > will be reclaimed immediately. The ioctl() itself requires an open
> > file reference to the inode, and there's no telling how many other
> > references there are to the inode that the filesystem a) can't find,
> > and b) even if it can find them, it is illegal to release them.
> > 
> > IOWs, if you are relying on being able to force eviction of inode
> > from the cache for correct operation of a user controlled flag, then
> > it's just not going to work.
> 
> Agree, I see the difficulty of forcing the effective flag to change in this
> path.  However, the only thing I am relying on is that the ioctl will change
> the physical flag.
> 
> IOW I am proposing that the semantic be that changing the physical flag does
> _not_ immediately change the effective flag.  With that clarified up front the
> user can adjust accordingly.

Which makes it useless from an admin perspective. i.e. to change the
way the application uses DAX now, admins are going to have to end up
rebooting the machine to guarantee that the kernel has picked up the
change in the on-disk flag.

> After thinking about this more I think there is a strong use case to be able to
> change the physical flag on a non-zero length file.  That use case is to be
> able to restore files from backups.

Why does that matter? Backup programs need to set the flag before
the data is written into the destination file, just like they do
with restoring other flags that influence data placement like the RT
device bit and extent size hints...

Basically, all these issues you keep trying to work around go away
if we can come up with a way of swapping the aops vector safely.
That's the problem we need to solve, anything else results in
largely unacceptible user visible admin warts.

> I propose the user has no direct control over this event and it is mainly used
> to restore files from backups which is mainly an admin operation where a
> remount is a reasonable thing to do.

As soon as users understand that they flag can be changed, they are
going to want to do that and they are going to want it to work
reliably.

> Users direct control of the effective flag is through inheritance.  The user
> needs to create the file in a DAX enable dir and they get effective operation
> right away.

Until they realise the application is slow or broken because it is
using DAX, and they want to turn DAX off for that application. Then
they have *no control*. You cannot have it both ways - being able to
turn something on but not turn it off is not "effective operation"
or user friendly.

> If in the future we can determine a safe way to trigger the a_ops change we can
> add that to the semantic as an alternative for users.

No, the flag does not get turned on until we've solved the problems
that resulted in us turning it off. We've gone over this mutliple
times, and nobody has solved the issues that need solving - everyone
seems to just hack around the issues rather than solving it
properly. If we thought taking some kind of shortcut full of
compromises and gotchas was the right solution, we would have never
turned the flag off in the first place.

Cheers,

Dave.
Dan Williams Nov. 2, 2019, 4:25 a.m. UTC | #15
On Fri, Nov 1, 2019 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote:
[..]
> No, the flag does not get turned on until we've solved the problems
> that resulted in us turning it off. We've gone over this mutliple
> times, and nobody has solved the issues that need solving - everyone
> seems to just hack around the issues rather than solving it
> properly. If we thought taking some kind of shortcut full of
> compromises and gotchas was the right solution, we would have never
> turned the flag off in the first place.

My fault. I thought the effective vs physical distinction was worth
taking an incremental step forward. Ira was continuing to look at the
a_ops issue in the meantime.