mbox series

[0/9] add ext4 per-inode DAX flag

Message ID 20170905223541.20594-1-ross.zwisler@linux.intel.com
Headers show
Series add ext4 per-inode DAX flag | expand

Message

Ross Zwisler Sept. 5, 2017, 10:35 p.m. UTC
The original intent of this series was to add a per-inode DAX flag to ext4
so that it would be consistent with XFS.  In my travels I found and fixed
several related issues in both ext4 and XFS.

I'm not fully happy with the ways that ext4 DAX interacts with conflicting
features (journaling, inline data and encryption).  My goal with this
series was to make all these interactions as consistent as possilble, and
of course to make them safe.  If anyone has ideas for improvements, I'm
very open.

Ross Zwisler (9):
  ext4: remove duplicate extended attributes defs
  xfs: always use DAX if mount option is used
  xfs: validate bdev support for DAX inode flag
  ext4: add ext4_should_use_dax()
  ext4: ext4_change_inode_journal_flag error handling
  ext4: safely transition S_DAX on journaling changes
  ext4: prevent data corruption with inline data + DAX
  ext4: add sanity check for encryption + DAX
  ext4: add per-inode DAX flag

 fs/ext4/ext4.h      | 47 ++++++---------------------------------------
 fs/ext4/ext4_jbd2.h | 16 ++++++++++++++++
 fs/ext4/inline.c    | 10 ----------
 fs/ext4/inode.c     | 45 ++++++++++++++++++++++++-------------------
 fs/ext4/ioctl.c     | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c     |  8 ++++++++
 fs/xfs/xfs_ioctl.c  | 14 +++++++++++---
 7 files changed, 119 insertions(+), 76 deletions(-)

Comments

Eric Sandeen Sept. 6, 2017, 2:12 a.m. UTC | #1
On 9/5/17 5:35 PM, Ross Zwisler wrote:
> The original intent of this series was to add a per-inode DAX flag to ext4
> so that it would be consistent with XFS.  In my travels I found and fixed
> several related issues in both ext4 and XFS.

Hi Ross -

hch had a lot of reasons to nuke the dax flag from orbit, and we just
/disabled/ it in xfs due to its habit of crashing the kernel...
so a couple questions:

1) does this series pass hch's "test the per-inode DAX flag" fstest?
2) do we have an agreement that we need this flag at all, or is this
   just a parity item because xfs has^whad a per-inode flag?

Thanks,
-Eric

> I'm not fully happy with the ways that ext4 DAX interacts with conflicting
> features (journaling, inline data and encryption).  My goal with this
> series was to make all these interactions as consistent as possilble, and
> of course to make them safe.  If anyone has ideas for improvements, I'm
> very open.
> 
> Ross Zwisler (9):
>   ext4: remove duplicate extended attributes defs
>   xfs: always use DAX if mount option is used
>   xfs: validate bdev support for DAX inode flag
>   ext4: add ext4_should_use_dax()
>   ext4: ext4_change_inode_journal_flag error handling
>   ext4: safely transition S_DAX on journaling changes
>   ext4: prevent data corruption with inline data + DAX
>   ext4: add sanity check for encryption + DAX
>   ext4: add per-inode DAX flag
> 
>  fs/ext4/ext4.h      | 47 ++++++---------------------------------------
>  fs/ext4/ext4_jbd2.h | 16 ++++++++++++++++
>  fs/ext4/inline.c    | 10 ----------
>  fs/ext4/inode.c     | 45 ++++++++++++++++++++++++-------------------
>  fs/ext4/ioctl.c     | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/super.c     |  8 ++++++++
>  fs/xfs/xfs_ioctl.c  | 14 +++++++++++---
>  7 files changed, 119 insertions(+), 76 deletions(-)
>
Ross Zwisler Sept. 6, 2017, 5:07 p.m. UTC | #2
On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
> On 9/5/17 5:35 PM, Ross Zwisler wrote:
> > The original intent of this series was to add a per-inode DAX flag to ext4
> > so that it would be consistent with XFS.  In my travels I found and fixed
> > several related issues in both ext4 and XFS.
> 
> Hi Ross -
> 
> hch had a lot of reasons to nuke the dax flag from orbit, and we just
> /disabled/ it in xfs due to its habit of crashing the kernel...

Ah, sorry, I wasn't CC'd on those threads and missed them.  For any interested
bystanders:

https://www.spinics.net/lists/linux-ext4/msg57840.html
https://www.spinics.net/lists/linux-xfs/msg09831.html
https://www.spinics.net/lists/linux-xfs/msg10124.html

> so a couple questions:
> 
> 1) does this series pass hch's "test the per-inode DAX flag" fstest?

Nope, it has the exact same problems as the XFS per-inode DAX flag.

> 2) do we have an agreement that we need this flag at all, or is this
>    just a parity item because xfs has^whad a per-inode flag?

It was for parity, and because it allows admins finer grained control over
their system.  Basically all things discussed in response to Lukas's original
patch in the first link above.

The way this series ended up the first 8 patches were all fixes for the
existing code, and only patch 9 introduced the new per-inode flag.  I'll drop
patch 9 for now and rework the first 8 patches so we can get safer behavior of
the existing DAX mount option in ext4.  We can try patch 9 again later if we
come to an agreement that re-enables the XFS per-inode DAX option.
Dan Williams Sept. 7, 2017, 8:54 p.m. UTC | #3
On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
>> On 9/5/17 5:35 PM, Ross Zwisler wrote:
>> > The original intent of this series was to add a per-inode DAX flag to ext4
>> > so that it would be consistent with XFS.  In my travels I found and fixed
>> > several related issues in both ext4 and XFS.
>>
>> Hi Ross -
>>
>> hch had a lot of reasons to nuke the dax flag from orbit, and we just
>> /disabled/ it in xfs due to its habit of crashing the kernel...
>
> Ah, sorry, I wasn't CC'd on those threads and missed them.  For any interested
> bystanders:
>
> https://www.spinics.net/lists/linux-ext4/msg57840.html
> https://www.spinics.net/lists/linux-xfs/msg09831.html
> https://www.spinics.net/lists/linux-xfs/msg10124.html
>
>> so a couple questions:
>>
>> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
>
> Nope, it has the exact same problems as the XFS per-inode DAX flag.
>
>> 2) do we have an agreement that we need this flag at all, or is this
>>    just a parity item because xfs has^whad a per-inode flag?
>
> It was for parity, and because it allows admins finer grained control over
> their system.  Basically all things discussed in response to Lukas's original
> patch in the first link above.

I think it's more than parity. When pmem is slower than page cache it
is actively harmful to have DAX enabled globally for a filesystem. So,
not only should we push for per-inode DAX control, we should also push
to deprecate the mount option. I agree with Christoph that we should
try to automatically and transparently enable DAX where it makes
sense, but we also need a finer-grained mechanism than a mount flag to
force the behavior one way or the other.
Ross Zwisler Sept. 7, 2017, 9:13 p.m. UTC | #4
On Thu, Sep 07, 2017 at 01:54:45PM -0700, Dan Williams wrote:
> On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
> >> On 9/5/17 5:35 PM, Ross Zwisler wrote:
> >> > The original intent of this series was to add a per-inode DAX flag to ext4
> >> > so that it would be consistent with XFS.  In my travels I found and fixed
> >> > several related issues in both ext4 and XFS.
> >>
> >> Hi Ross -
> >>
> >> hch had a lot of reasons to nuke the dax flag from orbit, and we just
> >> /disabled/ it in xfs due to its habit of crashing the kernel...
> >
> > Ah, sorry, I wasn't CC'd on those threads and missed them.  For any interested
> > bystanders:
> >
> > https://www.spinics.net/lists/linux-ext4/msg57840.html
> > https://www.spinics.net/lists/linux-xfs/msg09831.html
> > https://www.spinics.net/lists/linux-xfs/msg10124.html
> >
> >> so a couple questions:
> >>
> >> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
> >
> > Nope, it has the exact same problems as the XFS per-inode DAX flag.
> >
> >> 2) do we have an agreement that we need this flag at all, or is this
> >>    just a parity item because xfs has^whad a per-inode flag?
> >
> > It was for parity, and because it allows admins finer grained control over
> > their system.  Basically all things discussed in response to Lukas's original
> > patch in the first link above.
> 
> I think it's more than parity. When pmem is slower than page cache it
> is actively harmful to have DAX enabled globally for a filesystem. So,
> not only should we push for per-inode DAX control, we should also push
> to deprecate the mount option. I agree with Christoph that we should
> try to automatically and transparently enable DAX where it makes
> sense, but we also need a finer-grained mechanism than a mount flag to
> force the behavior one way or the other.

Yep, agreed.  I'll play with how to make this work after I've sorted out all
the data corruptions I've found. :)
Andreas Dilger Sept. 7, 2017, 9:26 p.m. UTC | #5
On Sep 7, 2017, at 3:13 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> 
> On Thu, Sep 07, 2017 at 01:54:45PM -0700, Dan Williams wrote:
>> On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
>>>> On 9/5/17 5:35 PM, Ross Zwisler wrote:
>>>>> The original intent of this series was to add a per-inode DAX flag to ext4
>>>>> so that it would be consistent with XFS.  In my travels I found and fixed
>>>>> several related issues in both ext4 and XFS.
>>>> 
>>>> Hi Ross -
>>>> 
>>>> hch had a lot of reasons to nuke the dax flag from orbit, and we just
>>>> /disabled/ it in xfs due to its habit of crashing the kernel...
>>> 
>>> Ah, sorry, I wasn't CC'd on those threads and missed them.  For any interested
>>> bystanders:
>>> 
>>> https://www.spinics.net/lists/linux-ext4/msg57840.html
>>> https://www.spinics.net/lists/linux-xfs/msg09831.html
>>> https://www.spinics.net/lists/linux-xfs/msg10124.html
>>> 
>>>> so a couple questions:
>>>> 
>>>> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
>>> 
>>> Nope, it has the exact same problems as the XFS per-inode DAX flag.
>>> 
>>>> 2) do we have an agreement that we need this flag at all, or is this
>>>>   just a parity item because xfs has^whad a per-inode flag?
>>> 
>>> It was for parity, and because it allows admins finer grained control over
>>> their system.  Basically all things discussed in response to Lukas's original
>>> patch in the first link above.
>> 
>> I think it's more than parity. When pmem is slower than page cache it
>> is actively harmful to have DAX enabled globally for a filesystem. So,
>> not only should we push for per-inode DAX control, we should also push
>> to deprecate the mount option. I agree with Christoph that we should
>> try to automatically and transparently enable DAX where it makes
>> sense, but we also need a finer-grained mechanism than a mount flag to
>> force the behavior one way or the other.
> 
> Yep, agreed.  I'll play with how to make this work after I've sorted out all
> the data corruptions I've found. :)

It seems that the majority of problems are from enabling/disabling S_DAX
on an inode that already has dirty data.  However, I wonder if this could
be prevented at runtime, and only allow S_DAX to be set when the inode is
first instantiated, and wouldn't be allowed to change after that?  Setting
or clearing the per-inode DAX flag might still be allowed, but it wouldn't
be enabled until the inode is next fetched into cache?  Similarly, for
inodes that have conflicting features (e.g. inline data or encryption)
would not be allowed to enable S_DAX.

My assumption here is that it is possible to fall back to always using
page cache for such inodes, and flush the data to pmem via the block
interface for inodes that don't have S_DAX set?

That would allow the vast majority of cases to work out of the box, or in
a few rare cases where the DAX feature is being changed (e.g. inline data
inode on disk growing to external disk blocks) would use the page cache
until such a time that the inode is dropped from cache and reloaded (at
worst the next remount).

Cheers, Andreas
Ross Zwisler Sept. 7, 2017, 9:51 p.m. UTC | #6
On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> On Sep 7, 2017, at 3:13 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> > 
> > On Thu, Sep 07, 2017 at 01:54:45PM -0700, Dan Williams wrote:
> >> On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
> >>>> On 9/5/17 5:35 PM, Ross Zwisler wrote:
> >>>>> The original intent of this series was to add a per-inode DAX flag to ext4
> >>>>> so that it would be consistent with XFS.  In my travels I found and fixed
> >>>>> several related issues in both ext4 and XFS.
> >>>> 
> >>>> Hi Ross -
> >>>> 
> >>>> hch had a lot of reasons to nuke the dax flag from orbit, and we just
> >>>> /disabled/ it in xfs due to its habit of crashing the kernel...
> >>> 
> >>> Ah, sorry, I wasn't CC'd on those threads and missed them.  For any interested
> >>> bystanders:
> >>> 
> >>> https://www.spinics.net/lists/linux-ext4/msg57840.html
> >>> https://www.spinics.net/lists/linux-xfs/msg09831.html
> >>> https://www.spinics.net/lists/linux-xfs/msg10124.html
> >>> 
> >>>> so a couple questions:
> >>>> 
> >>>> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
> >>> 
> >>> Nope, it has the exact same problems as the XFS per-inode DAX flag.
> >>> 
> >>>> 2) do we have an agreement that we need this flag at all, or is this
> >>>>   just a parity item because xfs has^whad a per-inode flag?
> >>> 
> >>> It was for parity, and because it allows admins finer grained control over
> >>> their system.  Basically all things discussed in response to Lukas's original
> >>> patch in the first link above.
> >> 
> >> I think it's more than parity. When pmem is slower than page cache it
> >> is actively harmful to have DAX enabled globally for a filesystem. So,
> >> not only should we push for per-inode DAX control, we should also push
> >> to deprecate the mount option. I agree with Christoph that we should
> >> try to automatically and transparently enable DAX where it makes
> >> sense, but we also need a finer-grained mechanism than a mount flag to
> >> force the behavior one way or the other.
> > 
> > Yep, agreed.  I'll play with how to make this work after I've sorted out all
> > the data corruptions I've found. :)
> 
> It seems that the majority of problems are from enabling/disabling S_DAX
> on an inode that already has dirty data. 

I don't think it's precisely about dirty data, more about having mappings set
up and I/Os in flight, even if those are read operations.  Tomorrow I'll post
some xfstests for the data corruptions due to DAX + each of inline data and
journaling, and those both happen because we set up one mapping to page cache,
and one to DAX.  Once either is written to they become out of sync.

> However, I wonder if this could
> be prevented at runtime, and only allow S_DAX to be set when the inode is
> first instantiated, and wouldn't be allowed to change after that?  Setting
> or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> be enabled until the inode is next fetched into cache?  Similarly, for
> inodes that have conflicting features (e.g. inline data or encryption)
> would not be allowed to enable S_DAX.

Ooh, this seems interesting.  This would ensure that S_DAX transitions
couldn't ever race with I/Os or mmaps().  I had some other ideas for how to
handle this, but I think your idea is more promising. :)

I guess with this solution we'd need:

a) A good way of letting the user detect the state where they had set the DAX
inode flag, but that it wasn't yet in use by the inode.

b) A reliable way of flushing the inode from the filesystem cache, so that the
next time an open() happens they get the new behavior.  The way I usually do
this is via umount/remount, but there is probably already a way to do this?

> My assumption here is that it is possible to fall back to always using
> page cache for such inodes, and flush the data to pmem via the block
> interface for inodes that don't have S_DAX set?

Correct.

> That would allow the vast majority of cases to work out of the box, or in
> a few rare cases where the DAX feature is being changed (e.g. inline data
> inode on disk growing to external disk blocks) would use the page cache
> until such a time that the inode is dropped from cache and reloaded (at
> worst the next remount).

Ah, yep, this has the potential to solve those cases as well.  Seems
promising, to me at least. :)
Dave Chinner Sept. 7, 2017, 10:12 p.m. UTC | #7
On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > However, I wonder if this could
> > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > first instantiated, and wouldn't be allowed to change after that?  Setting
> > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > be enabled until the inode is next fetched into cache?  Similarly, for
> > inodes that have conflicting features (e.g. inline data or encryption)
> > would not be allowed to enable S_DAX.
> 
> Ooh, this seems interesting.  This would ensure that S_DAX transitions
> couldn't ever race with I/Os or mmaps().  I had some other ideas for how to
> handle this, but I think your idea is more promising. :)

IMO, that's an awful admin interface - it can't be done on demand
(i.e. when needed) because we can't force an inode to be evicted
from the cache. And then we have the "why the hell did that just
change" problem if an inode is evicted due to memory pressure and
then immediately reinstantiated by the running workload. That's a
recipe for driving admins insane...

> I guess with this solution we'd need:
> 
> a) A good way of letting the user detect the state where they had set the DAX
> inode flag, but that it wasn't yet in use by the inode.
> 
> b) A reliable way of flushing the inode from the filesystem cache, so that the
> next time an open() happens they get the new behavior.  The way I usually do
> this is via umount/remount, but there is probably already a way to do this?

Not if it's referenced. And if it's not referenced, then the only
hammer we have is Brutus^Wdrop_caches. That's not an option for
production machines.

Neat idea, but one I'd already thought of and discarded as "not
practical from an admin perspective".

Cheers,

Dave.
Ross Zwisler Sept. 7, 2017, 10:19 p.m. UTC | #8
On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:
> On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> > On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > > However, I wonder if this could
> > > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > > first instantiated, and wouldn't be allowed to change after that?  Setting
> > > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > > be enabled until the inode is next fetched into cache?  Similarly, for
> > > inodes that have conflicting features (e.g. inline data or encryption)
> > > would not be allowed to enable S_DAX.
> > 
> > Ooh, this seems interesting.  This would ensure that S_DAX transitions
> > couldn't ever race with I/Os or mmaps().  I had some other ideas for how to
> > handle this, but I think your idea is more promising. :)
> 
> IMO, that's an awful admin interface - it can't be done on demand
> (i.e. when needed) because we can't force an inode to be evicted
> from the cache. And then we have the "why the hell did that just
> change" problem if an inode is evicted due to memory pressure and
> then immediately reinstantiated by the running workload. That's a
> recipe for driving admins insane...
> 
> > I guess with this solution we'd need:
> > 
> > a) A good way of letting the user detect the state where they had set the DAX
> > inode flag, but that it wasn't yet in use by the inode.
> > 
> > b) A reliable way of flushing the inode from the filesystem cache, so that the
> > next time an open() happens they get the new behavior.  The way I usually do
> > this is via umount/remount, but there is probably already a way to do this?
> 
> Not if it's referenced. And if it's not referenced, then the only
> hammer we have is Brutus^Wdrop_caches. That's not an option for
> production machines.
> 
> Neat idea, but one I'd already thought of and discarded as "not
> practical from an admin perspective".

Okay, so other ideas (which you have also probably already though of) include:

1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
open mappings or any open file handles.  To prevent TOCTOU races we'd have to
do some additional locking while actually changing the flag.

2) Be more drastic and follow the flow of ext4 file based encryption, only
allowing the inode flag to be set by an admin on an empty directory.  Files in
that directory will inherit it when they are created, and we don't provide a
way to clear.  If you want your file to not use DAX, move it to a different
directory (which I think for ext4 encryption turns it into a new inode).

Other ideas?
Dave Chinner Sept. 7, 2017, 11:25 p.m. UTC | #9
On Thu, Sep 07, 2017 at 04:19:00PM -0600, Ross Zwisler wrote:
> On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:
> > On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> > > On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > > > However, I wonder if this could
> > > > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > > > first instantiated, and wouldn't be allowed to change after that?  Setting
> > > > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > > > be enabled until the inode is next fetched into cache?  Similarly, for
> > > > inodes that have conflicting features (e.g. inline data or encryption)
> > > > would not be allowed to enable S_DAX.
> > > 
> > > Ooh, this seems interesting.  This would ensure that S_DAX transitions
> > > couldn't ever race with I/Os or mmaps().  I had some other ideas for how to
> > > handle this, but I think your idea is more promising. :)
> > 
> > IMO, that's an awful admin interface - it can't be done on demand
> > (i.e. when needed) because we can't force an inode to be evicted
> > from the cache. And then we have the "why the hell did that just
> > change" problem if an inode is evicted due to memory pressure and
> > then immediately reinstantiated by the running workload. That's a
> > recipe for driving admins insane...
> > 
> > > I guess with this solution we'd need:
> > > 
> > > a) A good way of letting the user detect the state where they had set the DAX
> > > inode flag, but that it wasn't yet in use by the inode.
> > > 
> > > b) A reliable way of flushing the inode from the filesystem cache, so that the
> > > next time an open() happens they get the new behavior.  The way I usually do
> > > this is via umount/remount, but there is probably already a way to do this?
> > 
> > Not if it's referenced. And if it's not referenced, then the only
> > hammer we have is Brutus^Wdrop_caches. That's not an option for
> > production machines.
> > 
> > Neat idea, but one I'd already thought of and discarded as "not
> > practical from an admin perspective".
> 
> Okay, so other ideas (which you have also probably already though of) include:
> 
> 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> open mappings or any open file handles.

You have to have an open fd to change the flag. :)

> To prevent TOCTOU races we'd have to
> do some additional locking while actually changing the flag.

I think that make sense - the fundamental problem is that the
mappings are different between dax and non-dax, and that we can't
properly lock out page faults to to prevent sending a racing
page fault down the wrong path.

> 2) Be more drastic and follow the flow of ext4 file based encryption, only
> allowing the inode flag to be set by an admin on an empty directory.  Files in
> that directory will inherit it when they are created, and we don't provide a
> way to clear.  If you want your file to not use DAX, move it to a different
> directory (which I think for ext4 encryption turns it into a new inode).

Seems like the wrong model to me - moving application data files
is a PITA because you've also go to change the app config to point
at the new location...

> Other ideas?

IMO, we need to fix the page fault path so we don't look at inode
flags to determine processing behaviour during the fault. Fault
processing as DAX or non-dax needs to be determined by the page
fault code and communicated to the fs via the vmf as the contents
of the vmf for a dax fault can be invalid for a non-dax fault. Fixing
that problem (i.e. make DAX is a property of the mapping and
instantiate it from the inode only at mmap() time) means all the
page fault vs inode flag race problems go away and we have a model
that is much more robust if we want to expand it in future.

Combine that with -EBUSY when there are active mappings as you've
proposed above and I think we've got a much more solid solution to
the problem.

-Dave.
Jan Kara Sept. 8, 2017, 9:48 a.m. UTC | #10
On Fri 08-09-17 09:25:43, Dave Chinner wrote:
> On Thu, Sep 07, 2017 at 04:19:00PM -0600, Ross Zwisler wrote:
> > On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:
> > > On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> > > > On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > > > > However, I wonder if this could
> > > > > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > > > > first instantiated, and wouldn't be allowed to change after that?  Setting
> > > > > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > > > > be enabled until the inode is next fetched into cache?  Similarly, for
> > > > > inodes that have conflicting features (e.g. inline data or encryption)
> > > > > would not be allowed to enable S_DAX.
> > > > 
> > > > Ooh, this seems interesting.  This would ensure that S_DAX transitions
> > > > couldn't ever race with I/Os or mmaps().  I had some other ideas for how to
> > > > handle this, but I think your idea is more promising. :)
> > > 
> > > IMO, that's an awful admin interface - it can't be done on demand
> > > (i.e. when needed) because we can't force an inode to be evicted
> > > from the cache. And then we have the "why the hell did that just
> > > change" problem if an inode is evicted due to memory pressure and
> > > then immediately reinstantiated by the running workload. That's a
> > > recipe for driving admins insane...
> > > 
> > > > I guess with this solution we'd need:
> > > > 
> > > > a) A good way of letting the user detect the state where they had set the DAX
> > > > inode flag, but that it wasn't yet in use by the inode.
> > > > 
> > > > b) A reliable way of flushing the inode from the filesystem cache, so that the
> > > > next time an open() happens they get the new behavior.  The way I usually do
> > > > this is via umount/remount, but there is probably already a way to do this?
> > > 
> > > Not if it's referenced. And if it's not referenced, then the only
> > > hammer we have is Brutus^Wdrop_caches. That's not an option for
> > > production machines.
> > > 
> > > Neat idea, but one I'd already thought of and discarded as "not
> > > practical from an admin perspective".
> > 
> > Okay, so other ideas (which you have also probably already though of) include:
> > 
> > 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> > open mappings or any open file handles.
> 
> You have to have an open fd to change the flag. :)

Yeah, open file handles don't matter and we can serialize against IO in
progress, that's not a big deal. Established mappings are difficult to deal
with.

> > To prevent TOCTOU races we'd have to
> > do some additional locking while actually changing the flag.
> 
> I think that make sense - the fundamental problem is that the
> mappings are different between dax and non-dax, and that we can't
> properly lock out page faults to to prevent sending a racing
> page fault down the wrong path.
> 
> > 2) Be more drastic and follow the flow of ext4 file based encryption, only
> > allowing the inode flag to be set by an admin on an empty directory.  Files in
> > that directory will inherit it when they are created, and we don't provide a
> > way to clear.  If you want your file to not use DAX, move it to a different
> > directory (which I think for ext4 encryption turns it into a new inode).
> 
> Seems like the wrong model to me - moving application data files
> is a PITA because you've also go to change the app config to point
> at the new location...

Agreed.

> > Other ideas?
> 
> IMO, we need to fix the page fault path so we don't look at inode
> flags to determine processing behaviour during the fault. Fault
> processing as DAX or non-dax needs to be determined by the page
> fault code and communicated to the fs via the vmf as the contents
> of the vmf for a dax fault can be invalid for a non-dax fault. Fixing
> that problem (i.e. make DAX is a property of the mapping and
> instantiate it from the inode only at mmap() time) means all the
> page fault vs inode flag race problems go away and we have a model
> that is much more robust if we want to expand it in future.

In fact, the real problem is only with .page_mkwrite and .pfn_mkwrite
callbacks. For those setup of 'vmf' differs. For .fault or .huge_fault the
vmf is the same regardless whether we do DAX or non-DAX fault. But it seems
difficult to me to determine DAX / non-DAX fault in vmf since locks
necessary to stabilize S_DAX flag are acquired only in filesystem-specific
handlers (and the locks themselves are fs specific).

So the only way I see of dealing safely with these races is careful
checking in .page_mkwrite and .pfn_mkwrite after necessary locks are
obtained and bail out doing nothing if state is inconsistent. VM will retry
the fault and we'll get to the correct handler next time.

But if we disallow any mappings when switching S_DAX flag, then all the
above is moot and there can be no races... We just have to be sure to block
new mappings of the file while switching the flag.

								Honza
Theodore Ts'o Sept. 8, 2017, 3:39 p.m. UTC | #11
On Fri, Sep 08, 2017 at 09:25:43AM +1000, Dave Chinner wrote:
> > Okay, so other ideas (which you have also probably already though of) include:
> > 
> > 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> > open mappings or any open file handles.
> 
> You have to have an open fd to change the flag. :)

What if we only allow the S_DAX flag to be *set*, when i_size and
i_blocks is zero?  We could also require that only one file descriptor
be open against the inode, and that it be opened O_RDONLY.

						- Ted
Jan Kara Sept. 11, 2017, 8:47 a.m. UTC | #12
On Fri 08-09-17 11:39:13, Ted Tso wrote:
> On Fri, Sep 08, 2017 at 09:25:43AM +1000, Dave Chinner wrote:
> > > Okay, so other ideas (which you have also probably already though of) include:
> > > 
> > > 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> > > open mappings or any open file handles.
> > 
> > You have to have an open fd to change the flag. :)
> 
> What if we only allow the S_DAX flag to be *set*, when i_size and
> i_blocks is zero?  We could also require that only one file descriptor
> be open against the inode, and that it be opened O_RDONLY.

We could do something like that but IMHO it will be a pain to use (e.g.
think how difficult it would be to switch your existing database to use DAX
for data files). We can make transition reliable whenever
inode->i_mapping->i_mmap RB tree is empty (effectively: whenever the file
is not mmaped). And that should be relaxed enough for most usecases... But
I agree that it will be somewhat tricky to prevent creation of new mappings
while we are switching S_DAX flag so it needs more though.

								Honza