ext4: introduce per-inode DAX flag

Submitted by Lukas Czerner on Aug. 2, 2017, 4:09 p.m.

Details

Message ID 1501690186-17607-1-git-send-email-lczerner@redhat.com
State New
Headers show

Commit Message

Lukas Czerner Aug. 2, 2017, 4:09 p.m.
Currently it's only possible to enable, or disable DAX via mount option
on ext4. However some applications may only want to enable DAX for
certain files on a file system.

This patch introduces new user visible and user modifiable ext4 inode
flag EXT4_DAX_FL. This flag is also added into EXT4_FL_INHERITED which
means that new inodes created in a directory with EXT4_DAX_FL set will
automatically inherit the inode flag.

With this patch ext4 is on par with xfs when it comes to per inode DAX
enablement.

Note that there are some changes due to this patch to consider. Regural
users are now able to set up their files to use DAX which was previously
only possible by mounting the file system with -o dax option. Any IO
done to the file with the flag se will fail in case the block device is
not DAX enabled.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h  | 10 ++++++----
 fs/ext4/inode.c |  6 +++---
 fs/ext4/ioctl.c |  7 ++++++-
 3 files changed, 15 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig Aug. 5, 2017, 8:46 a.m.
Please don't do this until we have a coherent DAX plan.  So far
the flag in XFS seems like a massive mistake and we need to get
rid of it instead of adding it to more file systems.
Lukas Czerner Aug. 7, 2017, 12:12 p.m.
On Sat, Aug 05, 2017 at 01:46:01AM -0700, Christoph Hellwig wrote:
> Please don't do this until we have a coherent DAX plan.  So far
> the flag in XFS seems like a massive mistake and we need to get
> rid of it instead of adding it to more file systems.

Hi, pardon my ignorance, but can you point me to the discussion where
this was decided ? I must have missed it and I can't find it.

Thanks!
-Lukas
Lukas Czerner Aug. 8, 2017, 9 a.m.
On Mon, Aug 07, 2017 at 02:12:12PM +0200, Lukas Czerner wrote:
> On Sat, Aug 05, 2017 at 01:46:01AM -0700, Christoph Hellwig wrote:
> > Please don't do this until we have a coherent DAX plan.  So far
> > the flag in XFS seems like a massive mistake and we need to get
> > rid of it instead of adding it to more file systems.
> 
> Hi, pardon my ignorance, but can you point me to the discussion where
> this was decided ? I must have missed it and I can't find it.
> 
> Thanks!
> -Lukas

So I've read the thread where you talk about this. I do not see anything
about it being a masive mistake. I only see that you're confident that's
it's a wrong thing to do. Like you're ever not confident about
something;) Care to elaborate on what's wrong with it and why we do not
want it ? On the surface it does look like a usefull option to have.

-Lukas
Christoph Hellwig Aug. 11, 2017, 10:01 a.m.
On Tue, Aug 08, 2017 at 11:00:16AM +0200, Lukas Czerner wrote:
> So I've read the thread where you talk about this. I do not see anything
> about it being a masive mistake. I only see that you're confident that's
> it's a wrong thing to do. Like you're ever not confident about
> something;) Care to elaborate on what's wrong with it and why we do not
> want it ? On the surface it does look like a usefull option to have.

The problem is that DAX is an implementation detail on how to write
data back.  It has absolutely no user visible semantics.  Encoding
such a detail in the on-disk format is not a good idea.
Lukas Czerner Aug. 11, 2017, 12:11 p.m.
On Fri, Aug 11, 2017 at 03:01:47AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 08, 2017 at 11:00:16AM +0200, Lukas Czerner wrote:
> > So I've read the thread where you talk about this. I do not see anything
> > about it being a masive mistake. I only see that you're confident that's
> > it's a wrong thing to do. Like you're ever not confident about
> > something;) Care to elaborate on what's wrong with it and why we do not
> > want it ? On the surface it does look like a usefull option to have.
> 
> The problem is that DAX is an implementation detail on how to write
> data back.  It has absolutely no user visible semantics.  Encoding
> such a detail in the on-disk format is not a good idea.

Thanks, for the answer. I do not know too much about the DAX enabled HW.
However I do know that there is some variety to it, some can be faster
than DRAM, some can be slower, or on-par with DRAM. Some can be more
expensive, hence probably smaller, some cheaper and bigger. What about
latency and throughput can there be difference as well ?

That said, it seems to me that there can be some user choice involved in
this at least based on the fact that when DAX is used system memory is not
used.

So for example when DAX HW is slower than system memory, user can make
a choice to exclude some inodes to speed up particular workload, while
saving system memory where it does not matter as much.

Also can this flag play a role in situation of hierarchical, or heterogeneous
storage where dax enabled hardware is used ? Seems to me that it might.

What I am trying to say is that while you say that it has no user
visible semantics, it does have effect on the system that should not be
simply ignored.

-Lukas
Christoph Hellwig Aug. 11, 2017, 12:58 p.m.
On Fri, Aug 11, 2017 at 02:11:32PM +0200, Lukas Czerner wrote:
> Thanks, for the answer. I do not know too much about the DAX enabled HW.
> However I do know that there is some variety to it, some can be faster
> than DRAM, some can be slower, or on-par with DRAM. Some can be more
> expensive, hence probably smaller, some cheaper and bigger. What about
> latency and throughput can there be difference as well ?

Or it could be similar for reads and much slower for writes.  Or it
might prefer larger access sizes than a cache line.

> That said, it seems to me that there can be some user choice involved in
> this at least based on the fact that when DAX is used system memory is not
> used.

All of the above are charateristics of the medium, not of the
application.


> So for example when DAX HW is slower than system memory, user can make
> a choice to exclude some inodes to speed up particular workload, while
> saving system memory where it does not matter as much.

What should be factored into the decision might be access hints from
the applications, which would be things like madvise/fadvise hints.

But the application should not even have to know about something like
DAX, and the detailed access characterisics of the medium.  And even
more importantly we should not encode those complex characterisitcs
(see above) into a binary flag in the on-disk format.
Lukas Czerner Aug. 11, 2017, 1:41 p.m.
On Fri, Aug 11, 2017 at 05:58:49AM -0700, Christoph Hellwig wrote:
> On Fri, Aug 11, 2017 at 02:11:32PM +0200, Lukas Czerner wrote:
> > Thanks, for the answer. I do not know too much about the DAX enabled HW.
> > However I do know that there is some variety to it, some can be faster
> > than DRAM, some can be slower, or on-par with DRAM. Some can be more
> > expensive, hence probably smaller, some cheaper and bigger. What about
> > latency and throughput can there be difference as well ?
> 
> Or it could be similar for reads and much slower for writes.  Or it
> might prefer larger access sizes than a cache line.
> 
> > That said, it seems to me that there can be some user choice involved in
> > this at least based on the fact that when DAX is used system memory is not
> > used.
> 
> All of the above are charateristics of the medium, not of the
> application.
> 
> 
> > So for example when DAX HW is slower than system memory, user can make
> > a choice to exclude some inodes to speed up particular workload, while
> > saving system memory where it does not matter as much.
> 
> What should be factored into the decision might be access hints from
> the applications, which would be things like madvise/fadvise hints.
> 
> But the application should not even have to know about something like
> DAX, and the detailed access characterisics of the medium.  And even
> more importantly we should not encode those complex characterisitcs
> (see above) into a binary flag in the on-disk format.

I understand your concept, I even agree that application should not
even know, or have to know, about DAX, or any characteristics of the
medium. Application can take advantage of madvise/fadvise hints, if
implemented in this case and it will be usefull. But I am not talking
about the application here.

Administrator is the one setting up the environment and no matter what
are the capabilities of the application he (it) can and should have the
information about the the system (and the medium characteristics) and
should be able to tweak it. Per inode DAX flags give him the ability to
do so permanently in a convenient way in this case.

We have plenty of knobs in the file system we can play with to optimize
it for a given medium already. Sometimes we maybe go too far, but I do not
think it's the case here. Especially since I do not see a way to implement
the above mentioned use case in a different way - without relying on the
application to do the right thing with hints (when those are implemented).

-Lukas

Patch hide | download patch | download mbox

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9ebde0c..689d003 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -395,14 +395,15 @@  struct flex_groups {
 #define EXT4_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define EXT4_HUGE_FILE_FL               0x00040000 /* Set to each huge file */
 #define EXT4_EXTENTS_FL			0x00080000 /* Inode uses extents */
+#define EXT4_DAX_FL			0x00100000 /* Use DAX for IO */
 #define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
 #define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
 #define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
 #define EXT4_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
 #define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */
 
-#define EXT4_FL_USER_VISIBLE		0x304BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE		0x204BC0FF /* User modifiable flags */
+#define EXT4_FL_USER_VISIBLE		0x305BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE		0x205BC0FF /* User modifiable flags */
 
 /* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
 #define EXT4_FL_XFLAG_VISIBLE		(EXT4_SYNC_FL | \
@@ -410,14 +411,15 @@  struct flex_groups {
 					 EXT4_APPEND_FL | \
 					 EXT4_NODUMP_FL | \
 					 EXT4_NOATIME_FL | \
-					 EXT4_PROJINHERIT_FL)
+					 EXT4_PROJINHERIT_FL |\
+					 EXT4_DAX_FL)
 
 /* Flags that should be inherited by new inodes from their parent. */
 #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
 			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
 			   EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
 			   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
-			   EXT4_PROJINHERIT_FL)
+			   EXT4_PROJINHERIT_FL | EXT4_DAX_FL)
 
 /* Flags that are appropriate for regular files (all but dir-specific ones). */
 #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c600f0..051c4db5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4555,9 +4555,9 @@  void ext4_set_inode_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		new_fl |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
-	    !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
-	    !ext4_encrypted_inode(inode))
+	if ((test_opt(inode->i_sb, DAX) || flags & EXT4_DAX_FL) &&
+	    S_ISREG(inode->i_mode) && !ext4_should_journal_data(inode) &&
+	    !ext4_has_inline_data(inode) && !ext4_encrypted_inode(inode))
 		new_fl |= S_DAX;
 	inode_set_flags(inode, new_fl,
 			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 42b3a73..a46b606 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -424,12 +424,15 @@  static inline __u32 ext4_iflags_to_xflags(unsigned long iflags)
 		xflags |= FS_XFLAG_NOATIME;
 	if (iflags & EXT4_PROJINHERIT_FL)
 		xflags |= FS_XFLAG_PROJINHERIT;
+	if (iflags & EXT4_DAX_FL)
+		xflags |= FS_XFLAG_DAX;
 	return xflags;
 }
 
 #define EXT4_SUPPORTED_FS_XFLAGS (FS_XFLAG_SYNC | FS_XFLAG_IMMUTABLE | \
 				  FS_XFLAG_APPEND | FS_XFLAG_NODUMP | \
-				  FS_XFLAG_NOATIME | FS_XFLAG_PROJINHERIT)
+				  FS_XFLAG_NOATIME | FS_XFLAG_PROJINHERIT | \
+				  FS_XFLAG_DAX)
 
 /* Transfer xflags flags to internal */
 static inline unsigned long ext4_xflags_to_iflags(__u32 xflags)
@@ -448,6 +451,8 @@  static inline unsigned long ext4_xflags_to_iflags(__u32 xflags)
 		iflags |= EXT4_NOATIME_FL;
 	if (xflags & FS_XFLAG_PROJINHERIT)
 		iflags |= EXT4_PROJINHERIT_FL;
+	if (xflags & FS_XFLAG_DAX)
+		iflags |= EXT4_DAX_FL;
 
 	return iflags;
 }