[v3,19/19] fs: handle inode->i_version more efficiently

Message ID	20171218151156.14565-20-jlayton@kernel.org
State	Not Applicable, archived
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3BE3821928 From: Jeff Layton <jlayton@kernel.org> To: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, viro@zeniv.linux.org.uk, linux-nfs@vger.kernel.org, bfields@fieldses.org, neilb@suse.de, jack@suse.de, linux-ext4@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, linux-xfs@vger.kernel.org, darrick.wong@oracle.com, david@fromorbit.com, linux-btrfs@vger.kernel.org, clm@fb.com, jbacik@fb.com, dsterba@suse.com, linux-integrity@vger.kernel.org, zohar@linux.vnet.ibm.com, dmitry.kasatkin@gmail.com, linux-afs@lists.infradead.org, dhowells@redhat.com, jaltman@auristor.com Subject: [PATCH v3 19/19] fs: handle inode->i_version more efficiently Date: Mon, 18 Dec 2017 10:11:56 -0500 Message-Id: <20171218151156.14565-20-jlayton@kernel.org> In-Reply-To: <20171218151156.14565-1-jlayton@kernel.org> References: <20171218151156.14565-1-jlayton@kernel.org> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	fs: rework and optimize i_version handling in filesystems \| expand [v3,00/19] fs: rework and optimize i_version handling in filesystems [v3,01/19] fs: new API for handling inode->i_version [v3,02/19] fs: don't take the i_lock in inode_inc_iversion [v3,03/19] fat: convert to new i_version API [v3,04/19] affs: convert to new i_version API [v3,05/19] afs: convert to new i_version API [v3,06/19] btrfs: convert to new i_version API [v3,07/19] exofs: switch to new i_version API [v3,08/19] ext2: convert to new i_version API [v3,09/19] ext4: convert to new i_version API [v3,10/19] nfs: convert to new i_version API [v3,11/19] nfsd: convert to new i_version API [v3,12/19] ocfs2: convert to new i_version API [v3,13/19] ufs: use new i_version API [v3,14/19] xfs: convert to new i_version API [v3,15/19] IMA: switch IMA over to new i_version API [v3,16/19] fs: only set S_VERSION when updating times if necessary [v3,17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing [v3,18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed [v3,19/19] fs: handle inode->i_version more efficiently

Jeffrey Layton Dec. 18, 2017, 3:11 p.m. UTC

From: Jeff Layton <jlayton@redhat.com>

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/fs.h       |   2 +-
 include/linux/iversion.h | 180 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 131 insertions(+), 51 deletions(-)

Jan Kara Dec. 18, 2017, 4:34 p.m. UTC | #1

On Mon 18-12-17 10:11:56, Jeff Layton wrote:
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> +	u64 cur, old, new;
>  
> -	atomic64_inc(ivp);
> +	cur = (u64)atomic64_read(&inode->i_version);
> +	for (;;) {
> +		/* If flag is clear then we needn't do anything */
> +		if (!force && !(cur & I_VERSION_QUERIED))
> +			return false;

The fast path here misses any memory barrier. Thus it seems this query
could be in theory reordered before any store that happened to modify the
inode? Or maybe we could race and miss the fact that in fact this i_version
has already been queried? But maybe there's some higher level locking that
makes sure this is all a non-issue... But in that case it would deserve
some comment I guess.

> +
> +		/* Since lowest bit is flag, add 2 to avoid it */
> +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> +
> +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> +		if (likely(old == cur))
> +			break;
> +		cur = old;
> +	}
>  	return true;
>  }
>  

...

>  static inline u64
>  inode_query_iversion(struct inode *inode)
>  {
> -	return inode_peek_iversion(inode);
> +	u64 cur, old, new;
> +
> +	cur = atomic64_read(&inode->i_version);
> +	for (;;) {
> +		/* If flag is already set, then no need to swap */
> +		if (cur & I_VERSION_QUERIED)
> +			break;
> +
> +		new = cur | I_VERSION_QUERIED;
> +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> +		if (old == cur)
> +			break;
> +		cur = old;
> +	}

Why not just use atomic64_or() here?

								Honza

Jeffrey Layton Dec. 18, 2017, 5:22 p.m. UTC | #2

On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> >  static inline bool
> >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> >  {
> > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > +	u64 cur, old, new;
> >  
> > -	atomic64_inc(ivp);
> > +	cur = (u64)atomic64_read(&inode->i_version);
> > +	for (;;) {
> > +		/* If flag is clear then we needn't do anything */
> > +		if (!force && !(cur & I_VERSION_QUERIED))
> > +			return false;
> 
> The fast path here misses any memory barrier. Thus it seems this query
> could be in theory reordered before any store that happened to modify the
> inode? Or maybe we could race and miss the fact that in fact this i_version
> has already been queried? But maybe there's some higher level locking that
> makes sure this is all a non-issue... But in that case it would deserve
> some comment I guess.
> 

There's no higher-level locking. Getting locking out of this codepath is
a good thing IMO. The larger question here is whether we really care
about ordering this with anything else.

The i_version, as implemented today, is not ordered with actual changes
to the inode. We only take the i_lock today when modifying it, not when
querying it. It's possible today that you could see the results of a
change and then do a fetch of the i_version that doesn't show an
increment vs. a previous change.

It'd be nice if this were atomic with the actual changes that it
represents, but I think that would be prohibitively expensive. That may
be something we need to address. I'm not sure we really want to do it as
part of this patchset though.

> > +
> > +		/* Since lowest bit is flag, add 2 to avoid it */
> > +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > +
> > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > +		if (likely(old == cur))
> > +			break;
> > +		cur = old;
> > +	}
> >  	return true;
> >  }
> >  
> 
> ...
> 
> >  static inline u64
> >  inode_query_iversion(struct inode *inode)
> >  {
> > -	return inode_peek_iversion(inode);
> > +	u64 cur, old, new;
> > +
> > +	cur = atomic64_read(&inode->i_version);
> > +	for (;;) {
> > +		/* If flag is already set, then no need to swap */
> > +		if (cur & I_VERSION_QUERIED)
> > +			break;
> > +
> > +		new = cur | I_VERSION_QUERIED;
> > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > +		if (old == cur)
> > +			break;
> > +		cur = old;
> > +	}
> 
> Why not just use atomic64_or() here?
> 

If the cmpxchg fails, then either:

1) it was incremented
2) someone flagged it QUERIED

If an increment happened then we don't need to flag it as QUERIED if
we're returning an older value. If we use atomic64_or, then we can't
tell if an increment happened so we'd end up potentially flagging it
more than necessary.

In principle, either outcome is technically OK and we don't have to loop
if the cmpxchg doesn't work. That said, if we think there might be a
later i_version available, then I think we probably want to try to query
it again so we can return as late a one as possible.

J. Bruce Fields Dec. 18, 2017, 5:36 p.m. UTC | #3

On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote:
> On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > >  static inline bool
> > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > >  {
> > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > +	u64 cur, old, new;
> > >  
> > > -	atomic64_inc(ivp);
> > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > +	for (;;) {
> > > +		/* If flag is clear then we needn't do anything */
> > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > +			return false;
> > 
> > The fast path here misses any memory barrier. Thus it seems this query
> > could be in theory reordered before any store that happened to modify the
> > inode? Or maybe we could race and miss the fact that in fact this i_version
> > has already been queried? But maybe there's some higher level locking that
> > makes sure this is all a non-issue... But in that case it would deserve
> > some comment I guess.
> > 
> 
> There's no higher-level locking. Getting locking out of this codepath is
> a good thing IMO. The larger question here is whether we really care
> about ordering this with anything else.
> 
> The i_version, as implemented today, is not ordered with actual changes
> to the inode. We only take the i_lock today when modifying it, not when
> querying it. It's possible today that you could see the results of a
> change and then do a fetch of the i_version that doesn't show an
> increment vs. a previous change.
> 
> It'd be nice if this were atomic with the actual changes that it
> represents, but I think that would be prohibitively expensive. That may
> be something we need to address. I'm not sure we really want to do it as
> part of this patchset though.

I don't think we need full atomicity, but we *do* need the i_version
increment to be seen as ordered after the inode change.  That should be
relatively inexpensive, yes?

Otherwise an inode change could be overlooked indefinitely, because a
client could cache the old contents with the new i_version value and
never see the i_version change again as long as the inode isn't touched
again.

(If the client instead caches new data with the old inode version, there
may be an unnecessary cache invalidation next time the client queries
the change attribute.  That's a performance bug, but not really a
correctness problem, at least for clients depending only on
close-to-open semantics: they'll only depend on seeing the new change
attribute after the change has been flushed to disk.  The performance
problem should be mitigated by the race being very rare.)

--b.

Jeffrey Layton Dec. 18, 2017, 7:35 p.m. UTC | #4

On Mon, 2017-12-18 at 12:36 -0500, J. Bruce Fields wrote:
> On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote:
> > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > > >  static inline bool
> > > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > > >  {
> > > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > > +	u64 cur, old, new;
> > > >  
> > > > -	atomic64_inc(ivp);
> > > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is clear then we needn't do anything */
> > > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > > +			return false;
> > > 
> > > The fast path here misses any memory barrier. Thus it seems this query
> > > could be in theory reordered before any store that happened to modify the
> > > inode? Or maybe we could race and miss the fact that in fact this i_version
> > > has already been queried? But maybe there's some higher level locking that
> > > makes sure this is all a non-issue... But in that case it would deserve
> > > some comment I guess.
> > > 
> > 
> > There's no higher-level locking. Getting locking out of this codepath is
> > a good thing IMO. The larger question here is whether we really care
> > about ordering this with anything else.
> > 
> > The i_version, as implemented today, is not ordered with actual changes
> > to the inode. We only take the i_lock today when modifying it, not when
> > querying it. It's possible today that you could see the results of a
> > change and then do a fetch of the i_version that doesn't show an
> > increment vs. a previous change.
> > 
> > It'd be nice if this were atomic with the actual changes that it
> > represents, but I think that would be prohibitively expensive. That may
> > be something we need to address. I'm not sure we really want to do it as
> > part of this patchset though.
> 
> I don't think we need full atomicity, but we *do* need the i_version
> increment to be seen as ordered after the inode change.  That should be
> relatively inexpensive, yes?
> 
> Otherwise an inode change could be overlooked indefinitely, because a
> client could cache the old contents with the new i_version value and
> never see the i_version change again as long as the inode isn't touched
> again.
> 
> (If the client instead caches new data with the old inode version, there
> may be an unnecessary cache invalidation next time the client queries
> the change attribute.  That's a performance bug, but not really a
> correctness problem, at least for clients depending only on
> close-to-open semantics: they'll only depend on seeing the new change
> attribute after the change has been flushed to disk.  The performance
> problem should be mitigated by the race being very rare.)

Perhaps something like this patch squashed into #19?

This should tighten up the initial loads and stores, like Jan was
suggesting (I think). The increments are done via cmpxchg so they
already have a full implied barrier (IIUC), and we don't exit the loop
until it succeeds or it looks like the flag is set.

To be clear though:

This series does not try to order the i_version update with any other
inode metadata or data (though we generally do update it after a write
is done, rather than before or during).

Inode field updates are not at all synchronized either and this series
does not change that. A stat() is also generally done locklessly so you
can easily get a half-baked attributes there if you hit the timing
wrong.

-------------------------8<--------------------------

[PATCH] SQUASH: add memory barriers around i_version accesses

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/iversion.h | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index a9fbf99709df..39ec9aa9e08e 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -87,6 +87,25 @@ static inline void
 inode_set_iversion_raw(struct inode *inode, const u64 val)
 {
 	atomic64_set(&inode->i_version, val);
+	smp_wmb();
+}
+
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+	smp_rmb();
+	return atomic64_read(&inode->i_version);
 }
 
 /**
@@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
 	u64 cur, old, new;
 
-	cur = (u64)atomic64_read(&inode->i_version);
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is clear then we needn't do anything */
 		if (!force && !(cur & I_VERSION_QUERIED))
@@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode)
 	inode_maybe_inc_iversion(inode, true);
 }
 
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
-	return atomic64_read(&inode->i_version);
-}
-
 /**
  * inode_iversion_need_inc - is the i_version in need of being incremented?
  * @inode: inode to check
@@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode)
 {
 	u64 cur, old, new;
 
-	cur = atomic64_read(&inode->i_version);
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is already set, then no need to swap */
 		if (cur & I_VERSION_QUERIED)

Dave Chinner Dec. 18, 2017, 10:07 p.m. UTC | #5

On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> [PATCH] SQUASH: add memory barriers around i_version accesses

Why explicit memory barriers rather than annotating the operations
with the required semantics and getting the barriers the arch
requires automatically?  I suspect this should be using
atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
the atomic_cmpxchg needs to have release semantics to match the
acquire semantics needed for the load of the current value.

From include/linux/atomics.h:

 * For compound atomics performing both a load and a store, ACQUIRE
 * semantics apply only to the load and RELEASE semantics only to the
 * store portion of the operation. Note that a failed cmpxchg_acquire
 * does -not- imply any memory ordering constraints.

Memory barriers hurt my brain. :/

At minimum, shouldn't the atomic op specific barriers be used rather
than full memory barriers? i.e:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..39ec9aa9e08e 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -87,6 +87,25 @@ static inline void
>  inode_set_iversion_raw(struct inode *inode, const u64 val)
>  {
>  	atomic64_set(&inode->i_version, val);
> +	smp_wmb();

	smp_mb__after_atomic();
.....
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> +	smp_rmb();

	smp_mb__before_atomic();

> +	return atomic64_read(&inode->i_version);
>  }

And, of course, these will require comments explaining what they
match with and what they are ordering.

> @@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>  	u64 cur, old, new;
>  
> -	cur = (u64)atomic64_read(&inode->i_version);
> +	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is clear then we needn't do anything */
>  		if (!force && !(cur & I_VERSION_QUERIED))

cmpxchg in this loop needs a release barrier so everyone will
see the change?

> @@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode)
>  	inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> -	return atomic64_read(&inode->i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode)
>  {
>  	u64 cur, old, new;
>  
> -	cur = atomic64_read(&inode->i_version);
> +	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is already set, then no need to swap */
>  		if (cur & I_VERSION_QUERIED)

cmpxchg in this loop needs a release barrier so everyone will
see the change on load?

Cheers,

Dave.

Jan Kara Dec. 19, 2017, 9:29 a.m. UTC | #6

On Mon 18-12-17 12:22:20, Jeff Layton wrote:
> On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > >  static inline bool
> > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > >  {
> > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > +	u64 cur, old, new;
> > >  
> > > -	atomic64_inc(ivp);
> > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > +	for (;;) {
> > > +		/* If flag is clear then we needn't do anything */
> > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > +			return false;
> > 
> > The fast path here misses any memory barrier. Thus it seems this query
> > could be in theory reordered before any store that happened to modify the
> > inode? Or maybe we could race and miss the fact that in fact this i_version
> > has already been queried? But maybe there's some higher level locking that
> > makes sure this is all a non-issue... But in that case it would deserve
> > some comment I guess.
> > 
> 
> There's no higher-level locking. Getting locking out of this codepath is
> a good thing IMO. The larger question here is whether we really care
> about ordering this with anything else.
> 
> The i_version, as implemented today, is not ordered with actual changes
> to the inode. We only take the i_lock today when modifying it, not when
> querying it. It's possible today that you could see the results of a
> change and then do a fetch of the i_version that doesn't show an
> increment vs. a previous change.

Yeah, so I don't suggest that you should fix unrelated issues but original
i_lock protection did actually provide memory barriers (although
semi-permeable, but in practice they are very often enough) and your patch
removing those could have changed a theoretical issue to a practical
problem. So at least preserving that original acquire-release semantics
of i_version handling would be IMHO good.

> It'd be nice if this were atomic with the actual changes that it
> represents, but I think that would be prohibitively expensive. That may
> be something we need to address. I'm not sure we really want to do it as
> part of this patchset though.
> 
> > > +
> > > +		/* Since lowest bit is flag, add 2 to avoid it */
> > > +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > > +
> > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > +		if (likely(old == cur))
> > > +			break;
> > > +		cur = old;
> > > +	}
> > >  	return true;
> > >  }
> > >  
> > 
> > ...
> > 
> > >  static inline u64
> > >  inode_query_iversion(struct inode *inode)
> > >  {
> > > -	return inode_peek_iversion(inode);
> > > +	u64 cur, old, new;
> > > +
> > > +	cur = atomic64_read(&inode->i_version);
> > > +	for (;;) {
> > > +		/* If flag is already set, then no need to swap */
> > > +		if (cur & I_VERSION_QUERIED)
> > > +			break;
> > > +
> > > +		new = cur | I_VERSION_QUERIED;
> > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > +		if (old == cur)
> > > +			break;
> > > +		cur = old;
> > > +	}
> > 
> > Why not just use atomic64_or() here?
> > 
> 
> If the cmpxchg fails, then either:
> 
> 1) it was incremented
> 2) someone flagged it QUERIED
> 
> If an increment happened then we don't need to flag it as QUERIED if
> we're returning an older value. If we use atomic64_or, then we can't
> tell if an increment happened so we'd end up potentially flagging it
> more than necessary.
> 
> In principle, either outcome is technically OK and we don't have to loop
> if the cmpxchg doesn't work. That said, if we think there might be a
> later i_version available, then I think we probably want to try to query
> it again so we can return as late a one as possible.

OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to
behave pretty badly in contended cases but I guess i_version won't be
hammered *that* hard.

								Honza

Jeffrey Layton Dec. 19, 2017, 5:14 p.m. UTC | #7

On Tue, 2017-12-19 at 10:29 +0100, Jan Kara wrote:
> On Mon 18-12-17 12:22:20, Jeff Layton wrote:
> > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > > >  static inline bool
> > > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > > >  {
> > > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > > +	u64 cur, old, new;
> > > >  
> > > > -	atomic64_inc(ivp);
> > > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is clear then we needn't do anything */
> > > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > > +			return false;
> > > 
> > > The fast path here misses any memory barrier. Thus it seems this query
> > > could be in theory reordered before any store that happened to modify the
> > > inode? Or maybe we could race and miss the fact that in fact this i_version
> > > has already been queried? But maybe there's some higher level locking that
> > > makes sure this is all a non-issue... But in that case it would deserve
> > > some comment I guess.
> > > 
> > 
> > There's no higher-level locking. Getting locking out of this codepath is
> > a good thing IMO. The larger question here is whether we really care
> > about ordering this with anything else.
> > 
> > The i_version, as implemented today, is not ordered with actual changes
> > to the inode. We only take the i_lock today when modifying it, not when
> > querying it. It's possible today that you could see the results of a
> > change and then do a fetch of the i_version that doesn't show an
> > increment vs. a previous change.
> 
> Yeah, so I don't suggest that you should fix unrelated issues but original
> i_lock protection did actually provide memory barriers (although
> semi-permeable, but in practice they are very often enough) and your patch
> removing those could have changed a theoretical issue to a practical
> problem. So at least preserving that original acquire-release semantics
> of i_version handling would be IMHO good.
> 

Agreed. I've no objection to memory barriers here and I'm looking at
that, I just need to go over Dave's comments and memory-barriers.txt
(again!) to make sure I get them right.

> > It'd be nice if this were atomic with the actual changes that it
> > represents, but I think that would be prohibitively expensive. That may
> > be something we need to address. I'm not sure we really want to do it as
> > part of this patchset though.
> > 
> > > > +
> > > > +		/* Since lowest bit is flag, add 2 to avoid it */
> > > > +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > > > +
> > > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > > +		if (likely(old == cur))
> > > > +			break;
> > > > +		cur = old;
> > > > +	}
> > > >  	return true;
> > > >  }
> > > >  
> > > 
> > > ...
> > > 
> > > >  static inline u64
> > > >  inode_query_iversion(struct inode *inode)
> > > >  {
> > > > -	return inode_peek_iversion(inode);
> > > > +	u64 cur, old, new;
> > > > +
> > > > +	cur = atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is already set, then no need to swap */
> > > > +		if (cur & I_VERSION_QUERIED)
> > > > +			break;
> > > > +
> > > > +		new = cur | I_VERSION_QUERIED;
> > > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > > +		if (old == cur)
> > > > +			break;
> > > > +		cur = old;
> > > > +	}
> > > 
> > > Why not just use atomic64_or() here?
> > > 
> > 
> > If the cmpxchg fails, then either:
> > 
> > 1) it was incremented
> > 2) someone flagged it QUERIED
> > 
> > If an increment happened then we don't need to flag it as QUERIED if
> > we're returning an older value. If we use atomic64_or, then we can't
> > tell if an increment happened so we'd end up potentially flagging it
> > more than necessary.
> > 
> > In principle, either outcome is technically OK and we don't have to loop
> > if the cmpxchg doesn't work. That said, if we think there might be a
> > later i_version available, then I think we probably want to try to query
> > it again so we can return as late a one as possible.
> 
> OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to
> behave pretty badly in contended cases but I guess i_version won't be
> hammered *that* hard.
> 

That's the principle I'm operating under here, and I think it's valid
for almost all workloads. Incrementing the i_version on parallel writes
should be mostly uncontended now, whereas they at had to serialize on
the i_lock before.

The pessimal case here, I think is parallel increments and queries. We
may see that sort of workload under knfsd, but I'm fine with giving
knfsd a small performance hit to help performance on other workloads.

While we're on the subject of looping here, should I add a cpu_relax()
into these loops?

Thanks for the review so far!

Jeffrey Layton Dec. 20, 2017, 2:03 p.m. UTC | #8

On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > [PATCH] SQUASH: add memory barriers around i_version accesses
> 
> Why explicit memory barriers rather than annotating the operations
> with the required semantics and getting the barriers the arch
> requires automatically?  I suspect this should be using
> atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> the atomic_cmpxchg needs to have release semantics to match the
> acquire semantics needed for the load of the current value.
> 
> From include/linux/atomics.h:
> 
>  * For compound atomics performing both a load and a store, ACQUIRE
>  * semantics apply only to the load and RELEASE semantics only to the
>  * store portion of the operation. Note that a failed cmpxchg_acquire
>  * does -not- imply any memory ordering constraints.
> 
> Memory barriers hurt my brain. :/
> 
> At minimum, shouldn't the atomic op specific barriers be used rather
> than full memory barriers? i.e:
> 

They hurt my brain too. Yes, definitely atomic-specific barriers should
be used here instead, since this is an atomic64_t now.

After going over the docs again...my understanding has always been that
you primarily need memory barriers to order accesses to different areas
of memory.

As Jan and I were discussing in the other thread, i_version is not
synchronized with anything else. In this code, we're only dealing with a
single 64-bit word. I don't think there are any races there wrt the API
itself.

The "legacy" inode_inc_iversion() however _does_ have implied memory
barriers from the i_lock. There could be some subtle de-facto ordering
there, so I think we probably do want some barriers in here if only to
preserve that. It's not likely to cost much, and may save us tracking
down some fiddly bugs.

What about this patch? Note that I've only added barriers to
inode_maybe_inc_iversion. I don't see that we need it for the other
functions, but please do tell me if I'm wrong there:

--------------------8<---------------------------

[PATCH] SQUASH: add memory barriers around i_version accesses

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/iversion.h | 53 +++++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index a9fbf99709df..02187a3bec3b 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
 	atomic64_set(&inode->i_version, val);
 }
 
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+	return atomic64_read(&inode->i_version);
+}
+
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
@@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
 	u64 cur, old, new;
 
-	cur = (u64)atomic64_read(&inode->i_version);
+	/*
+	 * The i_version field is not strictly ordered with any other inode
+	 * information, but the legacy inode_inc_iversion code used a spinlock
+	 * to serialize increments.
+	 *
+	 * This code adds full memory barriers to ensure that any de-facto
+	 * ordering with other info is preserved.
+	 */
+	smp_mb__before_atomic();
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is clear then we needn't do anything */
 		if (!force && !(cur & I_VERSION_QUERIED))
@@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
 
 		old = atomic64_cmpxchg(&inode->i_version, cur, new);
-		if (likely(old == cur))
+		if (likely(old == cur)) {
+			smp_mb__after_atomic();
 			break;
+		}
 		cur = old;
 	}
 	return true;
@@ -183,23 +211,6 @@ inode_inc_iversion(struct inode *inode)
 	inode_maybe_inc_iversion(inode, true);
 }
 
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
-	return atomic64_read(&inode->i_version);
-}
-
 /**
  * inode_iversion_need_inc - is the i_version in need of being incremented?
  * @inode: inode to check
@@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
 {
 	u64 cur, old, new;
 
-	cur = atomic64_read(&inode->i_version);
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is already set, then no need to swap */
 		if (cur & I_VERSION_QUERIED)
@@ -256,7 +267,7 @@ inode_query_iversion(struct inode *inode)
 
 		new = cur | I_VERSION_QUERIED;
 		old = atomic64_cmpxchg(&inode->i_version, cur, new);
-		if (old == cur)
+		if (likely(old == cur))
 			break;
 		cur = old;
 	}

Jan Kara Dec. 20, 2017, 4:41 p.m. UTC | #9

On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > [PATCH] SQUASH: add memory barriers around i_version accesses
> > 
> > Why explicit memory barriers rather than annotating the operations
> > with the required semantics and getting the barriers the arch
> > requires automatically?  I suspect this should be using
> > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > the atomic_cmpxchg needs to have release semantics to match the
> > acquire semantics needed for the load of the current value.
> > 
> > From include/linux/atomics.h:
> > 
> >  * For compound atomics performing both a load and a store, ACQUIRE
> >  * semantics apply only to the load and RELEASE semantics only to the
> >  * store portion of the operation. Note that a failed cmpxchg_acquire
> >  * does -not- imply any memory ordering constraints.
> > 
> > Memory barriers hurt my brain. :/
> > 
> > At minimum, shouldn't the atomic op specific barriers be used rather
> > than full memory barriers? i.e:
> > 
> 
> They hurt my brain too. Yes, definitely atomic-specific barriers should
> be used here instead, since this is an atomic64_t now.
> 
> After going over the docs again...my understanding has always been that
> you primarily need memory barriers to order accesses to different areas
> of memory.

That is correct.

> As Jan and I were discussing in the other thread, i_version is not
> synchronized with anything else. In this code, we're only dealing with a
> single 64-bit word. I don't think there are any races there wrt the API
> itself.

There are not but it is like saying that lock implementation is correct
because the lock state does not get corrupted ;). Who cares about protected
data...

> The "legacy" inode_inc_iversion() however _does_ have implied memory
> barriers from the i_lock. There could be some subtle de-facto ordering
> there, so I think we probably do want some barriers in here if only to
> preserve that. It's not likely to cost much, and may save us tracking
> down some fiddly bugs.
> 
> What about this patch? Note that I've only added barriers to
> inode_maybe_inc_iversion. I don't see that we need it for the other
> functions, but please do tell me if I'm wrong there:
> 
> --------------------8<---------------------------
> 
> [PATCH] SQUASH: add memory barriers around i_version accesses
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/iversion.h | 53 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..02187a3bec3b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
>  	atomic64_set(&inode->i_version, val);
>  }
>  
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> +	return atomic64_read(&inode->i_version);
> +}
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>  	u64 cur, old, new;
>  
> -	cur = (u64)atomic64_read(&inode->i_version);
> +	/*
> +	 * The i_version field is not strictly ordered with any other inode
> +	 * information, but the legacy inode_inc_iversion code used a spinlock
> +	 * to serialize increments.
> +	 *
> +	 * This code adds full memory barriers to ensure that any de-facto
> +	 * ordering with other info is preserved.
> +	 */
> +	smp_mb__before_atomic();

This should be just smp_mb(). __before_atomic() pairs with atomic
operations like atomic_inc(). atomic_read() is completely unordered
operation (happens to be plain memory read on x86) and so __before_atomic()
is not enough.

> +	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is clear then we needn't do anything */
>  		if (!force && !(cur & I_VERSION_QUERIED))
> @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>  
>  		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> -		if (likely(old == cur))
> +		if (likely(old == cur)) {
> +			smp_mb__after_atomic();

I don't think you need this. Cmpxchg is guaranteed to be full memory
barrier - from Documentation/atomic_t.txt:
  - RMW operations that have a return value are fully ordered;

>  			break;
> +		}
>  		cur = old;
>  	}
>  	return true;

...

> @@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
>  {
>  	u64 cur, old, new;
>  
> -	cur = atomic64_read(&inode->i_version);
> +	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is already set, then no need to swap */
>  		if (cur & I_VERSION_QUERIED)

And here I'd expect smp_mb() after inode_peek_iversion_raw() (actually be
needed only if you are not going to do cmpxchg as that implies barrier as
well). "Safe" use of i_version would be:

Update:

modify inode
inode_maybe_inc_iversion(inode)

Read:

my_version = inode_query_iversion(inode)
get inode data

And you need to make sure 'get inode data' does not get speculatively
evaluated before you actually sample i_version so that you are guaranteed
that if data changes, you will observe larger i_version in the future.

Also please add a comment smp_mb() in inode_maybe_inc_iversion() like:

/* This barrier pairs with the barrier in inode_query_iversion() */

and a similar comment to inode_query_iversion(). Because memory barriers
make sense only in pairs (see SMP BARRIER PAIRING in
Documentation/memory-barriers.txt).

								Honza

Jeffrey Layton Dec. 21, 2017, 11:25 a.m. UTC | #10

On Wed, 2017-12-20 at 17:41 +0100, Jan Kara wrote:
> On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> > On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > > [PATCH] SQUASH: add memory barriers around i_version accesses
> > > 
> > > Why explicit memory barriers rather than annotating the operations
> > > with the required semantics and getting the barriers the arch
> > > requires automatically?  I suspect this should be using
> > > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > > the atomic_cmpxchg needs to have release semantics to match the
> > > acquire semantics needed for the load of the current value.
> > > 
> > > From include/linux/atomics.h:
> > > 
> > >  * For compound atomics performing both a load and a store, ACQUIRE
> > >  * semantics apply only to the load and RELEASE semantics only to the
> > >  * store portion of the operation. Note that a failed cmpxchg_acquire
> > >  * does -not- imply any memory ordering constraints.
> > > 
> > > Memory barriers hurt my brain. :/
> > > 
> > > At minimum, shouldn't the atomic op specific barriers be used rather
> > > than full memory barriers? i.e:
> > > 
> > 
> > They hurt my brain too. Yes, definitely atomic-specific barriers should
> > be used here instead, since this is an atomic64_t now.
> > 
> > After going over the docs again...my understanding has always been that
> > you primarily need memory barriers to order accesses to different areas
> > of memory.
> 
> That is correct.
> 
> > As Jan and I were discussing in the other thread, i_version is not
> > synchronized with anything else. In this code, we're only dealing with a
> > single 64-bit word. I don't think there are any races there wrt the API
> > itself.
> 
> There are not but it is like saying that lock implementation is correct
> because the lock state does not get corrupted ;). Who cares about protected
> data...
> 
> > The "legacy" inode_inc_iversion() however _does_ have implied memory
> > barriers from the i_lock. There could be some subtle de-facto ordering
> > there, so I think we probably do want some barriers in here if only to
> > preserve that. It's not likely to cost much, and may save us tracking
> > down some fiddly bugs.
> > 
> > What about this patch? Note that I've only added barriers to
> > inode_maybe_inc_iversion. I don't see that we need it for the other
> > functions, but please do tell me if I'm wrong there:
> > 
> > --------------------8<---------------------------
> > 
> > [PATCH] SQUASH: add memory barriers around i_version accesses
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  include/linux/iversion.h | 53 +++++++++++++++++++++++++++++-------------------
> >  1 file changed, 32 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index a9fbf99709df..02187a3bec3b 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
> >  	atomic64_set(&inode->i_version, val);
> >  }
> >  
> > +/**
> > + * inode_peek_iversion_raw - grab a "raw" iversion value
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Grab a "raw" inode->i_version value and return it. The i_version is not
> > + * flagged or converted in any way. This is mostly used to access a self-managed
> > + * i_version.
> > + *
> > + * With those filesystems, we want to treat the i_version as an entirely
> > + * opaque value.
> > + */
> > +static inline u64
> > +inode_peek_iversion_raw(const struct inode *inode)
> > +{
> > +	return atomic64_read(&inode->i_version);
> > +}
> > +
> >  /**
> >   * inode_set_iversion - set i_version to a particular value
> >   * @inode: inode to set
> > @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> >  {
> >  	u64 cur, old, new;
> >  
> > -	cur = (u64)atomic64_read(&inode->i_version);
> > +	/*
> > +	 * The i_version field is not strictly ordered with any other inode
> > +	 * information, but the legacy inode_inc_iversion code used a spinlock
> > +	 * to serialize increments.
> > +	 *
> > +	 * This code adds full memory barriers to ensure that any de-facto
> > +	 * ordering with other info is preserved.
> > +	 */
> > +	smp_mb__before_atomic();
> 
> This should be just smp_mb(). __before_atomic() pairs with atomic
> operations like atomic_inc(). atomic_read() is completely unordered
> operation (happens to be plain memory read on x86) and so __before_atomic()
> is not enough.
> 
> > +	cur = inode_peek_iversion_raw(inode);
> >  	for (;;) {
> >  		/* If flag is clear then we needn't do anything */
> >  		if (!force && !(cur & I_VERSION_QUERIED))
> > @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> >  		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> >  
> >  		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > -		if (likely(old == cur))
> > +		if (likely(old == cur)) {
> > +			smp_mb__after_atomic();
> 
> I don't think you need this. Cmpxchg is guaranteed to be full memory
> barrier - from Documentation/atomic_t.txt:
>   - RMW operations that have a return value are fully ordered;
> 
> >  			break;
> > +		}
> >  		cur = old;
> >  	}
> >  	return true;
> 
> ...
> 
> > @@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
> >  {
> >  	u64 cur, old, new;
> >  
> > -	cur = atomic64_read(&inode->i_version);
> > +	cur = inode_peek_iversion_raw(inode);
> >  	for (;;) {
> >  		/* If flag is already set, then no need to swap */
> >  		if (cur & I_VERSION_QUERIED)
> 
> And here I'd expect smp_mb() after inode_peek_iversion_raw() (actually be
> needed only if you are not going to do cmpxchg as that implies barrier as
> well). "Safe" use of i_version would be:
> 
> Update:
> 
> modify inode
> inode_maybe_inc_iversion(inode)
> 
> Read:
> 
> my_version = inode_query_iversion(inode)
> get inode data
> 
> And you need to make sure 'get inode data' does not get speculatively
> evaluated before you actually sample i_version so that you are guaranteed
> that if data changes, you will observe larger i_version in the future.
> 
> Also please add a comment smp_mb() in inode_maybe_inc_iversion() like:
> 
> /* This barrier pairs with the barrier in inode_query_iversion() */
> 
> and a similar comment to inode_query_iversion(). Because memory barriers
> make sense only in pairs (see SMP BARRIER PAIRING in
> Documentation/memory-barriers.txt).
> 

Got it, I think. How about this (sorry for the unrelated deltas here):

[PATCH] SQUASH: add memory barriers around i_version accesses

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/iversion.h | 60 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 21 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index a9fbf99709df..1b3b5ed7c5b8 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
 	atomic64_set(&inode->i_version, val);
 }
 
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+	return atomic64_read(&inode->i_version);
+}
+
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
@@ -152,7 +169,18 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
 	u64 cur, old, new;
 
-	cur = (u64)atomic64_read(&inode->i_version);
+	/*
+	 * The i_version field is not strictly ordered with any other inode
+	 * information, but the legacy inode_inc_iversion code used a spinlock
+	 * to serialize increments.
+	 *
+	 * Here, we add full memory barriers to ensure that any de-facto
+	 * ordering with other info is preserved.
+	 *
+	 * This barrier pairs with the barrier in inode_query_iversion()
+	 */
+	smp_mb();
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is clear then we needn't do anything */
 		if (!force && !(cur & I_VERSION_QUERIED))
@@ -183,23 +211,6 @@ inode_inc_iversion(struct inode *inode)
 	inode_maybe_inc_iversion(inode, true);
 }
 
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
-	return atomic64_read(&inode->i_version);
-}
-
 /**
  * inode_iversion_need_inc - is the i_version in need of being incremented?
  * @inode: inode to check
@@ -248,15 +259,22 @@ inode_query_iversion(struct inode *inode)
 {
 	u64 cur, old, new;
 
-	cur = atomic64_read(&inode->i_version);
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is already set, then no need to swap */
-		if (cur & I_VERSION_QUERIED)
+		if (cur & I_VERSION_QUERIED) {
+			/*
+			 * This barrier (and the implicit barrier in the
+			 * cmpxchg below) pairs with the barrier in
+			 * inode_maybe_inc_iversion().
+			 */
+			smp_mb();
 			break;
+		}
 
 		new = cur | I_VERSION_QUERIED;
 		old = atomic64_cmpxchg(&inode->i_version, cur, new);
-		if (old == cur)
+		if (likely(old == cur))
 			break;
 		cur = old;
 	}

Jan Kara Dec. 21, 2017, 11:48 a.m. UTC | #11

On Thu 21-12-17 06:25:55, Jeff Layton wrote:
> Got it, I think. How about this (sorry for the unrelated deltas here):
> 
> [PATCH] SQUASH: add memory barriers around i_version accesses

Yep, this looks good to me.

								Honza
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/iversion.h | 60 +++++++++++++++++++++++++++++++-----------------
>  1 file changed, 39 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..1b3b5ed7c5b8 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
>  	atomic64_set(&inode->i_version, val);
>  }
>  
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> +	return atomic64_read(&inode->i_version);
> +}
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -152,7 +169,18 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>  	u64 cur, old, new;
>  
> -	cur = (u64)atomic64_read(&inode->i_version);
> +	/*
> +	 * The i_version field is not strictly ordered with any other inode
> +	 * information, but the legacy inode_inc_iversion code used a spinlock
> +	 * to serialize increments.
> +	 *
> +	 * Here, we add full memory barriers to ensure that any de-facto
> +	 * ordering with other info is preserved.
> +	 *
> +	 * This barrier pairs with the barrier in inode_query_iversion()
> +	 */
> +	smp_mb();
> +	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is clear then we needn't do anything */
>  		if (!force && !(cur & I_VERSION_QUERIED))
> @@ -183,23 +211,6 @@ inode_inc_iversion(struct inode *inode)
>  	inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> -	return atomic64_read(&inode->i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,15 +259,22 @@ inode_query_iversion(struct inode *inode)
>  {
>  	u64 cur, old, new;
>  
> -	cur = atomic64_read(&inode->i_version);
> +	cur = inode_peek_iversion_raw(inode);
>  	for (;;) {
>  		/* If flag is already set, then no need to swap */
> -		if (cur & I_VERSION_QUERIED)
> +		if (cur & I_VERSION_QUERIED) {
> +			/*
> +			 * This barrier (and the implicit barrier in the
> +			 * cmpxchg below) pairs with the barrier in
> +			 * inode_maybe_inc_iversion().
> +			 */
> +			smp_mb();
>  			break;
> +		}
>  
>  		new = cur | I_VERSION_QUERIED;
>  		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> -		if (old == cur)
> +		if (likely(old == cur))
>  			break;
>  		cur = old;
>  	}
> -- 
> 2.14.3
>

[v3,19/19] fs: handle inode->i_version more efficiently

Commit Message

Comments

Patch