[v6,0/9] Support negative dentries on case-insensitive ext4 and f2fs

Message ID	20230816050803.15660-1-krisman@suse.de
Headers	show Return-Path: <SRS0=k4Xd=EB=vger.kernel.org=linux-ext4-owner@ozlabs.org> From: Gabriel Krisman Bertazi <krisman@suse.de> To: viro@zeniv.linux.org.uk, brauner@kernel.org, tytso@mit.edu, ebiggers@kernel.org, jaegeuk@kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, Gabriel Krisman Bertazi <krisman@suse.de> Subject: [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs Date: Wed, 16 Aug 2023 01:07:54 -0400 Message-ID: <20230816050803.15660-1-krisman@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v6,1/9] ecryptfs: Reject casefold directory inodes \| expand [v6,0/9] Support negative dentries on case-insensitive ext4 and f2fs [v6,1/9] ecryptfs: Reject casefold directory inodes [v6,2/9] 9p: Split ->weak_revalidate from ->revalidate [v6,3/9] fs: Expose name under lookup to d_revalidate hooks [v6,4/9] fs: Add DCACHE_CASEFOLDED_NAME flag [v6,5/9] libfs: Validate negative dentries in case-insensitive directories [v6,6/9] libfs: Chain encryption checks after case-insensitive revalidation [v6,7/9] libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops [v6,8/9] ext4: Enable negative dentries on case-insensitive lookup [v6,9/9] f2fs: Enable negative dentries on case-insensitive lookup

Gabriel Krisman Bertazi Aug. 16, 2023, 5:07 a.m. UTC

Hi,

This is v6 of the negative dentry on case-insensitive directories.
Thanks Eric for the review of the last iteration.  This version
drops the patch to expose the helper to check casefolding directories,
since it is not necessary in ecryptfs and it might be going away.  It
also addresses some documentation details, fix a build bot error and
simplifies the commit messages.  See the changelog in each patch for
more details.

Thanks,

---

Gabriel Krisman Bertazi (9):
  ecryptfs: Reject casefold directory inodes
  9p: Split ->weak_revalidate from ->revalidate
  fs: Expose name under lookup to d_revalidate hooks
  fs: Add DCACHE_CASEFOLDED_NAME flag
  libfs: Validate negative dentries in case-insensitive directories
  libfs: Chain encryption checks after case-insensitive revalidation
  libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
  ext4: Enable negative dentries on case-insensitive lookup
  f2fs: Enable negative dentries on case-insensitive lookup

 Documentation/filesystems/locking.rst |   3 +-
 Documentation/filesystems/vfs.rst     |  11 ++-
 fs/9p/vfs_dentry.c                    |  11 ++-
 fs/afs/dir.c                          |   6 +-
 fs/afs/dynroot.c                      |   4 +-
 fs/ceph/dir.c                         |   3 +-
 fs/coda/dir.c                         |   3 +-
 fs/crypto/fname.c                     |   3 +-
 fs/dcache.c                           |   8 ++
 fs/ecryptfs/dentry.c                  |   5 +-
 fs/ecryptfs/inode.c                   |   8 ++
 fs/exfat/namei.c                      |   3 +-
 fs/ext4/namei.c                       |  35 +--------
 fs/f2fs/namei.c                       |  25 +-----
 fs/fat/namei_vfat.c                   |   6 +-
 fs/fuse/dir.c                         |   3 +-
 fs/gfs2/dentry.c                      |   3 +-
 fs/hfs/sysdep.c                       |   3 +-
 fs/jfs/namei.c                        |   3 +-
 fs/kernfs/dir.c                       |   3 +-
 fs/libfs.c                            | 107 ++++++++++++++++++--------
 fs/namei.c                            |  18 +++--
 fs/nfs/dir.c                          |   9 ++-
 fs/ocfs2/dcache.c                     |   4 +-
 fs/orangefs/dcache.c                  |   3 +-
 fs/overlayfs/super.c                  |  20 +++--
 fs/proc/base.c                        |   6 +-
 fs/proc/fd.c                          |   3 +-
 fs/proc/generic.c                     |   6 +-
 fs/proc/proc_sysctl.c                 |   3 +-
 fs/reiserfs/xattr.c                   |   3 +-
 fs/smb/client/dir.c                   |   3 +-
 fs/vboxsf/dir.c                       |   4 +-
 include/linux/dcache.h                |  10 ++-
 include/linux/fscrypt.h               |   4 +-
 35 files changed, 216 insertions(+), 136 deletions(-)

Eric Biggers Aug. 17, 2023, 5:06 p.m. UTC | #1

On Wed, Aug 16, 2023 at 01:07:54AM -0400, Gabriel Krisman Bertazi wrote:
> Hi,
> 
> This is v6 of the negative dentry on case-insensitive directories.
> Thanks Eric for the review of the last iteration.  This version
> drops the patch to expose the helper to check casefolding directories,
> since it is not necessary in ecryptfs and it might be going away.  It
> also addresses some documentation details, fix a build bot error and
> simplifies the commit messages.  See the changelog in each patch for
> more details.
> 
> Thanks,
> 
> ---
> 
> Gabriel Krisman Bertazi (9):
>   ecryptfs: Reject casefold directory inodes
>   9p: Split ->weak_revalidate from ->revalidate
>   fs: Expose name under lookup to d_revalidate hooks
>   fs: Add DCACHE_CASEFOLDED_NAME flag
>   libfs: Validate negative dentries in case-insensitive directories
>   libfs: Chain encryption checks after case-insensitive revalidation
>   libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
>   ext4: Enable negative dentries on case-insensitive lookup
>   f2fs: Enable negative dentries on case-insensitive lookup
> 

Looks good,

Reviewed-by: Eric Biggers <ebiggers@google.com>

- Eric

Christian Brauner Aug. 21, 2023, 3:52 p.m. UTC | #2

On Thu, Aug 17, 2023 at 10:06:58AM -0700, Eric Biggers wrote:
> On Wed, Aug 16, 2023 at 01:07:54AM -0400, Gabriel Krisman Bertazi wrote:
> > Hi,
> > 
> > This is v6 of the negative dentry on case-insensitive directories.
> > Thanks Eric for the review of the last iteration.  This version
> > drops the patch to expose the helper to check casefolding directories,
> > since it is not necessary in ecryptfs and it might be going away.  It
> > also addresses some documentation details, fix a build bot error and
> > simplifies the commit messages.  See the changelog in each patch for
> > more details.
> > 
> > Thanks,
> > 
> > ---
> > 
> > Gabriel Krisman Bertazi (9):
> >   ecryptfs: Reject casefold directory inodes
> >   9p: Split ->weak_revalidate from ->revalidate
> >   fs: Expose name under lookup to d_revalidate hooks
> >   fs: Add DCACHE_CASEFOLDED_NAME flag
> >   libfs: Validate negative dentries in case-insensitive directories
> >   libfs: Chain encryption checks after case-insensitive revalidation
> >   libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
> >   ext4: Enable negative dentries on case-insensitive lookup
> >   f2fs: Enable negative dentries on case-insensitive lookup
> > 
> 
> Looks good,
> 
> Reviewed-by: Eric Biggers <ebiggers@google.com>

Thanks! We're a bit too late for v6.6 with this given that this hasn't
even been in -next. So this will be up for v6.7.

Gabriel Krisman Bertazi Aug. 21, 2023, 6:53 p.m. UTC | #3

Christian Brauner <brauner@kernel.org> writes:

> On Thu, Aug 17, 2023 at 10:06:58AM -0700, Eric Biggers wrote:
>> On Wed, Aug 16, 2023 at 01:07:54AM -0400, Gabriel Krisman Bertazi wrote:
>> > Hi,
>> > 
>> > This is v6 of the negative dentry on case-insensitive directories.
>> > Thanks Eric for the review of the last iteration.  This version
>> > drops the patch to expose the helper to check casefolding directories,
>> > since it is not necessary in ecryptfs and it might be going away.  It
>> > also addresses some documentation details, fix a build bot error and
>> > simplifies the commit messages.  See the changelog in each patch for
>> > more details.
>> > 
>> > Thanks,
>> > 
>> > ---
>> > 
>> > Gabriel Krisman Bertazi (9):
>> >   ecryptfs: Reject casefold directory inodes
>> >   9p: Split ->weak_revalidate from ->revalidate
>> >   fs: Expose name under lookup to d_revalidate hooks
>> >   fs: Add DCACHE_CASEFOLDED_NAME flag
>> >   libfs: Validate negative dentries in case-insensitive directories
>> >   libfs: Chain encryption checks after case-insensitive revalidation
>> >   libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
>> >   ext4: Enable negative dentries on case-insensitive lookup
>> >   f2fs: Enable negative dentries on case-insensitive lookup
>> > 
>> 
>> Looks good,
>> 
>> Reviewed-by: Eric Biggers <ebiggers@google.com>
>
> Thanks! We're a bit too late for v6.6 with this given that this hasn't
> even been in -next. So this will be up for v6.7.

Targeting 6.7 is fine by me. will you pick it up through the vfs tree? I
prefer it goes through there since it mostly touches vfs.

Christian Brauner Aug. 22, 2023, 9:03 a.m. UTC | #4

> Targeting 6.7 is fine by me. will you pick it up through the vfs tree? I
> prefer it goes through there since it mostly touches vfs.

Yes, I think that's what will end up happening.

Christian Brauner Oct. 25, 2023, 1:32 p.m. UTC | #5

On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote:
> This is v6 of the negative dentry on case-insensitive directories.
> Thanks Eric for the review of the last iteration.  This version
> drops the patch to expose the helper to check casefolding directories,
> since it is not necessary in ecryptfs and it might be going away.  It
> also addresses some documentation details, fix a build bot error and
> simplifies the commit messages.  See the changelog in each patch for
> more details.
> 
> [...]

Ok, let's put it into -next so it sees some testing.
So it's too late for v6.7. Seems we forgot about this series.
Sorry about that.

---

Applied to the vfs.casefold branch of the vfs/vfs.git tree.
Patches in the vfs.casefold branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.casefold

[1/9] ecryptfs: Reject casefold directory inodes
      https://git.kernel.org/vfs/vfs/c/8512e7c7e665
[2/9] 9p: Split ->weak_revalidate from ->revalidate
      https://git.kernel.org/vfs/vfs/c/17f4423cb24a
[3/9] fs: Expose name under lookup to d_revalidate hooks
      https://git.kernel.org/vfs/vfs/c/24084e50e579
[4/9] fs: Add DCACHE_CASEFOLDED_NAME flag
      https://git.kernel.org/vfs/vfs/c/2daa2df800f8
[5/9] libfs: Validate negative dentries in case-insensitive directories
      https://git.kernel.org/vfs/vfs/c/8d879ccaf0f7
[6/9] libfs: Chain encryption checks after case-insensitive revalidation
      https://git.kernel.org/vfs/vfs/c/314e925d5a2c
[7/9] libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
      https://git.kernel.org/vfs/vfs/c/07f820b77c58
[8/9] ext4: Enable negative dentries on case-insensitive lookup
      https://git.kernel.org/vfs/vfs/c/2562ec77f11e
[9/9] f2fs: Enable negative dentries on case-insensitive lookup
      https://git.kernel.org/vfs/vfs/c/39d2dd36a065

Gabriel Krisman Bertazi Nov. 19, 2023, 11:11 p.m. UTC | #6

Christian Brauner <brauner@kernel.org> writes:

> On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote:
>> This is v6 of the negative dentry on case-insensitive directories.
>> Thanks Eric for the review of the last iteration.  This version
>> drops the patch to expose the helper to check casefolding directories,
>> since it is not necessary in ecryptfs and it might be going away.  It
>> also addresses some documentation details, fix a build bot error and
>> simplifies the commit messages.  See the changelog in each patch for
>> more details.
>> 
>> [...]
>
> Ok, let's put it into -next so it sees some testing.
> So it's too late for v6.7. Seems we forgot about this series.
> Sorry about that.

Christian,

We are approaching -rc2 and, until last Friday, it didn't shown up in
linux-next. So, to avoid turning a 6 month delay into 9 months, I pushed
your signed tag to linux-next myself.

That obviously uncovered a merge conflict: in v6.6, ceph added fscrypt,
and the caller had to be updated.  I fixed it and pushed again to
linux-next to get more testing.

Now, I don't want to send it to Linus myself. This is 100% VFS/FS code,
I'm not the maintainer and it will definitely raise eyebrows.  Can you
please requeue and make sure it goes through this time?  I'm happy to
drop my branch from linux-next once yours shows up.

  https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/log/?h=negative-dentries

This branch has the latest version with the ceph conflict folded in.  I
did it this way because I'd consider it was never picked up and there is
no point in making the history complex by adding a fix on top of your
signed tag, since it already fails to build ceph.

I can send it as a v7; but I prefer you just pull from the branch
above. Or you can ack and I'll send to Linus.

This is the diff from you signed tag:

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 629d8fb31d8f..21278a9d9baa 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1869,7 +1869,7 @@ static int ceph_d_revalidate(struct dentry *dentry, const struct qstr *name,
        struct inode *dir, *inode;
        struct ceph_mds_client *mdsc;

-       valid = fscrypt_d_revalidate(dentry, flags);
+       valid = fscrypt_d_revalidate(dentry, name, flags);
        if (valid <= 0)
                return valid;

diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c
index 56093648d838..ce86891a1711 100644
--- a/fs/ecryptfs/dentry.c
+++ b/fs/ecryptfs/dentry.c
@@ -18,6 +18,7 @@
 /**
  * ecryptfs_d_revalidate - revalidate an ecryptfs dentry
  * @dentry: The ecryptfs dentry
+ * @name: The name under lookup
  * @flags: lookup flags
  *
  * Called when the VFS needs to revalidate a dentry. This
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index 3dd93d36aaf2..5e4910e016a8 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -22,6 +22,7 @@
 /**
  * gfs2_drevalidate - Check directory lookup consistency
  * @dentry: the mapping to check
+ * @name: The name under lookup
  * @flags: lookup flags
  *
  * Check to make sure the lookup necessary to arrive at this inode from its

Christian Brauner Nov. 20, 2023, 3:06 p.m. UTC | #7

On Sun, Nov 19, 2023 at 06:11:39PM -0500, Gabriel Krisman Bertazi wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 
> > On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote:
> >> This is v6 of the negative dentry on case-insensitive directories.
> >> Thanks Eric for the review of the last iteration.  This version
> >> drops the patch to expose the helper to check casefolding directories,
> >> since it is not necessary in ecryptfs and it might be going away.  It
> >> also addresses some documentation details, fix a build bot error and
> >> simplifies the commit messages.  See the changelog in each patch for
> >> more details.
> >> 
> >> [...]
> >
> > Ok, let's put it into -next so it sees some testing.
> > So it's too late for v6.7. Seems we forgot about this series.
> > Sorry about that.
> 
> Christian,
> 
> We are approaching -rc2 and, until last Friday, it didn't shown up in
> linux-next. So, to avoid turning a 6 month delay into 9 months, I pushed
> your signed tag to linux-next myself.
> 
> That obviously uncovered a merge conflict: in v6.6, ceph added fscrypt,
> and the caller had to be updated.  I fixed it and pushed again to
> linux-next to get more testing.
> 
> Now, I don't want to send it to Linus myself. This is 100% VFS/FS code,
> I'm not the maintainer and it will definitely raise eyebrows.  Can you
> please requeue and make sure it goes through this time?  I'm happy to

My current understanding is that core dcache stuff is usually handled by
Al. And he's got a dcache branches sitting in his tree.

So this isn't me ignoring you in any way. My hands are tied and so I
can't sort this out for you easily.

> drop my branch from linux-next once yours shows up.
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/log/?h=negative-dentries
> 
> This branch has the latest version with the ceph conflict folded in.  I
> did it this way because I'd consider it was never picked up and there is
> no point in making the history complex by adding a fix on top of your
> signed tag, since it already fails to build ceph.
> 
> I can send it as a v7; but I prefer you just pull from the branch
> above. Or you can ack and I'll send to Linus.

Gabriel Krisman Bertazi Nov. 20, 2023, 4:59 p.m. UTC | #8

Christian Brauner <brauner@kernel.org> writes:

> On Sun, Nov 19, 2023 at 06:11:39PM -0500, Gabriel Krisman Bertazi wrote:
>> Christian Brauner <brauner@kernel.org> writes:
>> 
>> > On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote:
>> >> This is v6 of the negative dentry on case-insensitive directories.
>> >> Thanks Eric for the review of the last iteration.  This version
>> >> drops the patch to expose the helper to check casefolding directories,
>> >> since it is not necessary in ecryptfs and it might be going away.  It
>> >> also addresses some documentation details, fix a build bot error and
>> >> simplifies the commit messages.  See the changelog in each patch for
>> >> more details.
>> >> 
>> >> [...]
>> >
>> > Ok, let's put it into -next so it sees some testing.
>> > So it's too late for v6.7. Seems we forgot about this series.
>> > Sorry about that.
>> 
>> Christian,
>> 
>> We are approaching -rc2 and, until last Friday, it didn't shown up in
>> linux-next. So, to avoid turning a 6 month delay into 9 months, I pushed
>> your signed tag to linux-next myself.
>> 
>> That obviously uncovered a merge conflict: in v6.6, ceph added fscrypt,
>> and the caller had to be updated.  I fixed it and pushed again to
>> linux-next to get more testing.
>> 
>> Now, I don't want to send it to Linus myself. This is 100% VFS/FS code,
>> I'm not the maintainer and it will definitely raise eyebrows.  Can you
>> please requeue and make sure it goes through this time?  I'm happy to
>
> My current understanding is that core dcache stuff is usually handled by
> Al. And he's got a dcache branches sitting in his tree.
>
> So this isn't me ignoring you in any way. My hands are tied and so I
> can't sort this out for you easily.

Please don't take it personally, but you surely see how frustrating this
is.

While I appreciate your very prompt answer, this is very different from:

  https://lore.kernel.org/linux-fsdevel/20230821-derart-serienweise-3506611e576d@brauner/
  https://lore.kernel.org/linux-fsdevel/20230822-denkmal-operette-f16d8bd815fc@brauner/
  https://lore.kernel.org/linux-fsdevel/20231025-selektiert-leibarzt-5d0070d85d93@brauner/

Perhaps it all has a vfs-specific meaning. But the following suggests it
was accepted and queued long ago:

> Thanks! We're a bit too late for v6.6 with this given that this hasn't
> even been in -next. So this will be up for v6.7.
[...]
> Ok, let's put it into -next so it sees some testing.
[...]
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied.
[...]
> Patches in the vfs.casefold branch should appear in linux-next soon.
[...]

Obviously, there are big issues with the process here. But I fail to see
how I could have communicated clearer or where I didn't follow the
process in this entire thread.

The branches you mentioned are 10 days old. This patchset was
"accepted" two months ago.

As one of the VFS maintainer, can you send an acked-by - or at least a
r-b in cases like this, if you agree with the patches?  Then it makes
more sense for me to send to Linus directly.

Viro,

You are CC'ed since early 2022.  Can you comment?  Ted and Eric
reviewed, Christian too.  there's been only small changes since the
first RFC.

I'll ask Linus to pull from the unicode tree in the next merge window
if I don't hear from you.

Linus Torvalds Nov. 20, 2023, 6:07 p.m. UTC | #9

On Mon, 20 Nov 2023 at 07:06, Christian Brauner <brauner@kernel.org> wrote:
>
> My current understanding is that core dcache stuff is usually handled by
> Al. And he's got a dcache branches sitting in his tree.
>
> So this isn't me ignoring you in any way. My hands are tied and so I
> can't sort this out for you easily.

Well, we all know - very much including Al - that Al isn't always the
most responsive person, and tends to have his own ratholes that he
dives deep into.

The good news is that I do know the dcache code pretty well, and while
I really would like Al to deal with any locking issues (because
"pretty well" is not "as well as Al Viro"), for just about any other
issue I'll happily take pulls from you.

I dislike case folding with a passion - it's about the worst design
decision a filesystem can ever do - but the other side of that is that
if you have to have case folding, the last thing you want to do is to
have each filesystem deal with that sh*t-for-brains decision itself.

So moving more support for case folding into the VFS so that the
horrid thing at least gets better support is something I'm perfectly
fine with despite my dislike of it.

Of course, "do it in shared generic code" doesn't tend to really fix
the braindamage, but at least it's now shared braindamage and not
spread out all over. I'm looking at things like
generic_ci_d_compare(), and it hurts to see the mindless "let's do
lookups and compares one utf8 character at a time". What a disgrace.
Somebody either *really* didn't care, or was a Unicode person who
didn't understand the point of UTF-8.

Oh well. I guess people went "this is going to suck anyway, so let's
make sure it *really* sucks".

The patches look fine to me. Al - do you even care about them?

                   Linus

Theodore Ts'o Nov. 21, 2023, 2:02 a.m. UTC | #10

On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote:
> Of course, "do it in shared generic code" doesn't tend to really fix
> the braindamage, but at least it's now shared braindamage and not
> spread out all over. I'm looking at things like
> generic_ci_d_compare(), and it hurts to see the mindless "let's do
> lookups and compares one utf8 character at a time". What a disgrace.
> Somebody either *really* didn't care, or was a Unicode person who
> didn't understand the point of UTF-8.

This isn't because of case-folding brain damage, but rather Unicode
brain damage.  We compare one character at a time because it's
possible for some character like é to either be encoded as 0x0089 (aka
"Latin Small Letter E with Acute") OR as 0x0065 0x0301 ("Latin Small
Letter E" plus "Combining Acute Accent").

Typically, we pretend that UTF-8 means that we can just encode é, or
0x0089 as 0xC3 0xA9 and then call it a day and just use strcmp(3) on
the sucker.  But Unicode is a lot more insane than that.  Technically,
0x65 0xCC 0x81 is the same character as 0xC3 0xA9.

> Oh well. I guess people went "this is going to suck anyway, so let's
> make sure it *really* sucks".

It's more like, "this is going to suck, but if it's going to suck
anyway, let's implement the full Unicode spec in all its gory^H^H^H^H
glory, whether or not it's sane".

					- Ted

Al Viro Nov. 21, 2023, 2:27 a.m. UTC | #11

On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote:

> Well, we all know - very much including Al - that Al isn't always the
> most responsive person, and tends to have his own ratholes that he
> dives deep into.

FWIW, the right now I'm getting out of one of those - rename() deadlocks
fun.  Will post that, assuming it survives local beating, then post
the dcache stuff and other branches (all rebased by now).

Other ratholes - d_name/d_parent audit is about halfway through -
we do have fun shite in quite a few ->d_revalidate() instances (UAF,
etc.); RCU pathwalk methods need to be re-audited; the fixes caught
in the last cycle are either in mainline by now, or rebased.

But that stuff is relatively easy to suspend for a few days.

> Of course, "do it in shared generic code" doesn't tend to really fix
> the braindamage, but at least it's now shared braindamage and not
> spread out all over. I'm looking at things like
> generic_ci_d_compare(), and it hurts to see the mindless "let's do
> lookups and compares one utf8 character at a time". What a disgrace.
> Somebody either *really* didn't care, or was a Unicode person who
> didn't understand the point of UTF-8.
> 
> Oh well. I guess people went "this is going to suck anyway, so let's
> make sure it *really* sucks".
> 

> The patches look fine to me. Al - do you even care about them?

I will review that series; my impression from the previous iterations
had been fairly unpleasant, TBH, but I hadn't rechecked since April
or so.

Re dcache conflicts - see #untested-pile-dcache; most the dcache-affecting
stuff should be there.  One intersting exception is the general helper
for safely picking parent/parent's inode/entry name for ->d_revalidate()
use;  we have the parent-related part boilerplated, but the name side
tends to be missing.  That's still being tweaked, looking for sane
calling conventions.

Linus Torvalds Nov. 21, 2023, 2:29 a.m. UTC | #12

On Mon, 20 Nov 2023 at 18:03, Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote:
> >     I'm looking at things like
> > generic_ci_d_compare(), and it hurts to see the mindless "let's do
> > lookups and compares one utf8 character at a time". What a disgrace.
> > Somebody either *really* didn't care, or was a Unicode person who
> > didn't understand the point of UTF-8.
>
> This isn't because of case-folding brain damage, but rather Unicode
> brain damage.

No, it really is just stupidity and horribleness.

The thing is, when you check two strings for equality, the FIRST THING
you should do is to just compare them for exactly that: equality.

And no, the way you do that is not by checking each unicode character
one by one.

You do it by just doing a regular memcmp. In fact, you can do even
better than that: while at it, check whether
 (a) all bytes are equal in everything but bit#5
 (b) none of the bytes have the high  bit set
and you have now narrowed down things in a big way. You can do these
things trivially one whole word at a time, and you'll handle 99% of
all input without EVER doing any Unicode garbage AT ALL.

Yes, yes, if you actually have complex characters, you end up having
to deal with that mess. But no, that is *not* an excuse for saying
"all characters are complex".

So no. There is absolutely zero excuse for doing stupid things, except
for "nobody has ever cared, because case folding is so stupid to begin
with that people just expect it to perform horribly badly".

End result:

 - generic_ci_d_compare() should *not* consider the memcmp() to be a
"fall back to this for non-casefolded". You should start with that,
and if the bytes are equal then the strings are equal. End of story.

 - if the bytes are not equal, then the strings *might* still compare
equal if it's a casefolded directory.

 - but EVEN THEN you shouldn't fall back to actually doing UTF-8
decoding unless you saw the high bit being set at some point.

 - and if they different in anything but bit #5 and you didn't see the
high bit, you know they are different.

It's a bit complicated, yes. But no, doing things one unicode
character at a time is just bad bad bad.

               Linus

Linus Torvalds Nov. 21, 2023, 3:03 a.m. UTC | #13

On Mon, 20 Nov 2023 at 18:29, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It's a bit complicated, yes. But no, doing things one unicode
> character at a time is just bad bad bad.

Put another way: the _point_ of UTF-8 is that ASCII is still ASCII.
It's literally why UTF-8 doesn't suck.

So you can still compare ASCII strings as-is.

No, that doesn't help people who are really using other locales, and
are actively using complicated characters.

But it very much does mean that you can compare "Bad" and "bad" and
never ever look at any unicode translation ever.

In a perfect world, you'd use all the complicated DCACHE_WORD_ACCESS
stuff that can do all of this one word at a time.

But even if you end up doing the rules just one byte at a time, it
means that you can deal with the common cases without "unicode
cursors" or function calls to extract unicode characters, or anything
like that. You can still treat things as bytes.

So the top of generic_ci_d_compare() should probably be something
trivial like this:

        const char *ct = name.name;
        unsigned int tcount = name.len;

        /* Handle the exact equality quickly */
        if (len == tcount && !dentry_string_cmp(str, ct, tcount))
                return 0;

because byte-wise equality is equality even if high bits are set.

After that, it should probably do something like

        /* Not byte-identical, but maybe igncase identical in ASCII */
        do {
                unsigned char a, b;

                /* Dentry name byte */
                a = *str;

                /* If that's NUL, the qstr needs to be done too! */
                if (!a)
                        return !!tcount;

                /* Alternatively, if the qstr is done, it needed to be NUL */
                if (!tcount)
                        return 1;
                b = *ct;

                if ((a | b) & 0x80)
                        break;

                if (a != b) {
                        /* Quick "not same" igncase ASCII */
                        if ((a ^ b) & ~32)
                                return 1;
                        a &= ~32;
                        if (a < 'A' || a > 'Z')
                                return 1;
                }

                /* Ok, same ASCII, bytefolded, go to next */
                str++;
                ct++;
                tcount--;
                len--;
        }

and only after THAT should it do the utf name comparison (and only on
the remaining parts, since the above will have checked for common
ASCII beginnings).

And the above was obviously never tested, and written in the MUA, and
may be completely wrong in all the details, but you get the idea. Deal
with the usual cases first. Do the full unicode only when you
absolutely have to.

                Linus

Theodore Ts'o Nov. 21, 2023, 5:12 a.m. UTC | #14

On Mon, Nov 20, 2023 at 07:03:13PM -0800, Linus Torvalds wrote:
> On Mon, 20 Nov 2023 at 18:29, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It's a bit complicated, yes. But no, doing things one unicode
> > character at a time is just bad bad bad.
> 
> Put another way: the _point_ of UTF-8 is that ASCII is still ASCII.
> It's literally why UTF-8 doesn't suck.
> 
> So you can still compare ASCII strings as-is.
> 
> No, that doesn't help people who are really using other locales, and
> are actively using complicated characters.
> 
> But it very much does mean that you can compare "Bad" and "bad" and
> never ever look at any unicode translation ever.

Yeah, agreed, that would be a nice optimization.  However, in the
unfortunate case where (a) it's non-ASCII, and (b) the input string is
non-normalized and/or differs in case, we end up scanning some portion
of the two strings twice; once doing the strcmp, and once doing the
Unicode slow path.

That being said, given that even in the case where we're dealing with
non-ASCII strings, in the fairly common case where the program is
doing a readdir() followed by a open() or stat(), the filename will be
byte-identical and so a strcmp() will suffice.

So I agree that it's a nice optimization.  It'd be interesting how
much such an optimization would actually show up in various
benchmarks.  It'd have to be something that was really metadata-heavy,
or else the filenamea lookups would get drowned out.

   	    	      	      	    	- Ted

Gabriel Krisman Bertazi Nov. 22, 2023, 3:30 a.m. UTC | #15

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I dislike case folding with a passion - it's about the worst design
> decision a filesystem can ever do - but the other side of that is that
> if you have to have case folding, the last thing you want to do is to
> have each filesystem deal with that sh*t-for-brains decision itself.

Thanks for pitching in.

We all agree it is a horrible feature that we support for very specific
use cases.  I'm the one who added the code, yes, but I was never
under the illusion it's a good feature.  It solely exists to let Linux
handle the bad decisions made elsewhere.

> So moving more support for case folding into the VFS so that the
> horrid thing at least gets better support is something I'm perfectly
> fine with despite my dislike of it.

Yes. The entire implementation was meant to stay as far away as possible
from the fast lookup path (didn't want to displease Viro).  The negative
dentry is the only exception that required more changes to vfs to
address the issue he found of dangling negative dentries when turning a
directory case-insensitive.

But, fyi, there is work in progress to add support to more filesystems.
This is why I really want to get all of this done first. There is a use
case to enable it in shmem because of containerized environments
running wine; I was recently cc'ed on a bcachefs implementation; and
there are people working on adding it to btrfs (to support wine in
specific products).

> Of course, "do it in shared generic code" doesn't tend to really fix
> the braindamage, but at least it's now shared braindamage and not
> spread out all over. I'm looking at things like
> generic_ci_d_compare(), and it hurts to see the mindless "let's do
> lookups and compares one utf8 character at a time". What a disgrace.
> Somebody either *really* didn't care, or was a Unicode person who
> didn't understand the point of UTF-8.

Yes. I saw the rest of the thread and you are obviously correct here.
It needs to be fixed.  I will follow up with patches.

> The patches look fine to me. Al - do you even care about them?

I saw that Al Viro answered. Thank you, Al. So I'll wait for either his
review or the merge window.

Thanks,

Al Viro Nov. 22, 2023, 9:04 p.m. UTC | #16

On Tue, Nov 21, 2023 at 12:12:15AM -0500, Theodore Ts'o wrote:
> Yeah, agreed, that would be a nice optimization.  However, in the
> unfortunate case where (a) it's non-ASCII, and (b) the input string is
> non-normalized and/or differs in case, we end up scanning some portion
> of the two strings twice; once doing the strcmp, and once doing the
> Unicode slow path.

The first pass is cheap and the second one will be running entirely
from the cache, so I doubt that scanning them twice will be a loss...

Al Viro Nov. 22, 2023, 9:19 p.m. UTC | #17

On Tue, Nov 21, 2023 at 02:27:34AM +0000, Al Viro wrote:

> I will review that series; my impression from the previous iterations
> had been fairly unpleasant, TBH, but I hadn't rechecked since April
> or so.

The serious gap, AFAICS, is the interplay with open-by-fhandle.
It's not unfixable, but we need to figure out what to do when
lookup runs into a disconnected directory alias.  d_splice_alias()
will move it in place, all right, but any state ->lookup() has
hung off the dentry that had been passed to it will be lost.

And I seriously suspect that we want to combine that state
propagation with d_splice_alias() (or its variant to be used in
such cases), rather than fixing the things up afterwards.

In particular, propagating ->d_op is really not trivial at that
point; it is safe to do to ->lookup() argument prior to d_splice_alias()
(even though that's too subtle and brittle, IMO), but after
d_splice_alias() has succeeded, the damn thing is live and can
be hit by hash lookups, revalidate, etc.

The only things that can't happen to it are ->d_delete(), ->d_prune(),
->d_iput() and ->d_init().  Everything else is fair game.

And then there's an interesting question about the interplay with
reparenting.  It's OK to return an error rather than reparent,
but we need some way to tell if we need to do so.

Linus Torvalds Nov. 23, 2023, 12:18 a.m. UTC | #18

On Wed, 22 Nov 2023 at 13:19, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The serious gap, AFAICS, is the interplay with open-by-fhandle.

So I'm obviously not a fan of igncase filesystems, but I don't think
this series actually changes any of that.

> It's not unfixable, but we need to figure out what to do when
> lookup runs into a disconnected directory alias.  d_splice_alias()
> will move it in place, all right, but any state ->lookup() has
> hung off the dentry that had been passed to it will be lost.

I guess this migth be about the new DCACHE_CASEFOLDED_NAME bit.

At least for now, that is only used by generic_ci_d_revalidate() for
negative dentries, so it shouldn't matter for that d_splice_alias()
that only does positive dentries. No?

Or is there something else you worry about?

Side note: Gabriel, as things are now, instead of that

        if (!d_is_casefolded_name(dentry))
                return 0;

in generic_ci_d_revalidate(), I would suggest that any time a
directory is turned into a case-folded one, you'd just walk all the
dentries for that directory and invalidate negative ones at that
point. Or was there some reason I missed that made it a good idea to
do it at run-time after-the-fact?

             Linus

Al Viro Nov. 23, 2023, 1:12 a.m. UTC | #19

On Wed, Nov 22, 2023 at 09:19:01PM +0000, Al Viro wrote:
> On Tue, Nov 21, 2023 at 02:27:34AM +0000, Al Viro wrote:
> 
> > I will review that series; my impression from the previous iterations
> > had been fairly unpleasant, TBH, but I hadn't rechecked since April
> > or so.
> 
> The serious gap, AFAICS, is the interplay with open-by-fhandle.
> It's not unfixable, but we need to figure out what to do when
> lookup runs into a disconnected directory alias.  d_splice_alias()
> will move it in place, all right, but any state ->lookup() has
> hung off the dentry that had been passed to it will be lost.
> 
> And I seriously suspect that we want to combine that state
> propagation with d_splice_alias() (or its variant to be used in
> such cases), rather than fixing the things up afterwards.
> 
> In particular, propagating ->d_op is really not trivial at that
> point; it is safe to do to ->lookup() argument prior to d_splice_alias()
> (even though that's too subtle and brittle, IMO), but after
> d_splice_alias() has succeeded, the damn thing is live and can
> be hit by hash lookups, revalidate, etc.
> 
> The only things that can't happen to it are ->d_delete(), ->d_prune(),
> ->d_iput() and ->d_init().  Everything else is fair game.
> 
> And then there's an interesting question about the interplay with
> reparenting.  It's OK to return an error rather than reparent,
> but we need some way to tell if we need to do so.

Hmm... int (*d_transfer)(struct dentry *alias, struct dentry *new)?
Called if d_splice_alias() picks that sucker, under rename_lock,
before the call of __d_move().  Can check IS_ROOT(alias) (due to
rename_lock), so can tell attaching from reparenting, returning
an error - failed d_splice_alias().

Perhaps would be even better inside __d_move(), once all ->d_lock
are taken...  Turn the current bool exchange in there into honest
enum (exchange/move/splice) and call ->d_transfer() on splice.
In case of failure it's still not too late to back out - __d_move()
would return an int, ignored in d_move() and d_exchange() and
treated as "fail in unlikely case it's non-zero" in d_splice_alias()
and __d_unalias()...

Comments?  Note that e.g.
        res = d_splice_alias(inode, dentry);
        if (!IS_ERR(fid)) {
                if (!res)
                        v9fs_fid_add(dentry, &fid);
                else if (!IS_ERR(res))
                        v9fs_fid_add(res, &fid);
                else
                        p9_fid_put(fid);
        }

in 9p ->lookup() would turn into

	v9fs_fid_add(dentry, &fid);
        return d_splice_alias(inode, dentry);

with ->d_transfer(alias, new) being simply

	struct hlist_node *p = new->d_fsdata;
	hlist_del_init(p);
	__add_fid(alias, hlist_entry(p, struct p9_fid, dlist));
	return 0;

assuming the call from __d_move()...

Al Viro Nov. 23, 2023, 1:22 a.m. UTC | #20

On Thu, Nov 23, 2023 at 01:12:08AM +0000, Al Viro wrote:
> On Wed, Nov 22, 2023 at 09:19:01PM +0000, Al Viro wrote:
> > On Tue, Nov 21, 2023 at 02:27:34AM +0000, Al Viro wrote:
> > 
> > > I will review that series; my impression from the previous iterations
> > > had been fairly unpleasant, TBH, but I hadn't rechecked since April
> > > or so.
> > 
> > The serious gap, AFAICS, is the interplay with open-by-fhandle.
> > It's not unfixable, but we need to figure out what to do when
> > lookup runs into a disconnected directory alias.  d_splice_alias()
> > will move it in place, all right, but any state ->lookup() has
> > hung off the dentry that had been passed to it will be lost.
> > 
> > And I seriously suspect that we want to combine that state
> > propagation with d_splice_alias() (or its variant to be used in
> > such cases), rather than fixing the things up afterwards.
> > 
> > In particular, propagating ->d_op is really not trivial at that
> > point; it is safe to do to ->lookup() argument prior to d_splice_alias()
> > (even though that's too subtle and brittle, IMO), but after
> > d_splice_alias() has succeeded, the damn thing is live and can
> > be hit by hash lookups, revalidate, etc.
> > 
> > The only things that can't happen to it are ->d_delete(), ->d_prune(),
> > ->d_iput() and ->d_init().  Everything else is fair game.
> > 
> > And then there's an interesting question about the interplay with
> > reparenting.  It's OK to return an error rather than reparent,
> > but we need some way to tell if we need to do so.
> 
> Hmm... int (*d_transfer)(struct dentry *alias, struct dentry *new)?
> Called if d_splice_alias() picks that sucker, under rename_lock,
> before the call of __d_move().  Can check IS_ROOT(alias) (due to
> rename_lock), so can tell attaching from reparenting, returning
> an error - failed d_splice_alias().
> 
> Perhaps would be even better inside __d_move(), once all ->d_lock
> are taken...  Turn the current bool exchange in there into honest
> enum (exchange/move/splice) and call ->d_transfer() on splice.
> In case of failure it's still not too late to back out - __d_move()
> would return an int, ignored in d_move() and d_exchange() and
> treated as "fail in unlikely case it's non-zero" in d_splice_alias()
> and __d_unalias()...
> 
> Comments?  Note that e.g.
>         res = d_splice_alias(inode, dentry);
>         if (!IS_ERR(fid)) {
>                 if (!res)
>                         v9fs_fid_add(dentry, &fid);
>                 else if (!IS_ERR(res))
>                         v9fs_fid_add(res, &fid);
>                 else
>                         p9_fid_put(fid);
>         }
> 
> in 9p ->lookup() would turn into
> 
> 	v9fs_fid_add(dentry, &fid);
>         return d_splice_alias(inode, dentry);
> 
> with ->d_transfer(alias, new) being simply
> 
> 	struct hlist_node *p = new->d_fsdata;
> 	hlist_del_init(p);
> 	__add_fid(alias, hlist_entry(p, struct p9_fid, dlist));
> 	return 0;
> 
> assuming the call from __d_move()...

Incidentally, 9p and this one would not be the only places that could use it -
affs - alias->d_fsdata = new->d_fsdata
afs - ditto
ocfs2 - smells like another possible benefitiary (attaching locks, etc.
would be saner if done before d_splice_alias(), with ->d_transfer()
moving the lock to the alias)...

Al Viro Nov. 23, 2023, 5:09 a.m. UTC | #21

On Wed, Nov 22, 2023 at 04:18:56PM -0800, Linus Torvalds wrote:
> On Wed, 22 Nov 2023 at 13:19, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > The serious gap, AFAICS, is the interplay with open-by-fhandle.
> 
> So I'm obviously not a fan of igncase filesystems, but I don't think
> this series actually changes any of that.
> 
> > It's not unfixable, but we need to figure out what to do when
> > lookup runs into a disconnected directory alias.  d_splice_alias()
> > will move it in place, all right, but any state ->lookup() has
> > hung off the dentry that had been passed to it will be lost.
> 
> I guess this migth be about the new DCACHE_CASEFOLDED_NAME bit.
> 
> At least for now, that is only used by generic_ci_d_revalidate() for
> negative dentries, so it shouldn't matter for that d_splice_alias()
> that only does positive dentries. No?
> 
> Or is there something else you worry about?

Dentries created by d_obtain_alias() will never go anywhere near
generic_set_encrypted_ci_d_ops().  They do *not* get ->d_op set
that way.  When ext4_lookup() does a lookup in c-i directory it
does have ->d_op set on dentry it got from the caller.  Which is
promptly discarded when d_splice_alias() finds a preexisting
alias for it.

Positive dentries eventually become negative; not invalidating them
when that happens is a large part of the point of this series.
->d_revalidate() is taught to check if they are marked with that
bit, but... they won't have that ->d_revalidate() associated with
them, will they?  ->d_hash() and ->d_compare() come from the
parent, but ->d_revalidate() comes from dentry itself.

In other words, generic_ci_d_revalidate() won't see the lack of
that bit on dentry, etc. - it won't be called for that dentry
in the first place.

Gabriel Krisman Bertazi Nov. 23, 2023, 3:57 p.m. UTC | #22

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Side note: Gabriel, as things are now, instead of that
>
>         if (!d_is_casefolded_name(dentry))
>                 return 0;
>
> in generic_ci_d_revalidate(), I would suggest that any time a
> directory is turned into a case-folded one, you'd just walk all the
> dentries for that directory and invalidate negative ones at that
> point. Or was there some reason I missed that made it a good idea to
> do it at run-time after-the-fact?
>

The problem I found with that approach, which I originally tried, was
preventing concurrent lookups from racing with the invalidation and
creating more 'case-sensitive' negative dentries.  Did I miss a way to
synchronize with concurrent lookups of the children of the dentry?  We
can trivially ensure the dentry doesn't have positive children by
holding the parent lock, but that doesn't protect from concurrent
lookups creating negative dentries, as far as I understand.

Linus Torvalds Nov. 23, 2023, 4:41 p.m. UTC | #23

On Thu, 23 Nov 2023 at 07:57, Gabriel Krisman Bertazi
<gabriel@krisman.be> wrote:
>
> The problem I found with that approach, which I originally tried, was
> preventing concurrent lookups from racing with the invalidation and
> creating more 'case-sensitive' negative dentries.  Did I miss a way to
> synchronize with concurrent lookups of the children of the dentry?  We
> can trivially ensure the dentry doesn't have positive children by
> holding the parent lock, but that doesn't protect from concurrent
> lookups creating negative dentries, as far as I understand.

I'd just set the "casefolded" bit, then do a RCU grace period wait,
and then invalidate all old negative dentries.

Sure, there's technically a window there where somebody could hit an
existing negative dentry that matches a casefolded name after
casefolded has been set (but before the invalidation) and the lookup
would result in a "does not exist" lookup that way.

But that seems no different from the lookup having been done before
the casefolded bit got set, so I don't think that's an _actual_
difference. If you do a lookup concurrently with the directory being
set casefolded, you get one or the other.

And no, I haven't thought about this a ton, but it seems the obvious
thing to do. Temporary stale negative dentries while the casefolded
bit is in the process of being set seems a harmless thing, exactly
because they would seem to be the same thing as if the lookup was done
before...

And yes, that "wait for RCU grace period" is a somewhat slow
operation, but how often do the casefolded bits get changed?

This is not a huge deal. I don't hate your approach, I just found it surprising.

                 Linus

Al Viro Nov. 23, 2023, 5:12 p.m. UTC | #24

On Thu, Nov 23, 2023 at 10:57:22AM -0500, Gabriel Krisman Bertazi wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > Side note: Gabriel, as things are now, instead of that
> >
> >         if (!d_is_casefolded_name(dentry))
> >                 return 0;
> >
> > in generic_ci_d_revalidate(), I would suggest that any time a
> > directory is turned into a case-folded one, you'd just walk all the
> > dentries for that directory and invalidate negative ones at that
> > point. Or was there some reason I missed that made it a good idea to
> > do it at run-time after-the-fact?
> >
> 
> The problem I found with that approach, which I originally tried, was
> preventing concurrent lookups from racing with the invalidation and
> creating more 'case-sensitive' negative dentries.  Did I miss a way to
> synchronize with concurrent lookups of the children of the dentry?  We
> can trivially ensure the dentry doesn't have positive children by
> holding the parent lock, but that doesn't protect from concurrent
> lookups creating negative dentries, as far as I understand.

AFAICS, there is a problem with dentries that never came through
->lookup().  Unless I'm completely misreading your code, your
generic_ci_d_revalidate() is not called for them.  Ever.

Hash lookups are controlled by ->d_op of parent; that's where ->d_hash()
and ->d_compare() come from.  Revalidate comes from *child*.  You need
->d_op->d_revalidate of child dentry to be set to your generic_ci_d_revalidate().

The place where it gets set is generic_set_encrypted_ci_d_ops().  Look
at its callchain; in case of ext4 it gets called from ext4_lookup_dentry(),
which is called from ext4_lookup().  And dentry passed to it is the
argument of ->lookup().

Now take a look at open-by-fhandle stuff; all methods in there
(->fh_to_dentry(), ->fh_to_parent(), ->get_parent()) end up
returning d_obtain_alias(some inode).

We *do* call ->lookup(), all right - in reconnect_one(), while
trying to connect those suckers with the main tree.  But the way
it works is that d_splice_alias() in ext4_lookup() moves the
existing alias for subdirectory, connecting it to the parent.
That's not the dentry ext4_lookup() had set ->d_op on - that's
the dentry that came from d_obtain_alias().  And those do not
have ->d_op set by anything in your tree.

That's the problem I'd been talking about - there is a class of situations
where the work done by ext4_lookup() to set the state of dentry gets
completely lost.  After lookup you do have a dentry in the right place,
with the right name and inode, etc., but with NULL ->d_op->d_revalidate.

Gabriel Krisman Bertazi Nov. 23, 2023, 5:37 p.m. UTC | #25

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Nov 23, 2023 at 10:57:22AM -0500, Gabriel Krisman Bertazi wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> 
>> > Side note: Gabriel, as things are now, instead of that
>> >
>> >         if (!d_is_casefolded_name(dentry))
>> >                 return 0;
>> >
>> > in generic_ci_d_revalidate(), I would suggest that any time a
>> > directory is turned into a case-folded one, you'd just walk all the
>> > dentries for that directory and invalidate negative ones at that
>> > point. Or was there some reason I missed that made it a good idea to
>> > do it at run-time after-the-fact?
>> >
>> 
>> The problem I found with that approach, which I originally tried, was
>> preventing concurrent lookups from racing with the invalidation and
>> creating more 'case-sensitive' negative dentries.  Did I miss a way to
>> synchronize with concurrent lookups of the children of the dentry?  We
>> can trivially ensure the dentry doesn't have positive children by
>> holding the parent lock, but that doesn't protect from concurrent
>> lookups creating negative dentries, as far as I understand.
>
> AFAICS, there is a problem with dentries that never came through
> ->lookup().  Unless I'm completely misreading your code, your
> generic_ci_d_revalidate() is not called for them.  Ever.
>
> Hash lookups are controlled by ->d_op of parent; that's where ->d_hash()
> and ->d_compare() come from.  Revalidate comes from *child*.  You need
> ->d_op->d_revalidate of child dentry to be set to your generic_ci_d_revalidate().
>
> The place where it gets set is generic_set_encrypted_ci_d_ops().  Look
> at its callchain; in case of ext4 it gets called from ext4_lookup_dentry(),
> which is called from ext4_lookup().  And dentry passed to it is the
> argument of ->lookup().
>
> Now take a look at open-by-fhandle stuff; all methods in there
> (->fh_to_dentry(), ->fh_to_parent(), ->get_parent()) end up
> returning d_obtain_alias(some inode).
>
> We *do* call ->lookup(), all right - in reconnect_one(), while
> trying to connect those suckers with the main tree.  But the way
> it works is that d_splice_alias() in ext4_lookup() moves the
> existing alias for subdirectory, connecting it to the parent.
> That's not the dentry ext4_lookup() had set ->d_op on - that's
> the dentry that came from d_obtain_alias().  And those do not
> have ->d_op set by anything in your tree.
>
> That's the problem I'd been talking about - there is a class of situations
> where the work done by ext4_lookup() to set the state of dentry gets
> completely lost.  After lookup you do have a dentry in the right place,
> with the right name and inode, etc., but with NULL
> ->d_op->d_revalidate.

I get the problem now. I admit to not understanding all the details yet,
which is why I haven't answered directly, but I understand already how
it can get borked.  I'm studying your explanation.

Originally, ->d_op could be propagated trivially since we had sb->s_d_op
set, which would be set by __d_alloc, but that is no longer the case
since we combined fscrypt and CI support.

What I still don't understand is why we shouldn't fixup ->d_op when
calling d_obtain_alias (before __d_instantiate_anon) and you say we
better do it in d_splice_alias.  The ->d_op is going to be the same
across the filesystem when the casefold feature is enabled, regardless
if the directory is casefolded.  If we set it there, the alias already
has the right d_op from the start.

Al Viro Nov. 23, 2023, 6:24 p.m. UTC | #26

On Thu, Nov 23, 2023 at 12:37:43PM -0500, Gabriel Krisman Bertazi wrote:
> > That's the problem I'd been talking about - there is a class of situations
> > where the work done by ext4_lookup() to set the state of dentry gets
> > completely lost.  After lookup you do have a dentry in the right place,
> > with the right name and inode, etc., but with NULL
> > ->d_op->d_revalidate.
> 
> I get the problem now. I admit to not understanding all the details yet,
> which is why I haven't answered directly, but I understand already how
> it can get borked.  I'm studying your explanation.
> 
> Originally, ->d_op could be propagated trivially since we had sb->s_d_op
> set, which would be set by __d_alloc, but that is no longer the case
> since we combined fscrypt and CI support.
>
> What I still don't understand is why we shouldn't fixup ->d_op when
> calling d_obtain_alias (before __d_instantiate_anon) and you say we
> better do it in d_splice_alias.  The ->d_op is going to be the same
> across the filesystem when the casefold feature is enabled, regardless
> if the directory is casefolded.  If we set it there, the alias already
> has the right d_op from the start.

*blink*

A paragraph above you've said that it's not constant over the entire
filesystem.

Look, it's really simple - any setup work of that sort done in ->lookup()
is either misplaced, or should be somehow transferred over to the alias
if one gets picked.

As for d_obtain_alias()... AFAICS, it's far more limited in what information
it could access.  It knows the inode, but it has no idea about the parent
to be.

The more I look at that, the more it feels like we need a method that would
tell the filesystem that this dentry is about to be spliced here.  9p is
another place where it would obviously simplify the things; ocfs2 'attach
lock' stuff is another case where the things get much more complicated
by having to do that stuff after splicing, etc.

It's not even hard to do:

1. turn bool exchange in __d_move() arguments into 3-value thing - move,
exchange or splice.  Have the callers in d_splice_alias() and __d_unalias()
pass "splice" instead of false (aka normal move).

2. make __d_move() return an int (normally 0)

3. if asked to splice and if there's target->d_op->d_transfer(), let
__d_move() call it right after
        spin_lock_nested(&dentry->d_lock, 2);
	spin_lock_nested(&target->d_lock, 3);
in there.  Passing it target and dentry, obviously.  In unlikely case
of getting a non-zero returned by the method, undo locks and return
that value to __d_move() caller.

4. d_move() and d_exchange() would ignore the value returned by __d_move();
__d_unalias() turn
        __d_move(alias, dentry, false);
	ret = 0;
into
	ret = __d_move(alias, dentry, Splice);
d_splice_alias() turn
				__d_move(new, dentry, false);
				write_sequnlock(&rename_lock);
into
				err = __d_move(new, dentry, Splice);
				write_sequnlock(&rename_lock);
				if (unlikely(err)) {
					dput(new);
					new = ERR_PTR(err);
				}
(actually, dput()-on-error part would be common to all 3 branches
in there, so it would probably get pulled out of that if-else if-else).

I can cook a patch doing that (and convert the obvious beneficiaries already
in the tree to it) and throw it into dcache branch - just need to massage
the series in there for repost...

PS: note, BTW, that fscrypt folks have already placed a hook into
__d_move(), exactly for the case of splice; I wonder if that would be
foldable into the same mechanism - hadn't looked in details yet.

Gabriel Krisman Bertazi Nov. 23, 2023, 7:06 p.m. UTC | #27

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Nov 23, 2023 at 12:37:43PM -0500, Gabriel Krisman Bertazi wrote:
>> > That's the problem I'd been talking about - there is a class of situations
>> > where the work done by ext4_lookup() to set the state of dentry gets
>> > completely lost.  After lookup you do have a dentry in the right place,
>> > with the right name and inode, etc., but with NULL
>> > ->d_op->d_revalidate.
>> 
>> I get the problem now. I admit to not understanding all the details yet,
>> which is why I haven't answered directly, but I understand already how
>> it can get borked.  I'm studying your explanation.
>> 
>> Originally, ->d_op could be propagated trivially since we had sb->s_d_op
>> set, which would be set by __d_alloc, but that is no longer the case
>> since we combined fscrypt and CI support.
>>
>> What I still don't understand is why we shouldn't fixup ->d_op when
>> calling d_obtain_alias (before __d_instantiate_anon) and you say we
>> better do it in d_splice_alias.  The ->d_op is going to be the same
>> across the filesystem when the casefold feature is enabled, regardless
>> if the directory is casefolded.  If we set it there, the alias already
>> has the right d_op from the start.
>
> *blink*
>
> A paragraph above you've said that it's not constant over the entire
> filesystem.

The same ->d_op is used by every dentry in the filesystem if the superblock
has the casefold bit enabled, regardless of whether a specific inode is
casefolded or not. See generic_set_encrypted_ci_d_ops in my tree. It is
called unconditionally by ext4_lookup and only checks the superblock:

void generic_set_encrypted_ci_d_ops(struct dentry *dentry)
{
        if (dentry->d_sb->s_encoding) {
		d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops);
		return;
	}
        ...

What I meant was that this used to be set once at sb->s_d_op, and
propagated during dentry allocation.  Therefore, the propagation to the
alias would happen inside __d_alloc.  Once we enabled fscrypt and
casefold to work together, sb->s_d_op is NULL and we always set the same
handler for every dentry during lookup.

> Look, it's really simple - any setup work of that sort done in ->lookup()
> is either misplaced, or should be somehow transferred over to the alias
> if one gets picked.
>
> As for d_obtain_alias()... AFAICS, it's far more limited in what information
> it could access.  It knows the inode, but it has no idea about the parent
> to be.

Since it has the inode, d_obtain_alias has the superblock.  I think that's all
we need for generic_set_encrypted_ci_d_ops.

> The more I look at that, the more it feels like we need a method that would
> tell the filesystem that this dentry is about to be spliced here.  9p is
> another place where it would obviously simplify the things; ocfs2 'attach
> lock' stuff is another case where the things get much more complicated
> by having to do that stuff after splicing, etc.
>
> It's not even hard to do:
>
> 1. turn bool exchange in __d_move() arguments into 3-value thing - move,
> exchange or splice.  Have the callers in d_splice_alias() and __d_unalias()
> pass "splice" instead of false (aka normal move).
>
> 2. make __d_move() return an int (normally 0)
>
> 3. if asked to splice and if there's target->d_op->d_transfer(), let
> __d_move() call it right after
>         spin_lock_nested(&dentry->d_lock, 2);
> 	spin_lock_nested(&target->d_lock, 3);
> in there.  Passing it target and dentry, obviously.  In unlikely case
> of getting a non-zero returned by the method, undo locks and return
> that value to __d_move() caller.
>
> 4. d_move() and d_exchange() would ignore the value returned by __d_move();
> __d_unalias() turn
>         __d_move(alias, dentry, false);
> 	ret = 0;
> into
> 	ret = __d_move(alias, dentry, Splice);
> d_splice_alias() turn
> 				__d_move(new, dentry, false);
> 				write_sequnlock(&rename_lock);
> into
> 				err = __d_move(new, dentry, Splice);
> 				write_sequnlock(&rename_lock);
> 				if (unlikely(err)) {
> 					dput(new);
> 					new = ERR_PTR(err);
> 				}
> (actually, dput()-on-error part would be common to all 3 branches
> in there, so it would probably get pulled out of that if-else if-else).
>
> I can cook a patch doing that (and convert the obvious beneficiaries already
> in the tree to it) and throw it into dcache branch - just need to massage
> the series in there for repost...

if you can write that, I'll definitely appreciate it. It will surely
take me much longer to figure it out myself.

Al Viro Nov. 23, 2023, 7:53 p.m. UTC | #28

On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote:

> > A paragraph above you've said that it's not constant over the entire
> > filesystem.
> 
> The same ->d_op is used by every dentry in the filesystem if the superblock
> has the casefold bit enabled, regardless of whether a specific inode is
> casefolded or not. See generic_set_encrypted_ci_d_ops in my tree. It is
> called unconditionally by ext4_lookup and only checks the superblock:
> 
> void generic_set_encrypted_ci_d_ops(struct dentry *dentry)
> {
>         if (dentry->d_sb->s_encoding) {
> 		d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops);
> 		return;
> 	}
>         ...
> 
> What I meant was that this used to be set once at sb->s_d_op, and
> propagated during dentry allocation.  Therefore, the propagation to the
> alias would happen inside __d_alloc.  Once we enabled fscrypt and
> casefold to work together, sb->s_d_op is NULL

Why?  That's what I don't understand - if you really want it for
all dentries on that filesystem, that's what ->s_d_op is for.
If it is not, you have that problem, no matter which way you flip ->d_op
value.

> and we always set the same
> handler for every dentry during lookup.

Not every dentry goes through lookup - see upthread for details.

> > Look, it's really simple - any setup work of that sort done in ->lookup()
> > is either misplaced, or should be somehow transferred over to the alias
> > if one gets picked.
> >
> > As for d_obtain_alias()... AFAICS, it's far more limited in what information
> > it could access.  It knows the inode, but it has no idea about the parent
> > to be.
> 
> Since it has the inode, d_obtain_alias has the superblock.  I think that's all
> we need for generic_set_encrypted_ci_d_ops.

Huh?  If it really depends only upon the superblock, just set it in ->s_d_op
when you set the superblock up.

Again, whatever setup you do for dentry in ->lookup(), you either
	* have a filesystem that never picks an existing directory alias
(e.g. doesn't allow open-by-fhandle or has a very unusual implementation
of related methods, like e.g. shmem), or
	* have that setup misplaced, in part that applies to all dentries out
there (->s_d_op for universal ->d_op value, ->d_init() for uniform allocation
of objects hanging from ->d_fsdata and other things like that), or
	* need to figure out how to transfer the result to alias (manually
after d_splice_alias(), if races do not matter or using a new method explicitly
for that), or
	* lose that state for aliases.

Al Viro Nov. 23, 2023, 8:15 p.m. UTC | #29

On Thu, Nov 23, 2023 at 07:53:27PM +0000, Al Viro wrote:

> Huh?  If it really depends only upon the superblock, just set it in ->s_d_op
> when you set the superblock up.
> 
> Again, whatever setup you do for dentry in ->lookup(), you either
> 	* have a filesystem that never picks an existing directory alias
> (e.g. doesn't allow open-by-fhandle or has a very unusual implementation
> of related methods, like e.g. shmem), or
> 	* have that setup misplaced, in part that applies to all dentries out
> there (->s_d_op for universal ->d_op value, ->d_init() for uniform allocation
> of objects hanging from ->d_fsdata and other things like that), or
> 	* need to figure out how to transfer the result to alias (manually
> after d_splice_alias(), if races do not matter or using a new method explicitly
> for that), or
> 	* lose that state for aliases.

Note, BTW, that fscrypt tries to be very special in its handling of that
stuff - see fscrypt_handle_d_move() thing and comments in front of its
definition.  Then take a look at the place where it's called.

BTW, it looks like it's broken, since it discounts the possibility of splice
caused by lookup on no-key name.  You get DCACHE_NOKEY_NAME removed unconditionally
there, no-key or not.

It's not impossible that the boilerplate around the fscrypt_has_permitted_context()
calls in fscrypt-enabled ->lookup() instances somehow prevents those, but if so,
it's not obvious from the causual look.

Al Viro Nov. 23, 2023, 9:52 p.m. UTC | #30

On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote:

> >
> > 4. d_move() and d_exchange() would ignore the value returned by __d_move();
> > __d_unalias() turn
> >         __d_move(alias, dentry, false);
> > 	ret = 0;
> > into
> > 	ret = __d_move(alias, dentry, Splice);
> > d_splice_alias() turn
> > 				__d_move(new, dentry, false);
> > 				write_sequnlock(&rename_lock);
> > into
> > 				err = __d_move(new, dentry, Splice);
> > 				write_sequnlock(&rename_lock);
> > 				if (unlikely(err)) {
> > 					dput(new);
> > 					new = ERR_PTR(err);
> > 				}
> > (actually, dput()-on-error part would be common to all 3 branches
> > in there, so it would probably get pulled out of that if-else if-else).
> >
> > I can cook a patch doing that (and convert the obvious beneficiaries already
> > in the tree to it) and throw it into dcache branch - just need to massage
> > the series in there for repost...
> 
> if you can write that, I'll definitely appreciate it. It will surely
> take me much longer to figure it out myself.

Speaking of other stuff in the series - passing the expected name to
->d_revalidate() is definitely the right thing to do, for a lot of
other reasons.  We do have ->d_name UAF issues in ->d_revalidate()
instances, and that allows to solve them nicely.

It's self-contained (your 2/9 and 3/9), so I'm going to grab that
into a never-rebased branch, just to be able to base the followups
propagating the use of stable name into instances.

Anyway, need to finish writing up the description of existing dcache
series first...

Gabriel Krisman Bertazi Nov. 24, 2023, 3:20 p.m. UTC | #31

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote:
>
>> > A paragraph above you've said that it's not constant over the entire
>> > filesystem.
>> 
>> The same ->d_op is used by every dentry in the filesystem if the superblock
>> has the casefold bit enabled, regardless of whether a specific inode is
>> casefolded or not. See generic_set_encrypted_ci_d_ops in my tree. It is
>> called unconditionally by ext4_lookup and only checks the superblock:
>> 
>> void generic_set_encrypted_ci_d_ops(struct dentry *dentry)
>> {
>>         if (dentry->d_sb->s_encoding) {
>> 		d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops);
>> 		return;
>> 	}
>>         ...
>> 
>> What I meant was that this used to be set once at sb->s_d_op, and
>> propagated during dentry allocation.  Therefore, the propagation to the
>> alias would happen inside __d_alloc.  Once we enabled fscrypt and
>> casefold to work together, sb->s_d_op is NULL
>
> Why?  That's what I don't understand - if you really want it for
> all dentries on that filesystem, that's what ->s_d_op is for.
> If it is not, you have that problem, no matter which way you flip ->d_op
> value.

I'm not sure why it changed.  I'm guessing that, since it doesn't make
sense to set fscrypt_d_revalidate for every dentry in the
!case-insensitive case, they just kept the same behavior for
case-insensitive+fscrypt.  This is what I get from looking at the git
history.

I will get a new series reverting to use ->s_d_op, folding the
dentry_cmp behavior you mentioned, and based on what you merge in your
branch.

>> and we always set the same
>> handler for every dentry during lookup.
>
> Not every dentry goes through lookup - see upthread for details.

Yes, I got that already.  This should be "we always set the same handler
for every dentry that goes through lookup and bork whatever doesn't come
through lookup."

Gabriel Krisman Bertazi Nov. 24, 2023, 3:22 p.m. UTC | #32

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote:
>
>> >
>> > 4. d_move() and d_exchange() would ignore the value returned by __d_move();
>> > __d_unalias() turn
>> >         __d_move(alias, dentry, false);
>> > 	ret = 0;
>> > into
>> > 	ret = __d_move(alias, dentry, Splice);
>> > d_splice_alias() turn
>> > 				__d_move(new, dentry, false);
>> > 				write_sequnlock(&rename_lock);
>> > into
>> > 				err = __d_move(new, dentry, Splice);
>> > 				write_sequnlock(&rename_lock);
>> > 				if (unlikely(err)) {
>> > 					dput(new);
>> > 					new = ERR_PTR(err);
>> > 				}
>> > (actually, dput()-on-error part would be common to all 3 branches
>> > in there, so it would probably get pulled out of that if-else if-else).
>> >
>> > I can cook a patch doing that (and convert the obvious beneficiaries already
>> > in the tree to it) and throw it into dcache branch - just need to massage
>> > the series in there for repost...
>> 
>> if you can write that, I'll definitely appreciate it. It will surely
>> take me much longer to figure it out myself.
>
> Speaking of other stuff in the series - passing the expected name to
> ->d_revalidate() is definitely the right thing to do, for a lot of
> other reasons.  We do have ->d_name UAF issues in ->d_revalidate()
> instances, and that allows to solve them nicely.
>
> It's self-contained (your 2/9 and 3/9), so I'm going to grab that
> into a never-rebased branch, just to be able to base the followups
> propagating the use of stable name into instances.

ack. I'll base the other changes we discussed on top of your branch.

thanks,

Al Viro Nov. 25, 2023, 10:01 p.m. UTC | #33

On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote:

> ack. I'll base the other changes we discussed on top of your branch.

Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly,
and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate

Al Viro Nov. 26, 2023, 4:52 a.m. UTC | #34

On Sat, Nov 25, 2023 at 10:01:36PM +0000, Al Viro wrote:
> On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote:
> 
> > ack. I'll base the other changes we discussed on top of your branch.
> 
> Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly,
> and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate

FWIW, ->d_revalidate() has an old unpleasant problem we might try to solve
now.

In non-RCU mode We treat 0 as "invalidate that sucker and do a fresh lookup".
Fine, except that there's a possibility of race here - we'd hit ->d_revalidate()
while another thread was renaming object in question.  Or has just found it
by doing lookup in a place where it had been moved on server.

->d_revalidate() decides that it needs to be looked up on server and forms
a request before rename succeeds.  So NFS (e.g.) request goes out with the
old parent and name.  By the time server sees it, RENAME has been processed
and succeeded.  There's no such file in the old place anymore.

So ->d_revalidate() returns 0... and we proceed to invalidate the dentry.
Which had been moved to *new* place by now.  In that place it's perfectly
valid and does not deserve invalidation.

Scenario when rename had been done not from this client is even worse:

server:/srv/nfs/foo is mounted on /mnt/foo
we state /mnt/foo/bar
/mnt/foo/bar is in dcache
somebody on server renames /srv/nfs/foo/bar to /srv/nfs/foo/barf
process A: stat /mnt/foo/bar/baz.
process B: mount something on /mnt/foo/barf/
process B: no /mnt/foo/barf in dcache, let's look it up
	   found fhandle of /mnt/foo
	   sent LOOKUP "barf" in it
	   got an fhandle and found it matching the inode of /mnt/foo/bar
process A: has reached /mnt/foo/bar and decided to revalidate it.
	   found fhandle of /mnt/foo
	   sent a LOOKUP "bar" in that
	   got "nothing with that name there"
	   ->d_revalidate() returns 0
	   loses CPU
process B: splice the dentry of /mnt/foo/bar to /mnt/foo/barf
	   proceed to mount on top of it
process A: gets CPU back
	   calls d_invalidate() on the dentry that now is /mnt/foo/barf
	   dissolves the mount created by process B

Note that server:/srv/nfs/foo/barf has been there and perfectly valid
since before B has started doing anything.  It has no idea that the
damn thing used to be in a different place and something on the same
client had seen it at the old place once upon a time.  As far as it is
concerned, mount has succeeded and then quietly disappeared.  The mountpoint
is still there - with freshly looked up dentry, since the old one had been
invalidated, but userland doesn't see that, so... WTF?

It's not easy to hit, but I'd expect it to be feasible on SMP KVM, where instead
of A losing CPU we might've had the virtual CPU losing the timeslice on host.

IMO we should only do d_invalidate() if
	* ->d_revalidate() has returned 0
	* dentry is still hashed, still has the same parent and still matches
the name from ->d_compare() POV.
If it doesn't, we should just leave it whereever it has been moved to and
act as if we hadn't seen it in the first place.

In other words, have
d_revalidate(dentry, parent, name, flags) doing the following:
	if no ->d_revalidate
		return 1
	ret = ->d_revalidate(...)
	if (unlikely(ret == 0) && !(flags & LOOKUP_RCU)) {
		spin_lock(&dentry->d_lock);
		if (!d_same_name(dentry, parent, name))
			spin_lock(&dentry->d_lock);
		else
			d_invalidate_locked(dentry);
	}
	return ret

where d_invalidate_locked() would be d_invalidate() sans the initial
spin_lock(&dentry->d_lock);

That would solve that problem, AFAICS.  Objections, anyone?  I'm too
sleepy to put together a patch at the moment, will post after I get
some sleep...

PS: as the matter of fact, it might be a good idea to pass the parent
as explicit argument to ->d_revalidate(), now that we are passing the
name as well.  Look at the boilerplate in the instances; all that
        parent = READ_ONCE(dentry->d_parent);
	dir = d_inode_rcu(parent);
	if (!dir)
		return -ECHILD;
	...
on the RCU side combined with
	parent = dget_parent(dentry);
	dir = d_inode(parent);
	...
	dput(dir);
stuff.

It's needed only because the caller had not told us which directory
is that thing supposed to be in; in non-RCU mode the parent is
explicitly pinned down, no need to play those games.  All we need
is
	dir = d_inode_rcu(parent);
	if (!dir) // could happen only in RCU mode
		return -ECHILD;
assuming we need the parent inode, that is.

So... how about
	int (*d_revalidate)(struct dentry *dentry, struct dentry *parent,
			  const struct qstr *name, unsigned int flags);
since we are touching all instances anyway?

Al Viro Nov. 26, 2023, 6:41 p.m. UTC | #35

[folks involved into d_invalidate()/submount eviction stuff Cc'd]
On Sun, Nov 26, 2023 at 04:52:19AM +0000, Al Viro wrote:
> PS: as the matter of fact, it might be a good idea to pass the parent
> as explicit argument to ->d_revalidate(), now that we are passing the
> name as well.  Look at the boilerplate in the instances; all that
>         parent = READ_ONCE(dentry->d_parent);
> 	dir = d_inode_rcu(parent);
> 	if (!dir)
> 		return -ECHILD;
> 	...
> on the RCU side combined with
> 	parent = dget_parent(dentry);
> 	dir = d_inode(parent);
> 	...
> 	dput(dir);
> stuff.
> 
> It's needed only because the caller had not told us which directory
> is that thing supposed to be in; in non-RCU mode the parent is
> explicitly pinned down, no need to play those games.  All we need
> is
> 	dir = d_inode_rcu(parent);
> 	if (!dir) // could happen only in RCU mode
> 		return -ECHILD;
> assuming we need the parent inode, that is.
> 
> So... how about
> 	int (*d_revalidate)(struct dentry *dentry, struct dentry *parent,
> 			  const struct qstr *name, unsigned int flags);
> since we are touching all instances anyway?

OK, it's definitely a good idea for simplifying ->d_revalidate() instances
and I think we should go for it on thes grounds alone.  I'll do that.

d_invalidate() situation is more subtle - we need to sort out its interplay
with d_splice_alias().

More concise variant of the scenario in question:
* we have /mnt/foo/bar and a lot of its descendents in dcache on client
* server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
* somebody on the client does a lookup of /mnt/foo/bar and gets told by
the server that there's no directory with that name anymore.
* that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
evicting its descendents
* We try to mount something on /mnt/foo/baz/blah.  We look up baz, get
an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
it in process, as it ought to.  Then we find /mnt/foo/baz/blah in dcache and 
mount on top of it.
* d_invalidate() finishes shrink_dcache_parent() and starts hunting for
submounts to dissolve.  And finds the mount we'd done.  Which mount quietly
disappears.

Note that from the server POV the thing had been moved quite a while ago.
No server-side races involved - all it seeem is a couple of LOOKUP in the
same directory, one for the old name, one for the new.

On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
which had been there on server at the time we started and which remains in
place after mount we'd created suddenly disappears.

For the thread that ended up calling d_invalidate(), they'd been doing e.g.
stat on a pathname that used to be there a while ago, but currently isn't.
They get -ENOENT and no indication that something odd might have happened.

From ->d_revalidate() point of view there's also nothing odd happening -
dentry is not a mountpoint, it stays in place until we return and there's
no directory entry with that name on in its parent.  It's as clear-cut
as it gets - dentry is stale.

The only overlap happening there is d_splice_alias() hitting in the middle
of already started d_invalidate().

For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
to do with it, but the same problem existed prior to that.

FWIW, I suspect that the right answer would be along the lines of
	* if d_splice_alias() does move an exsiting (attached) alias in
place, it ought to dissolve all mountpoints in subtree being moved.
There might be subtleties, but in case when that __d_unalias() happens
due to rename on server this is definitely the right thing to do.
	* d_invalidate() should *NOT* do anything with dentry that
got moved (including moved by d_splice_alias()) from the place we'd
found it in dcache.  At least d_invalidate() done due to having
->d_revalidate() return 0.
	* d_invalidate() should dissolve all mountpoints in the
subtree that existed when it got started (and found the victim
still unmoved, that is).  It should (as it does) prevent any
new mountpoints added in that subtree, unless the mountpoint
to be had been moved (spliced) out.  What it really shouldn't
do is touch the mountpoints that are currently outside of it
due to moves.

I'm going to look around and see if we have any weird cases where
d_splice_alias() is used for things like "correct the case of
dentry name on a case-mangled filesystem" - that would presumably
not want to dissolve any submounts.  I seem to recall seeing
some shite of that sort, but that was a long time ago.

Eric, Miklos - it might be a good idea if you at least took a
look at whatever comes out of that (sub)thread; I'm trying to
reconstruct the picture, but the last round of serious reworking
of that area had been almost 10 years ago and your recollections
of the considerations back then might help.  I realize that they
are probably rather fragmentary (mine definitely are) and any
analysis will need to be redone on the current tree, but...

Al Viro Nov. 27, 2023, 6:38 a.m. UTC | #36

On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:

> d_invalidate() situation is more subtle - we need to sort out its interplay
> with d_splice_alias().
> 
> More concise variant of the scenario in question:
> * we have /mnt/foo/bar and a lot of its descendents in dcache on client
> * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
> * somebody on the client does a lookup of /mnt/foo/bar and gets told by
> the server that there's no directory with that name anymore.
> * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
> evicting its descendents
> * We try to mount something on /mnt/foo/baz/blah.  We look up baz, get
> an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
> d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
> it in process, as it ought to.  Then we find /mnt/foo/baz/blah in dcache and 
> mount on top of it.
> * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
> submounts to dissolve.  And finds the mount we'd done.  Which mount quietly
> disappears.
> 
> Note that from the server POV the thing had been moved quite a while ago.
> No server-side races involved - all it seeem is a couple of LOOKUP in the
> same directory, one for the old name, one for the new.
> 
> On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
> which had been there on server at the time we started and which remains in
> place after mount we'd created suddenly disappears.
> 
> For the thread that ended up calling d_invalidate(), they'd been doing e.g.
> stat on a pathname that used to be there a while ago, but currently isn't.
> They get -ENOENT and no indication that something odd might have happened.
> 
> >From ->d_revalidate() point of view there's also nothing odd happening -
> dentry is not a mountpoint, it stays in place until we return and there's
> no directory entry with that name on in its parent.  It's as clear-cut
> as it gets - dentry is stale.
> 
> The only overlap happening there is d_splice_alias() hitting in the middle
> of already started d_invalidate().
> 
> For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
> and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
> to do with it, but the same problem existed prior to that.
> 
> FWIW, I suspect that the right answer would be along the lines of
> 	* if d_splice_alias() does move an exsiting (attached) alias in
> place, it ought to dissolve all mountpoints in subtree being moved.
> There might be subtleties, but in case when that __d_unalias() happens
> due to rename on server this is definitely the right thing to do.
> 	* d_invalidate() should *NOT* do anything with dentry that
> got moved (including moved by d_splice_alias()) from the place we'd
> found it in dcache.  At least d_invalidate() done due to having
> ->d_revalidate() return 0.
> 	* d_invalidate() should dissolve all mountpoints in the
> subtree that existed when it got started (and found the victim
> still unmoved, that is).  It should (as it does) prevent any
> new mountpoints added in that subtree, unless the mountpoint
> to be had been moved (spliced) out.  What it really shouldn't
> do is touch the mountpoints that are currently outside of it
> due to moves.
> 
> I'm going to look around and see if we have any weird cases where
> d_splice_alias() is used for things like "correct the case of
> dentry name on a case-mangled filesystem" - that would presumably
> not want to dissolve any submounts.  I seem to recall seeing
> some shite of that sort, but that was a long time ago.
> 
> Eric, Miklos - it might be a good idea if you at least took a
> look at whatever comes out of that (sub)thread; I'm trying to
> reconstruct the picture, but the last round of serious reworking
> of that area had been almost 10 years ago and your recollections
> of the considerations back then might help.  I realize that they
> are probably rather fragmentary (mine definitely are) and any
> analysis will need to be redone on the current tree, but...

TBH, I wonder if we ought to have d_invalidate() variant that would
unhash the dentry in question, do a variant of shrink_dcache_parent()
that would report if there had been any mountpoints and if there
had been any, do namespace_lock() and go hunting for mounts in that
subtree, moving corresponding struct mountpoint to a private list
as we go (removing them from mountpoint hash chains, that it).  Then
have them all evicted after we'd finished walking the subtree...

The tricky part will be lock ordering - right now we have the
mountpoint hash protected by mount_lock (same as mount hash, probably
worth splitting anyway) and that nests outside of ->d_lock.

Note that we don't do mountpoint hash lookups on mountpoint crossing
- it's nowhere near the really hot paths.  What we have is
	lookup_mountpoint() - plain hash lookup.  Always
under namespace_lock() and mount_lock.
	get_mountpoint() - there's an insertion into hash chain,
with dentry passed through the d_set_mounted(), which would
fail if we have d_invalidate() on the subtree.
Also always under namespace_lock() and mount_lock.
	__put_mountpoint() - removal from the hash chain.
We remove from hash chain after having cleared DCACHE_MOUNTED.
_That_ can happen under mount_lock alone (taking out the stuck
submounts on final mntput()).

So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under
->d_lock.  Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?)
In d_walk() callback we would
	* do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION
is.
	* otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the
mountpoint hash chain matching that dentry, find struct mountpoint in it,
remove it from the chain and insert into a separate "collector" chain, all
without messing with refcount.
In lookup_mountpoint() and get_mountpoint() take the bitlock on chain.
In __put_mountpoint(), once it has grabbed ->d_lock
	* check if it has DCACHE_MOUNT_INVALIDATION, use that to
decide which chain we are locking - the normal one or the collector
	* clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION
	* remove from chain
	* unlock the chain
	* drop ->d_lock.

Once we are finished walking the tree, go over the collector list
and do what __detach_mount() guts do.  We are no longer under
any ->d_lock, so locking is not a problem.  namespace_unlock() will
flush them all, same as it does for __detach_mount().

In __d_unalias() case do that d_invalidate() analogues of the alias.
Yes, it might do final mntput() of other filesystems, while under
->i_rwsem on our parent.  Not a problem, fs shutdown will go
either through task_work or schedule_delayed_work(); in any
case, it won't happen under ->i_rwsem.  We obviously can't do
that under rename_lock, though, so we'll need to massage that
path in d_splice_alias() a bit.

So, something like d_invalidate_locked(victim) called with
victim->d_lock held.  d_splice_alias() would use that (see above)
and places where we do d_invalidate() after ->d_revalidate() having
returned 0 would do this:
	lock dentry
	if it still has the same parent and name
		d_invalidate_locked()
	else
		unlock dentry
probably folded into fs/namei.c:d_revalidate()...  Not tonight,
though - I'd rather do that while properly awake ;-/

Eric W. Biederman Nov. 27, 2023, 3:47 p.m. UTC | #37

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:
>
>> d_invalidate() situation is more subtle - we need to sort out its interplay
>> with d_splice_alias().
>> 
>> More concise variant of the scenario in question:
>> * we have /mnt/foo/bar and a lot of its descendents in dcache on client
>> * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
>> * somebody on the client does a lookup of /mnt/foo/bar and gets told by
>> the server that there's no directory with that name anymore.
>> * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
>> evicting its descendents
>> * We try to mount something on /mnt/foo/baz/blah.  We look up baz, get
>> an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
>> d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
>> it in process, as it ought to.  Then we find /mnt/foo/baz/blah in dcache and 
>> mount on top of it.
>> * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
>> submounts to dissolve.  And finds the mount we'd done.  Which mount quietly
>> disappears.
>> 
>> Note that from the server POV the thing had been moved quite a while ago.
>> No server-side races involved - all it seeem is a couple of LOOKUP in the
>> same directory, one for the old name, one for the new.
>> 
>> On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
>> which had been there on server at the time we started and which remains in
>> place after mount we'd created suddenly disappears.
>> 
>> For the thread that ended up calling d_invalidate(), they'd been doing e.g.
>> stat on a pathname that used to be there a while ago, but currently isn't.
>> They get -ENOENT and no indication that something odd might have happened.
>> 
>> >From ->d_revalidate() point of view there's also nothing odd happening -
>> dentry is not a mountpoint, it stays in place until we return and there's
>> no directory entry with that name on in its parent.  It's as clear-cut
>> as it gets - dentry is stale.
>> 
>> The only overlap happening there is d_splice_alias() hitting in the middle
>> of already started d_invalidate().
>> 
>> For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
>> and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
>> to do with it, but the same problem existed prior to that.
>> 
>> FWIW, I suspect that the right answer would be along the lines of
>> 	* if d_splice_alias() does move an exsiting (attached) alias in
>> place, it ought to dissolve all mountpoints in subtree being moved.
>> There might be subtleties, but in case when that __d_unalias() happens
>> due to rename on server this is definitely the right thing to do.
>> 	* d_invalidate() should *NOT* do anything with dentry that
>> got moved (including moved by d_splice_alias()) from the place we'd
>> found it in dcache.  At least d_invalidate() done due to having
>> ->d_revalidate() return 0.
>> 	* d_invalidate() should dissolve all mountpoints in the
>> subtree that existed when it got started (and found the victim
>> still unmoved, that is).  It should (as it does) prevent any
>> new mountpoints added in that subtree, unless the mountpoint
>> to be had been moved (spliced) out.  What it really shouldn't
>> do is touch the mountpoints that are currently outside of it
>> due to moves.
>> 
>> I'm going to look around and see if we have any weird cases where
>> d_splice_alias() is used for things like "correct the case of
>> dentry name on a case-mangled filesystem" - that would presumably
>> not want to dissolve any submounts.  I seem to recall seeing
>> some shite of that sort, but that was a long time ago.
>> 
>> Eric, Miklos - it might be a good idea if you at least took a
>> look at whatever comes out of that (sub)thread; I'm trying to
>> reconstruct the picture, but the last round of serious reworking
>> of that area had been almost 10 years ago and your recollections
>> of the considerations back then might help.  I realize that they
>> are probably rather fragmentary (mine definitely are) and any
>> analysis will need to be redone on the current tree, but...

By subthread I assume you are referring to the work to that generalized
check_submounts_and_drop into the current d_invalidate.

My memory is that there were deliberate restrictions on where
d_revalidate could be called so as not to mess with mounts.

I believe those restrictions either prevented or convinced us it
prevented nasty interactions between d_invalidate and d_splice_alias.

There is a lot going on there.  I remember one of the relevant
restrictions was marking dentries dont_mount, and inodes S_DEAD
in unlink and rmdir.

But even without out that marking if d_invalidate is called
from d_revalidate the inode and all of it's dentries must be
dead because the inode is stale and most go.  There should
be no resurrecting it at that point.

I suspect the most fruitful way to think of the d_invalidate vs
d_splice_alias races is an unlink vs rename race.

I don't think the mechanism matters, but deeply and fundamentally
if we detect a directory inode is dead we need to stick with
that decision and not attempt to resurrect it with d_splice_alias.

Looking at ext4 and f2fs it appears when case folding they are calling
d_invalidate before the generic code can, and before marking like
dont_mount happen.  Is that the tie in with where the current
conversation comes in?

> TBH, I wonder if we ought to have d_invalidate() variant that would
> unhash the dentry in question,

You mean like the current d_invalidate does?  It calls __d_drop which
unhashes the thing and prevent lookups.  You even pointed to the change
that added that in the previous email.  The only thing that does not
happen currently is marking the dentry as unhashed.

Looking the rmdir code uses not only dont_mount but marks the
inode S_DEAD as well.

Right now we can't even get to d_splice_alias unless the original
dentry is unhashed.

So I suspect it isn't d_invalidate you are fighting.

> do a variant of shrink_dcache_parent()
> that would report if there had been any mountpoints and if there
> had been any, do namespace_lock() and go hunting for mounts in that
> subtree, moving corresponding struct mountpoint to a private list
> as we go (removing them from mountpoint hash chains, that it).  Then
> have them all evicted after we'd finished walking the subtree...

>
> The tricky part will be lock ordering - right now we have the
> mountpoint hash protected by mount_lock (same as mount hash, probably
> worth splitting anyway) and that nests outside of ->d_lock.

I don't get get it.

All we have to do is to prevent the inode lookup from succeeding
if we have decided the inode has been deleted.  It may be a little
more subtle the path of the inode we are connecting goes through
a dentry that is being invalidated.

But either need to prevent it in the lookup that leads to d_alloc,
or prevent the new dentry from being attached.

I know d_splice_alias takes the rename_lock to prevent some of those
races.

I hope that helps on the recollection front.

I am confused what is going on with ext4 and f2fs.  I think they
are calling d_invalidate when all they need to call is d_drop.

Eric

Eric W. Biederman Nov. 27, 2023, 4:01 p.m. UTC | #38

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> I am confused what is going on with ext4 and f2fs.  I think they
> are calling d_invalidate when all they need to call is d_drop.

ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading
the code correctly.

d_invalidate calls detach_mounts.

detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to
prevent races with new mounts.

ext4 and f2fs (in their case insensitive code) are calling d_invalidate
before dont_mount has been called to set D_CANT_MOUNT.

Eric

Al Viro Nov. 27, 2023, 4:03 p.m. UTC | #39

On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote:

> There is a lot going on there.  I remember one of the relevant
> restrictions was marking dentries dont_mount, and inodes S_DEAD
> in unlink and rmdir.
> 
> But even without out that marking if d_invalidate is called
> from d_revalidate the inode and all of it's dentries must be
> dead because the inode is stale and most go.  There should
> be no resurrecting it at that point.
> 
> I suspect the most fruitful way to think of the d_invalidate vs
> d_splice_alias races is an unlink vs rename race.
> 
> I don't think the mechanism matters, but deeply and fundamentally
> if we detect a directory inode is dead we need to stick with
> that decision and not attempt to resurrect it with d_splice_alias.

Wrong.  Deeply and fundamentally we detect a dentry that does not
match the directory contents according to the server.

For example, due to rename done on server.  With object in question
perfectly alive there - fhandle still works, etc.

However, it's no longer where it used to be.  And we would bloody better
not have lookups for the old name result in access to that object.
We also should never allow the access to *new* name lead to two live
dentries for the same directory inode.

Again, this is not about rmdir() or unlink() - invalidation can happen
for object that is still open, still accessed and still very much alive.
Does that all the time for any filesystem with ->d_revalidate().

Al Viro Nov. 27, 2023, 4:14 p.m. UTC | #40

On Mon, Nov 27, 2023 at 04:03:18PM +0000, Al Viro wrote:
> On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote:
> 
> > There is a lot going on there.  I remember one of the relevant
> > restrictions was marking dentries dont_mount, and inodes S_DEAD
> > in unlink and rmdir.
> > 
> > But even without out that marking if d_invalidate is called
> > from d_revalidate the inode and all of it's dentries must be
> > dead because the inode is stale and most go.  There should
> > be no resurrecting it at that point.
> > 
> > I suspect the most fruitful way to think of the d_invalidate vs
> > d_splice_alias races is an unlink vs rename race.
> > 
> > I don't think the mechanism matters, but deeply and fundamentally
> > if we detect a directory inode is dead we need to stick with
> > that decision and not attempt to resurrect it with d_splice_alias.
> 
> Wrong.  Deeply and fundamentally we detect a dentry that does not
> match the directory contents according to the server.
> 
> For example, due to rename done on server.  With object in question
> perfectly alive there - fhandle still works, etc.
> 
> However, it's no longer where it used to be.  And we would bloody better
> not have lookups for the old name result in access to that object.
> We also should never allow the access to *new* name lead to two live
> dentries for the same directory inode.
> 
> Again, this is not about rmdir() or unlink() - invalidation can happen
> for object that is still open, still accessed and still very much alive.
> Does that all the time for any filesystem with ->d_revalidate().

Put another way, there used to be very odd song and dance in ->d_revalidate()
instances along the lines of "we can't possibly tell the caller to invalidate
a mountpoint"; it was racy in the best case and during the rewrite of
d_invalidate() to teach it how to evict submounts those attempts had been
dropped - ->d_revalidate() returning 0 does end up with mounts dissolved
by d_invalidate() from caller.

It always had been racy, starting with the checks that used to be in
->d_revalidate() instances way before all those changes.  So the switch
of d_invalidate() to dissolving submounts had been a step in the right
direction, but it's not being careful enough.

Again, it's about d_invalidate() caused by pathwalk running into a dentry that
doesn't match the reality vs. d_splice_alias() finding that it matches the
inode we had looked up elsewhere.

Christian Brauner Nov. 27, 2023, 4:33 p.m. UTC | #41

On Mon, Nov 27, 2023 at 06:38:42AM +0000, Al Viro wrote:
> On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:
> 
> > d_invalidate() situation is more subtle - we need to sort out its interplay
> > with d_splice_alias().
> > 
> > More concise variant of the scenario in question:
> > * we have /mnt/foo/bar and a lot of its descendents in dcache on client
> > * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
> > * somebody on the client does a lookup of /mnt/foo/bar and gets told by
> > the server that there's no directory with that name anymore.
> > * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
> > evicting its descendents
> > * We try to mount something on /mnt/foo/baz/blah.  We look up baz, get
> > an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
> > d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
> > it in process, as it ought to.  Then we find /mnt/foo/baz/blah in dcache and 
> > mount on top of it.
> > * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
> > submounts to dissolve.  And finds the mount we'd done.  Which mount quietly
> > disappears.
> > 
> > Note that from the server POV the thing had been moved quite a while ago.
> > No server-side races involved - all it seeem is a couple of LOOKUP in the
> > same directory, one for the old name, one for the new.
> > 
> > On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
> > which had been there on server at the time we started and which remains in
> > place after mount we'd created suddenly disappears.
> > 
> > For the thread that ended up calling d_invalidate(), they'd been doing e.g.
> > stat on a pathname that used to be there a while ago, but currently isn't.
> > They get -ENOENT and no indication that something odd might have happened.
> > 
> > >From ->d_revalidate() point of view there's also nothing odd happening -
> > dentry is not a mountpoint, it stays in place until we return and there's
> > no directory entry with that name on in its parent.  It's as clear-cut
> > as it gets - dentry is stale.
> > 
> > The only overlap happening there is d_splice_alias() hitting in the middle
> > of already started d_invalidate().
> > 
> > For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
> > and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
> > to do with it, but the same problem existed prior to that.
> > 
> > FWIW, I suspect that the right answer would be along the lines of
> > 	* if d_splice_alias() does move an exsiting (attached) alias in
> > place, it ought to dissolve all mountpoints in subtree being moved.
> > There might be subtleties, but in case when that __d_unalias() happens
> > due to rename on server this is definitely the right thing to do.
> > 	* d_invalidate() should *NOT* do anything with dentry that
> > got moved (including moved by d_splice_alias()) from the place we'd
> > found it in dcache.  At least d_invalidate() done due to having
> > ->d_revalidate() return 0.
> > 	* d_invalidate() should dissolve all mountpoints in the
> > subtree that existed when it got started (and found the victim
> > still unmoved, that is).  It should (as it does) prevent any
> > new mountpoints added in that subtree, unless the mountpoint
> > to be had been moved (spliced) out.  What it really shouldn't
> > do is touch the mountpoints that are currently outside of it
> > due to moves.
> > 
> > I'm going to look around and see if we have any weird cases where
> > d_splice_alias() is used for things like "correct the case of
> > dentry name on a case-mangled filesystem" - that would presumably
> > not want to dissolve any submounts.  I seem to recall seeing
> > some shite of that sort, but that was a long time ago.
> > 
> > Eric, Miklos - it might be a good idea if you at least took a
> > look at whatever comes out of that (sub)thread; I'm trying to
> > reconstruct the picture, but the last round of serious reworking
> > of that area had been almost 10 years ago and your recollections
> > of the considerations back then might help.  I realize that they
> > are probably rather fragmentary (mine definitely are) and any
> > analysis will need to be redone on the current tree, but...
> 
> TBH, I wonder if we ought to have d_invalidate() variant that would
> unhash the dentry in question, do a variant of shrink_dcache_parent()
> that would report if there had been any mountpoints and if there
> had been any, do namespace_lock() and go hunting for mounts in that
> subtree, moving corresponding struct mountpoint to a private list
> as we go (removing them from mountpoint hash chains, that it).  Then
> have them all evicted after we'd finished walking the subtree...

That sounds reasonable.

> 
> The tricky part will be lock ordering - right now we have the
> mountpoint hash protected by mount_lock (same as mount hash, probably
> worth splitting anyway) and that nests outside of ->d_lock.
> 
> Note that we don't do mountpoint hash lookups on mountpoint crossing
> - it's nowhere near the really hot paths.  What we have is
> 	lookup_mountpoint() - plain hash lookup.  Always
> under namespace_lock() and mount_lock.
> 	get_mountpoint() - there's an insertion into hash chain,
> with dentry passed through the d_set_mounted(), which would
> fail if we have d_invalidate() on the subtree.
> Also always under namespace_lock() and mount_lock.
> 	__put_mountpoint() - removal from the hash chain.
> We remove from hash chain after having cleared DCACHE_MOUNTED.
> _That_ can happen under mount_lock alone (taking out the stuck
> submounts on final mntput()).
> 
> So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under
> ->d_lock.  Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?)
> In d_walk() callback we would
> 	* do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION
> is.
> 	* otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the
> mountpoint hash chain matching that dentry, find struct mountpoint in it,
> remove it from the chain and insert into a separate "collector" chain, all
> without messing with refcount.

Ok.

> In lookup_mountpoint() and get_mountpoint() take the bitlock on chain.
> In __put_mountpoint(), once it has grabbed ->d_lock
> 	* check if it has DCACHE_MOUNT_INVALIDATION, use that to
> decide which chain we are locking - the normal one or the collector
> 	* clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION
> 	* remove from chain
> 	* unlock the chain
> 	* drop ->d_lock.
> 
> Once we are finished walking the tree, go over the collector list
> and do what __detach_mount() guts do.  We are no longer under
> any ->d_lock, so locking is not a problem.  namespace_unlock() will
> flush them all, same as it does for __detach_mount().

Ok.

Al Viro Nov. 27, 2023, 5:25 p.m. UTC | #42

On Mon, Nov 27, 2023 at 10:01:34AM -0600, Eric W. Biederman wrote:
> "Eric W. Biederman" <ebiederm@xmission.com> writes:
> 
> > I am confused what is going on with ext4 and f2fs.  I think they
> > are calling d_invalidate when all they need to call is d_drop.
> 
> ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading
> the code correctly.
> 
> d_invalidate calls detach_mounts.
> 
> detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to
> prevent races with new mounts.
>
> ext4 and f2fs (in their case insensitive code) are calling d_invalidate
> before dont_mount has been called to set D_CANT_MOUNT.

Not really - note that the place where we check cant_mount() is under
the lock on the mountpoint's inode, so anything inside ->unlink() or
->rmdir() is indistinguishable from the places where we do dont_mount()
in vfs_{unlink,rmdir}.

Eric W. Biederman Nov. 27, 2023, 6:19 p.m. UTC | #43

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Nov 27, 2023 at 04:03:18PM +0000, Al Viro wrote:
>> On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote:
>> 
>> > There is a lot going on there.  I remember one of the relevant
>> > restrictions was marking dentries dont_mount, and inodes S_DEAD
>> > in unlink and rmdir.
>> > 
>> > But even without out that marking if d_invalidate is called
>> > from d_revalidate the inode and all of it's dentries must be
>> > dead because the inode is stale and most go.  There should
>> > be no resurrecting it at that point.
>> > 
>> > I suspect the most fruitful way to think of the d_invalidate vs
>> > d_splice_alias races is an unlink vs rename race.
>> > 
>> > I don't think the mechanism matters, but deeply and fundamentally
>> > if we detect a directory inode is dead we need to stick with
>> > that decision and not attempt to resurrect it with d_splice_alias.
>> 
>> Wrong.  Deeply and fundamentally we detect a dentry that does not
>> match the directory contents according to the server.
>> 
>> For example, due to rename done on server.  With object in question
>> perfectly alive there - fhandle still works, etc.
>> 
>> However, it's no longer where it used to be.  And we would bloody better
>> not have lookups for the old name result in access to that object.
>> We also should never allow the access to *new* name lead to two live
>> dentries for the same directory inode.
>> 
>> Again, this is not about rmdir() or unlink() - invalidation can happen
>> for object that is still open, still accessed and still very much alive.
>> Does that all the time for any filesystem with ->d_revalidate().
>
> Put another way, there used to be very odd song and dance in ->d_revalidate()
> instances along the lines of "we can't possibly tell the caller to invalidate
> a mountpoint"; it was racy in the best case and during the rewrite of
> d_invalidate() to teach it how to evict submounts those attempts had been
> dropped - ->d_revalidate() returning 0 does end up with mounts dissolved
> by d_invalidate() from caller.
>
> It always had been racy, starting with the checks that used to be in
> ->d_revalidate() instances way before all those changes.  So the switch
> of d_invalidate() to dissolving submounts had been a step in the right
> direction, but it's not being careful enough.
>
> Again, it's about d_invalidate() caused by pathwalk running into a dentry that
> doesn't match the reality vs. d_splice_alias() finding that it matches the
> inode we had looked up elsewhere.

My point is we should have a atomic way to decide the disposition of
such a dentry, and it's children.

Either we should decide it is useless and remove it and all of it's
children.

Or we should decide it was renamed and just handle it that way.

If we can record such a decision on the dentry or possibly on the inode
then we can resolve the race by having it be a proper race of which
comes first.

It isn't a proper delete of the inode so anything messing with the inode
and marking it S_DEAD is probably wrong.

The code could do something like mark the dentry dont_mount which should
be enough to for d_splice_alias to say oops, something is not proper
here.  Let the d_invalidate do it's thing.

Or the code could remove the dentry from inode->i_dentry and keep
d_splice alias from finding it, and it's children completely.
That is different from unhashing it.

Anyway that is my memory and my general sense of what is going on.
I help it helps.

Eric

Al Viro Nov. 27, 2023, 6:26 p.m. UTC | #44

On Mon, Nov 27, 2023 at 05:25:44PM +0000, Al Viro wrote:
> On Mon, Nov 27, 2023 at 10:01:34AM -0600, Eric W. Biederman wrote:
> > "Eric W. Biederman" <ebiederm@xmission.com> writes:
> > 
> > > I am confused what is going on with ext4 and f2fs.  I think they
> > > are calling d_invalidate when all they need to call is d_drop.
> > 
> > ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading
> > the code correctly.
> > 
> > d_invalidate calls detach_mounts.
> > 
> > detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to
> > prevent races with new mounts.
> >
> > ext4 and f2fs (in their case insensitive code) are calling d_invalidate
> > before dont_mount has been called to set D_CANT_MOUNT.
> 
> Not really - note that the place where we check cant_mount() is under
> the lock on the mountpoint's inode, so anything inside ->unlink() or
> ->rmdir() is indistinguishable from the places where we do dont_mount()
> in vfs_{unlink,rmdir}.

Said that, we could simply use d_drop() in those, since the caller will
take care of mount eviction - we have ->unlink() or ->rmdir() returning
success, after all.

The same goes for xfs caller and for cifs_prime_dcache() (in the latter
case we have just checked that they sucker is negative, so d_invalidate()
and d_drop() are doing the same thing).

Al Viro Nov. 27, 2023, 6:43 p.m. UTC | #45

On Mon, Nov 27, 2023 at 12:19:09PM -0600, Eric W. Biederman wrote:

> Either we should decide it is useless and remove it and all of it's
> children.
> 
> Or we should decide it was renamed and just handle it that way.

How?  An extra roundtrip to server trying to do getattr on the fhandle
we've got?

Cost of that aside, we *still* need to dissolve submounts in such case;
there is no warranty that we'll ever guess the new name and no way
to ask the server for one, so we can't let them sit around.  Not that
having mounts (local by definition) suddenly show up in the unexpected
place because of rename on server looks like a good thing, especially
since had that rename on server been done as cp -rl + rm -rf the same
mounts would be gone...

> If we can record such a decision on the dentry or possibly on the inode
> then we can resolve the race by having it be a proper race of which
> comes first.
> 
> It isn't a proper delete of the inode so anything messing with the inode
> and marking it S_DEAD is probably wrong.

s/probably/certainly/, but where would d_invalidate() do such a thing?
It's none of its business...

> The code could do something like mark the dentry dont_mount which should
> be enough to for d_splice_alias to say oops, something is not proper
> here.  Let the d_invalidate do it's thing.
> 
> Or the code could remove the dentry from inode->i_dentry and keep
> d_splice alias from finding it, and it's children completely.
> That is different from unhashing it.

We might be just in the middle of getdents(2) on the directory in question.
It can be opened; we can't do anything that destructive there.

Again, it's about the d_invalidate() on ->d_revalidate() reporting 0;
uses like proc_invalidate_siblings_dcache() are separate story, simply
because there d_splice_alias() is not going to move anything anywhere.

Al Viro Nov. 28, 2023, 12:02 a.m. UTC | #46

On Fri, Nov 24, 2023 at 10:20:39AM -0500, Gabriel Krisman Bertazi wrote:

> I'm not sure why it changed.  I'm guessing that, since it doesn't make
> sense to set fscrypt_d_revalidate for every dentry in the
> !case-insensitive case, they just kept the same behavior for
> case-insensitive+fscrypt.  This is what I get from looking at the git
> history.
> 
> I will get a new series reverting to use ->s_d_op, folding the
> dentry_cmp behavior you mentioned, and based on what you merge in your
> branch.

FWIW, I suspect that we might want something along the lines of
d_set_always_valid(dentry)
{
	grab ->d_lock
	clear DCACHE_OP_REVALIDATE
	release ->d_lock
}

to make it possible to suppress ->d_revalidate() for this particular
dentry...

Al Viro Nov. 29, 2023, 4:53 a.m. UTC | #47

On Mon, Nov 27, 2023 at 06:38:43AM +0000, Al Viro wrote:

> > FWIW, I suspect that the right answer would be along the lines of
> > 	* if d_splice_alias() does move an exsiting (attached) alias in
> > place, it ought to dissolve all mountpoints in subtree being moved.
> > There might be subtleties,

Are there ever...  Starting with the "our test for loop creation
(alias is a direct ancestor, need to fail with -ELOOP) is dependent
upon rename_lock being held all along".

Folks, what semantics do we want for dissolving mounts on splice?
The situation when it happens is when we have a subtree on e.g. NFS
and have some mounts (on client) inside that.  Then somebody on
server moves the root of that subtree somewhere else and we try
to do a lookup in new place.  Options:

1) our dentry for directory that got moved on server is moved into
new place, along with the entire subtree *and* everything mounted
on it.  Very dubious semantics, especially since if we look the
old location up before looking for new one, the mounts will be
dissolved; no way around that.

2) lookup fails.  It's already possible; e.g. if server has
/srv/nfs/1/2/3 moved to /srv/nfs/x, then /srv/nfs/1/2 moved
to /srv/nfs/x/y and client has a process with cwd in /mnt/nfs/1/2/3
doing a lookup for "y", there's no way in hell to handle that -
the lookup will return the fhandle of /srv/nfs/x, which is the
same thing the client has for /mnt/nfs/1/2; we *can't* move that
dentry to /mnt/nfs/1/2/3/y - not without creating a detached loop.
We can also run into -ESTALE if one of the trylocks in
__d_unalias() fails.  Having the same happen if there are mounts
in the subtree we are trying to splice would be unpleasant, but
not fatal.  The trouble is, that won't be a transient failure -
not until somebody tries to look the old location up.

3) dissolve the mounts.  Doable, but it's not easy; especially
since we end up having to redo the loop-prevention check after
the mounts had been dissolved.  And that check may be failing
by that time, with no way to undo that dissolving...

Christian Brauner Nov. 29, 2023, 10:21 a.m. UTC | #48

> 2) lookup fails.  It's already possible; e.g. if server has

I think that's the sanest option. The other options seem even less
intuitive.

> not fatal.  The trouble is, that won't be a transient failure -
> not until somebody tries to look the old location up.

Eh, nfs semantics are quite special anyway already. I'd rather have that
in lookup than more magic involving moving mounts around or having them
disappear (Yes, we have the detach semantics on removal but that's
different.).

Eric W. Biederman Nov. 29, 2023, 3:19 p.m. UTC | #49

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Nov 27, 2023 at 06:38:43AM +0000, Al Viro wrote:
>
>> > FWIW, I suspect that the right answer would be along the lines of
>> > 	* if d_splice_alias() does move an exsiting (attached) alias in
>> > place, it ought to dissolve all mountpoints in subtree being moved.
>> > There might be subtleties,
>
> Are there ever...  Starting with the "our test for loop creation
> (alias is a direct ancestor, need to fail with -ELOOP) is dependent
> upon rename_lock being held all along".
>
> Folks, what semantics do we want for dissolving mounts on splice?
> The situation when it happens is when we have a subtree on e.g. NFS
> and have some mounts (on client) inside that.  Then somebody on
> server moves the root of that subtree somewhere else and we try
> to do a lookup in new place.  Options:
>
> 1) our dentry for directory that got moved on server is moved into
> new place, along with the entire subtree *and* everything mounted
> on it.  Very dubious semantics, especially since if we look the
> old location up before looking for new one, the mounts will be
> dissolved; no way around that.
>
> 2) lookup fails.  It's already possible; e.g. if server has
> /srv/nfs/1/2/3 moved to /srv/nfs/x, then /srv/nfs/1/2 moved
> to /srv/nfs/x/y and client has a process with cwd in /mnt/nfs/1/2/3
> doing a lookup for "y", there's no way in hell to handle that -
> the lookup will return the fhandle of /srv/nfs/x, which is the
> same thing the client has for /mnt/nfs/1/2; we *can't* move that
> dentry to /mnt/nfs/1/2/3/y - not without creating a detached loop.
> We can also run into -ESTALE if one of the trylocks in
> __d_unalias() fails.  Having the same happen if there are mounts
> in the subtree we are trying to splice would be unpleasant, but
> not fatal.  The trouble is, that won't be a transient failure -
> not until somebody tries to look the old location up.
>
> 3) dissolve the mounts.  Doable, but it's not easy; especially
> since we end up having to redo the loop-prevention check after
> the mounts had been dissolved.  And that check may be failing
> by that time, with no way to undo that dissolving...

To be clear this is a change in current semantics and has a minuscule
change of resulting in a regression.  That should be called out in the
change log.

If we choose to change the semantics I would suggest that the new
semantics be:

If a different name for a directory already exists:
	* Detach the mounts unconditionally (leaving dentry descendants alone).
	* Attempt the current splice.
          - If the splice succeeds ( return the new dentry )
	  - If the splice fails ( fail the lookup, and d_invalidate the existing name )

Unconditionally dissolving the mounts before attempting the rename
should simplify everything.

In the worst case a race between d_invalidate and d_splice_alias will
now become a race to see who can detach the mounts first.

Eric

patchwork-bot+f2fs@kernel.org Jan. 16, 2024, 7:02 p.m. UTC | #50

Hello:

This series was applied to jaegeuk/f2fs.git (dev)
by Gabriel Krisman Bertazi <krisman@suse.de>:

On Wed, 16 Aug 2023 01:07:54 -0400 you wrote:
> Hi,
> 
> This is v6 of the negative dentry on case-insensitive directories.
> Thanks Eric for the review of the last iteration.  This version
> drops the patch to expose the helper to check casefolding directories,
> since it is not necessary in ecryptfs and it might be going away.  It
> also addresses some documentation details, fix a build bot error and
> simplifies the commit messages.  See the changelog in each patch for
> more details.
> 
> [...]

Here is the summary with links:
  - [f2fs-dev,v6,1/9] ecryptfs: Reject casefold directory inodes
    https://git.kernel.org/jaegeuk/f2fs/c/cd72c7ef5fed
  - [f2fs-dev,v6,2/9] 9p: Split ->weak_revalidate from ->revalidate
    (no matching commit)
  - [f2fs-dev,v6,3/9] fs: Expose name under lookup to d_revalidate hooks
    (no matching commit)
  - [f2fs-dev,v6,4/9] fs: Add DCACHE_CASEFOLDED_NAME flag
    (no matching commit)
  - [f2fs-dev,v6,5/9] libfs: Validate negative dentries in case-insensitive directories
    (no matching commit)
  - [f2fs-dev,v6,6/9] libfs: Chain encryption checks after case-insensitive revalidation
    (no matching commit)
  - [f2fs-dev,v6,7/9] libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
    (no matching commit)
  - [f2fs-dev,v6,8/9] ext4: Enable negative dentries on case-insensitive lookup
    (no matching commit)
  - [f2fs-dev,v6,9/9] f2fs: Enable negative dentries on case-insensitive lookup
    (no matching commit)

You are awesome, thank you!

[v6,0/9] Support negative dentries on case-insensitive ext4 and f2fs

Message

Comments