mbox series

[RFC,v2,00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Message ID 20180125025349.31494-1-krisman@collabora.co.uk
Headers show
Series NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal | expand

Message

Gabriel Krisman Bertazi Jan. 25, 2018, 2:53 a.m. UTC
Hi,

Along with the patch series, I am very interested in getting feedback on
the two items below, regarding VFS and NLS changes.

This is a v2 of the unicode + ext4 case-insensitive support which
extends support to Unicode 10.0.0, and applies the fixes suggested by
Olaf in the previous iteration.  For the same reason as mentioned
before, the ucd files are not included in the RFC, but the relevant
patch file explains how to fetch them.

If you'd rather pull everything in this RFC at once, including the UCD
files, you can clone from:

https://gitlab.collabora.com/krisman/linux.git -b charset-lib

The original cover letter, with explanations on some of the design
decisions made in this RFC, is documented in the archive below:

  https://www.spinics.net/lists/linux-ext4/msg59457.html

In addition to this RFC, I am making two new proposals (no code in this
RFC) for VFS and NLS, which I would like to hear feedback from you
before turning this from an RFC into a final patch submission:

(1) integrate the charset lib into the NLS system.

Basically, this requires introducing new higher-level hooks for string
comparison, like the ones we have in the charset patch, into the NLS
subsystem.

NLS also has to support versions of the same encoding, my idea is to
separate the information to register the encoding with the NLS system
into a separate structure, which is restricted to the NLS system.  The
nls_table or a similar structure, which is then passed to users of the
library, will then be specific to a given version of the charset and
carry pointers to the functions specific to that version.

One final important point for NLS is that we need to prevent users from
mounting CI filesystems with encodings that don't support
normalization/comparison functions and try not the break compatibility
of filesystems that already do toupper/tolower without normalization.
These points are important to keep in mind but are quite trivial to
implement.

The second proposal is related to the VFS layer:

(2) Enable Insensitive lookup support on a per-mountpoint basis,
via a MS_CASEFOLD flag, with the ultimate goal of supporting a
case-insensitive bind mount of a subtree, side-by-side with a
sensitive version of the filesystem.

I have a prototype code at

https://gitlab.collabora.com/krisman/linux.git -b vfs-ms_casefold

Which is *not fully functional*, since it confuses the dentry cache when
multiple mountpoints are installed, but it gives an idea of the design,
if anyone wants to review it.  Basically, I want to:

  - Add a new MS_CASEFOLD mount option, which flips a flag in struct
    vfsmount

  - When this flag is enabled, a LOOKUP_CASEFOLD flag is submitted to
  the fs .lookup() hook, asking it to perform a case-folded lookup.

  - LOOKUP_CASEFOLD also replaces .d_hash() and d_compare() with
    insensitive versions, provided by filesystems.

  - Allow "mount -o remount,bind" to flip the MNT_CASEFOLD flag, similar
    to what is done with the read-only setting.

  - filesystems that support the MS_CASEFOLD flag need to advertise
    support in struct file_system_type.  There will be no generic
    implementation of casefolding in the VFS layer for now.  Either the
    FS acknowledges support for it, or MS_CASEFOLD fails the mount
    operation.

This is implemented in the branch above (along with the required
modifications for EXT4) except for the issue in the dentry cache, that I
am still working on.

Do these changes to VFS seem acceptable?

Thanks,

Gabriel Krisman Bertazi (9):
  charsets: Introduce middle-layer for character encoding
  charsets: ascii: Wrap ascii functions to charsets library
  charsets: utf8: Hook-up utf-8 code to charsets library
  charsets: utf8: Introduce test module for kernel UTF-8 implementation
  ext4: Add ignorecase mount option
  ext4: Include encoding information on the superblock
  fscrypt: Introduce charset-based matching functions
  ext4: Support charset name matching
  ext4: Implement ext4 dcache hooks for custom charsets

Olaf Weber (4):
  charsets: utf8: Add unicode character database files
  scripts: add trie generator for UTF-8
  charsets: utf8: Introduce code for UTF-8 normalization
  charsets: utf8: reduce the size of utf8data[]

 fs/ext4/dir.c                   |   63 +
 fs/ext4/ext4.h                  |    6 +
 fs/ext4/namei.c                 |   27 +-
 fs/ext4/super.c                 |   35 +
 include/linux/charsets.h        |   73 +
 include/linux/fscrypt.h         |    1 +
 include/linux/fscrypt_notsupp.h |   16 +
 include/linux/fscrypt_supp.h    |   27 +
 include/linux/utf8norm.h        |  116 ++
 lib/Kconfig                     |   16 +
 lib/Makefile                    |    2 +
 lib/charsets/Makefile           |   24 +
 lib/charsets/ascii.c            |   98 ++
 lib/charsets/core.c             |   68 +
 lib/charsets/test_ucd.c         |  186 +++
 lib/charsets/ucd/README         |   33 +
 lib/charsets/utf8_core.c        |  178 ++
 lib/charsets/utf8norm.c         |  794 +++++++++
 scripts/Makefile                |    1 +
 scripts/mkutf8data.c            | 3464 +++++++++++++++++++++++++++++++++++++++
 20 files changed, 5219 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/charsets.h
 create mode 100644 include/linux/utf8norm.h
 create mode 100644 lib/charsets/Makefile
 create mode 100644 lib/charsets/ascii.c
 create mode 100644 lib/charsets/core.c
 create mode 100644 lib/charsets/test_ucd.c
 create mode 100644 lib/charsets/ucd/README
 create mode 100644 lib/charsets/utf8_core.c
 create mode 100644 lib/charsets/utf8norm.c
 create mode 100644 scripts/mkutf8data.c

Comments

Al Viro Jan. 25, 2018, 3:16 a.m. UTC | #1
On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:

> The second proposal is related to the VFS layer:
> 
> (2) Enable Insensitive lookup support on a per-mountpoint basis,
> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
> case-insensitive bind mount of a subtree, side-by-side with a
> sensitive version of the filesystem.

First reaction: No.  With the side of HELL NO.

Your ultimate goal sounds utterly insane - dcache tree must be shared
for all mounts.  Moreover, "would these two names refer to the same
object" can not be mount-dependent.  Not going to happen.

Please, post the description of what you are planning to do.
Detailed.  I'm not saying that it's 100% impossible to do correctly,
but I'm _very_ sceptical about the feasibility.

I'm certainly not going to ACK any VFS changes until you convince
me that this thing can be done with sane semantics.
Theodore Ts'o Jan. 25, 2018, 7:32 p.m. UTC | #2
On Thu, Jan 25, 2018 at 03:16:50AM +0000, Al Viro wrote:
> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
> 
> > The second proposal is related to the VFS layer:
> > 
> > (2) Enable Insensitive lookup support on a per-mountpoint basis,
> > via a MS_CASEFOLD flag, with the ultimate goal of supporting a
> > case-insensitive bind mount of a subtree, side-by-side with a
> > sensitive version of the filesystem.
> 
> First reaction: No.  With the side of HELL NO.
> 
> Your ultimate goal sounds utterly insane - dcache tree must be shared
> for all mounts.  Moreover, "would these two names refer to the same
> object" can not be mount-dependent.  Not going to happen.
> 
> Please, post the description of what you are planning to do.
> Detailed.  I'm not saying that it's 100% impossible to do correctly,
> but I'm _very_ sceptical about the feasibility.

So I can't speak authoratatively about the Collabora's customer
requirements that drove this particular feature, but I can talk about
why I am interested in this feature --- although I understand why it's
hard to do correctly.

Android is currently using a horrible wrapfs scheme to provide case
insesitive lookups.  We talked about this last year at LSF/MM[1].  It
is using separate dentry entries for the upper and lower layers, and
it is *definitely* prone to races.  Running fsstress on the top and
lower layers simultaneously is a pretty simple way to induce hilarity,
although my understanding is that the developers working on it having
been hammering on it for a while so that the races that cause kernel
panicks or file system damage are rare.  I don't believe they have
been completely elimniated, because it's not clear to me that any
wrapfs solution can ever be made race-free.

[1] https://lwn.net/Articles/718640/

The problem is that both the case insensitive and case sensitive
directory trees are made available to userspace, and for backwards
compatibility reasons, they can't just make it disappear.  Although
they don't have a Linus Torvalds going all Mr. Shouty about breaking
userspace, the combined whinging from the App developer ecosystem
probably can be a fairly good substitute for Linus when they complain
about compatibility breakages visible to the Android application
programmer.  :-)

So my hope has been to find a way to (a) make wrapfs go away, and (b)
allow the functionality of case sensitive and case insensitive bind
mounts of the same underlying file system directories be *somehow*
supported.

If the answer is that if two files that differ only in case are
created via the case sensitive mount results in it being undefined
which file you get when you open, rename, or delete that file via the
case insensitive mount, that's fine.  I think that's what happpens in
the wrapfs implementation today anyway.  It would probably be fine if
the answer was you get the file with an exact match if it is present,
but if the two files created on the case sensitive side are "makefile"
and "Makefile", and you try open/delete "MaKeFiLe" via the case
insensitive mount, it's undefined which file you get --- that's fine.

Anyone who does that will get everything they deserve, and they get
that today anyway with the wrapfs or the previous fuse hack anyway.

This may indeed be an O_PONIES style request.  And in the worst case,
Android will continue using an out-of-tree wrapfs solution, and
hopefully I can fend off people asking me for help when the wrapfs
code blows up in various not-so-entertainng ways.  Which is,
admittedly, a purely selfish reason for why I would like to see if we
can give Android folks equivalent functionality to the wrapfs hack in
a sane, safe, and consensual way.   :-)

Cheers,

						- Ted
Gao Xiang Jan. 26, 2018, 2:52 a.m. UTC | #3
Hi Ted,

On 2018/1/26 3:32, Theodore Ts'o wrote:
> On Thu, Jan 25, 2018 at 03:16:50AM +0000, Al Viro wrote:

>> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:

>>

>>> The second proposal is related to the VFS layer:

>>>

>>> (2) Enable Insensitive lookup support on a per-mountpoint basis,

>>> via a MS_CASEFOLD flag, with the ultimate goal of supporting a

>>> case-insensitive bind mount of a subtree, side-by-side with a

>>> sensitive version of the filesystem.

>>

>> First reaction: No.  With the side of HELL NO.

>>

>> Your ultimate goal sounds utterly insane - dcache tree must be shared

>> for all mounts.  Moreover, "would these two names refer to the same

>> object" can not be mount-dependent.  Not going to happen.

>>

>> Please, post the description of what you are planning to do.

>> Detailed.  I'm not saying that it's 100% impossible to do correctly,

>> but I'm _very_ sceptical about the feasibility.

> 

> So I can't speak authoratatively about the Collabora's customer

> requirements that drove this particular feature, but I can talk about

> why I am interested in this feature --- although I understand why it's

> hard to do correctly.

> 

> Android is currently using a horrible wrapfs scheme to provide case

> insesitive lookups.  We talked about this last year at LSF/MM[1].  It

> is using separate dentry entries for the upper and lower layers, and

> it is *definitely* prone to races.  Running fsstress on the top and

> lower layers simultaneously is a pretty simple way to induce hilarity,

> although my understanding is that the developers working on it having

> been hammering on it for a while so that the races that cause kernel

> panicks or file system damage are rare.  I don't believe they have

> been completely elimniated, because it's not clear to me that any

> wrapfs solution can ever be made race-free.


Sorry for bother, I mean you refer to sdcardfs,
The current Samsung sdcardfs and esdfs are all fsstress broken for many issues,
However we developed an another approach last year, in

https://android-review.googlesource.com/#/c/kernel/common/+/574402/3/Documentation/filesystems/hwsdcardfs.txt

We applied it on HUAWEI Mate 10 (pro), and the complete solution is

https://android-review.googlesource.com/#/q/status:open+project:kernel/common+branch:android-4.4+topic:huawei_sdcardfs

It supports stackable case-insensitive without modifing VFS smoothly, 
and we tested fsstress on the top and lower layers at the same time 
without any problem.

> 

> [1] https://lwn.net/Articles/718640/

> 

> The problem is that both the case insensitive and case sensitive

> directory trees are made available to userspace, and for backwards

> compatibility reasons, they can't just make it disappear.  Although

> they don't have a Linus Torvalds going all Mr. Shouty about breaking

> userspace, the combined whinging from the App developer ecosystem

> probably can be a fairly good substitute for Linus when they complain

> about compatibility breakages visible to the Android application

> programmer.  :-)

> 

> So my hope has been to find a way to (a) make wrapfs go away, and (b)

> allow the functionality of case sensitive and case insensitive bind

> mounts of the same underlying file system directories be *somehow*

> supported.

> 

> If the answer is that if two files that differ only in case are

> created via the case sensitive mount results in it being undefined

> which file you get when you open, rename, or delete that file via the

> case insensitive mount, that's fine.  I think that's what happpens in

> the wrapfs implementation today anyway.  It would probably be fine if

> the answer was you get the file with an exact match if it is present,

> but if the two files created on the case sensitive side are "makefile"

> and "Makefile", and you try open/delete "MaKeFiLe" via the case

> insensitive mount, it's undefined which file you get --- that's fine.


We have a solution to solve case-insensitive stackable but without modifing VFS.

Thanks,

> 

> Anyone who does that will get everything they deserve, and they get

> that today anyway with the wrapfs or the previous fuse hack anyway.

> 

> This may indeed be an O_PONIES style request.  And in the worst case,

> Android will continue using an out-of-tree wrapfs solution, and

> hopefully I can fend off people asking me for help when the wrapfs

> code blows up in various not-so-entertainng ways.  Which is,

> admittedly, a purely selfish reason for why I would like to see if we

> can give Android folks equivalent functionality to the wrapfs hack in

> a sane, safe, and consensual way.   :-)

> 

> Cheers,

> 

> 						- Ted

>
Gabriel Krisman Bertazi Feb. 6, 2018, 2:24 a.m. UTC | #4
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
>
>> The second proposal is related to the VFS layer:
>> 
>> (2) Enable Insensitive lookup support on a per-mountpoint basis,
>> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
>> case-insensitive bind mount of a subtree, side-by-side with a
>> sensitive version of the filesystem.
>
> First reaction: No.  With the side of HELL NO.
>
> Your ultimate goal sounds utterly insane - dcache tree must be shared
> for all mounts.  Moreover, "would these two names refer to the same
> object" can not be mount-dependent.  Not going to happen.
>
> Please, post the description of what you are planning to do.
> Detailed.  I'm not saying that it's 100% impossible to do correctly,
> but I'm _very_ sceptical about the feasibility.
>
> I'm certainly not going to ACK any VFS changes until you convince
> me that this thing can be done with sane semantics.

Hi Viro,

Sorry for the delay.  I wanted to get a better understanding of the
code before getting back to you.

Our customer is interested in exposing a subtree of an existing
filesystem (native Linux filesystems, xfs, ext4 and others) in an
case-insensitive lookup manner, without paying the cost of a userspace
getdents implementation, and, preferably, without requiring the user to
modify data on the disk.

After learning, last year, about the similar Android issue from Ted at a
conference, I started to believe a bind mount is the appropriate
solution for both our use cases.

Like Ted mentioned, we let the user shot herself in the foot by having
two files with the exact CI name, and which one she gets will be
undefined.  My current solution gets the exact match if it is present,
and my second approach could be adapted to do that with a performance
cost only if both files are not cached.  That seem to be acceptable by
Ted and shouldn't be a problem for me too.

Let me start by discussing what I have now, which is a functional
solution, to the best of my knowledge, but which still requires rework
to better benefit from the dentry cache.  Then I will discuss the
solutions I am working on.

(1) What I have now:

As I mentioned in the first email, MS_CASEFOLD triggers a MNT_CASEFOLD
flag in the vfsmount structure.  This flag, which gets recalculated
every time a mountpoint is traversed, is used to decide if we do a
casefolded ->lookup() for each component of the path.  This obviously
allows paths with multiple nested mountpoints to work.

The ->lookup() function calls d_add_ci(), making sure that only the
exact-match dentry is inserted in the cache.  This means that we don't
have multiple dentries, differing only by case, associated to the same
file.  This is an important semantic, which I am not modifying.  This
also means that a following CI lookup with an inexact-case match will
trigger another ->lookup(), which is bad, but is manageable (for now).
I address this problem later in this email.

Regarding negative dentries.  If we have a negative dentry in the cache,
the case-sensitive algorithm will abort the lookup and return the
dentry, nothing new here.  If, on the other hand, That component lookup
is under a MNT_CASEFOLD search, the code does a bigger effort, ignores
the negative dentry and still do ->lookup().

This means that:

 - Case-sensitive (CS) search performance is not affected, except for
   the following sequence: Do a CS(foo) which returns a negative_dentry,
   then CI(FOO), and finally CS(foo).  This will require the second
   CS(foo) to re-do the search and recreate the negative dentry.  If
   only CS searches are done, the performance should be unaffected.

 - Case-insensitive (CI) benefits from the dentry cache as long as it is
 doing an exact-case name match.  Otherwise, it always falls to
 ->lookup().  The exact-name dentry, if already in the cache, will be
 returned in both cases.

 - Negative dentries are ignored if doing CI and it always re-do the
   search.

The prototype code is in the following branch.  I can send
patches if you want.

 https://gitlab.collabora.com/krisman/linux.git -b vfs-ms_casefold

(2) What I want to do:

I want to allow lookups of the inexact name to find the cached dentry,
preventing it from going to the disk only to find out later it already
has that dentry cached in some other cache bucket.  This means that,
whenever there is a MNT_CASEFOLD enabled, every mountpoint of that
superblock needs to use d_hash(casefold(name)), such that every file is
in a known bucket.  Obviously this is a worse hash function, but the
cost is only payed by filesystems using the feature.  If a bind mount is
done, we must rehash all existing entries of that filesystem at
"bind,remount,ignorecase" time.

Like the previous version, we only store exact name dentries, even
though the bucket is different now.  When doing a LOOKUP_CASEFOLD, we
use a different ->d_compare_CI() to do the matching, such that we can
find the right dentry or not, in both scenarios (CI and CS).

We still can't trust negative dentries when doing a CI lookup. Consider
the case where we do CS(FOO) and foo doesn't exist. Now we have a
negative_dentry(FOO), but CI(FOO) would still need to find an existing
'foo' file.  The first solution is to make fast_CI_lookup() walk the
entire bucket and ignore previous matching negative dentries if a
positive one is found.  If no positive dentry is found, then fallback
and do ->lookup().  A superior solution is to mark negative dentries as
soft negatives and hard negatives, depending on what kind of lookup
created the dentry, whether CS or CI, respectively.  Then, we can allow
a CI lookup to return the negative dentry only if it is a HARD negative,
and try the ->lookup() otherwise.  For this to work, any dentry creation
on that parent requires that we invalidate hard negative children
dentries, and make them soft, at insertion/rename time.

With these changes I believe the dentry cache could be shared by code
doing CI and CS lookups of the same dentry.  My main concern is whether
we could invalidate a negative dentry without harming parallel lookups
using that dentry, but I believe I can work around that.

I believe this covers everything I could identify, but please let me
know if I missed something, obvious or not.
Gao Xiang Feb. 6, 2018, 3:21 a.m. UTC | #5
Hi Gabriel,

On 2018/2/6 10:24, Gabriel Krisman Bertazi wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
>> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
>>
>>> The second proposal is related to the VFS layer:
>>>
>>> (2) Enable Insensitive lookup support on a per-mountpoint basis,
>>> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
>>> case-insensitive bind mount of a subtree, side-by-side with a
>>> sensitive version of the filesystem.
>>
>> First reaction: No.  With the side of HELL NO.
>>
>> Your ultimate goal sounds utterly insane - dcache tree must be shared
>> for all mounts.  Moreover, "would these two names refer to the same
>> object" can not be mount-dependent.  Not going to happen.
>>
>> Please, post the description of what you are planning to do.
>> Detailed.  I'm not saying that it's 100% impossible to do correctly,
>> but I'm _very_ sceptical about the feasibility.
>>
>> I'm certainly not going to ACK any VFS changes until you convince
>> me that this thing can be done with sane semantics.
> 
> Hi Viro,
> 
> Sorry for the delay.  I wanted to get a better understanding of the
> code before getting back to you.
> 
> Our customer is interested in exposing a subtree of an existing
> filesystem (native Linux filesystems, xfs, ext4 and others) in an
> case-insensitive lookup manner, without paying the cost of a userspace
> getdents implementation, and, preferably, without requiring the user to
> modify data on the disk.
Could I express my opinion? I have working on case-insensitive sdcardfs 
for months.

I think your problem is how we optimise a case-insensitive lookup on the 
file system with a case-sensitive dcache (I mean d_add and no d_compare 
and d_hash).
In that case, we could not trust the negative dentry when _creating_ a 
case-insensitive file, for example:
    there exists "anDroid" on-disk, but ext4's in-memory dcache only has 
the negative "Android", if we lookup "Android" we will get the 
_negative_ dentry, but we _cannot_ create it since "anDroid" exists on 
disk. In the create case, an on-disk _iterate_ (or readdir) is necessary.

I could give another example, if we uses case-insentive ext4 and create
"Android" and "anDroid", how to deal with the case in the 
case-insensitive way?
    I mean in that case we should make both "Android" and "anDroid" can 
access, right?
    I think we need to build a special case-sensitive dcache rather than 
a case-insensitive dcache following the native case-insentive fs(use 
d_add_ci, d_compare and d_hash, eg. fat, ntfs...)

Finally, I agree "let the user shot herself in the foot by having two 
files with the exact CI name", but I think it could not the VFS 
_busniess_ itself since each customer solution "case-sensitive ext4 -> 
case-insensitive lookup" has their _perfered_ way (for example, 
"android" and "Android" exist, A perfers android and B perfers Android.

Finally, I think for optmization, ext4 or other fs could add some dir 
inode _tag_ and supports native case-insensitive for these dirs could be 
better....

Sorry to bother,

Thanks,

> 
> After learning, last year, about the similar Android issue from Ted at a
> conference, I started to believe a bind mount is the appropriate
> solution for both our use cases.
> 
> Like Ted mentioned, we let the user shot herself in the foot by having
> two files with the exact CI name, and which one she gets will be
> undefined.  My current solution gets the exact match if it is present,
> and my second approach could be adapted to do that with a performance
> cost only if both files are not cached.  That seem to be acceptable by
> Ted and shouldn't be a problem for me too.
> 
> Let me start by discussing what I have now, which is a functional
> solution, to the best of my knowledge, but which still requires rework
> to better benefit from the dentry cache.  Then I will discuss the
> solutions I am working on.
> 
> (1) What I have now:
> 
> As I mentioned in the first email, MS_CASEFOLD triggers a MNT_CASEFOLD
> flag in the vfsmount structure.  This flag, which gets recalculated
> every time a mountpoint is traversed, is used to decide if we do a
> casefolded ->lookup() for each component of the path.  This obviously
> allows paths with multiple nested mountpoints to work.
> 
> The ->lookup() function calls d_add_ci(), making sure that only the
> exact-match dentry is inserted in the cache.  This means that we don't
> have multiple dentries, differing only by case, associated to the same
> file.  This is an important semantic, which I am not modifying.  This
> also means that a following CI lookup with an inexact-case match will
> trigger another ->lookup(), which is bad, but is manageable (for now).
> I address this problem later in this email.
> 
> Regarding negative dentries.  If we have a negative dentry in the cache,
> the case-sensitive algorithm will abort the lookup and return the
> dentry, nothing new here.  If, on the other hand, That component lookup
> is under a MNT_CASEFOLD search, the code does a bigger effort, ignores
> the negative dentry and still do ->lookup().
> 
> This means that:
> 
>   - Case-sensitive (CS) search performance is not affected, except for
>     the following sequence: Do a CS(foo) which returns a negative_dentry,
>     then CI(FOO), and finally CS(foo).  This will require the second
>     CS(foo) to re-do the search and recreate the negative dentry.  If
>     only CS searches are done, the performance should be unaffected.
> 
>   - Case-insensitive (CI) benefits from the dentry cache as long as it is
>   doing an exact-case name match.  Otherwise, it always falls to
>   ->lookup().  The exact-name dentry, if already in the cache, will be
>   returned in both cases.
> 
>   - Negative dentries are ignored if doing CI and it always re-do the
>     search.
> 
> The prototype code is in the following branch.  I can send
> patches if you want.
> 
>   https://gitlab.collabora.com/krisman/linux.git -b vfs-ms_casefold
> 
> (2) What I want to do:
> 
> I want to allow lookups of the inexact name to find the cached dentry,
> preventing it from going to the disk only to find out later it already
> has that dentry cached in some other cache bucket.  This means that,
> whenever there is a MNT_CASEFOLD enabled, every mountpoint of that
> superblock needs to use d_hash(casefold(name)), such that every file is
> in a known bucket.  Obviously this is a worse hash function, but the
> cost is only payed by filesystems using the feature.  If a bind mount is
> done, we must rehash all existing entries of that filesystem at
> "bind,remount,ignorecase" time.
> 
> Like the previous version, we only store exact name dentries, even
> though the bucket is different now.  When doing a LOOKUP_CASEFOLD, we
> use a different ->d_compare_CI() to do the matching, such that we can
> find the right dentry or not, in both scenarios (CI and CS).
> 
> We still can't trust negative dentries when doing a CI lookup. Consider
> the case where we do CS(FOO) and foo doesn't exist. Now we have a
> negative_dentry(FOO), but CI(FOO) would still need to find an existing
> 'foo' file.  The first solution is to make fast_CI_lookup() walk the
> entire bucket and ignore previous matching negative dentries if a
> positive one is found.  If no positive dentry is found, then fallback
> and do ->lookup().  A superior solution is to mark negative dentries as
> soft negatives and hard negatives, depending on what kind of lookup
> created the dentry, whether CS or CI, respectively.  Then, we can allow
> a CI lookup to return the negative dentry only if it is a HARD negative,
> and try the ->lookup() otherwise.  For this to work, any dentry creation
> on that parent requires that we invalidate hard negative children
> dentries, and make them soft, at insertion/rename time.
> 
> With these changes I believe the dentry cache could be shared by code
> doing CI and CS lookups of the same dentry.  My main concern is whether
> we could invalidate a negative dentry without harming parallel lookups
> using that dentry, but I believe I can work around that.
> 
> I believe this covers everything I could identify, but please let me
> know if I missed something, obvious or not.
>
Gabriel Krisman Bertazi Feb. 12, 2018, 7:56 p.m. UTC | #6
Gao Xiang <gaoxiang25@huawei.com> writes:

> Could I express my opinion? I have working on case-insensitive sdcardfs
> for months.

Hi Gao,

Thanks for helping out with this topic.

> I think your problem is how we optimise a case-insensitive lookup on the
> file system with a case-sensitive dcache (I mean d_add and no d_compare
> and d_hash).

Are d_compare and d_hash to be considered really disruptive
performance-wise?  Even if they are only used when casefold/encoding
support is enabled?  I don't see how we could better use the dcache
without at least requiring these functions to handle CI cases.

> In that case, we could not trust the negative dentry when _creating_ a
> case-insensitive file, for example:
>    there exists "anDroid" on-disk, but ext4's in-memory dcache only has
> the negative "Android", if we lookup "Android" we will get the
> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
> disk. In the create case, an on-disk _iterate_ (or readdir) is
> necessary.

In my previous email, I mentioned my current implementation ignores
negative dentries and forces a ->lookup(), which walks over the disk
entries.  (I had to add a fix to the creation path in the vfs-ms_casefold
branch to exactly match that description, so you might have missed the
updated version in that branch).

Either way, this case is supported like this:

If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
case-sensitive and case-insensitive, respectively,  We can do:

open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
open("/mnt/Android", 0) = -2 No such file or directory
open("/mnt-ci/Android", 0) = 4
open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists

The second open() is expected to create an negative_dentry of "Android",
which, if it wasn't ignored by the 3th open(), the CI operation would
have failed.  Notice that the 3th open() operation actually opens the
file that was created by the first open().  It doesn't create a new
file.

Following on, the 4th operation (file creation) *must fail* because
there is a CI name collision with /mnt-ci/anDroid.  The same is true for
the final case.

> I could give another example, if we uses case-insentive ext4 and create
> "Android" and "anDroid", how to deal with the case in the
> case-insensitive way?
>    I mean in that case we should make both "Android" and "anDroid" can
> access, right?

Not sure if I follow you here, but I'm assuming we create Android and
anDroid in the sensitive mountpoint, because, otherwise the
second file creation in the insensitive mountpoint would fail.

This is the case where I'm hiding one of the previously (CS) created
files, when in the insensitive mountpoint, and the user is shooting
himself.  For the sensitive case, Both stays visible to the user.

>    I think we need to build a special case-sensitive dcache rather than
> a case-insensitive dcache following the native case-insentive fs(use
> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)

What do you think about the second part of my proposal, where I mention
dealing differently with negative dentries created by a CI lookup?
We don't need to ignore them if we can invalidate them after a creation
in the directory.

> Finally, I agree "let the user shot herself in the foot by having two
> files with the exact CI name", but I think it could not the VFS
> _busniess_ itself since each customer solution "case-sensitive ext4 -> 
> case-insensitive lookup" has their _perfered_ way (for example,
> "android" and "Android" exist, A perfers android and B perfers Android.

I don't see how we could defer the decision to the filesystem, that's a
pretty good problem, which I don't have a solution right now.

> Finally, I think for optmization, ext4 or other fs could add some dir
> inode _tag_ and supports native case-insensitive for these dirs could be
> better....

Agreed. But I'm seeing this as outside the scope of my proposal, since it
is specific to each filesystem.  My ext4 adaptation, for instance, falls
back to linear search when it can't find the exact match.

Thanks,
Gao Xiang Feb. 12, 2018, 10:43 p.m. UTC | #7
Hi Gabriel,

Thanks for your reply.

On 2018/2/13 3:56, Gabriel Krisman Bertazi wrote:
> Gao Xiang <gaoxiang25@huawei.com> writes:
>
>> Could I express my opinion? I have working on case-insensitive sdcardfs
>> for months.
> Hi Gao,
>
> Thanks for helping out with this topic.
>
>> I think your problem is how we optimise a case-insensitive lookup on the
>> file system with a case-sensitive dcache (I mean d_add and no d_compare
>> and d_hash).
> Are d_compare and d_hash to be considered really disruptive
> performance-wise?  Even if they are only used when casefold/encoding
> support is enabled?  I don't see how we could better use the dcache
> without at least requiring these functions to handle CI cases.
I mean if both "Android" and "anDroid" exist in the same directory of 
on-disk ext4,
I could tend to make both "Android" and "anDroid" accessed by
the case-exact name lookup, otherwise do a case-insensitive lookup. eg. 
(my current implementation),
1. open("/mnt/anDroid", O_EXCL|O_CREAT)
2. open("/mnt/Android", O_EXCL|O_CREAT)
3. open("/mnt-ci/Android") --- exactly Android
4. open("/mnt-ci/anDroid") --- exactly anDroid
5. open("/mnt-ci/ANDROID") --- Android or anDroid, do a case-insensitive 
lookup.
6. unlink("/mnt-ci/Android") --- exactly Android
7. unlink("/mnt-ci/ANDROID") --- anDroid, do a case-insensitive lookup.

>> In that case, we could not trust the negative dentry when _creating_ a
>> case-insensitive file, for example:
>>     there exists "anDroid" on-disk, but ext4's in-memory dcache only has
>> the negative "Android", if we lookup "Android" we will get the
>> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
>> disk. In the create case, an on-disk _iterate_ (or readdir) is
>> necessary.
> In my previous email, I mentioned my current implementation ignores
> negative dentries and forces a ->lookup(), which walks over the disk
> entries.  (I had to add a fix to the creation path in the vfs-ms_casefold
> branch to exactly match that description, so you might have missed the
> updated version in that branch).
I am sorry, I haven't looked into your CI patch yet.

I decided to join in this topic because in the past months, I did some work
on the case-insensitive sdcardfs, and I just want to express my thoughts 
for this topic.

I don't know which level negative dentries you ignore, your 
case-insensitive negative dentries
or the original underlayfs(eg. ext4) case-sensitive negative dentries?

Actually I observed some false "negative dentries" race or deadlock
if multiple case-insensitive mounts exist and do fs-ops on these mounts 
at the same time,
but I am not sure whether your implementation has these potential issues 
in principle or not.

In addition, quote `
Our customer is interested in exposing a subtree of an existing
filesystem (native Linux filesystems, xfs, ext4 and others) in an
case-insensitive lookup manner, without paying the cost of a userspace
getdents implementation, and, preferably, without requiring the user to
modify data on the disk.`

I think the *most expensive* operation for case-insensitive lookup is to
create a not-exist file rather than do a existed-file lookup.
It takes much overhead and I think no straightforward way to directly 
reduce the cost
and improve its performance.

>
> Either way, this case is supported like this:
>
> If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
> case-sensitive and case-insensitive, respectively,  We can do:
>
> open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
> open("/mnt/Android", 0) = -2 No such file or directory
> open("/mnt-ci/Android", 0) = 4
> open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
> open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists
>
> The second open() is expected to create an negative_dentry of "Android",
> which, if it wasn't ignored by the 3th open(), the CI operation would
> have failed.  Notice that the 3th open() operation actually opens the
Yes, the 3th open() should invalidate the 2nd negative dentry.
However, I don't know your detailed implementation behavior, for example 
as follows:

1. open("/mnt/anDroid", O_EXCL|O_CREAT)
2. open("/mnt/Android", 0)                 ---- dentry A
3. open("/mnt-ci/Android", 0)             ---- CI-dentry B or the same 
dentry A?

if 2 and 3 are the different denties but the same inode, furthermore,
if the inode is a directory inode, it seems like a hard link directory,
any *potential deadlock* with that?

and how about?
1. open("/mnt/anDroid", O_EXCL|O_CREAT)             --- inode A
2. close("/mnt/anDroid") --- still positive, referenced
3. open("/mnt/Android", 0) --- No such file or directory
4. open("/mnt-ci/Android", 0) --- positive, inode A
5. close("/mnt-ci/Android") --- still positive, referenced
6. open("/mnt/android", O_EXCL|O_CREAT)             --- inode B
7. unlink("/mnt/anDroid")   --- expected behavior? since we have 2) and 
5) referenced, can the inode finally be evicted? ---
8. open("/mnt-ci/Android", 0) --- ??? positive and inode B? ---

> file that was created by the first open().  It doesn't create a new
> file.
>
> Following on, the 4th operation (file creation) *must fail* because
> there is a CI name collision with /mnt-ci/anDroid.  The same is true for
> the final case.
>
>> I could give another example, if we uses case-insentive ext4 and create
>> "Android" and "anDroid", how to deal with the case in the
>> case-insensitive way?
>>     I mean in that case we should make both "Android" and "anDroid" can
>> access, right?
> Not sure if I follow you here, but I'm assuming we create Android and
> anDroid in the sensitive mountpoint, because, otherwise the
> second file creation in the insensitive mountpoint would fail.
>
> This is the case where I'm hiding one of the previously (CS) created
> files, when in the insensitive mountpoint, and the user is shooting
> himself.  For the sensitive case, Both stays visible to the user.
As I mentioned above....
1. open("/mnt/anDroid", O_EXCL|O_CREAT)
2. open("/mnt/Android", O_EXCL|O_CREAT)
3. open("/mnt-ci/Android") --- exactly "Android"
4. open("/mnt-ci/anDroid") --- exactly "anDroid"
5. open("/mnt-ci/ANDROID") --- "Android" or "anDroid", do a 
case-insensitive lookup.
6. unlink("/mnt-ci/Android") --- exactly "Android"
7. unlink("/mnt-ci/ANDROID") --- "anDroid", do a case-insensitive lookup.

>>     I think we need to build a special case-sensitive dcache rather than
>> a case-insensitive dcache following the native case-insentive fs(use
>> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)
> What do you think about the second part of my proposal, where I mention
> dealing differently with negative dentries created by a CI lookup?
> We don't need to ignore them if we can invalidate them after a creation
> in the directory.
I feel some complex of your second part... I still need some time to 
look into that...
and I think it is useless of negative dentries for that case 
intuitively.....
>
>> Finally, I agree "let the user shot herself in the foot by having two
>> files with the exact CI name", but I think it could not the VFS
>> _busniess_ itself since each customer solution "case-sensitive ext4 ->
>> case-insensitive lookup" has their _perfered_ way (for example,
>> "android" and "Android" exist, A perfers android and B perfers Android.
> I don't see how we could defer the decision to the filesystem, that's a
> pretty good problem, which I don't have a solution right now.
>
>> Finally, I think for optmization, ext4 or other fs could add some dir
>> inode _tag_ and supports native case-insensitive for these dirs could be
>> better....
> Agreed. But I'm seeing this as outside the scope of my proposal, since it
> is specific to each filesystem.  My ext4 adaptation, for instance, falls
> back to linear search when it can't find the exact match.
>
> Thanks,
>
I could miss something important, if I recall, I will reply in the 
future....

Thanks,
Gabriel Krisman Bertazi Feb. 13, 2018, 10:20 p.m. UTC | #8
Gao Xiang <hsiangkao@aol.com> writes:

> Hi Gabriel,
>
> Thanks for your reply.
>
> On 2018/2/13 3:56, Gabriel Krisman Bertazi wrote:
>> Gao Xiang <gaoxiang25@huawei.com> writes:
>>
>>> Could I express my opinion? I have working on case-insensitive sdcardfs
>>> for months.
>> Hi Gao,
>>
>> Thanks for helping out with this topic.
>>
>>> I think your problem is how we optimise a case-insensitive lookup on the
>>> file system with a case-sensitive dcache (I mean d_add and no d_compare
>>> and d_hash).
>> Are d_compare and d_hash to be considered really disruptive
>> performance-wise?  Even if they are only used when casefold/encoding
>> support is enabled?  I don't see how we could better use the dcache
>> without at least requiring these functions to handle CI cases.
> I mean if both "Android" and "anDroid" exist in the same directory of
> on-disk ext4,
> I could tend to make both "Android" and "anDroid" accessed by
> the case-exact name lookup, otherwise do a case-insensitive
> lookup. eg. (my current implementation),
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
> 2. open("/mnt/Android", O_EXCL|O_CREAT)
> 3. open("/mnt-ci/Android") --- exactly Android
> 4. open("/mnt-ci/anDroid") --- exactly anDroid
> 5. open("/mnt-ci/ANDROID") --- Android or anDroid, do a case-insensitive
> lookup.
> 6. unlink("/mnt-ci/Android") --- exactly Android
> 7. unlink("/mnt-ci/ANDROID") --- anDroid, do a case-insensitive lookup.

Hmm, let me correct what I said before.  My code is not fetching the
exact-match in this case, it always returns the result of a CI lookup.

Still, I adapted it last night to return the exact-case file, by
touching the lookup hook itself, which I believe allows it to be made a
filesystem decision, despite being much more expensive.

Apart from that, can we revisit this design decision ?  It feels a bit
awkward that we can change the file returned by the lookup by doing an
unlink() or creat() on the other mountpoint.

$ echo foo > /mnt/bla
$ cat /mnt-ci/BLA
foo
$ echo bar > /mnt/BLA
$ cat /mnt-ci/BLA     <--- Same lookup has different result now
bar

This seems awfully confusing.  Maybe that's the nature of CI, but,
wouldn't entirely hiding the duplicated file be less confusing?  Is
there any benefit or use case that requires this behavior?


>>> In that case, we could not trust the negative dentry when _creating_ a
>>> case-insensitive file, for example:
>>>     there exists "anDroid" on-disk, but ext4's in-memory dcache only has
>>> the negative "Android", if we lookup "Android" we will get the
>>> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
>>> disk. In the create case, an on-disk _iterate_ (or readdir) is
>>> necessary.
>> In my previous email, I mentioned my current implementation ignores
>> negative dentries and forces a ->lookup(), which walks over the disk
>> entries.  (I had to add a fix to the creation path in the vfs-ms_casefold
>> branch to exactly match that description, so you might have missed the
>> updated version in that branch).
> I am sorry, I haven't looked into your CI patch yet.
>
> I decided to join in this topic because in the past months, I did some work
> on the case-insensitive sdcardfs, and I just want to express my thoughts
> for this topic.
>
> I don't know which level negative dentries you ignore, your
> case-insensitive negative dentries
> or the original underlayfs(eg. ext4) case-sensitive negative dentries?

I'm talking about the existing case-sensitive negative dentries.  I'm
not adding any dentries other than the ones created by the underlayfs

> Actually I observed some false "negative dentries" race or deadlock
> if multiple case-insensitive mounts exist and do fs-ops on these mounts
> at the same time,
> but I am not sure whether your implementation has these potential issues
> in principle or not.

Do you have a test case?  I didn't observe that, but I might not be
stressing my code enough.

>
> In addition, quote `
> Our customer is interested in exposing a subtree of an existing
> filesystem (native Linux filesystems, xfs, ext4 and others) in an
> case-insensitive lookup manner, without paying the cost of a userspace
> getdents implementation, and, preferably, without requiring the user to
> modify data on the disk.`
>
> I think the *most expensive* operation for case-insensitive lookup is to
> create a not-exist file rather than do a existed-file lookup.
> It takes much overhead and I think no straightforward way to directly
> reduce the cost
> and improve its performance.
>
>>
>> Either way, this case is supported like this:
>>
>> If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
>> case-sensitive and case-insensitive, respectively,  We can do:
>>
>> open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
>> open("/mnt/Android", 0) = -2 No such file or directory
>> open("/mnt-ci/Android", 0) = 4
>> open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
>> open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists
>>
>> The second open() is expected to create an negative_dentry of "Android",
>> which, if it wasn't ignored by the 3th open(), the CI operation would
>> have failed.  Notice that the 3th open() operation actually opens the
> Yes, the 3th open() should invalidate the 2nd negative dentry.
> However, I don't know your detailed implementation behavior, for example
> as follows:
>
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
> 2. open("/mnt/Android", 0)                 ---- dentry A
> 3. open("/mnt-ci/Android", 0)             ---- CI-dentry B or the same
> dentry A?
>
> if 2 and 3 are the different denties but the same inode, furthermore,
> if the inode is a directory inode, it seems like a hard link directory,
> any *potential deadlock* with that?

2 and 3 are the same dentry, not a duplicate or anything else.  The
point is to avoid dealing with multiple dentries kind of complexity.


> and how about?
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)             --- inode A
> 2. close("/mnt/anDroid") --- still positive, referenced
> 3. open("/mnt/Android", 0) --- No such file or directory
> 4. open("/mnt-ci/Android", 0) --- positive, inode A
> 5. close("/mnt-ci/Android") --- still positive, referenced
> 6. open("/mnt/android", O_EXCL|O_CREAT)             --- inode B
> 7. unlink("/mnt/anDroid")   --- expected behavior? since we have 2) and
> 5) referenced, can the inode finally be evicted? ---
> 8. open("/mnt-ci/Android", 0) --- ??? positive and inode B? ---

If I understand correctly, since the dentry recovered by 1 and 4 are the
same, 7 should unlink /mnt/anDroid and make it negative, since there are
no more references.

Since /mnt/anDroid no longer exists we are sure that step 8. when
accessing /mnt-ci/Android will return inode B, since that is the only
one remaining.  I think this is it.

>> file that was created by the first open().  It doesn't create a new
>> file.
>>
>> Following on, the 4th operation (file creation) *must fail* because
>> there is a CI name collision with /mnt-ci/anDroid.  The same is true for
>> the final case.
>>
>>> I could give another example, if we uses case-insentive ext4 and create
>>> "Android" and "anDroid", how to deal with the case in the
>>> case-insensitive way?
>>>     I mean in that case we should make both "Android" and "anDroid" can
>>> access, right?
>> Not sure if I follow you here, but I'm assuming we create Android and
>> anDroid in the sensitive mountpoint, because, otherwise the
>> second file creation in the insensitive mountpoint would fail.
>>
>> This is the case where I'm hiding one of the previously (CS) created
>> files, when in the insensitive mountpoint, and the user is shooting
>> himself.  For the sensitive case, Both stays visible to the user.
> As I mentioned above....
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT) 
> 2. open("/mnt/Android", O_EXCL|O_CREAT)
> 3. open("/mnt-ci/Android") --- exactly "Android"
> 4. open("/mnt-ci/anDroid") --- exactly "anDroid"
> 5. open("/mnt-ci/ANDROID") --- "Android" or "anDroid", do a
> case-insensitive lookup.
> 6. unlink("/mnt-ci/Android") --- exactly "Android"
> 7. unlink("/mnt-ci/ANDROID") --- "anDroid", do a case-insensitive
> lookup.

This depends on my question about the behavior of doing an exact name
lookup.

>
>>>     I think we need to build a special case-sensitive dcache rather than
>>> a case-insensitive dcache following the native case-insentive fs(use
>>> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)
>> What do you think about the second part of my proposal, where I mention
>> dealing differently with negative dentries created by a CI lookup?
>> We don't need to ignore them if we can invalidate them after a creation
>> in the directory.
> I feel some complex of your second part... I still need some time to
> look into that...
> and I think it is useless of negative dentries for that case
> intuitively.....
>>
>>> Finally, I agree "let the user shot herself in the foot by having two
>>> files with the exact CI name", but I think it could not the VFS
>>> _busniess_ itself since each customer solution "case-sensitive ext4 ->
>>> case-insensitive lookup" has their _perfered_ way (for example,
>>> "android" and "Android" exist, A perfers android and B perfers Android.
>> I don't see how we could defer the decision to the filesystem, that's a
>> pretty good problem, which I don't have a solution right now.
>>
>>> Finally, I think for optmization, ext4 or other fs could add some dir
>>> inode _tag_ and supports native case-insensitive for these dirs could be
>>> better....
>> Agreed. But I'm seeing this as outside the scope of my proposal, since it
>> is specific to each filesystem.  My ext4 adaptation, for instance, falls
>> back to linear search when it can't find the exact match.
>>
>> Thanks,
>>
> I could miss something important, if I recall, I will reply in the
> future....
>
> Thanks,
Gao Xiang Feb. 14, 2018, 12:27 p.m. UTC | #9
Hi Gabriel,

On 2018/2/14 6:20, Gabriel Krisman Bertazi wrote:
> Gao Xiang <hsiangkao@aol.com> writes:
>
>> Hi Gabriel,
>>
>> Thanks for your reply.
>>
>> On 2018/2/13 3:56, Gabriel Krisman Bertazi wrote:
>>> Gao Xiang <gaoxiang25@huawei.com> writes:
>>>
>>>> Could I express my opinion? I have working on case-insensitive sdcardfs
>>>> for months.
>>> Hi Gao,
>>>
>>> Thanks for helping out with this topic.
>>>
>>>> I think your problem is how we optimise a case-insensitive lookup on the
>>>> file system with a case-sensitive dcache (I mean d_add and no d_compare
>>>> and d_hash).
>>> Are d_compare and d_hash to be considered really disruptive
>>> performance-wise?  Even if they are only used when casefold/encoding
>>> support is enabled?  I don't see how we could better use the dcache
>>> without at least requiring these functions to handle CI cases.
>> I mean if both "Android" and "anDroid" exist in the same directory of
>> on-disk ext4,
>> I could tend to make both "Android" and "anDroid" accessed by
>> the case-exact name lookup, otherwise do a case-insensitive
>> lookup. eg. (my current implementation),
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
>> 2. open("/mnt/Android", O_EXCL|O_CREAT)
>> 3. open("/mnt-ci/Android") --- exactly Android
>> 4. open("/mnt-ci/anDroid") --- exactly anDroid
>> 5. open("/mnt-ci/ANDROID") --- Android or anDroid, do a case-insensitive
>> lookup.
>> 6. unlink("/mnt-ci/Android") --- exactly Android
>> 7. unlink("/mnt-ci/ANDROID") --- anDroid, do a case-insensitive lookup.
> Hmm, let me correct what I said before.  My code is not fetching the
> exact-match in this case, it always returns the result of a CI lookup.
>
> Still, I adapted it last night to return the exact-case file, by
> touching the lookup hook itself, which I believe allows it to be made a
> filesystem decision, despite being much more expensive.
I don't think it is a purely filesystem decision.
CI selection depends on the users...(eg. some users perfer the smallest 
alphabet order,
some users select CI file by time, such as the oldest one)...
IMO, it behaves as a file system filter users could select by themselves
rather than the part of VFS or filesystems.
> Apart from that, can we revisit this design decision ?  It feels a bit
> awkward that we can change the file returned by the lookup by doing an
> unlink() or creat() on the other mountpoint.
>
> $ echo foo > /mnt/bla
> $ cat /mnt-ci/BLA
> foo
> $ echo bar > /mnt/BLA
> $ cat /mnt-ci/BLA     <--- Same lookup has different result now
> bar
>
> This seems awfully confusing.  Maybe that's the nature of CI, but,
> wouldn't entirely hiding the duplicated file be less confusing?  Is
> there any benefit or use case that requires this behavior?
It seems no confusing... bacause the user creates a new case-exact 
"/mnt/BLA", and
the first "cat /mnt-ci/BLA" and the second "cat /mnt-ci/BLA" are not the 
same fd.
If the exact-case file exist, I think we should *give a way* to get
these files by the exact-case names on the CI mountpoint, and that is 
why I said
"a case-insensitive lookup on the file system with a case-sensitive dcache",
since it can create/rename a case-exact file on the original 
case-sensitive mnt.

BTW, If you feel confused, how about the following?

1) echo foo > /mnt/bla
2) echo bar > /mnt/BLA
3) cat /mnt-ci/bla    --- assume it is "/mnt/bla"
4) rm /mnt/bla
3) cat /mnt-ci/bla   ---> it will be "/mnt/BLA" because the user 
operates /mnt

I don't think these two have too much difference...
>
>>>> In that case, we could not trust the negative dentry when _creating_ a
>>>> case-insensitive file, for example:
>>>>      there exists "anDroid" on-disk, but ext4's in-memory dcache only has
>>>> the negative "Android", if we lookup "Android" we will get the
>>>> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
>>>> disk. In the create case, an on-disk _iterate_ (or readdir) is
>>>> necessary.
>>> In my previous email, I mentioned my current implementation ignores
>>> negative dentries and forces a ->lookup(), which walks over the disk
>>> entries.  (I had to add a fix to the creation path in the vfs-ms_casefold
>>> branch to exactly match that description, so you might have missed the
>>> updated version in that branch).
>> I am sorry, I haven't looked into your CI patch yet.
>>
>> I decided to join in this topic because in the past months, I did some work
>> on the case-insensitive sdcardfs, and I just want to express my thoughts
>> for this topic.
>>
>> I don't know which level negative dentries you ignore, your
>> case-insensitive negative dentries
>> or the original underlayfs(eg. ext4) case-sensitive negative dentries?
> I'm talking about the existing case-sensitive negative dentries.  I'm
> not adding any dentries other than the ones created by the underlayfs
OK, if you don't create duplicated dentries, I don't think it has the 
following two potential issues.
I have no more question at the moment, and I need to look into your 
detailed implementation first.
Thanks for your reply.

Thanks,
>> Actually I observed some false "negative dentries" race or deadlock
>> if multiple case-insensitive mounts exist and do fs-ops on these mounts
>> at the same time,
>> but I am not sure whether your implementation has these potential issues
>> in principle or not.
> Do you have a test case?  I didn't observe that, but I might not be
> stressing my code enough.
>
>> In addition, quote `
>> Our customer is interested in exposing a subtree of an existing
>> filesystem (native Linux filesystems, xfs, ext4 and others) in an
>> case-insensitive lookup manner, without paying the cost of a userspace
>> getdents implementation, and, preferably, without requiring the user to
>> modify data on the disk.`
>>
>> I think the *most expensive* operation for case-insensitive lookup is to
>> create a not-exist file rather than do a existed-file lookup.
>> It takes much overhead and I think no straightforward way to directly
>> reduce the cost
>> and improve its performance.
>>
>>> Either way, this case is supported like this:
>>>
>>> If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
>>> case-sensitive and case-insensitive, respectively,  We can do:
>>>
>>> open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
>>> open("/mnt/Android", 0) = -2 No such file or directory
>>> open("/mnt-ci/Android", 0) = 4
>>> open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
>>> open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists
>>>
>>> The second open() is expected to create an negative_dentry of "Android",
>>> which, if it wasn't ignored by the 3th open(), the CI operation would
>>> have failed.  Notice that the 3th open() operation actually opens the
>> Yes, the 3th open() should invalidate the 2nd negative dentry.
>> However, I don't know your detailed implementation behavior, for example
>> as follows:
>>
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
>> 2. open("/mnt/Android", 0)                 ---- dentry A
>> 3. open("/mnt-ci/Android", 0)             ---- CI-dentry B or the same
>> dentry A?
>>
>> if 2 and 3 are the different denties but the same inode, furthermore,
>> if the inode is a directory inode, it seems like a hard link directory,
>> any *potential deadlock* with that?
> 2 and 3 are the same dentry, not a duplicate or anything else.  The
> point is to avoid dealing with multiple dentries kind of complexity.
>
>
>> and how about?
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)             --- inode A
>> 2. close("/mnt/anDroid") --- still positive, referenced
>> 3. open("/mnt/Android", 0) --- No such file or directory
>> 4. open("/mnt-ci/Android", 0) --- positive, inode A
>> 5. close("/mnt-ci/Android") --- still positive, referenced
>> 6. open("/mnt/android", O_EXCL|O_CREAT)             --- inode B
>> 7. unlink("/mnt/anDroid")   --- expected behavior? since we have 2) and
>> 5) referenced, can the inode finally be evicted? ---
>> 8. open("/mnt-ci/Android", 0) --- ??? positive and inode B? ---
> If I understand correctly, since the dentry recovered by 1 and 4 are the
> same, 7 should unlink /mnt/anDroid and make it negative, since there are
> no more references.
>
> Since /mnt/anDroid no longer exists we are sure that step 8. when
> accessing /mnt-ci/Android will return inode B, since that is the only
> one remaining.  I think this is it.
>
>>> file that was created by the first open().  It doesn't create a new
>>> file.
>>>
>>> Following on, the 4th operation (file creation) *must fail* because
>>> there is a CI name collision with /mnt-ci/anDroid.  The same is true for
>>> the final case.
>>>
>>>> I could give another example, if we uses case-insentive ext4 and create
>>>> "Android" and "anDroid", how to deal with the case in the
>>>> case-insensitive way?
>>>>      I mean in that case we should make both "Android" and "anDroid" can
>>>> access, right?
>>> Not sure if I follow you here, but I'm assuming we create Android and
>>> anDroid in the sensitive mountpoint, because, otherwise the
>>> second file creation in the insensitive mountpoint would fail.
>>>
>>> This is the case where I'm hiding one of the previously (CS) created
>>> files, when in the insensitive mountpoint, and the user is shooting
>>> himself.  For the sensitive case, Both stays visible to the user.
>> As I mentioned above....
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
>> 2. open("/mnt/Android", O_EXCL|O_CREAT)
>> 3. open("/mnt-ci/Android") --- exactly "Android"
>> 4. open("/mnt-ci/anDroid") --- exactly "anDroid"
>> 5. open("/mnt-ci/ANDROID") --- "Android" or "anDroid", do a
>> case-insensitive lookup.
>> 6. unlink("/mnt-ci/Android") --- exactly "Android"
>> 7. unlink("/mnt-ci/ANDROID") --- "anDroid", do a case-insensitive
>> lookup.
> This depends on my question about the behavior of doing an exact name
> lookup.
>
>>>>      I think we need to build a special case-sensitive dcache rather than
>>>> a case-insensitive dcache following the native case-insentive fs(use
>>>> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)
>>> What do you think about the second part of my proposal, where I mention
>>> dealing differently with negative dentries created by a CI lookup?
>>> We don't need to ignore them if we can invalidate them after a creation
>>> in the directory.
>> I feel some complex of your second part... I still need some time to
>> look into that...
>> and I think it is useless of negative dentries for that case
>> intuitively.....
>>>> Finally, I agree "let the user shot herself in the foot by having two
>>>> files with the exact CI name", but I think it could not the VFS
>>>> _busniess_ itself since each customer solution "case-sensitive ext4 ->
>>>> case-insensitive lookup" has their _perfered_ way (for example,
>>>> "android" and "Android" exist, A perfers android and B perfers Android.
>>> I don't see how we could defer the decision to the filesystem, that's a
>>> pretty good problem, which I don't have a solution right now.
>>>
>>>> Finally, I think for optmization, ext4 or other fs could add some dir
>>>> inode _tag_ and supports native case-insensitive for these dirs could be
>>>> better....
>>> Agreed. But I'm seeing this as outside the scope of my proposal, since it
>>> is specific to each filesystem.  My ext4 adaptation, for instance, falls
>>> back to linear search when it can't find the exact match.
>>>
>>> Thanks,
>>>
>> I could miss something important, if I recall, I will reply in the
>> future....
>>
>> Thanks,
Greg KH March 5, 2018, 12:10 p.m. UTC | #10
On Fri, Jan 26, 2018 at 02:52:28AM +0000, Gaoxiang (OS) wrote:
> 
> Sorry for bother, I mean you refer to sdcardfs,
> The current Samsung sdcardfs and esdfs are all fsstress broken for many issues,
> However we developed an another approach last year, in
> 
> https://android-review.googlesource.com/#/c/kernel/common/+/574402/3/Documentation/filesystems/hwsdcardfs.txt
> 
> We applied it on HUAWEI Mate 10 (pro), and the complete solution is
> 
> https://android-review.googlesource.com/#/q/status:open+project:kernel/common+branch:android-4.4+topic:huawei_sdcardfs
> 
> It supports stackable case-insensitive without modifing VFS smoothly, 
> and we tested fsstress on the top and lower layers at the same time 
> without any problem.

Any specific reason that filesystem code has not been submitted here for
inclusion upstream?  It seems like all of the various sdcardfs
forks/rewrites need to be standardized upstream first to allow everyone
to benifit, instead of all of the duplicated efforts that are happening
right now :(

By getting yours, or someone's, code merged properly, that should reduce
the amount of duplication here.

thanks,

greg k-h