mbox series

[e2fsprogs,v3,00/12] Support encoding awareness and casefold

Message ID 20181126221949.12172-1-krisman@collabora.com
Headers show
Series Support encoding awareness and casefold | expand

Message

Gabriel Krisman Bertazi Nov. 26, 2018, 10:19 p.m. UTC
Hi Ted,

This version includes all the changes you requested as well as the
unicode 11.0 data files.  The utf8data file is generated in the kernel
source, and it is validated in that code, by nls_utf8-selftest.c, I
simply copied it here.  For the 10.0-> 11.0 migration, I also added a
few tests in the kernel source for the scripts that changed.

e2fsprogs:
  https://gitlab.collabora.com/krisman/e2fsprogs -b encoding-feature-merge_v3
linux:
  https://gitlab.collabora.com/krisman/e2fsprogs -b ext4-ci-directrory_v3
xfstests
  https://gitlab.collabora.com/krisman/xfstests -b encoding_v3

Thanks,

----
Original cover letter message:


These are the modifications to e2fsprogs in order to support encoding
awareness and case folding.  This patch series is divided in 3 parts:

Patch 1 & 2 work on reserving superblock fields.  Patch 1 is actually
unrelated, just updating the super_block to resynchronize with the
kernel.  Patch 2 reserves the feature bit and superblock fields for this
feature.

Patch 3 through 5 implements the changes the changes to mke2fs and
chattr/lsattr to enable the encoding feature at mkfs time and flipping
the casefold flag on demand for specific directories.

Patch 6 through 9 is where things get a bit ugly.  fsck needs to become
encoding aware, in order to calculate directory hashes correctly and
verify/fix inconsistencies.  This requires a tiny bit of plumbing to
pass the encoding information up to the point where we calculate the
hash, as well as implementing a simple nls-like interface in e2fsprogs
to do normalization/casefolding.  You'll see that in this series I've
actually dropped the utf8 part because that patch is huge and I'd rather
discuss it separately.  I did it in a hacky way now, where we import the
utf8n code from linux.  I thought about using libunistring but it
doesn't seem to support versioning and we risk being incompatible with
the kernel hashes.  I think we could follow the kernel approach and make
ucd files available in e2fsprogs and generate the data at
compilation. What do you think?

If you want to see a full utf8 capable version of this series, please
clone from:

https://gitlab.collabora.com/krisman/e2fsprogs -b encoding-feature-merge

If you don't object to patch 1 & 2, can we get them merged before the
rest of the series is ready, so I can reserve the bits in the super
block for this feature (patch 2) and avoid more rebasing on my side?


Gabriel Krisman Bertazi (12):
  libe2p: Helpers for configuring the encoding superblock fields
  mke2fs: Configure encoding during superblock initialization
  chattr/lsattr: Support casefold attribute
  lib/ext2fs: Implement NLS support
  lib/ext2fs: Support encoding when calculating dx hashes
  debugfs/htree: Support encoding when printing the file hash
  tune2fs: Prevent enabling encryption flag on encoding-aware fs
  ext2fs: nls: Support UTF-8 11.0 with NFKD normalization
  ext4.5: Add fname_encoding feature to ext4 man page
  mke2fs.8: Document fname_encoding options
  mke2fs.conf.5: Document fname_encoding configuration option
  chattr.1: Document the casefold attribute

 debugfs/Makefile.in        |    1 +
 debugfs/htree.c            |   30 +-
 e2fsck/Makefile.in         |    7 +-
 e2fsck/dx_dirinfo.c        |    4 +-
 e2fsck/e2fsck.h            |    4 +-
 e2fsck/pass1.c             |    3 +-
 e2fsck/pass2.c             |   11 +-
 e2fsck/rehash.c            |   20 +-
 e2fsck/unix.c              |   18 +
 lib/e2p/Makefile.in        |    8 +-
 lib/e2p/e2p.h              |    5 +
 lib/e2p/encoding.c         |   97 +
 lib/e2p/feature.c          |    2 +-
 lib/e2p/pf.c               |    1 +
 lib/ext2fs/Makefile.in     |   16 +-
 lib/ext2fs/dirhash.c       |   52 +
 lib/ext2fs/ext2_fs.h       |   10 +-
 lib/ext2fs/ext2fs.h        |    8 +
 lib/ext2fs/initialize.c    |    4 +
 lib/ext2fs/nls.h           |   66 +
 lib/ext2fs/nls_ascii.c     |   48 +
 lib/ext2fs/nls_utf8-norm.c |  793 +++++
 lib/ext2fs/nls_utf8.c      |   85 +
 lib/ext2fs/utf8data.h      | 6079 ++++++++++++++++++++++++++++++++++++
 lib/ext2fs/utf8n.h         |  120 +
 misc/chattr.1.in           |    8 +-
 misc/chattr.c              |    3 +-
 misc/ext4.5.in             |   10 +
 misc/mke2fs.8.in           |   25 +
 misc/mke2fs.c              |   83 +-
 misc/mke2fs.conf.5.in      |    4 +
 misc/mke2fs.conf.in        |    3 +
 misc/tune2fs.c             |    6 +
 33 files changed, 7598 insertions(+), 36 deletions(-)
 create mode 100644 lib/e2p/encoding.c
 create mode 100644 lib/ext2fs/nls.h
 create mode 100644 lib/ext2fs/nls_ascii.c
 create mode 100644 lib/ext2fs/nls_utf8-norm.c
 create mode 100644 lib/ext2fs/nls_utf8.c
 create mode 100644 lib/ext2fs/utf8data.h
 create mode 100644 lib/ext2fs/utf8n.h

Comments

Theodore Ts'o Nov. 30, 2018, 4:12 p.m. UTC | #1
On Mon, Nov 26, 2018 at 05:19:45PM -0500, Gabriel Krisman Bertazi wrote:
> +static int utf8_casefold(const struct nls_table *table,
> +			  const unsigned char *str, size_t len,
> +			  unsigned char *dest, size_t dlen)
> +{
> +	const struct utf8data *data = utf8nfkdicf(UNICODE_AGE(10,0,0));
> +	struct utf8cursor cur;
> +	size_t nlen = 0;
> +
> +	if (utf8ncursor(&cur, data, str, len) < 0)
> +		goto invalid_seq;
> +
> +	for (nlen = 0; nlen < dlen; nlen++) {
> +		dest[nlen] = utf8byte(&cur);
> +		if (!dest[nlen])
> +			return nlen;
> +		if (dest[nlen] == -1)
> +			break;
> +	}
> +invalid_seq:
> +	/* Treat the sequence as a binary blob. */
> +	memcpy(dest, str, len);
> +	return len;
> +
> +}

So it looks like the interface is if the destination buffer is too
small OR if the string is not a valid UTF-8 string, we treat it as a
binary blob.  I wonder if we would be better off if this function
actually signalling that there is a problem?  (Buffer too small,
invalid UTF-8 string).

It's fine to treat it as a binary blob, and copy it out to the
destination buffer, but I can imagine be use cases where knowing this
will be useful.  *Especially* the destination buffer too small case;
I'm actually a little nervous about having it silently ignoring that
error condition and just copying the binary blob.

Also, there *really* needs to be a check before dlen is assumed to be
>= len in the memcpy after the invalid_seq label.

							- Ted
Theodore Ts'o Nov. 30, 2018, 4:53 p.m. UTC | #2
On Mon, Nov 26, 2018 at 05:19:45PM -0500, Gabriel Krisman Bertazi wrote:
> From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
> 
> We need this such that we can do normalization and casefolding
> compatible with the kernel, in order to properly support fsck
> verification and rehashing.
> 
> The UTF-8 11.0 implementation is copied and adapted from the kernel code
> to ensure maximum compatibility.  The decode trie in utf8data.h is
> generated using a script and the UCD sources in the kernel code.
> 
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>

One more thought.  Is there any test cases we can add here?  I assume
the SGI folks must have had some test code that they used when they
were developing their trie code.  Was any of that released?

Maybe there is some Unicode normalization and case folding test
vectors we can grab?

Thanks,

     		      	   	      	  - Ted
Gabriel Krisman Bertazi Nov. 30, 2018, 6:48 p.m. UTC | #3
"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> One more thought.  Is there any test cases we can add here?  I assume
> the SGI folks must have had some test code that they used when they
> were developing their trie code.  Was any of that released?
>
> Maybe there is some Unicode normalization and case folding test
> vectors we can grab?

Since these file are generated and imported from the kernel code, I added
the tests there instead.  Should I duplicate or move them to here?