e2p: Print encoding information in superblock dump

Message ID	20181204211609.27477-1-krisman@collabora.com
State	New
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> sender: krisman) with ESMTPSA id 27A07276109 From: Gabriel Krisman Bertazi <krisman@collabora.com> To: tytso@mit.edu Cc: linux-ext4@vger.kernel.org, kernel@collabora.com, Gabriel Krisman Bertazi <krisman@collabora.com> Subject: [PATCH] e2p: Print encoding information in superblock dump Date: Tue, 4 Dec 2018 16:16:09 -0500 Message-Id: <20181204211609.27477-1-krisman@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	e2p: Print encoding information in superblock dump \| expand e2p: Print encoding information in superblock dump

Message ID

20181204211609.27477-1-krisman@collabora.com

State

New

Headers

From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: tytso@mit.edu
Cc: linux-ext4@vger.kernel.org, kernel@collabora.com,
	Gabriel Krisman Bertazi <krisman@collabora.com>
Subject: [PATCH] e2p: Print encoding information in superblock dump
Date: Tue,  4 Dec 2018 16:16:09 -0500
Message-Id: <20181204211609.27477-1-krisman@collabora.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk

Series

e2p: Print encoding information in superblock dump | expand

Commit Message

Gabriel Krisman Bertazi Dec. 4, 2018, 9:16 p.m. UTC

Display encoding related fields when printing superblock in places like
dumpe2fs or debugfs.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 lib/e2p/e2p.h      |  1 +
 lib/e2p/encoding.c | 10 ++++++++++
 lib/e2p/ls.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

Comments

Theodore Ts'o Dec. 8, 2018, 6:36 p.m. UTC | #1

On Tue, Dec 04, 2018 at 04:16:09PM -0500, Gabriel Krisman Bertazi wrote:
> diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
> index a7586e094da1..bb1fc8aa94da 100644
> --- a/lib/e2p/ls.c
> +++ b/lib/e2p/ls.c
    ....
> +	if (encoding == EXT4_ENC_UTF8_11_0) {
> +		if (flags & EXT4_UTF8_NORMALIZATION_TYPE_NFKD)
> +			fputs(" NFKD", f);
> +		else
> +			fputs(" Unnormalized", f);
> +		flags_found++;
> +
> +		if (flags & EXT4_UTF8_CASEFOLD_TYPE_NFKDCF)
> +			fputs(" NFKDCF", f);
> +		else
> +			fputs(" toUpper", f);
> +		flags_found++;
> +	}

I don't understand this.  Why is "toUpper" the opposite of
"CASEFOLD_TYPE_NFKDCF"?  From what I can tell looking at the kernel
patches, it appears that if CASEFOLD_TYPE_NFKDCF is not specified, no
case folding is done at all.  And it appears the opposite of "toupper"
is "tolower" --- for ASCII case folding.

More generally, we don't have a way of setting these flags, and I'm
wondering if we should just make a decision and be done with it.
After all, without any way of changing the flags, there's only one
code path that is going to be well tested, and realistically user
programs will come to *expect* only one way file systems will do
things.  The MacOS world has discovered the hard way what happens if
they try to change normalization conventions from one to another,
leading to all sorts of confusion for application programmers.

So perhaps we should just remove these flags from the superblock, and
only support one way of doing things.  We ask the opinion of various
stake-holders --- the Samba folks, fsdevel, Steam, etc.  But whether
we decide NFC, NFD, NFKC or NFKD, I suspect we'll be better off just
picking one and only one way of doing things.   WDYT?

	    	     	     	      		- Ted

Gabriel Krisman Bertazi Dec. 10, 2018, 9:05 p.m. UTC | #2

"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Tue, Dec 04, 2018 at 04:16:09PM -0500, Gabriel Krisman Bertazi wrote:
>> diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
>> index a7586e094da1..bb1fc8aa94da 100644
>> --- a/lib/e2p/ls.c
>> +++ b/lib/e2p/ls.c
>     ....
>> +	if (encoding == EXT4_ENC_UTF8_11_0) {
>> +		if (flags & EXT4_UTF8_NORMALIZATION_TYPE_NFKD)
>> +			fputs(" NFKD", f);
>> +		else
>> +			fputs(" Unnormalized", f);
>> +		flags_found++;
>> +
>> +		if (flags & EXT4_UTF8_CASEFOLD_TYPE_NFKDCF)
>> +			fputs(" NFKDCF", f);
>> +		else
>> +			fputs(" toUpper", f);
>> +		flags_found++;
>> +	}
>
> I don't understand this.  Why is "toUpper" the opposite of
> "CASEFOLD_TYPE_NFKDCF"?  From what I can tell looking at the kernel
> patches, it appears that if CASEFOLD_TYPE_NFKDCF is not specified, no
> case folding is done at all.  And it appears the opposite of "toupper"
> is "tolower" --- for ASCII case folding.

In order to allow any NLS charset to benefit from the
nls_strcmp/strncasecmp API I specified some default
normalization/casefold operations that could be implemented using the
hooks we already have.  The default was toUpper.  That was my thinking.
utf8 was originally split between utf8 and utf8n, the former being the
original unnormalized behavior.  If we didn't have CASEFOLD_TYPE_NFKDCF,
it used the toUpper method.

> More generally, we don't have a way of setting these flags, and I'm
> wondering if we should just make a decision and be done with it.
> After all, without any way of changing the flags, there's only one
> code path that is going to be well tested, and realistically user
> programs will come to *expect* only one way file systems will do
> things.  The MacOS world has discovered the hard way what happens if
> they try to change normalization conventions from one to another,
> leading to all sorts of confusion for application programmers.
>
> So perhaps we should just remove these flags from the superblock, and
> only support one way of doing things.  We ask the opinion of various
> stake-holders --- the Samba folks, fsdevel, Steam, etc.  But whether
> we decide NFC, NFD, NFKC or NFKD, I suspect we'll be better off just
> picking one and only one way of doing things.   WDYT?

My approach is over-complex, just to support the existing NLS tables.
Since Linus seems ok to move the code into a separate module and not
support other encodings, I agree we can make things much simpler, define
a single normalization/casefold and be done with it.

So, I will revive the first versions of this charset/unicode module.  We drop
these flags from the superblock, but we still store the encoding and the
encoding version in it, since it is useful to maintain stability of name
sequences.  We also support ASCII, alongside with utf8 because that is a
safer and pretty trivial.  Finally, do we revisit the decision to
provide a strict mode to reject invalid sequences?  I still think that
flag is useful.  Do we also want a flag to specify if the default is +F for
newer directories?

Do you agree?

diff --git a/lib/e2p/e2p.h b/lib/e2p/e2p.h
index cc2dbf39bfeb..f4cf2d611f2c 100644
--- a/lib/e2p/e2p.h
+++ b/lib/e2p/e2p.h
@@ -84,3 +84,4 @@  int e2p_string2encmode(char *string);
 int e2p_str2encoding(const char *string);
 int e2p_get_encoding_flags(int encoding);
 int e2p_str2encoding_flags(int encoding, char *param, __u16 *flags);
+char *e2p_encoding2string(int encoding);
diff --git a/lib/e2p/encoding.c b/lib/e2p/encoding.c
index 88433310a1dd..6edc41a78964 100644
--- a/lib/e2p/encoding.c
+++ b/lib/e2p/encoding.c
@@ -61,6 +61,16 @@  int e2p_str2encoding(const char *string)
 	return -EINVAL;
 }
 
+char *e2p_encoding2string(int encoding)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ext4_encoding_map); i++)
+		if (ext4_encoding_map[i].encoding_magic == encoding)
+			return ext4_encoding_map[i].name;
+	return NULL;
+}
+
 int e2p_get_encoding_flags(int encoding)
 {
 	int i;
diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
index a7586e094da1..bb1fc8aa94da 100644
--- a/lib/e2p/ls.c
+++ b/lib/e2p/ls.c
@@ -142,6 +142,36 @@  static void print_mntopts(struct ext2_super_block * s, FILE *f)
 #endif
 }
 
+static void print_encoding_flags(int encoding, __u16 flags, FILE *f)
+{
+	int flags_found = 0;
+
+	fputs("Filename encoding flags: ", f);
+	if (flags & EXT4_ENC_STRICT_MODE_FL) {
+		fputs(" strict", f);
+		flags_found++;
+	}
+
+	if (encoding == EXT4_ENC_UTF8_11_0) {
+		if (flags & EXT4_UTF8_NORMALIZATION_TYPE_NFKD)
+			fputs(" NFKD", f);
+		else
+			fputs(" Unnormalized", f);
+		flags_found++;
+
+		if (flags & EXT4_UTF8_CASEFOLD_TYPE_NFKDCF)
+			fputs(" NFKDCF", f);
+		else
+			fputs(" toUpper", f);
+		flags_found++;
+	}
+
+	if (flags_found)
+		fputs("\n", f);
+	else
+		fputs(" (none)\n", f);
+}
+
 static void print_super_flags(struct ext2_super_block * s, FILE *f)
 {
 	int	flags_found = 0;
@@ -457,6 +487,20 @@  void list_super2(struct ext2_super_block * sb, FILE *f)
 				*quota_sb_inump(sb, qtype));
 	}
 
+	if (ext2fs_has_feature_fname_encoding(sb)) {
+		str = e2p_encoding2string(sb->s_encoding);
+		if (str) {
+			fprintf(f, "Filename encoding:        %s\n", str);
+			print_encoding_flags(sb->s_encoding,
+					     sb->s_encoding_flags, f);
+		} else {
+			fprintf(f, "Filename encoding:        Unknown (%hu)\n",
+				sb->s_encoding);
+			fprintf(f, "Filename encoding flags:  0x%hX\n",
+				sb->s_encoding_flags);
+		}
+	}
+
 	if (ext2fs_has_feature_metadata_csum(sb)) {
 		fprintf(f, "Checksum type:            %s\n",
 			checksum_type(sb->s_checksum_type));

e2p: Print encoding information in superblock dump

Commit Message

Comments

Patch