diff mbox

[BUG] ext4: dx_map_entry cannot support over 64KB block size

Message ID 20090605165049.e8bd9c74.toshi.okajima@jp.fujitsu.com
State Accepted, archived
Headers show

Commit Message

Toshiyuki Okajima June 5, 2009, 7:50 a.m. UTC
From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

The dx_map_entry structure doesn't support over 64KB block size by current usage
of its member("offs"). Because "offs" treats an offset of copies of the 
ext4_dir_entry_2 structure as is. This member size is 16 bits. But real offset 
for over 64KB(256KB) block size needs 18 bits. However, real offset keeps
4 byte boundary, so lower 2 bits is not used.

Therefore, we do the following to fix this limitation:
For "store": 
	we divide the real offset by 4 and then store this result to "offs" 
	member.
For "use":
	we multiply "offs" member by 4 and then use this result 
	as real offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/ext4/namei.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andreas Dilger June 5, 2009, 9:20 p.m. UTC | #1
On Jun 05, 2009  16:50 +0900, Toshiyuki Okajima wrote:
> From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> 
> The dx_map_entry structure doesn't support over 64KB block size by current
> usage of its member("offs"). Because "offs" treats an offset of copies of
> the ext4_dir_entry_2 structure as is. This member size is 16 bits. But real
> offset for over 64KB(256KB) block size needs 18 bits. However, real offset
> keeps 4 byte boundary, so lower 2 bits is not used.
> 
> Therefore, we do the following to fix this limitation:
> For "store": 
> 	we divide the real offset by 4 and then store this result to "offs" 
> 	member.
> For "use":
> 	we multiply "offs" member by 4 and then use this result 
> 	as real offset.

This patch unfortunately doesn't address all of the issues related
to blocksize > 64kB.

There are a number of other places where there are limits related
to > 64kB blocksize like ext4_dir_entry_2 itself having only a
"__u16 rec_len", so without changing the on-disk format it is not
possible to have > 64kB blocksize.

You would only notice this if you create a large enough directory.
It might be possible to force very large directory blocks to have
multiple directory entries (max size 65536 bytes using the helpers
ext4_rec_len_{to,from}_disk() to convert 0xffff -> 0x10000).


The dx_map_entry you are changing is only an in-memory structure,
so changing it to use an int doesn't matter, but without changing
the on-disk structures it is not useful.

> Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> ---
>  fs/ext4/namei.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> --- linux-2.6.30-rc6/fs/ext4/namei.c.orig	2009-05-26 01:35:09.000000000 +0900
> +++ linux-2.6.30-rc6/fs/ext4/namei.c	2009-06-06 00:05:51.000000000 +0900
> @@ -750,7 +750,7 @@ static int dx_make_map(struct ext4_dir_e
>  			ext4fs_dirhash(de->name, de->name_len, &h);
>  			map_tail--;
>  			map_tail->hash = h.hash;
> -			map_tail->offs = (u16) ((char *) de - base);
> +			map_tail->offs = ((char *) de - base)>>2;
>  			map_tail->size = le16_to_cpu(de->rec_len);
>  			count++;
>  			cond_resched();
> @@ -1148,7 +1148,8 @@ dx_move_dirents(char *from, char *to, st
>  	unsigned rec_len = 0;
>  
>  	while (count--) {
> -		struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) (from + map->offs);
> +		struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) 
> +							(from + (map->offs<<2));
>  		rec_len = EXT4_DIR_REC_LEN(de->name_len);
>  		memcpy (to, de, rec_len);
>  		((struct ext4_dir_entry_2 *) to)->rec_len =

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima June 8, 2009, 7:30 a.m. UTC | #2
Hi,
> On Jun 05, 2009  16:50 +0900, Toshiyuki Okajima wrote:
> > > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> > > 
> > > The dx_map_entry structure doesn't support over 64KB block size by current
> > > usage of its member("offs"). Because "offs" treats an offset of copies of
> > > the ext4_dir_entry_2 structure as is. This member size is 16 bits. But real
> > > offset for over 64KB(256KB) block size needs 18 bits. However, real offset
> > > keeps 4 byte boundary, so lower 2 bits is not used.
> > > 
> > > Therefore, we do the following to fix this limitation:
> > > For "store": 
> > > 	we divide the real offset by 4 and then store this result to "offs" 
> > > 	member.
> > > For "use":
> > > 	we multiply "offs" member by 4 and then use this result 
> > > 	as real offset.
> 
> This patch unfortunately doesn't address all of the issues related
> to blocksize > 64kB.
> 
> There are a number of other places where there are limits related
> to > 64kB blocksize like ext4_dir_entry_2 itself having only a
> "__u16 rec_len", so without changing the on-disk format it is not
> possible to have > 64kB blocksize.
> 
> You would only notice this if you create a large enough directory.
> It might be possible to force very large directory blocks to have
> multiple directory entries (max size 65536 bytes using the helpers
> ext4_rec_len_{to,from}_disk() to convert 0xffff -> 0x10000).

This patch isn't one of the patches in order to support "blocksize > 64KB"
 with ext4. It only fixes dx_map_entry handling when we use "blocksize > 64KB".
You know, ext4_rec_len_{to,from}_disk() has already supported 
"blocksize > 64KB". But dx_map_entry doesn't support "blocksize > 64KB" now.
So, I have posted it to fix dx_map_entry handling on "blocksize > 64KB".

I understand ext4_rec_len_{to,from}_disk() are useful for almost purpose of 
ext4_dir_entry_2 handling. But "offs" which is the member of dx_map_entry is 
exceptional. Because dx_map_entry must treat "0" value (offset = 0) but they 
cannot treat "0" value.
("0" is special value for them.)

>  unsigned int ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
>  {
>          unsigned len = le16_to_cpu(dlen);
>   
>          if (len == EXT4_MAX_REC_LEN || len == 0)
                                          ^^^^^^^^^
>                   return blocksize;
                    ^^^^^^^^^^^^^^^^^
>           return (len & 65532) | ((len & 3) << 16);
>   }

So, we cannot use them for "offs".
Therefore I decided to use shift operations instead of them
because it is easy implementation.

[store]
> > > +			map_tail->offs = ((char *) de - base)>>2;
[use]
> > > +		struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) 
> > > +							(from + (map->offs<<2));
 
Thanks,
Toshiyuki Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 8, 2009, 1:31 p.m. UTC | #3
On Fri, Jun 05, 2009 at 03:20:00PM -0600, Andreas Dilger wrote:
> There are a number of other places where there are limits related
> to > 64kB blocksize like ext4_dir_entry_2 itself having only a
> "__u16 rec_len", so without changing the on-disk format it is not
> possible to have > 64kB blocksize.

That situation we have dealt with already, at least in the kernel:

unsigned int ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
{
	unsigned len = le16_to_cpu(dlen);

	if (len == EXT4_MAX_REC_LEN || len == 0)
		return blocksize;
	return (len & 65532) | ((len & 3) << 16);
}

We do *not* have support in e2fsprogs using this this algorithm, so
there are some bugs to be fixed, though.  But the kernel at least
theoretically handles this case correctly.  That's not to say there
might not be other places where this is buggy, but fundamentally it's
not impossible for us to support block sizes up to 256k, at least from
an on-disk filesystem format perspective.

	  	     	  	      	   	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 21, 2009, 3:57 a.m. UTC | #4
Toshiyuki-san,

I'm curious what if any testing you've done using 128k or 256k block
sizes?  I note that EXT4_MAX_BLOCK_SIZE is still set to 65536, and of
course, > 64k blocksize testing can only take place on systes with
page sizes > 64k --- of which there are very few at the moment.

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima June 22, 2009, 1:46 a.m. UTC | #5
Ted-san,

Theodore Tso wrote:
> Toshiyuki-san,
> 
> I'm curious what if any testing you've done using 128k or 256k block
> sizes?  I note that EXT4_MAX_BLOCK_SIZE is still set to 65536, and of
> course, > 64k blocksize testing can only take place on systes with
> page sizes > 64k --- of which there are very few at the moment.
> 
> Thanks,
> 
> 						- Ted
I have been reviewing the logic around dir_index(fs/ext4/dir.c fs/ext4/namei.c)
for the sake of performance and quality improvement. Then I found this bug.
Sorry, therefore it is not a result of the real test.

Best regards,
Toshiyuki Okajima


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 22, 2009, 2:47 a.m. UTC | #6
On Mon, Jun 22, 2009 at 10:46:17AM +0900, Toshiyuki Okajima wrote:
> I have been reviewing the logic around dir_index(fs/ext4/dir.c
> fs/ext4/namei.c) for the sake of performance and quality
> improvement. Then I found this bug.  Sorry, therefore it is not a
> result of the real test.

I've started doing some testing on the e2fsprogs side, and fixed some
problems in for the upcoming e2fsprogs 1.41.7, but both the kernel and
e2fsprogs currently don't officially support a blocksize larger than
64k.  At least e2fsprogs in the git tree will mke2fs that passes
e2fsck for a 128k block filesystem; this is *not* yet true for 256k
block filesystems (but it requires some commits in the 64k block
numbers patches which I don't want to merge into the maint branch).

Are you aware of any patches that enable an IA64 system to support a
page size greater than 64kB?  (And whether any customer would actually
want to use them, given the downsides of very large page size?)  The
issue is that Linux doesn't support filesystem block sizes > than the
page size.  In any case, I don't mind adding patches that attempt to
make it better to support large block sizes; I don't really want to
claim that we support it until we can actually fully test for that
feature, though.  Is it your intent to actually try to provide support
for this at some point?  I'll help you if you are, but you'll have to
do the testing, since I don't have access to an IA64 platform that
might be able to support these sorts of large pages.

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima June 23, 2009, 3:32 a.m. UTC | #7
Ted-san,

Theodore Tso wrote:
 > Are you aware of any patches that enable an IA64 system to support a
 > page size greater than 64kB?  (And whether any customer would actually
 > want to use them, given the downsides of very large page size?)  The
No, I am not. But I have considered that POWERPC system has been
supported over 64KB page size after I examined all Kconfig's in kernel
source. Therefore I thought this bug should be fixed in kernel side.

 > issue is that Linux doesn't support filesystem block sizes > than the
 > page size.  In any case, I don't mind adding patches that attempt to
 > make it better to support large block sizes; I don't really want to
 > claim that we support it until we can actually fully test for that
I think so, too.

 > feature, though.  Is it your intent to actually try to provide support
 > for this at some point?  I'll help you if you are, but you'll have to
 > do the testing, since I don't have access to an IA64 platform that
 > might be able to support these sorts of large pages.
 >
 > Thanks,
 >
 > 						- Ted
I don't think this feature (over 64KB block size support) to be necessary
immediately. Because I have been investigating ext4 for my customers
in order that I may provide them greater quality and performance but they don't
  want it now.
But when my customers want this feature, I will try to test it.

Best regards,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 23, 2009, 3:28 p.m. UTC | #8
On Tue, Jun 23, 2009 at 12:32:16PM +0900, Toshiyuki Okajima wrote:
> No, I am not. But I have considered that POWERPC system has been
> supported over 64KB page size after I examined all Kconfig's in kernel
> source. Therefore I thought this bug should be fixed in kernel side.

I hadn't ealized that PowerPC had support for 256k page sizes.  Looks
like it only works on a specialized embedded (AMCC) processor, with a
patched set of binutils, and all of the binaries have to be rebuilt to
accomodate having the ELF sections aligned on something greater than
64k boudaries (i.e. standard PPC distributions won't work).  So it's
not likely that I'm going to be able to get access to something that
will support such large pagesizes in the near future.

> I don't think this feature (over 64KB block size support) to be
> necessary immediately. Because I have been investigating ext4 for my
> customers in order that I may provide them greater quality and
> performance but they don't want it now.

That's my assessment on interest in >64k block size support as well; I
will prioritize accordingly.  I just wanted to make sure there wasn't
some customers interested in some unusual use case that I wasn't aware
of.

Regards,

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- linux-2.6.30-rc6/fs/ext4/namei.c.orig	2009-05-26 01:35:09.000000000 +0900
+++ linux-2.6.30-rc6/fs/ext4/namei.c	2009-06-06 00:05:51.000000000 +0900
@@ -750,7 +750,7 @@  static int dx_make_map(struct ext4_dir_e
 			ext4fs_dirhash(de->name, de->name_len, &h);
 			map_tail--;
 			map_tail->hash = h.hash;
-			map_tail->offs = (u16) ((char *) de - base);
+			map_tail->offs = ((char *) de - base)>>2;
 			map_tail->size = le16_to_cpu(de->rec_len);
 			count++;
 			cond_resched();
@@ -1148,7 +1148,8 @@  dx_move_dirents(char *from, char *to, st
 	unsigned rec_len = 0;
 
 	while (count--) {
-		struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) (from + map->offs);
+		struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *) 
+							(from + (map->offs<<2));
 		rec_len = EXT4_DIR_REC_LEN(de->name_len);
 		memcpy (to, de, rec_len);
 		((struct ext4_dir_entry_2 *) to)->rec_len =