diff mbox

fsck performance.

Message ID 20110222135431.GK21917@bitwizard.nl
State Superseded, archived
Headers show

Commit Message

Rogier Wolff Feb. 22, 2011, 1:54 p.m. UTC
On Tue, Feb 22, 2011 at 02:36:52PM +0100, Rogier Wolff wrote:
> On Tue, Feb 22, 2011 at 11:20:56AM +0100, Rogier Wolff wrote:
> > I wouldn't be surprised if I'd need more than 3G of RAM. When I
> > extrapolated "more than a few days" it was at under 20% of the
> > filesystem and had already allocated on the order of 800Gb of
> > memory. Now I'm not entirely sure that this is fair: memory use seems
> > to go up quickly in the beginning, and then stabilize: as if it has
> > decided that 800M of memory use is "acceptable" and somehow uses a
> > different strategy once it hits that limit.
> 
> OK. Good news. It's finished pass1. It is currently using about 2100Mb
> of RAM (ehh. mostly swap, I have only 1G in there). Here is the patch.

Forgot the patch. 

	Roger.

Comments

Andreas Dilger Feb. 22, 2011, 4:32 p.m. UTC | #1
Roger,
Any idea what the hash size does to memory usage?  I wonder if we can scale this based on the directory count, or if the memory usage is minimal (only needed in case of tdb) then just make it the default. It definitely appears to have been a major performance boost.

Another possible optuzatiom is to use the in-memory icount list (preferably with the patch to reduce realloc size) until the allocations fail and only then dump the list into tdb?  That would allow people to run with a swapfile configured by default, but only pay the cost of on-disk operations if really needed. 

Cheers, Andreas

On 2011-02-22, at 6:54, Rogier Wolff <R.E.Wolff@BitWizard.nl> wrote:

> On Tue, Feb 22, 2011 at 02:36:52PM +0100, Rogier Wolff wrote:
>> On Tue, Feb 22, 2011 at 11:20:56AM +0100, Rogier Wolff wrote:
>>> I wouldn't be surprised if I'd need more than 3G of RAM. When I
>>> extrapolated "more than a few days" it was at under 20% of the
>>> filesystem and had already allocated on the order of 800Gb of
>>> memory. Now I'm not entirely sure that this is fair: memory use seems
>>> to go up quickly in the beginning, and then stabilize: as if it has
>>> decided that 800M of memory use is "acceptable" and somehow uses a
>>> different strategy once it hits that limit.
>> 
>> OK. Good news. It's finished pass1. It is currently using about 2100Mb
>> of RAM (ehh. mostly swap, I have only 1G in there). Here is the patch.
> 
> Forgot the patch. 
> 
>    Roger. 
> 
> -- 
> ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
> Does it sit on the couch all day? Is it unemployed? Please be specific! 
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
> <cputimefix.patch>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 22, 2011, 10:13 p.m. UTC | #2
On Tue, Feb 22, 2011 at 09:32:28AM -0700, Andreas Dilger wrote:
> 
> Any idea what the hash size does to memory usage?  I wonder if we
> can scale this based on the directory count, or if the memory usage
> is minimal (only needed in case of tdb) then just make it the
> default. It definitely appears to have been a major performance
> boost.

Yeah, that was my question.  Your patch adds a magic number which
probably works well on your machine (and I'm not really worried if
someone has less than 1G --- here's a quarter kid, buy your self a
real computer :-).  But I wonder if we should be using a hash size
which is sized automatically depending on available memory or file
system size.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rogier Wolff Feb. 23, 2011, 2:54 a.m. UTC | #3
On Tue, Feb 22, 2011 at 09:32:28AM -0700, Andreas Dilger wrote:
> Roger,

> Any idea what the hash size does to memory usage?  I wonder if we
> can scale this based on the directory count, or if the memory usage
> is minimal (only needed in case of tdb) then just make it the
> default. It definitely appears to have been a major performance
> boost.

First, that hash size is passed to the tdb module, so yes it only
matters when tdb is actually used.

Second, I expect tdb's memory use to be significantly impacted by the
hash size. However.... tdb's memory use is dwarfed by e2fsck's own
memory use....  I have not noticed any difference in memory use of
e2fsck. (by watching "top" output. I haven't done any scientific
measurements.)

> Another possible optuzatiom is to use the in-memory icount list
> (preferably with the patch to reduce realloc size) until the
> allocations fail and only then dump the list into tdb?  That would
> allow people to run with a swapfile configured by default, but only
> pay the cost of on-disk operations if really needed.

I don't think this is a good idea. When you expect the "big"
allocations to eventually fail (i.e. icount), you'll evenutally end up
with an allocation somewhere else that fails where you don't have
anything prepared for. A program like e2fsck will be handling
larger and different filesystems "in the field" from what you expected
at the outset. It should be robust. 

My fsck is currently walking the ridge.... 

It grew from about 1000M to over 2500 after pass 1. I was expecting it
to hit the 3G limit before the end. But luckily somehow some memory
got released, and now it seems stable at 2001Mb.

It is currently again in a CPU-bound task. I think it's doing
lots of tdb lookups. 

It has asked me: 

First entry 'DSCN11194.JPG' (inode=279188586) in directory inode 277579348 (...) should be '.'
Fix<y>? yes


Which is clearly wrong. IF we can find directory entries in that
directory (i.e. it acutally IS a directory), then it is likely that
the file DSCN11194.JPG still exists, and that it has inode
279188586. If it should've been '.', it would've been inode
277579348. So instead of overwriting this "first entry" of the
directory, the fix should've been:

Directory "." is missing in directory inode 277579348. Add?

If neccesary, place should be made inside the directory. 

	Roger.
Rogier Wolff Feb. 23, 2011, 4:44 a.m. UTC | #4
On Tue, Feb 22, 2011 at 05:13:04PM -0500, Ted Ts'o wrote:
> On Tue, Feb 22, 2011 at 09:32:28AM -0700, Andreas Dilger wrote:
> > 
> > Any idea what the hash size does to memory usage?  I wonder if we
> > can scale this based on the directory count, or if the memory usage
> > is minimal (only needed in case of tdb) then just make it the
> > default. It definitely appears to have been a major performance
> > boost.
> 
> Yeah, that was my question.  Your patch adds a magic number which
> probably works well on your machine (and I'm not really worried if
> someone has less than 1G --- here's a quarter kid, buy your self a
> real computer :-).  But I wonder if we should be using a hash size
> which is sized automatically depending on available memory or file
> system size.

I fully agree that having "magic numbers" in the code is a bad thing.
A warning sign. 

I don't agree with your argument that 1G RAM should be considered
minimal. There are storage boxes (single-disk-nas systems) out there
that run on a 300MHz ARM chip and have little RAM. Some of them use
ext[234].

For example: http://iomega.nas-central.org/wiki/Main_Page

64Mb RAM. I'm not sure wether that CPU is capable of virtual memory.

I just mentioned this one because a friend brought one into my office
last week. I don't think it happens to run Linux. On the other hand,
some of the "competition" do run Linux.

As to the "disadvantage" of using a large hash value: 

As far as I can see, the library just seeks to that position in the
tdb file. With 32-bit file offsets (which is hardcoded into tdb), that
means the penalty is 4*hash_size of extra disk space. So with my
currently suggested value that comes to 4Mb.

As my current tdb database amounts to 1.5Gb I think the cost is
acceptable.

With the number of keys up to 1 million, we can expect a speedup of
1M/131 = about 7500. Above that we won't gain much anymore. 

This is assuming that we have a next-to-perfect hash function. In fact
we don't because I see about a 30% hash bucket usage. And I surely
think my fsck has used over 1M of the keys....

I just tested the hash function: I hashed the first 10 million numbers
and got 91962 unique results (out of a possible 99931). That's only
about 10%. That's a lot worse than what e2fsck is seeing. And this is
the simplest case to get right.

Here is my test program. 

#include <stdio.h>
#include <stdlib.h>
typedef unsigned int u32;

/* This is based on the hash algorithm from gdbm */
static unsigned int default_tdb_hash(unsigned char *key)
{
        u32 value;      /* Used to compute the hash value.  */
        u32   i;        /* Used to cycle through random values. */

        /* Set the initial value from the key size. */
        for (value = 0x238F13AF * 4, i=0; i < 4; i++)
                value = (value + (key[i] << (i*5 % 24)));

        return (1103515243 * value + 12345);
}



int main (int argc, char **argv)
{
  int i; 
  int max = 1000000; 

  if (argc > 1) max = atoi (argv[1]);
  for (i=0;i < max;i++) {
    printf ("%u %u\n", i, default_tdb_hash ((unsigned char *)&i) % 99931);
  }
  exit (0);
}

and here is the commandline I used to watch the results.

./a.out 10000000 | awk '{print $2}' | sort | uniq -c |sort -n  | less

It seems my "prime generator" program is wrong too. I had thought to
choose a prime with 99931, but apparently it's not prime. (13*7687).
Which, for hashing should not be too bad, but I'll go look for a
prime and check again. Ok. Hash bucket usage shot up: 16%. 

I just "designed" a new hash function, based on the "hash" page on
wikipedia.


static unsigned int my_tdb_hash(unsigned char *key)
{
        u32 value;      /* Used to compute the hash value.  */
        u32   i;        /* Used to cycle through random values. */

        /* Set the initial value from the key size. */
        for (value = 0, i=0; i < 4; i++)
                value = value * 256 + key[i] + (value >> 24) * 241;

        return value;
}


It behaves MUCH better than the "default_tdb_hash" in that it has 100%
bucket usage (not almost, but exaclty 100%). It's not that hard to get
right.

The "hash" at the end (times BIGPRIME + RANDOMVALUE) in the original
is redundant. It only serves to make the results less obvious to
humans, but there is no computer-science relevant reason.

I'll shoot off an Email to the TDB guys as well. 

	Roger.
Theodore Ts'o Feb. 23, 2011, 11:32 a.m. UTC | #5
On Feb 22, 2011, at 11:44 PM, Rogier Wolff wrote:

> 
> I'll shoot off an Email to the TDB guys as well. 

I'm pretty sure this won't come as a surprise to them.   I'm using the last version of TDB which was licensed under the GPLv2, and they relicensed to GPLv3 quite a while ago.   I remember hearing they had added a new hash algorithm to TDB since the relicensing, but those newer versions aren't available to e2fsprogs....

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rogier Wolff Feb. 23, 2011, 8:53 p.m. UTC | #6
On Wed, Feb 23, 2011 at 06:32:17AM -0500, Theodore Tso wrote:
> 
> On Feb 22, 2011, at 11:44 PM, Rogier Wolff wrote:
> 
> > 
> > I'll shoot off an Email to the TDB guys as well. 
 
> I'm pretty sure this won't come as a surprise to them.  I'm using
> the last version of TDB which was licensed under the GPLv2, and they
> relicensed to GPLv3 quite a while ago.  I remember hearing they had
> added a new hash algorithm to TDB since the relicensing, but those
> newer versions aren't available to e2fsprogs....

Well then.... 

You're free to use my "new" hash function, provided it is kept under
GPLv2 and not under GPLv3.

My implementation has been a "cleanroom" implementation in that I've
only looked at the specifications and implemented it from
there. Although no external attestation is available that I have been
completely shielded from the newer GPLv3 version... 

On a slightly different note: 

A pretty good estimate of the number of inodes is available in the
superblock (tot inodes - free inodes). A good hash size would be: "a
rough estimate of the number of inodes." Two or three times more or
less doesn't matter much. CPU is cheap. I'm not sure what the
estimate for the "dircount" tdb should be.

The amount of disk space that the tdb will use is at least: 
  overhead + hash_size * 4 + numrecords * (keysize + datasize +
                                                 perrecordoverhead)

There must also be some overhead to store the size of the keys and
data as both can be variable length. By implementing the "database"
ourselves we could optimize that out. I don't think it's worth the
trouble. 

With keysize equal 4, datasize also 4 and hash_size equal to numinodes
or numrecords, we would get

 overhead + numinodes * (12 + perrecordoverhead). 

In fact, my icount database grew to about 750Mb, with only 23M inodes,
so that means that apparently the perrecordoverhead is about 20 bytes.
This is the price you pay for using a much more versatile database
than what you really need. Disk is cheap (except when checking a root
filesystem!)

So... 

-- I suggest that for the icount tdb we move to using the superblock
info as the hash size.

-- I suggest that we use our own hash function. tdb allows us to
specify our own hash function. Instead of modifying the bad tdb, we'll
just keep it intact, and pass a better (local) hash function.


Does anybody know what the "dircount" tdb database holds, and what is
an estimate for the number of elements eventually in the database?  (I
could find out myself: I have the source. But I'm lazy. I'm a
programmer you know...).


On a separate note, my filesystem finished the fsck (33 hours (*)),
and I started the backups again... :-)

	Roger. 

*) that might include an estimated 1-5 hours of "Fix <y>?" waiting.
Andreas Dilger Feb. 23, 2011, 10:24 p.m. UTC | #7
On 2011-02-23, at 1:53 PM, Rogier Wolff wrote:
> My implementation has been a "cleanroom" implementation in that I've
> only looked at the specifications and implemented it from
> there. Although no external attestation is available that I have been
> completely shielded from the newer GPLv3 version... 
> 
> On a slightly different note: 
> 
> A pretty good estimate of the number of inodes is available in the
> superblock (tot inodes - free inodes). A good hash size would be: "a
> rough estimate of the number of inodes." Two or three times more or
> less doesn't matter much. CPU is cheap. I'm not sure what the
> estimate for the "dircount" tdb should be.

The dircount can be extracted from the group descriptors, which count the number of allocated directories in each group.  Since the superblock "free inodes" count is no longer updated except at unmount time, the code would need to walk all of the group descriptors to get this number anyway.

> The amount of disk space that the tdb will use is at least: 
>  overhead + hash_size * 4 + numrecords * (keysize + datasize +
>                                                 perrecordoverhead)
> 
> There must also be some overhead to store the size of the keys and
> data as both can be variable length. By implementing the "database"
> ourselves we could optimize that out. I don't think it's worth the
> trouble. 
> 
> With keysize equal 4, datasize also 4 and hash_size equal to numinodes
> or numrecords, we would get
> 
> overhead + numinodes * (12 + perrecordoverhead). 
> 
> In fact, my icount database grew to about 750Mb, with only 23M inodes,
> so that means that apparently the perrecordoverhead is about 20 bytes.
> This is the price you pay for using a much more versatile database
> than what you really need. Disk is cheap (except when checking a root
> filesystem!)
> 
> So... 
> 
> -- I suggest that for the icount tdb we move to using the superblock
> info as the hash size.
> 
> -- I suggest that we use our own hash function. tdb allows us to
> specify our own hash function. Instead of modifying the bad tdb, we'll
> just keep it intact, and pass a better (local) hash function.
> 
> 
> Does anybody know what the "dircount" tdb database holds, and what is
> an estimate for the number of elements eventually in the database?  (I
> could find out myself: I have the source. But I'm lazy. I'm a
> programmer you know...).
> 
> 
> On a separate note, my filesystem finished the fsck (33 hours (*)),
> and I started the backups again... :-)

If you have the opportunity, I wonder whether the entire need for tdb can be avoided in your case by using swap and the icount optimization patches previously posted?  I'd really like to get that patch included upstream, but it needs testing in an environment like yours where icount is a significant factor.  This would avoid all of the tdb overhead.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 23, 2011, 11:17 p.m. UTC | #8
On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
> 
> If you have the opportunity, I wonder whether the entire need for
> tdb can be avoided in your case by using swap and the icount
> optimization patches previously posted?  

Unfortunately, there are people who are still using 32-bit CPU's, so
no, swap is not a solution here.   

> I'd really like to get that patch included upstream, but it needs
> testing in an environment like yours where icount is a significant
> factor.  This would avoid all of the tdb overhead.

Adjusting the tdb hash parameters, and changing the tdb hash functions
shouldn't be hard to get into upstream.  We should really improve our
testing for [scratch files], but that's always been true....

	    	     	     	 	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Feb. 24, 2011, 12:41 a.m. UTC | #9
On 2011-02-23, at 4:17 PM, Ted Ts'o wrote:
> On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
>> 
>> If you have the opportunity, I wonder whether the entire need for
>> tdb can be avoided in your case by using swap and the icount
>> optimization patches previously posted?  
> 
> Unfortunately, there are people who are still using 32-bit CPU's, so
> no, swap is not a solution here.

I agree it isn't a solution in all cases, but avoiding GB-sized realloc() in the code was certainly enough to fix problems for the original people who hit them.  It likely also avoids a lot of memcpy() (depending on how realloc is implemented).

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rogier Wolff Feb. 24, 2011, 7:29 a.m. UTC | #10
On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:

> The dircount can be extracted from the group descriptors, which
> count the number of allocated directories in each group.  Since the

OK. 

> superblock "free inodes" count is no longer updated except at
> unmount time, the code would need to walk all of the group
> descriptors to get this number anyway.

No worries. It matters a bit for performance, but if that free inode
count in the superblock is outdated, we'll just use that outdated
one. The one case that I'm afraid of is that someone creates a new
filesystem (superblock inodes-in-use =~= 0), then copies on millions
of files, and then crashes his system.... 

I'll add a minimum of 999931, causing an overhead of around 4Mb of
disk space usage if this was totally unneccesary.

> If you have the opportunity, I wonder whether the entire need for
> tdb can be avoided in your case by using swap and the icount
> optimization patches previously posted?  I'd really like to get that
> patch included upstream, but it needs testing in an environment like
> yours where icount is a significant factor.  This would avoid all of
> the tdb overhead.

First: I don't think it will work. The largest amount of memory that
e2fsck had allocated was 2.5Gb. At that point it also had around 1.5G
of disk space in use for tdb's for a total of 4G. On the other hand,
we've established that the overhead in tdb is about 24bytes per 8
bytes of real data.... So maybe we would only have needed 200M of
in-memory datastructures to handle this. Two of those 400M together
with the dircount (tdb =750M, assume same ratio) total 600M still
above 3G.

Second: e2fsck is too fragile as it is. It should be able to handle
big filesystems on little systems. I have a puny little 2GHz Athlon
system that currently has 3T of disk storage and 1G RAM. Embedded
Linux systems can be running those amounts of storage with only 64
or 128 Mb of RAM. 

Even if MY filesystem happens to pass, with a little less memory-use,
then there is a slightly larger system that won't.

I have a server that has 4x2T instead of the server that has 4*1T. It
uses the same backup strategy, so it too has lots and lots of files.
In fact it has 84M inodes in use. (I thought 96M inodes would be
plenty... wrong! I HAVE run out of inodes on that thing!)

That one too may need to fsck the filesystem... 

I remember hearing about a tool that would extract all the filesystem
meta-info, so that I can make an image that I can then test e.g. fsck
upon? Inodes, directory blocks, indirect blocks etc.?

Then I could make an image where I could test this. I don't really
want to put this offline again for multiple days.


	Roger.
Amir Goldstein Feb. 24, 2011, 8:59 a.m. UTC | #11
On Thu, Feb 24, 2011 at 9:29 AM, Rogier Wolff <R.E.Wolff@bitwizard.nl> wrote:
> On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
>
>> The dircount can be extracted from the group descriptors, which
>> count the number of allocated directories in each group.  Since the
>
> OK.
>
>> superblock "free inodes" count is no longer updated except at
>> unmount time, the code would need to walk all of the group
>> descriptors to get this number anyway.
>
> No worries. It matters a bit for performance, but if that free inode
> count in the superblock is outdated, we'll just use that outdated
> one. The one case that I'm afraid of is that someone creates a new
> filesystem (superblock inodes-in-use =~= 0), then copies on millions
> of files, and then crashes his system....
>
> I'll add a minimum of 999931, causing an overhead of around 4Mb of
> disk space usage if this was totally unneccesary.
>
>> If you have the opportunity, I wonder whether the entire need for
>> tdb can be avoided in your case by using swap and the icount
>> optimization patches previously posted?  I'd really like to get that
>> patch included upstream, but it needs testing in an environment like
>> yours where icount is a significant factor.  This would avoid all of
>> the tdb overhead.
>
> First: I don't think it will work. The largest amount of memory that
> e2fsck had allocated was 2.5Gb. At that point it also had around 1.5G
> of disk space in use for tdb's for a total of 4G. On the other hand,
> we've established that the overhead in tdb is about 24bytes per 8
> bytes of real data.... So maybe we would only have needed 200M of
> in-memory datastructures to handle this. Two of those 400M together
> with the dircount (tdb =750M, assume same ratio) total 600M still
> above 3G.
>
> Second: e2fsck is too fragile as it is. It should be able to handle
> big filesystems on little systems. I have a puny little 2GHz Athlon
> system that currently has 3T of disk storage and 1G RAM. Embedded
> Linux systems can be running those amounts of storage with only 64
> or 128 Mb of RAM.
>
> Even if MY filesystem happens to pass, with a little less memory-use,
> then there is a slightly larger system that won't.
>
> I have a server that has 4x2T instead of the server that has 4*1T. It
> uses the same backup strategy, so it too has lots and lots of files.
> In fact it has 84M inodes in use. (I thought 96M inodes would be
> plenty... wrong! I HAVE run out of inodes on that thing!)
>
> That one too may need to fsck the filesystem...
>
> I remember hearing about a tool that would extract all the filesystem
> meta-info, so that I can make an image that I can then test e.g. fsck
> upon? Inodes, directory blocks, indirect blocks etc.?
>

That tool is e2image -r, which creates a sparse file image of your fs
(only metadata is written, the rest is holes), so you need to be careful
when copying/transferring it to another machine to do it wisely
(i.e. bzip or dd directly to a new HDD)
Not sure what you will do if fsck fixes errors on that image...
Mostly (if it didn't clone multiply claimed blocks for example), you would
be able to write the fixed image back onto your original fs,
but that would be risky.

> Then I could make an image where I could test this. I don't really
> want to put this offline again for multiple days.
>
>
>        Roger.
>
>
> --
> ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rogier Wolff Feb. 24, 2011, 8:59 a.m. UTC | #12
On Wed, Feb 23, 2011 at 05:41:31PM -0700, Andreas Dilger wrote:
> On 2011-02-23, at 4:17 PM, Ted Ts'o wrote:
> > On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
> >> 
> >> If you have the opportunity, I wonder whether the entire need for
> >> tdb can be avoided in your case by using swap and the icount
> >> optimization patches previously posted?  
> > 
> > Unfortunately, there are people who are still using 32-bit CPU's, so
> > no, swap is not a solution here.
> 

> I agree it isn't a solution in all cases, but avoiding GB-sized
> realloc() in the code was certainly enough to fix problems for the
> original people who hit them.  It likely also avoids a lot of
> memcpy() (depending on how realloc is implemented).

So, assuming that the biggest alloc is 1Gb. 
Assuming that we realloc (I haven't seen the code), at twice
the size every time, we'll alloc 1M, then 2M then 4M etc. up to 1G. 

In the last case we'll realloc the 512M pointer to a 1G region. Note
that this requires a contiguous 1G area of free addressing space
within the 3G total available addressing space. But let's ignore that
problem for now.

So for the 1G alloc we'll have to memcpy 512Mb of existing data.
The previous one required a memcpy of 256Mb etc etc. The total is
just under 1G. 

So you're proposing to optimize out a memcpy of 1G of my main memory.

When it boots, my system says: pIII_sse  :  4884.000 MB/sec

So it can handle xor at almost 5G/second. It should be able to do
memcpy (xor with a bunch of zeroes) at that speed. But lets assume
that the libc guys are stupid and mangaged to make it 10 times slower.

So you're proposing to optimize out 1G of memcopy at 0.5G/second or
two seconds of CPU time on an fsck that takes over 24
hours. Congratulations! You've made e2fsck about 0.0023 percent
faster!

Andreas, I really value your efforts to improve e2fsck. But optmizing
code can be done by looking at the code and saying: "this looks
inefficient, lets fix it up". However you're quickly going to be
spending time on optimizations that don't really matter.

(My second computer was a DOS 3.x machine. DOS came with a utility
called "sort". It does what you expect from a DOS program: It refuses
to sort datafiles larger than 64k. So I rewrote it. Turns out my
implementation was 100 times slower in reading in the dataset than the
original version. I did manage to sort 100 times faster than the
original version. End result? Mine was 10 times faster than the
original. They optimized something that didn't matter. I just read
some decades-old literature on sorting and implemented that).h

I firmly believe that a factor of ten performance improvement can be
achieved for fsck for my filesystem. It should be possible to fsck the
filesystem in 3.3 hours.

There are a total of 342M inodes. That's 87Gb. reading that at a
leasurely 50M/second gives us 1700 seconds, or half an hour. (it
should be possible to do better: I have 4 drives each doing 90M/sec,
allowing a total of over 300M/sec).

Then I have 2.7T of data. With old ext2/ext3 that requires indirect
blocks worth 2.7G of data. reading that at 10M/sec (it will be
shattered) requires 270 seconds or 5 minutes.

I have quite a lot of directories. So those might take some time.  The
cputime of actually doing the checks should be possible to overlap
with the IO.

Anyway, although in theory 10x should be possible, I expect that 5x is
a more realistic goal.

	Roger.
Rogier Wolff Feb. 24, 2011, 9:02 a.m. UTC | #13
On Thu, Feb 24, 2011 at 10:59:23AM +0200, Amir Goldstein wrote:

> That tool is e2image -r, which creates a sparse file image of your
> fs (only metadata is written, the rest is holes), so you need to be
> careful when copying/transferring it to another machine to do it
> wisely (i.e. bzip or dd directly to a new HDD) Not sure what you
> will do if fsck fixes errors on that image...  Mostly (if it didn't
> clone multiply claimed blocks for example), you would be able to
> write the fixed image back onto your original fs, but that would be
> risky.

I can then run the fsck tests on the image. I expect fsck to find
errors: I'm using the filesystem when I'm making that image.... It 
won't be consistent. 


	Roger.
Amir Goldstein Feb. 24, 2011, 9:33 a.m. UTC | #14
On Thu, Feb 24, 2011 at 11:02 AM, Rogier Wolff <R.E.Wolff@bitwizard.nl> wrote:
> On Thu, Feb 24, 2011 at 10:59:23AM +0200, Amir Goldstein wrote:
>
>> That tool is e2image -r, which creates a sparse file image of your
>> fs (only metadata is written, the rest is holes), so you need to be
>> careful when copying/transferring it to another machine to do it
>> wisely (i.e. bzip or dd directly to a new HDD) Not sure what you
>> will do if fsck fixes errors on that image...  Mostly (if it didn't
>> clone multiply claimed blocks for example), you would be able to
>> write the fixed image back onto your original fs, but that would be
>> risky.
>
> I can then run the fsck tests on the image. I expect fsck to find
> errors: I'm using the filesystem when I'm making that image.... It
> won't be consistent.
>

So you probably won't learn a lot from fsck results, unless you only
want to provide memusage/runtime statistic as per Andreas request.

You have the option to use NEXT3 so take a snapshot of your fs,
while it is online, but I don't suppose you would want to experiment
on your backup server.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rogier Wolff Feb. 24, 2011, 11:53 p.m. UTC | #15
On Thu, Feb 24, 2011 at 10:59:23AM +0200, Amir Goldstein wrote:
> That tool is e2image -r, which creates a sparse file image of your fs

Ah... I got: 

/home/wolff/e2image: File too large error writing block 535822337

Sigh. 2Tb max filesize on ext3. :-(

	Roger.
Daniel Taylor Feb. 25, 2011, 12:26 a.m. UTC | #16
> -----Original Message-----
> From: linux-ext4-owner@vger.kernel.org 
> [mailto:linux-ext4-owner@vger.kernel.org] On Behalf Of Rogier Wolff
> Sent: Wednesday, February 23, 2011 11:30 PM
> To: Andreas Dilger
> Cc: linux-ext4@vger.kernel.org
> Subject: Re: fsck performance.
> 
> On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
> 
...
> 
> Second: e2fsck is too fragile as it is. It should be able to handle
> big filesystems on little systems. I have a puny little 2GHz Athlon
> system that currently has 3T of disk storage and 1G RAM. Embedded
> Linux systems can be running those amounts of storage with only 64
> or 128 Mb of RAM. 

I have to second this comment.  One of our NAS has 256 MBytes of RAM
(and they wanted 64) with a 3TB disk, 2.996TB of which is an EXT4 file
system.  With our 2.6.32.11 kernel and e2fsprogs version  1.41.3-1,
all I get is a segfault when I run fsck.ext4.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/e2fsck/dirinfo.c b/e2fsck/dirinfo.c
index 901235c..f0bf2ad 100644
--- a/e2fsck/dirinfo.c
+++ b/e2fsck/dirinfo.c
@@ -62,7 +62,7 @@  static void setup_tdb(e2fsck_t ctx, ext2_ino_t num_dirs)
 	uuid_unparse(ctx->fs->super->s_uuid, uuid);
 	sprintf(db->tdb_fn, "%s/%s-dirinfo-XXXXXX", tdb_dir, uuid);
 	fd = mkstemp(db->tdb_fn);
-	db->tdb = tdb_open(db->tdb_fn, 0, TDB_CLEAR_IF_FIRST,
+	db->tdb = tdb_open(db->tdb_fn, 99931, TDB_CLEAR_IF_FIRST,
 			   O_RDWR | O_CREAT | O_TRUNC, 0600);
 	close(fd);
 }
diff --git a/lib/ext2fs/icount.c b/lib/ext2fs/icount.c
index bec0f5f..bba740d 100644
--- a/lib/ext2fs/icount.c
+++ b/lib/ext2fs/icount.c
@@ -193,7 +193,7 @@  errcode_t ext2fs_create_icount_tdb(ext2_filsys fs, char *tdb_dir,
 	fd = mkstemp(fn);
 
 	icount->tdb_fn = fn;
-	icount->tdb = tdb_open(fn, 0, TDB_CLEAR_IF_FIRST,
+	icount->tdb = tdb_open(fn, 999931, TDB_NOLOCK | TDB_NOSYNC,
 			       O_RDWR | O_CREAT | O_TRUNC, 0600);
 	if (icount->tdb) {
 		close(fd);