diff mbox

[RFC,3/9,v1] ext4: add physical block and status member into extent status tree

Message ID 1356335742-11793-4-git-send-email-wenqing.lz@taobao.com
State Superseded, archived
Headers show

Commit Message

Zheng Liu Dec. 24, 2012, 7:55 a.m. UTC
From: Zheng Liu <wenqing.lz@taobao.com>

es_pblk is used to record physical block that maps to the disk.  es_status is
used to record the status of the extent.  Three status are defined, which are
written, unwritten and delayed.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Jan Kara Dec. 31, 2012, 9:49 p.m. UTC | #1
On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> es_pblk is used to record physical block that maps to the disk.  es_status is
> used to record the status of the extent.  Three status are defined, which are
> written, unwritten and delayed.
  So this means one extent is 48 bytes on 64-bit architectures. If I'm a
nasty user and create artificially fragmented file (by allocating every
second block), extent tree takes 6 MB per GB of file. That's quite a bit
and I think you need to provide a way for kernel to reclaim extent
structures...

								Honza
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
>  fs/ext4/extents_status.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 81e9339..85115bb 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -20,10 +20,18 @@
>  #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
>  #endif
>  
> +enum {
> +	EXTENT_STATUS_WRITTEN = 0,	/* written extent */
> +	EXTENT_STATUS_UNWRITTEN = 1,	/* unwritten extent */
> +	EXTENT_STATUS_DELAYED = 2,	/* delayed extent */
> +};
> +
>  struct extent_status {
>  	struct rb_node rb_node;
>  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
>  	ext4_lblk_t es_len;	/* length of extent in block */
> +	ext4_fsblk_t es_pblk;	/* first physical block */
> +	int es_status;		/* record the status of extent */
>  };
>  
>  struct ext4_es_tree {
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu Jan. 1, 2013, 5:16 a.m. UTC | #2
On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > es_pblk is used to record physical block that maps to the disk.  es_status is
> > used to record the status of the extent.  Three status are defined, which are
> > written, unwritten and delayed.
>   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> nasty user and create artificially fragmented file (by allocating every
> second block), extent tree takes 6 MB per GB of file. That's quite a bit
> and I think you need to provide a way for kernel to reclaim extent
> structures...

Indeed, when a file has a lot of fragmentations, status tree will occupy
a number of memory.  That is why it will be loaded on-demand.  When I make
it, there are two solutions to load status tree.  One is loading
on-demand, and another is loading complete extent tree in
ext4_alloc_inode().  Finally I choose the former because it can reduce
the pressure of memory at most of time.  But it has a disadvantage that
status tree doesn't be fully trusted because it hasn't track a
completely status of extent tree on disk.

I will provide a way to reclaim extent structures from status tree.  Now
I have an idea in my mind that we can reclaim all extent which are
WRITTEN/UNWRITTEN status because we always need DELAYED extent in
fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
another mail, some unwritten extent which will be converted into
written also doesn't be reclaimed.

Another question is when do these extents reclaim?  Currently when
clear_inode() is called, the whole status tree will be reclaimed.  Maybe
a switch in sysfs is a optional choice.  Any thoughts?

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Jan. 2, 2013, 11:22 a.m. UTC | #3
On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > used to record the status of the extent.  Three status are defined, which are
> > > written, unwritten and delayed.
> >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > nasty user and create artificially fragmented file (by allocating every
> > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > and I think you need to provide a way for kernel to reclaim extent
> > structures...
> 
> Indeed, when a file has a lot of fragmentations, status tree will occupy
> a number of memory.  That is why it will be loaded on-demand.  When I make
> it, there are two solutions to load status tree.  One is loading
> on-demand, and another is loading complete extent tree in
> ext4_alloc_inode().  Finally I choose the former because it can reduce
> the pressure of memory at most of time.  But it has a disadvantage that
> status tree doesn't be fully trusted because it hasn't track a
> completely status of extent tree on disk.
  Not reading the whole extent tree in ext4_alloc_inode() is a good start
but it's not the whole solution IMHO. It saves us from unnecessary reading
of extents but still if someone reads the whole filesystem (like
grep -R "foo" /) you will still end up with all extents cached. And that
will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
eventually release these inodes including cached extents but it is usually
more beneficial to cache the inode itself than more extents so allowing us
to strip cached extents without releasing inode itself would be good.

> I will provide a way to reclaim extent structures from status tree.  Now
> I have an idea in my mind that we can reclaim all extent which are
> WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> another mail, some unwritten extent which will be converted into
> written also doesn't be reclaimed.
> 
> Another question is when do these extents reclaim?  Currently when
> clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> a switch in sysfs is a optional choice.  Any thoughts?
  The natural way to handle the shrinking is using 'shrinker' framework. In
this case, we could register a shrinker for shrinking extents. Just having
LRU of extents would increase the size of extent structure by 2 pointers
which is too big I'd think and I'm not yet sure how to choose extents for
reclaim in some other way. I will think about it...

								Honza
Zheng Liu Jan. 5, 2013, 2:44 a.m. UTC | #4
On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > used to record the status of the extent.  Three status are defined, which are
> > > > written, unwritten and delayed.
> > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > nasty user and create artificially fragmented file (by allocating every
> > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > and I think you need to provide a way for kernel to reclaim extent
> > > structures...
> > 
> > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > a number of memory.  That is why it will be loaded on-demand.  When I make
> > it, there are two solutions to load status tree.  One is loading
> > on-demand, and another is loading complete extent tree in
> > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > the pressure of memory at most of time.  But it has a disadvantage that
> > status tree doesn't be fully trusted because it hasn't track a
> > completely status of extent tree on disk.
>   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> but it's not the whole solution IMHO. It saves us from unnecessary reading
> of extents but still if someone reads the whole filesystem (like
> grep -R "foo" /) you will still end up with all extents cached. And that
> will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> eventually release these inodes including cached extents but it is usually
> more beneficial to cache the inode itself than more extents so allowing us
> to strip cached extents without releasing inode itself would be good.
> 
> > I will provide a way to reclaim extent structures from status tree.  Now
> > I have an idea in my mind that we can reclaim all extent which are
> > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > another mail, some unwritten extent which will be converted into
> > written also doesn't be reclaimed.
> > 
> > Another question is when do these extents reclaim?  Currently when
> > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > a switch in sysfs is a optional choice.  Any thoughts?
>   The natural way to handle the shrinking is using 'shrinker' framework. In
> this case, we could register a shrinker for shrinking extents. Just having
> LRU of extents would increase the size of extent structure by 2 pointers
> which is too big I'd think and I'm not yet sure how to choose extents for
> reclaim in some other way. I will think about it...

Hi Jan,

Sorry for the delay.  'shrinker' framework is an option.  We can define
a callback function to reclaim extents from status tree.  When we access
an extent in an inode, we will move this inode into the tail of LRU list.
But this way has a defect that the spinlock which protects the LRU list
has a heavy contention because all inodes need to take this lock.  I
guess this overhead is unacceptable for us.  Any comments?

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 8, 2013, 1:27 a.m. UTC | #5
On Sat, Jan 05, 2013 at 10:44:01AM +0800, Zheng Liu wrote:
> On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> > On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > 
> > > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > > used to record the status of the extent.  Three status are defined, which are
> > > > > written, unwritten and delayed.
> > > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > > nasty user and create artificially fragmented file (by allocating every
> > > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > > and I think you need to provide a way for kernel to reclaim extent
> > > > structures...
> > > 
> > > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > > a number of memory.  That is why it will be loaded on-demand.  When I make
> > > it, there are two solutions to load status tree.  One is loading
> > > on-demand, and another is loading complete extent tree in
> > > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > > the pressure of memory at most of time.  But it has a disadvantage that
> > > status tree doesn't be fully trusted because it hasn't track a
> > > completely status of extent tree on disk.
> >   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> > but it's not the whole solution IMHO. It saves us from unnecessary reading
> > of extents but still if someone reads the whole filesystem (like
> > grep -R "foo" /) you will still end up with all extents cached. And that
> > will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> > eventually release these inodes including cached extents but it is usually
> > more beneficial to cache the inode itself than more extents so allowing us
> > to strip cached extents without releasing inode itself would be good.
> > 
> > > I will provide a way to reclaim extent structures from status tree.  Now
> > > I have an idea in my mind that we can reclaim all extent which are
> > > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > > another mail, some unwritten extent which will be converted into
> > > written also doesn't be reclaimed.
> > > 
> > > Another question is when do these extents reclaim?  Currently when
> > > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > > a switch in sysfs is a optional choice.  Any thoughts?
> >   The natural way to handle the shrinking is using 'shrinker' framework. In
> > this case, we could register a shrinker for shrinking extents. Just having
> > LRU of extents would increase the size of extent structure by 2 pointers
> > which is too big I'd think and I'm not yet sure how to choose extents for
> > reclaim in some other way. I will think about it...
> 
> Hi Jan,
> 
> Sorry for the delay.  'shrinker' framework is an option.  We can define
> a callback function to reclaim extents from status tree.  When we access
> an extent in an inode, we will move this inode into the tail of LRU list.
> But this way has a defect that the spinlock which protects the LRU list
> has a heavy contention because all inodes need to take this lock.  I
> guess this overhead is unacceptable for us.  Any comments?

Measure it first. There are several filesystem global locks still
in existance at the VFS level. solve the simple problem first, and
then the hard problem might get solved for you by someone else. e.g:

http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

Cheers,

Dave.
Zheng Liu Jan. 8, 2013, 2:25 a.m. UTC | #6
On Tue, Jan 08, 2013 at 12:27:54PM +1100, Dave Chinner wrote:
> On Sat, Jan 05, 2013 at 10:44:01AM +0800, Zheng Liu wrote:
> > On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> > > On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > > > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > > 
> > > > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > > > used to record the status of the extent.  Three status are defined, which are
> > > > > > written, unwritten and delayed.
> > > > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > > > nasty user and create artificially fragmented file (by allocating every
> > > > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > > > and I think you need to provide a way for kernel to reclaim extent
> > > > > structures...
> > > > 
> > > > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > > > a number of memory.  That is why it will be loaded on-demand.  When I make
> > > > it, there are two solutions to load status tree.  One is loading
> > > > on-demand, and another is loading complete extent tree in
> > > > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > > > the pressure of memory at most of time.  But it has a disadvantage that
> > > > status tree doesn't be fully trusted because it hasn't track a
> > > > completely status of extent tree on disk.
> > >   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> > > but it's not the whole solution IMHO. It saves us from unnecessary reading
> > > of extents but still if someone reads the whole filesystem (like
> > > grep -R "foo" /) you will still end up with all extents cached. And that
> > > will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> > > eventually release these inodes including cached extents but it is usually
> > > more beneficial to cache the inode itself than more extents so allowing us
> > > to strip cached extents without releasing inode itself would be good.
> > > 
> > > > I will provide a way to reclaim extent structures from status tree.  Now
> > > > I have an idea in my mind that we can reclaim all extent which are
> > > > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > > > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > > > another mail, some unwritten extent which will be converted into
> > > > written also doesn't be reclaimed.
> > > > 
> > > > Another question is when do these extents reclaim?  Currently when
> > > > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > > > a switch in sysfs is a optional choice.  Any thoughts?
> > >   The natural way to handle the shrinking is using 'shrinker' framework. In
> > > this case, we could register a shrinker for shrinking extents. Just having
> > > LRU of extents would increase the size of extent structure by 2 pointers
> > > which is too big I'd think and I'm not yet sure how to choose extents for
> > > reclaim in some other way. I will think about it...
> > 
> > Hi Jan,
> > 
> > Sorry for the delay.  'shrinker' framework is an option.  We can define
> > a callback function to reclaim extents from status tree.  When we access
> > an extent in an inode, we will move this inode into the tail of LRU list.
> > But this way has a defect that the spinlock which protects the LRU list
> > has a heavy contention because all inodes need to take this lock.  I
> > guess this overhead is unacceptable for us.  Any comments?
> 
> Measure it first. There are several filesystem global locks still
> in existance at the VFS level. solve the simple problem first, and
> then the hard problem might get solved for you by someone else. e.g:
> 
> http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

Thanks for teaching me. :-)  I will measure its overhead first.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 81e9339..85115bb 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,18 @@ 
 #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
+enum {
+	EXTENT_STATUS_WRITTEN = 0,	/* written extent */
+	EXTENT_STATUS_UNWRITTEN = 1,	/* unwritten extent */
+	EXTENT_STATUS_DELAYED = 2,	/* delayed extent */
+};
+
 struct extent_status {
 	struct rb_node rb_node;
 	ext4_lblk_t es_lblk;	/* first logical block extent covers */
 	ext4_lblk_t es_len;	/* length of extent in block */
+	ext4_fsblk_t es_pblk;	/* first physical block */
+	int es_status;		/* record the status of extent */
 };
 
 struct ext4_es_tree {