Message ID | 20130614140940.GA20401@gmail.com |
---|---|
State | Superseded, archived |
Headers | show |
On Fri, Jun 14, 2013 at 10:09:40PM +0800, Zheng Liu wrote: > This commit tries to fix this problem. Now a new member called > i_touch_when is added into ext4_inode_info to record the last access > time for an inode. Meanwhile we never need to keep a proper in-order > LRU list. So this can avoid to burns some CPU time. When we try to > reclaim some entries from extent status tree, we use list_sort() to get > a proper in-order list. Then we traverse this list to discard some > entries. I think this approach is very sound. I have one reservation, however. If there are a large number of inodes on this list, list_sort() could burn a lot of CPU time, especially if the shrinker is called a very large number of times in a short time period (i.e., when the system is thrashing, or when one or more memory containers are under a large amount of memory pressure). I have a suggestion for how to address this: Keep a timestamp of when the list last has been sorted in struct ext4_super_info. When iterating over the list, looking for a candidate inode, if inode's i_touch_when is greater than the last time sorted, skip the inode. If there are no inodes left in the list, or more than some threshold inodes have been skipped, only then resort the list (and update the timestamp in the ext4_super_info structure). What do you think? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/14/2013 07:09 AM, Zheng Liu wrote: > - INIT_LIST_HEAD(&scanned); > - > spin_lock(&sbi->s_es_lru_lock); > + list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp); > list_for_each_safe(cur, tmp, &sbi->s_es_lru) { How long can this list get? I have the feeling this might get a bit painful, especially on a NUMA machine. But, it definitely eliminates the spinlock contention that I was seeing. The top ext4 function in my profiles is way down under 1% of CPU time now. Thanks for the quick response, and please let me know if you need any further testing. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/14/2013 10:11 AM, Zheng Liu wrote: > On Fri, Jun 14, 2013 at 08:57:44AM -0700, Dave Hansen wrote: >> > On 06/14/2013 07:09 AM, Zheng Liu wrote: >>> > > - INIT_LIST_HEAD(&scanned); >>> > > - >>> > > spin_lock(&sbi->s_es_lru_lock); >>> > > + list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp); >>> > > list_for_each_safe(cur, tmp, &sbi->s_es_lru) { >> > >> > How long can this list get? I have the feeling this might get a bit >> > painful, especially on a NUMA machine. > I guess that you worry about the time of sorting a lru list, right? Yeah, just worried about the amount of time it takes to examine (and possibly write to) every item on the list. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 14, 2013 at 10:02:15AM -0400, Theodore Ts'o wrote: > On Fri, Jun 14, 2013 at 10:09:40PM +0800, Zheng Liu wrote: > > This commit tries to fix this problem. Now a new member called > > i_touch_when is added into ext4_inode_info to record the last access > > time for an inode. Meanwhile we never need to keep a proper in-order > > LRU list. So this can avoid to burns some CPU time. When we try to > > reclaim some entries from extent status tree, we use list_sort() to get > > a proper in-order list. Then we traverse this list to discard some > > entries. > > I think this approach is very sound. I have one reservation, however. > If there are a large number of inodes on this list, list_sort() could > burn a lot of CPU time, especially if the shrinker is called a very > large number of times in a short time period (i.e., when the system is > thrashing, or when one or more memory containers are under a large > amount of memory pressure). > > I have a suggestion for how to address this: Keep a timestamp of when > the list last has been sorted in struct ext4_super_info. When > iterating over the list, looking for a candidate inode, if inode's > i_touch_when is greater than the last time sorted, skip the inode. If > there are no inodes left in the list, or more than some threshold > inodes have been skipped, only then resort the list (and update the > timestamp in the ext4_super_info structure). > > What do you think? Thanks for your suggestion. I fully agree with you. What I concern is how to define this threshold. A fixed value is very simple but too inflexible. If this value is too big, this lru list could never be sorted. So my proposal is that we set this value accroding to the number of used inodes. For example, if the number of used inodes is 10,000, this threshold could be set to 10 (10,000 x 0.1%). Meanwhile, we need to provide a sysfs interface to let user set this value. Does it make sense? Thanks, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 14, 2013 at 08:57:44AM -0700, Dave Hansen wrote: > On 06/14/2013 07:09 AM, Zheng Liu wrote: > > - INIT_LIST_HEAD(&scanned); > > - > > spin_lock(&sbi->s_es_lru_lock); > > + list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp); > > list_for_each_safe(cur, tmp, &sbi->s_es_lru) { > > How long can this list get? I have the feeling this might get a bit > painful, especially on a NUMA machine. I guess that you worry about the time of sorting a lru list, right? Ted and I also worry about this, especially when shrinker is called() very frequently. Ted has presented a solution. I will give it a try. > > But, it definitely eliminates the spinlock contention that I was seeing. > The top ext4 function in my profiles is way down under 1% of CPU time > now. Thanks for the quick response, and please let me know if you need > any further testing. Thanks for your help. As I said above, I will try to improve this patch later. My server's numa has been turned off in BIOS and I haven't a privilege to turn it on. So that would be great if you could help me to do some further testing. Thanks in advance, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 15, 2013 at 01:00:28AM +0800, Zheng Liu wrote: > > I have a suggestion for how to address this: Keep a timestamp of when > > the list last has been sorted in struct ext4_super_info. When > > iterating over the list, looking for a candidate inode, if inode's > > i_touch_when is greater than the last time sorted, skip the inode. If > > there are no inodes left in the list, or more than some threshold > > inodes have been skipped, only then resort the list (and update the > > timestamp in the ext4_super_info structure). > > > > Thanks for your suggestion. I fully agree with you. What I concern is > how to define this threshold. A fixed value is very simple but too > inflexible. If this value is too big, this lru list could never be > sorted. In my proposal, if the list is never sorted before, we would definitely sort the list the first time there is a request to shrink the number of extent caches. So the list would always be sorted at least once. If an inode gets accessed after the time when the list has been sorted, then the last_touch time will be greater than the last_sorted time, right? And our goal is to use the least recently used inode, so as long as we know they have been used more recently than the time the list was sorted, and there are still inodes on the list where the last_touch time is older than the last_sort time, we don't have a problem. So we will evict all of the non-delalloc extent cache items from that inode, and remove it from the list. The original reason for having the threshold at all is so that if we skip a huge number of inodes on the list, this would take a long time. But thinking about this some more, we don't even need a threshold. What we do instead is if the the last_touch time is newer than the last_sort time, we just move the inode to the end of the list. That means the end of the list will be unsorted, but that's OK, because the oldest inodes are still at the beginning of the list. Once the head of the list is newer than the last_sorted time, then we know we need to resort the entire list. Does that make sense to you? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 019db3c..1c31f27 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -874,6 +874,7 @@ struct ext4_inode_info { rwlock_t i_es_lock; struct list_head i_es_lru; unsigned int i_es_lru_nr; /* protected by i_es_lock */ + unsigned long i_touch_when; /* jiffies of last accessing */ /* ialloc */ ext4_group_t i_last_alloc_group; diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index e6941e6..0b867b8 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -10,6 +10,7 @@ * Ext4 extents status tree core functions. */ #include <linux/rbtree.h> +#include <linux/list_sort.h> #include "ext4.h" #include "extents_status.h" #include "ext4_extents.h" @@ -291,7 +292,6 @@ out: read_unlock(&EXT4_I(inode)->i_es_lock); - ext4_es_lru_add(inode); trace_ext4_es_find_delayed_extent_range_exit(inode, es); } @@ -672,7 +672,6 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, error: write_unlock(&EXT4_I(inode)->i_es_lock); - ext4_es_lru_add(inode); ext4_es_print_tree(inode); return err; @@ -734,7 +733,6 @@ out: read_unlock(&EXT4_I(inode)->i_es_lock); - ext4_es_lru_add(inode); trace_ext4_es_lookup_extent_exit(inode, es, found); return found; } @@ -878,12 +876,30 @@ int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex) EXTENT_STATUS_WRITTEN); } +static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct ext4_inode_info *eia, *eib; + unsigned long diff; + + eia = list_entry(a, struct ext4_inode_info, i_es_lru); + eib = list_entry(b, struct ext4_inode_info, i_es_lru); + + diff = eia->i_touch_when - eib->i_touch_when; + if (diff < 0) + return -1; + if (diff > 0) + return 1; + return 0; +} + static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc) { struct ext4_sb_info *sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker); struct ext4_inode_info *ei; - struct list_head *cur, *tmp, scanned; + struct list_head *cur, *tmp; + LIST_HEAD(scanned); int nr_to_scan = sc->nr_to_scan; int ret, nr_shrunk = 0; @@ -893,10 +909,16 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc) if (!nr_to_scan) return ret; - INIT_LIST_HEAD(&scanned); - spin_lock(&sbi->s_es_lru_lock); + list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp); list_for_each_safe(cur, tmp, &sbi->s_es_lru) { + /* + * If we have already reclaimed all extents from extent + * status tree, just stop the loop immediately. + */ + if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0) + break; + list_move_tail(cur, &scanned); ei = list_entry(cur, struct ext4_inode_info, i_es_lru); @@ -947,11 +969,14 @@ void ext4_es_lru_add(struct inode *inode) struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + ei->i_touch_when = jiffies; + + if (!list_empty(&ei->i_es_lru)) + return; + spin_lock(&sbi->s_es_lru_lock); if (list_empty(&ei->i_es_lru)) - list_add_tail(&ei->i_es_lru, &sbi->s_es_lru); - else - list_move_tail(&ei->i_es_lru, &sbi->s_es_lru); + list_add(&ei->i_es_lru, &sbi->s_es_lru); spin_unlock(&sbi->s_es_lru_lock); } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 38f03dc..1477406 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -571,6 +571,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, "logical block %lu\n", inode->i_ino, flags, map->m_len, (unsigned long) map->m_lblk); + ext4_es_lru_add(inode); + /* Lookup extent status tree firstly */ if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) { if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) { @@ -1888,6 +1890,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, "logical block %lu\n", inode->i_ino, map->m_len, (unsigned long) map->m_lblk); + ext4_es_lru_add(inode); + /* Lookup extent status tree firstly */ if (ext4_es_lookup_extent(inode, iblock, &es)) { diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a9c1438..b365695 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -849,6 +849,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) rwlock_init(&ei->i_es_lock); INIT_LIST_HEAD(&ei->i_es_lru); ei->i_es_lru_nr = 0; + ei->i_touch_when = 0; ei->i_reserved_data_blocks = 0; ei->i_reserved_meta_blocks = 0; ei->i_allocated_meta_blocks = 0;