Patchwork [2/2] ext4: Reduce contention on s_orphan_lock

login
register
mail settings
Submitter Jan Kara
Date May 15, 2014, 8:17 p.m.
Message ID <1400185026-3972-3-git-send-email-jack@suse.cz>
Download mbox | patch
Permalink /patch/349358/
State Superseded
Headers show

Comments

Jan Kara - May 15, 2014, 8:17 p.m.
Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 65 insertions(+), 44 deletions(-)
Theodore Ts'o - May 20, 2014, 3:23 a.m.
On Thu, May 15, 2014 at 10:17:06PM +0200, Jan Kara wrote:
> Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
> we avoid taking global s_orphan_lock in some cases and hold it for
> shorter time in other cases.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Hi Jan,

With this patch applies, xfstests is hanging after ext4/001 runs:

BEGIN TEST: Ext4 4k block Tue May 20 03:18:49 UTC 2014
Device: /dev/vdb
mk2fs options: -q
mount options: -o block_validity
FSTYP         -- ext4
PLATFORM      -- Linux/i686 candygram 3.15.0-rc2-00052-g0f430fc
MKFS_OPTIONS  -- -q /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc

ext4/001 4s ...	 [03:19:05] [03:19:09] 4s

It just stops here, and never runs the next test.

If I revert this commit, the test run continus with ext4/002, and then
onto the rest of the tests.

Could you take a look?

Thanks!!

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - May 20, 2014, 8:33 a.m.
Please see my one comment below.

BTW, I've run aim7 on your before I notice what I commented below.  There are workloads that my patch outperform yours and vice versa.  I will have to redo it over again.

On 05/15/2014 02:17 PM, Jan Kara wrote:
> Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
> we avoid taking global s_orphan_lock in some cases and hold it for
> shorter time in other cases.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/namei.c | 109 +++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 65 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 5fcaa85b6dc5..0486fbafb808 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2539,13 +2539,17 @@ static int empty_dir(struct inode *inode)
>  	return 1;
>  }
>  
> -/* ext4_orphan_add() links an unlinked or truncated inode into a list of
> +/*
> + * ext4_orphan_add() links an unlinked or truncated inode into a list of
>   * such inodes, starting at the superblock, in case we crash before the
>   * file is closed/deleted, or in case the inode truncate spans multiple
>   * transactions and the last transaction is not recovered after a crash.
>   *
>   * At filesystem recovery time, we walk this list deleting unlinked
>   * inodes and truncating linked inodes in ext4_orphan_cleanup().
> + *
> + * Orphan list manipulation functions must be called under i_mutex unless
> + * we are just creating the inode or deleting it.
>   */
>  int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  {
> @@ -2553,13 +2557,19 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  	struct ext4_iloc iloc;
>  	int err = 0, rc;
> +	bool dirty = false;
>  
>  	if (!sbi->s_journal)
>  		return 0;
>  
> -	mutex_lock(&sbi->s_orphan_lock);
> +	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> +		     !mutex_is_locked(&inode->i_mutex));
> +	/*
> +	 * Exit early if inode already is on orphan list. This is a big speedup
> +	 * since we don't have to contend on the global s_orphan_lock.
> +	 */
>  	if (!list_empty(&EXT4_I(inode)->i_orphan))
> -		goto out_unlock;
> +		return 0;
>  
>  	/*
>  	 * Orphan handling is only valid for files with data blocks
> @@ -2573,44 +2583,47 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  	BUFFER_TRACE(sbi->s_sbh, "get_write_access");
>  	err = ext4_journal_get_write_access(handle, sbi->s_sbh);
>  	if (err)
> -		goto out_unlock;
> +		goto out;
>  
>  	err = ext4_reserve_inode_write(handle, inode, &iloc);
>  	if (err)
> -		goto out_unlock;
> +		goto out;
> +
> +	mutex_lock(&sbi->s_orphan_lock);
>  	/*
>  	 * Due to previous errors inode may be already a part of on-disk
>  	 * orphan list. If so skip on-disk list modification.
>  	 */
> -	if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
> -		(le32_to_cpu(sbi->s_es->s_inodes_count)))
> -			goto mem_insert;
> -
> -	/* Insert this inode at the head of the on-disk orphan list... */
> -	NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
> -	sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> -	err = ext4_handle_dirty_super(handle, sb);
> -	rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> -	if (!err)
> -		err = rc;
> -
> -	/* Only add to the head of the in-memory list if all the
> -	 * previous operations succeeded.  If the orphan_add is going to
> -	 * fail (possibly taking the journal offline), we can't risk
> -	 * leaving the inode on the orphan list: stray orphan-list
> -	 * entries can cause panics at unmount time.
> -	 *
> -	 * This is safe: on error we're going to ignore the orphan list
> -	 * anyway on the next recovery. */
> -mem_insert:
> -	if (!err)
> -		list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
> +	if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) >
> +	    (le32_to_cpu(sbi->s_es->s_inodes_count))) {
> +		/* Insert this inode at the head of the on-disk orphan list */
> +		NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
> +		sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> +		dirty = true;
> +	}
> +	list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
> +	mutex_unlock(&sbi->s_orphan_lock);
>  
> +	if (dirty) {
> +		err = ext4_handle_dirty_super(handle, sb);
> +		rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> +		if (!err)
> +			err = rc;
> +		if (err) {
> +			/*
> +			 * We have to remove inode from in-memory list if
> + 			 * addition to on disk orphan list failed. Stray orphan
> +			 * list entries can cause panics at unmount time.
> +			 */
> +			mutex_lock(&sbi->s_orphan_lock);
> +			list_del(&EXT4_I(inode)->i_orphan);
> +			mutex_unlock(&sbi->s_orphan_lock);
> +		}
> +	}
>  	jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
>  	jbd_debug(4, "orphan inode %lu will point to %d\n",
>  			inode->i_ino, NEXT_ORPHAN(inode));
> -out_unlock:
> -	mutex_unlock(&sbi->s_orphan_lock);
> +out:
>  	ext4_std_error(sb, err);
>  	return err;
>  }
> @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
>  	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
>  		return 0;
>  
> -	mutex_lock(&sbi->s_orphan_lock);
> +	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> +		     !mutex_is_locked(&inode->i_mutex));
> +	/* Do this quick check before taking global s_orphan_lock. */
>  	if (list_empty(&ei->i_orphan))
> -		goto out;
> +		return 0;
>  
> -	ino_next = NEXT_ORPHAN(inode);
> -	prev = ei->i_orphan.prev;
> +	if (handle) {
> +		/* Grab inode buffer early before taking global s_orphan_lock */
> +		err = ext4_reserve_inode_write(handle, inode, &iloc);
> +	}
>  
> +	mutex_lock(&sbi->s_orphan_lock);
>  	jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> 

Should set prev = ei->i_orphan.prev; here, instead of down below where it has already been removed from the list.

Thanks,
Mak.
 
>  	list_del_init(&ei->i_orphan);
> @@ -2646,20 +2664,23 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
>  	 * transaction handle with which to update the orphan list on
>  	 * disk, but we still need to remove the inode from the linked
>  	 * list in memory. */
> -	if (!handle)
> -		goto out;
> -
> -	err = ext4_reserve_inode_write(handle, inode, &iloc);
> -	if (err)
> +	if (!handle || err) {
> +		mutex_unlock(&sbi->s_orphan_lock);
>  		goto out_err;
> +	}
>  
> +	ino_next = NEXT_ORPHAN(inode);
> +	prev = ei->i_orphan.prev;

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - May 20, 2014, 9:18 a.m.
On Tue 20-05-14 02:33:23, Thavatchai Makphaibulchoke wrote:
> Please see my one comment below.
> 
> BTW, I've run aim7 on your before I notice what I commented below.  There
> are workloads that my patch outperform yours and vice versa.  I will have
> to redo it over again.
> 
> On 05/15/2014 02:17 PM, Jan Kara wrote:
> > @@ -2631,13 +2644,18 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> >  	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> >  		return 0;
> >  
> > -	mutex_lock(&sbi->s_orphan_lock);
> > +	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
> > +		     !mutex_is_locked(&inode->i_mutex));
> > +	/* Do this quick check before taking global s_orphan_lock. */
> >  	if (list_empty(&ei->i_orphan))
> > -		goto out;
> > +		return 0;
> >  
> > -	ino_next = NEXT_ORPHAN(inode);
> > -	prev = ei->i_orphan.prev;
> > +	if (handle) {
> > +		/* Grab inode buffer early before taking global s_orphan_lock */
> > +		err = ext4_reserve_inode_write(handle, inode, &iloc);
> > +	}
> >  
> > +	mutex_lock(&sbi->s_orphan_lock);
> >  	jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> > 
> 
> Should set prev = ei->i_orphan.prev; here, instead of down below where it
> has already been removed from the list.
  Bah, I'm sorry for causing extra work to you. That's a really stupid
mistake. Thanks for finding that! Also I've realized we cannot really
reliably do
  ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
outside of s_orphan_lock because that inode may be reclaimed the moment
after we drop s_orphan_lock. So I had to remove that optimization as well
and move the call back under s_orphan_lock. I'm running xfstests now
(updated so they include also the test Ted found failing) to verify
correctness again and then I'll get my performance numbers with fixed
patches.

								Honza
Theodore Ts'o - May 20, 2014, 1:57 p.m.
On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote:
> Please see my one comment below.
> 
> BTW, I've run aim7 on your before I notice what I commented below.  There are workloads that my patch outperform yours and vice versa.  I will have to redo it over again.

Thavatchai, it would be really great if you could do lock_stat runs
with both Jan's latest patches as well as yours.  We need to
understand where the differences are coming from.

As I understand things, there are two differences between Jan and your
approaches.  The first is that Jan is using the implicit locking of
i_mutex to avoid needing to keep a hashed array of mutexes to
synchronize an individual inode's being added or removed to the orphan
list.

The second is that you've split the orphan mutex into an on-disk mutex
and a in-memory spinlock.

Is it possible to split up your patch so we can measure the benefits
of each of these two changes?  More interestingly, is there a way we
can use the your second change in concert with Jan's changes?

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - May 20, 2014, 5:16 p.m.
On 05/20/2014 07:57 AM, Theodore Ts'o wrote:
> Thavatchai, it would be really great if you could do lock_stat runs
> with both Jan's latest patches as well as yours.  We need to
> understand where the differences are coming from.
> 
> As I understand things, there are two differences between Jan and your
> approaches.  The first is that Jan is using the implicit locking of
> i_mutex to avoid needing to keep a hashed array of mutexes to
> synchronize an individual inode's being added or removed to the orphan
> list.
> 
> The second is that you've split the orphan mutex into an on-disk mutex
> and a in-memory spinlock.
> 
> Is it possible to split up your patch so we can measure the benefits
> of each of these two changes?  More interestingly, is there a way we
> can use the your second change in concert with Jan's changes?
> 
> Regards,
> 
> 						- Ted
> 

Ted, depending on what Jan does with my latest comment regarding the dirty optimization, her patch and my patch are functionally identical, with the exception that mine uses an explicit hashed lock to serialize add/delete within a single node.  Hers uses the fact that either i_mutex is already held or the node is not accessible to other thread.


My assumption is that with the hashed mutex, the contentions for the s_orphan_mutex will spread out more, increasing its chance to hit the mutex_lock fast path.  Anyway, I will do lock_stat to get more info.

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - June 2, 2014, 5:45 p.m.
On 05/20/2014 07:57 AM, Theodore Ts'o wrote:
> On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote:
> 
> Thavatchai, it would be really great if you could do lock_stat runs
> with both Jan's latest patches as well as yours.  We need to
> understand where the differences are coming from.
> 
> As I understand things, there are two differences between Jan and your
> approaches.  The first is that Jan is using the implicit locking of
> i_mutex to avoid needing to keep a hashed array of mutexes to
> synchronize an individual inode's being added or removed to the orphan
> list.
> 
> The second is that you've split the orphan mutex into an on-disk mutex
> and a in-memory spinlock.
> 
> Is it possible to split up your patch so we can measure the benefits
> of each of these two changes?  More interestingly, is there a way we
> can use the your second change in concert with Jan's changes?
> 
> Regards,
> 
> 						- Ted
> 

Thanks to Jan, as she pointed out one optimization in orphan_addr() that I've missed.

After integrated that into my patch, I've rerun the following aim7 workloads; alltests, custom, dbase, disk, fserver, new_fserver, shared and short.  Here are the results.

On an 8 core (16 thread) machine, both my revised patch (with additional optimization from Jan's oprhan_add()) and version 3 of Jan's patch give about the same results, for most of the workloads, except fserver and new_fserver, which Jan's outperforms about 9% and 16%, respectively.

Here are the lock_stat output for disk,
Jan's patch,
lock_stat version 0.4
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                     &sbi->s_orphan_lock:         80189          80246           3.94      464489.22    77219615.47         962.29         503289         809004           0.10      476537.44     3424587.77           4.23
Mine,
                     &sbi->s_orphan_lock:         82215          82259           3.61      640876.19    15098561.09         183.55         541422         794254           0.10      640368.86     4425140.61           5.57
              &sbi->s_orphan_op_mutex[n]:        102507         104880           4.21     1335087.21  1392773487.19       13279.69         398328         840120           0.11     1334711.17   397596137.90         473.26

For new_fserver,
Jan's patch,
                     &sbi->s_orphan_lock:       1063059        1063369           5.57     1073325.95 59535205188.94       55987.34        4525570        8446052           0.10       75625.72    10700844.58         1.27
Mine,
                     &sbi->s_orphan_lock:       1171433        1172220           3.02      349678.21   553168029.92         471.90        5517262        8446052           0.09      254108.75    16504015.29           1.95
              &sbi->s_orphan_op_mutex[n]:       2176760        2202674           3.44      633129.10 55206091750.06       25063.21        3259467        8452918           0.10      349687.82   605683982.34        71.65


On an 80 core (160 thread) machine, mine outpeforms Jan's in alltests, custom, fserver, new_fserver and shared about the same margin it did over the baseline, around 20%   For all these workloads, Jan's patch does not seem to show any noticeable improvement over baseline kernel.  I'm getting about the same performance with the rest of the workloads.

Here are the lock_stat output for alltests,
Jan;'s,
                     &sbi->s_orphan_lock:       2762871        2763355           4.46       49043.39  1763499587.40         638.17        5878253        6475844           0.15       20508.98    70827300.79          10.94
Mine,
                       &sbi->s_orphan_lock:       1171433        1172220           3.02      349678.21   553168029.92         471.90        5517262        8446052           0.09      254108.75    16504015.29           1.95
              &sbi->s_orphan_op_mutex[n]:        783176         785840           4.95       30358.58   432279688.66         550.09        2899889        6505883           0.16       30254.12  1668330140.08         256.43

For custom,
Jan's,
                     &sbi->s_orphan_lock:       5706466        5707069           4.54       44063.38  3312864313.18         580.48       11942088       13175060           0.15       15944.34   142660367.51          10.83
Mine,
                     &sbi->s_orphan_lock:       5518186        5518558           4.84       32040.05  2436898419.22         441.58       12290996       13175234           0.17       23160.65   141234888.88          10.72
              &sbi->s_orphan_op_mutex[n]:       1565216        1569333           4.50       32527.02   788215876.94         502.26        5894074       13196979           0.16       71073.57  3128766227.92         237.08

For dbase,
Jan's,
                     &sbi->s_orphan_lock:         14453          14489           5.84       39442.57     8678179.21         598.95         119847         153686           0.17        4390.25     1406816.03           9.15
Mine,
                     &sbi->s_orphan_lock:         13847          13868           6.23       31314.03     7982386.22         575.60         120332         153542           0.17        9354.86     1458061.28           9.50
              &sbi->s_orphan_op_mutex[n]:          1700           1717          22.00       50566.24     1225749.82         713.89          85062         189435           0.16       31374.44    14476217.56          76.42

In case the line-wrap making it hard to read, I've also attached the results as a text file.

The lock_stat seems to show that with my patch the s_orphan_lock performs better across the board.  But on a smaller machine, the hashed mutex seems to offset out the performance gain in the s_oprhan_lock and increase the hashed mutex size likely to make it perform better.

Jan, if you could send me your orphan stress test, I could run lock_stat for more performance comparison.

Ted, please let me know if there is anything else you like me to experiment with.  If you'd like I could also resubmit my revised patch for you to take a look. 



Thanks,
Mak.
On an 8 core (16 thread) machine,
lock_stat version 0.4
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Jan's disk,
                     &sbi->s_orphan_lock:         80189          80246           3.94      464489.22    77219615.47         962.29         503289         809004           0.10      476537.44     3424587.77           4.23
Mine's disk,
                     &sbi->s_orphan_lock:         82215          82259           3.61      640876.19    15098561.09         183.55         541422         794254           0.10      640368.86     4425140.61           5.57
              &sbi->s_orphan_op_mutex[n]:        102507         104880           4.21     1335087.21  1392773487.19       13279.69         398328         840120           0.11     1334711.17   397596137.90         473.26
Jan's new_fserver,
                     &sbi->s_orphan_lock:       1063059        1063369           5.57     1073325.95 59535205188.94       55987.34        4525570        8446052           0.10       75625.72    10700844.58           1.27
Mine's new_fserver,
                     &sbi->s_orphan_lock:       1171433        1172220           3.02      349678.21   553168029.92         471.90        5517262        8446052           0.09      254108.75    16504015.29           1.95
              &sbi->s_orphan_op_mutex[n]:       2176760        2202674           3.44      633129.10 55206091750.06       25063.21        3259467        8452918           0.10      349687.82   605683982.34          71.65

On an 80 core (160 thread) machine,
ock_stat version 0.4
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Jan's alltests,
                     &sbi->s_orphan_lock:       2762871        2763355           4.46       49043.39  1763499587.40         638.17        5878253        6475844           0.15       20508.98    70827300.79          10.94
Mine's alltests,
                     &sbi->s_orphan_lock:       2690362        2690612           5.08       24977.33  1286260951.01         478.06        6031175        6475694           0.17       12247.20    70182042.72          10.84
              &sbi->s_orphan_op_mutex[n]:        783176         785840           4.95       30358.58   432279688.66         550.09        2899889        6505883           0.16       30254.12  1668330140.08         256.43
Jan's custom,
                     &sbi->s_orphan_lock:       5706466        5707069           4.54       44063.38  3312864313.18         580.48       11942088       13175060           0.15       15944.34   142660367.51          10.83
Mine's custom,
                     &sbi->s_orphan_lock:       5518186        5518558           4.84       32040.05  2436898419.22         441.58       12290996       13175234           0.17       23160.65   141234888.88          10.72
              &sbi->s_orphan_op_mutex[n]:       1565216        1569333           4.50       32527.02   788215876.94         502.26        5894074       13196979           0.16       71073.57  3128766227.92         237.08
Jan's dbase,
                     &sbi->s_orphan_lock:         14453          14489           5.84       39442.57     8678179.21         598.95         119847         153686           0.17        4390.25     1406816.03           9.15
Mine's dbase,
                     &sbi->s_orphan_lock:         13847          13868           6.23       31314.03     7982386.22         575.60         120332         153542           0.17        9354.86     1458061.28           9.50
              &sbi->s_orphan_op_mutex[n]:          1700           1717          22.00       50566.24     1225749.82         713.89          85062         189435           0.16       31374.44    14476217.56          76.42
Jan Kara - June 3, 2014, 8:52 a.m.
On Mon 02-06-14 11:45:32, Thavatchai Makphaibulchoke wrote:
> On 05/20/2014 07:57 AM, Theodore Ts'o wrote:
> > On Tue, May 20, 2014 at 02:33:23AM -0600, Thavatchai Makphaibulchoke wrote:
> > 
> > Thavatchai, it would be really great if you could do lock_stat runs
> > with both Jan's latest patches as well as yours.  We need to
> > understand where the differences are coming from.
> > 
> > As I understand things, there are two differences between Jan and your
> > approaches.  The first is that Jan is using the implicit locking of
> > i_mutex to avoid needing to keep a hashed array of mutexes to
> > synchronize an individual inode's being added or removed to the orphan
> > list.
> > 
> > The second is that you've split the orphan mutex into an on-disk mutex
> > and a in-memory spinlock.
> > 
> > Is it possible to split up your patch so we can measure the benefits
> > of each of these two changes?  More interestingly, is there a way we
> > can use the your second change in concert with Jan's changes?
> > 
> > Regards,
> > 
> > 						- Ted
> > 
> 
> Thanks to Jan, as she pointed out one optimization in orphan_addr() that
> I've missed.
> 
> After integrated that into my patch, I've rerun the following aim7
> workloads; alltests, custom, dbase, disk, fserver, new_fserver, shared
> and short.  Here are the results.
> 
> On an 8 core (16 thread) machine, both my revised patch (with additional
> optimization from Jan's oprhan_add()) and version 3 of Jan's patch give
> about the same results, for most of the workloads, except fserver and
> new_fserver, which Jan's outperforms about 9% and 16%, respectively.
> 
> Here are the lock_stat output for disk,
> Jan's patch,
> lock_stat version 0.4
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                      &sbi->s_orphan_lock:         80189          80246           3.94      464489.22    77219615.47         962.29         503289         809004           0.10      476537.44     3424587.77           4.23
> Mine,
>                      &sbi->s_orphan_lock:         82215          82259           3.61      640876.19    15098561.09         183.55         541422         794254           0.10      640368.86     4425140.61           5.57
>               &sbi->s_orphan_op_mutex[n]:        102507         104880           4.21     1335087.21  1392773487.19       13279.69         398328         840120           0.11     1334711.17   397596137.90         473.26
> 
> For new_fserver,
> Jan's patch,
>                      &sbi->s_orphan_lock:       1063059        1063369           5.57     1073325.95 59535205188.94       55987.34        4525570        8446052           0.10       75625.72    10700844.58         1.27
> Mine,
>                      &sbi->s_orphan_lock:       1171433        1172220           3.02      349678.21   553168029.92         471.90        5517262        8446052           0.09      254108.75    16504015.29           1.95
>               &sbi->s_orphan_op_mutex[n]:       2176760        2202674           3.44      633129.10 55206091750.06       25063.21        3259467        8452918           0.10      349687.82   605683982.34        71.65
> 
> 
> On an 80 core (160 thread) machine, mine outpeforms Jan's in alltests,
> custom, fserver, new_fserver and shared about the same margin it did over
> the baseline, around 20%   For all these workloads, Jan's patch does not
> seem to show any noticeable improvement over baseline kernel.  I'm
> getting about the same performance with the rest of the workloads.
> 
> Here are the lock_stat output for alltests,
> Jan;'s,
>                      &sbi->s_orphan_lock:       2762871        2763355           4.46       49043.39  1763499587.40         638.17        5878253        6475844           0.15       20508.98    70827300.79          10.94
> Mine,
>                        &sbi->s_orphan_lock:       1171433        1172220           3.02      349678.21   553168029.92         471.90        5517262        8446052           0.09      254108.75    16504015.29           1.95
>               &sbi->s_orphan_op_mutex[n]:        783176         785840           4.95       30358.58   432279688.66         550.09        2899889        6505883           0.16       30254.12  1668330140.08         256.43
> 
> For custom,
> Jan's,
>                      &sbi->s_orphan_lock:       5706466        5707069           4.54       44063.38  3312864313.18         580.48       11942088       13175060           0.15       15944.34   142660367.51          10.83
> Mine,
>                      &sbi->s_orphan_lock:       5518186        5518558           4.84       32040.05  2436898419.22         441.58       12290996       13175234           0.17       23160.65   141234888.88          10.72
>               &sbi->s_orphan_op_mutex[n]:       1565216        1569333           4.50       32527.02   788215876.94         502.26        5894074       13196979           0.16       71073.57  3128766227.92         237.08
> 
> For dbase,
> Jan's,
>                      &sbi->s_orphan_lock:         14453          14489           5.84       39442.57     8678179.21         598.95         119847         153686           0.17        4390.25     1406816.03           9.15
> Mine,
>                      &sbi->s_orphan_lock:         13847          13868           6.23       31314.03     7982386.22         575.60         120332         153542           0.17        9354.86     1458061.28           9.50
>               &sbi->s_orphan_op_mutex[n]:          1700           1717          22.00       50566.24     1225749.82         713.89          85062         189435           0.16       31374.44    14476217.56          76.42
> 
> In case the line-wrap making it hard to read, I've also attached the
> results as a text file.
> 
> The lock_stat seems to show that with my patch the s_orphan_lock performs
> better across the board.  But on a smaller machine, the hashed mutex
> seems to offset out the performance gain in the s_oprhan_lock and
> increase the hashed mutex size likely to make it perform better.
  I'd interpret the data a bit differently :) With your patch the
contention for resource - access to orphan list - is split between
s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending
directly on s_orphan_lock is a win and we spend less time waiting in total.
For the large machine it seems beneficial to contend on the hashed mutex
first and only after that on global lock. Likely that reduces amount of
cacheline bouncing, or maybe the mutex is more often acquired during the
spinning phase which reduces the acquisition latency.

> Jan, if you could send me your orphan stress test, I could run lock_stat
> for more performance comparison.
  Sure, it is attached.

								Honza
Thavatchai Makphaibulchoke - June 16, 2014, 7:20 p.m.
On 06/03/2014 02:52 AM, Jan Kara wrote:
>   I'd interpret the data a bit differently :) With your patch the
> contention for resource - access to orphan list - is split between
> s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending
> directly on s_orphan_lock is a win and we spend less time waiting in total.
> For the large machine it seems beneficial to contend on the hashed mutex
> first and only after that on global lock. Likely that reduces amount of
> cacheline bouncing, or maybe the mutex is more often acquired during the
> spinning phase which reduces the acquisition latency.
> 
>   Sure, it is attached.
> 
> 								Honza
> 

Thanks Jan for the test program.

Anyway I did modify the test a little so that we could also actually run multiple incarnations of the test simultaneously, that is to generate the orphan stress operations on multiple files.  I have also attached the modified test, just in case.

These are the results that I got.

All values are real time in seconds, computed over ten runs with journaling disabled. "w/o" stand for without hashed mutexes and "with" for with mutexes, "Proc" number of processes, and "Files" number of files.


With only 1 file,

On an 8 core (16 thread) platform,


Proc |     1     |     20     |      40     |      80     |     160     |     400     |     800
----------------------------------------------------------------------------------------------------- 
     | Avg |  SD |  Avg |  SD |  Avg  |  SD |  Avg  | SD  |  Avg  |  SD |  Avg  | SD  |  Avg  | SD 
-----------------------------------------------------------------------------------------------------
w/o  |.7921|.0467|7.1342|.0316|12.4026|.3552|19.3930|.6917|22.7519|.7017|35.9374|1.658|66.7374|.4716
-----------------------------------------------------------------------------------------------------
with |.7819|.0362|6.3302|.2119|12.0933|.6589|18.7514|.9698|24.1351|1.659|38.6480|.6809|67.4810|.2749

On a 80 core (160 thread) platform,

Proc |      40      |      80      |      100      |      400      |       800     |     1600
----------------------------------------------------------------------------------------------------- 
     |  Avg  |  SD  |  Avg  |  SD  |  Avg   |  SD  |  Avg   |  SD  |   Avg  |  SD  |   Avg  |  SD 
-----------------------------------------------------------------------------------------------------
w/o  |44.8532|3.4991|67.8351|1.7534| 73.0616|2.4733|252.5798|1.1949|485.3289|5.7320|952.8874|2.0911
-----------------------------------------------------------------------------------------------------
with |46.1134|3.3228|99.1550|1.4894|109.0272|1.3617|259.6128|2.5247|284.4386|4.6767|266.8664|7.7726

With one file, we would expect without hashed mutexes would perform better than with.  The results do  show so on 80 core machine with 80 up to 400 processes.  Surprisingly there is no differences across all process ranges tested on 8 core.  Also on 80 core, with hashed mutexes the time seems to be steadying out at around high two hundred something with 400 or more processes and outperform without significantly with 800 or more processes.


With multiple files and only 1 process per file,

On an 8 core (16 thread) platform,


Proc |     40    |      80     |     150     |     400     |     800
-------------------------------------------------------------------------
     |  Avg |  SD |  Avg |  SD |  Avg  |  SD |  Avg  | SD  |  Avg  |  SD 
--------------------------------------------------------------------------
w/o  |3.3578|.0363|6.4533|.1204|12.1925|.2528|31.5862|.6016|63.9913|.3528
-------------------------------------------------------------------------
with |3.2583|.0384|6.3228|.1075|11.8328|.2290|30.5394|.3220|62.7672|.3802

On a 80 core (160 thread) platform,

Proc|      40      |      80      |      100      |     200      |   400     |      800     |     1200      |      1600
------------------------------------------------------------------------------------------------------------------ 
    |  Avg  |  SD  |  Avg  |  SD  |  Avg   |  SD  |  Avg  |  SD  |   Avg  |  SD |   Avg  |  SD  |  Avg   | SD   |
-------------------------------------------------------------------------------------------------------------------
w/o_|43.6507|2.9979|57.0404|1.8684|068.5557|1.2902|144.745|1.7939|52.7491|1.3585|487.8996|1.3997|715.1978|1.1224|942.5605|2.9629
-------------------------------------------------------------------------------------------------------------------
with|52.8003|2.1949|69.2455|1.2902|106.5026|1.8813|130.2995|7.8020150.3648|3.4153|184.7233|6.0525|270.1533|3.2261|298.5705|3.1318

Again, there is not much difference on 8 core.  On 80 core, without hashed mutexes performs better than with hashed mutexes with the number of files between 40 to less than 200.  With hashed mutexes outperorm without significantly with 400 or more files.

Overall there seems to be no performance difference on 8 core.  On 80 core with hashed mutexes, while performs worse than with in the lower ranges of both processes and files, seems to be scaling better with both higher processes and files.

Again, the change with hashed mutexes does include the additional optimization in orphan_add() introduced by your patch.  Please let me know if you need a copy of the modified patch with hashed mutexes for verification.

Thanks,
Mak.
Jan Kara - June 17, 2014, 9:29 a.m.
On Mon 16-06-14 13:20:32, Thavatchai Makphaibulchoke wrote:
> On 06/03/2014 02:52 AM, Jan Kara wrote:
> >   I'd interpret the data a bit differently :) With your patch the
> > contention for resource - access to orphan list - is split between
> > s_orphan_lock and s_orphan_op_mutex. For the smaller machine contending
> > directly on s_orphan_lock is a win and we spend less time waiting in total.
> > For the large machine it seems beneficial to contend on the hashed mutex
> > first and only after that on global lock. Likely that reduces amount of
> > cacheline bouncing, or maybe the mutex is more often acquired during the
> > spinning phase which reduces the acquisition latency.
> > 
> >   Sure, it is attached.
> > 
> > 								Honza
> > 
> 
> Thanks Jan for the test program.
> 
> Anyway I did modify the test a little so that we could also actually run
> multiple incarnations of the test simultaneously, that is to generate the
> orphan stress operations on multiple files.  I have also attached the
> modified test, just in case.
  Hum, looking at your test program source I'm not sure what do you mean.
Your program first forks 'niteration' times and each process starts using a
different directory. Then each of these processes forks 'procs' times and
each of these processes will use a different file in the directory
belonging to the parent. So what's the difference to just running
'niterations' * 'procs' processes? After some thought I guess the
difference is in how time to run on each individual file contributes to the
total average -
  (\sum_{i=1}^{procs} t_i)/procs
in the first case you ran, where t_i is the time to run test for file i, and
  (max^{i=1}^{procs} t_i)
in the second case. But what's the point?

> These are the results that I got.
> 
> All values are real time in seconds, computed over ten runs with
> journaling disabled. "w/o" stand for without hashed mutexes and "with"
> for with mutexes, "Proc" number of processes, and "Files" number of
> files.
  What do you exactly mean by 'journaling disabled'? Did you run ext4 in
nojournal mode? That wouldn't really make sense because in nojournal mode
all orphan list operations are skipped... So what did you really test?

> With only 1 file,
> 
> On an 8 core (16 thread) platform,
> 
> 
> Proc |     1     |     20     |      40     |      80     |     160     |     400     |     800
> ----------------------------------------------------------------------------------------------------- 
>      | Avg |  SD |  Avg |  SD |  Avg  |  SD |  Avg  | SD  |  Avg  |  SD |  Avg  | SD  |  Avg  | SD 
> -----------------------------------------------------------------------------------------------------
> w/o  |.7921|.0467|7.1342|.0316|12.4026|.3552|19.3930|.6917|22.7519|.7017|35.9374|1.658|66.7374|.4716
> -----------------------------------------------------------------------------------------------------
> with |.7819|.0362|6.3302|.2119|12.0933|.6589|18.7514|.9698|24.1351|1.659|38.6480|.6809|67.4810|.2749
> 
> On a 80 core (160 thread) platform,
> 
> Proc |      40      |      80      |      100      |      400      |       800     |     1600
> ----------------------------------------------------------------------------------------------------- 
>      |  Avg  |  SD  |  Avg  |  SD  |  Avg   |  SD  |  Avg   |  SD  |   Avg  |  SD  |   Avg  |  SD 
> -----------------------------------------------------------------------------------------------------
> w/o  |44.8532|3.4991|67.8351|1.7534| 73.0616|2.4733|252.5798|1.1949|485.3289|5.7320|952.8874|2.0911
> -----------------------------------------------------------------------------------------------------
> with |46.1134|3.3228|99.1550|1.4894|109.0272|1.3617|259.6128|2.5247|284.4386|4.6767|266.8664|7.7726
> 
> With one file, we would expect without hashed mutexes would perform
> better than with.  The results do  show so on 80 core machine with 80 up
> to 400 processes.  Surprisingly there is no differences across all
> process ranges tested on 8 core.  Also on 80 core, with hashed mutexes
> the time seems to be steadying out at around high two hundred something
> with 400 or more processes and outperform without significantly with 800
> or more processes.
> 
> With multiple files and only 1 process per file,
> 
> On an 8 core (16 thread) platform,
> 
> 
> Proc |     40    |      80     |     150     |     400     |     800
> -------------------------------------------------------------------------
>      |  Avg |  SD |  Avg |  SD |  Avg  |  SD |  Avg  | SD  |  Avg  |  SD 
> --------------------------------------------------------------------------
> w/o  |3.3578|.0363|6.4533|.1204|12.1925|.2528|31.5862|.6016|63.9913|.3528
> -------------------------------------------------------------------------
> with |3.2583|.0384|6.3228|.1075|11.8328|.2290|30.5394|.3220|62.7672|.3802
> 
> On a 80 core (160 thread) platform,
> 
> Proc|      40      |      80      |      100      |     200      |   400     |      800     |     1200      |      1600
> ------------------------------------------------------------------------------------------------------------------ 
>     |  Avg  |  SD  |  Avg  |  SD  |  Avg   |  SD  |  Avg  |  SD  |   Avg  |  SD |   Avg  |  SD  |  Avg   | SD   |
> -------------------------------------------------------------------------------------------------------------------
> w/o_|43.6507|2.9979|57.0404|1.8684|068.5557|1.2902|144.745|1.7939|52.7491|1.3585|487.8996|1.3997|715.1978|1.1224|942.5605|2.9629
                                                                    ^^^^
							this number is strange
> -------------------------------------------------------------------------------------------------------------------
> with|52.8003|2.1949|69.2455|1.2902|106.5026|1.8813|130.2995|7.8020150.3648|3.4153|184.7233|6.0525|270.1533|3.2261|298.5705|3.1318
> 
> Again, there is not much difference on 8 core.  On 80 core, without
> hashed mutexes performs better than with hashed mutexes with the number
> of files between 40 to less than 200.  With hashed mutexes outperorm
> without significantly with 400 or more files.
> 
> Overall there seems to be no performance difference on 8 core.  On 80
> core with hashed mutexes, while performs worse than with in the lower
> ranges of both processes and files, seems to be scaling better with both
> higher processes and files.
  Your numbers are interesting and seem to confirm that with really high
contention it is advantageous to contend on smaller locks first (your
hashed mutexes) and only after that on the global lock. But I'd like to
hear answers to my previous questions before drawing any conclusions...
 
> Again, the change with hashed mutexes does include the additional
> optimization in orphan_add() introduced by your patch.  Please let me
> know if you need a copy of the modified patch with hashed mutexes for
> verification.
  

								Honza
Thavatchai Makphaibulchoke - June 18, 2014, 4:38 a.m.
On 06/17/2014 03:29 AM, Jan Kara wrote:
>   Hum, looking at your test program source I'm not sure what do you mean.
> Your program first forks 'niteration' times and each process starts using a
> different directory. Then each of these processes forks 'procs' times and
> each of these processes will use a different file in the directory
> belonging to the parent. So what's the difference to just running
> 'niterations' * 'procs' processes? After some thought I guess the
> difference is in how time to run on each individual file contributes to the
> total average -
>   (\sum_{i=1}^{procs} t_i)/procs
> in the first case you ran, where t_i is the time to run test for file i, and
>   (max^{i=1}^{procs} t_i)
> in the second case. But what's the point?
> 

The original test program generates orphan traffic on only a single file, which not only does not seem to be a fair comparison for the WM (with hashed mutexes) algorithm, does not seem to represent realistic load.  The modified test's intention is to compare performance between the WM and WO (without hashed mutexes) under heavy orphan traffic on more than a single file.

Instead of running multiple copies of the original test in the background (for a short job we may not actually get overlapping traffic on different inodes), the modified test runs and start the excution, as close as possible to each other, of multiple copies of the original test.

The results of my test is the average, over ten runs, of the maximum time the original test takes to complete under certain number of processes and files.  Of course, with one copy (niteration of 1), we get the equivalence of the original test.

>   What do you exactly mean by 'journaling disabled'? Did you run ext4 in
> nojournal mode? That wouldn't really make sense because in nojournal mode
> all orphan list operations are skipped... So what did you really test?
> 

Yes, sorry my fault disabling journaling, should also inhibit the orphan activities.  Therefore there should not be any difference in performance between the two.

I just discovered that the two different branches I used do not have the same baseline.  Let me recreate the two branches and redo my test.  I'll get back with you with the results.

Sorry for the confusion.

Thanks,
Mak.


>   Your numbers are interesting and seem to confirm that with really high
> contention it is advantageous to contend on smaller locks first (your
> hashed mutexes) and only after that on the global lock. But I'd like to
> hear answers to my previous questions before drawing any conclusions...
>  
>   
> 
> 								Honza
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - June 18, 2014, 10:37 a.m.
On Tue 17-06-14 22:38:32, Thavatchai Makphaibulchoke wrote:
> On 06/17/2014 03:29 AM, Jan Kara wrote:
> >   Hum, looking at your test program source I'm not sure what do you mean.
> > Your program first forks 'niteration' times and each process starts using a
> > different directory. Then each of these processes forks 'procs' times and
> > each of these processes will use a different file in the directory
> > belonging to the parent. So what's the difference to just running
> > 'niterations' * 'procs' processes? After some thought I guess the
> > difference is in how time to run on each individual file contributes to the
> > total average -
> >   (\sum_{i=1}^{procs} t_i)/procs
> > in the first case you ran, where t_i is the time to run test for file i, and
> >   (max^{i=1}^{procs} t_i)
> > in the second case. But what's the point?
> > 
> 
> The original test program generates orphan traffic on only a single file,
> which not only does not seem to be a fair comparison for the WM (with
> hashed mutexes) algorithm, does not seem to represent realistic load.
> The modified test's intention is to compare performance between the WM
> and WO (without hashed mutexes) under heavy orphan traffic on more than a
> single file.
  That's not true. The original test program creates one file per process:
void run_test(char *base, int count)
{
...
       sprintf(pbuf, "%s/file-%d", base, count);
       fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644);
...
  and forks given number of processes. Ah, maybe I see what you are getting
at. My original test program doesn't bother with synchronizing start of the
processes - I have manually set number of repetitions (COUNT) so that each
process runs long enough on my machine that the differences in start time
are negligible.  Maybe you thought these processes run sequentially (and
maybe they really did if you had fast enough setup). So I agree
synchronization of start using shared memory as you did it is a good thing
and gets more reliable numbers especially for high numbers of processes
(which I didn't really test because of my limited HW) just there's no need
to add another level of iteration...

> Instead of running multiple copies of the original test in the background
> (for a short job we may not actually get overlapping traffic on different
> inodes), the modified test runs and start the excution, as close as
> possible to each other, of multiple copies of the original test.
> 
> The results of my test is the average, over ten runs, of the maximum time
> the original test takes to complete under certain number of processes and
> files.  Of course, with one copy (niteration of 1), we get the
> equivalence of the original test.
> 
> >   What do you exactly mean by 'journaling disabled'? Did you run ext4 in
> > nojournal mode? That wouldn't really make sense because in nojournal mode
> > all orphan list operations are skipped... So what did you really test?
> > 
> 
> Yes, sorry my fault disabling journaling, should also inhibit the orphan
> activities.  Therefore there should not be any difference in performance
> between the two.
  Exactly, it's strange you saw a difference.

> I just discovered that the two different branches I used do not have the
> same baseline.  Let me recreate the two branches and redo my test.  I'll
> get back with you with the results.
  OK, looking forward to the results.

								Honza
Thavatchai Makphaibulchoke - July 22, 2014, 4:35 a.m.
On 06/18/2014 04:37 AM, Jan Kara wrote:
>   That's not true. The original test program creates one file per process:
> void run_test(char *base, int count)
> {
> ...
>        sprintf(pbuf, "%s/file-%d", base, count);
>        fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644);
> ...
>   and forks given number of processes. Ah, maybe I see what you are getting
> at. My original test program doesn't bother with synchronizing start of the
> processes - I have manually set number of repetitions (COUNT) so that each
> process runs long enough on my machine that the differences in start time
> are negligible.  Maybe you thought these processes run sequentially (and
> maybe they really did if you had fast enough setup). So I agree
> synchronization of start using shared memory as you did it is a good thing
> and gets more reliable numbers especially for high numbers of processes
> (which I didn't really test because of my limited HW) just there's no need
> to add another level of iteration...
> 
First sorry for taking a little long to get back.

Your original main is now my new run_main() function.  The old main and new run_main() function, practically are doing the same thing, fork and immediately all the child processes as indicated by argv[1], running the original run_test() function.  By converting it to a run_main function, I'm trying to start each incarnation of the old main as closed to each other as possible, making sure I have some overlapping orphan activities on different files.  Actually I should have done the synchronization in the run_test function instead, this way both tests should be closer to each other.

>   Exactly, it's strange you saw a difference.
> 
>> I just discovered that the two different branches I used do not have the
>> same baseline.  Let me recreate the two branches and redo my test.  I'll
>> get back with you with the results.
>   OK, looking forward to the results.
> 
> 								Honza
> 

Here are the results with 3.15-rc5 baseline kernel, using your original test.  Each result is an average over ten runs, except those longer runs, marked with an '*', which is an average of only 5 runs.

On a 2 socket 16 thread machine,

            ----------------------------------
 Num files  |   40  |  100  |  400  |  1000  |
----------------------------------------------
|W/O Mutexes|118.593|178.118|595.017|1480.165|
-----------------------------------------------
|W Mutexes  |129.212|194.728|620.412|1430.908|
----------------------------------------------

On an 8 socket 160 thread machine,

            --------------------------------------------
 Num files  |   40  |  100  |   400  |  1000* |  1500* |
-------------------------------------------------------
|W/O Mutexes|387.257|644.270|1692.018|4254.766|7362.064|
-------------------------------------------------------
|W Mutexes  |328.490|557.508|1967.676|4721.060|7013.618|
--------------------------------------------------------

From the above data, looks like without mutexes (WOM) performs better across the board on a smaller machine, except at 1000 files.  For a larger machine, I give the edge to with mutexes (WM) for smaller number of files, until around 400 to 100 something files.  Looks like with WOM starts performing better again at 1500 files.

I've noticed that your test seems to generate high volume of both add and delete orphan operations for inodes that are either already on or off the orphan chain already, favoring the WM.  I've also done some data collection to verify that.  It turns out about half of both the delete and add orphan operations do indeed fell into that category.

I did some further experiment by changing the decrement in the ftruncate() loop in the run_test() function, from 1 to be a variable configurable from the command line, as shown here,

        for (i = 0; i < COUNT; i++) {
                if (pwrite(fd, wbuf, 4096, 0) != 4096) {
                        perror("pwrite");
                        exit(1);
                }

                for (j = 4096 - ndec ; j >= 1; j -= ndec) {
                        if (ftruncate(fd, j) < 0) {
                                perror("ftruncate");
                                exit(1);
                        }
                }
        }

Here are more results with different decrement values.

On a 2 socket 16 thread machine,
with decrement of 32,
   
            --------------------------------------------------
 Num files  |  40 | 100 | 400  | 1000 | 1500 | 2000  | 2500  |
--------------------------------------------------------------
|W/O Mutexes|4.077|5.928|19.005|46.056|95.767|514.050|621.227|
--------------------------------------------------------------
|W Mutexes  |4.441|6.354|19.837|44.771|63.068| 85.997|284.084|
--------------------------------------------------------------

with decrement of 256,
   
            -----------------------------------------------------
 Num files  |  40 | 100 | 400 | 1000| 1500 | 2000 | 2500 | 3000 |
--------------------------------------------------------------
|W/O Mutexes|0.482|0.708|2.486|6.072|11.111|54.994|69.695|81.657|
-----------------------------------------------------------------
|W Mutexes  |0.540|0.823|2.449|5.650| 8.146|11.179|26.453|34.705|
-----------------------------------------------------------------

with decrement of 4095,

            -------------------------------------------------------------
 Num files  |  40 | 100 | 400 | 1000| 1500| 2000| 2500| 3000| 4000| 5000|
-------------------------------------------------------------------------
|W/O Mutexes|0.063|0.010|0.235|0.479|0.665|1.701|2.620|3.812|4.986|7.047|
-------------------------------------------------------------------------
|W Mutexes  |0.064|0.011|0.217|0.462|0.682|0.886|1.374|2.091|3.602|4.650|
-------------------------------------------------------------------------

On an 8 socket 160 thread machine,
with decrement of 32,
   
            --------------------------------------------------------
 Num files  |  40  | 100  | 400  | 1000  | 1500  |  2000  | 2500*  |
--------------------------------------------------------------------
|W/O Mutexes|15.461|21.501|54.009|131.591|219.795|2375.105|6401.232|
--------------------------------------------------------------------
|W Mutexes  |13.376|18.557|61.573|146.369|217.394| 375.293|2626.881|
--------------------------------------------------------------------

with decrement of 256,
   
            -------------------------------------------------
 Num files  |  40 | 100 | 400 | 1000 | 1500 | 2000  | 2500  |
-------------------------------------------------------------
|W/O Mutexes|1.903|2.709|7.212|16.573|26.634|260.666|745.265|
-------------------------------------------------------------
|W Mutexes  |1.643|2.319|7.429|17.641|26.059| 45.333|302.096|
-------------------------------------------------------------

with decrement of 4095,

            ---------------------------------------------
 Num files  |  40 | 100 | 400 | 1000| 1500| 2000 | 2500 |
---------------------------------------------------------
|W/O Mutexes|0.160|0.231|0.746|1.640|2.307|13.586|49.986|
---------------------------------------------------------
|W Mutexes  |0.134|0.192|0.579|1.306|1.931| 3.332|15.481|
---------------------------------------------------------

Looks like the results are similar to that of decrement of 1.  With the exception that on smaller machine, with higher decrement values, both the WM and WOM seems to performance closer to each other for small to medium number of files.  The WOM does show better scaling for higher number of files on both the small and large core count machines.

Here are also aim7 results on an 8 socket 160 threads machine, at 2000 users,

           | % of Average job per minutes of WM compared to WOM
----------------------------------------------------------------
all_tests  | +42.02%
----------------------------------------------------------------
custom     | +56.22%
----------------------------------------------------------------
dbase      |      0%
----------------------------------------------------------------
disk       |      0%
----------------------------------------------------------------
fserver    | +51.23%
----------------------------------------------------------------
new_dbase  |      0%
----------------------------------------------------------------
new_fserver|+53.83%
----------------------------------------------------------------
shared     |+44.37%
----------------------------------------------------------------

Please let me know if you have need any additional info or have any question and also if you would like a copy of the patch for testing.

Thanks,
Mak.




--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - July 23, 2014, 8:15 a.m.
Hello,

On Mon 21-07-14 22:35:24, Thavatchai Makphaibulchoke wrote:
> On 06/18/2014 04:37 AM, Jan Kara wrote:
> >   That's not true. The original test program creates one file per process:
> > void run_test(char *base, int count)
> > {
> > ...
> >        sprintf(pbuf, "%s/file-%d", base, count);
> >        fd = open(pbuf, O_CREAT | O_TRUNC | O_WRONLY, 0644);
> > ...
> >   and forks given number of processes. Ah, maybe I see what you are getting
> > at. My original test program doesn't bother with synchronizing start of the
> > processes - I have manually set number of repetitions (COUNT) so that each
> > process runs long enough on my machine that the differences in start time
> > are negligible.  Maybe you thought these processes run sequentially (and
> > maybe they really did if you had fast enough setup). So I agree
> > synchronization of start using shared memory as you did it is a good thing
> > and gets more reliable numbers especially for high numbers of processes
> > (which I didn't really test because of my limited HW) just there's no need
> > to add another level of iteration...
> > 
> First sorry for taking a little long to get back.
> 
> Your original main is now my new run_main() function.  The old main and
> new run_main() function, practically are doing the same thing, fork and
> immediately all the child processes as indicated by argv[1], running the
> original run_test() function.  By converting it to a run_main function,
> I'm trying to start each incarnation of the old main as closed to each
> other as possible, making sure I have some overlapping orphan activities
> on different files.  Actually I should have done the synchronization in
> the run_test function instead, this way both tests should be closer to
> each other.
> 
> Here are the results with 3.15-rc5 baseline kernel, using your original
> test.  Each result is an average over ten runs, except those longer runs,
> marked with an '*', which is an average of only 5 runs.
> 
> On a 2 socket 16 thread machine,
> 
>             ----------------------------------
>  Num files  |   40  |  100  |  400  |  1000  |
> ----------------------------------------------
> |W/O Mutexes|118.593|178.118|595.017|1480.165|
> -----------------------------------------------
> |W Mutexes  |129.212|194.728|620.412|1430.908|
> ----------------------------------------------
> 
> On an 8 socket 160 thread machine,
> 
>             --------------------------------------------
>  Num files  |   40  |  100  |   400  |  1000* |  1500* |
> -------------------------------------------------------
> |W/O Mutexes|387.257|644.270|1692.018|4254.766|7362.064|
> -------------------------------------------------------
> |W Mutexes  |328.490|557.508|1967.676|4721.060|7013.618|
> --------------------------------------------------------
> 
> From the above data, looks like without mutexes (WOM) performs better
> across the board on a smaller machine, except at 1000 files.  For a
> larger machine, I give the edge to with mutexes (WM) for smaller number
> of files, until around 400 to 100 something files.  Looks like with WOM
> starts performing better again at 1500 files.
> 
  
<snip results for larger truncates as those don't seem to show anything
new>

> Here are also aim7 results on an 8 socket 160 threads machine, at 2000
> users,
> 
>            | % of Average job per minutes of WM compared to WOM
> ----------------------------------------------------------------
> all_tests  | +42.02%
> ----------------------------------------------------------------
> custom     | +56.22%
> ----------------------------------------------------------------
> dbase      |      0%
> ----------------------------------------------------------------
> disk       |      0%
> ----------------------------------------------------------------
> fserver    | +51.23%
> ----------------------------------------------------------------
> new_dbase  |      0%
> ----------------------------------------------------------------
> new_fserver|+53.83%
> ----------------------------------------------------------------
> shared     |+44.37%
> ----------------------------------------------------------------
> 
> Please let me know if you have need any additional info or have any
> question and also if you would like a copy of the patch for testing.
  Thanks for all the measurements! I don't have further questions I think.
All your tests seem to point to the fact that when contention on orphan
list is high, your hashed mutexes improve performance (especially when we
are bouncing cache between sockets) while they have some cost for low
contention cases. I'm not sure it is a worthwhile tradeoff but it's upto
Ted as a maintainer to decide.

							Honza

Patch

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5fcaa85b6dc5..0486fbafb808 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2539,13 +2539,17 @@  static int empty_dir(struct inode *inode)
 	return 1;
 }
 
-/* ext4_orphan_add() links an unlinked or truncated inode into a list of
+/*
+ * ext4_orphan_add() links an unlinked or truncated inode into a list of
  * such inodes, starting at the superblock, in case we crash before the
  * file is closed/deleted, or in case the inode truncate spans multiple
  * transactions and the last transaction is not recovered after a crash.
  *
  * At filesystem recovery time, we walk this list deleting unlinked
  * inodes and truncating linked inodes in ext4_orphan_cleanup().
+ *
+ * Orphan list manipulation functions must be called under i_mutex unless
+ * we are just creating the inode or deleting it.
  */
 int ext4_orphan_add(handle_t *handle, struct inode *inode)
 {
@@ -2553,13 +2557,19 @@  int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_iloc iloc;
 	int err = 0, rc;
+	bool dirty = false;
 
 	if (!sbi->s_journal)
 		return 0;
 
-	mutex_lock(&sbi->s_orphan_lock);
+	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+		     !mutex_is_locked(&inode->i_mutex));
+	/*
+	 * Exit early if inode already is on orphan list. This is a big speedup
+	 * since we don't have to contend on the global s_orphan_lock.
+	 */
 	if (!list_empty(&EXT4_I(inode)->i_orphan))
-		goto out_unlock;
+		return 0;
 
 	/*
 	 * Orphan handling is only valid for files with data blocks
@@ -2573,44 +2583,47 @@  int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	BUFFER_TRACE(sbi->s_sbh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, sbi->s_sbh);
 	if (err)
-		goto out_unlock;
+		goto out;
 
 	err = ext4_reserve_inode_write(handle, inode, &iloc);
 	if (err)
-		goto out_unlock;
+		goto out;
+
+	mutex_lock(&sbi->s_orphan_lock);
 	/*
 	 * Due to previous errors inode may be already a part of on-disk
 	 * orphan list. If so skip on-disk list modification.
 	 */
-	if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
-		(le32_to_cpu(sbi->s_es->s_inodes_count)))
-			goto mem_insert;
-
-	/* Insert this inode at the head of the on-disk orphan list... */
-	NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
-	sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
-	err = ext4_handle_dirty_super(handle, sb);
-	rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
-	if (!err)
-		err = rc;
-
-	/* Only add to the head of the in-memory list if all the
-	 * previous operations succeeded.  If the orphan_add is going to
-	 * fail (possibly taking the journal offline), we can't risk
-	 * leaving the inode on the orphan list: stray orphan-list
-	 * entries can cause panics at unmount time.
-	 *
-	 * This is safe: on error we're going to ignore the orphan list
-	 * anyway on the next recovery. */
-mem_insert:
-	if (!err)
-		list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
+	if (!NEXT_ORPHAN(inode) || NEXT_ORPHAN(inode) >
+	    (le32_to_cpu(sbi->s_es->s_inodes_count))) {
+		/* Insert this inode at the head of the on-disk orphan list */
+		NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan);
+		sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
+		dirty = true;
+	}
+	list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan);
+	mutex_unlock(&sbi->s_orphan_lock);
 
+	if (dirty) {
+		err = ext4_handle_dirty_super(handle, sb);
+		rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
+		if (!err)
+			err = rc;
+		if (err) {
+			/*
+			 * We have to remove inode from in-memory list if
+ 			 * addition to on disk orphan list failed. Stray orphan
+			 * list entries can cause panics at unmount time.
+			 */
+			mutex_lock(&sbi->s_orphan_lock);
+			list_del(&EXT4_I(inode)->i_orphan);
+			mutex_unlock(&sbi->s_orphan_lock);
+		}
+	}
 	jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
 	jbd_debug(4, "orphan inode %lu will point to %d\n",
 			inode->i_ino, NEXT_ORPHAN(inode));
-out_unlock:
-	mutex_unlock(&sbi->s_orphan_lock);
+out:
 	ext4_std_error(sb, err);
 	return err;
 }
@@ -2631,13 +2644,18 @@  int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
 		return 0;
 
-	mutex_lock(&sbi->s_orphan_lock);
+	WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
+		     !mutex_is_locked(&inode->i_mutex));
+	/* Do this quick check before taking global s_orphan_lock. */
 	if (list_empty(&ei->i_orphan))
-		goto out;
+		return 0;
 
-	ino_next = NEXT_ORPHAN(inode);
-	prev = ei->i_orphan.prev;
+	if (handle) {
+		/* Grab inode buffer early before taking global s_orphan_lock */
+		err = ext4_reserve_inode_write(handle, inode, &iloc);
+	}
 
+	mutex_lock(&sbi->s_orphan_lock);
 	jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
 
 	list_del_init(&ei->i_orphan);
@@ -2646,20 +2664,23 @@  int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	 * transaction handle with which to update the orphan list on
 	 * disk, but we still need to remove the inode from the linked
 	 * list in memory. */
-	if (!handle)
-		goto out;
-
-	err = ext4_reserve_inode_write(handle, inode, &iloc);
-	if (err)
+	if (!handle || err) {
+		mutex_unlock(&sbi->s_orphan_lock);
 		goto out_err;
+	}
 
+	ino_next = NEXT_ORPHAN(inode);
+	prev = ei->i_orphan.prev;
 	if (prev == &sbi->s_orphan) {
 		jbd_debug(4, "superblock will point to %u\n", ino_next);
 		BUFFER_TRACE(sbi->s_sbh, "get_write_access");
 		err = ext4_journal_get_write_access(handle, sbi->s_sbh);
-		if (err)
+		if (err) {
+			mutex_unlock(&sbi->s_orphan_lock);
 			goto out_brelse;
+		}
 		sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+		mutex_unlock(&sbi->s_orphan_lock);
 		err = ext4_handle_dirty_super(handle, inode->i_sb);
 	} else {
 		struct ext4_iloc iloc2;
@@ -2669,20 +2690,20 @@  int ext4_orphan_del(handle_t *handle, struct inode *inode)
 		jbd_debug(4, "orphan inode %lu will point to %u\n",
 			  i_prev->i_ino, ino_next);
 		err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
-		if (err)
+		if (err) {
+			mutex_unlock(&sbi->s_orphan_lock);
 			goto out_brelse;
+		}
 		NEXT_ORPHAN(i_prev) = ino_next;
+		mutex_unlock(&sbi->s_orphan_lock);
 		err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
 	}
 	if (err)
 		goto out_brelse;
 	NEXT_ORPHAN(inode) = 0;
 	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
-
 out_err:
 	ext4_std_error(inode->i_sb, err);
-out:
-	mutex_unlock(&sbi->s_orphan_lock);
 	return err;
 
 out_brelse: