[RFC] qht: Align sequence lock to cache line

Message ID	20161025153507.27110-1-bobby.prani@gmail.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Pranith Kumar <bobby.prani@gmail.com> To: Richard Henderson <rth@twiddle.net>, =?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org>, Markus Armbruster <armbru@redhat.com>, Sergey Fedorov <sergey.fedorov@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, "Emilio G. Cota" <cota@braap.org>, qemu-devel@nongnu.org (open list:All patches CC here) Date: Tue, 25 Oct 2016 11:35:06 -0400 Message-Id: <20161025153507.27110-1-bobby.prani@gmail.com> Subject: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Message ID

20161025153507.27110-1-bobby.prani@gmail.com

State

New

Headers

From: Pranith Kumar <bobby.prani@gmail.com>
To: Richard Henderson <rth@twiddle.net>,
	=?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org>,
	Markus Armbruster <armbru@redhat.com>,
	Sergey Fedorov <sergey.fedorov@linaro.org>,
	Paolo Bonzini <pbonzini@redhat.com>, "Emilio G. Cota" <cota@braap.org>,
	qemu-devel@nongnu.org (open list:All patches CC here)
Date: Tue, 25 Oct 2016 11:35:06 -0400
Message-Id: <20161025153507.27110-1-bobby.prani@gmail.com>
Subject: [Qemu-devel] [RFC PATCH] qht: Align sequence lock to cache line
Precedence: list
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Commit Message

Pranith Kumar Oct. 25, 2016, 3:35 p.m. UTC

Using perf, I see that sequence lock is being a bottleneck since it is
being read by everyone. Giving it its own cache-line seems to help
things quite a bit.

Using qht-bench, I measured the following for:

$ ./tests/qht-bench -d 10 -n 24 -u <x>

throughput base   patch  %change
update
0          8.07   13.33  +65%
10         7.10   8.90   +25%
20         6.34   7.02	 +10%
30         5.48   6.11   +9.6%
40         4.90   5.46   +11.42%

I am not able to see any significant increases for lower thread counts though.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 include/qemu/seqlock.h | 2 +-
 util/qht.c             | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Paolo Bonzini Oct. 25, 2016, 3:41 p.m. UTC | #1

On 25/10/2016 17:35, Pranith Kumar wrote:
> Using perf, I see that sequence lock is being a bottleneck since it is
> being read by everyone. Giving it its own cache-line seems to help
> things quite a bit.
> 
> Using qht-bench, I measured the following for:
> 
> $ ./tests/qht-bench -d 10 -n 24 -u <x>
> 
> throughput base   patch  %change
> update
> 0          8.07   13.33  +65%
> 10         7.10   8.90   +25%
> 20         6.34   7.02	 +10%
> 30         5.48   6.11   +9.6%
> 40         4.90   5.46   +11.42%
> 
> I am not able to see any significant increases for lower thread counts though.

Whoa, that's a bit of a heavy hammer.  The idea is that whoever modifies
the seqlock must take the spinlock first, and whoever reads the seqlock
will read one of the members of hashes[]/pointers[].  So having
everything in the same cacheline should be better.

What happens if you just change QHT_BUCKET_ALIGN to 128?

Paolo

> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  include/qemu/seqlock.h | 2 +-
>  util/qht.c             | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index 8dee11d..954abe8 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -21,7 +21,7 @@ typedef struct QemuSeqLock QemuSeqLock;
>  
>  struct QemuSeqLock {
>      unsigned sequence;
> -};
> +} QEMU_ALIGNED(64);
>  
>  static inline void seqlock_init(QemuSeqLock *sl)
>  {
> diff --git a/util/qht.c b/util/qht.c
> index ff4d2e6..4d82609 100644
> --- a/util/qht.c
> +++ b/util/qht.c
> @@ -101,14 +101,14 @@
>   * be grabbed first.
>   */
>  struct qht_bucket {
> -    QemuSpin lock;
>      QemuSeqLock sequence;
> +    QemuSpin lock;
>      uint32_t hashes[QHT_BUCKET_ENTRIES];
>      void *pointers[QHT_BUCKET_ENTRIES];
>      struct qht_bucket *next;
>  } QEMU_ALIGNED(QHT_BUCKET_ALIGN);
>  
> -QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
> +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > 2 * QHT_BUCKET_ALIGN);
>  
>  /**
>   * struct qht_map - structure to track an array of buckets
>

Pranith Kumar Oct. 25, 2016, 3:49 p.m. UTC | #2

On Tue, Oct 25, 2016 at 11:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 25/10/2016 17:35, Pranith Kumar wrote:
>> Using perf, I see that sequence lock is being a bottleneck since it is
>> being read by everyone. Giving it its own cache-line seems to help
>> things quite a bit.
>>
>> Using qht-bench, I measured the following for:
>>
>> $ ./tests/qht-bench -d 10 -n 24 -u <x>
>>
>> throughput base   patch  %change
>> update
>> 0          8.07   13.33  +65%
>> 10         7.10   8.90   +25%
>> 20         6.34   7.02         +10%
>> 30         5.48   6.11   +9.6%
>> 40         4.90   5.46   +11.42%
>>
>> I am not able to see any significant increases for lower thread counts though.
>
> Whoa, that's a bit of a heavy hammer.  The idea is that whoever modifies
> the seqlock must take the spinlock first, and whoever reads the seqlock
> will read one of the members of hashes[]/pointers[].  So having
> everything in the same cacheline should be better.

But we are taking the seqlock of only the head bucket, while the
readers are reading hashes/pointers of the chained buckets.

>
> What happens if you just change QHT_BUCKET_ALIGN to 128?
>

0     4.47
10   9.82 (weird!)
20   8.13
30   7.13
40   6.45

Thanks,

Paolo Bonzini Oct. 25, 2016, 3:54 p.m. UTC | #3

On 25/10/2016 17:49, Pranith Kumar wrote:
> But we are taking the seqlock of only the head bucket, while the
> readers are reading hashes/pointers of the chained buckets.

No, we aren't.  See qht_lookup__slowpath.

This patch:

throughput base   patch  %change
update
0          8.07   13.33  +65%
10         7.10   8.90   +25%
20         6.34   7.02   +10%
30         5.48   6.11   +9.6%
40         4.90   5.46   +11.42%


Just doubling the cachesize:

throughput base   patch  %change
update
0          8.07   4.47   -45% ?!?
10         7.10   9.82   +38%
20         6.34   8.13   +28%
30         5.48   7.13   +30%
40         5.90   6.45   +30%

It seems to me that your machine has 128-byte cachelines.

Paolo

diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 8dee11d..954abe8 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -21,7 +21,7 @@  typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
     unsigned sequence;
-};
+} QEMU_ALIGNED(64);
 
 static inline void seqlock_init(QemuSeqLock *sl)
 {
diff --git a/util/qht.c b/util/qht.c
index ff4d2e6..4d82609 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -101,14 +101,14 @@ 
  * be grabbed first.
  */
 struct qht_bucket {
-    QemuSpin lock;
     QemuSeqLock sequence;
+    QemuSpin lock;
     uint32_t hashes[QHT_BUCKET_ENTRIES];
     void *pointers[QHT_BUCKET_ENTRIES];
     struct qht_bucket *next;
 } QEMU_ALIGNED(QHT_BUCKET_ALIGN);
 
-QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
+QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > 2 * QHT_BUCKET_ALIGN);
 
 /**
  * struct qht_map - structure to track an array of buckets

[RFC] qht: Align sequence lock to cache line

Commit Message

Comments

Patch