diff mbox

qcow2: Optimize L2 table cache size based on image and cluster sizes

Message ID 1475588392-10286-1-git-send-email-eswierk@skyportsystems.com
State New
Headers show

Commit Message

Ed Swierk Oct. 4, 2016, 1:39 p.m. UTC
The optimal size of the qcow2 L2 table cache depends on the working
set size and the cluster size of the disk image. If the cache is too
small, L2 tables are be re-read from disk on every IO operation in the
worst case. The host's buffer cache can paper over this inefficiency,
but with cache=none or cache=directsync, L2 table cache thrashing
dramatically reduces IO operations per second (e.g. from 50k/sec to
5k/sec for a qcow2 stored on a modern SSD).

The L2 table cache size can be set manually via the l2-cache-size
runtime option, but such a performance-critical value needs a sane
default. Conservatively assuming the entire image is the working
set, the existing hardcoded default value of 1 MiB is appropriate
for images up to 8 GiB with the default 64 KiB cluster size.

This change replaces the hardcoded default L2 table cache size with a
value computed from the image size and cluster size. The resulting
default cache size is just large enough to service random accesses in
the entire image.

Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
---
 block/qcow2.c | 5 ++---
 block/qcow2.h | 4 ----
 2 files changed, 2 insertions(+), 7 deletions(-)

Comments

Alberto Garcia Oct. 4, 2016, 2:31 p.m. UTC | #1
On Tue 04 Oct 2016 03:39:52 PM CEST, Ed Swierk wrote:

> This change replaces the hardcoded default L2 table cache size with a
> value computed from the image size and cluster size. The resulting
> default cache size is just large enough to service random accesses in
> the entire image.

That might be a lot of memory if the image is big. 1 TB qcow2 image ==
128 MB L2 cache.

See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2

If we want to add this feature, I think I'd rather make it explicit.

Berto
Ed Swierk Oct. 4, 2016, 3:36 p.m. UTC | #2
On Tue, Oct 4, 2016 at 7:31 AM, Alberto Garcia <berto@igalia.com> wrote:
> That might be a lot of memory if the image is big. 1 TB qcow2 image ==
> 128 MB L2 cache.
>
> See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2
>
> If we want to add this feature, I think I'd rather make it explicit.

I agree explicit is generally better than automagical, but unlike say
the VM RAM size, the qcow L2 table cache is an implementation detail
no user should be expected to know exists, let alone needs tweaking
according to a specific formula to avoid an arbitrary performance
cliff.

At least giving users a way to skip the math would be an improvement.
Would you be okay with an explicitly-set option like
l2_cache_size=auto or =max that optimizes for performance at the
expense of memory?

--Ed
Max Reitz Oct. 4, 2016, 3:51 p.m. UTC | #3
On 04.10.2016 17:36, Ed Swierk wrote:
> On Tue, Oct 4, 2016 at 7:31 AM, Alberto Garcia <berto@igalia.com> wrote:
>> That might be a lot of memory if the image is big. 1 TB qcow2 image ==
>> 128 MB L2 cache.
>>
>> See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2
>>
>> If we want to add this feature, I think I'd rather make it explicit.
> 
> I agree explicit is generally better than automagical, but unlike say
> the VM RAM size, the qcow L2 table cache is an implementation detail
> no user should be expected to know exists, let alone needs tweaking
> according to a specific formula to avoid an arbitrary performance
> cliff.

I think a user can be expected to know it when they really want optimal
performance for huge workloads.

> At least giving users a way to skip the math would be an improvement.
> Would you be okay with an explicitly-set option like
> l2_cache_size=auto or =max that optimizes for performance at the
> expense of memory?

That (cache-size=max) is actually something Stefan Hajnoczi has proposed
at KVM Forum 2015.

I agree that implementing the formula in qemu's qcow2 driver itself
makes sense and will help users; however, I do think it is appropriate
to expect the user to pass cache-size=max if they need it.

We shouldn't fix this lack of knowledge by making cache-size=max the
default but by documenting it properly (although I'm yet unsure where to
do that, other than in some blog...). Also, we have the
cache-clean-interval option which makes the qcow2 driver automatically
clean up unused entries in the metadata caches, which would be very
useful for cache-size=max; we thus shouldn't forget to mention that in
the documentation as well.

Max
Alberto Garcia Oct. 5, 2016, 2:57 p.m. UTC | #4
On Tue 04 Oct 2016 05:51:26 PM CEST, Max Reitz wrote:

>> At least giving users a way to skip the math would be an improvement.
>> Would you be okay with an explicitly-set option like
>> l2_cache_size=auto or =max that optimizes for performance at the
>> expense of memory?
>
> That (cache-size=max) is actually something Stefan Hajnoczi has
> proposed at KVM Forum 2015.
>
> I agree that implementing the formula in qemu's qcow2 driver itself
> makes sense and will help users; however, I do think it is appropriate
> to expect the user to pass cache-size=max if they need it.

Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % of
the disk size, so

l2-cache-size=100%  (that would be cache-size=max)
l2-cache-size=80%
...

https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c3

Either one looks good to me. They would change however the data type of
'l2-cache-size' from int to string in BlockdevOptionsQcow2, can we
actually do that?

Berto
Max Reitz Oct. 5, 2016, 3:13 p.m. UTC | #5
On 05.10.2016 16:57, Alberto Garcia wrote:
> On Tue 04 Oct 2016 05:51:26 PM CEST, Max Reitz wrote:
> 
>>> At least giving users a way to skip the math would be an improvement.
>>> Would you be okay with an explicitly-set option like
>>> l2_cache_size=auto or =max that optimizes for performance at the
>>> expense of memory?
>>
>> That (cache-size=max) is actually something Stefan Hajnoczi has
>> proposed at KVM Forum 2015.
>>
>> I agree that implementing the formula in qemu's qcow2 driver itself
>> makes sense and will help users; however, I do think it is appropriate
>> to expect the user to pass cache-size=max if they need it.
> 
> Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % of
> the disk size, so
> 
> l2-cache-size=100%  (that would be cache-size=max)
> l2-cache-size=80%
> ...
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c3
> 
> Either one looks good to me. They would change however the data type of
> 'l2-cache-size' from int to string in BlockdevOptionsQcow2, can we
> actually do that?

I think we can do that with an alternate data type which accepts both an
integer and a string. If we only had "max", we'd probably want to make
it an enumeration instead of a free-form string, though. But with a %
suffix, we'd probably need a string.

For me, either works fine.

Apart from that, I have to say I think it would be a bit more useful if
one would specify the area covered by the metadata caches as an absolute
number instead of a relative one (I guess it's generally easier to know
what area your applications will perform random accesses on than the
relative size, but maybe that's just me).

Supporting that with cache-size is difficult, though, because how are
you going to decide between whether the user is specifying the actual
size of the cache or the area it covers?

So maybe it would make more sense to introduce a whole new set of
options called {,l2-,refcount-}cache-coverage or something. These
options would then accept both an absolute and a relative number.

Max
Alberto Garcia Oct. 5, 2016, 3:19 p.m. UTC | #6
On Wed 05 Oct 2016 05:13:39 PM CEST, Max Reitz wrote:

>>>> At least giving users a way to skip the math would be an
>>>> improvement.  Would you be okay with an explicitly-set option like
>>>> l2_cache_size=auto or =max that optimizes for performance at the
>>>> expense of memory?
>>>
>> Frank Myhr's suggestion (in bugzilla) is that we allow specifying a %
>> of the disk size, so
>> 
>> l2-cache-size=100%  (that would be cache-size=max)
>> l2-cache-size=80%

> For me, either works fine.
>
> Apart from that, I have to say I think it would be a bit more useful
> if one would specify the area covered by the metadata caches as an
> absolute number instead of a relative one (I guess it's generally
> easier to know what area your applications will perform random
> accesses on than the relative size, but maybe that's just me).

I'm not sure if I'm following you, can you give an example of what the
area covered by the cache exactly means?

Berto
Kevin Wolf Oct. 5, 2016, 3:24 p.m. UTC | #7
Am 05.10.2016 um 17:13 hat Max Reitz geschrieben:
> On 05.10.2016 16:57, Alberto Garcia wrote:
> > On Tue 04 Oct 2016 05:51:26 PM CEST, Max Reitz wrote:
> > 
> >>> At least giving users a way to skip the math would be an improvement.
> >>> Would you be okay with an explicitly-set option like
> >>> l2_cache_size=auto or =max that optimizes for performance at the
> >>> expense of memory?
> >>
> >> That (cache-size=max) is actually something Stefan Hajnoczi has
> >> proposed at KVM Forum 2015.
> >>
> >> I agree that implementing the formula in qemu's qcow2 driver itself
> >> makes sense and will help users; however, I do think it is appropriate
> >> to expect the user to pass cache-size=max if they need it.
> > 
> > Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % of
> > the disk size, so
> > 
> > l2-cache-size=100%  (that would be cache-size=max)
> > l2-cache-size=80%
> > ...
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c3
> > 
> > Either one looks good to me. They would change however the data type of
> > 'l2-cache-size' from int to string in BlockdevOptionsQcow2, can we
> > actually do that?
> 
> I think we can do that with an alternate data type which accepts both an
> integer and a string. If we only had "max", we'd probably want to make
> it an enumeration instead of a free-form string, though. But with a %
> suffix, we'd probably need a string.
> 
> For me, either works fine.

I'm not sure if we want to support such syntax in QMP. It sounds like a
typical convenience option for human users.

> Apart from that, I have to say I think it would be a bit more useful if
> one would specify the area covered by the metadata caches as an absolute
> number instead of a relative one (I guess it's generally easier to know
> what area your applications will perform random accesses on than the
> relative size, but maybe that's just me).
> 
> Supporting that with cache-size is difficult, though, because how are
> you going to decide between whether the user is specifying the actual
> size of the cache or the area it covers?
> 
> So maybe it would make more sense to introduce a whole new set of
> options called {,l2-,refcount-}cache-coverage or something. These
> options would then accept both an absolute and a relative number.

If we want easy, make it easy: cache-coverage and apply it to both.

Kevin
Max Reitz Oct. 5, 2016, 3:35 p.m. UTC | #8
On 05.10.2016 17:19, Alberto Garcia wrote:
> On Wed 05 Oct 2016 05:13:39 PM CEST, Max Reitz wrote:
> 
>>>>> At least giving users a way to skip the math would be an
>>>>> improvement.  Would you be okay with an explicitly-set option like
>>>>> l2_cache_size=auto or =max that optimizes for performance at the
>>>>> expense of memory?
>>>>
>>> Frank Myhr's suggestion (in bugzilla) is that we allow specifying a %
>>> of the disk size, so
>>>
>>> l2-cache-size=100%  (that would be cache-size=max)
>>> l2-cache-size=80%
> 
>> For me, either works fine.
>>
>> Apart from that, I have to say I think it would be a bit more useful
>> if one would specify the area covered by the metadata caches as an
>> absolute number instead of a relative one (I guess it's generally
>> easier to know what area your applications will perform random
>> accesses on than the relative size, but maybe that's just me).
> 
> I'm not sure if I'm following you, can you give an example of what the
> area covered by the cache exactly means?

Let's take the L2 table cache. Every eight bytes of that cache point to
one cluster in the image, so if it's 8 MB in size, for instance, then
you can cover 1M clusters. With the default 64 kB clusters, you would
cover 64 GB of any image (although you have some constraints on that
range, of course; an 8 MB cache would contain not just 1M cluster
pointers but actually 128 L2 tables (8 MB / 64 kB), so it would actually
cover 128 continuous 512 MB ranges).

So as long as you perform any kind of access pattern within these 64 GB,
the cache will cover that. But when you're doing random accesses over
more than 64 GB, the cache cannot cover it.

So that's what I mean with coverage: The area you can access without
having to suffer cache misses.

Max
Alberto Garcia Oct. 6, 2016, 2:34 p.m. UTC | #9
On Wed 05 Oct 2016 05:35:20 PM CEST, Max Reitz <mreitz@redhat.com> wrote:

>>> Apart from that, I have to say I think it would be a bit more useful
>>> if one would specify the area covered by the metadata caches as an
>>> absolute number instead of a relative one (I guess it's generally
>>> easier to know what area your applications will perform random
>>> accesses on than the relative size, but maybe that's just me).
>> 
>> I'm not sure if I'm following you, can you give an example of what
>> the area covered by the cache exactly means?
>
> Let's take the L2 table cache. Every eight bytes of that cache point
> to one cluster in the image, so if it's 8 MB in size, for instance,
> then you can cover 1M clusters. With the default 64 kB clusters, you
> would cover 64 GB of any image (although you have some constraints on
> that range, of course; an 8 MB cache would contain not just 1M cluster
> pointers but actually 128 L2 tables (8 MB / 64 kB), so it would
> actually cover 128 continuous 512 MB ranges).

Ah! You mean that instead of saying l2-cache-size=1M, you would say
l2-cache-coverage=8G, so the user doesn't need to do the numbers. I
thought you were talking about something different, that's why it didn't
make any sense to me :-)

Well, I don't know, if we had to design the interface again from the
beginning I guess it could make sense, but I don't think it's a good
idea to have two ways to specify exactly the same thing, when one can be
converted to the other with a very simple operation.

I admit the formula is not directly obvious (that's why I documented it
in docs/qcow2-cache.txt), but it's very simple.

I guess it wouldn't be so bad if we restricted it only to the command
line (for humans), since any program that communicates using QMP can do
the numbers itself. I don't know... how about a helper script that
checks the cluster size and tells you how much cache you need?

Berto
diff mbox

Patch

diff --git a/block/qcow2.c b/block/qcow2.c
index 91ef4df..259e377 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -563,9 +563,8 @@  static void read_cache_sizes(BlockDriverState *bs, QemuOpts *opts,
         }
     } else {
         if (!l2_cache_size_set && !refcount_cache_size_set) {
-            *l2_cache_size = MAX(DEFAULT_L2_CACHE_BYTE_SIZE,
-                                 (uint64_t)DEFAULT_L2_CACHE_CLUSTERS
-                                 * s->cluster_size);
+            *l2_cache_size = (bs->total_sectors * BDRV_SECTOR_SIZE * 8
+                              + s->cluster_size - 1) / s->cluster_size;
             *refcount_cache_size = *l2_cache_size
                                  / DEFAULT_L2_REFCOUNT_SIZE_RATIO;
         } else if (!l2_cache_size_set) {
diff --git a/block/qcow2.h b/block/qcow2.h
index b36a7bf..f148c4d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -68,10 +68,6 @@ 
 /* Must be at least 4 to cover all cases of refcount table growth */
 #define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */
 
-/* Whichever is more */
-#define DEFAULT_L2_CACHE_CLUSTERS 8 /* clusters */
-#define DEFAULT_L2_CACHE_BYTE_SIZE 1048576 /* bytes */
-
 /* The refblock cache needs only a fourth of the L2 cache size to cover as many
  * clusters */
 #define DEFAULT_L2_REFCOUNT_SIZE_RATIO 4