Message ID | 1475588392-10286-1-git-send-email-eswierk@skyportsystems.com |
---|---|
State | New |
Headers | show |
On Tue 04 Oct 2016 03:39:52 PM CEST, Ed Swierk wrote: > This change replaces the hardcoded default L2 table cache size with a > value computed from the image size and cluster size. The resulting > default cache size is just large enough to service random accesses in > the entire image. That might be a lot of memory if the image is big. 1 TB qcow2 image == 128 MB L2 cache. See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2 If we want to add this feature, I think I'd rather make it explicit. Berto
On Tue, Oct 4, 2016 at 7:31 AM, Alberto Garcia <berto@igalia.com> wrote: > That might be a lot of memory if the image is big. 1 TB qcow2 image == > 128 MB L2 cache. > > See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2 > > If we want to add this feature, I think I'd rather make it explicit. I agree explicit is generally better than automagical, but unlike say the VM RAM size, the qcow L2 table cache is an implementation detail no user should be expected to know exists, let alone needs tweaking according to a specific formula to avoid an arbitrary performance cliff. At least giving users a way to skip the math would be an improvement. Would you be okay with an explicitly-set option like l2_cache_size=auto or =max that optimizes for performance at the expense of memory? --Ed
On 04.10.2016 17:36, Ed Swierk wrote: > On Tue, Oct 4, 2016 at 7:31 AM, Alberto Garcia <berto@igalia.com> wrote: >> That might be a lot of memory if the image is big. 1 TB qcow2 image == >> 128 MB L2 cache. >> >> See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2 >> >> If we want to add this feature, I think I'd rather make it explicit. > > I agree explicit is generally better than automagical, but unlike say > the VM RAM size, the qcow L2 table cache is an implementation detail > no user should be expected to know exists, let alone needs tweaking > according to a specific formula to avoid an arbitrary performance > cliff. I think a user can be expected to know it when they really want optimal performance for huge workloads. > At least giving users a way to skip the math would be an improvement. > Would you be okay with an explicitly-set option like > l2_cache_size=auto or =max that optimizes for performance at the > expense of memory? That (cache-size=max) is actually something Stefan Hajnoczi has proposed at KVM Forum 2015. I agree that implementing the formula in qemu's qcow2 driver itself makes sense and will help users; however, I do think it is appropriate to expect the user to pass cache-size=max if they need it. We shouldn't fix this lack of knowledge by making cache-size=max the default but by documenting it properly (although I'm yet unsure where to do that, other than in some blog...). Also, we have the cache-clean-interval option which makes the qcow2 driver automatically clean up unused entries in the metadata caches, which would be very useful for cache-size=max; we thus shouldn't forget to mention that in the documentation as well. Max
On Tue 04 Oct 2016 05:51:26 PM CEST, Max Reitz wrote: >> At least giving users a way to skip the math would be an improvement. >> Would you be okay with an explicitly-set option like >> l2_cache_size=auto or =max that optimizes for performance at the >> expense of memory? > > That (cache-size=max) is actually something Stefan Hajnoczi has > proposed at KVM Forum 2015. > > I agree that implementing the formula in qemu's qcow2 driver itself > makes sense and will help users; however, I do think it is appropriate > to expect the user to pass cache-size=max if they need it. Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % of the disk size, so l2-cache-size=100% (that would be cache-size=max) l2-cache-size=80% ... https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c3 Either one looks good to me. They would change however the data type of 'l2-cache-size' from int to string in BlockdevOptionsQcow2, can we actually do that? Berto
On 05.10.2016 16:57, Alberto Garcia wrote: > On Tue 04 Oct 2016 05:51:26 PM CEST, Max Reitz wrote: > >>> At least giving users a way to skip the math would be an improvement. >>> Would you be okay with an explicitly-set option like >>> l2_cache_size=auto or =max that optimizes for performance at the >>> expense of memory? >> >> That (cache-size=max) is actually something Stefan Hajnoczi has >> proposed at KVM Forum 2015. >> >> I agree that implementing the formula in qemu's qcow2 driver itself >> makes sense and will help users; however, I do think it is appropriate >> to expect the user to pass cache-size=max if they need it. > > Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % of > the disk size, so > > l2-cache-size=100% (that would be cache-size=max) > l2-cache-size=80% > ... > > https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c3 > > Either one looks good to me. They would change however the data type of > 'l2-cache-size' from int to string in BlockdevOptionsQcow2, can we > actually do that? I think we can do that with an alternate data type which accepts both an integer and a string. If we only had "max", we'd probably want to make it an enumeration instead of a free-form string, though. But with a % suffix, we'd probably need a string. For me, either works fine. Apart from that, I have to say I think it would be a bit more useful if one would specify the area covered by the metadata caches as an absolute number instead of a relative one (I guess it's generally easier to know what area your applications will perform random accesses on than the relative size, but maybe that's just me). Supporting that with cache-size is difficult, though, because how are you going to decide between whether the user is specifying the actual size of the cache or the area it covers? So maybe it would make more sense to introduce a whole new set of options called {,l2-,refcount-}cache-coverage or something. These options would then accept both an absolute and a relative number. Max
On Wed 05 Oct 2016 05:13:39 PM CEST, Max Reitz wrote: >>>> At least giving users a way to skip the math would be an >>>> improvement. Would you be okay with an explicitly-set option like >>>> l2_cache_size=auto or =max that optimizes for performance at the >>>> expense of memory? >>> >> Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % >> of the disk size, so >> >> l2-cache-size=100% (that would be cache-size=max) >> l2-cache-size=80% > For me, either works fine. > > Apart from that, I have to say I think it would be a bit more useful > if one would specify the area covered by the metadata caches as an > absolute number instead of a relative one (I guess it's generally > easier to know what area your applications will perform random > accesses on than the relative size, but maybe that's just me). I'm not sure if I'm following you, can you give an example of what the area covered by the cache exactly means? Berto
Am 05.10.2016 um 17:13 hat Max Reitz geschrieben: > On 05.10.2016 16:57, Alberto Garcia wrote: > > On Tue 04 Oct 2016 05:51:26 PM CEST, Max Reitz wrote: > > > >>> At least giving users a way to skip the math would be an improvement. > >>> Would you be okay with an explicitly-set option like > >>> l2_cache_size=auto or =max that optimizes for performance at the > >>> expense of memory? > >> > >> That (cache-size=max) is actually something Stefan Hajnoczi has > >> proposed at KVM Forum 2015. > >> > >> I agree that implementing the formula in qemu's qcow2 driver itself > >> makes sense and will help users; however, I do think it is appropriate > >> to expect the user to pass cache-size=max if they need it. > > > > Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % of > > the disk size, so > > > > l2-cache-size=100% (that would be cache-size=max) > > l2-cache-size=80% > > ... > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c3 > > > > Either one looks good to me. They would change however the data type of > > 'l2-cache-size' from int to string in BlockdevOptionsQcow2, can we > > actually do that? > > I think we can do that with an alternate data type which accepts both an > integer and a string. If we only had "max", we'd probably want to make > it an enumeration instead of a free-form string, though. But with a % > suffix, we'd probably need a string. > > For me, either works fine. I'm not sure if we want to support such syntax in QMP. It sounds like a typical convenience option for human users. > Apart from that, I have to say I think it would be a bit more useful if > one would specify the area covered by the metadata caches as an absolute > number instead of a relative one (I guess it's generally easier to know > what area your applications will perform random accesses on than the > relative size, but maybe that's just me). > > Supporting that with cache-size is difficult, though, because how are > you going to decide between whether the user is specifying the actual > size of the cache or the area it covers? > > So maybe it would make more sense to introduce a whole new set of > options called {,l2-,refcount-}cache-coverage or something. These > options would then accept both an absolute and a relative number. If we want easy, make it easy: cache-coverage and apply it to both. Kevin
On 05.10.2016 17:19, Alberto Garcia wrote: > On Wed 05 Oct 2016 05:13:39 PM CEST, Max Reitz wrote: > >>>>> At least giving users a way to skip the math would be an >>>>> improvement. Would you be okay with an explicitly-set option like >>>>> l2_cache_size=auto or =max that optimizes for performance at the >>>>> expense of memory? >>>> >>> Frank Myhr's suggestion (in bugzilla) is that we allow specifying a % >>> of the disk size, so >>> >>> l2-cache-size=100% (that would be cache-size=max) >>> l2-cache-size=80% > >> For me, either works fine. >> >> Apart from that, I have to say I think it would be a bit more useful >> if one would specify the area covered by the metadata caches as an >> absolute number instead of a relative one (I guess it's generally >> easier to know what area your applications will perform random >> accesses on than the relative size, but maybe that's just me). > > I'm not sure if I'm following you, can you give an example of what the > area covered by the cache exactly means? Let's take the L2 table cache. Every eight bytes of that cache point to one cluster in the image, so if it's 8 MB in size, for instance, then you can cover 1M clusters. With the default 64 kB clusters, you would cover 64 GB of any image (although you have some constraints on that range, of course; an 8 MB cache would contain not just 1M cluster pointers but actually 128 L2 tables (8 MB / 64 kB), so it would actually cover 128 continuous 512 MB ranges). So as long as you perform any kind of access pattern within these 64 GB, the cache will cover that. But when you're doing random accesses over more than 64 GB, the cache cannot cover it. So that's what I mean with coverage: The area you can access without having to suffer cache misses. Max
On Wed 05 Oct 2016 05:35:20 PM CEST, Max Reitz <mreitz@redhat.com> wrote: >>> Apart from that, I have to say I think it would be a bit more useful >>> if one would specify the area covered by the metadata caches as an >>> absolute number instead of a relative one (I guess it's generally >>> easier to know what area your applications will perform random >>> accesses on than the relative size, but maybe that's just me). >> >> I'm not sure if I'm following you, can you give an example of what >> the area covered by the cache exactly means? > > Let's take the L2 table cache. Every eight bytes of that cache point > to one cluster in the image, so if it's 8 MB in size, for instance, > then you can cover 1M clusters. With the default 64 kB clusters, you > would cover 64 GB of any image (although you have some constraints on > that range, of course; an 8 MB cache would contain not just 1M cluster > pointers but actually 128 L2 tables (8 MB / 64 kB), so it would > actually cover 128 continuous 512 MB ranges). Ah! You mean that instead of saying l2-cache-size=1M, you would say l2-cache-coverage=8G, so the user doesn't need to do the numbers. I thought you were talking about something different, that's why it didn't make any sense to me :-) Well, I don't know, if we had to design the interface again from the beginning I guess it could make sense, but I don't think it's a good idea to have two ways to specify exactly the same thing, when one can be converted to the other with a very simple operation. I admit the formula is not directly obvious (that's why I documented it in docs/qcow2-cache.txt), but it's very simple. I guess it wouldn't be so bad if we restricted it only to the command line (for humans), since any program that communicates using QMP can do the numbers itself. I don't know... how about a helper script that checks the cluster size and tells you how much cache you need? Berto
diff --git a/block/qcow2.c b/block/qcow2.c index 91ef4df..259e377 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -563,9 +563,8 @@ static void read_cache_sizes(BlockDriverState *bs, QemuOpts *opts, } } else { if (!l2_cache_size_set && !refcount_cache_size_set) { - *l2_cache_size = MAX(DEFAULT_L2_CACHE_BYTE_SIZE, - (uint64_t)DEFAULT_L2_CACHE_CLUSTERS - * s->cluster_size); + *l2_cache_size = (bs->total_sectors * BDRV_SECTOR_SIZE * 8 + + s->cluster_size - 1) / s->cluster_size; *refcount_cache_size = *l2_cache_size / DEFAULT_L2_REFCOUNT_SIZE_RATIO; } else if (!l2_cache_size_set) { diff --git a/block/qcow2.h b/block/qcow2.h index b36a7bf..f148c4d 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -68,10 +68,6 @@ /* Must be at least 4 to cover all cases of refcount table growth */ #define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */ -/* Whichever is more */ -#define DEFAULT_L2_CACHE_CLUSTERS 8 /* clusters */ -#define DEFAULT_L2_CACHE_BYTE_SIZE 1048576 /* bytes */ - /* The refblock cache needs only a fourth of the L2 cache size to cover as many * clusters */ #define DEFAULT_L2_REFCOUNT_SIZE_RATIO 4
The optimal size of the qcow2 L2 table cache depends on the working set size and the cluster size of the disk image. If the cache is too small, L2 tables are be re-read from disk on every IO operation in the worst case. The host's buffer cache can paper over this inefficiency, but with cache=none or cache=directsync, L2 table cache thrashing dramatically reduces IO operations per second (e.g. from 50k/sec to 5k/sec for a qcow2 stored on a modern SSD). The L2 table cache size can be set manually via the l2-cache-size runtime option, but such a performance-critical value needs a sane default. Conservatively assuming the entire image is the working set, the existing hardcoded default value of 1 MiB is appropriate for images up to 8 GiB with the default 64 KiB cluster size. This change replaces the hardcoded default L2 table cache size with a value computed from the image size and cluster size. The resulting default cache size is just large enough to service random accesses in the entire image. Signed-off-by: Ed Swierk <eswierk@skyportsystems.com> --- block/qcow2.c | 5 ++--- block/qcow2.h | 4 ---- 2 files changed, 2 insertions(+), 7 deletions(-)