diff mbox

qcow2: do lazy allocation of the L2 cache

Message ID 1429629639-14328-1-git-send-email-berto@igalia.com
State New
Headers show

Commit Message

Alberto Garcia April 21, 2015, 3:20 p.m. UTC
Large disk images need large L2 caches in order to maximize their
I/O performance. However setting a correct size for the cache is not
necessarily easy since apart from the image size, it also depends
on other factors like its usage patterns or whether it's part of a
backing chain.

In order to be able to set a very large cache size to cover the
worst-case scenarios and yet prevent an unnecessary waste of memory,
this patch modifies the qcow2 cache algorithm so the memory for each
entry is allocated only when it's actually needed.

This also improves the scenarios with smaller images: the current
default L2 cache size can map a whole 8GB disk, so those smaller than
that are allocating cache memory that can never be used.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 block/qcow2-cache.c | 35 ++++++++++++++---------------------
 block/qcow2.c       |  4 ++--
 block/qcow2.h       |  2 +-
 3 files changed, 17 insertions(+), 24 deletions(-)

Comments

Stefan Hajnoczi April 22, 2015, 10:26 a.m. UTC | #1
On Tue, Apr 21, 2015 at 06:20:39PM +0300, Alberto Garcia wrote:
> Large disk images need large L2 caches in order to maximize their
> I/O performance. However setting a correct size for the cache is not
> necessarily easy since apart from the image size, it also depends
> on other factors like its usage patterns or whether it's part of a
> backing chain.
> 
> In order to be able to set a very large cache size to cover the
> worst-case scenarios and yet prevent an unnecessary waste of memory,
> this patch modifies the qcow2 cache algorithm so the memory for each
> entry is allocated only when it's actually needed.
> 
> This also improves the scenarios with smaller images: the current
> default L2 cache size can map a whole 8GB disk, so those smaller than
> that are allocating cache memory that can never be used.

What measurable improvement does this patch make?

Buffers allocated upfront are not touched, so they don't actually
consume physical memory.

Stefan
Alberto Garcia April 22, 2015, 2:37 p.m. UTC | #2
On Wed 22 Apr 2015 12:26:02 PM CEST, Stefan Hajnoczi wrote:

>> Large disk images need large L2 caches in order to maximize their I/O
>> performance. However setting a correct size for the cache is not
>> necessarily easy since apart from the image size, it also depends on
>> other factors like its usage patterns or whether it's part of a
>> backing chain.
>> 
>> In order to be able to set a very large cache size to cover the
>> worst-case scenarios and yet prevent an unnecessary waste of memory,
>> this patch modifies the qcow2 cache algorithm so the memory for each
>> entry is allocated only when it's actually needed.
>> 
>> This also improves the scenarios with smaller images: the current
>> default L2 cache size can map a whole 8GB disk, so those smaller than
>> that are allocating cache memory that can never be used.
>
> What measurable improvement does this patch make?
>
> Buffers allocated upfront are not touched, so they don't actually
> consume physical memory.

For a cache size of 128MB, the PSS is actually ~10MB larger without the
patch, which seems to come from posix_memalign().

But yeah, for smaller caches there's no difference. I should probably
add a timer to free unused cache entries after some expiration
time. What do you think?

Berto
Stefan Hajnoczi April 23, 2015, 10:15 a.m. UTC | #3
On Wed, Apr 22, 2015 at 04:37:15PM +0200, Alberto Garcia wrote:
> On Wed 22 Apr 2015 12:26:02 PM CEST, Stefan Hajnoczi wrote:
> 
> >> Large disk images need large L2 caches in order to maximize their I/O
> >> performance. However setting a correct size for the cache is not
> >> necessarily easy since apart from the image size, it also depends on
> >> other factors like its usage patterns or whether it's part of a
> >> backing chain.
> >> 
> >> In order to be able to set a very large cache size to cover the
> >> worst-case scenarios and yet prevent an unnecessary waste of memory,
> >> this patch modifies the qcow2 cache algorithm so the memory for each
> >> entry is allocated only when it's actually needed.
> >> 
> >> This also improves the scenarios with smaller images: the current
> >> default L2 cache size can map a whole 8GB disk, so those smaller than
> >> that are allocating cache memory that can never be used.
> >
> > What measurable improvement does this patch make?
> >
> > Buffers allocated upfront are not touched, so they don't actually
> > consume physical memory.
> 
> For a cache size of 128MB, the PSS is actually ~10MB larger without the
> patch, which seems to come from posix_memalign().

Do you mean RSS or are you using a tool that reports a "PSS" number that
I don't know about?

We should understand what is going on instead of moving the code around
to hide/delay the problem.

Stefan
Alberto Garcia April 23, 2015, 11:50 a.m. UTC | #4
On Thu 23 Apr 2015 12:15:04 PM CEST, Stefan Hajnoczi wrote:

>> For a cache size of 128MB, the PSS is actually ~10MB larger without
>> the patch, which seems to come from posix_memalign().
>
> Do you mean RSS or are you using a tool that reports a "PSS" number
> that I don't know about?
>
> We should understand what is going on instead of moving the code
> around to hide/delay the problem.

Both RSS and PSS ("proportional set size", also reported by the kernel).

I'm not an expert in memory allocators, but I measured the overhead like
this:

An L2 cache of 128MB implies a refcount cache of 32MB, in total 160MB.
With a default cluster size of 64k, that's 2560 cache entries.

So I wrote a test case that allocates 2560 blocks of 64k each using
posix_memalign and mmap, and here's how their /proc/<pid>/smaps compare:

-Size:             165184 kB
-Rss:               10244 kB
-Pss:               10244 kB
+Size:             161856 kB
+Rss:                   0 kB
+Pss:                   0 kB
 Shared_Clean:          0 kB
 Shared_Dirty:          0 kB
 Private_Clean:         0 kB
-Private_Dirty:     10244 kB
-Referenced:        10244 kB
-Anonymous:         10244 kB
+Private_Dirty:         0 kB
+Referenced:            0 kB
+Anonymous:             0 kB
 AnonHugePages:         0 kB
 Swap:                  0 kB
 KernelPageSize:        4 kB

Those are the 10MB I saw. For the record I also tried with malloc() and
the results are similar to those of posix_memalign().

Berto
Stefan Hajnoczi April 24, 2015, 9:26 a.m. UTC | #5
On Thu, Apr 23, 2015 at 01:50:28PM +0200, Alberto Garcia wrote:
> On Thu 23 Apr 2015 12:15:04 PM CEST, Stefan Hajnoczi wrote:
> 
> >> For a cache size of 128MB, the PSS is actually ~10MB larger without
> >> the patch, which seems to come from posix_memalign().
> >
> > Do you mean RSS or are you using a tool that reports a "PSS" number
> > that I don't know about?
> >
> > We should understand what is going on instead of moving the code
> > around to hide/delay the problem.
> 
> Both RSS and PSS ("proportional set size", also reported by the kernel).
> 
> I'm not an expert in memory allocators, but I measured the overhead like
> this:
> 
> An L2 cache of 128MB implies a refcount cache of 32MB, in total 160MB.
> With a default cluster size of 64k, that's 2560 cache entries.
> 
> So I wrote a test case that allocates 2560 blocks of 64k each using
> posix_memalign and mmap, and here's how their /proc/<pid>/smaps compare:
> 
> -Size:             165184 kB
> -Rss:               10244 kB
> -Pss:               10244 kB
> +Size:             161856 kB
> +Rss:                   0 kB
> +Pss:                   0 kB
>  Shared_Clean:          0 kB
>  Shared_Dirty:          0 kB
>  Private_Clean:         0 kB
> -Private_Dirty:     10244 kB
> -Referenced:        10244 kB
> -Anonymous:         10244 kB
> +Private_Dirty:         0 kB
> +Referenced:            0 kB
> +Anonymous:             0 kB
>  AnonHugePages:         0 kB
>  Swap:                  0 kB
>  KernelPageSize:        4 kB
> 
> Those are the 10MB I saw. For the record I also tried with malloc() and
> the results are similar to those of posix_memalign().

The posix_memalign() call wastes memory.  I compared:

  posix_memalign(&memptr, 65536, 2560 * 65536);
  memset(memptr, 0, 2560 * 65536);

with:

  for (i = 0; i < 2560; i++) {
      posix_memalign(&memptr, 65536, 65536);
      memset(memptr, 0, 65536);
  }

Here are the results:

-Size:             163920 kB
-Rss:              163860 kB
-Pss:              163860 kB
+Size:             337800 kB
+Rss:              183620 kB
+Pss:              183620 kB

Note the memset simulates a fully occupied cache.

The 19 MB RSS difference between the two seems wasteful.  The large
"Size" difference hints that the mmap pattern is very different when
posix_memalign() is called multiple times.

We could avoid the 19 MB overhead by switching to a single allocation.

What's more is that dropping the memset() to simulate no cache entry
usage (like your example) gives us a grand total of 20 kB RSS.  There is
no point in delaying allocations if we do a single big upfront
allocation.

I'd prefer a patch that replaces the small allocations with a single big
one.  That's a win in both empty and full cache cases.

Stefan
Kevin Wolf April 24, 2015, 9:45 a.m. UTC | #6
Am 24.04.2015 um 11:26 hat Stefan Hajnoczi geschrieben:
> On Thu, Apr 23, 2015 at 01:50:28PM +0200, Alberto Garcia wrote:
> > On Thu 23 Apr 2015 12:15:04 PM CEST, Stefan Hajnoczi wrote:
> > 
> > >> For a cache size of 128MB, the PSS is actually ~10MB larger without
> > >> the patch, which seems to come from posix_memalign().
> > >
> > > Do you mean RSS or are you using a tool that reports a "PSS" number
> > > that I don't know about?
> > >
> > > We should understand what is going on instead of moving the code
> > > around to hide/delay the problem.
> > 
> > Both RSS and PSS ("proportional set size", also reported by the kernel).
> > 
> > I'm not an expert in memory allocators, but I measured the overhead like
> > this:
> > 
> > An L2 cache of 128MB implies a refcount cache of 32MB, in total 160MB.
> > With a default cluster size of 64k, that's 2560 cache entries.
> > 
> > So I wrote a test case that allocates 2560 blocks of 64k each using
> > posix_memalign and mmap, and here's how their /proc/<pid>/smaps compare:
> > 
> > -Size:             165184 kB
> > -Rss:               10244 kB
> > -Pss:               10244 kB
> > +Size:             161856 kB
> > +Rss:                   0 kB
> > +Pss:                   0 kB
> >  Shared_Clean:          0 kB
> >  Shared_Dirty:          0 kB
> >  Private_Clean:         0 kB
> > -Private_Dirty:     10244 kB
> > -Referenced:        10244 kB
> > -Anonymous:         10244 kB
> > +Private_Dirty:         0 kB
> > +Referenced:            0 kB
> > +Anonymous:             0 kB
> >  AnonHugePages:         0 kB
> >  Swap:                  0 kB
> >  KernelPageSize:        4 kB
> > 
> > Those are the 10MB I saw. For the record I also tried with malloc() and
> > the results are similar to those of posix_memalign().
> 
> The posix_memalign() call wastes memory.  I compared:
> 
>   posix_memalign(&memptr, 65536, 2560 * 65536);
>   memset(memptr, 0, 2560 * 65536);
> 
> with:
> 
>   for (i = 0; i < 2560; i++) {
>       posix_memalign(&memptr, 65536, 65536);
>       memset(memptr, 0, 65536);
>   }
> 
> Here are the results:
> 
> -Size:             163920 kB
> -Rss:              163860 kB
> -Pss:              163860 kB
> +Size:             337800 kB
> +Rss:              183620 kB
> +Pss:              183620 kB
> 
> Note the memset simulates a fully occupied cache.
> 
> The 19 MB RSS difference between the two seems wasteful.  The large
> "Size" difference hints that the mmap pattern is very different when
> posix_memalign() is called multiple times.
> 
> We could avoid the 19 MB overhead by switching to a single allocation.
> 
> What's more is that dropping the memset() to simulate no cache entry
> usage (like your example) gives us a grand total of 20 kB RSS.  There is
> no point in delaying allocations if we do a single big upfront
> allocation.

Report a bug against glibc? 19 MB is certainly much more than is
required for metadata managing 2560 memory blocks. That's something like
8k per allocation.

> I'd prefer a patch that replaces the small allocations with a single big
> one.  That's a win in both empty and full cache cases.

Or in cases where the object size is stored anyway (like in the qcow2
cache), we could just directly use mmap() and avoid any memory
management overhead in glibc.

Kevin
Kevin Wolf April 24, 2015, 9:52 a.m. UTC | #7
Am 24.04.2015 um 11:26 hat Stefan Hajnoczi geschrieben:
> The posix_memalign() call wastes memory.  I compared:
> 
>   posix_memalign(&memptr, 65536, 2560 * 65536);
>   memset(memptr, 0, 2560 * 65536);
> 
> with:
> 
>   for (i = 0; i < 2560; i++) {
>       posix_memalign(&memptr, 65536, 65536);
>       memset(memptr, 0, 65536);
>   }

64k alignment is too much, in practice you need 512b or 4k, which
probably wastes a lot less memory.

But I just looked at the qcow2 cache code and you're right anyway.
Allocating one big block instead of many small allocations in a loop
looks like a good idea either way.

Kevin
Alberto Garcia April 24, 2015, 11:10 a.m. UTC | #8
On Fri 24 Apr 2015 11:52:14 AM CEST, Kevin Wolf <kwolf@redhat.com> wrote:

>> The posix_memalign() call wastes memory.  I compared:
>> 
>>   posix_memalign(&memptr, 65536, 2560 * 65536);
>>   memset(memptr, 0, 2560 * 65536);
>> 
>> with:
>> 
>>   for (i = 0; i < 2560; i++) {
>>       posix_memalign(&memptr, 65536, 65536);
>>       memset(memptr, 0, 65536);
>>   }
>
> 64k alignment is too much, in practice you need 512b or 4k, which
> probably wastes a lot less memory.

My tests were with 512b and 4k and the overhead was around 4k per page
(hence 2560 * 4 = 10240, the 10MB I was talking about).

> But I just looked at the qcow2 cache code and you're right anyway.
> Allocating one big block instead of many small allocations in a loop
> looks like a good idea either way.

The problem is that I had the idea to make the cache dynamic.

Consider the scenario [A] <- [B], with a virtual size of 1TB and [B] a
newly created snapshot. The L2 cache size is 128MB for each image, you
read a lot of data from the disk and the cache from [A] starts to fill
up (at this point [B] is mostly empty so you get all the data from
[A]). Then you start to write data into [B], and now its L2 cache starts
to fill up as well.

After a while you're going to have lots of cache entries in [A] that are
not needed anymore because now the data for those clusters is in [B].

I think it would be nice to have a way to free unused cache entries
after a while.

Berto
Stefan Hajnoczi April 24, 2015, 12:37 p.m. UTC | #9
On Fri, Apr 24, 2015 at 12:10 PM, Alberto Garcia <berto@igalia.com> wrote:
> On Fri 24 Apr 2015 11:52:14 AM CEST, Kevin Wolf <kwolf@redhat.com> wrote:
>
>>> The posix_memalign() call wastes memory.  I compared:
>>>
>>>   posix_memalign(&memptr, 65536, 2560 * 65536);
>>>   memset(memptr, 0, 2560 * 65536);
>>>
>>> with:
>>>
>>>   for (i = 0; i < 2560; i++) {
>>>       posix_memalign(&memptr, 65536, 65536);
>>>       memset(memptr, 0, 65536);
>>>   }
>>
>> 64k alignment is too much, in practice you need 512b or 4k, which
>> probably wastes a lot less memory.
>
> My tests were with 512b and 4k and the overhead was around 4k per page
> (hence 2560 * 4 = 10240, the 10MB I was talking about).
>
>> But I just looked at the qcow2 cache code and you're right anyway.
>> Allocating one big block instead of many small allocations in a loop
>> looks like a good idea either way.
>
> The problem is that I had the idea to make the cache dynamic.
>
> Consider the scenario [A] <- [B], with a virtual size of 1TB and [B] a
> newly created snapshot. The L2 cache size is 128MB for each image, you
> read a lot of data from the disk and the cache from [A] starts to fill
> up (at this point [B] is mostly empty so you get all the data from
> [A]). Then you start to write data into [B], and now its L2 cache starts
> to fill up as well.
>
> After a while you're going to have lots of cache entries in [A] that are
> not needed anymore because now the data for those clusters is in [B].
>
> I think it would be nice to have a way to free unused cache entries
> after a while.

Do you think mmap plus a periodic timer would work?

I'm hesitant about changes like this because they make QEMU more
complex, slow down the guest, and make the memory footprint volatile.
But if a simple solution addresses the problem, then I'd be happy.

Stefan
Alberto Garcia April 24, 2015, 12:50 p.m. UTC | #10
On Fri 24 Apr 2015 02:37:21 PM CEST, Stefan Hajnoczi wrote:

>> I think it would be nice to have a way to free unused cache entries
>> after a while.
>
> Do you think mmap plus a periodic timer would work?
>
> I'm hesitant about changes like this because they make QEMU more
> complex, slow down the guest, and make the memory footprint volatile.
> But if a simple solution addresses the problem, then I'd be happy.

I understand. One possible approach would be to use the timer only if
the cache size requested by the user is large, so

 1) we only try to save memory when the amount of memory to save is
    significant.
 2) we don't make the default scenario slower.

I'll try to see if I can come up with a good solution (and with more
numbers).

Berto
Kevin Wolf April 24, 2015, 1:04 p.m. UTC | #11
Am 24.04.2015 um 14:50 hat Alberto Garcia geschrieben:
> On Fri 24 Apr 2015 02:37:21 PM CEST, Stefan Hajnoczi wrote:
> 
> >> I think it would be nice to have a way to free unused cache entries
> >> after a while.
> >
> > Do you think mmap plus a periodic timer would work?
> >
> > I'm hesitant about changes like this because they make QEMU more
> > complex, slow down the guest, and make the memory footprint volatile.
> > But if a simple solution addresses the problem, then I'd be happy.
> 
> I understand. One possible approach would be to use the timer only if
> the cache size requested by the user is large, so
> 
>  1) we only try to save memory when the amount of memory to save is
>     significant.
>  2) we don't make the default scenario slower.
> 
> I'll try to see if I can come up with a good solution (and with more
> numbers).

I suspect that at this point, Anthony would have reminded us about
separation of policy and mechanism. The very least we need is some
option to control the policy (probably defaulting to no automatic cache
size changes).

The other option would be to let the management tool change the cache
size options at runtime (and as it happens, my current bdrv_reopen()
overhaul allows just that - except that a QMP interface isn't planned
yet), but I'm not sure if it has enough information to make good
decisions. If we know what the criteria are, they could possibly be
exposed, though.

Kevin
Alberto Garcia May 8, 2015, 9 a.m. UTC | #12
On Fri 24 Apr 2015 03:04:06 PM CEST, Kevin Wolf <kwolf@redhat.com> wrote:

>> >> I think it would be nice to have a way to free unused cache
>> >> entries after a while.
>> >
>> > Do you think mmap plus a periodic timer would work?
>> >
>> > I'm hesitant about changes like this because they make QEMU more
>> > complex, slow down the guest, and make the memory footprint
>> > volatile.  But if a simple solution addresses the problem, then I'd
>> > be happy.
>> 
>> I understand. One possible approach would be to use the timer only if
>> the cache size requested by the user is large, so
>> 
>>  1) we only try to save memory when the amount of memory to save is
>>     significant.
>>  2) we don't make the default scenario slower.
>> 
>> I'll try to see if I can come up with a good solution (and with more
>> numbers).
>
> I suspect that at this point, Anthony would have reminded us about
> separation of policy and mechanism. The very least we need is some
> option to control the policy (probably defaulting to no automatic
> cache size changes).

There's also the problem that with a single chunk of memory for all
cache tables it's not so easy to free individual entries.

One possible solution would be madvise() with MADV_DONTNEED, and that's
rather simple to use, but I'm unsure about its portability.

> The other option would be to let the management tool change the cache
> size options at runtime (and as it happens, my current bdrv_reopen()
> overhaul allows just that - except that a QMP interface isn't planned
> yet), but I'm not sure if it has enough information to make good
> decisions. If we know what the criteria are, they could possibly be
> exposed, though.

But what would you do there? Recreate the whole cache?

Berto
Kevin Wolf May 8, 2015, 9:47 a.m. UTC | #13
Am 08.05.2015 um 11:00 hat Alberto Garcia geschrieben:
> On Fri 24 Apr 2015 03:04:06 PM CEST, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> >> >> I think it would be nice to have a way to free unused cache
> >> >> entries after a while.
> >> >
> >> > Do you think mmap plus a periodic timer would work?
> >> >
> >> > I'm hesitant about changes like this because they make QEMU more
> >> > complex, slow down the guest, and make the memory footprint
> >> > volatile.  But if a simple solution addresses the problem, then I'd
> >> > be happy.
> >> 
> >> I understand. One possible approach would be to use the timer only if
> >> the cache size requested by the user is large, so
> >> 
> >>  1) we only try to save memory when the amount of memory to save is
> >>     significant.
> >>  2) we don't make the default scenario slower.
> >> 
> >> I'll try to see if I can come up with a good solution (and with more
> >> numbers).
> >
> > I suspect that at this point, Anthony would have reminded us about
> > separation of policy and mechanism. The very least we need is some
> > option to control the policy (probably defaulting to no automatic
> > cache size changes).
> 
> There's also the problem that with a single chunk of memory for all
> cache tables it's not so easy to free individual entries.

That's one of the reasons why I suggested to fix the problem by using
mmap() for the individual entries rather than allocating one big chunk.

> One possible solution would be madvise() with MADV_DONTNEED, and that's
> rather simple to use, but I'm unsure about its portability.

Probably not portable at all, but the worst that can happen is that a
cache doesn't shrink, i.e. behaves like today. Seems good enough.

> > The other option would be to let the management tool change the cache
> > size options at runtime (and as it happens, my current bdrv_reopen()
> > overhaul allows just that - except that a QMP interface isn't planned
> > yet), but I'm not sure if it has enough information to make good
> > decisions. If we know what the criteria are, they could possibly be
> > exposed, though.
> 
> But what would you do there? Recreate the whole cache?

Recreating the cache is the easiest solution (and it is what my reopen
patches do) and the overhead is probably neglegible as long as you don't
resize the cache all the time.

If we really need it, resizing the cache while maintaining its content
would be possible, too. Extending shouldn't be a problem at all (with
your patches you'd have to rehash the entries, but that's it), and for
shrinking you would have to evict some entries (based on the algorithm
that is used in other cases as well). Neither sounds particularly
complicated, it's just some work that needs to be done if we ever get a
requirement like this.

Kevin
Alberto Garcia May 8, 2015, 11:47 a.m. UTC | #14
On Fri 08 May 2015 11:47:00 AM CEST, Kevin Wolf wrote:

>> There's also the problem that with a single chunk of memory for all
>> cache tables it's not so easy to free individual entries.
>
> That's one of the reasons why I suggested to fix the problem by using
> mmap() for the individual entries rather than allocating one big
> chunk.

Yeah but how portable mmap() is? Does it work in Windows?

>> One possible solution would be madvise() with MADV_DONTNEED, and
>> that's rather simple to use, but I'm unsure about its portability.
>
> Probably not portable at all, but the worst that can happen is that a
> cache doesn't shrink, i.e. behaves like today. Seems good enough.

We actually have qemu_madvise() and QEMU_MADV_DONTNEED, so if the system
does not support the operation then it's just a no-op. However that
means that if we expose some sort of API to configure the cache
expiration time, it would only work on some platforms.

>> > The other option would be to let the management tool change the cache
>> > size options at runtime (and as it happens, my current bdrv_reopen()
>> > overhaul allows just that - except that a QMP interface isn't planned
>> > yet), but I'm not sure if it has enough information to make good
>> > decisions. If we know what the criteria are, they could possibly be
>> > exposed, though.
>> 
>> But what would you do there? Recreate the whole cache?
>
> Recreating the cache is the easiest solution (and it is what my reopen
> patches do) and the overhead is probably neglegible as long as you
> don't resize the cache all the time.
>
> If we really need it, resizing the cache while maintaining its content
> would be possible, too.

I agree that the overhead of recreating the cache is probably negligible
so there's no need to make it more complex.

But as you say I don't think it's easy for the management tool to make
good decisions. If we have to expose the relevant information so it can
be done I think we would end up with an unnecessarily complex solution.

A simple way to set the expiration time of cache entries seems like a
much easier solution, both from the implementation and the user point of
views.

Berto
diff mbox

Patch

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index b115549..33f2561 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -42,37 +42,19 @@  struct Qcow2Cache {
     bool                    depends_on_flush;
 };
 
-Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables)
+Qcow2Cache *qcow2_cache_create(int num_tables)
 {
-    BDRVQcowState *s = bs->opaque;
     Qcow2Cache *c;
-    int i;
 
     c = g_new0(Qcow2Cache, 1);
     c->size = num_tables;
     c->entries = g_try_new0(Qcow2CachedTable, num_tables);
     if (!c->entries) {
-        goto fail;
-    }
-
-    for (i = 0; i < c->size; i++) {
-        c->entries[i].table = qemu_try_blockalign(bs->file, s->cluster_size);
-        if (c->entries[i].table == NULL) {
-            goto fail;
-        }
+        g_free(c);
+        c = NULL;
     }
 
     return c;
-
-fail:
-    if (c->entries) {
-        for (i = 0; i < c->size; i++) {
-            qemu_vfree(c->entries[i].table);
-        }
-    }
-    g_free(c->entries);
-    g_free(c);
-    return NULL;
 }
 
 int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c)
@@ -151,6 +133,7 @@  static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i)
         BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE);
     }
 
+    assert(c->entries[i].table != NULL);
     ret = bdrv_pwrite(bs->file, c->entries[i].offset, c->entries[i].table,
         s->cluster_size);
     if (ret < 0) {
@@ -227,6 +210,8 @@  int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c)
 
     for (i = 0; i < c->size; i++) {
         assert(c->entries[i].ref == 0);
+        qemu_vfree(c->entries[i].table);
+        c->entries[i].table = NULL;
         c->entries[i].offset = 0;
         c->entries[i].cache_hits = 0;
     }
@@ -296,6 +281,13 @@  static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c,
         return ret;
     }
 
+    if (!c->entries[i].table) {
+        c->entries[i].table = qemu_try_blockalign(bs->file, s->cluster_size);
+        if (!c->entries[i].table) {
+            return -ENOMEM;
+        }
+    }
+
     trace_qcow2_cache_get_read(qemu_coroutine_self(),
                                c == s->l2_table_cache, i);
     c->entries[i].offset = 0;
@@ -343,6 +335,7 @@  int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table)
 {
     int i;
 
+    assert(*table != NULL);
     for (i = 0; i < c->size; i++) {
         if (c->entries[i].table == *table) {
             goto found;
diff --git a/block/qcow2.c b/block/qcow2.c
index 316a8db..c3fd072 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -831,8 +831,8 @@  static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     /* alloc L2 table/refcount block cache */
-    s->l2_table_cache = qcow2_cache_create(bs, l2_cache_size);
-    s->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_size);
+    s->l2_table_cache = qcow2_cache_create(l2_cache_size);
+    s->refcount_block_cache = qcow2_cache_create(refcount_cache_size);
     if (s->l2_table_cache == NULL || s->refcount_block_cache == NULL) {
         error_setg(errp, "Could not allocate metadata caches");
         ret = -ENOMEM;
diff --git a/block/qcow2.h b/block/qcow2.h
index 422b825..62e38f7 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -571,7 +571,7 @@  void qcow2_free_snapshots(BlockDriverState *bs);
 int qcow2_read_snapshots(BlockDriverState *bs);
 
 /* qcow2-cache.c functions */
-Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables);
+Qcow2Cache *qcow2_cache_create(int num_tables);
 int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c);
 
 void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table);