diff mbox

[v5,07/12] qemu-img: Empty images after commit

Message ID 1397771992-31126-8-git-send-email-mreitz@redhat.com
State New
Headers show

Commit Message

Max Reitz April 17, 2014, 9:59 p.m. UTC
After the top image has been committed into an image in its backing
chain, all images above that base image should be emptied to restore the
old qemu-img commit behavior.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qemu-img.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 84 insertions(+), 3 deletions(-)

Comments

Eric Blake April 22, 2014, 3:19 p.m. UTC | #1
On 04/17/2014 03:59 PM, Max Reitz wrote:
> After the top image has been committed into an image in its backing
> chain, all images above that base image should be emptied to restore the
> old qemu-img commit behavior.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  qemu-img.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 84 insertions(+), 3 deletions(-)

Does emptying an image take significant time?  If so, does that need to
be reflected in the progress meter?
Max Reitz April 22, 2014, 4:22 p.m. UTC | #2
On 22.04.2014 17:19, Eric Blake wrote:
> On 04/17/2014 03:59 PM, Max Reitz wrote:
>> After the top image has been committed into an image in its backing
>> chain, all images above that base image should be emptied to restore the
>> old qemu-img commit behavior.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   qemu-img.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 84 insertions(+), 3 deletions(-)
> Does emptying an image take significant time?  If so, does that need to
> be reflected in the progress meter?

For a 16 GB image I have here (should be nearly full) it took 1:22 min. 
Copying it took six minutes, so I guess committing it would take even 
more. I think the ratio is small enough not to include it in the 
progress meter.

Furthermore, I don't see a reasonable implementation for a make_empty 
progress output: As a general implementation for all image formats 
implementing discard will not work (the qcow2 implementation clearly 
states that discarded sectors should read back as zero), we would need 
some way of returning the progress in every format-specific make_empty 
implementation. This would probably require callbacks which I don't see 
feasible.

An in my opinion more reasonable approach would be to speed up the 
implementation of make_empty for qcow2 - as qcow does it, we could just 
create a new empty image (that is, rewrite the L1 and refcount table and 
truncate the image appropriately) instead of discarding every cluster in 
the existing file. This would be much faster, but the implementation 
would not be as nice and easy, so I chose not to implement it. I guess 
I'll have a second look, then. ;-)

Max
Max Reitz April 22, 2014, 5:18 p.m. UTC | #3
On 22.04.2014 18:22, Max Reitz wrote:
> On 22.04.2014 17:19, Eric Blake wrote:
>> On 04/17/2014 03:59 PM, Max Reitz wrote:
>>> After the top image has been committed into an image in its backing
>>> chain, all images above that base image should be emptied to restore 
>>> the
>>> old qemu-img commit behavior.
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   qemu-img.c | 87 
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>   1 file changed, 84 insertions(+), 3 deletions(-)
>> Does emptying an image take significant time?  If so, does that need to
>> be reflected in the progress meter?
>
> For a 16 GB image I have here (should be nearly full) it took 1:22 
> min. Copying it took six minutes, so I guess committing it would take 
> even more. I think the ratio is small enough not to include it in the 
> progress meter.
>
> Furthermore, I don't see a reasonable implementation for a make_empty 
> progress output: As a general implementation for all image formats 
> implementing discard will not work (the qcow2 implementation clearly 
> states that discarded sectors should read back as zero), we would need 
> some way of returning the progress in every format-specific make_empty 
> implementation. This would probably require callbacks which I don't 
> see feasible.
>
> An in my opinion more reasonable approach would be to speed up the 
> implementation of make_empty for qcow2 - as qcow does it, we could 
> just create a new empty image (that is, rewrite the L1 and refcount 
> table and truncate the image appropriately) instead of discarding 
> every cluster in the existing file. This would be much faster, but the 
> implementation would not be as nice and easy, so I chose not to 
> implement it. I guess I'll have a second look, then. ;-)

Okay, after having begun to implement this, I noticed that this approach 
will not work for images with internal snapshots as we probably want to 
keep those during make_empty. On the other hand, I guess few people will 
use internal and external snapshots at the same time; I guess I'll 
implement both paths then. If there are internal snapshots, fall back to 
the implementation discarding every clusters, if there aren't (which is 
more likely), recreate the image.

Max
Eric Blake April 22, 2014, 5:48 p.m. UTC | #4
On 04/22/2014 10:22 AM, Max Reitz wrote:
> On 22.04.2014 17:19, Eric Blake wrote:
>> On 04/17/2014 03:59 PM, Max Reitz wrote:
>>> After the top image has been committed into an image in its backing
>>> chain, all images above that base image should be emptied to restore the
>>> old qemu-img commit behavior.
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   qemu-img.c | 87
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>   1 file changed, 84 insertions(+), 3 deletions(-)
>> Does emptying an image take significant time?  If so, does that need to
>> be reflected in the progress meter?
> 
> For a 16 GB image I have here (should be nearly full) it took 1:22 min.
> Copying it took six minutes, so I guess committing it would take even
> more. I think the ratio is small enough not to include it in the
> progress meter.
> 
> Furthermore, I don't see a reasonable implementation for a make_empty
> progress output: As a general implementation for all image formats
> implementing discard will not work (the qcow2 implementation clearly
> states that discarded sectors should read back as zero), we would need
> some way of returning the progress in every format-specific make_empty
> implementation. This would probably require callbacks which I don't see
> feasible.

Thinking about it more, the ability to empty an image after commit
should be optional, and exposed as an option for the user to control.

I can see at least reasons to use a commit action:

1. I have 'base <- active', made changes in active, and want the current
state of active to become my new base, but still want to preserve active
for making further changes.  In this situation, I _want_ the commit
action to empty active (so that it is once again the minimal size).

2. I have 'base <- snap' long enough for me to copy base somewhere else
as a form of snapshot management, but now want to get rid of the
temporary 'snap' file by committing the chain back down to just 'base',
making 'base' once again the active file.  In this situation, I don't
want to waste ANY time modifying 'snap', so the operation should end as
soon as base is finished.

Furthermore, consider what happens when committing across more than one
image, as in 'base <- snap1 <- snap2' all the way down to 'base'.  The
moment that I copy a cluster from 'snap2' into 'base' that has not also
been changed in 'snap1', I have already invalidated the 'snap1' image.
Or more concretely, using XX for any cluster that refers to the parent:

base:  AA BB CC DD
snap1: EE XX FF XX
snap2: GG HH XX XX

The guest sees: GG HH FF DD.  Someone looking at just snap1 sees: EE BB
FF DD, which matches data that the guest saw at some point in the past.

The commit operation, prior to the empty step, changes things to:

base:  GG HH FF DD
snap1: EE XX FF XX
snap2: GG HH XX XX

Someone looking at just snap1 NOW sees: EE HH FF DD, which is not
something that the guest ever saw.  We have made an inconsistent image.

If you then empty the entire chain, we have:

base:  GG HH FF DD
snap1: XX XX XX XX
snap2: XX XX XX XX

Someone looking at snap2 still sees GG HH FF DD, which means the guest
does not notice any difference on whether we stuck with snap2 as the
active, or whether we changed over to base as the new active.  Someone
looking at snap1 NOW sees GG HH FF DD, which is a guest-consistent view,
but is NOT what the guest saw at the point snap1 was created.

In other words, I'm arguing that the ONLY time that leaving the active
image empty after completing a commit is when we are committing ONLY one
file deep - the moment multiple files are involved, we have ALREADY
invalidated the intermediate files, so wasting time on them to make them
empty is pointless.  If I wanted my entire chain to be consistent, I
would have already done it in stages (commit snap2 into snap1, then
commit that into base).

Furthermore, isn't the ability to empty an image precisely what
'qemu-img rebase' already gives us?  That is, for my two use cases
above, use case 2 never wants to empty images, and use case 1 has only a
single image to empty.  But observe what happens if we do the commit of
'base <- active' into just 'base' while NOT emptying 'active', and then
follow it up with a command to do 'qemu-img rebase -b base active'.  The
point of the rebase operation is that any cluster in 'active' that
duplicates a cluster in 'base' is discarded from 'active' - which has
the net result of making 'active' empty.  Thus, for my use case 1 above,
the option to empty an image is merely syntactic sugar for doing a
commit without empty followed by a rebase onto the same base that we
just committed into.

> 
> An in my opinion more reasonable approach would be to speed up the
> implementation of make_empty for qcow2 - as qcow does it, we could just
> create a new empty image (that is, rewrite the L1 and refcount table and
> truncate the image appropriately) instead of discarding every cluster in
> the existing file. This would be much faster, but the implementation
> would not be as nice and easy, so I chose not to implement it. I guess
> I'll have a second look, then. ;-)

We can't throw away internal snapshots (I see you already figured that
out), but indeed, recreating an empty image is faster than explicitly
emptying cluster-by-cluster if we know there is nothing else to be lost.
Max Reitz April 22, 2014, 6:01 p.m. UTC | #5
On 22.04.2014 19:48, Eric Blake wrote:
> On 04/22/2014 10:22 AM, Max Reitz wrote:
>> On 22.04.2014 17:19, Eric Blake wrote:
>>> On 04/17/2014 03:59 PM, Max Reitz wrote:
>>>> After the top image has been committed into an image in its backing
>>>> chain, all images above that base image should be emptied to restore the
>>>> old qemu-img commit behavior.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>    qemu-img.c | 87
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>    1 file changed, 84 insertions(+), 3 deletions(-)
>>> Does emptying an image take significant time?  If so, does that need to
>>> be reflected in the progress meter?
>> For a 16 GB image I have here (should be nearly full) it took 1:22 min.
>> Copying it took six minutes, so I guess committing it would take even
>> more. I think the ratio is small enough not to include it in the
>> progress meter.
>>
>> Furthermore, I don't see a reasonable implementation for a make_empty
>> progress output: As a general implementation for all image formats
>> implementing discard will not work (the qcow2 implementation clearly
>> states that discarded sectors should read back as zero), we would need
>> some way of returning the progress in every format-specific make_empty
>> implementation. This would probably require callbacks which I don't see
>> feasible.
> Thinking about it more, the ability to empty an image after commit
> should be optional, and exposed as an option for the user to control.
>
> I can see at least reasons to use a commit action:
>
> 1. I have 'base <- active', made changes in active, and want the current
> state of active to become my new base, but still want to preserve active
> for making further changes.  In this situation, I _want_ the commit
> action to empty active (so that it is once again the minimal size).
>
> 2. I have 'base <- snap' long enough for me to copy base somewhere else
> as a form of snapshot management, but now want to get rid of the
> temporary 'snap' file by committing the chain back down to just 'base',
> making 'base' once again the active file.  In this situation, I don't
> want to waste ANY time modifying 'snap', so the operation should end as
> soon as base is finished.
>
> Furthermore, consider what happens when committing across more than one
> image, as in 'base <- snap1 <- snap2' all the way down to 'base'.  The
> moment that I copy a cluster from 'snap2' into 'base' that has not also
> been changed in 'snap1', I have already invalidated the 'snap1' image.
> Or more concretely, using XX for any cluster that refers to the parent:
>
> base:  AA BB CC DD
> snap1: EE XX FF XX
> snap2: GG HH XX XX
>
> The guest sees: GG HH FF DD.  Someone looking at just snap1 sees: EE BB
> FF DD, which matches data that the guest saw at some point in the past.
>
> The commit operation, prior to the empty step, changes things to:
>
> base:  GG HH FF DD
> snap1: EE XX FF XX
> snap2: GG HH XX XX
>
> Someone looking at just snap1 NOW sees: EE HH FF DD, which is not
> something that the guest ever saw.  We have made an inconsistent image.
>
> If you then empty the entire chain, we have:
>
> base:  GG HH FF DD
> snap1: XX XX XX XX
> snap2: XX XX XX XX
>
> Someone looking at snap2 still sees GG HH FF DD, which means the guest
> does not notice any difference on whether we stuck with snap2 as the
> active, or whether we changed over to base as the new active.  Someone
> looking at snap1 NOW sees GG HH FF DD, which is a guest-consistent view,
> but is NOT what the guest saw at the point snap1 was created.
>
> In other words, I'm arguing that the ONLY time that leaving the active
> image empty after completing a commit is when we are committing ONLY one
> file deep - the moment multiple files are involved, we have ALREADY
> invalidated the intermediate files, so wasting time on them to make them
> empty is pointless.  If I wanted my entire chain to be consistent, I
> would have already done it in stages (commit snap2 into snap1, then
> commit that into base).

You're right. Great, then I can drop this ugly list for keeping the 
backing chain.

> Furthermore, isn't the ability to empty an image precisely what
> 'qemu-img rebase' already gives us?  That is, for my two use cases
> above, use case 2 never wants to empty images, and use case 1 has only a
> single image to empty.  But observe what happens if we do the commit of
> 'base <- active' into just 'base' while NOT emptying 'active', and then
> follow it up with a command to do 'qemu-img rebase -b base active'.  The
> point of the rebase operation is that any cluster in 'active' that
> duplicates a cluster in 'base' is discarded from 'active' - which has
> the net result of making 'active' empty.  Thus, for my use case 1 above,
> the option to empty an image is merely syntactic sugar for doing a
> commit without empty followed by a rebase onto the same base that we
> just committed into.

I took a look into rebase because I wondered how it could discard 
sectors in a way they don't read back as zero but fall through to the 
backing file (which is what a generic implementation of 
bdrv_make_empty() would require and which I would have missed so far, if 
it existed). As far as I see it, rebase does not discard sectors, it 
only writes differing currently unallocated sectors.

And if I have missed something and rebase actually does discard sectors, 
it is still a rather expensive operation, as it has to compare the whole 
file basically to itself, in this case. commit leaving the image empty 
(after specifying the appropriate option, which should in my opinion 
default to true, as this is the current mode of operation) is not only 
syntactic sugar, but faster as well.

Max

>> An in my opinion more reasonable approach would be to speed up the
>> implementation of make_empty for qcow2 - as qcow does it, we could just
>> create a new empty image (that is, rewrite the L1 and refcount table and
>> truncate the image appropriately) instead of discarding every cluster in
>> the existing file. This would be much faster, but the implementation
>> would not be as nice and easy, so I chose not to implement it. I guess
>> I'll have a second look, then. ;-)
> We can't throw away internal snapshots (I see you already figured that
> out), but indeed, recreating an empty image is faster than explicitly
> emptying cluster-by-cluster if we know there is nothing else to be lost.
>
Eric Blake April 22, 2014, 7:28 p.m. UTC | #6
On 04/22/2014 12:01 PM, Max Reitz wrote:

>> Thinking about it more, the ability to empty an image after commit
>> should be optional, and exposed as an option for the user to control.
>>

>> In other words, I'm arguing that the ONLY time that leaving the active
>> image empty after completing a commit is when we are committing ONLY one
>> file deep - the moment multiple files are involved, we have ALREADY
>> invalidated the intermediate files, so wasting time on them to make them
>> empty is pointless.  If I wanted my entire chain to be consistent, I
>> would have already done it in stages (commit snap2 into snap1, then
>> commit that into base).
> 
> You're right. Great, then I can drop this ugly list for keeping the
> backing chain.

So sounds like we want something like the following modes, when starting
with a chain 'base <- mid <- top':

qemu-img commit top
  - existing behavior: top committed to mid, then top is emptied (on
assumption that client will still use top)

qemu-img commit -d top
  - new boolean flag -d (mnemonic for drop) which states that the user
intends to drop 'top' after the operation, and instead use 'mid' as the
new active image.  Bypasses the attempt to empty top.

qemu-img commit -b mid top
qemu-img commit -d -b mid top
  - use of -b in isolation always implies -d; because the user
explicitly specified the base, even if the base is only one deep, we
don't bother emptying top

qemu-img commit -b base top
qemu-img commit -d -b base top
  - again, use of -b implies -d; here, since the commit crosses
boundaries, mid will be corrupted, so there is no point in emptying top
or mid

> I took a look into rebase because I wondered how it could discard
> sectors in a way they don't read back as zero but fall through to the
> backing file (which is what a generic implementation of
> bdrv_make_empty() would require and which I would have missed so far, if
> it existed). As far as I see it, rebase does not discard sectors, it
> only writes differing currently unallocated sectors.

I haven't looked closely at what rebase is actually doing, to know which
of the two scenarios you described actually happens.  And given this
thread, maybe it's worth a future patch to give the user the option for
both behaviors, as it sounds like a use case could be made for each.
But obviously not for this patch series.

> 
> And if I have missed something and rebase actually does discard sectors,
> it is still a rather expensive operation, as it has to compare the whole
> file basically to itself, in this case. commit leaving the image empty
> (after specifying the appropriate option, which should in my opinion
> default to true, as this is the current mode of operation) is not only
> syntactic sugar, but faster as well.

Yes, I agree that doing the discard as part of the commit operation can
indeed be faster than doing it in two separate steps, even if the end
result is the same.  And I also agree that existing behavior of emptying
the image as the default when no other option is specified is the best
way to preserve command-line compatibility; so my proposal is merely
limited to adding a new boolean flag (I chose -d) to explicitly request
no emptying, as well as make the new -b option imply the -d flag.
Kevin Wolf April 23, 2014, 9:32 a.m. UTC | #7
Am 22.04.2014 um 18:22 hat Max Reitz geschrieben:
> On 22.04.2014 17:19, Eric Blake wrote:
> >On 04/17/2014 03:59 PM, Max Reitz wrote:
> >>After the top image has been committed into an image in its backing
> >>chain, all images above that base image should be emptied to restore the
> >>old qemu-img commit behavior.
> >>
> >>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>---
> >>  qemu-img.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>  1 file changed, 84 insertions(+), 3 deletions(-)
> >Does emptying an image take significant time?  If so, does that need to
> >be reflected in the progress meter?
> 
> For a 16 GB image I have here (should be nearly full) it took 1:22
> min. Copying it took six minutes, so I guess committing it would
> take even more. I think the ratio is small enough not to include it
> in the progress meter.

Did you check why it took that long? Sounds like we're issuing a lot of
independent discard requests instead of few big ones. Is the image
heavily fragmented?

> Furthermore, I don't see a reasonable implementation for a
> make_empty progress output: As a general implementation for all
> image formats implementing discard will not work (the qcow2
> implementation clearly states that discarded sectors should read
> back as zero)

We can always add new flags or a separate new callback if it improves
things.

Kevin
Max Reitz April 24, 2014, 2:54 p.m. UTC | #8
On 23.04.2014 11:32, Kevin Wolf wrote:
> Am 22.04.2014 um 18:22 hat Max Reitz geschrieben:
>> On 22.04.2014 17:19, Eric Blake wrote:
>>> On 04/17/2014 03:59 PM, Max Reitz wrote:
>>>> After the top image has been committed into an image in its backing
>>>> chain, all images above that base image should be emptied to restore the
>>>> old qemu-img commit behavior.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>   qemu-img.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>   1 file changed, 84 insertions(+), 3 deletions(-)
>>> Does emptying an image take significant time?  If so, does that need to
>>> be reflected in the progress meter?
>> For a 16 GB image I have here (should be nearly full) it took 1:22
>> min. Copying it took six minutes, so I guess committing it would
>> take even more. I think the ratio is small enough not to include it
>> in the progress meter.
> Did you check why it took that long? Sounds like we're issuing a lot of
> independent discard requests instead of few big ones. Is the image
> heavily fragmented?

Indeed it is, judging from qemu-img map.

Max

>> Furthermore, I don't see a reasonable implementation for a
>> make_empty progress output: As a general implementation for all
>> image formats implementing discard will not work (the qcow2
>> implementation clearly states that discarded sectors should read
>> back as zero)
> We can always add new flags or a separate new callback if it improves
> things.
>
> Kevin
Kevin Wolf April 24, 2014, 4:01 p.m. UTC | #9
Am 24.04.2014 um 16:54 hat Max Reitz geschrieben:
> On 23.04.2014 11:32, Kevin Wolf wrote:
> >Am 22.04.2014 um 18:22 hat Max Reitz geschrieben:
> >>On 22.04.2014 17:19, Eric Blake wrote:
> >>>On 04/17/2014 03:59 PM, Max Reitz wrote:
> >>>>After the top image has been committed into an image in its backing
> >>>>chain, all images above that base image should be emptied to restore the
> >>>>old qemu-img commit behavior.
> >>>>
> >>>>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>>---
> >>>>  qemu-img.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>  1 file changed, 84 insertions(+), 3 deletions(-)
> >>>Does emptying an image take significant time?  If so, does that need to
> >>>be reflected in the progress meter?
> >>For a 16 GB image I have here (should be nearly full) it took 1:22
> >>min. Copying it took six minutes, so I guess committing it would
> >>take even more. I think the ratio is small enough not to include it
> >>in the progress meter.
> >Did you check why it took that long? Sounds like we're issuing a lot of
> >independent discard requests instead of few big ones. Is the image
> >heavily fragmented?
> 
> Indeed it is, judging from qemu-img map.

I see. But even that should be handled by the discard caching mechanism
in qcow2, so that it should still merge most requests, even if it has to
issue a few separate ones. I think some more in-depth debugging wouldn't
hurt.

Kevin
diff mbox

Patch

diff --git a/qemu-img.c b/qemu-img.c
index 7ee791f..42616da 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -49,6 +49,11 @@  typedef enum OutputFormat {
     OFORMAT_HUMAN,
 } OutputFormat;
 
+typedef struct BackingList {
+    BlockDriverState *bs;
+    QSIMPLEQ_ENTRY(BackingList) list;
+} BackingList;
+
 /* Default to cache=writeback as data integrity is not important for qemu-tcg. */
 #define BDRV_O_FLAGS BDRV_O_CACHE_WB
 #define BDRV_DEFAULT_CACHE "writeback"
@@ -717,10 +722,13 @@  static int img_commit(int argc, char **argv)
 {
     int c, ret, flags;
     const char *filename, *fmt, *cache;
-    BlockDriverState *bs, *base_bs;
+    BlockDriverState *bs, *base_bs, *backing_bs;
     bool quiet = false;
     Error *local_err = NULL;
     CommonBlockJobCBInfo cbi;
+    BackingList *bl_element, *bl_next;
+    QSIMPLEQ_HEAD(, BackingList) backing_list =
+        QSIMPLEQ_HEAD_INITIALIZER(backing_list);
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
@@ -750,7 +758,7 @@  static int img_commit(int argc, char **argv)
     }
     filename = argv[optind++];
 
-    flags = BDRV_O_RDWR;
+    flags = BDRV_O_RDWR | BDRV_O_UNMAP;
     ret = bdrv_parse_cache_flags(cache, &flags);
     if (ret < 0) {
         error_report("Invalid cache option: %s", cache);
@@ -771,6 +779,32 @@  static int img_commit(int argc, char **argv)
         goto done;
     }
 
+    /* Build a list of the intermediate backing images in order to be able to
+     * empty them later; note that base_bs is included in this list although we
+     * actually want to empty bs instead. As bdrv_swap() will be called on both
+     * by the commit block job, it is however correct to use the pointer to
+     * base_bs in order to clear bs. */
+    backing_bs = bs;
+
+    do {
+        backing_bs = backing_bs->backing_hd;
+        assert(backing_bs);
+
+        bl_element = g_new(BackingList, 1);
+        bl_element->bs = backing_bs;
+
+        if (backing_bs != base_bs) {
+            /* Generally, the images should be emptied from top to bottom in
+             * order to keep them consistent even if one make_empty operation
+             * fails because it could empty an image only partially */
+            QSIMPLEQ_INSERT_TAIL(&backing_list, bl_element, list);
+        } else {
+            /* The final pointer in the backing chain will however later point
+             * to the image at the very top, so put it at the front instead */
+            QSIMPLEQ_INSERT_HEAD(&backing_list, bl_element, list);
+        }
+    } while (backing_bs != base_bs);
+
     cbi = (CommonBlockJobCBInfo){
         .errp = &local_err,
         .bs   = bs,
@@ -779,10 +813,57 @@  static int img_commit(int argc, char **argv)
     commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
                         common_block_job_cb, &cbi, &local_err);
     if (local_err) {
-        goto done;
+        goto free_backing_list;
+    }
+
+    /* The block job will swap base_bs and bs (which is not what we really want
+     * here, but okay) and unref bs (and subsequently all intermediate block
+     * devices). In order to be able to empty these images afterwards, increment
+     * the reference counter here preemptively. */
+    QSIMPLEQ_FOREACH(bl_element, &backing_list, list) {
+        bdrv_ref(bl_element->bs);
     }
 
     run_block_job(bs->job, &local_err);
+    if (local_err) {
+        goto unref_backing;
+    }
+
+    QSIMPLEQ_FOREACH(bl_element, &backing_list, list) {
+        if (bl_element->bs->drv->bdrv_make_empty) {
+            if (bl_element->bs->read_only) {
+                /* If this is an intermediate file in the backing chain between
+                 * the top and the bottom image, it has been opened implicitly
+                 * read-only; reopen it R/W for emptying */
+                ret = bdrv_reopen(bl_element->bs, flags, &local_err);
+                if (ret < 0 || local_err) {
+                    if (!local_err) {
+                        error_setg_errno(&local_err, -ret,
+                                         "Could not write to %s",
+                                         bl_element->bs->filename);
+                    }
+                    goto unref_backing;
+                }
+            }
+
+            ret = bl_element->bs->drv->bdrv_make_empty(bl_element->bs);
+            if (ret) {
+                error_setg_errno(&local_err, -ret, "Could not empty %s",
+                                 bl_element->bs->filename);
+                goto unref_backing;
+            }
+        }
+    }
+
+unref_backing:
+    QSIMPLEQ_FOREACH(bl_element, &backing_list, list) {
+        bdrv_unref(bl_element->bs);
+    }
+
+free_backing_list:
+    QSIMPLEQ_FOREACH_SAFE(bl_element, &backing_list, list, bl_next) {
+        g_free(bl_element);
+    }
 
 done:
     bdrv_unref(bs);