diff mbox series

[v3,13/18] block: introduce new filter driver: fleecing-hook

Message ID 20181001102928.20533-14-vsementsov@virtuozzo.com
State New
Headers show
Series fleecing-hook driver for backup | expand

Commit Message

Vladimir Sementsov-Ogievskiy Oct. 1, 2018, 10:29 a.m. UTC
Fleecing-hook filter does copy-before-write operation. It should be
inserted above active disk and has a target node for CBW, like the
following:

    +-------+
    | Guest |
    +---+---+
        |r,w
        v
    +---+-----------+  target   +---------------+
    | Fleecing hook |---------->| target(qcow2) |
    +---+-----------+   CBW     +---+-----------+
        |                           |
backing |r,w                        |
        v                           |
    +---+---------+      backing    |
    | Active disk |<----------------+
    +-------------+        r

Target's backing may point to active disk (should be set up
separately), which gives fleecing-scheme.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 qapi/block-core.json  |  22 +++-
 block/fleecing-hook.c | 298 ++++++++++++++++++++++++++++++++++++++++++
 block/Makefile.objs   |   2 +
 3 files changed, 320 insertions(+), 2 deletions(-)
 create mode 100644 block/fleecing-hook.c

Comments

Kevin Wolf Oct. 4, 2018, 12:44 p.m. UTC | #1
Am 01.10.2018 um 12:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Fleecing-hook filter does copy-before-write operation. It should be
> inserted above active disk and has a target node for CBW, like the
> following:
> 
>     +-------+
>     | Guest |
>     +---+---+
>         |r,w
>         v
>     +---+-----------+  target   +---------------+
>     | Fleecing hook |---------->| target(qcow2) |
>     +---+-----------+   CBW     +---+-----------+
>         |                           |
> backing |r,w                        |
>         v                           |
>     +---+---------+      backing    |
>     | Active disk |<----------------+
>     +-------------+        r
> 
> Target's backing may point to active disk (should be set up
> separately), which gives fleecing-scheme.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

This lacks an explanation why we need a specialised fleecing hook driver
rather than just a generic bdrv_backup_top block driver in analogy to
what commit and mirror are already doing.

In fact, if I'm reading the last patch of the series right, backup
doesn't even restrict the use of the fleecing-hook driver to actual
fleecing scenarios.

Maybe what doesn't feel right to me is just that it's a misnomer, and if
you rename it into bdrv_backup_top (and make it internal to the block
job), it is very close to what I actually have in mind?

Kevin
Vladimir Sementsov-Ogievskiy Oct. 4, 2018, 1:59 p.m. UTC | #2
04.10.2018 15:44, Kevin Wolf wrote:
> Am 01.10.2018 um 12:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Fleecing-hook filter does copy-before-write operation. It should be
>> inserted above active disk and has a target node for CBW, like the
>> following:
>>
>>      +-------+
>>      | Guest |
>>      +---+---+
>>          |r,w
>>          v
>>      +---+-----------+  target   +---------------+
>>      | Fleecing hook |---------->| target(qcow2) |
>>      +---+-----------+   CBW     +---+-----------+
>>          |                           |
>> backing |r,w                        |
>>          v                           |
>>      +---+---------+      backing    |
>>      | Active disk |<----------------+
>>      +-------------+        r
>>
>> Target's backing may point to active disk (should be set up
>> separately), which gives fleecing-scheme.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> This lacks an explanation why we need a specialised fleecing hook driver
> rather than just a generic bdrv_backup_top block driver in analogy to
> what commit and mirror are already doing.
>
> In fact, if I'm reading the last patch of the series right, backup
> doesn't even restrict the use of the fleecing-hook driver to actual
> fleecing scenarios.
>
> Maybe what doesn't feel right to me is just that it's a misnomer, and if
> you rename it into bdrv_backup_top (and make it internal to the block
> job), it is very close to what I actually have in mind?
>
> Kevin

Hm.
1. assume we move to internal bdrv_backup_top
2. backup(mode=none) becomes just a wrapper for append/drop of the 
bdrv_backup_top node
3. looks interesting to get rid of empty (doing nothing) job and use 
bdrv_backup_top directly.

I want to finally create different backup schemes, based on fleecing 
hook, for example:

     +-------+
     | Guest |
     +-------+
         |r,w
         v
     +---+-----------+  target   +---------------+ +--------+
     | Fleecing hook +---------->+ fleecing-node +---------->+ target |
     +---+-----------+   CBW     +---+-----------+ backup +--------+
         |                           |             (no hook)
backing |r,w                        |
         v                           |
     +---+---------+      backing    |
     | Active disk +<----------------+
     +-------------+        r


This is needed for slow nbd target, if we don't need to slow down guest 
writes.
Here backup(no hook) is a backup job without hook / write notifiers, as 
it actually
do copy from static source.

Or, we can use mirror instead of backup, as mirror is asynchronous and 
is faster than backup. We can even use mirror with write-blocking mode 
(proposed by Max) and use something like null bds (but with backing) 
instead of qcow2 fleecing-node - this will imitate current backup 
approach, but with mirror instead of backup.

Of course, we can use old backup(sync=none) for all such schemes, I just 
think that architecture with filter node is more clean, than with backup 
job, which looks the same but with additional job:
     +-------+
     | Guest |
     +-------+
         |r,w
         v
     +---------------+  target   +---------------+ +--------+
     |bdrv_backup_top+---------->+ fleecing-node +---------->+ target |
     +---------------+   CBW     +---+----------++ backup +--------+
         |                           |          ^  (no hook)
backing |r,w                        |          |
         v                           |          |
     +---+---------+      backing    |          |
     | Active disk +<----------------+          |
     +----------+--+        r                   |
                |                               |
                |           backup(sync=none)   |
                +-------------------------------+



Finally, the first picture looks nicer and has less entities (and I 
didn't draw target blk which backup creates and all the permissions). 
Hmm, it also may be more difficult to setup permissions in the second 
scheme, but I didn't dive into. We just agreed with Max that separate 
building brick which may be reused in different schemes is better than 
internal thing in backup, so, I went this way. However, if you are 
against, it isn't difficult to move it all into backup.
Kevin Wolf Oct. 4, 2018, 2:52 p.m. UTC | #3
Am 04.10.2018 um 15:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 04.10.2018 15:44, Kevin Wolf wrote:
> > Am 01.10.2018 um 12:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> Fleecing-hook filter does copy-before-write operation. It should be
> >> inserted above active disk and has a target node for CBW, like the
> >> following:
> >>
> >>      +-------+
> >>      | Guest |
> >>      +---+---+
> >>          |r,w
> >>          v
> >>      +---+-----------+  target   +---------------+
> >>      | Fleecing hook |---------->| target(qcow2) |
> >>      +---+-----------+   CBW     +---+-----------+
> >>          |                           |
> >> backing |r,w                        |
> >>          v                           |
> >>      +---+---------+      backing    |
> >>      | Active disk |<----------------+
> >>      +-------------+        r
> >>
> >> Target's backing may point to active disk (should be set up
> >> separately), which gives fleecing-scheme.
> >>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > This lacks an explanation why we need a specialised fleecing hook driver
> > rather than just a generic bdrv_backup_top block driver in analogy to
> > what commit and mirror are already doing.
> >
> > In fact, if I'm reading the last patch of the series right, backup
> > doesn't even restrict the use of the fleecing-hook driver to actual
> > fleecing scenarios.
> >
> > Maybe what doesn't feel right to me is just that it's a misnomer, and if
> > you rename it into bdrv_backup_top (and make it internal to the block
> > job), it is very close to what I actually have in mind?
> >
> > Kevin
> 
> Hm.
> 1. assume we move to internal bdrv_backup_top
> 2. backup(mode=none) becomes just a wrapper for append/drop of the 
> bdrv_backup_top node

I think you mean sync=none?

Yes, this is true. There is no actual background job taking place there,
so the job infrastructure doesn't add much. As you say, it's just
inserting the node at the start and dropping it again at the end.

> 3. looks interesting to get rid of empty (doing nothing) job and use 
> bdrv_backup_top directly.

We could directly make the filter node available for the user, like this
series does. Should we do that? I'm not sure, but I'm not necessarily
opposed either.

But looking at the big picture, I have some more thoughts on this:

1. Is backup with sync=none only useful for fleecing? My understanding
   was that "fleecing" specifically means a setup where the target of
   the backup node is an overlay of the active layer of the guest
   device.

   I can imagine other use cases that would use sync=none (e.g. if you
   don't access arbitrary blocks like from the NBD server in the
   fleecing setup, but directly write to a backup file that can be
   commited back later to revert things).

   So I think 'fleecing-hook' is too narrow as a name. Maybe just
   'backup' would be better?

2. mirror has a sync=none mode, too. And like backup, it doesn't
   actually have any background job running then (at least in active
   mirror mode), but only changes the graph at the end of the job.
   Some consistency would be nice there, so is the goal to eventually
   let the user create filter nodes for all jobs that don't have a
   real background job?

3. We have been thinking about unifying backup, commit and mirror
   into a single copy block job because they are doing quite similar
   things. Of course, there are differences whether the old data or the
   new data should be copied on a write, and which graph changes to make
   at the end of the job, but many of the other differences are actually
   features that would make sense in all of them, but are only
   implemented in one job driver.

   Maybe having a single 'copy' filter driver that provides options to
   select backup-like behaviour or mirror-like behaviour, and that can
   then internally be used by all three block jobs would be an
   interesting first step towards this?

   We can start with supporting only what backup needs, but design
   everything with the idea that mirror and commit could use it, too.

I honestly feel that at first this wouldn't be very different from what
you have, so with a few renames and cleanups we might be good. But it
would give us a design in the grand scheme to work towards instead of
doing one-off things for every special case like fleecing and ending up
with even more similar things that are implemented separately even
though they do mostly the same thing.

> I want to finally create different backup schemes, based on fleecing 
> hook, for example:
> 
>      +-------+
>      | Guest |
>      +-------+
>          |r,w
>          v
>      +---+-----------+  target   +---------------+ +--------+
>      | Fleecing hook +---------->+ fleecing-node +---------->+ target |
>      +---+-----------+   CBW     +---+-----------+ backup +--------+
>          |                           |             (no hook)
> backing |r,w                        |
>          v                           |
>      +---+---------+      backing    |
>      | Active disk +<----------------+
>      +-------------+        r
> 
> 
> This is needed for slow nbd target, if we don't need to slow down
> guest writes.  Here backup(no hook) is a backup job without hook /
> write notifiers, as it actually do copy from static source.

Right.

We don't actually have a backup without a hook yet (which would be the
same as the equally missing mirror for read-only nodes), but we do have
commit without a hook - it doesn't share the WRITE permission for the
source.  This is an example for a mode that a unified 'copy' driver
would automatically support.

> Or, we can use mirror instead of backup, as mirror is asynchronous and 
> is faster than backup. We can even use mirror with write-blocking mode 
> (proposed by Max) and use something like null bds (but with backing) 
> instead of qcow2 fleecing-node - this will imitate current backup 
> approach, but with mirror instead of backup.

To be honest, I don't understand the null BDS part. null throws away
whatever data is written to it, so that's certainly not what you want?

> Of course, we can use old backup(sync=none) for all such schemes, I just 
> think that architecture with filter node is more clean, than with backup 
> job, which looks the same but with additional job:
>      +-------+
>      | Guest |
>      +-------+
>          |r,w
>          v
>      +---------------+  target   +---------------+ +--------+
>      |bdrv_backup_top+---------->+ fleecing-node +---------->+ target |
>      +---------------+   CBW     +---+----------++ backup +--------+
>          |                           |          ^  (no hook)
> backing |r,w                        |          |
>          v                           |          |
>      +---+---------+      backing    |          |
>      | Active disk +<----------------+          |
>      +----------+--+        r                   |
>                 |                               |
>                 |           backup(sync=none)   |
>                 +-------------------------------+

This looks only more complex because you decided to draw the block job
into the graph, as an edge connecting source and target. In reality,
this is not an edge that would be existing because bdrv_backup_top
already has both nodes as children. The job wouldn't have an additional
reference, but just use the BdrvChild that is owned by bdrv_backup_top.

Maybe this is an interesting point for the decision between an
integrated filter driver in the jobs and completely separate filter
driver. The jobs probably need access to the internal data structure
(bs->opaque) of the filter node at least, so that they can issue
requests on the child nodes.

Of course, if it isn't an internal filter driver, but a proper
standalone driver, letting jobs use those child nodes might be
considered a bit ugly...

> Finally, the first picture looks nicer and has less entities (and I 
> didn't draw target blk which backup creates and all the permissions). 
> Hmm, it also may be more difficult to setup permissions in the second 
> scheme, but I didn't dive into. We just agreed with Max that separate 
> building brick which may be reused in different schemes is better than 
> internal thing in backup, so, I went this way. However, if you are 
> against, it isn't difficult to move it all into backup.

The idea with bdrv_backup_top would obviously be to get rid of the
additional BlockBackend and BdrvChild instances and only access source
and target as children of the filter node.

Kevin
Vladimir Sementsov-Ogievskiy Oct. 4, 2018, 9:19 p.m. UTC | #4
On 10/04/2018 05:52 PM, Kevin Wolf wrote:
> Am 04.10.2018 um 15:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 04.10.2018 15:44, Kevin Wolf wrote:
>>> Am 01.10.2018 um 12:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> Fleecing-hook filter does copy-before-write operation. It should be
>>>> inserted above active disk and has a target node for CBW, like the
>>>> following:
>>>>
>>>>       +-------+
>>>>       | Guest |
>>>>       +---+---+
>>>>           |r,w
>>>>           v
>>>>       +---+-----------+  target   +---------------+
>>>>       | Fleecing hook |---------->| target(qcow2) |
>>>>       +---+-----------+   CBW     +---+-----------+
>>>>           |                           |
>>>> backing |r,w                        |
>>>>           v                           |
>>>>       +---+---------+      backing    |
>>>>       | Active disk |<----------------+
>>>>       +-------------+        r
>>>>
>>>> Target's backing may point to active disk (should be set up
>>>> separately), which gives fleecing-scheme.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> This lacks an explanation why we need a specialised fleecing hook driver
>>> rather than just a generic bdrv_backup_top block driver in analogy to
>>> what commit and mirror are already doing.
>>>
>>> In fact, if I'm reading the last patch of the series right, backup
>>> doesn't even restrict the use of the fleecing-hook driver to actual
>>> fleecing scenarios.
>>>
>>> Maybe what doesn't feel right to me is just that it's a misnomer, and if
>>> you rename it into bdrv_backup_top (and make it internal to the block
>>> job), it is very close to what I actually have in mind?
>>>
>>> Kevin
>> Hm.
>> 1. assume we move to internal bdrv_backup_top
>> 2. backup(mode=none) becomes just a wrapper for append/drop of the
>> bdrv_backup_top node
> I think you mean sync=none?
>
> Yes, this is true. There is no actual background job taking place there,
> so the job infrastructure doesn't add much. As you say, it's just
> inserting the node at the start and dropping it again at the end.
>
>> 3. looks interesting to get rid of empty (doing nothing) job and use
>> bdrv_backup_top directly.
> We could directly make the filter node available for the user, like this
> series does. Should we do that? I'm not sure, but I'm not necessarily
> opposed either.
>
> But looking at the big picture, I have some more thoughts on this:
>
> 1. Is backup with sync=none only useful for fleecing? My understanding
>     was that "fleecing" specifically means a setup where the target of
>     the backup node is an overlay of the active layer of the guest
>     device.
>
>     I can imagine other use cases that would use sync=none (e.g. if you
>     don't access arbitrary blocks like from the NBD server in the
>     fleecing setup, but directly write to a backup file that can be
>     commited back later to revert things).
>
>     So I think 'fleecing-hook' is too narrow as a name. Maybe just
>     'backup' would be better?

may be copy-before-write?

>
> 2. mirror has a sync=none mode, too. And like backup, it doesn't
>     actually have any background job running then (at least in active
>     mirror mode), but only changes the graph at the end of the job.
>     Some consistency would be nice there, so is the goal to eventually
>     let the user create filter nodes for all jobs that don't have a
>     real background job?
>
> 3. We have been thinking about unifying backup, commit and mirror
>     into a single copy block job because they are doing quite similar
>     things. Of course, there are differences whether the old data or the
>     new data should be copied on a write, and which graph changes to make
>     at the end of the job, but many of the other differences are actually
>     features that would make sense in all of them, but are only
>     implemented in one job driver.
>
>     Maybe having a single 'copy' filter driver that provides options to
>     select backup-like behaviour or mirror-like behaviour, and that can
>     then internally be used by all three block jobs would be an
>     interesting first step towards this?


Isn't it a question about having several simple things against one 
complicated?)
All these jobs are similar only in the fact that they are copying blocks 
from one point to another.. So, instead of creating one big job with a 
lot of options, we can separate copying code to some kind of internal 
copying api, to then share it between jobs (and qemi-img convert). 
Didn't you considered this way? Intuitively, I'm not a fan of idea to 
create one job, but I don't insist. Of course, we can create one job 
carefully split the code to different objects files, with (again) 
separate copying api, shared with qemu-img, so, difference between 
one-job vs several-jobs will be mostly in qapi/ , not in the real code...

>
>     We can start with supporting only what backup needs, but design
>     everything with the idea that mirror and commit could use it, too.
>
> I honestly feel that at first this wouldn't be very different from what
> you have, so with a few renames and cleanups we might be good. But it
> would give us a design in the grand scheme to work towards instead of
> doing one-off things for every special case like fleecing and ending up
> with even more similar things that are implemented separately even
> though they do mostly the same thing.
>
>> I want to finally create different backup schemes, based on fleecing
>> hook, for example:
>>
>>       +-------+
>>       | Guest |
>>       +-------+
>>           |r,w
>>           v
>>       +---+-----------+  target   +---------------+ +--------+
>>       | Fleecing hook +---------->+ fleecing-node +---------->+ target |
>>       +---+-----------+   CBW     +---+-----------+ backup +--------+
>>           |                           |             (no hook)
>> backing |r,w                        |
>>           v                           |
>>       +---+---------+      backing    |
>>       | Active disk +<----------------+
>>       +-------------+        r
>>
>>
>> This is needed for slow nbd target, if we don't need to slow down
>> guest writes.  Here backup(no hook) is a backup job without hook /
>> write notifiers, as it actually do copy from static source.
> Right.
>
> We don't actually have a backup without a hook yet (which would be the
> same as the equally missing mirror for read-only nodes), but we do have
> commit without a hook - it doesn't share the WRITE permission for the
> source.  This is an example for a mode that a unified 'copy' driver
> would automatically support.

I'm just afraid that copy driver will be even more complicated than 
mirror already is.
Mirror needs several iterations through the whole disk, other jobs - don't..

>
>> Or, we can use mirror instead of backup, as mirror is asynchronous and
>> is faster than backup. We can even use mirror with write-blocking mode
>> (proposed by Max) and use something like null bds (but with backing)
>> instead of qcow2 fleecing-node - this will imitate current backup
>> approach, but with mirror instead of backup.
> To be honest, I don't understand the null BDS part. null throws away
> whatever data is written to it, so that's certainly not what you want?

Exactly that. We don't need this data in fleecing node, as active sync 
mirror will copy it in-flight.

>
>> Of course, we can use old backup(sync=none) for all such schemes, I just
>> think that architecture with filter node is more clean, than with backup
>> job, which looks the same but with additional job:
>>       +-------+
>>       | Guest |
>>       +-------+
>>           |r,w
>>           v
>>       +---------------+  target   +---------------+ +--------+
>>       |bdrv_backup_top+---------->+ fleecing-node +---------->+ target |
>>       +---------------+   CBW     +---+----------++ backup +--------+
>>           |                           |          ^  (no hook)
>> backing |r,w                        |          |
>>           v                           |          |
>>       +---+---------+      backing    |          |
>>       | Active disk +<----------------+          |
>>       +----------+--+        r                   |
>>                  |                               |
>>                  |           backup(sync=none)   |
>>                  +-------------------------------+
> This looks only more complex because you decided to draw the block job
> into the graph, as an edge connecting source and target. In reality,
> this is not an edge that would be existing because bdrv_backup_top
> already has both nodes as children. The job wouldn't have an additional
> reference, but just use the BdrvChild that is owned by bdrv_backup_top.
>
> Maybe this is an interesting point for the decision between an
> integrated filter driver in the jobs and completely separate filter
> driver. The jobs probably need access to the internal data structure
> (bs->opaque) of the filter node at least, so that they can issue
> requests on the child nodes.
>
> Of course, if it isn't an internal filter driver, but a proper
> standalone driver, letting jobs use those child nodes might be
> considered a bit ugly...

Yes, in this case, there should be different children, sharing the same 
target BDS..
Hmm, sharing a BdrvChild is a point in favor of internal filter.

>
>> Finally, the first picture looks nicer and has less entities (and I
>> didn't draw target blk which backup creates and all the permissions).
>> Hmm, it also may be more difficult to setup permissions in the second
>> scheme, but I didn't dive into. We just agreed with Max that separate
>> building brick which may be reused in different schemes is better than
>> internal thing in backup, so, I went this way. However, if you are
>> against, it isn't difficult to move it all into backup.
> The idea with bdrv_backup_top would obviously be to get rid of the
> additional BlockBackend and BdrvChild instances and only access source
> and target as children of the filter node.
>
> Kevin

Ok. Let's start from internal driver, anyway, it is a lot easier to turn 
from internal to external if it is needed than vice versa. I'll resend. 
Most of prerequisites will not change. Also, anyway, I'll need to share 
copy-bitmap between two backup jobs, to avoid extra copying in the 
picture above. And anyway, I want to try to get rid of backup 
intersecting requests, using serializing requests and copy-bitmap instead.
Vladimir Sementsov-Ogievskiy Oct. 5, 2018, 3 p.m. UTC | #5
05.10.2018 00:19, Vladimir Sementsov-Ogievskiy wrote:
On 10/04/2018 05:52 PM, Kevin Wolf wrote:
Am 04.10.2018 um 15:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
04.10.2018 15:44, Kevin Wolf wrote:
Am 01.10.2018 um 12:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
Fleecing-hook filter does copy-before-write operation. It should be
inserted above active disk and has a target node for CBW, like the
following:

      +-------+
      | Guest |
      +---+---+
          |r,w
          v
      +---+-----------+  target   +---------------+
      | Fleecing hook |---------->| target(qcow2) |
      +---+-----------+   CBW     +---+-----------+
          |                           |
backing |r,w                        |
          v                           |
      +---+---------+      backing    |
      | Active disk |<----------------+
      +-------------+        r

Target's backing may point to active disk (should be set up
separately), which gives fleecing-scheme.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com><mailto:vsementsov@virtuozzo.com>
This lacks an explanation why we need a specialised fleecing hook driver
rather than just a generic bdrv_backup_top block driver in analogy to
what commit and mirror are already doing.

In fact, if I'm reading the last patch of the series right, backup
doesn't even restrict the use of the fleecing-hook driver to actual
fleecing scenarios.

Maybe what doesn't feel right to me is just that it's a misnomer, and if
you rename it into bdrv_backup_top (and make it internal to the block
job), it is very close to what I actually have in mind?

Kevin
Hm.
1. assume we move to internal bdrv_backup_top
2. backup(mode=none) becomes just a wrapper for append/drop of the
bdrv_backup_top node
I think you mean sync=none?

Yes, this is true. There is no actual background job taking place there,
so the job infrastructure doesn't add much. As you say, it's just
inserting the node at the start and dropping it again at the end.

3. looks interesting to get rid of empty (doing nothing) job and use
bdrv_backup_top directly.
We could directly make the filter node available for the user, like this
series does. Should we do that? I'm not sure, but I'm not necessarily
opposed either.

But looking at the big picture, I have some more thoughts on this:

1. Is backup with sync=none only useful for fleecing? My understanding
    was that "fleecing" specifically means a setup where the target of
    the backup node is an overlay of the active layer of the guest
    device.

    I can imagine other use cases that would use sync=none (e.g. if you
    don't access arbitrary blocks like from the NBD server in the
    fleecing setup, but directly write to a backup file that can be
    commited back later to revert things).

    So I think 'fleecing-hook' is too narrow as a name. Maybe just
    'backup' would be better?

may be copy-before-write?


2. mirror has a sync=none mode, too. And like backup, it doesn't
    actually have any background job running then (at least in active
    mirror mode), but only changes the graph at the end of the job.
    Some consistency would be nice there, so is the goal to eventually
    let the user create filter nodes for all jobs that don't have a
    real background job?

3. We have been thinking about unifying backup, commit and mirror
    into a single copy block job because they are doing quite similar
    things. Of course, there are differences whether the old data or the
    new data should be copied on a write, and which graph changes to make
    at the end of the job, but many of the other differences are actually
    features that would make sense in all of them, but are only
    implemented in one job driver.

    Maybe having a single 'copy' filter driver that provides options to
    select backup-like behaviour or mirror-like behaviour, and that can
    then internally be used by all three block jobs would be an
    interesting first step towards this?


Isn't it a question about having several simple things against one complicated?)
All these jobs are similar only in the fact that they are copying blocks from one point to another.. So, instead of creating one big job with a lot of options, we can separate copying code to some kind of internal copying api, to then share it between jobs (and qemi-img convert). Didn't you considered this way? Intuitively, I'm not a fan of idea to create one job, but I don't insist. Of course, we can create one job carefully split the code to different objects files, with (again) separate copying api, shared with qemu-img, so, difference between one-job vs several-jobs will be mostly in qapi/ , not in the real code...


    We can start with supporting only what backup needs, but design
    everything with the idea that mirror and commit could use it, too.

I honestly feel that at first this wouldn't be very different from what
you have, so with a few renames and cleanups we might be good. But it
would give us a design in the grand scheme to work towards instead of
doing one-off things for every special case like fleecing and ending up
with even more similar things that are implemented separately even
though they do mostly the same thing.

I want to finally create different backup schemes, based on fleecing
hook, for example:

      +-------+
      | Guest |
      +-------+
          |r,w
          v
      +---+-----------+  target   +---------------+ +--------+
      | Fleecing hook +---------->+ fleecing-node +---------->+ target |
      +---+-----------+   CBW     +---+-----------+ backup +--------+
          |                           |             (no hook)
backing |r,w                        |
          v                           |
      +---+---------+      backing    |
      | Active disk +<----------------+
      +-------------+        r


This is needed for slow nbd target, if we don't need to slow down
guest writes.  Here backup(no hook) is a backup job without hook /
write notifiers, as it actually do copy from static source.
Right.

We don't actually have a backup without a hook yet (which would be the
same as the equally missing mirror for read-only nodes), but we do have
commit without a hook - it doesn't share the WRITE permission for the
source.  This is an example for a mode that a unified 'copy' driver
would automatically support.

I'm just afraid that copy driver will be even more complicated than mirror already is.
Mirror needs several iterations through the whole disk, other jobs - don't..


Or, we can use mirror instead of backup, as mirror is asynchronous and
is faster than backup. We can even use mirror with write-blocking mode
(proposed by Max) and use something like null bds (but with backing)
instead of qcow2 fleecing-node - this will imitate current backup
approach, but with mirror instead of backup.
To be honest, I don't understand the null BDS part. null throws away
whatever data is written to it, so that's certainly not what you want?

Exactly that. We don't need this data in fleecing node, as active sync mirror will copy it in-flight.


Of course, we can use old backup(sync=none) for all such schemes, I just
think that architecture with filter node is more clean, than with backup
job, which looks the same but with additional job:
      +-------+
      | Guest |
      +-------+
          |r,w
          v
      +---------------+  target   +---------------+ +--------+
      |bdrv_backup_top+---------->+ fleecing-node +---------->+ target |
      +---------------+   CBW     +---+----------++ backup +--------+
          |                           |          ^  (no hook)
backing |r,w                        |          |
          v                           |          |
      +---+---------+      backing    |          |
      | Active disk +<----------------+          |
      +----------+--+        r                   |
                 |                               |
                 |           backup(sync=none)   |
                 +-------------------------------+
This looks only more complex because you decided to draw the block job
into the graph, as an edge connecting source and target. In reality,
this is not an edge that would be existing because bdrv_backup_top
already has both nodes as children. The job wouldn't have an additional
reference, but just use the BdrvChild that is owned by bdrv_backup_top.

Maybe this is an interesting point for the decision between an
integrated filter driver in the jobs and completely separate filter
driver. The jobs probably need access to the internal data structure
(bs->opaque) of the filter node at least, so that they can issue
requests on the child nodes.

Of course, if it isn't an internal filter driver, but a proper
standalone driver, letting jobs use those child nodes might be
considered a bit ugly...

Yes, in this case, there should be different children, sharing the same target BDS..
Hmm, sharing a BdrvChild is a point in favor of internal filter.

Hmm, how to share children?

backup job has two source BdrvChild'ren - child_job and child_root of job blk and two target BdrvChild'ren - again, child_job and child_root.

backup_top has source child - child_backing and second - child_file (named "target")..

Which BdrvChild'ren you suggest to remove? They are all different.

I don't know, why job needs both unnamed blk's and child_job's, and I don't know is it necessary for backup to use blk's not BdrvChild'ren..

And with internal way in none-mode we'll have two unused blk's  and four unused BdrvChild'ren.. Or we want to rewrite backup to use BdrvChild'ren for io operations and drop child_job BdrvChild'ren? So I'm lost. What did you mean?



Finally, the first picture looks nicer and has less entities (and I
didn't draw target blk which backup creates and all the permissions).
Hmm, it also may be more difficult to setup permissions in the second
scheme, but I didn't dive into. We just agreed with Max that separate
building brick which may be reused in different schemes is better than
internal thing in backup, so, I went this way. However, if you are
against, it isn't difficult to move it all into backup.
The idea with bdrv_backup_top would obviously be to get rid of the
additional BlockBackend and BdrvChild instances and only access source
and target as children of the filter node.

Kevin

Ok. Let's start from internal driver, anyway, it is a lot easier to turn from internal to external if it is needed than vice versa. I'll resend. Most of prerequisites will not change. Also, anyway, I'll need to share copy-bitmap between two backup jobs, to avoid extra copying in the picture above. And anyway, I want to try to get rid of backup intersecting requests, using serializing requests and copy-bitmap instead.




--
Best regards,
Vladimir
Kevin Wolf Oct. 5, 2018, 3:52 p.m. UTC | #6
Hi Vladimir,

can you please check your mailer settings? The plain text version of the
emails is hardly legible because it mixes quotes text and replies. I
had to manually open the HTML part to figure out what you really wrote.

Am 05.10.2018 um 17:00 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hmm, how to share children?
> 
> backup job has two source BdrvChild'ren - child_job and child_root of
> job blk and two target BdrvChild'ren - again, child_job and
> child_root.
> 
> backup_top has source child - child_backing and second - child_file
> (named "target")..

Right, these are six BdrvChild instances in total. I think we can ignore
the child_job ones, they are internal to the block job infrastructure,
so we have four of them left.

> Which BdrvChild'ren you suggest to remove? They are all different.

Now that you introduced backup_top, I think we don't need any
BlockBackends any more. So I suggest to remove the child_root ones and
to do all I/O through the child_backing and child_file ones of
backup_top.

> I don't know, why job needs both unnamed blk's and child_job's, and I
> don't know is it necessary for backup to use blk's not BdrvChild'ren..

I think we had a case recently where it turned out that it is strictly
speaking even wrong for jobs to use BlockBackends in a function that
intercepts a request on the BDS level (like the copy-before-write of
backup).

So getting rid of the BlockBackends isn't only okay, but actually a good
thing by itself.

> And with internal way in none-mode we'll have two unused blk's  and
> four unused BdrvChild'ren.. Or we want to rewrite backup to use
> BdrvChild'ren for io operations and drop child_job BdrvChild'ren? So
> I'm lost. What did you mean?

child_job isn't actually unused, even though you never use them to make
requests. The child_job BdrvChild is important because of the
BdrvChildRole callbacks it provides to the block job infrastructure.

Kevin
Vladimir Sementsov-Ogievskiy Oct. 5, 2018, 4:40 p.m. UTC | #7
05.10.2018 18:52, Kevin Wolf wrote:
> Hi Vladimir,
>
> can you please check your mailer settings? The plain text version of the
> emails is hardly legible because it mixes quotes text and replies. I
> had to manually open the HTML part to figure out what you really wrote.

I've sent it from other thunderbird instance from home, I hope 
thunderbird at work (where I'm composing now) is ok..

>
> Am 05.10.2018 um 17:00 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Hmm, how to share children?
>>
>> backup job has two source BdrvChild'ren - child_job and child_root of
>> job blk and two target BdrvChild'ren - again, child_job and
>> child_root.
>>
>> backup_top has source child - child_backing and second - child_file
>> (named "target")..
> Right, these are six BdrvChild instances in total. I think we can ignore
> the child_job ones, they are internal to the block job infrastructure,
> so we have four of them left.
>
>> Which BdrvChild'ren you suggest to remove? They are all different.
> Now that you introduced backup_top, I think we don't need any
> BlockBackends any more. So I suggest to remove the child_root ones and
> to do all I/O through the child_backing and child_file ones of
> backup_top.
>
>> I don't know, why job needs both unnamed blk's and child_job's, and I
>> don't know is it necessary for backup to use blk's not BdrvChild'ren..
> I think we had a case recently where it turned out that it is strictly
> speaking even wrong for jobs to use BlockBackends in a function that
> intercepts a request on the BDS level (like the copy-before-write of
> backup).
>
> So getting rid of the BlockBackends isn't only okay, but actually a good
> thing by itself.
>
>> And with internal way in none-mode we'll have two unused blk's  and
>> four unused BdrvChild'ren.. Or we want to rewrite backup to use
>> BdrvChild'ren for io operations and drop child_job BdrvChild'ren? So
>> I'm lost. What did you mean?
> child_job isn't actually unused, even though you never use them to make
> requests. The child_job BdrvChild is important because of the
> BdrvChildRole callbacks it provides to the block job infrastructure.
>
> Kevin

Ok, understand, thank you for the explanation!
Eric Blake Oct. 5, 2018, 4:47 p.m. UTC | #8
On 10/5/18 11:40 AM, Vladimir Sementsov-Ogievskiy wrote:
> 05.10.2018 18:52, Kevin Wolf wrote:
>> Hi Vladimir,
>>
>> can you please check your mailer settings? The plain text version of the
>> emails is hardly legible because it mixes quotes text and replies. I
>> had to manually open the HTML part to figure out what you really wrote.
> 
> I've sent it from other thunderbird instance from home, I hope
> thunderbird at work (where I'm composing now) is ok..

Comparing the two:

Home:
Message-ID: <46e224bd-8c1f-4565-944e-52440e85e2f0@virtuozzo.com>
...
Content-Type: multipart/alternative;
	boundary="_000_46e224bd8c1f4565944e52440e85e2f0virtuozzocom_"

Work:
Message-ID: <05adf79a-4ae1-0ba1-aa7f-7696aa043594@virtuozzo.com>
...
Content-Type: text/plain; charset="utf-8"
Content-ID: <DF06DD561001084699A8EB982D66721B@eurprd08.prod.outlook.com>
Content-Transfer-Encoding: base64

So, the difference is that at home, you haven't told thunderbird to send 
plain-text only emails to specific recipients (setting up the list as 
one of those recipients that wants plain-text only), and something else 
in your local configurations then results in a multipart email where the 
html portion looks fine but the plain-text portion has horrendous 
quoting. But at work, you are configured for plain-text-only output, 
html is not even available, and the quoting is decent from the start.
Vladimir Sementsov-Ogievskiy Oct. 5, 2018, 6:31 p.m. UTC | #9
Thank you, hope that's fixed now)

On 10/05/2018 07:47 PM, Eric Blake wrote:
> On 10/5/18 11:40 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 05.10.2018 18:52, Kevin Wolf wrote:
>>> Hi Vladimir,
>>>
>>> can you please check your mailer settings? The plain text version of the
>>> emails is hardly legible because it mixes quotes text and replies. I
>>> had to manually open the HTML part to figure out what you really wrote.
>>
>> I've sent it from other thunderbird instance from home, I hope
>> thunderbird at work (where I'm composing now) is ok..
> 
> Comparing the two:
> 
> Home:
> Message-ID: <46e224bd-8c1f-4565-944e-52440e85e2f0@virtuozzo.com>
> ...
> Content-Type: multipart/alternative;
>      boundary="_000_46e224bd8c1f4565944e52440e85e2f0virtuozzocom_"
> 
> Work:
> Message-ID: <05adf79a-4ae1-0ba1-aa7f-7696aa043594@virtuozzo.com>
> ...
> Content-Type: text/plain; charset="utf-8"
> Content-ID: <DF06DD561001084699A8EB982D66721B@eurprd08.prod.outlook.com>
> Content-Transfer-Encoding: base64
> 
> So, the difference is that at home, you haven't told thunderbird to send 
> plain-text only emails to specific recipients (setting up the list as 
> one of those recipients that wants plain-text only), and something else 
> in your local configurations then results in a multipart email where the 
> html portion looks fine but the plain-text portion has horrendous 
> quoting. But at work, you are configured for plain-text-only output, 
> html is not even available, and the quoting is decent from the start.
>
diff mbox series

Patch

diff --git a/qapi/block-core.json b/qapi/block-core.json
index c4774af18e..13cf90eab6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2628,7 +2628,8 @@ 
             'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
             'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
             'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
-            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
+            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs',
+            'fleecing-hook'] }
 
 ##
 # @BlockdevOptionsFile:
@@ -2719,6 +2720,22 @@ 
 { 'struct': 'BlockdevOptionsGenericFormat',
   'data': { 'file': 'BlockdevRef' } }
 
+##
+# @BlockdevOptionsFleecingHook:
+#
+# Driver specific block device options for image format that have no option
+# besides their data source.
+#
+# @append-to:        reference to or definition of the data source block device
+# @target:        reference to or definition of the data source block device
+# @copy-bitmap:   name for the copy-bitmap of the process. May be shared TODO: normal description here
+#
+# Since: 2.9
+##
+  { 'struct': 'BlockdevOptionsFleecingHook',
+    'data': { 'append-to': 'str', 'target': 'BlockdevRef',
+              '*copy-bitmap': 'str'} }
+
 ##
 # @BlockdevOptionsLUKS:
 #
@@ -3718,7 +3735,8 @@ 
       'vmdk':       'BlockdevOptionsGenericCOWFormat',
       'vpc':        'BlockdevOptionsGenericFormat',
       'vvfat':      'BlockdevOptionsVVFAT',
-      'vxhs':       'BlockdevOptionsVxHS'
+      'vxhs':       'BlockdevOptionsVxHS',
+      'fleecing-hook': 'BlockdevOptionsFleecingHook'
   } }
 
 ##
diff --git a/block/fleecing-hook.c b/block/fleecing-hook.c
new file mode 100644
index 0000000000..f4e2f3ce83
--- /dev/null
+++ b/block/fleecing-hook.c
@@ -0,0 +1,298 @@ 
+/*
+ * Fleecing Hook filter driver
+ *
+ * The driver performs Copy-Before-Write (CBW) operation: it is injected above
+ * some node, and before each write it copies _old_ data to the target node.
+ *
+ * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qapi/error.h"
+#include "block/block_int.h"
+#include "block/qdict.h"
+
+typedef struct BDRVFleecingHookState {
+    BdrvDirtyBitmap *cbw_bitmap; /* what should be copied to @target
+                                    on guest write. */
+    BdrvChild *target;
+    bool cbw_bitmap_created;
+} BDRVFleecingHookState;
+
+static coroutine_fn int fleecing_hook_co_preadv(
+        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+        QEMUIOVector *qiov, int flags)
+{
+    /* Features to be implemented:
+     * F1. COR. save read data to fleecing target for fast access
+     *     (to reduce reads). This possibly may be done with use of copy-on-read
+     *     filter, but we need an ability to make COR requests optional: for
+     *     example, if target is a ram-cache, and if it is full now, we should
+     *     skip doing COR request, as it is actually not necessary.
+     *
+     * F2. Feature for guest: read from fleecing target if data is in ram-cache
+     *     and is unchanged
+     */
+
+    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
+}
+
+static coroutine_fn int fleecing_hook_cbw(BlockDriverState *bs, uint64_t offset,
+                                          uint64_t bytes)
+{
+    int ret = 0;
+    BDRVFleecingHookState *s = bs->opaque;
+    uint64_t gran = bdrv_dirty_bitmap_granularity(s->cbw_bitmap);
+    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
+    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;
+    size_t align = MAX(bdrv_opt_mem_align(bs->backing->bs),
+                       bdrv_opt_mem_align(s->target->bs));
+    struct iovec iov = {
+        .iov_base = qemu_memalign(align, end - off),
+        .iov_len = end - off
+    };
+    QEMUIOVector qiov;
+
+    qemu_iovec_init_external(&qiov, &iov, 1);
+
+    /* Features to be implemented:
+     * F3. parallelize copying loop
+     * F4. detect zeros
+     * F5. use block_status ?
+     * F6. don't copy clusters which are already cached by COR [see F1]
+     */
+
+    len = end - off;
+    while (bdrv_dirty_bitmap_next_dirty_area(s->cbw_bitmap, &off, &len)) {
+        iov.iov_len = qiov.size = len;
+
+        bdrv_reset_dirty_bitmap(s->cbw_bitmap, off, len);
+
+        ret = bdrv_co_preadv(bs->backing, off, len, &qiov,
+                             BDRV_REQ_NO_SERIALISING);
+        if (ret < 0) {
+            bdrv_set_dirty_bitmap(s->cbw_bitmap, off, len);
+            goto finish;
+        }
+
+        ret = bdrv_co_pwritev(s->target, off, len, &qiov, BDRV_REQ_SERIALISING);
+        if (ret < 0) {
+            bdrv_set_dirty_bitmap(s->cbw_bitmap, off, len);
+            goto finish;
+        }
+
+        off += len;
+        if (off >= end) {
+            break;
+        }
+        len = end - off;
+    }
+
+finish:
+    qemu_vfree(iov.iov_base);
+
+    return ret;
+}
+
+static int coroutine_fn fleecing_hook_co_pdiscard(BlockDriverState *bs,
+                                                  int64_t offset, int bytes)
+{
+    int ret = fleecing_hook_cbw(bs, offset, bytes);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* Features to be implemented:
+     * F7. possibility of lazy discard: just defer the discard after fleecing
+     *     completion. If write (or new discard) occurs to the same area, just
+     *     drop deferred discard.
+     */
+
+    return bdrv_co_pdiscard(bs->backing, offset, bytes);
+}
+
+static int coroutine_fn fleecing_hook_co_pwrite_zeroes(BlockDriverState *bs,
+        int64_t offset, int bytes, BdrvRequestFlags flags)
+{
+    int ret = fleecing_hook_cbw(bs, offset, bytes);
+    if (ret < 0) {
+        /* F8. Additional option to break fleecing instead of breaking guest
+         * write here */
+        return ret;
+    }
+
+    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
+}
+
+static coroutine_fn int fleecing_hook_co_pwritev(BlockDriverState *bs,
+                                                 uint64_t offset,
+                                                 uint64_t bytes,
+                                                 QEMUIOVector *qiov, int flags)
+{
+    int ret = fleecing_hook_cbw(bs, offset, bytes);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
+}
+
+static int coroutine_fn fleecing_hook_co_flush(BlockDriverState *bs)
+{
+    if (!bs->backing) {
+        return 0;
+    }
+
+    return bdrv_co_flush(bs->backing->bs);
+}
+
+static void fleecing_hook_refresh_filename(BlockDriverState *bs, QDict *opts)
+{
+    if (bs->backing == NULL) {
+        /* we can be here after failed bdrv_attach_child in
+         * bdrv_set_backing_hd */
+        return;
+    }
+    bdrv_refresh_filename(bs->backing->bs);
+    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
+            bs->backing->bs->filename);
+}
+
+static void fleecing_hook_child_perm(BlockDriverState *bs, BdrvChild *c,
+                                       const BdrvChildRole *role,
+                                       BlockReopenQueue *reopen_queue,
+                                       uint64_t perm, uint64_t shared,
+                                       uint64_t *nperm, uint64_t *nshared)
+{
+    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared, nperm,
+                              nshared);
+
+    if (role == &child_file) {
+        /* share write to target, to not interfere guest writes to it's disk
+         * which will be in target backing chain */
+        *nshared = *nshared | BLK_PERM_WRITE;
+    }
+}
+
+static int fleecing_hook_open(BlockDriverState *bs, QDict *options, int flags,
+                              Error **errp)
+{
+    BDRVFleecingHookState *s = bs->opaque;
+    Error *local_err = NULL;
+    const char *append_to, *copy_bitmap_name;
+    BlockDriverState *backing_bs;
+
+    append_to = qdict_get_str(options, "append-to");
+    qdict_del(options, "append-to");
+    backing_bs = bdrv_lookup_bs(append_to, append_to, errp);
+    if (!backing_bs) {
+        return -EINVAL;
+    }
+
+    bs->total_sectors = backing_bs->total_sectors;
+
+    copy_bitmap_name = qdict_get_try_str(options, "copy-bitmap");
+    if (copy_bitmap_name) {
+        qdict_del(options, "copy-bitmap");
+        s->cbw_bitmap = bdrv_find_dirty_bitmap(backing_bs, copy_bitmap_name);
+    }
+
+    if (!s->cbw_bitmap) {
+        s->cbw_bitmap = bdrv_create_dirty_bitmap(bs, 65536, copy_bitmap_name,
+                                                 errp);
+        if (!s->cbw_bitmap) {
+            return -EINVAL;
+        }
+        s->cbw_bitmap_created = true;
+    }
+
+    bdrv_disable_dirty_bitmap(s->cbw_bitmap);
+    bdrv_set_dirty_bitmap(s->cbw_bitmap, 0, bdrv_getlength(backing_bs));
+
+    s->target = bdrv_open_child(NULL, options, "target", bs, &child_file,
+                               false, errp);
+    if (!s->target) {
+        return -EINVAL;
+    }
+
+    bdrv_set_aio_context(bs, bdrv_get_aio_context(backing_bs));
+    bdrv_set_aio_context(s->target->bs, bdrv_get_aio_context(backing_bs));
+
+    bdrv_drained_begin(backing_bs);
+
+    bdrv_ref(bs);
+    bdrv_append(bs, backing_bs, &local_err);
+
+    if (local_err) {
+        bdrv_unref(bs);
+    }
+
+    bdrv_drained_end(backing_bs);
+
+    if (local_err) {
+        bdrv_unref_child(bs, s->target);
+        error_propagate(errp, local_err);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static void fleecing_hook_close(BlockDriverState *bs)
+{
+    BDRVFleecingHookState *s = bs->opaque;
+
+    if (s->cbw_bitmap && s->cbw_bitmap_created) {
+        bdrv_release_dirty_bitmap(bs, s->cbw_bitmap);
+    }
+
+    if (s->target) {
+        bdrv_unref_child(bs, s->target);
+    }
+}
+
+BlockDriver bdrv_fleecing_hook_filter = {
+    .format_name = "fleecing-hook",
+    .instance_size = sizeof(BDRVFleecingHookState),
+
+    .bdrv_co_preadv             = fleecing_hook_co_preadv,
+    .bdrv_co_pwritev            = fleecing_hook_co_pwritev,
+    .bdrv_co_pwrite_zeroes      = fleecing_hook_co_pwrite_zeroes,
+    .bdrv_co_pdiscard           = fleecing_hook_co_pdiscard,
+    .bdrv_co_flush              = fleecing_hook_co_flush,
+
+    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
+
+    .bdrv_refresh_filename      = fleecing_hook_refresh_filename,
+
+    .bdrv_open                  = fleecing_hook_open,
+    .bdrv_close                 = fleecing_hook_close,
+
+    .bdrv_child_perm            = fleecing_hook_child_perm,
+
+    .is_filter = true,
+};
+
+static void bdrv_fleecing_hook_init(void)
+{
+    bdrv_register(&bdrv_fleecing_hook_filter);
+}
+
+block_init(bdrv_fleecing_hook_init);
diff --git a/block/Makefile.objs b/block/Makefile.objs
index c8337bf186..081720b14f 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -31,6 +31,8 @@  block-obj-y += throttle.o copy-on-read.o
 
 block-obj-y += crypto.o
 
+block-obj-y += fleecing-hook.o
+
 common-obj-y += stream.o
 
 nfs.o-libs         := $(LIBNFS_LIBS)