diff mbox series

[1/2] block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE

Message ID 20190812181146.26121-2-vsementsov@virtuozzo.com
State New
Headers show
Series deal with BDRV_BLOCK_RAW | expand

Commit Message

Vladimir Sementsov-Ogievskiy Aug. 12, 2019, 6:11 p.m. UTC
BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
returned file. But is it correct behavior at all? If returned file
itself has a backing file, we may report as totally unallocated and
area which actually has data in bottom backing file.

So, mirroring of qcow2 under raw-format is broken. Which is illustrated
by following commit with a test. Let's make raw-format behave more
correctly returning BDRV_BLOCK_DATA.

Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/raw-format.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kevin Wolf Aug. 13, 2019, 11:04 a.m. UTC | #1
Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
> returned file. But is it correct behavior at all? If returned file
> itself has a backing file, we may report as totally unallocated and
> area which actually has data in bottom backing file.
> 
> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
> by following commit with a test. Let's make raw-format behave more
> correctly returning BDRV_BLOCK_DATA.
> 
> Suggested-by: Max Reitz <mreitz@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

After some reading, I think I came to the conclusion that RAW is the
correct thing to do. There is indeed a problem, but this patch is trying
to fix it in the wrong place.

In the case where the backing file contains some data, and we have a
'raw' node above the qcow2 overlay node, the content of the respective
block is not defined by the queried backing file layer, so it is
completely correct that bdrv_is_allocated() returns false, like it would
if you queried the qcow2 layer directly. If it returned true, we would
copy everything, which isn't right either (the test cases should may add
the qemu-img map output of the target so this becomes visible).

The problem is that we try to recurse along the backing chain, but we
fail to make the step from the raw node to the backing file.

Note that just extending Max's "deal with filters" is not enough to fix
this because raw doesn't actually meet all of the criteria for being a
filter in this sense (at least because the 'offset' option can change
offsets between raw and its child).

I think this is essentially a result of special-casing backing files
everywhere instead of treating them like children like any other.
bdrv_co_block_status_above() probably shouldn't recurse along the
backing chain, but along the returned *file pointers, and consider the
returned offset in *map.

Kevin
Vladimir Sementsov-Ogievskiy Aug. 13, 2019, 11:28 a.m. UTC | #2
13.08.2019 14:04, Kevin Wolf wrote:
> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>> returned file. But is it correct behavior at all? If returned file
>> itself has a backing file, we may report as totally unallocated and
>> area which actually has data in bottom backing file.
>>
>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>> by following commit with a test. Let's make raw-format behave more
>> correctly returning BDRV_BLOCK_DATA.
>>
>> Suggested-by: Max Reitz <mreitz@redhat.com>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> After some reading, I think I came to the conclusion that RAW is the
> correct thing to do. There is indeed a problem, but this patch is trying
> to fix it in the wrong place.
> 
> In the case where the backing file contains some data, and we have a
> 'raw' node above the qcow2 overlay node, the content of the respective
> block is not defined by the queried backing file layer, so it is
> completely correct that bdrv_is_allocated() returns false, like it would
> if you queried the qcow2 layer directly. If it returned true, we would
> copy everything, which isn't right either (the test cases should may add
> the qemu-img map output of the target so this becomes visible).
> 
> The problem is that we try to recurse along the backing chain, but we
> fail to make the step from the raw node to the backing file.

I'd say, the problem is that we ignore backing chain of non-backing child

> 
> Note that just extending Max's "deal with filters" is not enough to fix
> this because raw doesn't actually meet all of the criteria for being a
> filter in this sense (at least because the 'offset' option can change
> offsets between raw and its child).
> 
> I think this is essentially a result of special-casing backing files
> everywhere instead of treating them like children like any other.

But we need to special-case them, as we have interfaces operating on backing chain,

> bdrv_co_block_status_above() probably shouldn't recurse along the
> backing chain, but along the returned *file pointers, and consider the
> returned offset in *map.
> 

So, you mean that in case of unallocated, format layer should return it's backing file as file?
Then, maybe bdrv_co_block_status should not recurse at all?
Kevin Wolf Aug. 13, 2019, 12:01 p.m. UTC | #3
Am 13.08.2019 um 13:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 13.08.2019 14:04, Kevin Wolf wrote:
> > Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
> >> returned file. But is it correct behavior at all? If returned file
> >> itself has a backing file, we may report as totally unallocated and
> >> area which actually has data in bottom backing file.
> >>
> >> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
> >> by following commit with a test. Let's make raw-format behave more
> >> correctly returning BDRV_BLOCK_DATA.
> >>
> >> Suggested-by: Max Reitz <mreitz@redhat.com>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > 
> > After some reading, I think I came to the conclusion that RAW is the
> > correct thing to do. There is indeed a problem, but this patch is trying
> > to fix it in the wrong place.
> > 
> > In the case where the backing file contains some data, and we have a
> > 'raw' node above the qcow2 overlay node, the content of the respective
> > block is not defined by the queried backing file layer, so it is
> > completely correct that bdrv_is_allocated() returns false, like it would
> > if you queried the qcow2 layer directly. If it returned true, we would
> > copy everything, which isn't right either (the test cases should may add
> > the qemu-img map output of the target so this becomes visible).
> > 
> > The problem is that we try to recurse along the backing chain, but we
> > fail to make the step from the raw node to the backing file.
> 
> I'd say, the problem is that we ignore backing chain of non-backing
> child

Yes, exactly. And I know even less about what happens if a child is
neither bs->file nor bs->backing. Imagine a qcow2 image with an external
data file that is a qcow2 image with a backing file itself. :-)

Actually, just having two qcow2 layers nested with bs->file probably
already fails.

> > Note that just extending Max's "deal with filters" is not enough to fix
> > this because raw doesn't actually meet all of the criteria for being a
> > filter in this sense (at least because the 'offset' option can change
> > offsets between raw and its child).
> > 
> > I think this is essentially a result of special-casing backing files
> > everywhere instead of treating them like children like any other.
> 
> But we need to special-case them, as we have interfaces operating on
> backing chain,

I'm not sure yet if this means that these interfaces are wrong, but it
might. But in any case, I think we depend on special-casing in more
places than we should.

> > bdrv_co_block_status_above() probably shouldn't recurse along the
> > backing chain, but along the returned *file pointers, and consider the
> > returned offset in *map.
> 
> So, you mean that in case of unallocated, format layer should return
> it's backing file as file?

Yes, because that's where it's reading the data from.

Hm... Now I wonder what this means for DATA... In theory it would have
to be set for backing files, but that would make it completely useless.
We can distinguish the cases by looking at *file, but how does the
generic block layer know which child should be counted as "allocated"
and which shouldn't?

> Then, maybe bdrv_co_block_status should not recurse at all?

Maybe. I think I need to draw a bit on the whiteboard before I can
really say anything. It would probably be a pretty fundamental change to
the model how block_status works.

Kevin
Kevin Wolf Aug. 13, 2019, 1:21 p.m. UTC | #4
Am 13.08.2019 um 14:01 hat Kevin Wolf geschrieben:
> Am 13.08.2019 um 13:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > 13.08.2019 14:04, Kevin Wolf wrote:
> > > Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > >> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
> > >> returned file. But is it correct behavior at all? If returned file
> > >> itself has a backing file, we may report as totally unallocated and
> > >> area which actually has data in bottom backing file.
> > >>
> > >> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
> > >> by following commit with a test. Let's make raw-format behave more
> > >> correctly returning BDRV_BLOCK_DATA.
> > >>
> > >> Suggested-by: Max Reitz <mreitz@redhat.com>
> > >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > 
> > > After some reading, I think I came to the conclusion that RAW is the
> > > correct thing to do. There is indeed a problem, but this patch is trying
> > > to fix it in the wrong place.
> > > 
> > > In the case where the backing file contains some data, and we have a
> > > 'raw' node above the qcow2 overlay node, the content of the respective
> > > block is not defined by the queried backing file layer, so it is
> > > completely correct that bdrv_is_allocated() returns false, like it would
> > > if you queried the qcow2 layer directly. If it returned true, we would
> > > copy everything, which isn't right either (the test cases should may add
> > > the qemu-img map output of the target so this becomes visible).
> > > 
> > > The problem is that we try to recurse along the backing chain, but we
> > > fail to make the step from the raw node to the backing file.
> > 
> > I'd say, the problem is that we ignore backing chain of non-backing
> > child
> 
> Yes, exactly. And I know even less about what happens if a child is
> neither bs->file nor bs->backing. Imagine a qcow2 image with an external
> data file that is a qcow2 image with a backing file itself. :-)
> 
> Actually, just having two qcow2 layers nested with bs->file probably
> already fails.
> 
> > > Note that just extending Max's "deal with filters" is not enough to fix
> > > this because raw doesn't actually meet all of the criteria for being a
> > > filter in this sense (at least because the 'offset' option can change
> > > offsets between raw and its child).
> > > 
> > > I think this is essentially a result of special-casing backing files
> > > everywhere instead of treating them like children like any other.
> > 
> > But we need to special-case them, as we have interfaces operating on
> > backing chain,
> 
> I'm not sure yet if this means that these interfaces are wrong, but it
> might. But in any case, I think we depend on special-casing in more
> places than we should.
> 
> > > bdrv_co_block_status_above() probably shouldn't recurse along the
> > > backing chain, but along the returned *file pointers, and consider the
> > > returned offset in *map.
> > 
> > So, you mean that in case of unallocated, format layer should return
> > it's backing file as file?
> 
> Yes, because that's where it's reading the data from.
> 
> Hm... Now I wonder what this means for DATA... In theory it would have
> to be set for backing files, but that would make it completely useless.
> We can distinguish the cases by looking at *file, but how does the
> generic block layer know which child should be counted as "allocated"
> and which shouldn't?

Possible answer to my own question:

bdrv_is_allocated(bs) isn't even asking a complete question. What we
really need to ask is whether a specific child is where data comes from.

What the current callers of bdrv_is_allocated() are interested in is
whether the data comes from bs->backing or from somewhere else. That is,
if removing bs from the graph (so that all parents of bs would point to
bs->backing instead) would still result in the same data in the given
block.

Kevin
Max Reitz Aug. 13, 2019, 2:43 p.m. UTC | #5
On 13.08.19 13:04, Kevin Wolf wrote:
> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>> returned file. But is it correct behavior at all? If returned file
>> itself has a backing file, we may report as totally unallocated and
>> area which actually has data in bottom backing file.
>>
>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>> by following commit with a test. Let's make raw-format behave more
>> correctly returning BDRV_BLOCK_DATA.
>>
>> Suggested-by: Max Reitz <mreitz@redhat.com>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> After some reading, I think I came to the conclusion that RAW is the
> correct thing to do. There is indeed a problem, but this patch is trying
> to fix it in the wrong place.
> 
> In the case where the backing file contains some data, and we have a
> 'raw' node above the qcow2 overlay node, the content of the respective
> block is not defined by the queried backing file layer, so it is
> completely correct that bdrv_is_allocated() returns false,like it would
> if you queried the qcow2 layer directly.

I disagree.  The queried backing file layer is the raw node.  As I said,
in my opinion raw nodes are not filter nodes, neither in behavior (they
have an offset option), nor in how they are generally used (as a format).

The raw format does not support backing files.  Therefore, everything on
a raw node is allocated.

(That is, like, my opinion.)

>                                          If it returned true, we would
> copy everything, which isn't right either (the test cases should may add
> the qemu-img map output of the target so this becomes visible).

It is right.

Max
Max Reitz Aug. 13, 2019, 2:46 p.m. UTC | #6
On 13.08.19 15:21, Kevin Wolf wrote:
> Am 13.08.2019 um 14:01 hat Kevin Wolf geschrieben:
>> Am 13.08.2019 um 13:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 13.08.2019 14:04, Kevin Wolf wrote:
>>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>>> returned file. But is it correct behavior at all? If returned file
>>>>> itself has a backing file, we may report as totally unallocated and
>>>>> area which actually has data in bottom backing file.
>>>>>
>>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>>> by following commit with a test. Let's make raw-format behave more
>>>>> correctly returning BDRV_BLOCK_DATA.
>>>>>
>>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>
>>>> After some reading, I think I came to the conclusion that RAW is the
>>>> correct thing to do. There is indeed a problem, but this patch is trying
>>>> to fix it in the wrong place.
>>>>
>>>> In the case where the backing file contains some data, and we have a
>>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>>> block is not defined by the queried backing file layer, so it is
>>>> completely correct that bdrv_is_allocated() returns false, like it would
>>>> if you queried the qcow2 layer directly. If it returned true, we would
>>>> copy everything, which isn't right either (the test cases should may add
>>>> the qemu-img map output of the target so this becomes visible).
>>>>
>>>> The problem is that we try to recurse along the backing chain, but we
>>>> fail to make the step from the raw node to the backing file.
>>>
>>> I'd say, the problem is that we ignore backing chain of non-backing
>>> child
>>
>> Yes, exactly. And I know even less about what happens if a child is
>> neither bs->file nor bs->backing. Imagine a qcow2 image with an external
>> data file that is a qcow2 image with a backing file itself. :-)
>>
>> Actually, just having two qcow2 layers nested with bs->file probably
>> already fails.
>>
>>>> Note that just extending Max's "deal with filters" is not enough to fix
>>>> this because raw doesn't actually meet all of the criteria for being a
>>>> filter in this sense (at least because the 'offset' option can change
>>>> offsets between raw and its child).
>>>>
>>>> I think this is essentially a result of special-casing backing files
>>>> everywhere instead of treating them like children like any other.
>>>
>>> But we need to special-case them, as we have interfaces operating on
>>> backing chain,
>>
>> I'm not sure yet if this means that these interfaces are wrong, but it
>> might. But in any case, I think we depend on special-casing in more
>> places than we should.
>>
>>>> bdrv_co_block_status_above() probably shouldn't recurse along the
>>>> backing chain, but along the returned *file pointers, and consider the
>>>> returned offset in *map.
>>>
>>> So, you mean that in case of unallocated, format layer should return
>>> it's backing file as file?
>>
>> Yes, because that's where it's reading the data from.
>>
>> Hm... Now I wonder what this means for DATA... In theory it would have
>> to be set for backing files, but that would make it completely useless.
>> We can distinguish the cases by looking at *file, but how does the
>> generic block layer know which child should be counted as "allocated"
>> and which shouldn't?
> 
> Possible answer to my own question:
> 
> bdrv_is_allocated(bs) isn't even asking a complete question. What we
> really need to ask is whether a specific child is where data comes from.
> 
> What the current callers of bdrv_is_allocated() are interested in is
> whether the data comes from bs->backing or from somewhere else. That is,
> if removing bs from the graph (so that all parents of bs would point to
> bs->backing instead) would still result in the same data in the given
> block.

Maybe callers of bdrv_is_allocated() should just ensure that the node
they pass actually has a backing file.

(If it doesn’t, they should skip all filters until it does.)

Max
Vladimir Sementsov-Ogievskiy Aug. 13, 2019, 2:56 p.m. UTC | #7
13.08.2019 17:43, Max Reitz wrote:
> On 13.08.19 13:04, Kevin Wolf wrote:
>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>> returned file. But is it correct behavior at all? If returned file
>>> itself has a backing file, we may report as totally unallocated and
>>> area which actually has data in bottom backing file.
>>>
>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>> by following commit with a test. Let's make raw-format behave more
>>> correctly returning BDRV_BLOCK_DATA.
>>>
>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>
>> After some reading, I think I came to the conclusion that RAW is the
>> correct thing to do. There is indeed a problem, but this patch is trying
>> to fix it in the wrong place.
>>
>> In the case where the backing file contains some data, and we have a
>> 'raw' node above the qcow2 overlay node, the content of the respective
>> block is not defined by the queried backing file layer, so it is
>> completely correct that bdrv_is_allocated() returns false,like it would
>> if you queried the qcow2 layer directly.
> 
> I disagree.  The queried backing file layer is the raw node.  As I said,
> in my opinion raw nodes are not filter nodes, neither in behavior (they
> have an offset option), nor in how they are generally used (as a format).
> 
> The raw format does not support backing files.  Therefore, everything on
> a raw node is allocated.
> 

Could you tell me at least, what means "allocated" ?

It's a term that describing a region somehow.. But how? Allocated where?
In raw node, in its child or both? Am I right that if region allocated in
one of non-cow children it is assumed to be allocated in parent too? Or what?

And it's unrelated to real disk allocation which (IMHO) directly shows that
this a bad term.
Max Reitz Aug. 13, 2019, 3:03 p.m. UTC | #8
On 13.08.19 16:56, Vladimir Sementsov-Ogievskiy wrote:
> 13.08.2019 17:43, Max Reitz wrote:
>> On 13.08.19 13:04, Kevin Wolf wrote:
>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>> returned file. But is it correct behavior at all? If returned file
>>>> itself has a backing file, we may report as totally unallocated and
>>>> area which actually has data in bottom backing file.
>>>>
>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>> by following commit with a test. Let's make raw-format behave more
>>>> correctly returning BDRV_BLOCK_DATA.
>>>>
>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> After some reading, I think I came to the conclusion that RAW is the
>>> correct thing to do. There is indeed a problem, but this patch is trying
>>> to fix it in the wrong place.
>>>
>>> In the case where the backing file contains some data, and we have a
>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>> block is not defined by the queried backing file layer, so it is
>>> completely correct that bdrv_is_allocated() returns false,like it would
>>> if you queried the qcow2 layer directly.
>>
>> I disagree.  The queried backing file layer is the raw node.  As I said,
>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>> have an offset option), nor in how they are generally used (as a format).
>>
>> The raw format does not support backing files.  Therefore, everything on
>> a raw node is allocated.
>>
> 
> Could you tell me at least, what means "allocated" ?
> 
> It's a term that describing a region somehow.. But how? Allocated where?
> In raw node, in its child or both? Am I right that if region allocated in
> one of non-cow children it is assumed to be allocated in parent too? Or what?
> 
> And it's unrelated to real disk allocation which (IMHO) directly shows that
> this a bad term.

It’s a term for COW backing chains.  If something is allocated on a
given node in a COW backing chain, it means it is either present in
exactly that node or in one of its storage children (in case the node is
a format node).  If it is not allocated, it is not, and read accesses
will be forwarded to the COW backing child.

Max
Vladimir Sementsov-Ogievskiy Aug. 13, 2019, 3:22 p.m. UTC | #9
13.08.2019 18:03, Max Reitz wrote:
> On 13.08.19 16:56, Vladimir Sementsov-Ogievskiy wrote:
>> 13.08.2019 17:43, Max Reitz wrote:
>>> On 13.08.19 13:04, Kevin Wolf wrote:
>>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>>> returned file. But is it correct behavior at all? If returned file
>>>>> itself has a backing file, we may report as totally unallocated and
>>>>> area which actually has data in bottom backing file.
>>>>>
>>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>>> by following commit with a test. Let's make raw-format behave more
>>>>> correctly returning BDRV_BLOCK_DATA.
>>>>>
>>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>
>>>> After some reading, I think I came to the conclusion that RAW is the
>>>> correct thing to do. There is indeed a problem, but this patch is trying
>>>> to fix it in the wrong place.
>>>>
>>>> In the case where the backing file contains some data, and we have a
>>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>>> block is not defined by the queried backing file layer, so it is
>>>> completely correct that bdrv_is_allocated() returns false,like it would
>>>> if you queried the qcow2 layer directly.
>>>
>>> I disagree.  The queried backing file layer is the raw node.  As I said,
>>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>>> have an offset option), nor in how they are generally used (as a format).
>>>
>>> The raw format does not support backing files.  Therefore, everything on
>>> a raw node is allocated.
>>>
>>
>> Could you tell me at least, what means "allocated" ?
>>
>> It's a term that describing a region somehow.. But how? Allocated where?
>> In raw node, in its child or both? Am I right that if region allocated in
>> one of non-cow children it is assumed to be allocated in parent too? Or what?
>>
>> And it's unrelated to real disk allocation which (IMHO) directly shows that
>> this a bad term.
> 
> It’s a term for COW backing chains.  If something is allocated on a
> given node in a COW backing chain, it means it is either present in
> exactly that node or in one of its storage children (in case the node is
> a format node).  If it is not allocated, it is not, and read accesses
> will be forwarded to the COW backing child.
> 

And this definition leads exactly to bug in these series:


[raw]
   |
   |file
   V       file
[qcow2]--------->[file]
   |
   |backing
   V
[base]


Assume something is actually allocated in [base] but not in [qcow2].
So, [qcow2] node reports it as unallocated. So nobdy of [raw]'s storage
children contains this as allocated, so it's unallocated in [raw].


And if you say that logic is different for raw as it don't have COW child,
we may put qcow2 instead of raw here with same problem [yes, qcow2 on qcow2
is a bit strange, but we try to make something generic]
Kevin Wolf Aug. 13, 2019, 3:41 p.m. UTC | #10
Am 13.08.2019 um 16:43 hat Max Reitz geschrieben:
> On 13.08.19 13:04, Kevin Wolf wrote:
> > Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
> >> returned file. But is it correct behavior at all? If returned file
> >> itself has a backing file, we may report as totally unallocated and
> >> area which actually has data in bottom backing file.
> >>
> >> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
> >> by following commit with a test. Let's make raw-format behave more
> >> correctly returning BDRV_BLOCK_DATA.
> >>
> >> Suggested-by: Max Reitz <mreitz@redhat.com>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > 
> > After some reading, I think I came to the conclusion that RAW is the
> > correct thing to do. There is indeed a problem, but this patch is trying
> > to fix it in the wrong place.
> > 
> > In the case where the backing file contains some data, and we have a
> > 'raw' node above the qcow2 overlay node, the content of the respective
> > block is not defined by the queried backing file layer, so it is
> > completely correct that bdrv_is_allocated() returns false,like it would
> > if you queried the qcow2 layer directly.
> 
> I disagree.  The queried backing file layer is the raw node.  As I said,
> in my opinion raw nodes are not filter nodes, neither in behavior (they
> have an offset option), nor in how they are generally used (as a format).
> 
> The raw format does not support backing files.  Therefore, everything on
> a raw node is allocated.
> 
> (That is, like, my opinion.)
> 
> >                                          If it returned true, we would
> > copy everything, which isn't right either (the test cases should may add
> > the qemu-img map output of the target so this becomes visible).
> 
> It is right.

So we don't even agree what mirroring the raw node should even mean.

I can the see your point when you say that the raw node has no backing
file, so everything should be copied. But I can also see the point that
the raw node can really just be used as a filter that limits the data
exposed from the qcow2 layer, and you want to keep the copy a COW
overlay over the same backing file.

Both are valid use cases in principle and there is no single right or
wrong.

We don't currently support the latter use case because we have only
sync=full or sync=top, but if you could specify a base node instead, we
could probably suport the case without all of the special-casing filter
nodes and backing file childs.

You would call bdrv_co_block_status_above() with the right base node and
it would just recurse whereever the data is stored, be it bs->backing,
bs->file or even driver-specific children. This would allow you to find
out whether some block in the top node came from the base node that
we're going to keep. If yes, skip it; if no, copy it.

This way we wouldn't have to decide whether raw is a filter or not,
because it wouldn't make a difference. The behaviour would only depend
on the base node given by the user. If you specified the top-level qcow2
file as the base, you get your full copy; if you specified the backing
qcow2, you get the partial copy where the target still uses the same
backing file.

(Hm... It would only actually work if the offsets stay the same in the
chain, which is true for backing file children, but not necessarily for
other children. Anyway, even if we don't gain much functionality, I
really want a more generic model that avoids different types of nodes
and edges as much as possible.)

Kevin
Vladimir Sementsov-Ogievskiy Aug. 13, 2019, 3:54 p.m. UTC | #11
13.08.2019 18:41, Kevin Wolf wrote:
> Am 13.08.2019 um 16:43 hat Max Reitz geschrieben:
>> On 13.08.19 13:04, Kevin Wolf wrote:
>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>> returned file. But is it correct behavior at all? If returned file
>>>> itself has a backing file, we may report as totally unallocated and
>>>> area which actually has data in bottom backing file.
>>>>
>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>> by following commit with a test. Let's make raw-format behave more
>>>> correctly returning BDRV_BLOCK_DATA.
>>>>
>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> After some reading, I think I came to the conclusion that RAW is the
>>> correct thing to do. There is indeed a problem, but this patch is trying
>>> to fix it in the wrong place.
>>>
>>> In the case where the backing file contains some data, and we have a
>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>> block is not defined by the queried backing file layer, so it is
>>> completely correct that bdrv_is_allocated() returns false,like it would
>>> if you queried the qcow2 layer directly.
>>
>> I disagree.  The queried backing file layer is the raw node.  As I said,
>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>> have an offset option), nor in how they are generally used (as a format).
>>
>> The raw format does not support backing files.  Therefore, everything on
>> a raw node is allocated.
>>
>> (That is, like, my opinion.)
>>
>>>                                           If it returned true, we would
>>> copy everything, which isn't right either (the test cases should may add
>>> the qemu-img map output of the target so this becomes visible).
>>
>> It is right.
> 
> So we don't even agree what mirroring the raw node should even mean.
> 
> I can the see your point when you say that the raw node has no backing
> file, so everything should be copied. But I can also see the point that
> the raw node can really just be used as a filter that limits the data
> exposed from the qcow2 layer, and you want to keep the copy a COW
> overlay over the same backing file.
> 
> Both are valid use cases in principle and there is no single right or
> wrong.
> 
> We don't currently support the latter use case because we have only
> sync=full or sync=top, but if you could specify a base node instead, we
> could probably suport the case without all of the special-casing filter
> nodes and backing file childs.
> 
> You would call bdrv_co_block_status_above() with the right base node and
> it would just recurse whereever the data is stored, be it bs->backing,
> bs->file or even driver-specific children. This would allow you to find
> out whether some block in the top node came from the base node that
> we're going to keep. If yes, skip it; if no, copy it.
> 
> This way we wouldn't have to decide whether raw is a filter or not,
> because it wouldn't make a difference. The behaviour would only depend
> on the base node given by the user. If you specified the top-level qcow2
> file as the base, you get your full copy;

ahm, full-copy = base is NULL..

> if you specified the backing
> qcow2, you get the partial copy where the target still uses the same
> backing file.
> 
> (Hm... It would only actually work if the offsets stay the same in the
> chain, which is true for backing file children, but not necessarily for
> other children.

Don't follow, what you mean by offsets stay the same and what is wrong with it?

> Anyway, even if we don't gain much functionality, I
> really want a more generic model that avoids different types of nodes
> and edges as much as possible.)
> 
> Kevin
>
Max Reitz Aug. 13, 2019, 4:07 p.m. UTC | #12
On 13.08.19 17:22, Vladimir Sementsov-Ogievskiy wrote:
> 13.08.2019 18:03, Max Reitz wrote:
>> On 13.08.19 16:56, Vladimir Sementsov-Ogievskiy wrote:
>>> 13.08.2019 17:43, Max Reitz wrote:
>>>> On 13.08.19 13:04, Kevin Wolf wrote:
>>>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>>>> returned file. But is it correct behavior at all? If returned file
>>>>>> itself has a backing file, we may report as totally unallocated and
>>>>>> area which actually has data in bottom backing file.
>>>>>>
>>>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>>>> by following commit with a test. Let's make raw-format behave more
>>>>>> correctly returning BDRV_BLOCK_DATA.
>>>>>>
>>>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> After some reading, I think I came to the conclusion that RAW is the
>>>>> correct thing to do. There is indeed a problem, but this patch is trying
>>>>> to fix it in the wrong place.
>>>>>
>>>>> In the case where the backing file contains some data, and we have a
>>>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>>>> block is not defined by the queried backing file layer, so it is
>>>>> completely correct that bdrv_is_allocated() returns false,like it would
>>>>> if you queried the qcow2 layer directly.
>>>>
>>>> I disagree.  The queried backing file layer is the raw node.  As I said,
>>>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>>>> have an offset option), nor in how they are generally used (as a format).
>>>>
>>>> The raw format does not support backing files.  Therefore, everything on
>>>> a raw node is allocated.
>>>>
>>>
>>> Could you tell me at least, what means "allocated" ?
>>>
>>> It's a term that describing a region somehow.. But how? Allocated where?
>>> In raw node, in its child or both? Am I right that if region allocated in
>>> one of non-cow children it is assumed to be allocated in parent too? Or what?
>>>
>>> And it's unrelated to real disk allocation which (IMHO) directly shows that
>>> this a bad term.
>>
>> It’s a term for COW backing chains.  If something is allocated on a
>> given node in a COW backing chain, it means it is either present in
>> exactly that node or in one of its storage children (in case the node is
>> a format node).  If it is not allocated, it is not, and read accesses
>> will be forwarded to the COW backing child.
>>
> 
> And this definition leads exactly to bug in these series:
> 
> 
> [raw]
>    |
>    |file
>    V       file
> [qcow2]--------->[file]
>    |
>    |backing
>    V
> [base]
> 
> 
> Assume something is actually allocated in [base] but not in [qcow2].
> So, [qcow2] node reports it as unallocated. So nobdy of [raw]'s storage
> children contains this as allocated, so it's unallocated in [raw].

Well, I would say it is in raw’s storage child, because the definition
of backing chain allocation only recurses down the COW link.

If it is *anywhere* in the storage subtree, it is to be considered
allocated on this level.  If it is in the backing COW subtree, it is not.

And if it is in neither, it doesn’t matter whether we say it’s allocated
or not:

For example (1), for sparse files, the raw driver may not have the data
in the storage subtree.  But it sure as hell won’t have it in its
backing COW subtree, because it doesn’t have one.  So for simplicity’s
sake, it may just return everything as allocated.

For example (2), for sparse qcow2 files, the qcow2 driver may report
some ranges as unallocated even when it doesn’t have a backing file.  We
don’t need to change it to report such cases as allocated.

Max
Kevin Wolf Aug. 13, 2019, 4:08 p.m. UTC | #13
Am 13.08.2019 um 17:54 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 13.08.2019 18:41, Kevin Wolf wrote:
> > Am 13.08.2019 um 16:43 hat Max Reitz geschrieben:
> >> On 13.08.19 13:04, Kevin Wolf wrote:
> >>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
> >>>> returned file. But is it correct behavior at all? If returned file
> >>>> itself has a backing file, we may report as totally unallocated and
> >>>> area which actually has data in bottom backing file.
> >>>>
> >>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
> >>>> by following commit with a test. Let's make raw-format behave more
> >>>> correctly returning BDRV_BLOCK_DATA.
> >>>>
> >>>> Suggested-by: Max Reitz <mreitz@redhat.com>
> >>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>
> >>> After some reading, I think I came to the conclusion that RAW is the
> >>> correct thing to do. There is indeed a problem, but this patch is trying
> >>> to fix it in the wrong place.
> >>>
> >>> In the case where the backing file contains some data, and we have a
> >>> 'raw' node above the qcow2 overlay node, the content of the respective
> >>> block is not defined by the queried backing file layer, so it is
> >>> completely correct that bdrv_is_allocated() returns false,like it would
> >>> if you queried the qcow2 layer directly.
> >>
> >> I disagree.  The queried backing file layer is the raw node.  As I said,
> >> in my opinion raw nodes are not filter nodes, neither in behavior (they
> >> have an offset option), nor in how they are generally used (as a format).
> >>
> >> The raw format does not support backing files.  Therefore, everything on
> >> a raw node is allocated.
> >>
> >> (That is, like, my opinion.)
> >>
> >>>                                           If it returned true, we would
> >>> copy everything, which isn't right either (the test cases should may add
> >>> the qemu-img map output of the target so this becomes visible).
> >>
> >> It is right.
> > 
> > So we don't even agree what mirroring the raw node should even mean.
> > 
> > I can the see your point when you say that the raw node has no backing
> > file, so everything should be copied. But I can also see the point that
> > the raw node can really just be used as a filter that limits the data
> > exposed from the qcow2 layer, and you want to keep the copy a COW
> > overlay over the same backing file.
> > 
> > Both are valid use cases in principle and there is no single right or
> > wrong.
> > 
> > We don't currently support the latter use case because we have only
> > sync=full or sync=top, but if you could specify a base node instead, we
> > could probably suport the case without all of the special-casing filter
> > nodes and backing file childs.
> > 
> > You would call bdrv_co_block_status_above() with the right base node and
> > it would just recurse whereever the data is stored, be it bs->backing,
> > bs->file or even driver-specific children. This would allow you to find
> > out whether some block in the top node came from the base node that
> > we're going to keep. If yes, skip it; if no, copy it.
> > 
> > This way we wouldn't have to decide whether raw is a filter or not,
> > because it wouldn't make a difference. The behaviour would only depend
> > on the base node given by the user. If you specified the top-level qcow2
> > file as the base, you get your full copy;
> 
> ahm, full-copy = base is NULL..

Oops, yes, of course. Using the top-level node would create an empty
"copy".

> > if you specified the backing
> > qcow2, you get the partial copy where the target still uses the same
> > backing file.
> > 
> > (Hm... It would only actually work if the offsets stay the same in the
> > chain, which is true for backing file children, but not necessarily for
> > other children.
> 
> Don't follow, what you mean by offsets stay the same and what is wrong
> with it?

Say we have this graph:

raw,offset=65536
    |
    v
  qcow2-----+
    |       |
    v       v
  file     base

Now you can't just mirror the raw node into a target.qcow2 that shares
base as the backing file, because the offsets will be wrong. In order to
use such a copy correctly, you'd have to use a raw node again in the
backing chain:

target.qcow2----+
    |           |
    v           v
  file      raw,offset=65536
                |
                v
              base

So the case where offsets differ between the top and the base node isn't
trivial.

(If this case isn't complicated enough yet, imagine passing file as the
base node instead... It just can't work.)

Kevin
Max Reitz Aug. 13, 2019, 4:21 p.m. UTC | #14
On 13.08.19 17:41, Kevin Wolf wrote:
> Am 13.08.2019 um 16:43 hat Max Reitz geschrieben:
>> On 13.08.19 13:04, Kevin Wolf wrote:
>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>> returned file. But is it correct behavior at all? If returned file
>>>> itself has a backing file, we may report as totally unallocated and
>>>> area which actually has data in bottom backing file.
>>>>
>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>> by following commit with a test. Let's make raw-format behave more
>>>> correctly returning BDRV_BLOCK_DATA.
>>>>
>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> After some reading, I think I came to the conclusion that RAW is the
>>> correct thing to do. There is indeed a problem, but this patch is trying
>>> to fix it in the wrong place.
>>>
>>> In the case where the backing file contains some data, and we have a
>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>> block is not defined by the queried backing file layer, so it is
>>> completely correct that bdrv_is_allocated() returns false,like it would
>>> if you queried the qcow2 layer directly.
>>
>> I disagree.  The queried backing file layer is the raw node.  As I said,
>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>> have an offset option), nor in how they are generally used (as a format).
>>
>> The raw format does not support backing files.  Therefore, everything on
>> a raw node is allocated.
>>
>> (That is, like, my opinion.)
>>
>>>                                          If it returned true, we would
>>> copy everything, which isn't right either (the test cases should may add
>>> the qemu-img map output of the target so this becomes visible).
>>
>> It is right.
> 
> So we don't even agree what mirroring the raw node should even mean.
> 
> I can the see your point when you say that the raw node has no backing
> file, so everything should be copied. But I can also see the point that
> the raw node can really just be used as a filter that limits the data
> exposed from the qcow2 layer, and you want to keep the copy a COW
> overlay over the same backing file.

But it is not a filter.  If you limit the data by using an offset, that
precisely makes it not a filter.

> Both are valid use cases in principle and there is no single right or
> wrong.

I don’t get what you mean because raw is not a filter.

If it were a filter, it’d be clear: Regarding allocation information, we
skip it.  But it isn’t.  Hence we cannot skip it.

I don’t see why you’re trying to make the point with raw, when it simply
is not a filter.


> We don't currently support the latter use case because we have only
> sync=full or sync=top, but if you could specify a base node instead, we
> could probably suport the case without all of the special-casing filter
> nodes and backing file childs.

This sounds like it’d be difficult to special-case filter nodes.  It isn’t.

Whenever the user specifies some node that should refer to a backing
chain element, all you do is call bdrv_skip_rw_filters() and you land at
the corresponding COW node.

> You would call bdrv_co_block_status_above() with the right base node and
> it would just recurse whereever the data is stored, be it bs->backing,
> bs->file or even driver-specific children. This would allow you to find
> out whether some block in the top node came from the base node that
> we're going to keep. If yes, skip it; if no, copy it.
> 
> This way we wouldn't have to decide whether raw is a filter or not,

Well, it isn’t.

> because it wouldn't make a difference. The behaviour would only depend
> on the base node given by the user. If you specified the top-level qcow2
> file as the base, you get your full copy; if you specified the backing
> qcow2, you get the partial copy where the target still uses the same
> backing file.

I simply do not see your point.

For two reasons:

(1) I don’t think anyone should use is_allocate on nodes that are not
COW nodes.  They are likely to be doing something wrong then.

You don’t seem to agree, because you say this makes the callers
complicated.  What I think is that we don’t have that many callers, and
it’s probably a good thing when they have to think about the backing
chain structure.

But anyway.

(2) What I understand is that you propose adding some way to find out
the next element in the COW chain if is_allocated returns false.  Well,
my “Deal with filters” series adds two ways to do that:

- bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)) will return the first
child behind the COW node, but that may be another filter.

- bdrv_backing_chain_next(bs) will return “the first non-filter backing
image of the first non-filter image.”.  It’s just
bdrv_skip_rw_filters(bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))).

So the thing is, we don’t need to modify block_status to return that
information.  If is_allocated returns that something is not allocated,
the caller can just call one of these and thus get the next node where
to look.

When it comes to bdrv_is_allocated_above(), my series fixes that anyway,
no?  (In the sense that it does not choke on filters.)


So I cannot see how it does not come down to the callers.

Max
Vladimir Sementsov-Ogievskiy Aug. 13, 2019, 4:32 p.m. UTC | #15
13.08.2019 19:08, Kevin Wolf wrote:
> Am 13.08.2019 um 17:54 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 13.08.2019 18:41, Kevin Wolf wrote:
>>> Am 13.08.2019 um 16:43 hat Max Reitz geschrieben:
>>>> On 13.08.19 13:04, Kevin Wolf wrote:
>>>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>>>> returned file. But is it correct behavior at all? If returned file
>>>>>> itself has a backing file, we may report as totally unallocated and
>>>>>> area which actually has data in bottom backing file.
>>>>>>
>>>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>>>> by following commit with a test. Let's make raw-format behave more
>>>>>> correctly returning BDRV_BLOCK_DATA.
>>>>>>
>>>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> After some reading, I think I came to the conclusion that RAW is the
>>>>> correct thing to do. There is indeed a problem, but this patch is trying
>>>>> to fix it in the wrong place.
>>>>>
>>>>> In the case where the backing file contains some data, and we have a
>>>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>>>> block is not defined by the queried backing file layer, so it is
>>>>> completely correct that bdrv_is_allocated() returns false,like it would
>>>>> if you queried the qcow2 layer directly.
>>>>
>>>> I disagree.  The queried backing file layer is the raw node.  As I said,
>>>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>>>> have an offset option), nor in how they are generally used (as a format).
>>>>
>>>> The raw format does not support backing files.  Therefore, everything on
>>>> a raw node is allocated.
>>>>
>>>> (That is, like, my opinion.)
>>>>
>>>>>                                            If it returned true, we would
>>>>> copy everything, which isn't right either (the test cases should may add
>>>>> the qemu-img map output of the target so this becomes visible).
>>>>
>>>> It is right.
>>>
>>> So we don't even agree what mirroring the raw node should even mean.
>>>
>>> I can the see your point when you say that the raw node has no backing
>>> file, so everything should be copied. But I can also see the point that
>>> the raw node can really just be used as a filter that limits the data
>>> exposed from the qcow2 layer, and you want to keep the copy a COW
>>> overlay over the same backing file.
>>>
>>> Both are valid use cases in principle and there is no single right or
>>> wrong.
>>>
>>> We don't currently support the latter use case because we have only
>>> sync=full or sync=top, but if you could specify a base node instead, we
>>> could probably suport the case without all of the special-casing filter
>>> nodes and backing file childs.
>>>
>>> You would call bdrv_co_block_status_above() with the right base node and
>>> it would just recurse whereever the data is stored, be it bs->backing,
>>> bs->file or even driver-specific children. This would allow you to find
>>> out whether some block in the top node came from the base node that
>>> we're going to keep. If yes, skip it; if no, copy it.
>>>
>>> This way we wouldn't have to decide whether raw is a filter or not,
>>> because it wouldn't make a difference. The behaviour would only depend
>>> on the base node given by the user. If you specified the top-level qcow2
>>> file as the base, you get your full copy;
>>
>> ahm, full-copy = base is NULL..
> 
> Oops, yes, of course. Using the top-level node would create an empty
> "copy".
> 
>>> if you specified the backing
>>> qcow2, you get the partial copy where the target still uses the same
>>> backing file.
>>>
>>> (Hm... It would only actually work if the offsets stay the same in the
>>> chain, which is true for backing file children, but not necessarily for
>>> other children.
>>
>> Don't follow, what you mean by offsets stay the same and what is wrong
>> with it?
> 
> Say we have this graph:
> 
> raw,offset=65536
>      |
>      v
>    qcow2-----+
>      |       |
>      v       v
>    file     base
> 
> Now you can't just mirror the raw node into a target.qcow2 that shares
> base as the backing file, because the offsets will be wrong. In order to
> use such a copy correctly, you'd have to use a raw node again in the
> backing chain:
> 
> target.qcow2----+
>      |           |
>      v           v
>    file      raw,offset=65536
>                  |
>                  v
>                base
> 
> So the case where offsets differ between the top and the base node isn't
> trivial.

Understand, but for me it don't look like the thing that behaves in unexpected
for user way, on the contrary, it seems obvious that it will not work, as user
understand what is backing file (offsets are backed by corresponding offsets)

> 
> (If this case isn't complicated enough yet, imagine passing file as the
> base node instead... It just can't work.)
> 

Yes, if we chose the way you proposed we have a possibility to use any node as a base,
but again this does something completely different from usual "top" or "based" mode..
So it's just a new possibility, may be useless, it don't break up the concept.


I like the idea of generic block_status_above that you propose, but still there should
a decision of what exactly DATA, ZERO and ALLOCATED means. Ok, assume we don't need ALLOCATED..
Then we don't need DATA too?

And how block_status_above would work? If node don't report ZERO and don't report DATA but report
*file pointer == base, we should stop, and don't go to *file, it seems obvious..

But what if node reports ZERO together with *file pointer == base? Should consider this region
belonging to base and UNALLOCATED and not copy it, or we should consider it ALLOCATED because
current node reports it ZERO? So, criteria of "ALLOCATED" or "where should we stop in recursion"
is not obvious.. Do you have one?
Vladimir Sementsov-Ogievskiy Aug. 14, 2019, 6:27 a.m. UTC | #16
13 авг. 2019 г. 19:32 пользователь Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> написал:

13.08.2019 19:08, Kevin Wolf wrote:
> Am 13.08.2019 um 17:54 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 13.08.2019 18:41, Kevin Wolf wrote:
>>> Am 13.08.2019 um 16:43 hat Max Reitz geschrieben:
>>>> On 13.08.19 13:04, Kevin Wolf wrote:
>>>>> Am 12.08.2019 um 20:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> BDRV_BLOCK_RAW makes generic bdrv_co_block_status to fallthrough to
>>>>>> returned file. But is it correct behavior at all? If returned file
>>>>>> itself has a backing file, we may report as totally unallocated and
>>>>>> area which actually has data in bottom backing file.
>>>>>>
>>>>>> So, mirroring of qcow2 under raw-format is broken. Which is illustrated
>>>>>> by following commit with a test. Let's make raw-format behave more
>>>>>> correctly returning BDRV_BLOCK_DATA.
>>>>>>
>>>>>> Suggested-by: Max Reitz <mreitz@redhat.com>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> After some reading, I think I came to the conclusion that RAW is the
>>>>> correct thing to do. There is indeed a problem, but this patch is trying
>>>>> to fix it in the wrong place.
>>>>>
>>>>> In the case where the backing file contains some data, and we have a
>>>>> 'raw' node above the qcow2 overlay node, the content of the respective
>>>>> block is not defined by the queried backing file layer, so it is
>>>>> completely correct that bdrv_is_allocated() returns false,like it would
>>>>> if you queried the qcow2 layer directly.
>>>>
>>>> I disagree.  The queried backing file layer is the raw node.  As I said,
>>>> in my opinion raw nodes are not filter nodes, neither in behavior (they
>>>> have an offset option), nor in how they are generally used (as a format).
>>>>
>>>> The raw format does not support backing files.  Therefore, everything on
>>>> a raw node is allocated.
>>>>
>>>> (That is, like, my opinion.)
>>>>
>>>>>                                            If it returned true, we would
>>>>> copy everything, which isn't right either (the test cases should may add
>>>>> the qemu-img map output of the target so this becomes visible).
>>>>
>>>> It is right.
>>>
>>> So we don't even agree what mirroring the raw node should even mean.
>>>
>>> I can the see your point when you say that the raw node has no backing
>>> file, so everything should be copied. But I can also see the point that
>>> the raw node can really just be used as a filter that limits the data
>>> exposed from the qcow2 layer, and you want to keep the copy a COW
>>> overlay over the same backing file.
>>>
>>> Both are valid use cases in principle and there is no single right or
>>> wrong.
>>>
>>> We don't currently support the latter use case because we have only
>>> sync=full or sync=top, but if you could specify a base node instead, we
>>> could probably suport the case without all of the special-casing filter
>>> nodes and backing file childs.
>>>
>>> You would call bdrv_co_block_status_above() with the right base node and
>>> it would just recurse whereever the data is stored, be it bs->backing,
>>> bs->file or even driver-specific children. This would allow you to find
>>> out whether some block in the top node came from the base node that
>>> we're going to keep. If yes, skip it; if no, copy it.
>>>
>>> This way we wouldn't have to decide whether raw is a filter or not,
>>> because it wouldn't make a difference. The behaviour would only depend
>>> on the base node given by the user. If you specified the top-level qcow2
>>> file as the base, you get your full copy;
>>
>> ahm, full-copy = base is NULL..
>
> Oops, yes, of course. Using the top-level node would create an empty
> "copy".
>
>>> if you specified the backing
>>> qcow2, you get the partial copy where the target still uses the same
>>> backing file.
>>>
>>> (Hm... It would only actually work if the offsets stay the same in the
>>> chain, which is true for backing file children, but not necessarily for
>>> other children.
>>
>> Don't follow, what you mean by offsets stay the same and what is wrong
>> with it?
>
> Say we have this graph:
>
> raw,offset=65536
>      |
>      v
>    qcow2-----+
>      |       |
>      v       v
>    file     base
>
> Now you can't just mirror the raw node into a target.qcow2 that shares
> base as the backing file, because the offsets will be wrong. In order to
> use such a copy correctly, you'd have to use a raw node again in the
> backing chain:
>
> target.qcow2----+
>      |           |
>      v           v
>    file      raw,offset=65536
>                  |
>                  v
>                base
>
> So the case where offsets differ between the top and the base node isn't
> trivial.

Understand, but for me it don't look like the thing that behaves in unexpected
for user way, on the contrary, it seems obvious that it will not work, as user
understand what is backing file (offsets are backed by corresponding offsets)

>
> (If this case isn't complicated enough yet, imagine passing file as the
> base node instead... It just can't work.)
>

Yes, if we chose the way you proposed we have a possibility to use any node as a base,
but again this does something completely different from usual "top" or "based" mode..
So it's just a new possibility, may be useless, it don't break up the concept.


I like the idea of generic block_status_above that you propose, but still there should
a decision of what exactly DATA, ZERO and ALLOCATED means. Ok, assume we don't need ALLOCATED..
Then we don't need DATA too?

And how block_status_above would work? If node don't report ZERO and don't report DATA but report
*file pointer == base, we should stop, and don't go to *file, it seems obvious..

But what if node reports ZERO together with *file pointer == base? Should consider this region
belonging to base and UNALLOCATED and not copy it, or we should consider it ALLOCATED because
current node reports it ZERO? So, criteria of "ALLOCATED" or "where should we stop in recursion"
is not obvious.. Do you have one?

And now I see a drawback of full recursion as Kevin proposes: it's slower when we are interested in allocation. We don't need to recurse through file* backing chain, if *file is not backing, as we already know that data is allocated.
diff mbox series

Patch

diff --git a/block/raw-format.c b/block/raw-format.c
index bffd424dd0..a273ee2387 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -275,7 +275,7 @@  static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     *pnum = bytes;
     *file = bs->file->bs;
     *map = offset + s->offset;
-    return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
+    return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | BDRV_BLOCK_RECURSE;
 }
 
 static int coroutine_fn raw_co_pwrite_zeroes(BlockDriverState *bs,