diff mbox

[RFC,v2,5/6] qcow2: implement bdrv_preallocate

Message ID d75f71c79a4ebe8a11083eef03ed78f72fef4271.1385518315.git.hutao@cn.fujitsu.com
State New
Headers show

Commit Message

Hu Tao Nov. 27, 2013, 2:15 a.m. UTC
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
---
 block/qcow2.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Fam Zheng Nov. 27, 2013, 3:01 a.m. UTC | #1
On 2013年11月27日 10:15, Hu Tao wrote:
> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
> ---
>   block/qcow2.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index b054a01..a23fade 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>       return 0;
>   }
>
> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
> +                             int64_t length)
> +{
> +    return bdrv_preallocate(bs->file, offset, length);
> +}
> +

What's the semantics of .bdrv_preallocate? I think you should map 
[offset, offset + length) to clusters in image file, and then forward to 
bs->file, rather than this direct wrapper.

E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call 
bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster, cluster_size).

Fam

>   static QEMUOptionParameter qcow2_create_options[] = {
>       {
>           .name = BLOCK_OPT_SIZE,
> @@ -2234,6 +2240,7 @@ static BlockDriver bdrv_qcow2 = {
>       .bdrv_reopen_prepare  = qcow2_reopen_prepare,
>       .bdrv_create        = qcow2_create,
>       .bdrv_has_zero_init = bdrv_has_zero_init_1,
> +    .bdrv_preallocate   = qcow2_preallocate,
>       .bdrv_co_get_block_status = qcow2_co_get_block_status,
>       .bdrv_set_key       = qcow2_set_key,
>       .bdrv_make_empty    = qcow2_make_empty,
>
Hu Tao Nov. 27, 2013, 6:01 a.m. UTC | #2
On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
> On 2013年11月27日 10:15, Hu Tao wrote:
> >Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
> >---
> >  block/qcow2.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> >diff --git a/block/qcow2.c b/block/qcow2.c
> >index b054a01..a23fade 100644
> >--- a/block/qcow2.c
> >+++ b/block/qcow2.c
> >@@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
> >      return 0;
> >  }
> >
> >+static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
> >+                             int64_t length)
> >+{
> >+    return bdrv_preallocate(bs->file, offset, length);
> >+}
> >+
> 
> What's the semantics of .bdrv_preallocate? I think you should map
> [offset, offset + length) to clusters in image file, and then
> forward to bs->file, rather than this direct wrapper.
> 
> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
> cluster_size).

You mean data clusters here, right? Is there a single function to get
the offset of the first data cluster?
Fam Zheng Nov. 27, 2013, 6:40 a.m. UTC | #3
On 2013年11月27日 14:01, Hu Tao wrote:
> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
>> On 2013年11月27日 10:15, Hu Tao wrote:
>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
>>> ---
>>>   block/qcow2.c | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index b054a01..a23fade 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>>>       return 0;
>>>   }
>>>
>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
>>> +                             int64_t length)
>>> +{
>>> +    return bdrv_preallocate(bs->file, offset, length);
>>> +}
>>> +
>>
>> What's the semantics of .bdrv_preallocate? I think you should map
>> [offset, offset + length) to clusters in image file, and then
>> forward to bs->file, rather than this direct wrapper.
>>
>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
>> cluster_size).
>
> You mean data clusters here, right? Is there a single function to get
> the offset of the first data cluster?
>

There is a function, qcow2_get_cluster_offset.

Fam
Peter Lieven Nov. 27, 2013, 10:03 a.m. UTC | #4
Am 27.11.2013 07:40, schrieb Fam Zheng:
> On 2013年11月27日 14:01, Hu Tao wrote:
>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
>>> On 2013年11月27日 10:15, Hu Tao wrote:
>>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
>>>> ---
>>>>   block/qcow2.c | 7 +++++++
>>>>   1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index b054a01..a23fade 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>>>>       return 0;
>>>>   }
>>>>
>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
>>>> +                             int64_t length)
>>>> +{
>>>> +    return bdrv_preallocate(bs->file, offset, length);
>>>> +}
>>>> +
>>>
>>> What's the semantics of .bdrv_preallocate? I think you should map
>>> [offset, offset + length) to clusters in image file, and then
>>> forward to bs->file, rather than this direct wrapper.
>>>
>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
>>> cluster_size).
>>
>> You mean data clusters here, right? Is there a single function to get
>> the offset of the first data cluster?
>>
>
> There is a function, qcow2_get_cluster_offset.
This should return no valid offset as long as the cluster is not allocated.

I think you actually have to "write" all clusters of a qcow2 one by one.
Eventually this write could be an fallocate call instead of a zero write.

Peter
Fam Zheng Nov. 27, 2013, 10:07 a.m. UTC | #5
On 2013年11月27日 18:03, Peter Lieven wrote:
> Am 27.11.2013 07:40, schrieb Fam Zheng:
>> On 2013年11月27日 14:01, Hu Tao wrote:
>>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
>>>> On 2013年11月27日 10:15, Hu Tao wrote:
>>>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
>>>>> ---
>>>>>    block/qcow2.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>> index b054a01..a23fade 100644
>>>>> --- a/block/qcow2.c
>>>>> +++ b/block/qcow2.c
>>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
>>>>> +                             int64_t length)
>>>>> +{
>>>>> +    return bdrv_preallocate(bs->file, offset, length);
>>>>> +}
>>>>> +
>>>>
>>>> What's the semantics of .bdrv_preallocate? I think you should map
>>>> [offset, offset + length) to clusters in image file, and then
>>>> forward to bs->file, rather than this direct wrapper.
>>>>
>>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
>>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
>>>> cluster_size).
>>>
>>> You mean data clusters here, right? Is there a single function to get
>>> the offset of the first data cluster?
>>>
>>
>> There is a function, qcow2_get_cluster_offset.
> This should return no valid offset as long as the cluster is not allocated.
>
> I think you actually have to "write" all clusters of a qcow2 one by one.
> Eventually this write could be an fallocate call instead of a zero write.
>

Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more 
like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can 
reuse that.

Fam
Peter Lieven Nov. 27, 2013, 10:13 a.m. UTC | #6
Am 27.11.2013 11:07, schrieb Fam Zheng:
> On 2013年11月27日 18:03, Peter Lieven wrote:
>> Am 27.11.2013 07:40, schrieb Fam Zheng:
>>> On 2013年11月27日 14:01, Hu Tao wrote:
>>>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
>>>>> On 2013年11月27日 10:15, Hu Tao wrote:
>>>>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
>>>>>> ---
>>>>>>    block/qcow2.c | 7 +++++++
>>>>>>    1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>> index b054a01..a23fade 100644
>>>>>> --- a/block/qcow2.c
>>>>>> +++ b/block/qcow2.c
>>>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>>>>>>        return 0;
>>>>>>    }
>>>>>>
>>>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
>>>>>> +                             int64_t length)
>>>>>> +{
>>>>>> +    return bdrv_preallocate(bs->file, offset, length);
>>>>>> +}
>>>>>> +
>>>>>
>>>>> What's the semantics of .bdrv_preallocate? I think you should map
>>>>> [offset, offset + length) to clusters in image file, and then
>>>>> forward to bs->file, rather than this direct wrapper.
>>>>>
>>>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
>>>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
>>>>> cluster_size).
>>>>
>>>> You mean data clusters here, right? Is there a single function to get
>>>> the offset of the first data cluster?
>>>>
>>>
>>> There is a function, qcow2_get_cluster_offset.
>> This should return no valid offset as long as the cluster is not allocated.
>>
>> I think you actually have to "write" all clusters of a qcow2 one by one.
>> Eventually this write could be an fallocate call instead of a zero write.
>>
>
> Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
conversion.

I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.

The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
do you also want a sequential mapping of all the clusters into the file?

Peter
Hu Tao Nov. 28, 2013, 8:48 a.m. UTC | #7
On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote:
> Am 27.11.2013 11:07, schrieb Fam Zheng:
> > On 2013年11月27日 18:03, Peter Lieven wrote:
> >> Am 27.11.2013 07:40, schrieb Fam Zheng:
> >>> On 2013年11月27日 14:01, Hu Tao wrote:
> >>>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
> >>>>> On 2013年11月27日 10:15, Hu Tao wrote:
> >>>>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
> >>>>>> ---
> >>>>>>    block/qcow2.c | 7 +++++++
> >>>>>>    1 file changed, 7 insertions(+)
> >>>>>>
> >>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>>> index b054a01..a23fade 100644
> >>>>>> --- a/block/qcow2.c
> >>>>>> +++ b/block/qcow2.c
> >>>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
> >>>>>>        return 0;
> >>>>>>    }
> >>>>>>
> >>>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
> >>>>>> +                             int64_t length)
> >>>>>> +{
> >>>>>> +    return bdrv_preallocate(bs->file, offset, length);
> >>>>>> +}
> >>>>>> +
> >>>>>
> >>>>> What's the semantics of .bdrv_preallocate? I think you should map
> >>>>> [offset, offset + length) to clusters in image file, and then
> >>>>> forward to bs->file, rather than this direct wrapper.
> >>>>>
> >>>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
> >>>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
> >>>>> cluster_size).
> >>>>
> >>>> You mean data clusters here, right? Is there a single function to get
> >>>> the offset of the first data cluster?
> >>>>
> >>>
> >>> There is a function, qcow2_get_cluster_offset.
> >> This should return no valid offset as long as the cluster is not allocated.
> >>
> >> I think you actually have to "write" all clusters of a qcow2 one by one.
> >> Eventually this write could be an fallocate call instead of a zero write.
> >>
> >
> > Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
> What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
> conversion.

So the information is stored in table and you mean we shouldn't clear
table when do preallocation? I'm not sure how the information could be
useful on a newly-created image, but it seems ideal to keep informations
in table.

> 
> I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
> have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
> flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.
> 
> The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
> do you also want a sequential mapping of all the clusters into the file?

The goal is to avoid sparse file as it can cause performance problem. So
the first one. I'm not sure about the second but IIUC, one fallocate()
is enough for all clusters if they are sequentially mapped.
Peter Lieven Nov. 28, 2013, 10:03 a.m. UTC | #8
On 28.11.2013 09:48, Hu Tao wrote:
> On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote:
>> Am 27.11.2013 11:07, schrieb Fam Zheng:
>>> On 2013年11月27日 18:03, Peter Lieven wrote:
>>>> Am 27.11.2013 07:40, schrieb Fam Zheng:
>>>>> On 2013年11月27日 14:01, Hu Tao wrote:
>>>>>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
>>>>>>> On 2013年11月27日 10:15, Hu Tao wrote:
>>>>>>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
>>>>>>>> ---
>>>>>>>>     block/qcow2.c | 7 +++++++
>>>>>>>>     1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>>>> index b054a01..a23fade 100644
>>>>>>>> --- a/block/qcow2.c
>>>>>>>> +++ b/block/qcow2.c
>>>>>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>>>>>>>>         return 0;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
>>>>>>>> +                             int64_t length)
>>>>>>>> +{
>>>>>>>> +    return bdrv_preallocate(bs->file, offset, length);
>>>>>>>> +}
>>>>>>>> +
>>>>>>> What's the semantics of .bdrv_preallocate? I think you should map
>>>>>>> [offset, offset + length) to clusters in image file, and then
>>>>>>> forward to bs->file, rather than this direct wrapper.
>>>>>>>
>>>>>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
>>>>>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
>>>>>>> cluster_size).
>>>>>> You mean data clusters here, right? Is there a single function to get
>>>>>> the offset of the first data cluster?
>>>>>>
>>>>> There is a function, qcow2_get_cluster_offset.
>>>> This should return no valid offset as long as the cluster is not allocated.
>>>>
>>>> I think you actually have to "write" all clusters of a qcow2 one by one.
>>>> Eventually this write could be an fallocate call instead of a zero write.
>>>>
>>> Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
>> What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
>> conversion.
> So the information is stored in table and you mean we shouldn't clear
> table when do preallocation? I'm not sure how the information could be
> useful on a newly-created image, but it seems ideal to keep informations
> in table.
When you want to e.g. convert this qcow2 later the performance is lower than needed because
you read all those preallocated sectors altough you could now they are empty.
>
>> I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
>> have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
>> flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.
>>
>> The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
>> do you also want a sequential mapping of all the clusters into the file?
> The goal is to avoid sparse file as it can cause performance problem. So
> the first one. I'm not sure about the second but IIUC, one fallocate()
> is enough for all clusters if they are sequentially mapped.
If you do not premap them they are allocated in the order they are written.
So if you are going to preallocate the whole file anyway, you should sequentally map all clusters into the file
AND still keep the information that they are in fact not yet written.

Peter
Hu Tao Dec. 11, 2013, 7:33 a.m. UTC | #9
On Thu, Nov 28, 2013 at 11:03:04AM +0100, Peter Lieven wrote:
> On 28.11.2013 09:48, Hu Tao wrote:
> >On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote:
> >>Am 27.11.2013 11:07, schrieb Fam Zheng:
> >>>On 2013年11月27日 18:03, Peter Lieven wrote:
> >>>>Am 27.11.2013 07:40, schrieb Fam Zheng:
> >>>>>On 2013年11月27日 14:01, Hu Tao wrote:
> >>>>>>On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
> >>>>>>>On 2013年11月27日 10:15, Hu Tao wrote:
> >>>>>>>>Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
> >>>>>>>>---
> >>>>>>>>    block/qcow2.c | 7 +++++++
> >>>>>>>>    1 file changed, 7 insertions(+)
> >>>>>>>>
> >>>>>>>>diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>>>>>index b054a01..a23fade 100644
> >>>>>>>>--- a/block/qcow2.c
> >>>>>>>>+++ b/block/qcow2.c
> >>>>>>>>@@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
> >>>>>>>>        return 0;
> >>>>>>>>    }
> >>>>>>>>
> >>>>>>>>+static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
> >>>>>>>>+                             int64_t length)
> >>>>>>>>+{
> >>>>>>>>+    return bdrv_preallocate(bs->file, offset, length);
> >>>>>>>>+}
> >>>>>>>>+
> >>>>>>>What's the semantics of .bdrv_preallocate? I think you should map
> >>>>>>>[offset, offset + length) to clusters in image file, and then
> >>>>>>>forward to bs->file, rather than this direct wrapper.
> >>>>>>>
> >>>>>>>E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
> >>>>>>>bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
> >>>>>>>cluster_size).
> >>>>>>You mean data clusters here, right? Is there a single function to get
> >>>>>>the offset of the first data cluster?
> >>>>>>
> >>>>>There is a function, qcow2_get_cluster_offset.
> >>>>This should return no valid offset as long as the cluster is not allocated.
> >>>>
> >>>>I think you actually have to "write" all clusters of a qcow2 one by one.
> >>>>Eventually this write could be an fallocate call instead of a zero write.
> >>>>
> >>>Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
> >>What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
> >>conversion.
> >So the information is stored in table and you mean we shouldn't clear
> >table when do preallocation? I'm not sure how the information could be
> >useful on a newly-created image, but it seems ideal to keep informations
> >in table.
> When you want to e.g. convert this qcow2 later the performance is lower than needed because
> you read all those preallocated sectors altough you could now they are empty.
> >
> >>I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
> >>have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
> >>flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.
> >>
> >>The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
> >>do you also want a sequential mapping of all the clusters into the file?
> >The goal is to avoid sparse file as it can cause performance problem. So
> >the first one. I'm not sure about the second but IIUC, one fallocate()
> >is enough for all clusters if they are sequentially mapped.
> If you do not premap them they are allocated in the order they are written.
> So if you are going to preallocate the whole file anyway, you should sequentally map all clusters into the file
> AND still keep the information that they are in fact not yet written.

Can this be achieved by first fallocate() the disk file, then allocate
metadata? This way all metadata clusters are allocated before any data
clusters, leaving all data clusters at the end of file.
Hu Tao Dec. 16, 2013, 8:24 a.m. UTC | #10
On Wed, Dec 11, 2013 at 03:33:40PM +0800, Hu Tao wrote:
> On Thu, Nov 28, 2013 at 11:03:04AM +0100, Peter Lieven wrote:
> > On 28.11.2013 09:48, Hu Tao wrote:
> > >On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote:
> > >>Am 27.11.2013 11:07, schrieb Fam Zheng:
> > >>>On 2013年11月27日 18:03, Peter Lieven wrote:
> > >>>>Am 27.11.2013 07:40, schrieb Fam Zheng:
> > >>>>>On 2013年11月27日 14:01, Hu Tao wrote:
> > >>>>>>On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
> > >>>>>>>On 2013年11月27日 10:15, Hu Tao wrote:
> > >>>>>>>>Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
> > >>>>>>>>---
> > >>>>>>>>    block/qcow2.c | 7 +++++++
> > >>>>>>>>    1 file changed, 7 insertions(+)
> > >>>>>>>>
> > >>>>>>>>diff --git a/block/qcow2.c b/block/qcow2.c
> > >>>>>>>>index b054a01..a23fade 100644
> > >>>>>>>>--- a/block/qcow2.c
> > >>>>>>>>+++ b/block/qcow2.c
> > >>>>>>>>@@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
> > >>>>>>>>        return 0;
> > >>>>>>>>    }
> > >>>>>>>>
> > >>>>>>>>+static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
> > >>>>>>>>+                             int64_t length)
> > >>>>>>>>+{
> > >>>>>>>>+    return bdrv_preallocate(bs->file, offset, length);
> > >>>>>>>>+}
> > >>>>>>>>+
> > >>>>>>>What's the semantics of .bdrv_preallocate? I think you should map
> > >>>>>>>[offset, offset + length) to clusters in image file, and then
> > >>>>>>>forward to bs->file, rather than this direct wrapper.
> > >>>>>>>
> > >>>>>>>E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
> > >>>>>>>bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
> > >>>>>>>cluster_size).
> > >>>>>>You mean data clusters here, right? Is there a single function to get
> > >>>>>>the offset of the first data cluster?
> > >>>>>>
> > >>>>>There is a function, qcow2_get_cluster_offset.
> > >>>>This should return no valid offset as long as the cluster is not allocated.
> > >>>>
> > >>>>I think you actually have to "write" all clusters of a qcow2 one by one.
> > >>>>Eventually this write could be an fallocate call instead of a zero write.
> > >>>>
> > >>>Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
> > >>What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
> > >>conversion.
> > >So the information is stored in table and you mean we shouldn't clear
> > >table when do preallocation? I'm not sure how the information could be
> > >useful on a newly-created image, but it seems ideal to keep informations
> > >in table.
> > When you want to e.g. convert this qcow2 later the performance is lower than needed because
> > you read all those preallocated sectors altough you could now they are empty.
> > >
> > >>I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
> > >>have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
> > >>flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.
> > >>
> > >>The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
> > >>do you also want a sequential mapping of all the clusters into the file?
> > >The goal is to avoid sparse file as it can cause performance problem. So
> > >the first one. I'm not sure about the second but IIUC, one fallocate()
> > >is enough for all clusters if they are sequentially mapped.
> > If you do not premap them they are allocated in the order they are written.
> > So if you are going to preallocate the whole file anyway, you should sequentally map all clusters into the file
> > AND still keep the information that they are in fact not yet written.
> 
> Can this be achieved by first fallocate() the disk file, then allocate
> metadata? This way all metadata clusters are allocated before any data
> clusters, leaving all data clusters at the end of file.

Any comments?
Fam Zheng Dec. 16, 2013, 9:21 a.m. UTC | #11
On 2013年12月11日 15:33, Hu Tao wrote:
> On Thu, Nov 28, 2013 at 11:03:04AM +0100, Peter Lieven wrote:
>> On 28.11.2013 09:48, Hu Tao wrote:
>>> On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote:
>>>> Am 27.11.2013 11:07, schrieb Fam Zheng:
>>>>> On 2013年11月27日 18:03, Peter Lieven wrote:
>>>>>> Am 27.11.2013 07:40, schrieb Fam Zheng:
>>>>>>> On 2013年11月27日 14:01, Hu Tao wrote:
>>>>>>>> On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
>>>>>>>>> On 2013年11月27日 10:15, Hu Tao wrote:
>>>>>>>>>> Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
>>>>>>>>>> ---
>>>>>>>>>>     block/qcow2.c | 7 +++++++
>>>>>>>>>>     1 file changed, 7 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>>>>>> index b054a01..a23fade 100644
>>>>>>>>>> --- a/block/qcow2.c
>>>>>>>>>> +++ b/block/qcow2.c
>>>>>>>>>> @@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
>>>>>>>>>>         return 0;
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>> +static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
>>>>>>>>>> +                             int64_t length)
>>>>>>>>>> +{
>>>>>>>>>> +    return bdrv_preallocate(bs->file, offset, length);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>> What's the semantics of .bdrv_preallocate? I think you should map
>>>>>>>>> [offset, offset + length) to clusters in image file, and then
>>>>>>>>> forward to bs->file, rather than this direct wrapper.
>>>>>>>>>
>>>>>>>>> E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
>>>>>>>>> bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
>>>>>>>>> cluster_size).
>>>>>>>> You mean data clusters here, right? Is there a single function to get
>>>>>>>> the offset of the first data cluster?
>>>>>>>>
>>>>>>> There is a function, qcow2_get_cluster_offset.
>>>>>> This should return no valid offset as long as the cluster is not allocated.
>>>>>>
>>>>>> I think you actually have to "write" all clusters of a qcow2 one by one.
>>>>>> Eventually this write could be an fallocate call instead of a zero write.
>>>>>>
>>>>> Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
>>>> What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
>>>> conversion.
>>> So the information is stored in table and you mean we shouldn't clear
>>> table when do preallocation? I'm not sure how the information could be
>>> useful on a newly-created image, but it seems ideal to keep informations
>>> in table.
>> When you want to e.g. convert this qcow2 later the performance is lower than needed because
>> you read all those preallocated sectors altough you could now they are empty.
>>>
>>>> I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
>>>> have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
>>>> flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.
>>>>
>>>> The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
>>>> do you also want a sequential mapping of all the clusters into the file?
>>> The goal is to avoid sparse file as it can cause performance problem. So
>>> the first one. I'm not sure about the second but IIUC, one fallocate()
>>> is enough for all clusters if they are sequentially mapped.
>> If you do not premap them they are allocated in the order they are written.
>> So if you are going to preallocate the whole file anyway, you should sequentally map all clusters into the file
>> AND still keep the information that they are in fact not yet written.
>
> Can this be achieved by first fallocate() the disk file, then allocate
> metadata? This way all metadata clusters are allocated before any data
> clusters, leaving all data clusters at the end of file.
>

I think Peter means your need to sequentially map clusters into the 
file, so that sequential IO in guest is translated to sequential IO on 
the image file.

fallocate() or posix_fallocate() should work. You need to set zero flag 
on the allocated cluster when mapping it in L2, instead of actually 
writing zeros.

Fam
Hu Tao Dec. 17, 2013, 2:03 a.m. UTC | #12
On Mon, Dec 16, 2013 at 05:21:23PM +0800, Fam Zheng wrote:
> On 2013年12月11日 15:33, Hu Tao wrote:
> >On Thu, Nov 28, 2013 at 11:03:04AM +0100, Peter Lieven wrote:
> >>On 28.11.2013 09:48, Hu Tao wrote:
> >>>On Wed, Nov 27, 2013 at 11:13:40AM +0100, Peter Lieven wrote:
> >>>>Am 27.11.2013 11:07, schrieb Fam Zheng:
> >>>>>On 2013年11月27日 18:03, Peter Lieven wrote:
> >>>>>>Am 27.11.2013 07:40, schrieb Fam Zheng:
> >>>>>>>On 2013年11月27日 14:01, Hu Tao wrote:
> >>>>>>>>On Wed, Nov 27, 2013 at 11:01:23AM +0800, Fam Zheng wrote:
> >>>>>>>>>On 2013年11月27日 10:15, Hu Tao wrote:
> >>>>>>>>>>Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
> >>>>>>>>>>---
> >>>>>>>>>>    block/qcow2.c | 7 +++++++
> >>>>>>>>>>    1 file changed, 7 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>>diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>>>>>>>index b054a01..a23fade 100644
> >>>>>>>>>>--- a/block/qcow2.c
> >>>>>>>>>>+++ b/block/qcow2.c
> >>>>>>>>>>@@ -2180,6 +2180,12 @@ static int qcow2_amend_options(BlockDriverState *bs,
> >>>>>>>>>>        return 0;
> >>>>>>>>>>    }
> >>>>>>>>>>
> >>>>>>>>>>+static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
> >>>>>>>>>>+                             int64_t length)
> >>>>>>>>>>+{
> >>>>>>>>>>+    return bdrv_preallocate(bs->file, offset, length);
> >>>>>>>>>>+}
> >>>>>>>>>>+
> >>>>>>>>>What's the semantics of .bdrv_preallocate? I think you should map
> >>>>>>>>>[offset, offset + length) to clusters in image file, and then
> >>>>>>>>>forward to bs->file, rather than this direct wrapper.
> >>>>>>>>>
> >>>>>>>>>E.g. bdrv_preallocate(qcow2_bs, 0, cluster_size) should call
> >>>>>>>>>bdrv_preallocate(qcow2_bs->file, offset_off_first_cluster,
> >>>>>>>>>cluster_size).
> >>>>>>>>You mean data clusters here, right? Is there a single function to get
> >>>>>>>>the offset of the first data cluster?
> >>>>>>>>
> >>>>>>>There is a function, qcow2_get_cluster_offset.
> >>>>>>This should return no valid offset as long as the cluster is not allocated.
> >>>>>>
> >>>>>>I think you actually have to "write" all clusters of a qcow2 one by one.
> >>>>>>Eventually this write could be an fallocate call instead of a zero write.
> >>>>>>
> >>>>>Yes, I was wrong about qcow2_get_cluster_offset. The logic here is more like cluster allocation in qcow2_alloc_cluster_offset. Maybe we can reuse that.
> >>>>What I don't like about the preallocation is that we would loose the information that a cluster contains no valid data and would read it e.g. during
> >>>>conversion.
> >>>So the information is stored in table and you mean we shouldn't clear
> >>>table when do preallocation? I'm not sure how the information could be
> >>>useful on a newly-created image, but it seems ideal to keep informations
> >>>in table.
> >>When you want to e.g. convert this qcow2 later the performance is lower than needed because
> >>you read all those preallocated sectors altough you could now they are empty.
> >>>
> >>>>I think what we want is a preallocated image with all clusters sequentally mapped into the qcow2 file. Preallocate all the cluster extends, but still
> >>>>have the information in the table that the cluster in fact has no valid data. So we would need a valid cluster offset while still haveing the
> >>>>flag that the cluster is unallocated. I think this would require thoughtfully checking all the cluster functions if they can easily cope with this.
> >>>>
> >>>>The quetion is Hu, what do you want to achieve? Do you want that the space on the filesystem is preallocated so you can't overcommit or
> >>>>do you also want a sequential mapping of all the clusters into the file?
> >>>The goal is to avoid sparse file as it can cause performance problem. So
> >>>the first one. I'm not sure about the second but IIUC, one fallocate()
> >>>is enough for all clusters if they are sequentially mapped.
> >>If you do not premap them they are allocated in the order they are written.
> >>So if you are going to preallocate the whole file anyway, you should sequentally map all clusters into the file
> >>AND still keep the information that they are in fact not yet written.
> >
> >Can this be achieved by first fallocate() the disk file, then allocate
> >metadata? This way all metadata clusters are allocated before any data
> >clusters, leaving all data clusters at the end of file.
> >
> 
> I think Peter means your need to sequentially map clusters into the
> file, so that sequential IO in guest is translated to sequential IO
> on the image file.

The question is how to map data clusters sequentially. If I read code
correctly, qemu allocates clusters (containing data and metadata) on
demand.  So if we allocates all metadata clusters right after the
creation of image file, we can leave all left space allocated for data
laterly. Fix me.

> 
> fallocate() or posix_fallocate() should work. You need to set zero
> flag on the allocated cluster when mapping it in L2, instead of
> actually writing zeros.

You mean bit 0 of L2 entry by zero flag? But the doc says it is always 0
with version 2.
diff mbox

Patch

diff --git a/block/qcow2.c b/block/qcow2.c
index b054a01..a23fade 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2180,6 +2180,12 @@  static int qcow2_amend_options(BlockDriverState *bs,
     return 0;
 }
 
+static int qcow2_preallocate(BlockDriverState *bs, int64_t offset,
+                             int64_t length)
+{
+    return bdrv_preallocate(bs->file, offset, length);
+}
+
 static QEMUOptionParameter qcow2_create_options[] = {
     {
         .name = BLOCK_OPT_SIZE,
@@ -2234,6 +2240,7 @@  static BlockDriver bdrv_qcow2 = {
     .bdrv_reopen_prepare  = qcow2_reopen_prepare,
     .bdrv_create        = qcow2_create,
     .bdrv_has_zero_init = bdrv_has_zero_init_1,
+    .bdrv_preallocate   = qcow2_preallocate,
     .bdrv_co_get_block_status = qcow2_co_get_block_status,
     .bdrv_set_key       = qcow2_set_key,
     .bdrv_make_empty    = qcow2_make_empty,