diff mbox series

[V5] qemu-img: align result of is_allocated_sectors

Message ID 1531235146-6048-1-git-send-email-pl@kamp.de
State New
Headers show
Series [V5] qemu-img: align result of is_allocated_sectors | expand

Commit Message

Peter Lieven July 10, 2018, 3:05 p.m. UTC
We currently don't enforce that the sparse segments we detect during convert are
aligned. This leads to unnecessary and costly read-modify-write cycles either
internally in Qemu or in the background on the storage device as nearly all
modern filesystems or hardware have a 4k alignment internally.

This patch modifies is_allocated_sectors so that its *pnum result will always
end at an alignment boundary. This way all requests will end at an alignment
boundary. The start of all requests will also be aligned as long as the results
of get_block_status do not lead to an unaligned offset.

The number of RMW cycles when converting an example image [1] to a raw device that
has 4k sector size is about 4600 4k read requests to perform a total of about 15000
write requests. With this path the additional 4600 read requests are eliminated while
the number of total write requests stays constant.

[1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk

Signed-off-by: Peter Lieven <pl@kamp.de>
---
V4->V5: - is_zero is a bool [Kevin]
        - treat zero areas as allocated if i <= tail to avoid *pnum underflow [Kevin]
V3->V4: - only focus on the end offset in is_allocated_sectors [Kevin]
V2->V3: - ensure that s.alignment is a power of 2
        - correctly handle n < alignment in is_allocated_sectors if
          sector_num % alignment > 0.
V1->V2: - take the current sector offset into account [Max]
        - try to figure out the target alignment [Max]

 qemu-img.c | 44 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

Comments

Kevin Wolf July 10, 2018, 3:31 p.m. UTC | #1
Am 10.07.2018 um 17:05 hat Peter Lieven geschrieben:
> We currently don't enforce that the sparse segments we detect during convert are
> aligned. This leads to unnecessary and costly read-modify-write cycles either
> internally in Qemu or in the background on the storage device as nearly all
> modern filesystems or hardware have a 4k alignment internally.
> 
> This patch modifies is_allocated_sectors so that its *pnum result will always
> end at an alignment boundary. This way all requests will end at an alignment
> boundary. The start of all requests will also be aligned as long as the results
> of get_block_status do not lead to an unaligned offset.
> 
> The number of RMW cycles when converting an example image [1] to a raw device that
> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
> write requests. With this path the additional 4600 read requests are eliminated while
> the number of total write requests stays constant.
> 
> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

It looked convincing, but I'm afraid this is still not correct.
qemu-iotests 122 fails for me with this patch.

Kevin
Peter Lieven July 10, 2018, 8:16 p.m. UTC | #2
> Am 10.07.2018 um 17:31 schrieb Kevin Wolf <kwolf@redhat.com>:
> 
> Am 10.07.2018 um 17:05 hat Peter Lieven geschrieben:
>> We currently don't enforce that the sparse segments we detect during convert are
>> aligned. This leads to unnecessary and costly read-modify-write cycles either
>> internally in Qemu or in the background on the storage device as nearly all
>> modern filesystems or hardware have a 4k alignment internally.
>> 
>> This patch modifies is_allocated_sectors so that its *pnum result will always
>> end at an alignment boundary. This way all requests will end at an alignment
>> boundary. The start of all requests will also be aligned as long as the results
>> of get_block_status do not lead to an unaligned offset.
>> 
>> The number of RMW cycles when converting an example image [1] to a raw device that
>> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
>> write requests. With this path the additional 4600 read requests are eliminated while
>> the number of total write requests stays constant.
>> 
>> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> 
> It looked convincing, but I'm afraid this is still not correct.
> qemu-iotests 122 fails for me with this patch.

I will have a look, where and why exactly it fails, but the allocation pattern might be slightly different due to the alignment. What counts is that the output is byte identical or not?

Peter
Kevin Wolf July 11, 2018, 8:25 a.m. UTC | #3
Am 10.07.2018 um 22:16 hat Peter Lieven geschrieben:
> 
> 
> > Am 10.07.2018 um 17:31 schrieb Kevin Wolf <kwolf@redhat.com>:
> > 
> > Am 10.07.2018 um 17:05 hat Peter Lieven geschrieben:
> >> We currently don't enforce that the sparse segments we detect during convert are
> >> aligned. This leads to unnecessary and costly read-modify-write cycles either
> >> internally in Qemu or in the background on the storage device as nearly all
> >> modern filesystems or hardware have a 4k alignment internally.
> >> 
> >> This patch modifies is_allocated_sectors so that its *pnum result will always
> >> end at an alignment boundary. This way all requests will end at an alignment
> >> boundary. The start of all requests will also be aligned as long as the results
> >> of get_block_status do not lead to an unaligned offset.
> >> 
> >> The number of RMW cycles when converting an example image [1] to a raw device that
> >> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
> >> write requests. With this path the additional 4600 read requests are eliminated while
> >> the number of total write requests stays constant.
> >> 
> >> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
> >> 
> >> Signed-off-by: Peter Lieven <pl@kamp.de>
> > 
> > It looked convincing, but I'm afraid this is still not correct.
> > qemu-iotests 122 fails for me with this patch.
> 
> I will have a look, where and why exactly it fails, but the allocation
> pattern might be slightly different due to the alignment. What counts
> is that the output is byte identical or not?

Right, I noticed only after sending this email that it's qemu-img map
output that changes and this might actually be okay. I didn't check,
however, if the exact changes are what is expected and whether we need
to add more test cases to cover what the test originally wanted to
cover.

So after all, there's a good chance that all that's missing is just an
update to the test case.

Kevin
Peter Lieven July 12, 2018, 9:23 a.m. UTC | #4
Am 11.07.2018 um 10:25 schrieb Kevin Wolf:
> Am 10.07.2018 um 22:16 hat Peter Lieven geschrieben:
>>
>>> Am 10.07.2018 um 17:31 schrieb Kevin Wolf <kwolf@redhat.com>:
>>>
>>> Am 10.07.2018 um 17:05 hat Peter Lieven geschrieben:
>>>> We currently don't enforce that the sparse segments we detect during convert are
>>>> aligned. This leads to unnecessary and costly read-modify-write cycles either
>>>> internally in Qemu or in the background on the storage device as nearly all
>>>> modern filesystems or hardware have a 4k alignment internally.
>>>>
>>>> This patch modifies is_allocated_sectors so that its *pnum result will always
>>>> end at an alignment boundary. This way all requests will end at an alignment
>>>> boundary. The start of all requests will also be aligned as long as the results
>>>> of get_block_status do not lead to an unaligned offset.
>>>>
>>>> The number of RMW cycles when converting an example image [1] to a raw device that
>>>> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
>>>> write requests. With this path the additional 4600 read requests are eliminated while
>>>> the number of total write requests stays constant.
>>>>
>>>> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> It looked convincing, but I'm afraid this is still not correct.
>>> qemu-iotests 122 fails for me with this patch.
>> I will have a look, where and why exactly it fails, but the allocation
>> pattern might be slightly different due to the alignment. What counts
>> is that the output is byte identical or not?
> Right, I noticed only after sending this email that it's qemu-img map
> output that changes and this might actually be okay. I didn't check,
> however, if the exact changes are what is expected and whether we need
> to add more test cases to cover what the test originally wanted to
> cover.
>
> So after all, there's a good chance that all that's missing is just an
> update to the test case.

I checked the output of test 122 and what happens is exactly what is expected.
The zero areas align to 4k or 8k respectively. If they don't align they are reported
as allocated. I would just go ahead and include the test update and send V6 if there
are no objections.

If someone feels that the behaviour is undesired we can also just go ahead
with a light version of the patch and use only bs->request_alignment as target
alignment and ignore the value of min_sparse.

In this case we could even think about as treating this as a bug fix as it
avoids these ugly RMW cycles that we were seeing and this is why we created this
patch.

Best,
Peter
Peter Lieven July 12, 2018, 9:26 a.m. UTC | #5
Am 11.07.2018 um 10:25 schrieb Kevin Wolf:
> Am 10.07.2018 um 22:16 hat Peter Lieven geschrieben:
>>
>>> Am 10.07.2018 um 17:31 schrieb Kevin Wolf <kwolf@redhat.com>:
>>>
>>> Am 10.07.2018 um 17:05 hat Peter Lieven geschrieben:
>>>> We currently don't enforce that the sparse segments we detect during convert are
>>>> aligned. This leads to unnecessary and costly read-modify-write cycles either
>>>> internally in Qemu or in the background on the storage device as nearly all
>>>> modern filesystems or hardware have a 4k alignment internally.
>>>>
>>>> This patch modifies is_allocated_sectors so that its *pnum result will always
>>>> end at an alignment boundary. This way all requests will end at an alignment
>>>> boundary. The start of all requests will also be aligned as long as the results
>>>> of get_block_status do not lead to an unaligned offset.
>>>>
>>>> The number of RMW cycles when converting an example image [1] to a raw device that
>>>> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
>>>> write requests. With this path the additional 4600 read requests are eliminated while
>>>> the number of total write requests stays constant.
>>>>
>>>> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> It looked convincing, but I'm afraid this is still not correct.
>>> qemu-iotests 122 fails for me with this patch.
>> I will have a look, where and why exactly it fails, but the allocation
>> pattern might be slightly different due to the alignment. What counts
>> is that the output is byte identical or not?
> Right, I noticed only after sending this email that it's qemu-img map
> output that changes and this might actually be okay. I didn't check,
> however, if the exact changes are what is expected and whether we need
> to add more test cases to cover what the test originally wanted to
> cover.
>
> So after all, there's a good chance that all that's missing is just an
> update to the test case.
>
> Kevin

I checked the output of test 122 and what happens is exactly what is expected.
The zero areas align to 4k or 8k respectively. If they don't align they are reported
as allocated. I would just go ahead and include the test update and send V6 if there
are no objections.

If someone feels that the behaviour is undesired we can also just go ahead
with a light version of the patch and use only bs->request_alignment as target
alignment and ignore the value of min_sparse.

In this case we could even think about as treating this as a bug fix as it
avoids these ugly RMW cycles that we were seeing and this is why we created this
patch.

Best,
Peter
Kevin Wolf July 12, 2018, 11:41 a.m. UTC | #6
Am 12.07.2018 um 11:26 hat Peter Lieven geschrieben:
> Am 11.07.2018 um 10:25 schrieb Kevin Wolf:
> > Am 10.07.2018 um 22:16 hat Peter Lieven geschrieben:
> > > 
> > > > Am 10.07.2018 um 17:31 schrieb Kevin Wolf <kwolf@redhat.com>:
> > > > 
> > > > Am 10.07.2018 um 17:05 hat Peter Lieven geschrieben:
> > > > > We currently don't enforce that the sparse segments we detect during convert are
> > > > > aligned. This leads to unnecessary and costly read-modify-write cycles either
> > > > > internally in Qemu or in the background on the storage device as nearly all
> > > > > modern filesystems or hardware have a 4k alignment internally.
> > > > > 
> > > > > This patch modifies is_allocated_sectors so that its *pnum result will always
> > > > > end at an alignment boundary. This way all requests will end at an alignment
> > > > > boundary. The start of all requests will also be aligned as long as the results
> > > > > of get_block_status do not lead to an unaligned offset.
> > > > > 
> > > > > The number of RMW cycles when converting an example image [1] to a raw device that
> > > > > has 4k sector size is about 4600 4k read requests to perform a total of about 15000
> > > > > write requests. With this path the additional 4600 read requests are eliminated while
> > > > > the number of total write requests stays constant.
> > > > > 
> > > > > [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
> > > > > 
> > > > > Signed-off-by: Peter Lieven <pl@kamp.de>
> > > > It looked convincing, but I'm afraid this is still not correct.
> > > > qemu-iotests 122 fails for me with this patch.
> > > I will have a look, where and why exactly it fails, but the allocation
> > > pattern might be slightly different due to the alignment. What counts
> > > is that the output is byte identical or not?
> > Right, I noticed only after sending this email that it's qemu-img map
> > output that changes and this might actually be okay. I didn't check,
> > however, if the exact changes are what is expected and whether we need
> > to add more test cases to cover what the test originally wanted to
> > cover.
> > 
> > So after all, there's a good chance that all that's missing is just an
> > update to the test case.
> > 
> > Kevin
> 
> I checked the output of test 122 and what happens is exactly what is expected.
> The zero areas align to 4k or 8k respectively. If they don't align they are reported
> as allocated. I would just go ahead and include the test update and send V6 if there
> are no objections.

Sounds good to me.

Kevin

> If someone feels that the behaviour is undesired we can also just go ahead
> with a light version of the patch and use only bs->request_alignment as target
> alignment and ignore the value of min_sparse.
> 
> In this case we could even think about as treating this as a bug fix as it
> avoids these ugly RMW cycles that we were seeing and this is why we created this
> patch.
> 
> Best,
> Peter
> 
> -- 
> 
> Mit freundlichen Grüßen
> 
> Peter Lieven
> 
> ...........................................................
> 
>   KAMP Netzwerkdienste GmbH
>   Vestische Str. 89-91 | 46117 Oberhausen
>   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
>   pl@kamp.de | http://www.kamp.de
> 
>   Geschäftsführer: Heiner Lante | Michael Lante
>   Amtsgericht Duisburg | HRB Nr. 12154
>   USt-Id-Nr.: DE 120607556
> 
> ...........................................................
> 
>
diff mbox series

Patch

diff --git a/qemu-img.c b/qemu-img.c
index 7651d81..9021ab0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1105,11 +1105,15 @@  static int64_t find_nonzero(const uint8_t *buf, int64_t n)
  *
  * 'pnum' is set to the number of sectors (including and immediately following
  * the first one) that are known to be in the same allocated/unallocated state.
+ * The function will try to align the end offset to alignment boundaries so
+ * that the request will at least end aligned and consequtive requests will
+ * also start at an aligned offset.
  */
-static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum)
+static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum,
+                                int64_t sector_num, int alignment)
 {
     bool is_zero;
-    int i;
+    int i, tail;
 
     if (n <= 0) {
         *pnum = 0;
@@ -1122,6 +1126,23 @@  static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum)
             break;
         }
     }
+
+    tail = (sector_num + i) & (alignment - 1);
+    if (tail) {
+        if (is_zero && i <= tail) {
+            /* treat unallocated areas which only consist
+             * of a small tail as allocated. */
+            is_zero = false;
+        }
+        if (!is_zero) {
+            /* align up end offset of allocated areas. */
+            i += alignment - tail;
+            i = MIN(i, n);
+        } else {
+            /* align down end offset of zero areas. */
+            i -= tail;
+        }
+    }
     *pnum = i;
     return !is_zero;
 }
@@ -1132,7 +1153,7 @@  static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum)
  * breaking up write requests for only small sparse areas.
  */
 static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum,
-    int min)
+    int min, int64_t sector_num, int alignment)
 {
     int ret;
     int num_checked, num_used;
@@ -1141,7 +1162,7 @@  static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum,
         min = n;
     }
 
-    ret = is_allocated_sectors(buf, n, pnum);
+    ret = is_allocated_sectors(buf, n, pnum, sector_num, alignment);
     if (!ret) {
         return ret;
     }
@@ -1149,13 +1170,15 @@  static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum,
     num_used = *pnum;
     buf += BDRV_SECTOR_SIZE * *pnum;
     n -= *pnum;
+    sector_num += *pnum;
     num_checked = num_used;
 
     while (n > 0) {
-        ret = is_allocated_sectors(buf, n, pnum);
+        ret = is_allocated_sectors(buf, n, pnum, sector_num, alignment);
 
         buf += BDRV_SECTOR_SIZE * *pnum;
         n -= *pnum;
+        sector_num += *pnum;
         num_checked += *pnum;
         if (ret) {
             num_used = num_checked;
@@ -1560,6 +1583,7 @@  typedef struct ImgConvertState {
     bool wr_in_order;
     bool copy_range;
     int min_sparse;
+    int alignment;
     size_t cluster_sectors;
     size_t buf_sectors;
     long num_coroutines;
@@ -1724,7 +1748,8 @@  static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
              * zeroed. */
             if (!s->min_sparse ||
                 (!s->compressed &&
-                 is_allocated_sectors_min(buf, n, &n, s->min_sparse)) ||
+                 is_allocated_sectors_min(buf, n, &n, s->min_sparse,
+                                          sector_num, s->alignment)) ||
                 (s->compressed &&
                  !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)))
             {
@@ -2368,6 +2393,13 @@  static int img_convert(int argc, char **argv)
                                 out_bs->bl.pdiscard_alignment >>
                                 BDRV_SECTOR_BITS)));
 
+    /* try to align the write requests to the destination to avoid unnecessary
+     * RMW cycles. */
+    s.alignment = MAX(pow2floor(s.min_sparse),
+                      DIV_ROUND_UP(out_bs->bl.request_alignment,
+                                   BDRV_SECTOR_SIZE));
+    assert(is_power_of_2(s.alignment));
+
     if (skip_create) {
         int64_t output_sectors = blk_nb_sectors(s.target);
         if (output_sectors < 0) {