diff mbox

[v3,for-2.2,3/3] raw-posix: The SEEK_HOLE code is flawed, rewrite it

Message ID 1416219514-22530-4-git-send-email-armbru@redhat.com
State New
Headers show

Commit Message

Markus Armbruster Nov. 17, 2014, 10:18 a.m. UTC
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.

Additionally, unlikely lseek() failures are treated badly:

* When SEEK_HOLE fails, try_seek_hole() reports trailing data.  For
  -ENXIO, there's in fact a trailing hole.  Can happen only when
  something truncated the file since we opened it.

* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
  then try_seek_hole() reports a trailing hole.  This is okay only
  when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
  found by SEEK_HOLE has since become trailing somehow).  For other
  failures (unlikely), it's wrong.

* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
  then try_seek_hole() reports bogus data [-1,start), which its caller
  raw_co_get_block_status() turns into zero sectors of data.  Could
  theoretically lead to infinite loops in code that attempts to scan
  data vs. hole forward.

Rewrite from scratch, with very careful comments.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 block/raw-posix.c | 111 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 85 insertions(+), 26 deletions(-)

Comments

Max Reitz Nov. 17, 2014, 10:33 a.m. UTC | #1
On 2014-11-17 at 11:18, Markus Armbruster wrote:
> On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
> but not Linux), try_seek_hole() reports trailing data instead.
>
> Additionally, unlikely lseek() failures are treated badly:
>
> * When SEEK_HOLE fails, try_seek_hole() reports trailing data.  For
>    -ENXIO, there's in fact a trailing hole.  Can happen only when
>    something truncated the file since we opened it.
>
> * When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
>    then try_seek_hole() reports a trailing hole.  This is okay only
>    when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
>    found by SEEK_HOLE has since become trailing somehow).  For other
>    failures (unlikely), it's wrong.
>
> * When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
>    then try_seek_hole() reports bogus data [-1,start), which its caller
>    raw_co_get_block_status() turns into zero sectors of data.  Could
>    theoretically lead to infinite loops in code that attempts to scan
>    data vs. hole forward.
>
> Rewrite from scratch, with very careful comments.
>
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   block/raw-posix.c | 111 +++++++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 85 insertions(+), 26 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>
Eric Blake Nov. 17, 2014, 4:43 p.m. UTC | #2
On 11/17/2014 03:18 AM, Markus Armbruster wrote:
> On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
> but not Linux), try_seek_hole() reports trailing data instead.

Maybe worth a comment that this is not fatal, but also not optimal.

> 
> Additionally, unlikely lseek() failures are treated badly:
> 
> * When SEEK_HOLE fails, try_seek_hole() reports trailing data.  For
>   -ENXIO, there's in fact a trailing hole.  Can happen only when
>   something truncated the file since we opened it.
> 
> * When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
>   then try_seek_hole() reports a trailing hole.  This is okay only
>   when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
>   found by SEEK_HOLE has since become trailing somehow).  For other
>   failures (unlikely), it's wrong.
> 
> * When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
>   then try_seek_hole() reports bogus data [-1,start), which its caller
>   raw_co_get_block_status() turns into zero sectors of data.  Could
>   theoretically lead to infinite loops in code that attempts to scan
>   data vs. hole forward.
> 
> Rewrite from scratch, with very careful comments.

Thanks for the careful commit message as well as the careful comments :)

> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>  block/raw-posix.c | 111 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 85 insertions(+), 26 deletions(-)
> 

> @@ -1542,25 +1600,26 @@ static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
>          nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
>      }
>  

> +    } else if (data == start) {
>          /* On a data extent, compute sectors to the end of the extent.  */
>          *pnum = MIN(nb_sectors, (hole - start) / BDRV_SECTOR_SIZE);

I think we are safe for now that no file system supports holes smaller
than 512 bytes, so (hole - start) should always be a non-zero multiple
of sectors.  Similarly for the hole case of (data - start).  Maybe it's
worth assert(*pnum > 0) to ensure that we never hit a situation where we
go into an infinite loop where we aren't progressing because pnum is
never advancing to the next sector?  But that would be okay as a
separate patch, and I don't want to delay getting _this_ patch into 2.2.

Reviewed-by: Eric Blake <eblake@redhat.com>
Max Reitz Nov. 18, 2014, 8:42 a.m. UTC | #3
On 2014-11-17 at 17:43, Eric Blake wrote:
> On 11/17/2014 03:18 AM, Markus Armbruster wrote:
>> On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
>> but not Linux), try_seek_hole() reports trailing data instead.
> Maybe worth a comment that this is not fatal, but also not optimal.
>
>> Additionally, unlikely lseek() failures are treated badly:
>>
>> * When SEEK_HOLE fails, try_seek_hole() reports trailing data.  For
>>    -ENXIO, there's in fact a trailing hole.  Can happen only when
>>    something truncated the file since we opened it.
>>
>> * When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
>>    then try_seek_hole() reports a trailing hole.  This is okay only
>>    when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
>>    found by SEEK_HOLE has since become trailing somehow).  For other
>>    failures (unlikely), it's wrong.
>>
>> * When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
>>    then try_seek_hole() reports bogus data [-1,start), which its caller
>>    raw_co_get_block_status() turns into zero sectors of data.  Could
>>    theoretically lead to infinite loops in code that attempts to scan
>>    data vs. hole forward.
>>
>> Rewrite from scratch, with very careful comments.
> Thanks for the careful commit message as well as the careful comments :)
>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   block/raw-posix.c | 111 +++++++++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 85 insertions(+), 26 deletions(-)
>>
>> @@ -1542,25 +1600,26 @@ static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
>>           nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
>>       }
>>   
>> +    } else if (data == start) {
>>           /* On a data extent, compute sectors to the end of the extent.  */
>>           *pnum = MIN(nb_sectors, (hole - start) / BDRV_SECTOR_SIZE);
> I think we are safe for now that no file system supports holes smaller
> than 512 bytes, so (hole - start) should always be a non-zero multiple
> of sectors.  Similarly for the hole case of (data - start).  Maybe it's
> worth assert(*pnum > 0) to ensure that we never hit a situation where we
> go into an infinite loop where we aren't progressing because pnum is
> never advancing to the next sector?

That's something the callers of bdrv_get_block_status() have to take 
care of. See commit 4b25bbc4c22cf39350b75bd250d568a4d975f7c5, for example.

Max

> But that would be okay as a
> separate patch, and I don't want to delay getting _this_ patch into 2.2.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
diff mbox

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index fd80d84..0b5d5a8 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1478,28 +1478,86 @@  out:
     return result;
 }
 
-static int try_seek_hole(BlockDriverState *bs, off_t start, off_t *data,
-                         off_t *hole)
+/*
+ * Find allocation range in @bs around offset @start.
+ * May change underlying file descriptor's file offset.
+ * If @start is not in a hole, store @start in @data, and the
+ * beginning of the next hole in @hole, and return 0.
+ * If @start is in a non-trailing hole, store @start in @hole and the
+ * beginning of the next non-hole in @data, and return 0.
+ * If @start is in a trailing hole or beyond EOF, return -ENXIO.
+ * If we can't find out, return a negative errno other than -ENXIO.
+ */
+static int find_allocation(BlockDriverState *bs, off_t start,
+                           off_t *data, off_t *hole)
 {
 #if defined SEEK_HOLE && defined SEEK_DATA
     BDRVRawState *s = bs->opaque;
+    off_t offs;
 
-    *hole = lseek(s->fd, start, SEEK_HOLE);
-    if (*hole == -1) {
-        return -errno;
+    /*
+     * SEEK_DATA cases:
+     * D1. offs == start: start is in data
+     * D2. offs > start: start is in a hole, next data at offs
+     * D3. offs < 0, errno = ENXIO: either start is in a trailing hole
+     *                              or start is beyond EOF
+     *     If the latter happens, the file has been truncated behind
+     *     our back since we opened it.  All bets are off then.
+     *     Treating like a trailing hole is simplest.
+     * D4. offs < 0, errno != ENXIO: we learned nothing
+     */
+    offs = lseek(s->fd, start, SEEK_DATA);
+    if (offs < 0) {
+        return -errno;          /* D3 or D4 */
     }
+    assert(offs >= start);
 
-    if (*hole > start) {
+    if (offs > start) {
+        /* D2: in hole, next data at offs */
+        *hole = start;
+        *data = offs;
+        return 0;
+    }
+
+    /* D1: in data, end not yet known */
+
+    /*
+     * SEEK_HOLE cases:
+     * H1. offs == start: start is in a hole
+     *     If this happens here, a hole has been dug behind our back
+     *     since the previous lseek().
+     * H2. offs > start: either start is in data, next hole at offs,
+     *                   or start is in trailing hole, EOF at offs
+     *     Linux treats trailing holes like any other hole: offs ==
+     *     start.  Solaris seeks to EOF instead: offs > start (blech).
+     *     If that happens here, a hole has been dug behind our back
+     *     since the previous lseek().
+     * H3. offs < 0, errno = ENXIO: start is beyond EOF
+     *     If this happens, the file has been truncated behind our
+     *     back since we opened it.  Treat it like a trailing hole.
+     * H4. offs < 0, errno != ENXIO: we learned nothing
+     *     Pretend we know nothing at all, i.e. "forget" about D1.
+     */
+    offs = lseek(s->fd, start, SEEK_HOLE);
+    if (offs < 0) {
+        return -errno;          /* D1 and (H3 or H4) */
+    }
+    assert(offs >= start);
+
+    if (offs > start) {
+        /*
+         * D1 and H2: either in data, next hole at offs, or it was in
+         * data but is now in a trailing hole.  In the latter case,
+         * all bets are off.  Treating it as if it there was data all
+         * the way to EOF is safe, so simply do that.
+         */
         *data = start;
-    } else {
-        /* On a hole.  We need another syscall to find its end.  */
-        *data = lseek(s->fd, start, SEEK_DATA);
-        if (*data == -1) {
-            *data = lseek(s->fd, 0, SEEK_END);
-        }
+        *hole = offs;
+        return 0;
     }
 
-    return 0;
+    /* D1 and H1 */
+    return -EBUSY;
 #else
     return -ENOTSUP;
 #endif
@@ -1542,25 +1600,26 @@  static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
         nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
     }
 
-    ret = try_seek_hole(bs, start, &data, &hole);
-    if (ret < 0) {
-        /* Assume everything is allocated. */
-        data = 0;
-        hole = start + nb_sectors * BDRV_SECTOR_SIZE;
-        ret = 0;
-    }
-
-    assert(ret >= 0);
-
-    if (data <= start) {
+    ret = find_allocation(bs, start, &data, &hole);
+    if (ret == -ENXIO) {
+        /* Trailing hole */
+        *pnum = nb_sectors;
+        ret = BDRV_BLOCK_ZERO;
+    } else if (ret < 0) {
+        /* No info available, so pretend there are no holes */
+        *pnum = nb_sectors;
+        ret = BDRV_BLOCK_DATA;
+    } else if (data == start) {
         /* On a data extent, compute sectors to the end of the extent.  */
         *pnum = MIN(nb_sectors, (hole - start) / BDRV_SECTOR_SIZE);
-        return ret | BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start;
+        ret = BDRV_BLOCK_DATA;
     } else {
         /* On a hole, compute sectors to the beginning of the next extent.  */
+        assert(hole == start);
         *pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE);
-        return ret | BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID | start;
+        ret = BDRV_BLOCK_ZERO;
     }
+    return ret | BDRV_BLOCK_OFFSET_VALID | start;
 }
 
 static coroutine_fn BlockAIOCB *raw_aio_discard(BlockDriverState *bs,