diff mbox

[PATCHv3,RESEND] block: introduce BDRV_O_SEQUENTIAL

Message ID 1401486037-25609-1-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven May 30, 2014, 9:40 p.m. UTC
this patch introduces a new flag to indicate that we are going to sequentially
read from a file and do not plan to reread/reuse the data after it has been read.

The current use of this flag is to open the source(s) of a qemu-img convert
process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
to advise to the kernel that we are going to read sequentially from the
file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
that there is no advantage keeping the blocks in the buffers.

Consider the following test case that was created to confirm the behaviour of
the new flag:

A 10G logical volume was created and filled with random data.
Then the logical volume was exported via qemu-img convert to an iscsi target.
Before the export was started all caches of the linux kernel where dropped.

Old behavior:
 - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
   to the end of the conversion. After qemu-img terminated all the buffers were
   freed by the kernel.

New behavior with the -N switch:
 - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
   to the end with some small peaks up to 30 MB during the conversion.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
v2->v3: - rebased
        - fixed typo in commit msg [Fam]
v1->v2: - added test example to commit msg
        - added -N knob to qemu-img

 block/raw-posix.c     |   14 ++++++++++++++
 include/block/block.h |    1 +
 qemu-img-cmds.hx      |    4 ++--
 qemu-img.c            |   15 ++++++++++++---
 qemu-img.texi         |    9 ++++++++-
 5 files changed, 37 insertions(+), 6 deletions(-)

Comments

Stefan Hajnoczi June 4, 2014, 3:12 p.m. UTC | #1
On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote:
> this patch introduces a new flag to indicate that we are going to sequentially
> read from a file and do not plan to reread/reuse the data after it has been read.
> 
> The current use of this flag is to open the source(s) of a qemu-img convert
> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
> to advise to the kernel that we are going to read sequentially from the
> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
> that there is no advantage keeping the blocks in the buffers.
> 
> Consider the following test case that was created to confirm the behaviour of
> the new flag:
> 
> A 10G logical volume was created and filled with random data.
> Then the logical volume was exported via qemu-img convert to an iscsi target.
> Before the export was started all caches of the linux kernel where dropped.
> 
> Old behavior:
>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>    to the end of the conversion. After qemu-img terminated all the buffers were
>    freed by the kernel.
> 
> New behavior with the -N switch:
>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>    to the end with some small peaks up to 30 MB during the conversion.

FADVISE_SEQUENTIAL can be good since it doubles read-ahead on Linux.

I'm skeptical of the effort to avoid buffer cache usage using
FADVISE_DONTNEED.  The performance results tell me that less buffer
cache was used but that number doesn't have a direct effect on
application performance.

Let's check GNU coreutils:

  $ cd coreutils
  $ git grep FADVISE_DONTNEED
  gl/lib/fadvise.h:  FADVISE_DONTNEED =   POSIX_FADV_DONTNEED,
  gl/lib/fadvise.h:  FADVISE_DONTNEED,
  $

GNU cp(1) does not care about minimizing impact on buffer cache using
FADVISE_DONTNEED.  It just sets FADVISE_SEQUENTIAL on the source file
and calls read() (plus uses FIEMAP to check extents for sparseness).

I want to avoid adding code just for the heck of it.  We need a deeper
understanding:

Please drop FADVISE_DONTNEED and compare again to see if it changes the
benchmark.

By the way, did you perform several runs to check the variance of the
running time?  I don't know if the 2 seconds difference were noise or
because FADVISE_SEQUENTIAL or because FADVISE_DONTNEED or because both.

> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 6586a0c..9768cc4 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif
>  
> +#ifdef POSIX_FADV_SEQUENTIAL
> +    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
> +        !(bs->open_flags & BDRV_O_NOCACHE)) {
> +        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
> +    }
> +#endif

This is only true if the image format is raw.  If the image format on
top of this raw-posix BDS is non-raw then the read pattern may not be
sequential.

Perhaps the extra I/O in that case doesn't matter but conceptually it's
wrong to think that a raw-posix file will have a sequential access
pattern just because bdrv_read() is called sequentially.
Peter Lieven June 4, 2014, 3:31 p.m. UTC | #2
Am 04.06.2014 17:12, schrieb Stefan Hajnoczi:
> On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote:
>> this patch introduces a new flag to indicate that we are going to sequentially
>> read from a file and do not plan to reread/reuse the data after it has been read.
>>
>> The current use of this flag is to open the source(s) of a qemu-img convert
>> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
>> to advise to the kernel that we are going to read sequentially from the
>> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
>> that there is no advantage keeping the blocks in the buffers.
>>
>> Consider the following test case that was created to confirm the behaviour of
>> the new flag:
>>
>> A 10G logical volume was created and filled with random data.
>> Then the logical volume was exported via qemu-img convert to an iscsi target.
>> Before the export was started all caches of the linux kernel where dropped.
>>
>> Old behavior:
>>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>>    to the end of the conversion. After qemu-img terminated all the buffers were
>>    freed by the kernel.
>>
>> New behavior with the -N switch:
>>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>>    to the end with some small peaks up to 30 MB during the conversion.
> FADVISE_SEQUENTIAL can be good since it doubles read-ahead on Linux.
>
> I'm skeptical of the effort to avoid buffer cache usage using
> FADVISE_DONTNEED.  The performance results tell me that less buffer
> cache was used but that number doesn't have a direct effect on
> application performance.
>
> Let's check GNU coreutils:
>
>   $ cd coreutils
>   $ git grep FADVISE_DONTNEED
>   gl/lib/fadvise.h:  FADVISE_DONTNEED =   POSIX_FADV_DONTNEED,
>   gl/lib/fadvise.h:  FADVISE_DONTNEED,
>   $
>
> GNU cp(1) does not care about minimizing impact on buffer cache using
> FADVISE_DONTNEED.  It just sets FADVISE_SEQUENTIAL on the source file
> and calls read() (plus uses FIEMAP to check extents for sparseness).
>
> I want to avoid adding code just for the heck of it.  We need a deeper
> understanding:
>
> Please drop FADVISE_DONTNEED and compare again to see if it changes the
> benchmark.
>
> By the way, did you perform several runs to check the variance of the
> running time?  I don't know if the 2 seconds difference were noise or
> because FADVISE_SEQUENTIAL or because FADVISE_DONTNEED or because both.

There was no effect on the runtime as far as I remember. I ran
some tests, but not a number large enough to filter out the noise.

I created this one because we saw it helps under memory pressure.
Maybe its too specific to add it into mainline qemu, but I wanted to
avoid to have too much individual changes we need to maintain.


>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 6586a0c..9768cc4 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>      }
>>  #endif
>>  
>> +#ifdef POSIX_FADV_SEQUENTIAL
>> +    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
>> +        !(bs->open_flags & BDRV_O_NOCACHE)) {
>> +        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
>> +    }
>> +#endif
> This is only true if the image format is raw.  If the image format on
> top of this raw-posix BDS is non-raw then the read pattern may not be
> sequential.

You are right, but will the other formats set BDRV_O_SEQUENTIAL?

>
> Perhaps the extra I/O in that case doesn't matter but conceptually it's
> wrong to think that a raw-posix file will have a sequential access
> pattern just because bdrv_read() is called sequentially.

Peter
Stefan Hajnoczi June 5, 2014, 7:53 a.m. UTC | #3
On Wed, Jun 04, 2014 at 05:31:48PM +0200, Peter Lieven wrote:
> Am 04.06.2014 17:12, schrieb Stefan Hajnoczi:
> > On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote:
> >> this patch introduces a new flag to indicate that we are going to sequentially
> >> read from a file and do not plan to reread/reuse the data after it has been read.
> >>
> >> The current use of this flag is to open the source(s) of a qemu-img convert
> >> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
> >> to advise to the kernel that we are going to read sequentially from the
> >> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
> >> that there is no advantage keeping the blocks in the buffers.
> >>
> >> Consider the following test case that was created to confirm the behaviour of
> >> the new flag:
> >>
> >> A 10G logical volume was created and filled with random data.
> >> Then the logical volume was exported via qemu-img convert to an iscsi target.
> >> Before the export was started all caches of the linux kernel where dropped.
> >>
> >> Old behavior:
> >>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
> >>    to the end of the conversion. After qemu-img terminated all the buffers were
> >>    freed by the kernel.
> >>
> >> New behavior with the -N switch:
> >>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
> >>    to the end with some small peaks up to 30 MB during the conversion.
> > FADVISE_SEQUENTIAL can be good since it doubles read-ahead on Linux.
> >
> > I'm skeptical of the effort to avoid buffer cache usage using
> > FADVISE_DONTNEED.  The performance results tell me that less buffer
> > cache was used but that number doesn't have a direct effect on
> > application performance.
> >
> > Let's check GNU coreutils:
> >
> >   $ cd coreutils
> >   $ git grep FADVISE_DONTNEED
> >   gl/lib/fadvise.h:  FADVISE_DONTNEED =   POSIX_FADV_DONTNEED,
> >   gl/lib/fadvise.h:  FADVISE_DONTNEED,
> >   $
> >
> > GNU cp(1) does not care about minimizing impact on buffer cache using
> > FADVISE_DONTNEED.  It just sets FADVISE_SEQUENTIAL on the source file
> > and calls read() (plus uses FIEMAP to check extents for sparseness).
> >
> > I want to avoid adding code just for the heck of it.  We need a deeper
> > understanding:
> >
> > Please drop FADVISE_DONTNEED and compare again to see if it changes the
> > benchmark.
> >
> > By the way, did you perform several runs to check the variance of the
> > running time?  I don't know if the 2 seconds difference were noise or
> > because FADVISE_SEQUENTIAL or because FADVISE_DONTNEED or because both.
> 
> There was no effect on the runtime as far as I remember. I ran
> some tests, but not a number large enough to filter out the noise.
> 
> I created this one because we saw it helps under memory pressure.
> Maybe its too specific to add it into mainline qemu, but I wanted to
> avoid to have too much individual changes we need to maintain.

I'm open to merging it if the improvement can be quantified.  Right now
this might be a workaround for Linux memory management heuristics or it
might not have any effect, I don't know.

> >
> >> diff --git a/block/raw-posix.c b/block/raw-posix.c
> >> index 6586a0c..9768cc4 100644
> >> --- a/block/raw-posix.c
> >> +++ b/block/raw-posix.c
> >> @@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >>      }
> >>  #endif
> >>  
> >> +#ifdef POSIX_FADV_SEQUENTIAL
> >> +    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
> >> +        !(bs->open_flags & BDRV_O_NOCACHE)) {
> >> +        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
> >> +    }
> >> +#endif
> > This is only true if the image format is raw.  If the image format on
> > top of this raw-posix BDS is non-raw then the read pattern may not be
> > sequential.
> 
> You are right, but will the other formats set BDRV_O_SEQUENTIAL?

If the user specifies qemu-img convert -N then it will be set for any
image format.

Maybe qemu-img convert can always set BDRV_O_SEQUENTIAL and the have the
raw_bsd.c format propagate it to bs->file while other formats do not.
Then the user doesn't have to specify a command-line option and we don't
set it for non-raw image formats.

Stefan
Peter Lieven June 5, 2014, 8:09 a.m. UTC | #4
On 05.06.2014 09:53, Stefan Hajnoczi wrote:
> On Wed, Jun 04, 2014 at 05:31:48PM +0200, Peter Lieven wrote:
>> Am 04.06.2014 17:12, schrieb Stefan Hajnoczi:
>>> On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote:
>>>> this patch introduces a new flag to indicate that we are going to sequentially
>>>> read from a file and do not plan to reread/reuse the data after it has been read.
>>>>
>>>> The current use of this flag is to open the source(s) of a qemu-img convert
>>>> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
>>>> to advise to the kernel that we are going to read sequentially from the
>>>> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
>>>> that there is no advantage keeping the blocks in the buffers.
>>>>
>>>> Consider the following test case that was created to confirm the behaviour of
>>>> the new flag:
>>>>
>>>> A 10G logical volume was created and filled with random data.
>>>> Then the logical volume was exported via qemu-img convert to an iscsi target.
>>>> Before the export was started all caches of the linux kernel where dropped.
>>>>
>>>> Old behavior:
>>>>   - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
>>>>     to the end of the conversion. After qemu-img terminated all the buffers were
>>>>     freed by the kernel.
>>>>
>>>> New behavior with the -N switch:
>>>>   - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
>>>>     to the end with some small peaks up to 30 MB during the conversion.
>>> FADVISE_SEQUENTIAL can be good since it doubles read-ahead on Linux.
>>>
>>> I'm skeptical of the effort to avoid buffer cache usage using
>>> FADVISE_DONTNEED.  The performance results tell me that less buffer
>>> cache was used but that number doesn't have a direct effect on
>>> application performance.
>>>
>>> Let's check GNU coreutils:
>>>
>>>    $ cd coreutils
>>>    $ git grep FADVISE_DONTNEED
>>>    gl/lib/fadvise.h:  FADVISE_DONTNEED =   POSIX_FADV_DONTNEED,
>>>    gl/lib/fadvise.h:  FADVISE_DONTNEED,
>>>    $
>>>
>>> GNU cp(1) does not care about minimizing impact on buffer cache using
>>> FADVISE_DONTNEED.  It just sets FADVISE_SEQUENTIAL on the source file
>>> and calls read() (plus uses FIEMAP to check extents for sparseness).
>>>
>>> I want to avoid adding code just for the heck of it.  We need a deeper
>>> understanding:
>>>
>>> Please drop FADVISE_DONTNEED and compare again to see if it changes the
>>> benchmark.
>>>
>>> By the way, did you perform several runs to check the variance of the
>>> running time?  I don't know if the 2 seconds difference were noise or
>>> because FADVISE_SEQUENTIAL or because FADVISE_DONTNEED or because both.
>> There was no effect on the runtime as far as I remember. I ran
>> some tests, but not a number large enough to filter out the noise.
>>
>> I created this one because we saw it helps under memory pressure.
>> Maybe its too specific to add it into mainline qemu, but I wanted to
>> avoid to have too much individual changes we need to maintain.
> I'm open to merging it if the improvement can be quantified.  Right now
> this might be a workaround for Linux memory management heuristics or it
> might not have any effect, I don't know.

I understand that you are critical about it. I can just say it solved
the problem with the specific setup, kernel version etc.

I found that FADVISE_DONTNEED solves problems also in other applications.
Offtopic: i have an raspberry pi running as tvheadend and observed desync
of the DVBS2 signal at some times. Since I FADV_DONTNEED all written
frames away it runs smothly. I this case the feeing of the page cache was
CPU intensive for the small device and caused the desync.

>
>>>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>>>> index 6586a0c..9768cc4 100644
>>>> --- a/block/raw-posix.c
>>>> +++ b/block/raw-posix.c
>>>> @@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>>>       }
>>>>   #endif
>>>>   
>>>> +#ifdef POSIX_FADV_SEQUENTIAL
>>>> +    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
>>>> +        !(bs->open_flags & BDRV_O_NOCACHE)) {
>>>> +        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
>>>> +    }
>>>> +#endif
>>> This is only true if the image format is raw.  If the image format on
>>> top of this raw-posix BDS is non-raw then the read pattern may not be
>>> sequential.
>> You are right, but will the other formats set BDRV_O_SEQUENTIAL?
> If the user specifies qemu-img convert -N then it will be set for any
> image format.

Of course, but when e.g. qcow2 opens its underlying file, then BDRV_O_SEQUENTIAL
is not passed on, or is it?

>
> Maybe qemu-img convert can always set BDRV_O_SEQUENTIAL and the have the
> raw_bsd.c format propagate it to bs->file while other formats do not.
> Then the user doesn't have to specify a command-line option and we don't
> set it for non-raw image formats.

This would be an option.

Peter
Kevin Wolf June 5, 2014, 8:13 a.m. UTC | #5
Am 05.06.2014 um 10:09 hat Peter Lieven geschrieben:
> On 05.06.2014 09:53, Stefan Hajnoczi wrote:
> >On Wed, Jun 04, 2014 at 05:31:48PM +0200, Peter Lieven wrote:
> >>Am 04.06.2014 17:12, schrieb Stefan Hajnoczi:
> >>>On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote:
> >>>>this patch introduces a new flag to indicate that we are going to sequentially
> >>>>read from a file and do not plan to reread/reuse the data after it has been read.
> >>>>
> >>>>The current use of this flag is to open the source(s) of a qemu-img convert
> >>>>process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized
> >>>>to advise to the kernel that we are going to read sequentially from the
> >>>>file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate
> >>>>that there is no advantage keeping the blocks in the buffers.
> >>>>
> >>>>Consider the following test case that was created to confirm the behaviour of
> >>>>the new flag:
> >>>>
> >>>>A 10G logical volume was created and filled with random data.
> >>>>Then the logical volume was exported via qemu-img convert to an iscsi target.
> >>>>Before the export was started all caches of the linux kernel where dropped.
> >>>>
> >>>>Old behavior:
> >>>>  - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close
> >>>>    to the end of the conversion. After qemu-img terminated all the buffers were
> >>>>    freed by the kernel.
> >>>>
> >>>>New behavior with the -N switch:
> >>>>  - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close
> >>>>    to the end with some small peaks up to 30 MB during the conversion.
> >>>FADVISE_SEQUENTIAL can be good since it doubles read-ahead on Linux.
> >>>
> >>>I'm skeptical of the effort to avoid buffer cache usage using
> >>>FADVISE_DONTNEED.  The performance results tell me that less buffer
> >>>cache was used but that number doesn't have a direct effect on
> >>>application performance.
> >>>
> >>>Let's check GNU coreutils:
> >>>
> >>>   $ cd coreutils
> >>>   $ git grep FADVISE_DONTNEED
> >>>   gl/lib/fadvise.h:  FADVISE_DONTNEED =   POSIX_FADV_DONTNEED,
> >>>   gl/lib/fadvise.h:  FADVISE_DONTNEED,
> >>>   $
> >>>
> >>>GNU cp(1) does not care about minimizing impact on buffer cache using
> >>>FADVISE_DONTNEED.  It just sets FADVISE_SEQUENTIAL on the source file
> >>>and calls read() (plus uses FIEMAP to check extents for sparseness).
> >>>
> >>>I want to avoid adding code just for the heck of it.  We need a deeper
> >>>understanding:
> >>>
> >>>Please drop FADVISE_DONTNEED and compare again to see if it changes the
> >>>benchmark.
> >>>
> >>>By the way, did you perform several runs to check the variance of the
> >>>running time?  I don't know if the 2 seconds difference were noise or
> >>>because FADVISE_SEQUENTIAL or because FADVISE_DONTNEED or because both.
> >>There was no effect on the runtime as far as I remember. I ran
> >>some tests, but not a number large enough to filter out the noise.
> >>
> >>I created this one because we saw it helps under memory pressure.
> >>Maybe its too specific to add it into mainline qemu, but I wanted to
> >>avoid to have too much individual changes we need to maintain.
> >I'm open to merging it if the improvement can be quantified.  Right now
> >this might be a workaround for Linux memory management heuristics or it
> >might not have any effect, I don't know.
> 
> I understand that you are critical about it. I can just say it solved
> the problem with the specific setup, kernel version etc.
> 
> I found that FADVISE_DONTNEED solves problems also in other applications.
> Offtopic: i have an raspberry pi running as tvheadend and observed desync
> of the DVBS2 signal at some times. Since I FADV_DONTNEED all written
> frames away it runs smothly. I this case the feeing of the page cache was
> CPU intensive for the small device and caused the desync.
> 
> >
> >>>>diff --git a/block/raw-posix.c b/block/raw-posix.c
> >>>>index 6586a0c..9768cc4 100644
> >>>>--- a/block/raw-posix.c
> >>>>+++ b/block/raw-posix.c
> >>>>@@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >>>>      }
> >>>>  #endif
> >>>>+#ifdef POSIX_FADV_SEQUENTIAL
> >>>>+    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
> >>>>+        !(bs->open_flags & BDRV_O_NOCACHE)) {
> >>>>+        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
> >>>>+    }
> >>>>+#endif
> >>>This is only true if the image format is raw.  If the image format on
> >>>top of this raw-posix BDS is non-raw then the read pattern may not be
> >>>sequential.
> >>You are right, but will the other formats set BDRV_O_SEQUENTIAL?
> >If the user specifies qemu-img convert -N then it will be set for any
> >image format.
> 
> Of course, but when e.g. qcow2 opens its underlying file, then BDRV_O_SEQUENTIAL
> is not passed on, or is it?

It isn't qcow2 but block.c that opens bs->file, and unless you
explicitly filter out a flag, bs->file inherits it. (If it didn't do
that, your patch would have no effect for raw either.)

> >Maybe qemu-img convert can always set BDRV_O_SEQUENTIAL and the have the
> >raw_bsd.c format propagate it to bs->file while other formats do not.
> >Then the user doesn't have to specify a command-line option and we don't
> >set it for non-raw image formats.
> 
> This would be an option.

I agree, though it's not quite clear how raw_bsd would do that. Would
that involve a bdrv_reopen() for bs->file?

Kevin
Stefan Hajnoczi June 5, 2014, 1:54 p.m. UTC | #6
On Thu, Jun 05, 2014 at 10:13:04AM +0200, Kevin Wolf wrote:
> Am 05.06.2014 um 10:09 hat Peter Lieven geschrieben:
> > On 05.06.2014 09:53, Stefan Hajnoczi wrote:
> > >On Wed, Jun 04, 2014 at 05:31:48PM +0200, Peter Lieven wrote:
> > >>Am 04.06.2014 17:12, schrieb Stefan Hajnoczi:
> > >>>On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote:
> > >>>>diff --git a/block/raw-posix.c b/block/raw-posix.c
> > >>>>index 6586a0c..9768cc4 100644
> > >>>>--- a/block/raw-posix.c
> > >>>>+++ b/block/raw-posix.c
> > >>>>@@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> > >>>>      }
> > >>>>  #endif
> > >>>>+#ifdef POSIX_FADV_SEQUENTIAL
> > >>>>+    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
> > >>>>+        !(bs->open_flags & BDRV_O_NOCACHE)) {
> > >>>>+        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
> > >>>>+    }
> > >>>>+#endif
> > >>>This is only true if the image format is raw.  If the image format on
> > >>>top of this raw-posix BDS is non-raw then the read pattern may not be
> > >>>sequential.
> > >>You are right, but will the other formats set BDRV_O_SEQUENTIAL?
> > >If the user specifies qemu-img convert -N then it will be set for any
> > >image format.
> > 
> > Of course, but when e.g. qcow2 opens its underlying file, then BDRV_O_SEQUENTIAL
> > is not passed on, or is it?
> 
> It isn't qcow2 but block.c that opens bs->file, and unless you
> explicitly filter out a flag, bs->file inherits it. (If it didn't do
> that, your patch would have no effect for raw either.)

Yes, exactly.  When a raw image file is opened there are actually two
BlockDriverStates:

  raw_bsd ("drive0")
    file: raw-posix (anonymous)

Since your patch affected the buffer cache counter, we know that the
flag was propagated down to raw-posix (by block.c as Kevin explained).

The qcow2 case looks like this:

  qcow2 ("drive0")
    file: raw-posix (anonymous)

> > >Maybe qemu-img convert can always set BDRV_O_SEQUENTIAL and the have the
> > >raw_bsd.c format propagate it to bs->file while other formats do not.
> > >Then the user doesn't have to specify a command-line option and we don't
> > >set it for non-raw image formats.
> > 
> > This would be an option.
> 
> I agree, though it's not quite clear how raw_bsd would do that. Would
> that involve a bdrv_reopen() for bs->file?

One way is to add a BlockDriver bitmask field for options that get
propagated to its children.  Only raw_bsd will include
BDRV_O_SEQUENTIAL.

Stefan
diff mbox

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 6586a0c..9768cc4 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -447,6 +447,13 @@  static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
+#ifdef POSIX_FADV_SEQUENTIAL
+    if (bs->open_flags & BDRV_O_SEQUENTIAL &&
+        !(bs->open_flags & BDRV_O_NOCACHE)) {
+        posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL);
+    }
+#endif
+
     ret = 0;
 fail:
     if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -919,6 +926,13 @@  static int aio_worker(void *arg)
             ret = aiocb->aio_nbytes;
         }
         if (ret == aiocb->aio_nbytes) {
+#ifdef POSIX_FADV_DONTNEED
+            if (aiocb->bs->open_flags & BDRV_O_SEQUENTIAL &&
+                !(aiocb->bs->open_flags & BDRV_O_NOCACHE)) {
+                posix_fadvise(aiocb->aio_fildes, aiocb->aio_offset,
+                              aiocb->aio_nbytes, POSIX_FADV_DONTNEED);
+            }
+#endif
             ret = 0;
         } else if (ret >= 0 && ret < aiocb->aio_nbytes) {
             ret = -EINVAL;
diff --git a/include/block/block.h b/include/block/block.h
index 1b119aa..9b42d54 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -110,6 +110,7 @@  typedef enum {
 #define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
                                       select an appropriate protocol driver,
                                       ignoring the format layer */
+#define BDRV_O_SEQUENTIAL 0x10000  /* open device for sequential read */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
 
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index d029609..74c2c08 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -34,9 +34,9 @@  STEXI
 ETEXI
 
 DEF("convert", img_convert,
-    "convert [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] filename [filename2 [...]] output_filename")
+    "convert [-c] [-p] [-q] [-n] [-N] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] filename [filename2 [...]] output_filename")
 STEXI
-@item convert [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-q] [-n] [-N] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 ETEXI
 
 DEF("info", img_info,
diff --git a/qemu-img.c b/qemu-img.c
index 04ce02a..356d4ae 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -141,6 +141,8 @@  static void QEMU_NORETURN help(void)
            "  '--output' takes the format in which the output must be done (human or json)\n"
            "  '-n' skips the target volume creation (useful if the volume is created\n"
            "       prior to running qemu-img)\n"
+           "  '-N' opens the source file(s) for sequential reading and drops data from\n"
+           "       page cache immediately\n"
            "\n"
            "Parameters to check subcommand:\n"
            "  '-r' tries to repair any inconsistencies that are found during the check.\n"
@@ -1199,7 +1201,7 @@  static int img_convert(int argc, char **argv)
     char *options = NULL;
     const char *snapshot_name = NULL;
     int min_sparse = 8; /* Need at least 4k of zeros for sparse detection */
-    bool quiet = false;
+    bool quiet = false, sequential_read = false;
     Error *local_err = NULL;
     QemuOpts *sn_opts = NULL;
 
@@ -1210,7 +1212,7 @@  static int img_convert(int argc, char **argv)
     compress = 0;
     skip_create = 0;
     for(;;) {
-        c = getopt(argc, argv, "f:O:B:s:hce6o:pS:t:qnl:");
+        c = getopt(argc, argv, "f:O:B:s:hce6o:pS:t:qnNl:");
         if (c == -1) {
             break;
         }
@@ -1297,6 +1299,9 @@  static int img_convert(int argc, char **argv)
         case 'n':
             skip_create = 1;
             break;
+        case 'N':
+            sequential_read = true;
+            break;
         }
     }
 
@@ -1333,9 +1338,13 @@  static int img_convert(int argc, char **argv)
 
     total_sectors = 0;
     for (bs_i = 0; bs_i < bs_n; bs_i++) {
+        int open_flags = BDRV_O_FLAGS;
         char *id = bs_n > 1 ? g_strdup_printf("source %d", bs_i)
                             : g_strdup("source");
-        bs[bs_i] = bdrv_new_open(id, argv[optind + bs_i], fmt, BDRV_O_FLAGS,
+        if (sequential_read) {
+            open_flags |= BDRV_O_SEQUENTIAL;
+        }
+        bs[bs_i] = bdrv_new_open(id, argv[optind + bs_i], fmt, open_flags,
                                  true, quiet);
         g_free(id);
         if (!bs[bs_i]) {
diff --git a/qemu-img.texi b/qemu-img.texi
index f84590e..0fb63c2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -190,7 +190,7 @@  Error on reading data
 
 @end table
 
-@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-n] [-N] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 
 Convert the disk image @var{filename} or a snapshot @var{snapshot_param}(@var{snapshot_id_or_name} is deprecated)
 to disk image @var{output_filename} using format @var{output_fmt}. It can be optionally compressed (@code{-c}
@@ -220,6 +220,13 @@  skipped. This is useful for formats such as @code{rbd} if the target
 volume has already been created with site specific options that cannot
 be supplied through qemu-img.
 
+If the @code{-N} option is specified, the source image is opened
+for sequential reading. This means its contents are dropped from
+the page cache immediately after they have been read. The option
+is meant for reading in raw files or host devices and may have
+bad performance impact on other formats which read a sector more
+than once.
+
 @item info [-f @var{fmt}] [--output=@var{ofmt}] [--backing-chain] @var{filename}
 
 Give information about the disk image @var{filename}. Use it in