Patchwork sheepdog: implement direct write semantics

login
register
mail settings
Submitter namei.unix@gmail.com
Date Dec. 19, 2012, 6:29 p.m.
Message ID <1355941771-3418-1-git-send-email-namei.unix@gmail.com>
Download mbox | patch
Permalink /patch/207501/
State New
Headers show

Comments

namei.unix@gmail.com - Dec. 19, 2012, 6:29 p.m.
From: Liu Yuan <tailai.ly@taobao.com>

Sheepdog supports both writeback/writethrough write but has not yet supported
DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is
set up with cache enabled.

Suppose cache is enabled on Sheepdog daemon size, the new cache control is

cache=writeback # enable the writeback semantics for write
cache=writethrough # enable the writethrough semantics for write
cache='directsync | none | off' # disable cache competely

Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
---
 block/sheepdog.c |   67 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 37 insertions(+), 30 deletions(-)
MORITA Kazutaka - Dec. 25, 2012, 7:47 a.m.
At Thu, 20 Dec 2012 02:29:31 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <tailai.ly@taobao.com>
> 
> Sheepdog supports both writeback/writethrough write but has not yet supported
> DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is
> set up with cache enabled.
> 
> Suppose cache is enabled on Sheepdog daemon size, the new cache control is
> 
> cache=writeback # enable the writeback semantics for write
> cache=writethrough # enable the writethrough semantics for write
> cache='directsync | none | off' # disable cache competely

I wonder if we should disable cache when cache=none.  Many management
frontend uses cache=none by default but, I think, users still expect
that data is cached (e.g. by disk write cache when a raw format is
used).  cache=none only means that the host page cache is not used for
VM disk IO.

In that sense,

> @@ -1118,12 +1118,19 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags)
>          goto out;
>      }
>  
> -    s->cache_enabled = true;
> -    s->flush_fd = connect_to_sdog(s->addr, s->port);
> -    if (s->flush_fd < 0) {
> -        error_report("failed to connect");
> -        ret = s->flush_fd;
> -        goto out;
> +    if (flags & BDRV_O_NOCACHE) {
> +        s->cache_flags = SD_FLAG_CMD_DIRECT;
> +    } else if (flags & BDRV_O_CACHE_WB) {

'else' should be removed, and

> +        s->cache_flags = SD_FLAG_CMD_CACHE;
> +    }
> +
> +    if (s->cache_flags != SD_FLAG_CMD_DIRECT) {

should be 's->cache_flags == SD_FLAG_CMD_CACHE'?  Do we need to send a
flush request when cache=writethourgh?

Thanks,

Kazutaka
namei.unix@gmail.com - Dec. 25, 2012, 8:26 a.m.
On 12/25/2012 03:47 PM, MORITA Kazutaka wrote:
> I wonder if we should disable cache when cache=none.  Many management
> frontend uses cache=none by default but, I think, users still expect
> that data is cached (e.g. by disk write cache when a raw format is
> used).  cache=none only means that the host page cache is not used for
> VM disk IO.
> 
> In that sense,
> 

I am fine to adopt option to this semantics.

>> > @@ -1118,12 +1118,19 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags)
>> >          goto out;
>> >      }
>> >  
>> > -    s->cache_enabled = true;
>> > -    s->flush_fd = connect_to_sdog(s->addr, s->port);
>> > -    if (s->flush_fd < 0) {
>> > -        error_report("failed to connect");
>> > -        ret = s->flush_fd;
>> > -        goto out;
>> > +    if (flags & BDRV_O_NOCACHE) {
>> > +        s->cache_flags = SD_FLAG_CMD_DIRECT;
>> > +    } else if (flags & BDRV_O_CACHE_WB) {
> 'else' should be removed, and
> 

Seems should not. We need this 'else if' to allow writethrough flag.

>> > +        s->cache_flags = SD_FLAG_CMD_CACHE;
>> > +    }
>> > +
>> > +    if (s->cache_flags != SD_FLAG_CMD_DIRECT) {
> should be 's->cache_flags == SD_FLAG_CMD_CACHE'?  Do we need to send a
> flush request when cache=writethourgh?

Looks good to me. I'll change it at V2.

Thanks,
Yuan
namei.unix@gmail.com - Dec. 25, 2012, 8:45 a.m.
On 12/25/2012 03:47 PM, MORITA Kazutaka wrote:
> I wonder if we should disable cache when cache=none.  Many management
> frontend uses cache=none by default but, I think, users still expect
> that data is cached (e.g. by disk write cache when a raw format is
> used).  cache=none only means that the host page cache is not used for
> VM disk IO.
> 
> In that sense,

Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'.
Is this a bug for current master? If no, my current scheme will be the
only way to bypass cache of sheepdog.

Thanks,
Yuan
namei.unix@gmail.com - Jan. 3, 2013, 1:43 p.m.
On 12/25/2012 04:45 PM, Liu Yuan wrote:
> Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'.
> Is this a bug for current master? If no, my current scheme will be the
> only way to bypass cache of sheepdog.

Ping. Can anyone confirm it is a bug that 'cache=directsync' will pass
BDRV_O_CACHE_WB' in the flags?

Thanks,
Yuan
Stefan Hajnoczi - Jan. 4, 2013, 4:38 p.m.
On Thu, Jan 03, 2013 at 09:43:53PM +0800, Liu Yuan wrote:
> On 12/25/2012 04:45 PM, Liu Yuan wrote:
> > Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'.
> > Is this a bug for current master? If no, my current scheme will be the
> > only way to bypass cache of sheepdog.
> 
> Ping. Can anyone confirm it is a bug that 'cache=directsync' will pass
> BDRV_O_CACHE_WB' in the flags?

Hi Yuan,
BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).

BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
means whether the disk cache is writethrough or writeback.

In other words, BDRV_O_NOCACHE is a host performance tweak while
BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
will flush when after each write when !bs->enable_write_cache).

Stefan
namei.unix@gmail.com - Jan. 5, 2013, 4:40 a.m.
On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
> Hi Yuan,
> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
> 
> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
> means whether the disk cache is writethrough or writeback.
> 
> In other words, BDRV_O_NOCACHE is a host performance tweak while
> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
> will flush when after each write when !bs->enable_write_cache).

Hi Stefan,

  Thanks for your explanation. But after more investigation, I find
myself more confused:

                      flags passed from block layer
  {writeback, writethrough}      0x2042
  {directsync, off, none}        0x2062
  {unsafe}                       0x2242

So underlying driver like Sheepdog can't depend on 'flags' passed from
.bdrv_file_open() to choose the right semantics (This was possible for
old QEMU IIRC).

If we can't rely on the 'flags' to get the cache indications of users,
would you point me how to implement tristate cache control for network
block driver like Sheepdog? For e.g, I want to implement following
semantics:
    cache=writeback|none|off # enable writeback semantics for write
    cache=writethrough # enable writethrough semantics for write
    cache=directsync # disable cache completely

Thanks,
Yuan
namei.unix@gmail.com - Jan. 5, 2013, 5:29 a.m.
On 01/05/2013 12:40 PM, Liu Yuan wrote:
> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>> Hi Yuan,
>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>
>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>> means whether the disk cache is writethrough or writeback.
>>
>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>> will flush when after each write when !bs->enable_write_cache).
> 
> Hi Stefan,
> 
>   Thanks for your explanation. But after more investigation, I find
> myself more confused:
> 
>                       flags passed from block layer
>   {writeback, writethrough}      0x2042
>   {directsync, off, none}        0x2062
>   {unsafe}                       0x2242
> 
> So underlying driver like Sheepdog can't depend on 'flags' passed from
> .bdrv_file_open() to choose the right semantics (This was possible for
> old QEMU IIRC).
> 
> If we can't rely on the 'flags' to get the cache indications of users,
> would you point me how to implement tristate cache control for network
> block driver like Sheepdog? For e.g, I want to implement following
> semantics:
>     cache=writeback|none|off # enable writeback semantics for write
>     cache=writethrough # enable writethrough semantics for write
>     cache=directsync # disable cache completely
> 
> Thanks,
> Yuan
> 

I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
expected. So I guess generic block layer got changed a bit and the
'flags' meaning turned different than old code, which did indeed allow
block drivers to interpret the 'flags' passed from bdrv_file_open().

With the current upstream code, it seems that BDRV_O_CACHE_WB is always
enabled and makes 'flags' completely unusable for block drivers to get
the indications of user by specifying 'cache=' field.

So is there other means to allow block drivers to rely on, in order to
interpret the 'cache semantics'?

Thanks,
Yuan
namei.unix@gmail.com - Jan. 5, 2013, 7:56 a.m.
On 01/05/2013 01:29 PM, Liu Yuan wrote:
> On 01/05/2013 12:40 PM, Liu Yuan wrote:
>> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>>> Hi Yuan,
>>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>>
>>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>>> means whether the disk cache is writethrough or writeback.
>>>
>>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>>> will flush when after each write when !bs->enable_write_cache).
>>
>> Hi Stefan,
>>
>>   Thanks for your explanation. But after more investigation, I find
>> myself more confused:
>>
>>                       flags passed from block layer
>>   {writeback, writethrough}      0x2042
>>   {directsync, off, none}        0x2062
>>   {unsafe}                       0x2242
>>
>> So underlying driver like Sheepdog can't depend on 'flags' passed from
>> .bdrv_file_open() to choose the right semantics (This was possible for
>> old QEMU IIRC).
>>
>> If we can't rely on the 'flags' to get the cache indications of users,
>> would you point me how to implement tristate cache control for network
>> block driver like Sheepdog? For e.g, I want to implement following
>> semantics:
>>     cache=writeback|none|off # enable writeback semantics for write
>>     cache=writethrough # enable writethrough semantics for write
>>     cache=directsync # disable cache completely
>>
>> Thanks,
>> Yuan
>>
> 
> I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
> expected. So I guess generic block layer got changed a bit and the
> 'flags' meaning turned different than old code, which did indeed allow
> block drivers to interpret the 'flags' passed from bdrv_file_open().
> 
> With the current upstream code, it seems that BDRV_O_CACHE_WB is always
> enabled and makes 'flags' completely unusable for block drivers to get
> the indications of user by specifying 'cache=' field.
> 
> So is there other means to allow block drivers to rely on, in order to
> interpret the 'cache semantics'?
> 

I found the commit:e1e9b0ac 'always open drivers in writeback mode'.
This is really undesired for network block drivers such as Sheepdog,
which implement its own cache mechanism that support
writeback/writethrough/directio behavior and then want to interpret
these flags on its own.

Is there any means for block drivers to get the semantics of 'cache=xxx'
from users?

Thanks,
Yuan
Stefan Hajnoczi - Jan. 7, 2013, 12:31 p.m.
On Sat, Jan 05, 2013 at 03:56:11PM +0800, Liu Yuan wrote:
> On 01/05/2013 01:29 PM, Liu Yuan wrote:
> > On 01/05/2013 12:40 PM, Liu Yuan wrote:
> >> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
> >>> Hi Yuan,
> >>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
> >>>
> >>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
> >>> means whether the disk cache is writethrough or writeback.
> >>>
> >>> In other words, BDRV_O_NOCACHE is a host performance tweak while
> >>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
> >>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
> >>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
> >>> will flush when after each write when !bs->enable_write_cache).
> >>
> >> Hi Stefan,
> >>
> >>   Thanks for your explanation. But after more investigation, I find
> >> myself more confused:
> >>
> >>                       flags passed from block layer
> >>   {writeback, writethrough}      0x2042
> >>   {directsync, off, none}        0x2062
> >>   {unsafe}                       0x2242
> >>
> >> So underlying driver like Sheepdog can't depend on 'flags' passed from
> >> .bdrv_file_open() to choose the right semantics (This was possible for
> >> old QEMU IIRC).
> >>
> >> If we can't rely on the 'flags' to get the cache indications of users,
> >> would you point me how to implement tristate cache control for network
> >> block driver like Sheepdog? For e.g, I want to implement following
> >> semantics:
> >>     cache=writeback|none|off # enable writeback semantics for write
> >>     cache=writethrough # enable writethrough semantics for write
> >>     cache=directsync # disable cache completely
> >>
> >> Thanks,
> >> Yuan
> >>
> > 
> > I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
> > expected. So I guess generic block layer got changed a bit and the
> > 'flags' meaning turned different than old code, which did indeed allow
> > block drivers to interpret the 'flags' passed from bdrv_file_open().
> > 
> > With the current upstream code, it seems that BDRV_O_CACHE_WB is always
> > enabled and makes 'flags' completely unusable for block drivers to get
> > the indications of user by specifying 'cache=' field.
> > 
> > So is there other means to allow block drivers to rely on, in order to
> > interpret the 'cache semantics'?
> > 
> 
> I found the commit:e1e9b0ac 'always open drivers in writeback mode'.
> This is really undesired for network block drivers such as Sheepdog,
> which implement its own cache mechanism that support
> writeback/writethrough/directio behavior and then want to interpret
> these flags on its own.
> 
> Is there any means for block drivers to get the semantics of 'cache=xxx'
> from users?

Hi Yuan,
Please explain what cache semantics the sheepdog server supports so I
can understand better what you are trying to achieve.

Stefan
Kevin Wolf - Jan. 7, 2013, 1:23 p.m.
Am 05.01.2013 08:56, schrieb Liu Yuan:
> On 01/05/2013 01:29 PM, Liu Yuan wrote:
>> On 01/05/2013 12:40 PM, Liu Yuan wrote:
>>> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>>>> Hi Yuan,
>>>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>>>
>>>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>>>> means whether the disk cache is writethrough or writeback.
>>>>
>>>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>>>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>>>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>>>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>>>> will flush when after each write when !bs->enable_write_cache).
>>>
>>> Hi Stefan,
>>>
>>>   Thanks for your explanation. But after more investigation, I find
>>> myself more confused:
>>>
>>>                       flags passed from block layer
>>>   {writeback, writethrough}      0x2042
>>>   {directsync, off, none}        0x2062
>>>   {unsafe}                       0x2242
>>>
>>> So underlying driver like Sheepdog can't depend on 'flags' passed from
>>> .bdrv_file_open() to choose the right semantics (This was possible for
>>> old QEMU IIRC).
>>>
>>> If we can't rely on the 'flags' to get the cache indications of users,
>>> would you point me how to implement tristate cache control for network
>>> block driver like Sheepdog? For e.g, I want to implement following
>>> semantics:
>>>     cache=writeback|none|off # enable writeback semantics for write
>>>     cache=writethrough # enable writethrough semantics for write
>>>     cache=directsync # disable cache completely
>>>
>>> Thanks,
>>> Yuan
>>>
>>
>> I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
>> expected. So I guess generic block layer got changed a bit and the
>> 'flags' meaning turned different than old code, which did indeed allow
>> block drivers to interpret the 'flags' passed from bdrv_file_open().
>>
>> With the current upstream code, it seems that BDRV_O_CACHE_WB is always
>> enabled and makes 'flags' completely unusable for block drivers to get
>> the indications of user by specifying 'cache=' field.
>>
>> So is there other means to allow block drivers to rely on, in order to
>> interpret the 'cache semantics'?
>>
> 
> I found the commit:e1e9b0ac 'always open drivers in writeback mode'.
> This is really undesired for network block drivers such as Sheepdog,
> which implement its own cache mechanism that support
> writeback/writethrough/directio behavior and then want to interpret
> these flags on its own.
> 
> Is there any means for block drivers to get the semantics of 'cache=xxx'
> from users?

No, and in theory they shouldn't have to care.

Why do you want to handle writethrough semantics in the block driver
rather than letting qemu care for sending the right flushes?

Kevin
namei.unix@gmail.com - Jan. 8, 2013, 5:28 a.m.
On 01/07/2013 08:31 PM, Stefan Hajnoczi wrote:
> On Sat, Jan 05, 2013 at 03:56:11PM +0800, Liu Yuan wrote:
>> On 01/05/2013 01:29 PM, Liu Yuan wrote:
>>> On 01/05/2013 12:40 PM, Liu Yuan wrote:
>>>> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>>>>> Hi Yuan,
>>>>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>>>>
>>>>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>>>>> means whether the disk cache is writethrough or writeback.
>>>>>
>>>>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>>>>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>>>>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>>>>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>>>>> will flush when after each write when !bs->enable_write_cache).
>>>>
>>>> Hi Stefan,
>>>>
>>>>   Thanks for your explanation. But after more investigation, I find
>>>> myself more confused:
>>>>
>>>>                       flags passed from block layer
>>>>   {writeback, writethrough}      0x2042
>>>>   {directsync, off, none}        0x2062
>>>>   {unsafe}                       0x2242
>>>>
>>>> So underlying driver like Sheepdog can't depend on 'flags' passed from
>>>> .bdrv_file_open() to choose the right semantics (This was possible for
>>>> old QEMU IIRC).
>>>>
>>>> If we can't rely on the 'flags' to get the cache indications of users,
>>>> would you point me how to implement tristate cache control for network
>>>> block driver like Sheepdog? For e.g, I want to implement following
>>>> semantics:
>>>>     cache=writeback|none|off # enable writeback semantics for write
>>>>     cache=writethrough # enable writethrough semantics for write
>>>>     cache=directsync # disable cache completely
>>>>
>>>> Thanks,
>>>> Yuan
>>>>
>>>
>>> I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
>>> expected. So I guess generic block layer got changed a bit and the
>>> 'flags' meaning turned different than old code, which did indeed allow
>>> block drivers to interpret the 'flags' passed from bdrv_file_open().
>>>
>>> With the current upstream code, it seems that BDRV_O_CACHE_WB is always
>>> enabled and makes 'flags' completely unusable for block drivers to get
>>> the indications of user by specifying 'cache=' field.
>>>
>>> So is there other means to allow block drivers to rely on, in order to
>>> interpret the 'cache semantics'?
>>>
>>
>> I found the commit:e1e9b0ac 'always open drivers in writeback mode'.
>> This is really undesired for network block drivers such as Sheepdog,
>> which implement its own cache mechanism that support
>> writeback/writethrough/directio behavior and then want to interpret
>> these flags on its own.
>>
>> Is there any means for block drivers to get the semantics of 'cache=xxx'
>> from users?
> 
> Hi Yuan,
> Please explain what cache semantics the sheepdog server supports so I
> can understand better what you are trying to achieve.
> 

Hi Stefan,

Sheepdog support writethrough/writeback/directio(bypass) the cache, and
I want to implement following semantics:
    cache=writeback|none|off # enable writeback semantics for write
    cache=writethrough # enable writethrough semantics for write
    cache=directsync # disable cache completely

Thanks,
Yuan
namei.unix@gmail.com - Jan. 8, 2013, 5:42 a.m.
On 01/07/2013 09:23 PM, Kevin Wolf wrote:
> No, and in theory they shouldn't have to care.
> 
> Why do you want to handle writethrough semantics in the block driver
> rather than letting qemu care for sending the right flushes?

Because Sheepdog itself provide the client side cache implementation and
we need means to get the cache hint from users when starting up the VM.
This cache support:
 writethrough: provide a read cache for this VM, which is always
consistent with cluster data
 writeback: provide a writeback cache for this VM, which only flush the
dirty bits when VM send 'flush' request.
 directio: disable cache completely for this VM.

Old QEMU (before v1.2) allows block drivers to get the cache flags, so
Sheepdog can interpret these flags on its own to choose the right cache
mode for each VM. doesn't QEMU need keep backward compatibility?

Thanks,
Yuan
Stefan Hajnoczi - Jan. 8, 2013, 9:40 a.m.
On Tue, Jan 08, 2013 at 01:42:35PM +0800, Liu Yuan wrote:
> On 01/07/2013 09:23 PM, Kevin Wolf wrote:
> > No, and in theory they shouldn't have to care.
> > 
> > Why do you want to handle writethrough semantics in the block driver
> > rather than letting qemu care for sending the right flushes?
> 
> Because Sheepdog itself provide the client side cache implementation and
> we need means to get the cache hint from users when starting up the VM.
> This cache support:
>  writethrough: provide a read cache for this VM, which is always
> consistent with cluster data
>  writeback: provide a writeback cache for this VM, which only flush the
> dirty bits when VM send 'flush' request.
>  directio: disable cache completely for this VM.
> 
> Old QEMU (before v1.2) allows block drivers to get the cache flags, so
> Sheepdog can interpret these flags on its own to choose the right cache
> mode for each VM. doesn't QEMU need keep backward compatibility?

This sounds fine - it fits together with QEMU like this:

BDRV_NO_CACHE means sheepdog directio.  We're disabling the client-side
sheepdog cache.

Otherwise use sheepdog writeback and let QEMU block.c decide when to
flush.  Never use sheepdog writethrough because it's redundant here.

Stefan
namei.unix@gmail.com - Jan. 8, 2013, 9:45 a.m.
On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
> Otherwise use sheepdog writeback and let QEMU block.c decide when to
> flush.  Never use sheepdog writethrough because it's redundant here.

I don't get it. What do you mean by 'redundant'? If we use virtio &
sheepdog block driver, how can we specify writethrough mode for Sheepdog
cache? Here 'writethrough' means use a pure read cache, which doesn't
need flush at all.

Thanks,
Yuan
Kevin Wolf - Jan. 8, 2013, 10 a.m.
Am 08.01.2013 10:45, schrieb Liu Yuan:
> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>> flush.  Never use sheepdog writethrough because it's redundant here.
> 
> I don't get it. What do you mean by 'redundant'? If we use virtio &
> sheepdog block driver, how can we specify writethrough mode for Sheepdog
> cache? Here 'writethrough' means use a pure read cache, which doesn't
> need flush at all.

A writethrough cache is equivalent to a write-back cache where each
write is followed by a flush. qemu makes sure to send these flushes, so
there is no need use Sheepdog's writethrough mode.

Kevin
namei.unix@gmail.com - Jan. 8, 2013, 10:39 a.m.
On 01/08/2013 06:00 PM, Kevin Wolf wrote:
> Am 08.01.2013 10:45, schrieb Liu Yuan:
>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>>> flush.  Never use sheepdog writethrough because it's redundant here.
>>
>> I don't get it. What do you mean by 'redundant'? If we use virtio &
>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
>> cache? Here 'writethrough' means use a pure read cache, which doesn't
>> need flush at all.
> 
> A writethrough cache is equivalent to a write-back cache where each
> write is followed by a flush. qemu makes sure to send these flushes, so
> there is no need use Sheepdog's writethrough mode.
> 

Implement writethrough as writeback + flush will cause considerable
overhead for network block device like Sheepdog: a single write request
will be executed as two requests: write + flush. This also explains why
I saw a regression about write performance: Old QEMU can issue multiple
write requests in one go, but now the requests are sent one by one (even
with cache=writeback set), which makes Sheepdog write performance drop a
lot. Is it possible to issue multiple requests in one go as old QEMU does?

It seems it is hard to restore into old semantics of cache flags due to
new design of QEMU block layer. So will you accept that adding a 'flags'
into BlockDriverState which carry the 'cache flags' from user to keep
backward compatibility?

Thanks,
Yuan
Kevin Wolf - Jan. 8, 2013, 10:51 a.m.
Am 08.01.2013 11:39, schrieb Liu Yuan:
> On 01/08/2013 06:00 PM, Kevin Wolf wrote:
>> Am 08.01.2013 10:45, schrieb Liu Yuan:
>>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>>>> flush.  Never use sheepdog writethrough because it's redundant here.
>>>
>>> I don't get it. What do you mean by 'redundant'? If we use virtio &
>>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
>>> cache? Here 'writethrough' means use a pure read cache, which doesn't
>>> need flush at all.
>>
>> A writethrough cache is equivalent to a write-back cache where each
>> write is followed by a flush. qemu makes sure to send these flushes, so
>> there is no need use Sheepdog's writethrough mode.
> 
> Implement writethrough as writeback + flush will cause considerable
> overhead for network block device like Sheepdog: a single write request
> will be executed as two requests: write + flush

Yeah, maybe we should have some kind of a FUA flag with write requests
instead of sending a separate flush.

> This also explains why
> I saw a regression about write performance: Old QEMU can issue multiple
> write requests in one go, but now the requests are sent one by one (even
> with cache=writeback set), which makes Sheepdog write performance drop a
> lot. Is it possible to issue multiple requests in one go as old QEMU does?

Huh? We didn't change anything to that respect, or at least not that I'm
aware of. qemu always only had single-request bdrv_co_writev, so if
anything that batching must have happened inside Sheepdog code? Do you
know what makes it not batch requests any more?

> It seems it is hard to restore into old semantics of cache flags due to
> new design of QEMU block layer. So will you accept that adding a 'flags'
> into BlockDriverState which carry the 'cache flags' from user to keep
> backward compatibility?

No, going back to the old behaviour would break guest-toggled WCE.

Kevin
namei.unix@gmail.com - Jan. 8, 2013, 11:08 a.m.
On 01/08/2013 06:51 PM, Kevin Wolf wrote:
> Am 08.01.2013 11:39, schrieb Liu Yuan:
>> On 01/08/2013 06:00 PM, Kevin Wolf wrote:
>>> Am 08.01.2013 10:45, schrieb Liu Yuan:
>>>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>>>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>>>>> flush.  Never use sheepdog writethrough because it's redundant here.
>>>>
>>>> I don't get it. What do you mean by 'redundant'? If we use virtio &
>>>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
>>>> cache? Here 'writethrough' means use a pure read cache, which doesn't
>>>> need flush at all.
>>>
>>> A writethrough cache is equivalent to a write-back cache where each
>>> write is followed by a flush. qemu makes sure to send these flushes, so
>>> there is no need use Sheepdog's writethrough mode.
>>
>> Implement writethrough as writeback + flush will cause considerable
>> overhead for network block device like Sheepdog: a single write request
>> will be executed as two requests: write + flush
> 
> Yeah, maybe we should have some kind of a FUA flag with write requests
> instead of sending a separate flush.
> 
>> This also explains why
>> I saw a regression about write performance: Old QEMU can issue multiple
>> write requests in one go, but now the requests are sent one by one (even
>> with cache=writeback set), which makes Sheepdog write performance drop a
>> lot. Is it possible to issue multiple requests in one go as old QEMU does?
> 
> Huh? We didn't change anything to that respect, or at least not that I'm
> aware of. qemu always only had single-request bdrv_co_writev, so if
> anything that batching must have happened inside Sheepdog code? Do you
> know what makes it not batch requests any more?
> 

QEMU v1.1.x works well with batched write requests. Sheepdog block
driver doesn't do batching trick as far as I know, just send request as
it is feed. There isn't noticeable changes between v1.1.x and current
master regard to Sheepdog.c.

To detail the different behavior, from Sheepdog daemon which receives
the requests from QEMU:
 old: can receive multiple many requests at the virtually the same time
and handle them concurrently
 now: only receive one request, handle it, reply and get another.

So I think the problem is, current QEMU will wait for write response
before sending another request.

>> It seems it is hard to restore into old semantics of cache flags due to
>> new design of QEMU block layer. So will you accept that adding a 'flags'
>> into BlockDriverState which carry the 'cache flags' from user to keep
>> backward compatibility?
> 
> No, going back to the old behaviour would break guest-toggled WCE.
> 

Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
support it, no? And I think there are huge virtio-blk users.

I didn't meant to break WCE. What I meant is to allow backward
compatibility. For e.g, Sheepdog driver can make use of this dedicated
cache flags to implement its own cache control and doesn't affect other
drivers at all.

Thanks,
Yuan
Kevin Wolf - Jan. 8, 2013, 11:19 a.m.
Am 08.01.2013 12:08, schrieb Liu Yuan:
> On 01/08/2013 06:51 PM, Kevin Wolf wrote:
>> Am 08.01.2013 11:39, schrieb Liu Yuan:
>>> This also explains why
>>> I saw a regression about write performance: Old QEMU can issue multiple
>>> write requests in one go, but now the requests are sent one by one (even
>>> with cache=writeback set), which makes Sheepdog write performance drop a
>>> lot. Is it possible to issue multiple requests in one go as old QEMU does?
>>
>> Huh? We didn't change anything to that respect, or at least not that I'm
>> aware of. qemu always only had single-request bdrv_co_writev, so if
>> anything that batching must have happened inside Sheepdog code? Do you
>> know what makes it not batch requests any more?
>>
> 
> QEMU v1.1.x works well with batched write requests. Sheepdog block
> driver doesn't do batching trick as far as I know, just send request as
> it is feed. There isn't noticeable changes between v1.1.x and current
> master regard to Sheepdog.c.
> 
> To detail the different behavior, from Sheepdog daemon which receives
> the requests from QEMU:
>  old: can receive multiple many requests at the virtually the same time
> and handle them concurrently
>  now: only receive one request, handle it, reply and get another.
> 
> So I think the problem is, current QEMU will wait for write response
> before sending another request.

I can't see a reason why it would do that. Can you bisect this?

>>> It seems it is hard to restore into old semantics of cache flags due to
>>> new design of QEMU block layer. So will you accept that adding a 'flags'
>>> into BlockDriverState which carry the 'cache flags' from user to keep
>>> backward compatibility?
>>
>> No, going back to the old behaviour would break guest-toggled WCE.
>>
> 
> Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
> support it, no? And I think there are huge virtio-blk users.

It works with virtio-blk and SCSI as well.

> I didn't meant to break WCE. What I meant is to allow backward
> compatibility. For e.g, Sheepdog driver can make use of this dedicated
> cache flags to implement its own cache control and doesn't affect other
> drivers at all.

How would you do it? With a WCE that changes during runtime the idea of
a flag that is passed to bdrv_open() and stays valid as long as the
BlockDriverState exists doesn't match reality any more.

Kevin
namei.unix@gmail.com - Jan. 8, 2013, 11:35 a.m.
On 01/08/2013 07:19 PM, Kevin Wolf wrote:
> I can't see a reason why it would do that. Can you bisect this?
> 

Sure, bisect it is on my schedule, but I can't promise a deadline.

>>>> >>> It seems it is hard to restore into old semantics of cache flags due to
>>>> >>> new design of QEMU block layer. So will you accept that adding a 'flags'
>>>> >>> into BlockDriverState which carry the 'cache flags' from user to keep
>>>> >>> backward compatibility?
>>> >>
>>> >> No, going back to the old behaviour would break guest-toggled WCE.
>>> >>
>> > 
>> > Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
>> > support it, no? And I think there are huge virtio-blk users.
> It works with virtio-blk and SCSI as well.
> 

Okay, I see the code indeed support WCE but it requires Guest kernel to
support it. For the kernel doesn't support it, there is no way to
disable write cache, no?

>> > I didn't meant to break WCE. What I meant is to allow backward
>> > compatibility. For e.g, Sheepdog driver can make use of this dedicated
>> > cache flags to implement its own cache control and doesn't affect other
>> > drivers at all.
> How would you do it? With a WCE that changes during runtime the idea of
> a flag that is passed to bdrv_open() and stays valid as long as the
> BlockDriverState exists doesn't match reality any more.

I am not sure if I get it right. What I meant is allow Sheepdog to
control cache on the 'cache flags' at startup and ignore WCE on the run
time.

So you mean, if I choose witeback cache at startup, and then Guest
disable it via WCE, then block layer will never send flush request down
to Sheepdog driver?

Thanks,
Yuan
Kevin Wolf - Jan. 8, 2013, 12:12 p.m.
Am 08.01.2013 12:35, schrieb Liu Yuan:
> On 01/08/2013 07:19 PM, Kevin Wolf wrote:
>> I can't see a reason why it would do that. Can you bisect this?
>>
> 
> Sure, bisect it is on my schedule, but I can't promise a deadline.

Ok, thanks. It would be good if it was before the hard freeze for 1.4.

>>>>>>>> It seems it is hard to restore into old semantics of cache flags due to
>>>>>>>> new design of QEMU block layer. So will you accept that adding a 'flags'
>>>>>>>> into BlockDriverState which carry the 'cache flags' from user to keep
>>>>>>>> backward compatibility?
>>>>>>
>>>>>> No, going back to the old behaviour would break guest-toggled WCE.
>>>>>>
>>>>
>>>> Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
>>>> support it, no? And I think there are huge virtio-blk users.
>> It works with virtio-blk and SCSI as well.
>>
> 
> Okay, I see the code indeed support WCE but it requires Guest kernel to
> support it. For the kernel doesn't support it, there is no way to
> disable write cache, no?

With Linux guests, it's possible for SCSI. I'm not sure about
virtio-blk, but you might be right that you can't manually change it there.

>>>> I didn't meant to break WCE. What I meant is to allow backward
>>>> compatibility. For e.g, Sheepdog driver can make use of this dedicated
>>>> cache flags to implement its own cache control and doesn't affect other
>>>> drivers at all.
>> How would you do it? With a WCE that changes during runtime the idea of
>> a flag that is passed to bdrv_open() and stays valid as long as the
>> BlockDriverState exists doesn't match reality any more.
> 
> I am not sure if I get it right. What I meant is allow Sheepdog to
> control cache on the 'cache flags' at startup and ignore WCE on the run
> time.

If you start with cache=writeback and then the guests switches WCE off
and you ignore that, then you're causing data corruption in the case of
a crash. This is not an option.

> So you mean, if I choose witeback cache at startup, and then Guest
> disable it via WCE, then block layer will never send flush request down
> to Sheepdog driver?

Yes, this is the problematic case. Currently the qemu block layer makes
sure to send flushes, but if you disable that logic for Sheepdog, you
would get broken behaviour in this case.

Kevin
namei.unix@gmail.com - Jan. 8, 2013, 1:18 p.m.
On 01/08/2013 08:12 PM, Kevin Wolf wrote:
> Ok, thanks. It would be good if it was before the hard freeze for 1.4.
> 

Oops, sorry. It is a false alarm. Last time I was running latest QEMU on
tmpfs which service the request too fast.

>>>>>>>>> >>>>>>>> It seems it is hard to restore into old semantics of cache flags due to
>>>>>>>>> >>>>>>>> new design of QEMU block layer. So will you accept that adding a 'flags'
>>>>>>>>> >>>>>>>> into BlockDriverState which carry the 'cache flags' from user to keep
>>>>>>>>> >>>>>>>> backward compatibility?
>>>>>>> >>>>>>
>>>>>>> >>>>>> No, going back to the old behaviour would break guest-toggled WCE.
>>>>>>> >>>>>>
>>>>> >>>>
>>>>> >>>> Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
>>>>> >>>> support it, no? And I think there are huge virtio-blk users.
>>> >> It works with virtio-blk and SCSI as well.
>>> >>
>> > 
>> > Okay, I see the code indeed support WCE but it requires Guest kernel to
>> > support it. For the kernel doesn't support it, there is no way to
>> > disable write cache, no?
> With Linux guests, it's possible for SCSI. I'm not sure about
> virtio-blk, but you might be right that you can't manually change it there.
> 
>>>>> >>>> I didn't meant to break WCE. What I meant is to allow backward
>>>>> >>>> compatibility. For e.g, Sheepdog driver can make use of this dedicated
>>>>> >>>> cache flags to implement its own cache control and doesn't affect other
>>>>> >>>> drivers at all.
>>> >> How would you do it? With a WCE that changes during runtime the idea of
>>> >> a flag that is passed to bdrv_open() and stays valid as long as the
>>> >> BlockDriverState exists doesn't match reality any more.
>> > 
>> > I am not sure if I get it right. What I meant is allow Sheepdog to
>> > control cache on the 'cache flags' at startup and ignore WCE on the run
>> > time.
> If you start with cache=writeback and then the guests switches WCE off
> and you ignore that, then you're causing data corruption in the case of
> a crash. This is not an option.
> 
>> > So you mean, if I choose witeback cache at startup, and then Guest
>> > disable it via WCE, then block layer will never send flush request down
>> > to Sheepdog driver?
> Yes, this is the problematic case. Currently the qemu block layer makes
> sure to send flushes, but if you disable that logic for Sheepdog, you
> would get broken behaviour in this case.

Maybe not for a second thought. See following combination:

   cache flags            WCE toggled and resulting behavior
   writethrough           writethrough
   writeback              writetrhough (writeback + flush as expected)

cache flags means specify 'cache=xxx' at startup and WCE toggled on the
fly in the guest (supose guest kernel support WCE control)

So the result is *not* broken. If we set cache=writethrough for
sheepdog, then WCE won't take any effect because 'flush' request will be
ignored by Sheepdog driver. And with cache=writeback, WCE does disable
the writecache and actually turns it to a writethrough cache by sending
flush req every time for write.

To conclude, let Sheepdog interpret cache flags won't cause trouble even
with current Guest WCE feature, the different is that if we set
cache=writethrough, guest can't change it via WCE toggling. Is this
behavior acceptable?

Thanks,
Yuan
namei.unix@gmail.com - Jan. 8, 2013, 1:23 p.m.
On 01/08/2013 09:18 PM, Liu Yuan wrote:
> So the result is *not* broken. If we set cache=writethrough for
> sheepdog, then WCE won't take any effect because 'flush' request will be
> ignored by Sheepdog driver. 

The merit of this 'writethrough' is that, write request will never be
interpreted as two requests: write + flush in Sheepdog driver.

Thanks,
Yuan
Paolo Bonzini - Jan. 9, 2013, 10:23 a.m.
Il 08/01/2013 13:12, Kevin Wolf ha scritto:
>> > Okay, I see the code indeed support WCE but it requires Guest kernel to
>> > support it. For the kernel doesn't support it, there is no way to
>> > disable write cache, no?
> With Linux guests, it's possible for SCSI. I'm not sure about
> virtio-blk, but you might be right that you can't manually change it there.

New enough kernels support it.  Also, if you reboot from a system that
supports writeback to one that doesn't (including old-enough Linux), the
new QEMU will be able to use writethrough while running on the older system.

Paolo
Paolo Bonzini - Jan. 9, 2013, 10:25 a.m.
Il 08/01/2013 14:18, Liu Yuan ha scritto:
> Maybe not for a second thought. See following combination:
> 
>    cache flags            WCE toggled and resulting behavior
>    writethrough           writethrough
>    writeback              writetrhough (writeback + flush as expected)
> 
> cache flags means specify 'cache=xxx' at startup and WCE toggled on the
> fly in the guest (supose guest kernel support WCE control)
> 
> So the result is *not* broken. If we set cache=writethrough for
> sheepdog, then WCE won't take any effect because 'flush' request will be
> ignored by Sheepdog driver. And with cache=writeback, WCE does disable
> the writecache and actually turns it to a writethrough cache by sending
> flush req every time for write.
> 
> To conclude, let Sheepdog interpret cache flags won't cause trouble even
> with current Guest WCE feature, the different is that if we set
> cache=writethrough, guest can't change it via WCE toggling. Is this
> behavior acceptable?

But why is it useful to force-disable writeback caching?  Do you have
any performance numbers?

Paolo
namei.unix@gmail.com - Jan. 9, 2013, 10:36 a.m.
On 01/09/2013 06:25 PM, Paolo Bonzini wrote:
> But why is it useful to force-disable writeback caching?  Do you have
> any performance numbers?

This is not related to performance. What I want is to allow users to
choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx.

Obviously writeback mode show much better write performance over
writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is
a readonly cache which is always consistent with cluster data thus need
no flush at all. This is to boost read performance for read intensive
Guest over Sheepdog iamges.

Thanks,
Yuan
Paolo Bonzini - Jan. 9, 2013, 10:40 a.m.
Il 09/01/2013 11:36, Liu Yuan ha scritto:
>> But why is it useful to force-disable writeback caching?  Do you have
>> > any performance numbers?
> This is not related to performance. What I want is to allow users to
> choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx.
> 
> Obviously writeback mode show much better write performance over
> writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is
> a readonly cache which is always consistent with cluster data thus need
> no flush at all. This is to boost read performance for read intensive
> Guest over Sheepdog iamges.

Ok, so the questions are (compared to older QEMU's writethrough mode):

1) how slower is QEMU's emulated-writethrough mode for writes, due to
the extra requests?

2) how slower is QEMU's writeback mode for reads, due to the different
structure of the cache?

Paolo
namei.unix@gmail.com - Jan. 9, 2013, 10:46 a.m.
On 01/09/2013 06:40 PM, Paolo Bonzini wrote:
> 1) how slower is QEMU's emulated-writethrough mode for writes, due to
> the extra requests?
> 

I'll collect some numbers on it.

> 2) how slower is QEMU's writeback mode for reads, due to the different
> structure of the cache?

Sorry, I don't get your question.

I didn't say QEMU's writeback mode is slow for reads, did I?

Thanks,
Yuan
namei.unix@gmail.com - Jan. 9, 2013, 10:58 a.m.
On 01/09/2013 06:46 PM, Liu Yuan wrote:
> I'll collect some numbers on it.

Well, when I use hdparm -W 0 /dev/vdx to choose the writethrough cache
in the Guest (virtio disk), it says:

test@vm1:~$ sudo hdparm -W 0 /dev/vdb

/dev/vdb:
 setting drive write-caching to 0 (off)
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(setcache) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device

Do I need extra flags to enable WCE?

Both Guest and Host kernel is 3.7 and I start QEMU by following command:

$ qemu-system-x86_64 --enable-kvm -drive
file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024
-drive file=sheepdog:test,if=virtio,cache=writeback

Thanks,
Yuan
Kevin Wolf - Jan. 9, 2013, 11:10 a.m.
Am 09.01.2013 11:36, schrieb Liu Yuan:
> On 01/09/2013 06:25 PM, Paolo Bonzini wrote:
>> But why is it useful to force-disable writeback caching?  Do you have
>> any performance numbers?
> 
> This is not related to performance. What I want is to allow users to
> choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx.

The question was which effects these modes have, and especially what the
user-visible differences between writeback+flush and writethrough cache are.

> Obviously writeback mode show much better write performance over
> writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is
> a readonly cache which is always consistent with cluster data thus need
> no flush at all. This is to boost read performance for read intensive
> Guest over Sheepdog iamges.

This is the definition of a writethrough cache.

Kevin
Paolo Bonzini - Jan. 9, 2013, 11:10 a.m.
Il 09/01/2013 11:58, Liu Yuan ha scritto:
> On 01/09/2013 06:46 PM, Liu Yuan wrote:
>> I'll collect some numbers on it.
> 
> Well, when I use hdparm -W 0 /dev/vdx to choose the writethrough cache
> in the Guest (virtio disk), it says:
> 
> test@vm1:~$ sudo hdparm -W 0 /dev/vdb
> 
> /dev/vdb:
>  setting drive write-caching to 0 (off)
>  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>  HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device
>  HDIO_DRIVE_CMD(setcache) failed: Inappropriate ioctl for device
>  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>  HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device
>  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> 
> Do I need extra flags to enable WCE?

"hdparm" is specific for IDE devices.  sysfs is supported by SCSI
(including IDE with libata, and virtio-scsi) and virtio-blk.

You can use this for SCSI:

echo 'write back' > /sys/block/sda/device/scsi_disk/*/cache_type

and a similar command for virtio-blk.

Paolo

> Both Guest and Host kernel is 3.7 and I start QEMU by following command:
> 
> $ qemu-system-x86_64 --enable-kvm -drive
> file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024
> -drive file=sheepdog:test,if=virtio,cache=writeback
> 
> Thanks,
> Yuan
> 
>
namei.unix@gmail.com - Jan. 9, 2013, 12:07 p.m.
On 01/09/2013 06:46 PM, Liu Yuan wrote:
>> 1) how slower is QEMU's emulated-writethrough mode for writes, due to
>> > the extra requests?
>> > 
> I'll collect some numbers on it.
> 

Okay I got some nubmers. I run three sheep daemon on the same host to
emulate a 3 nodes cluster, Sheepdog image has 3 copies and I put
Sheepdog client cache and Sheepdog backend storage on a tmpfs.

Guest and Host are all Linux 3.7. I start QEMU by following command:

$ qemu-system-x86_64 --enable-kvm -drive
file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024
-drive file=sheepdog:test,if=virtio,cache=writeback

I run 5 times 'dd if=/dev/urandom of=/dev/vdb bs=1M count=100
oflag=direct' and get the average nubmer:

    emulated (write + flush)  old impl (single write)
            13.3 M/s               13.7 M/s

boost percentage: (13.7 - 13.3)/13.3 = 3%.

If boost number is not big, but if we run QEMU and Sheep daemon on the
separate boxes, we'll expect of bigger boost because the overhead of
extra 'flush' request is increased.

Besides performance, I think backward compatibility is more important:
  1 if we run a old kernel host (quite possible for a long running
server) which doesn't support WCE, then we will never have a chance to
choose writethrough cache for guest OS against new QEMU (most users tend
to update user space tools to exclude bugs)
  2 The upper layer software which relies on the 'cache=xxx' to choose
cache mode will fail its assumption against new QEMU.

and my proposal (adding another field to BlockDriverState to allow
self-interpretation cache flag) can work well with current block layer:

Sheepdog driver behavior:
   cache flags            WCE toggled and resulting behavior
   writethrough           writethrough
   writeback              writetrhough (writeback + flush)

We can see that the writeback behavior for Guest-WCE toggling is the
same as expected. The difference is that if we set cache=writethrough,
guest can't change it via WCE toggling.

Thanks,
Yuan
namei.unix@gmail.com - Jan. 9, 2013, 12:16 p.m.
On 01/09/2013 08:07 PM, Liu Yuan wrote:
> and my proposal (adding another field to BlockDriverState to allow
> self-interpretation cache flag) can work well with current block layer:
> 
> Sheepdog driver behavior:
>    cache flags            WCE toggled and resulting behavior
>    writethrough           writethrough
>    writeback              writetrhough (writeback + flush)
> 
> We can see that the writeback behavior for Guest-WCE toggling is the
> same as expected. The difference is that if we set cache=writethrough,
> guest can't change it via WCE toggling.

I am wondering if above scheme can apply to the block layer, then
'cache=xxx' can act as a stronger control over WCE toggling. Then we
don't need extra field.

Thanks,
Yuan
Kevin Wolf - Jan. 9, 2013, 12:42 p.m.
Am 09.01.2013 13:07, schrieb Liu Yuan:
> Besides performance, I think backward compatibility is more important:
>   1 if we run a old kernel host (quite possible for a long running
> server) which doesn't support WCE, then we will never have a chance to
> choose writethrough cache for guest OS against new QEMU (most users tend
> to update user space tools to exclude bugs)

How does the host kernel even play a role in this context?

>   2 The upper layer software which relies on the 'cache=xxx' to choose
> cache mode will fail its assumption against new QEMU.

Which assumptions do you mean? As far as I can say the behaviour hasn't
changed, except possibly for the performance.

> We can see that the writeback behavior for Guest-WCE toggling is the
> same as expected. The difference is that if we set cache=writethrough,
> guest can't change it via WCE toggling.

Then you need to make sure to communicate this to the guest. I'm not
convinced that adding extra logic to each device for this is a good idea.

Kevin
namei.unix@gmail.com - Jan. 9, 2013, 1:04 p.m.
On 01/09/2013 08:42 PM, Kevin Wolf wrote:
> Am 09.01.2013 13:07, schrieb Liu Yuan:
>> Besides performance, I think backward compatibility is more important:
>>   1 if we run a old kernel host (quite possible for a long running
>> server) which doesn't support WCE, then we will never have a chance to
>> choose writethrough cache for guest OS against new QEMU (most users tend
>> to update user space tools to exclude bugs)
> 
> How does the host kernel even play a role in this context?
> 

Oops, my mind was broken. Okay, guest OS which doesn't support WCE can't
change cache type as we discussed.

>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>> cache mode will fail its assumption against new QEMU.
> 
> Which assumptions do you mean? As far as I can say the behaviour hasn't
> changed, except possibly for the performance.
> 

When users set 'cache=writethrough' to export only a writethrough cache
to Guest, but with new QEMU, it will actually get a writeback cache as
default.

>> We can see that the writeback behavior for Guest-WCE toggling is the
>> same as expected. The difference is that if we set cache=writethrough,
>> guest can't change it via WCE toggling.
> 
> Then you need to make sure to communicate this to the guest. I'm not
> convinced that adding extra logic to each device for this is a good idea.
> 

We don't need to communicate to the guest. I think 'cache=xxx' means
what kind of cache the users *expect* to export to Guest OS. So if
cache=writethrough set, Guest OS couldn't turn it to writeback cache
magically. This is like I bought a disk with 'writethrough' cache
built-in, I didn't expect that it turned to be a disk with writeback
cache under the hood which could possible lose data when power outage
happened.

So Guest-WCE only works when users provide us a writeback capable disk.

Thanks,
Yuan
Paolo Bonzini - Jan. 9, 2013, 3:10 p.m.
Il 09/01/2013 14:04, Liu Yuan ha scritto:
> > >   2 The upper layer software which relies on the 'cache=xxx' to choose
> > > cache mode will fail its assumption against new QEMU.
> > 
> > Which assumptions do you mean? As far as I can say the behaviour hasn't
> > changed, except possibly for the performance.
>
> When users set 'cache=writethrough' to export only a writethrough cache
> to Guest, but with new QEMU, it will actually get a writeback cache as
> default.

They get a writeback cache implementation-wise, but they get a
writethrough cache safety-wise.  How the cache is implemented doesn't
matter, as long as it "looks like" a writethrough cache.

In fact, consider a local disk that doesn't support FUA.  In old QEMU,
images used to be opened with O_DSYNC and that splits each write into
WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
flushes are created.  Old QEMU changes it in the kernel, new QEMU
changes it in userspace.

> We don't need to communicate to the guest. I think 'cache=xxx' means
> what kind of cache the users *expect* to export to Guest OS. So if
> cache=writethrough set, Guest OS couldn't turn it to writeback cache
> magically. This is like I bought a disk with 'writethrough' cache
> built-in, I didn't expect that it turned to be a disk with writeback
> cache under the hood which could possible lose data when power outage
> happened.

It's not by magic.  It's by explicitly requesting the disk to do this.

Perhaps it's a bug that the cache mode is not reset when the machine is
reset.  I haven't checked that, but it would be a valid complaint.

Paolo

> 
> So Guest-WCE only works when users provide us a writeback capable disk.
namei.unix@gmail.com - Jan. 10, 2013, 5:38 a.m.
On 01/09/2013 11:10 PM, Paolo Bonzini wrote:
> Il 09/01/2013 14:04, Liu Yuan ha scritto:
>>>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>>>> cache mode will fail its assumption against new QEMU.
>>>
>>> Which assumptions do you mean? As far as I can say the behaviour hasn't
>>> changed, except possibly for the performance.
>>
>> When users set 'cache=writethrough' to export only a writethrough cache
>> to Guest, but with new QEMU, it will actually get a writeback cache as
>> default.
> 
> They get a writeback cache implementation-wise, but they get a
> writethrough cache safety-wise.  How the cache is implemented doesn't
> matter, as long as it "looks like" a writethrough cache.
> 

> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
> images used to be opened with O_DSYNC and that splits each write into
> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
> flushes are created.  Old QEMU changes it in the kernel, new QEMU
> changes it in userspace.
> 
>> We don't need to communicate to the guest. I think 'cache=xxx' means
>> what kind of cache the users *expect* to export to Guest OS. So if
>> cache=writethrough set, Guest OS couldn't turn it to writeback cache
>> magically. This is like I bought a disk with 'writethrough' cache
>> built-in, I didn't expect that it turned to be a disk with writeback
>> cache under the hood which could possible lose data when power outage
>> happened.
> 
> It's not by magic.  It's by explicitly requesting the disk to do this.
> 
> Perhaps it's a bug that the cache mode is not reset when the machine is
> reset.  I haven't checked that, but it would be a valid complaint.
> 

Ah I didn't get the current implementation right. I tried the 3.7 kernel
and it works as expected (cache=writethrough result in a 'writethrough'
cache in the guest).

It looks fine to me to emulate writethrough as writeback + flush, since
the profermance drop isn't big, though sheepdog itself support true
writethrough cache (no flush).

I am going to send v2 of directio patch for sheepdog driver.

Thanks,
Yuan
Jamie Lokier - Jan. 10, 2013, 3:12 p.m.
Kevin Wolf wrote:
> Am 08.01.2013 11:39, schrieb Liu Yuan:
> > On 01/08/2013 06:00 PM, Kevin Wolf wrote:
> >> Am 08.01.2013 10:45, schrieb Liu Yuan:
> >>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
> >>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
> >>>> flush.  Never use sheepdog writethrough because it's redundant here.
> >>>
> >>> I don't get it. What do you mean by 'redundant'? If we use virtio &
> >>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
> >>> cache? Here 'writethrough' means use a pure read cache, which doesn't
> >>> need flush at all.
> >>
> >> A writethrough cache is equivalent to a write-back cache where each
> >> write is followed by a flush. qemu makes sure to send these flushes, so
> >> there is no need use Sheepdog's writethrough mode.
> > 
> > Implement writethrough as writeback + flush will cause considerable
> > overhead for network block device like Sheepdog: a single write request
> > will be executed as two requests: write + flush
> 
> Yeah, maybe we should have some kind of a FUA flag with write requests
> instead of sending a separate flush.

Note that write+FUA has different semantics than write+flush, at least
with regular disks.

write+FUA commits just what was written, while write+flush commits
everything that was written before.

-- Jamie
Kevin Wolf - Jan. 10, 2013, 3:21 p.m.
Am 10.01.2013 16:12, schrieb Jamie Lokier:
> Kevin Wolf wrote:
>> Am 08.01.2013 11:39, schrieb Liu Yuan:
>>> On 01/08/2013 06:00 PM, Kevin Wolf wrote:
>>>> Am 08.01.2013 10:45, schrieb Liu Yuan:
>>>>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>>>>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>>>>>> flush.  Never use sheepdog writethrough because it's redundant here.
>>>>>
>>>>> I don't get it. What do you mean by 'redundant'? If we use virtio &
>>>>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
>>>>> cache? Here 'writethrough' means use a pure read cache, which doesn't
>>>>> need flush at all.
>>>>
>>>> A writethrough cache is equivalent to a write-back cache where each
>>>> write is followed by a flush. qemu makes sure to send these flushes, so
>>>> there is no need use Sheepdog's writethrough mode.
>>>
>>> Implement writethrough as writeback + flush will cause considerable
>>> overhead for network block device like Sheepdog: a single write request
>>> will be executed as two requests: write + flush
>>
>> Yeah, maybe we should have some kind of a FUA flag with write requests
>> instead of sending a separate flush.
> 
> Note that write+FUA has different semantics than write+flush, at least
> with regular disks.
> 
> write+FUA commits just what was written, while write+flush commits
> everything that was written before.

True. However, when you use it for implementing a writethrough mode,
i.e. every single write has the FUA flag set, then it ought to be the
same. One thing to take care of might be doing one explicit flush when
switching from writeback to writethrough.

Kevin
Jamie Lokier - Jan. 10, 2013, 3:25 p.m.
Paolo Bonzini wrote:
> Il 09/01/2013 14:04, Liu Yuan ha scritto:
> > > >   2 The upper layer software which relies on the 'cache=xxx' to choose
> > > > cache mode will fail its assumption against new QEMU.
> > > 
> > > Which assumptions do you mean? As far as I can say the behaviour hasn't
> > > changed, except possibly for the performance.
> >
> > When users set 'cache=writethrough' to export only a writethrough cache
> > to Guest, but with new QEMU, it will actually get a writeback cache as
> > default.
> 
> They get a writeback cache implementation-wise, but they get a
> writethrough cache safety-wise.  How the cache is implemented doesn't
> matter, as long as it "looks like" a writethrough cache.
> 
> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
> images used to be opened with O_DSYNC and that splits each write into
> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
> flushes are created.  Old QEMU changes it in the kernel, new QEMU
> changes it in userspace.
> 
> > We don't need to communicate to the guest. I think 'cache=xxx' means
> > what kind of cache the users *expect* to export to Guest OS. So if
> > cache=writethrough set, Guest OS couldn't turn it to writeback cache
> > magically. This is like I bought a disk with 'writethrough' cache
> > built-in, I didn't expect that it turned to be a disk with writeback
> > cache under the hood which could possible lose data when power outage
> > happened.
> 
> It's not by magic.  It's by explicitly requesting the disk to do this.
> 
> Perhaps it's a bug that the cache mode is not reset when the machine is
> reset.  I haven't checked that, but it would be a valid complaint.

The question is, is cache=writeback/cache=writethrough an initial
setting of guest-visible WCE that the guest is allowed to change, or
is cache=writeththrough a way of saying "don't have a write cache"
(which may or may not be reflected in the guest-visible disk id).

I couldn't tell from QEMU documentation which is intended.  It would
be a bit silly if it means different things for different backend
storage.

I have seen (obscure) guest code which toggled WCE to simulate FUA,
and there is plenty of advice out there saying to set WCE=0 for
certain kinds of databases because of its presumed crash safety.  Even
very ancient guests on Linux and Windows can change WCE=0 with IDE and
SCSI.

So from a guest point of view, I think guest setting WCE=0 should mean
exactly the same as FUA every write, or flush after every write, until
guest setting WCE=1.

-- Jamie
Paolo Bonzini - Jan. 10, 2013, 3:31 p.m.
Il 10/01/2013 16:25, Jamie Lokier ha scritto:
>> > Perhaps it's a bug that the cache mode is not reset when the machine is
>> > reset.  I haven't checked that, but it would be a valid complaint.
> The question is, is cache=writeback/cache=writethrough an initial
> setting of guest-visible WCE that the guest is allowed to change, or
> is cache=writeththrough a way of saying "don't have a write cache"
> (which may or may not be reflected in the guest-visible disk id).

It used to be the latter (with reflection in the disk data), but now it
is the former.

> I couldn't tell from QEMU documentation which is intended.  It would
> be a bit silly if it means different things for different backend
> storage.

It means the same thing for IDE, SCSI and virtio-blk.  Other backends,
such as SD, do not even have flush, and are really slow with
cache=writethrough because they write one sector at a time.  For this
reason they cannot really be used in a "safe" manner.

> I have seen (obscure) guest code which toggled WCE to simulate FUA,

That's quite useless, since WCE=1->WCE=0 is documented to cause a flush
(and it does).  Might as well send a real flush.

> and there is plenty of advice out there saying to set WCE=0 for
> certain kinds of databases because of its presumed crash safety.  Even
> very ancient guests on Linux and Windows can change WCE=0 with IDE and
> SCSI.

This is supported in QEMU.

> So from a guest point of view, I think guest setting WCE=0 should mean
> exactly the same as FUA every write, or flush after every write, until
> guest setting WCE=1.

Yes, that's indeed how it is implemented.

Paolo
Jamie Lokier - Jan. 10, 2013, 5:22 p.m.
Paolo Bonzini wrote:
> Il 10/01/2013 16:25, Jamie Lokier ha scritto:
> >> > Perhaps it's a bug that the cache mode is not reset when the machine is
> >> > reset.  I haven't checked that, but it would be a valid complaint.
> > The question is, is cache=writeback/cache=writethrough an initial
> > setting of guest-visible WCE that the guest is allowed to change, or
> > is cache=writeththrough a way of saying "don't have a write cache"
> > (which may or may not be reflected in the guest-visible disk id).
> 
> It used to be the latter (with reflection in the disk data), but now it
> is the former.

Interesting.  It could be worth a note in the manual.

> > I couldn't tell from QEMU documentation which is intended.  It would
> > be a bit silly if it means different things for different backend
> > storage.
> 
> It means the same thing for IDE, SCSI and virtio-blk.  Other backends,
> such as SD, do not even have flush, and are really slow with
> cache=writethrough because they write one sector at a time.  For this
> reason they cannot really be used in a "safe" manner.
> 
> > I have seen (obscure) guest code which toggled WCE to simulate FUA,
> 
> That's quite useless, since WCE=1->WCE=0 is documented to cause a flush
> (and it does).  Might as well send a real flush.

It was because the ATA spec seemed to permit the combination of WCE
with no flush command supported.  So WCE=>1->WCE=0 was used to flush,
and kept at WCE=0 for the subsequent logging write-FUA(s), until a
non-FUA write was wanted.

-- Jamie
MORITA Kazutaka - Jan. 11, 2013, 7:52 a.m.
At Thu, 10 Jan 2013 13:38:16 +0800,
Liu Yuan wrote:
> 
> On 01/09/2013 11:10 PM, Paolo Bonzini wrote:
> > Il 09/01/2013 14:04, Liu Yuan ha scritto:
> >>>>   2 The upper layer software which relies on the 'cache=xxx' to choose
> >>>> cache mode will fail its assumption against new QEMU.
> >>>
> >>> Which assumptions do you mean? As far as I can say the behaviour hasn't
> >>> changed, except possibly for the performance.
> >>
> >> When users set 'cache=writethrough' to export only a writethrough cache
> >> to Guest, but with new QEMU, it will actually get a writeback cache as
> >> default.
> > 
> > They get a writeback cache implementation-wise, but they get a
> > writethrough cache safety-wise.  How the cache is implemented doesn't
> > matter, as long as it "looks like" a writethrough cache.
> > 
> 
> > In fact, consider a local disk that doesn't support FUA.  In old QEMU,
> > images used to be opened with O_DSYNC and that splits each write into
> > WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
> > flushes are created.  Old QEMU changes it in the kernel, new QEMU
> > changes it in userspace.
> > 
> >> We don't need to communicate to the guest. I think 'cache=xxx' means
> >> what kind of cache the users *expect* to export to Guest OS. So if
> >> cache=writethrough set, Guest OS couldn't turn it to writeback cache
> >> magically. This is like I bought a disk with 'writethrough' cache
> >> built-in, I didn't expect that it turned to be a disk with writeback
> >> cache under the hood which could possible lose data when power outage
> >> happened.
> > 
> > It's not by magic.  It's by explicitly requesting the disk to do this.
> > 
> > Perhaps it's a bug that the cache mode is not reset when the machine is
> > reset.  I haven't checked that, but it would be a valid complaint.
> > 
> 
> Ah I didn't get the current implementation right. I tried the 3.7 kernel
> and it works as expected (cache=writethrough result in a 'writethrough'
> cache in the guest).
> 
> It looks fine to me to emulate writethrough as writeback + flush, since
> the profermance drop isn't big, though sheepdog itself support true
> writethrough cache (no flush).

Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests
when bdrv_enable_write_cache() is false?  Then the requests behave
like FUA writes and we can safely omit succeeding flush requests.

Thanks,

Kazutaka
namei.unix@gmail.com - Jan. 11, 2013, 8:07 a.m.
On 01/11/2013 03:52 PM, MORITA Kazutaka wrote:
> At Thu, 10 Jan 2013 13:38:16 +0800,
> Liu Yuan wrote:
>>
>> On 01/09/2013 11:10 PM, Paolo Bonzini wrote:
>>> Il 09/01/2013 14:04, Liu Yuan ha scritto:
>>>>>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>>>>>> cache mode will fail its assumption against new QEMU.
>>>>>
>>>>> Which assumptions do you mean? As far as I can say the behaviour hasn't
>>>>> changed, except possibly for the performance.
>>>>
>>>> When users set 'cache=writethrough' to export only a writethrough cache
>>>> to Guest, but with new QEMU, it will actually get a writeback cache as
>>>> default.
>>>
>>> They get a writeback cache implementation-wise, but they get a
>>> writethrough cache safety-wise.  How the cache is implemented doesn't
>>> matter, as long as it "looks like" a writethrough cache.
>>>
>>
>>> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
>>> images used to be opened with O_DSYNC and that splits each write into
>>> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
>>> flushes are created.  Old QEMU changes it in the kernel, new QEMU
>>> changes it in userspace.
>>>
>>>> We don't need to communicate to the guest. I think 'cache=xxx' means
>>>> what kind of cache the users *expect* to export to Guest OS. So if
>>>> cache=writethrough set, Guest OS couldn't turn it to writeback cache
>>>> magically. This is like I bought a disk with 'writethrough' cache
>>>> built-in, I didn't expect that it turned to be a disk with writeback
>>>> cache under the hood which could possible lose data when power outage
>>>> happened.
>>>
>>> It's not by magic.  It's by explicitly requesting the disk to do this.
>>>
>>> Perhaps it's a bug that the cache mode is not reset when the machine is
>>> reset.  I haven't checked that, but it would be a valid complaint.
>>>
>>
>> Ah I didn't get the current implementation right. I tried the 3.7 kernel
>> and it works as expected (cache=writethrough result in a 'writethrough'
>> cache in the guest).
>>
>> It looks fine to me to emulate writethrough as writeback + flush, since
>> the profermance drop isn't big, though sheepdog itself support true
>> writethrough cache (no flush).
> 
> Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests
> when bdrv_enable_write_cache() is false?  Then the requests behave
> like FUA writes and we can safely omit succeeding flush requests.
> 

Let's first implement a emulated writethroughc cache and then look for a
methond if we can play tricks to implement a true writethrough cache.
This would bring complexity such as before we drop SD_FLAG_CMD_CACHE, we
need to flush beforehand. And more, bdrv_enable_write_cache() is always
true for now, I guess this need generic change block layer to get the
writethrough/writeback hints.

Thanks,
Yuan
Paolo Bonzini - Jan. 11, 2013, 9 a.m.
> bdrv_enable_write_cache() is always
> true for now, I guess this need generic change block layer to get the
> writethrough/writeback hints.

That's not correct.  Try hdparm with an IDE disk, or sysfs with a SCSI
disk (recent kernels will also support sysfs with an IDE or virtio-blk
disk) and you'll see that it changes.

Paolo
namei.unix@gmail.com - Jan. 11, 2013, 9:04 a.m.
On 01/11/2013 05:00 PM, Paolo Bonzini wrote:
> That's not correct.  Try hdparm with an IDE disk, or sysfs with a SCSI
> disk (recent kernels will also support sysfs with an IDE or virtio-blk
> disk) and you'll see that it changes.

Okay, at least at startup, even with cache=writehtough set, I saw
bdrv_enable_write_cache() returns true from sd_open(). So you mean this
will be reset later (after sd_open())?

Thanks,
Yuan
Kevin Wolf - Jan. 11, 2013, 9:32 a.m.
Am 11.01.2013 08:52, schrieb MORITA Kazutaka:
> At Thu, 10 Jan 2013 13:38:16 +0800,
> Liu Yuan wrote:
>>
>> On 01/09/2013 11:10 PM, Paolo Bonzini wrote:
>>> Il 09/01/2013 14:04, Liu Yuan ha scritto:
>>>>>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>>>>>> cache mode will fail its assumption against new QEMU.
>>>>>
>>>>> Which assumptions do you mean? As far as I can say the behaviour hasn't
>>>>> changed, except possibly for the performance.
>>>>
>>>> When users set 'cache=writethrough' to export only a writethrough cache
>>>> to Guest, but with new QEMU, it will actually get a writeback cache as
>>>> default.
>>>
>>> They get a writeback cache implementation-wise, but they get a
>>> writethrough cache safety-wise.  How the cache is implemented doesn't
>>> matter, as long as it "looks like" a writethrough cache.
>>>
>>
>>> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
>>> images used to be opened with O_DSYNC and that splits each write into
>>> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
>>> flushes are created.  Old QEMU changes it in the kernel, new QEMU
>>> changes it in userspace.
>>>
>>>> We don't need to communicate to the guest. I think 'cache=xxx' means
>>>> what kind of cache the users *expect* to export to Guest OS. So if
>>>> cache=writethrough set, Guest OS couldn't turn it to writeback cache
>>>> magically. This is like I bought a disk with 'writethrough' cache
>>>> built-in, I didn't expect that it turned to be a disk with writeback
>>>> cache under the hood which could possible lose data when power outage
>>>> happened.
>>>
>>> It's not by magic.  It's by explicitly requesting the disk to do this.
>>>
>>> Perhaps it's a bug that the cache mode is not reset when the machine is
>>> reset.  I haven't checked that, but it would be a valid complaint.
>>>
>>
>> Ah I didn't get the current implementation right. I tried the 3.7 kernel
>> and it works as expected (cache=writethrough result in a 'writethrough'
>> cache in the guest).
>>
>> It looks fine to me to emulate writethrough as writeback + flush, since
>> the profermance drop isn't big, though sheepdog itself support true
>> writethrough cache (no flush).
> 
> Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests
> when bdrv_enable_write_cache() is false?  Then the requests behave
> like FUA writes and we can safely omit succeeding flush requests.

First we would need to make sure that on a writeback -> writethrough
switch a flush happens. But once this is implemented, I think your
suggestion would work well.

Kevin
Paolo Bonzini - Jan. 11, 2013, 9:34 a.m.
----- Messaggio originale -----
> Da: "Liu Yuan" <namei.unix@gmail.com>
> A: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: "Kevin Wolf" <kwolf@redhat.com>, "Stefan Hajnoczi" <stefanha@gmail.com>, qemu-devel@nongnu.org, "MORITA Kazutaka"
> <morita.kazutaka@lab.ntt.co.jp>
> Inviato: Venerdì, 11 gennaio 2013 10:04:23
> Oggetto: Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
> 
> On 01/11/2013 05:00 PM, Paolo Bonzini wrote:
> > That's not correct.  Try hdparm with an IDE disk, or sysfs with a
> > SCSI disk (recent kernels will also support sysfs with an IDE or
> > virtio-blk disk) and you'll see that it changes.
> 
> Okay, at least at startup, even with cache=writehtough set, I saw
> bdrv_enable_write_cache() returns true from sd_open(). So you mean
> this will be reset later (after sd_open())?

It is always true for the protocol.  It is not always true for the format.

This is correct.  qcow2 can do some writes in writeback mode, but they
will always be followed by a flush before the write is reported to the
guest.  It is a bit faster, and does not lose any correctness.

Paolo
namei.unix@gmail.com - Jan. 11, 2013, 9:38 a.m.
On 01/11/2013 05:34 PM, Paolo Bonzini wrote:
> It is always true for the protocol.  It is not always true for the format.
> 

So you mean bdrv_enable_write_cache() always return true for Sheepdog at
any time?

Thanks,
Yuan
Paolo Bonzini - Jan. 11, 2013, 9:40 a.m.
> On 01/11/2013 05:34 PM, Paolo Bonzini wrote:
> > It is always true for the protocol.  It is not always true for the
> > format.
> 
> So you mean bdrv_enable_write_cache() always return true for Sheepdog
> at any time?

Yes, at least now.  You could in principle have a format that calls
bdrv_set_enable_write_cache(bs->file, ...), but nobody does it yet.

Paolo

Patch

diff --git a/block/sheepdog.c b/block/sheepdog.c
index ceabc00..134329a 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -36,7 +36,8 @@ 
 
 #define SD_FLAG_CMD_WRITE    0x01
 #define SD_FLAG_CMD_COW      0x02
-#define SD_FLAG_CMD_CACHE    0x04
+#define SD_FLAG_CMD_CACHE    0x04 /* Writeback mode for cache */
+#define SD_FLAG_CMD_DIRECT   0x08 /* Don't use cache */
 
 #define SD_RES_SUCCESS       0x00 /* Success */
 #define SD_RES_UNKNOWN       0x01 /* Unknown error */
@@ -293,7 +294,7 @@  typedef struct BDRVSheepdogState {
 
     char name[SD_MAX_VDI_LEN];
     bool is_snapshot;
-    bool cache_enabled;
+    uint32_t cache_flags;
 
     char *addr;
     char *port;
@@ -977,8 +978,8 @@  static int coroutine_fn add_aio_request(BDRVSheepdogState *s, AIOReq *aio_req,
         hdr.flags = SD_FLAG_CMD_WRITE | flags;
     }
 
-    if (s->cache_enabled) {
-        hdr.flags |= SD_FLAG_CMD_CACHE;
+    if (s->cache_flags) {
+        hdr.flags |= s->cache_flags;
     }
 
     hdr.oid = oid;
@@ -1023,7 +1024,7 @@  static int coroutine_fn add_aio_request(BDRVSheepdogState *s, AIOReq *aio_req,
 
 static int read_write_object(int fd, char *buf, uint64_t oid, int copies,
                              unsigned int datalen, uint64_t offset,
-                             bool write, bool create, bool cache)
+                             bool write, bool create, uint32_t cache_flags)
 {
     SheepdogObjReq hdr;
     SheepdogObjRsp *rsp = (SheepdogObjRsp *)&hdr;
@@ -1047,9 +1048,7 @@  static int read_write_object(int fd, char *buf, uint64_t oid, int copies,
         hdr.opcode = SD_OP_READ_OBJ;
     }
 
-    if (cache) {
-        hdr.flags |= SD_FLAG_CMD_CACHE;
-    }
+    hdr.flags |= cache_flags;
 
     hdr.oid = oid;
     hdr.data_length = datalen;
@@ -1072,18 +1071,19 @@  static int read_write_object(int fd, char *buf, uint64_t oid, int copies,
 }
 
 static int read_object(int fd, char *buf, uint64_t oid, int copies,
-                       unsigned int datalen, uint64_t offset, bool cache)
+                       unsigned int datalen, uint64_t offset,
+                       uint32_t cache_flags)
 {
     return read_write_object(fd, buf, oid, copies, datalen, offset, false,
-                             false, cache);
+                             false, cache_flags);
 }
 
 static int write_object(int fd, char *buf, uint64_t oid, int copies,
                         unsigned int datalen, uint64_t offset, bool create,
-                        bool cache)
+                        uint32_t cache_flags)
 {
     return read_write_object(fd, buf, oid, copies, datalen, offset, true,
-                             create, cache);
+                             create, cache_flags);
 }
 
 static int sd_open(BlockDriverState *bs, const char *filename, int flags)
@@ -1118,12 +1118,19 @@  static int sd_open(BlockDriverState *bs, const char *filename, int flags)
         goto out;
     }
 
-    s->cache_enabled = true;
-    s->flush_fd = connect_to_sdog(s->addr, s->port);
-    if (s->flush_fd < 0) {
-        error_report("failed to connect");
-        ret = s->flush_fd;
-        goto out;
+    if (flags & BDRV_O_NOCACHE) {
+        s->cache_flags = SD_FLAG_CMD_DIRECT;
+    } else if (flags & BDRV_O_CACHE_WB) {
+        s->cache_flags = SD_FLAG_CMD_CACHE;
+    }
+
+    if (s->cache_flags != SD_FLAG_CMD_DIRECT) {
+        s->flush_fd = connect_to_sdog(s->addr, s->port);
+        if (s->flush_fd < 0) {
+            error_report("failed to connect");
+            ret = s->flush_fd;
+            goto out;
+        }
     }
 
     if (snapid || tag[0] != '\0') {
@@ -1140,7 +1147,7 @@  static int sd_open(BlockDriverState *bs, const char *filename, int flags)
 
     buf = g_malloc(SD_INODE_SIZE);
     ret = read_object(fd, buf, vid_to_vdi_oid(vid), 0, SD_INODE_SIZE, 0,
-                      s->cache_enabled);
+                      s->cache_flags);
 
     closesocket(fd);
 
@@ -1387,7 +1394,7 @@  static void sd_close(BlockDriverState *bs)
 
     qemu_aio_set_fd_handler(s->fd, NULL, NULL, NULL, NULL);
     closesocket(s->fd);
-    if (s->cache_enabled) {
+    if (s->cache_flags) {
         closesocket(s->flush_fd);
     }
     g_free(s->addr);
@@ -1423,7 +1430,7 @@  static int sd_truncate(BlockDriverState *bs, int64_t offset)
     datalen = SD_INODE_SIZE - sizeof(s->inode.data_vdi_id);
     s->inode.vdi_size = offset;
     ret = write_object(fd, (char *)&s->inode, vid_to_vdi_oid(s->inode.vdi_id),
-                       s->inode.nr_copies, datalen, 0, false, s->cache_enabled);
+                       s->inode.nr_copies, datalen, 0, false, s->cache_flags);
     close(fd);
 
     if (ret < 0) {
@@ -1506,7 +1513,7 @@  static int sd_create_branch(BDRVSheepdogState *s)
     }
 
     ret = read_object(fd, buf, vid_to_vdi_oid(vid), s->inode.nr_copies,
-                      SD_INODE_SIZE, 0, s->cache_enabled);
+                      SD_INODE_SIZE, 0, s->cache_flags);
 
     closesocket(fd);
 
@@ -1707,7 +1714,7 @@  static int coroutine_fn sd_co_flush_to_disk(BlockDriverState *bs)
     int ret;
     unsigned int wlen = 0, rlen = 0;
 
-    if (!s->cache_enabled) {
+    if (s->cache_flags == SD_FLAG_CMD_DIRECT) {
         return 0;
     }
 
@@ -1723,7 +1730,7 @@  static int coroutine_fn sd_co_flush_to_disk(BlockDriverState *bs)
     if (rsp->result == SD_RES_INVALID_PARMS) {
         dprintf("disable write cache since the server doesn't support it\n");
 
-        s->cache_enabled = false;
+        s->cache_flags = SD_FLAG_CMD_DIRECT;
         closesocket(s->flush_fd);
         return 0;
     }
@@ -1774,7 +1781,7 @@  static int sd_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
     }
 
     ret = write_object(fd, (char *)&s->inode, vid_to_vdi_oid(s->inode.vdi_id),
-                       s->inode.nr_copies, datalen, 0, false, s->cache_enabled);
+                       s->inode.nr_copies, datalen, 0, false, s->cache_flags);
     if (ret < 0) {
         error_report("failed to write snapshot's inode.");
         goto cleanup;
@@ -1791,7 +1798,7 @@  static int sd_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
     inode = (SheepdogInode *)g_malloc(datalen);
 
     ret = read_object(fd, (char *)inode, vid_to_vdi_oid(new_vid),
-                      s->inode.nr_copies, datalen, 0, s->cache_enabled);
+                      s->inode.nr_copies, datalen, 0, s->cache_flags);
 
     if (ret < 0) {
         error_report("failed to read new inode info. %s", strerror(errno));
@@ -1845,7 +1852,7 @@  static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
 
     buf = g_malloc(SD_INODE_SIZE);
     ret = read_object(fd, buf, vid_to_vdi_oid(vid), s->inode.nr_copies,
-                      SD_INODE_SIZE, 0, s->cache_enabled);
+                      SD_INODE_SIZE, 0, s->cache_flags);
 
     closesocket(fd);
 
@@ -1942,7 +1949,7 @@  static int sd_snapshot_list(BlockDriverState *bs, QEMUSnapshotInfo **psn_tab)
         /* we don't need to read entire object */
         ret = read_object(fd, (char *)&inode, vid_to_vdi_oid(vid),
                           0, SD_INODE_SIZE - sizeof(inode.data_vdi_id), 0,
-                          s->cache_enabled);
+                          s->cache_flags);
 
         if (ret) {
             continue;
@@ -2003,11 +2010,11 @@  static int do_load_save_vmstate(BDRVSheepdogState *s, uint8_t *data,
         if (load) {
             ret = read_object(fd, (char *)data, vmstate_oid,
                               s->inode.nr_copies, data_len, offset,
-                              s->cache_enabled);
+                              s->cache_flags);
         } else {
             ret = write_object(fd, (char *)data, vmstate_oid,
                                s->inode.nr_copies, data_len, offset, create,
-                               s->cache_enabled);
+                               s->cache_flags);
         }
 
         if (ret < 0) {