Patchwork [PATCHv3] QEMU(upstream): Disable xen's use of O_DIRECT by default as it results in crashes.

login
register
mail settings
Submitter Alex Bligh
Date March 18, 2013, 12:18 p.m.
Message ID <1363609123-20748-1-git-send-email-alex@alex.org.uk>
Download mbox | patch
Permalink /patch/228471/
State New
Headers show

Comments

Alex Bligh - March 18, 2013, 12:18 p.m.
Due to what is almost certainly a kernel bug, writes with
O_DIRECT may continue to reference the page after the write
has been marked as completed, particularly in the case of
TCP retransmit. In other scenarios, this "merely" risks
data corruption on the write, but with Xen pages from domU
are only transiently mapped into dom0's memory, resulting
in kernel panics when they are subsequently accessed.

This brings PV devices in line with emulated devices. Removing
O_DIRECT is safe as barrier operations are now correctly passed
through.

See:
  http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html
for more details.

This patch has already been applied to the xenbits.org
qemu-upstream repository.
  http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=f3903bbac78a81fcbce1350cdce860764a62783a
Clearly it should go into qemu's own repository.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Alex Bligh <alex@alex.org.uk>
---
 hw/xen_disk.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
Stefan Hajnoczi - March 18, 2013, 1:03 p.m.
On Mon, Mar 18, 2013 at 12:18:43PM +0000, Alex Bligh wrote:
> Due to what is almost certainly a kernel bug, writes with
> O_DIRECT may continue to reference the page after the write
> has been marked as completed, particularly in the case of
> TCP retransmit. In other scenarios, this "merely" risks
> data corruption on the write, but with Xen pages from domU
> are only transiently mapped into dom0's memory, resulting
> in kernel panics when they are subsequently accessed.
> 
> This brings PV devices in line with emulated devices. Removing
> O_DIRECT is safe as barrier operations are now correctly passed
> through.
> 
> See:
>   http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html
> for more details.

From the mailing list discussion it appears that this patch is a
workaround - using the dom0 page cache to avoid the failed host kernel
paging request, which is caused by the true bug.

Has any progress been made at understanding the true problem?

> This patch has already been applied to the xenbits.org
> qemu-upstream repository.
>   http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=f3903bbac78a81fcbce1350cdce860764a62783a
> Clearly it should go into qemu's own repository.
> 
> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Signed-off-by: Alex Bligh <alex@alex.org.uk>
> ---
>  hw/xen_disk.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/xen_disk.c b/hw/xen_disk.c
> index a402ac8..14f8723 100644
> --- a/hw/xen_disk.c
> +++ b/hw/xen_disk.c
> @@ -603,7 +603,7 @@ static int blk_init(struct XenDevice *xendev)
>      }
>  
>      /* read-only ? */
> -    qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
> +    qflags = BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
>      if (strcmp(blkdev->mode, "w") == 0) {
>          qflags |= BDRV_O_RDWR;
>      } else {
> -- 
> 1.7.4.1
> 
>
Alex Bligh - March 18, 2013, 1:19 p.m.
Stefan,

--On 18 March 2013 14:03:49 +0100 Stefan Hajnoczi <stefanha@gmail.com> 
wrote:

> From the mailing list discussion it appears that this patch is a
> workaround - using the dom0 page cache to avoid the failed host kernel
> paging request, which is caused by the true bug.
>
> Has any progress been made at understanding the true problem?

It certainly is a workaround. My understanding is that ANY write with
O_DIRECT turned on can write data written to the page after the
O_DIRECT write is marked as complete, if tcp retransmit (and various
other skb related things) happen.

This thread is shorter that the one on xen-devel if you want to follow
the history and the explanation.
  http://comments.gmane.org/gmane.linux.nfs/54325

xen is particularly affected as the page in question is paged out
of dom0 when the access happens. However, anything using O_DIRECT
I/O to any form of network device (NFS, iSCSI, DRDB) is by my
analysis vulnerable to writing corrupt data.

Mel Gorman kindly forward ported (but not to tip) Ian Campbell's
fragment tracking patch, and I sent it to netdev here:
  http://marc.info/?l=linux-netdev&m=135912467817630

Given this was originally raised as an issue in 2008, and probably
has been an issue 'forever', I think it would be fair to say there
has not been an enormous amount of interest in fixing the underlying
problem.
Paolo Bonzini - March 18, 2013, 1:32 p.m.
Il 18/03/2013 13:18, Alex Bligh ha scritto:
> Due to what is almost certainly a kernel bug, writes with
> O_DIRECT may continue to reference the page after the write
> has been marked as completed, particularly in the case of
> TCP retransmit. In other scenarios, this "merely" risks
> data corruption on the write, but with Xen pages from domU
> are only transiently mapped into dom0's memory, resulting
> in kernel panics when they are subsequently accessed.
> 
> This brings PV devices in line with emulated devices. Removing
> O_DIRECT is safe as barrier operations are now correctly passed
> through.
> 
> See:
>   http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html
> for more details.
> 
> This patch has already been applied to the xenbits.org
> qemu-upstream repository.
>   http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=f3903bbac78a81fcbce1350cdce860764a62783a
> Clearly it should go into qemu's own repository.
> 
> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Signed-off-by: Alex Bligh <alex@alex.org.uk>
> ---
>  hw/xen_disk.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/xen_disk.c b/hw/xen_disk.c
> index a402ac8..14f8723 100644
> --- a/hw/xen_disk.c
> +++ b/hw/xen_disk.c
> @@ -603,7 +603,7 @@ static int blk_init(struct XenDevice *xendev)
>      }
>  
>      /* read-only ? */
> -    qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
> +    qflags = BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
>      if (strcmp(blkdev->mode, "w") == 0) {
>          qflags |= BDRV_O_RDWR;
>      } else {
> 

How does migration work with this change?  The target may read stale
data from the page cache.  Fix the kernel bug please.

Paolo


Paolo
Alex Bligh - March 18, 2013, 1:49 p.m.
--On 18 March 2013 14:32:23 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:

> How does migration work with this change?  The target may read stale
> data from the page cache.  Fix the kernel bug please.

Presumably the same way as if writeback caching is selected. I presume
that must fsync() / fdatasync() all the data to disk, and a barrier will
produce one of those.

It would be great to fix the kernel bug (and I have submitted code), but
the fix is pretty intrusive (see the link I posted) and there appears
to be little interest in taking it forward. Certainly my kernel hacking
skills are not adequate to the task.

The current position is that booting a Xen domU which does disk I/O
(Ubuntu cloud image used as the test case) with an NFS root crashes dom0
absolutely repeatably, and kills all other guests. Unless and until
there is a kernel fix for that, Xen is in essence unusable with HVM
and network based disk backend. So we need a workaround in the meantime
which doesn't require a kernel fix.
Paolo Bonzini - March 18, 2013, 2:05 p.m.
Il 18/03/2013 14:49, Alex Bligh ha scritto:
> 
> 
> --On 18 March 2013 14:32:23 +0100 Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> 
>> How does migration work with this change?  The target may read stale
>> data from the page cache.  Fix the kernel bug please.
> 
> Presumably the same way as if writeback caching is selected. I presume
> that must fsync() / fdatasync() all the data to disk, and a barrier will
> produce one of those.

No, that's done already.  The source does an fsync/fdatasync before
terminating the migration.

The problem is that the target's page cache might host image data from a
previous run.  If you do not use O_DIRECT, it will not see the changes
made on the source.

> It would be great to fix the kernel bug (and I have submitted code), but
> the fix is pretty intrusive (see the link I posted) and there appears
> to be little interest in taking it forward. Certainly my kernel hacking
> skills are not adequate to the task.
> 
> The current position is that booting a Xen domU which does disk I/O
> (Ubuntu cloud image used as the test case) with an NFS root crashes dom0
> absolutely repeatably, and kills all other guests. Unless and until
> there is a kernel fix for that, Xen is in essence unusable with HVM
> and network based disk backend. So we need a workaround in the meantime
> which doesn't require a kernel fix.

If you want to have this patch, you need to detect the bug and only do
the hack if the bug is detect.  Plus, disable migration when the hack is
in use.

Also:

1) why does blkback not have the bug?

2) does it also affect virtio disks (or perhaps AHCI too)?  I think
Stefano experimented with virtio in Xen.  If it does, then you're
working around the problem in the wrong place.

Paolo
Alex Bligh - March 18, 2013, 2:30 p.m.
Paolo,

--On 18 March 2013 15:05:08 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:

>> Presumably the same way as if writeback caching is selected. I presume
>> that must fsync() / fdatasync() all the data to disk, and a barrier will
>> produce one of those.
>
> No, that's done already.  The source does an fsync/fdatasync before
> terminating the migration.
>
> The problem is that the target's page cache might host image data from a
> previous run.  If you do not use O_DIRECT, it will not see the changes
> made on the source.

I was under the impression that with cache=writeback, qemu doesn't
use O_DIRECT, in which case why isn't there the same danger under
kvm, i.e. that the target page cache contains data from a previous
run?

>> It would be great to fix the kernel bug (and I have submitted code), but
>> the fix is pretty intrusive (see the link I posted) and there appears
>> to be little interest in taking it forward. Certainly my kernel hacking
>> skills are not adequate to the task.
>>
>> The current position is that booting a Xen domU which does disk I/O
>> (Ubuntu cloud image used as the test case) with an NFS root crashes dom0
>> absolutely repeatably, and kills all other guests. Unless and until
>> there is a kernel fix for that, Xen is in essence unusable with HVM
>> and network based disk backend. So we need a workaround in the meantime
>> which doesn't require a kernel fix.
>
> If you want to have this patch, you need to detect the bug and only do
> the hack if the bug is detect.  Plus, disable migration when the hack is
> in use.

I originally suggested having this as an option (detecting it live
and non-destructively is practically impossible - suggestions welcome),
but xen-devel felt it should just be changed. My original preference
was for xl to process cache= type options (so those using a local
file system known to be safe could use O_DIRECT still), but that
requires a change to xenstore, was not popular, and is probably too
intrusive. I patched it the way the xendevel folks wanted.

Disabling migration seems a bit excessive when migration isn't disabled
with cache=unsafe (AFAIK), and the alternative (using O_DIRECT)
is far far more unsafe (one tcp retransmit and your system is dead).

> 1) why does blkback not have the bug?
>
> 2) does it also affect virtio disks (or perhaps AHCI too)?  I think
> Stefano experimented with virtio in Xen.  If it does, then you're
> working around the problem in the wrong place.

I believe it affects PV disks and not emulated disks as emulated disks
under Xen do not use O_DIRECT (despite migration apparently working
notwithstanding your comment above).

Stefano did ack the patch, and for a one line change it's been
through a pretty extensive discussion on xen-devel ...

I've no idea what else it affects. I'd suggest it also affects kvm,
save that the kvm 'bad' will be writing the wrong data, not hosing
the whole machine.
Paolo Bonzini - March 18, 2013, 2:49 p.m.
Il 18/03/2013 15:30, Alex Bligh ha scritto:
> Paolo,
> 
> --On 18 March 2013 15:05:08 +0100 Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> 
>>> Presumably the same way as if writeback caching is selected. I presume
>>> that must fsync() / fdatasync() all the data to disk, and a barrier will
>>> produce one of those.
>>
>> No, that's done already.  The source does an fsync/fdatasync before
>> terminating the migration.
>>
>> The problem is that the target's page cache might host image data from a
>> previous run.  If you do not use O_DIRECT, it will not see the changes
>> made on the source.
> 
> I was under the impression that with cache=writeback, qemu doesn't
> use O_DIRECT, in which case why isn't there the same danger under
> kvm, i.e. that the target page cache contains data from a previous
> run?

KVM in fact only supports migration using cache=none.  This does not
apply of course if you're using cache-coherent storage, such as rbd or
gluster; or if you're using one of the userspace backends that bypass
the page cache, like NBD or libiscsi.

> Disabling migration seems a bit excessive when migration isn't disabled
> with cache=unsafe (AFAIK)

It is not QEMU's task.  There are cases where the cache= options are
unnecessary or ignored.  But indeed libvirt warns (or blocks, I don't
remember) in this case.

> , and the alternative (using O_DIRECT)
> is far far more unsafe (one tcp retransmit and your system is dead).
> 
>> 1) why does blkback not have the bug?
>>
>> 2) does it also affect virtio disks (or perhaps AHCI too)?  I think
>> Stefano experimented with virtio in Xen.  If it does, then you're
>> working around the problem in the wrong place.
> 
> I believe it affects PV disks and not emulated disks as emulated disks
> under Xen do not use O_DIRECT (despite migration apparently working
> notwithstanding your comment above).

If libxl does migration without O_DIRECT, then that's a bug in libxl.
What about blkback?  IIRC it uses bios, so it also bypasses the page cache.

> Stefano did ack the patch, and for a one line change it's been
> through a pretty extensive discussion on xen-devel ...

It may be a one-line change, but it completely changes the paths that
I/O goes through.  Apparently the discussion was not enough.

> I've no idea what else it affects. I'd suggest it also affects kvm,
> save that the kvm 'bad' will be writing the wrong data, not hosing
> the whole machine.
>
Alex Bligh - March 18, 2013, 3:40 p.m.
Paolo,

--On 18 March 2013 15:49:48 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:

>>> The problem is that the target's page cache might host image data from a
>>> previous run.  If you do not use O_DIRECT, it will not see the changes
>>> made on the source.
>>
>> I was under the impression that with cache=writeback, qemu doesn't
>> use O_DIRECT, in which case why isn't there the same danger under
>> kvm, i.e. that the target page cache contains data from a previous
>> run?
>
> KVM in fact only supports migration using cache=none.  This does not
> apply of course if you're using cache-coherent storage, such as rbd or
> gluster; or if you're using one of the userspace backends that bypass
> the page cache, like NBD or libiscsi.

Well right now emulated devices under xen with qemu-upstream
dm do not use O_DIRECT. So that's going to hit this problem anyway.

Let me understand the problem a little more. Let's say a VM has a
disk backed by NFS. It runs on node A, at which point pages are
introduced to the page cache. It then migrates to node B, with
node A flushing its dirty pages to disk before the migration
completed, but not emptying the page cache completely.
It's then migrated back to node A in which case I think you are
saying that node A may still hold the page cache entries from
prior to the first migration.

However, node A has closed and reopened the file. Doesn't NFSv4
notice that the file has changed at this point and invalidate
all page cache entries? Else (re)opening a file on one NFS client
that has been changed on another would never work. I know this
is unpredictable if you haven't closed the file, but in this
instance A has closed the file.

>> Disabling migration seems a bit excessive when migration isn't disabled
>> with cache=unsafe (AFAIK)
>
> It is not QEMU's task.  There are cases where the cache= options are
> unnecessary or ignored.  But indeed libvirt warns (or blocks, I don't
> remember) in this case.

Well we are kvm user too, and I understand hat kvm (as opposed to
libvirt) does not error or warn if you live migrate with cache=writeback.

I've no problem if xl or libvirt or whatever error or warn. My usage
is API based, rather than xl / libvirt based.

> If libxl does migration without O_DIRECT, then that's a bug in libxl.
> What about blkback?  IIRC it uses bios, so it also bypasses the page
> cache.

Possibly a bug in xl rather than libxl, but as no emulated devices
use O_DIRECT, that bug is already there, and isn't in QEMU.

>> Stefano did ack the patch, and for a one line change it's been
>> through a pretty extensive discussion on xen-devel ...
>
> It may be a one-line change, but it completely changes the paths that
> I/O goes through.  Apparently the discussion was not enough.

What would you suggest?
Paolo Bonzini - March 18, 2013, 4:19 p.m.
Il 18/03/2013 16:40, Alex Bligh ha scritto:
> Paolo,
> 
> --On 18 March 2013 15:49:48 +0100 Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> 
>>>> The problem is that the target's page cache might host image data from a
>>>> previous run.

I remembered this incorrectly, sorry.  It's not from a previous run,
it's from the beginning of this run.  See
http://wiki.qemu.org/Migration/Storage for more information.

A VM has a disk backed by NFS. It runs on node A, at which point pages
are introduced to the page cache. It then migrates to node B, which
entails starting the VM on node B while it is still running on node A.
Closing has yet to happen on node A, but the file is already open on
node B; anything that is cached on node B will never be invalidated.

Thus, any changes done to the disk on node A during migration may not
become visible on node B.

>>> Disabling migration seems a bit excessive when migration isn't disabled
>>> with cache=unsafe (AFAIK)
>>
>> It is not QEMU's task.  There are cases where the cache= options are
>> unnecessary or ignored.  But indeed libvirt warns (or blocks, I don't
>> remember) in this case.
> 
> Well we are kvm user too, and I understand hat kvm (as opposed to
> libvirt) does not error or warn if you live migrate with cache=writeback.
> 
> I've no problem if xl or libvirt or whatever error or warn. My usage
> is API based, rather than xl / libvirt based.

What makes libvirt not an API (just like libxl)?

>> If libxl does migration without O_DIRECT, then that's a bug in libxl.
>> What about blkback?  IIRC it uses bios, so it also bypasses the page
>> cache.
> 
> Possibly a bug in xl rather than libxl, but as no emulated devices
> use O_DIRECT, that bug is already there, and isn't in QEMU.

blkback is the in-kernel PV device, it's not an emulated device.

>>> Stefano did ack the patch, and for a one line change it's been
>>> through a pretty extensive discussion on xen-devel ...
>>
>> It may be a one-line change, but it completely changes the paths that
>> I/O goes through.  Apparently the discussion was not enough.
> 
> What would you suggest?

Nothing except fixing the bug in the kernel.  No one has yet explained
why blkback is not susceptible to the same bug.

Paolo
Alex Bligh - March 18, 2013, 4:53 p.m.
Paolo,

--On 18 March 2013 17:19:14 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:

> I remembered this incorrectly, sorry.  It's not from a previous run,
> it's from the beginning of this run.  See
> http://wiki.qemu.org/Migration/Storage for more information.
>
> A VM has a disk backed by NFS. It runs on node A, at which point pages
> are introduced to the page cache. It then migrates to node B, which
> entails starting the VM on node B while it is still running on node A.
> Closing has yet to happen on node A, but the file is already open on
> node B; anything that is cached on node B will never be invalidated.
>
> Thus, any changes done to the disk on node A during migration may not
> become visible on node B.

This might be a difference between Xen and KVM. On Xen migration is
made to a server in a paused state, and it's only unpaused when
the migration to B is complete. There's a sort of extra handshake at
the end.

I believe what's happening is that libxl_domain_suspend when
called with LIBXL_SUSPEND_LIVE will do a final fsync()/fdatasync()
at the end, then await a migrate_receiver_ready message, and only
when that has been received will it send a migrate_permission_to_go
message which unpauses the domain. Before that, I don't believe the
disk is read (I may be wrong about that). The sending code is in
migrate_domain() in xl_cmdimpl.c, and the receiving code is in
migrate_receive() (same file). On xen at least, I don't think
the VM is ever started on node B whilst it is still running on node
A.

>> I've no problem if xl or libvirt or whatever error or warn. My usage
>> is API based, rather than xl / libvirt based.
>
> What makes libvirt not an API (just like libxl)?

Nothing, just I'm using the QMP API and the libxl API. I'm just saying
whether libvirt or xl warn or error makes no difference to me.

>>> If libxl does migration without O_DIRECT, then that's a bug in libxl.
>>> What about blkback?  IIRC it uses bios, so it also bypasses the page
>>> cache.
>>
>> Possibly a bug in xl rather than libxl, but as no emulated devices
>> use O_DIRECT, that bug is already there, and isn't in QEMU.
>
> blkback is the in-kernel PV device, it's not an emulated device.

I mean that an emulated device will already not use O_DIRECT.
So if you are right about live migrate being unsafe without O_DIRECT,
it's already unsafe for emulated devices.

>>>> Stefano did ack the patch, and for a one line change it's been
>>>> through a pretty extensive discussion on xen-devel ...
>>>
>>> It may be a one-line change, but it completely changes the paths that
>>> I/O goes through.  Apparently the discussion was not enough.
>>
>> What would you suggest?
>
> Nothing except fixing the bug in the kernel.

I have already posted patches for that, as Ian Campbell did in 2008,
but no one seems particularly interested. Be my guest in trying to
get them adopted. That's quite obviously the long term solution.

In the mean time, however, there is a need to run Xen on kernels with
long term support. Not being able to run Xen in a stable manner is
not an acceptable position.

> No one has yet explained
> why blkback is not susceptible to the same bug.

I would guess it will be if it uses O_DIRECT or whatever the in kernel
equivalent is, unless it's doing a copy of the guest pages prior to the
write being marked as complete.

I can't claim to be familiar with blkback, but I presume this would
require a similar fix elsewhere.
George Dunlap - March 18, 2013, 5:38 p.m.
On 18/03/13 16:53, Alex Bligh wrote:
> Paolo,
>
> --On 18 March 2013 17:19:14 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> I remembered this incorrectly, sorry.  It's not from a previous run,
>> it's from the beginning of this run.  See
>> http://wiki.qemu.org/Migration/Storage for more information.
>>
>> A VM has a disk backed by NFS. It runs on node A, at which point pages
>> are introduced to the page cache. It then migrates to node B, which
>> entails starting the VM on node B while it is still running on node A.
>> Closing has yet to happen on node A, but the file is already open on
>> node B; anything that is cached on node B will never be invalidated.
>>
>> Thus, any changes done to the disk on node A during migration may not
>> become visible on node B.
> This might be a difference between Xen and KVM. On Xen migration is
> made to a server in a paused state, and it's only unpaused when
> the migration to B is complete. There's a sort of extra handshake at
> the end.

I think what you mean is that all the memory is handled by Xen and the 
toolstack, not by qemu.  The qemu state is sent as the very last thing, 
after all of the memory, and therefore (you are arguing) that qemu is 
not started, and the files cannot be opened, until after the migration 
is nearly complete, and certainly until after the file is closed on the 
sending side.

(In KVM this isn't the case because I assume it's qemu that handles the 
memory transition, and so it is likely to open the files on the 
receiving side when the migration starts.)

If so I think that should answer Paolo's argument -- but let me go 
through and verify that first.

It will have to wait until tomorrow, however. :-)

  -George
Alex Bligh - March 18, 2013, 5:47 p.m.
George,

--On 18 March 2013 17:38:54 +0000 George Dunlap 
<george.dunlap@eu.citrix.com> wrote:

> I think what you mean is that all the memory is handled by Xen and the
> toolstack, not by qemu.  The qemu state is sent as the very last thing,
> after all of the memory, and therefore (you are arguing) that qemu is not
> started, and the files cannot be opened, until after the migration is
> nearly complete, and certainly until after the file is closed on the
> sending side.

That is my understanding. Thank you for putting it better than I did :-)
Paolo Bonzini - March 18, 2013, 6 p.m.
Il 18/03/2013 18:38, George Dunlap ha scritto:
>>>
>> This might be a difference between Xen and KVM. On Xen migration is
>> made to a server in a paused state, and it's only unpaused when
>> the migration to B is complete. There's a sort of extra handshake at
>> the end.
> 
> I think what you mean is that all the memory is handled by Xen and the
> toolstack, not by qemu.  The qemu state is sent as the very last thing,
> after all of the memory, and therefore (you are arguing) that qemu is
> not started, and the files cannot be opened, until after the migration
> is nearly complete, and certainly until after the file is closed on the
> sending side.

That would be quite dangerous.  Files aren't closed until after QEMU
exits; at this point whatever problem you have launching QEMU on the
destination would be unrecoverable.

Even for successful migration, it would also be bad for downtime (QEMU
isn't exactly lightning-fast to start).  And even if failure weren't
catastrophic, it would be a pity to transfer a few gigs of memory and
then find out that QEMU isn't present in the destination. :)

Still, it's more than possible that I've forgotten something about Xen's
management of QEMU.

> (In KVM this isn't the case because I assume it's qemu that handles the
> memory transition, and so it is likely to open the files on the
> receiving side when the migration starts.)

Yup.

Paolo
George Dunlap - March 19, 2013, 10:06 a.m.
On Mon, Mar 18, 2013 at 6:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 18/03/2013 18:38, George Dunlap ha scritto:
>>>>
>>> This might be a difference between Xen and KVM. On Xen migration is
>>> made to a server in a paused state, and it's only unpaused when
>>> the migration to B is complete. There's a sort of extra handshake at
>>> the end.
>>
>> I think what you mean is that all the memory is handled by Xen and the
>> toolstack, not by qemu.  The qemu state is sent as the very last thing,
>> after all of the memory, and therefore (you are arguing) that qemu is
>> not started, and the files cannot be opened, until after the migration
>> is nearly complete, and certainly until after the file is closed on the
>> sending side.
>
> That would be quite dangerous.  Files aren't closed until after QEMU
> exits; at this point whatever problem you have launching QEMU on the
> destination would be unrecoverable.

But if I understand your concern correctly, you were concerned about
the following scenario:
R1. Receiver qemu opens file
R2. Something causes receiver kernel to cache parts of file (maybe
optimistic read-ahead)
S1. Sender qemu writes to file
S2. Sender qemu does final flush
S3. Sender qemu closes file
R3. Receiver reads stale blocks from cache

Even supposing that Xen doesn't actually shut down qemu until it is
started on the remote side, as long as the file isn't opened by qemu
until after S2, we should be safe, right?  It would look like this:

S1. Sender qemu writes to file
S2. Sender qemu does final flush
R1. Receiver qemu opens file
R2. Receiver kernel caches file
S3. Sender qemu closes file

This is all assuming that:
1. The barrier operations / write flush are effective at getting the
data back on to the NFS server
2. The receiver qemu doesn't open the file until after the last flush
by the sender.

Number 1 has been tested by Alex I believe, and is mentioned in the
changeset log; so if #2 is true, then we should be safe.  I'll try to
verify that today.

> Even for successful migration, it would also be bad for downtime (QEMU
> isn't exactly lightning-fast to start).  And even if failure weren't
> catastrophic, it would be a pity to transfer a few gigs of memory and
> then find out that QEMU isn't present in the destination. :)

Well, if qemu isn't present at the destination, that's definitely user
error. :-)  In any case, I know that he migrate can resume if it
fails, so I suspect that the qemu is just paused on the sending side
until the migration is known to complete.  As long as the last write
was flushed to the NFS server before the receiver opens the file, we
should be safe.

> Still, it's more than possible that I've forgotten something about Xen's
> management of QEMU.

And unfortunately I am not intimately familiar with that codepath; it
just happens that I'm the last person to have to dig into that code
and fix something. :-)

 -George
Paolo Bonzini - March 19, 2013, 10:43 a.m.
Il 19/03/2013 11:06, George Dunlap ha scritto:
> On Mon, Mar 18, 2013 at 6:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 18/03/2013 18:38, George Dunlap ha scritto:
>>>>>
>>>> This might be a difference between Xen and KVM. On Xen migration is
>>>> made to a server in a paused state, and it's only unpaused when
>>>> the migration to B is complete. There's a sort of extra handshake at
>>>> the end.
>>>
>>> I think what you mean is that all the memory is handled by Xen and the
>>> toolstack, not by qemu.  The qemu state is sent as the very last thing,
>>> after all of the memory, and therefore (you are arguing) that qemu is
>>> not started, and the files cannot be opened, until after the migration
>>> is nearly complete, and certainly until after the file is closed on the
>>> sending side.
>>
>> That would be quite dangerous.  Files aren't closed until after QEMU
>> exits; at this point whatever problem you have launching QEMU on the
>> destination would be unrecoverable.
> 
> But if I understand your concern correctly, you were concerned about
> the following scenario:
> R1. Receiver qemu opens file
> R2. Something causes receiver kernel to cache parts of file (maybe
> optimistic read-ahead)

For some image formats, metadata is cached inside QEMU on startup.
There is a callback to invalidate QEMU's cache at the end of migration,
but that does not extend to the page cache.

> S1. Sender qemu writes to file
> S2. Sender qemu does final flush
> S3. Sender qemu closes file
> R3. Receiver reads stale blocks from cache
> 
> Even supposing that Xen doesn't actually shut down qemu until it is
> started on the remote side, as long as the file isn't opened by qemu
> until after S2, we should be safe, right?  It would look like this:
> 
> S1. Sender qemu writes to file
> S2. Sender qemu does final flush
> R1. Receiver qemu opens file
> R2. Receiver kernel caches file
> S3. Sender qemu closes file
> 
> This is all assuming that:
> 1. The barrier operations / write flush are effective at getting the
> data back on to the NFS server
> 2. The receiver qemu doesn't open the file until after the last flush
> by the sender.
> 
> Number 1 has been tested by Alex I believe, and is mentioned in the
> changeset log; so if #2 is true, then we should be safe.  I'll try to
> verify that today.

Thanks.

>> Even for successful migration, it would also be bad for downtime (QEMU
>> isn't exactly lightning-fast to start).  And even if failure weren't
>> catastrophic, it would be a pity to transfer a few gigs of memory and
>> then find out that QEMU isn't present in the destination. :)
> 
> Well, if qemu isn't present at the destination, that's definitely user
> error. :-)  In any case, I know that he migrate can resume if it
> fails, so I suspect that the qemu is just paused on the sending side
> until the migration is known to complete.  As long as the last write
> was flushed to the NFS server before the receiver opens the file, we
> should be safe.

Note that the close really must happen before the next open.  Otherwise
the file metadata might not be up-to-date on the destination, too.

Paolo
George Dunlap - March 19, 2013, 10:51 a.m.
On 03/19/2013 10:43 AM, Paolo Bonzini wrote:
>>> Even for successful migration, it would also be bad for downtime (QEMU
>>> isn't exactly lightning-fast to start).  And even if failure weren't
>>> catastrophic, it would be a pity to transfer a few gigs of memory and
>>> then find out that QEMU isn't present in the destination. :)
>>
>> Well, if qemu isn't present at the destination, that's definitely user
>> error. :-)  In any case, I know that he migrate can resume if it
>> fails, so I suspect that the qemu is just paused on the sending side
>> until the migration is known to complete.  As long as the last write
>> was flushed to the NFS server before the receiver opens the file, we
>> should be safe.
>
> Note that the close really must happen before the next open.  Otherwise
> the file metadata might not be up-to-date on the destination, too.

By "file metadata" I assume you mean "metadata about the virtual disk 
within the file", not "metadata about the file within the filesystem", 
right?  That's good to know, I'll keep that in mind.

Even if it's true that at the moment qemu doesn't write the file 
metadata until it closes the file, that just means we'd have to add a 
hook to the callback to save qemu state, to sync the metadata at that 
point, right?  (This may in fact already be done; I'm more or less 
completely unfamiliar with the qemu side of save/restore.)

  -George
Paolo Bonzini - March 19, 2013, 11:14 a.m.
Il 19/03/2013 11:51, George Dunlap ha scritto:
> On 03/19/2013 10:43 AM, Paolo Bonzini wrote:
>>>> Even for successful migration, it would also be bad for downtime (QEMU
>>>> isn't exactly lightning-fast to start).  And even if failure weren't
>>>> catastrophic, it would be a pity to transfer a few gigs of memory and
>>>> then find out that QEMU isn't present in the destination. :)
>>>
>>> Well, if qemu isn't present at the destination, that's definitely user
>>> error. :-)  In any case, I know that he migrate can resume if it
>>> fails, so I suspect that the qemu is just paused on the sending side
>>> until the migration is known to complete.  As long as the last write
>>> was flushed to the NFS server before the receiver opens the file, we
>>> should be safe.
>>
>> Note that the close really must happen before the next open.  Otherwise
>> the file metadata might not be up-to-date on the destination, too.
> 
> By "file metadata" I assume you mean "metadata about the virtual disk
> within the file", not "metadata about the file within the filesystem",
> right?  That's good to know, I'll keep that in mind.

Actually especially the former (I'm calling them respectively "image
metadata" and "file metadata").  File metadata could also be a problem,
but I think it might just work except in cases like on-line resizing
during migration.

> Even if it's true that at the moment qemu doesn't write the file
> metadata until it closes the file, that just means we'd have to add a
> hook to the callback to save qemu state, to sync the metadata at that
> point, right?

Unfortunately no.  The problem is in the loading side's kernel, on which
you do not have any control.  If the loading side doesn't use O_DIRECT,
any attempt to invalidate the metadata in userspace or on the source is
futile, because there is no way to invalidate the page cache's copy of
that metadata.

Paolo
George Dunlap - March 19, 2013, 11:21 a.m.
On 03/19/2013 11:14 AM, Paolo Bonzini wrote:
> Il 19/03/2013 11:51, George Dunlap ha scritto:
>> On 03/19/2013 10:43 AM, Paolo Bonzini wrote:
>>>>> Even for successful migration, it would also be bad for downtime (QEMU
>>>>> isn't exactly lightning-fast to start).  And even if failure weren't
>>>>> catastrophic, it would be a pity to transfer a few gigs of memory and
>>>>> then find out that QEMU isn't present in the destination. :)
>>>>
>>>> Well, if qemu isn't present at the destination, that's definitely user
>>>> error. :-)  In any case, I know that he migrate can resume if it
>>>> fails, so I suspect that the qemu is just paused on the sending side
>>>> until the migration is known to complete.  As long as the last write
>>>> was flushed to the NFS server before the receiver opens the file, we
>>>> should be safe.
>>>
>>> Note that the close really must happen before the next open.  Otherwise
>>> the file metadata might not be up-to-date on the destination, too.
>>
>> By "file metadata" I assume you mean "metadata about the virtual disk
>> within the file", not "metadata about the file within the filesystem",
>> right?  That's good to know, I'll keep that in mind.
>
> Actually especially the former (I'm calling them respectively "image
> metadata" and "file metadata").  File metadata could also be a problem,
> but I think it might just work except in cases like on-line resizing
> during migration.
>
>> Even if it's true that at the moment qemu doesn't write the file
>> metadata until it closes the file, that just means we'd have to add a
>> hook to the callback to save qemu state, to sync the metadata at that
>> point, right?
>
> Unfortunately no.  The problem is in the loading side's kernel, on which
> you do not have any control.  If the loading side doesn't use O_DIRECT,
> any attempt to invalidate the metadata in userspace or on the source is
> futile, because there is no way to invalidate the page cache's copy of
> that metadata.

Yes, I meant "the only further thing we would have to do". The entire 
discussion relies on the assumption that the receiving side doesn't open 
the file until after the sending side has issued the qemu state save. 
So as long as both the virtual blocks and the image metadata have been 
synced to the NFS server at that point, we should be all right.  If at 
the moment the image metadata is *not* synced at that point, it seems 
like we should be able to make it so relatively easily.

  -George
Alex Bligh - March 19, 2013, 11:44 a.m.
--On 19 March 2013 12:14:36 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:

> Unfortunately no.  The problem is in the loading side's kernel, on which
> you do not have any control.  If the loading side doesn't use O_DIRECT,
> any attempt to invalidate the metadata in userspace or on the source is
> futile, because there is no way to invalidate the page cache's copy of
> that metadata.

If you closed the file and reopened it, then on NFS this would have this
effect I believe, as it needs to check whether other clients have modified
the file to give open/close coherency.

However, we are rather assuming the file isn't even open at the point it
is closed by the sender, which is what George is investigating.

If this isn't true, we have a problem anyway with (e.g.) emulated
devices which don't use O_DIRECT anyway. And I had thought (I may be
wrong) using O_DIRECT does not guarantee no read caching with NFS;
O_DIRECT merely guarantees the page cache is not used under Linux
and isn't defined under POSIX:
  http://www.spinics.net/lists/linux-nfs/msg17472.html

If it were just a write caching issue, we could use O_DSYNC instead of
O_DIRECT, which would at least ensure the copy from userspace.
Paolo Bonzini - March 19, 2013, 11:49 a.m.
Il 19/03/2013 12:44, Alex Bligh ha scritto:
> If this isn't true, we have a problem anyway with (e.g.) emulated
> devices which don't use O_DIRECT anyway.

Yes, though that would be a libxl bug, not a QEMU bug.

> And I had thought (I may be
> wrong) using O_DIRECT does not guarantee no read caching with NFS;
> O_DIRECT merely guarantees the page cache is not used under Linux
> and isn't defined under POSIX:
>  http://www.spinics.net/lists/linux-nfs/msg17472.html

Read caching on the server is fine, because it is the same server that
was used for the writes.

O_DIRECT bypasses the client's page cache, and that's enough for our
purposes.

> If it were just a write caching issue, we could use O_DSYNC instead of
> O_DIRECT, which would at least ensure the copy from userspace.

O_DSYNC is not necessary.  We do issue the appropriate fsync/fdatasync.
 What O_DSYNC does is add an implicit fdatasync after every write,
basically.

Paolo
George Dunlap - March 19, 2013, 3:12 p.m.
On Tue, Mar 19, 2013 at 11:21 AM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 03/19/2013 11:14 AM, Paolo Bonzini wrote:
>>
>> Il 19/03/2013 11:51, George Dunlap ha scritto:
>>>
>>> On 03/19/2013 10:43 AM, Paolo Bonzini wrote:
>>>>>>
>>>>>> Even for successful migration, it would also be bad for downtime (QEMU
>>>>>> isn't exactly lightning-fast to start).  And even if failure weren't
>>>>>> catastrophic, it would be a pity to transfer a few gigs of memory and
>>>>>> then find out that QEMU isn't present in the destination. :)
>>>>>
>>>>>
>>>>> Well, if qemu isn't present at the destination, that's definitely user
>>>>> error. :-)  In any case, I know that he migrate can resume if it
>>>>> fails, so I suspect that the qemu is just paused on the sending side
>>>>> until the migration is known to complete.  As long as the last write
>>>>> was flushed to the NFS server before the receiver opens the file, we
>>>>> should be safe.
>>>>
>>>>
>>>> Note that the close really must happen before the next open.  Otherwise
>>>> the file metadata might not be up-to-date on the destination, too.
>>>
>>>
>>> By "file metadata" I assume you mean "metadata about the virtual disk
>>> within the file", not "metadata about the file within the filesystem",
>>> right?  That's good to know, I'll keep that in mind.
>>
>>
>> Actually especially the former (I'm calling them respectively "image
>> metadata" and "file metadata").  File metadata could also be a problem,
>> but I think it might just work except in cases like on-line resizing
>> during migration.
>>
>>> Even if it's true that at the moment qemu doesn't write the file
>>> metadata until it closes the file, that just means we'd have to add a
>>> hook to the callback to save qemu state, to sync the metadata at that
>>> point, right?
>>
>>
>> Unfortunately no.  The problem is in the loading side's kernel, on which
>> you do not have any control.  If the loading side doesn't use O_DIRECT,
>> any attempt to invalidate the metadata in userspace or on the source is
>> futile, because there is no way to invalidate the page cache's copy of
>> that metadata.
>
>
> Yes, I meant "the only further thing we would have to do". The entire
> discussion relies on the assumption that the receiving side doesn't open the
> file until after the sending side has issued the qemu state save. So as long
> as both the virtual blocks and the image metadata have been synced to the
> NFS server at that point, we should be all right.  If at the moment the
> image metadata is *not* synced at that point, it seems like we should be
> able to make it so relatively easily.

I've just had a chat with Stefano, and it turns out I was a bit
confused -- this change has nothing to do with qemu running as a
device model, but only as qemu running as a PV back-end for PV guests.
 So the question of when in the save/restore process the qemu is
started is moot.

For posterity's sake, however, here is what I have found about qemu as
a device model and Xen migration:
* qemu on the receiving side gets the name of the qemu save file from
the command-line arguments; so it cannot be started until after qemu
on the sending side has been paused, and the file send across the
wire.
* qemu on the sending side does not exit until it is determined that
the receiver is ready to begin.  So it is still running when qemu on
the receiving side starts.
* I didn't determine whether the commands to stop and save state cause
a disk image metadata sync.

 -George
Stefano Stabellini - March 19, 2013, 3:13 p.m.
On Tue, 19 Mar 2013, Paolo Bonzini wrote:
> Il 19/03/2013 11:06, George Dunlap ha scritto:
> > On Mon, Mar 18, 2013 at 6:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> Il 18/03/2013 18:38, George Dunlap ha scritto:
> >>>>>
> >>>> This might be a difference between Xen and KVM. On Xen migration is
> >>>> made to a server in a paused state, and it's only unpaused when
> >>>> the migration to B is complete. There's a sort of extra handshake at
> >>>> the end.
> >>>
> >>> I think what you mean is that all the memory is handled by Xen and the
> >>> toolstack, not by qemu.  The qemu state is sent as the very last thing,
> >>> after all of the memory, and therefore (you are arguing) that qemu is
> >>> not started, and the files cannot be opened, until after the migration
> >>> is nearly complete, and certainly until after the file is closed on the
> >>> sending side.
> >>
> >> That would be quite dangerous.  Files aren't closed until after QEMU
> >> exits; at this point whatever problem you have launching QEMU on the
> >> destination would be unrecoverable.
> > 
> > But if I understand your concern correctly, you were concerned about
> > the following scenario:
> > R1. Receiver qemu opens file
> > R2. Something causes receiver kernel to cache parts of file (maybe
> > optimistic read-ahead)
> 
> For some image formats, metadata is cached inside QEMU on startup.
> There is a callback to invalidate QEMU's cache at the end of migration,
> but that does not extend to the page cache.

[...]

> > Well, if qemu isn't present at the destination, that's definitely user
> > error. :-)  In any case, I know that he migrate can resume if it
> > fails, so I suspect that the qemu is just paused on the sending side
> > until the migration is known to complete.  As long as the last write
> > was flushed to the NFS server before the receiver opens the file, we
> > should be safe.
> 
> Note that the close really must happen before the next open.  Otherwise
> the file metadata might not be up-to-date on the destination, too.


First of all, I want to thank Paolo for spending time on this, because
it's a very delicate issue. I didn't spot this possible race before. If
we make a mistake now, in 6 months time a very poor soul will be
spending weeks debugging a difficult, not very reproducible, bug.

Also, just as a clarification for the audience at home, it's either safe
or it's not: if it's not safe, we certainly can't have this patch go in.
If it is safe, then it should go in.


This patch only impacts the PV backend in QEMU, not the IDE interface.
PV frontends and backends always disconnect and reconnect during
save/restore.
So we can be *certain* that bdrv_close at the sender side is called
before the new connection is established at the receiver side.

Unfortunately the QEMU PV backend opens the disk during initialization,
so before the actual connection is established (blk_init).

Therefore I think that the current change is not safe, but it is pretty
easy to make it safe.
You just need to move the call to blk_open from blk_init to blk_connect.
Actually you can move most of blk_init to blk_connect, you just need to
keep the 4 xenstore_write_be_int at the end of the function.
George Dunlap - March 19, 2013, 3:29 p.m.
On Tue, Mar 19, 2013 at 3:12 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> > I've just had a chat with Stefano, and it turns out I was a bit
> confused -- this change has nothing to do with qemu running as a
> device model, but only as qemu running as a PV back-end for PV guests.
>  So the question of when in the save/restore process the qemu is
> started is moot.

Alex, the title of the changelog should certainly be more specific, so
people aren't confused.  Maybe something like the following?

"Disable use of O_DIRECT when acting as a Xen PV backend"

And emphasizing in the changelog that this has no effect on qemu when
acting as a device model for Xen?

 -George
Paolo Bonzini - March 19, 2013, 4:53 p.m.
Il 19/03/2013 16:13, Stefano Stabellini ha scritto:
> This patch only impacts the PV backend in QEMU, not the IDE interface.
> PV frontends and backends always disconnect and reconnect during
> save/restore.
> So we can be *certain* that bdrv_close at the sender side is called
> before the new connection is established at the receiver side.

From the little I remember about Xen, they go to XenbusStateClosed on
the source before invoking the suspend hypercall?  And then restart on
the destination?

If so, that sounds right.  Thanks for finding an elegant solution.

> Unfortunately the QEMU PV backend opens the disk during initialization,
> so before the actual connection is established (blk_init).
> 
> Therefore I think that the current change is not safe, but it is pretty
> easy to make it safe.
> You just need to move the call to blk_open from blk_init to blk_connect.

Paolo
Stefano Stabellini - March 19, 2013, 5:03 p.m.
On Tue, 19 Mar 2013, Paolo Bonzini wrote:
> Il 19/03/2013 16:13, Stefano Stabellini ha scritto:
> > This patch only impacts the PV backend in QEMU, not the IDE interface.
> > PV frontends and backends always disconnect and reconnect during
> > save/restore.
> > So we can be *certain* that bdrv_close at the sender side is called
> > before the new connection is established at the receiver side.
> 
> >From the little I remember about Xen, they go to XenbusStateClosed on
> the source before invoking the suspend hypercall?  And then restart on
> the destination?

That's right

> If so, that sounds right.  Thanks for finding an elegant solution.

Thank you :)
Alex Bligh - March 19, 2013, 7:15 p.m.
Stefano, George,

> This patch only impacts the PV backend in QEMU, not the IDE interface.
> PV frontends and backends always disconnect and reconnect during
> save/restore.
> So we can be *certain* that bdrv_close at the sender side is called
> before the new connection is established at the receiver side.
>
> Unfortunately the QEMU PV backend opens the disk during initialization,
> so before the actual connection is established (blk_init).
>
> Therefore I think that the current change is not safe, but it is pretty
> easy to make it safe.
> You just need to move the call to blk_open from blk_init to blk_connect.
> Actually you can move most of blk_init to blk_connect, you just need to
> keep the 4 xenstore_write_be_int at the end of the function.

Stefano: do you want me to come up with a patch to do this?

--On 19 March 2013 15:29:36 +0000 George Dunlap 
<George.Dunlap@eu.citrix.com> wrote:

> Alex, the title of the changelog should certainly be more specific, so
> people aren't confused.  Maybe something like the following?
>
> "Disable use of O_DIRECT when acting as a Xen PV backend"
>
> And emphasizing in the changelog that this has no effect on qemu when
> acting as a device model for Xen?

Sure, I can do that.

However, my understanding (possibly incorrect) is that when qemu is
acting as a device model for xen (i.e. emulated disks) it doesn't
use O_DIRECT anyway. I am aware this is a different problem!
Alex Bligh - March 20, 2013, 8:33 a.m.
Stefano,

--On 19 March 2013 15:13:29 +0000 Stefano Stabellini 
<stefano.stabellini@eu.citrix.com> wrote:

> Therefore I think that the current change is not safe, but it is pretty
> easy to make it safe.
> You just need to move the call to blk_open from blk_init to blk_connect.
> Actually you can move most of blk_init to blk_connect, you just need to
> keep the 4 xenstore_write_be_int at the end of the function.

The final of these 4 writes in blk_init is:

    xenstore_write_be_int(&blkdev->xendev, "sectors",
                          blkdev->file_size / blkdev->file_blk);

and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that
requires (I think) bdrv_open to have been called. Can these
4 writes move safely to blk_connect? Or can we write the sectors
count as (say) 0 here and fix it later in blk_connect? The remainder
look ok.
Paolo Bonzini - March 20, 2013, 9:26 a.m.
Il 20/03/2013 09:33, Alex Bligh ha scritto:
> Stefano,
> 
> --On 19 March 2013 15:13:29 +0000 Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> 
>> Therefore I think that the current change is not safe, but it is pretty
>> easy to make it safe.
>> You just need to move the call to blk_open from blk_init to blk_connect.
>> Actually you can move most of blk_init to blk_connect, you just need to
>> keep the 4 xenstore_write_be_int at the end of the function.
> 
> The final of these 4 writes in blk_init is:
> 
>    xenstore_write_be_int(&blkdev->xendev, "sectors",
>                          blkdev->file_size / blkdev->file_blk);
> 
> and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that
> requires (I think) bdrv_open to have been called. Can these
> 4 writes move safely to blk_connect? Or can we write the sectors
> count as (say) 0 here and fix it later in blk_connect? The remainder
> look ok.

I think so.  blkfront reads "sectors" when QEMU moves to
XenbusStateConnected, in blkfront_connect.

blk_connect is called from xen_be_try_initialise, which moves to
XenbusStateConnected on success.  So, QEMU's blk_connect will always be
called before blkfront's blkfront_connect.

Paolo
Stefano Stabellini - March 20, 2013, 10:24 a.m.
On Tue, 19 Mar 2013, Alex Bligh wrote:
> Stefano, George,
> 
> > This patch only impacts the PV backend in QEMU, not the IDE interface.
> > PV frontends and backends always disconnect and reconnect during
> > save/restore.
> > So we can be *certain* that bdrv_close at the sender side is called
> > before the new connection is established at the receiver side.
> >
> > Unfortunately the QEMU PV backend opens the disk during initialization,
> > so before the actual connection is established (blk_init).
> >
> > Therefore I think that the current change is not safe, but it is pretty
> > easy to make it safe.
> > You just need to move the call to blk_open from blk_init to blk_connect.
> > Actually you can move most of blk_init to blk_connect, you just need to
> > keep the 4 xenstore_write_be_int at the end of the function.
> 
> Stefano: do you want me to come up with a patch to do this?

Yes, please.


> --On 19 March 2013 15:29:36 +0000 George Dunlap 
> <George.Dunlap@eu.citrix.com> wrote:
> 
> > Alex, the title of the changelog should certainly be more specific, so
> > people aren't confused.  Maybe something like the following?
> >
> > "Disable use of O_DIRECT when acting as a Xen PV backend"
> >
> > And emphasizing in the changelog that this has no effect on qemu when
> > acting as a device model for Xen?
> 
> Sure, I can do that.
> 
> However, my understanding (possibly incorrect) is that when qemu is
> acting as a device model for xen (i.e. emulated disks) it doesn't
> use O_DIRECT anyway. I am aware this is a different problem!

That is true. My guess is that nobody really migrates HVM guests without
PV drivers installed (it's not even possible on XenServer but xl let you
do that if you want to). When the PV drivers initialize at boot time,
the IDE disk is closed. Therefore we wouldn't have this problem.

Maybe we should prevent HVM guest migration without PV drivers with xl
too. Ian, what do you think?
George Dunlap - March 20, 2013, 10:37 a.m.
On 20/03/13 10:24, Stefano Stabellini wrote:
> That is true. My guess is that nobody really migrates HVM guests without
> PV drivers installed (it's not even possible on XenServer but xl let you
> do that if you want to). When the PV drivers initialize at boot time,
> the IDE disk is closed. Therefore we wouldn't have this problem.
>
> Maybe we should prevent HVM guest migration without PV drivers with xl
> too. Ian, what do you think?

We should be able to make it safe if we just make sure that qemu does a 
metadata sync / FS sync when we ask qemu to write the save file.

  -George
Paolo Bonzini - March 20, 2013, 11:08 a.m.
Il 20/03/2013 11:37, George Dunlap ha scritto:
>> That is true. My guess is that nobody really migrates HVM guests without
>> PV drivers installed (it's not even possible on XenServer but xl let you
>> do that if you want to). When the PV drivers initialize at boot time,
>> the IDE disk is closed. Therefore we wouldn't have this problem.
>>
>> Maybe we should prevent HVM guest migration without PV drivers with xl
>> too. Ian, what do you think?
> 
> We should be able to make it safe if we just make sure that qemu does a
> metadata sync / FS sync when we ask qemu to write the save file.

To make it safe, just use cache=none.  There are downsides (for example
tmpfs does not support it), but perhaps you can use a global option or
environment variable to toggle the behavior.

Is xl still using the driver:subdriver:/path,readonly syntax?  That's
really inflexible compared to libvirt's XML...

Paolo
Alex Bligh - March 20, 2013, 11:20 a.m.
--On 20 March 2013 12:08:19 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:

> To make it safe, just use cache=none.  There are downsides (for example
> tmpfs does not support it), but perhaps you can use a global option or
> environment variable to toggle the behavior.

If you don't use cache=none, you it is unsafe because of the page cache
issue.

If you use cache=none, it uses O_DIRECT, and it is unsafe because of the
kernel issue.

My preference would be to leave this can of worms to the end user. I'd
rather take the risk on the first (as I'd rather domU broke than dom0)
but that's because I always use network drives. Some documentation
would be useful.
David Scott - March 20, 2013, 11:57 a.m.
Hi,

On 20/03/13 10:24, Stefano Stabellini wrote:

> That is true. My guess is that nobody really migrates HVM guests without
> PV drivers installed (it's not even possible on XenServer but xl let you
> do that if you want to). When the PV drivers initialize at boot time,
> the IDE disk is closed. Therefore we wouldn't have this problem.
>
> Maybe we should prevent HVM guest migration without PV drivers with xl
> too. Ian, what do you think?

Although it's not normally possible on XenServer to migrate HVM guests 
without PV drivers, a lot of people override the check and do it anyway. 
It happens a lot in cloud scenarios where the host (XenServer) admin is 
different from the guest admin (the end customer) and it's impractical 
for the host admin to force the guest admin to load the drivers.

We do our best to recommend, encourage and help people to install HVM PV 
drivers but we have to tolerate those who ignore our advice :/

Cheers,
Dave
Stefano Stabellini - March 29, 2013, 5:19 p.m.
On Wed, 20 Mar 2013, Paolo Bonzini wrote:
> Il 20/03/2013 09:33, Alex Bligh ha scritto:
> > Stefano,
> > 
> > --On 19 March 2013 15:13:29 +0000 Stefano Stabellini
> > <stefano.stabellini@eu.citrix.com> wrote:
> > 
> >> Therefore I think that the current change is not safe, but it is pretty
> >> easy to make it safe.
> >> You just need to move the call to blk_open from blk_init to blk_connect.
> >> Actually you can move most of blk_init to blk_connect, you just need to
> >> keep the 4 xenstore_write_be_int at the end of the function.
> > 
> > The final of these 4 writes in blk_init is:
> > 
> >    xenstore_write_be_int(&blkdev->xendev, "sectors",
> >                          blkdev->file_size / blkdev->file_blk);
> > 
> > and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that
> > requires (I think) bdrv_open to have been called. Can these
> > 4 writes move safely to blk_connect? Or can we write the sectors
> > count as (say) 0 here and fix it later in blk_connect? The remainder
> > look ok.
> 
> I think so.  blkfront reads "sectors" when QEMU moves to
> XenbusStateConnected, in blkfront_connect.
> 
> blk_connect is called from xen_be_try_initialise, which moves to
> XenbusStateConnected on success.  So, QEMU's blk_connect will always be
> called before blkfront's blkfront_connect.

Alex, do you have any updates on this patch?
Alex Bligh - March 31, 2013, 7:53 p.m.
Stefano,

--On 29 March 2013 17:19:26 +0000 Stefano Stabellini 
<stefano.stabellini@eu.citrix.com> wrote:

>> I think so.  blkfront reads "sectors" when QEMU moves to
>> XenbusStateConnected, in blkfront_connect.
>>
>> blk_connect is called from xen_be_try_initialise, which moves to
>> XenbusStateConnected on success.  So, QEMU's blk_connect will always be
>> called before blkfront's blkfront_connect.
>
> Alex, do you have any updates on this patch?

Sorry, I've been snowed under with other stuff.

I got to the speed bump below, then didn't get back to it after Paolo's
reply. I'll see if I can get some time tomorrow.
Stefano Stabellini - April 1, 2013, 3:44 p.m.
On Mon, 1 Apr 2013, Alex Bligh wrote:
> This commit delays the point at which bdrv_new (and hence blk_open
> on the underlying device) is called from blk_init to blk_connect.
> This ensures that in an inbound live migrate, the block device is
> not opened until it has been closed at the other end. This is in
> preparation for supporting devices with open/close consistency
> without using O_DIRECT. This commit does NOT itself change O_DIRECT
> semantics.
> 
> Note this patch is compile-tested only.

I think the patch looks good, just a minor comment.


> Signed-off-by: Alex Bligh <alex@alex.org.uk>
> ---
>  hw/xen_disk.c |   73 +++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 42 insertions(+), 31 deletions(-)
> 
> diff --git a/hw/xen_disk.c b/hw/xen_disk.c
> index 69e1d9d..3cccea8 100644
> --- a/hw/xen_disk.c
> +++ b/hw/xen_disk.c
> @@ -701,7 +701,7 @@ static void blk_alloc(struct XenDevice *xendev)
>  static int blk_init(struct XenDevice *xendev)
>  {
>      struct XenBlkDev *blkdev = container_of(xendev, struct XenBlkDev, xendev);
> -    int index, qflags, info = 0;
> +    int info = 0;
>  
>      /* read xenstore entries */
>      if (blkdev->params == NULL) {
> @@ -744,10 +744,7 @@ static int blk_init(struct XenDevice *xendev)
>      }
>  
>      /* read-only ? */
> -    qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
> -    if (strcmp(blkdev->mode, "w") == 0) {
> -        qflags |= BDRV_O_RDWR;
> -    } else {
> +    if (strcmp(blkdev->mode, "w")) {
>          info  |= VDISK_READONLY;
>      }
>  
> @@ -756,6 +753,44 @@ static int blk_init(struct XenDevice *xendev)
>          info  |= VDISK_CDROM;
>      }
>  
> +    blkdev->file_blk  = BLOCK_SIZE;
> +
> +    /* fill info
> +     * Temporarily write zero sectors as we won't know file size until
> +     * bdrv_new has been called. blk_connect corrects this.
> +     */
> +    xenstore_write_be_int(&blkdev->xendev, "feature-flush-cache", 1);
> +    xenstore_write_be_int(&blkdev->xendev, "feature-persistent", 1);
> +    xenstore_write_be_int(&blkdev->xendev, "info", info);
> +    xenstore_write_be_int(&blkdev->xendev, "sector-size", BLOCK_SIZE);
> +    xenstore_write_be_int(&blkdev->xendev, "sectors", 0);
> +    return 0;

There is no need to fill the sector-size and sectors info here, you can
do it later in blk_connect.
Alex Bligh - April 1, 2013, 4:35 p.m.
Stefano,

--On 31 March 2013 19:53:28 +0000 Alex Bligh <alex@alex.org.uk> wrote:

>> Alex, do you have any updates on this patch?
>
> Sorry, I've been snowed under with other stuff.
>
> I got to the speed bump below, then didn't get back to it after Paolo's
> reply. I'll see if I can get some time tomorrow.

Just sent under the subject:
 [PATCH] [RFC] Xen PV backend: Move call to bdrv_new from blk_init to 
blk_connect

Compile tested only at this stage as I'd like some feedback on whether
this is the sort of change you meant. Unfortunately the git diff isn't
very readable as it has decided I didn't move a pile of code down
from blk_init to blk_connect, but rather moved the declaration of
blk_connect up above the code concerned ...
Alex Bligh - April 1, 2013, 8:56 p.m.
Stefano,

--On 1 April 2013 16:44:05 +0100 Stefano Stabellini 
<stefano.stabellini@eu.citrix.com> wrote:

>> Note this patch is compile-tested only.
>
> I think the patch looks good, just a minor comment.

Thanks. I guess I ought to actually test it works then :-)

>> +    /* fill info
>> +     * Temporarily write zero sectors as we won't know file size until
>> +     * bdrv_new has been called. blk_connect corrects this.
>> +     */
>> +    xenstore_write_be_int(&blkdev->xendev, "feature-flush-cache", 1);
>> +    xenstore_write_be_int(&blkdev->xendev, "feature-persistent", 1);
>> +    xenstore_write_be_int(&blkdev->xendev, "info", info);
>> +    xenstore_write_be_int(&blkdev->xendev, "sector-size", BLOCK_SIZE);
>> +    xenstore_write_be_int(&blkdev->xendev, "sectors", 0);
>> +    return 0;
>
> There is no need to fill the sector-size and sectors info here, you can
> do it later in blk_connect.

My concern (not knowing how xenstore works) was the possibility of leaving
them as uninitialized values. I'm taking it that if I don't initialise
them, the key is just absent - correct?

I'll redo and move setting sector-size and sectors into blk_connect, and
then add a second commit which removes BDRV_O_NOCACHE. Oh, and test it ...
Stefano Stabellini - April 2, 2013, 11:08 a.m.
On Mon, 1 Apr 2013, Alex Bligh wrote:
> Stefano,
> 
> --On 1 April 2013 16:44:05 +0100 Stefano Stabellini 
> <stefano.stabellini@eu.citrix.com> wrote:
> 
> >> Note this patch is compile-tested only.
> >
> > I think the patch looks good, just a minor comment.
> 
> Thanks. I guess I ought to actually test it works then :-)
> 
> >> +    /* fill info
> >> +     * Temporarily write zero sectors as we won't know file size until
> >> +     * bdrv_new has been called. blk_connect corrects this.
> >> +     */
> >> +    xenstore_write_be_int(&blkdev->xendev, "feature-flush-cache", 1);
> >> +    xenstore_write_be_int(&blkdev->xendev, "feature-persistent", 1);
> >> +    xenstore_write_be_int(&blkdev->xendev, "info", info);
> >> +    xenstore_write_be_int(&blkdev->xendev, "sector-size", BLOCK_SIZE);
> >> +    xenstore_write_be_int(&blkdev->xendev, "sectors", 0);
> >> +    return 0;
> >
> > There is no need to fill the sector-size and sectors info here, you can
> > do it later in blk_connect.
> 
> My concern (not knowing how xenstore works) was the possibility of leaving
> them as uninitialized values. I'm taking it that if I don't initialise
> them, the key is just absent - correct?

Right


> I'll redo and move setting sector-size and sectors into blk_connect, and
> then add a second commit which removes BDRV_O_NOCACHE. Oh, and test it ...

Good :)
Alex Bligh - April 5, 2013, 10:34 a.m.
Stefano,

--On 2 April 2013 12:08:25 +0100 Stefano Stabellini 
<stefano.stabellini@eu.citrix.com> wrote:

>> I'll redo and move setting sector-size and sectors into blk_connect, and
>> then add a second commit which removes BDRV_O_NOCACHE. Oh, and test it
>> ...
>
> Good :)

I've just sent patches for qemu-upstream-unstable and
qemu-upstream-4.2-testing to do this. These are pretty lightly
tested but should be fine as it's just moving lines around really.

Patch

diff --git a/hw/xen_disk.c b/hw/xen_disk.c
index a402ac8..14f8723 100644
--- a/hw/xen_disk.c
+++ b/hw/xen_disk.c
@@ -603,7 +603,7 @@  static int blk_init(struct XenDevice *xendev)
     }
 
     /* read-only ? */
-    qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
+    qflags = BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO;
     if (strcmp(blkdev->mode, "w") == 0) {
         qflags |= BDRV_O_RDWR;
     } else {