Message ID | 1284213896-12705-4-git-send-email-aliguori@us.ibm.com |
---|---|
State | New |
Headers | show |
On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori <aliguori@us.ibm.com> wrote: > This fixes a couple nasty problems relating to live migration. > > 1) When dealing with shared storage with weak coherence (i.e. NFS), even if > we re-read, we may end up with undesired caching. By delaying any reads > until we absolutely have to, we decrease the likelihood of any undesirable > caching. > > 2) When dealing with copy-on-read, the local storage acts as a cache. We need > to make sure to avoid any reads to avoid polluting the local cache. > > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> > > diff --git a/hw/ide/core.c b/hw/ide/core.c > index 1e466d1..57d8db3 100644 > --- a/hw/ide/core.c > +++ b/hw/ide/core.c > @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) > return; > } > > + guess_geometry(s); > + Does the same change need to be made in ide_cfata_identify()? I quickly checked the VMStateDescription and don't see cylinders, heads, sectors being saved for migration. I am concerned that IDE will break after migration if the following happens: 1. Guest resumes and does not issue ATA IDENTIFY so cylinders, heads, sectors are not initialized. 2. Normal I/O is performed, invoking ide_get_sector() which uses geometry information that has not been initialized. Did I miss something? > diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c > index bd6bbe6..0bf17ec 100644 > --- a/hw/virtio-blk.c > +++ b/hw/virtio-blk.c > @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config) > > bdrv_get_geometry(s->bs, &capacity); > bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs); > + if (cylinders == 0) { > + bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs); > + } > + bdrv_guess_geometry() can be called unconditionally. The call to bdrv_get_geometry_hint() can be eliminated. bdrv_guess_geometry() updates the geometry hint and does not probe the boot sector after the first time. Stefan
On 09/11/2010 12:24 PM, Stefan Hajnoczi wrote: > On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori<aliguori@us.ibm.com> wrote: > >> This fixes a couple nasty problems relating to live migration. >> >> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if >> we re-read, we may end up with undesired caching. By delaying any reads >> until we absolutely have to, we decrease the likelihood of any undesirable >> caching. >> >> 2) When dealing with copy-on-read, the local storage acts as a cache. We need >> to make sure to avoid any reads to avoid polluting the local cache. >> >> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com> >> >> diff --git a/hw/ide/core.c b/hw/ide/core.c >> index 1e466d1..57d8db3 100644 >> --- a/hw/ide/core.c >> +++ b/hw/ide/core.c >> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) >> return; >> } >> >> + guess_geometry(s); >> + >> > Does the same change need to be made in ide_cfata_identify()? > > I quickly checked the VMStateDescription and don't see cylinders, > heads, sectors being saved for migration. I am concerned that IDE > will break after migration if the following happens: > 1. Guest resumes and does not issue ATA IDENTIFY so cylinders, heads, > sectors are not initialized. > 2. Normal I/O is performed, invoking ide_get_sector() which uses > geometry information that has not been initialized. > > Did I miss something? > Nope, I just missed some places that need calls. I've renamed guess_geometry() to init_geometry() and added calls to ide_cfata_identify(), ide_get_sector(), and ide_set_sector(). That should cover it all. >> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c >> index bd6bbe6..0bf17ec 100644 >> --- a/hw/virtio-blk.c >> +++ b/hw/virtio-blk.c >> @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config) >> >> bdrv_get_geometry(s->bs,&capacity); >> bdrv_get_geometry_hint(s->bs,&cylinders,&heads,&secs); >> + if (cylinders == 0) { >> + bdrv_guess_geometry(s->bs,&cylinders,&heads,&secs); >> + } >> + >> > bdrv_guess_geometry() can be called unconditionally. The call to > bdrv_get_geometry_hint() can be eliminated. bdrv_guess_geometry() > updates the geometry hint and does not probe the boot sector after the > first time. > Yeah, that's a fair point. Regards, Anthony Liguori > Stefan > >
On 09/11/2010 05:04 PM, Anthony Liguori wrote: > This fixes a couple nasty problems relating to live migration. > > 1) When dealing with shared storage with weak coherence (i.e. NFS), even if > we re-read, we may end up with undesired caching. By delaying any reads > until we absolutely have to, we decrease the likelihood of any undesirable > caching. > > 2) When dealing with copy-on-read, the local storage acts as a cache. We need > to make sure to avoid any reads to avoid polluting the local cache. > > + > static void ide_identify(IDEState *s) > { > uint16_t *p; > @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) > return; > } > > + guess_geometry(s); > + This can cause a disk read, no? Shouldn't it be made asynchronous? Or just move it to just before the guest starts?
On 09/12/2010 05:42 AM, Avi Kivity wrote: > On 09/11/2010 05:04 PM, Anthony Liguori wrote: >> This fixes a couple nasty problems relating to live migration. >> >> 1) When dealing with shared storage with weak coherence (i.e. NFS), >> even if >> we re-read, we may end up with undesired caching. By delaying >> any reads >> until we absolutely have to, we decrease the likelihood of any >> undesirable >> caching. >> >> 2) When dealing with copy-on-read, the local storage acts as a >> cache. We need >> to make sure to avoid any reads to avoid polluting the local cache. >> >> + >> static void ide_identify(IDEState *s) >> { >> uint16_t *p; >> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) >> return; >> } >> >> + guess_geometry(s); >> + > > This can cause a disk read, no? Shouldn't it be made asynchronous? Yes, it should. I'm not sure there's an easy way to make it asynchronous though not because of the block layer but because of how these functions are called. > > Or just move it to just before the guest starts? We don't really have a notion of "guest starts" today although maybe we should. Regards, Anthony Liguori
On 09/12/2010 03:08 PM, Anthony Liguori wrote: >> This can cause a disk read, no? Shouldn't it be made asynchronous? > > > Yes, it should. I'm not sure there's an easy way to make it > asynchronous though not because of the block layer but because of how > these functions are called. Sorry to harp on the subject, but that's the standard problem with state machines. Every time you want to do a blocking operation in a function, you have to put all its locals in some structure, split the function into two, do some scheduling, etc. >> >> Or just move it to just before the guest starts? > > We don't really have a notion of "guest starts" today although maybe > we should. Wasn't there some qdev callback that represents this? Faint memory from the reset thread.
On 09/12/2010 08:26 AM, Avi Kivity wrote: > On 09/12/2010 03:08 PM, Anthony Liguori wrote: >>> This can cause a disk read, no? Shouldn't it be made asynchronous? >> >> >> Yes, it should. I'm not sure there's an easy way to make it >> asynchronous though not because of the block layer but because of how >> these functions are called. > > Sorry to harp on the subject, but that's the standard problem with > state machines. Every time you want to do a blocking operation in a > function, you have to put all its locals in some structure, split the > function into two, do some scheduling, etc. We can't block the VCPU thread for arbitrarily long periods of time. If we get a PIO operation requesting information about geometry, we can't wait for a disk read in order to satisfy that request. We need to kick off the I/O operations in the background such that the data is available before the PIO operation happens. This isn't SM vs. threads at all, this is simply about the fact that we can't do block I/O during a PIO operation. >>> >>> Or just move it to just before the guest starts? >> >> We don't really have a notion of "guest starts" today although maybe >> we should. > > Wasn't there some qdev callback that represents this? Faint memory > from the reset thread. Yes, realize(). I guess that's a reasonable approach. Regards, Anthony Liguori
On 09/12/2010 05:29 PM, Anthony Liguori wrote: > On 09/12/2010 08:26 AM, Avi Kivity wrote: >> On 09/12/2010 03:08 PM, Anthony Liguori wrote: >>>> This can cause a disk read, no? Shouldn't it be made asynchronous? >>> >>> >>> Yes, it should. I'm not sure there's an easy way to make it >>> asynchronous though not because of the block layer but because of >>> how these functions are called. >> >> Sorry to harp on the subject, but that's the standard problem with >> state machines. Every time you want to do a blocking operation in a >> function, you have to put all its locals in some structure, split the >> function into two, do some scheduling, etc. > > We can't block the VCPU thread for arbitrarily long periods of time. > If we get a PIO operation requesting information about geometry, we > can't wait for a disk read in order to satisfy that request. > > We need to kick off the I/O operations in the background such that the > data is available before the PIO operation happens. This isn't SM vs. > threads at all, this is simply about the fact that we can't do block > I/O during a PIO operation. Isn't this an identify command, where the guest can only read the data after the interface indicates it is ready (or posts an interrupt)? I thought the guest interface was already async. The code appears to support this: switch(val) { case WIN_IDENTIFY: if (s->bs && s->drive_kind != IDE_CD) { if (s->drive_kind != IDE_CFATA) ide_identify(s); else ide_cfata_identify(s); s->status = READY_STAT | SEEK_STAT; ide_transfer_start(s, s->io_buffer, 512, ide_transfer_stop); } else { if (s->drive_kind == IDE_CD) { ide_set_signature(s); } ide_abort_command(s); } ide_set_irq(s->bus); break; but I may be misinterpreting it. If I'm right, all we need to do is push this whole block to a thread.
Am 11.09.2010 16:04, schrieb Anthony Liguori: > This fixes a couple nasty problems relating to live migration. > > 1) When dealing with shared storage with weak coherence (i.e. NFS), even if > we re-read, we may end up with undesired caching. By delaying any reads > until we absolutely have to, we decrease the likelihood of any undesirable > caching. > > 2) When dealing with copy-on-read, the local storage acts as a cache. We need > to make sure to avoid any reads to avoid polluting the local cache. > > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> I think we should also delay even opening the image file at all to the latest possible point to avoid that new problems of this kind are introduced. Ideally, opening the image would be the very last thing we do before telling the migration source that we're set and it should close the images. Even better would be to only open the image when the source has already closed it (we could completely avoid the invalidation/reopen then), but I think you were afraid that we might lose the image on both ends. Kevin
On 09/13/2010 03:32 AM, Kevin Wolf wrote: > Am 11.09.2010 16:04, schrieb Anthony Liguori: > >> This fixes a couple nasty problems relating to live migration. >> >> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if >> we re-read, we may end up with undesired caching. By delaying any reads >> until we absolutely have to, we decrease the likelihood of any undesirable >> caching. >> >> 2) When dealing with copy-on-read, the local storage acts as a cache. We need >> to make sure to avoid any reads to avoid polluting the local cache. >> >> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com> >> > I think we should also delay even opening the image file at all to the > latest possible point to avoid that new problems of this kind are > introduced. Ideally, opening the image would be the very last thing we > do before telling the migration source that we're set and it should > close the images. > There's a lot of possible failure scenarios that opening an image file can introduce. Fortunately, I don't think we have a strict requirement for it provided that we make a couple of reasonable changes. > Even better would be to only open the image when the source has already > closed it (we could completely avoid the invalidation/reopen then), but > I think you were afraid that we might lose the image on both ends. > Yeah, one of the key design points of live migration is to minimize the number of failure scenarios where you lose a VM. If someone typed the wrong command line or shared storage hasn't been mounted yet and we delay failure until live migration is in the critical path, that would be terribly unfortunate. Regards, Anthony Liguori > Kevin >
Am 13.09.2010 15:29, schrieb Anthony Liguori: > On 09/13/2010 03:32 AM, Kevin Wolf wrote: >> Am 11.09.2010 16:04, schrieb Anthony Liguori: >> >>> This fixes a couple nasty problems relating to live migration. >>> >>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if >>> we re-read, we may end up with undesired caching. By delaying any reads >>> until we absolutely have to, we decrease the likelihood of any undesirable >>> caching. >>> >>> 2) When dealing with copy-on-read, the local storage acts as a cache. We need >>> to make sure to avoid any reads to avoid polluting the local cache. >>> >>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com> >>> >> I think we should also delay even opening the image file at all to the >> latest possible point to avoid that new problems of this kind are >> introduced. Ideally, opening the image would be the very last thing we >> do before telling the migration source that we're set and it should >> close the images. >> > > There's a lot of possible failure scenarios that opening an image file > can introduce. Fortunately, I don't think we have a strict requirement > for it provided that we make a couple of reasonable changes. > >> Even better would be to only open the image when the source has already >> closed it (we could completely avoid the invalidation/reopen then), but >> I think you were afraid that we might lose the image on both ends. >> > > Yeah, one of the key design points of live migration is to minimize the > number of failure scenarios where you lose a VM. If someone typed the > wrong command line or shared storage hasn't been mounted yet and we > delay failure until live migration is in the critical path, that would > be terribly unfortunate. We would catch most of them if we try to open the image when migration starts and immediately close it again until migration is (almost) completed, so that no other code can possibly use it before the source has really closed it. Kevin
On 09/13/2010 08:39 AM, Kevin Wolf wrote: >> Yeah, one of the key design points of live migration is to minimize the >> number of failure scenarios where you lose a VM. If someone typed the >> wrong command line or shared storage hasn't been mounted yet and we >> delay failure until live migration is in the critical path, that would >> be terribly unfortunate. >> > We would catch most of them if we try to open the image when migration > starts and immediately close it again until migration is (almost) > completed, so that no other code can possibly use it before the source > has really closed it. > I think the only real advantage is that we fix NFS migration, right? But if we do invalidate_cache() as you suggested with a close/open of the qcow2 layer, and also acquire and release a lock in the file layer by propagating the invalidate_cache(), that should work robustly with NFS. I think that's a simpler change. Do you see additional advantages to delaying the open? Regards, anthony Liguori > Kevin >
Am 13.09.2010 15:42, schrieb Anthony Liguori: > On 09/13/2010 08:39 AM, Kevin Wolf wrote: >>> Yeah, one of the key design points of live migration is to minimize the >>> number of failure scenarios where you lose a VM. If someone typed the >>> wrong command line or shared storage hasn't been mounted yet and we >>> delay failure until live migration is in the critical path, that would >>> be terribly unfortunate. >>> >> We would catch most of them if we try to open the image when migration >> starts and immediately close it again until migration is (almost) >> completed, so that no other code can possibly use it before the source >> has really closed it. >> > > I think the only real advantage is that we fix NFS migration, right? That's the one that we know about, yes. The rest is not a specific scenario, but a strong feeling that having an image opened twice at the same time feels dangerous. As soon as an open/close sequence writes to the image for some format, we probably have a bug. For example, what about this mounted flag that you were discussing for QED? > But if we do invalidate_cache() as you suggested with a close/open of > the qcow2 layer, and also acquire and release a lock in the file layer > by propagating the invalidate_cache(), that should work robustly with NFS. > > I think that's a simpler change. Do you see additional advantages to > delaying the open? Just that it makes it very obvious if a device model is doing bad things and accessing the image before it should. The difference is a failed request vs. silently corrupted data. Kevin
On 09/13/2010 09:13 AM, Kevin Wolf wrote: >> I think the only real advantage is that we fix NFS migration, right? >> > That's the one that we know about, yes. > > The rest is not a specific scenario, but a strong feeling that having an > image opened twice at the same time feels dangerous. We've never really had clear semantics about live migration and block driver's life cycles. At a high level, for live migration to work, we need the following sequence: 1) src> flush all pending writes to disk 2) <barrier> 3) dst> invalidate any cached data 4) dst> start guest We've gotten away ignoring (3) because raw disks never cache anything. But that assumes that we're on top of cache coherent storage. If we don't have fully cache coherent storage, we need to do more. We need to extend (3) to also flush the cache of the underlying storage. There are two ways we can solve this, we can either ensure that (3) is a nop by not having any operations that would cause caching until after (3), or we can find a way to inject a flush into the underlying cache. Since he later approach requires storage specific knowledge, I went with the former approach. Of course, if you did a close-open sequence at (3), it may achieve the same goal but only really for NFS. If you have something weaker than close-to-open coherency, you still need to do something unique in step (3). I don't know that I see a perfect model. Pushing reads past point (3) is easy and fixes raw on top of NFS. I think we want to do that because it's low hanging fruit. An block driver hook for (3) also seems appealing because we can make use of it easily in QED. That said, I'm open to suggestions of a better model. Delaying open (especially if you open, then close, then open again) seems a bit hacky. With respect to the devices, I think the question of when block devices can begin accessing drives is orthogonal to this discussion. Even without delaying open, we could simply not give them their BlockDriverStates until realize() or something like that. Regards, Anthony Liguori > As soon as an > open/close sequence writes to the image for some format, we probably > have a bug. For example, what about this mounted flag that you were > discussing for QED? > > >> But if we do invalidate_cache() as you suggested with a close/open of >> the qcow2 layer, and also acquire and release a lock in the file layer >> by propagating the invalidate_cache(), that should work robustly with NFS. >> >> I think that's a simpler change. Do you see additional advantages to >> delaying the open? >> > Just that it makes it very obvious if a device model is doing bad things > and accessing the image before it should. The difference is a failed > request vs. silently corrupted data. > > Kevin > >
On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 13.09.2010 15:42, schrieb Anthony Liguori: >> On 09/13/2010 08:39 AM, Kevin Wolf wrote: >>>> Yeah, one of the key design points of live migration is to minimize the >>>> number of failure scenarios where you lose a VM. If someone typed the >>>> wrong command line or shared storage hasn't been mounted yet and we >>>> delay failure until live migration is in the critical path, that would >>>> be terribly unfortunate. >>>> >>> We would catch most of them if we try to open the image when migration >>> starts and immediately close it again until migration is (almost) >>> completed, so that no other code can possibly use it before the source >>> has really closed it. >>> >> >> I think the only real advantage is that we fix NFS migration, right? > > That's the one that we know about, yes. > > The rest is not a specific scenario, but a strong feeling that having an > image opened twice at the same time feels dangerous. As soon as an > open/close sequence writes to the image for some format, we probably > have a bug. For example, what about this mounted flag that you were > discussing for QED? There is some room left to work in, even if we can't check in open(). One idea would be to do the check asynchronously once I/O begins. It is actually easy to check L1/L2 tables as they are loaded. The only barrier relationship between I/O and checking is that an allocating write (which will need to update L1/L2 tables) is only allowed after check completes. Otherwise reads and non-allocating writes may proceed while the image is not yet fully checked. We can detect when a table element is an invalid offset and discard it. Stefan
Am 13.09.2010 21:29, schrieb Stefan Hajnoczi: > On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 13.09.2010 15:42, schrieb Anthony Liguori: >>> On 09/13/2010 08:39 AM, Kevin Wolf wrote: >>>>> Yeah, one of the key design points of live migration is to minimize the >>>>> number of failure scenarios where you lose a VM. If someone typed the >>>>> wrong command line or shared storage hasn't been mounted yet and we >>>>> delay failure until live migration is in the critical path, that would >>>>> be terribly unfortunate. >>>>> >>>> We would catch most of them if we try to open the image when migration >>>> starts and immediately close it again until migration is (almost) >>>> completed, so that no other code can possibly use it before the source >>>> has really closed it. >>>> >>> >>> I think the only real advantage is that we fix NFS migration, right? >> >> That's the one that we know about, yes. >> >> The rest is not a specific scenario, but a strong feeling that having an >> image opened twice at the same time feels dangerous. As soon as an >> open/close sequence writes to the image for some format, we probably >> have a bug. For example, what about this mounted flag that you were >> discussing for QED? > > There is some room left to work in, even if we can't check in open(). > One idea would be to do the check asynchronously once I/O begins. It > is actually easy to check L1/L2 tables as they are loaded. > > The only barrier relationship between I/O and checking is that an > allocating write (which will need to update L1/L2 tables) is only > allowed after check completes. Otherwise reads and non-allocating > writes may proceed while the image is not yet fully checked. We can > detect when a table element is an invalid offset and discard it. I'm not even talking about such complicated things. You wanted to have a dirty flag in the header, right? So when we allow opening an image twice, you get this sequence with migration: Source: open Destination: open (with dirty image) Source: close The image is now marked as clean, even though the destination is still working on it. Kevin
On 09/13/2010 03:03 PM, Kevin Wolf wrote: > Am 13.09.2010 21:29, schrieb Stefan Hajnoczi: > >> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf<kwolf@redhat.com> wrote: >> >>> Am 13.09.2010 15:42, schrieb Anthony Liguori: >>> >>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote: >>>> >>>>>> Yeah, one of the key design points of live migration is to minimize the >>>>>> number of failure scenarios where you lose a VM. If someone typed the >>>>>> wrong command line or shared storage hasn't been mounted yet and we >>>>>> delay failure until live migration is in the critical path, that would >>>>>> be terribly unfortunate. >>>>>> >>>>>> >>>>> We would catch most of them if we try to open the image when migration >>>>> starts and immediately close it again until migration is (almost) >>>>> completed, so that no other code can possibly use it before the source >>>>> has really closed it. >>>>> >>>>> >>>> I think the only real advantage is that we fix NFS migration, right? >>>> >>> That's the one that we know about, yes. >>> >>> The rest is not a specific scenario, but a strong feeling that having an >>> image opened twice at the same time feels dangerous. As soon as an >>> open/close sequence writes to the image for some format, we probably >>> have a bug. For example, what about this mounted flag that you were >>> discussing for QED? >>> >> There is some room left to work in, even if we can't check in open(). >> One idea would be to do the check asynchronously once I/O begins. It >> is actually easy to check L1/L2 tables as they are loaded. >> >> The only barrier relationship between I/O and checking is that an >> allocating write (which will need to update L1/L2 tables) is only >> allowed after check completes. Otherwise reads and non-allocating >> writes may proceed while the image is not yet fully checked. We can >> detect when a table element is an invalid offset and discard it. >> > I'm not even talking about such complicated things. You wanted to have a > dirty flag in the header, right? So when we allow opening an image > twice, you get this sequence with migration: > > Source: open > Destination: open (with dirty image) > Source: close > > The image is now marked as clean, even though the destination is still > working on it. > The dirty flag should be read on demand (which is the first time we fetch an L1/L2 table). I agree that the life cycle of the block drivers is getting fuzzy. Need to think quite a bit here. Regards, Anthony Liguori > Kevin >
Am 13.09.2010 22:09, schrieb Anthony Liguori: > On 09/13/2010 03:03 PM, Kevin Wolf wrote: >> Am 13.09.2010 21:29, schrieb Stefan Hajnoczi: >> >>> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf<kwolf@redhat.com> wrote: >>> >>>> Am 13.09.2010 15:42, schrieb Anthony Liguori: >>>> >>>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote: >>>>> >>>>>>> Yeah, one of the key design points of live migration is to minimize the >>>>>>> number of failure scenarios where you lose a VM. If someone typed the >>>>>>> wrong command line or shared storage hasn't been mounted yet and we >>>>>>> delay failure until live migration is in the critical path, that would >>>>>>> be terribly unfortunate. >>>>>>> >>>>>>> >>>>>> We would catch most of them if we try to open the image when migration >>>>>> starts and immediately close it again until migration is (almost) >>>>>> completed, so that no other code can possibly use it before the source >>>>>> has really closed it. >>>>>> >>>>>> >>>>> I think the only real advantage is that we fix NFS migration, right? >>>>> >>>> That's the one that we know about, yes. >>>> >>>> The rest is not a specific scenario, but a strong feeling that having an >>>> image opened twice at the same time feels dangerous. As soon as an >>>> open/close sequence writes to the image for some format, we probably >>>> have a bug. For example, what about this mounted flag that you were >>>> discussing for QED? >>>> >>> There is some room left to work in, even if we can't check in open(). >>> One idea would be to do the check asynchronously once I/O begins. It >>> is actually easy to check L1/L2 tables as they are loaded. >>> >>> The only barrier relationship between I/O and checking is that an >>> allocating write (which will need to update L1/L2 tables) is only >>> allowed after check completes. Otherwise reads and non-allocating >>> writes may proceed while the image is not yet fully checked. We can >>> detect when a table element is an invalid offset and discard it. >>> >> I'm not even talking about such complicated things. You wanted to have a >> dirty flag in the header, right? So when we allow opening an image >> twice, you get this sequence with migration: >> >> Source: open >> Destination: open (with dirty image) >> Source: close >> >> The image is now marked as clean, even though the destination is still >> working on it. >> > > The dirty flag should be read on demand (which is the first time we > fetch an L1/L2 table). > > I agree that the life cycle of the block drivers is getting fuzzy. Need > to think quite a bit here. The point is that after the close the flag is wrong on disk. That's completely unrelated to when you read it. If anything, you could also delay writing the flag until the first write, so that the close comes first. But honestly, I'm not very excited about pulling half of the open() logic into read/write functions. Kevin
On 09/13/2010 04:34 PM, Anthony Liguori wrote: > On 09/13/2010 09:13 AM, Kevin Wolf wrote: >>> I think the only real advantage is that we fix NFS migration, right? >> That's the one that we know about, yes. >> >> The rest is not a specific scenario, but a strong feeling that having an >> image opened twice at the same time feels dangerous. > > We've never really had clear semantics about live migration and block > driver's life cycles. At a high level, for live migration to work, we > need the following sequence: > > 1) src> flush all pending writes to disk > 2) <barrier> > 3) dst> invalidate any cached data > 4) dst> start guest That's pretty complicated, compared to 1) src> close 2) dst> open 3) dst> start guest There are two failure scenarios with this model: 1. dst cannot open the image We fix that by killing dst and continuing src (which has to re-open its images). 2. dst cannot open the image, and src cannot as well In this case, what would be gained by having an image handle open in one of the hosts, but no way to open it again? As soon as the surviving qemu exited (or crashed), the image would be lost for ever. To get (1) working correctly we need an event that tells management that all initialization has completed and the guest is ready to run (so management can terminate the source).
On 09/14/2010 04:47 AM, Avi Kivity wrote: > On 09/13/2010 04:34 PM, Anthony Liguori wrote: >> On 09/13/2010 09:13 AM, Kevin Wolf wrote: >>>> I think the only real advantage is that we fix NFS migration, right? >>> That's the one that we know about, yes. >>> >>> The rest is not a specific scenario, but a strong feeling that >>> having an >>> image opened twice at the same time feels dangerous. >> >> We've never really had clear semantics about live migration and block >> driver's life cycles. At a high level, for live migration to work, >> we need the following sequence: >> >> 1) src> flush all pending writes to disk >> 2) <barrier> >> 3) dst> invalidate any cached data >> 4) dst> start guest > > That's pretty complicated, compared to > > 1) src> close 1.5) <barrier> > 2) dst> open > 3) dst> start guest You need to make sure the open happens *after* the close. You're just using close to flush all pending writes or open to invalidate any cached data. Regards, Anthony Liguori > There are two failure scenarios with this model: > > 1. dst cannot open the image > > We fix that by killing dst and continuing src (which has to re-open > its images). > > 2. dst cannot open the image, and src cannot as well > > In this case, what would be gained by having an image handle open in > one of the hosts, but no way to open it again? As soon as the > surviving qemu exited (or crashed), the image would be lost for ever. > > To get (1) working correctly we need an event that tells management > that all initialization has completed and the guest is ready to run > (so management can terminate the source). >
On 09/14/2010 02:51 PM, Anthony Liguori wrote: > On 09/14/2010 04:47 AM, Avi Kivity wrote: >> On 09/13/2010 04:34 PM, Anthony Liguori wrote: >>> On 09/13/2010 09:13 AM, Kevin Wolf wrote: >>>>> I think the only real advantage is that we fix NFS migration, right? >>>> That's the one that we know about, yes. >>>> >>>> The rest is not a specific scenario, but a strong feeling that >>>> having an >>>> image opened twice at the same time feels dangerous. >>> >>> We've never really had clear semantics about live migration and >>> block driver's life cycles. At a high level, for live migration to >>> work, we need the following sequence: >>> >>> 1) src> flush all pending writes to disk >>> 2) <barrier> >>> 3) dst> invalidate any cached data >>> 4) dst> start guest >> >> That's pretty complicated, compared to >> >> 1) src> close > > 1.5) <barrier> > >> 2) dst> open >> 3) dst> start guest > > You need to make sure the open happens *after* the close. > > You're just using close to flush all pending writes or open to > invalidate any cached data. Right, that's simpler than teaching all block format drivers about invalidating any cached data, or using nfs locks to force the host to do the same. It also complies with close-to-open consistency without relying on implementation details.
Anthony Liguori <anthony@codemonkey.ws> wrote: > On 09/12/2010 05:42 AM, Avi Kivity wrote: >> On 09/11/2010 05:04 PM, Anthony Liguori wrote: >>> This fixes a couple nasty problems relating to live migration. >>> >>> 1) When dealing with shared storage with weak coherence (i.e. NFS), >>> even if >>> we re-read, we may end up with undesired caching. By delaying >>> any reads >>> until we absolutely have to, we decrease the likelihood of any >>> undesirable >>> caching. >>> >>> 2) When dealing with copy-on-read, the local storage acts as a >>> cache. We need >>> to make sure to avoid any reads to avoid polluting the local cache. >>> >>> + >>> static void ide_identify(IDEState *s) >>> { >>> uint16_t *p; >>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) >>> return; >>> } >>> >>> + guess_geometry(s); >>> + >> >> This can cause a disk read, no? Shouldn't it be made asynchronous? > > Yes, it should. I'm not sure there's an easy way to make it > asynchronous though not because of the block layer but because of how > these functions are called. > >> >> Or just move it to just before the guest starts? > > We don't really have a notion of "guest starts" today although maybe > we should. I agree that we should have more states. Expecting Migration/Migration done are quite special casses that are handled adhoc now. Having some state like: NeverRun will make trivial to have vm_start() do the right thing on the 1st run. Later, Juan. > Regards, > > Anthony Liguori
Kevin Wolf <kwolf@redhat.com> wrote: > Am 11.09.2010 16:04, schrieb Anthony Liguori: >> This fixes a couple nasty problems relating to live migration. >> >> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if >> we re-read, we may end up with undesired caching. By delaying any reads >> until we absolutely have to, we decrease the likelihood of any undesirable >> caching. >> >> 2) When dealing with copy-on-read, the local storage acts as a cache. We need >> to make sure to avoid any reads to avoid polluting the local cache. >> >> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> > > I think we should also delay even opening the image file at all to the > latest possible point to avoid that new problems of this kind are > introduced. Ideally, opening the image would be the very last thing we > do before telling the migration source that we're set and it should > close the images. > > Even better would be to only open the image when the source has already > closed it (we could completely avoid the invalidation/reopen then), but > I think you were afraid that we might lose the image on both ends. But then we have formats like qcow2 that store the "number_of_sectors" inside the file. And we need the size of the disk to initialize some structures :( About having the image opened in two different machines ..... - Having it opened in two places: makes it "interesting" for cocherence reasons. - Not having it, makes the failure case of anything failing during migration ... interesting. Real solution for "this" problem is to just send in the migration stream the size/head/cylinder/sector ... values and not having to open the image. Problem here is that for qemu architecture, we have to know this values before we call migration functions. As far as I can see, until we are not able to create a machine from inside the monitor (i.e. qemu initialization & machine creation are decoupled), we are not going to be able to fix migration properly (using that very same mechanism to create the machine during migration). Later, Juan.
diff --git a/hw/ide/core.c b/hw/ide/core.c index 1e466d1..57d8db3 100644 --- a/hw/ide/core.c +++ b/hw/ide/core.c @@ -94,6 +94,33 @@ static void put_le16(uint16_t *p, unsigned int v) *p = cpu_to_le16(v); } +static int guess_geometry(IDEState *s) +{ + int cylinders, heads, secs; + + if (s->cylinders != 0) { + return -1; + } + + bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs); + if (cylinders < 1 || cylinders > 16383) { + error_report("cyls must be between 1 and 16383"); + return -1; + } + if (heads < 1 || heads > 16) { + error_report("heads must be between 1 and 16"); + return -1; + } + if (secs < 1 || secs > 63) { + error_report("secs must be between 1 and 63"); + return -1; + } + s->cylinders = cylinders; + s->heads = heads; + s->sectors = secs; + return 0; +} + static void ide_identify(IDEState *s) { uint16_t *p; @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) return; } + guess_geometry(s); + memset(s->io_buffer, 0, 512); p = (uint16_t *)s->io_buffer; put_le16(p + 0, 0x0040); @@ -2614,27 +2643,13 @@ void ide_bus_reset(IDEBus *bus) int ide_init_drive(IDEState *s, BlockDriverState *bs, const char *version, const char *serial) { - int cylinders, heads, secs; uint64_t nb_sectors; s->bs = bs; - bdrv_get_geometry(bs, &nb_sectors); - bdrv_guess_geometry(bs, &cylinders, &heads, &secs); - if (cylinders < 1 || cylinders > 16383) { - error_report("cyls must be between 1 and 16383"); - return -1; - } - if (heads < 1 || heads > 16) { - error_report("heads must be between 1 and 16"); - return -1; - } - if (secs < 1 || secs > 63) { - error_report("secs must be between 1 and 63"); - return -1; - } - s->cylinders = cylinders; - s->heads = heads; - s->sectors = secs; + bdrv_get_geometry(s->bs, &nb_sectors); + s->cylinders = 0; + s->heads = 0; + s->sectors = 0; s->nb_sectors = nb_sectors; /* The SMART values should be preserved across power cycles but they aren't. */ diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index bd6bbe6..0bf17ec 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config) bdrv_get_geometry(s->bs, &capacity); bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs); + if (cylinders == 0) { + bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs); + } + memset(&blkcfg, 0, sizeof(blkcfg)); stq_raw(&blkcfg.capacity, capacity); stl_raw(&blkcfg.seg_max, 128 - 2); @@ -501,7 +505,6 @@ static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id) VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf) { VirtIOBlock *s; - int cylinders, heads, secs; static int virtio_blk_id; DriveInfo *dinfo; @@ -525,7 +528,6 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf) s->conf = conf; s->rq = NULL; s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1; - bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs); /* NB: per existing s/n string convention the string is terminated * by '\0' only when less than sizeof (s->sn)
This fixes a couple nasty problems relating to live migration. 1) When dealing with shared storage with weak coherence (i.e. NFS), even if we re-read, we may end up with undesired caching. By delaying any reads until we absolutely have to, we decrease the likelihood of any undesirable caching. 2) When dealing with copy-on-read, the local storage acts as a cache. We need to make sure to avoid any reads to avoid polluting the local cache. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>