Message ID | 20180419075232.31407-1-stefanha@redhat.com |
---|---|
Headers | show |
Series | block/file-posix: allow -drive cache.direct=off live migration | expand |
On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote: > file-posix.c only supports shared storage live migration with -drive > cache.direct=off due to cache consistency issues. There are two main shared > storage configurations: files on NFS and host block devices on SAN LUNs. > > The problem is that QEMU starts on the destination host before the source host > has written everything out to the disk. The page cache on the destination host > may contain stale data read when QEMU opened the image file (before migration > handover). Using O_DIRECT avoids this problem but prevents users from taking > advantage of the host page cache. > > Although cache=none is the recommended setting for virtualization use cases, > there are scenarios where cache=writeback makes sense. If the guest has much > less RAM than the host or many guests share the same backing file, then the > host page cache can significantly improve disk I/O performance. > > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > on Linux so that shared storage live migration works. I have sent it as an RFC > because cache consistency is not binary, there are corner cases which I've > described in the actual patch, and this may require more discussion. Interesting, in that the NBD list is also discussing the possible standardization of a NBD_CMD_CACHE command (based on existing practice in the xNBD implementation), and covering whether that MIGHT be worth doing as a thin wrapper that corresponds to posix_fadvise() semantics. Thus, if NBD_CMD_CACHE learns flags, we could support .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition to the POSIX file driver. Obviously, your usage invalidates the cache of the entire file; but does it also make sense to expose a start/length subset invalidation, for better exposure to posix_fadvise() semantics?
On Thu, Apr 19, 2018 at 11:09:53AM -0500, Eric Blake wrote: > On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote: > > file-posix.c only supports shared storage live migration with -drive > > cache.direct=off due to cache consistency issues. There are two main shared > > storage configurations: files on NFS and host block devices on SAN LUNs. > > > > The problem is that QEMU starts on the destination host before the source host > > has written everything out to the disk. The page cache on the destination host > > may contain stale data read when QEMU opened the image file (before migration > > handover). Using O_DIRECT avoids this problem but prevents users from taking > > advantage of the host page cache. > > > > Although cache=none is the recommended setting for virtualization use cases, > > there are scenarios where cache=writeback makes sense. If the guest has much > > less RAM than the host or many guests share the same backing file, then the > > host page cache can significantly improve disk I/O performance. > > > > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > > on Linux so that shared storage live migration works. I have sent it as an RFC > > because cache consistency is not binary, there are corner cases which I've > > described in the actual patch, and this may require more discussion. > > Interesting, in that the NBD list is also discussing the possible > standardization of a NBD_CMD_CACHE command (based on existing practice > in the xNBD implementation), and covering whether that MIGHT be worth > doing as a thin wrapper that corresponds to posix_fadvise() semantics. > Thus, if NBD_CMD_CACHE learns flags, we could support > .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition > to the POSIX file driver. Obviously, your usage invalidates the cache > of the entire file; but does it also make sense to expose a start/length > subset invalidation, for better exposure to posix_fadvise() semantics? bdrv_co_invalidate_cache() is currently only used by migration before using a file that may have been touched by the other host. I don't think start/length will be needed for that use case. Can you describe how will NBD use cache invalidation? Maybe this will help me understand other use cases. Stefan
On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote: >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c >>> on Linux so that shared storage live migration works. I have sent it as an RFC >>> because cache consistency is not binary, there are corner cases which I've >>> described in the actual patch, and this may require more discussion. >> >> Interesting, in that the NBD list is also discussing the possible >> standardization of a NBD_CMD_CACHE command (based on existing practice >> in the xNBD implementation), and covering whether that MIGHT be worth >> doing as a thin wrapper that corresponds to posix_fadvise() semantics. >> Thus, if NBD_CMD_CACHE learns flags, we could support >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition >> to the POSIX file driver. Obviously, your usage invalidates the cache >> of the entire file; but does it also make sense to expose a start/length >> subset invalidation, for better exposure to posix_fadvise() semantics? > > bdrv_co_invalidate_cache() is currently only used by migration before > using a file that may have been touched by the other host. I don't > think start/length will be needed for that use case. > > Can you describe how will NBD use cache invalidation? Maybe this will > help me understand other use cases. That's where things are still under discussion - no one has yet provided a case that would benefit from a POSIX_FADV_DONTNEED over just a range of the file [1]; on the other hand, it might make sense that if you know an implementation has a limited cache, then having control over the various posix_fadvise() flags over various ranges of the files may lead to more optimum behavior. And posix_fadvise() does have the ability to work over the entire file (offset 0 length 0) or a subrange (any offset and nonzero length). So I'm also fine if .bdrv_co_invalidate_cache() doesn't expose offset/length parameters, particularly if NBD can't come up with an actual use case that would benefit. [1] https://lists.debian.org/nbd/2018/04/msg00020.html
On Fri, Apr 20, 2018 at 08:53:33AM -0500, Eric Blake wrote: > On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote: > > >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > >>> on Linux so that shared storage live migration works. I have sent it as an RFC > >>> because cache consistency is not binary, there are corner cases which I've > >>> described in the actual patch, and this may require more discussion. > >> > >> Interesting, in that the NBD list is also discussing the possible > >> standardization of a NBD_CMD_CACHE command (based on existing practice > >> in the xNBD implementation), and covering whether that MIGHT be worth > >> doing as a thin wrapper that corresponds to posix_fadvise() semantics. > >> Thus, if NBD_CMD_CACHE learns flags, we could support > >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition > >> to the POSIX file driver. Obviously, your usage invalidates the cache > >> of the entire file; but does it also make sense to expose a start/length > >> subset invalidation, for better exposure to posix_fadvise() semantics? > > > > bdrv_co_invalidate_cache() is currently only used by migration before > > using a file that may have been touched by the other host. I don't > > think start/length will be needed for that use case. > > > > Can you describe how will NBD use cache invalidation? Maybe this will > > help me understand other use cases. > > That's where things are still under discussion - no one has yet provided > a case that would benefit from a POSIX_FADV_DONTNEED over just a range > of the file [1]; on the other hand, it might make sense that if you know > an implementation has a limited cache, then having control over the > various posix_fadvise() flags over various ranges of the files may lead > to more optimum behavior. And posix_fadvise() does have the ability to > work over the entire file (offset 0 length 0) or a subrange (any offset > and nonzero length). So I'm also fine if .bdrv_co_invalidate_cache() > doesn't expose offset/length parameters, particularly if NBD can't come > up with an actual use case that would benefit. > > [1] https://lists.debian.org/nbd/2018/04/msg00020.html Okay. .bdrv_co_invalidate_cache() is an internal interface. We can change it later to support NBD functionality, if necessary. Let's keep it without start/length for now. Stefan