Message ID | 1353488287-47077-1-git-send-email-borntraeger@de.ibm.com |
---|---|
State | New |
Headers | show |
Am 21.11.2012 09:58, schrieb Christian Borntraeger: > From: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > > While testing IPL code (booting) for s390x we faced some problems > with cache=none on dasds (4k block size) on bdrv_preads with length > values != block size. > > This patch makes sure that bdrv_pread and friends work fine with > unaligned access even with cache=none > - propagate alignment value also into bs->file struct > - modify the size in case of no cache to avoid EINVAL on > pread() etc. (file was opened with O_DIRECT). > > This patch seems to cure the problems. > > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> > --- > block.c | 3 +++ > block/raw-posix.c | 6 ++++++ > 2 files changed, 9 insertions(+) > > diff --git a/block.c b/block.c > index 854ebd6..f23c562 100644 > --- a/block.c > +++ b/block.c > @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, > void bdrv_set_buffer_alignment(BlockDriverState *bs, int align) > { > bs->buffer_alignment = align; > + if ((bs->open_flags & BDRV_O_NOCACHE)) { > + bs->file->buffer_alignment = align; > + } Any reason to restrict this to BDRV_O_NOCACHE? There have been patches to change the BDRV_O_NOCACHE flag from the monitor, in which case bdrv_set_buffer_alignment() wouldn't be called anew and O_DIRECT requests start to fail again. > } > > void *qemu_blockalign(BlockDriverState *bs, size_t size) > diff --git a/block/raw-posix.c b/block/raw-posix.c > index f2f0404..baebf1d 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, > acb->aio_nbytes = nb_sectors * 512; > acb->aio_offset = sector_num * 512; > > + /* O_DIRECT also requires an aligned length */ > + if (bs->open_flags & BDRV_O_NOCACHE) { > + acb->aio_nbytes += acb->bs->buffer_alignment - 1; > + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); > + } Modifying aio_nbytes, but not the iov looks wrong to me. This may work in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. Kevin
On 21/11/12 10:15, Kevin Wolf wrote: > Am 21.11.2012 09:58, schrieb Christian Borntraeger: >> From: Heinz Graalfs <graalfs@linux.vnet.ibm.com> >> >> While testing IPL code (booting) for s390x we faced some problems >> with cache=none on dasds (4k block size) on bdrv_preads with length >> values != block size. >> >> This patch makes sure that bdrv_pread and friends work fine with >> unaligned access even with cache=none >> - propagate alignment value also into bs->file struct >> - modify the size in case of no cache to avoid EINVAL on >> pread() etc. (file was opened with O_DIRECT). >> >> This patch seems to cure the problems. >> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Stefan Hajnoczi <stefanha@redhat.com> >> Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com> >> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> >> --- >> block.c | 3 +++ >> block/raw-posix.c | 6 ++++++ >> 2 files changed, 9 insertions(+) >> >> diff --git a/block.c b/block.c >> index 854ebd6..f23c562 100644 >> --- a/block.c >> +++ b/block.c >> @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, >> void bdrv_set_buffer_alignment(BlockDriverState *bs, int align) >> { >> bs->buffer_alignment = align; >> + if ((bs->open_flags & BDRV_O_NOCACHE)) { >> + bs->file->buffer_alignment = align; >> + } > > Any reason to restrict this to BDRV_O_NOCACHE? > > There have been patches to change the BDRV_O_NOCACHE flag from the > monitor, in which case bdrv_set_buffer_alignment() wouldn't be called > anew and O_DIRECT requests start to fail again. Right, should be ok to remove the check. > >> } >> >> void *qemu_blockalign(BlockDriverState *bs, size_t size) >> diff --git a/block/raw-posix.c b/block/raw-posix.c >> index f2f0404..baebf1d 100644 >> --- a/block/raw-posix.c >> +++ b/block/raw-posix.c >> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, >> acb->aio_nbytes = nb_sectors * 512; >> acb->aio_offset = sector_num * 512; >> >> + /* O_DIRECT also requires an aligned length */ >> + if (bs->open_flags & BDRV_O_NOCACHE) { >> + acb->aio_nbytes += acb->bs->buffer_alignment - 1; >> + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); >> + } > > Modifying aio_nbytes, but not the iov looks wrong to me. This may work > in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. I think it seemed to work because the vectored I/O cases that we were testing were properly aligned or were in the QEMU_AIO_MISALIGNED case which does bounce buffering anyway. But I am not sure... Heinz can you have a look at this and identify the exact place were it was failing and why this patch helps? (I just know it does). That might help to understand if we also need to touch the iovs. Christian
On Wed, 2012-11-21 at 11:00 +0100, Christian Borntraeger wrote: > On 21/11/12 10:15, Kevin Wolf wrote: > > Am 21.11.2012 09:58, schrieb Christian Borntraeger: > >> From: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > >> > >> While testing IPL code (booting) for s390x we faced some problems > >> with cache=none on dasds (4k block size) on bdrv_preads with length > >> values != block size. > >> > >> This patch makes sure that bdrv_pread and friends work fine with > >> unaligned access even with cache=none > >> - propagate alignment value also into bs->file struct > >> - modify the size in case of no cache to avoid EINVAL on > >> pread() etc. (file was opened with O_DIRECT). > >> > >> This patch seems to cure the problems. > >> > >> CC: Kevin Wolf <kwolf@redhat.com> > >> CC: Stefan Hajnoczi <stefanha@redhat.com> > >> Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > >> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> > >> --- > >> block.c | 3 +++ > >> block/raw-posix.c | 6 ++++++ > >> 2 files changed, 9 insertions(+) > >> > >> diff --git a/block.c b/block.c > >> index 854ebd6..f23c562 100644 > >> --- a/block.c > >> +++ b/block.c > >> @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, > >> void bdrv_set_buffer_alignment(BlockDriverState *bs, int align) > >> { > >> bs->buffer_alignment = align; > >> + if ((bs->open_flags & BDRV_O_NOCACHE)) { > >> + bs->file->buffer_alignment = align; > >> + } > > > > Any reason to restrict this to BDRV_O_NOCACHE? > > > > There have been patches to change the BDRV_O_NOCACHE flag from the > > monitor, in which case bdrv_set_buffer_alignment() wouldn't be called > > anew and O_DIRECT requests start to fail again. > > > Right, should be ok to remove the check. > > > > > >> } > >> > >> void *qemu_blockalign(BlockDriverState *bs, size_t size) > >> diff --git a/block/raw-posix.c b/block/raw-posix.c > >> index f2f0404..baebf1d 100644 > >> --- a/block/raw-posix.c > >> +++ b/block/raw-posix.c > >> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, > >> acb->aio_nbytes = nb_sectors * 512; > >> acb->aio_offset = sector_num * 512; > >> > >> + /* O_DIRECT also requires an aligned length */ > >> + if (bs->open_flags & BDRV_O_NOCACHE) { > >> + acb->aio_nbytes += acb->bs->buffer_alignment - 1; > >> + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); > >> + } > > > > Modifying aio_nbytes, but not the iov looks wrong to me. This may work > > in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. > > I think it seemed to work because the vectored I/O cases that we were testing were properly > aligned or were in the QEMU_AIO_MISALIGNED case which does bounce buffering anyway. > But I am not sure... > > Heinz can you have a look at this and identify the exact place were it was failing > and why this patch helps? (I just know it does). That might help to understand > if we also need to touch the iovs. The pread() call in handle_aiocb_rw_linear() is failing with errno 22. At least for this path the patch ensures that the length is correctly set. I need to look into the vectored I/O part in more detail. > Christian > >
Il 21/11/2012 10:15, Kevin Wolf ha scritto: >> > + if ((bs->open_flags & BDRV_O_NOCACHE)) { >> > + bs->file->buffer_alignment = align; >> > + } > Any reason to restrict this to BDRV_O_NOCACHE? > > There have been patches to change the BDRV_O_NOCACHE flag from the > monitor, in which case bdrv_set_buffer_alignment() wouldn't be called > anew and O_DIRECT requests start to fail again. > bdrv_set_buffer_alignment() is completely broken. It should set host alignment, but in fact it is passed the guest alignment. In practice, we only support logical_block_size matching the host's or bigger (which is unsafe due to torn writes, but works). So I suggest that we just look at writes outside the device models, and "fix" them to always read a multiple of 4k. Paolo
On 21/11/12 17:03, Paolo Bonzini wrote: > Il 21/11/2012 10:15, Kevin Wolf ha scritto: >>>> + if ((bs->open_flags & BDRV_O_NOCACHE)) { >>>> + bs->file->buffer_alignment = align; >>>> + } >> Any reason to restrict this to BDRV_O_NOCACHE? >> >> There have been patches to change the BDRV_O_NOCACHE flag from the >> monitor, in which case bdrv_set_buffer_alignment() wouldn't be called >> anew and O_DIRECT requests start to fail again. >> > > bdrv_set_buffer_alignment() is completely broken. It should set host > alignment, but in fact it is passed the guest alignment. > > In practice, we only support logical_block_size matching the host's or > bigger (which is unsafe due to torn writes, but works). For other reasons (partition table format) we want to have host block size == guest block size on s390 anyway - so it would not really matter for us. But I certainly agree that it makes more sense to use the host block size for the alignment checks. > So I suggest that we just look at writes outside the device models, and > "fix" them to always read a multiple of 4k. Wouldnt that cause performance regressions for block devices with 512 byte block size, because we read more than necessary. Wouldnt that also require read/update/write combinations for valid 512 byte writes? Christian
On Wed, 2012-11-21 at 10:15 +0100, Kevin Wolf wrote: > Am 21.11.2012 09:58, schrieb Christian Borntraeger: > > From: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > > > > While testing IPL code (booting) for s390x we faced some problems > > with cache=none on dasds (4k block size) on bdrv_preads with length > > values != block size. > > > > This patch makes sure that bdrv_pread and friends work fine with > > unaligned access even with cache=none > > - propagate alignment value also into bs->file struct > > - modify the size in case of no cache to avoid EINVAL on > > pread() etc. (file was opened with O_DIRECT). > > > > This patch seems to cure the problems. > > > > CC: Kevin Wolf <kwolf@redhat.com> > > CC: Stefan Hajnoczi <stefanha@redhat.com> > > Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> > > --- > > block.c | 3 +++ > > block/raw-posix.c | 6 ++++++ > > 2 files changed, 9 insertions(+) > > > > diff --git a/block.c b/block.c > > index 854ebd6..f23c562 100644 > > --- a/block.c > > +++ b/block.c > > @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, > > void bdrv_set_buffer_alignment(BlockDriverState *bs, int align) > > { > > bs->buffer_alignment = align; > > + if ((bs->open_flags & BDRV_O_NOCACHE)) { > > + bs->file->buffer_alignment = align; > > + } > > Any reason to restrict this to BDRV_O_NOCACHE? > > There have been patches to change the BDRV_O_NOCACHE flag from the > monitor, in which case bdrv_set_buffer_alignment() wouldn't be called > anew and O_DIRECT requests start to fail again. > OK > > } > > > > void *qemu_blockalign(BlockDriverState *bs, size_t size) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > > index f2f0404..baebf1d 100644 > > --- a/block/raw-posix.c > > +++ b/block/raw-posix.c > > @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, > > acb->aio_nbytes = nb_sectors * 512; > > acb->aio_offset = sector_num * 512; > > > > + /* O_DIRECT also requires an aligned length */ > > + if (bs->open_flags & BDRV_O_NOCACHE) { > > + acb->aio_nbytes += acb->bs->buffer_alignment - 1; > > + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); > > + } > > Modifying aio_nbytes, but not the iov looks wrong to me. This may work > in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. > Current coding ensures that read IO buffers always seem to be aligned correctly. Whereas read length values are not always appropriate for an O_DIRECT scenario. For a 2048 formatted disk I verified that 1. non vectored IO - the length needs to be adapted several times, which is accomplished now by the patch. 2. vectored IO - the qiov's total length is always a multiple of the logical block size (which is also verified in virtio_blk_handle_read()) The particular iov length fields are already correctly setup as a multiple of the logical block size when processed in virtio_blk_handle_request(). > Kevin >
Hello Kevin, I'm resending my answer as of Nov 23rd. Is this still on your queue? Heinz On Wed, 2012-11-21 at 10:15 +0100, Kevin Wolf wrote: > Am 21.11.2012 09:58, schrieb Christian Borntraeger: > > From: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > > > > While testing IPL code (booting) for s390x we faced some problems > > with cache=none on dasds (4k block size) on bdrv_preads with length > > values != block size. > > > > This patch makes sure that bdrv_pread and friends work fine with > > unaligned access even with cache=none > > - propagate alignment value also into bs->file struct > > - modify the size in case of no cache to avoid EINVAL on > > pread() etc. (file was opened with O_DIRECT). > > > > This patch seems to cure the problems. > > > > CC: Kevin Wolf <kwolf@redhat.com> > > CC: Stefan Hajnoczi <stefanha@redhat.com> > > Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com> > > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> > > --- > > block.c | 3 +++ > > block/raw-posix.c | 6 ++++++ > > 2 files changed, 9 insertions(+) > > > > diff --git a/block.c b/block.c > > index 854ebd6..f23c562 100644 > > --- a/block.c > > +++ b/block.c > > @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, > > void bdrv_set_buffer_alignment(BlockDriverState *bs, int align) > > { > > bs->buffer_alignment = align; > > + if ((bs->open_flags & BDRV_O_NOCACHE)) { > > + bs->file->buffer_alignment = align; > > + } > > Any reason to restrict this to BDRV_O_NOCACHE? OK, can be removed > There have been patches to change the BDRV_O_NOCACHE flag from the > monitor, in which case bdrv_set_buffer_alignment() wouldn't be called > anew and O_DIRECT requests start to fail again. > > } > > > > void *qemu_blockalign(BlockDriverState *bs, size_t size) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > > index f2f0404..baebf1d 100644 > > --- a/block/raw-posix.c > > +++ b/block/raw-posix.c > > @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, > > acb->aio_nbytes = nb_sectors * 512; > > acb->aio_offset = sector_num * 512; > > > > + /* O_DIRECT also requires an aligned length */ > > + if (bs->open_flags & BDRV_O_NOCACHE) { > > + acb->aio_nbytes += acb->bs->buffer_alignment - 1; > > + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); > > + } > > Modifying aio_nbytes, but not the iov looks wrong to me. This may work > in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. Current coding ensures that read IO buffers always seem to be aligned correctly. Whereas read length values are not always appropriate for an O_DIRECT scenario. For a 2048 formatted disk I verified that 1. non vectored IO - the length needs to be adapted several times, which is accomplished now by the patch. 2. vectored IO - the qiov's total length is always a multiple of the logical block size (which is also verified in virtio_blk_handle_read()) The particular iov length fields are already correctly setup as a multiple of the logical block size when processed in virtio_blk_handle_request(). > > Kevin >
Am 07.12.2012 21:26, schrieb Heinz Graalfs: > Hello Kevin, > > I'm resending my answer as of Nov 23rd. > > Is this still on your queue? No, it wasn't. I guess I was waiting for a new version of the patch. >>> } >>> >>> void *qemu_blockalign(BlockDriverState *bs, size_t size) >>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>> index f2f0404..baebf1d 100644 >>> --- a/block/raw-posix.c >>> +++ b/block/raw-posix.c >>> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, >>> acb->aio_nbytes = nb_sectors * 512; >>> acb->aio_offset = sector_num * 512; >>> >>> + /* O_DIRECT also requires an aligned length */ >>> + if (bs->open_flags & BDRV_O_NOCACHE) { >>> + acb->aio_nbytes += acb->bs->buffer_alignment - 1; >>> + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); >>> + } >> >> Modifying aio_nbytes, but not the iov looks wrong to me. This may work >> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. > > Current coding ensures that read IO buffers always seem to be aligned > correctly. Whereas read length values are not always appropriate for an > O_DIRECT scenario. > > For a 2048 formatted disk I verified that > > 1. non vectored IO - the length needs to be adapted several times, > which is accomplished now by the patch. > > 2. vectored IO - the qiov's total length is always a multiple of the > logical block size > (which is also verified in virtio_blk_handle_read()) > The particular iov length fields are already correctly setup as a > multiple of the logical block size when processed in > virtio_blk_handle_request(). I must admit that I don't quite understand this. As far as I know, virtio-blk doesn't make any difference between requests with niov = 1 and real vectored requests. So how can the length of the latter always be right, whereas the length of the former may be wrong? The other point is that requests may not even be coming from virtio-blk. They could be made by other device emulations or they could come from a block job. (They also could be the result of a merge in the block layer, though if the original requests were aligned, the result will stay aligned) Kevin
Hi Kevin, I'm using the bdrv_pread() function during boot partition detection ... In detail: bdrv_pread() is called to read 32 bytes from a 2048 bytes formatted disk. This results in setting up a read of 512 bytes (1 sector multiplied by 512 current code in paio_submit()), which is wrong for a O_DIRECT opened file, and produces the error. Heinz On Mon, 2012-12-10 at 09:55 +0100, Kevin Wolf wrote: > Am 07.12.2012 21:26, schrieb Heinz Graalfs: > > Hello Kevin, > > > > I'm resending my answer as of Nov 23rd. > > > > Is this still on your queue? > > No, it wasn't. I guess I was waiting for a new version of the patch. > > >>> } > >>> > >>> void *qemu_blockalign(BlockDriverState *bs, size_t size) > >>> diff --git a/block/raw-posix.c b/block/raw-posix.c > >>> index f2f0404..baebf1d 100644 > >>> --- a/block/raw-posix.c > >>> +++ b/block/raw-posix.c > >>> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, > >>> acb->aio_nbytes = nb_sectors * 512; > >>> acb->aio_offset = sector_num * 512; > >>> > >>> + /* O_DIRECT also requires an aligned length */ > >>> + if (bs->open_flags & BDRV_O_NOCACHE) { > >>> + acb->aio_nbytes += acb->bs->buffer_alignment - 1; > >>> + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); > >>> + } > >> > >> Modifying aio_nbytes, but not the iov looks wrong to me. This may work > >> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O. > > > > Current coding ensures that read IO buffers always seem to be aligned > > correctly. Whereas read length values are not always appropriate for an > > O_DIRECT scenario. > > > > For a 2048 formatted disk I verified that > > > > 1. non vectored IO - the length needs to be adapted several times, > > which is accomplished now by the patch. > > > > 2. vectored IO - the qiov's total length is always a multiple of the > > logical block size > > (which is also verified in virtio_blk_handle_read()) > > The particular iov length fields are already correctly setup as a > > multiple of the logical block size when processed in > > virtio_blk_handle_request(). > > I must admit that I don't quite understand this. As far as I know, > virtio-blk doesn't make any difference between requests with niov = 1 > and real vectored requests. So how can the length of the latter always > be right, whereas the length of the former may be wrong? > > The other point is that requests may not even be coming from virtio-blk. > They could be made by other device emulations or they could come from a > block job. (They also could be the result of a merge in the block layer, > though if the original requests were aligned, the result will stay aligned) > > Kevin >
diff --git a/block.c b/block.c index 854ebd6..f23c562 100644 --- a/block.c +++ b/block.c @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, void bdrv_set_buffer_alignment(BlockDriverState *bs, int align) { bs->buffer_alignment = align; + if ((bs->open_flags & BDRV_O_NOCACHE)) { + bs->file->buffer_alignment = align; + } } void *qemu_blockalign(BlockDriverState *bs, size_t size) diff --git a/block/raw-posix.c b/block/raw-posix.c index f2f0404..baebf1d 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, acb->aio_nbytes = nb_sectors * 512; acb->aio_offset = sector_num * 512; + /* O_DIRECT also requires an aligned length */ + if (bs->open_flags & BDRV_O_NOCACHE) { + acb->aio_nbytes += acb->bs->buffer_alignment - 1; + acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1); + } + trace_paio_submit(acb, opaque, sector_num, nb_sectors, type); return thread_pool_submit_aio(aio_worker, acb, cb, opaque); }