Patchwork [PATCH/RFC] block: Ensure that block size constraints are considered

login
register
mail settings
Submitter Christian Borntraeger
Date Nov. 21, 2012, 8:58 a.m.
Message ID <1353488287-47077-1-git-send-email-borntraeger@de.ibm.com>
Download mbox | patch
Permalink /patch/200609/
State New
Headers show

Comments

Christian Borntraeger - Nov. 21, 2012, 8:58 a.m.
From: Heinz Graalfs <graalfs@linux.vnet.ibm.com>

While testing IPL code (booting) for s390x we faced some problems
with cache=none on dasds (4k block size) on bdrv_preads with length
values != block size.

This patch makes sure that bdrv_pread and friends work fine with
unaligned access even with cache=none
   - propagate alignment value also into bs->file struct
   - modify the size in case of no cache to avoid EINVAL on
     pread() etc. (file was opened with O_DIRECT).

This patch seems to cure the problems.

CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 block.c           |    3 +++
 block/raw-posix.c |    6 ++++++
 2 files changed, 9 insertions(+)
Kevin Wolf - Nov. 21, 2012, 9:15 a.m.
Am 21.11.2012 09:58, schrieb Christian Borntraeger:
> From: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> 
> While testing IPL code (booting) for s390x we faced some problems
> with cache=none on dasds (4k block size) on bdrv_preads with length
> values != block size.
> 
> This patch makes sure that bdrv_pread and friends work fine with
> unaligned access even with cache=none
>    - propagate alignment value also into bs->file struct
>    - modify the size in case of no cache to avoid EINVAL on
>      pread() etc. (file was opened with O_DIRECT).
> 
> This patch seems to cure the problems.
> 
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  block.c           |    3 +++
>  block/raw-posix.c |    6 ++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 854ebd6..f23c562 100644
> --- a/block.c
> +++ b/block.c
> @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
>  void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
>  {
>      bs->buffer_alignment = align;
> +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
> +        bs->file->buffer_alignment = align;
> +    }

Any reason to restrict this to BDRV_O_NOCACHE?

There have been patches to change the BDRV_O_NOCACHE flag from the
monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
anew and O_DIRECT requests start to fail again.

>  }
>  
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index f2f0404..baebf1d 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
>      acb->aio_nbytes = nb_sectors * 512;
>      acb->aio_offset = sector_num * 512;
>  
> +    /* O_DIRECT also requires an aligned length */
> +    if (bs->open_flags & BDRV_O_NOCACHE) {
> +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
> +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
> +    }

Modifying aio_nbytes, but not the iov looks wrong to me. This may work
in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.

Kevin
Christian Borntraeger - Nov. 21, 2012, 10 a.m.
On 21/11/12 10:15, Kevin Wolf wrote:
> Am 21.11.2012 09:58, schrieb Christian Borntraeger:
>> From: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
>>
>> While testing IPL code (booting) for s390x we faced some problems
>> with cache=none on dasds (4k block size) on bdrv_preads with length
>> values != block size.
>>
>> This patch makes sure that bdrv_pread and friends work fine with
>> unaligned access even with cache=none
>>    - propagate alignment value also into bs->file struct
>>    - modify the size in case of no cache to avoid EINVAL on
>>      pread() etc. (file was opened with O_DIRECT).
>>
>> This patch seems to cure the problems.
>>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>> ---
>>  block.c           |    3 +++
>>  block/raw-posix.c |    6 ++++++
>>  2 files changed, 9 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 854ebd6..f23c562 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
>>  void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
>>  {
>>      bs->buffer_alignment = align;
>> +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
>> +        bs->file->buffer_alignment = align;
>> +    }
> 
> Any reason to restrict this to BDRV_O_NOCACHE?
> 
> There have been patches to change the BDRV_O_NOCACHE flag from the
> monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
> anew and O_DIRECT requests start to fail again.


Right, should be ok to remove the check.


> 
>>  }
>>  
>>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index f2f0404..baebf1d 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
>>      acb->aio_nbytes = nb_sectors * 512;
>>      acb->aio_offset = sector_num * 512;
>>  
>> +    /* O_DIRECT also requires an aligned length */
>> +    if (bs->open_flags & BDRV_O_NOCACHE) {
>> +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
>> +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
>> +    }
> 
> Modifying aio_nbytes, but not the iov looks wrong to me. This may work
> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.

I think it seemed to work because the vectored I/O cases that we were testing were properly
aligned or were in the QEMU_AIO_MISALIGNED case which does bounce buffering anyway.
But I am not sure...

Heinz can you have a look at this and identify the exact place were it was failing
and why this patch helps?  (I just know it does). That might help to understand
if we also need to touch the iovs.

Christian
Heinz Graalfs - Nov. 21, 2012, 11:24 a.m.
On Wed, 2012-11-21 at 11:00 +0100, Christian Borntraeger wrote:
> On 21/11/12 10:15, Kevin Wolf wrote:
> > Am 21.11.2012 09:58, schrieb Christian Borntraeger:
> >> From: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> >>
> >> While testing IPL code (booting) for s390x we faced some problems
> >> with cache=none on dasds (4k block size) on bdrv_preads with length
> >> values != block size.
> >>
> >> This patch makes sure that bdrv_pread and friends work fine with
> >> unaligned access even with cache=none
> >>    - propagate alignment value also into bs->file struct
> >>    - modify the size in case of no cache to avoid EINVAL on
> >>      pread() etc. (file was opened with O_DIRECT).
> >>
> >> This patch seems to cure the problems.
> >>
> >> CC: Kevin Wolf <kwolf@redhat.com>
> >> CC: Stefan Hajnoczi <stefanha@redhat.com>
> >> Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> >> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >> ---
> >>  block.c           |    3 +++
> >>  block/raw-posix.c |    6 ++++++
> >>  2 files changed, 9 insertions(+)
> >>
> >> diff --git a/block.c b/block.c
> >> index 854ebd6..f23c562 100644
> >> --- a/block.c
> >> +++ b/block.c
> >> @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
> >>  void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
> >>  {
> >>      bs->buffer_alignment = align;
> >> +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
> >> +        bs->file->buffer_alignment = align;
> >> +    }
> > 
> > Any reason to restrict this to BDRV_O_NOCACHE?
> > 
> > There have been patches to change the BDRV_O_NOCACHE flag from the
> > monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
> > anew and O_DIRECT requests start to fail again.
> 
> 
> Right, should be ok to remove the check.
> 
> 
> > 
> >>  }
> >>  
> >>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> >> diff --git a/block/raw-posix.c b/block/raw-posix.c
> >> index f2f0404..baebf1d 100644
> >> --- a/block/raw-posix.c
> >> +++ b/block/raw-posix.c
> >> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
> >>      acb->aio_nbytes = nb_sectors * 512;
> >>      acb->aio_offset = sector_num * 512;
> >>  
> >> +    /* O_DIRECT also requires an aligned length */
> >> +    if (bs->open_flags & BDRV_O_NOCACHE) {
> >> +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
> >> +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
> >> +    }
> > 
> > Modifying aio_nbytes, but not the iov looks wrong to me. This may work
> > in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.
> 
> I think it seemed to work because the vectored I/O cases that we were testing were properly
> aligned or were in the QEMU_AIO_MISALIGNED case which does bounce buffering anyway.
> But I am not sure...
> 
> Heinz can you have a look at this and identify the exact place were it was failing
> and why this patch helps?  (I just know it does). That might help to understand
> if we also need to touch the iovs.

The pread() call in handle_aiocb_rw_linear() is failing with errno 22.
At least for this path the patch ensures that the length is correctly
set. I need to look into the vectored I/O part in more detail.

> Christian
> 
>
Paolo Bonzini - Nov. 21, 2012, 4:03 p.m.
Il 21/11/2012 10:15, Kevin Wolf ha scritto:
>> > +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
>> > +        bs->file->buffer_alignment = align;
>> > +    }
> Any reason to restrict this to BDRV_O_NOCACHE?
> 
> There have been patches to change the BDRV_O_NOCACHE flag from the
> monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
> anew and O_DIRECT requests start to fail again.
> 

bdrv_set_buffer_alignment() is completely broken.  It should set host
alignment, but in fact it is passed the guest alignment.

In practice, we only support logical_block_size matching the host's or
bigger (which is unsafe due to torn writes, but works).

So I suggest that we just look at writes outside the device models, and
"fix" them to always read a multiple of 4k.

Paolo
Christian Borntraeger - Nov. 22, 2012, 12:03 p.m.
On 21/11/12 17:03, Paolo Bonzini wrote:
> Il 21/11/2012 10:15, Kevin Wolf ha scritto:
>>>> +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
>>>> +        bs->file->buffer_alignment = align;
>>>> +    }
>> Any reason to restrict this to BDRV_O_NOCACHE?
>>
>> There have been patches to change the BDRV_O_NOCACHE flag from the
>> monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
>> anew and O_DIRECT requests start to fail again.
>>
> 
> bdrv_set_buffer_alignment() is completely broken.  It should set host
> alignment, but in fact it is passed the guest alignment.
> 
> In practice, we only support logical_block_size matching the host's or
> bigger (which is unsafe due to torn writes, but works).


For other reasons (partition table format) we want to have host block
size == guest block size on s390 anyway - so it would not really matter for
us.
But I certainly agree that it makes more sense to use the host block size
for the alignment checks.



> So I suggest that we just look at writes outside the device models, and
> "fix" them to always read a multiple of 4k.

Wouldnt that cause performance regressions for block devices with 512 byte 
block size, because we read more than necessary. Wouldnt that also require
read/update/write combinations for valid 512 byte writes?

Christian
Heinz Graalfs - Nov. 23, 2012, 10:45 a.m.
On Wed, 2012-11-21 at 10:15 +0100, Kevin Wolf wrote:
> Am 21.11.2012 09:58, schrieb Christian Borntraeger:
> > From: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> > 
> > While testing IPL code (booting) for s390x we faced some problems
> > with cache=none on dasds (4k block size) on bdrv_preads with length
> > values != block size.
> > 
> > This patch makes sure that bdrv_pread and friends work fine with
> > unaligned access even with cache=none
> >    - propagate alignment value also into bs->file struct
> >    - modify the size in case of no cache to avoid EINVAL on
> >      pread() etc. (file was opened with O_DIRECT).
> > 
> > This patch seems to cure the problems.
> > 
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > ---
> >  block.c           |    3 +++
> >  block/raw-posix.c |    6 ++++++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index 854ebd6..f23c562 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
> >  void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
> >  {
> >      bs->buffer_alignment = align;
> > +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
> > +        bs->file->buffer_alignment = align;
> > +    }
> 
> Any reason to restrict this to BDRV_O_NOCACHE?
> 
> There have been patches to change the BDRV_O_NOCACHE flag from the
> monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
> anew and O_DIRECT requests start to fail again.
> 

OK

> >  }
> >  
> >  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> > diff --git a/block/raw-posix.c b/block/raw-posix.c
> > index f2f0404..baebf1d 100644
> > --- a/block/raw-posix.c
> > +++ b/block/raw-posix.c
> > @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
> >      acb->aio_nbytes = nb_sectors * 512;
> >      acb->aio_offset = sector_num * 512;
> >  
> > +    /* O_DIRECT also requires an aligned length */
> > +    if (bs->open_flags & BDRV_O_NOCACHE) {
> > +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
> > +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
> > +    }
> 
> Modifying aio_nbytes, but not the iov looks wrong to me. This may work
> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.
> 

Current coding ensures that read IO buffers always seem to be aligned
correctly. Whereas read length values are not always appropriate for an
O_DIRECT scenario.

For a 2048 formatted disk I verified that

1. non vectored IO - the length needs to be adapted several times,
   which is accomplished now by the patch.

2. vectored IO - the qiov's total length is always a multiple of the
   logical block size 
      (which is also verified in virtio_blk_handle_read())
   The particular iov length fields are already correctly setup as a
   multiple of the logical block size when processed in
   virtio_blk_handle_request().


> Kevin
>
Heinz Graalfs - Dec. 7, 2012, 8:26 p.m.
Hello Kevin,

I'm resending my answer as of Nov 23rd.

Is this still on your queue?

Heinz


On Wed, 2012-11-21 at 10:15 +0100, Kevin Wolf wrote:
> Am 21.11.2012 09:58, schrieb Christian Borntraeger:
> > From: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> > 
> > While testing IPL code (booting) for s390x we faced some problems
> > with cache=none on dasds (4k block size) on bdrv_preads with length
> > values != block size.
> > 
> > This patch makes sure that bdrv_pread and friends work fine with
> > unaligned access even with cache=none
> >    - propagate alignment value also into bs->file struct
> >    - modify the size in case of no cache to avoid EINVAL on
> >      pread() etc. (file was opened with O_DIRECT).
> > 
> > This patch seems to cure the problems.
> > 
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
> > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > ---
> >  block.c           |    3 +++
> >  block/raw-posix.c |    6 ++++++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index 854ebd6..f23c562 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -4242,6 +4242,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
> >  void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
> >  {
> >      bs->buffer_alignment = align;
> > +    if ((bs->open_flags & BDRV_O_NOCACHE)) {
> > +        bs->file->buffer_alignment = align;
> > +    }
> 
> Any reason to restrict this to BDRV_O_NOCACHE?


OK, can be removed

> There have been patches to change the BDRV_O_NOCACHE flag from the
> monitor, in which case bdrv_set_buffer_alignment() wouldn't be called
> anew and O_DIRECT requests start to fail again.

> >  }
> >  
> >  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> > diff --git a/block/raw-posix.c b/block/raw-posix.c
> > index f2f0404..baebf1d 100644
> > --- a/block/raw-posix.c
> > +++ b/block/raw-posix.c
> > @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
> >      acb->aio_nbytes = nb_sectors * 512;
> >      acb->aio_offset = sector_num * 512;
> >  
> > +    /* O_DIRECT also requires an aligned length */
> > +    if (bs->open_flags & BDRV_O_NOCACHE) {
> > +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
> > +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
> > +    }
> 
> Modifying aio_nbytes, but not the iov looks wrong to me. This may work
> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.

Current coding ensures that read IO buffers always seem to be aligned
correctly. Whereas read length values are not always appropriate for an
O_DIRECT scenario.

For a 2048 formatted disk I verified that

1. non vectored IO - the length needs to be adapted several times,
   which is accomplished now by the patch.

2. vectored IO - the qiov's total length is always a multiple of the
   logical block size 
      (which is also verified in virtio_blk_handle_read())
   The particular iov length fields are already correctly setup as a
   multiple of the logical block size when processed in
   virtio_blk_handle_request().

> 
> Kevin
>
Kevin Wolf - Dec. 10, 2012, 8:55 a.m.
Am 07.12.2012 21:26, schrieb Heinz Graalfs:
> Hello Kevin,
> 
> I'm resending my answer as of Nov 23rd.
> 
> Is this still on your queue?

No, it wasn't. I guess I was waiting for a new version of the patch.

>>>  }
>>>  
>>>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>>> index f2f0404..baebf1d 100644
>>> --- a/block/raw-posix.c
>>> +++ b/block/raw-posix.c
>>> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
>>>      acb->aio_nbytes = nb_sectors * 512;
>>>      acb->aio_offset = sector_num * 512;
>>>  
>>> +    /* O_DIRECT also requires an aligned length */
>>> +    if (bs->open_flags & BDRV_O_NOCACHE) {
>>> +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
>>> +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
>>> +    }
>>
>> Modifying aio_nbytes, but not the iov looks wrong to me. This may work
>> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.
> 
> Current coding ensures that read IO buffers always seem to be aligned
> correctly. Whereas read length values are not always appropriate for an
> O_DIRECT scenario.
> 
> For a 2048 formatted disk I verified that
> 
> 1. non vectored IO - the length needs to be adapted several times,
>    which is accomplished now by the patch.
> 
> 2. vectored IO - the qiov's total length is always a multiple of the
>    logical block size 
>       (which is also verified in virtio_blk_handle_read())
>    The particular iov length fields are already correctly setup as a
>    multiple of the logical block size when processed in
>    virtio_blk_handle_request().

I must admit that I don't quite understand this. As far as I know,
virtio-blk doesn't make any difference between requests with niov = 1
and real vectored requests. So how can the length of the latter always
be right, whereas the length of the former may be wrong?

The other point is that requests may not even be coming from virtio-blk.
They could be made by other device emulations or they could come from a
block job. (They also could be the result of a merge in the block layer,
though if the original requests were aligned, the result will stay aligned)

Kevin
Heinz Graalfs - Dec. 11, 2012, 9:58 a.m.
Hi Kevin,

I'm using the bdrv_pread() function during boot partition detection ...

In detail: 
bdrv_pread() is called to read 32 bytes from a 2048 bytes formatted
disk. This results in setting up a read of 512 bytes (1 sector
multiplied by 512 current code in paio_submit()), which is wrong for a
O_DIRECT opened file, and produces the error.

Heinz

On Mon, 2012-12-10 at 09:55 +0100, Kevin Wolf wrote:
> Am 07.12.2012 21:26, schrieb Heinz Graalfs:
> > Hello Kevin,
> > 
> > I'm resending my answer as of Nov 23rd.
> > 
> > Is this still on your queue?
> 
> No, it wasn't. I guess I was waiting for a new version of the patch.
> 
> >>>  }
> >>>  
> >>>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> >>> diff --git a/block/raw-posix.c b/block/raw-posix.c
> >>> index f2f0404..baebf1d 100644
> >>> --- a/block/raw-posix.c
> >>> +++ b/block/raw-posix.c
> >>> @@ -700,6 +700,12 @@ static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
> >>>      acb->aio_nbytes = nb_sectors * 512;
> >>>      acb->aio_offset = sector_num * 512;
> >>>  
> >>> +    /* O_DIRECT also requires an aligned length */
> >>> +    if (bs->open_flags & BDRV_O_NOCACHE) {
> >>> +        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
> >>> +        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
> >>> +    }
> >>
> >> Modifying aio_nbytes, but not the iov looks wrong to me. This may work
> >> in the handle_aiocb_rw_linear() code path, but not with actual vectored I/O.
> > 
> > Current coding ensures that read IO buffers always seem to be aligned
> > correctly. Whereas read length values are not always appropriate for an
> > O_DIRECT scenario.
> > 
> > For a 2048 formatted disk I verified that
> > 
> > 1. non vectored IO - the length needs to be adapted several times,
> >    which is accomplished now by the patch.
> > 
> > 2. vectored IO - the qiov's total length is always a multiple of the
> >    logical block size 
> >       (which is also verified in virtio_blk_handle_read())
> >    The particular iov length fields are already correctly setup as a
> >    multiple of the logical block size when processed in
> >    virtio_blk_handle_request().
> 
> I must admit that I don't quite understand this. As far as I know,
> virtio-blk doesn't make any difference between requests with niov = 1
> and real vectored requests. So how can the length of the latter always
> be right, whereas the length of the former may be wrong?
> 
> The other point is that requests may not even be coming from virtio-blk.
> They could be made by other device emulations or they could come from a
> block job. (They also could be the result of a merge in the block layer,
> though if the original requests were aligned, the result will stay aligned)
> 
> Kevin
>

Patch

diff --git a/block.c b/block.c
index 854ebd6..f23c562 100644
--- a/block.c
+++ b/block.c
@@ -4242,6 +4242,9 @@  BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
 void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
 {
     bs->buffer_alignment = align;
+    if ((bs->open_flags & BDRV_O_NOCACHE)) {
+        bs->file->buffer_alignment = align;
+    }
 }
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index f2f0404..baebf1d 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -700,6 +700,12 @@  static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
     acb->aio_nbytes = nb_sectors * 512;
     acb->aio_offset = sector_num * 512;
 
+    /* O_DIRECT also requires an aligned length */
+    if (bs->open_flags & BDRV_O_NOCACHE) {
+        acb->aio_nbytes += acb->bs->buffer_alignment - 1;
+        acb->aio_nbytes &= ~(acb->bs->buffer_alignment - 1);
+    }
+
     trace_paio_submit(acb, opaque, sector_num, nb_sectors, type);
     return thread_pool_submit_aio(aio_worker, acb, cb, opaque);
 }