diff mbox series

[PULL,2/5] block/nvme: support larger that 512 bytes sector devices

Message ID 20190722172616.28797-3-mreitz@redhat.com
State New
Headers show
Series [PULL,1/5] block/nvme: fix doorbell stride | expand

Commit Message

Max Reitz July 22, 2019, 5:26 p.m. UTC
From: Maxim Levitsky <mlevitsk@redhat.com>

Currently the driver hardcodes the sector size to 512,
and doesn't check the underlying device. Fix that.

Also fail if underlying nvme device is formatted with metadata
as this needs special support.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-id: 20190716163020.13383-3-mlevitsk@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/nvme.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

Comments

Peter Maydell July 29, 2019, 1:16 p.m. UTC | #1
On Mon, 22 Jul 2019 at 18:26, Max Reitz <mreitz@redhat.com> wrote:
>
> From: Maxim Levitsky <mlevitsk@redhat.com>
>
> Currently the driver hardcodes the sector size to 512,
> and doesn't check the underlying device. Fix that.
>
> Also fail if underlying nvme device is formatted with metadata
> as this needs special support.
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> Message-id: 20190716163020.13383-3-mlevitsk@redhat.com
> Signed-off-by: Max Reitz <mreitz@redhat.com>

> +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    assert(s->blkshift >= BDRV_SECTOR_BITS);
> +    return 1 << s->blkshift;
> +}

Hi -- Coverity points out here that we calculate the
"1 << s->blkshift" as a 32-bit shift, but then return an
int64_t type (CID 1403771).

Can the blkshift ever really be 31 or more ?

The types here seem weird anyway -- we return an int64_t,
but the only user of this is nvme_probe_blocksizes(),
which uses the value only to set BlockSizes::phys and ::log,
both of which are of type "uint32_t". That leads me to think
that the right return type for the function is uint32_t.

PS: this is the only Coverity issue currently outstanding so
if it's a trivial fix it might be nice to put it into rc3.

thanks
-- PMM
Max Reitz July 29, 2019, 1:25 p.m. UTC | #2
On 29.07.19 15:16, Peter Maydell wrote:
> On Mon, 22 Jul 2019 at 18:26, Max Reitz <mreitz@redhat.com> wrote:
>>
>> From: Maxim Levitsky <mlevitsk@redhat.com>
>>
>> Currently the driver hardcodes the sector size to 512,
>> and doesn't check the underlying device. Fix that.
>>
>> Also fail if underlying nvme device is formatted with metadata
>> as this needs special support.
>>
>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>> Message-id: 20190716163020.13383-3-mlevitsk@redhat.com
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
>> +static int64_t nvme_get_blocksize(BlockDriverState *bs)
>> +{
>> +    BDRVNVMeState *s = bs->opaque;
>> +    assert(s->blkshift >= BDRV_SECTOR_BITS);
>> +    return 1 << s->blkshift;
>> +}
> 
> Hi -- Coverity points out here that we calculate the
> "1 << s->blkshift" as a 32-bit shift, but then return an
> int64_t type (CID 1403771).
> 
> Can the blkshift ever really be 31 or more ?
> 
> The types here seem weird anyway -- we return an int64_t,
> but the only user of this is nvme_probe_blocksizes(),
> which uses the value only to set BlockSizes::phys and ::log,
> both of which are of type "uint32_t". That leads me to think
> that the right return type for the function is uint32_t.
> 
> PS: this is the only Coverity issue currently outstanding so
> if it's a trivial fix it might be nice to put it into rc3.

Maxim, what do you think?

How about we let nvme_identify() limit blkshift to something sane and
then return a uint32_t here?

In theory it would be limited by page_size, and that has a maximum value
of 2^27.  In practice, though, that limit is checked by another 32-bit
shift...

Max
Maxim Levitsky July 29, 2019, 2:50 p.m. UTC | #3
On Mon, 2019-07-29 at 14:16 +0100, Peter Maydell wrote:
> On Mon, 22 Jul 2019 at 18:26, Max Reitz <mreitz@redhat.com> wrote:
> > 
> > From: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > Currently the driver hardcodes the sector size to 512,
> > and doesn't check the underlying device. Fix that.
> > 
> > Also fail if underlying nvme device is formatted with metadata
> > as this needs special support.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > Message-id: 20190716163020.13383-3-mlevitsk@redhat.com
> > Signed-off-by: Max Reitz <mreitz@redhat.com>
> > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > +{
> > +    BDRVNVMeState *s = bs->opaque;
> > +    assert(s->blkshift >= BDRV_SECTOR_BITS);
> > +    return 1 << s->blkshift;
> > +}
> 
> Hi -- Coverity points out here that we calculate the
> "1 << s->blkshift" as a 32-bit shift, but then return an
> int64_t type (CID 1403771).
> 
> Can the blkshift ever really be 31 or more ?

In theory, in the spec it is a 8 bit field, in practice, it should not be larger
that 12, because at least Linux doesn't support at all block devices that have
larger that 4K block size.

Best regards,
	Maxim Levitsky

> 
> The types here seem weird anyway -- we return an int64_t,
> but the only user of this is nvme_probe_blocksizes(),
> which uses the value only to set BlockSizes::phys and ::log,
> both of which are of type "uint32_t". That leads me to think
> that the right return type for the function is uint32_t.
> 
> PS: this is the only Coverity issue currently outstanding so
> if it's a trivial fix it might be nice to put it into rc3.
> 
> thanks
> -- PMM
>
Maxim Levitsky July 29, 2019, 3:01 p.m. UTC | #4
On Mon, 2019-07-29 at 15:25 +0200, Max Reitz wrote:
> On 29.07.19 15:16, Peter Maydell wrote:
> > On Mon, 22 Jul 2019 at 18:26, Max Reitz <mreitz@redhat.com> wrote:
> > > 
> > > From: Maxim Levitsky <mlevitsk@redhat.com>
> > > 
> > > Currently the driver hardcodes the sector size to 512,
> > > and doesn't check the underlying device. Fix that.
> > > 
> > > Also fail if underlying nvme device is formatted with metadata
> > > as this needs special support.
> > > 
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > Message-id: 20190716163020.13383-3-mlevitsk@redhat.com
> > > Signed-off-by: Max Reitz <mreitz@redhat.com>
> > > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > > +{
> > > +    BDRVNVMeState *s = bs->opaque;
> > > +    assert(s->blkshift >= BDRV_SECTOR_BITS);
> > > +    return 1 << s->blkshift;
> > > +}
> > 
> > Hi -- Coverity points out here that we calculate the
> > "1 << s->blkshift" as a 32-bit shift, but then return an
> > int64_t type (CID 1403771).
> > 
> > Can the blkshift ever really be 31 or more ?
> > 
> > The types here seem weird anyway -- we return an int64_t,
> > but the only user of this is nvme_probe_blocksizes(),
> > which uses the value only to set BlockSizes::phys and ::log,
> > both of which are of type "uint32_t". That leads me to think
> > that the right return type for the function is uint32_t.
> > 
> > PS: this is the only Coverity issue currently outstanding so
> > if it's a trivial fix it might be nice to put it into rc3.
> 
> Maxim, what do you think?

Fully agree with that.

> 
> How about we let nvme_identify() limit blkshift to something sane and
> then return a uint32_t here?
> 
> In theory it would be limited by page_size, and that has a maximum value
> of 2^27.  In practice, though, that limit is checked by another 32-bit
> shift...

2^27 is the maximum NVME page size, but in theory, but only in theory,
you can have blocks larger that a page size, you will just have to give the controller
more that one page, even when you read a single block.

But like I said in the other mail, Linux doesn't support at all block size > cpu page size which is almost always 4K,
thus 12 is the practical limit for the block size these days, and of course both 27 and 31 is well above this.

So I'll send a patch to fix this today or tomorrow.

Best regards,
	Maxim Levitsky

> Max
>
diff mbox series

Patch

diff --git a/block/nvme.c b/block/nvme.c
index 82fdefccd6..35ce10dc79 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -102,8 +102,11 @@  typedef struct {
     size_t doorbell_scale;
     bool write_cache_supported;
     EventNotifier irq_notifier;
+
     uint64_t nsze; /* Namespace size reported by identify command */
     int nsid;      /* The namespace id to read/write data. */
+    size_t blkshift;
+
     uint64_t max_transfer;
     bool plugged;
 
@@ -418,8 +421,9 @@  static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
     BDRVNVMeState *s = bs->opaque;
     NvmeIdCtrl *idctrl;
     NvmeIdNs *idns;
+    NvmeLBAF *lbaf;
     uint8_t *resp;
-    int r;
+    int r, hwsect_size;
     uint64_t iova;
     NvmeCmd cmd = {
         .opcode = NVME_ADM_CMD_IDENTIFY,
@@ -466,7 +470,22 @@  static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
     }
 
     s->nsze = le64_to_cpu(idns->nsze);
+    lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
+
+    if (lbaf->ms) {
+        error_setg(errp, "Namespaces with metadata are not yet supported");
+        goto out;
+    }
+
+    hwsect_size = 1 << lbaf->ds;
+
+    if (hwsect_size < BDRV_SECTOR_SIZE || hwsect_size > s->page_size) {
+        error_setg(errp, "Namespace has unsupported block size (%d)",
+                hwsect_size);
+        goto out;
+    }
 
+    s->blkshift = lbaf->ds;
 out:
     qemu_vfio_dma_unmap(s->vfio, resp);
     qemu_vfree(resp);
@@ -785,8 +804,22 @@  fail:
 static int64_t nvme_getlength(BlockDriverState *bs)
 {
     BDRVNVMeState *s = bs->opaque;
+    return s->nsze << s->blkshift;
+}
 
-    return s->nsze << BDRV_SECTOR_BITS;
+static int64_t nvme_get_blocksize(BlockDriverState *bs)
+{
+    BDRVNVMeState *s = bs->opaque;
+    assert(s->blkshift >= BDRV_SECTOR_BITS);
+    return 1 << s->blkshift;
+}
+
+static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
+{
+    int64_t blocksize = nvme_get_blocksize(bs);
+    bsz->phys = blocksize;
+    bsz->log = blocksize;
+    return 0;
 }
 
 /* Called with s->dma_map_lock */
@@ -917,13 +950,14 @@  static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
     BDRVNVMeState *s = bs->opaque;
     NVMeQueuePair *ioq = s->queues[1];
     NVMeRequest *req;
-    uint32_t cdw12 = (((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF) |
+
+    uint32_t cdw12 = (((bytes >> s->blkshift) - 1) & 0xFFFF) |
                        (flags & BDRV_REQ_FUA ? 1 << 30 : 0);
     NvmeCmd cmd = {
         .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
         .nsid = cpu_to_le32(s->nsid),
-        .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
-        .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFFFFFFFF),
+        .cdw10 = cpu_to_le32((offset >> s->blkshift) & 0xFFFFFFFF),
+        .cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0xFFFFFFFF),
         .cdw12 = cpu_to_le32(cdw12),
     };
     NVMeCoData data = {
@@ -1154,6 +1188,7 @@  static BlockDriver bdrv_nvme = {
     .bdrv_file_open           = nvme_file_open,
     .bdrv_close               = nvme_close,
     .bdrv_getlength           = nvme_getlength,
+    .bdrv_probe_blocksizes    = nvme_probe_blocksizes,
 
     .bdrv_co_preadv           = nvme_co_preadv,
     .bdrv_co_pwritev          = nvme_co_pwritev,