diff mbox series

[v3,1/6] block/nvme: don't touch the completion entries

Message ID 20190703155944.9637-2-mlevitsk@redhat.com
State New
Headers show
Series Few fixes for userspace NVME driver | expand

Commit Message

Maxim Levitsky July 3, 2019, 3:59 p.m. UTC
Completion entries are meant to be only read by the host and written by the device.
The driver is supposed to scan the completions from the last point where it left,
and until it sees a completion with non flipped phase bit.


Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 block/nvme.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Comments

Max Reitz July 5, 2019, 11:03 a.m. UTC | #1
On 03.07.19 17:59, Maxim Levitsky wrote:
> Completion entries are meant to be only read by the host and written by the device.
> The driver is supposed to scan the completions from the last point where it left,
> and until it sees a completion with non flipped phase bit.

(Disclaimer: This is the first time I read the nvme driver, or really
something in the nvme spec.)

Well, no, completion entries are also meant to be initialized by the
host.  To me it looks like this is the place where that happens:
Everything that has been processed by the device is immediately being
re-initialized.

Maybe we shouldn’t do that here but in nvme_submit_command().  But
currently we don’t, and I don’t see any other place where we currently
initialize the CQ entries.

Max

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  block/nvme.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 73ed5fa75f..6d4e7f3d83 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
>      while (q->inflight) {
>          int16_t cid;
>          c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> -        if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> +        if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
>              break;
>          }
>          q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> @@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
>          qemu_mutex_unlock(&q->lock);
>          req.cb(req.opaque, nvme_translate_error(c));
>          qemu_mutex_lock(&q->lock);
> -        c->cid = cpu_to_le16(0);
>          q->inflight--;
> -        /* Flip Phase Tag bit. */
> -        c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
>          progress = true;
>      }
>      if (progress) {
>
Maxim Levitsky July 7, 2019, 8:43 a.m. UTC | #2
On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Completion entries are meant to be only read by the host and written by the device.
> > The driver is supposed to scan the completions from the last point where it left,
> > and until it sees a completion with non flipped phase bit.
> 
> (Disclaimer: This is the first time I read the nvme driver, or really
> something in the nvme spec.)
> 
> Well, no, completion entries are also meant to be initialized by the
> host.  To me it looks like this is the place where that happens:
> Everything that has been processed by the device is immediately being
> re-initialized.
> 
> Maybe we shouldn’t do that here but in nvme_submit_command().  But
> currently we don’t, and I don’t see any other place where we currently
> initialize the CQ entries.

Hi!
I couldn't find any place in the spec that says that completion entries should be initialized.
It is probably wise to initialize that area to 0 on driver initialization, but nothing beyond that.
In particular that is what the kernel nvme driver does. 
Other that allocating a zeroed memory (and even that I am not sure it does), 
it doesn't write to the completion entries.

Thanks for the very very good review btw. I will go over all patches now and fix things.

Best regards,
	Maxim Levitsky

> 
> Max
> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  block/nvme.c | 5 +----
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 73ed5fa75f..6d4e7f3d83 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
> >      while (q->inflight) {
> >          int16_t cid;
> >          c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> > -        if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> > +        if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> >              break;
> >          }
> >          q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> > @@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
> >          qemu_mutex_unlock(&q->lock);
> >          req.cb(req.opaque, nvme_translate_error(c));
> >          qemu_mutex_lock(&q->lock);
> > -        c->cid = cpu_to_le16(0);
> >          q->inflight--;
> > -        /* Flip Phase Tag bit. */
> > -        c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> >          progress = true;
> >      }
> >      if (progress) {
> > 
> 
>
Max Reitz July 8, 2019, 12:23 p.m. UTC | #3
On 07.07.19 10:43, Maxim Levitsky wrote:
> On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
>> On 03.07.19 17:59, Maxim Levitsky wrote:
>>> Completion entries are meant to be only read by the host and written by the device.
>>> The driver is supposed to scan the completions from the last point where it left,
>>> and until it sees a completion with non flipped phase bit.
>>
>> (Disclaimer: This is the first time I read the nvme driver, or really
>> something in the nvme spec.)
>>
>> Well, no, completion entries are also meant to be initialized by the
>> host.  To me it looks like this is the place where that happens:
>> Everything that has been processed by the device is immediately being
>> re-initialized.
>>
>> Maybe we shouldn’t do that here but in nvme_submit_command().  But
>> currently we don’t, and I don’t see any other place where we currently
>> initialize the CQ entries.
> 
> Hi!
> I couldn't find any place in the spec that says that completion entries should be initialized.
> It is probably wise to initialize that area to 0 on driver initialization, but nothing beyond that.

Ah, you’re right, I misread.  I didn’t pay as much attention to the
“...prior to setting CC.EN to ‘1’” as I should have.  Yep, and that is
done in nvme_init_queue().

OK, I cease my wrongful protest:

Reviewed-by: Max Reitz <mreitz@redhat.com>

> In particular that is what the kernel nvme driver does. 
> Other that allocating a zeroed memory (and even that I am not sure it does), 
> it doesn't write to the completion entrie
Maxim Levitsky July 8, 2019, 12:51 p.m. UTC | #4
On Mon, 2019-07-08 at 14:23 +0200, Max Reitz wrote:
> On 07.07.19 10:43, Maxim Levitsky wrote:
> > On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> > > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > > Completion entries are meant to be only read by the host and written by the device.
> > > > The driver is supposed to scan the completions from the last point where it left,
> > > > and until it sees a completion with non flipped phase bit.
> > > 
> > > (Disclaimer: This is the first time I read the nvme driver, or really
> > > something in the nvme spec.)
> > > 
> > > Well, no, completion entries are also meant to be initialized by the
> > > host.  To me it looks like this is the place where that happens:
> > > Everything that has been processed by the device is immediately being
> > > re-initialized.
> > > 
> > > Maybe we shouldn’t do that here but in nvme_submit_command().  But
> > > currently we don’t, and I don’t see any other place where we currently
> > > initialize the CQ entries.
> > 
> > Hi!
> > I couldn't find any place in the spec that says that completion entries should be initialized.
> > It is probably wise to initialize that area to 0 on driver initialization, but nothing beyond that.
> 
> Ah, you’re right, I misread.  I didn’t pay as much attention to the
> “...prior to setting CC.EN to ‘1’” as I should have.  Yep, and that is
> done in nvme_init_queue().
> 
> OK, I cease my wrongful protest:
> 
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> 
> > 

Thank you very much!
BTW, the qemu driver does allocate zeroed memory (in nvme_init_queue, 
"q->queue = qemu_try_blockalign0(bs, bytes);"

Thus I think this is all that is needed in that regard.

Note that this patch doesn't fix any real bug I know of, 
but just makes the thing right in regard to the spec.
Also racing with hardware in theory can have various memory ordering bugs,
although in this case the writes are done in 
entries which controller probably won't touch, but still.

TL;DR - no need in code which does nothing and might cause issues.

Do you want me to resend the series or shall I wait till we decide
what to do with the image creation support? I done fixing all the
review comments long ago, just didn't want to resend the series.
Or shall I drop that patch and resend?

From the urgency standpoint the only patch that really should
be merged ASAP is the one that adds support for block sizes,
because without it, the whole thing crashes and burns on 4K
nvme drives.

Best regards,
	Maxim Levitsky
Max Reitz July 8, 2019, 1 p.m. UTC | #5
On 08.07.19 14:51, Maxim Levitsky wrote:
> On Mon, 2019-07-08 at 14:23 +0200, Max Reitz wrote:
>> On 07.07.19 10:43, Maxim Levitsky wrote:
>>> On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
>>>> On 03.07.19 17:59, Maxim Levitsky wrote:
>>>>> Completion entries are meant to be only read by the host and written by the device.
>>>>> The driver is supposed to scan the completions from the last point where it left,
>>>>> and until it sees a completion with non flipped phase bit.
>>>>
>>>> (Disclaimer: This is the first time I read the nvme driver, or really
>>>> something in the nvme spec.)
>>>>
>>>> Well, no, completion entries are also meant to be initialized by the
>>>> host.  To me it looks like this is the place where that happens:
>>>> Everything that has been processed by the device is immediately being
>>>> re-initialized.
>>>>
>>>> Maybe we shouldn’t do that here but in nvme_submit_command().  But
>>>> currently we don’t, and I don’t see any other place where we currently
>>>> initialize the CQ entries.
>>>
>>> Hi!
>>> I couldn't find any place in the spec that says that completion entries should be initialized.
>>> It is probably wise to initialize that area to 0 on driver initialization, but nothing beyond that.
>>
>> Ah, you’re right, I misread.  I didn’t pay as much attention to the
>> “...prior to setting CC.EN to ‘1’” as I should have.  Yep, and that is
>> done in nvme_init_queue().
>>
>> OK, I cease my wrongful protest:
>>
>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>>
>>>
> 
> Thank you very much!
> BTW, the qemu driver does allocate zeroed memory (in nvme_init_queue, 
> "q->queue = qemu_try_blockalign0(bs, bytes);"

Yes, that’s what I was referring to above. :-)

> Thus I think this is all that is needed in that regard.
> 
> Note that this patch doesn't fix any real bug I know of, 
> but just makes the thing right in regard to the spec.
> Also racing with hardware in theory can have various memory ordering bugs,
> although in this case the writes are done in 
> entries which controller probably won't touch, but still.
> 
> TL;DR - no need in code which does nothing and might cause issues.
> 
> Do you want me to resend the series or shall I wait till we decide
> what to do with the image creation support? I done fixing all the
> review comments long ago, just didn't want to resend the series.
> Or shall I drop that patch and resend?

I think I won’t apply the image creation patch now, so it’s probably
better to just drop it for now.

> From the urgency standpoint the only patch that really should
> be merged ASAP is the one that adds support for block sizes,
> because without it, the whole thing crashes and burns on 4K
> nvme drives.

By now we’re in softfreeze anyway, so unless write-zeroes/discard
support is important now, it’s difficult to justify taking them for 4.1.
 So for me it would be best if you put patches 1 through 3 into a
for-4.1 series and move the rest to 4.2.  (I’d probably also split the
creation patch off, because I don’t think I’m going to apply it before
having experimented a bit with blockdev-create for qemu-img.)

If you think write-zeroes/discard support is important for 4.1, feel
free to include them in the for-4.1 series along with an explanation as
to why it’s important.

Max
Maxim Levitsky July 8, 2019, 1:06 p.m. UTC | #6
On Mon, 2019-07-08 at 15:00 +0200, Max Reitz wrote:
> On 08.07.19 14:51, Maxim Levitsky wrote:
> > On Mon, 2019-07-08 at 14:23 +0200, Max Reitz wrote:
> > > On 07.07.19 10:43, Maxim Levitsky wrote:
> > > > On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> > > > > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > > > > Completion entries are meant to be only read by the host and written by the device.
> > > > > > The driver is supposed to scan the completions from the last point where it left,
> > > > > > and until it sees a completion with non flipped phase bit.
> > > > > 
> > > > > (Disclaimer: This is the first time I read the nvme driver, or really
> > > > > something in the nvme spec.)
> > > > > 
> > > > > Well, no, completion entries are also meant to be initialized by the
> > > > > host.  To me it looks like this is the place where that happens:
> > > > > Everything that has been processed by the device is immediately being
> > > > > re-initialized.
> > > > > 
> > > > > Maybe we shouldn’t do that here but in nvme_submit_command().  But
> > > > > currently we don’t, and I don’t see any other place where we currently
> > > > > initialize the CQ entries.
> > > > 
> > > > Hi!
> > > > I couldn't find any place in the spec that says that completion entries should be initialized.
> > > > It is probably wise to initialize that area to 0 on driver initialization, but nothing beyond that.
> > > 
> > > Ah, you’re right, I misread.  I didn’t pay as much attention to the
> > > “...prior to setting CC.EN to ‘1’” as I should have.  Yep, and that is
> > > done in nvme_init_queue().
> > > 
> > > OK, I cease my wrongful protest:
> > > 
> > > Reviewed-by: Max Reitz <mreitz@redhat.com>
> > > 
> > > > 
> > 
> > Thank you very much!
> > BTW, the qemu driver does allocate zeroed memory (in nvme_init_queue, 
> > "q->queue = qemu_try_blockalign0(bs, bytes);"
> 
> Yes, that’s what I was referring to above. :-)
> 
> > Thus I think this is all that is needed in that regard.
> > 
> > Note that this patch doesn't fix any real bug I know of, 
> > but just makes the thing right in regard to the spec.
> > Also racing with hardware in theory can have various memory ordering bugs,
> > although in this case the writes are done in 
> > entries which controller probably won't touch, but still.
> > 
> > TL;DR - no need in code which does nothing and might cause issues.
> > 
> > Do you want me to resend the series or shall I wait till we decide
> > what to do with the image creation support? I done fixing all the
> > review comments long ago, just didn't want to resend the series.
> > Or shall I drop that patch and resend?
> 
> I think I won’t apply the image creation patch now, so it’s probably
> better to just drop it for now.
> 
> > From the urgency standpoint the only patch that really should
> > be merged ASAP is the one that adds support for block sizes,
> > because without it, the whole thing crashes and burns on 4K
> > nvme drives.
> 
> By now we’re in softfreeze anyway, so unless write-zeroes/discard
> support is important now, it’s difficult to justify taking them for 4.1.
>  So for me it would be best if you put patches 1 through 3 into a
> for-4.1 series and move the rest to 4.2.  (I’d probably also split the
> creation patch off, because I don’t think I’m going to apply it before
> having experimented a bit with blockdev-create for qemu-img.)
> 
> If you think write-zeroes/discard support is important for 4.1, feel
> free to include them in the for-4.1 series along with an explanation as
> to why it’s important.

I don't think either that these are important, so I split them as you say.

Best regards,
	Maxim Levitsky
diff mbox series

Patch

diff --git a/block/nvme.c b/block/nvme.c
index 73ed5fa75f..6d4e7f3d83 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -315,7 +315,7 @@  static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
     while (q->inflight) {
         int16_t cid;
         c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
-        if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
+        if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
             break;
         }
         q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
@@ -339,10 +339,7 @@  static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         qemu_mutex_unlock(&q->lock);
         req.cb(req.opaque, nvme_translate_error(c));
         qemu_mutex_lock(&q->lock);
-        c->cid = cpu_to_le16(0);
         q->inflight--;
-        /* Flip Phase Tag bit. */
-        c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
         progress = true;
     }
     if (progress) {