diff mbox

[RFT,0/3] iscsi: fix NULL dereferences / races between task completion and abort

Message ID 5030E5F2.7060903@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini Aug. 19, 2012, 1:11 p.m. UTC
Il 19/08/2012 09:55, Stefan Priebe ha scritto:
> Hi Paolo,
> 
> Am 18.08.2012 23:49, schrieb Paolo Bonzini:
>> Hi Stefan,
>>
>> this is my version of your patch.  I think the flow of the code is a
>> bit simpler (or at least matches other implementations of cancellation).
>> Can you test it on your test case?
> I'm really sorry but your patch doesn't work at all. I'm not even able
> to start the VM. KVM process hangs and never detaches itself.

No problem, my fault---I'm just back and I haven't really started again
all my stuff, so the patch was not tested.

This should fix it, though.

Paolo


     /* XXX we should pass the iovec to write16 to avoid the extra copy */
@@ -341,6 +342,7 @@ iscsi_aio_readv(BlockDriverState *bs, int64_t
sector_num,
     acb->qiov     = qiov;

     acb->canceled    = 0;
+    acb->bh          = NULL;
     acb->status      = -EINPROGRESS;
     acb->read_size   = qemu_read_size;
     acb->buf         = NULL;
@@ -442,6 +444,7 @@ iscsi_aio_flush(BlockDriverState *bs,

     acb->iscsilun = iscsilun;
     acb->canceled   = 0;
+    acb->bh         = NULL;
     acb->status     = -EINPROGRESS;

     acb->task = iscsi_synchronizecache10_task(iscsi, iscsilun->lun,
@@ -494,6 +497,7 @@ iscsi_aio_discard(BlockDriverState *bs,

     acb->iscsilun = iscsilun;
     acb->canceled   = 0;
+    acb->bh         = NULL;
     acb->status     = -EINPROGRESS;

     list[0].lba = sector_qemu2lun(sector_num, iscsilun);
@@ -568,6 +572,7 @@ static BlockDriverAIOCB
*iscsi_aio_ioctl(BlockDriverState *bs,

     acb->iscsilun = iscsilun;
     acb->canceled    = 0;
+    acb->bh          = NULL;
     acb->status      = -EINPROGRESS;
     acb->buf         = NULL;
     acb->ioh         = buf;

Comments

Stefan Priebe - Profihost AG Aug. 19, 2012, 7:22 p.m. UTC | #1
Am 19.08.2012 15:11, schrieb Paolo Bonzini:
> No problem, my fault---I'm just back and I haven't really started again
> all my stuff, so the patch was not tested.
>
> This should fix it, though.

Booting works fine now. But the VM starts to hang after trying to unmap 
large regions. No segfault or so just not reacting anymore.

Stefan
Paolo Bonzini Aug. 20, 2012, 7:22 a.m. UTC | #2
Il 19/08/2012 21:22, Stefan Priebe - Profihost AG ha scritto:
> 
>> No problem, my fault---I'm just back and I haven't really started again
>> all my stuff, so the patch was not tested.
>>
>> This should fix it, though.
> 
> Booting works fine now. But the VM starts to hang after trying to unmap
> large regions. No segfault or so just not reacting anymore.

This is expected; unfortunately cancellation right now is a synchronous
operation in the block layer.  SCSI is the first big user of
cancellation, and it would indeed benefit from asynchronous cancellation.

Without these three patches, you risk corruption in case the following
happens:

    qemu                 target
  -----------------------------------
    send unmap -------->
    cancel unmap ------>
    send write -------->
       <---------------- complete write
                         <unmap just written sector>
       <---------------- complete unmap
       <---------------- cancellation done (unmap complete)

Paolo
Stefan Priebe - Profihost AG Aug. 20, 2012, 7:34 a.m. UTC | #3
Am 20.08.2012 09:22, schrieb Paolo Bonzini:
> Il 19/08/2012 21:22, Stefan Priebe - Profihost AG ha scritto:
>>
>>> No problem, my fault---I'm just back and I haven't really started again
>>> all my stuff, so the patch was not tested.
>>>
>>> This should fix it, though.
>>
>> Booting works fine now. But the VM starts to hang after trying to unmap
>> large regions. No segfault or so just not reacting anymore.
>
> This is expected; unfortunately cancellation right now is a synchronous
> operation in the block layer.  SCSI is the first big user of
> cancellation, and it would indeed benefit from asynchronous cancellation.
>
> Without these three patches, you risk corruption in case the following
> happens:
>
>      qemu                 target
>    -----------------------------------
>      send unmap -------->
>      cancel unmap ------>
>      send write -------->
>         <---------------- complete write
>                           <unmap just written sector>
>         <---------------- complete unmap
>         <---------------- cancellation done (unmap complete)

mhm OK that makes sense. But i cannot even login via SSH and i also see 
no cancellation message in kernel log.

So what is the right way to test these patches? With my version i can 
still work via SSH.

Stefan
Paolo Bonzini Aug. 20, 2012, 8:08 a.m. UTC | #4
Il 20/08/2012 09:34, Stefan Priebe - Profihost AG ha scritto:
>>> Booting works fine now. But the VM starts to hang after trying to unmap
>>> large regions. No segfault or so just not reacting anymore.
>>
>> This is expected; unfortunately cancellation right now is a synchronous
>> operation in the block layer.  SCSI is the first big user of
>> cancellation, and it would indeed benefit from asynchronous cancellation.
>>
>> Without these three patches, you risk corruption in case the following
>> happens:
>>
>>      qemu                 target
>>    -----------------------------------
>>      send unmap -------->
>>      cancel unmap ------>
>>      send write -------->
>>         <---------------- complete write
>>                           <unmap just written sector>
>>         <---------------- complete unmap
>>         <---------------- cancellation done (unmap complete)
> 
> mhm OK that makes sense. But i cannot even login via SSH

That's because the "big QEMU lock" is held by the thread that called
qemu_aio_cancel.

> and i also see
> no cancellation message in kernel log.

And that's because the UNMAP actually ultimately succeeds.  You'll
probably see soft lockup messages though.

The solution here is to bump the timeout of the UNMAP command (either in
the kernel or in libiscsi, I didn't really understand who's at fault).

Paolo
Stefan Priebe - Profihost AG Aug. 20, 2012, 8:12 a.m. UTC | #5
Hi Ronnie,

Am 20.08.2012 10:08, schrieb Paolo Bonzini:
> That's because the "big QEMU lock" is held by the thread that called
> qemu_aio_cancel.
>
>> and i also see
>> no cancellation message in kernel log.
>
> And that's because the UNMAP actually ultimately succeeds.  You'll
> probably see soft lockup messages though.
>
> The solution here is to bump the timeout of the UNMAP command (either in
> the kernel or in libiscsi, I didn't really understand who's at fault).

What's your suggestion / idea about that?

Greets,
Stefan
ronnie sahlberg Aug. 20, 2012, 10:36 p.m. UTC | #6
On Mon, Aug 20, 2012 at 6:12 PM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Hi Ronnie,
>
> Am 20.08.2012 10:08, schrieb Paolo Bonzini:
>
>> That's because the "big QEMU lock" is held by the thread that called
>> qemu_aio_cancel.
>>
>>> and i also see
>>> no cancellation message in kernel log.
>>
>>
>> And that's because the UNMAP actually ultimately succeeds.  You'll
>> probably see soft lockup messages though.
>>
>> The solution here is to bump the timeout of the UNMAP command (either in
>> the kernel or in libiscsi, I didn't really understand who's at fault).
>
>
> What's your suggestion / idea about that?
>

There are no timeouts in libiscsi itself.
But you can probably tweak it through the guest kernel :


$ cat /sys/bus/scsi/devices/0\:0\:0\:0/timeout
30

should be the default scsi timeout for this device, and

$ cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth
31

would be how many concurrent i/o that the guest will drive to the device.


When performing the UNMAPS, maybe setting timeout to something really
big, and at the same time also dropping queue-depth to something
really small so that the guest kernel will not be able to send so many
UNMAPs concurrently.


ronnie s
Stefan Priebe - Profihost AG Aug. 21, 2012, 7:22 a.m. UTC | #7
Am 21.08.2012 00:36, schrieb ronnie sahlberg:
> On Mon, Aug 20, 2012 at 6:12 PM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> Hi Ronnie,
>>
>> Am 20.08.2012 10:08, schrieb Paolo Bonzini:
>>
>>> That's because the "big QEMU lock" is held by the thread that called
>>> qemu_aio_cancel.
>>>
>>>> and i also see
>>>> no cancellation message in kernel log.
>>>
>>>
>>> And that's because the UNMAP actually ultimately succeeds.  You'll
>>> probably see soft lockup messages though.
>>>
>>> The solution here is to bump the timeout of the UNMAP command (either in
>>> the kernel or in libiscsi, I didn't really understand who's at fault).
>>
>>
>> What's your suggestion / idea about that?
>>
>
> There are no timeouts in libiscsi itself.
> But you can probably tweak it through the guest kernel :
>
>
> $ cat /sys/bus/scsi/devices/0\:0\:0\:0/timeout
> 30
>
> should be the default scsi timeout for this device, and
>
> $ cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth
> 31
>
> would be how many concurrent i/o that the guest will drive to the device.
>
>
> When performing the UNMAPS, maybe setting timeout to something really
> big, and at the same time also dropping queue-depth to something
> really small so that the guest kernel will not be able to send so many
> UNMAPs concurrently.

How is this relevant to the fact, that i can't even work with SSH any 
longer with paolo's patch while UNMAPs are running? With my patchset you 
can still work on the machine.

Greets,
Stefan
Paolo Bonzini Aug. 21, 2012, 7:30 a.m. UTC | #8
> > When performing the UNMAPS, maybe setting timeout to something
> > really big, and at the same time also dropping queue-depth to something
> > really small so that the guest kernel will not be able to send so
> > many UNMAPs concurrently.
> 
> How is this relevant to the fact, that i can't even work with SSH any
> longer with paolo's patch while UNMAPs are running? With my patchset
> you can still work on the machine.

It is relevant because if you tweak the timeouts you will not hit the
cancel at all.

I would like to get a trace of the commands that are sent, so that we
can attack the problem in the guest kernel.  For example, if it's sending
UNMAPs that span 100GB or so, we could enforce a limit in the guest kernel
on the maximum number of discarded blocks per UNMAP.  But if the problem
is the _number_ of UNMAPs rather than the size, it would need a different
heuristic.

Paolo
Stefan Priebe - Profihost AG Nov. 6, 2012, 8:41 a.m. UTC | #9
Hello list,

i wantes to use scsi unmap with rbd. rbd documention says you need to 
set discard_granularity=512 for the device. I'm using qemu 1.2.

If i set this and send an UNMAP command i get this kernel output:

Sense Key : Aborted Command [current]
sd 2:0:0:1: [sdb]
Add. Sense: I/O process terminated
sd 2:0:0:1: [sdb] CDB:
Write same(16): 93 08 00 00 00 00 00 00 00 00 00 7f ff ff 00 00
end_request: I/O error, dev sdb, sector 0
sd 2:0:0:1: [sdb]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:1: [sdb]
Sense Key : Aborted Command [current]
sd 2:0:0:1: [sdb]
Add. Sense: I/O process terminated
sd 2:0:0:1: [sdb] CDB:
Write same(16): 93 08 00 00 00 00 01 ff ff fc 00 7f ff ff 00 00
end_request: I/O error, dev sdb, sector 33554428
sd 2:0:0:1: [sdb]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:1: [sdb]
Sense Key : Aborted Command [current]
sd 2:0:0:1: [sdb]
Add. Sense: I/O process terminated
sd 2:0:0:1: [sdb] CDB:
Write same(16): 93 08 00 00 00 00 00 ff ff fe 00 7f ff ff 00 00
end_request: I/O error, dev sdb, sector 16777214

Greets,
Stefan
Paolo Bonzini Nov. 6, 2012, 10:42 p.m. UTC | #10
> i wantes to use scsi unmap with rbd. rbd documention says you need to
> set discard_granularity=512 for the device. I'm using qemu 1.2.
> 
> If i set this and send an UNMAP command i get this kernel output:

The discard request is failing.  Please check why with a breakpoint on
rbd_aio_discard_wrapper or scsi_handle_rw_error for example.

Paolo
Stefan Priebe - Profihost AG Nov. 7, 2012, 6:57 p.m. UTC | #11
Am 06.11.2012 23:42, schrieb Paolo Bonzini:
>
>> i wantes to use scsi unmap with rbd. rbd documention says you need to
>> set discard_granularity=512 for the device. I'm using qemu 1.2.
>>
>> If i set this and send an UNMAP command i get this kernel output:
>
> The discard request is failing.  Please check why with a breakpoint on
> rbd_aio_discard_wrapper or scsi_handle_rw_error for example.

I've no idea about setting breakpoints and analyse what happens... no C 
coder just perl. I can for sure modify code and compile... but i don't 
know how todo that.

Greets,
Stefan
Stefan Priebe - Profihost AG Nov. 18, 2012, 10 p.m. UTC | #12
Hi Paolo,

Am 06.11.2012 23:42, schrieb Paolo Bonzini:
>
>> i wantes to use scsi unmap with rbd. rbd documention says you need to
>> set discard_granularity=512 for the device. I'm using qemu 1.2.
>>
>> If i set this and send an UNMAP command i get this kernel output:
>
> The discard request is failing.  Please check why with a breakpoint on
> rbd_aio_discard_wrapper or scsi_handle_rw_error for example.
>
> Paolo

I'm sorry the discard requests aren't failing. Qemu / Block driver 
starts to cancel a bunch of requests.

[   39.577194] sd 2:0:0:1: [sdb]
[   39.577194] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   39.577194] sd 2:0:0:1: [sdb]
[   39.577194] Sense Key : Aborted Command [current]
[   39.577194] sd 2:0:0:1: [sdb]
[   39.577194] Add. Sense: I/O process terminated
[   39.577194] sd 2:0:0:1: [sdb] CDB:
[   39.577194] Write same(16): 93 08 00 00 00 00 04 7f ff f7 00 62 00 09 
00 00
[   39.577194] end_request: I/O error, dev sdb, sector 75497463

Greets,
Stefan
Paolo Bonzini Nov. 19, 2012, 8:10 a.m. UTC | #13
Il 18/11/2012 23:00, Stefan Priebe ha scritto:
> Hi Paolo,
> 
> Am 06.11.2012 23:42, schrieb Paolo Bonzini:
>>
>>> i wantes to use scsi unmap with rbd. rbd documention says you need to
>>> set discard_granularity=512 for the device. I'm using qemu 1.2.
>>>
>>> If i set this and send an UNMAP command i get this kernel output:
>>
>> The discard request is failing.  Please check why with a breakpoint on
>> rbd_aio_discard_wrapper or scsi_handle_rw_error for example.
>>
>> Paolo
> 
> I'm sorry the discard requests aren't failing. Qemu / Block driver
> starts to cancel a bunch of requests.

That is being done in the kernel (the guest, I think) because the UNMAPs
are taking too long.

Paolo

> [   39.577194] sd 2:0:0:1: [sdb]
> [   39.577194] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [   39.577194] sd 2:0:0:1: [sdb]
> [   39.577194] Sense Key : Aborted Command [current]
> [   39.577194] sd 2:0:0:1: [sdb]
> [   39.577194] Add. Sense: I/O process terminated
> [   39.577194] sd 2:0:0:1: [sdb] CDB:
> [   39.577194] Write same(16): 93 08 00 00 00 00 04 7f ff f7 00 62 00 09
> 00 00
> [   39.577194] end_request: I/O error, dev sdb, sector 75497463
Stefan Priebe - Profihost AG Nov. 19, 2012, 9:36 a.m. UTC | #14
Hi Paolo,

Am 19.11.2012 09:10, schrieb Paolo Bonzini:
>> I'm sorry the discard requests aren't failing. Qemu / Block driver
>> starts to cancel a bunch of requests.
>
> That is being done in the kernel (the guest, I think) because the UNMAPs
> are taking too long.

That makes sense. RBD handles discards as buffered I/O. When i do an 
mkfs.xfs on a 30GB device i see around 900 pending discard requests. rbd 
finishes them all before reporting success.

If it is correct that a 3.6.7 kernel sends as many discard requests i 
only the a solution in using unbuffered I/O for discards.

Do you know what is the correct way?

Greets,
Stefan
Paolo Bonzini Nov. 19, 2012, 9:54 a.m. UTC | #15
Il 19/11/2012 10:36, Stefan Priebe - Profihost AG ha scritto:
> Hi Paolo,
> 
> Am 19.11.2012 09:10, schrieb Paolo Bonzini:
>>> I'm sorry the discard requests aren't failing. Qemu / Block driver
>>> starts to cancel a bunch of requests.
>>
>> That is being done in the kernel (the guest, I think) because the UNMAPs
>> are taking too long.
> 
> That makes sense. RBD handles discards as buffered I/O. When i do an
> mkfs.xfs on a 30GB device i see around 900 pending discard requests. rbd
> finishes them all before reporting success.
> 
> If it is correct that a 3.6.7 kernel sends as many discard requests i
> only the a solution in using unbuffered I/O for discards.
> 
> Do you know what is the correct way?

I think the correct fix is to serialize them in the kernel.

Paolo
Stefan Priebe - Profihost AG Nov. 19, 2012, 9:59 a.m. UTC | #16
Am 19.11.2012 10:54, schrieb Paolo Bonzini:
> Il 19/11/2012 10:36, Stefan Priebe - Profihost AG ha scritto:
>> Hi Paolo,
>>
>> Am 19.11.2012 09:10, schrieb Paolo Bonzini:
>>>> I'm sorry the discard requests aren't failing. Qemu / Block driver
>>>> starts to cancel a bunch of requests.
>>>
>>> That is being done in the kernel (the guest, I think) because the UNMAPs
>>> are taking too long.
>>
>> That makes sense. RBD handles discards as buffered I/O. When i do an
>> mkfs.xfs on a 30GB device i see around 900 pending discard requests. rbd
>> finishes them all before reporting success.
>>
>> If it is correct that a 3.6.7 kernel sends as many discard requests i
>> only the a solution in using unbuffered I/O for discards.
>>
>> Do you know what is the correct way?
>
> I think the correct fix is to serialize them in the kernel.

So you mean this is not a bug in rbd or qemu this is a general bug in 
the linux kernel since they implemented discard?

Greets,
Stefan
Paolo Bonzini Nov. 19, 2012, 10:06 a.m. UTC | #17
Il 19/11/2012 10:59, Stefan Priebe - Profihost AG ha scritto:
>>>
>>>
>>> Do you know what is the correct way?
>>
>> I think the correct fix is to serialize them in the kernel.
> 
> So you mean this is not a bug in rbd or qemu this is a general bug in
> the linux kernel since they implemented discard?

Yes.

Paolo
Stefan Priebe - Profihost AG Nov. 19, 2012, 10:13 a.m. UTC | #18
Am 19.11.2012 11:06, schrieb Paolo Bonzini:
> Il 19/11/2012 10:59, Stefan Priebe - Profihost AG ha scritto:
>>>>
>>>>
>>>> Do you know what is the correct way?
>>>
>>> I think the correct fix is to serialize them in the kernel.
>>
>> So you mean this is not a bug in rbd or qemu this is a general bug in
>> the linux kernel since they implemented discard?
>
> Yes.

As you're known in the linux dev community ;-) Might you open a thread / 
discussion in the linux kernel mailinglist CC'ing me regarding this 
problem? I think when i open one - no one will listen ;-)

Greets
Stefan
Paolo Bonzini Nov. 19, 2012, 10:23 a.m. UTC | #19
Il 19/11/2012 11:13, Stefan Priebe - Profihost AG ha scritto:
>>>>
>>>
>>> So you mean this is not a bug in rbd or qemu this is a general bug in
>>> the linux kernel since they implemented discard?
>>
>> Yes.
> 
> As you're known in the linux dev community ;-) Might you open a thread /
> discussion in the linux kernel mailinglist CC'ing me regarding this
> problem? I think when i open one - no one will listen ;-)

Yeah, I'll try making and sending a patch, that's usually a good way to
convince those guys to listen...

Paolo
Stefan Priebe - Profihost AG Nov. 19, 2012, 10:30 a.m. UTC | #20
Am 19.11.2012 11:23, schrieb Paolo Bonzini:
> Il 19/11/2012 11:13, Stefan Priebe - Profihost AG ha scritto:
>>>>>
>>>>
>>>> So you mean this is not a bug in rbd or qemu this is a general bug in
>>>> the linux kernel since they implemented discard?
>>>
>>> Yes.
>>
>> As you're known in the linux dev community ;-) Might you open a thread /
>> discussion in the linux kernel mailinglist CC'ing me regarding this
>> problem? I think when i open one - no one will listen ;-)
>
> Yeah, I'll try making and sending a patch, that's usually a good way to
> convince those guys to listen...
Thanks.

But do you have any idea why it works with an iscsi / libiscsi backend? 
In that case the kernel does not cancel the I/O.

Greets,
Stefan
Paolo Bonzini Nov. 19, 2012, 10:36 a.m. UTC | #21
Il 19/11/2012 11:30, Stefan Priebe - Profihost AG ha scritto:
> 
> 
> But do you have any idea why it works with an iscsi / libiscsi backend?
> In that case the kernel does not cancel the I/O.

It tries to, but the command ultimately succeeds and the success
response "undoes" the cancel.

http://thread.gmane.org/gmane.comp.emulators.qemu/166232/focus=166316

See also the whole thread (and I think there as a part of it offlist, too).

Paolo
Stefan Priebe - Profihost AG Nov. 19, 2012, 10:57 a.m. UTC | #22
Yeah thats my old thread regarding iscsi und unmap but this works fine 
now since you patched qemu.

Stefan

Am 19.11.2012 11:36, schrieb Paolo Bonzini:
> Il 19/11/2012 11:30, Stefan Priebe - Profihost AG ha scritto:
>>
>>
>> But do you have any idea why it works with an iscsi / libiscsi backend?
>> In that case the kernel does not cancel the I/O.
>
> It tries to, but the command ultimately succeeds and the success
> response "undoes" the cancel.
>
> http://thread.gmane.org/gmane.comp.emulators.qemu/166232/focus=166316
>
> See also the whole thread (and I think there as a part of it offlist, too).
>
> Paolo
>
Paolo Bonzini Nov. 19, 2012, 11:16 a.m. UTC | #23
Il 19/11/2012 11:57, Stefan Priebe - Profihost AG ha scritto:
> Yeah thats my old thread regarding iscsi und unmap but this works fine
> now since you patched qemu.

It still causes hangs no?  Though it works apart from that.

Paolo

> Stefan
> 
> Am 19.11.2012 11:36, schrieb Paolo Bonzini:
>> Il 19/11/2012 11:30, Stefan Priebe - Profihost AG ha scritto:
>>>
>>>
>>> But do you have any idea why it works with an iscsi / libiscsi backend?
>>> In that case the kernel does not cancel the I/O.
>>
>> It tries to, but the command ultimately succeeds and the success
>> response "undoes" the cancel.
>>
>> http://thread.gmane.org/gmane.comp.emulators.qemu/166232/focus=166316
>>
>> See also the whole thread (and I think there as a part of it offlist,
>> too).
>>
>> Paolo
>>
Stefan Priebe - Profihost AG Nov. 19, 2012, 11:49 a.m. UTC | #24
Am 19.11.2012 12:16, schrieb Paolo Bonzini:
> Il 19/11/2012 11:57, Stefan Priebe - Profihost AG ha scritto:
>> Yeah thats my old thread regarding iscsi und unmap but this works fine
>> now since you patched qemu.
>
> It still causes hangs no?  Though it works apart from that.

iscsi/libiscsi and discards works fine since your latest patches:
1bd075f29ea6d11853475c7c42734595720c3ac6
cfb3f5064af2d2e29c976e292c9472dfe9d61e31
27cbd828c617944c0f9603763fdf4fa87e7ad923
b20909195745c34a819aed14ae996b60ab0f591f

But in rbd it starts to cancel the discard requests. Both using the same 
guest and host kernel with the same QEMU version.

Greets,
Stefan
Paolo Bonzini Nov. 19, 2012, 12:24 p.m. UTC | #25
Il 19/11/2012 12:49, Stefan Priebe - Profihost AG ha scritto:
>>
>> It still causes hangs no?  Though it works apart from that.
> 
> iscsi/libiscsi and discards works fine since your latest patches:
> 1bd075f29ea6d11853475c7c42734595720c3ac6
> cfb3f5064af2d2e29c976e292c9472dfe9d61e31
> 27cbd828c617944c0f9603763fdf4fa87e7ad923
> b20909195745c34a819aed14ae996b60ab0f591f

Does the console still hang though?

> But in rbd it starts to cancel the discard requests. Both using the same
> guest and host kernel with the same QEMU version.

rbd's cancellation is broken like libiscsi's was.  So the cancellation
succeeds apparently, but it is subject to the same race introduced in
commit 64e69e8 (iscsi: Fix NULL dereferences / races between task
completion and abort, 2012-08-15) for libiscsi.

It risks corruption in case the following happens:

        guest                  qemu                 rbd server
=========================================================================
        send write 1 -------->
                               send write 1 ------>
        cancel write 1 ------>
                               cancel write 1 ---->
           <------------------ cancelled
        send write 2 -------->
                               send write 2 ------>
                                   <--------------- completed write 2
           <------------------ completed write 2
                                   <--------------- completed write 1

Here, the guest would see write 2 superseding write 1, when in fact the
outcome could have been the opposite.  The right behavior is to return
only after the target says whether the cancellation was done or not.
For libiscsi, it was implemented by the commits you mention.

Paolo
Stefan Priebe - Profihost AG Nov. 19, 2012, 1:01 p.m. UTC | #26
Am 19.11.2012 13:24, schrieb Paolo Bonzini:
> Il 19/11/2012 12:49, Stefan Priebe - Profihost AG ha scritto:
>>>
>>> It still causes hangs no?  Though it works apart from that.
>>
>> iscsi/libiscsi and discards works fine since your latest patches:
>> 1bd075f29ea6d11853475c7c42734595720c3ac6
>> cfb3f5064af2d2e29c976e292c9472dfe9d61e31
>> 27cbd828c617944c0f9603763fdf4fa87e7ad923
>> b20909195745c34a819aed14ae996b60ab0f591f
>
> Does the console still hang though?

No everything is absolutely fine.

>> But in rbd it starts to cancel the discard requests. Both using the same
>> guest and host kernel with the same QEMU version.
>
> rbd's cancellation is broken like libiscsi's was.  So the cancellation
> succeeds apparently, but it is subject to the same race introduced in
> commit 64e69e8 (iscsi: Fix NULL dereferences / races between task
> completion and abort, 2012-08-15) for libiscsi.
>
> It risks corruption in case the following happens:
>
>          guest                  qemu                 rbd server
> =========================================================================
>          send write 1 -------->
>                                 send write 1 ------>
>          cancel write 1 ------>
>                                 cancel write 1 ---->
>             <------------------ cancelled
>          send write 2 -------->
>                                 send write 2 ------>
>                                     <--------------- completed write 2
>             <------------------ completed write 2
>                                     <--------------- completed write 1
>
> Here, the guest would see write 2 superseding write 1, when in fact the
> outcome could have been the opposite.  The right behavior is to return
> only after the target says whether the cancellation was done or not.
> For libiscsi, it was implemented by the commits you mention.

So the whole bunch of changes is needed for rbd? Or just 64e69e8? Or all 
mentioned except 64e69e8 as 64e69e8 introduced the problem?

Stefan
Paolo Bonzini Nov. 19, 2012, 1:06 p.m. UTC | #27
Il 19/11/2012 14:01, Stefan Priebe - Profihost AG ha scritto:
>> The right behavior is to return
>> only after the target says whether the cancellation was done or not.
>> For libiscsi, it was implemented by the commits you mention.
> 
> So the whole bunch of changes is needed for rbd?

Something like the first three:

1bd075f29ea6d11853475c7c42734595720c3ac6
cfb3f5064af2d2e29c976e292c9472dfe9d61e31
27cbd828c617944c0f9603763fdf4fa87e7ad923

Paolo
Stefan Priebe - Profihost AG Nov. 19, 2012, 2:16 p.m. UTC | #28
Hi Paolo,
Hi Josh,

i started to port the iscsi fixes to rbd. The one think i'm missing is. 
How to tell / call rbd to cancel I/O on the server side?

I grepped librbd source code for abort / cancel but didn't find anything.

Qemu must be able to tell rbd to cancel a specific I/O request.

Greets,
Stefan
Am 19.11.2012 14:06, schrieb Paolo Bonzini:
> Il 19/11/2012 14:01, Stefan Priebe - Profihost AG ha scritto:
>>> The right behavior is to return
>>> only after the target says whether the cancellation was done or not.
>>> For libiscsi, it was implemented by the commits you mention.
>>
>> So the whole bunch of changes is needed for rbd?
>
> Something like the first three:
>
> 1bd075f29ea6d11853475c7c42734595720c3ac6
> cfb3f5064af2d2e29c976e292c9472dfe9d61e31
> 27cbd828c617944c0f9603763fdf4fa87e7ad923
>
> Paolo
>
Paolo Bonzini Nov. 19, 2012, 2:32 p.m. UTC | #29
Il 19/11/2012 15:16, Stefan Priebe - Profihost AG ha scritto:
> Hi Paolo,
> Hi Josh,
> 
> i started to port the iscsi fixes to rbd. The one think i'm missing is.
> How to tell / call rbd to cancel I/O on the server side?
> 
> I grepped librbd source code for abort / cancel but didn't find anything.
> 
> Qemu must be able to tell rbd to cancel a specific I/O request.

Just don't.  QEMU will wait for rbd to finish the operation, instead of
actually having it cancelled.

Paolo
diff mbox

Patch

diff --git a/block/iscsi.c b/block/iscsi.c
index 74ada64..0b96165 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -247,6 +247,7 @@  iscsi_aio_writev(BlockDriverState *bs, int64_t
sector_num,
     acb->qiov     = qiov;

     acb->canceled   = 0;
+    acb->bh         = NULL;
     acb->status     = -EINPROGRESS;