mbox series

[V2,0/1] PCI/AER: fix use-after-free in pcie_do_fatal_recovery

Message ID 1531416823-17841-1-git-send-email-thomas.tai@oracle.com
Headers show
Series PCI/AER: fix use-after-free in pcie_do_fatal_recovery | expand

Message

Thomas Tai July 12, 2018, 5:33 p.m. UTC
Hi Bjorn,
Thank you very much to review the V1 patch. I reworked the patch as you have suggested. Your suggestion is indeed better and cleaner. Would you please have a look for me?

FYI, I did not clear the device structure in aer_isr_one_error() just to avoid over complicated the fix.

Also, thank you very much to read thru the caller of pcie_do_fatal_recovery() to ensure no one is using the device structure. Really appricated your extra effort.

Thank you,
Thomas

Comments

Bjorn Helgaas July 12, 2018, 9:51 p.m. UTC | #1
[+cc Keith for DPC question]

On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote:
> Hi Bjorn,
> Thank you very much to review the V1 patch. I reworked the patch as
> you have suggested. Your suggestion is indeed better and cleaner.
> Would you please have a look for me?
> 
> FYI, I did not clear the device structure in aer_isr_one_error()
> just to avoid over complicated the fix.

I think that was a bad idea anyway.  I think the problem there is that
add_error_device() copies the pci_dev pointer without taking a
reference on it:

  aer_isr
    get_e_source              # dequeue from rpc->e_sources[]
      *e_src = rpc->e_sources[x]
    aer_isr_one_error
      pdev = rpc->rpd         # pdev = Root Port pci_dev
      find_source_device(pdev)
	find_device_iter
	  add_error_device
	    e_info->dev[i] = dev   # <-- save pci_dev pointer
      aer_process_err_devices
	handle_error_source(e_info->dev[i])
	  pcie_do_fatal_recovery

At this point, we have pci_dev pointers in e_info->dev[] that are
potentially stale.  Right now nobody uses them, but I think it's
sloppy that we still have them at all.

I think a better way to fix this would be to call pci_dev_get() in
add_error_device() when we save the pointer, and then do the
corresponding pci_dev_put() in aer_isr_one_error(), after we return
from aer_process_err_devices().

This could be done with a little helper function that iterates through
e_info->dev[i], calls pci_dev_put() for each, and clears
e_info->error_dev_num (which could then be removed from
find_source_device()).

I think this would fix the problem you're seeing, so we wouldn't need
the change in pcie_do_fatal_recovery().

However, I think we're also slightly exposed in dpc_work(), in basically
the same (possibly harmless) way.

  dpc_irq
    schedule_work(&dpc->work)
  ...
  dpc_work
    pdev = dpc->dev->port
    pcie_do_fatal_recovery(pdev)

pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
holding onto a pointer (which it never uses again).

The DPC driver should be holding a reference to pdev (through some black
magic I don't understand), but that would be released when pdev is removed,
and I don't know what ensures that dpc_work() runs before that release.

Bjorn
Keith Busch July 12, 2018, 9:57 p.m. UTC | #2
On Thu, Jul 12, 2018 at 04:51:51PM -0500, Bjorn Helgaas wrote:
> However, I think we're also slightly exposed in dpc_work(), in basically
> the same (possibly harmless) way.
> 
>   dpc_irq
>     schedule_work(&dpc->work)
>   ...
>   dpc_work
>     pdev = dpc->dev->port
>     pcie_do_fatal_recovery(pdev)
> 
> pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
> holding onto a pointer (which it never uses again).
> 
> The DPC driver should be holding a reference to pdev (through some black
> magic I don't understand), but that would be released when pdev is removed,
> and I don't know what ensures that dpc_work() runs before that release.
> 
> Bjorn

Yep, you're right on that point. There's different ways we can fix
that. The most recent one I proposed was to replace the scheduled work
with the threaded irq[1]. That should make it safe since the lifetime of
when bottom half can be executed is tied to the lifetime of the device
that registered it.

 1. https://patchwork.kernel.org/patch/10478755/
Oza Pawandeep July 15, 2018, 6:55 p.m. UTC | #3
On 2018-07-13 03:21, Bjorn Helgaas wrote:
> [+cc Keith for DPC question]
> 
> On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote:
>> Hi Bjorn,
>> Thank you very much to review the V1 patch. I reworked the patch as
>> you have suggested. Your suggestion is indeed better and cleaner.
>> Would you please have a look for me?
>> 
>> FYI, I did not clear the device structure in aer_isr_one_error()
>> just to avoid over complicated the fix.
> 
> I think that was a bad idea anyway.  I think the problem there is that
> add_error_device() copies the pci_dev pointer without taking a
> reference on it:
> 
>   aer_isr
>     get_e_source              # dequeue from rpc->e_sources[]
>       *e_src = rpc->e_sources[x]
>     aer_isr_one_error
>       pdev = rpc->rpd         # pdev = Root Port pci_dev
>       find_source_device(pdev)
> 	find_device_iter
> 	  add_error_device
> 	    e_info->dev[i] = dev   # <-- save pci_dev pointer
>       aer_process_err_devices
> 	handle_error_source(e_info->dev[i])
> 	  pcie_do_fatal_recovery
> 
> At this point, we have pci_dev pointers in e_info->dev[] that are
> potentially stale.  Right now nobody uses them, but I think it's
> sloppy that we still have them at all.
> 
> I think a better way to fix this would be to call pci_dev_get() in
> add_error_device() when we save the pointer, and then do the
> corresponding pci_dev_put() in aer_isr_one_error(), after we return
> from aer_process_err_devices().
> 
> This could be done with a little helper function that iterates through
> e_info->dev[i], calls pci_dev_put() for each, and clears
> e_info->error_dev_num (which could then be removed from
> find_source_device()).
> 
> I think this would fix the problem you're seeing, so we wouldn't need
> the change in pcie_do_fatal_recovery().
> 
> However, I think we're also slightly exposed in dpc_work(), in 
> basically
> the same (possibly harmless) way.
> 
>   dpc_irq
>     schedule_work(&dpc->work)
>   ...
>   dpc_work
>     pdev = dpc->dev->port
>     pcie_do_fatal_recovery(pdev)
> 
> pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is 
> still
> holding onto a pointer (which it never uses again).
> 
> The DPC driver should be holding a reference to pdev (through some 
> black
> magic I don't understand), but that would be released when pdev is 
> removed,
> and I don't know what ensures that dpc_work() runs before that release.
> 

Is not that dpc_work() is calling pcie_do_fatal_recovery() which in-turn 
is removing pdev.
I do not see DPC is using any more reference beyond that.
So, isnt it safe to assume that no matter when the dpc_work() run, the 
pdev it reference is always intact.
this is with the fact, that both AER and DPC can not trigger 
simultaneously, and and hence no race for removal.
who else would remove that pdev other than AER and DPC ?

With Keith's patch, are not we just improving the latency with threaded 
irq instead of workqueue ?

the potential problem is another interrupt arriving while we have 
acknowledged it in dpc_reset_link() and we are about to remove the 
devices.
The acknowledgement of DPC status trigger and re-enabling interrupt can 
be done in
dpc_work() or threaded_handler() after we are done with 
pcie_do_fatal_recovery()

dpc_reset_link()
{
         following can be moved after pcie_do_fatal_recovery() ??
	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
			      PCI_EXP_DPC_STATUS_TRIGGER);

	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
			      ctl | PCI_EXP_DPC_CTL_INT_EN);
}

and probably following also one also can be done after recovery.

pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
			      PCI_EXP_DPC_STATUS_INTERRUPT);

I am not sure if this was the concern or I did talk about something else 
all together !

> Bjorn
Keith Busch July 16, 2018, 2 p.m. UTC | #4
On Mon, Jul 16, 2018 at 12:25:07AM +0530, poza@codeaurora.org wrote:
> On 2018-07-13 03:21, Bjorn Helgaas wrote:
> > On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote:
> > > Hi Bjorn,
> > > Thank you very much to review the V1 patch. I reworked the patch as
> > > you have suggested. Your suggestion is indeed better and cleaner.
> > > Would you please have a look for me?
> > > 
> > > FYI, I did not clear the device structure in aer_isr_one_error()
> > > just to avoid over complicated the fix.
> > 
> > I think that was a bad idea anyway.  I think the problem there is that
> > add_error_device() copies the pci_dev pointer without taking a
> > reference on it:
> > 
> >   aer_isr
> >     get_e_source              # dequeue from rpc->e_sources[]
> >       *e_src = rpc->e_sources[x]
> >     aer_isr_one_error
> >       pdev = rpc->rpd         # pdev = Root Port pci_dev
> >       find_source_device(pdev)
> > 	find_device_iter
> > 	  add_error_device
> > 	    e_info->dev[i] = dev   # <-- save pci_dev pointer
> >       aer_process_err_devices
> > 	handle_error_source(e_info->dev[i])
> > 	  pcie_do_fatal_recovery
> > 
> > At this point, we have pci_dev pointers in e_info->dev[] that are
> > potentially stale.  Right now nobody uses them, but I think it's
> > sloppy that we still have them at all.
> > 
> > I think a better way to fix this would be to call pci_dev_get() in
> > add_error_device() when we save the pointer, and then do the
> > corresponding pci_dev_put() in aer_isr_one_error(), after we return
> > from aer_process_err_devices().
> > 
> > This could be done with a little helper function that iterates through
> > e_info->dev[i], calls pci_dev_put() for each, and clears
> > e_info->error_dev_num (which could then be removed from
> > find_source_device()).
> > 
> > I think this would fix the problem you're seeing, so we wouldn't need
> > the change in pcie_do_fatal_recovery().
> > 
> > However, I think we're also slightly exposed in dpc_work(), in basically
> > the same (possibly harmless) way.
> > 
> >   dpc_irq
> >     schedule_work(&dpc->work)
> >   ...
> >   dpc_work
> >     pdev = dpc->dev->port
> >     pcie_do_fatal_recovery(pdev)
> > 
> > pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
> > holding onto a pointer (which it never uses again).
> > 
> > The DPC driver should be holding a reference to pdev (through some black
> > magic I don't understand), but that would be released when pdev is
> > removed,
> > and I don't know what ensures that dpc_work() runs before that release.
> > 
> 
> Is not that dpc_work() is calling pcie_do_fatal_recovery() which in-turn is
> removing pdev.
> I do not see DPC is using any more reference beyond that.
> So, isnt it safe to assume that no matter when the dpc_work() run, the pdev
> it reference is always intact.
> this is with the fact, that both AER and DPC can not trigger simultaneously,
> and and hence no race for removal.
> who else would remove that pdev other than AER and DPC ?

It's not a problem for the devices below the contained port. It's more
about the port itself, which may happen if you hot-remove the switch
that has DPC services: nothing coordinates that switch's removal with the
dpc_work so the dpc work could be referencing a pdev that pciehp removed
earlier, for example. Removing switches currently is not very common as
far as I know, but there are definitely users that want that ability.

> With Keith's patch, are not we just improving the latency with threaded irq
> instead of workqueue ?

It's a little more than that. The irq subsystem will sync with the
threaded irq before allowing teardown of the device to complete, so that
should ensure we don't have a use-after-free in the dpc bottom half.
 
> the potential problem is another interrupt arriving while we have
> acknowledged it in dpc_reset_link() and we are about to remove the devices.
> The acknowledgement of DPC status trigger and re-enabling interrupt can be
> done in
> dpc_work() or threaded_handler() after we are done with
> pcie_do_fatal_recovery()
> 
> dpc_reset_link()
> {
>         following can be moved after pcie_do_fatal_recovery() ??
> 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
> 			      PCI_EXP_DPC_STATUS_TRIGGER);
> 
> 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
> 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
> 			      ctl | PCI_EXP_DPC_CTL_INT_EN);
> }
> 
> and probably following also one also can be done after recovery.
> 
> pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
> 			      PCI_EXP_DPC_STATUS_INTERRUPT);
> 
> I am not sure if this was the concern or I did talk about something else all
> together !
Thomas Tai July 16, 2018, 2:06 p.m. UTC | #5
On 07/12/2018 05:57 PM, Keith Busch wrote:
> On Thu, Jul 12, 2018 at 04:51:51PM -0500, Bjorn Helgaas wrote:
>> However, I think we're also slightly exposed in dpc_work(), in basically
>> the same (possibly harmless) way.
>>
>>    dpc_irq
>>      schedule_work(&dpc->work)
>>    ...
>>    dpc_work
>>      pdev = dpc->dev->port
>>      pcie_do_fatal_recovery(pdev)
>>
>> pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
>> holding onto a pointer (which it never uses again).
>>
>> The DPC driver should be holding a reference to pdev (through some black
>> magic I don't understand), but that would be released when pdev is removed,
>> and I don't know what ensures that dpc_work() runs before that release.
>>
>> Bjorn
> 
> Yep, you're right on that point. There's different ways we can fix
> that. The most recent one I proposed was to replace the scheduled work
> with the threaded irq[1]. That should make it safe since the lifetime of
> when bottom half can be executed is tied to the lifetime of the device
> that registered it.
> 
>   1. https://patchwork.kernel.org/patch/10478755/

Hi Bjorn, Keith and Poza,
I like the idea of using threaded irq if it can hold the device until 
the bottom half finished. It makes the AER and DPC driver codes more 
consistent. One problem I hit when changing the AER to threaded irq is 
that the error injection module aer_inject won't work anymore because it 
call the aer_irq directly and didn't go thru the threaded_irq path. I 
would have to fix that also. Will keep you posted.

Cheers,
Thomas

>
Thomas Tai July 16, 2018, 2:17 p.m. UTC | #6
On 07/15/2018 02:55 PM, poza@codeaurora.org wrote:
> On 2018-07-13 03:21, Bjorn Helgaas wrote:
>> [+cc Keith for DPC question]
>>
>> On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote:
>>> Hi Bjorn,
>>> Thank you very much to review the V1 patch. I reworked the patch as
>>> you have suggested. Your suggestion is indeed better and cleaner.
>>> Would you please have a look for me?
>>>
>>> FYI, I did not clear the device structure in aer_isr_one_error()
>>> just to avoid over complicated the fix.
>>
>> I think that was a bad idea anyway.  I think the problem there is that
>> add_error_device() copies the pci_dev pointer without taking a
>> reference on it:
>>
>>   aer_isr
>>     get_e_source              # dequeue from rpc->e_sources[]
>>       *e_src = rpc->e_sources[x]
>>     aer_isr_one_error
>>       pdev = rpc->rpd         # pdev = Root Port pci_dev
>>       find_source_device(pdev)
>>     find_device_iter
>>       add_error_device
>>         e_info->dev[i] = dev   # <-- save pci_dev pointer
>>       aer_process_err_devices
>>     handle_error_source(e_info->dev[i])
>>       pcie_do_fatal_recovery
>>
>> At this point, we have pci_dev pointers in e_info->dev[] that are
>> potentially stale.  Right now nobody uses them, but I think it's
>> sloppy that we still have them at all.
>>
>> I think a better way to fix this would be to call pci_dev_get() in
>> add_error_device() when we save the pointer, and then do the
>> corresponding pci_dev_put() in aer_isr_one_error(), after we return
>> from aer_process_err_devices().
>>
>> This could be done with a little helper function that iterates through
>> e_info->dev[i], calls pci_dev_put() for each, and clears
>> e_info->error_dev_num (which could then be removed from
>> find_source_device()).
>>
>> I think this would fix the problem you're seeing, so we wouldn't need
>> the change in pcie_do_fatal_recovery().
>>
>> However, I think we're also slightly exposed in dpc_work(), in basically
>> the same (possibly harmless) way.
>>
>>   dpc_irq
>>     schedule_work(&dpc->work)
>>   ...
>>   dpc_work
>>     pdev = dpc->dev->port
>>     pcie_do_fatal_recovery(pdev)
>>
>> pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
>> holding onto a pointer (which it never uses again).
>>
>> The DPC driver should be holding a reference to pdev (through some black
>> magic I don't understand), but that would be released when pdev is 
>> removed,
>> and I don't know what ensures that dpc_work() runs before that release.
>>
> 
> Is not that dpc_work() is calling pcie_do_fatal_recovery() which in-turn 
> is removing pdev.
> I do not see DPC is using any more reference beyond that.

Agree.

> So, isnt it safe to assume that no matter when the dpc_work() run, the 
> pdev it reference is always intact.
> this is with the fact, that both AER and DPC can not trigger 
> simultaneously, and and hence no race for removal.
> who else would remove that pdev other than AER and DPC ?

Looks like no one is removing it other than the pcie_do_fatal_recovery(pdev)

> 
> With Keith's patch, are not we just improving the latency with threaded 
> irq instead of workqueue ?

I think the most important is that the threaded_irq hold the device 
until it finished the bottom half. Although I can't see where in the 
threaded_irq code increase the device reference count. I may be missing 
some info here, either way I am trying out the threaded_irq to see if it 
fixed the issue or not.

> 
> the potential problem is another interrupt arriving while we have 
> acknowledged it in dpc_reset_link() and we are about to remove the devices.
> The acknowledgement of DPC status trigger and re-enabling interrupt can 
> be done in
> dpc_work() or threaded_handler() after we are done with 
> pcie_do_fatal_recovery()

Good point. the pcie_do_fatal_recovery() use 
pci_stop_and_remove_bus_device() to remove the device, isn't it will 
stop the device from sending out anymore interrupts? Once we remove the 
device, we probably no need to reset the device.

-Thomas

> 
> dpc_reset_link()
> {
>          following can be moved after pcie_do_fatal_recovery() ??
>      pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
>                    PCI_EXP_DPC_STATUS_TRIGGER);
> 
>      pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
>      pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
>                    ctl | PCI_EXP_DPC_CTL_INT_EN);
> }
> 
> and probably following also one also can be done after recovery.
> 
> pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
>                    PCI_EXP_DPC_STATUS_INTERRUPT);
> 
> I am not sure if this was the concern or I did talk about something else 
> all together !
> 
>> Bjorn
Bjorn Helgaas July 18, 2018, 9:16 p.m. UTC | #7
On Thu, Jul 12, 2018 at 03:57:01PM -0600, Keith Busch wrote:
> On Thu, Jul 12, 2018 at 04:51:51PM -0500, Bjorn Helgaas wrote:
> > However, I think we're also slightly exposed in dpc_work(), in basically
> > the same (possibly harmless) way.
> > 
> >   dpc_irq
> >     schedule_work(&dpc->work)
> >   ...
> >   dpc_work
> >     pdev = dpc->dev->port
> >     pcie_do_fatal_recovery(pdev)
> > 
> > pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
> > holding onto a pointer (which it never uses again).
> > 
> > The DPC driver should be holding a reference to pdev (through some black
> > magic I don't understand), but that would be released when pdev is removed,
> > and I don't know what ensures that dpc_work() runs before that release.
> > 
> > Bjorn
> 
> Yep, you're right on that point. There's different ways we can fix
> that. The most recent one I proposed was to replace the scheduled work
> with the threaded irq[1]. That should make it safe since the lifetime of
> when bottom half can be executed is tied to the lifetime of the device
> that registered it.
> 
>  1. https://patchwork.kernel.org/patch/10478755/

This one is headed for v4.19 (it's on pci/dpc right now).  I don't
think anybody will trip over this with DPC, so I don't think we need
to get this in v4.18 (let me know if you think otherwise).

But of course it doesn't fix the crash Thomas is seeing, because
that's in the AER path, not the DPC path.  I think you're working on
an AER fix involving threaded IRQs, right, Thomas?  Just want to make
sure I haven't missed something.

Bjorn
Thomas Tai July 18, 2018, 9:32 p.m. UTC | #8
Hi Bjorn, Keith,
Please have a look at the below inline comment.

On 07/12/2018 05:51 PM, Bjorn Helgaas wrote:
> [+cc Keith for DPC question]
> 
> On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote:
>> Hi Bjorn,
>> Thank you very much to review the V1 patch. I reworked the patch as
>> you have suggested. Your suggestion is indeed better and cleaner.
>> Would you please have a look for me?
>>
>> FYI, I did not clear the device structure in aer_isr_one_error()
>> just to avoid over complicated the fix.
> 
> I think that was a bad idea anyway.  I think the problem there is that
> add_error_device() copies the pci_dev pointer without taking a
> reference on it:
> 
>    aer_isr
>      get_e_source              # dequeue from rpc->e_sources[]
>        *e_src = rpc->e_sources[x]
>      aer_isr_one_error
>        pdev = rpc->rpd         # pdev = Root Port pci_dev
>        find_source_device(pdev)
> 	find_device_iter
> 	  add_error_device
> 	    e_info->dev[i] = dev   # <-- save pci_dev pointer
>        aer_process_err_devices
> 	handle_error_source(e_info->dev[i])
> 	  pcie_do_fatal_recovery
> 
> At this point, we have pci_dev pointers in e_info->dev[] that are
> potentially stale.  Right now nobody uses them, but I think it's
> sloppy that we still have them at all.
> 
> I think a better way to fix this would be to call pci_dev_get() in
> add_error_device() when we save the pointer, and then do the
> corresponding pci_dev_put() in aer_isr_one_error(), after we return
> from aer_process_err_devices().
> 
> This could be done with a little helper function that iterates through
> e_info->dev[i], calls pci_dev_put() for each, and clears
> e_info->error_dev_num (which could then be removed from
> find_source_device()).

I tried the request_threaded_irq() method, hoping that it will hold on 
to the device, unfortunately I am still seeing the use-after-free issue. 
I guess the next step is to implement Bjorn's suggestion. If we fix the 
issue in the add_error_device(), that would require DPC to hold on to 
the device. Would Keith has any concern about it?

Thank you,
Thomas

> 
> I think this would fix the problem you're seeing, so we wouldn't need
> the change in pcie_do_fatal_recovery().
> 
> However, I think we're also slightly exposed in dpc_work(), in basically
> the same (possibly harmless) way.
> 
>    dpc_irq
>      schedule_work(&dpc->work)
>    ...
>    dpc_work
>      pdev = dpc->dev->port
>      pcie_do_fatal_recovery(pdev)
> 
> pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
> holding onto a pointer (which it never uses again).
> 
> The DPC driver should be holding a reference to pdev (through some black
> magic I don't understand), but that would be released when pdev is removed,
> and I don't know what ensures that dpc_work() runs before that release.
> 
> Bjorn
>
Thomas Tai July 18, 2018, 9:43 p.m. UTC | #9
On 07/18/2018 05:16 PM, Bjorn Helgaas wrote:
> On Thu, Jul 12, 2018 at 03:57:01PM -0600, Keith Busch wrote:
>> On Thu, Jul 12, 2018 at 04:51:51PM -0500, Bjorn Helgaas wrote:
>>> However, I think we're also slightly exposed in dpc_work(), in basically
>>> the same (possibly harmless) way.
>>>
>>>    dpc_irq
>>>      schedule_work(&dpc->work)
>>>    ...
>>>    dpc_work
>>>      pdev = dpc->dev->port
>>>      pcie_do_fatal_recovery(pdev)
>>>
>>> pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
>>> holding onto a pointer (which it never uses again).
>>>
>>> The DPC driver should be holding a reference to pdev (through some black
>>> magic I don't understand), but that would be released when pdev is removed,
>>> and I don't know what ensures that dpc_work() runs before that release.
>>>
>>> Bjorn
>>
>> Yep, you're right on that point. There's different ways we can fix
>> that. The most recent one I proposed was to replace the scheduled work
>> with the threaded irq[1]. That should make it safe since the lifetime of
>> when bottom half can be executed is tied to the lifetime of the device
>> that registered it.
>>
>>   1. https://patchwork.kernel.org/patch/10478755/
> 
> This one is headed for v4.19 (it's on pci/dpc right now).  I don't
> think anybody will trip over this with DPC, so I don't think we need
> to get this in v4.18 (let me know if you think otherwise).
> 
> But of course it doesn't fix the crash Thomas is seeing, because
> that's in the AER path, not the DPC path.  I think you're working on
> an AER fix involving threaded IRQs, right, Thomas?  Just want to make
> sure I haven't missed something.

Yes, Bjorn. I am working on the aer path. I followed Keith's patch and 
use threaded_irq in the aer. Even with the threaded_irq, the device got 
freed in the pcie_do_fatal_receovery(). So I am suggesting to use your 
idea, that is, have a helper functions to get/put the device.


> 
> Bjorn
>
Bjorn Helgaas July 18, 2018, 9:57 p.m. UTC | #10
On Wed, Jul 18, 2018 at 05:43:03PM -0400, Thomas Tai wrote:
> On 07/18/2018 05:16 PM, Bjorn Helgaas wrote:
> > On Thu, Jul 12, 2018 at 03:57:01PM -0600, Keith Busch wrote:
> > > On Thu, Jul 12, 2018 at 04:51:51PM -0500, Bjorn Helgaas wrote:
> > > > However, I think we're also slightly exposed in dpc_work(), in basically
> > > > the same (possibly harmless) way.
> > > > 
> > > >    dpc_irq
> > > >      schedule_work(&dpc->work)
> > > >    ...
> > > >    dpc_work
> > > >      pdev = dpc->dev->port
> > > >      pcie_do_fatal_recovery(pdev)
> > > > 
> > > > pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
> > > > holding onto a pointer (which it never uses again).
> > > > 
> > > > The DPC driver should be holding a reference to pdev (through some black
> > > > magic I don't understand), but that would be released when pdev is removed,
> > > > and I don't know what ensures that dpc_work() runs before that release.
> > > > 
> > > > Bjorn
> > > 
> > > Yep, you're right on that point. There's different ways we can fix
> > > that. The most recent one I proposed was to replace the scheduled work
> > > with the threaded irq[1]. That should make it safe since the lifetime of
> > > when bottom half can be executed is tied to the lifetime of the device
> > > that registered it.
> > > 
> > >   1. https://patchwork.kernel.org/patch/10478755/
> > 
> > This one is headed for v4.19 (it's on pci/dpc right now).  I don't
> > think anybody will trip over this with DPC, so I don't think we need
> > to get this in v4.18 (let me know if you think otherwise).
> > 
> > But of course it doesn't fix the crash Thomas is seeing, because
> > that's in the AER path, not the DPC path.  I think you're working on
> > an AER fix involving threaded IRQs, right, Thomas?  Just want to make
> > sure I haven't missed something.
> 
> Yes, Bjorn. I am working on the aer path. I followed Keith's patch and use
> threaded_irq in the aer. Even with the threaded_irq, the device got freed in
> the pcie_do_fatal_receovery(). So I am suggesting to use your idea, that is,
> have a helper functions to get/put the device.

Hmmm, the fact that it happens even with threaded IRQs is concerning
because it suggests that we don't understand the threaded IRQ
synchronization completely (I know for sure *I* don't!), and Keith's patch
above may not solve the problem even for DPC.