diff mbox

ata: disable port while unloading ATA controller driver

Message ID 20161127231856.11466-1-vz@mleia.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Vladimir Zapolskiy Nov. 27, 2016, 11:18 p.m. UTC
While removing ATA controller driver ata_port_detach() sets
ATA_PFLAG_UNLOADING flag and charges the error handler, however actual
port disabling does not happen due to unset ATA_PFLAG_EH_PENDING flag.

To take care about clean port removal and ATA_PFLAG_EH_PENDING flag
setting it is sufficient to replace ata_port_schedule_eh() call with
ata_port_freeze().

The problem is reported by an assertion in ata_port_detach(),
if a controller driver is unloaded manually or on boot with
DEBUG_TEST_DRIVER_REMOVE option enabled:

  # rmmod ahci_imx
  WARNING: CPU: 2 PID: 379 at drivers/ata/libata-core.c:6484 ata_port_detach+0x11c/0x12c
  Modules linked in: ahci_imx(-) i2c_imx
  CPU: 2 PID: 379 Comm: rmmod Not tainted 4.9.0-rc6+ #66
  Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
  Backtrace:
  [<>] (dump_backtrace) from [<>] (show_stack+0x18/0x1c)
  [<>] (show_stack) from [<>] (dump_stack+0xb4/0xe8)
  [<>] (dump_stack) from [<>] (__warn+0xd8/0x104)
  [<>] (__warn) from [<>] (warn_slowpath_null+0x28/0x30)
  [<>] (warn_slowpath_null) from [<>] (ata_port_detach+0x11c/0x12c)
  [<>] (ata_port_detach) from [<>] (ata_platform_remove_one+0x2c/0x44)
  [<>] (ata_platform_remove_one) from [<>] (platform_drv_remove+0x2c/0x44)
  [<>] (platform_drv_remove) from [<>] (__device_release_driver+0x90/0x12c)
  [<>] (__device_release_driver) from [<>] (driver_detach+0xc0/0xc4)
  [<>] (driver_detach) from [<>] (bus_remove_driver+0x64/0xdc)
  [<>] (bus_remove_driver) from [<>] (driver_unregister+0x30/0x50)
  [<>] (driver_unregister) from [<>] (platform_driver_unregister+0x14/0x18)
  [<>] (platform_driver_unregister) from [<>] (imx_ahci_driver_exit+0x14/0x1c [ahci_imx])
  [<>] (imx_ahci_driver_exit [ahci_imx]) from [<>] (SyS_delete_module+0x140/0x1f4)
  [<>] (SyS_delete_module) from [<>] (ret_fast_syscall+0x0/0x1c)

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
---
 drivers/ata/libata-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Vladimir Zapolskiy Nov. 27, 2016, 11:30 p.m. UTC | #1
On 11/28/2016 01:18 AM, Vladimir Zapolskiy wrote:
> While removing ATA controller driver ata_port_detach() sets
> ATA_PFLAG_UNLOADING flag and charges the error handler, however actual
> port disabling does not happen due to unset ATA_PFLAG_EH_PENDING flag.
>
> To take care about clean port removal and ATA_PFLAG_EH_PENDING flag
> setting it is sufficient to replace ata_port_schedule_eh() call with
> ata_port_freeze().
>
> The problem is reported by an assertion in ata_port_detach(),
> if a controller driver is unloaded manually or on boot with
> DEBUG_TEST_DRIVER_REMOVE option enabled:
>
>   # rmmod ahci_imx
>   WARNING: CPU: 2 PID: 379 at drivers/ata/libata-core.c:6484 ata_port_detach+0x11c/0x12c
>   Modules linked in: ahci_imx(-) i2c_imx
>   CPU: 2 PID: 379 Comm: rmmod Not tainted 4.9.0-rc6+ #66
>   Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
>   Backtrace:
>   [<>] (dump_backtrace) from [<>] (show_stack+0x18/0x1c)
>   [<>] (show_stack) from [<>] (dump_stack+0xb4/0xe8)
>   [<>] (dump_stack) from [<>] (__warn+0xd8/0x104)
>   [<>] (__warn) from [<>] (warn_slowpath_null+0x28/0x30)
>   [<>] (warn_slowpath_null) from [<>] (ata_port_detach+0x11c/0x12c)
>   [<>] (ata_port_detach) from [<>] (ata_platform_remove_one+0x2c/0x44)
>   [<>] (ata_platform_remove_one) from [<>] (platform_drv_remove+0x2c/0x44)
>   [<>] (platform_drv_remove) from [<>] (__device_release_driver+0x90/0x12c)
>   [<>] (__device_release_driver) from [<>] (driver_detach+0xc0/0xc4)
>   [<>] (driver_detach) from [<>] (bus_remove_driver+0x64/0xdc)
>   [<>] (bus_remove_driver) from [<>] (driver_unregister+0x30/0x50)
>   [<>] (driver_unregister) from [<>] (platform_driver_unregister+0x14/0x18)
>   [<>] (platform_driver_unregister) from [<>] (imx_ahci_driver_exit+0x14/0x1c [ahci_imx])
>   [<>] (imx_ahci_driver_exit [ahci_imx]) from [<>] (SyS_delete_module+0x140/0x1f4)
>   [<>] (SyS_delete_module) from [<>] (ret_fast_syscall+0x0/0x1c)
>
> Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
> ---

Tejun, Bartlomiej,

you may consider this change as RFC, because I'm not an expert in ATA
subsystem, for example I hesitate to claim that ata_port_freeze() is
actually needed here, in my test case (iMX6Q SabreAuto without
a connected SATA drive) a weaker ata_port_abort() also works well.

I suppose that the change is in the kernel sources for a long time,
so you may consider to send the fix to the maintainers of stable
branches, feel free to ask me to find a commit for Fixes tag.

--
With best wishes,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 28, 2016, 6:34 p.m. UTC | #2
Hello, Vladimir.

On Mon, Nov 28, 2016 at 01:18:56AM +0200, Vladimir Zapolskiy wrote:
> While removing ATA controller driver ata_port_detach() sets
> ATA_PFLAG_UNLOADING flag and charges the error handler, however actual
> port disabling does not happen due to unset ATA_PFLAG_EH_PENDING flag.
> 
> To take care about clean port removal and ATA_PFLAG_EH_PENDING flag
> setting it is sufficient to replace ata_port_schedule_eh() call with
> ata_port_freeze().

Hmm... this explanation doesn't really make sense to me.
ATA_PFLAG_EH_PENDING is set by at_eh_set_pending() which is the same
for both ata_port_schedule_eh() and ata_port_freeze().  There gotta me
something else going on here.  Any chance you can track down why EH
isn't running?

Thanks.
Vladimir Zapolskiy Nov. 28, 2016, 11:51 p.m. UTC | #3
Hello Tejun,

On 11/28/2016 08:34 PM, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Mon, Nov 28, 2016 at 01:18:56AM +0200, Vladimir Zapolskiy wrote:
>> While removing ATA controller driver ata_port_detach() sets 
>> ATA_PFLAG_UNLOADING flag and charges the error handler, however
>> actual port disabling does not happen due to unset
>> ATA_PFLAG_EH_PENDING flag.
>> 
>> To take care about clean port removal and ATA_PFLAG_EH_PENDING
>> flag setting it is sufficient to replace ata_port_schedule_eh()
>> call with ata_port_freeze().
> 
> Hmm... this explanation doesn't really make sense to me. 
> ATA_PFLAG_EH_PENDING is set by at_eh_set_pending() which is the same 
> for both ata_port_schedule_eh() and ata_port_freeze().

correct, ATA_PFLAG_EH_PENDING is set by ata_eh_set_pending(),
you caused me doubt, and my analysis is crap...

> There gotta me something else going on here.  Any chance you can
> track down why EH isn't running?
> 

I've tested the unmodified master branch with a different kernel config
and on another but similar board (SabreSD) powered by the same iMX6Q
SoC, and I can not reproduce this problem, but I still experience it
on the SabreAuto board, I'll trace the kernel on it over JTAG tomorrow.

Sorry for the noise, the reported problem is unlikely related to ATA
subsystem, because in parallel there are some other funny reported
issues like:

  Unable to handle kernel paging request at virtual address 70732f34
  pgd = e5c04000
  [70732f34] *pgd=00000000
  Internal error: Oops: 5 [#1] SMP ARM
  CPU: 3 PID: 218 Comm: udevd Not tainted 4.9.0-rc7 #5
  Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
  task: e6a48c40 task.stack: e5c02000
  PC is at __lock_acquire+0xd8/0x1ba0
  LR is at 0x1

--
With best wishes,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Zapolskiy Nov. 29, 2016, 6:54 p.m. UTC | #4
Hello Tejun,

On 11/29/2016 01:51 AM, Vladimir Zapolskiy wrote:
> Hello Tejun,
> 
> On 11/28/2016 08:34 PM, Tejun Heo wrote:
>> Hello, Vladimir.
>>
>> On Mon, Nov 28, 2016 at 01:18:56AM +0200, Vladimir Zapolskiy wrote:
>>> While removing ATA controller driver ata_port_detach() sets 
>>> ATA_PFLAG_UNLOADING flag and charges the error handler, however
>>> actual port disabling does not happen due to unset
>>> ATA_PFLAG_EH_PENDING flag.
>>>
>>> To take care about clean port removal and ATA_PFLAG_EH_PENDING
>>> flag setting it is sufficient to replace ata_port_schedule_eh()
>>> call with ata_port_freeze().
>>
>> Hmm... this explanation doesn't really make sense to me. 
>> ATA_PFLAG_EH_PENDING is set by at_eh_set_pending() which is the same 
>> for both ata_port_schedule_eh() and ata_port_freeze().
> 
> correct, ATA_PFLAG_EH_PENDING is set by ata_eh_set_pending(),
> you caused me doubt, and my analysis is crap...
> 
>> There gotta me something else going on here.  Any chance you can
>> track down why EH isn't running?
>>
> 
> I've tested the unmodified master branch with a different kernel config
> and on another but similar board (SabreSD) powered by the same iMX6Q
> SoC, and I can not reproduce this problem, but I still experience it
> on the SabreAuto board, I'll trace the kernel on it over JTAG tomorrow.
> 

tracing on the board shows a race between driver initialization and
deinitialization, when async_port_probe() is scheduled after driver
removal, this causes the reported problem.

Since it is a race, it should be possible to fuzz the kernel by
introducing a delay (e.g. in ata_port_probe()) to get enough time
to reproduce the problem reliably and to verify a fix.

imx_ahci_probe()
  ahci_platform_init_host()
    ata_host_alloc_pinfo()
      ata_host_alloc()
        ata_port_alloc()    ---> sets ATA_PFLAG_INITIALIZING flag
          ata_link_init()
          ....
    ahci_host_activate()
      ata_host_activate()
        ata_host_start()
          ata_eh_freeze_port()
        ata_port_desc()
        ata_host_register() ---> schedules async_port_probe()
  ....

*** at this point the driver probe is completed, thus it can be removed ***

ata_platform_remove_one()    ==  imx_ahci_driver.remove()
  ata_port_detach()
    ata_port_schedule_eh()
      ata_std_sched_eh()    ---> return, ATA_PFLAG_EH_PENDING flag is not set
    ata_port_wait_eh()      ---> return, port cleanup work is not done

*** warning is printed out ***

async_port_probe()          ---- scheduled too late
  ata_port_probe()
    __ata_port_probe()      ---> now ATA_PFLAG_INITIALIZING flag unset
      ata_port_schedule_eh()
        ata_std_sched_eh()


It also explains why ata_port_schedule_eh() inside ata_port_detach()
replaced by ata_port_abort() with unconditional ATA_PFLAG_EH_PENDING
flag setting does not produce the warning, but still I'm not sure
that resource and state clean-ups are done correctly under the race.

If you buy this analysis sketch, it may take another day or two for
me to prepare a proper fix, or, if you have enough time and desire,
you may implement the fix on your own.

--
With best wishes,
Vladimir

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 29, 2016, 7 p.m. UTC | #5
Hello, Vladimir.

On Tue, Nov 29, 2016 at 08:54:11PM +0200, Vladimir Zapolskiy wrote:
> tracing on the board shows a race between driver initialization and
> deinitialization, when async_port_probe() is scheduled after driver
> removal, this causes the reported problem.
> 
> Since it is a race, it should be possible to fuzz the kernel by
> introducing a delay (e.g. in ata_port_probe()) to get enough time
> to reproduce the problem reliably and to verify a fix.
> 
> imx_ahci_probe()
>   ahci_platform_init_host()
>     ata_host_alloc_pinfo()
>       ata_host_alloc()
>         ata_port_alloc()    ---> sets ATA_PFLAG_INITIALIZING flag
>           ata_link_init()
>           ....
>     ahci_host_activate()
>       ata_host_activate()
>         ata_host_start()
>           ata_eh_freeze_port()
>         ata_port_desc()
>         ata_host_register() ---> schedules async_port_probe()
>   ....
> 
> *** at this point the driver probe is completed, thus it can be removed ***

Not really.  Is this from the unloading test config?  Control doesn't
get passed to userland until async probings are flushed, so this
shouldn't normally be possible.

Thanks.
Vladimir Zapolskiy Nov. 29, 2016, 8:04 p.m. UTC | #6
On 11/29/2016 09:00 PM, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Tue, Nov 29, 2016 at 08:54:11PM +0200, Vladimir Zapolskiy wrote:
>> tracing on the board shows a race between driver initialization and
>> deinitialization, when async_port_probe() is scheduled after driver
>> removal, this causes the reported problem.
>>
>> Since it is a race, it should be possible to fuzz the kernel by
>> introducing a delay (e.g. in ata_port_probe()) to get enough time
>> to reproduce the problem reliably and to verify a fix.
>>
>> imx_ahci_probe()
>>   ahci_platform_init_host()
>>     ata_host_alloc_pinfo()
>>       ata_host_alloc()
>>         ata_port_alloc()    ---> sets ATA_PFLAG_INITIALIZING flag
>>           ata_link_init()
>>           ....
>>     ahci_host_activate()
>>       ata_host_activate()
>>         ata_host_start()
>>           ata_eh_freeze_port()
>>         ata_port_desc()
>>         ata_host_register() ---> schedules async_port_probe()
>>   ....
>>
>> *** at this point the driver probe is completed, thus it can be removed ***
> 
> Not really.  Is this from the unloading test config?

Correct, I always get the warning with CONFIG_DEBUG_TEST_DRIVER_REMOVE
option enabled.

I understand that a working solution might be just to disable the
option rather than handle this case, however because it is wanted
to test other drivers for potential errors also (e.g. the same ATA
controller driver regardless of the false positive in the ATA core), 
in my opinion the issue should not be ignored.

> Control doesn't get passed to userland until async probings are
> flushed, so this shouldn't normally be possible.
> 

I'm not an expert in ATA, can you please show me the synchronization
point?

In parallel I'll test the fuzzed kernel with an added artificial
delay in ata_port_probe() to get a runtime confirmation.

--
With best wishes,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 29, 2016, 8:44 p.m. UTC | #7
Hello, Vladimir.

On Tue, Nov 29, 2016 at 10:04:14PM +0200, Vladimir Zapolskiy wrote:
> > Not really.  Is this from the unloading test config?
> 
> Correct, I always get the warning with CONFIG_DEBUG_TEST_DRIVER_REMOVE
> option enabled.
> 
> I understand that a working solution might be just to disable the
> option rather than handle this case, however because it is wanted
> to test other drivers for potential errors also (e.g. the same ATA
> controller driver regardless of the false positive in the ATA core), 
> in my opinion the issue should not be ignored.

So, the problem is that CONFIG_DEBUG_TEST_DRIVER_REMOVE is introducing
a condition which isn't otherwise possible, so it's triggering pseudo
bugs.  The solution here is to make CONFIG_DEBUG_TEST_DRIVER_REMOVE
flush async calls before trying to remove the driver.

> > Control doesn't get passed to userland until async probings are
> > flushed, so this shouldn't normally be possible.
> > 
> 
> I'm not an expert in ATA, can you please show me the synchronization
> point?

async_synchronize_full() call in kernel_init() for booting and in
delete_module() for module unloading.

Thanks.
diff mbox

Patch

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 43842fd..6d64fe3 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6505,7 +6505,7 @@  static void ata_port_detach(struct ata_port *ap)
 	/* tell EH we're leaving & flush EH */
 	spin_lock_irqsave(ap->lock, flags);
 	ap->pflags |= ATA_PFLAG_UNLOADING;
-	ata_port_schedule_eh(ap);
+	ata_port_freeze(ap);
 	spin_unlock_irqrestore(ap->lock, flags);
 
 	/* wait till EH commits suicide */