Message ID | 20161127231856.11466-1-vz@mleia.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
On 11/28/2016 01:18 AM, Vladimir Zapolskiy wrote: > While removing ATA controller driver ata_port_detach() sets > ATA_PFLAG_UNLOADING flag and charges the error handler, however actual > port disabling does not happen due to unset ATA_PFLAG_EH_PENDING flag. > > To take care about clean port removal and ATA_PFLAG_EH_PENDING flag > setting it is sufficient to replace ata_port_schedule_eh() call with > ata_port_freeze(). > > The problem is reported by an assertion in ata_port_detach(), > if a controller driver is unloaded manually or on boot with > DEBUG_TEST_DRIVER_REMOVE option enabled: > > # rmmod ahci_imx > WARNING: CPU: 2 PID: 379 at drivers/ata/libata-core.c:6484 ata_port_detach+0x11c/0x12c > Modules linked in: ahci_imx(-) i2c_imx > CPU: 2 PID: 379 Comm: rmmod Not tainted 4.9.0-rc6+ #66 > Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > Backtrace: > [<>] (dump_backtrace) from [<>] (show_stack+0x18/0x1c) > [<>] (show_stack) from [<>] (dump_stack+0xb4/0xe8) > [<>] (dump_stack) from [<>] (__warn+0xd8/0x104) > [<>] (__warn) from [<>] (warn_slowpath_null+0x28/0x30) > [<>] (warn_slowpath_null) from [<>] (ata_port_detach+0x11c/0x12c) > [<>] (ata_port_detach) from [<>] (ata_platform_remove_one+0x2c/0x44) > [<>] (ata_platform_remove_one) from [<>] (platform_drv_remove+0x2c/0x44) > [<>] (platform_drv_remove) from [<>] (__device_release_driver+0x90/0x12c) > [<>] (__device_release_driver) from [<>] (driver_detach+0xc0/0xc4) > [<>] (driver_detach) from [<>] (bus_remove_driver+0x64/0xdc) > [<>] (bus_remove_driver) from [<>] (driver_unregister+0x30/0x50) > [<>] (driver_unregister) from [<>] (platform_driver_unregister+0x14/0x18) > [<>] (platform_driver_unregister) from [<>] (imx_ahci_driver_exit+0x14/0x1c [ahci_imx]) > [<>] (imx_ahci_driver_exit [ahci_imx]) from [<>] (SyS_delete_module+0x140/0x1f4) > [<>] (SyS_delete_module) from [<>] (ret_fast_syscall+0x0/0x1c) > > Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> > --- Tejun, Bartlomiej, you may consider this change as RFC, because I'm not an expert in ATA subsystem, for example I hesitate to claim that ata_port_freeze() is actually needed here, in my test case (iMX6Q SabreAuto without a connected SATA drive) a weaker ata_port_abort() also works well. I suppose that the change is in the kernel sources for a long time, so you may consider to send the fix to the maintainers of stable branches, feel free to ask me to find a commit for Fixes tag. -- With best wishes, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Vladimir. On Mon, Nov 28, 2016 at 01:18:56AM +0200, Vladimir Zapolskiy wrote: > While removing ATA controller driver ata_port_detach() sets > ATA_PFLAG_UNLOADING flag and charges the error handler, however actual > port disabling does not happen due to unset ATA_PFLAG_EH_PENDING flag. > > To take care about clean port removal and ATA_PFLAG_EH_PENDING flag > setting it is sufficient to replace ata_port_schedule_eh() call with > ata_port_freeze(). Hmm... this explanation doesn't really make sense to me. ATA_PFLAG_EH_PENDING is set by at_eh_set_pending() which is the same for both ata_port_schedule_eh() and ata_port_freeze(). There gotta me something else going on here. Any chance you can track down why EH isn't running? Thanks.
Hello Tejun, On 11/28/2016 08:34 PM, Tejun Heo wrote: > Hello, Vladimir. > > On Mon, Nov 28, 2016 at 01:18:56AM +0200, Vladimir Zapolskiy wrote: >> While removing ATA controller driver ata_port_detach() sets >> ATA_PFLAG_UNLOADING flag and charges the error handler, however >> actual port disabling does not happen due to unset >> ATA_PFLAG_EH_PENDING flag. >> >> To take care about clean port removal and ATA_PFLAG_EH_PENDING >> flag setting it is sufficient to replace ata_port_schedule_eh() >> call with ata_port_freeze(). > > Hmm... this explanation doesn't really make sense to me. > ATA_PFLAG_EH_PENDING is set by at_eh_set_pending() which is the same > for both ata_port_schedule_eh() and ata_port_freeze(). correct, ATA_PFLAG_EH_PENDING is set by ata_eh_set_pending(), you caused me doubt, and my analysis is crap... > There gotta me something else going on here. Any chance you can > track down why EH isn't running? > I've tested the unmodified master branch with a different kernel config and on another but similar board (SabreSD) powered by the same iMX6Q SoC, and I can not reproduce this problem, but I still experience it on the SabreAuto board, I'll trace the kernel on it over JTAG tomorrow. Sorry for the noise, the reported problem is unlikely related to ATA subsystem, because in parallel there are some other funny reported issues like: Unable to handle kernel paging request at virtual address 70732f34 pgd = e5c04000 [70732f34] *pgd=00000000 Internal error: Oops: 5 [#1] SMP ARM CPU: 3 PID: 218 Comm: udevd Not tainted 4.9.0-rc7 #5 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) task: e6a48c40 task.stack: e5c02000 PC is at __lock_acquire+0xd8/0x1ba0 LR is at 0x1 -- With best wishes, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Tejun, On 11/29/2016 01:51 AM, Vladimir Zapolskiy wrote: > Hello Tejun, > > On 11/28/2016 08:34 PM, Tejun Heo wrote: >> Hello, Vladimir. >> >> On Mon, Nov 28, 2016 at 01:18:56AM +0200, Vladimir Zapolskiy wrote: >>> While removing ATA controller driver ata_port_detach() sets >>> ATA_PFLAG_UNLOADING flag and charges the error handler, however >>> actual port disabling does not happen due to unset >>> ATA_PFLAG_EH_PENDING flag. >>> >>> To take care about clean port removal and ATA_PFLAG_EH_PENDING >>> flag setting it is sufficient to replace ata_port_schedule_eh() >>> call with ata_port_freeze(). >> >> Hmm... this explanation doesn't really make sense to me. >> ATA_PFLAG_EH_PENDING is set by at_eh_set_pending() which is the same >> for both ata_port_schedule_eh() and ata_port_freeze(). > > correct, ATA_PFLAG_EH_PENDING is set by ata_eh_set_pending(), > you caused me doubt, and my analysis is crap... > >> There gotta me something else going on here. Any chance you can >> track down why EH isn't running? >> > > I've tested the unmodified master branch with a different kernel config > and on another but similar board (SabreSD) powered by the same iMX6Q > SoC, and I can not reproduce this problem, but I still experience it > on the SabreAuto board, I'll trace the kernel on it over JTAG tomorrow. > tracing on the board shows a race between driver initialization and deinitialization, when async_port_probe() is scheduled after driver removal, this causes the reported problem. Since it is a race, it should be possible to fuzz the kernel by introducing a delay (e.g. in ata_port_probe()) to get enough time to reproduce the problem reliably and to verify a fix. imx_ahci_probe() ahci_platform_init_host() ata_host_alloc_pinfo() ata_host_alloc() ata_port_alloc() ---> sets ATA_PFLAG_INITIALIZING flag ata_link_init() .... ahci_host_activate() ata_host_activate() ata_host_start() ata_eh_freeze_port() ata_port_desc() ata_host_register() ---> schedules async_port_probe() .... *** at this point the driver probe is completed, thus it can be removed *** ata_platform_remove_one() == imx_ahci_driver.remove() ata_port_detach() ata_port_schedule_eh() ata_std_sched_eh() ---> return, ATA_PFLAG_EH_PENDING flag is not set ata_port_wait_eh() ---> return, port cleanup work is not done *** warning is printed out *** async_port_probe() ---- scheduled too late ata_port_probe() __ata_port_probe() ---> now ATA_PFLAG_INITIALIZING flag unset ata_port_schedule_eh() ata_std_sched_eh() It also explains why ata_port_schedule_eh() inside ata_port_detach() replaced by ata_port_abort() with unconditional ATA_PFLAG_EH_PENDING flag setting does not produce the warning, but still I'm not sure that resource and state clean-ups are done correctly under the race. If you buy this analysis sketch, it may take another day or two for me to prepare a proper fix, or, if you have enough time and desire, you may implement the fix on your own. -- With best wishes, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Vladimir. On Tue, Nov 29, 2016 at 08:54:11PM +0200, Vladimir Zapolskiy wrote: > tracing on the board shows a race between driver initialization and > deinitialization, when async_port_probe() is scheduled after driver > removal, this causes the reported problem. > > Since it is a race, it should be possible to fuzz the kernel by > introducing a delay (e.g. in ata_port_probe()) to get enough time > to reproduce the problem reliably and to verify a fix. > > imx_ahci_probe() > ahci_platform_init_host() > ata_host_alloc_pinfo() > ata_host_alloc() > ata_port_alloc() ---> sets ATA_PFLAG_INITIALIZING flag > ata_link_init() > .... > ahci_host_activate() > ata_host_activate() > ata_host_start() > ata_eh_freeze_port() > ata_port_desc() > ata_host_register() ---> schedules async_port_probe() > .... > > *** at this point the driver probe is completed, thus it can be removed *** Not really. Is this from the unloading test config? Control doesn't get passed to userland until async probings are flushed, so this shouldn't normally be possible. Thanks.
On 11/29/2016 09:00 PM, Tejun Heo wrote: > Hello, Vladimir. > > On Tue, Nov 29, 2016 at 08:54:11PM +0200, Vladimir Zapolskiy wrote: >> tracing on the board shows a race between driver initialization and >> deinitialization, when async_port_probe() is scheduled after driver >> removal, this causes the reported problem. >> >> Since it is a race, it should be possible to fuzz the kernel by >> introducing a delay (e.g. in ata_port_probe()) to get enough time >> to reproduce the problem reliably and to verify a fix. >> >> imx_ahci_probe() >> ahci_platform_init_host() >> ata_host_alloc_pinfo() >> ata_host_alloc() >> ata_port_alloc() ---> sets ATA_PFLAG_INITIALIZING flag >> ata_link_init() >> .... >> ahci_host_activate() >> ata_host_activate() >> ata_host_start() >> ata_eh_freeze_port() >> ata_port_desc() >> ata_host_register() ---> schedules async_port_probe() >> .... >> >> *** at this point the driver probe is completed, thus it can be removed *** > > Not really. Is this from the unloading test config? Correct, I always get the warning with CONFIG_DEBUG_TEST_DRIVER_REMOVE option enabled. I understand that a working solution might be just to disable the option rather than handle this case, however because it is wanted to test other drivers for potential errors also (e.g. the same ATA controller driver regardless of the false positive in the ATA core), in my opinion the issue should not be ignored. > Control doesn't get passed to userland until async probings are > flushed, so this shouldn't normally be possible. > I'm not an expert in ATA, can you please show me the synchronization point? In parallel I'll test the fuzzed kernel with an added artificial delay in ata_port_probe() to get a runtime confirmation. -- With best wishes, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Vladimir. On Tue, Nov 29, 2016 at 10:04:14PM +0200, Vladimir Zapolskiy wrote: > > Not really. Is this from the unloading test config? > > Correct, I always get the warning with CONFIG_DEBUG_TEST_DRIVER_REMOVE > option enabled. > > I understand that a working solution might be just to disable the > option rather than handle this case, however because it is wanted > to test other drivers for potential errors also (e.g. the same ATA > controller driver regardless of the false positive in the ATA core), > in my opinion the issue should not be ignored. So, the problem is that CONFIG_DEBUG_TEST_DRIVER_REMOVE is introducing a condition which isn't otherwise possible, so it's triggering pseudo bugs. The solution here is to make CONFIG_DEBUG_TEST_DRIVER_REMOVE flush async calls before trying to remove the driver. > > Control doesn't get passed to userland until async probings are > > flushed, so this shouldn't normally be possible. > > > > I'm not an expert in ATA, can you please show me the synchronization > point? async_synchronize_full() call in kernel_init() for booting and in delete_module() for module unloading. Thanks.
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 43842fd..6d64fe3 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -6505,7 +6505,7 @@ static void ata_port_detach(struct ata_port *ap) /* tell EH we're leaving & flush EH */ spin_lock_irqsave(ap->lock, flags); ap->pflags |= ATA_PFLAG_UNLOADING; - ata_port_schedule_eh(ap); + ata_port_freeze(ap); spin_unlock_irqrestore(ap->lock, flags); /* wait till EH commits suicide */
While removing ATA controller driver ata_port_detach() sets ATA_PFLAG_UNLOADING flag and charges the error handler, however actual port disabling does not happen due to unset ATA_PFLAG_EH_PENDING flag. To take care about clean port removal and ATA_PFLAG_EH_PENDING flag setting it is sufficient to replace ata_port_schedule_eh() call with ata_port_freeze(). The problem is reported by an assertion in ata_port_detach(), if a controller driver is unloaded manually or on boot with DEBUG_TEST_DRIVER_REMOVE option enabled: # rmmod ahci_imx WARNING: CPU: 2 PID: 379 at drivers/ata/libata-core.c:6484 ata_port_detach+0x11c/0x12c Modules linked in: ahci_imx(-) i2c_imx CPU: 2 PID: 379 Comm: rmmod Not tainted 4.9.0-rc6+ #66 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Backtrace: [<>] (dump_backtrace) from [<>] (show_stack+0x18/0x1c) [<>] (show_stack) from [<>] (dump_stack+0xb4/0xe8) [<>] (dump_stack) from [<>] (__warn+0xd8/0x104) [<>] (__warn) from [<>] (warn_slowpath_null+0x28/0x30) [<>] (warn_slowpath_null) from [<>] (ata_port_detach+0x11c/0x12c) [<>] (ata_port_detach) from [<>] (ata_platform_remove_one+0x2c/0x44) [<>] (ata_platform_remove_one) from [<>] (platform_drv_remove+0x2c/0x44) [<>] (platform_drv_remove) from [<>] (__device_release_driver+0x90/0x12c) [<>] (__device_release_driver) from [<>] (driver_detach+0xc0/0xc4) [<>] (driver_detach) from [<>] (bus_remove_driver+0x64/0xdc) [<>] (bus_remove_driver) from [<>] (driver_unregister+0x30/0x50) [<>] (driver_unregister) from [<>] (platform_driver_unregister+0x14/0x18) [<>] (platform_driver_unregister) from [<>] (imx_ahci_driver_exit+0x14/0x1c [ahci_imx]) [<>] (imx_ahci_driver_exit [ahci_imx]) from [<>] (SyS_delete_module+0x140/0x1f4) [<>] (SyS_delete_module) from [<>] (ret_fast_syscall+0x0/0x1c) Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> --- drivers/ata/libata-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)