Message ID | 20170608024344.9311-1-david.thomson@alliedtelesis.co.nz |
---|---|
State | Rejected |
Headers | show |
On Thu, Jun 08, 2017 at 02:43:44PM +1200, David Thomson wrote: > There's a race condition that occurs when deleting & closing multiple > i2c devices, which results in a kernel warning on an incorrect reference > count: Thanks for the report and patch! CCing peda. Do you have time to look into it? Muxes and locking are kinda calling for your expertise ;) > > Call Trace: > [c00000007e2cfaf0] [c0000000000b84e4] .module_put+0x24/0xb0 (unreliable) > [c00000007e2cfb70] [c0000000005db830] .i2c_put_adapter+0x30/0x50 > [c00000007e2cfbf0] [c0000000005deb74] .i2cdev_release+0x24/0x60 > [c00000007e2cfc70] [c00000000014015c] .__fput+0xbc/0x290 > [c00000007e2cfd10] [c000000000059b64] .task_work_run+0xf4/0x130 > [c00000007e2cfdb0] [c00000000000a0a4] .do_notify_resume+0x74/0x80 > [c00000007e2cfe30] [c000000000000acc] .ret_from_except_lite+0x78/0x7c > > The race is seen when an i2c adapter is being deleted while > file descriptor on another i2c adapter is being closed. The following > shows two threads executing concurrently (the first number is the thread > id): > > 4580 i2c_del_mux_adapter adapter i2c-6-mux (chan_id 2) > 4580 i2c_del_adapter adapter i2c-6-mux (chan_id 2) > 4580 i2cdev_detach_adapter adapter i2c-6-mux (chan_id 2) > 2627 i2cdev_release START adapter i2c-6-mux (chan_id 2) > 2627 i2c_put_adapter START adapter i2c-6-mux (chan_id 2) owner (null) > 2627 i2c_adapter_dev_release adapter i2c-6-mux (chan_id 2) > 4580 i2c_del_mux_adapter adapter i2c-6-mux (chan_id 3) > 4580 i2c_del_adapter adapter i2c-6-mux (chan_id 3) > 4580 i2cdev_detach_adapter adapter i2c-6-mux (chan_id 3) > 2627 i2c_put_adapter END adapter i2c-6-mux (chan_id 2) owner c0000000edf66400 > 2627 i2cdev_release END adapter i2c-6-mux (chan_id 2) > 2627 i2cdev_release START adapter i2c-6-mux (chan_id 3) > 2627 i2c_put_adapter START adapter i2c-6-mux (chan_id 3) owner (null) > 2627 i2c_adapter_dev_release adapter i2c-6-mux (chan_id 3) > 2627 i2c_put_adapter END adapter i2c-6-mux (chan_id 3) owner (null) > 2627 i2cdev_release END adapter i2c-6-mux (chan_id 3) > > When thread 4580 for chan_id 3 interrupts 2627 doing > i2c_adapter_dev_release, the 'owner' field for chan_id 2 becomes > non-null. This causes the code to attempt to decrement the reference for > this owner, but the current reference count is invalid. I'm not sure how > the owner is getting set, but adding a lock around the > put_device/module_put calls resolves the issue. > > Signed-off-by: David Thomson <david.thomson@alliedtelesis.co.nz> > --- > drivers/i2c/i2c-core.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c > index 82576aaccc90..8831d9b363a7 100644 > --- a/drivers/i2c/i2c-core.c > +++ b/drivers/i2c/i2c-core.c > @@ -3151,8 +3151,12 @@ void i2c_put_adapter(struct i2c_adapter *adap) > if (!adap) > return; > > + mutex_lock(&core_lock); > + > put_device(&adap->dev); > module_put(adap->owner); > + > + mutex_unlock(&core_lock); > } > EXPORT_SYMBOL(i2c_put_adapter); > > -- > 2.13.0 >
On 2017-06-19 16:36, Wolfram Sang wrote: > On Thu, Jun 08, 2017 at 02:43:44PM +1200, David Thomson wrote: >> There's a race condition that occurs when deleting & closing multiple >> i2c devices, which results in a kernel warning on an incorrect reference >> count: > > Thanks for the report and patch! > > CCing peda. Do you have time to look into it? Muxes and locking are > kinda calling for your expertise ;) > >> >> Call Trace: >> [c00000007e2cfaf0] [c0000000000b84e4] .module_put+0x24/0xb0 (unreliable) >> [c00000007e2cfb70] [c0000000005db830] .i2c_put_adapter+0x30/0x50 >> [c00000007e2cfbf0] [c0000000005deb74] .i2cdev_release+0x24/0x60 >> [c00000007e2cfc70] [c00000000014015c] .__fput+0xbc/0x290 >> [c00000007e2cfd10] [c000000000059b64] .task_work_run+0xf4/0x130 >> [c00000007e2cfdb0] [c00000000000a0a4] .do_notify_resume+0x74/0x80 >> [c00000007e2cfe30] [c000000000000acc] .ret_from_except_lite+0x78/0x7c >> >> The race is seen when an i2c adapter is being deleted while >> file descriptor on another i2c adapter is being closed. The following >> shows two threads executing concurrently (the first number is the thread >> id): >> >> 4580 i2c_del_mux_adapter adapter i2c-6-mux (chan_id 2) This function no longer exist in the kernel. I'm not saying that the problem has been solved (it probably hasn't), but what version was this and can you please redo this with a newer kernel? >> 4580 i2c_del_adapter adapter i2c-6-mux (chan_id 2) >> 4580 i2cdev_detach_adapter adapter i2c-6-mux (chan_id 2) >> 2627 i2cdev_release START adapter i2c-6-mux (chan_id 2) >> 2627 i2c_put_adapter START adapter i2c-6-mux (chan_id 2) owner (null) >> 2627 i2c_adapter_dev_release adapter i2c-6-mux (chan_id 2) >> 4580 i2c_del_mux_adapter adapter i2c-6-mux (chan_id 3) >> 4580 i2c_del_adapter adapter i2c-6-mux (chan_id 3) >> 4580 i2cdev_detach_adapter adapter i2c-6-mux (chan_id 3) >> 2627 i2c_put_adapter END adapter i2c-6-mux (chan_id 2) owner c0000000edf66400 >> 2627 i2cdev_release END adapter i2c-6-mux (chan_id 2) >> 2627 i2cdev_release START adapter i2c-6-mux (chan_id 3) >> 2627 i2c_put_adapter START adapter i2c-6-mux (chan_id 3) owner (null) >> 2627 i2c_adapter_dev_release adapter i2c-6-mux (chan_id 3) >> 2627 i2c_put_adapter END adapter i2c-6-mux (chan_id 3) owner (null) >> 2627 i2cdev_release END adapter i2c-6-mux (chan_id 3) >> >> When thread 4580 for chan_id 3 interrupts 2627 doing >> i2c_adapter_dev_release, the 'owner' field for chan_id 2 becomes >> non-null. This causes the code to attempt to decrement the reference for >> this owner, but the current reference count is invalid. I'm not sure how >> the owner is getting set, but adding a lock around the >> put_device/module_put calls resolves the issue. It would certainly be nice to know why .owner is clobbered, because I don't see it. But then again, I don't know what sources I should be reading... Cheers, peda >> Signed-off-by: David Thomson <david.thomson@alliedtelesis.co.nz> >> --- >> drivers/i2c/i2c-core.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c >> index 82576aaccc90..8831d9b363a7 100644 >> --- a/drivers/i2c/i2c-core.c >> +++ b/drivers/i2c/i2c-core.c >> @@ -3151,8 +3151,12 @@ void i2c_put_adapter(struct i2c_adapter *adap) >> if (!adap) >> return; >> >> + mutex_lock(&core_lock); >> + >> put_device(&adap->dev); >> module_put(adap->owner); >> + >> + mutex_unlock(&core_lock); >> } >> EXPORT_SYMBOL(i2c_put_adapter); >> >> -- >> 2.13.0 >> -- To unsubscribe from this list: send the line "unsubscribe linux-i2c" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017-06-19 18:34, Peter Rosin wrote: > It would certainly be nice to know why .owner is clobbered, because I > don't see it. But then again, I don't know what sources I should be > reading... BTW, one thing I noticed when reading the current code is that I see this at the end of i2c_del_adapter: /* Clear the device structure in case this adapter is ever going to be added again */ memset(&adap->dev, 0, sizeof(adap->dev)); } EXPORT_SYMBOL(i2c_del_adapter); which is not very compatible with this part of the device_add() docs: * Do not call this routine or device_register() more than once for * any device structure. The driver model core is not designed to work * with devices that get unregistered and then spring back to life. * (Among other things, it's very hard to guarantee that all references * to the previous incarnation of @dev have been dropped.) Allocate * and register a fresh new struct device instead. That, of course, probably have no bearing on the problem/patch in this thread... Or maybe struct device reuse is exactly what is going on??? Cheers, peda -- To unsubscribe from this list: send the line "unsubscribe linux-i2c" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Peter Thanks for your reply. Unfortunately we cannot reproduce the issue on a newer kernel as it requires userspace applications (for processing the hotswap) that rely on our (many) changes to the kernel. I appreciate you taking the time to consider the issue. Thanks David
On Mon, Jun 19, 2017 at 09:30:06PM +0200, Peter Rosin wrote: > On 2017-06-19 18:34, Peter Rosin wrote: > > It would certainly be nice to know why .owner is clobbered, because I > > don't see it. But then again, I don't know what sources I should be > > reading... > > BTW, one thing I noticed when reading the current code is that I see this > at the end of i2c_del_adapter: > > /* Clear the device structure in case this adapter is ever going to be > added again */ > memset(&adap->dev, 0, sizeof(adap->dev)); > } > EXPORT_SYMBOL(i2c_del_adapter); It was added with commit bd4bc3dbded9cd ("i2c: Clear i2c_adapter.dev on adapter removal") with this commit message: i2c: Clear i2c_adapter.dev on adapter removal Clear i2c_adapter.dev on adapter removal. This makes it possible to re-add the adapter at a later point, which some drivers (i2c-amd756-s4882, i2c-nforce2-s4985) actually do. This fixes a bug reported by John Stultz here: http://lkml.org/lkml/2008/7/15/720 and by Ingo Molar there: http://lkml.org/lkml/2008/7/16/78 So, despite the docs, it used to be an issue actually...
On 2017-07-31 15:38, Wolfram Sang wrote: > On Mon, Jun 19, 2017 at 09:30:06PM +0200, Peter Rosin wrote: >> On 2017-06-19 18:34, Peter Rosin wrote: >>> It would certainly be nice to know why .owner is clobbered, because I >>> don't see it. But then again, I don't know what sources I should be >>> reading... >> >> BTW, one thing I noticed when reading the current code is that I see this >> at the end of i2c_del_adapter: >> >> /* Clear the device structure in case this adapter is ever going to be >> added again */ >> memset(&adap->dev, 0, sizeof(adap->dev)); >> } >> EXPORT_SYMBOL(i2c_del_adapter); > > It was added with commit bd4bc3dbded9cd ("i2c: Clear i2c_adapter.dev on > adapter removal") with this commit message: > > i2c: Clear i2c_adapter.dev on adapter removal > > Clear i2c_adapter.dev on adapter removal. This makes it possible to > re-add the adapter at a later point, which some drivers > (i2c-amd756-s4882, i2c-nforce2-s4985) actually do. > > This fixes a bug reported by John Stultz here: > http://lkml.org/lkml/2008/7/15/720 > and by Ingo Molar there: > http://lkml.org/lkml/2008/7/16/78 > > So, despite the docs, it used to be an issue actually... What do you mean "despite the docs"? The docs says that you shouldn't reuse struct device. When you did, you got splats. Zeroing out struct device happens to work in this case probably because there is a completion going on around the call to device_unregister(). Without that completion it would have been seriously wrong to zero out the struct device, I think. But even with it, I'm not sure it's correct to do so and worry about rcu grace periods etc. Has any driver model expert looked at this struct device reuse and blessed it as ok? Cheers, Peter
> to do so and worry about rcu grace periods etc. Has any driver model > expert looked at this struct device reuse and blessed it as ok? I am quite sure this did not happen. That's what I mean with "despite the docs": One shouldn't reuse struct device, some people did it and got this "fix" in place, and now people might rely on the "fix". As a result, we can't simply remove it by saying "Ah, that's cruft. It can just go." We'd need some deeper look into random drivers, like you sketched.
On 2017-06-29 00:20, David Thomson wrote: > On 2017-06-19 18:34, Peter Rosin wrote: >> It would certainly be nice to know why .owner is clobbered, because I >> don't see it. But then again, I don't know what sources I should be >> reading... > Hi Peter > > Thanks for your reply. Unfortunately we cannot reproduce the issue on a newer kernel as it requires userspace applications (for processing the hotswap) that rely on our (many) changes to the kernel. I appreciate you taking the time to consider the issue. Still interested in what base version you have and if there are any patches of relevance to the i2c subsystem... The oldest mainline that had i2c_del_mux_adapter is v4.6 (May 2016), so I had a second look at that. Could you also tell us what i2c-mux driver this is and whether it is expected that .owner is NULL in the first place? In other words, is the i2c-mux driver in question built-in or a loaded module? If you have not added any new i2c-mux driver (which I doubt, you would surely have mentioned that) the only way that this race can happen is if the driver is removed twice since i2c_del_mux_adapter is only called in .remove (and in .probe on failure, but I don't think that can race). At least that's the case for the i2c-mux drivers that I looked at... It seems pretty serious that a driver can have its .remove called twice so I suspect your patch just papers over a much bigger problem? So, why is the i2c-mux driver being removed more than once? The module refcounting must be badly screwed up for that to happen. Or? Cheers, Peter
diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c index 82576aaccc90..8831d9b363a7 100644 --- a/drivers/i2c/i2c-core.c +++ b/drivers/i2c/i2c-core.c @@ -3151,8 +3151,12 @@ void i2c_put_adapter(struct i2c_adapter *adap) if (!adap) return; + mutex_lock(&core_lock); + put_device(&adap->dev); module_put(adap->owner); + + mutex_unlock(&core_lock); } EXPORT_SYMBOL(i2c_put_adapter);
There's a race condition that occurs when deleting & closing multiple i2c devices, which results in a kernel warning on an incorrect reference count: Call Trace: [c00000007e2cfaf0] [c0000000000b84e4] .module_put+0x24/0xb0 (unreliable) [c00000007e2cfb70] [c0000000005db830] .i2c_put_adapter+0x30/0x50 [c00000007e2cfbf0] [c0000000005deb74] .i2cdev_release+0x24/0x60 [c00000007e2cfc70] [c00000000014015c] .__fput+0xbc/0x290 [c00000007e2cfd10] [c000000000059b64] .task_work_run+0xf4/0x130 [c00000007e2cfdb0] [c00000000000a0a4] .do_notify_resume+0x74/0x80 [c00000007e2cfe30] [c000000000000acc] .ret_from_except_lite+0x78/0x7c The race is seen when an i2c adapter is being deleted while file descriptor on another i2c adapter is being closed. The following shows two threads executing concurrently (the first number is the thread id): 4580 i2c_del_mux_adapter adapter i2c-6-mux (chan_id 2) 4580 i2c_del_adapter adapter i2c-6-mux (chan_id 2) 4580 i2cdev_detach_adapter adapter i2c-6-mux (chan_id 2) 2627 i2cdev_release START adapter i2c-6-mux (chan_id 2) 2627 i2c_put_adapter START adapter i2c-6-mux (chan_id 2) owner (null) 2627 i2c_adapter_dev_release adapter i2c-6-mux (chan_id 2) 4580 i2c_del_mux_adapter adapter i2c-6-mux (chan_id 3) 4580 i2c_del_adapter adapter i2c-6-mux (chan_id 3) 4580 i2cdev_detach_adapter adapter i2c-6-mux (chan_id 3) 2627 i2c_put_adapter END adapter i2c-6-mux (chan_id 2) owner c0000000edf66400 2627 i2cdev_release END adapter i2c-6-mux (chan_id 2) 2627 i2cdev_release START adapter i2c-6-mux (chan_id 3) 2627 i2c_put_adapter START adapter i2c-6-mux (chan_id 3) owner (null) 2627 i2c_adapter_dev_release adapter i2c-6-mux (chan_id 3) 2627 i2c_put_adapter END adapter i2c-6-mux (chan_id 3) owner (null) 2627 i2cdev_release END adapter i2c-6-mux (chan_id 3) When thread 4580 for chan_id 3 interrupts 2627 doing i2c_adapter_dev_release, the 'owner' field for chan_id 2 becomes non-null. This causes the code to attempt to decrement the reference for this owner, but the current reference count is invalid. I'm not sure how the owner is getting set, but adding a lock around the put_device/module_put calls resolves the issue. Signed-off-by: David Thomson <david.thomson@alliedtelesis.co.nz> --- drivers/i2c/i2c-core.c | 4 ++++ 1 file changed, 4 insertions(+)