diff mbox series

[4/5] driver core: inhibit automatic driver binding on reserved devices

Message ID 20211022020032.26980-5-zev@bewilderbeest.net
State New
Headers show
Series driver core, of: support for reserved devices | expand

Commit Message

Zev Weiss Oct. 22, 2021, 2 a.m. UTC
Devices whose fwnodes are marked as reserved are instantiated, but
will not have a driver bound to them unless userspace explicitly
requests it by writing to a 'bind' sysfs file.  This is to enable
devices that may require special (userspace-mediated) preparation
before a driver can safely probe them.

Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
---
 drivers/base/bus.c            |  2 +-
 drivers/base/dd.c             | 13 ++++++++-----
 drivers/dma/idxd/compat.c     |  3 +--
 drivers/vfio/mdev/mdev_core.c |  2 +-
 include/linux/device.h        | 14 +++++++++++++-
 5 files changed, 24 insertions(+), 10 deletions(-)

Comments

Greg KH Oct. 22, 2021, 6:46 a.m. UTC | #1
On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
> Devices whose fwnodes are marked as reserved are instantiated, but
> will not have a driver bound to them unless userspace explicitly
> requests it by writing to a 'bind' sysfs file.  This is to enable
> devices that may require special (userspace-mediated) preparation
> before a driver can safely probe them.
> 
> Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
> ---
>  drivers/base/bus.c            |  2 +-
>  drivers/base/dd.c             | 13 ++++++++-----
>  drivers/dma/idxd/compat.c     |  3 +--
>  drivers/vfio/mdev/mdev_core.c |  2 +-
>  include/linux/device.h        | 14 +++++++++++++-
>  5 files changed, 24 insertions(+), 10 deletions(-)

Ugh, no, I don't really want to add yet-another-state to the driver core
like this.  Why are these devices even in the kernel with a driver that
wants to bind to them registered if the driver somehow should NOT be
bound to it?  Shouldn't all of that logic be in the crazy driver itself
as that is a very rare and odd thing to do that the driver core should
not care about at all.

And why does a device need userspace interaction at all?  Again, why
would the driver not know about this and handle it all directly?

thanks,

greg k-h
Zev Weiss Oct. 22, 2021, 8:32 a.m. UTC | #2
On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
>On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
>> Devices whose fwnodes are marked as reserved are instantiated, but
>> will not have a driver bound to them unless userspace explicitly
>> requests it by writing to a 'bind' sysfs file.  This is to enable
>> devices that may require special (userspace-mediated) preparation
>> before a driver can safely probe them.
>>
>> Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
>> ---
>>  drivers/base/bus.c            |  2 +-
>>  drivers/base/dd.c             | 13 ++++++++-----
>>  drivers/dma/idxd/compat.c     |  3 +--
>>  drivers/vfio/mdev/mdev_core.c |  2 +-
>>  include/linux/device.h        | 14 +++++++++++++-
>>  5 files changed, 24 insertions(+), 10 deletions(-)
>
>Ugh, no, I don't really want to add yet-another-state to the driver core
>like this.  Why are these devices even in the kernel with a driver that
>wants to bind to them registered if the driver somehow should NOT be
>bound to it?  Shouldn't all of that logic be in the crazy driver itself
>as that is a very rare and odd thing to do that the driver core should
>not care about at all.
>
>And why does a device need userspace interaction at all?  Again, why
>would the driver not know about this and handle it all directly?
>

Let me expand a bit more on the details of the specific situation I'm 
dealing with...

On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) 
and a baseboard management controller, or BMC (typically an ARM SoC, an 
ASPEED AST2500 in my case).  The host CPU's firmware (BIOS/UEFI, ME 
firmware, etc.) lives in a SPI flash chip.  Because it's the host's 
firmware, that flash chip is connected to and generally (by default) 
under the control of the host CPU.  

But we also want the BMC to be able to perform out-of-band updates to 
the host's firmware, so the flash is *also* connected to the BMC.  
There's an external mux (controlled by a GPIO output driven by the BMC) 
that switches which processor (host or BMC) is actually driving the SPI 
signals to the flash chip, but there's a bunch of other stuff that's 
also required before the BMC can flip that switch and take control of 
the SPI interface:

  - the BMC needs to track (and potentially alter) the host's power state 
    to ensure it's not running (in OpenBMC the existing logic for this is 
    an entire non-trivial userspace daemon unto itself)

  - it needs to twiddle some other GPIOs to put the ME into recovery mode

  - it needs to exchange some IPMI messages with the ME to confirm it got 
    into recovery mode

(Some of the details here are specific to the particular motherboard I'm 
working with, but I'd guess other systems probably have broadly similar 
requirements.)

The firmware flash (or at least the BMC's side of the mux in front of 
it) is attached to a spi-nor controller that's well supported by an 
existing MTD driver (aspeed-smc), but that driver can't safely probe the 
chip until all the stuff described above has been done.  In particular, 
this means we can't reasonably bind the driver to that device during the 
normal device-discovery/driver-binding done in the BMC's boot process 
(nor do we want to, as that would pull the rug out from under the 
running host).  We basically only ever want to touch that SPI interface 
when a user (sysadmin using the BMC, let's say) has explicitly initiated 
an out-of-band firmware update.

So we want the kernel to be aware of the device's existence (so that we 
*can* bind a driver to it when needed), but we don't want it touching 
the device unless we really ask for it.

Does that help clarify the motivation for wanting this functionality?


Thanks,
Zev
Greg KH Oct. 22, 2021, 8:57 a.m. UTC | #3
On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
> > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
> > > Devices whose fwnodes are marked as reserved are instantiated, but
> > > will not have a driver bound to them unless userspace explicitly
> > > requests it by writing to a 'bind' sysfs file.  This is to enable
> > > devices that may require special (userspace-mediated) preparation
> > > before a driver can safely probe them.
> > > 
> > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
> > > ---
> > >  drivers/base/bus.c            |  2 +-
> > >  drivers/base/dd.c             | 13 ++++++++-----
> > >  drivers/dma/idxd/compat.c     |  3 +--
> > >  drivers/vfio/mdev/mdev_core.c |  2 +-
> > >  include/linux/device.h        | 14 +++++++++++++-
> > >  5 files changed, 24 insertions(+), 10 deletions(-)
> > 
> > Ugh, no, I don't really want to add yet-another-state to the driver core
> > like this.  Why are these devices even in the kernel with a driver that
> > wants to bind to them registered if the driver somehow should NOT be
> > bound to it?  Shouldn't all of that logic be in the crazy driver itself
> > as that is a very rare and odd thing to do that the driver core should
> > not care about at all.
> > 
> > And why does a device need userspace interaction at all?  Again, why
> > would the driver not know about this and handle it all directly?
> > 
> 
> Let me expand a bit more on the details of the specific situation I'm
> dealing with...
> 
> On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a
> baseboard management controller, or BMC (typically an ARM SoC, an ASPEED
> AST2500 in my case).  The host CPU's firmware (BIOS/UEFI, ME firmware, etc.)
> lives in a SPI flash chip.  Because it's the host's firmware, that flash
> chip is connected to and generally (by default) under the control of the
> host CPU.
> 
> But we also want the BMC to be able to perform out-of-band updates to the
> host's firmware, so the flash is *also* connected to the BMC.  There's an
> external mux (controlled by a GPIO output driven by the BMC) that switches
> which processor (host or BMC) is actually driving the SPI signals to the
> flash chip, but there's a bunch of other stuff that's also required before
> the BMC can flip that switch and take control of the SPI interface:
> 
>  - the BMC needs to track (and potentially alter) the host's power state
> to ensure it's not running (in OpenBMC the existing logic for this is    an
> entire non-trivial userspace daemon unto itself)
> 
>  - it needs to twiddle some other GPIOs to put the ME into recovery mode
> 
>  - it needs to exchange some IPMI messages with the ME to confirm it got
> into recovery mode
> 
> (Some of the details here are specific to the particular motherboard I'm
> working with, but I'd guess other systems probably have broadly similar
> requirements.)
> 
> The firmware flash (or at least the BMC's side of the mux in front of it) is
> attached to a spi-nor controller that's well supported by an existing MTD
> driver (aspeed-smc), but that driver can't safely probe the chip until all
> the stuff described above has been done.  In particular, this means we can't
> reasonably bind the driver to that device during the normal
> device-discovery/driver-binding done in the BMC's boot process (nor do we
> want to, as that would pull the rug out from under the running host).  We
> basically only ever want to touch that SPI interface when a user (sysadmin
> using the BMC, let's say) has explicitly initiated an out-of-band firmware
> update.
> 
> So we want the kernel to be aware of the device's existence (so that we
> *can* bind a driver to it when needed), but we don't want it touching the
> device unless we really ask for it.
> 
> Does that help clarify the motivation for wanting this functionality?

Sure, then just do this type of thing in the driver itself.  Do not have
any matching "ids" for this hardware it so that the bus will never call
the probe function for this hardware _until_ a manual write happens to
the driver's "bind" sysfs file.

Then when userspace is done, do a "unbind" write.

No driver core changes should be needed at all here.

thanks,

greg k-h
Patrick Williams Oct. 22, 2021, 3:18 p.m. UTC | #4
Hi Greg,

On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
> > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
> > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:

> > So we want the kernel to be aware of the device's existence (so that we
> > *can* bind a driver to it when needed), but we don't want it touching the
> > device unless we really ask for it.
> > 
> > Does that help clarify the motivation for wanting this functionality?
> 
> Sure, then just do this type of thing in the driver itself.  Do not have
> any matching "ids" for this hardware it so that the bus will never call
> the probe function for this hardware _until_ a manual write happens to
> the driver's "bind" sysfs file.

It sounds like you're suggesting a change to one particular driver to satisfy
this one particular case (and maybe I'm just not understanding your suggestion).
For a BMC, this is a pretty regular situation and not just as one-off as Zev's
example.

Another good example is where a system can have optional riser cards with a
whole tree of devices that might be on that riser card (and there might be
different variants of a riser card that could go in the same slot).  Usually
there is an EEPROM of some sort at a well-known address that can be parsed to
identify which kind of riser card it is and then the appropriate sub-devices can
be enumerated.  That EEPROM parsing is something that is currently done in
userspace due to the complexity and often vendor-specific nature of it.

Many of these devices require quite a bit more configuration information than
can be passed along a `bind` call.  I believe it has been suggested previously
that this riser-card scenario could also be solved with dynamic loading of DT
snippets, but that support seems simple pretty far from being merged.
Zev Weiss Oct. 22, 2021, 4:27 p.m. UTC | #5
On Fri, Oct 22, 2021 at 01:57:21AM PDT, Greg Kroah-Hartman wrote:
>On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
>> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
>> > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
>> > > Devices whose fwnodes are marked as reserved are instantiated, but
>> > > will not have a driver bound to them unless userspace explicitly
>> > > requests it by writing to a 'bind' sysfs file.  This is to enable
>> > > devices that may require special (userspace-mediated) preparation
>> > > before a driver can safely probe them.
>> > >
>> > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
>> > > ---
>> > >  drivers/base/bus.c            |  2 +-
>> > >  drivers/base/dd.c             | 13 ++++++++-----
>> > >  drivers/dma/idxd/compat.c     |  3 +--
>> > >  drivers/vfio/mdev/mdev_core.c |  2 +-
>> > >  include/linux/device.h        | 14 +++++++++++++-
>> > >  5 files changed, 24 insertions(+), 10 deletions(-)
>> >
>> > Ugh, no, I don't really want to add yet-another-state to the driver core
>> > like this.  Why are these devices even in the kernel with a driver that
>> > wants to bind to them registered if the driver somehow should NOT be
>> > bound to it?  Shouldn't all of that logic be in the crazy driver itself
>> > as that is a very rare and odd thing to do that the driver core should
>> > not care about at all.
>> >
>> > And why does a device need userspace interaction at all?  Again, why
>> > would the driver not know about this and handle it all directly?
>> >
>>
>> Let me expand a bit more on the details of the specific situation I'm
>> dealing with...
>>
>> On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a
>> baseboard management controller, or BMC (typically an ARM SoC, an ASPEED
>> AST2500 in my case).  The host CPU's firmware (BIOS/UEFI, ME firmware, etc.)
>> lives in a SPI flash chip.  Because it's the host's firmware, that flash
>> chip is connected to and generally (by default) under the control of the
>> host CPU.
>>
>> But we also want the BMC to be able to perform out-of-band updates to the
>> host's firmware, so the flash is *also* connected to the BMC.  There's an
>> external mux (controlled by a GPIO output driven by the BMC) that switches
>> which processor (host or BMC) is actually driving the SPI signals to the
>> flash chip, but there's a bunch of other stuff that's also required before
>> the BMC can flip that switch and take control of the SPI interface:
>>
>>  - the BMC needs to track (and potentially alter) the host's power state
>> to ensure it's not running (in OpenBMC the existing logic for this is    an
>> entire non-trivial userspace daemon unto itself)
>>
>>  - it needs to twiddle some other GPIOs to put the ME into recovery mode
>>
>>  - it needs to exchange some IPMI messages with the ME to confirm it got
>> into recovery mode
>>
>> (Some of the details here are specific to the particular motherboard I'm
>> working with, but I'd guess other systems probably have broadly similar
>> requirements.)
>>
>> The firmware flash (or at least the BMC's side of the mux in front of it) is
>> attached to a spi-nor controller that's well supported by an existing MTD
>> driver (aspeed-smc), but that driver can't safely probe the chip until all
>> the stuff described above has been done.  In particular, this means we can't
>> reasonably bind the driver to that device during the normal
>> device-discovery/driver-binding done in the BMC's boot process (nor do we
>> want to, as that would pull the rug out from under the running host).  We
>> basically only ever want to touch that SPI interface when a user (sysadmin
>> using the BMC, let's say) has explicitly initiated an out-of-band firmware
>> update.
>>
>> So we want the kernel to be aware of the device's existence (so that we
>> *can* bind a driver to it when needed), but we don't want it touching the
>> device unless we really ask for it.
>>
>> Does that help clarify the motivation for wanting this functionality?
>
>Sure, then just do this type of thing in the driver itself.  Do not have
>any matching "ids" for this hardware it so that the bus will never call
>the probe function for this hardware _until_ a manual write happens to
>the driver's "bind" sysfs file.
>

Perhaps I'm misunderstanding what you're suggesting, but if I just 
change the DT "compatible" string so that the device doesn't match the 
driver and then try to manually bind it, the driver_match_device() check 
in bind_store() prevents that manual bind from actually happening.


Thanks,
Zev
Greg KH Oct. 23, 2021, 8:55 a.m. UTC | #6
On Fri, Oct 22, 2021 at 09:27:41AM -0700, Zev Weiss wrote:
> On Fri, Oct 22, 2021 at 01:57:21AM PDT, Greg Kroah-Hartman wrote:
> > On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
> > > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
> > > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
> > > > > Devices whose fwnodes are marked as reserved are instantiated, but
> > > > > will not have a driver bound to them unless userspace explicitly
> > > > > requests it by writing to a 'bind' sysfs file.  This is to enable
> > > > > devices that may require special (userspace-mediated) preparation
> > > > > before a driver can safely probe them.
> > > > >
> > > > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
> > > > > ---
> > > > >  drivers/base/bus.c            |  2 +-
> > > > >  drivers/base/dd.c             | 13 ++++++++-----
> > > > >  drivers/dma/idxd/compat.c     |  3 +--
> > > > >  drivers/vfio/mdev/mdev_core.c |  2 +-
> > > > >  include/linux/device.h        | 14 +++++++++++++-
> > > > >  5 files changed, 24 insertions(+), 10 deletions(-)
> > > >
> > > > Ugh, no, I don't really want to add yet-another-state to the driver core
> > > > like this.  Why are these devices even in the kernel with a driver that
> > > > wants to bind to them registered if the driver somehow should NOT be
> > > > bound to it?  Shouldn't all of that logic be in the crazy driver itself
> > > > as that is a very rare and odd thing to do that the driver core should
> > > > not care about at all.
> > > >
> > > > And why does a device need userspace interaction at all?  Again, why
> > > > would the driver not know about this and handle it all directly?
> > > >
> > > 
> > > Let me expand a bit more on the details of the specific situation I'm
> > > dealing with...
> > > 
> > > On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a
> > > baseboard management controller, or BMC (typically an ARM SoC, an ASPEED
> > > AST2500 in my case).  The host CPU's firmware (BIOS/UEFI, ME firmware, etc.)
> > > lives in a SPI flash chip.  Because it's the host's firmware, that flash
> > > chip is connected to and generally (by default) under the control of the
> > > host CPU.
> > > 
> > > But we also want the BMC to be able to perform out-of-band updates to the
> > > host's firmware, so the flash is *also* connected to the BMC.  There's an
> > > external mux (controlled by a GPIO output driven by the BMC) that switches
> > > which processor (host or BMC) is actually driving the SPI signals to the
> > > flash chip, but there's a bunch of other stuff that's also required before
> > > the BMC can flip that switch and take control of the SPI interface:
> > > 
> > >  - the BMC needs to track (and potentially alter) the host's power state
> > > to ensure it's not running (in OpenBMC the existing logic for this is    an
> > > entire non-trivial userspace daemon unto itself)
> > > 
> > >  - it needs to twiddle some other GPIOs to put the ME into recovery mode
> > > 
> > >  - it needs to exchange some IPMI messages with the ME to confirm it got
> > > into recovery mode
> > > 
> > > (Some of the details here are specific to the particular motherboard I'm
> > > working with, but I'd guess other systems probably have broadly similar
> > > requirements.)
> > > 
> > > The firmware flash (or at least the BMC's side of the mux in front of it) is
> > > attached to a spi-nor controller that's well supported by an existing MTD
> > > driver (aspeed-smc), but that driver can't safely probe the chip until all
> > > the stuff described above has been done.  In particular, this means we can't
> > > reasonably bind the driver to that device during the normal
> > > device-discovery/driver-binding done in the BMC's boot process (nor do we
> > > want to, as that would pull the rug out from under the running host).  We
> > > basically only ever want to touch that SPI interface when a user (sysadmin
> > > using the BMC, let's say) has explicitly initiated an out-of-band firmware
> > > update.
> > > 
> > > So we want the kernel to be aware of the device's existence (so that we
> > > *can* bind a driver to it when needed), but we don't want it touching the
> > > device unless we really ask for it.
> > > 
> > > Does that help clarify the motivation for wanting this functionality?
> > 
> > Sure, then just do this type of thing in the driver itself.  Do not have
> > any matching "ids" for this hardware it so that the bus will never call
> > the probe function for this hardware _until_ a manual write happens to
> > the driver's "bind" sysfs file.
> > 
> 
> Perhaps I'm misunderstanding what you're suggesting, but if I just change
> the DT "compatible" string so that the device doesn't match the driver and
> then try to manually bind it, the driver_match_device() check in
> bind_store() prevents that manual bind from actually happening.

Hm, I thought the bus had the ability to 'override' this somehow.  The
bus does get the callback in driver_match_device() so maybe do the logic
in there?  Somehow this works for other devices and busses, so there
must be a way it happens...

thanks,

greg k-h
Greg KH Oct. 23, 2021, 8:56 a.m. UTC | #7
On Fri, Oct 22, 2021 at 10:18:11AM -0500, Patrick Williams wrote:
> Hi Greg,
> 
> On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
> > > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
> > > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
> 
> > > So we want the kernel to be aware of the device's existence (so that we
> > > *can* bind a driver to it when needed), but we don't want it touching the
> > > device unless we really ask for it.
> > > 
> > > Does that help clarify the motivation for wanting this functionality?
> > 
> > Sure, then just do this type of thing in the driver itself.  Do not have
> > any matching "ids" for this hardware it so that the bus will never call
> > the probe function for this hardware _until_ a manual write happens to
> > the driver's "bind" sysfs file.
> 
> It sounds like you're suggesting a change to one particular driver to satisfy
> this one particular case (and maybe I'm just not understanding your suggestion).
> For a BMC, this is a pretty regular situation and not just as one-off as Zev's
> example.
> 
> Another good example is where a system can have optional riser cards with a
> whole tree of devices that might be on that riser card (and there might be
> different variants of a riser card that could go in the same slot).  Usually
> there is an EEPROM of some sort at a well-known address that can be parsed to
> identify which kind of riser card it is and then the appropriate sub-devices can
> be enumerated.  That EEPROM parsing is something that is currently done in
> userspace due to the complexity and often vendor-specific nature of it.
> 
> Many of these devices require quite a bit more configuration information than
> can be passed along a `bind` call.  I believe it has been suggested previously
> that this riser-card scenario could also be solved with dynamic loading of DT
> snippets, but that support seems simple pretty far from being merged.

Then work to get the DT code merged!  Do not try to create
yet-another-way of doing things here if DT overlays is the correct
solution here (and it seems like it is.)

thanks,

greg k-h
Frank Rowand Oct. 25, 2021, 5:38 a.m. UTC | #8
On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
> On Fri, Oct 22, 2021 at 10:18:11AM -0500, Patrick Williams wrote:
>> Hi Greg,
>>
>> On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
>>>> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
>>>>> On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
>>
>>>> So we want the kernel to be aware of the device's existence (so that we
>>>> *can* bind a driver to it when needed), but we don't want it touching the
>>>> device unless we really ask for it.
>>>>
>>>> Does that help clarify the motivation for wanting this functionality?
>>>
>>> Sure, then just do this type of thing in the driver itself.  Do not have
>>> any matching "ids" for this hardware it so that the bus will never call
>>> the probe function for this hardware _until_ a manual write happens to
>>> the driver's "bind" sysfs file.
>>
>> It sounds like you're suggesting a change to one particular driver to satisfy
>> this one particular case (and maybe I'm just not understanding your suggestion).
>> For a BMC, this is a pretty regular situation and not just as one-off as Zev's
>> example.
>>
>> Another good example is where a system can have optional riser cards with a
>> whole tree of devices that might be on that riser card (and there might be
>> different variants of a riser card that could go in the same slot).  Usually
>> there is an EEPROM of some sort at a well-known address that can be parsed to
>> identify which kind of riser card it is and then the appropriate sub-devices can
>> be enumerated.  That EEPROM parsing is something that is currently done in
>> userspace due to the complexity and often vendor-specific nature of it.
>>
>> Many of these devices require quite a bit more configuration information than
>> can be passed along a `bind` call.  I believe it has been suggested previously
>> that this riser-card scenario could also be solved with dynamic loading of DT
>> snippets, but that support seems simple pretty far from being merged.
> 
> Then work to get the DT code merged!  Do not try to create
> yet-another-way of doing things here if DT overlays is the correct
> solution here (and it seems like it is.)

I don't think this is a case that fits the overlay model.

We know what the description of the device is (which is what devicetree
is all about), but the device is to be shared between the Linux kernel
and some other entity, such as the firmware or another OS.  The issue
to be resolved is how to describe that the device is to be shared (in
this case exclusively by the kernel _or_ by the other entity at any
given moment), and how to move ownership of the device between the
Linux kernel and the other entity.

In the scenario presented by Zev, it is suggested that a user space
agent will be involved in deciding which entity owns the device and
to tell the Linux kernel when to take ownership of the device (and
presumably when to relinquish ownership, although we haven't seen
the implementation of relinquishing ownership yet).  One could
imagine direct communication between the driver and the other
entity to mediate ownership.  That seems like a driver specific
defined choice to me, though if there are enough different drivers
facing this situation then eventually a shared framework would
make sense.

So to step back and think architecture, it seems to me that the
devicetree needs to specify in the node describing the shared
device that the device must be (1) owned exclusively by either
the Linux kernel or some other entity, with a signalling method
between the Linux kernel and the other entity being defined
(possibly by information in the node or possibly by the definition
of the driver) or (2) actively shared by both the Linux
kernel and the other entity.  Actively shared may or may not be
possible, based on the specific hardware.  For example, if a single
contains some bits controlled by the Linux kernel and other bits
controlled by the other entity, then it can be difficult for one
of the two entities to atomically modify the register in coordination
with the other entity.  Obviously case 1 is much simpler than case 2,
I'm just trying to create a picture of the potential cases.  In a
simpler version of case 2, each entity would have control of their
own set of registers.

Diverging away from the overlay question, to comment on the
"status" property mentioned elsewhere in this thread, I do not
think that a status value of "reserved" is an adequate method
of conveying all of the information needed by the above range
of scenarios.

-Frank

> 
> thanks,
> 
> greg k-h
>
Greg KH Oct. 25, 2021, 6:15 a.m. UTC | #9
On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
> > On Fri, Oct 22, 2021 at 10:18:11AM -0500, Patrick Williams wrote:
> >> Hi Greg,
> >>
> >> On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote:
> >>> On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote:
> >>>> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote:
> >>>>> On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote:
> >>
> >>>> So we want the kernel to be aware of the device's existence (so that we
> >>>> *can* bind a driver to it when needed), but we don't want it touching the
> >>>> device unless we really ask for it.
> >>>>
> >>>> Does that help clarify the motivation for wanting this functionality?
> >>>
> >>> Sure, then just do this type of thing in the driver itself.  Do not have
> >>> any matching "ids" for this hardware it so that the bus will never call
> >>> the probe function for this hardware _until_ a manual write happens to
> >>> the driver's "bind" sysfs file.
> >>
> >> It sounds like you're suggesting a change to one particular driver to satisfy
> >> this one particular case (and maybe I'm just not understanding your suggestion).
> >> For a BMC, this is a pretty regular situation and not just as one-off as Zev's
> >> example.
> >>
> >> Another good example is where a system can have optional riser cards with a
> >> whole tree of devices that might be on that riser card (and there might be
> >> different variants of a riser card that could go in the same slot).  Usually
> >> there is an EEPROM of some sort at a well-known address that can be parsed to
> >> identify which kind of riser card it is and then the appropriate sub-devices can
> >> be enumerated.  That EEPROM parsing is something that is currently done in
> >> userspace due to the complexity and often vendor-specific nature of it.
> >>
> >> Many of these devices require quite a bit more configuration information than
> >> can be passed along a `bind` call.  I believe it has been suggested previously
> >> that this riser-card scenario could also be solved with dynamic loading of DT
> >> snippets, but that support seems simple pretty far from being merged.
> > 
> > Then work to get the DT code merged!  Do not try to create
> > yet-another-way of doing things here if DT overlays is the correct
> > solution here (and it seems like it is.)
> 
> I don't think this is a case that fits the overlay model.
> 
> We know what the description of the device is (which is what devicetree
> is all about), but the device is to be shared between the Linux kernel
> and some other entity, such as the firmware or another OS.  The issue
> to be resolved is how to describe that the device is to be shared (in
> this case exclusively by the kernel _or_ by the other entity at any
> given moment), and how to move ownership of the device between the
> Linux kernel and the other entity.
> 
> In the scenario presented by Zev, it is suggested that a user space
> agent will be involved in deciding which entity owns the device and
> to tell the Linux kernel when to take ownership of the device (and
> presumably when to relinquish ownership, although we haven't seen
> the implementation of relinquishing ownership yet).  One could
> imagine direct communication between the driver and the other
> entity to mediate ownership.  That seems like a driver specific
> defined choice to me, though if there are enough different drivers
> facing this situation then eventually a shared framework would
> make sense.

We have the bind/unbind ability today, from userspace, that can control
this.  Why not just have Linux grab the device when it boots, and then
when userspace wants to "give the device up", it writes to "unbind" in
sysfs, and then when all is done, it writes to the "bind" file and then
Linux takes back over.

Unless for some reason Linux should _not_ grab the device when booting,
then things get messier, as we have seen in this thread.

thanks,

greg k-h
Patrick Williams Oct. 25, 2021, 11:44 a.m. UTC | #10
On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
 
> We have the bind/unbind ability today, from userspace, that can control
> this.  Why not just have Linux grab the device when it boots, and then
> when userspace wants to "give the device up", it writes to "unbind" in
> sysfs, and then when all is done, it writes to the "bind" file and then
> Linux takes back over.
> 
> Unless for some reason Linux should _not_ grab the device when booting,
> then things get messier, as we have seen in this thread.

This is probably more typical on a BMC than atypical.  The systems often require
the BMC (running Linux) to be able to reboot independently from the managed host
(running anything).  In the example Zev gave, the BMC rebooting would rip away
the BIOS chip from the running host.

The BMC almost always needs to come up in a "I don't know what could possibly be
going on in the system" state and re-discover where the system was left off.
Andy Shevchenko Oct. 25, 2021, 12:58 p.m. UTC | #11
On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote:
> On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
>  
> > We have the bind/unbind ability today, from userspace, that can control
> > this.  Why not just have Linux grab the device when it boots, and then
> > when userspace wants to "give the device up", it writes to "unbind" in
> > sysfs, and then when all is done, it writes to the "bind" file and then
> > Linux takes back over.
> > 
> > Unless for some reason Linux should _not_ grab the device when booting,
> > then things get messier, as we have seen in this thread.
> 
> This is probably more typical on a BMC than atypical.  The systems often require
> the BMC (running Linux) to be able to reboot independently from the managed host
> (running anything).  In the example Zev gave, the BMC rebooting would rip away
> the BIOS chip from the running host.
> 
> The BMC almost always needs to come up in a "I don't know what could possibly be
> going on in the system" state and re-discover where the system was left off.

Isn't it an architectural issue then?
Patrick Williams Oct. 25, 2021, 1:20 p.m. UTC | #12
On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote:
> On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote:
> > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
> >  
> > > We have the bind/unbind ability today, from userspace, that can control
> > > this.  Why not just have Linux grab the device when it boots, and then
> > > when userspace wants to "give the device up", it writes to "unbind" in
> > > sysfs, and then when all is done, it writes to the "bind" file and then
> > > Linux takes back over.
> > > 
> > > Unless for some reason Linux should _not_ grab the device when booting,
> > > then things get messier, as we have seen in this thread.
> > 
> > This is probably more typical on a BMC than atypical.  The systems often require
> > the BMC (running Linux) to be able to reboot independently from the managed host
> > (running anything).  In the example Zev gave, the BMC rebooting would rip away
> > the BIOS chip from the running host.
> > 
> > The BMC almost always needs to come up in a "I don't know what could possibly be
> > going on in the system" state and re-discover where the system was left off.
> 
> Isn't it an architectural issue then?

I'm not sure what "it" you are referring to here.

I was trying to explain why starting in "bind" state is not a good idea for a
BMC in most of these cases where we want to be able to dynamically add a device.
Greg KH Oct. 25, 2021, 1:34 p.m. UTC | #13
On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote:
> On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote:
> > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote:
> > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
> > >  
> > > > We have the bind/unbind ability today, from userspace, that can control
> > > > this.  Why not just have Linux grab the device when it boots, and then
> > > > when userspace wants to "give the device up", it writes to "unbind" in
> > > > sysfs, and then when all is done, it writes to the "bind" file and then
> > > > Linux takes back over.
> > > > 
> > > > Unless for some reason Linux should _not_ grab the device when booting,
> > > > then things get messier, as we have seen in this thread.
> > > 
> > > This is probably more typical on a BMC than atypical.  The systems often require
> > > the BMC (running Linux) to be able to reboot independently from the managed host
> > > (running anything).  In the example Zev gave, the BMC rebooting would rip away
> > > the BIOS chip from the running host.
> > > 
> > > The BMC almost always needs to come up in a "I don't know what could possibly be
> > > going on in the system" state and re-discover where the system was left off.
> > 
> > Isn't it an architectural issue then?
> 
> I'm not sure what "it" you are referring to here.
> 
> I was trying to explain why starting in "bind" state is not a good idea for a
> BMC in most of these cases where we want to be able to dynamically add a device.

I think "it" is "something needs to be the moderator between the two
operating systems".  What is the external entity that handles the
switching between the two?

thanks,

greg k-h
Patrick Williams Oct. 25, 2021, 2:02 p.m. UTC | #14
On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote:
> > On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote:
> > > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote:
> > > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote:
> > > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> > > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
> > > >  
> > > > > We have the bind/unbind ability today, from userspace, that can control
> > > > > this.  Why not just have Linux grab the device when it boots, and then
> > > > > when userspace wants to "give the device up", it writes to "unbind" in
> > > > > sysfs, and then when all is done, it writes to the "bind" file and then
> > > > > Linux takes back over.
> > > > > 
> > > > > Unless for some reason Linux should _not_ grab the device when booting,
> > > > > then things get messier, as we have seen in this thread.
> > > > 
> > > > This is probably more typical on a BMC than atypical.  The systems often require
> > > > the BMC (running Linux) to be able to reboot independently from the managed host
> > > > (running anything).  In the example Zev gave, the BMC rebooting would rip away
> > > > the BIOS chip from the running host.
> > > > 
> > > > The BMC almost always needs to come up in a "I don't know what could possibly be
> > > > going on in the system" state and re-discover where the system was left off.
> > > 
> > > Isn't it an architectural issue then?
> > 
> > I'm not sure what "it" you are referring to here.
> > 
> > I was trying to explain why starting in "bind" state is not a good idea for a
> > BMC in most of these cases where we want to be able to dynamically add a device.
> 
> I think "it" is "something needs to be the moderator between the two
> operating systems".  What is the external entity that handles the
> switching between the two?

Ah, ok.

Those usually end up being system / device specific.  In the case of the BIOS
flash, most designs I've seen use a SPI mux between the BMC and the host
processor or IO hub (PCH on Xeons).  The BMC has a GPIO to control the mux.

As far as state, the BMC on start-up will go through a set of discovery code to
figure out where it left the system prior to getting reset.  That involves
looking at the power subsystem and usually doing some kind of query to the host
to see if it is alive.  These queries are mostly system / host-processor design
specific.  I've seen anything from an IPMI/IPMB message alert from the BMC to
the BIOS to ask "are you alive" to reading host processor state over JTAG to
figure out if the processors are "making progress".
Greg KH Oct. 25, 2021, 2:09 p.m. UTC | #15
On Mon, Oct 25, 2021 at 09:02:40AM -0500, Patrick Williams wrote:
> On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote:
> > > On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote:
> > > > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote:
> > > > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote:
> > > > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote:
> > > > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote:
> > > > >  
> > > > > > We have the bind/unbind ability today, from userspace, that can control
> > > > > > this.  Why not just have Linux grab the device when it boots, and then
> > > > > > when userspace wants to "give the device up", it writes to "unbind" in
> > > > > > sysfs, and then when all is done, it writes to the "bind" file and then
> > > > > > Linux takes back over.
> > > > > > 
> > > > > > Unless for some reason Linux should _not_ grab the device when booting,
> > > > > > then things get messier, as we have seen in this thread.
> > > > > 
> > > > > This is probably more typical on a BMC than atypical.  The systems often require
> > > > > the BMC (running Linux) to be able to reboot independently from the managed host
> > > > > (running anything).  In the example Zev gave, the BMC rebooting would rip away
> > > > > the BIOS chip from the running host.
> > > > > 
> > > > > The BMC almost always needs to come up in a "I don't know what could possibly be
> > > > > going on in the system" state and re-discover where the system was left off.
> > > > 
> > > > Isn't it an architectural issue then?
> > > 
> > > I'm not sure what "it" you are referring to here.
> > > 
> > > I was trying to explain why starting in "bind" state is not a good idea for a
> > > BMC in most of these cases where we want to be able to dynamically add a device.
> > 
> > I think "it" is "something needs to be the moderator between the two
> > operating systems".  What is the external entity that handles the
> > switching between the two?
> 
> Ah, ok.
> 
> Those usually end up being system / device specific.  In the case of the BIOS
> flash, most designs I've seen use a SPI mux between the BMC and the host
> processor or IO hub (PCH on Xeons).  The BMC has a GPIO to control the mux.
> 
> As far as state, the BMC on start-up will go through a set of discovery code to
> figure out where it left the system prior to getting reset.  That involves
> looking at the power subsystem and usually doing some kind of query to the host
> to see if it is alive.  These queries are mostly system / host-processor design
> specific.  I've seen anything from an IPMI/IPMB message alert from the BMC to
> the BIOS to ask "are you alive" to reading host processor state over JTAG to
> figure out if the processors are "making progress".

But which processor is "in control" here over the hardware?  What method
is used to pass the device from one CPU to another from a logical point
of view?  Sounds like it is another driver that needs to handle all of
this, so why not have that be the one that adds/removes the devices
under control here?

thanks,

greg k-h
Patrick Williams Oct. 25, 2021, 3:54 p.m. UTC | #16
On Mon, Oct 25, 2021 at 04:09:59PM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 25, 2021 at 09:02:40AM -0500, Patrick Williams wrote:
> > On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote:
> > > I think "it" is "something needs to be the moderator between the two
> > > operating systems".  What is the external entity that handles the
> > > switching between the two?
> > 
> > Ah, ok.
> > 
> > Those usually end up being system / device specific.  In the case of the BIOS
> > flash, most designs I've seen use a SPI mux between the BMC and the host
> > processor or IO hub (PCH on Xeons).  The BMC has a GPIO to control the mux.
> > 
> > As far as state, the BMC on start-up will go through a set of discovery code to
> > figure out where it left the system prior to getting reset.  That involves
> > looking at the power subsystem and usually doing some kind of query to the host
> > to see if it is alive.  These queries are mostly system / host-processor design
> > specific.  I've seen anything from an IPMI/IPMB message alert from the BMC to
> > the BIOS to ask "are you alive" to reading host processor state over JTAG to
> > figure out if the processors are "making progress".
> 
> But which processor is "in control" here over the hardware?  

The BMC.  It owns the GPIO that controls the SPI mux.  

But, the BMC is responsible for doing all operations in a way that doesn't mess
up the running host processor(s).  Pulling away the SPI flash containing the
BIOS code at an incorrect time might do that.

> What method
> is used to pass the device from one CPU to another from a logical point
> of view?  

The state of the server as a whole is determined and maintained by the BMC.  I'm
simplifying here a bit but the operation "turn on the host processors" implies
"the host processors will access the BIOS" so the BMC must ensure "SPI mux is
switched towards the host" before "turn on the host processors".

> Sounds like it is another driver that needs to handle all of
> this, so why not have that be the one that adds/removes the devices
> under control here?

If what you're describing is moving all of the state control logic into the
kernel, I don't think that is feasible.  For some systems it would mean moving
yet another entire IPMI stack into the kernel tree.  On others it might be
somewhat simpler, but it is still a good amount of code.  We could probably
write up more details on the scope of this.

If what you're describing is a small driver, similar to the board support
drivers that were used before the device tree, that instantiates subordinate
devices it doesn't seem like an unreasonable alternative to DT overlays to me
(for whatever my limited kernel contribution experience counts for).
Greg KH Oct. 25, 2021, 6:36 p.m. UTC | #17
On Mon, Oct 25, 2021 at 10:54:21AM -0500, Patrick Williams wrote:
> On Mon, Oct 25, 2021 at 04:09:59PM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 25, 2021 at 09:02:40AM -0500, Patrick Williams wrote:
> > > On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote:
> > > > I think "it" is "something needs to be the moderator between the two
> > > > operating systems".  What is the external entity that handles the
> > > > switching between the two?
> > > 
> > > Ah, ok.
> > > 
> > > Those usually end up being system / device specific.  In the case of the BIOS
> > > flash, most designs I've seen use a SPI mux between the BMC and the host
> > > processor or IO hub (PCH on Xeons).  The BMC has a GPIO to control the mux.
> > > 
> > > As far as state, the BMC on start-up will go through a set of discovery code to
> > > figure out where it left the system prior to getting reset.  That involves
> > > looking at the power subsystem and usually doing some kind of query to the host
> > > to see if it is alive.  These queries are mostly system / host-processor design
> > > specific.  I've seen anything from an IPMI/IPMB message alert from the BMC to
> > > the BIOS to ask "are you alive" to reading host processor state over JTAG to
> > > figure out if the processors are "making progress".
> > 
> > But which processor is "in control" here over the hardware?  
> 
> The BMC.  It owns the GPIO that controls the SPI mux.  
> 
> But, the BMC is responsible for doing all operations in a way that doesn't mess
> up the running host processor(s).  Pulling away the SPI flash containing the
> BIOS code at an incorrect time might do that.
> 
> > What method
> > is used to pass the device from one CPU to another from a logical point
> > of view?  
> 
> The state of the server as a whole is determined and maintained by the BMC.  I'm
> simplifying here a bit but the operation "turn on the host processors" implies
> "the host processors will access the BIOS" so the BMC must ensure "SPI mux is
> switched towards the host" before "turn on the host processors".
> 
> > Sounds like it is another driver that needs to handle all of
> > this, so why not have that be the one that adds/removes the devices
> > under control here?
> 
> If what you're describing is moving all of the state control logic into the
> kernel, I don't think that is feasible.  For some systems it would mean moving
> yet another entire IPMI stack into the kernel tree.  On others it might be
> somewhat simpler, but it is still a good amount of code.  We could probably
> write up more details on the scope of this.
> 
> If what you're describing is a small driver, similar to the board support
> drivers that were used before the device tree, that instantiates subordinate
> devices it doesn't seem like an unreasonable alternative to DT overlays to me
> (for whatever my limited kernel contribution experience counts for).
> 

Something has to be here doing the mediation between the two processors
and keeping things straight as to what processor is handling the
hardware when.  I suggest you focus on that first...

Good luck!

greg k-h
diff mbox series

Patch

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index bdc98c5713d5..8a30f9a02de0 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -211,7 +211,7 @@  static ssize_t bind_store(struct device_driver *drv, const char *buf,
 
 	dev = bus_find_device_by_name(bus, NULL, buf);
 	if (dev && driver_match_device(drv, dev)) {
-		err = device_driver_attach(drv, dev);
+		err = device_driver_attach(drv, dev, DRV_BIND_EXPLICIT);
 		if (!err) {
 			/* success */
 			err = count;
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f949daa..ee64740a63d9 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -22,6 +22,7 @@ 
 #include <linux/dma-map-ops.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/property.h>
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/async.h>
@@ -727,13 +728,14 @@  void wait_for_device_probe(void)
 }
 EXPORT_SYMBOL_GPL(wait_for_device_probe);
 
-static int __driver_probe_device(struct device_driver *drv, struct device *dev)
+static int __driver_probe_device(struct device_driver *drv, struct device *dev, u32 flags)
 {
 	int ret = 0;
 
 	if (dev->p->dead || !device_is_registered(dev))
 		return -ENODEV;
-	if (dev->driver)
+	if (dev->driver ||
+	    (fwnode_device_is_reserved(dev->fwnode) && !(flags & DRV_BIND_EXPLICIT)))
 		return -EBUSY;
 
 	dev->can_match = true;
@@ -778,7 +780,7 @@  static int driver_probe_device(struct device_driver *drv, struct device *dev)
 	int ret;
 
 	atomic_inc(&probe_count);
-	ret = __driver_probe_device(drv, dev);
+	ret = __driver_probe_device(drv, dev, DRV_BIND_DEFAULT);
 	if (ret == -EPROBE_DEFER || ret == EPROBE_DEFER) {
 		driver_deferred_probe_add(dev);
 
@@ -1052,16 +1054,17 @@  static void __device_driver_unlock(struct device *dev, struct device *parent)
  * device_driver_attach - attach a specific driver to a specific device
  * @drv: Driver to attach
  * @dev: Device to attach it to
+ * @flags: Bitmask of DRV_BIND_* flags
  *
  * Manually attach driver to a device. Will acquire both @dev lock and
  * @dev->parent lock if needed. Returns 0 on success, -ERR on failure.
  */
-int device_driver_attach(struct device_driver *drv, struct device *dev)
+int device_driver_attach(struct device_driver *drv, struct device *dev, u32 flags)
 {
 	int ret;
 
 	__device_driver_lock(dev, dev->parent);
-	ret = __driver_probe_device(drv, dev);
+	ret = __driver_probe_device(drv, dev, flags);
 	__device_driver_unlock(dev, dev->parent);
 
 	/* also return probe errors as normal negative errnos */
diff --git a/drivers/dma/idxd/compat.c b/drivers/dma/idxd/compat.c
index 3df21615f888..51df38dea15a 100644
--- a/drivers/dma/idxd/compat.c
+++ b/drivers/dma/idxd/compat.c
@@ -7,7 +7,6 @@ 
 #include <linux/device/bus.h>
 #include "idxd.h"
 
-extern int device_driver_attach(struct device_driver *drv, struct device *dev);
 extern void device_driver_detach(struct device *dev);
 
 #define DRIVER_ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store)	\
@@ -56,7 +55,7 @@  static ssize_t bind_store(struct device_driver *drv, const char *buf, size_t cou
 	if (!alt_drv)
 		return -ENODEV;
 
-	rc = device_driver_attach(alt_drv, dev);
+	rc = device_driver_attach(alt_drv, dev, DRV_BIND_EXPLICIT);
 	if (rc < 0)
 		return rc;
 
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b314101237fe..f42c6ec543c8 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -309,7 +309,7 @@  int mdev_device_create(struct mdev_type *type, const guid_t *uuid)
 
 	if (!drv)
 		drv = &vfio_mdev_driver;
-	ret = device_driver_attach(&drv->driver, &mdev->dev);
+	ret = device_driver_attach(&drv->driver, &mdev->dev, DRV_BIND_DEFAULT);
 	if (ret)
 		goto out_del;
 
diff --git a/include/linux/device.h b/include/linux/device.h
index e270cb740b9e..1ada1093799b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -876,12 +876,24 @@  static inline void *dev_get_platdata(const struct device *dev)
 	return dev->platform_data;
 }
 
+/*
+ * Driver-binding flags (for passing to device_driver_attach())
+ *
+ * DRV_BIND_DEFAULT: a default, automatic bind, e.g. as a result of a device
+ *                   being added for which we already have a driver, or vice
+ *                   versa.
+ * DRV_BIND_EXPLICIT: an explicit, userspace-requested driver bind, e.g. as a
+ *                    result of a write to /sys/bus/.../drivers/.../bind
+ */
+#define DRV_BIND_DEFAULT	0
+#define DRV_BIND_EXPLICIT	BIT(0)
+
 /*
  * Manual binding of a device to driver. See drivers/base/bus.c
  * for information on use.
  */
 int __must_check device_driver_attach(struct device_driver *drv,
-				      struct device *dev);
+				      struct device *dev, u32 flags);
 int __must_check device_bind_driver(struct device *dev);
 void device_release_driver(struct device *dev);
 int  __must_check device_attach(struct device *dev);