Message ID | 20211022020032.26980-5-zev@bewilderbeest.net |
---|---|
State | New |
Headers | show |
Series | driver core, of: support for reserved devices | expand |
On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: > Devices whose fwnodes are marked as reserved are instantiated, but > will not have a driver bound to them unless userspace explicitly > requests it by writing to a 'bind' sysfs file. This is to enable > devices that may require special (userspace-mediated) preparation > before a driver can safely probe them. > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net> > --- > drivers/base/bus.c | 2 +- > drivers/base/dd.c | 13 ++++++++----- > drivers/dma/idxd/compat.c | 3 +-- > drivers/vfio/mdev/mdev_core.c | 2 +- > include/linux/device.h | 14 +++++++++++++- > 5 files changed, 24 insertions(+), 10 deletions(-) Ugh, no, I don't really want to add yet-another-state to the driver core like this. Why are these devices even in the kernel with a driver that wants to bind to them registered if the driver somehow should NOT be bound to it? Shouldn't all of that logic be in the crazy driver itself as that is a very rare and odd thing to do that the driver core should not care about at all. And why does a device need userspace interaction at all? Again, why would the driver not know about this and handle it all directly? thanks, greg k-h
On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: >On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: >> Devices whose fwnodes are marked as reserved are instantiated, but >> will not have a driver bound to them unless userspace explicitly >> requests it by writing to a 'bind' sysfs file. This is to enable >> devices that may require special (userspace-mediated) preparation >> before a driver can safely probe them. >> >> Signed-off-by: Zev Weiss <zev@bewilderbeest.net> >> --- >> drivers/base/bus.c | 2 +- >> drivers/base/dd.c | 13 ++++++++----- >> drivers/dma/idxd/compat.c | 3 +-- >> drivers/vfio/mdev/mdev_core.c | 2 +- >> include/linux/device.h | 14 +++++++++++++- >> 5 files changed, 24 insertions(+), 10 deletions(-) > >Ugh, no, I don't really want to add yet-another-state to the driver core >like this. Why are these devices even in the kernel with a driver that >wants to bind to them registered if the driver somehow should NOT be >bound to it? Shouldn't all of that logic be in the crazy driver itself >as that is a very rare and odd thing to do that the driver core should >not care about at all. > >And why does a device need userspace interaction at all? Again, why >would the driver not know about this and handle it all directly? > Let me expand a bit more on the details of the specific situation I'm dealing with... On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a baseboard management controller, or BMC (typically an ARM SoC, an ASPEED AST2500 in my case). The host CPU's firmware (BIOS/UEFI, ME firmware, etc.) lives in a SPI flash chip. Because it's the host's firmware, that flash chip is connected to and generally (by default) under the control of the host CPU. But we also want the BMC to be able to perform out-of-band updates to the host's firmware, so the flash is *also* connected to the BMC. There's an external mux (controlled by a GPIO output driven by the BMC) that switches which processor (host or BMC) is actually driving the SPI signals to the flash chip, but there's a bunch of other stuff that's also required before the BMC can flip that switch and take control of the SPI interface: - the BMC needs to track (and potentially alter) the host's power state to ensure it's not running (in OpenBMC the existing logic for this is an entire non-trivial userspace daemon unto itself) - it needs to twiddle some other GPIOs to put the ME into recovery mode - it needs to exchange some IPMI messages with the ME to confirm it got into recovery mode (Some of the details here are specific to the particular motherboard I'm working with, but I'd guess other systems probably have broadly similar requirements.) The firmware flash (or at least the BMC's side of the mux in front of it) is attached to a spi-nor controller that's well supported by an existing MTD driver (aspeed-smc), but that driver can't safely probe the chip until all the stuff described above has been done. In particular, this means we can't reasonably bind the driver to that device during the normal device-discovery/driver-binding done in the BMC's boot process (nor do we want to, as that would pull the rug out from under the running host). We basically only ever want to touch that SPI interface when a user (sysadmin using the BMC, let's say) has explicitly initiated an out-of-band firmware update. So we want the kernel to be aware of the device's existence (so that we *can* bind a driver to it when needed), but we don't want it touching the device unless we really ask for it. Does that help clarify the motivation for wanting this functionality? Thanks, Zev
On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: > > > Devices whose fwnodes are marked as reserved are instantiated, but > > > will not have a driver bound to them unless userspace explicitly > > > requests it by writing to a 'bind' sysfs file. This is to enable > > > devices that may require special (userspace-mediated) preparation > > > before a driver can safely probe them. > > > > > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net> > > > --- > > > drivers/base/bus.c | 2 +- > > > drivers/base/dd.c | 13 ++++++++----- > > > drivers/dma/idxd/compat.c | 3 +-- > > > drivers/vfio/mdev/mdev_core.c | 2 +- > > > include/linux/device.h | 14 +++++++++++++- > > > 5 files changed, 24 insertions(+), 10 deletions(-) > > > > Ugh, no, I don't really want to add yet-another-state to the driver core > > like this. Why are these devices even in the kernel with a driver that > > wants to bind to them registered if the driver somehow should NOT be > > bound to it? Shouldn't all of that logic be in the crazy driver itself > > as that is a very rare and odd thing to do that the driver core should > > not care about at all. > > > > And why does a device need userspace interaction at all? Again, why > > would the driver not know about this and handle it all directly? > > > > Let me expand a bit more on the details of the specific situation I'm > dealing with... > > On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a > baseboard management controller, or BMC (typically an ARM SoC, an ASPEED > AST2500 in my case). The host CPU's firmware (BIOS/UEFI, ME firmware, etc.) > lives in a SPI flash chip. Because it's the host's firmware, that flash > chip is connected to and generally (by default) under the control of the > host CPU. > > But we also want the BMC to be able to perform out-of-band updates to the > host's firmware, so the flash is *also* connected to the BMC. There's an > external mux (controlled by a GPIO output driven by the BMC) that switches > which processor (host or BMC) is actually driving the SPI signals to the > flash chip, but there's a bunch of other stuff that's also required before > the BMC can flip that switch and take control of the SPI interface: > > - the BMC needs to track (and potentially alter) the host's power state > to ensure it's not running (in OpenBMC the existing logic for this is an > entire non-trivial userspace daemon unto itself) > > - it needs to twiddle some other GPIOs to put the ME into recovery mode > > - it needs to exchange some IPMI messages with the ME to confirm it got > into recovery mode > > (Some of the details here are specific to the particular motherboard I'm > working with, but I'd guess other systems probably have broadly similar > requirements.) > > The firmware flash (or at least the BMC's side of the mux in front of it) is > attached to a spi-nor controller that's well supported by an existing MTD > driver (aspeed-smc), but that driver can't safely probe the chip until all > the stuff described above has been done. In particular, this means we can't > reasonably bind the driver to that device during the normal > device-discovery/driver-binding done in the BMC's boot process (nor do we > want to, as that would pull the rug out from under the running host). We > basically only ever want to touch that SPI interface when a user (sysadmin > using the BMC, let's say) has explicitly initiated an out-of-band firmware > update. > > So we want the kernel to be aware of the device's existence (so that we > *can* bind a driver to it when needed), but we don't want it touching the > device unless we really ask for it. > > Does that help clarify the motivation for wanting this functionality? Sure, then just do this type of thing in the driver itself. Do not have any matching "ids" for this hardware it so that the bus will never call the probe function for this hardware _until_ a manual write happens to the driver's "bind" sysfs file. Then when userspace is done, do a "unbind" write. No driver core changes should be needed at all here. thanks, greg k-h
Hi Greg, On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote: > On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: > > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: > > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: > > So we want the kernel to be aware of the device's existence (so that we > > *can* bind a driver to it when needed), but we don't want it touching the > > device unless we really ask for it. > > > > Does that help clarify the motivation for wanting this functionality? > > Sure, then just do this type of thing in the driver itself. Do not have > any matching "ids" for this hardware it so that the bus will never call > the probe function for this hardware _until_ a manual write happens to > the driver's "bind" sysfs file. It sounds like you're suggesting a change to one particular driver to satisfy this one particular case (and maybe I'm just not understanding your suggestion). For a BMC, this is a pretty regular situation and not just as one-off as Zev's example. Another good example is where a system can have optional riser cards with a whole tree of devices that might be on that riser card (and there might be different variants of a riser card that could go in the same slot). Usually there is an EEPROM of some sort at a well-known address that can be parsed to identify which kind of riser card it is and then the appropriate sub-devices can be enumerated. That EEPROM parsing is something that is currently done in userspace due to the complexity and often vendor-specific nature of it. Many of these devices require quite a bit more configuration information than can be passed along a `bind` call. I believe it has been suggested previously that this riser-card scenario could also be solved with dynamic loading of DT snippets, but that support seems simple pretty far from being merged.
On Fri, Oct 22, 2021 at 01:57:21AM PDT, Greg Kroah-Hartman wrote: >On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: >> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: >> > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: >> > > Devices whose fwnodes are marked as reserved are instantiated, but >> > > will not have a driver bound to them unless userspace explicitly >> > > requests it by writing to a 'bind' sysfs file. This is to enable >> > > devices that may require special (userspace-mediated) preparation >> > > before a driver can safely probe them. >> > > >> > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net> >> > > --- >> > > drivers/base/bus.c | 2 +- >> > > drivers/base/dd.c | 13 ++++++++----- >> > > drivers/dma/idxd/compat.c | 3 +-- >> > > drivers/vfio/mdev/mdev_core.c | 2 +- >> > > include/linux/device.h | 14 +++++++++++++- >> > > 5 files changed, 24 insertions(+), 10 deletions(-) >> > >> > Ugh, no, I don't really want to add yet-another-state to the driver core >> > like this. Why are these devices even in the kernel with a driver that >> > wants to bind to them registered if the driver somehow should NOT be >> > bound to it? Shouldn't all of that logic be in the crazy driver itself >> > as that is a very rare and odd thing to do that the driver core should >> > not care about at all. >> > >> > And why does a device need userspace interaction at all? Again, why >> > would the driver not know about this and handle it all directly? >> > >> >> Let me expand a bit more on the details of the specific situation I'm >> dealing with... >> >> On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a >> baseboard management controller, or BMC (typically an ARM SoC, an ASPEED >> AST2500 in my case). The host CPU's firmware (BIOS/UEFI, ME firmware, etc.) >> lives in a SPI flash chip. Because it's the host's firmware, that flash >> chip is connected to and generally (by default) under the control of the >> host CPU. >> >> But we also want the BMC to be able to perform out-of-band updates to the >> host's firmware, so the flash is *also* connected to the BMC. There's an >> external mux (controlled by a GPIO output driven by the BMC) that switches >> which processor (host or BMC) is actually driving the SPI signals to the >> flash chip, but there's a bunch of other stuff that's also required before >> the BMC can flip that switch and take control of the SPI interface: >> >> - the BMC needs to track (and potentially alter) the host's power state >> to ensure it's not running (in OpenBMC the existing logic for this is an >> entire non-trivial userspace daemon unto itself) >> >> - it needs to twiddle some other GPIOs to put the ME into recovery mode >> >> - it needs to exchange some IPMI messages with the ME to confirm it got >> into recovery mode >> >> (Some of the details here are specific to the particular motherboard I'm >> working with, but I'd guess other systems probably have broadly similar >> requirements.) >> >> The firmware flash (or at least the BMC's side of the mux in front of it) is >> attached to a spi-nor controller that's well supported by an existing MTD >> driver (aspeed-smc), but that driver can't safely probe the chip until all >> the stuff described above has been done. In particular, this means we can't >> reasonably bind the driver to that device during the normal >> device-discovery/driver-binding done in the BMC's boot process (nor do we >> want to, as that would pull the rug out from under the running host). We >> basically only ever want to touch that SPI interface when a user (sysadmin >> using the BMC, let's say) has explicitly initiated an out-of-band firmware >> update. >> >> So we want the kernel to be aware of the device's existence (so that we >> *can* bind a driver to it when needed), but we don't want it touching the >> device unless we really ask for it. >> >> Does that help clarify the motivation for wanting this functionality? > >Sure, then just do this type of thing in the driver itself. Do not have >any matching "ids" for this hardware it so that the bus will never call >the probe function for this hardware _until_ a manual write happens to >the driver's "bind" sysfs file. > Perhaps I'm misunderstanding what you're suggesting, but if I just change the DT "compatible" string so that the device doesn't match the driver and then try to manually bind it, the driver_match_device() check in bind_store() prevents that manual bind from actually happening. Thanks, Zev
On Fri, Oct 22, 2021 at 09:27:41AM -0700, Zev Weiss wrote: > On Fri, Oct 22, 2021 at 01:57:21AM PDT, Greg Kroah-Hartman wrote: > > On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: > > > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: > > > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: > > > > > Devices whose fwnodes are marked as reserved are instantiated, but > > > > > will not have a driver bound to them unless userspace explicitly > > > > > requests it by writing to a 'bind' sysfs file. This is to enable > > > > > devices that may require special (userspace-mediated) preparation > > > > > before a driver can safely probe them. > > > > > > > > > > Signed-off-by: Zev Weiss <zev@bewilderbeest.net> > > > > > --- > > > > > drivers/base/bus.c | 2 +- > > > > > drivers/base/dd.c | 13 ++++++++----- > > > > > drivers/dma/idxd/compat.c | 3 +-- > > > > > drivers/vfio/mdev/mdev_core.c | 2 +- > > > > > include/linux/device.h | 14 +++++++++++++- > > > > > 5 files changed, 24 insertions(+), 10 deletions(-) > > > > > > > > Ugh, no, I don't really want to add yet-another-state to the driver core > > > > like this. Why are these devices even in the kernel with a driver that > > > > wants to bind to them registered if the driver somehow should NOT be > > > > bound to it? Shouldn't all of that logic be in the crazy driver itself > > > > as that is a very rare and odd thing to do that the driver core should > > > > not care about at all. > > > > > > > > And why does a device need userspace interaction at all? Again, why > > > > would the driver not know about this and handle it all directly? > > > > > > > > > > Let me expand a bit more on the details of the specific situation I'm > > > dealing with... > > > > > > On a server motherboard we've got a host CPU (Xeon, Epyc, POWER, etc.) and a > > > baseboard management controller, or BMC (typically an ARM SoC, an ASPEED > > > AST2500 in my case). The host CPU's firmware (BIOS/UEFI, ME firmware, etc.) > > > lives in a SPI flash chip. Because it's the host's firmware, that flash > > > chip is connected to and generally (by default) under the control of the > > > host CPU. > > > > > > But we also want the BMC to be able to perform out-of-band updates to the > > > host's firmware, so the flash is *also* connected to the BMC. There's an > > > external mux (controlled by a GPIO output driven by the BMC) that switches > > > which processor (host or BMC) is actually driving the SPI signals to the > > > flash chip, but there's a bunch of other stuff that's also required before > > > the BMC can flip that switch and take control of the SPI interface: > > > > > > - the BMC needs to track (and potentially alter) the host's power state > > > to ensure it's not running (in OpenBMC the existing logic for this is an > > > entire non-trivial userspace daemon unto itself) > > > > > > - it needs to twiddle some other GPIOs to put the ME into recovery mode > > > > > > - it needs to exchange some IPMI messages with the ME to confirm it got > > > into recovery mode > > > > > > (Some of the details here are specific to the particular motherboard I'm > > > working with, but I'd guess other systems probably have broadly similar > > > requirements.) > > > > > > The firmware flash (or at least the BMC's side of the mux in front of it) is > > > attached to a spi-nor controller that's well supported by an existing MTD > > > driver (aspeed-smc), but that driver can't safely probe the chip until all > > > the stuff described above has been done. In particular, this means we can't > > > reasonably bind the driver to that device during the normal > > > device-discovery/driver-binding done in the BMC's boot process (nor do we > > > want to, as that would pull the rug out from under the running host). We > > > basically only ever want to touch that SPI interface when a user (sysadmin > > > using the BMC, let's say) has explicitly initiated an out-of-band firmware > > > update. > > > > > > So we want the kernel to be aware of the device's existence (so that we > > > *can* bind a driver to it when needed), but we don't want it touching the > > > device unless we really ask for it. > > > > > > Does that help clarify the motivation for wanting this functionality? > > > > Sure, then just do this type of thing in the driver itself. Do not have > > any matching "ids" for this hardware it so that the bus will never call > > the probe function for this hardware _until_ a manual write happens to > > the driver's "bind" sysfs file. > > > > Perhaps I'm misunderstanding what you're suggesting, but if I just change > the DT "compatible" string so that the device doesn't match the driver and > then try to manually bind it, the driver_match_device() check in > bind_store() prevents that manual bind from actually happening. Hm, I thought the bus had the ability to 'override' this somehow. The bus does get the callback in driver_match_device() so maybe do the logic in there? Somehow this works for other devices and busses, so there must be a way it happens... thanks, greg k-h
On Fri, Oct 22, 2021 at 10:18:11AM -0500, Patrick Williams wrote: > Hi Greg, > > On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote: > > On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: > > > On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: > > > > On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: > > > > So we want the kernel to be aware of the device's existence (so that we > > > *can* bind a driver to it when needed), but we don't want it touching the > > > device unless we really ask for it. > > > > > > Does that help clarify the motivation for wanting this functionality? > > > > Sure, then just do this type of thing in the driver itself. Do not have > > any matching "ids" for this hardware it so that the bus will never call > > the probe function for this hardware _until_ a manual write happens to > > the driver's "bind" sysfs file. > > It sounds like you're suggesting a change to one particular driver to satisfy > this one particular case (and maybe I'm just not understanding your suggestion). > For a BMC, this is a pretty regular situation and not just as one-off as Zev's > example. > > Another good example is where a system can have optional riser cards with a > whole tree of devices that might be on that riser card (and there might be > different variants of a riser card that could go in the same slot). Usually > there is an EEPROM of some sort at a well-known address that can be parsed to > identify which kind of riser card it is and then the appropriate sub-devices can > be enumerated. That EEPROM parsing is something that is currently done in > userspace due to the complexity and often vendor-specific nature of it. > > Many of these devices require quite a bit more configuration information than > can be passed along a `bind` call. I believe it has been suggested previously > that this riser-card scenario could also be solved with dynamic loading of DT > snippets, but that support seems simple pretty far from being merged. Then work to get the DT code merged! Do not try to create yet-another-way of doing things here if DT overlays is the correct solution here (and it seems like it is.) thanks, greg k-h
On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > On Fri, Oct 22, 2021 at 10:18:11AM -0500, Patrick Williams wrote: >> Hi Greg, >> >> On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote: >>> On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: >>>> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: >>>>> On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: >> >>>> So we want the kernel to be aware of the device's existence (so that we >>>> *can* bind a driver to it when needed), but we don't want it touching the >>>> device unless we really ask for it. >>>> >>>> Does that help clarify the motivation for wanting this functionality? >>> >>> Sure, then just do this type of thing in the driver itself. Do not have >>> any matching "ids" for this hardware it so that the bus will never call >>> the probe function for this hardware _until_ a manual write happens to >>> the driver's "bind" sysfs file. >> >> It sounds like you're suggesting a change to one particular driver to satisfy >> this one particular case (and maybe I'm just not understanding your suggestion). >> For a BMC, this is a pretty regular situation and not just as one-off as Zev's >> example. >> >> Another good example is where a system can have optional riser cards with a >> whole tree of devices that might be on that riser card (and there might be >> different variants of a riser card that could go in the same slot). Usually >> there is an EEPROM of some sort at a well-known address that can be parsed to >> identify which kind of riser card it is and then the appropriate sub-devices can >> be enumerated. That EEPROM parsing is something that is currently done in >> userspace due to the complexity and often vendor-specific nature of it. >> >> Many of these devices require quite a bit more configuration information than >> can be passed along a `bind` call. I believe it has been suggested previously >> that this riser-card scenario could also be solved with dynamic loading of DT >> snippets, but that support seems simple pretty far from being merged. > > Then work to get the DT code merged! Do not try to create > yet-another-way of doing things here if DT overlays is the correct > solution here (and it seems like it is.) I don't think this is a case that fits the overlay model. We know what the description of the device is (which is what devicetree is all about), but the device is to be shared between the Linux kernel and some other entity, such as the firmware or another OS. The issue to be resolved is how to describe that the device is to be shared (in this case exclusively by the kernel _or_ by the other entity at any given moment), and how to move ownership of the device between the Linux kernel and the other entity. In the scenario presented by Zev, it is suggested that a user space agent will be involved in deciding which entity owns the device and to tell the Linux kernel when to take ownership of the device (and presumably when to relinquish ownership, although we haven't seen the implementation of relinquishing ownership yet). One could imagine direct communication between the driver and the other entity to mediate ownership. That seems like a driver specific defined choice to me, though if there are enough different drivers facing this situation then eventually a shared framework would make sense. So to step back and think architecture, it seems to me that the devicetree needs to specify in the node describing the shared device that the device must be (1) owned exclusively by either the Linux kernel or some other entity, with a signalling method between the Linux kernel and the other entity being defined (possibly by information in the node or possibly by the definition of the driver) or (2) actively shared by both the Linux kernel and the other entity. Actively shared may or may not be possible, based on the specific hardware. For example, if a single contains some bits controlled by the Linux kernel and other bits controlled by the other entity, then it can be difficult for one of the two entities to atomically modify the register in coordination with the other entity. Obviously case 1 is much simpler than case 2, I'm just trying to create a picture of the potential cases. In a simpler version of case 2, each entity would have control of their own set of registers. Diverging away from the overlay question, to comment on the "status" property mentioned elsewhere in this thread, I do not think that a status value of "reserved" is an adequate method of conveying all of the information needed by the above range of scenarios. -Frank > > thanks, > > greg k-h >
On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > > On Fri, Oct 22, 2021 at 10:18:11AM -0500, Patrick Williams wrote: > >> Hi Greg, > >> > >> On Fri, Oct 22, 2021 at 10:57:21AM +0200, Greg Kroah-Hartman wrote: > >>> On Fri, Oct 22, 2021 at 01:32:32AM -0700, Zev Weiss wrote: > >>>> On Thu, Oct 21, 2021 at 11:46:56PM PDT, Greg Kroah-Hartman wrote: > >>>>> On Thu, Oct 21, 2021 at 07:00:31PM -0700, Zev Weiss wrote: > >> > >>>> So we want the kernel to be aware of the device's existence (so that we > >>>> *can* bind a driver to it when needed), but we don't want it touching the > >>>> device unless we really ask for it. > >>>> > >>>> Does that help clarify the motivation for wanting this functionality? > >>> > >>> Sure, then just do this type of thing in the driver itself. Do not have > >>> any matching "ids" for this hardware it so that the bus will never call > >>> the probe function for this hardware _until_ a manual write happens to > >>> the driver's "bind" sysfs file. > >> > >> It sounds like you're suggesting a change to one particular driver to satisfy > >> this one particular case (and maybe I'm just not understanding your suggestion). > >> For a BMC, this is a pretty regular situation and not just as one-off as Zev's > >> example. > >> > >> Another good example is where a system can have optional riser cards with a > >> whole tree of devices that might be on that riser card (and there might be > >> different variants of a riser card that could go in the same slot). Usually > >> there is an EEPROM of some sort at a well-known address that can be parsed to > >> identify which kind of riser card it is and then the appropriate sub-devices can > >> be enumerated. That EEPROM parsing is something that is currently done in > >> userspace due to the complexity and often vendor-specific nature of it. > >> > >> Many of these devices require quite a bit more configuration information than > >> can be passed along a `bind` call. I believe it has been suggested previously > >> that this riser-card scenario could also be solved with dynamic loading of DT > >> snippets, but that support seems simple pretty far from being merged. > > > > Then work to get the DT code merged! Do not try to create > > yet-another-way of doing things here if DT overlays is the correct > > solution here (and it seems like it is.) > > I don't think this is a case that fits the overlay model. > > We know what the description of the device is (which is what devicetree > is all about), but the device is to be shared between the Linux kernel > and some other entity, such as the firmware or another OS. The issue > to be resolved is how to describe that the device is to be shared (in > this case exclusively by the kernel _or_ by the other entity at any > given moment), and how to move ownership of the device between the > Linux kernel and the other entity. > > In the scenario presented by Zev, it is suggested that a user space > agent will be involved in deciding which entity owns the device and > to tell the Linux kernel when to take ownership of the device (and > presumably when to relinquish ownership, although we haven't seen > the implementation of relinquishing ownership yet). One could > imagine direct communication between the driver and the other > entity to mediate ownership. That seems like a driver specific > defined choice to me, though if there are enough different drivers > facing this situation then eventually a shared framework would > make sense. We have the bind/unbind ability today, from userspace, that can control this. Why not just have Linux grab the device when it boots, and then when userspace wants to "give the device up", it writes to "unbind" in sysfs, and then when all is done, it writes to the "bind" file and then Linux takes back over. Unless for some reason Linux should _not_ grab the device when booting, then things get messier, as we have seen in this thread. thanks, greg k-h
On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote: > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > We have the bind/unbind ability today, from userspace, that can control > this. Why not just have Linux grab the device when it boots, and then > when userspace wants to "give the device up", it writes to "unbind" in > sysfs, and then when all is done, it writes to the "bind" file and then > Linux takes back over. > > Unless for some reason Linux should _not_ grab the device when booting, > then things get messier, as we have seen in this thread. This is probably more typical on a BMC than atypical. The systems often require the BMC (running Linux) to be able to reboot independently from the managed host (running anything). In the example Zev gave, the BMC rebooting would rip away the BIOS chip from the running host. The BMC almost always needs to come up in a "I don't know what could possibly be going on in the system" state and re-discover where the system was left off.
On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote: > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote: > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > > > We have the bind/unbind ability today, from userspace, that can control > > this. Why not just have Linux grab the device when it boots, and then > > when userspace wants to "give the device up", it writes to "unbind" in > > sysfs, and then when all is done, it writes to the "bind" file and then > > Linux takes back over. > > > > Unless for some reason Linux should _not_ grab the device when booting, > > then things get messier, as we have seen in this thread. > > This is probably more typical on a BMC than atypical. The systems often require > the BMC (running Linux) to be able to reboot independently from the managed host > (running anything). In the example Zev gave, the BMC rebooting would rip away > the BIOS chip from the running host. > > The BMC almost always needs to come up in a "I don't know what could possibly be > going on in the system" state and re-discover where the system was left off. Isn't it an architectural issue then?
On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote: > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote: > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote: > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > > > > > We have the bind/unbind ability today, from userspace, that can control > > > this. Why not just have Linux grab the device when it boots, and then > > > when userspace wants to "give the device up", it writes to "unbind" in > > > sysfs, and then when all is done, it writes to the "bind" file and then > > > Linux takes back over. > > > > > > Unless for some reason Linux should _not_ grab the device when booting, > > > then things get messier, as we have seen in this thread. > > > > This is probably more typical on a BMC than atypical. The systems often require > > the BMC (running Linux) to be able to reboot independently from the managed host > > (running anything). In the example Zev gave, the BMC rebooting would rip away > > the BIOS chip from the running host. > > > > The BMC almost always needs to come up in a "I don't know what could possibly be > > going on in the system" state and re-discover where the system was left off. > > Isn't it an architectural issue then? I'm not sure what "it" you are referring to here. I was trying to explain why starting in "bind" state is not a good idea for a BMC in most of these cases where we want to be able to dynamically add a device.
On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote: > On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote: > > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote: > > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote: > > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > > > > > > > We have the bind/unbind ability today, from userspace, that can control > > > > this. Why not just have Linux grab the device when it boots, and then > > > > when userspace wants to "give the device up", it writes to "unbind" in > > > > sysfs, and then when all is done, it writes to the "bind" file and then > > > > Linux takes back over. > > > > > > > > Unless for some reason Linux should _not_ grab the device when booting, > > > > then things get messier, as we have seen in this thread. > > > > > > This is probably more typical on a BMC than atypical. The systems often require > > > the BMC (running Linux) to be able to reboot independently from the managed host > > > (running anything). In the example Zev gave, the BMC rebooting would rip away > > > the BIOS chip from the running host. > > > > > > The BMC almost always needs to come up in a "I don't know what could possibly be > > > going on in the system" state and re-discover where the system was left off. > > > > Isn't it an architectural issue then? > > I'm not sure what "it" you are referring to here. > > I was trying to explain why starting in "bind" state is not a good idea for a > BMC in most of these cases where we want to be able to dynamically add a device. I think "it" is "something needs to be the moderator between the two operating systems". What is the external entity that handles the switching between the two? thanks, greg k-h
On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote: > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote: > > On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote: > > > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote: > > > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote: > > > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > > > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > > > > > > > > > We have the bind/unbind ability today, from userspace, that can control > > > > > this. Why not just have Linux grab the device when it boots, and then > > > > > when userspace wants to "give the device up", it writes to "unbind" in > > > > > sysfs, and then when all is done, it writes to the "bind" file and then > > > > > Linux takes back over. > > > > > > > > > > Unless for some reason Linux should _not_ grab the device when booting, > > > > > then things get messier, as we have seen in this thread. > > > > > > > > This is probably more typical on a BMC than atypical. The systems often require > > > > the BMC (running Linux) to be able to reboot independently from the managed host > > > > (running anything). In the example Zev gave, the BMC rebooting would rip away > > > > the BIOS chip from the running host. > > > > > > > > The BMC almost always needs to come up in a "I don't know what could possibly be > > > > going on in the system" state and re-discover where the system was left off. > > > > > > Isn't it an architectural issue then? > > > > I'm not sure what "it" you are referring to here. > > > > I was trying to explain why starting in "bind" state is not a good idea for a > > BMC in most of these cases where we want to be able to dynamically add a device. > > I think "it" is "something needs to be the moderator between the two > operating systems". What is the external entity that handles the > switching between the two? Ah, ok. Those usually end up being system / device specific. In the case of the BIOS flash, most designs I've seen use a SPI mux between the BMC and the host processor or IO hub (PCH on Xeons). The BMC has a GPIO to control the mux. As far as state, the BMC on start-up will go through a set of discovery code to figure out where it left the system prior to getting reset. That involves looking at the power subsystem and usually doing some kind of query to the host to see if it is alive. These queries are mostly system / host-processor design specific. I've seen anything from an IPMI/IPMB message alert from the BMC to the BIOS to ask "are you alive" to reading host processor state over JTAG to figure out if the processors are "making progress".
On Mon, Oct 25, 2021 at 09:02:40AM -0500, Patrick Williams wrote: > On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote: > > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote: > > > On Mon, Oct 25, 2021 at 03:58:25PM +0300, Andy Shevchenko wrote: > > > > On Mon, Oct 25, 2021 at 06:44:26AM -0500, Patrick Williams wrote: > > > > > On Mon, Oct 25, 2021 at 08:15:41AM +0200, Greg Kroah-Hartman wrote: > > > > > > On Mon, Oct 25, 2021 at 12:38:08AM -0500, Frank Rowand wrote: > > > > > > > On 10/23/21 3:56 AM, Greg Kroah-Hartman wrote: > > > > > > > > > > > We have the bind/unbind ability today, from userspace, that can control > > > > > > this. Why not just have Linux grab the device when it boots, and then > > > > > > when userspace wants to "give the device up", it writes to "unbind" in > > > > > > sysfs, and then when all is done, it writes to the "bind" file and then > > > > > > Linux takes back over. > > > > > > > > > > > > Unless for some reason Linux should _not_ grab the device when booting, > > > > > > then things get messier, as we have seen in this thread. > > > > > > > > > > This is probably more typical on a BMC than atypical. The systems often require > > > > > the BMC (running Linux) to be able to reboot independently from the managed host > > > > > (running anything). In the example Zev gave, the BMC rebooting would rip away > > > > > the BIOS chip from the running host. > > > > > > > > > > The BMC almost always needs to come up in a "I don't know what could possibly be > > > > > going on in the system" state and re-discover where the system was left off. > > > > > > > > Isn't it an architectural issue then? > > > > > > I'm not sure what "it" you are referring to here. > > > > > > I was trying to explain why starting in "bind" state is not a good idea for a > > > BMC in most of these cases where we want to be able to dynamically add a device. > > > > I think "it" is "something needs to be the moderator between the two > > operating systems". What is the external entity that handles the > > switching between the two? > > Ah, ok. > > Those usually end up being system / device specific. In the case of the BIOS > flash, most designs I've seen use a SPI mux between the BMC and the host > processor or IO hub (PCH on Xeons). The BMC has a GPIO to control the mux. > > As far as state, the BMC on start-up will go through a set of discovery code to > figure out where it left the system prior to getting reset. That involves > looking at the power subsystem and usually doing some kind of query to the host > to see if it is alive. These queries are mostly system / host-processor design > specific. I've seen anything from an IPMI/IPMB message alert from the BMC to > the BIOS to ask "are you alive" to reading host processor state over JTAG to > figure out if the processors are "making progress". But which processor is "in control" here over the hardware? What method is used to pass the device from one CPU to another from a logical point of view? Sounds like it is another driver that needs to handle all of this, so why not have that be the one that adds/removes the devices under control here? thanks, greg k-h
On Mon, Oct 25, 2021 at 04:09:59PM +0200, Greg Kroah-Hartman wrote: > On Mon, Oct 25, 2021 at 09:02:40AM -0500, Patrick Williams wrote: > > On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote: > > > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote: > > > I think "it" is "something needs to be the moderator between the two > > > operating systems". What is the external entity that handles the > > > switching between the two? > > > > Ah, ok. > > > > Those usually end up being system / device specific. In the case of the BIOS > > flash, most designs I've seen use a SPI mux between the BMC and the host > > processor or IO hub (PCH on Xeons). The BMC has a GPIO to control the mux. > > > > As far as state, the BMC on start-up will go through a set of discovery code to > > figure out where it left the system prior to getting reset. That involves > > looking at the power subsystem and usually doing some kind of query to the host > > to see if it is alive. These queries are mostly system / host-processor design > > specific. I've seen anything from an IPMI/IPMB message alert from the BMC to > > the BIOS to ask "are you alive" to reading host processor state over JTAG to > > figure out if the processors are "making progress". > > But which processor is "in control" here over the hardware? The BMC. It owns the GPIO that controls the SPI mux. But, the BMC is responsible for doing all operations in a way that doesn't mess up the running host processor(s). Pulling away the SPI flash containing the BIOS code at an incorrect time might do that. > What method > is used to pass the device from one CPU to another from a logical point > of view? The state of the server as a whole is determined and maintained by the BMC. I'm simplifying here a bit but the operation "turn on the host processors" implies "the host processors will access the BIOS" so the BMC must ensure "SPI mux is switched towards the host" before "turn on the host processors". > Sounds like it is another driver that needs to handle all of > this, so why not have that be the one that adds/removes the devices > under control here? If what you're describing is moving all of the state control logic into the kernel, I don't think that is feasible. For some systems it would mean moving yet another entire IPMI stack into the kernel tree. On others it might be somewhat simpler, but it is still a good amount of code. We could probably write up more details on the scope of this. If what you're describing is a small driver, similar to the board support drivers that were used before the device tree, that instantiates subordinate devices it doesn't seem like an unreasonable alternative to DT overlays to me (for whatever my limited kernel contribution experience counts for).
On Mon, Oct 25, 2021 at 10:54:21AM -0500, Patrick Williams wrote: > On Mon, Oct 25, 2021 at 04:09:59PM +0200, Greg Kroah-Hartman wrote: > > On Mon, Oct 25, 2021 at 09:02:40AM -0500, Patrick Williams wrote: > > > On Mon, Oct 25, 2021 at 03:34:05PM +0200, Greg Kroah-Hartman wrote: > > > > On Mon, Oct 25, 2021 at 08:20:05AM -0500, Patrick Williams wrote: > > > > I think "it" is "something needs to be the moderator between the two > > > > operating systems". What is the external entity that handles the > > > > switching between the two? > > > > > > Ah, ok. > > > > > > Those usually end up being system / device specific. In the case of the BIOS > > > flash, most designs I've seen use a SPI mux between the BMC and the host > > > processor or IO hub (PCH on Xeons). The BMC has a GPIO to control the mux. > > > > > > As far as state, the BMC on start-up will go through a set of discovery code to > > > figure out where it left the system prior to getting reset. That involves > > > looking at the power subsystem and usually doing some kind of query to the host > > > to see if it is alive. These queries are mostly system / host-processor design > > > specific. I've seen anything from an IPMI/IPMB message alert from the BMC to > > > the BIOS to ask "are you alive" to reading host processor state over JTAG to > > > figure out if the processors are "making progress". > > > > But which processor is "in control" here over the hardware? > > The BMC. It owns the GPIO that controls the SPI mux. > > But, the BMC is responsible for doing all operations in a way that doesn't mess > up the running host processor(s). Pulling away the SPI flash containing the > BIOS code at an incorrect time might do that. > > > What method > > is used to pass the device from one CPU to another from a logical point > > of view? > > The state of the server as a whole is determined and maintained by the BMC. I'm > simplifying here a bit but the operation "turn on the host processors" implies > "the host processors will access the BIOS" so the BMC must ensure "SPI mux is > switched towards the host" before "turn on the host processors". > > > Sounds like it is another driver that needs to handle all of > > this, so why not have that be the one that adds/removes the devices > > under control here? > > If what you're describing is moving all of the state control logic into the > kernel, I don't think that is feasible. For some systems it would mean moving > yet another entire IPMI stack into the kernel tree. On others it might be > somewhat simpler, but it is still a good amount of code. We could probably > write up more details on the scope of this. > > If what you're describing is a small driver, similar to the board support > drivers that were used before the device tree, that instantiates subordinate > devices it doesn't seem like an unreasonable alternative to DT overlays to me > (for whatever my limited kernel contribution experience counts for). > Something has to be here doing the mediation between the two processors and keeping things straight as to what processor is handling the hardware when. I suggest you focus on that first... Good luck! greg k-h
diff --git a/drivers/base/bus.c b/drivers/base/bus.c index bdc98c5713d5..8a30f9a02de0 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -211,7 +211,7 @@ static ssize_t bind_store(struct device_driver *drv, const char *buf, dev = bus_find_device_by_name(bus, NULL, buf); if (dev && driver_match_device(drv, dev)) { - err = device_driver_attach(drv, dev); + err = device_driver_attach(drv, dev, DRV_BIND_EXPLICIT); if (!err) { /* success */ err = count; diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 68ea1f949daa..ee64740a63d9 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -22,6 +22,7 @@ #include <linux/dma-map-ops.h> #include <linux/init.h> #include <linux/module.h> +#include <linux/property.h> #include <linux/kthread.h> #include <linux/wait.h> #include <linux/async.h> @@ -727,13 +728,14 @@ void wait_for_device_probe(void) } EXPORT_SYMBOL_GPL(wait_for_device_probe); -static int __driver_probe_device(struct device_driver *drv, struct device *dev) +static int __driver_probe_device(struct device_driver *drv, struct device *dev, u32 flags) { int ret = 0; if (dev->p->dead || !device_is_registered(dev)) return -ENODEV; - if (dev->driver) + if (dev->driver || + (fwnode_device_is_reserved(dev->fwnode) && !(flags & DRV_BIND_EXPLICIT))) return -EBUSY; dev->can_match = true; @@ -778,7 +780,7 @@ static int driver_probe_device(struct device_driver *drv, struct device *dev) int ret; atomic_inc(&probe_count); - ret = __driver_probe_device(drv, dev); + ret = __driver_probe_device(drv, dev, DRV_BIND_DEFAULT); if (ret == -EPROBE_DEFER || ret == EPROBE_DEFER) { driver_deferred_probe_add(dev); @@ -1052,16 +1054,17 @@ static void __device_driver_unlock(struct device *dev, struct device *parent) * device_driver_attach - attach a specific driver to a specific device * @drv: Driver to attach * @dev: Device to attach it to + * @flags: Bitmask of DRV_BIND_* flags * * Manually attach driver to a device. Will acquire both @dev lock and * @dev->parent lock if needed. Returns 0 on success, -ERR on failure. */ -int device_driver_attach(struct device_driver *drv, struct device *dev) +int device_driver_attach(struct device_driver *drv, struct device *dev, u32 flags) { int ret; __device_driver_lock(dev, dev->parent); - ret = __driver_probe_device(drv, dev); + ret = __driver_probe_device(drv, dev, flags); __device_driver_unlock(dev, dev->parent); /* also return probe errors as normal negative errnos */ diff --git a/drivers/dma/idxd/compat.c b/drivers/dma/idxd/compat.c index 3df21615f888..51df38dea15a 100644 --- a/drivers/dma/idxd/compat.c +++ b/drivers/dma/idxd/compat.c @@ -7,7 +7,6 @@ #include <linux/device/bus.h> #include "idxd.h" -extern int device_driver_attach(struct device_driver *drv, struct device *dev); extern void device_driver_detach(struct device *dev); #define DRIVER_ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) \ @@ -56,7 +55,7 @@ static ssize_t bind_store(struct device_driver *drv, const char *buf, size_t cou if (!alt_drv) return -ENODEV; - rc = device_driver_attach(alt_drv, dev); + rc = device_driver_attach(alt_drv, dev, DRV_BIND_EXPLICIT); if (rc < 0) return rc; diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c index b314101237fe..f42c6ec543c8 100644 --- a/drivers/vfio/mdev/mdev_core.c +++ b/drivers/vfio/mdev/mdev_core.c @@ -309,7 +309,7 @@ int mdev_device_create(struct mdev_type *type, const guid_t *uuid) if (!drv) drv = &vfio_mdev_driver; - ret = device_driver_attach(&drv->driver, &mdev->dev); + ret = device_driver_attach(&drv->driver, &mdev->dev, DRV_BIND_DEFAULT); if (ret) goto out_del; diff --git a/include/linux/device.h b/include/linux/device.h index e270cb740b9e..1ada1093799b 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -876,12 +876,24 @@ static inline void *dev_get_platdata(const struct device *dev) return dev->platform_data; } +/* + * Driver-binding flags (for passing to device_driver_attach()) + * + * DRV_BIND_DEFAULT: a default, automatic bind, e.g. as a result of a device + * being added for which we already have a driver, or vice + * versa. + * DRV_BIND_EXPLICIT: an explicit, userspace-requested driver bind, e.g. as a + * result of a write to /sys/bus/.../drivers/.../bind + */ +#define DRV_BIND_DEFAULT 0 +#define DRV_BIND_EXPLICIT BIT(0) + /* * Manual binding of a device to driver. See drivers/base/bus.c * for information on use. */ int __must_check device_driver_attach(struct device_driver *drv, - struct device *dev); + struct device *dev, u32 flags); int __must_check device_bind_driver(struct device *dev); void device_release_driver(struct device *dev); int __must_check device_attach(struct device *dev);
Devices whose fwnodes are marked as reserved are instantiated, but will not have a driver bound to them unless userspace explicitly requests it by writing to a 'bind' sysfs file. This is to enable devices that may require special (userspace-mediated) preparation before a driver can safely probe them. Signed-off-by: Zev Weiss <zev@bewilderbeest.net> --- drivers/base/bus.c | 2 +- drivers/base/dd.c | 13 ++++++++----- drivers/dma/idxd/compat.c | 3 +-- drivers/vfio/mdev/mdev_core.c | 2 +- include/linux/device.h | 14 +++++++++++++- 5 files changed, 24 insertions(+), 10 deletions(-)