diff mbox

[14/14] cxl: Add cxl_check_and_switch_mode() API to switch bi-modal cards

Message ID 1467638532-9250-15-git-send-email-imunsie@au.ibm.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Ian Munsie July 4, 2016, 1:22 p.m. UTC
From: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Add a new API, cxl_check_and_switch_mode() to allow for switching of
bi-modal CAPI cards, such as the Mellanox CX-4 network card.

When a driver requests to switch a card to CAPI mode, use PCI hotplug
infrastructure to remove all PCI devices underneath the slot. We then write
an updated mode control register to the CAPI VSEC, hot reset the card, and
reprobe the card.

As the card may present a different set of PCI devices after the mode
switch, use the infrastructure provided by the pnv_php driver and the OPAL
PCI slot management facilities to ensure that:

  * the old devices are removed from both the OPAL and Linux device trees
  * the new devices are probed by OPAL and added to the OPAL device tree
  * the new devices are added to the Linux device tree and probed through
    the regular PCI device probe path

As such, introduce a new option, CONFIG_CXL_BIMODAL, with a dependency on
the pnv_php driver.

Refactor existing code that touches the mode control register in the
regular single mode case into a new function, setup_cxl_protocol_area().

Co-authored-by: Ian Munsie <imunsie@au1.ibm.com>
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/misc/cxl/Kconfig |   8 ++
 drivers/misc/cxl/pci.c   | 234 +++++++++++++++++++++++++++++++++++++++++++----
 include/misc/cxl.h       |  25 +++++
 3 files changed, 249 insertions(+), 18 deletions(-)

Comments

Andrew Donnellan July 6, 2016, 3:55 a.m. UTC | #1
On 04/07/16 23:22, Ian Munsie wrote:
> +static int setup_cxl_protocol_area(struct pci_dev *dev)
> +{
> +	u8 val;
> +	int rc;
> +	int vsec = find_cxl_vsec(dev);
> +
> +	if (!vsec) {
> +		dev_info(&dev->dev, "CXL VSEC not found\n");
> +		return -ENODEV;
> +	}
> +
> +	rc = CXL_READ_VSEC_MODE_CONTROL(dev, vsec, &val);
> +	if (rc) {
> +		dev_err(&dev->dev, "Failed to read current mode control: %i\n", rc);
> +		return rc;
> +	}
> +
> +	if (!(val & CXL_VSEC_PROTOCOL_ENABLE)) {
> +		dev_err(&dev->dev, "Card not in CAPI mode!\n");
> +		return -EIO;
> +	}
> +
> +	/* Still configure the protocol area for single mode cards */

This comment is extraneous and will be dropped in V2.
Frederic Barrat July 6, 2016, 6:51 p.m. UTC | #2
Le 04/07/2016 15:22, Ian Munsie a écrit :
> From: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>
> Add a new API, cxl_check_and_switch_mode() to allow for switching of
> bi-modal CAPI cards, such as the Mellanox CX-4 network card.
>
> When a driver requests to switch a card to CAPI mode, use PCI hotplug
> infrastructure to remove all PCI devices underneath the slot. We then write
> an updated mode control register to the CAPI VSEC, hot reset the card, and
> reprobe the card.
>
> As the card may present a different set of PCI devices after the mode
> switch, use the infrastructure provided by the pnv_php driver and the OPAL
> PCI slot management facilities to ensure that:
>
>    * the old devices are removed from both the OPAL and Linux device trees
>    * the new devices are probed by OPAL and added to the OPAL device tree
>    * the new devices are added to the Linux device tree and probed through
>      the regular PCI device probe path
>
> As such, introduce a new option, CONFIG_CXL_BIMODAL, with a dependency on
> the pnv_php driver.
>
> Refactor existing code that touches the mode control register in the
> regular single mode case into a new function, setup_cxl_protocol_area().
>
> Co-authored-by: Ian Munsie <imunsie@au1.ibm.com>
> Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   drivers/misc/cxl/Kconfig |   8 ++
>   drivers/misc/cxl/pci.c   | 234 +++++++++++++++++++++++++++++++++++++++++++----
>   include/misc/cxl.h       |  25 +++++
>   3 files changed, 249 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
> index 560412c..6859723 100644
> --- a/drivers/misc/cxl/Kconfig
> +++ b/drivers/misc/cxl/Kconfig
> @@ -38,3 +38,11 @@ config CXL
>   	  CAPI adapters are found in POWER8 based systems.
>
>   	  If unsure, say N.
> +
> +config CXL_BIMODAL
> +	bool "Support for bi-modal CAPI cards"
> +	depends on HOTPLUG_PCI_POWERNV = y && CXL || HOTPLUG_PCI_POWERNV = m && CXL = m
> +	default y
> +	help
> +	  Select this option to enable support for bi-modal CAPI cards, such as
> +	  the Mellanox CX-4.
> \ No newline at end of file
> diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
> index 090eee8..63abd26 100644
> --- a/drivers/misc/cxl/pci.c
> +++ b/drivers/misc/cxl/pci.c
> @@ -55,6 +55,8 @@
>   	pci_read_config_byte(dev, vsec + 0xa, dest)
>   #define CXL_WRITE_VSEC_MODE_CONTROL(dev, vsec, val) \
>   	pci_write_config_byte(dev, vsec + 0xa, val)
> +#define CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, vsec, val) \
> +	pci_bus_write_config_byte(bus, devfn, vsec + 0xa, val)
>   #define CXL_VSEC_PROTOCOL_MASK   0xe0
>   #define CXL_VSEC_PROTOCOL_1024TB 0x80
>   #define CXL_VSEC_PROTOCOL_512TB  0x40
> @@ -614,36 +616,232 @@ static int setup_cxl_bars(struct pci_dev *dev)
>   	return 0;
>   }
>
> -/* pciex node: ibm,opal-m64-window = <0x3d058 0x0 0x3d058 0x0 0x8 0x0>; */
> -static int switch_card_to_cxl(struct pci_dev *dev)
> -{
> +#ifdef CONFIG_CXL_BIMODAL
> +
> +struct cxl_switch_work {
> +	struct pci_dev *dev;
> +	struct work_struct work;
>   	int vsec;
> +	int mode;
> +};
> +
> +static void switch_card_to_cxl(struct work_struct *work)
> +{
> +	struct cxl_switch_work *switch_work =
> +		container_of(work, struct cxl_switch_work, work);
> +	struct pci_dev *dev = switch_work->dev;
> +	struct pci_bus *bus = dev->bus;
> +	struct pci_controller *hose = pci_bus_to_host(bus);
> +	struct pci_dev *bridge;
> +	struct pnv_php_slot *php_slot;
> +	unsigned int devfn;
>   	u8 val;
>   	int rc;
>
> -	dev_info(&dev->dev, "switch card to CXL\n");
> +	dev_info(&bus->dev, "cxl: Preparing for mode switch...\n");
> +	bridge = list_first_entry_or_null(&hose->bus->devices, struct pci_dev,
> +					  bus_list);
> +	if (!bridge) {
> +		dev_WARN(&bus->dev, "cxl: Couldn't find root port!\n");
> +		goto err_free_work;
> +	}
>
> -	if (!(vsec = find_cxl_vsec(dev))) {
> -		dev_err(&dev->dev, "ABORTING: CXL VSEC not found!\n");
> +	php_slot = pnv_php_find_slot(pci_device_to_OF_node(bridge));
> +	if (!php_slot) {
> +		dev_err(&bus->dev, "cxl: Failed to find slot hotplug "
> +			           "information. You may need to upgrade "
> +			           "skiboot. Aborting.\n");
> +		pci_dev_put(dev);
> +		goto err_free_work;
> +	}
> +
> +	rc = CXL_READ_VSEC_MODE_CONTROL(dev, switch_work->vsec, &val);
> +	if (rc) {
> +		dev_err(&bus->dev, "cxl: Failed to read CAPI mode control: %i\n", rc);
> +		pci_dev_put(dev);
> +		goto err_free_work;
> +	}
> +	devfn = dev->devfn;
> +	pci_dev_put(dev);

This is to balance the 'get' done in cxl_check_and_switch_mode(), right? 
A comment wouldn't hurt. I think we're missing the 'put' on the first 
error path above (!bridge).

I was half-expecting to see a new entry in the cxl_pci_tbl pci ID table 
for the Mellanox entry, but no such thing. By what magic is cxl_probe() 
called after the switch? Because of the device class?

Out of curiosity, could you tell me what the 3rd pci function looks like 
(vendor ID, device ID, ....)?
Thanks!

   Fred
Andrew Donnellan July 7, 2016, 1:18 a.m. UTC | #3
Thanks for the review Fred!

On 07/07/16 04:51, Frederic Barrat wrote:
>> +    rc = CXL_READ_VSEC_MODE_CONTROL(dev, switch_work->vsec, &val);
>> +    if (rc) {
>> +        dev_err(&bus->dev, "cxl: Failed to read CAPI mode control:
>> %i\n", rc);
>> +        pci_dev_put(dev);
>> +        goto err_free_work;
>> +    }
>> +    devfn = dev->devfn;
>> +    pci_dev_put(dev);
>
> This is to balance the 'get' done in cxl_check_and_switch_mode(), right?
> A comment wouldn't hurt. I think we're missing the 'put' on the first
> error path above (!bridge).

Yep, it's to balance the pci_dev_get() in cxl_check_and_switch_mode() - 
you're right, a comment to that effect wouldn't hurt.

You're also right about the error path. Will fix in V2.

> I was half-expecting to see a new entry in the cxl_pci_tbl pci ID table
> for the Mellanox entry, but no such thing. By what magic is cxl_probe()
> called after the switch? Because of the device class?

It matches against the class, as function 0 of the device after reset 
comes up as a class 1200 processing accelerator.

Perhaps we should be a bit more explicit though...

> Out of curiosity, could you tell me what the 3rd pci function looks like
> (vendor ID, device ID, ....)?

Before:

root@io163:~# lspci -vnn
0000:00:00.0 PCI bridge [0604]: IBM Device [1014:03dc] (prog-if 00 
[Normal decode])
         Flags: fast devsel
         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
         I/O behind bridge: 00000000-00000fff
         Capabilities: [40] Power Management version 3
         Capabilities: [48] Express Root Port (Slot-), MSI 00
         Capabilities: [100] Advanced Error Reporting
         Capabilities: [148] #19

0000:01:00.0 Infiniband controller [0207]: Mellanox Technologies MT27700 
Family [ConnectX-4] [15b3:1013]
         Subsystem: IBM Device [1014:04f4]
         Flags: fast devsel, IRQ 502
         Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=32M]
         Capabilities: [60] Express Endpoint, MSI 00
         Capabilities: [48] Vital Product Data
         Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
         Capabilities: [c0] Vendor Specific Information: Len=18 <?>
         Capabilities: [40] Power Management version 3
         Capabilities: [100] Device Serial Number ba-da-ce-55-de-ad-ca-fe
         Capabilities: [160] Vendor Specific Information: ID=1280 Rev=0 
Len=080 <?>
         Capabilities: [240] #19

0000:01:00.1 Infiniband controller [0207]: Mellanox Technologies MT27700 
Family [ConnectX-4] [15b3:1013]
         Subsystem: IBM Device [1014:04f4]
         Flags: fast devsel, IRQ 502
         Memory at 200002000000 (64-bit, prefetchable) [disabled] [size=32M]
         Capabilities: [60] Express Endpoint, MSI 00
         Capabilities: [48] Vital Product Data
         Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
         Capabilities: [40] Power Management version 3
         Capabilities: [100] Device Serial Number ba-da-ce-55-de-ad-ca-fe

After:

root@io163:~# lspci -vnn
0000:00:00.0 PCI bridge [0604]: IBM Device [1014:03dc] (prog-if 00 
[Normal decode])
         Flags: bus master, fast devsel, latency 0
         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
         I/O behind bridge: 00000000-00000fff
         Capabilities: [40] Power Management version 3
         Capabilities: [48] Express Root Port (Slot-), MSI 00
         Capabilities: [100] Advanced Error Reporting
         Capabilities: [148] #19

0000:01:00.0 Processing accelerators [1200]: Mellanox Technologies 
MT27700 Family [ConnectX-4] [15b3:1013]
         Subsystem: IBM Device [1014:04f4]
         Physical Slot: Slot3
         Flags: bus master, fast devsel, latency 0, IRQ 502
         Memory at 200004000000 (64-bit, prefetchable) [size=128K]
         Memory at 200004020000 (64-bit, prefetchable) [size=128K]
         Memory at <ignored> (64-bit, prefetchable) [size=256T]
         Capabilities: [60] Express Endpoint, MSI 00
         Capabilities: [48] Vital Product Data
         Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
         Capabilities: [100] Device Serial Number ba-da-ce-55-de-ad-ca-fe
         Capabilities: [160] Vendor Specific Information: ID=1280 Rev=0 
Len=080 <?>
         Kernel driver in use: cxl-pci

0000:01:00.1 Infiniband controller [0207]: Mellanox Technologies MT27700 
Family [ConnectX-4] [15b3:1013]
         Subsystem: IBM Device [1014:04f4]
         Physical Slot: Slot3
         Flags: bus master, fast devsel, latency 0, IRQ 502
         Memory at 200000000000 (64-bit, prefetchable) [size=32M]
         Capabilities: [60] Express Endpoint, MSI 00
         Capabilities: [48] Vital Product Data
         Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
         Capabilities: [40] Power Management version 3
         Capabilities: [100] Device Serial Number ba-da-ce-55-de-ad-ca-fe
         Kernel driver in use: mlx5_core

0000:01:00.2 Infiniband controller [0207]: Mellanox Technologies MT27700 
Family [ConnectX-4] [15b3:1013]
         Subsystem: IBM Device [1014:04f4]
         Physical Slot: Slot3
         Flags: bus master, fast devsel, latency 0, IRQ 502
         Memory at 200002000000 (64-bit, prefetchable) [size=32M]
         Capabilities: [60] Express Endpoint, MSI 00
         Capabilities: [48] Vital Product Data
         Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
         Capabilities: [40] Power Management version 3
         Capabilities: [100] Device Serial Number ba-da-ce-55-de-ad-ca-fe
         Kernel driver in use: mlx5_core

Andrew
Ian Munsie July 7, 2016, 6:26 a.m. UTC | #4
Excerpts from andrew.donnellan's message of 2016-07-07 11:18:37 +1000:
> > This is to balance the 'get' done in cxl_check_and_switch_mode(), right?
> > A comment wouldn't hurt. I think we're missing the 'put' on the first
> > error path above (!bridge).
> 
> Yep, it's to balance the pci_dev_get() in cxl_check_and_switch_mode() - 
> you're right, a comment to that effect wouldn't hurt.
> 
> You're also right about the error path. Will fix in V2.

We could probably use a dedicated error label for all the error paths
before the pci_dev_put in the main function so we don't need it in every
error path.

> > I was half-expecting to see a new entry in the cxl_pci_tbl pci ID table
> > for the Mellanox entry, but no such thing. By what magic is cxl_probe()
> > called after the switch? Because of the device class?
> 
> It matches against the class, as function 0 of the device after reset 
> comes up as a class 1200 processing accelerator.
> 
> Perhaps we should be a bit more explicit though...

If we explicitly match the Vendor + Device ID we will also match the
networking functions, which we can't do, because before the mode switch
there *IS* a CAPI VSEC in one of the networking functions and our driver
would mistake it as a generic accelerator and try to initialise it. We
could add a comment to this effect to the PCI ID table.

Cheers,
-Ian
Andrew Donnellan July 7, 2016, 6:44 a.m. UTC | #5
On 07/07/16 16:26, Ian Munsie wrote:
> We could probably use a dedicated error label for all the error paths
> before the pci_dev_put in the main function so we don't need it in every
> error path.

Yep, I've added that.

> If we explicitly match the Vendor + Device ID we will also match the
> networking functions, which we can't do, because before the mode switch
> there *IS* a CAPI VSEC in one of the networking functions and our driver
> would mistake it as a generic accelerator and try to initialise it. We
> could add a comment to this effect to the PCI ID table.

We can match the vendor, device ID *and* class code - unfortunately 
there isn't a macro for this, which makes it a little bit less 
aesthetically pleasing, but I'm pretty sure this works.

I'm not entirely sure how I feel about our current strategy of matching 
on all class 1200 devices (though if it weren't a CAPI device we'd bail 
very quickly...) - my quick grepping tells me we're one of a very small 
set of drivers in the kernel that uses PCI_DEVICE_CLASS.
Andrew Donnellan July 7, 2016, 8:15 a.m. UTC | #6
On 07/07/16 16:44, Andrew Donnellan wrote:
> We can match the vendor, device ID *and* class code - unfortunately
> there isn't a macro for this, which makes it a little bit less
> aesthetically pleasing, but I'm pretty sure this works.

Something like the below, which works fine:

/*
  * Matches a given PCI vendor ID and device ID, but only for class 12
  * (processing accelerators). Useful for bi-modal cards, such as the
  * Mellanox ConnectX-4, which keep the same vendor/device ID
  * post-mode-switch.
  */
#define PCI_DEVICE_ACCEL(vend, dev) \
	.vendor = (vend), .device = (dev), \
	.subvendor = PCI_ANY_ID, .subdevice = PCI_ANY_ID, \
	.class = 0x120000, .class_mask = 0xff0000

static const struct pci_device_id cxl_pci_tbl[] = {
	/* FPGA devices */
	{ PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0477), },
	{ PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x044b), },
	{ PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x04cf), },
	{ PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0601), },
	/* Mellanox ConnectX-4 */
	{ PCI_DEVICE_ACCEL(PCI_VENDOR_ID_MELLANOX, 0x1013), },
	{ }
};
MODULE_DEVICE_TABLE(pci, cxl_pci_tbl);
Ian Munsie July 11, 2016, 9:19 a.m. UTC | #7
Excerpts from andrew.donnellan's message of 2016-07-07 18:15:06 +1000:
> On 07/07/16 16:44, Andrew Donnellan wrote:
> > We can match the vendor, device ID *and* class code - unfortunately
> > there isn't a macro for this, which makes it a little bit less
> > aesthetically pleasing, but I'm pretty sure this works.
> 
> Something like the below, which works fine:

I like this solution, but I'm not going to include it in v2 of this
series and would rather it be submitted separately. The reason being is
that this series will work as is, and I'd like to see this undergo some
regression testing separate to the cx4 work, and a bit of scrutiny from
the hardware team just in case we are missing any device IDs that would
no longer be matched(I'm not aware of any, but you never know).

Cheers,
-Ian
Andrew Donnellan July 12, 2016, 1:20 a.m. UTC | #8
On 11/07/16 19:19, Ian Munsie wrote:
> I like this solution, but I'm not going to include it in v2 of this
> series and would rather it be submitted separately. The reason being is
> that this series will work as is, and I'd like to see this undergo some
> regression testing separate to the cx4 work, and a bit of scrutiny from
> the hardware team just in case we are missing any device IDs that would
> no longer be matched(I'm not aware of any, but you never know).

Yep, I can send it separately.
diff mbox

Patch

diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
index 560412c..6859723 100644
--- a/drivers/misc/cxl/Kconfig
+++ b/drivers/misc/cxl/Kconfig
@@ -38,3 +38,11 @@  config CXL
 	  CAPI adapters are found in POWER8 based systems.
 
 	  If unsure, say N.
+
+config CXL_BIMODAL
+	bool "Support for bi-modal CAPI cards"
+	depends on HOTPLUG_PCI_POWERNV = y && CXL || HOTPLUG_PCI_POWERNV = m && CXL = m
+	default y
+	help
+	  Select this option to enable support for bi-modal CAPI cards, such as
+	  the Mellanox CX-4.
\ No newline at end of file
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 090eee8..63abd26 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -55,6 +55,8 @@ 
 	pci_read_config_byte(dev, vsec + 0xa, dest)
 #define CXL_WRITE_VSEC_MODE_CONTROL(dev, vsec, val) \
 	pci_write_config_byte(dev, vsec + 0xa, val)
+#define CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, vsec, val) \
+	pci_bus_write_config_byte(bus, devfn, vsec + 0xa, val)
 #define CXL_VSEC_PROTOCOL_MASK   0xe0
 #define CXL_VSEC_PROTOCOL_1024TB 0x80
 #define CXL_VSEC_PROTOCOL_512TB  0x40
@@ -614,36 +616,232 @@  static int setup_cxl_bars(struct pci_dev *dev)
 	return 0;
 }
 
-/* pciex node: ibm,opal-m64-window = <0x3d058 0x0 0x3d058 0x0 0x8 0x0>; */
-static int switch_card_to_cxl(struct pci_dev *dev)
-{
+#ifdef CONFIG_CXL_BIMODAL
+
+struct cxl_switch_work {
+	struct pci_dev *dev;
+	struct work_struct work;
 	int vsec;
+	int mode;
+};
+
+static void switch_card_to_cxl(struct work_struct *work)
+{
+	struct cxl_switch_work *switch_work =
+		container_of(work, struct cxl_switch_work, work);
+	struct pci_dev *dev = switch_work->dev;
+	struct pci_bus *bus = dev->bus;
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pci_dev *bridge;
+	struct pnv_php_slot *php_slot;
+	unsigned int devfn;
 	u8 val;
 	int rc;
 
-	dev_info(&dev->dev, "switch card to CXL\n");
+	dev_info(&bus->dev, "cxl: Preparing for mode switch...\n");
+	bridge = list_first_entry_or_null(&hose->bus->devices, struct pci_dev,
+					  bus_list);
+	if (!bridge) {
+		dev_WARN(&bus->dev, "cxl: Couldn't find root port!\n");
+		goto err_free_work;
+	}
 
-	if (!(vsec = find_cxl_vsec(dev))) {
-		dev_err(&dev->dev, "ABORTING: CXL VSEC not found!\n");
+	php_slot = pnv_php_find_slot(pci_device_to_OF_node(bridge));
+	if (!php_slot) {
+		dev_err(&bus->dev, "cxl: Failed to find slot hotplug "
+			           "information. You may need to upgrade "
+			           "skiboot. Aborting.\n");
+		pci_dev_put(dev);
+		goto err_free_work;
+	}
+
+	rc = CXL_READ_VSEC_MODE_CONTROL(dev, switch_work->vsec, &val);
+	if (rc) {
+		dev_err(&bus->dev, "cxl: Failed to read CAPI mode control: %i\n", rc);
+		pci_dev_put(dev);
+		goto err_free_work;
+	}
+	devfn = dev->devfn;
+	pci_dev_put(dev);
+
+	dev_dbg(&bus->dev, "cxl: Removing PCI devices from kernel\n");
+	pci_lock_rescan_remove();
+	pci_hp_remove_devices(bridge->subordinate);
+	pci_unlock_rescan_remove();
+
+	/* Switch the CXL protocol on the card */
+	if (switch_work->mode == CXL_BIMODE_CXL) {
+		dev_info(&bus->dev, "cxl: Switching card to CXL mode\n");
+		val &= ~CXL_VSEC_PROTOCOL_MASK;
+		val |= CXL_VSEC_PROTOCOL_256TB | CXL_VSEC_PROTOCOL_ENABLE;
+		rc = pnv_cxl_enable_phb_kernel_api(hose, true);
+		if (rc) {
+			dev_err(&bus->dev, "cxl: Failed to enable kernel API"
+				           " on real PHB, aborting\n");
+			goto err_free_work;
+		}
+	} else {
+		dev_WARN(&bus->dev, "cxl: Switching card to PCI mode not supported!\n");
+		goto err_free_work;
+	}
+
+	rc = CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, switch_work->vsec, val);
+	if (rc) {
+		dev_err(&bus->dev, "cxl: Failed to configure CXL protocol: %i\n", rc);
+		goto err_free_work;
+	}
+
+	/*
+	 * The CAIA spec (v1.1, Section 10.6 Bi-modal Device Support) states
+	 * we must wait 100ms after this mode switch before touching PCIe config
+	 * space.
+	 */
+	msleep(100);
+
+	/*
+	 * Hot reset to cause the card to come back in cxl mode. A
+	 * OPAL_RESET_PCI_LINK would be sufficient, but currently lacks support
+	 * in skiboot, so we use a hot reset instead.
+	 *
+	 * We call pci_set_pcie_reset_state() on the bridge, as a CAPI card is
+	 * guaranteed to sit directly under the root port, and setting the reset
+	 * state on a device directly under the root port is equivalent to doing
+	 * it on the root port iself.
+	 */
+	dev_info(&bus->dev, "cxl: Configuration write complete, resetting card\n");
+	pci_set_pcie_reset_state(bridge, pcie_hot_reset);
+	pci_set_pcie_reset_state(bridge, pcie_deassert_reset);
+
+	dev_dbg(&bus->dev, "cxl: Offlining slot\n");
+	rc = pnv_php_set_slot_power_state(&php_slot->slot, OPAL_PCI_SLOT_OFFLINE);
+	if (rc) {
+		dev_err(&bus->dev, "cxl: OPAL offlining call failed: %i\n", rc);
+		goto err_free_work;
+	}
+
+	dev_dbg(&bus->dev, "cxl: Onlining and probing slot\n");
+	rc = pnv_php_set_slot_power_state(&php_slot->slot, OPAL_PCI_SLOT_ONLINE);
+	if (rc) {
+		dev_err(&bus->dev, "cxl: OPAL onlining call failed: %i\n", rc);
+		goto err_free_work;
+	}
+
+	pci_lock_rescan_remove();
+	pci_hp_add_devices(bridge->subordinate);
+	pci_unlock_rescan_remove();
+
+	dev_info(&bus->dev, "cxl: CAPI mode switch completed\n");
+	kfree(switch_work);
+	return;
+
+err_free_work:
+	kfree(switch_work);
+}
+
+int cxl_check_and_switch_mode(struct pci_dev *dev, int mode, int vsec)
+{
+	struct cxl_switch_work *work;
+	u8 val;
+	int rc;
+
+	if (!cpu_has_feature(CPU_FTR_HVMODE))
 		return -ENODEV;
+
+	if (!vsec) {
+		vsec = find_cxl_vsec(dev);
+		if (!vsec) {
+			dev_info(&dev->dev, "CXL VSEC not found\n");
+			return -ENODEV;
+		}
 	}
 
-	if ((rc = CXL_READ_VSEC_MODE_CONTROL(dev, vsec, &val))) {
-		dev_err(&dev->dev, "failed to read current mode control: %i", rc);
+	rc = CXL_READ_VSEC_MODE_CONTROL(dev, vsec, &val);
+	if (rc) {
+		dev_err(&dev->dev, "Failed to read current mode control: %i", rc);
 		return rc;
 	}
-	val &= ~CXL_VSEC_PROTOCOL_MASK;
-	val |= CXL_VSEC_PROTOCOL_256TB | CXL_VSEC_PROTOCOL_ENABLE;
-	if ((rc = CXL_WRITE_VSEC_MODE_CONTROL(dev, vsec, val))) {
-		dev_err(&dev->dev, "failed to enable CXL protocol: %i", rc);
-		return rc;
+
+	if (mode == CXL_BIMODE_PCI) {
+		if (!(val & CXL_VSEC_PROTOCOL_ENABLE)) {
+			dev_info(&dev->dev, "Card is already in PCI mode\n");
+			return 0;
+		}
+		/*
+		 * TODO: Before it's safe to switch the card back to PCI mode
+		 * we need to disable the CAPP and make sure any cachelines the
+		 * card holds have been flushed out. Needs skiboot support.
+		 */
+		dev_WARN(&dev->dev, "CXL mode switch to PCI unsupported!\n");
+		return -EIO;
 	}
+
+	if (val & CXL_VSEC_PROTOCOL_ENABLE) {
+		dev_info(&dev->dev, "Card is already in CXL mode\n");
+		return 0;
+	}
+
+	dev_info(&dev->dev, "Card is in PCI mode, scheduling kernel thread "
+			    "to switch to CXL mode\n");
+
+	work = kmalloc(sizeof(struct cxl_switch_work), GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	pci_dev_get(dev);
+	work->dev = dev;
+	work->vsec = vsec;
+	work->mode = mode;
+	INIT_WORK(&work->work, switch_card_to_cxl);
+
+	schedule_work(&work->work);
+
 	/*
-	 * The CAIA spec (v0.12 11.6 Bi-modal Device Support) states
-	 * we must wait 100ms after this mode switch before touching
-	 * PCIe config space.
+	 * We return a failure now to abort the driver init. Once the
+	 * link has been cycled and the card is in cxl mode we will
+	 * come back (possibly using the generic cxl driver), but
+	 * return success as the card should then be in cxl mode.
+	 *
+	 * TODO: What if the card comes back in PCI mode even after
+	 *       the switch?  Don't want to spin endlessly.
 	 */
-	msleep(100);
+	return -EBUSY;
+}
+EXPORT_SYMBOL_GPL(cxl_check_and_switch_mode);
+
+#endif /* CONFIG_CXL_BIMODAL */
+
+static int setup_cxl_protocol_area(struct pci_dev *dev)
+{
+	u8 val;
+	int rc;
+	int vsec = find_cxl_vsec(dev);
+
+	if (!vsec) {
+		dev_info(&dev->dev, "CXL VSEC not found\n");
+		return -ENODEV;
+	}
+
+	rc = CXL_READ_VSEC_MODE_CONTROL(dev, vsec, &val);
+	if (rc) {
+		dev_err(&dev->dev, "Failed to read current mode control: %i\n", rc);
+		return rc;
+	}
+
+	if (!(val & CXL_VSEC_PROTOCOL_ENABLE)) {
+		dev_err(&dev->dev, "Card not in CAPI mode!\n");
+		return -EIO;
+	}
+
+	/* Still configure the protocol area for single mode cards */
+	if ((val & CXL_VSEC_PROTOCOL_MASK) != CXL_VSEC_PROTOCOL_256TB) {
+		val &= ~CXL_VSEC_PROTOCOL_MASK;
+		val |= CXL_VSEC_PROTOCOL_256TB;
+		rc = CXL_WRITE_VSEC_MODE_CONTROL(dev, vsec, val);
+		if (rc) {
+			dev_err(&dev->dev, "Failed to set CXL protocol area: %i\n", rc);
+			return rc;
+		}
+	}
 
 	return 0;
 }
@@ -1249,7 +1447,7 @@  static int cxl_configure_adapter(struct cxl *adapter, struct pci_dev *dev)
 	if ((rc = setup_cxl_bars(dev)))
 		return rc;
 
-	if ((rc = switch_card_to_cxl(dev)))
+	if ((rc = setup_cxl_protocol_area(dev)))
 		return rc;
 
 	if ((rc = cxl_update_image_control(adapter)))
diff --git a/include/misc/cxl.h b/include/misc/cxl.h
index ed81a17..e5e17ed 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -39,6 +39,31 @@ 
 bool cxl_slot_is_supported(struct pci_dev *dev, int flags);
 
 
+#define CXL_BIMODE_CXL 1
+#define CXL_BIMODE_PCI 2
+
+/*
+ * Check the mode that the given bi-modal CXL adapter is currently in and
+ * change it if necessary. This does not apply to AFU drivers.
+ *
+ * If the mode matches the requested mode this function will return 0 - if the
+ * driver was expecting the generic CXL driver to have bound to the adapter and
+ * it gets this return value it should fail the probe function to give the CXL
+ * driver a chance to probe it.
+ *
+ * If the mode does not match it will start a background task to unplug the
+ * device from Linux and switch its mode, and will return -EBUSY. At this
+ * point the calling driver should make sure it has released the device and
+ * fail its probe function.
+ *
+ * The offset of the CXL VSEC can be provided to this function. If 0 is passed,
+ * this function will search for a CXL VSEC with ID 0x1280 and return -ENODEV
+ * if it is not found.
+ */
+#ifdef CONFIG_CXL_BIMODAL
+int cxl_check_and_switch_mode(struct pci_dev *dev, int mode, int vsec);
+#endif
+
 /* Get the AFU associated with a pci_dev */
 struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev);