diff mbox series

witherspoon: Squash spurious I2C errors

Message ID 20191127044758.32596-1-oohall@gmail.com
State Accepted
Headers show
Series witherspoon: Squash spurious I2C errors | expand

Checks

Context Check Description
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)

Commit Message

Oliver O'Halloran Nov. 27, 2019, 4:47 a.m. UTC
On witherspoon there's an I2C bus connecting the each chip to the GPUs
connected to that chip via NVLink. That bus has a 750kOhm pullup on the
system planar with prevents the bus from operating correctly.

Each GPU has a smaller pullup which makes the bus usable when a GPU is
plugged in, but on systems without GPUs we get a lot of spurious I2C
master errors. Specificly, because of the oversized pullup the SDA and
SCL for that bus cannot return to '1' fast enough, so the master assumes
that another master is driving the bus and that it should stop.

Work around this by disabling the affected port when there's no GPUs
detected in the system.

Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
Tested on a system with a GPU on each socket and on a system with none.
Couldn't find a mixed system.
---
 platforms/astbmc/witherspoon.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Andrew Donnellan Nov. 27, 2019, 5:04 a.m. UTC | #1
On 27/11/19 3:47 pm, Oliver O'Halloran wrote:
> On witherspoon there's an I2C bus connecting the each chip to the GPUs
> connected to that chip via NVLink. That bus has a 750kOhm pullup on the
> system planar with prevents the bus from operating correctly.
> 
> Each GPU has a smaller pullup which makes the bus usable when a GPU is
> plugged in, but on systems without GPUs we get a lot of spurious I2C
> master errors. Specificly, because of the oversized pullup the SDA and
> SCL for that bus cannot return to '1' fast enough, so the master assumes
> that another master is driving the bus and that it should stop.
> 
> Work around this by disabling the affected port when there's no GPUs
> detected in the system.
> 
> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks reasonable.

Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>

> ---
> Tested on a system with a GPU on each socket and on a system with none.
> Couldn't find a mixed system.
> ---
>   platforms/astbmc/witherspoon.c | 25 +++++++++++++++++++++++++
>   1 file changed, 25 insertions(+)
> 
> diff --git a/platforms/astbmc/witherspoon.c b/platforms/astbmc/witherspoon.c
> index edf84fb89601..081996684603 100644
> --- a/platforms/astbmc/witherspoon.c
> +++ b/platforms/astbmc/witherspoon.c
> @@ -465,6 +465,7 @@ static void npu2_phb_nvlink_dt(struct phb *npuphb)
>   static void witherspoon_finalise_dt(bool is_reboot)
>   {
>   	struct dt_node *np;
> +	struct proc_chip *c;
>   
>   	if (is_reboot)
>   		return;
> @@ -479,6 +480,30 @@ static void witherspoon_finalise_dt(bool is_reboot)
>   			continue;
>   		npu2_phb_nvlink_dt(npphb);
>   	}
> +
> +	/*
> +	 * The I2C bus on used to talk to the GPUs has a 750K pullup
> +	 * which is way too big. If there's no GPUs connected to the
> +	 * chip all I2C transactions fail with an Arb loss error since
> +	 * SCL/SDA don't return to the idle state fast enough. Disable
> +	 * the port to squash the errors.
> +	 */
> +	for (c = next_chip(0); c; c = next_chip(c)) {
> +		bool detected = false;
> +		int i;
> +
> +		np = dt_find_by_path(c->devnode, "i2cm@a1000/i2c-bus@4");
> +		if (!np)
> +			continue;
> +
> +		for (i = 0; i < 3; i++)
> +			detected |= occ_get_gpu_presence(c, i);
> +
> +		if (!detected) {
> +			dt_check_del_prop(np, "status");
> +			dt_add_property_string(np, "status", "disabled");
> +		}
> +	}
>   }
>   
>   /* The only difference between these is the PCI slot handling */
>
Oliver O'Halloran Dec. 16, 2019, 3:59 a.m. UTC | #2
On Wed, Nov 27, 2019 at 3:48 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On witherspoon there's an I2C bus connecting the each chip to the GPUs
> connected to that chip via NVLink. That bus has a 750kOhm pullup on the
> system planar with prevents the bus from operating correctly.
>
> Each GPU has a smaller pullup which makes the bus usable when a GPU is
> plugged in, but on systems without GPUs we get a lot of spurious I2C
> master errors. Specificly, because of the oversized pullup the SDA and
> SCL for that bus cannot return to '1' fast enough, so the master assumes
> that another master is driving the bus and that it should stop.
>
> Work around this by disabling the affected port when there's no GPUs
> detected in the system.
>
> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Merged as dc85bd46d20db4d878216283f818573cb0c8c05b
diff mbox series

Patch

diff --git a/platforms/astbmc/witherspoon.c b/platforms/astbmc/witherspoon.c
index edf84fb89601..081996684603 100644
--- a/platforms/astbmc/witherspoon.c
+++ b/platforms/astbmc/witherspoon.c
@@ -465,6 +465,7 @@  static void npu2_phb_nvlink_dt(struct phb *npuphb)
 static void witherspoon_finalise_dt(bool is_reboot)
 {
 	struct dt_node *np;
+	struct proc_chip *c;
 
 	if (is_reboot)
 		return;
@@ -479,6 +480,30 @@  static void witherspoon_finalise_dt(bool is_reboot)
 			continue;
 		npu2_phb_nvlink_dt(npphb);
 	}
+
+	/*
+	 * The I2C bus on used to talk to the GPUs has a 750K pullup
+	 * which is way too big. If there's no GPUs connected to the
+	 * chip all I2C transactions fail with an Arb loss error since
+	 * SCL/SDA don't return to the idle state fast enough. Disable
+	 * the port to squash the errors.
+	 */
+	for (c = next_chip(0); c; c = next_chip(c)) {
+		bool detected = false;
+		int i;
+
+		np = dt_find_by_path(c->devnode, "i2cm@a1000/i2c-bus@4");
+		if (!np)
+			continue;
+
+		for (i = 0; i < 3; i++)
+			detected |= occ_get_gpu_presence(c, i);
+
+		if (!detected) {
+			dt_check_del_prop(np, "status");
+			dt_add_property_string(np, "status", "disabled");
+		}
+	}
 }
 
 /* The only difference between these is the PCI slot handling */